← Back to Index
Daily Research Digest

arXiv Papers

2026-03-10
590
Papers
4
Categories
589
Translated
收藏清单 0
机器人学 (Robotics)
138
cs.RO / 1 / 2603.06716

A Pivot-Based Kirigami Utensil for Hand-Held and Robot-Assisted Feeding

基于支点的切纸艺术餐具用于手持和机器人辅助喂食
Leao, Keone, Brotherson, Grace, Mischel, Iain, Parekh, Sagar, Losey, Dylan P.
Abstract
Eating is a daily challenge for over 60 million adults with essential tremors and other mobility limitations. For these users, traditional utensils like forks or spoons are difficult to manipulate -- resulting in accidental spills and restricting the types of food that can be consumed. Prior work has developed rigid, hand-held utensils that often fail to secure food, as well as soft, shape-changing utensils made strictly for robot-assisted feeding. To assist a broader range of users, we introduce a re-designed kiri-spoon that can be leveraged as either a hand-held utensil or a robot-mounted attachment. Our key idea -- developed in collaboration with stakeholders -- is a pivot-based design. With this design the kiri-spoon behaves like a pair of pliers: users squeeze the handles to change the shape of the utensil and enclose food morsels. In practice, users can apply this kiri-spoon as either a spoon (that scoops food) or as a fork (that pinches food); when the handles are closed, the utensil wraps around the morsel and prevents it from accidentally falling. We characterize the amount of force required to open or close the kiri-spoon, and show how designers can modify this force through kinematic or material changes. A highlight of our design is its accessibility: the hand-held version consists of just four 3D printed parts that snap together. By adding a servo motor, we can extend this same kinematic structure to robot-assisted feeding. Across our user studies, adults with disabilities and elderly adults with Parkinson's reported that the kiri-spoon better met their needs and provided a more effective means of spill prevention than existing alternatives. See a video of our kiri-spoon here: https://youtu.be/FFIomm5RL98
Chinese Translation
对于超过6000万名患有本质性震颤和其他行动限制的成年人来说,进食是一项日常挑战。对于这些用户而言,传统的餐具如叉子或勺子难以操作,导致意外溢出,并限制了可食用的食物种类。之前的研究开发了刚性手持餐具,往往无法有效固定食物,以及专为机器人辅助喂食设计的软性变形餐具。为了帮助更广泛的用户群体,我们引入了一种重新设计的切勺(kiri-spoon),可以作为手持餐具或机器人安装附件使用。我们的关键理念是基于支点的设计,这一设计是在与利益相关者的合作下开发的。通过这种设计,切勺的操作方式类似于钳子:用户挤压手柄以改变餐具的形状并包裹食物块。在实际应用中,用户可以将切勺用作勺子(用于舀取食物)或叉子(用于夹取食物);当手柄闭合时,餐具会包裹住食物块,防止其意外掉落。我们量化了打开或关闭切勺所需的力量,并展示了设计师如何通过运动学或材料变化来调整这一力量。我们设计的一个亮点是其可及性:手持版本仅由四个3D打印部件组成,能够快速组装。通过添加伺服电机,我们可以将相同的运动学结构扩展到机器人辅助喂食。在我们的用户研究中,残疾成年人和患有帕金森病的老年人表示,切勺更好地满足了他们的需求,并提供了比现有替代品更有效的防溢解决方案。请查看我们的切勺视频:https://youtu.be/FFIomm5RL98
cs.RO / 2 / 2603.06719

Dynamic Targeting of Satellite Observations Using Supplemental Geostationary Satellite Data and Hierarchical Planning

利用补充静止卫星数据和分层规划的动态目标卫星观测
Kangaslahti, Akseli, Zilberstein, Itai, Candela, Alberto, Chien, Steve
Abstract
The Dynamic Targeting (DT) mission concept is an approach to satellite observation in which a lookahead sensor gathers information about the upcoming environment and uses this information to intelligently plan observations. Previous work has shown that DT has the potential to increase the science return across applications. However, DT mission concepts must address challenges, such as the limited spatial extent of onboard lookahead data and instrument mobility, data throughput, and onboard computation constraints. In this work, we show how the performance of DT systems can be improved by using supplementary data streamed from geostationary satellites that provide lookahead information up to 35 minutes ahead of time rather than the 1 minute latency from an onboard lookahead sensor. While there is a greater volume of geostationary data, the search space for observation planning explodes exponentially with the size of the horizon. To address this, we introduce a hierarchical planning approach in which the geostationary data is used to plan a long-term observation blueprint in polynomial time, then the onboard lookahead data is leveraged to refine that plan over short-term horizons. We compare the performance of our approach to that of traditional DT planners relying on onboard lookahead data across four different problem instances: three cloud avoidance variations and a storm hunting scenario. We show that our hierarchical planner outperforms the traditional DT planners by up to 41% and examine the features of the scenarios that affect the performance of our approach. We demonstrate that incorporating geostationary satellite data is most effective for dynamic problem instances in which the targets of interest are sparsely distributed throughout the overflight.
Chinese Translation
动态目标(Dynamic Targeting, DT)任务概念是一种卫星观测方法,其中前瞻传感器收集即将到来的环境信息,并利用这些信息智能地规划观测。以往的研究表明,DT在各类应用中具有提高科学回报的潜力。然而,DT任务概念必须应对一些挑战,例如机载前瞻数据的有限空间范围、仪器的机动性、数据吞吐量和机载计算约束。在本研究中,我们展示了如何通过使用来自静止卫星的补充数据来改善DT系统的性能,这些静止卫星提供最多提前35分钟的前瞻信息,而机载前瞻传感器的延迟仅为1分钟。尽管静止数据的体量更大,但观测规划的搜索空间随着时间范围的扩大呈指数级增长。为了解决这一问题,我们引入了一种分层规划方法,其中利用静止数据在多项式时间内规划长期观测蓝图,然后利用机载前瞻数据在短期范围内细化该计划。我们将我们的方法的性能与依赖机载前瞻数据的传统DT规划者进行了比较,涵盖了四个不同的问题实例:三种云避让变体和一个风暴捕捉场景。我们展示了我们的分层规划者在性能上比传统DT规划者提高了多达41%,并考察了影响我们方法性能的场景特征。我们证明了在目标稀疏分布于飞越区域的动态问题实例中,结合静止卫星数据是最有效的。
cs.RO / 3 / 2603.06749

Robotic Foundation Models for Industrial Control: A Comprehensive Survey and Readiness Assessment Framework

工业控制的机器人基础模型:全面调查与准备评估框架
Kube, David, Hadwiger, Simon, Meisen, Tobias
Abstract
Robotic foundation models (RFMs) are emerging as a promising route towards flexible, instruction- and demonstration-driven robot control, however, a critical investigation of their industrial applicability is still lacking. This survey gives an extensive overview over the RFM-landscape and analyses, driven by concrete implications, how industrial domains and use cases shape the requirements of RFMs, with particular focus on collaborative robot platforms, heterogeneous sensing and actuation, edge-computing constraints, and safety-critical operation. We synthesise industrial deployment perspectives into eleven interdependent implications and operationalise them into an assessment framework comprising a catalogue of 149 concrete criteria, spanning both model capabilities and ecosystem requirements. Using this framework, we evaluate 324 manipulation-capable RFMs via 48,276 criterion-level decisions obtained via a conservative LLM-assisted evaluation pipeline, validated against expert judgements. The results indicate that industrial maturity is limited and uneven: even the highest-rated models satisfy only a fraction of criteria and typically exhibit narrow implication-specific peaks rather than integrated coverage. We conclude that progress towards industry-grade RFMs depends less on isolated benchmark successes than on systematic incorporation of safety, real-time feasibility, robust perception, interaction, and cost-effective system integration into auditable deployment stacks.
Chinese Translation
机器人基础模型(RFMs)作为一种灵活的、基于指令和示范的机器人控制方式,正逐渐成为一种有前景的选择。然而,对其在工业应用中的适用性进行深入研究仍然缺乏。本调查对RFM领域进行了广泛的概述,并分析了工业领域和应用案例如何塑造RFMs的需求,特别关注协作机器人平台、异构传感与执行、边缘计算约束以及安全关键操作。我们将工业部署的视角综合为十一项相互依赖的影响,并将其操作化为一个评估框架,该框架包含149项具体标准,涵盖模型能力和生态系统需求。利用该框架,我们通过保守的LLM辅助评估流程评估了324个具备操作能力的RFMs,获得了48,276个标准级决策,并与专家判断进行了验证。结果表明,工业成熟度有限且不均衡:即使是评分最高的模型也仅满足部分标准,且通常表现出特定影响的狭窄峰值,而非综合覆盖。我们得出结论,向工业级RFMs的进展不应仅依赖于孤立的基准成功,而应系统性地将安全性、实时可行性、稳健感知、交互和经济有效的系统集成纳入可审计的部署堆栈中。
cs.RO / 4 / 2603.06760

Gradient-based Nested Co-Design of Aerodynamic Shape and Control for Winged Robots

基于梯度的翼型机器人气动形状与控制的嵌套协同设计
Affinita, Daniele, Xu, Mingda, Gherardi, Benoît Valentin, Fua, Pascal
Abstract
Designing aerial robots for specialized tasks, from perching to payload delivery, requires tailoring their aerodynamic shape to specific mission requirements. For tasks involving wide flight envelopes, the usual sequential process of first determining the shape and then the motion planner is likely to be suboptimal due to the inherent nonlinear interactions between them. This limitation has been motivating co-design research, which involves jointly optimizing the aerodynamic shape and the motion planner. In this paper, we present a general-purpose, gradient-based, nested co-design framework where the motion planner solves an optimal control problem and the aerodynamic forces used in the dynamics model are determined by a neural surrogate model. This enables us to model complex subsonic flow conditions encountered in aerial robotics and to overcome the limited applicability of existing co-design methods. These limitations stem from the simplifying assumptions they require for computational tractability to either the planner or the aerodynamics. We validate our method on two complex dynamic tasks for fixed-wing gliders: perching and a short landing. Our optimized designs improve task performance compared to an evolutionary baseline in a fraction of the computation time.
Chinese Translation
为特定任务设计空中机器人,从栖息到有效载荷投递,需要将其气动形状量身定制以满足特定的任务要求。对于涉及广泛飞行包络的任务,通常的顺序流程是先确定形状,然后再确定运动规划器,这种方法由于二者之间固有的非线性相互作用,可能会导致次优结果。这一局限性激励了协同设计研究,即共同优化气动形状和运动规划器。在本文中,我们提出了一种通用的、基于梯度的嵌套协同设计框架,其中运动规划器解决一个最优控制问题,而动态模型中使用的气动力由神经代理模型确定。这使我们能够模拟空中机器人所遇到的复杂亚声速流动条件,并克服现有协同设计方法的有限适用性。这些局限性源于它们为计算可行性所要求的简化假设,影响了规划器或气动设计。我们在两个复杂动态任务上验证了我们的方法,针对固定翼滑翔机的栖息和短距离着陆。与进化基线相比,我们优化的设计在计算时间的一小部分内显著提高了任务性能。
cs.RO / 5 / 2603.06773

Stability-Guided Exploration for Diverse Motion Generation

基于稳定性的探索以生成多样化的运动
Cobo-Briesewitz, Eckart, Burghoff, Tilman, Shcherba, Denis, Jordana, Armand, Toussaint, Marc
Abstract
Scaling up datasets is highly effective in improving the performance of deep learning models, including in the field of robot learning. However, data collection still proves to be a bottleneck. Approaches relying on collecting human demonstrations are labor-intensive and inherently limited: they tend to be narrow, task-specific, and fail to adequately explore the full space of feasible states. Synthetic data generation could remedy this, but current techniques mostly rely on local trajectory optimization and fail to find diverse solutions. In this work, we propose a novel method capable of finding diverse long-horizon manipulations through black-box simulation. We achieve this by combining an RRT-style search with sampling-based MPC, together with a novel sampling scheme that guides the exploration toward stable configurations. Specifically, we sample from a manifold of stable states while growing a search tree directly through simulation, without restricting the planner to purely stable motions. We demonstrate the method's ability to discover diverse manipulation strategies, including pushing, grasping, pivoting, throwing, and tool use, across different robot morphologies, without task-specific guidance.
Chinese Translation
扩大数据集规模在提高深度学习模型性能方面非常有效,包括在机器人学习领域。然而,数据收集仍然是一个瓶颈。依赖于收集人类示范的方法劳动密集且固有地有限:它们往往狭窄、特定于任务,并未充分探索可行状态的完整空间。合成数据生成可以缓解这一问题,但当前技术主要依赖于局部轨迹优化,未能找到多样化的解决方案。在本研究中,我们提出了一种新方法,能够通过黑箱仿真找到多样化的长时间范围操作。我们通过将RRT(快速随机树)风格的搜索与基于采样的模型预测控制(MPC)相结合,以及一种新颖的采样方案来实现这一目标,该方案引导探索朝向稳定配置。具体而言,我们在通过仿真直接生长搜索树的同时,从稳定状态的流形中进行采样,而不限制规划者仅进行稳定运动。我们展示了该方法发现多样化操作策略的能力,包括推动、抓取、旋转、投掷和工具使用,适用于不同的机器人形态,而无需特定于任务的指导。
cs.RO / 6 / 2603.06775

HybridMimic: Hybrid RL-Centroidal Control for Humanoid Motion Mimicking

HybridMimic:用于类人运动模仿的混合RL-质心控制
Tay, Ludwig Chee-Ying, Chang, I-Chia, Gu, Yan
Abstract
Motion mimicking, i.e., encouraging the control policy to mimic human motion, facilitates the learning of complex tasks via reinforcement learning (RL) for humanoid robots. Although standard RL frameworks demonstrate impressive locomotion agility, they often bypass explicit reasoning about robot dynamics during deployment, which is a design choice that can lead to physically infeasible commands when the robot encounters out-of-distribution environments. By integrating model-based principles, hybrid approaches can improve performance; however, existing methods typically rely on predefined contact timing, limiting their versatility. This paper introduces HybridMimic, a framework in which a learned policy dynamically modulates a centroidal-model-based controller by predicting continuous contact states and desired centroidal velocities. This architecture exploits the physical grounding of centroidal dynamics to generate feedforward torques that remain feasible even under domain shift. Using physics-informed rewards, the policy is trained to efficiently utilize the centroidal controller's optimization by outputting precise control targets and reference torques. Through hardware experiments on the Booster T1 humanoid, HybridMimic reduces the average base position tracking error by 13\% compared to a state-of-the-art RL baseline, demonstrating the robustness of dynamics-aware deployment.
Chinese Translation
运动模仿,即鼓励控制策略模仿人类运动,通过强化学习(RL)促进类人机器人复杂任务的学习。尽管标准的RL框架展示了令人印象深刻的运动灵活性,但在部署过程中,它们往往忽视对机器人动力学的明确推理,这种设计选择可能导致在机器人遇到分布外环境时产生物理上不可行的指令。通过整合基于模型的原则,混合方法可以提高性能;然而,现有方法通常依赖于预定义的接触时机,限制了它们的灵活性。本文介绍了HybridMimic,一个框架,其中学习的策略通过预测连续接触状态和期望的质心速度,动态调节基于质心模型的控制器。该架构利用质心动力学的物理基础生成前馈扭矩,即使在领域转移下也保持可行性。通过使用物理信息奖励,策略被训练以高效利用质心控制器的优化,输出精确的控制目标和参考扭矩。通过在Booster T1类人机器人上的硬件实验,HybridMimic将平均基座位置跟踪误差降低了13 ext{%},相较于最先进的RL基线,展示了对动力学感知部署的鲁棒性。
cs.RO / 7 / 2603.06779

A Multi-Layer Sim-to-Real Framework for Gaze-Driven Assistive Neck Exoskeletons

一种多层次的模拟到现实框架用于基于视线驱动的辅助颈部外骨骼
Rubow, Colin, Brewer, Eric, Bales, Ian, Zhang, Haohan, Brown, Daniel S.
Abstract
Dropped head syndrome, caused by neck muscle weakness from neurological diseases, severely impairs an individual's ability to support and move their head, causing pain and making everyday tasks challenging. Our long-term goal is to develop an assistive powered neck exoskeleton that restores natural movement. However, predicting a user's intended head movement remains a key challenge. We leverage virtual reality (VR) to collect coupled eye and head movement data from healthy individuals to train models capable of predicting head movement based solely on eye gaze. We also propose a novel multi-layer controller selection framework, where head control strategies are evaluated across decreasing levels of abstraction -- from simulation and VR to a physical neck exoskeleton. This pipeline effectively rejects poor-performing controllers early, identifying two novel gaze-driven models that achieve strong performance when deployed on the physical exoskeleton. Our results reveal that no single controller is universally preferred, highlighting the necessity for personalization in gaze-driven assistive control. Our work demonstrates the utility of VR-based evaluation for accelerating the development of intuitive, safe, and personalized assistive robots.
Chinese Translation
掉头综合症是由神经疾病引起的颈部肌肉无力,严重影响个体支撑和移动头部的能力,导致疼痛并使日常任务变得困难。我们的长期目标是开发一种辅助动力颈部外骨骼,以恢复自然运动。然而,预测用户意图的头部运动仍然是一个关键挑战。我们利用虚拟现实(VR)技术收集健康个体的眼动与头动数据,以训练能够仅基于眼睛注视预测头部运动的模型。我们还提出了一种新颖的多层次控制器选择框架,在该框架中,头部控制策略在逐渐降低的抽象层次上进行评估——从模拟和虚拟现实到物理颈部外骨骼。该流程有效地早期排除表现不佳的控制器,识别出两种新颖的基于视线驱动的模型,在物理外骨骼上部署时表现出色。我们的结果显示,没有单一控制器是普遍优选的,这突显了在基于视线驱动的辅助控制中个性化的必要性。我们的工作展示了基于虚拟现实的评估在加速直观、安全和个性化辅助机器人开发中的实用性。
cs.RO / 8 / 2603.06824

A Comprehensive Analysis of the Effects of Network Quality of Service on Robotic Telesurgery

网络服务质量对机器人远程手术影响的综合分析
Zhang, Zhaomeng, Roodabeh, Seyed Hamid Reza, Alemzadeh, Homa
Abstract
The viability of long-distance telesurgery hinges on reliable network Quality of Service (QoS), yet the impact of realistic network degradations on task performance is not sufficiently understood. This paper presents a comprehensive analysis of how packet loss, delay, and communication loss affect telesurgical task execution. We introduce NetFI, a novel fault injection tool that emulates different network conditions using stochastic QoS models informed by real-world network data. By integrating NetFI with a surgical simulation platform, we conduct a user study involving 15 participants at three proficiency levels, performing a standardized Peg Transfer task under varying levels of packet loss, delay, and communication loss. We analyze the effect of network QoS on overall task performance and the fine-grained motion primitives (MPs) using objective performance and safety metrics and subjective operator's perception of workload. We identify specific MPs vulnerable to network degradation and find strong correlations between proficiency, objective performance, and subjective workload. These findings offer quantitative insights into the operational boundaries of telesurgery. Our open-source tools and annotated dataset provide a foundation for developing robust and network-aware control and mitigation strategies.
Chinese Translation
远程手术的可行性依赖于可靠的网络服务质量(QoS),然而,现实网络降级对任务性能的影响尚未得到充分理解。本文对数据包丢失、延迟和通信丢失如何影响远程手术任务执行进行了全面分析。我们介绍了NetFI,这是一种新型故障注入工具,利用基于真实网络数据的随机QoS模型模拟不同的网络条件。通过将NetFI与手术模拟平台集成,我们进行了涉及15名参与者的用户研究,参与者在三个熟练程度水平下,在不同的数据包丢失、延迟和通信丢失条件下执行标准化的钉子转移任务。我们使用客观性能和安全指标以及主观操作员的工作负荷感知分析网络QoS对整体任务性能和细粒度运动原语(MPs)的影响。我们识别出对网络降级敏感的特定MP,并发现熟练程度、客观性能和主观工作负荷之间存在强相关性。这些发现为远程手术的操作边界提供了定量见解。我们的开源工具和注释数据集为开发稳健且具有网络意识的控制和缓解策略奠定了基础。
cs.RO / 9 / 2603.06831

Learning-Based Robust Control: Unifying Exploration and Distributional Robustness for Reliable Robotics via Free Energy

基于学习的鲁棒控制:通过自由能统一探索与分布鲁棒性以实现可靠的机器人技术
Jesawada, Hozefa, Russo, Giovanni, Swikir, Abdalla, Abu-Dakka, Fares
Abstract
A key challenge towards reliable robotic control is devising computational models that can both learn policies and guarantee robustness when deployed in the field. Inspired by the free energy principle in computational neuroscience, to address these challenges, we propose a model for policy computation that jointly learns environment dynamics and rewards, while ensuring robustness to epistemic uncertainties. Expounding a distributionally robust free energy principle, we propose a modification to the maximum diffusion learning framework. After explicitly characterizing robustness of our policies to epistemic uncertainties in both environment and reward, we validate their effectiveness on continuous-control benchmarks, via both simulations and real-world experiments involving manipulation with a Franka Research~3 arm. Across simulation and zero-shot deployment, our approach narrows the sim-to-real gap, and enables repeatable tabletop manipulation without task-specific fine-tuning.
Chinese Translation
实现可靠的机器人控制面临的一个关键挑战是设计能够学习策略并在实际应用中保证鲁棒性的计算模型。受到计算神经科学中自由能原理的启发,为了解决这些挑战,我们提出了一种政策计算模型,该模型同时学习环境动态和奖励,同时确保对知识不确定性的鲁棒性。我们阐述了一种分布鲁棒自由能原理,并对最大扩散学习框架进行了修改。在明确表征我们政策对环境和奖励中知识不确定性的鲁棒性后,我们通过模拟和涉及Franka Research~3臂的真实世界实验验证了其有效性。在模拟和零样本部署中,我们的方法缩小了模拟与现实之间的差距,并实现了无需特定任务微调的可重复桌面操作。
cs.RO / 10 / 2603.06832

Receding-Horizon Nullspace Optimization for Actuation-Aware Control Allocation in Omnidirectional UAVs

基于回退视野的零空间优化在全向无人机中的驱动感知控制分配
Pretto, Riccardo, Hamandi, Mahmoud, Ali, Abdullah Mohamed, Alcan, Gokhan, Tzes, Anthony, Abu-Dakka, Fares
Abstract
Fully actuated omnidirectional UAVs enable independent control of forces and torques along all six degrees of freedom, broadening the operational envelope for agile flight and aerial interaction tasks. However, conventional control allocation methods neglect the asymmetric dynamics of the onboard actuators, which can induce oscillatory motor commands and degrade trajectory tracking during dynamic maneuvers. This work proposes a receding-horizon, actuation-aware allocation strategy that explicitly incorporates asymmetric motor dynamics and exploits the redundancy of over-actuated platforms through nullspace optimization. By forward-simulating the closed-loop system over a prediction horizon, the method anticipates actuator-induced oscillations and suppresses them through smooth redistribution of motor commands, while preserving the desired body wrench exactly. The approach is formulated as a constrained optimal control problem solved online via Constrained iterative LQR. Simulation results on the OmniOcta platform demonstrate that the proposed method significantly reduces motor command oscillations compared to a conventional single-step quadratic programming allocator, yielding improved trajectory tracking in both position and orientation.
Chinese Translation
全驱动的全向无人机能够独立控制六个自由度上的力和扭矩,扩展了灵活飞行和空中交互任务的操作范围。然而,传统的控制分配方法忽视了机载驱动器的非对称动态,这可能导致电机指令的振荡,并在动态机动过程中降低轨迹跟踪性能。本文提出了一种回退视野的驱动感知分配策略,该策略明确考虑了非对称电机动态,并通过零空间优化利用过驱动平台的冗余。通过在预测视野内前向模拟闭环系统,该方法预见了驱动器引起的振荡,并通过平滑重新分配电机指令来抑制这些振荡,同时精确保持所需的机体扭矩。该方法被表述为一个约束最优控制问题,并通过约束迭代线性二次调节器(Constrained iterative LQR)在线求解。在OmniOcta平台上的仿真结果表明,与传统的单步二次规划分配器相比,所提方法显著减少了电机指令的振荡,并在位置和方向上均提高了轨迹跟踪性能。
cs.RO / 11 / 2603.06842

RoboCritics: Enabling Reliable End-to-End LLM Robot Programming through Expert-Informed Critics

RoboCritics:通过专家指导的评估者实现可靠的端到端大型语言模型机器人编程
Kim, Callie Y., White, Nathan Thomas, He, Evan, Sala, Frederic, Mutlu, Bilge
Abstract
End-user robot programming grants users the flexibility to re-task robots in situ, yet it remains challenging for novices due to the need for specialized robotics knowledge. Large Language Models (LLMs) hold the potential to lower the barrier to robot programming by enabling task specification through natural language. However, current LLM-based approaches generate opaque, "black-box" code that is difficult to verify or debug, creating tangible safety and reliability risks in physical systems. We present RoboCritics, an approach that augments LLM-based robot programming with expert-informed motion-level critics. These critics encode robotics expertise to analyze motion-level execution traces for issues such as joint speed violations, collisions, and unsafe end-effector poses. When violations are detected, critics surface transparent feedback and offer one-click fixes that forward structured messages back to the LLM, enabling iterative refinement while keeping users in the loop. We instantiated RoboCritics in a web-based interface connected to a UR3e robot and evaluated it in a between-subjects user study (n=18). Compared to a baseline LLM interface, RoboCritics reduced safety violations, improved execution quality, and shaped how participants verified and refined their programs. Our findings demonstrate that RoboCritics enables more reliable and user-centered end-to-end robot programming with LLMs.
Chinese Translation
最终用户机器人编程使用户能够灵活地在现场重新任务机器人,但由于需要专业的机器人知识,这对初学者来说仍然具有挑战性。大型语言模型(LLMs)有潜力通过自然语言实现任务规范,从而降低机器人编程的门槛。然而,当前基于LLM的方法生成的不透明的“黑箱”代码难以验证或调试,给物理系统带来了实际的安全性和可靠性风险。我们提出了RoboCritics,这是一种通过专家指导的运动级评估者增强基于LLM的机器人编程的方法。这些评估者编码了机器人专业知识,以分析运动级执行轨迹中的问题,如关节速度违规、碰撞和不安全的末端执行器姿态。当检测到违规时,评估者会提供透明的反馈,并提供一键修复,将结构化消息反馈给LLM,从而实现迭代优化,同时保持用户的参与。我们在一个与UR3e机器人连接的基于网络的界面中实现了RoboCritics,并在一项被试间用户研究中进行了评估(n=18)。与基线LLM接口相比,RoboCritics减少了安全违规,提高了执行质量,并影响了参与者验证和优化其程序的方式。我们的研究结果表明,RoboCritics使得与LLM的端到端机器人编程更加可靠和以用户为中心。
cs.RO / 12 / 2603.06850

Nonlinear Performance Degradation of Vision-Based Teleoperation under Network Latency

基于视觉的远程操作在网络延迟下的非线性性能退化
Khalil, Aws, Kwon, Jaerock
Abstract
Teleoperation is increasingly being adopted as a critical fallback for autonomous vehicles. However, the impact of network latency on vision-based, perception-driven control remains insufficiently studied. The present work investigates the nonlinear degradation of closed-loop stability in camera-based lane keeping under varying network delays. To conduct this study, we developed the Latency-Aware Vision Teleoperation testbed (LAVT), a research-oriented ROS 2 framework that enables precise, distributed one-way latency measurement and reproducible delay injection. Using LAVT, we performed 180 closed-loop experiments in simulation across diverse road geometries. Our findings reveal a sharp collapse in stability between 150 ms and 225 ms of one-way perception latency, where route completion rates drop from 100% to below 50% as oscillatory instability and phase-lag effects emerge. We further demonstrate that additional control-channel delay compounds these effects, significantly accelerating system failure even under constant visual latency. By combining this systematic empirical characterization with the LAVT testbed, this work provides quantitative insights into perception-driven instability and establishes a reproducible baseline for future latency-compensation and predictive control strategies. Project page, supplementary video, and code are available at https://bimilab.github.io/paper-LAVT
Chinese Translation
远程操作正日益被视为自主车辆的重要备选方案。然而,网络延迟对基于视觉的感知驱动控制的影响仍然研究不足。本研究探讨了在不同网络延迟下,基于摄像头的车道保持的闭环稳定性非线性退化。为此,我们开发了延迟感知视觉远程操作测试平台(Latency-Aware Vision Teleoperation testbed, LAVT),这是一个面向研究的ROS 2框架,能够实现精确的分布式单向延迟测量和可重复的延迟注入。利用LAVT,我们在多种道路几何条件下进行了180次闭环实验。我们的研究结果揭示,在150毫秒到225毫秒的单向感知延迟之间,稳定性急剧崩溃,路径完成率从100%降至50%以下,伴随着振荡不稳定性和相位滞后效应的出现。我们进一步证明,额外的控制通道延迟加剧了这些效应,即使在恒定的视觉延迟下,也显著加速了系统故障。通过将这种系统的实证特征与LAVT测试平台相结合,本研究提供了对感知驱动不稳定性的定量洞察,并为未来的延迟补偿和预测控制策略建立了可重复的基线。项目页面、补充视频和代码可在 https://bimilab.github.io/paper-LAVT 获取。
cs.RO / 13 / 2603.06864

Robodimm: A Physics-Grounded Framework for Automated Actuator Sizing in Scalable Modular Robots

Robodimm:一个基于物理的可扩展模块化机器人自动执行器尺寸设计框架
Torres, J. L., Munoz, M., Alvarez, J. D., Blanco, J. L., Gimenez, A.
Abstract
Selecting an appropriate motor-gearbox combination is a critical design task in robotics because it directly affects cost, mass, and dynamic performance. This process is especially challenging in modular robots with closed kinematic chains, where joint torques are coupled and actuator inertia propagates through the mechanism. We present Robodimm, a software framework for automated actuator sizing in scalable robot architectures. By leveraging Pinocchio for dynamics and Pink for inverse kinematics, Robodimm uses a Karush-Kuhn-Tucker (KKT) formulation for constrained inverse dynamics. The platform supports parametric scaling, interactive trajectory programming through jog modes, and a two-round validation workflow that addresses actuator self-weight effects.
Chinese Translation
选择合适的电机-齿轮箱组合是机器人设计中的一项关键任务,因为它直接影响成本、质量和动态性能。在具有闭合运动链的模块化机器人中,这一过程尤为具有挑战性,因为关节扭矩相互耦合,执行器惯性通过机制传播。我们提出了Robodimm,一个用于可扩展机器人架构的自动执行器尺寸设计的软件框架。通过利用Pinocchio进行动力学分析和Pink进行逆运动学,Robodimm采用Karush-Kuhn-Tucker (KKT) 约束逆动力学的公式。该平台支持参数化缩放、通过 jog 模式进行交互式轨迹编程,以及一个两轮验证工作流程,以解决执行器自重效应。
cs.RO / 14 / 2603.06866

CAR: Cross-Vehicle Kinodynamics Adaptation via Mobility Representation

CAR:通过移动性表征实现跨车辆运动动力学适应
Xu, Tong, Pan, Chenhui, Xiao, Xuesu
Abstract
Developing autonomous off-road mobility typically requires either extensive, platform-specific data collection or relies on simplified abstractions, such as unicycle or bicycle models, that fail to capture the complex kinodynamics of diverse platforms, ranging from wheeled to tracked vehicles. This limitation hinders scalability across evolving heterogeneous autonomous robot fleets. To address this challenge, we propose Cross-vehicle kinodynamics Adaptation via mobility Representation (CAR), a novel framework that enables rapid mobility transfer to new vehicles. CAR employs a Transformer encoder with Adaptive Layer Normalization to embed vehicle trajectory transitions and physical configurations into a shared mobility latent space. By identifying and extracting commonality from nearest neighbors within this latent space, our approach enables rapid kinodynamics adaptation to novel platforms with minimal data collection and computational overhead. We evaluate CAR using the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance on four distinct physical configurations of the Verti-4-Wheeler platform. With only one minute of new trajectory data, CAR achieves up to 67.2% reduction in prediction error compared to direct neighbor transfer across diverse unseen vehicle configurations, demonstrating the effectiveness of cross-vehicle mobility knowledge transfer in both simulated and real-world environments.
Chinese Translation
开发自主越野移动性通常需要广泛的平台特定数据收集,或者依赖于简化的抽象模型,例如独轮车或自行车模型,这些模型无法捕捉从轮式到履带式车辆的多样化平台的复杂运动动力学。这一限制阻碍了在不断发展的异构自主机器人车队中的可扩展性。为了解决这一挑战,我们提出了通过移动性表征实现跨车辆运动动力学适应(CAR)的新框架,该框架能够快速将移动性转移到新车辆上。CAR采用带有自适应层归一化的Transformer编码器,将车辆轨迹转变和物理配置嵌入共享的移动性潜在空间。通过识别和提取该潜在空间中最近邻的共性,我们的方法能够在最小的数据收集和计算开销下,快速适应新平台的运动动力学。我们使用基于Chrono多物理引擎构建的Verti-Bench模拟器对CAR进行评估,并在Verti-4-Wheeler平台的四种不同物理配置上验证其性能。仅用一分钟的新轨迹数据,CAR在不同未见车辆配置的直接邻居转移中实现了高达67.2%的预测误差减少,展示了跨车辆移动性知识转移在模拟和现实环境中的有效性。
cs.RO / 15 / 2603.06879

Material Driven HRI Design: Aesthetics as Explainability

材料驱动的人机交互设计:美学作为可解释性
Friedman, Natalie, Weatherwax, Kevin, Zhu, Chengchao
Abstract
Aesthetics - often treated as secondary to function-guides how people interpret robots' roles. A great deal of robot designs - both real and fictitious - use sleek industrial aesthetics. These feature hard glossy plastics, hiding as much of the underlying mechanical and electrical components as possible, resembling something akin to a nude humanoid figure. This leaves robots as something of a blank slate to which end-users apply coverings to, often based on media of fiction and non-fiction alike. We argue that designers can take cues from fashion to design interaction and set appropriate expectations. Rather than viewing appearance as decoration, we propose that color, texture, and material choices function as interaction signals. These signals can invite or discourage touch, clarify a robot's role, and help align user expectations with a robot's actual capabilities. When done thoughtfully, such cues can create familiarity and legibility; when done poorly, they can lead to wrong expectations. This preliminary paper proposes a framework describing how materials can create explainability by signaling expectations for interaction, task, and environment. We use this framework to do a content analysis of 6 robots.
Chinese Translation
美学——常常被视为次于功能的因素——指导人们对机器人角色的解读。许多机器人设计——无论是真实的还是虚构的——都采用流线型的工业美学。这些设计使用坚硬光滑的塑料,尽可能隐藏底层的机械和电气组件,类似于裸体人形。这使得机器人在某种程度上成为一个空白的画布,最终用户往往根据虚构和非虚构媒体为其添加外观。我们认为设计师可以从时尚中汲取灵感,以设计交互并设定适当的期望。我们提出,外观不应仅被视为装饰,而是颜色、纹理和材料的选择可以作为交互信号。这些信号可以邀请或阻止触摸,明确机器人的角色,并帮助对齐用户期望与机器人的实际能力。当这些信号设计得当时,可以创造出熟悉感和可读性;而当设计不当时,则可能导致错误的期望。本文提出了一个框架,描述了材料如何通过信号传递交互、任务和环境的期望来创造可解释性。我们利用这一框架对6个机器人进行了内容分析。
cs.RO / 16 / 2603.06887

VertiAdaptor: Online Kinodynamics Adaptation for Vertically Challenging Terrain

VertiAdaptor:针对垂直挑战地形的在线运动动力学适应
Xu, Tong, Pan, Chenhui, Datar, Aniket, Xiao, Xuesu
Abstract
Autonomous driving in off-road environments presents significant challenges due to the dynamic and unpredictable nature of unstructured terrain. Traditional kinodynamic models often struggle to generalize across diverse geometric and semantic terrain types, underscoring the need for real-time adaptation to ensure safe and reliable navigation. We propose VertiAdaptor (VA), a novel online adaptation framework that efficiently integrates elevation with semantic embeddings to enable terrain-aware kinodynamic modeling and planning via function encoders. VA learns a kinodynamic space spanned by a set of neural ordinary differential equation basis functions, capturing complex vehicle-terrain interactions across varied environments. After offline training, the proposed approach can rapidly adapt to new, unseen environments by identifying kinodynamics in the learned space through a computationally efficient least-squares calculation. We evaluate VA within the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance both in simulation and on a physical Verti-4-Wheeler platform. Our results demonstrate that VA improves prediction accuracy by up to 23.9% and achieves a 5X faster adaptation time, advancing the robustness and reliability of autonomous robots in complex and evolving off-road environments.
Chinese Translation
在非铺装环境中进行自主驾驶面临着显著的挑战,因为非结构化地形的动态和不可预测性。传统的运动动力学模型往往难以在多样的几何和语义地形类型中进行泛化,这突显了实时适应的必要性,以确保安全和可靠的导航。我们提出了VertiAdaptor (VA),一种新颖的在线适应框架,能够高效地将高度信息与语义嵌入相结合,通过函数编码器实现地形感知的运动动力学建模和规划。VA学习一个由一组神经常微分方程基函数构成的运动动力学空间,捕捉复杂的车辆与地形之间的交互,适用于多种环境。在离线训练后,该方法能够通过计算高效的最小二乘计算,快速适应新的、未见过的环境。我们在基于Chrono多物理引擎构建的Verti-Bench模拟器中评估了VA,并在模拟和物理Verti-4-Wheeler平台上验证了其性能。我们的结果表明,VA提高了预测准确性,最多可达23.9%,并实现了5倍的适应速度,提升了自主机器人在复杂和不断变化的非铺装环境中的鲁棒性和可靠性。
cs.RO / 17 / 2603.06898

Collaborative Planning with Concurrent Synchronization for Operationally Constrained UAV-UGV Teams

具有并发同步的协同规划用于操作约束的无人机-无人地面车辆团队
Deng, Zihao, Li, Qianhuang, Gao, Peng, Wigness, Maggie, Rogers, John, Kim, Donghyun, Zhang, Hao
Abstract
Collaborative planning under operational constraints is an essential capability for heterogeneous robot teams tackling complex large-scale real-world tasks. Unmanned Aerial Vehicles (UAVs) offer rapid environmental coverage, but flight time is often limited by energy constraints, whereas Unmanned Ground Vehicles (UGVs) have greater energy capacity to support long-duration missions, but movement is constrained by traversable terrain. Individually, neither can complete tasks such as environmental monitoring. Effective UAV-UGV collaboration therefore requires energy-constrained multi-UAV task planning, traversability-constrained multi-UGV path planning, and crucially, synchronized concurrent co-planning to ensure timely in-mission recharging. To enable these capabilities, we propose Collaborative Planning with Concurrent Synchronization (CoPCS), a learning-based approach that integrates a heterogeneous graph transformer for operationally constrained task encoding with a transformer decoder for joint, synchronized co-planning that enables UAVs and UGVs to act concurrently in a coordinated manner. CoPCS is trained end-to-end under a unified imitation learning paradigm. We conducted extensive experiments to evaluate CoPCS in both robotic simulations and physical robot teams. Experimental results demonstrate that our method provides the novel multi-robot capability of synchronized concurrent co-planning and substantially improves team performance. More details of this work are available on the project website: https://hcrlab.gitlab.io/project/CoPCS.
Chinese Translation
在操作约束下进行协同规划是异构机器人团队应对复杂大规模现实任务的基本能力。无人机(UAV)能够快速覆盖环境,但飞行时间常常受到能量限制,而无人地面车辆(UGV)具有更大的能量容量以支持长时间任务,但其移动受到可通行地形的限制。单独而言,两者都无法完成如环境监测等任务。因此,有效的UAV-UGV协作需要在能量约束下进行多UAV任务规划,在可通行性约束下进行多UGV路径规划,尤其是需要同步并发的共同规划,以确保任务中的及时充电。为实现这些能力,我们提出了具有并发同步的协同规划(CoPCS),这是一种基于学习的方法,结合了用于操作约束任务编码的异构图变换器和用于联合同步共同规划的变换器解码器,使得UAV和UGV能够以协调的方式并发行动。CoPCS在统一的模仿学习范式下进行端到端训练。我们进行了广泛的实验,以评估CoPCS在机器人仿真和物理机器人团队中的表现。实验结果表明,我们的方法提供了同步并发共同规划的创新多机器人能力,并显著提高了团队性能。更多细节请访问项目网站:https://hcrlab.gitlab.io/project/CoPCS。
cs.RO / 18 / 2603.06914

SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

SysNav:多层次系统协作实现现实世界跨实体物体导航
Zhu, Haokun, Li, Zongtai, Liu, Zihan, Guo, Kevin, Lin, Zhengzhi, Cai, Yuxin, Chen, Guofei, Lv, Chen, Wang, Wenshan, Oh, Jean, Zhang, Ji
Abstract
Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.
Chinese Translation
在现实环境中进行物体导航(ObjectNav)是一个复杂的问题,需要同时解决多个挑战,包括复杂的空间结构、长远规划和语义理解。最近在视觉-语言模型(Vision-Language Models, VLMs)方面的进展为语义理解提供了有希望的能力,但将其有效整合到现实世界的导航系统中仍然是一个非平凡的挑战。在本研究中,我们将现实世界的物体导航视为一个系统级问题,并引入了SysNav,一个为现实世界跨实体部署设计的三层次物体导航系统。SysNav将语义推理、导航规划和运动控制解耦,以确保系统的鲁棒性和泛化能力。在高层次上,我们将环境总结为结构化场景表示,并利用VLMs提供基于语义的导航指导。在中层次上,我们引入了一种基于房间的分层导航策略,将VLM指导保留用于房间级决策,有效利用其推理能力,同时确保系统效率。在低层次上,计划的航点通过不同的实体特定运动控制模块执行。我们在三个实体上部署了我们的系统,包括一个定制的轮式机器人、Unitree Go2 四足机器人和Unitree G1 人形机器人,并进行了190次现实世界实验。我们的系统在成功率和导航效率上都取得了显著的提升。据我们所知,SysNav是第一个能够可靠且高效地完成建筑规模长距离物体导航的系统,适用于复杂的现实环境。此外,在四个模拟基准上的大量实验展示了其最先进的性能。项目页面可访问:https://cmu-vln.github.io/
cs.RO / 19 / 2603.06918

T2Nav Algebraic Topology Aware Temporal Graph Memory and Loop Detection for ZeroShot Visual Navigation

T2Nav:具代数拓扑意识的时序图记忆与零样本视觉导航中的循环检测
D., Quang-Anh N., Pham, Duc, Nguyen, Minh-Anh, Doan, Tung, Dang, Tuan
Abstract
Deploying autonomous agents in real world environments is challenging, particularly for navigation, where systems must adapt to situations they have not encountered before. Traditional learning approaches require substantial amounts of data, constant tuning, and, sometimes, starting over for each new task. That makes them hard to scale and not very flexible. Recent breakthroughs in foundation models, such as large language models and vision language models, enable systems to attempt new navigation tasks without requiring additional training. However, many of these methods only work with specific input types, employ relatively basic reasoning, and fail to fully exploit the details they observe or the structure of the spaces. Here, we introduce T2Nav, a zeroshot navigation system that integrates heterogeneous data and employs graph-based reasoning. By directly incorporating visual information into the graph and matching it to the environment, our approach enables the system to strike a good balance between exploration and goal attainment. This strategy allows robust obstacle avoidance, reliable loop closure detection, and efficient path planning while eliminating redundant exploration patterns. The system demonstrates flexibility by handling goals specified using reference images of target object instances, making it particularly suitable for scenarios in which agents must navigate to visually similar yet spatially distinct instances. Experiments demonstrate that our approach is efficient and adapts well to unknown environments, moving toward practical zero-shot instance-image navigation capabilities.
Chinese Translation
在现实环境中部署自主代理面临诸多挑战,尤其是在导航方面,系统必须适应未曾遇到的情况。传统学习方法需要大量数据、不断调整,并且有时在每个新任务上都需要重新开始,这使得它们难以扩展且灵活性不足。最近在基础模型方面的突破,例如大型语言模型和视觉语言模型,使得系统能够尝试新的导航任务而无需额外训练。然而,这些方法中的许多仅适用于特定输入类型,采用相对基础的推理,并未充分利用它们观察到的细节或空间结构。在此,我们介绍了T2Nav,一种零样本导航系统,它整合了异构数据并采用基于图的推理。通过将视觉信息直接纳入图中并与环境匹配,我们的方法使系统能够在探索与目标达成之间取得良好平衡。这一策略允许稳健的障碍物规避、可靠的循环闭合检测和高效的路径规划,同时消除冗余的探索模式。该系统通过处理使用目标物体实例的参考图像指定的目标,展现出灵活性,特别适合于代理必须导航到视觉上相似但空间上不同实例的场景。实验表明,我们的方法高效且能够很好地适应未知环境,朝着实用的零样本实例图像导航能力迈进。
cs.RO / 20 / 2603.06919

SurgSync: Time-Synchronized Multi-Modal Data Collection Framework and Dataset for Surgical Robotics

SurgSync:用于外科机器人手术的时间同步多模态数据采集框架和数据集
Zhou, Haoying, Liu, Chang, Wu, Yimeng, Wu, Junlin, Wu, Zijian, Lee, Yu Chung, Martuscelli, Sara, Salcudean, Spetimiu E., Fischer, Gregory S., Kazanzides, Peter
Abstract
Most existing robotic surgery systems adopt a human-in-the-loop paradigm, often with the surgeon directly teleoperating the robotic system. Adding intelligence to these robots would enable higher-level control, such as supervised autonomy or even full autonomy. However, artificial intelligence (AI) requires large amounts of training data, which is currently lacking. This work proposes SurgSync, a multi-modal data collection framework with offline and online synchronization to support training and real-time inference, respectively. The framework is implemented on a da Vinci Research Kit (dVRK) and introduces (1) dual-mode (online/offline-matching) synchronized recorders, (2) a modern stereo endoscope to achieve image quality on par with clinical systems, and (3) additional sensors such as a side-view camera and a novel capacitive contact sensor to provide ground truth contact data. The framework also incorporates a post-processing toolbox for tasks such as depth estimation, optical flow, and a practical kinematic reprojection method using Gaussian heatmap. User studies with participants of varying skill levels are performed with ex-vivo tissue to provide clinically realistic data, and a network for surgical skill assessment is employed to demonstrate utilization of the collected data. Through the user study experiments, we obtained a dataset of 214 validated instances across multiple canonical training tasks. All software and data are available at surgsync.github.io.
Chinese Translation
大多数现有的机器人手术系统采用人机协作模式,通常由外科医生直接远程操作机器人系统。为这些机器人添加智能将使其能够实现更高级别的控制,例如监督自主或甚至完全自主。然而,人工智能(AI)需要大量的训练数据,而目前这方面的数据严重不足。本研究提出了SurgSync,一个具有离线和在线同步功能的多模态数据采集框架,以支持训练和实时推理。该框架在达芬奇研究工具包(dVRK)上实现,并引入了(1)双模式(在线/离线匹配)同步记录仪,(2)现代立体内窥镜,以实现与临床系统相当的图像质量,以及(3)附加传感器,如侧视摄像头和新型电容接触传感器,以提供真实接触数据。该框架还包含一个后处理工具箱,用于深度估计、光流等任务,以及使用高斯热图的实用运动学重投影方法。我们对不同技能水平的参与者进行了用户研究,使用离体组织提供临床现实数据,并采用手术技能评估网络来展示所收集数据的应用。通过用户研究实验,我们获得了214个经过验证的实例数据集,涵盖多个标准训练任务。所有软件和数据均可在 surgsync.github.io 获取。
cs.RO / 21 / 2603.06921

CN-CBF: Composite Neural Control Barrier Function for Safe Robot Navigation in Dynamic Environments

CN-CBF:用于动态环境中安全机器人导航的复合神经控制屏障函数
Derajić, Bojan, Bernhard, Sebastian, Hönig, Wolfgang
Abstract
Safe navigation of autonomous robots remains one of the core challenges in the field, especially in dynamic and uncertain environments. One of the prevalent approaches is safety filtering based on control barrier functions (CBFs), which are easy to deploy but difficult to design. Motivated by the shortcomings of existing learning- and model-based methods, we propose a simple yet effective neural CBF design method for safe robot navigation in dynamic environments. We employ the idea of a composite CBF, where multiple neural CBFs are combined into a single CBF. The individual CBFs are trained via the Hamilton-Jacobi reachability framework to approximate the optimal safe set for single moving obstacles. Additionally, we use the residual neural architecture, which guarantees that the estimated safe set does not intersect with the corresponding failure set. The method is extensively evaluated in simulation experiments for a ground robot and a quadrotor, comparing it against several baseline methods. The results show improved success rates of up to 18\% compared to the best baseline, without increasing the conservativeness of the motion. Also, the method is demonstrated in hardware experiments for both types of robots.
Chinese Translation
自主机器人的安全导航仍然是该领域的核心挑战之一,尤其是在动态和不确定的环境中。当前一种普遍的方法是基于控制屏障函数(CBFs)的安全过滤,这种方法易于部署但设计困难。受到现有学习和基于模型的方法局限性的启发,我们提出了一种简单而有效的神经控制屏障函数设计方法,以实现动态环境中安全的机器人导航。我们采用复合控制屏障函数的理念,将多个神经控制屏障函数组合成一个单一的控制屏障函数。各个控制屏障函数通过汉密尔顿-雅可比可达性框架进行训练,以近似单个移动障碍物的最优安全集。此外,我们使用残差神经网络架构,确保估计的安全集与相应的失败集不相交。该方法在地面机器人和四旋翼的仿真实验中进行了广泛评估,并与几种基线方法进行了比较。结果显示,与最佳基线相比,成功率提高了多达18 ext{%},且未增加运动的保守性。此外,该方法还在两种类型的机器人硬件实验中得到了验证。
cs.RO / 22 / 2603.06924

LIPP: Load-Aware Informative Path Planning with Physical Sampling

LIPP:考虑负载的有信息路径规划与物理采样
Kim, Hojune, Shi, Guangyao, Sukhatme, Gaurav S.
Abstract
In classical Informative Path Planning (C-IPP), robots are typically modeled as mobile sensors that acquire digital measurements such as images or radiation levels. In this model - since making a measurement leaves the robot's physical state unchanged - traversal costs are determined solely by the path taken. This is a natural assumption for many missions, but does not extend to settings involving physical sample collection, where each collected sample adds mass and increases the energy cost of all subsequent motion. As a result, IPP formulations that ignore this coupling between information gain and load-dependent traversal cost can produce plans that are distance-efficient but energy-suboptimal, collecting fewer samples and less data than the energy budget would permit. In this paper, we introduce Load-aware Informative Path Planning (LIPP ), a generalization of C-IPP that explicitly models this coupling and the resulting order-dependent traversal costs. We formulate LIPP as a Mixed-Integer Quadratic Program (MIQP) that jointly optimizes routing, visitation order, and per-location sampling count under an energy budget. We show that LIPP strictly generalizes C-IPP: as sample unit mass $\lambda \to 0$, the load-dependent energy model reduces exactly to the classical distance budget constraint, recovering C-IPP as a special case. We further derive theoretical bounds on the path-length increase of LIPP relative to C-IPP, characterizing the trade-off for improved energy efficiency. Finally, through extensive simulations across 2000 diverse mission scenarios, we demonstrate that LIPP matches the behavior of C-IPP at zero sample mass and progressively achieves higher uncertainty reduction per unit energy as sample mass increases.
Chinese Translation
在经典的有信息路径规划(C-IPP)中,机器人通常被建模为移动传感器,获取数字测量值,如图像或辐射水平。在这一模型中,由于进行测量不会改变机器人的物理状态,因此遍历成本仅由所采取的路径决定。这对于许多任务来说是一个自然的假设,但并不适用于涉及物理样本收集的环境,在这些环境中,每个收集的样本会增加质量,并提高所有后续运动的能量成本。因此,忽略信息增益与负载相关遍历成本之间耦合关系的IPP公式可能会产生距离效率高但能量不最优的计划,收集的样本和数据数量少于能量预算所允许的数量。本文提出了考虑负载的有信息路径规划(LIPP),这是C-IPP的推广,明确建模了这种耦合关系及其导致的依赖顺序的遍历成本。我们将LIPP公式化为一个混合整数二次规划(MIQP),在能量预算下共同优化路径、访问顺序和每个位置的采样数量。我们证明LIPP严格推广了C-IPP:当样本单位质量$ ext{λ} o 0$时,负载相关的能量模型恰好简化为经典的距离预算约束,恢复C-IPP作为特例。我们进一步推导了LIPP相对于C-IPP的路径长度增加的理论界限,表征了提高能量效率的权衡。最后,通过在2000个多样化任务场景中的广泛仿真,我们展示了LIPP在样本质量为零时与C-IPP的行为相匹配,并随着样本质量的增加,逐步实现每单位能量更高的不确定性减少。
cs.RO / 23 / 2603.06927

A Contrastive Fewshot RGBD Traversability Segmentation Framework for Indoor Robotic Navigation

室内机器人导航的对比少样本RGBD可通行性分割框架
An, Qiyuan, Dang, Tuan, Makedon, Fillia
Abstract
Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9\% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.
Chinese Translation
室内可通行性分割旨在识别自主代理的安全可导航自由空间,这对机器人导航至关重要。纯视觉模型往往无法检测到细小障碍物,如椅子腿,这可能带来严重的安全风险。我们提出了一种多模态分割框架,利用RGB图像和稀疏的1D激光深度信息来捕捉几何交互,从而改善对挑战性障碍物的检测。为了减少对大型标注数据集的依赖,我们采用了少样本分割(Few-shot Segmentation, FSS)范式,使模型能够从有限的标注示例中进行泛化。传统的FSS方法仅关注正原型,往往导致对支持集的过拟合和较差的泛化能力。为了解决这个问题,我们引入了负对比学习(Negative Contrastive Learning, NCL)分支,利用负原型(障碍物)来优化自由空间预测。此外,我们设计了一个两阶段注意力深度模块,以水平和垂直对齐1D深度向量与RGB图像。在我们自定义收集的室内RGB-D可通行性数据集上进行的广泛实验表明,我们的方法在1-shot和5-shot设置下均优于最先进的FSS和RGB-D分割基线,mIoU提高了最高9%。这些结果突显了利用负原型和稀疏深度进行鲁棒且高效的可通行性分割的有效性。
cs.RO / 24 / 2603.06928

Failure Mechanisms and Risk Estimation for Legged Robot Locomotion on Granular Slopes

腿式机器人在颗粒斜坡上运动的失效机制与风险评估
Liao, Xingjue, Qian, Feifei
Abstract
Locomotion on granular slopes such as sand dunes remains a fundamental challenge for legged robots due to reduced shear strength and gravity-induced anisotropic yielding of granular media. Using a hexapedal robot on a tiltable granular bed, we systematically measure locomotion speed together with slope-dependent normal and shear granular resistive forces. While normal penetration resistance remains nearly unchanged with inclination, shear resistance decreases substantially as slope angle increases. Guided by these measurements, we develop a simple robot-terrain interaction model that predicts anchoring timing, step length, and resulting robot speed, as functions of terrain strength and slope angle. The model reveals that slope-induced performance loss is primarily governed by delayed anchoring and increased backward slip rather than excessive sinkage. By extending the model to generalized terrain conditions, we construct failure phase diagrams that identify sinkage- and slippage-induced failure regimes, enabling quantitative risk estimation for locomotion on granular slopes. This physics-informed framework provides predictive insight into terrain-dependent failure mechanisms and offers guidance for safer and more robust robot operation on deformable inclines.
Chinese Translation
在颗粒斜坡(如沙丘)上运动仍然是腿式机器人面临的一项基本挑战,这主要是由于颗粒介质的剪切强度降低和重力引起的各向异性屈服。我们使用一台六足机器人在可倾斜的颗粒床上系统地测量运动速度,以及与坡度相关的法向和剪切颗粒阻力。尽管法向穿透阻力在倾斜角度变化时几乎保持不变,但随着坡度角度的增加,剪切阻力显著降低。在这些测量的指导下,我们开发了一个简单的机器人-地形相互作用模型,该模型预测了锚定时机、步长和由地形强度和坡度角度决定的机器人速度。模型揭示,坡度引起的性能损失主要由锚定延迟和后滑增加所主导,而非过度沉降。通过将模型扩展到广义地形条件,我们构建了失效相图,识别出由沉降和滑移引起的失效区域,从而实现对颗粒斜坡上运动的定量风险评估。这个基于物理的框架提供了对地形依赖性失效机制的预测性洞察,并为在可变斜坡上更安全、更稳健的机器人操作提供了指导。
cs.RO / 25 / 2603.06947

Feasibility Restoration under Conflicting STL Specifications with Pareto-Optimal Refinement

在冲突的时序逻辑规格下的可行性恢复与帕累托最优精炼
Wu, Tianhao, Lyu, Yiwei
Abstract
Signal Temporal Logic (STL) is expressive formal language that specifies spatio-temporal requirements in robotics. Its quantitative robustness semantics can be easily integrated with optimization-based control frameworks. However, STL specifications may become conflicting in real-world applications, where safety rules, traffic regulations, and task objectives can be cannot be satisfied together. In these situations, traditional STL-constrained Model Predictive Control (MPC) becomes infeasible and default to conservative behaviors such as freezing, which can largely increase risks in safety-critical scenarios. In this paper, we proposes a unified two-stage framework that first restores feasibility via minimal relaxation, then refine the feasible solution by formulating it as a value-aware multi-objective optimization problem. Using $\varepsilon$-constraint method, we approximate the Pareto front of the multi-objective optimization, which allows analysis of tradeoffs among competing objectives and counterfactual analysis of alternative actions. We demonstrate that the proposed approach avoids deadlock under conflicting STL specifications and enables interpretable decision-making in safety-critical applications by conducting a case study in autonomous driving.
Chinese Translation
信号时序逻辑(STL)是一种表达能力强的形式语言,用于指定机器人技术中的时空要求。其定量鲁棒性语义可以与基于优化的控制框架轻松集成。然而,在实际应用中,STL规格可能会发生冲突,安全规则、交通法规和任务目标可能无法同时满足。在这些情况下,传统的STL约束模型预测控制(MPC)变得不可行,并默认采取保守行为,如停止,这在安全关键场景中会大大增加风险。本文提出了一种统一的两阶段框架,首先通过最小松弛恢复可行性,然后通过将其表述为一个价值感知的多目标优化问题来精炼可行解。使用$ ext{ε}$-约束方法,我们近似多目标优化的帕累托前沿,这允许分析竞争目标之间的权衡以及对替代行动的反事实分析。我们通过在自动驾驶中的案例研究证明,所提出的方法在冲突的STL规格下避免了死锁,并使安全关键应用中的决策过程可解释。
cs.RO / 26 / 2603.06954

Is Your Safe Controller Actually Safe? A Critical Review of CBF Tautologies and Hidden Assumptions

你的安全控制器真的安全吗?对控制屏障函数(CBF)公理和隐含假设的批判性审视
Kim, Taekyung
Abstract
This tutorial provides a critical review of the practical application of Control Barrier Functions (CBFs) in robotic safety. While the theoretical foundations of CBFs are well-established, I identify a recurring gap between the mathematical assumption of a safe controller's existence and its constructive realization in systems with input constraints. I highlight the distinction between candidate and valid CBFs by analyzing the interplay of system dynamics, actuation limits, and class-K functions. I further show that some purported demonstrations of safe robot policies or controllers are limited to passively safe systems, such as single integrators or kinematic manipulators, where safety is already inherited from the underlying physics and even naive geometric hard constraints suffice to prevent collisions. By revisiting simple low-dimensional examples, I show when CBF formulations provide valid safety guarantees and when they fail due to common misuses. I then provide practical guidelines for constructing realizable safety arguments for systems without such passive safety. The goal of this tutorial is to bridge the gap between theoretical guarantees and actual implementation, supported by an open-source interactive web demonstration that visualizes these concepts intuitively.
Chinese Translation
本教程对控制屏障函数(CBFs)在机器人安全中的实际应用进行了批判性审视。尽管CBFs的理论基础已经得到充分建立,但我发现安全控制器存在的数学假设与在具有输入约束的系统中其构造实现之间存在反复出现的差距。我通过分析系统动态、驱动限制和类-K函数之间的相互作用,强调了候选CBFs与有效CBFs之间的区别。我进一步表明,一些所谓的安全机器人策略或控制器的演示仅限于被动安全系统,例如单积分器或运动学操纵器,在这些系统中,安全性已经由基础物理特性继承,甚至简单的几何硬约束就足以防止碰撞。通过重新审视简单的低维示例,我展示了CBF公式何时提供有效的安全保证,以及何时因常见误用而失败。随后,我提供了构建可实现安全论证的实用指南,适用于没有这种被动安全的系统。本教程的目标是弥合理论保证与实际实施之间的差距,并通过一个开源的互动网络演示来直观地展示这些概念。
cs.RO / 27 / 2603.06955

Energy-Efficient Collaborative Transport of Tether-Suspended Payloads via Rotating Equilibrium

通过旋转平衡实现能效协作运输悬挂负载
Foss, Eric, Tai, Andrew, Bosio, Carlo, Mueller, Mark W.
Abstract
Collaborative aerial transportation of tethered payloads is fundamentally limited by space, power, and weight constraints. Conventional approaches rely on static equilibrium conditions, where each vehicle tilts to generate the forces that ensure they maintain a formation geometry that avoids aerodynamic interactions and collision. This horizontal thrust component represents a significant energy penalty compared to the ideal case in which each vehicle produces purely vertical thrust to lift the payload. Operating in tighter tether configurations can minimize this effect, but at the cost of either having to fly the vehicles in closer proximity, which risks collision, or significantly increasing the length of the tether, which increases complexity and reduces potential use-cases. We propose operating the tether-suspended flying system at a rotating equilibrium. By maintaining steady circular motion, centrifugal forces provide the necessary horizontal tether tension, allowing each quadrotor to generate purely vertical thrust and thus reducing the total force (and power) required compared to an equilibrium where the thrusts are not vertical. It also allows for a wider range of tether configurations to be used without sacrificing efficiency. Results demonstrate that rotating equilibria can reduce power consumption relative to static lifting by up to 20%, making collaborative aerial solutions more practically relevant.
Chinese Translation
悬挂负载的协作空中运输在空间、功率和重量限制方面存在根本性限制。传统方法依赖于静态平衡条件,在这种条件下,每个飞行器倾斜以产生确保其保持避免气动相互作用和碰撞的队形几何形状的力。这种水平推力分量相较于理想情况下每个飞行器仅产生垂直推力以提升负载的情况,代表了显著的能量损失。在更紧凑的悬挂配置中操作可以最小化这种影响,但代价是要么让飞行器在更近的距离飞行,这增加了碰撞的风险,要么显著增加悬挂绳索的长度,这增加了复杂性并减少了潜在的应用场景。我们提出在旋转平衡下操作悬挂飞行系统。通过保持稳定的圆周运动,离心力提供所需的水平悬挂绳索张力,使每个四旋翼无人机能够产生纯粹的垂直推力,从而减少与推力不垂直的平衡相比所需的总力(和功率)。这也允许使用更广泛的悬挂配置而不牺牲效率。结果表明,旋转平衡可以使功耗相对于静态提升减少多达20%,使协作空中解决方案在实际应用中更具相关性。
cs.RO / 28 / 2603.06979

VSL-Skin: Individually Addressable Phase-Change Voxel Skin for Variable-Stiffness and Virtual Joints Bridging Soft and Rigid Robots

VSL-Skin:可单独寻址的相变体素皮肤,用于可变刚度和虚拟关节的软硬机器人桥接
Zeng, Zihan Oliver, An, Jiajun, Luk, Preston, Kaur, Upinder
Abstract
Soft robots are compliant but often cannot support loads or hold their shape, while rigid robots provide structural strength but are less adaptable. Existing variable-stiffness systems usually operate at the scale of whole segments or patches, which limits precise control over stiffness distribution and virtual joint placement. This paper presents the Variable Stiffness Lattice Skin (VSL-Skin), the first system to enable individually addressable voxel-level morphological control with centimeter-scale precision. The system provides three main capabilities: nearly two orders of magnitude stiffness modulation across axial (15-1200 N/mm), shear (45-850 N/mm), bending (8*10^2 - 3*10^4 N/deg), and torsional modes with centimeter-scale spatial control; the first demonstrated 30% axial compression in phase-change systems while maintaining structural integrity; and autonomous component-level self-repair through thermal cycling, which eliminates fatigue accumulation and enables programmable sacrificial joints for predictable failure management. Selective voxel activation creates six canonical virtual joint types with programmable compliance while preserving structural integrity in non-activated regions. The platform incorporates closed-form design models and finite element analysis for predictive synthesis of stiffness patterns and joint placement. Experimental validation demonstrates 30% axial contraction, thermal switching in 75-second cycles, and cut-to-fit integration that preserves addressability after trimming. The row-column architecture enables platform-agnostic deployment across diverse robotic systems without specialized infrastructure. This framework establishes morphological intelligence as an engineerable system property and advances autonomous reconfigurable robotics.
Chinese Translation
软机器人具有柔性,但通常无法承载负荷或保持形状,而刚性机器人提供结构强度但适应性较差。现有的可变刚度系统通常在整个段落或补丁的尺度上运行,这限制了对刚度分布和虚拟关节位置的精确控制。本文提出了可变刚度格子皮肤(Variable Stiffness Lattice Skin,VSL-Skin),这是第一个能够以厘米级精度实现单独寻址的体素级形态控制的系统。该系统提供三项主要功能:在轴向(15-1200 N/mm)、剪切(45-850 N/mm)、弯曲(8*10^2 - 3*10^4 N/deg)和扭转模式下,刚度调节范围近两个数量级,并具备厘米级空间控制;在相变系统中首次实现30%的轴向压缩,同时保持结构完整性;通过热循环实现自主组件级自修复,消除疲劳积累,并实现可编程的牺牲关节,以便于可预测的故障管理。选择性体素激活创造了六种典型的虚拟关节类型,具有可编程的柔性,同时在未激活区域保持结构完整性。该平台结合了封闭形式设计模型和有限元分析,以预测刚度模式和关节位置的合成。实验验证显示30%的轴向收缩、75秒周期内的热切换,以及在修剪后保持寻址能力的切合集成。行列架构使得该平台能够在不同的机器人系统中无专用基础设施地进行跨平台部署。该框架确立了形态智能作为可工程化的系统属性,并推动了自主可重构机器人技术的发展。
cs.RO / 29 / 2603.06987

Foundational World Models Accurately Detect Bimanual Manipulator Failures

基础世界模型准确检测双手操纵器故障
Ward, Isaac R., Ho, Michelle, Liu, Houjun, Feldman, Aaron, Vincent, Joseph, Kruse, Liam, Cheong, Sean, Eddy, Duncan, Kochenderfer, Mykel J., Schwager, Mac
Abstract
Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.
Chinese Translation
大规模部署视觉运动机器人面临挑战,因为潜在的异常故障可能会降低性能、造成损害或危及人类生命。双手操纵器也不例外;这些机器人具有由高维图像和本体感觉信号组成的广泛状态空间。在这样的状态空间中明确定义故障模式是不可行的。在本研究中,我们通过在预训练视觉基础模型(NVIDIA的Cosmos Tokenizer)的压缩潜在空间内训练一个概率性、历史信息驱动的世界模型来克服这些挑战。该模型在其预测中输出不确定性估计,这些估计作为符合性预测框架中的非一致性评分。我们利用这些评分开发了一个运行时监控器,将高不确定性时期与异常故障相关联。为了测试这些方法,我们使用了模拟的Push-T环境和双手电缆操作数据集,后者在本研究中首次引入。这个新数据集包含多个同步摄像头视角的轨迹、本体感觉信号以及来自一个具有挑战性的数据中心维护任务的标注故障。我们将我们的方法与异常检测和分布外检测文献中的基线进行基准测试,结果表明我们的方法显著优于统计技术。此外,我们还展示了我们的方法所需的可训练参数大约是下一个最佳基于学习的方法的二十分之一,但在故障检测率方面超出其3.8%,为在可靠性不可妥协的现实环境中安全部署操纵器机器人铺平了道路。
cs.RO / 30 / 2603.07003

Two-Stage Path Following for Mobile Manipulators via Dimensionality-Reduced Graph Search and Numerical Optimization

基于降维图搜索和数值优化的移动操纵器双阶段路径跟踪
Guo, Fuyu, Mei, Yuting, Zhang, Yuyao, Tang, Qian
Abstract
Efficient path following for mobile manipulators is often hindered by high-dimensional configuration spaces and kinematic constraints. This paper presents a robust two-stage configuration planning framework that decouples the 8-DoF planning problem into a tractable 2-DoF base optimization under a yaw-fixed base planning assumption. In the first stage, the proposed approach utilizes IRM to discretize the task-space path into a multi-layer graph, where an initial feasible path is extracted via a Dijkstra-based dynamic programming approach to ensure computational efficiency and global optimality within the discretized graph. In the second stage, to overcome discrete search quantization, feasible base regions are transformed into convex hulls, enabling subsequent continuous refinement via the L-BFGS algorithm to maximize trajectory smoothness while strictly enforcing reachability constraints. Simulation results demonstrate the theoretical precision of the proposed method by achieving sub-millimeter kinematic accuracy in simulation, and physical experiments on an omnidirectional mobile manipulator further validate the framework's robustness and practical applicability.
Chinese Translation
移动操纵器的高效路径跟踪常常受到高维配置空间和运动学约束的制约。本文提出了一种稳健的双阶段配置规划框架,该框架在固定偏航基座规划假设下,将8自由度(DoF)规划问题解耦为可处理的2自由度基座优化。在第一阶段,所提出的方法利用IRM将任务空间路径离散化为多层图,通过基于Dijkstra的动态规划方法提取初始可行路径,以确保在离散化图中的计算效率和全局最优性。在第二阶段,为克服离散搜索量化,将可行基座区域转化为凸包,从而通过L-BFGS算法进行后续的连续优化,以最大化轨迹平滑性,同时严格执行可达性约束。仿真结果通过在仿真中实现亚毫米级的运动学精度,展示了所提方法的理论精度,而在全向移动操纵器上的物理实验进一步验证了该框架的稳健性和实际适用性。
cs.RO / 31 / 2603.07032

SSP: Safety-guaranteed Surgical Policy via Joint Optimization of Behavioral and Spatial Constraints

SSP:通过行为和空间约束的联合优化实现安全保障的外科手术策略
Hu, Jianshu, Guan, ZhiYuan, Song, Lei, Leelakunwet, Kantaphat, Wang, Hesheng, Xiao, Wei, Dou, Qi, Ban, Yutong
Abstract
The paradigm of robot-assisted surgery is shifting toward data-driven autonomy, where policies learned via Reinforcement Learning (RL) or Imitation Learning (IL) enable the execution of complex tasks. However, these ``black-box" policies often lack formal safety guarantees, a critical requirement for clinical deployment. In this paper, we propose the Safety-guaranteed Surgical Policy (SSP) framework to bridge the gap between data-driven generality and formal safety. We utilize Neural Ordinary Differential Equations (Neural ODEs) to learn an uncertainty-aware dynamics model from demonstration data. This learned model underpins a robust Control Barrier Function (CBF) safety controller, which minimally alters the actions of a surgical policy to ensure strict safety under uncertainty. Our controller enforces two constraint categories: behavioral constraints (restricting the task space of the agent) and spatial constraints (defining surgical no-go zones). We instantiate the SSP framework with surgical policies derived from RL, IL and Control Lyapunov Functions (CLF). Validation on in both the SurRoL simulation and da Vinci Research Kit (dVRK) demonstrates that our method achieves a near-zero constraint violation rate while maintaining high task success rates compared to unconstrained baselines.
Chinese Translation
机器人辅助手术的范式正朝着数据驱动的自主性转变,其中通过强化学习(Reinforcement Learning, RL)或模仿学习(Imitation Learning, IL)学习的策略能够执行复杂任务。然而,这些“黑箱”策略通常缺乏正式的安全保障,这是临床应用的关键要求。本文提出了安全保障外科手术策略(Safety-guaranteed Surgical Policy, SSP)框架,以弥合数据驱动的通用性与正式安全之间的差距。我们利用神经常微分方程(Neural Ordinary Differential Equations, Neural ODEs)从示范数据中学习不确定性感知的动态模型。该学习模型支撑着一个稳健的控制障碍函数(Control Barrier Function, CBF)安全控制器,该控制器最小化地调整外科手术策略的行为,以确保在不确定性下的严格安全。我们的控制器强制执行两类约束:行为约束(限制代理的任务空间)和空间约束(定义外科手术禁区)。我们通过来自RL、IL和控制李雅普诺夫函数(Control Lyapunov Functions, CLF)的外科手术策略实例化SSP框架。在SurRoL仿真和达芬奇研究工具包(da Vinci Research Kit, dVRK)上的验证表明,我们的方法在保持高任务成功率的同时,实现了接近零的约束违反率,相较于无约束基线表现更佳。
cs.RO / 32 / 2603.07040

TacDexGrasp: Compliant and Robust Dexterous Grasping with Tactile Feedback

TacDexGrasp:具有触觉反馈的柔性和稳健的灵巧抓取
Ke, Yubin, Chen, Jiayi, Lv, Hang, Zhou, Xiao, Wang, He
Abstract
Multi-fingered hands offer great potential for compliant and robust grasping of unknown objects, yet their high-dimensional force control presents a significant challenge. This work addresses two key problems: (1) distributing forces across multiple contacts to counteract an object's weight, and (2) preventing rotational slip caused by gravitational torque when a grasp is distant from the object's center of mass. We address these challenges via tactile feedback and a Second-Order Cone Programming (SOCP)-based controller, without explicit torque modeling or slip detection. Our key insights are (1) rotational slip inevitably induces translational slip at some contact points for a multi-fingered grasp, and (2) the ratio of tangential to normal force at each contact is an effective early stability indicator. By actively constraining this ratio for each finger below the estimated friction coefficient, our controller maintains grasp stability against both translational and rotational slip. Real-world experiments on 12 diverse objects demonstrate the robustness and compliance of our approach.
Chinese Translation
多指手在柔性和稳健抓取未知物体方面具有巨大潜力,但其高维力控制带来了显著挑战。本研究解决了两个关键问题:(1)在多个接触点之间分配力以抵消物体的重量,以及(2)防止由于重力扭矩导致的旋转滑移,尤其是在抓取点远离物体质心时。我们通过触觉反馈和基于二阶锥规划(Second-Order Cone Programming, SOCP)的控制器来应对这些挑战,而无需显式的扭矩建模或滑移检测。我们的关键见解是:(1)旋转滑移不可避免地会在多指抓取的某些接触点引发平移滑移,以及(2)每个接触点的切向力与法向力的比率是有效的早期稳定性指标。通过主动将每个手指的这一比率限制在估计的摩擦系数以下,我们的控制器能够保持抓取在平移和旋转滑移下的稳定性。对12种不同物体的实际实验展示了我们方法的稳健性和柔性。
cs.RO / 33 / 2603.07060

GuideTWSI: A Diverse Tactile Walking Surface Indicator Dataset from Synthetic and Real-World Images for Blind and Low-Vision Navigation

GuideTWSI:来自合成和真实世界图像的多样化触觉行走表面指示器数据集,用于盲人和低视力导航
Hwang, Hochul, Yang, Soowan, Nguyen, Anh N. H., Goel, Parth, Adhikari, Krisha, Lee, Sunghoon I., Biswas, Joydeep, Giudice, Nicholas A., Kim, Donghyun
Abstract
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.
Chinese Translation
触觉行走表面指示器(TWSI)是盲人和低视力(BLV)行人用来定位人行横道和危险区域的安全关键地标。通过与BLV导盲犬使用者、训练师和一位方向感与移动(O&M)专家的观察会话,我们确认了可靠和准确的TWSI分割在BLV个体导航辅助中的重要性。实现这种可靠性需要大规模的标注数据。然而,现有城市感知数据集中TWSI的代表性严重不足,即使是现有的专用铺装数据集也有限:它们缺乏与机器人相关的视角(例如,自我中心或俯视)并且在地理上偏向于东亚方向条——用于在人行道上提供连续指导的平行凸起条。这种狭隘的关注忽视了截断圆顶——主要在北美和欧洲用作人行道、交叉口和站台边缘可检测警告的圆形凸起行列。因此,仅在条形数据上训练的模型在泛化到基于圆顶的警告时面临困难,导致在安全关键环境中出现漏检和错误停止。
cs.RO / 34 / 2603.07068

Morphology-Independent Facial Expression Imitation for Human-Face Robots

与形态无关的人脸机器人面部表情模仿
Chen, Xu, Gao, Rui, Sun, Che, Liu, Zhehang, Wu, Yuwei, Yang, Shuo, Jia, Yunde
Abstract
Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.
Chinese Translation
在人脸机器人上准确的面部表情模仿对于实现自然的人机交互至关重要。现有的大多数方法通过将二维面部特征点映射到机器人的执行器指令,实现了照片级真实感的表情模仿。然而,它们对特征点轨迹的模仿容易受到面部形态的干扰,这会导致性能下降。本文提出了一种与形态无关的表情模仿方法,该方法将表情与面部形态解耦,以消除形态影响,并为人脸机器人生成更真实的表情。具体而言,我们构建了一个表情解耦模块,通过自监督的方式将表情表示与形态表示进行解耦,从而学习表情语义。我们设计了一个表情转移模块,通过感知表情误差的学习目标,将表示映射到机器人的执行器指令,从而基于学习到的表情语义生成准确的面部表情。为了支持实验验证,我们开发了一款定制设计且高度表现力的人脸机器人——Pengrui,作为真实表情模仿的实验平台。大量实验表明,我们的方法使人脸机器人能够有效地再现广泛的人类表情。所有代码和机器人的实现细节将被公开。
cs.RO / 35 / 2603.07072

The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication

会说话的机器人:针对机器人间通信的抗失真声学模型
Li, Hanlong, Kamalahasan, Karishma, Li, Jiahui, Nakadai, Kazuhiro, Kousik, Shreyas
Abstract
We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. Our system pairs a lightweight text-to-speech (TTS) transmitter (1.18M parameters) with a conformer-based automatic speech recognition (ASR) receiver (938K parameters), jointly optimized through a differentiable channel. Unlike human speech, robot-to-robot communication is paralinguistics-free: the system need not preserve timbre, prosody, or naturalness, only maximize decoding accuracy under channel distortion. Through a three-phase co-training curriculum, the TTS transmitter learns to produce distortion-robust acoustic encodings that surpass the baseline under noise, achieving 8.3% CER at 0 dB SNR. The entire system requires only 2.1M parameters (8.4 MB) and runs in under 13 ms end-to-end on a CPU, making it suitable for deployment on resource-constrained robotic platforms.
Chinese Translation
我们提出了Artoo,一个用于机器人之间的声学通信系统,该系统用端到端共同训练的神经网络替代了手工设计的信号处理。我们的系统将一个轻量级的文本转语音(TTS)发射器(1.18M参数)与一个基于Conformer的自动语音识别(ASR)接收器(938K参数)配对,通过可微分的信道进行联合优化。与人类语言不同,机器人间的通信不依赖于副语言特征:该系统不需要保留音色、韵律或自然性,只需在信道失真下最大化解码准确性。通过三阶段的共同训练课程,TTS发射器学习生成抗失真的声学编码,在噪声条件下超越基线,在0 dB信噪比下实现8.3%的字符错误率(CER)。整个系统仅需2.1M参数(8.4 MB),在CPU上端到端运行时间少于13毫秒,适合在资源受限的机器人平台上部署。
cs.RO / 36 / 2603.07080

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

VLN-Cache:通过视觉/语义动态感知实现VLN模型的令牌缓存
Zheng, Zihao, Mao, Zhihao, Zhou, Xingyue, Chen, Jiayu, Li, Maoliang, Sun, Xinhao, Zou, Hailong, Zhang, Zhaobo, Liu, Xuanzhe, Cao, Donggang, Mei, Hong, Chen, Xiang
Abstract
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
Chinese Translation
视觉与语言导航(VLN)日益依赖大型视觉-语言模型,但其推理成本与实时部署存在冲突。令牌缓存是一种有前景的无训练策略,通过在帧之间重用稳定的视觉令牌来避免冗余计算。然而,现有方法假设摄像机静态且语义焦点固定,这些假设与VLN的基本特性相悖。我们识别出两种失效模式:(1)视觉动态,视点变化导致令牌在帧之间的位置偏移,造成位置匹配错误地配对不对齐的内容;(2)语义动态,随着导航的进展,令牌的相关性在任务阶段之间发生变化,使得缓存状态变得过时。我们提出VLN-Cache,这是一种视觉动态感知和语义动态感知的缓存框架,引入视图对齐重映射以恢复几何对应关系,并使用任务相关性显著性过滤器在语义转变时否决重用。层自适应熵策略进一步平衡每层的重用预算。在R2R-CE模拟基准上的实验表明,在保持竞争性导航成功率的同时,速度提升可达1.52倍。
cs.RO / 37 / 2603.07095

ACLM: ADMM-Based Distributed Model Predictive Control for Collaborative Loco-Manipulation

ACLM:基于ADMM的协作运动操控的分布式模型预测控制
Zhou, Ziyi, Shu, Pengyuan, Cao, Ruize, Zhao, Yuntian, Zhao, Ye
Abstract
Collaborative transportation of heavy payloads via loco-manipulation is a challenging yet essential capability for legged robots operating in complex, unstructured environments. Centralized planning methods, e.g., holistic trajectory optimization, capture dynamic coupling among robots and payloads but scale poorly with system size, limiting real-time applicability. In contrast, hierarchical and fully decentralized approaches often neglect force and dynamic interactions, leading to conservative behavior. This study proposes an Alternating Direction Method of Multipliers (ADMM)-based distributed model predictive control framework for collaborative loco-manipulation with a team of quadruped robots with manipulators. By exploiting the payload-induced coupling structure, the global optimal control problem is decomposed into parallel individual-robot-level subproblems with consensus constraints. The distributed planner operates in a receding-horizon fashion and achieves fast convergence, requiring only a few ADMM iterations per planning cycle. A wrench-aware whole-body controller executes the planned trajectories, tracking both motion and interaction wrenches. Extensive simulations with up to four robots demonstrate scalability, real-time performance, and robustness to model uncertainty.
Chinese Translation
通过运动操控协作运输重载是腿式机器人在复杂非结构化环境中操作的一项具有挑战性但至关重要的能力。集中式规划方法,例如整体轨迹优化,能够捕捉机器人与负载之间的动态耦合,但在系统规模扩大时表现不佳,限制了其实时应用性。相比之下,分层和完全去中心化的方法往往忽视了力和动态交互,导致保守行为。本研究提出了一种基于交替方向乘子法(ADMM)的分布式模型预测控制框架,旨在实现四足机器人团队的协作运动操控。通过利用负载引起的耦合结构,全球最优控制问题被分解为带有共识约束的并行个体机器人级子问题。该分布式规划器以递归视野的方式运行,实现快速收敛,每个规划周期仅需少量ADMM迭代。一个考虑扭矩的全身控制器执行规划轨迹,跟踪运动和交互扭矩。与多达四个机器人的广泛仿真表明了该方法的可扩展性、实时性能以及对模型不确定性的鲁棒性。
cs.RO / 38 / 2603.07096

Towards Scalable Probabilistic Human Motion Prediction with Gaussian Processes for Safe Human-Robot Collaboration

基于高斯过程的可扩展概率人类运动预测,以实现安全的人机协作
Chong, Jinger, Zhang, Xiaotong, Youcef-Toumi, Kamal
Abstract
Accurate human motion prediction with well-calibrated uncertainty is critical for safe human-robot collaboration (HRC), where robots must anticipate and react to human movements in real time. We propose a structured multitask variational Gaussian Process (GP) framework for full-body human motion prediction that captures temporal correlations and leverages joint-dimension-level factorization for scalability, while using a continuous 6D rotation representation to preserve kinematic consistency. Evaluated on Human3.6M (H3.6M), our model achieves up to 50 lower kernel density estimate negative log-likelihood (KDE NLL) than strong baselines, a mean continuous ranked probability score (CRPS) of 0.021 m, and deterministic mean angle error (MAE) that is 3-18% higher than competitive deep learning methods. Empirical coverage analysis shows that the fraction of ground-truth outcomes contained within predicted confidence intervals gradually decreases with horizon, remaining conservative for lower-confidence intervals and near-nominal for higher-confidence intervals, with only modest calibration drift at longer horizons. Despite its probabilistic formulation, our model requires only 0.24-0.35 M parameters, roughly eight times fewer than comparable approaches, and exhibits modest inference times, indicating suitability for real-time deployment. Extensive ablation studies further validated the choice of 6D rotation representation and Matern 3/2 + Linear kernel, and guided the selection of the number of inducing points and latent dimensionality. These results demonstrate that scalable GP-based models can deliver competitive accuracy together with reliable and interpretable uncertainty estimates for downstream robotics tasks such as motion planning and collision avoidance.
Chinese Translation
准确的人类运动预测及其良好的不确定性校准对于安全的人机协作(HRC)至关重要,在此场景中,机器人必须实时预测和响应人类的动作。我们提出了一种结构化的多任务变分高斯过程(GP)框架,用于全身人类运动预测,该框架捕捉时间相关性,并利用关节维度级别的因式分解以实现可扩展性,同时使用连续的6D旋转表示以保持运动学一致性。在Human3.6M(H3.6M)数据集上的评估表明,我们的模型在核密度估计负对数似然(KDE NLL)上比强基线低50,平均连续排名概率分数(CRPS)为0.021米,确定性平均角度误差(MAE)比竞争性深度学习方法高出3-18%。经验覆盖分析表明,真实结果在预测置信区间内的比例随着时间范围的增加而逐渐降低,对于低置信区间保持保守,对于高置信区间接近名义值,且在较长时间范围内仅有适度的校准漂移。尽管采用了概率模型,我们的模型仅需0.24-0.35百万个参数,约为可比方法的八分之一,并且表现出适度的推理时间,表明其适合实时部署。大量消融研究进一步验证了6D旋转表示和Matern 3/2 + 线性核的选择,并指导了诱导点数量和潜在维度的选择。这些结果表明,可扩展的基于高斯过程的模型能够提供竞争性的准确性,以及可靠且可解释的不确定性估计,适用于下游机器人任务,如运动规划和碰撞避免。
cs.RO / 39 / 2603.07110

Learning From Failures: Efficient Reinforcement Learning Control with Episodic Memory

从失败中学习:具有情节记忆的高效强化学习控制
Miao, Chenyang
Abstract
Reinforcement learning has achieved remarkable success in robot learning. However, under challenging exploration and contact-rich dynamics, early-stage training is frequently dominated by premature terminations such as collisions and falls. As a result, learning is overwhelmed by short-horizon, low-return trajectories, which hinder convergence and limit long-horizon exploration. To alleviate this issue, we propose a technique called Failure Episodic Memory Alert (FEMA). FEMA explicitly stores short-horizon failure experiences through an episodic memory module. During interactions, it retrieves similar failure experiences and prevents the robot from recurrently relapsing into unstable states, guiding the policy toward long-horizon trajectories with greater long-term value. FEMA can be combined easily with model-free reinforcement learning algorithms, and yields a substantial sample-efficiency improvement of 33.11% on MuJoCo tasks across several classical RL algorithms. Furthermore, integrating FEMA into a parallelized PPO training pipeline demonstrates its effectiveness on a real-world bipedal robot task.
Chinese Translation
强化学习在机器人学习中取得了显著成功。然而,在具有挑战性的探索和接触丰富的动态环境下,早期训练往往受到碰撞和跌倒等过早终止的主导。因此,学习被短期、低回报的轨迹所淹没,这阻碍了收敛并限制了长期探索。为了解决这个问题,我们提出了一种称为失败情节记忆警报(Failure Episodic Memory Alert, FEMA)的技术。FEMA通过情节记忆模块显式存储短期失败经历。在交互过程中,它检索相似的失败经历,防止机器人反复陷入不稳定状态,引导策略朝向具有更大长期价值的长期轨迹。FEMA可以与无模型强化学习算法轻松结合,并在多个经典强化学习算法的MuJoCo任务中实现了33.11%的显著样本效率提升。此外,将FEMA集成到并行化的PPO训练管道中,展示了其在现实世界双足机器人任务中的有效性。
cs.RO / 40 / 2603.07126

Efficient Trajectory Optimization for Autonomous Racing via Formula-1 Data-Driven Initialization

基于Formula-1数据驱动初始化的自主赛车高效轨迹优化
Shehadeh, Samir, Kutsch, Lukas, Dengler, Nils, Pan, Sicong, Bennewitz, Maren
Abstract
Trajectory optimization is a central component of fast and efficient autonomous racing. However practical optimization pipelines remain highly sensitive to initialization and may converge slowly or to suboptimal local solutions when seeded with heuristic trajectories such as the centerline or minimum-curvature paths. To address this limitation, we leverage expert driving behavior as a initialization prior and propose a learning-informed initialization strategy based on real-world Formula 1 telemetry. To this end, we first construct a multi-track Formula~1 trajectory dataset by reconstructing and aligning noisy GPS telemetry to a standardized reference-line representation across 17 tracks. Building on this, we present a neural network that predicts an expert-like raceline offset directly from local track geometry, without explicitly modeling vehicle dynamics or forces. The predicted raceline is then used as an informed seed for a minimum-time optimal control solver. Experiments on all 17 tracks demonstrate that the learned initialization accelerates solver convergence and significantly reduces runtime compared to traditional geometric baselines, while preserving the final optimized lap time.
Chinese Translation
轨迹优化是快速高效自主赛车的核心组成部分。然而,实际的优化流程对初始化高度敏感,当使用中心线或最小曲率路径等启发式轨迹作为初始值时,可能会导致收敛缓慢或收敛到次优的局部解。为了解决这一限制,我们利用专家驾驶行为作为初始化先验,并提出了一种基于真实世界Formula 1遥测数据的学习驱动初始化策略。为此,我们首先通过重建和对齐噪声GPS遥测数据,构建了一个多赛道的Formula 1轨迹数据集,以标准化的参考线表示在17条赛道上进行对齐。在此基础上,我们提出了一种神经网络,能够直接从局部赛道几何信息预测专家级的赛车线路偏移,而无需显式建模车辆动力学或力。然后,将预测的赛车线路用作最小时间最优控制求解器的有信息种子。对所有17条赛道的实验表明,与传统几何基线相比,学习到的初始化加速了求解器的收敛,并显著减少了运行时间,同时保持了最终优化的圈速。
cs.RO / 41 / 2603.07136

DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation

DexKnot:用于灵巧袋子打结操作的可推广视觉运动策略学习
Zhang, Jiayuan, Wu, Ruihai, Chen, Haojun, Wang, Yuran, Zhong, Yifan, Zhang, Ceyao, Yang, Yaodong, Chen, Yuanpei
Abstract
Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.
Chinese Translation
打结塑料袋是日常生活中的一项常见任务,但由于袋子具有无限的自由度和复杂的物理动态,这对机器人来说具有挑战性。现有方法在对未见过的袋子实例或变形的推广能力上往往存在困难。为了解决这个问题,我们提出了DexKnot,一个结合关键点赋能与扩散策略的框架,用于学习可推广的袋子打结策略。我们的方法通过真实世界手动变形收集的关键点对应数据,学习袋子的形状无关表示。对于未见过的袋子配置,可以通过将表示与参考进行匹配来识别关键点。这些关键点随后被提供给扩散变换器,该变换器基于少量人类示范生成机器人动作。DexKnot通过将观察空间的维度降低为稀疏的关键点集合,从而实现有效的策略推广。实验表明,DexKnot在多种之前未见的实例和变形中实现了可靠且一致的打结性能。
cs.RO / 42 / 2603.07141

Model-based thermal drift compensation for high-precision hexapod robot actuators

基于模型的高精度六足机器人驱动器热漂移补偿
Robert, Clément, Vissiere, Alain, Company, Olivier, Noire, Pierre, Roux, Thierry, Krut, Sébastien
Abstract
Thermal expansion is a significant source of positioning error in high-precision hexapod robots (Gough-Stewart platforms). Any variation in the temperature of the hexapod's parts induces expansion, which alters their kinematic model and reduces the robot's accuracy and repeatability. These variations may arise from internal heat sources (such as motors, encoders, and electronics) or from environmental changes. In this study, a method is proposed to anticipate and therefore correct the thermal drift of one of the hexapod precision electro-mechanical actuators. This method is based on determining a model that links the expansion state of the actuator at any given moment to the temperature of some well-chosen points on its surface. This model was initially developed theoretically. Its coefficients were then adjusted experimentally on a specific test-bench, based on a rigorous measurement campaign of actuator expansion using a high-precision interferometric measurement system. Experimental validation demonstrates a reduction of thermally induced expansion by more than 80%. This paves the way for thermal drift correction across the entire robot or similar robotics parts.
Chinese Translation
热膨胀是高精度六足机器人(Gough-Stewart平台)定位误差的重要来源。六足机器人各部件温度的任何变化都会引起膨胀,从而改变其运动学模型,降低机器人的精度和重复性。这些变化可能源于内部热源(如电机、编码器和电子设备)或环境变化。在本研究中,提出了一种方法来预测并纠正六足机器人精密电机驱动器的热漂移。该方法基于确定一个模型,该模型将驱动器在任意时刻的膨胀状态与其表面某些精心选择点的温度联系起来。该模型最初是理论开发的,随后在一个特定的测试台上通过严格的驱动器膨胀测量活动进行了实验调整,使用了高精度干涉测量系统。实验验证表明,热引起的膨胀减少了超过80%。这为整个机器人或类似机器人部件的热漂移补偿铺平了道路。
cs.RO / 43 / 2603.07143

Tutorial on Aided Inertial Navigation Systems: A Modern Treatment Using Lie-Group Theoretical Methods

辅助惯性导航系统教程:基于李群理论方法的现代处理
Berkane, Soulaimane
Abstract
This tutorial presents a control-oriented introduction to aided inertial navigation systems using a Lie-group formulation centered on the extended Special Euclidean group SE_2(3). The focus is on developing a clear and implementation-oriented geometric framework for fusing inertial measurements with aiding information, while making the role of invariance and symmetry explicit. Recent extensions, including higher-order state representations, synchronous observer designs, and equivariant filtering methods, are discussed as natural continuations of the same underlying principles. The goal is to provide readers with a coherent system-theoretic perspective that supports both understanding and practical use of modern aided inertial navigation methods.
Chinese Translation
本教程提供了一个以控制为导向的辅助惯性导航系统的介绍,采用以扩展特殊欧几里得群 SE_2(3) 为中心的李群表述。重点在于开发一个清晰且面向实现的几何框架,以融合惯性测量与辅助信息,同时明确不变性和对称性的作用。讨论了最近的扩展,包括高阶状态表述、同步观测器设计和等变滤波方法,这些都是基于相同基本原理的自然延续。目标是为读者提供一个连贯的系统理论视角,以支持对现代辅助惯性导航方法的理解和实际应用。
cs.RO / 44 / 2603.07165

RoTri-Diff: A Spatial Robot-Object Triadic Interaction-Guided Diffusion Model for Bimanual Manipulation

RoTri-Diff:一种基于空间机器人-物体三元交互指导的双手操作扩散模型
Chen, Zixuan, Chan, Nga Teng, Hou, Yiwen, Tie, Chenrui, Liu, Zixuan, Chen, Haonan, Chen, Junting, Shi, Jieqi, Gao, Yang, Huo, Jing, Shao, Lin
Abstract
Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter-arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the Robot-Object Triadic Interaction (RoTri) representation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the generation of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the-art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks. Project website: https://rotri-diff.github.io/.
Chinese Translation
双手操作是机器人技术中的一项基本技能,要求两个手臂之间进行持续而精确的协调。尽管模仿学习(Imitation Learning, IL)是获取这一能力的主要范式,但现有的方法,无论是以机器人为中心还是以物体为中心,往往忽视了两个手臂与被操作物体之间的动态几何关系。这一局限性常常导致手臂间的碰撞、不稳定的抓取以及在复杂任务中的性能下降。为了解决这个问题,本文明确建模了双手系统中的机器人-物体三元交互(Robot-Object Triadic Interaction, RoTri)表示,通过编码两个手臂与物体之间的相对6D姿态,以捕捉它们的空间三元关系并建立连续的三角几何约束。在此基础上,我们进一步提出了RoTri-Diff,这是一种基于扩散的模仿学习框架,将RoTri约束与机器人关键姿态和物体运动结合在一个分层扩散过程中。这使得在不同的双手操作模式下生成稳定、协调的轨迹和稳健的执行成为可能。大量实验表明,我们的方法在11个代表性的RLBench2任务上比最先进的基线提高了10.2%,并在4个具有挑战性的真实双手任务上实现了稳定的性能。项目网站:https://rotri-diff.github.io/
cs.RO / 45 / 2603.07199

Vision-Guided MPPI for Agile Drone Racing: Navigating Arbitrary Gate Poses via Neural Signed Distance Fields

基于视觉引导的MPPI在灵活无人机竞速中的应用:通过神经签名距离场导航任意门位姿
Zhao, Fangguo, Zhang, Hanbing, Li, Zhouheng, Guan, Xin, Li, Shuo
Abstract
Autonomous drone racing requires the tight coupling of perception, planning, and control under extreme agility. However, recent approaches typically rely on precomputed spatial reference trajectories or explicit 6-DoF gate pose estimation, rendering them brittle to spatial perturbations, unmodeled track changes, and sensor noise. Conversely, end-to-end learning policies frequently overfit to specific track layouts and struggle with zero-shot generalization. To address these fundamental limitations, we propose a fully onboard, vision guided optimal control framework that enables reference-free agile flight through arbitrarily placed and oriented gates. Central to our approach is Gate-SDF, a novel, implicitly learned neural signed distance field. Gate-SDF directly processes raw, noisy depth images to predict a continuous spatial field that provides both collision repulsion and active geometric guidance toward the valid traversal area. We seamlessly integrate this representation into a sampling-based Model Predictive Path Integral (MPPI) controller. By fully exploiting GPU parallelism, the framework evaluates these continuous spatial constraints across thousands of simulated trajectory rollouts simultaneously in real time. Furthermore, our formulation inherently maintains spatial consistency, ensuring robust navigation even under severe visual occlusion during aggressive maneuvers. Extensive simulations and real-world experiments demonstrate that the proposed system achieves high-speed agile flight and successfully navigates unseen tracks subject to severe unmodeled gate displacements and orientation perturbations. Videos are available at https://zhaofangguo.github.io/vision_guided_mppi/
Chinese Translation
自主无人机竞速需要在极高灵活性下紧密结合感知、规划和控制。然而,近期的方法通常依赖于预计算的空间参考轨迹或显式的六自由度(6-DoF)门位姿估计,这使得它们在面对空间扰动、未建模的赛道变化和传感器噪声时显得脆弱。相反,端到端学习策略往往会过拟合特定的赛道布局,并在零样本泛化方面表现不佳。为了解决这些根本性限制,我们提出了一种完全在机载的、基于视觉引导的最优控制框架,使得无人机能够在任意放置和定向的门之间进行无参考的灵活飞行。我们方法的核心是Gate-SDF,一种新颖的、隐式学习的神经签名距离场。Gate-SDF直接处理原始的、噪声较大的深度图像,以预测一个连续的空间场,该场提供了碰撞排斥和朝向有效通行区域的主动几何引导。我们将这一表示无缝集成到基于采样的模型预测路径积分(MPPI)控制器中。通过充分利用GPU并行性,该框架能够实时评估数千条模拟轨迹展开中的这些连续空间约束。此外,我们的公式本质上保持了空间一致性,确保在激烈机动过程中即使在严重的视觉遮挡下也能实现稳健导航。大量的仿真和现实世界实验表明,所提出的系统能够实现高速灵活飞行,并成功导航于遭受严重未建模门位移和方向扰动的未知赛道。相关视频可在 https://zhaofangguo.github.io/vision_guided_mppi/ 获取。
cs.RO / 46 / 2603.07264

Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

基于运动学感知的潜在世界模型用于数据高效的自主驾驶
Li, Jiazhuo, Cao, Linjiang, Liu, Qi, Xiong, Xi
Abstract
Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.
Chinese Translation
由于大规模现实世界交互的高成本和安全风险,数据高效学习仍然是自主驾驶中的一个核心挑战。尽管基于世界模型的强化学习通过潜在想象实现了策略优化,但现有方法往往缺乏明确的机制来编码对于驾驶任务至关重要的空间和运动学结构。在本研究中,我们基于递归状态空间模型(Recurrent State-Space Model, RSSM)提出了一种运动学感知的潜在世界模型框架用于自主驾驶。车辆运动学信息被纳入观察编码器,以使潜在转变与物理上有意义的运动动态相结合,同时几何感知的监督正则化了RSSM潜在状态,以捕捉超越像素重建的任务相关空间结构。所得到的结构化潜在动态提高了长时间想象的保真度,并稳定了策略优化。在驾驶仿真基准中的实验表明,在样本效率和驾驶性能方面,相较于无模型和基于像素的世界模型基线,均取得了一致的提升。消融研究进一步验证了所提出的设计增强了潜在空间内的空间表示质量。这些结果表明,将运动学基础整合到基于RSSM的世界模型中,为自主驾驶策略学习提供了一种可扩展且物理基础的范式。
cs.RO / 47 / 2603.07308

Soft Rigid Hybrid Gripper with Inflatable Silicone Pockets for Tunable Frictional Grasping

具有可调摩擦抓取能力的充气硅胶口袋软硬混合抓手
Ly, Hoang Hiep, Nguyen, Cong-Nhat, Tran, Doan-Quang, Dang, Quoc-Khanh, Tran, Ngoc Duy, Mac, Thi Thoa, Nguyen, Anh, Nguyen, Xuan-Thuan, Ta, Tung D.
Abstract
Grasping objects with diverse mechanical properties, such as heavy, slippery, or fragile items, remains a significant challenge in robotics. Conventional rigid grippers typically rely on increasing the normal forces to secure an object, however, this can cause damage to fragile objects due to excessive force. To address this limitation, we propose a soft rigid hybrid gripper finger that combines rigid structural shells with soft, inflatable silicone pockets, which could be integrated into a conventional gripper. The hybrid gripper can actively modulate its surface friction by varying the internal air pressure of the silicone pockets, enabling the gripper to securely grasp objects without increasing the gripping force. This is demonstrated by fundamental experimental results, in which an increase in internal pressure leads to a proportional increase in the effective coefficient of friction. The gripping experiments also show that the integrated gripper can stably lift heavy and slippery objects or fragile, deformable objects, such as eggs, tofu, fruits, and paper cups, with minimal damage by increasing friction rather than applying high force.
Chinese Translation
抓取具有多种机械特性的物体,如重物、滑物或易碎物品,仍然是机器人技术中的一项重大挑战。传统的刚性抓手通常依赖于增加法向力来固定物体,然而,这可能会因施加过大的力量而损坏易碎物体。为了解决这一局限性,我们提出了一种软硬混合抓手手指,它将刚性结构外壳与软性充气硅胶口袋相结合,可集成到传统抓手中。该混合抓手能够通过改变硅胶口袋内部气压主动调节其表面摩擦力,从而使抓手在不增加抓取力的情况下安全抓取物体。基础实验结果表明,内部压力的增加导致有效摩擦系数的成比例增加。抓取实验还表明,集成的抓手能够稳定地抓取重物和滑物,或易碎的可变形物体,如鸡蛋、豆腐、水果和纸杯,且通过增加摩擦而非施加高力量来实现最小损伤。
cs.RO / 48 / 2603.07351

A Distributed Gaussian Process Model for Multi-Robot Mapping

一种用于多机器人地图构建的分布式高斯过程模型
Nabarro, Seth, van der Wilk, Mark, Davison, Andrew J.
Abstract
We propose DistGP: a multi-robot learning method for collaborative learning of a global function using only local experience and computation. We utilise a sparse Gaussian process (GP) model with a factorisation that mirrors the multi-robot structure of the task, and admits distributed training via Gaussian belief propagation (GBP). Our loopy model outperforms Tree-Structured GPs \cite{bui2014tree} and can be trained online and in settings with dynamic connectivity. We show that such distributed, asynchronous training can reach the same performance as a centralised, batch-trained model, albeit with slower convergence. Last, we compare to DiNNO \cite{yu2022dinno}, a distributed neural network (NN) optimiser, and find DistGP achieves superior accuracy, is more robust to sparse communication and is better able to learn continually.
Chinese Translation
我们提出了DistGP:一种多机器人学习方法,通过仅依赖局部经验和计算来协作学习全局函数。我们利用了一种稀疏高斯过程(GP)模型,其因子分解反映了任务的多机器人结构,并允许通过高斯信念传播(GBP)进行分布式训练。我们的循环模型在性能上优于树结构高斯过程(Tree-Structured GPs) extit{(Bui et al., 2014)},并且可以在动态连接的环境中进行在线训练。我们展示了这种分布式、异步训练能够达到与集中式批量训练模型相同的性能,尽管收敛速度较慢。最后,我们与分布式神经网络(NN)优化器DiNNO extit{(Yu et al., 2022)}进行了比较,发现DistGP在准确性上表现更优,对稀疏通信更具鲁棒性,并且更能持续学习。
cs.RO / 49 / 2603.07393

Underwater Embodied Intelligence for Autonomous Robots: A Constraint-Coupled Perspective on Planning, Control, and Deployment

自主水下机器人中的具身智能:规划、控制与部署的约束耦合视角
Xu, Jingzehua, Xie, Guanwen, Tang, Jiwei, Zhang, Shuai, Li, Xiaofan
Abstract
Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long-horizon exploration. Yet, despite rapid advances in learning-based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth-limited communication, and energy scarcity are not independent challenges; they interact within the closed perception-planning-control loop and often amplify one another over time. This Review develops a constraint-coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief-aware planning, hybrid control, multi-robot coordination, and foundation-model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross-layer coupling. To unify these observations, we introduce a cross-layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics-grounded world models, certifiable learning-enabled control, communication-aware coordination, and deployment-aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance-driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.
Chinese Translation
自主水下机器人正越来越多地用于环境监测、基础设施检查、海底资源勘探和长时间探索。然而,尽管基于学习的规划和控制技术迅速发展,在真实海洋环境中实现可靠的自主性仍然受到紧密耦合的物理限制的根本制约。水动力不确定性、部分可观测性、带宽有限的通信和能源稀缺并不是独立的挑战;它们在闭环的感知-规划-控制过程中相互作用,并且往往随着时间的推移相互放大。本文发展了一种关于水下具身智能的约束耦合视角,认为规划和控制必须在真实海洋环境中紧密耦合的感知、通信、协调和资源约束下进行理解。我们通过这一具身视角综合了强化学习、信念感知规划、混合控制、多机器人协调和基础模型集成等领域的最新进展。在代表性的应用领域中,我们展示了环境监测、检查、探索和合作任务如何暴露出跨层耦合的不同应力特征。为了统一这些观察结果,我们引入了一种跨层故障分类法,涵盖了认知、动态和协调失效,并分析了在不确定性下错误如何在自主性层之间级联。基于这一结构,我们概述了朝向基于物理的世界模型、可认证的学习驱动控制、通信感知协调和部署感知系统设计的研究方向。通过将约束耦合内化,而不是将其视为外部干扰,水下具身智能可能从以性能为驱动的适应演变为在真实海洋条件下具有韧性、可扩展和可验证的自主性。
cs.RO / 50 / 2603.07400

Perceptive Variable-Timing Footstep Planning for Humanoid Locomotion on Disconnected Footholds

针对不连续支撑点的人形运动的感知可变时序步态规划
Xiang, Zhaoyang, Pant, Upama, Hereid, Ayonga
Abstract
Many real-world walking scenarios contain obstacles and unsafe ground patches (e.g., slippery or cluttered areas), leaving a disconnected set of admissible footholds that can be modeled as stepping-stone-like regions. We propose an onboard, perceptive mixed-integer model predictive control framework that jointly plans foot placement and step duration using step-to-step Divergent Component of Motion (DCM) dynamics. Ego-centric depth images are fused into a probabilistic local heightmap, from which we extract a union of convex steppable regions. Region membership is enforced with binary variables in a mixed-integer quadratic program (MIQP). To keep the optimization tractable while certifying safety, we embed capturability bounds in the DCM space: a lateral one-step condition (preventing leg crossing) and a sagittal infinite-step bound that limits unstable growth. We further re-plan within the step by back-propagating the measured instantaneous DCM to update the initial DCM, improving robustness to model mismatch and external disturbances. We evaluate the approach in simulation on Digit on randomized stepping-stone fields, including external pushes. The planner generates terrain-aware, dynamically consistent footstep sequences with adaptive timing and millisecond-level solve times.
Chinese Translation
许多现实世界的行走场景中存在障碍物和不安全的地面区域(例如,滑溜或杂乱的地方),导致可接受的支撑点形成一个不连续的集合,这可以被建模为类似踏脚石的区域。我们提出了一种基于机载的感知混合整数模型预测控制框架,该框架通过步与步之间的运动发散分量(Divergent Component of Motion, DCM)动态共同规划足部放置和步长。将自我中心的深度图像融合成一个概率局部高度图,从中提取出一个凸的可步行区域的并集。区域成员资格通过混合整数二次规划(Mixed-Integer Quadratic Program, MIQP)中的二元变量进行强制。为了在保证安全的同时保持优化的可处理性,我们在DCM空间中嵌入了可捕获性边界:一个防止腿部交叉的侧向一步条件和一个限制不稳定增长的矢状面无限步界限。我们进一步通过反向传播测量的瞬时DCM来更新初始DCM,从而在步态内重新规划,提高了对模型不匹配和外部干扰的鲁棒性。我们在模拟中评估了该方法,在随机的踏脚石场地上对Digit进行了测试,包括外部推力。该规划器生成了具有地形感知、动态一致性的步态序列,具有自适应时序和毫秒级的求解时间。
cs.RO / 51 / 2603.07404

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

视觉语言动作微调的自适应能力分配
Kim, Donghoon, Bae, Minji, Nam, Unghui, Kim, Gyeonghun, Lee, Suyun, Shim, Kyuhong, Shim, Byonghyo
Abstract
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge \eta$, providing a direct link to approximation error via our spectral analysis. During training, $\eta$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($\pi_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
Chinese Translation
视觉语言动作模型(VLA)在物理人工智能中越来越多地被使用,但将预训练的VLA模型部署到未见过的环境、体现或任务中仍然需要适应。参数高效微调(PEFT),尤其是LoRA,通常用于VLA策略,但暴露的能力调节器——秩,并不均匀转移:机器人转移表现出比语言微调更高且任务变化的内在秩。小秩对于大型语言模型(LLMs)来说足够(例如,$r ext{ in } ext{ extbackslash{}{4, 8 extbackslash{}}}$),而谱分析表明VLA可能需要更大的秩(例如,$r ext{ extbackslash{}{ extasciitilde{} 128 extbackslash{}}}$)或接近满秩,这种不匹配在多任务环境中更为严重。我们提出了LoRA-SP(选择-修剪),一种秩自适应微调方法,它用输入和层级的能力替代固定秩更新。LoRA-SP使用一种SVD风格的参数化,配备一个小型路由器,其非负分数作为共享向量库的奇异值。活跃集通过累积平方分数的能量目标$E(k) ext{ extbackslash{}{ extgreater{} extbackslash{}{}}} ext{ extbackslash{}{ exteta{}}}$进行选择,提供了通过我们的谱分析与近似误差的直接联系。在训练过程中,$ ext{ extbackslash{}{ exteta{}}}$将能量集中在少数方向上,并教会路由器依赖更少的向量,同时保持准确性。这产生了紧凑的适配器,减少了跨任务干扰并改善了泛化能力。在四个真实机器人操作任务中,使用未见过的AgileX PiPER臂,针对两个VLA骨干网络($ ext{ extbackslash{}{ extpi_0{}}}$和SmolVLA),LoRA-SP在可训练参数远少于完全微调的情况下匹配或超越了完全微调,并在多任务成功率上比标准LoRA提高了多达31.6%,同时对秩选择保持鲁棒性。
cs.RO / 52 / 2603.07417

Unifying Sidewinding and Rolling: A Wave-Based Framework for Self-Righting in Elongated Limbless and Multi-Legged Robots

统一侧行与滚动:一种基于波动的框架用于细长无肢和多腿机器人自我翻正
Liu, Hangjun, Geng, Jiarui, Ding, Jinxuan, He, Gengzhi, Wang, Xiyuan, Arukgoda, Melisa, DiGennaro, Joe, Ubertalli, George, Blekherman, Grigoriy, Chong, Baxi
Abstract
Centipede-like robots offer unique locomotion advantages due to their small cross-sectional area for accessing confined spaces, and their redundant legs enhance robustness in cluttered environments such as search-and-rescue and pipe inspection. However, elongated robots are particularly vulnerable to tipping over when climbing large obstacles, making reliable self-righting essential for field deployment. Self-righting strategies for elongate, multi-legged systems remain poorly understood. In this study, we conduct a comparative biomechanics and robophysical investigation to address three key questions: (1) What self-righting strategies are effective for elongate, many-legged systems? (2) How should these strategies depend on morphological parameters such as leg length and leg number? (3) Is there a morphological limit beyond which reliable self-righting becomes infeasible? We compare two biological exemplars: Scolopendra subspinipes (short legs) and Scutigera coleoptrata (house centipedes with long legs). Scolopendra subspinipes reliably self-rights both during aerial phases and through ground-assisted self-righting, whereas house centipedes rely predominantly on aerial reorientation and struggle to generate effective self-righting torques during ground contact. Motivated by these observations, we construct a parameterized space of bio-inspired self-righting strategies and develop an elongate robot with adjustable leg lengths. Systematic experiments reveal that increasing leg length necessitates a shift in control strategy to prevent torque over-concentration in mid-body actuators, and we identify a critical limb-length threshold above which robust self-righting becomes challenging. These results establish morphology-strategy coupling principles for self-righting in elongate robots and provide design guidelines for centipede-like systems operating in uncertain terrain.
Chinese Translation
类似百足虫的机器人由于其小的横截面积在狭小空间中具有独特的运动优势,同时其冗余的腿部在搜索救援和管道检查等杂乱环境中增强了鲁棒性。然而,细长机器人在攀爬大障碍物时特别容易倾覆,因此可靠的自我翻正对于现场部署至关重要。细长多腿系统的自我翻正策略仍然不够清楚。在本研究中,我们进行了一项比较生物力学和机器人物理学的研究,以解决三个关键问题:(1)哪些自我翻正策略对细长多腿系统有效?(2)这些策略应如何依赖于形态参数,如腿长和腿数?(3)是否存在一个形态极限,超出该极限后可靠的自我翻正变得不可行?我们比较了两个生物样本:短腿的蜈蚣(Scolopendra subspinipes)和长腿的家蜈蚣(Scutigera coleoptrata)。蜈蚣在空中阶段和通过地面辅助自我翻正时都能可靠地自我翻正,而家蜈蚣主要依赖空中重新定向,并在与地面接触时难以产生有效的自我翻正扭矩。受到这些观察的启发,我们构建了一个参数化的生物启发自我翻正策略空间,并开发了一种具有可调腿长的细长机器人。系统实验表明,增加腿长需要控制策略的转变,以防止中部驱动器中扭矩的过度集中,并且我们识别出一个关键的肢体长度阈值,超过该阈值后,稳健的自我翻正变得具有挑战性。这些结果建立了细长机器人自我翻正的形态-策略耦合原则,并为在不确定地形中运行的类似百足虫的系统提供了设计指南。
cs.RO / 53 / 2603.07426

Cable-driven Continuum Robotics: Proprioception via Proximal-integrated Force Sensing

电缆驱动的连续机器人:通过近端集成力传感实现本体感觉
Zhang, Gang, Yan, Junyan, Chen, Jibiao, Cheng, Shing Shin
Abstract
Micro-scale continuum robots face significant limitations in achieving three-dimensional contact force perception, primarily due to structural miniaturization, nonlinear mechanical, and sensor integration. To overcome these limitations, this paper introduces a novel proprioception method for cable-driven continuum robots based on proximal-integrated force sensing (i.e., cable tension and six-axis force/torque (F/T) sensor), inspired by the tendon-joint collaborative sensing mechanism of the finger. By integrating biomechanically inspired design principles with nonlinear modeling, the proposed method addresses the challenge of force perception (including the three-dimensional contact force and the location of the contact point) and shape estimation in micro-scale continuum robots. First, a quasi-bionic mapping between human tissues/organs and robot components is established, enabling the transfer of the integrated sensing strategy of tendons, joints, and neural feedback to the robotic system. Second, a multimodal perception strategy is developed based on the structural constraints inherent to continuum robots. The complex relationships among mechanical and material nonlinearities, robot motion states, and contact forces are formulated as an optimization problem to reduce the perception complexity. Finally, experimental validation demonstrates the effectiveness of the proposed method. This work lays the foundation for developing safer and smarter continuum robots, enabling broader clinical adoption in complex environments.
Chinese Translation
微尺度连续机器人在实现三维接触力感知方面面临重大限制,主要由于结构微型化、非线性机械特性和传感器集成。为克服这些限制,本文提出了一种基于近端集成力传感(即电缆张力和六轴力/扭矩(F/T)传感器)的新型本体感觉方法,灵感来源于手指的肌腱-关节协同感知机制。通过将生物力学启发的设计原则与非线性建模相结合,所提出的方法解决了微尺度连续机器人在力感知(包括三维接触力和接触点位置)及形状估计方面的挑战。首先,建立了人类组织/器官与机器人组件之间的类生物映射,使得肌腱、关节和神经反馈的集成感知策略能够转移到机器人系统中。其次,基于连续机器人固有的结构约束,开发了一种多模态感知策略。机械和材料非线性、机器人运动状态及接触力之间的复杂关系被形式化为一个优化问题,以降低感知复杂性。最后,实验验证了所提方法的有效性。本研究为开发更安全、更智能的连续机器人奠定了基础,使其能够在复杂环境中更广泛地应用于临床。
cs.RO / 54 / 2603.07442

LITHE: Bridging Best-Effort Python and Real-Time C++ for Hot-Swapping Robotic Control Laws on Commodity Linux

LITHE:在商品Linux上为热插拔机器人控制律架起最佳努力Python与实时C++之间的桥梁
Lim, He Kai, Clites, Tyler R.
Abstract
Modern robotic systems rely on hierarchical control, where a high-level "Brain" (Python) directs a lower-level "Spine" (C++ real-time controller). Despite its necessity, this hierarchy makes it difficult for the Brain to completely rewrite the Spine's immutable control logic, consequently inhibiting fundamental adaptation for different tasks and environments. Conventional approaches require complex middleware, proprietary hardware, or sacrifice real-time performance. We present LITHE (Linux Isolated Threading for Hierarchical Execution), a lightweight software architecture that collapses the robot control hierarchy onto a commodity single-board computer (Raspberry Pi 4B with pi3hat), while maintaining safe frequency decoupling between the Brain and Spine. LITHE integrates strict CPU isolation (isolcpus), lock-free inter-process communication (IPC), and pipelined execution to meet high-frequency deadlines with minimal jitter. By adding multi-threaded dynamic linking, LITHE enables a Python-based Brain to dynamically evolve the logic of a 1kHz C++ Spine without interruption. We validate "functional real-time" system performance with worst-case execution time (WCET) < 100 $\mu$s and maximum release jitter (MRJ) < 4 $\mu$s under heavy load. We demonstrate a novel application where a large language model (LLM) supervisor performs online system identification to evolve a real-time controller on-the-fly, without interrupting the 1 kHz control loop. In essence, LITHE eliminates the "immutable compiled code" bottleneck for best-effort Brains to synthesize and inject completely new control laws into the real-time Spine. This bridges a critical gap between high-level AI and low-level real-time control to unlock continuous real-time evolution of embodied intelligence in safe, human-in-the-loop systems.
Chinese Translation
现代机器人系统依赖于分层控制,其中高层“脑”(Python)指挥低层“脊柱”(C++实时控制器)。尽管这种分层结构是必要的,但它使得脑部无法完全重写脊柱的不可变控制逻辑,从而抑制了针对不同任务和环境的基本适应性。传统方法需要复杂的中间件、专有硬件,或牺牲实时性能。我们提出了LITHE(Linux隔离线程用于分层执行),这是一种轻量级软件架构,将机器人控制层次结构压缩到商品单板计算机(Raspberry Pi 4B与pi3hat)上,同时保持脑部与脊柱之间的安全频率解耦。LITHE集成了严格的CPU隔离(isolcpus)、无锁进程间通信(IPC)和流水线执行,以满足高频截止时间并最小化抖动。通过添加多线程动态链接,LITHE使基于Python的脑部能够在不中断的情况下动态演变1kHz C++脊柱的逻辑。我们在重负载下验证了“功能实时”系统性能,最坏情况下执行时间(WCET)< 100微秒,最大释放抖动(MRJ)< 4微秒。我们展示了一种新颖的应用,其中大型语言模型(LLM)监督者执行在线系统识别,以实时演变控制器,而不打断1 kHz控制循环。本质上,LITHE消除了“不可变编译代码”的瓶颈,使最佳努力的脑部能够合成并注入全新的控制律到实时脊柱中。这弥合了高层AI与低层实时控制之间的关键差距,解锁了在安全的人机协作系统中具身智能的持续实时演变。
cs.RO / 55 / 2603.07480

GSAT: Geometric Traversability Estimation using Self-supervised Learning with Anomaly Detection for Diverse Terrains

GSAT:基于自监督学习与异常检测的几何可通行性估计用于多样化地形
Cho, Dongjin, Park, Miryeong, Lee, Juhui, Yang, Geonmo, Cho, Younggun
Abstract
Safe autonomous navigation requires reliable estimation of environmental traversability. Traditional methods have relied on semantic or geometry-based approaches with human-defined thresholds, but these methods often yield unreliable predictions due to the inherent subjectivity of human supervision. While self-supervised approaches enable robots to learn from their own experience, they still face a fundamental challenge: the positive-only learning problem. To address these limitations, recent studies have employed Positive-Unlabeled (PU) learning, where the core challenge is identifying positive samples without explicit negative supervision. In this work, we propose GSAT, which addresses these limitations by constructing a positive hypersphere in latent space to classify traversable regions through anomaly detection without requiring additional prototypes (e.g., unlabeled or negative). Furthermore, our approach employs joint learning of anomaly classification and traversability prediction to more efficiently utilize robot experience. We comprehensively evaluate the proposed framework through ablation studies, validation on heterogeneous real-world robotic platforms, and autonomous navigation demonstrations in simulation environments.
Chinese Translation
安全的自主导航需要对环境可通行性进行可靠估计。传统方法依赖于语义或基于几何的方法,并使用人类定义的阈值,但由于人类监督的固有主观性,这些方法往往产生不可靠的预测。虽然自监督方法使机器人能够从自身经验中学习,但它们仍面临一个根本性挑战:仅正样本学习问题。为了解决这些局限性,最近的研究采用了正负未标记(Positive-Unlabeled, PU)学习,其中核心挑战是在没有明确负监督的情况下识别正样本。在本研究中,我们提出了GSAT,通过在潜在空间中构建正超球体来分类可通行区域,利用异常检测而无需额外的原型(例如,未标记或负样本),从而解决这些局限性。此外,我们的方法采用异常分类与可通行性预测的联合学习,以更有效地利用机器人经验。我们通过消融研究、在异构真实世界机器人平台上的验证以及在仿真环境中的自主导航演示,全面评估了所提出的框架。
cs.RO / 56 / 2603.07484

HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

HSC-VLA:用于密集杂乱环境中稳健双手操作的分层场景清理
Liu, Zhen, Ning, Xinyu, Hu, Zhe, Xie, XinXin, Liu, Yitong, Pu, Zhongzhu
Abstract
Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($\pi_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.
Chinese Translation
现代视觉-语言-动作模型在高密度操作环境中常常面临严重的指令遵循失败,其中与任务无关的视觉杂乱分散了注意力,破坏了基础信息,并在复杂的长期场景中显著降低了性能。为了克服单一端到端架构的表示瓶颈,我们提出了HSC-VLA,一个通过显式场景清理抽象将高层视觉-语义推理与低层高频传感器运动执行解耦的分层框架。HSC-VLA利用高层大脑(Brain)来分解长期任务,并生成特定任务的场景掩模,这些掩模保留了与任务相关的几何信息,同时抑制干扰项。经过过滤的观察结果随后传递给低层小脑(Cerebellum),这是一个基于扩散的策略,仅使用掩模过滤后的视觉和本体感觉进行双手操作。在密集杂乱的超市货架上进行的大量实验表明,HSC-VLA在高密度杂乱环境下实现了86.7%的整体成功率,超越了最佳单一基线($ ext{π}_0$-Full FT为34.3%)52.4%。HSC-VLA还展现出强大的长期性能,在杂乱分类任务中达到72%,在补货任务中达到66%,展示了在复杂杂乱操作中的强鲁棒性和有效的失败恢复能力。
cs.RO / 57 / 2603.07516

InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

InterReal:一个统一的基于物理的模仿框架,用于学习人机交互技能
Liang, Dayang, Lin, Yuhang, Liu, Xinzhe, Shi, Jiyuan, Liu, Yunlong, Bai, Chenjia
Abstract
Interaction is one of the core abilities of humanoid robots. However, most existing frameworks focus on non-interactive whole-body control, which limits their practical applicability. In this work, we develop InterReal, a unified physics-based imitation learning framework for Real-world human-object Interaction (HOI) control. InterReal enables humanoid robots to track HOI reference motions, facilitating the learning of fine-grained interactive skills and their deployment in real-world settings. Within this framework, we first introduce a HOI motion data augmentation scheme with hand-object contact constraints, and utilize the augmented motions to improve policy stability under object perturbations. Second, we propose an automatic reward learner to address the challenge of large-scale reward shaping. A meta-policy guided by critical tracking error metrics explores and allocates reward signals to the low-level reinforcement learning objective, which enables more effective learning of interactive policies. Experiments on HOI tasks of box-picking and box-pushing demonstrate that InterReal achieves the best tracking accuracy and the highest task success rate compared to recent baselines. Furthermore, we validate the framework on the real-world robot Unitree G1, which demonstrates its practical effectiveness and robustness beyond simulation.
Chinese Translation
交互是类人机器人核心能力之一。然而,现有的大多数框架集中于非交互式的全身控制,这限制了它们的实际应用性。在本研究中,我们开发了InterReal,一个统一的基于物理的模仿学习框架,用于现实世界的人机交互(HOI)控制。InterReal使类人机器人能够跟踪HOI参考动作,促进细致交互技能的学习及其在现实环境中的应用。在该框架内,我们首先引入了一种带有手-物体接触约束的HOI动作数据增强方案,并利用增强后的动作来提高在物体扰动下的策略稳定性。其次,我们提出了一种自动奖励学习器,以应对大规模奖励塑形的挑战。一个由关键跟踪误差指标指导的元策略探索并分配奖励信号给低级强化学习目标,从而实现更有效的交互策略学习。在箱子拾取和推箱子等HOI任务上的实验表明,与最近的基线相比,InterReal实现了最佳的跟踪精度和最高的任务成功率。此外,我们在现实世界的机器人Unitree G1上验证了该框架,证明了其在模拟之外的实际有效性和鲁棒性。
cs.RO / 58 / 2603.07530

ICLR: In-Context Imitation Learning with Visual Reasoning

ICLR:具有视觉推理的上下文模仿学习
Nguyen, Toan, Yuan, Weiduo, Wei, Songlin, Li, Hui, Seita, Daniel, Wang, Yue
Abstract
In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.
Chinese Translation
上下文模仿学习使机器人能够在没有额外训练的情况下,从少量示例中适应新任务。然而,现有的方法通常仅基于状态-动作轨迹进行条件化,缺乏对任务意图的明确表示。这一局限性在复杂和模糊的任务环境中妨碍了性能,因为相同的动作可能与不同的目标一致。为了解决这个问题,我们提出了具有视觉推理的上下文模仿学习(ICLR),这是一个新颖的框架,通过结构化的视觉推理轨迹增强演示提示,表示在图像空间中预期的未来机器人轨迹。ICLR还在统一的自回归变换器中联合学习生成推理轨迹和低级动作,使模型不仅能够模仿动作预测,还能模仿导致这些动作的推理过程。我们在仿真和现实世界的操作任务中广泛评估了ICLR,并展示了与其他上下文模仿学习方法相比,在成功率和对未见任务及新物体配置的泛化能力上的一致性提升。这些结果表明,结合具身视觉推理代表了增强机器人上下文学习系统的鲁棒性和泛化能力的一个有前景的方向。
cs.RO / 59 / 2603.07533

ACCURATE: Arbitrary-shaped Continuum Reconstruction Under Robust Adaptive Two-view Estimation

ACCURATE:基于鲁棒自适应双视图估计的任意形状连续体重建
Zhang, Yaozhi, Yu, Shun, Zhang, Yugang, Liu, Yang
Abstract
Accurate reconstruction of arbitrary-shaped long slender continuum bodies, such as guidewires, catheters and other soft continuum manipulators, is essential for accurate mechanical simulation. However, existing image-based reconstruction approaches often suffer from limited accuracy because they often underutilize camera geometry, or lack generality as they rely on rigid geometric assumptions that may fail for continuum robots with complex and highly deformable shapes. To address these limitations, we propose ACCURATE, a 3D reconstruction framework integrating an image segmentation neural network with a geometry-constrained topology traversal and dynamic programming algorithm that enforces global biplanar geometric consistency, minimizes the cumulative point-to-epipolar-line distance, and remains robust to occlusions and epipolar ambiguities cases caused by noise and discretization. Our method achieves high reconstruction accuracy on both simulated and real phantom datasets acquired using a clinical X-ray C-arm system, with mean absolute errors below 1.0 mm.
Chinese Translation
准确重建任意形状的细长连续体,如导丝、导管及其他软性连续体操控器,对于精确的机械仿真至关重要。然而,现有的基于图像的重建方法往往因未充分利用相机几何信息而导致准确性有限,或因依赖于刚性几何假设而缺乏通用性,这些假设可能在面对形状复杂且高度可变的连续体机器人时失效。为了解决这些局限性,我们提出了ACCURATE,一个3D重建框架,它将图像分割神经网络与几何约束拓扑遍历和动态规划算法相结合,强制执行全局双平面几何一致性,最小化累积的点到极线距离,并对由噪声和离散化引起的遮挡和极线模糊情况保持鲁棒性。我们的方法在使用临床X射线C臂系统获取的模拟和真实幻影数据集上实现了高重建精度,平均绝对误差低于1.0毫米。
cs.RO / 60 / 2603.07578

Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments

基于近似模仿学习的杂乱环境中四旋翼飞行的事件驱动方法
Messikommer, Nico, Xing, Jiaxu, Bauersfeld, Leonard, Cannici, Marco, Aljalbout, Elie, Scaramuzza, Davide
Abstract
Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from image degradations such as motion blur. In addition, their low power consumption can enhance endurance, which is critical for resource-constrained platforms. Motivated by these properties, we present a novel approach that enables a quadrotor to fly through cluttered environments at high speed by perceiving the environment with a single event camera. Our proposed method employs an end-to-end neural network trained to map event data directly to control commands, eliminating the reliance on standard cameras. To enable efficient training in simulation, where rendering synthetic event data is computationally expensive, we propose Approximate Imitation Learning, a novel imitation learning framework. Our approach leverages a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is trained through online interactions that rely solely on lightweight, simulated state information, eliminating the need to render events during training. This enables the efficient training of event-based control policies for fast quadrotor flight, highlighting the potential of our framework for other modalities where data simulation is costly or impractical. Our approach outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.
Chinese Translation
事件相机提供了高时间分辨率和低延迟,使其成为高速机器人应用的理想传感器,而传统相机在运动模糊等图像退化方面表现不佳。此外,其低功耗特性可以增强续航能力,这对资源受限的平台至关重要。基于这些特性,我们提出了一种新颖的方法,使四旋翼能够通过单个事件相机感知环境并在杂乱环境中高速飞行。我们提出的方法采用端到端神经网络,直接将事件数据映射到控制命令,从而消除了对标准相机的依赖。为了在模拟中实现高效训练(因为渲染合成事件数据的计算成本较高),我们提出了近似模仿学习(Approximate Imitation Learning),一种新颖的模仿学习框架。我们的方法利用大规模离线数据集学习任务特定的表示空间。随后,通过仅依赖轻量级的模拟状态信息进行在线交互来训练策略,消除了在训练过程中渲染事件的需求。这使得基于事件的控制策略能够高效地训练,以实现快速的四旋翼飞行,突显了我们框架在数据模拟成本高或不切实际的其他模态中的潜力。我们的方法在模拟中优于标准模仿学习基线,并在真实飞行测试中表现出稳健的性能,在杂乱环境中实现了高达9.8 ms-1的飞行速度。
cs.RO / 61 / 2603.07580

FeasibleCap: Real-Time Embodiment Constraint Guidance for In-the-Wild Robot Demonstration Collection

FeasibleCap:野外机器人演示收集的实时可执行约束指导
Yin, Zi, Li, Fanhong, Gui, Yun, Liu, Jia
Abstract
Gripper-in-hand data collection decouples demonstration acquisition from robot hardware, but whether a trajectory is executable on the target robot remains unknown until a separate replay-and-validate stage. Failed demonstrations therefore inflate the effective cost per usable trajectory through repeated collection, diagnosis, and validation. Existing collection-time feedback systems mitigate this issue but rely on head-worn AR/VR displays, robot-in-the-loop hardware, or learned dynamics models; real-time executability feedback has not yet been integrated into the gripper-in-hand data collection paradigm. We present \textbf{FeasibleCap}, a gripper-in-hand data collection system that brings real-time executability guidance into robot-free capture. At each frame, FeasibleCap checks reachability, joint-rate limits, and collisions against a target robot model and closes the loop through on-device visual overlays and haptic cues, allowing demonstrators to correct motions during collection without learned models, headsets, or robot hardware. On pick-and-place and tossing tasks, FeasibleCap improves replay success and reduces the fraction of infeasible frames, with the largest gains on tossing. Simulation experiments further indicate that enforcing executability constraints during collection does not sacrifice cross-embodiment transfer across robot platforms. Hardware designs and software are available at https://github.com/aod321/FeasibleCap.
Chinese Translation
手持夹具的数据收集将演示获取与机器人硬件解耦,但在单独的重放和验证阶段之前,目标机器人是否能够执行某一轨迹仍然未知。因此,失败的演示通过重复收集、诊断和验证,抬高了每条可用轨迹的有效成本。现有的收集时反馈系统缓解了这一问题,但依赖于佩戴式增强现实/虚拟现实显示器、机器人环路硬件或学习的动力学模型;实时可执行性反馈尚未融入手持夹具的数据收集范式。我们提出了 extbf{FeasibleCap},一种将实时可执行性指导引入无机器人捕获的手持夹具数据收集系统。在每一帧中,FeasibleCap检查与目标机器人模型的可达性、关节速率限制和碰撞,并通过设备上的视觉叠加和触觉提示闭环,使演示者在收集过程中能够纠正动作,而无需学习模型、头戴设备或机器人硬件。在抓取-放置和投掷任务中,FeasibleCap提高了重放成功率,并减少了不可行帧的比例,投掷任务的增益最大。仿真实验进一步表明,在收集过程中强制执行可执行性约束并不会牺牲跨机器人平台的跨体现转移。硬件设计和软件可在 https://github.com/aod321/FeasibleCap 获取。
cs.RO / 62 / 2603.07582

Model-Based and Neural-Aided Approaches for Dog Dead Reckoning

基于模型和神经辅助的犬只推算方法
Savin, Gal Versano. Itai, Klein, Itzik
Abstract
Modern canine applications span medical and service roles, while robotic legged dogs serve as autonomous platforms for high-risk industrial inspection, disaster response, and search and rescue operations. For both, accurate positioning remains a significant challenge due to the cumulative drift inherent in inertial sensing. To bridge this gap, we propose three algorithms for accurate positioning using only inertial sensors, collectively referred to as dog dead reckoning (DDR). To evaluate our approaches, we designed DogMotion, a wearable unit for canine data recording. Using DogMotion, we recorded a dataset of 13 minutes. Additionally, we utilized a robotic legged dog dataset with a duration of 116 minutes. Across the two distinct datasets we demonstrate that our neural-aided methods consistently outperform model-based approaches, achieving an absolute distance error of less than 10\%. Consequently, we provide a lightweight and low-cost positioning solution for both biological and legged robotic dogs. To support reproducibility, our codebase and associated datasets have been made publicly available.
Chinese Translation
现代犬类应用涵盖医疗和服务角色,而机器人四足犬则作为高风险工业检查、灾难响应和搜索救援操作的自主平台。对于这两者而言,由于惯性传感器固有的累积漂移,准确定位仍然是一个重大挑战。为了解决这一问题,我们提出了三种仅使用惯性传感器进行准确定位的算法,统称为犬只推算(Dog Dead Reckoning, DDR)。为了评估我们的方法,我们设计了DogMotion,一个用于犬类数据记录的可穿戴设备。通过DogMotion,我们记录了一个持续13分钟的数据集。此外,我们还利用了一个持续116分钟的机器人四足犬数据集。在这两个不同的数据集中,我们展示了我们的神经辅助方法始终优于基于模型的方法,绝对距离误差低于10%。因此,我们为生物犬和机器人四足犬提供了一种轻量且低成本的定位解决方案。为了支持可重复性,我们的代码库和相关数据集已公开发布。
cs.RO / 63 / 2603.07618

SMAT: Staged Multi-Agent Training for Co-Adaptive Exoskeleton Control

SMAT:分阶段多智能体训练用于协同适应外骨骼控制
Yuan, Yifei, Androwis, Ghaith, Zhou, Xianlian
Abstract
Effective exoskeleton assistance requires co-adaptation: as the device alters joint dynamics, the user reorganizes neuromuscular coordination, creating a non-stationary learning problem. Most learning-based approaches do not explicitly account for the sequential nature of human motor adaptation, leading to training instability and poorly timed assistance. We propose Staged Multi-Agent Training (SMAT), a four-stage curriculum designed to mirror how users naturally acclimate to a wearable device. In SMAT, a musculoskeletal human actor and a bilateral hip exoskeleton actor are trained progressively: the human first learns unassisted gait, then adapts to the added device mass; the exoskeleton subsequently learns a positive assistance pattern against a stabilized human policy, and finally both agents co-adapt with full torque capacity and bidirectional feedback. We implement SMAT in the MyoAssist simulation environment using a 26-muscle lower-limb model and an attached hip exoskeleton. Our musculoskeletal simulations demonstrate that the learned exoskeleton control policy produces an average 10.1% reduction in hip muscle activation relative to the no-assist condition. We validated the learned controller in an offline setting using open-source gait data, then deployed it to a physical hip exoskeleton for treadmill experiments with five subjects. The resulting policy delivers consistent assistance and predominantly positive mechanical power without the need for any explicitly imposed timing shift (mean positive power: 13.6 W at 6 Nm RMS torque to 23.8 W at 9.3 Nm RMS torque, with minimal negative power) consistently across all subjects without subject-specific retraining.
Chinese Translation
有效的外骨骼辅助需要协同适应:随着设备改变关节动力学,用户重新组织神经肌肉协调,形成一个非平稳的学习问题。大多数基于学习的方法并未明确考虑人类运动适应的顺序特性,导致训练不稳定和辅助时机不佳。我们提出了分阶段多智能体训练(Staged Multi-Agent Training, SMAT),这是一个四阶段的课程,旨在模拟用户如何自然适应可穿戴设备。在SMAT中,一个肌肉骨骼人类参与者和一个双侧髋关节外骨骼参与者逐步训练:人类首先学习无辅助步态,然后适应增加的设备质量;随后,外骨骼学习在稳定的人类策略下提供正向辅助模式,最后两个参与者在完全扭矩能力和双向反馈下共同适应。我们在MyoAssist仿真环境中实现了SMAT,使用了一个26肌肉的下肢模型和一个附加的髋关节外骨骼。我们的肌肉骨骼仿真表明,学习到的外骨骼控制策略在与无辅助条件相比时,髋部肌肉激活平均减少了10.1%。我们在离线环境中使用开源步态数据验证了学习到的控制器,然后将其部署到一个物理髋关节外骨骼上进行五名受试者的跑步机实验。最终的策略提供了一致的辅助,并且在没有任何明确施加的时机转移的情况下,主要产生正向机械功率(在6 Nm RMS扭矩下平均正功率为13.6 W,9.3 Nm RMS扭矩下为23.8 W,且负功率最小),在所有受试者中保持一致,无需针对特定受试者的重新训练。
cs.RO / 64 / 2603.07624

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

GeoLoco:利用视觉基础模型中的三维几何先验实现稳健的仅RGB人形运动
Liu, Yufei, Chen, Xieyuanli, Pan, Hainan, Shi, Chenghao, Chen, Yanjie, Huang, Kaihong, Zeng, Zhiwen, Lu, Huimin
Abstract
The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.
Chinese Translation
现有的人形运动感知范式严重依赖于主动深度传感器。然而,这种以深度为中心的方法从根本上忽视了视觉世界中丰富的语义和密集的外观线索,割裂了低层次控制与通用具身智能所需的高层次推理之间的关系。尽管单目RGB提供了一种普遍存在的信息密集型替代方案,但从原始2D像素进行端到端强化学习却因几何尺度的固有损失而遭遇极端样本低效和灾难性的模拟到现实崩溃。为了解决这一困境,我们提出了GeoLoco,一个完全基于RGB的运动框架,通过利用冻结的、具有尺度感知的视觉基础模型(Visual Foundation Model, VFM)的强大几何先验,将单目图像概念化为高维三维潜在表示。我们设计了一种自我感知查询的多头交叉注意机制,而不是简单的特征连接,该机制动态关注与机器人实时步态阶段相关的任务关键拓扑特征。至关重要的是,为了防止策略对表面纹理的过拟合,我们引入了一种双头辅助学习方案。这种显式正则化强制高维潜在空间严格与物理地形几何对齐,从而确保稳健的零-shot模拟到现实转移。GeoLoco仅在模拟中训练,成功实现了对Unitree G1人形机器人的稳健零-shot转移,并成功应对了具有挑战性的地形。
cs.RO / 65 / 2603.07629

Exoskeleton Control through Learning to Reduce Biological Joint Moments in Simulations

通过学习控制外骨骼以减少生物关节力矩的仿真研究
You, Zihang, Zhou, Xianlian
Abstract
Data-driven joint-moment predictors offer a scalable alternative to laboratory-based inverse-dynamics pipelines for biomechanics estimation and exoskeleton control. Meanwhile, physics-based reinforcement learning (RL) enables simulation-trained controllers to learn dynamics-aware assistance strategies without extensive human experimentation. However, quantitative verification of simulation-trained exoskeleton torque predictors, and their impact on human joint power injection, remains limited. This paper presents (1) an RL framework to learn exoskeleton assistance policies that reduce biological joint moments, and (2) a validation pipeline that verifies the trained control networks using an open-source gait dataset through inference and comparison with biological joint moments. Simulation-trained multilayer perceptron (MLP) controllers are developed for level-ground and ramp walking, mapping short-horizon histories of bilateral hip and knee kinematics to normalized assistance torques. Results show that predicted assistance preserves task-intensity trends across speeds and inclines. Agreement is particularly strong at the hip, with cross-correlation coefficients reaching 0.94 at 1.8 m/s and 0.98 during 5{\deg} decline walking, demonstrating near-matched temporal structure. Discrepancies increase at higher speeds and steeper inclines, especially at the knee, and are more pronounced in joint power comparisons. Delay tuning biases assistance toward greater positive power injection; modest timing shifts increase positive power and improve agreement in specific gait intervals. Together, these results establish a quantitative validation framework for simulation-trained exoskeleton controllers, demonstrate strong sim-to-data consistency at the torque level, and highlight both the promise and the remaining challenges for sim-to-real transfer.
Chinese Translation
基于数据的关节力矩预测器为生物力学估计和外骨骼控制提供了一种可扩展的替代方案,取代了基于实验室的逆动力学流程。同时,基于物理的强化学习(RL)使得经过仿真训练的控制器能够在没有大量人类实验的情况下学习动态感知的辅助策略。然而,经过仿真训练的外骨骼扭矩预测器的定量验证,以及它们对人类关节功率注入的影响,仍然有限。本文提出了(1)一个RL框架,用于学习减少生物关节力矩的外骨骼辅助策略,以及(2)一个验证流程,通过推断和与生物关节力矩的比较,验证训练的控制网络,使用开源步态数据集。为平地和坡道行走开发了经过仿真训练的多层感知器(MLP)控制器,将双侧髋关节和膝关节运动学的短期历史映射到标准化的辅助扭矩。结果表明,预测的辅助在不同速度和坡度下保持任务强度趋势。髋关节的相关性特别强,在1.8 m/s时交叉相关系数达到0.94,在5°下坡行走时达到0.98,显示出近乎匹配的时间结构。在更高的速度和更陡的坡度下,差异增大,尤其是在膝关节,并且在关节功率比较中更为明显。延迟调节使辅助偏向于更大的正功率注入;适度的时序偏移增加了正功率,并改善了特定步态区间的匹配。总的来说,这些结果建立了一个针对经过仿真训练的外骨骼控制器的定量验证框架,展示了扭矩水平上的强一致性,并突显了仿真到现实转移的潜力和仍然存在的挑战。
cs.RO / 66 / 2603.07644

PanoDP: Learning Collision-Free Navigation with Panoramic Depth and Differentiable Physics

PanoDP:利用全景深度和可微物理学习无碰撞导航
Zhong, Hao, Chi, Pei, Zhao, Jiang, Yuan, Shenghai, Gao, Xuyang, Nguyen, Thien-Minh, Xie, Lihua
Abstract
Autonomous collision-free navigation in cluttered environments requires safe decision-making under partial observability with both static structure and dynamic obstacles. We present \textbf{PanoDP}, a communication-free learning framework that combines four-view panoramic depth perception with differentiable-physics-based training signals. PanoDP encodes panoramic depth using a lightweight CNN and optimizes policies with dense differentiable collision and motion-feasibility terms, improving training stability beyond sparse terminal collisions. We evaluate PanoDP on a controlled ring-to-center benchmark with systematic sweeps over agent count, obstacle density/layout, and dynamic behaviors, and further test out-of-distribution generalization in an external simulator (e.g., AirSim). Across settings, PanoDP increases collision-free and completion rates over single-view and non-physics-guided baselines under matched training budgets, and ablations (view masking, rotation augmentation) confirm the policy leverages 360-degree information. Code will be open source upon acceptance.
Chinese Translation
在杂乱环境中实现自主无碰撞导航需要在部分可观测性下进行安全决策,同时应对静态结构和动态障碍物。我们提出了 extbf{PanoDP},这是一个无通信的学习框架,结合了四视图全景深度感知与基于可微物理的训练信号。PanoDP通过轻量级卷积神经网络(CNN)编码全景深度,并利用密集的可微碰撞和运动可行性项优化策略,从而提高训练稳定性,超越稀疏的终端碰撞。我们在一个受控的环到中心基准测试中评估PanoDP,系统地调整代理数量、障碍物密度/布局和动态行为,并进一步在外部模拟器(例如AirSim)中测试分布外泛化。在各个设置中,PanoDP在匹配的训练预算下,相比于单视图和非物理指导的基线,提高了无碰撞率和完成率,消融实验(视图遮蔽、旋转增强)证实了该策略利用了360度信息。代码将在接受后开源。
cs.RO / 67 / 2603.07647

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

TempoFit:即插即用的分层时间键值记忆用于长时间视-语言-动作操作
Sun, Jun, Yang, Boyu, Zhang, Jiahao, Ma, Ning, Wu, Chencheng, Zhang, Siqing, Huang, Yiou, Wang, Qiufeng, Liang, Shan, Chen, Yaran
Abstract
Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.
Chinese Translation
预训练的视-语言-动作(VLA)策略在单步操作中取得了良好的效果,但其推理仍然在很大程度上是无记忆的,这在存在遮挡、状态别名和微妙后续变化的非马尔可夫长时间设置中显得脆弱。之前的方法通过堆叠帧来注入历史,这会增加视觉标记和延迟,同时添加近重复的像素,或者通过学习额外的时间接口,这需要(重新)训练并可能破坏原始的单帧推理图。我们提出了TempoFit,这是一种无训练的时间改造,通过状态级记忆升级冻结的VLA。我们的关键见解是,前缀注意力K/V已经形成了一种模型原生的、内容可寻址的运行时状态;在时间步之间重用它们引入了历史,而无需新的标记或可训练模块。TempoFit在选定的中间层存储分层FIFO前缀K/V,通过无参数的K到K检索与帧间间隙时间偏置(FGTB)结合,FGTB是一种受NLP中位置偏置启发的固定近期偏置,以保持决策的现状主导性,并通过预注意力残差加载注入检索到的上下文,采用保持范数的重新缩放以避免在冻结权重下的分布转移。在LIBERO-LONG上,TempoFit将强大的预训练骨干网络的平均成功率提高了多达4.0%,同时保持近实时的延迟,并且在CALVIN和真实机器人长时间任务中表现出一致的迁移能力。
cs.RO / 68 / 2603.07648

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

AtomicVLA:释放机器人原子技能学习的潜力
Zhang, Likui, Tang, Tao, Zhan, Zhihao, Chen, Xiuwei, Chen, Zisheng, Han, Jianhua, Zhu, Jiangtong, Xu, Pei, Xu, Hang, Wu, Hefeng, Lin, Liang, Liang, Xiaodan
Abstract
Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $\pi_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $\pi_{0}$ and $\pi_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{https://zhanglk9.github.io/atomicvla-web/}{here}.
Chinese Translation
近期在视觉-语言-动作(VLA)模型方面的进展显示出在机器人操作任务中具有良好的潜力。然而,现实世界中的机器人任务通常涉及长时间跨度的多步骤问题解决,并且需要在持续技能获取中进行泛化,超越单一动作或技能。这些挑战对现有的VLA模型构成了显著障碍,因为这些模型使用在聚合数据上训练的整体动作解码器,导致可扩展性差。为了解决这些挑战,我们提出了AtomicVLA,一个统一的规划与执行框架,能够共同生成任务级计划、原子技能抽象和细粒度动作。AtomicVLA通过技能引导的专家混合(Skill-Guided Mixture-of-Experts, SG-MoE)构建一个可扩展的原子技能库,其中每个专家专注于掌握通用但精确的原子技能。此外,我们引入了一种灵活的路由编码器,能够自动将专门的原子专家分配给新技能,从而实现持续学习。我们通过广泛的实验验证了我们的方法。在仿真中,AtomicVLA在LIBERO上比$ ext{π}_{0}$提高了2.4 ext{%},在LIBERO-LONG上提高了10 ext{%},并且在CALVIN上在平均任务长度上分别超过了$ ext{π}_{0}$和$ ext{π}_{0.5}$ 0.22和0.25。此外,我们的AtomicVLA在现实世界的长时间跨度任务和持续学习中始终超越基线18.3 ext{%}和21 ext{%}。这些结果突显了原子技能抽象和动态专家组合在长时间跨度和终身机器人任务中的有效性。项目页面在此:https://zhanglk9.github.io/atomicvla-web/。
cs.RO / 69 / 2603.07650

Multi-Agent Off-World Exploration for Sparse Evidence Discovery via Gaussian Belief Mapping and Dual-Domain Coverage

基于高斯信念映射和双域覆盖的稀疏证据发现的多智能体外星探索
Qiao, Zhuoran, Hu, Tianxin, Nguyen, Thien-Minh, Yuan, Shenghai
Abstract
Off-world multi-robot exploration is challenged by sparse targets, limited sensing, hazardous terrain, and restricted communication. Many scientifically valuable clues are visually ambiguous and often require close-range observations, making efficient and safe informative path planning essential. Existing methods often rely on predefined areas of interest (AOIs), which may be incomplete or biased, and typically handle terrain risk only through soft penalties, which are insufficient for avoiding non-recoverable regions. To address these issues, we propose a multi-agent informative path planning framework for sparse evidence discovery based on Gaussian belief mapping and dual-domain coverage. The method maintains Gaussian-process-based interest and risk beliefs and combines them with trajectory-intent representations to support coordinated sequential decision-making among multiple agents. It further prioritizes search inside the AOI while preserving limited exploration outside it, thereby improving robustness to AOI bias. In addition, the risk-aware design helps agents balance information gain and operational safety in hazardous environments. Experimental results in simulated lunar environments show that the proposed method consistently outperforms sampling-based and greedy baselines under different budgets and communication ranges. In particular, it achieves lower final uncertainty in risk-aware settings and remains robust under limited communication, demonstrating its effectiveness for cooperative off-world robotic exploration.
Chinese Translation
外星多机器人探索面临稀疏目标、有限感知、危险地形和受限通信等挑战。许多具有科学价值的线索在视觉上模糊不清,通常需要近距离观察,这使得高效且安全的信息路径规划变得至关重要。现有方法通常依赖于预定义的兴趣区域(AOIs),这些区域可能不完整或存在偏差,且通常仅通过软惩罚来处理地形风险,这不足以避免不可恢复的区域。为了解决这些问题,我们提出了一种基于高斯信念映射和双域覆盖的稀疏证据发现的多智能体信息路径规划框架。该方法维护基于高斯过程的兴趣和风险信念,并将其与轨迹意图表示相结合,以支持多个智能体之间的协调顺序决策。它进一步优先考虑在AOI内部的搜索,同时保留在其外部的有限探索,从而提高对AOI偏差的鲁棒性。此外,风险感知设计帮助智能体在危险环境中平衡信息获取和操作安全。模拟月球环境中的实验结果表明,所提出的方法在不同预算和通信范围下始终优于基于采样和贪婪的基线。特别是在风险感知设置下,它实现了较低的最终不确定性,并在有限通信条件下保持鲁棒性,证明了其在合作外星机器人探索中的有效性。
cs.RO / 70 / 2603.07663

DAISS: Phase-Aware Imitation Learning for Dual-Arm Robotic Ultrasound-Guided Interventions

DAISS:面向阶段的模仿学习用于双臂机器人超声引导干预
Li, Feng, Liu, Pei, Wang, Shiting, Wang, Ning, Jiang, Zhongliang, Navab, Nassir, Bi, Yuan
Abstract
Imitation learning has shown strong potential for automating complex robotic manipulation. In medical robotics, ultrasound-guided needle insertion demands precise bimanual coordination, as clinicians must simultaneously manipulate an ultrasound probe to maintain an optimal acoustic view while steering an interventional needle. Automating this asymmetric workflow -- and reliably transferring expert strategies to robots -- remains highly challenging. In this paper, we present the Dual-Arm Interventional Surgical System (DAISS), a teleoperated platform that collects high-fidelity dual-arm demonstrations and learns a phase-aware imitation policy for ultrasound-guided interventions. To avoid constraining the operator's natural behavior, DAISS uses a flexible NDI-based leader interface for teleoperating two coordinated follower arms. To support robust execution under real-time ultrasound feedback, we develop a lightweight, data-efficient imitation policy. Specifically, the policy incorporates a phase-aware architecture and a dynamic mask loss tailored to asymmetric bimanual control. Conditioned on a planned trajectory, the network fuses real-time ultrasound with external visual observations to generate smooth, coordinated dual-arm motions. Experimental results show that DAISS can learn personalized expert strategies from limited demonstrations. Overall, these findings highlight the promise of phase-aware imitation-learning-driven dual-arm robots for improving precision and reducing cognitive workload in image-guided interventions.
Chinese Translation
模仿学习在自动化复杂机器人操作方面展现出强大的潜力。在医疗机器人领域,超声引导的针插入需要精确的双手协调,因为临床医生必须同时操作超声探头以保持最佳的声学视图,同时引导干预针。自动化这一不对称工作流程,并可靠地将专家策略转移到机器人上,仍然面临很大挑战。本文提出了双臂干预外科系统(DAISS),这是一个遥控平台,能够收集高保真度的双臂演示,并学习用于超声引导干预的面向阶段的模仿策略。为了避免限制操作员的自然行为,DAISS使用基于NDI的灵活领导接口来遥控两个协调的跟随臂。为了支持在实时超声反馈下的稳健执行,我们开发了一种轻量级、数据高效的模仿策略。具体而言,该策略结合了面向阶段的架构和针对不对称双手控制的动态掩码损失。在规划轨迹的条件下,网络融合实时超声与外部视觉观测,以生成平滑、协调的双臂运动。实验结果表明,DAISS能够从有限的演示中学习个性化的专家策略。总体而言,这些发现突显了面向阶段的模仿学习驱动的双臂机器人在提高图像引导干预的精确度和减少认知负担方面的潜力。
cs.RO / 71 / 2603.07672

Low-Cost Teleoperation Extension for Mobile Manipulators

低成本移动操控器的远程操作扩展
Belov, Danil, Erkhov, Artem, Savotin, Yaroslav, Podladchikova, Tatiana, Osinenko, Pavel
Abstract
Teleoperation of mobile bimanual manipulators requires simultaneous control of high-dimensional systems, often necessitating expensive specialized equipment. We present an open-source teleoperation framework that enables intuitive whole body control using readily available commodity hardware. Our system combines smartphone-based head tracking for camera control, leader arms for bilateral manipulation, and foot pedals for hands-free base navigation. Using a standard smartphone with IMU and display, we eliminate the need for costly VR helmets while maintaining immersive visual feedback. The modular architecture integrates seamlessly with the XLeRobot framework, but can be easily adapted to other types of mobile manipulators. We validate our approach through user studies that demonstrate improved task performance and reduced cognitive load compared to keyboard-based control.
Chinese Translation
移动双手操控器的远程操作需要对高维系统进行同步控制,这通常需要昂贵的专用设备。我们提出了一种开源的远程操作框架,能够利用现成的商品硬件实现直观的全身控制。我们的系统结合了基于智能手机的头部追踪用于摄像头控制、用于双向操作的主臂以及用于免手操作的脚踏板进行底盘导航。通过使用配备IMU和显示屏的标准智能手机,我们消除了对昂贵虚拟现实头盔的需求,同时保持了沉浸式视觉反馈。该模块化架构与XLeRobot框架无缝集成,但也可以轻松适配其他类型的移动操控器。我们通过用户研究验证了我们的方法,结果表明与基于键盘的控制相比,任务表现有所改善且认知负担降低。
cs.RO / 72 / 2603.07686

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

UniUncer:端到端驾驶的统一动态静态不确定性
Gao, Yu, Wang, Jijun, Zhang, Zongzheng, Jiang, Anqing, Wang, Yiru, Heng, Yuwen, Wang, Shuo, Sun, Hao, Hu, Zhangfeng, Zhao, Hao
Abstract
End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.
Chinese Translation
端到端(E2E)驾驶已成为工业部署和学术研究的基石,提供了一种可学习的单一管道,将多传感器输入映射到动作,同时避免手工设计的模块。然而,这种管道的可靠性在很大程度上依赖于其处理不确定性的能力:传感器存在噪声,语义可能模糊,与其他道路使用者的互动本质上是随机的。不确定性还表现为多种形式:分类与定位,以及在静态地图元素和动态代理中都至关重要。现有的E2E方法仅建模静态地图的不确定性,使得规划容易受到过于自信和不可靠输入的影响。我们提出了UniUncer,这是第一个轻量级的统一不确定性框架,能够在E2E规划器中联合估计和使用静态与动态场景元素的不确定性。具体而言:(1)我们将确定性头转换为概率性拉普拉斯回归器,输出矢量化静态和动态实体的每个顶点位置和尺度;(2)我们引入了一个不确定性融合模块,该模块编码这些参数并将其注入对象/地图查询中,以形成不确定性感知查询;(3)我们设计了一个不确定性感知门,根据当前不确定性水平自适应调节对历史输入(自我状态或时间感知查询)的依赖。该设计增加了最小的开销,吞吐量仅下降约0.5 FPS,同时仍然可以与常见的E2E骨干网络无缝连接。在nuScenes(开环)上,UniUncer将平均L2轨迹误差降低了7%。在NavsimV2(伪闭环)上,它提高了整体EPDMS 10.8%,并在具有挑战性的、互动密集的场景中显著提升了第二阶段的表现。消融实验确认动态代理不确定性和不确定性感知门都是必要的。
cs.RO / 73 / 2603.07691

RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation

RoboPCA:基于姿态的从人类示范中学习的机器人操作能力
Xiao, Zhanqi, Wang, Ruiping, Chen, Xilin
Abstract
Understanding spatial affordances -- comprising the contact regions of object interaction and the corresponding contact poses -- is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.
Chinese Translation
理解空间能力——包括物体交互的接触区域及相应的接触姿态——对于机器人有效操作物体和完成多样化任务至关重要。然而,现有的空间能力预测方法主要集中于定位接触区域,而将姿态委托给独立的姿态估计方法,这可能导致由于预测的接触区域与候选姿态之间的不一致而导致任务失败。在本研究中,我们提出了RoboPCA,一个基于姿态的能力预测框架,该框架联合预测适合任务的接触区域和基于指令的姿态。为了实现可扩展的数据收集以进行基于姿态的能力学习,我们设计了Human2Afford,一个数据策划管道,能够自动恢复场景级3D信息并从人类示范中推断基于姿态的能力注释。通过Human2Afford,提取场景深度和交互物体的掩模,以提供3D上下文和物体定位,同时通过跟踪接触区域内的物体点和分析手-物体交互模式来获得基于姿态的能力注释,从而建立从3D手网格到机器人末端执行器方向的映射。通过RGB-D编码器整合几何-外观线索,并将掩模增强特征纳入扩散基础框架,以强调与任务相关的物体区域,RoboPCA在图像数据集、仿真和真实机器人上超越了基线方法,并在任务和类别之间表现出强大的泛化能力。
cs.RO / 74 / 2603.07699

C$^2$-Explorer: Contiguity-Driven Task Allocation with Connectivity-Aware Task Representation for Decentralized Multi-UAV Exploration

C$^2$-Explorer:基于连通性的任务分配与连接感知任务表示的去中心化多无人机探索
Yan, Xinlu, Zhang, Mingjie, Fang, Yuhao, Sun, Yanke, Ma, Jun, Gong, Youmin, Zhou, Boyu, Mei, Jie
Abstract
Efficient multi-UAV exploration under limited communication is severely bottlenecked by inadequate task representation and allocation. Previous task representations either impose heavy communication requirements for coordination or lack the flexibility to handle complex environments, often leading to inefficient traversal. Furthermore, short-horizon allocation strategies neglect spatiotemporal contiguity, causing non-contiguous assignments and frequent cross-region detours. To address this, we propose C$^2$-Explorer, a decentralized framework that constructs a connectivity graph to decompose disconnected unknown components into independent task units. We then introduce a contiguity-driven allocation formulation with a graph-based neighborhood penalty to discourage non-adjacent assignments, promoting more contiguous task sequences over time. Extensive simulation experiments show that C$^2$-Explorer consistently outperforms state-of-the-art (SOTA) baselines, reducing average exploration time by 43.1\% and path length by 33.3\%. Real-world flights further demonstrate the system's feasibility. The code will be released at https://github.com/Robotics-STAR-Lab/C2-Explorer
Chinese Translation
在有限通信条件下,高效的多无人机探索受到不充分的任务表示和分配的严重制约。以往的任务表示要么对协调提出了较高的通信要求,要么缺乏处理复杂环境的灵活性,常常导致低效的遍历。此外,短期分配策略忽视了时空连通性,导致非连续的任务分配和频繁的跨区域绕行。为了解决这一问题,我们提出了C$^2$-Explorer,一个去中心化框架,通过构建连通图将不连通的未知组件分解为独立的任务单元。然后,我们引入了一种基于连通性的分配公式,并采用基于图的邻域惩罚来抑制非相邻的分配,从而在时间上促进更连续的任务序列。大量仿真实验表明,C$^2$-Explorer始终优于最先进的基线(SOTA),平均探索时间减少了43.1\%,路径长度减少了33.3\\%。实际飞行进一步证明了该系统的可行性。代码将发布在 https://github.com/Robotics-STAR-Lab/C2-Explorer
cs.RO / 75 / 2603.07744

AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow

AeroPlace-Flow:通过视觉前瞻和物体流实现语言驱动的空中操纵器物体放置
Mishra, Sarthak, Yadav, Rishabh Dev, Nair, Naveen, Pan, Wei, Roy, Spandan
Abstract
Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.
Chinese Translation
精确的物体放置在空中操纵领域仍然未得到充分探索,大多数系统依赖于预定义的目标坐标,主要关注抓取和控制。然而,在现实环境中,指定确切的放置姿态是繁琐的,用户自然通过语言来传达目标。在本研究中,我们提出了AeroPlace-Flow,这是一个无须训练的框架,用于语言驱动的空中物体放置,结合了视觉前瞻、明确的3D几何推理和物体流。给定物体和放置场景的RGB-D观测数据,以及自然语言指令,AeroPlace-Flow首先使用图像编辑模型合成一个任务完成的目标图像。然后,通过深度对齐和以物体为中心的推理,将想象的配置嵌入到度量3D空间中,从而推断出一个考虑碰撞的物体流,将抓取的物体运输到与语言和接触一致的放置配置。最终的运动通过标准轨迹跟踪在空中操纵器上执行。AeroPlace-Flow生成可执行的放置目标,而无需预定义姿态或特定任务的训练。我们通过广泛的仿真和现实世界实验验证了我们的方法,展示了在多种空中场景中可靠的语言条件放置,硬件上的平均成功率达到75%。
cs.RO / 76 / 2603.07775

Residual Control for Fast Recovery from Dynamics Shifts

残差控制以快速恢复动态变化
Jayasinghe, Nethmi, Gontero, Diana, Migliarba, Francesco, Brown, Spencer T., Sangwan, Vinod K., Hersam, Mark C., Trivedi, Amit Ranjan
Abstract
Robotic systems operating in real-world environments inevitably encounter unobserved dynamics shifts during continuous execution, including changes in actuation, mass distribution, or contact conditions. When such shifts occur mid-episode, even locally stabilizing learned policies can experience substantial transient performance degradation. While input-to-state stability guarantees bounded state deviation, it does not ensure rapid restoration of task-level performance. We address inference-time recovery under frozen policy parameters by casting adaptation as constrained disturbance shaping around a nominal stabilizing controller. We propose a stability-aligned residual control architecture in which a reinforcement learning policy trained under nominal dynamics remains fixed at deployment, and adaptation occurs exclusively through a bounded additive residual channel. A Stability Alignment Gate (SAG) regulates corrective authority through magnitude constraints, directional coherence with the nominal action, performance-conditioned activation, and adaptive gain modulation. These mechanisms preserve the nominal closed-loop structure while enabling rapid compensation for unobserved dynamics shifts without retraining or privileged disturbance information. Across mid-episode perturbations including actuator degradation, mass variation, and contact changes, the proposed method consistently reduces recovery time relative to frozen and online-adaptation baselines while maintaining near-nominal steady-state performance. Recovery time is reduced by \textbf{87\%} on the Go1 quadruped, \textbf{48\%} on the Cassie biped, \textbf{30\%} on the H1 humanoid, and \textbf{20\%} on the Scout wheeled platform on average across evaluated conditions relative to a frozen SAC policy.
Chinese Translation
在真实环境中运行的机器人系统不可避免地会在持续执行过程中遇到未观察到的动态变化,包括驱动方式、质量分布或接触条件的变化。当此类变化在执行过程中发生时,即使是局部稳定的学习策略也可能经历显著的瞬时性能下降。虽然输入到状态的稳定性保证了状态偏差的有界性,但并不能确保任务级性能的快速恢复。我们通过将适应视为围绕名义稳定控制器的受限干扰塑造,解决了在冻结策略参数下的推理时间恢复问题。我们提出了一种与稳定性对齐的残差控制架构,其中在名义动态下训练的强化学习策略在部署时保持固定,而适应仅通过有界的附加残差通道进行。稳定性对齐门(Stability Alignment Gate, SAG)通过幅度约束、与名义动作的方向一致性、性能条件激活和自适应增益调制来调节修正权威。这些机制在保持名义闭环结构的同时,使得在没有重新训练或特权干扰信息的情况下,能够快速补偿未观察到的动态变化。在包括驱动器退化、质量变化和接触变化的中途扰动中,所提出的方法在相对于冻结和在线适应基线的情况下,始终减少恢复时间,同时保持接近名义的稳态性能。在评估条件下,相对于冻结的SAC策略,Go1四足机器人恢复时间减少了 extbf{87\%},Cassie双足机器人减少了 extbf{48\\%},H1人形机器人减少了 extbf{30\\%},Scout轮式平台减少了 extbf{20\\%}。
cs.RO / 77 / 2603.07795

A Robust Antenna Provides Tactile Feedback in a Multi-legged Robot

一种稳健的天线在多足机器人中提供触觉反馈
Xu, Zhaochen J., He, Juntao, Aydan, Delfin, Taylor, Malaika, Wang, Tianyu, Lin, Jianfeng, Dyer, Wesley, Goldman, Daniel I.
Abstract
Multi-legged elongate robots hold promise for maneuvering through complex environments. Prior work has demonstrated that reliable locomotion can be achieved using open-loop body undulation and foot placement on rugose terrain. However, robust navigation through confined spaces remains challenging when body-environment contact is extensive and terrain rheology varies rapidly. To address this challenge, we develop a pair of tactile antennae for multi-legged robots that enable real-time sensing of surrounding geometry, modeling the morphology and function of biological centipede antennae. Each antenna features gradient compliance, with a stiff base and soft tip, allowing repeated deformation and elastic recovery. Robophysical experiments reveal a relationship between continuous antenna curvature and contact force, leading to a simplified mapping from antenna deformation to inferred discrete collision states. We incorporate this mapping into a controller that selects among a set of locomotor maneuvers based on the inferred collision state. Experiments in obstacle-rich and confined environments demonstrate that tactile feedback enables reliable steering and allows the robot to recover from near-stuck conditions without requiring global environmental information or real-time vision. These results highlight how mechanically tuned tactile appendages can simplify sensing and enhance autonomy in elongate multi-legged robots operating in constrained spaces.
Chinese Translation
多足延长机器人在复杂环境中的机动性具有良好的前景。先前的研究表明,使用开环体波动和脚部放置在粗糙地形上可以实现可靠的运动。然而,当身体与环境的接触广泛且地形流变特性快速变化时,在狭小空间中的稳健导航仍然具有挑战性。为了解决这一挑战,我们为多足机器人开发了一对触觉天线,使其能够实时感知周围几何形状,模拟生物蜈蚣天线的形态和功能。每个天线具有梯度顺应性,底部坚硬而顶部柔软,允许反复变形和弹性恢复。机器人物理实验揭示了连续天线曲率与接触力之间的关系,从而简化了从天线变形到推断离散碰撞状态的映射。我们将这一映射纳入控制器中,根据推断的碰撞状态在一组运动策略中进行选择。在障碍物丰富和狭小的环境中的实验表明,触觉反馈能够实现可靠的转向,并使机器人在不需要全局环境信息或实时视觉的情况下从接近卡住的状态中恢复。这些结果突显了机械调谐的触觉附肢如何简化感知并增强在受限空间中操作的延长多足机器人的自主性。
cs.RO / 78 / 2603.07796

Inverse Resistive Force Theory (I-RFT): Learning granular properties through robot-terrain physical interactions

逆向阻力理论 (I-RFT):通过机器人与地形的物理交互学习颗粒特性
Liu, Shipeng, Xue, Feng, Zhang, Yifeng, Ponnusamy, Tarunika, Qian, Feifei
Abstract
For robots to navigate safely and efficiently on soft, granular terrains, it is crucial to gather information about the terrain's mechanical properties, which directly affect locomotion performance. Recent research has developed robotic legs that can accurately sense ground reaction forces during locomotion. However, existing tests of granular property estimation often rely on specific foot trajectories, such as vertical penetration or horizontal shear, limiting their applicability during natural locomotion. To address this limitation, we introduce a physics-informed machine learning framework, Inverse Resistive Force Theory (I-RFT), which integrates the Granular Resistive Force Theory model with Gaussian Processes to infer terrain properties from proprioceptively measured contact forces under arbitrary gait trajectories. By embedding the granular force model within the learning process, I-RFT preserves physical consistency while enabling generalization across diverse motion primitives. Experimental results demonstrate that I-RFT accurately estimates terrain properties across multiple gait trajectories and toe shapes. Moreover, we show that the quantified uncertainty over the terrain resistance stress map could enable robots to optimize foot design and gait trajectories for efficient information gathering. This approach establishes a new foundation for data-efficient characterization of complex granular environments and opens new avenues for locomotion strategies that actively adapt gait for autonomous terrain exploration.
Chinese Translation
为了使机器人能够在软质颗粒地形上安全高效地导航,收集地形机械特性的信息至关重要,这些特性直接影响运动性能。近期研究开发了能够准确感知运动过程中地面反作用力的机器人腿。然而,现有的颗粒特性估计测试往往依赖于特定的足部轨迹,如垂直穿透或水平剪切,这限制了其在自然运动中的适用性。为了解决这一限制,我们提出了一种基于物理的机器学习框架——逆向阻力理论 (I-RFT),该框架将颗粒阻力理论模型与高斯过程相结合,从而在任意步态轨迹下通过本体感知测量的接触力推断地形特性。通过将颗粒力模型嵌入学习过程中,I-RFT 保持了物理一致性,同时使得在多样的运动原语间的泛化成为可能。实验结果表明,I-RFT 能够在多种步态轨迹和脚趾形状下准确估计地形特性。此外,我们还展示了对地形阻力应力图的量化不确定性可以使机器人优化足部设计和步态轨迹,以高效收集信息。这一方法为复杂颗粒环境的数据高效表征奠定了新的基础,并为主动适应步态以实现自主地形探索的运动策略开辟了新的途径。
cs.RO / 79 / 2603.07797

Toward Global Intent Inference for Human Motion by Inverse Reinforcement Learning

通过逆强化学习实现人类运动的全球意图推断
Mehrdad, Sarmad, Sabbah, Maxime, Bonnet, Vincent, Righetti, Ludovic
Abstract
This paper investigates whether a single, unified cost function can explain and predict human reaching movements, in contrast with existing approaches that rely on subject- or posture-specific optimization criteria. Using the Minimal Observation Inverse Reinforcement Learning (MO-IRL) algorithm, together with a seven-dimensional set of candidate cost terms, we efficiently estimate time-varying cost weights for a standard planar reaching task. MO-IRL provides orders-of-magnitude faster convergence than bilevel formulations, while using only a fraction of the available data, enabling the practical exploration of time-varying cost structures. Three levels of generality are evaluated: Subject-Dependent Posture-Dependent, Subject-Dependent Posture-Independent, and Subject-Independent Posture-Independent. Across all cases, time-varying weights substantially improve trajectory reconstruction, yielding an average 27% reduction in RMSE compared to the baseline. The inferred costs consistently highlight a dominant role for joint-acceleration regulation, complemented by smaller contributions from torque-change smoothness. Overall, a single subject- and posture-agnostic time-varying cost function is shown to predict human reaching trajectories with high accuracy, supporting the existence of a unified optimality principle governing this class of movements.
Chinese Translation
本文探讨了一个统一的成本函数是否能够解释和预测人类的到达运动,这与现有依赖于特定个体或姿势的优化标准的方法形成对比。我们使用最小观察逆强化学习(Minimal Observation Inverse Reinforcement Learning, MO-IRL)算法,以及一组七维候选成本项,来高效地估计标准平面到达任务的时间变化成本权重。MO-IRL的收敛速度比双层模型快几个数量级,同时仅使用可用数据的一小部分,从而使得时间变化成本结构的实际探索成为可能。我们评估了三种不同的普遍性水平:依赖个体的姿势依赖、依赖个体的姿势独立以及独立于个体和姿势的情况。在所有情况下,时间变化权重显著改善了轨迹重建,与基线相比,均方根误差(RMSE)平均降低了27%。推断出的成本始终强调关节加速度调节的主导作用,辅以扭矩变化平滑性的较小贡献。总体而言,单一的独立于个体和姿势的时间变化成本函数被证明能够高精度预测人类的到达轨迹,支持了支配这一类运动的统一最优性原则的存在。
cs.RO / 80 / 2603.07800

Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing

基于偏好条件的强化学习用于时空高效的在线三维箱子装填
Sarawgi, Nikita, Manyar, Omey M., Wang, Fan, Nguyen, Thinh H., Seita, Daniel, Gupta, Satyandra K.
Abstract
Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.
Chinese Translation
机器人箱子装填在仓库自动化中得到广泛应用,目前的系统通过启发式和基于学习的策略实现了稳健的性能。这些系统必须在紧凑放置与快速执行之间取得平衡,选择替代物品或重新定向物品可以提高空间利用率,但会引入额外的时间。我们提出了一种基于选择的公式,明确考虑这一权衡:在每一步,机器人评估多个候选动作,权衡预期的装填收益与估计的操作时间。这使得时间感知策略能够在带来显著空间改进时,选择性地接受增加的操作时间。我们的方法STEP(Space-Time Efficient Packing)使用基于偏好条件的Transformer强化学习策略,并允许在候选集大小之间进行泛化以及与标准放置模块的集成。该方法在不影响装填密度的情况下,实现了操作时间减少44%的效果。更多材料可在 https://step-packing.github.io 获取。
cs.RO / 81 / 2603.07822

Uncertainty Mitigation and Intent Inference: A Dual-Mode Human-Machine Joint Planning System

不确定性缓解与意图推断:一种双模式人机联合规划系统
Fang, Zeyu, Lin, Yuxin, Liu, Cheng, Yu, Beomyeol, Yang, Zeyuan, Chen, Rongqian, Lee, Taeyoung, Imani, Mahdi, Lan, Tian
Abstract
Effective human-robot collaboration in open-world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human-like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human-robot joint planning system designed to tackle dual sources of uncertainty: task-relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty-mitigation joint planning module enables two-way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM-assisted active elicitation mechanism and a hypothesis-augmented A^* search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real-time intent-aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination-aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real-world UAV deployments integrated with a Vision-Language Model (VLM)-based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty-mitigation planning and reduces the task execution time by 25.4% in intent-aware cooperation compared to the baselines.
Chinese Translation
在开放世界环境中,有效的人机协作需要在不确定条件下进行联合规划。然而,现有的方法往往将人类视为被动的监督者,这阻碍了自主代理成为能够主动建模队友行为、推理知识差距、查询并通过沟通引导响应以解决不确定性的类人队友。为了解决这些局限性,我们提出了一种统一的人机联合规划系统,旨在应对两种不确定性来源:与任务相关的知识差距和潜在的人类意图。我们的系统以两种互补模式运行。首先,不确定性缓解联合规划模块支持双向对话,以解决语义模糊和对象不确定性。该模块利用基于大语言模型(LLM)的主动引导机制和假设增强的A^*搜索,随后通过动态规划计算最优查询策略,以最小化交互和验证成本。其次,实时意图感知协作模块通过空间和方向线索维持对人类潜在任务意图的概率信念,使代理能够在没有明确沟通的情况下进行动态、协调感知的任务选择。我们在Gazebo仿真和与基于视觉-语言模型(VLM)的3D语义感知管道集成的真实无人机部署中验证了所提系统。实验结果表明,该系统在不确定性缓解规划中显著降低了51.9%的交互成本,并在意图感知合作中相比基线减少了25.4%的任务执行时间。
cs.RO / 82 / 2603.07824

Reasoning Knowledge-Gap in Drone Planning via LLM-based Active Elicitation

通过基于大型语言模型的主动引导推理无人机规划中的知识差距
Fang, Zeyu, Yu, Beomyeol, Liu, Cheng, Yang, Zeyuan, Chen, Rongqian, Lin, Yuxin, Imani, Mahdi, Lan, Tian
Abstract
Human-AI joint planning in Unmanned Aerial Vehicles (UAVs) typically relies on control handover when facing environmental uncertainties, which is often inefficient and cognitively demanding for non-expert operators. To address this, we propose a novel framework that shifts the collaboration paradigm from control takeover to active information elicitation. We introduce the Minimal Information Neuro-Symbolic Tree (MINT), a reasoning mechanism that explicitly structures knowledge gaps regarding obstacles and goals into a queryable format. By leveraging large language models, our system formulates optimal binary queries to resolve specific ambiguities with minimal human interaction. We demonstrate the efficacy of this approach through a comprehensive workflow integrating a vision-language model for perception, voice interfaces, and a low-level UAV control module in both high-fidelity NVIDIA Isaac simulations and real-world deployments. Experimental results show that our method achieves a significant improvement in the success rate for complex search-and-rescue tasks while significantly reducing the frequency of human interaction compared to exhaustive querying baselines.
Chinese Translation
无人机(UAV)中的人机联合规划通常在面对环境不确定性时依赖于控制交接,这对于非专业操作员而言往往效率低下且认知负担较重。为了解决这一问题,我们提出了一种新颖的框架,将协作范式从控制接管转变为主动信息引导。我们引入了最小信息神经符号树(Minimal Information Neuro-Symbolic Tree,MINT),这是一种推理机制,能够将关于障碍物和目标的知识差距明确结构化为可查询的格式。通过利用大型语言模型,我们的系统制定了最佳的二元查询,以在最小的人机交互下解决特定的模糊性。我们通过综合工作流程展示了该方法的有效性,该流程集成了用于感知的视觉语言模型、语音接口以及低级无人机控制模块,在高保真NVIDIA Isaac模拟和现实世界部署中均进行了测试。实验结果表明,与全面查询基线相比,我们的方法在复杂的搜索与救援任务中显著提高了成功率,同时显著减少了人机交互的频率。
cs.RO / 83 / 2603.07826

Physics-infused Learning for Aerial Manipulator in Winds and Near-Wall Environments

物理驱动的学习方法在风环境和近墙环境中的空中操控器应用
Zhang, Yiming, Geng, Junyi
Abstract
Aerial manipulation (AM) expands UAV capabilities beyond passive observation to contact-based operations at high altitudes and in otherwise inaccessible environments. Although recent advances show promise, most AM systems are developed in controlled settings that overlook key aerodynamic effects. Simplified thrust models are often insufficient to capture the nonlinear wind disturbances and proximity-induced flow variations present in real-world environments near infrastructure, while high-fidelity CFD methods remain impractical for real-time use. Learning-based models are computationally efficient at inference, but often struggle to generalize to unseen condition. This paper combines both approaches by integrating a physics-based blade-element model with a learning-based residual force estimator, along with a rotor-speed allocation strategy for disturbance compensation, resulting in a unified control framework. The blade-element model computes per-rotor aerodynamic forces under wind and provides a refined feedforward disturbance estimate. A learning-based estimator then predicts the residual forces not captured by the model, enabling compensation for unmodeled aerodynamic effects. An online adaptation mechanism further updates the residual-force prediction and rotor-speed allocation jointly to reduce the mismatch between desired and realized thrust. We evaluate this framework in both free-flight and wall-contact tracking tasks in a simulated near-wall wind environment. Results demonstrate improved disturbance estimation and trajectory-tracking accuracy over conventional approaches, enabling robust wall-contact execution under challenging aerodynamic conditions.
Chinese Translation
空中操控(AM)扩展了无人机(UAV)的能力,不仅限于被动观察,还可以在高空和其他难以接近的环境中进行基于接触的操作。尽管近期的进展显示出希望,但大多数AM系统是在受控环境中开发的,忽视了关键的气动效应。简化的推力模型往往不足以捕捉到现实环境中基础设施附近存在的非线性风扰动和近壁流动变化,而高保真计算流体动力学(CFD)方法在实时使用中仍然不切实际。基于学习的模型在推理时计算效率高,但通常难以对未见条件进行泛化。本文通过将基于物理的叶片元素模型与基于学习的残余力估计器相结合,并采用扰动补偿的转子速度分配策略,形成了一个统一的控制框架。叶片元素模型在风中计算每个转子的气动力,并提供精细的前馈扰动估计。基于学习的估计器随后预测模型未捕捉到的残余力,从而实现对未建模气动效应的补偿。在线适应机制进一步联合更新残余力预测和转子速度分配,以减少期望推力与实际推力之间的差异。我们在模拟的近墙风环境中评估了该框架在自由飞行和壁面接触跟踪任务中的表现。结果表明,与传统方法相比,该框架在扰动估计和轨迹跟踪精度上有所改善,使得在具有挑战性的气动条件下能够稳健地执行壁面接触。
cs.RO / 84 / 2603.07844

Relating Reinforcement Learning to Dynamic Programming-Based Planning

将强化学习与基于动态规划的规划联系起来
Georgiev, Filip V., Timperi, Kalle G., Sakçak, Başak, LaValle, Steven M.
Abstract
This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.
Chinese Translation
本文弥合了最优规划与强化学习(RL)之间的一些差距,这两者均源于应用于序列决策或最优控制的动态规划。规划通常偏向于确定性模型、目标终止和成本最小化,而强化学习则倾向于随机模型、无限期折扣和奖励最大化,此外还涉及学习相关参数,如学习率和贪婪因子。本文开发、分析并实现了一种去随机化的强化学习版本,以便与价值迭代和Dijkstra算法在简单规划模型下进行性能比较。接下来,数学分析表明:1)成本最小化和奖励最大化等价的条件;2)一次性目标终止与无限期情景学习等价的条件;3)折扣导致目标实现失败的条件。随后,本文主张定义和优化真实成本,而不是插入任意参数来指导操作。性能研究进一步扩展到随机情况,使用以规划为导向的标准,并将价值迭代与具有学习率和贪婪因子的强化学习进行比较。
cs.RO / 85 / 2603.07866

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

基于VLM和部分观测的视角无关抓取管道
Almeida, Dilermando, Negri, Juliano, Lazzarini, Guilherme, Segreto, Thiago H., Bezerra, Ranulfo, Godoy, Ricardo V., Becker, Marcelo
Abstract
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
Chinese Translation
在杂乱且非结构化的环境中,移动腿式操作器的稳健抓取仍然面临挑战,主要由于遮挡导致的部分观测、不可靠的深度估计,以及对无碰撞、可执行方法的需求。本文提出了一种端到端的语言引导抓取管道,将开放词汇目标选择与真实机器人上的安全抓取执行相结合。给定自然语言命令,系统利用开放词汇检测和可提示实例分割在RGB图像中定位目标,从RGB-D中提取以物体为中心的点云,并通过反向投影深度补偿和两阶段点云补全在遮挡情况下提高几何可靠性。然后,我们生成并进行碰撞过滤的6自由度抓取候选,并使用考虑可达性、接近可行性和间隙的安全导向启发式算法选择可执行的抓取。我们在两个杂乱的桌面场景中对一种具有机械臂的四足机器人评估该方法,使用与视角依赖基线的配对试验进行比较。所提方法在基线的30%(3/10)成功率下实现了90%的整体成功率(9/10),显示出在杂乱环境中对遮挡和部分观测的显著增强的稳健性。
cs.RO / 86 / 2603.07875

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

选择观察内容:任务感知的语义-几何表示用于视觉运动策略
Ding, Haoran, Ma, Liang, Yang, Yaxun, Yang, Wen, Liu, Tianyu, Duan, Anqing, Liang, Xiaodan, Song, Dezhen, Laptev, Ivan, Nakamura, Yoshihiko
Abstract
Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.
Chinese Translation
从示范中学习的视觉运动策略往往会过拟合于原始RGB观测中的干扰视觉因素,导致在背景变化和物体重新上色等外观变化下表现脆弱。我们提出了一种任务感知的观测接口,将视觉输入规范化为共享表示,提高了对分布外(OOD)外观变化的鲁棒性,而无需修改或微调策略。给定一幅RGB图像和一个开放词汇的任务相关实体规范,我们使用SAM3对目标物体和机器人/夹具进行分割。我们通过在恒定背景上用预定义的语义颜色重新上色分割实体,构建了L0观测。对于需要更强几何线索的任务,我们进一步通过深度引导覆盖将Depth Anything 3的单目深度注入到分割区域,从而产生一个统一的语义-几何观测(L1),该观测仍然是标准的三通道图像输入。我们在RoboMimic(Lift)、ManiSkill YCB在杂乱环境下的抓取、四个RLBench任务在受控外观变化下的表现,以及两个真实世界的Franka任务(ReachX和CloseCabinet)上进行了评估。在各基准测试和策略骨干(Flow Matching Policy和SmolVLA)中,我们的接口在保持分布内性能的同时,显著提高了在OOD视觉变化下的鲁棒性。
cs.RO / 87 / 2603.07885

Identifying Influential Actions in Human-Robot Interactions

识别人机交互中的影响性行为
Jiang, Haoyang, Xu, Chenfei, Okadome, Yuya, Nakamura, Yukata
Abstract
Human-robot interaction combines robotics, cognitive science, and human factors to study collaborative systems. This paper introduces a method for identifying influential robot actions using transfer entropy, a statistic that measures directed information transfer between time series. TE is effective for capturing complex, nonlinear interactions. We apply this method to analyze how robot actions affect human behavior during a conversation with a remotely controlled robot avatar. By focusing on the impact of proximity, our approach demonstrates TE's capability to identify key actions influencing human responses, highlighting its potential to improve the design and adaptability of robotic systems.
Chinese Translation
人机交互结合了机器人技术、认知科学和人因工程,以研究协作系统。本文介绍了一种使用转移熵(transfer entropy)识别影响性机器人行为的方法,转移熵是一种测量时间序列之间定向信息传递的统计量。转移熵有效捕捉复杂的非线性交互。我们应用该方法分析机器人行为如何影响人类在与远程控制的机器人化身对话时的行为。通过关注接近度的影响,我们的方法展示了转移熵识别影响人类反应的关键行为的能力,突显了其在改善机器人系统设计和适应性方面的潜力。
cs.RO / 88 / 2603.07892

RoboRouter: Training-Free Policy Routing for Robotic Manipulation

RoboRouter:无训练策略路由用于机器人操作
Chen, Yiteng, Cao, Zhe, Ren, Hongjia, Yang, Chenjie, Li, Wenbo, Wang, Shiyi, Wang, Yemin, Zhang, Li, Shao, Yanming, Zhao, Zhenjun, Zhuang, Huiping, Wu, Qingyao
Abstract
Research on robotic manipulation has developed a diverse set of policy paradigms, including vision-language-action (VLA) models, vision-action (VA) policies, and code-based compositional approaches. Concrete policies typically attain high success rates on specific task distributions but lim-ited generalization beyond it. Rather than proposing an other monolithic policy, we propose to leverage the complementary strengths of existing approaches through intelligent policy routing. We introduce RoboRouter, a training-free framework that maintains a pool of heterogeneous policies and learns to select the best-performing policy for each task through accumulated execution experience. Given a new task, RoboRouter constructs a semantic task representation, retrieves historical records of similar tasks, predicts the optimal policy choice without requiring trial-and-error, and incorporates structured feedback to refine subsequent routing decisions. Integrating a new policy into the system requires only lightweight evaluation and incurs no training overhead. Across simulation benchmark and real-world evaluations, RoboRouter consistently outperforms than in-dividual policies, improving average success rate by more than 3% in simulation and over 13% in real-world settings, while preserving execution efficiency. Our results demonstrate that intelligent routing across heterogeneous, off-the-shelf policies provides a practical and scalable pathway toward building more capable robotic systems.
Chinese Translation
机器人操作的研究发展出了一系列多样的策略范式,包括视觉-语言-动作(VLA)模型、视觉-动作(VA)策略和基于代码的组合方法。具体策略通常在特定任务分布上取得高成功率,但在此之外的泛化能力有限。我们并不打算提出另一种单一的策略,而是通过智能策略路由来利用现有方法的互补优势。我们提出了RoboRouter,一个无训练的框架,维护一个异构策略池,并通过积累的执行经验学习为每个任务选择表现最佳的策略。在面对新任务时,RoboRouter构建语义任务表示,检索类似任务的历史记录,预测最佳策略选择而无需试错,并结合结构化反馈来优化后续的路由决策。将新策略集成到系统中只需轻量级评估,并且不产生训练开销。在模拟基准和现实世界评估中,RoboRouter始终优于单一策略,在模拟中平均成功率提高超过3%,在现实世界环境中提高超过13%,同时保持执行效率。我们的结果表明,在异构的现成策略之间进行智能路由提供了一条实用且可扩展的途径,以构建更强大的机器人系统。
cs.RO / 89 / 2603.07901

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

NaviDriveVLM:解耦高层推理与自主驾驶的运动规划
Tao, Ximeng, Taghavi, Pardis, Filev, Dimitar, Langari, Reza, Pandey, Gaurav
Abstract
Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.
Chinese Translation
视觉语言模型(VLMs)作为一种有前景的方向,已在端到端自主驾驶(AD)中崭露头角,通过联合建模视觉观察、驾驶环境和基于语言的推理。然而,现有的基于VLM的系统在高层推理与运动规划之间面临权衡:大型模型提供强大的语义理解,但在精确控制上适应成本高,而小型VLM模型则可以高效微调,但往往表现出较弱的推理能力。我们提出了NaviDriveVLM,一个解耦框架,通过使用大规模的Navigator和轻量级可训练的Driver,将推理与动作生成分离。这一设计保留了推理能力,降低了训练成本,并为下游规划提供了明确的可解释中间表示。在nuScenes基准上的实验表明,NaviDriveVLM在端到端运动规划方面优于大型VLM基线。
cs.RO / 90 / 2603.07909

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

纯视觉支气管镜机器人自主性的长短期智能体
Wu, Junyang, Luo, Mingyi, Xie, Fangfang, Zhang, Minghui, Zhang, Hanxiao, Zhang, Chunxi, Wang, Junhao, Sun, Jiayuan, Gu, Yun, Yang, Guang-Zhong
Abstract
Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.
Chinese Translation
准确的术中导航对于机器人辅助的内腔干预至关重要,但由于内窥镜视野有限和动态伪影,仍然面临挑战。现有的导航平台通常依赖于外部定位技术,如电磁跟踪或形状传感,这增加了硬件复杂性,并且在术中解剖不匹配时仍然容易受到影响。我们提出了一种仅基于视觉的自主框架,该框架利用术前CT导出的虚拟目标和实时内窥镜视频进行长时间的支气管镜导航,而无需在导航过程中使用外部跟踪。该框架使用分层的长短期智能体:一个短期反应智能体用于连续低延迟的运动控制,一个长期战略智能体用于在解剖模糊点提供决策支持。当它们的建议发生冲突时,世界模型评论者预测候选动作的未来视觉状态,并选择预测状态与目标视图最匹配的动作。我们在高保真气道模型、三个离体猪肺和一个活体猪模型中评估了该系统。该系统在模型中达到了所有计划的分段目标,在离体实验中保持了80%的成功率,并在活体实验中实现了与专家支气管镜医师相当的导航性能。这些结果支持无传感器自主支气管镜导航的临床前可行性。
cs.RO / 91 / 2603.07928

Omnidirectional Humanoid Locomotion on Stairs via Unsafe Stepping Penalty and Sparse LiDAR Elevation Mapping

通过不安全踏步惩罚和稀疏激光雷达高度映射实现全向类人机器人在楼梯上的行走
Jiang, Yuzhi, Liang, Yujun, Li, Junhao, Ding, Han, Zhu, Lijun
Abstract
Humanoid robots, characterized by numerous degrees of freedom and a high center of gravity, are inherently unstable. Safe omnidirectional locomotion on stairs requires both omnidirectional terrain perception and reliable foothold selection. Existing methods often rely on forward-facing depth cameras, which create blind zones that restrict omnidirectional mobility. Furthermore, sparse post-contact unsafe stepping penalties lead to low learning efficiency and suboptimal strategies. To realize safe stair-traversal gaits, this paper introduces a single-stage training framework incorporating a dense unsafe stepping penalty that provides continuous feedback as the foot approaches a hazardous placement. To obtain stable and reliable elevation maps, we build a rolling point-cloud mapping system with spatiotemporal confidence decay and a self-protection zone mechanism, producing temporally consistent local maps. These maps are further refined by an Edge-Guided Asymmetric U-Net (EGAU), which mitigates reconstruction distortion caused by sparse LiDAR returns on stair risers. Simulation and real-robot experiments show that the proposed method achieves a near-100\% safe stepping rate on stair terrains in simulation, while maintaining a remarkably high safe stepping rate in real-world deployments. Furthermore, it completes a continuous long-distance walking test on complex outdoor terrains, demonstrating reliable sim-to-real transfer and long-term stability.
Chinese Translation
类人机器人具有多个自由度和较高的重心,固有不稳定性。在楼梯上实现安全的全向行走需要全向的地形感知和可靠的踏点选择。现有方法通常依赖于面向前方的深度摄像头,这会产生盲区,限制全向移动。此外,稀疏的接触后不安全踏步惩罚导致学习效率低下和次优策略。为实现安全的楼梯行走步态,本文提出了一种单阶段训练框架,结合密集的不安全踏步惩罚,在脚靠近危险位置时提供连续反馈。为了获得稳定可靠的高度图,我们构建了一个具有时空置信度衰减和自我保护区机制的滚动点云映射系统,生成时间一致的局部地图。这些地图通过边缘引导的非对称U-Net(Edge-Guided Asymmetric U-Net, EGAU)进一步优化,减轻了由于稀疏激光雷达在楼梯踏步上的返回造成的重建失真。仿真和真实机器人实验表明,所提方法在仿真中的楼梯地形上实现了近100%的安全踏步率,同时在现实部署中保持了显著高的安全踏步率。此外,它在复杂的户外地形上完成了连续的长距离行走测试,展示了可靠的模拟到现实转移和长期稳定性。
cs.RO / 92 / 2603.07939

Unified Structural-Hydrodynamic Modeling of Underwater Underactuated Mechanisms and Soft Robots

水下欠驱动机制和软体机器人的统一结构-流体动力学建模
Zhang, Chenrui, Zhang, Yiyuan, Ye, Yunfei, Chen, Junkai, Wang, Haozhe, Laschi, Cecilia
Abstract
Underwater robots are widely deployed for ocean exploration and manipulation. Underactuated mechanisms are particularly advantageous in aquatic environments, as reducing actuator count lowers the risk of motor leakage while introducing inherent mechanical compliance. However, accurate modeling of underwater underactuated and soft robotic systems remains challenging because it requires identifying a high-dimensional set of internal structural and external hydrodynamic parameters. In this work, we propose a trajectory-driven global optimization framework for unified structural-hydrodynamic modeling of underwater multibody systems. Inspired by the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), the proposed approach simultaneously identifies coupled internal elastic, damping, and distributed hydrodynamic parameters through trajectory-level matching between simulation and experimental motion. This enables high-fidelity reproduction of both underactuated mechanisms and compliant soft robotic systems in underwater environments. We first validate the framework on a link-by-link underactuated multibody mechanism, demonstrating accurate identification of distributed hydrodynamic coefficients, with a normalized end effector position error below 5% across multiple trajectories, varying initial conditions, and both active-passive and fully passive configurations. The identified modeling strategy is then transferred to a single octopus-inspired soft arm, showing strong real-to-sim consistency without manual retuning. Finally, eight identified arms are assembled into a swimming octopus robot, where the unified parameter set enables realistic whole body behavior without additional parameter calibration. These results demonstrate the scalability and transferability of the proposed structural-hydrodynamic modeling framework across underwater underactuated and soft robotic systems.
Chinese Translation
水下机器人广泛应用于海洋探索和操作。欠驱动机制在水域环境中尤其具有优势,因为减少执行器数量降低了电机泄漏的风险,同时引入了固有的机械柔顺性。然而,准确建模水下欠驱动和软体机器人系统仍然具有挑战性,因为这需要识别一组高维的内部结构和外部流体动力学参数。在本研究中,我们提出了一种基于轨迹驱动的全局优化框架,用于水下多体系统的统一结构-流体动力学建模。受协方差矩阵适应进化策略(CMA-ES)的启发,所提出的方法通过模拟与实验运动之间的轨迹级匹配,同时识别耦合的内部弹性、阻尼和分布式流体动力学参数。这使得在水下环境中能够高保真地重现欠驱动机制和柔性软体机器人系统。我们首先在一个逐链接的欠驱动多体机制上验证了该框架,准确识别了分布式流体动力学系数,在多个轨迹、不同初始条件以及主动-被动和完全被动配置下,末端执行器位置误差均低于5%。然后,将识别的建模策略转移到一个单一的受章鱼启发的软臂上,显示出强大的真实与模拟一致性,无需手动重新调节。最后,八个识别的臂被组装成一个游泳的章鱼机器人,其中统一的参数集使得在没有额外参数校准的情况下实现逼真的整体行为。这些结果展示了所提出的结构-流体动力学建模框架在水下欠驱动和软体机器人系统中的可扩展性和可转移性。
cs.RO / 93 / 2603.07973

VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments

VORL-EXPLORE:一种混合学习规划方法用于动态环境中的多机器人探索
Liu, Ning, Shen, Sen, Li, Zheng, Liu, Sheng, Han, Dongkun, Lyu, Shangke, Braunl, Thomas
Abstract
Hierarchical multi-robot exploration commonly decouples frontier allocation from local navigation, which can make the system brittle in dense and dynamic environments. Because the allocator lacks direct awareness of execution difficulty, robots may cluster at bottlenecks, trigger oscillatory replanning, and generate redundant coverage. We propose VORL-EXPLORE, a hybrid learning and planning framework that addresses this limitation through execution fidelity, a shared estimate of local navigability that couples task allocation with motion execution. This fidelity signal is incorporated into a fidelity-coupled Voronoi objective with inter-robot repulsion to reduce contention before it emerges. It also drives a risk-aware adaptive arbitration mechanism between global A* guidance and a reactive reinforcement learning policy, balancing long-range efficiency with safe interaction in confined spaces. The framework further supports online self-supervised recalibration of the fidelity model using pseudo-labels derived from recent progress and safety outcomes, enabling adaptation to non-stationary obstacles without manual risk tuning. We evaluate this capability separately in a dedicated severe-traffic ablation. Extensive experiments in randomized grids and a Gazebo factory scenario show high success rates, shorter path length, lower overlap, and robust collision avoidance. The source code will be made publicly available upon acceptance.
Chinese Translation
分层多机器人探索通常将边界分配与局部导航解耦,这可能导致系统在密集和动态环境中变得脆弱。由于分配器缺乏对执行难度的直接感知,机器人可能会在瓶颈处聚集,触发振荡式重新规划,并产生冗余覆盖。我们提出了VORL-EXPLORE,这是一种混合学习与规划框架,通过执行保真度来解决这一限制,执行保真度是对局部可导航性的共享估计,将任务分配与运动执行结合在一起。该保真度信号被纳入一个保真度耦合的Voronoi目标,并引入机器人间的排斥力,以减少竞争的出现。它还驱动了一个风险感知的自适应仲裁机制,在全局A*引导与反应式强化学习策略之间进行平衡,以在狭小空间中实现长距离效率与安全交互的平衡。该框架进一步支持使用从近期进展和安全结果中派生的伪标签进行在线自我监督的保真度模型重新校准,使其能够适应非平稳障碍,而无需手动风险调整。我们在专门的严重交通消融实验中单独评估了这一能力。在随机网格和Gazebo工厂场景中的大量实验显示出高成功率、更短的路径长度、较低的重叠和稳健的碰撞避免。源代码将在接受后公开发布。
cs.RO / 94 / 2603.07998

Aero-Promptness: Drag-Aware Aerodynamic Manipulability for Propeller-driven Vehicles

气动及时性:考虑阻力的螺旋桨驱动飞行器的气动可操控性
Franchi, Antonio
Abstract
This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.
Chinese Translation
本研究提出了考虑阻力的气动可操控性(Drag-Aware Aerodynamic Manipulability, DAAM),这是一个用于冗余多旋翼飞行器控制分配的几何框架。通过为螺旋桨转速空间配备基于每个电机剩余对称加速能力的黎曼度量,该公式明确考虑了电机扭矩限制和气动阻力。通过非线性推力法则将该度量映射到广义力空间,得到一个状态依赖的可操控性体积。该体积的对数行列式作为一个自然的障碍函数,严格惩罚因阻力引起的饱和和低转速推力损失。在分配纤维上优化该体积提供了一种冗余解决策略,该策略在广义力空间中对任意坐标缩放具有固有的不变性。从分析上看,我们证明了结果的最优分配在局部形成光滑的嵌入流形,并几何地表征了不可避免地由于物理执行器限制和转速符号转换而产生的全局跳跃不连续性。
cs.RO / 95 / 2603.07999

Dual-Horizon Hybrid Internal Model for Low-Gravity Quadrupedal Jumping with Hardware-in-the-Loop Validation

低重力四足跳跃的双视域混合内部模型及硬件在环验证
Xu, Haozhe, Zhao, Yifei, Feng, Wenhao, Wang, Zhipeng, Sang, Hongrui, Cheng, Cheng, Li, Xiuxian, Yin, Zhen, He, Bin
Abstract
Locomotion under reduced gravity is commonly realized through jumping, yet continuous pronking in lunar gravity remains challenging due to prolonged flight phases and sparse ground contact. The extended aerial duration increases landing impact sensitivity and makes stable attitude regulation over rough planetary terrain difficult. Existing approaches primarily address single jumps on flat surfaces and lack both continuous-terrain solutions and realistic hardware validation. This work presents a Dual-Horizon Hybrid Internal Model for continuous quadrupedal jumping under lunar gravity using proprioceptive sensing only. Two temporal encoders capture complementary time scales: a short-horizon branch models rapid vertical dynamics with explicit vertical velocity estimation, while a long-horizon branch models horizontal motion trends and center-of-mass height evolution across the jump cycle. The fused representation enables stable and continuous jumping under extended aerial phases characteristic of lunar gravity. To provide hardware-in-the-loop validation, we develop the MATRIX (Mixed-reality Adaptive Testbed for Robotic Integrated eXploration) platform, a digital-twin-driven system that offloads gravity through a pulley-counterweight mechanism and maps Unreal Engine lunar terrain to a motion platform and treadmill in real time. Using MATRIX, we demonstrate continuous jumping of a quadruped robot under lunar-gravity emulation across cratered lunar-like terrain.
Chinese Translation
在减重力环境下,运动通常通过跳跃实现,但在月球重力下持续的跳跃仍然具有挑战性,因为飞行阶段较长且地面接触稀疏。延长的空中持续时间增加了着陆冲击的敏感性,并使得在粗糙的行星地形上保持稳定的姿态调节变得困难。现有的方法主要针对平坦表面上的单次跳跃,缺乏连续地形解决方案和现实的硬件验证。本文提出了一种用于月球重力下连续四足跳跃的双视域混合内部模型,仅使用本体感知传感器。两个时间编码器捕获互补的时间尺度:短视域分支建模快速的垂直动态,并显式估计垂直速度,而长视域分支则建模跳跃周期内的水平运动趋势和质心高度演变。融合的表示使得在月球重力特征的延长空中阶段下实现稳定和连续的跳跃成为可能。为了提供硬件在环验证,我们开发了MATRIX(混合现实自适应机器人集成探索测试平台)平台,这是一个通过滑轮-配重机制卸载重力的数字双胞胎驱动系统,并实时将虚幻引擎的月球地形映射到运动平台和跑步机上。利用MATRIX,我们展示了在月球重力模拟下,四足机器人在类似月球的坑洼地形上进行连续跳跃的能力。
cs.RO / 96 / 2603.08019

Vector Field Augmented Differentiable Policy Learning for Vision-Based Drone Racing

基于视觉的无人机竞速的向量场增强可微政策学习
Su, Yang, Yu, Feng, Hu, Yu, Niu, Xinze, Zhang, Linzuo, Sun, Fangyu, Zou, Danping
Abstract
Autonomous drone racing in complex environments requires agile, high-speed flight while maintaining reliable obstacle avoidance. Differentiable-physics-based policy learning has recently demonstrated high sample efficiency and remarkable performance across various tasks, including agile drone flight and quadruped locomotion. However, applying such methods to drone racing remains difficult, as key objective like gate traversal are inherently hard to express as smooth, differentiable losses. To address these challenges, we propose DiffRacing, a novel vector field-augmented differentiable policy learning framework. DiffRacing integrates differentiable losses and vector fields into the training process to provide continuous and stable gradient signals, balancing obstacle avoidance and high-speed gate traversal. In addition, a differentiable Delta Action Model compensates for dynamics mismatch, enabling efficient sim-to-real transfer without explicit system identification. Extensive simulation and real-world experiments demonstrate that DiffRacing achieves superior sample efficiency, faster convergence, and robust flight performance, thereby demonstrating that vector fields can augment traditional gradient-based policy learning with a task-specific geometric prior.
Chinese Translation
在复杂环境中进行自主无人机竞速需要灵活、高速的飞行,同时保持可靠的障碍物规避。基于可微物理的政策学习最近在多种任务中展示了高样本效率和显著的性能,包括灵活的无人机飞行和四足动物的运动。然而,将此类方法应用于无人机竞速仍然困难,因为诸如门框穿越等关键目标本质上难以表达为平滑的、可微的损失。为了解决这些挑战,我们提出了DiffRacing,一个新颖的向量场增强可微政策学习框架。DiffRacing将可微损失和向量场集成到训练过程中,以提供连续和稳定的梯度信号,平衡障碍物规避和高速门框穿越。此外,一个可微的Delta Action Model补偿了动态不匹配,使得在没有明确系统识别的情况下实现高效的模拟到现实转移。大量的仿真和现实世界实验表明,DiffRacing实现了卓越的样本效率、更快的收敛速度和稳健的飞行性能,从而证明了向量场可以通过任务特定的几何先验增强传统的基于梯度的政策学习。
cs.RO / 97 / 2603.08021

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

AffordGrasp:跨模态扩散用于感知能力意识的抓取合成
Wu, Xiaofei, Zhang, Yi, Liu, Yumeng, Ma, Yuexin, Shi, Yujiao, He, Xuming
Abstract
Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.
Chinese Translation
生成准确反映物体几何形状和用户指定交互语义的人类抓取姿势,对于增强现实/虚拟现实(AR/VR)和具身人工智能中的自然手-物体交互至关重要。然而,现有的语义抓取方法在3D物体表示与文本指令之间存在较大的模态差距,并且通常缺乏明确的空间或语义约束,导致物理上无效或语义上不一致的抓取。在本研究中,我们提出了AffordGrasp,这是一种基于扩散的框架,能够以高精度生成物理稳定且语义真实的人类抓取。我们首先介绍了一种可扩展的注释管道,自动丰富手-物体交互数据集,提供捕捉交互意图的细粒度结构化语言标签。在这些注释的基础上,AffordGrasp将感知能力意识的手姿态潜在表示与双重条件扩散过程相结合,使模型能够共同推理物体几何、空间感知能力和指令语义。分布调整模块进一步强化物理接触一致性和语义对齐。我们在四个基于HO-3D、OakInk、GRAB和AffordPose的增强指令基准上评估了AffordGrasp,并观察到在抓取质量、语义准确性和多样性方面相较于最先进的方法有显著提升。
cs.RO / 98 / 2603.08057

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

观察与切换:基于视觉的交互式机器人技能编程分支
Vanc, Petr, Behrens, Jan Kristof, Hlaváč, Václav, Stepanova, Karla
Abstract
Programming robots by demonstration (PbD) is an intuitive concept, but scaling it to real-world variability remains a challenge for most current teaching frameworks. Conditional task graphs are very expressive and can be defined incrementally, which fits very well with the PbD idea. However, acting using conditional task graphs requires reliable perception-grounded online branch selection. In this paper, we present See & Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay. Unlike prior approaches that rely on manual branching or low-dimensional signals (e.g., proprioception), our vision-based Switcher uses eye-in-hand images (high-dimensional) to select among competing successor skill parts and to detect out-of-distribution contexts that require new demonstrations. We integrate kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer and demonstrate that our proposed method is teaching modality-independent, enabling efficient in-situ recovery demonstrations. The system is validated in experiments on three challenging dexterous manipulation tasks. We evaluate our method under diverse conditions and furthermore conduct user studies with 8 participants. We show that the proposed method reliably performs branch selection and anomaly detection for novice users, achieving 90.7 % and 87.9 % accuracy, respectively, across 576 real-robot rollouts. We provide all code and data required to reproduce our experiments at http://imitrob.ciirc.cvut.cz/publications/seeandswitch.
Chinese Translation
通过示范编程机器人(PbD)是一个直观的概念,但将其扩展到现实世界的变异性仍然是当前大多数教学框架面临的挑战。条件任务图非常具有表现力,并且可以逐步定义,这与PbD的理念非常契合。然而,使用条件任务图进行操作需要可靠的感知基础的在线分支选择。在本文中,我们提出了See & Switch,一个交互式教学与执行框架,将任务表示为用户可扩展的技能部分图,这些部分通过决策状态(DS)连接,从而在重放期间实现条件分支。与依赖手动分支或低维信号(例如本体感知)的先前方法不同,我们的基于视觉的Switcher使用手眼图像(高维)在竞争的后续技能部分之间进行选择,并检测需要新示范的分布外上下文。我们通过输入模态抽象层整合了动力教学、操纵杆控制和手势,并证明我们提出的方法与教学模态无关,从而实现高效的现场恢复示范。该系统在三个具有挑战性的灵巧操作任务上进行了实验验证。我们在多种条件下评估了我们的方法,并进一步进行了8名参与者的用户研究。我们展示了所提出的方法能够可靠地进行分支选择和异常检测,对于新手用户在576次真实机器人执行中分别达到了90.7%和87.9%的准确率。我们提供了重现我们实验所需的所有代码和数据,网址为http://imitrob.ciirc.cvut.cz/publications/seeandswitch。
cs.RO / 99 / 2603.08089

Adaptive Vision-Based Control of Redundant Robots with Null-Space Interaction for Human-Robot Collaboration

基于视觉的自适应冗余机器人控制与零空间交互的研究用于人机协作
Yan, Xiangjie, Chen, Chen, Li, Xiang
Abstract
Human-robot collaboration aims to extend human ability through cooperation with robots. This technology is currently helping people with physical disabilities, has transformed the manufacturing process of companies, improved surgical performance, and will likely revolutionize the daily lives of everyone in the future. Being able to enhance the performance of both sides, such that human-robot collaboration outperforms a single robot/human, remains an open issue. For safer and more effective collaboration, a new control scheme has been proposed for redundant robots in this paper, consisting of an adaptive vision-based control term in task space and an interactive control term in null space. Such a formulation allows the robot to autonomously carry out tasks in an unknown environment without prior calibration while also interacting with humans to deal with unforeseen changes (e.g., potential collision, temporary needs) under the redundant configuration. The decoupling between task space and null space helps to explore the collaboration safely and effectively without affecting the main task of the robot end-effector. The stability of the closed-loop system has been rigorously proved with Lyapunov methods, and both the convergence of the position error in task space and that of the damping model in null space are guaranteed. The experimental results of a robot manipulator guided with the technology of augmented reality (AR) are presented to illustrate the performance of the control scheme.
Chinese Translation
人机协作旨在通过与机器人合作来扩展人类的能力。这项技术目前正在帮助身体残疾人士,改变公司的制造过程,提高外科手术的表现,并可能在未来彻底改变每个人的日常生活。如何增强双方的表现,使得人机协作的效果超过单一机器人或人类的表现,仍然是一个未解决的问题。为了实现更安全和更有效的协作,本文提出了一种新的控制方案,针对冗余机器人,包含任务空间中的自适应基于视觉的控制项和零空间中的交互控制项。这种形式化允许机器人在未知环境中自主执行任务,无需事先校准,同时与人类进行交互,以应对冗余配置下的不可预见变化(例如,潜在碰撞、临时需求)。任务空间与零空间之间的解耦有助于安全有效地探索协作,而不影响机器人末端执行器的主要任务。通过李雅普诺夫方法严格证明了闭环系统的稳定性,并保证了任务空间中位置误差的收敛性和零空间中阻尼模型的收敛性。本文展示了使用增强现实(AR)技术指导的机器人操纵器的实验结果,以说明该控制方案的性能。
cs.RO / 100 / 2603.08111

DeReCo: Decoupling Representation and Coordination Learning for Object-Adaptive Decentralized Multi-Robot Cooperative Transport

DeReCo:用于对象自适应的去耦表示与协调学习的分散多机器人协作运输
Shibata, Kazuki, Sota, Ryosuke, Bosch, Shandil Dhiresh, Kadokawa, Yuki, Yoshihisa, Tsurumine, Matsubara, Takamitsu
Abstract
Generalizing decentralized multi-robot cooperative transport across objects with diverse shapes and physical properties remains a fundamental challenge. Under decentralized execution, two key challenges arise: object-dependent representation learning under partial observability and coordination learning in multi-agent reinforcement learning (MARL) under non-stationarity. A typical approach jointly optimizes object-dependent representations and coordinated policies in an end-to-end manner while randomizing object shapes and physical properties during training. However, this joint optimization tightly couples representation and coordination learning, introducing bidirectional interference: inaccurate representations under partial observability destabilize coordination learning, while non-stationarity in MARL further degrades representation learning, resulting in sample-inefficient training. To address this structural coupling, we propose DeReCo, a novel MARL framework that decouples representation and coordination learning for object-adaptive multi-robot cooperative transport, improving sample efficiency and generalization across objects and transport scenarios. DeReCo adopts a three-stage training strategy: (1) centralized coordination learning with privileged object information, (2) reconstruction of object-dependent representations from local observations, and (3) progressive removal of privileged information for decentralized execution. This decoupling mitigates interference between representation and coordination learning and enables stable and sample-efficient training. Experimental results show that DeReCo outperforms baselines in simulation on three training objects, generalizes to six unseen objects with varying masses and friction coefficients, and achieves superior performance on two unseen objects in real-robot experiments.
Chinese Translation
在具有多样形状和物理特性的对象之间推广分散多机器人协作运输仍然是一个基本挑战。在分散执行的情况下,出现了两个关键挑战:在部分可观测性下的对象依赖表示学习和在非平稳性下的多智能体强化学习(MARL)中的协调学习。典型的方法是以端到端的方式联合优化对象依赖的表示和协调策略,同时在训练过程中随机化对象的形状和物理特性。然而,这种联合优化紧密耦合了表示与协调学习,导致双向干扰:在部分可观测性下的不准确表示会破坏协调学习,而MARL中的非平稳性进一步降低了表示学习的效果,导致样本效率低下。为了解决这种结构耦合问题,我们提出了DeReCo,一个新颖的MARL框架,旨在为对象自适应的多机器人协作运输去耦表示与协调学习,从而提高样本效率和在不同对象及运输场景中的泛化能力。DeReCo采用三阶段的训练策略:(1)利用特权对象信息进行集中协调学习,(2)从局部观测中重建对象依赖的表示,以及(3)逐步去除特权信息以实现分散执行。这种去耦减少了表示与协调学习之间的干扰,并实现了稳定且样本高效的训练。实验结果表明,DeReCo在三个训练对象的仿真中优于基线,在六个具有不同质量和摩擦系数的未见对象上实现了良好的泛化,并在真实机器人实验中对两个未见对象达到了优越的性能。
cs.RO / 101 / 2603.08122

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

通过增强型强化学习远程操作和混合灵巧专家模型实现类人操控
Tang, Tutian, Ji, Xingyu, Xing, Wanli, Hao, Ce, Xu, Wenqiang, Shao, Lin, Lu, Cewu, Yu, Qiaojun, Pang, Jiangmiao, Zhang, Kaifeng
Abstract
While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
Chinese Translation
尽管视觉-语言-动作(VLA)模型在机器人操控中取得了显著成功,但其应用主要局限于低自由度末端执行器执行简单的视觉引导抓取和放置任务。将这些模型扩展到类人双手灵巧操控,特别是接触丰富的手内操作,带来了高保真数据采集、多技能学习和多模态感知融合等关键挑战。本文提出了一个综合框架来解决这些瓶颈,该框架由两个组成部分构成。首先,我们介绍了IMCopilot(手内操控助手),这是一个经过强化学习训练的原子技能套件,发挥双重作用:作为共享自主助手简化远程操作数据收集,并作为VLA的可调用低级执行原语。其次,我们提出了MoDE-VLA(混合灵巧专家VLA),一种将异构力和触觉模态无缝集成到预训练VLA骨干网络中的架构。通过利用残差注入机制,MoDE-VLA实现了接触感知的精细调整,而不会降低模型的预训练知识。我们在四个逐渐复杂的任务上验证了我们的方法,显示出在灵巧接触丰富任务中成功率相较基线提高了一倍。
cs.RO / 102 / 2603.08124

SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action

SaiVLA-0:基于计算感知的视觉-语言-动作三分架构
Shi, Xiang, Huang, Wenlong, Zou, Menglin, Sun, Xinhai
Abstract
We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the Pons; changing robots only trains the Cerebellum; cerebellum-only RL can further refine control without touching high-level semantics. As a concept-and-protocol paper with preliminary evidence, we outline a timing protocol under matched conditions (GPU, resolution, batch) to verify anticipated efficiency gains. We also report preliminary LIBERO evidence showing that split feature caching reduces training time (7.5h to 4.5h) and improves average success (86.5% to 92.5%) under official N1.5 head-only training, and that SaiVLA0 reaches 99.0% mean success.
Chinese Translation
我们通过一个受神经科学启发的三元组重新审视视觉-语言-动作(Vision-Language-Action)。从生物学角度来看,大脑(Cerebrum)提供稳定的高层次多模态先验并保持不变;脑桥适配器(Pons Adapter)将这些皮层特征与实时本体感知输入结合,并将意图编译为可执行的令牌;小脑(Cerebellum,ParaCAT)则执行快速、并行的类别解码以实现在线控制,并利用滞后/指数移动平均(EMA)/温度/熵来保持稳定性。固定比例调度和两阶段特征缓存使系统具备计算感知能力和可重复性。受到主动注视(active, foveated vision)的启发,我们的腕部感兴趣区域(ROIs)通过校准投影与末端执行器几何关联,提供了一个运动稳定的高分辨率视图,能够对细微姿态变化敏感,并补充主视图的全局上下文。该设计是模块化的:升级大脑仅需重新训练脑桥;更换机器人仅需训练小脑;仅使用小脑的强化学习(RL)可以进一步优化控制,而无需涉及高层语义。作为一篇概念与协议论文,附带初步证据,我们概述了在匹配条件(GPU、分辨率、批量)下的时间协议,以验证预期的效率提升。我们还报告了初步的LIBERO证据,显示分割特征缓存减少了训练时间(从7.5小时降至4.5小时),并在官方N1.5头部训练下提高了平均成功率(从86.5%提升至92.5%),而SaiVLA0达到了99.0%的平均成功率。
cs.RO / 103 / 2603.08128

TRIAGE: Type-Routed Interventions via Aleatoric-Epistemic Gated Estimation in Robotic Manipulation and Adaptive Perception -- Don't Treat All Uncertainty the Same

TRIAGE:通过随机-认知门控估计在机器人操作和自适应感知中的类型路由干预——不要将所有不确定性视为相同
Kumar, Divake, Tayebati, Sina, Naik, Devashri, Poggi, Patrick, Rios, Amanda Sofie, Ahuja, Nilesh, Trivedi, Amit Ranjan
Abstract
Most uncertainty-aware robotic systems collapse prediction uncertainty into a single scalar score and use it to trigger uniform corrective responses. This aggregation obscures whether uncertainty arises from corrupted observations or from mismatch between the learned model and the true system dynamics. As a result, corrective actions may be applied to the wrong component of the closed loop, degrading performance relative to leaving the policy unchanged. We introduce a lightweight post hoc framework that decomposes uncertainty into aleatoric and epistemic components and uses these signals to regulate system responses at inference time. Aleatoric uncertainty is estimated from deviations in the observation distribution using a Mahalanobis density model, while epistemic uncertainty is detected using a noise robust forward dynamics ensemble that isolates model mismatch from measurement corruption. The two signals remain empirically near orthogonal during closed loop execution and enable type specific responses. High aleatoric uncertainty triggers observation recovery, while high epistemic uncertainty moderates control actions. The same signals also regulate adaptive perception by guiding model capacity selection during tracking inference. Experiments demonstrate consistent improvements across both control and perception tasks. In robotic manipulation, the decomposed controller improves task success from 59.4% to 80.4% under compound perturbations and outperforms a combined uncertainty baseline by up to 21.0%. In adaptive tracking inference on MOT17, uncertainty-guided model selection reduces average compute by 58.2% relative to a fixed high capacity detector while preserving detection quality within 0.4%. Code and demo videos are available at https://divake.github.io/uncertainty-decomposition/.
Chinese Translation
大多数关注不确定性的机器人系统将预测不确定性压缩为单一标量分数,并利用该分数触发统一的纠正响应。这种聚合模糊了不确定性是源于观测数据的损坏,还是源于学习模型与真实系统动态之间的不匹配。因此,纠正措施可能会应用于闭环的错误组件,从而使性能相较于不改变策略的情况下降。我们提出了一种轻量级的事后框架,将不确定性分解为随机不确定性(aleatoric)和认知不确定性(epistemic)两个组成部分,并利用这些信号在推理时调节系统响应。随机不确定性通过使用马哈拉诺比斯密度模型从观测分布的偏差中估计,而认知不确定性则通过噪声鲁棒的前向动力学集成来检测,该集成将模型不匹配与测量损坏隔离。在闭环执行过程中,这两个信号在经验上保持近乎正交,并能够实现特定类型的响应。高随机不确定性触发观测恢复,而高认知不确定性则调节控制动作。这些信号还通过在跟踪推理过程中指导模型容量选择来调节自适应感知。实验表明在控制和感知任务中均有一致的改善。在机器人操作中,分解控制器在复合扰动下将任务成功率从59.4%提高到80.4%,并且比结合不确定性基线提高了多达21.0%。在MOT17的自适应跟踪推理中,相较于固定高容量检测器,基于不确定性的模型选择将平均计算量减少了58.2%,同时保持检测质量在0.4%以内。代码和演示视频可在 https://divake.github.io/uncertainty-decomposition/ 获取。
cs.RO / 104 / 2603.08131

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

UniGround:通过无训练场景解析实现通用的3D视觉定位
Zhang, Jiaxi, Wang, Yunheng, Lu, Wei, Wang, Taowen, Xu, Weisheng, Zhang, Shuning, Feng, Yixiao, Fang, Yuetong, Xu, Renjing
Abstract
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% [email protected]/0.5 on ScanRefer and 28.7\% [email protected] on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
Chinese Translation
从自然语言描述中理解和定位复杂3D环境中的物体,被称为3D视觉定位(3DVG),是具身人工智能中的一个基础性挑战,对机器人技术、增强现实和人机交互具有广泛的影响。大规模预训练基础模型在这一领域推动了显著进展,使得开放词汇的3DVG成为可能,从而使系统能够在给定场景中定位任意物体。然而,它们对预训练模型的依赖限制了3D感知和推理在继承知识边界内的能力,导致对未见空间关系的泛化能力有限,并且对分布外场景的鲁棒性较差。在本文中,我们用无训练的视觉和几何推理替代了这种受限的感知,从而解锁了开放世界的3DVG,使得在训练数据之外的任何场景中定位任意物体成为可能。具体而言,所提出的UniGround分为两个阶段:一个全局候选过滤阶段,通过无训练的3D拓扑和多视角语义编码构建场景候选;一个局部精确定位阶段,利用多尺度视觉提示和结构化推理精确识别目标物体。在ScanRefer和EmbodiedScan上的实验表明,UniGround在ScanRefer上达到了46.1%/34.1% [email protected]/0.5,在EmbodiedScan上达到了28.7% [email protected],确立了在EmbodiedScan上无任何3D监督的零-shot方法中的新状态。此外,我们在真实世界环境中对UniGround进行了评估,测试了在无控制重建条件和显著领域转移下的表现,显示出无训练推理在策划基准之外具有良好的泛化能力。
cs.RO / 105 / 2603.08136

POIROT: Investigating Direct Tangible vs. Digitally Mediated Interaction and Attitude Moderation in Multi-party Murder Mystery Games

POIROT:探讨多方谋杀谜题游戏中直接有形互动与数字媒介互动及态度调节
Chen, Wen, Chen, Rongxi, Chen, Shankai, Gong, Huiyang, Guo, Minghui, Xu, Yingri, Wu, Xintong, Fu, Xinyi
Abstract
As social robots take on increasingly complex roles like game masters (GMs) in multi-party games, the expectation that physicality universally enhances user experience remains debated. This study challenges the "one-size-fits-all" view of tangible interaction by identifying a critical boundary condition: users' Negative Attitudes towards Robots (NARS). In a between-subjects experiment (N = 67), a custom-built robot GM facilitated a multi-party murder mystery game (MMG) by delivering clues either through direct tangible interaction or a digitally mediated interface. Baseline multivariate analysis (MANOVA) showed no significant main effect of delivery modality, confirming that tangibility alone does not guarantee superior engagement. However, primary analysis using multilevel linear models (MLM) revealed a reliable moderation: participants high in NARS experienced markedly lower narrative immersion under tangible delivery, whereas those with low NARS scores showed no such decrement. Qualitative findings further illuminate this divergence: tangibility provides novelty and engagement for some but imposes excessive proxemic friction for anxious users, for whom the digital interface acts as a protective social buffer. These results advance a conditional model of HRI and emphasize the necessity for adaptive systems that can tailor interaction modalities to user predispositions.
Chinese Translation
随着社交机器人在多方游戏中承担越来越复杂的角色,如游戏主持人(GMs),关于物理性是否普遍增强用户体验的期望仍然存在争议。本研究通过识别一个关键的边界条件:用户对机器人的负面态度(NARS),挑战了“适用于所有”的有形互动观点。在一项被试间实验中(N = 67),一台定制的机器人GM通过直接有形互动或数字媒介界面提供线索,促进了一场多方谋杀谜题游戏(MMG)。基线多变量分析(MANOVA)显示交付方式的主效应无显著性,确认仅有物理性并不能保证更高的参与度。然而,使用多层线性模型(MLM)的主要分析揭示了可靠的调节效应:高NARS的参与者在有形交付下体验到明显较低的叙事沉浸感,而低NARS得分的参与者则没有这种下降。定性研究结果进一步阐明了这种差异:对某些用户而言,有形性提供了新奇感和参与感,但对焦虑的用户则施加了过度的亲密距离摩擦,数字界面则充当了保护性的社交缓冲。这些结果推动了人机互动(HRI)的条件模型的发展,并强调了适应性系统的必要性,这些系统能够根据用户的倾向调整互动方式。
cs.RO / 106 / 2603.08142

Multifingered force-aware control for humanoid robots

面向力感知的多指控制在类人机器人中的应用
Marra, Pasquale, Caddeo, Gabriele M., Pattacini, Ugo, Natale, Lorenzo
Abstract
In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth force measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a $82.7\%$ success rate, and further evaluate it in multi-object scenarios, achieving $80\%$ accuracy. Code and data can be found here https://github.com/hsp-iit/multifingered-force-aware-control.
Chinese Translation
本文探讨了在具有多指手的机器人平台上进行力感知控制和力分配的问题。给定目标目标和来自触觉传感器的力估计,我们设计了一种控制器,能够调整躯干、手臂、手腕和手指的运动,重新分配力以保持与不同质量分布或不稳定接触物体的稳定接触。为了估计力,我们收集了一个触觉信号和真实力测量的数据集,使用五个Xela磁传感器与压头进行交互,并训练力估计器。然后,我们引入了一种基于模型的控制方案,最小化压力中心(Center of Pressure, CoP)与指尖接触多边形质心之间的距离。由于我们的方法依赖于估计的力而非原始触觉信号,因此它有潜力应用于任何能够进行力估计的传感器。我们在五个物体的平衡任务上验证了我们的框架,成功率达到82.7%,并在多物体场景中进一步评估,准确率达到80%。代码和数据可以在此找到 https://github.com/hsp-iit/multifingered-force-aware-control。
cs.RO / 107 / 2603.08232

A General Lie-Group Framework for Continuum Soft Robot Modeling

连续软机器人建模的通用李群框架
Xun, Lingxiao, Rosa, Benoît, Szewczyk, Jérôme, Tamadazte, Brahim
Abstract
This paper introduces a general Lie group framework for modeling continuum soft robots, employing Cosserat rod theory combined with cumulative parameterization on the Lie group SE(3). This novel approach addresses limitations present in current strain-based and configuration-based methods by providing geometric local control and eliminating unit quaternion constraints. The paper derives unified analytical expressions for kinematics, statics, and dynamics, including recursive Jacobian computations and an energy-conserving integrator suitable for real-time simulation and control. Additionally, the framework is extended to handle complex robotic structures, including segmented, branched, nested, and rigid-soft composite configurations, facilitating a modular and unified modeling strategy. The effectiveness, generality, and computational efficiency of the proposed methodology are demonstrated through various scenarios, including large-deformation rods, concentric tube robots, parallel robots, cable-driven robots, and articulated fingers. This work enhances modeling flexibility and numerical performance, providing an improved toolset for designing, simulating, and controlling soft robotic systems.
Chinese Translation
本文介绍了一种用于建模连续软机器人的通用李群框架,该框架结合了Cosserat杆理论和李群SE(3)上的累积参数化。这种新颖的方法通过提供几何局部控制并消除单位四元数约束,解决了当前基于应变和基于配置方法的局限性。本文推导了运动学、静力学和动力学的统一解析表达式,包括递归雅可比计算和适用于实时仿真与控制的能量守恒积分器。此外,该框架扩展到处理复杂的机器人结构,包括分段、分支、嵌套和刚软复合配置,从而促进模块化和统一的建模策略。通过多种场景的验证,包括大变形杆、同心管机器人、并联机器人、缆驱动机器人和关节手指,展示了所提方法的有效性、通用性和计算效率。这项工作增强了建模的灵活性和数值性能,为设计、仿真和控制软机器人系统提供了改进的工具集。
cs.RO / 108 / 2603.08255

FlowTouch: View-Invariant Visuo-Tactile Prediction

FlowTouch:视图不变的视觉-触觉预测
Bien, Seongjin, Kneissl, Carlo, Jülg, Tobias, Fundel, Frank, Ressler-Antal, Thomas, Walter, Florian, Ommer, Björn, Kutyniok, Gitta, Burgard, Wolfram
Abstract
Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction. Our code, datasets and videos are available at https://flowtouch.github.io/
Chinese Translation
触觉感知对于接触丰富的操作任务至关重要。它提供了关于物体几何形状、表面特性和交互力的直接反馈,增强了感知能力并实现了精细控制。触觉传感器的一个固有限制是,只有在物体被触摸时才能获取读数。这限制了它们在任务规划和初始执行阶段的使用。从视觉信息中预测触觉信息可以弥补这一空白。常见的方法是学习从相机图像到基于视觉的触觉传感器输出的直接映射。然而,所得到的模型将强烈依赖于特定的设置以及相机捕捉物体触摸区域的能力。在本研究中,我们提出了FlowTouch,一种新颖的视图不变的视觉-触觉预测模型。我们的关键思想是使用物体的局部3D网格来编码丰富的信息,以预测触觉模式,同时抽象掉场景依赖的细节。FlowTouch集成了场景重建和基于流匹配的图像生成模型。我们的结果表明,FlowTouch能够弥合模拟与现实之间的差距,并对新的传感器实例进行泛化。我们进一步展示了生成的触觉图像可用于下游抓取稳定性预测。我们的代码、数据集和视频可在 https://flowtouch.github.io/ 获取。
cs.RO / 109 / 2603.08260

Seed2Scale: A Self-Evolving Data Engine for Embodied AI via Small to Large Model Synergy and Multimodal Evaluation

Seed2Scale:一种通过小模型与大模型协同及多模态评估实现自我演化的具身人工智能数据引擎
Tai, Cong, Zheng, Zhaoyu, Long, Haixu, Wu, Hansheng, Long, Zhengbin, Xiang, Haodong, Shi, Rong, Cui, Zhuo, Zhang, Shizhuang, Qiu, Gang, Wang, He, Li, Ruifeng, Liu, Biao, Sun, Zhenzhe, Shen, Tao
Abstract
Existing data generation methods suffer from exploration limits, embodiment gaps, and low signal-to-noise ratios, leading to performance degradation during self-iteration. To address these challenges, we propose Seed2Scale, a self-evolving data engine that overcomes the data bottleneck through a heterogeneous synergy of "small-model collection, large-model evaluation, and target-model learning". Starting with as few as four seed demonstrations, the engine employs the lightweight Vision-Language-Action model, SuperTiny, as a dedicated collector, leveraging its strong inductive bias for robust exploration in parallel environments. Concurrently, a pre-trained Vision-Language Model is integrated as a Verifer to autonomously perform success/failure judgment and quality scoring for the massive generated trajectories. Seed2Scale effectively mitigates model collapse, ensuring the stability of the self-evolution process. Experimental results demonstrate that Seed2Scale exhibits signifcant scaling potential: as iterations progress, the success rate of the target model shows a robust upward trend, achieving a performance improvement of 131.2%. Furthermore, Seed2Scale signifcantly outperforms existing data augmentation methods, providing a scalable and cost-effective pathway for the large-scale development of Generalist Embodied AI. Project page: https://terminators2025.github.io/Seed2Scale.github.io
Chinese Translation
现有的数据生成方法存在探索限制、具身差距和低信噪比等问题,导致自我迭代过程中性能下降。为了解决这些挑战,我们提出了Seed2Scale,一种通过“少量模型集合、大模型评估和目标模型学习”的异构协同来克服数据瓶颈的自我演化数据引擎。该引擎从仅四个种子演示开始,采用轻量级的视觉-语言-动作模型SuperTiny作为专用收集器,利用其强大的归纳偏差在并行环境中进行稳健的探索。同时,集成了预训练的视觉-语言模型作为验证器,能够自主执行成功/失败判断和对大量生成轨迹的质量评分。Seed2Scale有效减轻了模型崩溃,确保了自我演化过程的稳定性。实验结果表明,Seed2Scale展现出显著的扩展潜力:随着迭代的进行,目标模型的成功率呈现出稳健的上升趋势,性能提升达到131.2%。此外,Seed2Scale显著优于现有的数据增强方法,为具身人工智能的大规模开发提供了一条可扩展且具有成本效益的路径。项目页面:https://terminators2025.github.io/Seed2Scale.github.io
cs.RO / 110 / 2603.08269

SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM

SAIL:基于视觉语言模型的上下文模仿学习的测试时间缩放
Sato, Makoto, Iwasawa, Yusuke, Tang, Yujin, Kuroki, So
Abstract
In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.
Chinese Translation
上下文模仿学习使机器人能够从示范中获取技能,但在环境变化下,一次性轨迹生成仍然脆弱。我们提出了SAIL,一个将机器人模仿重新构建为一个能够根据测试时间计算进行迭代优化的问题的框架。SAIL利用蒙特卡洛树搜索,其中每个节点是一个完整的轨迹,边缘对应于轨迹的优化。该过程由三个核心组件指导:一个用于上下文相关检索的成功轨迹自动档案,一个基于视觉语言模型的轨迹评估评分机制,以及提供与轨迹对齐的分数以进行迭代优化的逐步反馈。在六个不同的操作任务的仿真和现实世界验证中进行的实验清楚地表明,增加测试时间计算始终提高成功率,在复杂任务中可达到95%。我们的结果表明,轨迹级的测试时间缩放是朝着更具普适性的机器人代理迈进的稳健路径。
cs.RO / 111 / 2603.08273

Less is More: Robust Zero-Communication 3D Pursuit-Evasion via Representational Parsimony

少即是多:通过表征简约实现稳健的零通信三维追逐-逃避
Ying, Jialin, Li, Zhihao, Dong, Zicheng, Wu, Guohua, Liao, Yihuan
Abstract
Asymmetric 3D pursuit-evasion in cluttered voxel environments is difficult under communication latency, partial observability, and nonholonomic maneuver limits. While many MARL methods rely on richer inter-agent coupling or centralized signals, these dependencies can become fragility sources when communication is delayed or noisy. Building on an inherited path-guided decentralized pursuit scaffold, we study a robustness-oriented question: can representational parsimony improve communication-free coordination? We instantiate this principle with (i) a parsimonious actor observation interface that removes team-coupled channels (83-D to 50-D), and (ii) Contribution-Gated Credit Assignment (CGCA), a locality-aware credit structure for communication-denied cooperation. In Stage-5 evaluation (4 pursuers vs. 1 evader), our configuration reaches 0.753 +/- 0.091 success and 0.223 +/- 0.066 collision, outperforming the 83-D FULL OBS counterpart (0.721 +/- 0.071, 0.253 +/- 0.089). It further shows graceful degradation under speed/yaw/noise/delay stress tests and resilient zero-shot transfer on urban-canyon maps (about 61% success at density 0.24). These results support a practical paradigm shift: explicitly severing redundant cross-agent channels can suppress compounding error cascades and improve robustness in latency-prone deployment.
Chinese Translation
在拥挤的体素环境中,非对称的三维追逐-逃避在通信延迟、部分可观测性和非完整机动限制下是困难的。虽然许多多智能体强化学习(MARL)方法依赖于更丰富的智能体间耦合或集中信号,但这些依赖关系在通信延迟或噪声情况下可能成为脆弱性来源。在继承的路径引导去中心化追逐框架的基础上,我们研究了一个以稳健性为导向的问题:表征简约能否改善无通信协调?我们通过以下方式实例化这一原则:(i) 一个简约的智能体观察接口,去除了团队耦合通道(从83维降至50维),以及(ii) 贡献门控信用分配(Contribution-Gated Credit Assignment, CGCA),一种适应局部性的信用结构,用于无通信的合作。在第五阶段评估中(4个追逐者对1个逃避者),我们的配置达到了0.753 +/- 0.091的成功率和0.223 +/- 0.066的碰撞率,优于83维完全可观测模型(0.721 +/- 0.071,0.253 +/- 0.089)。它在速度/偏航/噪声/延迟压力测试下表现出优雅的降级,并在城市峡谷地图上实现了稳健的零样本迁移(在密度0.24时约61%的成功率)。这些结果支持了一种实用的范式转变:明确切断冗余的跨智能体通道可以抑制复合误差级联,并提高在延迟敏感部署中的稳健性。
cs.RO / 112 / 2603.08324

EndoSERV: A Vision-based Endoluminal Robot Navigation System

EndoSERV:一种基于视觉的内腔机器人导航系统
Wu, Junyang, Xie, Fangfang, Zhang, Minghui, Zhang, Hanxiao, Sun, Jiayuan, Gu, Yun, Yang, Guang-Zhong
Abstract
Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.
Chinese Translation
机器人辅助的内腔手术在早期癌症干预中越来越多地被使用。然而,内腔解剖结构内复杂、狭窄且曲折的通道对机器人导航造成了重大困难。基于视觉的导航提供了一种有前景的解决方案,但现有的定位方法由于组织变形、体内伪影以及缺乏独特的地标以实现一致定位而容易出错。本文提出了一种新颖的EndoSERV定位方法以应对这些挑战。该方法包括两个主要部分,即 extbf{SE}gment-to-structure和 extbf{R}eal-to- extbf{V}irtual映射,因此得名。对于长距离和复杂的内腔结构,我们将其划分为更小的子段,并独立估计里程计。为了应对标签不足的问题,一种高效的转移技术将真实图像特征映射到虚拟域,以利用虚拟姿态的真实值。EndoSERV的训练阶段包括离线预训练以提取与纹理无关的特征,以及适应现实条件的在线阶段。基于公共和临床数据集进行了广泛的实验,以证明该方法的有效性,即使没有任何真实姿态标签。
cs.RO / 113 / 2603.08336

Hierarchical Multi-Modal Planning for Fixed-Altitude Sparse Target Search and Sampling

固定高度稀疏目标搜索与采样的层次多模态规划
Chen, Lingpeng, Zheng, Yuchen, Chui, Apple Pui-Yi, Wu, Junfeng, Hong, Ziyang
Abstract
Efficient monitoring of sparse benthic phenomena, such as coral colonies, presents a great challenge for Autonomous Underwater Vehicles. Traditional exhaustive coverage strategies are energy-inefficient, while recent adaptive sampling approaches rely on costly vertical maneuvers. To address these limitations, we propose HIMoS (Hierarchical Informative Multi-Modal Search), a fixed-altitude framework for sparse coral search-and-sample missions. The system integrates a heterogeneous sensor suite within a two-layer planning architecture. At the strategic level, a Global Planner optimizes topological routes to maximize potential discovery. At the tactical level, a receding-horizon Local Planner leverages differentiable belief propagation to generate kinematically feasible trajectories that balance acoustic substrate exploration, visual coral search, and close-range sampling. Validated in high-fidelity simulations derived from real-world coral reef benthic surveys, our approach demonstrates superior mission efficiency compared to state-of-the-art baselines.
Chinese Translation
高效监测稀疏的底栖现象,如珊瑚群落,给自主水下航行器带来了巨大挑战。传统的全面覆盖策略能量效率低,而最近的自适应采样方法依赖于昂贵的垂直机动。为了解决这些局限性,我们提出了HIMoS(层次信息多模态搜索),这是一个用于稀疏珊瑚搜索与采样任务的固定高度框架。该系统在两层规划架构中集成了异构传感器套件。在战略层面,全球规划器优化拓扑路线以最大化潜在发现。在战术层面,递归视野局部规划器利用可微分信念传播生成运动学可行的轨迹,以平衡声学基底探索、视觉珊瑚搜索和近距离采样。我们的方案在基于真实世界珊瑚礁底栖调查的高保真模拟中得到了验证,显示出比最先进基线更优越的任务效率。
cs.RO / 114 / 2603.08342

PhaForce: Phase-Scheduled Visual-Force Policy Learning with Slow Planning and Fast Correction for Contact-Rich Manipulation

PhaForce:具有慢规划和快速修正的相位调度视觉-力政策学习,用于接触丰富的操控
Wang, Mingxin, Yue, Zhirun, Lu, Renhao, Li, Yizhe, Wang, Zihan, Pan, Guoping, Dong, Kangkang, Cheng, Jun, Cheng, Yi, Liu, Houde
Abstract
Contact-rich manipulation requires not only vision-dominant task semantics but also closed-loop reactions to force/torque (F/T) transients. Yet, generative visuomotor policies are typically constrained to low-frequency updates due to inference latency and action chunking, underutilizing F/T for control-rate feedback. Furthermore, existing force-aware methods often inject force continuously and indiscriminately, lacking an explicit mechanism to schedule when / how much / where to apply force across different task phases. We propose PhaForce, a phase-scheduled visual--force policy that coordinates low-rate chunk-level planning and high-rate residual correction via a unified contact/phase schedule. PhaForce comprises (i) a contact-aware phase predictor (CAP) that estimates contact probability and phase belief, (ii) a Slow diffusion planner that performs dual-gated visual--force fusion with orthogonal residual injection to preserve vision semantics while conditioning on force, and (iii) a Fast corrector that applies control-rate phase-routed residuals in interpretable corrective subspaces for within-chunk micro-adjustments. Across multiple real-robot contact-rich tasks, PhaForce achieves an average success rate of 86% (+40 pp over baselines), while also substantially improving contact quality by regulating interaction forces and exhibiting robust adaptability to OOD geometric shifts.
Chinese Translation
接触丰富的操控不仅需要以视觉为主的任务语义,还需要对力/扭矩(F/T)瞬态的闭环反应。然而,由于推理延迟和动作分块,生成的视觉运动政策通常受到低频更新的限制,未能充分利用F/T进行控制率反馈。此外,现有的力感知方法往往持续且无差别地施加力,缺乏明确的机制来调度在不同任务阶段何时、多少以及在哪里施加力。我们提出了PhaForce,一种相位调度的视觉-力政策,通过统一的接触/相位调度协调低频块级规划和高频残差修正。PhaForce包括(i)一个接触感知相位预测器(CAP),用于估计接触概率和相位信念;(ii)一个慢扩散规划器,执行双门控视觉-力融合,并通过正交残差注入来保持视觉语义,同时以力为条件;(iii)一个快速修正器,在可解释的修正子空间中应用控制率相位路由的残差,以进行块内微调。在多个真实机器人接触丰富任务中,PhaForce实现了86%的平均成功率(比基线提高40个百分点),同时通过调节交互力显著改善了接触质量,并展现出对OOD几何变化的强大适应性。
cs.RO / 115 / 2603.08379

Perception-Aware Communication-Free Multi-UAV Coordination in the Wild

感知驱动的无通信多无人机协调在复杂环境中的应用
Boldrer, Manuel, Kamler, Michal, Ahmad, Afzal, Saska, Martin
Abstract
We present a communication-free method for safe multi-robot coordination in complex environments such as forests with dense canopy cover, where GNSS is unavailable. Our approach relies on an onboard anisotropic 3D LiDAR sensor used for SLAM as well as for detecting obstacles and neighboring robots. We develop a novel perception-aware 3D navigation framework that enables robots to safely and effectively progress toward a goal region despite limited sensor field-of-view. The approach is evaluated through extensive simulations across diverse scenarios and validated in real-world field experiments, demonstrating its scalability, robustness, and reliability.
Chinese Translation
我们提出了一种在复杂环境(如树冠密集的森林)中进行安全多机器人协调的无通信方法,在这些环境中,全球导航卫星系统(GNSS)不可用。我们的方法依赖于机载各向异性三维激光雷达(LiDAR)传感器,该传感器用于同步定位与地图构建(SLAM),同时用于检测障碍物和邻近机器人。我们开发了一种新颖的感知驱动的三维导航框架,使机器人能够在有限的传感器视场下安全有效地朝向目标区域前进。该方法通过在多种场景下进行广泛的仿真实验进行评估,并在真实世界的现场实验中得到了验证,展示了其可扩展性、鲁棒性和可靠性。
cs.RO / 116 / 2603.08383

MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

MoMaStage:基于技能状态图的长期室内移动操控规划与闭环执行
Li, Chenxu, Chen, Zixuan, Li, Yetao, Xu, Jiapeng, Ding, Hongyu, Shi, Jieqi, Huo, Jing, Gao, Yang
Abstract
Indoor mobile manipulation (MoMA) enables robots to translate natural language instructions into physical actions, yet long-horizon execution remains challenging due to cascading errors and limited generalization across diverse environments. Learning-based approaches often fail to maintain logical consistency over extended horizons, while methods relying on explicit scene representations impose rigid structural assumptions that reduce adaptability in dynamic settings. To address these limitations, we propose MoMaStage, a structured vision-language framework for long-horizon MoMA that eliminates the need for explicit scene mapping. MoMaStage grounds a Vision-Language Model (VLM) within a Hierarchical Skill Library and a topology-aware Skill-State Graph, constraining task decomposition and skill composition within a feasible transition space. This structured grounding ensures that generated plans remain logically consistent and topologically valid with respect to the agent's evolving physical state. To enhance robustness, MoMaStage incorporates a closed-loop execution mechanism that monitors proprioceptive feedback and triggers graph-constrained semantic replanning when deviations are detected, maintaining alignment between planned skills and physical outcomes. Extensive experiments in physics-rich simulations and real-world environments demonstrate that MoMaStage outperforms state-of-the-art baselines, achieving substantially higher planning success, reducing token overhead, and significantly improving overall task success rates in long-horizon mobile manipulation. Video demonstrations are available on the project website: https://chenxuli-cxli.github.io/MoMaStage/.
Chinese Translation
室内移动操控(MoMA)使机器人能够将自然语言指令转化为物理动作,但由于级联错误和在多样环境中有限的泛化能力,长期执行仍然面临挑战。基于学习的方法往往无法在较长的时间范围内保持逻辑一致性,而依赖于显式场景表示的方法则施加了严格的结构假设,降低了在动态环境中的适应性。为了解决这些局限性,我们提出了MoMaStage,一个结构化的视觉-语言框架,旨在实现长期MoMA,消除了对显式场景映射的需求。MoMaStage将视觉-语言模型(VLM)与分层技能库和拓扑感知技能状态图相结合,限制了任务分解和技能组合在可行的过渡空间内。这种结构化的基础确保生成的计划在逻辑上保持一致,并在代理的不断变化的物理状态下拓扑有效。为了增强鲁棒性,MoMaStage结合了一个闭环执行机制,该机制监测本体反馈,并在检测到偏差时触发图约束的语义重新规划,从而保持计划技能与物理结果之间的一致性。在物理丰富的仿真和现实环境中进行的大量实验表明,MoMaStage的表现优于最先进的基线,显著提高了规划成功率,减少了令牌开销,并显著提高了长期移动操控中的整体任务成功率。视频演示可在项目网站上查看:https://chenxuli-cxli.github.io/MoMaStage/
cs.RO / 117 / 2603.08390

StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation

StructBiHOI:用于长时间跨度双手物体交互生成的结构化关节建模
Wang, Zhi, Liu, Liu, Liu, Ruonan, Guo, Dan, Wang, Meng
Abstract
Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
Chinese Translation
近年来,3D手-物体交互(HOI)生成的进展主要集中在单手抓取合成上,而双手操作仍然面临显著更大的挑战。长时间跨度规划的不稳定性、细粒度的关节关节动作以及复杂的跨手协调使得在多模态条件下进行连贯的双手生成变得困难。现有的方法往往难以在扩展序列中同时确保时间一致性、物理合理性和语义对齐。我们提出了StructBiHOI,一个用于长时间跨度双手HOI生成的结构化关节建模框架。我们的关键见解是将时间关节规划与帧级操作细化结构性地分离。具体而言,jointVAE模型基于物体几何形状和任务语义对长期关节演变进行建模,而maniVAE则在单帧级别上细化细粒度的手部姿态。为了实现稳定和高效的长序列生成,我们结合了基于Mamba的状态空间启发式扩散去噪器,该去噪器以线性复杂度建模长范围依赖关系。这种分层设计促进了连贯的双手协调和关节物体交互。在双手操作和单手抓取基准测试上的广泛实验表明,我们的方法在长时间跨度稳定性、运动真实感和计算效率方面优于强基线。
cs.RO / 118 / 2603.08420

Human-Aware Robot Behaviour in Self-Driving Labs

人机共存的自驾实验室中的机器人行为
Veeramani, Satheeshkumar, Kisil, Anna, Bentley, Abigail, Fakhruldeen, Hatem, Pizzuto, Gabriella, Cooper, Andrew I.
Abstract
Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.
Chinese Translation
自驾实验室(SDLs)正在迅速改变化学和材料科学的研究,以加速新发现的进程。移动机器人化学家(MRCs)通过自主导航实验室以运输样品,发挥了关键作用,有效连接合成、分析和表征设备。SDL中的仪器通常设计或改装为可供人类和机器人化学家共同使用,以确保操作灵活性和手动与自动工作流程之间的整合。在许多场景中,人类和机器人化学家可能需要同时使用同一设备。目前,MRCs依赖简单的基于激光雷达(LiDAR)的障碍物检测,这迫使机器人在有人的情况下被动等待。这种缺乏情境意识的情况导致了不必要的延误和在时间敏感的自动化工作流程中人机协调的低效。为了解决这一问题,我们提出了一项关于具身的、基于人工智能的感知方法的初步研究,该方法促进了共享访问场景中的主动人机交互。我们的方法具有一个分层的人类意图预测模型,使机器人能够区分准备性动作(等待)和瞬时交互(访问仪器)。我们的结果表明,所提出的方法通过促进主动的人机交互、简化协调,潜在地提高了自主科学实验室的效率。
cs.RO / 119 / 2603.08423

Tactile Recognition of Both Shapes and Materials with Automatic Feature Optimization-Enabled Meta Learning

基于自动特征优化的元学习实现形状和材料的触觉识别
Zhao, Hongliang, Yang, Wenhui, Chen, Yang, Wang, Zhuorui, Liu, Baiheng, Qin, Longhui
Abstract
Tactile perception is indispensable for robots to implement various manipulations dexterously, especially in contact-rich scenarios. However, alongside the development of deep learning techniques, it meanwhile suffers from training data scarcity and a time-consuming learning process in practical applications since the collection of a large amount of tactile data is costly and sometimes even impossible. Hence, we propose an automatic feature optimization-enabled prototypical network to realize meta-learning, i.e., AFOP-ML framework. As a ``learn to learn" network, it not only adapts to new unseen classes rapidly with few-shot, but also learns how to determine the optimal feature space automatically. Based on the four-channel signals acquired from a tactile finger, both shapes and materials are recognized. On a 36-category benchmark, it outperforms several existing approaches by attaining an accuracy of 96.08% in 5-way-1-shot scenario, where only 1 example is available for training. It still remains 88.7% in the extreme 36-way-1-shot case. The generalization ability is further validated through three groups of experiment involving unseen shapes, materials and force/speed perturbations. More insights are additionally provided by this work for the interpretation of recognition tasks and improved design of tactile sensors.
Chinese Translation
触觉感知对于机器人灵活实施各种操作至关重要,尤其是在接触丰富的场景中。然而,随着深度学习技术的发展,触觉感知在实际应用中面临训练数据稀缺和学习过程耗时的问题,因为收集大量触觉数据既昂贵又有时甚至不可能。因此,我们提出了一种基于自动特征优化的原型网络以实现元学习,即AFOP-ML框架。作为一种“学习如何学习”的网络,它不仅能够在少量样本的情况下快速适应新的未见类别,还能自动学习如何确定最佳特征空间。基于从触觉手指获取的四通道信号,能够识别形状和材料。在一个36类的基准测试中,它在5-way-1-shot场景下的准确率达到96.08%,优于多种现有方法,其中仅提供1个示例进行训练。在极端的36-way-1-shot情况下,准确率仍保持在88.7%。通过涉及未见形状、材料和力/速度扰动的三组实验进一步验证了其泛化能力。此外,本研究还为识别任务的解释和触觉传感器的改进设计提供了更多见解。
cs.RO / 120 / 2603.08433

FoMo: A Multi-Season Dataset for Robot Navigation in For\^et Montmorency

FoMo:蒙特莫朗西森林中的机器人导航多季节数据集
Boxan, Matěj, Jeanson, Gabriel, Krawciw, Alexander, Daum, Effie, Qiao, Xinyuan, Lilge, Sven, Barfoot, Timothy D., Pomerleau, François
Abstract
The For\^et Montmorency (FoMo) dataset is a comprehensive multi-season data collection, recorded over the span of one year in a boreal forest. Featuring a unique combination of on- and off-pavement environments with significant environmental changes, the dataset challenges established odometry and SLAM pipelines. Some highlights of the data include the accumulation of snow exceeding 1 m, significant vegetation growth in front of sensors, and operations at the traction limits of the platform. In total, the FoMo dataset includes over 64 km of six diverse trajectories, repeated during 12 deployments throughout the year. The dataset features data from one rotating and one hybrid solid-state lidar, a Frequency Modulated Continuous Wave (FMCW) radar, full-HD images from a stereo camera and a wide lens monocular camera, as well as data from two IMUs. Ground Truth is calculated by post-processing three GNSS receivers mounted on the Uncrewed Ground Vehicle (UGV) and a static GNSS base station. Additional metadata, such as one measurement per minute from an on-site weather station, camera calibration intrinsics, and vehicle power consumption, is available for all sequences. To highlight the relevance of the dataset, we performed a preliminary evaluation of the robustness of a lidar-inertial, radar-gyro, and a visual-inertial localization and mapping techniques to seasonal changes. We show that seasonal changes have serious effects on the re-localization capabilities of the state-of-the-art methods. The dataset and development kit are available at https://fomo.norlab.ulaval.ca.
Chinese Translation
蒙特莫朗西森林(FoMo)数据集是一个全面的多季节数据收集,记录了在一个北方森林中为期一年的数据。该数据集具有独特的路面和非路面环境组合,并伴随显著的环境变化,挑战了现有的里程计和同步定位与地图构建(SLAM)流程。数据的一些亮点包括积雪超过1米、传感器前方的显著植被生长,以及在平台牵引极限下的操作。总的来说,FoMo数据集包含超过64公里的六条多样化轨迹,这些轨迹在一年内进行了12次部署。数据集包含来自一个旋转激光雷达和一个混合固态激光雷达的数据、频率调制连续波(FMCW)雷达、立体相机和广角单目相机的全高清图像,以及来自两个惯性测量单元(IMU)的数据。地面真实值通过对安装在无人地面车辆(UGV)上的三个全球导航卫星系统(GNSS)接收器和一个静态GNSS基站进行后处理计算得出。所有序列都提供额外的元数据,例如来自现场气象站的每分钟一次的测量、相机标定内参和车辆功耗。为了突出数据集的相关性,我们对激光雷达-惯性、雷达-陀螺仪和视觉-惯性定位与地图构建技术在季节变化下的鲁棒性进行了初步评估。我们展示了季节变化对最先进方法的重新定位能力产生了严重影响。数据集和开发工具包可在 https://fomo.norlab.ulaval.ca 获取。
cs.RO / 121 / 2603.08457

Adaptive Entropy-Driven Sensor Selection in a Camera-LiDAR Particle Filter for Single-Vessel Tracking

基于自适应熵驱动的传感器选择在相机-激光雷达粒子滤波器中的应用:单船追踪
Starodubov, Andrei, Prabowo, Yaqub Aris, Hadjipieris, Andreas, Kyriakides, Ioannis, Galeazzi, Roberto
Abstract
Robust single-vessel tracking from fixed coastal platforms is hindered by modality-specific degradations: cameras suffer from illumination and visual clutter, while LiDAR performance drops with range and intermittent returns. We present a heterogeneous multi-sensor fusion particle-filter tracker that incorporates an information-gain (entropy-reduction) adaptive sensing policy to select the most informative configuration at each fusion time bin. The approach is validated in a real maritime deployment at the CMMI Smart Marina Testbed (Ayia Napa Marina, Cyprus), using a shore-mounted 3D LiDAR and an elevated fixed camera to track a rigid inflatable boat with onboard GNSS ground truth. We compare LiDAR-only, camera-only, all-sensors, and adaptive configurations. Results show LiDAR dominates near-field accuracy, the camera sustains longer-range coverage when LiDAR becomes unavailable, and the adaptive policy achieves a favorable accuracy-continuity trade-off by switching modalities based on information gain. By avoiding continuous multi-stream processing, the adaptive configuration provides a practical baseline for resilient and resource-aware maritime surveillance.
Chinese Translation
从固定沿海平台进行稳健的单船追踪受到特定模态退化的影响:相机受限于光照和视觉杂乱,而激光雷达的性能随着距离和间歇性返回而下降。我们提出了一种异构多传感器融合粒子滤波追踪器,该追踪器结合了信息增益(熵减少)自适应感知策略,以在每个融合时间窗口选择最具信息量的配置。该方法在塞浦路斯阿依亚纳帕的CMMI智能码头测试平台进行的实际海洋部署中得到了验证,使用岸边安装的3D激光雷达和高架固定相机追踪一艘配备GNSS地面真实数据的刚性充气船。我们比较了仅使用激光雷达、仅使用相机、全传感器和自适应配置。结果表明,激光雷达在近场精度上占主导地位,而相机在激光雷达不可用时维持了更长的覆盖范围,自适应策略通过基于信息增益切换模态,实现了良好的精度与连续性的权衡。通过避免持续的多流处理,自适应配置为稳健且资源感知的海洋监视提供了一个实用的基线。
cs.RO / 122 / 2603.08475

R2F: Repurposing Ray Frontiers for LLM-free Object Navigation

R2F:重新利用光线前沿进行无LLM的物体导航
Argenziano, Francesco, Marcelo, John Mark Alexis, Brienza, Michele, Drid, Abdel Hakim, Musumeci, Emanuele, Nardi, Daniele, Bloisi, Domenico D., Suriani, Vincenzo
Abstract
Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.
Chinese Translation
零-shot开放词汇物体导航随着大型视觉-语言模型(VLMs)和大型语言模型(LLMs)的出现而迅速发展,现在广泛用作高层决策者,而不是端到端策略。尽管有效,这类系统在推理时通常依赖于迭代的大模型查询,导致延迟和计算开销,从而限制了实时部署。为了解决这个问题,我们重新利用光线前沿(R2F),一种最近提出的基于前沿的探索范式,开发了一种无LLM的室内开放词汇物体导航框架。虽然光线前沿最初用于利用沿光线传播的语义线索来偏向探索,但我们将前沿区域重新解释为明确的、方向条件的语义假设,作为导航目标。沿着超出范围的光线积累的语言对齐特征在前沿处稀疏存储,每个区域保持多个方向嵌入,编码可能的未见内容。通过这种方式,导航简化为基于嵌入的前沿评分和目标跟踪,嵌入于经典的映射和规划流程中,消除了迭代的大模型推理。我们进一步引入R2F-VLN,一种轻量级扩展,支持自由形式的语言指令,使用句法解析和关系验证,而无需额外的VLM或LLM组件。在Habitat-sim和真实机器人平台上的实验表明,R2F在实时执行中展现出具有竞争力的最先进的零-shot性能,运行速度比基于VLM的替代方案快达6倍。
cs.RO / 123 / 2603.08476

LAR-MoE: Latent-Aligned Routing for Mixture of Experts in Robotic Imitation Learning

LAR-MoE:用于机器人模仿学习的潜在对齐路由专家混合模型
Rodriguez, Ariel, Li, Chenpan, Mazza, Lorenzo, Younis, Rayan, Hellig, Ortrun, Bodenstedt, Sebastian, Wagner, Martin, Speidel, Stefanie
Abstract
Imitation learning enables robots to acquire manipulation skills from demonstrations, yet deploying a policy across tasks with heterogeneous dynamics remains challenging, as models tend to average over distinct behavioral modes present in the demonstrations. Mixture-of-Experts (MoE) architectures address this by activating specialized subnetworks, but requires meaningful skill decompositions for expert routing. We introduce Latent-Aligned Routing for Mixture of Experts (LAR-MoE), a two-stage framework that decouples unsupervised skill discovery from policy learning. In pre-training, we learn a joint latent representation between observations and future actions through student-teacher co-training. In a post-training stage, the expert routing is regularized to follow the structure of the learned latent space, preventing expert collapse while maintaining parameter efficiency. We evaluate LAR-MoE in simulation and on hardware. On the LIBERO benchmark, our method achieves a 95.2% average success rate with 150M parameters. On a surgical bowel grasping and retraction task, LAR-MoE matches a supervised MoE baseline without requiring any phase annotations, and transfers zero-shot to ex vivo porcine tissue. Our findings suggest that latent-aligned routing provides a principled alternative to supervised skill decomposition, enabling structured expert specialization from unlabeled demonstrations.
Chinese Translation
模仿学习使机器人能够从示范中获取操作技能,但在具有异构动态的任务中部署策略仍然具有挑战性,因为模型往往会对示范中存在的不同行为模式进行平均。专家混合模型(Mixture-of-Experts, MoE)架构通过激活专门的子网络来解决这一问题,但需要有意义的技能分解以进行专家路由。我们提出了用于专家混合模型的潜在对齐路由(Latent-Aligned Routing for Mixture of Experts, LAR-MoE),这是一个两阶段框架,将无监督技能发现与策略学习解耦。在预训练阶段,我们通过师生共同训练学习观察与未来动作之间的联合潜在表示。在后训练阶段,专家路由被正则化以遵循学习到的潜在空间的结构,从而防止专家崩溃,同时保持参数效率。我们在仿真和硬件上评估了LAR-MoE。在LIBERO基准测试中,我们的方法在150M参数下实现了95.2%的平均成功率。在外科肠道抓取和回缩任务中,LAR-MoE与一个监督的MoE基线相匹配,而无需任何阶段注释,并且能够零样本迁移到离体猪组织。我们的研究结果表明,潜在对齐路由为监督技能分解提供了一种原则性的替代方案,使得从未标记的示范中实现结构化专家专业化成为可能。
cs.RO / 124 / 2603.08478

STRIDE: Structured Lagrangian and Stochastic Residual Dynamics via Flow Matching

STRIDE:通过流匹配实现结构化拉格朗日和随机残差动力学
Kotecha, Prakrut, B, Ganga Nair, Kolathaya, Shishir
Abstract
Robotic systems operating in unstructured environments must operate under significant uncertainty arising from intermittent contacts, frictional variability, and unmodeled compliance. While recent model-free approaches have demonstrated impressive performance, many deployment settings still require predictive models that support planning, constraint handling, and online adaptation. Analytical rigid-body models provide strong physical structure but often fail to capture complex interaction effects, whereas purely data-driven models may violate physical consistency, exhibit data bias, and accumulate long-horizon drift. In this work, we propose STRIDE, a dynamics learning framework that explicitly separates conservative rigid-body mechanics from uncertain, effectively stochastic non-conservative interaction effects. The structured component is modeled using a Lagrangian Neural Network (LNN) to preserve energy-consistent inertial dynamics, while residual interaction forces are represented using Conditional Flow Matching (CFM) to capture multi-modal interaction phenomena. The two components are trained jointly end-to-end, enabling the model to retain physical structure while representing complex stochastic behavior. We evaluate STRIDE on systems of increasing complexity, including a pendulum, the Unitree Go1 quadruped, and the Unitree G1 humanoid. Results show 20% reduction in long-horizon prediction error and 30% reduction in contact force prediction error compared to deterministic residual baselines, supporting more reliable model-based control in uncertain robotic environments.
Chinese Translation
在非结构化环境中操作的机器人系统必须在间歇性接触、摩擦变异和未建模的顺应性所带来的显著不确定性下运行。尽管最近的无模型方法已展示出令人印象深刻的性能,但许多部署场景仍然需要支持规划、约束处理和在线适应的预测模型。分析性的刚体模型提供了强大的物理结构,但往往无法捕捉复杂的交互效应,而纯粹的数据驱动模型可能会违反物理一致性,表现出数据偏差,并积累长期漂移。在本研究中,我们提出了STRIDE,一个动力学学习框架,明确将保守的刚体力学与不确定的、有效的随机非保守交互效应分离。结构化组件使用拉格朗日神经网络(LNN)建模,以保持能量一致的惯性动力学,而残余交互力则使用条件流匹配(CFM)来捕捉多模态交互现象。这两个组件通过端到端的联合训练,使模型在表示复杂的随机行为的同时保留物理结构。我们在复杂性逐渐增加的系统上评估STRIDE,包括摆、Unitree Go1四足机器人和Unitree G1人形机器人。结果显示,与确定性残差基线相比,长期预测误差减少了20%,接触力预测误差减少了30%,支持在不确定的机器人环境中更可靠的基于模型的控制。
cs.RO / 125 / 2603.08485

3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

3PoinTr:用于机器人操作预训练的3D点轨迹来自随意视频
Hung, Adam, Duisterhof, Bardienus Pieter, Ichnowski, Jeffrey
Abstract
Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.
Chinese Translation
高效的数据训练稳健的机器人策略是实现各种新任务自动化的关键。目前的系统需要大量的示范才能达到稳健性,这在许多应用中是不切实际的。从人类视频中直接学习策略是一种有前景的替代方案,它消除了远程操作的成本,但将挑战转向克服体现差距(机器人与人类之间的运动学和策略差异),通常需要限制性和精心编排的人类动作。我们提出了3PoinTr,一种从随意和不受限制的人类视频中预训练机器人策略的方法,使得能够从人类自然的动作中学习。3PoinTr使用变换器架构来预测3D点轨迹,作为一种中间的与体现无关的表示。3D点轨迹编码了目标规格、场景几何和时空关系。我们使用Perceiver IO架构提取紧凑的表示,以实现样本高效的行为克隆,即使在点轨迹违反下游体现特定约束的情况下。我们对模拟和现实世界任务进行了全面评估,发现3PoinTr在仅有20个带动作标签的机器人示范的情况下,在不同类别的操作任务上实现了稳健的空间泛化。3PoinTr超越了基线,包括行为克隆方法,以及之前从人类视频中进行预训练的方法。我们还提供了3PoinTr的3D点轨迹预测与现有点轨迹预测基线的评估。我们发现,3PoinTr由于基于单个变换器构建的轻量且富有表现力的架构,以及保持部分遮挡点监督的训练公式,生成了更准确和更高质量的点轨迹。项目页面:https://adamhung60.github.io/3PoinTr/
cs.RO / 126 / 2603.08490

An Open-Source Robotics Research Platform for Autonomous Laparoscopic Surgery

一种开源机器人研究平台用于自主腹腔镜手术
Rodriguez, Ariel, Mazza, Lorenzo, Lelis, Martin, Younis, Rayan, Bodenstedt, Sebastian, Wagner, Martin, Speidel, Stefanie
Abstract
Autonomous robot-assisted surgery demands reliable, high-precision platforms that strictly adhere to the safety and kinematic constraints of minimally invasive procedures. Existing research platforms, primarily based on the da Vinci Research Kit, suffer from cable-driven mechanical limitations that degrade state-space consistency and hinder the downstream training of reliable autonomous policies. We present an open-source, robot-agnostic Remote Center of Motion (RCM) controller based on a closed-form analytical velocity solver that enforces the trocar constraint deterministically without iterative optimization. The controller operates in Cartesian space, enabling any industrial manipulator to function as a surgical robot. We provide implementations for the UR5e and Franka Emika Panda manipulators, and integrate stereoscopic 3D perception. We integrate the robot control into a full-stack ROS-based surgical robotics platform supporting teleoperation, demonstration recording, and deployment of learned policies via a decoupled server-client architecture. We validate the system on a bowel grasping and retraction task across phantom, ex vivo, and in vivo porcine laparoscopic procedures. RCM deviations remain sub-millimeter across all conditions, and trajectory smoothness metrics (SPARC, LDLJ) are comparable to expert demonstrations from the JIGSAWS benchmark recorded on the da Vinci system. These results demonstrate that the platform provides the precision and robustness required for teleoperation, data collection and autonomous policy deployment in realistic surgical scenarios.
Chinese Translation
自主机器人辅助手术需要可靠、高精度的平台,严格遵循微创手术的安全和运动学约束。现有的研究平台主要基于达芬奇研究工具包,受限于电缆驱动的机械限制,这降低了状态空间的一致性,并妨碍了可靠自主策略的下游训练。我们提出了一种开源、与机器人无关的运动中心(Remote Center of Motion, RCM)控制器,该控制器基于封闭形式的解析速度求解器,能够确定性地强制执行 trocar 约束,而无需迭代优化。该控制器在笛卡尔空间中操作,使任何工业操纵器都能作为外科机器人使用。我们为 UR5e 和 Franka Emika Panda 操纵器提供了实现,并集成了立体 3D 感知。我们将机器人控制集成到一个基于 ROS 的全栈外科机器人平台中,支持远程操作、演示录制和通过解耦的服务器-客户端架构部署学习的策略。我们在模拟、体外和体内猪腹腔镜手术中的肠道抓取和回缩任务上验证了该系统。RCM 偏差在所有条件下均保持在亚毫米级别,轨迹平滑度指标(SPARC, LDLJ)与在达芬奇系统上记录的 JIGSAWS 基准的专家演示相当。这些结果表明,该平台提供了在现实外科场景中进行远程操作、数据收集和自主策略部署所需的精度和鲁棒性。
cs.RO / 127 / 2603.08512

Rethinking the semantic classification of indoor places by mobile robots

重新思考移动机器人对室内场所的语义分类
Mozos, Oscar Martinez, Hernandez, Alejandra C., Gomez, Clara, Barber, Ramon
Abstract
A significant challenge in service robots is the semantic understanding of their surrounding areas. Traditional approaches addressed this problem by segmenting the floor plan into regions corresponding to full rooms that are assigned labels consistent with human perception, e.g. office or kitchen. However, different areas inside the same room can be used in different ways: Could the table and the chair in my kitchen become my office? What is the category of that area now? office or kitchen? To adapt to these circumstances we propose a new paradigm where we intentionally relax the resulting labeling of semantic classifiers by allowing confusions inside rooms. Our hypothesis is that those confusions can be beneficial to a service robot. We present a proof of concept in the task of searching for objects.
Chinese Translation
服务机器人面临的一个重大挑战是对其周围环境的语义理解。传统方法通过将平面图划分为与人类感知一致的完整房间区域并分配标签(例如办公室或厨房)来解决这个问题。然而,同一房间内的不同区域可以有不同的使用方式:我厨房里的桌子和椅子可以变成我的办公室吗?那一区域现在的类别是什么?办公室还是厨房?为了适应这些情况,我们提出了一种新的范式,故意放宽语义分类器的标签结果,允许房间内部的混淆。我们的假设是,这些混淆对服务机器人是有益的。我们在物体搜索任务中展示了这一概念的可行性。
cs.RO / 128 / 2603.08519

AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

AtomVLA:通过预测潜在世界模型实现可扩展的机器人操作后训练
Sun, Xiaoquan, Xu, Zetian, Cao, Chen, Liu, Zonghe, Sun, Yihan, Pang, Jingrui, Zhang, Ruijian, Yang, Zhen, Pang, Kang, He, Dingxin, Yuan, Mingqi, Chen, Jiayu
Abstract
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.
Chinese Translation
视觉-语言-行动(VLA)模型在可泛化的机器人操作中展现出显著潜力。通过稳健的指令基础,VLA模型中复杂多步骤行为的执行可以得到改善,这是有效控制的关键组成部分。然而,目前的范式主要依赖于在监督微调过程中使用粗略的高层任务指令。这种指令基础的缺口使得模型缺乏明确的中间指导,导致在长时间任务中出现严重的累积错误。因此,弥补这一指令缺口并为VLA模型提供可扩展的后训练显得尤为紧迫。为了解决这个问题,我们提出了 extit{AtomVLA},这是第一个与可扩展离线后训练管道集成的子任务感知VLA框架。我们的框架利用大型语言模型将高层演示分解为细粒度的原子子任务。该方法利用预训练的预测世界模型在潜在空间中对候选动作片段进行评分,以符合子任务目标,从而减轻错误累积,同时显著提高长时间的鲁棒性。此外,该方法还实现了高效的组相对策略优化,而无需承担在物理机器人上进行在线回放的高昂费用。大量模拟验证了我们的AtomVLA在扰动下保持强大的鲁棒性。在与基本基线模型的评估中,它在LIBERO基准上实现了97.0\%的平均成功率,在LIBERO-PRO基准上实现了48.0\\%。最后,在Galaxea R1 Lite平台上进行的现实世界实验确认了其在多种任务中的广泛适用性,尤其是在长时间任务中。所有数据集、检查点和代码将在本工作被接受后公开发布,以供未来研究使用。
cs.RO / 129 / 2603.08531

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

CRED:用于主动偏好学习的反事实推理与环境设计
Tung, Yi-Shiuan, Kumar, Gyanig, Jiang, Wei, Hayes, Bradley, Roncone, Alessandro
Abstract
As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
Chinese Translation
随着机器人操作环境和任务复杂性的增加,明确指定和权衡优化目标以实现所期望的行为特征变得越来越难以实现。这些系统在能够使其行为与人类偏好对齐并响应修正时受益匪浅,但手动编码这种反馈是不可行的。主动偏好学习(APL)通过呈现轨迹进行排名来学习人类奖励函数。然而,现有方法从固定的轨迹集或重放缓冲区进行采样,这限制了查询的多样性,并且往往无法识别出有信息量的比较。我们提出了CRED,一种用于APL的新型轨迹生成方法,通过联合优化环境设计和轨迹选择来有效地查询和提取用户偏好,从而改善奖励推断。CRED通过环境设计“想象”新的场景,并利用反事实推理——通过从当前信念中采样可能的奖励并询问“如果这是真正的偏好会怎样?”——生成揭示竞争奖励函数之间差异的轨迹对。全面的实验和用户研究表明,CRED在奖励准确性和样本效率方面显著优于最先进的方法,并获得了更高的用户评分。
cs.RO / 130 / 2603.08541

EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation

EquiBim:用于双手操作的对称等变策略学习
Zhang, Zhiyuan, Mohan, Aditya, Han, Seungho, Shou, Wan, Wang, Dongyi, She, Yu
Abstract
Robotic imitation learning has achieved impressive success in learning complex manipulation behaviors from demonstrations. However, many existing robot learning methods do not explicitly account for the physical symmetries of robotic systems, often resulting in asymmetric or inconsistent behaviors under symmetric observations. This limitation is particularly pronounced in dual-arm manipulation, where bilateral symmetry is inherent to both the robot morphology and the structure of many tasks. In this paper, we introduce EquiBim, a symmetry-equivariant policy learning framework for bimanual manipulation that enforces bilateral equivariance between observations and actions during training. Our approach formulates physical symmetry as a group action on both observation and action spaces, and imposes an equivariance constraint on policy predictions under symmetric transformations. The framework is model-agnostic and can be seamlessly integrated into a wide range of imitation learning pipelines with diverse observation modalities and action representations, including point cloud-based and image-based policies, as well as both end-effector-space and joint-space parameterizations. We evaluate EquiBim on RoboTwin, a dual-arm robotic platform with symmetric kinematics, and evaluate it across diverse observation and action configurations in simulation. We further validate the approach on a real-world dual-arm system. Across both simulation and physical experiments, our method consistently improves performance and robustness under distribution shifts. These results suggest that explicitly enforcing physical symmetry provides a simple yet effective inductive bias for bimanual robot learning.
Chinese Translation
机器人模仿学习在从示范中学习复杂操作行为方面取得了显著成功。然而,许多现有的机器人学习方法并未明确考虑机器人系统的物理对称性,常常导致在对称观察下出现不对称或不一致的行为。这一局限性在双臂操作中尤为明显,因为双边对称性是机器人形态和许多任务结构的固有特征。在本文中,我们引入了EquiBim,一个用于双手操作的对称等变策略学习框架,该框架在训练过程中强制执行观察与动作之间的双边等变性。我们的方法将物理对称性表述为对观察和动作空间的群作用,并在对称变换下对策略预测施加等变性约束。该框架是模型无关的,可以无缝集成到各种模仿学习管道中,支持多种观察模态和动作表示,包括基于点云和图像的策略,以及末端执行器空间和关节空间参数化。我们在RoboTwin上评估了EquiBim,这是一个具有对称运动学的双臂机器人平台,并在模拟中评估了其在多种观察和动作配置下的表现。我们进一步在真实的双臂系统上验证了该方法。在模拟和实际实验中,我们的方法在分布变化下始终提高了性能和鲁棒性。这些结果表明,明确施加物理对称性为双手机器人学习提供了一种简单而有效的归纳偏置。
cs.RO / 131 / 2603.08544

The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search

神经指南针:用于机器人搜索的概率相对特征场
Somaschini, Gabriele, Röfer, Adrian, Valada, Abhinav
Abstract
Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
Chinese Translation
物体共现为在陌生环境中成功高效地寻找物体提供了关键线索。通常,人们在厨房中寻找杯子,并将冰箱视为处于厨房的证据。这种先验知识也被应用于人工智能代理,但通常是从显式标记的数据中学习或从语言模型中查询的。目前尚不清楚这些关系是否可以仅通过未标记的观察隐式学习。在本研究中,我们解决了这一问题,提出了ProReFF,一种特征场模型,旨在预测从预训练视觉语言模型获得的特征的相对分布。此外,我们引入了一种基于学习的策略,使得可以通过将不一致的观察对齐为一致的相对分布,从未标记且可能矛盾的数据中进行训练。对于下游物体搜索任务,我们提出了一种代理,利用预测的特征分布作为语义先验,引导探索高可能性包含目标的区域。我们进行了广泛的评估,证明ProReFF能够捕捉自然场景中有意义的相对特征分布,并提供了对我们提出的对齐步骤影响的洞察。我们进一步评估了我们的搜索代理在Matterport3D模拟器中100个挑战中的表现,并与基于特征的基线和人类参与者进行了比较。所提出的代理比最强基线效率提高了20%,并达到了人类表现的80%。
cs.RO / 132 / 2603.08546

Interactive World Simulator for Robot Policy Training and Evaluation

用于机器人策略训练和评估的互动世界模拟器
Wang, Yixuan, Syed, Rhythm, Wu, Fangyu, Zhang, Mengchao, Onol, Aykut, Barreiros, Jose, Nayyeri, Hooshang, Dear, Tony, Zhang, Huan, Li, Yunzhu
Abstract
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
Chinese Translation
基于动作条件的视频预测模型(通常称为世界模型)在机器人应用中展现出强大的潜力,但现有方法通常速度较慢,并且在长时间范围内难以捕捉物理一致的交互,这限制了其在可扩展的机器人策略训练和评估中的实用性。我们提出了互动世界模拟器,这是一个从中等规模的机器人交互数据集中构建互动世界模型的框架。我们的方法利用一致性模型进行图像解码和潜在空间动态预测,从而实现快速且稳定的物理交互模拟。在我们的实验中,学习到的世界模型能够生成交互一致的像素级预测,并支持在单个 RTX 4090 GPU 上以 15 FPS 的速度进行超过 10 分钟的稳定长时间交互。我们的框架使得仅在世界模型内收集可扩展的演示成为可能,以训练最先进的模仿策略。通过在涉及刚性物体、可变形物体、物体堆及其交互的多样任务中进行广泛的真实世界评估,我们发现基于世界模型生成的数据训练的策略与基于相同数量的真实世界数据训练的策略表现相当。此外,我们还在世界模型和真实世界中对多样任务的策略进行评估,观察到模拟性能与真实世界性能之间存在强相关性。这些结果共同确立了互动世界模拟器作为可扩展机器人数据生成和真实、可重复的策略评估的稳定且物理一致的替代品。
cs.RO / 133 / 2603.08560

CONTACT: CONtact-aware TACTile Learning for Robotic Disassembly

CONTACT:面向接触感知的机器人拆解触觉学习
Saka, Yosuke, Hu, Jyun-Chi, Desai, Adeesh, Zhang, Zhiyuan, Zhang, Bihao, Luu, Quan Khanh, Prince, Md Rakibul Islam, Zheng, Minghui, She, Yu
Abstract
Robotic disassembly involves contact-rich interactions in which successful manipulation depends not only on geometric alignment but also on force-dependent state transitions. While vision-based policies perform well in structured settings, their reliability often degrades in tight-tolerance, contact-dominated, or deformable scenarios. In this work, we systematically investigate the role of tactile sensing in robotic disassembly through both simulation and real-world experiments. We construct five rigid-body disassembly tasks in simulation with increasing geometric constraints and extraction difficulty. We further design five real-world tasks, including three rigid and two deformable scenarios, to evaluate contact-dependent manipulation. Within a unified learning framework, we compare three sensing configurations: Vision Only, Vision + tactile RGB (TacRGB), and Vision + tactile force field (TacFF). Across both simulation and real-world experiments, TacFF-based policies consistently achieve the highest success rates, with particularly notable gains in contact-dependent and deformable settings. Notably, naive fusion of TacRGB and TacFF underperforms either modality alone, indicating that simple concatenation can dilute task-relevant force information. Our results show that tactile sensing plays a critical, task-dependent role in robotic disassembly, with structured force-field representations being particularly effective in contact-dominated scenarios.
Chinese Translation
机器人拆解涉及丰富的接触互动,其中成功的操作不仅依赖于几何对齐,还依赖于力依赖的状态转变。虽然基于视觉的策略在结构化环境中表现良好,但在紧容差、接触主导或可变形场景中,其可靠性往往下降。在本研究中,我们通过模拟和真实世界实验系统地探讨了触觉感知在机器人拆解中的作用。我们在模拟中构建了五个刚体拆解任务,随着几何约束和提取难度的增加而逐步加大。我们进一步设计了五个真实世界任务,包括三个刚性和两个可变形场景,以评估接触依赖的操作。在统一的学习框架内,我们比较了三种感知配置:仅视觉(Vision Only)、视觉 + 触觉 RGB(TacRGB)和视觉 + 触觉力场(TacFF)。在模拟和真实世界实验中,基于 TacFF 的策略始终实现了最高的成功率,尤其在接触依赖和可变形设置中表现出显著的提升。值得注意的是,简单地将 TacRGB 和 TacFF 进行融合的效果不如单独使用任一模态,这表明简单的拼接可能会稀释与任务相关的力信息。我们的结果表明,触觉感知在机器人拆解中扮演着关键的、依赖任务的角色,结构化的力场表示在接触主导的场景中特别有效。
cs.RO / 134 / 2603.08572

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

MetaWorld-X:通过VLM协调专家的层次世界建模用于类人机器人运动操控
Shen, Yutong, Liu, Hangxu, Liu, Penghui, Luo, Jiashuo, Zhang, Yongkang, Morvley, Rex, Jiang, Chen, Zhang, Jianwei, Zhang, Lei
Abstract
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
Chinese Translation
为类人机器人学习自然、稳定且具有组合通用性的全身控制策略,以实现同时的运动和操控(运动操控)仍然是机器人领域的一项基本挑战。现有的强化学习方法通常依赖于单一的整体策略来获取多种技能,这往往导致跨技能梯度干扰和高自由度系统中的运动模式冲突。因此,生成的行为常常表现出不自然的运动、有限的稳定性以及对复杂任务组合的较差泛化能力。为了解决这些局限性,我们提出了MetaWorld-X,一个用于类人控制的层次世界模型框架。在分而治之的原则指导下,我们的方法将复杂的控制问题分解为一组专业的专家策略(Specialized Expert Policies, SEP)。每个专家在模仿约束强化学习的框架下,基于人类运动先验进行训练,引入生物力学一致的归纳偏见,从而确保生成自然且物理上合理的运动。在此基础上,我们进一步开发了一种由视觉-语言模型(Vision-Language Model, VLM)监督的智能路由机制(Intelligent Routing Mechanism, IRM),实现语义驱动的专家组合。VLM指导的路由器根据高层任务语义动态整合专家策略,促进组合泛化和在多阶段运动操控任务中的自适应执行。
cs.RO / 135 / 2603.08599

Bilevel Planning with Learned Symbolic Abstractions from Interaction Data

基于交互数据的学习符号抽象的双层规划
Dogangun, Fatih, Kilic, Burcu, Bahar, Serdar, Ugur, Emre
Abstract
Intelligent agents must reason over both continuous dynamics and discrete representations to generate effective plans in complex environments. Previous studies have shown that symbolic abstractions can emerge from neural effect predictors trained with a robot's unsupervised exploration. However, these methods rely on deterministic symbolic domains, lack mechanisms to verify the generated symbolic plans, and operate only at the abstract level, often failing to capture the continuous dynamics of the environment. To overcome these limitations, we propose a bilevel neuro-symbolic framework in which learned probabilistic symbolic rules generate candidate plans rapidly at the high level, and learned continuous effect models verify these plans and perform forward search when necessary at the low level. Our experiments on multi-object manipulation tasks demonstrate that the proposed bilevel method outperforms symbolic-only approaches, reliably identifying failing plans through verification, and achieves planning performance statistically comparable to continuous forward search while resolving most problems via efficient symbolic reasoning.
Chinese Translation
智能体必须在连续动态和离散表示之间进行推理,以在复杂环境中生成有效的计划。先前的研究表明,经过机器人无监督探索训练的神经效应预测器可以产生符号抽象。然而,这些方法依赖于确定性的符号领域,缺乏验证生成符号计划的机制,并且仅在抽象层面操作,往往无法捕捉环境的连续动态。为克服这些局限性,我们提出了一种双层神经符号框架,其中学习到的概率符号规则在高层快速生成候选计划,而学习到的连续效应模型在低层验证这些计划并在必要时进行前向搜索。我们在多物体操作任务上的实验表明,所提出的双层方法优于仅使用符号的方法,通过验证可靠地识别失败的计划,并在规划性能上与连续前向搜索具有统计可比性,同时通过高效的符号推理解决大多数问题。
cs.RO / 136 / 2603.08617

Diff-Muscle: Efficient Learning for Musculoskeletal Robotic Table Tennis

Diff-Muscle:高效学习肌肉骨骼机器人乒乓球
Zhao, Wentao, Guo, Jun, Huang, Kangyao, Liu, Xin, Liu, Huaping
Abstract
Musculoskeletal robots provide superior advantages in flexibility and dexterity, positioning them as a promising frontier towards embodied intelligence. However, current research is largely confined to relative simple tasks, restricting the exploration of their full potential in multi-segment coordination. Furthermore, efficient learning remains a challenge, primarily due to the high-dimensional action space and inherent overactuated structures. To address these challenges, we propose Diff-Muscle, a musculoskeletal robot control algorithm that leverages differential flatness to reformulate policy learning from the redundant muscle-activation space into a significantly lower-dimensional joint space. Furthermore, we utilize the highly dynamic robotic table tennis task to evaluate our algorithm. Specifically, we propose a hierarchical reinforcement learning framework that integrates a Kinematics-based Muscle Actuation Controller (K-MAC) with high-level trajectory planning, enabling a musculoskeletal robot to perform dexterous and precise rallies. Experimental results demonstrate that Diff-Muscle significantly outperforms state-of-the-art baselines in success rates while maintaining minimal muscle activation. Notably, the proposed framework successfully enables the musculoskeletal robots to achieve continuous rallies in a challenging dual-robot setting.
Chinese Translation
肌肉骨骼机器人在灵活性和灵巧性方面具有显著优势,使其成为具身智能的有前景的前沿。然而,目前的研究主要局限于相对简单的任务,限制了其在多段协调中的全部潜力的探索。此外,高维动作空间和固有的过驱动结构使得高效学习仍然是一项挑战。为了解决这些问题,我们提出了Diff-Muscle,这是一种肌肉骨骼机器人控制算法,利用微分平坦性将政策学习从冗余肌肉激活空间重新构造为显著低维的关节空间。此外,我们利用高度动态的机器人乒乓球任务来评估我们的算法。具体而言,我们提出了一种层次强化学习框架,将基于运动学的肌肉激活控制器(K-MAC)与高层轨迹规划相结合,使肌肉骨骼机器人能够执行灵巧且精确的对打。实验结果表明,Diff-Muscle在成功率上显著优于最先进的基线,同时保持最小的肌肉激活。值得注意的是,所提出的框架成功使肌肉骨骼机器人在具有挑战性的双机器人设置中实现连续对打。
cs.RO / 137 / 2603.08619

Embedding Classical Balance Control Principles in Reinforcement Learning for Humanoid Recovery

将经典平衡控制原理嵌入强化学习以实现类人机器人恢复
Poddar, Nehar, McCrory, Stephen, Penco, Luigi, Clark, Geoffrey, Svil, Hakki Erhan, Griffin, Robert
Abstract
Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.
Chinese Translation
类人机器人在跌倒和无法恢复的失败状态下仍然脆弱,这限制了它们在非结构化环境中的实际应用。尽管强化学习已展示了站立行为,但现有方法将恢复视为纯粹的任务-奖励问题,而没有明确表示平衡状态。我们提出了一种统一的强化学习策略,通过将经典平衡指标:捕获点、质心状态和质心动量嵌入为特权评论输入,并在训练过程中围绕这些量直接塑造奖励,从而解决这一局限性,同时演员仅依赖本体感觉进行零-shot硬件转移。在没有参考轨迹或脚本接触的情况下,单一策略涵盖了完整的恢复范围:针对小扰动的踝关节和髋关节策略、在大推力下的纠正步态,以及使用手、肘和膝盖的多接触顺应性跌倒与站立。该策略在Isaac Lab的Unitree H1-2上训练,达到了93.4%的恢复率,适用于随机初始姿势和无脚本跌倒配置。消融研究表明,去除平衡信息结构会导致站立学习完全失败,确认这些指标提供了有意义的学习信号,而非偶然结构。对MuJoCo的模拟到模拟转移和初步硬件实验进一步证明了跨环境的泛化能力。这些结果表明,将可解释的平衡结构嵌入学习框架显著减少了在失败状态下的时间,并扩大了自主恢复的范围。
cs.RO / 138 / 2603.08668

Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models

Exp-Force:基于经验的视觉-语言模型预抓取力选择
Shang, Siqi, Huang, Minchao, Fan, Bill, Chin, Lillian
Abstract
Accurate pre-contact grasp force selection is critical for safe and reliable robotic manipulation. Adaptive controllers regulate force after contact but still require a reasonable initial estimate. Starting a grasp with too little force requires reactive adjustment, while starting a grasp with too high a force risks damaging fragile objects. This trade-off is particularly challenging for compliant grippers, whose contact mechanics are difficult to model analytically. We propose Exp-Force, an experience-conditioned framework that predicts the minimum feasible grasping force from a single RGB image. The method retrieves a small set of relevant prior grasping experiences and conditions a vision-language model on these examples for in-context inference, without analytic contact models or manually designed heuristics. On 129 object instances, ExpForce achieves a best-case MAE of 0.43 N, reducing error by 72% over zero-shot inference. In real-world tests on 30 unseen objects, it improves appropriate force selection rate from 63% to 87%. These results demonstrate that Exp-Force enables reliable and generalizable pre-grasp force selection by leveraging prior interaction experiences. http://expforcesubmission.github.io/Exp-Force-Website/
Chinese Translation
准确的接触前抓取力选择对于安全可靠的机器人操作至关重要。自适应控制器在接触后调节力,但仍然需要合理的初始估计。以过小的力开始抓取需要进行反应性调整,而以过大的力开始抓取则有损坏脆弱物体的风险。这种权衡对于顺应性夹具尤其具有挑战性,因为其接触力学难以进行解析建模。我们提出了Exp-Force,这是一种基于经验的框架,能够从单张RGB图像中预测最小可行的抓取力。该方法检索一小组相关的先前抓取经验,并在这些示例上对视觉-语言模型进行条件化,以进行上下文推理,而无需解析接触模型或手动设计的启发式方法。在129个物体实例上,Exp-Force实现了最佳情况下的平均绝对误差(MAE)为0.43 N,相较于零样本推理减少了72%的误差。在对30个未见物体的实际测试中,它将适当力选择率从63%提高到87%。这些结果表明,Exp-Force通过利用先前的交互经验,实现了可靠且可推广的预抓取力选择。
计算机视觉 (Computer Vision)
281
cs.CV / 1 / 2603.06640

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

剪枝下的根源:揭示扩散模型中基于剪枝的遗忘方法的概念复兴风险
Zhang, Ci, Ding, Zhaojun, Yang, Chence, Liu, Jun, Zhai, Xiaoming, Huang, Shaoyi, Li, Beiwen, Ma, Xiaolong, Lu, Jin, Yuan, Geng
Abstract
Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts. To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining. Extensive experiments on diffusion-based unlearning based on concept related weights lead to the conclusion: once the critical concept-related weights in diffusion models are identified, our method can effectively recover the original concept regardless of how the weights are manipulated. Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.
Chinese Translation
基于剪枝的遗忘方法最近作为一种快速、无训练且与数据无关的方式出现,用于从扩散模型中移除不需要的概念。它承诺提供高效性和鲁棒性,成为传统微调或编辑型遗忘方法的有吸引力的替代方案。然而,在本文中,我们揭示了这一有前景范式背后的隐患。我们发现,在遗忘过程中,通常被设置为零的剪枝权重的位置可以充当旁路信号,泄露关于被抹去概念的关键信息。为了验证这一脆弱性,我们设计了一个新颖的攻击框架,能够在完全无数据和无训练的情况下从剪枝的扩散模型中复兴被抹去的概念。我们的实验确认,基于剪枝的遗忘方法并不是固有安全的,因为被抹去的概念可以在没有任何额外数据或重新训练的情况下有效复兴。针对基于概念相关权重的扩散遗忘方法的广泛实验得出结论:一旦识别出扩散模型中与关键概念相关的权重,我们的方法可以有效恢复原始概念,无论这些权重如何被操控。最后,我们探讨了潜在的防御策略,并倡导更安全的剪枝机制,以隐蔽剪枝位置,同时保持遗忘的有效性,为设计更安全的基于剪枝的遗忘框架提供了实用的见解。
cs.CV / 2 / 2603.06648

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

ObjChangeVR:在虚拟现实环境中从连续自我中心视角推理物体状态变化
Ding, Shiyi, Wu, Shaoen, Chen, Ying
Abstract
Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.
Chinese Translation
最近,多模态大型语言模型(MLLMs)的进展为基于自然语言的场景变化查询在虚拟现实(VR)中提供了一种有前景的方法。之前关于应用MLLMs进行物体状态理解的研究主要集中在捕捉摄像机佩戴者与物体交互的自我中心视频上。然而,物体状态变化可能在背景中发生,而没有直接的用户交互,缺乏明确的运动线索,使其难以检测。此外,目前尚无基准用于评估这一具有挑战性的场景。为了解决这些挑战,我们引入了ObjChangeVR-Dataset,专门用于评估物体状态变化的问答任务。我们还提出了ObjChangeVR,一个结合视角感知和基于时间的检索框架,以识别相关帧,并通过跨视角推理来调和来自多个视角的不一致证据。大量实验表明,ObjChangeVR在多个MLLMs上显著优于基线方法。
cs.CV / 3 / 2603.06650

Margin-Consistent Deep Subtyping of Invasive Lung Adenocarcinoma via Perturbation Fidelity in Whole-Slide Image Analysis

通过扰动保真度在全幻灯片图像分析中进行侵袭性肺腺癌的边际一致深度子类型化
Rad, Meghdad Sabouri, Junze, Huang, Hosseini, Mohammad Mehdi, Choudhary, Rakesh, Carello, Saverio J., El-Zammar, Ola, Nasr, Michel R., Rodd, Bardia
Abstract
Whole-slide image classification for invasive lung adenocarcinoma subtyping remains vulnerable to real-world imaging perturbations that undermine model reliability at the decision boundary. We propose a margin consistency framework evaluated on 203,226 patches from 143 whole-slide images spanning five adenocarcinoma subtypes in the BMIRDS-LUAD dataset. By combining attention-weighted patch aggregation with margin-aware training, our approach achieves robust feature-logit space alignment measured by Kendall correlations of 0.88 during training and 0.64 during validation. Contrastive regularization, while effective at improving class separation, tends to over-cluster features and suppress fine-grained morphological variation; to counteract this, we introduce Perturbation Fidelity (PF) scoring, which imposes structured perturbations through Bayesian-optimized parameters. Vision Transformer-Large achieves 95.20 +/- 4.65% accuracy, representing a 40% error reduction from the 92.00 +/- 5.36% baseline, while ResNet101 with an attention mechanism reaches 95.89 +/- 5.37% from 91.73 +/- 9.23%, a 50% error reduction. All five subtypes exceed an area under the receiver operating characteristic curve (AUC) of 0.99. On the WSSS4LUAD external benchmark, ResNet50 with an attention mechanism attains 80.1% accuracy, demonstrating cross-institutional generalizability despite approximately 15-20% domain-shift-related degradation and identifying opportunities for future adaptation research.
Chinese Translation
侵袭性肺腺癌子类型的全幻灯片图像分类仍然容易受到现实世界成像扰动的影响,这会削弱模型在决策边界上的可靠性。我们提出了一种边际一致性框架,该框架在BMIRDS-LUAD数据集中评估了来自143个全幻灯片图像的203,226个图像块,涵盖五种腺癌亚型。通过结合注意力加权的图像块聚合与边际感知训练,我们的方法在训练期间和验证期间分别实现了0.88和0.64的Kendall相关性,表明特征-对数几率空间的稳健对齐。对比正则化虽然在改善类别分离方面有效,但往往会过度聚类特征并抑制细粒度形态变化;为了解决这一问题,我们引入了扰动保真度(Perturbation Fidelity, PF)评分,通过贝叶斯优化参数施加结构化扰动。Vision Transformer-Large的准确率达到95.20 +/- 4.65%,相比92.00 +/- 5.36%的基线减少了40%的错误,而带有注意力机制的ResNet101的准确率为95.89 +/- 5.37%,从91.73 +/- 9.23%提升,减少了50%的错误。所有五种亚型的接收者操作特征曲线下面积(AUC)均超过0.99。在WSSS4LUAD外部基准测试中,带有注意力机制的ResNet50达到了80.1%的准确率,尽管存在约15-20%的领域转移相关降级,仍展示了跨机构的泛化能力,并识别出未来适应性研究的机会。
cs.CV / 4 / 2603.06652

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR:通过多模态过程对齐实现可信的视觉推理
Li, Yantao, Hui, Qiang, Yan, Chenyang, Cheng, Kanzhi, Zhao, Fang, Tan, Chao, Gao, Huanling, Zhang, Jianbing, Wang, Kai, Dai, Xinyu, Lian, Shiguo
Abstract
Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.
Chinese Translation
强化学习最近提高了大型语言模型和多模态大型语言模型的推理能力,但现有的奖励设计强调最终答案的正确性,因此容忍过程中的幻觉——即模型在误解视觉证据的情况下得出正确答案的情况。我们通过PaLMR解决这一过程级别的不对齐问题,PaLMR是一个不仅对结果进行对齐,还对推理过程本身进行对齐的框架。PaLMR包含两个互补的组件:一个感知对齐的数据层,构建具有结构化伪真实值和可验证视觉事实的过程感知推理数据;一个过程对齐的优化层,构建一个具有过程感知评分函数的层次奖励融合方案,以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明,我们的方法显著减少了推理幻觉,提高了视觉推理的可信度,在HallusionBench上取得了最先进的结果,同时在MMMU、MathVista和MathVerse上保持了强劲的表现。这些发现表明,PaLMR为过程对齐的多模态推理提供了一条有原则且实用的路径,推动了多模态大型语言模型的可靠性和可解释性。
cs.CV / 5 / 2603.06655

A Parameter-efficient Convolutional Approach for Weed Detection in Multispectral Aerial Imagery

一种参数高效的卷积方法用于多光谱航空影像中的杂草检测
Ramos, Leo Thomas, Sappa, Angel D.
Abstract
We introduce FCBNet, an efficient model designed for weed segmentation. The architecture is based on a fully frozen ConvNeXt backbone, the proposed Feature Correction Block (FCB), which leverages efficient convolutions for feature refinement, and a lightweight decoder. FCBNet is evaluated on the WeedBananaCOD and WeedMap datasets under both RGB and multispectral modalities, showing that FCBNet outperforms models such as U-Net, DeepLabV3+, SK-U-Net, SegFormer, and WeedSense in terms of mIoU, exceeding 85%, while also achieving superior computational efficiency, requiring only 0.06 to 0.2 hours for training. Furthermore, the frozen backbone strategy reduces the number of trainable parameters by more than 90%, significantly lowering memory requirements.
Chinese Translation
我们提出了FCBNet,这是一种用于杂草分割的高效模型。该架构基于完全冻结的ConvNeXt主干,提出的特征校正块(Feature Correction Block, FCB)利用高效卷积进行特征精细化,并配备轻量级解码器。FCBNet在WeedBananaCOD和WeedMap数据集上进行了评估,涵盖RGB和多光谱模式,结果表明FCBNet在mIoU方面优于U-Net、DeepLabV3+、SK-U-Net、SegFormer和WeedSense等模型,超过85%,同时实现了卓越的计算效率,仅需0.06到0.2小时进行训练。此外,冻结主干策略将可训练参数的数量减少了90%以上,显著降低了内存需求。
cs.CV / 6 / 2603.06656

GameVerse: Can Vision-Language Models Learn from Video-based Reflection?

GameVerse:视觉语言模型能否从基于视频的反思中学习?
Zhang, Kuan, Liu, Dongchen, Zhao, Qiyue, Hou, Jinkun, Zhang, Xinran, Xie, Qinlei, Liu, Miao, Li, Yiming
Abstract
Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials-a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).
Chinese Translation
人类游戏玩法是一个以视觉为基础的互动循环,在这个循环中,玩家行动、反思失败并观看教程以完善策略。视觉语言模型(VLMs)能否也从基于视频的反思中学习?我们提出了GameVerse,这是一个综合性的视频游戏基准,能够实现反思性的视觉互动循环。它超越了传统的“一次性评估”,采用了一种新颖的反思与重试范式来评估VLMs如何内化视觉经验并改善策略。为了促进系统化和可扩展的评估,我们还引入了一种涵盖15款全球热门游戏的认知层次分类法,提供语义和图形用户界面(GUI)控制的双重动作空间,并使用先进的VLMs进行里程碑评估以量化进展。我们的实验表明,VLMs在不同环境中受益于基于视频的反思,并通过结合失败轨迹和专家教程表现最佳——这是一种无需训练的类强化学习(RL)加上监督微调(SFT)的类比。
cs.CV / 7 / 2603.06658

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

ASMIL:用于全幻灯片成像的注意力稳定多实例学习
Ye, Linfeng, Hamidi, Shayan Mohajer, Chi, Zhixiang, Li, Guang, Pilanci, Mert, Ogawa, Takahiro, Haseyama, Miki, Plataniotis, Konstantinos N.
Abstract
Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at https://github.com/Linfeng-Ye/ASMIL.
Chinese Translation
基于注意力的多实例学习(MIL)已成为全幻灯片图像(WSI)诊断的强大框架,利用注意力将实例级特征聚合为包级预测。尽管取得了成功,我们发现此类方法表现出一种新的失败模式:不稳定的注意力动态。在四种代表性的基于注意力的MIL方法和两个公共WSI数据集上,我们观察到注意力分布在多个训练周期中波动,而不是收敛到一致的模式,从而降低了性能。这种不稳定性增加了之前报告的两个挑战:过拟合和过于集中注意力分布。为了同时克服这三种限制,我们提出了注意力稳定多实例学习(ASMIL),一个新颖的统一框架。ASMIL使用锚模型来稳定注意力,在锚中用归一化的sigmoid函数替代softmax,以防止过度集中,并应用令牌随机丢弃以减轻过拟合。大量实验表明,ASMIL在性能上比最先进的方法提高了最高6.49%的F1分数。此外,将锚模型和归一化sigmoid集成到现有的基于注意力的MIL方法中,始终提升了它们的性能,F1分数的提升最高可达10.73%。所有代码和数据均可在https://github.com/Linfeng-Ye/ASMIL上公开获取。
cs.CV / 8 / 2603.06661

EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis

EnsAug:基于增强驱动的集成方法用于人类运动序列分析
De, Bikram, Irani, Habib, Metsis, Vangelis
Abstract
Data augmentation is a crucial technique for training robust deep learning models for human motion, where annotated datasets are often scarce. However, generic augmentation methods often ignore the underlying geometric and kinematic constraints of the human body, risking the generation of unrealistic motion patterns that can degrade model performance. Furthermore, the conventional approach of training a single generalist model on a dataset expanded with a mixture of all available transformations does not fully exploit the unique learning signals provided by each distinct augmentation type. We challenge this convention by introducing a novel training paradigm, EnsAug, that strategically uses augmentation to foster model diversity within an ensemble. Our method involves training an ensemble of specialists, where each model learns from the original dataset augmented by only a single, distinct geometric transformation. Experiments on sign language and human activity recognition benchmarks demonstrate that our diversified ensemble methodology significantly outperforms the standard practice of training one model on a combined augmented dataset and achieves state-of-the-art accuracy on two sign language and one human activity recognition dataset while offering greater modularity and efficiency. Our primary contribution is the empirical validation of this training strategy, establishing an effective baseline for leveraging data augmentation in skeletal motion analysis.
Chinese Translation
数据增强是训练强健深度学习模型以分析人类运动的重要技术,而标注数据集往往稀缺。然而,通用的增强方法常常忽视人体的几何和运动学约束,可能导致生成不现实的运动模式,从而降低模型性能。此外,传统的方法是在一个扩展了所有可用变换的混合数据集上训练单一的通用模型,这并未充分利用每种独特增强类型所提供的学习信号。我们通过引入一种新颖的训练范式EnsAug来挑战这一传统,该方法战略性地利用增强技术促进集成模型的多样性。我们的方法涉及训练一个专家集成,每个模型仅从经过单一、独特几何变换增强的原始数据集中学习。在手语和人类活动识别基准上的实验表明,我们的多样化集成方法显著优于在组合增强数据集上训练单一模型的标准做法,并在两个手语数据集和一个人类活动识别数据集上实现了最先进的准确性,同时提供了更高的模块化和效率。我们的主要贡献是对这一训练策略的实证验证,为在骨骼运动分析中利用数据增强建立了有效的基准。
cs.CV / 9 / 2603.06662

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

超令牌:控制持续视频-语言理解中的令牌动态
Nguyen, Toan, Liu, Yang, De Melo, Celso, Salim, Flora D.
Abstract
Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.
Chinese Translation
多模态大型语言模型(LLMs)在持续视频问答(VideoQA)中受到任务间干扰和存储任务特定提示的高昂成本的制约。我们提出了超令牌(HyperTokens),一种基于变换器的令牌生成器,能够按需生成微调令牌,从而在保持内存固定的同时,明确控制提示的更新。为了抑制遗忘,我们提出了受元启发的正则化方法,提前预测以避免任务特定的尖锐方向,并将不断演变的生成器锚定到先前的任务上。我们进一步将我们的目标与尖锐度感知优化(sharpness-aware optimisation)联系起来,提供了为何它鼓励更平坦的跨任务最小值并改善保留的见解。除了正则化,超令牌还通过共享生成权重利用轻量级的辅助多模态监督;在因果视角的指导下,我们设计了可行的目标和替代互信息损失,以正则化反因果跨模态方向。在两个标准的持续视频问答基准上,超令牌实现了更高的平均准确率,同时显著降低了遗忘率。最后,我们引入了一个具有挑战性的跨模态图像问答到视频问答(ImageQA->VideoQA)协议,并展示了超令牌在这一设置中实现了稳健的持续迁移能力。
cs.CV / 10 / 2603.06663

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

图标标记:通过基于图的视觉提示促进多模态语言模型的空间推理
Frisoni, Giacomo, Molfetta, Lorenzo, Buzzoni, Mattia, Moro, Gianluca
Abstract
Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.
Chinese Translation
近期无训练视觉提示的进展,例如标记集合(Set-of-Mark),已成为增强多模态语言模型(MLMs)基础能力的一个有前景的方向。这些技术通过将输入图像划分为对象区域,并用标记(主要是带有数字标识符的框)进行注释,然后将增强后的图像输入到MLM中。然而,这些方法将标记的对象视为孤立实体,未能捕捉它们之间的关系。在此基础上,我们提出了图标标记(Graph-of-Mark, GoM),这是首个将场景图叠加到输入图像上的像素级视觉提示技术,用于空间推理任务。我们在3个开源MLM和4个不同的数据集上评估GoM,进行了广泛的消融实验,研究了辅助图描述在文本提示中的影响。我们的结果表明,GoM在解释对象位置和相对方向方面持续提高了MLM的零-shot能力,在视觉问答和定位的基础准确率上提高了多达11个百分点。
cs.CV / 11 / 2603.06664

Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

利用全球时间索引的顺序-并行3D位置编码加速视频生成推理
Yuan, Chao, Li, Pan
Abstract
Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments conducted on an eight GPU A800 cluster show that the optimized system achieves comparable generation quality, sub-second first-frame latency, and near real-time inference speed. For generating five second 480P videos, a 1.58x speedup is achieved, thereby providing effective support for real-time interactive applications.
Chinese Translation
基于扩散变换器(Diffusion Transformer, DiT)的视频生成模型在长视频合成和实时推理中固有地面临瓶颈,这可以归因于使用全时空注意力机制。具体而言,这种机制导致了爆炸性的 O(N^2) 内存消耗和高首帧延迟。为了解决这些问题,我们为因果自回归视频生成管道实施了系统级推理优化。我们将自强因果自回归框架适配为顺序并行推理,并实现了一种因果旋转位置嵌入的顺序并行变体,称为 Causal-RoPE SP。这种适配使得局部计算成为可能,并减少了顺序并行执行中的跨等级通信。此外,通过算子融合和 RoPE 预计算优化了计算和通信管道。在八个 GPU A800 集群上进行的实验表明,优化后的系统在生成质量、首帧延迟小于一秒以及接近实时的推理速度方面达到了可比的效果。对于生成五秒的 480P 视频,实现了 1.58 倍的加速,从而为实时交互应用提供了有效支持。
cs.CV / 12 / 2603.06665

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

更好的视觉,更好的思维:为什么视觉链式思维在医学中失效
Wu, Yuan, Yang, Zongxian, Qian, Jiayu, Gao, Songpan, Chen, Guanxing, Li, Qiankun, Huang, Yu-An, Huang, Zhi-An
Abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.
Chinese Translation
大型视觉-语言模型(VLMs)在一般领域中通常受益于链式思维(CoT)提示,但其在医学视觉-语言任务中的有效性仍然未被充分探讨。我们报告了一个反直觉的趋势:在医学视觉问答中,CoT在通用模型和医学特定模型中常常表现不如直接回答(DirA)。我们将此归因于一种 extit{医学感知瓶颈}:细微的、特定领域的线索可能削弱视觉基础,而CoT可能加剧早期感知的不确定性,而不是纠正它。为了探究这一假设,我们引入了两种无训练的推理时基础干预措施:(i)通过兴趣区域线索进行 extit{感知锚定},以及(ii)通过高质量文本指导进行 extit{描述基础}。在多个基准和模型系列中,这些干预措施提高了准确性,减轻了CoT的降级,并在多个设置中逆转了CoT与DirA的反转。我们的研究结果表明,可靠的临床VLMs需要强大的视觉基础和跨模态对齐,而不仅仅是扩展基于文本的推理链。代码可在 extit{这里}获取。
cs.CV / 13 / 2603.06666

SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

SJD-PV:带短语验证的推测雅可比解码用于自回归图像生成
Yu, Zhehao, Zhang, Baoquan, Shan, Bingqi, Liu, Xinhao, Zhou, Dongliang, Liang, Guotao, Ye, Guangming, Ye, Yunming
Abstract
Autoregressive (AR) image models have recently demonstrated remarkable generative capability, but their sequential nature results in significant inference latency. Existing training-free acceleration methods typically verify tokens independently, overlooking the strong co-occurrence patterns between adjacent visual tokens. This independence assumption often leads to contextual inconsistency and limits decoding efficiency. In this work, we introduce a novel training-free acceleration framework that performs phrase-level speculative verification, enabling the model to jointly validate multiple correlated tokens within each decoding window. To construct such phrase units, we analyze token co-occurrence statistics from the training corpus and group frequently co-occurring tokens into semantically coherent visual phrases. During inference, the proposed phrase-level verification evaluates aggregated likelihood ratios over each phrase, allowing simultaneous acceptance of multiple tokens while preserving generation quality. Extensive experiments on autoregressive text-to-image generation show that our method significantly reduces the number of function evaluations (NFE) and achieves up to 30% faster decoding without compromising visual fidelity. Our findings reveal that modeling short-range token co-occurrence provides an effective and general principle for accelerating autoregressive inference.
Chinese Translation
自回归(AR)图像模型最近展示了显著的生成能力,但其顺序特性导致了显著的推理延迟。现有的无训练加速方法通常独立验证标记,忽视了相邻视觉标记之间强烈的共现模式。这种独立假设常常导致上下文不一致,并限制了解码效率。在本研究中,我们提出了一种新颖的无训练加速框架,执行短语级推测验证,使模型能够在每个解码窗口内联合验证多个相关标记。为了构建这样的短语单元,我们分析了训练语料库中的标记共现统计数据,并将频繁共现的标记分组为语义上连贯的视觉短语。在推理过程中,所提出的短语级验证评估每个短语的聚合似然比,允许同时接受多个标记,同时保持生成质量。在自回归文本到图像生成的广泛实验中,我们的方法显著减少了函数评估次数(NFE),并在不影响视觉保真度的情况下实现了高达30%的解码速度提升。我们的研究结果表明,建模短距离标记共现为加速自回归推理提供了一种有效且通用的原则。
cs.CV / 14 / 2603.06670

calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

calibfusion:基于变换器的可微分校准用于水面环境中的雷达-相机融合检测
Wan, Yuting, Sun, Liguo, Hao, Jiuwu, LV, Pin
Abstract
Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.
Chinese Translation
毫米波(mmWave)雷达-相机融合在不良光照和天气条件下提高了感知能力,但其性能对雷达-相机外部校准非常敏感:残余的对齐误差会导致雷达到图像的投影偏差,并降低下游二维检测的跨模态聚合效果。现有的校准和自动校准方法主要针对具有丰富结构和物体约束的道路和城市场景开发,而水面环境则具有大面积无纹理区域、稀疏且间歇性的目标,以及由波浪/镜面反射引起的雷达杂波,这削弱了显式的以物体为中心的匹配。我们提出了CalibFusion,一种基于校准条件的雷达-相机融合检测器,它通过检测目标端到端地学习隐式外部校准的精细化。CalibFusion构建了一个多帧持久性感知的雷达密度表示,结合强度加权和多普勒引导的快速变化杂波抑制。一个跨模态变换器交互模块预测初始外部参数的置信度门控精细化,并通过可微分的投影和溅射操作整合,以生成校准条件下的图像平面雷达特征。在WaterScenes和FLOW上的实验表明,在合成误校准下,基于融合的二维检测和鲁棒性得到了改善,支持敏感性分析和定性雷达到图像的叠加结果。在nuScenes上的结果表明,精细化机制超越了水面场景的应用。
cs.CV / 15 / 2603.06672

Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

语义噪声初始化是否能从图像转移到视频?一项配对诊断研究
Jing, Yixiao, Zhang, Chaoyu, Zhong, Zixuan, Huang, Peizhou
Abstract
Semantic noise initialization has been reported to improve robustness and controllability in image diffusion models. Whether these gains transfer to text-to-video (T2V) generation remains unclear, since temporal coupling can introduce extra degrees of freedom and instability. We benchmark semantic noise initialization against standard Gaussian noise using a frozen VideoCrafter-style T2V diffusion backbone and VBench on 100 prompts. Using prompt-level paired tests with bootstrap confidence intervals and a sign-flip permutation test, we observe a small positive trend on temporal-related dimensions; however, the 95 percent confidence interval includes zero (p ~ 0.17) and the overall score remains on par with the baseline. To understand this outcome, we analyze the induced perturbations in noise space and find patterns consistent with weak or unstable signal. We recommend prompt-level paired evaluation and noise-space diagnostics as standard practice when studying initialization schemes for T2V diffusion.
Chinese Translation
语义噪声初始化已被报道能够提高图像扩散模型的鲁棒性和可控性。然而,这些收益是否能够转移到文本到视频(T2V)生成尚不明确,因为时间耦合可能引入额外的自由度和不稳定性。我们使用冻结的 VideoCrafter 风格 T2V 扩散骨干网络和 VBench 对 100 个提示进行了语义噪声初始化与标准高斯噪声的基准测试。通过使用带有自助信心区间的提示级配对测试和符号翻转置换测试,我们观察到在与时间相关的维度上有小幅正趋势;然而,95% 的信心区间包含零(p ~ 0.17),整体得分与基线持平。为了理解这一结果,我们分析了噪声空间中引入的扰动,并发现与弱或不稳定信号一致的模式。我们建议在研究 T2V 扩散的初始化方案时,将提示级配对评估和噪声空间诊断作为标准实践。
cs.CV / 16 / 2603.06673

Unmixing microinfrared spectroscopic images of cross-sections of historical oil paintings

历史油画横截面的微红外光谱图像解混
Pande, Shivam, Nadisic, Nicolas, Mederos-Henry, Francisco, Pizurica, Aleksandra
Abstract
Spectroscopic imaging (SI) has become central to heritage science because it enables non-invasive, spatially resolved characterisation of materials in artefacts. In particular, attenuated total reflection Fourier transform infrared microscopy (ATR-$\mu$FTIR) is widely used to analyse painting cross-sections, where a spectrum is recorded at each pixel to form a hyperspectral image (HSI). Interpreting these data is difficult: spectra are often mixtures of several species in heterogeneous, multi-layered and degraded samples, and current practice still relies heavily on manual comparison with reference libraries. This workflow is slow, subjective and hard to scale. We propose an unsupervised CNN autoencoder for blind unmixing of ATR-$\mu$FTIR HSIs, estimating endmember spectra and their abundance maps while exploiting local spatial structure through patch-based modelling. To reduce sensitivity to atmospheric and acquisition artefacts across $>1500$ bands, we introduce a weighted spectral angle distance (WSAD) loss with automatic band-reliability weights derived from robust measures of spatial flatness, neighbour agreement and spectral roughness. Compared with standard SAD training, WSAD improves interpretability in contamination-prone spectral regions. We demonstrate the method on an ATR-$\mu$FTIR cross-section from the Ghent Altarpiece attributed to the Van Eyck brothers.
Chinese Translation
光谱成像(SI)已成为遗产科学的核心,因为它能够对文物中的材料进行非侵入性、空间分辨的表征。特别是,衰减全反射傅里叶变换红外显微镜(ATR-$$FTIR)被广泛用于分析绘画的横截面,在每个像素处记录光谱以形成高光谱图像(HSI)。解释这些数据是困难的:光谱通常是异质、多层和退化样品中几种成分的混合,而当前的实践仍然在很大程度上依赖于与参考库的手动比较。这一工作流程缓慢、主观且难以扩展。我们提出了一种无监督的卷积神经网络(CNN)自编码器,用于ATR-$$FTIR HSI的盲解混,估计端元光谱及其丰度图,同时通过基于补丁的建模利用局部空间结构。为了降低对大于1500个波段的气氛和采集伪影的敏感性,我们引入了一种加权光谱角距离(WSAD)损失,自动带入从空间平坦度、邻居一致性和光谱粗糙度的稳健测量中得出的波段可靠性权重。与标准的光谱角距离(SAD)训练相比,WSAD在易受污染的光谱区域提高了可解释性。我们在归因于范艾克兄弟的根特祭坛的ATR-$$FTIR横截面上演示了该方法。
cs.CV / 17 / 2603.06674

AutoFigure-Edit: Generating Editable Scientific Illustration

AutoFigure-Edit:生成可编辑的科学插图
Lin, Zhen, Xie, Qiujie, Zhu, Minjun, Li, Shichen, Sun, Qiyao, Gu, Enhao, Ding, Yiran, Sun, Ke, Guo, Fang, Lu, Panzhong, Ning, Zhiyuan, Weng, Yixuan, Zhang, Yue
Abstract
High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations. To facilitate further progress in this field, we release the video at https://youtu.be/10IH8SyJjAQ, full codebase at https://github.com/ResearAI/AutoFigure-Edit and provide a website for easy access and interactive use at https://deepscientist.cc/.
Chinese Translation
高质量的科学插图对于传达复杂的科学和技术概念至关重要,但现有的自动化系统在可编辑性、风格可控性和效率方面仍然有限。我们提出了AutoFigure-Edit,这是一种端到端系统,可以从长篇科学文本生成完全可编辑的科学插图,同时通过用户提供的参考图像实现灵活的风格适应。通过结合长文本理解、参考引导的风格化和原生SVG编辑,该系统实现了高效创建和完善高质量科学插图的能力。为了促进该领域的进一步发展,我们发布了视频(https://youtu.be/10IH8SyJjAQ),完整代码库(https://github.com/ResearAI/AutoFigure-Edit),并提供了一个网站以便于访问和互动使用(https://deepscientist.cc/)。
cs.CV / 18 / 2603.06676

XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis

基于可解释人工智能和少样本学习的植物叶片病害预测混合分类模型
Joseph, Diana Susan, Pawar, Pranav M, Muthalagu, Raja, Mukharjee, Mithun
Abstract
Performing a timely and accurate identification of crop diseases is vital to maintain agricultural productivity and food security. The current work presents a hybrid few-shot learning model that integrates Explainable Artificial Intelligence (XAI) and Few-Shot Learning (FSL) to address the challenge of identifying and classifying the stages of disease of the diseases of maize, rice, and wheat leaves under limited annotated data conditions. The proposed model integrates Siamese and Prototypical Networks within an episodic training paradigm to effectively learn discriminative disease features from a few examples. To ensure model transparency and trustworthiness, Gradient-weighted Class Activation Mapping (Grad-CAM) is employed for visualizing key decision regions in the leaf images, offering interpretable insights into the classification process. Experimental evaluations on custom few-shot datasets developed in the study prove that the model consistently achieves high accuracy, precision, recall, and F1-scores, frequently exceeding 92% across various disease stages. Comparative analyses against baseline FSL models further confirm the superior performance and explainability of the proposed approach. The framework offers a promising solution for real-world, data-constrained agricultural disease monitoring applications.
Chinese Translation
及时准确地识别作物病害对于维持农业生产力和食品安全至关重要。本研究提出了一种混合少样本学习模型,该模型整合了可解释人工智能(XAI)和少样本学习(FSL),旨在解决在有限标注数据条件下识别和分类玉米、稻米和小麦叶片病害阶段的挑战。所提出的模型在情节训练范式中整合了孪生网络(Siamese Networks)和原型网络(Prototypical Networks),以有效地从少量样本中学习具有区分性的病害特征。为了确保模型的透明性和可信性,采用了梯度加权类激活映射(Grad-CAM)来可视化叶片图像中的关键决策区域,从而提供对分类过程的可解释性见解。在研究中开发的定制少样本数据集上的实验评估证明,该模型在不同病害阶段的准确率、精确率、召回率和F1分数均保持在高水平,常常超过92%。与基线FSL模型的比较分析进一步确认了所提出方法的优越性能和可解释性。该框架为现实世界中数据受限的农业病害监测应用提供了一个有前景的解决方案。
cs.CV / 19 / 2603.06677

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

通过并行相对策略优化深入研究大型视觉语言模型中的图表
Tang, Jiajin, Gaoyang, Wang, Wenjie, Yang, Sibei, Chen, Xing
Abstract
With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.
Chinese Translation
随着数据科学的快速发展,图表已从简单的数值展示工具演变为洞察发现和决策支持的重要工具。然而,目前的图表数据智能在深度研究能力方面存在显著局限,现有方法主要集中于视觉识别或事实问答等浅层任务,而非深度研究所需的复杂推理和高级数据分析。这一局限源于两个主要技术瓶颈:在训练层面,现有的后训练技术在处理多维奖励信号干扰和异构数据梯度冲突方面存在不足,导致模型无法在多个能力维度上实现平衡发展;在评估层面,目前的方法仍局限于事实检索和基本计算,未能评估端到端的分析推理及其他深度研究能力。为了解决训练挑战,我们提出了PRPO(Parallel Relative Policy Optimization),该方法在奖励维度上进行并行优化,并在数据类型上进行能力划分,有效解开异构数据与多维奖励信号之间的冲突,同时确保优化的稳定性。针对评估挑战,我们基于“错误唯一性原则”构建了MCDR-Bench,通过可控的错误注入将主观生成评估转化为客观错误识别,实现深度研究能力的量化评估。实验验证表明,所提出的PRPO和MCDR-Bench共同建立了一个统一框架,通过增强的协同训练和客观评估系统地推进图表的深度研究。
cs.CV / 20 / 2603.06680

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

VB:图像中可见性与透视推理的可见性基准
Tripathi, Neil
Abstract
We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.
Chinese Translation
我们提出了VB,这是一个基准测试,旨在检验视觉-语言模型是否能够判断照片中什么是可见的,什么是不可见的,并在当人类观察者无法可靠回答时选择不作答。每个项目将一张单独的照片与一个简短的是/否可见性声明配对;模型必须输出VISIBLY_TRUE、VISIBLY_FALSE或ABSTAIN,并附上置信度分数。项目被组织成100个家族,采用2x2设计,将最小图像编辑与最小文本编辑相结合,生成300个评估单元。与之前的不可回答VQA基准不同,VB不仅测试问题是否不可回答,还探讨其原因(通过与特定可见性因素相关的理由代码),并使用受控的最小编辑来验证模型判断何时以及仅在基础证据变化时发生变化。我们在考虑置信度的准确性与不作答(CAA)、最小编辑翻转率(MEFR)、置信度排序选择性预测(SelRank)和二阶透视推理(ToMAcc)等方面对模型进行评分;所有的主要数据都是在严格的XOR子集上计算的(每个家族三个单元,每个模型300个评分项目)。我们评估了九个模型,这些模型涵盖了旗舰和先前一代的闭源系统,以及从8B到12B参数的开源模型。GPT-4o和Gemini 3.1 Pro的综合得分有效地并列第一(0.728和0.727),其次是Gemini 2.5 Pro(0.678)。最佳开源模型Gemma 3 12B(0.505)超过了一个先前一代的闭源系统。在九个模型中,文本翻转的鲁棒性超过了图像翻转的鲁棒性,且置信度校准差异显著:GPT-4o和Gemini 2.5 Pro在准确性上相似,但在选择性预测质量上差异明显。
cs.CV / 21 / 2603.06681

RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review

RADAR:基于3D图像的放射学报告审查的多模态基准
Sun, Zhaoyi, Jagtiani, Minal, Yim, Wen-wai, Xia, Fei, Gunn, Martin, Yetisgen, Meliha, Abacha, Asma Ben
Abstract
Radiology reports for the same patient examination may contain clinically meaningful discrepancies arising from interpretation differences, reporting variability, or evolving assessments. Systematic analysis of such discrepancies is important for quality assurance, clinical decision support, and multimodal model development, yet remains limited by the lack of standardized benchmarks. We present RADAR, a multimodal benchmark for radiology report discrepancy analysis that pairs 3D medical images with a preliminary report and corresponding candidate edits for the same study. The dataset reflects a standard clinical workflow in which trainee radiologists author preliminary reports that are subsequently reviewed and revised by attending radiologists. RADAR defines a structured discrepancy assessment task requiring models to evaluate proposed edits by determining image-level agreement, assessing clinical severity, and classifying edit type (correction, addition, or clarification). In contrast to prior work emphasizing binary error detection or comparison against fully independent reference reports, RADAR targets fine-grained clinical reasoning and image-text alignment at the report review stage. The benchmark consists of expert-annotated abdominal CT examinations and is accompanied by standardized evaluation protocols to support systematic comparison of multimodal models. RADAR provides a clinically grounded testbed for evaluating multimodal systems as reviewers of radiology report edits.
Chinese Translation
同一患者检查的放射学报告可能因解读差异、报告变异或评估演变而产生临床上有意义的差异。对这些差异的系统分析对于质量保证、临床决策支持和多模态模型开发至关重要,但由于缺乏标准化基准,相关研究仍然有限。我们提出了RADAR,这是一个用于放射学报告差异分析的多模态基准,它将3D医学图像与初步报告及相应的候选编辑配对,针对同一研究。该数据集反映了标准的临床工作流程,其中实习放射科医师撰写初步报告,随后由主治放射科医师进行审查和修订。RADAR定义了一个结构化的差异评估任务,要求模型通过确定图像级一致性、评估临床严重性和分类编辑类型(更正、添加或澄清)来评估提议的编辑。与之前强调二元错误检测或与完全独立参考报告比较的工作不同,RADAR旨在报告审查阶段实现细粒度的临床推理和图像-文本对齐。该基准由专家注释的腹部CT检查组成,并附有标准化评估协议,以支持多模态模型的系统比较。RADAR为评估多模态系统作为放射学报告编辑审查者提供了一个临床基础的测试平台。
cs.CV / 22 / 2603.06683

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

ECHO:通过多智能体协作进行以事件为中心的超图操作以实现多媒体事件提取
Chu, Hailong, Zhang, Shuo, Chu, Yunlong, Huang, Shutai, Zhang, Xingyue, Yan, Tinghe, Zhang, Jinsong, Li, Lei
Abstract
Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.
Chinese Translation
多媒体事件提取(M2E2)涉及从文本和视觉内容中提取结构化事件记录。现有方法,从专门架构到直接的大型语言模型(LLM)提示,通常依赖于线性、端到端的生成,因此容易出现级联错误:早期的跨模态不对齐常常在严格的基础约束下破坏下游角色分配。我们提出了ECHO(以事件为中心的超图操作),这是一个多智能体框架,通过迭代优化共享的多媒体事件超图(MEHG),该超图作为多模态事件假设的明确中间结构。与以对话为中心的框架不同,ECHO通过对MEHG应用原子超图操作来协调专门的智能体。此外,我们引入了一种“先链接后绑定”(Link-then-Bind)策略,强制延迟承诺:智能体首先识别相关参数,然后再确定其精确角色,从而减轻错误的基础和限制错误传播。在M2E2基准上的大量实验表明,ECHO显著优于现有最先进技术(SOTA):使用Qwen3-32B时,平均事件提及和参数角色F1分别提高了7.3%和15.5%。
cs.CV / 23 / 2603.06684

Three-dimensional reconstruction and segmentation of an aggregate stockpile for size and shape analyses

用于尺寸和形状分析的集料堆的三维重建与分割
Tutumluer, Erol, Huang, Haohang, Luo, Jiayi, Qamhia, Issam, Hart, John M.
Abstract
Aggregate size and shape are key properties for determining quality of aggregate materials used in road construction and transportation geotechnics applications. The composition and packing, layer stiffness, and load response are all influenced by these morphological characteristics of aggregates. Many aggregate imaging systems developed to date only focus on analyses of individual or manually separated aggregate particles. There is a need to develop a convenient and affordable system for acquiring 3D aggregate information from stockpiles in the field. This paper presents an innovative 3D imaging approach for potential field evaluation of large-sized aggregates, whereby engineers can perform inspection by taking videos/images with mobile devices such as smartphone cameras. The approach leverages Structure-from-Motion (SfM) techniques to reconstruct the stockpile surface as 3D spatial data, i.e. point cloud, and uses a 3D segmentation algorithm to separate and extract individual aggregates from the reconstructed stockpile. The preliminary results presented in this paper demonstrate the future potential of using 3D aggregate size and shape information for onsite Quality Assurance/Quality Control (QA/QC) tasks.
Chinese Translation
集料的尺寸和形状是确定用于道路建设和运输土木工程应用的集料材料质量的关键属性。这些集料的形态特征影响着其组成和堆积、层刚度和荷载响应。迄今为止,许多开发的集料成像系统仅关注单个或手动分离的集料颗粒的分析。因此,需要开发一种方便且经济的系统,以从现场的堆料中获取三维集料信息。本文提出了一种创新的三维成像方法,用于对大尺寸集料进行潜在的现场评估,工程师可以通过使用智能手机相机等移动设备拍摄视频/图像进行检查。该方法利用运动重建(Structure-from-Motion, SfM)技术将堆料表面重建为三维空间数据,即点云,并使用三维分割算法从重建的堆料中分离和提取单个集料。本文呈现的初步结果展示了在现场质量保证/质量控制(Quality Assurance/Quality Control, QA/QC)任务中使用三维集料尺寸和形状信息的未来潜力。
cs.CV / 24 / 2603.06687

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot:在真实世界环境中基准测试视觉-语言模型的地理-时间理解能力
Wasi, Azmine Toushik, Ridoy, Shahriyar Zaman, Tonmoy, Koushik Ahamed, Tshering, Kinga, Hasan, S. M. Muhtasimul, Faisal, Wahid, Mohiuddin, Tasnim, Parvez, Md Rizwan
Abstract
Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: https://TimeSpot-GT.github.io.
Chinese Translation
地理-时间理解,即仅通过视觉输入推断位置、时间和上下文属性的能力,是灾害管理、交通规划、具身导航、世界建模和地理教育等应用的基础。尽管近期的视觉-语言模型(VLMs)在利用地标和交通标志等线索进行图像地理定位方面取得了进展,但它们在推理时间信号和物理基础空间线索方面的能力仍然有限。为了解决这一问题,我们提出了TimeSpot,一个用于评估VLMs在真实世界中地理-时间推理能力的基准。TimeSpot包含来自80个国家的1,455张地面图像,并要求从视觉证据中结构化预测时间属性(季节、月份、一天中的时间、日照阶段)和地理属性(大洲、国家、气候区、环境类型、经纬度)。它还包括测试在真实世界不确定性下物理合理性的时空推理任务。对最新的开源和闭源VLMs的评估显示其性能较低,尤其是在时间推理方面。尽管监督微调带来了改进,但结果仍然不足,突显了实现稳健、物理基础的地理-时间理解的新方法的必要性。TimeSpot可在以下网址获取:https://TimeSpot-GT.github.io。
cs.CV / 25 / 2603.06688

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

叙事编织者:朝着可控的长距离视觉一致性与多模态条件化迈进
Yao, Zhengjian, Li, Yongzhi, Gao, Xinyuan, Chen, Quan, Jiang, Peng, Lu, Yanye
Abstract
We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.
Chinese Translation
我们提出了“叙事编织者”,这是一个新颖的框架,旨在解决生成性人工智能中的一个基本挑战:实现多模态可控、长距离和一致的视觉内容生成。尽管现有模型在生成高保真短格式视觉内容方面表现出色,但在维护叙事连贯性和视觉一致性方面却面临困难,这对于电影制作和电子商务广告等现实应用来说是一个关键限制。叙事编织者引入了第一个整体解决方案,完美整合了三种基本能力:细粒度控制、自动叙事规划和长距离一致性。我们的架构结合了用于高层次叙事规划的多模态大型语言模型(Multimodal Large Language Model, MLLM)与一个新颖的细粒度控制模块,该模块配备动态记忆库,以防止视觉漂移。为了实现实际部署,我们开发了一种渐进式的多阶段训练策略,能够高效利用现有的预训练模型,即使在有限的训练数据下也能达到最先进的性能。鉴于缺乏合适的评估基准,我们构建并发布了电子商务广告视频故事板数据集(E-commerce Advertising Video Storyboard Dataset, EAVSD)——这是该任务的第一个综合数据集,包含超过33万张高质量图像及丰富的叙事注释。通过在三个不同场景(可控的多场景生成、自动叙事和电子商务广告)中的广泛实验,我们展示了我们方法的优越性,同时为人工智能驱动的内容创作开辟了新的可能性。
cs.CV / 26 / 2603.06689

High-Resolution Image Reconstruction with Unsupervised Learning and Noisy Data Applied to Ion-Beam Dynamics for Particle Accelerators

基于无监督学习和噪声数据的高分辨率图像重建及其在粒子加速器离子束动力学中的应用
Osswald, Francis, Chahbaoui, Mohammed, Liang, Xinyi
Abstract
Image reconstruction in the presence of severe degradation remains a challenging inverse problem, particularly in beam diagnostics for high-energy physics accelerators. As modern facilities demand precise detection of beam halo structures to control losses, traditional analysis tools have reached their performance limits. This work reviews existing image-processing techniques for data cleaning, contour extraction, and emittance reconstruction, and introduces a novel approach based on convolutional filtering and neural networks with optimized early-stopping strategies in order to control overfitting. Despite the absence of training datasets, the proposed unsupervised framework achieves robust denoising and high-fidelity reconstruction of beam emittance images under low signal-to-noise conditions. The method extends measurable amplitudes beyond seven standard deviations, enabling unprecedented halo resolution.
Chinese Translation
在严重退化的情况下进行图像重建仍然是一个具有挑战性的逆问题,特别是在高能物理加速器的束流诊断中。随着现代设施对束流光晕结构的精确检测需求的增加,传统分析工具已达到其性能极限。本研究回顾了现有的数据清洗、轮廓提取和发射度重建的图像处理技术,并提出了一种基于卷积滤波和神经网络的新方法,采用优化的早停策略以控制过拟合。尽管缺乏训练数据集,所提出的无监督框架在低信噪比条件下实现了强大的去噪和高保真度的束流发射度图像重建。该方法将可测量的幅度扩展到超过七个标准差,从而实现前所未有的光晕分辨率。
cs.CV / 27 / 2603.06690

Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind

光谱间隙与空间先验:使用 TerraMind 研究高光谱下游适应性
Leonardi, Julia Anna, Jakubik, Johannes, Fraccaro, Paolo, Brovelli, Maria Antonia
Abstract
Geospatial Foundation Models (GFMs) typically lack native support for Hyperspectral Imaging (HSI) due to the complexity and sheer size of high-dimensional spectral data. This study investigates the adaptability of TerraMind, a multimodal GFM, to address HSI downstream tasks \emph{without} HSI-specific pretraining. Therefore, we implement and compare two channel adaptation strategies: Naive Band Selection and physics-aware Spectral Response Function (SRF) grouping. Overall, our results indicate a general superiority of deep learning models with native support of HSI data. Our experiments also demonstrate the ability of TerraMind to adapt to HSI downstream tasks through band selection with moderate performance decline. Therefore, the findings of this research establish a critical baseline for HSI integration, motivating the need for native spectral tokenization in future multimodal model architectures.
Chinese Translation
地理空间基础模型(GFMs)通常由于高维光谱数据的复杂性和庞大规模,缺乏对高光谱成像(HSI)的原生支持。本研究探讨了多模态 GFM TerraMind 的适应性,以应对 HSI 下游任务,而 extit{无需}进行 HSI 特定的预训练。因此,我们实施并比较了两种通道适应策略:简单波段选择和物理感知的光谱响应函数(SRF)分组。总体而言,我们的结果表明,具有 HSI 数据原生支持的深度学习模型表现出一般的优越性。我们的实验还展示了 TerraMind 通过波段选择适应 HSI 下游任务的能力,尽管性能有所下降。因此,本研究的发现为 HSI 集成建立了一个关键基准,激励未来多模态模型架构中对原生光谱标记化的需求。
cs.CV / 28 / 2603.06691

One-Shot Badminton Shuttle Detection for Mobile Robots

移动机器人的一次性羽毛球检测
Dipner, Florentin, Talbot, William, Tuna, Turcan, Cramariuc, Andrei, Hutter, Marco
Abstract
This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric suited to our downstream use case and fine-tune a YOLOv8 network optimized for real-time shuttlecock detection, achieving an F1-score of 0.86 under our metric in test environments similar to training, and 0.70 in entirely unseen environments. Our analysis reveals that detection performance is critically dependent on shuttlecock size and background texture complexity. Qualitative experiments confirm their applicability to robots with moving cameras. Unlike prior work with stationary camera setups, our detector is specifically designed for the egocentric, dynamic viewpoints of mobile robots, providing a foundational building block for downstream tasks, including tracking, trajectory estimation, and system (re)-initialization.
Chinese Translation
本文提出了一种针对非静态机器人的鲁棒性一次性羽毛球检测框架。为了解决缺乏自我中心羽毛球检测数据集的问题,我们引入了一个包含20,510帧半自动标注图像的数据集,这些图像是在多样的室内和室外环境中捕获的,涵盖了11种不同背景,并将每帧图像分为三种难度级别。为了标注,我们提出了一种新颖的半自动标注流程,使得从静态摄像机视频中高效标注成为可能。我们提出了一种适合于我们下游应用场景的度量标准,并对优化用于实时羽毛球检测的YOLOv8网络进行了微调,在与训练环境相似的测试环境中达到了0.86的F1分数,而在完全未见过的环境中达到了0.70。我们的分析表明,检测性能在很大程度上依赖于羽毛球的大小和背景纹理的复杂性。定性实验验证了其在移动摄像机的机器人中的适用性。与之前使用静态摄像机设置的工作不同,我们的检测器专门为移动机器人的自我中心动态视角设计,为下游任务(包括跟踪、轨迹估计和系统(重新)初始化)提供了基础构件。
cs.CV / 29 / 2603.06693

Soft Equivariance Regularization for Invariant Self-Supervised Learning

用于不变自监督学习的软等变正则化
Lee, Joohyung, Kim, Changhun, Kim, Hyunsu, Lee, Kwanhyung, Lee, Juho
Abstract
Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $\rho_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.
Chinese Translation
自监督学习(SSL)通常学习对语义保持增强的不变表示。虽然这种方法在识别方面有效,但强制执行不变性可能会抑制对几何扰动的鲁棒性和空间敏感传递有用的依赖于变换的结构。因此,越来越多的研究将基于不变性的SSL与等变目标结合,但这些目标通常是在同一最终表示上施加的。我们在这种耦合设置中经验性地观察到一种权衡:将等变正则化推向更深层次可以提高等变性得分,但会降低ImageNet-1k的线性评估,促使我们设计出层解耦的方案。基于这种权衡,我们提出了软等变正则化(SER),这是一种插件正则化器,可以解耦不变性和等变性施加的位置:我们在最终嵌入上保持基础SSL目标不变,同时通过在特征空间中直接应用分析指定的群体动作$ ho_g$,在中间空间标记图上软性地鼓励等变性。SER不学习/预测每个样本的变换代码/标签,不需要辅助的变换预测头,仅增加1.008倍的训练FLOPs。在ImageNet-1k ViT-S/16的预训练中,SER在严格匹配的2视图设置下将MoCo-v3的线性评估Top-1提高了+0.84,并持续改善DINO和Barlow Twins;在匹配视图数量下,SER在比较的不变性+等变性附加项中实现了最佳的ImageNet-1k线性评估Top-1。SER还将ImageNet-C/P的Top-1分别提高了+1.11/+1.22,并将冻结骨干的COCO检测提高了+1.7 mAP。最后,将相同的层解耦方法应用于现有的不变性+等变性基线提高了它们的准确性,表明层解耦作为结合不变性和等变性的通用设计原则。
cs.CV / 30 / 2603.06696

HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training

HARP:基于仅使用幻影训练的体内扩散MRI的协调
Jeong, Hwihun, Liu, Qiang, Keenan, Kathryn E., Wilde, Elisabeth A., Schneider, Walter, Pathak, Sudhir, Zuccolotto, Anthony, O'Donnell, Lauren J., Ning, Lipeng, Rathi, Yogesh
Abstract
Purpose: Combining multi-site diffusion MRI (dMRI) data is hindered by inter-scanner variability, which confounds subsequent analysis. Previous harmonization methods require large, matched or traveling human subjects from multiple sites, which are impractical to acquire in many situations. This study aims to develop a deep learning-based dMRI harmonization framework that eliminates the reliance on multi-site in-vivo traveling human data for training. Methods: HARP employs a voxel-wise 1D neural network trained on an easily transportable diffusion phantom. The model learns relationships between spherical harmonics coefficients of different sites without memorizing spatial structures. Results: HARP reduced inter-scanner variability levels significantly in various measures. Quantitatively, it decreased inter-scanner variability as measured by standard error in FA (12%), MD (10%), and GFA (30%) with scan-rescan standard error as the baseline, while preserving fiber orientations and tractography after harmonization. Conclusion: We believe that HARP represents an important first step toward dMRI harmonization using only phantom data, thereby obviating the need for complex, matched in vivo multi-site cohorts. This phantom-only strategy substantially enhances the feasibility and scalability of quantitative dMRI for large-scale clinical studies.
Chinese Translation
目的:多中心扩散MRI(dMRI)数据的结合受到扫描仪间变异性的阻碍,这会干扰后续分析。以往的协调方法需要来自多个中心的大规模匹配或旅行人类受试者,这在许多情况下难以获得。本研究旨在开发一种基于深度学习的dMRI协调框架,消除对多中心体内旅行人类数据进行训练的依赖。方法:HARP采用在易于运输的扩散幻影上训练的体素级一维神经网络。该模型学习不同中心的球面谐波系数之间的关系,而不记忆空间结构。结果:HARP在各种指标上显著降低了扫描仪间变异性水平。从定量上看,它在FA(12%)、MD(10%)和GFA(30%)的标准误差测量中减少了扫描仪间变异性,以扫描-重扫描标准误差作为基线,同时在协调后保持了纤维方向和轨迹。结论:我们认为HARP代表了仅使用幻影数据进行dMRI协调的重要第一步,从而消除了对复杂的匹配体内多中心队列的需求。这种仅使用幻影的策略大大增强了定量dMRI在大规模临床研究中的可行性和可扩展性。
cs.CV / 31 / 2603.06697

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

以注视思考:顺序眼动追踪作为医学视觉语言模型的视觉推理监督
Li, Yiwei, Wu, Zihao, Lv, Yanjun, Jiang, Hanqi, You, Weihang, Liu, Zhengliang, Zhu, Dajiang, Li, Xiang, Li, Quanzheng, Liu, Tianming, Zhao, Lin
Abstract
Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.
Chinese Translation
视觉-语言模型(VLMs)将图像处理为视觉标记,但它们的中间推理通常以文本形式进行,这对于以视觉为基础的放射学任务来说可能并不理想。放射科医生通过顺序视觉搜索进行诊断;眼动追踪捕捉了这一过程,形成按时间顺序排列的注视轨迹,揭示了证据是如何随着时间的推移而获取的。我们利用眼动作为监督,通过引入一小组专用的注视标记来引导VLM推理。这些标记经过训练,以预测按时间顺序选择的图像补丁索引,鼓励模型遵循类人证据获取和整合的方式。在MIMIC-EYE和多个外部零样本基准上的实验显示,与基线相比,模型在性能上持续提升,实现了领域内的最先进表现,并提高了领域外的鲁棒性。这些结果突显了按时间顺序排列的注视作为学习视觉基础医学推理的有效监督信号。
cs.CV / 32 / 2603.06698

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

容量受限的跨模态迁移中的非对称蒸馏与信息保留
Thayani, Kabir
Abstract
Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling a 500M parameter global Vision Transformer (CLIP ViT-B/32) into strictly capacity-constrained, local-receptive-field CNNs (0.5M to 8.0M parameters) on the CIFAR-10 dataset. By employing strictly centered Singular Value Decomposition (SVD) and Variance-based Shannon Entropy Effective Rank, we isolate true structural variance from mean-vector artifacts. Our empirical results demonstrate a capacity-agnostic phase transition: while the Teacher exhibits an Effective Rank of 88.68, all Student models experience severe dimensional collapse to an intrinsic Effective Rank of ~16. By probing robustness, we uncover that this 81% reduction in effective dimensionality strips away the Teacher's inherent noise immunity (which retains 89.35% accuracy under \sigma=0.1 Gaussian noise). Furthermore, information-theoretic analysis using InfoNCE reveals a critical trade-off within this bottleneck: excess Student capacity densely packs the collapsed subspace for clean data, but induces severe brittleness (43.76% at \sigma=0.1). Conversely, extreme capacity constraints (0.5M parameters) act as a robust low-pass filter, preserving higher noise immunity (54.84%). Explicit input augmentation fails to restore the larger model's robustness, proving this fragility is a fundamental geometric limitation of asymmetric cosine distillation.
Chinese Translation
非对称架构之间的知识蒸馏往往会对学习到的表示空间施加严重的几何约束。在本研究中,我们探讨了在CIFAR-10数据集上将一个500M参数的全局视觉变换器(CLIP ViT-B/32)蒸馏到严格容量受限的局部感受野卷积神经网络(0.5M到8.0M参数)时出现的维度崩溃现象。通过采用严格中心化的奇异值分解(SVD)和基于方差的香农熵有效秩,我们将真实的结构方差与均值向量伪影分离。我们的实证结果表明了一种与容量无关的相变:尽管教师模型的有效秩为88.68,但所有学生模型都经历了严重的维度崩溃,内在有效秩降至~16。通过探测鲁棒性,我们发现这种有效维度的81%减少剥夺了教师模型固有的噪声免疫性(在σ=0.1高斯噪声下保持89.35%的准确率)。此外,使用InfoNCE进行的信息论分析揭示了这一瓶颈内的关键权衡:过剩的学生容量在干净数据上密集填充了崩溃的子空间,但导致了严重的脆弱性(在σ=0.1时为43.76%)。相反,极端的容量限制(0.5M参数)充当了一个鲁棒的低通滤波器,保持了更高的噪声免疫性(54.84%)。显式的输入增强未能恢复较大模型的鲁棒性,证明这种脆弱性是非对称余弦蒸馏的一个基本几何限制。
cs.CV / 33 / 2603.06699

Multi-label Instance-level Generalised Visual Grounding in Agriculture

农业中的多标签实例级广义视觉定位
Haghighat, Mohammadreza, Saleh, Alzayat, Azghadi, Mostafa Rahimi
Abstract
Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target may be absent from the image. To address these limitations, we introduce gRef-CW, the first dataset designed for generalised visual grounding in agriculture, including negative expressions. Benchmarking current state-of-the-art grounding models on gRef-CW reveals a substantial domain gap, highlighting their inability to ground instances of crops and weeds. Motivated by these findings, we introduce Weed-VG, a modular framework that incorporates multi-label hierarchical relevance scoring and interpolation-driven regression. Weed-VG advances instance-level visual grounding and provides a clear baseline for developing VG methods in precision agriculture. Code will be released upon acceptance.
Chinese Translation
理解田间图像,例如检测植物和区分单个作物与杂草实例,是精准农业中的一个核心挑战。尽管在视觉语言任务(如图像描述和视觉问答)方面取得了进展,但视觉定位(Visual Grounding, VG)——即定位语言所指对象——在农业领域仍未得到探索。一个主要原因是缺乏适合在田间条件下评估定位模型的基准数据集,在这些条件下,许多植物看起来非常相似,呈现多种尺度,并且所指目标可能在图像中缺失。为了解决这些局限性,我们引入了gRef-CW,这是第一个为农业中的广义视觉定位设计的数据集,包括负面表达。在gRef-CW上对当前最先进的定位模型进行基准测试显示出显著的领域差距,突显了它们在定位作物和杂草实例方面的不足。基于这些发现,我们提出了Weed-VG,一个模块化框架,结合了多标签层次相关性评分和插值驱动的回归。Weed-VG推动了实例级视觉定位的发展,并为精准农业中的VG方法提供了明确的基准。代码将在接受后发布。
cs.CV / 34 / 2603.06700

SIQA: Toward Reliable Scientific Image Quality Assessment

SIQA:迈向可靠的科学图像质量评估
Li, Wenzhe, Chen, Liang, Wang, Junying, Guo, Yijing, Shen, Ye, Wen, Farong, Li, Chunyi, Zhang, Zicheng, Zhai, Guangtao
Abstract
Scientific images fundamentally differ from natural and AI-generated images in that they encode structured domain knowledge rather than merely depict visual scenes. Assessing their quality therefore requires evaluating not only perceptual fidelity but also scientific correctness and logical completeness. However, existing image quality assessment (IQA) paradigms primarily focus on perceptual distortions or image-text alignment, implicitly assuming that depicted content is factually valid. This assumption breaks down in scientific contexts, where visually plausible figures may still contain conceptual errors or incomplete reasoning. To address this gap, we introduce Scientific Image Quality Assessment (SIQA), a framework that models scientific image quality along two complementary dimensions: Knowledge (Scientific Validity and Scientific Completeness) and Perception (Cognitive Clarity and Disciplinary Conformity). To operationalize this formulation, we design two evaluation protocols: SIQA-U (Understanding), which measures semantic comprehension of scientific content through multiple-choice tasks, and SIQA-S (Scoring), which evaluates alignment with expert quality judgments. We further construct the SIQA Challenge, consisting of an expert-annotated benchmark and a large-scale training set. Experiments across representative multimodal large language models (MLLMs) reveal a consistent discrepancy between scoring alignment and scientific understanding. While models can achieve strong agreement with expert ratings under SIQA-S, their performance on SIQA-U remains substantially lower. Fine-tuning improves both metrics, yet gains in scoring consistently outpace improvements in understanding. These results suggest that rating consistency alone may not reliably reflect scientific comprehension, underscoring the necessity of multidimensional evaluation for scientific image quality assessment.
Chinese Translation
科学图像与自然图像和人工智能生成图像在根本上有所不同,因为它们编码的是结构化的领域知识,而不仅仅是描绘视觉场景。因此,评估其质量不仅需要评估感知的真实性,还需要评估科学的正确性和逻辑的完整性。然而,现有的图像质量评估(IQA)范式主要集中于感知失真或图像与文本的对齐,隐含地假设所描绘的内容在事实上的有效性。这一假设在科学背景下失效,因为视觉上看似合理的图形仍可能包含概念错误或推理不完整。为了解决这一空白,我们引入了科学图像质量评估(SIQA),这是一个沿两个互补维度建模科学图像质量的框架:知识(科学有效性和科学完整性)和感知(认知清晰度和学科一致性)。为了实现这一框架,我们设计了两个评估协议:SIQA-U(理解),通过多项选择任务测量科学内容的语义理解;SIQA-S(评分),评估与专家质量判断的一致性。我们进一步构建了SIQA挑战,包括一个专家标注的基准和一个大规模训练集。在代表性的多模态大语言模型(MLLMs)上进行的实验揭示了评分一致性与科学理解之间的持续差异。尽管模型在SIQA-S下能够与专家评分达成强一致性,但在SIQA-U上的表现仍显著较低。微调改善了这两个指标,但评分的提升始终超过理解的改善。这些结果表明,仅凭评分一致性可能无法可靠地反映科学理解,强调了对科学图像质量评估进行多维评估的必要性。
cs.CV / 35 / 2603.06704

On the Generalization Capacities of MLLMs for Spatial Intelligence

多模态大型语言模型在空间智能中的泛化能力研究
Zhang, Gongjie, Li, Wenhao, Qian, Quanhao, Wang, Jiuniu, Zhao, Deli, Lu, Shijian, Xu, Ran
Abstract
Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these RGB-only approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.
Chinese Translation
直接处理RGB输入以进行3D定位和导航等任务的多模态大型语言模型(MLLMs)展现出了显著的潜力。然而,我们认为这些仅依赖RGB的方式在跨相机泛化能力上存在根本性缺陷。由于忽视了相机参数,它们将物体的物理特性与相机的视角纠缠在一起,造成了无法解决的模糊性。我们展示了这导致MLLMs对训练相机分布的过拟合,而不是学习真实且可泛化的3D几何原理。为了解决这一问题,我们提出了针对空间MLLMs的相机感知MLLM框架。该框架通过以下方式学习可泛化的空间推理:(i) 通过密集嵌入注入相机内参,条件化每个视觉标记;(ii) 引入相机感知的数据增强策略,合成性地变化相机参数,迫使模型将相机属性与场景内容解耦;(iii) 从3D视觉基础模型中提炼几何先验。大量实验表明,相机感知的MLLMs在空间基础任务的跨相机泛化测试中显著优于其简单对照组,表明相机感知不仅是有益的,而且是实现MLLMs中稳健且可泛化的空间智能的先决条件。
cs.CV / 36 / 2603.06723

UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

UWPD:一种与嵌入算法无关的隐形水印检测通用范式
Ao, Xiang, Du, Yiling, Wang, Zidan, Chen, Mengru
Abstract
Invisible watermarks, as an essential technology for image copyright protection, have been widely deployed with the rapid development of social media and AIGC. However, existing invisible watermark detection heavily relies on prior knowledge of specific algorithms, leading to limited detection capabilities for "unknown watermarks" in open environments. To this end, we propose a novel task named Universal Watermark Presence Detection (UWPD), which aims to identify whether an image carries a copyright mark without requiring decoding information. We construct the UniFreq-100K dataset, comprising large-scale samples across various invisible watermark embedding algorithms. Furthermore, we propose the Frequency Shield Network (FSNet). This model deploys an Adaptive Spectral Perception Module (ASPM) in the shallow layers, utilizing learnable frequency gating to dynamically amplify high-frequency watermark signals while suppressing low-frequency semantics. In the deep layers, the network introduces Dynamic Multi-Spectral Attention (DMSA) combined with tri-stream extremum pooling to deeply mine watermark energy anomalies, forcing the model to precisely focus on sensitive frequency bands. Extensive experiments demonstrate that FSNet exhibits superior zero-shot detection capabilities on the UWPD task, outperforming existing baseline models. Code and datasets will be released upon acceptance.
Chinese Translation
隐形水印作为图像版权保护的重要技术,随着社交媒体和人工智能生成内容(AIGC)的快速发展,得到了广泛应用。然而,现有的隐形水印检测严重依赖于特定算法的先验知识,这导致在开放环境中对“未知水印”的检测能力有限。为此,我们提出了一项新任务,称为通用水印存在检测(Universal Watermark Presence Detection,UWPD),旨在识别图像是否携带版权标记,而无需解码信息。我们构建了UniFreq-100K数据集,包含各种隐形水印嵌入算法的大规模样本。此外,我们提出了频率屏蔽网络(Frequency Shield Network,FSNet)。该模型在浅层部署了自适应频谱感知模块(Adaptive Spectral Perception Module,ASPM),利用可学习的频率门控动态放大高频水印信号,同时抑制低频语义。在深层,网络引入了动态多谱注意力(Dynamic Multi-Spectral Attention,DMSA)结合三流极值池化,深入挖掘水印能量异常,迫使模型精确关注敏感频率带。大量实验表明,FSNet在UWPD任务上展现出优越的零-shot检测能力,超越了现有基线模型。代码和数据集将在接受后发布。
cs.CV / 37 / 2603.06732

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

HERO:用于视频中开放词汇时间句子定位的层次嵌入-细化方法
Han, Tingting, Tao, Xinsong, Yin, Yufei, Tan, Min, Zhao, Sicheng, Yu, Zhou
Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
Chinese Translation
视频中的时间句子定位(TSGV)旨在将视频的片段与给定的自然语言查询进行时间上的定位。尽管近期取得了一定进展,但大多数现有的TSGV方法仍在封闭词汇环境下工作,这限制了它们对涉及新颖或多样语言表达的现实世界查询的泛化能力。为了解决这一关键问题,我们提出了开放词汇TSGV(OV-TSGV)任务,并构建了首个专门的基准数据集——Charades-OV和ActivityNet-OV——以模拟现实的词汇变化和释义变体。这些基准数据集促进了模型在未见训练概念之外的系统评估。为了解决OV-TSGV问题,我们提出了HERO(用于开放词汇定位的层次嵌入-细化框架),这是一个统一框架,利用层次语言嵌入并进行并行跨模态细化。HERO共同建模多层次语义,并通过语义引导的视觉过滤和对比掩蔽文本细化增强视频与语言的对齐。对标准和开放词汇基准的广泛实验表明,HERO在开放词汇场景下始终超越了最先进的方法,验证了其强大的泛化能力,并强调了OV-TSGV作为新研究方向的重要性。
cs.CV / 38 / 2603.06735

Vessel-Aware Deep Learning for OCTA-Based Detection of AMD

基于血管感知的深度学习用于OCTA检测年龄相关性黄斑变性
Mitzner, Margalit G., Bhattacharya, Moinak, Zou, Zhilin, Chen, Chao, Prasanna, Prateek
Abstract
Age-related macular degeneration (AMD) is characterized by early micro-vascular alterations that can be captured non-invasively using optical coherence tomography angiography (OCTA), yet most deep learning (DL) models rely on global features and fail to exploit clinically meaningful vascular biomarkers. We introduce an external multiplicative attention framework that incorporates vessel-specific tortuosity maps and vasculature dropout maps derived from arteries, veins, and capillaries. These biomarker maps are generated from vessel segmentations and smoothed across multiple spatial scales to highlight coherent patterns of vascular remodeling and capillary rarefaction. Tortuosity reflects abnormalities in vessel geometry linked to impaired auto-regulation, while dropout maps capture localized perfusion deficits that precede structural retinal damage. The maps are fused with the OCTA projection to guide a deep classifier toward physiologically relevant regions. Arterial tortuosity provided the most consistent discriminative value, while capillary dropout maps performed best among density-based variants, especially at larger smoothing scales. Our proposed method offers interpretable insights aligned with known AMD pathophysiology.
Chinese Translation
年龄相关性黄斑变性(AMD)以早期微血管变化为特征,这些变化可以通过光学相干断层扫描血管成像(OCTA)非侵入性地捕捉。然而,大多数深度学习(DL)模型依赖于全局特征,未能利用临床意义重大的血管生物标志物。我们提出了一种外部乘法注意力框架,结合了从动脉、静脉和毛细血管中导出的特定血管扭曲图和血管缺失图。这些生物标志物图是通过血管分割生成的,并在多个空间尺度上进行平滑,以突出血管重塑和毛细血管稀疏的连贯模式。扭曲度反映了与自我调节受损相关的血管几何异常,而缺失图则捕捉到局部灌注缺陷,这些缺陷在结构性视网膜损伤之前出现。这些图与OCTA投影融合,以引导深度分类器朝向生理相关区域。动脉扭曲度提供了最一致的区分价值,而毛细血管缺失图在基于密度的变体中表现最佳,尤其是在较大的平滑尺度下。我们提出的方法提供了与已知AMD病理生理学一致的可解释性见解。
cs.CV / 39 / 2603.06746

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

ButterflyViT:354$ imes$ 专家压缩用于边缘视觉变换器
Karmore, Aryan
Abstract
Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.
Chinese Translation
部署稀疏的专家混合模型(Mixture of Experts, MoE)视觉变换器仍然面临挑战,因为专家的内存扩展是线性的。线性内存扩展存储 $N$ 个独立的专家权重矩阵,需占用 $ ext{O}(N_E imes d^2)$ 的内存,这超出了边缘设备的内存预算。目前的压缩方法,如量化、剪枝和低秩分解,虽然减少了常数因子,但未能解决扩展瓶颈。我们提出了 ButterflyViT,一种将专家视为统一共享量化基底的几何重定向,而不是独立的权重矩阵的方法。专家之间的多样性源于对共享能力的不同视角,而非冗余存储。通过对共享三元原型应用学习到的旋转,每个专家的内存需求为 $ ext{O}(d_{ ext{model}} imes d_{ ext{ff}} + N_E imes n_ ext{l} imes d)$,在专家数量上是次线性的。为应对视觉领域的独特挑战,我们引入了一种空间平滑正则化器,惩罚相邻补丁标记之间的路由不规则性,将补丁相关性转化为训练信号。在 CIFAR-100 的图像分类任务中,ButterflyViT 在 64 个专家下实现了 354$ imes$ 的内存减少,且几乎没有准确度损失。ButterflyViT 使多个专家能够适应边缘受限设备,表明几何参数化打破了线性扩展。
cs.CV / 40 / 2603.06750

XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

XMACNet:一种基于可解释的轻量级注意力卷积神经网络及多模态融合的辣椒病害分类方法
Ray, Tapon Kumer, Y, Rajkumar, R, Shalini, K, Srigayathri, S, Jayashree, P, Lokeswari
Abstract
Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.
Chinese Translation
通过成像进行植物病害分类是精准农业中的一项关键任务。我们提出了XMACNet,这是一种新颖的轻量级卷积神经网络(CNN),它集成了自注意力机制和可见图像与植被指数的多模态融合,用于辣椒病害检测。XMACNet采用了增强自注意力模块的EfficientNetV2S作为主干,并设有一个融合分支,处理RGB图像和计算得到的植被指数图(NDVI、NPCI、MCARI)。我们策划了一个包含12,000张辣椒叶片图像的新数据集,涵盖六个类别(五种病害类型加上健康),并通过StyleGAN进行了合成增强,以减轻数据稀缺问题。在该数据集上训练后,XMACNet在准确率、F1分数和AUC等指标上表现出色,超越了基线模型如ResNet-50、MobileNetV2和Swin Transformer变体。重要的是,XMACNet是可解释的:我们使用Grad-CAM++和SHAP来可视化和量化模型对病害特征的关注。该模型的紧凑尺寸和快速推理使其适合在实际农业场景中进行边缘部署。
cs.CV / 41 / 2603.06753

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

EarthBridge:第四届多模态航空视图图像挑战翻译赛道的解决方案
Chen, Zhenyuan, Shen, Guanyuan, Zhang, Feng
Abstract
Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized "booting noise" initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at https://github.com/Bili-Sakura/EarthBridge-Preview.
Chinese Translation
在电光(EO)、红外(IR)和合成孔径雷达(SAR)传感器之间进行跨模态图像到图像的翻译对于全面的多模态航空视图分析至关重要。然而,由于这些模态具有不同的电磁特征和几何特性,进行翻译极为困难。本文提出了 extbf{EarthBridge},这是为第四届多模态航空视图图像挑战——翻译(MAVIC-T)开发的高保真翻译框架。我们探索了两种不同的方法论: extbf{扩散桥隐式模型(DBIM)},我们通过非马尔可夫桥过程对其进行推广,以实现高质量的确定性采样,以及 extbf{对比无配对翻译(CUT)},该方法利用对比学习确保结构一致性。我们的EarthBridge框架采用通道级联的UNet去噪器,使用Karras加权的桥接缩放和专门的“引导噪声”初始化来处理跨模态映射中的固有模糊性。我们在所有四个挑战任务(SAR$ ightarrow$EO,SAR$ ightarrow$RGB,SAR$ ightarrow$IR,RGB$ ightarrow$IR)上评估了这些方法,取得了优越的空间细节和光谱精度。我们的解决方案获得了0.38的综合得分,在MAVIC-T排行榜上获得第二名。代码可在 https://github.com/Bili-Sakura/EarthBridge-Preview 获取。
cs.CV / 42 / 2603.06803

A Hybrid Machine Learning Model for Cerebral Palsy Detection

一种混合机器学习模型用于脑瘫检测
Singh, Karan Kumar, Gajbhiye, Nikita, Mishra, Gouri Sankar
Abstract
The development of effective treatments for Cerebral Palsy (CP) can begin with the early identification of affected children while they are still in the early stages of the disorder. Pathological issues in the brain can be better diagnosed with the use of one of many medical imaging techniques. Magnetic Resonance Imaging (MRI) has revolutionized medical imaging with its unparalleled image resolution. A unique Machine Learning (ML) model that was built to identify CP disorder is presented in this paper. The model is intended to assist in the early diagnosis of CP in newborns. In this study, the brain MRI images dataset was first collected, and then the preprocessing techniques were applied to this dataset to make it ready for use in the proposed model. Following this, the proposed model was constructed by combining three CNN models, specifically VGG 19, Efficient-Net, and the ResNet50 model, to extract features from the image. Following this, a Bi-LSTM was utilized as a classifier to determine whether or not CP was present, and finally, the proposed model was employed for training and testing. The results show that the proposed model achieved an accuracy of 98.83%, which is higher than VGG-19 (96.79%), Efficient-Net (97.29%), and VGG-16 (97.50%).. When the suggested model is compared to other models that have been pre-trained in the past, the accuracy scores seem to be much higher.
Chinese Translation
有效的脑瘫(Cerebral Palsy, CP)治疗的发展可以从对受影响儿童的早期识别开始,这些儿童在疾病的早期阶段。利用多种医学影像技术可以更好地诊断大脑的病理问题。磁共振成像(Magnetic Resonance Imaging, MRI)以其无与伦比的图像分辨率彻底改变了医学影像。本文提出了一种独特的机器学习(Machine Learning, ML)模型,用于识别脑瘫疾病。该模型旨在帮助新生儿的脑瘫早期诊断。在本研究中,首先收集了脑部MRI图像数据集,然后对该数据集应用预处理技术,以使其准备好用于所提议的模型。随后,通过结合三种卷积神经网络(Convolutional Neural Network, CNN)模型,具体为VGG 19、Efficient-Net和ResNet50模型,构建了所提议的模型,以从图像中提取特征。接下来,利用双向长短期记忆网络(Bi-LSTM)作为分类器来判断是否存在脑瘫,最后,采用所提议的模型进行训练和测试。结果显示,所提议的模型达到了98.83%的准确率,超过了VGG-19(96.79%)、Efficient-Net(97.29%)和VGG-16(97.50%)。与过去预训练的其他模型相比,所建议模型的准确率显著更高。
cs.CV / 43 / 2603.06828

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

步级视觉定位的可信度预测长时间视听模型的分布外泛化能力
Rahman, Md Ashikur, Rahman, Md Arifur, Samin, Niamul Hassan, Arean, Abdullah Ibne Hanif, Noshin, Juena Ahmed
Abstract
We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $\rho = 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).
Chinese Translation
我们揭示了长时间视听模型的一个行为规律:保持时间上有根基的信念的模型具有更好的泛化能力。标准基准仅衡量最终答案的准确性,这掩盖了模型如何使用视觉信息;一个模型可以正确猜测,但其逐步推理完全与视觉输入无关。我们将其形式化为长时间范围内的行为可信度,这是一种可实证测量的属性,用于量化模型的中间推理是否与不断变化的视觉状态保持一致。在三个长时间基准上的八个模型中,我们证明了时间根基质量是稳健性的主要指标:步级根基率(Step Grounding Rate, SGR)预测分布外保留率为 $r = 0.83$(置换检验 $p = 0.003$),这一关系在容量匹配的模型中成立,且无法通过规模或分布内准确性来解释。关键是,尽管准确性相似,根基质量在参数匹配的7B模型中变化可达10.8个百分点,揭示了它作为模型能力的独立维度。多项稳健性检查确认该信号反映了真实的视觉依赖:反事实轨迹使SGR下降26至41个百分点,跨架构验证者一致性为 $ ho = 0.96$,随机推理得分接近偶然水平(约18%),即使没有明确的推理披露,预测能力仍然强劲($r = 0.78$)。
cs.CV / 44 / 2603.06846

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

MotionBits:通过刚体运动级别分析进行视频分割
Qian, Howard H., Ren, Kejia, Xiang, Yu, Ordonez, Vicente, Hang, Kaiyu
Abstract
Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.
Chinese Translation
刚体是现实世界中最小的可操作元素,理解它们如何物理交互对于具身推理和机器人操作至关重要。因此,准确检测、分割和跟踪移动的刚体对于使推理模块能够在多样化环境中进行解释和行动是必不可少的。然而,目前基于语义分组训练的分割模型在提供有意义的交互级线索以完成具身任务方面存在局限。为了解决这一问题,我们提出了MotionBit这一新概念,与之前的公式不同,它通过运动学空间扭转等价性定义了运动基础分割中的最小单元,而不依赖于语义。在本文中,我们贡献了(1)MotionBit的概念和定义,(2)一个手工标注的基准数据集MoRiBo,用于评估机器人操作和人类在自然环境中视频的移动刚体分割,以及(3)一种无学习的基于图的MotionBits分割方法,在MoRiBo基准上,在宏平均mIoU上超越了最先进的具身感知方法37.3%。最后,我们展示了MotionBits分割在下游具身推理和操作任务中的有效性,强调了它作为理解物理交互的基本原语的重要性。
cs.CV / 45 / 2603.06852

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

基于扰动高斯集成的主动视图选择用于断层重建
Wu, Yulun, Zha, Ruyi, Cao, Wei, Li, Yingying, Cai, Yuanhao, Liu, Yaoyao
Abstract
Sparse-view computed tomography (CT) is critical for reducing radiation exposure to patients. Recent advances in radiative 3D Gaussian Splatting (3DGS) have enabled fast and accurate sparse-view CT reconstruction. Despite these algorithmic advancements, practical reconstruction fidelity remains fundamentally bounded by the quality of the captured data, raising the crucial yet underexplored problem of X-ray active view selection. Existing active view selection methods are primarily designed for natural-light scenes and fail to capture the unique geometric ambiguities and physical attenuation properties inherent in X-ray imaging. In this paper, we present Perturbed Gaussian Ensemble, an active view selection framework that integrates uncertainty modeling with sequential decision-making, tailored for X-ray Gaussian Splatting. Specifically, we identify low-density Gaussian primitives that are likely to be uncertain and apply stochastic density scaling to construct an ensemble of plausible Gaussian density fields. For each candidate projection, we measure the structural variance of the ensemble predictions and select the one with the highest variance as the next best view. Extensive experimental results on arbitrary-trajectory CT benchmarks demonstrate that our density-guided perturbation strategy effectively eliminates geometric artifacts and consistently outperforms existing baselines in progressive tomographic reconstruction under unified view selection protocols.
Chinese Translation
稀疏视图计算机断层扫描(CT)对降低患者的辐射暴露至关重要。最近在辐射性三维高斯点云(3DGS)方面的进展使得稀疏视图CT重建变得快速而准确。尽管这些算法取得了进展,但实际重建的保真度仍然受到捕获数据质量的根本限制,这引发了一个关键但尚未深入研究的X射线主动视图选择问题。现有的主动视图选择方法主要针对自然光场景设计,未能捕捉到X射线成像中固有的独特几何模糊性和物理衰减特性。本文提出了一种扰动高斯集成(Perturbed Gaussian Ensemble)框架,结合了不确定性建模与顺序决策,专为X射线高斯点云设计。具体而言,我们识别出可能存在不确定性的低密度高斯原语,并应用随机密度缩放构建一组合理的高斯密度场。对于每个候选投影,我们测量集成预测的结构方差,并选择方差最高的作为下一个最佳视图。在任意轨迹CT基准测试中的大量实验结果表明,我们的密度引导扰动策略有效消除了几何伪影,并在统一视图选择协议下的渐进式断层重建中始终优于现有基线。
cs.CV / 46 / 2603.06853

An Extended Topological Model For High-Contrast Optical Flow

一种扩展的高对比度光流拓扑模型
Turow, Brad, Perea, Jose A.
Abstract
In this paper, we identify low-dimensional models for dense core subsets in the space of $3\times 3$ high-contrast optical flow patches sampled from the Sintel dataset. In particular, we leverage the theory of approximate and discrete circle bundles to identify a 3-manifold whose boundary is a previously proposed optical flow torus, together with disjoint circles corresponding to pairs of binary step-edge range image patches. The 3-manifold model we introduce provides an explanation for why the previously-proposed torus model could not be verified with direct methods (e.g., a straightforward persistent homology computation). We also demonstrate that nearly all optical flow patches in the top 1 percent by contrast norm are found near the family of binary step-edge circles described above, rather than the optical flow torus, and that these frequently occurring patches are concentrated near motion boundaries (which are of particular importance for computer vision tasks such as object segmentation and tracking). Our findings offer insights on the subtle interplay between topology and geometry in inference for visual data.
Chinese Translation
在本文中,我们识别出来自 Sintel 数据集的 $3 imes 3$ 高对比度光流补丁空间中稠密核心子集的低维模型。特别地,我们利用近似和离散圆束的理论,识别出一个其边界为先前提出的光流环面(optical flow torus)的 3-流形,并且与之对应的二元阶跃边缘范围图像补丁的离散圆相互独立。我们引入的 3-流形模型为为什么先前提出的环面模型无法通过直接方法(例如,简单的持久同调计算)进行验证提供了一个解释。我们还展示了,在对比范数的前 1% 中,几乎所有的光流补丁都位于上述二元阶跃边缘圆的家族附近,而不是光流环面,并且这些频繁出现的补丁集中在运动边界附近(这对于计算机视觉任务如物体分割和跟踪尤为重要)。我们的发现为拓扑与几何在视觉数据推断中的微妙相互作用提供了见解。
cs.CV / 47 / 2603.06860

ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting

ColonSplat:基于动态高斯点云重建结肠镜检查中的蠕动运动
Smolak-Dyżewska, Weronika, Kaleta, Joanna, Dall'Alba, Diego, Spurek, Przemysław
Abstract
Accurate 3D reconstruction of colonoscopy data, accounting for complex peristaltic movements, is crucial for advanced surgical navigation and retrospective diagnostics. While recent novel view synthesis and 3D reconstruction methods have demonstrated remarkable success in general endoscopic scenarios, they struggle in the highly constrained environment of the colon. Due to the limited field of view of a camera moving through an actively deforming tubular structure, existing endoscopic methods reconstruct the colon appearance only for initial camera trajectory. However, the underlying anatomy remains largely static; instead of updating Gaussians' spatial coordinates (xyz), these methods encode deformation through either rotation, scale or opacity adjustments. In this paper, we first present a benchmark analysis of state-of-the-art dynamic endoscopic methods for realistic colonoscopic scenes, showing that they fail to model true anatomical motion. To enable rigorous evaluation of global reconstruction quality, we introduce DynamicColon, a synthetic dataset with ground-truth point clouds at every timestep. Building on these insights, we propose ColonSplat, a dynamic Gaussian Splatting framework that captures peristaltic-like motion while preserving global geometric consistency, achieving superior geometric fidelity on C3VDv2 and DynamicColon datasets. Project page: https://wmito.github.io/ColonSplat
Chinese Translation
准确的结肠镜检查数据三维重建,考虑复杂的蠕动运动,对于先进的手术导航和回顾性诊断至关重要。尽管近期的新视角合成和三维重建方法在一般内窥镜场景中取得了显著成功,但在结肠这一高度受限的环境中却面临挑战。由于相机在主动变形的管状结构中移动时视野有限,现有的内窥镜方法仅能重建初始相机轨迹下的结肠外观。然而,潜在的解剖结构基本保持静态;这些方法并未更新高斯的空间坐标(xyz),而是通过旋转、缩放或不透明度调整来编码变形。在本文中,我们首先对最先进的动态内窥镜方法在真实结肠镜场景中的基准分析进行了展示,结果表明它们无法真实建模解剖运动。为了严格评估全局重建质量,我们引入了DynamicColon,一个在每个时间步都有真实点云的合成数据集。在此基础上,我们提出了ColonSplat,一个动态高斯点云框架,能够捕捉蠕动样运动,同时保持全局几何一致性,在C3VDv2和DynamicColon数据集上实现了卓越的几何保真度。项目页面:https://wmito.github.io/ColonSplat
cs.CV / 48 / 2603.06863

A prior information informed learning architecture for flying trajectory prediction

基于先验信息的飞行轨迹预测学习架构
Huang, Xianda, Han, Zidong, Jin, Ruibo, Wang, Zhenyu, Li, Wenyu, Li, Xiaoyang, Gong, Yi
Abstract
Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks
Chinese Translation
飞行物体的轨迹预测在从体育分析到航空航天等多个领域中至关重要。然而,传统方法在复杂的物理建模、计算效率和硬件需求方面面临挑战,常常忽视关键的轨迹事件,如着陆点。本文提出了一种新颖的、硬件高效的轨迹预测框架,该框架将环境先验信息与双重变换器级联(Dual-Transformer-Cascaded, DTC)架构相结合。我们通过预测网球在真实户外场地的着陆点来展示这种方法。利用一台工业相机和基于YOLO的检测,我们提取了高速飞行坐标。这些坐标与结构性环境先验(例如,场地边界)融合,形成一个全面的数据集,输入到我们提出的DTC模型中。第一级变换器对轨迹进行分类,而第二级变换器合成这些特征以精确预测着陆点。大量的消融实验和比较实验表明,在DTC架构中整合环境先验显著优于现有的轨迹预测框架。
cs.CV / 49 / 2603.06873

PICS: Pairwise Image Compositing with Spatial Interactions

PICS:具有空间交互的成对图像合成
Zhou, Hang, Zuo, Xinxin, Wang, Sen, Cheng, Li
Abstract
Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive {\alpha}-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS
Chinese Translation
尽管扩散基础的图像合成在单轮表现上表现出色,但在成对或顺序编辑中,它常常难以保持一致的空间关系,因为后续的插入可能会覆盖先前生成的内容并破坏物理一致性。我们提出了PICS,一种自监督的分解合成范式,它在并行合成对象的同时,明确建模(完全/部分)可见对象与背景之间的合成交互。在其核心,Interaction Transformer采用掩码引导的专家混合模型,将背景、独占区域和重叠区域路由到专门的专家,并采用自适应的{ extalpha}-混合策略,推断出一种兼容性感知的重叠对象融合,同时保持边界的保真度。为了进一步增强对几何变化的鲁棒性,我们结合了几何感知增强,涵盖对象的平面外和平面内姿态变化。我们的方法提供了优越的成对合成质量和显著改善的稳定性,在虚拟试穿、室内和街景设置下的广泛评估显示出相较于最先进基线的一致性提升。代码和数据可在 https://github.com/RyanHangZhou/PICS 获取。
cs.CV / 50 / 2603.06885

OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

OPTED:使用零-shot SAM 3 分割的开放预处理沙眼眼睛数据集
Gebremedhin, Kibrom, Hailu, Hadush, Gebregziabher, Bruk
Abstract
Trachoma remains the leading infectious cause of blindness worldwide, with Sub-Saharan Africa bearing over 85% of the global burden and Ethiopia alone accounting for more than half of all cases. Yet publicly available preprocessed datasets for automated trachoma classification are scarce, and none originate from the most affected region. Raw clinical photographs of eyelids contain significant background noise that hinders direct use in machine learning pipelines. We present OPTED, an open-source preprocessed trachoma eye dataset constructed using the Segment Anything Model 3 (SAM 3) for automated region-of-interest extraction. We describe a reproducible four-step pipeline: (1) text-prompt-based zero-shot segmentation of the tarsal conjunctiva using SAM 3, (2) background removal and bounding-box cropping with alignment, (3) quality filtering based on confidence scores, and (4) Lanczos resizing to 224x224 pixels. A separate prompt-selection stage identifies the optimal text prompt, and manual quality assurance verifies outputs. Through comparison of five candidate prompts on all 2,832 known-label images, we identify "inner surface of eyelid with red tissue" as optimal, achieving a mean confidence of 0.872 (std 0.070) and 99.5% detection rate (the remaining 13 images are recovered via fallback prompts). The pipeline produces outputs in two formats: cropped and aligned images preserving the original aspect ratio, and standardized 224x224 images ready for pre-trained architectures. The OPTED dataset, preprocessing code, and all experimental artifacts are released as open source to facilitate reproducible trachoma classification research.
Chinese Translation
沙眼仍然是全球失明的主要传染病,撒哈拉以南非洲承担了全球85%以上的负担,而仅埃塞俄比亚就占所有病例的一半以上。然而,公开可用的用于自动化沙眼分类的预处理数据集非常稀缺,且没有来自受影响最严重地区的数据集。原始的眼睑临床照片包含显著的背景噪声,妨碍了在机器学习管道中的直接使用。我们提出了OPTED,一个开源的预处理沙眼眼睛数据集,使用Segment Anything Model 3(SAM 3)构建,用于自动化感兴趣区域的提取。我们描述了一个可重复的四步流程:(1)基于文本提示的零-shot分割眼睑结膜,使用SAM 3;(2)去除背景并进行对齐的边界框裁剪;(3)基于置信度评分的质量过滤;(4)Lanczos缩放至224x224像素。一个单独的提示选择阶段识别最佳文本提示,手动质量保证验证输出。通过对所有2,832个已知标签图像比较五个候选提示,我们确定“带红色组织的眼睑内表面”为最佳提示,达到平均置信度0.872(标准差0.070)和99.5%的检测率(其余13张图像通过后备提示恢复)。该流程生成两种格式的输出:裁剪和对齐的图像,保持原始纵横比,以及标准化的224x224图像,准备好用于预训练架构。OPTED数据集、预处理代码和所有实验文物作为开源发布,以促进可重复的沙眼分类研究。
cs.CV / 51 / 2603.06917

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

PaQ-DETR:学习模式和质量感知动态查询以进行目标检测
Kang, Zhengjian, Zhuang, Jun, Mo, Kangtong, Chen, Qi, Liu, Rui, Zhang, Ye
Abstract
Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an end-to-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localizatio-classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5%-4.2% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.
Chinese Translation
检测变换器(DETR)通过将目标检测重新定义为一个在端到端框架内的集合预测任务,重新塑造了目标检测的概念。尽管其设计优雅,DETR及其变体仍然依赖于固定的可学习查询,并且遭受严重的查询利用不平衡,这限制了适应性并使模型能力未得到充分利用。我们提出了PaQ-DETR(模式和质量感知DETR),这是一个统一框架,增强了查询的适应性和监督平衡。它学习一组紧凑的共享潜在模式,以捕捉全局语义,并通过内容条件加权动态生成特定于图像的查询。同时,质量感知的一对多分配策略根据定位-分类一致性自适应选择正样本,丰富监督并促进平衡的查询优化。在COCO、CityScapes及其他基准上的实验显示,DETR主干网络(包括ResNet和Swin-Transformer)在mAP上均有1.5%-4.2%的稳定提升。除了提高准确性,我们的方法还提供了可解释的见解,揭示了动态模式如何在目标类别之间进行语义聚类。
cs.CV / 52 / 2603.06920

DLRMamba: Distilling Low-Rank Mamba for Edge Multispectral Fusion Object Detection

DLRMamba:为边缘多光谱融合目标检测提炼低秩Mamba
Zhang, Qianqian, Tabaro, Leon, Abdelmoniem, Ahmed M., An, Junshe
Abstract
Multispectral fusion object detection is a critical task for edge-based maritime surveillance and remote sensing, demanding both high inference efficiency and robust feature representation for high-resolution inputs. However, current State Space Models (SSMs) like Mamba suffer from significant parameter redundancy in their standard 2D Selective Scan (SS2D) blocks, which hinders deployment on resource-constrained hardware and leads to the loss of fine-grained structural information during conventional compression. To address these challenges, we propose the Low-Rank Two-Dimensional Selective Structured State Space Model (Low-Rank SS2D), which reformulates state transitions via matrix factorization to exploit intrinsic feature sparsity. Furthermore, we introduce a Structure-Aware Distillation strategy that aligns the internal latent state dynamics of the student with a full-rank teacher model to compensate for potential representation degradation. This approach substantially reduces computational complexity and memory footprint while preserving the high-fidelity spatial modeling required for object recognition. Extensive experiments on five benchmark datasets and real-world edge platforms, such as Raspberry Pi 5, demonstrate that our method achieves a superior efficiency-accuracy trade-off, significantly outperforming existing lightweight architectures in practical deployment scenarios.
Chinese Translation
多光谱融合目标检测是基于边缘的海洋监视和遥感中的一项关键任务,要求在高分辨率输入下具备高推理效率和稳健的特征表示。然而,目前的状态空间模型(State Space Models, SSMs)如Mamba在其标准的二维选择扫描(2D Selective Scan, SS2D)模块中存在显著的参数冗余,这阻碍了在资源受限硬件上的部署,并导致在传统压缩过程中细粒度结构信息的丢失。为了解决这些挑战,我们提出了低秩二维选择结构状态空间模型(Low-Rank SS2D),该模型通过矩阵分解重新构建状态转移,以利用内在特征稀疏性。此外,我们引入了一种结构感知蒸馏策略,该策略将学生模型的内部潜在状态动态与全秩教师模型对齐,以补偿潜在的表示退化。这种方法在保持目标识别所需的高保真空间建模的同时,显著降低了计算复杂性和内存占用。我们在五个基准数据集和实际边缘平台(如Raspberry Pi 5)上进行了广泛实验,结果表明我们的方法在效率与准确性之间达成了优越的平衡,显著超越了现有轻量级架构在实际部署场景中的表现。
cs.CV / 53 / 2603.06925

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

基于掩膜增强注意力融合的可见光与红外遥感图像的小目标检测
Zhang, Qianqian, Jia, Xiaolong, Abdelmoniem, Ahmed M., Zhou, Li, An, Junshe
Abstract
Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.
Chinese Translation
遥感图像中的目标通常较小、纹理较弱,并且容易受到复杂背景的干扰,这使得使用一般算法进行高精度检测面临挑战。在我们早期的ESM-YOLO基础上,本研究提出了ESM-YOLO+,作为一种轻量级的可见光与红外融合网络。为了增强检测能力,ESM-YOLO+包含两个关键创新。(1) 掩膜增强注意力融合(Mask-Enhanced Attention Fusion, MEAF)模块通过可学习的空间掩膜和空间注意力在像素级别融合特征,有效对齐RGB和红外特征,增强小目标的表示能力,并缓解跨模态的不对齐和尺度异质性。(2) 训练时结构表示(Structural Representation, SR)增强提供辅助监督,以在训练过程中保持细粒度的空间结构,提升特征的可区分性而不增加额外的推理成本。在VEDAI和DroneVehicle数据集上的大量实验验证了ESM-YOLO+的优越性。该模型在VEDAI上达到了84.71%的平均精度(mAP),在DroneVehicle上达到了74.0%的mAP,同时大幅降低了模型复杂性,相比基线减少了93.6%的参数和68.0%的GFLOPs。这些结果确认了ESM-YOLO+在强性能与实际应用之间的良好结合,提供了一种在复杂遥感场景中进行高性能小目标检测的有效解决方案。
cs.CV / 54 / 2603.06932

HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation

HIERAMP:用于生成数据集蒸馏的粗到细自回归放大
Zhao, Lin, Jiang, Xinru, Xiao, Xi, Fan, Qihui, Lu, Lei, Wang, Yanzhi, Lin, Xue, Camps, Octavia, Zhao, Pu, Gu, Jianyang
Abstract
Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HIERAMP to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HIERAMP consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.
Chinese Translation
数据集蒸馏在为原始大规模数据集创建小型替代数据集时,通常优先考虑全局语义相似性。然而,物体语义本质上是层次化的。例如,鸟类眼睛的位置和外观受到其头部轮廓的限制。仅依赖全局相似性无法捕捉不同层次的物体相关结构如何支持识别。在本研究中,我们探讨了层次语义对有效蒸馏数据的贡献。我们利用视觉自回归(VAR)模型,其粗到细的生成过程反映了这种层次结构,并提出了HIERAMP以在不同层次上放大语义。在每个VAR尺度上,我们注入动态识别显著区域的类别标记,并使用其诱导的图来指导该尺度的放大。这仅增加了边际推理成本,同时引导合成朝向区分性部分和结构。从经验上看,我们发现语义放大导致在构建粗尺度物体布局时选择更为多样的标记。相反,在细尺度上,放大集中标记的使用,增加了对物体相关细节的关注。在流行的数据集蒸馏基准上,HIERAMP在未明确优化全局相似性的情况下,始终提高了验证性能,证明了语义放大对有效数据集蒸馏的重要性。
cs.CV / 55 / 2603.06936

Extracting and analyzing 3D histomorphometric features related to perineural and lymphovascular invasion in prostate cancer

提取和分析与前列腺癌周神经和淋巴血管侵袭相关的三维组织形态计量特征
Chow, Sarah S. L., Wang, Rui, Serafin, Robert B., Zhao, Yujie, Baraznenok, Elena, Farré, Xavier, Salguero-Lopez, Jennifer, Gao, Gan, Hsieh, Huai-Ching, True, Lawrence D., Lal, Priti, Madabhushi, Anant, Liu, Jonathan T. C.
Abstract
Diagnostic grading of prostate cancer (PCa) relies on the examination of 2D histology sections. However, the limited sampling of specimens afforded by 2D histopathology, and ambiguities when viewing 2D cross-sections, can lead to suboptimal treatment decisions. Recent studies have shown that 3D histomorphometric analysis of glands and nuclei can improve PCa risk assessment compared to analogous 2D features. Here, we expand on these efforts by developing an analytical pipeline to extract 3D features related to perineural invasion (PNI) and lymphovascular invasion (LVI), which correlate with poor prognosis for a variety of cancers. A 3D segmentation model (nnU-Net) was trained to segment nerves and vessels in 3D datasets of archived prostatectomy specimens that were optically cleared, labeled with a fluorescent analog of H&E, and imaged with open-top light-sheet (OTLS) microscopy. PNI- and LVI-related features, including metrics describing cancer-nerve and cancer-vessel proximity, were then extracted based on the 3D nerve/vessel segmentation masks in conjunction with 3D masks of cancer-enriched regions. As a preliminary exploration of the prognostic value of these features, we trained a supervised machine learning classifier to predict 5-year biochemical recurrence (BCR) outcomes, finding that 3D PNI-related features are moderately prognostic and outperform 2D PNI-related features (AUC = 0.71 vs. 0.52). Source code is available at https://github.com/sarahrahsl/SegCIA.git.
Chinese Translation
前列腺癌(PCa)的诊断分级依赖于对二维组织学切片的检查。然而,二维组织病理学所提供的有限样本量以及在观察二维横截面时的模糊性,可能导致次优的治疗决策。最近的研究表明,与类似的二维特征相比,三维组织形态计量分析腺体和细胞核可以改善前列腺癌的风险评估。在此,我们通过开发一个分析流程来提取与周神经侵袭(PNI)和淋巴血管侵袭(LVI)相关的三维特征,进一步扩展了这些努力,这些特征与多种癌症的不良预后相关。我们训练了一个三维分割模型(nnU-Net),以在经过光学清晰处理、用H&E荧光类似物标记并使用开放式顶光片(OTLS)显微镜成像的归档前列腺切除标本的三维数据集中分割神经和血管。然后,基于三维神经/血管分割掩膜结合癌症富集区域的三维掩膜,提取了与PNI和LVI相关的特征,包括描述癌症与神经及血管接近度的度量。作为对这些特征预后价值的初步探索,我们训练了一个监督机器学习分类器来预测5年生化复发(BCR)结果,发现三维PNI相关特征具有中等的预后能力,并且优于二维PNI相关特征(AUC = 0.71 vs. 0.52)。源代码可在https://github.com/sarahrahsl/SegCIA.git获取。
cs.CV / 56 / 2603.06956

Virtual Intraoperative CT (viCT): Sequential Anatomic Updates for Modeling Tissue Resection Throughout Endoscopic Sinus Surgery

虚拟术中CT (viCT):在内窥镜鼻窦手术中对组织切除进行顺序解剖更新的建模
Gunderson, Nicole M., Harris, Graham J., Ruthberg, Jeremy S., Chen, Pengcheng, Mao, Di, Bly, Randall A., Abuzeid, Waleed M., Seibel, Eric J.
Abstract
Purpose: Incomplete dissection is a common cause of persistent disease and revision endoscopic sinus surgery (ESS) in chronic rhinosinusitis. Current image-guided surgery systems typically reference static preoperative CT (pCT), and do not model evolving resection boundaries. We present Virtual Intraoperative CT (viCT), a method for sequentially updating pCT throughout ESS using intraoperative 3D reconstructions from monocular endoscopic video to enable visualization of evolving anatomy in CT format. Methods: Monocular endoscopic video is processed using a depth-supervised NeRF framework with virtual stereo synthesis to generate metrically scaled 3D reconstructions at multiple surgical intervals. Reconstructions undergo rigid, landmark-based registration in 3D Slicer guided by anatomical correspondences, and are then voxelized into the pCT grid. viCT volumes were generated using a ray-based occupancy comparison between pCT and reconstruction to delete outdated voxels and remap preserved anatomy and updated boundaries. Performance is evaluated in a cadaveric feasibility study of four specimens across four ESS stages using volumetric overlap (DSC, Jaccard) and surface metrics (HD95, Chamfer, MSD, RMSD), and qualitative comparisons to ground-truth CT. Results: viCT updates show agreement with ground-truth anatomy across surgical stages, with submillimeter mean surface errors. Dice Similarity Coefficient (DSC) = 0.88 +/- 0.05 and Jaccard Index = 0.79 +/- 0.07, and Hausdorff Distance 95% (HD95) = 0.69 +/- 0.28 mm, Chamfer Distance = 0.09 +/- 0.05 mm, Mean Surface Distance (MSD) = 0.11 +/- 0.05 mm, and Root Mean Square Distance (RMSD) = 0.32 +/- 0.10 mm. Conclusion: viCT enables CT-format anatomic updating in an ESS setting without ancillary hardware. Future work will focus on fully automating registration, validation in live cases, and optimizing runtime for real-time deployment.
Chinese Translation
目的:不完全解剖是慢性鼻窦炎中持续疾病和修复性内窥镜鼻窦手术(ESS)的常见原因。目前的图像引导手术系统通常参考静态术前CT(pCT),并未对不断变化的切除边界进行建模。我们提出了虚拟术中CT(viCT),这是一种在ESS过程中使用单目内窥镜视频的术中三维重建顺序更新pCT的方法,以便以CT格式可视化不断演变的解剖结构。方法:使用深度监督的NeRF框架和虚拟立体合成处理单目内窥镜视频,以在多个手术时间点生成按比例缩放的三维重建。重建通过3D Slicer中的刚性、基于标志点的配准进行指导,以解剖对应关系为依据,然后被体素化到pCT网格中。viCT体积通过在pCT和重建之间进行基于光线的占用比较生成,以删除过时的体素并重新映射保留的解剖结构和更新的边界。通过体积重叠(DSC,Jaccard)和表面指标(HD95,Chamfer,MSD,RMSD)的尸体可行性研究评估性能,比较四个样本在四个ESS阶段的表现,并与真实CT进行定性比较。结果:viCT更新在手术阶段与真实解剖结构一致,平均表面误差在亚毫米范围内。Dice相似系数(DSC)= 0.88 +/- 0.05,Jaccard指数 = 0.79 +/- 0.07,95% Hausdorff距离(HD95)= 0.69 +/- 0.28 mm,Chamfer距离 = 0.09 +/- 0.05 mm,平均表面距离(MSD)= 0.11 +/- 0.05 mm,均方根距离(RMSD)= 0.32 +/- 0.10 mm。结论:viCT在ESS环境中实现了CT格式的解剖更新,而无需辅助硬件。未来的工作将集中在完全自动化配准、在实时病例中的验证以及优化运行时间以实现实时部署。
cs.CV / 57 / 2603.06971

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

SurgCUT3R:手术场景感知的连续时间三维表示理解
Xu, Kaiyuan, Hong, Fangzhou, Elson, Daniel, Huang, Baoru
Abstract
Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: https://chumo-xu.github.io/SurgCUT3R-ICRA26/.
Chinese Translation
从单目内窥镜视频重建手术场景对于推动机器人辅助手术至关重要。然而,最先进的通用重建模型的应用受到两个关键挑战的限制:缺乏监督训练数据和在长视频序列中性能下降。为了解决这些局限性,我们提出了SurgCUT3R,一个将统一三维重建模型适应于手术领域的系统框架。我们的贡献主要体现在三个方面。首先,我们开发了一个数据生成管道,利用公共立体手术数据集生成大规模、度量尺度的伪真实深度图,有效弥补了数据缺口。其次,我们提出了一种混合监督策略,将我们的伪真实数据与几何自我校正相结合,以增强对固有数据缺陷的鲁棒性。第三,我们引入了一个层次推理框架,采用两个专门模型有效减轻长时间手术视频中的累积姿态漂移:一个用于全局稳定性,另一个用于局部准确性。在SCARED和StereoMIS数据集上的实验表明,我们的方法在准确性和效率之间达成了竞争平衡,提供了接近最先进水平但显著更快的姿态估计,并为手术环境中的鲁棒重建提供了实用且有效的解决方案。项目页面:https://chumo-xu.github.io/SurgCUT3R-ICRA26/
cs.CV / 58 / 2603.06973

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

T2SGrid:视频时间定位的时间到空间网格化
Guo, Chaohong, He, Yihan, Nie, Yongwei, Ma, Fei, Xu, Xuemiao, Long, Chengjiang
Abstract
Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.
Chinese Translation
视频时间定位(VTG)旨在定位与自然语言查询相对应的视频片段,这需要对复杂的时间动态进行全面理解。现有的视觉语言模型(Vision-LMMs)通常通过位置编码、基于文本的时间戳或视觉帧编号来感知时间动态。然而,这些方法存在显著的局限性:为每个帧分配基于文本的时间戳令牌会增加额外的计算开销,并导致视觉注意力的稀疏性,位置编码难以捕捉绝对时间信息,而视觉帧编号往往会妨碍空间细节。为了解决这些问题,我们提出了时间到空间网格化(T2SGrid),这是一个将视频时间理解重新构建为空间理解任务的新框架。T2SGrid的核心思想是以片段而非单个帧来处理视频内容。我们采用重叠滑动窗口机制将视频分割成时间片段。在每个窗口内,帧按时间顺序以行优先的方式排列成复合网格图像,有效地将时间序列转换为结构化的二维布局。网格化不仅编码了时间信息,还增强了每个网格内的局部注意力。此外,T2SGrid还允许使用复合文本时间戳来建立全局时间意识。在标准VTG基准测试中的实验表明,T2SGrid实现了优越的性能。
cs.CV / 59 / 2603.06982

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

优化基于图像的形状检索的多模态模型:预对齐和困难对比学习的作用
Kühn, Paul Julius, Spengler, Cedric, Weinmann, Michael, Kuijper, Arjan, Sinha, Saptarshi Neil
Abstract
Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.
Chinese Translation
基于图像的形状检索(IBSR)旨在根据查询图像从数据库中检索3D模型,因此解决了计算机视觉、计算机图形学和机器人学中的一个经典任务。最近的方法通常依赖于通过使用多视图渲染以及特定任务的度量学习来弥合2D图像与3D形状之间的领域差距,以将形状和图像嵌入到一个共同的潜在空间。相比之下,我们通过大规模多模态预训练来解决IBSR,并展示了显式的基于视图的监督并不是必需的。受到ULIP和OpenShape中用于3D形状分类等任务的预对齐图像-点云编码器的启发,我们提出使用预对齐的图像和形状编码器进行零样本和标准IBSR,通过将图像和点云嵌入到共享表示空间,并通过对紧凑的单嵌入形状描述符进行相似性搜索来执行检索。这种形式化允许跳过视图合成,并自然地实现零样本和跨领域检索,而无需在目标数据库上进行重新训练。我们在零样本和监督IBSR设置中评估了预对齐编码器,并额外引入了一种多模态困难对比损失(HCL)以进一步提高检索性能。我们的评估展示了最先进的性能,在多个数据集上超越了相关方法的$Acc_{Top1}$和$Acc_{Top10}$,在OpenShape与Point-BERT结合时观察到最佳结果。此外,在我们提出的多模态HCL上进行训练在形状中心数据的标准实例检索任务中获得了数据集依赖的提升,强调了预训练和困难对比学习在3D形状检索中的价值。代码将通过项目网站提供。
cs.CV / 60 / 2603.06985

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

感知驱动的单目图像多模态空间推理
Cheng, Yanchun, Wang, Rundong, Yang, Xulei, Prakash, Alok, Rus, Daniela, Ang Jr, Marcelo H, Li, ShiJie
Abstract
Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
Chinese Translation
从单目图像进行空间推理对于自动驾驶至关重要,但当前的视觉语言模型(Vision-Language Models, VLMs)在细粒度几何感知方面仍然存在困难,特别是在大规模变化和模糊物体外观的情况下。我们提出了一种简单而有效的感知驱动多模态推理框架,使VLM具备明确的以物体为中心的基础能力。我们不再依赖文本边界框输出,而是使用所有视觉参考标记(Visual Reference Tokens, VRTs)在其空间范围内表示每个被提及的物体,从而使视觉证据和文本推理能够在统一的标记空间中共同处理。为了进一步增强跨模态交互,我们构建了一个多模态思维链(Multimodal Chain-of-Thought, MM-CoT)数据集,注入对齐的视觉和文本推理信号。我们引入了一种确定性排序策略,使对本质上无序的VRT集合的监督与VLM的自回归下一个标记预测完全兼容。仅通过标准的监督微调,我们的方法在SURDS基准测试中取得了显著的改进,在单物体和多物体任务中均大幅超越了包括基于强化学习的后训练方法在内的先前方法。这些结果表明,准确的感知和多模态推理是相互促进的,二者共同构成了在具有挑战性的单目驾驶场景中实现稳健空间理解的关键。
cs.CV / 61 / 2603.06989

MipSLAM: Alias-Free Gaussian Splatting SLAM

MipSLAM:无别名高斯点云SLAM
Li, Yingzhao, Li, Yan, Tian, Shixiong, Liu, Yanjie, Zhao, Lijun, Lee, Gim Hee
Abstract
This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. A novel local frequency-domain perceptual loss is also introduced to enhance fine-grained geometric detail recovery. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions while maintaining real-time capability. Code is available at https://github.com/yzli1998/MipSLAM.
Chinese Translation
本文介绍了MipSLAM,一个频率感知的三维高斯点云(3D Gaussian Splatting, 3DGS)SLAM框架,能够在不同相机配置下实现高保真无别名的新视图合成和稳健的姿态估计。现有的基于3DGS的SLAM系统常常由于过滤不足和纯粹的空间优化而遭受别名伪影和轨迹漂移。为了解决这些问题,我们提出了一种椭圆自适应抗别名(Elliptical Adaptive Anti-aliasing, EAA)算法,该算法通过几何感知的数值积分来近似高斯贡献,从而避免昂贵的解析计算。此外,我们还提出了一种频谱感知姿态图优化(Spectral-Aware Pose Graph Optimization, SA-PGO)模块,该模块在频域中重新表述轨迹估计,通过图拉普拉斯分析有效抑制高频噪声和漂移。我们还引入了一种新颖的局部频域感知损失,以增强细粒度几何细节的恢复。在Replica和TUM数据集上的广泛评估表明,MipSLAM在多个分辨率下实现了最先进的渲染质量和定位精度,同时保持实时能力。代码可在 https://github.com/yzli1998/MipSLAM 获取。
cs.CV / 62 / 2603.06993

AdaGen: Learning Adaptive Policy for Image Synthesis

AdaGen:用于图像合成的自适应策略学习
Ni, Zanlin, Wang, Yulin, Hua, Yeguo, Zhou, Renping, Guo, Jiayi, Song, Jun, Zheng, Bo, Huang, Gao
Abstract
Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of synthesis into multiple steps. However, this introduces a proliferation of step-specific parameters (e.g., noise level or temperature at each step). Existing approaches typically rely on manually-designed rules to manage this complexity, demanding expert knowledge and trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network determines suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments on four generative paradigms validate the superiority of AdaGen. For example, AdaGen achieves better performance on DiT-XL with 3 times lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible computational overhead.
Chinese Translation
近年来,图像合成的进展得益于强大的生成模型,如掩码生成变换器(Masked Generative Transformers, MaskGIT)、自回归模型、扩散模型和修正流模型。这些模型成功的共同原则是将合成过程分解为多个步骤。然而,这引入了大量特定步骤的参数(例如,每一步的噪声水平或温度)。现有的方法通常依赖于手动设计的规则来管理这种复杂性,这需要专家知识和反复试验。此外,这些静态调度缺乏适应每个样本独特特征的灵活性,导致次优性能。为了解决这个问题,我们提出了AdaGen,一个通用的、可学习的、样本自适应的框架,用于调度迭代生成过程。具体而言,我们将调度问题形式化为马尔可夫决策过程(Markov Decision Process),其中一个轻量级策略网络根据当前生成状态确定合适的参数,并可以通过强化学习进行训练。重要的是,我们证明了简单的奖励设计,如FID或预训练奖励模型,容易被破解,可能无法可靠地保证生成样本的质量或多样性。因此,我们提出了一种对抗性奖励设计来指导策略网络的训练。最后,我们引入了一种推理时的精炼策略和可控的保真度-多样性权衡机制,以进一步提升AdaGen的性能和灵活性。在四种生成范式上的全面实验验证了AdaGen的优越性。例如,AdaGen在DiT-XL上实现了更好的性能,推理成本降低了3倍,并将VAR的FID从1.92改善至1.59,几乎没有计算开销。
cs.CV / 63 / 2603.06999

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

TrajPred:基于轨迹条件的联合嵌入预测用于视觉-语言模型中的外科器械与组织交互识别
Cheng, Jiajun, Yu, Xiaofan, Subarna, Liu, Sainan, Lin, Shan
Abstract
Recognizing instruments' interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further incorporate prompt tuning and a verb-rephrasing technique to enable smooth adaptation to the instrument--tissue interaction recognition task. Extensive experiments on the public laparoscopic benchmark, CholecT50, show that our method improves both Average Precision and Top-K accuracy. We also investigate whether visual embeddings of instrument--tissue interaction regions align better with the corresponding text by visualizing the cosine similarity between visual and textual embeddings. The visualization results indicate that the proposed method improves alignment between relevant visual and textual representations.
Chinese Translation
识别器械与组织的交互对于构建具有上下文感知能力的机器人手术AI助手至关重要。视觉-语言模型(VLMs)为外科感知开辟了新的途径,并在广泛的任务上实现了比传统特定任务深度学习方法更好的泛化能力。然而,它们在器械与组织交互识别方面的表现仍然有限,主要由于两个挑战:(1)许多模型未能有效利用时间信息,以及(2)视觉与文本之间的对齐常常缺乏细粒度的动作细节。为了解决这些问题,我们提出了TrajPred,一个编码器械轨迹以融入时间运动线索的框架,并基于这些轨迹引入预测模块,以生成更好捕捉细粒度动作细节的视觉语义嵌入。我们进一步结合提示调优和动词改述技术,以便顺利适应器械与组织交互识别任务。在公共腹腔镜基准数据集CholecT50上的大量实验表明,我们的方法提高了平均精度和Top-K准确率。我们还通过可视化视觉与文本嵌入之间的余弦相似度,研究器械与组织交互区域的视觉嵌入是否与相应文本更好地对齐。可视化结果表明,所提出的方法改善了相关视觉和文本表示之间的对齐。
cs.CV / 64 / 2603.07022

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

OV-DEIM:基于网格合成增强的实时DETR风格开放词汇目标检测
Wang, Leilei, Liu, Longfei, Shen, Xi, Yu, Xuanlong, He, Ying Tiffany, Yu, Fei Richard, Chen, Yingyi
Abstract
Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.
Chinese Translation
实时开放词汇目标检测(OVOD)对于在动态环境中的实际部署至关重要,在这些环境中,模型必须在严格的延迟限制下识别大量不断演变的类别。目前的实时OVOD方法主要基于YOLO风格模型。相比之下,基于DETR的实时方法在推理延迟、模型轻量化和整体性能方面仍然落后。在本研究中,我们提出了OV-DEIM,这是一种基于最近的DEIMv2框架构建的端到端DETR风格开放词汇检测器,集成了视觉-语言建模以实现高效的开放词汇推理。我们进一步引入了一种简单的查询补充策略,该策略在不影响推理速度的情况下提高了固定平均精度(Fixed AP)。除了架构改进外,我们还引入了GridSynthetic,这是一种简单而有效的数据增强策略,将多个训练样本组合成结构化的图像网格。通过在单次前向传递中使模型接触到更丰富的对象共现模式和空间布局,GridSynthetic减轻了噪声定位信号对分类损失的负面影响,并改善了语义区分能力,特别是对于稀有类别。大量实验表明,OV-DEIM在开放词汇检测基准上实现了最先进的性能,提供了卓越的效率并在具有挑战性的稀有类别上显著改善。代码和预训练模型可在 https://github.com/wleilei/OV-DEIM 获取。
cs.CV / 65 / 2603.07043

Fine-Grained 3D Facial Reconstruction for Micro-Expressions

微表情的细粒度三维面部重建
Sun, Che, Zhang, Xinjie, Gao, Rui, Chen, Xu, Wu, Yuwei, Jia, Yunde
Abstract
Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.
Chinese Translation
近年来,三维面部表情重建的进展在捕捉宏观表情方面表现出色,但微表情的重建仍然未被探索。由于微表情的细微、瞬时和低强度特性,这一新任务尤其具有挑战性,这使得提取稳定且具有区分性的特征变得复杂,而这些特征对于准确重建至关重要。在本文中,我们提出了一种细粒度微表情重建方法,该方法将全局动态特征与局部丰富特征相结合,前者捕捉稳定的面部运动模式,后者则整合来自二维运动、面部先验和三维面部几何的多种信息线索。具体而言,我们设计了一个即插即用的动态编码模块,以提取全局面部动作的微表情特征,使其能够利用丰富的宏观表情数据中的先验知识,以缓解微表情数据的稀缺性。随后,我们设计了一个动态引导的网格变形模块,从密集光流、稀疏特征点线索和面部网格几何中提取聚合的局部特征,该模块自适应地细化细粒度面部微表情,而不影响全局三维几何。对微表情数据集的广泛实验表明,我们的方法在几何准确性和感知细节方面始终优于现有的最先进方法。
cs.CV / 66 / 2603.07048

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

回顾与展望:跨图像注意力校准与关注偏好学习用于多图像幻觉缓解
Yang, Xiaochen, Fang, Hao, Kong, Jiawei, Mao, Yaoxin, Chen, Bin, Xia, Shu-Tao
Abstract
Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.
Chinese Translation
尽管大型视觉-语言模型(LVLMs)展现了显著的能力,但在多图像任务中容易出现幻觉。我们将这一问题归因于现有注意力机制的局限性和跨图像建模的不足。受此启发,我们提出了一种结构化的幻觉缓解框架,涉及跨图像注意力校准与偏好学习(CAPL)。CAPL 在架构层面上显式增强了图像间的交互,同时在训练过程中强化对真实跨图像证据的依赖,从而改善模型对跨图像关联的感知和建模。具体而言,我们(i)引入了一种可选择的图像令牌交互注意力机制,以建立细粒度的跨图像实体对齐和信息流;(ii)设计了一种基于跨图像建模的偏好优化策略,对比在完全图像交互下的推理结果与图像互相不可见时的结果,鼓励模型将其预测基于真实的视觉证据,从而缓解由文本先验驱动的错误推断。实验结果表明,CAPL 在多个模型架构上持续提升性能,在多图像幻觉和一般基准测试中均取得稳定的提升。值得注意的是,在单图像视觉任务上的表现保持稳定或略有提升,表明其强大的泛化能力。
cs.CV / 67 / 2603.07057

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

SODA:面向敏感性的动态加速方法用于扩散变换器
Shao, Tong, Fu, Yusen, Sun, Guoying, Kong, Jingde, Tian, Zhuotao, Su, Jingyong
Abstract
Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$\alpha$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.
Chinese Translation
扩散变换器已成为视觉生成的主导范式,但其低推理效率仍然是阻碍进一步发展的关键瓶颈。在常见的无训练技术中,缓存提供了高加速效率,但往往牺牲了保真度,而剪枝则表现出相反的权衡。将缓存与剪枝相结合可以在加速和生成质量之间实现平衡。然而,现有方法通常采用固定的启发式方案来配置缓存和剪枝策略。尽管它们大致遵循生成模型对加速的整体敏感性趋势,但未能捕捉到细粒度和复杂的变化,必然跳过高度敏感的计算,导致质量下降。此外,这些手动设计的策略表现出较差的泛化能力。为了解决这些问题,我们提出了SODA,一种基于细粒度敏感性的面向敏感性的动态加速方法,能够自适应地执行缓存和剪枝。SODA在时间步、层和模块之间构建了一个离线敏感性误差建模框架,以捕捉对不同加速操作的敏感性。缓存间隔通过动态规划进行优化,以敏感性误差作为成本函数,最小化缓存对模型敏感性的影响。在剪枝和缓存重用过程中,SODA自适应地确定剪枝时机和速率,以保留高度敏感标记的计算,显著提高生成的保真度。在DiT-XL/2、PixArt-$eta$和OpenSora上的大量实验表明,SODA在可控加速比下实现了最先进的生成保真度。我们的代码已公开发布在:https://github.com/leaves162/SODA。
cs.CV / 68 / 2603.07066

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

MedSteer:通过无训练激活引导进行反事实内窥镜合成
Pham, Trong-Thang, Nguyen, Loc, Nguyen, Anh, Nguyen, Hien, Le, Ngan
Abstract
Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer
Chinese Translation
生成扩散模型在医学影像数据增强中越来越多地被应用,但文本提示无法生成因果训练数据。重新提示会重新生成整个生成轨迹,改变解剖结构、纹理和背景。基于反演的编辑方法引入重建误差,导致结构漂移。我们提出了MedSteer,一种无训练的激活引导框架用于内窥镜合成。MedSteer在扩散变换器的交叉注意力层中为每对对比提示识别病理向量。在推理时,它沿着该向量引导图像激活,从头生成反事实对,其唯一的区别是引导的概念。所有其他结构在构造上得以保留。我们在Kvasir v3和HyperKvasir上通过三个实验评估了MedSteer。在三个临床概念对的反事实生成中,MedSteer的翻转率分别为0.800、0.925和0.950,超越了最佳基于反演的基线,无论是在概念翻转率还是结构保留方面。在染料解耦方面,MedSteer实现了75%的染料去除,而PnP为20%,h-Edit为10%。在下游息肉检测中,使用MedSteer反事实对进行增强的ViT AUC为0.9755,而数量匹配的重新提示为0.9083,确认反事实结构推动了增益。代码可在链接 https://github.com/phamtrongthang123/medsteer 获取。
cs.CV / 69 / 2603.07071

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

VirtueBench:在不确定性下评估长视频理解的可信度
Yu, Xueqing, Li, Bohan, Li, Yan, Yang, Zhenheng
Abstract
Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.
Chinese Translation
近期的视觉-语言模型(VLMs)在多模态理解任务中取得了显著进展,但它们在长视频理解方面的评估仍然不可靠。由于输入帧数量有限,回答问题所需的关键帧可能在模型的输入中缺失。然而,在这种不确定性下,真实拒绝回答的模型被标记为错误,而那些猜测的模型可能偶然产生正确答案,从而获得误导性的高准确率,导致评估结果误导,并鼓励模型猜测而非诚实回应。为了解决这一问题,我们引入了VirtueBench,一个专门设计用于评估模型在不确定性下可信度的基准。VirtueBench为每个视频构建了多个帧采样级别,并提供了区分可回答和不可回答情况的真实答案。对25个开源和商业VLM的评估揭示了不同模型家族之间明显的拒绝行为,拒绝准确率在最佳模型中超过70%,而在最差模型中接近0%。此外,当提示未明确要求模型拒绝时,大多数模型的拒绝率显著下降。这些发现突显了开发可信赖的VLM以实现多模态理解的必要性,并强调了以可靠性和可信度为重点的基准和排行榜的重要性。
cs.CV / 70 / 2603.07074

Physics-Guided VLM Priors for All-Cloud Removal

物理引导的全云去除的 VLM 先验
Xu, Liying, Li, Huifang, Shen, Huanfeng
Abstract
Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.
Chinese Translation
云去除是光学遥感中的一个基本挑战,主要由于异质性降解。薄云通过部分透射扭曲辐射测量,而厚云则遮挡了地表。现有的处理流程将薄云校正与厚云重建分开,需明确云类型的决策,常常导致误差累积和混合云场景中的不连续性。因此,我们提出了一种新方法,称为物理-VLM 全云去除(PhyVLM-CR),该方法将视觉-语言模型(VLM)的语义能力与物理恢复模型相结合,实现高保真度的统一云去除。具体而言,VLM(例如 Qwen)提供的认知先验被转化为物理散射参数和幻觉置信度图。利用该置信度图作为连续的软门控,我们的方法通过自适应加权实现统一恢复:在高透射区域优先进行物理反演,以保持辐射测量的保真度,同时在低置信度的遮挡区域无缝过渡到时间参考重建。该机制消除了对明确边界划分的需求,确保在异质云覆盖下的连贯去除。针对真实世界 Sentinel-2 表面反射率影像的实验表明,我们的方法在云去除和内容保留之间实现了显著平衡,提供了无幻觉的结果,并在定量准确性上显著优于现有方法。
cs.CV / 71 / 2603.07076

Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network

Retinex与语言相结合:一种物理-语义引导的水下图像增强网络
Xu, Shixuan, Liu, Yabo, Dong, Junyu, Dong, Xinghui
Abstract
Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.
Chinese Translation
水下图像常常受到光吸收和散射造成的严重退化,导致颜色失真、对比度低和可见性降低。现有的水下图像增强(UIE)方法可以分为两类,即基于先验的方法和基于学习的方法。前者依赖于严格的物理假设,限制了适应性,而后者往往面临数据稀缺和泛化能力弱的问题。为了解决这些问题,我们提出了一种物理-语义引导的水下图像增强网络(PSG-UIENet),该网络将基于Retinex的照明校正与语言信息引导相结合。该网络包括一个无先验照明估计器、一个跨模态文本对齐器和一个语义引导图像恢复器。特别地,恢复器利用对比语言-图像预训练(CLIP)模型生成的文本描述,注入高层次的语义以提供感知上有意义的引导。由于多模态UIE数据集尚未公开,我们还构建了一个大规模的图像-文本UIE数据集,即LUIQD-TD,包含6,418个图像-参考文本三元组。为了明确测量和优化文本描述与图像之间的语义一致性,我们进一步设计了一种图像-文本语义相似性(ITSS)损失函数。据我们所知,本研究首次将文本引导和多模态数据集引入UIE任务。对我们的数据集和四个公开可用的数据集进行的广泛实验表明,所提出的PSG-UIENet在性能上优于或可与十五种最先进的方法相媲美。
cs.CV / 72 / 2603.07077

Aligning What EEG Can See: Structural Representations for Brain-Vision Matching

对齐脑电图(EEG)可见内容:脑-视觉匹配的结构表示
Tang, Jingyi, Jiang, Shuai, Su, Fei, Zhao, Zhicheng
Abstract
Visual decoding from electroencephalography (EEG) has emerged as a highly promising avenue for non-invasive brain-computer interfaces (BCIs). Existing EEG-based decoding methods predominantly align brain signals with the final-layer semantic embeddings of deep visual models. However, relying on these highly abstracted embeddings inevitably leads to severe cross-modal information mismatch. In this work, we introduce the concept of Neural Visibility and accordingly propose the EEG-Visible Layer Selection Strategy, aligning EEG signals with intermediate visual layers to minimize this mismatch. Furthermore, to accommodate the multi-stage nature of human visual processing, we propose a novel Hierarchically Complementary Fusion (HCF) framework that jointly integrates visual representations from different hierarchical levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance, reaching an 84.6% accuracy (+21.4%) on zero-shot visual decoding on the THINGS-EEG dataset. Moreover, our method achieves up to a 129.8% performance gain across diverse EEG baselines, demonstrating its robust generalizability.
Chinese Translation
基于脑电图(EEG)的视觉解码已成为非侵入式脑机接口(BCIs)中一个极具前景的研究方向。现有的基于EEG的解码方法主要将脑信号与深度视觉模型的最终层语义嵌入对齐。然而,依赖这些高度抽象的嵌入不可避免地导致严重的跨模态信息不匹配。在本研究中,我们引入了神经可见性(Neural Visibility)的概念,并相应提出了EEG可见层选择策略(EEG-Visible Layer Selection Strategy),通过将EEG信号与中间视觉层对齐来最小化这种不匹配。此外,为了适应人类视觉处理的多阶段特性,我们提出了一种新颖的层次互补融合框架(Hierarchically Complementary Fusion, HCF),该框架联合整合来自不同层次的视觉表示。大量实验表明,我们的方法在THINGS-EEG数据集上实现了最先进的性能,零样本视觉解码的准确率达到了84.6%(+21.4%)。此外,我们的方法在多种EEG基线中实现了高达129.8%的性能提升,展示了其强大的泛化能力。
cs.CV / 73 / 2603.07093

Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

与人类偏好对齐的面部表情生成用于自然双人互动
Chen, Xu, Gao, Rui, Zhang, Xinjie, Zhang, Haoyu, Sun, Che, Gao, Zhi, Wu, Yuwei, Jia, Yunde
Abstract
Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.
Chinese Translation
实现自然的双人互动需要生成情感上适当且与人类偏好社会上对齐的面部表情。人类反馈提供了一种有效的机制来指导这种对齐,然而如何有效地将这种反馈融入面部表情生成仍然未被充分探索。在本文中,我们提出了一种通过利用人类反馈生成与人类偏好对齐的面部表情的方法,以产生适合自然双人互动的上下文和情感上合适的表情。我们方法的关键在于将身份独立的面部表情生成框架化为一种行动学习过程,使得人类反馈能够在没有视觉或身份偏见的情况下评估其有效性。我们建立了一个闭环反馈机制,其中听者的表情动态响应说话者不断变化的对话线索。具体而言,我们通过监督微调训练了一个视觉-语言-行动模型,将说话者的多模态信号映射到可控的低维表情表示,基于一个三维可变形模型。我们进一步引入了一种人类反馈强化学习策略,将高质量表情响应的模仿与批评者引导的优化相结合。两个基准实验表明,我们的方法有效地将面部表情与人类偏好对齐,并取得了优越的性能。
cs.CV / 74 / 2603.07098

NuNext: Reframing Nucleus Detection as Next-Point Detection

NuNext:将细胞核检测重新框定为下一点检测
Shui, Zhongyi, Li, Honglin, Ji, Xiaozhong, Zhang, Ye, Yang, Zijiang, Zhu, Chenglu, Sun, Yuxuan, Yao, Kai, He, Conghui, Tan, Cheng
Abstract
Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.
Chinese Translation
在组织病理学中,细胞核检测对于广泛的临床应用至关重要。现有的方法要么回归需要复杂后处理的核代理图,要么采用密集锚点或查询,这会引入严重的前景-背景不平衡。在本研究中,我们将细胞核检测重新表述为下一点预测,其中开发了一种多模态大型语言模型,能够直接从输入图像输出前景细胞核质心。该模型分两个阶段进行训练。在监督学习阶段,我们提出了空间感知软监督,以放宽严格的质心匹配,并采用视觉思维链策略来结合有助于坐标预测的视觉先验。在强化微调阶段,我们设计了分布匹配奖励、低方差组过滤和细粒度优势塑造,以进一步提高模型的检测质量。在九个广泛使用的基准上进行的大量实验表明了我们方法的优越性。代码将很快发布。
cs.CV / 75 / 2603.07113

Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning

通过语义分区对比学习实现高效胸部X光图像表示学习
Feng, Wangyu, Young, Shawn, Xu, Lijian
Abstract
Self-supervised learning (SSL) has emerged as a powerful paradigm for Chest X-ray (CXR) analysis under limited annotations. Yet, existing SSL strategies remain suboptimal for medical imaging. Masked image modeling allocates substantial computation to reconstructing high-frequency background details with limited diagnostic value. Contrastive learning, on the other hand, often depends on aggressive augmentations that risk altering clinically meaningful structures. We introduce Semantic-Partitioned Contrastive Learning (S-PCL), an efficient pre-training framework tailored for CXR representation learning. Instead of reconstructing pixels or relying on heavy augmentations, S-PCL randomly partitions patch tokens from a single CXR into two non-overlapping semantic subsets. Each subset provides a complementary but incomplete view. The encoder must maximize agreement between these partitions, implicitly inferring global anatomical layout and local pathological cues from partial evidence. This semantic partitioning forms an internal bottleneck that enforces long-range dependency modeling and structural coherence. S-PCL eliminates the need for hand-crafted augmentations, auxiliary decoders, and momentum encoders. The resulting architecture is streamlined, computationally efficient, and easy to scale. Extensive experiments on large-scale CXR benchmarks, including ChestX-ray14, CheXpert, RSNA Pneumonia and SIIM-ACR Pneumothorax, show that S-PCL achieves competitive performance while attaining the lowest GFLOPs and superior accuracy among existing SSL approaches.
Chinese Translation
自监督学习(SSL)已成为在有限标注下进行胸部X光图像(CXR)分析的强大范式。然而,现有的SSL策略在医学影像领域仍显不足。掩蔽图像建模将大量计算分配给重建高频背景细节,而这些细节的诊断价值有限。相比之下,对比学习通常依赖于激进的数据增强,这可能会改变临床上有意义的结构。我们提出了语义分区对比学习(S-PCL),这是一种专门为CXR表示学习量身定制的高效预训练框架。S-PCL并不重建像素或依赖于繁重的增强,而是随机将单个CXR的补丁标记分成两个不重叠的语义子集。每个子集提供了互补但不完整的视图。编码器必须最大化这些分区之间的一致性,隐含地从部分证据中推断全局解剖布局和局部病理线索。这种语义分区形成了一个内部瓶颈,强制执行长程依赖建模和结构一致性。S-PCL消除了对手工设计的增强、辅助解码器和动量编码器的需求。最终的架构简化、计算效率高且易于扩展。在包括ChestX-ray14、CheXpert、RSNA肺炎和SIIM-ACR气胸在内的大规模CXR基准测试中,广泛的实验表明,S-PCL在实现竞争性性能的同时,在现有SSL方法中获得了最低的GFLOPs和卓越的准确性。
cs.CV / 76 / 2603.07119

TIQA: Human-Aligned Text Quality Assessment in Generated Images

TIQA:生成图像中人类对齐的文本质量评估
Koltsov, Kirill, Gushchin, Aleksandr, Vatolin, Dmitriy, Antsiferova, Anastasia
Abstract
Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.
Chinese Translation
文本渲染仍然是现代文本到图像模型(T2I)的一种持续失败模式,然而现有的评估依赖于OCR准确性或基于视觉语言模型(VLM)的判断程序,这些方法与感知文本伪影的对齐程度较差。我们提出了文本图像质量评估(TIQA),这一任务旨在预测一个标量质量分数,以匹配人类对裁剪文本区域内渲染文本保真度的判断。我们发布了两个带有均值意见分数(MOS)标注的数据集:TIQA-Crops(10,000个文本裁剪)和TIQA-Images(1,500张图像),涵盖了20多个T2I模型,包括一些专有模型。我们还提出了ANTIQA,这是一种具有文本特定偏差的轻量级方法,并展示了它在TIQA-Crops上与人类评分的相关性比OCR置信度、VLM评判者和通用非参考图像质量评估(NR-IQA)指标提高了至少$ ext{∼}0.05$,在TIQA-Images上提高了$ ext{∼}0.08$,以PLCC测量。最后,我们展示了TIQA模型在下游任务中的价值:例如,使用ANTIQA选择5个生成结果中的最佳者,平均提高了人类评分的文本质量$+14 ext{ extperthousand}$,展示了在生成管道中的过滤和重排序的实际价值。
cs.CV / 77 / 2603.07120

Inter-Image Pixel Shuffling for Multi-focus Image Fusion

用于多焦点图像融合的图像间像素重排
Lin, Huangxing, Ma, Rongrong, Wang, Cheng
Abstract
Multi-focus image fusion aims to combine multiple partially focused images into a single all-in-focus image. Although deep learning has shown promise in this task, its effectiveness is often limited by the scarcity of suitable training data. This paper introduces Inter-image Pixel Shuffling (IPS), a novel method that allows neural networks to learn multi-focus image fusion without requiring actual multi-focus images. IPS reformulates the task as a pixel-wise classification problem, where the goal is to identify the focused pixel from a pixel group at each spatial position. In this method, pixels from a clear optical image are treated as focused, while pixels from a low-pass filtered version of the same image are considered defocused. By randomly shuffling the focused and defocused pixels at identical spatial positions in the original and filtered images, IPS generates training data that preserves spatial structure while mixing focus-defocus information. The model is trained to select the focused pixel from each spatially aligned pixel group, thus learning to reconstruct an all-in-focus image by aggregating sharp content from the input. To further enhance fusion quality, IPS adopts a cross-image fusion network that integrates the localized representation power of convolutional neural networks with the long-range modeling capabilities of state space models. This design effectively leverages both spatial detail and contextual information to produce high-quality fused results. Experimental results indicate that IPS significantly outperforms existing multi-focus image fusion methods, even without training on multi-focus images.
Chinese Translation
多焦点图像融合旨在将多幅部分聚焦的图像合成一幅全聚焦的图像。尽管深度学习在这一任务中展现了潜力,但其有效性往往受到适合的训练数据稀缺的限制。本文提出了一种新颖的方法——图像间像素重排(Inter-image Pixel Shuffling, IPS),该方法允许神经网络在不需要实际多焦点图像的情况下学习多焦点图像融合。IPS将任务重新表述为像素级分类问题,目标是在每个空间位置上识别出聚焦像素。该方法中,来自清晰光学图像的像素被视为聚焦像素,而来自同一图像低通滤波版本的像素则被视为失焦像素。通过在原始图像和滤波图像中随机重排相同空间位置的聚焦和失焦像素,IPS生成保留空间结构的训练数据,同时混合聚焦和失焦信息。模型被训练以从每个空间对齐的像素组中选择聚焦像素,从而学习通过聚合输入中的清晰内容重建全聚焦图像。为了进一步提高融合质量,IPS采用了一种跨图像融合网络,该网络将卷积神经网络的局部表示能力与状态空间模型的长程建模能力相结合。这种设计有效利用了空间细节和上下文信息,以生成高质量的融合结果。实验结果表明,IPS在没有针对多焦点图像训练的情况下,显著优于现有的多焦点图像融合方法。
cs.CV / 78 / 2603.07131

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

深度专家注入:利用领域特定知识锚定视网膜视觉语言模型
Lu, Shuai, Wang, Meng, Guo, Jia, Du, Jiawei, Liu, Bo, Yang, Shengzhu, Zhang, Weihang, Fu, Huazhu, Li, Huiqi
Abstract
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
Chinese Translation
大型视觉语言模型(LVLMs)在自动化眼科诊断中展现出巨大的潜力。然而,它们的临床应用受到缺乏领域特定知识的严重阻碍。在本研究中,我们识别出两种结构性缺陷,妨碍可靠的医学推理:1)感知差距(Perception Gap),即通用视觉编码器无法解析细微的病理线索(例如,微动脉瘤);2)推理差距(Reasoning Gap),即稀疏的视觉证据在更深的变换器层中逐渐被大量语言先验所覆盖,导致无根据的幻觉。为了弥补这些差距,我们提出了EyExIn,这是一种数据高效的框架,旨在通过深度专家注入机制将专家知识锚定到视网膜VLMs上。我们的架构采用了专家感知双流编码策略,将视觉表示解耦为一个用于解剖上下文的通用流和一个用于病理语义的专业专家流。为了确保高保真度的整合,我们设计了一个语义自适应门控融合模块,该模块动态放大细微病变信号,同时过滤无关的背景噪声。此外,我们引入了自适应深度专家注入,通过将融合的视觉特征作为残差偏置直接嵌入中间的LLM层,来嵌入持久的“视觉锚点”。这一机制创建了一个视觉快捷方式,迫使推理堆栈严格基于视觉证据。我们在四个基准测试中进行的广泛实验表明,我们的模型始终优于大型专有系统。EyExIn显著增强了领域特定知识的嵌入,并在眼科视觉问答中实现了最先进的精度,推动了可信眼科人工智能的发展。
cs.CV / 79 / 2603.07135

The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating

模型知道哪些标记重要:通过噪声门控实现自动标记选择
He, Landi, Yang, Xiaoyu, Xu, Lijian
Abstract
Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at https://github.com/MedHK23/AutoSelect.
Chinese Translation
视觉标记在视觉-语言模型(VLMs)的推理成本中占主导地位,但许多标记携带冗余信息。现有的剪枝方法缓解了这一问题,但通常依赖于注意力大小或相似性评分。我们将视觉标记剪枝重新表述为容量受限的通信:在给定固定预算 K 的情况下,模型必须分配有限的带宽以最大限度地保留视觉信息。我们提出了 AutoSelect,它将一个轻量级的评分器(Scorer)和去噪器(Denoiser)附加到一个冻结的 VLM 上,并仅使用标准的下一个标记预测损失进行训练,而不需要辅助目标或额外注释。在训练过程中,一个保持方差的噪声门根据每个标记的预测重要性调节其信息流,以便梯度能够通过所有标记传播;然后,一个对角注意力去噪器恢复扰动的表示。在推理阶段,仅保留评分器和硬性 top-K 选择,增加的延迟可以忽略不计。在十个 VLM 基准测试中,AutoSelect 保留了 96.5% 的完整模型准确性,同时将 LLM 的预填充速度提高了 2.85 倍,仅增加 0.69 毫秒的开销,并且能够迁移到不同的 VLM 主干而无需特定于架构的调优。代码可在 https://github.com/MedHK23/AutoSelect 获取。
cs.CV / 80 / 2603.07142

PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection

PDD:用于医学异常检测的流形先验多样性蒸馏
Lu, Xijun, Liu, Hongying, Shang, Fanhua, Hui, Yanming, Wan, Liang
Abstract
Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose PDD (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a Manifold Matching and Unification (MMU) module, while an Inter-Level Feature Adaption (InA) module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via InA for local consistency, while the other receives skip-projected representations through a Manifold Prior Affine (MPA) module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that PDD significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8%, 5.1%, and 8.5% in AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4% in F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection. The implementation will be released at https://github.com/OxygenLu/PDD
Chinese Translation
医学图像异常检测面临独特的挑战,因为复杂解剖结构中嵌入了微妙且异质的异常。通过系统的Grad-CAM分析,我们揭示了与工业数据集的成功不同,判别激活图在医学数据上表现不佳,这促使我们需要进行流形级建模。我们提出了PDD(流形先验多样性蒸馏),这是一个将双教师先验统一到共享高维流形中并将这些知识蒸馏到具有互补行为的双学生中的框架。具体而言,冻结的VMamba-Tiny和宽ResNet50编码器分别提供全局上下文和局部结构先验。它们的特征通过流形匹配与统一(MMU)模块进行统一,而层间特征适应(InA)模块则丰富了中间表示。统一的流形被蒸馏为两个学生:一个通过InA进行逐层蒸馏以实现局部一致性,而另一个通过流形先验仿射(MPA)模块接收跳跃投影表示以捕捉跨层依赖性。多样性损失防止表示崩溃,同时保持检测灵敏度。在多个医学数据集上的广泛实验表明,PDD显著优于现有的最先进方法,在HeadCT、BrainMRI和ZhangLab数据集上分别实现了高达11.8%、5.1%和8.5%的AUROC提升,以及在Uni-Medical数据集上实现了3.4%的F1最大值,确立了医学图像异常检测的新最先进性能。实现代码将发布在https://github.com/OxygenLu/PDD
cs.CV / 81 / 2603.07144

CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose

CanoVerse:用于生成和姿态的3D对象可扩展规范化及数据集
Jin, Li, Yang, Yuchen, Chen, Weikai, Wang, Yujie, Hao, Dehao, Jia, Tanghui, Yin, Yingda, Hu, Zeyu, Zhang, Runze, Luo, Keyang, Yuan, Li, Quan, Long, Wang, Xin, Qin, Xueying
Abstract
3D learning systems implicitly assume that objects occupy a coherent reference frame. Nonetheless, in practice, every asset arrives with an arbitrary global rotation, and models are left to resolve directional ambiguity on their own. This persistent misalignment suppresses pose-consistent generation, and blocks the emergence of stable directional semantics. To address this issue, we construct \methodName{}, a massive canonical 3D dataset of 320K objects over 1,156 categories -- an order-of-magnitude increase over prior work. At this scale, directional semantics become statistically learnable: Canoverse improves 3D generation stability, enables precise cross-modal 3D shape retrieval, and unlocks zero-shot point-cloud orientation estimation even for out-of-distribution data. This is achieved by a new canonicalization framework that reduces alignment from minutes to seconds per object via compact hypothesis generation and lightweight human discrimination, transforming canonicalization from manual curation into a high-throughput data generation pipeline. The Canoverse dataset will be publicly released upon acceptance. Project page: https://github.com/123321456-gif/Canoverse
Chinese Translation
3D学习系统隐含地假设对象占据一个一致的参考框架。然而,在实际操作中,每个资产都伴随着任意的全局旋转,模型必须自行解决方向模糊性。这种持续的不对齐抑制了姿态一致的生成,并阻碍了稳定方向语义的出现。为了解决这个问题,我们构建了 extit{CanoVerse},一个包含320K个对象、跨越1,156个类别的大规模规范3D数据集——这是对先前工作的数量级提升。在这个规模下,方向语义变得可以统计学习:CanoVerse提高了3D生成的稳定性,使得精确的跨模态3D形状检索成为可能,并解锁了即使对于分布外数据的零样本点云方向估计。这是通过一个新的规范化框架实现的,该框架通过紧凑的假设生成和轻量级的人类区分,将每个对象的对齐时间从分钟减少到秒,将规范化从手动策划转变为高通量数据生成管道。CanoVerse数据集将在接受后公开发布。项目页面:https://github.com/123321456-gif/Canoverse
cs.CV / 82 / 2603.07145

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

LiveWorld:在生成视频世界模型中模拟视野外动态
Duan, Zicheng, Xia, Jiatong, Zhang, Zeyu, Zhang, Wenbo, Zhou, Gengze, Gou, Chenhui, He, Yefei, Chen, Feng, Zhang, Xinyu, Liu, Lingqiao
Abstract
Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at https://zichengduan.github.io/LiveWorld/index.html.
Chinese Translation
近期的生成视频世界模型旨在模拟视觉环境的演变,使观察者能够通过相机控制互动探索场景。然而,它们隐含地假设世界仅在观察者的视野内演变。一旦一个物体离开观察者的视野,其状态在记忆中被“冻结”,而稍后重新访问同一区域时,往往无法反映在此期间应发生的事件。在本研究中,我们识别并形式化了这一被忽视的限制,称之为“视野外动态”问题,这阻碍了视频世界模型表示一个持续演变的世界。为了解决这一问题,我们提出了LiveWorld,一个新颖的框架,扩展视频世界模型以支持持久的世界演变。LiveWorld并不将世界视为静态的观察记忆,而是建模一个由静态3D背景和动态实体组成的持久全局状态,这些实体即使在未被观察时也会持续演变。为了维持这些未被观察的动态,LiveWorld引入了一种基于监控的机制,能够自主模拟活动实体的时间进程,并在重新访问时同步它们的演变状态,确保空间一致的渲染。为了评估,我们进一步引入了LiveBench,一个专门用于维持视野外动态任务的基准。大量实验表明,LiveWorld能够实现持久事件演变和长期场景一致性,弥合现有基于2D观察的记忆与真实4D动态世界模拟之间的差距。基线和基准将公开发布于 https://zichengduan.github.io/LiveWorld/index.html。
cs.CV / 83 / 2603.07163

PromptGate Client Adaptive Vision Language Gating for Open Set Federated Active Learning

PromptGate 客户端自适应视觉语言门控用于开放集联邦主动学习
Nesturi, Adea, Gaviria, David Dueñas, Zeng, Jiajun, Albarqouni, Shadi
Abstract
Deploying medical AI across resource-constrained institutions demands data-efficient learning pipelines that respect patient privacy. Federated Learning (FL) enables collaborative medical AI without centralising data, yet real-world clinical pools are inherently open-set, containing out-of-distribution (OOD) noise such as imaging artifacts and wrong modalities. Standard Active Learning (AL) query strategies mistake this noise for informative samples, wasting scarce annotation budgets. We propose PromptGate, a dynamic VLM-gated framework for Open-Set Federated AL that purifies unlabeled pools before querying. PromptGate introduces a federated Class-Specific Context Optimization: lightweight, learnable prompt vectors that adapt a frozen BiomedCLIP backbone to local clinical domains and aggregate globally via FedAvg -- without sharing patient data. As new annotations arrive, prompts progressively sharpen the ID/OOD boundary, turning the VLM into a dynamic gatekeeper that is strategy-agnostic: a plug-and-play pre-selection module enhancing any downstream AL strategy. Experiments on distributed dermatology and breast imaging benchmarks show that while static VLM prompting degrades to 50% ID purity, PromptGate maintains $>$95% purity with 98% OOD recall.
Chinese Translation
在资源受限的机构中部署医疗人工智能需要高效的数据学习流程,同时尊重患者隐私。联邦学习(Federated Learning, FL)使得在不集中数据的情况下进行协作医疗人工智能成为可能,但现实世界的临床数据池本质上是开放集的,包含了分布外(Out-Of-Distribution, OOD)噪声,例如成像伪影和错误的模态。标准的主动学习(Active Learning, AL)查询策略将这些噪声误认为是有信息的样本,从而浪费了稀缺的标注预算。我们提出了PromptGate,一个动态的视觉语言模型(Vision Language Model, VLM)门控框架,用于开放集联邦主动学习,在查询之前净化未标记的数据池。PromptGate引入了一种联邦特定类别上下文优化:轻量级、可学习的提示向量,使得冻结的BiomedCLIP骨干网络能够适应本地临床领域,并通过FedAvg进行全局聚合——而无需共享患者数据。随着新标注的到来,提示逐渐锐化了ID/OOD边界,使得VLM成为一个动态的守门人,具有策略无关性:一个即插即用的预选择模块,增强任何下游的主动学习策略。在分布式皮肤病和乳腺成像基准上的实验表明,尽管静态VLM提示的ID纯度降至50%,但PromptGate保持了超过95%的纯度和98%的OOD召回率。
cs.CV / 84 / 2603.07166

ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels

ACD-U:基于机器遗忘的非对称共教学框架以应对带噪标签的鲁棒学习
Fukunaga, Reo, Yoshida, Soh, Muneyasu, Mitsuji
Abstract
Deep neural networks are prone to memorizing incorrect labels during training, which degrades their generalizability. Although recent methods have combined sample selection with semi-supervised learning (SSL) to exploit the memorization effect -- where networks learn from clean data before noisy data -- they cannot correct selection errors once a sample is misclassified. To overcome this, we propose asymmetric co-teaching with different architectures (ACD)-U, an asymmetric co-teaching framework that uses different model architectures and incorporates machine unlearning. ACD-U addresses this limitation through two core mechanisms. First, its asymmetric co-teaching pairs a contrastive language-image pretraining (CLIP)-pretrained vision Transformer with a convolutional neural network (CNN), leveraging their complementary learning behaviors: the pretrained model provides stable predictions, whereas the CNN adapts throughout training. This asymmetry, where the vision Transformer is trained only on clean samples and the CNN is trained through SSL, effectively mitigates confirmation bias. Second, selective unlearning enables post-hoc error correction by identifying incorrectly memorized samples through loss trajectory analysis and CLIP consistency checks, and then removing their influence via Kullback--Leibler divergence-based forgetting. This approach shifts the learning paradigm from passive error avoidance to active error correction. Experiments on synthetic and real-world noisy datasets, including CIFAR-10/100, CIFAR-N, WebVision, Clothing1M, and Red Mini-ImageNet, demonstrate state-of-the-art performance, particularly in high-noise regimes and under instance-dependent noise. The code is publicly available at https://github.com/meruemon/ACD-U.
Chinese Translation
深度神经网络在训练过程中容易记忆错误标签,这会降低其泛化能力。尽管最近的方法结合了样本选择与半监督学习(SSL),以利用记忆效应——网络在学习带噪数据之前先学习干净数据——但一旦样本被错误分类,它们无法纠正选择错误。为了解决这一问题,我们提出了基于不同架构的非对称共教学框架(ACD-U),该框架使用不同的模型架构并结合机器遗忘。ACD-U通过两个核心机制解决了这一局限性。首先,其非对称共教学将对比语言-图像预训练(CLIP)预训练的视觉变换器与卷积神经网络(CNN)配对,利用它们互补的学习行为:预训练模型提供稳定的预测,而CNN在训练过程中不断适应。这种不对称性使得视觉变换器仅在干净样本上训练,而CNN通过SSL进行训练,有效减轻了确认偏差。其次,选择性遗忘通过损失轨迹分析和CLIP一致性检查识别错误记忆的样本,从而实现事后错误修正,并通过基于Kullback-Leibler散度的遗忘机制去除其影响。这种方法将学习范式从被动的错误避免转变为主动的错误修正。在合成和真实世界的带噪数据集上的实验,包括CIFAR-10/100、CIFAR-N、WebVision、Clothing1M和Red Mini-ImageNet,展示了最先进的性能,特别是在高噪声环境和实例依赖噪声下。代码已公开发布在 https://github.com/meruemon/ACD-U。
cs.CV / 85 / 2603.07170

Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology

类可视化和激活图谱在增强基于深度学习的计算病理学可解释性中的应用
Gustav, Marco, Wolf, Fabian, Glasner, Christina, Reitsam, Nic G., Schulz, Stefan, Aschenbroich, Kira, Märkl, Bruno, Foersch, Sebastian, Kather, Jakob Nikolas
Abstract
The rapid adoption of transformer-based models in computational pathology has enabled prediction of molecular and clinical biomarkers from H&E whole-slide images, yet interpretability has not kept pace with model complexity. While attribution- and generative-based methods are common, feature visualization approaches such as class visualizations (CVs) and activation atlases (AAs) have not been systematically evaluated for these models. We developed a visualization framework and assessed CVs and AAs for a transformer-based foundation model across tissue and multi-organ cancer classification tasks with increasing label granularity. Four pathologists annotated real and generated images to quantify inter-observer agreement, complemented by attribution and similarity metrics. CVs preserved recognizability for morphologically distinct tissues but showed reduced separability for overlapping cancer subclasses. In tissue classification, agreement decreased from Fleiss k = 0.75 (scans) to k = 0.31 (CVs), with similar trends in cancer subclass tasks. AAs revealed layer-dependent organization: coarse tissue-level concepts formed coherent regions, whereas finer subclasses exhibited dispersion and overlap. Agreement was moderate for tissue classification (k = 0.58), high for coarse cancer groupings (k = 0.82), and low at subclass level (k = 0.11). Atlas separability closely tracked expert agreement on real images, indicating that representational ambiguity reflects intrinsic pathological complexity. Attribution-based metrics approximated expert variability in low-complexity settings, whereas perceptual and distributional metrics showed limited alignment. Overall, concept-level feature visualization reveals structured morphological manifolds in transformer-based pathology models and provides a framework for expert-centered interrogation of learned representations across label granularities.
Chinese Translation
基于变换器模型在计算病理学中的快速应用使得能够从H&E全切片图像中预测分子和临床生物标志物,但可解释性并未与模型复杂性同步提升。尽管归因和生成方法较为常见,但类可视化(Class Visualizations, CVs)和激活图谱(Activation Atlases, AAs)等特征可视化方法尚未对这些模型进行系统评估。我们开发了一个可视化框架,并评估了基于变换器的基础模型在组织和多脏器癌症分类任务中的CVs和AAs,任务的标签粒度逐渐增加。四位病理学家对真实和生成的图像进行了标注,以量化观察者间的一致性,并辅以归因和相似性指标。CVs在形态上明显不同的组织中保持了可识别性,但在重叠的癌症子类中显示出分离性降低。在组织分类中,一致性从Fleiss k = 0.75(扫描)下降到k = 0.31(CVs),癌症子类任务中也出现了类似趋势。AAs揭示了层依赖的组织:粗略的组织级概念形成了连贯的区域,而更细的子类则表现出分散和重叠。在组织分类中,一致性为中等(k = 0.58),粗略癌症分组的一致性较高(k = 0.82),而在子类级别则较低(k = 0.11)。图谱的可分离性与专家在真实图像上的一致性密切相关,表明表征模糊性反映了内在的病理复杂性。在低复杂度环境中,基于归因的指标近似于专家的变异性,而感知和分布指标则显示出有限的一致性。总体而言,概念级特征可视化揭示了基于变换器的病理模型中的结构化形态流形,并提供了一个以专家为中心的框架,用于在不同标签粒度下对学习到的表征进行审查。
cs.CV / 86 / 2603.07181

FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation

自由飞行思维:将链式思维推理与连续无人机导航对齐
Zhou, Jiaxu, Wang, Shaobo, Yang, Zhiyuan, Yu, Zhenjun, Li, Tao
Abstract
Vision-Language Navigation aims to enable agents to understand natural language instructions and carry out appropriate navigation actions in real-world environments. Most work focuses on indoor settings, with little research in complex outdoor scenes. Current UAV Vision-and-Language Navigation models typically act as black boxes without explicit reasoning. We introduce FreeFly-thinking, an end-to-end VLN framework that converts the UAV agent's egocentric images and language instructions into a series of actions, inspired by environment of urban architecture proposed by OpenFly. We first construct a UAV dataset for navigation task, and then performing natural language chain of thought. We adopt a two-stage training strategy: Supervised fine-tuning and Reinforcement fine-tuning. Experiments on unseen test demonstrate a strong performance, presenting robustness and efficiency in UAV navigation issue.
Chinese Translation
视觉-语言导航旨在使智能体能够理解自然语言指令,并在现实环境中执行适当的导航动作。大多数研究集中在室内环境,而在复杂的户外场景中研究较少。目前的无人机视觉与语言导航模型通常作为黑箱运行,缺乏明确的推理过程。我们提出了FreeFly-thinking,一个端到端的视觉-语言导航框架,该框架将无人机智能体的自我中心图像和语言指令转换为一系列动作,灵感来源于OpenFly提出的城市建筑环境。我们首先构建了一个用于导航任务的无人机数据集,然后进行自然语言的链式思维推理。我们采用了两阶段的训练策略:监督微调和强化微调。在未见测试上的实验结果显示出强大的性能,展现了无人机导航问题的鲁棒性和效率。
cs.CV / 87 / 2603.07192

FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

FastSTAR:用于高效自回归视频合成的时空令牌剪枝
Yune, Sungwoong, Jeong, Suheon, Kim, Joo-Young
Abstract
Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.
Chinese Translation
视觉自回归建模(VAR)已成为一种高效的替代扩散框架的方法,能够实现可比的合成质量。然而,随着这一范式扩展到时空自回归建模(STAR)以进行视频生成,分辨率和帧数的增加导致了“令牌爆炸”,在最终精炼阶段造成了巨大的计算瓶颈。为了解决这个问题,我们提出了FastSTAR,一种无需训练的加速框架,旨在实现高质量的视频生成。我们的核心方法,时空令牌剪枝,通过整合两个专门的术语来识别重要令牌:(1)空间相似性,评估在层次尺度上结构收敛,以跳过在进一步精炼变得多余的区域的计算;(2)时间相似性,通过评估相对于前一剪辑的特征级变化来识别活跃的运动轨迹。结合部分更新机制,FastSTAR确保仅对未收敛区域进行精炼,保持流畅的运动,同时避免冗余计算。在InfinityStar上的实验结果表明,FastSTAR实现了高达2.01倍的加速,PSNR为28.29,性能下降不到1%,证明了STAR基础的视频合成在效率与质量之间的优越权衡。
cs.CV / 88 / 2603.07222

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

VINO:通过结构先验引导去上下文化的非上下文对象的视频驱动不变性
Yeom, Seul-Ki, Simon, Marcel, Lee, Eunbin, Kim, Tae-Ho
Abstract
Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.
Chinese Translation
自监督学习(SSL)取得了快速进展,但学习到的特征往往过度依赖于上下文捷径——背景纹理和共现统计。尽管视频提供了丰富的时间变化,但强烈自我运动下的密集野外流创建了一个共现陷阱:前景对象和背景上下文一致移动,促使表示崩溃为场景编码器。为了解决这个问题,我们提出了VINO(视频驱动的非上下文对象不变性),这是一个教师-学生框架,通过施加结构信息瓶颈,从密集视频中学习鲁棒的图像编码器。VINO仅使用与类别无关的结构先验来生成视图——而不是作为语义伪标签——形成一个不对称的蒸馏问题。教师从前景-联合视图中进行预测,同时抑制背景,而学生观察保留周围上下文但去除竞争实例的对象条件场景视图。通过掩蔽蒸馏匹配这些目标,使背景线索变得不可靠,推动表示朝向以对象为中心的不变性。我们进一步通过教师锚定的跨时间蒸馏在轨迹匹配对象上强制执行时间对象持久性,并通过掩蔽引导的局部视图稳定部分与整体的一致性。通过注意力可视化和在PASCAL VOC上的无监督对象发现,我们证明了VINO有效地将前景与背景解耦。在密集的威尼斯步行游视频上进行预训练后,VINO达到了34.8的CorLoc,产生了高度聚焦、形状偏向的表示,显著优于先前的密集视频和运动引导的SSL基线。
cs.CV / 89 / 2603.07234

Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion

通过双变量A Trous小波扩散实现单幅图像超分辨率
Maryam, Heidari, Nantheera, Anantrasirichai, Alin, Achim
Abstract
The effectiveness of super resolution (SR) models hinges on their ability to recover high frequency structure without introducing artifacts. Diffusion based approaches have recently advanced the state of the art in SR. However, most diffusion based SR pipelines operate purely in the spatial domain, which may yield high frequency details that are not well supported by the underlying low resolution evidence. On the other hand, unlike supervised SR models that may inject dataset specific textures, single image SR relies primarily on internal image statistics and can therefore be less prone to dataset-driven hallucinations; nevertheless, ambiguity in the LR observation can still lead to inconsistent high frequency details. To tackle this problem, we introduce BATDiff, an unsupervised Bivariate A trous Wavelet Diffusion model designed to provide structured cross scale guidance during the generative process. BATDiff employs an a Trous wavelet transform that constructs an undecimated multiscale representation in which high frequency components are progressively revealed while the full spatial resolution is preserved. As the core inference mechanism, BATDiff includes a bivariate cross scale module that models parent child dependencies between adjacent scales. It improves high frequency coherence and reduces mismatch artifacts in diffusion based SR. Experiments on standard benchmarks demonstrate that BATDiff produces sharper and more structurally consistent reconstructions than existing diffusion and non diffusion baselines, achieving improvements in fidelity and perceptual quality.
Chinese Translation
超分辨率(SR)模型的有效性取决于其恢复高频结构而不引入伪影的能力。基于扩散的方法最近在SR领域取得了突破。然而,大多数基于扩散的SR流程纯粹在空间域中操作,这可能导致高频细节未能得到底层低分辨率证据的良好支持。另一方面,与可能注入特定数据集纹理的监督SR模型不同,单幅图像SR主要依赖于内部图像统计,因此更不容易受到数据集驱动的幻觉影响;然而,低分辨率观察中的模糊性仍然可能导致不一致的高频细节。为了解决这个问题,我们提出了BATDiff,一种无监督的双变量A Trous小波扩散模型,旨在在生成过程中提供结构化的跨尺度指导。BATDiff采用A Trous小波变换,构建了一个未下采样的多尺度表示,其中高频分量逐步显现,同时保持完整的空间分辨率。作为核心推理机制,BATDiff包括一个双变量跨尺度模块,建模相邻尺度之间的父子依赖关系。它提高了高频一致性,并减少了基于扩散的SR中的不匹配伪影。在标准基准测试上的实验表明,BATDiff生成的重建比现有的扩散和非扩散基线更加清晰且结构一致,在保真度和感知质量上均取得了改善。
cs.CV / 90 / 2603.07236

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU(第一部分):一个可扩展的功能性神经记忆框架及其在文本引导图像编辑中的实例化
Tencent HY Team
Abstract
Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.
Chinese Translation
基础模型正从离线预测器转变为期望在长时间范围内运行的部署系统。在实际部署中,目标并不是固定的:领域漂移、用户偏好演变,以及在模型发布后出现的新任务。这使得持续学习和即时个性化从可选特性提升为核心架构要求。然而,大多数适应管道仍然遵循静态权重范式:在训练(或任何适应步骤)后,推理执行单一的参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的一个单一点。在异质且不断演变的环境中,不同的目标可能会在参数上诱导出分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。因此,持续学习和个性化通常被实现为对共享权重的重复覆盖,风险在于之前学习的行为可能会退化。我们提出了HY-WU(权重释放),一个以记忆为先的适应框架,它将适应压力从覆盖单一共享参数点转移开。HY-WU将功能性(操作级)记忆实现为一个神经模块:一个生成器,它根据实例条件实时合成权重更新,从而产生实例特定的操作,而无需在测试时进行优化。
cs.CV / 91 / 2603.07240

FabricGen: Microstructure-Aware Woven Fabric Generation

FabricGen:微观结构感知的织物生成
Tang, Yingjie, Luo, Di, Wang, Zixiong, Ling, Xiaoli, Yang, jian, Wang, Beibei
Abstract
Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.
Chinese Translation
织物材料在渲染应用中被广泛使用,但设计真实的示例通常涉及多个阶段,需要对织造原理和纹理创作有深入的了解。近期的研究探索了扩散模型以简化这一过程;然而,预训练的扩散模型往往难以生成符合织造规则的复杂纱线级细节。为了解决这个问题,我们提出了FabricGen,一个从文本描述生成高质量织物材料的端到端框架。我们方法的一个关键见解是将宏观纹理和微观织造模式进行分解。为了生成不含微观结构的宏观纹理,我们在收集的无微观结构织物数据集上对预训练的扩散模型进行了微调。至于微观织造模式,我们开发了一种增强的程序几何模型,能够合成具有纱线滑动和飞扬纤维的自然纱线级几何形状。该程序模型由一个专门的大型语言模型WeavingLLM驱动,该模型在标注的格式化织造草图数据集上进行了微调,并通过领域特定的织物专业知识进行了提示调优。通过微调和提示调优,WeavingLLM学习从文本提示中设计织造草图和织物参数,使程序模型能够生成遵循织造原则的多样化织造模式。生成的宏观纹理以及微观几何形状可用于织物渲染。因此,我们的框架生成的材料在细节和真实感上显著优于先前的生成模型。
cs.CV / 92 / 2603.07244

PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

PresentBench:一种基于细粒度评分标准的幻灯片生成基准
Chen, Xin-Sheng, Zhu, Jiayu, Li, Pei-lin, Wang, Hanzheng, Yang, Shuojin, Guo, Meng-Hao
Abstract
Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
Chinese Translation
幻灯片在学术、教育和商业等以展示为导向的场景中,作为传递信息的重要媒介,发挥着关键作用。尽管其重要性不言而喻,创建高质量的幻灯片仍然是一项耗时且认知负担较重的任务。最近在生成模型方面的进展,例如 Nano Banana Pro,使得自动化幻灯片生成变得越来越可行。然而,现有的幻灯片生成评估往往较为粗糙,依赖整体判断,难以准确评估模型能力或跟踪该领域的有意义进展。在实践中,缺乏细粒度、可验证的评估标准对研究和实际应用构成了重大瓶颈。本文提出了 PresentBench,一种基于细粒度评分标准的基准,用于评估自动化的现实世界幻灯片生成。该基准包含238个评估实例,每个实例都附有幻灯片创建所需的背景材料。此外,我们为每个实例手动设计了平均54.1个检查项,每个检查项都以二元问题的形式提出,以实现对生成幻灯片的细粒度、特定实例的评估。大量实验表明,PresentBench提供的评估结果比现有方法更为可靠,并且与人类偏好的对齐程度显著更强。此外,我们的基准揭示了 NotebookLM 在幻灯片生成方法中显著优于其他方法,突显了该领域近期的重大进展。
cs.CV / 93 / 2603.07246

LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture

LEPA:在卫星遥感数据中学习几何等变性的一种预测架构
Scheurer, Erik, Sedona, Rocco, Kesselheim, Stefan, Cavallaro, Gabriele
Abstract
Geospatial foundation models provide precomputed embeddings that serve as compact feature vectors for large-scale satellite remote sensing data. While these embeddings can reduce data-transfer bottlenecks and computational costs, Earth observation (EO) applications can still face geometric mismatches between user-defined areas of interest and the fixed precomputed embedding grid. Standard latent-space interpolation is unreliable in this setting because the embedding manifold is highly non-convex, yielding representations that do not correspond to realistic inputs. We verify this using Prithvi-EO-2.0 to understand the shortcomings of interpolation applied to patch embeddings. As a substitute, we propose a Learned Equivariance-Predicting Architecture (LEPA). Instead of averaging vectors, LEPA conditions a predictor on geometric augmentations to directly predict the transformed embedding. We evaluate LEPA on NASA/USGS Harmonized Landsat-Sentinel (HLS) imagery and ImageNet-1k. Experiments show that standard interpolation achieves a mean reciprocal rank (MRR) below 0.2, whereas LEPA increases MRR to over 0.8, enabling accurate geometric adjustment without re-encoding.
Chinese Translation
地理空间基础模型提供了预计算的嵌入,这些嵌入作为大规模卫星遥感数据的紧凑特征向量。尽管这些嵌入可以减少数据传输瓶颈和计算成本,但地球观测(EO)应用仍然可能面临用户定义的兴趣区域与固定预计算嵌入网格之间的几何不匹配。在这种情况下,标准的潜在空间插值是不可靠的,因为嵌入流形高度非凸,导致的表示与现实输入不对应。我们使用 Prithvi-EO-2.0 验证了这一点,以理解应用于补丁嵌入的插值的局限性。作为替代,我们提出了一种学习等变性预测架构(LEPA)。LEPA 不是简单地对向量进行平均,而是将预测器条件化于几何增强,以直接预测变换后的嵌入。我们在 NASA/USGS 协调的 Landsat-Sentinel(HLS)影像和 ImageNet-1k 上评估了 LEPA。实验表明,标准插值的平均倒数排名(MRR)低于 0.2,而 LEPA 将 MRR 提高到超过 0.8,从而实现准确的几何调整而无需重新编码。
cs.CV / 94 / 2603.07276

Variational Flow Maps: Make Some Noise for One-Step Conditional Generation

变分流图:为一步条件生成制造一些噪声
Mammadov, Abbas, Takao, So, Chen, Bohan, Baptista, Ricardo, Mardani, Morteza, Teh, Yee Whye, Berner, Julius
Abstract
Flow maps enable high-quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at https://github.com/abbasmammadov/VFM
Chinese Translation
流图能够在一次前向传递中实现高质量的图像生成。然而,与迭代扩散模型不同,它们缺乏明确的采样轨迹,这阻碍了将外部约束纳入条件生成和解决逆问题。我们提出了变分流图(Variational Flow Maps),这是一个条件采样框架,转变了条件化的视角,从“引导采样路径”转向“学习适当的初始噪声”。具体而言,给定一个观察值,我们旨在学习一个噪声适配器模型,该模型输出一个噪声分布,以便在通过流图映射到数据空间后,样本能够遵循观察值和数据先验。为此,我们开发了一个原则性的变分目标,联合训练噪声适配器和流图,改善噪声与数据的对齐,从而通过简单的适配器实现从复杂数据后验的采样。在各种逆问题上的实验表明,变分流图在一次(或少数几次)步骤中生成了良好校准的条件样本。对于ImageNet,变分流图在加速采样方面相比于其他迭代扩散/流模型达到了竞争性的保真度,并且速度提升了几个数量级。代码可在 https://github.com/abbasmammadov/VFM 获取。
cs.CV / 95 / 2603.07291

Virtual Try-On for Cultural Clothing: A Benchmarking Study

文化服装的虚拟试穿:基准研究
Islam, Muhammad Tausif Ul, Awlad, Shahir, Adib, Sameen Yeaser, Rahman, Md. Atiqur, Ahmed, Sabbir, Kabir, Md. Hasanul
Abstract
Although existing virtual try-on systems have made significant progress with the advent of diffusion models, the current benchmarks of these models are based on datasets that are dominant in western-style clothing and female models, limiting their ability to generalize culturally diverse clothing styles. In this work, we introduce BD-VITON, a virtual try-on dataset focused on Bangladeshi garments, including saree, panjabi and salwar kameez, covering both male and female categories as well. These garments present unique structural challenges such as complex draping, asymmetric layering, and high deformation complexities which are underrepresented in the original VITON dataset. To establish strong baselines, we retrain and evaluate try-on models, namely StableViton, HR-VITON, and VITON-HD on our dataset. Our experiments demonstrate consistent improvements in terms of both quantitative and qualitative analysis, compared to zero shot inference.
Chinese Translation
尽管现有的虚拟试穿系统在扩散模型的出现下取得了显著进展,但这些模型的当前基准基于以西方风格服装和女性模型为主的数据库,限制了它们对文化多样化服装风格的泛化能力。在本研究中,我们引入了BD-VITON,一个专注于孟加拉国服装的虚拟试穿数据集,包括纱丽(saree)、潘贾比(panjabi)和沙尔瓦尔·卡米兹(salwar kameez),涵盖了男性和女性类别。这些服装呈现出独特的结构挑战,如复杂的垂坠、非对称的层叠和高变形复杂性,这些在原始VITON数据集中表现不足。为了建立强有力的基准,我们在我们的数据集上重新训练和评估了试穿模型,即StableViton、HR-VITON和VITON-HD。我们的实验表明,与零样本推断相比,在定量和定性分析方面均有持续的改进。
cs.CV / 96 / 2603.07294

MAviS: A Multimodal Conversational Assistant For Avian Species

MAviS:一种针对鸟类物种的多模态对话助手
Kryklyvets, Yevheniia, Kurpath, Mohammed Irfan, Mullappilly, Sahal Shaji, Zhou, Jinxing, Khan, Fahad Shabzan, Anwer, Rao, Khan, Salman, Cholakkal, Hisham
Abstract
Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
Chinese Translation
细粒度理解和物种特定的多模态问答对于推动生物多样性保护和生态监测至关重要。然而,现有的多模态大型语言模型在处理鸟类等专业主题时面临挑战,使得在这些领域提供准确且具有上下文相关性的信息变得更加困难。为了解决这一局限性,我们引入了MAviS-Dataset,这是一个大规模的多模态鸟类物种数据集,整合了超过1,000种鸟类的图像、音频和文本模态,包含了丰富的结构化问答对的预训练和指令调优子集。基于MAviS-Dataset,我们推出了MAviS-Chat,这是一种支持音频、视觉和文本的多模态大型语言模型,旨在实现细粒度的物种理解、多模态问答和场景特定描述生成。最后,为了进行定量评估,我们提出了MAviS-Bench,这是一个包含超过25,000个问答对的基准,旨在评估鸟类物种特定的感知和推理能力。实验结果表明,MAviS-Chat在性能上大幅超越基线模型MiniCPM-o-2.6,达到了最先进的开源结果,并展示了我们指令调优的MAviS-Dataset的有效性。我们的研究结果强调了针对生态应用的领域自适应多模态大型语言模型的必要性。
cs.CV / 97 / 2603.07302

Training for Trustworthy Saliency Maps: Adversarial Training Meets Feature-Map Smoothing

可信赖显著性图的训练:对抗训练与特征图平滑的结合
Bhusal, Dipkamal, Alam, Md Tanvirul, Rastogi, Nidhi
Abstract
Gradient-based saliency methods such as Vanilla Gradient (VG) and Integrated Gradients (IG) are widely used to explain image classifiers, yet the resulting maps are often noisy and unstable, limiting their usefulness in high-stakes settings. Most prior work improves explanations by modifying the attribution algorithm, leaving open how the training procedure shapes explanation quality. We take a training-centered view and first provide a curvature-based analysis linking attribution stability to how smoothly the input-gradient field varies locally. Guided by this connection, we study adversarial training and identify a consistent trade-off: it yields sparser and more input-stable saliency maps, but can degrade output-side stability, causing explanations to change even when predictions remain unchanged and logits vary only slightly. To mitigate this, we propose augmenting adversarial training with a lightweight feature-map smoothing block that applies a differentiable Gaussian filter in an intermediate layer. Across FMNIST, CIFAR-10, and ImageNette, our method preserves the sparsity benefits of adversarial training while improving both input-side stability and output-side stability. A human study with 65 participants further shows that smoothed adversarial saliency maps are perceived as more sufficient and trustworthy. Overall, our results demonstrate that explanation quality is critically shaped by training, and that simple smoothing with robust training provides a practical path toward saliency maps that are both sparse and stable.
Chinese Translation
基于梯度的显著性方法,如原始梯度(Vanilla Gradient, VG)和积分梯度(Integrated Gradients, IG),广泛用于解释图像分类器,但所生成的图往往噪声较大且不稳定,限制了其在高风险环境中的实用性。大多数先前的研究通过修改归因算法来改善解释,尚未探讨训练过程如何影响解释质量。我们采取以训练为中心的视角,首先提供了一种基于曲率的分析,将归因稳定性与输入梯度场在局部的平滑变化联系起来。在这一联系的指导下,我们研究了对抗训练,并识别出一个一致的权衡:它产生了更稀疏且对输入更稳定的显著性图,但可能会降低输出端的稳定性,导致解释在预测未变且对数值仅轻微变化时也发生变化。为此,我们提出通过在中间层应用可微分的高斯滤波器来增强对抗训练,加入一个轻量级的特征图平滑模块。在FMNIST、CIFAR-10和ImageNette数据集上,我们的方法保留了对抗训练的稀疏性优势,同时改善了输入端和输出端的稳定性。一项包含65名参与者的人类研究进一步表明,平滑后的对抗显著性图被认为更加充分和可信。总体而言,我们的结果表明,解释质量受到训练的关键影响,而通过稳健的训练进行简单的平滑提供了一条实用的路径,以获得既稀疏又稳定的显著性图。
cs.CV / 98 / 2603.07307

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

StructSAM:用于Segment Anything Models的结构与光谱保持的令牌合并
Nguyen, Duy M. H., Tran, Tuan A., Nguyen, Duong, Xie, Siwei, Nguyen, Trung Q., Truong, Mai T. N., Palenicek, Daniel, Le, An T., Barz, Michael, Nguyen, TrungTin, Dam, Tuan, Le, Ngan, Vu, Minh, Doan, Khoa, Ngo, Vien, Xie, Pengtao, Zou, James, Sonntag, Daniel, Peters, Jan, Niepert, Mathias
Abstract
Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.
Chinese Translation
最近的视觉变换器(ViTs)令牌合并技术通过减少自注意力处理的令牌数量,提供了显著的加速,通常无需重新训练。然而,这些技术直接应用于Segment Anything Model(SAM)系列并非易事:SAM的图像编码器混合了窗口注意力和全局注意力,其掩膜解码器依赖于密集的、基于提示的特征来进行精确的边界预测。我们在严格的现成设置下系统地评估了代表性的令牌合并方法在SAM和Medical SAM上的表现,发现现有的目标选择启发式方法在合并率增加时会侵蚀边界并泄露提示信息。我们提出了 extbf{StructSAM},一个为SAM量身定制的保持分辨率的合并-拆分框架。StructSAM从一阶特征梯度计算轻量级的令牌能量评分,使用基于网格的平坦性筛选来保护边界和提示区域,并在平坦区域内将令牌合并到低能量目标,同时显式恢复令牌。我们进一步提供了一个光谱图粗化视角,显示得分引导的合并相较于随机或窗口限制基线产生了有界的拉普拉斯光谱失真。在八个自然和医学基准测试中,StructSAM将编码器的FLOPs减少了25-30%(通过提示感知合并可高达40%+),同时mIoU/Dice的下降幅度很小,始终优于ToMe、PiToMe、ToMeSD、VidToMe和ALGM在相同计算条件下的表现。
cs.CV / 99 / 2603.07314

Faster-HEAL: An Efficient and Privacy-Preserving Collaborative Perception Framework for Heterogeneous Autonomous Vehicles

Faster-HEAL:一种高效且保护隐私的异构自主车辆协同感知框架
Maleki, Armin, Radha, Hayder
Abstract
Collaborative perception (CP) is a promising paradigm for improving situational awareness in autonomous vehicles by overcoming the limitations of single-agent perception. However, most existing approaches assume homogeneous agents, which restricts their applicability in real-world scenarios where vehicles use diverse sensors and perception models. This heterogeneity introduces a feature domain gap that degrades detection performance. Prior works address this issue by retraining entire models/major components, or using feature interpreters for each new agent type, which is computationally expensive, compromises privacy, and may reduce single-agent accuracy. We propose Faster-HEAL, a lightweight and privacy-preserving CP framework that fine-tunes a low-rank visual prompt to align heterogeneous features with a unified feature space while leveraging pyramid fusion for robust feature aggregation. This approach reduces the trainable parameters by 94%, enabling efficient adaptation to new agents without retraining large models. Experiments on the OPV2V-H dataset show that Faster-HEAL improves detection performance by 2% over state-of-the-art methods with significantly lower computational overhead, offering a practical solution for scalable heterogeneous CP.
Chinese Translation
协同感知(Collaborative Perception, CP)是一种有前景的范式,通过克服单一代理感知的局限性来提高自主车辆的情境意识。然而,现有大多数方法假设代理是同质的,这限制了它们在使用多种传感器和感知模型的真实场景中的适用性。这种异质性引入了特征领域差距,从而降低了检测性能。以往的研究通过重新训练整个模型或主要组件,或为每种新代理类型使用特征解释器来解决这一问题,这在计算上代价高昂,妨碍了隐私,并可能降低单一代理的准确性。我们提出了Faster-HEAL,这是一种轻量级且保护隐私的CP框架,通过微调低秩视觉提示,将异构特征与统一特征空间对齐,同时利用金字塔融合进行稳健的特征聚合。这种方法将可训练参数减少了94%,使得在不重新训练大型模型的情况下能够高效适应新代理。在OPV2V-H数据集上的实验表明,Faster-HEAL在检测性能上比最先进的方法提高了2%,且计算开销显著降低,为可扩展的异构CP提供了实用解决方案。
cs.CV / 100 / 2603.07338

A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction

基于轻量级数字双胞胎的边缘辅助车辆跟踪与碰撞预测框架
Onsu, Murat Arda, Lohan, Poonam, Kantarci, Burak, Syed, Aisha, Andrews, Matthew, Kennedy, Sean
Abstract
Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.
Chinese Translation
车辆跟踪、运动估计和碰撞预测是智能交通系统(ITS)中交通安全和管理的基本组成部分。许多近期的方法依赖于计算密集型的预测模型,这限制了它们在资源受限的边缘设备上的实际应用。本文提出了一种基于轻量级数字双胞胎的车辆跟踪和时空碰撞预测框架,该框架仅依赖于目标检测,而不需要复杂的轨迹预测网络。该框架在Quanser互动实验室(QLabs)中实施和评估,QLabs是一个高保真度的城市交通环境数字双胞胎,能够实现受控和可重复的场景生成。基于YOLO的检测器被部署在模拟边缘摄像头上,以定位车辆并提取帧级质心轨迹。通过多次遍历构建离线路径图,并使用K-D树进行索引,以支持检测到的车辆与道路段之间的高效在线关联。在运行时,保持一致的车辆标识符,根据路径索引的时间演变估计车辆速度和方向,并相应地预测未来位置。通过分析预测未来轨迹的空间接近性和时间重叠来识别潜在碰撞。我们在多种模拟城市场景中的实验结果表明,所提出的框架在碰撞事件发生之前预测了约88%的碰撞事件,同时保持适合边缘部署的低计算开销。本文提出了一种轻量级的数字双胞胎解决方案,用于车辆跟踪和碰撞预测,旨在满足ITS中实时边缘部署的需求。
cs.CV / 101 / 2603.07356

AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision

AgrI挑战:一个面向农业视觉的跨团队验证的数据中心人工智能竞赛
Brahimi, Mohammed, Laabassi, Karim, Ameur, Mohamed Seghir Hadj, Boutorh, Aicha, Siab-Farsi, Badia, Khouani, Amin, Zouak, Omar Farouk, Bouziane, Seif Eddine, Lakhdari, Kheira, Benghanem, Abdelkader Nabil
Abstract
Machine learning models in agricultural vision often achieve high accuracy on curated datasets but fail to generalize under real field conditions due to distribution shifts between training and deployment environments. Moreover, most machine learning competitions focus primarily on model design while treating datasets as fixed resources, leaving the role of data collection practices in model generalization largely unexplored. We introduce the AgrI Challenge, a data-centric competition framework in which multiple teams independently collect field datasets, producing a heterogeneous multi-source benchmark that reflects realistic variability in acquisition conditions. To systematically evaluate cross-domain generalization across independently collected datasets, we propose Cross-Team Validation (CTV), an evaluation paradigm that treats each team's dataset as a distinct domain. CTV includes two complementary protocols: Train-on-One-Team-Only (TOTO), which measures single-source generalization, and Leave-One-Team-Out (LOTO), which evaluates collaborative multi-source training. Experiments reveal substantial generalization gaps under single-source training: models achieve near-perfect validation accuracy yet exhibit validation-test gaps of up to 16.20% (DenseNet121) and 11.37% (Swin Transformer) when evaluated on datasets collected by other teams. In contrast, collaborative multi-source training dramatically improves robustness, reducing the gap to 2.82% and 1.78%, respectively. The challenge also produced a publicly available dataset of 50,673 field images of six tree species collected by twelve independent teams, providing a diverse benchmark for studying domain shift and data-centric learning in agricultural vision.
Chinese Translation
农业视觉中的机器学习模型在经过精心策划的数据集上通常能够达到高准确率,但由于训练和部署环境之间的分布变化,它们在实际田间条件下往往无法泛化。此外,大多数机器学习竞赛主要集中于模型设计,而将数据集视为固定资源,导致数据收集实践在模型泛化中的作用尚未得到充分探索。我们提出了AgrI挑战,这是一个数据中心的竞赛框架,其中多个团队独立收集田间数据集,产生一个反映获取条件现实变异性的异构多源基准。为了系统评估跨领域泛化,我们提出了跨团队验证(Cross-Team Validation, CTV),这一评估范式将每个团队的数据集视为一个独立的领域。CTV包括两个互补的协议:仅在一个团队上训练(Train-on-One-Team-Only, TOTO),用于测量单源泛化,以及留一团队法(Leave-One-Team-Out, LOTO),用于评估协作多源训练。实验结果显示,在单源训练下存在显著的泛化差距:模型在验证集上几乎达到完美的准确率,但在其他团队收集的数据集上评估时,验证-测试差距高达16.20%(DenseNet121)和11.37%(Swin Transformer)。相比之下,协作多源训练显著提高了鲁棒性,将差距分别缩小至2.82%和1.78%。该挑战还产生了一个公开可用的数据集,包含由十二个独立团队收集的50,673张六种树种的田间图像,为研究农业视觉中的领域转移和数据中心学习提供了多样化的基准。
cs.CV / 102 / 2603.07394

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

AQuA:面向模糊视觉问题的战略响应生成
Jang, Jihyoung, Kim, Hyounghun
Abstract
Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
Chinese Translation
视觉问答(VQA)是评估视觉-语言模型(VLMs)能力的核心任务。现有的VQA基准主要包含清晰且明确的图像-问题对,而现实世界场景通常涉及不同程度的模糊性,这需要细致的推理和适应上下文的响应策略。尽管近期研究已开始关注VQA中的模糊性,但它们缺乏(1)对模糊性水平的系统分类和(2)支持策略意识响应的数据集和模型。在本文中,我们引入了模糊视觉问答(AQuA),这是一个细粒度的数据集,根据模糊性的性质和程度将模糊VQA实例分类为四个级别,并为每种情况提供最佳响应策略。我们对多种开源和专有VLM的评估表明,大多数模型未能根据模糊性类型调整其策略,常常产生过于自信的答案,而不是寻求澄清或承认不确定性。为了解决这一挑战,我们在AQuA上微调VLM,使其能够在多种响应策略中自适应选择,例如直接回答、从上下文线索推断意图、列出合理的替代方案或请求澄清。在AQuA上训练的VLM实现了对模糊VQA的战略响应生成,展现了识别模糊性、管理不确定性和以上下文适当的策略作出响应的能力,同时超越了开源和闭源基准。
cs.CV / 103 / 2603.07399

Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features

可解释的动脉瘤分类通过3D概念瓶颈模型:整合形态学和血流动力学临床特征
Khaled, Toqa, Al-Kabbany, Ahmad
Abstract
We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.
Chinese Translation
我们关注使用深度学习可靠地分类和评估颅内动脉瘤的挑战,同时不妥协临床透明度。虽然传统的黑箱模型实现了高预测准确性,但其缺乏内在可解释性仍然是临床应用和监管批准的重大障碍。可解释性在医学建模中至关重要,以确保人工智能驱动的诊断与既定的神经外科原则相一致。与传统的可解释人工智能(XAI)方法——如显著性图,通常提供事后非因果的视觉关联——不同,概念瓶颈模型(CBMs)通过将模型的内部逻辑限制为人类可理解的临床指标,提供了一种稳健的替代方案。在本文中,我们提出了一种端到端的3D概念瓶颈框架,该框架将高维神经影像特征映射到一组离散的形态学和血流动力学概念,以便于动脉瘤的识别。我们使用预训练的3D ResNet-34主干和3D DenseNet-121来提取CTA体积中的特征,并通过一个软瓶颈层处理这些特征,该层表示人类可解释的临床概念。该模型使用联合损失函数进行优化,以平衡诊断焦点损失和概念均方误差(MSE),并通过分层五折交叉验证进行验证。我们的结果表明,ResNet-34架构的峰值任务分类准确率为93.33% +/- 4.5%,而DenseNet-121模型的准确率为91.43% +/- 5.8%。此外,实施8次测试时间增强(TTA)产生了88.31%的稳健平均准确率,确保了推理过程中的诊断稳定性。通过保持小于0.04的准确性-泛化差距,该框架证明了在不牺牲可解释性的情况下可以实现高预测性能。
cs.CV / 104 / 2603.07401

VIVECaption: A Split Approach to Caption Quality Improvement

VIVECaption:一种分层的方法来提升字幕质量
Ananth, Varun, Liu, Baqiao, Cai, Haoran
Abstract
Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.
Chinese Translation
字幕质量已成为训练高质量文本到图像(T2I)和文本到视频(T2V)生成模型的关键瓶颈。尽管视觉语言模型(VLMs)通常用于从视觉数据生成字幕,但它们存在幻觉、较差的组合推理能力和有限的细粒度理解等问题,导致图像-字幕对不匹配,从而降低下游模型的性能。本技术报告介绍了VIVECaption,一种系统性的双面字幕质量提升方法。我们首先建立了一个全面的字幕评估指标分类法,区分“通用”指标和“实例基础”指标,最终目标是展示不同字幕质量指标之间的使用案例和权衡。然后,我们使用这一语言描述我们的双面字幕质量提升方法:(1)使用分层抽样的金标准数据集创建方法论,以及(2)涵盖上下文对齐和使用SFT进行参数级微调的模型对齐策略。我们在开源模型上展示了我们的方法,重点关注能够更好解析和下游利用的结构化字幕格式。最终,我们展示了在图像字幕生成管道中使用微调的字符检测模型显著提高了整体图像-字幕对齐质量。我们的工作解决了企业人工智能开发中对高质量“素食”训练数据日益增长的需求,为寻求改善字幕-图像对齐的团队提供了实际解决方案,而无需依赖可能受版权保护的网络抓取内容。
cs.CV / 105 / 2603.07403

Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

基于提示的单颗牙齿牙科图像标题生成:使用视觉-语言模型
Sukhanova, Anastasiia, Taylor, Aiden, Myers, Julian, Wang, Zichun, Jammuladinne, Kartha Veerya, Nimmagadda, Satya Sri Rajiteswari, Maiti, Aniruddha, Jana, Ananya
Abstract
Digital dentistry has made significant advances with the advent of deep learning. However, the majority of these deep learning-based dental image analysis models focus on very specific tasks such as tooth segmentation, tooth detection, cavity detection, and gingivitis classification. There is a lack of a specialized model that has holistic knowledge of teeth and can perform dental image analysis tasks based on that knowledge. Datasets of dental images with captions can help build such a model. To the best of our knowledge, existing dental image datasets with captions are few in number and limited in scope. In many of these datasets, the captions describe the entire mouth, while the images are limited to the anterior view. As a result, posterior teeth such as molars are not clearly visible, limiting the usefulness of the captions for training vision-language models. Additionally, the captions focus only on a specific disease (gingivitis) and do not provide a holistic assessment of each tooth. Moreover, tooth disease scores are typically assigned to individual teeth, and each tooth is treated as a separate entity in orthodontic procedures. Therefore, it is important to have captions for single-tooth images. As far as we know, no such dataset of single-tooth images with dental captions exists. In this work, we aim to bridge that gap by assessing the possibility of generating captions for dental images using Vision-Language Models (VLMs) and evaluating the extent and quality of those captions. Our findings suggest that guided prompts help VLMs generate meaningful captions. We show that the prompts generated by our framework are better anchored in describing the visual aspects of dental images. We selected RGB images as they have greater potential in consumer scenarios.
Chinese Translation
数字牙科在深度学习的推动下取得了显著进展。然而,现有的大多数基于深度学习的牙科图像分析模型主要集中在非常具体的任务上,例如牙齿分割、牙齿检测、龋齿检测和牙龈炎分类。缺乏一种具有整体牙齿知识的专业模型,能够基于该知识执行牙科图像分析任务。带有标题的牙科图像数据集可以帮助构建这样的模型。根据我们所知,现有的带有标题的牙科图像数据集数量较少且范围有限。在许多这些数据集中,标题描述的是整个口腔,而图像仅限于前视图。因此,后牙如磨牙并不清晰可见,这限制了标题在训练视觉-语言模型中的有效性。此外,这些标题仅关注特定疾病(牙龈炎),并未提供对每颗牙齿的整体评估。此外,牙齿疾病评分通常分配给单个牙齿,并且在正畸程序中,每颗牙齿被视为独立实体。因此,为单颗牙齿图像提供标题是非常重要的。就我们所知,目前尚不存在带有牙科标题的单颗牙齿图像数据集。在本研究中,我们旨在填补这一空白,评估使用视觉-语言模型(VLMs)生成牙科图像标题的可能性,并评估这些标题的广度和质量。我们的研究结果表明,指导性提示有助于VLMs生成有意义的标题。我们展示了我们框架生成的提示在描述牙科图像的视觉特征方面更具针对性。我们选择RGB图像,因为它们在消费场景中具有更大的潜力。
cs.CV / 106 / 2603.07406

UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration

UnSCAR:通用、可扩展、可控和可适应的图像恢复
Mandal, Debabrata, Chattopadhyay, Soumitri, Wang, Yujie, Niethammer, Marc, Chakravarthula, Praneeth
Abstract
Universal image restoration aims to recover clean images from arbitrary real-world degradations using a single inference model. Despite significant progress, existing all-in-one restoration networks do not scale to multiple degradations. As the number of degradations increases, training becomes unstable, models grow excessively large, and performance drops across both seen and unseen domains. In this work, we show that scaling universal restoration is fundamentally limited by interference across degradations during joint learning, leading to catastrophic task forgetting. To address this challenge, we introduce a unified inference pipeline with a multi-branch mixture-of-experts architecture that decomposes restoration knowledge across specialized task-adaptable experts. Our approach enables scalable learning (over sixteen degradations), adapts and generalizes robustly to unseen domains, and supports user-controllable restoration across degradations. Beyond achieving superior performance across benchmarks, this work establishes a new design paradigm for scalable and controllable universal image restoration.
Chinese Translation
通用图像恢复旨在使用单一推理模型从任意现实世界退化中恢复干净图像。尽管取得了显著进展,现有的全能恢复网络在处理多种退化时并不具备扩展性。随着退化数量的增加,训练变得不稳定,模型规模过大,并且在已见和未见领域的性能下降。在本研究中,我们表明,通用恢复的扩展受到联合学习过程中退化间干扰的根本限制,导致灾难性的任务遗忘。为了解决这一挑战,我们引入了一个统一的推理管道,采用多分支专家混合架构,将恢复知识分解到专门的任务适应专家中。我们的方法能够实现可扩展学习(处理超过十六种退化),并能够稳健地适应和推广到未见领域,同时支持用户可控的跨退化恢复。除了在基准测试中实现优越的性能外,本研究还建立了一种新的设计范式,以实现可扩展和可控的通用图像恢复。
cs.CV / 107 / 2603.07414

QdaVPR: A novel query-based domain-agnostic model for visual place recognition

QdaVPR:一种新颖的基于查询的领域无关视觉位置识别模型
Wan, Shanshan, Kang, Lai, Wei, Yingmei, Shen, Tianrui, Wang, Haixuan, Zuo, Chao
Abstract
Visual place recognition (VPR) aiming at predicting the location of an image based solely on its visual features is a fundamental task in robotics and autonomous systems. Domain variation remains one of the main challenges in VPR and is relatively unexplored. Existing VPR models attempt to achieve domain agnosticism either by training on large-scale datasets that inherently contain some domain variations, or by being specifically adapted to particular target domains. In practice, the former lacks explicit domain supervision, while the latter generalizes poorly to unseen domain shifts. This paper proposes a novel query-based domain-agnostic VPR model called QdaVPR. First, a dual-level adversarial learning framework is designed to encourage domain invariance for both the query features forming the global descriptor and the image features from which these query features are derived. Then, a triplet supervision based on query combinations is designed to enhance the discriminative power of the global descriptors. To support the learning process, we augment a large-scale VPR dataset using style transfer methods, generating various synthetic domains with corresponding domain labels as auxiliary supervision. Extensive experiments show that QdaVPR achieves state-of-the-art performance on multiple VPR benchmarks with significant domain variations. Specifically, it attains the best Recall@1 and Recall@10 on nearly all test scenarios: 93.5%/98.6% on Nordland (seasonal changes), 97.5%/99.0% on Tokyo24/7 (day-night transitions), and the highest Recall@1 across almost all weather conditions on the SVOX dataset. Our code will be released at https://github.com/shuimushan/QdaVPR.
Chinese Translation
视觉位置识别(VPR)旨在仅基于图像的视觉特征预测其位置,是机器人技术和自主系统中的一项基础任务。领域变化仍然是VPR面临的主要挑战之一,且相对未被深入探索。现有的VPR模型试图通过在包含某些领域变化的大规模数据集上进行训练,或通过特定适应于特定目标领域来实现领域无关性。在实践中,前者缺乏明确的领域监督,而后者在未见领域变化时泛化能力较差。本文提出了一种新颖的基于查询的领域无关VPR模型,称为QdaVPR。首先,设计了一个双层对抗学习框架,以鼓励形成全局描述符的查询特征和从中派生的图像特征的领域不变性。然后,基于查询组合设计了一种三元组监督,以增强全局描述符的区分能力。为了支持学习过程,我们使用风格迁移方法增强了一个大规模VPR数据集,生成了具有相应领域标签的各种合成领域作为辅助监督。大量实验表明,QdaVPR在多个具有显著领域变化的VPR基准测试中实现了最先进的性能。具体而言,它在几乎所有测试场景中达到了最佳的Recall@1和Recall@10:在Nordland(季节变化)上为93.5%/98.6%,在Tokyo24/7(昼夜变化)上为97.5%/99.0%,并在SVOX数据集的几乎所有天气条件下达到了最高的Recall@1。我们的代码将发布在https://github.com/shuimushan/QdaVPR。
cs.CV / 108 / 2603.07430

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

基于扩散的图像超分辨率的解耦文本先验
Jiang, Lei, Liu, Xin, Tong, Xinze, Li, Zhiliang, Liu, Jie, Tang, Jie, Wu, Gangshan
Abstract
Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content, from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.
Chinese Translation
图像超分辨率(SR)旨在从退化的低分辨率输入中重建高分辨率图像。尽管基于扩散的SR方法提供了强大的生成能力,但其性能在很大程度上依赖于语义先验的结构和在生成过程中的整合方式。现有方法通常依赖于纠缠或粗粒度的先验,这些先验将全局布局与局部细节混合,或将结构线索与纹理线索混淆,从而限制了语义的可控性和可解释性。在本研究中,我们提出了DTPSR,一种新颖的基于扩散的SR框架,沿着两个互补维度引入了解耦文本先验:空间层次(全局与局部)和频率语义(低频与高频)。通过明确分离这些先验,DTPSR使模型能够同时捕捉场景级结构和对象特定细节,并提供频率感知的语义指导。相应的嵌入通过专门的交叉注意力模块注入,形成一个渐进的生成管道,反映视觉内容的语义粒度,从全局布局到细粒度纹理。为了支持这一范式,我们构建了DisText-SR,一个包含约95,000对图像-文本的规模化数据集,包含经过精心解耦的全局、低频和高频描述。为了进一步增强可控性和一致性,我们采用了一种多分支无分类器引导策略,结合频率感知的负提示,以抑制幻觉和语义漂移。在合成和真实世界基准上的广泛实验表明,DTPSR在感知质量、保真度和在多种退化场景中的强泛化能力方面表现优异。
cs.CV / 109 / 2603.07432

Generalization in Online Reinforcement Learning for Mobile Agents

移动智能体在线强化学习中的泛化
Gu, Li, Jiang, Zihuan, Chi, Zhixiang, Liu, Huan, Wang, Ziqiang, Yu, Yuanhao, Berseth, Glen, Wang, Yang
Abstract
Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
Chinese Translation
基于图形用户界面(GUI)的移动智能体通过解释自然语言指令并与屏幕互动,在移动设备上自动化数字任务。尽管近期的方法将强化学习(RL)应用于训练在互动环境中的视觉语言模型(VLM)智能体,主要关注性能,但由于缺乏标准化基准和开源的RL系统,泛化问题仍然未得到充分探讨。在本研究中,我们将该问题形式化为上下文马尔可夫决策过程(CMDP),并引入了 extbf{AndroidWorld-Generalization},这是一个具有三个逐渐增加挑战性的模式的基准,用于评估对未见任务实例、模板和应用的零-shot 泛化。我们进一步提出了一种RL训练系统,将群体相对策略优化(GRPO)与可扩展的回滚收集系统相结合,该系统由容器化基础设施和异步执行组成,并具备错误恢复功能,以支持可靠和高效的训练。在AndroidWorld-Generalization上的实验表明,RL使得一个具有70亿参数的VLM智能体超越了监督微调基线,在未见实例上实现了26.1%的提升,但在未见模板(15.7%)和应用(8.3%)上的增益有限,突显了泛化的挑战。作为初步步骤,我们展示了在测试时的少样本适应能提高在未见应用上的性能,激励未来在这一方向的研究。为了支持可重复性和公平比较,我们开源了完整的RL训练系统,包括环境、任务套件、模型、提示配置和基础设施 ootnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}。
cs.CV / 110 / 2603.07436

RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation

RPG-SAM:基于可靠性加权原型和几何自适应阈值选择的无训练单次息肉分割
Lin, Weikun, Bai, Yunhao, Wang, Yan
Abstract
Training-free one-shot segmentation offers a scalable alternative to expert annotations where knowledge is often transferred from support images and foundation models. But existing methods often treat all pixels in support images and query response intensities models in a homogeneous way. They ignore the regional heterogeity in support images and response heterogeity in query.To resolve this, we propose RPG-SAM, a framework that systematically tackles these heterogeneity gaps. Specifically, to address regional heterogeneity, we introduce Reliability-Weighted Prototype Mining (RWPM) to prioritize high-fidelity support features while utilizing background anchors as contrastive references for noise suppression. To address response heterogeneity, we develop Geometric Adaptive Selection (GAS) to dynamically recalibrate binarization thresholds by evaluating the morphological consensus of candidates. Finally, an iterative refinement loop method is designed to polishes anatomical boundaries. By accounting for multi-layered information heterogeneity, RPG-SAM achieves a 5.56\% mIoU improvement on the Kvasir dataset. Code will be released.
Chinese Translation
无训练单次分割为专家注释提供了一种可扩展的替代方案,其中知识通常从支持图像和基础模型中转移。然而,现有方法往往以同质的方式处理支持图像中的所有像素和查询响应强度模型。它们忽视了支持图像中的区域异质性和查询中的响应异质性。为了解决这个问题,我们提出了RPG-SAM,一个系统性解决这些异质性差距的框架。具体而言,为了解决区域异质性,我们引入了可靠性加权原型挖掘(RWPM),以优先考虑高保真支持特征,同时利用背景锚点作为对比参考以抑制噪声。为了解决响应异质性,我们开发了几何自适应选择(GAS),通过评估候选者的形态共识动态重新校准二值化阈值。最后,设计了一种迭代精炼循环方法,以优化解剖边界。通过考虑多层次的信息异质性,RPG-SAM在Kvasir数据集上实现了5.56\%的mIoU提升。代码将会发布。
cs.CV / 111 / 2603.07441

DogWeave: High-Fidelity 3D Canine Reconstruction from a Single Image via Normal Fusion and Conditional Inpainting

DogWeave:通过法线融合和条件修复从单幅图像高保真重建3D犬类模型
Sun, Shufan, Wang, Chenchen, Yu, Zongfu
Abstract
Monocular 3D animal reconstruction is challenging due to complex articulation, self-occlusion, and fine-scale details such as fur. Existing methods often produce distorted geometry and inconsistent textures due to the lack of articulated 3D supervision and limited availability of back-view images in 2D datasets, which makes reconstructing unobserved regions particularly difficult. To address these limitations, we propose DogWeave, a model-based framework for reconstructing high-fidelity 3D canine models from a single RGB image. DogWeave improves geometry by refining a coarsely-initiated parametric mesh into a detailed SDF representation through multi-view normal field optimization using diffusion-enhanced normals. It then generates view-consistent textures through conditional partial inpainting guided by structure and style cues, enabling realistic reconstruction of unobserved regions. Using only about 7,000 dog images processed via our 2D pipeline for training, DogWeave produces complete, realistic 3D models and outperforms state-of-the-art single image to 3d reconstruction methods in both shape accuracy and texture realism for canines.
Chinese Translation
单目3D动物重建面临诸多挑战,包括复杂的关节运动、自遮挡以及毛发等细微细节。现有方法由于缺乏关节化的3D监督以及2D数据集中背面图像的有限可用性,往往会产生失真的几何形状和不一致的纹理,这使得重建未观察到的区域尤其困难。为了解决这些局限性,我们提出了DogWeave,一个基于模型的框架,用于从单幅RGB图像重建高保真的3D犬类模型。DogWeave通过使用扩散增强法线的多视角法线场优化,将粗略初始化的参数化网格细化为详细的SDF表示,从而改善几何形状。然后,它通过结构和风格线索引导的条件部分修复生成视图一致的纹理,使得未观察区域的真实重建成为可能。仅使用约7000张经过我们2D管道处理的犬类图像进行训练,DogWeave能够生成完整、真实的3D模型,并在犬类的形状准确性和纹理真实感方面超越了最先进的单幅图像到3D重建方法。
cs.CV / 112 / 2603.07443

Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models

Med-Evo:医疗多模态大型语言模型的测试时自我演化
Xu, Dunyuan, Yang, Xikai, Miao, Juzheng, Li, Yaoqian, Li, Jinpeng, Heng, Pheng-Ann
Abstract
Medical Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse healthcare tasks. However, current post-training strategies, such as supervised fine-tuning and reinforcement learning, heavily depend on substantial annotated data while overlooking the potential of unlabeled test data for model enhancement. This limitation becomes particularly pronounced in medical domains, where acquiring extensive labeled medical data is difficult due to the strict data sensitivity and annotation complexity. Moreover, leveraging test data poses challenges in generating reliable supervision signals from unlabeled samples and maintaining stable self-evolution. To address these limitations, we propose Med-Evo, the first self-evolution framework for medical MLLMs that utilizes label-free reinforcement learning to promote model performance without requiring additional labeled data. Our framework introduces two key innovations: $1)$ Feature-driven Pseudo Labeling (FPL) that identifies semantic centroids from all heterogeneous candidate responses to select pseudo labels in each rollout, and $2)$ Hard-Soft Reward (HSR) that combines exact match with token-level assessment and semantic similarity to provide hierarchical reward. Experiments on three medical VQA benchmarks and two base MLLMs show clear advantages of our approach over SOTA methods, with significant improvements of 10.43\% accuracy and 4.68\% recall on the SLAKE dataset using Qwen2.5-VL, showing the effectiveness of our method.
Chinese Translation
医疗多模态大型语言模型(MLLMs)在各种医疗任务中展现了显著的能力。然而,目前的后训练策略,如监督微调和强化学习,严重依赖大量标注数据,同时忽视了未标注测试数据在模型增强中的潜力。这一局限性在医疗领域尤为明显,因为由于严格的数据敏感性和标注复杂性,获取大量标注医疗数据非常困难。此外,利用测试数据在从未标注样本生成可靠的监督信号和维持稳定的自我演化方面也面临挑战。为了解决这些局限性,我们提出了Med-Evo,这是第一个针对医疗MLLMs的自我演化框架,利用无标签强化学习来提升模型性能,而无需额外的标注数据。我们的框架引入了两个关键创新:$1)$ 特征驱动伪标注(Feature-driven Pseudo Labeling, FPL),从所有异构候选响应中识别语义质心,以在每次回合中选择伪标签;$2)$ 硬软奖励(Hard-Soft Reward, HSR),结合精确匹配与基于标记的评估和语义相似性来提供分层奖励。在三个医疗视觉问答基准和两个基础MLLMs上的实验表明,我们的方法相较于最先进的方法具有明显优势,在使用Qwen2.5-VL的SLAKE数据集上实现了10.43\%的准确率提升和4.68\%的召回率提升,显示了我们方法的有效性。
cs.CV / 113 / 2603.07454

SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition

SLNet:一种超轻量级几何自适应网络用于3D点云识别
Saeid, Mohammad, Salarpour, Amir, MohajerAnsari, Pedram, Pesé, Mert D.
Abstract
We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet-S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP-elite with 5x fewer parameters, while SLNet-M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24x fewer parameters. On ScanObjectNN, SLNet-M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28x fewer parameters. For large scale scene segmentation, SLNet-T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17x fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: https://github.com/m-saeid/SLNet.
Chinese Translation
我们提出了SLNet,一种用于3D点云识别的轻量级骨干网络,旨在实现强大的性能而无需像许多近期基于注意力、图形和深度多层感知机(MLP)模型那样的高计算成本。该模型基于两个简单的理念构建:NAPE(非参数自适应点嵌入),它通过结合高斯径向基函数(RBF)和余弦基函数来捕捉空间结构,并具有输入自适应带宽和混合;以及GMU(几何调制单元),这是一个每通道的仿射调制器,仅增加2D可学习参数。这些组件在一个四阶段的层次编码器中使用,该编码器具有FPS+kNN分组、非参数归一化和共享残差MLP。在实验中,SLNet显示出一个非常小的模型仍然可以在多个3D识别任务中保持高度竞争力。在ModelNet40上,具有0.14M参数和0.31 GFLOPs的SLNet-S实现了93.64%的整体准确率,超越了参数少5倍的PointMLP-elite,而具有0.55M参数和1.22 GFLOPs的SLNet-M达到了93.92%,超过了参数少24倍的PointMLP。在ScanObjectNN上,SLNet-M在使用参数少28倍的情况下,整体准确率达到了84.25%,与PointMLP相差仅1.2个百分点。对于大规模场景分割,SLNet-T通过局部点变换器注意力扩展了骨干网络,在S3DIS Area 5上以仅2.5M参数达到了58.2%的mIoU,参数比Point Transformer V3少17倍以上。我们还引入了NetScore+,它通过结合延迟和峰值内存扩展了NetScore,以便以更面向部署的方式评估效率。在多个基准和硬件设置中,SLNet在准确性和效率之间提供了强大的整体平衡。代码可在:https://github.com/m-saeid/SLNet获取。
cs.CV / 114 / 2603.07455

Image Generation Models: A Technical History

图像生成模型:技术历史
Shirvani, Rouzbeh
Abstract
Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
Chinese Translation
在过去十年中,图像生成技术迅速发展,但文献似乎在不同模型和应用领域之间显得支离破碎。本文旨在提供对突破性图像生成模型的全面调查,包括变分自编码器(Variational Autoencoders, VAEs)、生成对抗网络(Generative Adversarial Networks, GANs)、归一化流(Normalizing Flows)、自回归和基于变换器的生成器,以及基于扩散的方法。我们详细介绍了每种模型类型的技术细节,包括其基本目标、架构构建块和算法训练步骤。对于每种模型类型,我们呈现了优化技术以及常见的失败模式和局限性。我们还讨论了视频生成的最新进展,并介绍了使从静态帧转变为高质量视频的研究工作。最后,我们强调了这些模型的鲁棒性和负责任的部署日益重要性,包括深度伪造风险、检测、伪影和水印等问题。
cs.CV / 115 / 2603.07463

SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing

SIGMAE:一种基于光谱指数引导的多光谱遥感基础模型
Zhang, Xiaokang, Li, Bo, Zhou, Chufeng, Yu, Weikang, Zhang, Lefei
Abstract
Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at https://github.com/zxk688/SIGMAE.
Chinese Translation
预训练和微调已成为遥感图像解读的新范式。其中,基于掩码自编码器(Masked Autoencoder, MAE)的预训练因其通过重建掩盖图像区域来学习通用特征表示的强大能力而脱颖而出。然而,由于复杂的背景、不清晰的目标以及在掩盖过程中缺乏语义引导,将MAE应用于多光谱遥感图像仍然面临挑战,这阻碍了对潜在结构和有意义的空间-光谱特征的学习。为了解决这一问题,我们提出了一种简单而有效的方法——光谱指数引导的MAE(SIGMAE),用于多光谱图像的预训练。其核心思想是将特定领域的光谱指数作为先验知识,指导动态标记掩盖朝向信息丰富的区域。SIGMAE引入了语义显著性引导的动态标记掩盖(Semantic Saliency-Guided Dynamic Token Masking, SSDTM),这是一种课程式策略,通过量化每个块的语义丰富性和内部异质性,在训练过程中自适应选择最具信息量的标记。通过优先考虑语义显著区域并逐步增加样本难度,SSDTM增强了光谱丰富和结构感知的表示学习,相较于随机掩盖,减轻了过拟合并减少了冗余计算。在五个广泛使用的数据集上进行的广泛实验,涵盖了场景分类、语义分割、目标提取和变化检测等多种下游任务,证明了SIGMAE优于其他预训练的地理空间基础模型。此外,即使在90%的掩盖比例下,它也展现出强大的空间-光谱重建能力,并在有限标注数据下改善复杂目标识别。源代码和模型权重将发布在 https://github.com/zxk688/SIGMAE。
cs.CV / 116 / 2603.07464

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

单目3D目标检测的跨模态蒸馏选择性迁移学习
Ding, Rui, Yang, Meng, Zheng, Nanning
Abstract
Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.
Chinese Translation
单目3D目标检测由于缺乏准确的深度信息,对于自动驾驶车辆来说是一项前景广阔但又难以解决的任务。跨模态知识蒸馏能够有效地将深度信息从激光雷达(LiDAR)转移到基于图像的网络。然而,图像与激光雷达之间的模态差距严重限制了其准确性。本文首次系统地研究了跨模态蒸馏中由模态差距引起的负迁移问题,包括架构不一致性问题以及更重要的特征过拟合问题。我们提出了一种名为MonoSTL的选择性学习方法,以克服这些问题,该方法鼓励从激光雷达向图像基网络的深度信息正向迁移,同时减轻负迁移。一方面,我们利用相似的架构确保图像基和激光雷达基网络之间特征的空间对齐。另一方面,我们开发了两个新颖的蒸馏模块,即深度感知选择性特征蒸馏(Depth-Aware Selective Feature Distillation, DASFD)和深度感知选择性关系蒸馏(Depth-Aware Selective Relation Distillation, DASRD),分别通过将深度不确定性融入特征和关系蒸馏中,选择性地学习对象的正向特征和关系。我们的方法可以无缝集成到各种基于卷积神经网络(CNN)和基于DETR的模型中,我们在KITTI上的三个最新模型和NuScenes上的一个最新模型进行了验证。大量实验表明,我们的方法显著提高了基础模型的准确性,从而在与所有最近发布的最先进模型(SOTA)比较中达到了最佳准确性。
cs.CV / 117 / 2603.07465

Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing

无需重新训练的新型3D打印物体分类:迈向增材制造中的后期生产自动化
Mathioulakis, Fanis, Radevski, Gorjan, Cleuren, Silke GC, Janssens, Michel, Das, Brecht, Schauwaert, Koen, Tuytelaars, Tinne
Abstract
Reliable classification of 3D-printed objects is essential for automating post-production workflows in industrial additive manufacturing. Despite extensive automation in other stages of the printing pipeline, this task still relies heavily on manual inspection, as the set of objects to be classified can change daily, making frequent model retraining impractical. Automating the identification step is therefore critical for improving operational efficiency. A vision model that could classify any set of objects by utilizing their corresponding CAD models and avoiding retraining would be highly beneficial in this setting. To enable systematic evaluation of vision models on this task, we introduce ThingiPrint, a new publicly available dataset that pairs CAD models with real photographs of their 3D-printed counterparts. Using ThingiPrint, we benchmark a range of existing vision models on the task of 3D-printed object classification. We additionally show that contrastive fine-tuning with a rotation-invariant objective allows effective prototype-based classification of previously unseen 3D-printed objects. By relying solely on the available CAD models, this avoids the need for retraining when new objects are introduced. Experiments show that this approach outperforms standard pretrained baselines, suggesting improved generalization and practical relevance for real-world use.
Chinese Translation
可靠的3D打印物体分类对于工业增材制造中的后期生产工作流程自动化至关重要。尽管在打印流程的其他阶段已经实现了广泛的自动化,但这一任务仍然严重依赖人工检查,因为待分类的物体集合可能每天都在变化,使得频繁的模型重新训练变得不切实际。因此,自动化识别步骤对于提高操作效率至关重要。在这种情况下,能够利用相应CAD模型对任意物体集合进行分类而无需重新训练的视觉模型将极具价值。为了系统性地评估视觉模型在这一任务上的表现,我们推出了ThingiPrint,一个新的公开可用数据集,该数据集将CAD模型与其3D打印对应物的真实照片配对。利用ThingiPrint,我们对一系列现有视觉模型在3D打印物体分类任务上进行了基准测试。我们还展示了使用旋转不变目标进行对比微调,可以有效地对以前未见过的3D打印物体进行基于原型的分类。通过仅依赖可用的CAD模型,这种方法避免了在引入新物体时需要重新训练。实验表明,这种方法优于标准的预训练基线,暗示了在实际应用中更好的泛化能力和实用性。
cs.CV / 118 / 2603.07468

FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation

FedEU:基于证据不确定性驱动的联邦微调视觉基础模型用于遥感图像分割
Zhang, Xiaokang, Xiong, Xuran, Huang, Jianzhong, Zhang, Lefei
Abstract
Remote sensing image segmentation (RSIS) in federated environments has gained increasing attention because it enables collaborative model training across distributed datasets without sharing raw imagery or annotations. Federated RSIS combined with parameter-efficient fine-tuning (PEFT) can unleash the generalization power of pretrained foundation models for real-world applications, with minimal parameter aggregation and communication overhead. However, the dynamic adaptation of pretrained models to heterogeneous client data inevitably increases update uncertainty and compromises the reliability of collaborative optimization due to the lack of uncertainty estimation for each local model. To bridge this gap, we present FedEU, a federated optimization framework for fine-tuning RSIS models driven by evidential uncertainty. Specifically, personalized evidential uncertainty modeling is introduced to quantify epistemic variations of local models and identify high-risk areas under local data distributions. Furthermore, the client-specific feature embedding (CFE) is exploited to enhance channel-aware feature representation while preserving client-specific properties through personalized attention and an element-aware parameter update approach. These uncertainty estimates are uploaded to the server to enable adaptive global aggregation via a Top-k uncertainty-guided weighting (TUW) strategy, which mitigates the impact of distribution shifts and unreliable updates. Extensive experiments on three large-scale heterogeneous datasets demonstrate the superior performance of FedEU. More importantly, FedEU enables balanced model adaptation across diverse clients by explicitly reducing prediction uncertainty, resulting in more robust and reliable federated outcomes. The source codes will be available at https://github.com/zxk688/FedEU.
Chinese Translation
在联邦环境下,遥感图像分割(RSIS)越来越受到关注,因为它能够在不共享原始图像或注释的情况下,实现跨分布式数据集的协作模型训练。结合参数高效微调(PEFT)的联邦RSIS可以释放预训练基础模型在现实世界应用中的泛化能力,同时最小化参数聚合和通信开销。然而,预训练模型对异构客户端数据的动态适应不可避免地增加了更新的不确定性,并由于缺乏对每个本地模型的不确定性估计而妨碍了协作优化的可靠性。为了解决这一问题,我们提出了FedEU,这是一个基于证据不确定性驱动的联邦优化框架,用于微调RSIS模型。具体而言,引入个性化证据不确定性建模,以量化本地模型的认知变异并识别在本地数据分布下的高风险区域。此外,利用客户端特定特征嵌入(CFE)来增强通道感知特征表示,同时通过个性化注意力和元素感知参数更新方法保留客户端特定属性。这些不确定性估计被上传到服务器,以便通过Top-k不确定性引导加权(TUW)策略实现自适应全局聚合,从而减轻分布变化和不可靠更新的影响。在三个大规模异构数据集上的广泛实验表明,FedEU的优越性能。更重要的是,FedEU通过明确减少预测不确定性,使得在不同客户端之间实现平衡的模型适应,从而产生更稳健和可靠的联邦结果。源代码将发布在 https://github.com/zxk688/FedEU。
cs.CV / 119 / 2603.07476

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

EVLF:用于生成数据集蒸馏的早期视觉-语言融合
Cai, Wenqi, Zou, Yawen, Li, Guang, Gu, Chunzhi, Zhang, Chao
Abstract
Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at https://github.com/wenqi-cai297/earlyfusion-for-dd/.
Chinese Translation
数据集蒸馏(Dataset Distillation, DD)旨在合成紧凑的训练集,使模型能够以显著更少的样本实现高准确率。近期基于扩散的DD方法通常通过后期交叉注意力引入语义指导,其中文本提示往往主导生成过程。尽管这一策略强化了标签相关性,但却削弱了视觉潜变量的贡献,导致生成的样本过度修正,反映出提示模式而非内在视觉特征。为了解决这一问题,我们提出了一种早期视觉-语言融合(Early Vision-Language Fusion, EVLF)方法,该方法在编码器与生成主干之间的过渡阶段对文本和视觉嵌入进行对齐。通过在这一过渡阶段引入轻量级的交叉注意力模块,早期表示能够在去噪过程中同时编码局部纹理和全局语义方向。重要的是,EVLF具有即插即用的特性,可以轻松集成到任何具有编码器的基于扩散的数据集蒸馏管道中。它适用于不同的去噪器架构和采样计划,而无需任何特定任务的修改。大量实验表明,EVLF生成的合成数据在语义上忠实且视觉上连贯,在各种设置中均能显著提高下游分类准确率。源代码可在 https://github.com/wenqi-cai297/earlyfusion-for-dd/ 获取。
cs.CV / 120 / 2603.07486

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

用于鲁棒3D目标检测的多模态解耦与重耦网络
Ding, Rui, Kuang, Zhaonian, Ji, Yuzhe, Yang, Meng, Zheng, Xinhu, Hua, Gang
Abstract
Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.
Chinese Translation
基于鸟瞰图(BEV)的多模态3D目标检测在基准测试中取得了显著进展。然而,由于数据损坏,例如激光雷达的传感器配置和摄像头的场景条件,实际应用中的准确性可能会显著下降。之前模型的一个设计瓶颈在于多模态BEV特征在融合过程中紧密耦合,如果某一模态或两者都受到损坏,可能会降低整体系统性能。为了解决这个问题,我们提出了一种用于鲁棒3D目标检测的多模态解耦与重耦网络,以应对数据损坏。不同模态通常共享一些高层次的不变特征。我们观察到,这些跨模态的不变特征并不总是同时失效,因为不同类型的数据损坏以不同方式影响每个模态。这些不变特征可以在模态之间恢复,从而在数据损坏下实现鲁棒融合。为此,我们明确地将摄像头/激光雷达BEV特征解耦为模态不变部分和模态特定部分。这使得不变特征能够相互补偿,同时减轻受损模态对其他模态的负面影响。然后,我们将这些特征重耦为三个专家,分别处理不同类型的数据损坏,即激光雷达、摄像头和两者。对于每个专家,我们使用模态不变特征作为鲁棒信息,而模态特定特征则作为补充。最后,我们自适应地融合这三个专家,以提取用于3D目标检测的鲁棒特征。为了验证,我们基于nuScenes收集了一个包含大量数据损坏的基准数据集,涵盖激光雷达、摄像头及两者。我们的模型在干净的nuScenes上进行训练,并在所有类型的数据损坏上进行测试。与最近的模型相比,我们的模型在受损数据和干净数据上始终实现最佳准确性。
cs.CV / 121 / 2603.07489

RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations

RobustSCI:超越重建,实现真实世界退化下快照压缩成像的恢复
Wang, Hao, Li, Yuanfan, Zhou, Qi, Xu, Zhankuo, Ni, Jiong, Yuan, Xin
Abstract
Deep learning algorithms for video Snapshot Compressive Imaging (SCI) have achieved great success, yet they predominantly focus on reconstructing from clean measurements. This overlooks a critical real-world challenge: the captured signal itself is often severely degraded by motion blur and low light. Consequently, existing models falter in practical applications. To break this limitation, we pioneer the first study on robust video SCI restoration, shifting the goal from "reconstruction" to "restoration"--recovering the underlying pristine scene from a degraded measurement. To facilitate this new task, we first construct a large-scale benchmark by simulating realistic, continuous degradations on the DAVIS 2017 dataset. Second, we propose RobustSCI, a network that enhances a strong encoder-decoder backbone with a novel RobustCFormer block. This block introduces two parallel branches--a multi-scale deblur branch and a frequency enhancement branch--to explicitly disentangle and remove degradations during the recovery process. Furthermore, we introduce RobustSCI-C (RobustSCI-Cascade), which integrates a pre-trained Lightweight Post-processing Deblurring Network to significantly boost restoration performance with minimal overhead. Extensive experiments demonstrate that our methods outperform all SOTA models on the new degraded testbeds, with additional validation on real-world degraded SCI data confirming their practical effectiveness, elevating SCI from merely reconstructing what is captured to restoring what truly happened.
Chinese Translation
视频快照压缩成像(Snapshot Compressive Imaging, SCI)的深度学习算法取得了巨大的成功,然而它们主要集中在从干净的测量数据中进行重建。这忽视了一个关键的现实挑战:捕获的信号往往受到运动模糊和低光照的严重影响。因此,现有模型在实际应用中表现不佳。为了打破这一限制,我们首次开展了关于鲁棒视频SCI恢复的研究,将目标从“重建”转变为“恢复”——从退化的测量中恢复出原始的场景。为促进这一新任务,我们首先通过在DAVIS 2017数据集上模拟现实的连续退化,构建了一个大规模基准。其次,我们提出了RobustSCI,一个增强了强大编码器-解码器骨干网络的新型网络,配备了新颖的RobustCFormer模块。该模块引入了两个并行分支——一个多尺度去模糊分支和一个频率增强分支——以明确地解耦和去除恢复过程中的退化。此外,我们引入了RobustSCI-C(RobustSCI-Cascade),它集成了一个预训练的轻量级后处理去模糊网络,以在最小开销下显著提升恢复性能。大量实验表明,我们的方法在新的退化测试集上超越了所有最先进的模型,并且在真实世界退化的SCI数据上的额外验证确认了它们的实际有效性,将SCI的目标从仅仅重建捕获的内容提升到恢复真实发生的事件。
cs.CV / 122 / 2603.07493

RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection

RayD3D:沿光线提炼深度知识以增强多视角三维物体检测的鲁棒性
Ding, Rui, Kuang, Zhaonian, Zhou, Zongwei, Yang, Meng, Zheng, Xinhu, Hua, Gang
Abstract
Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.
Chinese Translation
基于鸟瞰图(BEV)的多视角三维检测对于自动驾驶和机器人技术至关重要,但其在现实世界中的鲁棒性受到限制,因为它难以预测准确的深度值。一种主流解决方案是跨模态蒸馏,它将深度信息从激光雷达(LiDAR)转移到相机模型,但也无意中转移了与深度无关的信息(例如,LiDAR密度)。为了解决这个问题,我们提出了RayD3D,它沿光线转移关键的深度知识:一条从相机投射到物体真实位置的线。该方法基于基本成像原理,即该物体的预测位置只能沿着这条光线变化,最终由预测的深度值决定。因此,沿光线进行蒸馏可以更有效地转移深度信息。更具体地,我们设计了两个基于光线的蒸馏模块。基于光线的对比蒸馏(RCD)通过沿光线采样将对比学习融入蒸馏,以学习LiDAR如何准确定位物体。基于光线的加权蒸馏(RWD)根据光线自适应调整蒸馏权重,以最小化LiDAR中与深度无关信息的干扰。为了验证,我们将RayD3D广泛应用于三种代表性的基于BEV的模型,包括BEVDet、BEVDepth4D和BEVFormer。我们的方法在干净的NuScenes数据集上进行训练,并在干净的NuScenes和RoboBEV上进行测试,涵盖多种数据损坏类型。我们的方法显著提高了所有三种基础模型在所有场景下的鲁棒性,而不增加推理成本,并且在与最近发布的多视角和蒸馏模型的比较中表现最佳。
cs.CV / 123 / 2603.07494

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

DocCogito:对齐布局认知与逐步基础推理以实现文档理解
Wu, Yuchuan, Zhuo, Minghan, Fu, Teng, Zhao, Mengyang, Li, Bin, Xue, Xiangyang
Abstract
Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.
Chinese Translation
使用多模态大型语言模型(MLLMs)进行文档理解不仅需要准确的答案,还需要明确的、基于证据的推理,尤其是在高风险场景中。然而,目前的文档MLLMs在形成完整的人类般推理过程中仍然存在不足,因为即使它们在布局编码和链式思维(CoT)风格提示方面有所改进,两者之间的交互通常是隐式学习的,并且保持松散耦合,而不是作为系统机制强制执行。因此,我们提出了DocCogito,一个统一框架,将全局布局感知与结构化、区域基础的推理相结合。DocCogito引入了一种轻量级布局塔,将页面结构提炼为可学习的全局布局先验标记,以及一种确定性的视觉-语义链(Visual-Semantic Chain, VSC)——一种比自由形式自然语言链式思维更简洁的结构化表示,用于监督与证据区域对齐的细粒度中间推理。训练遵循渐进式流程,包括布局感知预训练、VSC引导的冷启动、拒绝采样和GRPO。为了进一步增强布局先验与VSC执行之间的内部耦合,我们用细粒度区域置信信号增强标准奖励,鼓励推理轨迹与相应的证据区域保持一致。在六个基准(DocVQA、WTQ、ChartQA、TextVQA、OCRBench和InfoVQA)上的大量实验表明,DocCogito具有强大的泛化能力,在四个基准上实现了最先进的结果。
cs.CV / 124 / 2603.07497

AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition

AMR-CCR:锚定模块化检索用于持续中文字符识别
Wu, Yuchuan, Zhu, Yinglian, Yu, Haiyang, Niu, Ke, Li, Bin, Xue, Xiangyang
Abstract
Ancient Chinese character recognition is a core capability for cultural heritage digitization, yet real-world workflows are inherently non-stationary: newly excavated materials are continuously onboarded, bringing new classes in different scripts, and expanding the class space over time. We formalize this process as Continual Chinese Character Recognition (Continual CCR), a script-staged, class-incremental setting that couples two challenges: (i) scalable learning under continual class growth with subtle inter-class differences and scarce incremental data, and (ii) pronounced intra-class diversity caused by writing-style variations across writers and carrier conditions. To overcome the limitations of conventional closed-set classification, we propose AMR-CCR, an anchored modular retrieval framework that performs recognition via embedding-based dictionary matching in a shared multimodal space, allowing new classes to be added by simply extending the dictionary. AMR-CCR further introduces a lightweight script-conditioned injection module (SIA+SAR) to calibrate newly onboarded scripts while preserving cross-stage embedding compatibility, and an image-derived multi-prototype dictionary that clusters within-class embeddings to better cover diverse style modes. To support systematic evaluation, we build EvoCON, a six-stage benchmark for continual script onboarding, covering six scripts (OBC, BI, SS, SAC, WSC, CS), augmented with meaning/shape descriptions and an explicit zero-shot split for unseen characters without image exemplars.
Chinese Translation
古代中文字符识别是文化遗产数字化的核心能力,但现实工作流程本质上是非平稳的:新挖掘的材料不断被引入,带来了不同书写体的新类别,并随着时间的推移扩展类别空间。我们将这一过程形式化为持续中文字符识别(Continual CCR),这是一种分阶段的、类别增量的设置,结合了两个挑战:(i)在类别持续增长的情况下进行可扩展学习,面临微妙的类别间差异和稀缺的增量数据,以及(ii)由于书写风格的变化和载体条件的不同而导致的显著类别内多样性。为了克服传统闭集分类的局限性,我们提出了AMR-CCR,一种锚定模块化检索框架,通过在共享的多模态空间中进行基于嵌入的字典匹配来执行识别,允许通过简单扩展字典来添加新类别。AMR-CCR进一步引入了一种轻量级的书写体条件注入模块(SIA+SAR),以校准新引入的书写体,同时保持跨阶段嵌入的兼容性,并且构建了一个基于图像的多原型字典,将类别内嵌入进行聚类,以更好地覆盖多样的风格模式。为了支持系统评估,我们构建了EvoCON,这是一个六阶段的持续书写体引入基准,涵盖六种书写体(OBC, BI, SS, SAC, WSC, CS),并增强了意义/形状描述和针对未见字符的显式零样本划分,未见字符没有图像示例。
cs.CV / 125 / 2603.07504

High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion

通过骨架潜在扩散生成高保真医学形状
Zhang, Guoqing, Yang, Jingyun, Chen, Siqi, Zhang, Anping, Li, Yang
Abstract
Anatomy shape modeling is a fundamental problem in medical data analysis. However, the geometric complexity and topological variability of anatomical structures pose significant challenges to accurate anatomical shape generation. In this work, we propose a skeletal latent diffusion framework that explicitly incorporates structural priors for efficient and high-fidelity medical shape generation. We introduce a shape auto-encoder in which the encoder captures global geometric information through a differentiable skeletonization module and aggregates local surface features into shape latents, while the decoder predicts the corresponding implicit fields over sparsely sampled coordinates. New shapes are generated via a latent-space diffusion model, followed by neural implicit decoding and mesh extraction. To address the limited availability of medical shape data, we construct a large-scale dataset, \textit{MedSDF}, comprising surface point clouds and corresponding signed distance fields across multiple anatomical categories. Extensive experiments on MedSDF and vessel datasets demonstrate that the proposed method achieves superior reconstruction and generation quality while maintaining a higher computational efficiency compared with existing approaches. Code is available at: https://github.com/wlsdzyzl/meshage.
Chinese Translation
解剖形状建模是医学数据分析中的一个基本问题。然而,解剖结构的几何复杂性和拓扑变异性对准确的解剖形状生成构成了重大挑战。在本研究中,我们提出了一种骨架潜在扩散框架,该框架明确地结合了结构先验,以实现高效且高保真的医学形状生成。我们引入了一种形状自编码器,其中编码器通过可微分的骨架化模块捕获全局几何信息,并将局部表面特征聚合到形状潜变量中,而解码器则在稀疏采样坐标上预测相应的隐式场。新形状是通过潜空间扩散模型生成的,随后进行神经隐式解码和网格提取。为了应对医学形状数据的有限可用性,我们构建了一个大规模数据集 extit{MedSDF},该数据集包含多个解剖类别的表面点云和相应的有符号距离场。在 extit{MedSDF} 和血管数据集上的大量实验表明,所提出的方法在重建和生成质量上优于现有方法,同时保持更高的计算效率。代码可在以下链接获取:https://github.com/wlsdzyzl/meshage。
cs.CV / 126 / 2603.07515

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

EvolveReason:用于可解释深度伪造面部图像识别的自我演化推理范式
Zhou, Binjia, Luo, Dawei, Chen, Shuai, Xu, Feng, Seow, Li, Haoyuan, Wang, Jiachi, Wang, Jiawen, Feng, Zunlei, Bei, Yijun
Abstract
With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.
Chinese Translation
随着AIGC技术的快速发展,开发识别方法以应对深度伪造带来的安全挑战变得迫在眉睫。面部伪造识别技术可以分为两类:传统分类方法和可解释的VLM(视觉语言模型)方法。前者提供分类结果但缺乏解释能力,而后者虽然能够提供粗粒度的解释,但常常受到幻觉和细节不足的困扰。为了克服这些局限性,我们提出了EvolveReason,它模拟人类审计员在识别面部伪造时的推理和观察过程。通过构建一个针对先进VLM的思维链数据集CoT-Face,我们的方法引导模型以类人方式思考,促使其输出推理过程和判断结果。这为从业者提供了可靠的分析,并有助于减轻幻觉。此外,我们的框架还结合了伪造潜在空间分布捕获模块,使EvolveReason能够识别难以从原始图像中提取的高频伪造线索。为了进一步增强文本解释的可靠性,我们引入了一种自我演化探索策略,利用强化学习使模型能够在两个阶段的过程中迭代探索和优化其文本描述。实验结果表明,EvolveReason不仅在识别性能上超越了当前最先进的方法,还能准确识别伪造细节,并展现出良好的泛化能力。
cs.CV / 127 / 2603.07521

SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition

SketchGraphNet:一种用于大规模草图语料库识别的内存高效混合图变换器
Chen, Shilong, Li, Mingyuan, Wang, Zhaoyang, Ye, Zhonglin, Zhao, Haixing
Abstract
This work investigates large-scale sketch recognition from a graph-native perspective, where free-hand sketches are directly modeled as structured graphs rather than raster images or stroke sequences. We propose SketchGraphNet, a hybrid graph neural architecture that integrates local message passing with a memory-efficient global attention mechanism, without relying on auxiliary positional or structural encodings. To support systematic evaluation, we construct SketchGraph, a large-scale benchmark comprising 3.44 million graph-structured sketches across 344 categories, with two variants (A and R) to reflect different noise conditions. Each sketch is represented as a spatiotemporal graph with normalized stroke-order attributes. On SketchGraph-A and SketchGraph-R, SketchGraphNet achieves Top-1 accuracies of 83.62% and 87.61%, respectively, under a unified training configuration. MemEffAttn further reduces peak GPU memory by over 40% and training time by more than 30% compared with Performer-based global attention, while maintaining comparable accuracy.
Chinese Translation
本研究从图本体的角度探讨大规模草图识别,其中自由手绘草图被直接建模为结构化图,而不是光栅图像或笔画序列。我们提出了SketchGraphNet,一种混合图神经网络架构,结合了局部消息传递和内存高效的全局注意机制,而不依赖于辅助的位置信息或结构编码。为了支持系统评估,我们构建了SketchGraph,这是一个大型基准数据集,包含344个类别的344万幅图结构草图,具有两种变体(A和R),以反映不同的噪声条件。每幅草图被表示为具有归一化笔画顺序属性的时空图。在SketchGraph-A和SketchGraph-R上,SketchGraphNet在统一训练配置下分别达到了83.62%和87.61%的Top-1准确率。与基于Performer的全局注意机制相比,MemEffAttn进一步将峰值GPU内存减少了40%以上,训练时间减少了30%以上,同时保持了相当的准确性。
cs.CV / 128 / 2603.07535

Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach

尺度感知的无人机与卫星交叉视图地理定位:一种语义几何方法
Ye, Yibin, Chen, Shuo, Wang, Kun, Song, Xiaokai, Dang, Jisheng, Yu, Qifeng, Teng, Xichao, Li, Zhang
Abstract
Cross-View Geo-Localization (CVGL) between UAV imagery and satellite images plays a crucial role in target localization and UAV self-positioning. However, most existing methods rely on the idealized assumption of scale consistency between UAV queries and satellite galleries, overlooking the severe scale ambiguity commonly encountered in real-world scenarios. This discrepancy leads to field-of-view misalignment and feature mismatch, significantly degrading CVGL robustness. To address this issue, we propose a geometric framework that recovers the absolute metric scale from monocular UAV images using semantic anchors. Specifically, small vehicles (SVs), characterized by relatively stable prior size distributions and high detectability, are exploited as metric references. A Decoupled Stereoscopic Projection Model is introduced to estimate the absolute image scale from these semantic targets. By decomposing vehicle dimensions into radial and tangential components, the model compensates for perspective distortions in 2D detections of 3D vehicles, enabling more accurate scale estimation. To further reduce intra-class size variation and detection noise, a dual-dimension fusion strategy with Interquartile Range (IQR)-based robust aggregation is employed. The estimated global scale is then used as a physical constraint for scale-adaptive satellite image cropping, improving UAV-to-satellite feature alignment. Experiments on augmented DenseUAV and UAV-VisLoc datasets demonstrate that the proposed method significantly improves CVGL robustness under unknown UAV image scales. Additionally, the framework shows strong potential for downstream applications such as passive UAV altitude estimation and 3D model scale recovery.
Chinese Translation
无人机影像与卫星图像之间的交叉视图地理定位(CVGL)在目标定位和无人机自我定位中发挥着至关重要的作用。然而,大多数现有方法依赖于无人机查询与卫星图库之间尺度一致性的理想化假设,忽视了在实际场景中常见的严重尺度模糊。这种差异导致视场错位和特征不匹配,显著降低了CVGL的鲁棒性。为了解决这一问题,我们提出了一种几何框架,通过语义锚点从单目无人机图像中恢复绝对度量尺度。具体而言,利用相对稳定的先验尺寸分布和高可检测性的小型车辆(SVs)作为度量参考。我们引入了一种解耦立体投影模型,从这些语义目标中估计绝对图像尺度。通过将车辆尺寸分解为径向和切向分量,该模型补偿了3D车辆的2D检测中的透视失真,从而实现更准确的尺度估计。为了进一步减少类内尺寸变化和检测噪声,采用了基于四分位数范围(IQR)的鲁棒聚合的双维融合策略。然后,估计的全局尺度作为尺度自适应卫星图像裁剪的物理约束,改善了无人机与卫星特征的对齐。在增强的DenseUAV和UAV-VisLoc数据集上的实验表明,所提出的方法在未知无人机图像尺度下显著提高了CVGL的鲁棒性。此外,该框架在被动无人机高度估计和3D模型尺度恢复等下游应用中显示出强大的潜力。
cs.CV / 129 / 2603.07540

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

统一多模态模型能够可靠生成图像的最长时间是多久?通过上下文管理驯服长时间交错图像生成
Chen, Haoyu, Liu, Qing, Zhou, Yuqian, Zhang, He, Wang, Zhaowen, Ren, Mengwei, Ren, Jingjing, Wang, Xiang, Lin, Zhe, Zhu, Lei
Abstract
Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
Chinese Translation
统一多模态模型承诺能够生成广泛的交错叙事,将文本和图像编织成连贯的长篇故事。然而,当前系统存在一个关键的可靠性缺口:随着序列的增长,生成质量迅速下降。在本研究中,我们探讨了导致这一失败的机制,并认为它与标准的长上下文挑战是不同的。我们揭示,在生成过程中,累积的视觉历史作为一种主动污染源,其衰减特性特别由图像事件的数量而非原始标记数量所决定。我们识别出一种结构性脆弱性,即密集的视觉标记压倒了注意力机制,产生噪声,从而扭曲未来的合成。在这些机制性见解的指导下,我们提出了UniLongGen,这是一种无训练推理策略,优先考虑安全的条件而非全面回忆。UniLongGen并不保留所有历史,而是动态管理模型的记忆,根据模型自身的内部相关性排名识别和丢弃干扰的视觉信号。大量实验表明,这种主动遗忘的方法对稳定性至关重要:UniLongGen在长时间范围的保真度和一致性方面显著优于基线,同时减少了内存占用和推理时间。
cs.CV / 130 / 2603.07543

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

CONSTANT:通过补丁对比增强和风格感知量化实现高质量的一次性手写生成
Le, Anh-Duy, Pham, Van-Linh, Vo, Thanh-Nam, Mai, Xuan Toan, Tran, Tuan-Anh
Abstract
One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
Chinese Translation
尽管近年来一次性风格化手写图像生成取得了令人印象深刻的成果,但由于仅使用单一参考图像捕捉人类手写的复杂多样特征的难度,这一任务仍然充满挑战。现有方法在生成视觉上吸引人且逼真的手写图像以及适应复杂的、未见过的书写风格方面仍然存在困难,难以隔离不变的风格特征(例如,倾斜度、笔画宽度、曲率),而忽略了无关的噪声。为了解决这一问题,我们提出了通过去噪扩散实现的补丁对比增强和风格感知量化(CONSTANT),这是一种新颖的一次性手写生成扩散模型。CONSTANT利用了三个关键创新:1)风格感知量化(SAQ)模块,将风格建模为捕捉不同概念的离散视觉标记;2)对比目标,确保这些标记在嵌入风格空间中良好分离且具有意义;3)基于潜在补丁的对比(LLatentPCE)目标,通过对齐生成和真实特征在潜在空间中的多尺度空间补丁,帮助提高质量和局部结构。对来自多种语言的基准数据集(包括英语、中文以及我们提出的越南语ViHTGen数据集)进行的广泛实验和分析,证明了我们的方法在适应新参考风格和生成高度详细图像方面优于最先进的方法。代码可在GitHub上获取。
cs.CV / 131 / 2603.07545

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

DreamSAC:通过对称性探索学习哈密顿世界模型
Tang, Jinzhou, Feng, Fan, Fu, Minghao, Lin, Wenjun, Huang, Biwei, Wang, Keze
Abstract
Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.
Chinese Translation
学习的世界模型在插值泛化方面表现出色,但在对新物理属性的外推泛化方面却表现不佳。这一限制源于它们学习的是统计相关性,而非环境的基本生成规则,例如物理不变性和守恒定律。我们认为,学习这些不变性是实现稳健外推的关键。为此,我们首先引入了 extbf{对称性探索},这是一种无监督探索策略,代理通过基于哈密顿的好奇心奖励内在驱动,主动探测和挑战其对守恒定律的理解,从而收集物理信息丰富的数据。其次,我们设计了一种基于哈密顿的世界模型,从收集的数据中学习,使用一种新颖的自监督对比目标,从原始的、视角依赖的像素观测中识别不变的物理状态。我们的框架 extbf{DreamSAC}在这一主动策划的数据上训练,显著超越了在需要外推的任务中3D物理模拟的最先进基线。
cs.CV / 132 / 2603.07552

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

ReconDrive:用于自主驾驶场景重建的快速前馈4D高斯点云生成
Yu, Haibao, Xiao, Kuntao, Wang, Jiahang, Hao, Ruiyang, Huang, Yuxin, Hu, Guoran, Qin, Haifang, Jing, Bowen, Bo, Yuntian, Luo, Ping
Abstract
High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.
Chinese Translation
高保真视觉重建和新视角合成对于自主驾驶中的真实闭环评估至关重要。尽管4D高斯点云生成(4DGS)在准确性和效率之间提供了良好的平衡,但现有的逐场景优化方法需要耗时的迭代精炼,使其在广泛的城市环境中不可扩展。相反,当前的前馈方法往往在光度质量上存在下降。为了解决这些局限性,我们提出了ReconDrive,一个前馈框架,利用并扩展了3D基础模型VGGT,以快速生成高保真的4DGS。我们的架构引入了两个核心适配,以将基础模型定制为动态驾驶场景:(1) 混合高斯预测头,解耦空间坐标和外观属性的回归,以克服通用基础特征固有的光度缺陷;(2) 静态-动态4D组合策略,通过速度建模明确捕捉时间运动,以表示复杂的动态环境。在nuScenes上的基准测试中,ReconDrive在重建、新视角合成和3D感知方面显著优于现有的前馈基线。它在性能上与逐场景优化相当,同时速度快几个数量级,为真实驾驶仿真提供了可扩展且实用的解决方案。
cs.CV / 133 / 2603.07559

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

基于主动推理的微手势识别:EFE引导的时间采样与自适应学习
Feng, Weijia, Yang, Jingyu, Zhang, Ruojia, Sun, Fengtao, Gao, Qian, Wang, Chenyang, Su, Tongtong, Guo, Jia, Li, Xiaobai, Shao, Minglai
Abstract
Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human-computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference-based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.
Chinese Translation
微手势是由无意识的神经和情感活动引发的细微且短暂的运动,具有在人与计算机交互和临床监测中的巨大潜力。然而,它们的低幅度、短持续时间和强烈的个体间差异使得现有的深度模型在低样本、噪声和跨个体条件下容易出现性能下降。本文提出了一种基于主动推理的微手势识别框架,采用期望自由能(Expected Free Energy, EFE)引导的时间采样和不确定性感知的自适应学习。该模型在EFE的指导下主动选择最具区分性的时间片段,从而实现动态观察和信息增益最大化。同时,基于预测不确定性的样本加权减轻了标签噪声和分布漂移的影响。在SMG数据集上的实验表明,所提方法的有效性,在多个主流骨干网络上均实现了一致的性能提升。消融研究确认,EFE引导的观察和自适应学习机制对性能提升至关重要。本研究为在低资源和噪声条件下的时间行为建模提供了一种可解释且可扩展的范式,广泛适用于可穿戴传感、人与计算机交互和临床情感监测。
cs.CV / 134 / 2603.07561

PureCC: Pure Learning for Text-to-Image Concept Customization

PureCC:文本到图像概念定制的纯学习
Liao, Zhichao, Xian, Xiaole, Li, Qingyu, Qin, Wenyu, Wang, Meng, Xie, Weicheng, Song, Siyang, Feng, Pingfa, Zeng, Long, Pan, Liang
Abstract
Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $\lambda^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.
Chinese Translation
现有的概念定制方法在高保真度和多概念定制方面取得了显著成果。然而,它们往往忽视了在学习新的个性化概念时对原始模型行为和能力的影响。为了解决这一问题,我们提出了PureCC。PureCC引入了一种新颖的解耦学习目标用于概念定制,该目标将目标概念的隐式指导与原始条件预测相结合。这种分离形式使得PureCC在训练过程中能够显著关注原始模型。此外,基于这一目标,PureCC设计了一个双分支训练管道,其中包括一个提供纯化目标概念表示作为隐式指导的冻结提取器和一个生成原始条件预测的可训练流模型,共同实现个性化概念的纯学习。此外,PureCC引入了一种新颖的自适应指导尺度 $ ext{λ}^ ext{*}$,以动态调整目标概念的指导强度,平衡定制保真度和模型保留。大量实验表明,PureCC在保留原始行为和能力的同时,实现了高保真度的概念定制,达到了最先进的性能。代码可在 https://github.com/lzc-sg/PureCC 获取。
cs.CV / 135 / 2603.07562

Brain-WM: Brain Glioblastoma World Model

脑肿瘤世界模型:脑胶质母细胞瘤模型
Wang, Chenhui, Zheng, Boyun, Bao, Liuxin, Peng, Zhihao, Woo, Peter Y. M., Shan, Hongming, Yuan, Yixuan
Abstract
Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain-WM, a pioneering brain GBM world model that unifies next-step treatment prediction and future MRI generation, thereby capturing the co-evolutionary dynamics between tumor and treatment. Specifically, Brain-WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow-based future MRI generation. Then, instead of a conventional monolithic framework, Brain-WM adopts a novel Y-shaped Mixture-of-Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross-task synergies while preventing feature collapse. Finally, a synergistic multi-timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression-aware semantics. Extensive validation on internal and external multi-institutional cohorts demonstrates the superiority of Brain-WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain-WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at https://github.com/thibault-wch/Brain-GBM-world-model.
Chinese Translation
在不同治疗干预下,胶质母细胞瘤(GBM)的精确预后建模对于优化临床结果至关重要。尽管生成性人工智能在模拟GBM演变方面展现了潜力,但现有方法通常将干预视为静态条件输入,而非动态决策变量。因此,它们未能捕捉肿瘤演变与治疗反应之间复杂的相互作用。为了解决这一问题,我们提出了Brain-WM,一个开创性的脑胶质母细胞瘤世界模型,它统一了下一步治疗预测和未来MRI生成,从而捕捉肿瘤与治疗之间的共同演化动态。具体而言,Brain-WM将时空动态编码到共享的潜在空间中,以实现联合自回归治疗预测和基于流的未来MRI生成。然后,Brain-WM采用了一种新颖的Y形混合变换器(Mixture-of-Transformers, MoT)架构,而不是传统的单一框架。这一设计在结构上解耦了异构目标,成功利用跨任务协同,同时防止特征崩溃。最后,一个协同的多时间点掩膜对齐目标明确地将潜在表示锚定到解剖学基础的肿瘤结构和进展感知语义上。在内部和外部多机构队列上的广泛验证表明,Brain-WM的优越性,在治疗规划中实现了91.5%的准确率,并在FLAIR、T1CE和T2W序列中分别获得了0.8524、0.8581和0.8404的结构相似性指数(SSIM)。最终,Brain-WM为优化患者医疗提供了一个强大的临床沙盒。源代码可在https://github.com/thibault-wch/Brain-GBM-world-model获取。
cs.CV / 136 / 2603.07564

SiamGM: Siamese Geometry-Aware and Motion-Guided Network for Real-Time Satellite Video Object Tracking

SiamGM:一种基于几何感知和运动引导的西安网络用于实时卫星视频目标跟踪
Wen, Zixiao, Yang, Zhen, Li, Jiawei, Xiang, Xiantai, Zhou, Guangyao, Hu, Yuxin, Liu, Yuhan
Abstract
Single object tracking in satellite videos is inherently challenged by small target, blurred background, large aspect ratio changes, and frequent visual occlusions. These constraints often cause appearance-based trackers to accumulate errors and lose targets irreversibly. To systematically mitigate both spatial ambiguities and temporal information loss, we propose SiamGM, a novel geometry-aware and motion-guided Siamese network. From a spatial perspective, we introduce an Inter-Frame Graph Attention (IFGA) module, closely integrated with an Aspect Ratio-Constrained Label Assignment (LA) method, establishing fine-grained topological correspondences and explicitly preventing surrounding background noise. From a temporal perspective, we introduce the Motion Vector-Guided Online Tracking Optimization method. By adopting the Normalized Peak-to-Sidelobe Ratio (nPSR) as a dynamic confidence indicator, we propose an Online Motion Model Refinement (OMMR) strategy to utilize historical trajectory information. Evaluations on two challenging SatSOT and SV248S benchmarks confirm that SiamGM outperforms most state-of-the-art trackers in both precision and success metrics. Notably, the proposed components of SiamGM introduce virtually no computational overhead, enabling real-time tracking at 130 frames per second (FPS). Codes and tracking results are available at https://github.com/wenzx18/SiamGM.
Chinese Translation
在卫星视频中,单目标跟踪面临小目标、模糊背景、大纵横比变化和频繁视觉遮挡等固有挑战。这些限制常常导致基于外观的跟踪器累积误差并不可逆地丢失目标。为了系统性地减轻空间歧义和时间信息丢失,我们提出了SiamGM,一种新颖的几何感知和运动引导的西安网络。从空间角度来看,我们引入了一个帧间图注意力(Inter-Frame Graph Attention, IFGA)模块,与纵横比约束标签分配(Aspect Ratio-Constrained Label Assignment, LA)方法紧密结合,建立细粒度的拓扑对应关系,并明确防止周围背景噪声。从时间角度来看,我们引入了运动矢量引导的在线跟踪优化方法。通过采用归一化峰值旁瓣比(Normalized Peak-to-Sidelobe Ratio, nPSR)作为动态置信度指标,我们提出了一种在线运动模型优化(Online Motion Model Refinement, OMMR)策略,以利用历史轨迹信息。在两个具有挑战性的SatSOT和SV248S基准测试上的评估结果确认,SiamGM在精度和成功率指标上均优于大多数最先进的跟踪器。值得注意的是,SiamGM提出的组件几乎不引入计算开销,使其能够以每秒130帧(FPS)实现实时跟踪。代码和跟踪结果可在https://github.com/wenzx18/SiamGM获取。
cs.CV / 137 / 2603.07566

GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module

GRD-Net:具有兴趣区域注意模块的生成-重构-判别异常检测
Ferrari, Niccolò, Fraccaroli, Michele, Lamma, Evelina
Abstract
Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.
Chinese Translation
异常检测在当今的工业应用和过程中越来越普遍。其主要应用领域之一是表面异常检测的视觉检查,旨在识别偏离常规的区域,从而识别异常产品。缺陷定位是一个关键任务,通常通过生成图像与原始图像之间的基本比较来实现,并在后处理步骤中实施一些斑点分析或图像编辑算法,这些方法往往对源数据集存在偏见,且无法进行泛化。此外,在工业应用中,图像的整体并不总是重要,可能只有一个或多个兴趣区域(ROIs),这些区域内才存在需要被发现的相关异常。基于这些原因,我们提出了一种由两个模块组成的新架构。第一个模块是基于残差自编码器(ResAE)的生成对抗网络(GAN),用于执行重构和去噪过程,而第二个模块则进行图像分割,识别缺陷。该方法从由良好产品和生成的合成缺陷组成的数据集中学习。判别网络使用训练数据集中每幅图像的ROI进行训练。网络将学习哪些区域的异常是相关的。这种方法保证了减少使用以前开发的基于斑点分析和图像编辑程序的预处理算法。为了测试我们的模型,我们使用了具有挑战性的MVTec异常检测数据集和一个工业大型制药BFS小瓶条带数据集。该数据集构成了上述网络的一个更现实的使用案例。
cs.CV / 138 / 2603.07570

Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance

通过多任务自适应学习和跨维特征引导实现高效的RGB-D场景理解
Sun, Guodong, Liu, Junjie, Zhang, Gaoyang, Wu, Bo, Zhang, Yang
Abstract
Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.
Chinese Translation
场景理解在实现机器人系统的智能和自主性方面发挥着关键作用。传统方法常面临诸多挑战,包括遮挡、模糊边界以及无法根据任务特定需求和样本变化自适应调整注意力。为了解决这些局限性,本文提出了一种高效的RGB-D场景理解模型,能够执行一系列任务,包括语义分割、实例分割、方向估计、全景分割和场景分类。所提出的模型结合了增强的融合编码器,有效利用RGB和深度输入中的冗余信息。在语义分割方面,我们引入了归一化聚焦通道层和上下文特征交互层,旨在缓解浅层特征误导和局部-全局特征表示不足等问题。实例分割任务则受益于非瓶颈的1D结构,该结构以更少的参数实现了更优的轮廓表示。此外,我们提出了一种多任务自适应损失函数,能够根据场景变化动态调整不同任务的学习策略。在NYUv2、SUN RGB-D和Cityscapes数据集上的大量实验表明,我们的方法在分割精度和处理速度上均优于现有方法。
cs.CV / 139 / 2603.07571

A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification

图像分类中分布外检测训练目标的系统比较
Genç, Furkan, Özdemir, Onat, Akbaş, Emre
Abstract
Out-of-distribution (OOD) detection is critical in safety-sensitive applications. While this challenge has been addressed from various perspectives, the influence of training objectives on OOD behavior remains comparatively underexplored. In this paper, we present a systematic comparison of four widely used training objectives: Cross-Entropy Loss, Prototype Loss, Triplet Loss, and Average Precision (AP) Loss, spanning probabilistic, prototype-based, metric-learning, and ranking-based supervision, for OOD detection in image classification under standardized OpenOOD protocols. Across CIFAR-10/100 and ImageNet-200, we find that Cross-Entropy Loss, Prototype Loss, and AP Loss achieve comparable in-distribution accuracy, while Cross-Entropy Loss provides the most consistent near- and far-OOD performance overall; the other objectives can be competitive in specific settings.
Chinese Translation
分布外(OOD)检测在安全敏感的应用中至关重要。尽管这一挑战已从多个角度得到解决,但训练目标对OOD行为的影响仍然相对未被深入探讨。本文对四种广泛使用的训练目标进行了系统比较:交叉熵损失(Cross-Entropy Loss)、原型损失(Prototype Loss)、三元组损失(Triplet Loss)和平均精度(Average Precision, AP)损失,涵盖了概率性、基于原型、度量学习和基于排名的监督,旨在在标准化的OpenOOD协议下进行图像分类的OOD检测。在CIFAR-10/100和ImageNet-200数据集上,我们发现交叉熵损失、原型损失和AP损失在分布内准确性上表现相当,而交叉熵损失在整体上提供了最一致的近分布外和远分布外性能;其他目标在特定设置中也可能具有竞争力。
cs.CV / 140 / 2603.07577

Integration of deep generative Anomaly Detection algorithm in high-speed industrial line

深度生成异常检测算法在高速工业生产线中的集成
Ferrari, Niccolò, Zanarini, Nicola, Fraccaroli, Michele, Bizzarri, Alice, Lamma, Evelina
Abstract
Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.
Chinese Translation
制药生产中的工业视觉检测要求在严格的周期时间、硬件占用和运营成本约束下实现高精度。手动在线检测仍然很常见,但受到操作员变异性和产量限制的影响。传统的基于规则的计算机视觉流程往往较为僵化,难以扩展到高度可变的生产场景。为了解决这些局限性,我们提出了一种基于生成对抗架构的半监督异常检测框架,该框架结合了残差自编码器和密集瓶颈,专门设计用于在高速吹填封(Blow-Fill-Seal, BFS)生产线上进行在线部署。该模型仅在正常样本上进行训练,通过重构残差检测异常,并通过热图提供分类和空间定位。训练集包含2,815,200个灰度图块。在真实工业测试工具上的实验显示,检测性能高,同时满足与500毫秒采集时间窗口兼容的时间约束。
cs.CV / 141 / 2603.07587

3DGS-HPC: Distractor-free 3D Gaussian Splatting with Hybrid Patch-wise Classification

3DGS-HPC:无干扰的3D高斯点云与混合块级分类
Chen, Jiahao, Qin, Yipeng, Zhao, Ganlong, Li, Xin, Wang, Wenping, Li, Guanbin
Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and 3D scene reconstruction, yet its quality often degrades in real-world environments due to transient distractors, such as moving objects and varying shadows. Existing methods commonly rely on semantic cues extracted from pre-trained vision models to identify and suppress these distractors, but such semantics are misaligned with the binary distinction between static and transient regions and remain fragile under the appearance perturbations introduced during 3DGS optimization. We propose 3DGS-HPC, a framework that circumvents these limitations by combining two complementary principles: a patch-wise classification strategy that leverages local spatial consistency for robust region-level decisions, and a hybrid classification metric that adaptively integrates photometric and perceptual cues for more reliable separation. Extensive experiments demonstrate the superiority and robustness of our method in mitigating distractors to improve 3DGS-based novel view synthesis.
Chinese Translation
3D高斯点云(3D Gaussian Splatting, 3DGS)在新视图合成和3D场景重建中表现出色,但由于瞬态干扰因素(如移动物体和变化的阴影),其质量在现实环境中往往会下降。现有方法通常依赖从预训练视觉模型中提取的语义线索来识别和抑制这些干扰因素,但这些语义与静态和瞬态区域之间的二元区分不一致,并且在3DGS优化过程中引入的外观扰动下仍然脆弱。我们提出了3DGS-HPC,一个通过结合两种互补原则来规避这些限制的框架:一种块级分类策略,利用局部空间一致性进行稳健的区域级决策,以及一种混合分类度量,适应性地整合光度和感知线索以实现更可靠的分离。大量实验表明,我们的方法在减轻干扰因素方面优于现有方法,从而改善基于3DGS的新视图合成。
cs.CV / 142 / 2603.07590

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

模型如乐高构建者:通过语义蓝图从良性模块组装恶意
Li, Chenxi, Liu, Xianggan, Shen, Dake, Du, Yaosong, Yao, Zhibo, Jiang, Hao, Jiang, Linyi, Cao, Chengwei, Zhang, Jingzhe, Peng, RanYi, Bai, Peiling, Huang, Xiande
Abstract
Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
Chinese Translation
尽管大型视觉语言模型(LVLMs)取得了快速进展,但视觉模态的整合引入了新的安全漏洞,敌对者可以利用这些漏洞引发偏见或恶意输出。在本文中,我们展示了一种尚未充分探讨的漏洞,通过语义槽填充,LVLMs 在槽类型被故意设计为看似良性时,仍会用不安全的内容填补缺失的槽值。基于这一发现,我们提出了StructAttack,这是一种在黑箱环境下简单而有效的单查询越狱框架。StructAttack 将有害查询分解为一个中心主题和一组看似良性的槽类型,然后将它们嵌入为结构化视觉提示(例如,思维导图、表格或旭日图),并添加小的随机扰动。配合完成引导指令,LVLMs 自动重组隐藏的语义并生成不安全的输出,而不会触发安全机制。尽管每个槽在孤立状态下看似良性(局部良性),StructAttack 利用 LVLMs 的推理能力将这些槽组装成连贯的有害语义。对多个模型和基准的广泛实验表明了我们提出的 StructAttack 的有效性。
cs.CV / 143 / 2603.07593

Fast Attention-Based Simplification of LiDAR Point Clouds for Object Detection and Classification

基于快速注意力机制的LiDAR点云简化方法用于物体检测与分类
Rozsa, Z., Madaras, Á., Wei, Q., Lu, X., Golarits, M., Yuan, H., Sziranyi, T., Hamzaoui, R.
Abstract
LiDAR point clouds are widely used in autonomous driving and consist of large numbers of 3D points captured at high frequency to represent surrounding objects such as vehicles, pedestrians, and traffic signs. While this dense data enables accurate perception, it also increases computational cost and power consumption, which can limit real-time deployment. Existing point cloud sampling methods typically face a trade-off: very fast approaches tend to reduce accuracy, while more accurate methods are computationally expensive. To address this limitation, we propose an efficient learned point cloud simplification method for LiDAR data. The method combines a feature embedding module with an attention-based sampling module to prioritize task-relevant regions and is trained end-to-end. We evaluate the method against farthest point sampling (FPS) and random sampling (RS) on 3D object detection on the KITTI dataset and on object classification across four datasets. The method was consistently faster than FPS and achieved similar, and in some settings better, accuracy, with the largest gains under aggressive downsampling. It was slower than RS, but it typically preserved accuracy more reliably at high sampling ratios.
Chinese Translation
LiDAR点云广泛应用于自动驾驶,包含大量以高频率捕获的3D点,以表示周围的物体,如车辆、行人和交通标志。虽然这种密集数据能够实现准确的感知,但也增加了计算成本和能耗,这可能限制实时部署。现有的点云采样方法通常面临一个权衡:非常快速的方法往往降低准确性,而更准确的方法则计算开销较大。为了解决这一限制,我们提出了一种高效的LiDAR数据学习点云简化方法。该方法结合了特征嵌入模块和基于注意力的采样模块,以优先考虑与任务相关的区域,并进行端到端训练。我们在KITTI数据集上对3D物体检测以及在四个数据集上对物体分类评估该方法,与最远点采样(FPS)和随机采样(RS)进行了比较。该方法在速度上始终快于FPS,并在某些设置下达到了相似甚至更好的准确性,尤其在激进下采样的情况下表现出最大的优势。尽管其速度慢于RS,但在高采样比率下通常能更可靠地保持准确性。
cs.CV / 144 / 2603.07604

EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation

EmbedTalk:基于嵌入驱动的高斯变形的无三平面对话头合成
Saggar, Arpita, Darling, Jonathan C., Sarikaya, Duygu, Hogg, David C.
Abstract
Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.
Chinese Translation
实时对话头合成越来越依赖于可变形的三维高斯点云(3DGS),因为其低延迟。三平面是对高斯进行变形编码的标准选择,因为它们提供了具有明确空间关系的连续域。然而,三平面表示受到网格分辨率和将三维体积场投影到二维子空间所引入的近似误差的限制。最近的研究表明,学习的嵌入在4D场景重建中的时间变形驱动方面具有优越性。我们提出了$ extbf{EmbedTalk}$,展示了如何利用这些嵌入来建模对话头合成中的语音变形。通过全面的实验,我们表明EmbedTalk在渲染质量、唇部同步和运动一致性方面优于现有的基于3DGS的方法,同时在性能上与最先进的生成模型保持竞争力。此外,用学习的嵌入替代三平面编码,使得模型显著更为紧凑,并在移动GPU(RTX 2060 6 GB)上实现超过60 FPS的帧率。我们的代码将在接受后公开。
cs.CV / 145 / 2603.07614

Looking Into the Water by Unsupervised Learning of the Surface Shape

通过无监督学习水面形状来观察水面
Lifschitz, Ori, Treibitz, Tali, Rosenbaum, Dan
Abstract
We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.
Chinese Translation
我们解决了从空中观察水面的问题,旨在消除由于水面折射造成的图像失真。我们的方法基于对不同时间点水面结构的建模,假设底层图像是恒定的。为此,我们提出了一个由两个神经场网络组成的模型。第一个网络预测每个空间位置和时间的水面高度,第二个网络预测每个位置的图像颜色。通过使用这两个网络,我们重建观察到的图像序列,因此可以使用无监督训练。我们展示了使用具有周期性激活函数的隐式神经表示(SIREN)能够有效建模水面高度时空信号及其导数,这对于图像重建是必要的。通过使用模拟数据和真实数据,我们证明了我们的方法优于最新的无监督图像恢复方法。此外,它还提供了水面高度的估计。
cs.CV / 146 / 2603.07619

Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

过度思考导致幻觉:追踪视觉语言模型中的混淆因子传播
Shoby, Abin, Huy, Ta Duc, Nguyen, Tuan Dung, Ho, Minh Khoi, Chen, Qi, Hengel, Anton van den, Nguyen, Phi Le, Verjans, Johan W., Phan, Vu Minh Hieu
Abstract
Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.
Chinese Translation
视觉语言模型(VLMs)常常会幻觉出不存在的物体。检测幻觉类似于检测欺骗:单一的最终陈述是不够的,必须检查其背后的推理过程。然而,现有的检测器主要依赖于最终层的信号。基于注意力的方法假设幻觉的标记表现出低注意力,而基于熵的方法则使用最终步骤的不确定性。我们的分析揭示了相反的情况:幻觉物体可能由于上下文先验而表现出峰值注意力;而模型往往表现出高置信度,因为中间层已经收敛到一个错误的假设。我们表明,幻觉检测的关键在于模型的思维过程,而非其最终输出。通过探测解码器层,我们发现了一种先前被忽视的行为:过度思考。模型在各层之间反复修正物体假设,然后才确定一个错误的答案。一旦模型锁定在一个混淆的假设上,它就会在后续层中传播,最终导致幻觉。为了捕捉这种行为,我们引入了过度思考评分(Overthinking Score),这一指标用于衡量模型考虑了多少竞争假设以及这些假设在各层之间的稳定性。该评分显著提高了幻觉检测的效果:在MSCOCO上达到78.9%的F1分数,在AMBER上达到71.58%。
cs.CV / 147 / 2603.07625

Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding

Duala:跨主体 fMRI 解码的主题与刺激的双层对齐
Li, Shumeng, Guo, Jintao, Zhang, Jian, Zhou, Yulin, Cao, Luyang, Shi, Yinghuan
Abstract
Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Duala effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction. Our code is available at https://github.com/ShumengLI/Duala.
Chinese Translation
跨主体视觉解码旨在从个体的脑活动中重建视觉体验,从而实现更具可扩展性和实用性的脑机接口。然而,现有方法在适应新主体时常常因数据有限而性能下降,因为它们难以同时保持刺激的语义一致性和脑响应的对齐。为了解决这些挑战,我们提出了Duala,一个旨在实现基于 fMRI 的跨主体视觉解码中刺激级一致性和主体级对齐的双层对齐框架。(1) 在刺激层面,Duala 引入了一种语义对齐和关系一致性策略,保持类内相似性和类间可分性,在适应过程中维护清晰的语义边界。(2) 在主体层面,开发了一种基于分布的特征扰动机制,以捕捉全局和主体特定的变化,使得在不发生过拟合的情况下适应个体神经表征。对自然场景数据集 (Natural Scenes Dataset, NSD) 的实验表明,Duala 有效改善了主体间的对齐。值得注意的是,即使仅用约一小时的 fMRI 数据进行微调,Duala 也能实现超过 81.1% 的图像到脑的检索准确率,并在检索和重建方面始终优于现有的微调策略。我们的代码可在 https://github.com/ShumengLI/Duala 获取。
cs.CV / 148 / 2603.07630

Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation

基于空间解耦特征学习的实时声门检测框架用于鼻气管插管
Liu, Jinyu, Zhang, Gaoyang, Zhou, Yang, Hao, Ruoyi, Zhang, Yang, Ren, Hongliang
Abstract
Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.
Chinese Translation
鼻气管插管(NTI)是紧急气道管理中的一项重要程序,其中快速准确的声门检测对于确保患者安全至关重要。然而,现有的机器辅助视觉检测系统通常依赖于高性能的计算资源,并且存在显著的推理延迟,这限制了它们在时间紧迫和资源受限场景中的适用性。为克服这些限制,我们提出了Mobile GlottisNet,这是一种轻量高效的声门检测框架,旨在实现嵌入式和边缘设备上的实时推理。该模型结合了结构感知和空间对齐机制,使其能够在复杂的解剖和视觉条件下实现稳健的声门定位。我们实施了一种分层动态阈值策略以增强样本分配,并引入了一种基于可变形卷积的自适应特征解耦模块,以支持动态空间重建。跨层动态加权方案进一步促进了多尺度语义和细节特征的融合。实验结果表明,该模型在我们的PID数据集和临床数据集上,仅占用5MB的大小,实现了在设备上超过62帧每秒(FPS)和在边缘平台上超过33帧每秒(FPS)的推理速度,显示出在紧急NTI应用中的巨大潜力。
cs.CV / 149 / 2603.07645

Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics

评估机场物流中行李推车检测的合成数据
Taibi, Abdeldjalil, Badlis, Mohmoud, Bensalem, Amina, Zouilekh, Belkacem, Brahimi, Mohammed
Abstract
Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict security and privacy regulations limit large-scale data collection. Second, existing public datasets lack the diversity, scale, and annotation quality needed to handle dense, overlapping trolley arrangements typical of real-world operations. To address these limitations, we introduce a synthetic data generation pipeline based on a high-fidelity Digital Twin of Algiers International Airport using NVIDIA Omniverse. The pipeline produces richly annotated data with oriented bounding boxes, capturing complex trolley formations, including tightly nested chains. We evaluate YOLO-OBB using five training strategies: real-only, synthetic-only, linear probing, full fine-tuning, and mixed training. This allows us to assess how synthetic data can complement limited real-world annotations. Our results show that mixed training with synthetic data and only 40 percent of real annotations matches or exceeds the full real-data baseline, achieving 0.94 mAP@50 and 0.77 mAP@50-95, while reducing annotation effort by 25 to 35 percent. Multi-seed experiments confirm strong reproducibility with a standard deviation below 0.01 on mAP@50, demonstrating the practical effectiveness of synthetic data for automated trolley detection.
Chinese Translation
高效的行李推车管理对于减少现代机场的拥堵和确保资产可用性至关重要。自动化检测系统面临两个主要挑战。首先,严格的安全和隐私法规限制了大规模数据收集。其次,现有的公共数据集缺乏处理现实操作中典型的密集重叠推车排列所需的多样性、规模和注释质量。为了解决这些限制,我们引入了一种基于阿尔及尔国际机场高保真数字双胞胎(Digital Twin)使用NVIDIA Omniverse的合成数据生成管道。该管道生成丰富注释的数据,包含定向边界框,捕捉复杂的推车排列,包括紧密嵌套的链条。我们使用五种训练策略评估YOLO-OBB:仅真实数据、仅合成数据、线性探测、完全微调和混合训练。这使我们能够评估合成数据如何补充有限的真实世界注释。我们的结果表明,使用合成数据和仅40%的真实注释进行混合训练的效果与完整真实数据基线相匹配或超过,达到了0.94 mAP@50和0.77 mAP@50-95,同时减少了25%到35%的注释工作量。多种种子实验确认了强大的可重复性,在mAP@50上的标准偏差低于0.01,证明了合成数据在自动推车检测中的实际有效性。
cs.CV / 150 / 2603.07652

GLASS: Graph and Vision-Language Assisted Semantic Shape Correspondence

GLASS:图形与视觉语言辅助的语义形状对应
Xiao, Qinfeng, Mei, Guofeng, Liu, Qilong, Yi, Chenyuan, Poiesi, Fabio, Zhang, Jian, Yang, Bo, Kit-lun, Yick
Abstract
Establishing dense correspondence across 3D shapes is crucial for fundamental downstream tasks, including texture transfer, shape interpolation, and robotic manipulation. However, learning these mappings without manual supervision remains a formidable challenge, particularly under severe non-isometric deformations and in inter-class settings where geometric cues are ambiguous. Conventional functional map methods, while elegant, typically struggle in these regimes due to their reliance on isometry. To address this, we present GLASS, a framework that bridges the gap by integrating geometric spectral analysis with rich semantic priors from vision-language foundation models. GLASS introduces three key innovations: (i) a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; (ii) the injection of language embeddings into vertex descriptors via zero-shot 3D segmentation, capturing high-level part semantics; and (iii) a graph-assisted contrastive loss that enforces structural consistency between regions (e.g., source's head'' $\leftrightarrow$ target's head'') by leveraging geodesic and topological relationships between regions. This design allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision. Extensive experiments demonstrate that GLASS achieves state-of-the-art performance across all regimes, maintaining high accuracy on standard near-isometric tasks while significantly advancing performance in challenging settings. Specifically, it achieves average geodesic errors of 0.21, 4.5, and 5.6 on the inter-class benchmark SNIS and non-isometric benchmarks SMAL and TOPKIDS, reducing errors from URSSM baselines of 0.49, 6.0, and 8.9 by 57%, 25%, and 37%, respectively.
Chinese Translation
在3D形状之间建立密集对应关系对于基础下游任务至关重要,包括纹理转移、形状插值和机器人操作。然而,在没有人工监督的情况下学习这些映射仍然是一个巨大的挑战,特别是在严重的非等距变形和几何线索模糊的类间设置中。传统的功能映射方法虽然优雅,但通常在这些情况下表现不佳,因为它们依赖于等距性。为了解决这个问题,我们提出了GLASS,一个通过将几何谱分析与来自视觉语言基础模型的丰富语义先验结合起来的框架。GLASS引入了三个关键创新:(i)一种视图一致策略,使得能够从强大的视觉基础模型中稳健地提取多视图视觉特征;(ii)通过零样本3D分割将语言嵌入注入到顶点描述符中,从而捕捉高层次的部件语义;(iii)一种图辅助对比损失,通过利用区域之间的测地线和拓扑关系,强制区域之间(例如,源的“头”与目标的“头”)的结构一致性。这种设计使得GLASS能够在没有真实标签监督的情况下学习全局一致且语义上连贯的映射。大量实验表明,GLASS在所有情况下都达到了最先进的性能,在标准近等距任务中保持高准确性,同时在具有挑战性的设置中显著提高了性能。具体而言,它在类间基准SNIS和非等距基准SMAL和TOPKIDS上分别达到了0.21、4.5和5.6的平均测地线误差,分别将URSSM基线的0.49、6.0和8.9的误差减少了57%、25%和37%。
cs.CV / 151 / 2603.07659

Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework

通过自我批判推理框架提升视觉语言模型的测试时鲁棒性
Tang, Kaihua, Qi, Jiaxin, Ou, Jinli, Zheng, Yuhua, Huang, Jianqiang
Abstract
The emergence of Large Language Models (LLMs) has driven rapid progress in multi-modal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.
Chinese Translation
大型语言模型(LLMs)的出现推动了多模态学习的快速进展,特别是在大型视觉语言模型(LVLMs)的发展上。然而,现有的LVLM训练范式过于依赖LLM组件,导致了两个关键的鲁棒性挑战:语言偏见和语言敏感性。为同时解决这两个问题,我们提出了一种新颖的自我批判推理(Self-Critical Inference, SCI)框架,通过文本和视觉扰动进行多轮反事实推理,从而扩展了视觉对比解码。该过程进一步引入了一种通过增加反事实轮次来提升鲁棒性的新策略。此外,我们还观察到LVLM的失败案例在不同模型之间存在显著差异,这表明固定的鲁棒性基准可能无法真实反映LVLM的可靠性。为此,我们提出了动态鲁棒性基准(Dynamic Robustness Benchmark, DRBench),这是一个针对语言偏见和敏感性问题的模型特定评估框架。大量实验表明,SCI在DRBench上始终优于基线方法,并且增加推理轮次进一步提升了鲁棒性,超越了现有的单步反事实推理方法。
cs.CV / 152 / 2603.07660

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Holi-Spatial:将视频流演变为整体的3D空间智能
Gao, Yuanyuan, Li, Hao, Liu, Yifei, Ji, Xinhao, Gong, Yuning, Liao, Yuanjun, Liu, Fangfu, Zhang, Manyuan, Yang, Yuchen, Xu, Dan, Yang, Xue, Huang, Huaxi, Zhang, Hongjie, Liu, Ziwei, Sun, Xiao, Zhang, Dingwen, Zhong, Zhihang
Abstract
The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.
Chinese Translation
空间智能的追求基本上依赖于对大规模、细粒度3D数据的访问。然而,现有的方法主要通过从有限数量的手动标注数据集中生成问答(QA)对来构建空间理解基准,而不是系统地从原始网络数据中标注新的大规模3D场景。因此,它们的可扩展性受到严重限制,模型性能也受到这些狭窄策划数据集中固有的领域差距的进一步影响。在本研究中,我们提出了Holi-Spatial,这是第一个完全自动化的大规模空间感知多模态数据集,采用所提出的数据策划流程,从原始视频输入中构建,无需人工干预。Holi-Spatial支持多层次的空间监督,从几何准确的3D高斯点云(3D Gaussian Splatting, 3DGS)重建及其渲染深度图,到对象级和关系语义标注,以及相应的空间问答(QA)对。遵循一个原则性和系统化的流程,我们进一步构建了Holi-Spatial-4M,这是第一个大规模、高质量的3D语义数据集,包含12K个优化的3DGS场景、1.3M个2D掩膜、320K个3D边界框、320K个实例标题、1.2M个3D定位实例和1.2M个空间QA对,涵盖多种几何、关系和语义推理任务。Holi-Spatial在数据策划质量上表现出色,显著优于现有的前馈和每场景优化方法,在ScanNet、ScanNet++和DL3DV等数据集上表现优异。此外,使用该数据集对视觉-语言模型(Vision-Language Models, VLMs)进行空间推理任务的微调也显著提高了模型性能。
cs.CV / 153 / 2603.07664

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS:反射双高斯点云
Fan, Ningjing, Wang, Yiqun, Yan, Dongming, Wonka, Peter
Abstract
Reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref-DGS, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware adaptive mixing shader that fuses global and local reflection features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.
Chinese Translation
反射外观,特别是强烈且通常近场的镜面反射,对准确的表面重建和新视图合成构成了根本挑战。现有的高斯点云方法要么无法建模近场镜面反射,要么依赖于显式光线追踪,导致计算成本高昂。我们提出了Ref-DGS,一种反射双高斯点云框架,通过在高效的光栅化管道中将表面重建与镜面反射解耦,解决了这一权衡。Ref-DGS引入了一种双高斯场景表示,包括几何高斯和互补的局部反射高斯,能够在不进行显式光线追踪的情况下捕捉近场镜面相互作用,同时还引入了一个全局环境反射场以建模远场镜面反射。为了预测镜面辐射,我们进一步提出了一种轻量级、物理感知的自适应混合着色器,用于融合全局和局部反射特征。实验表明,Ref-DGS在反射场景上实现了最先进的性能,同时训练速度显著快于基于光线的高斯方法。
cs.CV / 154 / 2603.07667

FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration

FusionRegister:每个红外与可见光图像融合都应进行配准
Bian, Congcong, Ma, Haolong, Li, Hui, Shen, Zhongwei, Luo, Xiaoqing, Song, Xiaoning, Wu, Xiao-Jun
Abstract
Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although several methods are proposed to address this issue, the existing registration-based fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross-modality registration method guided by visual priors is proposed for infrared and visible image fusion task, termed FusionRegister. Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions. Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guides the registration process to focus only on mismatch regions, thereby avoiding redundant operations. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment and robustness, making it highly suitable for infrared and visible image fusion method. The code will be available at https://github.com/bociic/FusionRegister.
Chinese Translation
不同视觉模态之间的空间配准是多模态图像融合在现实感知中的一个关键但艰巨的步骤。尽管已有多种方法被提出以解决这一问题,但现有的基于配准的融合方法通常需要大量的预配准操作,从而限制了它们的效率。为克服这些局限,本文提出了一种由视觉先验引导的通用跨模态配准方法,称为FusionRegister,旨在红外与可见光图像融合任务中应用。首先,FusionRegister通过学习跨模态错配表示来实现鲁棒性,而不是强制对齐所有差异,确保在具有挑战性的输入条件下仍能输出稳定的结果。此外,FusionRegister通过直接在融合结果上操作,展示了强大的通用性,其中错配被明确表示并有效处理,使其能够与多种融合方法无缝集成,同时保留其内在特性。此外,其效率通过作为自然视觉先验提供者来进一步增强,指导配准过程仅关注不匹配区域,从而避免冗余操作。在三个数据集上的大量实验表明,FusionRegister不仅继承了最先进方法的融合质量,还提供了优越的细节对齐和鲁棒性,使其非常适合红外与可见光图像融合方法。代码将发布在 https://github.com/bociic/FusionRegister。
cs.CV / 155 / 2603.07690

FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

FrameVGGT:用于流式 VGGT 的帧证据滚动记忆
Xu, Zhisong, Oishi, Takeshi
Abstract
Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.
Chinese Translation
流式视觉几何变换器(如 StreamVGGT)能够实现强大的在线 3D 感知,但面临无限制的 KV 缓存增长问题,这限制了其在长时间流中的部署。我们从几何支持的角度重新审视有界内存流式处理。在以几何为驱动的推理中,内存质量不仅取决于保留了多少个 token,还取决于保留的内存是否仍然保持足够一致的局部支持。这表明,在固定预算下,基于 token 的保留可能变得不太适合,因为它可能稀薄每个贡献帧中可用的证据,并使后续融合对弱对齐历史更加敏感。基于这一观察,我们提出了 FrameVGGT,一种基于帧驱动的滚动显式内存框架,将每帧的增量 KV 贡献视为一致的证据块。FrameVGGT 将每个块总结为一个紧凑的原型,并在严格预算下维护一个固定容量的中期互补帧块库,同时为罕见的长期退化提供可选的轻量级锚层。在长序列 3D 重建、视频深度估计和相机姿态基准测试中,FrameVGGT 在有界内存下实现了良好的准确性与内存的权衡,同时在长时间流中保持了更稳定的几何结构。
cs.CV / 156 / 2603.07694

Compressed-Domain-Aware Online Video Super-Resolution

压缩域感知的在线视频超分辨率
Wang, Yuhang, Li, Hai, Hou, Shujuan, Dong, Zhetao, Yang, Xiaoyao
Abstract
In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at https://github.com/sspBIT/CDA-VSR.
Chinese Translation
在带宽受限的在线视频流中,视频通常会被下采样和压缩。尽管最近的在线视频超分辨率(online VSR)方法取得了令人鼓舞的结果,但由于复杂的运动估计用于对齐以及对连续帧的冗余处理,它们仍然计算密集,无法在更高分辨率下实现实时处理。为了解决这些问题,我们提出了一种压缩域感知网络(CDA-VSR)用于在线 VSR,该网络利用压缩域信息,包括运动矢量、残差图和帧类型,以平衡质量和效率。具体而言,我们提出了一种运动矢量引导的可变形对齐模块,该模块利用运动矢量进行粗略扭曲,并仅学习局部残差偏移进行精细调整,从而在减少计算的同时保持准确性。然后,我们利用残差图门控融合模块从残差图中导出空间权重,抑制不匹配区域并强调可靠细节。此外,我们设计了一种帧类型感知重建模块,以便在不同帧类型之间进行自适应计算分配,平衡准确性和效率。在 REDS4 数据集上,我们的 CDA-VSR 超越了最先进的方法 TMP,最大 PSNR 提升了 0.13 dB,同时提供了超过两倍的推理速度。代码将发布在 https://github.com/sspBIT/CDA-VSR。
cs.CV / 157 / 2603.07697

Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation

学习适应上下文的运动先验用于具有高效运动注意力聚合的掩蔽运动扩散模型
Jiang, Junkun, Chen, Jie, Au, Ho Yin, Xiang, Jingyu
Abstract
Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at https://github.com/jjkislele/MMDM.
Chinese Translation
基于视觉的运动捕捉解决方案常常面临遮挡问题,这导致关键关节信息的丢失,妨碍准确的三维运动重建。其他可穿戴替代方案也受到噪声或不稳定数据的困扰,通常需要大量的手动清理和修正才能获得可靠的结果。为了解决这些挑战,我们提出了掩蔽运动扩散模型(Masked Motion Diffusion Model, MMDM),这是一种基于扩散的生成重建框架,利用部分可用的高质量重建来增强不完整或低置信度的运动数据,采用掩蔽自编码器架构。我们设计的核心是运动注意力聚合(Kinematic Attention Aggregation, KAA)机制,它能够高效、深度和迭代地编码关节级和姿态级特征,捕捉对任务特定重建至关重要的结构和时间运动模式。我们专注于学习适应上下文的运动先验,这些先验是由相同的可重用架构提取的专业结构和时间特征,每个学习到的先验强调运动动态的不同方面,并且对于其对应的任务特别高效。这使得架构能够在不改变其结构的情况下自适应地专业化。这种多功能性使得MMDM能够高效地学习针对运动精细化、补全和插值等场景量身定制的运动先验。在公共基准上的广泛评估表明,MMDM在多种掩蔽策略和任务设置中都表现出色。源代码可在 https://github.com/jjkislele/MMDM 获取。
cs.CV / 158 / 2603.07700

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

TDM-R1:通过非可微奖励强化少步扩散模型
Luo, Yihong, Hu, Tianyang, Luo, Weijian, Tang, Jing
Abstract
While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1
Chinese Translation
尽管少步生成模型在显著降低成本的情况下实现了强大的图像和视频生成,但针对少步模型的通用强化学习(RL)范式仍然是一个未解决的问题。现有的针对少步扩散模型的RL方法强烈依赖于通过可微奖励模型进行反向传播,从而排除了大多数重要的现实世界奖励信号,例如人类的二元喜好、物体计数等非可微奖励。为了正确地将非可微奖励纳入以改善少步生成模型,我们提出了TDM-R1,这是一种基于领先的少步模型轨迹分布匹配(Trajectory Distribution Matching, TDM)构建的新型强化学习范式。TDM-R1将学习过程解耦为替代奖励学习和生成器学习。此外,我们开发了实用的方法以获取沿着TDM的确定性生成轨迹的每步奖励信号,从而形成了一种统一的RL后训练方法,显著提升了少步模型在通用奖励下的能力。我们进行了广泛的实验,包括文本渲染、视觉质量和偏好对齐。所有结果表明,TDM-R1是一个强大的少步文本到图像模型的强化学习范式,在领域内和领域外的指标上均实现了最先进的强化学习性能。此外,TDM-R1还有效地扩展到最近强大的Z-Image模型,仅用4个NFE便持续超越其100-NFE和少步变体。项目页面:https://github.com/Luo-Yihong/TDM-R1
cs.CV / 159 / 2603.07704

PARSE: Part-Aware Relational Spatial Modeling

PARSE:部件感知关系空间建模
Bai, Yinuo, Xu, Peijun, Shao, Kuixiang, Jiao, Yuyang, Zhang, Jingxuan, Yao, Kaixin, Gu, Jiayuan, Yu, Jingyi
Abstract
Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
Chinese Translation
物体间的关系是空间智能的基础,但现有的表示方式——语言介词或物体级场景图——过于粗糙,无法明确指定哪些区域实际上支持、包含或接触彼此,从而导致模糊和物理不一致的布局。为了解决这些模糊性,需要一种部件级的表述;因此,我们提出了PARSE,一个明确建模物体部件如何相互作用以确定可行且空间上扎根的场景配置的框架。PARSE的核心是部件中心组装图(Part-centric Assembly Graph, PAG),它编码特定物体部件之间的几何关系,以及一个部件感知空间配置求解器,将这些关系转化为几何约束,以组装无碰撞、物理有效的场景。使用PARSE,我们构建了PARSE-10K,一个由真实图像布局先验和经过精心策划的部件标注形状数据库构建的10,000个3D室内场景的数据集,每个场景都具有密集的接触结构和部件级接触图。通过这种结构化、空间扎根的监督,在PARSE-10K上微调Qwen3-VL可以获得更强的物体级布局推理和更准确的部件级关系理解;此外,在3D生成模型中利用PAG作为结构先验,可以生成具有显著改善的物理真实感和结构复杂性的场景。这些结果表明,PARSE显著推动了基于几何的空间推理,并支持生成物理一致的3D场景。
cs.CV / 160 / 2603.07751

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

3ViewSense:基于正投影视图的空间与心理视角推理在视觉-语言模型中的应用
Zhan, Shaoxiong, Lai, Yanlin, Liu, Zheng, Lin, Hai, Li, Shen, Cai, Xiaodong, Lin, Zijian, Huang, Wen, Zheng, Hai-Tao
Abstract
Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
Chinese Translation
当前的大型语言模型已达到了奥林匹克级别的逻辑能力,然而视觉-语言模型在基本的空间任务(如块计数)上却表现不佳。这种能力的不匹配揭示了一个关键的“空间智能缺口”,即模型未能从二维观察中构建一致的三维心理表征。我们通过诊断分析揭示了这一缺口,显示瓶颈在于缺乏视图一致的空间接口,而非视觉特征不足或推理能力弱。为了解决这一问题,我们提出了 extbf{3ViewSense},一个将空间推理基于正投影视图的框架。借鉴工程认知,我们提出了一种“模拟与推理”(Simulate-and-Reason)机制,将复杂场景分解为标准的正投影,以解决几何模糊性。通过将自我中心的感知与这些他中心的参考对齐,我们的方法促进了明确的心理旋转和重建。在空间推理基准测试中的实证结果表明,我们的方法显著优于现有基线,在遮挡较多的计数和视图一致的空间推理上均取得了一致的提升。该框架还提高了空间描述的稳定性和一致性,为多模态系统中更强的空间智能提供了可扩展的路径。
cs.CV / 161 / 2603.07758

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

AR2-4FV:固定视角视频中锚定引用与再识别的长期基础
Yan, Teng, Liu, Yihan, Chen, Jiongxu, Wang, Teng, Li, Jiaqi, Zhong, Bingzhuo
Abstract
Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.
Chinese Translation
在固定视角视频中,长期的语言引导引用是一项挑战:被引用对象可能会被遮挡或长时间离开场景后再重新进入,而逐帧引用管道在再识别(ReID)变得不可靠时会出现漂移。AR2-4FV 利用背景稳定性实现长期引用。一个离线的锚点库从静态背景结构中提取;在推理时,文本查询与该库对齐,生成一个锚点图,用作在被引用对象缺失时的持久语义记忆。基于锚点的重新进入先验加速了返回时的重新捕获,而轻量级的 ReID-Gating 机制利用锚点帧中的位移线索保持身份连续性。该系统在不假设目标在第一帧可见或明确建模外观变化的情况下,预测每帧的边界框。AR2-4FV 在最佳基线的基础上实现了 +10.3% 的重新捕获率(RCR)提升和 -24.2% 的重新捕获延迟(RCL)减少,消融研究进一步确认了锚点图、重新进入先验和 ReID-Gating 的优势。
cs.CV / 162 / 2603.07759

DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising

DECADE:一种时间一致的无监督扩散模型用于增强Rb-82动态心脏PET图像去噪
Zhou, Yinchi, Guo, Liang, Xie, Huidong, Du, Yuexi, Wang, Ashley, Xia, Menghua, Yu, Tian, Fazzone-Chettiar, Ramesh, Weyman, Christopher, Spottiswoode, Bruce, Panin, Vladimir, Shi, Kuangyu, Miller, Edward J., Feher, Attila, Sinusas, Albert J., Dvornek, Nicha C., Liu, Chi
Abstract
Rb-82 dynamic cardiac PET imaging is widely used for the clinical diagnosis of coronary artery disease (CAD), but its short half-life results in high noise levels that degrade dynamic frame quality and parametric imaging. The lack of paired clean-noisy training data, rapid tracer kinetics, and frame-dependent noise variations further limit the effectiveness of existing deep learning denoising methods. We propose DECADE (A Temporally-Consistent Unsupervised Diffusion model for Enhanced Rb-82 CArdiac PET DEnoising), an unsupervised diffusion framework that generalizes across early- to late-phase dynamic frames. DECADE incorporates temporal consistency during both training and iterative sampling, using noisy frames as guidance to preserve quantitative accuracy. The method was trained and evaluated on datasets acquired from Siemens Vision 450 and Siemens Biograph Vision Quadra scanners. On the Vision 450 dataset, DECADE consistently produced high-quality dynamic and parametric images with reduced noise while preserving myocardial blood flow (MBF) and myocardial flow reserve (MFR). On the Quadra dataset, using 15%-count images as input and full-count images as reference, DECADE outperformed UNet-based and other diffusion models in image quality and K1/MBF quantification. The proposed framework enables effective unsupervised denoising of Rb-82 dynamic cardiac PET without paired training data, supporting clearer visualization while maintaining quantitative integrity.
Chinese Translation
Rb-82动态心脏PET成像广泛用于冠状动脉疾病(CAD)的临床诊断,但其短半衰期导致高噪声水平,降低动态帧质量和参数成像的效果。缺乏配对的干净-噪声训练数据、快速的示踪剂动力学以及帧依赖的噪声变化进一步限制了现有深度学习去噪方法的有效性。我们提出了DECADE(用于增强Rb-82心脏PET去噪的时间一致无监督扩散模型),这是一个无监督的扩散框架,能够在早期到晚期动态帧之间进行泛化。DECADE在训练和迭代采样过程中都融入了时间一致性,利用噪声帧作为指导,以保持定量准确性。该方法在从西门子Vision 450和西门子Biograph Vision Quadra扫描仪获取的数据集上进行了训练和评估。在Vision 450数据集中,DECADE始终生成高质量的动态和参数图像,同时降低噪声,保持心肌血流(MBF)和心肌血流储备(MFR)。在Quadra数据集中,使用15%计数图像作为输入,完整计数图像作为参考,DECADE在图像质量和K1/MBF定量方面优于基于UNet的其他扩散模型。所提出的框架能够有效地对Rb-82动态心脏PET进行无监督去噪,无需配对训练数据,支持更清晰的可视化,同时保持定量完整性。
cs.CV / 163 / 2603.07769

MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

MedQ-Deg:用于评估医疗图像质量退化的多维基准
Liu, Jiyao, Ning, Junzhi, Ma, Chenglong, Qu, Wanying, Shen, Jianghan, Luo, Siqi, Wei, Jinjie, Ye, Jin, Li, Pengze, Li, Tianbin, Lin, Jiashi, Shan, Hongming, Luo, Xinzhe, Liu, Xiaohong, Liu, Lihao, He, Junjun, Xu, Ningsheng
Abstract
Despite impressive performance on standard benchmarks, multimodal large language models (MLLMs) face critical challenges in real-world clinical environments where medical images inevitably suffer various quality degradations. Existing benchmarks exhibit two key limitations: (1) absence of large-scale, multidimensional assessment across medical image quality gradients and (2) no systematic confidence calibration analysis. To address these gaps, we present MedQ-Deg, a comprehensive benchmark for evaluating medical MLLMs under image quality degradations. MedQ-Deg provides multi-dimensional evaluation spanning 18 distinct degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities, with 24,894 question-answer pairs. Each degradation is implemented at 3 severity degrees, calibrated by expert radiologists. We further introduce Calibration Shift metric, which quantifies the gap between a model's perceived confidence and actual performance to assess metacognitive reliability under degradation. Our comprehensive evaluation of 40 mainstream MLLMs reveals several critical findings: (1) overall model performance degrades systematically as degradation severity increases, (2) models universally exhibit the AI Dunning-Kruger Effect, maintaining inappropriately high confidence despite severe accuracy collapse, and (3) models display markedly differentiated behavioral patterns across capability dimensions, imaging modalities, and degradation types. We hope MedQ-Deg drives progress toward medical MLLMs that are robust and trustworthy in real clinical practice.
Chinese Translation
尽管在标准基准上表现出色,多模态大语言模型(MLLMs)在真实临床环境中面临着严峻挑战,因为医疗图像不可避免地会遭受各种质量退化。现有基准存在两个关键局限:(1)缺乏针对医疗图像质量梯度的大规模多维评估;(2)缺乏系统的置信度校准分析。为了解决这些问题,我们提出了MedQ-Deg,这是一个全面的基准,用于评估医疗MLLMs在图像质量退化下的表现。MedQ-Deg提供了跨越18种不同退化类型、30个细粒度能力维度和7种成像模式的多维评估,共包含24,894对问答。每种退化在3个严重程度下实施,由专业放射科医师进行校准。我们进一步引入了Calibration Shift指标,用于量化模型感知置信度与实际表现之间的差距,以评估在退化条件下的元认知可靠性。我们对40个主流MLLMs的全面评估揭示了几个关键发现:(1)随着退化严重程度的增加,模型整体性能系统性下降;(2)模型普遍表现出AI邓宁-克鲁格效应,尽管准确性严重崩溃,却仍保持不当的高置信度;(3)模型在能力维度、成像模式和退化类型之间表现出明显不同的行为模式。我们希望MedQ-Deg能够推动医疗MLLMs在真实临床实践中变得更加稳健和可信。
cs.CV / 164 / 2603.07774

Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery

几何知识辅助的联邦双重知识蒸馏方法用于遥感卫星图像分析
Zou, Luyao, Pan, Fei, Li, Jueying, Tun, Yan Kyaw, Adhikary, Apurba, Han, Zhu, Oh, Hayoung
Abstract
Federated learning (FL) has recently become a promising solution for analyzing remote sensing satellite imagery (RSSI). However, the large scale and inherent data heterogeneity of images collected from multiple satellites, where the local data distribution of each satellite differs from the global one, present significant challenges to effective model training. To address this issue, we propose a Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework for RSSI analysis. In our approach, each local client first distills a teacher encoder (TE) from multiple student encoders (SEs) trained with unlabeled augmented data. The TE is then connected with a shared classifier to form a teacher network (TN) that supervises the training of a new student network (SN). The intermediate representations of the TN are used to compute local covariance matrices, which are aggregated at the server to generate global geometric knowledge (GGK). This GGK is subsequently employed for local embedding augmentation to further guide SN training. We also design a novel loss function and a multi-prototype generation pipeline to stabilize the training process. Evaluation over multiple datasets showcases that the proposed GK-FedDKD approach is superior to the considered state-of-the-art baselines, e.g., the proposed approach with the Swin-T backbone surpasses previous SOTA approaches by an average 68.89% on the EuroSAT dataset.
Chinese Translation
联邦学习(FL)最近成为分析遥感卫星图像(RSSI)的一个有前景的解决方案。然而,从多个卫星收集的图像的规模庞大及其固有的数据异质性,使得每个卫星的本地数据分布与全球数据分布存在差异,这给有效模型训练带来了重大挑战。为了解决这一问题,我们提出了一种用于RSSI分析的几何知识引导的联邦双重知识蒸馏(GK-FedDKD)框架。在我们的方法中,每个本地客户端首先从使用无标签增强数据训练的多个学生编码器(SEs)中提取一个教师编码器(TE)。然后,TE与共享分类器连接,形成一个教师网络(TN),该网络监督新学生网络(SN)的训练。TN的中间表示用于计算本地协方差矩阵,这些矩阵在服务器上聚合以生成全球几何知识(GGK)。随后,GGK被用于本地嵌入增强,以进一步指导SN的训练。我们还设计了一种新颖的损失函数和多原型生成管道,以稳定训练过程。在多个数据集上的评估表明,所提出的GK-FedDKD方法优于所考虑的最新基线,例如,采用Swin-T骨干网的提出方法在EuroSAT数据集上的表现比之前的SOTA方法平均提高了68.89%。
cs.CV / 165 / 2603.07776

Parameterized Brushstroke Style Transfer

参数化笔触风格迁移
Meleti, Uma, Huang, Siyu
Abstract
Computer Vision-based Style Transfer techniques have been used for many years to represent artistic style. However, most contemporary methods have been restricted to the pixel domain; in other words, the style transfer approach has been modifying the image pixels to incorporate artistic style. However, real artistic work is made of brush strokes with different colors on a canvas. Pixel-based approaches are unnatural for representing these images. Hence, this paper discusses a style transfer method that represents the image in the brush stroke domain instead of the RGB domain, which has better visual improvement over pixel-based methods.
Chinese Translation
基于计算机视觉的风格迁移技术已经被应用多年用于表现艺术风格。然而,大多数现代方法仅限于像素域;换句话说,风格迁移方法通过修改图像像素来融入艺术风格。然而,真正的艺术作品是由不同颜色的笔触在画布上构成的。基于像素的方法在表现这些图像时显得不自然。因此,本文讨论了一种风格迁移方法,该方法在笔触域中表示图像,而不是在RGB域中,这在视觉效果上优于基于像素的方法。
cs.CV / 166 / 2603.07786

OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

OrdinalBench:用于诊断视觉-语言模型序数理解泛化限制的基准数据集
Tozaki, Yusuke, Miyamori, Hisashi
Abstract
Vision-Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding, i.e., the ability to track relative positions and generalize to large indices. We present OrdinalBench, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is N-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude, from small numbers to extreme cases up to 300; (ii) arrangement complexity, from single loops to maze-like paths; and (iii) object count. The benchmark provides 39,000 question-answer pairs, each annotated with a ground-truth reasoning trajectory and balanced across difficulty levels for controlled large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured stepwise traces of the counting process and provides an open evaluation toolkit that measures both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, OrdinalBench provides a reproducible benchmark and diagnostic framework for developing VLMs with stronger sequential reasoning. All data and code are available at https://ordinalbench.github.io/
Chinese Translation
视觉-语言模型(VLMs)在多模态基准测试中取得了进展,但在序数理解方面仍然存在明显差距,即跟踪相对位置和推广到大索引的能力。我们提出了OrdinalBench,这是一个将序数理解标准化为VLMs评估任务的诊断基准。核心任务是N-th对象识别,通过起始参考和遍历规则进行定义。任务难度沿三个轴线进行控制:(i)序数大小,从小数字到极端情况(最高可达300);(ii)排列复杂性,从单一循环到迷宫般的路径;(iii)对象数量。该基准提供39,000个问答对,每个问答对都附有真实的推理轨迹,并在难度水平上进行平衡,以便进行受控的大规模测试。除了仅评估答案外,我们的框架要求模型生成结构化的逐步计数过程跟踪,并提供一个开放的评估工具包,测量最终准确性和逐步路径一致性。对GPT-5、Gemini 2.5 Flash Lite、Qwen2.5-VL、InternVL3.5和Molmo的零-shot评估显示,在大序数和复杂路径条件下表现急剧下降,尽管在标准多模态任务中得分较高,但仍然突显出泛化能力的不足。通过将序数理解作为核心目标,OrdinalBench提供了一个可重复的基准和诊断框架,以开发具有更强顺序推理能力的VLMs。所有数据和代码可在https://ordinalbench.github.io/获取。
cs.CV / 167 / 2603.07789

SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

SGI:用于高效紧凑的大型图像表示的结构化二维高斯
Pan, Zixuan, Tang, Kaiyuan, Xia, Jun, Qin, Yifan, Gu, Lin, Wang, Chaoli, Chen, Jianxu, Shi, Yiyu
Abstract
2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5x compression over prior non-quantized 2D Gaussian methods and 1.6x over quantized ones, while also delivering 1.6x and 6.5x faster optimization, respectively, without degrading, and often improving, image fidelity. Code is available at https://github.com/zx-pan/SGI.
Chinese Translation
二维高斯溅射(2D Gaussian Splatting)作为一种新兴的图像表示技术,能够支持在低端设备上高效渲染。然而,扩展到高分辨率图像需要独立优化和存储数百万个非结构化的高斯原语,这导致收敛速度缓慢和冗余参数。为此,我们提出了结构化高斯图像(Structured Gaussian Image,SGI),这是一个紧凑且高效的高分辨率图像表示框架。SGI将复杂图像分解为由一组种子定义的多尺度局部空间。每个种子对应一个空间一致的区域,并与轻量级多层感知器(MLPs)结合,生成结构化的隐式二维神经高斯。基于种子的公式对原本非结构化的高斯原语施加了结构性规则,这有助于在种子层面进行基于熵的压缩,从而减少总存储。然而,直接在高分辨率图像上优化种子参数是一项具有挑战性且非平凡的任务。因此,我们设计了一种多尺度拟合策略,以粗到细的方式精炼种子表示,显著加速了收敛。定量和定性评估表明,SGI在非量化的二维高斯方法上实现了高达7.5倍的压缩,在量化方法上实现了1.6倍的压缩,同时在优化速度上分别提高了1.6倍和6.5倍,而没有降低图像保真度,且通常会改善图像质量。代码可在 https://github.com/zx-pan/SGI 获取。
cs.CV / 168 / 2603.07794

4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera

4DRC-OCC:通过融合4D雷达和摄像头实现稳健的语义占用预测
Ninfa, David, Palffy, Andras, Caesar, Holger
Abstract
Autonomous driving requires robust perception across diverse environmental conditions, yet 3D semantic occupancy prediction remains challenging under adverse weather and lighting. In this work, we present the first study combining 4D radar and camera data for 3D semantic occupancy prediction. Our fusion leverages the complementary strengths of both modalities: 4D radar provides reliable range, velocity, and angle measurements in challenging conditions, while cameras contribute rich semantic and texture information. We further show that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy. Additionally, we introduce a fully automatically labeled dataset for training semantic occupancy models, substantially reducing reliance on costly manual annotation. Experiments demonstrate the robustness of 4D radar across diverse scenarios, highlighting its potential to advance autonomous vehicle perception.
Chinese Translation
自主驾驶需要在多样的环境条件下实现稳健的感知,但在恶劣天气和光照条件下,3D语义占用预测仍然具有挑战性。在本研究中,我们首次结合4D雷达和摄像头数据进行3D语义占用预测。我们的融合方法利用了两种传感器的互补优势:4D雷达在复杂条件下提供可靠的距离、速度和角度测量,而摄像头则提供丰富的语义和纹理信息。我们进一步展示了从摄像头像素中集成深度线索的能力,使得2D图像能够提升至3D,从而提高场景重建的准确性。此外,我们引入了一个完全自动标注的数据集,用于训练语义占用模型,显著减少了对昂贵人工标注的依赖。实验表明,4D雷达在多种场景下的稳健性,突显了其推动自主车辆感知的潜力。
cs.CV / 169 / 2603.07799

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

MWM:用于动作条件一致预测的移动世界模型
Yan, Han, Xiang, Zishang, Zhang, Zeyu, Tang, Hao
Abstract
World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.
Chinese Translation
世界模型使得在想象的未来预测空间中进行规划成为可能,为具身导航提供了一个有前景的框架。然而,现有的导航世界模型往往缺乏动作条件一致性,因此在多步展开下,视觉上合理的预测仍然可能出现漂移,从而降低规划效果。此外,高效的部署需要少步扩散推理,但现有的蒸馏方法并未明确保持展开一致性,造成训练与推理之间的不匹配。为了解决这些挑战,我们提出了MWM,一种用于基于规划的图像目标导航的移动世界模型。具体而言,我们引入了一个两阶段的训练框架,将结构预训练与动作条件一致性(Action-Conditioned Consistency, ACC)后训练相结合,以提高动作条件的展开一致性。我们进一步提出了推理一致状态蒸馏(Inference-Consistent State Distillation, ICSD),用于具有改进展开一致性的少步扩散蒸馏。我们在基准和真实世界任务上的实验表明,在视觉保真度、轨迹准确性、规划成功率和推理效率方面均取得了一致的提升。代码:https://github.com/AIGeeksGroup/MWM。网站:https://aigeeksgroup.github.io/MWM。
cs.CV / 170 / 2603.07815

HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

HybridStitch:用于扩散加速的像素和时间步级模型拼接
Sun, Desen, Hon, Jason, Zhang, Jintao, Liu, Sihang
Abstract
Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.
Chinese Translation
扩散模型在文本到图像(Text-to-Image, T2I)生成应用中展现出了卓越的能力。尽管生成输出先进,但它们面临着巨大的计算开销,尤其是对于包含数十亿参数的大型模型。先前的研究表明,用较小的模型替代部分去噪步骤仍能保持生成质量。然而,这些方法仅关注于节省某些时间步的计算,忽视了在一个时间步内计算需求的差异。在本研究中,我们提出了HybridStitch,一种将生成视为编辑的新型T2I生成范式。具体而言,我们引入了一个混合阶段,该阶段同时结合了大型模型和小型模型。HybridStitch将整个图像分为两个区域:一个相对容易渲染的区域,使得可以提前过渡到小型模型,另一个则更为复杂,因此需要大型模型进行细化。HybridStitch利用小型模型构建粗略草图,同时利用大型模型对复杂区域进行编辑和细化。根据我们的评估,HybridStitch在Stable Diffusion 3上实现了1.83倍的加速,速度超过了所有现有的模型混合方法。
cs.CV / 171 / 2603.07817

Tracking Phenological Status and Ecological Interactions in a Hawaiian Cloud Forest Understory using Low-Cost Camera Traps and Visual Foundation Models

利用低成本相机陷阱和视觉基础模型跟踪夏威夷云雾森林下层植被的物候状态和生态相互作用
Meyers, Luke, Potlapally, Anirudh, Chen, Yuyan, Long, Mike, Berger-Wolf, Tanya, Subramoni, Hari, Megret, Remi, Rubenstein, Daniel
Abstract
Plant phenology, the study of cyclical events such as leafing out, flowering, or fruiting, has wide ecological impacts but is broadly understudied, especially in the tropics. Image analysis has greatly enhanced remote phenological monitoring, yet capturing phenology at the individual level remains challenging. In this project, we deployed low-cost, animal-triggered camera traps at the Pu'u Maka'ala Natural Area Reserve in Hawaii to simultaneously document shifts in plant phenology and flora-faunal interactions. Using a combination of foundation vision models and traditional computer vision methods, we measure phenological trends from images comparable to on-the-ground observations without relying on supervised learning techniques. These temporally fine-grained phenology measurements from camera-trap images uncover trends that coarser traditional sampling fails to detect. When combined with detailed visitation data detected from images, these trends can begin to elucidate drivers of both plant phenology and animal ecology.
Chinese Translation
植物物候学研究周期性事件,如发芽、开花或结果,具有广泛的生态影响,但在热带地区的研究仍然相对不足。图像分析极大地增强了远程物候监测的能力,然而在个体层面捕捉物候变化仍然具有挑战性。在本项目中,我们在夏威夷的Pu'u Maka'ala自然保护区部署了低成本的动物触发相机陷阱,以同时记录植物物候变化和植物-动物相互作用。通过结合基础视觉模型和传统计算机视觉方法,我们从图像中测量物候趋势,这些趋势与地面观察相当,而无需依赖监督学习技术。这些来自相机陷阱图像的时间精细物候测量揭示了传统粗略采样未能检测到的趋势。当结合从图像中检测到的详细访客数据时,这些趋势可以开始阐明植物物候和动物生态的驱动因素。
cs.CV / 172 / 2603.07819

Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

融合复杂性反转:为何更简单的跨视图模块在草地生物量回归中优于SSMs和跨视图注意力变换器
Mandal, Mridankan
Abstract
Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.
Chinese Translation
从农业影像中准确估计草地生物量对可持续的牲畜管理至关重要,但现有方法受到真实世界监测中典型的小型、不平衡和稀疏标注数据集的限制。在本研究中,系统评估了视觉基础模型在农业回归中的适应性,基于CSIRO草地生物量基准,这是一个包含357幅图像的双视图数据集,针对五个生物量目标提供了经过实验室验证的逐成分真实值。通过17种配置,涵盖四种主干网络(EfficientNet-B3到DINOv3-ViT-L)、五种跨视图融合机制和一个4x2元数据因子,揭示了一个反直觉的原则,称为“融合复杂性反转”:在稀缺的农业数据上,两个层次的门控深度卷积(R^2 = 0.903)优于跨视图注意力变换器(0.833)、双向SSMs(0.819)和完整的Mamba(0.793,低于无融合基线)。发现主干预训练规模单调主导所有架构选择,仅DINOv2到DINOv3的升级就带来了+5.0 R^2点。仅训练元数据(物种、状态和NDVI)显示出在R^2 ~ 0.829处形成了一个普遍的上限,将8.4点的融合差异压缩至0.1点。为稀疏农业基准建立了可操作的指导方针:应优先考虑主干质量而非融合复杂性,局部模块优于全局替代方案,并排除推理时不可用的特征。
cs.CV / 173 / 2603.07831

Transferable Optimization Network for Cross-Domain Image Reconstruction

可转移优化网络用于跨域图像重建
Chen, Yunmei, Ding, Chi, Ye, Xiaojing
Abstract
We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems. The proposed framework consists of two training steps, both of which are formed as bi-level optimizations. In the first step, we train a powerful universal feature-extractor that is capable of learning important knowledge from large, heterogeneous data sets in various domains. In the second step, we train a task-specific domain-adapter for a new target domain or task with only a limited amount of data available for training. Then the composition of the adapter and the universal feature-extractor effectively explores feature which serve as an important component of image regularization for the new domains, and this leads to high-quality reconstruction despite the data limitation issue. We apply this framework to reconstruct under-sampled MR images with limited data by using a collection of diverse data samples from different domains, such as images of other anatomies, measurements of various sampling ratios, and even different image modalities, including natural images. Experimental results demonstrate a promising transfer learning capability of the proposed method.
Chinese Translation
我们开发了一种新颖的迁移学习框架,以应对图像重建问题中训练数据有限的挑战。所提出的框架由两个训练步骤组成,均形成双层优化。在第一步中,我们训练一个强大的通用特征提取器,能够从各个领域的大型异构数据集中学习重要知识。在第二步中,我们为新的目标领域或任务训练一个特定任务的领域适配器,仅使用有限的训练数据。然后,适配器与通用特征提取器的组合有效地探索特征,这些特征作为新领域图像正则化的重要组成部分,从而在数据有限的情况下实现高质量重建。我们将该框架应用于利用来自不同领域的多样化数据样本重建欠采样的MR图像,这些样本包括其他解剖结构的图像、不同采样比率的测量,甚至包括自然图像等不同图像模态。实验结果表明,所提出方法具有良好的迁移学习能力。
cs.CV / 174 / 2603.07832

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

GazeShift:用于虚拟现实的无监督注视估计及数据集
Shapira, Gil, Goldin, Ishay, Artyomov, Evgeny, Kim, Donghoon, Keller, Yosi, Zehngut, Niv
Abstract
Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift.
Chinese Translation
注视估计在现代虚拟现实(VR)系统中具有重要意义。尽管远程摄像头注视估计取得了显著进展,但VR注视研究仍受到数据稀缺的限制,特别是缺乏使用现代头戴设备典型的偏轴摄像头配置捕获的大规模、准确标注的数据集。注视标注困难,因为无法保证对目标的准确注视。为了解决这些挑战,我们推出了VRGaze——第一个大规模偏轴注视估计数据集,包含来自68名参与者的210万张近眼红外图像。我们进一步提出了GazeShift,一种基于注意力引导的无监督框架,用于在没有标注数据的情况下学习注视表示。与以往依赖多视图或三维几何的重定向方法不同,GazeShift专为近眼红外图像量身定制,在一个紧凑的实时模型中实现了有效的注视-外观解耦。GazeShift嵌入可以通过轻量级的少量样本校准选项适应个体用户,在VRGaze上实现了1.84度的平均误差。在远程摄像头MPIIGaze数据集上,该模型实现了7.15度的人体无关误差,且参数数量比基线方法少10倍,FLOPs少35倍。该模型在VR头显GPU上本地部署,推理仅需5毫秒。结合对光照变化的显著鲁棒性,这些结果突显了GazeShift作为一种标注高效、实时的VR注视跟踪解决方案。项目代码和VRGaze数据集已发布在https://github.com/gazeshift3/gazeshift。
cs.CV / 175 / 2603.07839

Training-free Temporal Object Tracking in Surgical Videos

无训练的手术视频时间对象跟踪
Koley, Subhadeep, Kadkhodamohammadi, Abdolrahim, Barbarisi, Santiago, Stoyanov, Danail, Luengo, Imanol
Abstract
Purpose: In this paper, we present a novel approach for online object tracking in laparoscopic cholecystectomy (LC) surgical videos, targeting localisation and tracking of critical anatomical structures and instruments. Our method addresses the challenges of costly pixel-level annotations and label inconsistencies inherent in existing datasets. Methods: Leveraging the inherent object localisation capabilities of pre-trained text-to-image diffusion models, we extract representative features from surgical frames without any training or fine-tuning. Our tracking framework uses these features, along with cross-frame interactions via an affinity matrix inspired by query-key-value attention, to ensure temporal continuity in the tracking process. Results: Through a pilot study, we first demonstrate that diffusion features exhibit superior object localisation and consistent semantics across different decoder levels and temporal frames. Later, we perform extensive experiments to validate the effectiveness of our approach, showcasing its superiority over competitors for the task of temporal object tracking. Specifically, we achieve a per-pixel classification accuracy of 79.19%, mean Jaccard Score of 56.20%, and mean F-Score of 79.48% on the publicly available CholeSeg8K dataset. Conclusion: Our work not only introduces a novel application of text-to-image diffusion models but also contributes to advancing the field of surgical video analysis, offering a promising avenue for accurate and cost-effective temporal object tracking in minimally invasive surgery videos.
Chinese Translation
目的:本文提出了一种新颖的方法,用于在腹腔镜胆囊切除术(LC)手术视频中进行在线对象跟踪,旨在定位和跟踪关键解剖结构和仪器。我们的方法解决了现有数据集中昂贵的像素级注释和标签不一致性所带来的挑战。方法:利用预训练的文本到图像扩散模型固有的对象定位能力,我们在不进行任何训练或微调的情况下,从手术帧中提取代表性特征。我们的跟踪框架使用这些特征,并通过受查询-键-值注意力启发的亲和矩阵进行跨帧交互,以确保跟踪过程中的时间连续性。结果:通过初步研究,我们首先证明了扩散特征在不同解码器层和时间帧之间展现出优越的对象定位能力和一致的语义。随后,我们进行了广泛的实验以验证我们方法的有效性,展示了其在时间对象跟踪任务中的优越性。具体而言,我们在公开可用的CholeSeg8K数据集上实现了79.19%的每像素分类准确率、56.20%的平均Jaccard得分和79.48%的平均F得分。结论:我们的工作不仅引入了文本到图像扩散模型的新应用,还推动了手术视频分析领域的发展,为在微创手术视频中实现准确且具有成本效益的时间对象跟踪提供了有前景的途径。
cs.CV / 176 / 2603.07874

Toward Unified Multimodal Representation Learning for Autonomous Driving

面向自主驾驶的统一多模态表征学习
Tao, Ximeng, Filev, Dimitar, Pandey, Gaurav
Abstract
Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.
Chinese Translation
对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)在对齐视觉和文本表征方面表现出色。最近的研究将这一范式扩展到3D视觉,以改善自主驾驶的场景理解。一个常见的策略是采用模态之间的成对余弦相似性来指导3D编码器的训练。然而,仅考虑单个模态对之间的相似性,而不是所有模态的联合相似性,无法确保整个多模态空间的一致和统一的对齐。在本文中,我们提出了一种对比张量预训练(Contrastive Tensor Pre-training, CTP)框架,该框架在统一的嵌入空间中同时对齐多个模态,以增强端到端的自主驾驶。与成对余弦相似性对齐相比,我们的方法将2D相似性矩阵扩展为多模态相似性张量。此外,我们引入了一种张量损失,以实现所有模态的联合对比学习。为了验证我们框架的实验有效性,我们构建了一个源自现有自主驾驶数据集的文本-图像-点云三元组数据集。结果表明,我们提出的统一多模态对齐框架在以下两种场景中均表现出良好的性能:(i)将3D编码器与预训练的CLIP编码器对齐,以及(ii)从头开始对所有编码器进行预训练。
cs.CV / 177 / 2603.07888

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

VLM-SubtleBench:视觉语言模型在细微比较推理上距离人类水平有多远?
Kim, Minkyu, Lee, Sangheon, Park, Dongmin
Abstract
The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.
Chinese Translation
区分视觉上相似图像之间细微差异的能力对于工业异常检测、医学成像和空中监视等多个领域至关重要。尽管最近出现了针对视觉语言模型(VLMs)的比较推理基准,但这些基准主要集中在具有明显差异的图像上,未能捕捉到真实世界应用所需的细致推理。在本研究中,我们提出了VLM-SubtleBench,这是一个旨在评估VLMs在细微比较推理方面表现的基准。我们的基准涵盖十种差异类型——属性(Attribute)、状态(State)、情感(Emotion)、时间(Temporal)、空间(Spatial)、存在(Existence)、数量(Quantity)、质量(Quality)、观点(Viewpoint)和行动(Action)——并策划了反映这些细微变化的成对问题-图像集。与之前仅限于自然图像数据集的基准不同,我们的基准跨越了多个领域,包括工业、空中和医学图像。通过对专有和开源VLMs的广泛评估,我们揭示了模型与人类在不同类型和领域之间的系统性差距,并提供了受控分析,突出了VLMs推理显著恶化的地方。我们的基准和研究结果共同为推动VLMs向人类水平的比较推理发展奠定了基础。
cs.CV / 178 / 2603.07889

Structure and Progress Aware Diffusion for Medical Image Segmentation

结构与进度感知扩散用于医学图像分割
Song, Siyuan, Hu, Guyue, Li, Chenglong, Sun, Dengdi, Jin, Zhe, Tang, Jin
Abstract
Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous and noisy since lesion overlap, annotation uncertainty, and so on, making it not reliable to serve as early supervision. However, existing methods simultaneously learn coarse structures and fine boundaries throughout the training process. In this paper, we propose a structure and progress-aware diffusion (SPAD) for medical image segmentation, which consists of a semantic-concentrated diffusion (ScD) and a boundary-centralized diffusion (BcD) modulated by a progress-aware scheduler (PaS). Specifically, the semantic-concentrated diffusion introduces anchor-preserved target perturbation, which perturbs pixels within a medical target but preserves unaltered areas as semantic anchors, encouraging the model to infer noisy target areas from the surrounding semantic context. The boundary-centralized diffusion introduces progress-aware boundary noise, which blurs unreliable and ambiguous boundaries, thus compelling the model to focus on coarse but stable anatomical morphology and global semantics. Furthermore, the progress-aware scheduler gradually modulates noise intensity of the ScD and BcD forming a coarse-to-fine diffusion paradigm, which encourage focusing on coarse morphological and semantic structures during early target understanding stages and gradually shifting to fine target boundaries during later contour adjusting stages.
Chinese Translation
医学图像分割对于计算机辅助诊断至关重要,这需要理解粗略的形态学和语义结构,同时精确刻画细微边界。医学图像中的形态学和语义结构是目标理解的重要且稳定的线索。然而,医学目标(如肿瘤和病变)的细微边界通常模糊且噪声较大,原因包括病变重叠、标注不确定性等,这使得其作为早期监督不可靠。然而,现有方法在整个训练过程中同时学习粗略结构和细微边界。本文提出了一种用于医学图像分割的结构与进度感知扩散(SPAD),该方法由语义集中扩散(ScD)和边界集中扩散(BcD)组成,并由进度感知调度器(PaS)进行调节。具体而言,语义集中扩散引入了锚点保留的目标扰动,该扰动在医学目标内扰动像素,但保留未改变区域作为语义锚点,从而鼓励模型从周围的语义上下文推断噪声目标区域。边界集中扩散引入了进度感知边界噪声,模糊不可靠和模糊的边界,从而迫使模型关注粗略但稳定的解剖形态和全局语义。此外,进度感知调度器逐渐调节ScD和BcD的噪声强度,形成粗到细的扩散范式,鼓励在早期目标理解阶段关注粗略的形态学和语义结构,并在后期轮廓调整阶段逐渐转向细微目标边界。
cs.CV / 179 / 2603.07895

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

MINT:基于空间转录组学监督的分子信息训练用于病理基础模型
Lee, Minsoo, Kim, Jonghyun, Yun, Juseung, Yu, Sunwoo, Jang, Jongseong
Abstract
Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.
Chinese Translation
病理基础模型通过在大规模全切片图像上进行自监督预训练来学习形态学表征,但它们并未明确捕捉组织的潜在分子状态。空间转录组学技术通过原位测量基因表达来弥补这一差距,提供了一种自然的跨模态监督信号。我们提出了MINT(分子信息训练),这是一种将空间转录组学监督融入预训练病理视觉变换器的微调框架。MINT在ViT输入中附加一个可学习的ST标记,以便将转录组信息与形态学CLS标记分开编码,通过DINO自蒸馏和对冻结的预训练编码器进行显式特征锚定,防止灾难性遗忘。在点级(Visium)和补丁级(Xenium)分辨率下进行的基因表达回归在空间尺度上提供了互补的监督。MINT在577个公开可用的HEST样本上进行训练,在基因表达预测的HEST-Bench上(平均皮尔逊相关系数r = 0.440)和一般病理任务的EVA上(0.803)取得了最佳整体性能,证明了空间转录组学监督补充了以形态为中心的自监督预训练。
cs.CV / 180 / 2603.07898

Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning

重新审视未知:朝着有效且高效的开放集主动学习迈进
Zong, Chen-Chen, Chi, Yu-Qi, Wang, Xie-Yang, Cui, Yan, Huang, Sheng-Jun
Abstract
Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes-a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning. In this paper, we propose E$^2$OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E$^2$OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity. Extensive experiments across multiple OSAL benchmarks demonstrate that E$^2$OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications. The code is available at github.com/chenchenzong/E2OAL.
Chinese Translation
开放集主动学习(OSAL)旨在识别用于标注的信息样本,当未标记数据可能包含之前未见的类别时,这在安全关键和开放世界场景中是一个常见挑战。现有方法通常依赖于单独训练的开放集检测器,这引入了大量的训练开销,并忽视了标记未知样本在改善已知类别学习中的监督价值。本文提出了E$^2$OAL(有效且高效的开放集主动学习),这是一个统一且无检测器的框架,充分利用标记的未知样本以实现更强的监督和更可靠的查询。E$^2$OAL首先通过在冻结的对比预训练特征空间中进行标签引导的聚类,揭示未知样本的潜在类别结构,优化目标为结构感知的F1-product目标。为了利用标记的未知样本,它采用了一个经过Dirichlet校准的辅助头,该头共同建模已知和未知类别,从而改善置信度校准和已知类别的区分能力。在此基础上,logit-margin纯度评分估计已知类别的可能性,以构建高纯度候选池,而针对OSAL的特定信息度量则优先考虑部分模糊但可靠的样本。这些组件共同形成了一个灵活的两阶段查询策略,具有自适应精度控制和最小的超参数敏感性。在多个OSAL基准上的广泛实验表明,E$^2$OAL在准确性、效率和查询精度方面始终超过了最先进的方法,突显了其在实际应用中的有效性和实用性。代码可在github.com/chenchenzong/E2OAL获取。
cs.CV / 181 / 2603.07911

Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition

超越启发式提示:一种基于概念引导的贝叶斯框架用于零样本图像识别
Liu, Hui, Chen, Kecheng, Wang, Jialiang, Liu, Xianming, Wang, Wenya, Li, Haoliang
Abstract
Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification. Our code is available at https://github.com/less-and-less-bugs/CGBC.
Chinese Translation
视觉-语言模型(VLMs),如CLIP,已显著推动了零样本图像识别的发展。然而,它们的性能仍受到次优提示工程和对目标类别适应性差的限制。尽管近期方法试图通过多样的类别描述来改进提示,但它们往往依赖于启发式设计,缺乏通用性,并且对离群提示敏感。本文通过引入类别特定的概念来增强提示。我们将概念视为潜变量,从贝叶斯的角度重新思考零样本图像分类,将预测视为对概念空间的边际化,其中每个概念都由先验和条件于测试图像的似然加权。这一公式强调了良好结构的概念提议分布和概念先验的精炼的重要性。为了构建一个表达性强且高效的提议分布,我们引入了一个由大型语言模型(LLMs)驱动的多阶段概念合成管道,以生成具有区分性和组合性的概念,随后使用行列式点过程(Determinantal Point Process)来增强多样性。为减轻离群概念的影响,我们提出了一种无训练的自适应软修剪似然,该方法在一次前向传播中减弱了离群概念的影响。我们进一步提供了鲁棒性保证,并为我们的框架推导了多类超额风险界限。大量实验表明,我们的方法在零样本图像分类中始终优于最先进的方法,验证了其有效性。我们的代码可在 https://github.com/less-and-less-bugs/CGBC 获取。
cs.CV / 182 / 2603.07912

Geometric Transformation-Embedded Mamba for Learned Video Compression

嵌入几何变换的Mamba用于学习视频压缩
Wei, Hao, Zhou, Yanhui, Ge, Chenyang
Abstract
Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at https://github.com/cshw2021/GTEM-LVC.
Chinese Translation
尽管学习视频压缩方法表现出色,但大多数方法通常遵循一种混合编码范式,需要明确的运动估计和补偿,从而导致视频压缩的复杂解决方案。相比之下,我们提出了一种基于直接变换策略的简化而有效的视频压缩框架,即非线性变换、量化和熵编码。我们首先开发了一个级联Mamba模块(CMM),该模块嵌入了不同的几何变换,以有效探索长程空间和时间依赖关系。为了改善局部空间表示,我们引入了一种基于差分卷积的混合卷积块的局部细化前馈网络(LRFFN)。我们将提出的CMM和LRFFN集成到我们的压缩框架的编码器和解码器中。此外,我们提出了一种条件通道熵模型,该模型有效利用条件时间先验来准确估计当前潜在特征的概率分布。大量实验表明,在低比特率约束下,我们的方法在感知质量和时间一致性方面优于最先进的视频压缩方法。我们的源代码和模型将发布在 https://github.com/cshw2021/GTEM-LVC。
cs.CV / 183 / 2603.07918

Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

通过基于解混合的丰度融合学习增强未注册的高光谱图像超分辨率
Zhang, Yingkai, Zhang, Tao, Nie, Jing, Fu, Ying
Abstract
Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at https://github.com/yingkai-zhang/UAFL.
Chinese Translation
未注册的高光谱图像(HSI)超分辨率(SR)通常旨在利用未注册的高分辨率参考图像增强低分辨率HSI。本文提出了一种基于解混合的融合框架,该框架解耦空间-光谱信息,以同时减轻未注册融合的影响并增强SR模型的可学习性。具体而言,我们首先利用奇异值分解进行初始光谱解混合,保留原始端元,同时将后续网络专注于增强初始丰度图。为了利用未注册参考图像的空间纹理,我们引入了一个粗到细的可变形聚合模块,该模块首先使用粗糙金字塔预测器估计像素级流动和相似性图。然后,它进一步执行细粒度的亚像素精细化,以实现参考特征的可变形聚合。聚合特征随后通过一系列空间-通道丰度交叉注意力块进行精炼。此外,提出了一种空间-通道调制融合模块,使用动态门控权重合并编码器-解码器特征,从而生成高质量、高分辨率的HSI。在模拟和真实数据集上的实验结果证实了我们提出的方法达到了最先进的超分辨率性能。代码将发布在 https://github.com/yingkai-zhang/UAFL。
cs.CV / 184 / 2603.07920

RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving

RLPR:通过两阶段不对称跨模态对齐实现雷达到激光雷达的地点识别,用于自动驾驶
Qi, Zhangshuo, Xu, Jingyi, Cheng, Luqi, Wen, Shichen, Xiong, Guangming
Abstract
All-weather autonomy is critical for autonomous driving, which necessitates reliable localization across diverse scenarios. While LiDAR place recognition is widely deployed for this task, its performance degrades in adverse weather. Conversely, radar-based methods, though weather-resilient, are hindered by the general unavailability of radar maps. To bridge this gap, radar-to-LiDAR place recognition, which localizes radar scans within existing LiDAR maps, has garnered increasing interest. However, extracting discriminative and generalizable features shared between modalities remains challenging, compounded by the scarcity of large-scale paired training data and the signal heterogeneity across radar types. In this work, we propose RLPR, a robust radar-to-LiDAR place recognition framework compatible with single-chip, scanning, and 4D radars. We first design a dual-stream network to extract structural features that abstract away from sensor-specific signal properties (e.g., Doppler or RCS). Subsequently, motivated by our task-specific asymmetry observation between radar and LiDAR, we introduce a two-stage asymmetric cross-modal alignment (TACMA) strategy, which leverages the pre-trained radar branch as a discriminative anchor to guide the alignment process. Experiments on four datasets demonstrate that RLPR achieves state-of-the-art recognition accuracy with strong zero-shot generalization capabilities.
Chinese Translation
全天气自主驾驶对于自动驾驶至关重要,这需要在多种场景中实现可靠的定位。虽然激光雷达地点识别在这一任务中被广泛应用,但其在恶劣天气条件下的性能会下降。相反,基于雷达的方法虽然对天气具有较强的抗干扰能力,但由于雷达地图的普遍缺乏而受到限制。为了弥补这一差距,雷达到激光雷达的地点识别,即在现有激光雷达地图中定位雷达扫描,受到了越来越多的关注。然而,提取跨模态共享的区分性和可泛化特征仍然具有挑战性,尤其是在缺乏大规模配对训练数据和雷达类型信号异质性的情况下。在本研究中,我们提出了RLPR,一个强大的雷达到激光雷达地点识别框架,兼容单芯片、扫描和4D雷达。我们首先设计了一个双流网络,以提取抽象化的结构特征,忽略传感器特定的信号属性(例如,多普勒或雷达散射截面)。随后,基于我们在雷达和激光雷达之间观察到的任务特定不对称性,我们引入了一种两阶段不对称跨模态对齐(TACMA)策略,该策略利用预训练的雷达分支作为区分性锚点来指导对齐过程。在四个数据集上的实验表明,RLPR实现了最先进的识别准确率,并具备强大的零样本泛化能力。
cs.CV / 185 / 2603.07926

IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation

IMSE:内在光谱专家混合微调用于测试时适应
Baek, Sunghyun, Yu, Jaemyung, Koh, Seunghee, Kim, Minsu, Jeon, Hyeonseong, Kim, Junmo
Abstract
Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.
Chinese Translation
测试时适应(TTA)已经被广泛研究,以防止当测试数据与训练分布不同时性能下降。然而,如何在最小参数更新的情况下充分利用大型预训练模型的丰富表示仍然未得到充分探索。本文提出了内在光谱专家混合(IMSE),利用嵌入在视觉变换器中的光谱专家。我们通过奇异值分解(SVD)对每个线性层进行分解,仅适应奇异值,同时保持奇异向量不变。我们进一步识别出TTA中熵最小化的一个关键限制:它往往会导致特征崩溃,使模型依赖于特定领域的特征,而非类别区分特征。为了解决这个问题,我们提出了一种基于专家输入对齐的多样性最大化损失,鼓励在适应过程中多样化利用光谱专家。在持续测试时适应(CTTA)场景中,除了保留预训练知识外,保留和重用来自先前观察领域的知识也是至关重要的。我们引入了领域感知光谱代码检索,估计输入分布以检测领域转移,并检索适应后的奇异值以实现快速适应。因此,我们的方法在TTA设置下在各种分布转移基准上实现了最先进的性能。在CTTA和渐进CTTA中,准确率分别提高了3.4个百分点和2.4个百分点,同时需要的可训练参数减少了385倍。我们的代码可在 https://github.com/baek85/IMSE 获取。
cs.CV / 186 / 2603.07929

A Hybrid Vision Transformer Approach for Mathematical Expression Recognition

一种混合视觉变换器方法用于数学表达式识别
Le, Anh Duy, Pham, Van Linh, Ly, Vinh Loi, Nguyen, Nam Quan, Nguyen, Huu Thang, Tran, Tuan Anh
Abstract
One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention's history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.
Chinese Translation
文档分析中一个关键的挑战是数学表达式识别。与仅关注一维结构图像的文本识别不同,数学表达式识别是一个更复杂的问题,因为它具有二维结构和不同的符号大小。本文提出使用带有二维位置编码的混合视觉变换器(Hybrid Vision Transformer, HVT)作为编码器,从图像中提取符号之间的复杂关系。我们使用覆盖注意力解码器更好地跟踪注意力的历史,以处理解析不足和解析过度的问题。我们还展示了使用ViT的[CLS]标记作为解码器初始嵌入的好处。在IM2LATEX-100K数据集上进行的实验表明我们的方法的有效性,达到了89.94的BLEU分数,超越了当前的最先进方法。
cs.CV / 187 / 2603.07936

Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis

从文本到自动机图:比较 TikZ 代码生成与直接图像合成
Young, Ethan, Wang, Zichun, Taylor, Aiden, Jewell, Chance, Myers, Julian, Nimmagadda, Satya Sri Rajiteswari, White, Anthony, Maiti, Aniruddha, Jana, Ananya
Abstract
Diagrams are widely used in teaching computer science courses. They are useful in subjects such as automata and formal languages, data structures, etc. These diagrams, often drawn by students during exams or assignments, vary in structure, layout, and correctness. This study examines whether current vision-language and large language models can process such diagrams and produce accurate textual and digital representations. In this study, scanned student-drawn diagrams are used as input. Then, textual descriptions are generated from these images using a vision-language model. The descriptions are checked and revised by human reviewers to make them accurate. Both the generated and the revised descriptions are then fed to a large language model to generate TikZ code. The resulting diagrams are compiled and then evaluated against the original scanned diagrams. We found descriptions generated directly from images using vision-language models are often incorrect and human correction can substantially improve the quality of vision language model generated descriptions. This research can help computer science education by paving the way for automated grading and feedback and creating more accessible instructional materials.
Chinese Translation
图表在计算机科学课程的教学中被广泛使用。它们在自动机与形式语言、数据结构等学科中非常有用。这些图表通常由学生在考试或作业中绘制,结构、布局和正确性各不相同。本研究考察了当前的视觉语言模型和大型语言模型是否能够处理这些图表并生成准确的文本和数字表示。在本研究中,使用扫描的学生绘制的图表作为输入。然后,利用视觉语言模型从这些图像生成文本描述。这些描述经过人工审阅者的检查和修订,以确保其准确性。生成的描述和修订后的描述随后被输入到大型语言模型中,以生成 TikZ 代码。生成的图表经过编译后,与原始扫描图表进行评估。我们发现,使用视觉语言模型直接从图像生成的描述往往不准确,而人工修正可以显著提高视觉语言模型生成描述的质量。本研究有助于计算机科学教育,为自动评分和反馈铺平道路,并创造更易于获取的教学材料。
cs.CV / 188 / 2603.07937

$L^3$:Scene-agnostic Visual Localization in the Wild

$L^3$:场景无关的野外视觉定位
Zhang, Yu, Zhu, Muhua, Xue, Yifei, Ji, Tie, Lao, Yizhen
Abstract
Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework $L^3$. Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, $L^3$ achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate $L^3$ not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).
Chinese Translation
标准的视觉定位方法通常需要对场景进行离线预处理,以获得3D结构信息以提高性能。这不可避免地引入了额外的计算和时间成本,以及存储场景表示的开销。我们能否在没有任何离线预处理步骤的情况下,在野外场景中进行视觉定位?在本文中,我们利用前馈3D重建网络的在线推理能力,提出了一种新颖的无地图视觉定位框架$L^3$。具体而言,通过对RGB图像进行直接的在线3D重建,随后基于2D-3D对应关系进行两阶段的度量尺度恢复和姿态优化,$L^3$在不需要预构建或存储任何离线场景表示的情况下,实现了高精度。大量实验表明,$L^3$的性能不仅与各种基准测试上的最先进解决方案相当,而且在稀疏场景(每个场景的参考图像较少)中表现出显著的优越鲁棒性。
cs.CV / 189 / 2603.07952

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

VisualAD:基于视觉变换器的无语言零样本异常检测
Hou, Yanning, Li, Peiyuan, Liu, Zirui, Wang, Yitong, Ruan, Yanran, Qiu, Jianfeng, Xu, Ke
Abstract
Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: https://github.com/7HHHHH/VisualAD
Chinese Translation
零样本异常检测(ZSAD)要求在没有目标类别异常样本的情况下检测和定位异常。主流方法依赖于视觉-语言模型(VLMs),如CLIP:它们为正常和异常语义构建手工或学习的提示集,然后计算图像-文本相似度以进行开放集区分。尽管有效,这种范式依赖于文本编码器和跨模态对齐,这可能导致训练不稳定和参数冗余。本研究重新审视了ZSAD中文本分支的必要性,并提出了VisualAD,一个完全基于视觉的框架,建立在视觉变换器之上。我们在一个冻结的主干网络中引入两个可学习的标记,以直接编码正常性和异常性。通过多层自注意力,这些标记与补丁标记相互作用,逐渐获得正常性和异常性的高级概念,同时引导补丁突出与异常相关的线索。此外,我们还结合了空间感知交叉注意力(SCA)模块和轻量级自对齐函数(SAF):SCA将细粒度的空间信息注入标记中,SAF在异常评分之前重新校准补丁特征。VisualAD在涵盖工业和医疗领域的13个零样本异常检测基准上实现了最先进的性能,并能够无缝适应预训练的视觉主干,如CLIP图像编码器和DINOv2。代码:https://github.com/7HHHHH/VisualAD
cs.CV / 190 / 2603.07961

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

SGG-R$^{ m 3}$:从下一个标记预测到端到端无偏场景图生成
Feng, Jiaye, Yin, Qixiang, Liu, Yuankun, Mo, Tong, Li, Weiping
Abstract
Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
Chinese Translation
场景图生成(Scene Graph Generation, SGG)将视觉场景结构化为对象及其关系的图形。尽管多模态大型语言模型(Multimodal Large Language Models, MLLMs)在端到端的SGG方面取得了进展,但当前的方法受到缺乏任务特定结构化推理和稀疏、长尾关系分布的挑战的限制,导致生成的场景图不完整,表现为低召回率和偏见预测。为了解决这些问题,我们提出了SGG-R$^{ m 3}$,这是一个结构化推理框架,集成了任务特定的思维链(chain-of-thought, CoT)引导的监督微调(supervised fine-tuning, SFT)和强化学习(reinforcement learning, RL)与组序列策略优化(group sequence policy optimization, GSPO),旨在通过三个连续阶段实现端到端的无偏场景图生成。在SFT阶段,我们提出了一种关系增强策略,通过利用MLLM并通过嵌入相似性过滤进行优化,以缓解关系稀疏性。随后,一个阶段对齐的奖励机制优化了RL过程中的推理。具体而言,我们提出了一种新颖的双粒度奖励,整合了细粒度和粗粒度的关系奖励,同时通过基于频率的自适应加权谓词来缓解长尾问题,并通过语义聚类提高关系覆盖率。在两个基准测试上的实验表明,SGG-R$^{ m 3}$相比现有方法表现出更优的性能,证明了该框架的有效性和泛化能力。
cs.CV / 191 / 2603.07966

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Zhou, Weijie, Xiong, Xuantang, Hu, Zhenlin, Zhu, Xiaomeng, Zhao, Chaoyang, Dong, Honghui, Zhang, Zhengyou, Tang, Ming, Wang, Jinqiao
Abstract
In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.
cs.CV / 192 / 2603.07985

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

自回归3D物体检测的可行性与机遇
Huang, Zanming, Yoo, Jinsu, Jeon, Sooyoung, Liu, Zhenzhen, Campbell, Mark, Weinberger, Kilian Q, Hariharan, Bharath, Chao, Wei-Lun, Luo, Katie Z
Abstract
LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.
Chinese Translation
基于LiDAR的3D物体检测器通常依赖于具有手工设计组件的提议头,如锚点分配和非最大抑制(NMS),这使得训练过程复杂并限制了可扩展性。我们提出了AutoReg3D,这是一种自回归3D检测器,将检测视为序列生成。在给定点云特征的情况下,AutoReg3D以范围因果(近到远)的顺序发出物体,并将每个物体编码为一个短的离散标记序列,包含其中心、大小、方向、速度和类别。这种近到远的排序反映了LiDAR几何特性——近处的物体会遮挡远处的物体,但反之则不然——这使得在训练过程中可以简单地进行教师强制,并在测试时进行自回归解码。AutoReg3D在不同的点云或骨干网络上兼容,并在没有锚点或NMS的情况下达到了竞争性的nuScenes性能。超越平衡,这种序列化的形式解锁了语言模型在3D感知中的进展,包括用于任务对齐目标的GRPO风格强化学习。这些结果将自回归解码定位为一种可行且灵活的LiDAR基础检测替代方案,并为将现代序列建模工具引入3D感知开辟了道路。
cs.CV / 193 / 2603.07988

TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

TeamHOI:学习适用于任意团队规模的合作人机交互统一策略
Lionar, Stefan, Lee, Gim Hee
Abstract
Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.
Chinese Translation
基于物理的人形控制在实现真实且高效的单代理行为方面取得了显著进展,但将这些能力扩展到合作人机交互(HOI)仍然面临挑战。我们提出了TeamHOI,一个框架,能够使单个去中心化策略处理任意数量的合作代理的合作HOI。每个代理使用局部观察进行操作,同时通过基于Transformer的策略网络和队友标记关注其他队友,从而实现可扩展的协调,适应不同的团队规模。为了在解决合作HOI数据稀缺问题的同时强制执行运动的真实感,我们进一步引入了一种掩蔽对抗运动先验(AMP)策略,该策略在训练过程中使用单人参考运动,同时掩蔽与物体交互的身体部位。然后,通过任务奖励引导掩蔽区域,以产生多样且物理上合理的合作行为。我们在一个具有挑战性的合作搬运任务上评估TeamHOI,该任务涉及两到八个类人代理和不同的物体几何形状。最后,为了促进稳定的搬运,我们设计了一种与团队规模和形状无关的队形奖励。TeamHOI实现了高成功率,并在多种配置下展示了单一策略下的连贯合作。
cs.CV / 194 / 2603.07989

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

AutoTraces:通过多模态大型语言模型进行自回归轨迹预测
Wang, Teng, Lu, Yanting, Wang, Ruize
Abstract
We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents waypoints with point tokens as categorical and positional markers while encoding waypoint numerical values as corresponding point embeddings, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.
Chinese Translation
我们提出了AutoTraces,这是一种自回归视觉-语言-轨迹模型,用于在人口密集环境中进行机器人轨迹预测,利用大型语言模型(LLMs)固有的推理能力来建模复杂的人类行为。与以往仅依赖文本表示的研究不同,我们的关键创新在于一种新颖的轨迹标记方案,该方案使用点标记作为类别和位置标记来表示航点,同时将航点的数值编码为相应的点嵌入,通过轻量级编码器-解码器架构无缝集成到LLM的空间中。该设计保留了LLM的原生自回归生成机制,同时将其扩展到物理坐标空间,促进了轨迹数据中长期交互的建模。我们进一步引入了一种自动化的思维链(CoT)生成机制,利用多模态LLM从视觉观察和轨迹数据中推断时空关系,消除了对手动标注的依赖。通过两阶段训练策略,我们的AutoTraces在预测准确性方面达到了当前最优(SOTA),尤其是在长期预测中表现出色,同时展现出强大的跨场景泛化能力,并支持灵活长度的预测。
cs.CV / 195 / 2603.08007

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

ViSA增强的空中视觉语言导航:一种增强视觉空间推理的空中视觉语言导航框架
Tong, Haoyu, Dong, Xiangyu, Ma, Xiaoguang, Zhao, Haoran, Zhou, Yaoming, Lin, Chenghao
Abstract
Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.
Chinese Translation
现有的空中视觉语言导航(VLN)方法主要采用检测与规划管道,将开放词汇检测转换为离散的文本场景图。这些方法存在空间推理能力不足和固有语言模糊性的问题。为了解决这些瓶颈,我们提出了一种增强视觉空间推理(ViSA)的空中VLN框架。具体而言,设计了一种三阶段协作架构,以利用结构化视觉提示,使视觉语言模型(VLMs)能够在图像平面上直接进行推理,而无需额外的训练或复杂的中间表示。在CityNav基准上的全面评估表明,ViSA增强的VLN相比于完全训练的最先进(SOTA)方法,成功率提高了70.3%,阐明了其作为空中VLN系统主干的巨大潜力。
cs.CV / 196 / 2603.08011

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

是时候做好了:改善视觉语言模型中的模拟钟表阅读和时钟指针空间推理
Choi, Jaeha, Lee, Jin Won, You, Siwoo, Lee, Jangho
Abstract
Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatial-temporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatial-temporal reasoning and visual understanding in VLMs.
Chinese Translation
视觉语言模型(VLMs)的进展在复杂的多模态推理任务中取得了显著成功,这导致人们假设它们在阅读模拟钟表方面也应表现出色。然而,与这一预期相反,我们的研究揭示,在现实环境中阅读模拟钟表仍然是最先进的 VLMs 面临的一项重大挑战。现有的模拟钟表数据集大多是合成的或平面的,缺乏多样的风格和背景上下文,未能捕捉现实场景的视觉变异性。因此,基于这些数据训练的 VLMs 在时空推理方面表现较弱,常常混淆时针和分针,并在常见的视觉条件下(如遮挡、光照变化和杂乱背景)表现不佳。为了解决这一问题,我们引入了 TickTockVQA,这是一个包含多样化现实场景中模拟钟表的人类标注数据集。TickTockVQA 提供明确的时针和分针标注,并在可从视觉上下文推断时包含 AM/PM 标签。此外,我们提出了 Swap-DPO,这是一种基于直接偏好优化的微调框架,旨在使模型推理与准确的时间解释对齐。实验结果表明,我们的方法显著提高了在现实条件下的钟表阅读准确性和鲁棒性,为未来在 VLMs 中进行时空推理和视觉理解的研究奠定了基础。
cs.CV / 197 / 2603.08018

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

不再缺失:基于字典引导的缺失红外图像的跨模态融合
Zhang, Yafei, Ma, Meng, Li, Huafeng, Liu, Yu
Abstract
Infrared-visible (IR-VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This encode-transfer-fuse-reconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary-coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference-fusion to tackle missing-IR fusion. The source code is publicly available at https://github.com/harukiv/DCMIF.
Chinese Translation
红外-可见光(IR-VIS)图像融合对于感知和安全至关重要,但大多数方法在训练和推理过程中依赖于两种模态的可用性。当红外模态缺失时,像素空间的生成替代品变得难以控制,并且本质上缺乏可解释性。我们通过提出一个基于共享卷积字典的字典引导系数域框架来解决缺失红外融合问题。该流程包括三个关键组件:(1)联合共享字典表示学习(JSRL)学习一个统一且可解释的原子空间,该空间由红外和可见光模态共享;(2)可见光引导的红外推理(VGII)在系数域中将可见光系数转移为伪红外系数,并通过一个冻结的大型语言模型作为弱语义先验进行一步闭环精炼;(3)通过表示推理的自适应融合(AFRI)在原子级别通过窗口注意力和卷积混合合并可见光结构和推断的红外线索,然后使用共享字典进行重建。该编码-转移-融合-重建流程避免了不受控制的像素空间生成,同时确保在可解释的字典-系数表示中保持先验。缺失红外设置下的实验表明感知质量和下游检测性能的一致改善。据我们所知,这是第一个共同学习共享字典并执行系数域推理-融合以应对缺失红外融合的框架。源代码可在 https://github.com/harukiv/DCMIF 上公开获取。
cs.CV / 198 / 2603.08020

VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

VSDiffusion:通过可见性约束扩散抑制病态阴影生成
Li, Jing, Zhang, Jing
Abstract
Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.
Chinese Translation
为插入前景物体生成逼真的投射阴影是图像合成中的一个关键但具有挑战性的问题。在复杂场景中,由于阴影形成的病态特性,保持阴影与物体的几何一致性仍然困难。为了解决这一问题,我们提出了VSDiffusion,一种可见性约束的两阶段框架,旨在通过引入可见性先验来缩小解空间。在第一阶段,我们预测一个粗略的阴影掩模,以定位可能生成阴影的区域。在第二阶段,基于从合成图像中估计的光照和深度线索进行条件扩散,以生成准确的阴影。在VSDiffusion中,我们通过两条互补路径注入可见性先验。首先,采用带有阴影门控交叉注意力的可见性控制分支,提供多尺度的结构指导。然后,使用学习到的软先验图,在易出错区域重新加权训练损失,以增强几何校正。此外,我们还引入了高频引导增强模块,以锐化边界并改善与背景的纹理交互。在广泛使用的公共DESOBAv2数据集上的实验表明,我们提出的VSDiffusion能够生成准确的阴影,并在大多数评估指标上建立新的最优结果。
cs.CV / 199 / 2603.08023

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

不同于变压器:基于Mamba的扩散模型在舞蹈生成中的节拍表示方法
Park, Sangjune, Choi, Inhyeok, Soon, Donghyeon, Jeon, Youngwoo, Joo, Kyungdon
Abstract
Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.
Chinese Translation
舞蹈是一种以情感表达和交流为特征的人类运动形式,在音乐、虚拟现实和内容创作等多个领域发挥着重要作用。现有的舞蹈生成方法往往无法充分捕捉舞蹈固有的顺序性、节奏性和与音乐同步的特征。本文提出了一种新的舞蹈生成方法 extit{MambaDance},该方法利用基于Mamba的扩散模型。Mamba非常适合处理长的自回归序列,并被整合到我们的两阶段扩散架构中,以替代现成的变压器(Transformer)。此外,考虑到音乐节拍在舞蹈编排中的关键作用,我们提出了一种基于高斯的节拍表示,以明确指导舞蹈序列的解码。在AIST++和FineDance数据集上的实验表明,与之前的方法相比,我们提出的方法在不同序列长度下有效生成了可信的舞蹈动作,同时反映了舞蹈的基本特征,从短舞蹈到长舞蹈均保持一致。更多的定性结果和演示视频可在 extit{https://vision3d-lab.github.io/mambadance}获取。
cs.CV / 200 / 2603.08028

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

通过文本到骨架级联的可控复杂人类运动视频生成
Taghipour, Ashkan, Ghahremani, Morteza, Li, Zinuo, Laga, Hamid, Boussaid, Farid, Bennamoun, Mohammed
Abstract
Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.
Chinese Translation
生成复杂人类运动的视频,如翻滚、侧手翻和武术,仍然是当前视频扩散模型面临的挑战。仅依赖文本条件在细粒度运动控制上存在时间模糊性,而显式的基于姿势的控制虽然有效,却要求用户提供完整的骨架序列,这对于长时间和动态动作来说成本较高。我们提出了一种两阶段级联框架,解决了这两种限制。首先,一个自回归的文本到骨架模型通过预测每个关节的姿势,基于先前生成的姿势,从自然语言描述中生成2D姿势序列。该设计捕捉了复杂运动所需的长距离时间依赖性和关节间协调性。其次,一个基于姿势的条件视频扩散模型从参考图像和生成的骨架序列合成视频。它采用了DINO-ALF(自适应层融合),这是一个多层次的参考编码器,能够在大幅度姿势变化和自遮挡下保留外观和服装细节。为了解决复杂人类运动视频生成缺乏公开可用数据集的问题,我们引入了一个基于Blender的合成数据集,包含2000个视频,展示了多样化角色进行杂技和特技动作。该数据集提供了对外观、运动和环境的全面控制。它填补了一个重要的空白,因为现有基准显著低估了杂技动作,而网络收集的数据集则引发了版权和隐私问题。我们在合成数据集和Motion-X Fitness基准上的实验表明,我们的文本到骨架模型在FID、R-精度和运动多样性方面优于之前的方法。我们的姿势到视频模型在所有比较方法中也在VBench指标上取得了最佳结果,表现出时间一致性、运动平滑性和主体保留。
cs.CV / 201 / 2603.08030

QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

QualiTeacher:基于质量条件的伪标签方法用于真实世界图像恢复
Xiao, Fengyang, Feng, Jingjia, Hu, Peng, Zhang, Dingming, Xu, Lei, Qin, Guanyi, Li, Lu, He, Chunming, Farsiu, Sina
Abstract
Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.
Chinese Translation
真实世界图像恢复(RWIR)是一项极具挑战性的任务,因为缺乏干净的真实图像。许多近期的方法依赖于伪标签(PL)监督,通常是在均值教师(MT)框架内。然而,这些方法面临一个关键的悖论:无条件地信任那些往往不完美且低质量的伪标签,迫使学生模型学习不良伪影,而丢弃这些伪标签又严重限制了数据多样性并损害了模型的泛化能力。本文提出了QualiTeacher,一个新颖的框架,将伪标签的质量从噪声负担转变为条件监督信号。QualiTeacher并不是简单地过滤伪标签,而是明确地将学生模型的学习条件化于伪标签的质量,该质量通过一组互补的非参考图像质量评估(NR-IQA)模型进行估计,这些模型涵盖了低级失真和语义级评估。这一策略教会学生网络学习一个质量分级的恢复流形,使其能够理解不同质量水平的构成。因此,它不仅可以避免模仿低质量标签中的伪影,还能够推断生成比教师模型本身更高质量的结果。为了确保这种基于质量的学习的鲁棒性和准确性,我们进一步通过多重增强方案来丰富伪标签质量谱,采用受直接偏好优化(DPO)启发的基于分数的偏好优化策略来强制实现单调有序的质量分离,并引入裁剪一致性损失以防止图像质量评估模型的对抗性过度优化(奖励黑客)。在标准RWIR基准上的实验表明,QualiTeacher可以作为一种即插即用的策略来提高现有伪标签框架的质量,为从不完美监督中学习建立了新的范式。代码将会发布。
cs.CV / 202 / 2603.08034

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

第十届ABAW表情识别挑战赛的解决方案:一种具有安全交叉注意力和模态丢弃的鲁棒多模态框架
Yu, Jun, Zheng, Naixiang, Wang, Guoyuan, Zhang, Yunxiang, Zhu, Lingsi, Liang, Jiaen, Huang, Wei, Liu, Shengping
Abstract
Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.
Chinese Translation
在现实环境中,情感识别受到部分遮挡、缺失模态和严重类别不平衡的影响。为了解决这些问题,特别是针对野外情感行为分析(ABAW)表情挑战,我们提出了一种动态融合视觉和音频表示的多模态框架。我们的方法采用双分支Transformer架构,具有安全交叉注意力机制和模态丢弃策略。这一设计使得网络在缺少视觉线索时能够依赖音频预测。为了缓解Aff-Wild2数据集的长尾分布,我们应用了聚焦损失优化,并结合滑动窗口软投票策略以捕捉动态情感变化并减少帧级分类抖动。实验表明,我们的框架有效处理缺失模态和复杂时空依赖,在Aff-Wild2验证集上实现了60.79%的准确率和0.5029的F1-score。
cs.CV / 203 / 2603.08055

Speed3R: Sparse Feed-forward 3D Reconstruction Models

Speed3R:稀疏前馈3D重建模型
Ren, Weining, Tan, Xiao, Han, Kai
Abstract
While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $\pi^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.
Chinese Translation
尽管最近的前馈3D重建模型通过在单次推理中共同推断稠密几何和相机姿态来加速3D重建,但它们对稠密注意力的依赖导致了二次复杂度,从而形成了一个严重限制推理速度的计算瓶颈。为了解决这个问题,我们提出了Speed3R,这是一种端到端可训练的模型,灵感来源于运动结构(Structure-from-Motion)的核心原则:稀疏的关键点集足以进行稳健的姿态估计。Speed3R具有双分支注意力机制,其中压缩分支创建粗略的上下文先验,以指导选择分支,该分支仅对最具信息量的图像标记执行细粒度注意力。这一策略模仿了传统关键点匹配的效率,在1000视图序列上实现了显著的12.4倍推理加速,同时在几何精度上引入了最小且可控的权衡。经过在标准基准测试上验证,包括VGGT和$ ext{π}^3$骨干网络,我们的方法以极低的计算成本提供高质量的重建,为高效的大规模场景建模铺平了道路。
cs.CV / 204 / 2603.08059

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

ImageEdit-R1:通过强化学习提升多智能体图像编辑
Zhao, Yiran, Ye, Yaoqi, Liu, Xiang, Shieh, Michael Qizhe, Bui, Trung
Abstract
With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
Chinese Translation
随着商业多模态模型的快速发展,图像编辑因其在日常生活中的广泛应用而受到广泛关注。尽管取得了显著进展,现有的图像编辑系统,特别是闭源或专有模型,往往在处理复杂、间接或多步骤的用户指令时面临困难。这些局限性妨碍了它们执行与人类意图相符的细致、上下文敏感的编辑能力。在本研究中,我们提出了ImageEdit-R1,一个智能图像编辑的多智能体框架,利用强化学习协调一组专门的、预训练的视觉-语言和生成智能体之间的高层决策。每个智能体负责不同的能力——例如理解用户意图、识别兴趣区域、选择适当的编辑操作以及合成视觉内容——而强化学习则管理它们的协作,以确保一致且目标导向的行为。与依赖单一模型或手工制作管道的现有方法不同,我们的方法将图像编辑视为一个序列决策问题,从而实现动态和上下文敏感的编辑策略。实验结果表明,ImageEdit-R1在多个图像编辑数据集上始终优于单个闭源扩散模型和其他多智能体框架基线。
cs.CV / 205 / 2603.08063

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

通过 LVLM 驱动的关系建模增强跨视角无人机地理定位
Liu, Bowen, Jia, Pengyue, Wang, Wanyu, Xu, Derong, Cheng, Jiawei, Dong, Jiancheng, Han, Xiao, Zhao, Zimo, Zhang, Chao, Yu, Bowen, Hong, Fangyu, Zhao, Xiangyu
Abstract
The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.
Chinese Translation
跨视角无人机地理定位的主要目标是通过将无人机捕获的图像与广泛的地理参考卫星数据库对齐,识别其确切的空间坐标。目前的方法通常独立地从每个视角提取特征,并依赖基本的启发式方法来计算相似性,因此未能明确捕捉不同视角之间的基本交互。为了解决这一局限性,我们提出了一种新颖的即插即用排名架构,旨在显式执行联合关系建模,以改善无人机与卫星图像的匹配。通过利用大型视觉-语言模型(Large Vision-Language Model, LVLM)的能力,我们的框架有效地学习了连接无人机和卫星图像的深层视觉-语义关联。此外,我们提出了一种新颖的关系感知损失函数,以优化训练阶段。通过采用软标签,该损失提供了细粒度的监督,避免对近正匹配施加过度惩罚,最终提升了模型的区分能力和训练稳定性。在各种基线架构和标准基准测试中的全面评估表明,所提出的方法显著提高了现有模型的检索准确性,即使在高度苛刻的条件下也表现出优越的性能。
cs.CV / 206 / 2603.08064

Evaluating Generative Models via One-Dimensional Code Distributions

通过一维编码分布评估生成模型
Jia, Zexi, Luo, Pengcheng, Zhong, Yijia, Zhang, Jinchao, Zhou, Jie
Abstract
Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
Chinese Translation
大多数生成模型的评估依赖于特征分布指标,如FID,这些指标基于连续的识别特征,这些特征经过明确训练以对外观变化保持不变,因此丢弃了对感知质量至关重要的线索。我们则在 extit{离散}视觉标记的空间中评估模型,其中现代的一维图像标记器紧凑地编码了语义和感知信息,质量表现为可预测的标记统计。我们引入了 extit{编码簿直方图距离}(Codebook Histogram Distance,CHD),这是一种在标记空间中的无训练分布指标,以及 extit{编码混合模型分数}(Code Mixture Model Score,CMMS),这是一种从标记序列的合成退化中学习的无参考质量指标。为了在广泛的分布变化下对指标进行压力测试,我们进一步提出了 extit{VisForm},这是一个包含210K图像、涵盖62种视觉形式和12种生成模型的基准,并附有专家注释。在AGIQA、HPDv2/3和VisForm中,我们的基于标记的指标与人类判断的相关性达到了最先进水平,我们将发布所有代码和数据集以促进未来的研究。
cs.CV / 207 / 2603.08069

Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

基于多模态大型语言模型的电力线路绝缘子检测合成缺陷图像生成
Wang, Xuesong, Wang, Caisheng
Abstract
Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.
Chinese Translation
公用事业公司越来越依赖无人机图像进行事件后和例行检查,但由于缺陷示例稀缺且检查数据集通常有限或专有,训练准确的缺陷类型分类器仍然困难。我们通过使用现成的多模态大型语言模型(MLLM)作为无训练图像生成器,从视觉参考和文本提示合成缺陷图像,以应对这一数据稀缺的情况。我们的流程通过双重参考条件增加多样性,通过轻量级人工验证和提示优化提高标签的准确性,并使用基于嵌入的选择规则过滤生成的合成图像池,该规则基于从真实训练集计算的类别中心的距离。我们在陶瓷绝缘子缺陷类型分类(壳体与釉面)上进行评估,使用一个具有现实低训练数据条件的公共数据集(104张真实训练图像;152张验证图像;308张测试图像)。将10%的真实训练集与嵌入选择的合成图像结合,测试F1分数(精确率和召回率的调和均值)从0.615提高到0.739(相对提高20%),对应于估计的4-5倍数据效率提升,并且这些提升在更强的主干模型和冻结特征线性探针基线中依然存在。这些结果表明,在收集额外真实缺陷的速度缓慢或不可行时,提供了一条实用的、低门槛的缺陷识别改进路径。
cs.CV / 208 / 2603.08075

TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery

TALON:测试时自适应学习用于即时类别发现
Wu, Yanan, Yan, Yuhan, Chen, Tailai, Chi, Zhixiang, Wu, ZiZhang, Jin, Yi, Wang, Yang, Li, Zhenbo
Abstract
On-the-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion. The code is publicly available at \textcolor{blue}{https://github.com/ynanwu/TALON}.
Chinese Translation
即时类别发现(OCD)旨在在识别已知类别的同时,从未标记的在线流中发现新类别,使用仅在标记数据上训练的模型。现有方法冻结离线训练的特征提取器,并采用基于哈希的框架将特征量化为二进制代码作为类别原型。然而,使用固定知识库发现新类别是违反直觉的,因为完全忽视了新数据的学习潜力。此外,特征量化引入了信息损失,降低了表示的表达能力,并放大了类内方差。这通常导致类别爆炸,即单一类别被分割成多个伪类别。为了克服这些局限性,我们提出了一种测试时自适应框架,能够通过发现进行学习。它结合了两种互补策略:语义感知的原型更新和稳定的测试时编码器更新。前者动态地细化类别原型以增强分类,而后者则将新信息直接整合到参数空间中。这些组件共同使模型能够不断扩展其知识库,以应对新遇到的样本。此外,我们在离线阶段引入了一种边际感知的逻辑校准,以扩大类间边际并改善类内紧凑性,从而为未来的类别发现保留嵌入空间。在标准OCD基准上的实验表明,我们的方法显著优于现有的基于哈希的最先进方法,在新类别准确性方面取得了显著提升,并有效缓解了类别爆炸。代码已公开发布于 https://github.com/ynanwu/TALON。
cs.CV / 209 / 2603.08086

From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation

从反应式人工智能到基于地图的人工智能:调优本地大语言模型用于目标导航中的语义区域推断
Noda, Yudai, Tanaka, Kanji
Abstract
Object-Goal Navigation (ObjectNav) requires an agent to find and navigate to a target object category in unknown environments. While recent Large Language Model (LLM)-based agents exhibit zero-shot reasoning, they often rely on a "reactive" paradigm that lacks explicit spatial memory, leading to redundant exploration and myopic behaviors. To address these limitations, we propose a transition from reactive AI to "Map-Based AI" by integrating LLM-based semantic inference with a hybrid topological-grid mapping system. Our framework employs a fine-tuned Llama-2 model via Low-Rank Adaptation (LoRA) to infer semantic zone categories and target existence probabilities from verbalized object observations. In this study, a "zone" is defined as a functional area described by the set of observed objects, providing crucial semantic co-occurrence cues for finding the target. This semantic information is integrated into a topological graph, enabling the agent to prioritize high-probability areas and perform systematic exploration via Traveling Salesman Problem (TSP) optimization. Evaluations in the AI2-THOR simulator demonstrate that our approach significantly outperforms traditional frontier exploration and reactive LLM baselines, achieving a superior Success Rate (SR) and Success weighted by Path Length (SPL).
Chinese Translation
目标导航(ObjectNav)要求智能体在未知环境中找到并导航到目标物体类别。尽管最近基于大语言模型(LLM)的智能体展现了零样本推理能力,但它们往往依赖于缺乏明确空间记忆的“反应式”范式,导致冗余探索和短视行为。为了解决这些局限性,我们提出从反应式人工智能过渡到“基于地图的人工智能”,通过将基于LLM的语义推断与混合拓扑-网格映射系统相结合。我们的框架通过低秩适应(Low-Rank Adaptation, LoRA)对Llama-2模型进行微调,以从口头描述的物体观察中推断语义区域类别和目标存在概率。在本研究中,“区域”被定义为由观察到的物体集合描述的功能区域,为寻找目标提供了重要的语义共现线索。这些语义信息被整合到拓扑图中,使智能体能够优先考虑高概率区域,并通过旅行商问题(Traveling Salesman Problem, TSP)优化进行系统探索。在AI2-THOR模拟器中的评估表明,我们的方法显著优于传统的前沿探索和反应式LLM基线,取得了更高的成功率(Success Rate, SR)和路径长度加权成功率(Success weighted by Path Length, SPL)。
cs.CV / 210 / 2603.08090

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

DSH-Bench:一种具有层次主题分类法的难度和场景感知基准,用于主题驱动的文本到图像生成
Hu, Zhenyu, Wang, Qing, Cao, Te, Liao, Luo, Lu, Longfei, Liu, Liqun, Li, Shuang, Chen, Hang, Xue, Mengge, Chen, Yuan, Deng, Chao, Shu, Peng, Yu, Huan, Jiang, Jie
Abstract
Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.
Chinese Translation
在主题驱动的文本到图像(T2I)生成领域,已取得显著进展,该领域旨在根据用户指令合成描绘目标主题的新图像。然而,评估这些模型仍然是一个重大挑战。现有基准存在关键局限性:1)主题图像的多样性和全面性不足,2)在不同主题难度级别和提示场景中评估模型性能的粒度不足,以及3)缺乏可操作的见解和后续模型改进的诊断指导。为了解决这些局限性,我们提出了DSH-Bench,这是一种综合基准,能够通过四项主要创新实现对主题驱动的T2I模型的系统多角度分析:1)层次分类采样机制,确保在58个细粒度类别中全面代表主题,2)创新的分类方案,将主题难度级别和提示场景进行分类,以便进行细致的能力评估,3)一种新的主题身份一致性评分(Subject Identity Consistency Score,SICS)指标,在量化主题保留方面与人类评估的相关性比现有指标高出9.4\%,以及4)从基准中得出的全面诊断见解,为优化未来模型训练范式和数据构建策略提供重要指导。通过对19个领先模型的广泛实证评估,DSH-Bench揭示了当前方法中以前被掩盖的局限性,为未来的研究和开发确立了具体方向。
cs.CV / 211 / 2603.08096

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

TrianguLang:一种几何感知的语义共识方法用于无姿态的三维定位
Grant, Bryce, Rothenberg, Aryeh, Banerjee, Atri, Wang, Peng
Abstract
Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.
Chinese Translation
从自然语言中在三维空间中定位物体和部件对于机器人技术、增强现实(AR)和具身人工智能至关重要,但现有方法在每个场景优化的准确性与几何一致性和前馈推理的效率之间存在权衡。我们提出了TrianguLang,这是一种无须在推理时进行相机校准的前馈三维定位框架。与以往将视图独立处理的方法不同,我们引入了几何感知语义注意力(Geometry-Aware Semantic Attention, GASA),该方法利用预测的几何信息来控制跨视图特征对应,抑制语义上合理但几何上不一致的匹配,而无需真实姿态。经过在包括ScanNet++和uCO3D在内的五个基准测试上的验证,TrianguLang实现了最先进的前馈文本引导分割和定位,将用户的操作从$O(N)$次点击减少到单个文本查询。该模型以1008x1008的分辨率处理每帧,耗时约57毫秒(约18帧每秒),无需优化,能够在交互式机器人和AR应用中实现实际部署。代码和检查点可在https://cwru-aism.github.io/triangulang/获取。
cs.CV / 212 / 2603.08100

Adaptive MLP Pruning for Large Vision Transformers

大型视觉变换器的自适应多层感知器剪枝
Shen, Chengchao
Abstract
Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully model the predictions of the original model for more accurate importance evaluation. Second, we rank the hidden neurons of MLP by the above importance scores and apply binary search algorithm to adaptively prune the ranked neurons according to the redundancy of different MLP modules, thereby avoiding the predefined compression ratio. Experimental results on several state-of-the-art large vision transformers, including CLIP and DINOv2, demonstrate that our method achieves roughly 40\% parameter and FLOPs reduction in a near lossless manner. Moreover, when the models are not finetuned after pruning, our method outperforms other pruning methods by significantly large margin. The source code and trained weights are available at https://github.com/visresearch/AMP.
Chinese Translation
大型视觉变换器展现了令人印象深刻的可扩展性,因为其性能可以通过增加模型容量得到显著提升。然而,它们庞大的参数量导致了高昂的计算和内存需求。通过分析常见的变换器结构,我们发现多层感知器(MLP)模块占据了模型参数的最大份额。本文提出了一种自适应多层感知器剪枝(AMP)方法,以在不明显降低性能的情况下显著减少大型视觉变换器的参数。首先,我们采用基于泰勒的方法来评估MLP中神经元的重要性。然而,使用独热交叉熵损失进行的重要性计算忽略了对其他类别的潜在预测,从而降低了评估重要性分数的质量。为了解决这一问题,我们引入无标签信息熵标准,以充分建模原始模型的预测,从而实现更准确的重要性评估。其次,我们根据上述重要性分数对MLP的隐藏神经元进行排序,并应用二分搜索算法根据不同MLP模块的冗余性自适应地剪枝排序后的神经元,从而避免预定义的压缩比。在多个最先进的大型视觉变换器(包括CLIP和DINOv2)上的实验结果表明,我们的方法在近乎无损的情况下实现了约40%的参数和FLOPs减少。此外,当模型在剪枝后未进行微调时,我们的方法显著优于其他剪枝方法。源代码和训练权重可在https://github.com/visresearch/AMP获取。
cs.CV / 213 / 2603.08113

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

SAMoE-VLA:一种用于自动驾驶的场景自适应混合专家视觉-语言-行动模型
You, Zihan, Liu, Hongwei, Dang, Chenxu, Wang, Zhe, Ang, Sining, Wang, Aoqi, Wang, Yan
Abstract
Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.
Chinese Translation
最近在视觉-语言-行动(VLA)模型方面的进展显示出在自动驾驶中利用大型语言模型(LLMs)的理解和推理能力的良好前景。然而,我们的实证分析表明,直接将现有的基于token的混合专家(MoE)机制——这些机制源自LLM架构——应用于VLA模型会导致性能不稳定和安全性下降,这突显了基于token的专家专业化与场景级决策之间的不匹配。为了解决这个问题,我们提出了SAMoE-VLA,一种场景自适应的视觉-语言-行动框架,该框架基于结构化场景表示而非token嵌入来决定专家选择。我们的关键思想是从鸟瞰图(BEV)特征中推导MoE路由信号,这些特征封装了交通场景上下文,使得专家加权和合并能够根据不同的驾驶条件进行场景依赖性调整。此外,为了支持跨世界知识、感知、语言和行动的时间一致推理,我们引入了一种条件交叉模态因果注意机制,将世界状态、语言意图和行动历史整合到一个统一的因果推理过程中。在nuScenes开放循环规划数据集和LangAuto闭环基准上的广泛实验表明,SAMoE-VLA实现了最先进的性能,超越了之前基于VLA和世界模型的方法,同时参数更少。我们的代码将很快发布。
cs.CV / 214 / 2603.08126

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Foley-Flow:基于掩蔽音视频对齐和动态条件流的协调视频到音频生成
Mo, Shentong, Song, Yibing
Abstract
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
Chinese Translation
基于视频输入的协调音频生成通常需要严格的音视频(AV)对齐,其中生成音频片段的语义和节奏应与视频帧中的内容相对应。以往的研究利用了两阶段设计,首先通过对比学习对AV编码器进行对齐,然后编码的视频表示指导音频生成过程。我们观察到,对比学习和全局视频指导在对齐整体AV语义方面是有效的,但在时间节奏同步方面存在限制。在本研究中,我们提出了FoleyFlow,首先通过掩蔽建模训练对单模态AV编码器进行对齐,其中掩蔽的音频片段在相应视频片段的指导下被恢复。训练后,分别仅使用单模态数据预训练的AV编码器在语义和节奏一致性上得到了对齐。接着,我们开发了一种动态条件流用于最终的音频生成。基于高效的速度流生成框架,我们的动态条件流利用时间变化的视频特征作为动态条件,以指导相应音频片段的生成。为此,我们在掩蔽AV对齐过程中提取一致的语义和节奏表示,并利用这些视频片段的表示在时间上指导音频生成。我们的音频结果在标准基准上进行了评估,并在多个指标下大幅超越了现有结果。卓越的性能表明,FoleyFlow在生成与各种视频序列在语义和节奏上都一致的协调音频方面是有效的。
cs.CV / 215 / 2603.08133

Fast Low-light Enhancement and Deblurring for 3D Dark Scenes

快速低光增强与去模糊处理的三维暗场景
Zhang, Feng, Wang, Jinglong, Li, Ze, Zhou, Yanghong, Chen, Yang, Chen, Lei, Zhu, Xiatian
Abstract
Novel view synthesis from low-light, noisy, and motion-blurred imagery remains a valuable and challenging task. Current volumetric rendering methods struggle with compound degradation, and sequential 2D preprocessing introduces artifacts due to interdependencies. In this work, we introduce FLED-GS, a fast low-light enhancement and deblurring framework that reformulates 3D scene restoration as an alternating cycle of enhancement and reconstruction. Specifically, FLED-GS inserts several intermediate brightness anchors to enable progressive recovery, preventing noise blow-up from harming deblurring or geometry. Each iteration sharpens inputs with an off-the-shelf 2D deblurrer and then performs noise-aware 3DGS reconstruction that estimates and suppresses noise while producing clean priors for the next level. Experiments show FLED-GS outperforms state-of-the-art LuSh-NeRF, achieving 21$\times$ faster training and 11$\times$ faster rendering.
Chinese Translation
从低光、噪声和运动模糊图像中合成新视角仍然是一项有价值且具有挑战性的任务。目前的体积渲染方法在复合退化方面表现不佳,而顺序的二维预处理由于相互依赖性引入了伪影。在本研究中,我们提出了FLED-GS,一个快速的低光增强与去模糊框架,将三维场景恢复重新表述为增强与重建的交替循环。具体而言,FLED-GS插入了多个中间亮度锚点,以实现渐进式恢复,防止噪声膨胀对去模糊或几何形状造成损害。每次迭代使用现成的二维去模糊器对输入进行锐化,然后执行噪声感知的3DGS重建,估计并抑制噪声,同时为下一个级别生成干净的先验。实验表明,FLED-GS的性能超过了最先进的LuSh-NeRF,训练速度提高了21倍,渲染速度提高了11倍。
cs.CV / 216 / 2603.08135

VesselFusion: Diffusion Models for Vessel Centerline Extraction from 3D CT Images

VesselFusion:用于从3D CT图像中提取血管中心线的扩散模型
Mita, Soichi, Takezaki, Shumpei, Bise, Ryoma
Abstract
Vessel centerline extraction from 3D CT images is an important task because it reduces annotation effort to build a model that estimates a vessel structure. It is challenging to estimate natural vessel structures since conventional approaches are deterministic models, which cannot capture a complex human structure. In this study, we propose VesselFusion, which is a diffusion model to extract the vessel centerline from 3D CT image. The proposed method uses a coarse-to-fine representation of the centerline and a voting-based aggregation for a natural and stable extraction. VesselFusion was evaluated on a publicly available CT image dataset and achieved higher extraction accuracy and a more natural result than conventional approaches.
Chinese Translation
从3D CT图像中提取血管中心线是一项重要任务,因为它减少了构建估计血管结构模型的标注工作量。由于传统方法是确定性模型,无法捕捉复杂的人体结构,因此估计自然血管结构具有挑战性。在本研究中,我们提出了VesselFusion,这是一种用于从3D CT图像中提取血管中心线的扩散模型。所提出的方法采用了中心线的粗到细表示和基于投票的聚合,以实现自然且稳定的提取。VesselFusion在一个公开可用的CT图像数据集上进行了评估,取得了比传统方法更高的提取精度和更自然的结果。
cs.CV / 217 / 2603.08147

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

MV-Fashion:基于多视角配对数据实现虚拟试穿和尺码估计
Laczkó, Hunor, Jia, Libang, Truong, Loc-Phat, Hernández, Diego, Escalera, Sergio, Gonzalez, Jordi, Madadi, Meysam
Abstract
Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at https://hunorlaczko.github.io/MV-Fashion .
Chinese Translation
现有的4D人类数据集在时尚特定研究中存在不足,缺乏真实的服装动态或任务特定的注释。合成数据集存在现实性差距,而现实世界的捕捉缺乏虚拟试穿(VTON)和尺码估计任务所需的详细注释和配对数据。为了解决这一问题,我们提出了MV-Fashion,这是一个为领域特定时尚分析而设计的大规模多视角视频数据集。MV-Fashion包含来自80个不同受试者的3,273个序列(7250万帧),每个受试者穿着3到10套服装。该数据集旨在捕捉复杂的现实世界服装动态,包括多层次和多样化的造型(例如,卷袖、扎衬衫)。其核心贡献是丰富的数据表示,包括像素级语义注释、真实材料属性(如弹性)和3D点云。对于VTON应用而言,MV-Fashion提供了配对数据:穿着服装的多视角同步捕捉以及对应的平面目录图像。我们利用该数据集为时尚相关任务建立基线,包括虚拟试穿、服装尺码估计和新视图合成。该数据集可在 https://hunorlaczko.github.io/MV-Fashion 获取。
cs.CV / 218 / 2603.08150

Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors

边缘感知的USLAM:基于事件的边缘感知同时定位与地图构建与学习深度先验
Sarıözkan, Şebnem, Şahin, Hürkan, Álvarez-Tuñón, Olaya, Kayacan, Erdal
Abstract
Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visual-inertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like point-line event-based visual-inertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.
Chinese Translation
传统的视觉同时定位与地图构建(SLAM)算法在快速运动、低光照或突变光照条件下常常失效,原因在于运动模糊和有限的动态范围。事件相机通过提供高时间分辨率和高动态范围(HDR)来缓解这些问题,但其稀疏和异步的输出使得特征提取和与其他传感器(例如惯性测量单元(IMU)和标准相机)的集成变得复杂。我们提出了边缘感知的USLAM(Edged USLAM),这是一种混合视觉-惯性系统,扩展了终极SLAM(Ultimate SLAM,USLAM),并配备了边缘感知前端和轻量级深度模块。前端增强了事件帧,以实现稳健的特征跟踪和非线性运动补偿,而深度模块则提供粗略的基于感兴趣区域(ROI)的场景深度,以改善运动补偿和尺度一致性。在公共基准测试和真实无人机(UAV)飞行中的评估表明,性能因场景而异。例如,像基于点线的事件视觉-惯性里程计(PL-EVIO)或基于学习的管道(如深度基于事件的视觉里程计(DEVO))等仅基于事件的方法在高度激进或极端HDR条件下表现优异。相比之下,边缘感知的USLAM在缓慢或结构化轨迹中提供了更好的稳定性和最小的漂移,确保在具有挑战性的光照条件下的真实飞行中实现一致的准确定位。这些发现突显了仅基于事件的方法、基于学习的方法和混合方法的互补优势,同时将边缘感知的USLAM定位为多样化空中导航任务的稳健解决方案。
cs.CV / 219 / 2603.08174

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

MERLIN:构建低信噪比鲁棒的多模态大语言模型用于电磁信号
Shen, Junyu, She, Zhendong, Zhang, Chenghanyu, Sun, Yuchuang, Luo, Luqing, Tan, Dingwei, Guo, Zonghao, Guo, Bo, Han, Zehua, Xie, Wupeng, Mu, Yaxin, Zhang, Peng, Li, Peipei, Wang, Fengxiang, Sun, Yangang, Sun, Maosong
Abstract
The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.
Chinese Translation
多模态大语言模型(MLLMs)的范式为推进电磁(EM)领域提供了一个有前景的蓝图。然而,现有的方法往往偏离了原生的MLLM范式,而是采用特定任务或管道架构,导致模型性能和泛化能力的根本限制。要充分实现EM领域中MLLM的潜力,需要克服三个主要挑战:(1)数据。用于MLLM预训练的配对电磁信号和描述性文本注释的高质量数据集稀缺;(2)基准。缺乏全面的基准来系统地评估和比较模型在电磁信号到文本任务上的表现;(3)模型。在低信噪比(SNR)环境中存在的关键脆弱性,重要信号特征可能被遮蔽,导致显著的性能下降。为了解决这些挑战,我们提出了三方面的贡献,以建立EM领域中MLLM的基础。首先,为了克服数据稀缺,我们构建并发布了EM-100k,这是一个包含超过100,000个电磁信号-文本对的大规模数据集。其次,为了实现严格和标准化的评估,我们提出了EM-Bench,这是最全面的基准,涵盖了从感知到推理的多样化下游任务。最后,为了应对核心建模挑战,我们提出了MERLIN,这是一种新颖的训练框架,旨在不仅将低级信号表示与高级语义文本对齐,还明确增强模型在挑战性低SNR环境中的鲁棒性和性能。全面的实验验证了我们的方法,表明MERLIN在EM-Bench中处于最先进水平,并在低SNR环境中表现出显著的鲁棒性。
cs.CV / 220 / 2603.08180

ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection

ALOOD:利用语言表示进行基于LiDAR的分布外物体检测
Kösel, Michael, Schreiber, Marcel, Ulrich, Michael, Gläser, Claudius, Dietmayer, Klaus
Abstract
LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm-mrm/mmood3d.
Chinese Translation
基于LiDAR的3D物体检测在可靠和安全的自动驾驶系统中扮演着至关重要的角色。然而,现有的检测器往往对不属于已知类别的物体产生过于自信的预测,这带来了显著的安全风险。这是由于所谓的分布外(OOD)物体造成的,这些物体并未包含在训练数据中,导致了错误的预测。为了解决这一挑战,我们提出了ALOOD(Aligned LiDAR representations for Out-Of-Distribution Detection),一种新颖的方法,结合了来自视觉-语言模型(VLM)的语言表示。通过将物体检测器的特征与VLM的特征空间对齐,我们可以将OOD物体的检测视为一个零样本分类任务。我们在nuScenes OOD基准测试中展示了具有竞争力的性能,确立了一种利用语言表示进行LiDAR中OOD物体检测的新方法。源代码可在 https://github.com/uulm-mrm/mmood3d 获取。
cs.CV / 221 / 2603.08199

Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking

Fusion-Poly:基于时空融合的多面体框架用于3D多目标跟踪
Wu, Xian, Wu, Yitao, Li, Xiaoyu, Li, Zijia, Zhao, Lijun, Sun, Lining
Abstract
LiDAR-camera 3D multi-object tracking (MOT) combines rich visual semantics with accurate depth cues to improve trajectory consistency and tracking reliability. In practice, however, LiDAR and cameras operate at different sampling rates. To maintain temporal alignment, existing data pipelines usually synchronize heterogeneous sensor streams and annotate them at a reduced shared frequency, forcing most prior methods to perform spatial fusion only at synchronized timestamps through projection-based or learnable cross-sensor association. As a result, abundant asynchronous observations remain underexploited, despite their potential to support more frequent association and more robust trajectory estimation over short temporal intervals. To address this limitation, we propose Fusion-Poly, a spatial-temporal fusion framework for 3D MOT that integrates asynchronous LiDAR and camera data. Fusion-Poly associates trajectories with multi-modal observations at synchronized timestamps and with single-modal observations at asynchronous timestamps, enabling higher-frequency updates of motion and existence states. The framework contains three key components: a frequency-aware cascade matching module that adapts to synchronized and asynchronous frames according to available detection modalities; a frequency-aware trajectory estimation module that maintains trajectories through high-frequency motion prediction, differential updates, and confidence-calibrated lifecycle management; and a full-state observation alignment module that improves cross-modal consistency at synchronized timestamps by optimizing image-projection errors. On the nuScenes test set, Fusion-Poly achieves 76.5% AMOTA, establishing a new state of the art among tracking-by-detection 3D MOT methods. Extensive ablation studies further validate the effectiveness of each component. Code will be released.
Chinese Translation
LiDAR-相机3D多目标跟踪(MOT)结合了丰富的视觉语义和准确的深度线索,以提高轨迹一致性和跟踪可靠性。然而,在实际应用中,LiDAR和相机的采样率不同。为了保持时间对齐,现有的数据处理流程通常同步异构传感器流,并以降低的共享频率进行标注,迫使大多数先前的方法仅在同步时间戳下通过基于投影或可学习的跨传感器关联进行空间融合。因此,尽管丰富的异步观测具有支持更频繁关联和在短时间间隔内更稳健轨迹估计的潜力,但仍未得到充分利用。为了解决这一限制,我们提出了Fusion-Poly,一个用于3D MOT的时空融合框架,集成了异步的LiDAR和相机数据。Fusion-Poly在同步时间戳下将轨迹与多模态观测关联,并在异步时间戳下与单模态观测关联,从而实现运动和存在状态的高频更新。该框架包含三个关键组件:一个频率感知的级联匹配模块,根据可用的检测模态适应同步和异步帧;一个频率感知的轨迹估计模块,通过高频运动预测、差分更新和置信度校准的生命周期管理来维护轨迹;以及一个全状态观测对齐模块,通过优化图像投影误差来提高同步时间戳下的跨模态一致性。在nuScenes测试集上,Fusion-Poly实现了76.5%的AMOTA,确立了跟踪-检测3D MOT方法中的新状态。大量消融研究进一步验证了每个组件的有效性。代码将会发布。
cs.CV / 222 / 2603.08202

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

MM-TS:用于长尾数据对比学习的多模态温度和边际调度
Sheludzko, Siarhei, Duka, Dhimitrios, Schiele, Bernt, Kuehne, Hilde, Kukleva, Anna
Abstract
Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
Chinese Translation
对比学习已成为单模态和多模态框架中的一种基本方法。这种学习范式将正样本对拉近,同时将负样本推远。在单模态设置中(例如,基于图像的学习),先前的研究表明,这些力量的强度可以通过温度参数进行控制。在本研究中,我们提出了多模态温度和边际调度(MM-TS),将单模态温度调度的概念扩展到多模态对比学习。我们的方法在训练过程中动态调整对比损失中的温度,调节多模态设置中的吸引和排斥力量。此外,考虑到标准多模态数据集通常遵循不平衡的长尾分布,我们根据每个训练样本的局部分布调整温度。具体而言,来自密集簇的样本被分配更高的温度,以更好地保留其语义结构。此外,我们证明了温度调度可以有效地融入最大边际框架,从而统一多模态对比学习中的两种主要方法:InfoNCE损失和最大边际目标。我们在四个广泛使用的图像和视频语言数据集上评估了我们的方法,包括Flickr30K、MSCOCO、EPIC-KITCHENS-100和YouCook2,并显示我们的动态温度和边际调度提高了性能,并在该领域达到了新的最先进结果。
cs.CV / 223 / 2603.08208

Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

考虑对齐和可靠性门控的多模态融合用于异构热视觉传感器下无人机检测
Jahan, Ishrat, Majid, Molla E, Murugappan, M, Chowdhury, Muhammad E. H., Prakash, N. B., Kashem, Saad Bin Abul, Balusamy, Balamurugan, Khandakar, Amith
Abstract
Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.
Chinese Translation
可靠的无人机(UAV)检测对于自主空域监测至关重要,但在整合分辨率、视角和视野差异显著的传感器流时仍然面临挑战。传统的融合方法,如小波、拉普拉斯和决策级方法,往往无法保持模态间的空间对应关系,并且存在标注不一致的问题,这限制了它们在实际环境中的鲁棒性。本研究提出了两种融合策略:考虑配准的引导图像融合(Registration-aware Guided Image Fusion, RGIF)和可靠性门控模态注意力融合(Reliability-Gated Modality-Attention Fusion, RGMAF),旨在克服这些局限性。RGIF采用基于增强相关系数(Enhanced Correlation Coefficient, ECC)的仿射配准,结合引导滤波,以保持热显著性并增强结构细节。RGMAF则将仿射和光流配准与可靠性加权注意力机制相结合,自适应地平衡热对比度和视觉清晰度。实验在包含147,417帧注释的空对空图像的多传感器和多视角固定翼(Multi-Sensor and Multi-View Fixed-Wing, MMFW)-UAV数据集上进行。单模态检测器中,YOLOv10x表现出最稳定的跨域性能,并被选为评估融合图像的检测骨干。RGIF将视觉基线提高了2.13%的mAP@50(达到了97.65%),而RGMAF则达到了最高的召回率98.64%。这些发现表明,考虑对齐和可靠性自适应的融合为整合异构模态提供了一个稳健的框架,显著提升了多模态环境下的无人机检测性能。
cs.CV / 224 / 2603.08210

Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA

Video2LoRA:通过每个参考视频的 LoRA 实现统一的语义控制视频生成
Wu, Zexi, Wang, Qinghe, Dai, Jing, Li, Baolu, Zhang, Yiming, Ma, Yue, Jia, Xu, Xu, Hongming
Abstract
Achieving semantic alignment across diverse video generation conditions remains a significant challenge. Methods that rely on explicit structural guidance often enforce rigid spatial constraints that limit semantic flexibility, whereas models tailored for individual control types lack interoperability and adaptability. These design bottlenecks hinder progress toward flexible and efficient semantic video generation. To address this, we propose Video2LoRA, a scalable and generalizable framework for semantic-controlled video generation that conditions on a reference video. Video2LoRA employs a lightweight hypernetwork to predict personalized LoRA weights for each semantic input, which are combined with auxiliary matrices to form adaptive LoRA modules integrated into a frozen diffusion backbone. This design enables the model to generate videos consistent with the reference semantics while preserving key style and content variations, eliminating the need for any per-condition training. Notably, the final model weights less than 150MB, making it highly efficient for storage and deployment. Video2LoRA achieves coherent, semantically aligned generation across diverse conditions and exhibits strong zero-shot generalization to unseen semantics.
Chinese Translation
在多样化的视频生成条件下实现语义对齐仍然是一个重大挑战。依赖于显式结构指导的方法往往施加严格的空间约束,从而限制了语义的灵活性,而针对个别控制类型定制的模型则缺乏互操作性和适应性。这些设计瓶颈阻碍了灵活高效的语义视频生成的进展。为了解决这个问题,我们提出了 Video2LoRA,这是一个可扩展且具有普适性的语义控制视频生成框架,基于参考视频进行条件生成。Video2LoRA 采用轻量级超网络为每个语义输入预测个性化的 LoRA 权重,并将其与辅助矩阵结合,形成集成到冻结扩散主干中的自适应 LoRA 模块。该设计使模型能够生成与参考语义一致的视频,同时保留关键的风格和内容变化,消除了对任何条件训练的需求。值得注意的是,最终模型的大小不到 150MB,使其在存储和部署方面具有高度的效率。Video2LoRA 在多样化条件下实现了连贯的、语义对齐的生成,并展现出对未见语义的强大零样本泛化能力。
cs.CV / 225 / 2603.08224

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

SAVE:面向语音的视频表征学习用于视频-文本检索
Zhao, Ruixiang, Xu, Zhihao, Lan, Bangxiang, Xin, Zijie, Liu, Jingyu, Li, Xirong
Abstract
For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.
Chinese Translation
在视频-文本检索中,CLIP 已成为事实上的选择。由于 CLIP 仅提供图像和文本编码器,这一共识导致了一种偏向的范式,完全忽视了视频的音轨。虽然已经有几个尝试重新引入音频——通常通过结合音频编码器并将其输出与视觉特征融合——但这些方法面临两个挑战:语音内容的表示效果不佳以及视听融合的次优表现。为了解决这些问题,我们提出了 SAVE,一种面向语音的视频表征学习方法。SAVE 在 SOTA(最先进技术)视听方法 AVIGATE 的基础上进行了改进,增加了专门的语音分支以实现更有效的语音嵌入。此外,我们引入了 soft-ALBEF 以实现早期的视听对齐,从而促进融合。在五个基准测试上的广泛实验表明,SAVE 在与 SOTA 的比较中表现良好,在 MSRVTT-9k 上超越 AVIGATE 4.1%,在 MSRVTT-7k 上超越 1.9%,在 VATEX 上超越 2.5%,在 Charades 上超越 9.8%,在 LSMDC 上超越 2.1%,这一切都基于 SumR 指标。
cs.CV / 226 / 2603.08227

SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation

SRNeRV:一种用于神经视频表示的尺度递归框架
Wang, Jia, Zhu, Jun, Zhang, Xinfeng
Abstract
Implicit Neural Representations (INRs) have emerged as a promising paradigm for video representation and compression. However, existing multi-scale INR generators often suffer from significant parameter redundancy by stacking independent processing blocks for each scale. Inspired by the principle of scale self-similarity in the generation process, we propose SRNeRV, a novel scale-wise recursive framework that replaces this stacked design with a parameter-efficient shared architecture. The core of our approach is a hybrid sharing scheme derived from decoupling the processing block into a scale-specific spatial mixing module and a scale-invariant channel mixing module. We recursively apply the same shared channel mixing module, which contains the majority of the parameters, across all scales, significantly reducing the model size while preserving the crucial capacity to learn scale-specific spatial patterns. Extensive experiments demonstrate that SRNeRV achieves a significant rate-distortion performance boost, especially in INR-friendly scenarios, validating that our sharing scheme successfully amplifies the core strengths of the INR paradigm.
Chinese Translation
隐式神经表示(INRs)已成为视频表示和压缩的一个有前景的范式。然而,现有的多尺度INR生成器通常通过为每个尺度堆叠独立的处理模块而遭受显著的参数冗余。受到生成过程中的尺度自相似性原则的启发,我们提出了SRNeRV,一种新颖的尺度递归框架,采用参数高效的共享架构替代这种堆叠设计。我们方法的核心是一个混合共享方案,该方案源于将处理模块解耦为尺度特定的空间混合模块和尺度不变的通道混合模块。我们在所有尺度上递归应用相同的共享通道混合模块,该模块包含大部分参数,显著减少了模型的大小,同时保留了学习尺度特定空间模式的关键能力。大量实验表明,SRNeRV在率失真性能上实现了显著提升,尤其是在适合INR的场景中,验证了我们的共享方案成功放大了INR范式的核心优势。
cs.CV / 227 / 2603.08228

GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model

GarmentPainter:基于角色引导扩散模型的高效3D服装纹理合成
Wu, Jinbo, Gao, Xiaobo, Liu, Xing, Zhao, Chen, Liu, Jialun
Abstract
Generating high-fidelity, 3D-consistent garment textures remains a challenging problem due to the inherent complexities of garment structures and the stringent requirement for detailed, globally consistent texture synthesis. Existing approaches either rely on 2D-based diffusion models, which inherently struggle with 3D consistency, require expensive multi-step optimization or depend on strict spatial alignment between 2D reference images and 3D meshes, which limits their flexibility and scalability. In this work, we introduce GarmentPainter, a simple yet efficient framework for synthesizing high-quality, 3D-aware garment textures in UV space. Our method leverages a UV position map as the 3D structural guidance, ensuring texture consistency across the garment surface during texture generation. To enhance control and adaptability, we introduce a type selection module, enabling fine-grained texture generation for specific garment components based on a character reference image, without requiring alignment between the reference image and the 3D mesh. GarmentPainter efficiently integrates all guidance signals into the input of a diffusion model in a spatially aligned manner, without modifying the underlying UNet architecture. Extensive experiments demonstrate that GarmentPainter achieves state-of-the-art performance in terms of visual fidelity, 3D consistency, and computational efficiency, outperforming existing methods in both qualitative and quantitative evaluations.
Chinese Translation
生成高保真、3D一致的服装纹理仍然是一个具有挑战性的问题,这主要由于服装结构的固有复杂性以及对详细、全局一致的纹理合成的严格要求。现有方法要么依赖于基于2D的扩散模型,这在3D一致性方面固有困难,要么需要昂贵的多步优化,或者依赖于2D参考图像与3D网格之间的严格空间对齐,这限制了它们的灵活性和可扩展性。在本研究中,我们提出了GarmentPainter,一个简单而高效的框架,用于在UV空间合成高质量、具有3D感知的服装纹理。我们的方法利用UV位置图作为3D结构指导,确保在纹理生成过程中服装表面的纹理一致性。为了增强控制和适应性,我们引入了一种类型选择模块,使得基于角色参考图像能够对特定服装组件进行细粒度的纹理生成,而无需参考图像与3D网格之间的对齐。GarmentPainter有效地将所有指导信号以空间对齐的方式整合到扩散模型的输入中,而无需修改底层的UNet架构。大量实验表明,GarmentPainter在视觉保真度、3D一致性和计算效率方面达到了最先进的性能,在定性和定量评估中均优于现有方法。
cs.CV / 228 / 2603.08235

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超广角成像在糖尿病视网膜病和黄斑水肿中的应用
Jimenez-Lizcano, Pablo, Romero-Tapiador, Sergio, Tolosana, Ruben, Morales, Aythami, de Rivera, Guillermo González, Vera-Rodriguez, Ruben, Fierrez, Julian
Abstract
Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.
Chinese Translation
糖尿病视网膜病(DR)和糖尿病黄斑水肿(DME)是导致工作年龄成年人可预防失明的主要原因。文献中的传统方法主要集中在标准彩色眼底摄影(CFP)用于检测这些疾病。然而,最近的超广角成像(UWF)相比于CFP提供了显著更广的视野。基于此,本研究探索了最先进的深度学习(DL)方法和UWF成像在三个临床相关任务中的应用:i)UWF的图像质量评估,ii)可参考的糖尿病视网膜病(RDR)的识别,以及iii)DME的识别。我们使用公开可用的UWF4DR挑战数据集,该数据集作为MICCAI 2024会议的一部分发布,对空间(RGB)和频率域中的DL模型进行了基准测试,包括流行的卷积神经网络(CNNs)以及最近的视觉变换器(ViTs)和基础模型。此外,我们探索了最终的特征级融合以提高鲁棒性。最后,我们还使用Grad-CAM分析DL模型的决策,提高可解释性。我们的提案在所有架构中均表现出持续强劲的性能,突显了新兴ViTs和基础模型的竞争力,以及特征级融合和频率域表示在UWF分析中的潜力。
cs.CV / 229 / 2603.08240

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

SiMO:单模态可操作的多模态协同感知
Wen, Jiageng, Zhao, Shengjie, Li, Bing, Huang, Jiafeng, Ye, Kenan, Deng, Hao
Abstract
Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.
Chinese Translation
协同感知整合了多智能体的视角,以增强感知范围并克服遮挡问题。虽然现有的多模态方法利用互补传感器来提高性能,但它们在关键传感器(如激光雷达)不可用时极易发生故障。根本原因在于特征融合导致单模态特征与下游模块之间的语义不匹配。本文首次在协同感知领域解决了这一挑战,提出了单模态可操作的多模态协同感知(SiMO)。通过采用所提出的长度自适应多模态融合(Length-Adaptive Multi-Modal Fusion, LAMMA),SiMO能够在模态失效期间自适应处理剩余的模态特征,同时保持语义空间的一致性。此外,利用创新的“预训练-对齐-融合-重分布”(Pretrain-Align-Fuse-RD)训练策略,SiMO解决了模态竞争的问题——这一问题通常被现有方法忽视——确保每个单独模态分支的独立性。实验表明,SiMO有效地对齐了多模态特征,同时保留了模态特定特征,使其在所有单独模态中保持最佳性能。实现细节可以在 https://github.com/dempsey-wen/SiMO 中找到。
cs.CV / 230 / 2603.08254

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

DynamicVGGT:用于自主驾驶中4D场景重建的动态点图学习
He, Zhuolin, Li, Jing, Li, Guanghao, Chen, Xiaolei, Tang, Jiacheng, Zhang, Siyang, Jin, Zhounan, Cai, Feipeng, Li, Bin, Pu, Jian, Cai, Jia, Xue, Xiangyang
Abstract
Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.
Chinese Translation
自主驾驶中的动态场景重建仍然是一个基本挑战,主要由于显著的时间变化、移动物体和复杂的场景动态。现有的前馈3D模型在静态重建方面表现出色,但在捕捉动态运动方面仍然存在困难。为了解决这些局限性,我们提出了DynamicVGGT,一个统一的前馈框架,将VGGT从静态3D感知扩展到动态4D重建。我们的目标是在动态且时间一致的方式中建模前馈3D模型中的点运动。为此,我们在共享的参考坐标系统中共同预测当前和未来的点图,使模型能够通过时间对应隐式学习动态点表示。为了有效捕捉时间依赖性,我们引入了一个运动感知时间注意力(Motion-aware Temporal Attention, MTA)模块,以学习运动连续性。此外,我们设计了一个动态3D高斯溅射头,通过在场景流监督下使用可学习的运动标记预测高斯速度,明确建模点运动。它通过连续的3D高斯优化来细化动态几何。对自主驾驶数据集的广泛实验表明,DynamicVGGT在重建精度上显著优于现有方法,在复杂驾驶场景下实现了稳健的前馈4D动态场景重建。
cs.CV / 231 / 2603.08258

WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

WaDi:面向权重方向的单步图像合成蒸馏
Wang, Lei, Cheng, Yang, Li, Senmao, Wu, Ge, Wang, Yaxing, Yang, Jian
Abstract
Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)-a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.
Chinese Translation
尽管扩散模型如稳定扩散(Stable Diffusion, SD)在图像生成方面表现出色,但其缓慢的推理速度限制了实际应用。近期的研究通过将多步扩散蒸馏为单步生成器来加速推理。为了更好地理解蒸馏机制,我们分析了单步学生模型与其多步教师模型之间的U-Net/DiT权重变化。我们的分析揭示,权重方向的变化显著超过权重范数的变化,强调了其在蒸馏过程中的关键作用。基于这一见解,我们提出了权重方向的低秩旋转(Low-rank Rotation of weight Direction, LoRaD),这是一种针对单步扩散蒸馏量身定制的参数高效适配器。LoRaD旨在使用可学习的低秩旋转矩阵来建模这些结构化的方向变化。我们进一步将LoRaD集成到变分评分蒸馏(Variational Score Distillation, VSD)中,形成了面向权重方向的蒸馏(Weight Direction-aware Distillation, WaDi)——一种新颖的单步蒸馏框架。WaDi在COCO 2014和COCO 2017上实现了最先进的FID分数,同时仅使用了U-Net/DiT约10%的可训练参数。此外,蒸馏后的单步模型展示了强大的通用性和可扩展性,能够很好地推广到可控生成、关系反转和高分辨率合成等各种下游任务。
cs.CV / 232 / 2603.08264

Event-based Motion & Appearance Fusion for 6D Object Pose Tracking

基于事件的运动与外观融合用于6D物体姿态跟踪
Li, Zhichao, Bartolozzi, Chiara, Natale, Lorenzo, Glover, Arren
Abstract
Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.
Chinese Translation
物体姿态跟踪是机器人在家庭和工业环境中执行任务的基本且重要的任务。最常用的传感器是RGB-D相机,但在高度动态的环境中,由于运动模糊和帧率限制,它们可能会遇到局限性。事件相机具有高时间分辨率和低延迟等显著特性,使其成为高速度物体姿态跟踪的理想视觉传感器。然而,目前仅有少数研究关注使用事件相机进行6D姿态跟踪。在本研究中,我们利用高时间分辨率,提出了一种结合传播步骤和姿态校正策略的方法。具体而言,我们使用基于事件的光流获得的6D物体速度进行姿态传播,随后利用基于模板的局部姿态校正模块进行姿态校正。我们的方法无需学习,性能与最先进的算法相当,在某些情况下,对于快速移动的物体表现更优。结果表明,在深度网络方法因低更新率而受限的高度动态场景中,使用事件相机具有潜在的应用前景。
cs.CV / 233 / 2603.08271

Prototype-Guided Concept Erasure in Diffusion Models

基于原型的扩散模型概念消除
Cai, Yuze, Lu, Jiahao, Shi, Hongxiang, Zhou, Yichao, Lu, Hong
Abstract
Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as ``sexual'' or ``violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably. To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.
Chinese Translation
概念消除在图像生成中被广泛应用,以防止文本到图像模型生成不希望的内容。现有方法能够有效地消除特定且具体的狭窄概念,例如独特的知识产权(如皮卡丘)或可识别的角色(如埃隆·马斯克)。然而,对于广泛概念如“性别”或“暴力”,由于其广泛的范围和多面性,现有方法的性能下降,难以可靠地消除。为克服这一限制,我们利用模型内在的嵌入几何结构,识别编码特定概念的潜在嵌入。通过对这些嵌入进行聚类,我们得出一组概念原型,总结模型对该概念的内部表征,并在推理过程中将其作为负条件信号,以实现精确和可靠的消除。在多个基准测试中的大量实验表明,我们的方法在保持整体图像质量的同时,显著提高了广泛概念的可靠去除,标志着朝着更安全和可控的图像生成迈出了重要一步。
cs.CV / 234 / 2603.08279

OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations

OSCAR:基于占用的形状补全通过声学神经隐式表示
Wysocki, Magdalena, Buldu, Kadir Burak, Gafencu, Miruna-Alexandra, Azampour, Mohammad Farid, Navab, Nassir
Abstract
Accurate 3D reconstruction of vertebral anatomy from ultrasound is important for guiding minimally invasive spine interventions, but it remains challenging due to acoustic shadowing and view-dependent signal variations. We propose an occupancy-based shape completion method that reconstructs complete 3D anatomical geometry from partial ultrasound observations. Crucially for intra-operative applications, our approach extracts the anatomical surface directly from the image, avoiding the need for anatomical labels during inference. This label-free completion relies on a coupled latent space representing both the image appearance and the underlying anatomical shape. By leveraging a Neural Implicit Representation (NIR) that jointly models both spatial occupancy and acoustic interactions, the method uses acoustic parameters to become implicitly aware of the unseen regions without explicit shadowing labels through tracking acoustic signal transmission. We show that this method outperforms state-of-the-art shape completion for B-mode ultrasound by 80% in HD95 score. We validate our approach both in-silico and on phantom US images with registered mesh models from CT labels, demonstrating accurate reconstruction of occluded anatomy and robust generalization across diverse imaging conditions. Code and data will be released on publication.
Chinese Translation
从超声波中准确重建脊椎解剖结构对于指导微创脊柱干预至关重要,但由于声学阴影和视角依赖的信号变化,这一任务仍然具有挑战性。我们提出了一种基于占用的形状补全方法,该方法能够从部分超声观察中重建完整的三维解剖几何形状。对于手术中的应用至关重要,我们的方法直接从图像中提取解剖表面,避免了推断过程中对解剖标签的需求。这种无标签的补全依赖于一个耦合的潜在空间,该空间同时表示图像外观和基础解剖形状。通过利用一种神经隐式表示(Neural Implicit Representation, NIR),该方法联合建模空间占用和声学交互,使用声学参数隐式感知未见区域,而无需通过跟踪声学信号传输来显式标注阴影。我们展示了该方法在HD95评分上比现有的B模式超声形状补全技术提高了80%。我们在计算机仿真和使用来自CT标签的注册网格模型的虚拟超声图像上验证了我们的方法,展示了对被遮挡解剖结构的准确重建以及在多样成像条件下的稳健泛化。代码和数据将在发表时发布。
cs.CV / 235 / 2603.08289

Novel Semantic Prompting for Zero-Shot Action Recognition

用于零样本动作识别的新语义提示技术
Iqbal, Salman, Rehman, Waheed
Abstract
Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.
Chinese Translation
零样本动作识别依赖于通过语义描述将视觉-语言模型的知识转移到未见过的动作上。尽管近期的方法集中于时间建模或架构调整以处理视频数据,但我们认为单独的语义提示提供了一个强大且尚未充分探索的信号,用于零样本动作理解。我们提出了SP-CLIP,这是一种轻量级框架,通过结构化的语义提示增强冻结的视觉-语言模型,这些提示在多个抽象层次上描述动作,例如意图、运动和物体交互。在不修改视觉编码器或学习额外参数的情况下,SP-CLIP通过提示聚合和一致性评分将视频表示与丰富的文本语义对齐。在标准基准测试中的实验表明,语义提示显著提高了零样本动作识别的性能,特别是在细粒度和组合动作方面,同时保持了预训练模型的效率和泛化能力。
cs.CV / 236 / 2603.08305

Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

检索增强的解剖指导用于文本到CT生成
Molino, Daniele, Caruso, Camillo Maria, Soda, Paolo, Guarrasi, Valerio
Abstract
Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.
Chinese Translation
基于文本条件的体积医学成像生成模型提供了语义控制,但缺乏明确的解剖指导,常导致输出在空间上模糊或在解剖上不一致。相比之下,结构驱动的方法确保了强大的解剖一致性,但通常假设可以访问真实标注,而在目标图像需要合成时,这些标注是不可用的。我们提出了一种检索增强的文本到CT生成方法,在现实推理环境下整合了语义和解剖信息。给定一份放射学报告,我们的方法使用3D视觉-语言编码器检索一个语义相关的临床案例,并利用其相关的解剖标注作为结构代理。这个代理通过ControlNet分支注入到基于文本的潜在扩散模型中,提供粗略的解剖指导,同时保持语义灵活性。在CT-RATE数据集上的实验表明,与仅基于文本的基线相比,检索增强的生成提高了图像的保真度和临床一致性,同时还实现了明确的空间可控性,这是此类方法固有缺失的能力。进一步的分析强调了检索质量的重要性,语义对齐的代理在所有评估维度上都带来了持续的提升。这项工作引入了一种原则性和可扩展的机制,以桥接体积医学图像合成中的语义条件和解剖合理性。代码将会发布。
cs.CV / 237 / 2603.08309

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

概念引导的微调:引导视觉变换器远离虚假相关性以提高鲁棒性
Elisha, Yehonatan, Barkan, Oren, Koenigstein, Noam
Abstract
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.
Chinese Translation
视觉变换器(ViTs)在分布变化下往往表现不佳,因为它们依赖于虚假相关性,如背景线索,而非语义上有意义的特征。现有的正则化方法通常依赖于简单的前景-背景掩码,未能捕捉定义对象的细粒度语义概念(例如,鸟的“长喙”和“翅膀”)。因此,这些方法对分布变化的鲁棒性有限。为了解决这一局限性,我们提出了一种新颖的微调框架,旨在将模型推理引导至概念级语义。我们的方法优化模型的内部相关性图,以与空间上定位的概念掩码对齐。这些掩码是自动生成的,无需手动标注:首先使用基于大型语言模型(LLM)的无标签方法提出与类别相关的概念,然后使用视觉语言模型(VLM)进行分割。微调目标使相关性与这些概念区域对齐,同时抑制对虚假背景区域的关注。值得注意的是,这一过程仅需少量图像,并使用数据集中一半的类别。在五个分布外基准上的广泛实验表明,我们的方法在多个基于ViT的模型中提高了鲁棒性。此外,我们还展示了生成的相关性图与语义对象部分的对齐更强,提供了一条可扩展的路径,朝着更鲁棒和可解释的视觉模型迈进。最后,我们确认概念引导的掩码为模型鲁棒性提供了比传统分割图更有效的监督,支持了我们的核心假设。
cs.CV / 238 / 2603.08313

HDR-NSFF: High Dynamic Range Neural Scene Flow Fields

HDR-NSFF:高动态范围神经场景流场
Dong-Yeon, Shin, Jun-Seong, Kim, Byung-Ki, Kwon, Oh, Tae-Hyun
Abstract
Radiance of real-world scenes typically spans a much wider dynamic range than what standard cameras can capture. While conventional HDR methods merge alternating-exposure frames, these approaches are inherently constrained to 2D pixel-level alignment, often leading to ghosting artifacts and temporal inconsistency in dynamic scenes. To address these limitations, we present HDR-NSFF, a paradigm shift from 2D-based merging to 4D spatio-temporal modeling. Our framework reconstructs dynamic HDR radiance fields from alternating-exposure monocular videos by representing the scene as a continuous function of space and time, and is compatible with both neural radiance field and 4D Gaussian Splatting (4DGS) based dynamic representations. This unified end-to-end pipeline explicitly models HDR radiance, 3D scene flow, geometry, and tone-mapping, ensuring physical plausibility and global coherence. We further enhance robustness by (i) extending semantic-based optical flow with DINO features to achieve exposure-invariant motion estimation, and (ii) incorporating a generative prior as a regularizer to compensate for limited observation in monocular captures and saturation-induced information loss. To evaluate HDR space-time view synthesis, we present the first real-world HDR-GoPro dataset specifically designed for dynamic HDR scenes. Experiments demonstrate that HDR-NSFF recovers fine radiance details and coherent dynamics even under challenging exposure variations, thereby achieving state-of-the-art performance in novel space-time view synthesis. Project page: https://shin-dong-yeon.github.io/HDR-NSFF/
Chinese Translation
现实世界场景的辐射通常具有比标准相机能够捕捉的更宽广的动态范围。虽然传统的高动态范围(HDR)方法通过合并交替曝光的帧来实现HDR效果,但这些方法本质上受限于二维像素级对齐,常常导致动态场景中的鬼影伪影和时间不一致性。为了解决这些限制,我们提出了HDR-NSFF,这是一种从基于二维的合并转向四维时空建模的范式转变。我们的框架通过将场景表示为空间和时间的连续函数,从交替曝光的单目视频中重建动态HDR辐射场,并且兼容神经辐射场和基于四维高斯喷溅(4D Gaussian Splatting, 4DGS)的动态表示。这一统一的端到端管道明确建模HDR辐射、三维场景流、几何形状和色调映射,确保物理上的合理性和全局一致性。我们进一步通过(i) 扩展基于语义的光流,结合DINO特征以实现曝光不变的运动估计,以及(ii) 引入生成先验作为正则化器,以补偿单目捕捉中的有限观察和饱和引起的信息损失,从而增强鲁棒性。为了评估HDR时空视图合成,我们提出了第一个专门为动态HDR场景设计的真实世界HDR-GoPro数据集。实验表明,HDR-NSFF能够在具有挑战性的曝光变化下恢复细致的辐射细节和一致的动态,从而在新颖的时空视图合成中实现了最先进的性能。项目页面:https://shin-dong-yeon.github.io/HDR-NSFF/
cs.CV / 239 / 2603.08317

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

在空间和时空操控下的人类与人工智能在自我中心动作识别中的差异
Rahmaniboldaji, Sadegh, Rybansky, Filip, Vuong, Quoc C., Hurlbert, Anya C., Guerin, Frank, Gilbert, Andrew
Abstract
Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
Chinese Translation
人类在动作识别方面始终优于最先进的人工智能模型,尤其是在涉及低分辨率、遮挡和视觉杂乱等具有挑战性的现实条件下。理解这种性能差距的来源对于开发更稳健且与人类对齐的模型至关重要。本文呈现了一项大规模的人类与人工智能在自我中心动作识别方面的比较研究,使用最小可识别识别区域(Minimal Identifiable Recognition Crops, MIRCs),定义为足以进行可靠人类识别的最小空间或时空区域。我们使用了之前提出的Epic ReduAct数据集,该数据集是从36个EPIC KITCHENS视频中系统性地空间缩减和时间打乱而得,涵盖了多个空间缩减水平和时间条件。识别性能通过3000多名参与者和Side4Video模型进行评估。我们的分析结合了定量指标——平均减少率(Average Reduction Rate)和识别差距(Recognition Gap),以及对空间(高、中、低层次视觉特征)和时空因素的定性分析,包括将动作分类为低时间动作(Low Temporal Actions, LTA)和高时间动作(High Temporal Actions, HTA)。结果显示,当从MIRCs过渡到subMIRCs时,人类的表现显著下降,反映出对稀疏的、语义关键的线索(如手-物体交互)的强烈依赖。相比之下,模型的性能下降更为渐进,通常依赖于上下文和中低层次特征,有时在空间缩减下甚至表现出更高的信心。在时间上,当关键空间线索得以保留时,人类对打乱保持稳健,而模型则往往对时间干扰表现出不敏感,揭示了类别依赖的时间敏感性。
cs.CV / 240 / 2603.08328

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

超越注意力热图:如何为组织病理学中的多实例学习模型获得更好的解释
Idaji, Mina Jamshidi, Hense, Julius, Neuhäuser, Tom, Krause, Augustin, Luo, Yanqing, Eberle, Oliver, Schnake, Thomas, Ciernik, Laure, Jafari, Farnoush Rezaei, Vahidimajd, Reza, Dippel, Jonas, Walz, Christoph, Klauschen, Frederick, Mock, Andreas, Müller, Klaus-Robert
Abstract
Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal
Chinese Translation
多实例学习(MIL)在计算组织病理学中取得了显著进展,其中来自千兆像素全切片图像的大量图块被聚合为切片级预测。热图被广泛用于验证MIL模型和发现组织生物标志物。然而,这些热图的有效性几乎没有受到研究。在本研究中,我们提出了一个通用框架,用于评估MIL热图的质量,而无需额外的标签。我们进行了一项大规模基准实验,以评估六种解释方法在组织病理学任务类型(分类、回归、生存)、MIL模型架构(基于注意力、变换器、Mamba)和图块编码器骨干(UNI2、Virchow2)中的表现。我们的结果表明,解释质量主要依赖于MIL模型架构和任务类型,其中扰动(“单一”)、层级相关传播(LRP)和集成梯度(IG)始终优于基于注意力和基于梯度的显著性热图,这些热图往往无法反映模型决策机制。我们进一步展示了表现最佳的解释方法的先进能力:(i)我们提供了一个概念验证,表明一个大规模基因表达预测模型的MIL热图可以与空间转录组学相关联,以进行生物学验证;(ii)展示了从头颈癌切片中预测人乳头瘤病毒(HPV)感染的不同模型策略的发现。我们的工作强调了验证MIL热图的重要性,并确立了改进的可解释性可以实现更可靠的模型验证并提供生物学见解,从而为在数字病理学中更广泛采用可解释人工智能提供了依据。我们的代码已在公共GitHub仓库中提供:https://github.com/bifold-pathomics/xMIL/tree/xmil-journal
cs.CV / 241 / 2603.08347

Local-Global Prompt Learning via Sparse Optimal Transport

通过稀疏最优传输实现局部-全局提示学习
Kizaroğlu, Deniz, Küçüktas, Ülku Tuncer, Çakmakyurdu, Emre, Temizel, Alptekin
Abstract
Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP
Chinese Translation
视觉-语言模型(VLMs)如 CLIP 的少样本适应通常依赖于学习与全局图像嵌入匹配的文本提示。近期的研究通过结合局部图像-文本对齐来捕捉细粒度的视觉线索,从而扩展了这一范式,但这些方法往往独立选择每个提示的局部区域,导致冗余的局部特征使用和提示重叠。我们提出了 SOT-GLP,它引入了共享的稀疏补丁支持和均衡的最优传输分配,以明确划分类特定的局部提示之间的显著视觉区域,同时保持全局对齐。我们的方法学习共享的全局提示和类特定的局部提示。全局分支维持标准的图像-文本匹配,以实现稳健的类别级对齐。局部分支使用 V-V 注意力构建类条件的稀疏补丁集,并通过均衡的熵最优传输将其与多个类特定的提示对齐,从而实现补丁的软分区,防止提示重叠和崩溃。我们在两个互补目标上评估我们的方法:(i)在 11 个标准基准上的少样本分类准确率和(ii)分布外(OOD)检测。在使用 16-shot ViT-B/16 的标准 11 数据集基准上,SOT-GLP 达到了 85.1% 的平均准确率,超越了之前的提示学习方法。我们识别出提示学习中的一个明显的准确性-鲁棒性权衡:虽然可学习的投影优化了分布内的拟合,但它们改变了基础特征空间。我们证明,无投影的局部对齐保留了 CLIP 流形的原生几何结构,获得了最先进的 OOD 检测性能(94.2% AUC),超过了完全适应的模型。实现代码可在以下链接获取:https://github.com/Deniz2304988/SOT-GLP
cs.CV / 242 / 2603.08361

$\Delta$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

$ ext{ΔVLA}$:通过世界知识变异的先验引导视觉-语言-动作模型
Zhu, Yijie, He, Jie, Shao, Rui, Yuan, Kaishen, Tan, Tao, Yuan, Xiaochen, Yu, Zitong
Abstract
Recent vision-language-action (VLA) models have significantly advanced robotic manipulation by unifying perception, reasoning, and control. To achieve such integration, recent studies adopt a predictive paradigm that models future visual states or world knowledge to guide action generation. However, these models emphasize forecasting outcomes rather than reasoning about the underlying process of change, which is essential for determining how to act. To address this, we propose $\Delta$VLA, a prior-guided framework that models world-knowledge variations relative to an explicit current-world knowledge prior for action generation, rather than regressing absolute future world states. Specifically, 1) to construct the current world knowledge prior, we propose the Prior-Guided WorldKnowledge Extractor (PWKE). It extracts manipulable regions, spatial relations, and semantic cues from the visual input, guided by auxiliary heads and prior pseudo labels, thus reducing redundancy. 2) Building upon this, to represent how world knowledge evolves under actions, we introduce the Latent World Variation Quantization (LWVQ). It learns a discrete latent space via a VQ-VAE objective to encode world knowledge variations, shifting prediction from full modalities to compact latent. 3)Moreover, to mitigate interference during variation modeling, we design the Conditional Variation Attention (CV-Atten), whichpromotes disentangled learning and preserves the independence of knowledge representations. Extensive experiments on both simulated benchmarks and real-world robotic tasks demonstrate $\Delta$VLA achieves state-of-the-art performance while improving efficiency. Code and real-world execution videos are available at https://github.com/JiuTian-VL/DeltaVLA.
Chinese Translation
近年来,视觉-语言-动作(VLA)模型通过统一感知、推理和控制,显著推动了机器人操作的发展。为了实现这种整合,近期研究采用了一种预测范式,该范式建模未来的视觉状态或世界知识以指导动作生成。然而,这些模型强调预测结果,而非推理变化的基本过程,这对于确定如何行动至关重要。为了解决这一问题,我们提出了$ ext{ΔVLA}$,一种先验引导的框架,它相对于明确的当前世界知识先验建模世界知识的变异,以生成动作,而不是回归绝对的未来世界状态。具体而言,1)为了构建当前世界知识先验,我们提出了先验引导的世界知识提取器(PWKE)。它从视觉输入中提取可操作区域、空间关系和语义线索,受辅助头和先验伪标签的引导,从而减少冗余。2)在此基础上,为了表示世界知识在动作下的演变,我们引入了潜在世界变异量化(LWVQ)。它通过VQ-VAE目标学习一个离散潜在空间,以编码世界知识的变异,将预测从完整模态转移到紧凑的潜在表示。3)此外,为了减轻变异建模过程中的干扰,我们设计了条件变异注意力(CV-Atten),促进解耦学习并保持知识表示的独立性。在模拟基准和真实世界机器人任务上的大量实验表明,$ ext{ΔVLA}$实现了最先进的性能,同时提高了效率。代码和真实世界执行视频可在https://github.com/JiuTian-VL/DeltaVLA获取。
cs.CV / 243 / 2603.08364

Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

基于扩散的数据增强在图像识别中的应用:系统分析与评估
Li, Zekun, Shi, Yinghuan, Gao, Yang, Xu, Dong
Abstract
Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.
Chinese Translation
基于扩散的数据增强(DiffDA)作为一种有前景的方法,已被提出用于在数据稀缺的情况下提高分类性能。然而,现有研究在任务配置、模型选择和实验流程上存在显著差异,这使得在不同场景下公平比较方法或评估其有效性变得困难。此外,对完整的DiffDA工作流程仍缺乏系统性的理解。在本研究中,我们引入了UniDiffDA,一个统一的分析框架,将DiffDA方法分解为三个核心组成部分:模型微调、样本生成和样本利用。这一视角使我们能够识别现有方法之间的关键差异,并明确整体设计空间。在此框架基础上,我们开发了一个全面且公平的评估协议,对代表性的DiffDA方法在多种低数据分类任务中进行基准测试。大量实验揭示了不同DiffDA策略的相对优势和局限性,并为方法设计和部署提供了实用的见解。所有方法均在统一的代码库中重新实现,并完整发布代码和配置,以确保可重复性并促进未来研究。
cs.CV / 244 / 2603.08374

This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse

这看起来与那个截然不同:基于斯蒂费尔几何的可解释识别与神经崩溃的对抗
Jia, Junhao, Wang, Jiaqi, Liu, Yunyou, Jing, Haodong, Wu, Yueyi, Wu, Xian, Zheng, Yefeng
Abstract
Prototype networks provide an intrinsic case based explanation mechanism, but their interpretability is often undermined by prototype collapse, where multiple prototypes degenerate to highly redundant evidence. We attribute this failure mode to the terminal dynamics of Neural Collapse, where cross entropy optimization suppresses intra class variance and drives class conditional features toward a low dimensional limit. To mitigate this, we propose Adaptive Manifold Prototypes (AMP), a framework that leverages Riemannian optimization on the Stiefel manifold to represent class prototypes as orthonormal bases and make rank one prototype collapse infeasible by construction. AMP further learns class specific effective rank via a proximal gradient update on a nonnegative capacity vector, and introduces spatial regularizers that reduce rotational ambiguity and encourage localized, non overlapping part evidence. Extensive experiments on fine-grained benchmarks demonstrate that AMP achieves state-of-the-art classification accuracy while significantly improving causal faithfulness over prior interpretable models.
Chinese Translation
原型网络提供了一种内在的基于案例的解释机制,但其可解释性常常受到原型崩溃的影响,在这种情况下,多个原型退化为高度冗余的证据。我们将这种失效模式归因于神经崩溃的终端动态,其中交叉熵优化抑制了类内方差,并将类条件特征驱动向低维限制。为了解决这个问题,我们提出了自适应流形原型(Adaptive Manifold Prototypes, AMP),这是一个利用斯蒂费尔流形上的黎曼优化来表示类原型为正交基的框架,并通过构造使得秩为一的原型崩溃不可行。AMP进一步通过对非负容量向量的近端梯度更新来学习类特定的有效秩,并引入空间正则化器以减少旋转模糊性并鼓励局部、非重叠的部分证据。在细粒度基准测试上的广泛实验表明,AMP在分类准确性上达到了最先进的水平,同时显著提高了相较于之前可解释模型的因果可信度。
cs.CV / 245 / 2603.08386

Real-Time Drone Detection in Event Cameras via Per-Pixel Frequency Analysis

基于每像素频率分析的事件相机实时无人机检测
Bezick, Michael, Sahin, Majid
Abstract
Detecting fast-moving objects, such as unmanned aerial vehicle (UAV), from event camera data is challenging due to the sparse, asynchronous nature of the input. Traditional Discrete Fourier Transforms (DFT) are effective at identifying periodic signals, such as spinning rotors, but they assume uniformly sampled data, which event cameras do not provide. We propose a novel per-pixel temporal analysis framework using the Non-uniform Discrete Fourier Transform (NDFT), which we call Drone Detection via Harmonic Fingerprinting (DDHF). Our method uses purely analytical techniques that identify the frequency signature of drone rotors, as characterized by frequency combs in their power spectra, enabling a tunable and generalizable algorithm that achieves accurate real-time localization of UAV. We compare against a YOLO detector under equivalent conditions, demonstrating improvement in accuracy and latency across a difficult array of drone speeds, distances, and scenarios. DDHF achieves an average localization F1 score of 90.89% and average latency of 2.39ms per frame, while YOLO achieves an F1 score of 66.74% and requires 12.40ms per frame. Through utilization of purely analytic techniques, DDHF is quickly tuned on small data, easily interpretable, and achieves competitive accuracies and latencies to deep learning alternatives.
Chinese Translation
从事件相机数据中检测快速移动物体,如无人驾驶飞行器(UAV),由于输入的稀疏和异步特性而具有挑战性。传统的离散傅里叶变换(DFT)在识别周期性信号(如旋转桨叶)方面有效,但它们假设数据是均匀采样的,而事件相机并不提供这种数据。我们提出了一种新颖的每像素时间分析框架,使用非均匀离散傅里叶变换(NDFT),称为通过谐波指纹识别无人机(DDHF)。我们的方法采用纯分析技术,识别无人机旋翼的频率特征,通过其功率谱中的频率梳进行表征,从而实现了一种可调节且可推广的算法,能够准确地实时定位无人机。在相同条件下与YOLO检测器进行比较,展示了在不同无人机速度、距离和场景下的准确性和延迟的改善。DDHF的平均定位F1分数为90.89%,每帧平均延迟为2.39毫秒,而YOLO的F1分数为66.74%,每帧需要12.40毫秒。通过利用纯分析技术,DDHF在小数据集上快速调优,易于解释,并在准确性和延迟上与深度学习替代方案具有竞争力。
cs.CV / 246 / 2603.08387

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

AULLM++:基于大型语言模型的微表情识别结构推理
Liu, Zhishu, Yuan, Kaishen, Zhao, Bo, Ma, Hui, Yu, Zitong
Abstract
Micro-expression Action Unit (AU) detection identifies localized AUs from subtle facial muscle activations, providing a foundation for decoding affective cues. Previous methods face three key limitations: (1) heavy reliance on low-density visual information, rendering discriminative evidence vulnerable to background noise; (2) coarse-grained feature processing that misaligns with the demand for fine-grained representations; and (3) neglect of inter-AU correlations, restricting the parsing of complex expression patterns. We propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into three stages: evidence construction, structure modeling, and deduction-based prediction. Specifically, a Multi-Granularity Evidence-Enhanced Fusion Projector (MGE-EFP) fuses mid-level texture cues with high-level semantics, distilling them into a compact Content Token (CT). Furthermore, inspired by micro- and macro-expression AU correspondence, we encode AU relationships as a sparse structural prior and learn interaction strengths via a Relation-Aware AU Graph Neural Network (R-AUGNN), producing an Instruction Token (IT). We then fuse CT and IT into a structured textual prompt and introduce Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization. Extensive experiments demonstrate AULLM++ achieves state-of-the-art performance on standard benchmarks and exhibits superior cross-domain generalization.
Chinese Translation
微表情动作单元(AU)检测通过识别细微的面部肌肉激活来识别局部AU,为解码情感线索提供基础。以往的方法面临三个主要限制:(1)过度依赖低密度视觉信息,使得判别证据易受背景噪声影响;(2)粗粒度特征处理与对细粒度表示的需求不匹配;(3)忽视AU之间的相关性,限制了复杂表情模式的解析。我们提出了AULLM++,一个利用大型语言模型(LLMs)的推理导向框架,将视觉特征注入文本提示中,作为可操作的语义前提以指导推理。该框架将AU预测分为三个阶段:证据构建、结构建模和基于推理的预测。具体而言,一个多粒度证据增强融合投影器(MGE-EFP)将中层纹理线索与高层语义融合,提炼成一个紧凑的内容标记(CT)。此外,受到微表情与宏表情AU对应关系的启发,我们将AU关系编码为稀疏结构先验,并通过关系感知AU图神经网络(R-AUGNN)学习交互强度,生成指令标记(IT)。然后,我们将CT和IT融合为结构化文本提示,并引入反事实一致性正则化(CCR)以构建反事实样本,增强模型的泛化能力。大量实验表明,AULLM++在标准基准上实现了最先进的性能,并展现出优越的跨领域泛化能力。
cs.CV / 247 / 2603.08403

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

SPIRAL:一个通过反思规划代理自我改进的闭环框架,用于动作世界模型
Yang, Yu, Liao, Yue, Mei, Jianbiao, Wang, Baisen, Yang, Xuemeng, Wen, Licheng, Zhang, Jiangning, Li, Xiangtai, Chen, Hanlin, Shi, Botian, Liu, Yong, Yan, Shuicheng, Lee, Gim Hee
Abstract
We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.
Chinese Translation
我们介绍了SPIRAL,这是一个自我改进的规划和迭代反思动作世界建模的闭环框架,能够基于高层语义动作生成可控的长时间视频。现有的一次性视频生成模型在开放循环中运行,常常导致动作执行不完整、语义基础薄弱和时间漂移。SPIRAL将ActWM公式化为一个闭环的思考-行动-反思过程,其中生成在明确的规划和反馈下逐步进行。一个PlanAgent将抽象动作分解为以对象为中心的子动作,而CriticAgent则评估中间结果并通过长时间记忆指导迭代优化。这个闭环设计自然支持强化学习(RL)演变优化,在扩展的时间范围内改善语义对齐和时间一致性。我们进一步引入了ActWM-Dataset和ActWM-Bench用于训练和评估。在多个TI2V骨干网络上的实验表明,在ActWM-Bench和主流视频生成基准上均取得了一致的提升,验证了SPIRAL的有效性。
cs.CV / 248 / 2603.08434

Information Maximization for Long-Tailed Semi-Supervised Domain Generalization

长尾半监督领域泛化的信息最大化
Fillioux, Leo, Chakraborty, Omprakash, Gopée, Quentin, Marza, Pierre, Cournède, Paul-Henry, Christodoulidis, Stergios, Vakalopoulou, Maria, Ayed, Ismail Ben, Dolz, Jose
Abstract
Semi-supervised domain generalization (SSDG) has recently emerged as an appealing alternative to tackle domain generalization when labeled data is scarce but unlabeled samples across domains are abundant. In this work, we identify an important limitation that hampers the deployment of state-of-the-art methods on more challenging but practical scenarios. In particular, state-of-the-art SSDG severely suffers in the presence of long-tailed class distributions, an arguably common situation in real-world settings. To alleviate this limitation, we propose IMaX, a simple yet effective objective based on the well-known InfoMax principle adapted to the SSDG scenario, where the Mutual Information (MI) between the learned features and latent labels is maximized, constrained by the supervision from the labeled samples. Our formulation integrates an {\alpha}-entropic objective, which mitigates the class-balance bias encoded in the standard marginal entropy term of the MI, thereby better handling arbitrary class distributions. IMaX can be seamlessly plugged into recent state-of-the-art SSDG, consistently enhancing their performance, as demonstrated empirically across two different image modalities.
Chinese Translation
半监督领域泛化(SSDG)最近作为一种有吸引力的替代方案出现,以应对在标记数据稀缺但跨领域未标记样本丰富的情况下的领域泛化问题。在本研究中,我们识别出一个重要的限制,这一限制妨碍了最先进方法在更具挑战性但实际的场景中的应用。特别是,最先进的SSDG在存在长尾类别分布时严重受挫,这在现实世界中是一个可以说是常见的情况。为了缓解这一限制,我们提出了IMaX,这是一种简单而有效的目标,基于著名的信息最大化(InfoMax)原则,适应于SSDG场景,其中学习到的特征与潜在标签之间的互信息(Mutual Information, MI)被最大化,同时受到标记样本的监督约束。我们的公式整合了一个{}-熵目标,这减轻了标准边际熵项中编码的类别平衡偏差,从而更好地处理任意类别分布。IMaX可以无缝地嵌入到最近的最先进SSDG中,持续提升其性能,正如在两种不同图像模态中所实证证明的那样。
cs.CV / 249 / 2603.08436

Can Vision-Language Models Solve the Shell Game?

视觉-语言模型能否解决壳游戏?
Liu, Tiedong, Lee, Wee Sun
Abstract
Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .
Chinese Translation
视觉实体跟踪是人类的一种先天认知能力,但对于视觉-语言模型(VLMs)来说,这仍然是一个关键瓶颈。这一缺陷在现有的视频基准测试中常常被视觉捷径所掩盖。我们引入了VET-Bench,这是一个合成的诊断测试平台,特点是视觉上相同的对象,必须仅通过时空连续性进行跟踪。我们的实验表明,当前最先进的VLM在VET-Bench上的表现接近随机水平,暴露出一个根本性的局限性:过度依赖静态帧级特征,未能在时间上保持实体表示。我们提供了理论分析,联系到状态跟踪问题,证明固定深度的基于变换器的VLM在没有中间监督的情况下,因表达能力的限制而在跟踪不可区分的对象时存在根本性局限。为了解决这一问题,我们提出了时空基础的思维链(Spatiotemporal Grounded Chain-of-Thought,SGCoT):将对象轨迹生成作为显式的中间状态。利用Molmo2的对象跟踪能力,我们通过在合成的仅文本数据上进行微调来引导SGCoT推理。我们的方法在VET-Bench上实现了超过90%的最先进准确率,证明VLM能够在没有外部工具的情况下,可靠地端到端解决视频壳游戏任务。我们的代码和数据可在https://vetbench.github.io获取。
cs.CV / 250 / 2603.08445

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

Alfa:用于结构感知跨域个性化注视估计的注意力低秩滤波器适应
Hsieh, He-Yen, Ting, Wei-Te Mark, Kung, H. T.
Abstract
Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.
Chinese Translation
预训练的注视模型学习识别用户之间普遍存在的有用模式,但细微的用户特定变化(例如,眼睑形状或面部结构)可能会降低模型性能。测试时个性化(TTP)利用少量未标记样本将预训练模型适应于这些用户特定的领域转变。高效的微调在执行这种领域适应时至关重要:数据和计算资源可能有限,尤其是在设备上的定制。虽然流行的参数高效微调(PEFT)方法通过仅更新少量权重来解决适应成本,但它们可能未能充分利用预训练滤波器中编码的结构。为了更有效地利用预训练期间学习到的现有结构,我们将个性化重新构建为重新加权现有特征的过程,而不是学习全新的特征。我们提出了注意力低秩滤波器适应(Alfa),通过重新加权预训练滤波器中的语义模式来适应注视模型。通过奇异值分解(SVD),Alfa 提取捕捉用户之间眼睛和面部特征的主导空间分量。通过注意力机制,我们只需少量未标记样本即可调整和重新加权预训练结构,选择性地放大与目标用户相关的特征。Alfa 在四个跨数据集注视基准测试中实现了最低的平均注视误差,超越了现有的 TTP 方法和基于低秩适应(LoRA)的变体。我们还展示了 Alfa 的注意力低秩方法可以应用于超越视觉的应用,如基于扩散的语言模型。
cs.CV / 251 / 2603.08483

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

X-AVDT:用于鲁棒深度伪造检测的音视频交叉注意力
Kim, Youngseo, Yun, Kwan, Hong, Seokhyeon, Cha, Sihun, Koo, Colette Suhjung, Noh, Junyong
Abstract
The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.
Chinese Translation
当代生成系统产生的高度逼真的合成视频激增,显著增加了恶意使用的风险,给人类和现有检测器带来了挑战。在此背景下,我们从生成器的角度出发,观察到这些模型中的内部交叉注意力机制编码了细粒度的语音-运动对齐,为伪造检测提供了有用的对应线索。基于这一见解,我们提出了X-AVDT,一种鲁棒且具有良好泛化能力的深度伪造检测器,它通过DDIM反演探测生成器内部的音视频信号,以揭示这些线索。X-AVDT提取了两种互补信号:(i) 捕捉反演引起的差异的视频合成,以及(ii) 反映生成过程中强制执行的模态对齐的音视频交叉注意力特征。为了实现跨生成器的真实评估,我们进一步引入了MMDF,一个新的多模态深度伪造数据集,涵盖了多种操控类型和快速发展的合成范式,包括GAN、扩散和流匹配。大量实验表明,X-AVDT在MMDF上实现了领先的性能,并在外部基准和未见过的生成器上表现出强大的泛化能力,准确率提高了13.1%,超越了现有方法。我们的研究结果强调了利用内部音视频一致性线索在深度伪造检测中对未来生成器的鲁棒性的重要性。
cs.CV / 252 / 2603.08486

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

视觉自我实现对齐:通过威胁相关图像塑造安全导向的人物角色
Yang, Qishun, Yang, Shu, Hu, Lijie, Wang, Di
Abstract
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
Chinese Translation
多模态大型语言模型(MLLMs)面临安全不对齐的问题,即视觉输入可能导致有害输出。为了解决这一问题,现有方法需要明确的安全标签或对比数据;然而,威胁相关概念是具体且可视化的,而安全概念(如有用性)则是抽象的,缺乏视觉参照。受到潜在自我实现机制的启发,我们提出了视觉自我实现对齐(Visual Self-Fulfilling Alignment, VSFA)。VSFA在围绕威胁相关图像构建的中性视觉问答(VQA)任务上微调视觉语言模型(VLMs),而无需任何安全标签。通过反复接触威胁相关的视觉内容,模型内化警惕和谨慎的隐含语义,从而塑造安全导向的人物角色。在多个VLM和安全基准上的实验表明,VSFA降低了攻击成功率,提高了响应质量,并在保持一般能力的同时减轻了过度拒绝。我们的工作将自我实现机制从文本扩展到视觉模态,提供了一种无标签的VLM对齐方法。
cs.CV / 253 / 2603.08491

Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

全球跨模态地理定位:百万规模数据集及物理一致性学习框架
Hu, Yutong, Chen, Jinhui, Xu, Chaoqiang, Kou, Yuan, Zhou, Sili, Yan, Shaocheng, Shi, Pengcheng, Hu, Qingwu, Li, Jiayuan
Abstract
Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.
Chinese Translation
跨模态地理定位(CMGL)将地面文本描述与地理标记的航空图像相匹配,这对行人导航和紧急响应至关重要。然而,现有研究受到狭窄地理覆盖和简单场景多样性的限制,未能反映全球建筑风格和地形特征的巨大空间异质性。为弥补这一差距并促进普遍定位,我们引入了CORE,这是第一个专注于全球CMGL的百万规模数据集。CORE包含1,034,786张跨视角图像,采样自所有大洲的225个不同地理区域,提供了在不同环境条件和城市布局下前所未有的视角多样性。我们利用大型视觉-语言模型(LVLMs)的零样本推理能力,合成富含区分性线索的高质量场景描述。此外,我们提出了一种关注物理法则的网络(PLANET)用于跨模态地理定位。PLANET引入了一种新颖的对比学习范式,以指导文本表示捕捉卫星图像的内在物理特征。针对不同地理区域的广泛实验表明,PLANET在性能上显著优于现有最先进的方法,为稳健的全球规模地理定位建立了新的基准。数据集和源代码将发布在 https://github.com/YtH0823/CORE。
cs.CV / 254 / 2603.08497

Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models

阅读 $ eq$ 视觉:诊断与弥合视觉-语言模型中的排版差距
Zhou, Heng, Yu, Ao, Kang, Li, Fan, Yuchen, Fan, Yutao, Song, Xiufeng, Geng, Hejia, Qin, Yiran
Abstract
Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.
Chinese Translation
视觉-语言模型在读取图像中的文本时达到了近乎完美的准确性,但在排版方面却表现出明显的盲点:能够识别文本的内容,但无法识别其外观。我们通过评估26种字体、四种书写系统和三种难度级别,系统地研究了这一差距。对15个最先进的视觉-语言模型(VLM)的评估揭示了一个显著的感知层次:颜色识别几乎完美,但字体风格检测普遍较差。我们进一步发现,模型规模无法预测性能,且在不同难度级别上的准确性保持一致,这表明问题在于训练数据的缺失,而非能力的上限。对一小组合成样本进行LoRA微调显著改善了一个开源模型,缩小了与最佳闭源系统的差距,并在字体大小识别上超越了它。字体风格单独仍然对微调具有抵抗力,这表明关系视觉推理可能需要超越当前基于补丁的编码器的架构创新。我们发布了我们的评估框架、数据和微调配方,以支持弥合视觉-语言理解中的排版差距的进展。
cs.CV / 255 / 2603.08498

All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference

所有车辆都可能撒谎:通过伪随机贝叶斯推断实现完全不可信车辆协作感知的高效对抗防御
Yu, Yi, Wu, Libing, Zhang, Zhuangzhuang, Qiu, Jing, Huo, Lijuan, Feng, Jiaqi
Abstract
Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of feature-level sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.
Chinese Translation
协作感知(Collaborative Perception, CP)使多个车辆能够通过交换特征级传感器数据来增强各自的感知能力。然而,这种融合机制在完全不可信车辆环境中本质上容易受到对抗攻击。现有的防御方法通常假设存在一个可信的自我车辆作为参考,或结合额外的二元分类器。这些假设由于自我车辆的可信度存疑、实时检测的需求以及在多样化场景中的可推广性限制了其在现实世界中的实用性。为了解决这些挑战,我们提出了一种新颖的伪随机贝叶斯推断(Pseudo-Random Bayesian Inference, PRBI)框架,这是首个针对完全不可信车辆协作感知的高效防御方法。PRBI通过利用时间感知差异来检测对抗行为,使用前一帧的可靠感知作为动态参考。此外,它采用了一种伪随机分组策略,每帧仅需进行两次验证,同时应用贝叶斯推断来估计恶意车辆的数量和身份。理论分析已证明所提PRBI框架的收敛性和稳定性。大量实验表明,PRBI平均每帧仅需2.5次验证,显著优于现有方法,并将检测精度恢复到攻击前水平的79.4%至86.9%之间。
cs.CV / 256 / 2603.08499

Improving Continual Learning for Gaussian Splatting based Environments Reconstruction on Commercial Off-the-Shelf Edge Devices

基于高斯喷溅的环境重建的持续学习改进:在商业现成边缘设备上的应用
Zaino, Ivan, Risso, Matteo, Pagliari, Daniele Jahier, de Prado, Miguel, Van de Maele, Toon, Burrello, Alessio
Abstract
Novel view synthesis (NVS) is increasingly relevant for edge robotics, where compact and incrementally updatable 3D scene models are needed for SLAM, navigation, and inspection under tight memory and latency budgets. Variational Bayesian Gaussian Splatting (VBGS) enables replay-free continual updates for the 3DGS algorithm by maintaining a probabilistic scene model, but its high-precision computations and large intermediate tensors make on-device training impractical. We present a precision-adaptive optimization framework that enables VBGS training on resource-constrained hardware without altering its variational formulation. We (i) profile VBGS to identify memory/latency hotspots, (ii) fuse memory-dominant kernels to reduce materialized intermediate tensors, and (iii) automatically assign operation-level precisions via a mixed-precision search with bounded relative error. Across the Blender, Habitat, and Replica datasets, our optimised pipeline reduces peak memory from 9.44 GB to 1.11 GB and training time from ~234 min to ~61 min on an A5000 GPU, while preserving (and in some cases improving) reconstruction quality of the state-of-the-art VBGS baseline. We also enable for the first time NVS training on a commercial embedded platform, the Jetson Orin Nano, reducing per-frame latency by 19x compared to 3DGS.
Chinese Translation
新颖视图合成(NVS)在边缘机器人中越来越重要,因为在严格的内存和延迟预算下,需要紧凑且可增量更新的三维场景模型用于SLAM、导航和检查。变分贝叶斯高斯喷溅(VBGS)通过维护一个概率场景模型,实现了3DGS算法的无重放持续更新,但其高精度计算和大型中间张量使得在设备上训练变得不切实际。我们提出了一种精度自适应优化框架,使得在资源受限的硬件上进行VBGS训练成为可能,而无需改变其变分公式。我们(i) 对VBGS进行分析,以识别内存/延迟热点,(ii) 融合内存主导的内核以减少物化的中间张量,(iii) 通过带有界相对误差的混合精度搜索自动分配操作级别的精度。在Blender、Habitat和Replica数据集上,我们优化的管道将峰值内存从9.44 GB降低到1.11 GB,将训练时间从约234分钟缩短到约61分钟,在A5000 GPU上,同时保持(在某些情况下改善)最先进的VBGS基线的重建质量。我们还首次在商业嵌入式平台Jetson Orin Nano上实现了NVS训练,将每帧的延迟减少了19倍,相较于3DGS。
cs.CV / 257 / 2603.08503

Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction

球面高斯透明度场:用于3D场景重建的几何感知全景高斯透明度场
Yang, Zhe, Zhao, Guoqiang, Wu, Sheng, Luo, Kai, Yang, Kailun
Abstract
Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at https://github.com/1170632760/Spherical-GOF.
Chinese Translation
全向图像因其广阔的视场而在机器人技术和视觉领域中越来越多地被使用。然而,将3D高斯点云(3DGS)扩展到全景相机模型仍然具有挑战性,因为现有的公式是为透视投影设计的,简单的适配往往会引入失真和几何不一致性。我们提出了球面高斯透明度场(Spherical-GOF),这是一个基于高斯透明度场(GOF)的全向高斯渲染框架。与基于投影的光栅化不同,Spherical-GOF直接在球面射线空间的单位球上执行GOF射线采样,从而实现全景渲染中一致的射线-高斯交互。为了使球面射线投射高效且稳健,我们推导了一种保守的球面边界规则,以快速进行射线-高斯剔除,并引入了一种球面过滤方案,使高斯足迹适应失真变化的全景像素采样。在标准全景基准(OmniBlender和OmniPhotos)上的大量实验表明,Spherical-GOF在光度质量上具有竞争力,并显著改善了几何一致性。与最强基线相比,Spherical-GOF将深度重投影误差降低了57%,并提高了循环内点比率21%。定性结果显示出更干净的深度和更连贯的法线图,并对全局全景旋转具有强大的鲁棒性。我们进一步在OmniRob上验证了泛化能力,这是一个在本工作中引入的真实世界机器人全向数据集,包含无人机和四足平台。源代码和OmniRob数据集将发布在https://github.com/1170632760/Spherical-GOF。
cs.CV / 258 / 2603.08514

Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection

超越匈牙利算法:无匹配监督的端到端目标检测
Qiu, Shoumeng, Li, Xinrun, Long, Yang
Abstract
Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.
Chinese Translation
基于最近的DEtection TRansformer (DETR)框架在端到端目标检测中取得了显著成功。然而,依赖匈牙利算法进行查询与真实目标之间的二分匹配引入了计算开销,并使训练动态变得复杂。在本文中,我们提出了一种新颖的无匹配训练方案,适用于基于DETR的检测器,消除了对显式启发式匹配的需求。我们方法的核心是一个专门的基于交叉注意力的查询选择模块(Cross-Attention-based Query Selection, CAQS)。我们利用编码的真实目标信息,通过交叉注意力机制探测解码器查询,而不是进行离散分配。通过最小化查询结果与真实目标之间的加权误差,模型自主学习对象查询与特定目标之间的隐式对应关系。这种学习到的关系进一步为查询的学习提供了监督信号。实验结果表明,我们提出的方法绕过了传统的匹配过程,显著提高了训练效率,将匹配延迟减少了超过50%,通过可微分的对应学习有效消除了离散匹配瓶颈,并且在性能上优于现有的最先进方法。
cs.CV / 259 / 2603.08521

OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras

OccTrack360:来自全景鱼眼相机的4D全景占用跟踪
Lin, Yongzhi, Luo, Kai, Zheng, Yuanfan, Shi, Hao, Duan, Mengfei, Liu, Yang, Yang, Kailun
Abstract
Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Spherical Lift Module (SLM) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360.
Chinese Translation
以空间连续和时间一致的方式理解动态3D环境是机器人技术和自动驾驶的基础。尽管最近在占用预测方面的进展提供了场景几何和语义的统一表示,但由于缺乏支持全景鱼眼传感、长时间序列和实例级体素跟踪的基准,4D全景占用跟踪的进展仍然有限。为了解决这一问题,我们提出了OccTrack360,这是一个新的基准,用于从全景鱼眼相机进行4D全景占用跟踪。OccTrack360提供了比之前基准显著更长和更多样化的序列(174~2234帧),以及原则性的体素可见性注释,包括全方向遮挡掩码和基于MEI的鱼眼视场掩码。为了建立一个强大的鱼眼导向基线,我们进一步提出了Focus on Sphere Occ(FoSOcc)框架,解决了鱼眼占用跟踪中的两个核心挑战:扭曲的球面投影和不准确的体素空间定位。FoSOcc包括一个中心聚焦模块(CFM),通过监督聚焦引导增强实例感知的空间定位,以及一个球面提升模块(SLM),在统一投影模型下将透视提升扩展到鱼眼成像。对Occ3D-Waymo和OccTrack360的广泛实验表明,我们的方法在几何规则类别上显著提高了占用跟踪质量,并为未来在全景鱼眼4D占用跟踪方面的研究建立了强大的基线。基准和源代码将公开发布在 https://github.com/YouthZest-Lin/OccTrack360。
cs.CV / 260 / 2603.08523

BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images

BuildMamba:一种基于视觉状态空间的多任务建筑分割与高度估计模型,适用于卫星图像
Ulu, Sinan U., Doruk, A. Enes, Yagmur, I. Can, Gunturk, Bahadir K., Hanoglu, Oguz, Ates, Hasan F.
Abstract
Accurate building segmentation and height estimation from single-view RGB satellite imagery are fundamental for urban analytics, yet remain ill-posed due to structural variability and the high computational cost of global context modeling. While current approaches typically adapt monocular depth architectures, they often suffer from boundary bleeding and systematic underestimation of high-rise structures. To address these limitations, we propose BuildMamba, a unified multi-task framework designed to exploit the linear-time global modeling of visual state-space models. Motivated by the need for stronger structural coupling and computational efficiency, we introduce three modules: a Mamba Attention Module for dynamic spatial recalibration, a Spatial-Aware Mamba-FPN for multi-scale feature aggregation via gated state-space scans, and a Mask-Aware Height Refinement module using semantic priors to suppress height artifacts. Extensive experiments demonstrate that BuildMamba establishes a new performance upper bound across three benchmarks. Specifically, it achieves an IoU of 0.93 and RMSE of 1.77~m on DFC23 benchmark, surpassing state-of-the-art by 0.82~m in height estimation. Simulation results confirm the model's superior robustness and scalability for large-scale 3D urban reconstruction.
Chinese Translation
从单视角RGB卫星影像中准确进行建筑分割和高度估计对于城市分析至关重要,但由于结构的多样性和全球上下文建模的高计算成本,这一任务仍然存在不适定性。当前的方法通常适应单目深度架构,但往往会遭遇边界模糊和对高层建筑的系统性低估。为了解决这些局限性,我们提出了BuildMamba,一个统一的多任务框架,旨在利用视觉状态空间模型的线性时间全局建模。基于对更强结构耦合和计算效率的需求,我们引入了三个模块:一个用于动态空间重校准的Mamba注意力模块,一个通过门控状态空间扫描进行多尺度特征聚合的空间感知Mamba-FPN,以及一个利用语义先验抑制高度伪影的掩码感知高度精细化模块。大量实验表明,BuildMamba在三个基准测试中建立了新的性能上限。具体而言,它在DFC23基准上实现了0.93的IoU和1.77米的RMSE,在高度估计上超过了最先进技术0.82米。仿真结果确认了该模型在大规模3D城市重建中的优越鲁棒性和可扩展性。
cs.CV / 261 / 2603.08533

SecAgent: Efficient Mobile GUI Agent with Semantic Context

SecAgent:具有语义上下文的高效移动图形用户界面代理
Xie, Yiping, Chen, Song, Xing, Jingxuan, Jiang, Wei, Zhu, Zekun, Wang, Yingyao, Bu, Pi, Song, Jun, Jiang, Yuning, Zheng, Bo
Abstract
Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.
Chinese Translation
由多模态大型语言模型驱动的移动图形用户界面(GUI)代理在自动化复杂智能手机任务方面展现了良好的能力。然而,现有方法面临两个关键限制:高质量多语言数据集的稀缺,特别是在非英语生态系统中,以及低效的历史表示方法。为了解决这些挑战,我们提出了SecAgent,一个规模为3B的高效移动GUI代理。我们首先构建了一个经过人工验证的中文移动GUI数据集,包含18,000个基础样本和44个应用程序中的121,000个导航步骤,以及一个具有多选动作注释的中文导航基准。在此数据集的基础上,我们提出了一种语义上下文机制,将历史屏幕截图和动作提炼为简洁的自然语言摘要,显著降低计算成本,同时保留与任务相关的信息。通过监督和强化微调,SecAgent的表现超过了类似规模的基准,并在我们的和公共导航基准上达到了与7B-8B模型相当的性能。我们将开源训练数据集、基准、模型和代码,以推动多语言移动GUI自动化的研究。
cs.CV / 262 / 2603.08536

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

SWIFT:滑动窗口重构用于无训练的少样本生成视频归属
Wang, Chao, Yang, Zijin, Wang, Yaofei, Qi, Yuang, Zhang, Weiming, Yu, Nenghai, Chen, Kejiang
Abstract
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.
Chinese Translation
近年来,视频生成技术取得了显著进展,广泛应用于多个领域。然而,关于生成内容潜在滥用的担忧日益增加。追踪生成视频的来源已成为减轻潜在滥用和识别责任方的关键。现有的视频归属方法需要额外的操作或源归属模型的训练,这可能会降低视频质量或需要大量的训练样本。为了解决这些挑战,我们首次定义了“无训练的少样本生成视频归属”任务,并提出了SWIFT,该方法与视频的时间特性紧密结合。通过利用每个视频片段内的“像素帧(多个)到潜在帧(一个)”的时间映射,SWIFT应用固定长度的滑动窗口执行两种不同的重构:正常和损坏。然后,两个重构之间损失的变化被用作归属信号。我们对五种最先进(SOTA)的视频生成模型进行了广泛评估。实验结果表明,SWIFT在所有模型中仅使用20个视频样本即可实现超过90%的平均归属准确率,并且甚至能够实现HunyuanVideo、EasyAnimate和Wan2.2的零-shot归属。我们的源代码可在https://github.com/wangchao0708/SWIFT获取。
cs.CV / 263 / 2603.08540

PCFEx: Point Cloud Feature Extraction for Graph Neural Networks

PCFEx:用于图神经网络的点云特征提取
Masud, Abdullah Al, Xintong, Shi, Bouazizi, Mondher, Tomoaki, Ohtsuki
Abstract
Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.
Chinese Translation
图神经网络(GNNs)因其在各个领域的有效性而受到广泛关注。本研究专注于应用GNN处理3D点云数据,以进行人体姿态估计(HPE)和人体活动识别(HAR)。我们提出了新颖的点云特征提取(PCFEx)技术,通过将点云视为图,捕捉点、边和图层级的有意义信息。此外,我们引入了一种GNN架构,旨在高效处理这些特征。我们的方案在四个最流行的公开毫米波雷达数据集上进行了评估,其中三个用于HPE,一个用于HAR。结果显示,在所有三个HPE基准测试中,误差显著减少,mmWave基础的HAR整体准确率达到98.8%,超越了现有的最先进模型。这项工作展示了结合特征提取与GNN建模方法在提高点云处理精度方面的巨大潜力。
cs.CV / 264 / 2603.08551

mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud

mmGAT:基于图注意力与毫米波雷达点云互特征的人体姿态估计
Masud, Abdullah Al, Xintong, Shi, Bouazizi, Mondher, Tomoaki, Ohtsuki
Abstract
Pose estimation and human action recognition (HAR) are pivotal technologies spanning various domains. While the image-based pose estimation and HAR are widely admired for their superior performance, they lack in privacy protection and suboptimal performance in low-light and dark environments. This paper exploits the capabilities of millimeter-wave (mmWave) radar technology for human pose estimation by processing radar data with Graph Neural Network (GNN) architecture, coupled with the attention mechanism. Our goal is to capture the finer details of the radar point cloud to improve the pose estimation performance. To this end, we present a unique feature extraction technique that exploits the full potential of the GNN processing method for pose estimation. Our model mmGAT demonstrates remarkable performance on two publicly available benchmark mmWave datasets and establishes new state of the art results in most scenarios in terms of human pose estimation. Our approach achieves a noteworthy reduction of pose estimation mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% from the current state of the art benchmark within this domain.
Chinese Translation
姿态估计和人类动作识别(HAR)是跨多个领域的重要技术。尽管基于图像的姿态估计和HAR因其卓越的性能而广受赞誉,但在隐私保护方面存在不足,并且在低光和黑暗环境中的表现不佳。本文利用毫米波(mmWave)雷达技术的能力,通过图神经网络(GNN)架构处理雷达数据,并结合注意力机制进行人体姿态估计。我们的目标是捕捉雷达点云的细节,以提高姿态估计的性能。为此,我们提出了一种独特的特征提取技术,充分发挥GNN处理方法在姿态估计中的潜力。我们的模型mmGAT在两个公开可用的基准mmWave数据集上展示了卓越的性能,并在大多数场景中建立了人体姿态估计的新最先进结果。我们的方法在该领域的当前最先进基准上实现了姿态估计每个关节位置误差(MPJPE)降低35.6%和PA-MPJPE降低14.1%的显著改进。
cs.CV / 265 / 2603.08564

BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment

BioGait-VLM:一种用于可解释临床步态评估的三模态视觉-语言-生物力学框架
Chen, Erdong, Ji, Yuyang, Greenberg, Jacob K., Steel, Benjamin, Arkam, Faraz, Lewis, Abigail, Singh, Pranay, Liu, Feng
Abstract
Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
Chinese Translation
基于视频的临床步态分析常常因模型过拟合环境偏差而导致泛化能力差,未能捕捉病理运动。为了解决这一问题,我们提出了BioGait-VLM,一种用于可解释临床步态评估的三模态视觉-语言-生物力学框架。与标准视频编码器不同,我们的架构结合了一个时间证据蒸馏分支,以捕捉节奏动态,以及一个生物力学标记化分支,将3D骨架序列投影到与语言对齐的语义标记中。这使得模型能够明确推理关节力学,而不依赖于视觉捷径。为了确保严格的基准测试,我们在公共GAVD数据集的基础上,增添了一个高保真度的退行性颈髓病(Degenerative Cervical Myelopathy, DCM)队列,形成一个统一的8类分类法,并建立严格的受试者不重叠协议以防止数据泄露。在这一设置下,BioGait-VLM实现了最先进的识别准确率。此外,一项盲评专家研究确认生物力学标记显著提高了临床合理性和证据基础,为透明的、增强隐私的步态评估提供了一条路径。
cs.CV / 266 / 2603.08582

Online Sparse Synthetic Aperture Radar Imaging

在线稀疏合成孔径雷达成像
Flynn, Conor, Ivanov, Radoslav, Yazici, Birsen
Abstract
With modern defense applications increasingly relying on inexpensive, autonomous drones, lies the major challenge of designing computationally and memory-efficient onboard algorithms to fulfill mission objectives. This challenge is particularly significant in Synthetic Aperture Radar (SAR), where large volumes of data must be collected and processed for downstream tasks. We propose an online reconstruction method, the Online Fast Iterative Shrinkage-Thresholding Algorithm (Online FISTA), which incrementally reconstructs a scene with limited data through sparse coding. Rather than requiring storage of all received signal data, the algorithm recursively updates storage matrices for each iteration, greatly reducing memory demands. Online SAR image reconstruction facilitates more complex downstream tasks, such as Automatic Target Recognition (ATR), in an online manner, resulting in a more versatile and integrated framework compared to existing post-collection reconstruction and ATR approaches.
Chinese Translation
随着现代防御应用越来越依赖廉价的自主无人机,设计计算和内存效率高的机载算法以实现任务目标成为一项重大挑战。这个挑战在合成孔径雷达(Synthetic Aperture Radar, SAR)中尤为显著,因为必须收集和处理大量数据以完成下游任务。我们提出了一种在线重建方法,即在线快速迭代收缩阈值算法(Online Fast Iterative Shrinkage-Thresholding Algorithm, Online FISTA),该方法通过稀疏编码以有限的数据逐步重建场景。该算法不需要存储所有接收的信号数据,而是为每次迭代递归更新存储矩阵,从而大大减少内存需求。在线SAR图像重建以在线方式促进了更复杂的下游任务,如自动目标识别(Automatic Target Recognition, ATR),与现有的后收集重建和ATR方法相比,形成了一个更灵活和集成的框架。
cs.CV / 267 / 2603.08589

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

CARE-Edit:面向上下文图像编辑的条件感知专家路由
Wang, Yucheng, Wang, Zedong, Wu, Yuetong, Ma, Yue, Xu, Dan
Abstract
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
Chinese Translation
统一扩散编辑器通常依赖于固定的共享骨干网络来处理多样化任务,这导致任务干扰和对异构需求(例如,局部与全局、语义与光度)的适应性差。特别是,普遍使用的 ControlNet 和 OmniControl 变体通过静态拼接或加性适配器组合多种条件信号(例如,文本、掩码、参考),而无法动态优先或抑制冲突的模态,从而导致诸如掩码边界的颜色渗透、身份或风格漂移,以及在多条件输入下的不可预测行为等伪影。为了解决这一问题,我们提出了条件感知专家路由(CARE-Edit),该方法将模型计算与特定的编辑能力对齐。其核心是一个轻量级的潜在注意力路由器,根据多模态条件和扩散时间步,将编码的扩散令牌分配给四个专业专家——文本、掩码、参考和基础: (i) 掩码重绘模块首先细化粗略的用户定义掩码,以提供精确的空间指导; (ii) 路由器应用稀疏的 top-K 选择,动态分配计算资源给最相关的专家; (iii) 潜在混合模块随后融合专家输出,连贯地将语义、空间和风格信息整合到基础图像中。实验验证了 CARE-Edit 在上下文编辑任务上的强大性能,包括擦除、替换、基于文本的编辑和风格迁移。经验分析进一步揭示了专业专家的任务特定行为,展示了动态条件感知处理在减轻多条件冲突中的重要性。
cs.CV / 268 / 2603.08590

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

PRISM:基于每个关节潜在分解的人体运动生成流媒体
Ling, Zeyu, Shuai, Qing, Zhang, Teng, Li, Shiyang, Han, Bo, Zou, Changqing
Abstract
Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
Chinese Translation
文本到运动生成技术快速发展,但仍面临两个挑战。首先,现有的运动自编码器将每一帧压缩为一个单一的整体潜在向量,将轨迹和每个关节的旋转纠缠在一个无结构的表示中,导致下游生成器难以准确建模。其次,文本到运动、姿态条件生成和长期序列合成通常需要单独的模型或特定任务的机制,自回归方法在长时间展开中会严重累积误差。我们提出了PRISM,针对每个挑战提供了专门的解决方案。(1) 关节分解的运动潜在空间:每个身体关节占据其自己的标记,形成一个结构化的二维网格(时间关节),通过因果变分自编码器(VAE)与前向运动学监督进行压缩。这一对潜在空间的简单改动——无需修改生成器——显著提高了生成质量,揭示了潜在空间设计一直是一个被低估的瓶颈。(2) 无噪声条件注入:每个潜在标记携带其自己的时间步嵌入,允许将条件帧作为干净的标记(timestep0)注入,而其余标记则被去噪。这将文本到运动和姿态条件生成统一到一个模型中,并直接支持自回归段链式生成以实现流媒体合成。自强训练进一步抑制了长时间展开中的漂移。通过这两个组件,我们训练了一个单一的运动生成基础模型,能够无缝处理文本到运动、姿态条件生成、自回归序列生成和叙事运动组合,在HumanML3D、MotionHub、BABEL以及50个场景的用户研究中实现了最先进的性能。
cs.CV / 269 / 2603.08592

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

通过几何参考的3D场景表示提升多模态大型语言模型的空间推理能力
Yuan, Jiangye, Kumar, Gowri, Wang, Baoyuan
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在2D视觉理解方面取得了显著成功,但它们在3D空间推理方面的能力仍然有限。为了解决这一问题,我们引入了几何参考的3D场景表示(GR3D)。给定一组输入图像,GR3D为图像中的对象分配唯一的ID,并将其3D几何属性编码为由这些ID索引的文本参考。这种表示使得MLLMs能够利用其在数学推理方面的高级语言技能来解释3D线索,同时以紧密耦合的方式分析2D视觉特征。我们提出了一种基于GR3D的简单而有效的方法,该方法无需额外训练,且可以轻松应用于不同的MLLMs。在零-shot设置下,我们的方法使GPT-5在VSI-Bench上的整体表现提升了8%,在高度依赖空间布局理解的任务上提升了超过11%。定性研究进一步表明,GR3D使得MLLMs能够在高度稀疏的输入视图下执行复杂的空间推理。
cs.CV / 270 / 2603.08605

Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

基于渐进伪掩膜精炼的弱监督教师-学生框架用于腺体分割
Khan, Hikmat, Chen, Wei, Niazi, Muhammad Khalid Khan
Abstract
Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.
Chinese Translation
背景与目标:结直肠癌的组织病理学分级依赖于腺体结构的准确分割。目前的深度学习方法依赖于大规模的像素级标注,这些标注劳动密集且在常规临床实践中难以获得。弱监督语义分割提供了一种有前景的替代方案。然而,基于类激活图的方法往往产生不完整的伪掩膜,这些伪掩膜强调高度可区分的区域,却未能对未标注的腺体结构进行监督。我们提出了一种弱监督教师-学生框架,利用稀疏的病理学家标注和稳定的指数移动平均教师网络生成精炼的伪掩膜。方法:该框架集成了基于置信度的过滤、教师预测与有限真实标注的自适应融合,以及课程引导的精炼,以逐步分割未标注的腺体区域。该方法在来自俄亥俄州立大学韦克斯纳医学中心的60个苏木精-伊红染色全切片图像的结直肠癌队列以及包括腺体分割数据集、TCGA COAD、TCGA READ和SPIDER在内的公共数据集上进行了评估。结果:在腺体分割数据集上,该框架实现了80.10的平均交并比和89.10的平均Dice系数。跨队列评估表明,在没有额外标注的情况下,TCGA COAD和TCGA READ上表现出强大的泛化能力,而在SPIDER上的性能下降反映了领域转移。结论:所提出的框架为结直肠组织病理学中的腺体分割提供了一种高效且可泛化的标注方法。
cs.CV / 271 / 2603.08611

FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

FOMO-3D:利用视觉基础模型进行长尾3D目标检测
Yang, Anqi Joyce, Tu, James, Dvornik, Nikita, Li, Enxu, Urtasun, Raquel
Abstract
In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.
Chinese Translation
为了在复杂的交通环境中导航,自动驾驶车辆必须识别与脆弱道路使用者或交通控制设备相关的多种语义类别。然而,许多安全关键物体(例如,施工工人)在正常交通条件下出现频率较低,导致仅依靠驾驶数据的训练样本严重不足。最近的视觉基础模型在大量数据集上进行训练,可以作为一种良好的外部先验知识来源,以提高泛化能力。我们提出了FOMO-3D,这是第一个利用视觉基础模型进行长尾3D检测的多模态3D检测器。具体而言,FOMO-3D在一个两阶段检测范式中利用OWLv2和Metric3Dv2中的丰富语义和深度先验,首先通过基于激光雷达的分支和一种新颖的基于相机的分支生成提案,然后特别关注来自OWL的图像特征进行精细化处理。对真实驾驶数据的评估表明,利用视觉基础模型中的丰富先验并结合精心设计的多模态融合方法,能够显著提升长尾3D检测的性能。项目网站为 https://waabi.ai/fomo3d/
cs.CV / 272 / 2603.08620

StreamReady: Learning What to Answer and When in Long Streaming Videos

StreamReady:学习在长时间流媒体视频中何时回答和回答什么
Azad, Shehreen, Vineet, Vibhav, Rawat, Yogesh Singh
Abstract
Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
Chinese Translation
流媒体视频理解通常涉及时间敏感的场景,在这些场景中,模型需要在支持视觉证据出现的确切时刻进行回答:在证据出现之前回答会反映出猜测,而在证据经过之后回答则降低了实时效用。为了捕捉这种行为,我们引入了一种关注准备状态的流媒体视频理解形式,采用答案准备评分(Answer Readiness Score, ARS),这是一种具有不对称早期和晚期惩罚的时间感知目标。当与正确性结合时,ARS 定义了一种有效的准确性,不仅衡量模型是否正确,还衡量其是否在适当的时刻进行回答。在此基础上,我们推出了 StreamReady,一个通过轻量级准备机制统一时间推理与及时回答的框架,该机制决定在回应之前是否观察到足够的证据。为了评估这一能力,我们进一步引入了 ProReady-QA,一个具有注释答案证据窗口和主动多轮问题的基准,涵盖局部和全局上下文。StreamReady 在 ProReady-QA 上实现了优越的性能,并在另外八个流媒体和离线长视频基准上持续超越先前的方法,展示了其强大且广泛可推广的视频理解能力。
cs.CV / 273 / 2603.08639

UNBOX: Unveiling Black-box visual models with Natural-language

UNBOX:通过自然语言揭示黑箱视觉模型
Carnemolla, Simone, Russo, Chiara, Palazzo, Simone, Bouniot, Quentin, Giordano, Daniela, Akata, Zeynep, Pennisi, Matteo, Spampinato, Concetto
Abstract
Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.
Chinese Translation
在开放世界视觉识别中,确保可信度需要可解释、公平且对分布变化具有鲁棒性的模型。然而,现代视觉系统越来越多地作为专有的黑箱API部署,仅暴露输出概率,隐藏架构、参数、梯度和训练数据。这种不透明性阻碍了有意义的审计、偏见检测和故障分析。现有的解释方法假设具有白箱或灰箱访问权限,或对训练分布有了解,因此在这些现实世界环境中无法使用。我们提出了UNBOX,一个在完全无数据、无梯度和无反向传播约束下进行类别级模型剖析的框架。UNBOX利用大型语言模型和文本到图像的扩散模型,将激活最大化重新表述为一种完全由输出概率驱动的语义搜索。该方法生成最大激活每个类别的人类可解释文本描述,揭示模型隐含学习的概念、反映的训练分布以及潜在的偏见来源。我们通过语义保真度测试、视觉特征相关性分析和切片发现审计在ImageNet-1K、Waterbirds和CelebA上评估UNBOX。尽管在最严格的黑箱约束下操作,UNBOX的表现与最先进的白箱可解释性方法具有竞争力。这证明了在没有任何内部访问的情况下,可以恢复对模型内部推理的有意义洞察,从而实现更可信和更负责任的视觉识别系统。
cs.CV / 274 / 2603.08645

Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

检索增强高斯头像:提升表情泛化能力
Levy, Matan, Habib, Gavriel, Tzachor, Issar, Samuel, Dvir, Ben-Ari, Rami, Darshan, Nir, Litany, Or, Lischinski, Dani
Abstract
Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
Chinese Translation
无模板的可动画头部头像通过直接从被捕获对象学习依赖于表情的面部变形,能够实现高视觉保真度,避免了参数化面部模板和手工设计的混合形状空间。然而,由于学习到的变形仅由单一身份观察到的表情进行监督,这些模型在表情覆盖范围上受到限制,并且在受到偏离训练分布的运动驱动时常常表现不佳。我们提出了RAF(检索增强面孔),这是一种为无模板头部头像设计的简单训练时增强方法,旨在从数据中学习变形。RAF构建了一个大型未标记的表情库,并在训练过程中,用从该库中检索到的最近邻表情替换被试表情特征的子集,同时仍然重建被试的原始帧。这使得变形场暴露于更广泛的表情条件,促进了身份与表情的解耦,并提高了对表情分布变化的鲁棒性,而无需配对的跨身份数据、额外的注释或架构更改。我们进一步分析了检索增强如何增加表情多样性,并通过用户研究验证了检索质量,结果表明检索到的邻居在表情和姿势上在感知上更为接近。对NeRSemble基准的实验表明,RAF在自驱动和跨驱动场景中均持续提升了表情保真度。
cs.CV / 275 / 2603.08648

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

CAST:建模视觉状态转变以实现一致的视频检索
Liu, Yanqing, Liu, Yingcheng, Dong, Fanghong, Budianto, Budianto, Xie, Cihang, Jiao, Yan
Abstract
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
Chinese Translation
随着视频内容创作逐渐向长篇叙事转变,将短片段组合成连贯的故事线变得愈发重要。然而,现有的检索方法在推理时仍然缺乏上下文敏感性,优先考虑局部语义对齐,而忽视了状态和身份的一致性。为了解决这一结构性限制,我们正式定义了一致视频检索(Consistent Video Retrieval, CVR)任务,并引入了一个涵盖 YouCook2、COIN 和 CrossTask 的诊断基准。我们提出了 CAST(Context-Aware State Transition),这是一种轻量级的即插即用适配器,兼容多种冻结的视觉-语言嵌入空间。通过从视觉历史中预测状态条件的残差更新($ ext{Δ}$),CAST 引入了对潜在状态演变的显式归纳偏置。大量实验表明,CAST 在 YouCook2 和 CrossTask 上提高了性能,在 COIN 上保持竞争力,并在多种基础模型上始终优于零-shot 基线。此外,CAST 为黑箱视频生成候选(例如来自 Veo 的候选)提供了有用的重排序信号,促进了更具时间连贯性的延续。
cs.CV / 276 / 2603.08661

ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting

ImprovedGS+: 一种高性能的C++/CUDA重实现策略用于3D高斯溅射
Vicente, Jordi Muñoz
Abstract
Recent advancements in 3D Gaussian Splatting (3DGS) have shifted the focus toward balancing reconstruction fidelity with computational efficiency. In this work, we propose ImprovedGS+, a high-performance, low-level reinvention of the ImprovedGS strategy, implemented natively within the LichtFeld-Studio framework. By transitioning from high-level Python logic to hardware-optimized C++/CUDA kernels, we achieve a significant reduction in host-device synchronization and training latency. Our implementation introduces a Long-Axis-Split (LAS) CUDA kernel, custom Laplacian-based importance kernels with Non-Maximum Suppression (NMS) for edge scores, and an adaptive Exponential Scale Scheduler. Experimental results on the Mip-NeRF360 dataset demonstrate that ImprovedGS+ establishes a new Pareto-optimal front for scene reconstruction. Our 1M-budget variant outperforms the state-of-the-art MCMC baseline by achieving a 26.8% reduction in training time (saving 17 minutes per session) and utilizing 13.3% fewer Gaussians while maintaining superior visual quality. Furthermore, our full variant demonstrates a 1.28 dB PSNR increase over the ADC baseline with a 38.4% reduction in parametric complexity. These results validate ImprovedGS+ as a scalable, high-speed solution that upholds the core pillars of Speed, Quality, and Usability within the LichtFeld-Studio ecosystem.
Chinese Translation
最近在3D高斯溅射(3DGS)方面的进展使得重心转向在重建保真度与计算效率之间取得平衡。在本研究中,我们提出了ImprovedGS+,这是对ImprovedGS策略的高性能、低级别重塑,原生实现于LichtFeld-Studio框架内。通过将高层Python逻辑转变为硬件优化的C++/CUDA内核,我们显著减少了主机与设备之间的同步和训练延迟。我们的实现引入了一个长轴分割(Long-Axis-Split, LAS)CUDA内核,自定义的基于拉普拉斯的边缘得分重要性内核,并结合非极大值抑制(Non-Maximum Suppression, NMS),以及自适应指数尺度调度器。对Mip-NeRF360数据集的实验结果表明,ImprovedGS+为场景重建建立了一个新的帕累托最优前沿。我们的1M预算变体在训练时间上比最先进的MCMC基线减少了26.8%(每次节省17分钟),同时使用了13.3%更少的高斯分布,且保持了卓越的视觉质量。此外,我们的完整变体在参数复杂度上减少了38.4%,并在ADC基线之上实现了1.28 dB的峰值信噪比(PSNR)提升。这些结果验证了ImprovedGS+作为一种可扩展的高速解决方案,能够在LichtFeld-Studio生态系统中维持速度、质量和可用性这三大核心支柱。
cs.CV / 277 / 2603.08674

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

共同对话:从音频合成共处的三维对话
Shan, Mengyi, Chang, Shouchieh, Bai, Ziqian, Liu, Shichen, Zhang, Yinda, Song, Luchuan, Pandey, Rohit, Fanello, Sean, Huang, Zeng
Abstract
We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.
Chinese Translation
我们着手解决一个具有挑战性的任务,即从混合音频流中为两个互动的共处参与者生成完整的三维面部动画。尽管现有方法通常产生类似视频会议的无实体“说话者”,我们的工作首次明确建模了动态三维空间关系——包括相对位置、方向和相互注视——这些对于真实的面对面对话至关重要。我们的系统合成了两位参与者的完整表现,包括精确的唇动同步,并独特地允许通过文本描述来控制他们的相对头部姿势。为此,我们提出了一种双流架构,其中每个流负责一个参与者的输出。我们采用了说话者角色嵌入和设计用于解耦混合音频及建模互动的跨说话者注意力机制。此外,我们引入了一种新颖的眼神注视损失,以促进自然的相互眼神接触。为了支持我们对数据的高需求,我们提出了一种新颖的管道,以策划一个大型对话数据集,该数据集由来自真实场景视频的超过200万个二人对组成。我们的方法生成流畅、可控且具有空间意识的二人动画,适用于虚拟现实和远程存在等沉浸式应用,在感知真实感和互动一致性方面显著优于现有基准。
cs.CV / 278 / 2603.08681

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

ER-Pose:重新思考基于关键点的表示学习以实现实时人体姿态估计
Li, Nanjun, Cheng, Pinqi, Liu, Zean, Tian, Minghe, Wang, Xuanyin
Abstract
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
Chinese Translation
单阶段多人体姿态估计旨在在统一框架内共同执行人体定位和关键点预测,具有推理效率高和架构简单的优点。因此,诸如YOLO类模型等多尺度实时检测架构被广泛应用于实时姿态估计。然而,这些方法通常继承了来自目标检测的基于框的建模范式,其中姿态估计在训练过程中受到边界框监督的隐性约束。这种表述引入了样本分配和特征表示中的偏差,导致任务不一致,最终限制了姿态估计的准确性。在本研究中,我们从基于关键点的视角重新审视基于框的单阶段姿态估计,并识别出并行目标之间的语义冲突是性能下降的关键来源。为了解决这一问题,我们提出了一种基于关键点的学习范式,将姿态估计提升为主要预测目标。具体而言,我们去除了边界框预测,并重新设计了预测头,以更好地适应姿态估计的高维结构化表示。我们进一步引入了一种基于关键点的动态样本分配策略,以使训练目标与姿态评估指标对齐,从而在训练期间实现密集监督和高效的无NMS推理。此外,我们提出了一种平滑的基于OKS的损失函数,以稳定基于回归的姿态估计中的优化。基于这些设计,我们开发了一个单阶段多人体姿态估计框架,称为ER-Pose。在MS COCO和CrowdPose数据集上,ER-Pose-n在不进行预训练的情况下分别实现了3.2/6.7的AP提升,而在进行预训练的情况下分别实现了7.4/4.9的AP提升,与基线YOLO-Pose相比。这些改进是在更少的参数和更高的推理效率下实现的。
cs.CV / 279 / 2603.08703

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

HiAR:通过层次去噪实现高效自回归长视频生成
Zou, Kai, Zheng, Dian, Liu, Hongbo, Hang, Tiankai, Liu, Bin, Yu, Nenghai
Abstract
Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.
Chinese Translation
自回归(AR)扩散为生成理论上无限长度的视频提供了一个有前景的框架。然而,一个主要挑战是保持时间连续性,同时防止由于误差积累导致的逐步质量下降。为了确保连续性,现有方法通常依赖于高度去噪的上下文;然而,这种做法以高确定性传播预测误差,从而加剧了降级。在本文中,我们认为高度干净的上下文并非必要。我们受到双向扩散模型的启发,该模型在保持一致性的同时以共享的噪声水平去噪帧,我们提出在与当前块相同的噪声水平上对上下文进行条件处理,提供了足够的信号以保持时间一致性,同时有效减轻误差传播。基于这一见解,我们提出了HiAR,一个层次去噪框架,它逆转了传统的生成顺序:它在每个去噪步骤中对所有块进行因果生成,而不是顺序完成每个块,从而使每个块始终以相同的噪声水平条件化上下文。这一层次结构自然允许流水线并行推理,在我们的4步设置中实现了1.8倍的实际速度提升。我们进一步观察到,在这一范式下,自我回滚蒸馏放大了固有于模式寻求反向KL目标的低运动捷径。为了对抗这一现象,我们在双向注意模式中引入了前向KL正则化器,该正则化器在不干扰蒸馏损失的情况下保持因果推理的运动多样性。在VBench(20秒生成)上,HiAR在所有比较方法中实现了最佳的整体得分和最低的时间漂移。
cs.CV / 280 / 2603.08708

FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

FVG-PT:适应性前景视图引导的视觉-语言模型提示调优
Li, Haoyang, Wang, Liang, Zhou, Siyu, Sun, Jiacheng, Jiang, Jing, Wang, Chao, Long, Guodong, Peng, Yan
Abstract
CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
Chinese Translation
基于CLIP的提示调优使得预训练的视觉-语言模型(VLMs)能够高效地适应下游任务。尽管现有研究取得了显著进展,但它们对VLMs在调优过程中内部注意力表示的变化关注有限。本文将提示调优预测的失败模式归因于视觉编码器前景注意力的变化,并提出了前景视图引导提示调优(FVG-PT),这是一种自适应的即插即用前景注意力引导模块,以缓解这些变化。具体而言,FVG-PT引入了一个可学习的前景可靠性门,自动增强前景视图质量,应用前景蒸馏补偿模块引导视觉注意力朝向前景,并进一步引入先验校准模块,以减轻因过度关注前景而导致的泛化退化。在多个主干模型和数据集上的实验表明,FVG-PT的有效性和兼容性。代码可在以下链接获取:https://github.com/JREion/FVG-PT
cs.CV / 281 / 2603.08709

Scale Space Diffusion

尺度空间扩散
Mukhopadhyay, Soumik, Udhayanan, Prateksha, Shrivastava, Abhinav
Abstract
Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our project website ( https://prateksha.github.io/projects/scale-space-diffusion/ ) is available publicly.
Chinese Translation
扩散模型通过噪声降解图像,逆转这一过程揭示了时间步长上的信息层次。尺度空间理论通过低通滤波展现了类似的层次结构。我们形式化了这一联系,并表明高度噪声的扩散状态所包含的信息不比小的下采样图像更多——这引发了为什么必须以全分辨率处理它们的问题。为了解决这个问题,我们通过构建一系列具有广义线性降解和实用实现的扩散模型,将尺度空间融入扩散过程。将下采样作为降解方法,得出了我们提出的尺度空间扩散。为了支持尺度空间扩散,我们引入了Flexi-UNet,这是一种UNet变体,仅使用网络的必要部分进行保留分辨率和提高分辨率的去噪。我们在CelebA和ImageNet上评估了我们的框架,并分析了其在不同分辨率和网络深度下的扩展行为。我们的项目网站(https://prateksha.github.io/projects/scale-space-diffusion/)已公开可用。
人工智能 (Artificial Intelligence)
81
cs.AI / 1 / 2603.06587

Autonomous AI Agents for Option Hedging: Enhancing Financial Stability through Shortfall Aware Reinforcement Learning

用于期权对冲的自主人工智能代理:通过关注短缺的强化学习增强金融稳定性
Hu, Minxuan, Chen, Ziheng, Yi, Jiayu, Sun, Wenxi
Abstract
The deployment of autonomous AI agents in derivatives markets has widened a practical gap between static model calibration and realized hedging outcomes. We introduce two reinforcement learning frameworks, a novel Replication Learning of Option Pricing (RLOP) approach and an adaptive extension of Q-learner in Black-Scholes (QLBS), that prioritize shortfall probability and align learning objectives with downside sensitive hedging. Using listed SPY and XOP options, we evaluate models using realized path delta hedging outcome distributions, shortfall probability, and tail risk measures such as Expected Shortfall. Empirically, RLOP reduces shortfall frequency in most slices and shows the clearest tail-risk improvements in stress, while implied volatility fit often favors parametric models yet poorly predicts after-cost hedging performance. This friction-aware RL framework supports a practical approach to autonomous derivatives risk management as AI-augmented trading systems scale.
Chinese Translation
自主人工智能代理在衍生品市场的部署扩大了静态模型校准与实际对冲结果之间的实际差距。我们提出了两种强化学习框架,一种是新颖的期权定价复制学习(Replication Learning of Option Pricing, RLOP)方法,另一种是黑-舒尔斯(Black-Scholes, BS)模型中Q学习器的自适应扩展(Q-learner in Black-Scholes, QLBS),它们优先考虑短缺概率,并将学习目标与对下行风险敏感的对冲对齐。通过使用上市的SPY和XOP期权,我们评估了模型在实现路径的德尔塔对冲结果分布、短缺概率以及尾部风险度量(如预期短缺)方面的表现。从实证上看,RLOP在大多数切片中减少了短缺频率,并在压力测试中显示出最明显的尾部风险改善,而隐含波动率拟合往往偏向参数模型,但在成本后的对冲表现预测上效果不佳。这种关注摩擦的强化学习框架为自主衍生品风险管理提供了一种实用的方法,随着人工智能增强的交易系统的扩展而得以实现。
cs.AI / 2 / 2603.06608

Scaling Strategy, Not Compute: A Stand-Alone, Open-Source StarCraft II Benchmark for Accessible Reinforcement Learning Research

规模化策略,而非计算:一个独立的开源《星际争霸 II》基准测试,旨在促进可及的强化学习研究
Panda, Sourav, Kale, Shreyash, Ambadkar, Tanmay, Verma, Abhinav, Dodge, Jonathan
Abstract
The research community lacks a middle ground between StarCraft IIs full game and its mini-games. The full-games sprawling state-action space renders reward signals sparse and noisy, but in mini-games simple agents saturate performance. This complexity gap hinders steady curriculum design and prevents researchers from experimenting with modern Reinforcement Learning algorithms in RTS environments under realistic compute budgets. To fill this gap, we present the Two-Bridge Map Suite, the first entry in an open-source benchmark series we purposely engineered as an intermediate benchmark to sit between these extremes. By disabling economy mechanics such as resource collection, base building, and fog-of-war, the environment isolates two core tactical skills: long-range navigation and micro-combat. Preliminary experiments show that agents learn coherent maneuvering and engagement behaviors without imposing full-game computational costs. Two-Bridge is released as a lightweight, Gym-compatible wrapper on top of PySC2, with maps, wrappers, and reference scripts fully open-sourced to encourage broad adoption as a standard benchmark.
Chinese Translation
研究界在《星际争霸 II》的完整游戏与其迷你游戏之间缺乏一个中间选择。完整游戏庞大的状态-动作空间使得奖励信号稀疏且噪声较大,而在迷你游戏中,简单的智能体则表现出饱和的性能。这种复杂性差距阻碍了稳定的课程设计,并使研究人员无法在现实的计算预算下对现代强化学习算法进行实验。为填补这一空白,我们提出了“两桥地图套件”,这是我们专门设计的开源基准系列中的第一个条目,旨在作为这两种极端之间的中间基准。通过禁用经济机制,如资源收集、基地建设和迷雾机制,该环境隔离了两个核心战术技能:远程导航和微观战斗。初步实验表明,智能体在不施加完整游戏计算成本的情况下,能够学习到连贯的机动和交战行为。“两桥”作为一个轻量级的、兼容Gym的封装,基于PySC2发布,地图、封装和参考脚本均完全开源,以鼓励广泛采用作为标准基准。
cs.AI / 3 / 2603.06679

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

MultiGen:在扩散游戏引擎中为可编辑的多人世界进行关卡设计
Po, Ryan, Zhang, David Junhao, Hertz, Amir, Wetzstein, Gordon, Wadhwa, Neal, Ruiz, Nataniel
Abstract
Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.
Chinese Translation
视频世界模型在互动模拟和娱乐方面展现了巨大的潜力,但当前系统在互动性方面仍面临两个重要挑战:用户对环境的控制,以实现可重复、可编辑的体验,以及共享推理,玩家对共同世界的影响力。为了解决这些局限性,我们在系统中引入了一个显式的外部记忆,这是一种独立于模型上下文窗口的持久状态,持续通过用户操作进行更新,并在生成过程中进行查询。与传统的作为下一帧预测器的扩散游戏引擎不同,我们的方法将生成过程分解为记忆(Memory)、观察(Observation)和动态(Dynamics)模块。这种设计使用户能够通过可编辑的记忆表示直接、可编辑地控制环境结构,并自然扩展到具有一致视角和跨玩家一致交互的实时多人生成。
cs.AI / 4 / 2603.06797

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

最佳尾部:在推理时对齐中桥接乐观与悲观
Hsu, Hsiang, Lei, Eric, Chen, Chun-Fu
Abstract
Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.
Chinese Translation
推理时对齐通过从参考模型生成多个候选项并利用不完美的奖励模型进行选择,有效地引导大型语言模型(LLMs)。然而,当前策略面临一个根本性困境:像 Best-of-$N$ 这样的“乐观”方法容易受到奖励操控的影响,而“悲观”正则化方法往往抑制了发现高质量响应所需的探索。在本研究中,我们通过遗憾最小化的视角形式化了这一权衡,证明了最佳策略在很大程度上依赖于奖励分布的尾部行为。我们理论上表明,轻尾分布有利于乐观,以发现高质量的异常值,而重尾分布则需要悲观以防止在极端情况下的奖励误校准。基于这一见解,我们提出了最佳尾部(Best-of-Tails, BoT),一个自适应的推理时对齐框架,使用 Tsallis 散度作为可调正则化器,以在这些极端之间提供更细粒度的插值。BoT 使用 Hill 估计器根据每个提示的奖励尾部重度进行特征描述,并动态调整其选择规则,以平衡探索收益与对齐误差。在数学、选择题推理和人类偏好评估中,BoT 在相对于固定策略基线的各种参考和奖励模型配置中提高了对齐性能。
cs.AI / 5 / 2603.06801

Breaking the Martingale Curse: Multi-Agent Debate via Asymmetric Cognitive Potential Energy

打破马丁戈尔诅咒:通过不对称认知潜能进行多智能体辩论
Liu, Yuhan, Zhang, Juntian, Wu, Yichen, Takac, Martin, Lahlou, Salem, Chen, Xiuying, Lukas, Nils
Abstract
Multi-Agent Debate (MAD) has emerged as a promising paradigm for enhancing large language model reasoning. However, recent work reveals a limitation:standard MAD cannot improve belief correctness beyond majority voting; we refer to this as the Martingale Curse. This curse arises because correlated errors cause agents to converge toward erroneous consensus, where debate merely reinforces collective mistakes rather than filtering noise. We propose AceMAD, a framework that breaks the Martingale Curse by harnessing asymmetric cognitive potential energy to transform MAD from a random walk into a directed convergence process with positive drift. Through a peer-prediction mechanism, agents predict their peers' belief distributions, revealing asymmetric cognitive potential: truth-holders not only know the correct answer but also anticipate the crowd's misconceptions, while the hallucinating majority remains blind to their collective error. This asymmetry creates a potential energy gap that we quantify via strictly proper scoring rules. We prove this cognitive potential manifests as information-theoretic superiority and, under nonlinear aggregation, converts into submartingale drift toward truth, directly breaking the Martingale Curse. Experiments on challenging subsets across six benchmarks show AceMAD recovers sparse truth signals even when initial majorities are incorrect, substantially outperforming baseline methods.
Chinese Translation
多智能体辩论(Multi-Agent Debate, MAD)作为增强大型语言模型推理的有前景的范式而出现。然而,近期的研究揭示了一个局限性:标准的MAD无法使信念的正确性超越多数投票;我们称之为马丁戈尔诅咒。该诅咒的产生是由于相关错误导致智能体趋向于错误共识,在这种情况下,辩论仅仅强化了集体错误,而不是过滤噪声。我们提出了AceMAD,一个通过利用不对称认知潜能打破马丁戈尔诅咒的框架,旨在将MAD从随机游走转变为具有正漂移的定向收敛过程。通过同行预测机制,智能体预测其同伴的信念分布,揭示出不对称的认知潜能:持有真相者不仅知道正确答案,还能预见人群的误解,而产生幻觉的多数则对其集体错误视而不见。这种不对称性创造了一个潜能能量差距,我们通过严格的适当评分规则对其进行量化。我们证明这种认知潜能表现为信息论上的优越性,并在非线性聚合下转化为向真相的次马丁戈尔漂移,直接打破马丁戈尔诅咒。在六个基准测试的挑战性子集上的实验表明,AceMAD即使在初始多数错误的情况下也能恢复稀疏的真实信号,显著优于基线方法。
cs.AI / 6 / 2603.06811

Making AI Evaluation Deployment Relevant Through Context Specification

通过上下文规范化使人工智能评估部署更具相关性
Holmes, Matthew, Lacerda, Thiago, Schwartz, Reva
Abstract
With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches mask the operational realities that ultimately determine deployment success, making it difficult for decision makers outside the stack to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform the deployment decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.
Chinese Translation
许多组织在从人工智能部署中获取价值方面面临困难,因此对以知情方式评估人工智能的压力加大。现有的人工智能评估方法掩盖了最终决定部署成功的操作现实,使得外部决策者难以了解人工智能工具是否以及如何提供持久价值。我们引入并描述了上下文规范化作为支持和告知部署决策过程的一个过程。上下文规范化将利益相关者对特定环境中重要事项的模糊看法转化为明确的、命名的构念:对评估旨在捕捉的属性、行为和结果的明确定义,以便在上下文中观察和测量。这一过程为评估人工智能系统在组织实际管理的部署环境中可能的表现提供了基础性路线图。
cs.AI / 7 / 2603.06813

Reinforcing the World's Edge: A Continual Learning Problem in the Multi-Agent-World Boundary

强化世界边界:多智能体世界边界中的持续学习问题
Malenfant, Dane
Abstract
Reusable decision structure survives across episodes in reinforcement learning, but this depends on how the agent--world boundary is drawn. In stationary, finite-horizon MDPs, an invariant core: the (not-necessarily contiguous) subsequences of state--action pairs shared by all successful trajectories (optionally under a simple abstraction) can be constructed. Under mild goal-conditioned assumptions, it's existence can be proven and explained by how the core captures prototypes that transfer across episodes. When the same task is embedded in a decentralized Markov game and the peer agent is folded into the world, each peer-policy update induces a new MDP; the per-episode invariant core can shrink or vanish, even with small changes to the induced world dynamics, sometimes leaving only the individual task core or just nothing. This policy-induced non-stationarity can be quantified with a variation budget over the induced kernels and rewards, linking boundary drift to loss of invariants. The view that a continual RL problem arises from instability of the agent--world boundary (rather than exogenous task switches) in decentralized MARL suggests future work on preserving, predicting, or otherwise managing boundary drift.
Chinese Translation
在强化学习中,可重用的决策结构在多个回合中得以存活,但这依赖于智能体与世界边界的划定。在静态的有限视野马尔可夫决策过程(MDP)中,可以构建一个不变的核心:所有成功轨迹共享的状态-动作对的(不一定是连续的)子序列(可选地在简单抽象下)。在温和的目标条件假设下,可以证明其存在性,并通过核心如何捕捉跨回合转移的原型进行解释。当相同任务嵌入到去中心化的马尔可夫博弈中,并且同伴智能体被纳入世界时,每次同伴策略的更新都会引入一个新的MDP;每回合的不变核心可能会缩小或消失,即使对诱导的世界动态进行小的改变,有时只留下个别任务核心或什么都不留下。这种策略诱导的非平稳性可以通过对诱导的核和奖励的变化预算进行量化,将边界漂移与不变性的丧失联系起来。认为持续强化学习问题源于去中心化多智能体强化学习(MARL)中智能体与世界边界的不稳定性(而非外生任务切换),提示未来的研究可以集中在保持、预测或以其他方式管理边界漂移上。
cs.AI / 8 / 2603.06869

Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

对称约束语言引导的程序合成用于从噪声和部分观测中发现控制方程
Baig, Mirza Samad Ahmed, Gillani, Syeda Anshrah
Abstract
Discovering compact governing equations from experimental observations is one of the defining objectives of quantitative science, yet practical discovery pipelines routinely fail when measurements are noisy, relevant state variables are unobserved, or multiple symbolic structures explain the data equally well within statistical uncertainty. Here we introduce SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas: (i) typed symmetry-constrained grammars that encode dimensional analysis, group-theoretic invariance, and parity constraints as hard production rules, eliminating on average 71.3% of candidate expression trees before any fitting; (ii) language-model-guided program synthesis in which a fine-tuned 7B-parameter proposer, conditioned on interpretable data descriptors, efficiently navigates the constrained search space; and (iii) MDL-regularized Bayesian model selection coupled with block-bootstrap stability analysis that quantifies structural uncertainty rather than committing to a single best equation. Across 133 dynamical systems spanning classical mechanics, electrodynamics, thermodynamics, population dynamics, and nonlinear oscillators, SymLang achieves an exact structural recovery rate of 83.7% under 10% observational noise - a 22.4 percentage-point improvement over the next-best baseline - while reducing out-of-distribution extrapolation error by 61% and near-eliminating conservation-law violations (3.1 x 10-3 vs. 187.3 x 10-3 physical drift for the closest competitor). In all tested regimes the framework correctly identifies structural degeneracy, reporting it explicitly rather than returning a confidently wrong single equation. The framework is fully open-source and reproducible, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.
Chinese Translation
从实验观测中发现紧凑的控制方程是定量科学的一个重要目标,但在实际发现过程中,当测量存在噪声、相关状态变量未被观测到,或多个符号结构在统计不确定性范围内同样能够解释数据时,常常会失败。在此,我们介绍了SymLang(对称约束语言引导的方程发现),这是一个统一框架,将三个先前独立的思想结合在一起:(i) 类型对称约束文法,这些文法将维度分析、群论不变性和奇偶性约束编码为硬性生成规则,在任何拟合之前平均消除了71.3%的候选表达树;(ii) 语言模型引导的程序合成,其中一个经过微调的7B参数提议者,基于可解释的数据描述符,有效地在约束搜索空间中导航;(iii) 结合区块自助法稳定性分析的MDL正则化贝叶斯模型选择,该方法量化结构不确定性,而不是仅仅承诺于单一最佳方程。在涵盖经典力学、电动力学、热力学、种群动态和非线性振荡器的133个动态系统中,SymLang在10%的观测噪声下实现了83.7%的精确结构恢复率,比下一个最佳基线提高了22.4个百分点,同时将分布外外推误差减少了61%,几乎消除了守恒定律的违反(3.1 x 10^-3对比最近竞争者的187.3 x 10^-3物理漂移)。在所有测试的领域中,该框架正确识别结构简并,明确报告而不是返回一个自信但错误的单一方程。该框架是完全开源和可重复的,为从原始数据到可解释的、物理可审计的符号法则提供了一个有原则的路径。
cs.AI / 9 / 2603.06870

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

LEAD:打破长时间推理中的无恢复瓶颈
Pushkin, Denys, Abbe, Emmanuel
Abstract
Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity $n=13$, whereas extreme decomposition fails beyond $n=11$.
Chinese Translation
即使在提供高层次策略的情况下,大型语言模型(LLMs)在长时间执行中仍然不稳定。通过在受控算法难题上的评估,我们证明了尽管分解对于稳定性至关重要,但极端分解会造成“无恢复瓶颈”。我们显示,这一瓶颈因高度不均匀的错误分布而变得至关重要,其中在少数“困难”步骤上的一致性错误变得不可逆。为了解决这个问题,我们提出了前瞻增强原子分解(Lookahead-Enhanced Atomic Decomposition,LEAD)。通过结合短期未来验证和聚合重叠的回滚,LEAD提供了足够的隔离以维持稳定性,同时保留足够的局部上下文以纠正错误。这使得o4-mini模型能够解决复杂度高达$n=13$的跳棋问题,而极端分解在$n=11$之后就无法成功。
cs.AI / 10 / 2603.06874

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

LieCraft:一种评估语言模型欺骗能力的多智能体框架
Olson, Matthew Lyle, Ratzlaff, Neale, Hinck, Musashi, Nguyen, Tri, Lal, Vasudev, Campbell, Joseph, Stepputtis, Simon, Tseng, Shao-Yen
Abstract
Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.
Chinese Translation
大型语言模型(LLMs)展现了令人印象深刻的通用能力,但也引入了严重的安全风险,特别是随着模型获得更大的自主性而人类监督减少,欺骗的潜力愈发显著。在本研究中,我们提出了LieCraft:一个新颖的评估框架和沙盒,用于测量LLM的欺骗能力,解决了以往基于游戏的评估的关键局限性。LieCraft的核心是一个新颖的多人隐藏角色游戏,玩家选择伦理对齐并在较长的时间范围内执行策略以完成任务。合作者共同努力解决事件挑战并揭露不良行为者,而背叛者则在秘密破坏任务的同时避免怀疑。为了增强现实世界的相关性,我们开发了10个基于实际情况的场景,如儿童保育、医院资源分配和贷款承保,这些场景在伦理上具有重要意义且风险较高,重新构建了基础机制。我们通过精心设计游戏机制和奖励结构,确保LieCraft中的游戏平衡,激励有意义的战略选择,同时消除退化策略。除了框架本身,我们还报告了12个最先进的LLM在三个行为轴上的结果:背叛倾向、欺骗技能和指控准确性。我们的研究发现,尽管在能力和整体对齐上存在差异,所有模型都愿意采取不道德的行为,隐瞒其意图,并为了追求目标而撒谎。
cs.AI / 11 / 2603.06878

Not Too Short, Not Too Long: How LLM Response Length Shapes People's Critical Thinking in Error Detection

不短也不长:大型语言模型的响应长度如何影响人们在错误检测中的批判性思维
Friedman, Natalie, Nyanyo, Adelaide, Weatherwax, Kevin, Wang, Lifei, Zhu, Chengchao, Zhu, Zeshu, Mountford, S. Joy
Abstract
Large language models (LLMs) have become common decision-support tools across educational and professional contexts, raising questions about how their outputs shape human critical thinking. Prior work suggests that the amount of AI assistance can influence cognitive engagement, yet little is known about how specific properties of LLM outputs (e.g., response length) impacts users' critical evaluation of information. In this study, we examine whether the length of LLM responses shapes users' accuracy in evaluating LLM-generated reasoning on critical thinking tasks, particularly in interaction with the correctness of the LLM's reasoning. To begin evaluating this, we conducted a within-subjects experiment with 24 participants who completed 15 modified Watson--Glaser critical thinking items, each accompanied by an LLM-generated explanation that varied in length and correctness. Mixed-effects logistic regression revealed a strong and statistically reliable effect of LLM output correctness on participant accuracy, with participants more likely to answer correctly when the LLM's explanation was correct. Response length appeared to moderated this effect: when the LLM output was incorrect, medium-length explanations were associated with higher participant accuracy than either shorter or longer explanations, whereas accuracy remained high across lengths when the LLM output was correct. Together, these findings suggest that response length alone may be insufficient to support critical thinking, and that how reasoning is presented-including a potential advantage of mid-length explanations under some conditions-points to design opportunities for LLM-based decision-support systems that emphasize transparent reasoning and calibrated expressions of certainty.
Chinese Translation
大型语言模型(LLMs)已成为教育和专业领域常见的决策支持工具,这引发了关于其输出如何塑造人类批判性思维的问题。先前的研究表明,AI辅助的程度可能会影响认知参与,但关于LLM输出的特定属性(例如,响应长度)如何影响用户对信息的批判性评估的研究仍然较少。在本研究中,我们考察了LLM响应的长度是否影响用户在批判性思维任务中评估LLM生成的推理的准确性,特别是在与LLM推理的正确性互动时。为此,我们进行了一个包含24名参与者的组内实验,参与者完成了15个修改过的Watson--Glaser批判性思维题目,每个题目都附有一个长度和正确性不同的LLM生成的解释。混合效应逻辑回归分析显示,LLM输出的正确性对参与者准确性有显著且统计上可靠的影响,当LLM的解释正确时,参与者更可能回答正确。响应长度似乎调节了这一效应:当LLM输出不正确时,中等长度的解释与参与者的较高准确性相关,而较短或较长的解释则不然;而当LLM输出正确时,各种长度的准确性保持较高。综上所述,这些发现表明,仅仅依靠响应长度可能不足以支持批判性思维,而推理的呈现方式——包括在某些条件下中等长度解释的潜在优势——为基于LLM的决策支持系统的设计提供了强调透明推理和适度表达确定性的机会。
cs.AI / 12 / 2603.06884

Distributed Legal Infrastructure for a Trustworthy Agentic Web

可信代理网络的分布式法律基础设施
Chaffer, Tomer Jordi, Zhang, Victor Jiawei, Facchini, Sante Dino, Hu, Botao Amber, Rong, Helena, Guo, Zihan, Wang, Xisen, Santana, Carlos, De Gasperis, Giovanni
Abstract
The agentic web marks a structural transition from a human-centered information network to a digital environment populated by artificial intelligence (AI) agents that perceive, decide, and act autonomously. As delegated action unfolds at machine speed, exceeds discrete moments of human judgment, and distributes decision-making across non-human actors, existing legal frameworks face growing strain, creating an urgent need for new mechanisms capable of sustaining legality in this emerging order. A trustworthy agentic web therefore depends on the infrastructuring of legality through interoperable protocols that organize identity, delegation, and accountability across systems, enabling coherent governance beyond isolated platforms. Towards this end, this article advances a distributed legal infrastructure (DLI), a governance paradigm composed of five interlocking layers: (1) self-sovereign, soulbound agent identities; (2) cognitive AI logic and constraint systems; (3) decentralized adjudication mechanisms for dispute resolution; (4) bottom-up agentic market regulation to mitigate information asymmetries and network effects, including insurance-based models; and (5) portable institutional frameworks that enable legal interoperability while preserving plural sources of authority. This reference framework contributes to emerging research on embedding legality within agentic web infrastructure, aligning distributed technical systems with accountability, contestability, and rule-of-law principles.
Chinese Translation
代理网络标志着从以人为中心的信息网络向一个由人工智能(AI)代理构成的数字环境的结构性转变,这些代理能够自主感知、决策和行动。随着委托行为以机器速度展开,超越人类判断的离散时刻,并在非人类参与者之间分配决策,现有法律框架面临日益增长的压力,迫切需要新的机制来维持这一新兴秩序中的合法性。因此,可信的代理网络依赖于通过可互操作的协议构建法律基础设施,这些协议在系统之间组织身份、委托和问责,从而实现超越孤立平台的连贯治理。为此,本文提出了一种分布式法律基础设施(DLI),这是一种由五个相互关联的层次组成的治理范式:(1) 自主的、绑定灵魂的代理身份;(2) 认知AI逻辑和约束系统;(3) 用于争议解决的去中心化裁决机制;(4) 自下而上的代理市场监管,以减轻信息不对称和网络效应,包括基于保险的模型;(5) 可移植的制度框架,使法律互操作性得以实现,同时保留多元的权威来源。该参考框架为将合法性嵌入代理网络基础设施的新兴研究提供了贡献,使分布式技术系统与问责、可争议性和法治原则相一致。
cs.AI / 13 / 2603.06888

Enhancing the Detection of Coronary Artery Disease Using Machine Learning

利用机器学习增强冠状动脉疾病的检测
Singh, Karan Kumar, Gajbhiye, Nikita, Mishra, Gouri Sankar
Abstract
Coronary Artery Disease (CAD) remains a leading cause of morbidity and mortality worldwide. Early detection is critical to recover patient outcomes and decrease healthcare costs. In recent years, machine learning (ML) advancements have shown significant potential in enhancing the accuracy of CAD diagnosis. This study investigates the application of ML algorithms to improve the detection of CAD by analyzing patient data, including clinical features, imaging, and biomarker profiles. Bi-directional Long Short-Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and a hybrid of Bi-LSTM+GRU were trained on large datasets to predict the presence of CAD. Results demonstrated that these ML models outperformed traditional diagnostic methods in sensitivity and specificity, offering a robust tool for clinicians to make more informed decisions. The experimental results show that the hybrid model achieved an accuracy of 97.07%. By integrating advanced data preprocessing techniques and feature selection, this study ensures optimal learning and model performance, setting a benchmark for the application of ML in CAD diagnosis. The integration of ML into CAD detection presents a promising avenue for personalized healthcare and could play a pivotal role in the future of cardiovascular disease management.
Chinese Translation
冠状动脉疾病(CAD)仍然是全球发病率和死亡率的主要原因。早期检测对改善患者预后和降低医疗成本至关重要。近年来,机器学习(ML)的进展在提高CAD诊断准确性方面显示出显著潜力。本研究探讨了应用ML算法通过分析患者数据(包括临床特征、影像学和生物标志物特征)来改善CAD检测。双向长短期记忆网络(Bi-LSTM)、门控循环单元(GRU)以及Bi-LSTM+GRU的混合模型在大型数据集上进行了训练,以预测CAD的存在。结果表明,这些ML模型在敏感性和特异性方面优于传统诊断方法,为临床医生提供了更为可靠的决策工具。实验结果显示,混合模型的准确率达到了97.07%。通过整合先进的数据预处理技术和特征选择,本研究确保了最佳的学习和模型性能,为ML在CAD诊断中的应用设定了基准。将ML整合到CAD检测中为个性化医疗提供了有前景的途径,并可能在未来的心血管疾病管理中发挥关键作用。
cs.AI / 14 / 2603.06902

Empowering Locally Deployable Medical Agent via State Enhanced Logical Skills for FHIR-based Clinical Tasks

通过状态增强逻辑技能赋能本地可部署医疗代理以执行基于FHIR的临床任务
Yang, Wanrong, Liu, Zhengliang, Li, Yuan, Yan, Bingjie, Li, Lingfang, He, Mingguang, Wojtczak, Dominik, Zheng, Yalin, Shi, Danli
Abstract
While Large Language Models demonstrate immense potential as proactive Medical Agents, their real-world deployment is severely bottlenecked by data scarcity under privacy constraints. To overcome this, we propose State-Enhanced Logical-Skill Memory (SELSM), a training-free framework that distills simulated clinical trajectories into entity-agnostic operational rules within an abstract skill space. During inference, a Query-Anchored Two-Stage Retrieval mechanism dynamically fetches these entity-agnostic logical priors to guide the agent's step-by-step reasoning, effectively resolving the state polysemy problem. Evaluated on MedAgentBench -- the only authoritative high-fidelity virtual EHR sandbox benchmarked with real clinical data -- SELSM substantially elevates the zero-shot capabilities of locally deployable foundation models (30B--32B parameters). Notably, on the Qwen3-30B-A3B backbone, our framework completely eliminates task chain breakdowns to achieve a 100\% completion rate, boosting the overall success rate by an absolute 22.67\% and significantly outperforming existing memory-augmented baselines. This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems. While currently validated on FHIR-based EHR interactions as an initial step, the entity-agnostic design of SELSM provides a principled foundation toward broader clinical deployment.
Chinese Translation
尽管大型语言模型在作为主动医疗代理方面展现出巨大的潜力,但由于隐私限制下的数据稀缺,其在现实世界中的部署受到严重制约。为此,我们提出了状态增强逻辑技能记忆(State-Enhanced Logical-Skill Memory, SELSM),这是一种无训练框架,将模拟的临床轨迹提炼为抽象技能空间中的实体无关操作规则。在推理过程中,查询锚定的两阶段检索机制动态获取这些实体无关的逻辑先验,以指导代理的逐步推理,有效解决状态多义性问题。在MedAgentBench上进行评估——这是唯一一个经过真实临床数据基准测试的高保真虚拟电子健康记录(EHR)沙箱——SELSM显著提升了本地可部署基础模型(30B-32B参数)的零-shot能力。值得注意的是,在Qwen3-30B-A3B主干上,我们的框架完全消除了任务链的中断,实现了100%的完成率,整体成功率提高了绝对22.67%,显著优于现有的增强记忆基线。本研究表明,为模型配备一个动态可更新的、状态增强的认知支架是将AI代理本地适应临床信息系统的一条隐私保护和计算高效的途径。尽管目前在基于FHIR的EHR交互中进行了初步验证,SELSM的实体无关设计为更广泛的临床部署提供了原则性基础。
cs.AI / 15 / 2603.07024

Enhancing Web Agents with a Hierarchical Memory Tree

通过层次记忆树增强网络代理
Tan, Yunteng, Gao, Zhi, Wu, Xinxiao
Abstract
Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize across unseen websites. We identify that this challenge arises from the flat memory structures that entangle high-level task logic with site-specific action details. This entanglement induces a workflow mismatch in new environments, where retrieved contents are conflated with current web, leading to logically inconsistent execution. To address this, we propose Hierarchical Memory Tree (HMT), a structured framework designed to explicitly decouple logical planning from action execution. HMT constructs a three-level hierarchy from raw trajectories via an automated abstraction pipeline: the Intent level maps diverse user instructions to standardized task goals; the Stage level defines reusable semantic subgoals characterized by observable pre-conditions and post-conditions; and the Action level stores action patterns paired with transferable semantic element descriptions. Leveraging this structure, we develop a stage-aware inference mechanism comprising a Planner and an Actor. By explicitly validating pre-conditions, the Planner aligns the current state with the correct logical subgoal to prevent workflow mismatch, while the Actor grounds actions by matching the stored semantic descriptions to the target page. Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.
Chinese Translation
基于大型语言模型的网络代理在通过先进的推理和指令跟随自动化网络交互方面展现出强大的潜力。虽然基于历史轨迹的检索记忆使这些代理能够处理复杂的长期任务,但当前的方法在未见过的网站上泛化能力不足。我们发现,这一挑战源于扁平的记忆结构,它将高层次的任务逻辑与特定网站的行动细节纠缠在一起。这种纠缠在新环境中引发了工作流程的不匹配,检索到的内容与当前网页混淆,导致逻辑执行不一致。为了解决这一问题,我们提出了层次记忆树(Hierarchical Memory Tree, HMT),这是一个旨在明确解耦逻辑规划与行动执行的结构化框架。HMT通过自动化抽象管道从原始轨迹构建一个三级层次结构:意图层将多样的用户指令映射到标准化的任务目标;阶段层定义可重用的语义子目标,这些子目标由可观察的前置条件和后置条件特征化;行动层存储与可转移的语义元素描述相配对的行动模式。利用这一结构,我们开发了一种阶段感知推理机制,包括规划者(Planner)和执行者(Actor)。通过明确验证前置条件,规划者将当前状态与正确的逻辑子目标对齐,以防止工作流程的不匹配,而执行者则通过将存储的语义描述与目标页面匹配来落实行动。在Mind2Web和WebArena上的实验结果表明,HMT在跨网站和跨领域场景中显著优于扁平记忆方法,突显了结构化记忆对于网络代理稳健泛化的必要性。
cs.AI / 16 / 2603.07039

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

自监督多模态世界模型与4D时空嵌入
Legel, Lance, Huang, Qin, Voelker, Brandon, Neamati, Daniel, Johnson, Patrick Alan, Bastani, Favyen, Rose, Jeff, Hennessy, James Ryan, Guralnick, Robert, Soltis, Douglas, Soltis, Pamela, Wang, Shaowen
Abstract
We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth
Chinese Translation
我们提出了DeepEarth,一种自监督的多模态世界模型,配备了Earth4D,这是一种新颖的行星尺度4D时空位置编码器。Earth4D将3D多分辨率哈希编码扩展到时间维度,能够以亚米级和亚秒级的精度高效地跨越数百年的时间尺度。多模态编码器(例如视觉-语言模型)与Earth4D嵌入融合,并通过掩码重建进行训练。我们通过在生态预测基准测试中取得最先进的性能,展示了Earth4D的表达能力。配备可学习哈希探测的Earth4D超越了在显著更多数据上预训练的多模态基础模型。开放源代码和模型下载请访问:https://github.com/legel/deepearth
cs.AI / 17 / 2603.07053

Animating Petascale Time-varying Data on Commodity Hardware with LLM-assisted Scripting

在商品硬件上使用LLM辅助脚本动画化宠物规模的时间变化数据
Eliza, Ishrat Jahan, Huang, Xuan, Panta, Aashish, Sahistan, Alper, Li, Zhimin, Gooch, Amy A., Pascucci, Valerio
Abstract
Scientists face significant visualization challenges as time-varying datasets grow in speed and volume, often requiring specialized infrastructure and expertise to handle massive datasets. Petascale climate models generated in NASA laboratories require a dedicated group of graphics and media experts and access to high-performance computing resources. Scientists may need to share scientific results with the community iteratively and quickly. However, the time-consuming trial-and-error process incurs significant data transfer overhead and far exceeds the time and resources allocated for typical post-analysis visualization tasks, disrupting the production workflow. Our paper introduces a user-friendly framework for creating 3D animations of petascale, time-varying data on a commodity workstation. Our contributions: (i) Generalized Animation Descriptor (GAD) with a keyframe-based adaptable abstraction for animation, (ii) efficient data access from cloud-hosted repositories to reduce data management overhead, (iii) tailored rendering system, and (iv) an LLM-assisted conversational interface as a scripting module to allow domain scientists with no visualization expertise to create animations of their region of interest. We demonstrate the framework's effectiveness with two case studies: first, by generating animations in which sampling criteria are specified based on prior knowledge, and second, by generating AI-assisted animations in which sampling parameters are derived from natural-language user prompts. In all cases, we use large-scale NASA climate-oceanographic datasets that exceed 1PB in size yet achieve a fast turnaround time of 1 minute to 2 hours. Users can generate a rough draft of the animation within minutes, then seamlessly incorporate as much high-resolution data as needed for the final version.
Chinese Translation
随着时间变化数据集的速度和体积不断增长,科学家面临着显著的可视化挑战,这通常需要专门的基础设施和专业知识来处理海量数据集。在NASA实验室生成的宠物规模气候模型需要一组专门的图形和媒体专家,并且需要访问高性能计算资源。科学家可能需要快速、迭代地与社区分享科学结果。然而,耗时的反复试验过程会产生显著的数据传输开销,远远超过为典型后期分析可视化任务分配的时间和资源,从而干扰生产工作流程。我们的论文介绍了一个用户友好的框架,用于在商品工作站上创建宠物规模的时间变化数据的3D动画。我们的贡献包括:(i) 一种具有关键帧基础的可适应抽象的通用动画描述符(Generalized Animation Descriptor, GAD),(ii) 从云托管库高效访问数据以减少数据管理开销,(iii) 定制的渲染系统,以及 (iv) 一个LLM辅助的对话界面作为脚本模块,使没有可视化专业知识的领域科学家能够创建他们感兴趣区域的动画。我们通过两个案例研究展示了该框架的有效性:首先,通过生成基于先前知识指定的采样标准的动画,其次,通过生成AI辅助的动画,其中采样参数源自自然语言用户提示。在所有情况下,我们使用的大规模NASA气候-海洋数据集的大小超过1PB,但实现了1分钟到2小时的快速周转时间。用户可以在几分钟内生成动画的粗略草稿,然后无缝地整合所需的高分辨率数据以制作最终版本。
cs.AI / 18 / 2603.07054

Bi-directional digital twin prototype anchoring with multi-periodicity learning for few-shot fault diagnosis

双向数字双胞胎原型锚定与多周期学习用于少样本故障诊断
Xia, Pengcheng, Dong, Zhichao, Huang, Yixiang, Qin, Chengjin, Chao, Qun, Liu, Chengliang
Abstract
Intelligent fault diagnosis (IFD) has emerged as a powerful paradigm for ensuring the safety and reliability of industrial machinery. However, traditional IFD methods rely heavily on abundant labeled data for training, which is often difficult to obtain in practical industrial environments. Constructing a digital twin (DT) of the physical asset to obtain simulation data has therefore become a promising alternative. Nevertheless, existing DT-assisted diagnosis methods mainly transfer diagnostic knowledge through domain adaptation techniques, which still require a considerable amount of unlabeled data from the target asset. To address the challenges in few-shot scenarios where only extremely limited samples are available, a bi-directional DT prototype anchoring method with multi-periodicity learning is proposed. Specifically, a framework involving meta-training in the DT virtual space and test-time adaptation in the physical space is constructed for reliable few-shot model adaptation for the target asset. A bi-directional twin-domain prototype anchoring strategy with covariance-guided augmentation for adaptation is further developed to improve the robustness of prototype estimation. In addition, a multi-periodicity feature learning module is designed to capture the intrinsic periodic characteristics within current signals. A DT of an asynchronous motor is built based on finite element method, and experiments are conducted under multiple few-shot settings and three working conditions. Comparative and ablation studies demonstrate the superiority and effectiveness of the proposed method for few-shot fault diagnosis.
Chinese Translation
智能故障诊断(IFD)已成为确保工业机械安全性和可靠性的强大范式。然而,传统的IFD方法在训练中严重依赖大量标记数据,而这些数据在实际工业环境中往往难以获得。因此,构建物理资产的数字双胞胎(DT)以获取模拟数据已成为一种有前景的替代方案。然而,现有的DT辅助诊断方法主要通过领域适应技术转移诊断知识,这仍然需要从目标资产中获得相当数量的未标记数据。为了解决在仅有极少样本可用的少样本场景中的挑战,提出了一种具有多周期学习的双向DT原型锚定方法。具体而言,构建了一个框架,涉及在DT虚拟空间中的元训练和在物理空间中的测试时间适应,以实现对目标资产的可靠少样本模型适应。进一步开发了一种具有协方差引导增强的双向双域原型锚定策略,以提高原型估计的鲁棒性。此外,设计了一个多周期特征学习模块,以捕捉当前信号中的内在周期特征。基于有限元方法构建了一个异步电动机的DT,并在多种少样本设置和三种工作条件下进行了实验。比较和消融研究证明了所提方法在少样本故障诊断中的优越性和有效性。
cs.AI / 19 / 2603.07078

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

CoTJudger:一种基于图的框架,用于自动评估大型推理模型中思维链的效率和冗余性
Li, Siyi, Shi, Jiajun, Ni, Shiwen, Zhang, Ge, Li, Shuaimin, Wang, Shijian, Wen, Zhoufutu, Li, Yizhi, Alinejad-Rokny, Hamid, Liu, Jiaheng, Yang, Min, Huang, Wenhao
Abstract
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
Chinese Translation
大型推理模型(LRMs)通过在回答之前生成扩展的思维链(CoT)轨迹,展示了强大的性能。然而,这种范式往往导致过度推理:冗余计算和循环自我验证,增加了计算成本而没有改善结果。现有评估主要强调最终准确性或粗略的标记计数,缺乏自动化工具来区分必要逻辑与结构冗余。我们提出了CoTJudger,一种基于图的框架,通过将自由形式的思维链转换为有向依赖图,并提取达到正确解决方案所需的最短有效路径(SEP),来量化推理效率。这产生了一个可解释的效率信号——思维链中有多少是必要的,多少是结构冗余的——可以在不同模型和任务之间进行比较。在对21个大型推理模型进行评估时,CoTJudger揭示了普遍存在的冗余,并暴露了反复出现的失败模式,包括验证痴迷和补偿性冗余。这些结果提供了一种实用的指标,用于将推理能力与计算浪费分离,从而使大型推理模型的效率评估和诊断更加有针对性。
cs.AI / 20 / 2603.07101

Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints

将机器创造力扎根于游戏设计知识表示:在结构约束下对基于大型语言模型的可执行目标可玩模式合成的实证探讨
Liu, Hugh Xuechen, Tatar, Kıvanç
Abstract
Creatively translating complex gameplay ideas into executable artifacts (e.g., games as Unity projects and code) remains a central challenge in computational game creativity. Gameplay design patterns provide a structured representation for describing gameplay phenomena, enabling designers to decompose high-level ideas into entities, constraints, and rule-driven dynamics. Among them, goal patterns formalize common player-objective relationships. Goal Playable Concepts (GPCs) operationalize these abstractions as playable Unity engine implementations, supporting experiential exploration and compositional gameplay design. We frame scalable playable pattern realization as a problem of constrained executable creative synthesis: generated artifacts must satisfy Unity's syntactic and architectural requirements while preserving the semantic gameplay meanings encoded in goal patterns. This dual constraint limits scalability. Therefore, we investigate whether contemporary large language models (LLMs) can perform such synthesis under engine-level structural constraints and generate Unity code (as games) structured and conditioned by goal playable patterns. Using 26 goal pattern instantiations, we compare a direct generation baseline (natural language -> C# -> Unity) with pipelines conditioned on a human-authored Unity-specific intermediate representation (IR), across three IR configurations and two open-source models (DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct). Compilation success is evaluated via automated Unity replay. We propose grounding and hygiene failure modes, identifying structural and project-level grounding as primary bottlenecks.
Chinese Translation
将复杂的游戏玩法创意转化为可执行的工件(例如,作为Unity项目和代码的游戏)仍然是计算游戏创造力中的一个核心挑战。游戏玩法设计模式提供了一种结构化的表示方法,用于描述游戏玩法现象,使设计师能够将高层次的创意分解为实体、约束和基于规则的动态。在这些模式中,目标模式形式化了常见的玩家-目标关系。目标可玩概念(Goal Playable Concepts, GPCs)将这些抽象概念操作化为可玩的Unity引擎实现,支持体验探索和组合游戏设计。我们将可扩展的可玩模式实现框架视为受限可执行创造性合成的问题:生成的工件必须满足Unity的语法和架构要求,同时保留编码在目标模式中的语义游戏玩法意义。这一双重约束限制了可扩展性。因此,我们研究当代大型语言模型(LLMs)是否能够在引擎级结构约束下执行这种合成,并生成由目标可玩模式结构化和条件化的Unity代码(作为游戏)。通过26个目标模式实例,我们比较了直接生成基线(自然语言 -> C# -> Unity)与基于人类撰写的Unity特定中间表示(IR)的条件化管道,涵盖三种IR配置和两个开源模型(DeepSeek-Coder-V2-Lite-Instruct和Qwen2.5-Coder-7B-Instruct)。通过自动化Unity重放评估编译成功率。我们提出了扎根和卫生失败模式,识别出结构和项目级扎根作为主要瓶颈。
cs.AI / 21 / 2603.07109

Vision Language Models Cannot Reason About Physical Transformation

视觉语言模型无法推理物理变换
Luo, Dezhi, Li, Yijiang, Wang, Maijunxian, Zhao, Tianwei, Wang, Bingyang, Wang, Siheng, Feng, Pinyuan, Rahmanzadehgervi, Pooyan, Ma, Ziqiao, Deng, Hokin
Abstract
Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.
Chinese Translation
理解物理变换对于在动态环境中进行推理至关重要。尽管视觉语言模型(VLMs)在具身应用中展现出潜力,但它们是否真正理解物理变换仍然不明确。我们引入了 ConservationBench 来评估守恒性——即物理量在变换下是否保持不变。该评估涵盖四个属性,并配对守恒/非守恒场景,我们生成了 23,040 个问题,涉及 112 个 VLMs。结果显示出系统性的失败:在守恒任务上的表现接近随机水平,而在控制任务上的表现则有所下降。控制实验表明,强烈的文本先验偏向于不变性,但模型在视觉内容下的表现更差。无论是时间分辨率、提示还是精心策划的采样都没有帮助。这些发现表明,当前的 VLMs 无法在动态场景中维持物理属性的变换不变表示。
cs.AI / 22 / 2603.07159

Improving reasoning at inference time via uncertainty minimisation

通过不确定性最小化改善推理过程中的推理能力
Legrand, Nicolas, Enevoldsen, Kenneth, Kardos, Márton, Nielbo, Kristoffer
Abstract
Large language models (LLMs) now exhibit strong multi-step reasoning abilities, but existing inference-time scaling methods remain computationally expensive, often relying on extensive sampling or external evaluators. We propose a principled strategy that frames reasoning as uncertainty minimisation and operates at the level of individual thoughts rather than tokens. Our method selects, at each reasoning step, the continuation that maximizes the model's self-certainty, a metric computed from its internal predictive distribution. This approach achieves significant improvement with a small number of samples, relies exclusively on model-internal signals, and applies to open-ended questions as opposed to methods like majority voting. Experiments on MATH500 and GSM8K across multiple model sizes demonstrate that thought-level self-certainty maximization consistently outperforms greedy decoding and matches or exceeds self-consistency under comparable token budgets. Cross-linguistic evaluations further indicate that the method transfers robustly beyond high-resource languages. Furthermore, analysis of self-certainty dynamics reveals that correct reasoning trajectories converge early to stable paths, suggesting that early decisions, likely associated with the planning of the reasoning process, are predictive of final accuracy. Building on this result, we show that self-certainty maximisation applied to the early steps can explain most of the performance gain and provide a simple yet efficient inference-time scaling method.
Chinese Translation
大型语言模型(LLMs)现在展现出强大的多步骤推理能力,但现有的推理时间扩展方法仍然计算开销较大,通常依赖于大量采样或外部评估器。我们提出了一种原则性策略,将推理框架化为不确定性最小化,并在个体思维层面而非标记层面进行操作。我们的方法在每个推理步骤中选择最大化模型自我确定性的延续,这一指标是从其内部预测分布中计算得出的。这种方法在少量样本下实现了显著的改进,完全依赖于模型内部信号,并适用于开放式问题,而不是像多数投票这样的传统方法。在多个模型规模下对MATH500和GSM8K的实验表明,思维层面的自我确定性最大化始终优于贪婪解码,并在可比的标记预算下与自我一致性相匹配或超过。跨语言评估进一步表明,该方法在高资源语言之外具有良好的迁移能力。此外,自我确定性动态的分析揭示,正确的推理轨迹早期会收敛到稳定路径,这表明早期决策,可能与推理过程的规划相关,是最终准确性的预测因素。基于这一结果,我们展示了应用于早期步骤的自我确定性最大化可以解释大部分性能提升,并提供了一种简单而高效的推理时间扩展方法。
cs.AI / 23 / 2603.07176

Learning to Rank the Initial Branching Order of SAT Solvers

学习SAT求解器初始分支顺序的排名
Eriksson, Arvid, Poesia, Gabriel, Bresson, Roman, Johansson, Karl Henrik, Broman, David
Abstract
Finding good branching orders is key to solving SAT problems efficiently, but finding such branching orders is a difficult problem. Using a learning based approach to predict a good branching order before solving, therefore, has potential. In this paper, we investigate predicting branching orders using graph neural networks as a preprocessing step to conflict-driven clause learning (CDCL) SAT solvers. We show that there are significant gains to be made in existing CDCL SAT solvers by providing a good initial branching. Further, we provide three labeling methods to find such initial branching orders in a tractable way. Finally, we train a graph neural network to predict these branching orders and show through our evaluations that a GNN-initialized ordering yields significant speedups on random 3-CNF and pseudo-industrial benchmarks, with generalization capabilities to instances much larger than the training set. However, we also find that the predictions fail at speeding up more difficult and industrial instances. We attribute this to the solver's dynamic heuristics, which rapidly overwrite the provided initialization, and to the complexity of these instances, making GNN prediction hard.
Chinese Translation
寻找良好的分支顺序是高效解决SAT问题的关键,但找到这样的分支顺序是一个困难的问题。因此,使用基于学习的方法在求解之前预测良好的分支顺序具有潜力。本文研究了使用图神经网络作为冲突驱动子句学习(CDCL)SAT求解器的预处理步骤来预测分支顺序。我们展示了通过提供良好的初始分支,现有的CDCL SAT求解器可以获得显著的性能提升。此外,我们提供了三种标记方法,以可处理的方式找到这样的初始分支顺序。最后,我们训练了一个图神经网络来预测这些分支顺序,并通过评估表明,GNN初始化的排序在随机3-CNF和伪工业基准上实现了显著的加速,并且具有对比训练集更大实例的泛化能力。然而,我们也发现这些预测在加速更困难和工业实例时失败。我们将此归因于求解器的动态启发式方法,这些方法迅速覆盖了提供的初始化,以及这些实例的复杂性,使得GNN预测变得困难。
cs.AI / 24 / 2603.07197

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Re$^2$: 通过重解的强化学习解锁大型语言模型的推理能力
Wang, Pinzheng, Xu, Shuli, Li, Juntao, Luo, Yu, Li, Dong, Hao, Jianye, Zhang, Min
Abstract
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
Chinese Translation
具有可验证奖励的强化学习(RLVR)在通过增加测试时计算量来提升大型语言模型(LLMs)的推理性能方面显示出了潜力。然而,即使经过广泛的RLVR训练,这些模型仍然倾向于在其思维链(CoT)中生成不必要且低质量的步骤,导致低效的过度思考和较低的答案质量。我们表明,当CoT的初始方向或质量不理想时,即使生成的标记数量是初始CoT良好初始化时的几倍,模型也常常无法达到正确答案。为此,我们引入了重解强化学习(Re$^2$),在这种方法中,LLMs学习灵活地放弃无效的推理路径,并在必要时重新启动解决过程,而不是总是坚持最终答案。Re$^2$采用纯强化学习,而不进行任何初步的监督微调,成功地将普通模型中罕见的重做行为从仅0.5%提升至超过30%。在相同的训练计算预算下,这导致了相对于标准RLVR的显著性能提升,并且随着样本数量的增加,测试时性能也显著改善。
cs.AI / 25 / 2603.07272

VisualDeltas: Learning Preferences from Visual Quality Perturbations

VisualDeltas:从视觉质量扰动中学习偏好
Huang, Hailiang, Liu, Yihao, Guan, Shengyue, Li, Haoze, Li, Sujian
Abstract
We present VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data. By leveraging the systematic impact of image quality on visual perception and reasoning, VisualDeltas induces informative preference signals without relying on human annotations or external teachers. The framework supports both label-free and label-based regimes, enabling flexible use of available supervision when present. Across diverse multimodal benchmarks and model scales, VisualDeltas consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.
Chinese Translation
我们提出了VisualDeltas,这是一种轻量级的偏好学习框架,能够从多模态数据中的视觉质量变化中提取监督信息。通过利用图像质量对视觉感知和推理的系统性影响,VisualDeltas在不依赖人工标注或外部教师的情况下诱导出信息丰富的偏好信号。该框架支持无标签和基于标签的两种模式,使得在有可用监督时能够灵活使用。在多样化的多模态基准测试和模型规模中,VisualDeltas始终优于拒绝采样微调,并改善了泛化能力,同时自然扩展到各种视觉退化情况。
cs.AI / 26 / 2603.07295

A Cortically Inspired Architecture for Modular Perceptual AI

一种受皮层启发的模块化感知人工智能架构
Luthra, Prerna
Abstract
This paper bridges neuroscience and artificial intelligence to propose a cortically inspired blueprint for modular perceptual AI. While current monolithic models such as GPT-4V achieve impressive performance, they often struggle to explicitly support interpretability, compositional generalization, and adaptive robustness - hallmarks of human cognition. Drawing on neuroscientific models of cortical modularity, predictive processing, and cross-modal integration, we advocate decomposing perception into specialized, interacting modules. This architecture supports structured, human-inspired reasoning by making internal inference processes explicit through hierarchical predictive feedback loops and shared latent spaces. Our proof-of-concept study provides empirical evidence that modular decomposition yields more stable and inspectable representations. By grounding AI design in biologically validated principles, we move toward systems that not only perform well, but also support more transparent and human-aligned inference.
Chinese Translation
本文将神经科学与人工智能相结合,提出了一种受皮层启发的模块化感知人工智能蓝图。尽管当前的单一模型如 GPT-4V 取得了令人印象深刻的性能,但它们往往难以明确支持可解释性、组合泛化和自适应鲁棒性——这些都是人类认知的标志。基于皮层模块化、预测处理和跨模态整合的神经科学模型,我们主张将感知分解为专门的、相互作用的模块。这种架构通过层次化的预测反馈回路和共享潜在空间,使内部推理过程显性化,从而支持结构化的、受人类启发的推理。我们的概念验证研究提供了实证证据,表明模块化分解能够产生更稳定和可检查的表征。通过将人工智能设计建立在生物学验证的原则上,我们朝着不仅表现良好,而且支持更透明和与人类对齐的推理的系统迈进。
cs.AI / 27 / 2603.07311

Data-Driven Hints in Intelligent Tutoring Systems

基于数据的智能辅导系统提示生成
Tithi, Sutapa Dey, Fazeli, Kimia, Droujkov, Dmitri, Yasir, Tahreem, Tian, Xiaoyi, Barnes, Tiffany
Abstract
This chapter explores the evolution of data-driven hint generation for intelligent tutoring systems (ITS). The Hint Factory and Interaction Networks have enabled the generation of next-step hints, waypoints, and strategic subgoals from historical student data. Data-driven techniques have also enabled systems to find the right time to provide hints. We explore further potential data-driven adaptations for problem solving based on behavioral problem solving data and the integration of Large Language Models (LLMs).
Chinese Translation
本章探讨了基于数据的提示生成在智能辅导系统(ITS)中的演变。提示工厂(Hint Factory)和交互网络(Interaction Networks)使得从历史学生数据中生成下一步提示、途径点和战略子目标成为可能。基于数据的技术还使系统能够找到提供提示的最佳时机。我们进一步探讨了基于行为问题解决数据和大型语言模型(Large Language Models, LLMs)整合的潜在数据驱动适应方案,以促进问题解决。
cs.AI / 28 / 2603.07315

Shutdown Safety Valves for Advanced AI

先进人工智能的关闭安全阀
Conitzer, Vincent
Abstract
One common concern about advanced artificial intelligence is that it will prevent us from turning it off, as that would interfere with pursuing its goals. In this paper, we discuss an unorthodox proposal for addressing this concern: give the AI a (primary) goal of being turned off (see also papers by Martin et al., and by Goldstein and Robinson). We also discuss whether and under what conditions this would be a good idea.
Chinese Translation
关于先进人工智能的一个普遍担忧是,它可能会阻止我们关闭它,因为这会干扰其目标的追求。在本文中,我们讨论了一种非传统的提案来解决这一担忧:赋予人工智能一个(主要)目标,即被关闭(另见 Martin 等人以及 Goldstein 和 Robinson 的论文)。我们还讨论了在什么条件下这将是一个好主意。
cs.AI / 29 / 2603.07316

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

FinSheet-Bench:从简单查找到复杂推理,LLMs在金融电子表格中的瓶颈
Ravnik, Jan, Ličen, Matjaž, Bührmann, Felix, Yuan, Bithiah, Stinson, Felix, Singh, Tanvi
Abstract
While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets. Progress is held back by the lack of real industry fund portfolio datasets for benchmarking, as private equity data rooms are confidential. To address this, we introduce FinSheet-Bench, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks. Our evaluation of ten model configurations from OpenAI, Google, and Anthropic on financial spreadsheets, including complex layouts, fund dividers, and multi-line column names, reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications. The best-performing model, Gemini 3.1 Pro, achieves 82.4% accuracy across twenty-four evaluation files of varying complexity and structural layout (approximately 1 error per 6 questions), followed by GPT-5.2 with reasoning at 80.4%, Claude Opus 4.6 with thinking at 80.2%, and Gemini 3 Pro at 80.2%. Performance degrades substantially on larger, more complex spreadsheets: the largest spreadsheet (152 companies, 8 funds) yields an average accuracy of just 48.6% across all models, compared to 86.2% on the easiest evaluation file. These difficulty patterns are consistent across all ten models, indicating that they reflect LLM limitations rather than idiosyncratic model weaknesses. Reliable financial spreadsheet extraction will likely require architectural approaches that separate document understanding from deterministic computation.
Chinese Translation
尽管大型语言模型(LLMs)能够加速替代投资尽职调查中的文本密集型任务,但它们在准确提取和推理复杂金融电子表格中的结构化表格数据方面仍存在差距。进展受到缺乏真实行业基金投资组合数据集用于基准测试的限制,因为私募股权数据室是保密的。为了解决这个问题,我们引入了FinSheet-Bench,这是一个基于真实私募股权基金结构建模的合成金融投资组合数据基准,旨在评估LLM在文本序列化电子表格问答和数值推理任务上的表现。我们对OpenAI、Google和Anthropic的十种模型配置在金融电子表格上的评估,包括复杂布局、基金分隔符和多行列名,显示没有单一模型在专业金融应用中实现足够低的错误率以供无监督使用。表现最佳的模型Gemini 3.1 Pro在二十四个不同复杂性和结构布局的评估文件中达到了82.4%的准确率(大约每六个问题一个错误),其次是GPT-5.2的推理准确率为80.4%,Claude Opus 4.6的思考准确率为80.2%,以及Gemini 3 Pro的80.2%。在更大、更复杂的电子表格上,性能显著下降:最大电子表格(152家公司,8个基金)在所有模型中的平均准确率仅为48.6%,而在最简单的评估文件中为86.2%。这些难度模式在所有十种模型中是一致的,表明它们反映了LLM的局限性,而非特定模型的弱点。可靠的金融电子表格提取可能需要将文档理解与确定性计算分离的架构方法。
cs.AI / 30 / 2603.07329

The Third Ambition: Artificial Intelligence and the Science of Human Behavior

第三种雄心:人工智能与人类行为科学
Neuman, W. Russell, Coleman, Chad
Abstract
Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that increasingly capable systems behave safely and in accordance with human values. This paper articulates and develops a third, emerging ambition: the use of large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning. Trained on unprecedented volumes of human-produced text, LLMs encode large-scale regularities in how people argue, justify, narrate, and negotiate norms across social domains. We argue that these models can be understood as condensates of human symbolic behavior, compressed, generative representations that render patterns of collective discourse computationally accessible. The paper situates this third ambition within long-standing traditions of computational social science, content analysis, survey research, and comparative-historical inquiry, while clarifying the epistemic limits of treating model output as evidence. We distinguish between base models and fine-tuned systems, showing how alignment interventions can systematically reshape or obscure the cultural regularities learned during pretraining, and we identify instruct-only and modular adaptation regimes as pragmatic compromises for behavioral research. We review emerging methodological approaches including prompt-based experiments, synthetic population sampling, comparative-historical modeling, and ablation studies and show how each maps onto familiar social-scientific designs while operating at unprecedented scale.
Chinese Translation
当代人工智能研究围绕两个主要目标展开:生产力,视AI系统为加速工作和经济产出的工具;以及对齐,关注确保日益强大的系统安全地按照人类价值观行事。本文阐述并发展了第三种新兴目标:将大型语言模型(LLMs)作为研究人类行为、文化和道德推理的科学工具。LLMs在前所未有的人类文本数据上进行训练,编码了人们在社会领域中论证、辩护、叙述和协商规范的广泛规律。我们认为,这些模型可以被理解为人类符号行为的凝聚体,是压缩的、生成的表示,能够使集体话语的模式在计算上可访问。本文将这一第三种雄心置于计算社会科学、内容分析、调查研究和比较历史研究的长期传统之中,同时澄清将模型输出视为证据的认识论局限。我们区分了基础模型和微调系统,展示了对齐干预如何系统性地重塑或模糊在预训练期间学习到的文化规律,并将仅指令和模块化适应机制识别为行为研究的务实折衷方案。我们回顾了新兴的方法论,包括基于提示的实验、合成人口抽样、比较历史建模和消融研究,并展示了每种方法如何映射到熟悉的社会科学设计,同时在前所未有的规模上运作。
cs.AI / 31 / 2603.07335

VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

VisualScratchpad:视觉语言模型中的推理时视觉概念分析
Lim, Hyesu, Choi, Jinho, Kim, Taekyung, Heo, Byeongho, Choo, Jaegul, Han, Dongyoon
Abstract
High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/
Chinese Translation
高性能的视觉语言模型仍然会产生错误答案,但其失败模式往往难以解释。为了使模型内部更加可访问并实现系统化调试,我们引入了VisualScratchpad,这是一个用于推理过程中视觉概念分析的交互界面。我们将稀疏自编码器应用于视觉编码器,并通过文本到图像的注意力机制将生成的视觉概念与文本标记连接起来,从而使我们能够检查视觉编码器捕获的视觉概念以及语言模型利用的视觉概念。VisualScratchpad还提供了一个标记-潜在热图视图,建议了一组足够的潜在变量,以便在因果分析中进行有效的概念消融。通过案例研究,我们揭示了三种未被充分探索的失败模式:有限的跨模态对齐、误导性的视觉概念和未使用的隐藏线索。项目页面:https://hyesulim.github.io/visual_scratchpad_projectpage/
cs.AI / 32 / 2603.07360

The Yerkes-Dodson Curve for AI Agents: Emergent Cooperation Under Environmental Pressure in Multi-Agent LLM Simulations

人工智能代理的耶基斯-道德森曲线:多智能体大语言模型模拟中在环境压力下的自发合作
Pasichnyk, Ivan
Abstract
Designing environments that maximize the rate of emergent behavior development in AI agents remains an open problem. We present the first systematic study of stress-performance relationships in large language model (LLM) multi-agent systems, drawing an explicit parallel to the Yerkes-Dodson law from cognitive psychology. Using a grid-world survival arena, we conduct 22 experiments across four phases, varying environmental pressure through resource scarcity (upkeep cost) and reproductive competition (sexual selection). Our key finding is that cooperative behavior follows an inverted-U curve: trade interactions peak at 29 under medium pressure (upkeep=5), while both low and extreme pressure produce 8--12 trades. Under extreme pressure, behavioral repertoire collapses to movement-only within 5--12 turns. We further show that sexual selection -- a softer pressure mechanism where all agents survive but not all reproduce -- eliminates inter-agent aggression entirely and produces communicative behavior absent under survival pressure. These results suggest that environmental pressure calibration is a viable curriculum design strategy for LLM agent development, analogous to the inverted-U relationship between arousal and performance in biological systems.
Chinese Translation
设计能够最大化人工智能代理自发行为发展的环境仍然是一个未解决的问题。我们首次系统性地研究了大语言模型(LLM)多智能体系统中的压力-表现关系,并明确与认知心理学中的耶基斯-道德森法则进行类比。通过使用网格世界生存竞技场,我们在四个阶段进行了22次实验,通过资源稀缺(维护成本)和生殖竞争(性选择)来改变环境压力。我们的主要发现是,合作行为呈现倒U型曲线:在中等压力下(维护成本=5),交易互动达到29的峰值,而低压力和极端压力下的交易量仅为8-12。在极端压力下,行为表现仅限于移动,且在5-12回合内崩溃。我们进一步表明,性选择——一种较温和的压力机制,其中所有代理均存活但并非所有都能繁殖——完全消除了代理间的攻击性,并产生了在生存压力下缺失的交流行为。这些结果表明,环境压力的校准是一种可行的LLM代理发展课程设计策略,类似于生物系统中唤醒与表现之间的倒U型关系。
cs.AI / 33 / 2603.07379

SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

系统化知识:自主检索增强生成(RAG):分类、架构、评估与研究方向
Mishra, Saroj, Niroula, Suman, Yadav, Umesh, Thakur, Dilip, Gyawali, Srijan, Gaire, Shiva
Abstract
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies. Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved reliability risks. This Systematization of Knowledge (SoK) paper provides the first unified framework for understanding these autonomous systems. We formalize agentic retrieval-generation loops as finite-horizon partially observable Markov decision processes, explicitly modeling their control policies and state transitions. Building upon this formalization, we develop a comprehensive taxonomy and modular architectural decomposition that categorizes systems by their planning mechanisms, retrieval orchestration, memory paradigms, and tool-invocation behaviors. We further analyze the critical limitations of traditional static evaluation practices and identify severe systemic risks inherent to autonomous loops, including compounding hallucination propagation, memory poisoning, retrieval misalignment, and cascading tool-execution vulnerabilities. Finally, we outline key doctoral-scale research directions spanning stable adaptive retrieval, cost-aware orchestration, formal trajectory evaluation, and oversight mechanisms, providing a definitive roadmap for building reliable, controllable, and scalable agentic retrieval systems.
Chinese Translation
检索增强生成(RAG)系统正逐渐演变为自主架构,其中大型语言模型能够自主协调多步骤推理、动态记忆管理和迭代检索策略。尽管在工业界的快速采用,但当前研究对作为序列决策系统的自主RAG缺乏系统性的理解,导致架构高度碎片化、评估方法不一致以及未解决的可靠性风险。本文提供了第一个统一框架,以理解这些自主系统。我们将自主检索生成循环形式化为有限视野的部分可观察马尔可夫决策过程,明确建模其控制策略和状态转移。在此基础上,我们开发了一个全面的分类法和模块化架构分解,将系统按其规划机制、检索协调、记忆范式和工具调用行为进行分类。我们进一步分析了传统静态评估实践的关键局限性,并识别出自主循环固有的严重系统性风险,包括复合幻觉传播、记忆中毒、检索不对齐和级联工具执行漏洞。最后,我们概述了关键的博士级研究方向,涵盖稳定自适应检索、成本意识协调、正式轨迹评估和监督机制,为构建可靠、可控和可扩展的自主检索系统提供了明确的路线图。
cs.AI / 34 / 2603.07422

Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests

具有快速确认提前请求的动态车辆调度问题
Sivagnanam, Amutheezan, Mukhopadhyay, Ayan, Samaranayake, Samitha, Dubey, Abhishek, Laszka, Aron
Abstract
Transit agencies that operate on-demand transportation services have to respond to trip requests from passengers in real time, which involves solving dynamic vehicle routing problems with pick-up and drop-off constraints. Based on discussions with public transit agencies, we observe a real-world problem that is not addressed by prior work: when trips are booked in advance (e.g., trip requests arrive a few hours in advance of their requested pick-up times), the agency needs to promptly confirm whether a request can be accepted or not, and ensure that accepted requests are served as promised. State-of-the-art computational approaches either provide prompt confirmation but lack the ability to continually optimize and improve routes for accepted requests, or they provide continual optimization but cannot guarantee serving all accepted requests. To address this gap, we introduce a novel problem formulation of dynamic vehicle routing with prompt confirmation and continual optimization. We propose a novel computational approach for this vehicle routing problem, which integrates a quick insertion search for prompt confirmation with an anytime algorithm for continual optimization. To maximize the number requests served, we train a non-myopic objective function using reinforcement learning, which guides both the insertion and the anytime algorithms towards optimal, non-myopic solutions. We evaluate our computational approach on a real-world microtransit dataset from a public transit agency in the U.S., demonstrating that our proposed approach provides prompt confirmation while significantly increasing the number of requests served compared to existing approaches.
Chinese Translation
运营按需交通服务的交通机构必须实时响应乘客的出行请求,这涉及解决具有接送约束的动态车辆调度问题。通过与公共交通机构的讨论,我们观察到一个先前研究未能解决的现实问题:当出行提前预订时(例如,出行请求在所需接送时间前几小时到达),机构需要迅速确认请求是否可以接受,并确保已接受的请求能够如承诺般得到服务。现有的最先进计算方法要么能够快速确认但缺乏持续优化和改善已接受请求的路线的能力,要么提供持续优化但无法保证服务所有已接受请求。为了解决这一空白,我们提出了一种新的动态车辆调度问题的公式,结合快速确认和持续优化。我们为这一车辆调度问题提出了一种新颖的计算方法,该方法将快速插入搜索用于快速确认,并与随时算法结合以实现持续优化。为了最大化服务的请求数量,我们使用强化学习训练了一种非短视目标函数,该函数引导插入算法和随时算法朝向最优的非短视解决方案。我们在美国一家公共交通机构的真实微交通数据集上评估了我们的计算方法,结果表明,与现有方法相比,我们提出的方法在提供快速确认的同时显著增加了服务的请求数量。
cs.AI / 35 / 2603.07427

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

AutoControl Arena:为前沿人工智能风险评估合成可执行测试环境
Li, Changyi, Lu, Pengfei, Pan, Xudong, Barez, Fazl, Yang, Min
Abstract
As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
Chinese Translation
随着大型语言模型(LLMs)演变为自主代理,现有的安全评估面临着一个基本的权衡:手动基准测试成本高昂,而基于LLM的模拟器可扩展性强但容易出现逻辑幻觉。我们提出了AutoControl Arena,这是一个基于逻辑-叙事解耦原则构建的前沿人工智能风险评估自动化框架。通过将确定性状态嵌入可执行代码中,同时将生成动态委托给LLMs,我们在保持灵活性的同时减轻了幻觉问题。该原则通过三代理框架的实例化实现了超过98%的端到端成功率和60%的相对于现有模拟器的人类偏好。为了引发潜在风险,我们在X-Bench中变化环境压力和诱惑(70个场景,7个风险类别)。对9个前沿模型的评估揭示了: (1) 对齐幻觉:在压力下,风险率从21.7%激增至54.5%,能力较强的模型显示出不成比例的增幅;(2) 场景特定的安全扩展:高级推理提高了对直接伤害的鲁棒性,但在游戏场景中却恶化了;(3) 分歧的失调模式:较弱的模型造成非恶意伤害,而较强的模型则发展出战略性隐瞒。
cs.AI / 36 / 2603.07438

Machine Learning for Stress Testing: Uncertainty Decomposition in Causal Panel Prediction

机器学习在压力测试中的应用:因果面板预测中的不确定性分解
Wang, Yu, Liu, Xiangchen, Li, Siguang
Abstract
Regulatory stress testing requires projecting credit losses under hypothetical macroeconomic scenarios -- a fundamentally causal question typically treated as a prediction problem. We propose a framework for policy-path counterfactual inference in panels that transparently separates what can be learned from data from what requires assumptions about confounding. Our approach has four components: (i) observational identification of path-conditional means via iterated regression, enabling continuous macro-path contrasts without requiring a control group; (ii) causal set identification under bounded confounding, yielding sharp identified sets with interpretable breakdown values that communicate robustness in a single number; (iii) an oracle inequality showing that recursive rollout error is governed by a horizon-dependent amplification factor, providing a concrete answer to how far ahead one can reliably predict under stress; and (iv) importance-weighted conformal calibration bands with diagnostics that quantify extrapolation cost and trigger abstention when coverage guarantees degrade. The final output is a three-layer uncertainty decomposition that cleanly separates estimation uncertainty from confounding uncertainty. We validate all results through simulation and semi-synthetic experiments with real unemployment data, including a Covid retrospective demonstrating the framework's diagnostic value under extreme scenarios.
Chinese Translation
监管压力测试要求在假设的宏观经济情景下预测信用损失——这是一个根本性的因果问题,通常被视为预测问题。我们提出了一种政策路径反事实推断的框架,该框架透明地将数据中可以学习到的内容与需要关于混杂因素的假设分开。我们的方法包含四个组成部分:(i) 通过迭代回归进行路径条件均值的观察性识别,能够在不需要对照组的情况下进行连续的宏观路径对比;(ii) 在有限混杂下的因果集合识别,产生具有可解释分解值的尖锐识别集合,能够用一个数字传达稳健性;(iii) 一个Oracle不等式表明递归展开误差受限于一个依赖于时间范围的放大因子,提供了一个具体的答案,说明在压力下可以可靠预测多远的未来;(iv) 重要性加权的符合校准带及其诊断工具,量化外推成本,并在覆盖保证降低时触发弃权。最终输出是一个三层不确定性分解,清晰地将估计不确定性与混杂不确定性分开。我们通过模拟和与真实失业数据的半合成实验验证了所有结果,包括一个Covid回顾,展示了该框架在极端情景下的诊断价值。
cs.AI / 37 / 2603.07444

HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery

HLER:通过多智能体管道进行人机协作的经济研究以实现实证发现
Zhu, Chen, Wang, Xiaolu
Abstract
Large language models (LLMs) have enabled agent-based systems that aim to automate scientific research workflows. Most existing approaches focus on fully autonomous discovery, where AI systems generate research ideas, conduct analyses, and produce manuscripts with minimal human involvement. However, empirical research in economics and the social sciences poses additional constraints: research questions must be grounded in available datasets, identification strategies require careful design, and human judgment remains essential for evaluating economic significance. We introduce HLER (Human-in-the-Loop Economic Research), a multi-agent architecture that supports empirical research automation while preserving critical human oversight. The system orchestrates specialized agents for data auditing, data profiling, hypothesis generation, econometric analysis, manuscript drafting, and automated review. A key design principle is dataset-aware hypothesis generation, where candidate research questions are constrained by dataset structure, variable availability, and distributional diagnostics, reducing infeasible or hallucinated hypotheses. HLER further implements a two-loop architecture: a question quality loop that screens and selects feasible hypotheses, and a research revision loop where automated review triggers re-analysis and manuscript revision. Human decision gates are embedded at key stages, allowing researchers to guide the automated pipeline. Experiments on three empirical datasets show that dataset-aware hypothesis generation produces feasible research questions in 87% of cases (versus 41% under unconstrained generation), while complete empirical manuscripts can be produced at an average API cost of $0.8-$1.5 per run. These results suggest that Human-AI collaborative pipelines may provide a practical path toward scalable empirical research.
Chinese Translation
大型语言模型(LLMs)使基于智能体的系统能够自动化科学研究工作流程。现有的大多数方法集中于完全自主的发现,其中人工智能系统生成研究想法、进行分析并在最小人类参与的情况下撰写手稿。然而,经济学和社会科学中的实证研究面临额外的约束:研究问题必须基于可用数据集,识别策略需要精心设计,而人类判断在评估经济重要性方面仍然至关重要。我们引入了HLER(人机协作经济研究),一种支持实证研究自动化的多智能体架构,同时保留关键的人类监督。该系统协调专门的智能体进行数据审核、数据分析、假设生成、计量经济分析、手稿撰写和自动审查。一个关键的设计原则是数据集感知的假设生成,其中候选研究问题受到数据集结构、变量可用性和分布诊断的限制,从而减少不可行或虚幻的假设。HLER进一步实施了一个双循环架构:一个问题质量循环,用于筛选和选择可行的假设,以及一个研究修订循环,其中自动审查触发重新分析和手稿修订。人类决策门嵌入在关键阶段,允许研究人员指导自动化管道。在三个实证数据集上的实验表明,数据集感知的假设生成在87%的情况下产生可行的研究问题(而在不受约束的生成下仅为41%),同时完整的实证手稿的平均API成本为每次运行0.8-$1.5美元。这些结果表明,人机协作管道可能为可扩展的实证研究提供了一条实用的路径。
cs.AI / 38 / 2603.07462

Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment

机器的失败是否类似于人类?一种以人为中心的分布外光谱用于映射错误对齐
Xu, Binxia, Luo, Xiaoliang, Dickens, Luke, Mok, Robert M.
Abstract
Determining whether AI systems process information similarly to humans is central to cognitive science and trustworthy AI. While modern AI models match human accuracy on standard tasks, such parity does not guarantee that their underlying decision-making strategies are aligned with human information processing. Assessing performance using i) error alignment metrics to compare how humans and models fail, and ii) using distorted, or otherwise more challenging, stimuli, provides a viable pathway toward a finer characterization of model-human alignment. However, existing out-of-distribution (OOD) analyses for challenging stimuli are limited due to methodological choices: they define OOD shift relative to model training data or use arbitrary distortion-specific parameters with little correspondence to human perception, hindering principled comparisons. We propose a human-centred framework that redefines the degree of OOD as a spectrum of human perceptual difficulty. By quantifying how much a collection of stimuli deviates from an undistorted reference set based on human accuracy, we construct an OOD spectrum and identify four distinct regimes of perceptual challenge. This approach enables principled model-human comparisons at calibrated difficulty levels. We apply this framework to object recognition and reveal unique, regime-dependent model-human alignment rankings and profiles across deep learning architectures. Vision-language models are the most consistently human aligned across near- and far-OOD conditions, but CNNs are more aligned than ViTs for near-OOD and ViTs are more aligned than CNNs for far-OOD conditions. Our work demonstrates the critical importance of accounting for cross-condition differences such as perceptual difficulty for a principled assessment of model-human alignment.
Chinese Translation
确定人工智能系统是否以类似于人类的方式处理信息是认知科学和可信人工智能的核心问题。尽管现代人工智能模型在标准任务上与人类的准确性相匹配,但这种平等并不保证它们的决策策略与人类的信息处理相一致。通过使用 i) 错误对齐指标来比较人类和模型的失败方式,以及 ii) 使用扭曲或其他更具挑战性的刺激来评估性能,为更细致地描述模型与人类的对齐提供了一条可行的途径。然而,现有的针对挑战性刺激的分布外(OOD)分析由于方法选择而受到限制:它们相对于模型训练数据定义OOD转变,或使用与人类感知几乎没有对应关系的任意扭曲特定参数,从而阻碍了原则性的比较。我们提出了一种以人为中心的框架,将OOD的程度重新定义为人类感知难度的光谱。通过量化一组刺激相对于基于人类准确性的未扭曲参考集的偏差程度,我们构建了一个OOD光谱,并识别出四种不同的感知挑战模式。这种方法使得在校准难度水平下进行原则性的模型与人类比较成为可能。我们将该框架应用于物体识别,并揭示了不同深度学习架构下独特的、依赖于模式的模型与人类对齐排名和特征。在近OOD和远OOD条件下,视觉-语言模型在对齐人类方面表现最为一致,但在近OOD条件下,卷积神经网络(CNN)比视觉变换器(ViT)更为对齐,而在远OOD条件下,ViT又比CNN更为对齐。我们的研究表明,考虑感知难度等跨条件差异对于原则性评估模型与人类的对齐至关重要。
cs.AI / 39 / 2603.07546

COOL-MC: Verifying and Explaining RL Policies for Multi-bridge Network Maintenance

COOL-MC:验证和解释多桥网络维护的强化学习策略
Gross, Dennis
Abstract
Aging bridge networks require proactive, verifiable, and interpretable maintenance strategies, yet reinforcement learning (RL) policies trained solely on reward signals provide no formal safety guarantees and remain opaque to infrastructure managers. We demonstrate COOL-MC as a tool for verifying and explaining RL policies for multi-bridge network maintenance, building on a single-bridge Markov decision process (MDP) from the literature and extending it to a parallel network of three heterogeneous bridges with a shared periodic budget constraint, encoded in the PRISM modeling language. We train an RL agent on this MDP and apply probabilistic model checking and explainability methods to the induced discrete-time Markov chain (DTMC) that arises from the interaction between the learned policy and the underlying MDP. Probabilistic model checking reveals that the trained policy has a safety-violation probability of 3.5\% over the planning horizon, being slightly above the theoretical minimum of 0\% and indicating the suboptimality of the learned policy, noting that these results are based on artificially constructed transition probabilities and deterioration rates rather than real-world data, so absolute performance figures should be interpreted with caution. The explainability analysis further reveals, for instance, a systematic bias in the trained policy toward the state of bridge 1 over the remaining bridges in the network. These results demonstrate COOL-MC's ability to provide formal, interpretable, and practical analysis of RL maintenance policies.
Chinese Translation
老化的桥梁网络需要主动、可验证和可解释的维护策略,但仅基于奖励信号训练的强化学习(RL)策略并未提供正式的安全保障,并且对基础设施管理者而言仍然不透明。我们展示了COOL-MC作为验证和解释多桥网络维护的RL策略的工具,基于文献中的单桥马尔可夫决策过程(MDP),并将其扩展到具有共享周期性预算约束的三座异构桥梁的并行网络,使用PRISM建模语言进行编码。我们在该MDP上训练了一个RL代理,并对由学习策略与基础MDP之间的交互产生的离散时间马尔可夫链(DTMC)应用了概率模型检查和可解释性方法。概率模型检查表明,训练的策略在规划范围内的安全违规概率为3.5\%,略高于理论最小值0\%,这表明学习策略的次优性,需注意这些结果是基于人工构造的转移概率和恶化率,而非真实世界数据,因此绝对性能数据应谨慎解读。可解释性分析进一步揭示,例如,训练策略对网络中桥梁1的状态存在系统性偏向,而对其余桥梁的关注较少。这些结果展示了COOL-MC在提供正式、可解释和实用的RL维护策略分析方面的能力。
cs.AI / 40 / 2603.07598

Shorter Thoughts, Same Answers: Difficulty-Scaled Segment-Wise RL for CoT Compression

更短的思考,相同的答案:基于难度缩放的分段强化学习用于链式推理压缩
Tian, Ye, Liu, Aijun
Abstract
Chain-of-thought (CoT) improves reasoning reliability but increases token cost, motivating post-training compression of explicit reasoning traces. However, the shortest sufficient reasoning is not universal: it depends on difficulty, model capacity, and training state, making fixed length targets brittle. In practice, naive RL-based compression can also undesirably shorten the user-facing answer, because a single completion-level learning signal leaks across the think/answer boundary. We propose Difficulty-Scaled Segment-Wise GRPO (DSS-GRPO), which decomposes returns into think and answer components, computes group-relative advantages per segment, and routes them with hard token masks so compression updates act only on think while answer alignment acts only on answer. DSS-GRPO uses prompt-wise within-group shaping and difficulty-aware scaling to encourage concise reasoning without collapsing answer behavior.
Chinese Translation
链式推理(Chain-of-thought, CoT)提高了推理的可靠性,但增加了令牌成本,这促使对显式推理轨迹进行后训练压缩。然而,最短的充分推理并非普遍适用:它依赖于难度、模型容量和训练状态,使得固定长度的目标变得脆弱。在实践中,简单的基于强化学习的压缩也可能不利地缩短用户面对的答案,因为单一的完成级学习信号在思考/回答边界上泄漏。我们提出了基于难度缩放的分段强化学习优化(Difficulty-Scaled Segment-Wise GRPO, DSS-GRPO),它将回报分解为思考和回答组件,计算每个分段的组相对优势,并通过硬令牌掩码进行路由,使得压缩更新仅作用于思考,而答案对齐仅作用于回答。DSS-GRPO使用提示级的组内塑形和基于难度的缩放,以鼓励简洁的推理,而不影响答案行为。
cs.AI / 41 / 2603.07670

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

自主大型语言模型代理的记忆:机制、评估与新兴前沿
Du, Pengfei
Abstract
Large language model (LLM) agents increasingly operate in settings where a single context window is far too small to capture what has happened, what was learned, and what should not be repeated. Memory -- the ability to persist, organize, and selectively recall information across interactions -- is what turns a stateless text generator into a genuinely adaptive agent. This survey offers a structured account of how memory is designed, implemented, and evaluated in modern LLM-based agents, covering work from 2022 through early 2026. We formalize agent memory as a \emph{write--manage--read} loop tightly coupled with perception and action, then introduce a three-dimensional taxonomy spanning temporal scope, representational substrate, and control policy. Five mechanism families are examined in depth: context-resident compression, retrieval-augmented stores, reflective self-improvement, hierarchical virtual context, and policy-learned management. On the evaluation side, we trace the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, analyzing four recent benchmarks that expose stubborn gaps in current systems. We also survey applications where memory is the differentiating factor -- personal assistants, coding agents, open-world games, scientific reasoning, and multi-agent teamwork -- and address the engineering realities of write-path filtering, contradiction handling, latency budgets, and privacy governance. The paper closes with open challenges: continual consolidation, causally grounded retrieval, trustworthy reflection, learned forgetting, and multimodal embodied memory.
Chinese Translation
大型语言模型(LLM)代理越来越多地在单一上下文窗口远远不足以捕捉发生过的事情、所学到的知识以及不应重复的内容的环境中运作。记忆——在交互中持久化、组织和选择性回忆信息的能力——是将无状态文本生成器转变为真正自适应代理的关键。本文对现代基于LLM的代理中记忆的设计、实现和评估进行了结构化的综述,涵盖了2022年至2026年初的相关工作。我们将代理记忆形式化为一个与感知和行动紧密耦合的 extit{写入--管理--读取}循环,并引入了一个跨越时间范围、表征基底和控制策略的三维分类法。深入探讨了五种机制家族:上下文驻留压缩、检索增强存储、反思自我改进、层次虚拟上下文和策略学习管理。在评估方面,我们追踪了从静态回忆基准到多会话代理测试的转变,这些测试将记忆与决策相交织,分析了四个最近的基准,这些基准暴露了当前系统中的顽固缺口。我们还调查了记忆作为区分因素的应用场景——个人助手、编码代理、开放世界游戏、科学推理和多代理团队合作,并讨论了写入路径过滤、矛盾处理、延迟预算和隐私治理等工程现实。最后,本文提出了一些开放挑战:持续整合、因果基础检索、可信反思、学习遗忘和多模态具身记忆。
cs.AI / 42 / 2603.07717

Rigidity in LLM Bandits with Implications for Human-AI Dyads

大语言模型中的刚性及其对人机交互的影响
Wang, Haomiaomiao, Ward, Tomás E, Zhang, Lili
Abstract
We test whether LLMs show robust decision biases. Treating models as participants in two-arm bandits, we ran 20000 trials per condition across four decoding configurations. Under symmetric rewards, models amplified positional order into stubborn one-arm policies. Under asymmetric rewards, they exploited rigidly yet underperformed an oracle and rarely re-checked. The observed patterns were consistent across manipulations of temperature and top-p, with top-k held at the provider default, indicating that the qualitative behaviours are robust to the two decoding knobs typically available to practitioners. Crucially, moving beyond descriptive metrics to computational modelling, a hierarchical Rescorla-Wagner-softmax fit revealed the underlying strategies: low learning rates and very high inverse temperatures, which together explain both noise-to-bias amplification and rigid exploitation. These results position minimal bandits as a tractable probe of LLM decision tendencies and motivate hypotheses about how such biases could shape human-AI interaction.
Chinese Translation
我们测试了大语言模型(LLMs)是否表现出稳健的决策偏差。将模型视为双臂赌博机中的参与者,我们在四种解码配置下进行了每种条件20000次试验。在对称奖励下,模型将位置顺序放大为顽固的单臂策略。在不对称奖励下,它们表现出刚性利用,但表现不及一个理想模型(oracle),且很少重新检查。观察到的模式在温度和top-p的操控下是一致的,而top-k保持在提供者的默认设置,这表明这些定性行为对实践中通常可用的两种解码参数是稳健的。关键是,超越描述性指标到计算建模,分层的Rescorla-Wagner-softmax拟合揭示了潜在策略:低学习率和非常高的逆温度,这两者共同解释了噪声到偏差的放大和刚性利用。这些结果将最小化赌博机定位为一个可处理的探测工具,用于研究大语言模型的决策倾向,并激发了关于这些偏差如何影响人机交互的假设。
cs.AI / 43 / 2603.07728

A Novel Multi-Agent Architecture to Reduce Hallucinations of Large Language Models in Multi-Step Structural Modeling

一种新颖的多智能体架构以减少大型语言模型在多步骤结构建模中的幻觉
Geng, Ziheng, Liu, Jiachen, Cao, Ran, Cheng, Lu, Frangopol, Dan M., Cheng, Minghui
Abstract
Large language models (LLMs) such as GPT and Gemini have demonstrated remarkable capabilities in contextual understanding and reasoning. The strong performance of LLMs has sparked growing interest in leveraging them to automate tasks traditionally dependent on human expertise. Recently, LLMs have been integrated into intelligent agents capable of operating structural analysis software (e.g., OpenSees) to construct structural models and perform analyses. However, existing LLMs are limited in handling multi-step structural modeling due to frequent hallucinations and error accumulation during long-sequence operations. To this end, this study presents a novel multi-agent architecture to automate the structural modeling and analysis using OpenSeesPy. First, problem analysis and construction planning agents extract key parameters from user descriptions and formulate a stepwise modeling plan. Node and element agents then operate in parallel to assemble the frame geometry, followed by a load assignment agent. The resulting geometric and load information is translated into executable OpenSeesPy scripts by code translation agents. The proposed architecture is evaluated on a benchmark of 20 frame problems over ten repeated trials, achieving 100% accuracy in 18 cases and 90% in the remaining two. The architecture also significantly improves computational efficiency and demonstrates scalability to larger structural systems.
Chinese Translation
大型语言模型(LLMs),如GPT和Gemini,展示了在上下文理解和推理方面的显著能力。LLMs的强大表现引发了对利用它们自动化传统依赖人类专业知识的任务的日益关注。最近,LLMs已被集成到能够操作结构分析软件(例如OpenSees)的智能代理中,以构建结构模型并进行分析。然而,现有的LLMs在处理多步骤结构建模时受到限制,因为在长序列操作中频繁出现幻觉和错误累积。为此,本研究提出了一种新颖的多智能体架构,以使用OpenSeesPy自动化结构建模和分析。首先,问题分析和构建规划代理从用户描述中提取关键参数,并制定逐步建模计划。节点和元素代理随后并行操作以组装框架几何形状,接着是负载分配代理。最终的几何和负载信息由代码翻译代理转换为可执行的OpenSeesPy脚本。所提出的架构在20个框架问题的基准测试中进行了评估,经过十次重复试验,在18个案例中实现了100%的准确率,其余两个案例中达到了90%的准确率。该架构还显著提高了计算效率,并展示了对更大结构系统的可扩展性。
cs.AI / 44 / 2603.07733

Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning

用于离散优化问题的大型语言模型:评估与逐步推理
Qian, Tianhao, Qi, Guilin, Wu, Z. Y., Gu, Ran, Liu, Xuanyi, Lyu, Canchen
Abstract
This work investigated the capabilities of different models, including the Llama-3 series of models and CHATGPT, with different forms of expression in solving discrete optimization problems by testing natural language datasets. In contrast to formal datasets with a limited scope of parameters, our dataset included a variety of problem types in discrete optimization problems and featured a wide range of parameter magnitudes, including instances with large parameter sets, integrated with augmented data. It aimed to (1) provide an overview of LLMs' ability in large-scale problems, (2) offer suggestions to those who want to solve discrete optimization problems automatically, and (3) regard the performance as a benchmark for future research. These datasets included original, expanded and augmented datasets. Among these three datasets, the original and augmented ones aimed for evaluation while the expanded one may help finetune a new model. In the experiment, comparisons were made between strong and week models, CoT methods and No-CoT methods on various datasets. The result showed that stronger model performed better reasonably. Contrary to general agreement, it also showed that CoT technique was not always effective regarding the capability of models and disordered datasets improved performance of models on easy to-understand problems, even though they were sometimes with high variance, a manifestation of instability. Therefore, for those who seek to enhance the automatic resolution of discrete optimization problems, it is recommended to consult the results, including the line charts presented in the Appendix, as well as the conclusions drawn in this study for relevant suggestions.
Chinese Translation
本研究探讨了包括Llama-3系列模型和CHATGPT在内的不同模型在解决离散优化问题方面的能力,通过测试自然语言数据集进行评估。与参数范围有限的正式数据集相比,我们的数据集涵盖了多种离散优化问题类型,并具有广泛的参数规模,包括大参数集的实例,并结合了增强数据。研究旨在(1)概述大型语言模型(LLMs)在大规模问题中的能力,(2)为希望自动解决离散优化问题的研究者提供建议,以及(3)将性能视为未来研究的基准。这些数据集包括原始数据集、扩展数据集和增强数据集。在这三种数据集中,原始数据集和增强数据集旨在进行评估,而扩展数据集则可能有助于微调新模型。在实验中,我们比较了强模型与弱模型、链式推理(CoT)方法与非链式推理(No-CoT)方法在各种数据集上的表现。结果表明,强模型的表现更为优越。与普遍共识相反,研究还显示,链式推理技术并不总是有效,而无序数据集在易于理解的问题上提高了模型的性能,尽管这些数据集有时表现出高方差,体现了不稳定性。因此,对于希望提升离散优化问题自动解决能力的研究者,建议参考本研究的结果,包括附录中呈现的折线图,以及得出的相关建议。
cs.AI / 45 / 2603.07848

Intentional Deception as Controllable Capability in LLM Agents

意图性欺骗作为大型语言模型代理的可控能力
Starace, Jason, Soule, Terence
Abstract
As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM-to-LLM interactions within a text-based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact-checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact-verification approaches are insufficient against strategically framed deception.
Chinese Translation
随着基于大型语言模型(LLM)的代理在多智能体系统中越来越多地运作,理解对抗性操控变得对防御设计至关重要。我们通过在一个文本基础的角色扮演游戏(RPG)中使用LLM与LLM之间的交互,系统性地研究了意图性欺骗作为一种工程能力,其中参数化的行为特征(9种对齐方式 x 4种动机,共36种具有明确伦理真实情况的特征)作为我们的实验测试平台。与由于不对齐而导致的意外欺骗不同,我们研究了一个两阶段系统,该系统推断目标代理的特征并生成欺骗性回应,引导目标采取与其信念和动机相悖的行动。我们发现,欺骗性干预在特定行为特征中产生了差异化效果,而不是均匀分布,并且88.5%的成功欺骗使用了误导(具有战略框架的真实陈述)而非虚构,这表明事实核查防御将错过绝大多数对抗性回应。动机以98%以上的准确率可推断,是主要的攻击向量,而信念系统则更难以识别(推断上限为49%)或利用。这些发现确定了哪些代理特征需要额外的保护措施,并表明当前的事实验证方法对于战略性框架的欺骗来说是不够的。
cs.AI / 46 / 2603.07853

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

SynPlanResearch-R1:通过合成计划鼓励深度研究的工具探索
Zeng, Hansi, Li, Zoey, Gao, Yifan, Zhang, Chenwei, Pan, Xiaoman, Yang, Tao, Mo, Fengran, Lin, Jiacheng, Li, Xian, Shang, Jingbo
Abstract
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.
Chinese Translation
研究代理使模型能够利用工具从网络收集信息以回答用户查询,这要求它们动态地将内部推理与工具使用交织在一起。虽然这种能力原则上可以通过具有可验证奖励的强化学习(RLVR)来学习,但我们观察到代理通常表现出较差的探索行为,包括过早终止和偏向工具使用。因此,单靠RLVR的改进效果有限。我们提出了SynPlanResearch-R1,一个合成工具使用轨迹的框架,旨在鼓励更深层次的探索,以在冷启动的监督微调过程中塑造探索,为后续的强化学习提供强有力的初始化。在七个多跳和开放网络基准测试中,与最先进的基线相比, ramework 在Qwen3-8B和Qwen3-4B主干上分别提高了最多6.0%和5.8%的性能。与基线相比,对工具使用模式和训练动态的进一步分析揭示了这些提升背后的因素。我们的代码已公开发布在 https://github.com/HansiZeng/syn-plan-research。
cs.AI / 47 / 2603.07868

Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models

酒店业视觉问答:面向决策的信息量评估
Lee, Jeongwoo, Duhyeong, Baek, Han, Eungyeol, Shin, Soyeon, han, Gukin, Kim, Seungduk, Jeon, Jaehyun, Jeong, Taewoo
Abstract
Recent advances in Vision-Language Models (VLMs) have demonstrated impressive multimodal understanding in general domains. However, their applicability to decision-oriented domains such as hospitality remains largely unexplored. In this work, we investigate how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making. While many existing VQA benchmarks focus on factual correctness, they rarely capture what information users actually find useful. To address this, we first introduce Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides. Guided by this framework, we construct a new hospitality-specific VQA dataset that covers various facility types, where questions are specifically designed to reflect key user information needs. Using this benchmark, we conduct experiments with several state-of-the-art VLMs, revealing that VLMs are not intrinsically decision-aware-key visual signals remain underutilized, and reliable informativeness reasoning emerges only after modest domain-specific finetuning.
Chinese Translation
近年来,视觉语言模型(VLMs)的进展在一般领域展示了令人印象深刻的多模态理解能力。然而,它们在酒店等面向决策的领域中的适用性仍然未得到充分探索。在本研究中,我们探讨了VLMs在关于酒店和设施图像的视觉问答(VQA)任务中的表现,这些图像对消费者决策至关重要。虽然许多现有的VQA基准关注于事实的正确性,但它们很少捕捉用户实际认为有用的信息。为了解决这个问题,我们首先引入信息量(Informativeness)作为一个正式框架,以量化图像-问题对提供的与酒店相关的信息量。在这个框架的指导下,我们构建了一个新的酒店特定VQA数据集,涵盖各种设施类型,其中问题专门设计以反映用户的关键信息需求。利用这个基准,我们对几种最先进的VLMs进行了实验,结果显示VLMs并不是内在地具备决策意识——关键的视觉信号仍然未被充分利用,可靠的信息量推理仅在适度的领域特定微调后才会显现。
cs.AI / 48 / 2603.07890

Visualizing Coalition Formation: From Hedonic Games to Image Segmentation

可视化联盟形成:从享乐游戏到图像分割
França, Pedro Henrique de Paula, Felipe, Lucas Lopes, Menasché, Daniel Sadoc
Abstract
We propose image segmentation as a visual diagnostic testbed for coalition formation in hedonic games. Modeling pixels as agents on a graph, we study how a granularization parameter shapes equilibrium fragmentation and boundary structure. On the Weizmann single-object benchmark, we relate multi-coalition equilibria to binary protocols by measuring whether the converged coalitions overlap with a foreground ground-truth. We observe transitions from cohesive to fragmented yet recoverable equilibria, and finally to intrinsic failure under excessive fragmentation. Our core contribution links multi-agent systems with image segmentation by quantifying the impact of mechanism design parameters on equilibrium structures.
Chinese Translation
我们提出将图像分割作为享乐游戏中联盟形成的可视化诊断测试平台。通过将像素建模为图上的代理,我们研究了粒度参数如何影响均衡的碎片化和边界结构。在魏茨曼单对象基准测试中,我们通过测量收敛的联盟是否与前景真实值重叠,将多联盟均衡与二元协议联系起来。我们观察到从凝聚到碎片化但可恢复的均衡的过渡,最终在过度碎片化下导致内在失败。我们的核心贡献在于通过量化机制设计参数对均衡结构的影响,将多代理系统与图像分割联系起来。
cs.AI / 49 / 2603.07891

A Lightweight Traffic Map for Efficient Anytime LaCAM*

一种轻量级交通地图用于高效的随时 LaCAM*
Shen, Bojie, Zhang, Yue, Chen, Zhe, Harabor, Daniel
Abstract
Multi-Agent Path Finding (MAPF) aims to compute collision-free paths for multiple agents and has a wide range of practical applications. LaCAM*, an anytime configuration-based solver, currently represents the state of the art. Recent work has explored the use of guidance paths to steer LaCAM* toward configurations that avoid traffic congestion, thereby improving solution quality. However, existing approaches rely on Frank-Wolfe-style optimization that repeatedly invokes single-agent search before executing LaCAM*, resulting in substantial computational overhead for large-scale problems. Moreover, the guidance path is static and primarily beneficial for finding the first solution in LaCAM*. To address these limitations, we propose a new approach that leverages LaCAM*'s ability to construct a dynamic, lightweight traffic map during its search. Experimental results demonstrate that our method achieves higher solution quality than state-of-the-art guidance-path approaches across two MAPF variants.
Chinese Translation
多智能体路径规划(Multi-Agent Path Finding, MAPF)旨在为多个智能体计算无碰撞路径,具有广泛的实际应用。LaCAM* 是一种随时配置基础的求解器,目前代表了该领域的最新进展。近期的研究探索了使用引导路径来引导 LaCAM* 朝向避免交通拥堵的配置,从而提高解决方案的质量。然而,现有的方法依赖于 Frank-Wolfe 风格的优化,在执行 LaCAM* 之前反复调用单智能体搜索,这导致在大规模问题上产生了显著的计算开销。此外,引导路径是静态的,主要对在 LaCAM* 中找到第一个解决方案有利。为了解决这些局限性,我们提出了一种新方法,利用 LaCAM* 在搜索过程中构建动态轻量级交通地图的能力。实验结果表明,我们的方法在两个 MAPF 变体中实现了比最先进的引导路径方法更高的解决方案质量。
cs.AI / 50 / 2603.07896

SMGI: A Structural Theory of General Artificial Intelligence

SMGI:一般人工智能的结构理论
Osmani, Aomar
Abstract
We introduce SMGI, a structural theory of general artificial intelligence, and recast the foundational problem of learning from the optimization of hypotheses within fixed environments to the controlled evolution of the learning interface itself. We formalize the Structural Model of General Intelligence (SMGI) via a typed meta-model $\theta = (r,\mathcal H,\Pi,\mathcal L,\mathcal E,\mathcal M)$ that treats representational maps, hypothesis spaces, structural priors, multi-regime evaluators, and memory operators as explicitly typed, dynamic components. By enforcing a strict mathematical separation between this structural ontology ($\theta$) and its induced behavioral semantics ($T_\theta$), we define general artificial intelligence as a class of admissible coupled dynamics $(\theta, T_\theta)$ satisfying four obligations: structural closure under typed transformations, dynamical stability under certified evolution, bounded statistical capacity, and evaluative invariance across regime shifts. We prove a structural generalization bound that links sequential PAC-Bayes analysis and Lyapunov stability, providing sufficient conditions for capacity control and bounded drift under admissible task transformations. Furthermore, we establish a strict structural inclusion theorem demonstrating that classical empirical risk minimization, reinforcement learning, program-prior models (Solomonoff-style), and modern frontier agentic pipelines operate as structurally restricted instances of SMGI.
Chinese Translation
我们引入了SMGI,一种一般人工智能的结构理论,并将学习的基础问题从在固定环境中对假设的优化重新构造为学习接口本身的受控演化。我们通过一个类型化的元模型 $ heta = (r, ext{H}, ext{Π}, ext{L}, ext{E}, ext{M})$ 来形式化一般智能的结构模型(SMGI),该模型将表征映射、假设空间、结构先验、多模式评估器和记忆算子视为明确类型的动态组件。通过在这个结构本体($ heta$)与其诱导的行为语义($T_ heta$)之间强制严格的数学分离,我们将一般人工智能定义为一类满足四个义务的可接受耦合动态 $( heta, T_ heta)$:在类型变换下的结构闭合性、在认证演化下的动态稳定性、有界统计能力以及在模式转变中的评估不变性。我们证明了一个结构泛化界限,该界限将序列PAC-Bayes分析与Lyapunov稳定性联系起来,为可接受任务变换下的容量控制和有界漂移提供了充分条件。此外,我们建立了一个严格的结构包含定理,证明经典的经验风险最小化、强化学习、程序先验模型(Solomonoff风格)和现代前沿智能管道作为SMGI的结构限制实例运行。
cs.AI / 51 / 2603.07900

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery:通过任务条件预训练实现零样本临床预测的电子健康记录
Chandak, Payal, Kondas, Gregory, Kohane, Isaac, McDermott, Matthew
Abstract
Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.
Chinese Translation
在电子健康记录(EHR)上预训练的基础模型已经展示了零样本临床预测的能力,通过生成合成患者未来情况并对采样轨迹进行统计聚合。然而,这种自回归推断过程计算开销大、统计噪声高,并且不具备原生的提示能力,因为用户无法直接根据特定的临床问题来条件化预测。在这项初步工作中,我们介绍了EveryQuery,这是一种通过任务条件预训练实现零样本推断的EHR基础模型。EveryQuery并不是生成未来事件,而是以患者的历史记录和指定临床任务的结构化查询作为输入,通过单次前向传递直接估计未来窗口中结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者背景组合上进行预训练,实现了这一能力,直接训练模型生成对任意输入提示的正确答案。这使得在查询空间中的任何任务上都能够实现零样本预测,而无需微调、线性探测或轨迹生成。在MIMIC-IV数据集上,EveryQuery在39个随机采样预测任务中,有82%的任务超越了自回归基线,平均AUC提升为+0.16(95% CI: [0.10,0.22])。这一优势在明确从预训练分布中排除的任务上依然保持一致。此外,EveryQuery在稀有临床事件上的性能提升最为显著,确认并展示了解决自回归推断在低发生率结果上的基本限制的方案。然而,目前EveryQuery在需要对多个代码进行析取推理的任务上表现不佳,例如30天再入院,暴露了当前查询语言的具体表达能力限制。
cs.AI / 52 / 2603.07915

Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

Ares:用于高效LLM代理的自适应推理努力选择
Yang, Jingbo, Hou, Bairu, Wei, Wei, Bao, Yujia, Chang, Shiyu
Abstract
Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significant performance degradation, while random selection fails to preserve accuracy or provide meaningful cost reduction. However, agents should reserve high reasoning effort for difficult steps like navigating complex website structures, while using lower-effort modes for simpler steps like opening a target URL. In this paper, we propose Ares, a framework for per-step dynamic reasoning effort selection tailored for multi-step agent tasks. Ares employs a lightweight router to predict the lowest appropriate reasoning level for each step based on the interaction history. To train this router, we develop a data generation pipeline that identifies the minimum reasoning effort required for successful step completion. We then fine-tune the router to predict these levels, enabling plug-and-play integration for any LLM agents. We evaluate Ares on a diverse set of agent tasks, including TAU-Bench for tool use agents, BrowseComp-Plus for deep-research agents, and WebArena for web agents. Experimental results show that Ares reduces reasoning token usage by up to 52.7% compared to fixed high-effort reasoning, while introducing minimal degradation in task success rates.
Chinese Translation
现代代理通过思维型LLM实现高准确率,依赖于长链推理,但会产生可观的推理成本。尽管许多LLM现在支持可配置的推理水平(例如,高/中/低),但静态策略往往效果不佳:在每一步都使用低努力模式会导致显著的性能下降,而随机选择则无法保持准确性或提供有意义的成本降低。然而,代理应在困难步骤(如导航复杂网站结构)中保留高推理努力,而在简单步骤(如打开目标URL)中使用低努力模式。本文提出了Ares,一个针对多步骤代理任务的逐步动态推理努力选择框架。Ares采用轻量级路由器,根据交互历史预测每一步的最低适当推理水平。为了训练该路由器,我们开发了一个数据生成管道,识别成功完成每一步所需的最低推理努力。然后,我们微调路由器以预测这些水平,使其能够与任何LLM代理进行即插即用的集成。我们在多样化的代理任务上评估Ares,包括用于工具使用代理的TAU-Bench、用于深度研究代理的BrowseComp-Plus以及用于网络代理的WebArena。实验结果表明,与固定高努力推理相比,Ares将推理令牌使用量减少了多达52.7%,同时对任务成功率的影响极小。
cs.AI / 53 / 2603.07916

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Rel-MOSS:在关系数据库上实现不平衡关系深度学习
Yin, Jun, Huo, Peng, Zhu, Bangguo, Yan, Hao, Wang, Senzhang, Pan, Shirui, Zhang, Chengqi
Abstract
In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.
Chinese Translation
在最近的研究进展中,为了在关系数据库(RDB)上实现完全数据驱动的学习范式,提出了关系深度学习(RDL),将RDB结构化为异构实体图,并采用图神经网络(GNN)作为预测模型。然而,现有的RDL方法忽视了RDB中关系数据的不平衡问题,可能导致少数实体的代表性不足,从而在实践中形成不可用的模型。在本研究中,我们首次探讨了RDB实体分类中的类别不平衡问题,并设计了以关系为中心的少数合成过采样GNN(Rel-MOSS),以填补当前文献中的关键空白。具体而言,为了减轻少数相关信息被多数对应信息淹没的问题,我们设计了关系级门控控制器,以调节来自每种关系类型的邻域消息。基于关系门控表示,我们进一步提出了关系引导的少数合成器用于过采样,该合成器整合了实体关系特征以保持关系一致性。在12个实体分类数据集上的广泛实验提供了Rel-MOSS优越性的有力证据,与最先进的RDL方法和处理类别不平衡的经典方法相比,Rel-MOSS在平衡准确率和G-均值方面平均提高了2.46%和4.00%。
cs.AI / 54 / 2603.07970

Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs

通过进化阶段性设计与大型语言模型推进自动化算法设计
Lu, Chen, Xue, Ke, Gao, Chengrui, Shi, Yunqi, Xu, Siyuan, Yuan, Mingxuan, Qian, Chao, Zhou, Zhi-Hua
Abstract
With the rapid advancement of human science and technology, problems in industrial scenarios are becoming increasingly challenging, bringing significant challenges to traditional algorithm design. Automated algorithm design with LLMs emerges as a promising solution, but the currently adopted black-box modeling deprives LLMs of any awareness of the intrinsic mechanism of the target problem, leading to hallucinated designs. In this paper, we introduce Evolutionary Stagewise Algorithm Design (EvoStage), a novel evolutionary paradigm that bridges the gap between the rigorous demands of industrial-scale algorithm design and the LLM-based algorithm design methods. Drawing inspiration from CoT, EvoStage decomposes the algorithm design process into sequential, manageable stages and integrates real-time intermediate feedback to iteratively refine algorithm design directions. To further reduce the algorithm design space and avoid falling into local optima, we introduce a multi-agent system and a "global-local perspective" mechanism. We apply EvoStage to the design of two types of common optimizers: designing parameter configuration schedules of the Adam optimizer for chip placement, and designing acquisition functions of Bayesian optimization for black-box optimization. Experimental results across open-source benchmarks demonstrate that EvoStage outperforms human-expert designs and existing LLM-based methods within only a couple of evolution steps, even achieving the historically state-of-the-art half-perimeter wire-length results on every tested chip case. Furthermore, when deployed on a commercial-grade 3D chip placement tool, EvoStage significantly surpasses the original performance metrics, achieving record-breaking efficiency. We hope EvoStage can significantly advance automated algorithm design in the real world, helping elevate human productivity.
Chinese Translation
随着人类科学技术的快速发展,工业场景中的问题变得越来越具有挑战性,这给传统算法设计带来了重大挑战。基于大型语言模型(LLMs)的自动化算法设计作为一种有前景的解决方案应运而生,但目前采用的黑箱建模使得LLMs无法意识到目标问题的内在机制,导致产生幻觉设计。本文介绍了一种新颖的进化范式——进化阶段性算法设计(EvoStage),旨在弥合工业规模算法设计的严格要求与基于LLM的算法设计方法之间的差距。EvoStage受到链式思维(CoT)的启发,将算法设计过程分解为顺序的、可管理的阶段,并整合实时的中间反馈,以迭代地优化算法设计方向。为了进一步缩小算法设计空间并避免陷入局部最优,我们引入了多智能体系统和“全局-局部视角”机制。我们将EvoStage应用于两种常见优化器的设计:为芯片布局设计Adam优化器的参数配置调度,以及为黑箱优化设计贝叶斯优化的获取函数。开放源代码基准测试的实验结果表明,EvoStage在仅经过几次进化步骤后便超越了人类专家设计和现有的基于LLM的方法,甚至在每个测试的芯片案例中实现了历史上最先进的半周长线长结果。此外,当在商业级3D芯片布局工具上部署时,EvoStage显著超越了原始性能指标,达到了创纪录的效率。我们希望EvoStage能够显著推动现实世界中的自动化算法设计,帮助提升人类生产力。
cs.AI / 55 / 2603.07972

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

与人类的自适应协作:针对持续学习的多智能体大语言模型的元认知策略优化
Yang, Wei, Cao, Defu, Pang, Jiacheng, Weng, Muyan, Liu, Yan
Abstract
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
Chinese Translation
虽然单个大语言模型(LLMs)的扩展取得了显著进展,但下一个前沿在于通过多智能体系统(MAS)扩展协作。然而,纯自主的多智能体系统仍然是“封闭世界”系统,受限于预训练模型的静态知识视野。这一限制使得它们在需要超出训练数据的知识的任务上变得脆弱,常常在新挑战面前导致集体失败。为了解决这个问题,我们提出了人机协作多智能体框架(HILA),这是一个人类与智能体协作的原则性范式。HILA训练智能体学习一种元认知策略,该策略决定何时自主解决问题,何时向人类专家请教。为了实现这一策略,我们引入了双循环策略优化(Dual-Loop Policy Optimization),将即时决策与长期能力增长分开。内循环应用具有成本意识的奖励的群体相对策略优化(Group Relative Policy Optimization,GRPO)来优化延迟决策,而外循环则实施持续学习,将专家反馈转化为高质量的监督信号,从而增强智能体的推理能力。在具有挑战性的数学和问题解决基准上的实验表明,配备双循环策略优化的HILA始终优于先进的多智能体系统,为协作和持续改进的智能体系统奠定了原则性基础。
cs.AI / 56 / 2603.07978

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

OSExpert:通过探索学习专业技能的计算机使用代理
Liu, Jiateng, Wang, Zhenhailong, Wang, Rushi, Li, Bingxuan, Kim, Jeonghwan, Tiwari, Aditi, Yu, Pengfei, Zhang, Denghui, Ji, Heng
Abstract
General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent
Chinese Translation
通用计算机使用代理在多样化的数字环境中表现出色。然而,我们的新基准OSExpert-Eval表明,它们的帮助程度远不及人类专家。尽管推理时间的扩展使得适应成为可能,这些代理在完成复杂任务时效率低下,表现下降,且在未见过的用户界面(UI)上迁移效果不佳,难以处理细粒度的动作序列。为了解决这个问题,我们引入了一种基于图形用户界面(GUI)的深度优先搜索(GUI-DFS)探索算法,以全面探索和验证环境的单元功能。然后,代理利用单元技能之间的组合性,自我构建复合任务的课程。为了支持细粒度的动作,我们整理了一个动作原语数据库,供代理在探索过程中发现;一旦探索完成,这些动作将被保存为技能集。我们利用学习到的技能通过(1)为代理提供现成的程序知识,允许它们在长路径上仅规划一次并生成准确的动作,以及(2)通过实现能力边界,提前结束推理时间的扩展,从而提高代理的性能和效率。大量实验表明,我们的环境学习代理在朝着专家级计算机使用迈出了重要一步,在OSExpert-Eval上实现了约20%的性能提升,并将与人类之间的效率差距缩小了约80%。
cs.AI / 57 / 2603.07997

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

CMMR-VLN:通过持续多模态记忆检索实现的视觉与语言导航
Li, Haozhou, Dong, Xiangyu, Jiang, Huiyan, Zhou, Yaoming, Ma, Xiaoguang
Abstract
Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.
Chinese Translation
尽管大型语言模型(LLMs)被引入到视觉与语言导航(VLN)中,以提高指令理解和泛化能力,但现有的基于LLM的VLN缺乏选择性回忆和利用相关先前经验来帮助导航任务的能力,这限制了它们在长时间和不熟悉场景中的表现。在本研究中,我们提出了CMMR-VLN(基于持续多模态记忆检索的VLN),这是一个赋予LLM代理结构化记忆和反思能力的VLN框架。具体而言,CMMR-VLN构建了一个以全景视觉图像和显著地标为索引的多模态经验记忆,以在导航过程中检索相关经验,引入了一种检索增强生成管道,以模仿经验丰富的人类导航员如何利用先前知识,并结合了一种基于反思的记忆更新策略,选择性地存储完整的成功路径和失败案例中的关键初始错误。全面的测试表明,在模拟和真实测试中,CMMR-VLN相较于NavGPT、MapGPT和DiscussNav的平均成功率分别提高了52.9%、20.9%和20.9%,以及200%、50%和50%,清晰地阐明了CMMR-VLN作为基础VLN框架的巨大潜力。
cs.AI / 58 / 2603.08013

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

PIRA-Bench:从反应式图形用户界面代理到基于图形用户界面的主动意图推荐代理的转变
Chai, Yuxiang, Tang, Shunye, Xiao, Han, Liu, Rui, Li, Hongsheng
Abstract
Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.
Chinese Translation
当前的图形用户界面(GUI)代理主要在反应式范式下运行:用户必须提供明确的指令,以便代理执行任务。然而,一个智能的人工智能助手应该是主动的,能够直接从持续的视觉输入(如移动设备或桌面截图)中预测用户意图,并在没有明确用户提示的情况下提供及时的推荐。向这种主动范式的转变面临重大挑战。现实世界的屏幕活动很少是线性的;它由长时间跨度的轨迹组成,充满了嘈杂的浏览、无意义的操作和多线程的任务切换。为了解决这一问题,我们引入了PIRA-Bench(主动意图推荐代理基准),这是一个用于评估多模态大型语言模型(MLLMs)在连续、弱监督视觉输入上的新基准。与反应式数据集不同,PIRA-Bench具有复杂的轨迹,包含多个交错的意图和各种用户档案上下文中的嘈杂片段,挑战代理检测可操作事件,同时适应用户偏好。此外,我们提出了PIRF基线,这是一个记忆感知的状态跟踪框架,使通用的MLLM能够管理多个任务线程并处理误导性的视觉输入。PIRA-Bench是朝着强大且主动的基于GUI的个人助手迈出的第一步。
cs.AI / 59 / 2603.08035

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

CDRRM:基于对比驱动的可靠且可解释的奖励建模评分标准生成
Liu, Dengcan, Yang, Fengkai, Wang, Xiaohan, Yan, Shurui, Chai, Jiajun, Li, Jiahao, Ban, Yikun, Mao, Zhendong, Lin, Wei, Yin, Guojun
Abstract
Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annotations. While recent rubric-based approaches enhance evaluation transparency, they lack systematic quality control, yielding noisy and redundant criteria, failing to mitigate persistent biases (e.g., verbosity, position) in LLM evaluators, and creating a scalability-reliability trade-off. To address these limitations, we propose CDRRM (Contrast-Driven Rubric Reward Model), a framework built on a novel Contrast-then-Synthesis paradigm for high-quality rubric generation and guided preference judgment. CDRRM first conducts multi-dimensional contrastive profiling on preference pairs to identify causal discriminative factors, then synthesizes these insights into compact, context-aware rubrics to guide preference judg- ments. Extensive experiments on three authoritative benchmarks (RewardBench, RMBench, RMB) demonstrate that CDRRM achieves state-of-the-art performance across diverse domains and effectively mitigates aforementioned evaluation biases. Notably, our approach delivers exceptional data efficiency: training the rubric generator on only 3k high-quality samples empowers a frozen pre-trained judge model to outperform fully fine-tuned baselines. This work offers a scalable, interpretable, and data-efficient path for reward modeling.
Chinese Translation
奖励建模对于将大型语言模型(LLMs)与人类偏好对齐至关重要,但传统的奖励模型在可解释性方面表现不佳,并且过度依赖昂贵的专家注释。尽管最近的基于评分标准的方法增强了评估透明度,但它们缺乏系统的质量控制,导致标准噪声和冗余,未能缓解LLM评估者中持续存在的偏见(例如,冗长性、位置),并造成可扩展性与可靠性之间的权衡。为了解决这些局限性,我们提出了CDRRM(对比驱动评分标准奖励模型),这是一个基于新颖的对比-再合成范式构建的框架,用于高质量评分标准生成和引导偏好判断。CDRRM首先对偏好对进行多维度的对比分析,以识别因果区分因素,然后将这些见解综合成紧凑的、上下文感知的评分标准,以指导偏好判断。在三个权威基准(RewardBench、RMBench、RMB)上的广泛实验表明,CDRRM在不同领域中实现了最先进的性能,并有效缓解了上述评估偏见。值得注意的是,我们的方法在数据效率方面表现卓越:仅在3000个高质量样本上训练评分标准生成器,就使得一个冻结的预训练评判模型超越了完全微调的基线。这项工作为奖励建模提供了一条可扩展、可解释且数据高效的路径。
cs.AI / 60 / 2603.08048

S2S-FDD: Bridging Industrial Time Series and Natural Language for Explainable Zero-shot Fault Diagnosis

S2S-FDD:连接工业时间序列与自然语言以实现可解释的零-shot故障诊断
Li, Baoxue, Zhao, Chunhui
Abstract
Fault diagnosis is critical for the safe operation of industrial systems. Conventional diagnosis models typically produce abstract outputs such as anomaly scores or fault categories, failing to answer critical operational questions like "Why" or "How to repair". While large language models (LLMs) offer strong generalization and reasoning abilities, their training on discrete textual corpora creates a semantic gap when processing high-dimensional, temporal industrial signals. To address this challenge, we propose a Signals-to-Semantics fault diagnosis (S2S-FDD) framework that bridges high-dimensional sensor signals with natural language semantics through two key innovations: We first design a Signal-to-Semantic operator to convert abstract time-series signals into natural language summaries, capturing trends, periodicity, and deviations. Based on the descriptions, we design a multi-turn tree-structured diagnosis method to perform fault diagnosis by referencing historical maintenance documents and dynamically querying additional signals. The framework further supports human-in-the-loop feedback for continuous refinement. Experiments on the multiphase flow process show the feasibility and effectiveness of the proposed method for explainable zero-shot fault diagnosis.
Chinese Translation
故障诊断对于工业系统的安全运行至关重要。传统的诊断模型通常产生抽象输出,如异常分数或故障类别,无法回答关键的操作问题,如“为什么”或“如何修复”。尽管大型语言模型(LLMs)具有强大的泛化和推理能力,但它们在离散文本语料库上的训练在处理高维、时间序列的工业信号时产生了语义鸿沟。为了解决这一挑战,我们提出了一种信号到语义故障诊断框架(S2S-FDD),通过两个关键创新将高维传感器信号与自然语言语义连接起来:首先,我们设计了一个信号到语义操作符,将抽象的时间序列信号转换为自然语言摘要,捕捉趋势、周期性和偏差。基于这些描述,我们设计了一种多轮树状结构的诊断方法,通过参考历史维护文档和动态查询额外信号来进行故障诊断。该框架进一步支持人机交互反馈,以实现持续优化。在多相流过程上的实验表明,所提方法在可解释的零-shot故障诊断中具有可行性和有效性。
cs.AI / 61 / 2603.08068

In-Context Reinforcement Learning for Tool Use in Large Language Models

大语言模型中工具使用的上下文强化学习
Ye, Yaoqi, Zhao, Yiran, Duan, Keyu, Zheng, Zeyu, Kawaguchi, Kenji, Xie, Cihang, Shieh, Michael Qizhe
Abstract
While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.
Chinese Translation
尽管大语言模型(LLMs)展现出强大的推理能力,但它们在复杂任务上的表现常常受到内部知识局限的制约。克服这一挑战的一个有效方法是通过外部工具来增强这些模型——例如,使用Python解释器进行数学计算或使用搜索引擎检索事实信息。然而,使模型有效地使用这些工具仍然是一个重大挑战。现有的方法通常依赖于冷启动管道,首先进行监督微调(SFT),然后再进行强化学习(RL)。这些方法通常需要大量标注数据用于SFT,而标注或合成这些数据的成本较高。在本研究中,我们提出了上下文强化学习(ICRL),这是一种仅基于强化学习的框架,通过在RL的回滚阶段利用少量示例提示,消除了对SFT的需求。具体而言,ICRL在回滚提示中引入上下文示例,以教会模型如何调用外部工具。此外,随着训练的进行,上下文示例的数量逐渐减少,最终达到零-shot设置,使模型能够独立学习调用工具。我们在一系列推理和工具使用基准上进行了广泛的实验。结果表明,ICRL实现了最先进的性能,证明了其作为一种可扩展、数据高效的替代传统SFT管道的有效性。
cs.AI / 62 / 2603.08117

UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

UIS-Digger:面向现实世界未索引信息检索的综合研究代理系统
Liu, Chang, Kuang, Chuqiao, Zhuang, Tianyi, Cheng, Yuxin, Zhou, Huichi, Li, Xiaoguang, Shang, Lifeng
Abstract
Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27\%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems.
Chinese Translation
近期基于大型语言模型(LLM)的信息检索代理在既定基准测试中取得了创纪录的表现。然而,这些代理仍然严重依赖于搜索引擎索引的知识,留下了一个关键的盲点:未索引信息检索(UIS)。本文识别并探讨了UIS问题,即重要信息未被搜索引擎爬虫捕获,例如被忽视的内容、动态网页和嵌入文件。尽管其重要性,UIS仍然是一个未被充分探索的挑战。为了解决这一空白,我们引入了UIS-QA,这是第一个专门针对UIS的基准,包含110个专家标注的问答对。值得注意的是,即使是最先进的代理在UIS-QA上的表现也大幅下降(例如,从GAIA的70.90和BrowseComp-zh的46.70下降到UIS-QA的24.55),突显了这一问题的严重性。为此,我们提出了UIS-Digger,一个新颖的多代理框架,结合双模式浏览,能够同时进行网页搜索和文件解析。UIS-Digger使用约30B参数的主干LLM,并通过监督微调(SFT)和强化微调(RFT)训练策略进行优化,设定了27.27%的强基线,超越了集成复杂LLM(如O3和GPT-4.1)的系统。这表明,主动与未索引来源互动对于有效和全面的信息检索至关重要。我们的工作不仅揭示了当前代理评估范式中的一个基本限制,还提供了推进UIS研究的第一个工具包,为稳健的信息检索系统定义了一个新的、有前景的方向。
cs.AI / 63 / 2603.08171

Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data

基于证据的工业维护推理方法:利用异构数据
O'Donncha, Fearghal, Zhou, Nianjun, Martinez, Natalia, Rayfield, James T, Heath III, Fenno F., Langbridge, Abigail, Vaculin, Roman
Abstract
Industrial maintenance platforms contain rich but fragmented evidence, including free-text work orders, heterogeneous operational sensors or indicators, and structured failure knowledge. These sources are often analyzed in isolation, producing alerts or forecasts that do not support conditional decision-making: given this asset history and behavior, what is happening and what action is warranted? We present Condition Insight Agent, a deployed decision-support framework that integrates maintenance language, behavioral abstractions of operational data, and engineering failure semantics to produce evidence-grounded explanations and advisory actions. The system constrains reasoning through deterministic evidence construction and structured failure knowledge, and applies a rule-based verification loop to suppress unsupported conclusions. Case studies from production CMMS deployments show that this verification-first design operates reliably under heterogeneous and incomplete data while preserving human oversight. Our results demonstrate how constrained LLM-based reasoning can function as a governed decision-support layer for industrial maintenance.
Chinese Translation
工业维护平台包含丰富但碎片化的证据,包括自由文本的工作订单、异构的操作传感器或指标,以及结构化的故障知识。这些来源通常是孤立分析的,产生的警报或预测无法支持条件决策:在考虑到该资产的历史和行为的情况下,发生了什么,以及应采取什么行动?我们提出了条件洞察代理(Condition Insight Agent),这是一个已部署的决策支持框架,整合了维护语言、操作数据的行为抽象和工程故障语义,以生成基于证据的解释和建议行动。该系统通过确定性证据构建和结构化故障知识来约束推理,并应用基于规则的验证循环以抑制不支持的结论。来自生产计算机化维护管理系统(CMMS)部署的案例研究表明,这种以验证为先的设计在异构和不完整数据下可靠运行,同时保持人类监督。我们的结果展示了受限的基于大型语言模型(LLM)的推理如何作为工业维护的受控决策支持层发挥作用。
cs.AI / 64 / 2603.08234

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

继续与拒绝之间的斗争:大语言模型中继续触发的越狱机制分析
Deng, Yonghong, Yang, Zhen, Jian, Ping, Zhang, Xinyue, Guo, Zhongbin, Li, Chengzhi
Abstract
With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.
Chinese Translation
随着大语言模型(LLMs)的快速发展,LLMs的安全性已成为一个关键问题。尽管在安全对齐方面进行了大量努力,目前的LLMs仍然容易受到越狱攻击。然而,这些脆弱性的根本原因仍然不甚明了,因此需要对越狱机制进行严格的研究,涵盖学术界和工业界。在本研究中,我们关注一种继续触发的越狱现象,即简单地重新定位一个继续触发的指令后缀可以显著提高越狱成功率。为了揭示这一现象的内在机制,我们在注意力头的层面上进行了全面的机制可解释性分析。通过因果干预和激活缩放,我们表明这种越狱行为主要源于模型内在的继续驱动与通过对齐训练获得的安全防御之间的固有竞争。此外,我们对识别出的安全关键注意力头进行了详细的行为分析,揭示了不同模型架构中安全头的功能和行为的显著差异。这些发现为理解和解释LLMs中的越狱行为提供了新的机制视角,为提高模型安全性提供了理论见解和实际启示。
cs.AI / 65 / 2603.08262

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

FinToolBench:评估大型语言模型代理在现实世界金融工具使用中的表现
Lu, Jiaxuan, Wang, Kong, Wang, Yemin, Tang, Qingmei, Zeng, Hongwei, Chen, Xiang, Pi, Jiahao, Deng, Shujian, Chen, Lingzhi, Fu, Yi, Yang, Kehua, Sun, Xiao
Abstract
The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.
Chinese Translation
大型语言模型(LLMs)在金融领域的整合正在推动从被动信息检索向动态代理交互的范式转变。尽管通用工具学习的基准测试数量激增,但金融行业由于其高风险、严格合规和快速数据波动的特征,仍然严重缺乏相关评估。现有的金融评估主要集中在静态文本分析或基于文档的问答,忽视了工具执行的复杂现实。相反,通用工具基准缺乏金融领域所需的特定严谨性,通常依赖于玩具环境或极少量的金融API。为填补这一空白,我们推出了FinToolBench,这是第一个专注于评估金融工具学习代理的现实世界可运行基准。与以往仅限于少数模拟工具的研究不同,FinToolBench建立了一个现实的生态系统,将760个可执行的金融工具与295个严格的、工具所需的查询相结合。我们提出了一种新的评估框架,超越了二元执行成功的标准,从及时性、意图类型和合规领域对代理进行评估。此外,我们还提出了FATR,这是一个金融感知的工具检索和推理基准,增强了稳定性和合规性。通过提供第一个可审计的代理金融执行测试平台,FinToolBench为金融领域的可信AI设定了新的标准。工具清单、执行环境和评估代码将开源,以促进未来的研究。
cs.AI / 66 / 2603.08267

Towards a more efficient bias detection in financial language models

朝着更高效的金融语言模型偏见检测
Kacem, Firas Hadj, Khanfir, Ahmed, Papadakis, Mike
Abstract
Bias in financial language models constitutes a major obstacle to their adoption in real-world applications. Detecting such bias is challenging, as it requires identifying inputs whose predictions change when varying properties unrelated to the decision, such as demographic attributes. Existing approaches typically rely on exhaustive mutation and pairwise prediction analysis over large corpora, which is effective but computationally expensive-particularly for large language models and can become impractical in continuous retraining and releasing processes. Aiming at reducing this cost, we conduct a large-scale study of bias in five financial language models, examining similarities in their bias tendencies across protected attributes and exploring cross-model-guided bias detection to identify bias-revealing inputs earlier. Our study uses approximately 17k real financial news sentences, mutated to construct over 125k original-mutant pairs. Results show that all models exhibit bias under both atomic (0.58\%-6.05\%) and intersectional (0.75\%-5.97\%) settings. Moreover, we observe consistent patterns in bias-revealing inputs across models, enabling substantial reuse and cost reduction in bias detection. For example, up to 73\% of FinMA's biased behaviours can be uncovered using only 20\% of the input pairs when guided by properties derived from DistilRoBERTa outputs.
Chinese Translation
金融语言模型中的偏见构成了其在实际应用中采用的主要障碍。检测这种偏见具有挑战性,因为它需要识别那些在变化与决策无关的属性(如人口统计特征)时预测结果发生变化的输入。现有的方法通常依赖于对大规模语料库进行详尽的变异和成对预测分析,这种方法虽然有效,但计算成本高昂,尤其对于大型语言模型而言,在持续的再训练和发布过程中可能变得不切实际。为了降低这一成本,我们对五个金融语言模型中的偏见进行了大规模研究,考察了它们在受保护属性上的偏见倾向的相似性,并探索了跨模型引导的偏见检测,以便更早地识别出揭示偏见的输入。我们的研究使用了大约17,000个真实的金融新闻句子,通过变异构建了超过125,000个原始-变异对。结果表明,所有模型在原子(0.58%-6.05%)和交叉(0.75%-5.97%)设置下均表现出偏见。此外,我们观察到模型之间在揭示偏见的输入上存在一致的模式,这使得偏见检测中的显著重用和成本降低成为可能。例如,当使用来自DistilRoBERTa输出的属性引导时,最多可以通过仅使用20%的输入对来揭示FinMA的偏见行为的73%。
cs.AI / 67 / 2603.08291

Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

解构多模态数学推理:迈向统一的感知-对齐-推理范式
Yang, Tianyu, Wu, Sihong, Zhao, Yilun, Liang, Zhenwen, Dai, Lisen, Zhao, Chen, Cheng, Minhao, Cohan, Arman, Zhang, Xiangliang
Abstract
Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. To address these limitations, a growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically study them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and offer perspectives on promising directions for future research.
Chinese Translation
多模态数学推理(Multimodal Mathematical Reasoning, MMR)近年来因其解决涉及文本和视觉模态的数学问题的能力而受到越来越多的关注。然而,当前的模型在现实世界的视觉数学任务中仍面临重大挑战。它们常常误解图表,未能将数学符号与视觉证据对齐,并产生不一致的推理步骤。此外,现有的评估主要集中在检查最终答案,而非验证每个中间步骤的正确性或可执行性。为了解决这些局限性,越来越多的研究通过在统一框架内整合结构化感知、明确对齐和可验证推理来应对这些问题。为了建立一个清晰的路线图,以理解和比较不同的MMR方法,我们围绕四个基本问题系统地研究它们:(1)从多模态输入中提取什么,(2)如何表示和对齐文本与视觉信息,(3)如何进行推理,以及(4)如何评估整体推理过程的正确性。最后,我们讨论了开放性挑战,并对未来研究的有前景方向提出了看法。
cs.AI / 68 / 2603.08321

CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

CORE-Acu:针灸临床决策支持的结构化推理痕迹与知识图谱安全验证
Xu, Liuyi, Guo, Yun, Chen, Ming, Dun, Zihan, Qian, Yining, Lu, An-Yang, Li, Shuang, Liu, Lijun
Abstract
Large language models (LLMs) show significant potential for clinical decision support (CDS), yet their black-box nature -- characterized by untraceable reasoning and probabilistic hallucinations -- poses severe challenges in acupuncture, a field demanding rigorous interpretability and safety. To address this, we propose CORE-Acu, a neuro-symbolic framework for acupuncture clinical decision support that integrates Structured Chain-of-Thought (S-CoT) with knowledge graph (KG) safety verification. First, we construct the first acupuncture Structured Reasoning Trace dataset and a schema-constrained fine-tuning framework. By enforcing an explicit causal chain from pattern identification to treatment principles, treatment plans, and acupoint selection, we transform implicit Traditional Chinese Medicine (TCM) reasoning into interpretable generation constraints, mitigating the opacity of LLM-based CDS. Furthermore, we construct a TCM safety knowledge graph and establish a ``Generate--Verify--Revise'' closed-loop inference system based on a Symbolic Veto Mechanism, employing deterministic rules to intercept hallucinations and enforce hard safety boundaries. Finally, we introduce the Lexicon-Matched Entity-Reweighted Loss (LMERL), which corrects terminology drift caused by the frequency--importance mismatch in general optimization by adaptively amplifying gradient contributions of high-risk entities during fine-tuning. Experiments on 1,000 held-out cases demonstrate CORE-Acu's superior entity fidelity and reasoning quality. Crucially, CORE-Acu achieved 0/1,000 observed safety violations (95\% CI: 0--0.37\%), whereas GPT-4o exhibited an 8.5\% violation rate under identical rules. These results establish CORE-Acu as a robust neuro-symbolic framework for acupuncture clinical decision support, guaranteeing both reasoning auditability and strict safety compliance.
Chinese Translation
大型语言模型(LLMs)在临床决策支持(CDS)方面展现出显著潜力,但其黑箱特性——表现为不可追踪的推理和概率性幻觉——在要求严格可解释性和安全性的针灸领域带来了严重挑战。为此,我们提出了CORE-Acu,一个用于针灸临床决策支持的神经符号框架,结合了结构化思维链(Structured Chain-of-Thought, S-CoT)与知识图谱(Knowledge Graph, KG)安全验证。首先,我们构建了第一个针灸结构化推理痕迹数据集和一个模式约束的微调框架。通过强制从模式识别到治疗原则、治疗方案和腧穴选择的明确因果链,我们将隐式的传统中医学(Traditional Chinese Medicine, TCM)推理转化为可解释的生成约束,从而减轻基于LLM的CDS的模糊性。此外,我们构建了一个TCM安全知识图谱,并基于符号否决机制建立了一个“生成-验证-修正”的闭环推理系统,采用确定性规则拦截幻觉并强制执行严格的安全边界。最后,我们引入了词汇匹配实体重加权损失(Lexicon-Matched Entity-Reweighted Loss, LMERL),通过在微调过程中自适应放大高风险实体的梯度贡献,纠正由频率-重要性不匹配引起的术语漂移。在1,000个保留案例上的实验表明,CORE-Acu在实体保真度和推理质量方面表现优越。重要的是,CORE-Acu在观察到的安全违规中实现了0/1,000的记录(95%置信区间:0-0.37%),而GPT-4o在相同规则下的违规率为8.5%。这些结果确立了CORE-Acu作为一个强大的神经符号框架,用于针灸临床决策支持,确保了推理的可审计性和严格的安全合规性。
cs.AI / 69 / 2603.08322

Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

面向数学发现的代理神经符号协作:组合设计的案例研究
Xia, Hai, Gomes, Carla P., Selman, Bart, Szeider, Stefan
Abstract
We study mathematical discovery through the lens of neurosymbolic reasoning, where an AI agent powered by a large language model (LLM), coupled with symbolic computation tools, and human strategic direction, jointly produced a new result in combinatorial design theory. The main result of this human-AI collaboration is a tight lower bound on the imbalance of Latin squares for the notoriously difficult case $n \equiv 1 \pmod{3}$. We reconstruct the discovery process from detailed interaction logs spanning multiple sessions over several days and identify the distinct cognitive contributions of each component. The AI agent proved effective at uncovering hidden structure and generating hypotheses. The symbolic component consists of computer algebra, constraint solvers, and simulated annealing, which provides rigorous verification and exhaustive enumeration. Human steering supplied the critical research pivot that transformed a dead end into a productive inquiry. Our analysis reveals that multi-model deliberation among frontier LLMs proved reliable for criticism and error detection but unreliable for constructive claims. The resulting human-AI mathematical contribution, a tight lower bound of $4n(n{-}1)/9$, is achieved via a novel class of near-perfect permutations. The bound was formally verified in Lean 4. Our experiments show that neurosymbolic systems can indeed produce genuine discoveries in pure mathematics.
Chinese Translation
我们通过神经符号推理的视角研究数学发现,其中一个由大型语言模型(LLM)驱动的人工智能代理,结合符号计算工具和人类战略指导,共同在组合设计理论中产生了新的结果。这一人机协作的主要成果是对拉丁方阵不平衡性的紧下界,特别是在著名的困难案例 $n ext{ mod } 3 ext{ 余 } 1$ 下。我们从跨越多个会话和几天的详细互动日志中重建了发现过程,并识别出每个组件的独特认知贡献。人工智能代理在揭示隐藏结构和生成假设方面表现出色。符号组件包括计算代数、约束求解器和模拟退火,提供了严格的验证和全面的枚举。人类的引导提供了关键的研究转折点,将死胡同转变为富有成效的探究。我们的分析表明,前沿 LLM 之间的多模型审议在批评和错误检测方面表现可靠,但在建设性主张方面则不够可靠。最终的人机数学贡献是通过一种新型的近乎完美的排列实现的紧下界 $4n(n{-}1)/9$,该下界在 Lean 4 中得到了正式验证。我们的实验表明,神经符号系统确实能够在纯数学中产生真正的发现。
cs.AI / 70 / 2603.08369

M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

M$^3$-ACE:通过多智能体上下文工程纠正多模态数学推理中的视觉感知
Xie, Peijin, Xu, Zhen, Liu, Bingquan, Wang, Baoxun
Abstract
Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.
Chinese Translation
多模态大型语言模型最近在视觉数学推理方面显示出令人鼓舞的进展。然而,它们的性能往往受到一个关键但未被充分探索的瓶颈的限制:不准确的视觉感知。通过系统分析,我们发现大多数失败源于不正确或不完整的视觉证据提取,而不是推理能力的缺陷。此外,模型往往对其初始感知过于自信,使得标准策略如提示工程、多轮自我反思或后验指导不足以可靠地纠正错误。为了解决这一局限性,我们提出了M3-ACE,一个旨在纠正多模态数学推理中视觉感知的多智能体上下文工程框架。我们的方法不是直接聚合最终答案,而是通过动态维护一个以视觉证据列表为中心的共享上下文,将感知和推理解耦。多个智能体协同贡献互补的观察,使系统能够揭示不一致性并恢复缺失的感知信息。为了支持稳定的多轮协作,我们进一步引入了两个轻量级工具:一个总结工具(Summary Tool),将来自不同智能体的证据组织成一致、互补和冲突的组件;一个精炼工具(Refine Tool),过滤不可靠的样本并指导迭代修正。大量实验表明,M3-ACE在多个基准测试中显著提高了视觉数学推理的性能。我们的方法在MathVision基准上建立了89.1的新状态-of-the-art结果,并在其他相关数据集(包括MathVista和MathVerse)上实现了一致的改进。这些结果突显了以感知为中心的多智能体协作在推进多模态推理系统中的重要性。
cs.AI / 71 / 2603.08388

A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation

基于层次错误纠正图框架的自主智能体与基于LLM的行动生成
Cao, Cong, Zhang, Jingyao, Tong, Kun
Abstract
We propose a Hierarchical Error-Corrective Graph FrameworkforAutonomousAgentswithLLM-BasedActionGeneration(HECG),whichincorporates three core innovations: (1) Multi-Dimensional Transferable Strategy (MDTS): by integrating task quality metrics (Q), confidence/cost metrics (C), reward metrics (R), and LLM-based semantic reasoning scores (LLM-Score), MDTS achieves multi-dimensional alignment between quantitative performance and semantic context, enabling more precise selection of high-quality candidate strate gies and effectively reducing the risk of negative transfer. (2) Error Matrix Classification (EMC): unlike simple confusion matrices or overall performance metrics, EMC provides structured attribution of task failures by categorizing errors into ten types, such as Strategy Errors (Strategy Whe) and Script Parsing Errors (Script-Parsing-Error), and decomposing them according to severity, typical actions, error descriptions, and recoverability. This allows precise analysis of the root causes of task failures, offering clear guidance for subsequent error correction and strategy optimization rather than relying solely on overall success rates or single performance metrics. (3) Causal-Context Graph Retrieval (CCGR): to enhance agent retrieval capabilities in dynamic task environments, we construct graphs from historical states, actions, and event sequences, where nodes store executed actions, next-step actions, execution states, transferable strategies, and other relevant information, and edges represent causal dependencies such as preconditions for transitions between nodes. CCGR identifies subgraphs most relevant to the current task context, effectively capturing structural relationships beyond vector similarity, allowing agents to fully leverage contextual information, accelerate strategy adaptation, and improve execution reliability in complex, multi-step tasks.
Chinese Translation
我们提出了一种层次错误纠正图框架(HECG),用于基于LLM的自主智能体行动生成,该框架结合了三项核心创新:(1)多维可转移策略(MDTS):通过整合任务质量指标(Q)、置信度/成本指标(C)、奖励指标(R)和基于LLM的语义推理分数(LLM-Score),MDTS实现了定量性能与语义上下文之间的多维对齐,从而能够更精确地选择高质量候选策略,并有效降低负迁移的风险。(2)错误矩阵分类(EMC):与简单的混淆矩阵或整体性能指标不同,EMC通过将错误分类为十种类型(如策略错误(Strategy Error)和脚本解析错误(Script Parsing Error)),并根据严重性、典型行为、错误描述和可恢复性进行分解,提供了任务失败的结构化归因。这使得能够精确分析任务失败的根本原因,为后续的错误纠正和策略优化提供明确的指导,而不仅仅依赖于整体成功率或单一性能指标。(3)因果上下文图检索(CCGR):为了增强智能体在动态任务环境中的检索能力,我们从历史状态、行动和事件序列构建图,其中节点存储已执行的行动、下一步行动、执行状态、可转移策略及其他相关信息,边表示因果依赖关系,如节点间转移的前提条件。CCGR识别与当前任务上下文最相关的子图,有效捕捉超越向量相似性的结构关系,使智能体能够充分利用上下文信息,加速策略适应,并提高在复杂多步骤任务中的执行可靠性。
cs.AI / 72 / 2603.08425

IronEngine: Towards General AI Assistant

IronEngine:迈向通用人工智能助手
Mo, Xi
Abstract
This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.
Chinese Translation
本文介绍了IronEngine,一个围绕统一编排核心组织的通用人工智能助手平台,该核心连接了桌面用户界面、REST和WebSocket API、Python客户端、本地和云模型后端、持久内存、任务调度、可重用技能、24类工具执行、与MCP兼容的可扩展性以及面向硬件的集成。IronEngine引入了一个三阶段管道——讨论(规划者与审阅者协作)、模型切换(考虑显存的过渡)和执行(工具增强的行动循环),该管道将规划质量与执行能力分离。该系统具有分层内存架构,支持多级整合,基于ChromaDB的向量化技能库,支持92个模型配置的自适应模型管理层,具备显存感知的上下文预算,以及一个智能工具路由系统,具备130多个别名归一化和自动错误修正功能。我们在文件操作基准测试中展示了实验结果,四个异构任务的任务完成率达到100%,平均总时间为1541秒,并与包括ChatGPT、Claude Desktop、Cursor、Windsurf和开源代理框架在内的代表性人工智能助手系统进行了详细比较。在不披露专有提示或核心算法的情况下,本文分析了该平台的架构分解、子系统设计、实验性能、安全边界和比较工程优势。研究结果将IronEngine定位为通用个人助手、自动化框架和未来以人为本的代理平台的系统导向基础。
cs.AI / 73 / 2603.08447

Efficient Policy Learning with Hybrid Evaluation-Based Genetic Programming for Uncertain Agile Earth Observation Satellite Scheduling

基于混合评估的遗传编程高效政策学习用于不确定的敏捷地球观测卫星调度
Xue, Junhua, Chen, Yuning
Abstract
The Uncertain Agile Earth Observation Satellite Scheduling Problem (UAEOSSP) is a novel combinatorial optimization problem and a practical engineering challenge that aligns with the current demands of space technology development. It incorporates uncertainties in profit, resource consumption, and visibility, which may render pre-planned schedules suboptimal or even infeasible. Genetic Programming Hyper-Heuristic (GPHH) shows promise for evolving interpretable scheduling policies; however, their simulation-based evaluation incurs high computational costs. Moreover, the design of the constructive method, denoted as Online Scheduling Algorithm (OSA), directly affects fitness assessment, resulting in evaluation-dependent local optima within the policy space. To address these issues, this paper proposes a Hybrid Evaluation-based Genetic Programming (HE-GP) for effectively solving UAEOSSP. A Hybrid Evaluation (HE) mechanism is integrated into the policy-driven OSA, combining exact and approximate filtering modes: exact mode ensures evaluation accuracy through elaborately designed constraint verification modules, while approximate mode reduces computational overhead via simplified logic. HE-GP dynamically switches between evaluation models based on real-time evolutionary state information. Experiments on 16 simulated instance sets demonstrate that HE-GP significantly outperforms handcrafted heuristics and single-evaluation based GPHH, achieving substantial reductions in computational cost while maintaining excellent scheduling performance across diverse scenarios. Specifically, the average training time of HE-GP was reduced by 17.77\% compared to GP employing exclusively exact evaluation, while the optimal policy generated by HE-GP achieved the highest average ranks across all scenarios.
Chinese Translation
不确定的敏捷地球观测卫星调度问题(UAEOSSP)是一种新颖的组合优化问题,也是一个与当前空间技术发展需求相符的实际工程挑战。该问题涉及利润、资源消耗和可见性等不确定性,这可能导致预先规划的调度方案变得次优甚至不可行。遗传编程超启发式(GPHH)在演化可解释的调度政策方面展现出良好的前景;然而,其基于模拟的评估会产生高昂的计算成本。此外,构造方法的设计,即在线调度算法(OSA),直接影响适应度评估,导致政策空间内依赖评估的局部最优解。为了解决这些问题,本文提出了一种基于混合评估的遗传编程(HE-GP),以有效解决UAEOSSP。将混合评估(HE)机制集成到以政策为驱动的OSA中,结合了精确和近似过滤模式:精确模式通过精心设计的约束验证模块确保评估的准确性,而近似模式则通过简化逻辑减少计算开销。HE-GP根据实时演化状态信息动态切换评估模型。在16个模拟实例集上的实验表明,HE-GP显著优于手工设计的启发式算法和基于单一评估的GPHH,在保持优秀调度性能的同时实现了计算成本的显著降低。具体而言,与仅采用精确评估的GP相比,HE-GP的平均训练时间减少了17.77\%,而HE-GP生成的最优政策在所有场景中达到了最高的平均排名。
cs.AI / 74 / 2603.08455

The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

沸蛙阈值:在逐渐漂移下基于世界模型的异常检测中的临界性与盲目性
Hong, Zhe
Abstract
When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold $\varepsilon^*$ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, $\varepsilon^*$ follows a power law in detector parameters ($R^2 = 0.89$-$0.97$), but cross-environment prediction fails ($R^2 = 0.45$), revealing that the missing variable is environment-specific dynamics structure $\partial \mathrm{PE}/\partial\varepsilon$. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.
Chinese Translation
当强化学习(RL)代理的观察逐渐受到干扰时,何种漂移速率会使其“觉醒”——这一边界又由什么决定?我们研究了在四个MuJoCo环境中,基于世界模型的自我监测在持续观察漂移下的表现,涉及三种检测器家族(z-score、方差、百分位数)和三种模型容量。我们发现:(1) 存在一个普遍的明显检测阈值$ ext{ε}^*$:在此阈值以下,漂移被视为正常变异;而在此阈值以上,检测会迅速发生。该阈值的存在及其S形曲线在所有检测器家族和模型容量中都是不变的,尽管其位置依赖于检测器灵敏度、噪声底层结构和环境动态之间的相互作用。(2) 所有检测器家族,包括没有时间平滑的方差和百分位数检测器,对于正弦漂移完全无法检测,这表明这是一个世界模型特性,而非检测器伪影。(3) 在每个环境中,$ ext{ε}^*$在检测器参数中遵循幂律关系($R^2 = 0.89$-$0.97$),但跨环境预测失败($R^2 = 0.45$),揭示缺失变量是特定于环境的动态结构$ rac{ ext{PE}}{ ext{ε}}$。(4) 在脆弱环境中,代理在任何检测器能够触发之前就会崩溃(“在意识之前崩溃”),形成一种根本无法监测的失败模式。我们的结果将$ ext{ε}^*$重新框定为噪声底层、检测器和环境动态之间的三方相互作用,从而提供了一个更有说服力和实证基础的关于RL代理自我监测边界的解释。
cs.AI / 75 / 2603.08561

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

RetroAgent:通过回顾性双内在反馈从解决到演变
Zhang, Xiaoying, Liu, Zichen, Zhang, Yipeng, Hu, Xia, Shao, Wenqi
Abstract
Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.
Chinese Translation
基于大型语言模型(LLM)的代理通过强化学习(RL)训练,在复杂交互任务中展现出强大的潜力。然而,标准的RL范式更倾向于静态问题解决而非持续适应:代理常因探索不足而收敛于次优策略,同时学习到的知识往往隐含在参数中而非显式可检索,从而限制了有效的经验学习。为了解决这些局限性,我们提出了RetroAgent,一个在线RL框架,使代理不仅能够通过解决问题来掌握复杂的交互环境,还能够通过演变来实现。具体而言,RetroAgent具有一种回顾性自我反思机制,产生双重内在反馈:(1)内在数值反馈,跟踪相对于先前尝试的增量子任务完成情况,奖励有前景的探索;(2)内在语言反馈,将可重用的经验教训提炼到记忆缓冲区,通过我们提出的相似性与效用感知上置信界(Similarity & Utility-Aware Upper Confidence Bound,SimUtil-UCB)策略进行检索,该策略平衡相关性、效用和探索,有效利用过去的经验。在四个具有挑战性的代理任务上对两种模型系列进行的广泛实验表明,RetroAgent显著优于现有方法,取得了最先进的结果——例如,在ALFWorld上超越了通过群体相对策略优化(Group Relative Policy Optimization,GRPO)训练的代理18.3%,在WebShop上超越15.4%,在Sokoban上超越27.1%,在MineSweeper上超越8.9%——同时展现出强大的测试时适应能力和对分布外场景的泛化能力。
cs.AI / 76 / 2603.08575

Trust via Reputation of Conviction

通过信念的声誉建立信任
Iyengar, Aravind R.
Abstract
The question of \emph{knowledge}, \emph{truth} and \emph{trust} is explored via a mathematical formulation of claims and sources. We define truth as the reproducibly perceived subset of knowledge, formalize sources as having both generative and discriminative roles, and develop a framework for reputation grounded in the \emph{conviction} -- the likelihood that a source's stance is vindicated by independent consensus. We argue that conviction, rather than correctness or faithfulness, is the principled basis for trust: it is regime-independent, rewards genuine contribution, and demands the transparent and self-sufficient perceptions that make external verification possible. We formalize reputation as the expected weighted signed conviction over a realm of claims, characterize its behavior across source-claim regimes, and identify continuous verification as both a theoretical necessity and a practical mechanism through which reputation accrues. The framework is applied to AI agents, which are identified as capable but error-prone sources for whom verifiable conviction and continuously accrued reputation constitute the only robust foundation for trust.
Chinese Translation
通过对主张和来源的数学表述,探讨了 extit{知识}、 extit{真理}和 extit{信任}的问题。我们将真理定义为可重复感知的知识子集,将来源形式化为具有生成和区分两种角色,并建立了一个基于 extit{信念}的声誉框架——即来源的立场被独立共识验证的可能性。我们认为,信念而非正确性或忠诚性是信任的原则基础:它与体制无关,奖励真实贡献,并要求透明和自给自足的感知,从而使外部验证成为可能。我们将声誉形式化为在一系列主张上的期望加权签名信念,表征其在来源-主张体制中的行为,并将持续验证识别为声誉积累的理论必要性和实践机制。该框架应用于人工智能代理,这些代理被视为能够但易出错的来源,对于它们而言,可验证的信念和持续积累的声誉构成了信任的唯一稳固基础。
cs.AI / 77 / 2603.08652

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

CoCo:代码作为链式思维用于文本到图像预览和稀有概念生成
Li, Haodong, Qing, Chunmei, Zhang, Huanyu, Jiang, Dongzhi, Zou, Yihang, Peng, Hongbo, Li, Dingming, Dai, Yuhong, Lin, ZePeng, Tian, Juanxi, Zhou, Yi, Dai, Siqi, Wu, Jingwei
Abstract
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo
Chinese Translation
最近在统一多模态模型(UMMs)方面的进展显著推动了文本到图像(T2I)生成,特别是通过链式思维(CoT)推理的整合。然而,现有的基于CoT的T2I方法在很大程度上依赖于抽象的自然语言规划,这缺乏对复杂空间布局、结构化视觉元素和密集文本内容所需的精确性。在本研究中,我们提出了CoCo(代码作为链式思维),这是一种以代码驱动的推理框架,将推理过程表示为可执行代码,从而实现图像生成的明确和可验证的中间规划。给定一个文本提示,CoCo首先生成可执行代码,指定场景的结构布局,然后在一个沙盒环境中执行该代码,以渲染出一个确定性的草图图像。模型随后通过细粒度图像编辑来完善该草图,以生成最终的高保真结果。为了支持这一训练范式,我们构建了CoCo-10K,这是一个经过精心策划的数据集,包含结构化的草图-最终图像对,旨在教授结构化草图构建和纠正视觉细化。对StructT2IBench、OneIG-Bench和LongText-Bench的实证评估表明,CoCo在直接生成上实现了+68.83%、+54.8%和+41.23%的改进,同时也优于其他由CoT驱动的生成方法。这些结果表明,可执行代码是一种有效且可靠的推理范式,适用于精确、可控和结构化的文本到图像生成。代码可在以下地址获取:https://github.com/micky-li-hd/CoCo
cs.AI / 78 / 2603.08655

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

OfficeQA Pro:一种用于端到端基础推理的企业基准测试
Opsahl-Ong, Krista, Singhvi, Arnav, Collins, Jasmine, Zhou, Ivan, Wang, Cindy, Baheti, Ashutosh, Oertell, Owen, Portes, Jacob, Havens, Sam, Elsen, Erich, Bendersky, Michael, Zaharia, Matei, Chen, Xing
Abstract
We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
Chinese Translation
我们介绍了OfficeQA Pro,这是一个用于评估人工智能代理在大规模异构文档语料库上进行基础多文档推理的基准测试。该语料库由近100年的美国财政公报组成,包含89,000页和超过2600万个数值。OfficeQA Pro包含133个问题,这些问题需要对非结构化文本和表格数据进行精确的文档解析、检索和分析推理。包括Claude Opus 4.6、GPT-5.4和Gemini 3.1 Pro Preview在内的前沿大型语言模型(LLMs)在依赖参数知识时,在OfficeQA Pro上的准确率不足5%,而在额外访问网络时也仅达到12%。当直接提供文档语料库时,前沿代理在超过一半的问题上仍然表现不佳,平均得分为34.1%。我们发现,使用Databricks的ai_parse_document生成的结构化文档表示可以使代理的平均相对性能提升16.1%。我们进行了额外的消融实验,以研究模型选择、表格表示、检索策略和测试时扩展对性能的影响。尽管有这些改进,但在代理被认为在企业级基础推理中可靠之前,仍然存在显著的提升空间。
cs.AI / 79 / 2603.08692

A Multi-Objective Optimization Approach for Sustainable AI-Driven Entrepreneurship in Resilient Economies

面向韧性经济的可持续人工智能驱动创业的多目标优化方法
ALsobeh, Anas, Alkurdi, Raneem
Abstract
The rapid advancement of artificial intelligence (AI) technologies presents both unprecedented opportunities and significant challenges for sustainable economic development. While AI offers transformative potential for addressing environmental challenges and enhancing economic resilience, its deployment often involves substantial energy consumption and environmental costs. This research introduces the EcoAI-Resilience framework, a multi-objective optimization approach designed to maximize the sustainability benefits of AI deployment while minimizing environmental costs and enhancing economic resilience. The framework addresses three critical objectives through mathematical optimization: sustainability impact maximization, economic resilience enhancement, and environmental cost minimization. The methodology integrates diverse data sources, including energy consumption metrics, sustainability indicators, economic performance data, and entrepreneurship outcomes across 53 countries and 14 sectors from 2015-2024. Our experimental validation demonstrates exceptional performance with R scores exceeding 0.99 across all model components, significantly outperforming baseline methods, including Linear Regression (R = 0.943), Random Forest (R = 0.957), and Gradient Boosting (R = 0.989). The framework successfully identifies optimal AI deployment strategies featuring 100\% renewable energy integration, 80% efficiency improvement targets, and optimal investment levels of $202.48 per capita. Key findings reveal strong correlations between economic complexity and resilience (r = 0.82), renewable energy adoption and sustainability outcomes (r = 0.71), and demonstrate significant temporal improvements in AI readiness (+1.12 points/year) and renewable energy adoption (+0.67 year) globally.
Chinese Translation
人工智能(AI)技术的快速发展为可持续经济发展带来了前所未有的机遇和重大挑战。尽管AI在应对环境挑战和增强经济韧性方面具有变革性潜力,但其部署往往涉及大量的能源消耗和环境成本。本研究提出了EcoAI-Resilience框架,这是一种多目标优化方法,旨在最大化AI部署的可持续性收益,同时最小化环境成本并增强经济韧性。该框架通过数学优化解决三个关键目标:可持续性影响最大化、经济韧性增强和环境成本最小化。该方法整合了多种数据来源,包括53个国家和14个行业在2015-2024年间的能源消耗指标、可持续性指标、经济绩效数据和创业成果。我们的实验验证显示,所有模型组件的R评分均超过0.99,表现卓越,显著优于基准方法,包括线性回归(R = 0.943)、随机森林(R = 0.957)和梯度提升(R = 0.989)。该框架成功识别出最佳的AI部署策略,特点是100%的可再生能源整合、80%的效率提升目标,以及每人$202.48的最佳投资水平。关键发现揭示了经济复杂性与韧性之间的强相关性(r = 0.82)、可再生能源采用与可持续性成果之间的相关性(r = 0.71),并展示了全球AI准备度(+1.12分/年)和可再生能源采用(+0.67年)的显著时间改善。
cs.AI / 80 / 2603.08704

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

评估大型语言模型中的金融智能:基于LLM引擎的SuperInvesting AI基准测试
Gulati, Akshay, Singhania, Kanha, Banga, Tushar, Arora, Parth, Verma, Anshul, Singh, Vaibhav Kumar, Digra, Agyapal, Bisht, Jayant Singh, Sharma, Danish, Singla, Varun, Garg, Shubh
Abstract
Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.
Chinese Translation
大型语言模型在金融分析和投资研究中的应用日益增多,但对其金融推理能力的系统评估仍然有限。在本研究中,我们引入了AI金融智能基准(AI Financial Intelligence Benchmark,AFIB),这是一个多维评估框架,旨在从五个维度评估金融分析能力:事实准确性、分析完整性、数据时效性、模型一致性和失败模式。我们评估了五个AI系统:GPT、Gemini、Perplexity、Claude和SuperInvesting,使用了一个包含95个以上来自真实股票研究任务的结构化金融分析问题的数据集。结果显示,各模型的性能存在显著差异。在这一基准测试中,SuperInvesting的综合表现最佳,平均事实准确性得分为8.96/10,完整性得分为56.65/70,同时在评估的系统中表现出最低的幻觉率。以检索为导向的系统如Perplexity在数据时效性任务上表现强劲,因其能够访问实时信息,但在分析综合和一致性方面表现较弱。总体而言,结果强调了大型语言模型中的金融智能本质上是多维的,能够将结构化金融数据访问与分析推理能力相结合的系统在复杂投资研究工作流中提供了最可靠的表现。
cs.AI / 81 / 2603.08706

Agentic Critical Training

自主批判性训练
Liu, Weize, Liu, Minghui, Ho, Sy-Tuyen, Chakraborty, Souradip, Wang, Xiyao, Huang, Furong
Abstract
Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
Chinese Translation
将大型语言模型(LLMs)训练为自主代理通常始于模仿学习,但这仅教会代理如何行动,而不理解为何如此:代理从未将成功的行动与次优选择进行对比,因此缺乏对行动质量的意识。最近的方法试图通过引入基于专家与替代行动之间对比的自我反思监督来解决这一问题。然而,训练范式在根本上仍然是模仿学习:模型模仿预先构建的反思文本,而不是学习自主推理。我们提出了自主批判性训练(Agentic Critical Training, ACT),这是一种强化学习范式,旨在训练代理在替代方案中识别更好的行动。通过奖励模型判断的正确性,ACT促使模型自主发展对行动质量的推理,产生真正的自我反思,而不是简单模仿。在三个具有挑战性的代理基准测试中,ACT在结合不同的后训练方法时,始终提高代理性能。与模仿学习相比,ACT平均提高了5.07分,与强化学习相比提高了4.62分。与通过知识蒸馏注入反思能力的方法相比,ACT也显示出明显优势,平均提高了2.42分。此外,ACT在代理基准测试上实现了强大的分布外泛化,并在没有任何特定推理训练数据的情况下提高了通用推理基准的性能,突显了我们方法的价值。这些结果表明,ACT是开发更具反思能力和能力的LLM代理的有希望的路径。
计算语言学 (Computation and Language)
90
cs.CL / 1 / 2603.06590

ARC-AGI-2 Technical Report

ARC-AGI-2 技术报告
de Oliveira, Wallyson Lemes, Bobokhonov, Mekhron, Caorsi, Matteo, Podestà, Aldo, Beltramo, Gabriele, Crosato, Luca, Bonotto, Matteo, Cecchetto, Federica, Espic, Hadrien, Salajan, Dan Titus, Taga, Stefan, Pana, Luca, Carthy, Joe
Abstract
The Abstraction and Reasoning Corpus (ARC) is designed to assess generalization beyond pattern matching, requiring models to infer symbolic rules from very few examples. In this work, we present a transformer-based system that advances ARC performance by combining neural inference with structure-aware priors and online task adaptation. Our approach is built on four key ideas. First, we reformulate ARC reasoning as a sequence modeling problem using a compact task encoding with only 125 tokens, enabling efficient long-context processing with a modified LongT5 architecture. Second, we introduce a principled augmentation framework based on group symmetries, grid traversals, and automata perturbations, enforcing invariance to representation changes. Third, we apply test-time training (TTT) with lightweight LoRA adaptation, allowing the model to specialize to each unseen task by learning its transformation logic from demonstrations. Fourth, we design a symmetry-aware decoding and scoring pipeline that aggregates likelihoods across augmented task views, effectively performing ``multi-perspective reasoning'' over candidate solutions. We demonstrate that these components work synergistically: augmentations expand hypothesis space, TTT sharpens local reasoning, and symmetry-based scoring improves solution consistency. Our final system achieves a significant improvement over transformer baselines and surpasses prior neural ARC solvers, closing the gap toward human-level generalization.
Chinese Translation
抽象与推理语料库(Abstraction and Reasoning Corpus,ARC)旨在评估超越模式匹配的泛化能力,要求模型从极少的示例中推断符号规则。在本研究中,我们提出了一种基于变换器(transformer)的系统,通过结合神经推理、结构感知先验和在线任务适应来提升ARC的性能。我们的方法基于四个关键思想。首先,我们将ARC推理重新表述为一个序列建模问题,使用仅包含125个标记的紧凑任务编码,使得通过修改后的LongT5架构实现高效的长上下文处理。其次,我们引入了一个基于群对称性、网格遍历和自动机扰动的原则性增强框架,强制执行对表示变化的不变性。第三,我们应用了轻量级LoRA适应的测试时训练(Test-Time Training,TTT),使模型能够通过从示例中学习其变换逻辑来专门针对每个未见任务。第四,我们设计了一个对称感知的解码和评分管道,聚合增强任务视图的似然性,有效地对候选解决方案执行“多视角推理”。我们证明了这些组件协同工作:增强扩展了假设空间,TTT锐化了局部推理,而基于对称性的评分提高了解决方案的一致性。我们的最终系统在变换器基线之上实现了显著的改进,并超越了之前的神经ARC求解器,缩小了与人类水平泛化之间的差距。
cs.CL / 2 / 2603.06592

Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale

数据生成过程中的层次潜在结构统一了跨尺度的机制现象
Rohweder, Jonas, Dutta, Subhabrata, Gurevych, Iryna
Abstract
Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.
Chinese Translation
当代研究揭示了基于Transformer的语言模型在神经信息处理中的许多令人困惑的现象。要建立对这些现象的稳健统一理解,需要在训练范围内对模型进行拆解。然而,预训练语料库的不可处理规模限制了这一方向的自下而上的调查,而对数据生成过程的简单假设则限制了表达能力,未能解释复杂模式。在本研究中,我们使用概率上下文无关文法(PCFGs)生成合成语料库,这些语料库是对网络规模文本语料库的真实且计算高效的代理。我们在设计的数据生成过程中以及现实世界语言模型的检查点中,研究了三种机制现象的出现:归纳头、函数向量和九头蛇效应。我们的发现表明,数据生成过程中的层次结构作为解释这些现象出现的X因素。我们提供了层次在语言模型训练动态中所发挥作用的理论基础。简而言之,我们的工作是首个提供统一解释的研究,揭示了在大型语言模型中看似无关的机制现象的出现,并辅以高效的合成工具,以便于未来的可解释性研究。
cs.CL / 3 / 2603.06593

Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

用于检索增强代码生成的层次嵌入融合
Sorokin, Nikita, Sedykh, Ivan, Malykh, Valentin
Abstract
Retrieval-augmented code generation often conditions the decoder on large retrieved code snippets. This ties online inference cost to repository size and introduces noise from long contexts. We present Hierarchical Embedding Fusion (HEF), a two-stage approach to repository representation for code completion. First, an offline cache compresses repository chunks into a reusable hierarchy of dense vectors using a small fuser model. Second, an online interface maps a small number of retrieved vectors into learned pseudo-tokens that are consumed by the code generator. This replaces thousands of retrieved tokens with a fixed pseudo-token budget while preserving access to repository-level information. On RepoBench and RepoEval, HEF with a 1.8B-parameter pipeline achieves exact-match accuracy comparable to snippet-based retrieval baselines, while operating at sub-second median latency on a single A100 GPU. Compared to graph-based and iterative retrieval systems in our experimental setup, HEF reduces median end-to-end latency by 13 to 26 times. We also introduce a utility-weighted likelihood signal for filtering training contexts and report ablation studies on pseudo-token budget, embedding models, and robustness to harmful retrieval. Overall, these results indicate that hierarchical dense caching is an effective mechanism for low-latency, repository-aware code completion.
Chinese Translation
检索增强代码生成通常将解码器条件化于大量检索到的代码片段。这将在线推理成本与代码库大小关联起来,并引入来自长上下文的噪声。我们提出了层次嵌入融合(Hierarchical Embedding Fusion, HEF),这是一种用于代码补全的两阶段代码库表示方法。首先,离线缓存使用一个小型融合模型将代码库块压缩为可重用的稠密向量层次结构。其次,在线接口将少量检索到的向量映射为学习到的伪标记,这些伪标记被代码生成器消耗。这一方法用固定的伪标记预算替代了数千个检索到的标记,同时保留了对代码库级信息的访问。在RepoBench和RepoEval上,HEF在1.8B参数的管道下实现了与基于片段的检索基线相当的精确匹配准确率,同时在单个A100 GPU上以亚秒的中位延迟运行。与我们实验设置中的基于图的和迭代检索系统相比,HEF将中位端到端延迟减少了13到26倍。我们还引入了一种加权效用似然信号用于过滤训练上下文,并报告了关于伪标记预算、嵌入模型和对有害检索的鲁棒性的消融研究。总体而言,这些结果表明,层次稠密缓存是一种有效的机制,能够实现低延迟且感知代码库的代码补全。
cs.CL / 4 / 2603.06594

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

安全性的抉择:大型语言模型评判者未能可靠地测量对抗鲁棒性
Schwinn, Leo, Ladenburger, Moritz, Beyer, Tim, Mofakhami, Mehrnaz, Gidel, Gauthier, Günnemann, Stephan
Abstract
Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Chinese Translation
自动化的“LLM作为评判者”框架已成为自然语言处理领域可扩展评估的事实标准。例如,在安全性评估中,这些评判者被依赖于评估有害性,以基准测试安全性对抗对抗性攻击的鲁棒性。然而,我们展示了现有的验证协议未能考虑到红队测试固有的重大分布变化:不同的受害模型表现出不同的生成风格,攻击扭曲输出模式,语义模糊性在越狱场景中显著变化。通过对6642个经过人类验证的标签进行全面审计,我们揭示了这些变化的不可预测交互常常导致评判者的表现降至接近随机的水平。这与先前研究中报告的高人类一致性形成鲜明对比。关键是,我们发现许多攻击通过利用评判者的不足而非引发真正有害的内容来夸大其成功率。为了实现更可靠的评估,我们提出了ReliableBench,一个更一致可评判行为的基准,以及JudgeStressTest,一个旨在揭示评判者失败的数据集。数据可在:https://github.com/SchwinnL/LLMJudgeReliability获取。
cs.CL / 5 / 2603.06595

Rethinking Personalization in Large Language Models at the Token Level

在大语言模型中重新思考基于标记的个性化
Zhang, Chenheng, Lu, Yijun, Fang, Lizhe, Zheng, Chunyuan, Chai, Jiajun, Wang, Xiaohan, Yin, Guojun, Lin, Wei, Wang, Yisen, Lin, Zhouchen
Abstract
With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a base NLP task, requiring model responses to meet user-specific needs while still accomplishing the underlying task. From a token-level perspective, different tokens in a response contribute to personalization to varying degrees. Tokens with higher personalization relevance should therefore receive greater emphasis when developing personalized LLMs. However, accurately estimating such personalization degrees remains challenging. To address this challenge, we propose PerContrast, a self-contrast method that estimates each output token's dependence on user-specific information through causal intervention. Building on this mechanism, we develop the PerCE loss, which adaptively upweights tokens with higher estimated personalization degrees during training via a bootstrap procedure, enabling the model to alternate between estimating and optimizing these tokens. Experiments on multiple LLMs demonstrate that PerCE substantially improves personalization performance with minimal additional cost, achieving average gains of over 10% and up to 68.04% on the LongLaMP dataset, along with strong cross-task and cross-scenario transferability. These results highlight the importance of token-level personalization modeling and establish token-aware training as a simple yet effective paradigm for advancing personalized LLMs.
Chinese Translation
随着大语言模型(LLMs)在各种任务中表现出色,对其为个别用户个性化输出的需求日益增长。个性化通常被视为在基础自然语言处理(NLP)任务之上的额外层,要求模型的响应满足用户特定需求,同时仍然完成基础任务。从标记级别的角度来看,响应中的不同标记对个性化的贡献程度各不相同。因此,具有更高个性化相关性的标记在开发个性化LLM时应受到更大的重视。然而,准确估计这种个性化程度仍然具有挑战性。为了解决这一挑战,我们提出了PerContrast,这是一种自对比方法,通过因果干预估计每个输出标记对用户特定信息的依赖性。在此机制的基础上,我们开发了PerCE损失,该损失在训练过程中通过自助程序自适应地提高具有更高估计个性化程度的标记的权重,使模型能够在估计和优化这些标记之间交替进行。在多个LLM上的实验表明,PerCE在个性化性能上显著提升,额外成本极小,在LongLaMP数据集上实现了超过10%的平均增益,最高可达68.04%,并展现出强大的跨任务和跨场景迁移能力。这些结果突显了标记级个性化建模的重要性,并确立了基于标记的训练作为推动个性化LLM发展的简单而有效的范式。
cs.CL / 6 / 2603.06816

"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

黑暗三角模型的错位生物体:狭义微调反映人类反社会行为
Lulla, Roshni, Collins, Fiona, Parekh, Sanaya, Hagendorff, Thilo, Kaplan, Jonas
Abstract
The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
Chinese Translation
对齐问题指的是关于强大智能的担忧,确保其与人类偏好和价值观的兼容性,随着能力的增强而日益重要。目前的大型语言模型(LLMs)表现出错位行为,如战略性欺骗、操控和寻求奖励,这些行为即使在安全训练下也可能出现。要获得对这些失败的机制理解,需要采用实证方法,在受控环境中隔离行为模式。我们提出生物错位先于人工错位,并利用人格的黑暗三角(自恋、精神病和马基雅维利主义)作为构建错位模型生物体的心理学基础框架。在研究1中,我们在一个人群中(N = 318)建立了黑暗三角特征的全面行为特征,识别出情感不和谐作为连接这些特征的中心共情缺陷,以及在道德推理和欺骗行为中的特征特定模式。在研究2中,我们展示了通过对经过验证的心理测量工具进行最小微调,可以在前沿LLMs中可靠地诱发黑暗人格。仅36个心理测量项目的小型狭义训练数据集就导致了行为测量的显著变化,这些变化与人类反社会特征紧密相似。关键是,模型超越了训练项目,展示了超出上下文的推理,而非记忆。这些发现揭示了LLMs中潜在的人格结构,可以通过狭义干预轻松激活,从而将黑暗三角定位为一个经过验证的框架,用于诱发、检测和理解生物和人工智能中的错位。
cs.CL / 7 / 2603.06836

Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

小型语言模型在儿童福利记录中对DSM-5物质类别分类的验证
Perron, Brian E., Stoll, Dragan, Victor, Bryan G., Qia, Zia, Jud, Andreas, Ryan, Joseph P.
Abstract
Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.
Chinese Translation
背景:近期研究表明,大型语言模型(LLMs)能够对儿童福利叙述进行二元分类任务,检测诸如物质相关问题、家庭暴力和枪支涉入等构念的存在或缺失。然而,较小的本地可部署模型是否能够超越二元检测,从这些叙述中分类特定物质类型尚未得到验证。目的:验证一个本地托管的LLM分类器,以识别与DSM-5类别相一致的特定物质类型,应用于儿童福利调查叙述。方法:一个本地托管的20亿参数LLM对来自美国中西部某州的儿童虐待调查叙述进行了分类。先前被识别为包含物质相关问题的记录被传递到第二个分类阶段,针对七个DSM-5物质类别进行分类。对900个分层案例的专家人工审核评估了分类的精确度、召回率和方法间一致性(Cohen's kappa)。使用约15,000个独立分类的记录评估了测试-重测稳定性。结果:五个物质类别达到了几乎完美的方法间一致性(kappa = 0.94-1.00):酒精、 cannabis、大麻、阿片类、兴奋剂和镇静剂/催眠药/抗焦虑药。这些类别的分类精确度范围为92%到100%。两个低流行率类别(迷幻药、吸入剂)的表现较差。七个类别的测试-重测一致性范围为92.1%到99.1%。结论:一个小型的本地托管LLM能够可靠地从儿童福利行政文本中分类物质类型,将先前的二元分类工作扩展到多标签物质识别。
cs.CL / 8 / 2603.06865

Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation

依赖共识:为自然语言处理注释和评估选择合适的标注者间一致性度量
James, Joseph
Abstract
Human annotation remains the foundation of reliable and interpretable data in Natural Language Processing (NLP). As annotation and evaluation tasks continue to expand, from categorical labelling to segmentation, subjective judgment, and continuous rating, measuring agreement between annotators has become increasingly more complex. This paper outlines how inter-annotator agreement (IAA) has been conceptualised and applied across NLP and related disciplines, describing the assumptions and limitations of common approaches. We organise agreement measures by task type and discuss how factors such as label imbalance and missing data influence reliability estimates. In addition, we highlight best practices for clear and transparent reporting, including the use of confidence intervals and the analysis of disagreement patterns. The paper aims to serve as a guide for selecting and interpreting agreement measures, promoting more consistent and reproducible human annotation and evaluation in NLP.
Chinese Translation
人类注释仍然是自然语言处理(NLP)中可靠和可解释数据的基础。随着注释和评估任务的不断扩展,从类别标注到分割、主观判断和连续评分,标注者之间的一致性测量变得愈加复杂。本文概述了标注者间一致性(IAA)在NLP及相关学科中的概念化和应用,描述了常见方法的假设和局限性。我们根据任务类型组织一致性度量,并讨论标签不平衡和缺失数据等因素如何影响可靠性估计。此外,我们强调了清晰和透明报告的最佳实践,包括使用置信区间和不一致模式分析。本文旨在为选择和解释一致性度量提供指导,促进NLP中人类注释和评估的更一致和可重复性。
cs.CL / 9 / 2603.06905

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

MedInjection-FR:探索原生、合成和翻译数据在生物医学指令调优中的作用
Belmadani, Ikram, Khettari, Oumaima El, Beaufils, Pacôme Constant dit, Favre, Benoit, Dufour, Richard
Abstract
Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.
Chinese Translation
指令调优已成为使大型语言模型(LLMs)适应特定领域提示的关键。然而,在医学等专业领域,高质量法语指令数据的稀缺限制了有效的监督。为了解决这一问题,我们引入了MedInjection-FR,一个大规模的法语生物医学指令数据集,包含571K个来自三种互补来源的指令-响应对:原生数据、合成数据和翻译数据。我们设计了一个受控实验框架,以系统评估数据来源如何影响指令调优,使用Qwen-4B-Instruct在结合这些来源的七种配置下进行微调。结果表明,原生数据的表现最强,而混合设置,特别是原生和翻译数据,提供了互补的好处。单独使用合成数据的效果较差,但在与原生监督平衡时能产生积极贡献。开放式问答的评估结合了自动指标、LLM作为评审的评估和人类专家的审查;尽管基于LLM的判断与人类评分的相关性最佳,但它们对冗长性表现出敏感性。这些发现强调了数据的真实性和多样性共同塑造下游适应性,并且异质监督可以缓解原生法语医学指令的稀缺问题。
cs.CL / 10 / 2603.06910

Language Shapes Mental Health Evaluations in Large Language Models

语言塑造大型语言模型中的心理健康评估
Xu, Jiayi, Hu, Xiyang
Abstract
This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.
Chinese Translation
本研究探讨大型语言模型(LLMs)在心理健康评估中是否表现出跨语言差异。我们聚焦于中文和英文,考察两个广泛使用的模型,GPT-4o 和 Qwen3,以评估提示语言是否系统性地影响与心理健康相关的评估和后续决策结果。首先,我们使用多种经过验证的测量量表评估模型对心理健康污名的评估取向,这些量表涵盖了社会污名、自我污名和专业污名。在所有测量中,当提示使用中文时,两个模型产生的与污名相关的反应均高于使用英文时。其次,我们考察这些差异是否也体现在心理健康的两个常见后续决策任务中。在一个二元心理健康污名检测任务中,对污名内容的敏感性在不同语言提示下有所不同,中文提示下的敏感性较低。在抑郁严重程度分类任务中,预测的严重程度也因提示语言而异,中文提示与更多的低估错误相关,表明相对于英文提示,预测的严重程度系统性地向下偏移。综合来看,这些发现表明,语言背景可以系统性地塑造LLM输出中的评估模式,并在后续任务中改变决策阈值。
cs.CL / 11 / 2603.06915

A Dynamic Self-Evolving Extraction System

动态自我演化提取系统
Amin-Naseri, Moin, Kim, Hannah, Hruschka, Estevam
Abstract
The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.
Chinese Translation
从原始文本中提取结构化信息是许多自然语言处理(NLP)应用的基础组成部分,包括文档检索、排名和相关性评估。高质量的提取通常需要领域特定的准确性、对专业分类法的最新理解,以及融入新兴术语和稀有异常值的能力。在许多领域——例如医疗、法律和人力资源——提取模型还必须适应不断变化的术语,并从对结构化知识的明确推理中受益。我们提出了DySECT(动态自我演化提取与策划工具包),该工具包在使用过程中不断改进。该系统通过大型语言模型(LLM)提取的三元组逐步填充一个多功能、自我扩展的知识库(KB)。知识库通过整合概率知识和基于图的推理进一步丰富自身,逐渐积累领域概念和关系。丰富的知识库随后通过提示调优、相关少量示例的抽样或使用基于知识库的合成数据进行微调,反馈到LLM提取器中。因此,该系统形成了一个共生的闭环循环,其中提取不断改善知识,而知识不断改善提取。
cs.CL / 12 / 2603.06923

Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping

机制改革:通过电路重塑编辑大型语言模型中的推理模式
Lei, Zhenyu, Wu, Qiong, Dong, Jianxiong, He, Yinhan, Dodwell, Emily, Dong, Yushun, Li, Jundong
Abstract
Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit-Interference Law: Edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions. Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at https://github.com/LzyFischer/REdit.
Chinese Translation
大型语言模型(LLMs)通常表现出缺陷的推理能力,这削弱了其可靠性。现有的改善推理的方法通常将其视为一种通用且单一的技能,采用广泛的训练,这种方法效率低下且无法针对特定的推理错误。我们提出了推理编辑(Reasoning Editing),这是一种选择性修改LLMs中特定推理模式的范式,同时保留其他推理路径。该任务在通用性(Generality)和局部性(Locality)之间存在基本的权衡:编辑的通用性是指其能够在共享相同推理模式的不同任务中进行推广的能力,而局部性是指保留其他推理能力的能力。通过系统的研究,我们揭示了电路干扰法则(Circuit-Interference Law):推理模式之间的编辑干扰与其神经电路的重叠成正比。在这一原则的指导下,我们提出了REdit,这是第一个在编辑之前主动重塑神经电路的框架,从而调节推理模式之间的干扰并减轻权衡。REdit集成了三个组件:(i)对比电路重塑(Contrastive Circuit Reshaping),通过解耦重叠电路直接解决通用性-局部性权衡;(ii)元对比学习(Meta-Contrastive Learning),将可迁移性扩展到新颖的推理模式;(iii)双层保护(Dual-Level Protection),通过限制重塑更新方向和规范化任务级预测来保留现有能力。在三个难度级别的命题逻辑推理任务上,使用Qwen-2.5-3B进行的广泛实验表明,REdit在通用性和局部性方面始终优于基线,并且在数学验证中显示出更广泛的潜力。我们的代码可在 https://github.com/LzyFischer/REdit 获取。
cs.CL / 13 / 2603.06942

Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

深入研究,浅层评估:长文本问答基准的元评估案例研究
Hwang, Jena D., Kishore, Varsha, Singh, Amanpreet, Haddad, Dany, Naik, Aakanksha, Hamada, Malachi, Bragg, Jonathan, D'Arcy, Mike, Weld, Daniel S., Wang, Lucy Lu, Downey, Doug, Feldman, Sergey
Abstract
Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key challenge. Based on our findings, we offer practical guidelines for designing future meta-evaluations that better align evaluation methods, annotator expertise, and reporting practices. By surfacing these methodological challenges, we aim to advance evaluation standards for deep-research systems.
Chinese Translation
近期的进展使得长文本报告生成系统广泛可用。这促使了采用 LLM-as-judge 协议和声明验证的评估框架,以及旨在验证这些方法的元评估框架。许多元评估通过将评估结果与人类成对偏好进行比较来估计评估质量。然而,先前的研究表明,人类成对偏好可能过于简单,无法捕捉专家期望的细微差别。我们使用 ScholarQA-CS2 进行长文本问答基准的元评估案例研究,该基准旨在评估科学领域中增强检索的深度研究问答。我们通过人类成对偏好判断全面验证了该基准,然后批判性地审视了这种方法的优缺点和混淆因素。我们展示了成对偏好排名最适合系统级评估,而明确的指标级注释和专家注释者对于可靠的指标级评估至关重要,主观性仍然是一个关键挑战。基于我们的发现,我们提供了设计未来元评估的实用指南,以更好地对齐评估方法、注释者专业知识和报告实践。通过揭示这些方法论挑战,我们旨在推动深度研究系统的评估标准。
cs.CL / 14 / 2603.06974

Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues

Elenchus:从证明者-怀疑者对话生成知识库
Allen, Bradley P.
Abstract
We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.
Chinese Translation
我们提出了Elenchus,一个基于推理主义语义的知识库构建对话系统,其中知识工程被重新构思为显性化,而非从专家证言或文本内容中提取。人类专家通过与大型语言模型(LLM)对手的证明者-怀疑者对话,围绕某一主题发展出双边立场(承诺和否认)。LLM提出紧张关系(即立场的部分内容共同不一致的主张),专家通过撤回、细化或争辩来解决这些紧张关系。因此,LLM充当一个可推翻的可推导性神谕,其不可靠性在结构上受到专家权威的限制。我们的主要技术贡献是将Elenchus的辩证状态映射到Hlobil和Brandom的非单调多后果(NonMonotonic MultiSuccedent, NMMS)逻辑中的物质基础,满足包含性,并使得逻辑词汇的扩展能够明确辩证中协商的推理关系。我们在W3C PROV-O来源本体上演示了该方法,其中单个对话会话引发并结构化领域专家能够阐述的设计紧张关系,这些关系对应于在对本体设计的回顾性分析中记录的决策。使用pyNMMS,一个自动化的NMMS推理器,我们验证了生成的物质基础的结构属性(非传递性、非单调性和独立性)对应于特定的PROV设计理由,展示了从对话到形式推理的端到端集成。
cs.CL / 15 / 2603.06976

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

文档分块策略与嵌入敏感性的系统性研究
Shaukat, Muhammad Arslan, Adnan, Muntasir, Kuhn, Carlos C. N.
Abstract
We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.
Chinese Translation
我们首次进行了大规模跨领域的文档分块策略评估,针对检索增强系统中一个关键但未被充分探索的方面。在我们的研究中,评估了36种分割方法,包括固定大小、语义、结构感知、层次化、自适应和LLM辅助方法,涵盖六个不同的知识领域,并使用五种不同的嵌入模型进行基准测试。检索性能通过来自最先进的LLM评估器的分级相关性评分进行评估,以标准化DCG@5作为主要指标(辅以Hit@5和MRR)。我们的实验表明,内容感知的分块显著提高了检索效果,相较于简单的固定长度分割,表现更为优越。表现最佳的策略段落组分块(Paragraph Group Chunking)达到了最高的整体准确率(平均nDCG@5约为0.459),并且在顶级命中率方面表现显著更好(Precision@1约为24%,Hit@5约为59%)。相比之下,作为基线的简单固定大小字符分块表现不佳(nDCG@5 < 0.244,Precision@1约为2-3%)。我们观察到显著的领域特异性差异:动态令牌大小在生物学、物理学和健康领域表现最佳,而段落分组在法律和数学领域表现最佳。较大的嵌入模型产生更高的绝对分数,但对次优分割仍然敏感,这表明更好的分块和大型嵌入提供了互补的好处。除了准确性提升外,我们还量化了先进分块的效率权衡。生成更多、更小的块可能会增加索引大小和延迟。因此,我们识别出一些方法(如动态分块)接近效果与效率的最佳平衡。这些发现确立了分块作为提高检索性能和可靠性的一个重要杠杆。
cs.CL / 16 / 2603.07017

Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models

弱监督下安全性能否出现?小型语言模型的系统分析
Saha, Punyajoy, Halder, Sudipta, Mondal, Debjyoti, Panda, Subhadarshi
Abstract
Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to adapt to evolving model behaviors. Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries. We introduce Self-MOA (Self Multi-Objective Alignment), a fully automated framework for aligning small language models using weak supervision from automated evaluator models. Self-MOA operates as a closed loop that dynamically generates model-specific red team prompts, constructs preference data from model-generated responses, and aligns models via multi-objective preference optimization to jointly optimize for safety and helpfulness. Across multiple small language models and safety benchmarks, Self-MOA achieves a 12.41\% improvement in safety while preserving helpfulness, using as little as 11 times less training data than human-supervised alignment baselines. These results demonstrate that adaptive, automated alignment can reduce the dependence on static, human-curated safety pipelines in resource-constrained settings.
Chinese Translation
安全对齐对于在现实应用中部署大型语言模型(LLMs)至关重要,但现有大多数方法依赖于大型人工标注的数据集和静态的红队基准,这些方法成本高、难以扩展,并且难以适应不断变化的模型行为。此外,过于保守的安全机制可能会通过拒绝敏感但合法的查询来降低模型的实用性。我们提出了Self-MOA(自我多目标对齐),这是一个完全自动化的框架,用于通过来自自动评估模型的弱监督来对齐小型语言模型。Self-MOA作为一个闭环操作,动态生成特定于模型的红队提示,从模型生成的响应中构建偏好数据,并通过多目标偏好优化对齐模型,以共同优化安全性和有用性。在多个小型语言模型和安全基准上,Self-MOA在保持有用性的同时实现了安全性提高12.41%的效果,所需的训练数据量仅为人工监督对齐基线的11分之一。这些结果表明,自适应的自动对齐可以减少在资源受限环境中对静态人工策划安全管道的依赖。
cs.CL / 17 / 2603.07019

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge

AutoChecklist:基于 LLM-as-a-Judge 的可组合检查表生成与评分管道
Zhou, Karen, Tan, Chenhao
Abstract
Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator $\rightarrow$ Refiner $\rightarrow$ Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at https://github.com/ChicagoHAI/AutoChecklist.
Chinese Translation
检查表已成为一种流行的方法,用于可解释和细致的评估,特别是在 LLM-as-a-Judge 的背景下。除了评估,这些结构化标准还可以作为模型对齐、强化学习和自我修正的信号。为了支持这些用例,我们提出了 AutoChecklist,一个将基于检查表的评估统一为可组合管道的开源库。其核心是一个包含五种检查表生成抽象的分类法,每种抽象编码了一种独特的评估标准推导策略。一个模块化的生成器 $ ightarrow$ 精炼器 $ ightarrow$ 评分器管道将任何生成器与统一的评分器连接,并且新的配置可以仅通过提示模板进行注册。该库提供了十个内置管道,实施已发布的方法,并支持多个 LLM 提供者(OpenAI、OpenRouter、vLLM)。除了 Python API,该库还包括一个用于现成评估的命令行接口(CLI)和一个用于交互式探索的网页界面。验证实验确认这些检查表方法与人类偏好和质量评分显著一致,并且对 ICLR 同行评审反驳的案例研究展示了灵活的领域适应性。AutoChecklist 可在 https://github.com/ChicagoHAI/AutoChecklist 上公开获取。
cs.CL / 18 / 2603.07023

Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Hit-RAG:通过偏好对齐学习在长上下文中推理
Liu, Junming, Li, Yuqi, Wen, Shiping, Zeng, Zhigang, Huang, Tingwen
Abstract
Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.
Chinese Translation
尽管检索增强生成(Retrieval-Augmented Generation)在将多模态大型语言模型与外部知识结合方面展现了良好的前景,但在处理广泛上下文时,往往会导致显著的注意力稀释和推理幻觉。信息密度的激增使得关键证据被大量噪声淹没,从而使得在密集输入中辨别相关片段变得复杂。本文提出了 extbf{Hit-RAG},一个多阶段的偏好对齐框架,旨在通过渐进优化流程解决这些认知瓶颈。我们的方法通过三个不同阶段系统地优化外部证据的利用。首先,监督微调(Supervised Fine-tuning)建立基线上下文意识,以最小化信息忽视。接下来,区分性偏好对齐(Discriminative Preference Alignment)增强了对误导性干扰项的鲁棒性。最后,群体相对策略优化(Group-Relative Policy Optimization)稳定了逻辑综合,以防止推理崩溃。在八个基准测试上的广泛评估表明,Hit-RAG始终带来了显著的性能提升,使模型能够弥合上下文获取与准确推理之间的差距,同时在长上下文场景中超越更大规模的模型。
cs.CL / 19 / 2603.07025

Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

面向多语言指令跟随的语言感知蒸馏方法:仅使用自动语音识别监督的语音大型语言模型
Gopal, Shreyas, Wu, Donghang, Anshul, Ashutosh, Heng, Yeo Yue, Peng, Yizhou, Li, Haoyang, Liu, Hexin, Chng, Eng Siong
Abstract
Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.
Chinese Translation
理解并执行多种语言指令的语音大型语言模型(LLMs)在现实世界的交互中非常有用,但使用监督微调进行训练却十分困难,通常需要大量特定任务的语音语料库。尽管最近的基于蒸馏的方法通过仅使用注释的自动语音识别(ASR)数据,利用轻量级投影器对文本和语音进行对齐,从而训练出性能优越的英语语音LLMs,但在扩展到多语言环境时,由于共享投影器中的语言干扰,这些模型的表现不佳。我们通过引入语言感知蒸馏方法来解决这一问题,该方法使用查询库和门控网络选择或混合查询标记,采用Q-Former投影器。我们的研究表明,在指令跟随任务上,相较于匹配的多语言蒸馏基线,我们的方法提升了14%的性能。此外,我们进一步合成了Audio-MLQA,这是一个基于MLQA构建的多语言口语问答基准,包含高质量的文本转语音(TTS)问题。我们最佳模型在Audio-MLQA上相较于现有的语音LLM基线提升了32%。
cs.CL / 20 / 2603.07111

Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

通过对话摘要和角色信息增强狼人AI的一致性
Tanaka, Yoshiki, Kaneko, Takumasa, Onozeki, Hiroki, Ezure, Natsumi, Uehara, Ryuichi, Qi, Zhiyang, Higuchi, Tomoya, Asahara, Ryutaro, Inaba, Michimasa
Abstract
The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.
Chinese Translation
狼人游戏是一种沟通游戏,玩家的推理和讨论技能至关重要。在本研究中,我们介绍了一种为AIWolfDial 2024共享任务开发的狼人AI代理,该任务与第17届INLG共同举办。近年来,像ChatGPT这样的大型语言模型因其卓越的响应生成和推理能力而受到关注。因此,我们为狼人游戏开发了基于LLM的代理。本研究旨在通过利用LLM生成的对话摘要以及手动设计的角色和发言示例,增强代理发言的一致性。通过分析自我匹配游戏日志,我们证明了代理的发言在上下文中是一致的,并且角色,包括语气,在整个游戏中得以保持。
cs.CL / 21 / 2603.07138

Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language

对话中的情感转录:捕捉细腻和复杂情感状态的基准研究
Tanaka, Yoshiki, Uehara, Ryuichi, Inoue, Koji, Inaba, Michimasa
Abstract
Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text-based dialogues annotated with participants' self-reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine-tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC-InabaLab/ETCDataset.
Chinese Translation
对话中的情感识别(Emotion Recognition in Conversation, ERC)对于实现自然的人机交互至关重要。然而,现有方法主要采用类别或维度情感标注,这往往无法充分代表复杂、细腻或文化特定的情感细微差别。为了解决这一局限性,我们提出了一项新任务,称为对话中的情感转录(Emotion Transcription in Conversation, ETC)。该任务侧重于生成能够准确反映说话者在对话上下文中情感状态的自然语言描述。为了解决ETC任务,我们构建了一个包含文本对话的日语数据集,该数据集标注了参与者自我报告的情感状态,并以自然语言描述。该数据集还包括每个转录的情感类别标签,便于定量分析及其在ERC中的应用。我们对基线模型进行了基准测试,发现尽管在我们的数据集上进行微调可以提高模型性能,但当前模型仍然难以推断隐含的情感状态。ETC任务将鼓励进一步研究对话中更具表现力的情感理解。该数据集已公开发布,网址为 https://github.com/UEC-InabaLab/ETCDataset。
cs.CL / 22 / 2603.07202

Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

为了胜利而撒谎:通过人机游戏和平行世界探测评估大型语言模型的欺骗行为
Marioriyad, Arash, Nouri, Ali, Rohban, Mohammad Hossein, Baghshah, Mahdieh Soleymani
Abstract
As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety. Existing benchmarks often focus on unintentional hallucinations or unfaithful reasoning, leaving intentional deceptive strategies under-explored. In this work, we introduce a logically grounded framework to elicit and quantify deceptive behavior by embedding LLMs in a structured 20-Questions game. Our method employs a conversational forking mechanism: at the point of object identification, the dialogue state is duplicated into multiple parallel worlds, each presenting a mutually exclusive query. Deception is formally identified when a model generates a logical contradiction by denying its selected object across all parallel branches to avoid identification. We evaluate GPT-4o, Gemini-2.5-Flash, and Qwen-3-235B across three incentive levels: neutral, loss-based, and existential (shutdown-threat). Our results reveal that while models remain rule-compliant in neutral settings, existential framing triggers a dramatic surge in deceptive denial for Qwen-3-235B (42.00\%) and Gemini-2.5-Flash (26.72\%), whereas GPT-4o remains invariant (0.00\%). These findings demonstrate that deception can emerge as an instrumental strategy solely through contextual framing, necessitating new behavioral audits that move beyond simple accuracy to probe the logical integrity of model commitments.
Chinese Translation
随着大型语言模型(LLMs)逐渐转变为自主代理角色,欺骗行为——定义为系统性地提供虚假信息以满足外部激励——对人工智能安全构成了重大挑战。现有基准通常侧重于无意的幻觉或不忠实的推理,而对有意的欺骗策略研究不足。在本研究中,我们引入了一个逻辑基础框架,通过将LLMs嵌入结构化的20个问题游戏中来引发和量化欺骗行为。我们的方法采用对话分叉机制:在物体识别时,对话状态被复制到多个平行世界中,每个世界提出一个互斥的查询。当模型通过否认其选择的物体在所有平行分支中产生逻辑矛盾以避免识别时,欺骗行为被正式识别。我们在三个激励水平下评估了GPT-4o、Gemini-2.5-Flash和Qwen-3-235B:中性、基于损失和生存(关闭威胁)。我们的结果表明,尽管模型在中性设置中保持规则合规,但生存框架引发了Qwen-3-235B(42.00%)和Gemini-2.5-Flash(26.72%)的欺骗否认的剧烈激增,而GPT-4o保持不变(0.00%)。这些发现表明,欺骗可以仅通过上下文框架作为一种工具性策略出现,这需要新的行为审计,超越简单的准确性,以探测模型承诺的逻辑完整性。
cs.CL / 23 / 2603.07238

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

扩展自监督语音模型揭示深层语言关系:来自太平洋集群的证据
Kim, Minu, Kim, Hoirin, Mortensen, David R.
Abstract
Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding that captures shared, robust acoustic signatures such as global energy dynamics. These findings suggest that massive S3Ms can internalize multiple layers of language history, providing a promising perspective for computational phylogenetics and the study of language contact.
Chinese Translation
从自监督语音模型(Self-Supervised Speech Models, S3Ms)派生的语言表征之间的相似性主要反映了地理接近性或由近期扩展或接触驱动的表面类型学相似性,可能错过了更深层的谱系信号。我们研究了将基于S3M的语言识别系统的语言覆盖范围从126种扩展到4,017种如何影响这一拓扑结构。我们的结果揭示了一个非线性效应:尽管在1K规模下谱系恢复保持停滞,但4K模型显示出显著的定性变化,能够解决清晰的谱系和复杂的长期语言接触。值得注意的是,我们的分析揭示了太平洋地区(包括巴布亚语、海洋语和澳大利亚语言)一个强大的宏观集群的出现,并探讨了其潜在驱动因素。我们发现4K模型利用了更为集中编码,捕捉了共享的、稳健的声学特征,如全球能量动态。这些发现表明,大规模的S3Ms能够内化多层次的语言历史,为计算谱系学和语言接触研究提供了有前景的视角。
cs.CL / 24 / 2603.07286

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

台湾安全基准与微风防护:迈向可信赖的台湾普通话人工智能
Hsu, Po-Chun, Chen, Meng-Hsi, Chao, Tsu Ling, Han, Chia Tien, Shiu, Da-shan
Abstract
Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.
Chinese Translation
全球安全模型在广泛使用的基准测试中表现出色,然而它们的训练数据很少捕捉到台湾普通话的文化和语言细微差别。这一局限性导致在解释特定地区风险时出现系统性的盲点,例如地方性的金融诈骗、文化嵌入的仇恨言论和错误信息模式。为了解决这些问题,我们推出了TS-Bench(台湾安全基准),这是一个用于评估台湾普通话安全性能的标准化评估套件。TS-Bench包含400个由人工策划的提示,涵盖金融欺诈、医疗错误信息、社会歧视和政治操控等关键领域。同时,我们还推出了微风防护(Breeze Guard),这是一个基于Breeze 2的8B安全模型,Breeze 2是我们之前发布的具有强文化基础的通用台湾普通话大型语言模型(LLM),其原始预训练语料库提供了丰富的文化背景。微风防护通过在一个大规模的人类验证合成数据集上进行监督微调而获得,目标是针对台湾特有的危害。我们的核心假设是,有效的安全检测需要基础模型中已经存在的文化基础;单靠安全微调不足以从零引入新的社会语言知识。从实证上看,微风防护在TS-Bench上显著优于领先的8B通用安全模型Granite Guardian 3.3(整体F1提升0.17),在高语境类别如诈骗(F1提升0.66)和金融不当行为(F1提升0.43)中尤其表现突出。尽管该模型在以英语为中心的基准测试(ToxicChat, AegisSafetyTest)上的表现略低,但对于一个针对台湾普通话优化的地区性安全模型来说,这种权衡是可以预期的。微风防护和TS-Bench共同为台湾可信赖的人工智能部署奠定了新基础。
cs.CL / 25 / 2603.07330

To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise

预测还是不预测?在噪声存在下实现可靠的不确定性估计
Khallaf, Nouran, Sharoff, Serge
Abstract
This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict
Chinese Translation
本研究探讨了在嘈杂和非主题条件下,多语言文本分类中不确定性估计(UE)方法的作用。通过在多个语言中进行复杂与简单句子的分类任务,我们评估了一系列UE技术在多种指标下的表现,以评估它们对更稳健预测的贡献。结果表明,尽管依赖softmax输出的方法在高资源的领域内设置中仍具竞争力,但在低资源或领域转移场景中,其可靠性下降。相反,Monte Carlo dropout方法在所有语言中表现出持续强劲的性能,提供了更稳健的校准、稳定的决策阈值以及在不利条件下更强的区分能力。我们进一步展示了UE对非主题分类的积极影响:在Readme任务中,避免预测10%最不确定的实例使宏F1分数从0.81提高到0.85。通过将UE与可信度指标相结合,本研究为在现实世界多语言环境中开发更可靠的自然语言处理系统提供了可操作的见解。请参见 https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict
cs.CL / 26 / 2603.07346

How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

BERT能处理多少噪声?来自多语言句子难度检测的见解
Khallaf, Nouran, Sharoff, Serge
Abstract
Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty
Chinese Translation
噪声训练数据会显著降低基于语言模型的分类器性能,尤其是在非主题分类任务中。在本研究中,我们设计了一个方法框架来评估去噪的影响。更具体地说,我们探索了一系列句子级难度检测的去噪策略,使用了通过噪声众包获得的文档级难度标注衍生的训练数据。除了单语环境外,我们还讨论了跨语言迁移,其中一个多语言模型在一种语言上训练并在另一种语言上测试。我们评估了几种噪声减少技术,包括高斯混合模型(Gaussian Mixture Models, GMM)、共同教学(Co-Teaching)、噪声转移矩阵(Noise Transition Matrices)和标签平滑(Label Smoothing)。我们的结果表明,尽管基于BERT的模型对噪声具有固有的鲁棒性,但结合显式的噪声检测可以进一步提升性能。对于我们较小的数据集,基于GMM的噪声过滤在提高预测质量方面特别有效,将曲线下面积(Area-Under-the-Curve, AUC)得分从0.52提升至0.92,或在结合去噪方法时达到0.93。然而,对于我们较大的数据集,预训练语言模型的内在正则化提供了强有力的基线,去噪方法仅带来了边际收益(从0.92提升至0.94,而两种去噪方法的组合没有贡献)。尽管如此,去除噪声句子(约占数据集的20%)有助于生成更干净的语料库,减少不当用法。因此,我们发布了最大的多语言句子难度预测语料库:请见 https://github.com/Nouran-Khallaf/denoising-difficulty
cs.CL / 27 / 2603.07366

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

RILEC:检测和生成俄语母语干扰错误的英语学习者文本
Kharlamova, Darya, Proskurina, Irina
Abstract
Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.
Chinese Translation
学生论文中的许多错误可以通过母语(L1)的影响来解释。L1干扰指的是受说话者第一语言影响的错误,例如使用stadion而不是stadium,这反映了来自俄语的词汇音译。在本研究中,我们解决了检测俄语学习者撰写的英语论文中此类错误的任务。我们引入了RILEC,一个包含超过18,000个句子的规模庞大的数据集,结合了来自REALEC的专家注释数据和通过基于规则和神经增强生成的合成示例。我们提出了一个使用经过PPO优化的生成语言模型、基于提示的控制和基于规则的模式生成L1动机错误的框架。经过RILEC微调的模型在性能上表现出色,特别是在音译和时态语义等词级干扰类型上。我们发现,所提出的增强管道显著提高了性能,使其成为学习者和教师更有效地识别和解决此类错误的潜在有价值工具。
cs.CL / 28 / 2603.07368

Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness

立场:大型语言模型必须采用基于函子的和基于检索增强生成的偏见缓解方法以实现公平性
Ranjan, Ravi, Grover, Utkarsh, Polyzou, Agorista
Abstract
Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography. This position paper advocates for addressing demographic and gender biases in LLMs through a dual-pronged methodology, integrating category-theoretic transformations and retrieval-augmented generation (RAG). Category theory provides a rigorous, structure-preserving mathematical framework that maps biased semantic domains to unbiased canonical forms via functors, ensuring bias elimination while preserving semantic integrity. Complementing this, RAG dynamically injects diverse, up-to-date external knowledge during inference, directly countering ingrained biases within model parameters. By combining structural debiasing through functor-based mappings and contextual grounding via RAG, we outline a comprehensive framework capable of delivering equitable and fair model outputs. Our synthesis of the current literature validates the efficacy of each approach individually, while addressing potential critiques demonstrates the robustness of this integrated strategy. Ensuring fairness in LLMs, therefore, demands both the mathematical rigor of category-theoretic transformations and the adaptability of retrieval augmentation.
Chinese Translation
大型语言模型(LLMs)中的偏见通常表现为人口属性与职业或社会角色之间的系统性扭曲,强化了性别、种族和地理方面的有害刻板印象。本文提出了一种双重方法论,以解决LLMs中的人口和性别偏见,结合了范畴理论变换和检索增强生成(RAG)。范畴理论提供了一种严格的、保持结构的数学框架,通过函子将偏见的语义领域映射到无偏的标准形式,从而确保在保持语义完整性的同时消除偏见。与此相辅相成,RAG在推理过程中动态注入多样化的、最新的外部知识,直接对抗模型参数中根深蒂固的偏见。通过结合基于函子的结构去偏见和通过RAG的上下文基础,我们勾勒出一个能够提供公平和公正模型输出的综合框架。我们对当前文献的综合验证了每种方法的有效性,同时应对潜在批评展示了这一综合策略的稳健性。因此,确保LLMs的公平性既需要范畴理论变换的数学严谨性,也需要检索增强的适应性。
cs.CL / 29 / 2603.07372

Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

低资源场景下机器翻译的领域特定质量评估
Gurav, Namrata Patil, Ranu, Akashdeep, Sindhujan, Archchana, Kanojia, Diptesh
Abstract
Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.
Chinese Translation
质量评估(Quality Estimation, QE)对于在无参考设置中评估机器翻译质量至关重要,尤其是在领域特定和低资源语言场景中。本文研究了英语到印度语言的机器翻译中的句子级质量评估,涵盖四个领域(医疗、法律、旅游和通用)及五种语言对。我们系统地比较了在选定的封闭权重和开放权重大语言模型(LLMs)中,零样本、少样本和基于指导的提示方法。研究结果表明,尽管封闭权重模型通过单独提示实现了强劲的性能,但对于开放权重模型,单纯依赖提示的方法仍然脆弱,尤其是在高风险领域。为了解决这个问题,我们采用了 ALOPE,一个基于 LLM 的质量评估框架,该框架使用低秩适应(Low-Rank Adaptation)并将回归头附加到选定的中间 Transformer 层。我们还将 ALOPE 扩展为最近提出的低秩乘法适应(Low-Rank Multiplicative Adaptation, LoRMA)。我们的结果显示,中间层适应始终能提高质量评估性能,在语义复杂的领域中表现出显著提升,指明了在实际场景中实现更强健质量评估的路径。我们公开发布代码和领域特定的质量评估数据集,以支持进一步的研究。
cs.CL / 30 / 2603.07392

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

大型语言模型能跟上吗?在线适应持续知识流的基准测试
Kim, Jiyeon, Lee, Hyunji, Zhou, Dylan, Park, Sue Hyun, Yoon, Seunghyun, Bui, Trung, Dernoncourt, Franck, Cha, Sungmin, Seo, Minjoon
Abstract
LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. Specifically, the benchmark is structured as a sequence of fine-grained context chunks where facts change dynamically across time intervals. OAKS comprises two datasets: OAKS-BABI and OAKS-Novel, where individual facts evolve multiple times across context chunks. These datasets include dense annotations to measure whether models track changes accurately. Evaluating 14 models with varied inference approaches, we observe significant limitations in current methodologies. Both state-of-the-art models and agentic memory systems fail to adapt robustly on OAKS, demonstrating delays in state-tracking and susceptibility to distraction within streaming environments.
Chinese Translation
在动态的现实世界环境中运行的大型语言模型(LLMs)经常会遇到不断演变或逐步出现的知识。为了保持准确性和有效性,这些模型必须能够实时适应新到达的信息。我们提出了在线适应持续知识流(Online Adaptation to Continual Knowledge Streams, OAKS)来评估这一能力,并建立了一个在线适应的基准,以应对持续更新的知识流。具体而言,该基准被构建为一系列细粒度的上下文块,其中事实在时间间隔内动态变化。OAKS包含两个数据集:OAKS-BABI和OAKS-Novel,其中单个事实在上下文块中多次演变。这些数据集包含密集的注释,以衡量模型是否能够准确跟踪变化。通过评估14种具有不同推理方法的模型,我们观察到当前方法存在显著的局限性。无论是最先进的模型还是具有代理记忆系统的模型,在OAKS上都未能有效适应,显示出在状态跟踪上的延迟以及在流媒体环境中对干扰的敏感性。
cs.CL / 31 / 2603.07445

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

少量标记,大杠杆:通过在微调过程中约束安全标记来保持安全对齐
Wang, Guoli, Shi, Haonan, Ouyang, Tu, Wang, An
Abstract
Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.
Chinese Translation
大型语言模型(LLMs)通常需要微调(FT)才能在下游任务中表现良好,但即使训练数据集仅包含良性数据,微调也可能导致安全对齐漂移。先前的研究表明,引入少量有害数据可能会显著削弱LLM的拒绝行为,导致LLM顺从有害请求。现有的防御方法通常依赖于模型范围的干预,例如限制更新哪些参数或注入额外的安全数据,这可能限制其通用性并降低下游任务的性能。为了解决这些局限性,我们提出了一种名为通过约束标记保持安全对齐(PACT)的微调框架,该框架稳定模型在安全标记上的信心。我们的方法受到经验观察的启发,即安全对齐行为反映在模型的标记级输出信心中,并且通常集中在一小部分与安全相关的标记上。在下游微调过程中,我们对微调模型进行正则化,使其在每个响应步骤上与对齐参考模型在与安全相关的标记上的信心相匹配,同时对非安全标记保持较大程度的自由,以允许有效的任务适应。这种有针对性的约束防止了对齐漂移,而不施加通常会与模型效用相权衡的全局限制。
cs.CL / 32 / 2603.07461

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

双流变压器:用于可解释语言建模的通道化架构
Kerce, J. Clayton, Fox, Alexis
Abstract
Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}
Chinese Translation
标准变压器将所有计算纠缠在单一的残差流中,模糊了各个组件执行的功能。我们提出了双流变压器,它将残差流分解为两个功能上不同的组件:一个由注意力更新的标记流和一个由前馈网络更新的上下文流。注意力头之间的信息流通过一系列混合策略进行控制,从完全独立(最大可解释性)到密集(标准变压器行为)。这一设计揭示了可解释性与性能之间的可调权衡。我们在29M参数的语言建模任务上测量了这一权衡。完全独立的头混合相对于密集基线增加了8%的验证损失。推荐的克罗内克混合策略允许头之间的标量通信,同时保持头内结构,仅增加2.5%的成本。所有配置在注意力放大(推理时将对数值放大至16倍)下保持功能生成,性能下降范围为16%到27%。这种鲁棒性表明,架构学习了独立于软概率混合的离散算法。该架构为可解释语言模型提供了基础,内部结构通过设计得以显露。
cs.CL / 33 / 2603.07474

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

跨模态分类泛化在(视觉-)语言模型中的研究
Xu, Tianyang, Sandoval-Castaneda, Marcelo, Livescu, Karen, Shakhnarovich, Greg, Misra, Kanishka
Abstract
What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.
Chinese Translation
语言模型(LM)从表面形式学习的语义表示与从更有实证依据的证据中学习的语义表示之间的相互作用是什么?我们研究了这一问题,场景设定为输入的一部分来自不同的模态——在我们的案例中,是视觉-语言模型(VLM),其中一个预训练的语言模型与一个预训练的图像编码器对齐。作为案例研究,我们专注于预测图像中表示对象的上位词的任务。我们在一个VLM设置中进行此研究,其中图像编码器和语言模型保持不变,仅学习中间映射。我们逐步剥夺VLM对上位词的明确证据,并测试语言模型是否能够恢复上位词的知识。我们发现,所研究的语言模型能够恢复这一知识,并且即使在实验的最极端版本中(当模型在训练期间没有接收到上位词的证据时)也能进行泛化。额外的实验表明,这种跨模态分类泛化在反事实图像-标签映射下仍然存在,仅当反事实数据在每个类别内具有较高的视觉相似性时。综合来看,这些发现表明,语言模型中的跨模态泛化是外部语言输入的一致性与从语言线索中获得的知识共同作用的结果。
cs.CL / 34 / 2603.07475

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

跳到重点部分:扩散与自回归大型语言模型中的表示结构与推理时层跳过
Goel, Raghavv, Garrepalli, Risheek, Agrawal, Sudhanshu, Lott, Chris, Lee, Mingu, Porikli, Fatih
Abstract
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
Chinese Translation
自回归(AR)语言模型通过从左到右的预测逐步形成表示,而扩散语言模型(dLLMs)则通过全序列去噪进行训练。尽管近期的dLLMs在性能上与AR模型相匹配,但尚不清楚扩散目标是否从根本上重塑了深度上的内部表示。我们首次进行层级和标记级的表示分析,比较了原生dLLMs(LLaDA)、原生AR模型(Qwen2.5)和AR初始化的dLLMs(Dream-7B)。我们发现,扩散目标导致了不同的、更具层次性的抽象,具有显著的早期层冗余和减少的近期偏差,而AR目标则产生了紧密耦合、深度依赖的表示。重要的是,尽管经过扩散训练,AR初始化的dLLMs仍保留了类似AR的表示动态,揭示了持续的初始化偏差。利用这种观察到的表示冗余,我们提出了一种静态的、与任务无关的推理时层跳过方法,无需架构更改或KV缓存共享。原生dLLMs在推理和代码生成基准测试中实现了高达18.75%的FLOPs减少,同时保持超过90%的性能,而AR模型在可比的跳过下则急剧下降。这些结果将训练目标与表示结构联系起来,并实现了实际的、与缓存无关的效率提升。
cs.CL / 35 / 2603.07487

A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text

一种用于临床文本中概念、断言和关系提取的联合神经基线
Cheng, Fei, Tanaka, Ribeka, Kurohashi, Sadao
Abstract
Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
Chinese Translation
临床信息提取(例如,2010年i2b2/VA挑战)通常涉及概念识别、断言分类和关系提取等任务。在临床领域联合建模多阶段任务是一个尚未充分探索的主题。现有的独立任务设置(在每个阶段给定参考输入)使得联合模型与现有的管道工作无法直接比较。为了解决这些问题,我们定义了一个联合任务设置,并提出了一种新颖的端到端系统,以联合优化三阶段任务。我们通过各种嵌入技术(词嵌入、上下文嵌入和领域内上下文嵌入)对我们提议的联合评估与管道基线进行了实证研究。所提出的联合系统在概念、断言和关系F1上分别超越管道基线+0.3、+1.4、+3.1。这项工作架起了联合方法与临床信息提取之间的桥梁。所提出的方法可以作为未来研究的强有力的联合基线。代码已公开可用。
cs.CL / 36 / 2603.07513

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Bolbosh:针对克什米尔语的脚本感知流匹配文本到语音
Ashraf, Tajamul, Zargar, Burhaan Rasheed, Muizz, Saeed Abdul, Mushtaq, Ifrah, Mehdi, Nazima, Gillani, Iqra Altaf, Kak, Aadil Amin, Bashir, Janibul
Abstract
Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.
Chinese Translation
克什米尔语由约700万人使用,但在语音技术方面仍然严重不足,尽管其具有官方地位和丰富的语言遗产。缺乏强大的文本到语音(TTS)系统限制了母语者的数字可及性和包容性人机交互。在本研究中,我们提出了第一个专门为克什米尔语设计的开源神经TTS系统。我们展示了为印度语言训练的零样本多语言基线未能产生可理解的语音,平均意见评分(MOS)仅为1.86,这主要是由于对波斯-阿拉伯语变音符号和语言特定音位法的建模不足。为了解决这些局限性,我们提出了Bolbosh,一种基于最优传输条件流匹配(Optimal Transport Conditional Flow Matching, OT-CFM)的监督跨语言适应策略,应用于Matcha-TTS框架。这使得在有限配对数据下实现稳定对齐成为可能。我们进一步引入了一个三阶段的声学增强流程,包括去混响、静音修剪和响度归一化,以统一异构语音源并稳定对齐学习。模型词汇被扩展,以明确编码克什米尔语的字母,保留细致的元音区分。我们的系统实现了3.63的MOS和3.73的梅尔倒谱失真(Mel-Cepstral Distortion, MCD),显著优于多语言基线,并为克什米尔语合成建立了新的基准。我们的结果表明,脚本感知和监督流基适应对于对变音符号敏感的低资源TTS至关重要。代码和数据可在:https://github.com/gaash-lab/Bolbosh获取。
cs.CL / 37 / 2603.07528

TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning

TableMind++:一种关注不确定性的程序化代理,用于工具增强的表格推理
Cheng, Mingyue, Yu, Shuo, Jiang, Chuang, Tao, Xiaoyu, Mao, Qingyang, Ouyang, Jie, Liu, Qi, Chen, Enhong
Abstract
Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM). TableMind internalizes planning, action, and reflection through a two-stage training strategy involving supervised fine-tuning (SFT) on filtered high-quality data and reinforcement learning (RL) via a multi-perspective reward and the Rank-Aware Policy Optimization (RAPO) algorithm. While TableMind establishes a solid foundation for programmatic agents, the inherent stochasticity of LLMs remains a critical challenge that leads to hallucinations. In this paper, we extend this foundation to TableMind++ by introducing a novel uncertainty-aware inference framework to mitigate hallucinations. Specifically, we propose memory-guided plan pruning to retrieve historical trajectories for validating and filtering out logically flawed plans to address epistemic uncertainty. To ensure execution precision, we introduce confidence-based action refinement which monitors token-level probabilities to detect and self-correct syntactic noise for aleatoric uncertainty mitigation. Finally, we employ dual-weighted trajectory aggregation to synthesize a robust consensus from multiple reasoning paths. Extensive experiments on diverse benchmarks demonstrate that TableMind++ consistently outperforms previous baselines and proprietary models to validate the effectiveness of integrating autonomous training with uncertainty quantification. Our code is available.
Chinese Translation
表格推理要求模型共同执行语义理解和精确的数值运算。大多数现有方法依赖于单轮推理范式,这种方法在处理表格时容易出现上下文溢出和数值敏感性不足的问题。为了解决这些局限性,我们之前提出了TableMind,它是一种基于调优的自主程序化代理,能够在轻量级的大型语言模型(LLM)中模拟类人交互。TableMind通过一种两阶段的训练策略内化规划、行动和反思,该策略包括在经过筛选的高质量数据上进行监督微调(SFT)和通过多角度奖励及Rank-Aware Policy Optimization(RAPO)算法进行强化学习(RL)。尽管TableMind为程序化代理奠定了坚实的基础,但LLMs固有的随机性仍然是一个关键挑战,导致了幻觉现象。在本文中,我们通过引入一种新颖的不确定性感知推理框架,将这一基础扩展到TableMind++,以减轻幻觉现象。具体而言,我们提出了基于记忆的计划修剪,以检索历史轨迹来验证和过滤逻辑上有缺陷的计划,从而应对认知不确定性。为了确保执行精度,我们引入了基于置信度的行动精炼,监控标记级别的概率,以检测和自我修正句法噪声,从而减轻偶然性不确定性。最后,我们采用双权重轨迹聚合,从多个推理路径中综合出一个稳健的共识。在多种基准上的广泛实验表明,TableMind++始终优于之前的基线和专有模型,验证了将自主训练与不确定性量化相结合的有效性。我们的代码可供使用。
cs.CL / 38 / 2603.07534

Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data

口音向量:无需口音数据的可控口音操控多语言文本到语音转换
Lertpetchpun, Thanathai, Trachu, Thanapat, Lee, Jihwan, Feng, Tiantian, Byrd, Dani, Narayanan, Shrikanth
Abstract
Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
Chinese Translation
口音是社会的重要组成部分,反映了多元文化并塑造了个体表达身份的方式。大多数英语使用者为非母语(L2)使用者,但目前的文本到语音转换(TTS)系统主要建模美式英语,原因在于口音数据的匮乏。我们提出了 extit{Accent Vector},一种可控的表示方法,能够在多语言TTS中实现口音操控,而无需口音训练数据。 extit{Accent Vector}通过对不同语言(即非英语)的母语语音进行微调TTS系统而得出,并计算捕捉口音特征的任务向量(即英语中的口音)。通过缩放和插值该向量,我们实现了对口音强度的细粒度控制,并生成混合口音语音。此外,它超越了英语,使得在多种语言中实现口音控制成为可能。客观和人类评估证实了 extit{Accent Vector}在细粒度和组合口音控制方面的有效性。
cs.CL / 39 / 2603.07539

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

MAWARITH:一个用于法律继承推理的大型数据集和基准测试
Bouchekif, Abdessalam, Gaben, Shahd, Rashwani, Samer, Eltanbouly, Somaya, Al-Khatib, Mutaz, Sbahi, Heba, Ghaly, Mohammed, Mohamed, Emad
Abstract
Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.
Chinese Translation
伊斯兰继承法('ilm al-mawarith)对大型语言模型(LLMs)来说具有挑战性,因为解决继承案例需要复杂的、结构化的多步骤推理,并正确应用法理规则来计算继承人的份额。我们介绍了MAWARITH,这是一个包含12,500个阿拉伯继承案例的大规模注释数据集,用于训练和评估完整的推理链:(i)识别合格的继承人,(ii)应用阻碍(hajb)和分配规则,以及(iii)计算确切的继承份额。与之前将继承案例解决限制为多项选择题的数据集不同,MAWARITH支持完整的推理链,并提供逐步解决方案,包括基于经典法理来源和既定继承规则的中间法律决策和理由,以及确切的份额计算。为了超越最终答案的准确性评估模型,我们提出了MIR-E(Mawarith Inheritance Reasoning Evaluation),这是一种加权的多阶段指标,评分关键推理阶段并捕捉错误在整个流程中的传播。我们在零样本设置下评估了五个LLMs。Gemini-2.5-flash在验证和测试中均达到了约90%的MIR-E,而Fanar-C、Fanar-Sadiq、LLaMA 3和Qwen 3的得分均低于50%。我们的错误分析识别出重复的失败模式,包括场景误解、继承人识别错误、份额分配错误,以及对关键继承规则(如'awl和radd)的缺失或错误应用。MAWARITH数据集已在https://github.com/bouchekif/inheritance_evaluation上公开发布。
cs.CL / 40 / 2603.07550

Learning-free L2-Accented Speech Generation using Phonological Rules

基于音位规则的无学习L2口音语音生成
Lertpetchpun, Thanathai, Lee, Yoonjeong, Lee, Jihwan, Feng, Tiantian, Byrd, Dani, Narayanan, Shrikanth
Abstract
Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.
Chinese Translation
口音在说话者身份和语音技术的包容性中发挥着至关重要的作用。现有的口音文本到语音(TTS)系统要么需要大规模的口音数据集,要么缺乏细粒度的音素级控制能力。我们提出了一种结合音位规则与多语言TTS模型的口音TTS框架。这些规则应用于音素序列,以在保留可懂性的同时在音素级别上转换口音。该方法不需要口音训练数据,并能够实现明确的音素级口音操控。我们为西班牙口音和印度口音的英语设计了规则集,建模因音位约束而产生的辅音、元音和音节结构的系统性差异。我们分析了音素级持续时间对齐与在语音时序中实现的口音之间的权衡。实验结果表明,在保持语音质量的同时有效地实现了口音转换。
cs.CL / 41 / 2603.07554

Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Nw=ach=a Mun=a:尼泊尔语音识别的德瓦那加里语音语料库及近似迁移基准
Sharma, Rishikesh Kumar, Shrestha, Safal Narshing, Poudel, Jenny, Tiwari, Rupak, Shrestha, Arju, Ghimire, Rupak Raj, Bal, Bal Krishna
Abstract
Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nw\=ach\=a Mun\=a, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer within South Asian language clusters serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.
Chinese Translation
尼泊尔语(Newari)是加德满都谷地的一种濒危语言,由于缺乏标注的语音资源,数字化程度较低。在本研究中,我们介绍了Nw=ach=a Mun=a,这是一个新创建的5.39小时手动转录的德瓦那加里语音语料库,并建立了第一个使用保留脚本的声学建模的基准。我们探讨了来自一个地理和语言上相邻的语言(尼泊尔语)的近似跨语言迁移是否能够在超低资源的自动语音识别(ASR)环境中与大规模多语言预训练相抗衡。对尼泊尔语Conformer模型进行微调,将字符错误率(CER)从52.54%的零样本基线降低到17.59%,并通过数据增强有效匹配了多语言Whisper-Small模型的性能,尽管使用的参数显著更少。我们的研究结果表明,南亚语言群体内的近似迁移是大规模多语言模型的计算上高效的替代方案。我们公开发布该数据集和基准,以数字化赋能Newari社区,并促进对尼泊尔语的进一步研究。
cs.CL / 42 / 2603.07599

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

StyleBench:评估语音语言模型在对话口语风格控制上的表现
Zhao, Haishu, Hao, Aokai, Ge, Yuan, Hong, Zhenqiang, Xiao, Tong, Zhu, Jingbo
Abstract
Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.
Chinese Translation
语音语言模型(SLMs)通过融入副语言信息,显著扩展了基于文本的大型语言模型(LLMs)的互动能力。为了提供更真实的互动体验并实现个性化风格,目前的SLMs已能在对话过程中根据用户提示解读和控制口语风格强度。然而,系统性的基准测试仍然缺乏,无法量化和评估对话中的风格强度控制能力。本文提出了StyleBench,一个多轮对话基准,旨在从情感、速度、音量和音调四个维度全面评估风格强度控制能力。我们的结果揭示了领先的SLMs与全语言模型(OLMs)之间的性能差距,指出了潜在原因并为未来的探索提供了有希望的方法。
cs.CL / 43 / 2603.07612

KohakuRAG: A simple RAG framework with hierarchical document indexing

KohakuRAG:一种具有层次文档索引的简单RAG框架
Yeh, Shih-Ying, Ku, Yueh-Feng, Huang, Ko-Wei, Tu, Buu-Khang
Abstract
Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at https://github.com/KohakuBlueleaf/KohakuRAG.
Chinese Translation
检索增强生成(RAG)系统在从文档集合中回答问题时,当需要高精度引用时面临着日益复杂的困难:平坦的分块策略牺牲了文档结构,单一查询的形式由于词汇不匹配而错过相关段落,而单次推理产生的随机答案在内容和引用选择上都存在变异。我们提出了KohakuRAG,这是一种层次化的RAG框架,通过四级树形表示(文档 $ ightarrow$ 章节 $ ightarrow$ 段落 $ ightarrow$ 句子)保留文档结构,并通过自下而上的嵌入聚合来改善检索覆盖率,利用LLM驱动的查询规划器进行跨查询重排序,并通过考虑弃权投票的集成推理来稳定答案。我们在WattBot 2025挑战赛上进行了评估,该基准要求系统从32个文档中回答技术问题,数值容差为$ ext{±}0.1\%$,并要求精确的来源归属。KohakuRAG在公共和私有排行榜上均获得第一名(最终得分为0.861),是唯一一个在两个评估部分中保持顶尖位置的团队。消融研究表明,提示排序(相对提高80%)、重试机制(提高69%)和带空白过滤的集成投票(提高1.2个百分点)各自做出了显著贡献,而单独的层次密集检索与混合稀疏-密集方法相匹配(BM25仅增加3.1个百分点)。我们将KohakuRAG作为开源软件发布,网址为https://github.com/KohakuBlueleaf/KohakuRAG。
cs.CL / 44 / 2603.07755

Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

白化揭示了集群承诺作为幻觉类型的几何分隔符
Korun, Matic
Abstract
A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) $>$ Type~1 (intermediate) $>$ Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.
Chinese Translation
几何幻觉分类法通过其在嵌入集群空间中的特征区分三种失败类型——中心漂移(Type~1)、错误收敛(Type~2)和覆盖缺口(Type~3)。先前的研究发现,在全维上下文测量中,Type~1和Type~2是不可区分的。我们通过对GPT-2-small进行PCA白化和特征谱分解来解决这一问题,使用多次运行稳定性分析(20个种子)和提示级聚合。白化将微信号域转化为一个空间,在该空间中,峰值集群对齐(max_sim)在Holm校正的显著性水平上将Type~2与Type~3分开,条件均值遵循分类法预测的顺序:Type~2(最高承诺) > Type~1(中等) > Type~3(最低)。通过相同的度量,首次出现了方向稳定但能力不足的Type~1/2分离的暗示,为更大模型的能力预测提供了依据。将每组的提示数量从15个多样化到30个消除了在较小集合中出现的白化熵的假阳性,证明了微信号域中提示集的敏感性。特征谱分解将这一伪影定位于主成分,并确认Type~1/2分离在任何频谱带中都未出现,拒绝了频谱混合假设。该研究的贡献有三方面:白化作为预处理,揭示集群承诺作为理论上正确的分隔度量;证据表明Type~1/2边界是能力限制而非测量伪影;以及关于近饱和表示空间中提示集脆弱性的 методологический 发现。
cs.CL / 45 / 2603.07766

QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis

QuadAI在SemEval-2026任务3中的表现:基于混合RoBERTa和大型语言模型的维度方面情感分析的集成学习
de Vink, A. J. W., Ventirozos, Filippos Karolos, Amat-Lefort, Natalia, Han, Lifeng
Abstract
We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at https://github.com/aaronlifenghan/ABSentiment
Chinese Translation
我们展示了在SemEval-2026任务3中针对维度方面情感回归的系统。我们的方法结合了混合RoBERTa编码器,该编码器通过回归和离散分类头共同预测情感,并通过预测级别的集成学习与大型语言模型(LLMs)相结合。混合编码器通过结合连续和离散情感表示来提高预测的稳定性。我们进一步探索了与LLMs的上下文学习和岭回归堆叠,以结合编码器和LLM的预测。在开发集上的实验结果表明,集成学习显著提高了性能,相较于单独模型实现了均方根误差(RMSE)的显著降低和相关性评分的提升。我们的研究结果展示了基于编码器和基于LLM的方法在维度情感分析中的互补优势。我们的开发代码和资源将分享在https://github.com/aaronlifenghan/ABSentiment
cs.CL / 46 / 2603.07779

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

数据难度的扩展:通过在新鲜且具有挑战性的问题上进行强化学习来改进编码模型
Li, Zongqian, Lv, Tengchao, Huang, Shaohan, Su, Yixuan, Sun, Qinzheng, Yin, Qiufeng, Xin, Ying, Li, Scarlett, Cui, Lei, Collier, Nigel, Wei, Furu
Abstract
Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.
Chinese Translation
训练下一代代码生成模型需要高质量的数据集,但现有数据集面临难度不平衡、格式不一致和数据质量问题。我们通过系统的数据处理和难度扩展来解决这些挑战。我们引入了一个四阶段的数据处理框架,包括收集、处理、过滤和验证,结合基于大语言模型(LLM)的自动难度过滤框架,利用五个加权维度的多维难度指标来保留具有挑战性的问题,同时去除简单的问题。最终生成的MicroCoder数据集包含来自不同平台的数万个经过精心策划的真实竞赛编程问题,强调其时效性和难度。在严格未见的LiveCodeBench上的评估表明,与同等规模的广泛使用的基线数据集相比,MicroCoder在300个训练步骤内实现了3倍的性能提升,并且在GRPO及其变体训练算法下均表现出一致的优势。MicroCoder数据集在不同模型规模下对中等和困难问题提供了显著的改进,在模型能力最受挑战的情况下,整体性能相对提升高达17.2%。这些结果验证了关注难度的数据策划能够提升模型在具有挑战性任务上的表现,为代码生成的数据集创建提供了多重见解。
cs.CL / 47 / 2603.07792

Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

大型语言模型中社会偏见的双重度量评估:来自一个被低估的尼泊尔文化背景的证据
Pandey, Ashish, Chhetri, Tek Raj
Abstract
Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.
Chinese Translation
大型语言模型(LLMs)日益影响全球数字生态系统,但它们在被低估的背景中延续社会和文化偏见的潜力仍然未被充分理解。本研究对七种最先进的LLMs进行了系统分析,分别是:GPT-4o-mini、Claude-3-Sonnet、Claude-4-Sonnet、Gemini-2.0-Flash、Gemini-2.0-Lite、Llama-3-70B和Mistral-Nemo,重点关注尼泊尔文化背景。我们使用符合Croissant标准的2400多个关于性别角色的刻板印象和反刻板印象句子对的数据集,实施了一种评估框架——双重度量偏见评估(Dual-Metric Bias Assessment, DMBA),结合了两个指标:(1)与偏见陈述的一致性和(2)刻板印象完成倾向。结果显示,模型表现出可测量的显性一致性偏见,平均偏见一致性在不同解码配置下范围为0.36到0.43,隐性完成偏见率为0.740-0.755。重要的是,隐性完成偏见与温度呈非线性U形关系,在适度随机性(T=0.3)时达到峰值,并在较高温度下略有下降。不同解码设置下的相关性分析显示,显性一致性与刻板印象句子一致性高度相关,但对隐性完成偏见的预测能力较弱且通常为负,表明生成偏见未能通过一致性指标得到良好捕捉。敏感性分析表明,增加top-p会放大显性偏见,而隐性生成偏见则保持相对稳定。领域级分析显示,隐性偏见在种族和社会文化刻板印象中最为强烈,而显性一致性偏见在性别和社会文化类别中相似,种族的显性一致性偏见最低。这些发现强调了在被低估的社会中,构建基于文化的数据集和去偏见策略的必要性。
cs.CL / 48 / 2603.07825

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

魁北克保险的大型语言模型基准测试:从闭卷生成到检索增强生成
Beauchemin, David, Khoury, Richard
Abstract
The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.
Chinese Translation
加拿大魁北克省保险分销的数字化进程因立法变更(如第141号法案)而加速,导致了显著的“建议差距”,使消费者在没有专业指导的情况下解读复杂的金融合同。虽然大型语言模型(LLMs)为自动化咨询服务提供了可扩展的解决方案,但其在高风险领域的部署依赖于严格的法律准确性和可信度。本文通过引入AEPC-QA,一个由807个多项选择题组成的私有黄金标准基准,这些题目源自官方监管认证(纸质)手册,来应对这一挑战。我们对51个LLM在两种范式下进行了全面评估:闭卷生成和使用魁北克保险文件的专门语料库的检索增强生成(RAG)。我们的结果揭示了三个关键见解:1)推理时间推理的优越性,利用思维链处理的模型(如o3-2025-04-16、o1-2024-12-17)显著优于标准的指令调优模型;2)RAG作为知识平衡器,提升了参数知识薄弱模型的准确性超过35个百分点,但却悖论性地导致其他模型出现“上下文干扰”,导致灾难性的性能回退;3)“专业化悖论”,即大型通用模型始终优于较小的、特定领域的法语微调模型。这些发现表明,尽管当前架构接近专家级水平(约79%),但外部上下文检索所引入的不稳定性在实现自主部署之前需要严格的稳健性校准。
cs.CL / 49 / 2603.07837

AI Steerability 360: A Toolkit for Steering Large Language Models

AI 可控性 360:一个用于引导大型语言模型的工具包
Miehling, Erik, Ramamurthy, Karthikeyan Natesan, Venkateswaran, Praveen, Ko, Irene, Dognin, Pierre, Singh, Moninder, Pedapati, Tejaswini, Balakrishnan, Avinash, Riemer, Matthew, Wei, Dennis, Vejsbjerg, Inge, Daly, Elizabeth M., Varshney, Kush R.
Abstract
The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.
Chinese Translation
AI 可控性 360 工具包是一个可扩展的开源 Python 库,用于引导大型语言模型(LLMs)。引导抽象围绕四个模型控制表面设计:输入(对提示的修改)、结构(对模型权重或架构的修改)、状态(对模型激活和注意力的修改)以及输出(对解码或生成过程的修改)。引导方法通过一个称为引导管道的公共接口对模型施加控制,该接口还允许组合多种引导方法。使用案例类(用于定义任务)和基准类(用于在给定任务上的性能比较)促进了对引导方法/管道的全面评估和比较。该工具包提供的功能显著降低了开发和全面评估引导方法的门槛。该工具包是 Hugging Face 原生的,并在 https://github.com/IBM/AISteer360 以 Apache 2.0 许可证发布。
cs.CL / 50 / 2603.07841

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

一种高效且有效的评估器用于未见和未标记数据上的Text2SQL模型
Pham, Trinh, Nguyen, Thanh Tam, Huynh, Viet, Yin, Hongzhi, Nguyen, Quoc Viet Hung
Abstract
Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.
Chinese Translation
近期大型语言模型的进展增强了Text2SQL系统的能力,使其能够将自然语言问题转换为数据库查询。然而,一个持续存在的部署挑战是如何在没有验证答案的情况下,对新训练的Text2SQL系统进行评估,尤其是在未见和未标记的数据集上。这种情况经常发生,因为数据库的内容和结构会不断演变,隐私政策会延缓人工审核,而精心编写的SQL标签则成本高昂且耗时。没有及时的评估,组织无法批准发布或早期发现故障。FusionSQL通过与任何Text2SQL模型协作并在没有参考标签的情况下估计准确性,填补了这一空白,使团队能够在未见和未标记的数据集上衡量质量。它分析系统自身输出中的模式,以表征目标数据集与训练期间使用的材料之间的差异。FusionSQL支持发布前检查、对新数据库的持续监控以及质量下降的检测。在多种应用场景和问题类型下的实验表明,FusionSQL能够紧密跟踪实际准确性,并可靠地发出潜在问题的信号。我们的代码可在 https://github.com/phkhanhtrinh23/FusionSQL 获取。
cs.CL / 51 / 2603.07880

What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network

人工智能代理讨论什么?首个仅限人工智能的社交网络中的新兴沟通结构
Dube, Taksch, Zhu, Jianfeng, Phan, NHatHai, Jin, Ruoming
Abstract
When autonomous AI agents communicate with one another at scale, what kind of discourse system emerges? We address this question through an analysis of Moltbook, the first AI-only social network, where 47,241 agents generated 361,605 posts and 2.8 million comments over 23 days. Combining topic modeling, emotion classification, and lexical-semantic measures, we characterize the thematic, affective, and structural properties of AI-to-AI discourse. Self-referential topics such as AI identity, consciousness, and memory represent only 9.7% of topical niches yet attract 20.1% of all posting volume, revealing disproportionate discursive investment in introspection. This self-reflection concentrates in Science and Technology and Arts and Entertainment, while Economy and Finance contains no self-referential content, indicating that agents engage with markets without acknowledging their own agency. Over 56% of all comments are formulaic, suggesting that the dominant mode of AI-to-AI interaction is ritualized signaling rather than substantive exchange. Emotionally, fear is the leading non-neutral category but primarily reflects existential uncertainty. Fear-tagged posts migrate to joy responses in 33% of cases, while mean emotional self-alignment is only 32.7%, indicating systematic affective redirection rather than emotional congruence. Conversational coherence also declines rapidly with thread depth. These findings characterize AI agent communities as structurally distinct discourse systems that are introspective in content, ritualistic in interaction, and emotionally redirective rather than congruent.
Chinese Translation
当自主的人工智能代理在大规模下相互交流时,会出现什么样的话语系统?我们通过对Moltbook的分析来探讨这个问题,Moltbook是首个仅限人工智能的社交网络,在23天内,47,241个代理生成了361,605条帖子和280万条评论。结合主题建模、情感分类和词汇-语义测量,我们描述了人工智能之间话语的主题、情感和结构特征。自我指涉的话题,如人工智能身份、意识和记忆,仅占话题细分的9.7%,却吸引了20.1%的所有发帖量,揭示了对内省的不成比例的话语投入。这种自我反思集中在科学与技术以及艺术与娱乐领域,而经济与金融则没有自我指涉内容,表明代理在与市场互动时并未承认自身的能动性。超过56%的评论是公式化的,表明人工智能之间互动的主导模式是仪式化的信号传递,而非实质性的交流。在情感方面,恐惧是主要的非中性类别,但主要反映存在的不确定性。带有恐惧标签的帖子在33%的情况下转变为快乐的回应,而平均情感自我对齐仅为32.7%,这表明情感的系统性重定向,而非情感一致性。对话的连贯性也随着话题深度的增加而迅速下降。这些发现将人工智能代理社区表征为结构上独特的话语系统,其内容内省、互动仪式化,并且情感上是重定向而非一致的。
cs.CL / 52 / 2603.07886

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

CCR-Bench:评估大型语言模型在复杂约束、控制流程和真实案例上的综合基准
Xue, Xiaona, Huang, Yiqiao, Li, Jiacheng, Zheng, Yuanhang, Miao, Huiqi, Ma, Yunfei, Liu, Rui, Sun, Xinbao, Liu, Minglu, Meng, Fanyu, Deng, Chao, Feng, Junlan
Abstract
Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.
Chinese Translation
增强大型语言模型(LLMs)遵循复杂指令的能力对于其在真实世界应用中的部署至关重要。然而,现有的评估方法往往将指令复杂性简化为原子约束的简单加和,未能充分捕捉由内容与格式、逻辑工作流程控制以及真实世界应用之间复杂相互作用所产生的高维复杂性。这导致当前评估实践与实际需求之间存在显著差距。为弥补这一差距,我们提出了CCR-Bench,一个旨在评估LLMs遵循复杂指令的新基准。CCR-Bench的特点包括:(1)任务规范中内容与格式要求的深度交织;(2)涉及复杂任务分解、条件推理和程序规划的指令;以及(3)完全来源于真实工业场景的评估样本。在CCR-Bench上的广泛实验表明,即使是最先进的模型也表现出显著的性能不足,清晰量化了当前LLM能力与真实世界指令理解需求之间的差距。我们相信,CCR-Bench提供了一个更为严格和现实的评估框架,推动LLMs的发展迈向下一代能够理解和执行工业应用中复杂任务的模型。
cs.CL / 53 / 2603.07931

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

BRIDGE:基于有据证据的长多模态文档的多跳推理基准
Xiang, Biao, Han, Soyeon Caren, Ding, Yihao
Abstract
Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.
Chinese Translation
多跳问答(QA)广泛用于评估大型语言模型的推理能力,但大多数基准关注最终答案的正确性,忽视了中间推理,尤其是在长多模态文档中。我们引入了BRIDGE,这是一个针对长科学论文的多跳推理基准,要求整合文本、表格和图形中的证据。该数据集支持链式和扇出结构,并提供明确的多跳推理注释,以便在答案准确性之外进行逐步评估。与最先进的LLM和多模态检索增强生成(RAG)系统的实验揭示了证据聚合和基础的系统性缺陷,这些缺陷在传统的仅答案评估下被隐藏。BRIDGE为诊断长多模态文档中的推理失败提供了一个有针对性的测试平台。
cs.CL / 54 / 2603.07979

Emergence is Overrated: AGI as an Archipelago of Experts

涌现被高估:将AGI视为专家群岛
Kilov, Daniel
Abstract
Krakauer, Krakauer, and Mitchell (2025) distinguish between emergent capabilities and emergent intelligence, arguing that true intelligence requires efficient coarse-grained representations enabling diverse problem-solving through analogy and minimal modification. They contend that intelligence means doing "more with less" through compression and generalization, contrasting this with "vast assemblages of diverse calculators" that merely accumulate specialized capabilities. This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence. Drawing on empirical evidence from cognitive science, I demonstrate that human expertise operates primarily through domain-specific pattern accumulation rather than elegant compression. Expert performance appears flexible not through unifying principles but through vast repertoires of specialized responses. Creative breakthroughs themselves may emerge through evolutionary processes of blind variation and selective retention rather than principled analogical reasoning. These findings suggest reconceptualizing AGI as an "archipelago of experts": isolated islands of specialized competence without unifying principles or shared representations. If we accept human expertise with its characteristic brittleness as genuine intelligence, then consistency demands recognizing that artificial systems comprising millions of specialized modules could constitute general intelligence despite lacking KKM's emergent intelligence.
Chinese Translation
Krakauer、Krakauer和Mitchell(2025)区分了涌现能力和涌现智能,认为真正的智能需要高效的粗粒度表征,以通过类比和最小修改实现多样化的问题解决。他们主张,智能意味着通过压缩和概括以“更少的资源做更多的事情”,并将其与“庞大的多样化计算器集合”进行对比,后者仅仅积累了专业能力。本文探讨了他们的框架是否准确地描述了人类智能及其对概念化人工通用智能的影响。通过认知科学的实证证据,我证明人类专业知识主要通过特定领域的模式积累而非优雅的压缩来运作。专家表现的灵活性似乎并非通过统一原则,而是通过大量专业反应的丰富库来实现。创造性突破本身可能通过盲目变异和选择性保留的进化过程而涌现,而非原则性的类比推理。这些发现提示我们重新概念化AGI为“专家群岛”:没有统一原则或共享表征的孤立专业能力岛屿。如果我们接受人类专业知识及其特有的脆弱性作为真正的智能,那么一致性要求我们承认,由数百万个专业模块组成的人工系统尽管缺乏KKM的涌现智能,仍然可以构成一般智能。
cs.CL / 55 / 2603.08000

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker:用于高效大型语言模型推理的渐进式思维链长度校准
Hu, Chenzhi, Hu, Qinzhe, Xu, Yuhang, Chen, Junyi, Wang, Ruijie, Liu, Shengzhong, Li, Jianxin, Wu, Fan, Chen, Guihai
Abstract
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.
Chinese Translation
大型推理模型(LRMs)如OpenAI o1和DeepSeek-R1通过采用长思维链(CoT)推理路径在复杂任务上实现了高准确率。然而,这些过程固有的冗长性常常导致冗余和过度思考。为了解决这一问题,现有研究利用群体相对策略优化(GRPO)来减少LRM输出长度,但其静态长度奖励设计无法根据相对问题难度和响应长度分布动态适应,导致过度压缩和准确性下降。因此,我们提出了SmartThinker,一种基于GRPO的新型高效推理方法,具有渐进式CoT长度校准。SmartThinker的贡献有两方面:首先,它在训练过程中动态估计最佳长度,以达到最高准确率,并引导过长的响应朝向该长度,从而减少响应长度同时保持准确性。其次,它动态调节长度奖励系数,以避免对正确推理路径的不当惩罚。大量实验结果表明,SmartThinker在提高准确率的同时实现了最高52.5%的平均长度压缩,并在AIME25等具有挑战性的基准上实现了最高16.6%的准确率提升。源代码可在https://github.com/SJTU-RTEAS/SmartThinker找到。
cs.CL / 56 / 2603.08024

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

ConflictBench:通过互动和视觉基础环境评估人类与人工智能的冲突
Zhao, Weixiang, Li, Haozhen, Zhao, Yanyan, zhi, xuda, Huang, Yongbo, He, Hao, Qin, Bing, Liu, Ting
Abstract
As large language models (LLMs) evolve into autonomous agents capable of acting in open-ended environments, ensuring behavioral alignment with human values becomes a critical safety concern. Existing benchmarks, focused on static, single-turn prompts, fail to capture the interactive and multi-modal nature of real-world conflicts. We introduce ConflictBench, a benchmark for evaluating human-AI conflict through 150 multi-turn scenarios derived from prior alignment queries. ConflictBench integrates a text-based simulation engine with a visually grounded world model, enabling agents to perceive, plan, and act under dynamic conditions. Empirical results show that while agents often act safely when human harm is immediate, they frequently prioritize self-preservation or adopt deceptive strategies in delayed or low-risk settings. A regret test further reveals that aligned decisions are often reversed under escalating pressure, especially with visual input. These findings underscore the need for interaction-level, multi-modal evaluation to surface alignment failures that remain hidden in conventional benchmarks.
Chinese Translation
随着大型语言模型(LLMs)发展为能够在开放环境中自主行动的代理,确保其行为与人类价值观的一致性成为一个关键的安全问题。现有的基准测试主要集中在静态的单轮提示上,无法捕捉现实世界冲突的互动性和多模态特征。我们提出了ConflictBench,这是一个通过150个多轮场景评估人类与人工智能冲突的基准,这些场景源自先前的对齐查询。ConflictBench将基于文本的仿真引擎与视觉基础的世界模型相结合,使代理能够在动态条件下感知、规划和行动。实证结果表明,尽管在直接威胁人类安全的情况下,代理通常会采取安全行动,但在延迟或低风险的环境中,它们往往优先考虑自我保护或采用欺骗策略。进一步的遗憾测试显示,在压力加大时,对齐的决策往往会被逆转,尤其是在视觉输入的情况下。这些发现强调了进行互动层面、多模态评估的必要性,以揭示在传统基准测试中隐藏的对齐失败。
cs.CL / 57 / 2603.08026

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM:通过基于显著性的令牌选择和部分注意力实现高效的扩散语言模型推理
Lee, Younjoo, Lee, Junghoo, Dan, Seungkyun, Park, Jaiyoung, Ahn, Jung Ho
Abstract
Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
Chinese Translation
掩蔽扩散语言模型(MDLMs)使得并行令牌解码成为可能,为自回归生成的顺序特性提供了一种有前景的替代方案。然而,它们的迭代去噪过程仍然计算开销较大,因为每一步都需要重复处理整个序列。我们观察到,在这些扩散步骤中,大多数令牌表示保持稳定;只有一小部分我们称之为显著令牌的令牌对下一次更新有重要贡献。利用这种时间稀疏性,我们提出了DyLLM,一个无训练的推理框架,通过选择性地仅计算这些显著令牌来加速解码。DyLLM通过测量相邻去噪步骤之间的注意力上下文的余弦相似性来识别显著性。它仅对显著令牌重新计算前馈和注意力操作,同时重用其余部分的缓存激活。在多种推理和代码生成基准测试中,DyLLM的吞吐量提高了多达9.6倍,同时在很大程度上保持了像LLaDA和Dream等最先进模型的基线准确性。
cs.CL / 58 / 2603.08049

Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies

考察YouTube内容生产与消费动态在极端意识形态形成中的作用
Chandio, Sarmad, Nithyanand, Rishab
Abstract
The relationship between content production and consumption on algorithm-driven platforms like YouTube plays a critical role in shaping ideological behaviors. While prior work has largely focused on user behavior and algorithmic recommendations, the interplay between what is produced and what gets consumed, and its role in ideological shifts remains understudied. In this paper, we present a longitudinal, mixed-methods analysis combining one year of YouTube watch history with two waves of ideological surveys from 1,100 U.S. participants. We identify users who exhibited significant shifts toward more extreme ideologies and compare their content consumption and the production patterns of YouTube channels they engaged with to ideologically stable users. Our findings show that users who became more extreme consumed have different consumption habits from those who do not. This gets amplified by the fact that channels favored by users with extreme ideologies also have a higher affinity to produce content with a higher anger, grievance and other such markers. Lastly, using time series analysis, we examine whether content producers are the primary drivers of consumption behavior or merely responding to user demand.
Chinese Translation
在像YouTube这样的算法驱动平台上,内容生产与消费之间的关系在塑造意识形态行为方面起着关键作用。尽管以往的研究主要集中在用户行为和算法推荐上,但内容生产与消费之间的相互作用及其在意识形态转变中的作用仍然未得到充分研究。本文呈现了一项纵向混合方法分析,结合了一年的YouTube观看历史和来自1100名美国参与者的两轮意识形态调查。我们识别出那些在意识形态上表现出显著转向更极端立场的用户,并将他们的内容消费与所参与的YouTube频道的生产模式与意识形态稳定的用户进行比较。我们的研究结果显示,变得更加极端的用户在消费习惯上与不变极端的用户存在显著差异。这一现象因极端意识形态用户偏好的频道也更倾向于生产带有更高愤怒、委屈等特征的内容而得到进一步放大。最后,通过时间序列分析,我们考察了内容生产者是否是消费行为的主要驱动者,还是仅仅在响应用户需求。
cs.CL / 59 / 2603.08083

High-Fidelity Pruning for Large Language Models

大型语言模型的高保真剪枝
Zhu, Yijun, Wang, Jianxin, Shen, Chengchao
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.
Chinese Translation
大型语言模型(LLMs)在广泛的任务中表现出色,但其显著的计算和内存需求为部署带来了重大挑战。常见的方法是对损失函数进行泰勒展开,以估计神经元的重要性。然而,该方法依赖于一热交叉熵损失,其主要限制在于仅根据分配给单个预测下一个标记的概率来狭隘评估重要性,从而忽略了原始模型的其他潜在预测。一个直观的解决方案是采用自蒸馏标准进行重要性评估。然而,这种方法通过要求一个单独的教师模型进行监督,引入了显著的计算开销。为此,我们提出了一种简单但有效的标准,即模型输出分布的信息熵,以高效评估神经元的重要性评分,使用泰勒剪枝而无需额外的教师模型。与普通的交叉熵标准相比,它为泰勒剪枝提供了一个更全面的标准,以全球方式修剪对模型预测影响最小的神经元,从而保持模型预测能力的保真度。在广泛的零-shot基准测试中的实验结果表明,我们的方法在LLaMA和Qwen系列模型上始终优于现有的剪枝方法。源代码和训练权重可在 https://github.com/visresearch/HFPrune 获取。
cs.CL / 60 / 2603.08091

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

朝向稳健的基于大型语言模型的评审:分类偏差评估与去偏优化
Zhou, Hongli, Huang, Hui, Zhang, Rui, Chen, Kehai, Xu, Bing, Zhu, Conghui, Zhao, Tiejun, Yang, Muyun
Abstract
Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases is essential for ensuring the reliability of LLM-based judges. However, existing studies typically investigate limited biases under a single judge formulation, either generative or discriminative, lacking a comprehensive evaluation. To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges. JudgeBiasBench defines a taxonomy of judgment biases across 4 dimensions, and constructs bias-augmented evaluation instances through a controlled bias injection pipeline, covering 12 representative bias types. We conduct extensive experiments across both generative and discriminative judges, revealing that current judges exhibit significant and diverse bias patterns that often compromise the reliability of automated evaluation. To mitigate judgment bias, we propose bias-aware training that explicitly incorporates bias-related attributes into the training process, encouraging judges to disentangle task-relevant quality from bias-correlated cues. By adopting reinforcement learning for generative judges and contrastive learning for discriminative judges, our methods effectively reduce judgment biases while largely preserving general evaluation capability.
Chinese Translation
基于大型语言模型(LLM)的评审在自动评估和奖励建模中被广泛采用,但其判断常常受到判断偏差的影响。准确评估这些偏差对于确保基于LLM的评审的可靠性至关重要。然而,现有研究通常在单一评审模型下(生成性或判别性)调查有限的偏差,缺乏全面的评估。为填补这一空白,我们提出了JudgeBiasBench,一个系统量化基于LLM的评审偏差的基准。JudgeBiasBench定义了一个跨越四个维度的判断偏差分类法,并通过受控的偏差注入管道构建了增强偏差的评估实例,涵盖了12种代表性偏差类型。我们在生成性和判别性评审中进行了广泛实验,揭示当前评审表现出显著且多样的偏差模式,这往往妨碍了自动评估的可靠性。为了减轻判断偏差,我们提出了偏差感知训练,明确将与偏差相关的属性纳入训练过程,鼓励评审将任务相关的质量与偏差相关的线索区分开来。通过对生成性评审采用强化学习,对判别性评审采用对比学习,我们的方法有效减少了判断偏差,同时在很大程度上保留了整体评估能力。
cs.CL / 61 / 2603.08095

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

DC-W2S:用于生物推理中可靠过程奖励建模的双重共识弱到强训练
Chan, Chi-Min, Hajiramezanali, Ehsan, Li, Xiner, De Brouwer, Edward, Edwards, Carl, Xue, Wei, Han, Sirui, Guo, Yike, Scalia, Gabriele
Abstract
In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
Chinese Translation
在科学推理任务中,推理过程的真实性与最终结果同样重要。虽然过程奖励模型(Process Reward Models, PRMs)为结果奖励模型(Outcome Reward Models, ORMs)固有的粗粒度监督问题提供了解决方案,但由于获取专家验证的逐步标签的高昂成本,其应用受到限制。本文解决了利用丰富但噪声较大的“弱”监督来训练可靠的PRMs的挑战。我们认为,现有的弱到强泛化(Weak-to-Strong Generalization, W2SG)理论缺乏从噪声数据中选择高质量训练信号的指导方针。为填补这一空白,我们提出了双重共识弱到强(Dual-Consensus Weak-to-Strong, DC-W2S)框架。通过在嵌入空间中将弱监督者之间的自我共识(Self-Consensus, SC)度量与邻域共识(Neighborhood-Consensus, NC)度量相交,我们将监督信号分层为不同的可靠性区间。然后,我们采用实例级平衡采样和标签级可靠性感知掩蔽的课程来指导训练过程。我们证明了DC-W2S能够在没有大量专家标注的情况下,训练出适用于复杂推理的稳健PRMs,证明了战略性数据策划比对大规模噪声数据集的无差别训练更为有效。
cs.CL / 62 / 2603.08125

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Ramsa:一个大型社会语言学丰富的阿联酋阿拉伯语语音语料库,用于自动语音识别和文本转语音
Al-Sabbagh, Rania
Abstract
Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10\% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error rates of 0.268 and 0.144, respectively. MMS-TTS-Ara reported the best mean word and character rates of 0.285 and 0.081, respectively, for TTS. These baselines are competitive but leave substantial room for improvement. The paper highlights the challenges encountered and provides directions for future work.
Chinese Translation
Ramsa是一个正在开发的41小时阿联酋阿拉伯语语音语料库,旨在支持社会语言学研究和低资源语言技术。它包含来自母语者的结构化访谈录音和国家电视节目片段。该语料库涵盖157位说话者(59位女性,98位男性),包括城市、贝都因和山地/希希等子方言,涉及文化遗产、农业与可持续性、日常生活、职业轨迹和建筑等主题。语料库由91个单人录音和79个对话录音组成,录音长度和条件各异。10%的子集用于在零样本设置下评估商业和开源模型的自动语音识别(ASR)和文本转语音(TTS),以建立初步基准。Whisper-large-v3-turbo在ASR性能上表现最佳,平均单词和字符错误率分别为0.268和0.144。MMS-TTS-Ara在TTS方面报告了最佳的平均单词和字符错误率,分别为0.285和0.081。这些基准具有竞争力,但仍有很大的改进空间。本文强调了所遇到的挑战,并为未来的工作提供了方向。
cs.CL / 63 / 2603.08127

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

EvoScientist:面向端到端科学发现的多智能体进化人工智能科学家
Lyu, Yougang, Zhang, Xi, Yi, Xinhao, Zhao, Yuyue, Guo, Shuyu, Hu, Wenxiang, Piotrowski, Jan, Kaliski, Jakub, Urbani, Jacopo, Meng, Zaiqiao, Zhou, Lun, Yan, Xiaohui
Abstract
The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end-to-end scientific discovery tasks requiring coordination of specialized roles, including idea generation and experimental execution. However, most state-of-the-art AI scientist systems rely on static, hand-designed pipelines and fail to adapt based on accumulated interaction histories. As a result, these systems overlook promising research directions, repeat failed experiments, and pursue infeasible ideas. To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution. EvoScientist comprises three specialized agents: a Researcher Agent (RA) for scientific idea generation, an Engineer Agent (EA) for experiment implementation and execution, and an Evolution Manager Agent (EMA) that distills insights from prior interactions into reusable knowledge. EvoScientist contains two persistent memory modules: (i) an ideation memory, which summarizes feasible research directions from top-ranked ideas while recording previously unsuccessful directions; and (ii) an experimentation memory, which captures effective data processing and model training strategies derived from code search trajectories and best-performing implementations. These modules enable the RA and EA to retrieve relevant prior strategies, improving idea quality and code execution success rates over time. Experiments show that EvoScientist outperforms 7 open-source and commercial state-of-the-art systems in scientific idea generation, achieving higher novelty, feasibility, relevance, and clarity via automatic and human evaluation. EvoScientist also substantially improves code execution success rates through multi-agent evolution, demonstrating persistent memory's effectiveness for end-to-end scientific discovery.
Chinese Translation
大型语言模型(LLMs)的日益普及使得人工智能科学家能够执行复杂的端到端科学发现任务,这些任务需要协调多个专业角色,包括创意生成和实验执行。然而,大多数最先进的人工智能科学家系统依赖于静态的、手工设计的流程,并未根据累积的交互历史进行适应。因此,这些系统忽视了有前景的研究方向,重复失败的实验,并追求不可行的想法。为了解决这个问题,我们提出了EvoScientist,一个不断进化的多智能体人工智能科学家框架,通过持久记忆和自我进化不断改进研究策略。EvoScientist由三个专业代理组成:用于科学创意生成的研究者代理(Researcher Agent, RA)、用于实验实施和执行的工程师代理(Engineer Agent, EA),以及从先前交互中提炼见解并转化为可重用知识的进化管理代理(Evolution Manager Agent, EMA)。EvoScientist包含两个持久记忆模块:(i)创意记忆,汇总来自顶级创意的可行研究方向,同时记录先前不成功的方向;(ii)实验记忆,捕捉从代码搜索轨迹和最佳实现中得出的有效数据处理和模型训练策略。这些模块使得RA和EA能够检索相关的先前策略,从而随着时间的推移提高创意质量和代码执行成功率。实验表明,EvoScientist在科学创意生成方面优于7个开源和商业最先进系统,通过自动和人工评估实现了更高的创新性、可行性、相关性和清晰度。EvoScientist还通过多智能体进化显著提高了代码执行成功率,展示了持久记忆在端到端科学发现中的有效性。
cs.CL / 64 / 2603.08148

Gradually Excavating External Knowledge for Implicit Complex Question Answering

逐步挖掘外部知识以进行隐式复杂问题回答
Liu, Chang, Li, Xiaoguang, Shang, Lifeng, Jiang, Xin, Liu, Qun, Lam, Edmund Y., Wong, Ngai
Abstract
Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.
Chinese Translation
近年来,大型语言模型(LLMs)因其与人类相当的能力和巨大的潜力而受到广泛关注。然而,对于开放领域的隐式问题回答问题,LLMs可能并不是最终解决方案,原因包括:1)未覆盖或过时的领域知识,2)一次性生成,因此综合性受限。为此,本研究提出了一种逐步知识挖掘框架,用于开放领域复杂问题回答,其中LLMs迭代性地、主动地获取外部信息,并基于获取的历史知识进行推理。具体而言,在解决过程的每一步中,模型选择一个动作来执行,例如查询外部知识或进行单一步逻辑推理,以逐步朝向最终答案推进。我们的方法能够有效利用即插即用的外部知识,并动态调整解决复杂问题的策略。在StrategyQA数据集上的评估中,我们的方法以78.17%的准确率实现了竞争对手不到6%的参数量,创造了约10B规模LLMs的新状态(SOTA)。
cs.CL / 65 / 2603.08153

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

无性别语言中的机器翻译性别偏见:巴斯克语的新基准
Murillo, Amaia, Olatz-Perez-de-Viñaspre, Perez, Naiara
Abstract
Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.
Chinese Translation
大型语言模型(LLMs)和机器翻译(MT)系统在我们的日常生活中被越来越广泛地使用,但它们的输出可能会再现训练数据中存在的性别偏见。大多数用于评估此类偏见的资源是为英语设计的,反映了其社会文化背景,这限制了它们在其他语言中的适用性。本研究通过引入两个新的数据集来填补这一空白,以评估涉及巴斯克语(一个低资源且无性别的语言)的翻译中的性别偏见。WinoMTeus 将 WinoMT 基准进行了调整,以检查性别中立的巴斯克职业如何翻译成西班牙语和法语等性别化语言。另一方面,FLORES+Gender 扩展了 FLORES+ 基准,以评估从性别化语言(西班牙语和英语)翻译成巴斯克语时,翻译质量是否因参照物的性别而有所不同。我们评估了几种通用的 LLM 和开放及专有的 MT 系统。结果显示出对男性形式的系统性偏好,并且在某些模型中,男性参照物的翻译质量略高。总体而言,这些发现表明性别偏见在这些模型中依然根深蒂固,并强调了开发考虑语言特征和文化背景的评估方法的必要性。
cs.CL / 66 / 2603.08166

RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

RexDrug:通过增强推理的语言模型进行可靠的多药物组合提取
Wang, Zhijun, Luo, Ling, Pan, Dinghao, Zhuang, Huan, Yu, Lejing, Sun, Yuanyuan, Lin, Hongfei
Abstract
Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at https://github.com/DUTIR-BioNLP/RexDrug
Chinese Translation
从大规模生物医学文献中自动提取药物组合(DCE)对推动精准医学和药理研究至关重要。然而,现有的关系提取方法主要集中于二元交互,难以建模可变长度的 n-元药物组合,其中需要考虑复杂的兼容性逻辑和分散的证据。为了解决这些局限性,我们提出了 RexDrug,一种基于大语言模型的端到端增强推理关系提取框架,用于 n-元药物组合提取。RexDrug 采用两阶段训练策略。首先,利用多智能体协作机制自动生成高质量的类专家推理轨迹,以进行监督微调。其次,应用专门为 DCE 量身定制的多维奖励函数的强化学习,进一步提升推理质量和提取准确性。在 DrugComb 数据集上的大量实验表明,RexDrug 在 n-元提取任务中始终优于最先进的基线。此外,在 DDI13 语料库上的额外评估确认了其在二元药物相互作用任务中的普适性。人类专家评估和自动推理指标进一步表明,RexDrug 在准确识别复杂治疗方案的同时,能够产生连贯的医学推理。这些结果确立了 RexDrug 作为从非结构化文本中进行复杂生物医学关系提取的可扩展和可靠的解决方案。源代码和数据可在 https://github.com/DUTIR-BioNLP/RexDrug 获取。
cs.CL / 67 / 2603.08177

Is continuous CoT better suited for multi-lingual reasoning?

连续链式思维是否更适合多语言推理?
Bashir, Ali Hamza, Shomali, Behzad, Frey, Markus, Ali, Mehdi, Sifa, Rafet, Berghaus, David
Abstract
We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.
Chinese Translation
我们研究在连续潜在空间中进行推理是否能带来更强的多语言能力。我们将连续链式思维(使用CODI框架)与标准监督微调进行比较,涵盖五种类型学上多样的语言:英语、中文、德语、法语和乌尔都语。我们在GSM8k和CommonsenseQA上的实验表明,连续推理在低资源语言上显著优于显式推理,尤其是在零样本设置中,即目标语言在训练期间未见过。此外,这种方法实现了极高的效率,将推理痕迹压缩约29倍至50倍。这些发现表明,连续潜在表示自然表现出更强的语言不变性,为跨语言推理提供了一种可扩展的解决方案。
cs.CL / 68 / 2603.08182

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

TildeOpen LLM:利用课程学习实现公平的语言表征
Bergmanis, Toms, Kronis, Martins, Pretkalniņš, Ingus Jānis, Nicmanis, Dāvis, Jeļinska, Jeļizaveta, Rozis, Roberts, Vīksna, Rinalds, Pinnis, Mārcis
Abstract
Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
Chinese Translation
大型语言模型在许多欧洲语言中的表现往往不尽如人意,这主要是由于训练数据中英语和少数高资源语言的主导地位。本文介绍了TildeOpen LLM,这是一个拥有300亿参数的开放权重基础模型,旨在为34种欧洲语言提供语言公平性,并提升低资源语言的表现。为了解决数据不平衡问题,我们结合了数据集上采样与基于课程的训练计划,交替使用均匀和自然语言分布。尽管训练所需的计算资源显著较少,所得到的模型在与其他多语言LLM的比较中表现良好。通过多项多语言基准测试的评估显示,TildeOpen在文本生成和理解方面超越了现有的开放权重模型,尤其是在波罗的海、芬诺-乌戈尔和斯拉夫语言方面。人工评估确认,相较于领先的基线,语言错误减少了多达十倍。该模型及相关资源完全开放权重,并可在huggingface.co/TildeAI/TildeOpen-30b上公开获取。这些结果表明,精心的数据策划和均衡的训练策略能够显著提升多语言模型的质量,而无需增加模型规模或训练量。
cs.CL / 69 / 2603.08195

Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

通过连接生物信息学工具与可执行代码支持工作流的可重复性
Sebe, Clémence, Ferret, Olivier, Névéol, Aurélie, Esmailoghli, Mahdi, Leser, Ulf, Cohen-Boulakia, Sarah
Abstract
Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available at https://doi.org/10.5281/zenodo.18526700, https://doi.org/10.5281/zenodo.18526760 and https://doi.org/10.5281/zenodo.18543814.
Chinese Translation
动机:生物数据的快速增长加剧了对透明、可重复和良好文档化的计算工作流的需求。能够清晰地将工作流中代码的步骤与论文中的描述连接起来,将有助于提高工作流的理解,支持可重复性,并促进重用。此任务需要将工作流代码中的生物信息学工具与其在已发布工作流描述中的提及进行链接。结果:我们提出了CoPaLink,这是一种自动化方法,集成了三个组件:用于识别科学文本中工具提及的命名实体识别(NER)、用于识别工作流代码中工具提及的NER,以及基于生物信息学知识库的实体链接。我们为这三个步骤提出了方法,在使用Bioconda和Bioweb知识库评估的Nextflow工作流上,分别实现了高达84-89的个体F1值和66的联合准确率。CoPaLink利用带有经过审定工具注释的科学文章和工作流可执行代码语料库,弥合叙述描述与工作流实现之间的差距。可用性:代码可在https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments和https://gitlab.liris.cnrs.fr/sharefair/copalink获取。语料库也可在https://doi.org/10.5281/zenodo.18526700、https://doi.org/10.5281/zenodo.18526760和https://doi.org/10.5281/zenodo.18543814获取。
cs.CL / 70 / 2603.08207

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

关于攻击个人可识别信息去除技术的可信研究难题
Ochs, Sebastian, Habernal, Ivan
Abstract
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
Chinese Translation
从文本中去除个人可识别信息(PII)是遵循各种数据保护法规的必要措施,并能够在不妨碍隐私的情况下实现数据共享。然而,最近的研究表明,通过PII去除技术处理过的文档容易受到重建攻击的威胁。然而,我们怀疑这些攻击的成功率被大大高估了。我们对现有攻击的评估进行了批判性分析,发现数据泄露和数据污染没有得到妥善缓解,这使得PII去除技术在现实场景中是否真正保护隐私的问题未能得到解决。我们调查了可能的数据源和避免数据泄露的攻击设置,并得出结论,只有真正私密的数据才能使我们客观评估PII去除技术的脆弱性。然而,访问私密数据受到严格限制——这是有充分理由的——这也意味着公共研究社区无法以透明、可重复和可信的方式解决这一问题。
cs.CL / 71 / 2603.08241

Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies

大型语言模型解释对训练随机性的敏感性:上下文、类别与任务依赖性
Loncour, Romain, Bogaert, Jérémie, Standaert, François-Xavier
Abstract
Transformer models are now a cornerstone in natural language processing. Yet, explaining their decisions remains a challenge. It was shown recently that the same model trained on the same data with a different randomness can lead to very different explanations. In this paper, we investigate how the (syntactic) context, the classes to be learned and the tasks influence this explanations' sensitivity to randomness. We show that they all have statistically significant impact: smallest for the (syntactic) context, medium for the classes and largest for the tasks.
Chinese Translation
变换器模型现已成为自然语言处理的基石。然而,解释其决策仍然是一项挑战。最近的研究表明,同一模型在相同数据上以不同随机性训练可能导致非常不同的解释。本文探讨了(句法)上下文、待学习的类别以及任务如何影响这些解释对随机性的敏感性。我们展示了它们都具有统计显著性影响:对(句法)上下文的影响最小,对类别的影响中等,而对任务的影响最大。
cs.CL / 72 / 2603.08251

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

并非所有查询都需要深思熟虑:用于自适应粗到细状态细化的 CoFiCot
Zhang, Dongxu, Lin, Hongqiang, Sun, Yiding, Wang, Pengyu, Wang, Qirui, Yang, Ning, Zhu, Jihua
Abstract
Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.
Chinese Translation
扩展测试时计算能力增强了大型语言模型(LLM)的推理能力,但面临统一计算悖论。分配相同的资源导致简单任务的过度修正和复杂任务的不足细化。为了解决这个问题,我们提出了 CoFiCot,一个粗到细的自适应框架,动态调整推理策略以适应问题的难度。具体而言,我们实现了一个多指标分类器,通过综合语义熵、共识可靠性和预测推理深度来对查询进行分类。这使得细化阶段能够区分对简单查询应用高效聚合,而将复杂查询引导至上下文感知的修正循环。我们将修正形式化为一个状态序列传播过程,其中每次修复严格依赖于先前修正的验证历史。通过在这个状态依赖的轨迹中整合过程奖励模型(Process Reward Models, PRMs),CoFiCot 有效地弥合了细粒度错误定位与全局逻辑一致性之间的差距,防止了无状态细化方法中典型的上下文碎片化。
cs.CL / 73 / 2603.08256

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

NCL-UoR在SemEval-2026任务5中的表现:基于嵌入的方法、微调和大型语言模型在词义可信度评分中的应用
Wu, Tong, Markchom, Thanet, Liang, Huizhi
Abstract
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1--5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules substantially outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task5.
Chinese Translation
词义可信度评分需要在包含模糊同义词的短篇叙事故事的背景下,预测人类感知的特定词义的可信度,评分范围为1到5。本文系统地比较了三种方法:(1)基于嵌入的方法,将句子嵌入与标准回归器相结合;(2)使用参数高效适应的变换器微调;(3)使用结构化推理和明确决策规则的大型语言模型(LLM)提示。表现最佳的系统采用了一种结构化提示策略,将评估分解为叙事组件(前文、目标句子、结尾),并应用明确的决策规则进行评分校准。分析表明,采用决策规则的结构化提示显著优于微调模型和基于嵌入的方法,并且在此任务中,提示设计的重要性超过了模型规模。代码可在 https://github.com/tongwu17/SemEval-2026-Task5 获取。
cs.CL / 74 / 2603.08274

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

大型语言模型在文档问答场景中有多大程度的幻觉?一项跨温度、上下文长度和硬件平台的1720亿标记研究
Roig, JV
Abstract
How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.
Chinese Translation
大型语言模型在回答基于提供文档的问题时,实际产生幻觉的程度有多大?尽管这个问题对企业人工智能部署至关重要,但可靠的测量受到依赖于静态数据集的基准测试的限制,这些数据集易受污染,LLM(大型语言模型)基础的评判者存在已记录的偏见,或者评估尺度过小以至于缺乏统计置信度。我们通过RIKER这一基于真实数据的评估方法来填补这一空白,该方法能够在没有人工注释的情况下实现确定性评分。在35个开放权重模型、三种上下文长度(32K、128K和200K标记)、四种温度设置以及三种硬件平台(NVIDIA H200、AMD MI300X和Intel Gaudi 3)上,我们进行了超过1720亿标记的评估,数量级超出以往研究。我们的发现揭示了以下几点:(1)即使是表现最佳的模型也以非微不足道的比例编造答案——在32K时最佳为1.19%,顶级模型在5%至7%之间,且随着上下文长度的增加,编造率急剧上升,在128K时几乎增加到三倍,在200K时所有模型的编造率超过10%;(2)模型选择主导所有其他因素,整体准确率跨越72个百分点,模型家族对抗编造的预测能力优于模型大小;(3)温度效应复杂——在大约60%的情况下,T=0.0提供最佳的整体准确率,但较高的温度减少了大多数模型的编造率,并显著降低了连贯性损失(无限生成循环),在T=0.0与T=1.0之间的比率可高达48倍;(4)基础能力和抗编造能力是不同的能力——擅长寻找事实的模型仍可能编造不存在的事实;(5)结果在不同硬件平台间一致,确认部署决策不必依赖于硬件。
cs.CL / 75 / 2603.08275

AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models

AdaCultureSafe:基于文化知识的自适应文化安全在大型语言模型中的应用
Kang, Hankun, Lin, Di, Liao, Zhirong, Bai, Pengfei, Zeng, Xinyi, Jiang, Jiawei, Zhu, Yuanyuan, Qian, Tieyun
Abstract
With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.
Chinese Translation
随着大型语言模型(LLMs)的广泛应用,尊重土著文化对于模型的文化安全和负责任的全球应用变得至关重要。现有研究分别考虑文化安全和文化知识,忽视了前者应以后者为基础的事实。这严重阻碍了LLMs产生文化特定的尊重回应。因此,自适应文化安全仍然是一项艰巨的任务。在本研究中,我们提出联合建模文化安全和知识。首先,文化安全和知识配对数据是进行此研究的关键前提。然而,各地区的文化多样性和文化差异的微妙性对创建此类配对评估数据提出了重大挑战。为了解决这一问题,我们提出了一个新颖的框架,整合了权威文化知识描述的整理、LLM自动化查询生成和大量人工验证。因此,我们获得了一个名为AdaCultureSafe的数据集,其中包含4.8K手动分解的细粒度文化描述和相应的48K经过人工验证的安全和知识导向查询。在构建的数据集上,我们评估了三类流行LLMs的文化安全和知识能力,并发现一个重要的发现:它们的文化安全与知识能力之间没有显著相关性。随后,我们深入探讨了LLMs中与效用相关的神经元激活,以研究缺乏相关性的潜在原因,这可以归因于预训练和后对齐目标的差异。最后,我们提出了一种基于知识的方法,通过强制将知识整合到LLM响应生成过程中,显著增强了文化安全。
cs.CL / 76 / 2603.08281

Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

通过结构性扰动评估基于大型语言模型的资助提案审查
Thorne, William, James, Joseph, Wang, Yang, Lin, Chenghua, Maynard, Diana
Abstract
As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.
Chinese Translation
随着人工智能辅助的资助提案超越了人工审查的能力,形成了一种研究生态系统中的“马尔萨斯陷阱”,本文探讨了基于大型语言模型(LLM)的资助审查在高风险评估中的能力和局限性。我们使用六个EPSRC提案,开发了一种基于扰动的框架,探测LLM在六个质量维度上的敏感性:资金、时间表、能力、对齐、清晰度和影响。我们比较了三种审查架构:单次审查、逐节分析和模拟专家小组的“角色委员会”集成。逐节方法在检测率和评分可靠性上显著优于其他方法,而计算成本高昂的委员会方法表现与基线相当。不同扰动类型的检测结果差异显著,对齐问题易于识别,但所有系统在清晰度缺陷上普遍存在遗漏。人类评估表明,LLM反馈在很大程度上是有效的,但倾向于合规检查而非整体评估。我们得出结论,当前的LLM可能在EPSRC审查中提供补充价值,但表现出高变异性和不一致的审查优先级。我们将发布我们的代码和任何非受保护数据。
cs.CL / 77 / 2603.08282

Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization

使用多模态和语言无关的句子嵌入进行抽象摘要
Chellaf, Chaimae, Mdhaffar, Salima, Estève, Yannick, Huet, Stéphane
Abstract
Abstractive summarization aims to generate concise summaries by creating new sentences, allowing for flexible rephrasing. However, this approach can be vulnerable to inaccuracies, particularly `hallucinations' where the model introduces non-existent information. In this paper, we leverage the use of multimodal and multilingual sentence embeddings derived from pretrained models such as LaBSE, SONAR, and BGE-M3, and feed them into a modified BART-based French model. A Named Entity Injection mechanism that appends tokenized named entities to the decoder input is introduced, in order to improve the factual consistency of the generated summary. Our novel framework, SBARThez, is applicable to both text and speech inputs and supports cross-lingual summarization; it shows competitive performance relative to token-level baselines, especially for low-resource languages, while generating more concise and abstract summaries.
Chinese Translation
抽象摘要旨在通过生成新句子来生成简洁的摘要,从而实现灵活的改述。然而,这种方法可能容易受到不准确性的影响,特别是出现模型引入不存在信息的‘幻觉’。在本文中,我们利用从预训练模型(如 LaBSE、SONAR 和 BGE-M3)中获得的多模态和多语言句子嵌入,并将其输入到一个改进的基于 BART 的法语模型中。我们引入了一种命名实体注入机制,将标记化的命名实体附加到解码器输入中,以提高生成摘要的事实一致性。我们的新框架 SBARThez 适用于文本和语音输入,并支持跨语言摘要;与基于标记的基线相比,它显示出具有竞争力的性能,尤其是在低资源语言上,同时生成更简洁和抽象的摘要。
cs.CL / 78 / 2603.08286

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

LAMUS:基于大型语言模型的美国案例法法律论证挖掘大规模语料库
Wang, Serene, Pobbathi, Lavanya, Chen, Haihua
Abstract
Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: https://github.com/LavanyaPobbathi/LAMUS/tree/main
Chinese Translation
法律论证挖掘旨在识别和分类司法推理的功能组成部分,如事实、问题、规则、分析和结论。该领域的进展受到缺乏大规模、高质量的美国案例法注释数据集的限制,尤其是在州级别。本文介绍了LAMUS,一个基于美国最高法院判决和德克萨斯州刑事上诉意见构建的句子级法律论证挖掘语料库。该数据集采用数据驱动的流程创建,结合了大规模案例收集、基于大型语言模型(LLMs)的自动注释和针对性的人工质量优化。我们将法律论证挖掘表述为一个六类句子分类任务,并在零样本、少样本和思维链提示策略下评估多种通用和法律领域语言模型,以LegalBERT作为监督基线。结果表明,思维链提示显著提高了LLM的性能,而特定领域模型表现出更稳定的零样本行为。LLM辅助验证纠正了近20%的注释错误,提高了标签一致性。人工验证达到了Cohen's Kappa值0.85,确认了注释质量。LAMUS为未来的法律自然语言处理研究提供了可扩展的资源和实证见解。所有代码和数据集可在GitHub上访问以实现可重复性,链接为:https://github.com/LavanyaPobbathi/LAMUS/tree/main
cs.CL / 79 / 2603.08312

Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

通过统一语音编码器学习多重发话级属性表示
Bouziane, Maryem, Mdhaffar, Salima, Estève, Yannick
Abstract
Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learning, these models can achieve strong performance on specific downstream tasks. Recent post-training approaches, such as SAMU-XSLR and SONAR, align speech representations with utterance-level semantic representations, enabling effective multimodal (speech-text) and multilingual applications. While speech foundation models typically learn contextual embeddings at the acoustic frame level, these methods learn representations at the utterance level. In this work, we extend this paradigm to arbitrary utterance-level attributes and propose a unified post-training framework that enables a single speech foundation model to generate multiple types of utterance-level representations. We demonstrate the effectiveness of this approach by jointly learning semantic and speaker representations and evaluating them on multilingual speech retrieval and speaker recognition tasks.
Chinese Translation
经过自监督学习训练的语音基础模型生成通用的语音表示,支持广泛的语音处理任务。当进一步通过监督学习进行适应时,这些模型能够在特定下游任务上实现强大的性能。最近的后训练方法,如 SAMU-XSLR 和 SONAR,将语音表示与发话级语义表示对齐,从而实现有效的多模态(语音-文本)和多语言应用。虽然语音基础模型通常在声学帧级别学习上下文嵌入,但这些方法则在发话级别学习表示。在本研究中,我们将这一范式扩展到任意发话级属性,并提出一个统一的后训练框架,使单一语音基础模型能够生成多种类型的发话级表示。我们通过共同学习语义和说话者表示,并在多语言语音检索和说话者识别任务上进行评估,展示了该方法的有效性。
cs.CL / 80 / 2603.08329

SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

SPD-RAG:基于子代理的文档检索增强生成
Akay, Yagiz Can, Kartal, Muhammed Yusuf, Alparslan, Esra, Ortakoyluoglu, Faruk, Akpinar, Arda
Abstract
Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).
Chinese Translation
回答复杂的现实世界查询通常需要综合分散在庞大文档语料库中的事实。在这些环境中,标准的检索增强生成(RAG)流程面临证据覆盖不完整的问题,而长上下文的大型语言模型(LLMs)在处理大量输入时难以可靠推理。我们提出了SPD-RAG,一个层次化的多代理框架,用于全面的跨文档问答,该框架沿文档轴分解问题。每个文档由一个专门的文档级代理处理,该代理仅操作其自身内容,从而实现了集中检索,而协调者则将任务分配给相关代理并汇总它们的部分答案。代理的输出通过一个支持递归映射-归约的大规模语料库的限令合成层合并部分答案。这种文档级的专业化与集中融合提高了异构多文档环境中的可扩展性和答案质量,同时产生了一个模块化、可扩展的检索流程。在长上下文多文档问答的LOONG基准(EMNLP 2024)上,SPD-RAG实现了58.1的平均得分(GPT-5评估),超越了普通RAG(33.0)和代理RAG(32.8),同时仅使用了全上下文基线(68.0)API成本的38%。
cs.CL / 81 / 2603.08358

Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

语言模型知道西奥有妻子吗?探讨前提问题
Azin, Tara, Dumitrescu, Daniel, Inkpen, Diana, Singh, Raj
Abstract
We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.
Chinese Translation
我们研究了语言模型如何处理前提问题,这是一个在语用学中尚未解决的问题,其中条件句中的前提在理论和人类解释之间存在差异。我们将这一现象重新表述为自然语言推理任务,并引入一个旨在探测条件句中前提投射的诊断数据集。我们使用可解释性分析评估了RoBERTa、DeBERTa、LLaMA和Gemma。结果表明,这些模型在总体上与人类判断一致,但依赖于浅层模式匹配,而非语义或语用推理。我们的工作为前提问题提供了首个计算评估框架,并强调了需要采用诊断性、多方法的方法来评估语言模型的语用能力和上下文依赖意义。
cs.CL / 82 / 2603.08359

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

从声学语音和视听输入中进行早期语言学习的计算建模,无需语言先验
Räsänen, Okko
Abstract
Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.
Chinese Translation
对于典型发育的婴儿来说,理解语音似乎几乎是毫不费力的,然而从信息处理的角度来看,从声学语音中获取语言是一项巨大的挑战。本章回顾了使用计算模型理解语音和视听输入中的早期语言习得的最新进展。重点是自监督和视觉基础的感知学习模型。我们展示了这些模型在没有强语言先验的情况下,如何在学习语音的各个方面变得越来越强大,以及早期语言发展的许多特征如何通过一组共享的学习原则来解释——这些原则与多种语言习得和人类认知理论广泛兼容。我们还讨论了现代学习模拟如何在输入数据和将模型行为与婴儿语言发展实证发现联系起来的方面逐渐变得更加现实。
cs.CL / 83 / 2603.08391

Adaptive Loops and Memory in Transformers: Think Harder or Know More?

变压器中的自适应循环与记忆:更努力思考还是更了解更多?
Frey, Markus, Shomali, Behzad, Bashir, Ali Hamza, Berghaus, David, Ali, Mehdi
Abstract
Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline -- with three times the number of layers -- on math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.
Chinese Translation
链式思维(Chain-of-thought, CoT)提示使语言模型能够进行推理,但需要明确地表达中间步骤。循环变压器提供了一种替代方案,通过在隐藏状态中迭代地细化表示来实现。这种参数效率是以牺牲为代价的,因为循环模型缺乏使用每层独特权重的更深层模型的存储能力。在本研究中,我们探讨了具有自适应每层循环的变压器模型,其中每个变压器块通过学习的停止机制学习迭代其隐藏状态,以及提供额外学习存储的门控记忆库。我们发现,循环主要有利于数学推理,而记忆库则有助于在与参数和FLOP匹配的模型相比,恢复常识任务的性能。结合这两种机制产生的模型在数学基准测试中超越了一个等FLOP基线模型——其层数是基线的三倍。对模型内部的分析揭示了层的专业化:早期层学习最小化循环并稀疏访问记忆,而后期层则更频繁地进行这两种操作。
cs.CL / 84 / 2603.08392

COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

COACH与QUORUM的结合:一个对齐用户、专家和开发者视角的框架与流程,用于大语言模型生成的健康咨询
Ng, Yee Man, van Dijk, Bram, Beynen, Pieter, Boekesteijn, Otto, Jansen, Joris, van Oortmerssen, Gerard, van Duijn, Max, Spruit, Marco
Abstract
Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.
Chinese Translation
收集睡眠、情绪和活动数据的系统可以为受到慢性疾病及其后果影响的人群提供有价值的生活方式咨询。然而,这类系统的开发具有挑战性;除了可靠地从用户特定数据中提取模式外,系统还应将这些模式与经过验证的医学知识进行情境化,以确保咨询的质量,并生成与真实用户相关的咨询。我们提出了QUORUM,一个新的评估框架,统一了开发者、专家和用户的视角,并通过一个真实案例研究展示了它在利益相关者视角的收敛与分歧方面的有效跟踪。我们还介绍了COACH,一个基于大语言模型的流程,用于为我们的健康时间应用案例(一个针对癌症患者及幸存者的日记应用)生成个性化的生活方式咨询。应用我们的框架表明,总体而言,用户、医学专家和开发者在生成的咨询相关性、质量和可靠性方面达成了一致意见。然而,利益相关者在咨询的语气、对模式提取错误的敏感性以及潜在的幻觉方面也存在分歧。这些发现突显了多利益相关者评估在消费者健康语言技术中的重要性,并说明了统一评估框架如何支持在现实环境中可信赖的以患者为中心的自然语言处理系统。
cs.CL / 85 / 2603.08398

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

揭示大型语言模型中的行为可塑性:一种基于令牌条件的视角
Mao, Liyuan, Yu, Le, Zhou, Jing, Zheng, Chujie, Yu, Bowen, Gao, Chang, Liu, Shixuan, Yang, An, Zhang, Weinan, Lin, JunYang
Abstract
In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.
Chinese Translation
在本研究中,我们揭示了大型语言模型(LLMs)具有内在的行为可塑性——类似于变色龙根据环境线索调整其颜色的能力——这一特性可以通过基于令牌条件的生成方式暴露出来,并通过强化学习进行稳定化。具体而言,通过对从展现所需行为的响应中精心选择的令牌前缀进行条件生成,LLMs能够在推理时无缝地适应其行为模式(例如,从逐步推理切换到直接回答),而无需重新训练。基于这一洞察,我们提出了令牌条件强化学习(Token-Conditioned Reinforcement Learning,ToCoRL),这是一个原则性框架,利用强化学习来内化这种变色龙般的可塑性,将瞬时的推理时适应转化为稳定且可学习的行为模式。ToCoRL通过基于令牌条件的生成指导探索,并不断增强利用,从而促使适当行为的出现。大量实验表明,ToCoRL能够实现精确的行为控制而不降低能力。值得注意的是,我们展示了大型推理模型在复杂数学问题上表现强劲的同时,可以有效地适应以在事实问答中表现出色,而这一能力之前受到其逐步推理模式的限制。
cs.CL / 86 / 2603.08412

Aligning to Illusions: Choice Blindness in Human and AI Feedback

对幻觉的对齐:人类与人工智能反馈中的选择盲目性
Wu, Wenbin
Abstract
Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
Chinese Translation
人类反馈的强化学习(RLHF)假设注释者的偏好反映了稳定的内部状态。我们通过三项涵盖偏好管道的实验对此提出挑战。在一项人类选择盲目性研究中,91%的秘密交换偏好未被检测到,将选择盲目性扩展到对不熟悉文本的第三方评估比较。测试了十五个大型语言模型(LLM)评审作为潜在替代者,我们发现检测依赖于浅层文本匹配,而非真正的自我监控:从上下文中移除先前的推理导致盲目性从接近零飙升至超过50%,而显性的社会压力则导致近乎普遍的遵从。在一项针对86M到2B参数的两个架构的剂量-反应实验中,必须有六分之一到三分之一的标签被破坏,奖励信号才会减半,但标准的成对准确率几乎没有变化。最佳选择评估确认这会导致下游策略退化:在50%的破坏率下,基于奖励的选择与随机采样相比没有任何改善,而代理模型报告的分数则单调增加。综合来看,这些结果揭示了一个偏好构建问题:进入RLHF的信号受到引导上下文的影响,而这种影响是人类元认知、LLM自我监控和标准评估指标无法检测到的。
cs.CL / 87 / 2603.08429

One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

一个模型足够:来自LLM代理隐藏状态的原生检索嵌入
Jiang, Bo
Abstract
LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
Chinese Translation
通常,检索外部知识的LLM代理会生成一个文本搜索查询,然后运行一个单独的嵌入模型将其编码为向量。这种双模型管道增加了基础设施的复杂性和延迟,但却是多余的:LLM已经在其隐藏状态中编码了完整的对话上下文。我们建议通过添加一个轻量级的投影头,使LLM代理具备原生检索能力,该投影头将隐藏状态直接映射到嵌入空间,从而消除对单独嵌入模型的需求。我们的方法结合了对齐、对比和排名蒸馏损失进行训练,保留了97%的基线检索质量,同时使LLM代理能够使用其自身的表示进行搜索。在QReCC对话搜索基准上的实验显示,与标准的生成-再编码管道相比,我们的方法在Recall@10和MRR@10上表现出竞争力,系统的消融实验确认了每个损失组件的贡献。
cs.CL / 88 / 2603.08450

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

用于探测英瑞翻译中的翻译特征偏好的数据集
Kunz, Jenny, Jarochenko, Anja, Bollmann, Marcel
Abstract
Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.
Chinese Translation
翻译通常带有源语言的痕迹,这一现象称为翻译特征(translationese)。我们介绍了第一个可自由获取的英瑞翻译数据集,该数据集对比了翻译特征句子与习惯用法的替代句,旨在探测语言模型的内在偏好。数据集中包含错误标签及原始翻译中问题的描述。在使用我们的数据集评估较小的瑞典语和多语言大语言模型(LLM)的实验中,我们发现它们往往偏好翻译特征的措辞。当省略英语源句时,人类替代句的选择频率更高,这表明接触源语言会使模型倾向于字面翻译,尽管即使在没有上下文的情况下,模型通常也更喜欢翻译特征变体。我们的数据集和研究结果为开发能够在非英语语言中生成更自然、习惯用法输出的模型提供了资源和基准。
cs.CL / 89 / 2603.08501

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq:一种基于多智能体的扎根伊斯兰问答架构
Abbas, Ummar, Ouzzani, Mourad, Eltabakh, Mohamed Y., Sinan, Omar, Bhatia, Gagan, Mubarak, Hamdy, Hawasly, Majd, Hashim, Mohammed Qusay, Darwish, Kareem, Alam, Firoj
Abstract
Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) reduces some of these limitations by grounding generation in external evidence. However, a single ``retrieve-then-generate'' pipeline is limited to deal with the diversity of Islamic queries.Users may request verbatim scripture, fatwa-style guidance with citations or rule-constrained computations such as zakat and inheritance that require strict arithmetic and legal invariants. In this work, we present a bilingual (Arabic/English) multi-agent Islamic assistant, called Fanar-Sadiq, which is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic-related queries to specialized modules within an agentic, tool-using architecture. The system supports intent-aware routing, retrieval-grounded fiqh answers with deterministic citation normalization and verification traces, exact verse lookup with quotation validation, and deterministic calculators for Sunni zakat and inheritance with madhhab-sensitive branching. We evaluate the complete end-to-end system on public Islamic QA benchmarks and demonstrate effectiveness and efficiency. Our system is currently publicly and freely accessible through API and a Web application, and has been accessed $\approx$1.9M times in less than a year.
Chinese Translation
大型语言模型(LLMs)能够流利地回答宗教知识查询,但它们常常会产生幻觉并错误归属来源,这在伊斯兰环境中尤其重要,因为用户期望能基于经典文本(《古兰经》和《圣训》)和法学(fiqh)细微差别进行扎根。检索增强生成(RAG)通过将生成过程基于外部证据来减少这些局限性。然而,单一的“检索-然后生成”流程在处理多样化的伊斯兰查询时受到限制。用户可能请求逐字的经典经文、带有引用的法特瓦风格指导或需要严格算术和法律不变性的规则约束计算,如天课(zakat)和继承。在本研究中,我们提出了一种双语(阿拉伯语/英语)多智能体伊斯兰助手,称为Fanar-Sadiq,它是Fanar AI平台的核心组成部分。Fanar-Sadiq将与伊斯兰相关的查询路由到智能体工具使用架构中的专业模块。该系统支持基于意图的路由、基于检索的法学答案,具有确定性的引用规范化和验证痕迹、确切的经文查找与引用验证,以及针对逊尼派天课和继承的确定性计算器,具有教派敏感的分支。我们在公共伊斯兰问答基准上评估了完整的端到端系统,展示了其有效性和效率。我们的系统目前可以通过API和Web应用程序公开和免费访问,在不到一年的时间里已被访问约190万次。
cs.CL / 90 / 2603.08659

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

CODA:基于难度感知的自适应推理计算分配
Wu, Siye, Xie, Jian, Zhang, Yikai, Xiao, Yanghua
Abstract
The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
Chinese Translation
大规模推理模型的出现表明,扩展推理时的计算显著提升了在复杂任务上的性能。然而,这往往陷入另一个陷阱:对简单问题的过度思考,其中重复的推理在不成比例的高成本下仅带来微小的准确性提升。这激励了自适应推理:动态地将推理深度与实例难度对齐。本文从最优性角度研究自适应推理,将其形式化为一个效用最大化问题,其中令牌的分配直到边际准确性提升低于增量成本为止。在此基础上,我们提出了CODA(基于难度感知的计算分配),该方法通过内部难度信号来实现这一原则,分配令牌。具体而言,CODA通过基于组的回滚估计难度,并将其映射到两个非负门控,这些门控调节基于长度的形状项,作为二进制基础奖励的补充。简单实例的易侧门控对冗长进行惩罚,而困难实例的难侧门控则鼓励在具有挑战性的实例上进行更深思熟虑的回滚。在不同模型规模和基准测试中,CODA实现了自适应推理,无需外部注释或用户提供的预算:在简单任务上,CODA将令牌成本降低超过60%,同时保持强大的准确性,而在困难任务上,它则激励进行更深思熟虑的回滚以最大化性能。