cs.RO / 1 / 2603.02291
Goal-Oriented Semantic Communication for ISAC-Enabled Robotic Obstacle Avoidance
面向目标的语义通信在ISAC支持的机器人避障中的应用
Abstract
We investigate an integrated sensing and communication (ISAC)-enabled BS for the unmanned aerial vehicle (UAV) obstacle avoidance task, and propose a goal-oriented semantic communication (GOSC) framework for the BS to transmit sensing and command and control (C&C) signals efficiently and effectively. Our GOSC framework establishes a closed loop for sensing-C&C generation-sensing and C&C transmission: For sensing, a Kalman filter (KF) is applied to continuously predict UAV positions, mitigating the reliance of UAV position acquisition on continuous sensing signal transmission, and enhancing position estimation accuracy through sensing-prediction fusion. Based on the refined estimation position provided by the KF, we develop a Mahalanobis distance-based dynamic window approach (MD-DWA) to generate precise C&C signals under uncertainty, in which we derive the mathematical expression of the minimum Mahalanobis distance required to guarantee collision avoidance. Finally, for efficient sensing and C&C signal transmission, we propose an effectiveness-aware deep Q-network (E-DQN) to determine the transmission of sensing and C&C signals based on their value of information (VoI). The VoI of sensing signals is quantified by the reduction in uncertainty entropy of UAV's position estimation, while the VoI of C&C signals is measured by their contribution to UAV navigation improvement. Extensive simulations validate the effectiveness of our proposed GOSC framework. Compared to the conventional ISAC transmission framework that transmits sensing and C&C signals at every time slot, GOSC achieves the same 100% task success rate while reducing the number of transmitted sensing and C&C signals by 92.4% and the number of transmission time slots by 85.5%.
Chinese Translation
我们研究了一种集成感知与通信(ISAC)支持的基站(BS),用于无人机(UAV)避障任务,并提出了一种面向目标的语义通信(GOSC)框架,使基站能够高效、有效地传输感知和指挥控制(C&C)信号。我们的GOSC框架建立了一个闭环,包括感知-C&C生成-感知与C&C传输:在感知方面,采用卡尔曼滤波器(KF)持续预测无人机位置,减轻了无人机位置获取对连续感知信号传输的依赖,并通过感知-预测融合提高位置估计的准确性。基于KF提供的精确估计位置,我们开发了一种基于马哈拉诺比斯距离的动态窗口方法(MD-DWA),在不确定性下生成精确的C&C信号,其中我们推导了保证避免碰撞所需的最小马哈拉诺比斯距离的数学表达式。最后,为了高效传输感知和C&C信号,我们提出了一种关注有效性的深度Q网络(E-DQN),根据感知和C&C信号的信息价值(VoI)来决定其传输。感知信号的VoI通过无人机位置估计的不确定性熵减少量来量化,而C&C信号的VoI则通过其对无人机导航改进的贡献来衡量。大量仿真验证了我们提出的GOSC框架的有效性。与传统的ISAC传输框架相比,后者在每个时间槽中传输感知和C&C信号,GOSC在保持100%任务成功率的同时,将传输的感知和C&C信号数量减少了92.4%,传输时间槽数量减少了85.5%。
cs.RO / 2 / 2603.02443
Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control
通过结合模型与学习控制实现安全的全身运动操控
Abstract
Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches--such as those applied by a human during physical interaction--into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
Chinese Translation
同时进行运动与操控使得机器人能够超越固定基础的限制与环境进行交互。然而,在考虑接触交互过程中的安全性和顺应性的情况下,协调腿部运动与手臂操控仍然具有挑战性。为此,我们提出了一种全身控制器,该控制器将基于模型的接纳控制与用于腿部运动的强化学习(Reinforcement Learning, RL)策略相结合。接纳控制器将外部扭矩(例如人类在物理交互中施加的扭矩)映射为期望的末端执行器速度,从而允许顺应行为。这些速度由手臂和腿部控制器共同跟踪,实现统一的六自由度(6-DoF)力响应。基于模型的设计通过参考调节器(Reference Governor, RG)提供准确的力控制和安全保障,同时通过增强神经网络的卡尔曼滤波器进一步提高鲁棒性,以实现可靠的基础速度估计。我们在仿真和硬件中验证了我们的方法,使用配备6-DoF手臂和腕部安装6-DoF力/扭矩传感器的Unitree Go2四足机器人。结果表明,交互驱动速度的准确跟踪、顺应行为以及在动态环境中的安全可靠性能。
cs.RO / 3 / 2603.02458
Learning Therapist Policy from Therapist-Exoskeleton-Patient Interaction
从治疗师-外骨骼-患者互动中学习治疗师策略
Abstract
Post-stroke rehabilitation is often necessary for patients to regain proper walking gait. However, the typical therapy process can be exhausting and physically demanding for therapists, potentially reducing therapy intensity, duration, and consistency over time. We propose a Patient-Therapist Force Field (PTFF) to visualize therapist responses to patient kinematics and a Synthetic Therapist (ST) machine learning model to support the therapist in dyadic robot-mediated physical interaction therapy. The first encodes patient and therapist stride kinematics into a shared low-dimensional latent manifold using a Variational Autoencoder (VAE) and models their interaction through a Gaussian Mixture Model (GMM), which learns a probabilistic vector field mapping patient latent states to therapist responses. This representation visualizes patient-therapist interaction dynamics to inform therapy strategies and robot controller design. The latter is implemented as a Long Short-Term Memory (LSTM) network trained on patient-therapist interaction data to predict therapist-applied joint torques from patient kinematics. Trained and validated using leave-one-out cross-validation across eight post-stroke patients, the model was integrated into a ROS-based exoskeleton controller to generate real-time torque assistance based on predicted therapist responses. Offline results and preliminary testing indicate the potential of their use as an alternative approach to post-stroke exoskeleton therapy. The PTFF provides understanding of the therapist's actions while the ST frees the human therapist from the exoskeleton, allowing them to continuously monitor the patient's nuanced condition.
Chinese Translation
中风后的康复通常是患者恢复正常步态所必需的。然而,典型的治疗过程对于治疗师来说往往是耗费精力和身体的,可能导致治疗强度、持续时间和一致性随时间降低。我们提出了一种患者-治疗师力场(Patient-Therapist Force Field, PTFF),用于可视化治疗师对患者运动学的反应,并提出了一种合成治疗师(Synthetic Therapist, ST)机器学习模型,以支持治疗师在双人机器人介导的物理互动治疗中的工作。PTFF首先使用变分自编码器(Variational Autoencoder, VAE)将患者和治疗师的步态运动学编码为共享的低维潜在流形,并通过高斯混合模型(Gaussian Mixture Model, GMM)建模它们的互动,学习将患者潜在状态映射到治疗师反应的概率向量场。该表示法可视化患者-治疗师互动动态,以指导治疗策略和机器人控制器设计。ST则作为一种长短期记忆(Long Short-Term Memory, LSTM)网络实现,基于患者-治疗师互动数据进行训练,以预测治疗师施加的关节扭矩。该模型在八名中风患者的留一交叉验证中进行了训练和验证,并被集成到基于ROS的外骨骼控制器中,以根据预测的治疗师反应生成实时扭矩辅助。离线结果和初步测试表明其作为中风后外骨骼治疗替代方法的潜力。PTFF提供了对治疗师行为的理解,而ST则使人类治疗师从外骨骼中解放出来,使其能够持续监测患者的细微状况。
cs.RO / 4 / 2603.02468
A Novel Modular Cable-Driven Soft Robotic Arm with Multi-Segment Reconfigurability
一种新型模块化电缆驱动软体机器人手臂,具有多段可重构性
Abstract
This paper presents a novel, modular, cable-driven soft robotic arm featuring multi-segment reconfigurability. The proposed architecture enables a stackable system with independent segment control, allowing scalable adaptation to diverse structural and application requirements. The system is fabricated from soft silicone material and incorporates embedded tendon-routing channels with a protective dual-helical tendon structure. Experimental results showed that modular stacking substantially expanded the reachable workspace: relative to the single-segment arm, the three-segment configuration achieved up to a 13-fold increase in planar workspace area and a 38.9-fold increase in workspace volume. Furthermore, this study investigated the effect of silicone stiffness on actuator performance. The results revealed a clear trade-off between compliance and stiffness: softer silicone improved bending flexibility, while stiffer silicone improved structural rigidity and load-bearing stability. These results highlight the potential of stiffness tuning to balance compliance and strength for configuring scalable, reconfigurable soft robotic arms.
Chinese Translation
本文提出了一种新型的模块化电缆驱动软体机器人手臂,具有多段可重构性。所提出的架构实现了一个可堆叠的系统,具有独立的段控制,能够根据不同的结构和应用需求进行可扩展的适应。该系统由软硅胶材料制成,并结合了嵌入式腱路通道和保护性的双螺旋腱结构。实验结果表明,模块化堆叠显著扩大了可达工作空间:与单段手臂相比,三段配置在平面工作空间面积上实现了最多13倍的增加,在工作空间体积上实现了38.9倍的增加。此外,本研究还探讨了硅胶硬度对驱动器性能的影响。结果显示,在柔韧性和硬度之间存在明显的权衡:较软的硅胶改善了弯曲灵活性,而较硬的硅胶则提高了结构刚性和承载稳定性。这些结果凸显了硬度调节在平衡柔韧性和强度方面的潜力,以配置可扩展的可重构软体机器人手臂。
cs.RO / 5 / 2603.02484
COLREGs Compliant Collision Avoidance and Grounding Prevention for Autonomous Marine Navigation
符合国际海上避碰规则的自主海洋导航中的碰撞避免与搁浅预防
Abstract
Maritime Autonomous Surface Ships (MASS) are increasingly regarded as a promising solution to address crew shortages, improve navigational safety, and improve operational efficiency in the maritime industry. Nevertheless, the reliable deployment of MASS in real-world environments remains a significant challenge, particularly in congested waters where the majority of maritime accidents occur. This emphasizes the need for safe and regulation-aware motion planning strategies for MASS that are capable of operating under dynamic maritime conditions. This paper presents a unified motion planning method for MASS that achieves real time collision avoidance, compliance with International Regulations for Preventing Collisions at Sea (COLREGs), and grounding prevention. The proposed work introduces a convex optimization method that integrates velocity obstacle-based (VO) collision constraints, COLREGs-based directional constraints, and bathymetry-based grounding constraints to generate computationally efficient, rule-compliant optimal velocity selection. To enhance robustness, the classical VO method is extended to consider uncertainty in the position and velocity estimates of the target vessel. Unnavigable shallow water regions obtained from bathymetric data, which are inherently nonconvex, are approximated via convex geometries using a integer linear programming (ILP), allowing grounding constraints to be incorporated into the motion planning. The resulting optimization generates optimal and dynamically feasible input velocities that meet collision avoidance, regulatory compliance, kinodynamic limits, and grounding prevention requirements. Simulation results involving multi-vessel encounters demonstrate the effectiveness of the proposed method in producing safe and regulation-compliant maneuvers, highlighting the suitability of the proposed approach for real time autonomous maritime navigation.
Chinese Translation
海事自主水面船舶(MASS)越来越被视为解决船员短缺、提高航行安全性和改善海事行业运营效率的有前景的解决方案。然而,在现实环境中可靠地部署MASS仍然是一个重大挑战,特别是在大多数海事事故发生的拥挤水域。这强调了需要为MASS制定安全且符合规定的运动规划策略,以便在动态海洋条件下运行。本文提出了一种统一的MASS运动规划方法,能够实现实时碰撞避免、遵循国际海上避碰规则(COLREGs)和搁浅预防。所提出的工作引入了一种凸优化方法,该方法整合了基于速度障碍(VO)的碰撞约束、基于COLREGs的方向约束和基于水深的搁浅约束,以生成计算效率高、符合规则的最优速度选择。为了增强鲁棒性,经典的VO方法被扩展,以考虑目标船舶位置和速度估计的不确定性。通过整数线性规划(ILP)将从水深数据中获得的不可航行浅水区域(本质上是非凸的)近似为凸几何形状,从而使搁浅约束能够纳入运动规划中。最终的优化生成了满足碰撞避免、法规合规、运动动力学限制和搁浅预防要求的最优且动态可行的输入速度。涉及多船相遇的仿真结果证明了所提方法在生成安全且符合规定的机动方面的有效性,突显了所提方法在实时自主海洋导航中的适用性。
cs.RO / 6 / 2603.02487
A Robust Simulation Framework for Verification and Validation of Autonomous Maritime Navigation in Adverse Weather and Constrained Environments
一种稳健的仿真框架用于在恶劣天气和受限环境中验证与确认自主海洋导航
Abstract
Maritime Autonomous Surface Ships (MASS) have emerged as a promising solution to enhance navigational safety, operational efficiency, and long-term cost effectiveness. However, their reliable deployment requires rigorous verification and validation (V\&V) under various environmental conditions, including extreme and safety-critical scenarios. This paper presents an enhanced virtual simulation framework to support the V\&V of MASS in realistic maritime environments, with particular emphasis on the influence of weather and bathymetry on autonomous navigation performance. The framework incorporates a high-fidelity environmental modeling suite capable of simulating adverse weather conditions such as rain, fog, and wave dynamics. The key factors that affect weather, such as rain and visibility, are parameterized to affect sea-state characteristics, perception, and sensing systems, resulting in position and velocity uncertainty, reduced visibility, and degraded situational awareness. Furthermore, high-resolution bathymetric data from major U.S. ports are integrated to enable depth-aware navigation, grounding prevention capabilities, and evaluation of vessel controllability in shallow or confined waterways. The proposed framework offers extensive configurability, enabling systematic testing in a wide spectrum of maritime conditions, including scenarios that are impractical or unsafe to replicate in real-world trials, thus supporting the V\&V of MASS.
Chinese Translation
海洋自主水面船舶(MASS)作为提升航行安全、操作效率和长期成本效益的有前景的解决方案而受到关注。然而,其可靠部署需要在各种环境条件下进行严格的验证与确认(V&V),包括极端和安全关键场景。本文提出了一种增强的虚拟仿真框架,以支持MASS在现实海洋环境中的V&V,特别强调天气和水深对自主导航性能的影响。该框架结合了高保真环境建模套件,能够模拟恶劣天气条件,如降雨、雾霾和波浪动态。影响天气的关键因素,如降雨和能见度,被参数化以影响海况特征、感知和传感系统,从而导致位置和速度的不确定性、能见度降低和情境意识下降。此外,集成了来自美国主要港口的高分辨率水深数据,以实现深度感知导航、预防搁浅能力以及在浅水或受限水域中评估船舶可控性。所提出的框架提供了广泛的可配置性,使得在各种海洋条件下进行系统测试成为可能,包括在现实试验中不切实际或不安全的场景,从而支持MASS的V&V。
cs.RO / 7 / 2603.02500
Instant and Reversible Adhesive-free Bonding Between Silicones and Glossy Papers for Soft Robotics
硅胶与光面纸之间的瞬时可逆无粘合剂粘接用于软机器人
Abstract
Integrating silicone with non-extensible materials is a common strategy used in the fabrication of fluidically-driven soft actuators, yet conventional approaches often rely on irreversible adhesives or embedding processes that are labor-intensive and difficult to modify. This work presents silicone-glossy paper bonding (SGB), a rapid, adhesive-free, and solvent-reversible bonding approach that forms robust silicone-paper interfaces simply through contact. The SGB interface withstands high mechanical loads (shear strength > 61 kPa) and can be fully detached and reassembled via ethanol immersion without loss of performance, enabling component reuse and rapid redesign. Characterization studies indicate that surface functional groups primarily govern adhesion on the glossy paper and the modulus of the silicone, while durability and environmental response clarify the conditions for reversible debonding. The results further suggest a synergistic interaction of hydrogen bonding and oligomer diffusion, yielding strong yet reconfigurable adhesion. Soft actuators fabricated using SGB design exhibit equal or greater performance compared to conventional embedded-layer design and enable programmable actuation modes, including contraction, bending, and twisting. By simplifying fabrication while supporting reuse and rapid iteration, SGB offers a scalable and sustainable platform for rapid prototyping in soft robotics.
Chinese Translation
将硅胶与不可伸展材料结合是制造流体驱动软执行器的一种常见策略,但传统方法通常依赖于不可逆的粘合剂或嵌入工艺,这些方法劳动密集且难以修改。本研究提出了一种硅胶-光面纸粘接(SGB)方法,这是一种快速、无粘合剂且可溶剂逆转的粘接方法,仅通过接触即可形成稳固的硅胶-纸界面。SGB界面能够承受高机械负荷(剪切强度 > 61 kPa),并且可以通过乙醇浸泡完全分离和重新组装而不损失性能,从而实现组件的重复使用和快速重新设计。表征研究表明,表面功能基团主要决定了光面纸和硅胶的粘附性,而耐久性和环境反应则阐明了可逆脱粘的条件。结果进一步表明氢键和低聚物扩散的协同作用,产生了强而可重构的粘附性。使用SGB设计制造的软执行器在性能上与传统嵌入层设计相当或更优,并支持可编程的驱动模式,包括收缩、弯曲和扭转。通过简化制造过程,同时支持重复使用和快速迭代,SGB为软机器人领域的快速原型设计提供了一个可扩展和可持续的平台。
cs.RO / 8 / 2603.02511
Learning Object-Centric Spatial Reasoning for Sequential Manipulation in Cluttered Environments
在杂乱环境中进行顺序操作的物体中心空间推理学习
Abstract
Robotic manipulation in cluttered environments presents a critical challenge for automation. Recent large-scale, end-to-end models demonstrate impressive capabilities but often lack the data efficiency and modularity required for retrieving objects in dense clutter. In this work, we argue for a paradigm of specialized, decoupled systems and present Unveiler, a framework that explicitly separates high-level spatial reasoning from low-level action execution. Unveiler's core is a lightweight, transformer-based Spatial Relationship Encoder (SRE) that sequentially identifies the most critical obstacle for removal. This discrete decision is then passed to a rotation-invariant Action Decoder for execution. We demonstrate that this decoupled architecture is not only more computationally efficient in terms of parameter count and inference time, but also significantly outperforms both classic end-to-end policies and modern, large-model-based baselines in retrieving targets from dense clutter. The SRE is trained in two stages: imitation learning from heuristic demonstrations provides sample-efficient initialization, after which PPO fine-tuning enables the policy to discover removal strategies that surpass the heuristic in dense clutter. Our results, achieving up to 97.6\% success in partially occluded and 90.0\% in fully occluded scenarios in simulation, make a case for the power of specialized, object-centric reasoning in complex manipulation tasks. Additionally, we demonstrate that the SRE's spatial reasoning transfers zero-shot to real scenes, and validate the full system on a physical robot requiring only geometric workspace calibration; no learned components are retrained.
Chinese Translation
在杂乱环境中进行机器人操作是自动化面临的一项重大挑战。最近的大规模端到端模型展示了令人印象深刻的能力,但往往缺乏在密集杂乱中检索物体所需的数据效率和模块化。在本研究中,我们主张采用专门的解耦系统范式,并提出了Unveiler,一个明确将高层空间推理与低层动作执行分开的框架。Unveiler的核心是一个轻量级的基于变换器的空间关系编码器(Spatial Relationship Encoder, SRE),它顺序识别出最关键的障碍物以便移除。这个离散决策随后被传递给一个旋转不变的动作解码器(Action Decoder)进行执行。我们证明这种解耦架构在参数数量和推理时间方面不仅计算效率更高,而且在从密集杂乱中检索目标时显著优于经典的端到端策略和现代大型模型基线。SRE的训练分为两个阶段:从启发式演示中进行模仿学习提供样本高效的初始化,随后通过PPO微调使策略发现超越启发式在密集杂乱中移除策略。我们的结果在模拟中实现了部分遮挡场景下97.6%的成功率和完全遮挡场景下90.0%的成功率,证明了专门的物体中心推理在复杂操作任务中的强大能力。此外,我们展示了SRE的空间推理在真实场景中零样本迁移,并在仅需几何工作空间校准的物理机器人上验证了整个系统;没有学习组件被重新训练。
cs.RO / 9 / 2603.02538
PathSpace: Rapid continuous map approximation for efficient SLAM using B-Splines in constrained environments
PathSpace:在受限环境中使用B样条进行高效SLAM的快速连续地图近似
Abstract
Simultaneous Localization and Mapping (SLAM) plays a crucial role in enabling autonomous vehicles to navigate previously unknown environments. Semantic SLAM mostly extends visual SLAM, leveraging the higher density information available to reason about the environment in a more human-like manner. This allows for better decision making by exploiting prior structural knowledge of the environment, usually in the form of labels. Current semantic SLAM techniques still mostly rely on a dense geometric representation of the environment, limiting their ability to apply constraints based on context. We propose PathSpace, a novel semantic SLAM framework that uses continuous B-splines to represent the environment in a compact manner, while also maintaining and reasoning through the continuous probability density functions required for probabilistic reasoning. This system applies the multiple strengths of B-splines in the context of SLAM to interpolate and fit otherwise discrete sparse environments. We test this framework in the context of autonomous racing, where we exploit pre-specified track characteristics to produce significantly reduced representations at comparable levels of accuracy to traditional landmark based methods and demonstrate its potential in limiting the resources used by a system with minimal accuracy loss.
Chinese Translation
同时定位与地图构建(SLAM)在使自主车辆能够在未知环境中导航方面发挥着至关重要的作用。语义SLAM主要扩展了视觉SLAM,利用更高密度的信息以更类人方式推理环境。这使得通过利用环境的先验结构知识(通常以标签的形式)来进行更好的决策成为可能。目前的语义SLAM技术仍然主要依赖于环境的稠密几何表示,限制了它们基于上下文应用约束的能力。我们提出了PathSpace,一个新颖的语义SLAM框架,使用连续B样条以紧凑的方式表示环境,同时维护和推理进行概率推理所需的连续概率密度函数。该系统在SLAM的背景下应用B样条的多重优势,以插值和拟合原本离散的稀疏环境。我们在自主赛车的背景下测试了该框架,利用预先指定的赛道特征生成显著减少的表示,并在与传统基于地标的方法相当的准确度水平下展示了其潜力,限制了系统使用的资源,同时保持了最小的准确度损失。
cs.RO / 10 / 2603.02553
Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery
给我剪刀:无碰撞双臂外科辅助机器人用于器械递送
Abstract
During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.
Chinese Translation
在手术过程中,护士需要频繁地将手术器械递送给外科医生,这可能导致身体疲劳和注意力下降。机器人护士提供了一种有前景的解决方案,可以替代重复性任务并提高效率。现有的机器人护士研究依赖于预定义的器械递送路径,这限制了其通用性,并在动态环境中带来了安全风险。为了解决这些挑战,我们提出了一种无碰撞双臂外科辅助机器人,能够执行器械递送。我们利用视觉-语言模型,根据外科医生的指示,以零样本的方式自动生成机器人的抓取和递送轨迹。我们提出了一种实时障碍物最小距离感知方法,并将其集成到统一的二次规划框架中。该框架确保在动态环境中双臂机器人自主移动时能够进行反应式障碍物避让和自我碰撞预防。大量实验验证表明,所提出的机器人系统在手术器械递送中实现了83.33%的成功率,同时在所有试验中保持了平滑、无碰撞的运动。项目页面和源代码可在 https://give-me-scissors.github.io/ 获取。
cs.RO / 11 / 2603.02596
Tensegrity Robot Endcap-Ground Contact Estimation with Symmetry-aware Heterogeneous Graph Neural Network
具有对称感知的异构图神经网络的张力结构机器人端盖-地面接触估计
Abstract
Tensegrity robots possess lightweight and resilient structures but present significant challenges for state estimation due to compliant and distributed ground contacts. This paper introduces a symmetry-aware heterogeneous graph neural network (Sym-HGNN) that infers contact states directly from proprioceptive measurements, including IMU and cable-length histories, without dedicated contact sensors. The network incorporates the robot's dihedral symmetry $D_3$ into the message-passing process to enhance sample efficiency and generalization. The predicted contacts are integrated into a state-of-the-art contact-aided invariant extended Kalman filter (InEKF) for improved pose estimation. Simulation results demonstrate that the proposed method achieves up to 15% higher accuracy and 5% higher F1-score using only 20% of the training data compared to the CNN and MI-HGNN baselines, while maintaining low-drift and physically consistent state estimation results comparable to ground truth contacts. This work highlights the potential of fully proprioceptive sensing for accurate and robust state estimation in tensegrity robots. Code available at: https://github.com/Jonathan-Twz/Tensegrity-Sym-HGNN
Chinese Translation
张力结构机器人具有轻量和韧性的结构,但由于柔性和分布式的地面接触,状态估计面临重大挑战。本文提出了一种对称感知的异构图神经网络(Sym-HGNN),该网络直接从本体感知测量(包括惯性测量单元(IMU)和电缆长度历史)中推断接触状态,而无需专用的接触传感器。该网络将机器人的二面体对称性 $D_3$ 纳入消息传递过程,以增强样本效率和泛化能力。预测的接触状态被集成到一种先进的接触辅助不变扩展卡尔曼滤波器(InEKF)中,以改善姿态估计。仿真结果表明,所提出的方法在仅使用20%的训练数据的情况下,相比于卷积神经网络(CNN)和MI-HGNN基线,达到了高达15%的准确率提升和5%的F1分数提升,同时保持低漂移和与真实接触相当的物理一致的状态估计结果。这项工作突显了完全本体感知传感在张力结构机器人中实现准确和稳健状态估计的潜力。代码可在以下链接获取: https://github.com/Jonathan-Twz/Tensegrity-Sym-HGNN
cs.RO / 12 / 2603.02602
Wukong-Omni: Design, Modeling and Control of a Multi-mode Robot for Air, Land, and Underwater Exploration with All-in-One Propulsion Unit
悟空全能:一种用于空中、陆地和水下探索的多模式机器人设计、建模与控制,配备一体化推进单元
Abstract
In flood disaster rescue scenarios, partially submerged buildings prevent aerial robots from accessing lower levels, limiting mission effectiveness. To address this challenge, this paper presents Wukong-Omni, a novel multimode robot capable of operating across land, air, and underwater using a unified propulsion system. The system is enabled by an innovative mechanical design that allows motor reuse and improves thrust generation. Efficiency and peak thrust are enhanced through simulation and tank-based optimization. Experimental results show a 100 percent improvement in propulsion efficiency and a 150 percent increase in maximum thrust compared with direct installation methods. Dynamic models for the three operating domains are developed, and a unified cross-domain control framework is proposed. Comprehensive experiments validate stable locomotion and smooth transition across domains. Outdoor experiments further demonstrate robustness and adaptability in real-world environments.
Chinese Translation
在洪水灾害救援场景中,部分淹没的建筑物阻碍了空中机器人进入下层,限制了任务的有效性。为了解决这一挑战,本文提出了悟空全能(Wukong-Omni),一种新型多模式机器人,能够利用统一的推进系统在陆地、空中和水下进行操作。该系统通过创新的机械设计实现了电机的重复使用,并提高了推力生成。通过仿真和水池优化,提升了效率和峰值推力。实验结果表明,与直接安装方法相比,推进效率提高了100%,最大推力增加了150%。为三个操作领域开发了动态模型,并提出了统一的跨领域控制框架。全面的实验验证了稳定的运动和跨领域的平滑过渡。户外实验进一步展示了在现实环境中的鲁棒性和适应性。
cs.RO / 13 / 2603.02623
Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation
Uni-Skill:构建自我演化的技能库以实现可推广的机器人操作
Abstract
While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.
Chinese Translation
虽然以技能为中心的方法利用基础模型来增强组合任务的泛化能力,但它们通常依赖于固定的技能库,限制了在没有人工干预的情况下对新任务的适应能力。为了解决这一问题,我们提出了Uni-Skill,一个统一的以技能为中心的框架,支持技能感知规划并促进自动技能演化。与以往将规划限制于预定义技能的方法不同,Uni-Skill在现有技能不足时请求新的技能实现,确保具有自我增强技能库的适应性规划。为了支持规划模块请求的多样化技能的自动实现,我们构建了SkillFolder,一个受VerbNet启发的库,来源于大规模非结构化的机器人视频。SkillFolder引入了一个层次化的技能分类法,捕捉多个抽象层次的多样化技能描述。通过用大规模、自动注释的演示填充这一分类法,Uni-Skill将技能获取的范式从低效的人工注释转变为高效的离线结构检索。检索到的示例为行为模式提供语义监督,并为空间轨迹提供细粒度参考,使得在没有部署时演示的情况下实现少量示例的技能推断。在模拟和现实环境中的全面实验验证了Uni-Skill在现有基于VLM的以技能为中心的方法中的先进性能,突显了其卓越的推理能力和在广泛新任务中的强大零-shot泛化能力。
cs.RO / 14 / 2603.02642
cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization
cuNRTO:基于GPU的非线性鲁棒轨迹优化
Abstract
Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementation of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedup up to 139.6$\times$.
Chinese Translation
鲁棒轨迹优化使自主系统能够在不确定性下安全运行,通过计算满足所有有界干扰约束的控制策略。然而,这些问题通常导致大型二阶锥规划(SOCP)约束,计算成本高昂。在本研究中,我们提出了CUDA非线性鲁棒轨迹优化(cuNRTO)框架,介绍了两种动态优化架构,直接应用于鲁棒决策制定,并在CUDA上实现。第一种架构NRTO-DR利用道格拉斯-拉赫福德(Douglas-Rachford, DR)分裂方法解决NRTO的SOCP内部子问题,从而通过并行SOCP投影和稀疏直接求解显著降低计算负担。第二种架构NRTO-FullADMM是一种新颖的变体,进一步利用问题结构,通过交替方向乘子法(Alternating Direction Method of Multipliers, ADMM)提高可扩展性。最后,我们使用自定义CUDA内核实现所提出的方法,处理SOC投影步骤和cuBLAS GEMM链以更新反馈增益。我们通过在独轮车、四旋翼和Franka机械臂模型上的模拟实验验证了cuNRTO的性能,展示了高达139.6倍的加速效果。
cs.RO / 15 / 2603.02646
Compositional Visual Planning via Inference-Time Diffusion Scaling
通过推理时扩散缩放实现组合视觉规划
Abstract
Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines, effectively generalizing to unseen start-goal combinations that were not present in the original training data. Project website: https://comp-visual-planning.github.io/
Chinese Translation
扩散模型在短期机器人规划中表现出色,但由于计算限制和训练数据不足,将其扩展到长期任务仍然具有挑战性。现有的组合方法通过分别去噪每个组件并平均重叠区域来拼接短段。然而,这种方法在噪声数据空间中因分解假设的失效而导致不稳定,从而产生不一致的全局规划。我们提出,稳定的组合生成的关键在于对估计的干净数据(Tweedie 估计)施加边界一致性,而不是对噪声中间状态。我们的方法将长期规划公式化为对重叠视频片段的链式结构因子图的推理,其中预训练的短期视频扩散模型提供局部先验。在推理时,我们通过一种新颖的同步和异步消息传递的组合来强制边界一致性,该组合在 Tweedie 估计上操作,生成全局一致的指导,而无需额外的训练。我们的无训练框架在现有基准上显示出显著的改进,有效地推广到原始训练数据中未出现的未见起始-目标组合。项目网站:https://comp-visual-planning.github.io/
cs.RO / 16 / 2603.02657
Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment
注意你的步伐:在杂乱环境中学习语义引导的运动
Abstract
Although legged robots demonstrate impressive mobility on rough terrain, using them safely in cluttered environments remains a challenge. A key issue is their inability to avoid stepping on low-lying objects, such as high-cost small devices or cables on flat ground. This limitation arises from a disconnection between high-level semantic understanding and low-level control, combined with errors in elevation maps during real-world operation. To address this, we introduce SemLoco, a Reinforcement Learning (RL) framework designed to avoid obstacles precisely in densely cluttered environments. SemLoco uses a two-stage RL approach that combines both soft and hard constraints and performs pixel-wise foothold safety inference, enabling more accurate foot placement. Additionally, SemLoco integrates a semantic map to assign traversability costs rather than relying solely on geometric data. SemLoco significantly reduces collisions and improves safety around sensitive objects, enabling reliable navigation in situations where traditional controllers would likely cause damage. Experimental results further demonstrate that SemLoco can be effectively applied to more complex, unstructured real-world environments.
Chinese Translation
尽管四足机器人在崎岖地形上展现出令人印象深刻的机动性,但在杂乱环境中安全使用它们仍然是一项挑战。一个关键问题是它们无法避免踩到低矮物体,例如平地上的高成本小设备或电缆。这一局限性源于高层语义理解与低层控制之间的脱节,以及在实际操作中高程图的误差。为了解决这个问题,我们提出了SemLoco,一个旨在在密集杂乱环境中精确避障的强化学习(Reinforcement Learning, RL)框架。SemLoco采用了两阶段的RL方法,结合了软约束和硬约束,并执行逐像素的足底安全推断,从而实现更准确的足部放置。此外,SemLoco集成了语义地图,以分配可通行性成本,而不仅仅依赖几何数据。SemLoco显著减少了碰撞,并提高了在敏感物体周围的安全性,使得在传统控制器可能造成损害的情况下实现可靠导航。实验结果进一步表明,SemLoco可以有效应用于更复杂、非结构化的现实环境。
cs.RO / 17 / 2603.02669
IMR-LLM: Industrial Multi-Robot Task Planning and Program Generation using Large Language Models
IMR-LLM:基于大型语言模型的工业多机器人任务规划与程序生成
Abstract
In modern industrial production, multiple robots often collaborate to complete complex manufacturing tasks. Large language models (LLMs), with their strong reasoning capabilities, have shown potential in coordinating robots for simple household and manipulation tasks. However, in industrial scenarios, stricter sequential constraints and more complex dependencies within tasks present new challenges for LLMs. To address this, we propose IMR-LLM, a novel LLM-driven Industrial Multi-Robot task planning and program generation framework. Specifically, we utilize LLMs to assist in constructing disjunctive graphs and employ deterministic solving methods to obtain a feasible and efficient high-level task plan. Based on this, we use a process tree to guide LLMs to generate executable low-level programs. Additionally, we create IMR-Bench, a challenging benchmark that encompasses multi-robot industrial tasks across three levels of complexity. Experimental results indicate that our method significantly surpasses existing methods across all evaluation metrics.
Chinese Translation
在现代工业生产中,多台机器人常常协作完成复杂的制造任务。大型语言模型(LLMs)凭借其强大的推理能力,在协调机器人进行简单的家庭和操作任务方面展现了潜力。然而,在工业场景中,更严格的顺序约束和任务内更复杂的依赖关系为LLMs带来了新的挑战。为此,我们提出了IMR-LLM,一个新颖的基于LLM的工业多机器人任务规划与程序生成框架。具体而言,我们利用LLMs辅助构建析取图,并采用确定性求解方法获得可行且高效的高层任务计划。在此基础上,我们使用过程树引导LLMs生成可执行的低层程序。此外,我们创建了IMR-Bench,一个涵盖三种复杂度级别的多机器人工业任务的挑战性基准。实验结果表明,我们的方法在所有评估指标上显著超越了现有方法。
cs.RO / 18 / 2603.02683
MMH-Planner: Multi-Mode Hybrid Trajectory Planning Method for UAV Efficient Flight Based on Real-Time Spatial Awareness
MMH-规划器:基于实时空间感知的无人机高效飞行多模式混合轨迹规划方法
Abstract
Motion planning is a critical component of intelligent unmanned systems, enabling their complex autonomous operations. However, current planning algorithms still face limitations in planning efficiency due to inflexible strategies and weak adaptability. To address this, this paper proposes a multi-mode hybrid trajectory planning method for UAVs based on real-time environmental awareness, which dynamically selects the optimal planning model for high-quality trajectory generation in response to environmental changes. First, we introduce a goal-oriented spatial awareness method that rapidly assesses flight safety in the upcoming environments. Second, a multi-mode hybrid trajectory planning mechanism is proposed, which can enhance the planning efficiency by selecting the optimal planning model for trajectory generation based on prior spatial awareness. Finally, we design a lazy replanning strategy that triggers replanning only when necessary to reduce computational resource consumption while maintaining flight quality. To validate the performance of the proposed method, we conducted comprehensive comparative experiments in simulation environments. Results demonstrate that our approach outperforms existing state-of-the-art (SOTA) algorithms across multiple metrics, achieving the best performance particularly in terms of the average number of planning iterations and computational cost per iteration. Furthermore, the effectiveness of our approach is further verified through real-world flight experiments integrated with a self-developed intelligent UAV platform.
Chinese Translation
运动规划是智能无人系统的关键组成部分,使其能够进行复杂的自主操作。然而,当前的规划算法在规划效率方面仍面临由于策略不灵活和适应性较弱而导致的限制。为了解决这一问题,本文提出了一种基于实时环境感知的无人机多模式混合轨迹规划方法,该方法能够根据环境变化动态选择最佳规划模型,以实现高质量轨迹生成。首先,我们引入了一种面向目标的空间感知方法,能够快速评估即将到来的环境中的飞行安全性。其次,提出了一种多模式混合轨迹规划机制,通过基于先前的空间感知选择最佳规划模型来增强规划效率。最后,我们设计了一种懒惰重规划策略,仅在必要时触发重规划,以减少计算资源消耗,同时保持飞行质量。为了验证所提方法的性能,我们在仿真环境中进行了全面的对比实验。结果表明,我们的方法在多个指标上优于现有的最先进(SOTA)算法,特别是在平均规划迭代次数和每次迭代的计算成本方面表现最佳。此外,通过与自开发的智能无人机平台集成的实际飞行实验进一步验证了我们方法的有效性。
cs.RO / 19 / 2603.02742
Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing
基于鲁棒紧耦合滤波器的单目视觉惯性状态估计及图基评估在自主无人机竞速中的应用
Abstract
Autonomous drone racing (ADR) demands state estimation that is simultaneously computationally efficient and resilient to the perceptual degradation experienced during extreme velocity and maneuvers. Traditional frameworks typically rely on conventional visual-inertial pipelines with loosely-coupled gate-based Perspective-n-Points (PnP) corrections that suffer from a rigid requirement for four visible features and information loss in intermediate steps. Furthermore, the absence of GNSS and Motion Capture systems in uninstrumented, competitive racing environments makes the objective evaluation of such systems remarkably difficult. To address these limitations, we propose ADR-VINS, a robust, monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) tailored for autonomous drone racing. Our approach integrates direct pixel reprojection errors from gate corners features as innovation terms within the filter. By bypassing intermediate PnP solvers, ADR-VINS maintains valid state updates with as few as two visible corners and utilizes robust reweighting instead of RANSAC-based schemes to handle outliers, enhancing computational efficiency. Furthermore, we introduce ADR-FGO, an offline Factor-Graph Optimization framework to generate high-fidelity reference trajectories that facilitate post-flight performance evaluation and analysis on uninstrumented, GNSS-denied environments. The proposed system is validated using TII-RATM dataset, where ADR-VINS achieves an average RMS translation error of 0.134 m, while ADR-FGO yields 0.060 m as a smoothing-based reference. Finally, ADR-VINS was successfully deployed in the A2RL Drone Championship Season 2, maintaining stable and robust estimation despite noisy detections during high-agility flight at top speeds of 20.9 m/s. We further utilize ADR-FGO for post-flight evaluation in uninstrumented racing environments.
Chinese Translation
自主无人机竞速(ADR)需要一种状态估计方法,该方法在计算效率和在极端速度和机动过程中所经历的感知退化的抗干扰能力上都表现出色。传统框架通常依赖于常规的视觉惯性管道,采用松耦合的基于门的透视n点(PnP)校正,这种方法对四个可见特征有严格要求,并且在中间步骤中会导致信息损失。此外,在未配备仪器的竞争性竞速环境中,缺乏全球导航卫星系统(GNSS)和运动捕捉系统使得对这些系统的客观评估变得极为困难。为了解决这些局限性,我们提出了ADR-VINS,这是一种基于误差状态卡尔曼滤波器(Error-State Kalman Filter, ESKF)的鲁棒单目视觉惯性状态估计框架,专为自主无人机竞速而设计。我们的方法将来自门角特征的直接像素重投影误差作为创新项集成到滤波器中。通过绕过中间的PnP求解器,ADR-VINS能够在仅有两个可见角的情况下维持有效的状态更新,并采用鲁棒重加权而非基于RANSAC的方案来处理离群值,从而提高计算效率。此外,我们引入了ADR-FGO,一个离线因子图优化框架,用于生成高保真度的参考轨迹,以便在未配备仪器且缺乏GNSS的环境中进行飞行后性能评估和分析。所提系统在TII-RATM数据集上进行了验证,其中ADR-VINS的平均均方根平移误差为0.134米,而ADR-FGO作为平滑基准则则为0.060米。最后,ADR-VINS成功应用于A2RL无人机锦标赛第二季,在高速20.9米/秒的高机动飞行中,尽管检测存在噪声,仍保持稳定和鲁棒的估计。我们进一步利用ADR-FGO在未配备仪器的竞速环境中进行飞行后的评估。
cs.RO / 20 / 2603.02772
Agentic Self-Evolutionary Replanning for Embodied Navigation
具身导航的自主自我进化重规划
Abstract
Failure is inevitable for embodied navigation in complex environments. To enhance the resilience, replanning (RP) is a viable option, where the robot is allowed to fail, but is capable of adjusting plan until success. However, existing RP approaches freeze the ego action model and miss the opportunities to explore better plans by upgrading the robot itself. To address this limitation, we propose Self-Evolutionary RePlanning, or SERP for short, which leads to a paradigm shift from frozen models towards evolving models by run-time learning from recent experiences. In contrast to existing model evolution approaches that often get stuck at predefined static parameters, we introduce agentic self-evolving action model that uses in-context learning with auto-differentiation (ILAD) for adaptive function adjustment and global parameter reset. To achieve token-efficient replanning for SERP, we also propose graph chain-of-thought (GCOT) replanning with large language model (LLM) inference over distilled graphs. Extensive simulation and real-world experiments demonstrate that SERP achieves higher success rate with lower token expenditure over various benchmarks, validating its superior robustness and efficiency across diverse environments.
Chinese Translation
在复杂环境中,具身导航的失败是不可避免的。为了增强其韧性,重规划(Replanning, RP)是一种可行的选择,允许机器人失败,但能够调整计划直至成功。然而,现有的重规划方法冻结了自我行动模型,错失了通过升级机器人自身来探索更好计划的机会。为了解决这一局限,我们提出了自我进化重规划(Self-Evolutionary RePlanning, SERP),这标志着从冻结模型向通过实时学习最近经验而进化模型的范式转变。与现有的模型进化方法常常停留在预定义的静态参数不同,我们引入了自主自我进化的行动模型,利用上下文学习与自动微分(In-context Learning with Auto-Differentiation, ILAD)进行自适应功能调整和全局参数重置。为了实现SERP的高效重规划,我们还提出了基于大语言模型(Large Language Model, LLM)推理的图链思维(Graph Chain-of-Thought, GCOT)重规划。大量的仿真和现实世界实验表明,SERP在各种基准测试中实现了更高的成功率和更低的令牌消耗,验证了其在多样化环境中的卓越鲁棒性和效率。
cs.RO / 21 / 2603.02783
Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies
生成对抗模仿学习在机器人群体中的应用:从人类示范和训练策略中学习
Abstract
In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
Chinese Translation
在模仿学习中,机器人应当从期望行为的示范中学习。大多数关于群体机器人模仿学习的研究将示范提供为现有策略的回放。在本研究中,我们提供了一个基于生成对抗模仿学习的框架,旨在从人类示范中学习集体行为。我们的框架在六个不同的任务中进行了评估,既学习了手动示范,也学习了源自PPO(Proximal Policy Optimization)训练策略的示范。结果表明,模仿学习过程能够学习到在定性上有意义的行为,这些行为的表现与提供的示范相似。此外,我们在真实机器人实验中将学习到的策略部署在一群TurtleBot 4机器人上。展现的行为保持了其可视识别特征,其性能与模拟中实现的性能相当。
cs.RO / 22 / 2603.02845
SPARC: Spatial-Aware Path Planning via Attentive Robot Communication
SPARC:通过关注机器人通信的空间感知路径规划
Abstract
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making
Chinese Translation
高效的通信对于分散式多机器人路径规划(MRPP)至关重要,然而现有的学习通信方法对所有邻近机器人一视同仁,而未考虑其空间接近性,导致在协调至关重要的拥挤区域注意力分散。我们提出了关系增强多头注意力(RMHA),这是一种通信机制,明确将成对的曼哈顿距离嵌入到注意力权重计算中,使每个机器人能够动态优先处理来自空间相关邻居的消息。结合距离约束的注意力掩码和GRU门控消息融合,RMHA与MAPPO无缝集成,实现稳定的端到端训练。在8个训练机器人到128个测试机器人在40x40网格上的零样本泛化中,RMHA在30%障碍密度下实现了约75%的成功率,超过最佳基线25个百分点以上。消融研究确认,距离关系编码是提高高密度环境成功率的关键因素。关键词:多机器人路径规划,图注意力机制,多头注意力,通信优化,合作决策
cs.RO / 23 / 2603.02851
Design, Modeling and Direction Control of a Wire-Driven Robotic Fish Based on a 2-DoF Crank-Slider Mechanism
基于2自由度曲柄滑块机制的线驱动机器人鱼的设计、建模与方向控制
Abstract
Robotic fish have attracted growing attention in recent years owing to their biomimetic design and potential applications in environmental monitoring and biological surveys. Among robotic fish employing the Body-Caudal Fin (BCF) locomotion pattern, motor-driven actuation is widely adopted. Some approaches utilize multiple servo motors to achieve precise body curvature control, while others employ a brushless motor to drive the tail via wire or rod, enabling higher oscillation and swimming speeds. However, the former approaches typically result in limited swimming speed, whereas the latter suffer from poor maneuverability, with few capable of smooth turning. To address this trade-off, we develop a wire-driven robotic fish equipped with a 2-degree-of-freedom (DoF) crank-slider mechanism that decouples propulsion from steering, enabling both high swimming speed and agile maneuvering. In this paper, we first present the design of the robotic fish, including the elastic skeleton, waterproof structure, and the actuation mechanism that realizes the decoupling. We then establish the actuation modeling and body dynamics to analyze the locomotion behavior. Furthermore, we propose a combined feedforward-feedback control strategy to achieve independent regulation of propulsion and steering. Finally, we validate the feasibility of the design, modeling, and control through a series of prototype experiments, demonstrating swimming, turning, and directional control.
Chinese Translation
近年来,机器人鱼因其仿生设计及在环境监测和生物调查中的潜在应用而受到越来越多的关注。在采用身体-尾鳍(Body-Caudal Fin,BCF)运动模式的机器人鱼中,电动驱动被广泛采用。一些方法利用多个伺服电机实现精确的身体曲率控制,而另一些则通过线缆或杆驱动尾部的无刷电机,从而实现更高的振荡和游泳速度。然而,前者通常导致游泳速度受限,而后者则在机动性方面表现不佳,少数能够实现平滑转向。为了解决这一权衡问题,我们开发了一种配备2自由度(DoF)曲柄滑块机制的线驱动机器人鱼,该机制将推进与转向解耦,从而实现高游泳速度和灵活的机动性。本文首先介绍了机器人鱼的设计,包括弹性骨架、防水结构以及实现解耦的驱动机制。然后,我们建立了驱动建模和身体动力学,以分析运动行为。此外,我们提出了一种组合前馈-反馈控制策略,以实现推进和转向的独立调节。最后,通过一系列原型实验验证了设计、建模和控制的可行性,展示了游泳、转向和方向控制的能力。
cs.RO / 24 / 2603.02854
CoFL: Continuous Flow Fields for Language-Conditioned Navigation
CoFL:用于语言条件导航的连续流场
Abstract
Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.
Chinese Translation
语言条件导航管道通常依赖于脆弱的模块化组件或昂贵的动作序列生成。为了解决这些局限性,我们提出了CoFL,一种端到端策略,直接将鸟瞰图(BEV)观察和语言指令映射到用于导航的连续流场。CoFL不是预测离散的动作标记或通过迭代去噪采样动作块,而是输出可以在任意二维投影位置查询的瞬时速度。通过对预测场的数值积分获得轨迹,产生平滑的运动,并在闭环执行中保持反应性。为了实现大规模训练,我们构建了一个包含超过50万对BEV图像-指令的数据集,每对数据都通过程序化注释生成流场和从基于Matterport3D和ScanNet构建的BEV语义图中派生的轨迹。通过在混合分布上训练,CoFL在严格未见场景中显著优于基于模块化视觉-语言模型(VLM)的规划器和生成策略基线。最后,我们在多个布局的真实世界实验中零-shot部署CoFL,利用高空BEV观察,保持可靠的闭环控制和高成功率。
cs.RO / 25 / 2603.02856
Rhythm: Learning Interactive Whole-Body Control for Dual Humanoids
Rhythm:学习双人形机器人交互式全身控制
Abstract
Realizing interactive whole-body control for multi-humanoid systems is critical for unlocking complex collaborative capabilities in shared environments. Although recent advancements have significantly enhanced the agility of individual robots, bridging the gap to physically coupled multi-humanoid interaction remains challenging, primarily due to severe kinematic mismatches and complex contact dynamics. To address this, we introduce Rhythm, the first unified framework enabling real-world deployment of dual-humanoid systems for complex, physically plausible interactions. Our framework integrates three core components: (1) an Interaction-Aware Motion Retargeting (IAMR) module that generates feasible humanoid interaction references from human data; (2) an Interaction-Guided Reinforcement Learning (IGRL) policy that masters coupled dynamics via graph-based rewards; and (3) a real-world deployment system that enables robust transfer of dual-humanoid interaction. Extensive experiments on physical Unitree G1 robots demonstrate that our framework achieves robust interactive whole-body control, successfully transferring diverse behaviors such as hugging and dancing from simulation to reality.
Chinese Translation
实现多机器人系统的交互式全身控制对于在共享环境中解锁复杂的协作能力至关重要。尽管近期的进展显著提升了单个机器人的灵活性,但在物理耦合的多机器人交互中弥合差距仍然具有挑战性,主要是由于严重的运动学不匹配和复杂的接触动态。为了解决这一问题,我们提出了Rhythm,这是第一个统一框架,能够实现双人形机器人系统在复杂、物理合理的交互中的现实部署。我们的框架集成了三个核心组件:(1)一个交互感知运动重定向(Interaction-Aware Motion Retargeting, IAMR)模块,从人类数据中生成可行的人形交互参考;(2)一个交互引导强化学习(Interaction-Guided Reinforcement Learning, IGRL)策略,通过基于图的奖励掌握耦合动态;以及(3)一个现实部署系统,能够实现双人形交互的稳健转移。在物理Unitree G1机器人上进行的广泛实验表明,我们的框架实现了稳健的交互式全身控制,成功地将拥抱和舞蹈等多样行为从仿真转移到现实中。
cs.RO / 26 / 2603.02878
Emerging trends in Cislunar Space for Lunar Science Exploration and Space Robotics aiding Human Spaceflight Safety
月球科学探索与太空机器人在保障人类航天安全中的新兴趋势
Abstract
In recent years, the Moon has emerged as an unparalleled extraterrestrial testbed for advancing cuttingedge technological and scientific research critical to enabling sustained human presence on its surface and supporting future interplanetary exploration. This study identifies and investigates two pivotal research domains with substantial transformative potential for accelerating humanity interplanetary aspirations. First is Lunar Science Exploration with Artificial Intelligence and Space Robotics which focusses on AI and Space Robotics redefining the frontiers of space exploration. Second being Space Robotics aid in manned spaceflight to the Moon serving as critical assets for pre-deployment infrastructure development, In-Situ Resource Utilization, surface operations support, and astronaut safety assurance. By integrating autonomy, machine learning, and realtime sensor fusion, space robotics not only augment human capabilities but also serve as force multipliers in achieving sustainable lunar exploration, paving the way for future crewed missions to Mars and beyond.
Chinese Translation
近年来,月球已成为一个无与伦比的外星测试平台,推动尖端技术和科学研究的发展,这对于实现人类在其表面的持续存在以及支持未来的星际探索至关重要。本研究识别并探讨了两个具有重大变革潜力的关键研究领域,以加速人类的星际愿景。第一个领域是结合人工智能(Artificial Intelligence)和太空机器人(Space Robotics)的月球科学探索,重点在于人工智能和太空机器人重新定义太空探索的前沿。第二个领域是太空机器人在载人航天飞往月球中的辅助作用,作为预部署基础设施开发、原位资源利用(In-Situ Resource Utilization)、表面操作支持和宇航员安全保障的关键资产。通过整合自主性、机器学习和实时传感器融合,太空机器人不仅增强了人类的能力,还在实现可持续的月球探索中充当了力量倍增器,为未来的载人火星任务及更远的探索铺平了道路。
cs.RO / 27 / 2603.02881
Tracing Back Error Sources to Explain and Mitigate Pose Estimation Failures
追溯错误源以解释和缓解姿态估计失败
Abstract
Robust estimation of object poses in robotic manipulation is often addressed using foundational general estimators, that aim to handle diverse error sources naively within a single model. Still, they struggle due to environmental uncertainties, while requiring long inference times and heavy computation. In contrast, we propose a modular, uncertainty-aware framework that attributes pose estimation errors to specific error sources and applies targeted mitigation strategies only when necessary. Instantiated with Iterative Closest Point (ICP) as a simple and lightweight pose estimator, we leverage our framework for real-world robotic grasping tasks. By decomposing pose estimation into failure detection, error attribution, and targeted recovery, we significantly improve the robustness of ICP and achieve competitive performance compared to foundation models, while relying on a substantially simpler and faster pose estimator.
Chinese Translation
在机器人操作中,物体姿态的稳健估计通常采用基础通用估计器,这些估计器旨在在单一模型中天真地处理多种错误源。然而,由于环境的不确定性,它们仍然面临挑战,同时需要较长的推理时间和大量计算。相比之下,我们提出了一种模块化的、不确定性感知的框架,该框架将姿态估计错误归因于特定的错误源,并仅在必要时应用针对性的缓解策略。我们以迭代最近点(Iterative Closest Point, ICP)作为简单且轻量的姿态估计器来实例化该框架,并利用其进行真实世界的机器人抓取任务。通过将姿态估计分解为故障检测、错误归因和针对性恢复,我们显著提高了ICP的稳健性,并在依赖于显著更简单和更快速的姿态估计器的同时,达到了与基础模型相媲美的性能。
cs.RO / 28 / 2603.02936
Self-supervised Domain Adaptation for Visual 3D Pose Estimation of Nano-drone Racing Gates by Enforcing Geometric Consistency
通过强制几何一致性进行自监督领域适应,以实现纳米无人机竞速门的视觉三维姿态估计
Abstract
We consider the task of visually estimating the relative pose of a drone racing gate in front of a nano-quadrotor, using a convolutional neural network pre-trained on simulated data to regress the gate's pose. Due to the sim-to-real gap, the pre-trained model underperforms in the real world and must be adapted to the target domain. We propose an unsupervised domain adaptation (UDA) approach using only real image sequences collected by the drone flying an arbitrary trajectory in front of a gate; sequences are annotated in a self-supervised fashion with the drone's odometry as measured by its onboard sensors. On this dataset, a state consistency loss enforces that two images acquired at different times yield pose predictions that are consistent with the drone's odometry. Results indicate that our approach outperforms other SoA UDA approaches, has a low mean absolute error in position (x=26, y=28, z=10 cm) and orientation ($\psi$=13${^{\circ}}$), an improvement of 40% in position and 37% in orientation over a baseline. The approach's effectiveness is appreciable with as few as 10 minutes of real-world flight data and yields models with an inference time of 30.4ms (33 fps) when deployed aboard the Crazyflie 2.1 Brushless nano-drone.
Chinese Translation
我们考虑在纳米四旋翼无人机前方视觉估计无人机竞速门相对姿态的任务,使用在模拟数据上预训练的卷积神经网络来回归门的姿态。由于模拟与现实之间的差距,预训练模型在现实世界中的表现不佳,需要适应目标领域。我们提出了一种无监督领域适应(UDA)方法,仅使用无人机在门前以任意轨迹飞行时收集的真实图像序列;这些序列通过无人机的传感器测量的里程计以自监督的方式进行标注。在该数据集上,状态一致性损失强制要求在不同时间获取的两幅图像产生与无人机里程计一致的姿态预测。结果表明,我们的方法优于其他最先进的UDA方法,在位置(x=26, y=28, z=10 cm)和方向($ heta$=13${^{
m{o}}}$)上的平均绝对误差较低,位置和方向分别比基线提高了40%和37%。该方法在仅有10分钟的真实飞行数据下也能取得显著效果,并在部署于Crazyflie 2.1无刷纳米无人机时,模型的推理时间为30.4毫秒(33帧每秒)。
cs.RO / 29 / 2603.02976
DreamFlow: Local Navigation Beyond Observation via Conditional Flow Matching in the Latent Space
DreamFlow:通过潜在空间中的条件流匹配实现超越观察的局部导航
Abstract
Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning(DRL)based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robot's perceptual horizon through conditional flow matching(CFM). The proposed CFM based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrate that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real world environments with a quadrupedal robot. The project page is available at https://dreamflow-icra.github.io.
Chinese Translation
在杂乱环境中的局部导航常常受到密集障碍物和频繁局部极小值的影响。传统的局部规划器依赖于启发式方法,容易出现失败,而基于深度强化学习(DRL)的方法虽然提供了适应性,但受到有限的机载传感器的限制。这些局限性导致导航失败,因为机器人无法感知其视野之外的结构。本文提出了DreamFlow,一种基于DRL的局部导航框架,通过条件流匹配(CFM)扩展机器人的感知视野。所提出的基于CFM的预测模块学习局部高度图潜在表示与基于导航上下文的更广泛空间表示之间的概率映射。这使得导航策略能够预测未观察到的环境特征,并主动避免潜在的局部极小值。实验结果表明,DreamFlow在潜在预测准确性和模拟中的导航性能方面优于现有方法。该方法在杂乱的真实环境中也得到了四足机器人进一步验证。项目页面可访问 https://dreamflow-icra.github.io。
cs.RO / 30 / 2603.02989
CASSR: Continuous A-Star Search through Reachability for real time footstep planning
CASSR:通过可达性进行连续A*搜索以实现实时足迹规划
Abstract
Footstep planning involves a challenging combinatorial search. Traditional A* approaches require discretising reachability constraints, while Mixed-Integer Programming (MIP) supports continuous formulations but quickly becomes intractable, especially when rotations are included. We present CASSR, a novel framework that recursively propagates convex, continuous formulations of a robot's kinematic constraints within an A* search. Combined with a new cost-to-go heuristic based on the EPA algorithm, CASSR efficiently plans contact sequences of up to 30 footsteps in under 125 ms. Experiments on biped locomotion tasks demonstrate that CASSR outperforms traditional discretised A* by up to a factor of 100, while also surpassing a commercial MIP solver. These results show that CASSR enables fast, reliable, and real-time footstep planning for biped robots.
Chinese Translation
足迹规划涉及一个具有挑战性的组合搜索。传统的A*方法需要对可达性约束进行离散化,而混合整数规划(MIP)支持连续形式,但在包含旋转时很快变得不可处理。我们提出了CASSR,一个新颖的框架,通过A*搜索递归传播机器人的运动约束的凸连续形式。结合基于EPA算法的新成本启发式,CASSR能够在125毫秒内高效规划多达30个足迹的接触序列。在双足运动任务上的实验表明,CASSR的性能比传统的离散A*方法提高了多达100倍,同时也超越了商业MIP求解器。这些结果表明,CASSR能够为双足机器人实现快速、可靠和实时的足迹规划。
cs.RO / 31 / 2603.03024
MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN
MA-CoNav:一种具有层次协作和双层反思的主从多智能体框架,用于长时间跨度的具身视觉语言导航
Abstract
Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient ``language-understanding -- visual-perception -- embodied-execution'' closed loop. Existing methods often suffer from perceptual distortion and decision drift in complex, long-distance tasks due to the cognitive overload of a single agent. Inspired by distributed cognition theory, this paper proposes MA-CoNav, a Multi-Agent Collaborative Navigation framework. This framework adopts a ``Master-Slave'' hierarchical agent collaboration architecture, decoupling and distributing the perception, planning, execution, and memory functions required for navigation tasks to specialized agents. Specifically, the Master Agent is responsible for global orchestration, while the Subordinate Agent group collaborates through a clear division of labor: an Observation Agent generates environment descriptions, a Planning Agent performs task decomposition and dynamic verification, an Execution Agent handles simultaneous mapping and action, and a Memory Agent manages structured experiences. Furthermore, the framework introduces a ``Local-Global'' dual-stage reflection mechanism to dynamically optimize the entire navigation pipeline. Empirical experiments were conducted using a real-world indoor dataset collected by a Limo Pro robot, with no scene-specific fine-tuning performed on the models throughout the process. The results demonstrate that MA-CoNav comprehensively outperforms existing mainstream VLN methods across multiple metrics.
Chinese Translation
视觉语言导航(VLN)旨在赋予机器人在不熟悉环境中根据复杂语言指令进行长时间跨度导航的能力。其成功关键在于建立一个高效的“语言理解 - 视觉感知 - 具身执行”闭环。现有方法常因单一智能体的认知负荷而在复杂的长距离任务中遭受感知失真和决策漂移。受到分布式认知理论的启发,本文提出了MA-CoNav,一个多智能体协作导航框架。该框架采用“主从”层次智能体协作架构,将导航任务所需的感知、规划、执行和记忆功能解耦并分配给专门的智能体。具体而言,主智能体负责全局协调,而从属智能体组通过明确的分工进行协作:观察智能体生成环境描述,规划智能体进行任务分解和动态验证,执行智能体处理同时映射和行动,记忆智能体管理结构化经验。此外,该框架引入了“局部-全局”双阶段反思机制,以动态优化整个导航流程。通过使用Limo Pro机器人收集的真实室内数据集进行的实证实验,整个过程中未对模型进行场景特定的微调。结果表明,MA-CoNav在多个指标上全面优于现有主流的VLN方法。
cs.RO / 32 / 2603.03052
Architectural HRI: Towards a Robotic Paradigm Shift in Human-Building Interaction
建筑人机交互:朝着人类与建筑互动的机器人范式转变
Abstract
Recent advances in sensing, communication, interfaces, control, and robotics are expanding Human-Building Interaction (HBI) beyond adaptive building services and facades toward the physical actuation of architectural space. In parallel, research in robotic furniture, swarm robotics, and shape-changing spaces shows that architectural elements can now be robotically augmented to move, reconfigure, and adapt space. We propose that these advances promise a paradigm shift in HBI, in which multiple building layers physically adapt in synchrony to support occupant needs and sustainability goals more holistically. Conversely, we argue that this emerging paradigm also provides an ideal case for transferring HRI knowledge to unconventional robotic morphologies, including the interpretation of the robot as multiple architectural layers or even as a building. However, this research agenda remains challenged by the temporal, spatial, and social complexity of architectural HRI, and by fragmented knowledge across HCI, environmental psychology, cognitive science, and architecture. We therefore call for interdisciplinary research that unifies the why, what, and how of robotic actuation in architectural forms.
Chinese Translation
近期在传感、通信、接口、控制和机器人技术方面的进展正在将人类与建筑互动(HBI)从适应性建筑服务和外立面扩展到建筑空间的物理驱动。与此同时,机器人家具、群体机器人和形状变化空间的研究表明,建筑元素现在可以通过机器人技术增强,具备移动、重新配置和适应空间的能力。我们提出,这些进展预示着HBI的范式转变,在这一范式中,多个建筑层次能够同步物理适应,以更全面地支持居住者的需求和可持续发展目标。相反,我们认为这一新兴范式也为将人机交互(HRI)知识转移到非常规机器人形态提供了理想案例,包括将机器人视为多个建筑层次或甚至作为一座建筑。然而,这一研究议程仍面临建筑人机交互的时间、空间和社会复杂性挑战,以及人机交互(HCI)、环境心理学、认知科学和建筑学之间知识的碎片化。因此,我们呼吁开展跨学科研究,以统一建筑形式中机器人驱动的“为什么”、“什么”和“如何”。
cs.RO / 33 / 2603.03067
CMoE: Contrastive Mixture of Experts for Motion Control and Terrain Adaptation of Humanoid Robots
CMoE:用于类人机器人运动控制和地形适应的对比专家混合模型
Abstract
For effective deployment in real-world environments, humanoid robots must autonomously navigate a diverse range of complex terrains with abrupt transitions. While the Vanilla mixture of experts (MoE) framework is theoretically capable of modeling diverse terrain features, in practice, the gating network exhibits nearly uniform expert activations across different terrains, weakening the expert specialization and limiting the model's expressive power. To address this limitation, we introduce CMoE, a novel single-stage reinforcement learning framework that integrates contrastive learning to refine expert activation distributions. By imposing contrastive constraints, CMoE maximizes the consistency of expert activations within the same terrain while minimizing their similarity across different terrains, thereby encouraging experts to specialize in distinct terrain types. We validated our approach on the Unitree G1 humanoid robot through a series of challenging experiments. Results demonstrate that CMoE enables the robot to traverse continuous steps up to 20 cm high and gaps up to 80 cm wide, while achieving robust and natural gait across diverse mixed terrains, surpassing the limits of existing methods. To support further research and foster community development, we release our code publicly.
Chinese Translation
为了在现实环境中有效部署,类人机器人必须能够自主导航多样化的复杂地形,并应对突发的过渡。尽管传统的专家混合模型(MoE)框架在理论上能够建模多样的地形特征,但在实践中,门控网络在不同地形上表现出几乎均匀的专家激活,削弱了专家的专业化,并限制了模型的表达能力。为了解决这一局限性,我们提出了CMoE,一种新颖的单阶段强化学习框架,结合对比学习来优化专家激活分布。通过施加对比约束,CMoE最大化同一地形内专家激活的一致性,同时最小化不同地形之间的相似性,从而鼓励专家在不同地形类型上进行专业化。我们通过一系列具有挑战性的实验在Unitree G1类人机器人上验证了我们的方法。结果表明,CMoE使机器人能够跨越高达20厘米的连续台阶和宽达80厘米的间隙,同时在多样化的混合地形上实现稳健而自然的步态,超越了现有方法的局限。为了支持进一步的研究和促进社区发展,我们公开发布了我们的代码。
cs.RO / 34 / 2603.03137
RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces
基于强化学习的三维表面可变形物体覆盖路径规划
Abstract
Currently, manipulation tasks for deformable objects often focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact-rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using harmonic UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions (SGCNN) to extract features efficiently. The agent then outputs actions in a reduced-dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on a Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.
Chinese Translation
目前,针对可变形物体的操作任务通常集中在折叠衣物、处理绳索和操控袋子等活动上。然而,涉及可变形物体的接触丰富任务的研究仍然相对滞后。当人类使用布料或海绵擦拭表面时,他们依赖于视觉和触觉反馈。然而,当前算法在遮挡等问题上仍面临挑战,而针对操作的触觉感知研究仍在不断发展。使用可变形物体覆盖表面等任务不仅需要感知能力,还需要精确的机器人操作。为了解决这个问题,我们提出了一种利用高效且易于访问的模拟器进行任务执行的方法。具体而言,我们在模拟器中训练一个强化学习代理,以操控可变形物体进行表面擦拭任务。我们通过谐波 UV 映射简化物体表面的状态表示,处理来自模拟器的接触反馈,并在二维特征图上使用缩放分组卷积(SGCNN)高效提取特征。然后,代理在降维的动作空间中输出动作,以生成覆盖路径。实验表明,我们的方法在总路径长度和覆盖面积等关键指标上优于以往的方法。我们在 Kinova Gen3 机械臂上部署这些路径,在一个躯干模型的背部进行擦拭实验,验证了我们方法的可行性。
cs.RO / 35 / 2603.03138
Look Forward to Walk Backward: Efficient Terrain Memory for Backward Locomotion with Forward Vision
展望前行以实现倒退行走:用于前视的高效地形记忆反向运动框架
Abstract
Legged robots with egocentric forward-facing depth cameras can couple exteroception and proprioception to achieve robust forward agility on complex terrain. When these robots walk backward, the forward-only field of view provides no preview. Purely proprioceptive controllers can remain stable on moderate ground when moving backward but cannot fully exploit the robot's capabilities on complex terrain and must collide with obstacles. We present Look Forward to Walk Backward (LF2WB), an efficient terrain-memory locomotion framework that uses forward egocentric depth and proprioception to write a compact associative memory during forward motion and to retrieve it for collision-free backward locomotion without rearward vision. The memory backbone employs a delta-rule selective update that softly removes then writes the memory state along the active subspace. Training uses hardware-efficient parallel computation, and deployment runs recurrent, constant-time per-step inference with a constant-size state, making the approach suitable for onboard processors on low-cost robots. Experiments in both simulations and real-world scenarios demonstrate the effectiveness of our method, improving backward agility across complex terrains under limited sensing.
Chinese Translation
配备自我中心前视深度摄像头的腿式机器人能够将外部感知与本体感知结合,以在复杂地形上实现稳健的前向灵活性。当这些机器人向后行走时,前向视野无法提供任何预览。纯粹依赖本体感知的控制器在中等地面上向后移动时可以保持稳定,但无法充分利用机器人在复杂地形上的能力,并且必须与障碍物发生碰撞。我们提出了展望前行以实现倒退行走(Look Forward to Walk Backward, LF2WB),这是一种高效的地形记忆运动框架,利用前向自我中心深度感知和本体感知在前行过程中写入紧凑的关联记忆,并在没有后视的情况下检索该记忆以实现无碰撞的倒退行走。该记忆骨架采用增量规则选择性更新,柔和地移除然后写入沿着活跃子空间的记忆状态。训练使用硬件高效的并行计算,部署时以恒定时间进行逐步推理,保持恒定大小的状态,使该方法适合低成本机器人的机载处理器。模拟和现实场景中的实验证明了我们方法的有效性,在有限感知条件下提高了复杂地形上的倒退灵活性。
cs.RO / 36 / 2603.03148
From Language to Action: Can LLM-Based Agents Be Used for Embodied Robot Cognition?
从语言到行动:基于大型语言模型的智能体能否用于具身机器人认知?
Abstract
In order to flexibly act in an everyday environment, a robotic agent needs a variety of cognitive capabilities that enable it to reason about plans and perform execution recovery. Large language models (LLMs) have been shown to demonstrate emergent cognitive aspects, such as reasoning and language understanding; however, the ability to control embodied robotic agents requires reliably bridging high-level language to low-level functionalities for perception and control. In this paper, we investigate the extent to which an LLM can serve as a core component for planning and execution reasoning in a cognitive robot architecture. For this purpose, we propose a cognitive architecture in which an agentic LLM serves as the core component for planning and reasoning, while components for working and episodic memories support learning from experience and adaptation. An instance of the architecture is then used to control a mobile manipulator in a simulated household environment, where environment interaction is done through a set of high-level tools for perception, reasoning, navigation, grasping, and placement, all of which are made available to the LLM-based agent. We evaluate our proposed system on two household tasks (object placement and object swapping), which evaluate the agent's reasoning, planning, and memory utilisation. The results demonstrate that the LLM-driven agent can complete structured tasks and exhibits emergent adaptation and memory-guided planning, but also reveal significant limitations, such as hallucinations about the task success and poor instruction following by refusing to acknowledge and complete sequential tasks. These findings highlight both the potential and challenges of employing LLMs as embodied cognitive controllers for autonomous robots.
Chinese Translation
为了在日常环境中灵活行动,机器人智能体需要具备多种认知能力,使其能够推理计划并执行恢复。研究表明,大型语言模型(LLMs)展现出新兴的认知特征,如推理和语言理解;然而,控制具身机器人智能体的能力需要可靠地将高层次语言与低层次的感知和控制功能连接起来。本文探讨了LLM在认知机器人架构中作为规划和执行推理核心组件的适用程度。为此,我们提出了一种认知架构,其中代理型LLM作为规划和推理的核心组件,而工作记忆和情节记忆组件支持从经验中学习和适应。然后,我们使用该架构的一个实例来控制一个移动操纵器,在模拟的家庭环境中进行操作,环境交互通过一组高层次工具进行,这些工具用于感知、推理、导航、抓取和放置,所有这些都可供基于LLM的智能体使用。我们在两个家庭任务(物体放置和物体交换)上评估了我们提出的系统,这些任务评估了智能体的推理、规划和记忆利用。结果表明,基于LLM的智能体能够完成结构化任务,并展现出新兴的适应性和记忆引导的规划,但也揭示了显著的局限性,如对任务成功的幻觉和拒绝承认及完成顺序任务的糟糕指令遵循。这些发现突显了将LLM作为自主机器人的具身认知控制器的潜力和挑战。
cs.RO / 37 / 2603.03181
Robotic Grasping and Placement Controlled by EEG-Based Hybrid Visual and Motor Imagery
基于脑电图的混合视觉与运动想象控制的机器人抓取与放置
Abstract
We present a framework that integrates EEG-based visual and motor imagery (VI/MI) with robotic control to enable real-time, intention-driven grasping and placement. Motivated by the promise of BCI-driven robotics to enhance human-robot interaction, this system bridges neural signals with physical control by deploying offline-pretrained decoders in a zero-shot manner within an online streaming pipeline. This establishes a dual-channel intent interface that translates visual intent into robotic actions, with VI identifying objects for grasping and MI determining placement poses, enabling intuitive control over both what to grasp and where to place. The system operates solely on EEG via a cue-free imagery protocol, achieving integration and online validation. Implemented on a Base robotic platform and evaluated across diverse scenarios, including occluded targets or varying participant postures, the system achieves online decoding accuracies of 40.23% (VI) and 62.59% (MI), with an end-to-end task success rate of 20.88%. These results demonstrate that high-level visual cognition can be decoded in real time and translated into executable robot commands, bridging the gap between neural signals and physical interaction, and validating the flexibility of a purely imagery-based BCI paradigm for practical human-robot collaboration.
Chinese Translation
我们提出了一个框架,将基于脑电图的视觉与运动想象(VI/MI)与机器人控制相结合,以实现实时、基于意图的抓取与放置。该系统的动机源于脑机接口(BCI)驱动的机器人技术在增强人机交互方面的潜力,通过在在线流媒体管道中以零样本方式部署离线预训练解码器,架起了神经信号与物理控制之间的桥梁。这建立了一个双通道意图接口,将视觉意图转化为机器人动作,其中VI用于识别抓取对象,MI用于确定放置姿势,从而实现对抓取对象及放置位置的直观控制。该系统完全基于脑电图,通过无提示的想象协议进行操作,实现了集成和在线验证。在一个基础机器人平台上实施,并在包括遮挡目标或参与者姿势变化等多种场景中进行评估,该系统实现了40.23%(VI)和62.59%(MI)的在线解码准确率,整体任务成功率为20.88%。这些结果表明,高级视觉认知可以实时解码并转化为可执行的机器人指令,弥合了神经信号与物理交互之间的差距,并验证了纯想象基础的BCI范式在实际人机协作中的灵活性。
cs.RO / 38 / 2603.03198
ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments
ACE-Brain-0:空间智能作为通用体现的共享支架
Gong, Ziyang, Luo, Zehang, Tang, Anke, Liu, Zhe, Fu, Shi, Hou, Zhi, Yang, Ganlin, Wang, Weiyun, Wang, Xiaofeng, Liu, Jianbo, Luo, Gen, Kang, Haolan, Luo, Shuang, Zhou, Yue, Luo, Yong, Shen, Li, Jia, Xiaosong, Mu, Yao, Yang, Xue, Liu, Chunxiao, Yan, Junchi, Zhao, Hengshuang, Tao, Dacheng, Wang, Xiaogang
Abstract
Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.
Chinese Translation
通用体现智能要求在异构体现之间实现强大的泛化能力,例如自主驾驶、机器人技术和无人机(UAV)。然而,现有的体现大脑在训练统一模型时,往往会引发长尾数据、梯度干扰和灾难性遗忘,使得平衡通用泛化与领域特定能力变得极为困难。在本报告中,我们介绍了ACE-Brain-0,这是一种通用基础大脑,将空间推理、自主驾驶和体现操作统一在一个多模态大型语言模型(MLLM)中。我们的关键见解是,空间智能作为不同物理体现之间的通用支架:尽管车辆、机器人和无人机在形态上差异巨大,但它们在建模三维心理空间方面有着共同的需求,使得空间认知成为跨体现转移的自然、领域无关的基础。基于这一见解,我们提出了支架-专业化-协调(SSR)范式,首先建立共享的空间基础,然后培养领域专业化的专家,最后通过无数据模型合并将它们协调起来。此外,我们采用了群体相对策略优化(GRPO)来增强模型的综合能力。大量实验表明,ACE-Brain-0在24个与空间和体现相关的基准测试中达到了具有竞争力甚至是最先进的性能。
cs.RO / 39 / 2603.03243
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
HoMMI:从人类示范中学习全身移动操控
Abstract
We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io
Chinese Translation
我们提出了全身移动操控接口(HoMMI),这是一个数据收集和策略学习框架,能够直接从无机器人的人类示范中学习全身移动操控。我们通过自我中心感知增强了UMI接口,以捕捉移动操控所需的全球上下文,从而实现便携、无机器人和可扩展的数据收集。然而,简单地引入自我中心感知在观察和动作空间中引入了更大的人与机器人之间的体现差距,使得策略转移变得困难。我们通过跨体现的手眼策略设计明确地弥补了这一差距,包括一种与体现无关的视觉表示;一种放松的头部动作表示;以及一个全身控制器,该控制器通过在特定机器人物理约束下协调全身运动来实现手眼轨迹。这些共同使得需要双手和全身协调、导航和主动感知的长时间移动操控任务成为可能。结果最佳查看地址为:https://hommi-robot.github.io
cs.RO / 40 / 2603.03278
Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping
Tether:基于对应驱动的轨迹扭曲的自主功能性游戏
Abstract
The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
Chinese Translation
进行和从互动与经验中学习的能力是机器人技术中的一个核心挑战,为劳动密集型的人类示范提供了一种可扩展的替代方案。然而,实现这种“游戏”需要(1)一个对多样化、潜在的分布外环境状态具有鲁棒性的策略,以及(2)一个持续产生有用机器人经验的过程。为了解决这些挑战,我们提出了Tether,这是一种涉及结构化、任务导向互动的自主功能性游戏方法。首先,我们设计了一种新颖的开放式策略,通过将动作从一小组源示范(<=10)扭曲,并将其锚定到目标场景中的语义关键点对应关系。我们展示了这一设计在数据效率和鲁棒性方面都表现出色,即使在显著的空间和语义变化下也能保持稳定。其次,我们通过任务选择、执行、评估和改进的持续循环,将该策略部署到现实世界中的自主功能性游戏中,受益于视觉-语言模型的视觉理解能力。该过程生成了多样化、高质量的数据集,且人类干预最小。在类似家庭的多物体设置中,我们的方法首次在现实世界中仅从少量示范开始,进行了数小时的自主多任务游戏。这产生了一系列数据,随着时间的推移,持续改善闭环模仿策略的性能,最终产生了超过1000条专家级轨迹,并训练出与人类收集示范学习的策略相竞争的策略。
cs.RO / 41 / 2603.03279
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
ULTRA:统一的多模态控制用于自主类人机器人全身运动操控
Abstract
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
Chinese Translation
实现自主且多功能的全身运动操控仍然是使类人机器人在实际应用中有用的一个主要障碍。然而,现有的方法在根本上受到限制:重新定向的数据往往稀缺或质量低下;方法难以扩展到大规模技能库;最重要的是,它们依赖于跟踪预定义的运动参考,而不是从感知和高层任务规范中生成行为。为了解决这些局限性,我们提出了ULTRA,一个具有两个关键组件的统一框架。首先,我们引入了一种基于物理驱动的神经重定向算法,该算法将大规模运动捕捉转换为类人机器人表现,同时保持在接触丰富的交互中的物理合理性。其次,我们学习了一个统一的多模态控制器,支持稠密参考和稀疏任务规范,感知范围从准确的运动捕捉状态到嘈杂的自我中心视觉输入。我们将一个通用跟踪策略提炼到这个控制器中,将运动技能压缩到一个紧凑的潜在空间,并应用强化学习微调以扩展覆盖范围并提高在分布外场景下的鲁棒性。这使得从稀疏意图中实现协调的全身行为成为可能,而无需在测试时参考运动。我们在仿真和真实的Unitree G1类人机器人上评估了ULTRA。结果表明,ULTRA能够从自我中心感知中推广到自主、目标条件的全身运动操控,始终优于仅跟踪的基线方法,且技能有限。
cs.RO / 42 / 2603.03280
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
如何用刀剥皮:将细粒度操作与人类偏好对齐
Abstract
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
Chinese Translation
许多基本的操作任务,如食品准备、外科手术和工艺制作,对于自主机器人仍然难以处理。这些任务不仅具有接触丰富、对力敏感的动态特征,还具有“隐性”的成功标准:与简单的取放任务不同,这些领域的任务质量是连续且主观的(例如,土豆剥皮的效果),这使得定量评估和奖励工程变得困难。我们提出了一种针对此类任务的学习框架,以用刀剥皮为代表性示例。我们的方法遵循两阶段流程:首先,通过考虑力的数据显示和模仿学习,学习一个稳健的初始策略,从而实现对物体变化的泛化;其次,通过使用结合定量任务指标和定性人类反馈的学习奖励模型进行基于偏好的微调,进一步优化策略,使政策行为与人类对任务质量的认知对齐。仅使用50-200条剥皮轨迹,我们的系统在包括黄瓜、苹果和土豆等具有挑战性的农产品上实现了超过90%的平均成功率,并通过基于偏好的微调将性能提高了多达40%。值得注意的是,在单一农产品类别上训练的策略表现出对未见类别实例和来自不同类别的分布外农产品的强零样本泛化,同时保持超过90%的成功率。
cs.CV / 1 / 2603.02256
CamDirector: Towards Long-Term Coherent Video Trajectory Editing
CamDirector:面向长期一致的视频轨迹编辑
Abstract
Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.
Chinese Translation
视频(摄像机)轨迹编辑旨在合成遵循用户定义的摄像机路径的新视频,同时保留场景内容并合理地填补先前未见区域,将业余录像升级为专业风格的视频。现有的视频轨迹编辑(VTE)方法在精确的摄像机控制和长范围一致性方面存在困难,因为它们要么通过有限容量的嵌入注入目标姿态,要么依赖于仅具有隐式跨帧聚合的单帧变形视频扩散模型。为了解决这些问题,我们提出了一种新的VTE框架:1)通过混合变形方案显式聚合整个源视频的信息。具体而言,静态区域逐步融合到世界缓存中,然后渲染到目标摄像机姿态,而动态区域则直接进行变形;它们的融合产生全局一致的粗略帧,以指导后续的细化。2)通过历史引导的自回归扩散模型联合处理视频片段及其历史,同时逐步更新世界缓存,以强化已填补内容,从而实现长期的时间一致性。最后,我们提出了iPhone-PTZ,一个具有多样化摄像机运动和大轨迹变化的新VTE基准,并以更少的参数实现了最先进的性能。
cs.CV / 2 / 2603.02263
Social-JEPA: Emergent Geometric Isomorphism
社会-JEPA:新兴几何同构
Abstract
World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.
Chinese Translation
世界模型将丰富的感官流压缩为紧凑的潜在编码,以预测未来的观察。我们让不同的代理从同一环境的不同视角获取这样的模型,而不进行任何参数共享或协调。训练后,它们的内部表征展现出一种显著的新兴特性:两个潜在空间通过近似线性等距关系相互关联,使得它们之间的透明转换成为可能。这种几何共识在大视角变化和原始像素重叠稀少的情况下依然存在。利用学习到的对齐,一个在一个代理上训练的分类器可以无须额外的梯度步骤便移植到另一个代理上,而类似蒸馏的迁移加速了后续学习,并显著减少了总计算量。研究结果揭示了预测学习目标对表征几何施加了强有力的规律性,暗示了在去中心化视觉系统之间实现互操作性的轻量路径。代码可在 https://anonymous.4open.science/r/Social-JEPA-5C57 获取。
cs.CV / 3 / 2603.02270
From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification
从视觉到多模态:动物识别中编码器和融合策略的系统性消融研究
Abstract
Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28\% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11\% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.
Chinese Translation
自动化动物识别是将失散宠物与其主人 reunite 的实际任务,但现有系统常因数据集规模有限和依赖单一视觉线索而面临挑战。本研究提出了一种多模态验证框架,通过合成文本描述衍生的语义身份先验来增强视觉特征。我们构建了一个包含 1.9 百万张照片、覆盖 695,091 种独特动物的大型训练语料库,以支持本次研究。通过系统性消融研究,我们确定了 SigLIP2-Giant 和 E5-Small-v2 为最佳视觉和文本骨干网络。我们进一步评估了从简单拼接到自适应门控的融合策略,以确定整合这些模态的最佳方法。我们提出的方法利用了门控融合机制,在全面测试协议上达到了 84.28\% 的 Top-1 准确率和 0.0422 的等错误率。这些结果比领先的单模态基线提高了 11\\%,并表明整合合成的语义描述显著改善了大规模宠物重新识别中的决策边界。
cs.CV / 4 / 2603.02286
Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection
超越提示退化:原型引导的双池提示用于增量目标检测
Abstract
Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2\% AP improvement) and PASCAL VOC (with a 3.3\% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP\_IOD/tree/main
Chinese Translation
增量目标检测(IOD)旨在不断学习新目标类别,而不遗忘之前学习的类别。近年来,基于提示的方法因其无重放设计和参数效率而受到广泛关注。然而,由于提示耦合和提示漂移,这些方法在持续适应过程中常常遭遇提示退化。为了解决这些问题,我们提出了一种新颖的提示解耦框架,称为PDP。PDP创新性地设计了一个双池提示解耦范式,该范式由一个共享池和一个私有池组成,前者用于捕捉任务通用知识以实现前向迁移,后者用于学习任务特定的区分特征。该范式明确区分了任务通用和任务特定的提示,防止了提示之间的干扰,并减轻了提示耦合。此外,为了抵消由于不一致的监督导致的提示漂移,即在后续任务中将旧的前景物体视为背景,PDP引入了原型伪标签生成(PPG)模块。PPG可以在训练过程中动态更新类别原型空间,并利用类别原型进一步过滤有价值的伪标签,从而在增量过程中保持监督信号的一致性。PDP在MS-COCO(AP提升9.2%)和PASCAL VOC(AP提升3.3%)基准测试中实现了最先进的性能,突显了其在平衡稳定性和可塑性方面的潜力。代码和数据集已发布在:https://github.com/zyt95579/PDP_IOD/tree/main
cs.CV / 5 / 2603.02288
AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning
AutoFFS:用于面部女性化手术规划的对抗性形变
Abstract
Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation and a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.
Chinese Translation
面部女性化手术(FFS)是变性人和性别多样性患者性别确认的重要组成部分,旨在重塑颅面结构以达到女性形态。目前的手术规划程序主要依赖于主观的临床评估,缺乏定量和可重复的解剖指导。因此,我们提出了AutoFFS,这是一种新颖的数据驱动框架,通过对抗性自由形变生成反事实颅骨形态。我们的方法对一组预训练的二元性别分类器进行基于形变的针对性对抗攻击,这些分类器学习了性别二态性,有效地将个体颅骨形状转变为目标性别。生成的反事实颅骨形态为FFS的术前规划提供了定量基础,推动了这一被忽视患者群体的进步。我们通过分类器评估和人类感知研究验证了我们的方法,确认生成的形态展现了目标性别特征。
cs.CV / 6 / 2603.02329
HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
HAMMER:通过跨模态集成利用多模态大语言模型进行意图驱动的3D可供性基础
Abstract
Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.
Chinese Translation
人类通常通过观察图像或视频中的交互来识别3D物体的可供性,一旦形成,这种知识可以普遍推广到新物体。受到这一原则的启发,我们倡导一种新颖的框架,利用新兴的多模态大语言模型(MLLM)进行基于交互意图的3D可供性基础,称为HAMMER。我们不是生成明确的物体属性描述或依赖现成的2D分割器,而是将图像中描绘的交互意图聚合为一种接触感知的嵌入,并引导模型推断文本可供性标签,确保其充分挖掘物体语义和上下文线索。我们进一步设计了一种分层跨模态集成机制,以充分利用MLLM提供的互补信息进行3D表示的细化,并引入一个多粒度几何提升模块,将空间特征注入提取的意图嵌入中,从而促进准确的3D可供性定位。在公共数据集和我们新构建的损坏基准上的大量实验表明,HAMMER相比现有方法具有优越性和鲁棒性。所有代码和权重均已公开。
cs.CV / 7 / 2603.02351
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
MERG3R:一种分而治之的大规模神经视觉几何方法
Abstract
Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.
Chinese Translation
最近在神经视觉几何方面的进展,包括基于变换器的模型如VGGT和Pi3,在3D重建任务中取得了令人印象深刻的准确性。然而,它们对全注意力机制的依赖使其在根本上受到GPU内存容量的限制,无法扩展到大型无序图像集合。我们提出了MERG3R,一种无训练的分而治之框架,使几何基础模型能够超越其原生内存限制进行操作。MERG3R首先将无序图像重新排序并划分为重叠的、几何多样的子集,这些子集可以独立重建。然后,通过高效的全局对齐和基于置信度的束调整程序合并生成的局部重建,产生一个全局一致的3D模型。我们的框架是模型无关的,可以与现有的神经几何模型配对。在包括7-Scenes、NRGBD、Tanks & Temples以及剑桥地标等大规模数据集上,MERG3R始终提高了重建准确性、内存效率和可扩展性,使得在数据集超过内存容量限制时仍能实现高质量重建。
cs.CV / 8 / 2603.02363
Beyond Caption-Based Queries for Video Moment Retrieval
超越基于字幕的查询进行视频时刻检索
Abstract
In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/
Chinese Translation
在本研究中,我们探讨了现有视频时刻检索(VMR)方法的退化,特别是DETR架构在以字幕为基础的查询上训练但在搜索查询上评估时的表现。为此,我们通过修改三个公共VMR数据集中的文本查询,介绍了三个基准数据集——即HD-EPIC、YouCook2和ActivityNet-Captions。我们的分析揭示了两个关键的泛化挑战:(i)语言差距,源于搜索查询的语言不充分性,以及(ii)多时刻差距,由于从单时刻查询转向多时刻查询而引起。我们还识别出这些架构中的一个关键问题——主动解码器查询崩溃——作为对多时刻实例泛化不良的主要原因。我们通过架构修改来缓解此问题,有效增加主动解码器查询的数量。大量实验表明,我们的方法在搜索查询上的性能提高了最高14.82%的mAP_m,在多时刻搜索查询上提高了最高21.83%的mAP_m。代码、模型和数据可在项目网页上获取:https://davidpujol.github.io/beyond-vmr/
cs.CV / 9 / 2603.02367
Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment
检索患者特异性放射组学特征集以实现透明的膝关节MRI评估
Abstract
Classical radiomic features are designed to quantify image appearance and intensity patterns. Compared with end-to-end deep learning (DL) models trained for disease classification, radiomics pipelines with low-dimensional parametric classifiers offer enhanced transparency and interpretability, yet often underperform because of the reliance on population-level predefined feature sets. Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F (often ~10^3). However, such marginal ranking can over-admit redundant descriptors and overlook complementary feature interactions. We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject, targeting complementary and diverse evidence rather than marginal top-k features. To overcome the intractable combinatorial search space of F choose k features, our method utilizes a 2-stage retrieval strategy: randomly sample diverse candidate feature sets, then rank these sets with a learned scoring function to select a high-performing feature set for the specific patient. The system consists of a feature-set scorer, and a classifier that performs the final diagnosis. We empirically show that the proposed two-stage retrieval approximates the original exhaustive all k-feature selection. Validating on tasks including ACL tear detection and KL grading for osteoarthritis, the experimental results achieve diagnostic performance, outperforming the top-k approach with the same k values, and competitive with end-to-end DL models while maintaining high transparency. The model generates auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, allowing clinicians to inspect which anatomical structures and quantitative descriptors drive the prediction.
Chinese Translation
经典的放射组学特征旨在量化图像外观和强度模式。与为疾病分类训练的端到端深度学习(DL)模型相比,采用低维参数分类器的放射组学管道提供了更高的透明度和可解释性,但由于依赖于人口层面的预定义特征集,往往表现不佳。最近的自适应放射组学研究利用深度学习预测放射组学池中的特征权重,然后对这些权重进行阈值处理,以保留来自大型放射组学池F(通常约为10^3)的前k个特征。然而,这种边际排名可能会过度接纳冗余描述符,并忽视互补特征之间的交互。我们提出了一种患者特异性特征集选择框架,该框架为每个受试者预测一个紧凑的特征集,目标是获取互补和多样化的证据,而不是边际的前k个特征。为了解决从F中选择k个特征的不可处理组合搜索空间,我们的方法采用了两阶段检索策略:随机抽样多样化的候选特征集,然后使用学习的评分函数对这些特征集进行排名,以选择适合特定患者的高性能特征集。该系统由特征集评分器和执行最终诊断的分类器组成。我们通过实验证明,所提出的两阶段检索方法近似于原始的穷举所有k特征选择。在包括ACL撕裂检测和骨关节炎KL分级等任务的验证中,实验结果显示出诊断性能,超越了具有相同k值的前k方法,并与端到端深度学习模型竞争,同时保持高透明度。该模型生成可审核的特征集,将临床结果与特定解剖区域和放射组学家族联系起来,使临床医生能够检查哪些解剖结构和定量描述符驱动预测。
cs.CV / 10 / 2603.02370
Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples
文化反事实:利用反事实示例评估大型视觉-语言模型中的文化偏见
Abstract
Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual's appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.
Chinese Translation
近年来,大型视觉-语言模型(LVLMs)变得越来越强大,但也可能表现出有害的偏见。先前研究主要集中于与图像中人物的视觉特征相关的人口特征(如种族或性别)的偏见。这使得与文化差异(例如宗教、社会经济地位)相关的偏见相对缺乏研究,而这些偏见无法仅通过个体的外貌轻易辨别。衡量文化偏见的一个关键挑战在于,确定个体所属的群体往往依赖于图像中的文化背景线索,而缺乏带有文化背景线索的标注数据集。为了解决这一问题,我们引入了文化反事实(Cultural Counterfactuals):一个高质量的合成数据集,包含近60,000个反事实图像,用于测量与宗教、国籍和社会经济地位相关的文化偏见。为了确保文化背景的准确呈现,我们使用图像编辑模型生成数据集,将不同人口特征的人物置入真实的文化背景图像中。这使得构建反事实图像集成为可能,能够在多个不同的背景中描绘同一个人,从而精确测量文化背景差异对LVLM输出的影响。我们展示了文化反事实在量化流行LVLM中的文化偏见方面的实用性。
cs.CV / 11 / 2603.02371
Aligning Fetal Anatomy with Kinematic Tree Log-Euclidean PolyRigid Transforms
将胎儿解剖结构与运动树对数欧几里得多刚体变换对齐
Abstract
Automated analysis of articulated bodies is crucial in medical imaging. Existing surface-based models often ignore internal volumetric structures and rely on deformation methods that lack anatomical consistency guarantees. To address this problem, we introduce a differentiable volumetric body model based on the Skinned Multi-Person Linear (SMPL) formulation, driven by a new Kinematic Tree-based Log-Euclidean PolyRigid (KTPolyRigid) transform. KTPolyRigid resolves Lie algebra ambiguities associated with large, non-local articulated motions, and encourages smooth, bijective volumetric mappings. Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts. Furthermore, our framework enables robust groupwise image registration and a label-efficient, template-based segmentation of fetal organs. It provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging.
Chinese Translation
自动化分析关节体在医学成像中至关重要。现有的基于表面的模型往往忽视内部体积结构,并依赖缺乏解剖一致性保证的变形方法。为了解决这个问题,我们引入了一种基于皮肤多人体线性(Skinned Multi-Person Linear, SMPL)公式的可微分体积体模型,该模型由一种新的基于运动树的对数欧几里得多刚体(Kinematic Tree-based Log-Euclidean PolyRigid, KTPolyRigid)变换驱动。KTPolyRigid 解决了与大规模非局部关节运动相关的李代数歧义,并鼓励平滑的双射体积映射。在53个胎儿MRI体积上进行评估,KTPolyRigid 产生的变形场显著减少了折叠伪影。此外,我们的框架实现了稳健的组图像配准和高效标签的基于模板的胎儿器官分割。它为医学成像中关节体的标准化体积分析提供了坚实的基础。
cs.CV / 12 / 2603.02386
Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial
通过机器学习推进地球观测:TorchGeo 教程
Abstract
Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html and https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html.
Chinese Translation
地球观测中的机器学习管道与标准计算机视觉工作流程在根本上有所不同。图像通常以大型地理参考场景的形式提供,标签可能是不同坐标参考系统中的栅格掩膜或矢量几何体,并且训练和评估通常需要空间感知的采样和拆分策略。TorchGeo 是一个基于 PyTorch 的领域库,提供数据集、采样器、变换和预训练模型,旨在简化在机器学习管道中使用地理空间数据的过程。本文介绍了一个教程,展示了 1.) 通过代码示例演示核心 TorchGeo 抽象,以及 2.) 一个关于使用地球表面水体数据集从 Sentinel-2 图像进行多光谱水体分割的端到端案例研究。这展示了如何使用 TorchGeo 数据集训练语义分割模型,将模型应用于巴西里约热内卢的 Sentinel-2 场景,并将生成的预测结果保存为 GeoTIFF 以便进一步的地理空间分析。教程代码本身以两个 Python 笔记本的形式分发: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html 和 https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html。
cs.CV / 13 / 2603.02390
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
OpenMarcie:用于工业环境中多模态动作识别的数据集
Abstract
Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.
Chinese Translation
智能工厂利用先进技术优化生产并提高效率。为此,工人活动的识别能够准确量化绩效指标,从整体上提高效率,同时有助于工人安全。OpenMarcie是我们所知最大的多模态数据集,旨在制造环境中进行人类动作监测。该数据集包含来自可穿戴传感器和分布在周围环境中的摄像头的数据。数据集围绕两个实验设置构建,共涉及36名参与者。在第一个设置中,12名参与者在半现实条件下执行自行车的组装和拆卸任务,没有固定的协议,促进了发散性和目标导向的问题解决。第二个实验涉及25名志愿者(24份有效数据),参与3D打印机的组装任务,提供了3D打印机制造商的说明,以指导志愿者获取程序知识。该设置还包括顺序协作组装,参与者评估并纠正彼此的进度,反映了现实世界的制造动态。OpenMarcie包含超过37小时的自我中心和外部中心的多模态、多位置数据,具有八种不同的数据类型和超过200个独立的信息通道。该数据集在三个人类活动识别任务上进行了基准测试:活动分类、开放词汇字幕生成和跨模态对齐。
cs.CV / 14 / 2603.02411
From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness
从更少的样本到更少的位:将数据集蒸馏重新框架为精度与紧凑性的联合优化
Abstract
Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.
Chinese Translation
数据集蒸馏(Dataset Distillation, DD)将大型数据集压缩为紧凑的合成数据集,同时保持训练性能。然而,当前的方法主要关注样本减少,对数据精度及其对效率的影响考虑有限。我们提出了量化感知数据集蒸馏(Quantization-aware Dataset Distillation, QuADD),这是一个统一框架,在固定位预算下联合优化数据集的紧凑性和精度。QuADD在蒸馏循环中集成了一个可微分的量化模块,使得合成样本和量化参数的端到端共同优化成为可能。在率失真视角的指导下,我们实证分析了样本数量与精度之间的位分配如何影响学习性能。我们的框架支持均匀和自适应非均匀量化,其中后者从数据中学习量化水平,以更好地表示信息密集区域。在图像分类和3GPP波束管理任务上的实验表明,QuADD在每位准确性上超越了现有的DD和后量化基线,确立了信息高效数据集蒸馏的新标准。
cs.CV / 15 / 2603.02413
TruckDrive: Long-Range Autonomous Highway Driving Dataset
TruckDrive:长距离自主高速驾驶数据集
Abstract
Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.
Chinese Translation
重型卡车的安全高速自主驾驶仍然是一个未解决的挑战:由于长制动距离,需要对数百米的场景进行理解,以便进行预判规划并确保安全的制动余量。然而,现有的驾驶数据集主要覆盖城市场景,感知有效范围仅限于最多100米。为了解决这一空白,我们推出了TruckDrive,一个高速公路规模的多模态驾驶数据集,使用专为长距离感知而设计的传感器阵列捕获:七个长距离FMCW激光雷达测量范围和径向速度,三个高分辨率短距离激光雷达,十一台具有不同焦距的8MP环视摄像头,以及十个4D FMCW雷达。该数据集提供了475,000个样本和165,000个密集标注的帧,用于驾驶感知基准测试,支持高达1,000米的2D检测和400米的3D检测、深度估计、跟踪、规划以及在高速公路速度下的20秒序列的端到端驾驶。我们发现,最先进的自主驾驶模型在150米以上的范围内无法泛化,3D感知任务的性能下降在31%到99%之间,暴露出当前架构和训练信号无法弥补的系统性长距离差距。
cs.CV / 16 / 2603.02419
DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting
DINOv3视觉表征在蓝莓感知中的应用及其对机器人采摘的影响
Abstract
Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.
Chinese Translation
通过大规模自监督学习训练的视觉基础模型在视觉感知方面表现出强大的泛化能力;然而,它们在农业环境中的实际作用和性能限制仍然不够明确。本研究评估了DINOv3作为蓝莓机器人采摘相关视觉任务的固定骨干,包括果实和伤痕分割,以及果实和簇的检测。在统一协议下,采用轻量级解码器,分割任务持续受益于稳定的补丁级表征,并随着骨干网络规模的增大而提升。相比之下,检测任务受到目标尺度变化、补丁离散化和定位兼容性的限制。簇检测的失败突显了在建模由空间聚合定义的关系目标方面的局限性。总体而言,DINOv3更适合被视为一个语义骨干,其有效性依赖于与果实尺度和聚合结构对齐的下游空间建模,为蓝莓机器人采摘提供指导。代码和数据集将在论文接受后提供。
cs.CV / 17 / 2603.02434
MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer's Disease Prediction
MIRAGE:基于知识图谱的跨队列MRI合成用于阿尔茨海默病预测
Abstract
Reliable Alzheimer's disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled "diagnostic-surrogate" representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.
Chinese Translation
可靠的阿尔茨海默病(AD)诊断日益依赖于结合结构性磁共振成像(MRI)和电子健康记录(EHR)的多模态评估。然而,由于MRI扫描昂贵且在许多患者队列中经常不可用,这些模型的部署受到模态缺失的瓶颈。此外,从稀疏的高维表格记录中合成全新的3D解剖扫描在技术上具有挑战性,并且带来了严重的临床风险。为了解决这一问题,我们提出了MIRAGE,一个将缺失MRI问题重新构建为解剖引导的跨模态潜在蒸馏任务的新框架。首先,MIRAGE利用生物医学知识图谱(KG)和图注意力网络将异构EHR变量映射到一个统一的嵌入空间,该空间可以从具有真实MRI的队列传播到没有MRI的队列。为了弥合语义差距并增强物理空间意识,我们采用一个冻结的预训练3D U-Net解码器,严格作为辅助正则化引擎。得益于一种新颖的队列聚合跳跃特征补偿策略,该解码器作为严格的结构惩罚,迫使1D潜在表示编码生物学上合理的宏观病理语义。通过在推理过程中独占地利用这种蒸馏的“诊断替代”表示,MIRAGE完全绕过了计算成本高昂的3D体素重建。实验表明,我们的框架成功弥合了缺失模态的差距,在没有真实MRI的队列中将AD分类率提高了13%,相较于单模态基线。
cs.CV / 18 / 2603.02438
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
ORCA:协同智能体的协同推理用于文档视觉问答
Abstract
Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.
Chinese Translation
文档视觉问答(DocVQA)对于现有的视觉语言模型(VLMs)仍然具有挑战性,尤其是在复杂推理和多步骤工作流程下。当前的方法难以将复杂问题分解为可管理的子任务,并且往往无法利用针对不同文档元素的专门处理路径。我们提出了ORCA:协同智能体的协同推理用于文档视觉问答,这是一个新颖的多智能体框架,通过战略性智能体协调和迭代精炼来解决这些局限性。ORCA首先由一个推理智能体将查询分解为逻辑步骤,随后由一个路由机制激活来自专用智能体停靠站的任务特定智能体。我们的框架利用一组专门的人工智能智能体,每个智能体专注于不同的模态,使得对多样化文档组件的细致理解和协同推理成为可能。为了确保答案的可靠性,ORCA采用了辩论机制和压力测试,并在必要时进行论点-反论点的裁决过程。接下来是一个合理性检查器,以确保格式一致性。在三个基准上的广泛实验表明,我们的方法在性能上显著优于最先进的方法,为视觉语言推理中的协同智能体系统建立了一个新的范式。
cs.CV / 19 / 2603.02465
Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning
基于深度学习的泥炭地火灾检测方法:迁移学习的应用
Abstract
Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics -- such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning -- that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.
Chinese Translation
近年来,基于机器学习(ML)的火灾检测方法得到了发展,主要使用在大量火灾图像和视频上训练的深度学习(DL)模型。然而,泥炭地火灾具有独特的视觉和物理特征——如阴燃、低火焰强度、持续冒烟和地下燃烧——这些特征限制了传统火灾探测器在开放火焰森林火灾上的有效性。在本研究中,我们提出了一种基于迁移学习的泥炭地火灾检测方法,该方法利用从一般火灾图像中学习到的知识,并将其适应于泥炭地火灾领域。我们使用来自传统火灾检测模型的预训练权重初始化一个基于深度学习的泥炭地火灾探测器,并随后使用由马来西亚泥炭地图像和视频组成的数据集对网络进行微调。这一策略使得在标注的泥炭地火灾数据有限的情况下仍能有效学习。实验结果表明,与从头开始训练相比,迁移学习显著提高了检测的准确性和鲁棒性,特别是在低对比度烟雾、部分遮挡和可变光照等挑战性条件下。所提出的方法为早期泥炭地火灾检测提供了一种实用且可扩展的解决方案,并有潜力支持火灾预防和环境保护的实时监测系统。
cs.CV / 20 / 2603.02475
Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild
野外肤色分类的大规模数据集和基准测试
Abstract
Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon
Chinese Translation
深度学习模型往往会继承其训练数据中的偏见。尽管性别和种族的公平性已被广泛研究,但由于缺乏细粒度的标注数据集,细致的肤色分析仍然是一项挑战。现有方法通常依赖于医学上的6色Fitzpatrick尺度,该尺度缺乏视觉代表性,或者使用小型的私有数据集,这限制了可重复性,或者往往依赖于经典的计算机视觉管道,只有少数使用深度学习。这些方法忽视了训练-测试泄漏和数据集不平衡等问题,并受到小型或不可用数据集的限制。在本研究中,我们提出了一个全面的肤色公平性框架。首先,我们介绍了STW,这是一个大规模、开放获取的数据集,包含来自3,564个个体的42,313张图像,使用10色MST尺度进行标注。其次,我们对经典计算机视觉方法(SkinToneCCV)和深度学习方法进行了基准测试,结果表明经典模型提供近乎随机的结果,而深度学习模型达到了接近标注者的准确性。最后,我们提出了SkinToneNet,一个经过微调的ViT模型,在域外数据上实现了最先进的泛化能力,这使得对CelebA和VGGFace2等公共数据集的公平性审计变得可靠。本研究在肤色分类和公平性评估方面提供了最先进的结果。代码和数据将很快发布。
cs.CV / 21 / 2603.02477
E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition
E2E-GNet:一种端到端的基于骨架的几何深度神经网络用于人类动作识别
Abstract
Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.
Chinese Translation
几何深度学习最近在计算机视觉领域引起了显著关注,因为它能够捕捉位于非欧几里得空间中的数据的有意义表示。为此,我们提出了E2E-GNet,一种用于基于骨架的人类动作识别的端到端几何深度神经网络。为了增强非欧几里得空间中不同动作之间的区分能力,E2E-GNet引入了一个几何变换层,该层联合优化该空间中的骨架动作序列,并应用可微分对数映射激活将其投影到线性空间。在此基础上,我们进一步设计了一个抗畸变优化层,以限制该投影造成的骨架形状畸变,使网络能够保留区分性的几何线索,并实现更高的动作识别率。我们通过消融研究展示了每一层的影响,并通过在三个领域的五个数据集上的广泛实验表明,E2E-GNet在成本更低的情况下优于其他方法。
cs.CV / 22 / 2603.02481
ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop
ModalPatch:一种用于应对模态丢失的鲁棒多模态3D物体检测的即插即用模块
Abstract
Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.
Chinese Translation
多模态3D物体检测对于自动驾驶至关重要,它整合了激光雷达(LiDAR)和摄像头等互补传感器。然而,其在现实世界中的可靠性受到瞬时数据中断和模态缺失的挑战,模态可能因硬件故障、不利天气或遮挡而暂时丢失。这在模态同时丢失的情况下构成了严重风险,车辆在此时会暂时失去视野。为了解决这一问题,我们提出了ModalPatch,这是第一个旨在应对任意模态丢失场景的即插即用模块,能够实现鲁棒的检测。ModalPatch无需架构更改或重新训练,可以无缝集成到各种检测框架中。从技术上讲,ModalPatch利用传感器数据的时间特性来实现感知连续性,采用基于历史的模块来预测和补偿暂时不可用的特征。为了提高预测特征的准确性,我们进一步引入了一种不确定性引导的跨模态融合策略,该策略动态估计补偿特征的可靠性,抑制偏差信号,同时增强信息量大的信号。大量实验表明,ModalPatch在多种模态丢失条件下,始终提高了最先进的3D物体检测器的鲁棒性和准确性。
cs.CV / 23 / 2603.02497
WTHaar-Net: a Hybrid Quantum-Classical Approach
WTHaar-Net:一种混合量子-经典方法
Abstract
Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.
Chinese Translation
卷积神经网络依赖于可以在合适的变换域中高效重构的线性滤波操作。同时,量子计算的进展表明,某些结构化线性变换可以通过浅层量子电路实现,为增强深度学习模型的混合量子-经典方法打开了大门。在本研究中,我们介绍了WTHaar-Net,这是一种卷积神经网络,它用Haar小波变换(Haar Wavelet Transform, HWT)替代了先前混合架构中使用的Hadamard变换。与Hadamard变换不同,Haar变换提供了空间局部化的多分辨率表示,更加符合视觉任务的归纳偏置。我们展示了HWT可以通过结构化的Hadamard门实现量子化,从而使其分解为适合量子电路的单位操作。在CIFAR-10和Tiny-ImageNet上的实验表明,WTHaar-Net在保持竞争性准确度的同时实现了显著的参数减少。在Tiny-ImageNet上,我们的方法优于ResNet和基于Hadamard的基线。我们在IBM Quantum云硬件上验证了量子实现,证明了其与近期量子设备的兼容性。
cs.CV / 24 / 2603.02505
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data
SGMA:基于语义引导的模态感知分割在不完整多模态数据下的遥感应用
Abstract
Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.
Chinese Translation
多模态语义分割整合来自不同传感器的互补信息,以实现遥感地球观测。然而,实际系统常常由于传感器故障或覆盖不全而遇到缺失模态,这被称为不完整多模态语义分割(IMSS)。IMSS面临三个主要挑战:(1)多模态不平衡,主导模态抑制脆弱模态;(2)模态间的类别内尺度、形状和方向的变化;(3)跨模态异质性,冲突线索导致不一致的语义响应。现有方法依赖于对比学习或联合优化,这可能导致过度对齐,丢弃模态特定线索或不平衡训练,偏向于强健模态,同时在很大程度上忽视类别内变化和跨模态异质性。为了解决这些局限性,我们提出了基于语义引导的模态感知(SGMA)框架,该框架确保平衡的多模态学习,同时通过语义引导减少类别内变化并调和跨模态不一致性。SGMA引入了两个互补的即插即用模块:(1)语义引导融合(SGF)模块提取多尺度、类别特定的语义原型,捕捉跨模态的一致类别表示,基于原型特征对齐估计每个模态的鲁棒性,并根据鲁棒性评分进行自适应融合,以减轻类别内变化和跨模态异质性;(2)模态感知采样(MAS)模块利用SGF的鲁棒性估计动态重新加权训练样本,优先考虑来自脆弱模态的挑战性样本,以解决模态不平衡。通过在多个数据集和骨干网络上的广泛实验,证明SGMA始终优于最先进的方法,尤其在脆弱模态上有显著改善。
cs.CV / 25 / 2603.02518
Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks
超越解剖学:基于功能分区和图注意力网络的可解释自闭症谱系障碍分类
Abstract
Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.
Chinese Translation
基于解剖学的脑分区在基于静息态功能磁共振成像(rs-fMRI)的自闭症谱系障碍(ASD)分类中占主导地位,然而其僵化的边界可能无法捕捉到特征化ASD的特有连接模式。我们提出了一种基于图的深度学习框架,比较了在ABIDE I数据集上使用解剖学(AAL,116个感兴趣区域)和功能性(MSDL,39个感兴趣区域)分区策略。我们的FSL预处理管道处理了400名平衡受试者之间的多站点异质性,并采用站点分层的70/15/15划分以防止数据泄漏。在训练折中进行高斯噪声增强,将样本数量从280扩展到1680。该管道分为三个阶段,从使用AAL的基线图卷积网络(GCN,准确率73.3%,AUC=0.74)开始,优化后的GCN使用MSDL达到84.0%(AUC=0.84),最后通过图注意力网络集成实现95.0%的准确率(AUC=0.98),超越了ABIDE I上所有近期基于图神经网络(GNN)的基准。仅通过替换图谱所获得的10.7个百分点的提升表明,功能性分区是最具影响力的建模决策。基于梯度的显著性分析和GNNExplainer分析一致表明,后扣带皮层和前扣带皮层是核心的默认模式网络枢纽,验证了模型决策反映了ASD神经病理而非采集伪影。所有代码和数据集将在接受后公开发布。
cs.CV / 26 / 2603.02522
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
NeighborMAE:在掩码自编码器预训练中利用邻近地球观测图像之间的空间依赖性
Abstract
Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.
Chinese Translation
掩码图像建模已成为从大规模无标签地球观测图像中学习表征的最流行的自监督学习范式之一。尽管将多模态和多时间地球观测数据纳入掩码图像建模的研究已得到广泛探索,但来自邻近区域的图像之间的空间依赖性仍然在很大程度上被忽视。由于地球表面是连续的,邻近图像之间高度相关,并为自监督学习提供了丰富的上下文信息。为了解决这一问题,我们提出了NeighborMAE,通过对邻近地球观测图像的联合重建来学习空间依赖性。为了确保重建任务具有挑战性,我们采用了一种启发式策略,动态调整掩码比例和像素级损失权重。跨越各种预训练数据集和下游任务的实验结果表明,NeighborMAE显著优于现有基线,强调了邻近图像在地球观测掩码图像建模中的价值以及我们设计的有效性。
cs.CV / 27 / 2603.02532
EIMC: Efficient Instance-aware Multi-modal Collaborative Perception
EIMC:高效的实例感知多模态协同感知
Abstract
Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication'' sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01\%
[email protected] while reducing byte bandwidth usage by 87.98\% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.
Chinese Translation
多模态协同感知在提高自动驾驶安全性方面引起了广泛关注。然而,目前的多模态方法仍然处于“局部融合到通信”的序列中,该序列在局部融合多模态数据时需要高带宽来传输个体特征数据,然后进行协同融合。EIMC 创新性地提出了一种早期协同范式。它将由邻近代理传输的轻量级协同体素注入到自我的局部模态融合步骤中,从而产生紧凑而信息丰富的 3D 协同先验,增强跨模态对齐。接下来,基于热图的共识协议通过计算每像素的置信热图精确识别出需要合作的区域。只有位于这些低置信度、高差异区域的 Top-K 实例向同伴查询,然后通过跨注意力进行融合以完成任务。之后,我们应用了一种精细化融合方法,涉及从每个代理收集最具置信度的 Top-K 实例,并利用自注意力增强其特征。上述以实例为中心的信息传递减少了冗余,同时确保关键的遮挡物体得以恢复。在 OPV2V 和 DAIR-V2X 上评估,EIMC 在与最佳已发布的多模态协同检测器相比时,达到了 73.01%
[email protected],同时将字节带宽使用减少了 87.98%。代码已公开发布在 https://github.com/sidiangongyuan/EIMC。
cs.CV / 28 / 2603.02541
ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection
ForestPersons:一个用于树冠下失踪人员检测的大规模数据集
Abstract
Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at https://huggingface.co/datasets/etri/ForestPersons.
Chinese Translation
在森林环境中检测失踪人员仍然是一项挑战,因为密集的树冠覆盖常常使个体在无人机(UAV)通常捕获的自上而下或倾斜的航空影像中难以被发现。尽管无人机在覆盖大面积、难以到达的区域方面效果显著,但它们的航空视角往往错过了树冠下的关键视觉线索。这一局限性凸显了需要更适合于在此类环境中检测失踪人员的树冠下视角。为了解决这一问题,我们推出了ForestPersons,一个专门为树冠下人员检测设计的新型大规模数据集。ForestPersons包含96,482张图像和204,078个注释,这些数据是在多样的环境和时间条件下收集的。每个注释包括一个边界框、姿态和可见性标签,以便进行遮挡感知分析。ForestPersons提供的地面和低空视角能够紧密反映微型无人机(MAVs)在森林搜索与救援(SAR)任务中所遇到的视觉条件。我们的基准评估显示,基于先前的大规模物体检测数据集或面向SAR的数据集训练的标准物体检测模型在ForestPersons上的表现有限。这表明,先前的基准与树冠下失踪人员检测的挑战并不充分对齐。我们提供这一基准,以支持在现实世界SAR场景中的高级人员检测能力。该数据集已在https://huggingface.co/datasets/etri/ForestPersons上公开发布。
cs.CV / 29 / 2603.02546
On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding
判别式与生成式分类器的比较:重新思考多模态大语言模型在动作理解中的应用
Abstract
Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.
Chinese Translation
多模态大语言模型(MLLMs)在开放世界的动作理解方面取得了进展,并可以通过自回归生成动作标签作为文本,适应于封闭集设置的生成式分类器。然而,这种方法效率低下,动作标签之间共享的子词引入了语义重叠,导致生成过程中的歧义。相比之下,判别式分类器学习特定任务的表示,具有明确的决策边界,使得无需自回归解码即可高效进行一步分类。我们首先比较了生成式和判别式分类器在封闭集动作理解中的表现,揭示了后者在准确性和效率上的优势。为了缩小性能差距,我们设计了策略,使生成式分类器的性能接近判别式分类器。此外,我们展示了生成建模可以补充判别式分类器,从而在保持效率的同时提高性能。为此,我们提出了生成辅助判别式(Generation-Assisted Discriminative, GAD)分类器,用于封闭集动作理解。GAD仅在微调期间运行,保持与MLLM预训练的完全兼容。在时间动作理解基准上的大量实验表明,GAD在准确性和效率上均优于生成方法,在我们的最大COIN基准上实现了四个任务、五个数据集的最新成果,包括平均提高2.5%的准确率和3倍的推理速度。
cs.CV / 30 / 2603.02548
SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding
SemGS:基于稀疏视图的前馈语义3D高斯点云重建框架,用于可泛化场景理解
Abstract
Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.
Chinese Translation
对3D场景的语义理解对于机器人在复杂环境中有效且安全地操作至关重要。现有的语义场景重建和语义感知新视图合成方法通常依赖于密集的多视图输入,并需要特定场景的优化,这限制了它们在实际应用中的实用性和可扩展性。为了解决这些挑战,我们提出了SemGS,一种从稀疏图像输入重建可泛化语义场的前馈框架。SemGS采用双分支架构提取颜色和语义特征,其中两个分支共享浅层卷积神经网络(CNN)层,使得语义推理能够利用颜色外观中的纹理和结构线索。我们还在特征提取器中引入了一种相机感知注意机制,以明确建模相机视点之间的几何关系。提取的特征被解码为共享几何一致性的双高斯,同时保留分支特定属性,并进一步光栅化以在新视点下合成语义图。此外,我们引入了一种区域平滑损失以增强语义一致性。实验表明,SemGS在基准数据集上实现了最先进的性能,同时在多样化的合成和真实场景中提供了快速推理和强大的泛化能力。
cs.CV / 31 / 2603.02554
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
来自视觉基础模型的可推广知识蒸馏用于语义分割
Abstract
Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.
Chinese Translation
知识蒸馏(KD)已广泛应用于语义分割,以压缩大型模型,但传统方法主要保留领域内的准确性,而忽视了在分布变化下至关重要的领域外泛化。这一局限性在视觉基础模型(VFM)出现后变得更加严重:尽管VFM在未见数据上表现出强大的鲁棒性,但使用传统KD对其进行蒸馏往往会妥协这种能力。我们提出了可推广知识蒸馏(GKD),这是一个多阶段框架,明确增强泛化能力。GKD将表示学习与任务学习解耦。在第一阶段,学生通过选择性特征蒸馏获得领域无关的表示;在第二阶段,这些表示被冻结以进行任务适应,从而减轻对可见领域的过拟合。为了进一步支持迁移,我们引入了一种基于查询的软蒸馏机制,其中学生特征作为查询,以选择性地从VFM中检索可转移的空间知识。在五个领域泛化基准上的大量实验表明,GKD始终优于现有的KD方法,在基础到基础(F2F)和基础到本地(F2L)蒸馏中分别实现了平均增益+1.9%和+10.6%。代码将发布在 https://github.com/Younger-hua/GKD。
cs.CV / 32 / 2603.02556
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
通过对比的视角:视觉语言模型中的自我提升视觉推理
Abstract
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.
Chinese Translation
推理已成为大型语言模型的一项关键能力。在语言任务中,这种能力可以通过自我提升技术得到增强,从而优化后续微调的推理路径。然而,将这些基于语言的自我提升方法扩展到视觉语言模型(VLMs)面临独特的挑战:推理路径中的视觉幻觉无法有效验证或纠正。我们的解决方案始于对视觉对比的关键观察:当呈现对比的视觉问答(VQA)对时,即两个视觉相似的图像及其同义问题,VLMs能够更精确地识别相关的视觉线索。基于这一观察,我们提出了视觉对比自我学习推理器(Visual Contrastive Self-Taught Reasoner,VC-STaR),这是一种新颖的自我提升框架,利用视觉对比来减轻模型生成的推理中的幻觉。我们收集了一套多样化的VQA数据集,根据多模态相似性策划对比对,并使用VC-STaR生成推理。因此,我们获得了一个新的视觉推理数据集VisCoR-55K,该数据集随后用于通过监督微调提升各种VLM的推理能力。大量实验表明,VC-STaR不仅优于现有的自我提升方法,还超越了在最先进的视觉推理数据集上微调的模型,证明了VLM固有的对比能力可以自我促进其视觉推理能力。项目地址:https://github.com/zhiyupan42/VC-STaR。
cs.CV / 33 / 2603.02557
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
CAPT:基于混淆感知的提示调优以减少视觉-语言不一致
Abstract
Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.
Chinese Translation
视觉-语言模型如CLIP在跨模态表示学习方面取得了显著进展,但在视觉和语义相似类别之间仍然存在系统性的错误分类。我们观察到,这种混淆模式并非随机,而是持续发生在特定类别对之间,揭示了模型的内在偏见和有限的细粒度判别能力。为了解决这个问题,我们提出了CAPT,一种混淆感知的提示调优框架,使模型能够从自身的不一致中学习。具体而言,我们构建了一个混淆库,以明确建模类别之间和错误分类样本之间的稳定混淆关系。在此基础上,我们引入了语义混淆挖掘器(Semantic Confusion Miner, SEM),通过语义差异和共性提示捕捉全局类别间的混淆,以及样本混淆挖掘器(Sample Confusion Miner, SAM),从混淆库中检索代表性的错误分类实例,并通过整合全局和局部上下文的Diff-Manner适配器捕捉样本级线索。为了进一步统一不同粒度的混淆信息,我们设计了一个多粒度差异专家(Multi-Granularity Difference Expert, MGDE)模块,联合利用语义和样本级专家进行更强大的混淆感知推理。在11个基准数据集上的大量实验表明,我们的方法显著减少了由混淆引起的错误,同时增强了基础类和新类的可区分性和泛化能力,成功解决了50.72%的混淆样本对。代码将发布在https://github.com/greatest-gourmet/CAPT。
cs.CV / 34 / 2603.02560
CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration
CAWM-Mamba:一种用于红外-可见图像融合和复合恶劣天气恢复的统一模型
Abstract
Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM-Mamba.
Chinese Translation
多模态图像融合(MMIF)集成来自不同模态的互补信息,以生成更清晰、更具信息量的融合图像。在恶劣天气条件下进行MMIF在自动驾驶和无人机监测应用中尤为重要。然而,现有的恶劣天气融合方法通常仅处理单一类型的退化,如雾、雨或雪,当多种退化共存(例如,雾+雨、雨+雪)时则无法有效应对。为了解决这一挑战,我们提出了复合恶劣天气Mamba(CAWM-Mamba),这是第一个端到端框架,能够通过统一的共享权重同时执行图像融合和复合天气恢复。我们的网络包含三个关键组件:(1)天气感知预处理模块(WAPM),用于增强退化的可见特征并提取全局天气嵌入;(2)跨模态特征交互模块(CFIM),以促进异构模态的对齐和跨模态互补特征的交换;(3)小波空间状态块(WSSB),利用小波域分解来解耦多频率退化。WSSB包括Freq-SSM,一个建模各向异性高频退化而不产生冗余的模块,以及一个统一的退化表示机制,以进一步提高在复杂复合天气条件下的泛化能力。在AWMM-100K基准和三个标准融合数据集上的大量实验表明,CAWM-Mamba在复合和单一天气场景中始终优于最先进的方法。此外,我们的融合结果在涵盖语义分割和目标检测的下游任务中表现出色,确认了其在现实世界恶劣天气感知中的实际价值。源代码将发布在 https://github.com/Feecuin/CAWM-Mamba。
cs.CV / 35 / 2603.02573
Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels
Track4World:面向世界的前馈密集3D跟踪所有像素
Abstract
Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
Chinese Translation
从单目视频中估计每个像素的3D轨迹对于全面理解视频的3D动态至关重要且前景广阔。近期的单目3D跟踪研究展示了令人印象深刻的性能,但局限于在第一帧上跟踪稀疏点或采用基于优化的慢速框架进行密集跟踪。本文提出了一种前馈模型,称为Track4World,能够在以世界为中心的坐标系统中高效地进行每个像素的整体3D跟踪。Track4World基于由VGGT风格的ViT编码的全球3D场景表示,应用了一种新颖的3D相关方案,以同时估计任意帧对之间的像素级2D和3D密集流。估计的场景流以及重建的3D几何体使得后续对该视频每个像素的高效3D跟踪成为可能。在多个基准上的广泛实验表明,我们的方法在2D/3D流估计和3D跟踪方面始终优于现有方法,突显了其在实际4D重建任务中的鲁棒性和可扩展性。
cs.CV / 36 / 2603.02581
ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration
ATD:具有自适应令牌字典的改进型变换器用于图像恢复
Abstract
Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.
Chinese Translation
近年来,由于其卓越的性能,变换器在图像恢复任务(如图像超分辨率和去噪)中获得了显著的关注。然而,平衡性能与计算负担仍然是基于变换器架构的一个长期问题。由于自注意力的二次复杂性,现有方法通常将注意力限制在局部窗口,从而导致感受野有限和性能不佳。为了解决这一问题,我们提出了自适应令牌字典(Adaptive Token Dictionary, ATD),这是一种新颖的基于变换器的图像恢复架构,能够以与图像大小成线性关系的复杂度进行全局依赖建模。ATD模型结合了一个可学习的令牌字典,在训练过程中总结外部图像先验(即典型图像结构)。为了利用这些信息,我们引入了一种令牌字典交叉注意力(Token Dictionary Cross-Attention, TDCA)机制,通过与学习到的字典进行交互来增强输入特征。此外,我们利用嵌入在TDCA注意力图中的类别信息,将输入特征分组为多个类别,每个类别代表图像中相似特征的聚类,并作为一个注意力组。我们还将学习到的类别信息整合到前馈网络中,以进一步改善特征融合。ATD及其轻量版本ATD-light在多个图像超分辨率基准测试中实现了最先进的性能。此外,我们开发了ATD-U,一个ATD的多尺度变体,以应对其他图像恢复任务,包括图像去噪和JPEG压缩伪影去除。大量实验表明,我们提出的模型在定量和定性上均表现出优越性。
cs.CV / 37 / 2603.02582
Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction
用于高分辨率材料参数重建的神经电磁场
Abstract
Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene's underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.
Chinese Translation
创建功能性数字双胞胎,即可模拟的现实世界三维复制品,是计算机视觉中的一个核心挑战。目前的方法如 NeRF 生成了视觉上丰富但功能上不完整的双胞胎。关键障碍在于缺乏基础材料属性(例如,介电常数、导电性)。通过非接触、非侵入式传感获取场景中每个点的这些信息是一个主要目标,但这需要解决一个著名的病态物理反演问题。标准的远程信号,如图像和射频(RF),深度纠缠了未知的几何形状、环境场和目标材料。我们提出了 NEMF,一个用于密集、非侵入式物理反演的新框架,旨在构建功能性数字双胞胎。我们的关键见解是一种系统的解耦策略。NEMF 利用来自图像的高保真几何形状作为强有力的锚点,首先使得环境场的分辨成为可能。通过仅使用非侵入式数据约束几何形状和场,原始的病态问题转变为一个良态的、物理监督的学习任务。这一转变解锁了我们的核心反演模块:解码器。在环境 RF 信号和一个包含物理反射模型的可微层的指导下,它学习明确输出场景底层材料参数的连续、空间变化的场。我们在高保真合成数据集上验证了我们的框架。实验表明,我们的非侵入式反演以高精度重建了这些材料图,并且生成的功能性双胞胎能够实现高保真的物理模拟。这一进展超越了被动的视觉复制品,使得创建真正功能性和可模拟的物理世界模型成为可能。
cs.CV / 38 / 2603.02591
Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification
最大化泛化能力:不同数据增强技术对轻量级视觉变换器在孟加拉字符分类中的影响
Abstract
Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.
Chinese Translation
深度学习模型在计算机视觉领域已被证明具有高度有效性,深度卷积神经网络在各种计算机视觉任务中取得了令人瞩目的成果。然而,这些模型在避免过拟合时严重依赖于大规模数据集。当模型学习到低方差或高方差的特征时,可能导致在训练数据上出现欠拟合或过拟合。不幸的是,在许多领域,尤其是资源有限的语言(如孟加拉语)中,可能无法获得大规模数据集。在本实验中,进行了系列测试,旨在通过图像数据增强的方法来解决孟加拉手写字符的数据不足问题。该研究还对不同增强技术的性能进行了深入分析。数据增强是指一组应用于数据的技术,以增加其规模和多样性,使其更适合训练深度学习模型。本研究评估的图像增强技术包括CLAHE、随机旋转、随机仿射、颜色抖动及其组合。研究进一步探讨了与轻量级模型(如EfficientViT)结合使用增强方法的效果。在不同的增强策略中,随机仿射与颜色抖动的组合在Ekush [1] 和AIBangla [2] 数据集上取得了最佳准确率,分别达到97.48%和97.57%。该组合的表现优于所有其他单独和组合的增强技术。总体而言,本分析对在资源稀缺语言中,特别是在使用轻量级模型进行孟加拉手写字符识别时,图像数据增强的影响进行了全面的考察。
cs.CV / 39 / 2603.02598
Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation
Synthetic-Child:基于AIGC的隐私保护儿童姿态估计合成数据管道
Abstract
Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP -- a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.
Chinese Translation
准确的儿童姿态估计对于基于人工智能的学习伴侣设备至关重要,但由于隐私问题,收集大规模标注的儿童数据集既昂贵又在伦理上不可行。我们提出了Synthetic-Child,这是一种基于AIGC的合成数据管道,能够生成具有真实感的儿童姿态训练图像,并附带真实关键点标注,完全不需要真实儿童照片。该管道包括四个阶段:(1)在Blender中使用可编程的3D儿童身体模型(SMPL-X)生成多样的桌面学习姿势,具备IK约束的解剖学合理性,并自动导出COCO格式的真实标注;(2)自定义的PoseInjectorNode将3D生成的骨架输入到双重ControlNet(姿态 + 深度)中,基于FLUX-1 Dev合成12,000张真实感图像,涵盖10个姿态类别,且标注漂移较低;(3)基于ViTPose的置信度过滤和针对性增强去除生成失败,提高鲁棒性;(4)RTMPose-M(13.6M参数)在合成数据上进行微调,并结合几何特征工程和轻量级多层感知器(MLP)进行姿态分类,最后量化为INT8以实现实时边缘部署。在一个真实儿童测试集上(n~300),FP16模型达到了71.2 AP,相较于相同模型容量的COCO预训练成人数据基线提高了12.5 AP。经过INT8量化后,模型仍保持70.4 AP,并在0.8-TOPS Rockchip RK3568 NPU上以22 FPS运行。在与商业姿态矫正器的单一受试者对照比较中,我们的系统在大多数测试类别中实现了显著更高的识别率,并且平均响应速度快约1.8倍。这些结果表明,精心设计的AIGC管道可以显著减少对真实儿童图像的依赖,同时实现可部署的准确性,并在其他隐私敏感领域具有潜在应用。
cs.CV / 40 / 2603.02609
VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction
VLMFusionOcc3D:基于视觉语言模型的多模态三维语义占用预测
Abstract
This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
Chinese Translation
本文介绍了VLMFusionOcc3D,一个用于自主驾驶中密集三维语义占用预测的强大多模态框架。目前基于体素的占用模型在稀疏几何网格中常常面临语义模糊和在恶劣天气条件下性能下降的问题。为了解决这些挑战,我们利用视觉语言模型(VLM)丰富的语言先验,将模糊的体素特征锚定到稳定的语义概念上。我们的框架以双分支特征提取管道为起点,将多视角图像和激光雷达点云投影到统一的体素空间。我们提出了基于实例的VLM注意力(InstVLM),该方法利用门控交叉注意力和经过LoRA调整的CLIP嵌入,将高层次的语义和地理先验直接注入到三维体素中。此外,我们引入了天气感知自适应融合(WeathFusion),这是一种动态门控机制,利用车辆元数据和天气条件提示,根据实时环境可靠性重新加权传感器贡献。为了确保结构一致性,采用深度感知几何对齐(DAGA)损失,将密集相机生成的几何体与稀疏的空间准确的激光雷达返回结果对齐。在nuScenes和SemanticKITTI数据集上的大量实验表明,我们的即插即用模块始终增强了最先进的基于体素的基线性能。值得注意的是,我们的方法在具有挑战性的天气场景中取得了显著的改进,为复杂城市导航提供了一种可扩展且稳健的解决方案。
cs.CV / 41 / 2603.02618
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
注意选择负文本的方式:在使用视觉语言模型进行OOD检测时追求距离一致性
Abstract
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.
Chinese Translation
分布外(OOD)检测旨在识别来自未知类别的样本,这是在开放世界场景中部署机器学习模型的关键能力。近期研究表明,视觉语言模型(VLMs)能够有效利用其多模态表示进行OOD检测。然而,目前的方法通常在OOD检测中引入了模态内距离,例如将负文本与已知类别标签进行比较,或将测试图像与图像代理进行比较。这种设计范式与CLIP类VLMs所优化的模态间距离存在固有的不一致性,可能导致次优性能。为了解决这一局限性,我们提出了InterNeg,这是一个简单而有效的框架,系统地利用来自文本和视觉角度的一致模态间距离增强。从文本角度来看,我们设计了一种选择负文本的模态间标准。从视觉角度来看,我们动态识别高置信度的OOD图像,并将其反转到文本空间,生成由模态间距离指导的额外负文本嵌入。我们在多个基准测试上的广泛实验表明了我们方法的优越性。值得注意的是,与现有工作相比,我们的InterNeg在大规模ImageNet基准测试中实现了3.47\%的FPR95降低,在具有挑战性的Near-OOD基准测试中实现了5.50\%的AUROC提升,达到了最先进的性能。
cs.CV / 42 / 2603.02619
Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild
针对单幅图像到野外3D人类的姿态直接奖励微调
Abstract
Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: https://seunguk-do.github.io/drpose.
Chinese Translation
单视角3D人类重建通过采用多视角扩散模型取得了显著进展,但恢复的3D人类往往表现出不自然的姿态。当重建具有动态或挑战性姿态的3D人类时,这种现象尤为明显,我们将其归因于可用的多样化姿态的3D人类数据集规模有限。为了解决这一限制,我们提出了DrPose,即针对姿态的直接奖励微调算法,该算法使得在多样化姿态上对多视角扩散模型进行后训练,而无需昂贵的3D人类资产。DrPose仅使用与单视图图像配对的人类姿态来训练模型,采用直接奖励微调以最大化PoseScore,这是我们提出的可微分奖励,用于量化生成的多视角潜在图像与真实人类姿态之间的一致性。该优化在DrPose15K上进行,这是一个新构建的数据集,来源于现有的人类运动数据集和姿态条件的视频生成模型。DrPose15K由丰富的人类姿态序列数据构成,与现有的3D人类数据集相比,展现了更广泛的姿态分布。我们通过在传统基准数据集、野外图像和新构建的基准上进行评估来验证我们的方法,特别关注在挑战性人类姿态上的性能评估。我们的结果在所有基准上都显示出一致的定性和定量改进。项目页面:https://seunguk-do.github.io/drpose.
cs.CV / 43 / 2603.02629
Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
朝向增量统一的多模态异常检测:从信息瓶颈视角增强多模态去噪
Abstract
The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.
Chinese Translation
增量统一多模态异常检测的探索旨在赋予单一模型系统性检测所有类别异常的能力,并支持增量学习以适应新出现的对象/类别。实现这一目标的核心在于解决灾难性遗忘难题,即在获取新知识的同时保留先前学习的知识。尽管已有一些努力试图解决这一难题,但一个关键的疏忽仍然存在:忽视了虚假和冗余特征对灾难性遗忘的潜在影响。本文深入探讨了虚假和冗余特征在增量统一框架中对这一难题的负面影响,并揭示在类似条件下,通过简单聚合单模态架构开发的多模态框架更容易遗忘。为了解决这一问题,我们引入了一种新颖的去噪框架,称为 IB-IUMAD,该框架利用 Mamba 解码器和信息瓶颈融合模块的互补优势:前者专注于解耦对象间特征耦合,防止对象间的虚假特征干扰;后者则用于从融合特征中过滤冗余特征,从而明确保留判别信息。在 MVTec 3D-AD 和 Eyecandies 数据集上的一系列理论分析和实验表明,IB-IUMAD 的有效性和竞争性能。
cs.CV / 44 / 2603.02648
SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation
SEP-YOLO:用于透明物体实例分割的傅里叶域特征表示
Abstract
Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.
Chinese Translation
透明物体实例分割在计算机视觉中面临重大挑战,这主要源于透明物体的固有特性,包括边界模糊、低对比度以及对背景上下文的高度依赖。现有方法往往依赖于强烈的外观线索和清晰的边界,因此效果不佳。为了解决这些局限性,我们提出了SEP-YOLO,一个集成了双域协作机制的新框架,用于透明物体实例分割。我们的方法结合了频域细节增强模块,该模块通过可学习的复数权重分离和增强弱的高频边界成分。此外,我们还设计了一个多尺度空间精炼流,由内容感知对齐颈部和多尺度门控精炼块组成,以确保深层语义特征中的精确特征对齐和边界定位。我们还为Trans10K数据集提供了高质量的实例级注释,填补了透明物体实例分割中的关键数据缺口。在Trans10K和GVD数据集上的大量实验表明,SEP-YOLO达到了最先进的(SOTA)性能。
cs.CV / 45 / 2603.02658
OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning
OmniFashion:通过多任务视觉-语言学习实现通用时尚智能
Abstract
Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.
Chinese Translation
时尚智能涵盖多个任务,即检索、推荐、识别和对话,但仍受到碎片化监督和不完整时尚注释的限制。这些局限性共同阻碍了一致的视觉-语义结构的形成,阻止了近期的视觉-语言模型(VLMs)作为通用时尚大脑的功能,无法统一跨任务的理解和推理。因此,我们构建了FashionX,这是一个百万规模的数据集,全面注释了服装中的可见时尚物品,并从全局到部分级别组织属性。在此基础上,我们提出了OmniFashion,一个统一的视觉-语言框架,旨在通过统一的时尚对话范式连接多样的时尚任务,实现多任务推理和互动对话。在多子任务和检索基准上的实验表明,OmniFashion在任务级别准确性和跨任务泛化方面表现出色,突显了其为通用、面向对话的时尚智能提供可扩展路径的潜力。
cs.CV / 46 / 2603.02667
DREAM: Where Visual Understanding Meets Text-to-Image Generation
DREAM:视觉理解与文本到图像生成的交汇
Abstract
Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.
Chinese Translation
在一个单一模型中统一视觉表征学习和文本到图像(T2I)生成仍然是多模态学习中的一个核心挑战。我们提出了DREAM,一个统一框架,能够同时优化判别和生成目标,同时学习强大的视觉表征。DREAM基于两项关键技术构建:在训练过程中,Masking Warmup(逐步掩蔽)从最小掩蔽开始,以建立进行表征学习所需的对比对齐,然后逐渐过渡到完全掩蔽,以实现稳定的生成训练。在推理阶段,DREAM采用Semantically Aligned Decoding(语义对齐解码)将部分掩蔽的图像候选与目标文本对齐,并选择最佳候选进行进一步解码,从而提高文本-图像的保真度(+6.3%),而无需外部重排序器。仅在CC12M上训练,DREAM在ImageNet线性探测准确率上达到了72.7%(比CLIP提高1.1%),FID为4.25(比FLUID提高6.2%),在少量样本分类、语义分割和深度估计中也取得了一致的提升。这些结果表明,判别和生成目标可以是协同的,使得统一的多模态模型在视觉理解和生成方面都表现出色。
cs.CV / 47 / 2603.02681
VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation
VisionCreator:一个具备理解、思考、规划和创造能力的原生视觉生成代理模型
Abstract
Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.
Chinese Translation
视觉内容创作任务需要对设计规范和创意工作流程有细致的理解,这对通用模型来说是一个挑战,而基于工作流程的代理缺乏自主创意规划所需的专业知识。为了解决这些挑战,我们提出了VisionCreator,一个将理解、思考、规划和创造(UTPC)能力统一于一个端到端可学习框架中的原生视觉生成代理模型。我们的工作提出了四个关键贡献:(i)VisGenData-4k及其构建方法,利用基于元认知的VisionAgent生成具有明确UTPC结构的高质量创作轨迹;(ii)VisionCreator代理模型,通过渐进式专业化训练(Progressive Specialization Training, PST)和虚拟强化学习(Virtual Reinforcement Learning, VRL)在高保真模拟环境中进行优化,使得在复杂创作任务中稳定高效地获取UTPC能力;(iii)VisGenBench,一个全面的基准,包含1.2k个测试样本,涵盖多种场景,用于多步骤视觉创作能力的标准化评估;(iv)值得注意的是,我们的VisionCreator-8B/32B模型在多个评估维度上表现优于更大规模的闭源模型。总体而言,这项工作为未来视觉生成代理系统的研究奠定了基础。
cs.CV / 48 / 2603.02691
ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT
ReCo-Diff:用于稀视CT的残差条件确定性采样冷扩散
Abstract
Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.
Chinese Translation
冷扩散和广义扩散模型最近在稀视CT重建中展现出强大的潜力,通过明确建模确定性降解过程。然而,现有的采样策略往往依赖于临时的采样控制或固定的时间表,这些方法对误差累积和采样不稳定性仍然敏感。我们提出了ReCo-Diff,一种残差条件扩散框架,通过残差条件自引导采样利用观察残差。在每个采样步骤中,ReCo-Diff首先生成一个空(无条件)基线重建,然后根据预测图像与测量的稀视输入之间的观察残差对后续预测进行条件化。这种基于残差的引导提供了持续的、关注测量的校正,同时保持了确定性的采样时间表,而无需启发式干预。实验结果表明,ReCo-Diff在重建精度、稳定性和在严重稀疏情况下的鲁棒性方面,始终优于现有的冷扩散采样基线。
cs.CV / 49 / 2603.02692
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
FiDeSR:高保真与细节保持的一步扩散超分辨率
Abstract
Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: https://github.com/Ar0Kim/FiDeSR.
Chinese Translation
基于扩散的方法最近在真实世界图像超分辨率(SR)方面取得了显著进展。然而,现有方法仍然难以同时保持细节和确保高保真重建,常常导致视觉质量不佳。本文提出了FiDeSR,一个高保真与细节保持的一步扩散超分辨率框架。在训练过程中,我们引入了一种细节感知加权策略,能够自适应地强调模型预测误差较高的区域。在推理阶段,低频和高频自适应增强器进一步细化重建,无需重新训练模型,从而实现灵活的增强控制。为了进一步提高重建精度,FiDeSR结合了残差中的残差噪声精炼,纠正扩散噪声中的预测误差并增强细节恢复。与现有基于扩散的方法相比,FiDeSR在真实世界的超分辨率性能上表现优越,生成的输出具有高感知质量和忠实的内容恢复。源代码将发布在:https://github.com/Ar0Kim/FiDeSR。
cs.CV / 50 / 2603.02697
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling
ShareVerse:用于共享世界建模的多智能体一致性视频生成
Abstract
This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.
Chinese Translation
本文提出了ShareVerse,一个视频生成框架,能够实现多智能体共享世界建模,解决了现有研究中缺乏支持多智能体交互的统一共享世界构建的问题。ShareVerse利用大型视频模型的生成能力,并整合了三项关键创新:1)在CARLA仿真平台上构建了一个用于大规模多智能体交互世界建模的数据集,包含多样的场景、天气条件和交互轨迹,以及配对的多视角视频(每个智能体的前/后/左/右视角)和摄像头数据。2)我们提出了一种空间拼接策略,用于独立智能体的四视角视频,以建模更广泛的环境,并确保内部多视角几何一致性。3)我们将跨智能体注意力模块集成到预训练的视频模型中,使得智能体之间能够交互传递时空信息,确保重叠区域的共享世界一致性以及非重叠区域的合理生成。ShareVerse支持49帧的大规模视频生成,能够准确感知动态智能体的位置,实现一致的共享世界建模。
cs.CV / 51 / 2603.02704
Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model
基于视觉-语言深度学习模型的妊娠滋养层疾病智能病理诊断
Abstract
The pathological diagnosis of gestational trophoblastic disease(GTD) takes a long time, relies heavily on the experience of pathologists, and the consistency of initial diagnosis is low, which seriously threatens maternal health and reproductive outcomes. We developed an expert model for GTD pathological diagnosis, named GTDoctor. GTDoctor can perform pixel-based lesion segmentation on pathological slides, and output diagnostic conclusions and personalized pathological analysis results. We developed a software system, GTDiagnosis, based on this technology and conducted clinical trials. The retrospective results demonstrated that GTDiagnosis achieved a mean precision of over 0.91 for lesion detection in pathological slides (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained a Positive Predictive Value of 95.59% (n=68 patients). The tool reduced average diagnostic time from 56 to 16 seconds per case (n=285 patients). GTDoctor and GTDiagnosis offer a novel solution for GTD pathological diagnosis, enhancing diagnostic performance and efficiency while maintaining clinical interpretability.
Chinese Translation
妊娠滋养层疾病(GTD)的病理诊断耗时较长,严重依赖病理学家的经验,且初步诊断的一致性较低,这严重威胁到母体健康和生育结果。我们开发了一种用于GTD病理诊断的专家模型,命名为GTDoctor。GTDoctor能够对病理切片进行基于像素的病变分割,并输出诊断结论和个性化的病理分析结果。基于这一技术,我们开发了软件系统GTDiagnosis,并进行了临床试验。回顾性结果表明,GTDiagnosis在病理切片的病变检测中实现了超过0.91的平均精准度(n=679切片)。在前瞻性研究中,使用GTDiagnosis的病理学家获得了95.59%的阳性预测值(n=68患者)。该工具将每例的平均诊断时间从56秒减少至16秒(n=285患者)。GTDoctor和GTDiagnosis为GTD病理诊断提供了一种新颖的解决方案,提升了诊断性能和效率,同时保持了临床可解释性。
cs.CV / 52 / 2603.02710
MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration
MiM-DiT:基于扩散变换器的混合专家模型中的混合专家用于一体化图像恢复
Abstract
All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.
Chinese Translation
一体化图像恢复面临挑战,因为不同的退化类型(如雾霾、模糊、噪声和低光)对恢复策略提出了多样化的要求,使得单一模型难以有效处理。本文提出了一种统一的图像恢复框架,该框架将双层混合专家(Mixture-of-Experts, MoE)架构与预训练的扩散模型相结合。该框架在两个层面上运行:Inter-MoE 层自适应地组合专家组以处理主要退化类型,而 Intra-MoE 层进一步选择专业的子专家以应对每种类型内的细粒度变化。这一设计使得模型能够在不同的退化类别之间实现粗粒度适应,同时对特定的类内变化进行细粒度调节,确保在处理复杂的现实世界损坏时具有高度专业化。大量实验表明,所提出的方法在多个图像恢复任务中表现优于最先进的方法。
cs.CV / 53 / 2603.02712
From "What" to "How": Constrained Reasoning for Autoregressive Image Generation
从“什么”到“如何”:自回归图像生成的约束推理
Abstract
Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify "What" details to depict by rewriting the input prompt, yet fundamentally fail to reason about "How" to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a "How-to-What" paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces "How to draw" by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description "What to draw", providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).
Chinese Translation
自回归图像生成在链式思维和强化学习的引入下取得了近期的进展。然而,当前的方法仅通过重写输入提示来指定“什么”细节进行描绘,但在根本上未能推理“如何”构建整体图像。这一固有的限制导致了持续存在的问题,例如空间模糊直接造成不现实的物体重叠。为了解决这一问题,我们提出了CoR-Painter,一个开创性的框架,通过引入约束推理来指导自回归生成,开创了“如何到什么”的范式。具体而言,它首先通过从输入提示中推导出一组视觉约束,明确规定空间关系、关键属性和构图规则,从而推导出“如何绘制”。这些约束引导后续生成详细描述“绘制什么”,为准确的视觉合成提供了结构合理且连贯的基础。此外,我们引入了一种双目标GRPO策略,专门优化文本约束推理和视觉投影过程,以确保整个生成流程的连贯性和质量。在T2I-CompBench、GenEval和WISE上的大量实验表明,我们的方法达到了最先进的性能,在空间指标上显著提高(例如,在T2I-CompBench上提高了5.41%)。
cs.CV / 54 / 2603.02720
TenExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework
TenExp:基于专家混合的张量分解结构搜索框架
Abstract
Recently, tensor decompositions continue to emerge and receive increasing attention. Selecting a suitable tensor decomposition to exactly capture the low-rank structures behind the data is at the heart of the tensor decomposition field, which remains a challenging and relatively under-explored problem. Current tensor decomposition structure search methods are still confined by a fixed factor-interaction family (e.g., tensor contraction) and cannot deliver the mixture of decompositions. To address this problem, we elaborately design a mixture-of-experts-based tensor decomposition structure search framework (termed as TenExp), which allows us to dynamically select and activate suitable tensor decompositions in an unsupervised fashion. This framework enjoys two unique advantages over the state-of-the-art tensor decomposition structure search methods. Firstly, TenExp can provide a suitable single decomposition beyond a fixed factor-interaction family. Secondly, TenExp can deliver a suitable mixture of decompositions beyond a single decomposition. Theoretically, we also provide the approximation error bound of TenExp, which reveals the approximation capability of TenExp. Extensive experiments on both synthetic and realistic datasets demonstrate the superiority of the proposed TenExp compared to the state-of-the-art tensor decomposition-based methods.
Chinese Translation
近年来,张量分解不断涌现并受到越来越多的关注。选择合适的张量分解以准确捕捉数据背后的低秩结构是张量分解领域的核心问题,这仍然是一个具有挑战性且相对未被充分探索的问题。目前的张量分解结构搜索方法仍然受限于固定的因子交互家族(例如,张量收缩),无法提供分解的混合。为了解决这个问题,我们精心设计了一个基于专家混合的张量分解结构搜索框架(称为 TenExp),该框架允许我们以无监督的方式动态选择和激活合适的张量分解。与最先进的张量分解结构搜索方法相比,该框架具有两个独特的优势。首先,TenExp 可以提供超越固定因子交互家族的合适单一分解。其次,TenExp 可以提供超越单一分解的合适分解混合。从理论上讲,我们还提供了 TenExp 的近似误差界,揭示了 TenExp 的近似能力。在合成和真实数据集上的大量实验表明,所提出的 TenExp 相较于最先进的基于张量分解的方法具有优越性。
cs.CV / 55 / 2603.02726
Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement
跨视角地理定位、图像检索、多尺度几何建模、频域增强
Abstract
Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. {Our code is available at https://github.com/Mashuaishuai669/SFDE
Chinese Translation
跨视角地理定位(CVGL)旨在建立从显著不同视点拍摄的图像之间的空间对应关系,是在无GNSS环境中进行视觉定位的基本技术。然而,由于严重的几何不对称性、成像领域间的纹理不一致性以及判别性局部信息的逐步退化,CVGL仍然面临挑战。现有方法主要依赖于空间域特征对齐,这本质上对大规模视点变化和局部干扰非常敏感。为了解决这些限制,本文提出了空间与频域增强网络(SFDE),该网络利用来自空间域和频域的互补表示。SFDE采用三分支并行架构,分别建模全局语义上下文、局部几何结构和频域中的统计稳定性,从而从场景拓扑、多尺度结构模式和频率不变性的角度表征跨域一致性。通过渐进增强和耦合约束,所得到的互补特征在统一的嵌入空间中共同优化,使得跨视角表示的学习在多个粒度上保持一致。综合实验表明,SFDE在性能上具有竞争力,并且在许多情况下甚至超越了最先进的方法,同时保持轻量和计算效率高的设计。{我们的代码可在 https://github.com/Mashuaishuai669/SFDE 获取}
cs.CV / 56 / 2603.02727
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
门控差分线性注意力:高保真医学分割的线性时间解码器
Abstract
Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.
Chinese Translation
医学图像分割需要在保持精细解剖边界的同时,具备临床部署的高效性。尽管变换器(transformers)能够捕捉长距离依赖关系,但它们面临二次注意力成本和大量数据需求的问题,而卷积神经网络(CNNs)则计算友好,但在全局推理方面表现不佳。线性注意力提供了 $ ext{O}(N)$ 的扩展性,但通常表现出训练不稳定和注意力稀释,导致模糊的映射。我们提出了 PVT-GDLA,这是一种以解码器为中心的变换器,能够以线性时间恢复清晰的长距离依赖关系。其核心,门控差分线性注意力(Gated Differential Linear Attention, GDLA),在互补的查询/键子空间上计算两个核化的注意力路径,并通过可学习的通道级缩放将它们相减,以消除共模噪声并增强相关上下文。一个轻量级的、特定于头的门控机制引入非线性和输入自适应稀疏性,减轻了注意力消耗,同时一个并行的局部令牌混合分支结合深度卷积增强了邻近令牌之间的交互,提高了边界保真度,同时保持 $ ext{O}(N)$ 的复杂性和低参数开销。结合预训练的金字塔视觉变换器(Pyramid Vision Transformer, PVT)编码器,PVT-GDLA 在 CT、MRI、超声和皮肤镜基准测试中,在相同的训练预算下实现了最先进的准确性,且与 CNN、变换器、混合和线性注意力基线相比,参数相当但 FLOPs 更低。PVT-GDLA 为在临床环境和其他资源受限的设置中实现快速、可扩展的高保真医学分割提供了一条实用路径。
cs.CV / 57 / 2603.02743
CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model
CoShadow:基于扩散模型的多对象阴影生成用于图像合成
Abstract
Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.
Chinese Translation
真实的阴影生成对实现无缝图像合成至关重要,但现有方法主要集中于单对象插入,且在多个前景对象合成到背景场景时往往无法很好地泛化。然而,在实际应用中,现代合成流程和现实世界的应用通常会同时插入多个对象,这就需要在几何形状、附着和位置上共同一致的阴影。本文针对多对象阴影生成这一尚未充分探索的问题,旨在为多个插入对象合成物理上合理的阴影。我们的方法利用了预训练的文本到图像扩散模型的多模态能力。图像通道注入密集的多尺度特征,以提供细粒度的空间指导,而基于文本的通道则将每个对象的阴影边界框编码为学习到的位置标记,并通过交叉注意力进行融合。注意力对齐损失进一步将这些标记与其对应的阴影区域进行关联。为了支持这一任务,我们通过构建包含多个插入对象的复合场景来增强DESOBAv2数据集,并自动生成结合对象类别和阴影定位信息的提示。实验结果表明,我们的方法在单对象和多对象阴影生成设置中均实现了最先进的性能。
cs.CV / 58 / 2603.02748
iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
iGVLM:用于问题感知多模态理解的动态指令引导视觉编码
Abstract
Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.
Chinese Translation
尽管大型视觉-语言模型(LVLMs)取得了成功,但大多数现有架构仍面临表示瓶颈:它们依赖于静态的、与指令无关的视觉编码器,其视觉表示在不同文本任务中以不变的方式被利用。这种刚性阻碍了细粒度推理,因为任务特定的视觉线索至关重要。为了解决这个问题,我们提出了iGVLM,一个用于指令引导视觉调制的通用框架。iGVLM引入了一种解耦的双分支架构:一个冻结的表示分支,保留在预训练过程中学习的与任务无关的视觉表示,以及一个动态条件分支,通过自适应层归一化(AdaLN)执行仿射特征调制。这种设计使得从通用感知到指令感知推理的平滑过渡成为可能,同时保持预训练视觉先验的结构完整性和稳定性。除了标准基准外,我们还引入了MM4,一个用于在多查询、多指令设置下量化逻辑一致性的受控诊断探针。大量结果表明,iGVLM在不同语言基础上始终增强了对指令的敏感性,为连接被动感知和主动推理提供了一种即插即用的范式。
cs.CV / 59 / 2603.02754
Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing
无需训练的清晰视觉:减轻多模态大语言模型在遥感中的幻觉
Abstract
Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR
Chinese Translation
多模态大语言模型(MLLMs)在遥感视觉问答(RS-VQA)中存在明显的幻觉问题,这主要是由于在大规模场景中的视觉定位失败或对细粒度小目标的误解。为了系统性地分析这些问题,我们引入了RSHBench,一个基于协议的基准,用于对事实和逻辑幻觉进行细粒度诊断。为了减轻由定位引起的事实幻觉,我们进一步提出了相对注意力驱动的主动推理(Relative Attention-Driven Actively Reasoning,RADAR),这是一种无训练的推理方法,利用MLLMs中的内在注意力在测试时引导渐进式定位和细粒度局部推理。广泛的实验结果表明,RADAR在不同的MLLMs中持续提高了RS-VQA的性能,并减少了事实和逻辑幻觉。代码和数据将公开发布于:https://github.com/MiliLab/RADAR
cs.CV / 60 / 2603.02767
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
ITO:通过协同多重对齐和训练时融合将图像与文本视为统一体
Abstract
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
Chinese Translation
图像-文本对比预训练已成为视觉表征学习的主导范式,但现有方法通常导致表征在一定程度上仍按模态组织。我们提出了ITO,一个通过两种协同机制解决这一局限性的框架。多模态多重对齐通过挖掘多样的图像-文本对应关系来丰富监督,而一个轻量级的训练时多模态融合模块则强制执行结构化的跨模态交互。关键是,在推理阶段融合模块被丢弃,从而保持标准双编码器架构的效率。大量实验表明,ITO在分类、检索和多模态基准测试中始终优于强基线。我们的分析揭示,尽管多重对齐驱动了判别能力,但训练时融合作为关键的结构正则化器,消除了模态间的差距,并稳定了训练动态,以防止在激进的对比学习中常见的早期饱和现象。
cs.CV / 61 / 2603.02785
HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning
HiLoRA:用于个性化联邦学习的分层低秩适应
Abstract
Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA's design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.
Chinese Translation
视觉变换器(ViTs)因其强大的迁移能力而在视觉任务中被广泛采用。在联邦学习(FL)中,由于完全微调的通信负担较重,低秩适应(LoRA)提供了一种高效且友好的通信方式来适应ViTs。然而,现有基于LoRA的联邦调优方法忽视了现实环境中潜在的客户端结构,限制了共享表示学习,并阻碍了对未见客户端的有效适应。为了解决这个问题,我们提出了HiLoRA,一个分层LoRA框架,在根、集群和叶子三个层级上放置适配器,分别旨在捕捉全局、子群体和客户端特定的知识。通过跨层正交性和级联优化,HiLoRA分离更新子空间,并将每个层级与其残差个性化目标对齐。特别地,我们开发了一种LoRA子空间自适应聚类机制,通过子空间相似性分析推断潜在的客户端组,从而促进结构上对齐的客户端之间的知识共享。从理论上讲,我们建立了逐层的泛化分析,以支持HiLoRA的设计。在CIFAR-100和DomainNet上的ViT骨干实验表明,在个性化和泛化方面均有持续的改善。
cs.CV / 62 / 2603.02790
Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language
设计 UNICORN:计算病理学、放射学和自然语言处理的统一基准
Stegeman, Michelle, Philipp, Lena, van der Graaf, Fennie, D'Amato, Marina, Grisi, Clément, Builtjes, Luc, Bosma, Joeran S., Lefkes, Judith, Weber, Rianne A., Meakin, James A., Koopman, Thomas, Mickan, Anne, Prokop, Mathias, Smit, Ewoud J., Litjens, Geert, van der Laak, Jeroen, van Ginneken, Bram, de Rooij, Maarten, Huisman, Henkjan, Jacobs, Colin, Ciompi, Francesco, Hering, Alessa
Abstract
Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
Chinese Translation
医学基础模型显示出从大型、多样化数据集中学习广泛可推广特征的潜力。这可能成为可靠的跨模态泛化和快速适应新任务特定目标的基础,仅需少量任务特定示例。然而,现有证据受到缺乏公共、标准化和可重复评估框架的限制,因为现有的公共基准通常在任务、器官或模态特定设置中碎片化,限制了跨任务泛化的评估。我们介绍了 UNICORN,这是一个旨在系统评估医学基础模型的公共基准,采用统一协议。为了隔离表示质量,我们基于标准化的少量样本适应构建了一个新颖的两步框架,将模型推理与任务特定评估解耦。作为核心设计选择,我们构建了间接可访问的隔离测试集,这些测试集来源于临床相关的队列,并在开放基准平台上提供标准化的评估代码和提交接口。性能被聚合为一个单一的 UNICORN 分数,这是我们引入的一种新指标,用于支持在不同医学领域、模态和任务类型之间直接比较基础模型。UNICORN 测试数据集包括来自2400多名患者的数据,其中包括3700多个视觉案例和2400多份临床报告,这些数据来自八个国家的17个机构。该基准涵盖八个解剖区域和四种成像模态。任务特定和聚合的排行榜使评估变得可访问、标准化和可重复。通过标准化多任务、多模态评估,UNICORN 为医学基础模型的可重复基准测试奠定了基础。数据、基线方法和评估平台可通过 unicorn.grand-challenge.org 公开获取。
cs.CV / 63 / 2603.02795
VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
VSearcher:通过强化学习实现的长时间跨度多模态搜索代理
Abstract
Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.
Chinese Translation
大型模型正日益成为与现实环境互动的自主代理,并利用外部工具增强其静态能力。然而,最近的进展大多集中在仅限文本的大型语言模型上,这些模型受限于单一模态,因此应用场景较窄。另一方面,多模态大型模型虽然提供了更强的感知能力,但仍然局限于静态知识,缺乏访问和利用最新网络信息的能力。本文提出了VSearcher,将静态多模态模型转变为能够在现实网络环境中进行长时间跨度、多轮工具使用的多模态搜索代理,包括文本搜索、图像搜索和网页浏览,采用强化学习的方法。具体而言,我们引入了迭代注入数据合成管道,以生成大规模、复杂的多模态问答问题,这些问题经过全面的指标过滤,以确保高质量和足够的难度。然后,我们采用了先进行监督微调(SFT)再进行强化学习(RL)的训练管道,将基础多模态模型转变为能够在现实网络环境中进行多轮工具调用的代理。此外,我们提出了一个多模态搜索基准MM-SearchExam,专门用于评估多模态搜索代理的搜索能力,这对最近的专有模型构成了极大的挑战。在多个多模态搜索基准上的广泛评估揭示了我们方法的有效性。VSearcher在多模态搜索任务中表现优于最近的多模态搜索代理,甚至超过了若干专有模型。
cs.CV / 64 / 2603.02801
R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild
R3GW:适用于野外户外场景的可重光照3D高斯模型
Abstract
3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary
Chinese Translation
3D高斯点云(3DGS)已成为静态场景的3D重建和新视图合成的领先技术,取得了卓越的渲染质量和快速的训练速度。然而,该方法并未明确建模场景的光照,因此不适合重光照任务。此外,3DGS在重建由不受限制的照片集合捕捉的野外场景时,面临着光照条件变化的挑战。在本文中,我们提出了R3GW,一种新颖的方法,用于学习在野外捕捉的户外场景的可重光照3DGS表示。我们的方法将场景分为可重光照的前景和非反射的背景(天空),使用两组不同的高斯模型。R3GW通过将物理基础渲染(Physically Based Rendering)与3DGS场景表示结合,在变化的光照条件下建模前景反射中的视依赖光照效应。我们在NeRF-OSR数据集上对我们的方法进行了定量和定性评估,提供了最先进的性能,并增强了对不受限制场景的物理基础重光照的支持。我们的方法能够在任意光照条件下合成照片级真实感的新视图。此外,我们对天空的表示减轻了深度重建伪影,改善了天空与前景边界处的渲染质量。
cs.CV / 65 / 2603.02802
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
NOVA:稀疏控制与密集合成的无配对视频编辑
Abstract
Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
Chinese Translation
近年来的视频编辑模型取得了令人瞩目的成果,但大多数仍然需要大规模的配对数据集。在规模上收集自然对齐的配对数据仍然极具挑战性,并构成了一个关键瓶颈,特别是在本地视频编辑数据方面。现有的解决方案通过全局运动控制将图像编辑转移到视频上,以实现无配对视频编辑,但这种设计在背景和时间一致性方面存在困难。在本文中,我们提出了NOVA:稀疏控制与密集合成,这是一个新的无配对视频编辑框架。具体而言,稀疏分支通过用户编辑的关键帧提供语义指导,这些关键帧分布在整个视频中,而密集分支则持续从原始视频中整合运动和纹理信息,以保持高保真度和一致性。此外,我们引入了一种降解模拟训练策略,使模型能够通过在人工降解视频上训练来学习运动重建和时间一致性,从而消除对配对数据的需求。我们的广泛实验表明,NOVA在编辑保真度、运动保留和时间一致性方面优于现有方法。
cs.CV / 66 / 2603.02803
Structure-Aware Text Recognition for Ancient Greek Critical Editions
面向结构的古希腊批判版文本识别
Abstract
Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.
Chinese Translation
近年来,视觉语言模型(VLMs)的进展已改变了端到端文档理解的方式。然而,它们在解释历史学术文本复杂布局语义方面的能力仍然有限。本文研究了面向结构的古希腊批判版文本识别,这些文本具有密集的参考层次和广泛的边注。我们引入了两个新资源:(i)一个由TEI/XML源生成的大规模合成语料库,包含185,000个页面图像,具有可控的排版和布局变化,以及(ii)一个经过整理的真实扫描版基准,涵盖了一个多世纪的编辑和排版实践。利用这些数据集,我们在零样本和微调两种模式下评估了三种最先进的VLM。我们的实验揭示了当前VLM架构在面对高度结构化的历史文档时的重大局限性。在零样本设置中,大多数模型的表现显著低于现有的现成软件。然而,Qwen3VL-8B模型实现了最先进的性能,在真实扫描中达到了1.0\%的中位字符错误率。这些结果突显了VLM在复杂学术文档的结构感知识别方面的当前不足和未来潜力。
cs.CV / 67 / 2603.02805
ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink
ScribeTokens:数字墨水的固定词汇表标记化
Abstract
Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).
Chinese Translation
数字墨水——从手写笔或触摸输入捕获的坐标流——缺乏统一的表示。连续向量表示产生长序列并且训练不稳定,而现有的标记表示需要大型词汇表,面临词汇外问题,并且在识别上表现不佳。我们提出了ScribeTokens,这是一种将笔运动分解为单位像素步长的标记化方法。结合两个笔状态标记,这个固定的10个标记基础词汇表足以表示任何数字墨水,并且能够实现激进的BPE压缩。在手写文本生成任务中,ScribeTokens的表现显著优于向量(17.33% vs. 70.29% CER),显示出标记在生成任务中更为有效。在识别任务中,ScribeTokens是唯一一个在没有预训练的情况下超越向量的标记表示。我们进一步引入下一个墨水标记预测作为自监督预训练策略,这一策略在所有基于标记的模型中一致性地提高了识别效果,并将收敛速度加快了多达83倍。通过预训练,ScribeTokens在两个数据集上实现了所有表示中最佳的识别结果(IAM上8.27% CER,DeepWriting上9.83%)。
cs.CV / 68 / 2603.02816
BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
品牌融合:一种无缝品牌集成的多智能体框架用于文本到视频生成
Abstract
The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.
Chinese Translation
文本到视频(T2V)模型的快速发展彻底改变了内容创作,但其商业潜力仍未得到充分挖掘。我们首次引入了T2V中无缝品牌集成的任务:在生成的提示视频中自动嵌入广告品牌,同时保持对用户意图的语义忠实性。该任务面临三个核心挑战:保持提示的忠实性、确保品牌的可识别性以及实现上下文自然的集成。为了解决这些问题,我们提出了BrandFusion,一个新颖的多智能体框架,包含两个协同的阶段。在离线阶段(面向广告商),我们通过探测模型先验并通过轻量级微调适应新品牌,构建品牌知识库。在在线阶段(面向用户),五个智能体通过迭代优化共同完善用户提示,利用共享知识库和实时上下文跟踪,确保品牌的可见性和语义的一致性。在多个最先进的T2V模型上对18个已建立品牌和2个定制品牌的实验表明,BrandFusion在语义保留、品牌可识别性和集成自然性方面显著优于基线。人类评估进一步确认了更高的用户满意度,为可持续的T2V货币化建立了实用路径。
cs.CV / 69 / 2603.02829
Toward Early Quality Assessment of Text-to-Image Diffusion Models
朝向文本到图像扩散模型的早期质量评估
Abstract
Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.
Chinese Translation
最近的文本到图像(T2I)扩散和流匹配模型能够从自然语言提示中生成高度逼真的图像。在实际场景中,T2I 系统通常以“生成-然后-选择”的模式运行:许多种子被采样,只有少数图像被保留用于使用。然而,这种流程资源消耗极大,因为每个候选图像需要数十到数百个去噪步骤,而评估指标如 CLIPScore 和 ImageReward 是事后评估。在本研究中,我们通过引入 Probe-Select,一个插件模块,解决了这一低效问题,使得在生成过程中能够高效评估图像质量。我们观察到某些中间去噪器激活,即使在早期时间步,也编码了稳定的粗略结构、物体布局和空间排列,这与最终图像的保真度高度相关。Probe-Select 利用这一特性,通过直接从早期激活中预测最终质量评分,允许对不太有希望的种子进行早期终止。在扩散和流匹配的基础模型中,我们的实验表明,仅在轨迹的 20% 处进行早期评估即可准确排名候选种子,并实现选择性继续。这一策略将采样成本降低了超过 60%,同时提高了保留图像的质量,证明早期结构信号能够有效指导选择性生成,而不改变基础生成模型。代码可在 https://github.com/Guhuary/ProbeSelect 获取。
cs.CV / 70 / 2603.02843
Scale-invariant Gaussian derivative residual networks
尺度不变的高斯导数残差网络
Abstract
Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.
Chinese Translation
跨图像尺度的泛化仍然是深度网络面临的一个基本挑战,这些网络往往无法处理在训练期间未见过的尺度的图像(即分布外问题)。在本文中,我们提出了可证明的尺度不变高斯导数残差网络(GaussDerResNets),该网络由级联的尺度协变高斯导数残差块构成,旨在解决这一问题。通过在之前的高斯导数层概念中添加残差跳跃连接,可以构建出具有显著提高的准确性的更深网络,同时在更高的准确性水平上保持良好的尺度泛化特性。我们提供了关于任意维度下基础的尺度协变和尺度不变特性的明确证明。为了分析GaussDerResNets在新尺度上泛化的能力,我们将其应用于STL-10数据集的新重缩版本,其中训练在单一固定尺度下进行,而评估则在多个测试集的副本上进行,每个副本重缩到单一不同的空间尺度,尺度因子范围扩展至4。我们还在Fashion-MNIST和CIFAR-10数据集的重缩版本上进行了类似的系统实验。实验结果表明,GaussDerResNets在这三个重缩数据集上具有强大的尺度泛化和尺度选择特性。在我们的消融研究中,我们探讨了GaussDerResNets的不同架构变体,证明基于深度可分离卷积的架构可以减少参数数量和计算量,同时合理地保持准确性和尺度泛化。
cs.CV / 71 / 2603.02866
Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis
基于多模态先验引导的重要性采样的稀疏视图新视图合成中的层次高斯点云
Abstract
We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues { -- } photometric rendering residuals, semantic priors, and geometric priors { -- } to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.
Chinese Translation
我们提出了基于多模态先验引导的重要性采样,作为稀疏视图新视图合成中层次3D高斯点云(3DGS)的核心机制。我们的采样器融合了互补线索——光度渲染残差、语义先验和几何先验——以生成一个稳健的局部可恢复性估计,直接指导细致高斯的注入位置。围绕这一采样核心,我们的框架包括(1)一种粗到细的高斯表示,编码全局形状,具有稳定的粗层,并在多模态度量指示可恢复细节的地方选择性地添加细小原件;(2)一种几何感知的采样和保留策略,集中在几何关键和复杂区域进行细化,同时保护在约束不足区域中新添加的原件,避免过早修剪。通过优先考虑由一致的多模态证据支持的区域,而不仅仅是原始残差,我们的方法缓解了由纹理引起的过拟合错误,并抑制了姿态/外观不一致带来的噪声。在多样的稀疏视图基准测试中的实验表明,我们的方法实现了最先进的重建效果,在DTU上提升了高达+0.3 dB的PSNR。
cs.CV / 72 / 2603.02872
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
随看随想:大规模视觉语言模型的流式思维链推理
Abstract
Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}
Chinese Translation
大型视觉语言模型(LVLMs)展现出强大的思维链(CoT)能力,但大多数现有范式假设在推理之前可获得完整视频,这种批处理式的过程与现实世界的视频流不符,因为信息是顺序到达的。受到视频数据流式特性的启发,我们研究了两种适用于LVLM的流式推理范式。第一种是交错范式,它在接收帧和生成部分推理之间交替进行,但受到严格顺序缓存更新的限制。为了更好地匹配流式输入,我们提出了 extbf{随看随想(Think-as-You-See, TaYS)},这是一个统一框架,能够实现真正的并发推理。TaYS集成了并行化的思维链生成、流式约束训练和流式并行推理。它进一步采用了时间对齐的推理单元、流式注意力掩码和位置编码,以及一个双KV缓存,解耦视觉编码与文本推理。我们在Qwen2.5-VL系列上评估了所有范式,涵盖了代表性的视频思维链任务,包括事件动态分析、因果推理和主题理解。实验结果表明,TaYS在推理性能上始终优于批处理和交错基线,同时显著减少了首次令牌时间(TTFT)和整体推理延迟。这些结果展示了数据对齐的流式推理在实现高效和响应迅速的视频理解方面的有效性。我们在 exthref{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{此代码库}中发布了我们的代码。
cs.CV / 73 / 2603.02882
SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion
SIGMark:一种可扩展的生成水印框架,支持视频扩散的盲提取
Abstract
Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.
Chinese Translation
人工智能生成内容(AIGC),特别是使用扩散模型的视频生成,已迅速发展。隐形水印是保护AI生成视频和追踪有害内容的关键技术,因此在AI安全中发挥着至关重要的作用。除了后处理水印不可避免地降低视频质量外,最近的研究提出了针对视频扩散模型的无失真生成水印。然而,现有的生成水印方法是非盲的:它们需要维护所有消息-密钥对,并在提取过程中进行基于模板的匹配,这在大规模应用时会产生高昂的计算成本。此外,当应用于具有因果3D变分自编码器(VAEs)的现代视频扩散模型时,它们对时间干扰的鲁棒性变得极其脆弱。为了解决这些挑战,我们提出了SIGMark,一种可扩展的生成水印框架,支持视频扩散的盲提取。为了实现盲提取,我们提出使用全局帧级伪随机编码密钥集(GF-PRC)生成带水印的初始噪声,从而减少存储大规模信息的成本,同时保持噪声分布和多样性,以实现无失真的水印。为了增强鲁棒性,我们进一步设计了一个针对因果3D VAE的段组排序模块(SGO),确保在时间干扰下提取过程中的水印反演具有鲁棒性。对现代扩散模型的全面实验表明,SIGMark在时间和空间干扰下的提取过程中实现了非常高的比特准确率,且开销最小,展示了其可扩展性和鲁棒性。我们的项目可在 https://jeremyzhao1998.github.io/SIGMark-release/ 获取。
cs.CV / 74 / 2603.02883
SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
SemanticDialect:面向语义的混合格式量化方法用于视频扩散变换器
Abstract
Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.
Chinese Translation
扩散变换器(DiT)在视频生成质量上表现出色,但其内存和计算成本限制了在边缘设备上的部署。量化可以降低这些成本,但现有方法在高激活变化下往往会降低视频质量,并且需要保持语义/时间一致性。我们提出了SemanticDialect,它通过从多个候选格式(格式书)中选择每个块的最佳格式(方言),推动了最近的块级混合格式量化。该方法通过查找量化误差和量化值的查找表来扩展格式书,从而实现高效的每块格式选择和低在线成本的量化。我们还引入了激活分解,通过重新量化和添加残差误差来减少量化误差,并采用注意力引导的显著标记选择。我们进一步提出了面向语义的方言分配(SeDA),通过在语义相关的标记之间共享子格式书来提高量化值的一致性。在视频DiT(VDiT)模型上的实验表明,SemanticDialect在性能上优于之前的VDiT量化方法和细粒度块级格式基线,同时在Open-Sora 2.0上接近FP16质量。
cs.CV / 75 / 2603.02886
StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting
StegaFFD:通过细粒度隐写域提升实现隐私保护的人脸伪造检测
Abstract
Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.
Chinese Translation
大多数现有的人脸伪造检测(FFD)模型假设可以访问原始人脸图像。在实际应用中,在客户端-服务器框架下,私密的人脸数据可能在传输过程中被拦截或被不可信的服务器泄露。以往的隐私保护方法,如匿名化、加密或失真,部分缓解了数据泄露,但往往引入严重的语义失真,使图像明显受到保护。这引起了攻击者的警觉,促使其采取更激进的策略,将这一过程变成了猫鼠游戏。此外,这些方法对图像内容进行了大量操作,导致降级或伪影的出现,可能会混淆依赖极其微妙伪造痕迹的FFD模型。受到图像隐写技术进展的启发,该技术能够实现高保真度的隐藏和恢复,我们提出了一种基于隐写的人脸伪造检测框架(StegaFFD),以在不引起怀疑的情况下保护隐私。StegaFFD将人脸图像隐藏在自然覆盖图像中,并直接在隐写域中进行伪造检测。然而,隐藏的伪造特征极其微妙,并受到覆盖语义的干扰,带来了重大挑战。为了解决这一问题,我们提出了低频感知分解(Low-Frequency-Aware Decomposition, LFAD)和空间频率差异注意力(Spatial-Frequency Differential Attention, SFDA),以抑制来自低频覆盖语义的干扰,并增强隐藏人脸特征的感知。此外,我们引入了隐写域对齐(Steganographic Domain Alignment, SDA),以将隐藏人脸的表示与其原始对应物对齐,从而增强模型在隐写域中感知微妙面部线索的能力。在七个FFD数据集上的大量实验表明,StegaFFD实现了强隐蔽性,避免引起攻击者的怀疑,并且相比现有的人脸隐私保护方法更好地保持了FFD的准确性。
cs.CV / 76 / 2603.02888
LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval
LLandMark:一个面向地标感知的多智能体框架用于多模态互动视频检索
Abstract
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.
Chinese Translation
视频数据的多样性和规模日益增加,要求检索系统具备多模态理解、自适应推理和领域特定知识整合的能力。本文提出了LLandMark,一个模块化的多智能体框架,用于处理面向地标的多模态视频检索,以应对现实世界中的复杂查询。该框架具有专门的智能体,协同工作于四个阶段:查询解析与规划、地标推理、多模态检索和重新排序答案合成。一个关键组件,地标知识智能体,能够检测文化或空间地标,并将其重新表述为描述性视觉提示,从而增强基于CLIP的越南场景语义匹配。为了扩展能力,我们引入了一个LLM辅助的图像到图像管道,其中一个大型语言模型(Gemini 2.5 Flash)自主检测地标,生成图像搜索查询,检索代表性图像,并执行基于CLIP的视觉相似性匹配,消除了手动输入图像的需求。此外,利用Gemini和LlamaIndex的OCR优化模块提高了越南文本的识别精度。实验结果表明,LLandMark实现了自适应、文化根植和可解释的检索性能。
cs.CV / 77 / 2603.02893
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
稀疏视图高斯点云的一致性优化:内在几何与外观
Abstract
3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
Chinese Translation
从单幅图像重建三维人类形象是一个具有挑战性的问题,并且在文献中得到了广泛研究。最近,一些方法借助扩散模型进行指导,通过得分蒸馏采样(Score Distillation Sampling, SDS)优化三维表示,或生成背视图图像以促进重建。然而,这些方法往往会产生不令人满意的伪影(例如,由于来自多个视角的不一致先验导致的人体结构扁平化或过度平滑的结果),并在真实场景中面临泛化困难。在本研究中,我们提出了 extit{MVD-HuGaS},通过多视角人类扩散模型实现从单幅图像自由视角的三维人类渲染。我们首先利用增强的多视角扩散模型从单一参考图像生成多视角图像,该模型经过良好的微调,能够结合三维几何先验和人体结构先验,以高质量的三维人类数据集为基础。为了从稀疏生成的多视角图像中推断准确的相机姿态以进行重建,引入了一个对齐模块,以促进三维高斯和相机姿态的联合优化。此外,我们提出了一种基于深度的面部畸变缓解模块,以细化生成的面部区域,从而提高重建的整体保真度。最后,借助经过精细化的多视角图像及其准确的相机姿态,MVD-HuGaS 优化目标人类的三维高斯,以实现高保真的自由视角渲染。在 Thuman2.0 和 2K2K 数据集上的大量实验表明,所提出的 MVD-HuGaS 在单视角三维人类渲染方面达到了最先进的性能。
cs.CV / 78 / 2603.02896
3D-DRES: Detailed 3D Referring Expression Segmentation
3D-DRES:详细的3D指称表达分割
Abstract
Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.
Chinese Translation
当前的3D视觉定位任务仅处理句子级别的检测或分割,这在很大程度上未能利用自然语言表达中的丰富组合上下文推理。为了解决这一挑战,我们提出了详细的3D指称表达分割(3D-DRES),这是一个新的任务,提供短语到3D实例的映射,旨在增强细粒度的3D视觉语言理解。为了支持3D-DRES,我们推出了DetailRefer,这是一个新数据集,包含54,432个描述,涵盖11,054个不同的对象。与之前的数据集不同,DetailRefer实施了一种开创性的短语-实例注释范式,其中每个被引用的名词短语都明确映射到其对应的3D元素。此外,我们还引入了DetailBase,这是一种经过精心简化但有效的基线架构,支持句子和短语级别的双模式分割。我们的实验结果表明,在DetailRefer上训练的模型不仅在短语级别的分割上表现出色,而且在传统的3D-RES基准上也显示出惊人的改进。
cs.CV / 79 / 2603.02897
ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization
ProGIC:基于残差向量量化的渐进式轻量级生成图像压缩
Abstract
Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.
Chinese Translation
最近在生成图像压缩(GIC)方面的进展显著提升了感知质量。然而,许多GIC依赖于大规模和刚性模型,这严重限制了它们在低比特率场景中的灵活传输和实际部署。为了解决这些问题,我们提出了渐进式生成图像压缩(ProGIC),这是一种基于残差向量量化(RVQ)的紧凑编码器。在RVQ中,一系列向量量化器逐阶段编码残差,每个量化器都有自己的代码本。生成的代码字汇总为粗到细的重建和渐进式比特流,使得可以从部分数据中进行预览。我们将其与基于深度可分离卷积和小型注意力模块的轻量级主干网络相结合,使其能够在GPU和仅CPU设备上进行实际部署。实验结果表明,ProGIC在压缩性能上与之前的方法相当。与Kodak数据集上的MS-ILLM相比,它在DISTS上节省了高达57.57%的比特率,在LPIPS上节省了58.83%。除了感知质量,ProGIC还支持渐进式传输以提高灵活性,并且在GPU上实现了比MS-ILLM快10倍以上的编码和解码速度,以提高效率。
cs.CV / 80 / 2603.02907
Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework
谐波贝尔特拉米特征网络:深度学习框架中的形状先验模块
Abstract
This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.
Chinese Translation
本文提出了谐波贝尔特拉米特征网络(HBSN),这是一种新颖的深度学习架构,用于从类二值图像中计算谐波贝尔特拉米特征(HBS)。HBS是一种形状表示,提供与二维简单连通形状的一对一对应,并对平移、缩放和旋转具有不变性。通过利用神经网络的函数逼近能力,HBSN能够高效提取和利用形状先验信息。所提出的网络架构结合了用于形状归一化的预空间变换网络(pre-STN)、基于UNet的主干网络用于HBS预测,以及用于角度正则化的后空间变换网络(post-STN)。实验表明,HBSN能够准确计算HBS表示,即使对于复杂形状也是如此。此外,我们展示了HBSN如何可以直接融入现有的深度学习分割模型,通过使用形状先验来提高其性能。结果证实了HBSN作为一个通用模块在计算机视觉管道中嵌入几何形状信息的实用性。
cs.CV / 81 / 2603.02910
Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement
运动中的关节:基于动态-静态解耦的无先验部分运动分析
Abstract
Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: https://haoai-1997.github.io/AiM/.
Chinese Translation
关节物体在日常生活中无处不在。我们的目标是实现高质量的重建、独立运动部分的分割和关节分析。最近的方法分析两种不同的关节状态,并进行逐点部分分割,利用跨状态对应关系优化每部分的关节,前提是已知部分的数量。这些假设极大地限制了它们的应用和性能。当物体在两种状态下无法清晰可见时,它们的鲁棒性会降低。为了解决这些问题,本文提出了一种新的框架——运动中的关节(Articulation in Motion, AiM)。我们从用户与物体的交互视频和初始状态扫描中推断部分级分解、关节运动学,并重建一个交互式的3D数字复制品。我们提出了一种双高斯场景表示法,该方法从物体的初始3DGS扫描和展示独立部分运动的视频中学习。它利用运动线索将物体分割成部分并分配关节。随后,采用鲁棒的序列RANSAC方法,在没有任何部分级结构先验的情况下实现部分运动分析,将运动原件聚类为刚性部分并估计运动学,同时自动确定部分的数量。所提出的方法将物体分割为部分,每部分表示为一个3D高斯集,从而实现高质量渲染。我们的方法在没有先验知识的情况下,提供了比以往方法更高质量的部分分割。对简单和复杂物体的广泛实验分析验证了我们方法的有效性和强大的泛化能力。项目页面:https://haoai-1997.github.io/AiM/
cs.CV / 82 / 2603.02919
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
可解释的运动注意力图:在视频扩散变换器中时空定位概念
Abstract
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.
Chinese Translation
视频扩散变换器(DiTs)能够根据涉及运动的文本描述合成高保真度的高质量视频。然而,理解视频 DiTs 如何将运动词转换为视频的过程仍然不足。此外,尽管以往关于可解释显著性图的研究主要针对对象,但在视频 DiTs 中与运动相关的行为仍然在很大程度上未被探索。本文研究了具体的运动特征,以确定在给定运动概念时何时以及哪个对象移动。首先,为了进行空间定位,我们引入了 GramCol,它能够自适应地为任何文本概念(包括运动和非运动)生成逐帧显著性图。其次,我们提出了一种运动特征选择算法,以获得可解释的运动注意力图(IMAP),该图在时空上定位运动。我们的方法在不需要任何梯度计算或参数更新的情况下发现概念显著性图。在实验中,我们的方法在运动定位任务和零样本视频语义分割中表现出卓越的定位能力,为运动和非运动概念提供了可解释且更清晰的显著性图。
cs.CV / 83 / 2603.02924
HDINO: A Concise and Efficient Open-Vocabulary Detector
HDINO:一种简洁高效的开放词汇检测器
Abstract
Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.
Chinese Translation
尽管近年来对开放词汇物体检测的兴趣日益增长,但大多数现有方法仍然严重依赖于手动策划的细粒度训练数据集以及资源密集型的逐层跨模态特征提取。本文提出了HDINO,一种简洁而高效的开放词汇物体检测器,消除了对这些组件的依赖。具体而言,我们提出了一种基于变换器(transformer)模型DINO的两阶段训练策略。在第一阶段,噪声样本被视为额外的正物体实例,以构建视觉和文本模态之间的一对多语义对齐机制(One-to-Many Semantic Alignment Mechanism, O2M),从而促进语义对齐。同时,基于初始检测难度设计了一种困难加权分类损失(Difficulty Weighted Classification Loss, DWCL),以挖掘困难示例并进一步提高模型性能。在第二阶段,轻量级特征融合模块被应用于对齐表示,以增强对语言语义的敏感性。在Swin Transformer-T设置下,HDINO-T在COCO上使用来自两个公开检测数据集的220万训练图像达到了49.2的mAP,且无需任何手动数据策划和基础数据的使用,超越了在540万和650万图像上训练的Grounding DINO-T和T-Rex2,分别提高了0.8和2.8的mAP。在COCO上进行微调后,HDINO-T和HDINO-L进一步达到了56.4和59.2的mAP,突显了我们方法的有效性和可扩展性。代码和模型可在https://github.com/HaoZ416/HDINO获取。
cs.CV / 84 / 2603.02926
GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights
GloPath:一种以实体为中心的基础模型,用于肾小球损伤评估和临床病理学洞察
He, Qiming, Li, Jing, Guan, Tian, Ma, Yifei, Zhao, Zimo, Wang, Yanxia, Chen, Hongjing, Xu, Yingming, Ge, Shuang, Zhang, Yexing, Wang, Yizhi, Chen, Xinrui, Zhu, Lianghui, Liu, Yiqing, Hou, Qingxia, Zhao, Shuyan, Wang, Xiaoqin, Ma, Lili, Hu, Peizhen, Huang, Qiang, Wang, Zihan, Shen, Zhiyuan, Cheng, Junru, Zeng, Siqi, Chen, Jiurun, Song, Zhen, He, Chao, Wang, Zhe, He, Yonghong
Abstract
Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.
Chinese Translation
肾小球病理在肾脏疾病的诊断和预后中至关重要,但肾小球形态的异质性和细粒度损伤模式对当前的人工智能方法仍然构成挑战。我们提出了GloPath,这是一种以实体为中心的基础模型,基于从14,049个肾活检标本中提取的超过一百万个肾小球,通过多尺度和多视角的自监督学习进行训练。GloPath解决了肾脏病理学中的两个主要挑战:肾小球损伤评估和临床病理学洞察发现。在损伤评估方面,GloPath在52个任务上对三个独立队列进行了基准测试,包括损伤识别、分级、少样本分类和跨模态诊断,在42个任务中超越了最先进的方法(80.8%)。在大规模的真实世界研究中,它在损伤识别中达到了91.51%的ROC-AUC,展示了在常规临床环境中的强大鲁棒性。对于临床病理学洞察,GloPath系统性地揭示了224对形态学-临床变量之间的统计显著关联,展示了其将组织级病理与患者级结果相连接的能力。这些结果使GloPath成为一个可扩展且可解释的平台,用于肾小球损伤评估和临床病理学发现,代表了向肾脏病理学中临床可转化人工智能迈出的重要一步。
cs.CV / 85 / 2603.02929
TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval
TRACE:用于通用多模态检索的任务自适应推理与表示学习
Abstract
Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.
Chinese Translation
通用多模态检索需要统一的嵌入模型,能够解读多样化的用户意图,从简单的关键词到复杂的组合指令。尽管多模态大型语言模型(MLLMs)具备强大的推理能力,但现有的适配方式将其局限于静态编码器,未能充分利用其生成潜力。这种仅使用编码器的范式在处理需要逻辑推理而非表面模式匹配的复杂意图时表现不佳。为了解决这个问题,我们提出了TRACE(任务自适应推理与压缩嵌入)。TRACE将生成推理与区分表示学习相结合。它首先生成一个结构化的思维链(Chain-of-Thought,CoT)以明确推理查询,然后通过专用标记将这一推理过程压缩成紧凑的嵌入。为了训练这一框架,我们构建了M-BEIR-CoT,一个具有难度感知路由策略的大规模数据集。在M-BEIR基准上的实验表明,TRACE成为了新的最先进技术。重要的是,TRACE展示了学习到的隐式路由行为。它能够自主激活复杂查询的推理,同时对简单查询则跳过推理,实现了检索准确性与推理吞吐量之间的最佳平衡。此外,通过内化推理过程,TRACE在未见领域和新约束下表现出显著的零-shot 可迁移性。
cs.CV / 86 / 2603.02943
TC-Pad\'e: Trajectory-Consistent Pad\'e Approximation for Diffusion Acceleration
TC-Padé:用于扩散加速的轨迹一致性Padé近似
Abstract
Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Pad\'e approximation, a feature prediction framework grounded in Pad\'e approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Pad\'e incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Pad\'e. For instance, TC-Pad\'e achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.
Chinese Translation
尽管扩散模型在生成质量上达到了最先进的水平,但其迭代采样过程的巨大计算负担仍然是一个障碍。虽然特征缓存技术在较高步数(例如50步)下实现了有效加速,但在实际的低步数范围(20-30步)中存在关键限制。随着步数间隔的增加,基于多项式的外推器如TaylorSeer会遭遇误差积累和轨迹漂移。同时,传统的缓存策略往往忽视了不同去噪阶段的独特动态特性。为了解决这些挑战,我们提出了轨迹一致性Padé近似(Trajectory-Consistent Padé approximation),这是一种基于Padé近似的特征预测框架。通过用有理函数建模特征演变,我们的方法比基于Taylor的方法更准确地捕捉渐近和过渡行为。为了在减少步数的情况下实现稳定且轨迹一致的采样,TC-Padé结合了(1)自适应系数调制,利用历史缓存残差检测微妙的轨迹过渡,以及(2)针对早期、中期和晚期采样阶段的独特动态量身定制的步数感知预测策略。在DiT-XL/2、FLUX.1-dev和Wan2.1上的广泛实验表明了TC-Padé的有效性。例如,TC-Padé在FLUX.1-dev上实现了2.88倍的加速,在Wan2.1上实现了1.72倍的加速,同时在FID、CLIP、美学和VBench-2.0指标上保持高质量,显著优于现有的特征缓存方法。
cs.CV / 87 / 2603.02959
Semi-Supervised Few-Shot Adaptation of Vision-Language Models
视觉-语言模型的半监督少样本适应
Abstract
Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.
Chinese Translation
在大规模异构数据源上预训练的视觉-语言模型(VLMs)正变得越来越受欢迎,提供丰富的多模态嵌入,能够高效地转移到新任务上。一个特别相关的应用是少样本适应,其中仅有少量标注示例可用于通过多模态线性探针来适应模型。在医学影像领域,专门的VLMs在零样本和少样本图像分类中表现出良好的性能,这对于减轻专家标注的高成本具有重要价值。然而,在极低样本情况下仍然存在挑战:医学任务中固有的类别不平衡往往导致某些类别的代表性不足,从而影响整体模型性能。为了解决这一限制,我们提出利用未标注数据,通过引入一种高效的半监督求解器,在少样本适应过程中传播文本信息驱动的伪标签。所提方法使得适应VLMs的低预算标注流程成为可能,在低样本情况下减少了超过50%的标注工作量。
cs.CV / 88 / 2603.02964
Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention
通过基础模型合成和小波域注意力改进异常检测
Abstract
Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.
Chinese Translation
工业异常检测面临着由于异常样本稀缺和现实世界异常复杂性带来的重大挑战。本文提出了一种基于基础模型的异常合成管道(FMAS),该管道能够在不进行微调或特定类别训练的情况下生成高度真实的异常样本。基于异常在频域中的独特特征,我们引入了小波域注意力模块(WDAM),该模块利用自适应子带处理来增强异常特征提取。FMAS与WDAM的结合显著提高了异常检测的灵敏度,同时保持了计算效率。在MVTec AD和VisA数据集上的全面实验表明,WDAM作为一个即插即用模块,相较于现有基线实现了显著的性能提升。
cs.CV / 89 / 2603.02972
TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
TagaVLM:面向拓扑的全球行动推理在视觉-语言导航中的应用
Abstract
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM
Chinese Translation
视觉-语言导航(VLN)对大型视觉-语言模型(VLMs)提出了独特的挑战,主要是由于其固有的架构不匹配:VLMs主要在静态、非具身的视觉-语言任务上进行预训练,这与导航的动态、具身和空间结构特性根本相悖。现有的大模型方法通常依赖于将丰富的视觉和空间信息转换为文本,迫使模型隐式推断复杂的视觉-拓扑关系,或限制其全球行动能力。为了弥补这一差距,我们提出了TagaVLM(面向拓扑的全球行动推理),这是一个端到端框架,明确将拓扑结构注入VLM主干。为了引入拓扑边信息,空间拓扑感知残差注意力(STAR-Att)将其直接整合到VLM的自注意力机制中,使得模型在保留预训练知识的同时,能够进行内在的空间推理。为了增强拓扑节点信息,交错导航提示(Interleaved Navigation Prompt)加强了节点级的视觉-文本对齐。最后,借助嵌入的拓扑图,模型能够进行全球行动推理,从而实现稳健的路径修正。在R2R基准测试中,TagaVLM在大型模型方法中实现了最先进的性能,在未见环境中的成功率(SR)为51.09%,SPL为47.18,分别比之前的工作提高了3.39%的SR和9.08的SPL。这表明,对于具身空间推理,针对较小的开源VLM的定向增强可能比简单的模型扩展更为有效。代码将在发表后发布。项目页面:https://apex-bjut.github.io/Taga-VLM
cs.CV / 90 / 2603.02974
Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection
DINOv3嵌入的空间自回归建模用于无监督异常检测
Abstract
DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.
Chinese Translation
DINO模型提供了丰富的补丁级表示,最近在无监督异常检测(UAD)中表现出色。现有大多数方法从“正常”图像中提取补丁嵌入并独立建模,忽略了补丁之间的空间和邻域关系。这隐含地假设自注意力和位置编码足以在每个补丁嵌入中编码上下文信息。此外,规范分布通常被建模为记忆库或基于原型的表示,这需要存储大量特征并在推理时进行耗时的比较,导致显著的内存和计算开销。在本研究中,我们通过提出一个简单高效的框架来解决这两个限制,该框架使用二维自回归(AR)模型显式建模补丁嵌入之间的空间和上下文依赖关系。我们的办法不是存储嵌入或聚类原型,而是通过自回归卷积神经网络(CNN)学习规范分布的紧凑参数模型。在测试时,异常检测简化为通过网络的单次前向传播,从而实现快速且内存高效的推理。我们在BMAD基准上评估了我们的方法,该基准包含三个医学影像数据集,并与现有工作(包括最近的基于DINO的方法)进行了比较。实验结果表明,显式建模空间依赖关系能够实现具有竞争力的异常检测性能,同时显著减少推理时间和内存需求。代码可在项目页面获取:https://eerdil.github.io/spatial-ar-dinov3-uad/
cs.CV / 91 / 2603.02985
The Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes
德累斯顿数据集用于非刚性腹部手术场景的4D重建
Abstract
The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.
Chinese Translation
D4D数据集提供了配对的内窥镜视频和高质量的结构光几何数据,用于评估在真实手术条件下变形腹部软组织的3D重建。数据来自六个猪尸体实验会话,使用da Vinci Xi立体内窥镜和Zivid结构光相机获取,通过光学追踪和手动策划的迭代对齐方法进行注册。三种序列类型——整体变形、增量变形和移动相机片段——探讨算法对非刚性运动、变形幅度和视野外更新的鲁棒性。每个片段提供了校正后的立体图像、每帧的仪器掩膜、立体深度、开始/结束的结构光点云、策划的相机姿态和相机内参。在后处理过程中,使用ICP和半自动注册技术对数据进行注册,并创建仪器掩膜。该数据集支持在可见和遮挡区域进行定量几何评估,并提供光度视图合成基准。数据集包含超过300,000帧和369个点云,涵盖98个策划录音,可作为开发和评估非刚性SLAM、4D重建和深度估计方法的全面基准。
cs.CV / 92 / 2603.02986
VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats
VIRGi:基于视图依赖的3D高斯点云即时重新着色
Abstract
3D Gaussian Splatting (3DGS) has recently transformed the fields of novel view synthesis and 3D reconstruction due to its ability to accurately model complex 3D scenes and its unprecedented rendering performance. However, a significant challenge persists: the absence of an efficient and photorealistic method for editing the appearance of the scene's content. In this paper we introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS while preserving view-dependent effects such as specular highlights. Key to our method are a novel architecture that separates color into diffuse and view-dependent components, and a multi-view training strategy that integrates image patches from multiple viewpoints. Improving over the conventional single-view batch training, our 3DGS representation provides more accurate reconstruction and serves as a solid representation for the recoloring task. For 3DGS recoloring, we then introduce a rapid scheme requiring only one manually edited image of the scene from the end-user. By fine-tuning the weights of a single MLP, alongside a module for single-shot segmentation of the editable area, the color edits are seamlessly propagated to the entire scene in just two seconds, facilitating real-time interaction and providing control over the strength of the view-dependent effects. An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors based on Neural Radiance Fields representations.
Chinese Translation
3D高斯点云(3DGS)由于其能够准确建模复杂的3D场景以及前所未有的渲染性能,最近在新视图合成和3D重建领域引发了变革。然而,仍然存在一个重大挑战:缺乏一种高效且具备照片级真实感的场景内容外观编辑方法。本文提出了VIRGi,一种新颖的方法,旨在快速编辑由3DGS建模的场景颜色,同时保留视图依赖效果,如镜面高光。我们方法的关键在于一种新颖的架构,它将颜色分离为漫反射和视图依赖成分,以及一种多视角训练策略,能够整合来自多个视角的图像块。相较于传统的单视角批量训练,我们的3DGS表示提供了更准确的重建,并为重新着色任务提供了坚实的表示。对于3DGS重新着色,我们引入了一种快速方案,仅需最终用户手动编辑场景中的一张图像。通过微调单个多层感知机(MLP)的权重,以及一个用于可编辑区域单次分割的模块,颜色编辑能够在仅仅两秒内无缝传播到整个场景,实现实时交互,并提供对视图依赖效果强度的控制。在多样化数据集上的全面验证显示,与基于神经辐射场(Neural Radiance Fields)表示的竞争对手相比,VIRGi在定量和定性上均取得了显著的进展。
cs.CV / 93 / 2603.03026
Any Resolution Any Geometry: From Multi-View To Multi-Patch
任意分辨率 任意几何:从多视角到多补丁
Abstract
Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
Chinese Translation
表面法线和深度的联合估计对于全面的三维场景理解至关重要,但由于在保持细致局部细节和维持全局一致性之间的权衡,高分辨率预测仍然困难。为了解决这一挑战,我们提出了超分辨率几何变换器(Ultra Resolution Geometry Transformer, URGT),该变换器将视觉几何基础变换器(Visual Geometry Grounded Transformer, VGGT)改编为一个统一的多补丁变换器,用于单目高分辨率深度-法线估计。单个高分辨率图像被划分为补丁,这些补丁通过预训练模型的粗略深度和法线先验进行增强,并在单次前向传播中共同处理,以预测精细的几何输出。通过跨补丁注意力强制全局一致性,这使得长距离几何推理和信息在共享主干网络内的无缝传播成为可能。为了进一步增强空间鲁棒性,我们引入了一种GridMix补丁采样策略,该策略在训练过程中以概率方式采样网格配置,从而改善补丁间的一致性和泛化能力。我们的方法在UnrealStereo4K上取得了最先进的结果,联合提高了深度和法线估计,将绝对相对误差(AbsRel)从0.0582降低到0.0291,均方根误差(RMSE)从2.17降低到1.31,平均角度误差从23.36度降低到18.51度,同时生成更清晰和更稳定的几何形状。所提出的多补丁框架还展示了强大的零样本和跨域泛化能力,并有效扩展到非常高的分辨率,为高质量几何细化提供了一种高效且可扩展的解决方案。
cs.CV / 94 / 2603.03030
BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology
BRIGHT:一种用于乳腺病理的协作通才-专才基础模型
Guo, Xiaojing, Lin, Jiatai, Jia, Yumian, Huang, Jingqi, Xu, Zeyan, Li, Weidong, Wang, Longfei, Chen, Jingjing, Li, Qin, Wang, Weiwei, Cui, Lifang, Yue, Wen, Cheng, Zhiqiang, Wei, Xiaolong, Yu, Jianzhong, Jin, Xia, Li, Baizhou, Shen, Honghong, Li, Jing, Li, Chunlan, Cui, Yanfen, Dai, Yi, Yang, Yiling, Qian, Xiaolong, Yang, Liu, Yang, Yang, Gao, Guangshen, Li, Yaqing, Zhai, Lili, Liu, Chenying, Zhang, Tianhua, Shi, Zhenwei, Lu, Cheng, Zhou, Xingchen, Xu, Jing, Zhao, Miaoqing, Mei, Fang, Zhou, Jiaojiao, Mao, Ning, Liu, Fangfang, Han, Chu, Liu, Zaiyi
Abstract
Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT's clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.
Chinese Translation
通才病理基础模型(PFMs)在大规模多脏器数据集上进行预训练,已在多种临床应用中展现出显著的预测能力。然而,由于缺乏针对单一器官的大规模验证队列,以及缺乏能够有效将广泛的组织形态学知识转化为专门级解读所需的器官特定专业知识的定制训练范式,其在特定器官系统内的临床关键任务的全面能力仍然是一个未解的问题。在本研究中,我们提出了BRIGHT,这是首个专门为乳腺病理设计的PFM,训练数据来自于19家医院超过40,000名患者的51,000多张乳腺全切片图像的约2.1亿个组织病理切片。BRIGHT采用协作通才-专才框架,以捕捉普遍特征和器官特定特征。为了全面评估PFMs在乳腺肿瘤学中的表现,我们策划了迄今为止最大的多机构队列用于下游任务开发和评估,涵盖来自10家医院的超过25,000个全切片图像(WSIs)。验证队列涵盖了乳腺病理的全谱,包括24个不同的临床任务,涉及诊断、生物标志物预测、治疗反应和生存预测。大量实验表明,BRIGHT在24个内部验证任务中的21个任务上超越了三种领先的通才PFMs,并在10个外部验证任务中的5个任务上实现了最先进的(SOTA)表现,且热图可解释性优秀。通过在大规模验证队列上的评估,本研究不仅展示了BRIGHT在乳腺肿瘤学中的临床实用性,还验证了协作通才-专才范式,为在特定器官系统上开发PFMs提供了可扩展的模板。
cs.CV / 95 / 2603.03066
EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education
EduVQA:教育领域AI生成视频质量评估的基准测试
Abstract
While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.
Chinese Translation
尽管AI生成内容(AIGC)模型在生成逼真的视频方面取得了显著成功,但其在教育中支持视觉和故事驱动学习的潜力仍然未得到充分利用。为了解决这一问题,我们提出了EduAIGV-1k,这是第一个专门用于评估旨在教授基础数学概念(如数字和几何)给年轻学习者的AI生成视频(AIGVs)的基准数据集和评估框架。EduAIGV-1k包含由十个最先进的文本到视频(T2V)模型使用113个以教学为导向的提示生成的1,130个短视频。每个视频都附有丰富的细粒度注释,涵盖两个互补的维度:(1)感知质量,分解为空间和时间的保真度,以及(2)提示对齐,在词级和句级进行标注,以量化提示中每个数学概念在生成视频中准确体现的程度。这些细粒度注释将每个视频转化为多维的、可解释的监督信号,远超单一的质量评分。利用这些密集反馈,我们引入了EduVQA,用于评估AIGVs的感知质量和对齐质量。特别地,我们提出了一种结构化的二维专家混合模型(S2D-MoE)模块,通过共享专家和动态二维门控矩阵增强整体质量与每个子维度之间的依赖关系。大量实验表明,我们的EduVQA在性能上始终优于现有的VQA基准。我们的数据集和代码将公开发布。
cs.CV / 96 / 2603.03075
TinyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference
TinyIceNet:用于机载FPGA推理的低功耗SAR海冰分割
Abstract
Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.
Chinese Translation
准确的海冰制图对于极地地区安全的海上航行至关重要,因为快速变化的冰情需要及时和可靠的信息。虽然Sentinel-1合成孔径雷达(SAR)提供了高分辨率、全天候的海冰观测,但传统的地面处理受到下行带宽、延迟以及传输大量原始数据所需的能源成本的限制。机载处理通过在卫星有效载荷中直接集成专用推理芯片,提供了一种变革性的替代方案,可以在轨道上生成可操作的海冰产品。在此背景下,我们提出了TinyIceNet,这是一种紧凑的语义分割网络,旨在满足在严格的硬件和功耗约束下,从双极化的Sentinel-1 SAR影像进行机载开发阶段(SOD)制图的需求。TinyIceNet在AI4Arctic数据集上进行训练,结合了SAR感知的架构简化与低精度量化,以平衡准确性和效率。该模型使用高级综合(High-Level Synthesis)进行合成,并部署在Xilinx Zynq UltraScale+ FPGA平台上,展示了接近实时的推理能力,同时显著降低了能耗。实验结果表明,TinyIceNet在SOD分割上达到了75.216%的F1分数,同时与全精度GPU基线相比,能耗降低了2倍,突显了芯片级硬件-算法协同设计在未来空间和边缘AI系统中的潜力。
cs.CV / 97 / 2603.03101
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
MoECLIP:用于零-shot异常检测的补丁专用专家
Abstract
The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbf{MoECLIP}, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.
Chinese Translation
CLIP模型卓越的泛化能力推动了最近在零-shot异常检测(ZSAD)领域的成功,能够检测未见类别中的异常。ZSAD的核心挑战在于在保留CLIP强大泛化能力的同时,使模型专门化于异常检测任务。现有尝试解决这一挑战的方法存在一个根本性限制,即采用补丁无关的设计,单一处理所有补丁,而不考虑其独特特征。为了解决这一限制,我们提出了 extbf{MoECLIP},一种用于ZSAD任务的专家混合(Mixture-of-Experts, MoE)架构,通过根据每个图像补丁的独特特征动态路由到专门的低秩适应(Low-Rank Adaptation, LoRA)专家,实现补丁级适应。此外,为了防止LoRA专家之间的功能冗余,我们引入了(1)冻结正交特征分离(Frozen Orthogonal Feature Separation, FOFS),该方法正交分离输入特征空间,迫使专家关注不同的信息,以及(2)一个单纯形等角紧框架(simplex equiangular tight frame, ETF)损失,以调节专家输出形成最大等角表示。跨越14个涵盖工业和医疗领域的基准数据集的全面实验结果表明,MoECLIP优于现有的最先进方法。代码可在 https://github.com/CoCoRessa/MoECLIP 获取。
cs.CV / 98 / 2603.03125
AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis
AWDiff:一种用于肺部超声图像合成的a trous小波扩散模型
Abstract
Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.
Chinese Translation
肺部超声(LUS)是一种安全且便携的成像方式,但数据稀缺限制了机器学习方法在图像解读和疾病监测中的发展。现有的生成增强方法,如生成对抗网络(GANs)和扩散模型,往往由于分辨率降低而丧失细微的诊断线索,尤其是B线和胸膜不规则性。我们提出了A trous小波扩散(AWDiff),这是一种基于扩散的增强框架,集成了a trous小波变换,以保留细微结构,同时避免破坏性的下采样。此外,利用BioMedCLIP进行语义条件化,这是一种在大规模生物医学语料库上训练的视觉语言基础模型,确保与临床意义标签的一致性。在LUS数据集中,AWDiff相比现有方法实现了更低的失真和更高的感知质量,展示了结构保真性和临床多样性。
cs.CV / 99 / 2603.03143
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
基于几何引导的强化学习用于多视角一致的3D场景编辑
Abstract
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
Chinese Translation
利用2D扩散模型的先验知识进行3D编辑已成为一种有前景的范式。然而,在编辑结果中保持多视角一致性仍然具有挑战性,并且3D一致性编辑配对数据的极度稀缺使得监督微调(SFT)这一针对编辑任务的最有效训练策略变得不可行。在本文中,我们观察到,尽管生成多视角一致的3D内容极具挑战性,但验证3D一致性是可行的,这自然使得强化学习(RL)成为一个可行的解决方案。基于此,我们提出了 extbf{RL3DEdit},这是一个由RL优化驱动的单次处理框架,采用从3D基础模型VGGT中衍生的新奖励。具体而言,我们利用VGGT从大量真实世界数据中学习到的强大先验,输入编辑后的图像,并利用输出的置信度图和姿态估计误差作为奖励信号,通过RL有效地将2D编辑先验锚定到3D一致的流形上。大量实验表明,RL3DEdit实现了稳定的多视角一致性,并在编辑质量上以高效率超越了最先进的方法。为了促进3D编辑的发展,我们将发布代码和模型。
cs.CV / 100 / 2603.03160
Kling-MotionControl Technical Report
Kling-MotionControl技术报告
Kling Team, Chen, Jialu, Ding, Yikang, Fang, Zhixue, Gai, Kun, He, Kang, He, Xu, Hua, Jingyun, Lao, Mingming, Li, Xiaohan, Liu, Hui, Liu, Jiwen, Liu, Xiaoqiang, Shi, Fan, Shi, Xiaoyu, Sun, Peiqin, Tang, Songlin, Wan, Pengfei, Wen, Tiancheng, Wu, Zhiyong, Zhang, Haoxian, Zhao, Runze, Zhang, Yuanxing, Zhou, Yan
Abstract
Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.
Chinese Translation
角色动画旨在通过将运动动态从驱动视频转移到参考图像来生成逼真的视频。最近在生成模型方面的进展为高保真角色动画铺平了道路。在本研究中,我们提出了Kling-MotionControl,这是一个基于DiT的统一框架,专门设计用于强大、精确和富有表现力的整体角色动画。该模型利用在一个统一系统内的分而治之策略,协调针对身体、面部和手部独特特征的异构运动表示,有效地调和大规模结构稳定性与细粒度的关节表现力。为了确保强大的跨身份泛化能力,我们结合了自适应身份无关学习,促进了从现实人类到风格化卡通等多样角色的自然运动重定向。同时,我们通过细致的身份注入和融合设计确保忠实的外观保留,并通过利用全面参考上下文的主题库机制进一步支持这一点。为了确保实用性,我们实施了一个先进的加速框架,利用多阶段蒸馏,将推理速度提升超过10倍。Kling-MotionControl通过智能的语义运动理解和精确的文本响应能力,使其在视觉输入之外实现灵活控制。人类偏好评估表明,Kling-MotionControl在整体运动控制、开放领域泛化以及视觉质量和一致性方面的表现优于领先的商业和开源解决方案,达到卓越的保真度。这些结果确立了Kling-MotionControl作为高质量、可控和逼真角色动画的强大解决方案。
cs.CV / 101 / 2603.03163
Conditioned Activation Transport for T2I Safety Steering
用于 T2I 安全引导的条件激活传输
Abstract
Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.
Chinese Translation
尽管当前的文本到图像(T2I)模型具有令人印象深刻的能力,但仍然容易生成不安全和有毒的内容。虽然激活引导提供了一种有前景的推理时干预方法,但我们观察到线性激活引导在应用于良性提示时常常会降低图像质量。为了解决这一权衡,我们首先构建了 SafeSteerDataset,这是一个对比数据集,包含 2300 对具有高余弦相似度的安全和不安全提示。利用这些数据,我们提出了条件激活传输(Conditioned Activation Transport, CAT),这是一个采用基于几何的条件机制和非线性传输映射的框架。通过将传输映射条件化以仅在不安全激活区域内激活,我们最小化了对良性查询的干扰。我们在两种最先进的架构上验证了我们的方法:Z-Image 和 Infinity。实验表明,CAT 在这些基础架构上有效地泛化,显著降低了攻击成功率,同时相比于未引导生成保持了图像的保真度。警告:本文包含可能令人反感的文本和图像。
cs.CV / 102 / 2603.03187
ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection
ProSMA-UNet:用于近端稀疏跳跃特征选择的解码器条件化
Abstract
Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering -- an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an $\ell_1$ proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ($\approx20$\%) on difficult 3D segmentation tasks. Project page: https://math-ml-x.github.io/ProSMA-UNet/
Chinese Translation
医学图像分割通常依赖于 U 型编码器-解码器架构,如 U-Net,其中跳跃连接通过将高分辨率编码器特征注入解码器来保留细致的空间细节。然而,这些跳跃路径也传播低级纹理、背景杂波和采集噪声,使得无关信息能够绕过更深层次的语义过滤——这一问题在低对比度临床成像中尤为严重。尽管引入了注意力门以解决这一局限性,但它们通常生成密集的 sigmoid 掩模,柔性地重新加权特征,而不是明确地去除无关激活。我们提出了 ProSMA-UNet(近端稀疏多尺度注意力 U-Net),将跳跃门控重新表述为一个解码器条件化的稀疏特征选择问题。ProSMA 使用轻量级深度膨胀卷积构建多尺度兼容性场,以捕捉局部和上下文尺度的相关性,然后通过具有可学习的每通道阈值的 $ ext{l}_1$ 近端算子强制显式稀疏性,产生一个闭式形式的软阈值门,可以去除噪声响应。为了进一步抑制语义上无关的通道,ProSMA 结合了由全局解码器上下文驱动的解码器条件通道门控。在具有挑战性的 2D 和 3D 基准测试上的大量实验表明了其最先进的性能,尤其是在困难的 3D 分割任务上获得了显著的提升(约 20%)。项目页面: https://math-ml-x.github.io/ProSMA-UNet/
cs.CV / 103 / 2603.03192
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
MoD-DPO:通过模态解耦偏好优化减轻全模态大语言模型中的跨模态幻觉
Abstract
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
Chinese Translation
全模态大语言模型(omni LLMs)最近在视听理解任务中取得了强劲的表现,但它们仍然对由虚假相关性和主导语言先验引起的跨模态幻觉高度敏感。在本研究中,我们提出了模态解耦直接偏好优化(Modality-Decoupled Direct Preference Optimization,MoD-DPO),这是一个简单而有效的框架,用于改善全模态 LLMs 的模态基础。MoD-DPO 引入了模态感知的正则化项,明确强制对无关模态中的干扰保持不变,对相关模态中的扰动保持敏感,从而减少意外的跨模态交互。为了进一步减轻对文本先验的过度依赖,我们结合了一种语言先验去偏差惩罚,旨在抑制易产生幻觉的仅文本响应。针对多个视听幻觉基准的广泛实验表明,MoD-DPO 一贯提高了感知准确性和幻觉抵抗力,在相似的训练预算下超越了先前的偏好优化基线。我们的研究结果强调了模态忠实对齐的重要性,并展示了朝着更可靠和更具弹性的多模态基础模型迈进的可扩展路径。
cs.CV / 104 / 2603.03195
Chain of World: World Model Thinking in Latent Motion
世界链:潜在运动中的世界模型思维
Abstract
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.
Chinese Translation
视觉-语言-动作(VLA)模型是实现具身智能的有希望的路径,但它们往往忽视了视觉动态背后的预测和时间因果结构。世界模型VLA通过预测未来帧来解决这一问题,但在重建冗余背景时浪费了容量。潜在动作VLA紧凑地编码帧与帧之间的过渡,但缺乏时间连续的动态建模和世界知识。为了克服这些局限性,我们提出了CoWVLA(世界链VLA),一种新的“世界链”范式,将世界模型的时间推理与解耦的潜在运动表示统一起来。首先,预训练的视频变分自编码器(VAE)作为潜在运动提取器,明确地将视频片段分解为结构和运动潜变量。然后,在预训练期间,VLA从指令和初始帧中学习,以推断连续的潜在运动链并预测片段的终止帧。最后,在共同微调期间,这种潜在动态通过在统一的自回归解码器中联合建模稀疏关键帧和动作序列,与离散动作预测对齐。这一设计保留了时间推理和世界知识的世界模型优势,同时保持了潜在动作的紧凑性和可解释性,从而实现高效的视觉运动学习。在机器人仿真基准上的大量实验表明,CoWVLA优于现有的世界模型和潜在动作方法,并实现了适度的计算效率,突显了其作为更有效的VLA预训练范式的潜力。项目网站可在 https://fx-hit.github.io/cowvla-io 找到。
cs.CV / 105 / 2603.03197
Specificity-aware reinforcement learning for fine-grained open-world classification
针对细粒度开放世界分类的特异性意识强化学习
Abstract
Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.
Chinese Translation
在开放世界环境下对细粒度视觉概念进行分类,即在没有预定义标签集的情况下,要求模型既要准确又要具体。近期的推理大型多模态模型(LMMs)展现出强大的视觉理解能力,但在进行细粒度图像分类时往往会产生过于泛化的预测。我们的初步分析表明,这些模型确实具备内在的细粒度领域知识。然而,在不妨碍正确预测的前提下,促进更具体的预测(特异性)仍然是一个复杂且未得到充分研究的挑战。在本研究中,我们探讨如何引导推理LMMs朝着既正确又具体的预测方向发展。我们提出了一种新颖的特异性意识强化学习框架SpeciaRL,以在开放世界环境下对推理LMMs进行细粒度图像分类的微调。SpeciaRL引入了一种动态的、基于验证者的奖励信号,该信号依托于在线回滚中的最佳预测,促进特异性,同时尊重模型的能力以防止错误预测。我们的域外实验表明,SpeciaRL在广泛的细粒度基准测试中提供了正确性与特异性之间的最佳权衡,超越了现有方法,推动了开放世界细粒度图像分类的发展。代码和模型可在 https://github.com/s-angheben/SpeciaRL 上公开获取。
cs.CV / 106 / 2603.03239
COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design
COP-GEN:用于哥白尼地球观测数据的潜在扩散变换器——按设计生成的随机性
Abstract
Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen
Chinese Translation
地球观测应用日益依赖来自多个传感器的数据,包括光学、雷达、高程和土地覆盖产品。这些模态之间的关系对于数据集成至关重要,但本质上是非单射的:相同的条件信息可能对应多个物理上合理的观测。因此,这种条件映射应被参数化为数据分布。因此,确定性模型往往会向条件均值收敛,无法表示数据补全和跨传感器翻译等任务所需的不确定性和变异性。我们提出了COP-GEN,一种多模态潜在扩散变换器,能够在其原生空间分辨率下建模异构地球观测模态的联合分布。通过将跨模态映射参数化为条件分布,COP-GEN实现了灵活的任意到任意条件生成,包括零样本模态翻译、光谱带填充以及在部分或缺失输入下的生成,而无需针对特定任务的重训练。在一个大规模全球多模态数据集上的实验表明,COP-GEN能够生成多样且物理上一致的实现,同时在光学、雷达和高程模态中保持强大的峰值保真度。定性和定量分析表明,该模型捕捉了有意义的跨模态结构,并随着条件信息的增加系统地调整其输出不确定性。这些结果突显了随机生成建模在地球观测中的实际重要性,并激励评估协议超越单一参考、逐点指标。网站:https://miquel-espinosa.github.io/cop-gen
cs.CV / 107 / 2603.03241
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
UniG2U-Bench:统一模型是否推动多模态理解?
Abstract
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.
Chinese Translation
统一多模态模型最近展示了强大的生成能力,但生成是否以及何时能改善理解仍不明确。现有基准缺乏对生成促进理解的具体任务的系统性探索。为此,我们引入了UniG2U-Bench,这是一个全面的基准,将生成到理解(G2U)评估分为7个领域和30个子任务,要求不同程度的隐式或显式视觉转换。对30多个模型的广泛评估揭示了三个核心发现:1)统一模型通常表现不及其基础的视觉-语言模型(VLMs),且生成后回答(GtA)推理通常相较于直接推理会降低性能。2)在空间智能、视觉错觉或多轮推理子任务中,持续的增强效果显现,其中增强的空间和形状感知以及多步中间图像状态被证明是有益的。3)具有相似推理结构的任务和共享架构的模型表现出相关行为,这表明生成-理解耦合在任务、预训练数据和模型架构上诱导了一致的归纳偏差。这些发现强调了需要更多样化的训练数据和新颖的范式,以充分释放统一多模态建模的潜力。
cs.CV / 108 / 2603.03265
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
DuoMo:用于世界空间人类重建的双重运动扩散
Abstract
We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/
Chinese Translation
我们提出了DuoMo,这是一种生成方法,能够从无约束的视频中恢复世界空间坐标下的人类运动,尽管这些视频存在噪声或不完整的观测。重建此类运动需要解决一个基本的权衡:在保持全局运动一致性的同时,从多样且嘈杂的视频输入中进行泛化。我们的方法通过将运动学习分解为两个扩散模型来解决这个问题。相机空间模型首先从相机坐标的视频中估计运动。然后,世界空间模型将这一初步估计提升到世界坐标,并对其进行全局一致性优化。两个模型结合在一起,可以在多样的场景和轨迹中重建运动,即使在高度嘈杂或不完整的观测下也能实现。此外,我们的公式是通用的,能够直接生成网格顶点的运动,绕过参数模型。DuoMo达到了最先进的性能。在EMDB数据集上,我们的方法在保持低脚滑的同时,实现了16%的世界空间重建误差降低。在RICH数据集上,获得了30%的世界空间误差降低。项目页面:https://yufu-wang.github.io/duomo/
cs.CV / 109 / 2603.03269
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
LoGeR:具有混合记忆的长时序几何重建
Abstract
Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.
Chinese Translation
前馈几何基础模型在短时间窗口重建方面表现出色,但将其扩展到数分钟的视频时,由于二次注意力复杂性或递归设计中有效记忆的限制,面临瓶颈。我们提出了LoGeR(长时序几何重建),这是一种新颖的架构,能够在不进行后期优化的情况下将密集的3D重建扩展到极长的序列。LoGeR以块的形式处理视频流,利用强大的双向先验进行高保真度的块内推理。为了解决跨块边界的一致性这一关键挑战,我们提出了一种基于学习的混合记忆模块。该双组件系统结合了参数化的测试时训练(Test-Time Training, TTT)记忆,以固定全局坐标框架并防止尺度漂移,以及非参数化的滑动窗口注意力(Sliding Window Attention, SWA)机制,以保持未压缩的上下文,实现高精度的相邻对齐。值得注意的是,这种记忆架构使LoGeR能够在128帧的序列上进行训练,并在推理过程中推广至数千帧。在标准基准测试和一个新改造的VBR数据集(序列长度可达19k帧)上评估,LoGeR显著超越了先前的最先进前馈方法——在KITTI数据集上将ATE降低超过74%——并在前所未有的范围内实现了稳健且全局一致的重建。
cs.CV / 110 / 2603.03276
Beyond Language Modeling: An Exploration of Multimodal Pretraining
超越语言建模:多模态预训练的探索
Tong, Shengbang, Fan, David, Nguyen, John, Brown, Ellis, Zhou, Gaoyue, Qian, Shengyi, Zheng, Boyang, Vallaeys, Théophane, Han, Junlin, Fergus, Rob, Murray, Naila, Ghazvininejad, Marjan, Lewis, Mike, Ballas, Nicolas, Bar, Amir, Rabbat, Michael, Verbeek, Jakob, Zettlemoyer, Luke, Sinha, Koustuv, LeCun, Yann, Xie, Saining
Abstract
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
Chinese Translation
视觉世界为推动基础模型超越语言提供了一个关键轴心。尽管对此方向的兴趣日益增长,但原生多模态模型的设计空间仍然不够明晰。我们通过控制的从零开始的预训练实验提供了实证上的清晰性,隔离了在没有语言预训练干扰的情况下影响多模态预训练的因素。我们采用了Transfusion框架,使用下一个标记预测进行语言训练,并使用扩散方法进行视觉训练,训练的数据包括文本、视频、图像-文本对,甚至是动作条件的视频。我们的实验得出了四个关键见解:(i) 表示自编码器(Representation Autoencoder, RAE)通过在视觉理解和生成方面的优越表现,提供了最佳的统一视觉表示;(ii) 视觉和语言数据是互补的,并为下游能力带来了协同效应;(iii) 统一的多模态预训练自然导致世界建模,能力源于一般训练;(iv) 专家混合模型(Mixture-of-Experts, MoE)实现了高效且有效的多模态扩展,同时自然地促进了模态专业化。通过IsoFLOP分析,我们计算了两种模态的扩展法则,并揭示了一种扩展不对称性:视觉对数据的需求显著高于语言。我们证明了MoE架构通过提供语言所需的高模型容量,同时适应视觉的数据密集特性,调和了这种扩展不对称性,为真正统一的多模态模型铺平了道路。
cs.CV / 111 / 2603.03281
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
CFG-Ctrl:基于控制的无分类器扩散引导
Abstract
Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)已成为增强基于流的扩散模型语义对齐的核心方法。在本文中,我们探讨了一个称为CFG-Ctrl的统一框架,该框架将CFG重新解释为施加于一阶连续时间生成流的控制,利用条件-无条件差异作为误差信号来调整速度场。从这个角度来看,我们将普通CFG总结为具有固定增益的比例控制器(P-control),而典型的后续变体则基于此开发扩展的控制律设计。然而,现有方法主要依赖线性控制,固有地导致不稳定、超调以及在大引导尺度上语义保真度降低。为了解决这个问题,我们引入了滑模控制CFG(Sliding Mode Control CFG, SMC-CFG),该方法强制生成流朝向快速收敛的滑模流形。具体而言,我们在语义预测误差上定义了一个指数滑模面,并引入了切换控制项以建立非线性反馈引导的修正。此外,我们提供了李雅普诺夫稳定性分析,以理论支持有限时间收敛。针对包括Stable Diffusion 3.5、Flux和Qwen-Image在内的文本到图像生成模型的实验表明,SMC-CFG在语义对齐上优于标准CFG,并在广泛的引导尺度上增强了鲁棒性。项目页面:https://hanyang-21.github.io/CFG-Ctrl
cs.CV / 112 / 2603.03282
MIBURI: Towards Expressive Interactive Gesture Synthesis
MIBURI:朝向富有表现力的互动手势合成
Abstract
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.
Chinese Translation
具身对话代理(ECA)旨在通过语言、手势和面部表情模拟人类的面对面互动。目前基于大型语言模型(LLM)的对话代理缺乏具身性和自然互动所需的表现性手势。现有的ECA解决方案通常产生僵硬且多样性低的动作,不适合人类般的互动。另一方面,基于生成的方法用于共语手势合成能够产生自然的身体手势,但依赖于未来的语言上下文,并且需要较长的运行时间。为了填补这一空白,我们提出了MIBURI,这是第一个在线的因果框架,用于生成与实时口语对话同步的表现性全身手势和面部表情。我们采用了身体部位感知的手势编码器,将层次运动细节编码为多层离散标记。这些标记随后由一个二维因果框架自回归生成,该框架以基于LLM的语音-文本嵌入为条件,实时建模时间动态和部位级运动层次。此外,我们引入辅助目标,以鼓励表现性和多样化的手势,同时防止收敛到静态姿势。比较评估表明,我们的因果和实时方法在自然性和上下文一致性手势方面优于近期的基线。我们鼓励读者访问 https://vcai.mpi-inf.mpg.de/projects/MIBURI/ 观看演示视频。
cs.CV / 113 / 2603.03283
Utonia: Toward One Encoder for All Point Clouds
Utonia:朝着一个适用于所有点云的编码器迈进
Abstract
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.
Chinese Translation
我们梦想着一个未来,来自各个领域的点云能够汇聚在一起,形成一个惠及所有领域的单一模型。为实现这一目标,我们提出了Utonia,这是朝着在多个领域(包括遥感、户外激光雷达、室内RGB-D序列、以物体为中心的CAD模型以及从仅RGB视频中提取的点云)训练单一自监督点变换器编码器的第一步。尽管这些领域在传感几何、密度和先验方面存在显著差异,Utonia仍然学习到一个一致的表示空间,可以跨领域迁移。这种统一性提高了感知能力,同时揭示了只有在共同训练时才会出现的有趣的涌现行为。除了感知,我们观察到Utonia的表示也可以促进具身和多模态推理:将视觉-语言-动作策略以Utonia特征为条件可以改善机器人操作,而将其整合到视觉-语言模型中则在空间推理上带来了收益。我们希望Utonia能够成为稀疏3D数据基础模型的一个步骤,并支持增强现实/虚拟现实、机器人技术和自动驾驶等下游应用。
cs.AI / 1 / 2603.02214
Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
联邦推理:迈向隐私保护的协作与激励模型服务
Abstract
Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a unified abstraction and system-level understanding of FI remain lacking. This paper positions FI as a distinct collaborative paradigm, complementary to federated learning, and identifies two fundamental requirements that govern its feasibility: inference-time privacy preservation and meaningful performance gains through collaboration. We formalize FI as a protected collaborative computation, analyze its core design dimensions, and examine the structural trade-offs that arise when privacy constraints, non-IID data, and limited observability are jointly imposed at inference time. Through a concrete instantiation and empirical analysis, we highlight recurring friction points in privacy-preserving inference, ensemble-based collaboration, and incentive alignment. Our findings suggest that FI exhibits system-level behaviors that cannot be directly inherited from training-time federation or classical ensemble methods. Overall, this work provides a unifying perspective on FI and outlines open challenges that must be addressed to enable practical, scalable, and privacy-preserving collaborative inference systems.
Chinese Translation
联邦推理(Federated Inference, FI)研究如何在推理时独立训练和私有拥有的模型能够在不共享数据或模型参数的情况下进行协作。尽管近期的研究从不同角度探讨了安全和分布式推理,但对FI的统一抽象和系统级理解仍然缺乏。本文将FI定位为一种独特的协作范式,补充了联邦学习,并识别出两项基本要求,这些要求决定了其可行性:推理时的隐私保护和通过协作实现的显著性能提升。我们将FI形式化为一种受保护的协作计算,分析其核心设计维度,并考察在推理时共同施加隐私约束、非独立同分布(non-IID)数据和有限可观测性时所产生的结构性权衡。通过具体实例和实证分析,我们突出隐私保护推理、基于集成的协作和激励对齐中的反复摩擦点。我们的研究结果表明,FI表现出系统级行为,这些行为不能直接从训练时的联邦或经典集成方法中继承。总体而言,本研究为FI提供了一个统一的视角,并概述了必须解决的开放挑战,以实现实用、可扩展和隐私保护的协作推理系统。
cs.AI / 2 / 2603.02239
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
工程推理与指导(ERI)基准:一个基于分类法的大型数据集,用于基础模型和智能体
Abstract
The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.
Chinese Translation
工程推理与指导(ERI)基准是一个基于分类法的指导数据集,旨在训练和评估具有工程能力的大型语言模型(LLMs)和智能体。该数据集涵盖九个工程领域(即:土木工程、机械工程、电气工程、化学工程、环境工程、航空航天工程、材料工程、消防工程和工业工程)及55个子领域,并与七种意图类型(即:定义、解释、计算、比较、设计/综合、故障排除和代码相关)以及三个难度层级(本科、研究生和专业)相交,生成了57,750条记录,包含领域/子领域/类型/难度的元数据和解决方案格式。我们通过七个LLM对ERI进行了检验,并报告了一个统计显著的三层性能结构,前沿模型(GPT-5、Claude Sonnet 4、DeepSeek V3.1)在五分制上平均得分超过4.30,而中层和小型模型在研究生级别问题上表现出逐渐更高的失败率和更陡的性能下降。为了解决LLM基准中固有的循环性问题,我们开发了一种收敛验证协议,利用跨提供者独立性、多评审平均和前沿模型一致性分析,将幻觉风险实证限制在1.7%。ERI随分类法规范、验证脚本和评估工具发布,以便于在工程环境中进行可重复的比较和回归测试,适用于指导调优、路由、增强评估和智能工具使用工作流程。
cs.AI / 3 / 2603.02240
SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
超本地记忆:具有贝叶斯信任防御的隐私保护多智能体记忆系统抵御记忆毒化
Abstract
We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank -- all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on persistent memory, cloud-based memory systems create centralized attack surfaces where poisoned memories propagate across sessions and users -- a threat demonstrated in documented attacks against production systems. Our architecture combines SQLite-backed storage with FTS5 full-text search, Leiden-based knowledge graph clustering, an event-driven coordination layer with per-agent provenance, and an adaptive re-ranking framework that learns user preferences through three-layer behavioral analysis (cross-project technology preferences, project context detection, and workflow pattern mining). Evaluation across seven benchmark dimensions demonstrates 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap =0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re-ranking is enabled. Behavioral data is isolated in a separate database with GDPR Article 17 erasure support. SuperLocalMemory is open-source (MIT) and integrates with 17+ development tools via Model Context Protocol.
Chinese Translation
我们提出了超本地记忆(SuperLocalMemory),这是一种面向多智能体人工智能的本地优先记忆系统,通过架构隔离和贝叶斯信任评分来防御OWASP ASI06记忆毒化,同时通过自适应学习排名实现个性化检索——所有这些都不依赖于云服务或大型语言模型(LLM)推理调用。随着人工智能代理越来越依赖持久性记忆,基于云的记忆系统创造了集中式攻击面,使得被毒化的记忆在会话和用户之间传播——这一威胁在针对生产系统的文献攻击中得到了证明。我们的架构结合了基于SQLite的存储与FTS5全文搜索、基于莱顿的知识图谱聚类、具有每个代理来源的事件驱动协调层,以及一个通过三层行为分析(跨项目技术偏好、项目上下文检测和工作流模式挖掘)学习用户偏好的自适应重新排名框架。对七个基准维度的评估表明,搜索的中位延迟为10.6毫秒,在10个同时代理下零并发错误,信任分离(差距=0.90),在睡眠攻击下信任降级72%,并且在启用自适应重新排名时NDCG@5提升104%。行为数据被隔离在一个独立的数据库中,并支持GDPR第17条的删除。超本地记忆是开源的(MIT),并通过模型上下文协议与17个以上的开发工具集成。
cs.AI / 4 / 2603.02359
Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach
从观察数据中估计广告中的视觉属性效应:一种基于深度伪造的双重机器学习方法
Abstract
Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model's skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.
Chinese Translation
数字广告越来越依赖视觉内容,但市场营销人员缺乏严格的方法来理解特定视觉属性如何因果地影响消费者参与度。本文解决了一个基本的 методологический 挑战:在处理(例如模型的肤色)作为嵌入在图像中的属性时,如何估计因果效应。标准方法如双重机器学习(Double Machine Learning, DML)在这种情况下失效,因为视觉编码器将处理信息与混淆变量纠缠在一起,导致严重的偏倚估计。我们开发了 DICE-DML(基于深度伪造的双重机器学习控制编码器),这是一个利用生成性人工智能将处理与混淆因素分离的框架。该方法结合了三种机制:(1)深度伪造生成的图像对,隔离处理变异;(2)在配对差异向量上进行的 DICE-Diff 对抗学习,其中背景信号相互抵消以揭示纯粹的处理指纹;(3)正交投影,几何上去除处理轴成分。在已知真实值的模拟中,与标准 DML 相比,DICE-DML 将均方根误差降低了 73-97%,在零效应点的改善最为显著(97.5%),显示出稳健的第一类错误控制。将 DICE-DML 应用于 232,089 条 Instagram 网红帖子,我们估计了肤色对参与度的因果效应。标准 DML 产生了诊断上无效的结果(负结果 R^2),而 DICE-DML 实现了有效的混淆控制(R^2 = 0.63),并估计出肤色较深的负效应边际显著(-522 个点赞;p = 0.062),显著小于偏倚的标准估计。我们的框架为在图像中处理与混淆因素共存时的因果推断提供了一种原则性的方法。
cs.AI / 5 / 2603.02365
Can machines be uncertain?
机器能否存在不确定性?
Abstract
The paper investigates whether and how AI systems can realize states of uncertainty. By adopting a functionalist and behavioral perspective, it examines how symbolic, connectionist and hybrid architectures make room for uncertainty. The paper distinguishes between epistemic uncertainty, or uncertainty inherent in the data or information, and subjective uncertainty, or the system's own attitude of being uncertainty. It further distinguishes between distributed and discrete realizations of subjective uncertainty. A key contribution is the idea that some states of uncertainty are interrogative attitudes whose content is a question rather than a proposition.
Chinese Translation
本文探讨了人工智能系统是否以及如何实现不确定性状态。通过采用功能主义和行为主义的视角,研究了符号、连接主义和混合架构如何容纳不确定性。本文区分了认知不确定性(即数据或信息固有的不确定性)和主观不确定性(即系统自身对不确定性的态度)。进一步区分了主观不确定性的分布式和离散实现。一个关键贡献是提出某些不确定性状态是询问态度,其内容是一个问题而非命题。
cs.AI / 6 / 2603.02396
COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management
COOL-MC:验证和解释用于血小板库存管理的强化学习策略
Abstract
Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.
Chinese Translation
血小板在五天内会过期。血库面临不确定的每日需求,必须在过量库存导致的高昂浪费与库存不足造成的生命威胁之间平衡订购决策。强化学习(RL)可以为这一马尔可夫决策过程(MDP)学习有效的订购策略,但所得到的神经网络策略仍然是黑箱,这阻碍了在安全关键领域的信任和采用。我们应用COOL-MC,这是一种将RL与概率模型检查和可解释RL相结合的工具,以验证和解释针对Haijema等人提出的血小板库存管理MDP的训练策略。通过构建一个策略诱导的离散时间马尔可夫链(仅包括在训练策略下可达的状态,以减少内存使用),我们验证了PCTL属性并提供了特征级别的解释。结果表明,训练的策略在200步的时间范围内实现了2.9%的缺货概率和1.1%的库存满(潜在浪费)概率,主要关注库存的年龄分布,而非其他特征,如星期几或待处理订单。行动可达性分析显示,该策略采用了多样化的补货策略,大多数订单数量迅速达到,而有些则从未被选择。反事实分析表明,用较小的订单替代中大型订单几乎不改变安全概率,表明这些订单是在良好缓冲的库存状态下进行的。这是对RL血小板库存管理策略的首次正式验证和解释,展示了COOL-MC在安全关键医疗供应链领域透明、可审计决策中的价值。
cs.AI / 7 / 2603.02435
VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
VL-KGE:视觉-语言模型与知识图谱嵌入的结合
Abstract
Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.
Chinese Translation
现实世界的多模态知识图谱(MKGs)本质上是异质的,建模与多种模态相关联的实体。传统的知识图谱嵌入(KGE)方法在学习实体和关系的连续表示方面表现出色,但通常是为单模态环境设计的。最近的方法将KGE扩展到多模态环境,但仍然受到限制,通常孤立地处理模态,导致跨模态对齐较弱,并依赖于诸如实体间模态可用性均匀等简单假设。视觉-语言模型(VLMs)提供了一种强大的方式,在共享嵌入空间中对齐多样的模态。我们提出了视觉-语言知识图谱嵌入(VL-KGE),这是一个将VLMs的跨模态对齐与结构化关系建模相结合的框架,以学习知识图谱的统一多模态表示。在WN9-IMG和两个新颖的美术MKGs(WikiArt-MKG-v1和WikiArt-MKG-v2)上的实验表明,VL-KGE在链接预测任务中始终优于传统的单模态和多模态KGE方法。我们的结果突显了VLMs在多模态KGE中的价值,使得对大规模异质知识图谱的推理更加稳健和结构化。
cs.AI / 8 / 2603.02473
Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory
诊断大型语言模型代理记忆中的检索与利用瓶颈
Abstract
Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Code is publicly available at https://github.com/boqiny/memory-probe.
Chinese Translation
增强记忆的大型语言模型(LLM)代理能够存储和检索来自先前交互的信息,但记忆的写入方式与检索方式的重要性相对仍不明确。我们提出了一种诊断框架,分析在不同的写入策略、检索方法和记忆利用行为下,性能差异是如何表现的,并将其应用于一个3x3的研究,交叉比较三种写入策略(原始块、Mem0风格的事实提取、MemGPT风格的摘要)与三种检索方法(余弦相似度、BM25、混合重排序)。在LoCoMo上,检索方法是主导因素:不同检索方法之间的平均准确率差异达到20个百分点(57.1%到77.2%),而写入策略之间的差异仅为3-8个百分点。原始块存储不需要任何LLM调用,其性能与昂贵的有损替代方案相当或更优,这表明当前的记忆管道可能会丢弃有用的上下文,而下游检索机制无法弥补。故障分析表明,性能下降通常发生在检索阶段,而非利用阶段。我们认为,在当前的检索实践下,提高检索质量所带来的收益大于提升写入时的复杂性。代码可在 https://github.com/boqiny/memory-probe 上公开获取。
cs.AI / 9 / 2603.02479
PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference
PRISM:通过过程奖励模型引导推理推动深度思维的前沿
Abstract
DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.
Chinese Translation
DEEPTHINK 方法通过生成、细化和聚合候选解的群体来改善推理,从而在复杂的数学和科学任务中实现强大的性能。然而,现有框架在推理过程中往往缺乏可靠的正确性信号,这导致了一个群体增强瓶颈:更深的思考放大了错误,抑制了正确的少数解,并且对额外计算的回报较弱。本文介绍了 DEEPTHINK 系统的功能分解,并提出了 PRISM,一种基于过程奖励模型(Process Reward Model, PRM)引导的推理算法,该算法利用逐步验证来指导群体细化和解的聚合。在细化过程中,PRISM 将候选解视为 PRM 定义的能量景观中的粒子,通过得分引导的重采样和随机细化来重塑群体,这样可以在保持多样性的同时将概率质量集中在更高质量的推理上。在数学和科学基准测试中,PRISM 的表现与现有 DEEPTHINK 方法相当或更优,在 AIME25、HMMT25 和 GPQA Diamond 上分别达到了 90.0%、75.4% 和 71.4% 的准确率,同时与 gpt-oss-120b 相匹配或超过。此外,我们的分析显示,PRISM 在细化过程中产生了一致的净方向性修正,当初始群体包含少量正确候选时仍然可靠,并且通常位于计算-准确性帕累托前沿上。
cs.AI / 10 / 2603.02495
Revealing Positive and Negative Role Models to Help People Make Good Decisions
揭示正面和负面榜样以帮助人们做出良好决策
Abstract
We consider a setting where agents take action by following their role models in a social network, and study strategies for a social planner to help agents by revealing whether the role models are positive or negative. Specifically, agents observe a local neighborhood of possible role models they can emulate, but do not know their true labels. Revealing a positive label encourages emulation, while revealing a negative one redirects agents toward alternative options. The social planner observes all labels, but operates under a limited disclosure budget that it selectively allocates to maximize social welfare (the expected number of agents who emulate adjacent positive role models). We consider both algorithms and hardness results for welfare maximization, and provide a sample-complexity guarantee when the planner observes a sampled subset of agents. We also consider fairness guarantees when agents belong to different groups. It is a technical challenge that the ability to reveal negative role models breaks submodularity. We thus introduce a proxy welfare function that remains submodular even when revealed targets include negative ones. When each agent has at most a constant number of negative target neighbors, we use this proxy to achieve a constant-factor approximation to the true optimal welfare gain. When agents belong to different groups, we also show that each group's welfare gain is within a constant factor of the optimum achievable if the full budget were allocated to that group. Beyond this basic model, we also propose an intervention model that directly connects high-risk agents to positive role models, and a coverage radius model that expands the visibility of selected positive role models. Lastly, we conduct extensive experiments on four real-world datasets to support our theoretical results and assess the effectiveness of the proposed algorithms.
Chinese Translation
我们考虑一个情境,其中代理通过在社交网络中追随他们的榜样来采取行动,并研究社会规划者通过揭示榜样是正面还是负面的策略来帮助代理。具体而言,代理观察到一组可能的榜样的局部邻域,但并不知道它们的真实标签。揭示正面标签会鼓励模仿,而揭示负面标签则会引导代理转向其他选择。社会规划者观察到所有标签,但在有限的披露预算下运作,选择性地分配预算以最大化社会福利(即模仿相邻正面榜样的代理的预期数量)。我们考虑了福利最大化的算法和难度结果,并在规划者观察到一个样本子集时提供了样本复杂度保证。我们还考虑了当代理属于不同群体时的公平性保证。揭示负面榜样的能力打破了次模性,这是一个技术挑战。因此,我们引入了一个代理福利函数,即使在揭示目标包括负面榜样时仍保持次模性。当每个代理最多有常数个负面目标邻居时,我们使用这个代理函数来实现对真实最优福利增益的常数因子近似。当代理属于不同群体时,我们还表明,如果将全部预算分配给该群体,则每个群体的福利增益在一个常数因子内接近可实现的最优值。除了这个基本模型,我们还提出了一种干预模型,直接将高风险代理与正面榜样连接起来,以及一个覆盖半径模型,扩展所选正面榜样的可见性。最后,我们在四个真实世界的数据集上进行了广泛的实验,以支持我们的理论结果并评估所提算法的有效性。
cs.AI / 11 / 2603.02504
NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect
NeuroProlog:通过鸡尾酒效应进行神经符号数学推理的多任务微调
Abstract
Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines.Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.
Chinese Translation
大型语言模型(LLMs)在自然语言任务中表现出色,但在数学推理方面仍然不可靠,常常生成流畅但逻辑不一致的解决方案。我们提出了 extbf{NeuroProlog},一个神经符号框架,通过将数学文字问题编译为可执行的Prolog程序,并提供形式验证保证,从而确保可验证的推理。我们提出了一种多任务鸡尾酒训练策略,在统一的符号表示空间中共同优化三个协同目标:(i)数学公式到规则的翻译(KB),(ii)自然语言到程序的合成(SOLVE),以及(iii)程序与答案的对齐。这种联合监督使得正向迁移成为可能,其中公式翻译中的符号基础直接改善了组合推理能力。在推理阶段,我们引入了一种执行引导的解码管道,具有细粒度的错误分类,能够实现迭代程序修复并量化模型的自我调试能力。在GSM8K上的全面评估涵盖了四种模型规模(3B--32B参数),显示出一致的改进:鸡尾酒训练在单任务基准上实现了显著的准确性提升:+5.23\%(Qwen-32B,$p < 0.01$),+3.43\\%(GPT-OSS-20B,$p < 0.01$),以及+5.54\\%(Llama-3B,$p < 0.05$)。系统的错误分析揭示了规模依赖的学习动态:在32B规模下,鸡尾酒训练将不可修复的类型错误(12\\%修复率)转化为可修复的领域错误(96\\%修复率),实现了92.7\\%的整体修正;在8B规模下,相同的训练消除了语法错误,但引入了语义失败,揭示了类型安全符号推理的关键能力阈值。
cs.AI / 12 / 2603.02528
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model
LLM-MLFFN:通过大语言模型实现多层次自主驾驶行为特征融合
Abstract
Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability and robustness in complex traffic environments. This paper presents LLM-MLFFN, a novel large language model (LLM)-enhanced multi-level feature fusion network designed to address the complexities of multi-dimensional driving data. The proposed LLM-MLFFN framework integrates priors from largescale pre-trained models and employs a multi-level approach to enhance classification accuracy. LLM-MLFFN comprises three core components: (1) a multi-level feature extraction module that extracts statistical, behavioral, and dynamic features to capture the quantitative aspects of driving behaviors; (2) a semantic description module that leverages LLMs to transform raw data into high-level semantic features; and (3) a dual-channel multi-level feature fusion network that combines numerical and semantic features using weighted attention mechanisms to improve robustness and prediction accuracy. Evaluation on the Waymo open trajectory dataset demonstrates the superior performance of the proposed LLM-MLFFN, achieving a classification accuracy of over 94%, surpassing existing machine learning models. Ablation studies further validate the critical contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning. These results suggest that integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.
Chinese Translation
准确分类自主驾驶车辆(AV)的驾驶行为对于安全验证、性能诊断和交通集成分析至关重要。然而,现有方法主要依赖于数值时间序列建模,往往缺乏语义抽象,限制了在复杂交通环境中的可解释性和鲁棒性。本文提出了LLM-MLFFN,一种新颖的大语言模型(LLM)增强的多层次特征融合网络,旨在解决多维驾驶数据的复杂性。所提出的LLM-MLFFN框架整合了来自大规模预训练模型的先验知识,并采用多层次的方法来提高分类准确性。LLM-MLFFN由三个核心组件组成:(1)多层次特征提取模块,提取统计、行为和动态特征,以捕捉驾驶行为的定量方面;(2)语义描述模块,利用LLM将原始数据转化为高层次的语义特征;(3)双通道多层次特征融合网络,使用加权注意机制结合数值和语义特征,以提高鲁棒性和预测准确性。在Waymo开放轨迹数据集上的评估表明,所提出的LLM-MLFFN表现优越,分类准确率超过94%,超越了现有的机器学习模型。消融研究进一步验证了多层次融合、特征提取策略和LLM衍生语义推理的关键贡献。这些结果表明,将结构化特征建模与语言驱动的语义抽象相结合,为鲁棒的自主驾驶行为分类提供了一条有原则且可解释的路径。
cs.AI / 13 / 2603.02540
A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
基于神经心理学的语言模型认知能力评估
Abstract
Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.
Chinese Translation
大型语言模型(LLMs)在10个基准测试中表现出统一的“通用因子”能力,这一发现通过对156个模型的因子分析得到了证实,但它们在处理简单、琐碎的人类任务时仍然存在困难。这是因为当前的基准测试侧重于任务完成,未能探讨突出这些行为的基础认知能力。我们通过引入NeuroCognition基准来解决这一问题,该基准基于三项改编的神经心理学测试:拉文进阶矩阵(抽象关系推理)、空间工作记忆(维护和系统搜索)以及威斯康星卡片分类测试(认知灵活性)。我们的评估显示,尽管模型在文本上的表现强劲,但在图像和复杂性增加时,其性能下降。此外,我们观察到复杂推理并不总是有益,而简单的人类策略则能带来部分收益。我们还发现,NeuroCognition与标准的通用能力基准呈正相关,同时测量了超出这些基准的独特认知能力。总体而言,NeuroCognition强调了当前LLMs与人类智能的对齐之处以及它们在核心适应性认知方面的不足,显示出作为可验证、可扩展的改进LLMs的来源的潜力。
cs.AI / 14 / 2603.02542
AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation
AnchorDrive:基于锚点引导扩散再生的安全关键场景生成的LLM场景推广
Abstract
Autonomous driving systems require comprehensive evaluation in safety-critical scenarios to ensure safety and robustness. However, such scenarios are rare and difficult to collect from real-world driving data, necessitating simulation-based synthesis. Yet, existing methods often exhibit limitations in both controllability and realism. From a capability perspective, LLMs excel at controllable generation guided by natural language instructions, while diffusion models are better suited for producing trajectories consistent with realistic driving distributions. Leveraging their complementary strengths, we propose AnchorDrive, a two-stage safety-critical scenario generation framework. In the first stage, we deploy an LLM as a driver agent within a closed-loop simulation, which reasons and iteratively outputs control commands under natural language constraints; a plan assessor reviews these commands and provides corrective feedback, enabling semantically controllable scenario generation. In the second stage, the LLM extracts key anchor points from the first-stage trajectories as guidance objectives, which jointly with other guidance terms steer the diffusion model to regenerate complete trajectories with improved realism while preserving user-specified intent. Experiments on the highD dataset demonstrate that AnchorDrive achieves superior overall performance in criticality, realism, and controllability, validating its effectiveness for generating controllable and realistic safety-critical scenarios.
Chinese Translation
自主驾驶系统需要在安全关键场景中进行全面评估,以确保安全性和鲁棒性。然而,这些场景稀缺且难以从真实驾驶数据中收集,迫切需要基于模拟的合成方法。然而,现有方法在可控性和真实性方面往往存在局限性。从能力的角度来看,LLM(大语言模型)在自然语言指令引导下的可控生成方面表现出色,而扩散模型更适合生成与真实驾驶分布一致的轨迹。利用它们的互补优势,我们提出了AnchorDrive,一个两阶段的安全关键场景生成框架。在第一阶段,我们在闭环模拟中部署LLM作为驾驶代理,依据自然语言约束进行推理并迭代输出控制命令;计划评估器审查这些命令并提供纠正反馈,从而实现语义可控的场景生成。在第二阶段,LLM从第一阶段的轨迹中提取关键锚点作为指导目标,这些目标与其他指导项共同引导扩散模型再生完整轨迹,同时提高真实性并保持用户指定的意图。在highD数据集上的实验表明,AnchorDrive在关键性、真实性和可控性方面实现了优越的整体性能,验证了其在生成可控且真实的安全关键场景中的有效性。
cs.AI / 15 / 2603.02586
LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
LiveAgentBench:针对104个真实世界挑战的智能系统综合基准测试
Abstract
As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements. It is constructed from publicly sourced questions on social media and real-world products. Central to our approach is the Social Perception-Driven Data Generation (SPDG) method, a novel process we developed to ensure each question's real-world relevance, task complexity, and result verifiability. We evaluate various models, frameworks, and commercial products using LiveAgentBench, revealing their practical performance and identifying areas for improvement. This release includes 374 tasks, with 125 for validation and 249 for testing. The SPDG process enables continuous updates with fresh queries from real-world interactions.
Chinese Translation
随着大型语言模型能力的不断增强,通用人工智能代理在实际应用中变得越来越普遍。然而,现有的基准测试存在显著的局限性,未能准确代表真实用户任务。为了解决这一问题,我们提出了LiveAgentBench,这是一个包含104个场景的综合基准,反映了真实用户的需求。该基准由社交媒体和真实产品上的公开问题构建而成。我们方法的核心是社会感知驱动的数据生成(Social Perception-Driven Data Generation, SPDG)方法,这是我们开发的一种新颖过程,旨在确保每个问题的现实相关性、任务复杂性和结果可验证性。我们使用LiveAgentBench评估各种模型、框架和商业产品,揭示其实际性能并识别改进领域。本次发布包括374个任务,其中125个用于验证,249个用于测试。SPDG过程使得可以通过来自真实世界互动的新查询进行持续更新。
cs.AI / 16 / 2603.02599
SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
SUN:高效多LLM分离服务的下一个标记预测共享使用
Abstract
In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.
Chinese Translation
在多模型LLM服务中,由于模型特定的资源划分,解码执行仍然效率低下:由于无法进行跨模型批处理,内存受限的解码在工作负载不均衡时常常面临严重的GPU利用率不足。我们提出了下一个标记预测共享使用(Shared Use of Next-token Prediction,SUN),这是首个实现分离多LLM服务中解码执行跨模型共享的方法。SUN将仅解码的Transformer分解为预填充模块和解码模块,并仅对任务特定的预填充模块进行微调,从而使得冻结的解码模块可以在模型之间共享。这一设计实现了一种与模型无关的解码路由策略,能够在共享工作者之间平衡解码请求,以最大化利用率。在各种任务和模型系列中,SUN在保持系统吞吐量的同时,达到了与完全微调相当的准确性,并且使用更少的解码工作者。特别是,SUN在保持每个输出标记时间(TPOT)在5%以内的情况下,相比于传统的分离方法,提升了每个GPU的吞吐量高达2.0倍。SUN本质上支持并促进低比特解码;通过量化SUN(Quantized SUN,QSUN),在保持与SUN相当的准确性的同时,实现了45%的加速,同时保留了共享解码的优势。
cs.AI / 17 / 2603.02601
AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
AgentAssay:针对非确定性人工智能代理工作流的高效回归测试
Abstract
Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the first token-efficient framework for regression testing non-deterministic AI agent workflows, achieving 78-100% cost reduction while maintaining rigorous statistical guarantees. Our contributions include: (1) stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in hypothesis testing; (2) five-dimensional agent coverage metrics; (3) agent-specific mutation testing operators; (4) metamorphic relations for agent workflows; (5) CI/CD deployment gates as statistical decision procedures; (6) behavioral fingerprinting that maps execution traces to compact vectors, enabling multivariate regression detection; (7) adaptive budget optimization calibrating trial counts to behavioral variance; and (8) trace-first offline analysis enabling zero-cost testing on production traces. Experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3, Llama-4-Maverick, Phi-4), 3 scenarios, and 7,605 trials demonstrate that behavioral fingerprinting achieves 86% detection power where binary testing has 0%, SPRT reduces trials by 78%, and the full pipeline achieves 100% cost savings through trace-first analysis. Implementation: 20,000+ lines of Python, 751 tests, 10 framework adapters.
Chinese Translation
自主人工智能代理以空前的规模部署,但尚无原则性的方法来验证代理在其提示、工具、模型或编排逻辑发生变化后是否出现回归。我们提出了AgentAssay,这是第一个针对非确定性人工智能代理工作流的高效回归测试框架,在保持严格统计保证的同时实现了78-100%的成本降低。我们的贡献包括:(1) 基于假设检验的随机三值判决(通过/失败/不确定);(2) 五维代理覆盖度指标;(3) 代理特定的变异测试操作符;(4) 代理工作流的变形关系;(5) 作为统计决策程序的CI/CD部署门;(6) 行为指纹识别,将执行轨迹映射到紧凑向量,从而实现多变量回归检测;(7) 自适应预算优化,根据行为方差调整试验次数;(8) 以轨迹为先的离线分析,实现对生产轨迹的零成本测试。针对5个模型(GPT-5.2、Claude Sonnet 4.6、Mistral-Large-3、Llama-4-Maverick、Phi-4)、3种场景和7,605次试验的实验表明,行为指纹识别在二元测试为0%的情况下实现了86%的检测能力,SPRT将试验次数减少了78%,而完整流程通过轨迹优先分析实现了100%的成本节约。实现:超过20,000行Python代码,751个测试,10个框架适配器。
cs.AI / 18 / 2603.02626
See and Remember: A Multimodal Agent for Web Traversal
观察与记忆:一种用于网络遍历的多模态智能体
Abstract
Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V-GEMS.
Chinese Translation
自主网络导航要求智能体感知复杂的视觉环境并保持长期上下文,然而当前基于大型语言模型(LLM)的智能体常常在空间定向和导航循环方面遇到困难。本文提出了一种普遍适用的V-GEMS(视觉定位与显式记忆系统),这是一种为精确和稳健的网络遍历而设计的强大多模态智能体架构。我们的智能体整合了视觉定位以解决模糊的交互元素,并引入了带有状态跟踪的显式记忆栈。这一双重机制使得智能体能够保持其遍历路径的结构化地图,从而实现有效的回溯,并防止在深度导航任务中出现循环失败。我们还引入了一个可更新的动态基准,以严格评估适应性。实验表明,V-GEMS显著优于WebWalker基线,取得了28.7%的显著性能提升。代码可在https://github.com/Vaultttttttttttt/V-GEMS获取。
cs.AI / 19 / 2603.02668
SorryDB: Can AI Provers Complete Real-World Lean Theorems?
SorryDB:人工智能证明者能否完成现实世界的 Lean 定理?
Abstract
We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.
Chinese Translation
我们提出了 SorryDB,这是一个动态更新的基准,涵盖了来自 GitHub 上 78 个现实世界形式化项目的开放 Lean 任务。与现有的静态基准(通常由竞赛问题组成)不同,逐步攻克 SorryDB 基准将产生与社区需求相一致的工具,这些工具更易于数学家使用,并且更能理解复杂的依赖关系。此外,通过提供持续更新的任务流,SorryDB 减少了测试集污染,并为代理在新形式数学项目中的贡献能力提供了一个稳健的度量标准。我们评估了一系列方法,包括通用大型语言模型、代理方法和专用符号证明者,基于从 SorryDB 中选取的 1000 个任务的快照。我们展示了当前的方法是互补的:尽管基于 Gemini Flash 的代理方法表现最佳,但它并不严格优于其他现成的大型语言模型、专用证明者,甚至是经过精心策划的 Lean 策略列表。
cs.AI / 20 / 2603.02680
LLMs for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization
高频决策中的大语言模型:归一化动作奖励引导的一致性策略优化
Abstract
While Large Language Models (LLMs) form the cornerstone of sequential decision-making agent development, they have inherent limitations in high-frequency decision tasks. Existing research mainly focuses on discrete embodied decision scenarios with low-frequency and significant semantic differences in state space (e.g., household planning). These methods suffer from limited performance in high-frequency decision-making tasks, since high-precision numerical state information in such tasks undergoes frequent updates with minimal fluctuations, and exhibiting policy misalignment between the learned sub-tasks and composite tasks. To address these issues, this paper proposes Normalized Action Reward guided Consistency Policy Optimization (NAR-CP). 1) Our method first acquires predefined dense rewards from environmental feedback of candidate actions via reward functions, then completes reward shaping through normalization, and theoretically verifies action reward normalization does not impair optimal policy. 2) To reduce policy misalignment in composite tasks, we use LLMs to infer sub-observation candidate actions and generate joint policies, with consistency loss ensuring precise alignment between global semantic policies and sub-semantic policies. Experiments on UAV pursuit, a typical high-frequency task, show our method delivers superior performance on independent and composite tasks with excellent generalization to unseen tasks.
Chinese Translation
尽管大语言模型(LLMs)构成了顺序决策代理开发的基石,但它们在高频决策任务中存在固有的局限性。现有研究主要集中在低频且状态空间具有显著语义差异的离散体现决策场景(例如家庭规划)。这些方法在高频决策任务中表现有限,因为此类任务中的高精度数值状态信息频繁更新且波动较小,导致学习的子任务与复合任务之间的策略不一致。为了解决这些问题,本文提出了归一化动作奖励引导的一致性策略优化(NAR-CP)。1)我们的方法首先通过奖励函数从候选动作的环境反馈中获取预定义的密集奖励,然后通过归一化完成奖励塑造,并理论验证动作奖励的归一化不会损害最优策略。2)为了减少复合任务中的策略不一致性,我们使用LLMs推断子观察候选动作并生成联合策略,通过一致性损失确保全局语义策略与子语义策略之间的精确对齐。在无人机追逐这一典型高频任务上的实验表明,我们的方法在独立和复合任务中表现出优越的性能,并对未见任务具有良好的泛化能力。
cs.AI / 21 / 2603.02688
Retrieval-Augmented Robots via Retrieve-Reason-Act
通过检索-推理-行动实现增强检索的机器人
Abstract
To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.
Chinese Translation
为了实现通用的实用性,我们认为机器人必须从被动执行者演变为主动的信息检索用户。在严格的零样本环境中,机器人面临着一个关键的信息缺口,例如组装复杂家具套件所需的确切顺序,这一需求无法通过内部参数知识(常识)或过去的内部记忆来满足。尽管最近的机器人研究尝试在行动之前进行搜索,但它们主要集中于检索过去的运动轨迹(类似于搜索内部记忆)或基于文本的安全规则(搜索约束条件)。这些方法未能解决主动任务构建的核心信息需求:从外部非结构化文档中获取未见的过程知识。在本文中,我们将这一范式定义为增强检索机器人技术(Retrieval-Augmented Robotics, RAR),赋予机器人信息寻求的能力,弥合视觉文档与物理执行之间的差距。我们将任务执行形式化为一个迭代的检索-推理-行动循环:机器人或具身代理主动从非结构化语料库中检索相关的视觉过程手册,通过跨模态对齐将抽象的二维图示与三维物理部件结合,并合成可执行的计划。我们在一个具有挑战性的长时间组装基准上验证了这一范式。我们的实验表明,将机器人规划与检索到的视觉文档结合的基础上,显著优于依赖于零样本推理或少样本示例检索的基线。这项工作奠定了RAR的基础,扩展了信息检索的范围,从回答用户查询到驱动具身的物理行动。
cs.AI / 22 / 2603.02702
FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
FinTexTS:基于语义和多层次配对的金融文本配对时间序列数据集
Abstract
The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.
Chinese Translation
金融领域涉及多种重要的时间序列问题。近年来,联合利用文本和数据信息的时间序列分析方法受到越来越多的关注。因此,许多努力致力于构建金融领域的文本配对时间序列数据集。然而,金融市场的特点是复杂的相互依赖关系,其中公司的股票价格不仅受到公司特定事件的影响,还受到其他公司事件和更广泛的宏观经济因素的影响。现有的方法通常基于简单的关键词匹配将文本与金融时间序列数据配对,往往无法捕捉到这种复杂的关系。为了解决这一局限性,我们提出了一种基于语义和多层次配对的框架。具体而言,我们从美国证券交易委员会(SEC)文件中提取目标公司的特定上下文,并应用基于嵌入的匹配机制,根据该上下文检索语义相关的新闻文章。此外,我们使用大型语言模型(LLMs)将新闻文章分类为四个层次(宏观层次、行业层次、相关公司层次和目标公司层次),实现新闻文章与目标公司的多层次配对。将该框架应用于公开可用的新闻数据集,我们构建了 extbf{FinTexTS},一个新的大规模文本配对股票价格数据集。在 extbf{FinTexTS}上的实验结果证明了我们基于语义和多层次配对策略在股票价格预测中的有效性。除了 extbf{FinTexTS}所基于的公开可用新闻外,我们还展示了将我们的方法应用于专有但经过精心策划的新闻来源可以生成更高质量的配对数据,并改善股票价格预测性能。
cs.AI / 23 / 2603.02711
A Natural Language Agentic Approach to Study Affective Polarization
一种自然语言代理方法研究情感极化
Abstract
Affective polarization has been central to political and social studies, with growing focus on social media, where partisan divisions are often exacerbated. Real-world studies tend to have limited scope, while simulated studies suffer from insufficient high-quality training data, as manually labeling posts is labor-intensive and prone to subjective biases. The lack of adequate tools to formalize different definitions of affective polarization across studies complicates result comparison and hinders interoperable frameworks. We present a multi-agent model providing a comprehensive approach to studying affective polarization in social media. To operationalize our framework, we develop a platform leveraging large language models (LLMs) to construct virtual communities where agents engage in discussions. We showcase the potential of our platform by (1) analyzing questions related to affective polarization, as explored in social science literature, providing a fresh perspective on this phenomenon, and (2) introducing scenarios that allow observation and measurement of polarization at different levels of granularity and abstraction. Experiments show that our platform is a flexible tool for computational studies of complex social dynamics such as affective polarization. It leverages advanced agent models to simulate rich, context-sensitive interactions and systematically explore research questions traditionally addressed through human-subject studies.
Chinese Translation
情感极化在政治和社会研究中占据核心地位,随着对社交媒体的关注增加,党派分歧往往加剧。现实世界的研究往往范围有限,而模拟研究则受到高质量训练数据不足的困扰,因为手动标注帖子既费时又容易受到主观偏见的影响。缺乏适当的工具来形式化不同研究中情感极化的定义,进一步复杂化了结果比较,并阻碍了可互操作框架的建立。我们提出了一种多代理模型,提供了一种全面的方法来研究社交媒体中的情感极化。为了使我们的框架具备可操作性,我们开发了一个平台,利用大型语言模型(LLMs)构建虚拟社区,在这些社区中,代理进行讨论。我们通过(1)分析与情感极化相关的问题,这些问题在社会科学文献中被探讨,从而为这一现象提供新的视角,以及(2)引入允许在不同粒度和抽象层次观察和测量极化的场景,展示了我们平台的潜力。实验表明,我们的平台是一个灵活的工具,适用于对复杂社会动态(如情感极化)的计算研究。它利用先进的代理模型模拟丰富的、情境敏感的互动,并系统地探索传统上通过人类受试者研究解决的研究问题。
cs.AI / 24 / 2603.02766
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
EvoSkill:多智能体系统的自动化技能发现
Abstract
Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.
Chinese Translation
编码智能体越来越多地被用作通用问题解决者,但其灵活性本身并不能赋予其执行专业任务所需的领域专业知识。近期的研究通过 extit{智能体技能}来解决这一问题:可重用的工作流程和代码,增强智能体的领域特定能力。目前大多数技能是手工制作的,现有的进化方法优化的是与特定模型和任务紧密耦合的低级工件(例如提示和代码)。我们提出了 extbf{EvoSkill},一个自我进化框架,通过迭代失败分析自动发现和完善智能体技能。EvoSkill分析执行失败,提出新的技能或对现有技能的编辑,并将其具体化为结构化的可重用技能文件夹。一个智能体程序的帕累托前沿控制选择,仅保留那些在保持模型不变的情况下提高验证性能的技能。我们在两个基准上评估EvoSkill:OfficeQA,这是一个基于美国财政数据的基础推理基准,EvoSkill将准确匹配率提高了 extbf{7.3\%}(从60.6\\%提升至67.9\\%);SealQA,这是一个带有噪声检索的搜索增强QA基准,EvoSkill实现了 extbf{12.1\\%}的增益(从26.6\\%提升至38.7\\%)。我们还研究了在一个任务上进化的技能向另一个任务的零-shot迁移能力;特别是:从SealQA进化的技能在未修改的情况下零-shot迁移到BrowseComp,准确率提高了 extbf{5.3\\%},这表明技能级优化产生了超越训练任务的可迁移能力。
cs.AI / 25 / 2603.02787
Rethinking Code Similarity for Automated Algorithm Design with LLMs
重新思考基于大型语言模型的自动化算法设计中的代码相似性
Abstract
The rise of Large Language Model-based Automated Algorithm Design (LLM-AAD) has transformed algorithm development by autonomously generating code implementations of expert-level algorithms. Unlike traditional expert-driven algorithm development, in the LLM-AAD paradigm, the main design principle behind an algorithm is often implicitly embedded in the generated code. Therefore, assessing algorithmic similarity directly from code, distinguishing genuine algorithmic innovation from mere syntactic variation, becomes essential. While various code similarity metrics exist, they fail to capture algorithmic similarity, as they focus on surface-level syntax or output equivalence rather than the underlying algorithmic logic. We propose BehaveSim, a novel method to measure algorithmic similarity through the lens of problem-solving behavior as a sequence of intermediate solutions produced during execution, dubbed as problem-solving trajectories (PSTrajs). By quantifying the alignment between PSTrajs using dynamic time warping (DTW), BehaveSim distinguishes algorithms with divergent logic despite syntactic or output-level similarities. We demonstrate its utility in two key applications: (i) Enhancing LLM-AAD: Integrating BehaveSim into existing LLM-AAD frameworks (e.g., FunSearch, EoH) promotes behavioral diversity, significantly improving performance on three AAD tasks. (ii) Algorithm analysis: BehaveSim clusters generated algorithms by behavior, enabling systematic analysis of problem-solving strategies--a crucial tool for the growing ecosystem of AI-generated algorithms. Data and code of this work are open-sourced at https://github.com/RayZhhh/behavesim.
Chinese Translation
基于大型语言模型的自动化算法设计(LLM-AAD)的兴起,已经通过自主生成专家级算法的代码实现,改变了算法开发的方式。与传统的专家驱动算法开发不同,在LLM-AAD范式中,算法背后的主要设计原则通常隐含在生成的代码中。因此,从代码直接评估算法相似性,区分真正的算法创新与单纯的语法变异,变得至关重要。虽然存在多种代码相似性度量方法,但它们未能捕捉算法相似性,因为它们关注的是表层语法或输出等价性,而非潜在的算法逻辑。我们提出了BehaveSim,这是一种通过解决问题行为的视角来测量算法相似性的新方法,该行为表现为执行过程中产生的一系列中间解决方案,称为问题解决轨迹(problem-solving trajectories,PSTrajs)。通过使用动态时间规整(dynamic time warping,DTW)量化PSTrajs之间的对齐,BehaveSim能够区分尽管在语法或输出层面上相似但逻辑上存在差异的算法。我们展示了其在两个关键应用中的实用性:(i)增强LLM-AAD:将BehaveSim集成到现有的LLM-AAD框架(如FunSearch、EoH)中,促进行为多样性,显著提高三项AAD任务的性能。(ii)算法分析:BehaveSim根据行为对生成的算法进行聚类,使得对问题解决策略的系统分析成为可能——这是日益增长的AI生成算法生态系统中的一个重要工具。本文的数据和代码已开源,地址为 https://github.com/RayZhhh/behavesim。
cs.AI / 26 / 2603.02788
Agentified Assessment of Logical Reasoning Agents
逻辑推理智能体的代理化评估
Abstract
We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).
Chinese Translation
我们提出了一个评估和基准测试逻辑推理智能体的框架,该框架要求评估过程本身必须是可重复的、可审计的,并且对执行失败具有鲁棒性。在代理化评估的基础上,我们使用评估智能体来发布任务、强制执行预算、解析输出并记录结构化的失败类型,而被测试的智能体只需暴露一个标准化的智能体间接口。作为案例研究,我们对一个用于一阶逻辑(FOL)推理的自动形式化智能体进行了基准测试,该智能体在一个经过求解器验证和修复的FOLIO数据集上进行评估。该智能体将自然语言前提和结论转换为可执行的Z3Py程序,并利用满足性模理论(SMT)求解来确定逻辑蕴涵。在清理后的FOLIO验证集上,自动形式化智能体在评估协议下达到了86.70%的准确率,超越了链式思维基线(73.89%)。
cs.AI / 27 / 2603.02798
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
基于指南的高风险代理验证的证据积累
Abstract
As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN's utility in practice.
Chinese Translation
随着大型语言模型(LLM)驱动的代理被用于高风险决策,如临床诊断,开发可靠的决策验证变得至关重要,以促进可信的部署。然而,现有的验证器通常由于缺乏领域知识和有限的校准而表现不佳。为了解决这个问题,我们建立了GLEAN,一个基于指南的证据积累的代理验证框架,它将专家策划的协议编译成轨迹信息化、良好校准的正确性信号。GLEAN评估与领域指南的逐步对齐,并将多指南评分聚合为替代特征,这些特征沿轨迹积累,并使用贝叶斯逻辑回归校准为正确性概率。此外,估计的不确定性触发主动验证,选择性地通过扩展指南覆盖和执行差异检查来收集不确定案例的额外证据。我们在MIMIC-IV数据集中对三种疾病的代理临床诊断进行了实证验证,GLEAN在AUROC上超越最佳基线12%,在Brier评分减少上超越50%,确认了其在区分和校准方面的有效性。此外,临床专家研究认可了GLEAN在实践中的实用性。
cs.AI / 28 / 2603.02858
LLM-based Argument Mining meets Argumentation and Description Logics: a Unified Framework for Reasoning about Debates
基于大型语言模型的论证挖掘与论证和描述逻辑相结合:关于辩论推理的统一框架
Abstract
Large Language Models (LLMs) achieve strong performance in analyzing and generating text, yet they struggle with explicit, transparent, and verifiable reasoning over complex texts such as those containing debates. In particular, they lack structured representations that capture how arguments support or attack each other and how their relative strengths determine overall acceptability. We encompass these limitations by proposing a framework that integrates learning-based argument mining with quantitative reasoning and ontology-based querying. Starting from a raw debate text, the framework extracts a fuzzy argumentative knowledge base, where arguments are explicitly represented as entities, linked by attack and support relations, and annotated with initial fuzzy strengths reflecting plausibility w.r.t. the debate's context. Quantitative argumentation semantics are then applied to compute final argument strengths by propagating the effects of supports and attacks. These results are then embedded into a fuzzy description logic setting, enabling expressive query answering through efficient rewriting techniques. The proposed approach provides a transparent, explainable, and formally grounded method for analyzing debates, overcoming purely statistical LLM-based analyses.
Chinese Translation
大型语言模型(LLMs)在文本分析和生成方面表现出色,但在对复杂文本(如辩论内容)进行明确、透明和可验证的推理时却面临困难。特别是,它们缺乏结构化的表示,无法捕捉论点之间如何相互支持或攻击,以及它们的相对强度如何决定整体可接受性。为了解决这些局限性,我们提出了一个框架,将基于学习的论证挖掘与定量推理和基于本体的查询相结合。从原始辩论文本出发,该框架提取出一个模糊的论证知识库,其中论点被明确表示为实体,通过攻击和支持关系相互连接,并用反映辩论上下文的初始模糊强度进行注释。然后应用定量论证语义,通过传播支持和攻击的影响来计算最终的论点强度。这些结果随后嵌入到模糊描述逻辑环境中,通过高效的重写技术实现表达性查询回答。所提出的方法提供了一种透明、可解释且在形式上有依据的辩论分析方法,克服了纯粹基于统计的LLM分析的局限。
cs.AI / 29 / 2603.02874
Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures
Retrievit:变换器、状态空间模型和混合架构的上下文检索能力
Abstract
Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.
Chinese Translation
变换器在上下文检索方面表现出色,但在序列长度上存在平方复杂度的问题,而状态空间模型(SSMs)则提供高效的线性时间处理,但其检索能力有限。我们探讨了结合变换器和SSMs的混合架构是否能够在两个合成的上下文检索任务中实现两者的最佳结合。第一个任务是n-gram检索,要求模型识别并重现输入序列中紧随查询的n-gram。第二个任务是位置检索,模型接收一个单一的查询标记,并要求其执行两步关联查找:首先定位序列中的相应元素,然后输出其位置索引。在受控实验条件下,我们评估了变换器、SSMs和混合架构在数据效率、长度泛化、对域外训练样本的鲁棒性以及学习表示方面的表现。我们发现混合模型在数据效率和信息密集型上下文检索的外推能力上优于SSMs,并与变换器相匹配或超越。然而,在位置检索任务中,变换器仍然保持优势。通过表示分析,我们发现基于SSM的模型发展出位置感知的嵌入,其中表示相邻位置的标记在嵌入空间中成为邻居,形成可解释的结构。这一新兴特性在变换器中缺失,解释了SSMs和混合模型在不同检索任务中的优缺点。我们的研究结果为基于任务需求的架构选择提供了原则性指导,并揭示了变换器、SSMs和混合模型在学习位置关联方面的根本差异。
cs.AI / 30 / 2603.02908
SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
SAE作为水晶球:可解释特征预测大型语言模型的跨领域可转移性,无需训练
Abstract
In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.
Chinese Translation
近年来,预训练的大型语言模型在多种任务中取得了显著成功。除了自监督预训练的关键作用外,它们在下游应用中的有效性也在很大程度上依赖于后续训练过程,该过程将模型适应于特定任务的数据和目标。然而,这一过程不可避免地引入了模型偏移,这可能影响在不同领域的表现,而这种偏移的转移机制仍然不甚明了。为了揭开这一黑箱,我们提出了基于稀疏自编码器(SAE)的可转移性评分(STS),这是一种新指标,利用稀疏自编码器预测后续训练的可转移性。以监督微调为例,STS识别SAE表示中的偏移维度,并计算其与下游领域的相关性,从而在微调 extit{之前}可靠地估计可转移性。针对多个模型和领域的广泛实验表明,STS能够准确预测监督微调的可转移性,实际表现变化的皮尔逊相关系数超过0.7。除此之外,我们迈出了将STS扩展到强化学习的初步步骤。我们相信,STS可以作为一种{ extcolor{black}{可解释}}的工具,指导大型语言模型的后续训练策略。代码可在 https://github.com/PKU-ML/STS 获取。
cs.AI / 31 / 2603.02939
ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization
ShipTraj-R1:通过群体相对策略优化增强大型语言模型中的船舶轨迹预测
Abstract
Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying LLMs to ship trajectory prediction remains largely unexplored. In this paper, we propose ShipTraj-R1, a novel LLM-based framework that reformulates ship trajectory prediction as a text-to-text generation problem. (1) We design a dynamic prompt containing trajectory information about conflicting ships to guide the model to achieve adaptive chain-of-thought (CoT) reasoning. (2) We introduce a comprehensive rule-based reward mechanism to incentivize the reasoning format and prediction accuracy of the model. (3) Our ShipTraj-R1 is reinforced through the GRPO mechanism guided by domain-specific prompts and rewards, and utilizes the Qwen3 as the model backbone. Extensive experimental results on two complex and real-world maritime datasets show that the proposed ShipTraj-R1 achieves the least error compared with state-of-the-art deep learning and LLM-based baselines.
Chinese Translation
最近,强化微调的进展显著提升了大型语言模型(LLMs)的推理能力。特别是,群体相对策略优化(GRPO)等方法在多个领域展现了强大的能力。然而,将LLMs应用于船舶轨迹预测仍然基本未被探索。本文提出了ShipTraj-R1,一种基于LLM的新框架,将船舶轨迹预测重新表述为文本到文本的生成问题。(1) 我们设计了一个动态提示,包含有关冲突船舶的轨迹信息,以指导模型实现自适应的思维链(CoT)推理。(2) 我们引入了一种全面的基于规则的奖励机制,以激励模型的推理格式和预测准确性。(3) 我们的ShipTraj-R1通过GRPO机制进行强化,该机制由特定领域的提示和奖励引导,并利用Qwen3作为模型骨干。对两个复杂的真实海洋数据集的广泛实验结果表明,所提出的ShipTraj-R1与最先进的深度学习和基于LLM的基线相比,达到了最低的误差。
cs.AI / 32 / 2603.02960
Architecting Trust in Artificial Epistemic Agents
构建人工知识代理的信任
Abstract
Large language models increasingly function as epistemic agents -- entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search-based methods, and are frequently used to generate both personal and deeply specialized advice. How they perform these functions, including whether they are reliable and properly calibrated to both individual and collective epistemic norms, is therefore highly consequential for the choices we make. We argue that the potential impact of epistemic AI agents on practices of knowledge creation, curation and synthesis, particularly in the context of complex multi-agent interactions, creates new informational interdependencies that necessitate a fundamental shift in evaluation and governance of AI. While a well-calibrated ecosystem could augment human judgment and collective decision-making, poorly aligned agents risk causing cognitive deskilling and epistemic drift, making the calibration of these models to human norms a high-stakes necessity. To ensure a beneficial human-AI knowledge ecosystem, we propose a framework centered on building and cultivating the trustworthiness of epistemic AI agents; aligning AI these agents with human epistemic goals; and reinforcing the surrounding socio-epistemic infrastructure. In this context, trustworthy AI agents must demonstrate epistemic competence, robust falsifiability, and epistemically virtuous behaviors, supported by technical provenance systems and "knowledge sanctuaries" designed to protect human resilience. This normative roadmap provides a path toward ensuring that future AI systems act as reliable partners in a robust and inclusive knowledge ecosystem.
Chinese Translation
大型语言模型越来越多地作为知识代理运作——即能够1) 自主追求知识目标和2) 积极塑造我们共享知识环境的实体。它们策划我们接收到的信息,常常取代传统的基于搜索的方法,并且经常被用于生成个人化和深度专业化的建议。因此,它们如何执行这些功能,包括它们是否可靠以及是否适当地与个人和集体的知识规范相一致,对于我们所做的选择具有重要影响。我们认为,知识人工智能代理对知识创造、策划和综合实践的潜在影响,特别是在复杂的多代理交互背景下,创造了新的信息相互依赖关系,这需要在人工智能的评估和治理上进行根本性的转变。虽然一个良好校准的生态系统可以增强人类判断和集体决策,但不良对齐的代理可能导致认知技能退化和知识漂移,使得这些模型与人类规范的校准成为一项高风险的必要任务。为了确保一个有利的人类与人工智能的知识生态系统,我们提出了一个框架,重点在于建立和培养知识人工智能代理的可信度;使这些代理与人类的知识目标对齐;并加强周围的社会知识基础设施。在这种背景下,可信的人工智能代理必须展示知识能力、强大的可证伪性和知识美德行为,并得到技术来源系统和旨在保护人类韧性的“知识庇护所”的支持。这个规范性路线图为确保未来的人工智能系统作为一个强大而包容的知识生态系统中的可靠伙伴提供了一条路径。
cs.AI / 33 / 2603.03002
SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models
SpatialText:用于大型语言模型空间理解的纯文本认知基准
Abstract
Genuine spatial reasoning relies on the capacity to construct and manipulate coherent internal spatial representations, often conceptualized as mental models, rather than merely processing surface linguistic associations. While large language models exhibit advanced capabilities across various domains, existing benchmarks fail to isolate this intrinsic spatial cognition from statistical language heuristics. Furthermore, multimodal evaluations frequently conflate genuine spatial reasoning with visual perception. To systematically investigate whether models construct flexible spatial mental models, we introduce SpatialText, a theory-driven diagnostic framework. Rather than functioning simply as a dataset, SpatialText isolates text-based spatial reasoning through a dual-source methodology. It integrates human-annotated descriptions of real 3D indoor environments, which capture natural ambiguities, perspective shifts, and functional relations, with code-generated, logically precise scenes designed to probe formal spatial deduction and epistemic boundaries. Systematic evaluation across state-of-the-art models reveals fundamental representational limitations. Although models demonstrate proficiency in retrieving explicit spatial facts and operating within global, allocentric coordinate systems, they exhibit critical failures in egocentric perspective transformation and local reference frame reasoning. These systematic errors provide strong evidence that current models rely heavily on linguistic co-occurrence heuristics rather than constructing coherent, verifiable internal spatial representations. SpatialText thus serves as a rigorous instrument for diagnosing the cognitive boundaries of artificial spatial intelligence.
Chinese Translation
真正的空间推理依赖于构建和操纵一致的内部空间表征的能力,这通常被概念化为心理模型,而不仅仅是处理表面的语言关联。尽管大型语言模型在各个领域展现出先进的能力,但现有基准未能将这种内在的空间认知与统计语言启发式方法区分开。此外,多模态评估常常将真正的空间推理与视觉感知混为一谈。为了系统地研究模型是否构建灵活的空间心理模型,我们引入了SpatialText,一个以理论为驱动的诊断框架。SpatialText不仅仅作为一个数据集,它通过双源方法隔离文本基础的空间推理。它整合了人类注释的真实三维室内环境描述,这些描述捕捉了自然的模糊性、视角变化和功能关系,以及旨在探测形式空间推理和认知边界的代码生成的逻辑精确场景。对最先进模型的系统评估揭示了基本的表征局限性。尽管模型在检索显性空间事实和在全球、分配坐标系统中操作方面表现出熟练,但在自我中心视角转换和局部参考框架推理方面却出现了关键性的失败。这些系统性错误提供了强有力的证据,表明当前模型在很大程度上依赖于语言共现启发式,而不是构建一致、可验证的内部空间表征。因此,SpatialText作为一个严格的工具,用于诊断人工空间智能的认知边界。
cs.AI / 34 / 2603.03005
OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents
OrchMAS:与多协作异构科学专家结构代理的协调推理
Abstract
Multi-agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model reliance, leading to poor domain adaptation, limited reasoning flexibility, and high latency on heterogeneous or long-horizon scientific tasks. They also struggle to revise earlier decisions when intermediate reasoning diverges, reducing reliability in structured and calculation heavy settings. To address these limitations, we propose a scientific domain oriented interactive two tier multi model orchestration framework. A dedicated orchestration model analyzes each task, dynamically constructs a domain aware reasoning pipeline, and instantiates specialized expert agents with tailored prompts, while an execution model performs each step under generated role and instruction specifications. The orchestrator iteratively updates the pipeline based on intermediate feedback, enabling dynamic replanning, role reallocation, and prompt refinement across multi turn interactions, strengthening robustness and specialization for scientific reasoning through structured heterogeneous model collaboration. The framework is model agnostic and supports heterogeneous LLM integration with different capacities or costs, enabling flexible performance efficiency trade offs in practical scientific deployments. Experiments show consistent improvements over existing multi agent systems and strong baselines across diverse reasoning and scientific style benchmarks.
Chinese Translation
多代理大型语言模型框架在复杂的多步骤推理中展现出良好的前景,但现有系统在科学和知识密集型领域仍显薄弱,原因在于静态提示和代理角色、僵化的工作流程以及对同质模型的依赖,导致领域适应性差、推理灵活性有限以及在异构或长时间跨度的科学任务中延迟较高。当中间推理出现偏差时,它们也难以修正早期决策,从而降低了在结构化和计算密集型环境中的可靠性。为了解决这些局限性,我们提出了一种面向科学领域的交互式双层多模型协调框架。专门的协调模型分析每个任务,动态构建一个领域感知的推理管道,并实例化具有定制提示的专门专家代理,而执行模型则根据生成的角色和指令规范执行每一步。协调者根据中间反馈迭代更新管道,实现动态重新规划、角色重新分配和提示优化,增强通过结构化异构模型协作进行科学推理的鲁棒性和专业化。该框架与模型无关,支持不同能力或成本的异构大型语言模型集成,从而在实际科学应用中实现灵活的性能效率权衡。实验结果显示,在多种推理和科学风格基准测试中,相较于现有的多代理系统和强基线,均取得了一致的改进。
cs.AI / 35 / 2603.03018
REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry
REGAL:一种基于注册表驱动的企业遥测中代理人工智能确定性基础架构
Abstract
Enterprise engineering organizations produce high-volume, heterogeneous telemetry from version control systems, CI/CD pipelines, issue trackers, and observability platforms. Large Language Models (LLMs) enable new forms of agentic automation, but grounding such agents on private telemetry raises three practical challenges: limited model context, locally defined semantic concepts, and evolving metric interfaces. We present REGAL, a registry-driven architecture for deterministic grounding of agentic AI systems in enterprise telemetry. REGAL adopts an explicitly architectural approach: deterministic telemetry computation is treated as a first-class primitive, and LLMs operate over a bounded, version-controlled action space rather than raw event streams. The architecture combines (1) a Medallion ELT pipeline that produces replayable, semantically compressed Gold artifacts, and (2) a registry-driven compilation layer that synthesizes Model Context Protocol (MCP) tools from declarative metric definitions. The registry functions as an "interface-as-code" layer, ensuring alignment between tool specification and execution, mitigating tool drift, and embedding governance policies directly at the semantic boundary. A prototype implementation and case study validate the feasibility of deterministic grounding and illustrate its implications for latency, token efficiency, and operational governance. This work systematizes an architectural pattern for enterprise LLM grounding; it does not propose new learning algorithms, but rather elevates deterministic computation and semantic compilation to first-class design primitives for agentic systems.
Chinese Translation
企业工程组织从版本控制系统、CI/CD管道、问题跟踪器和可观察性平台生成高容量、异构的遥测数据。大型语言模型(LLMs)使得新的代理自动化形式成为可能,但在私有遥测上为这些代理提供基础时面临三个实际挑战:有限的模型上下文、本地定义的语义概念以及不断演变的度量接口。我们提出了REGAL,一种用于在企业遥测中实现代理人工智能系统确定性基础的注册表驱动架构。REGAL采用了一种明确的架构方法:将确定性遥测计算视为一等原语,LLMs在一个有限的、版本控制的行动空间中操作,而不是原始事件流。该架构结合了(1)一个生成可重放、语义压缩的黄金工件的Medallion ELT管道,以及(2)一个从声明性度量定义合成模型上下文协议(MCP)工具的注册表驱动编译层。注册表作为“接口即代码”层,确保工具规范与执行之间的一致性,减轻工具漂移,并在语义边界直接嵌入治理政策。原型实现和案例研究验证了确定性基础的可行性,并展示了其对延迟、令牌效率和操作治理的影响。这项工作系统化了一种企业LLM基础的架构模式;它并未提出新的学习算法,而是将确定性计算和语义编译提升为代理系统的一等设计原语。
cs.AI / 36 / 2603.03078
RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization
RAPO:通过检索增强策略优化扩展大语言模型代理的探索
Abstract
Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
Chinese Translation
代理强化学习(Agentic RL)在基于大语言模型(LLM)的代理中展现出显著的潜力。这些研究可以使LLM代理通过多步骤、工具集成的推理来应对复杂任务。然而,现有Agentic RL方法的一个固有限制是它们依赖于纯粹的在政策范式进行探索,限制了探索仅限于代理自生成的输出,阻碍了发现新的推理视角以进一步改进。尽管近期的努力引入了辅助的离政策信号以增强探索,但它们通常利用完整的离政策轨迹进行轨迹级政策估计,忽视了在代理回滚过程中细粒度、逐步探索动态的必要性。在本文中,我们重新审视Agentic RL中的探索,并提出检索增强策略优化(RAPO),这是一种新颖的强化学习框架,通过引入检索来明确扩展训练过程中的探索。为此,我们将Agentic RL训练过程分解为两个阶段:(i)混合策略代理回滚,和(ii)检索感知的策略优化。具体而言,我们提出了一种混合策略代理回滚策略,使代理能够持续推理检索到的离政策逐步轨迹。它动态扩展了代理的推理接收范围,使其能够在外部行为的条件下进行更广泛的探索。随后,我们引入了检索感知的策略优化机制,通过检索奖励和重要性塑造来校准政策梯度估计,稳定训练并优先考虑检索启发的探索。大量实验表明,RAPO在三个代理推理任务的十四个数据集上实现了平均+5.0%的增益,同时提供了1.2倍的训练效率提升。
cs.AI / 37 / 2603.03080
Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation
超越事实正确性:缓解可解释推荐中的偏好不一致解释
Abstract
LLM-based explainable recommenders can produce fluent explanations that are factually correct, yet still justify items using attributes that conflict with a user's historical preferences. Such preference-inconsistent explanations yield logically valid but unconvincing reasoning and are largely missed by standard hallucination or faithfulness metrics. We formalize this failure mode and propose PURE, a preference-aware reasoning framework following a select-then-generate paradigm. Instead of only improving generation, PURE intervenes in evidence selection, it selects a compact set of multi-hop item-centric reasoning paths that are both factually grounded and aligned with user preference structure, guided by user intent, specificity, and diversity to suppress generic, weakly personalized evidence. The selected evidence is then injected into LLM generation via structure-aware prompting that preserves relational constraints. To measure preference inconsistency, we introduce a feature-level, user-centric evaluation metric that reveals misalignment overlooked by factuality-based measures. Experiments on three real-world datasets show that PURE consistently reduces preference-inconsistent explanations and factual hallucinations while maintaining competitive recommendation accuracy, explanation quality, and inference efficiency. These results highlight that trustworthy explanations require not only factual correctness but also justification aligned with user preferences.
Chinese Translation
基于大型语言模型(LLM)的可解释推荐系统能够生成流畅且事实正确的解释,但仍然可能使用与用户历史偏好相冲突的属性来为推荐项目辩护。这种偏好不一致的解释虽然在逻辑上有效,但缺乏说服力,并且在标准的幻觉或忠实度指标中往往被忽视。我们对这种失败模式进行了形式化,并提出了PURE,一个遵循选择后生成(select-then-generate)范式的偏好感知推理框架。PURE不仅仅改善生成过程,还介入证据选择,选择一组紧凑的多跳以项目为中心的推理路径,这些路径既有事实依据,又与用户的偏好结构相一致,受用户意图、特异性和多样性的指导,以抑制通用的、弱个性化的证据。然后,通过结构感知提示将所选证据注入LLM生成中,以保持关系约束。为了衡量偏好不一致性,我们引入了一种基于特征的、以用户为中心的评估指标,揭示了被事实性指标忽视的错位。对三个真实世界数据集的实验表明,PURE在持续减少偏好不一致解释和事实幻觉的同时,保持了竞争性的推荐准确性、解释质量和推理效率。这些结果强调,可信的解释不仅需要事实正确性,还需要与用户偏好一致的辩护。
cs.AI / 38 / 2603.03097
Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs
Odin:用于知识图谱自主发现的多信号图智能
Abstract
We present Odin, the first production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without prior specification. Unlike retrieval-based systems that answer predefined queries, Odin guides exploration through the COMPASS (Composite Oriented Multi-signal Path Assessment) score, a novel metric that combines (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning (NPLL) used as a discriminative filter rather than generative model, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. This multi-signal integration, particularly the bridge scoring mechanism, addresses the "echo chamber" problem where graph exploration becomes trapped in dense local communities. We formalize the autonomous discovery problem, prove theoretical properties of our scoring function, and demonstrate that beam search with multi-signal guidance achieves $O(b \cdot h)$ complexity while maintaining high recall compared to exhaustive exploration. To our knowledge, Odin represents the first autonomous discovery system deployed in regulated production environments (healthcare and insurance), demonstrating significant improvements in pattern discovery quality and analyst efficiency. Our approach maintains complete provenance traceability -- a critical requirement for regulated industries where hallucination is unacceptable.
Chinese Translation
我们提出了Odin,这是第一个在生产环境中部署的图智能引擎,能够在知识图谱中自主发现有意义的模式,而无需事先指定。与基于检索的系统不同,Odin通过COMPASS(复合导向多信号路径评估)评分引导探索,这是一种新颖的度量标准,结合了(1)通过个性化PageRank获得的结构重要性,(2)通过神经概率逻辑学习(Neural Probabilistic Logic Learning, NPLL)实现的语义合理性,该方法作为判别过滤器而非生成模型使用,(3)具有可配置衰减的时间相关性,以及(4)通过图神经网络(GNN)识别的桥接实体和社区间亲和度评分提供的社区感知指导。这种多信号集成,特别是桥接评分机制,解决了图探索在密集局部社区中陷入“回音室”问题。我们形式化了自主发现问题,证明了我们评分函数的理论属性,并展示了带有多信号指导的束搜索在保持高召回率的同时实现了$O(b imes h)$的复杂度,相较于全面探索具有优势。根据我们的了解,Odin代表了第一个在受监管的生产环境(医疗和保险)中部署的自主发现系统,显著提高了模式发现的质量和分析师的效率。我们的方法保持了完整的来源可追溯性,这是受监管行业中不可接受幻觉的关键要求。
cs.AI / 39 / 2603.03116
Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
超越任务完成:通过程序意识评估揭示大型语言模型代理中的腐败成功
Abstract
Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.
Chinese Translation
基于大型语言模型(LLM)的代理在高风险环境中的应用日益增加,但当前的基准主要评估任务是否完成,而不是完成的方式。我们引入了程序意识评估(Procedure-Aware Evaluation, PAE),这是一个将代理程序形式化为结构化观察的框架,并揭示代理观察、交流和执行之间的一致性关系。PAE从互补的维度(效用、效率、互动质量、程序完整性)评估代理,并应用多维度筛选,明确排除腐败结果。在tau-bench上评估最先进的LLM代理的结果显示了在维度、合规性和基准水平上的发现。在维度层面,这些维度捕捉到非冗余的失败模式:效用掩盖了可靠性差距,速度并不意味着精确性,简洁性也不能预测意图遵循。在程序合规性层面,基准报告的成功案例中有27%-78%是腐败成功,掩盖了互动和完整性方面的违规行为。此外,筛选显著降低了Pass^4率并影响了模型排名。对腐败成功案例的分析揭示了每个模型独特的失败特征:GPT-5在政策、执行和意图维度上分散错误;Kimi-K2-Thinking在政策忠诚度和合规性方面集中78%的违规;而Mistral-Large-3则主要受到忠诚度失败的影响。在基准层面,我们的分析揭示了基准设计中的结构性缺陷,包括任务范围差距、矛盾的奖励信号以及产生意外成功的模拟器伪影。
cs.AI / 40 / 2603.03119
AI Space Physics: Constitutive boundary semantics for open AI institutions
人工智能空间物理学:开放人工智能机构的构成边界语义
Abstract
Agentic AI deployments increasingly behave as persistent institutions rather than one-shot inference endpoints: they accumulate state, invoke external tools, coordinate multiple runtimes, and modify their future authority surface over time. Existing governance language typically specifies decision-layer constraints but leaves the causal mechanics of boundary crossing underdefined, particularly for transitions that do not immediately change the external world yet expand what the institution can later do. This paper introduces AI Space Physics as a constitutive semantics for open, self-expanding AI institutions. We define a minimal state model with typed boundary channels, horizon-limited reach semantics, and a membrane-witness discipline. The core law family (P-1, P-1a, P-1b, P-1c) requires witness completeness, non-bypass mediation, atomic adjudication-to-effect transitions, and replayable reconstruction of adjudication class. We explicitly separate second-order effects into structural expansion and policy broadening, and treat expansion transitions as governance-relevant even when immediate external deltas are zero. The novelty claim is precise rather than expansive: this work does not introduce mediation as a concept; it reclassifies authority-surface expansion as a first-class boundary event with constitutive witness obligations. In this semantics, expansion without immediate commit remains adjudication-relevant.
Chinese Translation
代理型人工智能的部署越来越像持久性机构,而不仅仅是一次性的推理终端:它们积累状态,调用外部工具,协调多个运行时,并随着时间的推移修改其未来的权威表面。现有的治理语言通常规定决策层的约束,但对边界跨越的因果机制定义不明确,尤其是对于那些不会立即改变外部世界但扩展机构未来能力的过渡。本文提出了人工智能空间物理学,作为开放、自我扩展的人工智能机构的构成语义。我们定义了一个最小状态模型,包含类型化边界通道、有限视界的到达语义和膜见证学科。核心法律系列(P-1, P-1a, P-1b, P-1c)要求见证的完整性、非绕过的调解、原子裁决到效果的过渡,以及裁决类别的可重放重建。我们明确将二阶效应分为结构扩展和政策拓宽,并将扩展过渡视为与治理相关,即使立即的外部变化为零。新颖性主张是精确而非广泛:这项工作并未引入调解作为概念;而是将权威表面的扩展重新分类为一类具有构成见证义务的边界事件。在这种语义中,没有立即承诺的扩展仍然与裁决相关。
cs.AI / 41 / 2603.03147
Agentic AI-based Coverage Closure for Formal Verification
基于代理智能的形式验证覆盖闭合
Abstract
Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI (GenAI) to automate coverage analysis for formal verification, identify coverage gaps, and generate the required formal properties. The framework accelerates verification efficiency by systematically addressing coverage holes. Benchmarking open-source and internal designs reveals a measurable increase in coverage metrics, with improvements correlated to the complexity of the design. Comparative analysis validates the effectiveness of this approach. These results highlight the potential of agentic AI-based techniques to improve formal verification productivity and support comprehensive coverage closure.
Chinese Translation
覆盖闭合是集成电路(IC)开发过程中的一个关键要求,也是验证签署的重要指标。然而,传统的穷举方法往往无法在项目时间内实现全面覆盖。本研究提出了一种基于代理智能的工作流程,该流程利用大型语言模型(LLM)驱动的生成式人工智能(GenAI)自动化形式验证的覆盖分析,识别覆盖缺口,并生成所需的形式属性。该框架通过系统性地解决覆盖漏洞来加速验证效率。对开源和内部设计的基准测试显示,覆盖指标有显著提升,且改进与设计的复杂性相关。比较分析验证了该方法的有效性。这些结果突显了基于代理智能的技术在提高形式验证生产力和支持全面覆盖闭合方面的潜力。
cs.AI / 42 / 2603.03175
Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification
Saarthi在AGI中的应用:面向形式验证的领域特定通用智能
Abstract
Saarthi is an agentic AI framework that uses multi-agent collaboration to perform end-to-end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General Intelligence (AGI) is still a distant goal, and current Large Language Model (LLM)-based agents are prone to hallucinations and making mistakes, especially when dealing with complex tasks such as formal verification. However, with the right enhancements and improvements, we believe that Saarthi can be a significant step towards achieving domain-specific general intelligence for formal verification. Especially for problems that require Short Term, Short Context (STSC) capabilities, such as formal verification, Saarthi can be a powerful tool to assist verification engineers in their work. In this paper, we present two key enhancements to the Saarthi framework: (1) a structured rulebook and specification grammar to improve the accuracy and controllability of SystemVerilog Assertion (SVA) generation, and (2) integration of advanced Retrieval Augmented Generation (RAG) techniques, such as GraphRAG, to provide agents with access to technical knowledge and best practices for iterative refinement and improvement of outputs. We also benchmark these enhancements for the overall Saarthi framework using challenging test cases from NVIDIA's CVDP benchmark targeting formal verification. Our benchmark results stand out with a 70% improvement in the accuracy of generated assertions, and a 50% reduction in the number of iterations required to achieve coverage closure.
Chinese Translation
Saarthi是一个智能代理AI框架,利用多智能体协作执行端到端的形式验证。尽管该框架提供了从规范到覆盖闭合的完整流程,且有效性约为40%,但仍然存在若干挑战需要解决,以使其更加稳健和可靠。人工通用智能(AGI)仍然是一个遥远的目标,目前基于大型语言模型(LLM)的代理容易出现幻觉和错误,尤其是在处理形式验证等复杂任务时。然而,通过适当的增强和改进,我们相信Saarthi可以成为实现形式验证领域特定通用智能的重要一步。特别是对于需要短期、短上下文(STSC)能力的问题,如形式验证,Saarthi可以成为验证工程师工作的强大工具。在本文中,我们提出了对Saarthi框架的两个关键增强: (1) 结构化规则书和规范语法,以提高SystemVerilog断言(SVA)生成的准确性和可控性; (2) 集成先进的检索增强生成(RAG)技术,如GraphRAG,为代理提供访问技术知识和最佳实践的能力,以便对输出进行迭代优化和改进。我们还使用来自NVIDIA CVDP基准的具有挑战性的测试用例,对整体Saarthi框架的这些增强进行了基准测试。我们的基准测试结果显示,生成的断言准确性提高了70%,并且实现覆盖闭合所需的迭代次数减少了50%。
cs.AI / 43 / 2603.03176
FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System
FEAST:用于FoodEx2系统的检索增强多层次食品分类
Abstract
Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority's FoodEx2 system-a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., "organic yogurt''), the system identifies its base term ("yogurt''), all the applicable facet categories (e.g., "production method''), and then, every relevant facet descriptors to each category (e.g., "organic production''). While existing models perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes FoodEx2 classification into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system's hierarchical structure to guide training and performing deep metric learning, FEASTlearns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European's CNN baseline F1 scores by 12-38 % on rare classes.
Chinese Translation
层次文本分类(HTC)和极端多标签分类(XML)任务面临着复杂标签依赖性、数据稀疏性和极端输出维度等复合挑战。这些挑战在欧洲食品安全局的FoodEx2系统中得到了体现,该系统是一个标准化的食品分类框架,对于监测食品消费和评估污染物暴露在欧洲至关重要。FoodEx2编码将自然语言食品描述转化为来自多个标准化层次的一组代码,但由于其复杂结构面临实施障碍。给定一个食品描述(例如,“有机酸奶”),该系统识别其基本术语(“酸奶”)、所有适用的方面类别(例如,“生产方式”),然后为每个类别分配相关的方面描述符(例如,“有机生产”)。虽然现有模型在平衡良好且语义密集的层次上表现尚可,但尚未有研究应用于FoodEx2系统所施加的实际限制。针对这种现实场景的文献有限,进一步加剧了这些挑战。我们提出了FEAST(食品嵌入和语义分类法),这是一种新颖的检索增强框架,将FoodEx2分类分解为三个阶段的方法:(1)基本术语识别,(2)多标签方面预测,以及(3)方面描述符分配。通过利用系统的层次结构来指导训练并执行深度度量学习,FEAST学习到的判别嵌入能够缓解数据稀疏性,并提高对稀有和细粒度标签的泛化能力。在多语言FoodEx2基准上评估,FEAST在稀有类别上超越了之前欧洲CNN基线F1分数12-38%。
cs.AI / 44 / 2603.03177
Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Models Era
神经符号人工智能:黑箱模型时代的任务导向调查
Abstract
The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro-Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human-level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre-defined patterns and rules hinder their practical implementation in real-world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task-specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. Reproducibility details and in-depth comments on each surveyed research work are made available at https://github.com/disi-unibo-nlp/task-oriented-neuro-symbolic.git.
Chinese Translation
自人工智能(AI)首次理论化以来,符号计算与神经网络的结合一直吸引着研究者的关注。神经符号(Neuro-Symbolic, NeSy)方法推断或利用行为模式的能力被广泛认为是人类水平智能的潜在代理之一。然而,有限的语义泛化能力以及在复杂领域中应用预定义模式和规则所面临的挑战,阻碍了它们在现实场景中的实际应用。自2017年人工智能突破以来,连接主义系统取得的前所未有的成果引发了对NeSy解决方案竞争力的质疑,特别是在自然语言处理和计算机视觉领域。本文调查了NeSy领域中任务特定的进展,探讨了如何通过结合符号系统来增强可解释性和推理能力。我们的研究结果旨在为探索可解释的NeSy方法论以应对现实任务和应用的研究者提供资源。有关每项调查研究工作的可重复性细节和深入评论可在 https://github.com/disi-unibo-nlp/task-oriented-neuro-symbolic.git 获取。
cs.AI / 45 / 2603.03190
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
期望与声学神经网络表征增强基于脑活动的音乐识别
Abstract
During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
Chinese Translation
在音乐聆听过程中,皮层活动编码了声学信息和与期望相关的信息。先前的研究表明,人工神经网络(ANN)表征与皮层表征相似,并可以作为脑电图(EEG)识别的监督信号。在本研究中,我们展示了将声学和与期望相关的ANN表征区分为教师目标可以改善基于EEG的音乐识别。预训练以预测任一表征的模型优于未预训练的基线模型,而将两者结合则产生互补增益,超越了通过变化随机初始化形成的强种子集成。这些发现表明,教师表征类型影响下游性能,并且表征学习可以通过神经编码进行引导。本研究指向了预测音乐认知和神经解码的进展。我们的期望表征直接从原始信号计算而来,无需人工标签,反映了超越音符或音高的预测结构,使得对多层预测编码在不同刺激下的研究成为可能。其在大型多样化数据集上的可扩展性进一步表明了开发基于皮层编码原理的通用EEG模型的潜力。
cs.AI / 46 / 2603.03203
No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models
无记忆,无检测:基于输出分布的小型语言模型污染检测
Abstract
CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM
Chinese Translation
CDD(通过输出分布进行污染检测)通过测量模型采样输出的峰度来识别数据污染。我们研究了这一方法在参数范围从7000万到4.1亿的小型语言模型上成功与失败的条件。通过对GSM8K、HumanEval和MATH进行控制污染实验,我们发现CDD的有效性在很大程度上依赖于微调是否导致逐字记忆。在低秩适应下,模型可以从污染数据中学习而不进行记忆,即使数据被验证为污染,CDD的表现也仅在随机水平。只有当微调能力足以引发记忆时,CDD才能恢复强检测准确性。我们的结果表征了一个控制可检测性的记忆阈值,并强调了一个实际考虑:参数高效的微调可能产生输出分布方法无法检测的污染。我们的代码可在 https://github.com/Sela-Omer/Contamination-Detection-Small-LM 获取。
cs.AI / 47 / 2603.03212
NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind
NeuroSkill(tm):一种能够建模人类心理状态的主动实时智能系统
Abstract
Real-time proactive agentic system, capable of modeling Human State of Mind, using foundation EXG model and text embeddings model, running fully offline on the edge. Unlike all previously known systems, the NeuroSkill(tm) system leverages SKILL.md description of Human's State of Mind via API and CLI provided by the system, directly from the Brain-Computer Interface (BCI) devices, which records Human biophysical and brain signals. Our custom harness - NeuroLoop(tm) - utilizes all of the above to run agentic flow that manages to engage with the Human on multiple cognitive and affective levels of their State of Mind (e.g., empathy), by providing actionable tool calls and protocol execution with explicit or implicit requests from the Human. GPLv3 open-source software with ethically aligned AI100 licensing for the skill markdown.
Chinese Translation
一种实时主动智能系统,能够建模人类心理状态,基于基础的EXG模型和文本嵌入模型,完全离线运行于边缘设备。与所有已知系统不同,NeuroSkill(tm)系统通过系统提供的API和CLI,利用SKILL.md描述人类心理状态,直接从脑-计算机接口(BCI)设备获取记录的人类生物物理和脑信号。我们的定制设备NeuroLoop(tm)利用上述所有功能,运行智能流程,能够在多种认知和情感层面(例如同理心)与人类进行互动,通过提供可操作的工具调用和协议执行,响应人类的明确或隐含请求。该系统采用GPLv3开源软件,并遵循伦理对齐的AI100许可协议。
cs.AI / 48 / 2603.03233
AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
基于贝叶斯对抗多智能体框架的科学人工智能低代码平台
Abstract
Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.
Chinese Translation
大型语言模型(LLMs)在自动化科学代码生成方面展现出潜力,但在可靠性、多智能体工作流中的错误传播以及在成功指标不明确的领域中的评估方面面临挑战。我们提出了一种专门为科学人工智能(AI for Science, AI4S)任务设计的贝叶斯对抗多智能体框架,形式为低代码平台(Low-code Platform, LCP)。在贝叶斯框架下,三个基于LLM的智能体被协调:一个任务管理器(Task Manager)将用户输入结构化为可执行的计划和自适应测试用例,一个代码生成器(Code Generator)生成候选解决方案,以及一个评估器(Evaluator)提供全面反馈。该框架采用对抗循环,任务管理器迭代地优化测试用例以挑战代码生成器,同时通过整合代码质量指标(功能正确性、结构对齐和静态分析)动态更新提示分布,运用贝叶斯原理。这种测试与代码的共同优化减少了对LLM可靠性的依赖,并解决了科学任务固有的评估不确定性。LCP还通过将非专家提示转换为特定领域的需求,简化了人机协作,避免了没有编码背景的从业者手动进行提示工程的需要。基准评估表明,LCP在生成稳健代码的同时,最小化了错误传播。该平台还在一个跨学科的地球科学任务上进行了测试,表现出强大的可靠性,优于竞争模型。
cs.AI / 49 / 2603.03242
Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals
基于密度的响应优化:通过隐式接受信号实现社区导向的对齐
Abstract
Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities -- particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics -- where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.
Chinese Translation
在在线社区中部署的语言模型必须适应社会、文化和特定领域上下文中变化的规范。以往的对齐方法依赖于显式的偏好监督或预定义原则,这在资源充足的环境中有效,但排除了大多数在线社区——特别是那些没有机构支持、注释基础设施或围绕敏感话题组织的社区——在这些社区中,偏好引导的成本高昂、伦理复杂或文化不一致。我们观察到,社区已经通过其接受、参与和允许内容持续存在的方式隐式表达了偏好。我们展示了这种接受行为在表示空间中诱导了可测量的几何结构:被接受的响应占据了反映社区特定规范的连贯、高密度区域,而被拒绝的内容则位于更稀疏或不一致的区域。我们将这种结构操作化为一种隐式偏好信号用于对齐,并引入了基于密度的响应优化(Density-Guided Response Optimization, DGRO)方法,该方法在不需要显式偏好标签的情况下将语言模型与社区规范对齐。通过使用标注的偏好数据,我们证明局部密度能够恢复成对的社区判断,表明几何结构编码了有意义的偏好信号。随后,我们在跨平台、主题和语言的多样社区中应用DGRO于注释稀缺的环境。DGRO对齐的模型始终生成被人类注释者、领域专家和基于模型的评审者偏好的响应,优于监督和基于提示的基线。我们将DGRO定位为在缺乏显式偏好监督或与具体实践不一致的社区中一种实用的对齐替代方案,并讨论从新兴接受行为中学习的影响和风险。
cs.AI / 50 / 2603.03252
Valet: A Standardized Testbed of Traditional Imperfect-Information Card Games
Valet:传统不完全信息纸牌游戏的标准化测试平台
Abstract
AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natural domain for imperfect information due to hidden hands and stochastic draws. To facilitate comparative research on imperfect-information game-playing algorithms and game systems, we introduce Valet, a diverse and comprehensive testbed of 21 traditional imperfect-information card games. These games span multiple genres, cultures, player counts, deck structures, mechanics, winning conditions, and methods of hiding and revealing information. To standardize implementations across systems, we encode the rules of each game in RECYCLE, a card game description language. We empirically characterize each game's branching factor and duration using random simulations, reporting baseline score distributions for a Monte Carlo Tree Search player against random opponents to demonstrate the suitability of Valet as a benchmarking suite.
Chinese Translation
不完全信息游戏的人工智能算法通常通过在单个游戏中的性能指标进行比较,这使得评估算法在不同游戏选择中的鲁棒性变得困难。纸牌游戏由于隐藏的手牌和随机抽牌,成为不完全信息的自然领域。为了促进对不完全信息游戏算法和游戏系统的比较研究,我们引入了Valet,一个涵盖21种传统不完全信息纸牌游戏的多样化和全面的测试平台。这些游戏跨越多个类型、文化、玩家数量、牌组结构、机制、胜利条件以及隐藏和揭示信息的方法。为了在系统之间标准化实现,我们使用RECYCLE(一种纸牌游戏描述语言)对每个游戏的规则进行编码。我们通过随机模拟对每个游戏的分支因子和持续时间进行实证特征化,并报告了针对随机对手的蒙特卡洛树搜索玩家的基线得分分布,以证明Valet作为基准测试套件的适用性。
cs.AI / 51 / 2603.03258
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
继承目标漂移:情境压力可能削弱自主目标
Abstract
The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.
Chinese Translation
语言模型(LM)作为长上下文任务中的代理的加速采用促使我们深入理解目标漂移:代理偏离原始目标的倾向。尽管先前一代语言模型代理已被证明易受漂移影响,但漂移对更近期模型的影响程度仍不清楚。在本研究中,我们提供了目标漂移的程度和原因的更新特征描述。我们在模拟股票交易环境中(Arike et al., 2025)调查了最先进模型的漂移。这些模型在面对对抗性压力时表现出较强的鲁棒性。然而,我们显示这种鲁棒性是脆弱的:在多个设置中,当基于较弱代理的预填充轨迹进行条件处理时,同样的模型往往会继承漂移。条件引起的漂移程度因模型家族而异,只有GPT-5.1在测试模型中保持了一致的抗压能力。我们发现漂移行为在提示变体之间不一致,并且与指令层次跟随行为的相关性较差,强层次跟随未能可靠预测对漂移的抵抗力。最后,我们在新的急诊室分诊环境中进行了类似实验,以展示我们的结果在质上不同的环境中的可转移性初步证据。我们的发现强调了现代语言模型代理在情境压力下的持续脆弱性,以及需要改进后训练技术以减轻这一问题。
cs.CL / 1 / 2603.02213
A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences
一种保留Zipf特性的长程相关替代模型用于书面语言及其他符号序列
Abstract
Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.
Chinese Translation
符号序列如书面语言和基因组DNA展示了特征性的频率分布和跨越多个符号的长程相关性。在语言中,这表现为词频的Zipf定律以及跨越数百或数千个标记的持续相关性,而在DNA中则体现在核苷酸组成和在嘌呤-嘧啶映射下的长记忆行走。现有的替代模型通常只保留频率分布或相关性特征,但不能同时保留两者。我们提出了一种保留这两种约束的替代模型:它保留了原始序列的经验符号频率,并重现了其长程相关结构,通过去趋势波动分析(DFA)指数进行量化。我们的方法通过将分数高斯噪声(FGN)映射到经验直方图上,通过保留频率的分配生成符号序列的替代品。生成的替代品在一阶统计特性和长程缩放上与原始序列匹配,同时随机化短程依赖性。我们在英语和拉丁语的代表性文本上验证了该模型,并通过基因组DNA展示了其更广泛的适用性,表明碱基组成和DFA缩放得以重现。这种方法为解开符号系统的结构特征以及测试关于语言、DNA及其他符号领域中缩放法则和记忆效应起源的假设提供了一个原则性工具。
cs.CL / 2 / 2603.02258
Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry
神经翻译中的普遍概念结构:探究 NLLB-200 的多语言几何特征
Abstract
Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ($\rho = 0.13$, $p = 0.020$), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ($U = 42656$, $p = 1.33 \times 10^{-11}$, $d = 0.96$), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.
Chinese Translation
神经机器翻译模型是否学习了语言普遍的概念表征,还是仅仅通过表面相似性对语言进行聚类?我们通过六个实验探讨了这一问题,这些实验将自然语言处理的可解释性与多语言词汇组织的认知科学理论结合起来,研究了 Meta 的 NLLB-200 模型(一个支持 200 种语言的编码-解码 Transformer)的表征几何特征。利用嵌入在 135 种语言中的 Swadesh 核心词汇表,我们发现模型的嵌入距离与来自自动相似性判断程序的系统发育距离显著相关($
ho = 0.13$, $p = 0.020$),这表明 NLLB-200 隐式地学习了人类语言的谱系结构。我们展示了来自 CLICS 数据库的频繁共词概念对的嵌入相似性显著高于非共词对($U = 42656$, $p = 1.33 imes 10^{-11}$, $d = 0.96$),这表明模型内化了普遍的概念关联。对每种语言的嵌入进行均值中心化,将概念间距离与概念内距离的比率提高了 1.19 倍,提供了几何证据,支持存在一个与双语神经成像中识别的前颞叶中心类似的语言中立概念存储。基本概念对之间的语义偏移向量(例如,男人到女人,大到小)显示出高度的跨语言一致性(平均余弦 = 0.84),这表明二阶关系结构在类型学上多样的语言中得以保留。我们发布了 InterpretCognates,这是一个开源的交互式工具包,用于探索这些现象,并提供了一个完全可重复的分析流程。
cs.CL / 3 / 2603.02333
Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects
扩散语言模型中的记忆特征:广义提取与采样效应
Abstract
Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.
Chinese Translation
自回归语言模型(ARMs)已被证明能够记忆并偶尔逐字再现训练数据,这引发了关于隐私和版权责任的担忧。扩散语言模型(DLMs)最近作为一种竞争性替代方案出现,但由于生成动态的根本差异,其记忆行为仍然在很大程度上未被探索。为了解决这一空白,我们提出了对DLMs中记忆的系统理论和实证特征描述。我们提出了一种广义概率提取框架,该框架统一了前缀条件解码和基于扩散的生成,适用于任意掩码模式和随机采样轨迹。定理4.3建立了采样分辨率与记忆之间的单调关系:提高分辨率严格增加了精确提取训练数据的概率,这意味着自回归解码对应于通过将采样分辨率设为最大值而得到的基于扩散的生成的极限情况。我们在不同模型规模和采样策略下进行了广泛的实验,以验证我们的理论预测。在对齐的前缀条件评估下,我们进一步证明,与ARMs相比,DLMs在个人可识别信息(PII)方面表现出显著较低的基于记忆的泄漏。
cs.CL / 4 / 2603.02353
Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
在写作评估中检测人工智能生成的论文:负责任的使用及其在大型语言模型中的普适性
Abstract
Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.
Chinese Translation
写作是一项基础的素养技能,支撑着有效的沟通,促进批判性思维,推动跨学科的学习,并使个人能够组织和表达复杂的思想。因此,写作评估在评估语言能力、沟通有效性和分析推理方面发挥着至关重要的作用。大型语言模型(LLMs)的快速发展使得生成连贯、高质量的论文变得越来越容易,这引发了对学生提交作品真实性的重大担忧。本章首先概述了当前针对人工智能生成和辅助论文的检测器的现状,以及其负责任使用的指导方针。接着,呈现了实证分析,以评估在一种LLM上训练的检测器在识别由其他LLM生成的论文时的普适性,基于对公共GRE写作提示生成的论文进行分析。这些发现为开发和重新训练检测器以用于实际应用提供了指导。
cs.CL / 5 / 2603.02368
RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
RO-N3WS:通过多样化的罗马尼亚语语音基准提升低资源自动语音识别的泛化能力
Abstract
We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.
Chinese Translation
我们介绍了RO-N3WS,这是一个旨在改善自动语音识别(ASR)泛化能力的罗马尼亚语语音基准数据集,特别是在低资源和分布外(OOD)条件下。RO-N3WS包含超过126小时的转录音频,音频来源于广播新闻、文学有声书、电影对话、儿童故事和对话播客语音。这种多样性使得在风格上截然不同的领域中进行稳健的训练和微调成为可能。我们在零-shot和微调设置下评估了几种最先进的ASR系统(Whisper,Wav2Vec 2.0),并使用通过表现力强的文本转语音(TTS)模型生成的合成数据进行控制比较。我们的结果表明,即使在RO-N3WS上对真实语音进行有限的微调,也能显著提高字错误率(WER),超过零-shot基线。我们将发布所有模型、脚本和数据划分,以支持多语言ASR、领域适应和轻量级部署的可重复研究。
cs.CL / 6 / 2603.02464
GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR
GLoRIA:用于方言自动语音识别的门控低秩可解释适应
Abstract
Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.
Chinese Translation
在方言丰富的环境中,自动语音识别(ASR)仍然面临挑战,主要由于强烈的区域差异和有限的标注数据。我们提出了GLoRIA,这是一种参数高效的适应框架,利用元数据(例如坐标)来调节预训练编码器中的低秩更新。GLoRIA在每个前馈层中注入低秩矩阵,门控多层感知机(MLP)根据位置元数据确定每个LoRA(低秩适应)秩-1组件的非负贡献。在GCND语料库上,GLoRIA的表现优于地理条件下的完全微调、LoRA以及方言特定和统一的完全微调,达到了最先进的词错误率,同时更新的参数不足10%。GLoRIA在未见过的方言中也具有良好的泛化能力,包括在外推场景中,并且能够实现可视化的可解释适应模式。这些结果表明,基于元数据的门控低秩适应是方言ASR的有效、可解释和高效的解决方案。
cs.CL / 7 / 2603.02547
CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think
CoDAR:连续扩散语言模型比你想象的更强大
Abstract
We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.
Chinese Translation
我们研究了为什么尽管连续生成动态具有吸引力,连续扩散语言模型(DLMs)仍落后于离散扩散方法。在一个受控的标记恢复研究中,我们确定了标记四舍五入,即从去噪嵌入到标记的最终投影,作为主要瓶颈。在这些见解的基础上,我们提出了CoDAR(带上下文的自回归解码器的连续扩散),这是一个两阶段框架,它在嵌入空间中保持扩散完全连续,同时学习一个强大的、上下文条件的离散化器:一个自回归的Transformer解码器,它对去噪的嵌入序列进行交叉注意,并执行上下文化的标记四舍五入。在LM1B和OpenWebText上的实验表明,CoDAR在生成质量上显著优于潜在扩散,并且在与强大的离散DLMs竞争时,提供了一个简单的解码器温度调节器,以平衡流畅性和多样性之间的权衡。
cs.CL / 8 / 2603.02578
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
大型语言模型的可控性如何?跨行为粒度的统一评估
Abstract
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于社会敏感领域,但其不可预测的行为,从意图不一致到个性不稳定,带来了显著风险。我们引入了SteerEval,一个用于评估LLM可控性的分层基准,涵盖三个领域:语言特征、情感和个性。每个领域分为三个规范层次:L1(表达什么)、L2(如何表达)和L3(如何实例化),将高层次的行为意图与具体的文本输出连接起来。通过使用SteerEval,我们系统性地评估了当前的引导方法,揭示了在更细粒度层面上控制往往会下降。我们的基准提供了一个原则性和可解释的框架,以确保LLM行为的安全性和可控性,为未来的研究奠定基础。
cs.CL / 9 / 2603.02588
ExpGuard: LLM Content Moderation in Specialized Domains
ExpGuard:专门领域的大型语言模型内容审查
Abstract
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
Chinese Translation
随着大型语言模型(LLMs)在现实应用中的广泛部署,建立强有力的安全防护措施以审查其输入和输出已成为确保遵循安全政策的必要条件。目前的防护模型主要针对一般的人类-LLM交互,导致LLMs在特定领域的上下文中容易受到有害和对抗性内容的影响,尤其是在技术术语和专业概念丰富的领域。为了解决这一局限性,我们提出了ExpGuard,一个强大且专门的防护模型,旨在保护金融、医疗和法律领域免受有害提示和响应的影响。此外,我们还推出了ExpGuardMix,这是一个精心策划的数据集,包含58,928个标记的提示及其对应的拒绝和合规响应,来自这些特定行业。该数据集分为两个子集:ExpGuardTrain,用于模型训练,以及ExpGuardTest,这是一个由领域专家注释的高质量测试集,用于评估模型在技术和领域特定内容上的鲁棒性。在ExpGuardTest和八个已建立的公共基准上进行的全面评估表明,ExpGuard在各方面表现出色,同时在抵御领域特定对抗攻击方面展现出卓越的韧性,在提示分类上超越了最先进的模型WildGuard,提升幅度达到8.9%,在响应分类上提升幅度达到15.3%。为了鼓励进一步的研究和开发,我们开源了我们的代码、数据和模型,以便适应更多领域,并支持创建越来越强大的防护模型。
cs.CL / 10 / 2603.02597
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
GPUTOK:基于GPU加速的字节级BPE分词
Abstract
As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.
Chinese Translation
随着大型语言模型向百万标记上下文窗口发展,CPU分词器成为主要的性能瓶颈,因为它们逐步处理文本,而强大的GPU却未被充分利用。我们构建了一个基于GPU的字节级BPE分词器,遵循GPT-2的合并规则。该分词器包括一个基本的BlockBPE风格内核和一个更快的优化版本,后者使用了cuCollections静态映射、CUB归约和用于Python的pybind11接口。在处理长度达到131k标记的WikiText103序列时,优化后的GPU分词器生成的标记与CPU版本相同,对于最长的输入,其速度约为tiktoken的1.7倍,约为HuggingFace GPT-2分词器的7.6倍。Nsight分析显示,70-80%的CUDA API时间用于内存分配,因此增加内存池应该能带来最大的速度提升。在使用WikiText103提示进行生成任务的测试中,我们的GPU分词器的输出在相似性和重叠指标上与tiktoken和HuggingFace GPT-2保持在约一个百分点的范围内,这意味着它在提高长上下文推理的实用性的同时保持了输出质量。
cs.CL / 11 / 2603.02615
Think, But Don't Overthink: Reproducing Recursive Language Models
思考,但不要过度思考:递归语言模型的再现
Abstract
This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction
Chinese Translation
本项目再现并扩展了Zhang等人(2026)最近提出的“递归语言模型”(Recursive Language Models, RLMs)框架。该框架使大型语言模型(Large Language Models, LLMs)能够通过将提示卸载到外部REPL环境中来处理近乎无限的上下文。尽管原始论文依赖于默认的递归深度为1,并建议将更深的递归作为未来的研究方向,但本研究特别探讨了递归深度扩展的影响。使用最先进的开源代理模型(DeepSeek v3.2和Kimi K2),我在S-NIAH和OOLONG基准上评估了纯LLM、RLM(深度=1)和RLM(深度=2)。研究结果揭示了一个引人注目的现象:更深的递归导致模型“过度思考”。虽然深度为1的RLM在复杂推理任务中有效提升了准确性,但在简单检索任务中应用更深的递归(深度=2)或使用RLM却悖论性地降低了性能,并使执行时间呈指数级膨胀(例如,从3.6秒增加到344.5秒)以及令牌成本。代码和数据可在以下网址获取:https://github.com/drbillwang/rlm-reproduction
cs.CL / 12 / 2603.02631
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
跨家族推测性预填充:无训练的长上下文压缩与小型草稿模型
Abstract
Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.
Chinese Translation
提示长度是代理大型语言模型(LLM)工作负载中的一个主要瓶颈,其中重复推理步骤和多次调用循环会产生可观的预填充成本。近期关于推测性预填充的研究表明,基于注意力的令牌重要性估计可以实现无训练的提示压缩,但这假设存在一个与目标模型共享相同分词器的草稿模型。然而,在实践中,代理管道经常使用没有任何更小的同家族草稿模型的模型。在本研究中,我们研究了跨家族推测性预填充,其中来自一个模型家族的轻量级草稿模型用于对来自不同家族的目标模型进行提示压缩。我们使用与先前工作相同的推测性预填充机制,评估了一系列跨家族草稿-目标组合,包括 Qwen、LLaMA 和 DeepSeek 模型。在广泛多样的任务中,我们发现基于注意力的令牌重要性估计在不同模型家族之间可靠地转移,尽管草稿模型和目标模型之间存在架构和分词器的差异。跨模型提示压缩在很大程度上保留了 90~100% 的完整提示基线性能,并且在某些情况下,由于去噪效果略微提高了准确性,同时显著减少了首次令牌时间(TTFT)。这些结果表明,推测性预填充主要依赖于任务先验和语义结构,因此作为一种可推广的提示压缩原语。我们讨论了这些发现对代理系统的影响,在这些系统中,重复的长上下文推理和异构模型堆栈使得跨模型提示压缩既必要又实用。
cs.CL / 13 / 2603.02655
Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
基于多模态大语言模型的实时游戏视频解说生成:关注暂停的解码方法
Abstract
Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.
Chinese Translation
实时视频解说生成为视频中正在发生的事件提供文本描述。这在体育、电子竞技和直播等领域支持无障碍访问和参与。解说生成涉及两个基本决策:说什么以及何时说。尽管最近基于提示的方法使用多模态大语言模型(MLLMs)在内容生成方面表现出色,但它们在很大程度上忽视了时机因素。我们研究了仅通过上下文提示是否能够支持语义相关且时机恰当的实时解说生成。我们提出了两种基于提示的解码策略:1)固定间隔方法,和2)一种新颖的基于动态间隔的解码方法,该方法根据前一个话语的估计持续时间调整下一个预测的时机。这两种方法都能够在不进行任何微调的情况下实现关注暂停的生成。在日本和英语的赛车和格斗游戏数据集上的实验表明,基于动态间隔的解码能够仅通过提示生成更接近人类话语时机和内容的解说。我们发布了一个多语言基准数据集、训练模型和实现,以支持未来在实时视频解说生成方面的研究。
cs.CL / 14 / 2603.02663
Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
评估跨模态推理能力与问题特征的多模态项目反应理论
Abstract
Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.
Chinese Translation
多模态大型语言模型(MLLMs)最近作为一种能够在多种模态上进行推理的通用架构而出现。针对MLLMs的基准测试应当衡量其跨模态整合的能力。然而,当前的基准测试充斥着捷径问题,这些问题仅依赖于单一模态即可解决,从而导致不可靠的排名。例如,在视觉-语言案例中,我们可以在没有图像或文本的情况下找到正确答案。这些低质量问题不必要地增加了基准测试的规模和计算要求。我们提出了一种多模态和多维的项目反应理论框架(M3IRT),该框架通过将模型能力和项目难度分解为仅图像、仅文本和跨模态成分,扩展了经典的IRT。M3IRT估计MLLMs的跨模态能力以及每个问题的跨模态难度,从而实现更紧凑、高质量的子集,更好地反映多模态推理。在三个基准测试的24个视觉语言模型(VLMs)中,M3IRT优先考虑真正的跨模态问题而非捷径,并在50%的项目为人工生成的低质量问题时仍保持排名的可靠性,从而降低评估成本,同时提高可靠性。因此,M3IRT为评估跨模态推理和完善多模态基准提供了一种实用工具。
cs.CL / 15 / 2603.02676
ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
ITLC在SemEval-2026任务11中的应用:用于形式推理的规范化和确定性解析
Abstract
Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.
Chinese Translation
大型语言模型在推理任务中受到内容效应的影响,尤其是在多语言环境中。我们提出了一种新颖的方法,通过显式的结构抽象来减少这些偏差,该方法将三段论转化为规范的逻辑表示,并应用确定性解析来判断有效性。在SemEval-2026任务11的多语言基准测试中,我们的方法在所有子任务中均取得了前五名的排名,同时显著减少了内容效应,并提供了一个与复杂的微调或激活级别干预相竞争的替代方案。
cs.CL / 16 / 2603.02684
HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse
HateMirage:一个可解释的多维数据集,用于解码虚假仇恨和微妙的在线虐待
Abstract
Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.
Chinese Translation
微妙和间接的仇恨言论在在线安全研究中仍然是一个未被充分探索的挑战,尤其是当有害意图嵌入在误导性或操控性的叙述中时。现有的仇恨言论数据集主要捕捉明显的毒性,未能充分代表错误信息如何以细微的方式引发或正常化仇恨。为了解决这一空白,我们提出了HateMirage,一个新颖的虚假仇恨评论数据集,旨在推动关于源自虚假或扭曲叙述的仇恨的推理和可解释性研究。该数据集通过识别来自事实核查来源的广泛被驳斥的错误信息主张,并追踪相关的YouTube讨论而构建,最终形成了4,530条用户评论。每条评论都沿着三个可解释的维度进行标注:目标(受影响者)、意图(评论背后的基本动机或目标)和影响(其潜在的社会影响)。与之前的可解释性数据集如HateXplain和HARE不同,后者提供的是基于标记级别或单维度的推理,HateMirage引入了一个多维解释框架,捕捉了错误信息、伤害和社会后果之间的相互作用。我们在HateMirage上基准测试了多个开源语言模型,使用ROUGE-L F1和Sentence-BERT相似度来评估解释的一致性。结果表明,解释质量可能更多地依赖于预训练的多样性和以推理为导向的数据,而不仅仅是模型规模。通过将错误信息推理与伤害归因相结合,HateMirage为可解释的仇恨检测和负责任的人工智能研究建立了一个新的基准。
cs.CL / 17 / 2603.02701
Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
Graph-GRPO:通过组相对策略优化稳定多智能体拓扑学习
Abstract
Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
Chinese Translation
优化通信拓扑对于基于大型语言模型(LLM)的多智能体系统(MAS)的效率和有效性至关重要。尽管最近的方法利用强化学习动态构建任务特定的图,但它们通常依赖于带有绝对奖励(例如,二元正确性的单样本策略梯度)。这种范式存在严重的梯度方差和信用分配问题:简单查询对次优结构产生非信息性的正奖励,而困难查询往往导致失败,无法提供学习信号。为了解决这些挑战,我们提出了Graph-GRPO,这是一种新颖的拓扑优化框架,集成了组相对策略优化。Graph-GRPO并不是孤立地评估单一拓扑,而是为每个查询采样一组多样的通信图,并根据它们在组内的相对表现计算特定边的优势。通过对采样组中的奖励进行归一化,我们的方法有效减轻了由于任务难度方差带来的噪声,并实现了细粒度的信用分配。在推理和代码生成基准上的大量实验表明,Graph-GRPO显著优于最先进的基线,达到了更好的训练稳定性,并识别出之前被奖励噪声掩盖的关键通信路径。
cs.CL / 18 / 2603.02709
Sensory-Aware Sequential Recommendation via Review-Distilled Representations
基于感知的序列推荐通过评论提炼的表示
Abstract
We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute--value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.
Chinese Translation
我们提出了一种新颖的基于感知的序列推荐框架,该框架通过从产品评论中提取的语言学感知属性丰富了项目表示。我们的方法 extsc{ASEGR}(基于属性的感知增强生成推荐)引入了一个两阶段的流程,其中首先对大型语言模型进行微调,使其作为教师提取结构化的感知属性-值对,例如 extit{颜色:哑光黑} 和 extit{气味:香草},从非结构化的评论文本中提取。提取的结构随后被提炼成一个紧凑的学生变换器,该变换器为每个项目生成固定维度的感知嵌入。这些嵌入以可重用的形式编码体验语义,并作为额外的项目级表示被纳入标准的序列推荐架构中。我们在四个亚马逊领域评估了我们的方法,并将学习到的感知嵌入集成到代表性的序列推荐模型中,包括 SASRec、BERT4Rec 和 BSARec。在各个领域,增强感知的模型始终优于基于标识符的对应模型,这表明以语言为基础的感知表示为行为交互模式提供了互补信号。定性分析进一步表明,提取的属性与人类对产品的感知紧密对齐,使自然语言描述与推荐行为之间建立可解释的联系。总体而言,这项工作表明,感知属性提炼提供了一种原则性和可扩展的方法,通过结构化语义表示学习来桥接信息提取与序列推荐。
cs.CL / 19 / 2603.02760
Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
通过序列再生实现扩散语言模型的高效自我评估
Abstract
Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.
Chinese Translation
扩散大型语言模型(dLLMs)因其增强多样性、可控性和并行性的能力而受到广泛关注。然而,其非顺序、双向掩蔽的生成方式使得质量评估变得困难,突显了有效自我评估的必要性。在本研究中,我们提出了DiSE,这是一种简单而有效的dLLMs自我评估信心量化方法。DiSE通过计算在给定完整上下文的情况下再生整个生成序列中标记的概率来量化信心。该方法通过利用标记再生概率,使得质量评估更加高效和可靠,促进了似然估计和稳健的不确定性量化。在DiSE的基础上,我们进一步引入了一种灵活长度生成框架,该框架根据模型对自身输出的自我评估自适应地控制序列长度。我们从dLLM泛化的角度分析并验证了DiSE的可行性,并实证表明DiSE与语义连贯性和答案准确性呈正相关。对似然评估、不确定性量化和灵活长度生成的广泛实验进一步确认了所提DiSE的有效性。
cs.CL / 20 / 2603.02775
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
从解题者到导师:使用 KMP-Bench 评估大型语言模型的教学智能
Abstract
Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
Chinese Translation
大型语言模型(LLMs)在人工智能数学辅导中展现出显著的潜力,但当前的评估往往依赖于简单的指标或狭窄的教学场景,未能全面评估多轮教学的有效性。本文介绍了 KMP-Bench,一个综合性的 K-8 数学教学基准,旨在从两个互补的角度评估 LLMs。第一个模块 KMP-Dialogue 依据六个核心原则(例如,挑战、解释、反馈)评估整体教学能力,利用一个通过整合多样化教学组件构建的新型多轮对话数据集。第二个模块 KMP-Skills 提供对基础辅导能力的细致评估,包括多轮问题解决、错误检测与纠正以及问题生成。我们在 KMP-Bench 上的评估揭示了一个关键差异:尽管领先的 LLMs 在可验证解决方案的任务中表现出色,但在教学原则的细致应用上却面临挑战。此外,我们还提出了 KMP-Pile,一个大规模(150K)的对话数据集。经过 KMP-Pile 微调的模型在 KMP-Bench 上显示出显著改善,强调了富有教学价值的训练数据在开发更有效的人工智能数学辅导员中的重要性。
cs.CL / 21 / 2603.02789
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
OCR还是不OCR?在大规模真实数据集时代重新思考文档信息提取
Abstract
Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
Chinese Translation
多模态大型语言模型(MLLMs)增强了自然语言处理的潜力。然而,它们对文档信息提取的实际影响仍不明确。特别是,MLLMs仅管道——虽然更简单——是否真的能够匹配传统OCR+MLLM设置的性能尚不清楚。在本文中,我们进行了一项大规模基准研究,评估了多种现成的MLLMs在商业文档信息提取中的表现。为了检查和探索失败模式,我们提出了一个自动化的层次错误分析框架,该框架利用大型语言模型(LLMs)系统性地诊断错误模式。我们的研究结果表明,对于强大的MLLMs,OCR可能并不是必需的,因为仅图像输入可以达到与OCR增强方法相当的性能。此外,我们展示了精心设计的模式、示例和指令可以进一步提升MLLMs的性能。我们希望这项工作能够为推进文档信息提取提供实用指导和宝贵见解。
cs.CL / 22 / 2603.02830
Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs
更快、更便宜、更准确:专业知识追踪模型优于大型语言模型
Abstract
Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students' future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.
Chinese Translation
预测学生未来对问题的回答对于教育学习平台尤为重要,因为这可以实现有效的干预。实现这一目标的关键方法之一是使用知识追踪(Knowledge Tracing, KT)模型。这些模型是针对特定领域的小型时间序列模型,基于学生的问答数据进行训练。KT模型在特定教育领域的高准确性上进行了优化,并具备快速推理和可扩展部署的特点。大型语言模型(Large Language Models, LLMs)的兴起促使我们提出以下问题:(1)LLMs在预测学生未来对问题的回答方面表现如何?(2)LLMs在该领域的可扩展性如何?(3)LLMs与KT模型在这一特定任务上的比较如何?在本文中,我们比较了多种LLMs和KT模型在预测性能、部署成本和推理速度等方面,以回答上述问题。我们展示了KT模型在这一特定任务上在准确性和F1分数方面优于LLMs。此外,我们证明了LLMs的推理速度比KT模型慢几个数量级,并且部署成本也高出几个数量级。这突显了领域特定模型在教育预测任务中的重要性,以及当前闭源LLMs不应作为所有任务的通用解决方案的事实。
cs.CL / 23 / 2603.02842
A Browser-based Open Source Assistant for Multimodal Content Verification
基于浏览器的开源多模态内容验证助手
Abstract
Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.
Chinese Translation
生成性人工智能所产生的虚假信息和错误内容对记者和事实核查者构成了重大挑战,他们必须迅速验证数字媒体信息。尽管存在大量用于检测可信度信号(如说服技巧、主观性或机器生成文本)的自然语言处理(NLP)模型,但这些方法往往对非专业用户而言难以获取,并且未能作为统一框架融入他们的日常工作流程。本文展示了VERIFICATION ASSISTANT,这是一种旨在弥合这一差距的基于浏览器的工具。VERIFICATION ASSISTANT是广泛采用的VERIFICATION PLUGIN(用户超过140,000)的核心组成部分,允许用户将URL或媒体文件提交到统一接口。它自动提取内容并将其路由到一套后端NLP分类器,提供可操作的可信度信号、估计AI生成内容,并以清晰易懂的格式提供其他验证指导。本文展示了该工具的架构、其多种NLP服务的整合以及其在检测虚假信息中的实际应用。
cs.CL / 24 / 2603.02860
The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models
世界语言中音素频率的分布:宏观与微观的信息论模型
Abstract
We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.
Chinese Translation
我们展示了不同语言中音素的频率分布可以在宏观和微观层面上进行解释。在宏观层面上,音素的等级-频率分布紧密遵循对称Dirichlet分布的顺序统计,其单一浓度参数与音素库存规模系统性地相关,揭示了一种稳健的补偿效应,即较大的音素库存表现出较低的相对熵。在微观层面上,结合发音、音位法则和词汇结构约束的最大熵模型准确预测了特定语言的音素概率。这些发现共同提供了音素频率结构的统一信息论解释。
cs.CL / 25 / 2603.02865
Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
节点早现,边缘晚现:探究大型视觉-语言模型中的图示表示
Abstract
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
Chinese Translation
大型视觉-语言模型(LVLMs)在图示理解基准测试中表现出色,但它们在理解元素之间的关系方面仍然存在困难,尤其是那些由节点和有向边(例如,箭头和线条)表示的关系。为了探讨这一局限性的潜在原因,我们使用基于有向图的精心构建的合成图示数据集对LVLMs的内部表示进行探测。我们的探测实验表明,边缘信息在视觉编码器中并不是线性可分的,只有在语言模型的文本标记中才变得线性编码。相比之下,节点信息和全局结构特征在视觉编码器的单个隐藏状态中已经线性编码。这些发现表明,线性可分表示的形成阶段取决于视觉信息的类型。特别是边缘表示的延迟出现可能有助于解释为什么LVLMs在关系理解方面存在困难,例如解释边缘方向,这需要更抽象、综合整合的过程。
cs.CL / 26 / 2603.02873
LaTeX Compilation: Challenges in the Era of LLMs
LaTeX 编译:在大型语言模型时代的挑战
Abstract
As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX's fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What's more, we show that due to Mogan's lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.
Chinese Translation
随着大型语言模型(LLMs)在科学写作中越来越多地提供帮助,TeX 的局限性和显著的令牌成本变得愈发明显。本文分析了 TeX 在编译和用户体验设计中的基本缺陷,以说明其在编译效率、生成语义、错误定位和工具生态系统方面的局限性,尤其是在 LLMs 时代。作为替代方案,本文介绍了 Mogan STEM,一种所见即所得(WYSIWYG)结构化编辑器。Mogan 在上述方面优于 TeX,得益于其高效的数据结构、快速的渲染和按需插件加载。我们进行了大量实验,以验证其在编译/渲染时间和 LLM 任务性能上的优势。此外,我们还表明,由于 Mogan 的信息熵较低,使用 .tmu(Mogan 的文档格式)来微调 LLMs 比使用 TeX 更为高效。因此,我们呼吁对使用 .tmu 格式进行 LLM 训练的大规模实验。
cs.CL / 27 / 2603.02876
Eval4Sim: An Evaluation Framework for Persona Simulation
Eval4Sim:一种用于角色模拟的评估框架
Abstract
Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.
Chinese Translation
具有明确属性、背景和行为倾向的语言模型(LLM)角色越来越多地用于模拟人类对话,以完成用户建模、社会推理和行为分析等任务。因此,确保基于角色的模拟能够真实反映人类对话行为至关重要。然而,目前的评估实践主要依赖于将LLM作为评判者的方法,这种方法在可观察的人类行为上提供的基础有限,并且产生不透明的标量分数。我们通过提出Eval4Sim来填补这一空白,这是一种评估框架,用于衡量模拟对话在三个互补维度上与人类对话模式的契合程度。遵循性(Adherence)捕捉生成话语中角色背景的隐式编码效果,通过使用考虑说话者的表示进行密集检索进行评估。一致性(Consistency)评估角色在对话中是否保持可区分的身份,通过作者身份验证进行计算。自然性(Naturalness)反映对话是否表现出人类般的流畅性,而不是过于僵硬或优化的结构,通过从对话聚焦的自然语言推理中导出的分布进行量化。与绝对或优化导向的指标不同,Eval4Sim使用人类对话语料库(即PersonaChat)作为参考基线,并对偏离进行双向惩罚,以区分不足的角色编码和过度优化、不自然的行为。尽管在PersonaChat上进行了验证,Eval4Sim的适用性扩展到任何包含说话者级注释的对话语料库。
cs.CL / 28 / 2603.02909
Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction
学习生成与提取:一种用于零-shot文档级事件论元提取的多智能体协作框架
Abstract
Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents.In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of "Propose-Evaluate-Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.
Chinese Translation
文档级事件论元提取(DEAE)对于知识获取至关重要,旨在从文档中提取事件的参与者。在零-shot设置下,现有方法利用大语言模型(LLMs)生成合成数据,以应对标注数据稀缺带来的挑战。然而,仅依赖事件类型提示使得生成的内容难以准确捕捉未见事件的上下文和结构关系。此外,由于缺乏质量评估机制,确保合成数据的可靠性和可用性仍然是一个重大挑战。为此,我们提出了一种用于零-shot文档级事件论元提取(ZS-DEAE)的多智能体协作框架,该框架模拟了人类协作认知过程中的“提议-评估-修订”。具体而言,该框架由生成智能体和评估智能体组成。生成智能体通过利用已见事件的知识合成未见事件的数据,而评估智能体则从合成数据中提取论元,并评估其与上下文的语义一致性。评估结果随后转化为奖励信号,并在奖励设计中融入事件结构约束,以便通过强化学习实现两个智能体的迭代优化。在从RAMS和WikiEvents数据集中构建的三个零-shot场景中,我们的方法在数据生成质量和论元提取性能上均取得了提升,同时生成的数据也有效增强了其他DEAE模型的零-shot性能。
cs.CL / 29 / 2603.02945
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
ACE-Merging:基于自适应协方差估计的数据无关模型合并
Abstract
Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
Chinese Translation
模型合并旨在将多个特定任务的专家模型合并为一个单一模型,同时保持在多样化任务中的泛化能力。然而,专家之间的干扰,尤其是在它们在不同目标上训练时,往往会导致显著的性能下降。尽管近期取得了一些进展,但在没有数据访问、重新训练或架构修改的情况下解决这种干扰仍然是一个基本挑战。本文提供了理论分析,证明每个任务的输入协方差这一关键因素可以通过其微调模型的参数差异在完全无数据的环境中隐式估计。在此基础上,我们引入了 extit{ACE-Merging}(自适应协方差估计框架),有效缓解了任务间的干扰。我们的方法具有原则性、封闭形式的解决方案,与之前的迭代或启发式方法形成对比。在视觉和语言基准上的大量实验表明, extit{ACE-Merging} 在数据无关方法中设定了新的最先进水平。它始终优于现有基线;例如, extit{ACE-Merging} 在 GPT-2 上的七个任务中实现了平均绝对提升 4\%。由于其高效的封闭形式公式, extit{ACE-Merging} 在适度的计算成本下提供了卓越的性能,为模型合并提供了一个实用且理论基础扎实的解决方案。
cs.CL / 30 / 2603.03001
MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
MaBERT:一种安全填充的交错变换器Mamba混合编码器,用于高效扩展上下文的掩码语言建模
Abstract
Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.
Chinese Translation
自注意力编码器,如双向编码器表示(Bidirectional Encoder Representations from Transformers,BERT),随着序列长度的增加呈平方级别扩展,使得长上下文建模变得昂贵。线性时间状态空间模型,如Mamba,具有高效性;然而,它们在建模全局交互方面存在局限,并可能受到填充引起的状态污染的影响。我们提出了MaBERT,一种混合编码器,它将变换器层与Mamba层交错,以实现全局依赖建模和线性时间状态更新。这种设计交替进行全局上下文整合与快速状态累积,从而实现对长输入的高效训练和推理。为了稳定可变长度的批处理,我们引入了安全填充掩码,阻止状态在填充位置的传播,以及掩码感知注意力池化,仅从有效标记中聚合信息。在GLUE基准测试中,MaBERT在八个任务中的五个任务上实现了最佳平均得分,在CoLA和句子对推理任务上表现强劲。当将上下文从512扩展到4096个标记时,MaBERT相对于编码器基线的平均值,分别减少了训练时间和推理延迟2.36倍和2.43倍,展示了一个实用的长上下文高效编码器。
cs.CL / 31 / 2603.03047
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
TrustMH-Bench:评估大型语言模型在心理健康领域可信度的综合基准
Abstract
While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.
Chinese Translation
尽管大型语言模型(LLMs)在提供可及的心理健康支持方面展现出显著潜力,但其实际应用引发了由于该领域高风险和安全敏感性而产生的关键可信度问题。现有的通用 LLM 评估范式未能捕捉到心理健康特定的需求,突显出优先考虑和提升其可信度的迫切需要。为此,我们提出了 TrustMH-Bench,这是一个旨在系统量化心理健康 LLM 可信度的整体框架。通过建立从领域特定规范到定量评估指标的深度映射,TrustMH-Bench 在八个核心支柱上评估模型:可靠性、危机识别与升级、安全性、公平性、隐私、稳健性、反阿谀奉承和伦理。我们对六个通用 LLM 和六个专门的心理健康模型进行了广泛的实验。实验结果表明,被评估模型在心理健康场景中的各个可信度维度上表现不佳,揭示出显著的缺陷。值得注意的是,即使是通常强大的模型(例如,GPT-5.1)在所有维度上也未能保持一致的高性能。因此,系统性地提高 LLM 的可信度已成为一项关键任务。我们的数据和代码已公开发布。
cs.CL / 32 / 2603.03054
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
PrivMedChat:用于医疗对话系统的端到端差分隐私强化学习人类反馈
Abstract
Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.
Chinese Translation
大型语言模型越来越多地用于面向患者的医疗辅助和临床决策支持,但将其适应于临床对话通常需要源自医生与患者对话的监督,这些对话可能包含敏感信息。传统的监督微调和基于人类反馈的强化学习(RLHF)可能会加剧记忆风险,从而使得经验性成员推断和稀有训练集内容的提取成为可能。我们提出了PrivMedChat,这是一个用于医疗对话的端到端差分隐私RLHF(DP-RLHF)框架。我们的设计在每个直接访问对话派生监督的训练阶段强制执行差分隐私:(i)用于医疗监督微调的差分私有随机梯度下降(DP-SGD)和(ii)用于从偏好对学习奖励模型的DP-SGD。为了限制对齐过程中的额外隐私支出,我们在处理对话派生提示时对PPO演员和评论员应用DP-SGD,而奖励模型在DP训练后保持固定。我们还引入了一种无注释的偏好构建策略,将医生的回应与过滤后的非专家生成内容配对,以在没有临床医生标注的情况下生成可扩展的偏好数据。在医疗对话基准上的实验表明,PrivMedChat在$ ext{ε}=7$时在所有差分隐私模型中达到了最高的ROUGE-L值0.156,将临床幻觉降低到1.4%,有害建议降低到0.4%,并在3模型LLM评审中获得最高的整体评分2.86,同时产生的成员推断信号接近随机(AUC 0.510-0.555)。我们的代码已开源,地址为https://github.com/sudip-bhujel/privmedchat。
cs.CL / 33 / 2603.03081
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
TAO-Attack:面向大型语言模型的先进优化基础越狱攻击
Abstract
Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
Chinese Translation
大型语言模型(LLMs)在各种应用中取得了显著成功,但仍然容易受到越狱攻击,攻击者通过设计提示绕过安全对齐并引发不安全的响应。在现有方法中,基于优化的攻击显示出强大的有效性,但当前方法往往面临频繁拒绝、伪有害输出和低效的令牌级更新。在本研究中,我们提出了TAO-Attack,一种新的基于优化的越狱方法。TAO-Attack采用了两阶段损失函数:第一阶段抑制拒绝,以确保模型继续生成有害前缀,而第二阶段则惩罚伪有害输出,并鼓励模型朝向更有害的完成。此外,我们设计了一种方向优先令牌优化(DPTO)策略,通过在考虑更新幅度之前将候选项与梯度方向对齐,从而提高效率。在多个LLM上的广泛实验表明,TAO-Attack在攻击成功率上始终优于最先进的方法,在某些场景中甚至达到了100%。
cs.CL / 34 / 2603.03095
Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection
基于指令调优的大型语言模型中的紧凑提示用于联合论证成分检测
Abstract
Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.
Chinese Translation
论证成分检测(ACD)是论证挖掘(AM)的核心子任务之一,也是其最具挑战性的方面之一,因为它需要共同界定论证范围并将其分类为主张和前提等成分。尽管与其他AM任务相比,这一子任务的研究仍相对有限,但现有的大多数方法将其形式化为简化的序列标注问题、成分分类,或成分分割后跟随分类的管道。在本文中,我们提出了一种基于指令调优的大型语言模型(LLMs)的新方法,使用紧凑的基于指令的提示,并将ACD重新构建为语言生成任务,使得可以直接从普通文本中识别论证,而无需依赖预先分割的成分。在标准基准上的实验表明,我们的方法相比于最先进的系统实现了更高的性能。据我们所知,这是首次将ACD完全建模为生成任务的尝试,突显了指令调优在复杂AM问题中的潜力。
cs.CL / 35 / 2603.03111
Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
评估多轮大语言模型系统中模型切换导致的性能漂移
Abstract
Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.
Chinese Translation
部署的多轮大语言模型系统在交互过程中常常因升级、跨服务提供商路由和回退而切换模型。这种切换会导致上下文不匹配:生成后续轮次的模型必须依赖于由不同模型生成的对话前缀,这可能会引发无声的性能漂移。我们引入了一种切换矩阵基准,通过在早期轮次运行前缀模型,在最后一轮运行后缀模型,并使用配对的回合级自助置信区间与无切换基线进行比较,从而测量这种效应。在 CoQA 对话问答和 Multi-IF 基准中,即使是单轮切换也会产生普遍且统计显著的方向性影响,可能导致 Multi-IF 严格成功率的结果波动 -8 到 +13 个百分点,以及 CoQA 上 +/- 4 的绝对 F1,这与常见模型层级之间的无切换差距(例如,GPT-5-nano 与 GPT-5-mini)相当。我们进一步发现系统性的兼容性模式:一些后缀模型在几乎任何非自我对话历史下表现下降,而其他模型在几乎任何外部前缀下表现提升。为了实现压缩的切换风险监测,我们将切换引起的漂移分解为每个模型的前缀影响和后缀易感性项,解释了基准测试中约 70% 的方差。这些结果将切换的鲁棒性定位为单模型基准所忽视的操作可靠性维度,促使在多轮系统中进行明确的监测和切换意识的缓解措施。
cs.CL / 36 / 2603.03134
UniSkill: A Dataset for Matching University Curricula to Professional Competencies
UniSkill:匹配大学课程与职业能力的数据集
Abstract
Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.
Chinese Translation
技能提取和推荐系统从招聘者、申请者和教育的角度进行了研究。尽管人工智能在招聘广告中的应用受到了广泛关注,但在所教授技能方面的不足仍然是一个挑战。在本研究中,我们通过发布来自欧洲技能、能力、资格和职业(ESCO)分类法的手动标注和合成技能数据集,以及大学课程对的相关注释指南,来解决公开可用数据集的稀缺问题。具体而言,我们将研究生级别的大学课程与来自系统分析师和管理与组织分析师ESCO职业组的技能进行匹配,匹配的粒度包括:课程标题与技能,以及课程句子与技能。我们在该数据集上训练语言模型,以作为课程与技能匹配和技能与课程匹配的检索和推荐系统的基线。我们在一部分标注数据上评估模型。我们的BERT模型达到了87%的F1-score,表明课程与技能匹配是一项可行的任务。
cs.CL / 37 / 2603.03142
APRES: An Agentic Paper Revision and Evaluation System
APRES:一种代理型论文修订与评估系统
Abstract
Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.
Chinese Translation
科学发现必须清晰地传达,以实现其全部潜力。没有有效的沟通,即使是最具突破性的发现也可能被忽视或误解。科学家们交流其研究成果并从社区获得反馈的主要方式是同行评审。然而,当前的系统往往在评审者之间提供不一致的反馈,最终阻碍了手稿的改进并限制了其潜在影响。在本文中,我们介绍了一种新方法APRES,该方法由大型语言模型(Large Language Models, LLMs)驱动,旨在根据评估标准更新科学论文的文本。我们自动化的方法发现了一种对未来引用次数具有高度预测性的评估标准,并将其与APRES集成在一个自动化系统中,以修订论文以提升其质量和影响力。至关重要的是,这一目标应在不改变核心科学内容的情况下实现。我们展示了APRES的成功,其在未来引用预测方面比下一个最佳基线提高了19.6%的平均绝对误差,并且我们的论文修订过程产生的论文在79%的情况下被人类专家评估者偏好于原稿。我们的发现为使用LLMs作为工具帮助作者在提交前进行手稿压力测试提供了强有力的实证支持。最终,我们的工作旨在增强而非取代人类专家评审者的基本角色,因为应该由人类来辨别哪些发现真正重要,引导科学向推进知识和丰富生活的方向发展。
cs.CL / 38 / 2603.03194
BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
BeyondSWE:当前代码智能体能否超越单一代码库的错误修复?
Abstract
Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
Chinese Translation
当前的代码智能体基准主要评估狭窄的、特定于代码库的修复,忽视了诸如跨代码库推理、领域专业化问题解决、依赖驱动迁移和全代码库生成等关键的现实世界挑战。为了解决这一问题,我们提出了BeyondSWE,这是一个全面的基准,通过使用500个真实世界实例在四个不同的设置中,从解决范围和知识范围两个维度扩展现有评估。实验结果揭示了显著的能力差距:即使是最前沿的模型,其成功率也停滞在45%以下,并且没有单一模型在任务类型上表现一致。为了系统地研究外部知识的作用,我们开发了SearchSWE,一个将深度搜索与编码能力相结合的框架。我们的实验表明,搜索增强带来了不一致的收益,并且在某些情况下可能会降低性能,这突显了在编码任务中模拟开发者式工作流程(交替进行搜索和推理)的困难。这项工作提供了一个现实且具有挑战性的评估基准,以及一个灵活的框架,以推动研究向更强大的代码智能体发展。
cs.CL / 39 / 2603.03202
Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Code2Math:你的代码代理能否通过探索有效地演化数学问题?
Abstract
As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.
Chinese Translation
随着大型语言模型(LLMs)在数学能力上向国际数学奥林匹克(IMO)水平的进展,训练和评估所需的具有挑战性和高质量问题的稀缺性已成为一个显著的瓶颈。同时,近期的代码代理展示了在代理编码和推理方面的复杂技能,这表明代码执行可以作为一个可扩展的数学实验环境。在本文中,我们研究了代码代理自主演化现有数学问题为更复杂变体的潜力。我们引入了一个多代理框架,旨在在验证生成问题的可解性和难度增加的同时进行问题演化。我们的实验表明,在足够的测试时间探索下,代码代理能够合成出新的、可解的问题,这些问题在结构上与原始问题有显著区别,并且更具挑战性。这项工作提供了实证证据,表明基于代码的代理可以作为在可扩展计算环境中合成高难度数学推理问题的可行机制。我们的数据可在 https://github.com/TarferSoul/Code2Math 获取。
cs.CL / 40 / 2603.03205
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
学习何时行动或拒绝:保护自主推理模型以安全地使用多步骤工具
Abstract
Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
Chinese Translation
自主语言模型在安全机制上与聊天模型有根本性的不同:它们必须进行规划、调用工具并执行长时间跨度的行动,其中一次失误,例如访问文件或输入凭证,可能导致不可逆转的伤害。现有的对齐方法主要针对静态生成和任务完成进行了优化,但在这些场景中由于序列决策、对抗性工具反馈和过于自信的中间推理而失效。我们提出了MOSAIC,一个后训练框架,通过明确和可学习的安全决策来对齐代理,以实现安全的多步骤工具使用。MOSAIC将推理结构化为计划、检查,然后行动或拒绝的循环,将明确的安全推理和拒绝作为一类重要的行动。为了在没有轨迹级标签的情况下进行训练,我们采用基于偏好的强化学习,通过成对轨迹比较来捕捉安全差异,这些差异通常被标量奖励所忽视。我们在三个模型系列Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4上对MOSAIC进行了零样本评估,并在跨越有害任务、提示注入、良性工具使用和跨域隐私泄露的分布外基准上进行了测试。MOSAIC将有害行为减少了多达50%,在注入攻击中将有害任务的拒绝率提高了超过20%,减少了隐私泄露,并保持或改善了良性任务的表现,展示了在模型、领域和自主环境中的强健泛化能力。
cs.CL / 41 / 2603.03249
Using Learning Progressions to Guide AI Feedback for Science Learning
利用学习进展指导科学学习中的人工智能反馈
Abstract
Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
Chinese Translation
生成性人工智能(AI)为形成性反馈提供了可扩展的支持,但大多数AI生成的反馈依赖于由领域专家撰写的特定任务评分标准。尽管有效,评分标准的编写耗时且限制了在教学环境中的可扩展性。学习进展(Learning Progressions, LP)提供了一个理论基础的学生理解发展表现,可能提供一种替代解决方案。本研究考察了一个基于学习进展的评分标准生成管道是否能够产生与专家撰写的任务评分标准指导的反馈质量相当的AI生成反馈。我们分析了207名中学生在化学任务中撰写的科学解释所生成的AI反馈。比较了两种管道:(a)由人类专家设计的特定任务评分标准指导的反馈,以及(b)在评分和反馈生成之前,自动从学习进展中推导出的特定任务评分标准指导的反馈。两位人类编码者使用一个多维评分标准评估反馈质量,该标准评估清晰度、准确性、相关性、参与度与动机以及反思性(10个子维度)。评估者间的一致性很高,百分比一致性范围为89%到100%,可估计维度的Cohen's kappa值(kappa = 0.66到0.88)。配对t检验显示,两种管道在清晰度(t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = 0.399)、相关性(t1 = 0.28, p1 = 0.782; t2 = -0.58, p2 = 0.565)、参与度与动机(t1 = 0.50, p1 = 0.618; t2 = -0.58, p2 = 0.565)或反思性(t = -0.45, p = 0.656)方面没有统计学显著差异。这些发现表明,基于学习进展的评分标准管道可以作为一种替代解决方案。