cs.RO / 1 / 2602.21259
Cross domain Persistent Monitoring for Hybrid Aerial Underwater Vehicles
跨域持久监测的混合空中水下车辆
Abstract
Hybrid Unmanned Aerial Underwater Vehicles (HUAUVs) have emerged as platforms capable of operating in both aerial and underwater environments, enabling applications such as inspection, mapping, search, and rescue in challenging scenarios. However, the development of novel methodologies poses significant challenges due to the distinct dynamics and constraints of the air and water domains. In this work, we present persistent monitoring tasks for HUAUVs by combining Deep Reinforcement Learning (DRL) and Transfer Learning to enable cross-domain adaptability. Our approach employs a shared DRL architecture trained on Lidar sensor data (on air) and Sonar data (underwater), demonstrating the feasibility of a unified policy for both environments. We further show that the methodology presents promising results, taking into account the uncertainty of the environment and the dynamics of multiple mobile targets. The proposed framework lays the groundwork for scalable autonomous persistent monitoring solutions based on DRL for hybrid aerial-underwater vehicles.
Chinese Translation
混合无人空中水下车辆(HUAUVs)作为能够在空中和水下环境中操作的平台,已成为在复杂场景中进行检查、制图、搜索和救援等应用的关键。然而,由于空气和水域的动态特性和约束条件的不同,开发新方法面临重大挑战。在本研究中,我们通过结合深度强化学习(Deep Reinforcement Learning, DRL)和迁移学习(Transfer Learning),提出了针对HUAUVs的持久监测任务,以实现跨域适应性。我们的方法采用了一个共享的DRL架构,该架构在激光雷达(Lidar)传感器数据(空中)和声纳(Sonar)数据(水下)上进行训练,展示了在这两种环境中统一策略的可行性。我们进一步表明,该方法在考虑环境的不确定性和多个移动目标的动态特性时,呈现出良好的结果。所提出的框架为基于DRL的混合空中水下车辆可扩展的自主持久监测解决方案奠定了基础。
cs.RO / 2 / 2602.21266
Dual-Branch INS/GNSS Fusion with Inequality and Equality Constraints
带有不等式和等式约束的双分支INS/GNSS融合
Abstract
Reliable vehicle navigation in urban environments remains a challenging problem due to frequent satellite signal blockages caused by tall buildings and complex infrastructure. While fusing inertial reading with satellite positioning in an extended Kalman filter provides short-term navigation continuity, low-cost inertial sensors suffer from rapid error accumulation during prolonged outages. Existing information aiding approaches, such as the non-holonomic constraint, impose rigid equality assumptions on vehicle motion that may be violated under dynamic urban driving conditions, limiting their robustness precisely when aiding is most needed. In this paper, we propose a dual-branch information aiding framework that fuses equality and inequality motion constraints through a variance-weighted scheme, requiring only a software modification to an existing navigation filter with no additional sensors or hardware. The proposed method is evaluated on four publicly available urban datasets featuring various inertial sensors, road conditions, and dynamics, covering a total duration of 4.3 hours of recorded data. Under Full GNSS availability, the method reduces vertical position error by 16.7% and improves altitude accuracy by 50.1% over the standard non-holonomic constraint. Under GNSS-denied conditions, vertical drift is reduced by 24.2% and altitude accuracy improves by 20.2%. These results demonstrate that replacing hard motion equality assumptions with physically motivated inequality bounds is a practical and cost-free strategy for improving navigation resilience, continuity, and drift robustness without relying on additional sensors, map data, or learned models.
Chinese Translation
在城市环境中,可靠的车辆导航仍然是一个具有挑战性的问题,因为高楼大厦和复杂基础设施导致卫星信号频繁被阻挡。尽管在扩展卡尔曼滤波器中将惯性读数与卫星定位融合可以提供短期导航连续性,但低成本的惯性传感器在长时间失效期间会迅速积累误差。现有的信息辅助方法,如非完整约束,对车辆运动施加了严格的等式假设,而在动态城市驾驶条件下,这些假设可能会被违反,从而限制了它们的鲁棒性,恰恰在最需要辅助时表现不佳。本文提出了一种双分支信息辅助框架,通过方差加权方案融合等式和不等式运动约束,仅需对现有导航滤波器进行软件修改,而无需额外的传感器或硬件。所提方法在四个公开可用的城市数据集上进行了评估,这些数据集涵盖了各种惯性传感器、道路条件和动态,记录数据总时长为4.3小时。在完全GNSS可用的情况下,该方法将垂直位置误差降低了16.7%,并将高度精度提高了50.1%,相较于标准的非完整约束。在GNSS不可用的条件下,垂直漂移减少了24.2%,高度精度提高了20.2%。这些结果表明,用物理动机驱动的不等式界限替代严格的运动等式假设是一种实用且无成本的策略,可以在不依赖额外传感器、地图数据或学习模型的情况下,提高导航的韧性、连续性和漂移鲁棒性。
cs.RO / 3 / 2602.21302
Learning Deformable Object Manipulation Using Task-Level Iterative Learning Control
基于任务级迭代学习控制的可变形物体操控学习
Abstract
Dynamic manipulation of deformable objects is challenging for humans and robots because they have infinite degrees of freedom and exhibit underactuated dynamics. We introduce a Task-Level Iterative Learning Control method for dynamic manipulation of deformable objects. We demonstrate this method on a non-planar rope manipulation task called the flying knot. Using a single human demonstration and a simplified rope model, the method learns directly on hardware without reliance on large amounts of demonstration data or massive amounts of simulation. At each iteration, the algorithm constructs a local inverse model of the robot and rope by solving a quadratic program to propagate task-space errors into action updates. We evaluate performance across 7 different kinds of ropes, including chain, latex surgical tubing, and braided and twisted ropes, ranging in thicknesses of 7--25mm and densities of 0.013--0.5 kg/m. Learning achieves a 100\% success rate within 10 trials on all ropes. Furthermore, the method can successfully transfer between most rope types in approximately 2--5 trials. https://flying-knots.github.io
Chinese Translation
可变形物体的动态操控对人类和机器人来说都是一项挑战,因为它们具有无限的自由度并表现出欠驱动的动态特性。我们提出了一种任务级迭代学习控制方法,用于可变形物体的动态操控。我们在一个名为飞结的非平面绳索操控任务上演示了该方法。该方法利用单一的人类示范和简化的绳索模型,直接在硬件上学习,而无需依赖大量的示范数据或巨量的仿真。在每次迭代中,算法通过求解二次规划构建机器人的局部逆模型和绳索模型,以将任务空间误差传播到动作更新中。我们在7种不同类型的绳索上评估了性能,包括链条、乳胶手术管、编织绳和扭绳,厚度范围为7-25毫米,密度为0.013-0.5 kg/m。学习在所有绳索上在10次试验内达到了100%的成功率。此外,该方法能够在大多数绳索类型之间成功转移,约需2-5次试验。
cs.RO / 4 / 2602.21316
Unified Complementarity-Based Contact Modeling and Planning for Soft Robots
基于统一互补性的软机器人接触建模与规划
Abstract
Soft robots were introduced in large part to enable safe, adaptive interaction with the environment, and this interaction relies fundamentally on contact. However, modeling and planning contact-rich interactions for soft robots remain challenging: dense contact candidates along the body create redundant constraints and rank-deficient LCPs, while the disparity between high stiffness and low friction introduces severe ill-conditioning. Existing approaches rely on problem-specific approximations or penalty-based treatments. This letter presents a unified complementarity-based framework for soft-robot contact modeling and planning that brings contact modeling, manipulation, and planning into a unified, physically consistent formulation. We develop a robust Linear Complementarity Problem (LCP) model tailored to discretized soft robots and address these challenges with a three-stage conditioning pipeline: inertial rank selection to remove redundant contacts, Ruiz equilibration to correct scale disparity and ill-conditioning, and lightweight Tikhonov regularization on normal blocks. Building on the same formulation, we introduce a kinematically guided warm-start strategy that enables dynamic trajectory optimization through contact using Mathematical Programs with Complementarity Constraints (MPCC) and demonstrate its effectiveness on contact-rich ball manipulation tasks. In conclusion, CUSP provides a new foundation for unifying contact modeling, simulation, and planning in soft robotics.
Chinese Translation
软机器人的引入在很大程度上是为了实现与环境的安全、自适应交互,而这种交互基本上依赖于接触。然而,为软机器人建模和规划接触丰富的交互仍然具有挑战性:沿着身体的密集接触候选者会产生冗余约束和秩缺失的线性互补问题(LCP),而高刚度与低摩擦之间的差异则引入了严重的病态性。现有方法依赖于特定问题的近似或基于惩罚的处理。本文提出了一种基于统一互补性的软机器人接触建模与规划框架,将接触建模、操作和规划整合为统一的、物理上自洽的公式。我们开发了一种针对离散化软机器人的稳健线性互补问题(LCP)模型,并通过三阶段条件化流程解决这些挑战:惯性秩选择以去除冗余接触,Ruiz平衡以纠正尺度差异和病态性,以及对法向块的轻量级Tikhonov正则化。在相同的公式基础上,我们引入了一种运动学引导的热启动策略,使得通过接触实现动态轨迹优化成为可能,采用互补约束的数学规划(MPCC),并在接触丰富的球体操作任务中展示了其有效性。总之,CUSP为软机器人中的接触建模、仿真和规划提供了新的统一基础。
cs.RO / 5 / 2602.21331
CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics
CableRobotGraphSim:一种用于建模部分可观测电缆驱动机器人动态的图神经网络
Abstract
General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.
Chinese Translation
通用模拟器加速了机器人技术的发展。然而,基于第一原理的传统模拟器通常需要完全的状态可观测性或依赖参数搜索进行系统识别。本研究提出了 exttt{CableRobotGraphSim},一种新颖的图神经网络(GNN)模型,旨在解决先前模拟解决方案的不足。通过将电缆驱动机器人表示为图,其中刚体作为节点,电缆和接触作为边,该模型能够快速且准确地匹配其他模拟模型和真实机器人的特性,同时仅需输入部分可观测的数据。与GNN模型相伴的是一种模拟与真实数据共同训练的程序,促进了对噪声真实数据的泛化和鲁棒性。该模型进一步与模型预测路径积分(MPPI)控制器集成,用于闭环导航,展示了模型的速度和准确性。
cs.RO / 6 / 2602.21366
Environment-Aware Learning of Smooth GNSS Covariance Dynamics for Autonomous Racing
环境感知的平滑GNSS协方差动态学习用于自主赛车
Abstract
Ensuring accurate and stable state estimation is a challenging task crucial to safety-critical domains such as high-speed autonomous racing, where measurement uncertainty must be both adaptive to the environment and temporally smooth for control. In this work, we develop a learning-based framework, LACE, capable of directly modeling the temporal dynamics of GNSS measurement covariance. We model the covariance evolution as an exponentially stable dynamical system where a deep neural network (DNN) learns to predict the system's process noise from environmental features through an attention mechanism. By using contraction-based stability and systematically imposing spectral constraints, we formally provide guarantees of exponential stability and smoothness for the resulting covariance dynamics. We validate our approach on an AV-24 autonomous racecar, demonstrating improved localization performance and smoother covariance estimates in challenging, GNSS-degraded environments. Our results highlight the promise of dynamically modeling the perceived uncertainty in state estimation problems that are tightly coupled with control sensitivity.
Chinese Translation
确保准确和稳定的状态估计是一项具有挑战性的任务,对于高速自主赛车等安全关键领域至关重要,在这些领域中,测量不确定性必须适应环境并在时间上保持平滑以便于控制。在本研究中,我们开发了一个基于学习的框架LACE,能够直接建模GNSS测量协方差的时间动态。我们将协方差演变建模为一个指数稳定的动态系统,其中深度神经网络(DNN)通过注意机制从环境特征中学习预测系统的过程噪声。通过使用收缩型稳定性并系统性地施加谱约束,我们正式提供了对所得到的协方差动态的指数稳定性和平滑性的保证。我们在AV-24自主赛车上验证了我们的方法,展示了在具有挑战性的GNSS退化环境中改进的定位性能和更平滑的协方差估计。我们的结果突显了在与控制敏感性紧密耦合的状态估计问题中动态建模感知不确定性的前景。
cs.RO / 7 / 2602.21389
Autonomous Sea Turtle Robot for Marine Fieldwork
用于海洋实地工作的自主海龟机器人
Abstract
Autonomous robots can transform how we observe marine ecosystems, but close-range operation in reefs and other cluttered habitats remains difficult. Vehicles must maneuver safely near animals and fragile structures while coping with currents, variable illumination and limited sensing. Previous approaches simplify these problems by leveraging soft materials and bioinspired swimming designs, but such platforms remain limited in terms of deployable autonomy. Here we present a sea turtle-inspired autonomous underwater robot that closed the gap between bioinspired locomotion and field-ready autonomy through a tightly integrated, vision-driven control stack. The robot combines robust depth-heading stabilization with obstacle avoidance and target-centric control, enabling it to track and interact with moving objects in complex terrain. We validate the robot in controlled pool experiments and in a live coral reef exhibit at the New England Aquarium, demonstrating stable operation and reliable tracking of fast-moving marine animals and human divers. To the best of our knowledge, this is the first integrated biomimetic robotic system, combining novel hardware, control, and field experiments, deployed to track and monitor real marine animals in their natural environment. During off-tether experiments, we demonstrate safe navigation around obstacles (91\% success rate in the aquarium exhibit) and introduce a low-compute onboard tracking mode. Together, these results establish a practical route toward soft-rigid hybrid, bioinspired underwater robots capable of minimally disruptive exploration and close-range monitoring in sensitive ecosystems.
Chinese Translation
自主机器人可以改变我们观察海洋生态系统的方式,但在珊瑚礁和其他杂乱栖息地进行近距离操作仍然困难。车辆必须在动物和脆弱结构附近安全机动,同时应对水流、光照变化和有限的感知能力。以往的方法通过利用柔性材料和仿生游泳设计来简化这些问题,但此类平台在可部署的自主性方面仍然有限。在此,我们提出了一种受海龟启发的自主水下机器人,通过紧密集成的视觉驱动控制系统,弥合了仿生运动与实地自主性之间的差距。该机器人结合了稳健的深度-航向稳定性、障碍物规避和以目标为中心的控制,使其能够在复杂地形中跟踪和与移动物体互动。我们在受控水池实验和新英格兰水族馆的活珊瑚礁展览中验证了该机器人,展示了其稳定的操作和对快速移动的海洋动物及潜水员的可靠跟踪。据我们所知,这是第一个集成的仿生机器人系统,结合了新颖的硬件、控制和实地实验,旨在跟踪和监测自然环境中的真实海洋动物。在无缆实验中,我们展示了在障碍物周围安全导航(在水族馆展览中的成功率为91%),并引入了一种低计算量的机载跟踪模式。这些结果共同确立了一条实用的路线,朝着能够在敏感生态系统中进行最小干扰探索和近距离监测的软-刚性混合仿生水下机器人迈进。
cs.RO / 8 / 2602.21418
Event-Driven On-Sensor Locomotion Mode Recognition Using a Shank-Mounted IMU with Embedded Machine Learning for Exoskeleton Control
基于事件驱动的传感器内运动模式识别:使用嵌入式机器学习的胫部安装IMU进行外骨骼控制
Abstract
This work presents a wearable human activity recognition (HAR) system that performs real-time inference directly inside a shank-mounted inertial measurement unit (IMU) to support low-latency control of a lower-limb exoskeleton. Unlike conventional approaches that continuously stream raw inertial data to a microcontroller for classification, the proposed system executes activity recognition at the sensor level using the embedded Machine Learning Core (MLC) of the STMicroelectronics LSM6DSV16X IMU, allowing the host microcontroller to remain in a low-power state and read only the recognized activity label from IMU registers. While the system generalizes to multiple human activities, this paper focuses on three representative locomotion modes - stance, level walking, and stair ascent - using data collected from adult participants. A lightweight decision-tree model was configured and deployed for on-sensor execution using ST MEMS Studio, enabling continuous operation without custom machine learning code on the microcontroller. During operation, the IMU asserts an interrupt when motion or a new classification is detected; the microcontroller wakes, reads the MLC output registers, and forwards the inferred mode to the exoskeleton controller. This interrupt-driven, on-sensor inference architecture reduces computation and communication overhead while preserving battery energy and improving robustness in distinguishing level walking from stair ascent for torque-assist control.
Chinese Translation
本研究提出了一种可穿戴的人体活动识别(HAR)系统,该系统能够在胫部安装的惯性测量单元(IMU)内部实时推断,以支持下肢外骨骼的低延迟控制。与传统方法不断将原始惯性数据流传输到微控制器进行分类不同,所提出的系统利用意法半导体(STMicroelectronics)LSM6DSV16X IMU的嵌入式机器学习核心(MLC)在传感器级别执行活动识别,从而使主微控制器保持低功耗状态,仅从IMU寄存器中读取识别到的活动标签。尽管该系统能够推广到多种人类活动,但本文重点关注三种代表性的运动模式——站立、平地行走和楼梯上升——使用从成年参与者收集的数据。配置并部署了一个轻量级决策树模型,以便在传感器上执行,使用ST MEMS Studio,使得在微控制器上无需自定义机器学习代码即可实现持续运行。在操作过程中,当检测到运动或新的分类时,IMU会发出中断信号;微控制器被唤醒,读取MLC输出寄存器,并将推断的模式转发给外骨骼控制器。这种基于中断的传感器内推断架构减少了计算和通信开销,同时保持电池能量,提高了在扭矩辅助控制中区分平地行走与楼梯上升的鲁棒性。
cs.RO / 9 / 2602.21445
VLA Knows Its Limits
VLA 知道自己的极限
Abstract
Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention weights as a proxy for the model's predictive limit and propose AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions. Across simulated and real-world robotic manipulation tasks, AutoHorizon is performant, incurs negligible computational overhead, and generalizes across diverse tasks and flow-based models.
Chinese Translation
动作分块最近已成为基于流的视觉-语言-动作(VLA)模型中的标准实践。然而,执行视野的影响和选择——即从每个预测分块中执行的动作数量——仍然未被充分探索。在本研究中,我们首先展示了变化的执行视野会导致显著的性能偏差,随着视野的增加,性能最初提高然后下降。为了揭示原因,我们分析了基于流的 VLA 中的交叉和自注意力权重,并揭示了两个关键现象:(i)分块内的动作对视觉-语言标记的关注是恒定的,限制了对环境变化的适应性;(ii)初始和终止动作标记作为稳定的锚点,形成潜在中心,围绕其组织中间动作。基于这些见解,我们将动作自注意力权重解释为模型预测极限的代理,并提出了 AutoHorizon,这是首个在测试时动态估计每个预测动作分块的执行视野的方法,以适应变化的感知条件。在模拟和真实世界的机器人操作任务中,AutoHorizon 表现出色,计算开销微乎其微,并且在多样化的任务和基于流的模型中具有良好的泛化能力。
cs.RO / 10 / 2602.21450
Constructive Vector Fields for Path Following in Fully-Actuated Systems on Matrix Lie Groups
矩阵李群上全驱动系统路径跟踪的构造性向量场
Abstract
This paper presents a novel vector field strategy for controlling fully-actuated systems on connected matrix Lie groups, ensuring convergence to and traversal along a curve defined on the group. Our approach generalizes our previous work (Rezende et al., 2022) and reduces to it when considering the Lie group of translations in Euclidean space. Since the proofs in Rezende et al. (2022) rely on key properties such as the orthogonality between the convergent and traversal components, we extend these results by leveraging Lie group properties. These properties also allow the control input to be non-redundant, meaning it matches the dimension of the Lie group, rather than the potentially larger dimension of the space in which the group is embedded. This can lead to more practical control inputs in certain scenarios. A particularly notable application of our strategy is in controlling systems on SE(3) -- in this case, the non-redundant input corresponds to the object's mechanical twist -- making it well-suited for controlling objects that can move and rotate freely, such as omnidirectional drones. In this case, we provide an efficient algorithm to compute the vector field. We experimentally validate the proposed method using a robotic manipulator to demonstrate its effectiveness.
Chinese Translation
本文提出了一种新颖的向量场策略,用于控制连接矩阵李群上的全驱动系统,确保系统收敛到并沿着定义在该群上的曲线进行遍历。我们的方法推广了我们之前的工作(Rezende et al., 2022),并在考虑欧几里得空间中的平移李群时简化为该工作。由于Rezende et al.(2022)中的证明依赖于收敛分量和遍历分量之间的正交性等关键性质,我们通过利用李群的性质扩展了这些结果。这些性质还允许控制输入是非冗余的,意味着它的维度与李群的维度相匹配,而不是与嵌入该群的空间的潜在更大维度相匹配。这在某些情况下可以导致更实用的控制输入。我们策略的一个特别显著的应用是在SE(3)上的系统控制——在这种情况下,非冗余输入对应于物体的机械扭转——使其非常适合控制可以自由移动和旋转的物体,如全向无人机。在这种情况下,我们提供了一种高效的算法来计算向量场。我们使用机器人操纵器对所提出的方法进行了实验验证,以展示其有效性。
cs.RO / 11 / 2602.21531
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
LiLo-VLA:通过链接对象中心策略实现组合性长时间操作
Abstract
General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.
Chinese Translation
通用机器人必须掌握长时间操作,这被定义为在非结构化环境中涉及多个运动结构变化的任务(例如,连接或断开物体)。尽管视觉-语言-动作(VLA)模型提供了掌握多样化原子技能的潜力,但它们在对这些技能进行序列组合时面临组合复杂性的问题,并且由于对环境的敏感性而容易出现级联故障。为了解决这些挑战,我们提出了LiLo-VLA(链接局部VLA),这是一个模块化框架,能够在从未接受过训练的情况下对新颖的长时间任务进行零样本泛化。我们的方法将运输与交互解耦:一个到达模块处理全局运动,而一个交互模块利用对象中心的VLA来处理感兴趣的孤立物体,确保对无关视觉特征的鲁棒性和对空间配置的不变性。关键是,这种模块化使得通过动态重新规划和技能重用实现鲁棒的故障恢复,从而有效减轻了端到端方法中常见的级联错误。我们引入了一个包含21个任务的仿真基准,涵盖两个具有挑战性的套件:LIBERO-Long++和Ultra-Long。在这些仿真中,LiLo-VLA实现了69%的平均成功率,分别比Pi0.5高出41%和比OpenVLA-OFT高出67%。此外,在8个长时间任务的现实世界评估中,LiLo-VLA的平均成功率达到了85%。项目页面:https://yy-gx.github.io/LiLo-VLA/
cs.RO / 12 / 2602.21583
Learning Agile and Robust Omnidirectional Aerial Motion on Overactuated Tiltable-Quadrotors
学习灵活且稳健的全向空中运动控制在过驱动可倾式四旋翼上的应用
Abstract
Tilt-rotor aerial robots enable omnidirectional maneuvering through thrust vectoring, but introduce significant control challenges due to the strong coupling between joint and rotor dynamics. While model-based controllers can achieve high motion accuracy under nominal conditions, their robustness and responsiveness often degrade in the presence of disturbances and modeling uncertainties. This work investigates reinforcement learning for omnidirectional aerial motion control on over-actuated tiltable quadrotors that prioritizes robustness and agility. We present a learning-based control framework that enables efficient acquisition of coordinated rotor-joint behaviors for reaching target poses in the $SE(3)$ space. To achieve reliable sim-to-real transfer while preserving motion accuracy, we integrate system identification with minimal and physically consistent domain randomization. Compared with a state-of-the-art NMPC controller, the proposed method achieves comparable six-degree-of-freedom pose tracking accuracy, while demonstrating superior robustness and generalization across diverse tasks, enabling zero-shot deployment on real hardware.
Chinese Translation
倾转旋翼空中机器人通过推力矢量实现全向机动,但由于关节和旋翼动力学之间的强耦合,带来了显著的控制挑战。虽然基于模型的控制器在正常条件下能够实现高运动精度,但在存在干扰和建模不确定性的情况下,其稳健性和响应性往往会降低。本研究探讨了强化学习在过驱动可倾式四旋翼上的全向空中运动控制中的应用,重点关注稳健性和灵活性。我们提出了一种基于学习的控制框架,使得能够高效获取协调的旋翼-关节行为,以达到$SE(3)$空间中的目标姿态。为了实现可靠的仿真到现实的迁移,同时保持运动精度,我们将系统识别与最小化和物理一致的领域随机化相结合。与最先进的非线性模型预测控制(NMPC)控制器相比,所提方法实现了可比的六自由度姿态跟踪精度,同时在多样化任务中展现出更优的稳健性和泛化能力,使得在真实硬件上实现零样本部署成为可能。
cs.RO / 13 / 2602.21595
SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints
SPOC:考虑安全的部分可观测性与物理约束下的规划
Abstract
Embodied Task Planning with large language models faces safety challenges in real-world environments, where partial observability and physical constraints must be respected. Existing benchmarks often overlook these critical factors, limiting their ability to evaluate both feasibility and safety. We introduce SPOC, a benchmark for safety-aware embodied task planning, which integrates strict partial observability, physical constraints, step-by-step planning, and goal-condition-based evaluation. Covering diverse household hazards such as fire, fluid, injury, object damage, and pollution, SPOC enables rigorous assessment through both state and constraint-based online metrics. Experiments with state-of-the-art LLMs reveal that current models struggle to ensure safety-aware planning, particularly under implicit constraints. Code and dataset are available at https://github.com/khm159/SPOC
Chinese Translation
具身任务规划与大型语言模型在现实环境中面临安全挑战,必须遵循部分可观测性和物理约束。现有基准测试常常忽视这些关键因素,限制了其评估可行性和安全性的能力。我们提出了SPOC,一个针对安全意识的具身任务规划基准,集成了严格的部分可观测性、物理约束、逐步规划和基于目标条件的评估。SPOC涵盖了多种家庭危险,如火灾、液体、伤害、物体损坏和污染,通过状态和基于约束的在线指标进行严格评估。与最先进的LLM进行的实验表明,当前模型在确保安全意识规划方面面临挑战,尤其是在隐性约束下。代码和数据集可在https://github.com/khm159/SPOC获取。
cs.RO / 14 / 2602.21599
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
迭代闭环运动合成以提升类人控制的能力
Abstract
Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45\% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
Chinese Translation
基于物理的类人控制依赖于具有多样化数据分布的运动数据集进行训练。然而,数据集固定的难度分布限制了训练控制策略的性能上限。此外,通过专业运动捕捉系统获取高质量数据的方法受到成本的限制,使得实现大规模扩展变得困难。为了解决这些问题,我们提出了一种闭环自动运动数据生成和迭代框架。该框架能够生成具有丰富动作语义的高质量运动数据,包括武术、舞蹈、战斗、体育、体操等。此外,我们的框架通过物理指标和客观评估实现策略和数据的难度迭代,使得训练的跟踪器能够突破其原有的难度限制。在PHC单原语跟踪器上,仅使用约1/10的AMASS数据集大小,测试集(2201个片段)的平均失败率相比基线降低了45%。最后,我们进行了全面的消融和对比实验,以突出我们框架的合理性和优势。
cs.RO / 15 / 2602.21612
Jumping Control for a Quadrupedal Wheeled-Legged Robot via NMPC and DE Optimization
基于非线性模型预测控制和差分进化优化的四足轮腿机器人跳跃控制
Abstract
Quadrupedal wheeled-legged robots combine the advantages of legged and wheeled locomotion to achieve superior mobility, but executing dynamic jumps remains a significant challenge due to the additional degrees of freedom introduced by wheeled legs. This paper develops a mini-sized wheeled-legged robot for agile motion and presents a novel motion control framework that integrates the Nonlinear Model Predictive Control (NMPC) for locomotion and the Differential Evolution (DE) based trajectory optimization for jumping in quadrupedal wheeled-legged robots. The proposed controller utilizes wheel motion and locomotion to enhance jumping performance, achieving versatile maneuvers such as vertical jumping, forward jumping, and backflips. Extensive simulations and real-world experiments validate the effectiveness of the framework, demonstrating a forward jump over a 0.12 m obstacle and a vertical jump reaching 0.5 m.
Chinese Translation
四足轮腿机器人结合了腿部和轮式运动的优势,实现了卓越的机动性,但由于轮腿引入的额外自由度,执行动态跳跃仍然是一个重大挑战。本文开发了一种迷你型轮腿机器人以实现灵活运动,并提出了一种新颖的运动控制框架,该框架集成了用于运动的非线性模型预测控制(NMPC)和基于差分进化(DE)的跳跃轨迹优化。所提出的控制器利用轮子运动和行走运动来增强跳跃性能,实现了多种灵活的动作,如垂直跳跃、向前跳跃和后空翻。大量的仿真和实地实验验证了该框架的有效性,展示了在0.12米障碍物上方的向前跳跃和达到0.5米的垂直跳跃。
cs.RO / 16 / 2602.21622
ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation
ADM-DP:通过视觉-触觉-图融合的自适应动态模态扩散策略用于多智能体操控
Abstract
Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
Chinese Translation
多智能体机器人操控因协调、抓取稳定性和共享工作空间中的碰撞避免等多重需求而面临挑战。为了解决这些问题,我们提出了自适应动态模态扩散策略(Adaptive Dynamic Modality Diffusion Policy,ADM-DP),该框架整合了视觉、触觉和基于图的(多智能体姿态)模态以实现协调控制。ADM-DP引入了四项关键创新。首先,增强的视觉编码器通过特征线性调制(Feature-wise Linear Modulation,FiLM)融合RGB和点云特征,以丰富感知。其次,触觉引导的抓取策略利用力敏电阻(Force-Sensitive Resistor,FSR)反馈来检测接触不足并触发纠正抓取的细化,从而提高抓取稳定性。第三,基于图的碰撞编码器利用多个智能体的共享工具中心点(Tool Center Point,TCP)位置作为结构化运动学上下文,以维持空间意识并减少智能体间的干扰。第四,自适应模态注意机制(Adaptive Modality Attention Mechanism,AMAM)根据任务上下文动态重新加权模态,实现灵活融合。为了实现可扩展性和模块化,采用了一个解耦训练范式,其中智能体学习独立策略,同时共享空间信息。这在保持智能体间低依赖性的同时,保留了集体意识。在七个多智能体任务中,ADM-DP相较于最先进的基线实现了12-25%的性能提升。消融研究表明,在需要多种感官模态的任务中,性能提升最为显著,验证了我们的自适应融合策略,并展示了其在多样化操控场景中的鲁棒性。
cs.RO / 17 / 2602.21625
Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map
Tacmap:通过几何一致的穿透深度图弥合触觉模拟与现实之间的差距
Abstract
Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.
Chinese Translation
基于视觉的触觉传感器(VBTS)对于实现灵巧的机器人操作至关重要,但触觉模拟与现实之间的差距仍然是一个根本性瓶颈。目前的触觉模拟面临着一个持续的困境:简化的几何投影缺乏物理真实性,而高保真度的有限元方法(FEM)在大规模强化学习中计算成本过高。在本研究中,我们提出了Tacmap,一个基于体积穿透深度的高保真、计算高效的触觉模拟框架。我们的关键见解是通过共享的变形图表示将触觉模拟与现实统一,从而弥合两者之间的差距。具体而言,我们在模拟中计算3D交集体积作为深度图,而在现实世界中,我们使用自动化数据采集装置学习从原始触觉图像到真实深度图的稳健映射。通过在这一统一的几何空间中对齐模拟与现实,Tacmap最小化了领域转移,同时保持物理一致性。在多种接触场景下的定量评估表明,Tacmap的变形图与现实世界的测量结果高度吻合。此外,我们通过一个手中旋转任务验证了Tacmap的实用性,其中仅在模拟中训练的策略实现了对物理机器人的零样本迁移。
cs.RO / 18 / 2602.21633
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
自我修正的视觉-语言-动作(VLA):通过稀疏世界想象进行在线动作精炼
Abstract
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
Chinese Translation
标准的视觉-语言-动作(VLA)模型依赖于拟合统计数据先验,这限制了它们对潜在物理动态的稳健理解。强化学习通过探索增强了物理基础,但通常依赖于与智能体内部状态隔离的外部奖励信号。世界动作模型作为一种有前景的范式,整合了想象与控制,以实现预测性规划。然而,它们依赖于隐式上下文建模,缺乏自我改进的显式机制。为了解决这些问题,我们提出了自我修正的视觉-语言-动作(SC-VLA),通过稀疏想象内在地指导动作精炼,从而实现自我改进。我们首先通过整合辅助预测头设计稀疏世界想象,以预测当前任务进展和未来轨迹趋势,从而限制策略对短期物理演变的编码。然后,我们引入在线动作精炼模块,以重塑依赖进展的密集奖励,根据预测的稀疏未来状态调整轨迹方向。在来自仿真基准和现实世界环境的挑战性机器人操作任务上的评估表明,SC-VLA实现了最先进的性能,任务吞吐量最高,步骤减少16%,成功率比表现最好的基线高出9%,同时在现实世界实验中获得了14%的提升。代码可在 https://github.com/Kisaragi0/SC-VLA 获取。
cs.RO / 19 / 2602.21644
DAGS-SLAM: Dynamic-Aware 3DGS SLAM via Spatiotemporal Motion Probability and Uncertainty-Aware Scheduling
DAGS-SLAM:基于时空运动概率和不确定性感知调度的动态感知3D高斯点云SLAM
Abstract
Mobile robots and IoT devices demand real-time localization and dense reconstruction under tight compute and energy budgets. While 3D Gaussian Splatting (3DGS) enables efficient dense SLAM, dynamic objects and occlusions still degrade tracking and mapping. Existing dynamic 3DGS-SLAM often relies on heavy optical flow and per-frame segmentation, which is costly for mobile deployment and brittle under challenging illumination. We present DAGS-SLAM, a dynamic-aware 3DGS-SLAM system that maintains a spatiotemporal motion probability (MP) state per Gaussian and triggers semantics on demand via an uncertainty-aware scheduler. DAGS-SLAM fuses lightweight YOLO instance priors with geometric cues to estimate and temporally update MP, propagates MP to the front-end for dynamic-aware correspondence selection, and suppresses dynamic artifacts in the back-end via MP-guided optimization. Experiments on public dynamic RGB-D benchmarks show improved reconstruction and robust tracking while sustaining real-time throughput on a commodity GPU, demonstrating a practical speed-accuracy tradeoff with reduced semantic invocations toward mobile deployment.
Chinese Translation
移动机器人和物联网设备在严格的计算和能量预算下需要实时定位和密集重建。虽然3D高斯点云(3DGS)能够实现高效的密集SLAM,但动态物体和遮挡仍然会降低跟踪和映射的效果。现有的动态3DGS-SLAM通常依赖于复杂的光流和逐帧分割,这在移动部署中代价高昂,并且在复杂光照条件下表现脆弱。我们提出了DAGS-SLAM,一种动态感知的3DGS-SLAM系统,它为每个高斯保持一个时空运动概率(MP)状态,并通过不确定性感知调度器按需触发语义。DAGS-SLAM将轻量级YOLO实例先验与几何线索融合,以估计和时间更新MP,将MP传播到前端以进行动态感知的对应选择,并通过MP引导的优化在后端抑制动态伪影。在公共动态RGB-D基准测试上的实验表明,DAGS-SLAM在保持实时吞吐量的同时,改善了重建效果和鲁棒跟踪,展示了在移动部署中实现速度与精度的实用权衡,并减少了语义调用。
cs.RO / 20 / 2602.21666
Biomechanical Comparisons Reveal Divergence of Human and Humanoid Gaits
生物力学比较揭示人类与类人步态的差异
Abstract
It remains challenging to achieve human-like locomotion in legged robots due to fundamental discrepancies between biological and mechanical structures. Although imitation learning has emerged as a promising approach for generating natural robotic movements, simply replicating joint angle trajectories fails to capture the underlying principles of human motion. This study proposes a Gait Divergence Analysis Framework (GDAF), a unified biomechanical evaluation framework that systematically quantifies kinematic and kinetic discrepancies between humans and bipedal robots. We apply GDAF to systematically compare human and humanoid locomotion across 28 walking speeds. To enable reproducible analysis, we collect and release a speed-continuous humanoid locomotion dataset from a state-of-the-art humanoid controller. We further provide an open-source implementation of GDAF, including analysis, visualization, and MuJoCo-based tools, enabling quantitative, interpretable, and reproducible biomechanical analysis of humanoid locomotion. Results demonstrate that despite visually human-like motion generated by modern humanoid controllers, significant biomechanical divergence persists across speeds. Robots exhibit systematic deviations in gait symmetry, energy distribution, and joint coordination, indicating that substantial room remains for improving the biomechanical fidelity and energetic efficiency of humanoid locomotion. This work provides a quantitative benchmark for evaluating humanoid locomotion and offers data and versatile tools to support the development of more human-like and energetically efficient locomotion controllers. The data and code will be made publicly available upon acceptance of the paper.
Chinese Translation
由于生物结构与机械结构之间的根本差异,实现类人步态的腿部机器人仍然面临挑战。尽管模仿学习已成为生成自然机器人运动的有前景的方法,但仅仅复制关节角度轨迹无法捕捉人类运动的基本原理。本研究提出了一种步态差异分析框架(Gait Divergence Analysis Framework, GDAF),这是一个统一的生物力学评估框架,系统地量化人类与双足机器人之间的运动学和动力学差异。我们应用GDAF系统地比较了28种行走速度下的人类与类人步态。为了实现可重复的分析,我们收集并发布了来自先进类人控制器的速度连续类人步态数据集。我们进一步提供了GDAF的开源实现,包括分析、可视化和基于MuJoCo的工具,使得类人步态的定量、可解释和可重复的生物力学分析成为可能。结果表明,尽管现代类人控制器生成的运动在视觉上类似于人类,但在不同速度下仍然存在显著的生物力学差异。机器人在步态对称性、能量分配和关节协调方面表现出系统性偏差,表明在提高类人步态的生物力学逼真性和能量效率方面仍有很大的改进空间。本研究为评估类人步态提供了定量基准,并提供数据和多功能工具以支持开发更类人和能量高效的步态控制器。数据和代码将在论文接受后公开。
cs.RO / 21 / 2602.21670
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
基于层次化大语言模型的多智能体框架及其提示优化在多机器人任务规划中的应用
Abstract
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
Chinese Translation
多机器人任务规划需要将自然语言指令分解为异构机器人团队可执行的动作。传统的规划领域定义语言(PDDL)规划器提供严格的保证,但在处理模糊或长时间跨度的任务时表现不佳,而大型语言模型(LLMs)能够理解指令并提出计划,但可能会产生幻觉或生成不可行的动作。我们提出了一种基于层次化多智能体的LLM规划器,并结合提示优化:上层负责任务分解并将其分配给下层智能体,下层智能体生成PDDL问题,由经典规划器解决。当计划失败时,系统应用受TextGrad启发的文本梯度更新来优化每个智能体的提示,从而提高规划的准确性。此外,元提示在同一层内的智能体之间学习和共享,使得在多智能体环境中实现高效的提示优化。在MAT-THOR基准测试中,我们的规划器在复合任务上取得了0.95的成功率,在复杂任务上为0.84,在模糊任务上为0.60,分别比之前的最先进技术LaMMA-P提高了2、7和15个百分点。消融研究表明,层次结构、提示优化和元提示共享对整体成功率的贡献分别约为+59、+37和+4个百分点。
cs.RO / 22 / 2602.21682
SunnyParking: Multi-Shot Trajectory Generation and Motion State Awareness for Human-like Parking
SunnyParking:人类驾驶风格的多次轨迹生成与运动状态感知
Abstract
Autonomous parking fundamentally differs from on-road driving due to its frequent direction changes and complex maneuvering requirements. However, existing End-to-End (E2E) planning methods often simplify the parking task into a geometric path regression problem, neglecting explicit modeling of the vehicle's kinematic state. This "dimensionality deficiency" easily leads to physically infeasible trajectories and deviates from real human driving behavior, particularly at critical gear-shift points in multi-shot parking scenarios. In this paper, we propose SunnyParking, a novel dual-branch E2E architecture that achieves motion state awareness by jointly predicting spatial trajectories and discrete motion state sequences (e.g., forward/reverse). Additionally, we introduce a Fourier feature-based representation of target parking slots to overcome the resolution limitations of traditional bird's-eye view (BEV) approaches, enabling high-precision target interactions. Experimental results demonstrate that our framework generates more robust and human-like trajectories in complex multi-shot parking scenarios, while significantly improving gear-shift point localization accuracy compared to state-of-the-art methods. We open-source a new parking dataset of the CARLA simulator, specifically designed to evaluate full prediction capabilities under complex maneuvers.
Chinese Translation
自主停车与道路驾驶在本质上存在显著差异,主要体现在频繁的方向变化和复杂的操控要求。然而,现有的端到端(End-to-End, E2E)规划方法往往将停车任务简化为几何路径回归问题,忽视了对车辆运动状态的明确建模。这种“维度缺失”容易导致物理上不可行的轨迹,并偏离真实的人类驾驶行为,尤其是在多次停车场景中的关键换挡点。本文提出了SunnyParking,一种新颖的双分支E2E架构,通过联合预测空间轨迹和离散运动状态序列(例如,前进/倒退)来实现运动状态感知。此外,我们引入了一种基于傅里叶特征的目标停车位表示,以克服传统鸟瞰视图(Bird's-Eye View, BEV)方法的分辨率限制,从而实现高精度的目标交互。实验结果表明,我们的框架在复杂的多次停车场景中生成了更稳健且更具人类驾驶特征的轨迹,同时显著提高了换挡点定位的准确性,相较于最先进的方法。我们开源了一个新的CARLA模拟器停车数据集,专门设计用于评估在复杂操控下的全预测能力。
cs.RO / 23 / 2602.21684
Primary-Fine Decoupling for Action Generation in Robotic Imitation
机器人模仿中的动作生成的初级-精细解耦
Abstract
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning. To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution. However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions. To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is learned to generate high-fidelity continuous actions. Theoretically, we prove PF-DAG's two-stage design achieves a strictly lower MSE bound than single-stage generative policies. Empirically, PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks. It further generalizes to real-world tactile dexterous manipulation tasks. Our work demonstrates that explicit mode-level decoupling enables both robust multi-modal modeling and reactive closed-loop control for robotic manipulation.
Chinese Translation
机器人操作动作序列中的多模态分布对模仿学习提出了关键挑战。为此,现有方法通常将动作空间建模为离散的标记集合或连续的潜变量分布。然而,这两种方法都存在权衡:一些方法将动作离散化为标记,从而失去了细粒度的动作变化,而其他方法在单一阶段生成连续动作时往往会产生不稳定的模式转换。为了解决这些局限性,我们提出了动作生成的初级-精细解耦(PF-DAG),这是一个两阶段框架,将粗糙动作一致性与细粒度变化解耦。首先,我们将动作块压缩为一小组离散模式,使轻量级策略能够选择一致的粗糙模式并避免模式跳跃。其次,学习一种基于模式条件的MeanFlow策略,以生成高保真连续动作。从理论上讲,我们证明了PF-DAG的两阶段设计实现了比单阶段生成策略严格更低的均方误差(MSE)界限。从经验上看,PF-DAG在Adroit、DexArt和MetaWorld基准的56个任务中超越了最先进的基线。它进一步推广到现实世界的触觉灵巧操作任务。我们的工作表明,显式的模式级解耦能够实现机器人操作的稳健多模态建模和反应式闭环控制。
cs.RO / 24 / 2602.21691
Trajectory Generation with Endpoint Regulation and Momentum-Aware Dynamics for Visually Impaired Scenarios
针对视觉障碍场景的端点调节和动量感知动态的轨迹生成
Abstract
Trajectory generation for visually impaired scenarios requires smooth and temporally consistent state in structured, low-speed dynamic environments. However, traditional jerk-based heuristic trajectory sampling with independent segment generation and conventional smoothness penalties often lead to unstable terminal behavior and state discontinuities under frequent regenerating. This paper proposes a trajectory generation approach that integrates endpoint regulation to stabilize terminal states within each segment and momentum-aware dynamics to regularize the evolution of velocity and acceleration for segment consistency. Endpoint regulation is incorporated into trajectory sampling to stabilize terminal behavior, while a momentum-aware dynamics enforces consistent velocity and acceleration evolution across consecutive trajectory segments. Experimental results demonstrate reduced acceleration peaks and lower jerk levels with decreased dispersion, smoother velocity and acceleration profiles, more stable endpoint distributions, and fewer infeasible trajectory candidates compared with a baseline planner.
Chinese Translation
针对视觉障碍场景的轨迹生成需要在结构化的低速动态环境中保持平滑和时间一致的状态。然而,传统的基于冲击的启发式轨迹采样方法采用独立的段生成和常规的平滑性惩罚,往往导致在频繁重新生成下终端行为不稳定和状态不连续。本文提出了一种轨迹生成方法,集成了端点调节以稳定每个段内的终端状态,以及动量感知动态以规范速度和加速度的演变,从而确保段间一致性。端点调节被纳入轨迹采样中,以稳定终端行为,而动量感知动态则强制连续轨迹段之间的一致速度和加速度演变。实验结果表明,与基线规划器相比,所提出的方法在加速度峰值和冲击水平上均有所降低,离散度减少,速度和加速度曲线更加平滑,终端分布更稳定,且不合规轨迹候选数量更少。
cs.RO / 25 / 2602.21696
Dual-Regime Hybrid Aerodynamic Modeling of Winged Blimps With Neural Mixing
带翼飞艇的双重模式混合气动建模与神经网络混合
Abstract
Winged blimps operate across distinct aerodynamic regimes that cannot be adequately captured by a single model. At high speeds and small angles of attack, their dynamics exhibit strong coupling between lift and attitude, resembling fixed-wing aircraft behavior. At low speeds or large angles of attack, viscous effects and flow separation dominate, leading to drag-driven and damping-dominated dynamics. Accurately representing transitions between these regimes remains a fundamental challenge. This paper presents a hybrid aerodynamic modeling framework that integrates a fixed-wing Aerodynamic Coupling Model (ACM) and a Generalized Drag Model (GDM) using a learned neural network mixer with explicit physics-based regularization. The mixer enables smooth transitions between regimes while retaining explicit, physics-based aerodynamic representation. Model parameters are identified through a structured three-phase pipeline tailored for hybrid aerodynamic modeling. The proposed approach is validated on the RGBlimp platform through a large-scale experimental campaign comprising 1,320 real-world flight trajectories across 330 thruster and moving mass configurations, spanning a wide range of speeds and angles of attack. Experimental results demonstrate that the proposed hybrid model consistently outperforms single-model and predefined-mixer baselines, establishing a practical and robust aerodynamic modeling solution for winged blimps.
Chinese Translation
带翼飞艇在不同的气动模式下运行,这些模式无法通过单一模型充分捕捉。在高速度和小攻角下,它们的动力学表现出升力与姿态之间的强耦合,类似于固定翼飞机的行为。在低速度或大攻角下,粘性效应和流动分离主导,导致以阻力驱动和阻尼为主的动力学。准确表示这些模式之间的过渡仍然是一个基本挑战。本文提出了一种混合气动建模框架,该框架结合了固定翼气动耦合模型(Aerodynamic Coupling Model, ACM)和广义阻力模型(Generalized Drag Model, GDM),并使用具有显式物理基础正则化的学习型神经网络混合器。该混合器能够在模式之间实现平滑过渡,同时保留显式的、基于物理的气动表征。模型参数通过为混合气动建模量身定制的结构化三阶段流程进行识别。所提方法在RGBlimp平台上进行了验证,开展了一项大规模实验活动,涵盖了1,320条真实飞行轨迹,涉及330种推进器和移动质量配置,涵盖了广泛的速度和攻角范围。实验结果表明,所提混合模型在性能上始终优于单一模型和预定义混合器基线,建立了一种实用且稳健的带翼飞艇气动建模解决方案。
cs.RO / 26 / 2602.21723
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
LessMimic:使用统一距离场表示的长时间人形交互
Abstract
Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
Chinese Translation
自主与物理环境进行长时间交互的人形机器人是具身智能的核心目标。现有方法依赖于参考动作或特定任务的奖励,这使得策略与特定物体几何形状紧密耦合,从而阻碍了在单一框架内的多技能泛化。实现一种统一的交互表示,能够在一个策略中支持无参考推理、几何泛化和长时间技能组合,仍然是一个未解决的挑战。在此,我们展示了距离场(Distance Field, DF)提供了这样的表示:LessMimic 将单一的全身策略条件化于 DF 派生的几何线索——表面距离、梯度和速度分解——消除了对运动参考的需求,交互潜变量通过变分自编码器(Variational Auto-Encoder, VAE)进行编码,并在强化学习(Reinforcement Learning, RL)下使用对抗交互先验(Adversarial Interaction Priors, AIP)进行后训练。通过 DAgger 风格的蒸馏,将 DF 潜变量与自我中心深度特征对齐,LessMimic 进一步无缝转移到仅依赖视觉的部署,无需运动捕捉(MoCap)基础设施。单一的 LessMimic 策略在 PickUp 和 SitStand 任务中实现了 80% 至 100% 的成功率,适用于物体尺度从 0.4x 到 1.6x 的范围,在 5 个任务实例轨迹上达到了 62.1% 的成功率,并在最多 40 个连续组合任务中保持可行性。通过将交互基于局部几何而非演示,LessMimic 提供了一条可扩展的路径,朝着能够在非结构化环境中泛化、组合技能并从失败中恢复的人形机器人迈进。
cs.RO / 27 / 2602.21736
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
联合对齐潜在动作:迈向可扩展的野外视觉-语言-动作预训练
Abstract
Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
Chinese Translation
尽管取得了一定进展,视觉-语言-动作模型(VLA)仍受到大规模、多样化机器人数据稀缺的限制。虽然人类操作视频提供了丰富的替代方案,但现有方法不得不在小型、精确标注的数据集与大量野外视频(其手部跟踪标签不可靠)之间做出选择。我们提出了JALA,一个学习联合对齐潜在动作的预训练框架。JALA绕过了完整的视觉动态重建,而是学习与逆动态和真实动作对齐的预测性动作嵌入。这为从异质人类数据中学习提供了一个过渡感知的、以行为为中心的潜在空间。我们通过UniHand-Mix扩展了这一方法,构建了一个750万视频语料库(超过2000小时),融合了实验室和野外视频。实验表明,JALA在受控和非受控场景中生成了更为真实的手部动作,显著提升了机器人操作在模拟和现实任务中的下游性能。这些结果表明,联合对齐的潜在动作为从人类数据中进行VLA预训练提供了一条可扩展的路径。
cs.RO / 28 / 2602.21783
Therapist-Robot-Patient Physical Interaction is Worth a Thousand Words: Enabling Intuitive Therapist Guidance via Remote Haptic Control
治疗师-机器人-患者的物理互动胜过千言万语:通过远程触觉控制实现直观的治疗师指导
Abstract
Robotic systems can enhance the amount and repeatability of physically guided motor training. Yet their real-world adoption is limited, partly due to non-intuitive trainer/therapist-trainee/patient interactions. To address this gap, we present a haptic teleoperation system for trainers to remotely guide and monitor the movements of a trainee wearing an arm exoskeleton. The trainer can physically interact with the exoskeleton through a commercial handheld haptic device via virtual contact points at the exoskeleton's elbow and wrist, allowing intuitive guidance. Thirty-two participants tested the system in a trainer-trainee paradigm, comparing our haptic demonstration system with conventional visual demonstration in guiding trainees in executing arm poses. Quantitative analyses showed that haptic demonstration significantly reduced movement completion time and improved smoothness, while speech analysis using large language models for automated transcription and categorization of verbal commands revealed fewer verbal instructions. The haptic demonstration did not result in higher reported mental and physical effort by trainers compared to the visual demonstration, while trainers reported greater competence and trainees lower physical demand. These findings support the feasibility of our proposed interface for effective remote human-robot physical interaction. Future work should assess its usability and efficacy for clinical populations in restoring clinicians' sense of agency during robot-assisted therapy.
Chinese Translation
机器人系统可以增强物理指导的运动训练的数量和重复性。然而,它们在现实世界中的应用受到限制,部分原因是训练者/治疗师与受训者/患者之间的互动不够直观。为了解决这一问题,我们提出了一种触觉远程操作系统,使训练者能够远程指导和监控穿戴臂部外骨骼的受训者的动作。训练者可以通过商业手持触觉设备与外骨骼的肘部和手腕的虚拟接触点进行物理互动,从而实现直观的指导。三十二名参与者在训练者-受训者范式中测试了该系统,比较了我们的触觉演示系统与传统的视觉演示在指导受训者执行臂部姿势方面的效果。定量分析显示,触觉演示显著减少了动作完成时间并提高了动作的流畅性,而使用大型语言模型进行的语音分析在自动转录和分类口头指令时显示出较少的口头指令。与视觉演示相比,触觉演示并未导致训练者报告更高的心理和身体努力,而训练者报告的能力感更强,受训者的身体需求则较低。这些发现支持了我们提出的接口在有效的远程人机物理互动中的可行性。未来的工作应评估其在临床人群中恢复临床医生在机器人辅助治疗中的自主感的可用性和有效性。
cs.RO / 29 / 2602.21811
DexRepNet++: Learning Dexterous Robotic Manipulation with Geometric and Spatial Hand-Object Representations
DexRepNet++:通过几何和空间手-物体表示学习灵巧机器人操作
Abstract
Robotic dexterous manipulation is a challenging problem due to high degrees of freedom (DoFs) and complex contacts of multi-fingered robotic hands. Many existing deep reinforcement learning (DRL) based methods aim at improving sample efficiency in high-dimensional output action spaces. However, existing works often overlook the role of representations in achieving generalization of a manipulation policy in the complex input space during the hand-object interaction. In this paper, we propose DexRep, a novel hand-object interaction representation to capture object surface features and spatial relations between hands and objects for dexterous manipulation skill learning. Based on DexRep, policies are learned for three dexterous manipulation tasks, i.e. grasping, in-hand reorientation, bimanual handover, and extensive experiments are conducted to verify the effectiveness. In simulation, for grasping, the policy learned with 40 objects achieves a success rate of 87.9% on more than 5000 unseen objects of diverse categories, significantly surpassing existing work trained with thousands of objects; for the in-hand reorientation and handover tasks, the policies also boost the success rates and other metrics of existing hand-object representations by 20% to 40%. The grasp policies with DexRep are deployed to the real world under multi-camera and single-camera setups and demonstrate a small sim-to-real gap.
Chinese Translation
机器人灵巧操作是一个具有挑战性的问题,因为多指机器人手具有高自由度(DoFs)和复杂的接触。许多现有的基于深度强化学习(DRL)的方法旨在提高在高维输出动作空间中的样本效率。然而,现有研究往往忽视了在手-物体交互过程中,表示在实现操作策略泛化中的作用。本文提出了DexRep,一种新颖的手-物体交互表示,用于捕捉物体表面特征以及手与物体之间的空间关系,以便学习灵巧操作技能。基于DexRep,针对三个灵巧操作任务(即抓取、手内重定向和双手交接)学习了策略,并进行了广泛的实验以验证其有效性。在仿真中,对于抓取任务,使用40个物体学习的策略在超过5000个未见过的多样化类别物体上达到了87.9%的成功率,显著超越了现有使用数千个物体训练的工作;对于手内重定向和交接任务,这些策略也将现有手-物体表示的成功率和其他指标提高了20%到40%。使用DexRep的抓取策略在多摄像头和单摄像头设置下部署到真实世界,展示了较小的仿真到现实的差距。
cs.RO / 30 / 2602.21816
Self-Curriculum Model-based Reinforcement Learning for Shape Control of Deformable Linear Objects
基于自我课程模型的强化学习在可变形线性物体形状控制中的应用
Abstract
Precise shape control of Deformable Linear Objects (DLOs) is crucial in robotic applications such as industrial and medical fields. However, existing methods face challenges in handling complex large deformation tasks, especially those involving opposite curvatures, and lack efficiency and precision. To address this, we propose a two-stage framework combining Reinforcement Learning (RL) and online visual servoing. In the large-deformation stage, a model-based reinforcement learning approach using an ensemble of dynamics models is introduced to significantly improve sample efficiency. Additionally, we design a self-curriculum goal generation mechanism that dynamically selects intermediate-difficulty goals with high diversity through imagined evaluations, thereby optimizing the policy learning process. In the small-deformation stage, a Jacobian-based visual servo controller is deployed to ensure high-precision convergence. Simulation results show that the proposed method enables efficient policy learning and significantly outperforms mainstream baselines in shape control success rate and precision. Furthermore, the framework effectively transfers the policy trained in simulation to real-world tasks with zero-shot adaptation. It successfully completes all 30 cases with diverse initial and target shapes across DLOs of different sizes and materials. The project website is available at: https://anonymous.4open.science/w/sc-mbrl-dlo-EB48/
Chinese Translation
可变形线性物体(DLOs)的精确形状控制在工业和医疗等机器人应用中至关重要。然而,现有方法在处理复杂的大变形任务时面临挑战,尤其是涉及相反曲率的情况,并且缺乏效率和精度。为此,我们提出了一种结合强化学习(RL)和在线视觉伺服的两阶段框架。在大变形阶段,采用基于模型的强化学习方法,利用一组动力学模型显著提高样本效率。此外,我们设计了一种自我课程目标生成机制,通过想象评估动态选择具有高多样性的中等难度目标,从而优化策略学习过程。在小变形阶段,部署了一种基于雅可比矩阵的视觉伺服控制器,以确保高精度收敛。仿真结果表明,所提出的方法能够实现高效的策略学习,并在形状控制的成功率和精度上显著超越主流基准。此外,该框架有效地将训练好的策略从仿真转移到现实任务中,实现零样本适应。它成功完成了30个案例,涵盖不同大小和材料的DLOs的多样初始和目标形状。项目网站可访问:https://anonymous.4open.science/w/sc-mbrl-dlo-EB48/
cs.RO / 31 / 2602.21899
Enhancing Cellular-enabled Collaborative Robots Planning through GNSS data for SAR Scenarios
通过GNSS数据增强基于蜂窝网络的协作机器人在搜救场景中的规划
Abstract
Cellular-enabled collaborative robots are becoming paramount in Search-and-Rescue (SAR) and emergency response. Crucially dependent on resilient mobile network connectivity, they serve as invaluable assets for tasks like rapid victim localization and the exploration of hazardous, otherwise unreachable areas. However, their reliance on battery power and the need for persistent, low-latency communication limit operational time and mobility. To address this, and considering the evolving capabilities of 5G/6G networks, we propose a novel SAR framework that includes Mission Planning and Mission Execution phases and that optimizes robot deployment. By considering parameters such as the exploration area size, terrain elevation, robot fleet size, communication-influenced energy profiles, desired exploration rate, and target response time, our framework determines the minimum number of robots required and their optimal paths to ensure effective coverage and timely data backhaul over mobile networks. Our results demonstrate the trade-offs between number of robots, explored area, and response time for wheeled and quadruped robots. Further, we quantify the impact of terrain elevation data on mission time and energy consumption, showing the benefits of incorporating real-world environmental factors that might also affect mobile signal propagation and connectivity into SAR planning. This framework provides critical insights for leveraging next-generation mobile networks to enhance autonomous SAR operations.
Chinese Translation
基于蜂窝网络的协作机器人在搜救(SAR)和应急响应中变得至关重要。这些机器人依赖于稳健的移动网络连接,成为快速定位受害者和探索危险、难以到达区域等任务的宝贵资产。然而,它们对电池电量的依赖以及对持续低延迟通信的需求限制了其操作时间和机动性。为了解决这一问题,并考虑到5G/6G网络的不断发展能力,我们提出了一种新颖的SAR框架,该框架包括任务规划和任务执行阶段,并优化机器人部署。通过考虑探索区域大小、地形高度、机器人队伍规模、受通信影响的能量特征、期望的探索速率和目标响应时间等参数,我们的框架确定了所需的最小机器人数量及其最佳路径,以确保有效覆盖和及时的数据回传。我们的结果展示了轮式和四足机器人在机器人数量、探索区域和响应时间之间的权衡。此外,我们量化了地形高度数据对任务时间和能耗的影响,显示了将可能影响移动信号传播和连接的现实环境因素纳入SAR规划的好处。该框架为利用下一代移动网络增强自主搜救操作提供了重要的见解。
cs.RO / 32 / 2602.21967
Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments
梦境SLAM:在动态环境中为主动SLAM梦见未见之物
Abstract
In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.
Chinese Translation
除了同时定位与地图构建(SLAM)的核心任务外,主动SLAM还涉及生成机器人动作,以有效且高效地探索未知环境。然而,现有的主动SLAM流程受到三个主要因素的限制。首先,它们继承了所使用的基础SLAM模块的限制。其次,它们的运动规划策略通常目光短浅,缺乏长期视野。第三,大多数方法在处理动态场景时面临困难。为了解决这些限制,我们提出了一种新颖的单目主动SLAM方法——梦境SLAM,该方法基于梦见跨时空图像和部分观察到的动态环境的语义合理结构。生成的跨时空图像与真实观测数据融合,以减轻噪声和数据不完整性,从而实现更准确的相机位姿估计和更连贯的3D场景表示。此外,我们整合了梦见的和观察到的场景结构,以实现长远规划,生成有远见的轨迹,促进高效而彻底的探索。在公共数据集和自收集数据集上的大量实验表明,梦境SLAM在定位精度、地图质量和探索效率方面优于最先进的方法。源代码将在论文接受后公开。
cs.RO / 33 / 2602.21983
Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots
人性化机器人视线转移:类人机器人自然视线转移的框架
Abstract
Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.
Chinese Translation
利用听觉和视觉反馈进行注意力重新定向对于社交互动中的自然视线转移至关重要。然而,使类人机器人在不受限制的人机交互(HRI)中执行自然且符合上下文的视线转移仍然具有挑战性,因为这需要认知注意机制与仿生运动生成的结合。在本研究中,我们提出了机器人视线转移(Robot Gaze-Shift, RGS)框架,将这两个组件整合为一个统一的流程。首先,RGS采用基于视觉-语言模型(Vision-Language Model, VLM)的视线推理流程,从多模态交互线索中推断出符合上下文的视线目标,确保与人类视线定向规律的一致性。其次,RGS引入了一种条件向量量化变分自编码器(Conditional Vector Quantized-Variational Autoencoder, VQ-VAE)模型,用于眼-头协调的视线转移运动生成,产生多样化且类人化的视线转移行为。实验验证了RGS有效地复制了类人目标选择,并生成了真实且多样的视线转移运动。
cs.RO / 34 / 2602.22001
Are Foundation Models the Route to Full-Stack Transfer in Robotics?
基础模型是实现机器人全栈迁移的途径吗?
Abstract
In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.
Chinese Translation
在人类和机器人中,迁移学习发生在不同的抽象层次,从高层次的语言迁移到低层次的运动技能迁移。本文概述了基础模型和变换器网络对这些不同层次的影响,使机器人比以往任何时候都更接近于“全栈迁移”。从机器人迁移学习的角度考虑大型语言模型(LLMs)、视觉语言模型(VLMs)和视觉语言算法(VLAs),使我们能够突出迁移的反复出现的概念,而不仅仅是特定的实现。我们还考虑了在基础模型时代,数据收集和迁移基准对机器人技术所带来的挑战。基础模型是实现机器人全栈迁移的途径吗?我们的预期是,它们无疑将作为关键技术继续沿着这条道路前行。
cs.RO / 35 / 2602.22006
Parallel Continuous-Time Relative Localization with Augmented Clamped Non-Uniform B-Splines
基于增强夹紧非均匀B样条的并行连续时间相对定位
Abstract
Accurate relative localization is critical for multi-robot cooperation. In robot swarms, measurements from different robots arrive asynchronously and with clock time-offsets. Although Continuous-Time (CT) formulations have proved effective for handling asynchronous measurements in single-robot SLAM and calibration, extending CT methods to multi-robot settings faces great challenges to achieve high-accuracy, low-latency, and high-frequency performance. Especially, existing CT methods suffer from the inherent query-time delay of unclamped B-splines and high computational cost. This paper proposes CT-RIO, a novel Continuous-Time Relative-Inertial Odometry framework. We employ Clamped Non-Uniform B-splines (C-NUBS) to represent robot states for the first time, eliminating the query-time delay. We further augment C-NUBS with closed-form extension and shrinkage operations that preserve the spline shape, making it suitable for online estimation and enabling flexible knot management. This flexibility leads to the concept of knot-keyknot strategy, which supports spline extension at high-frequency while retaining sparse keyknots for adaptive relative-motion modeling. We then formulate a sliding-window relative localization problem that operates purely on relative kinematics and inter-robot constraints. To meet the demanding computation required at swarm scale, we decompose the tightly-coupled optimization into robot-wise sub-problems and solve them in parallel using incremental asynchronous block coordinate descent. Extensive experiments show that CT-RIO converges from time-offsets as large as 263 ms to sub-millisecond within 3 s, and achieves RMSEs of 0.046 m and 1.8 {\deg}. It consistently outperforms state-of-the-art methods, with improvements of up to 60% under high-speed motion.
Chinese Translation
准确的相对定位对于多机器人协作至关重要。在机器人群体中,不同机器人的测量数据异步到达,并且存在时钟时间偏移。尽管连续时间(Continuous-Time, CT)方法在单机器人SLAM和标定中已被证明能够有效处理异步测量,但将CT方法扩展到多机器人环境面临着实现高精度、低延迟和高频率性能的重大挑战。尤其是,现有的CT方法受到未夹紧B样条固有查询时间延迟和高计算成本的影响。本文提出了CT-RIO,一个新颖的连续时间相对惯性里程计框架。我们首次采用夹紧非均匀B样条(Clamped Non-Uniform B-splines, C-NUBS)来表示机器人状态,从而消除了查询时间延迟。我们进一步增强C-NUBS,增加了保持样条形状的封闭形式扩展和收缩操作,使其适合在线估计并实现灵活的节点管理。这种灵活性引出了节点-关键节点策略的概念,支持在高频率下进行样条扩展,同时保留稀疏的关键节点以适应相对运动建模。接着,我们构建了一个滑动窗口相对定位问题,该问题纯粹基于相对运动学和机器人间约束。为了满足群体规模所需的高计算需求,我们将紧耦合优化分解为机器人级子问题,并使用增量异步块坐标下降法并行求解。大量实验表明,CT-RIO能够在高达263毫秒的时间偏移下,在3秒内收敛至亚毫秒级,并实现0.046米和1.8度的均方根误差(RMSE)。在高速运动下,其性能始终优于最先进的方法,提升幅度高达60%。
cs.RO / 36 / 2602.22010
World Guidance: World Modeling in Condition Space for Action Generation
世界引导:条件空间中的世界建模用于动作生成
Abstract
Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: https://selen-suyue.github.io/WoGNet/
Chinese Translation
利用未来观察建模来促进动作生成为增强视觉-语言-动作(Vision-Language-Action, VLA)模型的能力提供了一条有前景的途径。然而,现有方法在保持高效、可预测的未来表示与保留足够的细粒度信息以指导精确的动作生成之间难以取得平衡。为了解决这一限制,我们提出了WoG(世界引导)框架,该框架通过将未来观察注入动作推理管道,将其映射为紧凑的条件。然后,VLA被训练以同时预测这些压缩条件和未来动作,从而在条件空间内实现有效的世界建模以进行动作推理。我们证明,建模和预测这一条件空间不仅促进了细粒度的动作生成,而且展现了卓越的泛化能力。此外,它能够有效地从大量人类操控视频中学习。大量在模拟和真实环境中的实验验证了我们的方法显著优于基于未来预测的现有方法。项目页面可访问:https://selen-suyue.github.io/WoGNet/
cs.RO / 37 / 2602.22056
FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation
FlowCorrect:用于机器人操作的生成流策略的高效交互修正
Abstract
Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We present FlowCorrect, a deployment-time correction framework that converts near-miss failures into successes using sparse human nudges, without full policy retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corrections to locally adapt the policy, improving actions without retraining the backbone while preserving the model performance on previously learned scenarios. We evaluate on a real-world robot across three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect improves success on hard cases by 85\% while preserving performance on previously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with very few demonstrations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.
Chinese Translation
生成操作策略在部署时的分布转移下可能会发生灾难性失败,但许多失败实际上是接近成功的:机器人几乎达到正确的姿态,只需进行小幅度的修正动作即可成功。我们提出了FlowCorrect,这是一种部署时修正框架,通过稀疏的人为提示将接近失败转化为成功,而无需完全重新训练策略。在执行过程中,用户通过轻量级的虚拟现实界面提供简短的姿态修正提示。FlowCorrect利用这些稀疏的修正来局部调整策略,在不重新训练主干模型的情况下改善动作,同时保持在先前学习场景上的模型性能。我们在一台真实机器人上评估了三个桌面任务:抓取与放置、倒液和杯子竖立。在低修正预算下,FlowCorrect在困难案例上的成功率提高了85%,同时保持了在先前解决场景上的性能。结果清楚地表明,FlowCorrect仅通过极少的示范进行学习,并能够在真实机器人领域实现快速且样本高效的增量人机交互修正生成视觉运动策略。
cs.RO / 38 / 2602.22088
Force Policy: Learning Hybrid Force-Position Control Policy under Interaction Frame for Contact-Rich Manipulation
力策略:在交互框架下学习混合力-位置控制策略以应对接触丰富的操作
Abstract
Contact-rich manipulation demands human-like integration of perception and force feedback: vision should guide task progress, while high-frequency interaction control must stabilize contact under uncertainty. Existing learning-based policies often entangle these roles in a monolithic network, trading off global generalization against stable local refinement, while control-centric approaches typically assume a known task structure or learn only controller parameters rather than the structure itself. In this paper, we formalize a physically grounded interaction frame, an instantaneous local basis that decouples force regulation from motion execution, and propose a method to recover it from demonstrations. Based on this, we address both issues by proposing Force Policy, a global-local vision-force policy in which a global policy guides free-space actions using vision, and upon contact, a high-frequency local policy with force feedback estimates the interaction frame and executes hybrid force-position control for stable interaction. Real-world experiments across diverse contact-rich tasks show consistent gains over strong baselines, with more robust contact establishment, more accurate force regulation, and reliable generalization to novel objects with varied geometries and physical properties, ultimately improving both contact stability and execution quality. Project page: https://force-policy.github.io/
Chinese Translation
接触丰富的操作要求人类般的感知与力反馈的整合:视觉应引导任务进展,而高频交互控制必须在不确定性下稳定接触。现有的基于学习的策略通常将这些角色纠缠在一个整体网络中,权衡全局泛化与稳定局部细化,而以控制为中心的方法通常假设已知的任务结构,或仅学习控制器参数而非结构本身。在本文中,我们形式化了一个物理基础的交互框架,这是一种瞬时局部基础,将力调节与运动执行解耦,并提出了一种从示范中恢复该框架的方法。在此基础上,我们通过提出力策略(Force Policy)来解决这两个问题,这是一种全局-局部的视觉-力策略,其中全局策略使用视觉引导自由空间动作,而在接触时,高频局部策略结合力反馈估计交互框架并执行混合力-位置控制以实现稳定交互。针对多种接触丰富任务的实际实验表明,与强基线相比,力策略在接触建立的鲁棒性、力调节的准确性以及对具有不同几何形状和物理特性的全新物体的可靠泛化方面均表现出一致的提升,最终改善了接触稳定性和执行质量。项目页面:https://force-policy.github.io/
cs.RO / 39 / 2602.22100
Behavioral Cloning for Robotic Connector Assembly: An Empirical Study
机器人连接器组装的行为克隆:一项实证研究
Abstract
Automating the assembly of wire harnesses is challenging in automotive, electrical cabinet, and aircraft production, particularly due to deformable cables and a high variance in connector geometries. In addition, connectors must be inserted with limited force to avoid damage, while their poses can vary significantly. While humans can do this task intuitively by combining visual and haptic feedback, programming an industrial robot for such a task in an adaptable manner remains difficult. This work presents an empirical study investigating the suitability of behavioral cloning for learning an action prediction model for connector insertion that fuses force-torque sensing with a fixed position camera. We compare several network architectures and other design choices using a dataset of up to 300 successful human demonstrations collected via teleoperation of a UR5e robot with a SpaceMouse under varying connector poses. The resulting system is then evaluated against five different connector geometries under varying connector poses, achieving an overall insertion success rate of over 90 %.
Chinese Translation
在汽车、电气柜和飞机生产中,线束的组装自动化面临挑战,特别是由于可变形电缆和连接器几何形状的高度变化。此外,连接器必须以有限的力插入以避免损坏,而其姿态可能会有显著变化。虽然人类可以通过结合视觉和触觉反馈直观地完成这一任务,但以适应性方式为工业机器人编程以执行此类任务仍然困难。本文呈现了一项实证研究,探讨行为克隆在学习连接器插入的动作预测模型中的适用性,该模型融合了力-扭矩传感和固定位置摄像头。我们使用通过遥操作UR5e机器人与SpaceMouse在不同连接器姿态下收集的多达300个成功人类演示的数据集,比较了几种网络架构和其他设计选择。最终系统在不同连接器几何形状和变化的连接器姿态下进行评估,整体插入成功率超过90%。
cs.RO / 40 / 2602.22118
System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot
超移动车辆的系统设计:一种驱动、平衡和跳跃的自行车机器人
Bokser, Benjamin, Gonzalez, Daniel, Singh, Surya, Preston, Aaron, Bahner, Alex, Wollschläger, Annika, Ilvonen, Arianna, Eckert-Erdheim, Asa, Khadke, Ashwin, Hammoud, Bilal, Molinaro, Dean, Jenelten, Fabian, Mayne, Henry, Choset, Howie, Bogoslavskyi, Igor, Tinman, Itic, Tigue, James, Preisig, Jan, Zheng, Kaiyu, Sharma, Kenny, Ang, Kim, Lee, Laura, Margolese, Liana, Lin, Nicole, Frias, Oscar, Drews, Paul, Boggavarapu, Ravi, Burnham, Rick, Zapolsky, Samuel, Kim, Sangbae, Biddlestone, Scott, Mayorga, Sean, Fahmi, Shamel, McCollum, Tyler, Dimitrov, Velin, Moyne, William, Chen, Yu-Ming, Farshidian, Farbod, Hutter, Marco, Perry, David, Rizzi, Al, Nelson, Gabe
Abstract
Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
Chinese Translation
试验骑行者和山地自行车骑士能够在一个或两个轮子上跳跃、平衡和驾驶。这种多功能性使他们能够在平坦地形上实现速度和能量效率,并在崎岖地形上展现灵活性。受到这些运动员的启发,我们提出了一种机器人平台的设计与控制——超移动车辆(Ultra Mobility Vehicle, UMV),该平台结合了自行车和反应质量,以最小的驱动自由度动态移动。我们采用基于仿真的设计优化过程,合成了一种空间连杆拓扑,重点关注垂直跳跃高度和基于动量的单轮接触平衡。利用约束强化学习(Reinforcement Learning, RL)框架,我们展示了多种运动行为的零样本迁移,包括静止骑行、跳跃、单轮骑行、后轮跳跃和前空翻。该机器人重23.5千克,能够以高达8米/秒的速度行驶,并能够在大型障碍物上跳跃(高度为1米,或机器人标称高度的130%)。
cs.RO / 41 / 2602.22154
Position-Based Flocking for Persistent Alignment without Velocity Sensing
基于位置的群聚模型实现持久对齐而无需速度感知
Abstract
Coordinated collective motion in bird flocks and fish schools inspires algorithms for cohesive swarm robotics. This paper presents a position-based flocking model that achieves persistent velocity alignment without velocity sensing. By approximating relative velocity differences from changes between current and initial relative positions and incorporating a time- and density-dependent alignment gain with a non-zero minimum threshold to maintain persistent alignment, the model sustains coherent collective motion over extended periods. Simulations with a collective of 50 agents demonstrate that the position-based flocking model attains faster and more sustained directional alignment and results in more compact formations than a velocity-alignment-based baseline. This position-based flocking model is particularly well-suited for real-world robotic swarms, where velocity measurements are unreliable, noisy, or unavailable. Experimental results using a team of nine real wheeled mobile robots are also presented.
Chinese Translation
鸟群和鱼群中的协调集体运动激发了凝聚性群体机器人算法的研究。本文提出了一种基于位置的群聚模型,该模型在没有速度感知的情况下实现持久的速度对齐。通过从当前和初始相对位置的变化中近似相对速度差异,并结合一个时间和密度依赖的对齐增益以及一个非零的最小阈值以维持持久对齐,该模型能够在较长时间内维持一致的集体运动。对50个体的集体进行的仿真实验表明,基于位置的群聚模型在实现更快且更持久的方向对齐方面表现优于基于速度对齐的基线模型,并且形成了更紧凑的队形。该基于位置的群聚模型特别适合于现实世界中的机器人群体,在这些环境中,速度测量往往不可靠、噪声大或不可用。本文还展示了使用九个真实轮式移动机器人团队的实验结果。
cs.CV / 1 / 2602.21273
StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
故事裁缝:一种零样本的多主体丰富动作视觉叙事生成管道
Abstract
Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
Chinese Translation
在不进行微调的情况下生成多帧、丰富动作的视觉叙事面临三重紧张:动作文本的忠实性、主体身份的保真性以及跨帧背景的连续性。我们提出了故事裁缝(StoryTailor),这是一种在单个 RTX 4090(24 GB)上运行的零样本管道,能够从长叙事提示、每个主体的参考和定位框中生成时间上连贯、身份保留的图像序列。系统由三个协同模块驱动:高斯中心注意力(Gaussian-Centered Attention, GCA)动态聚焦于每个主体的核心,并缓解定位框的重叠;动作增强奇异值重加权(Action-Boost Singular Value Reweighting, AB-SVR)在文本嵌入空间中放大与动作相关的方向;选择性遗忘缓存(Selective Forgetting Cache, SFC)保留可转移的背景线索,遗忘非必要的历史信息,并选择性地呈现保留的线索以建立跨场景的语义联系。与基线方法相比,实验表明 CLIP-T 提高了 10-15%,而 DreamSim 低于强基线,CLIP-I 则保持在视觉上可接受的竞争范围内。在 24 GB GPU 上匹配分辨率和步骤的情况下,推理速度快于 FluxKontext。从定性上看,故事裁缝提供了富有表现力的交互和不断演变但又稳定的场景。
cs.CV / 2 / 2602.21333
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
HorizonForge:使用任意轨迹和任意车辆进行驾驶场景编辑
Abstract
Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: https://horizonforge.github.io/ .
Chinese Translation
可控的驾驶场景生成对于实现逼真且可扩展的自动驾驶仿真至关重要,但现有方法在同时实现照片级真实感和精确控制方面存在困难。我们提出了HorizonForge,一个统一框架,将场景重建为可编辑的高斯点云(Gaussian Splats)和网格(Meshes),使得精细的3D操作和基于语言的车辆插入成为可能。编辑通过一种噪声感知的视频扩散过程进行渲染,该过程强制执行空间和时间一致性,在一次前向传递中生成多样的场景变体,而无需逐轨迹优化。为了标准化评估,我们进一步提出了HorizonSuite,一个涵盖自我(ego)和代理(agent)级编辑任务(如轨迹修改和物体操作)的综合基准。大量实验表明,高斯-网格表示的保真度显著高于其他3D表示,并且视频扩散中的时间先验对于一致的合成至关重要。结合这些发现,HorizonForge建立了一个简单而强大的范式,用于逼真且可控的驾驶仿真,用户偏好提升达83.4%,FID(Fréchet Inception Distance)相较于第二最佳的最先进方法提高了25.19%。项目页面:https://horizonforge.github.io/
cs.CV / 3 / 2602.21341
Scaling View Synthesis Transformers
可扩展视图合成变换器
Abstract
Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.
Chinese Translation
无几何信息的视图合成变换器最近在新颖视图合成(Novel View Synthesis, NVS)中取得了最先进的性能,超越了依赖显式几何建模的传统方法。然而,影响其计算扩展性的因素仍不清楚。我们对视图合成变换器的扩展规律进行了系统研究,并推导出训练计算最优的NVS模型的设计原则。与之前的发现相反,我们表明编码器-解码器架构可以是计算最优的;我们将早期的负面结果追溯到次优的架构选择和不平等的训练计算预算比较。在多个计算水平上,我们展示了我们称之为可扩展视图合成模型(Scalable View Synthesis Model, SVSM)的编码器-解码器架构,其扩展效果与仅解码器模型同样有效,达到了更优的性能-计算帕累托前沿,并在实际NVS基准测试中超越了之前的最先进水平,同时显著减少了训练计算。
cs.CV / 4 / 2602.21365
Towards Controllable Video Synthesis of Routine and Rare OR Events
朝着可控的常规与稀有手术室事件视频合成
Abstract
Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.
Chinese Translation
目的:策划大规模的手术室(OR)工作流程数据集,包括稀有、安全关键或非典型事件,仍然在操作和伦理上面临挑战。这一数据瓶颈使得开发用于检测、理解和减轻手术室中稀有或安全关键事件的环境智能变得复杂。方法:本研究提出了一种手术室视频扩散框架,能够控制性地合成稀有和安全关键事件。该框架集成了几何抽象模块、条件模块和经过微调的扩散模型,首先将手术室场景转化为抽象的几何表示,然后对合成过程进行条件设置,最后生成逼真的手术室事件视频。利用该框架,我们还策划了一个合成数据集,以训练和验证用于检测无菌区域违规近失事件的人工智能模型。结果:在合成常规手术室事件时,我们的方法优于现成的视频扩散基线,在领域内和领域外数据集中均实现了更低的FVD/LPIPS和更高的SSIM/PSNR。通过定性结果,我们展示了其在反事实事件的可控视频合成方面的能力。一个在生成的合成数据上训练和验证的人工智能模型在检测近安全关键事件时达到了70.13%的召回率。最后,我们进行了一项消融研究,以量化关键设计选择带来的性能提升。结论:我们的解决方案能够从抽象几何表示中可控地合成常规和稀有的手术室事件。除了展示其生成稀有和安全关键场景的能力外,我们还展示了其支持环境智能模型开发的潜力。
cs.CV / 5 / 2602.21395
Momentum Memory for Knowledge Distillation in Computational Pathology
计算病理学中的知识蒸馏动量记忆
Abstract
Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.
Chinese Translation
整合基因组学与组织病理学的多模态学习在癌症诊断中展现出强大的潜力,但其临床转化受到配对组织学-基因组数据有限性的阻碍。知识蒸馏(Knowledge Distillation, KD)通过将基因组监督转移到组织病理学模型中,提供了一种实用的解决方案,使得仅使用组织学数据即可进行准确推断。然而,现有的KD方法依赖于批量局部对齐,这由于批内比较有限而引入不稳定性,最终导致性能下降。为了解决这些限制,我们提出了动量记忆知识蒸馏(Momentum Memory Knowledge Distillation, MoMKD),这是一种由动量更新的记忆驱动的跨模态蒸馏框架。该记忆在各个批次之间聚合基因组和组织病理学信息,有效扩大了每个小批次可用的监督上下文。此外,我们解耦了基因组和组织学分支的梯度,防止基因组信号在训练期间主导组织学特征学习,并消除了推断时的模态差距问题。在TCGA-BRCA基准(HER2、PR和ODX分类任务)及独立的内部测试数据集上的广泛实验表明,MoMKD始终优于最先进的多实例学习(MIL)和多模态KD基线,在仅使用组织学推断时表现出强大的性能和泛化能力。总体而言,MoMKD为计算病理学建立了一个稳健且可泛化的知识蒸馏范式。
cs.CV / 6 / 2602.21397
MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation
MMLoP:用于高效视觉-语言适应的多模态低秩提示
Abstract
Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.
Chinese Translation
提示学习已成为适应视觉-语言模型(VLMs)如CLIP到下游任务的主流范式,而无需修改预训练权重。虽然将提示扩展到多个变换器层的视觉和文本编码器显著提升了性能,但这也大幅增加了可训练参数的数量,最先进的方法需要数百万个参数,从而放弃了提示调优所具备的参数效率。在本研究中,我们提出了 extbf{MMLoP}( extbf{M}ulti- extbf{M}odal extbf{Lo}w-Rank extbf{P}rompting),这是一个仅使用 extbf{11.5K可训练参数}实现深度多模态提示的框架,性能可与早期的仅文本方法如CoOp相媲美。MMLoP通过低秩分解在每个变换器层对视觉和文本提示进行参数化,这作为一种隐式正则化器,防止在少量训练数据上过拟合。为了进一步缩小与最先进方法之间的准确性差距,我们引入了三个互补组件:自我调节的一致性损失,将提示表示锚定到冻结的零-shot CLIP特征,在特征和logit层面均如此;均匀漂移修正,消除由提示调优引起的全局嵌入偏移,以保持类别区分结构;以及共享上投影,通过共同的低秩因子将视觉和文本提示耦合,以强化跨模态对齐。在三个基准和11个多样化数据集上的广泛实验表明,MMLoP实现了高度有利的准确性-效率权衡,超越了大多数现有方法,包括那些参数数量级更高的方法,同时在基础到新颖的泛化上达到了79.70\%的调和平均值。
cs.CV / 7 / 2602.21402
FlowFixer: Towards Detail-Preserving Subject-Driven Generation
FlowFixer:面向细节保留的主题驱动生成
Abstract
We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.
Chinese Translation
我们提出了FlowFixer,这是一个用于主题驱动生成(SDG)的精细化框架,旨在恢复因主题的尺度和视角变化而在生成过程中丢失的细节。FlowFixer提出了从视觉参考进行直接的图像到图像转换,避免了语言提示中的歧义。为了实现图像到图像的训练,我们引入了一种一步去噪方案,以生成自监督训练数据,该方案能够自动去除高频细节,同时保留全局结构,有效模拟现实世界中的SDG错误。我们进一步提出了一种基于关键点匹配的度量标准,以适当地评估超出通常由CLIP或DINO测量的语义相似性之外的细节保真度。实验结果表明,FlowFixer在定性和定量评估中均优于最先进的SDG方法,为高保真主题驱动生成设定了新的基准。
cs.CV / 8 / 2602.21406
Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation
探索用于开放词汇零样本动作分割的视觉-语言模型
Abstract
Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.
Chinese Translation
时间动作分割(TAS)需要将视频划分为动作片段,但活动的广泛空间和替代划分使得收集全面的数据集变得不可行。现有方法仍然局限于封闭词汇和固定标签集。在本研究中,我们利用视觉-语言模型(VLMs)强大的零样本能力,探索了开放词汇零样本时间动作分割(OVTAS)这一尚未充分研究的问题。我们提出了一种无训练的管道,遵循分类分割的设计:帧-动作嵌入相似性(FAES)将视频帧与候选动作标签进行匹配,而相似性矩阵时间分割(SMTS)则强制执行时间一致性。除了提出OVTAS外,我们还对14种不同的VLMs进行了系统研究,首次广泛分析了它们在开放词汇动作分割中的适用性。在标准基准上的实验表明,OVTAS在没有特定任务监督的情况下取得了良好的结果,突显了VLMs在结构化时间理解中的潜力。
cs.CV / 9 / 2602.21416
WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions
WildSVG:在真实世界条件下实现可靠的SVG生成
Abstract
We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and find that current approaches perform well below what is needed for reliable SVG extraction in real scenarios. Nonetheless, iterative refinement methods point to a promising path forward, and model capabilities are steadily improving
Chinese Translation
我们引入了SVG提取任务,该任务旨在将图像中的特定视觉输入转换为可缩放矢量图形(SVG)。现有的多模态模型在从干净的渲染图或文本描述生成SVG时取得了良好的效果,但在真实场景中,由于自然图像引入了噪声、杂乱和领域转移,这些模型的表现却不尽如人意。解决这一问题的一个核心挑战是缺乏合适的基准测试。为此,我们引入了WildSVG基准,由两个互补的数据集组成:Natural WildSVG,该数据集由包含公司标志的真实图像及其SVG注释构成;以及Synthetic WildSVG,该数据集将复杂的SVG渲染与真实场景融合,以模拟困难条件。这些资源共同为系统化的SVG提取基准测试提供了首个基础。我们对最先进的多模态模型进行了基准测试,发现当前的方法在真实场景中进行可靠的SVG提取时的表现远低于所需水平。然而,迭代精炼方法指向了一条有希望的前进道路,模型的能力也在稳步提升。
cs.CV / 10 / 2602.21421
ECHOSAT: Estimating Canopy Height Over Space And Time
ECHOSAT:跨时间和空间估算冠层高度
Abstract
Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.
Chinese Translation
森林监测对于气候变化缓解至关重要。然而,现有的全球树高地图仅提供静态快照,无法捕捉时间上的森林动态,而这些动态对于准确的碳核算至关重要。我们提出了ECHOSAT,一个全球范围内、时间一致的树高地图,分辨率为10米,涵盖多个年份。为此,我们利用多传感器卫星数据训练了一个专门的视觉变换器模型,该模型执行像素级的时间回归。自监督的生长损失正则化预测,使其遵循与自然树木生长相符的生长曲线,包括随时间逐渐增加的高度,以及由于森林损失事件(如火灾)导致的突发下降。我们的实验评估表明,在单年预测的背景下,我们的模型提高了最先进的准确性。我们还提供了第一个全球尺度的高度地图,准确量化了树木的生长和扰动。我们期待ECHOSAT能够推动全球碳监测和扰动评估的工作。这些地图可以在 https://github.com/ai4forest/echosat 访问。
cs.CV / 11 / 2602.21425
Automating Timed Up and Go Phase Segmentation and Gait Analysis via the tugturn Markerless 3D Pipeline
通过 tugturn 无标记 3D 流程自动化定时起立行走阶段分割和步态分析
Abstract
Instrumented Timed Up and Go (TUG) analysis can support clinical and research decision-making, but robust and reproducible markerless pipelines are still limited. We present \textit{tugturn.py}, a Python-based workflow for 3D markerless TUG processing that combines phase segmentation, gait-event detection, spatiotemporal metrics, intersegmental coordination, and dynamic stability analysis. The pipeline uses spatial thresholds to segment each trial into stand, first gait, turning, second gait, and sit phases, and applies a relative-distance strategy to detect heel-strike and toe-off events within valid gait windows. In addition to conventional kinematics, \textit{tugturn} provides Vector Coding outputs and Extrapolated Center of Mass (XCoM)-based metrics. The software is configured through TOML files and produces reproducible artifacts, including HTML reports, CSV tables, and quality-assurance visual outputs. A complete runnable example is provided with test data and command-line instructions. This manuscript describes the implementation, outputs, and reproducibility workflow of \textit{tugturn} as a focused software contribution for markerless biomechanical TUG analysis.
Chinese Translation
仪器化的定时起立行走(TUG)分析可以支持临床和研究决策,但稳健且可重复的无标记流程仍然有限。我们提出了 extit{tugturn.py},这是一种基于 Python 的 3D 无标记 TUG 处理工作流程,结合了阶段分割、步态事件检测、时空指标、段间协调和动态稳定性分析。该流程使用空间阈值将每个试验分割为站立、第一次步态、转身、第二次步态和坐下阶段,并应用相对距离策略在有效步态窗口内检测跟踪和离地事件。除了传统的运动学, extit{tugturn} 还提供向量编码输出和基于外推质心(XCoM)的指标。该软件通过 TOML 文件进行配置,并生成可重复的文档,包括 HTML 报告、CSV 表格和质量保证可视化输出。提供了一个完整的可运行示例,包括测试数据和命令行说明。本文描述了 extit{tugturn} 的实现、输出和可重复性工作流程,作为无标记生物力学 TUG 分析的专注软件贡献。
cs.CV / 12 / 2602.21428
PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
PSF-Med:测量和解释医学视觉语言模型中的释义敏感性
Abstract
Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
Chinese Translation
医学视觉语言模型(VLMs)在临床医生重新表述相同问题时可能会改变其答案,这引发了部署风险。我们引入了释义敏感性失败(Paraphrase Sensitivity Failure,PSF)-Med,这是一个包含19,748个胸部X光问题的基准,配对约92,000个保留意义的释义,数据来源于MIMIC-CXR和PadChest。在六个医学VLM中,我们测量了同一图像的“是/否”翻转情况,发现翻转率从8%到58%不等。然而,低翻转率并不意味着视觉基础:仅基于文本的基准显示,一些模型即使在去除图像后仍保持一致,表明它们依赖于语言先验。为了研究一个模型中的机制,我们将GemmaScope 2稀疏自编码器(Sparse Autoencoders,SAEs)应用于MedGemma 4B,并分析了FlipBank,一个经过精心挑选的158个翻转案例的集合。我们在第17层识别出一个稀疏特征,该特征与提示框架相关,并预测决策边际的变化。在因果修补中,去除该特征的贡献平均恢复了45%的“是-否”对数边际,并完全逆转了15%的翻转。基于这一发现,我们展示了在推理过程中固定识别特征可以将翻转率降低31%,同时仅造成1.3个百分点的准确性损失,并减少了对文本先验的依赖。这些结果表明,仅靠翻转率是不够的;稳健性评估应同时测试释义稳定性和图像依赖性。
cs.CV / 13 / 2602.21435
Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
通过交错分析-草拟思维实现理解与生成的协同
Abstract
Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at https://sqwu.top/AD-Loop.
Chinese Translation
统一视觉-语言模型(Unified Vision-Language Models, UVLMs)旨在通过在单一框架内支持理解与生成,推动多模态学习的发展。然而,现有的方法主要集中在架构的统一上,忽视了在任务解决过程中这两种能力之间显式互动的必要性。因此,当前模型将理解和生成视为并行技能,而非协同过程。为了实现真正的协同,我们引入了交错分析-草拟问题解决循环(Analyzing-Drafting Loop, AD-Loop),这是一种新的思维范式,动态交替进行分析和草拟操作。通过将文本思维与视觉思维交错,AD-Loop使模型能够迭代地优化理解和输出,促进真正的协同。为了训练这一机制,我们设计了一个两阶段策略:首先在交错思维数据上进行监督学习以初始化交替,然后通过强化学习促进自适应和自主控制。大量实验表明,AD-Loop在理解和生成的标准基准测试中持续提高性能,并对各种UVLM架构具有良好的迁移性。视觉分析进一步验证了隐式视觉思维的有效性。这些结果突显了AD-Loop作为一种原则性和广泛适用的策略,以实现理解与创造的协同。项目页面可访问 https://sqwu.top/AD-Loop。
cs.CV / 14 / 2602.21452
Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
基于深度学习的超声甲状腺结节分割的对抗鲁棒性
Abstract
Introduction: Deep learning-based segmentation models are increasingly integrated into clinical imaging workflows, yet their robustness to adversarial perturbations remains incompletely characterized, particularly for ultrasound images. We evaluated adversarial attacks and inference-time defenses for thyroid nodule segmentation in B-mode ultrasound. Methods: Two black-box adversarial attacks were developed: (1) Structured Speckle Amplification Attack (SSAA), which injects boundary-targeted noise, and (2) Frequency-Domain Ultrasound Attack (FDUA), which applies bandpass-filtered phase perturbations in the Fourier domain. Three inference-time mitigations were evaluated on adversarial images: randomized preprocessing with test-time augmentation, deterministic input denoising, and stochastic ensemble inference with consistency-aware aggregation. Experiments were conducted on a U-Net segmentation model trained on cine-clips from a database of 192 thyroid nodules. Results: The baseline model achieved a mean Dice similarity coefficient (DSC) of 0.76 (SD 0.20) on unperturbed images. SSAA reduced DSC by 0.29 (SD 0.20) while maintaining high visual similarity (SSIM = 0.94). FDUA resulted in a smaller DSC reduction of 0.11 (SD 0.09) with lower visual fidelity (SSIM = 0.82). Against SSAA, all three defenses significantly improved DSC after correction, with deterministic denoising showing the largest recovery (+0.10, p < 0.001), followed by randomized preprocessing (+0.09, p < 0.001), and stochastic ensemble inference (+0.08, p = 0.002). No defense achieved statistically significant improvement against FDUA. Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific challenges in adversarial robustness evaluation.
Chinese Translation
引言:基于深度学习的分割模型越来越多地集成到临床影像工作流程中,但它们对对抗扰动的鲁棒性仍未完全表征,特别是在超声图像中。我们评估了针对B模式超声甲状腺结节分割的对抗攻击和推理时防御措施。方法:开发了两种黑箱对抗攻击:(1) 结构性散斑增强攻击(Structured Speckle Amplification Attack, SSAA),该攻击注入边界目标噪声;(2) 频域超声攻击(Frequency-Domain Ultrasound Attack, FDUA),该攻击在傅里叶域应用带通滤波相位扰动。对对抗图像评估了三种推理时的缓解措施:随机预处理与测试时增强、确定性输入去噪和具有一致性感知聚合的随机集成推理。实验在一个基于192个甲状腺结节数据库的cine-clips上训练的U-Net分割模型上进行。结果:基线模型在未扰动图像上实现了平均Dice相似系数(Dice Similarity Coefficient, DSC)为0.76(标准差0.20)。SSAA将DSC降低了0.29(标准差0.20),同时保持了高视觉相似性(结构相似性指数SSIM = 0.94)。FDUA导致DSC降低幅度较小,为0.11(标准差0.09),且视觉保真度较低(SSIM = 0.82)。针对SSAA,所有三种防御措施在修正后显著提高了DSC,其中确定性去噪显示出最大的恢复(+0.10,p < 0.001),其次是随机预处理(+0.09,p < 0.001)和随机集成推理(+0.08,p = 0.002)。对FDUA没有防御措施实现统计学上显著的改善。结论:超声分割中的空间域对抗扰动通过输入预处理显示出部分缓解,而频域扰动未能通过防御措施缓解,突显了对抗鲁棒性评估中的模态特定挑战。
cs.CV / 15 / 2602.21473
Automatic Map Density Selection for Locally-Performant Visual Place Recognition
自动地图密度选择用于局部高效的视觉地点识别
Abstract
A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.
Chinese Translation
将视觉地点识别(Visual Place Recognition, VPR)从实验室转向长期部署的一个关键挑战是确保系统能够在环境的不同部分满足用户指定的性能要求,而不仅仅是在全球范围内的平均水平。控制局部 VPR 性能的一个关键机制是参考映射数据库的密度,但这一因素在现有研究中往往被忽视,通常使用固定的、工程驱动的(传感器、存储、GPS 频率)采样密度的基准数据集。在本文中,我们提出了一种动态 VPR 映射方法,该方法利用来自目标环境的参考遍历对自动选择合适的地图密度,以满足两个用户定义的要求:(1)目标 Local Recall@1 水平,以及(2)必须满足或超过该要求的操作环境的比例,我们称之为 Recall Achievement Rate (RAR)。我们的方法基于这样的假设:在不同地图密度下评估的多个参考遍历之间的匹配模式可以被建模,以预测满足这些性能目标所需的密度,尤其是在未见过的部署数据上。通过在多个 VPR 方法以及 Nordland 和 Oxford RobotCar 基准上的广泛实验,我们展示了我们的系统在至少用户指定的环境比例上始终达到或超过指定的局部召回水平。与其他基线的比较表明,我们的方法可靠地选择了正确的地图密度操作点,避免了不必要的过度密集化。最后,消融研究和分析评估了对参考地图选择和局部空间定义的敏感性,并揭示了传统的全球 Recall@1 是一个较差的预测指标,无法有效反映通常更具操作意义的 RAR 指标。
cs.CV / 16 / 2602.21484
Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning
通过语义伪标签和原型学习实现统一的无监督和稀疏监督3D目标检测
Abstract
3D object detection is essential for autonomous driving and robotic perception, yet its reliance on large-scale manually annotated data limits scalability and adaptability. To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged. However, they face intertwined challenges: low-quality pseudo-labels, unstable feature mining, and a lack of a unified training framework. This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection via Semantic Pseudo-labeling and prototype Learning. SPL first generates high-quality pseudo-labels by integrating image semantics, point cloud geometry, and temporal cues, producing both 3D bounding boxes for dense objects and 3D point labels for sparse ones. These pseudo-labels are not used directly but as probabilistic priors within a novel, multi-stage prototype learning strategy. This strategy stabilizes feature representation learning through memory-based initialization and momentum-based prototype updating, effectively mining features from both labeled and unlabeled data. Extensive experiments on KITTI and nuScenes datasets demonstrate that SPL significantly outperforms state-of-the-art methods in both settings. Our work provides a robust and generalizable solution for learning 3D object detectors with minimal or no manual annotations.
Chinese Translation
3D目标检测对于自动驾驶和机器人感知至关重要,但其对大规模手动标注数据的依赖限制了可扩展性和适应性。为减少对标注的依赖,无监督和稀疏监督范式应运而生。然而,它们面临着交织的挑战:低质量的伪标签、不稳定的特征挖掘以及缺乏统一的训练框架。本文提出了SPL,一个通过语义伪标签和原型学习实现无监督和稀疏监督3D目标检测的统一训练框架。SPL首先通过整合图像语义、点云几何和时间线索生成高质量的伪标签,产生稠密物体的3D边界框和稀疏物体的3D点标签。这些伪标签并不是直接使用,而是作为一种概率先验,在一种新颖的多阶段原型学习策略中使用。该策略通过基于记忆的初始化和基于动量的原型更新来稳定特征表示学习,有效地从标注和未标注数据中挖掘特征。在KITTI和nuScenes数据集上的大量实验表明,SPL在这两种设置下显著优于最先进的方法。我们的工作为以最少或无需手动标注学习3D目标检测器提供了一个稳健且可推广的解决方案。
cs.CV / 17 / 2602.21497
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
看见它,说出它,分类:一种迭代的无训练框架用于视觉基础的多模态推理在大型视觉语言模型中的应用
Abstract
Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.
Chinese Translation
最近的大型视觉语言模型(LVLMs)通过生成长链推理(CoT)响应展示了令人印象深刻的推理能力。然而,在多模态环境中的CoT推理对视觉幻觉传播高度敏感:一旦中间推理步骤与视觉证据不一致,后续步骤即使在逻辑上有效也可能导致错误的最终答案。现有解决方案试图通过强化学习(RL)训练模型以“用图像思考”来缓解这一问题。尽管这些方法有效,但它们成本高、特定于模型且难以跨架构推广。与之不同,我们提出了一种轻量级方法,绕过RL训练,提供一个迭代的、无训练的、即插即用的视觉基础多模态推理框架。我们的关键思想是在测试时用视觉证据监督每个推理步骤,确保每个解码的标记都有相应视觉线索的支持。具体而言,我们构建了一个文本视觉证据池,指导模型的推理生成。当现有证据不足时,视觉决策模块根据正在进行的推理上下文动态提取图像中额外相关的证据,扩展证据池,直到模型获得足够的视觉确定性以终止推理并生成最终答案。在多个LVLM骨干网络和基准上的大量实验表明我们的方法的有效性。我们的方法在TreeBench上实现了16.5%-29.5%的提升,在RH-Bench上获得了13.7%的RH-AUC增益,显著降低了幻觉率,同时在没有额外训练的情况下提高了推理准确性。
cs.CV / 18 / 2602.21499
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
Easy3E:通过校正体素流进行前馈式3D资产编辑
Abstract
Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.
Chinese Translation
现有的3D编辑方法依赖于计算密集型的逐场景迭代优化,并且存在多视图不一致的问题。我们提出了一种有效的完全前馈式3D编辑框架,基于TRELLIS生成骨干网络,能够从单一编辑视图修改3D模型。我们的框架解决了两个关键问题:将无训练的2D编辑适配到结构化的3D表示,以及克服压缩3D特征中外观保真度的瓶颈。为了确保几何一致性,我们引入了Voxel FlowEdit,这是一种在稀疏体素潜在空间中的编辑驱动流,能够在单次传递中实现全局一致的3D变形。为了恢复高保真细节,我们开发了一个以法线为指导的单视图到多视图生成模块,作为外部外观先验,成功恢复高频纹理。实验表明,我们的方法能够实现快速、全局一致且高保真的3D模型编辑。
cs.CV / 19 / 2602.21503
AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification
AHAN:用于同卵双胞胎人脸验证的非对称层次注意力网络
Abstract
Identical twin face verification represents an extreme fine-grained recognition challenge where even state-of-the-art systems fail due to overwhelming genetic similarity. Current face recognition methods achieve over 99.8% accuracy on standard benchmarks but drop dramatically to 88.9% when distinguishing identical twins, exposing critical vulnerabilities in biometric security systems. The difficulty lies in learning features that capture subtle, non-genetic variations that uniquely identify individuals. We propose the Asymmetric Hierarchical Attention Network (AHAN), a novel architecture specifically designed for this challenge through multi-granularity facial analysis. AHAN introduces a Hierarchical Cross-Attention (HCA) module that performs multi-scale analysis on semantic facial regions, enabling specialized processing at optimal resolutions. We further propose a Facial Asymmetry Attention Module (FAAM) that learns unique biometric signatures by computing cross-attention between left and right facial halves, capturing subtle asymmetric patterns that differ even between twins. To ensure the network learns truly individuating features, we introduce Twin-Aware Pair-Wise Cross-Attention (TA-PWCA), a training-only regularization strategy that uses each subject's own twin as the hardest possible distractor. Extensive experiments on the ND_TWIN dataset demonstrate that AHAN achieves 92.3% twin verification accuracy, representing a 3.4% improvement over state-of-the-art methods.
Chinese Translation
同卵双胞胎人脸验证代表了一项极具挑战性的细粒度识别任务,甚至最先进的系统也因遗传相似性过高而失败。目前的人脸识别方法在标准基准测试中达到了超过99.8%的准确率,但在区分同卵双胞胎时急剧下降至88.9%,暴露了生物识别安全系统的关键漏洞。困难在于学习能够捕捉微妙的非遗传变异的特征,这些变异能够唯一地识别个体。我们提出了非对称层次注意力网络(AHAN),这是一种专门为此挑战设计的新型架构,通过多粒度面部分析来实现。AHAN引入了层次交叉注意力(HCA)模块,对语义面部区域进行多尺度分析,从而在最佳分辨率下实现专业处理。我们进一步提出了面部不对称注意力模块(FAAM),通过计算左右面部半边之间的交叉注意力来学习独特的生物特征签名,捕捉即使在双胞胎之间也有所不同的微妙不对称模式。为了确保网络学习到真正的个体特征,我们引入了双胞胎感知成对交叉注意力(TA-PWCA),这是一种仅在训练中使用的正则化策略,利用每个受试者自己的双胞胎作为最具挑战性的干扰项。在ND_TWIN数据集上的大量实验表明,AHAN达到了92.3%的双胞胎验证准确率,比最先进的方法提高了3.4%。
cs.CV / 20 / 2602.21517
Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning
我应该信任哪个工具响应?具备工具专业知识的胸部X光代理与多模态代理学习
Abstract
AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.
Chinese Translation
具备工具使用能力的人工智能代理在整合各种工具的领域专业知识方面展现出良好的前景。然而,在医学领域,工具通常是固有易出错的人工智能模型,可能会产生矛盾的响应。现有关于医学代理的研究对工具的实际可靠性缺乏足够的理解,因此无法有效解决工具之间的冲突。为了解决这一问题,本文提出了一个框架,使代理能够与工具互动,并通过代理学习在不同类型的多模态查询中实证学习工具的实际可信度。作为具体实例,我们专注于胸部X光分析,并提出了一种具备工具专业知识的胸部X光代理(TEA-CXA)。当工具输出不一致时,代理会实验性地接受或拒绝多模态工具结果,获得奖励,并学习在每种查询类型中信任哪个工具。重要的是,TEA-CXA扩展了现有的强化学习代码库,支持多轮工具调用,重点关注文本输入,以有效支持多模态上下文。此外,我们通过支持在一次调用中进行多个工具调用、并行工具推理以及在单个用户查询中容纳多幅图像,增强了医学使用场景的代码库。我们的代码框架适用于多模态环境中多轮工具调用强化学习的一般医学研究。实验表明,TEA-CXA的表现优于最先进的方法和一系列基线。代码将会发布。
cs.CV / 21 / 2602.21535
Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction
通过置信度融合的伪视图增强用于无姿态稀疏视图重建
Abstract
3D scene reconstruction under unposed sparse viewpoints is a highly challenging yet practically important problem, especially in outdoor scenes due to complex lighting and scale variation. With extremely limited input views, directly utilizing diffusion model to synthesize pseudo frames will introduce unreasonable geometry, which will harm the final reconstruction quality. To address these issues, we propose a novel framework for sparse-view outdoor reconstruction that achieves high-quality results through bidirectional pseudo frame restoration and scene perception Gaussian management. Specifically, we introduce a bidirectional pseudo frame restoration method that restores missing content by diffusion-based synthesis guided by adjacent frames with a lightweight pseudo-view deblur model and confidence mask inference algorithm. Then we propose a scene perception Gaussian management strategy that optimize Gaussians based on joint depth-density information. These designs significantly enhance reconstruction completeness, suppress floating artifacts and improve overall geometric consistency under extreme view sparsity. Experiments on outdoor benchmarks demonstrate substantial gains over existing methods in both fidelity and stability.
Chinese Translation
在无姿态稀疏视点下进行三维场景重建是一个极具挑战性但又具有实际重要性的问题,尤其是在户外场景中,由于复杂的光照和尺度变化。在输入视图极为有限的情况下,直接利用扩散模型合成伪帧会引入不合理的几何形状,从而损害最终的重建质量。为了解决这些问题,我们提出了一种新颖的稀疏视图户外重建框架,通过双向伪帧恢复和场景感知高斯管理实现高质量的重建结果。具体而言,我们引入了一种双向伪帧恢复方法,通过基于扩散的合成,在轻量级伪视图去模糊模型和置信度掩模推理算法的指导下恢复缺失内容。然后,我们提出了一种场景感知高斯管理策略,基于联合深度-密度信息优化高斯分布。这些设计显著增强了重建的完整性,抑制了浮动伪影,并在极端视图稀疏条件下改善了整体几何一致性。在户外基准测试中的实验结果表明,与现有方法相比,在保真度和稳定性方面有显著提升。
cs.CV / 22 / 2602.21536
IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model
IHF-Harmony:使用可逆层次流模型进行多模态磁共振图像的协调
Abstract
Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code will be released upon acceptance.
Chinese Translation
回顾性的磁共振成像(MRI)协调受到跨模态可扩展性差和依赖于旅行受试者数据集的限制。为了解决这些挑战,我们提出了IHF-Harmony,这是一种统一的可逆层次流框架,用于使用未配对数据进行多模态协调。通过将转换过程分解为可逆特征变换,IHF-Harmony保证了双射映射和无损重建,以防止解剖结构失真。具体而言,可逆层次流(IHF)执行层次减法耦合,逐步去除与伪影相关的特征,而伪影感知归一化(AAN)则采用固定解剖特征调制,准确传递目标特征。结合解剖和伪影一致性损失目标,IHF-Harmony实现了高保真度的协调,保留了源解剖结构。在多个MRI模态的实验中,IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法,为大规模多中心成像研究提供了可靠的协调。代码将在接受后发布。
cs.CV / 23 / 2602.21539
VasGuideNet: Vascular Topology-Guided Couinaud Liver Segmentation with Structural Contrastive Loss
VasGuideNet:基于血管拓扑引导的Couinaud肝脏分割与结构对比损失
Abstract
Accurate Couinaud liver segmentation is critical for preoperative surgical planning and tumor localization.However, existing methods primarily rely on image intensity and spatial location cues, without explicitly modeling vascular topology. As a result, they often produce indistinct boundaries near vessels and show limited generalization under anatomical variability.We propose VasGuideNet, the first Couinaud segmentation framework explicitly guided by vascular topology. Specifically, skeletonized vessels, Euclidean distance transform (EDT)--derived geometry, and k-nearest neighbor (kNN) connectivity are encoded into topology features using Graph Convolutional Networks (GCNs). These features are then injected into a 3D encoder--decoder backbone via a cross-attention fusion module. To further improve inter-class separability and anatomical consistency, we introduce a Structural Contrastive Loss (SCL) with a global memory bank.On Task08_HepaticVessel and our private LASSD dataset, VasGuideNet achieves Dice scores of 83.68% and 76.65% with RVDs of 1.68 and 7.08, respectively. It consistently outperforms representative baselines including UNETR, Swin UNETR, and G-UNETR++, delivering higher Dice/mIoU and lower RVD across datasets, demonstrating its effectiveness for anatomically consistent segmentation. Code is available at https://github.com/Qacket/VasGuideNet.git.
Chinese Translation
准确的Couinaud肝脏分割对于术前手术规划和肿瘤定位至关重要。然而,现有方法主要依赖图像强度和空间位置线索,而未明确建模血管拓扑。因此,它们在血管附近往往产生模糊的边界,并且在解剖变异下表现出有限的泛化能力。我们提出了VasGuideNet,这是第一个明确以血管拓扑为引导的Couinaud分割框架。具体而言,骨架化的血管、欧几里得距离变换(Euclidean Distance Transform, EDT)衍生的几何特征和k近邻(k-nearest neighbor, kNN)连通性被编码为拓扑特征,使用图卷积网络(Graph Convolutional Networks, GCNs)进行处理。这些特征随后通过交叉注意力融合模块注入到3D编码器-解码器主干网络中。为了进一步提高类别间的可分离性和解剖一致性,我们引入了一种具有全局记忆库的结构对比损失(Structural Contrastive Loss, SCL)。在Task08_HepaticVessel和我们的私有LASSD数据集上,VasGuideNet分别达到了83.68%和76.65%的Dice分数,RVD分别为1.68和7.08。它在多个数据集上始终优于代表性基线,包括UNETR、Swin UNETR和G-UNETR++,在Dice/mIoU和RVD方面均表现出更高的性能,证明了其在解剖一致性分割中的有效性。代码可在https://github.com/Qacket/VasGuideNet.git获取。
cs.CV / 24 / 2602.21552
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
将视觉几何先验推广至稀疏高斯占用预测
Abstract
Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.
Chinese Translation
准确的三维场景理解对于具身智能至关重要,而占用预测作为推理对象和自由空间的关键任务日益受到关注。现有方法主要依赖深度先验(例如,DepthAnything),但对三维线索的利用有限,限制了性能和泛化能力。最近,视觉几何模型如VGGT在提供丰富的三维先验方面表现出强大的能力,但与单目深度基础模型类似,它们仍然仅在可见表面层面操作,而非体积内部,这促使我们探索如何更有效地利用这些日益强大的几何先验进行三维占用预测。我们提出了GPOcc,一个利用可推广的视觉几何先验(GPs)进行单目占用预测的框架。我们的方法沿着相机光线将表面点向内延伸,以生成体积样本,这些样本被表示为高斯原语,用于概率占用推断。为了处理流式输入,我们进一步设计了一种无训练的增量更新策略,将每帧的高斯融合为统一的全局表示。在Occ-ScanNet和EmbodiedOcc-ScanNet上的实验表明显著提升:在单目设置下,GPOcc的mIoU提高了+9.99,在流式设置下提高了+11.79,超越了之前的最先进水平。在相同的深度先验下,其运行速度提高了2.65倍,同时实现了+6.73的mIoU。这些结果突出表明GPOcc更有效和高效地利用了几何先验。代码将发布在https://github.com/JuIvyy/GPOcc。
cs.CV / 25 / 2602.21581
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate:可扩展的姿态引导图像动画
Abstract
Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
Chinese Translation
姿态引导的人物图像动画旨在合成由一系列姿态驱动的参考角色的真实视频。尽管基于扩散的方法取得了显著成功,但大多数现有方法仅限于单角色动画。我们观察到,简单地将这些方法扩展到多角色场景往往会导致角色之间的身份混淆和不合理的遮挡。为了解决这些挑战,本文提出了一种基于现代扩散变换器(Diffusion Transformers, DiTs)的可扩展多角色图像动画框架,用于视频生成。我们的框架核心引入了两个新组件——身份分配器(Identifier Assigner)和身份适配器(Identifier Adapter),它们协同捕捉每个人的位置信息和人际间的空间关系。该基于掩码的方案以及可扩展的训练策略,不仅增强了灵活性,还使得模型能够推广到训练期间未见过的更多角色场景。值得注意的是,我们的模型在仅使用双角色数据集进行训练的情况下,能够推广到多角色动画,同时保持与单角色案例的兼容性。大量实验表明,我们的方法在多角色图像动画中达到了最先进的性能,超越了现有的基于扩散的方法基准。
cs.CV / 26 / 2602.21589
SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction
SEF-MAP:用于鲁棒多模态高清地图预测的子空间分解专家融合
Abstract
High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
Chinese Translation
高清(HD)地图对于自动驾驶至关重要,但多模态融合常常受到相机和激光雷达(LiDAR)模态之间不一致性的影响,导致在低光照条件、遮挡或稀疏点云下性能下降。为了解决这个问题,我们提出了SEF-MAP,一个用于鲁棒多模态高清地图预测的子空间专家融合框架。其核心思想是将鸟瞰视图(BEV)特征明确地解耦为四个语义子空间:LiDAR专有、图像专有、共享和交互。每个子空间都分配一个专门的专家,从而在捕捉跨模态共识的同时保留模态特定线索。为了自适应地结合专家输出,我们在BEV单元级引入了一种基于不确定性的门控机制,其中不可靠的专家根据预测方差被降低权重,并辅以使用平衡正则化器以防止专家崩溃。为了增强在恶劣条件下的鲁棒性并促进角色专业化,我们进一步提出了分布感知掩蔽:在训练过程中,使用EMA统计替代特征模拟模态丢失场景,并通过专业化损失强制私有、共享和交互专家在完整和掩蔽输入之间表现出不同的行为。在nuScenes和Argoverse2基准上的实验表明,SEF-MAP实现了最先进的性能,分别比之前的方法提高了+4.2%和+4.8%的mAP。SEF-MAP为在多样化和恶劣条件下的多模态高清地图预测提供了一个鲁棒有效的解决方案。
cs.CV / 27 / 2602.21591
CADC: Content Adaptive Diffusion-Based Generative Image Compression
CADC:基于内容自适应扩散的生成图像压缩
Abstract
Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input -- prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.
Chinese Translation
基于扩散的生成图像压缩在超低比特率下实现真实重建方面展现了显著的潜力。解锁这一潜力的关键在于使整个压缩过程具有内容自适应性,确保编码器的表示和解码器的生成先验与输入图像的语义和结构特征动态对齐。然而,现有方法存在三项关键限制,阻碍了有效的内容自适应。首先,各向同性量化采用统一的量化步长,未能适应图像内容的空间变化复杂性,从而导致与扩散模型的噪声依赖先验之间的错位。其次,信息集中瓶颈——源于高维噪声潜变量与扩散解码器固定输入之间的维度不匹配——阻碍了模型在主要通道中自适应地保留重要的语义信息。第三,现有的文本条件策略要么需要显著的文本比特率开销,要么依赖于通用的、与内容无关的文本提示,因此未能有效提供自适应的语义指导。为克服这些限制,我们提出了一种基于内容自适应的扩散图像编解码器,具有三项技术创新:1) 不确定性引导的自适应量化方法,通过学习空间不确定性图来自适应地将量化失真与内容特征对齐;2) 辅助解码器引导的信息集中方法,利用轻量级辅助解码器在主要潜变量通道中强制执行内容感知的信息保留;3) 无比特率自适应文本条件方法,从辅助重建图像中推导内容感知的文本描述,实现无需比特率成本的语义指导。
cs.CV / 28 / 2602.21596
A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers
扩散变换器条件嵌入中的隐含语义瓶颈
Abstract
Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99\% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9\%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.
Chinese Translation
扩散变换器在类别条件和多模态生成方面已达到最先进的性能,但其学习到的条件嵌入的结构仍然不够清楚。在本研究中,我们首次对这些嵌入进行了系统研究,并揭示了一个显著的冗余现象:类别条件嵌入表现出极高的角度相似性,在 ImageNet-1K 上超过 99\%,而连续条件任务如姿态引导的图像生成和视频到音频生成则超过 99.9\%。我们进一步发现,语义信息集中在一个小的维度子集上,头部维度承载了主要信号,而尾部维度的贡献则微乎其微。通过修剪低幅度维度——去除多达三分之二的嵌入空间——我们展示了生成质量和保真度基本不受影响,并在某些情况下有所改善。这些结果揭示了基于变换器的扩散模型中的语义瓶颈,为语义如何被编码提供了新的见解,并暗示了更高效的条件机制的机会。
cs.CV / 29 / 2602.21613
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
基于MRI的颅内肿瘤诊断虚拟活检
Abstract
Deep intracranial tumors situated in eloquent brain regions controlling vital functions present critical diagnostic challenges. Clinical practice has shifted toward stereotactic biopsy for pathological confirmation before treatment. Yet biopsy carries inherent risks of hemorrhage and neurological deficits and struggles with sampling bias due to tumor spatial heterogeneity, because pathological changes are typically region-selective rather than tumor-wide. Therefore, advancing non-invasive MRI-based pathology prediction is essential for holistic tumor assessment and modern clinical decision-making. The primary challenge lies in data scarcity: low tumor incidence requires long collection cycles, and annotation demands biopsy-verified pathology from neurosurgical experts. Additionally, tiny lesion volumes lacking segmentation masks cause critical features to be overwhelmed by background noise. To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories. We propose a Virtual Biopsy framework comprising: MRI-Processor for standardization; Tumor-Localizer employing vision-language models for coarse-to-fine localization via weak supervision; and Adaptive-Diagnoser with a Masked Channel Attention mechanism fusing local discriminative features with global contexts. Experiments demonstrate over 90% accuracy, outperforming baselines by more than 20%.
Chinese Translation
位于控制重要功能的脑区的深部颅内肿瘤在诊断上面临重大挑战。临床实践已转向立体定向活检以在治疗前进行病理确认。然而,活检本身存在出血和神经功能缺损的固有风险,并且由于肿瘤的空间异质性,活检在采样时容易受到偏倚,因为病理变化通常是区域选择性的,而非肿瘤全局性的。因此,推进基于MRI的非侵入性病理预测对于全面评估肿瘤和现代临床决策至关重要。主要挑战在于数据稀缺:低肿瘤发生率需要较长的收集周期,而标注则要求神经外科专家提供经过活检验证的病理信息。此外,缺乏分割掩膜的小病灶体积使得关键特征容易被背景噪声淹没。为了解决这些挑战,我们构建了ICT-MRI数据集——第一个公开的经过活检验证的基准数据集,包含249个案例,分为四个类别。我们提出了一个虚拟活检框架,包括:用于标准化的MRI处理器;利用视觉-语言模型进行粗到细定位的肿瘤定位器,通过弱监督实现;以及具有掩蔽通道注意机制的自适应诊断器,将局部判别特征与全局上下文融合。实验结果表明,准确率超过90%,比基线提高了20%以上。
cs.CV / 30 / 2602.21627
Tokenizing Semantic Segmentation with RLE
使用 RLE 的语义分割标记化
Abstract
This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.
Chinese Translation
本文提出了一种新的统一方法,通过使用语言建模将图像和视频中的语义分割输出为离散标记序列。我们使用游程编码(RLE)对分割掩膜进行离散化,然后训练一个修改版的 Pix2Seq [p2s],通过自回归输出这些 RLE 标记。我们提出了新颖的标记化策略,以压缩标记序列的长度,使得将该方法扩展到视频成为可能。我们还展示了如何将实例信息纳入标记化过程,以执行全景分割。我们在两个数据集上评估了我们提出的模型,结果表明尽管受到有限计算资源的瓶颈影响,但它们在性能上与最先进的技术具有竞争力。
cs.CV / 31 / 2602.21631
UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
UniHand:一种用于多样化受控4D手部运动建模的统一模型
Abstract
Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
Chinese Translation
手部运动在人际互动中扮演着核心角色,但建模真实的4D手部运动(即随时间变化的3D手部姿态序列)仍然具有挑战性。该领域的研究通常分为两个任务:(1)估计方法从视觉观察中重建精确的运动,但在手部遮挡或缺失的情况下往往失败;(2)生成方法通过利用多模态结构输入下的生成先验,专注于合成手部姿态并填补不完整序列中的运动。然而,这种分离不仅限制了在实践中经常出现的异构条件信号的有效利用,也阻碍了两个任务之间的知识转移。我们提出了UniHand,一个统一的基于扩散的框架,将估计和生成都表述为条件运动合成。UniHand通过将结构信号嵌入共享潜在空间,利用联合变分自编码器整合异构输入,从而对齐如MANO参数和2D骨架等条件。视觉观察通过一个冻结的视觉主干进行编码,而专用的手部感知器直接从图像特征中提取手部特定线索,消除了复杂的检测和裁剪流程的需求。然后,潜在扩散模型从这些多样的条件中合成一致的运动序列。在多个基准上的广泛实验表明,UniHand提供了稳健且准确的手部运动建模,在严重遮挡和时间上不完整的输入下保持性能。
cs.CV / 32 / 2602.21636
Axial-Centric Cross-Plane Attention for 3D Medical Image Classification
以轴心为中心的跨平面注意力机制用于三维医学图像分类
Abstract
Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.
Chinese Translation
临床医生通常通过多个解剖平面而非单一体积表示来解读三维(3D)医学图像,如计算机断层扫描(CT)图像。在这种多平面方法中,轴向平面通常作为主要的获取和诊断参考,而冠状面和矢状面则提供互补的空间信息,以提高诊断信心。然而,许多现有的3D深度学习方法要么整体处理体积数据,要么对所有平面赋予相同的重要性,未能反映以轴心为中心的临床解释工作流程。为了解决这一问题,我们提出了一种以轴心为中心的跨平面注意力架构,用于3D医学图像分类,旨在捕捉不同解剖平面之间固有的非对称依赖关系。我们的架构结合了MedDINOv3,这是一个通过自监督学习在大规模轴向CT图像上预训练的医学视觉基础模型,作为轴向、冠状面和矢状面的冻结特征提取器。RICA模块和平面内变换编码器捕捉每个解剖平面内特定的位置信息和上下文信息,而以轴心为中心的跨平面变换编码器则基于来自辅助平面的互补信息调节轴向特征。在MedMNIST3D基准的六个数据集上的实验结果表明,所提出的架构在准确性和AUC方面始终优于现有的3D和多平面模型。消融研究进一步确认了以轴心为中心的查询-键-值分配和方向性跨平面融合的重要性。这些结果强调了将架构设计与临床解释工作流程对齐的重要性,以实现稳健且数据高效的3D医学图像分析。
cs.CV / 33 / 2602.21637
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
CARE:一种具有自适应区域建模的分子引导基础模型,用于全切片图像分析
Abstract
Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.
Chinese Translation
基础模型最近在计算病理学中取得了显著成功,展示了在多样化的组织病理学任务中的强大泛化能力。然而,现有模型忽视了病理感兴趣区域(ROIs)的异质性和非均匀组织,因为它们依赖于未针对组织形态学进行调整的自然图像骨干网络。因此,它们往往无法捕捉到超越孤立斑块的连贯组织结构,限制了可解释性和临床相关性。为了解决这些挑战,我们提出了跨模态自适应区域编码器(CARE),这是一个用于病理学的基础模型,能够自动将全切片图像(WSIs)划分为若干形态学相关区域。具体而言,CARE采用了两阶段的预训练策略:(1)一个自监督的单模态预训练阶段,从34,277幅全切片图像中学习形态学表示,而无需分割注释;(2)一个跨模态对齐阶段,利用RNA和蛋白质谱来优化自适应区域的构建和表示。这种分子引导使CARE能够识别生物学相关模式,并生成不规则但连贯的组织区域,选择最具代表性的区域作为ROIs。CARE支持广泛的病理相关任务,使用自适应区域聚合得到的ROIs特征或切片级特征。基于仅为主流基础模型通常使用的预训练数据的十分之一,CARE在33个下游基准测试中实现了优越的平均性能,包括形态分类、分子预测和生存分析,并在整体上超越了其他基础模型基线。
cs.CV / 34 / 2602.21645
Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle
李群流动:基于李代数的几何物理原理的视频动态场建模与预测
Abstract
Modeling 4D scenes requires capturing both spatial structure and temporal motion, which is challenging due to the need for physically consistent representations of complex rigid and non-rigid motions. Existing approaches mainly rely on translational displacements, which struggle to represent rotations, articulated transformations, often leading to spatial inconsistency and physically implausible motion. LieFlow, a dynamic radiance representation framework that explicitly models motion within the SE(3) Lie group, enabling coherent learning of translation and rotation in a unified geometric space. The SE(3) transformation field enforces physically inspired constraints to maintain motion continuity and geometric consistency. The evaluation includes a synthetic dataset with rigid-body trajectories and two real-world datasets capturing complex motion under natural lighting and occlusions. Across all datasets, LieFlow consistently improves view-synthesis fidelity, temporal coherence, and physical realism over NeRF-based baselines. These results confirm that SE(3)-based motion modeling offers a robust and physically grounded framework for representing dynamic 4D scenes.
Chinese Translation
建模四维场景需要捕捉空间结构和时间运动,这一过程具有挑战性,因为需要对复杂的刚性和非刚性运动进行物理一致性的表示。现有的方法主要依赖于平移位移,这在表示旋转和关节变换时存在困难,常常导致空间不一致和物理上不合理的运动。LieFlow 是一个动态辐射表示框架,它明确地在 SE(3) 李群内建模运动,从而在统一的几何空间中实现平移和旋转的连贯学习。SE(3) 变换场施加了物理启发的约束,以保持运动的连续性和几何一致性。评估包括一个具有刚体轨迹的合成数据集和两个在自然光照和遮挡下捕捉复杂运动的真实世界数据集。在所有数据集中,LieFlow 在视图合成保真度、时间一致性和物理现实性方面始终优于基于 NeRF 的基线。这些结果确认了基于 SE(3) 的运动建模为表示动态四维场景提供了一个稳健且物理基础的框架。
cs.CV / 35 / 2602.21655
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
CCCaption:用于完整且正确图像描述的双重奖励强化学习
Abstract
Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
Chinese Translation
图像描述仍然是视觉语言理解的一个基本任务,但真实标签的监督仍主要依赖于人工标注的参考。由于人工标注反映了主观偏好和专业知识,真实标签往往是不完整的,甚至是错误的,这限制了描述模型的能力。我们认为,描述质量应通过两个客观方面进行评估:完整性(描述是否涵盖所有显著的视觉事实?)和正确性(描述是否与图像真实相符?)。为此,我们提出了CCCaption:一个双重奖励强化学习框架,配备专门的微调语料库,明确优化这些属性以生成 extbf{完整且 extbf{正确的描述}(Complete and Correct Captions)}。为了确保完整性,我们使用多样的视觉语言模型(LVLMs)将图像解构为一组视觉查询,并奖励能够回答更多这些查询的描述,同时采用动态查询采样策略以提高训练效率。为了确保正确性,我们通过验证从描述分解中得出的子描述查询的真实性来惩罚包含虚假信息的描述。我们的对称双重奖励优化共同最大化完整性和正确性,引导模型生成更好地满足这些客观标准的描述。在标准描述基准上的广泛实验显示了一致的改进,提供了一条超越人工标注模仿的描述模型训练的原则性路径。
cs.CV / 36 / 2602.21657
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
跟随诊断轨迹:视觉认知引导的胸部X光诊断协作网络
Abstract
Computer-aided diagnosis (CAD) has significantly advanced automated chest X-ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within diagnostic routines impedes collaboration, while the semantic gap between radiologists' decision-making patterns and model representations further limits clinical adoption. To overcome these limitations, we propose a visual cognition-guided collaborative network (VCC-Net) to achieve the cooperative diagnostic paradigm. VCC-Net centers on visual cognition (VC) and employs clinically compatible interfaces, such as eye-tracking or the mouse, to capture radiologists' visual search traces and attention patterns during diagnosis. VCC-Net employs VC as a spatial cognition guide, learning hierarchical visual search strategies to localize diagnostically key regions. A cognition-graph co-editing module subsequently integrates radiologist VC with model inference to construct a disease-aware graph. The module captures dependencies among anatomical regions and aligns model representations with VC-driven features, mitigating radiologist bias and facilitating complementary, transparent decision-making. Experiments on the public datasets SIIM-ACR, EGD-CXR, and self-constructed TB-Mouse dataset achieved classification accuracies of 88.40%, 85.05%, and 92.41%, respectively. The attention maps produced by VCC-Net exhibit strong concordance with radiologists' gaze distributions, demonstrating a mutual reinforcement of radiologist and model inference. The code is available at https://github.com/IPMI-NWU/VCC-Net.
Chinese Translation
计算机辅助诊断(CAD)在自动化胸部X光诊断方面取得了显著进展,但仍然与临床工作流程相隔离,缺乏可靠的决策支持和可解释性。人机协作旨在通过整合可控放射科医生的行为来增强诊断模型的可靠性。然而,缺乏无缝嵌入诊断流程的交互工具阻碍了协作,而放射科医生决策模式与模型表示之间的语义差距进一步限制了临床应用。为克服这些限制,我们提出了一种视觉认知引导的协作网络(VCC-Net),以实现协作诊断范式。VCC-Net以视觉认知(VC)为中心,采用临床兼容的接口,如眼动追踪或鼠标,捕捉放射科医生在诊断过程中的视觉搜索轨迹和注意力模式。VCC-Net将VC作为空间认知引导,学习分层视觉搜索策略以定位诊断关键区域。随后,一个认知图共同编辑模块将放射科医生的VC与模型推理整合,以构建一个疾病感知图。该模块捕捉解剖区域之间的依赖关系,并将模型表示与VC驱动的特征对齐,从而减轻放射科医生的偏见,促进互补的透明决策。对公共数据集SIIM-ACR、EGD-CXR和自建的TB-Mouse数据集的实验分别达到了88.40%、85.05%和92.41%的分类准确率。VCC-Net生成的注意力图与放射科医生的注视分布表现出强一致性,展示了放射科医生与模型推理之间的相互强化。代码可在https://github.com/IPMI-NWU/VCC-Net获取。
cs.CV / 37 / 2602.21662
HybridINR-PCGC: Hybrid Lossless Point Cloud Geometry Compression Bridging Pretrained Model and Implicit Neural Representation
HybridINR-PCGC:桥接预训练模型与隐式神经表示的混合无损点云几何压缩
Abstract
Learning-based point cloud compression presents superior performance to handcrafted codecs. However, pretrained-based methods, which are based on end-to-end training and expected to generalize to all the potential samples, suffer from training data dependency. Implicit neural representation (INR) based methods are distribution-agnostic and more robust, but they require time-consuming online training and suffer from the bitstream overhead from the overfitted model. To address these limitations, we propose HybridINR-PCGC, a novel hybrid framework that bridges the pretrained model and INR. Our framework retains distribution-agnostic properties while leveraging a pretrained network to accelerate convergence and reduce model overhead, which consists of two parts: the Pretrained Prior Network (PPN) and the Distribution Agnostic Refiner (DAR). We leverage the PPN, designed for fast inference and stable performance, to generate a robust prior for accelerating the DAR's convergence. The DAR is decomposed into a base layer and an enhancement layer, and only the enhancement layer needed to be packed into the bitstream. Finally, we propose a supervised model compression module to further supervise and minimize the bitrate of the enhancement layer parameters. Based on experiment results, HybridINR-PCGC achieves a significantly improved compression rate and encoding efficiency. Specifically, our method achieves a Bpp reduction of approximately 20.43% compared to G-PCC on 8iVFB. In the challenging out-of-distribution scenario Cat1B, our method achieves a Bpp reduction of approximately 57.85% compared to UniPCGC. And our method exhibits a superior time-rate trade-off, achieving an average Bpp reduction of 15.193% relative to the LINR-PCGC on 8iVFB.
Chinese Translation
基于学习的点云压缩在性能上优于手工编码器。然而,基于预训练的方法依赖于端到端训练,并期望能够推广到所有潜在样本,这使其受到训练数据的限制。基于隐式神经表示(INR)的方法对分布不敏感且更具鲁棒性,但它们需要耗时的在线训练,并受到过拟合模型带来的比特流开销。为了解决这些限制,我们提出了HybridINR-PCGC,这是一种新颖的混合框架,连接了预训练模型和INR。我们的框架保留了对分布不敏感的特性,同时利用预训练网络加速收敛并减少模型开销,框架由两个部分组成:预训练先验网络(PPN)和分布无关细化器(DAR)。我们利用PPN,该网络设计用于快速推理和稳定性能,以生成强大的先验,从而加速DAR的收敛。DAR被分解为基础层和增强层,只有增强层需要打包到比特流中。最后,我们提出了一种监督模型压缩模块,以进一步监督并最小化增强层参数的比特率。基于实验结果,HybridINR-PCGC实现了显著提高的压缩率和编码效率。具体而言,我们的方法在8iVFB上相较于G-PCC实现了约20.43%的Bpp减少。在具有挑战性的分布外场景Cat1B中,我们的方法相较于UniPCGC实现了约57.85%的Bpp减少。此外,我们的方法展现了优越的时间-比特率权衡,在8iVFB上相较于LINR-PCGC实现了平均15.193%的Bpp减少。
cs.CV / 38 / 2602.21667
Send Less, Perceive More: Masked Quantized Point Cloud Communication for Loss-Tolerant Collaborative Perception
减少传输,增强感知:面向容错协同感知的掩蔽量化点云通信
Abstract
Collaborative perception allows connected vehicles to overcome occlusions and limited viewpoints by sharing sensory information. However, existing approaches struggle to achieve high accuracy under strict bandwidth constraints and remain highly vulnerable to random transmission packet loss. We introduce QPoint2Comm, a quantized point-cloud communication framework that dramatically reduces bandwidth while preserving high-fidelity 3D information. Instead of transmitting intermediate features, QPoint2Comm directly communicates quantized point-cloud indices using a shared codebook, enabling efficient reconstruction with lower bandwidth than feature-based methods. To ensure robustness to possible communication packet loss, we employ a masked training strategy that simulates random packet loss, allowing the model to maintain strong performance even under severe transmission failures. In addition, a cascade attention fusion module is proposed to enhance multi-vehicle information integration. Extensive experiments on both simulated and real-world datasets demonstrate that QPoint2Comm sets a new state of the art in accuracy, communication efficiency, and resilience to packet loss.
Chinese Translation
协同感知使得联网车辆能够通过共享传感信息克服遮挡和有限视角的限制。然而,现有方法在严格的带宽限制下难以实现高精度,并且对随机传输数据包丢失高度敏感。我们提出了QPoint2Comm,一个量化点云通信框架,显著减少带宽同时保留高保真度的3D信息。QPoint2Comm不再传输中间特征,而是直接使用共享代码本传递量化的点云索引,从而在带宽低于基于特征的方法的情况下实现高效重建。为了确保对可能的通信数据包丢失的鲁棒性,我们采用了一种掩蔽训练策略,模拟随机数据包丢失,使模型在严重传输故障下仍能保持强劲的性能。此外,我们提出了级联注意力融合模块,以增强多车辆信息的整合。对模拟和真实世界数据集的广泛实验表明,QPoint2Comm在准确性、通信效率和对数据包丢失的韧性方面设定了新的技术标准。
cs.CV / 39 / 2602.21668
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
基于运动感知高斯分组的动态场景时空预测
Abstract
Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at https://slime0519.github.io/mogaf
Chinese Translation
动态场景的预测仍然是计算机视觉中的一个基本挑战,因为有限的观察使得捕捉连贯的物体级运动和长期时间演变变得困难。我们提出了运动组感知高斯预测(MoGaF),这是一个基于4D高斯点云表示的长期场景外推框架。MoGaF引入了运动感知高斯分组和组内优化,以在刚性和非刚性区域之间强制执行物理一致的运动,从而产生空间上连贯的动态表示。利用这种结构化的时空表示,轻量级预测模块能够预测未来的运动,实现逼真且时间稳定的场景演变。在合成和真实世界数据集上的实验表明,MoGaF在渲染质量、运动合理性和长期预测稳定性方面始终优于现有基线。我们的项目页面可访问 https://slime0519.github.io/mogaf
cs.CV / 40 / 2602.21698
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
E-comIQ-ZH:一个与人类对齐的数据集和基准,用于细粒度评估带有思维链的电子商务海报
Abstract
Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.Code will be available at https://github.com/4mm7/E-comIQ-ZH.
Chinese Translation
生成性人工智能广泛用于创建商业海报。然而,生成技术的快速进步已经超越了自动化质量评估的能力。现有模型强调通用美学或低级失真,缺乏电子商务设计所需的功能性标准。对于中文内容而言,尤其具有挑战性,因为复杂字符常常产生微妙但关键的文本伪影,而这些伪影在现有方法中被忽视。为了解决这个问题,我们引入了E-comIQ-ZH,一个用于评估中文电子商务海报的框架。我们构建了第一个数据集E-comIQ-18k,具有多维评分和专家校准的思维链(Chain of Thought, CoT)推理。利用该数据集,我们训练了E-comIQ-M,一个与人类专家判断对齐的专用评估模型。我们的框架使得E-comIQ-Bench成为第一个自动化和可扩展的中文电子商务海报生成基准。大量实验表明,我们的E-comIQ-M与专家标准的对齐程度更高,并能够实现电子商务海报的可扩展自动评估。所有数据集、模型和评估工具将被发布,以支持该领域未来的研究。代码将可在https://github.com/4mm7/E-comIQ-ZH获取。
cs.CV / 41 / 2602.21699
SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR
SF3D-RGB:基于单目相机和稀疏LiDAR的场景流估计
Abstract
Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.
Chinese Translation
场景流估计是计算机视觉中一项极其重要的任务,旨在支持对场景动态变化的感知。为了实现稳健的场景流,基于学习的方法最近在使用图像或LiDAR模态方面取得了显著成果。然而,这些方法往往集中于单一模态的使用。为了解决这些问题,我们提出了一种深度学习架构SF3D-RGB,该架构能够使用2D单目图像和3D点云(例如,通过LiDAR获取)作为输入进行稀疏场景流估计。我们的架构是一个端到端模型,首先将每种模态的信息编码为特征,并将其融合在一起。然后,融合后的特征增强了图匹配模块,以更好地计算映射矩阵,从而生成初始场景流。最后,残差场景流模块进一步细化初始场景流。我们的模型旨在实现准确性与效率之间的平衡。此外,实验表明,我们提出的方法在真实世界数据集上超越了单模态方法,并在使用更少参数的情况下,相较于其他最先进的融合方法,达到了更好的场景流准确性。
cs.CV / 42 / 2602.21703
Brain Tumor Segmentation with Special Emphasis on the Non-Enhancing Brain Tumor Compartment
脑肿瘤分割,特别关注非增强脑肿瘤区
Abstract
A U-Net based deep learning architecture is designed to segment brain tumors as they appear on various MRI modalities. Special emphasis is lent to the non-enhancing tumor compartment. The latter has not been considered anymore in recent brain tumor segmentation challenges like the MICCAI challenges. However, it is considered to be indicative of the survival time of the patient as well as of areas of further tumor growth. Hence it deems essential to have means to automatically delineate its extension within the tumor.
Chinese Translation
设计了一种基于U-Net的深度学习架构,用于对各种MRI模态下的脑肿瘤进行分割。特别强调非增强肿瘤区。后者在最近的脑肿瘤分割挑战(如MICCAI挑战)中并未被考虑。然而,它被认为与患者的生存时间以及进一步肿瘤生长的区域相关。因此,自动描绘其在肿瘤内的扩展是至关重要的。
cs.CV / 43 / 2602.21704
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
动态多模态激活引导用于大型视觉-语言模型的幻觉缓解
Abstract
Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.
Chinese Translation
大型视觉-语言模型(LVLMs)在视觉-语言任务上表现出色,但在幻觉问题上存在困难。通过对LVLM激活模式的深入分析,我们揭示了两个关键发现:1)真实性和视觉感知能力主要涉及模型架构中不同子集的注意力头;2)真实性引导向量在不同语义上下文中变化显著。基于这些观察,我们提出了动态多模态激活引导,这是一种无需训练的幻觉缓解方法。我们的方法构建了一个基于语义的真实性引导向量数据库,并计算视觉感知引导向量,使得在推理过程中能够根据输入的语义相似性动态选择最相关的引导向量,并将其应用于最具影响力的注意力头。我们在多个模型和数据集上进行了全面实验,证明我们的方法显著提升了模型性能,超越了现有的最先进方法。
cs.CV / 44 / 2602.21706
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
SurGo-R1:手术视频中操作区域的上下文推理基准与建模
Abstract
Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision-language models cannot handle such tasks and perform poorly. We then present SurGo-R1, a model optimized via RLHF with a multi-turn phase-then-go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo-R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6$\times$ improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at https://github.com/jinlab-imvr/SurGo-R1
Chinese Translation
微创手术显著改善了患者的手术结果,但在关键阶段识别安全操作区域仍然具有挑战性,这要求外科医生在高认知负荷下整合视觉线索、程序阶段和解剖背景。现有的人工智能系统提供二元安全验证或静态检测,忽视了术中推理的阶段依赖性。我们引入了ResGo,这是一个标注有Go Zone边界框和临床医生撰写的涵盖阶段、暴露质量推理、下一步行动和风险提醒的理由的腹腔镜帧基准。我们提出了评估指标,将在错误阶段下的正确定位视为失败,揭示大多数视觉-语言模型无法处理此类任务且表现不佳。随后,我们提出了SurGo-R1,这是一个通过强化学习人类反馈(RLHF)优化的模型,采用多轮阶段-再行动架构,模型首先识别手术阶段,然后根据该上下文生成推理和Go Zone坐标。在未见过的手术中,SurGo-R1实现了76.6%的阶段准确率、32.7 mIoU和54.8%的硬核准确率,相较于主流通用视觉-语言模型提高了6.6倍。代码、模型和基准将可在https://github.com/jinlab-imvr/SurGo-R1获取。
cs.CV / 45 / 2602.21709
Assessing airborne laser scanning and aerial photogrammetry for deep learning-based stand delineation
评估基于深度学习的林分划分中的航空激光扫描与航空摄影测量
Abstract
Accurate forest stand delineation is essential for forest inventory and management but remains a largely manual and subjective process. A recent study has shown that deep learning can produce stand delineations comparable to expert interpreters when combining aerial imagery and airborne laser scanning (ALS) data. However, temporal misalignment between data sources limits operational scalability. Canopy height models (CHMs) derived from digital photogrammetry (DAP) offer better temporal alignment but may smoothen canopy surface and canopy gaps, raising the question of whether they can reliably replace ALS-derived CHMs. Similarly, the inclusion of a digital terrain model (DTM) has been suggested to improve delineation performance, but has remained untested in published literature. Using expert-delineated forest stands as reference data, we assessed a U-Net-based semantic segmentation framework with municipality-level cross-validation across six municipalities in southeastern Norway. We compared multispectral aerial imagery combined with (i) an ALS-derived CHM, (ii) a DAP-derived CHM, and (iii) a DAP-derived CHM in combination with a DTM. Results showed comparable performance across all data combinations, reaching overall accuracy values between 0.90-0.91. Agreement between model predictions was substantially larger than agreement with the reference data, highlighting both model consistency and the inherent subjectivity of stand delineation. The similar performance of DAP-CHMs, despite the reduced structural detail, and the lack of improvements of the DTM indicate that the framework is resilient to variations in input data. These findings indicate that large datasets for deep learning-based stand delineations can be assembled using projects including temporally aligned ALS data and DAP point clouds.
Chinese Translation
准确的森林林分划分对于森林清查和管理至关重要,但仍然主要是一个手动和主观的过程。最近的研究表明,当结合航空影像和航空激光扫描(ALS)数据时,深度学习可以产生与专家解读者相当的林分划分。然而,数据源之间的时间不对齐限制了其操作的可扩展性。基于数字摄影测量(DAP)生成的冠层高度模型(CHMs)提供了更好的时间对齐,但可能会平滑冠层表面和冠层间隙,这引发了它们是否可以可靠替代基于ALS的CHMs的疑问。同样,建议加入数字地形模型(DTM)以提高划分性能,但在已发表的文献中尚未进行测试。我们使用专家划分的森林林分作为参考数据,评估了基于U-Net的语义分割框架,并在挪威东南部六个市的市级交叉验证中进行测试。我们比较了结合(i) ALS生成的CHM,(ii) DAP生成的CHM,以及(iii) DAP生成的CHM与DTM结合的多光谱航空影像。结果显示所有数据组合的性能相当,整体准确率达到0.90-0.91。模型预测之间的一致性显著高于与参考数据的一致性,突显了模型的一致性及林分划分的固有主观性。尽管结构细节减少,DAP-CHMs的相似性能以及DTM未能改善的结果表明,该框架对输入数据的变化具有韧性。这些发现表明,可以通过包括时间对齐的ALS数据和DAP点云的项目组装大型数据集,以实现基于深度学习的林分划分。
cs.CV / 46 / 2602.21712
Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling
基于层次特征和双向序列建模的创新牙齿分割
Abstract
Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).
Chinese Translation
牙齿图像分割是牙科数字化的基石。然而,依赖于固定分辨率特征图的传统图像编码器常常导致分割不连续,并且在目标区域与背景之间的区分能力较差,这主要是由于对环境和全局上下文建模不足。此外,基于变换器的自注意力机制由于其平方复杂度(O(n^2))引入了大量计算开销,使其在处理高分辨率牙科图像时效率低下。为了解决这些挑战,我们提出了一种三阶段编码器,采用层次特征表示来捕捉牙科图像中的尺度自适应信息。通过跨尺度特征融合共同利用低层次细节和高层次语义,该模型有效地保留了细微的结构信息,同时保持了强大的上下文意识。此外,我们引入了一种双向序列建模策略,以增强对全局空间上下文的理解,而不增加高计算成本。我们在两个牙科数据集上验证了我们的方法,实验结果表明其优于现有方法。在OralVision数据集上,我们的模型在平均交并比(mIoU)上提高了1.1%。
cs.CV / 47 / 2602.21716
TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection
TranX-Adapter:在多模态大语言模型中桥接伪影与语义以增强AI生成图像检测的鲁棒性
Abstract
Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).
Chinese Translation
AI生成图像(AIGI)技术的快速进展使得高度逼真的合成成为可能,威胁到公共信息的完整性和安全性。近期研究表明,将纹理级别的伪影特征与语义特征结合到多模态大语言模型(MLLMs)中,可以增强其AIGI检测能力。然而,我们的初步分析显示,伪影特征表现出较高的特征内部相似性,导致在softmax操作后几乎形成均匀的注意力图。这一现象导致注意力稀释,从而妨碍了语义特征与伪影特征之间的有效融合。为克服这一限制,我们提出了一种轻量级融合适配器TranX-Adapter,该适配器集成了任务感知的最优传输融合(Task-aware Optimal-Transport Fusion),利用伪影与语义预测概率之间的詹森-香农散度作为成本矩阵,将伪影信息转移到语义特征中;同时采用交叉注意力的X-Fusion将语义信息转移到伪影特征中。在多个先进的MLLMs上进行的标准AIGI检测基准实验表明,我们的TranX-Adapter带来了持续且显著的提升(准确率提高最多可达6%)。
cs.CV / 48 / 2602.21735
SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning
SigVLP:用于自监督CT体积自适应表示学习的Sigmoid体积-语言预训练
Abstract
Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
Chinese Translation
大规模体积医学影像数据集通常汇聚来自不同供应商和设备的扫描,导致分辨率、切片厚度和每项研究的切片数量高度可变。因此,训练表示模型通常需要沿z轴裁剪或插值以获得固定大小的块,这不可避免地导致信息损失。我们提出了一种新的训练方法来克服这一限制。我们将体积视为3D块的序列,并采用旋转位置嵌入(Rotary Position Embeddings),使我们能够将z轴视为不受限制的时间维度。在此基础上,我们引入了一种新的视觉-语言模型:SigVLP。在SigVLP中,我们将旋转位置嵌入作为位置编码方法,直接应用于注意力操作中,实时生成输入条件的正弦和余弦权重。该设计确保查询和键投影之间的一致对齐,并适应任何输入大小。为了在训练过程中允许可变输入大小,我们以块的形式对计算机断层扫描(Computed Tomography)体积进行采样,并将其与局部器官的文本观察配对。与使用完整报告进行条件处理相比,块状对齐提供了更细粒度的监督,使模型能够在文本和体积表示之间建立更强的关联,从而提高文本到体积对齐的精度。我们的模型使用Muon优化器进行训练,并在一系列多样的下游任务上进行评估,包括零-shot异常检测、器官分类、分割和检索任务。
cs.CV / 49 / 2602.21740
Structure-to-Image: Zero-Shot Depth Estimation in Colonoscopy via High-Fidelity Sim-to-Real Adaptation
结构到图像:通过高保真模拟到真实适应实现结肠镜检查中的零样本深度估计
Abstract
Monocular depth estimation (MDE) for colonoscopy is hampered by the domain gap between simulated and real-world images. Existing image-to-image translation methods, which use depth as a posterior constraint, often produce structural distortions and specular highlights by failing to balance realism with structure consistency. To address this, we propose a Structure-to-Image paradigm that transforms the depth map from a passive constraint into an active generative foundation. We are the first to introduce phase congruency to colonoscopic domain adaptation and design a cross-level structure constraint to co-optimize geometric structures and fine-grained details like vascular textures. In zero-shot evaluations conducted on a publicly available phantom dataset, the MDE model that was fine-tuned on our generated data achieved a maximum reduction of 44.18% in RMSE compared to competing methods. Our code is available at https://github.com/YyangJJuan/PC-S2I.git.
Chinese Translation
结肠镜检查的单目深度估计(MDE)受到模拟图像与真实世界图像之间领域差距的限制。现有的图像到图像翻译方法使用深度作为后验约束,往往由于未能平衡现实主义与结构一致性而产生结构扭曲和高光反射。为了解决这一问题,我们提出了一种结构到图像(Structure-to-Image)范式,将深度图从被动约束转变为主动生成基础。我们首次将相位一致性引入结肠镜领域适应,并设计了一种跨层结构约束,以共同优化几何结构和细致的纹理细节,如血管纹理。在对公开可用的假体数据集进行的零样本评估中,基于我们生成的数据微调的MDE模型在均方根误差(RMSE)上与竞争方法相比实现了最大44.18%的降低。我们的代码可在 https://github.com/YyangJJuan/PC-S2I.git 获取。
cs.CV / 50 / 2602.21743
Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization
通过难度感知的组归一化增强多模态大语言模型的推理能力
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
Chinese Translation
可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)和组相对策略优化(Group Relative Policy Optimization, GRPO)显著提升了大型语言模型的推理能力。然而,将这些方法扩展到多模态环境面临一个关键挑战:基于标准差的归一化的不稳定性,这种不稳定性容易受到极端样本(几乎为正或负奖励)的扭曲。与纯文本的大语言模型不同,多模态模型对这种扭曲特别敏感,因为感知错误和推理错误都会影响它们的响应。为了解决这个问题,我们通过感知复杂性(通过视觉熵测量)和推理不确定性(通过模型置信度捕获)来对每个样本进行难度表征。在此基础上,我们提出了难度感知的组归一化(Durian),该方法根据难度水平重新分组样本,并在每个组内共享标准差。我们的方法保持了GRPO的组内区分,同时消除了对极端案例的敏感性,从而在多个多模态推理基准测试中实现了显著的性能提升。
cs.CV / 51 / 2602.21754
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
LiREC-Net:一种无目标且基于学习的激光雷达、RGB和事件校准网络
Abstract
Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.
Chinese Translation
先进的自主系统依赖于多传感器融合以实现更安全和更稳健的感知。为了实现有效的融合,从自然驾驶场景中直接进行高精度校准(即无目标)对于精确的多传感器对齐至关重要。现有的基于学习的校准方法通常仅针对单一对传感器模态(即双模态设置)进行设计。与这些方法不同,我们提出了LiREC-Net,这是一种无目标的基于学习的校准网络,能够在统一框架内联合校准多个传感器模态对,包括激光雷达(LiDAR)、RGB和事件数据。为了减少冗余计算并提高效率,我们引入了一种共享的激光雷达表示,利用其三维特性和投影深度图的特征,确保跨模态的一致性。经过在KITTI和DSEC等已建立数据集上的训练和评估,我们的LiREC-Net在性能上与双模态模型具有竞争力,并为三模态使用案例设定了新的强基线。
cs.CV / 52 / 2602.21760
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
基于条件引导调度的混合数据管道并行加速扩散
Abstract
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.
Chinese Translation
扩散模型在高保真图像、视频和音频生成方面取得了显著进展,但推理仍然计算开销较大。然而,当前基于分布式并行的扩散加速方法存在明显的生成伪影,并未能实现与GPU数量成比例的显著加速。因此,我们提出了一种混合并行框架,将一种新颖的数据并行策略(基于条件的分区)与一种最优的管道调度方法(自适应并行切换)相结合,以减少生成延迟并在条件扩散模型中实现高生成质量。关键思想是(i) 利用条件和无条件去噪路径作为新的数据分区视角,以及(ii) 根据这两条路径之间的去噪差异自适应地启用最优管道并行。我们的框架在使用两块NVIDIA RTX~3090 GPU时,分别在SDXL和SD3上实现了$2.31 imes$和$2.07 imes$的延迟减少,同时保持图像质量。该结果证实了我们的方法在基于U-Net的扩散模型和基于DiT的流匹配架构中的普适性。我们的方案在高分辨率合成设置下的加速性能也优于现有方法。代码可在https://github.com/kaist-dmlab/Hybridiff获取。
cs.CV / 53 / 2602.21762
SAPNet++: Evolving Point-Prompted Instance Segmentation with Semantic and Spatial Awareness
SAPNet++:具备语义和空间意识的点提示实例分割演进
Abstract
Single-point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point-prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single-point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise the difficulty distinguishing between different levels of detail (eg. whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point's granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S-MIL. The Multi-level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt's granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.
Chinese Translation
单点标注在视觉任务中越来越受到重视,以降低标注成本。然而,这对需要高精度的任务构成挑战,例如点提示实例分割(PPIS)任务,其目标是利用单点提示来估计精确的掩膜,以训练分割网络。由于点标注的限制,颗粒度模糊和边界不确定性使得区分不同细节层次(例如,整体对象与部分)变得困难,并且精确勾勒对象边界也面临挑战。以往的研究通常继承了掩膜生成与提议选择的范式来实现PPIS。然而,提议选择仅依赖于类别信息,未能解决不同颗粒度的模糊性。此外,掩膜生成器仅提供有限的离散解决方案,往往偏离实际掩膜,尤其是在边界处。为了解决这些问题,我们提出了语义感知点提示实例分割网络(SAPNet)。该网络集成了点距离引导和框挖掘策略,以解决由点的颗粒度模糊性引起的群体和局部问题。此外,我们在提议中加入了完整性评分,以增强空间颗粒度意识,提升提议选择中的多实例学习(MIL),称为S-MIL。多级亲和力细化传递像素和语义线索,在掩膜细化过程中缩小边界不确定性。这些模块汇聚成SAPNet++,缓解了点提示的颗粒度模糊性和边界不确定性,显著提高了分割性能。在四个具有挑战性的数据集上进行的广泛实验验证了我们方法的有效性,突显了推动PPIS的潜力。
cs.CV / 54 / 2602.21778
From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
从静态到动态:基于物理知识的图像编辑与潜在转移先验
Abstract
Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.
Chinese Translation
基于指令的图像编辑在语义对齐方面取得了显著成功,但最先进的模型在涉及复杂因果动态(如折射或材料变形)的编辑时,常常无法生成物理上合理的结果。我们将这一局限归因于主流范式将编辑视为图像对之间的离散映射,这种方法仅提供边界条件,未能充分指定转移动态。为了解决这一问题,我们将基于物理知识的编辑重新表述为预测物理状态转移,并引入PhysicTran38K,这是一个大规模基于视频的数据集,包含来自五个物理领域的38K转移轨迹,构建过程采用了两阶段过滤和约束感知注释管道。在此监督基础上,我们提出了PhysicEdit,一个端到端框架,配备文本-视觉双重思维机制。该框架结合了冻结的Qwen2.5-VL用于物理基础推理,以及可学习的转移查询,为扩散骨干网络提供时间步自适应的视觉指导。实验表明,PhysicEdit在物理真实感方面比Qwen-Image-Edit提高了5.9%,在知识基础编辑方面提高了10.1%,为开源方法设定了新的最先进水平,同时在竞争中与领先的专有模型保持竞争力。
cs.CV / 55 / 2602.21779
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
超越静态伪影:视频深度伪造推理的法医学基准在视觉语言模型中的应用
Abstract
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.
Chinese Translation
当前用于深度伪造检测的视觉语言模型(VLMs)在识别空间伪影方面表现出色,但忽视了一个关键维度:视频伪造中的时间不一致性。将VLMs适应于对这些动态线索进行推理仍然是一个独特的挑战。为了解决这一问题,我们提出了法医学问答(Forensic Answer-Questioning, FAQ),这是一个将时间深度伪造分析表述为多项选择任务的大规模基准。FAQ引入了一个三级层次结构,以逐步评估并赋予VLMs法医学能力:(1)面部感知,测试识别静态视觉伪影的能力;(2)时间深度伪造定位,要求在帧之间定位动态伪造伪影;(3)法医学推理,挑战模型综合证据以得出最终的真实性裁决。我们在FAQ上评估了一系列VLMs,并生成了相应的指令调优集FAQ-IT。大量实验表明,在FAQ-IT上微调的模型在领域内和跨数据集检测基准上都实现了先进的性能。消融研究进一步验证了我们关键设计选择的影响,确认FAQ是推动这些VLMs时间推理能力的动力。
cs.CV / 56 / 2602.21780
XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
XStreamVGGT:极高内存效率的流式视觉几何基础变换器与KV缓存压缩
Abstract
Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.
Chinese Translation
基于学习的3D视觉几何模型随着大规模变换器的出现而显著进步。在这些模型中,StreamVGGT利用逐帧因果注意力实现了稳健且高效的流式3D重建。然而,由于来自多图像和长视频输入的视觉标记的大量涌入,它在键值(KV)缓存中面临无限增长的问题,导致随着输入帧的累积,内存消耗和推理延迟增加。这最终限制了其在长时间应用中的可扩展性。为了解决这一问题,我们提出了XStreamVGGT,这是一种无调优的方法,能够无缝地将剪枝和量化集成,以系统地压缩KV缓存,从而实现极高内存效率的流式推理。具体而言,来自多帧输入生成的冗余KV最初通过一种高效的标记重要性识别机制进行剪枝,以符合固定的KV内存预算,同时保持与高性能注意力核(例如,FlashAttention)的完全兼容。此外,利用KV张量的内在分布模式,我们在剪枝流程中应用维度自适应KV量化,以进一步最小化内存开销,同时保持数值准确性。广泛的评估表明,XStreamVGGT在显著减少内存使用4.42倍和加速推理5.48倍的同时,性能降级几乎可以忽略不计,从而使流式3D应用变得实用且可扩展。代码可在https://github.com/ywh187/XStreamVGGT/获取。
cs.CV / 57 / 2602.21810
GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
GeoMotion:通过潜在四维几何重新思考运动分割
Abstract
Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\pi^3$), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:https://github.com/zjutcvg/GeoMotion.
Chinese Translation
在动态场景中,运动分割具有很高的挑战性,因为传统方法严重依赖于从固有噪声运动线索中估计相机姿态和点对应关系。现有的统计推断或迭代优化技术在多阶段管道中难以减轻累积误差,往往导致性能有限或计算成本高。相比之下,我们提出了一种完全基于学习的方法,通过注意力机制直接从潜在特征表示中推断移动物体,从而实现端到端的前馈运动分割。我们的关键见解是绕过显式的对应关系估计,而是让模型学习隐式地解开物体和相机运动。得益于最近在四维场景几何重建(例如,$ ext{π}^3$)方面的进展,所提出的方法利用可靠的相机姿态和丰富的时空先验,确保模型的稳定训练和鲁棒推断。大量实验表明,通过消除复杂的预处理和迭代优化,我们的方法以高效的方式实现了最先进的运动分割性能。代码可在以下链接获取:https://github.com/zjutcvg/GeoMotion。
cs.CV / 58 / 2602.21818
SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
SkyReels-V4:多模态视频-音频生成、修复与编辑模型
Chen, Guibin, Lin, Dixuan, Yang, Jiangping, Zhang, Youqiang, Fei, Zhengcong, Li, Debang, Chen, Sheng, Ao, Chaofeng, Pang, Nuo, Wang, Yiming, Dou, Yikun, Chen, Zheng, Fan, Mingyuan, Li, Tuanhui, Chang, Mingshan, Zhang, Hao, Sun, Xiaopeng, Xu, Jingtao, Xie, Yuqiang, Wang, Jiahua, Xu, Zhiheng, Xiong, Weiming, Jin, Yuzhe, Gu, Baoxuan, Mao, Binjie, Yu, Yunjie, He, Jujie, Feng, Yuhao, Tu, Shiwen, Wang, Chaojie, Yan, Rui, Shen, Wei, Wu, Jingchen, Zhao, Peng, Zhong, Xuanyue, Liu, Zhuangzhuang, Wang, Kaifei, Zhang, Fuxiang, Xu, Weikai, Liu, Wenyan, Zhang, Binglu, Shen, Yu, Xiong, Tianhui, Peng, Bin, Zeng, Liang, Song, Xuchen, Guo, Haoxiang, Wang, Peiyu, Zhou, Yahui
Abstract
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.
Chinese Translation
SkyReels V4 是一个统一的多模态视频基础模型,用于联合视频音频生成、修复和编辑。该模型采用双流多模态扩散变换器(Multimodal Diffusion Transformer, MMDiT)架构,其中一条分支合成视频,另一条生成时间对齐的音频,同时共享基于多模态大语言模型(Multimodal Large Language Models, MMLM)的强大文本编码器。SkyReels V4 接受丰富的多模态指令,包括文本、图像、视频片段、掩码和音频参考。通过将 MMLM 的多模态指令跟随能力与视频分支 MMDiT 的上下文学习相结合,该模型能够在复杂条件下注入细粒度的视觉指导,而音频分支 MMDiT 同时利用音频参考来指导声音生成。在视频方面,我们采用通道拼接的形式,统一了多种修复风格任务,如图像到视频、视频扩展和视频编辑,所有任务都在一个接口下进行,并通过多模态提示自然扩展到视觉参考的修复和编辑。SkyReels V4 支持高达 1080p 的分辨率、32 FPS 和 15 秒的时长,实现高保真、多镜头、电影级别的视频生成,并同步音频。为了使这种高分辨率、长时长的生成在计算上可行,我们引入了一种效率策略:联合生成低分辨率的完整序列和高分辨率的关键帧,随后使用专门的超分辨率和帧插值模型。根据我们的了解,SkyReels V4 是第一个同时支持多模态输入、联合视频音频生成以及统一处理生成、修复和编辑的视频基础模型,同时在电影级分辨率和时长下保持强大的效率和质量。
cs.CV / 59 / 2602.21819
SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
SemVideo:通过层次语义引导重建您观看的内容
Abstract
Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.
Chinese Translation
从脑活动中重建动态视觉体验为探索人类视觉感知的神经机制提供了一个引人注目的途径。尽管基于功能性磁共振成像(fMRI)的图像重建取得了显著进展,但将这一成功扩展到视频重建仍然是一个重大挑战。目前的fMRI到视频重建方法普遍面临两个主要缺陷:(i)在帧之间显著物体的视觉表征不一致,导致外观不匹配;(ii)时间一致性差,导致运动错位或突兀的帧过渡。为了解决这些局限性,我们提出了SemVideo,这是一种由层次语义信息引导的新型fMRI到视频重建框架。SemVideo的核心是SemMiner,一个层次引导模块,从原始视频刺激中构建三个层次的语义线索:静态锚点描述、运动导向叙述和整体摘要。利用这种语义引导,SemVideo包含三个关键组件:一个语义对齐解码器(Semantic Alignment Decoder),用于将fMRI信号与来自SemMiner的CLIP风格嵌入对齐;一个运动适应解码器(Motion Adaptation Decoder),使用新颖的三方注意力融合架构重建动态运动模式;以及一个条件视频渲染器(Conditional Video Render),利用层次语义引导进行视频重建。在CC2017和HCP数据集上进行的实验表明,SemVideo在语义对齐和时间一致性方面均表现出优越的性能,树立了fMRI到视频重建的新状态。
cs.CV / 60 / 2602.21820
Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps
通过光-几何交互映射实现联合阴影生成与重光照
Abstract
We propose Light-Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting - unlike prior methods that treat them as disjoint tasks - capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.
Chinese Translation
我们提出了光-几何交互(Light-Geometry Interaction, LGI)映射,这是一种新颖的表示方法,能够从单目深度图中编码光感知的遮挡。与需要完整三维重建的光线追踪不同,LGI 可靠且准确地捕捉了关键的光影交互,这些交互是从现成的 2.5D 深度图预测中计算得出的。LGI 明确将照明方向与几何形状联系起来,提供了一种受物理启发的先验,约束生成模型。没有这样的先验,这些模型往往会产生漂浮阴影、不一致的照明和不合理的阴影几何形状。在此基础上,我们提出了一种统一的管道,用于联合阴影生成和重光照——与之前将其视为分离任务的方法不同——捕捉照明与阴影之间的内在耦合,这对于建模间接效果至关重要。通过将 LGI 嵌入到桥接匹配生成骨干网络中,我们减少了歧义,并强制执行物理一致的光影推理。为了实现有效的训练,我们策划了第一个大规模的联合阴影与重光照基准数据集,涵盖了反射、透明度和复杂的相互反射。实验结果显示,在合成图像和真实图像中,现实感和一致性都有显著提升。因此,LGI 将受几何启发的渲染与生成建模相结合,实现了高效、物理一致的阴影生成与重光照。
cs.CV / 61 / 2602.21829
StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
StoryMovie:一个用于视觉故事与电影剧本和字幕语义对齐的数据集
Abstract
Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.
Chinese Translation
视觉叙事模型虽然能够正确地将实体与图像对接,但仍可能会产生语义关系的幻觉,从而生成不正确的对话归属、角色互动或情感状态。我们引入了StoryMovie,这是一个包含1,757个故事的数据集,通过最长公共子序列(LCS)匹配与电影剧本和字幕对齐。我们的对齐流程将剧本对话与字幕时间戳同步,使得通过将剧本中的角色名称与字幕中的时间位置链接,实现对话归属。利用这些对齐内容,我们生成保持视觉对接标签的故事,同时融入真实的角色名称、对话和关系动态。我们在该数据集上微调了Qwen Storyteller3,基于之前在视觉对接和实体再识别方面的工作。使用DeepSeek V3进行评估显示,Storyteller3在字幕对齐方面的胜率达到了89.9%,相较于基础版Qwen2.5-VL 7B。与未经过剧本对接训练的Storyteller相比,Storyteller3的表现为48.5%对38.0%,确认了语义对齐在对话归属方面的逐步改善,超越了仅依赖视觉对接的效果。
cs.CV / 62 / 2602.21835
UniVBench: Towards Unified Evaluation for Video Foundation Models
UniVBench:迈向视频基础模型的统一评估
Abstract
Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.
Chinese Translation
视频基础模型旨在将视频理解、生成、编辑和指令跟随整合在一个框架内,使其成为下一代多模态系统的核心方向。然而,现有的评估基准仍然碎片化且范围有限,因为它们各自针对单一任务,依赖于特定任务的指标,并通常使用短小或简单的视频片段。因此,它们未能捕捉到这些模型所设计的统一能力。为了解决这一问题,我们引入了UniVBench,这是一个专门构建的基准,用于评估视频基础模型在四个核心能力上的表现:视频理解、视频生成、视频编辑,以及一个新提出的任务——视频重建,该任务评估模型能够多忠实地再现其所遇到的视频内容。我们的基准通过整合200个高质量、多样化和多镜头的视频,显著扩展了评估的复杂性,每个视频都配有详细的说明、各种格式的编辑指令和参考图像。所有视频均为人工创作并经过仔细验证,提供了比以往基准更丰富的电影信息。此外,我们开发了一个统一的代理评估系统(UniV-Eval),标准化了所有任务的提示、指令解析和评分,能够实现统一视频模型的公平、可扩展和可重复的比较。通过将评估基于指令驱动的多镜头视频任务,UniVBench提供了一个测量视频基础模型所期望实现的综合能力的首个框架。大量的人类注释确保我们的评估与人类判断一致,从而实现严格的评估并加速向强大视频智能的进展。
cs.CV / 63 / 2602.21849
Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
Meta-FC:具有特征一致性的元学习用于鲁棒且可泛化的水印技术
Abstract
Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underline{\textbf{s}}ingle \underline{\textbf{r}}andom \underline{\textbf{d}}istortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches. As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline{\textbf{meta}}-learning with \underline{\textbf{f}}eature \underline{\textbf{c}}onsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown'' distortion for the meta-testing phase. Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch. To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent. Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59\%, 4.71\%, and 2.38\% under high-intensity, combined, and unknown distortions.
Chinese Translation
基于深度学习的水印技术近年来取得了显著进展。为了实现对各种失真的鲁棒性,目前的方法通常采用一种训练策略,即在每个训练批次中选择一个 extbf{单一随机失真}(SRD)作为噪声层。然而,SRD策略在每个批次中独立处理失真,忽视了不同类型失真之间的内在关系,导致批次间的优化冲突。因此,水印模型的鲁棒性和泛化能力受到限制。为了解决这个问题,我们提出了一种新颖的训练策略,通过 extbf{元学习}(meta-learning)与 extbf{特征一致性}(feature consistency)相结合(Meta-FC),增强鲁棒性和泛化能力。具体而言,我们从噪声池中随机抽取多个失真来构建元训练任务,同时保留一个失真作为模拟的“未知”失真用于元测试阶段。通过元学习,模型被鼓励识别和利用在不同类型失真中表现出稳定激活的神经元,从而减轻每个批次中随机采样多样失真所导致的优化冲突。为了进一步促进稳定激活向失真不变表示的转化,我们引入了一种特征一致性损失,约束同一图像在不同失真下解码的特征保持一致。大量实验表明,与SRD训练策略相比,Meta-FC在高强度、组合和未知失真下,平均提高了各种水印模型的鲁棒性和泛化能力,分别提高了1.59%、4.71%和2.38%。
cs.CV / 64 / 2602.21855
Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett's Video Segmentation
理解注释错误传播及学习自适应策略以便在巴雷特视频分割中进行专家干预
Abstract
Accurate annotation of endoscopic videos is essential yet time-consuming, particularly for challenging datasets such as dysplasia in Barrett's esophagus, where the affected regions are irregular and lack clear boundaries. Semi-automatic tools like Segment Anything Model 2 (SAM2) can ease this process by propagating annotations across frames, but small errors often accumulate and reduce accuracy, requiring expert review and correction. To address this, we systematically study how annotation errors propagate across different prompt types, namely masks, boxes, and points, and propose Learning-to-Re-Prompt (L2RP), a cost-aware framework that learns when and where to seek expert input. By tuning a human-cost parameter, our method balances annotation effort and segmentation accuracy. Experiments on a private Barrett's dysplasia dataset and the public SUN-SEG benchmark demonstrate improved temporal consistency and superior performance over baseline strategies.
Chinese Translation
内窥镜视频的准确注释至关重要,但耗时较长,尤其是在巴雷特食管的发育不良等具有挑战性的数据集中,受影响区域不规则且缺乏清晰边界。像Segment Anything Model 2 (SAM2)这样的半自动工具可以通过在帧之间传播注释来简化这一过程,但小错误往往会积累并降低准确性,需专家进行审核和修正。为了解决这个问题,我们系统地研究了注释错误如何在不同提示类型(即掩码、框和点)之间传播,并提出了学习重新提示(Learning-to-Re-Prompt, L2RP),这是一个成本感知框架,能够学习何时何地寻求专家输入。通过调整人力成本参数,我们的方法在注释工作量和分割准确性之间取得了平衡。在一个私有的巴雷特发育不良数据集和公共的SUN-SEG基准上进行的实验表明,时间一致性得到了改善,并且在性能上优于基线策略。
cs.CV / 65 / 2602.21864
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
DynamicGTR:利用图拓扑表示偏好提升图问答中的视觉语言模型能力
Abstract
Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.
Chinese Translation
视觉语言模型(VLMs)已成为在各个领域进行零样本问答(QA)的多功能解决方案。然而,使VLMs有效理解结构化图形并进行准确、高效的问答仍然面临挑战。现有方法通常依赖于单一的图拓扑表示(GTR),如固定风格的视觉图像或统一的文本描述。这种“一刀切”的策略往往忽视了模型特定和任务特定的偏好,导致对图相关查询的回答不准确或过于冗长。为了解决这个问题,我们提出了$ ext{DynamicGTR}$框架,该框架在推理过程中动态选择每个查询的最佳GTR,从而增强VLMs在零样本图问答中的能力,实现可定制的准确性与简洁性的权衡。大量实验表明,DynamicGTR不仅提升了基于VLM的图算法问答性能,还成功地将从合成图算法任务中训练的经验转移到链接预测和节点分类等实际应用中,而无需额外训练。此外,DynamicGTR在任务、领域和模型之间表现出强大的可迁移性,表明其作为广泛图场景灵活解决方案的潜力。
cs.CV / 66 / 2602.21873
GFPL: Generative Federated Prototype Learning for Resource-Constrained and Data-Imbalanced Vision Task
GFPL:针对资源受限和数据不平衡视觉任务的生成式联邦原型学习
Abstract
Federated learning (FL) facilitates the secure utilization of decentralized images, advancing applications in medical image recognition and autonomous driving. However, conventional FL faces two critical challenges in real-world deployment: ineffective knowledge fusion caused by model updates biased toward majority-class features, and prohibitive communication overhead due to frequent transmissions of high-dimensional model parameters. Inspired by the human brain's efficiency in knowledge integration, we propose a novel Generative Federated Prototype Learning (GFPL) framework to address these issues. Within this framework, a prototype generation method based on Gaussian Mixture Model (GMM) captures the statistical information of class-wise features, while a prototype aggregation strategy using Bhattacharyya distance effectively fuses semantically similar knowledge across clients. In addition, these fused prototypes are leveraged to generate pseudo-features, thereby mitigating feature distribution imbalance across clients. To further enhance feature alignment during local training, we devise a dual-classifier architecture, optimized via a hybrid loss combining Dot Regression and Cross-Entropy. Extensive experiments on benchmarks show that GFPL improves model accuracy by 3.6% under imbalanced data settings while maintaining low communication cost.
Chinese Translation
联邦学习(FL)促进了去中心化图像的安全利用,推动了医学图像识别和自动驾驶等应用的发展。然而,传统的联邦学习在实际部署中面临两个关键挑战:由于模型更新偏向于多数类特征,导致知识融合效率低下,以及由于频繁传输高维模型参数而造成的通信开销过大。受到人脑在知识整合方面高效性的启发,我们提出了一种新颖的生成式联邦原型学习(GFPL)框架,以解决这些问题。在该框架内,基于高斯混合模型(GMM)的原型生成方法捕捉了类别特征的统计信息,而使用Bhattacharyya距离的原型聚合策略有效地融合了客户端之间语义相似的知识。此外,这些融合的原型被用于生成伪特征,从而缓解客户端之间的特征分布不平衡。为了进一步增强本地训练中的特征对齐,我们设计了一种双分类器架构,通过结合点回归和交叉熵的混合损失进行优化。在基准测试上的大量实验表明,GFPL在不平衡数据设置下提高了模型准确性3.6%,同时保持了低通信成本。
cs.CV / 67 / 2602.21877
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
如何拍摄一张令人难忘的照片?赋能用户以可操作的反馈
Abstract
Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
Chinese Translation
图像的可记忆性,即一幅图像被记住的可能性,传统上在计算机视觉领域被研究为一种被动预测任务,模型回归一个标量分数,或者通过生成方法改变视觉输入以提高图像被记住的可能性。然而,这些范式都未能在拍摄时支持用户,而此时关键问题是如何提高照片的可记忆性。我们引入了可记忆性反馈(Memorability Feedback,MemFeed)这一任务,其中自动化模型应提供可操作的、易于人类理解的指导,以增强图像未来的回忆。我们还提出了MemCoach,这是第一个旨在提供具体建议以改善可记忆性的自然语言方法(例如,“强调面部表情”,“将主体向前移动”)。我们的方法基于多模态大型语言模型(Multimodal Large Language Models,MLLMs),无需训练,并采用教师-学生引导策略,使模型内部激活与从教师模型中学习到的更具可记忆性的模式对齐,教师模型是根据从最不记忆到最记忆的样本逐步进展的。为了对这一新任务进行系统评估,我们进一步引入了MemBench,一个新的基准,包含带有注释的可记忆性分数的序列对齐照片拍摄。我们的实验考虑了多个MLLMs,展示了MemCoach的有效性,显示出在多个零样本模型上性能的一致提升。结果表明,可记忆性不仅可以被预测,还可以被教授和指导,将重点从单纯的预测转向为人类创作者提供可操作的反馈。
cs.CV / 68 / 2602.21893
EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion
EndoDDC:通过扩散深度补全学习内窥镜机器人导航的稀疏到密集重建
Abstract
Accurate depth estimation plays a critical role in the navigation of endoscopic surgical robots, forming the foundation for 3D reconstruction and safe instrument guidance. Fine-tuning pretrained models heavily relies on endoscopic surgical datasets with precise depth annotations. While existing self-supervised depth estimation techniques eliminate the need for accurate depth annotations, their performance degrades in environments with weak textures and variable lighting, leading to sparse reconstruction with invalid depth estimation. Depth completion using sparse depth maps can mitigate these issues and improve accuracy. Despite the advances in depth completion techniques in general fields, their application in endoscopy remains limited. To overcome these limitations, we propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features, and optimizes depth maps through a diffusion model, addressing the issues of weak texture and light reflection in endoscopic environments. Extensive experiments on two publicly available endoscopy datasets show that our approach outperforms state-of-the-art models in both depth accuracy and robustness. This demonstrates the potential of our method to reduce visual errors in complex endoscopic environments. Our code will be released at https://github.com/yinheng-lin/EndoDDC.
Chinese Translation
准确的深度估计在内窥镜手术机器人的导航中发挥着至关重要的作用,是三维重建和安全工具引导的基础。微调预训练模型在很大程度上依赖于具有精确深度注释的内窥镜手术数据集。尽管现有的自监督深度估计技术消除了对准确深度注释的需求,但在纹理较弱和光照变化的环境中,其性能会下降,导致稀疏重建和无效的深度估计。利用稀疏深度图进行深度补全可以缓解这些问题并提高准确性。尽管深度补全技术在一般领域已有所进展,但在内窥镜领域的应用仍然有限。为了解决这些局限性,我们提出了EndoDDC,一种内窥镜深度补全方法,该方法整合了图像、稀疏深度信息和深度梯度特征,并通过扩散模型优化深度图,解决了内窥镜环境中纹理弱和光反射的问题。在两个公开可用的内窥镜数据集上的大量实验表明,我们的方法在深度准确性和鲁棒性方面优于最先进的模型。这证明了我们的方法在复杂内窥镜环境中减少视觉错误的潜力。我们的代码将发布在 https://github.com/yinheng-lin/EndoDDC。
cs.CV / 69 / 2602.21904
UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing
基于UNet的关键点回归用于自主赛车中的3D锥体定位
Abstract
Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.
Chinese Translation
在自主赛车中,准确的3D空间锥体定位对于精确导航绕过赛道至关重要。依赖传统计算机视觉算法的方法对环境变化敏感,而神经网络通常在有限数据上训练,且在实时运行时不可行。我们提出了一种基于UNet的神经网络,用于锥体的关键点检测,利用我们组装的最大自定义标记数据集。我们的方法能够实现准确的锥体位置估计,并具有颜色预测的潜力。我们的模型在关键点准确性方面相较于传统方法取得了显著提升。此外,我们在感知管道中利用预测的关键点,并评估端到端的自主系统。我们的结果在所有指标上显示出高质量的性能,突显了该方法的有效性及其在竞争性自主赛车系统中应用的潜力。
cs.CV / 70 / 2602.21905
TIRAuxCloud: A Thermal Infrared Dataset for Day and Night Cloud Detection
TIRAuxCloud:一个用于昼夜云检测的热红外数据集
Abstract
Clouds are a major obstacle in Earth observation, limiting the usability and reliability of critical remote sensing applications such as fire disaster response, urban heat island monitoring, and snow and ice cover mapping. Therefore, the ability to detect clouds 24/7 is of paramount importance. While visible and near-infrared bands are effective for daytime cloud detection, their dependence on solar illumination makes them unsuitable for nighttime monitoring. In contrast, thermal infrared (TIR) imagery plays a crucial role in detecting clouds at night, when sunlight is absent. Due to their generally lower temperatures, clouds emit distinct thermal signatures that are detectable in TIR bands. Despite this, accurate nighttime cloud detection remains challenging due to limited spectral information and the typically lower spatial resolution of TIR imagery. To address these challenges, we present TIRAuxCloud, a multi-modal dataset centered around thermal spectral data to facilitate cloud segmentation under both daytime and nighttime conditions. The dataset comprises a unique combination of multispectral data (TIR, optical, and near-infrared bands) from Landsat and VIIRS, aligned with auxiliary information layers. Elevation, land cover, meteorological variables, and cloud-free reference images are included to help reduce surface-cloud ambiguity and cloud formation uncertainty. To overcome the scarcity of manual cloud labels, we include a large set of samples with automated cloud masks and a smaller manually annotated subset to further evaluate and improve models. Comprehensive benchmarks are presented to establish performance baselines through supervised and transfer learning, demonstrating the dataset's value in advancing the development of innovative methods for day and night time cloud detection.
Chinese Translation
云层是地球观测中的主要障碍,限制了火灾应急响应、城市热岛监测以及雪冰覆盖制图等关键遥感应用的可用性和可靠性。因此,全天候云检测能力至关重要。虽然可见光和近红外波段在白天云检测中有效,但它们对太阳光照的依赖使其不适合夜间监测。相比之下,热红外(TIR)图像在夜间(阳光缺失时)云检测中发挥着关键作用。由于云层通常具有较低的温度,云层在TIR波段中发出独特的热特征。然而,由于光谱信息有限以及TIR图像通常较低的空间分辨率,准确的夜间云检测仍然具有挑战性。为了解决这些问题,我们提出了TIRAuxCloud,一个以热光谱数据为中心的多模态数据集,以促进昼夜条件下的云分割。该数据集包含来自Landsat和VIIRS的多光谱数据(TIR、光学和近红外波段)与辅助信息层的独特组合。数据集中包括地形、土地覆盖、气象变量和无云参考图像,以帮助减少表面与云层的模糊性和云层形成的不确定性。为了克服人工云标签稀缺的问题,我们包括了一大批带有自动云掩膜的样本以及一个较小的手动标注子集,以进一步评估和改进模型。我们提供了全面的基准测试,通过监督学习和迁移学习建立性能基线,展示了该数据集在推动昼夜云检测创新方法发展中的价值。
cs.CV / 71 / 2602.21915
Protein Graph Neural Networks for Heterogeneous Cryo-EM Reconstruction
用于异质性冷冻电子显微镜重建的蛋白质图神经网络
Abstract
We present a geometry-aware method for heterogeneous single-particle cryogenic electron microscopy (cryo-EM) reconstruction that predicts atomic backbone conformations. To incorporate protein-structure priors, we represent the backbone as a graph and use a graph neural network (GNN) autodecoder that maps per-image latent variables to 3D displacements of a template conformation. The objective combines a data-discrepancy term based on a differentiable cryo-EM forward model with geometric regularization, and it supports unknown orientations via ellipsoidal support lifting (ESL) pose estimation. On synthetic datasets derived from molecular dynamics trajectories, the proposed GNN achieves higher accuracy compared to a multilayer perceptron (MLP) of comparable size, highlighting the benefits of a geometry-informed inductive bias.
Chinese Translation
我们提出了一种几何感知的方法,用于异质性单颗粒冷冻电子显微镜(cryo-EM)重建,该方法预测原子骨架构象。为了结合蛋白质结构先验,我们将骨架表示为图,并使用图神经网络(GNN)自解码器,将每幅图像的潜变量映射到模板构象的三维位移。该目标结合了基于可微分的cryo-EM前向模型的数据差异项和几何正则化,并通过椭球支持提升(ESL)姿态估计支持未知方向。在从分子动力学轨迹派生的合成数据集上,所提出的GNN相比于同等规模的多层感知器(MLP)实现了更高的准确性,突显了几何信息引导的归纳偏置的优势。
cs.CV / 72 / 2602.21917
Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
扫描簇,而非像素:一种以簇为中心的高效超高清图像恢复范式
Abstract
Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C$^2$SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C$^2$SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.
Chinese Translation
超高清(UHD)图像恢复面临可扩展性危机:现有模型依赖于逐像素操作,导致计算需求不可持续。尽管状态空间模型(SSMs)如 Mamba 承诺线性复杂度,但其逐像素扫描仍然是处理UHD内容中数百万像素的根本瓶颈。我们提出疑问:理解图像是否必须处理每一个像素?本文介绍了 C$^2$SSM,一种视觉状态空间模型,通过从逐像素扫描转变为逐簇扫描,打破了这一禁忌。我们的核心发现是,UHD图像的丰富特征分布可以通过神经参数化混合模型提炼为一组稀疏的语义质心。C$^2$SSM 利用这一点将全局建模重新构建为一种新颖的双路径过程:它对少量簇中心进行扫描和推理,然后通过一个有原则的相似性分布将全局上下文扩散回所有像素,同时一个轻量调制器保留细节。该以簇为中心的范式在效率上实现了决定性飞跃,大幅降低计算成本,同时在五个UHD恢复任务中建立了新的最先进结果。C$^2$SSM不仅是一个解决方案,更为高效的大规模视觉开辟了一条新路径:扫描簇,而非像素。
cs.CV / 73 / 2602.21929
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
几何作为上下文:在场景一致的视频生成中调节显式3D与几何上下文
Abstract
Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.
Chinese Translation
场景一致的视频生成旨在基于相机轨迹创建探索3D场景的视频。以往的方法依赖于具有外部记忆的视频生成模型以保持一致性,或迭代的3D重建和修复,这在推理过程中由于中间输出不正确、不可微分的过程以及分离模型而累积错误。为克服这些局限性,我们提出了“几何作为上下文”。该方法使用自回归相机控制的视频生成模型迭代完成以下步骤:(1)估计当前视图所需的几何形状,以进行3D重建;(2)模拟和恢复由3D场景渲染的新视图图像。在这一多任务框架下,我们开发了相机门控注意力模块,以增强模型有效利用相机姿态的能力。在训练阶段,利用文本上下文来确定应生成几何图像还是RGB图像。为了确保模型在推理过程中能够生成仅包含RGB的输出,几何上下文在交错的文本-图像-几何训练序列中随机丢弃。该方法已在单向和往返轨迹的场景视频生成中进行了测试。结果表明,其在保持场景一致性和相机控制方面优于以往的方法。
cs.CV / 74 / 2602.21935
A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography
跨域冠状动脉钙评分的框架:针对有电图门控和无电图门控计算机断层扫描的研究
Abstract
Coronary artery calcium (CAC) scoring is a key predictor of cardiovascular risk, but it relies on ECG-gated CT scans, restricting its use to specialized cardiac imaging settings. We introduce an automated framework for CAC detection and lesion-specific Agatston scoring that operates across both gated and non-gated CT scans. At its core is CARD-ViT, a self-supervised Vision Transformer trained exclusively on gated CT data using DINO. Without any non-gated training data, our framework achieves 0.707 accuracy and a Cohen's kappa of 0.528 on the Stanford non-gated dataset, matching models trained directly on non-gated scans. On gated test sets, the framework achieves 0.910 accuracy with Cohen's kappa scores of 0.871 and 0.874 across independent datasets, demonstrating robust risk stratification. These results demonstrate the feasibility of cross-domain CAC scoring from gated to non-gated domains, supporting scalable cardiovascular screening in routine chest imaging without additional scans or annotations.
Chinese Translation
冠状动脉钙(CAC)评分是心血管风险的关键预测指标,但其依赖于电图门控CT扫描,限制了其在专门心脏影像学环境中的应用。我们提出了一种自动化框架,用于CAC检测和特定病变的Agatston评分,能够在有电图门控和无电图门控CT扫描中运行。该框架的核心是CARD-ViT,一种自监督视觉变换器,专门在有电图门控CT数据上使用DINO进行训练。在没有任何无电图门控训练数据的情况下,我们的框架在斯坦福无电图门控数据集上实现了0.707的准确率和0.528的Cohen's kappa,与直接在无电图门控扫描上训练的模型相匹配。在有电图门控测试集上,该框架实现了0.910的准确率,Cohen's kappa得分在独立数据集上分别为0.871和0.874,展示了强大的风险分层能力。这些结果证明了从有电图门控到无电图门控领域进行跨域CAC评分的可行性,支持在常规胸部影像学中进行可扩展的心血管筛查,而无需额外的扫描或注释。
cs.CV / 75 / 2602.21942
Directed Ordinal Diffusion Regularization for Progression-Aware Diabetic Retinopathy Grading
面向进展意识的糖尿病视网膜病变分级的定向序数扩散正则化
Abstract
Diabetic Retinopathy (DR) progresses as a continuous and irreversible deterioration of the retina, following a well-defined clinical trajectory from mild to severe stages. However, most existing ordinal regression approaches model DR severity as a set of static, symmetric ranks, capturing relative order while ignoring the inherent unidirectional nature of disease progression. As a result, the learned feature representations may violate biological plausibility, allowing implausible proximity between non-consecutive stages or even reverse transitions. To bridge this gap, we propose Directed Ordinal Diffusion Regularization (D-ODR), which explicitly models the feature space as a directed flow by constructing a progression-constrained directed graph that strictly enforces forward disease evolution. By performing multi-scale diffusion on this directed structure, D-ODR imposes penalties on score inversions along valid progression paths, thereby effectively preventing the model from learning biologically inconsistent reverse transitions. This mechanism aligns the feature representation with the natural trajectory of DR worsening. Extensive experiments demonstrate that D-ODR yields superior grading performance compared to state-of-the-art ordinal regression and DR-specific grading methods, offering a more clinically reliable assessment of disease severity. Our code is available on https://github.com/HovChen/D-ODR.
Chinese Translation
糖尿病视网膜病变(Diabetic Retinopathy, DR)作为视网膜的持续且不可逆的恶化过程,遵循从轻度到重度阶段的明确临床轨迹。然而,大多数现有的序数回归方法将DR的严重程度建模为一组静态的对称等级,捕捉相对顺序的同时忽视了疾病进展的固有单向性。因此,学习到的特征表示可能违反生物学合理性,导致非连续阶段之间产生不合理的接近性,甚至出现逆向转变。为了解决这一问题,我们提出了定向序数扩散正则化(Directed Ordinal Diffusion Regularization, D-ODR),该方法通过构建一个受进展约束的定向图,明确将特征空间建模为定向流,严格执行疾病的前向演变。通过在这一定向结构上进行多尺度扩散,D-ODR对有效进展路径上的评分反转施加惩罚,从而有效防止模型学习生物学上不一致的逆向转变。该机制使特征表示与DR恶化的自然轨迹保持一致。大量实验表明,D-ODR在分级性能上优于最先进的序数回归和DR特定的分级方法,提供了更具临床可靠性的疾病严重性评估。我们的代码可在 https://github.com/HovChen/D-ODR 获取。
cs.CV / 76 / 2602.21943
Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images
基于数字眼底图像的移动端自动化糖尿病视网膜病变分诊
Abstract
Diabetic Retinopathy (DR) is a major cause of vision impairment worldwide. However, manual diagnosis is often time-consuming and prone to errors, leading to delays in screening. This paper presents a lightweight automated deep learning framework for efficient assessment of DR severity from digital fundus images. We use a MobileNetV3 architecture with a Consistent Rank Logits (CORAL) head to model the ordered progression of disease while maintaining computational efficiency for resource-constrained environments. The model is trained and validated on a combined dataset of APTOS 2019 and IDRiD images using a preprocessing pipeline including circular cropping and illumination normalization. Extensive experiments including 3-fold cross-validation and ablation studies demonstrate strong performance. The model achieves a Quadratic Weighted Kappa (QWK) score of 0.9019 and an accuracy of 80.03 percent. Additionally, we address real-world deployment challenges through model calibration to reduce overconfidence and optimization for mobile devices. The proposed system provides a scalable and practical tool for early-stage diabetic retinopathy screening.
Chinese Translation
糖尿病视网膜病变(Diabetic Retinopathy, DR)是全球视力障碍的主要原因。然而,手动诊断通常耗时且容易出错,导致筛查延迟。本文提出了一种轻量级的自动化深度学习框架,用于高效评估数字眼底图像中的DR严重程度。我们采用MobileNetV3架构,并结合一致性排名逻辑(Consistent Rank Logits, CORAL)头部,以建模疾病的有序进展,同时保持在资源受限环境中的计算效率。该模型在APTOS 2019和IDRiD图像的组合数据集上进行训练和验证,使用包括圆形裁剪和光照标准化在内的预处理管道。大量实验,包括3折交叉验证和消融研究,展示了模型的强大性能。该模型实现了0.9019的二次加权Kappa(Quadratic Weighted Kappa, QWK)分数和80.03%的准确率。此外,我们通过模型校准来减少过度自信,并针对移动设备进行优化,以应对实际部署中的挑战。所提出的系统为早期糖尿病视网膜病变筛查提供了可扩展且实用的工具。
cs.CV / 77 / 2602.21944
Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading
学习融合与重建多视角图以进行糖尿病视网膜病变分级
Abstract
Diabetic retinopathy (DR) is one of the leading causes of vision loss worldwide, making early and accurate DR grading critical for timely intervention. Recent clinical practices leverage multi-view fundus images for DR detection with a wide coverage of the field of view (FOV), motivating deep learning methods to explore the potential of multi-view learning for DR grading. However, existing methods often overlook the inter-view correlations when fusing multi-view fundus images, failing to fully exploit the inherent consistency across views originating from the same patient. In this work, we present MVGFDR, an end-to-end Multi-View Graph Fusion framework for DR grading. Different from existing methods that directly fuse visual features from multiple views, MVGFDR is equipped with a novel Multi-View Graph Fusion (MVGF) module to explicitly disentangle the shared and view-specific visual features. Specifically, MVGF comprises three key components: (1) Multi-view Graph Initialization, which constructs visual graphs via residual-guided connections and employs Discrete Cosine Transform (DCT) coefficients as frequency-domain anchors; (2) Multi-view Graph Fusion, which integrates selective nodes across multi-view graphs based on frequency-domain relevance to capture complementary view-specific information; and (3) Masked Cross-view Reconstruction, which leverages masked reconstruction of shared information across views to facilitate view-invariant representation learning. Extensive experimental results on MFIDDR, by far the largest multi-view fundus image dataset, demonstrate the superiority of our proposed approach over existing state-of-the-art approaches in diabetic retinopathy grading.
Chinese Translation
糖尿病视网膜病变(DR)是全球视力丧失的主要原因之一,因此早期和准确的DR分级对于及时干预至关重要。最近的临床实践利用多视角眼底图像进行DR检测,具有广泛的视野覆盖(FOV),这促使深度学习方法探索多视角学习在DR分级中的潜力。然而,现有方法在融合多视角眼底图像时往往忽视了视角间的关联,未能充分利用来自同一患者的视角间固有的一致性。在本研究中,我们提出了MVGFDR,一个端到端的多视角图融合框架用于DR分级。与现有方法直接融合来自多个视角的视觉特征不同,MVGFDR配备了一个新颖的多视角图融合(MVGF)模块,以明确区分共享和视角特定的视觉特征。具体而言,MVGF包含三个关键组件:(1)多视角图初始化,通过残差引导连接构建视觉图,并采用离散余弦变换(DCT)系数作为频域锚点;(2)多视角图融合,根据频域相关性整合多视角图中的选择节点,以捕获互补的视角特定信息;(3)掩码跨视角重建,利用跨视角共享信息的掩码重建来促进视角不变表示学习。在目前为止最大的多视角眼底图像数据集MFIDDR上的广泛实验结果表明,我们提出的方法在糖尿病视网膜病变分级方面优于现有的最先进方法。
cs.CV / 78 / 2602.21952
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
MindDriver:引入渐进式多模态推理用于自动驾驶
Abstract
Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.
Chinese Translation
视觉-语言模型(VLM)展现出强大的推理能力,显示出在端到端自动驾驶系统中的潜力。链式思维(CoT)作为VLM广泛使用的推理策略,正面临重大挑战。现有的文本CoT在文本语义空间与轨迹物理空间之间存在较大差距。尽管近期的方法利用未来图像替代文本作为CoT过程,但缺乏明确的面向规划的目标指导,以生成具有准确场景演变的图像。为了解决这些问题,我们创新性地提出了MindDriver,一个渐进式多模态推理框架,使VLM能够模仿人类的渐进式思维以实现自动驾驶。MindDriver展示了语义理解、语义到物理空间的想象以及物理空间轨迹规划。为了在MindDriver中实现对齐的推理过程,我们开发了一种反馈引导的自动数据标注管道,以生成对齐的多模态推理训练数据。此外,我们开发了一种渐进式强化微调方法,通过基于渐进式高层奖励的学习来优化对齐。MindDriver在nuScenes开环和Bench2Drive闭环评估中表现出色。代码可在 https://github.com/hotdogcheesewhite/MindDriver 获取。
cs.CV / 79 / 2602.21956
Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation
高分辨率文本丰富图像翻译中的全球-局部双重感知方法
Abstract
Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
Chinese Translation
文本图像机器翻译(TIMT)旨在将源语言中嵌入图像的文本翻译为目标语言,这需要视觉感知与语言理解的协同整合。现有的TIMT方法,无论是级联管道还是端到端的多模态大型语言模型(MLLMs),在处理高分辨率文本丰富图像时面临挑战,主要由于布局杂乱、多样的字体和非文本干扰,导致文本遗漏、语义漂移和上下文不一致。为了解决这些问题,我们提出了GLoTran,一种基于MLLM的TIMT的全球-局部双重视觉感知框架。GLoTran在指令引导的对齐策略下,将低分辨率的全局图像与多尺度区域级文本图像切片相结合,使得MLLM能够保持场景级上下文一致性,同时忠实捕捉细粒度的文本细节。此外,为了实现这一双重感知范式,我们构建了GLoD,一个大规模文本丰富的TIMT数据集,包含510K高分辨率的全球-局部图像-文本对,涵盖多样的现实场景。大量实验表明,GLoTran在翻译的完整性和准确性上显著优于最先进的MLLM,提供了一种在高分辨率和文本丰富条件下进行细粒度TIMT的新范式。
cs.CV / 80 / 2602.21963
Global-Aware Edge Prioritization for Pose Graph Initialization
全球感知的边缘优先级排序用于姿态图初始化
Abstract
The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at https://github.com/weitong8591/global_edge_prior.
Chinese Translation
姿态图是运动重建(Structure-from-Motion, SfM)的核心组成部分,其中图像作为节点,边缘编码相对姿态。由于几何验证的成本较高,SfM 流水线将姿态图限制为稀疏的候选边集,使得初始化变得至关重要。现有方法依赖于图像检索将每个图像连接到其 $k$ 个最近邻,独立处理每对图像,忽视了全局一致性。我们通过边缘优先级排序的概念来解决这一局限性,根据其在 SfM 中的效用对候选边进行排名。我们的方法包含三个组成部分:(1)一个通过 SfM 派生的监督训练的图神经网络(GNN),用于预测全局一致的边缘可靠性;(2)基于多最小生成树的姿态图构建,受这些排名的指导;(3)连接性感知的评分调制,强化弱区域并减少图的直径。这种全球信息驱动的初始化产生了更可靠和紧凑的姿态图,提高了稀疏和高速环境下的重建精度,并在模糊场景中超越了现有的检索方法。代码和训练模型可在 https://github.com/weitong8591/global_edge_prior 获取。
cs.CV / 81 / 2602.21977
When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
当 LoRA 背叛时:通过伪装为良性适配器对文本到图像模型进行后门攻击
Abstract
Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack surface. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first systematic attack framework that leverages an independent LoRA module as the attack vehicle to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of "trigger word-target image" pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; otherwise, it behaves indistinguishably from the benign model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.
Chinese Translation
低秩适配(LoRA)已成为高效微调文本到图像扩散模型的领先技术,其在开源平台上的广泛应用促进了模型共享和定制的活跃文化。然而,正是这种模块化和即插即用的灵活性使得 LoRA 更具吸引力,同时也引入了更广泛的攻击面。为了突出这一风险,我们提出了 Masquerade-LoRA(MasqLoRA),这是第一个系统性攻击框架,利用独立的 LoRA 模块作为攻击载体,悄然将恶意行为注入文本到图像的扩散模型中。MasqLoRA 通过冻结基础模型参数,仅使用少量的“触发词-目标图像”对来更新低秩适配器权重。这使得攻击者能够训练一个独立的后门 LoRA 模块,该模块嵌入了一个隐藏的跨模态映射:当加载该模块并提供特定的文本触发时,模型会生成预定义的视觉输出;否则,它的行为与良性模型无异,从而确保攻击的隐蔽性。实验结果表明,MasqLoRA 可以以最小的资源开销进行训练,并且达到了 99.8% 的高攻击成功率。MasqLoRA 揭示了 AI 供应链中的一种严重且独特的威胁,强调了对 LoRA 共享生态系统专门防御机制的迫切需求。
cs.CV / 82 / 2602.21987
PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for medical images
PatchDenoiser:一种高效的多尺度补丁学习与融合去噪器用于医学图像
Abstract
Medical images are essential for diagnosis, treatment planning, and research, but their quality is often degraded by noise from low-dose acquisition, patient motion, or scanner limitations, affecting both clinical interpretation and downstream analysis. Traditional filtering approaches often over-smooth and lose fine anatomical details, while deep learning methods, including CNNs, GANs, and transformers, may struggle to preserve such details or require large, computationally expensive models, limiting clinical practicality. We propose PatchDenoiser, a lightweight, energy-efficient multi-scale patch-based denoising framework. It decomposes denoising into local texture extraction and global context aggregation, fused via a spatially aware patch fusion strategy. This design enables effective noise suppression while preserving fine structural and anatomical details. PatchDenoiser is ultra-lightweight, with far fewer parameters and lower computational complexity than CNN-, GAN-, and transformer-based denoisers. On the 2016 Mayo Low-Dose CT dataset, PatchDenoiser consistently outperforms state-of-the-art CNN- and GAN-based methods in PSNR and SSIM. It is robust to variations in slice thickness, reconstruction kernels, and HU windows, generalizes across scanners without fine-tuning, and reduces parameters by ~9x and energy consumption per inference by ~27x compared with conventional CNN denoisers. PatchDenoiser thus provides a practical, scalable, and computationally efficient solution for medical image denoising, balancing performance, robustness, and clinical deployability.
Chinese Translation
医学图像对于诊断、治疗规划和研究至关重要,但其质量常常受到低剂量采集、患者运动或扫描仪限制等噪声的影响,从而影响临床解读和后续分析。传统的过滤方法往往过度平滑,导致细微解剖细节的丢失,而深度学习方法,包括卷积神经网络(CNN)、生成对抗网络(GAN)和变换器(transformers),可能难以保留这些细节,或需要大型、计算开销高的模型,限制了其临床实用性。我们提出了PatchDenoiser,一种轻量级、能效高的多尺度基于补丁的去噪框架。该框架将去噪过程分解为局部纹理提取和全局上下文聚合,通过空间感知的补丁融合策略进行融合。这一设计能够有效抑制噪声,同时保留细微的结构和解剖细节。PatchDenoiser超轻量级,其参数远少于基于CNN、GAN和变换器的去噪器,计算复杂度也更低。在2016年梅奥低剂量CT数据集上,PatchDenoiser在峰值信噪比(PSNR)和结构相似性指数(SSIM)方面始终优于最先进的基于CNN和GAN的方法。它对切片厚度、重建核和HU窗口的变化具有鲁棒性,能够在不同扫描仪间泛化而无需微调,并且与传统的CNN去噪器相比,参数减少约9倍,推理时的能耗减少约27倍。因此,PatchDenoiser为医学图像去噪提供了一种实用、可扩展且计算高效的解决方案,平衡了性能、鲁棒性和临床可部署性。
cs.CV / 83 / 2602.21992
PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
PanoEnv:通过强化学习探索全景环境中的三维空间智能
Abstract
360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
Chinese Translation
360度全景图像在虚拟现实、自动驾驶和机器人技术中越来越多地用于整体场景理解。然而,当前的视觉语言模型(VLMs)在处理等距投影(Equirectangular Projection, ERP)图像时,由于几何失真和有限的三维监督,难以进行三维空间推理。我们提出了PanoEnv,一个基于合成三维环境构建的大规模视觉问答(VQA)基准,包含14.8K个问题,涵盖五个类别(例如,相对位置、体积比较),并基于准确的三维注释(包括深度、分割和边界框)。对14个最先进的VLM进行基准测试显示其三维理解能力有限,整体准确率仅为49.34%,开放式(OE)问题的准确率为8.36%。为了增强三维推理能力,我们提出了一种基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习后训练框架,该框架采用基于真实值的奖励,结合了五种几何感知策略,如距离容忍和空间一致性。两阶段课程进一步减轻了灾难性遗忘:第一阶段在结构化任务(真/假和多项选择)上进行训练,第二阶段在混合开放式数据上进行微调,以提高泛化能力。我们的7B模型实现了新的最先进性能,将整体准确率提高到52.93%(+3.59%),开放式准确率提高到14.83%,同时保持结构化任务的表现。它还在语义评估中获得了最高分(Q-Score 6.24,P-Score 5.95),超越了32B模型。这些结果表明,PanoEnv-QA及我们的基于课程的强化学习框架有效地在VLM中灌输三维空间智能,以实现全方位感知。
cs.CV / 84 / 2602.22013
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
RobustVisRAG:在视觉退化下的因果意识视觉基础检索增强生成
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
Chinese Translation
基于视觉的检索增强生成(VisRAG)利用视觉-语言模型(VLMs)联合检索相关的视觉文档,并基于多模态证据生成有根据的答案。然而,现有的VisRAG模型在视觉输入遭受模糊、噪声、低光或阴影等失真时,性能会下降,其中语义和退化因素在预训练的视觉编码器中交织在一起,导致检索和生成阶段的错误。为了解决这一限制,我们提出了RobustVisRAG,一种因果引导的双路径框架,旨在提高VisRAG的鲁棒性,同时保持效率和零样本泛化能力。RobustVisRAG使用非因果路径通过单向注意力捕捉退化信号,并使用因果路径学习由这些信号引导的纯化语义。结合提出的非因果失真建模和因果语义对齐目标,该框架强制在语义和退化之间进行明确的分离,使得在具有挑战性的视觉条件下实现稳定的检索和生成。为了在现实条件下评估鲁棒性,我们引入了Distortion-VisRAG数据集,这是一个大规模基准,包含七个领域的合成和真实世界退化文档,具有12种合成和5种真实失真类型,全面反映实际视觉退化。实验结果表明,RobustVisRAG在真实世界退化下的检索、生成和端到端性能分别提高了7.35%、6.35%和12.40%,同时在干净输入上保持了可比的准确性。
cs.CV / 85 / 2602.22025
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
Olbedo:用于大规模户外环境的反照率和阴影航空数据集
Abstract
Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.
Chinese Translation
户外场景的内在图像分解(IID)对于重新照明、编辑和理解大规模环境至关重要,但由于缺乏具有可靠反照率和阴影监督的真实世界数据集,进展受到限制。我们介绍了Olbedo,这是一个用于户外反照率-阴影分解的大规模航空数据集。Olbedo包含5664张无人机拍摄的图像,覆盖四种地形类型、多个年份和多样的光照条件。每个视图都配有多视图一致的反照率和阴影图、度量深度、表面法线、太阳和天空阴影成分、相机姿态,以及对于最近的飞行,测量的HDR天空穹顶。这些注释是通过对多视图立体重建和校准天空光照进行逆渲染精细化处理而得出的,并附有每像素的置信度掩码。我们展示了Olbedo使得最先进的基于扩散的IID模型(最初在合成室内数据上训练)能够推广到真实的户外图像:在Olbedo上进行微调显著提高了在MatrixCity基准上的单视图户外反照率预测。我们进一步展示了使用Olbedo训练的模型在3D资产的多视图一致重新照明、材料编辑和城市数字双胞胎的场景变化分析中的应用。我们发布了该数据集、基线模型和评估协议,以支持未来在户外内在分解和光照感知航空视觉方面的研究。
cs.CV / 86 / 2602.22026
RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models
基于预训练基础模型的公里标识识别的RGB-事件超图提示
Abstract
Metro trains often operate in highly complex environments, characterized by illumination variations, high-speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low-light conditions, high-speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS-denied conditions. In this context, we propose a robust baseline method based on a pre-trained RGB OCR foundation model, enhanced through multi-modal adaptation. Furthermore, we construct the first large-scale RGB-Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB-Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on https://github.com/Event-AHU/EvMetro5K_benchmark
Chinese Translation
地铁列车通常在高度复杂的环境中运行,这些环境的特点是光照变化、高速运动和恶劣天气条件。这些因素对视觉感知系统提出了重大挑战,尤其是那些仅依赖传统RGB摄像头的系统。为了解决这些困难,我们探索将事件摄像头集成到感知系统中,利用其在低光照条件、高速场景和低功耗方面的优势。具体而言,我们关注公里标识识别(KMR),这是在GNSS失效条件下进行自主地铁定位的关键任务。在此背景下,我们提出了一种基于预训练RGB OCR基础模型的稳健基线方法,并通过多模态适应进行增强。此外,我们构建了第一个大规模RGB-事件数据集EvMetro5K,包含5,599对同步的RGB-事件样本,分为4,479个训练样本和1,120个测试样本。在EvMetro5K及其他广泛使用的基准上进行的广泛实验表明了我们方法在KMR任务中的有效性。数据集和源代码将发布在https://github.com/Event-AHU/EvMetro5K_benchmark
cs.CV / 87 / 2602.22033
RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking
RT-RMOT:一种用于RGB-热成像的参考多目标跟踪的数据集与框架
Abstract
Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.
Chinese Translation
参考多目标跟踪因其人性化的交互特性而受到越来越多的关注,但在低能见度条件下(如夜间、烟雾等挑战性场景)存在局限性。为了解决这一问题,我们提出了一项新的RGB-热成像RMOT任务,命名为RT-RMOT,旨在融合RGB外观特征与热成像模态的光照鲁棒性,以实现全天候的参考多目标跟踪。为了促进RT-RMOT的研究,我们构建了第一个RGB-热成像模态下的参考多目标跟踪数据集,命名为RefRT。该数据集包含388个语言描述、1,250个跟踪目标和166,147个语言-RGB-热成像(L-RGB-T)三元组。此外,我们提出了RTrack,一个基于多模态大型语言模型(MLLM)的框架,集成了RGB、热成像和文本特征。由于初始框架仍有改进空间,我们引入了一种群体序列策略优化(GSPO)策略,以进一步挖掘模型的潜力。为缓解强化学习微调过程中的训练不稳定性,我们引入了一种剪切优势缩放(CAS)策略,以抑制梯度爆炸。此外,我们设计了结构化输出奖励和综合检测奖励,以平衡探索与利用,从而提高目标感知的完整性和准确性。在RefRT数据集上进行的广泛实验验证了所提出的RTrack框架的有效性。
cs.CV / 88 / 2602.22049
SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation
SPGen:基于无监督领域适应的绘画随机注视路径生成
Abstract
Understanding human visual attention is key to preserving cultural heritage We introduce SPGen a novel deep learning model to predict scanpaths the sequence of eye movementswhen viewers observe paintings. Our architecture uses a Fully Convolutional Neural Network FCNN with differentiable fixation selection and learnable Gaussian priors to simulate natural viewing biases To address the domain gap between photographs and artworks we employ unsupervised domain adaptation via a gradient reversal layer allowing the model to transfer knowledge from natural scenes to paintings Furthermore a random noise sampler models the inherent stochasticity of eyetracking data. Extensive testing shows SPGen outperforms existing methods offering a powerful tool to analyze gaze behavior and advance the preservation and appreciation of artistic treasures.
Chinese Translation
理解人类视觉注意力对于保护文化遗产至关重要。我们介绍了SPGen,一个新颖的深度学习模型,用于预测注视路径,即观众观察绘画时眼动的顺序。我们的架构使用全卷积神经网络(Fully Convolutional Neural Network, FCNN),结合可微分的注视选择和可学习的高斯先验,以模拟自然观看偏好。为了解决照片与艺术作品之间的领域差距,我们通过梯度反转层采用无监督领域适应,使模型能够将自然场景的知识转移到绘画上。此外,随机噪声采样器模拟了眼动数据固有的随机性。广泛的测试表明,SPGen的表现优于现有方法,为分析注视行为提供了强大的工具,并推动了艺术珍品的保护与欣赏。
cs.CV / 89 / 2602.22052
AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks
AutoSew:一种基于几何的图神经网络缝合预测方法
Abstract
Automating garment assembly from sewing patterns remains a significant challenge due to the lack of standardized annotation protocols and the frequent absence of semantic cues. Existing methods often rely on panel labels or handcrafted heuristics, which limit their applicability to real-world, non-conforming patterns. We present AutoSew, a fully automatic, geometry-based approach for predicting stitch correspondences directly from 2D pattern contours. AutoSew formulates the problem as a graph matching task, leveraging a Graph Neural Network to capture local and global geometric context, and employing a differentiable optimal transport solver to infer stitching relationships-including multi-edge connections. To support this task, we update the GarmentCodeData dataset modifying over 18k patterns with realistic multi-edge annotations, reflecting industrial assembly scenarios. AutoSew achieves 96% F1-score and successfully assembles 73.3% of test garments without error, outperforming existing methods while relying solely on geometric input. Our results demonstrate that geometry alone can robustly guide stitching prediction, enabling scalable garment assembly without manual input.
Chinese Translation
从缝纫图案自动化服装组装仍然是一个重大挑战,主要由于缺乏标准化的注释协议以及语义线索的频繁缺失。现有方法通常依赖于面板标签或手工设计的启发式规则,这限制了它们在现实世界中不符合标准的图案上的适用性。我们提出了AutoSew,这是一种完全自动化的、基于几何的方法,能够直接从二维图案轮廓预测缝合对应关系。AutoSew将该问题表述为图匹配任务,利用图神经网络捕捉局部和全局几何上下文,并采用可微分的最优传输求解器推断缝合关系,包括多边连接。为了支持这一任务,我们更新了GarmentCodeData数据集,修改了超过18,000个图案,添加了反映工业组装场景的真实多边注释。AutoSew实现了96%的F1分数,并成功无误地组装了73.3%的测试服装,超越了现有方法,同时仅依赖于几何输入。我们的结果表明,仅凭几何信息就能有效指导缝合预测,从而实现可扩展的服装组装,无需人工输入。
cs.CV / 90 / 2602.22059
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
NESTOR:一种基于嵌套混合专家(MoE)的神经算子用于大规模偏微分方程(PDE)预训练
Abstract
Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.
Chinese Translation
神经算子已成为解决偏微分方程(PDE)的有效范式,克服了传统数值方法的局限性,并显著提高了计算效率。然而,由于PDE系统的多样性和复杂性,现有的神经算子通常依赖于单一的网络架构,这限制了它们充分捕捉异构特征和复杂系统依赖关系的能力。这一限制对基于神经算子的PDE大规模预训练构成了瓶颈。为了解决这些挑战,我们提出了一种基于嵌套混合专家(MoE)框架的大规模PDE预训练神经算子。具体而言,图像级的MoE旨在捕捉全局依赖关系,而令牌级的子MoE则专注于局部依赖关系。我们的模型可以根据给定输入选择性地激活最合适的专家网络,从而增强泛化能力和迁移能力。我们在来自不同来源的十二个PDE数据集上进行了大规模预训练,并成功将模型迁移到下游任务。大量实验表明了我们方法的有效性。
cs.CV / 91 / 2602.22073
AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
AdaSpot:在关键位置进行支出解析以实现精确事件检测
Abstract
Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.
Chinese Translation
精确事件检测旨在以高时间精度定位视频中的快速动作或事件,这是体育分析、机器人技术和自主系统等应用中的关键任务。现有方法通常对所有帧进行统一处理,忽视了视频数据中固有的时空冗余。这导致在非信息区域上进行冗余计算,同时限制了整体效率。为了保持可处理性,它们通常会对输入进行空间下采样,从而丢失对精确定位至关重要的细粒度细节。为了解决这些局限性,我们提出了 extbf{AdaSpot},一个简单而有效的框架,处理低分辨率视频以提取全局任务相关特征,同时自适应地选择每帧中最具信息性的感兴趣区域进行高分辨率处理。选择过程通过一种无监督的、任务感知的策略进行,该策略保持帧间的时空一致性,并避免可学习替代方案的训练不稳定性。与仅使用低分辨率基线相比,这种设计在计算开销上保持了必要的细粒度视觉线索,同时在效率上远超统一高分辨率处理。标准PES基准测试的实验表明, extbf{AdaSpot}在严格评估指标下实现了最先进的性能(例如,网球和精细潜水在mAP$@0$帧上分别提高了$+3.96$和$+2.26$),同时在较宽松的指标下也保持了强劲的结果。代码可在以下链接获取: exttt{https://github.com/arturxe2/AdaSpot}。
cs.CV / 92 / 2602.22091
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
学习驾驶是一份免费的礼物:来自未摆姿势的野外视频的大规模无标签自主预训练
Abstract
Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.
Chinese Translation
在线可获得的自我中心驾驶视频为自主驾驶提供了丰富的视觉数据来源,但由于缺乏注释,难以学习同时捕捉语义结构和三维几何的表征。最近在大型前馈空间模型方面的进展表明,点图和自我运动可以在单次前向传播中推断出来,这为可扩展的驾驶感知提供了一个有前景的方向。因此,我们提出了一种无标签的、教师引导的框架,直接从未摆姿势的视频中学习自主驾驶表征。与以往主要关注帧间一致性的自监督方法不同,我们认为安全和反应灵敏的驾驶在很大程度上依赖于时间上下文。为此,我们利用一种配备轻量级自回归模块的前馈架构,使用多模态监督信号进行训练,指导模型共同预测当前和未来的点图、相机姿态、语义分割和运动掩码。多模态教师提供序列级伪监督,使得LFG能够从原始YouTube视频中学习统一的伪4D表征,而无需姿态、标签或激光雷达。所得到的编码器不仅在NAVSIM基准上有效转移到下游自主驾驶规划,超越了仅使用单个单目相机的多摄像头和激光雷达基线,而且在一系列语义、几何和定性运动预测任务中也表现出色。这些几何和运动感知特征使LFG成为自主驾驶的一个引人注目的视频中心基础模型。
cs.CV / 93 / 2602.22092
Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification
CXR-LT 2026 挑战概述:多中心长尾与零样本胸部 X 光分类
Abstract
Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.
Chinese Translation
胸部 X 光(CXR)解读受到病理分布长尾特性和临床环境开放世界性质的影响。现有基准通常依赖于单一机构的封闭集类别,未能捕捉到罕见疾病的流行程度或新发现的出现。为了解决这一问题,我们提出了 CXR-LT 2026 挑战。该基准的第三次迭代引入了一个多中心数据集,包含来自 PadChest 和 NIH 胸部 X 光数据集的超过 145,000 张图像。挑战定义了两个核心任务:(1)在 30 个已知类别上进行稳健的多标签分类,以及(2)对 6 个未见(分布外)罕见疾病类别进行开放世界泛化。我们报告了表现最佳团队的结果,通过平均精度均值(mAP)、受试者工作特征曲线下面积(AUROC)和 F1 分数进行评估。获胜解决方案在任务 1 上实现了 0.5854 的 mAP,在任务 2 上实现了 0.4315,表明大规模视觉-语言预训练显著减轻了与零样本诊断通常相关的性能下降。
cs.CV / 94 / 2602.22096
WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
WeatherCity:可控多天气转换的城市场景重建
Abstract
Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.
Chinese Translation
可编辑的高保真4D场景对于自动驾驶至关重要,因为它们可以应用于端到端训练和闭环仿真。然而,现有的重建方法主要限于复制观察到的场景,缺乏多样化天气模拟的能力。同时,图像级天气编辑方法往往会引入场景伪影,并对天气效果的可控性较差。为了解决这些局限性,我们提出了WeatherCity,一个用于4D城市场景重建和天气编辑的新框架。具体而言,我们利用文本引导的图像编辑模型,实现图像天气背景的灵活编辑。为了解决多天气建模的挑战,我们引入了一种基于共享场景特征和专用天气解码器的新型天气高斯表示。这种表示通过内容一致性优化进一步增强,确保在不同天气条件下的建模一致性。此外,我们设计了一个基于物理驱动的模型,通过粒子和运动模式模拟动态天气效果。在多个数据集和各种场景上的大量实验表明,WeatherCity在4D重建和天气编辑中实现了灵活的可控性、高保真度和时间一致性。我们的框架不仅能够对天气条件(例如小雨和大雪)进行精细控制,还支持场景内的对象级操作。
cs.CV / 95 / 2602.22098
Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D
Brain3D:基于膨胀视觉变换器的三维脑部报告自动化
Abstract
Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility
Chinese Translation
当前的医学视觉语言模型(VLMs)使用基于二维切片的近似方法处理体积脑部MRI,导致空间上下文的碎片化,这对于准确的神经放射学解读至关重要。我们开发了 extbf{Brain3D},这是一个分阶段的视觉语言框架,用于从三维脑肿瘤MRI自动生成放射学报告。我们的方法将预训练的二维医学编码器膨胀为原生三维架构,并通过三个阶段逐步与因果语言模型对齐:对比基础、监督投影器预热和基于LoRA的语言专业化。与通用的三维医学VLMs不同, extbf{Brain3D}专门针对神经放射学,其中半球侧重性、肿瘤浸润模式和解剖定位至关重要。在468个受试者(BraTS病理案例加健康对照)上进行评估,我们的模型实现了0.951的临床病理F1分数,而强大的二维基线为0.413,同时在健康扫描中保持了完美的特异性。分阶段的对齐被证明是至关重要的:对比基础建立了视觉-文本对应关系,投影器预热稳定了条件,而LoRA适应则将输出从冗长的标题转变为结构化的临床报告。
cs.CV / 96 / 2602.22120
GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models
GeoDiv:测量文本到图像模型地理多样性的框架
Abstract
Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: https://abhipsabasu.github.io/geodiv
Chinese Translation
文本到图像(T2I)模型正在迅速普及,但其输出往往缺乏地理多样性,强化刻板印象,并且对地区的表现存在误导。鉴于其广泛的影响力,严谨评估这些模型如何描绘世界至关重要。现有的多样性指标要么依赖于经过筛选的数据集,要么关注表面视觉相似性,限制了可解释性。我们提出了GeoDiv,一个利用大型语言模型和视觉-语言模型来评估地理多样性的框架,沿着两个互补的轴线进行评估:社会经济视觉指数(Socio-Economic Visual Index, SEVI),捕捉经济和条件相关的线索,以及视觉多样性指数(Visual Diversity Index, VDI),测量主要实体和背景的变化。GeoDiv应用于由如Stable Diffusion和FLUX.1-dev等模型生成的图像,涵盖$10$个实体和$16$个国家,揭示了一种持续的多样性缺失,并识别出模型在偏见表现上默认的细微属性。值得注意的是,印度、尼日利亚和哥伦比亚等国的描绘往往显得贫困和破旧,反映了潜在的社会经济偏见。这些结果突显了生成模型中对地理细微差别的更大需求。GeoDiv提供了第一个系统的、可解释的框架来测量此类偏见,标志着朝着更公平和更具包容性的生成系统迈出了一步。项目页面:https://abhipsabasu.github.io/geodiv
cs.CV / 97 / 2602.22142
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
WeaveTime:在视频大语言模型中将早期帧流入新兴记忆
Abstract
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
Chinese Translation
最近在多模态大型语言模型方面的进展大大提高了视觉理解和推理能力,但其二次注意力机制和离线训练协议使其不适合于帧按顺序到达且未来观察不可获取的流式环境。我们诊断了当前视频大语言模型的一个核心限制,即时间无关性(Time-Agnosticism),在这种情况下,视频被视为无序的证据集合而不是因果有序的序列,这导致了流式处理中的两个失败:时间顺序模糊性,即模型无法遵循或推理正确的时间顺序,以及过去-当前焦点盲点,即模型无法区分当前观察与累积历史。我们提出了WeaveTime,这是一个简单、高效且与模型无关的框架,首先教授顺序,然后利用顺序。我们引入了一种轻量级的时间重建目标——我们的流式顺序感知增强(Streaming Order Perception enhancement),它在最小微调和无需专门流式数据的情况下,赋予了顺序感知的表示。在推理时,一个过去-当前动态焦点缓存(Past-Current Dynamic Focus Cache)执行不确定性触发的粗到细检索,仅在需要时扩展历史。WeaveTime可以无架构更改地集成到现有的视频大语言模型中,在代表性的流式基准测试中提供了一致的提升,提高了准确性,同时减少了延迟。这些结果确立了WeaveTime作为在严格在线、时间因果约束下实现时间感知流式视频大语言模型的实用路径。代码和权重将公开提供。项目页面:https://zhangyl4.github.io/publications/weavetime/
cs.CV / 98 / 2602.22143
MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining
MedTri:一个用于结构化医学报告标准化的平台,以增强视觉-语言预训练
Abstract
Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at https://github.com/Arturia-Pendragon-Iris/MedTri.
Chinese Translation
医学视觉-语言预训练日益依赖医学报告作为大规模监督信号;然而,原始报告往往表现出显著的风格异质性、可变长度以及大量与图像无关的内容。尽管文本标准化在以往的研究中常被作为预处理步骤,但其设计原则及对视觉-语言预训练的实证影响仍然缺乏系统的研究。在本研究中,我们提出了MedTri,一个可部署的医学视觉-语言预训练标准化框架,旨在将自由文本报告转换为统一的[解剖实体:放射学描述 + 诊断类别]三元组。这种结构化、基于解剖学的标准化保留了重要的形态学和空间信息,同时去除了风格噪声和与图像无关的内容,从而在大规模上提供一致且基于图像的文本监督。在涵盖X光和计算机断层扫描(CT)两种模式的多个数据集上,我们证明了结构化、基于解剖学的文本标准化是医学视觉-语言预训练质量的重要因素,相较于原始报告和现有标准化基线,取得了一致的改善。此外,我们展示了这种标准化如何轻松支持模块化的文本级增强策略,包括知识丰富化和基于解剖学的反事实监督,这些策略在不改变核心标准化过程的情况下,提供了在稳健性和泛化能力上的互补提升。综合来看,我们的结果将结构化文本标准化定位为医学视觉-语言学习中的一个关键且可推广的预处理组件,而MedTri则提供了这一标准化平台。代码和数据将发布在 https://github.com/Arturia-Pendragon-Iris/MedTri。
cs.CV / 99 / 2602.22144
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
NoLan:通过动态抑制语言先验来减轻大型视觉-语言模型中的对象幻觉
Abstract
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
Chinese Translation
对象幻觉是大型视觉-语言模型(LVLMs)中的一个关键问题,其输出包括输入图像中不存在的对象。由此现象引发了一个自然的问题:LVLM管道的哪个组件主要导致了对象幻觉?是用于感知视觉信息的视觉编码器,还是用于生成文本响应的语言解码器?在本研究中,我们努力通过设计一个系统实验来分析视觉编码器和语言解码器在幻觉生成中的作用,从而回答这个问题。我们的观察表明,对象幻觉主要与语言解码器的强先验相关。基于这一发现,我们提出了一个简单且无需训练的框架——无语言幻觉解码(No-Language-Hallucination Decoding),简称NoLan,该框架通过动态抑制语言先验来优化输出分布,抑制程度根据多模态输入与仅文本输入之间的输出分布差异进行调节。实验结果表明,NoLan在不同任务的多种LVLM上有效减少了对象幻觉。例如,NoLan在POPE上取得了显著的改进,分别将LLaVA-1.5 7B和Qwen-VL 7B的准确率提升了高达6.45和7.21。代码已公开发布在:https://github.com/lingfengren/NoLan。
cs.CV / 100 / 2602.22150
CoLoGen: Progressive Learning of Concept`-`Localization Duality for Unified Image Generation
CoLoGen:概念-定位二元性渐进学习的统一图像生成
Abstract
Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept`-`localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept`-`localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction`-`driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
Chinese Translation
统一的条件图像生成仍然面临困难,因为不同任务依赖于根本不同的内部表示。有些任务需要概念理解以进行语义合成,而另一些则依赖于定位线索以实现空间精确性。强迫这些异构任务共享单一表示会导致概念-定位表示冲突。为了解决这个问题,我们提出了CoLoGen,一个统一的扩散框架,它渐进地学习和调和这种概念-定位二元性。CoLoGen采用分阶段的课程,首先建立核心的概念和定位能力,然后将其适应于多样的视觉条件,最后为复杂的指令驱动任务精炼它们的协同作用。这个过程的核心是渐进表示编织(Progressive Representation Weaving, PRW)模块,它动态地将特征路由到专业专家,并在各个阶段稳定地整合它们的输出。在编辑、可控生成和定制生成的实验中,CoLoGen表现出竞争力或优越的性能,为统一图像生成提供了一个原则性的表示视角。
cs.CV / 101 / 2602.22159
CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
CASR:一种针对任意大规模超分辨率的稳健循环框架,具有分布对齐和自相似性意识
Abstract
Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.
Chinese Translation
任意尺度超分辨率(ASISR)在根本上受到跨尺度分布偏移的限制:一旦推断尺度超出训练范围,噪声、模糊和伪影会急剧增加。我们从跨尺度分布转变的角度重新审视这一挑战,并提出CASR,一个简单但高效的循环超分辨率框架,将超放大重新表述为一系列在分布内的尺度转变。该设计确保在任意尺度下的稳定推断,同时仅需一个模型。CASR解决了两个主要瓶颈:迭代过程中的分布漂移和块级扩散不一致。所提出的SDAM模块通过超像素聚合对结构分布进行对齐,防止误差累积,而SARM模块通过强制自相关和嵌入低分辨率自相似先验来恢复高频纹理。尽管仅使用一个模型,我们的方法显著减少了分布漂移,保持了长距离纹理一致性,并在极端放大情况下实现了优越的泛化能力。
cs.CV / 102 / 2602.22176
Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology
用于计算病理学中可泛化区域级表示的混合放大聚合
Abstract
In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$\times$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$\times$224 pixel crops at 20$\times$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.
Chinese Translation
近年来,标准的计算病理学工作流程逐渐形成,其中整个切片图像被裁剪为小块,这些小块使用基础模型进行处理,并基于生成的表示构建特定任务模型。已经提出了至少15种不同的基础模型,绝大多数模型仅使用20$ imes$放大率的小块进行训练。然而,众所周知,某些组织学特征只能在更大的上下文窗口中辨别,并且在分析整个切片图像时需要病理学家进行放大和缩小。此外,在20$ imes$放大率下创建224$ imes$224像素的裁剪会导致每个切片产生大量小块,可能达到千兆像素的大小。为了更准确地捕捉多分辨率特征并探讨减少每个切片表示数量的可能性,我们提出了一种区域级混合编码器。我们的方法通过一个掩码嵌入建模的预训练步骤,联合融合混合放大基础模型的小块图像表示。我们探索了预训练所提议的混合放大区域聚合器的设计空间,并在代表各种癌症类型的生物标志物预测任务上评估我们的模型。结果表明,预测性能在癌症依赖性上有所改善,突显了空间上下文和理解的重要性。
cs.CV / 103 / 2602.22197
Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes
现成的图像到图像模型足以击败图像保护方案
Abstract
Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models. Code is available in this repository: https://github.com/mlsecviswanath/img2imgdenoiser
Chinese Translation
生成性人工智能(GenAI)的进步促使了各种保护策略的发展,以防止图像的未经授权使用。这些方法依赖于向图像添加不可察觉的保护扰动,以阻止诸如风格模仿或深度伪造操作等滥用行为。尽管之前对这些保护措施的攻击需要专门的、定制的方法,但我们证明这已不再必要。我们展示了现成的图像到图像GenAI模型可以通过简单的文本提示重新用于通用的“去噪器”,有效去除各种保护扰动。在涵盖6种不同保护方案的8个案例研究中,我们的通用攻击不仅绕过了这些防御措施,还在保持图像对对手的效用的同时超越了现有的专门攻击。我们的发现揭示了当前图像保护领域中的一个关键和普遍的脆弱性,表明许多方案提供了一种虚假的安全感。我们强调迫切需要开发强健的防御措施,并确立任何未来的保护机制必须针对现成的GenAI模型的攻击进行基准测试。代码可在此存储库中获取:https://github.com/mlsecviswanath/img2imgdenoiser
cs.CV / 104 / 2602.22208
Solaris: Building a Multiplayer Video World Model in Minecraft
Solaris:在Minecraft中构建多人视频世界模型
Abstract
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Chinese Translation
现有的基于动作的视频生成模型(视频世界模型)仅限于单一代理的视角,无法捕捉真实环境中多代理的交互。我们提出了Solaris,一个模拟一致多视角观察的多人视频世界模型。为此,我们开发了一种多人数据系统,旨在对Minecraft等视频游戏进行稳健、连续和自动化的数据收集。与之前为单人设置构建的平台不同,我们的系统支持协调的多代理交互和同步的视频+动作捕捉。通过该系统,我们收集了1264万帧多人视频,并提出了一个用于评估多人移动、记忆、基础、视角一致性的框架。我们使用一个分阶段的管道训练Solaris,该管道逐步从单人游戏过渡到多人建模,结合了双向、因果和自我强制训练。在最后阶段,我们引入了Checkpointed Self Forcing,这是一种内存高效的自我强制变体,能够支持更长时间的教师。结果表明,我们的架构和训练设计优于现有基准。通过开源我们的系统和模型,我们希望为新一代多代理世界模型奠定基础。
cs.CV / 105 / 2602.22209
WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
WHOLE:从自我中心视频中提取的世界基础手-物体运动
Abstract
Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www
Chinese Translation
自我中心的操作视频由于交互过程中的严重遮挡以及随着人物移动而频繁出现和消失的物体,具有很高的挑战性。目前的方法通常专注于单独恢复手或物体的姿态,但在交互过程中都面临困难,并且无法处理视野外的情况。此外,它们的独立预测往往导致手-物体关系不一致。我们提出了WHOLE,一种从自我中心视频中根据物体模板整体重建手和物体运动的方法。我们的关键见解是学习手-物体运动的生成先验,以共同推理它们的交互。在测试时,预训练的先验被引导生成符合视频观察的轨迹。这种联合生成重建显著优于分别处理手和物体后再进行后处理的方法。WHOLE在手运动估计、6D物体姿态估计及其相对交互重建方面达到了最先进的性能。项目网站:https://judyye.github.io/whole-www
cs.CV / 106 / 2602.22212
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
Neu-PiG:用于长序列快速动态表面重建的神经预条件网格
Abstract
Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.
Chinese Translation
从非结构化点云数据中对动态三维物体进行时间一致的表面重建仍然具有挑战性,尤其是在非常长的序列中。现有的方法要么逐步优化变形,面临漂移风险并需要较长的运行时间,要么依赖于复杂的学习模型,这些模型需要特定类别的训练。我们提出了Neu-PiG,这是一种基于新颖的预条件潜在网格编码的快速变形优化方法,该方法在关键帧表面的位置信息和法向方向上分布空间特征。我们的方法将各个时间步的整体变形编码到一个多分辨率的潜在网格中,该网格由单个关键帧的参考表面的位置信息和法向方向参数化。然后,这种潜在表示被增强以进行时间调制,并通过轻量级多层感知器(MLP)解码为每帧的6自由度变形。为了在几秒内实现高保真、无漂移的表面重建,我们在潜在空间的基于梯度的训练过程中采用了Sobolev预条件,从而完全避免了任何显式对应关系或进一步的先验条件。针对多样的人类和动物数据集的实验表明,Neu-PiG在准确性和可扩展性方面均优于最先进的方法,运行速度至少比现有的无训练方法快60倍,并且在推理速度上与重型预训练模型处于同一数量级。
cs.CL / 1 / 2602.21212
Disaster Question Answering with LoRA Efficiency and Accurate End Position
基于LoRA效率和准确终点的灾难问答系统
Abstract
Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge and experience necessary to determine appropriate responses and actions. While disaster information is continuously updated, even when utilizing RAG search and large language models for inquiries, obtaining relevant domain knowledge about natural disasters and experiences similar to one's specific situation is not guaranteed. When hallucinations are included in disaster question answering, artificial misinformation may spread and exacerbate confusion. This work introduces a disaster-focused question answering system based on Japanese disaster situations and response experiences. Utilizing the cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads architecture with LoRA efficiency optimization, we achieved 70.4\% End Position accuracy with only 5.7\% of the total parameters (6.7M/117M). Experimental results demonstrate that the combination of Japanese BERT-base optimization and Bi-LSTM contextual understanding achieves accuracy levels suitable for real disaster response scenarios, attaining a 0.885 Span F1 score. Future challenges include: establishing natural disaster Q\&A benchmark datasets, fine-tuning foundation models with disaster knowledge, developing lightweight and power-efficient edge AI Disaster Q\&A applications for situations with insufficient power and communication during disasters, and addressing disaster knowledge base updates and continual learning capabilities.
Chinese Translation
自然灾害如地震、暴雨、洪水和火山喷发发生频率极低,并且影响的地理区域有限。当人们面临灾难情况时,常常会感到困惑,缺乏必要的领域特定知识和经验来确定适当的应对措施和行动。尽管灾难信息不断更新,即使在利用RAG搜索和大型语言模型进行查询时,获得与自然灾害相关的领域知识和与特定情况相似的经验也并不保证。当灾难问答中包含幻觉时,人工错误信息可能会传播并加剧混乱。本文介绍了一种基于日本灾难情况和应对经验的灾难专注问答系统。通过利用cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + 增强位置头架构,并进行LoRA效率优化,我们在仅使用总参数的5.7%(6.7M/117M)的情况下,实现了70.4%的终点准确率。实验结果表明,结合日本BERT-base优化和Bi-LSTM上下文理解,达到了适用于真实灾难响应场景的准确水平,获得了0.885的Span F1分数。未来的挑战包括:建立自然灾害问答基准数据集,微调具有灾难知识的基础模型,开发适用于在灾难中电力和通信不足情况下的轻量化和节能边缘AI灾难问答应用,以及解决灾难知识库更新和持续学习能力的问题。
cs.CL / 2 / 2602.21215
Inference-time Alignment via Sparse Junction Steering
稀疏交汇引导下的推理时对齐
Abstract
Token-level steering has emerged as a pivotal approach for inference-time alignment, enabling fine grained control over large language models by modulating their output distributions without parameter updates. While effective, existing methods rely on dense intervention at every decoding step. This persistent manipulation not only incurs substantial computational overhead but also risks compromising generation quality by excessively drifting from the model's intrinsic distribution. In this work, we show that dense intervention is unnecessary and propose Sparse Inference time Alignment (SIA), which performs sparse junction steering by intervening only at critical decision points along the generation trajectory. Our key insight is that high entropy junctions mark pivotal decision points in the generation trajectory and are particularly susceptible to misalignment, indicating the need to introduce alignment related reward signals at these points. Extensive experiments across different model families and alignment objectives show that steering only 20% to 80% of tokens achieves superior alignment-efficiency trade offs. For strong base models such as Qwen3, intervening on as few as 20% of tokens matches or even surpasses heavily post-trained instruct models. This sparsity enables stronger guidance while better preserving the model's native distribution, integrates seamlessly with search based methods such as Best-of-N, and reduces computational cost by up to 6x.
Chinese Translation
令牌级引导已成为推理时对齐的关键方法,通过调节大型语言模型的输出分布而无需更新参数,从而实现对其的细粒度控制。尽管有效,现有方法在每个解码步骤都依赖于密集干预。这种持续的操控不仅带来了可观的计算开销,还可能因过度偏离模型的内在分布而影响生成质量。在本研究中,我们展示了密集干预并非必要,并提出了稀疏推理时对齐(Sparse Inference time Alignment, SIA),该方法仅在生成轨迹中的关键决策点进行干预,从而实现稀疏交汇引导。我们的关键见解是,高熵交汇点标志着生成轨迹中的关键决策点,并且特别容易出现不对齐,表明在这些点引入与对齐相关的奖励信号的必要性。针对不同模型系列和对齐目标的广泛实验表明,仅对20%到80%的令牌进行引导即可实现更优的对齐效率权衡。对于强大的基础模型,如Qwen3,仅对20%的令牌进行干预就能匹配甚至超越经过大量后训练的指令模型。这种稀疏性在更好地保持模型的本地分布的同时,提供了更强的引导,与基于搜索的方法(如Best-of-N)无缝集成,并将计算成本降低至6倍。
cs.CL / 3 / 2602.21216
EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning
基于生物医学实体增强的预训练语言模型和多实例学习的EQ-5D分类
Abstract
The EQ-5D (EuroQol 5-Dimensions) is a standardized instrument for the evaluation of health-related quality of life. In health economics, systematic literature reviews (SLRs) depend on the correct identification of publications that use the EQ-5D, but manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent. In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models (PLMs), enriched with biomedical entity information extracted through scispaCy models for each statement, to improve EQ-5D detection from abstracts. We conduct nine experimental setups, including combining three scispaCy models with three PLMs, and evaluate their performance at both the sentence and study levels. Furthermore, we explore a Multiple Instance Learning (MIL) approach with attention pooling to aggregate sentence-level information into study-level predictions, where each abstract is represented as a bag of enriched sentences (by scispaCy). The findings indicate consistent improvements in F1-scores (reaching 0.82) and nearly perfect recall at the study-level, significantly exceeding classical bag-of-words baselines and recently reported PLM baselines. These results show that entity enrichment significantly improves domain adaptation and model generalization, enabling more accurate automated screening in systematic reviews.
Chinese Translation
EQ-5D(欧洲质量生活五维量表)是一种用于评估健康相关生活质量的标准化工具。在健康经济学中,系统文献综述(SLRs)依赖于正确识别使用EQ-5D的出版物,但手动筛选大量科学文献既耗时又容易出错且不一致。在本研究中,我们探讨了对通用(BERT)和特定领域(SciBERT, BioBERT)预训练语言模型(PLMs)的微调,这些模型通过scispaCy模型提取的生物医学实体信息来增强每个陈述,以提高从摘要中检测EQ-5D的能力。我们进行了九个实验设置,包括将三个scispaCy模型与三个PLMs结合,并在句子和研究水平上评估它们的性能。此外,我们探索了一种带有注意力池化的多实例学习(MIL)方法,将句子级信息聚合为研究级预测,其中每个摘要被表示为一个由scispaCy增强的句子包。研究结果表明,F1分数(达到0.82)在研究级别上有一致的提升,召回率几乎完美,显著超过经典的词袋基线和最近报告的PLM基线。这些结果表明,实体增强显著改善了领域适应性和模型泛化能力,使系统评价中的自动筛选更加准确。
cs.CL / 4 / 2602.21217
Applied Sociolinguistic AI for Community Development (ASA-CD): A New Scientific Paradigm for Linguistically-Grounded Social Intervention
社区发展应用社会语言学人工智能(ASA-CD):一种基于语言的社会干预的新科学范式
Abstract
This paper establishes Applied Sociolinguistic AI for Community Development (ASA-CD) as a novel scientific paradigm for addressing community challenges through linguistically grounded, AI-enabled intervention. ASA-CD introduces three key contributions: (1) linguistic biomarkers as computational indicators of discursive fragmentation; (2) development-aligned natural language processing (NLP), an AI optimisation paradigm prioritising collective outcomes; and (3) a standardised five-phase protocol for discursive intervention. A proof-of-concept study, incorporating real-world and synthetic corpora, demonstrates systematic associations between exclusionary language and negative sentiment and simulates intervention-based improvements. ASA-CD provides a unified methodological, ethical and empirical framework for scalable, value-aligned AI in the service of community empowerment.
Chinese Translation
本文确立了社区发展应用社会语言学人工智能(ASA-CD)作为一种新颖的科学范式,通过基于语言的、人工智能驱动的干预来应对社区挑战。ASA-CD 提出了三个关键贡献:(1)作为话语碎片化计算指标的语言生物标志;(2)与发展相一致的自然语言处理(NLP),一种优先考虑集体成果的人工智能优化范式;(3)用于话语干预的标准化五阶段协议。一项概念验证研究结合了真实和合成语料,展示了排斥性语言与负面情绪之间的系统关联,并模拟了基于干预的改善。ASA-CD 提供了一个统一的方法论、伦理和实证框架,以便在社区赋权服务中实现可扩展的、与价值一致的人工智能。
cs.CL / 5 / 2602.21218
EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
EPSVec:通过数据集向量实现高效且私密的合成数据生成
Abstract
High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.
Chinese Translation
高质量数据对于现代机器学习至关重要,但许多有价值的语料库是敏感的,无法自由共享。合成数据为下游开发提供了一个实用的替代方案,而大型语言模型(LLMs)已成为生成合成数据的强大引擎。然而,现有的私密文本生成方法效率极低:它们数据密集、计算缓慢,并且通常需要大量的私有语料库或批量大小才能达到可用的质量。我们提出了EPSVec,这是一种差分隐私的轻量级替代方案,通过*数据集向量*引导LLM生成——这些向量是在激活空间中的方向,捕捉私有数据与公共先验之间的分布差距。EPSVec仅提取和清理一次引导向量,然后执行标准解码。这将隐私预算与生成解耦,使得可以在没有额外隐私成本的情况下生成任意数量的合成样本,并在低数据环境下仍能保持强大的保真度。此外,我们通过利用预训练(基础)模型和引入固定样本提示来增强我们的方法,以提升生成的多样性和保真度。我们的实验表明,EPSVec在分布对齐和下游效用方面优于现有基线,特别是在低数据环境中,同时显著降低了计算开销。
cs.CL / 6 / 2602.21219
Reasoning-Based Personalized Generation for Users with Sparse Data
基于推理的稀疏数据用户个性化生成
Abstract
Large Language Model (LLM) personalization holds great promise for tailoring responses by leveraging personal context and history. However, real-world users usually possess sparse interaction histories with limited personal context, such as cold-start users in social platforms and newly registered customers in online E-commerce platforms, compromising the LLM-based personalized generation. To address this challenge, we introduce GraSPer (Graph-based Sparse Personalized Reasoning), a novel framework for enhancing personalized text generation under sparse context. GraSPer first augments user context by predicting items that the user would likely interact with in the future. With reasoning alignment, it then generates texts for these interactions to enrich the augmented context. In the end, it generates personalized outputs conditioned on both the real and synthetic histories, ensuring alignment with user style and preferences. Extensive experiments on three benchmark personalized generation datasets show that GraSPer achieves significant performance gain, substantially improving personalization in sparse user context settings.
Chinese Translation
大型语言模型(LLM)个性化在利用个人背景和历史定制响应方面具有巨大潜力。然而,现实世界中的用户通常拥有稀疏的交互历史和有限的个人背景,例如社交平台中的冷启动用户和在线电子商务平台中新注册的客户,这影响了基于LLM的个性化生成。为了解决这一挑战,我们提出了GraSPer(基于图的稀疏个性化推理),这是一个在稀疏背景下增强个性化文本生成的新框架。GraSPer首先通过预测用户未来可能交互的项目来增强用户背景。然后,通过推理对齐,它为这些交互生成文本,以丰富增强后的背景。最后,它基于真实和合成历史生成个性化输出,确保与用户风格和偏好的对齐。在三个基准个性化生成数据集上的广泛实验表明,GraSPer实现了显著的性能提升,显著改善了稀疏用户背景下的个性化效果。
cs.CL / 7 / 2602.21220
Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation
面向人工智能代理的场论记忆:用于上下文保留的连续动态
Abstract
We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi-agent scenarios. We evaluate the system on two established long-context benchmarks: LoCoMo (ACL 2024) with 300-turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi-session reasoning over 500+ turns. On LongMemEval, the field-theoretic approach achieves significant improvements: +116% F1 on multi-session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi-agent experiments show near-perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs-fieldmem.
Chinese Translation
我们提出了一种面向人工智能代理的记忆系统,该系统将存储的信息视为由偏微分方程支配的连续场,而非数据库中的离散条目。该方法借鉴了经典场论:记忆在语义空间中扩散,基于重要性热力学衰减,并在多代理场景中通过场耦合进行交互。我们在两个已建立的长上下文基准上评估该系统:LoCoMo(ACL 2024),包含35个会话的300轮对话,以及LongMemEval(ICLR 2025),测试500轮以上的多会话推理。在LongMemEval上,场论方法取得了显著的改进:在多会话推理上F1提高116%(p<0.01,d=3.06),在时间推理上提高43.8%(p<0.001,d=9.21),在知识更新的检索召回率上提高27.8%(p<0.001,d=5.00)。多代理实验显示,通过场耦合实现了近乎完美的集体智能(>99.8%)。代码可在github.com/rotalabs/rotalabs-fieldmem获取。
cs.CL / 8 / 2602.21222
Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases
基于相似性检索的任务感知 LoRA 适配器组合
Abstract
Parameter efficient fine tuning methods like LoRA have enabled task specific adaptation of large language models, but efficiently composing multiple specialized adapters for unseen tasks remains challenging. We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases to enable zero-shot generalization across diverse NLP tasks. Our approach constructs a task-aware vector database by embedding training examples from 22 datasets spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis. At inference time, we retrieve the most similar training examples, compute task similarity distributions via nucleus sampling, and dynamically merge relevant LoRA adapters using retrieval weighted fusion strategies. We evaluated four merging methods Linear, Concatenation, TIES, and Magnitude Prune demonstrating that our dataset centric retrieval approach often matches or exceeds the performance of individually fine-tuned task-specific adapters. Notably, Linear merging achieves 70.95% on PIQA and 77.62% on RTE, substantially outperforming single-task baselines (46% and 52%, respectively). Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition. These results suggest that retrieval based dynamic merging offers a promising direction for scalable, parameter-efficient multitask learning without requiring full model retraining for each new task.
Chinese Translation
像 LoRA 这样的参数高效微调方法使得大型语言模型能够针对特定任务进行适应,但高效组合多个专门适配器以应对未见任务仍然具有挑战性。我们提出了一种新颖的动态 LoRA 适配器组合框架,该框架利用向量数据库中的相似性检索,实现了在多样化自然语言处理(NLP)任务中的零样本泛化。我们的方法通过嵌入来自 22 个数据集的训练示例,构建了一个任务感知的向量数据库,这些数据集涵盖了常识推理、问答、自然语言推理和情感分析。在推理时,我们检索最相似的训练示例,通过核采样计算任务相似性分布,并使用检索加权融合策略动态合并相关的 LoRA 适配器。我们评估了四种合并方法:线性(Linear)、连接(Concatenation)、TIES 和幅度修剪(Magnitude Prune),结果表明我们的数据集中心检索方法在性能上往往与单独微调的任务特定适配器相匹配或超越。值得注意的是,线性合并在 PIQA 上达到了 70.95%,在 RTE 上达到了 77.62%,显著优于单任务基线(分别为 46% 和 52%)。我们的框架无需额外的检索器训练,使用冻结的嵌入进行操作,并实现了高效、可解释的适配器组合。这些结果表明,基于检索的动态合并为可扩展、参数高效的多任务学习提供了一个有前景的方向,而无需为每个新任务进行完整的模型重训练。
cs.CL / 9 / 2602.21223
Measuring Pragmatic Influence in Large Language Model Instructions
测量大型语言模型指令中的语用影响
Abstract
It is not only what we ask large language models (LLMs) to do that matters, but also how we prompt. Phrases like "This is urgent" or "As your supervisor" can shift model behavior without altering task content. We study this effect as pragmatic framing, contextual cues that shape directive interpretation rather than task specification. While prior work exploits such cues for prompt optimization or probes them as security vulnerabilities, pragmatic framing itself has not been treated as a measurable property of instruction following. Measuring this influence systematically remains challenging, requiring controlled isolation of framing cues. We introduce a framework with three novel components: directive-framing decomposition separating framing context from task specification; a taxonomy organizing 400 instantiations of framing into 13 strategies across 4 mechanism clusters; and priority-based measurement that quantifies influence through observable shifts in directive prioritization. Across five LLMs of different families and sizes, influence mechanisms cause consistent and structured shifts in directive prioritization, moving models from baseline impartiality toward favoring the framed directive. This work establishes pragmatic framing as a measurable and predictable factor in instruction-following systems.
Chinese Translation
我们要求大型语言模型(LLMs)执行的任务不仅仅取决于我们所问的内容,还取决于我们的提示方式。诸如“这很紧急”或“作为你的主管”这样的短语可以在不改变任务内容的情况下改变模型的行为。我们将这种效应研究为语用框架,即塑造指令解释而非任务规范的上下文线索。虽然之前的研究利用这些线索进行提示优化或将其作为安全漏洞进行探讨,但语用框架本身尚未被视为指令遵循的可测量属性。系统地测量这种影响仍然具有挑战性,需要对框架线索进行控制隔离。我们引入了一个包含三个新组件的框架:指令框架分解,将框架上下文与任务规范分离;一个将400个框架实例组织成13种策略和4个机制集群的分类法;以及基于优先级的测量,通过可观察的指令优先级变化量化影响。在五个不同家族和规模的LLMs中,影响机制导致指令优先级出现一致且结构化的变化,使模型从基线的公正性向偏向框架指令转变。这项工作确立了语用框架作为指令遵循系统中一个可测量和可预测的因素。
cs.CL / 10 / 2602.21224
Make Every Draft Count: Hidden State based Speculative Decoding
让每个草稿都发挥作用:基于隐状态的推测解码
Abstract
Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.
Chinese Translation
推测解码已成为加速大规模语言模型(LLM)推理的关键技术,通过采用轻量级的草稿模型生成候选标记,并由目标模型并行验证。然而,尽管这一范式成功提高了内存限制推理的算力密度,但它导致了显著的计算效率低下:大多数草稿标记未能通过验证而被丢弃,造成计算浪费。基于回收这些浪费计算的目标,我们提出了一种新颖的系统,将被丢弃的草稿转化为可重用的标记。我们的关键见解是,在隐状态层面进行自回归预测,并推迟在隐状态生成后整合标记信息,从而使草稿隐状态不受错误标记的污染,允许隐状态的重用。为了实现这样的系统,首先我们介绍了一种基于自回归隐状态的草稿模型架构,它比基于标记的草稿生成器保留了更丰富的语义,以促进草稿的再利用。其次,我们设计了一种高效的标记信息注入机制,利用我们的专用草稿模型构建高质量的草稿标记树,并能够从验证失败中重新采样标记。第三,我们消除了设计中的开销,以进一步最大化硬件利用率。我们进行了广泛的评估,与各种基线进行比较,证明了与标准推测解码相比,速度提升高达3.3倍。
cs.CL / 11 / 2602.21225
Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal
与架构无关的文档理解课程学习:来自文本和多模态的实证证据
Abstract
We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($\Delta$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.
Chinese Translation
我们研究了渐进式数据调度——一种逐步增加训练数据暴露的课程学习策略(33
ightarrow67
ightarrow100\%)——是否在不同架构的文档理解模型中产生一致的效率提升。通过在FUNSD和CORD基准上评估BERT(仅文本,110M参数)和LayoutLMv3(多模态,126M参数),我们确定该调度将实际训练时间减少了约33\\%,与数据有效轮次从6.67减少到10.0相对应。为了将课程效果与计算减少分离,我们引入了匹配计算基线(Standard-7),以控制总梯度更新。在FUNSD数据集上,课程学习显著优于BERT的匹配计算基线($ riangle$F1 = +0.023,$p=0.022$,$d_z=3.83$),这为容量受限模型中的真实调度效益提供了证据。相比之下,LayoutLMv3没有观察到类似的效益($p=0.621$),其多模态表示提供了足够的归纳偏置。在CORD数据集上,所有条件的F1分数趋于相等($ ext{F1} ext{分数} ext{≥} 0.947$),无论调度如何,表明存在性能上限。通过比较渐进式、两阶段、反向和随机节奏的调度消融实验确认,效率提升源于数据量的减少而非顺序。综合来看,这些发现表明,渐进式调度是跨模型系列的可靠计算减少策略,而课程特定的效益则依赖于模型容量与任务复杂性之间的相互作用。
cs.CL / 12 / 2602.21226
IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions
IslamicLegalBench:评估大型语言模型在1200年伊斯兰多元法律传统中的知识与推理能力
Abstract
As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.
Chinese Translation
随着数百万穆斯林转向GPT、Claude和DeepSeek等大型语言模型寻求宗教指导,一个关键问题浮现:这些人工智能系统能否可靠地推理伊斯兰法律?我们推出了IslamicLegalBench,这是第一个评估大型语言模型在七个伊斯兰法学派中的表现的基准,涵盖718个实例和13个不同复杂度的任务。对九个最先进模型的评估揭示了重大局限性:最佳模型的正确率仅为68%,而幻觉率高达21%,而多个模型的正确率低于35%,幻觉率超过55%。少量示例提示的效果微乎其微,仅有2个模型的表现提高了超过1%。需要准确知识的中等复杂度任务显示出最高的错误率,而高复杂度任务则通过语义推理表现出明显的能力。虚假前提检测表明存在风险的迎合现象,9个模型中有6个在接受误导性假设时的比例超过40%。这些结果突显了基于提示的方法无法弥补缺失的基础知识。IslamicLegalBench提供了评估人工智能中伊斯兰法律推理的第一个系统框架,揭示了在日益依赖的精神指导工具中存在的关键缺口。
cs.CL / 13 / 2602.21227
Budget-Aware Agentic Routing via Boundary-Guided Training
基于边界引导训练的预算感知自主路由
Abstract
As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost--success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.
Chinese Translation
随着大型语言模型(LLMs)发展为能够执行长时间工作流的自主代理,在每一步调用高能力模型变得经济上不可持续。虽然模型路由对于单轮查询有效,但自主路由是一个顺序的、路径依赖的问题:早期的错误会累积,反馈通常在剧集结束时才会出现,而部署往往要求严格的每任务支出限制。我们提出了预算感知自主路由,该方法在每一步选择廉价模型和昂贵模型之间,以优化成本-成功边界,并在严格的每任务预算下运行。我们提出了边界引导训练,利用两种边界策略(始终小型 vs. 始终大型)构建难度分类法,并在稀疏奖励下锚定学习。我们的方法通过对成本高效轨迹的分层采样进行边界引导的SFT数据合成来热启动,然后应用边界引导策略优化(BoPO),结合边界相对奖励和参考引导优势,以避免退化的廉价失败解决方案。实验结果表明,我们的方法提高了效率边界,在显著较低的成本下与强大的路由基线相匹配,同时展示了对严格推理时间预算约束的泛化能力。总体而言,我们的工作为自主路由建立了一个基础框架,将范式从静态模型选择转变为动态的、预算感知的顺序决策。
cs.CL / 14 / 2602.21228
ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following
ImpRIF:更强的隐式推理促进更好的复杂指令遵循
Abstract
As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs' understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.
Chinese Translation
随着大型语言模型(LLMs)应用的日益复杂,对稳健的复杂指令遵循能力的需求也在不断增长。我们认为,深入理解指令本身,尤其是潜在的推理结构(latent reasoning structure)在字里行间所蕴含的内容,对于提升指令遵循能力至关重要。因此,我们针对涉及隐式推理、复杂逻辑关系和多约束依赖的复杂指令。我们提出了ImpRIF,一种增强LLMs对隐式推理指令理解的方法,从而提高其遵循复杂指令的能力。我们将此类指令形式化为可验证的推理图(verifiable reasoning graphs),以实现程序化验证和图驱动的思维链推理。基于这一形式化,我们合成了大规模的单轮和多轮数据,提出了基于图推理的微调方法,并应用强化学习(reinforcement learning)明确训练模型沿图进行推理。在五个复杂指令遵循基准测试中,我们的模型显著超越了其基础模型。这些结果表明,增强隐式推理能力可以显著改善复杂指令的遵循能力。该项目将在不久的将来开源。
cs.CL / 15 / 2602.21230
TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents
TRACE:面向轨迹的深度研究代理综合评估
Abstract
The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.
Chinese Translation
深度研究代理的评估是一个关键挑战,因为传统的基于结果的指标无法捕捉其复杂推理的细微差别。目前的评估面临两个主要挑战:1)依赖于单一指标,如 Pass@1,造成了忽视推理过程的质量、效率和合理性的“高分幻觉”;2)静态基准未能量化诸如鲁棒性和潜在能力等关键属性。为了解决这些问题,我们提出了 TRACE(面向轨迹的综合评估)框架,该框架全面评估整个问题解决轨迹。为了应对“高分幻觉”,我们提出了一种层次化轨迹效用函数,该函数量化了过程效率和认知质量,包括证据基础,以及准确性。为了测量更深层次的属性,TRACE 引入了一种支架能力评估协议,通过确定成功所需的最小指导量来量化代理的潜在能力。我们的贡献包括 TRACE 框架、其新颖的指标,以及配套的具有可控复杂性的 DeepResearch-Bench。实验表明,TRACE 提供了细致的排名,揭示了代理的准确性、效率和鲁棒性之间的关键权衡,这些在单一指标中完全被忽视。
cs.CL / 16 / 2602.21257
Structured Prompt Language: Declarative Context Management for LLMs
结构化提示语言:大型语言模型的声明性上下文管理
Abstract
We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL's EXPLAIN ANALYZE, and native integration of retrieval-augmented generation (RAG) and persistent memory in a single declarative framework. SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script. Five extensions demonstrate the paradigm's breadth: (1) Text2SPL (multilingual NL->SPL translation); (2) Mixture-of-Models (MoM) routing that dispatches each PROMPT to a domain-specialist model at runtime; (3) Logical Chunking, an intelligent strategy for documents exceeding a single context window--expressed naturally through SPL's existing CTE syntax with no new constructs, decomposing a large query into a Map-Reduce pipeline that reduces attention cost from O(N^2) to O(N^2/k) and runs identically on cloud (parallel) or local hardware (sequential); (4) SPL-flow, a declarative agentic orchestration layer with resilient three-tier provider fallback; and (5) BENCHMARK for parallel multi-model comparison with automatic winner persistence. We provide a formal EBNF grammar, two pip-installable Python packages (spl-llm, spl-flow), and comparison against Prompty, DSPy, and LMQL. SPL reduces prompt boilerplate by 65% on average, surfaces a 68x cost spread across model tiers as a pre-execution signal, and runs the identical .spl script at $0.002 on OpenRouter or at zero marginal cost on a local Ollama instance--without modification.
Chinese Translation
我们提出了SPL(结构化提示语言),这是一种受SQL启发的声明性语言,将大型语言模型视为生成知识库,并将其上下文窗口视为受限资源。SPL提供了明确的WITH BUDGET/LIMIT令牌管理、自动查询优化器、类似于SQL的EXPLAIN ANALYZE的EXPLAIN透明性,以及在单一声明性框架中原生集成检索增强生成(RAG)和持久内存。SPL-flow将SPL扩展为具有弹性的代理管道,采用三层提供者回退策略(Ollama -> OpenRouter -> 自我修复重试),对.spl脚本完全透明。五个扩展展示了该范式的广度:(1)Text2SPL(多语言NL->SPL翻译);(2)模型混合(MoM)路由,在运行时将每个PROMPT分派给领域专家模型;(3)逻辑分块,一种智能策略,适用于超出单个上下文窗口的文档——通过SPL现有的CTE语法自然表达,无需新构造,将大型查询分解为一个Map-Reduce管道,将注意力成本从O(N^2)降低到O(N^2/k),并在云(并行)或本地硬件(顺序)上以相同方式运行;(4)SPL-flow,一个具有弹性三层提供者回退的声明性代理编排层;(5)BENCHMARK用于并行多模型比较,并具有自动赢家持久性。我们提供了正式的EBNF语法、两个可通过pip安装的Python包(spl-llm,spl-flow),以及与Prompty、DSPy和LMQL的比较。SPL平均减少了65%的提示样板,显示出在模型层之间68倍的成本差异作为执行前信号,并在OpenRouter上以$0.002的成本运行相同的.spl脚本,或在本地Ollama实例上以零边际成本运行——无需修改。
cs.CL / 17 / 2602.21262
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
在影响下:量化大型语言模型中的说服力与警觉性
Abstract
With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. % as part of the prompt. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
Chinese Translation
随着大型语言模型(LLMs)在高风险人类决策领域的日益融入,理解它们作为顾问所带来的风险变得尤为重要。为了成为有效的顾问,LLMs必须从大量内容中筛选信息,这些内容既包含善意的意图,也包含恶意的意图,并利用这些信息说服用户采取特定行动。这涉及到两种社会能力:警觉性(判断使用哪些信息,丢弃哪些信息的能力)和说服力(综合现有证据以提出有说服力的论点)。尽管现有研究已孤立地探讨了这些能力,但关于这些能力如何相互关联的研究却很少。在此,我们使用一个简单的多轮解谜游戏——推箱子(Sokoban),来研究LLMs在说服和理性警觉方面对其他LLM代理的能力。我们发现,解谜表现、说服能力和警觉性在LLMs中是可分离的能力。在游戏中表现良好并不意味着模型能够检测到何时被误导,即使欺骗的可能性在提示中被明确提及。然而,LLMs确实会持续调节其标记使用,当建议是善意时使用较少的标记进行推理,而在建议是恶意时则使用更多的标记,即使它们仍然被说服采取导致失败的行动。根据我们的了解,我们的研究首次探讨了LLMs中说服力、警觉性与任务表现之间的关系,并建议独立监测这三者对于未来的人工智能安全工作至关重要。
cs.CL / 18 / 2602.21265
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
ToolMATH:一个用于现实长时间跨度多工具推理的数学工具基准
Abstract
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.
Chinese Translation
我们介绍了 oolMATH,这是一个以数学为基础的基准,评估在现实多工具环境中增强工具的语言模型,其中输出依赖于调用模式指定的工具并维持多步骤执行。它将数学问题转化为一个可控的、可检查正确性的基准,配备工具集,从而系统性地评估模型在(1)大型重叠工具目录和(2)缺乏预期能力下的可靠性。 oolMATH 提供了可操作的故障模式诊断证据,帮助识别增强工具代理所需的控制机制以确保稳健性。 oolMATH 大致包含 8000 个问题和 12000 个工具;我们还提供了一个额外的困难集 oolMATHHard,包含问题和工具。我们的评估揭示了关键的失败因素是由于推理能力不足,导致中间结果错误的积累并限制后续决策。工具列表的冗余不仅仅增加了噪声,而是将小的早期偏差放大为不可逆的执行漂移。基准强调,当缺乏预期能力时,干扰工具有时可以作为解决路径中的部分替代品,但它们也可能误导模型进入无基础的工具轨迹。最后,工具使用协议之间的比较强调,改进更多来自于长远计划的一致性和对观察的严格使用,而非局部行动选择。
cs.CL / 19 / 2602.21346
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
加权对齐DPO:一种改进安全对齐的原则性推理方法
Abstract
Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
Chinese Translation
近期在对齐技术方面的进展,如监督微调(Supervised Fine-Tuning, SFT)、基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO),提升了大型语言模型(Large Language Models, LLMs)的安全性。然而,这些LLMs仍然容易受到越狱攻击,这些攻击通过间接或欺骗性措辞掩盖有害意图。通过因果干预,我们实证表明,这种脆弱性源于缺乏深层推理的浅层对齐机制,通常在没有真正理解为何有害的情况下拒绝有害提示。为了减轻这种脆弱性,我们建议通过关注推理的后期训练来增强对齐。我们构建并发布了一个新颖的思维链(Chain-of-Thought, CoT)微调数据集,该数据集包含了以效用为导向和安全关键的提示,并附有逐步推理。对该数据集进行微调鼓励模型基于推理产生原则性的拒绝,超越了标准SFT基准。此外,受CoT微调中失败模式的启发,我们引入了加权对齐DPO(Alignment-Weighted DPO),通过对推理和最终答案部分分配不同的偏好权重,针对输出中最有问题的部分。这比传统DPO产生更细粒度、针对性的更新,并提高了对多种越狱策略的鲁棒性。在多个安全性和效用基准上的广泛实验表明,我们的方法在保持整体模型效用的同时,持续提高了对齐的鲁棒性。
cs.CL / 20 / 2602.21374
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
小型语言模型在低资源语言中进行隐私保护的临床信息提取
Abstract
Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.
Chinese Translation
从低资源语言的医疗转录文本中提取临床信息仍然是医疗自然语言处理(NLP)中的一项重大挑战。本研究评估了一种两步流程,该流程将 Aya-expanse-8B 作为波斯语到英语的翻译模型,与五个开源小型语言模型(SLMs)结合使用——Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct、Llama-3.2-3B-Instruct、Qwen2.5-1.5B-Instruct 和 Gemma-3-1B-it——用于从 1,221 份在癌症姑息治疗呼叫中心收集的匿名波斯语转录文本中提取 13 个临床特征的二元提取。采用无需微调的少量提示策略,对模型在宏平均 F1 分数、马修斯相关系数(MCC)、灵敏度和特异性等指标上进行了评估,以考虑类别不平衡。Qwen2.5-7B-Instruct 达到了最高的整体表现(中位数宏 F1: 0.899; MCC: 0.797),而 Gemma-3-1B-it 的结果最弱。较大的模型(7B--8B 参数)在灵敏度和 MCC 上始终优于较小的模型。对 Aya-expanse-8B 的双语分析表明,将波斯语转录文本翻译成英语提高了灵敏度,减少了缺失输出,并增强了对类别不平衡具有鲁棒性的指标,尽管在特异性和精确度上略有下降。特征级结果显示大多数模型在生理症状的提取上表现可靠,而心理投诉、行政请求和复杂的躯体特征仍然具有挑战性。这些发现为在基础设施和注释资源有限的多语言临床 NLP 环境中部署开源 SLMs 建立了一个实用的、隐私保护的蓝图,并强调了在敏感医疗应用中共同优化模型规模和输入语言策略的重要性。
cs.CL / 21 / 2602.21377
Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages
超越子符:一种适用于低资源和形态复杂语言的丰富字符嵌入
Abstract
Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.
Chinese Translation
基于标记化和子标记化的模型,如 word2vec、BERT 和 GPT 系列,是自然语言处理领域的最新技术。然而,这些方法在输入表示方面存在局限性,无法充分捕捉正字法相似性和形态变化,尤其是在高度屈折和资源匮乏的语言中。为了解决这一问题,我们提出直接从字符字符串计算词向量的方法,整合语义和句法信息。我们将这种基于变换器的方法称为丰富字符嵌入(Rich Character Embeddings, RCE)。此外,我们还提出了一种结合变换器和卷积机制的混合模型。这两种向量表示可以作为现有模型架构中基于字典和子标记的词嵌入的替代方案,具有提升在资源匮乏和形态丰富语言中表现的潜力,适用于大型基于上下文的语言模型(如 BERT)和小型模型(如 word2vec)。我们在多个任务上评估了我们的方法,包括 SWAG、屈折语言的词尾预测、隐喻和对称修辞检测等。实验结果表明,在有限数据上,我们的方法在 OddOneOut 和 TopK 指标下优于传统的基于标记的方法。
cs.CL / 22 / 2602.21379
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
MrBERT:通过词汇、领域和维度适应的现代多语言编码器
Abstract
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.
Chinese Translation
我们介绍了MrBERT,一个基于ModernBERT架构的150M-300M参数编码器系列,经过35种语言和代码的预训练。通过有针对性的适应,这个模型系列在加泰罗尼亚语和西班牙语特定任务上实现了最先进的结果,同时在专业的生物医学和法律领域也展现出强大的性能。为了缩小研究与生产之间的差距,我们引入了Matryoshka Representation Learning (MRL),使得向量尺寸灵活调整,从而显著降低推理和存储成本。最终,MrBERT系列展示了现代编码器架构可以在本地语言卓越性和高风险领域专业化之间进行优化。我们将在Huggingface上开源完整的模型系列。
cs.CL / 23 / 2602.21461
VecGlypher: Unified Vector Glyph Generation with Language Models
VecGlypher:基于语言模型的统一向量字形生成
Abstract
Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
Chinese Translation
向量字形是数字排版的基本单元,但大多数基于学习的流程仍依赖于精心策划的示例表和光栅到向量的后处理,这限制了可访问性和可编辑性。我们介绍了VecGlypher,这是一种单一的多模态语言模型,可以直接从文本描述或图像示例生成高保真度的向量字形。给定样式提示、可选的参考字形图像和目标字符,VecGlypher以自回归方式发出SVG路径标记,避免了光栅中间过程,并在一次传递中生成可编辑的、密闭的轮廓。一个关注排版的数据和训练方案使这一切成为可能:(i)在39K嘈杂的Envato字体上进行的大规模延续阶段,以掌握SVG语法和长距离几何,随后是(ii)在2.5K专家注释的Google Fonts上进行的后训练,配有描述性标签和示例,以将语言和图像与几何对齐;预处理规范化坐标框架,标准化路径,去重字体系列,并量化坐标以实现稳定的长序列解码。在跨系列的OOD评估中,VecGlypher在仅文本生成方面显著优于通用的LLM和专门的向量字体基线,而图像参考生成则达到了最先进的性能,相较于DeepVecFont-v2和DualVector有显著提升。消融实验表明,模型规模和两阶段方案至关重要,绝对坐标序列化产生最佳几何。VecGlypher通过让用户使用文字或示例进行设计,降低了字体创作的门槛,并为未来的多模态设计工具提供了可扩展的基础。
cs.CL / 24 / 2602.21485
Evaluating the Usage of African-American Vernacular English in Large Language Models
评估大型语言模型中非裔美国人方言英语的使用情况
Abstract
In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.
Chinese Translation
在人工智能领域,大多数自然语言理解任务的评估都是在标准化方言中进行的,例如标准美式英语(Standard American English, SAE)。在本研究中,我们探讨了大型语言模型(Large Language Models, LLMs)在多大程度上准确地表现非裔美国人方言英语(African American Vernacular English, AAVE)。我们分析了三种LLM,以比较它们对AAVE的使用与以AAVE为母语的人的使用情况。我们首先分析了来自区域非裔美国人语言语料库(Corpus of Regional African American Language)和TwitterAAE的访谈,以识别人们使用AAVE语法特征(如 ain't)的典型语境。随后,我们提示LLM生成AAVE文本,并将模型生成的文本与人类使用模式进行了比较。我们发现,在许多情况下,LLM中AAVE的使用与人类存在显著差异:LLM通常低估和误用AAVE特有的语法特征。此外,通过情感分析和人工检查,我们发现模型复制了关于非裔美国人的刻板印象。这些结果突显了在训练数据中需要更多多样性以及纳入公平性方法以减轻刻板印象延续的必要性。
cs.CL / 25 / 2602.21543
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
通过多向平行文本对齐增强多语言嵌入
Abstract
Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
Chinese Translation
多语言预训练通常缺乏明确的对齐信号,导致表示空间中的跨语言对齐效果不佳。在本研究中,我们展示了使用多种语言的多向平行语料库对标准预训练模型进行跨语言对齐训练,可以显著改善自然语言理解(NLU)任务中的多语言和跨语言表示。我们构建了一个多向平行数据集,使用现成的神经机器翻译(NMT)模型对英语文本进行翻译,涵盖六种目标语言,并通过对比学习实现了强大的跨语言对齐。这在MTEB基准测试中对XLM-Roberta和多语言BERT基础模型的多个任务中,显著提升了已见和未见语言的性能。与以英语为中心的(En-X)双语平行数据相比,使用多向平行语料库进行对比训练在双语文本挖掘(21.3%)、语义相似性(5.3%)和分类(28.4%)等方面取得了显著提升,其中X是从多个目标语言的池中抽样而来。此外,在小型数据集上对mE5模型进行多向平行微调,相较于不使用多向平行性的情况,显著改善了双语文本挖掘,强调了即使对于已经预训练的高质量句子嵌入模型,多向跨语言监督的重要性。
cs.CL / 26 / 2602.21608
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
MixSarc:用于隐含意义识别的孟加拉语-英语代码混合语料库
Abstract
Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.
Chinese Translation
孟加拉语-英语代码混合在南亚社交媒体上广泛存在,但在这一背景下用于隐含意义识别的资源仍然稀缺。现有的情感和讽刺模型主要集中于单语英语或高资源语言,且在音译变体、文化参考和句内语言切换方面面临挑战。为了解决这一问题,我们推出了MixSarc,这是第一个公开可用的孟加拉语-英语代码混合语料库,用于隐含意义识别。该数据集包含9,087个手动标注的句子,标记了幽默、讽刺、冒犯和粗俗性。我们通过有针对性的社交媒体收集、系统过滤和多标注者验证构建了该语料库。我们对基于变换器的模型进行了基准测试,并在结构化提示下评估了零样本大型语言模型。结果显示在幽默检测上表现良好,但在讽刺、冒犯和粗俗性方面由于类别不平衡和语用复杂性而显著下降。零样本模型在微F1分数上表现竞争力,但准确匹配率较低。进一步分析显示,在一个外部数据集中,超过42%的负面情感实例表现出讽刺特征。MixSarc为文化敏感的自然语言处理提供了基础资源,并支持在代码混合环境中更可靠的多标签建模。
cs.CL / 27 / 2602.21619
When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
多即是少:视觉空间推理中的空间与常识信息的系统分析
Abstract
Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.
Chinese Translation
尽管多模态架构取得了进展,视觉空间推理(VSR)仍然对现代视觉-语言模型(VLMs)构成挑战。一种常见策略是在推理时注入额外信息,例如显式空间线索、外部常识知识或思维链(CoT)推理指令。然而,目前尚不清楚何时这些信息真正改善推理,何时又引入噪声。本文对三种代表性VLMs和两个公共基准进行假设驱动的VSR信息注入分析。我们考察了(i)空间上下文的类型和数量,(ii)注入的常识知识的数量和相关性,以及(iii)空间基础与CoT提示之间的互动。我们的结果揭示了一个一致的模式:更多的信息并不一定带来更好的推理。针对性的单一空间线索优于多上下文聚合,过多或相关性较弱的常识知识会降低性能,而只有当空间基础足够精确时,CoT提示才能提高准确性。这些发现强调了选择性、任务对齐的信息注入的重要性,并为设计可靠的多模态推理管道提供了实用指导。
cs.CL / 28 / 2602.21628
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
RuCL:基于分层评分标准的多模态大语言模型推理课程学习
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Chinese Translation
可验证奖励的强化学习(RLVR)已成为增强多模态大语言模型(MLLMs)推理能力的主要范式。然而,单靠结果监督可能导致奖励黑客行为,模型学习到虚假的推理模式以满足最终答案检查。尽管近期基于评分标准的方法提供了细粒度的监督信号,但它们在实例级生成的高计算成本和将所有评分标准视为同等可学习所导致的低效训练动态方面存在不足。在本文中,我们提出了基于分层评分标准的课程学习(RuCL),这是一个通过将重点从数据选择转向奖励设计来重新构建课程学习的新框架。RuCL生成适用于广泛应用的通用评分标准,并根据模型的能力对其进行分层。通过在训练过程中动态调整评分标准的权重,RuCL引导模型从掌握基础感知到应对高级逻辑推理。在各种视觉推理基准上的广泛实验表明,RuCL相较于Qwen2.5-VL-7B模型平均提升了7.83%,达到了60.06%的最新准确率。
cs.CL / 29 / 2602.21638
Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLMs
基于大型语言模型的文本咨询中对客户抵抗的多维评估与可解释反馈
Abstract
Effectively addressing client resistance is a sophisticated clinical skill in psychological counseling, yet practitioners often lack timely and scalable supervisory feedback to refine their approaches. Although current NLP research has examined overall counseling quality and general therapeutic skills, it fails to provide granular evaluations of high-stakes moments where clients exhibit resistance. In this work, we present a comprehensive pipeline for the multi-dimensional evaluation of human counselors' interventions specifically targeting client resistance in text-based therapy. We introduce a theory-driven framework that decomposes counselor responses into four distinct communication mechanisms. Leveraging this framework, we curate and share an expert-annotated dataset of real-world counseling excerpts, pairing counselor-client interactions with professional ratings and explanatory rationales. Using this data, we perform full-parameter instruction tuning on a Llama-3.1-8B-Instruct backbone to model fine-grained evaluative judgments of response quality and generate explanations underlying. Experimental results show that our approach can effectively distinguish the quality of different communication mechanisms (77-81% F1), substantially outperforming GPT-4o and Claude-3.5-Sonnet (45-59% F1). Moreover, the model produces high-quality explanations that closely align with expert references and receive near-ceiling ratings from human experts (2.8-2.9/3.0). A controlled experiment with 43 counselors further confirms that receiving these AI-generated feedback significantly improves counselors' ability to respond effectively to client resistance.
Chinese Translation
有效应对客户抵抗是心理咨询中的一项复杂临床技能,但从业者往往缺乏及时且可扩展的监督反馈来完善其方法。尽管当前的自然语言处理(NLP)研究已经考察了整体咨询质量和一般治疗技能,但未能提供对客户表现出抵抗的关键时刻的细致评估。在本研究中,我们提出了一种全面的流程,用于对人类咨询师在文本治疗中针对客户抵抗的干预进行多维评估。我们引入了一个理论驱动的框架,将咨询师的回应分解为四种不同的沟通机制。利用该框架,我们整理并分享了一个专家注释的数据集,包含真实世界的咨询摘录,将咨询师与客户的互动与专业评分及解释性理由相结合。使用这些数据,我们在 Llama-3.1-8B-Instruct 主干上进行全参数指令调优,以建模对回应质量的细致评估判断并生成相应的解释。实验结果表明,我们的方法能够有效区分不同沟通机制的质量(F1 值为 77-81%),显著优于 GPT-4o 和 Claude-3.5-Sonnet(F1 值为 45-59%)。此外,该模型生成的高质量解释与专家参考高度一致,并获得人类专家接近满分的评分(2.8-2.9/3.0)。对 43 名咨询师进行的对照实验进一步确认,接受这些 AI 生成的反馈显著提高了咨询师有效应对客户抵抗的能力。
cs.CL / 30 / 2602.21646
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
可扩展的多语言多模态机器翻译与语音-文本融合
Abstract
Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.
Chinese Translation
多模态大型语言模型(MLLMs)通过整合多模态信息在提升翻译性能方面取得了显著成功。然而,现有研究主要集中在图像引导的方法上,其适用性受到多语言图像-文本对稀缺的限制。语音模态克服了这一限制,因为其与文本的自然对齐以及现有语音数据集的丰富性,使得语言覆盖具有可扩展性。在本文中,我们提出了一种语音引导的机器翻译(SMT)框架,将语音和文本作为融合输入整合到MLLM中,以提高翻译质量。为了减少对低资源数据的依赖,我们引入了一种自我进化机制。该框架的核心组件包括一个文本到语音模型,负责生成合成语音,以及一个能够对合成语音样本进行分类并利用正样本迭代优化自身的MLLM。实验结果表明,我们的框架在Multi30K多模态机器翻译基准上超越了所有现有方法,达到了新的最先进结果。此外,在一般机器翻译数据集上,特别是在FLORES-200上,它在108个翻译方向上实现了平均最先进的性能。对CoVoST-2的消融研究确认,合成语音与真实语音之间的差异对翻译质量的影响微乎其微。代码和模型已发布在https://github.com/yxduir/LLM-SRT。
cs.CL / 31 / 2602.21647
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
降低低资源S2TT中的结构噪声:优化的尼泊尔语-英语级联管道及标点恢复
Abstract
This paper presents and evaluates an optimized cascaded Nepali speech-to-English text translation (S2TT) system, focusing on mitigating structural noise introduced by Automatic Speech Recognition (ASR). We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline's superior Adequacy (3.673) and Fluency (3.804). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.
Chinese Translation
本文提出并评估了一种优化的尼泊尔语语音到英语文本翻译(S2TT)系统,重点在于减轻自动语音识别(ASR)引入的结构噪声。我们首先建立了高效的ASR和神经机器翻译(NMT)组件:Wav2Vec2-XLS-R-300m模型在OpenSLR-54上达到了先进的2.72%字符错误率(CER),而经过多阶段微调的MarianMT模型在FLORES-200基准上达到了28.32的BLEU分数。我们实证研究了标点缺失的影响,证明无标点的ASR输出显著降低了翻译质量,在FLORES基准上造成了20.7%的相对BLEU下降。为了解决这个问题,我们提出并评估了一个中间的标点恢复模块(PRM)。最终的S2TT管道在一个自定义数据集上进行了三种配置的测试。最佳配置是将PRM直接应用于ASR输出,与直接的ASR到NMT基线相比,获得了4.90的BLEU分数提升(BLEU 36.38对比31.48)。这一改进得到了人类评估的验证,确认了优化管道在充分性(Adequacy)和流畅性(Fluency)方面的优越性(3.673和3.804)。这项工作验证了针对性的标点恢复是减轻尼泊尔语S2TT管道中结构噪声的最有效干预措施。它建立了一个优化的基线,并展示了为类似低资源语言开发级联语音翻译系统的重要架构见解。
cs.CL / 32 / 2602.21652
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
大语言模型准确后训练剪枝的稀疏性诱导
Abstract
Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS), which reduces model cost by removing weights from dense networks, is an effective approach. However, native dense matrices lack high sparsity, making existing approaches that directly remove weights disrupt model states, resulting in unsatisfactory performance recovery even with post-tuning. We propose Sparsity Induction, which promotes models toward higher sparsity at both distribution and feature levels before pruning, to push the limits of PTS. At the distribution level, we enhance distributional sparsity through mathematically equivalent scaling transformations, which are fully absorbable and incur no extra parameters or inference-time overhead. At the feature level, we introduce Spectral Norm Loss to promote feature sparsity from a low-rank perspective. Experiments across diverse model architectures and tasks demonstrate that our method further enhances sparsity-friendliness, achieving superior pruning performance over existing approaches.
Chinese Translation
大语言模型在文本生成方面展现了强大的能力,但其日益增长的参数规模在计算和内存效率上带来了挑战。后训练稀疏性(Post-training sparsity, PTS)通过从稠密网络中移除权重来降低模型成本,是一种有效的方法。然而,原生的稠密矩阵缺乏高稀疏性,使得现有直接移除权重的方法会破坏模型状态,即使经过后调优,性能恢复也不尽如人意。我们提出了稀疏性诱导(Sparsity Induction),旨在在剪枝之前在分布和特征层面上促进模型朝向更高的稀疏性,以推动PTS的极限。在分布层面,我们通过数学等效的缩放变换增强分布稀疏性,这些变换是完全可吸收的,并且不增加额外的参数或推理时间开销。在特征层面,我们引入了谱范数损失(Spectral Norm Loss),从低秩的角度促进特征稀疏性。针对多种模型架构和任务的实验表明,我们的方法进一步增强了稀疏友好性,在剪枝性能上优于现有方法。
cs.CL / 33 / 2602.21669
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
DWA-KD:双空间加权与时间扭曲对齐用于跨标记器知识蒸馏
Abstract
Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback-Leibler divergence (KL). The process is modulated by dual-space weights that up-weight tokens where the student is uncertain and the teacher is confident, thereby focusing learning on informative tokens rather than treating all positions equally. At the sequence level, DWA-KD applies Soft Dynamic Time Warping (Soft-DTW) to both the embedding and final hidden-state layers, enabling robust alignment of lexical and contextual semantics between teacher and student sequences. Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final hidden state layer Soft-DTW alignment.
Chinese Translation
知识蒸馏(Knowledge Distillation, KD)已成为压缩大型语言模型(Large Language Models, LLMs)的关键技术。尽管现有的跨标记器KD方法取得了显著进展,但其有效性仍受到序列和词汇层面不理想对齐的限制。为了解决这些局限性,我们提出了双空间加权与时间扭曲对齐(Dual-Space Weighting and Time-Warped Alignment, DWA-KD),这是一种新颖的跨标记器蒸馏框架,通过基于双空间熵的加权增强了逐标记蒸馏,并通过利用词汇和语义信息实现了精确的序列级对齐。在标记级别,DWA-KD将教师表示映射到学生空间,反之亦然,通过Kullback-Leibler散度(KL)执行双空间KD。该过程由双空间权重调节,增强学生不确定而教师自信的标记,从而将学习重点放在信息丰富的标记上,而不是平等对待所有位置。在序列级别,DWA-KD对嵌入层和最终隐藏状态层应用软动态时间扭曲(Soft Dynamic Time Warping, Soft-DTW),使教师和学生序列之间的词汇和上下文语义能够稳健对齐。在多种自然语言处理基准上的广泛实验表明,DWA-KD超越了最先进的KD基线,而消融研究确认了基于熵的标记加权以及嵌入和最终隐藏状态层Soft-DTW对齐的互补贡献。
cs.CL / 34 / 2602.21720
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
使用强化学习评估递归数字系统中规律性与可学习性之间的关系
Abstract
Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.
Chinese Translation
人类递归数字系统(即诸如英语十进制数字等计数系统)与许多其他语法系统一样,具有高度的规律性。在之前的研究中,跨语言的倾向与学习中的偏差相关,我们提出一个问题:规律性是否使得规律系统更为普遍,因为规律性促进了学习。采用来自强化学习文献的方法,我们确认高度规律的人类(类)系统比未被证实但可能的非规律系统更易于学习。这种不对称性在一个自然假设下出现,即递归数字系统是为了从有限数据中进行概括,以准确表示所有整数。我们还发现,对于不自然的高度非规律系统,规律性对可学习性的影响缺失,其可学习性反而受到信号长度的影响,这表明不同的压力可能在不同的可能数字系统空间中以不同方式影响可学习性。我们的结果为将可学习性与跨语言普遍性联系起来的研究提供了贡献。
cs.CL / 35 / 2602.21728
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
探索图谱:通过路径优化奖励建模激励大型语言模型在知识图谱上的自主探索
Abstract
The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Chinese Translation
大型语言模型(LLMs)的推理过程常常受到幻觉和缺失事实的困扰,尤其在问答任务中。一个有前景的解决方案是将LLMs的答案基于可验证的知识来源,如知识图谱(KGs)。现有的KG增强方法通常通过在生成过程中强制执行规则或模仿固定示例集中的路径来限制LLM的推理。然而,这些方法自然地将LLM的推理模式限制在先前经验或微调数据的范围内,从而限制了它们在分布外图推理问题上的泛化能力。为了解决这个问题,本文提出了探索图谱(Explore-on-Graph, EoG),一个新的框架,鼓励LLMs在知识图谱上自主探索更广泛的推理空间。为了激励探索和发现新的推理路径,我们建议在训练过程中引入强化学习,其奖励是推理路径最终答案的正确性。为了提高探索的效率和意义,我们建议将路径信息作为额外的奖励信号,以优化探索过程并减少无效努力。在五个KGQA基准数据集上的大量实验表明,尽我们所知,我们的方法达到了最先进的性能,超越了开源和闭源的LLMs。
cs.CL / 36 / 2602.21741
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
鲁棒的长形式孟加拉语语音处理:自动语音识别与说话人分离
Abstract
We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
Chinese Translation
我们描述了提交给Kaggle的DL Sprint 4.0比赛的孟加拉语长形式语音识别(ASR)和说话人分离的端到端系统。孟加拉语在这两项任务中都面临着重大挑战:丰富的音素库存、显著的方言变异、与英语的频繁代码混合,以及大规模标注语料库的相对匮乏。在ASR方面,我们实现了最佳的私有词错误率(WER)为0.37738,公共WER为0.36137,结合了经过孟加拉AI微调的Whisper中型模型与Demucs源分离技术,用于声源隔离、静音边界分块,以及精心调整的生成超参数。在说话人分离方面,我们通过将pyannote.audio管道中的默认分割模型替换为经过孟加拉语微调的变体,搭配wespeaker-voxceleb-resnet34-LM嵌入和基于质心的聚合聚类,达到了最佳的私有分离错误率(DER)为0.27671,公共DER为0.20936。我们的实验表明,特定领域的分割组件微调、声源分离和自然静音感知分块是低资源孟加拉语语音处理中最具影响力的设计选择。
cs.CL / 37 / 2602.21763
Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs
利用大型语言模型的自然语言解释提升隐含语篇关系识别
Abstract
Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding in the absence of explicit discourse markers. A further limitation is that existing methods only predict relations without providing any supporting explanations. Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation. In this work, we propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability. Specifically, we first prompt an LLM to generate explanations for each training instance conditioned on its gold label. Then, we introduce a novel classification-generation framework that jointly performs relation prediction and explanation generation, and train it with the additional supervision of LLM-generated explanations. Our framework is plug-and-play, enabling easy integration with most existing IDRR models. Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability. Furthermore, we validate the generality of our approach on sentiment classification and natural language inference
Chinese Translation
隐含语篇关系识别(IDRR)仍然是一项具有挑战性的任务,因为在缺乏明确语篇标记的情况下,需要深层次的语义理解。现有方法的另一个局限性在于,它们仅预测关系而不提供任何支持性解释。最近,大型语言模型(LLMs)的进展显示出在深层语言理解和自然语言解释生成方面的强大推理能力。在本研究中,我们提出了一种简单而有效的方法,将LLMs的推理能力提炼到轻量级的IDRR模型中,以提高性能和可解释性。具体而言,我们首先提示LLM为每个训练实例生成基于其真实标签的解释。然后,我们引入了一种新颖的分类-生成框架,该框架联合执行关系预测和解释生成,并通过LLM生成的解释进行额外的监督训练。我们的框架具有即插即用的特性,便于与大多数现有的IDRR模型轻松集成。在PDTB上的实验结果表明,我们的方法显著提高了IDRR的性能,而人类评估进一步确认生成的解释增强了模型的可解释性。此外,我们还验证了我们方法在情感分类和自然语言推理中的通用性。
cs.CL / 38 / 2602.21786
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
D-COT:小型语言模型高效推理的有序思维链学习
Abstract
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as for fact-checking and for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.
Chinese Translation
从大型语言模型(LLMs)中提取的思维链(CoT)蒸馏常常导致小型语言模型(SLMs)出现“过度思考”,从而导致性能下降和过多的标记消耗。在本研究中,我们提出了有序思维链(D-CoT),这是一个新颖的框架,通过使用控制标签(如 用于事实核查和 用于多角度探索)作为训练过程中的辅助支架,强制执行结构化推理过程。通过优化思维链轨迹,D-CoT 抑制了推理漂移,同时实现了标记减少和性能提升。我们在 Qwen3-8B 上展示了我们方法的有效性:仅使用 5,000 个训练样本,D-CoT 在 GPQA-diamond 上将准确率提高了 9.9%,在 MMLU-Pro(0-shot)上提高了 9.1%,同时大幅降低了计算成本。此外,我们确认模型内化了这种有序思维结构,即使在推理过程中没有显式控制标签,仍能保持高性能。
cs.CL / 39 / 2602.21854
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
FewMMBench:多模态少样本学习基准测试
Abstract
As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench
Chinese Translation
随着多模态大型语言模型(MLLMs)在处理交错的图像-文本数据方面的进展,评估它们的少样本学习能力仍然是一个开放的挑战。本文介绍了FewMMBench,这是一个综合基准,旨在评估MLLMs在少样本条件下的表现,重点关注上下文学习(In-Context Learning, ICL)和思维链(Chain-of-Thought, CoT)提示。FewMMBench涵盖了一系列多模态理解任务,从属性识别到时间推理,能够在任务类型、模型系列和提示策略之间进行系统分析。我们评估了来自六个模型系列的26个开放权重的MLLMs,在零样本、少样本和CoT增强的少样本设置下进行测试。我们的研究结果表明,经过指令调优的模型在零样本表现上表现强劲,但在增加额外示例或CoT推理时收益有限,甚至可能出现退步。基于检索的示例和增加的上下文大小也仅带来有限的提升。这些结果突显了FewMMBench作为一个严格的测试平台,用于诊断和推动多模态大型语言模型的少样本能力。数据可在以下链接获取:https://huggingface.co/datasets/mustafaa/FewMMBench
cs.CL / 40 / 2602.21862
Personalized Graph-Empowered Large Language Model for Proactive Information Access
个性化图谱增强的大型语言模型用于主动信息获取
Abstract
Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential. While numerous studies have proposed memory recall systems, these primarily rely on deep learning techniques that require extensive training and often face data scarcity due to the limited availability of personal lifelogs. As lifelogs grow over time, systems must also adapt quickly to newly accumulated data. Recently, large language models (LLMs) have demonstrated remarkable capabilities across various tasks, making them promising for personalized applications. In this work, we present a framework that leverages LLMs for proactive information access, integrating personal knowledge graphs to enhance the detection of access needs through a refined decision-making process. Our framework offers high flexibility, enabling the replacement of base models and the modification of fact retrieval methods for continuous improvement. Experimental results demonstrate that our approach effectively identifies forgotten events, supporting users in recalling past experiences more efficiently.
Chinese Translation
由于个体可能难以回忆起生活中的所有细节,并且常常混淆事件,因此建立一个系统以帮助用户回忆遗忘的经历是至关重要的。虽然许多研究提出了记忆回忆系统,但这些系统主要依赖于深度学习技术,这些技术需要大量的训练,并且由于个人生活日志的有限可用性,常常面临数据稀缺的问题。随着生活日志随时间的增长,系统也必须迅速适应新积累的数据。最近,大型语言模型(LLMs)在各种任务中展示了显著的能力,使其在个性化应用中具有良好的前景。在本研究中,我们提出了一个框架,利用LLMs实现主动信息获取,整合个人知识图谱以通过精细的决策过程增强访问需求的检测。我们的框架提供了高度的灵活性,能够替换基础模型并修改事实检索方法以实现持续改进。实验结果表明,我们的方法有效识别遗忘的事件,支持用户更高效地回忆过去的经历。
cs.CL / 41 / 2602.21887
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
ExpLang:通过策略内思维语言选择改善大型语言模型推理中的探索与利用
Abstract
Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.
Chinese Translation
当前的大型推理模型(LRMs)在强化学习(RL)后训练后在具有挑战性的任务上表现出强大的能力。然而,之前的研究主要集中在英语推理上,以期获得最佳性能,尽管多语言思维的潜在优势以及全球用户对本土思维痕迹的需求已被证明。在本文中,我们提出了ExpLang,一种新颖的LLM后训练流程,通过策略内思维语言选择来改善在使用多种语言进行RL时的探索与利用。结果表明,我们的方法在相同的训练预算下稳步超越仅使用英语的训练,同时对已见和未见语言表现出高的思维语言合规性。分析表明,通过在RL过程中将策略内思维语言选择作为一种行动,ExpLang有效地扩展了RL探索空间,增加了多样化的语言偏好,并利用非英语优势改善了RL的利用结果。该方法与大多数RL算法是正交的,为利用多语言性改善LRMs开辟了新的视角。
cs.CL / 42 / 2602.21933
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
小胜大胜:比较大型语言模型与领域微调模型在混合印地语文本中的讽刺检测
Abstract
Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability. This study compares four large language models, Llama 3.1, Mistral, Gemma 3, and Phi-4, with a fine-tuned DistilBERT model for sarcasm detection in code-mixed Hinglish text. The results indicate that the smaller, sequentially fine-tuned DistilBERT model achieved the highest overall accuracy of 84%, outperforming all of the LLMs in zero and few-shot set ups, using minimal LLM generated code-mixed data used for fine-tuning. These findings indicate that domain-adaptive fine-tuning of smaller transformer based models may significantly improve sarcasm detection over general LLM inference, in low-resource and data scarce settings.
Chinese Translation
在多语言和混合语言环境中进行讽刺检测仍然是自然语言处理模型面临的一项挑战,原因在于结构变异、非正式表达以及低资源语言的可用性。本研究比较了四个大型语言模型:Llama 3.1、Mistral、Gemma 3 和 Phi-4,以及一个微调后的 DistilBERT 模型在混合印地语文本中的讽刺检测效果。结果表明,较小的、经过顺序微调的 DistilBERT 模型在零样本和少样本设置中实现了最高的整体准确率84%,超越了所有大型语言模型,并且使用了最少的用于微调的 LLM 生成的混合数据。这些发现表明,在低资源和数据稀缺的环境中,针对特定领域的微调较小的基于变换器的模型可能显著提高讽刺检测的效果,优于通用大型语言模型的推理。
cs.CL / 43 / 2602.21941
MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
MERRY:角色扮演代理的多模态情感与角色一致性的语义解耦评估
Abstract
Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.
Chinese Translation
多模态角色扮演代理(MRPAs)因其能够提供更具沉浸感的多模态情感交互而受到越来越多的关注。然而,现有研究仍依赖纯文本基准来评估MRPAs的文本响应,同时将其多模态表达的评估完全委托给模态合成指标。这种评估范式一方面将语义评估与模态生成纠缠在一起,导致模糊的错误归因;另一方面则受到对人类判断的高度依赖的限制。为此,我们提出了MERRY,一个用于评估角色扮演代理的多模态情感与角色一致性的语义解耦评估框架。该框架引入了五个用于情感一致性(EC)和三个用于角色一致性(RC)的精细化指标。值得注意的是,我们将传统的主观评分方法转变为一种新颖的双向证据发现任务,显著提高了LLM作为评判者评估的人类一致性。基于MERRY,我们进行了广泛的评估。我们的实证结果主要揭示:(1)在合成数据集上训练往往会降低情感一致性,而在真实世界数据集上训练则会改善它;(2)现有模型存在情感模板化和简化的问题,在细粒度负面情感中表现出积极偏差和性能瓶颈;(3)简单的提示方法增强了弱模型,但限制了强模型,而简单的微调方法则在角色泛化方面表现不佳。代码和数据集已公开。
cs.CL / 44 / 2602.21947
Large Language Models are Algorithmically Blind
大型语言模型在算法上是盲目的
Abstract
Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.
Chinese Translation
大型语言模型(LLMs)展现了卓越的知识广度,但它们对计算过程的推理能力仍然缺乏深入理解。弥补这一差距对于依赖LLMs指导算法选择和部署的从业者至关重要。我们通过因果发现作为测试平台,评估了八个前沿LLMs与基于大规模算法执行得出的真实结果,并发现系统性、几乎完全的失败。这些模型产生的范围远大于真实的置信区间,但在大多数情况下仍未能包含真实的算法均值;大多数模型的表现甚至不如随机猜测,而最佳模型的边际表现略高于随机的情况更符合基准记忆而非原则性推理。我们将这种失败称为算法盲目,并认为它反映了关于算法的声明性知识与经过校准的过程预测之间的根本差距。
cs.CL / 45 / 2602.21950
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
MEDSYN:针对多模态大型语言模型在复杂临床案例中的多证据合成基准测试
Abstract
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx--FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.
Chinese Translation
多模态大型语言模型(MLLMs)在医学应用中展现出巨大的潜力,但现有基准未能充分捕捉现实世界的临床复杂性。我们推出了MEDSYN,这是一个多语言、多模态的基准,涵盖了每个案例中多达7种不同的视觉临床证据(CE)类型。我们模拟临床工作流程,对18个MLLMs进行差异诊断(DDx)生成和最终诊断(FDx)选择的评估。尽管顶尖模型在DDx生成上往往与人类专家相匹配甚至超越,但所有MLLMs在DDx与FDx之间的表现差距明显大于专家临床医生,表明在合成异质CE类型时存在失败模式。消融实验将这一失败归因于(i)对较少区分性的文本CE(例如,病史)的过度依赖,以及(ii)跨模态CE利用差距。我们引入了证据敏感性(Evidence Sensitivity)来量化后者,并显示较小的差距与更高的诊断准确性相关。最后,我们展示了如何利用这一指标指导干预措施以提高模型性能。我们将开源我们的基准和代码。
cs.CL / 46 / 2602.21951
RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
RADAR:基于对齐表示的推理作为区分用于基于大语言模型的知识图谱推理
Abstract
Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.
Chinese Translation
知识图谱推理(KGR)用于推断缺失的事实,近期的进展越来越多地利用了大语言模型(LLMs)的语义先验和推理能力。然而,现有的生成范式容易记忆表层共现,而不是学习真正的关系语义,这限制了其在分布外的泛化能力。为了解决这一问题,我们提出了RADAR,它将KGR从生成模式匹配重新构造为区分性关系推理。我们将KGR重新表述为区分性实体选择,其中强化学习在超越标记似然模仿的基础上强制执行相对实体的可分性。利用这种可分性,推理直接在表示空间中进行,确保与区分性优化的一致性,并绕过生成引起的幻觉。在四个基准测试中,RADAR在链接预测和三元组分类上相较于强大的LLM基线实现了5-6%的相对提升,同时在中间表示中增加了62.9%的任务相关互信息,表明其具有更强的鲁棒性和可转移的关系推理能力。
cs.CL / 47 / 2602.21978
CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
CxMP:评估语言模型构式理解的语言最小对基准
Abstract
Recent work has examined language models from a linguistic perspective to better understand how they acquire language. Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention. We introduce the Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models (CxMP), a benchmark grounded in Construction Grammar that treats form-meaning pairings, or constructions, as fundamental linguistic units. CxMP evaluates whether models can interpret the semantic relations implied by constructions, using a controlled minimal-pair design across nine construction types, including the let-alone, caused motion, and ditransitive constructions. Our results show that while syntactic competence emerges early, constructional understanding develops more gradually and remains limited even in large language models (LLMs). CxMP thus reveals persistent gaps in how language models integrate form and meaning, providing a framework for studying constructional understanding and learning trajectories in language models.
Chinese Translation
近期的研究从语言学的角度考察语言模型,以更好地理解它们如何习得语言。现有的大多数基准主要集中在判断语法可接受性,而对语法形式所传达的意义的解读能力则关注较少。我们引入了评估语言模型构式理解的语言最小对基准(CxMP),该基准基于构式语法,将形式-意义配对或构式视为基本的语言单位。CxMP 评估模型是否能够解读构式所暗示的语义关系,采用了在九种构式类型(包括 let-alone、引起运动和双及物构式)中的受控最小对设计。我们的结果表明,尽管句法能力较早显现,但构式理解的发展则更为渐进,即便在大型语言模型(LLMs)中也仍然有限。因此,CxMP 揭示了语言模型在形式与意义整合方面的持续差距,为研究语言模型的构式理解和学习轨迹提供了框架。
cs.CL / 48 / 2602.22014
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
更健康模型的多样性饮食:以法语ModernBERT为例
Abstract
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
Chinese Translation
近年来,多样性在自然语言处理(NLP)领域引起了越来越多的关注。与此同时,现代的变换器模型如ModernBERT使用非常大的预训练数据集,这些数据集的构建主要依赖于规模而非多样性。这促使我们对多样性对ModernBERT预训练的影响进行研究。本研究的明确意图是减少预训练数据集的规模,同时保持至少相当的性能。我们比较了基于多样性的采样算法,以选择最佳方案。我们发现,在某些任务中,基于多样性的采样相较于随机采样的预训练数据可以提高10个点的性能。我们还观察到,一个在150M标记的基于多样性的数据集上预训练了483小时的模型,其性能可以与一个在2.4B标记的随机驱动数据集上预训练了1,775小时的模型相媲美。
cs.CL / 49 / 2602.22045
DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
DLT-Corpus:面向分布式账本技术领域的大规模文本集合
Abstract
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.
Chinese Translation
我们介绍了DLT-Corpus,这是迄今为止针对分布式账本技术(DLT)研究的最大领域特定文本集合:包含来自2212万份文档的29.8亿个词元,涵盖科学文献(37440篇出版物)、美国专利商标局(USPTO)专利(49023项申请)和社交媒体(2200万条帖子)。现有的自然语言处理(NLP)资源主要集中在加密货币价格预测和智能合约上,尽管该领域的市场资本化约为3万亿美元且技术快速演变,但领域特定语言仍未得到充分探索。我们通过分析技术出现模式和市场创新相关性来展示DLT-Corpus的实用性。研究结果表明,技术通常先在科学文献中出现,然后才进入专利和社交媒体,遵循传统的技术转移模式。尽管在加密寒冬期间,社交媒体情绪仍然极为乐观,但科学和专利活动独立于市场波动而增长,跟踪整体市场扩张,形成一个良性循环,其中研究先于并促进经济增长,从而为进一步创新提供资金。我们公开发布完整的DLT-Corpus;LedgerBERT,一个领域适应模型,在DLT特定的命名实体识别(NER)任务上比BERT-base提高了23%;以及所有相关工具和代码。
cs.CL / 50 / 2602.22072
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
理解人工心智理论:大型语言模型中的扰动任务与推理
Abstract
Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.
Chinese Translation
心智理论(Theory of Mind, ToM)指的是一个代理建模他人内在状态的能力。本研究旨在探讨大型语言模型(Large Language Models, LLMs)是否具备真正的心智理论能力,通过对虚假信念任务的扰动来检验其心智理论的稳健性,并考察链式思维提示(Chain-of-Thought prompting, CoT)在提升性能和解释LLM决策中的潜力。我们引入了一个手工制作、注释丰富的心智理论数据集,包括经典和扰动的虚假信念任务、相应的有效推理链空间以完成正确任务、后续推理的可信度、任务解决方案,并提出评估推理链正确性及最终答案在多大程度上忠实于生成的CoT推理轨迹的指标。我们发现,在任务扰动下,所有评估的LLM的心智理论能力均显著下降,这质疑了任何稳健形式的心智理论存在的观点。尽管CoT提示在整体上以忠实的方式改善了心智理论表现,但对于某些扰动类别,其准确性却意外下降,这表明需要选择性地应用。
cs.CL / 51 / 2602.22090
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
基于信心驱动的多尺度模型选择用于成本效益推理
Abstract
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.
Chinese Translation
大型语言模型(LLMs)在各种自然语言任务的推理中引发了革命,虽然更大的模型表现更好,但计算成本也更高。我们提出了一种基于信心的策略,动态选择最合适的模型,依据信心估计进行决策。通过评估模型在处理任务时的信心及其响应准确性,保留那些可能正确解决的任务,而将更不确定或复杂的案例委托给更大的模型,从而确保可靠性并最小化计算成本。具体而言,我们评估模型知道正确答案的可能性以及其响应准确的概率。在大规模多任务语言理解(MMLU)基准上的实验表明,我们的方法在实现与最大模型相当的准确性的同时,计算成本降低了20%到40%。在应用于GPT-4o API调用时,令令牌使用量减少了约60%,进一步提高了成本效益。这些发现表明,基于信心的模型选择有潜力增强大型语言模型在现实世界中的部署,特别是在边缘设备和商业API应用等资源受限的环境中。
cs.CL / 52 / 2602.22125
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
IndicIFEval:一个针对14种印度语言的可验证指令遵循评估基准
Abstract
Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).
Chinese Translation
指令遵循基准主要集中于英语,这为数亿印度语言使用者留下了重要的评估空白。我们推出了IndicIFEval,这是一个评估大型语言模型(LLMs)在14种印度语言中受限生成能力的基准,使用自动可验证的基于规则的指令。该基准包含每种语言约800个经过人工验证的示例,分为两个互补的子集:IndicIFEval-Ground,来自IFEval(Zhou et al., 2023)的翻译提示,经过精心本地化以适应印度语境;以及IndicIFEval-Ground,基于本土印度内容合成生成的指令。我们对主要的开放权重和专有模型进行了全面评估,涵盖了推理和非推理模型。尽管模型在格式约束方面表现出强大的遵循能力,但在词汇和跨语言任务上却显著挣扎——尽管在高资源语言上有所进展,但在更广泛的印度语言家族中,指令遵循的能力仍显著落后于英语。我们发布了IndicIFEval及其评估脚本,以支持多语言受限生成的进展(http://github.com/ai4bharat/IndicIFEval)。
cs.CL / 53 / 2602.22157
Dynamic Personality Adaptation in Large Language Models via State Machines
通过状态机实现大型语言模型的动态个性适应
Abstract
The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the interaction.We evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
Chinese Translation
大型语言模型(LLMs)在对不断变化的对话动态调节其个性表达方面的能力不足,限制了其在复杂互动环境中的表现。我们提出了一种与模型无关的动态个性模拟框架,该框架利用状态机表示潜在个性状态,并根据对话上下文动态调整转移概率。我们架构的一部分是一个模块化的连续个性评分管道,该管道沿潜在轴评估对话,同时对具体的个性模型、其维度、转移机制或使用的LLMs保持无关。这些评分作为动态状态变量,系统地重新配置系统提示,引导整个互动过程中的行为一致性。我们通过在医学教育环境中操作化人际圆周(IPC)来评估该框架。结果表明,该系统成功地根据用户输入调整其个性状态,同时也影响用户行为,从而促进去激化训练。值得注意的是,即使使用轻量级的微调分类器而非大规模LLMs,评分管道仍保持了可比的精度。这项工作展示了模块化、个性适应架构在教育、客户支持和更广泛的人机交互中的可行性。
cs.CL / 54 / 2602.22175
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
DySCO:用于长上下文语言模型的动态注意力缩放解码
Abstract
Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at https://github.com/princeton-pli/DySCO.
Chinese Translation
理解和推理长上下文是语言模型(LMs)的一项关键能力。尽管近期模型支持越来越长的上下文窗口,但随着输入长度的增加,其准确性往往会下降。在实际应用中,模型常常难以在解码过程中保持与最相关上下文的注意力对齐。在本研究中,我们提出了DySCO,一种用于改善长上下文推理的新型解码算法。DySCO利用检索头——一组专门用于长上下文检索的注意力头——在每个解码步骤中识别与任务相关的标记,并显式地对其加权。通过这样做,DySCO在生成过程中动态调整注意力,以更好地利用相关上下文。该方法无需训练,可以直接应用于任何现成的语言模型。在多个指令调优和推理模型上,DySCO在具有挑战性的长上下文推理基准测试中持续提升性能,在128K上下文长度下,MRCR和LongBenchV2的相对增益高达25%,且仅需适度的额外计算。进一步的分析强调了动态注意力重缩放和检索头引导选择对方法有效性的关键作用,同时提供了对解码时注意力行为的可解释性洞察。我们的代码可在 https://github.com/princeton-pli/DySCO 获取。
cs.CL / 55 / 2602.22182
LiCQA : A Lightweight Complex Question Answering System
LiCQA:一种轻量级复杂问题回答系统
Abstract
Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contem- porary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present LiCQA, an unsupervised question answer- ing model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of LiCQA with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
Chinese Translation
在过去的二十年中,问答(QA)系统的设计和实施取得了显著进展。然而,处理复杂问题(其答案分散在多个文档中)仍然是一个具有挑战性的问题。最近设计用于处理复杂问题的问答系统要么基于知识图谱,要么利用当代神经模型,这些模型在训练时需要消耗大量的计算资源和训练数据。在本文中,我们提出了LiCQA,一种主要基于语料证据的无监督问答模型。我们通过实证比较LiCQA与两种基于不同基础原理的最新问答系统的有效性和效率。实验结果表明,LiCQA在基准数据上显著优于这两种最先进的系统,并且在延迟方面有显著降低。
cs.CL / 56 / 2602.22193
Improving Parametric Knowledge Access in Reasoning Language Models
改善推理语言模型中的参数知识访问
Abstract
We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.
Chinese Translation
我们研究了如何通过语言模型的参数访问存储的世界知识进行推理。例如,回忆起堪培拉是澳大利亚的首都可能有助于通过思考主要城市和专门建造的首都的概念来实现。尽管推理语言模型通过强化学习在数学等任务上生成推理轨迹,但它们在访问自身的世界知识时可能推理不佳。我们首先发现,模型默认情况下并不会生成最佳的世界知识推理:添加一个简单的“逐步思考”提示在知识回忆上显示出统计显著的改善,但在数学上则没有。受到此启发,我们提出训练模型通过世界知识问答作为可验证的奖励来推理其参数知识。在TriviaQA上经过强化学习后,性能提升了9.9%,同时在Natural Questions、HotpotQA、SimpleQA和StrategyQA上的表现也分别提高了4.2%、2.1%、0.6%和3.0%。推理模型在参数知识访问方面尚未得到充分优化,但可以很容易地训练以提高推理能力。
cs.CL / 57 / 2602.22200
SumTablets: A Transliteration Dataset of Sumerian Tablets
SumTablets:苏美尔泥板音译数据集
Abstract
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.
Chinese Translation
苏美尔音译是一种传统系统,用于用拉丁字母表示学者对泥板的解读。得益于ETCSL、CDLI和Oracc等前瞻性数字亚述学项目,已经有大量苏美尔音译在网上发布,这些数据结构良好,适用于多种搜索和分析任务。然而,缺乏一个全面且易于访问的数据集,将音译与泥板的楔形文字数字表示相结合,阻碍了现代自然语言处理(NLP)方法在苏美尔音译任务中的应用。为了解决这一问题,我们提出了SumTablets,一个将91,606个苏美尔楔形泥板的Unicode表示(总计6,970,407个字形)与Oracc发布的相关音译配对的数据集。我们通过首先对Oracc音译进行预处理和标准化,然后将每个读音映射回源字形的Unicode表示来构建SumTablets。此外,我们通过使用特殊标记保留平行结构信息(例如,表面、新行、断裂段)。我们将SumTablets作为Hugging Face数据集(CC BY 4.0)发布,并通过GitHub开源数据准备代码。此外,我们利用SumTablets实现并评估了两个音译基线:(1)从字形的可能读音中进行加权抽样,以及(2)微调自回归语言模型。我们微调的语言模型在音译字符级F-score(chrF)上达到了97.55的平均值,展示了基于变换器的音译模型在允许专家快速验证生成的音译而不是逐个手动音译泥板方面的直接潜力。
cs.CL / 58 / 2602.22207
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
翻译中的恢复:自动化基准和数据集翻译的高效管道
Abstract
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
Chinese Translation
多语言大型语言模型(LLM)评估的可靠性目前受到翻译基准质量不一致的影响。现有资源往往存在语义漂移和上下文丢失的问题,这可能导致误导性的性能指标。在本研究中,我们提出了一个完全自动化的框架,旨在通过实现可扩展的高质量数据集和基准翻译来应对这些挑战。我们证明,适应测试时计算规模策略,特别是通用自我改进(Universal Self-Improvement, USI)和我们提出的多轮排名方法T-RANK,相较于传统管道能够显著提高输出质量。我们的框架确保基准在本地化过程中保留其原始任务结构和语言细微差别。我们将这种方法应用于将流行的基准和数据集翻译成八种东欧和南欧语言(乌克兰语、保加利亚语、斯洛伐克语、罗马尼亚语、立陶宛语、爱沙尼亚语、土耳其语、希腊语)。使用基于参考的指标和LLM作为评判者的评估表明,我们的翻译超越了现有资源,从而导致更准确的下游模型评估。我们发布了该框架和改进后的基准,以促进稳健和可重复的多语言人工智能开发。