cs.RO / 1 / 2605.20264
Adaptive Human-Robot Collaboration for Masonry Construction Under Material and Assembly Uncertainty
在材料和装配不确定性下的适应性人机协作砖砌施工
Abstract
Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.
Chinese Translation
建筑中的人机协作常常面临机器人与人类之间沟通有限以及需要适应由材料和装配不确定性引起的公差累积的问题。我们提出了一种适应性人机协作工作流程,用于砖砌施工,旨在解决沟通限制和公差累积问题,具体通过一个砖砌案例研究进行展示,其中机器人负责放置砖块,而人类则负责涂抹粘合剂。该工作流程依赖于两个互补机制:1)一个安装在末端执行器上的投影仪,提供空间注册的及时投影指导,以辅助手动涂抹粘合剂;2)激光扫描用于反馈驱动的抓取和放置姿态校正。这些机制共同使人类和机器人能够根据材料变异和累积的装配公差调整各自的动作。在传统的跑缝和非标准配置下进行的全尺度实验表明,投影指导提高了粘合剂应用的一致性并减少了应用时间,而基于激光的校正则保持了水平课程并避免了与开环执行相关的碰撞故障。这些结果表明,将空间投影与反馈驱动的适应性结合起来,并通过材料和建造状态感知来实现,可以减轻公差累积,提高人机协作施工的精度和鲁棒性。
cs.RO / 2 / 2605.20304
Terrestrial Soft Mobile Robots: A Review
地面软体移动机器人:综述
Abstract
Soft mobile robots have emerged as a promising area of research with potential applications in various disciplines including but not limited to search-and-rescue, service, surveillance, explorations, and manufacturing. In this article, we provide a comprehensive review of the current state of soft mobile robot research, focusing on wheelless terrestrial locomotive systems. We include past and present developments in locomotion strategies, actuation methods, modeling approaches, and control systems. Further, we identify key research challenges that must be overcome to enable the widespread adoption of soft mobile robots in various applications. Overall, this article provides a valuable resource for researchers and practitioners interested in the field of soft mobile robots and soft robotics.
Chinese Translation
软体移动机器人作为一个有前景的研究领域,已在包括搜索与救援、服务、监视、探索和制造等多个学科中展现出潜在应用。在本文中,我们对当前软体移动机器人研究的现状进行了全面回顾,重点关注无轮地面运动系统。我们涵盖了运动策略、驱动方法、建模方法和控制系统的过去与现在的发展。此外,我们还识别了必须克服的关键研究挑战,以促进软体移动机器人在各种应用中的广泛采用。总体而言,本文为对软体移动机器人及软体机器人领域感兴趣的研究人员和从业者提供了宝贵的资源。
cs.RO / 3 / 2605.20355
Proximal State Nudging: Reducing Skill Atrophy from AI Assistance
近端状态推动:减少人工智能辅助下的技能退化
Abstract
Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.
Chinese Translation
技能退化是指在人工智能(AI)辅助下,人类能力的逐渐下降,这在半自主系统的共享控制中构成了安全风险,因为操作员可能无法区分自己的输入和自主修正。我们提出了近端状态推动(Proximal State Nudging, PSN),这是一种共享自主算法,通过推动用户朝向被估计为最易学习的状态,联合优化技能发展和任务表现。我们首先展示了PSN在平衡学生在无辅助奖励下的提升与整体共享表现方面,优于现有的共享自主基准,使用经典的LunarLander环境中的模拟学生进行实验。接着,我们呈现了我们所知的首个包含学习兼容共享自主的规划者的人类受试者研究:在CARLA模拟器中的两个驾驶任务(高性能赛车和平行停车,n = 60)中,PSN在无辅助技能方面产生的提升比标准混合共享自主大达7倍,同时碰撞率比无辅助自我练习减少了50%。
cs.RO / 4 / 2605.20373
SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework
SUGAR:一种可扩展的人类视频驱动的通用类人机器人运动操控学习框架
Abstract
Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/
Chinese Translation
构建能够在现实世界中实现可泛化的全身运动操控的类人机器人仍然是一个基本挑战。现有的方法要么依赖繁琐的任务特定奖励设计,要么僵硬地重放参考动作而无法泛化,或者依赖昂贵的远程操作,限制了可扩展性。虽然人类视频捕捉了多样的人类行为,但从中推断出的运动先验本质上是不完美的,受到遮挡、接触伪影和重定向错误的影响,使其不适合直接进行策略学习。为了解决这个问题,我们提出了SUGAR,一个可扩展的数据驱动框架,能够将多样的人类视频转化为可部署的类人运动操控技能,而无需在推理时进行任何任务特定的奖励设计或参考动作条件化。SUGAR的过程分为三个阶段。首先,一个完全自动化的管道从非结构化的人类视频中提取运动学交互先验,包括人-物体运动轨迹和接触标签。其次,一个特权的基于物理的精炼器使用统一的模仿奖励和渐进状态池将不完美的先验转化为物理上可行的高保真技能。最后,精炼后的技能被提炼为一个层次化的自主策略,包括命令生成器和命令跟踪器。我们在六个具有代表性的运动操控任务上对SUGAR进行了模拟和现实世界类人硬件的评估。我们的方法显著优于参考跟踪基线,且性能明显随着人类视频数据的增加而提升。它还实现了零样本的现实世界转移,具备可靠的闭环执行、自主故障恢复能力,以及在外部扰动下的稳定长时间性能。项目页面:https://tianshuwu.github.io/sugar-humanoid/
cs.RO / 5 / 2605.20392
VBT-MPC: Vision-Based Tactile MPC for Contour Following
VBT-MPC:基于视觉的触觉模型预测控制用于轮廓跟踪
Abstract
Tactile sensing plays a key role in robotic manipulation, particularly in tasks like surface inspection. Successful execution requires maintaining contact while accurately tracking object contours. In this work, we propose a Vision-Based Tactile Model Predictive Control (VBT-MPC) framework for robotic contour following using a Vision-Based Tactile Sensor (VBTS) mounted in an eye-in-hand configuration. The proposed controller operates directly in contour features space, thereby avoiding the need for separate pose-estimation modules or complex force-control architectures. We further compare our VBT-MPC with visual-servoing strategies adapted to tactile features, and evaluate contour tracking on objects with diverse geometries and materials in both simulation and real-world experiments.
Chinese Translation
触觉传感在机器人操作中发挥着关键作用,特别是在表面检查等任务中。成功执行这些任务需要在保持接触的同时准确跟踪物体轮廓。在本研究中,我们提出了一种基于视觉的触觉模型预测控制框架(VBT-MPC),用于利用安装在眼手结合配置中的基于视觉的触觉传感器(VBTS)进行机器人轮廓跟踪。所提出的控制器直接在轮廓特征空间中操作,从而避免了对单独的姿态估计模块或复杂的力控制架构的需求。我们进一步将我们的VBT-MPC与适应于触觉特征的视觉伺服策略进行了比较,并在模拟和现实实验中评估了对不同几何形状和材料物体的轮廓跟踪。
cs.RO / 6 / 2605.20395
Scalable Multi-robot Motion Planning via Hierarchical Subproblem Expansion and Workspace Decomposition Refinement
通过层次子问题扩展和工作空间分解精细化实现可扩展的多机器人运动规划
Abstract
A fundamental challenge in multi-robot motion planning is achieving sufficient coordination to avoid inter-robot conflicts without incurring the large computational expense of searching the joint configuration space of the robot group. In this work, we present a method for multiple mobile robot motion planning that achieves an improvement in planning time up to an order of magnitude by leveraging the insight that we can use discrete search over a workspace decomposition to provide coordination between robots during planning. While prior work uses workspace topology to inform when coordination between robots is needed and then composes robots into their joint configuration space, we take a step further by iteratively refining our workspace representation to allow our planner to search smaller, decoupled configuration spaces.
Chinese Translation
多机器人运动规划中的一个基本挑战是实现足够的协调,以避免机器人之间的冲突,同时又不需要在机器人组的联合配置空间中进行大量计算开销的搜索。在本研究中,我们提出了一种多移动机器人运动规划的方法,通过利用对工作空间分解进行离散搜索的洞察,显著提高了规划时间,最多可提高一个数量级。以往的研究使用工作空间拓扑来判断何时需要机器人之间的协调,然后将机器人组合到它们的联合配置空间中,而我们进一步采取了迭代精细化工作空间表示的方法,使我们的规划器能够搜索更小的、解耦的配置空间。
cs.RO / 7 / 2605.20433
Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation
时空最优传输注意力用于接触丰富的操作的视觉-触觉模仿学习
Abstract
Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.
Chinese Translation
接触丰富的操作任务,如紧配合插入、连接器配对、抛光和表面贴合擦拭,对于数据驱动的控制器仍然具有挑战性,因为它们结合了不连续的接触动力学、部分可观测性和严格的安全约束。单一的感知模态不足以应对这些任务:视觉在接触前提供全局背景,力/扭矩(F/T)反馈在接触后主导交互,而本体感知姿态则提供一致的运动学基础。大多数先前针对接触丰富任务的模仿学习策略仅基于单模态或双模态信号,而少数融合三种模态的策略通常采用现成的注意力模块,并没有明确的先验知识来指导注意力分布在任务相关区域的方式。我们提出了时空最优传输注意力(Spacetime Optimal-Transport Attention, SO-TA),这是一种三模态融合的基础架构,通过在力-姿态衍生的子查询和视觉补丁之间进行熵正则化的最优传输(Optimal Transport, OT)对齐,替代了软最大归一化的补丁注意力。显式的边际约束作为接触丰富任务的结构性归纳偏置,鼓励在不同照明、干扰物和部分遮挡下稳定的条件感知空间选择。SO-TA与基于扩散的序列策略相结合,将观察窗口映射到姿态-动作块。我们在三个真实机器人任务上评估SO-TA:紧配合的孔中插入、BCM接线器插入和曲面标记擦除。在每个条件下进行约200次实验,SO-TA在紧配合的孔中插入任务中达到100%的成功率,而交叉注意力在相同能力下的成功率为93%,在照明、干扰物和部分遮挡扰动下保持82.5%的成功率,而拼接基线下降至43.5%。OT衍生的补丁热图和逐一排除模态影响比率提供了可解释的、相位依赖的诊断信息。
cs.RO / 8 / 2605.20484
Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry
通过利用腿部里程计增强GNSS缺失环境中的基于图的SLAM
Abstract
Autonomous navigation in GNSS-denied environments remains a core challenge for legged robots, where exteroceptive sensors such as LiDAR are prone to elevation drift in geometrically sparse or repetitive scenes. We present a factor graph architecture that augments the LIO-SAM framework with a parallel kinematic lane driven by proprioceptive leg odometry, coupled to the main LiDAR-inertial lane via an identity relative pose constraint with a selective noise model. Applied to a Linxai D50 quadruped platform across two outdoor loops totaling over one kilometer, our approach reduces elevation drift from over 30m to under 30cm and enables convergence in a scene where the baseline pipeline fails entirely. These results suggest that proprioceptive data, already computed onboard for gait control, constitutes a lightweight and effective vertical anchor for SLAM in GNSS-denied settings.
Chinese Translation
在GNSS缺失环境中,自主导航仍然是四足机器人面临的核心挑战,其中激光雷达等外部传感器在几何稀疏或重复场景中容易出现高度漂移。我们提出了一种因子图架构,该架构通过自我感知的腿部里程计增强了LIO-SAM框架,并通过选择性噪声模型与主激光雷达-惯性通道通过身份相对姿态约束相结合。我们的方案应用于Linxai D50四足平台,在两个户外循环中,总长度超过一公里,显著降低了高度漂移,从超过30米减少到30厘米以下,并在基线管道完全失效的场景中实现了收敛。这些结果表明,已经在车载系统中计算的自我感知数据为GNSS缺失环境中的SLAM提供了一种轻量且有效的垂直锚点。
cs.RO / 9 / 2605.20544
The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents
顺从者综合症:在具身机器人代理中的弃权基准测试
Abstract
Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.
Chinese Translation
视觉-语言模型(VLMs)被用作具身代理的高级规划者,将自然语言指令和视觉观察转化为行动计划。尽管先前的研究已探讨了大型语言模型(LLMs)中的弃权问题,但现有基准大多仅限于文本,未能捕捉到具身机器人环境中固有的感知基础和物理约束。在这种环境中,弃权需要识别指令何时模糊、物理上不可行、基于错误前提或在可用感官模态和上下文下无法解决。为了解决这一空白,我们引入了一种分类法,以在具身机器人背景下对弃权进行分类,并提出了RoboAbstention,这是一个可扩展且可审计的框架,用于生成基于从五个机器人数据集中收集的图像的弃权指令。RoboAbstention通过一个三阶段流程实现了这一分类法:(1)结构化视觉基础,(2)确定性约束推导,以及(3)通过特定类别模板的受控指令生成。这使得构建具有可验证弃权条件的多样化数据集成为可能。我们评估了几种前沿VLM,并发现所有模型在弃权方面都表现出显著的弱点,包括那些具有先进推理能力的模型。表现最佳的模型Gemini 2.5 Flash仅在我们6069个基准指令中弃权39.0%,而具身规划器Gemini Robotics ER 1.6 Preview仅弃权16.5%。我们进一步探索了改善VLM规划者弃权的方法,例如防御性提示和上下文学习,发现这些干预措施显著提高了性能,Gemini Robotics ER 1.6 Preview的弃权率达到93.6%,GPT 5.4 Mini的弃权率为88.6%,但没有一种方法能够完全解决这一问题。我们在https://purseclab.github.io/RoboAbstention/上开源了RoboAbstention。
cs.RO / 10 / 2605.20561
Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots
气囊桁架机器人故障容忍与刚度保持控制
Abstract
Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.
Chinese Translation
等周机器人桁架因其高强度与重量比、能够显著改变自身形状以及能够重新配置为多种不同形状,能够适应不同的任务和环境。然而,操作环境中的电机故障如果未得到妥善处理,会严重限制其操作能力。本文提出了一种气囊机器人桁架的故障容忍控制框架,能够在电机故障的情况下保持功能,主要通过三个关键贡献来实现。首先,我们扩展了运动学优化,以处理任意组合的电机故障,通过施加等式约束确保故障的执行器不被使用。其次,我们引入了离散时间控制障碍函数(DTCBF)约束,数学上保证了结构的刚度,同时最大化工作空间的利用,这是在离散时间控制下桁架机器人可靠操作的关键要求。第三,我们使用机载编码器反馈和基于前向运动学的状态估计器实现闭环位置控制,提高了在干扰存在下的位置精度。我们通过在二维等周桁架测试平台上的仿真和硬件实验验证了我们的方法。在具有6个执行器的二维配置中,我们展示了在单电机故障下工作空间保持率超过69%,并且闭环控制下跟踪精度提高了超过25%。这些结果为在退化驱动下操作的更强大和更具韧性的等周桁架机器人奠定了基础。
cs.RO / 11 / 2605.20566
Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions
基于控制屏障函数的3D高斯点云场中的冲突感知主动感知与控制
Abstract
Active perception in uncertain environments requires robots to navigate safely while acquiring informative observations to reduce map uncertainty. These objectives inherently conflict, as informative viewpoints often lie near uncertain regions with higher collision risk. To address this challenge, we develop a conflict-aware active perception and control framework for robotic systems operating in environments represented by 3D Gaussian Splatting (3DGS). Safety is enforced using a Control Barrier Function (CBF) derived from an Average Value-at-Risk AV@R collision-risk metric that accounts for geometric uncertainty and guarantees forward invariance of a safe set. To improve perception, we propose a risk-aware Expected Information Gain (EIG) formulation for selecting the next-best-view and introduce perception barrier functions that align the camera orientation with the local information-ascent direction. To obtain a tractable formulation for these conflicting safety and perception objectives, we propose a unified safety-critical, perception-aware quadratic program that enforces safety as a hard constraint while relaxing perception constraints through slack variables. Simulation results demonstrate that the proposed method improves both safety and information acquisition compared to existing 3DGS-based approaches.
Chinese Translation
在不确定环境中,主动感知要求机器人在获取信息性观测以减少地图不确定性的同时安全导航。这些目标本质上存在冲突,因为信息性视点通常位于具有较高碰撞风险的不确定区域附近。为了解决这一挑战,我们开发了一种冲突感知的主动感知与控制框架,适用于在3D高斯点云(3D Gaussian Splatting, 3DGS)表示的环境中运行的机器人系统。安全性通过控制屏障函数(Control Barrier Function, CBF)来强制执行,该函数源自考虑几何不确定性的平均风险值(Average Value-at-Risk, AV@R)碰撞风险度量,并保证安全集的前向不变性。为了提高感知能力,我们提出了一种风险感知的期望信息增益(Expected Information Gain, EIG)公式,用于选择下一个最佳视点,并引入感知屏障函数,使相机朝向与局部信息上升方向对齐。为了获得这些冲突的安全性和感知目标的可处理公式,我们提出了一种统一的安全关键、感知感知的二次规划(quadratic program),该规划将安全性作为硬约束,同时通过松弛变量放宽感知约束。仿真结果表明,与现有的基于3DGS的方法相比,所提方法在安全性和信息获取方面均有所改善。
cs.RO / 12 / 2605.20595
Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance
以意图为先的空中V2V战术协调与分离:密度与干扰下的协议与性能
Abstract
Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.
Chinese Translation
密集的低空空中操作不仅需要预先的航线协调和最后的碰撞避免。一旦飞机起飞,干扰可能在比战略重新授权更短的时间尺度上出现,而碰撞避免在此时已为时已晚,且会对常规交通管理造成干扰。尽管战术分离被视为中间层,但在大规模实现这一目标需要一种可部署的邻域通信机制,以提供新鲜、可信的信息用于本地协调。本文提出了我们所知的第一个与控制器耦合的全空中、侧链级别、以意图为先的车辆对车辆(V2V)战术邻域交换栈,旨在支持密集的无人机系统交通管理(UTM)操作。与仅仅进行意识广播不同,所提出的交换结合了更新的状态和意图信标,以实现本地意识、协作感知和降级模式评估,并通过事件触发消息进行让路、排序、释放和应急协调。我们在一个全空中V2V栈上实现并评估了该模型,使用侧链级C-V2X模块并进行认证的新鲜度检查。评估使用了一个基于场景驱动的高容量压力测试活动,并得到实时、现场锚定基础设施的支持。结果表明,V2V减少了过时信念的偏差,通过协作感知保持可观测性,拒绝无效的战术消息,抑制错误的本地推断,并构建共享资源的协调。所实现的栈为低到中等密度下的战术分离提供了可行的通信层,但随着密度、损伤和复杂性的增加,逐渐过渡到受限的后备模式。这些发现将以意图为先的空中V2V定位为在干扰驱动的城市空域中扩展战术协调的有限推动者。
cs.RO / 13 / 2605.20648
Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition
联合学习谓词和动作实现零-shot技能组合
Abstract
Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/
Chinese Translation
从示范学习(LfD)使机器人能够从专家示例中学习复杂行为,但现有方法往往无法在不重新训练的情况下对已知技能的新组合进行泛化。现代生成策略仅对动作轨迹建模分布,因此无法推理出强健组合所需的符号结果。我们提出技能应联合建模动作轨迹及其引发的符号结果。为了解决这一问题,我们引入了谓词动作技能(PACTS),这是一类闭环视觉运动策略,将技能建模为对动作和谓词信念轨迹的联合生成过程,在单一模型中生成一致的动作-结果展开。联合生成动作和谓词使得PACTS能够学习内部表示,从而改善动作生成和谓词分类。此外,我们通过规划展示了学习技能的零-shot组合,利用来自PACTS的在线谓词预测作为符号接口,以便进行序列化和执行监控。项目网站:https://planpacts.github.io/
cs.RO / 14 / 2605.20666
A Semantic and Occlusion-Aware GM-PHD Filter
一种语义与遮挡感知的高斯混合概率假设密度滤波器
Abstract
This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.
Chinese Translation
本文提出了一种新的出生模型,该模型结合了来自深度学习的语义信息,以创建一个遮挡感知的高斯混合概率假设密度(GM-PHD)滤波器。与依赖于简单或均匀假设的先前方法不同,所提出的语义-遮挡感知(S-OA)出生模型通过明确考虑遮挡区域并利用环境的语义信息来定义初始化项。这使得滤波器能够准确表示新物体更可能出现的位置,从而提高在复杂和高密度驾驶场景中的跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集上的实验进行评估。性能通过测量首次检测与轨迹启动之间的延迟,以及平均绝对基数误差和最优子模式分配(OSPA)指标进行评估。结果表明,S-OA出生模型在遮挡严重的环境中减少了初始化延迟,在约70%的情况下与最强基线相匹配或超越。此外,还提供了出生模型权重的敏感性分析。总体而言,研究结果强调了将遮挡推理和语义先验整合到自主驾驶的贝叶斯跟踪框架中的好处。
cs.RO / 15 / 2605.20752
GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
GaussianDream:一种用于机器人操作的前馈3D高斯世界模型
Abstract
Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. Yet, standard action-imitation training often provides limited explicit supervision for 3D geometry, dense visual structure, and short-horizon environment evolution, which are critical for physically precise manipulation. We introduce \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in that turns robot trajectories into structured spatial-temporal supervision. The key idea is to couple current Gaussian reconstruction with horizon-conditioned future Gaussian prediction during training, forcing a compact spatio-temporal prefix to be decodable into renderable 3D Gaussian states. This enables dense RGB rendering, depth, and pseudo 3D scene-flow supervision without requiring test-time Gaussian decoding. At inference, GaussianDream discards all auxiliary decoding heads and retains only the learned prefix to condition action generation, avoiding rendering, video rollout, or additional planning during closed-loop control. Experiments on LIBERO, RoboCasa Human-50, and real-robot tasks demonstrate strong and highly competitive performance, achieving \textbf{98.4\%} average success on LIBERO, \textbf{52.6\%} on RoboCasa Human-50, and \textbf{50.0\%} in real-world evaluation.
Chinese Translation
视觉-语言-动作(VLA)策略通过将预训练视觉-语言模型中的语义先验转移到动作生成中,推动了语言条件下的机器人操作。然而,标准的动作模仿训练通常对3D几何、密集视觉结构和短期环境演变提供有限的显式监督,而这些对于物理精确的操作至关重要。我们提出了 extbf{GaussianDream},一种前馈3D高斯世界模型插件,将机器人轨迹转化为结构化的时空监督。其核心思想是在训练过程中将当前的高斯重构与基于地平线的未来高斯预测相结合,迫使紧凑的时空前缀可解码为可渲染的3D高斯状态。这使得在不需要测试时高斯解码的情况下,实现密集的RGB渲染、深度和伪3D场景流监督。在推理阶段,GaussianDream丢弃所有辅助解码头,仅保留学习到的前缀以条件化动作生成,避免在闭环控制过程中进行渲染、视频展开或额外规划。在LIBERO、RoboCasa Human-50和真实机器人任务上的实验表明,GaussianDream表现出强大且高度竞争的性能,在LIBERO上实现了 extbf{98.4\%}的平均成功率,在RoboCasa Human-50上为 extbf{52.6\\%},在真实世界评估中为 extbf{50.0\\%}。
cs.RO / 16 / 2605.20774
VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models
VLA-REPLICA:一种低成本、可重复的基准,用于视觉-语言-动作模型的真实世界评估
Abstract
Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.
Chinese Translation
视觉-语言-动作(VLA)模型在通用机器人操作方面展现出强大的潜力,但其在真实世界中的评估受到缺乏可获取、可重复和一致的基准的限制。模拟基准无法捕捉真实世界的复杂性,而现有的真实世界基准往往需要昂贵的硬件、集中评估,或在任务多样性方面受到限制。我们提出了VLA-REPLICA,这是一种低成本、易于重复的真实世界基准,用于评估VLA模型。我们的系统由现成组件构建,可以快速组装并在各实验室间复制,为全球范围内的策略评估提供一致的环境。VLA-REPLICA包括多样化的操作任务套件和一个小规模的演示数据集,用于目标领域适应,具有针对分布内和分布外设置的真实世界评估协议。通过模仿学习和最先进的VLA模型的实验揭示了模型的优缺点,而在独立构建的设置中获得的一致结果证明了我们基准的可重复性。
cs.RO / 17 / 2605.20796
CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization
CMC-Opt:用于不等式约束优化的带角约束流形
Abstract
We introduce a manifold-based framework for addressing optimization problems with equality and inequality constraints found in robotics. Our approach transforms the original problem into an unconstrained optimization problem directly on the constrained state space. To achieve this, we introduce ``constraint manifolds with corners" to represent the state space satisfying mixed nonlinear equality and inequality constraints. We further extend manifold optimization algorithms to operate on this new topological structure. We demonstrate the power and robustness of our framework in the context of a large-scale kinodynamic planning problem, successfully generating dynamically feasible trajectories where standard methods fail.
Chinese Translation
我们提出了一种基于流形的框架,用于解决机器人领域中的带有等式和不等式约束的优化问题。我们的方法将原始问题转化为在约束状态空间上直接进行的无约束优化问题。为此,我们引入了“带角约束流形”来表示满足混合非线性等式和不等式约束的状态空间。我们进一步扩展了流形优化算法,以适应这一新的拓扑结构。我们在一个大规模动力学规划问题的背景下展示了我们框架的强大和鲁棒性,成功生成了在标准方法失效情况下的动态可行轨迹。
cs.RO / 18 / 2605.20801
Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation
Q-SpiRL:用于自适应机器人导航的量子脉冲强化学习
Abstract
Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.
Chinese Translation
在动态环境中,自适应机器人导航需要能够可靠到达目标的策略,同时产生高效且稳定的轨迹。本文提出了Q-SpiRL,一个用于障碍物感知机器人导航的量子脉冲强化学习框架。该框架开发并评估了五种智能体家族:表格Q学习、经典多层感知器(MLP)、经典脉冲神经网络(SNN)、量子增强多层感知器(QMLP)和量子增强脉冲神经网络(QSNN)。虽然所有模型都在统一的训练和评估流程下实施,但QSNN是主要关注的架构,因为它结合了基于脉冲的时间处理和变分量子特征变换。实验在三个逐渐增大的网格世界环境中进行,分别为20x20、30x30和40x40,包含静态和动态障碍物。通过成功率、成功加权路径长度、路径长度和转向率等指标评估性能,采用确定性推理。结果表明,QSNN在任务完成、轨迹效率和运动平滑性之间实现了最佳的整体权衡,在最具挑战性的环境中成功率高达99%,同时保持高路径效率。在IBM量子硬件上的执行进一步证明了在真实设备条件下部署所提混合策略的可行性。
cs.RO / 19 / 2605.20811
Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation
Demo-JEPA:用于一次性跨体现模仿的联合嵌入预测架构
Abstract
Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.
Chinese Translation
机器人模仿学习通常被视为再现示范动作,但动作本质上是体现特定的。当示范来自于具有不同形态、运动学或动作空间的人类或机器人时,这种以动作为中心的观点需要共享的动作空间、启发式的重定向或大规模的多体现共同训练。我们则将示范视为未来目标的隐式规范:目标代理应推断出示范者试图实现的状态,而不是示范者如何执行它。我们提出了Demo-JEPA,这是一种跨体现模仿框架,它将示范意图与体现特定的执行解耦。基于JEPA(Joint-Embedding Predictive Architecture)世界模型,Demo-JEPA将源视觉示范转换为在共享预测表示空间中与目标兼容的未来潜在轨迹。目标代理随后将这些潜在轨迹作为子目标,并通过在其自身学习的前向动力学下进行规划来实现它们。由于Demo-JEPA避免了动作级别的对应关系,仅需视觉示范加上目标代理自身的交互经验,因此支持跨异构体现的灵活模仿。在RLBench和现实世界操控任务上的实验表明,Demo-JEPA与专业的领域内规划器相匹配,并且能够推广到先前方法失败的未见任务和体现配置。
cs.RO / 20 / 2605.20850
SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP
SmoCap:具有代理映射信任区域二次规划的统一尺度-姿态标准化
Abstract
Objective: Stage-wise workflows that separate model scaling and inverse kinematics can induce morphology-posture compensation, resulting in anatomically inconsistent yet numerically acceptable solutions, especially in weakly observed directions. We present SmoCap, a leakage-resistant canonicalization framework that estimates morphology and posture jointly in each local trust-region quadratic program (QP) within a sparse control subspace. Methods: SmoCap solves a constrained trust-region QP with analytical proxy-mapped pose and scale Jacobians. The low dimensional proxy map stabilizes weakly observed directions and drives coordinated structures. An optional pre-solve provides warm starts in difficult configurations. The framework is evaluated using cohort fluoroscopy knee motion, anthropometric ground truth, and extreme yoga sequences. Results: SmoCap achieved 2.9 degree knee flexion RMSE against fluoroscopy, and a pooled anthropometric endpoint error around 3%. In the leakage audit against segment wise scaling, SmoCap also reduced marker RMSE, FE error, and anthropometric endpoint error. Proxy coupling preserved expressive and coordinated spine motion with marginal fitting error increase (+0.14 mm, +0.6%) against baseline models in yoga ablation. Median marker RMSE was around 20 mm, and median runtime was 0.204-0.332 ms/frame, achieved with consistently 2-3 iterations. Conclusion: SmoCap provides an externally validated unified coupling-aware scale-pose framework, making externally consistent motion canonicalization practical at dataset scale.
Chinese Translation
目的:将模型缩放和逆向运动学分开的阶段性工作流程可能导致形态-姿态补偿,从而产生解剖学上不一致但在数值上可接受的解决方案,尤其是在观察较弱的方向上。我们提出了SmoCap,一种抗泄漏的标准化框架,它在稀疏控制子空间内的每个局部信任区域二次规划(QP)中联合估计形态和姿态。方法:SmoCap通过解析的代理映射姿态和尺度雅可比矩阵求解受限的信任区域QP。低维代理映射稳定了观察较弱的方向,并驱动协调结构。可选的预求解在困难配置中提供了热启动。该框架使用队列荧光膝关节运动、人体测量真实值和极限瑜伽序列进行了评估。结果:SmoCap在荧光成像下实现了2.9度的膝关节屈曲均方根误差(RMSE),以及约3%的汇总人体测量端点误差。在针对分段缩放的泄漏审计中,SmoCap还降低了标记RMSE、FE误差和人体测量端点误差。代理耦合在瑜伽消融实验中保持了富有表现力和协调的脊柱运动,拟合误差仅略微增加(+0.14 mm,+0.6%),与基线模型相比。中位数标记RMSE约为20 mm,中位数运行时间为0.204-0.332 ms/帧,始终以2-3次迭代实现。结论:SmoCap提供了一个经过外部验证的统一耦合感知尺度-姿态框架,使得在数据集规模上实现外部一致的运动标准化成为可能。
cs.RO / 21 / 2605.20856
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
DISC:通过策略生成将指令与状态条件控制解耦
Abstract
Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $\pi_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.
Chinese Translation
语言条件的操作策略通常通过共享的网络参数处理指令和观察。这种任务与状态的纠缠为观察泄漏提供了一条路径——网络学习场景到动作的捷径,完全绕过语言的基础。DISC 从结构上消除了这种失败。DISC 不再将通用策略与语言条件化,而是使用超网络仅根据指令生成特定任务的视觉运动策略的整个参数集。生成的策略从不直接访问语言;因此,其任务意识必须来自语言。因此,观察泄漏没有出现的路径。另一方面,生成一致的高维策略权重本身就是一个具有挑战性的问题。我们通过一个两阶段的超网络来解决这个问题,其精炼阶段将基于梯度优化的结构嵌入为前馈归纳偏置,产生全局一致的参数而无需实际的梯度计算。完全从头开始在标准数据预算上进行训练,DISC 在 LIBERO-90 和 Meta-World 上超越了所有纠缠的基线,在复杂的长时间任务中优势更为明显——并且尽管没有使用外部预训练数据,仍超越了大规模预训练的 $ ext{π}_0$。在一个所有任务共享相同视觉上下文的真实世界基准上,DISC 显著优于纠缠的替代方案,直接确认了由语言生成的策略参数,而非视觉捷径,驱动行为。超网络进一步学习了一个语义结构化的参数流形,使其能够从最小的演示中进行少量适应,并在改述的指令中实现稳健的泛化。我们的代码可在以下网址获取:{https://github.com/ReNginx/DISC}。
cs.RO / 22 / 2605.20894
Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation
移动 UMI:具有解耦运动学的跨视图扩散策略用于移动操作
Abstract
Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.
Chinese Translation
便携式演示接口上的移动模仿学习面临两个耦合瓶颈:运动污染的动作标签和在持续移动基座上的推理引起的执行延迟。最近的腕部安装接口降低了桌面数据收集的成本,但单一的腕部视角无法捕捉基座导航所需的全局上下文。添加身体安装的摄像头将人类行走与手部运动纠缠在一起。同时,生成策略引入了数百毫秒的推理延迟,在此期间,基座超越了预测的路径点,迫使在动作拼接时进行向后修正。本文提出了移动 UMI,这是一种无硬件的演示框架,通过三个组件解决了这两个问题。首先,双摄像头捕捉系统记录了以胸部为中心的全局上下文和以腕部为中心的局部交互,而无需任何机器人存在。其次,基于 ChArUco 的一次性空间锚点统一了胸部和手部的视觉惯性框架;然后,手部姿态相对于胸部重新表达,以提取解耦的 SE(3) 操作和 SE(2) 基座轨迹。第三,异步递归地平面执行器执行在线状态匹配:每个生成的动作块与当前物理姿态重新对齐,以便在执行之前丢弃过期的路径点。整个系统在四个长时间的家庭任务上进行了评估,在每个任务的 100 次试验中实现了 83.8% 的平均成功率。与 ACT 和扩散策略的对照比较表明,仅胸部相对标签就缩小了大部分差距;在线状态匹配则弥补了剩余的差距。这些结果表明,在测试条件下,对于移动模仿学习,显式运动学因子化结合状态级延迟对齐提供了一种有效的解决方案,而无需对基础策略类进行架构更改。
cs.RO / 23 / 2605.20917
SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation
SubTGraph:具有可控拓扑变异性的规模化地下环境合成用于机器人自主验证
Abstract
Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.
Chinese Translation
地下(SubT)环境一直是自主机器人研究的前沿领域,这一领域受到矿业自动化和行星探索(火星熔岩管)兴趣的推动。由于访问真实地下环境的挑战,严格在现实模拟环境中强化自主系统至关重要。本文填补了一个众所周知的空白,即缺乏一个大规模基于模拟的基准测试基础设施,以进行机器人自主性的严格统计评估,因此,SubT研究文章通常最多只能在少数环境中呈现验证结果。本文提出了SubTGraph,一个快速合成多层次高变异性地下环境的新框架,结合用户在拓扑、维度、纹理等方面的规格,生成不同的环境,如作业矿、自然洞穴和熔岩管。SubTGraph从用户指定的结构约束构建成本矩阵,以指导经典的Dijkstra算法,利用DARPA世界生成器的拓扑瓷砖程序性生成地下世界。我们研究了三个机器人案例,以展示SubTGraph在机器人自主性堆栈不同层次的严格验证中的实用性。结构语义分割与拓扑真实值进行了验证,多智能体路径规划广泛测试以识别算法行为中的模式和趋势,而LIO SLAM在具有挑战性的地下部分进行了压力测试,以识别失败案例。SubTGraph世界创建代码库已开源(https://github.com/LTU-RAI/SubTGraph.git),并附带一个包含150个高度变异地下世界的数据库。
cs.RO / 24 / 2605.20929
STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding
STEAM:一种无训练的拥堵感知增强框架用于去中心化多智能体路径规划
Abstract
We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.
Chinese Translation
我们提出了STEAM(空间、时间和突发拥堵感知的多智能体路径规划),这是一个无训练的测试时增强框架,旨在用于离散环境中的学习基础去中心化多智能体路径规划(MAPF)。在给定一个预训练的去中心化策略的情况下,STEAM无需重新训练、架构修改或替换为中心化规划器。相反,它将轻量级的拥堵感知指导注入到原始策略执行中。STEAM首先根据当前的成本地图推出最短路径,以识别潜在的未来拥堵热点。通过更新特定智能体的成本信息来减轻空间上可避免的拥堵,而空间上不可避免的瓶颈则通过时间逻辑修正来处理。此外,通过基于邻近智能体修正后的成本地图的密度感知逻辑修正,减少突发的局部拥堵。在具有代表性的学习基础去中心化MAPF算法上的大量实验表明,STEAM始终提高了成功率、完工时间和解决成本,成功率提升高达60%,且仅带来轻微的计算开销。该实现可在https://anonymous.4open.science/r/STEAM-MAPF-7A62获取。
cs.RO / 25 / 2605.20932
WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation
WiXus:一种轮腿机器人,利用线驱动环境整合移动性与操作性
Abstract
Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.
Chinese Translation
轮腿机器人是在其脚部配备轮子的机器人,通过协调轮子驱动和腿部驱动实现高机动性。这些机器人纯粹作为专门用于运动的平台而开发。因此,它们没有将腿部重新用于运动以外的角色(如物体操作或工具利用)的手段。本文解决了如何通过外部身体支撑来解放腿部的运动角色,从而挖掘其潜在任务执行能力的问题。为此,我们提出并开发了一种新型机器人WiXus,它将轮腿机制与利用外部环境的线驱动机制相结合。开发的WiXus不仅展示了通过轮腿驱动实现的平面运动,还通过协调线驱动和轮腿驱动实现了如悬崖攀爬等三维机动性。此外,通过线驱动悬挂身体,WiXus成功地将其腿部重新用作手臂,以执行物体操作(例如,救助一只狗(玩具))和工具利用(例如,用剪刀采摘苹果(模型))。本研究表明,利用线驱动环境的这一方法是一种新的设计原则,扩展了轮腿机器人的操作领域。
cs.RO / 26 / 2605.21026
Component Influence-Driven Fastener Reduction for Robotic Disassemblability-Aware Design Simplification
基于组件影响驱动的紧固件减少方法用于机器人可拆卸性意识设计简化
Abstract
To accelerate automated remanufacturing, robotic disassembly must be considered during the product design phase. However, designers currently lack quantitative feedback to identify which structural elements hinder robotic operations. To address this, this study proposes an analytical framework that provides actionable redesign guidance focused on fastener reduction, as fasteners are numerous and ubiquitous components found in almost all manufactured products. Using a Computer-Aided Design (CAD) model and its automatically generated Contact-Connection-Constraint (CCC) graph, the framework translates robotic disassembly sequence planning outcomes into component influence scores. These scores reflect how often a component causes structural constraint violations or evaluation objective deteriorations in the robotic disassembly sequence. To visually highlight structural hindrances, the framework projects these scores onto the CAD geometry as 3D heatmaps. The system then analytically simulates the removal of highly influential fasteners. It reports the expected reductions in structural constraints, tool changes, and robot travel distances, while preventing structurally unsafe modifications by evaluating geometric stability metrics. Experiments on seven household appliances demonstrate that the framework successfully targets redundant fasteners. Removing the recommended fasteners simplified the structural dependencies by eliminating between 8 and 132 structural constraints on the graph depending on each product's structural configuration. Furthermore, it improved robotic operational efficiency by eliminating unnecessary tool change operations and shortening travel distances by 165 to 1675 millimeters wherever structurally permissible.
Chinese Translation
为了加速自动化再制造,必须在产品设计阶段考虑机器人拆卸。然而,设计师目前缺乏定量反馈,以识别哪些结构元素妨碍机器人操作。为了解决这个问题,本研究提出了一种分析框架,提供以紧固件减少为重点的可操作性重新设计指导,因为紧固件是几乎所有制造产品中数量众多且普遍存在的组件。该框架利用计算机辅助设计(CAD)模型及其自动生成的接触-连接-约束(CCC)图,将机器人拆卸序列规划结果转化为组件影响评分。这些评分反映了组件在机器人拆卸序列中导致结构约束违反或评估目标恶化的频率。为了直观地突出结构障碍,该框架将这些评分投影到CAD几何体上,形成3D热图。系统随后分析性地模拟去除高度影响的紧固件,报告预期的结构约束、工具更换和机器人移动距离的减少,同时通过评估几何稳定性指标来防止结构不安全的修改。在对七种家用电器进行的实验中,证明该框架成功地针对冗余紧固件。去除推荐的紧固件简化了结构依赖关系,根据每种产品的结构配置,消除了图中8到132个结构约束。此外,它通过消除不必要的工具更换操作和在结构允许的情况下缩短165到1675毫米的移动距离,提高了机器人的操作效率。
cs.RO / 27 / 2605.21031
Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation
基于SOFA框架的气动变形软四旋翼的建模与控制研究
Abstract
This article presents a novel SOFA based finite element method for the soft body modeling and the corresponding dynamic simulation and control of a pneumatic morphing soft quadrotor. The proposed modeling preserves the physical interpretability and control structure of traditional quadrotor dynamics, while capturing the complex, time-varying behavior of pneumatically actuated soft arms. In SOFA, the soft pneumatically actuated arms are discretized as a tetrahedral mesh following an elastic material law that produces internal forces adequate to the real dynamic behavior of the body. Pneumatic actuation governed by both periodic and error-based control signals is applied within the internal cavities to analyze the morphing capability. Finally, a proportional-integral controller is proposed to study the controlled dynamic behavior and morphing capabilities of the pneumatic arm, wherein the pneumatic actuation to the soft arm is controlled to achieve the desired target position. The simulation results show the effectiveness of the proposed novel modeling framework and the related controller design.
Chinese Translation
本文提出了一种新颖的基于SOFA的有限元方法,用于气动变形软四旋翼的软体建模及相应的动态仿真与控制。所提建模方法保留了传统四旋翼动力学的物理可解释性和控制结构,同时捕捉了气动驱动软臂的复杂时变行为。在SOFA中,气动驱动的软臂被离散化为遵循弹性材料法则的四面体网格,从而产生适合于物体真实动态行为的内部力。通过周期性和基于误差的控制信号对内部腔体进行气动驱动,以分析其变形能力。最后,提出了一种比例-积分控制器,以研究气动臂的受控动态行为和变形能力,其中对软臂的气动驱动进行控制,以实现期望的目标位置。仿真结果表明,所提新颖建模框架及相关控制器设计的有效性。
cs.RO / 28 / 2605.21053
Perception of Social Robots as Communication Partners in Healthcare for Older Adults
老年人医疗保健中社交机器人作为沟通伙伴的认知
Abstract
Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults during human-robot interaction (HRI). This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+). Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.
Chinese Translation
通过社交辅助机器人解决全球护理人员短缺问题,需要深入了解它们在老年人与机器人互动(HRI)过程中的心理和生理影响。本研究探讨了社交机器人是否可以作为有效的互动伙伴,与人类相比是否具有相似的效果,以及“积极提示”是否能够同样增强这些互动。我们对35名参与者(年龄70岁及以上)进行了比较研究。我们的多模态分析整合了面部表情数据、心率变异性和主观问卷,结果显示人类与机器人互动之间的整体压力水平没有显著差异。面部表情分析确认机器人被接受为有效的互动伙伴,而生理数据则显示在机器人互动期间心率略低,表明与人类主导的会话相比,参与者处于更放松的状态。这些发现表明,社交机器人能够在不引发心理压力的情况下与老年人互动,并能够通过执行结构化任务(如健康感知调查)来减轻护理人员的负担。未来的工作应解决机器人设计中识别出的“外观-内容不匹配”问题,以促进更加自然和有效的互动。
cs.RO / 29 / 2605.21109
Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction
基于异常信息的视觉安全预测置信度校准
Abstract
Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.
Chinese Translation
可靠的置信度估计对于安全部署基于视觉的控制器在自主赛车中至关重要,因为安全预测必须从摄像头图像中得出。然而,现代预测器在测试时分布变化下变得过于自信,这是一个危险的情况。我们识别出现有异常信号中的一个关键感知-动态差距:广泛使用的评分,如自编码器重构误差,能够捕捉视觉损坏,但却忽略了动态异常(例如,执行偏差、延迟),在这些情况下,图像仍然看似合理,但轨迹却在退化。为了解决这个问题,我们提出了一种基于异常信息的在线校准方法,该方法在不重新训练任何模型组件的情况下,融合了从世界模型中提取的两种互补的异常评分:来自重构误差的感知评分和来自认知不确定性及控制流统计的动态评分。基于这些融合的评分,一个轻量级的温度缩放校准器利用测试时增强选择性地降低在分布变化下的过度自信,同时保持正常条件下的性能。在四种在训练期间未见过的真实世界异常协议(黑暗、模糊、执行偏差、处理延迟)下对物理DonkeyCar的实验将平均期望校准误差从0.184降低到0.116,相较于最佳基线提高了37%,而无需修改基础安全预测器。
cs.RO / 30 / 2605.21111
Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing
自主赛车中前馈转向控制的经验与基于学习的方法的基准测试
Abstract
Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.
Chinese Translation
前馈转向控制是自主赛车分层控制架构中的关键组成部分。其目标是通过预测车辆的逆向横向动力学来减少反馈控制器的转向修正。本文系统性地对两种基于学习的方法和两种经验(解析)前馈转向控制器进行了基准测试。我们引入了一种基于多项式曲面拟合的新型 extit{EHD} 公式,该公式以最小的参数化捕捉了速度依赖的非线性转向行为。我们在一个基于真实世界阿布扎比自主赛车联赛竞赛的高保真仿真框架中测试了前馈控制器,使用高保真的双轨车辆动力学仿真器。开环评估表明,基于学习的控制器实现了最低的预测误差;然而,闭环测试显示,这种改进的准确性并未转化为更优的路径跟踪性能或圈速,即使经过迭代微调。相反,所提出的 EHD 方法在整体闭环鲁棒性和圈速方面表现最佳,突显了在完整的轨迹规划和控制软件堆栈中评估前馈策略的必要性。我们的代码可在 https://github.com/TUMRT/steering_ff_control 获取。
cs.RO / 31 / 2605.21133
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum
通过主动空间大脑和可泛化动作小脑实现类人全身操控
Abstract
In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.
Chinese Translation
在本文中,我们探讨了空间感知的类人全身操控任务。与桌面设置相比,该任务面临两个主要挑战:1)在具有多样空间关系的复杂三维环境中,空间理解具有挑战性。2)由于有限且成本高昂的真实机器人数据限制了数据驱动模型的泛化,动作生成难以泛化。为了解决这些挑战,我们提出了一种可泛化的类人运动操控框架,该框架利用了多智能体大模型的空间感知和动作生成能力。具体而言,我们的框架包括两个组件:主动空间大脑(Active Spatial Brain)用于主动空间感知和决策制定,以及可泛化动作小脑(Generalizable Action Cerebellum)用于可执行机器人动作生成。第一个组件主动感知空间场景,并对任务规划和子任务分解做出决策。第二个组件根据第一个模块做出的决策生成可执行的机器人动作,而无需特定任务的真实机器人数据。为了基准测试我们的框架,我们从两个角度设计了一组空间操控任务:评估空间感知和理解,以及评估真实机器人任务性能。结果表明,在不同任务和环境中,这两个方面均表现出强大的性能。
cs.RO / 32 / 2605.21138
Safety-Critical Control for Smoothed Implicit Contact Dynamics
平滑隐式接触动力学的安全关键控制
Abstract
Smoothed implicit contact dynamics enables gradient-based planning and control for contact-rich tasks without predefined mode sequences. However, safety-critical control remains challenging because implicit contact dynamics makes safety-filter design nontrivial. The smoothing parameter $\kappa$ relaxes contact complementarity constraints, which makes the dynamics smooth but affects the contact force. This paper provides a method for bounding the actual contact force despite the use of relaxed complementarity constraints. We show that constraint violations can be non-monotonic in $\kappa$. Smaller $\kappa$ reduces force-approximation error, but it does not necessarily improve safety performance. To address this issue, we introduce boundary-focused rollouts to screen $\kappa$ by comparing the safety margin with the approximation error. We then develop a discrete-time control barrier function (CBF) framework based on a first-order Taylor approximation of the implicitly defined contact force. To account for possible force under-prediction, we augment the resulting safety constraint with a fixed robust margin. Simulations on four contact-rich systems show that the proposed method eliminates force violations observed under a standard CBF.
Chinese Translation
平滑隐式接触动力学使得在没有预定义模式序列的情况下进行基于梯度的规划和控制成为可能,适用于接触丰富的任务。然而,由于隐式接触动力学使得安全过滤器的设计变得复杂,安全关键控制仍然面临挑战。平滑参数 $ppa$ 放宽了接触互补约束,这使得动力学变得平滑,但也影响了接触力。本文提供了一种方法,可以在使用放宽的互补约束的情况下对实际接触力进行界定。我们展示了约束违反在 $ppa$ 中可能是非单调的。较小的 $ppa$ 减少了力近似误差,但并不一定改善安全性能。为了解决这个问题,我们引入了边界聚焦的回滚方法,通过比较安全边际与近似误差来筛选 $ppa$。然后,我们基于隐式定义的接触力的一阶泰勒近似,开发了一个离散时间控制障碍函数(CBF)框架。为了考虑可能的力低估,我们用一个固定的鲁棒边际增强了所得到的安全约束。在四个接触丰富系统上的仿真表明,所提出的方法消除了在标准 CBF 下观察到的力违反。
cs.RO / 33 / 2605.21150
EllipseLIO: Adaptive LiDAR Inertial Odometry with an Ellipsoid Representation
EllipseLIO:基于椭球体表示的自适应激光雷达惯性里程计
Abstract
LiDAR Inertial Odometry (LIO) is a critical component for many mobile robots that need to navigate without relying on external positioning (e.g., GPS). Platforms that operate autonomously in different environments and with heterogeneous LiDAR sensors require a LIO approach that can adapt to these different scenarios without human intervention. Existing LIO approaches can typically provide reliable and accurate odometry in scenarios with similar environments and sensors when suitably tuned. However, many approaches struggle to retain robust odometry across heterogeneous environments and sensors while using a consistent configuration. This paper presents EllipseLIO, a real-time LIO approach that generalises between scenarios by using methods for LiDAR scan filtering and registration that adapt to the sensor capabilities and environment without requiring scenario-specific tuning. Experiments with EllipseLIO and state-of-the-art LIO approaches on five datasets with diverse and challenging scenarios demonstrate that EllipseLIO is the best-performing approach overall. It achieves a 38% lower odometry error on average than the second-best approach and is the only approach that does not diverge in any experiment. An open-source version of EllipseLIO will be available at github.com/v4rl-ucy/ellipselio.
Chinese Translation
激光雷达惯性里程计(LIO)是许多需要在不依赖外部定位(例如 GPS)的情况下进行导航的移动机器人中的关键组件。能够在不同环境中自主操作并使用异构激光雷达传感器的平台需要一种能够适应这些不同场景而无需人工干预的 LIO 方法。现有的 LIO 方法通常可以在环境和传感器相似的场景中提供可靠且准确的里程计,但在使用一致配置的情况下,许多方法在异构环境和传感器中难以保持稳健的里程计。本文提出了 EllipseLIO,一种实时 LIO 方法,通过使用适应传感器能力和环境的激光雷达扫描过滤和配准方法,在不同场景之间进行泛化,而无需特定场景的调优。在五个具有多样性和挑战性的场景的数据集上对 EllipseLIO 和最先进的 LIO 方法进行的实验表明,EllipseLIO 在整体性能上表现最佳。与第二佳方法相比,EllipseLIO 平均实现了 38% 的里程计误差降低,并且是唯一在任何实验中没有出现发散的方案。EllipseLIO 的开源版本将可在 github.com/v4rl-ucy/ellipselio 上获取。
cs.RO / 34 / 2605.21188
A Terrain-Adaptive epsilon-Constraint MPC for Uneven Terrain Kinodynamic Planning
一种适应地形的ε约束模型预测控制用于不平坦地形的动力学规划
Abstract
Kinodynamic planning for car-like vehicles on uneven terrain requires simultaneously optimizing competing objectives such as path efficiency and pose stability. This work presents an adaptive epsilon-constraint method integrated into a Model Predictive Control (MPC) framework, where the epsilon bounds are dynamically adjusted based on terrain descriptors to explore the Pareto front in real time. To capture vehicle-terrain dynamics, we develop a semi-parametric model combining analytical vehicle dynamics with a Sparse Gaussian Process (SGP) trained on the same terrain descriptors. The proposed epsilon-MPC is evaluated against MPPI and GAKD baselines, achieving a 94% navigation success rate while reducing maximum orientation deviation by 24% and improving multi-objective trade-off quality by 23%.
Chinese Translation
针对不平坦地形上类车车辆的动力学规划需要同时优化路径效率和姿态稳定性等竞争目标。本文提出了一种自适应ε约束方法,集成于模型预测控制(MPC)框架中,其中ε界限根据地形描述符动态调整,以实时探索帕累托前沿。为了捕捉车辆与地形的动态关系,我们开发了一种半参数模型,将分析性车辆动力学与基于相同地形描述符训练的稀疏高斯过程(SGP)相结合。所提出的ε-MPC与MPPI和GAKD基线进行评估,导航成功率达到94%,同时最大方向偏差减少了24%,多目标权衡质量提高了23%。
cs.RO / 35 / 2605.21242
To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble
选择还是不选择,这才是问题:将机器人技能预测提炼为小型集成模型
Abstract
As robot fleets become more heterogeneous, including humanoids, rovers, quadrupeds, and drones, selecting the right robot for a task becomes a core systems problem. We study robot skill prediction: mapping a natural-language task description to the physical capabilities required to execute it, such as fly, wheels, legs, surface water, under water and hands. Since labelled data that maps natural-language task descriptions to robot's physical capabilities does not exist, we construct a synthetic task-to-skill dataset using LLM-assisted generation and targeted label auditing. Trained on this data, a ~133M-parameter ensemble of two fine-tuned sentence encoders (mpnet + MiniLM) reaches 83.5% task-to-skill matching on a stratified 200 task dataset, outperforming Kimi K2 (1T MoE) at 72.0%, GPT-OSS-120B at 71.5%, and Llama-4-Scout-17B at 69.0% under the same zero-shot prompt. These results suggest that, for fixed robot skill taxonomies, small specialized models trained on synthetic data can outperform much larger general-purpose LLMs for fleet-level task routing.
Chinese Translation
随着机器人舰队变得愈加异构,包括类人机器人、探测车、四足机器人和无人机,为任务选择合适的机器人成为一个核心系统问题。我们研究机器人技能预测:将自然语言任务描述映射到执行该任务所需的物理能力,如飞行、轮子、腿、在水面上、在水下和手部。由于缺乏将自然语言任务描述与机器人物理能力相映射的标注数据,我们利用大语言模型(LLM)辅助生成和针对性标签审计构建了一个合成的任务到技能数据集。在该数据上训练的一个约133M参数的两种微调句子编码器(mpnet + MiniLM)集成模型在一个分层的200个任务数据集上达到了83.5%的任务到技能匹配,优于Kimi K2(1T MoE)的72.0%、GPT-OSS-120B的71.5%和Llama-4-Scout-17B的69.0%(在相同的零-shot提示下)。这些结果表明,对于固定的机器人技能分类,小型专用模型在合成数据上训练后可以在舰队级任务路由中超越更大的一般用途大语言模型。
cs.RO / 36 / 2605.21257
Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions
通过可微分的条件风险价值(CVaR)障碍函数进行风险适应的强化学习
Abstract
Planning through crowded environments under uncertain obstacle motions remains difficult, as stochastic interactions often induce overly conservative behavior or reduced efficiency. To address this challenge, we propose an end-to-end risk adaptation framework for crowd navigation under obstacle-motion uncertainty modeled by a Gaussian mixture model. The framework combines reinforcement learning~(RL) with a differentiable quadratic-program safety layer based on Conditional Value-at-Risk~(CVaR) barrier functions, jointly learning nominal control input, risk level, and safety margin and enforcing explicit probabilistic safety constraints. This design enables context-aware adaptation, promoting efficient behavior while invoking caution only when necessary. We conduct extensive evaluations in dynamic, uncertain, and crowded environments across varying obstacle densities and robot models, and further assess generalization under three out-of-distribution cases. Comparisons across optimization-based, RL-based, and integrated RL and optimization methods are provided, and the proposed method is shown to deliver the strongest overall performance in safety, efficiency, and generalization under uncertainty.
Chinese Translation
在不确定障碍物运动的情况下,通过拥挤环境进行规划仍然困难,因为随机交互往往会导致过于保守的行为或效率降低。为了解决这一挑战,我们提出了一种针对障碍物运动不确定性的拥挤导航的端到端风险适应框架,该不确定性由高斯混合模型建模。该框架结合了强化学习(RL)与基于条件风险价值(CVaR)障碍函数的可微分二次规划安全层,联合学习名义控制输入、风险水平和安全边际,并强制执行明确的概率安全约束。这一设计使得上下文感知适应成为可能,促进了高效行为,同时仅在必要时采取谨慎措施。我们在动态、不确定和拥挤的环境中进行了广泛评估,涵盖不同的障碍物密度和机器人模型,并进一步评估了在三种分布外情况下的泛化能力。我们提供了基于优化、基于RL以及集成RL和优化方法的比较,结果表明所提出的方法在安全性、效率和不确定性下的泛化能力方面表现最为优越。
cs.RO / 37 / 2605.21258
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
学习结构潜在点以提高机器人操作中的视觉表示效率
Abstract
Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.
Chinese Translation
当前针对具身感知和操作的3D感知预训练方法主要基于可微渲染框架,生成完全隐式的神经场或完全显式的几何原件。隐式表示虽然具有表现力,但缺乏明确的结构线索;而显式表示则保留了几何信息,但在分辨率上存在限制且泛化能力较弱。为了解决这些局限性,我们提出了一种新颖的预训练框架,学习混合表示——结构潜在点。具体而言,我们将一个逐点潜在变分自编码器插入到点云自编码器的潜在空间中,共同对逐点特征和坐标进行正则化,以符合高斯先验。最终得到的紧凑潜在表示保留了粗略的结构趋势,虽然不编码精确的几何信息,但捕捉了更丰富的粗略形状和语义信息,有效地结合了隐式表示的表现力与显式表示的结构先验。此外,受到先前工作的共享设计选择的启发,我们开发了一种简化的高效3DGS(3D生成系统)渲染管道,故意保持轻量化,提高了效率,同时为前端潜在模块留出了更大的表示能力。在RLBench、ManiSkill2和真实机器人平台上的广泛评估显示,在任务成功率、样本效率以及对视角和场景变化的鲁棒性方面,相较于强基线表现出一致的提升。消融研究进一步确认了我们框架中每个组件对整体性能的重要性。
cs.RO / 38 / 2605.21330
Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer
基于关节传感器的自我感知变换器学习鲁棒的灵巧手内操作
Abstract
In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.
Chinese Translation
手内物体操作是灵巧机器人一项基本而又具有挑战性的能力。尽管在灵巧操作方面取得了显著进展,现有的方法仍然过于依赖视觉或触觉传感来跟踪物体状态,而关节传感——任何机器人手上最易获得的传感方式——在很大程度上被忽视,尤其是在腱驱动的手中。本文研究了仅依靠关节传感能达到的程度,提出了以下问题:(i) 电机编码器或直接关节传感是否提供更好的自我感知反馈,(ii) 如何从关节测量中提取环境信息,以及 (iii) 仅使用关节控制是否能够在没有外部感知的情况下实现竞争性的现实世界表现。我们提出了自我感知变换器(Proprioceptive Transformer, PT),这是一种在腱驱动灵巧手上进行连续立方体旋转的无外部感知方法,仅使用关节传感反馈。首先,通过强化学习与特权物体信息训练教师策略,然后将其提炼为PT,后者仅基于关节位置和速度历史进行操作。变换器架构有效地从关节传感器读数的时间模式中提取隐含的物体状态信息。在真实的ORCA手上的实验表明,我们的方法实现了比基线高出3.1倍的旋转速度。我们还展示了我们的PT在立方体位置估计方面比多层感知器(MLP)基线具有23.4%的更低均方根误差(RMSE),表明从自我感知源中提取外部感知信息的能力更强。
cs.RO / 39 / 2605.21398
From swept contact to pose: Probe-aware registration via complementary-shape docking
从接触扫掠到姿态:基于探针感知的互补形状对接注册
Abstract
Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4{\deg} accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75{\deg}, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.
Chinese Translation
在先前模型与真实场景之间进行准确注册对于高精度机器人操作至关重要,然而光学方法受到长校准链、视线限制和制造误差的影响。我们提出了一种无校准的替代方案,将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,明确考虑探针几何形状,并利用接触和非接触证据。我们的求解器通过对低差异SO(3)样本进行3D FFT相关的全局到局部搜索,然后使用李代数更新和解析接触灵敏度进行连续SE(3)精化。该流程实现了高效的探索和度量级收敛,而无需脆弱的点对应关系。在自由形状网格上的仿真达到了小于0.04毫米和小于0.4度的精度,并对姿态噪声和接触丢失具有鲁棒性。在一台牙齿准备机器人上,我们的方法达到了0.42毫米和3.75度的精度,优于光学追踪器注册且不需要外部传感器。这些结果展示了一种适用于外科和工业机器人的实用且精确的注册策略。
cs.RO / 40 / 2605.21406
MC-Risk: Multi-Component Risk Fields for Risk Identification and Motion Planning
MC-Risk:用于风险识别和运动规划的多组件风险场
Abstract
We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.
Chinese Translation
我们提出了MC-Risk,这是一种与规划器对齐的多组件风险场,基于鸟瞰图网格,能够实现早期、校准和类别感知的风险定位。MC-Risk 线性组合了三个可解释的模块:(i) 一个机动代理场,它将黑箱多模态轨迹预测器与分析性高斯环面构造融合在一起,其横向宽度随速度/曲率增加而增长,高度则随着前视距离减小而衰减;(ii) 一个VRU(脆弱道路用户)风险场,它用向前偏置的各向异性核替代了各向同性的行人块,该核与行驶方向和速度对齐;(iii) 一个道路惩罚场,利用全高清地图拓扑,施加越界惩罚和对同向/反向的车道感知风险暴露。我们进行了一项标准化的定量评估,这是我们所知的首次对风险场公式在RiskBench的碰撞子集上的评估。MC-Risk 实现了最佳的整体风险定位和最早的危险指示。最后,我们通过将该场作为模型预测控制(MPC)成本密度,展示了一个即插即用的规划接口,使得在无需额外训练的情况下能够生成风险感知轨迹。
cs.RO / 41 / 2605.21414
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
PointACT:具有多尺度点-动作交互的视觉-语言-动作模型
Abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.
Chinese Translation
视觉-语言-动作(VLA)模型通过利用大型预训练的视觉-语言骨干网络,展现出在通用机器人操作中的强大潜力。然而,大多数现有的VLA主要依赖于二维视觉表示,这限制了它们在细粒度几何和空间基础推理方面的能力,而这些能力对于在三维环境中进行精确和稳健的操作至关重要。在本文中,我们提出了PointACT,一种双系统的三维感知VLA策略,直接将分层的三维点云表示集成到动作解码过程中。PointACT采用了一种多尺度点-动作交互机制,结合高效的瓶颈窗口自注意力,使得演变的动作标记能够密集关注局部几何细节和全局场景结构。我们在LIBERO和RLBench基准上评估PointACT,并系统地将其与单一和双系统VLA基线进行比较,包括增强了点云输入的变体。PointACT在这两个基准上均取得了一致的改进,在具有挑战性的RLBench-10Tasks套件上成功率提高了10%,相较于最先进的预训练VLA,当视觉-语言骨干被冻结且动作专家从头训练时,提升更为显著。大量消融研究表明,将分层的三维几何与预训练的二维语义表示紧密结合对于稳健且空间基础的机器人控制至关重要。我们的结果还突显了预训练三维表示在三维感知VLA策略中的潜力。
cs.RO / 42 / 2605.21429
roto 2.0: The Robot Tactile Olympiad
roto 2.0:机器人触觉奥林匹克
Abstract
Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\texttt{roto 2.0}), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: https://elle-miller.github.io/roto/
Chinese Translation
基于触觉的强化学习(RL)目前受到研究碎片化和过度集中于饱和的方向任务的限制。我们介绍机器人触觉奥林匹克的第二版( exttt{roto 2.0}),这是一个设计用于标准化四种不同机器人形态(16-自由度到24-自由度)之间的触觉基础RL的GPU并行基准测试。与之前的基准不同,roto专注于端到端的“盲”操作,仅利用本体感知和触觉感知,而不依赖状态信息或提炼。我们展示了显著的性能飞跃,我们的盲代理在10秒内实现了13次保定球旋转,速度比当前最先进的水平快了一个数量级。通过开源我们的环境和经过严格调整的基线,我们降低了进入门槛,使研究人员能够优先关注基础算法挑战,而不是繁琐的RL调优。网站:https://elle-miller.github.io/roto/
cs.RO / 43 / 2605.21446
Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs
迷雾中的迷失:传感器扰动暴露驾驶视觉-语言-动作(VLA)推理的脆弱性
Abstract
Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable under real-world sensor degradation. In this paper we present a controlled perturbation study of Vision-Language-Action (VLA) robustness in autonomous driving, evaluating Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations (Gaussian noise at four intensities, two lighting extremes, and two fog levels; ${\sim}18{,}000$ inference trials). We find that reasoning consistency is a high-fidelity indicator of trajectory reliability: when Chain-of-Causation (CoC) explanations change after perturbation, trajectory deviation spikes $5.3{\times}$ (21.8m vs 4.1m), with $r\!=\!0.99$ across attack types and $r_{pb}\!=\!0.53$ per-sample (Cohen's $d\!=\!1.12$). A controlled ablation provides evidence that enabling CoC generation is associated with improved trajectory accuracy (11.8% on average across conditions; $p < 0.0001$) under matched inference settings. Over the tested noise range ($\sigma \in \{10, 30, 50, 70\}$), degradation is approximately linear ($R^2\!=\!0.957$), while standard input preprocessing defenses provide only marginal relief. Together, these results establish CoC consistency as a quantitative proxy for planning safety and motivate reasoning-based runtime monitoring for safer VLA deployment.
Chinese Translation
可解释的自动驾驶规划不仅依赖于生成解释,还依赖于这些解释在现实世界传感器退化下的可靠性。在本文中,我们展示了一项关于自动驾驶中视觉-语言-动作(VLA)鲁棒性的受控扰动研究,评估了Alpamayo R1(10B参数)在1,996个场景下的表现,涉及八种传感器扰动(四种强度的高斯噪声、两种极端光照和两种雾霾水平;约18,000次推理试验)。我们发现,推理一致性是轨迹可靠性的高保真指标:当因果链(Chain-of-Causation, CoC)解释在扰动后发生变化时,轨迹偏差激增5.3倍(21.8米对比4.1米),在不同攻击类型下的相关系数为0.99,每个样本的相关系数为0.53(Cohen's d=1.12)。受控消融实验提供了证据,表明启用CoC生成与轨迹准确性改善相关(在匹配推理设置下,平均提高11.8%;p < 0.0001)。在测试的噪声范围内(σ ∈ {10, 30, 50, 70}),退化大致呈线性关系(R²=0.957),而标准输入预处理防御仅提供边际缓解。综合这些结果,确立了CoC一致性作为规划安全性的定量代理,并激励基于推理的运行时监控,以实现更安全的VLA部署。
cs.RO / 44 / 2605.21460
HITL-D: Human In The Loop Diffusion Assisted Shared Control
HITL-D:人机协同扩散辅助共享控制
Abstract
Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.
Chinese Translation
自主操控系统已经取得了显著的能力,但将人类专业知识与基于扩散的策略结合在共享控制中的研究仍然相对较少。在本文中,我们提出了人机协同扩散(Human-In-The-Loop Diffusion,HITL-D),这是一种共享控制框架,旨在提升用户在多步骤、插入和精细操控任务中的表现。HITL-D利用基于扩散的策略与人类控制的新颖组合,提供基于场景点云和末端执行器的笛卡尔位置的自主末端执行器方向更新。这种方法减少了所需的操纵杆控制轴数,从而降低了心理负担。在一项包含12名参与者的多任务用户研究中,HITL-D将平均任务完成时间缩短了40%,感知工作负荷降低了37%,并且在独立性、直观性和自信心的Likert量表评分上相比传统远程操作方法有了显著改善。这些结果表明,HITL-D有效地将人类专业知识与自主辅助结合在一起,改善了远程操作的客观和主观方面。
cs.CV / 1 / 2605.20211
Leveraging Vision-Language Models to Detect Attention in Educational Videos
利用视觉-语言模型检测教育视频中的注意力
Abstract
Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.
Chinese Translation
教育视频是远程和混合学习的基石。然而,学习者注意力的波动仍然是有效信息保留的一大障碍。先前的研究试图通过在运行时使用眼动追踪来检测和应对注意力丧失,从而减轻这一问题。迄今为止,这种检测基于训练于工程特征的经典机器学习分类器,例如学习者注视和扫视的汇总统计。这些方法在捕捉学习者参与的复杂时间特性方面存在困难,因此表现出适度的预测性能。在本研究中,我们旨在通过从标准工程特征转向多模态基础模型来推进注意力检测。使用一个教育眼动追踪数据集(N = 70),我们探讨了一种新方法,该方法利用视觉-语言模型(VLM)直接分析视频内容与叠加的注视数据。该方法旨在利用基础模型的语义推理能力,将学习者的注意力置于视频流的上下文中。我们使用几种提示策略评估这种基于VLM的方法的性能,采用Gemini 3,但最终发现没有一种方法能够超越统计基线。我们的结果为使用VLM进行实时教育诊断的局限性提供了新的见解。
cs.CV / 2 / 2605.20223
Why Latent Actions Fail, and How to Prevent It
潜在动作为何失败,以及如何防止这种情况
Abstract
Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.
Chinese Translation
潜在动作模型(LAMs)旨在通过压缩帧间变化,从未标记的视频中学习类动作表示。然而,现实世界视频的帧不仅包含代理自身的状态,还包含背景杂乱等外部状态。由于外部状态引入了与动作无关的变化,因而阻碍了可靠的潜在动作学习。本文通过扩展线性LAM框架,明确建模外部状态,分析了这一问题。我们的分析揭示了两个见解:(1)最小化标准重构目标会产生编码来自未来观察的外部信息的潜在动作;(2)在关注内生成分的表示空间中进行学习是减轻噪声干扰的关键。我们进一步表明,先前提出的辅助目标,如动作监督,确实鼓励潜在动作在外部状态之间保持一致。这些发现通过对线性和非线性LAM的实验得到了验证,为外部状态如何阻碍潜在动作学习及常见补救措施为何有效提供了统一的理论分析。
cs.CV / 3 / 2605.20233
AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
基于人工智能的自我中心视频在模拟护理教育中的能力评估
Abstract
Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.
Chinese Translation
在临床模拟中评估学习者能力需要专家观察,这一过程耗时、难以扩展,并且受到评估者间变异性的影响。视觉-语言模型作为理解复杂视觉行为的有前景工具逐渐受到关注。本研究探讨了视觉观察是否能够通过一个三阶段框架为能力评估提供具有教育意义的信号,该框架包括:(1) 使用冻结的视觉编码器和少量学习从自我中心的护理模拟视频中提取动作时间线;(2) 推导序列级特征和每次会话的识别指标;(3) 将这些与教师评分的能力相关联。在22个密集注释的会话中(3.8小时,493个动作),使用HMM Viterbi解码的冻结DINOv2骨干网络在留一法1-shot识别中达到了57.4%的MOF。令人惊讶的是,我们观察到识别准确性与能力之间存在负相关趋势(rho = -0.524,p = 0.012,针对mIoU),这一结果在六个混杂控制下依然稳健:更有能力的学生产生多样化且更难分类的工作流程,而简单的序列特征没有显示出这种关系。逐项分析识别出患者安全协议和团队沟通是这一模式中最反映的预期行为,而过程模型比较显示高能力学生表现出更符合协议的动作过渡。这些发现表明,识别准确性可能作为自动能力评估中的一种教育性信息信号,补充预测的动作时间线。
cs.CV / 4 / 2605.20237
AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation
AnimeAdapter:细粒度且一致的零-shot动漫角色生成
Abstract
We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.
Chinese Translation
我们提出了一种轻量级的外观适配器,用于稳定扩散(Stable Diffusion),能够在多种编辑条件下实现可控且一致的动漫角色生成。我们的方法不依赖于大规模的视觉-语言模型或针对特定主题的微调,而是将来自单一参考图像的细粒度视觉特征注入到扩散过程中。基于CLIP(Contrastive Language-Image Pre-training)所产生的局部空间化,我们开发了语义选择性局部注意力机制。为了进一步将角色外观与空间布局解耦,我们在适配器训练过程中引入了姿态感知条件。最终得到的预训练适配器保持紧凑、模块化,并与稳定扩散社区工作流程完全兼容,同时在部署时无需额外的微调。此外,我们基于精心策划和重组的Danbooru提示,展示了一个高质量的动漫角色数据集,并在多个实际角色编辑场景中评估了我们的方法。我们的代码、模型权重和数据集将在论文被接受后公开发布。
cs.CV / 5 / 2605.20267
Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model
利用预训练领域适应扩散模型从均匀器官活动图生成异质性PET图像
Abstract
Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.
Chinese Translation
合成PET图像对于定量成像工作流程开发、可扩展的虚拟成像试验和深度学习模型训练具有重要价值,但传统的基于物理的模拟方法计算密集、解剖变异性有限,且常常无法捕捉异质性PET摄取。本研究开发了一种预训练的领域适应扩散(PAD)模型,用于从均匀器官活动图合成条件于解剖结构的PET图像。PAD采用了自然图像预训练的文本到图像解码器,配备上游条件编码器和下游PET领域适配器。采用了两阶段的训练策略,第一阶段学习粗略的摄取分布,第二阶段细化局部图像细节。均匀器官活动图通过从配对的PET图像中为每个器官分配其平均摄取值,从CT基础分割生成。评估包括定量准确性、噪声评估、放射组学分析、肿瘤分割性能和人类观察者研究。PAD生成的图像在定量准确性上表现出色,器官平均标准摄取值(SUV)与分配活动值之间的一致性相关系数超过0.92。合成图像的噪声水平和纹理特征与目标PET图像相似,并且产生了可比的肿瘤分割性能。在一项二选一强迫选择观察者研究中,四位读者的准确率约为50%,表明合成图像与目标图像在视觉上不可区分。PAD还从XCAT派生的活动图生成了逼真的PET图像,展示了与基于幻影的解剖先验的兼容性。总体而言,PAD提供了一种基于扩散的框架,用于从临床分割或数字幻影派生的均匀器官活动图生成临床相关的异质性PET图像,支持数据增强和下游成像研究。
cs.CV / 6 / 2605.20275
You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection
你不需要注意力:基于门控卷积建模的手表式跌倒检测
Abstract
Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.
Chinese Translation
现有的可穿戴跌倒检测系统的深度学习方法依赖于自注意力机制,这会带来二次计算开销,并在所有时间步之间分配权重。这种全局权重分配削弱了在短且固定长度窗口内精确定位跌倒特征的能力,这些特征通常表现为短暂的冲击信号。为了解决这一挑战,我们提出了Gated-CNN,这是一种轻量级的双流架构,通过独立的一维卷积特征提取器处理加速度计和陀螺仪数据流,随后进行以下处理:(i) 一个sigmoid门控模块,选择性地抑制无信息的背景激活,同时增强与跌倒相关的特征;(ii) 一个全局平均池化层,将每个数据流压缩为紧凑的固定长度描述符;(iii) 一个共享分类头,将两个描述符融合用于二元跌倒预测。为了进行离线评估,我们在五个手腕安装的惯性测量单元(IMU)数据集上评估了该模型,在SmartFallMM、WEDA-Fall、FallAllD、UMAFall和UP-Fall上分别达到了93%、93%、90%、91%和90%的平均F1分数,超越了Transformer基线。在实时评估中,我们将模型部署在Google Pixel Watch 3上,并在12名参与者中进行了测试。该模型达到了97%的平均F1分数和98%的准确率,且未漏检任何跌倒事件,显示出sigmoid门控为基于商品智能手表的跌倒检测提供了一种结构上更一致且计算上更高效的替代方案。
cs.CV / 7 / 2605.20277
Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis
通过轨迹积分反馈调节解剖意识奖励以进行体积计算机断层扫描分析
Abstract
Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.
Chinese Translation
医学视觉语言模型(VLMs)作为通用多模态助手迅速发展,但在三维计算机断层扫描(CT)分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。目前的强化学习(RL)范式仍依赖于词汇代理信号,这会导致“ extit{评估幻觉}”,使模型优化语言流畅性而非事实临床正确性,从而导致诊断上关键的错误。为了解决这一问题,我们引入了 extbf{临床异常基准基质(CABS)},这是一个将放射学报告分解为可验证的临床语义单元的结构化系统。通过使用CABS,我们识别出标准RL中的“ extit{机制性偏差}”,在这种情况下,表面相似性奖励驱动策略梯度绕过医学事实。因此,我们提出了 extbf{轨迹积分反馈GRPO(TIF-GRPO)},这是一个将控制理论原则整合到策略优化中的新框架。通过将临床推理形式化为异常发现的伪时间轨迹,TIF-GRPO通过一个积分反馈回路调节解剖意识奖励,该回路将持续遗漏惩罚为累积状态错误,并将幻觉抑制为过度控制努力。在三维CT基准上的实验表明,我们的方法显著增强了异常检测和临床可信度,为医学VLMs中的细粒度调节建立了新的范式。我们的项目可在 exttt{https://github.com/ZJU4HealthCare/TIF-GRPO}找到。
cs.CV / 8 / 2605.20282
Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning
视觉模型真的能忘记吗?Mirage:视觉遗忘的表征级认证
Abstract
Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.
Chinese Translation
机器遗忘在垂直联邦学习(Vertical Federated Learning, VFL)中引起了越来越多的关注,但现有方法仅通过输出级指标来认证遗忘。我们通过引入Mirage,一个包含四个互补诊断的表征级审计框架,挑战了这些说法:线性探测恢复(Linear Probe Recovery, LPR)、中心核对齐(Centered Kernel Alignment, CKA)、特征可分离性评分和逐层恢复分析。通过在七个数据集和七个基线方法上进行实验,遵循最近的VFL遗忘协议,Mirage揭示了三个关键发现:(i)遗忘差距:通过输出级认证的方法在其表征中仍保留了大量的类别结构,LPR超过重训练基线最多15.4个百分点;CKA显示这些模型在结构上仍然比重训练参考更接近原始模型,而可分离性评分表明几何区分能力持续存在。(ii)遗忘三难困境:没有现有方法能够同时实现高效用、输出级遗忘和表征级遗忘。(iii)类别样本不对称:类别级遗忘留下了强烈的表征痕迹(LPR高达97%),而样本级遗忘与随机情况无异(LPR约50%);逐层分析进一步表明,残余的类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表征意识的评估标准。
cs.CV / 9 / 2605.20284
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
JUDO:一种面向领域的并列多模态推理器用于工业异常问答
Abstract
Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.
Chinese Translation
工业异常检测得益于大型多模态模型(LMMs)的显著进展,使得超越检测的多样化人类指令成为可能,特别是通过视觉基础推理以更好地理解图像。然而,LMMs缺乏领域特定知识,这限制了它们在复杂工业场景中生成准确响应的能力。在本研究中,我们提出了JUDO,即并列领域导向多模态推理器,这一框架有效地将领域知识和上下文融入视觉和文本推理中。通过视觉推理,我们的模型通过将查询图像与正常图像并列作为视觉领域上下文来分割缺陷区域,从而实现细粒度的视觉比较检查。此外,我们通过监督微调(SFT)注入领域知识,以增强上下文理解,并随后通过强化学习(GRPO)以量身定制的奖励指导领域推理,选择领域导向的推理过程。实验结果表明,JUDO在MMAD基准测试中表现优异,超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果强调了增强领域知识和上下文在异常理解中的有效推理的重要性。
cs.CV / 10 / 2605.20297
MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery
MedCRP-CL:通过贝叶斯非参数语义模态发现实现持续医学图像分割
Abstract
Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at https://github.com/zygao930/MedCRP-CL.
Chinese Translation
医学图像分割在持续学习中面临一个基本挑战:数据来自异构源并按顺序到达,而有效的持续学习需要发现哪些任务共享足够的结构以便从联合学习中受益。现有方法要么在所有任务上施加统一约束,导致任务冲突时的灾难性遗忘,要么需要预定义的任务分组,这无法预见未来任务的多样性。我们提出了MedCRP-CL,一个执行在线任务结构发现和结构感知持续学习的框架。利用中国餐馆过程(Chinese Restaurant Process, CRP),我们的方法在任务到达时动态推断临床文本提示中的任务分组,而无需预定义的聚类数量或对未来任务的访问。我们将这些发现的分组称为语义模态,因为它们通过整合解剖区域和病理背景捕捉比物理成像模态更细粒度的结构。在发现的结构指导下,我们维护特定于语义模态的LoRA适配器,并通过模态内的EWC进行正则化,确保在不同任务组之间的参数隔离,同时促进在相似任务组内的知识转移。该框架也是无重放的,仅存储汇总统计信息,而非原始患者数据。在四种成像模态下对16个医学分割任务的实验表明,MedCRP-CL以73.3%的Dice得分和仅4.1%的遗忘率超越了最佳基线8.0%,同时所需参数减少了6倍。代码可在https://github.com/zygao930/MedCRP-CL获取。
cs.CV / 11 / 2605.20301
Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection
Co-Fusion4D:用于稳健3D目标检测的时空协同融合
Abstract
In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.
Chinese Translation
在自动驾驶中,3D目标检测对于准确感知和可靠决策至关重要。然而,物体运动和自我运动常常导致基于鸟瞰视图(BEV)检测器的跨帧时空不一致,导致时间上的BEV特征错位和时空一致性下降。为了解决这些挑战,我们提出了Co-Fusion4D,一个统一框架,明确保持跨帧时空一致性并抑制时间特征漂移。Co-Fusion4D采用以当前帧为中心的策略,将当前帧视为主要信息源,同时在时空过滤和对齐后选择性地融入历史帧。这种主导-互补机制有效减轻了累积对齐误差,抑制了噪声特征传播,并利用可靠的时间线索以获得更一致的BEV表示。此外,Co-Fusion4D集成了双重注意力融合(DAF)模块,以进一步增强时空特征交互。DAF共同利用帧内空间注意力和帧间时间注意力,自适应地对齐和融合多帧特征,强调运动一致区域,同时抑制虚假相关性。通过摆脱传统的均匀融合范式,该设计显著提高了BEV表示的时间稳定性和区分能力。在nuScenes基准上的大量实验表明,Co-Fusion4D实现了最先进的性能,mAP达到74.9%,NDS达到75.6%,且不依赖于测试时增强或外部数据。
cs.CV / 12 / 2605.20306
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
WildRoadBench:一种针对视觉语言模型和自主智能体的野外航空道路损坏定位基准
Abstract
We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.
Chinese Translation
我们介绍了WildRoadBench,这是一个野外航空道路损坏定位基准,它将视觉语言模型的直接视觉定位与基于大型语言模型(LLM)驱动的自主研究与工程结合在一起,基于一个经过专业注释的无人机(UAV)语料库进行评估。相同的图像集和相同的每类AP_50指标在两个协议下进行评估。VLM Track评估固定的视觉语言模型(VLM)是否能够在统一的提示、解码和解析流程下,从一张图像和一个简短提示中定位特定领域的损坏。Agent Track评估一个自主智能体在仅提供书面任务简报、小规模探索片段和固定交互预算的情况下,是否能够搜索公共网络、适应预训练组件、编写训练和推理代码,并通过一个标量反馈神谕在隐藏的保留集上提交预测。我们对一系列闭源前沿模型和开源视觉语言模型,以及若干前沿的LLM驱动智能体进行了基准测试。在这个野外环境中,两条路径的可靠性能仍然相距甚远:闭源前沿模型在VLM排行榜上领先,但仍有超过一半的指标未被利用;开源定位器的表现远低于它们,而更新的生成模型或推理风格变体并未持续改善定位;每个开源模型在小目标上表现不佳;尽管有更丰富的功能,智能体的表现仍落后于最强的VLM,并且有几个智能体未能在预算内提交有效的结果。我们在https://anonymous.4open.science/r/wildroadbench-0607发布代码和数据,以支持可重复的后续研究。
cs.CV / 13 / 2605.20308
SDM: A Powerful Tool for Evaluating Model Robustness
SDM:评估模型鲁棒性的强大工具
Abstract
Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.
Chinese Translation
基于梯度的攻击是评估模型鲁棒性的重要方法。然而,自从提出APGD以来,这类方法在取得显著突破方面一直面临困难。为了实现这一效果,我们首先分析了“高损失非对抗样本”问题,该问题在以往方法中降低了攻击性能,并证明该问题源于对抗样本生成目标的不当设定。随后,我们将目标重构为“最大化非真实标签概率上界与真实标签概率之间的差异”,并提出了一种新颖且强大的基于梯度的攻击方法,命名为顺序差异最大化(Sequential Difference Maximization,SDM)。SDM建立了一个“三层优化框架”,包括“循环-阶段-步骤”。它在初始和后续优化阶段分别采用负概率损失函数和方向概率差异比率(Directional Probability Difference Ratio,DPDR)损失函数,通过阶段性顺序优化接近对抗样本生成的理想目标。实验表明,与以往的最先进方法相比,SDM不仅实现了更强的攻击性能,还表现出更优的性价比。代码可在 https://github.com/X-L-Liu/ICML-SDM 获取。
cs.CV / 14 / 2605.20309
Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision
Tiny-Engram:用于生成视觉的触发索引概念表
Abstract
Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.
Chinese Translation
当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念,但在何时以及如何检索概念方面提供的控制有限。在本研究中,我们引入了Tiny-Engram,一种紧凑的触发索引概念表,它为视觉记忆提供了明确的词汇地址和激活边界,适用于冻结的图像和视频生成器。Tiny-Engram将每个概念参数化为一小组由注册的n-gram匹配索引的记忆条目,这些条目仅在匹配的触发区域内调节文本编码器的隐藏状态。在这个词汇支持之外,条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变换器骨干网络中,这种形式将稀有的触发短语与目标身份绑定,同时保留来自周围提示的组合控制。我们进一步在文本条件的视频生成环境中评估了相同的基于表的记忆,其中触发路径可靠地改变了生成的主题,但在保持的视频提示中,细粒度身份持久性仍然有限。综合来看,这些结果表明,小型、明确地址的概念表是实现模块化视觉个性化的实用途径,尤其在图像生成中证据最为强烈。对于视频扩散,剩余的差距指向一个更广泛的需求:时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合,这为未来在文本条件接口之外的记忆注入工作提供了动机。
cs.CV / 15 / 2605.20316
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
FullFlow:升级文本到图像流匹配模型以实现双向视觉-语言生成
Abstract
Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.
Chinese Translation
现代文本到图像扩散模型编码了丰富的视觉先验,但仅通过单向文本条件生成进行展示。现有的统一视觉-语言模型通过大规模联合预训练或对文本路径的显著再训练恢复了双向能力,但却丢弃了文本到图像主干模型已经编码的强大图像先验。我们提出了 extit{FullFlow},这是一种参数高效的方案,通过仅训练LoRA适配器和轻量级文本头,将预训练的校正流文本到图像模型升级为双向视觉-语言生成器。FullFlow保持图像在其原生连续流中,并为文本添加了离散插入过程。分离的图像和文本时间步将推理转变为二维生成空间中的轨迹选择,从而实现文本$
ightarrow$图像、图像$
ightarrow$文本、联合采样和部分文本预测,且仅需一个主干模型。在相同的可训练参数数量和匹配的LoRA秩下,FullFlow在Stable Diffusion 3 (SD3)上将文本$
ightarrow$图像的FID从$62.7$提升至$31.6$,将图像$
ightarrow$文本的CIDEr从$2.0$提升至$99.4$,相较于遵循先前SOTA公式(Dual Diffusion)的LoRA等效模型,在匹配的墙钟训练时间内,同时将峰值VRAM从${ ext{∼}}84$ GB降低至${ ext{∼}}38$ GB,并在不到24小时内在两块RTX A5000 GPU上提高了${ ext{∼}}8 imes$的吞吐量,仅训练了${ ext{∼}}5 ext{%}$的主干参数。相同的方案可转移到FLUX.1-dev,并支持通过部分文本生成进行下游视觉问答(VQA)。这些结果表明,强大的双向视觉-语言能力可以从预训练的文本到图像流模型中解锁,而无需进行全面的多模态预训练。
cs.CV / 16 / 2605.20337
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
能力 $
eq$ 可解释性:视觉基础模型的人类可解释性
Abstract
How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.
Chinese Translation
领先视觉模型的特征有多可解释?随着这些模型从研究基准转向高风险应用,这个问题变得愈发紧迫,但现有方法无法可靠地回答。我们通过一个框架来测量和比较视觉模型的人类可解释性,填补了这一空白,该框架基于两个互补的心理物理学协议:(1) 可定位性——观察者能否预测特征在新图像上的激活位置?——和 (2) 可命名性——观察者能否准确描述该特征所代表的内容?特征通过稀疏自编码器恢复,基于机会锚定的评分函数将每个模型置于一个共同的尺度上。我们将该框架应用于六个视觉变换器——两个监督的 ViTs 和四个基础模型(DINOv2、DINOv3、CLIP、SigLIP)——收集了超过 $15{,}000$ 个行为反应,分析了 $377$ 名通过我们预设质量检查的参与者的 $13{,}400$ 个反应。基础模型的可解释性始终*低于*其监督对应物,而这种差距并不是能力的权衡:在我们检查的任何基准上,可解释性与下游任务性能没有相关性。相关的是特征激活的局部性和与人类的粗粒度语义对齐——具有焦点激活和反映世界广泛类别结构的表示的模型产生更可解释的特征,而细粒度的感知对齐则没有。这两个协议产生了强相关的排名,并共享相同的预测因子,确立了可解释性作为表示质量的一个独立、可测量的维度——令人惊讶的是,我们测试的每个基础模型在这一维度上均低于之前的监督基线。仅凭能力无法缩小这一差距;局部性和粗粒度对齐可以。
cs.CV / 17 / 2605.20342
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT:驯服工具先验悖论以实现代理视频强化学习中的并行工具使用
Abstract
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Chinese Translation
通过强化学习(RL)训练大型多模态模型(LMMs)以原生调用视频处理工具(例如裁剪)已成为理解长视频的有希望的途径。然而,现有的原生RL方法依次调度工具调用(即每次一个):单个错误裁剪会传播错误而没有同行纠正,多轮工具调用会破坏上下文,推理成本与轮次数量线性相关。我们提出了ParaVT,这是第一个为并行视频工具调用而训练的多代理端到端RL框架,在单个回合中调度多个时间窗口裁剪,以实现更清晰的上下文和更好的容错能力。然而,将标准RL应用于ParaVT揭示了一个我们称之为工具先验悖论的障碍:使工具探索成为可能的预训练工具先验也会使冷启动的结构格式不稳定,并在温度采样下暴露跳过工具的奖励捷径。对一个弱先验LMM的跨模型对比支持了这一观点:格式保持稳定,但RL没有引发任何工具调用,表明先验强度是格式崩溃和工具探索的共同驱动因素。我们提出了PARA-GRPO(可解析性锚定和比例引导的GRPO),它通过两种互补机制增强标准RL:(i)仅在最容易崩溃的结构标记位置应用的目标格式奖励,以及(ii)每个提示帧预算随机化,创建训练提示,使得调用工具能产生可测量的奖励信号,而不是跳过它。在六个长视频理解基准上,ParaVT在平均上比Qwen3-VL基线提高了7.9%,而PARA-GRPO将训练时间格式合规性从0.13提升至0.64。随着工具能力在现代LMMs中越来越内化,RL必须与由此产生的先验合作,ParaVT为代理RL提供了一种通用方案。代码、数据和模型权重已公开。
cs.CV / 18 / 2605.20362
HAPS: Rethinking Image Similarity for Virtual Staining
HAPS:重新思考虚拟染色的图像相似性
Abstract
Virtual staining of histopathology images (e.g., H&E-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of H&E-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.
Chinese Translation
组织病理图像的虚拟染色(例如,H&E-IHC)是数字病理学中的一项新兴工具,通过合成常规获取的切片中的目标染色,实现更快且更便宜的工作流程。然而,虚拟染色模型的质量仍主要通过诸如SSIM、PSNR和LPIPS等通用指标进行评估。这些指标最初是为自然图像开发的,固有地与组织学数据的领域特征不匹配,未能捕捉到组织形态的保持和生物标志物表达模式。因此,在不同组织学模式中量化相似性的稳健领域特定标准仍然是该领域的一个关键空白。在本研究中,我们将组织学图像相似性形式化为一个独立问题,并系统地评估一组广泛的全参考指标,针对一组带有专家相似性评分的H&E-IHC补丁对数据集进行评估。我们进一步分析了指标对受控几何失真的敏感性(位移、旋转和非刚性变形),这些失真模拟了连续切片之间的现实配准误差。根据这些观察,我们提出了组织学感知的感知相似性(HAPS)指标。HAPS在一个冻结的编码器的特征空间中计算距离,该编码器是在组织病理数据上预训练的,并添加一个线性头将特征级差异聚合成一个与专家评估一致的最终评分。最后,我们展示了HAPS在训练数据质量控制中的实际价值。通过量化MIST数据集中训练对的相似性并过滤低评分样本,我们创建了一个更干净的训练集。在这个精炼数据上训练的虚拟染色模型的表现优于在原始未过滤数据集上训练的模型。
cs.CV / 19 / 2605.20372
Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
基于潜在空间引导的缺失模态下多模态分割场景采样
Abstract
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling
Chinese Translation
多模态语义分割通过结合来自不同传感器模态的互补信息,促进了遥感分析。在实际的遥感应用中,由于传感器故障、不利的气象条件或数据获取问题,可能会缺失一个或多个模态。即使使用了预训练的多模态表示和现有的微调或适应策略,性能仍然可能受到限制,因为在训练过程中,所有模态可用场景通常被视为同样具有信息量。在本文中,我们提出了一种新颖的训练策略,该策略直接从预训练的潜在空间中学习场景采样分布。该方法不依赖于均匀随机的模态丢弃,而是引导微调朝向更具信息量的模态可用场景。更具体地,我们根据每个场景在共享潜在表示中引起的失真独立量化其影响。然后,我们使用径向基函数核捕捉场景关系,并通过正则化核平滑推导出精细的场景得分。这些得分在微调的场景采样中被转换为概率分布。我们在三个遥感图像集上评估了这一策略,即 DSTL、波茨坦(Potsdam)和湖南(Hunan),使用 CBC-SLP、CBC 和 CMX 主干网络。不同图像集和主干网络的实验结果表明,我们的方法优于标准微调和基于 LoRA 的适应。这些发现表明,预训练的潜在表示可以作为缺失模态微调过程中采样的有效基础。代码可在 https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling 获取。
cs.CV / 20 / 2605.20385
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
ConceptSeg-R1:通过元强化学习分割任意概念
Abstract
Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.
Chinese Translation
最近在可提示分割方面的进展将视觉感知的重点从对象级定位转向了概念级理解。然而,概念的定义仍然不够明确,使得当前方法是否真正超越类别识别的普适性变得不清晰。在本研究中,我们通过一个由上下文独立(CI)、上下文依赖(CD)和上下文推理(CR)概念组成的三级分类法形式化了广义概念分割,这揭示了在认知复杂性逐级增加的过程中存在明显的能力差距。为了解决这一挑战,我们提出了ConceptSeg-R1,一个将概念分割重新表述为规则诱导概念基础的统一框架。我们方法的核心是Meta-GRPO,一种从视觉示范中学习可转移任务规则并通过代理推理验证这些规则的元强化学习机制。推断出的推理状态随后通过轻量级概念翻译模块转换为适合分割的概念提示,从而实现对目标图像的推理应用。快捷路由策略进一步保持了分割模型在简单案例上的原生效率。为了系统地评估广义概念分割,我们在自然、工业、医疗和推理密集型领域的多样化CI、CD和CR概念分割基准上进行了广泛实验。ConceptSeg-R1在整个概念层级上实现了强大的性能,同时保持了可提示分割骨干网络的原生能力,且没有额外的复杂性。作为向分割任意概念迈出的初步步骤,我们希望ConceptSeg-R1能够作为一个实用的基线,推动从对象级预测向概念级理解的分割进展。
cs.CV / 21 / 2605.20388
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
你的移动方式揭示了你的行为:基于轨迹的自我中心预测
Abstract
Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.
Chinese Translation
预测一个人第一人称视角将如何演变(接下来会采取什么行动,什么计划完成任务,正在进行的投篮是否会得分)在本质上是未充分指定的:相同的背景允许许多合理的未来,而一个旨在最小化预测误差的模型被迫在这些未来之间进行权衡或平均,结果无论如何都可能出错。两个发现塑造了我们的方法。首先,未来的相机轨迹,即头部在空间中划出的路径,使模型能够承诺其中一个未来:它以足够细致的形式承载了操作员的意图,从而决定了一个行动将如何展开,显著优于语言作为条件信号。其次,这种意图使得轨迹本身在当前上下文中部分可预测,足以使得在测试时无需观察轨迹就能恢复大部分增益。我们将这些发现具体化为TrajPilot,一个从自我中心上下文预测候选未来轨迹的模型,并利用这些轨迹在一个与行动对齐的嵌入空间中引导行动预测,在这个空间中,语言塑造了结构但从未作为条件输入使用。TrajPilot在Ego-Exo4D原子、Ego-Exo4D关键步骤、Ego4D目标步骤和EgoPER的程序规划中超越了VLM和结构规划基线,随着预测范围的扩大,轨迹优势也在加大(正是之前的规划者在此崩溃的地方),并且在仅使用RGB的相机姿态估计下保持有效。在推理时目标被屏蔽的情况下,同一模型执行无目标的预期,在Ego-Exo4D原子上超越了VLM基线,并扩展到EPIC-Kitchens-100和篮球投篮结果预测。
cs.CV / 22 / 2605.20390
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
STELLAR:为自动驾驶扩展3D感知大模型
Abstract
Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.
Chinese Translation
模型扩展通过在多样化数据集上进行大规模训练已显示出显著成功。然而,由于融合异构传感器数据和对复杂3D空间理解的需求等独特挑战,是否同样的范式适用于自动驾驶感知系统仍然是一个未解的问题。为了解决这一问题,我们提出了一项全面研究,系统分析规模对这些系统的影响。我们基于稀疏窗口变换器(Sparse Window Transformer)开发了STELLAR模型,通过扩展输入模态以包括激光雷达(LiDAR)、雷达、摄像头和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型,参数量达到5亿。我们的规模实验揭示了将模型性能与模型大小、数据和计算能力联系起来的经验扩展趋势。最终模型在Waymo开放数据集挑战中建立了新的最先进水平,远超以往的研究成果。我们的工作表明,大规模训练是提升自动驾驶感知模型能力的一个非常有前景的路径。
cs.CV / 23 / 2605.20436
Lighting-aware Unified Model for Instance Segmentation
考虑光照的实例分割统一模型
Abstract
Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.
Chinese Translation
基础模型如Segment Anything Model (SAM)展现了令人印象深刻的零-shot泛化能力,但在多样的真实世界光照条件下,尤其是在实例分割方面,表现往往会下降。在本研究中,我们通过开发 extit{Lighting Convolutional-Attention ( extit{lca})}适配模块来解决这一局限性,该模块在不对重型主干进行微调的情况下增强了分割的鲁棒性。 extit{lca}采用双分支架构,同时处理RGB特征和对比度图,从而使其对结构变化的敏感性更具物理动机,而非光照伪影。我们通过成对训练策略优化 extit{lca},引入了一个针对性的损失项,明确惩罚干净图像与其对应光照变体之间的差异。为了评估和支持这一架构,我们在多个现有基准上进行了全面的实证研究,并提出了一个新颖的基于Unity的合成数据集,专门设计用于准确复制复杂的真实世界光照条件。大量实验结果表明,我们的方法成功弥合了领域差距,实现了更优的光照鲁棒性分割。
cs.CV / 24 / 2605.20445
A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery
深度学习架构在CT和X光影像中对COVID-19分类的全面比较
Abstract
COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.
Chinese Translation
COVID-19是一个重大的挑战,每天导致无数生命的损失。这场疫情不仅影响了某个国家,甚至全球都因冠状病毒而遭受苦难。使用计算机断层扫描(CT)和肺部X光的成像技术是COVID-19或任何其他流行病筛查过程中最有用的工具。如今的技术通过使用人工智能将手动过程替换为自动化机器,彻底改变了世界,使系统能够模仿人脑,根据经验做出明智的决策。受到这一点的启发,我们的工作提出使用基于卷积神经网络(CNN)的模型来设计一个计算机辅助诊断(CAD)系统,以区分COVID-19与健康肺部影像。我们使用了两组不同的肺部X光图像,以及两组不同的CT扫描,并使用多种预训练网络进行分类,如VGG(16, 19)、Densenet(121)、Resnet(50, 50 V2, 101 V2)、Mobile net(V2)、Xception Inception(V3, Resnet V2)、Efficient net(B0)和Nasnet(Large)。在X光和CT影像数据集上,Resnet和VGG架构显示出能够有效区分COVID-19与正常影像的能力,平均准确率分别为95%到98%。我们在分类数据集上获得的结果具有竞争力,并优于文献中之前报告的发现。
cs.CV / 25 / 2605.20448
Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?
视觉-语言模型是否理解三维场景还是仅仅 catalog 物体?
Abstract
Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.
Chinese Translation
视觉-语言模型能够可靠地命名场景中的物体,但它们是否表示这些物体所处的三维布局?我们引入了一个包含3,034个样本的人类策划基准,针对空间理解的三个组成部分:深度排序遮挡(通过三种独立的反事实操作进行探测)、可见反射的光学几何推断,以及体积重排规划。六个前沿和开放权重的视觉-语言模型(VLMs),由经过训练的注释者在18,204个响应中评分,未使用大型语言模型(LLM)作为评判者,揭示了明显的分离:在可见布局上进行重排规划的模型准确率为53%至97%,且很少违反碰撞约束,但在遮挡任务上的准确率下降至6%至45%,在反射任务上低于7%。一个具身推理模型再现了相同的特征。对Qwen3-VL-8B-Thinking的白盒分析将失败定位于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可访问,只有在干净的后合并激活被修补到语言解码器时才会再次稳定。
cs.CV / 26 / 2605.20458
ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach
ELEMENT:基于耦合区域生长和机器学习方法的多模态视网膜血管分割
Abstract
Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.
Chinese Translation
视网膜中的血管结构包含了检测和分析眼部疾病的重要信息,包括年龄相关性黄斑变性、糖尿病视网膜病变和青光眼。用于诊断这些疾病的常用模态包括眼底摄影、扫描激光眼底镜(SLO)和荧光素血管造影(FA)。通常,视网膜血管分割是手动或交互进行的,这使得这一过程耗时且容易出现人为错误。在本研究中,我们提出了一种新的多模态血管分割框架,称为ELEMENT(基于机器学习和连通性的血管分割)。该框架包括特征提取和基于像素的分类,采用区域生长和机器学习。所提出的特征基于灰度级和血管连通性属性捕捉互补证据。后者信息在分类阶段通过像素无缝传播。ELEMENT减少了不一致性并加快了分割速度。我们分析并比较了所提方法在三组主要实验中与最先进的血管分割算法的性能,针对每种眼部模态。与26种最先进的方法相比,我们的方法在广为人知的DRIVE眼底图像数据集上获得了更高的整体性能,整体准确率为97.40%,其中包括六项基于深度学习的研究。在STARE、CHASE-DB、VAMPIRE FA、IOSTAR SLO和RC-SLO数据集中,所提出的框架在准确率方面均优于所有最先进的方法,分别为98.27%、97.78%、98.34%、98.04%和98.35%。
cs.CV / 27 / 2605.20459
Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures
基于像素的COVID-19 CT影像病灶预测:自动化图像分割架构的比较分析
Abstract
In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.
Chinese Translation
近年来,基于深度学习的算法在医学图像分割领域获得了显著关注。然而,由于缺乏标准化的性能分析方法以及以往研究中使用不同数据集的情况,该领域的可靠性受到了一定影响。本研究的主要目标是全面评估现代分割框架与最先进的预训练骨干网络的结合,以准确预测CT图像中的COVID-19病灶。此外,这一评估可以作为其他成像场景中图像分割的参考点。为此,我们将四种不同的深度学习架构(即Unet、PSPNet、Linknet和FPN)与六种预训练编码器(包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0)相结合。这种方法使得我们能够开发多样化的测试架构。在图像分割的背景下,我们的研究涵盖了二分类和多分类的实验。通过对三个不同的COVID-19 CT分割数据集的分析结果表明,深度学习架构能够产生精确且高效的分割结果。值得注意的是,在二分类分割中,最大F1得分达到了98%,而在多分类分割中,两个不同数据集的F1得分分别为75%和77%。人工智能和深度学习的应用在多个维度上提升了对流行病的诊断过程。
cs.CV / 28 / 2605.20461
Understanding Model Behavior in Monocular Polyp Sizing
单目息肉大小理解模型行为
Abstract
Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.
Chinese Translation
准确的息肉大小分层指导监测决策,通常大于5毫米的病变需要更密切的随访。然而,单目结肠镜检查缺乏可靠的度量参考。我们对多个公共多中心数据集、模型家族和患者分层交叉验证中的二元息肉大小分类(<=5毫米与>5毫米)进行了诊断审计。在不同的架构和输入模式下,包括RGB外观、相对深度和光度,模型性能中等一致,表明其依赖于与检查行为相关的线索,而非真实的度量尺度。通过在不同粒度下提供真实尺度,我们量化了完美尺度信息可能带来的改进,并显示当前的深度估计和全局校准提供的收益有限。我们进一步证明,在分布转移下的分割错误消除了大部分潜在收益,而在预测掩膜下的神谕尺度仅恢复了基线性能。这些结果突显了度量尺度和掩膜鲁棒性作为两个独立的瓶颈,并提供了可重用的评估工具,如神谕尺度梯度、快捷分区和掩膜替代,以审计未来的息肉大小管道。我们的代码可在 https://github.com/anaxqx/polyp-sizing-audit 公开获取。
cs.CV / 29 / 2605.20469
HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation
HalluCXR:评估和减轻胸部放射影像解读中医学视觉语言模型的幻觉现象
Abstract
Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.
Chinese Translation
视觉语言模型(VLMs)在医学图像解读中的应用日益增多,但它们经常产生幻觉,生成临床上看似合理但事实不正确的发现,这对患者安全构成直接风险。我们提出了HalluCXR,一个基准评估六种架构多样的VLMs,在856个分层的MIMIC-CXR胸部放射影像和三种查询类型上进行评估,产生了15,408次模型评估。我们验证了一个包含临床严重性评级的八类幻觉分类法和一个两层检测管道,基于250个人工标注(自动检测F1=0.959;LLM评判F1=0.907)。我们发现61.9%至82.3%的输出包含幻觉,其中高达80.2%的输出存在临床危险错误。出现了三个关键模式:正常的放射影像反而吸引了最严重的幻觉,常见发现被系统性地过度虚构,而稀有发现则被低估,仅响应长度就能预测幻觉风险(AUC高达0.908)。一个六模型集成在增加遗漏的代价下将虚构减少了高达84.8%;一个三模型子集在成本减半的情况下保持了可比的性能。这些结果表明,幻觉审计、基于冗长的风险监测和基于集成的安全层是临床部署的前提条件。
cs.CV / 30 / 2605.20470
EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis
EPC-3D-Diff:等变物理一致条件下的3D潜在扩散用于CBCT到CT的合成
Abstract
Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.
Chinese Translation
锥束计算机断层扫描(CBCT)在放射治疗中常用于患者定位,但其定量可靠性受到散射、噪声和重建伪影的影响,从而限制了Hounsfield单位(HU)的准确性。我们提出了EPC-3D-Diff,这是一种新颖的条件3D潜在扩散框架,用于体积CBCT到CT的合成,引入了基于获取物理学的投影域等变损失。与常见的图像域等变性不同,我们利用了体积的平面旋转对应于其投影的角度偏移这一事实。在训练过程中,我们通过前向投影旋转后的合成CT体积,并将其与配对目标CT的适当角度偏移投影进行匹配,从而强制执行这种关系,形成集成到扩散目标中的物理一致性等变约束。为了高效捕捉完整的3D上下文,条件扩散在一个由轻量级3D自编码器学习的紧凑潜在空间中进行,保持轴向深度的同时在平面分辨率上进行下采样,以实现稳定的训练。我们在配对的头部CBCT/CT幻影数据集上进行了验证,包括重复扫描,以及使用患者分割的配对临床数据,并进行了单域和混合域训练、消融实验以及与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力,与最先进的方法相比,在PSNR上取得了显著的改进,幻影数据提高了7.4 dB,临床数据提高了1.8 dB,同时在组织边界内改善了SSIM和HU准确性。总体而言,EPC-3D-Diff提高了鲁棒性和物理一致性,支持HU感知的合成,以便于后续的放射治疗工作流程。
cs.CV / 31 / 2605.20476
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
告别漂移:用于长时间视频到视频生成的锚定树采样
Abstract
Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.
Chinese Translation
长时间视频生成面临两个交织的问题。首先是漂移,视频质量随着时间的推移而下降。其次是连续性问题,表现为物体持久性问题,或不正确地渲染瞬态内容(例如,在不连续帧中出现的物体颜色/风格变化)。近期的研究集中于自回归蒸馏技术,同时解决这两个问题。我们选择直接关注漂移,并引入 extbf{锚定树采样(Anchored Tree Sampling, ATS)}:一种无训练的推理时间调度器,用稀疏到密集、锚定限制的插补替代从左到右的展开,组织成树形结构。根调用在整个时间范围内生成稀疏锚点,递归细化生成中间锚点,最终叶子跨度在相邻锚点之间合成。这将关键路径从$K$个顺序展开步骤减少到$L+1$个树形层次步骤,并将时间范围复合漂移转化为锚定限制漂移。我们专注于 extit{静态摄像机}模式下的V2V生成,其中时间范围内的稀疏锚点可以很好地通过密集条件信号进行近似,基础模型可以在不重新训练的情况下生成这些锚点。我们在Wan $2.1$ $+$ VACE上对ATS与两个当代自回归基线进行了评估,涵盖五种条件模式(修复、外绘、边缘、姿态、深度)。我们展示了ATS在整体质量和漂移防止方面均优于两个竞争者。此外,我们还展示了在LTX-$2.3$上进行稳定的$ ext{≥}40$分钟生成,涵盖相同的五种模式。最后,我们提出了一条前进的路径,以将ATS扩展到任意长的T2V生成,以及动态摄像机和多镜头模式。
cs.CV / 32 / 2605.20479
Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising
用于模型基础图像去噪的超参数预测的Oracle监督转移
Abstract
Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.
Chinese Translation
超参数预测是模型基础图像去噪器的一个关键实际瓶颈,涵盖了从经典的TV/TGV变分求解器到现代基于扩散的模型(如DiffPIR)。虽然现有的学习预测器可以实现接近Oracle的性能,但这种方法的扩展性较差:每个新配置通常需要其自己的Oracle标记训练集,每个标签都需要针对干净的真实值进行层次网格搜索评估。因此,我们提出一个问题:在源配置上收集的Oracle监督是否可以转移到目标配置上,而无需或仅需少量目标Oracle标签。我们提出了HyperDn,这是一种单一配置条件的预测器,能够跨源配置汇聚Oracle监督,并为新的去噪器-噪声配置预测异构超参数。在一个跨范式实验中,HyperDn从相对便宜的TV/TGV变分源转移到更昂贵的基于扩散的DiffPIR。仅用$2$个目标Oracle标签,它达到了$30.23$ dB,距离Oracle仅$0.90$ dB,并且在使用$1/32$的目标标签数量的情况下,超越了从头训练的每配置$64$标签预测器。在没有任何目标Oracle标签的情况下,HyperDn在两个未见的噪声类型混合和从相对便宜的$96 imes 96$源图像转移到$512 imes 768$目标时,也达到了接近Oracle的PSNR。这些结果表明,超参数预测所需的昂贵Oracle监督可以从源配置转移到新的目标配置,从而减少为每个新的去噪配置重建Oracle标签的需求。
cs.CV / 33 / 2605.20495
A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models
一种人机协作框架用于显微镜视觉-语言模型中高效提示选择
Abstract
Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.
Chinese Translation
显微镜图像分类的深度学习流程通常需要昂贵且耗时的专家标注,以生成高质量的训练真值。近期研究表明,视觉-语言模型(VLMs)的提示调优可以通过构建一小组经过专家验证的图像-标题示例,作为少量示例上下文来减少人工标注,从而在推理时对所有剩余图像进行分类。为了进一步减少工作量,VLM可以为候选示例草拟标题,专家随后对其进行验证和轻微编辑,而不是从头撰写文本。然而,仍然存在两个实际问题未得到解决:(1)应优先验证哪些未标记图像,以及(2)需要多少经过验证的示例才能达到性能目标。在本研究中,我们通过将提示集构建形式化为一个以目标驱动的主动学习问题来解决这些问题,从而优先考虑哪些图像进行标注。我们在严格的低资源约束下研究了三种互补的选择标准,使用小规模的未标记图像池。实验表明,我们的方法在专家验证的图像数量远少于随机选择的情况下达到了目标性能,平均只需20个标注图像便实现了100%的测试准确率。更广泛地说,我们的人机协作框架展示了生成性人工智能在生物医学图像分析中的以人为本的应用,专家在验证和完善模型输出的同时,显著降低了标注成本。代码和数据将公开发布。
cs.CV / 34 / 2605.20510
ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society
ShadeBench:可持续社会建筑阴影模拟的基准数据集
Abstract
Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.
Chinese Translation
城市热暴露正成为一个日益严峻的挑战,这主要是由于城市热岛效应的加剧。细粒度的阴影模式,尤其是由城市建筑引起的阴影,强烈影响行人的热暴露和户外活动规划。然而,由于缺乏大规模数据集和系统评估框架,准确建模和分析城市阴影在大规模上仍然困难。为了解决这一挑战,我们提出了ShadeBench,这是一个全面的数据集和基准,用于城市阴影理解。ShadeBench包含地理多样的城市场景,具有时间变化的模拟阴影图和文本描述,以及对齐的卫星影像、建筑骨架表示和3D建筑网格。基于这一多模态数据集,ShadeBench支持一系列下游任务,包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准化评估协议和基线方法。通过实现可扩展和细粒度的阴影分析,ShadeBench为数据驱动的城市气候研究提供了基础,并支持未来在抗热城市规划和决策中的研究。代码和数据集可在 https://darl-genai.github.io/shadebench/ 获取。
cs.CV / 35 / 2605.20525
NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
NeuroQA:一个大规模的基于图像的3D脑MRI理解基准
Abstract
We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.
Chinese Translation
我们提出了NeuroQA,这是一个用于3D脑磁共振成像(MRI)视觉问答的大规模基准,包含来自12,977个受试者的56,953对问答对,涵盖12个数据集。该数据集涵盖了5至104岁的年龄段以及五个临床领域:阿尔茨海默病、帕金森病、肿瘤、白质病和神经发育。与以往基于2D切片或依赖狭窄诊断标签的医学视觉问答(VQA)工作不同,NeuroQA将每个项目与完整的3D体积配对。它评估了11种临床基础的推理技能,涵盖是/否、多项选择和开放式格式。在203个模板中,131个是基于图像的(可以从三平面查看器中回答),72个是基于图像信息的(从定量体积测量或临床工具中获得的真实答案)。为了消除仅文本的捷径,我们应用了答案分布精炼,将封闭格式的仅文本准确率从超过80%降低至44.6%;图像必要性通过与基准一起发布的图像基础协议单独评估。一个包含38条规则的确定性流程和两轮专家审查验证每对问答与FreeSurfer测量、元数据或放射学报告字段的一致性,确保在模板之间没有同一受试者的矛盾。我们进行了一次临床评估,两名临床医生独立评估100个冻结测试项目,使用三平面查看器。在封闭格式(是/否 + 多项选择)测试公共项目中,最佳的零样本视觉-语言模型和一个监督的3D CNN基线分别达到47.5%和43.7%的准确率,均低于49.4%的仅文本主要模板底线。NeuroQA采用两级发布,提供公共问答对以供开放访问的数据集,以及为受数据使用协议(DUAs)限制的数据集提供可重复生成脚本,此外还有受试者级拆分、一个保留的私有测试集和一个在线排行榜。
cs.CV / 36 / 2605.20536
HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification
HADS-Net:一种基于物理信息增强的混合注意力增强双流网络用于乳腺超声图像分类
Abstract
Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.
Chinese Translation
将乳腺超声图像准确分类为良性、恶性和正常类别是一项关键的临床任务,但受到斑点噪声、声学阴影和类间视觉模糊的影响。现有的深度学习方法依赖于单流架构,并采用通用增强技术,忽视了超声采集物理特性,并且没有先前的方法专门为病变边界特征设置流,而这些特征被认为是最具诊断意义的视觉线索。我们提出了HADS-Net,一种混合注意力增强的双流网络,通过两个并行路径利用全局纹理和局部边界线索。流1应用物理信息增强,模拟斑点噪声、声学阴影和增益变化,然后通过预训练的EfficientNet-B3提取特征,投影到512维空间。流2提取经过轻量级卷积神经网络处理的Sobel边缘图,投影到相同的512维空间。交叉注意力融合模块允许纹理流选择性查询边界特征,生成一个联合优化的表示,由使用自适应类别加权焦点损失训练的多层感知机(MLP)进行分类。采用五折分层交叉验证和50个周期的余弦退火,选择在保留的测试集上评估的最低验证损失的全局最佳检查点。在BUSI数据集上,HADS-Net实现了96.58%的准确率,宏观ROC-AUC为0.9978,宏观F1为0.9654,良性、恶性和正常的每类F1分数分别为0.970、0.951和0.976。没有恶性病变被误分类为正常。这些结果确认了基于模态特定增强与跨模态注意力融合的策略在超声乳腺癌诊断中的有效性。
cs.CV / 37 / 2605.20538
Continual Segmentation under Joint Nonstationarity
联合非平稳下的持续分割
Abstract
Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.
Chinese Translation
不断演变的数据流在持续语义分割中引入了联合非平稳性,其中语义类别、输入分布和监督可用性随着时间的推移而同时变化。这种设置反映了实际的结构化预测系统,但在以往的持续学习研究中仍然 largely 未被探索,后者通常孤立地研究这些因素。我们对伴随类别、领域和标签变化的持续分割进行了形式化,并研究了在有限注释和丰富未标记数据的异构密集预测环境中的学习。为了应对在分布漂移下由少量监督引起的不稳定性和过拟合,我们引入了梯度自适应稳定化,这是一种通过梯度缩放随机扰动实现的参数级正则化机制,促进了稳定性与可塑性之间的原则性权衡。我们进一步通过半监督学习利用未标记数据,并引入原型锚定监督,通过联合置信度和原型一致性验证伪标签。这些机制共同使得在持续分割中能够在联合非平稳性下进行学习。对类别增量、领域增量和少样本模式的广泛实证评估表明,在异构结构化预测设置中,相较于以往方法具有一致的改进。我们的结果揭示了现有持续分割方法的基本失效模式,并为在动态演变环境中学习稳健的密集预测器提供了洞见。
cs.CV / 38 / 2605.20543
Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation
基于不确定性引导的保守传播在血管分割中的结构化推理
Abstract
Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.
Chinese Translation
准确的血管分割对医学图像分析至关重要,但由于复杂的血管模式和成像模糊性,仍然面临挑战。大多数深度模型依赖于单次预测,限制了它们在推理过程中对不确定或断开的区域进行细化的能力。为了解决这一限制,我们提出了不确定性引导的保守传播(Uncertainty-Guided Conservative Propagation, UGCP),这是一个通用的插件模块,用于血管分割。UGCP并不是直接使用一次性输出作为最终预测,而是通过局部预测交互执行少量的logit空间更新步骤,以细化分割。预测不确定性引导可靠区域支持模糊区域,而结构感知调制和基于源的稳定性减少了不可靠传播和过度漂移。该模块是可微分的,可以与不同的分割网络进行端到端训练。我们在四个公共血管分割数据集上评估了UGCP,这些数据集涵盖了2D和3D任务,包括视网膜血管、冠状动脉和脑血管分割。基于卷积神经网络和Transformer的实验结果显示,在Dice相似系数、中心线Dice和95百分位Hausdorff距离上均有一致的改善。进一步分析表明,UGCP减少了血管断连并提高了结构一致性,同时计算开销有限。代码将发布在 https://github.com/chenzhao2023/UGC_PR。
cs.CV / 39 / 2605.20549
MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space
MAPS:用于在受控三维场景空间中探测视觉模型的合成数据集
Abstract
Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.
Chinese Translation
现代视觉模型在标准基准测试中表现出色,但其整体准确性对于驱动预测的场景属性揭示的信息有限。现有的鲁棒性基准提供了重要的压力测试,但通常操作全局二维图像属性,依赖于复杂的现实世界变异,或仅涵盖有限的三维物体和场景参数。我们引入了MAPS(人工参数场景流形),这是一个可扩展的工具,用于将视觉模型行为的控制归因于场景参数。MAPS包含2,618个经过策划的逼真三维网格,这些网格在560个ImageNet类别中经过验证以确保可识别性,并提供基于Blender的渲染管道,以便在九个独立场景因素(涵盖背景、相机和照明)连续变化下按需生成图像,且可扩展到其他因素。为了展示其适用性,我们使用MAPS评估20个卷积和基于变换器的模型,通过基于回归的敏感性分析量化它们对这些场景因素的依赖。我们发现所有测试架构中存在一个几乎普遍的失败轴:相机距离和高度始终主导识别失败,无论ImageNet准确性如何。然而,完整的敏感性结构揭示出现代CNN和变换器聚集在一起,与旧架构不同,这表明细粒度的架构设计选择,而不是粗略的CNN与变换器的区分,是敏感性特征的更强决定因素。
cs.CV / 40 / 2605.20551
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning
更快还是更强:通过加权聚合和令牌修剪实现灵活的视觉地点识别
Abstract
Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.
Chinese Translation
视觉地点识别(VPR)旨在将查询图像与大型数据库中同一地点的参考图像进行匹配。近期的最先进方法采用视觉变换器(Vision Transformers, ViTs)作为基础模型,以提取对视角、光照和季节变化具有鲁棒性的块级特征,这些特征随后被聚合成紧凑的全局描述符以进行检索。大多数现有的聚合方法将块令牌均匀地汇聚到学习到的簇中,尽管不同的簇通常编码不同的空间或语义模式,并对VPR性能的贡献不均衡。为了解决这一局限性,我们提出了加权聚合描述符(Weighted Aggregated Descriptor, WeiAD),在聚合过程中为簇分配权重,从而生成更具区分性的全局表示。除了准确性,检索延迟也是大规模部署和资源受限的边缘设备中的一个关键问题。之前的工作主要通过压缩全局描述符来减少延迟,而忽视了特征提取的成本,这一问题在基于ViT的基础模型中更加严重。因此,我们引入了WeiToP,一个面向VPR的令牌修剪框架,通过自蒸馏减少特征提取成本,其中聚合引起的令牌重要性监督附加在早期变换器层上的轻量级修剪模块,实现推理时的令牌修剪。在一次联合训练阶段后,WeiToP使得在推理时能够即插即用地进行令牌修剪,从而在不需要额外训练的情况下灵活地按需控制准确性与效率的权衡。此外,WeiToP在从一般视觉任务适配的现有令牌修剪方法中表现更优。
cs.CV / 41 / 2605.20569
End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking
基于材料提示的端到端高光谱物体跟踪解混
Abstract
Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.
Chinese Translation
高光谱图像编码了丰富的材料属性,可以在外观模糊、光照变化和背景杂乱的情况下提高跟踪的鲁棒性。然而,由于高光谱视频数据的有限可用性,许多现有方法通过空间或通道融合策略适应预训练的RGB跟踪器,基本上忽视了高光谱图像中的内在材料信息。此外,少数关注材料的研究通常依赖于与跟踪目标解耦的外部光谱解混管道,限制了材料表示在目标定位中的有效优化。为了解决这些限制,我们将高光谱物体跟踪形式化为材料分解和目标定位的联合优化问题,通过加权的目标导向解混损失将这两项任务耦合在一起,明确地将材料表示与定位精度对齐。具体而言,我们提出了一种用于深度学习基础的光谱解混的材料表示分解模块,采用自适应频率分解。在分解的材料表示基础上,我们进一步引入了一个双分支小波增强材料提示模块,通过频域中的高效空间-材料交互学习低频和高频材料提示。该框架是模型无关的,可以无缝推广到不同的解混骨干网络。在标准高光谱跟踪基准上的大量实验表明了该方法的最先进性能,并验证了所提出的端到端材料感知跟踪框架的有效性。代码可在 https://github.com/han030927/E2EMPT 获取。
cs.CV / 42 / 2605.20576
$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Δynamics:基于语言的刚体动力学推断表示
Abstract
Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $\Delta$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $\Delta$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $\Delta$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.
Chinese Translation
从单目视频中推断刚体物理状态和属性是基于物理的感知和仿真的基础步骤。现有方法假设特定的物理系统、物体类型和相机姿态,使其无法推广到复杂的现实场景中。我们提出了ΔYNAMICS,一个使用语言作为刚体动力学统一表示的视觉-语言框架。ΔYNAMICS并不是直接预测参数,而是以结构化文本格式生成场景配置以进行物理仿真。我们通过整合自然语言运动推理和利用光流作为语义无关输入来增强模型的泛化能力。在CLEVRER数据集上,ΔYNAMICS实现了0.30的分割IoU,相较于领先的视觉语言模型(InternVL3-8B、Qwen2.5-VL-7B和Claude-4-Sonnet)提高了7倍。此外,测试时采样和进化搜索进一步分别提升了27%和120%的分割IoU性能。最后,我们展示了在235个现实世界刚体视频的新数据集上的强转移能力,突显了基于语言的物理推断在连接感知与仿真方面的潜力。
cs.CV / 43 / 2605.20584
QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs
QwenSafe:通过偏好对齐的视觉语言模型进行多模态内容评级描述识别
Abstract
Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.
Chinese Translation
移动应用市场要求开发者披露标准化的内容评级描述符(CRDs),以告知用户潜在的敏感或受限内容。然而,由于应用内容的多模态特性,包括文本描述和视觉界面,确保这些披露的准确性和一致性仍然具有挑战性。在本文中,我们提出了QwenSafe,这是一种视觉语言模型(VLM),旨在通过共同推理应用元数据和截图,自动识别苹果定义的CRDs的存在。为了支持该任务的可扩展训练,我们引入了metadata2CRD,这是一种数据构建管道,通过结合应用描述、截图和正式的描述符定义,合成描述符对齐的问题-答案对。我们使用监督微调和直接偏好优化(DPO)对Qwen3-VL-8B进行调整,以使模型预测与视觉和文本模态中的描述符特定证据和解释对齐。我们在12个苹果定义的内容评级描述符上评估了QwenSafe,并与最先进的视觉语言模型进行比较,包括Qwen3-VL、LLaVA-1.6和Gemini-2.5-Flash。QwenSafe在二元CRD分类中始终优于所有基线,分别在正类召回率上实现了111.8%、36.1%和2.1%的提升。我们的结果表明,描述符感知的多模态对齐显著改善了自动内容分类,并突显了视觉语言模型在支持移动应用市场中可扩展和一致的内容评级方面的潜力。
cs.CV / 44 / 2605.20600
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
面向头部的键值压缩以实现高效的自回归图像生成
Abstract
Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.
Chinese Translation
自回归(AR)视觉生成已取得显著的性能,但由于需要缓存先前生成的视觉标记,导致高内存使用和低吞吐量。近期研究表明,仅保留少量缓存标记即可在显著降低内存使用和提高吞吐量的同时维持高质量图像。然而,这些方法为每个注意力头分配固定预算,忽视了注意力头之间的异质性,导致内存分配不理想。在本文中,我们观察到不同层的注意力头表现出多样的注意力模式,其中一些头关注局部邻域,而另一些则捕捉更广泛的上下文依赖。基于这一洞察,我们提出了一种新颖的面向头部的键值(KV)缓存压缩框架,用于自回归图像生成,称为HeadKV,该框架为局部偏向的头分配较小的预算,而为具有更广泛注意力的头分配较大的预算。一个关键挑战在于识别每个注意力头的类型,以指导缓存压缩。我们进一步观察到,在同一层内,每个头在标记位置之间表现出一致的注意力模式,即,一个头在早期标记的行为与其在后期标记的行为保持一致。这一洞察表明,头的类型可以在早期阶段识别,并在整个生成过程中重复用于KV压缩。其优势在于不需要额外的训练或数据集级别的统计,并且能够在不同输入之间无缝泛化。此外,我们设计了一种分层标记驱逐策略,以有效保留长距离信息。大量实验表明其在多个自回归图像生成模型中的有效性。
cs.CV / 45 / 2605.20606
Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?
关注你的边际和边界:你的蒸馏数据集真的稳健吗?
Abstract
Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.
Chinese Translation
数据集蒸馏(Dataset Distillation, DD)将大型训练集压缩为小型合成集,以实现高效训练,但大多数DD方法仅优化干净准确率,而忽视了稳健性。近期的稳健DD方法提高了稳健性,但通常在准确率与稳健性之间存在较差的权衡,因为它们(i)对所有对抗扰动样本采取统一处理,尽管稳健风险主要由近零稳健边际主导,以及(ii)未明确增加决策边界中攻击集中区域的类间分离。我们提出了针对稳健数据集蒸馏的对比课程(Contrastive Curriculum for Robust Dataset Distillation, C$^2$R),这是一个将攻击感知课程与对比稳健性目标相结合的框架。从稳健边际的角度出发,我们推导出一个扰动评分,近似每个样本的稳健铰链,从而实现一个优先考虑最小边际对手的课程,这些对手最直接地驱动稳健错误。同时,类平衡的对比稳健性损失在强制对抗不变性的同时,明确扩大了类间的边界分离。在CIFAR-10/100、Tiny-ImageNet和多个ImageNet-1K子集上进行的六种攻击实验表明,C$^2$R实现了最佳的稳健准确率,平均超越了之前的稳健DD方法2.8%。
cs.CV / 46 / 2605.20610
Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts
超越路由:视觉混合专家中的专家调优与表征特征分析
Abstract
Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.
Chinese Translation
混合专家(MoE)模型通常通过分析哪些类别被路由到哪些专家来进行解释。然而,仅仅依靠路由并不能揭示每个专家实际编码的内容。我们在自然图像上训练了稀疏门控卷积MoE模型,并使用对比目标对专家专业化进行特征分析,借助视觉神经科学的工具。从门控层分析扩展到专家层分析,我们测量了每个专家的类别可分离性以及使用最具吸引力的输入进行的每个专家的调优。从类别层面的解释扩展到特征层面,我们通过从人类行为判断数据集(THINGS)中衍生的语义维度来解释调优。最后,我们使用调优和表征相似性分析来评估在独立初始化下专业分配的稳定性。我们发现,生物与非生物的区分主导了专家的划分,这一现象从门控到专家输出均可见,并且在独立训练的模型中保持稳定。尽管路由统计数据表明相对稀疏的类别偏好,但专家分析揭示了对超越类别边界的连续视觉和语义维度的更广泛调优。尽管特征调优各异,专家之间的类别可分离性相似,展示了超越类别层面分析的解释优势。综上所述,这些结果表明,视觉MoEs中的专家专业化远远超出了类别路由,更好地通过深入探讨细粒度的专家调优和表征结构来理解。
cs.CV / 47 / 2605.20624
Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models
利用自回归扩散模型加速视频逆问题求解器
Abstract
Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.
Chinese Translation
扩散模型为零样本视频逆问题提供了强大的先验,但其实时部署受到两个低效因素的制约:由于整体视频恢复导致的高初始延迟,以及由于多次变分自编码器(VAE)传递以在像素空间中强制测量一致性而导致的低吞吐量。为了解决这些限制,我们提出了自回归视频逆问题求解器(AVIS)。AVIS框架利用自回归视频扩散模型以流式方式恢复视频,自然消除了延迟瓶颈。具体而言,AVIS以测量一致的估计初始化反向扩散,减少所需的采样步骤。与领先的非自回归求解器相比,AVIS将初始延迟从114秒大幅减少到4秒,同时将吞吐量从0.71提高到1.18帧每秒(FPS),并实现了更优的恢复质量。我们进一步介绍了一种高度加速的变体,称为AVIS Flash,它仅在第一个数据块上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量大幅提升至5.91 FPS,同时保持竞争力的性能,并实现了良好的效率与性能的权衡,为实时部署铺平了道路。
cs.CV / 48 / 2605.20640
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
帕累托增强的人物肖像生成:面向对齐、真实感和美学的视觉对齐文本监督
Abstract
Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.
Chinese Translation
文本到图像扩散模型在生成人物肖像时常常面临严重的三难困境:文本与图像的对齐、照片真实感和人类感知的美学彼此之间固有地相互抑制。监督微调(Supervised Fine-Tuning, SFT)是一种有效的方法,可以增强图像生成的照片真实感。然而,它往往导致对训练数据集的过拟合,破坏预训练图像先验,并降低对齐或美学。为了打破这一瓶颈,我们提出了一种多模态扩散变换器(Multimodal Diffusion Transformers, MM-DiT)的特征监督范式。具体而言,我们引入了一种轻量级的跨模态对齐机制,该机制隐式提取来自SigLIP 2的多粒度视觉对齐文本表示,并在训练阶段对MM-DiT的图像分支施加监督,且没有额外的推理开销。我们的方法在保持基础模型原始泛化能力的同时注入视觉对齐文本指导,避免了SFT带来的退化。此外,我们的方法直接从预训练的视觉基础模型中挖掘隐式的多粒度美学信号,以优化人类感知的美学。在MM-DiT上的大量实验表明,我们的方法推动了帕累托前沿,并在文本与图像的对齐、照片真实感和人类感知的美学方面实现了协同改进。
cs.CV / 49 / 2605.20645
Seeing Through Fog: Towards Fog-Invariant Action Recognition
穿透雾霭:朝着雾不变的动作识别迈进
Abstract
Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.
Chinese Translation
雾霭条件在现实应用中常常遇到;然而,现有的动作识别方法通常假设天气良好且视频输入质量高。在雾天,不可预测的能见度下降和对比度降低阻碍了语义线索的提取,给当前的动作识别方法带来了重大挑战。本文通过采用两种策略来缓解雾霭条件下动作识别面临的问题。首先,我们提出了FogAct,这是第一个用于雾霭动作识别的基准数据集,包含使用立体摄像系统拍摄的干净和雾霭视频的配对。该数据集涵盖10个场景和55个动作类别,包含近10,000个视频片段。其次,我们提出了FogNet,一个双流CLIP模型,能够发现隐藏在降质视频背后的雾不变语义信息。FogNet在干净视频的指导下学习雾霭视频的鲁棒表示,有效捕捉干净视频和雾霭视频之间的共享结构和运动线索。我们在FogAct和其他三个流行数据集上的广泛实验表明,我们的方法与最先进的(SOTA)方法相比,取得了竞争力的性能。我们的FogAct和FogNet可在我们的项目页面上找到。
cs.CV / 50 / 2605.20651
Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation
深入细节:用于OCTA视网膜血管分割的局部敏感增强
Abstract
Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.
Chinese Translation
现有的光学相干断层扫描血管成像(Optical Coherence Tomography Angiography, OCTA)血管分割深度学习框架主要源自U-Net架构,该架构是当前大多数设计的基础。然而,这些方法大多仅关注整体表示,难以解决OCTA特有的低局部对比度问题,这导致血管不连续和细节丢失。为了解决这些问题,我们提出了LSENet,该网络在U-Net架构的基础上引入了三个核心创新模块:为了解决血管不连续性,我们引入了补丁信息增强模块(Patch Information Enhance, PIE),该模块替代了标准跳跃连接,执行补丁级注意力。为了减轻细节丢失,我们提出了多尺度特征融合模块(Multiscale Feature Fusion, MFF),通过从原始输入和前一层提取可视化可解释特征,为PIE模块提供丰富的多尺度信息。最后,连接性精炼解码器(Connectivity Refinement Decoder, CRD)旨在精炼来自各个层次的特征,并在最终卷积层中利用大卷积核以减少碎片化。在三个公共数据集(OCTA-500、ROSE-1和ROSSA)上的实验表明,我们提出的LSENet在参数更少的情况下实现了最先进的性能。
cs.CV / 51 / 2605.20659
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
RoPeSLR:基于3D RoPE的稀疏低秩注意力框架以提高扩散变换器的效率
Abstract
Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).
Chinese Translation
扩散变换器(DiTs)已经彻底改变了高保真视频生成,但其$ extmath{O}(L^2)$的注意力复杂度对长序列合成构成了严峻的瓶颈。尽管近期的稀疏线性注意力混合方法旨在缓解这一问题,但在极端稀疏情况下,它们的性能严重下降,这被称为“RoPE困境”:标准线性注意力无法保持3D旋转位置嵌入(RoPE)的正交相对位置结构,从而削弱了重要的距离感知。为了解决这一问题,我们提出了 extbf{RoPeSLR},一个基于3D RoPE的稀疏低秩注意力框架。我们建立了在经过经验验证的假设下,DiT注意力流形可以解耦为一个高频语义尖峰集(稀疏度受限于$ extmath{O}(L^{3/2})$)和一个极低秩($ extmath{O}(d_h ext{log} L)$)背景连续体。在这一结构先验的指导下,RoPeSLR摒弃了标准线性注意力,采用了头-wise的低秩参数化,并配备了可学习的3D绝对位置嵌入(PE)注入,顺畅地合成长距离相对距离衰减。通过确保次二次稀疏性和次线性秩增长,RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这一可扩展的优越性:在90 ext{%}的稀疏度下,RoPeSLR在Wan2.1-1.3B上实现了高达$10 imes$的FLOPs减少,并在超长的100K+标记序列HunyuanVideo-13B上提供了$2.26 imes$的端到端推理加速,同时保持近乎无损的生成保真度(平均VBench降级小于1.3 ext{%})。
cs.CV / 52 / 2605.20667
LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection
LER-YOLO:面向不对齐RGB-红外无人机检测的可靠性感知专家路由
Abstract
Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.
Chinese Translation
从RGB-红外遥感对中检测小型无人机仍然面临挑战,原因包括目标尺度小、背景杂乱以及异构传感器之间的空间不对齐。现有的双模态检测器通常在未评估局部跨传感器对应的可靠性的情况下对特征进行对齐或融合,这导致不匹配的伪影传播到检测头。为了解决这一问题,我们提出了LER-YOLO,一种面向不对齐RGB-红外无人机检测的可靠性感知稀疏专家混合框架。LER-YOLO首先引入了不确定性感知目标对齐模块,该模块将可见特征重新采样至红外参考,并估计空间可靠性图。然后,该可靠性先验被用于可靠性引导的稀疏MoE融合模块,以自适应选择来自RGB主导、红外主导和交互融合专家的k个专家,从而实现可信的跨模态交互,同时抑制不可靠的融合。在YOLOv5s系列协议下的公共MBU基准测试中的实验表明,LER-YOLO在三个独立种子上达到了89.7+/-0.2%的AP50,最佳结果为89.9%。大量的消融实验、参数匹配比较、合成偏移评估和复杂性分析表明,性能提升主要来自于可靠性引导的专家路由,而非模型容量的增加。
cs.CV / 53 / 2605.20669
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection
GSA-YOLO:通过结构稀疏性和自适应知识蒸馏实现高效实时X射线安检框架
Abstract
X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.
Chinese Translation
X射线安检需要对禁止物品进行准确的实时检测,但现有模型往往难以平衡严重遮挡、复杂杂乱和严格速度要求的挑战。为了解决这些问题,本文提出了GSA-YOLO,这是一种基于YOLOv8n架构的新型轻量级框架,专门设计用于增强检测的鲁棒性和推理效率。GSA-YOLO通过三个核心组件战略性地整合了结构稀疏性和自适应知识转移:应用于网络颈部的Group Lasso (GL)用于稳健特征提取;应用于检测头的Sparse Structure Selection (SSS)用于显著的模型精简;以及自适应知识蒸馏(Adaptive Knowledge Distillation, Ada-KD)机制用于全面的精度恢复。这种综合方法协同增强特征表示,同时修剪冗余通道,最大化模型效率而不牺牲性能。在HiXray和PIDray数据集上的严格评估证实了GSA-YOLO的全面能力,达到了189.62 FPS的领先推理速度,同时计算成本从8.7G降低到8.0G。重要的是,GSA-YOLO在HiXray和PIDray上分别获得了mAP50:95结果为0.531和0.679,较基线分别提高了2.4%和1.8%。与其他模型相比,GSA-YOLO在保持计算效率的同时展现了更高的准确性,使其成为实际X射线安检的有前景的解决方案。
cs.CV / 54 / 2605.20676
VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
VISTAQA:联合视觉问答与像素级证据的基准评估
Abstract
Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.
Chinese Translation
建立模型预测与支持这些预测的视觉证据之间的清晰联系,对于多模态推理的透明性和可靠性至关重要,然而当前的多模态大型语言模型(MLLM)评估并未明确强制这种对齐。现有基准评估要么孤立地评估文本答案的正确性,要么评估像素级定位,导致推理与基础的耦合仍然是一个开放的挑战。我们提出了VISTAQA,这是一个综合基准,用于在视觉问答中联合评估自由形式答案的正确性和像素级证据的基础。VISTAQA包含1,157个由专家策划的样本,涵盖六种任务类型和六个视觉领域,从直接感知到组合和关系推理。VISTAQA要求模型不仅要正确回答,还要提供支持其答案的精确分割掩码。它还包括幻觉感知的示例,其中不存在有效的视觉证据。为了支持这种增强的评估,我们引入了GROVE,这是一种统一的评估指标,通过结合文本准确性和基础质量的每样本几何平均值,强制执行联合正确性,确保任何一个维度都无法弥补另一个维度的不足。针对基础感知模型和与通用MMLMs结合的混合管道的全面实验表明,即使是最强大的系统在GROVE下的表现也有限,突显了答案准确性与视觉证据对齐之间的显著差距。
cs.CV / 55 / 2605.20680
DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions
DarkShake-DVS:低光照和抖动相机条件下的事件驱动人类动作识别
Abstract
Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.
Chinese Translation
人类动作识别(HAR)是计算机视觉领域的一个基础任务,具有广泛的现实应用。实际部署通常涉及低光照环境和不受约束的六自由度(6-DoF)相机运动,这些条件会降低视觉质量、破坏时间一致性,并影响现有方法的可靠性。事件相机具有高低光敏感性和微秒级时间分辨率,结合惯性测量单元(IMU),提供了一种有前景的解决方案。然而,目前的研究面临两个主要挑战:缺乏一个整合低光照条件、6-DoF运动和同步IMU数据的基准;以及缺乏有效的运动补偿技术。为了解决这些问题,我们提出了事件-IMU稳定化HAR(EIS-HAR),该方法包含两个模块。第一个是EIS模块,通过非线性扭曲函数减少运动模糊,以重建运动补偿输入。第二个是HAR模块,采用四阶段混合架构,以高效提取时空特征,实现准确的动作识别。为了缓解数据稀缺问题,我们引入了DarkShake-DVS,这是第一个大规模的事件驱动HAR基准,包含18,041个在低光照和强烈6-DoF运动下捕获的真实世界片段,并辅以同步的IMU数据。在三个数据集上的大量实验表明,EIS-HAR在性能上始终优于最先进的方法。
cs.CV / 56 / 2605.20682
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
IndusAgent:利用代理工具增强开放词汇工业异常检测
Abstract
Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.
Chinese Translation
多模态大语言模型(MLLMs)在视觉感知与文本推理之间架起了显著的桥梁,使其能够在多样化的工业场景中实现零样本理解。然而,它们在开放词汇工业异常检测(IAD)中的表现常常受到领域不匹配推理和虚构结构推断的限制。为了解决这些挑战,我们提出了 extbf{IndusAgent},一个增强工具的代理框架,用于开放词汇IAD。具体而言,我们首先构建了 extbf{Indus-CoT},一个结构化数据集,整合了全球视觉观察、高分辨率局部图像块和专家正常性先验,为模型在严格的工业检查轨迹上进行微调提供监督。在此基础上,IndusAgent动态协调一组外部工具,包括动态区域裁剪、高频特征增强和先验检索,从而使代理能够主动解决视觉模糊并解开细微异常。此外,我们引入了一种门控强化学习目标,联合优化异常分类、定位准确性、异常类型推理和高效工具使用,确保工具调用仅在有利时进行。在五个工业异常基准(包括MVTec-AD、VisA、MPDD、DTD和SDD)上的广泛评估表明,IndusAgent在所有现有方法中实现了最先进的零样本性能,验证了我们的鲁棒性和泛化能力。
cs.CV / 57 / 2605.20708
Rethinking Cross-Layer Information Routing in Diffusion Transformers
重新思考扩散变换器中的跨层信息路由
Abstract
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
Chinese Translation
扩散变换器(Diffusion Transformers, DiTs)已成为现代视觉生成的事实标准骨干,几乎每个设计的主要方面——标记化、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而,控制信息在层间累积的残差流直接继承自原始变换器。在本文中,我们对DiTs中的跨层信息流进行了系统的实证分析,联合考虑深度和去噪时间步,并识别出传统残差加法的三个具体症状,即单调的前向幅度膨胀、急剧的后向梯度衰减和明显的块状冗余。基于这一诊断,我们提出了扩散自适应路由(Diffusion-Adaptive Routing, extsc{DAR}),这是一种可替代的残差替换方法,能够对子层输出的历史进行可学习、时间步自适应和非增量的聚合。此外,所提出的 extsc{DAR}与许多现代变换器增强方法(如REPA)兼容。在ImageNet $256 imes256$上, extsc{DAR}使SiT-XL/2的FID提高了$2.11$($7.56$对比$9.67$),并以$8.75 imes$更少的训练迭代匹配基线的收敛质量。在REPA的基础上叠加时,它在早期阶段实现了$2 imes$的训练加速,这表明跨层信息路由是扩散建模中一个尚未充分探索的设计轴,与现有的表示对齐目标正交。除了预训练, extsc{DAR}还可以在大规模T2I模型的微调阶段应用,并在分布匹配蒸馏过程中保留高频细节。
cs.CV / 58 / 2605.20713
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
SAVER:按需选择的视觉证据用于多模态信息提取
Abstract
Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.
Chinese Translation
社交媒体中的多模态信息提取(IE)面临挑战,因为一条帖子可能附带多个与文本关系较弱、冗余甚至误导性的图像。在这种情况下,始终开启的多模态融合不仅浪费计算资源,还可能放大虚假的视觉线索。核心挑战在于决定对于每个候选跨度或标记实体对,是否需要咨询视觉信息,如果需要,哪些小的图像子集提供可信的证据。我们提出了SAVER,一个按需选择的视觉框架,用于多模态命名实体识别和多模态关系提取。SAVER使用一个符合性可用性门(Conformal Groundability Gate, CGG)来估计多模态命名实体识别(MNER)中的跨度级视觉可用性,从两个标记实体中推导出多模态关系提取(MRE)中的对级激活,并通过符合性风格的程序与Clopper-Pearson上界在保留的拆分上校准激活阈值。当被激活时,一个子模态相关性-多样性选择器在图像中选择一个紧凑的证据子集,然后由集合变换器(Set Transformer)进行聚合。一个受能量启发的联合评分头结合了文本、可选的视觉证据、文本-图像一致性以及稀疏路由,用于实体类型或关系分类。实验表明,SAVER在强文本单一和始终开启的多模态基线之上,始终提高了F1分数,同时降低了AURC,在固定风险水平下增加了激活覆盖率,并减少了FLOPs和P90延迟。
cs.CV / 59 / 2605.20725
Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label
整体可靠性传播:解耦注释与预测以应对鲁棒性噪声标签
Abstract
Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.
Chinese Translation
在多媒体分类中,使用噪声标签的学习通常将外部注释和模型预测结合为一个单一的可靠性权重,尽管这两种来源可能因不同原因而失效。我们则估计解耦的可靠性:双层元学习为每个样本生成两个批量归一化的标量,分别为给定标签的 alpha 和伪标签的 beta,而不限制它们的和为一。整体可靠性传播(Holistic Reliability Propagation, HRP)随后将它们路由到不同的目标,使用可靠性感知的 Mixup,在输入分支上进行全局门控,并在对比分支上使用 beta 门控的伪标签正样本。在合成和真实世界的基准测试中,HRP 在强基线之上提高了平均准确率,并在最高噪声率下保持竞争力。
cs.CV / 60 / 2605.20727
GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels
GAMR:具有虚拟异常值合成的几何感知流形正则化用于处理带噪标签的学习
Abstract
Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.
Chinese Translation
深度神经网络(DNN)在处理带噪标签时会出现显著的性能下降,主要是由于对错误标记数据的过拟合。目前主流的方法试图通过在训练过程中被动过滤干净样本来缓解这一问题。然而,在噪声影响下的特征空间中简单的样本过滤难以区分具有挑战性的样本和噪声样本,从而成为模型性能的瓶颈。我们首次强调了主动重塑特征空间几何形状在从噪声数据中学习中的基本重要性。我们提出了一种新颖的几何感知流形正则化范式,其核心思想是通过主动合成虚拟异常值样本,明确构建数据流形之间的能量障碍。通过施加促进类内紧凑性和类间分离的几何约束,这种方法增强了困难样本和噪声样本之间的可区分性,从而学习到更鲁棒的表示。我们的正则化机制表现出高度的普适性,其有效性独立于任何关于噪声模式的先验假设。它可以作为独立机制集成到现有的样本选择框架中,为应对多样的噪声环境提供更强的鲁棒性。实验表明,我们的范式在多个基准上实现了超过当前最先进(SOTA)方法的性能,包括CIFAR-10,并在更具挑战性的非对称噪声条件下表现出特别明显的优势。此外,该范式显著增强了模型在分布外(OOD)检测中的能力,确保在开放世界场景中的可靠性和安全性。
cs.CV / 61 / 2605.20728
Early High-Frequency Injection for Geometry-Sensitive OOD Detection
几何敏感的OOD检测的早期高频注入
Abstract
Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.
Chinese Translation
事后OOD检测器在训练后对logits或特征进行评分,因此它们的成功依赖于表示中已经编码的几何特征。我们通过对CE、SimCLR、SupCon和面向OOD的表示方法PALM进行带宽MMD^2分析,重新审视了这一假设。在我们的诊断中,低频输入带导致较弱的ID/OOD特征差异,而高频带则倾向于提供更强的可分离性。这一观察促使我们提出了EIHF(Early High-Frequency Injection),这是一种输入侧干预方法,在第一次卷积之前暴露高频证据,而不改变训练目标。EIHF在几何敏感的OOD检测中效果最佳:在匹配的训练和评分设置下,它重塑了类条件特征几何,并减少了ID/OOD马哈拉诺比斯分数的重叠。在CIFAR-100和ImageNet-100上的实验表明,CIFAR-100的性能有所提升,并在ImageNet-100上获得了最佳的平均FPR95和第二好的平均AUROC,同时也揭示了在场景中心的Places迁移上的局限性。代码可在 https://anonymous.4open.science/r/EIHF 获取。
cs.CV / 62 / 2605.20731
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
TASTE:一个设计师注释的多维偏好数据集,用于AI生成的图形设计
Abstract
Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.
Chinese Translation
文本到图像模型以生产规模生成图形设计,但其监督来自于每次比较仅有一个总体判决的照片风格偏好数据。设计师在多个不同的维度上进行评估,包括排版、视觉层次、色彩和谐、布局和简报忠实度,而单一标签则将这些维度合并。我们发布了TASTE(排版、美学、空间、语调等):十位专业设计师根据九个标准对四个当前文本到图像模型的输出进行排名,分为两个不重叠的群体,每个标准产生1,600个评分,并对整体偏好群体的每张图像标记幻觉标志。我们将数据集与三项贡献相结合。首先,使用Kendall's tau、主要概率和Condorcet循环的标准无关信号测试框架,在p = 4和R = 5的情况下,与精确的独立同分布均匀零假设进行比较,揭示了设计师在图形设计上的一致性,涵盖了食品和电影偏好以及照片风格图像质量,所有TASTE标准均拒绝随机评分者的零假设。其次,在我们的基准测试中,没有任何预训练系统,包括六个开放权重的VLM评审(参数从3B到33B)和三个专门的T2I评分器(HPSv2.1、PickScore-v1和LAION-Aesthetic-V2),与五位设计师的多数意见的宏观一致性超过0.55;VLM评审在位置偏差与内容敏感性之间进行权衡,因此扩展沿着这一边界进行,而不改善准确性。第三,基于TASTE训练的小型成对差异头达到了0.611,缩小了与0.741单评分者上限的差距约一半。
cs.CV / 63 / 2605.20732
Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations
深度注意力重加权:在虚假相关性下解耦核心特征和虚假特征的卷积神经网络中的后处理基于注意力的特征聚合
Abstract
Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.
Chinese Translation
卷积神经网络(CNN)通常利用数据集中的虚假相关性,学习表面上具有预测能力但因果上无关的特征,从而导致较差的泛化能力和公平性问题。深度特征重加权(Deep Feature Reweighting, DFR)是一种后处理技术,通过在目标数据集上重新训练其分类头,减少训练模型对虚假相关性的依赖。然而,我们表明,DFR在处理纠缠特征时受到根本性限制,限制了其在放大核心特征的同时抑制虚假特征的能力。我们将这种纠缠追溯到普遍存在的全局平均池化(Global Average Pooling, GAP)层,该层不加区分地将空间上不同的核心特征和虚假特征压缩为单一表示。为了解决这个问题,我们提出了深度注意力重加权(Deep Attention Reweighting, DAR),这是一种后处理的基于注意力的聚合模块,替代GAP,并与分类头共同重新训练。DAR计算特征图上空间位置的自适应加权,使得在特征压缩为纠缠特征之前能够选择性地抑制虚假特征。在各种数据集、指标和消融实验中,DAR始终优于DFR,证明我们的基于注意力的聚合减轻了GAP引起的纠缠,并减少了对虚假特征的依赖。
cs.CV / 64 / 2605.20733
Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches
Sketch2MinSurf:基于视觉-语言指导的可编辑最小曲面的手绘草图生成
Abstract
Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.
Chinese Translation
将手绘草图转换为结构化的三维几何体仍然具有挑战性,因为非欧几里得曲面的表示和拓扑一致性的维护都存在困难。现有的生成模型,如生成对抗网络(GANs)、神经辐射场(NeRFs)和扩散架构,通常无法直接生成可在下游设计工作流程中使用的可编辑流形。我们提出了Sketch2MinSurf,一个混合视觉-语言与几何优化的框架,结合了视觉-语言指导和最小曲面理论,从手绘草图生成光滑且可编辑的三维表面。我们方法的核心是空间-拓扑编码,它将几何体表示为节点坐标和真实/虚拟边缘骨架的元组,从而在生成过程中实现稳定的拓扑控制。我们进一步引入了Sketch2MinSurf结构损失(S2MS-Loss),这是一种奖励调制目标,联合约束几何重建和拓扑一致性。在100个草图的测试集上,Sketch2MinSurf达到了0.844的拓扑相似度评分,超越了现有的草图到形状基准。生成的流形是可直接编辑的,并且没有非流形伪影。在一所大学的公共艺术装置中展示了该方法在基于人类意图的三维形态生成中的潜力。数据集和代码可在 https://anonymous.4open.science/r/Sketch2MinSurf/ 获取。
cs.CV / 65 / 2605.20735
Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition
降低IREX参与门槛:用于虹膜识别的开源算法、工具包和基准测试
Abstract
This paper proposes two new open-source iris recognition algorithms, providing both Python and IREX-compliant C++ implementations to be submitted to the official IREX X program. This work has two primary goals: (a) to conduct the first-ever assessment of open-source iris recognition solutions according to IREX testing protocols, and (b) to offer a model C++ submission that significantly facilitates the entry of other teams' open-source methods into the IREX evaluation. The new methods consist of two Neural Networks trained with: (i) Triplet loss with Batch-Hard Triplet mining (TripletIris), and (ii) ArcFace loss (ArcIris). The paper also provides open-source IREX-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). Except for CRYPTS, which faced timing constraints during 1:N search, these methods have undergone the official IREX X evaluation and have also been assessed using several popular academic benchmarks: Quality-Face/Iris Research Ensemble, Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2. Finally, this paper also provides open-source models for iris segmentation and circle estimation that can be incorporated into any new iris recognition method.
Chinese Translation
本文提出了两种新的开源虹膜识别算法,提供了符合IREX标准的Python和C++实现,以提交至官方IREX X项目。该工作的主要目标有两个:(a) 根据IREX测试协议首次评估开源虹膜识别解决方案,(b) 提供一个模型C++提交,显著便利其他团队的开源方法进入IREX评估。这些新方法包括两个神经网络,分别使用:(i) 基于三元组损失和批量困难三元组挖掘的TripletIris,以及(ii) ArcFace损失的ArcIris。本文还提供了两种现有方法的开源IREX兼容C++实现:(a) 基于虹膜图像过滤的算法,利用人类显著性驱动的核(HDBIF),以及(b) 一种可被人类理解的算法,用于检测和比较Fuchs隐窝(CRYPTS)。除CRYPTS在1:N搜索中面临时间限制外,这些方法均已通过官方IREX X评估,并使用多个流行的学术基准进行了评估:Quality-Face/Iris Research Ensemble、华沙生物库尸检虹膜、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、印度理工学院德里虹膜数据库、IIITD隐形眼镜虹膜数据库、NDIris3D和圣母大学可变虹膜图像质量发布2。最后,本文还提供了可集成到任何新虹膜识别方法中的开源虹膜分割和圆形估计模型。
cs.CV / 66 / 2605.20737
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors
利用语言先验解决无监督3D点云分割中的长尾歧义
Abstract
Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.
Chinese Translation
现有的无监督3D点云分割方法主要依赖于纯视觉相似性基础的聚类学习范式,这一方法存在一个根本性限制:长尾歧义。在这种范式中,次要类别的特征不断被主导聚类所吸收,导致预测严重失衡。为了解决这一问题,我们提出了LangTail,一种语言引导的层次学习框架,利用语言模型中编码的平衡世界知识来减轻无监督3D分割中的长尾歧义。其关键思想是建立语言衍生的语义先验与视觉上代表性不足的次要类别之间的多层次关联,从而弥补纯视觉聚类对主导类别的偏向注意。具体而言,LangTail首先从语言模型构建实体级语义先验,捕捉跨类别的平衡和细粒度的世界知识。这些先验通过对比对齐注入层次聚类框架中。这指导了多粒度语义结构的形成,并防止次要类别被主导聚类所吸收,从而为代表性不足的类别提供更具辨别力的表示。在ScanNet-v2、S3DIS和nuScenes上的大量实验表明,LangTail在各个方面均显著优于现有方法,分别提升了+13.5、+12.9和+8.9 mIoU。这些结果证明了语言先验在改善3D点云中少数类别表示方面的有效性。代码将发布于:https://github.com/Whisky0129/langtail_official。
cs.CV / 67 / 2605.20738
STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection
STAR-IOD:具有伪标签精炼的规模解耦拓扑对齐用于遥感增量目标检测
Abstract
Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: https://github.com/zyt95579/STAR-IOD.
Chinese Translation
遥感影像通常以连续数据流的形式到达。传统检测器在学习新类别时往往会遗忘之前学习的类别,因此,遥感增量目标检测(RS-IOD)的研究具有重要意义。然而,现有方法在很大程度上忽视了遥感场景中普遍存在的类内尺度变化,这削弱了知识转移和旧知识保留的有效性。此外,RS-IOD还面临缺失标注的问题,这导致模型将旧类别实例错误分类为背景。为了解决这些挑战,我们提出了一种新颖的框架,STAR-IOD。首先,我们引入了一个子空间解耦拓扑蒸馏(STD)模块,以传递结构知识,明确对齐类间拓扑关系,并减轻由尺度变化引起的类内表示差异。此外,我们引入了聚类驱动的伪标签生成器(CPG),这是一个即插即用的模块,利用K-Means聚类动态识别类别特定的阈值,从而保证准确区分真实正目标和背景噪声,并缓解旧类别缺失标注的问题。我们还构建了两个遥感增量目标检测数据集,DIOR-IOD和DOTA-IOD,以促进RS-IOD的研究。大量实验表明,我们的方法在DIOR-IOD和DOTA-IOD上分别比最先进的方法提高了1.7%和2.1%的mAP,有效减轻了灾难性遗忘,同时在基础类和新类上保持了强大的检测性能。代码和数据集已发布于:https://github.com/zyt95579/STAR-IOD。
cs.CV / 68 / 2605.20743
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Draw2Think:通过约束引擎交互利用几何推理
Abstract
Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/
Chinese Translation
视觉语言模型在解决几何问题方面的准确性不断提高,但其中间状态仍然是潜在的且不可验证的:以文本推理或绘图代码表达的关系并不能保证满足约束的配置能够实现这一点。我们观察到,现有基于渲染像素或一次性脚本的外部化方法未能提供精确的逐步几何保证。通过代数定义强制几何关系可以弥补这一差距:工作空间变成一个经过约束检查的动态画布。我们提出了Draw2Think,一个将几何推理从潜在的空间推断转变为与GeoGebra约束引擎的主动交互的框架。在一个提议-绘制-验证循环中,Draw2Think将假设外部化到可执行的画布上,测量精确的几何量,并将结构化观察反馈给模型,从而使后续推理基于共享工作空间中经过检查的画布状态进行。这种外部化使得两个属性可以单独审计:模型级构建保真度(画布是否实现了预期配置)和引擎级测量真实性(来自画布约束的精确值和关系)。在构建、结果和渲染评估中,Draw2Think构建的画布在GeoGoal上通过了95.9%的谓词级和84.0%的严格问题级构建检查,在平面/固体基准上提高了结果准确性,分别达到4.1%/16.4%的提升,并在GenExam-math上获得了68.2%/90.5%的严格/放宽渲染分数。项目页面可访问 https://draw2think.github.io/
cs.CV / 69 / 2605.20760
SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation
SpineContextResUNet:一种计算高效的残差 U-Net 用于脊柱 CT 分割
Abstract
Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.
Chinese Translation
在计算机断层扫描(CT)中自动分割脊柱是病理评估和手术规划的前提。然而,最先进的方法,特别是基于 Transformer 或大规模集成的方法,需要大量的 GPU 资源,这在资源受限的环境或边缘设备上形成了障碍。为了解决这个问题,我们提出了 SpineContextResUNet,一种计算高效的 3D 残差 U-Net,旨在快速定位脊柱。我们的架构集成了一个轻量级的上下文块,采用并行多膨胀卷积来捕捉长距离的解剖依赖关系,而无需像递归神经网络(RNN)那样的高延迟或自注意力机制的内存开销。在两个公共基准测试 VerSe2020 和 CTSpine1K 上进行了广泛的验证,结果表明我们的模型分别达到了 88.17% 和 88.13% 的 Dice 分数。为了在严格的硬件限制下评估性能,我们将我们的模型与一个瓶颈的 SwinUNETR 进行了比较,该模型经过缩放以匹配我们约 1.7M 的硬件占用。尽管受限的 Transformer 在有限数据环境中由于缺乏空间归纳偏差而严重性能下降,但我们的基于 CNN 的方法成功地保持了高准确性。重要的是,像 TotalSegmentator 这样的重基线在普通硬件(Intel Core i5,8GB RAM)上由于内存耗尽而失败,而我们的模型则能够进行稳健的推理,使其成为点对点诊断和在 Nvidia Jetson Orin Nano 等边缘平台上部署的可行解决方案。
cs.CV / 70 / 2605.20766
Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection
扩散检测:基于伪标签扩散的双层样本重平衡用于点监督红外小目标检测
Abstract
Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.
Chinese Translation
点监督已成为解决红外小目标检测中密集标注问题的可扩展方案,但其性能受到两个相互关联的瓶颈的限制:在杂乱、低对比度的红外图像中伪标签演变的不稳定性和严重的样本分布不平衡。本文提出了一种更具适应性和稳定性的框架来解决这些问题。利用热辐射模式与热扩散之间的内在一致性,我们提出了一种物理驱动的标注策略,将单点标签扩展为可靠的伪掩码。为了进一步增强监督并缓解样本不平衡,我们开发了一个双层双重更新框架,联合优化检测器权重、样本权重和扩散参数。一个元分类器动态预测样本级损失权重,而一个可微分的扩散模块通过检测反馈细化伪标签,实现训练与超参数优化之间的自适应交互。多个数据集上的大量实验表明,我们的方法在标注加速方面达到了五倍的提升,检测准确性优越,并且在使用30%的训练数据时表现出可比性,验证了我们方法的效率和实用性。我们的代码可在 https://github.com/yuanhang-yao/diffuse-to-detect 获取。
cs.CV / 71 / 2605.20772
VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering
VIHD:基于视觉干预的医学视觉问答中的幻觉检测
Abstract
While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at https://github.com/Jiayi-Chen-AU/VIHD
Chinese Translation
尽管医学多模态大语言模型(MLLMs)在辅助诊断方面展现出潜力,但它们仍然经常生成看似语言上合理但缺乏视觉证据的幻觉响应。这些幻觉对临床决策构成风险,因此需要有效的检测方法。现有的自省检测方法主要通过分析模型对原始或扰动输入的响应来执行不确定性估计或逻辑验证。然而,这种外部扰动往往是启发式的且与上下文无关,忽视了在解码过程中生成的标记与相关视觉标记之间的内部跨模态依赖关系。为了解决这个问题,我们提出了VIHD,一种基于视觉干预的幻觉检测方法,利用针对性的视觉标记屏蔽来校准语义熵,以实现更有效的幻觉检测。VIHD通过视觉依赖探测(Visual Dependency Probing, VDP)定位视觉主导的解码层,通过标记屏蔽执行视觉干预解码(Visual Intervention Decoding, VID)以校准语义分布,并量化得到的校准语义熵(Calibrated Semantic Entropy, CSE)作为可靠的幻觉信号。在三个医学视觉问答基准上对两种医学MLLMs进行的广泛实验表明,VIHD始终优于最先进的方法,强调了细粒度视觉依赖在幻觉检测中的重要性。代码将发布在 https://github.com/Jiayi-Chen-AU/VIHD
cs.CV / 72 / 2605.20777
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models
AttriStory:基于扩散模型的视觉叙事中的细粒度属性实现
Abstract
Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/
Chinese Translation
基于扩散模型的视觉叙事在保持叙事场景中角色一致性方面取得了显著进展。然而,仍然存在一个关键的缺口:虽然这些方法确保角色在场景之间保持一致,但并没有提供系统的方法来确保生成图像中细粒度属性(如服装、配饰的颜色和纹理)得到了忠实呈现。为此,我们引入了AttriStory,一个用于视觉叙事中属性实现的基准。我们使用大型语言模型策划了200个跨越10种不同艺术风格的多场景故事。每个场景都根据详细的属性规范构建,以实现丰富的视觉叙事。此外,为了解决属性实现问题,我们提出了一个即插即用的潜在优化模块,该模块在早期去噪步骤中运行,当模型建立结构和语义内容时。我们通过AttriLoss目标实现这一点,该目标旨在最大化所需属性-对象对的交叉注意力图之间的对齐,同时抑制虚假关联,指导模型正确定位属性。这种方法与现有的一致性机制正交操作,能够无缝集成到当前的故事生成管道中,而无需进行架构修改。我们的实验表明,在所有基线中引入AttriLoss均能实现一致的改进。这项工作将属性实现定位为视觉叙事的一个独特且互补的维度,与角色一致性并行,推动该领域朝着细粒度属性控制的故事生成发展。项目页面:https://manogna-s.github.io/attristory/
cs.CV / 73 / 2605.20787
Findings of the Counter Turing Test: AI-Generated Image Detection
反图灵测试的发现:AI生成图像检测
Abstract
The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.
Chinese Translation
生成性人工智能技术的快速进展,例如稳定扩散(Stable Diffusion)、DALL-E和Midjourney,显著改变了合成视觉内容的创作方式。虽然这些模型在各个行业中促进了创新,但它们也带来了严重的挑战,包括错误信息、虚假信息和偏见内容的生成。AI生成图像日益真实,使得其检测成为研究人员、政策制定者和行业利益相关者面临的紧迫问题。本文展示了Defactify 4.0研讨会的发现,该研讨会引入了用于AI生成图像检测的反图灵测试(Counter Turing Test, CT2)。该竞赛包括两个关键任务:(1) 将图像二分类为AI生成或真实;(2) 确定负责生成AI图像的具体生成模型。为此,我们开发了MS COCOAI数据集,该数据集包含来自多个生成模型的50,000张合成图像,以及来自MS COCO数据集的真实图像。参与者采用了多种检测策略,包括卷积神经网络(CNNs)、视觉变换器(Vision Transformers, ViTs)、基于频率的分析、对比学习和多模态技术。结果表明,虽然AI生成图像可以以高准确率(F1-score > 0.83)被检测,但识别所使用的确切模型仍然显著更具挑战性(最高F1-score: 0.4986)。这些发现突显了改进模型指纹识别、对抗鲁棒性和实时检测机制的必要性。
cs.CV / 74 / 2605.20795
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
连接器存活了什么语义?诊断视频编辑中的VLM与DiT对齐
Abstract
Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.
Chinese Translation
基于流匹配的视频生成模型越来越依赖于预置的视觉-语言模型(VLM)来处理复杂的基于指令的视频编辑。这一范式背后的主要假设是,连接模块能够无缝地将VLM丰富的多模态推理与DiT的原始文本嵌入空间对齐。然而,我们假设这种对齐实际上充当了一个严重的语义瓶颈,降低了细粒度结构变量的质量。验证这一点具有挑战性,因为端到端评估将对齐失败与生成错误混为一谈,而自然数据集缺乏解耦的注释。为了严格调查这一问题,我们提出了一种基于视频合成的受控数据处理管道,生成了TRACE-Edit,一个专注于基于关系编辑的诊断数据集。利用该数据集,我们提出了一种全面的诊断协议,以分析现有视频编辑模型中元查询和连接器的两个重要设计。对四个代表性模型案例的系统评估表明,在对齐过程中,细粒度结构语义可能会受到严重损害。我们的研究推翻了无损语义转移的假设,将VLM与DiT的对齐识别为一个主要瓶颈,并为未来的多模态对齐架构提供了新的诊断基础。
cs.CV / 75 / 2605.20804
OlmoEarth v1.1: A more efficient family of OlmoEarth models
OlmoEarth v1.1:更高效的OlmoEarth模型家族
Abstract
We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.
Chinese Translation
我们提出了一系列对OlmoEarth家族的改进。这些改进使我们能够在训练过程中降低计算成本(训练基础模型所需的GPU小时数减少了1.7倍)以及推理过程中的计算量(在Sentinel-2任务中MACs减少了2.9倍),同时保持模型的整体性能。所有训练代码可在github.com/allenai/olmoearth_pretrain获取。
cs.CV / 76 / 2605.20807
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction
通过中间结构预测分解主题驱动的图像生成
Abstract
Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.
Chinese Translation
主题驱动的文本到图像生成仍然难以保持高频身份细节,如徽标、图案和文本。现有方法通常直接在RGB空间中操作,这往往导致在大幅编辑下细节退化。我们提出了一种两阶段框架,通过首先预测Canny图并在源外观和预测结构的基础上渲染最终图像,从而将结构与外观解耦。为了改善文本处理,我们进一步引入了一种全自动流程,构建了一个具有跨视角文本一致性的10万对文本感知数据集。实验,包括基于GPT-4.1的评估和知识蒸馏研究,显示出相较于选定基线的明显提升,并表明中间结构预测是实现高保真主题驱动生成的有效途径。我们的数据集和代码将公开发布。
cs.CV / 77 / 2605.20808
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis
超高分辨率图像合成的空间Gram对齐
Abstract
Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.
Chinese Translation
现代超高分辨率图像合成在很大程度上依赖于大规模预训练潜在扩散模型(Latent Diffusion Models, LDMs)强大的生成能力。尽管最近的表征对齐方法通过将基础模型(如SAM或DINO)中的视觉先验提炼到生成潜在特征中已被证明有效,但将这些方法扩展到极高分辨率的预训练LDM时,暴露出一个关键的可学习性与保真度冲突。具体而言,强制直接的块级特征提炼本质上扰动了预训练潜在流形,最终导致生成质量下降。为了解决这一瓶颈,我们提出了空间Gram对齐(Spatial Gram Alignment, SGA),这是一个新颖的框架,明确利用视觉基础模型的表征先验,同时保留LDM的原生生成能力。SGA超越了限制性的直接对齐,通过将生成特征的内部自相似性与基础先验的自相似性进行对齐,施加了一种非侵入性的空间约束。这种空间约束有效地建立了宏观结构一致性,而原生生成目标则保留了原始LDM固有的微观像素级保真度。值得注意的是,这种多功能策略在预训练LDM中能够无缝集成中间扩散特征和变分自编码器(VAE)潜变量。大量实验表明,SGA在超高分辨率文本到图像合成中实现了最先进的性能,有效调和了全局结构完整性与细粒度视觉细节之间的关系。代码可在 https://github.com/zhang0jhon/SGA 获取。
cs.CV / 78 / 2605.20818
OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026
基于MLLM重排序的OSGNet在Ego4D情节记忆挑战2026中的应用
Abstract
In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.
Chinese Translation
在本报告中,我们展示了在2026年CVPR的Ego4D情节记忆挑战中,针对自然语言查询和GoalStep任务的冠军解决方案。这两个任务都要求从长时间未剪辑的自我中心视频中准确定位时间段。为了解决这些任务,我们提出了一种基于重排序的框架,该框架有效利用了多模态大型语言模型(MLLM)强大的视频-语言推理能力,同时保持了传统定位管道的效率和候选回收率。具体而言,我们首先从现有的定位模型OSGNet中获取一组候选段,然后利用MLLM选择与给定查询最匹配的段,从而优化最终预测。最终,我们的方法在自然语言查询和GoalStep两个任务中均获得第一名。我们的代码可以在https://github.com/iLearn-Lab/CVPR25-OSGNet找到。
cs.CV / 79 / 2605.20820
AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting
AIR:用于自监督前馈2D高斯点云的摊销图像重建框架
Abstract
2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: https://github.com/whoiszzj/AIR.git
Chinese Translation
2D高斯点云为图像重建提供了一种高效的显式表示,但现有方法仍然需要昂贵的逐图迭代优化或依赖于手工设计的先验进行原始分配。我们提出了AIR,一种自监督前馈框架,将迭代高斯拟合摊销为单次网络传递,从而消除了逐图测试时的优化。AIR采用阶段性残差架构,逐步从重建残差中预测额外的高斯原始,并结合显式的阶段控制机制,仅在重建不足的区域激活新的原始。预测-优化-提炼的训练策略通过将短期优化的高斯增量提炼回预测器来稳定多阶段预测。稳定的预测器随后在各个阶段共同微调,并配备了图像自适应量化器以实现紧凑的高斯存储。在Kodak和DIV2K上的实验表明,AIR在重建质量上优于代表性的基于高斯的方法,同时将编码时间减少到160-300毫秒。代码:https://github.com/whoiszzj/AIR.git
cs.CV / 80 / 2605.20821
VSCD: Video-based Scene Change Detection in Unaligned Scenes
VSCD:基于视频的非对齐场景变化检测
Abstract
Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.
Chinese Translation
检测环境中的变化对于长期自主性至关重要,但大多数变化检测设置假设固定视点、轻微的错位或仅少量变化的物体。我们提出了基于视频的场景变化检测(VSCD),该方法在给定同一室内空间的参考和查询RGB视频(在不同时间记录,且相机运动不受限制)的情况下,为每个查询帧预测逐像素的变化掩码。这两个视频在时间上并不同步,并且可能出现或消失许多物体实例。为了研究这一设置,我们构建了一个大规模基准数据集,包含超过110万帧,标注有逐像素准确的变化掩码,并提供了一个真实世界的测试集,以评估超越仿真的迁移能力。我们提出了一种以查询为中心的多参考模型,该模型通过变化掩码监督隐式学习时间匹配,通过局部补丁对应将候选参考特征与查询对齐,并在解码每帧的高分辨率掩码之前,使用帧级和补丁级置信度融合每个候选的变化特征。我们的方法在强大的基于图像和视频的基线下实现了最先进的性能,并通过在移动机器人上部署该方法进行两个下游应用——视觉监控和物体增量学习,验证了其在真实世界中的影响。
cs.CV / 81 / 2605.20822
TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection
TERDNet:用于场景变化检测的变换器编码器-递归解码器网络
Abstract
In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at https://github.com/AutoCompSysLab/TERDNet.
Chinese Translation
在这项工作中,我们解决了场景变化检测(SCD)这一挑战,目标是识别在不同时间拍摄的同一地点的两幅图像之间的变化。现有的SCD模型往往忽视了特征在各层之间的重要性变化,采用单步解码器限制了精细化过程,并且对编码器预训练策略提供了有限的见解。我们提出了TERDNet,一种旨在克服这些局限性的变换器编码器-递归解码器网络。TERDNet由一个基于变换器的编码器组成,该编码器提取多层次表示,一个特征融合模块将相关体积与这些特征整合,一个递归3门GRU解码器执行迭代精细化,以及一个结合卷积和插值的上采样器,恢复细粒度分辨率。在四个公共基准上的广泛实验表明,TERDNet始终优于先前的方法,并生成更准确和详细的变化掩模。消融研究确认了基于分割的预训练的好处以及我们融合设计的有效性。此外,在视角不对齐下的鲁棒性测试确认了TERDNet在现实世界机器人系统中部署的潜力,在这些系统中,可靠的感知至关重要。我们的代码可在 https://github.com/AutoCompSysLab/TERDNet 获取。
cs.CV / 82 / 2605.20823
RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
RelWitness:基于视觉-几何关系见证的开放词汇3D场景图生成
Abstract
Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission
Chinese Translation
开放词汇3D场景图生成旨在使用灵活的自然语言谓词描述对象实例及其关系。其核心难点不仅在于词汇扩展,还在于监督的可靠性:3D场景图数据集中关系注释是选择性的,许多有效的对象对关系未被注释。我们提出了RelWitness,一个在不完整关系监督下,从姿态RGB-D序列生成开放词汇3D场景图的框架。关键概念是关系见证:一种具体的视觉-几何线索,使得在捕获的场景中关系可观察。支持关系需要接触和垂直排序;包含关系需要封闭;接近关系需要度量上的接近;方向关系需要面向方向;稳定关系应在两个对象可见的视图中持续存在。RelWitness从RGB视图、深度图、重建的3D几何、角色敏感文本、对象先验空视图和多视图一致性中构建关系见证记录。视觉-几何见证验证器将未注释的关系候选分配给经过验证的缺失正例、可靠的负例或不确定的未标记案例。然后,见证引导的正-未标记目标在不将每个缺失标签转化为负例的情况下,从不完整注释中学习。我们进一步引入见证一致解码和RGB-D缺失关系审核协议。在3DSSG/3RScan和ScanNet派生的开放词汇拆分上进行的模拟手稿规划实验显示了预期的行为:改善了未见关系识别,提高了见证精度,降低了幻觉现象,并减少了冗余关系短语。所有数值结果均为规划值,必须在提交前由重现测量替代。
cs.CV / 83 / 2605.20827
HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction
HyDAR-Pano3D:一种用于全景到三维重建的混合解耦解剖恢复框架
Abstract
Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p < 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.
Chinese Translation
全景放射影像(PR)在日常牙科护理中具有基础性作用,但其本质上仅提供复杂三维(3D)颅面解剖结构的二维(2D)投影。大多数现有的基于学习的方法试图通过直接从PR回归原生锥束计算断层扫描(CBCT)体积来计算恢复这些3D信息。然而,这种直接映射要求模型同时学习常见的解剖结构和特定于患者的形态变化。这种纠缠的形式使得病态的2D到3D逆问题高度模糊,常常导致重建结果过于平滑,解剖边界模糊。为了解决这个问题,我们提出了HyDAR-Pano3D,一个将PR到CBCT重建重新表述为解耦解剖恢复问题的两阶段框架。在第一阶段,双编码器网络将放射特征与基于SAM的语义先验结合,重建一个弓形标准化的典型体积。在第二阶段,解剖恢复网络预测一个先验约束的结构变形场,将该典型体积映射回原生空间,恢复个体的形态变化。在三个大规模数据集上的实验表明,HyDAR-Pano3D显著优于基线方法($p < 0.05$),实现了25.76 dB的峰值信噪比(PSNR)、85.70\%的结构相似性指数(SSIM)和83.83\\%的整体解剖Dice得分。合成的体积成功支持整体牙齿(82.4\\% Dice)和下颌神经管(72.2\\% Dice)的下游分割,证明我们的解耦方法能够保留临床相关结构,以便在CBCT数据不可用时进行稳健的解剖意识评估。
cs.CV / 84 / 2605.20837
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
ArchSIBench:视觉-语言模型的建筑空间智能基准测试
Abstract
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.
Chinese Translation
建筑空间智能,即识别和推断建筑空间的能力,对于机器人导航、具身交互以及三维场景理解与生成等任务至关重要。尽管已有大量研究评估了视觉-语言模型(VLMs)的基本空间技能,如相对方位、距离比较和物体计数,但这些任务仅涵盖了空间认知的最基本层面,且在很大程度上忽视了建筑空间的高级认知,包括布局理解、流线模式和功能分区。在本研究中,我们提出了ArchSIBench,这是一个基于建筑学、认知科学和心理学视角的建筑空间智能基准测试。ArchSIBench涵盖五个核心维度:感知、推理、导航、转换和配置,共包含17个细分子任务。通过建筑背景专家的仔细人工标注,我们构建了3000个问答对,以实现对建筑空间智能的全面评估。基于ArchSIBench,我们评估了多种VLMs,发现大多数模型的建筑空间智能与人类基线存在显著差异;此外,模型在能力维度上表现出显著的变异性。一些最先进的模型在没有建筑训练的情况下可以接近人类评估者的水平。然而,与经过建筑训练的人类评估者相比,在空间转换和配置推理方面仍然存在明显差距。我们相信,ArchSIBench将为测量和提升VLMs的建筑空间智能提供重要的见解和系统资源。数据集和代码可在 https://huggingface.co/datasets/ArchSIBench/ArchSIBench 获取。
cs.CV / 85 / 2605.20838
USV: Towards Understanding the User-generated Short-form Videos
USV:理解用户生成短视频的探索
Abstract
Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.
Chinese Translation
近年来,多个大规模视频数据集的发布推动了视频理解领域的发展。然而,新兴的用户生成短视频却鲜有研究。本文提出了USV(用户生成短视频数据集),旨在实现高层次语义视频理解。该数据集包含约224K个视频,这些视频通过标签查询从用户生成内容(UGC)平台收集而来,未经过额外的人工验证和剪辑。尽管近年来视频理解取得了显著进展,但大多数研究仍集中于实例级别的识别,这不足以学习视频高层次语义信息的表示。因此,我们进一步建立了两个任务:主题识别和视频-文本检索。我们提出了两种统一且有效的基线方法:多模态融合网络(Multi-Modality Fusion Network, MMF-Net)和视频-文本对比学习(Video-Text Contrastive Learning, VTCL),分别用于解决主题识别任务和视频-文本检索,并进行全面的基准测试以促进未来的研究。我们的项目页面是 https://usvdataset.github.io。
cs.CV / 86 / 2605.20839
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models
无激活函数的图像识别主干网络:MetaFormer风格视觉模型中的多项式替代方案
Abstract
Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.
Chinese Translation
现代视觉主干网络将逐点激活(例如,ReLU、GELU)和指数软最大值视为非线性的基本来源,但我们证明在MetaFormer风格的视觉主干网络中并不需要这些。我们为三种核心原语(多层感知器、卷积和注意力机制)设计了无激活函数的多项式替代方案,其中哈达玛积取代了标准非线性,以生成输入的多项式函数。这些模块可以无缝集成到现有架构中:在MetaFormer中实例化,这是一种用于视觉主干网络的模块化框架,我们的PolyNeXt模型在ImageNet分类、ADE20K语义分割和分布外鲁棒性方面与基于激活的对应模型相匹配或超越。我们还在降低计算成本的同时显著超越了之前的多项式网络,显示出标准模块的多项式变体优于复杂的定制架构。
cs.CV / 87 / 2605.20889
Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video
Map-Mono-Ego:基于地图的单目自我中心视频中的全球人类姿态估计
Abstract
Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.
Chinese Translation
单目自我中心的人类姿态估计对于普遍的活动监测至关重要。然而,理解用户在环境中的绝对位置仍然是一个挑战。现有方法主要关注从初始位置的相对运动,往往未考虑佩戴者在环境中的绝对位置。此外,单目视觉固有的尺度模糊性导致严重的平移漂移,限制了在没有专用多传感器硬件的情况下进行长期跟踪。为了解决这一问题,我们提出了MapMonoEgo,一个新颖的框架,通过利用预先扫描的3D点云,仅使用单目相机实现全球一致的人类姿态估计。我们还引入了AIST-Living数据集,这是一个将自我中心视频与扫描环境中的真实运动配对的新数据集。实验表明,我们的方法显著优于最先进的基线,证明了其在没有专用硬件的情况下进行实际监测任务的实用性。
cs.CV / 88 / 2605.20891
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction
HDMoE:一种用于多模态癌症生存预测的分层解耦-融合专家混合框架
Abstract
Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.
Chinese Translation
多模态生存预测是一项关键但具有挑战性的任务,需要整合多模态医疗数据(例如全切片图像(WSIs)和基因组特征)以实现准确的预后建模。鉴于不同模态之间固有的异质性,特征解耦-融合范式已成为主流方法。然而,这些方法存在以下不足:(1)在解耦之前未能减少模态特征的冗余信息,负面影响特征解耦和融合效果;(2)缺乏建模特征之间细粒度关系的能力,无法捕捉模态内和模态间特征的局部信息交互。为了解决这些问题,我们提出了一种分层解耦-融合专家混合框架(HDMoE),该框架具有两个层次的专家混合(MoE)和随机特征重组(RFR)模块。在第一层MoE中,使用共享专家和路由专家来去除冗余信息并提取每个模态内的细粒度特征,而第二层MoE则促进细粒度的模态间特征解耦。此外,我们在每个MoE层后设计了两个RFR模块,以精细融合模态内和模态间特征,从而帮助模型捕捉模态之间更细粒度的关系。在我们私有的肝癌(LC)数据集和三个TCGA公共数据集上的大量实验结果证实了我们提出方法的有效性。代码可在 https://github.com/ZJUMAI/HDMoE 获得。
cs.CV / 89 / 2605.20892
FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition
FruitEnsemble:用于细粒度水果识别的多模态大语言模型引导的异构集成仲裁
Abstract
Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.
Chinese Translation
细粒度水果分类是农业计算机视觉中的一项关键而具有挑战性的任务,主要受到高质量数据集严重短缺和类别之间高度视觉相似性的阻碍。为了解决这些挑战,我们首先构建了一个包含306种水果类别和116,233个样本的综合数据集。此外,我们提出了FruitEnsemble,这是一种实用的两阶段动态推理框架,旨在克服静态单模型架构的泛化限制。在第一阶段,FruitEnsemble采用经过验证校准的异构骨干网络加权集成,以生成一个稳健的Top-3候选池。为了处理困难样本,我们引入了一种专家仲裁机制:当集成置信度低于0.6时,触发多模态大语言模型(MLLM)通过结合外部植物描述进行严格的视觉验证,采用链式思维(Chain-of-Thought, CoT)推理。此外,我们通过关注困难样本的联合损失优化了训练流程。大量实验表明,FruitEnsemble实现了70.49%的分类准确率,并超越了现有的最先进模型。我们的框架为现实世界农业视觉分拣和质量检测任务提供了一种高效的、面向部署的解决方案。
cs.CV / 90 / 2605.20901
VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026
VISTA:EgoVis 2026年Ego4D短期物体交互预测的技术报告
Abstract
We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.
Chinese Translation
我们提出了VISTA,一种用于EgoVis 2026年Ego4D短期物体交互预测(STA)挑战的V-JEPA集成StillFast时间预测器。给定一个自我中心的视频时间戳,该任务要求预测下一个人-物体交互,包括未来活动物体的边界框、名词类别、动词类别、接触时间和置信度分数。VISTA遵循StillFast风格的设计,将以物体为中心的空间检测与短期时间上下文相结合。具体而言,经过COCO预训练的Faster R-CNN ResNet-50 FPN检测器从最后观察到的高分辨率帧生成物体提议,而冻结的V-JEPA 2.1时间分支从观察到的视频中提取剪辑级自我中心上下文。时间表示通过特征调制和ROI级上下文融合注入到检测路径中。融合后的提议特征随后传递给多头STA预测器进行边界框细化、名词分类、动词分类、接触时间回归和交互置信度估计。对于最终提交,我们进一步集成互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明,VISTA在EgoVis 2026年Ego4D STA挑战中获得第一名。我们的代码将发布在 https://github.com/CorrineQiu/VISTA。
cs.CV / 91 / 2605.20904
JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026
JFAA:2026年EgoVis EPIC-KITCHENS-100动作预测挑战的技术报告
Abstract
We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.
Chinese Translation
我们提出了JFAA,一种基于JEPA的未来动作预测方法,针对EPIC-KITCHENS-100 (EK-100) 动作预测任务。JFAA受到V-JEPA 2.1的表示学习和未来预测能力的启发,使用冻结的编码器和预测器来提取观察到的上下文特征和近未来的潜在标记。然后,训练一个轻量级的注意力探针,以独立的任务查询预测动词、名词和动作的logits。为了提高鲁棒性,我们进一步构建了一个领域感知的集成方法,基于选定的时期级预测,使每个输出领域能够从其最可靠的候选者中受益。官方挑战服务器上的实验结果表明,JFAA在2026年EgoVis EK-100动作预测挑战中获得了第一名。我们的代码将发布在https://github.com/CorrineQiu/JFAA。
cs.CV / 92 / 2605.20908
SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches
SynCB:一种基于协同概念的模型,具有概念与互补神经分支之间的动态路由
Abstract
Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.
Chinese Translation
基于概念(CB)模型提供了解释性并支持测试时的人为干预,而标准神经网络(NN)则在任务性能上表现强劲但透明度较低。先前的研究探索了将概念与额外表示相结合的混合形式,以提高准确性,通常以牺牲人为干预为代价。我们提出了 extit{协同概念基础模型(SynCB)}框架,该框架结合了一个CB分支和一个互补神经分支,以及一个可训练的路由模块,动态选择每个输入使用哪个分支。与先前的模型不同,SynCB保持两个分支的独立性,并通过路由模块进行协调。此外,两个分支是联合学习的,允许通过它们的共同骨干进行信息共享,以增强互补神经分支与CB分支之间的联系。为了提高对干预的响应能力,我们进一步引入了一种测试时干预策略及相应的损失函数。在五个数据集和CB基准测试中,SynCB始终实现了更高的任务准确性,同时对人为干预的响应性更强,超越了完整神经基线最多3.9个百分点,并在干预性能上超过了最强竞争者最多6.43个百分点。
cs.CV / 93 / 2605.20910
FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
FlowLong:通过流形约束的Tweedie匹配进行推理时长视频生成
Abstract
Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.
Chinese Translation
将视频扩散模型的生成范围扩展到长序列仍然是一个长期存在的重要挑战。现有的无训练方法分为两类:一类是双向模型的扩展,这些模型与特定架构紧密耦合,并且在长时间范围内质量下降;另一类是自回归模型,由于曝光偏差而累积漂移误差,往往产生重复的运动模式。为了解决这些问题,我们提出了一种新颖而简单的推理时长视频生成方法,该方法与架构无关且无需额外训练。我们的方法通过重叠滑动窗口生成长视频,其中来自相邻窗口的预测干净样本通过 extit{Tweedie匹配}进行融合,以在重叠区域内强制执行 extbf{流形约束和时间一致性}。 extit{随机早期阶段采样}随后通过在每次Tweedie匹配校正后的高噪声阶段注入新噪声来同步每个窗口的轨迹,然后过渡到确定性ODE采样,以保持细致的视觉保真度。应用于各种视频生成模型,我们的方法生成的视频长度是原始窗口长度的几倍,同时在时间一致性和视觉质量上优于无训练和自回归基线,并进一步扩展到音视频联合生成和文本到3DGS,而无需任何微调。
cs.CV / 94 / 2605.20914
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
RISE:自我演化视觉-语言模型的可靠改进
Abstract
Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.
Chinese Translation
视觉-语言模型(VLMs)已实现强大的多模态推理能力,但进一步提升仍然严重依赖于大规模人工构建的后训练监督。这种监督获取成本高昂,尤其是在需要精心设计问题、答案和反馈信号的推理密集型多模态任务中。这促使了自我演化学习的出现,其中模型通过双重角色闭环自我改进:提问者自主提出问题,解答者学习解决这些问题。然而,我们观察到当前的VLM自我演化方法仍面临三个主要挑战:粗粒度角色交替延迟了问题生成与解答者适应之间的互动;生成的问题质量可能逐渐下降;问题类型可能会向狭窄的分布收敛。这些问题限制了自我演化的效率和可靠性。因此,我们提出了 extbf{RISE},一个针对视觉-语言模型的可靠自我演化框架。RISE基于三种互补设计构建:细粒度角色交替,缩短提问者与解答者之间的反馈循环以提高效率;质量监督,提高问题的有效性和伪标签的可靠性;以及技能感知动态平衡,缓解模式崩溃并在演化过程中保持广泛的技能覆盖。这些组件共同使得从未标记图像中实现更可靠和有效的自我演化成为可能。在七个基准测试中对两个VLM骨干网络的实验表明,RISE始终改善基础模型,带来广泛而持续的收益。我们的代码已公开发布在 https://github.com/AMAP-ML/RISE。
cs.CV / 95 / 2605.20940
3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat
三维重建与知识蒸馏提升多视角图像模型以探索小麦穗体积估计
Abstract
Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.
Chinese Translation
准确估计小麦穗体积对于产量组成分析和抗逆性评估至关重要,但基于田间的测量仍然具有挑战性。主动三维感知方法,如光探测与测距(LiDAR)或飞行时间(ToF),对植物运动敏感或不适合户外条件,而三维重建的计算成本较高。直接的二维图像处理虽然在计算上具有优势,但基于图像的模型缺乏明确的几何信息。因此,我们提出了一种混合的二维-三维方法,在训练过程中进行知识蒸馏,同时实现高效的仅基于图像的推理。首先,我们使用基于距离的直方图特征训练一个刚体不变点云网络,以获得姿态鲁棒的几何表示。然后,我们将三维模型与提出的多视角图像基础的调节变换器(RT)结合在一个集成架构中。最后,我们通过特征基础或标签基础的蒸馏将集成知识蒸馏到一个纯图像基础的学生模型中。两个蒸馏后的RT将非蒸馏RT的平均绝对误差(MAE)从654.31 mm³降低到639.93 mm³和644.62 mm³,并将相关性从0.76提高到0.77和0.82。同时,推理时间从每个穗160毫秒减少到1.4毫秒。蒸馏进一步减轻了体积依赖的偏差,并重塑了图像模型的潜在表示,使其朝向几何感知的形状发展。我们的结果表明,基于三维信息的二维变换器训练能够实现高通量田间表型分析中可扩展且高效的穗体积估计。
cs.CV / 96 / 2605.20941
PaintCopilot: Modeling Painting as Autonomous Artistic Continuation
PaintCopilot:将绘画建模为自主艺术延续
Abstract
We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.
Chinese Translation
我们提出了PaintCopilot,这是一种共创的神经绘画助手,它将绘画建模为一种开放式自回归艺术行为,基于不断演变的画布状态和先前的笔触历史,而无需目标图像。与现有的神经绘画方法将绘画视为朝向预定义参考的像素重建不同,PaintCopilot直接从学习到的艺术动态中预测未来的笔触,类似于大型语言模型如何根据先前的上下文继续文本序列。该框架提出了三个互补的模型:一个基于ViT的目标预测器,从部分画布观察中推断艺术家的意图;一个自回归的下一笔预测器,通过流匹配生成时间上连贯的笔触;以及一个基于变分自编码器(VAE)的区域采样器,根据需求合成语义上局部化的笔触序列。该系统建立在三种可微分的笔刷表示(硬圆形、笔刷尖和二维高斯)之上,支持四种交互式工作流程:优化历史、笔触完成、区域修复和动态笔刷。通过与专业艺术家的案例研究,我们展示了PaintCopilot能够实现流畅的共创绘画工作流程,使艺术家与人工智能在整个创作过程中不断交替控制。
cs.CV / 97 / 2605.20942
Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding
架构与语言的桥梁:基于图的视觉推理用于自主道路理解
Abstract
Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.
Chinese Translation
对车道几何、拓扑和交通元素关系的结构化道路理解是安全自主驾驶的基础。尽管视觉语言模型(VLMs)提供了有前景的语义灵活性,但它们缺乏进行精确道路推理所需的几何和关系基础。相反,传统的模块化系统,如高清地图(HD maps)和拓扑道路图,提供了结构精确性,但在语义上仍然僵化。为了弥合这一差距,我们提出了组合道路基底(Combined Road Substrate, CRS),这是一个基于图的框架,能够在单一表示中将几何道路结构和开放词汇语义共同执行。CRS通过递归图查询自动生成组合复杂且语言多样的问题-答案对,并增强了“免费的基础”机制,确保逻辑可追溯到特定地图元素,以及程序化提取的思维链监督痕迹。我们证明了最先进的VLMs,包括大型闭源模型,在结构化道路推理方面显著困难,然而,使用仅20到80个CRS增强场景训练一个小型的20亿或40亿参数模型,能够在不同深度的组合推理任务中获得稳定的提升。通过可验证的推理痕迹分析模型行为,揭示了故障模式的系统性转变:基线模型在关系场景理解上失败,而CRS训练的模型则将失败减少到属性识别,表明道路理解的主要瓶颈并非模型规模,而是缺乏结构化监督。
cs.CV / 98 / 2605.20950
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
聚焦-再到上下文:面向主体的渐进式视觉标记减少方法用于视觉-语言模型
Abstract
Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.
Chinese Translation
视觉-语言模型(VLMs)在推理过程中面临着由于庞大的视觉标记序列而导致的高昂计算成本瓶颈。现有的视觉标记减少方法虽然缓解了这一负担,但它们无意中保留了与用户查询严格对齐的孤立视觉主体,未能实质性地探索显著主体及其上下文关系。本文提出了SPpruner,一种面向主体的渐进式减少范式,模拟了人类视觉感知系统的“聚焦-再到上下文”(Focus-then-Context)机制。具体而言,我们首先构建了一个聚焦识别模块,以明确建模视觉显著性与语义相关性之间的相互作用。在此过程中,它可以挖掘全面的视觉主体谱,以确保对视觉输入的高保真表示。随后,开发了一个上下文感知的结构扫描模块,以聚合来自邻近区域的上下文线索。因此,它能够有效恢复全球关系依赖性,以维护保留主体的结构完整性。大量实验表明,我们的范式在性能上始终优于现有最先进的方法,在Qwen2.5-VL中实现了高达2.53倍的加速,同时仅保留22.2%的视觉标记,并在LLaVA上实现了67%的FLOPs减少,且准确率仅下降了0.6%。
cs.CV / 99 / 2605.20955
DrawMotion: Generating 3D Human Motions by Freehand Drawing
DrawMotion:通过自由手绘生成3D人类动作
Abstract
Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.
Chinese Translation
文本到动作生成(Text-to-motion generation)将文本描述转化为人类动作,但面临着用户常常难以仅通过文本准确传达其意图动作的挑战。为了解决这一问题,本文提出了DrawMotion,一个高效的基于扩散的框架,旨在多条件场景下使用。DrawMotion基于传统文本条件和新颖的手绘条件生成动作,分别提供对生成动作的语义和空间控制。具体而言,我们从三个方面解决细粒度动作生成任务:1)自由手绘条件。为了准确捕捉用户的意图动作而无需繁琐的文本输入,我们开发了一种算法,能够在不同数据集格式中自动生成手绘小人草图;2)多条件融合。我们提出了一个集成到扩散过程中的多条件模块(Multi-Condition Module, MCM),使模型能够利用所有可能的条件组合,同时与传统方法相比降低计算复杂性;3)无训练引导。值得注意的是,DrawMotion中的MCM确保其中间特征位于连续空间中,使得分类器引导梯度能够更新特征,从而使生成的动作与用户意图对齐,同时保持保真度。定量实验和用户研究表明,自由手绘方法在生成与用户想象一致的动作时,减少了约46.7%的用户时间。代码、演示和相关数据可在 https://github.com/InvertedForest/DrawMotion 上公开获取。
cs.CV / 100 / 2605.20961
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning
保留、揭示、扩展:基于区域感知条件的忠实4D视频编辑
Abstract
Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open
Chinese Translation
现有的4D驱动视频扩散模型主要针对合理生成,但忠实的4D编辑需要在合成未遮挡或视野外内容的同时保留源观察区域。我们识别出证据角色不匹配的问题:可靠的源支持证据、不可靠的渲染线索和不支持区域交织在单一的条件信号中,导致保留漂移、鬼影和不稳定的外推。我们提出了PREX(保留、揭示、扩展),一个区域感知框架,根据观察支持和场景范围将目标时空体分解为保留、揭示和扩展角色。PREX构建了基于观察的外观线索,具有校准的置信度,并通过区域感知适配器将其注入到冻结的视频扩散骨干中,该适配器通过代理任务进行训练,无需配对的编辑视频。我们进一步引入了PREBench,一个诊断基准,包含策划的编辑、区域角色掩码和与人类对齐的指标,补充了全局视频质量和4D控制评估。实验表明,PREX在保持强视觉质量和4D编辑控制能力的同时,减少了区域结构失败。项目页面:https://ricepastem.github.io/PREX-Open
cs.CV / 101 / 2605.20963
Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method
面向现实世界的无人机检测:新的多光谱数据集UAVNet-MS及新方法
Abstract
The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.
Chinese Translation
无人机(UAV)的迅速普及对精确的无人机监测提出了迫切需求。现有的基于RGB的系统依赖于空间线索,但在小尺度下这些线索会退化,尤其是在高类型相似性、目标与杂波模糊以及低对比度的情况下。多光谱成像(MSI)编码了材料感知的光谱特征,然而由于缺乏专门的数据集,基于MSI的细粒度小型无人机检测仍然未得到充分探索。我们介绍了UAVNet-MS,这是第一个用于细粒度小型无人机检测的多光谱数据集,包含15,618个时间同步的RGB-MSI数据立方体(1440x1080),并附有边界框注释。该数据集在低对比度下具有挑战性的小物体(93.7% <= 32^2 像素,平均18^2 像素,约0.02%图像面积)。我们提出了MFDNet,这是一种双流基线方法,旨在解决阵列引起的视差和空间-光谱融合问题。在RGB-only、MSI-only和RGB+MSI协议下对20个检测器进行了广泛评估,结果表明MFDNet在AP50上比最佳的RGB-only方法提高了6.2%,证明了光谱线索提供了超越空间线索的补充材料证据。本研究为多光谱无人机监测研究提供了基础数据集、强基线和基准。
cs.CV / 102 / 2605.20965
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
在不遗忘的情况下找到正确的视觉证据:通过层间视觉注意力差异减轻大规模视觉语言模型中的幻觉
Abstract
Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.
Chinese Translation
大型视觉语言模型(LVLMs)在各种视觉语言任务中表现出色。尽管取得了这些进展,但它们仍然容易出现幻觉,生成与视觉内容不一致的响应。在本研究中,我们发现LVLMs在对正确的视觉证据关注不足时,容易产生幻觉,并在生成过程中逐渐遗忘这些证据。我们通过实证研究发现,尽管LVLMs整体上对视觉证据的关注不足,但在特定层中对正确视觉证据表现出敏感性,且存在显著的层间差异。基于这一观察,我们提出了一种新颖的幻觉减轻方法,该方法基于层间视觉注意力差异(Inter-Layer Visual Attention Discrepancy, ILVAD)增强视觉证据。具体而言,我们从早期生成的标记到各层的视觉标记中获取注意力权重,并识别出作为视觉证据反复激活的标记,形成显著性图。然后,我们通过显著性图在生成过程中增强对视觉证据的关注,以减少视觉遗忘。此外,我们利用显著性图获取生成文本与视觉证据的注意力得分,以选择和强调与视觉证据紧密相关的文本标记。我们的方法不需要训练,且易于使用。在对五个最近发布的模型进行的多项基准评估中,我们的方法在不同的LVLM架构中一致地减轻了幻觉。代码可在 https://github.com/ytx-ML/ILVAD 获取。
cs.CV / 103 / 2605.20971
Comparative Evaluation of Deep Learning Models for Fake Image Detection
假图像检测的深度学习模型比较评估
Abstract
The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.
Chinese Translation
基于生成对抗网络(GAN)的图像处理技术日益复杂,给数字取证带来了重大挑战。本研究比较了四种预训练卷积神经网络(CNN)架构的性能,包括VGG16、ResNet50、EfficientNetB0和XceptionNet,旨在使用统一的预处理和训练流程进行假图像检测。我们对真实和处理过的图像数据集进行了处理,包括调整大小、归一化和数据增强,以解决类别不平衡问题并提高模型的泛化能力。模型评估指标包括准确率、精确率、召回率、F1分数和ROC-AUC。结果显示,VGG16的准确率最高,达91%,而XceptionNet、ResNet50和EfficientNetB0的准确率均为90%。EfficientNetB0对假图像表现出更强的敏感性,但在真实样本上的可靠性较低,反映了由于不平衡导致的偏差。研究的局限性包括数据集不平衡、过拟合和有限的可解释性,这些因素影响了跨领域的鲁棒性。本研究提供了可重复的基准,并强调了平衡数据集、先进的数据增强和公平意识训练在开发可靠的假图像检测系统中的必要性。
cs.CV / 104 / 2605.20973
Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines
面向地下矿山三维点云的综合岩石支护可视化
Abstract
The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.
Chinese Translation
地下矿山中岩石支护的有效性依赖于已安装岩锚与周围岩体结构之间的相互作用。然而,断层特征化和岩锚识别通常被视为独立任务,这限制了它们在综合支护评估中的价值。本研究提出了一种基于地下矿山开挖三维点云的综合岩石支护可视化自动化框架。该框架将结构映射、岩锚识别、断层面拟合和锚杆方向估计整合为一个统一的工作流程,优化了准确性和计算效率。输出结果用于生成拟合断层面和锚杆矢量的综合三维可视化,使得能够直接评估它们的空间交点和几何关系。同时,还对断层极点和锚杆方向进行了补充的立体分析,以评估相对于已映射结构特征的整体锚固几何有效性。此外,锚杆级别的质量指标,包括暴露突出长度和与局部顶板法线的偏差,也被可视化,以支持安装质量的评估。该框架在实际的地下金属矿山扫描中得到了验证,在中等规模的点云中产生了准确的结构映射和岩锚识别结果。总体而言,本研究为实现岩石支护有效性的自动化、综合地质技术评估提供了一个切实可行的步骤,而无需手动测量或额外的原位数据采集。
cs.CV / 105 / 2605.20992
CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction
CHOIR:基于接触感知的4D手物交互重建
Abstract
We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.
Chinese Translation
我们探讨日常开放世界的单目视频是否可以转化为可重复使用的4D交互原语:关节手部运动、随时间变化的物体形状及其6D位姿,以及接触的时间和地点。这种能力将使得对真实交互的可扩展挖掘成为可能,并且在重建之外,支持场景感知的合成和规划。然而,从具有挑战性的单目视频中重建手物交互(HOI)仍然困难:现有方法通常假设已知物体或经过整理的场景,且单独估计的手和物体在杂乱、遮挡和未见物体几何形状下容易出现错位。针对这一问题,我们提出了CHOIR,一个基于接触感知的HOI重建框架,适用于单目相机,利用接触作为手与物体之间的显式耦合信号。CHOIR首先从开放世界的视觉先验中初始化一个粗略的、与接触无关的4D HOI序列。然后,它引入一个生成式HOI空间校正模块,以预测光线深度修正并校正手物相对位置,进而在校正后的几何体上推导出初始的逐帧接触对应关系。最后,通过动态更新的接触约束进行接触感知的联合优化,以确保几何、时间和接触的一致性。在受控和具有挑战性的视频上的实验表明,CHOIR在物体重建、物理合理性和时间一致性方面优于最先进的方法。
cs.CV / 106 / 2605.20997
Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data
基于TanDEM-X和Landsat数据的森林高度估计混合机器学习模型
Abstract
Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lop\'e national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.
Chinese Translation
将机器学习(ML)与物理模型(PM)相结合,已成为从遥感数据中提取地球物理参数的一种有前景的方法。在此背景下,最近提出了一种基于TanDEM-X干涉相干测量的森林高度估计ML模型,该模型通过PM约束学习过程。虽然用于训练和反演的特征经过选择以确保解的物理一致性,但它们无法解决数据中的所有高度/结构和基线/地形坡度的歧义。为此,提出了扩展特征空间的方法,结合光学Landsat数据,以提供关于森林类型或结构的补充信息。扩展模型在加蓬Lopé国家公园的多次TanDEM-X获取数据上进行了应用和验证,并与机载LiDAR测量结果进行了比较。结果显示,与原始混合模型相比,均方根误差(RMSE)减少了13.5%,平均绝对误差(MAE)减少了16.6%,确认了多光谱输入的附加价值。
cs.CV / 107 / 2605.21001
DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars
DAMA:可控多层次虚拟形象的解耦身体锚定高斯模型
Abstract
Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/
Chinese Translation
现有的3D穿衣虚拟形象重建方法虽然在视觉逼真度上表现优异,但忽视了几何结构和物理合理性。它们要么将穿衣人类建模为单一可变形表面,要么尝试在不强加几何约束的情况下进行服装解耦,导致服装边界模糊且无法控制叠加或层次顺序。为了解决这些局限性,我们提出了DAMA(Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars),这是一种通过专门的表示和重建方法生成物理合理的穿衣虚拟形象的3D重建方法。在表示层面,我们使用重心平面坐标和正法向偏移将高斯模型绑定到SMPL-X面上。基于这种参数化,重建方法将2D分割提升为身体锚定的高斯模型,利用拓扑引导的校正来细化层次,并联合优化几何形状和外观。DAMA是首个从多视角图像中实现物理合理分层、清晰服装分离和显式叠加控制的高斯虚拟形象重建方法。在完整的4D-DRESS数据集(82个扫描)上,它在几何重建、服装分离、穿透率和穿透深度方面达到了最先进的性能。该表示进一步支持用户定义的服装重新排序和快速将符合身体的服装转换为适合模拟的网格。项目页面:https://danieleskandar.github.io/dama/
cs.CV / 108 / 2605.21007
LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation
LiteViLNet:用于高效道路分割的轻量级视觉-激光雷达融合网络
Abstract
Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.
Chinese Translation
道路分割是自动驾驶和智能机器人系统中的一项基础感知任务,要求具备高精度和实时推理能力,尤其是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖于重型变换器(transformer)编码器以实现最先进的性能,但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为了解决这一困境,我们提出了 extbf{LiteViLNet},一种轻量级多模态网络,融合了RGB纹理信息和激光雷达几何信息,以实现高效的道路分割。具体而言,我们设计了一个双流轻量级编码器和深度可分离卷积,以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块(Multi-Scale Feature Fusion Module, MSFM),以促进不同层次的跨模态交互,以及一个大核桥接模块,以线性复杂度捕捉长距离依赖。对KITTI道路数据集和实际应用的广泛实验表明,LiteViLNet在精度和效率之间实现了良好的平衡。值得注意的是,我们的模型仅有14.04M参数,达到了96.36 ext{%}的MaxF分数,在所有基于CNN的方法中排名第一,并且与更大的变换器模型相当,在RTX 4060 Ti上模型推理时运行速度为163.79 FPS(在Jetson Orin NX上为22.18 FPS)。它在推理速度上超越了众多重型方法,同时保持了高度竞争的精度,充分验证了LiteViLNet在自动驾驶和智能机器人领域实时嵌入式部署的潜力。
cs.CV / 109 / 2605.21028
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
DySink:用于自回归长视频生成的动态帧接收器
Abstract
Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.
Chinese Translation
自回归长视频生成通常采用有限内存流式处理以提高效率,通常将局部窗口与静态早期帧接收器结合,以实现短期连续性和长程锚定。然而,这种固定分配使得早期帧在当前视觉状态与其显著偏离时仍被缓存,同时丢弃了可能更相关的中间历史。因此,保留的长程上下文可能变得不够适应,并使生成偏向过时的线索;在严重情况下,由于 RoPE(相对位置编码)引起的相位重新对齐可能会使头间注意力同质化,并导致接收器崩溃,即内容回归到接收器帧。我们提出了 DySink,这是一种基于检索的框架,维护一个紧凑的记忆库,并选择视觉上相关的历史帧作为动态帧接收器。DySink 将自适应检索与接收器异常门结合,后者检测检索上下文中过度的头间共识,并抑制易崩溃的上下文。在对分钟长视频的实验中,DySink 在强基线的基础上始终提高了动态程度,同时也实现了更高的时间质量。代码和模型权重将发布在 https://github.com/yebo0216best/DySink。
cs.CV / 110 / 2605.21032
Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation
面向物理一致性的闭环自主驾驶仿真中的四维场景重建
Abstract
High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.
Chinese Translation
高保真街景重建对于端到端自主驾驶仿真至关重要,其中新视角合成(NVS)和时变信息建模是促进闭环训练的两个基本能力。然而,现有的3DGS方法及其四维扩展未能同时实现这两者。为了解决这一问题,我们建立了一个信息几何诊断框架,揭示了这一限制源于空间和时间参数之间的信用分配困境。具体而言,单源观测中视点与时间之间的确定性耦合产生了一个低秩结构,从而在静态视点依赖和动态时变成分之间引入了大量的零空间歧义。时变信息掩盖了空间线索,导致空间参数的估计方差发散。为了解决这个问题,我们提出了正交投影梯度(Orthogonal Projected Gradient, OPG),这是一种旨在恢复空间可识别性的分层训练方法。OPG通过在初始阶段确保空间表示的完整性,然后将时变更新限制在空间零空间中,从而优先考虑空间表示的完整性,促进主动的信用分配。虽然OPG在代数上隔离了时变更新,但我们提出了时变正则化策略,以通过施加基于一致外观演变的物理先验的平滑性约束进一步细化时变解空间,确保重建的场景在闭环仿真中保持物理一致性。大量实验表明,我们的方法不仅保持了稳定的新视角合成能力,而且在传统观察再现指标中表现出优越的性能,这间接反映了建模时变动态的能力。
cs.CV / 111 / 2605.21042
Dynamic Video Generation: Shaping Video Generation Across Time and Space
动态视频生成:跨时间和空间塑造视频生成
Abstract
Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.
Chinese Translation
扩散模型在视频生成方面取得了令人瞩目的成果,但其迭代去噪过程由于每个时间步处理的大量标记而仍然计算开销巨大。最近,渐进分辨率采样作为一种有前景的加速方法出现,通过在早期阶段降低潜在分辨率。然而,将这一思想扩展到视频生成仍然面临挑战,因为额外的时间维度在不同视频之间引入了多样的时空需求,而仅压缩单一维度往往导致加速有限或质量下降。因此,我们提出了DVG(动态视频生成)框架,它在时间和空间上联合分配计算,自动选择内容感知的加速策略,无需手动调优或重新训练。DVG在模型和任务之间实现了近乎无损的加速,在HunyuanVideo和HunyuanVideo-1.5上达到最高7倍的加速,与蒸馏结合时可达到18倍,展示了其作为当今大规模高效视频生成系统关键组成部分的潜力。我们的代码在补充材料中,并将在Github上发布。
cs.CV / 112 / 2605.21059
Multimodal LLMs under Pairwise Modalities
成对模态下的多模态大型语言模型
Abstract
Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.
Chinese Translation
尽管多模态大型语言模型(MLLMs)取得了令人印象深刻的成果,但其训练通常依赖于联合策划的多模态数据,这需要大量人力来构建多向对齐的数据集,从而限制了跨领域的可扩展性。在本研究中,我们探索仅通过利用多个成对模态作为完整联合多模态分布的替代品来训练MLLMs。具体而言,我们首先提供了在仅观察成对模态的情况下,表示可识别的条件的理论分析。在此分析的基础上,我们提出了一种表示学习框架,通过仅使用成对数据来对齐跨模态的潜在表示。该框架由两个阶段组成:潜在表示对齐和跨模态重组。具体来说,在第一阶段,我们通过自模态重构和成对对比学习来学习跨模态的共享潜在空间。我们还通过部分对齐和最小潜在规范在对比学习过程中引入了归纳偏差。在第二阶段,我们将新引入模态的编码器与预训练模态的解码器集成,以促进跨模态的转移和生成。我们通过将3D点云和触觉模态新添加到具有三对模态的预训练MLLMs中来评估我们的方法,并显示通过学习对齐的潜在表示空间,我们的模型实现了强大的跨模态性能。
cs.CV / 113 / 2605.21061
Grounding Driving VLA via Inverse Kinematics
通过逆向运动学实现驾驶视觉语言模型的基础
Abstract
Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.
Chinese Translation
现有的驾驶视觉语言模型(Driving VLAs)在预测轨迹时往往忽视其视觉标记——这一现象我们追溯到任务表述结构的不适定性,而非训练不足。我们表明,从逆向运动学的角度来看,轨迹恢复需要当前和未来的视觉状态作为边界条件;而现有的视觉语言模型仅提供前者,这促使模型仅通过自我状态和文本命令进行简化推理。为了解决这个问题,我们重新设计了驾驶视觉语言模型,使其采用逆向运动学求解器的风格。首先,要求大型语言模型(LLM)预测未来视觉场景的下一视觉状态预测目标提供了密集的视觉监督,并抑制了简化路径。其次,设计了一个独立的逆向运动学网络(Inverse Kinematics Network),该网络仅以当前和未来的视觉状态作为输入,旨在抑制在轨迹解码过程中对自我状态和文本简化的依赖。仅凭这一简单的设计,我们的0.5B规模模型恢复了视觉基础,并在闭环NAVSIM-v2和nuScenes基准测试中达到了与7B至8B规模的视觉语言模型相当的轨迹规划性能,后者的规模大于一个数量级。广泛的分析进一步表明,这一改进源于恢复了利用视觉特征的能力,尤其在动态驾驶场景(如转弯)中效果最为显著。
cs.CV / 114 / 2605.21072
Q-ARVD: Quantizing Autoregressive Video Diffusion Models
Q-ARVD:量化自回归视频扩散模型
Abstract
Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.
Chinese Translation
自回归视频扩散模型(ARVDs)作为一种有前景的架构,已成为流媒体视频生成的理想选择,为实时互动视频生成和世界建模铺平了道路。尽管具有潜力,ARVDs 的高推理成本仍然是实际部署的主要障碍,因此模型量化成为提高效率的自然方向。然而,ARVDs 的量化仍然基本未被探索。我们的实证分析表明,直接将现有的针对标准扩散变换器开发的量化方案应用于 ARVDs 会导致次优性能,揭示了与双向扩散模型中观察到的量化行为不同的现象。在本文中,我们确定了量化 ARVDs 的两个关键挑战:(C1)帧级量化敏感性高度不平衡。在自回归生成过程中,错误累积可能导致帧间量化敏感性严重失衡,呈现出类似指数衰减的模式。(C2)权重中显著且异质的异常值模式。权重分布表现出明显的异常通道,其模式在不同层类型和块深度之间变化显著。为了解决这些问题,我们提出了 Q-ARVD,一个用于准确量化 ARVDs 的新框架。(S1)为了应对高度不平衡的帧级敏感性,Q-ARVD 在量化目标中引入了最终质量感知的帧加权机制。(S2)为了防止异质异常值降低性能,Q-ARVD 引入了一种异常值感知的自适应双尺度量化,能够自动检测任意层中异常通道的存在和数量,并将其隔离以保护正常通道。大量实验表明 Q-ARVD 的优越性。
cs.CV / 115 / 2605.21075
SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
SpectralEarth-FM:将高光谱影像引入多模态地球观测预训练
Abstract
Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.
Chinese Translation
地球观测(EO)基础模型(FMs)越来越多地在多传感器数据上进行训练,这些数据涵盖了多光谱影像(MSI)、合成孔径雷达(SAR)和衍生的地理空间层,但高光谱影像(HSI)仍然代表性不足。相反,现有的高光谱FMs仅在HSI上进行训练,尚未探索HSI与共定位EO传感器的联合预训练和融合。我们提出了SpectralEarth-FM,这是一种用于异构光谱维度的多传感器EO输入的层次变换器。该架构结合了针对高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享层次编码器,使得HSI和低通道观测的联合处理成为可能。为了预训练SpectralEarth-FM,我们整理了SpectralEarth-MM,这是一个将来自三个空间传感器(EnMAP、EMIT、DESIS)的HSI与Sentinel-2、Landsat-8/9光学影像、Landsat地表温度(LST)和Sentinel-1 SAR在共同地理范围内共定位的数据集。该数据集包含大约200万个全球分布的位置、2500万个地理参考的块,以及超过40TB的数据。预训练使用了一种联合嵌入预测架构(JEPA)风格的目标,该目标匹配来自同一位置的全球视图和单传感器局部视图之间的表示。我们在高光谱下游任务和标准EO基准测试上评估了SpectralEarth-FM,按照PANGAEA协议,取得了在这两种评估设置下的最先进结果。
cs.CV / 116 / 2605.21079
VDFP: Video Deflickering with Flicker-banding Priors
VDFP:基于闪烁带优先级的视频去闪烁
Abstract
Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets.Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at~ https://github.com/ZhiyiZZhou/VDFP.
Chinese Translation
使用智能手机捕捉数字屏幕时,常常会因硬件同步不匹配而导致严重的带状现象。现有的视频修复方法在处理这些结构化的周期性亮度波动时表现不佳,通常会导致残留伪影或过度平滑的纹理。我们首先构建了DeViD,这是一个涵盖多种场景的真实世界数据集,以解决可用数据集不足的问题。然后,我们提出了VDFP(基于闪烁带优先级的视频去闪烁),这是一种新颖的感知引导生成框架。首先,我们引入了一种基于滚动快门机制的降级场建模(DFM),能够合成复杂的多带状场景。其次,我们提出了一种时空连续优先感知(CPP)。与传统的二元分割不同,该模块通过闪烁感知均方误差(FA-MSE)进行优化,以捕捉亮度过渡。通过对增强输入层进行零初始化,我们的模型保留了预训练的生成优先级以及时空优先感知。大量实验表明,VDFP显著优于其他方法,消除了复杂的带状现象,同时保持了高保真度的空间细节和时间一致性。我们的数据集和代码将发布在 https://github.com/ZhiyiZZhou/VDFP。
cs.CV / 117 / 2605.21090
TextSculptor: Training and Benchmarking Scene Text Editing
TextSculptor:场景文本编辑的训练与基准测试
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.
Chinese Translation
近年来,多模态大型语言模型(MLLMs)和基于扩散的生成模型的进展显著提升了基于提示的图像编辑能力。然而,场景文本编辑仍然具有挑战性,因为它要求模型精确修改文本内容,同时保持视觉真实感和非目标区域的完整性。目前的开源模型在这方面仍落后于专有系统,主要原因在于高质量训练数据的稀缺以及缺乏针对文本编辑的标准化基准。为了解决这些挑战,我们提出了TextSculptor,一个用于场景文本编辑的数据构建和评估的综合框架。我们首先开发了一个自动化数据构建管道,该管道结合了文本感知的图像合成、程序化文本渲染和合成。在此管道的基础上,我们构建了TextSculpt-Data,这是一个包含320万个训练样本的大规模数据集,其中包括120万个经过OCR验证的文本到图像样本和200万个配对文本编辑样本,具有自然对齐的源-目标图像和强背景一致性。我们进一步介绍了TextSculpt-Bench,这是一个涵盖四个基本文本编辑任务的基准:文本添加、文本替换、文本移除和混合编辑。为了支持可靠的评估,我们设计了一个量身定制的协议,通过基于OCR的文本对齐、多模态判断和背景区域相似性来测量文本准确性、视觉质量和背景保留。大量实验表明,TextSculptor提升了开源文本编辑的性能,并缩小了与专有模型之间的差距。数据和基准可在 https://github.com/linyiheng123/TextSculptor 获取。
cs.CV / 118 / 2605.21099
R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound
R2AoP:基于产程超声的可靠且稳健的进展角度估计
Abstract
Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.
Chinese Translation
从产程经会阴超声中准确估计进展角度(AoP)对于客观评估分娩进展至关重要,但仍然对成像噪声、边界模糊和局部分割误差的几何放大高度敏感。我们提出了R2AoP,一种可靠且稳健的AoP估计框架,集成了结构信息驱动的分割和信心引导的几何建模,以实现稳定且可重复的测量。三分支局部结构增强主干网络改善了耻骨联合(PS)和胎头(FH)的轮廓描绘,而信心加权轮廓拟合则明确抑制了不可靠边界点在AoP计算中的影响。为了进一步提高在异构采集条件下的性能,我们引入了一种轻量级几何可靠的测试时适应策略作为辅助组件,使得在没有目标注释的情况下实现稳定推断。在多中心基准测试中的广泛评估表明,与最先进的AoP方法相比,我们的方法在AoP误差和边界指标上均表现出一致的降低。我们的源代码可在 https://github.com/baiyou1234/R2AoP 获取。
cs.CV / 119 / 2605.21112
RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding
RCGDet3D:重新思考基于4D雷达-摄像头融合的3D目标检测,增强雷达特征编码
Abstract
4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.
Chinese Translation
4D汽车雷达因其低成本和鲁棒性在自动驾驶中不可或缺,但其点云稀疏性对3D目标检测构成挑战。现有的4D雷达-摄像头融合方法侧重于复杂的融合策略,牺牲推理速度以获得微小的性能提升。这种权衡由于对密集特征图的高计算需求,阻碍了实时部署。相比之下,从稀疏雷达点提取特征的时间消耗较少,但仍然未得到充分探索。本研究揭示,简单增强雷达特征提取可以实现与复杂融合模块相当甚至更高的性能,同时保持实时性能。基于这一发现,我们提出了RCGDet3D,重点关注雷达特征编码并简化多模态融合。其编码器继承自RadarGaussianDet3D中的高效高斯点云编码器(Point Gaussian Encoder, PGE),并进行了两个关键改进。首先,基于光线的PGE(Ray-centric PGE, R-PGE)在统一到鸟瞰图(Bird's-Eye View, BEV)空间之前,在光线对齐的坐标系统中预测高斯属性,显著提高了几何一致性,并通过将坐标变换与表示学习解耦,降低了学习难度。其次,语义注入(Semantic Injection, SI)模块结合了来自图像的视觉线索,生成更具几何准确性和语义丰富性的雷达特征。在View-of-Delft(VoD)和TJ4DRadSet上的实验表明,RCGDet3D在准确性和速度上均优于最先进的方法,为实时部署设定了新的基准。
cs.CV / 120 / 2605.21121
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation
ROAR-3D:用于高保真3D生成的任意视角路由
Abstract
Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.
Chinese Translation
单图像到3D生成模型现在能够生成高质量的几何体,但对单一视角的条件限制不可避免地引入了对未见区域的模糊性。多视角条件可以减少这种模糊性,但现有方法要么需要固定的规范视点,要么依赖于外部重建模块,这会带来高昂的训练成本并限制生成质量。我们观察到,预训练的单视图模型已经具备强大的2D到3D的基础,可以被重用于多视角条件。然而,进一步分析表明,它们的条件机制将方向控制与几何传递纠缠在一起,这两种功能在不同视点的图像被简单组合时会发生冲突。基于这一分析,我们提出了ROAR-3D,这是一种轻量级方法,可以将预训练的单视图模型升级为接受任意数量的未定姿态图像。一个基于标记的视角路由器将每个3D潜在标记分配给其最相关的视角,隐式建立2D到3D的对应关系,而无需显式的姿态输入。双流注意力设计保留了预训练的主视图行为,同时通过专门用于几何增强的单独路径路由辅助视图。方向扰动策略确保辅助路径学习与方向无关的几何传递。这些组件引入的可训练参数极少,相对于单视图基线增加的推理开销也微不足道。ROAR-3D实现了最先进的多视角3D生成质量,并支持从1到12+视角的测试时视角扩展,且持续改善。
cs.CV / 121 / 2605.21123
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
线性-DPO:用于扩散和流匹配生成模型的线性直接偏好优化
Abstract
Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.
Chinese Translation
直接偏好优化(DPO)在大语言模型(LLMs)中的对齐方面取得了成功,但在文本到图像生成中仍面临挑战。现有研究局限于去噪扩散模型,而忽视了流匹配,并且在将基于离散自然语言处理(NLP)的DPO应用于基于回归的生成任务时存在目标不匹配的问题。本文推导出一个广义的DPO目标,该目标通过统一的反向时间随机微分方程(SDE)框架覆盖了扩散和流匹配,并从梯度的角度指出标准DPO目标在文本到图像生成中是次优的。因此,我们提出了线性-DPO,它用持续的线性效用函数替代了激进的基于sigmoid的效用函数,并结合了EMA更新的参考模型。在扩散模型(SD1.5,SDXL)和流匹配模型(SD3-Medium)上的定性和定量实验表明,我们的方法优于现有基线。
cs.CV / 122 / 2605.21130
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment
VersusQ:用于可泛化视频质量评估的成对边际推理
Abstract
Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.
Chinese Translation
大型多模态模型(LMMs)在视频质量评估中展现了良好的前景,但大多数方法仍然为每个视频预测一个绝对分数。这种点对点的监督往往将感知质量与特定数据集的校准混合在一起,包括注释协议、评分习惯和分数分布。因此,学习到的评分规则可能在基准测试中表现良好,但在未见领域的迁移效果较差。我们认为,相对比较通过专注于感知差异而非特定数据集的评分习惯,减轻了绝对尺度校准的偏差。因此,我们提出了 extbf{VersusQ},一个完全基于直接比较的成对边际推理框架。具体而言,VersusQ在两个视频之间进行基于LMM的比较,推理它们的视觉和时间质量差异,并预测一个带符号的连续边际,该边际捕捉了首选选择及其差异程度。此外,为了将可解释的比较理由与细粒度的数值差异对齐,我们引入了Margin-Coupled GRPO,它联合优化基于回滚的关系推理和连续边际回归。在多个公共视频质量评估基准上的大量实验表明,VersusQ实现了最先进的性能、强大的跨领域泛化能力,以及在异构评估场景下可靠的细粒度排名。
cs.CV / 123 / 2605.21131
UniT: Unified Geometry Learning with Group Autoregressive Transformer
UniT:基于组自回归变换器的统一几何学习
Abstract
Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.
Chinese Translation
近年来,前馈模型在从传感器观测中推断稠密三维结构的几何感知方面取得了显著进展。然而,其基本能力在多个不兼容的范式中仍然存在碎片化现象,包括在线感知、离线重建、多模态融合、长时间范围可扩展性和度量尺度估计。我们提出了UniT,这是一个基于新型组自回归变换器(Group Autoregressive Transformer)构建的统一模型,它在一个框架内重新构建了这些看似不相关的能力。其关键思想是将传感器观测的组视为基本的自回归单元,并以无锚和尺度自适应的方式预测相应的点图。更具体地说,无论是在在线还是离线设置中,多样的视图配置都自然地统一在一个组自回归过程中。通过改变组大小,在线模式在单帧组上进行多个自回归步骤,而离线模式则在单次前向传递中聚合多帧组。同时,队列式键值(KV)缓存机制确保在长时间范围内的自回归记忆是有界的。这是通过通过无锚关系建模减少对早期帧的长程依赖,从而允许过时的记忆即时被丢弃。为了提高跨场景的度量尺度泛化能力,在该框架内进一步引入了尺度自适应几何损失。它将相对几何约束与部分绝对尺度项结合在一起,隐式地对全局尺度进行正则化,并促使从尺度不变几何向度量尺度解决方案的渐进过渡。结合用于整合辅助模态的专用模态注意力模块,UniT在统一几何感知方面实现了最先进的性能,这在涵盖七个代表性任务的十个基准测试中得到了验证。
cs.CV / 124 / 2605.21132
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
SurgOnAir:层次感知的实时外科视频解说
Abstract
Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.
Chinese Translation
实时理解外科手术工作流程对于智能外科实施至关重要,其中人工智能系统在手术进行时持续感知并响应。在手术室中,关键决策依赖于微妙的、瞬息万变的变化,例如精细的仪器移动和不断变化的组织状态,甚至轻微的感知延迟都可能限制辅助或危及安全。然而,现有方法仍然是离线的或在粗糙的时间尺度上操作,仅在处理视频片段后生成描述,无法实现即时反应。为此,我们提出了SurgOnAir,一种流媒体视觉-语言模型,它顺序处理帧而不访问未来信息,并随着视觉输入的到来逐步生成解说标记。SurgOnAir实现了细粒度的帧到标记生成,使其能够对不断变化的外科动态做出即时响应。该模型基于我们精心策划的层次数据集SurgOnAir-11k,涵盖了动作、步骤和阶段级别的监督,经过训练以生成反映外科程序固有层次的多级文本响应。此外,生成特殊的过渡标记以明确标记状态变化,使SurgOnAir能够捕捉并信号关键工作流程的过渡。实验表明,SurgOnAir通过一个统一的视觉-语言模型实现了实时理解,跨越外科工作流程的多个层次,生成优越且具层次感知的解说。代码和数据集将公开发布。
cs.CV / 125 / 2605.21139
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
提炼思考,预见行动:用于自主驾驶的认知-物理强化学习
Abstract
Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.
Chinese Translation
当前的端到端自主驾驶模型在本质上受到模仿学习行为克隆上限的限制。虽然强化学习为更智能的自主性提供了一条路径,但它需要两个缺失的基础设施部分:(1) 理解交通语义和驾驶意图的认知基础,以及 (2) 能够预测候选行动后果的前瞻性物理环境。为此,我们提出了 CoPhy,一个用于自主驾驶的认知-物理强化学习框架。为了提炼思考,我们将 VLM 知识提炼到 BEV 编码器中,然后完全丢弃 VLM,在零推理成本下保留认知能力,同时将认知通道释放为可插拔接口,以便于可选的人类语言指令。为了预见行动,我们构建了一个自回归的 BEV 世界模型,该模型明确预测基于候选行动的未来语义图,作为一个可解释的物理沙盒,从中直接导出安全指标。在这一双重基础设施的基础上,我们通过 GRPO 优化驾驶策略,采用一种新颖的双重奖励机制:来自 BEV 模拟的物理奖励强制执行严格的安全约束,而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明,CoPhy 不仅在 NAVSIM v1 和 v2 基准测试中取得了最先进的结果,还通过认知信息驱动的场景合规性和通过用户定义的语言指令实现灵活的意图控制,增强了驾驶安全性。
cs.CV / 126 / 2605.21157
Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums
基于无人机图像的多种视觉光谱军事探测比较分析
Abstract
In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.
Chinese Translation
在现代战争中,无人机正成为情报收集和在不同敌对环境中进行精确攻击的重要组成部分。它们能够在安全距离内实时操作,使其在监视和军事行动中不可或缺。KIIT-MiTA 数据集包含了从无人机拍摄的不同军事场景的图像,为军事物体的探测提供了基础,但未考虑各种现实场景。为此,为了评估模型在不同条件下的表现,创建了四种不同类型的数据集:灰度图(Gray Scale)、热成像(Thermal Vision)、夜视(Night Vision)和模糊视(Obscura Vision)。这些数据集模拟了低能见度、基于热量的图像和夜间条件等现实环境。YOLOv11-small 模型经过训练并用于在多样化环境中检测物体。本研究通过为防御和进攻任务的发展提供先进的探测系统,提升了基于无人机操作的性能和可靠性。
cs.CV / 127 / 2605.21171
FTerViT: Fully Ternary Vision Transformer
FTerViT:完全三元视觉变换器
Abstract
Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.
Chinese Translation
三元视觉变换器提供了显著的模型压缩,然而,现有的最先进方法仅对编码器层进行三元化,导致补丁嵌入、LayerNorm 参数和分类器头仍保持全精度。在针对资源受限处理器(如微控制器)的紧凑模型中,这些剩余的全精度组件决定了总内存占用,严重限制了部署效率和设备上的可行性。在本研究中,我们提出了一种完全三元化的视觉变换器,其中所有权重矩阵和归一化参数均为三元化(FTerViT)。为此,我们引入了两个新颖的操作符:具有每通道缩放的 TernaryBitConv2d 用于补丁嵌入和 TernaryLayerNorm。FTerViT 采用知识蒸馏进行训练,随后进行轻量级的量化感知恢复阶段。我们的三元 W2A8 DeiT-III-S 在 384×384 分辨率下实现了 82.43% 的 ImageNet-1K top-1 准确率,模型大小为 6.09 MB(约 15 倍压缩,相较于 FP32 降低 2.42 个百分点),超越了之前的三元 ViTs 方法,提升幅度可达 8 个百分点。最后,我们展示了在 ESP32-S3 系统芯片内的双核 XTensa LX7 微控制器上首次实现三元视觉变换器。通过部署 FTerViT-Small(基于 224×224 分辨率的 DeiT-III-Small,5.81 MB),我们实现了 79.64% 的 ImageNet-1K top-1 准确率。
cs.CV / 128 / 2605.21186
SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection
SAM-Sode:朝着忠实解释的小型细菌检测
Abstract
Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.
Chinese Translation
目标检测中的可解释性为临床辅助诊断提供了至关重要的信心支持。然而,在小型细菌检测中,传统的解释方法常常由于目标形态特征的极度稀疏和复杂背景的严重干扰,导致前景边界模糊和特征归因扩散。这些限制妨碍了逻辑上连贯的形态证据的提供。为了填补这一空白,我们提出了一种新颖的可解释人工智能(XAI)框架,SAM-Sode。该框架创新性地将初始特征归因图转换为几何感知提示,利用基础模型(SAM3)的先验知识实现解释映射的空间细化和形态重建。此外,我们引入了一种基于物理意义和几何对齐的双重约束机制,以进行实例级去噪,生成更符合人类专家直觉的连贯解释。在我们自建的具有复杂电路背景的细菌数据集(包含2,524张图像)和其他公共数据集上的实验结果表明,所提出的方法有效抑制了背景冗余,并显著增强了小型物体检测的决策透明度。
cs.CV / 129 / 2605.21190
Semantic Granularity Navigation in Image Editing
图像编辑中的语义粒度导航
Abstract
Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.
Chinese Translation
尽管扩散模型和流模型具有生成能力,真实图像编辑仍受到语义可编辑性与结构保真性之间持续权衡的限制。我们将这一限制的主要原因追溯到现有范式中编辑进程与模型规模的隐性耦合。在这种耦合下,较强的编辑通常需要访问更嘈杂的状态,这在语义变化尚未很好定位之前就消耗了计算资源用于破坏布局。我们提出了NaviEdit,一种无训练的推理时控制器,通过严格的自一致性契约将编辑进程与模型规模遍历解耦。NaviEdit在展开级别操作,并保持底层预训练模型不变。它将规模视为控制输入,并将固定的步骤预算重新分配到语义响应的中间尺度,而不是破坏性的高噪声状态。实验表明,在兼容的编辑器和流骨干网络中,平均增益表现积极,支持解耦作为一种可移植的推理时控制原则。
cs.CV / 130 / 2605.21195
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE:用于离散文本到图像生成的端到端后训练与解码器共同演化
Abstract
Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.
Chinese Translation
离散自回归(AR)文本到图像(T2I)模型将VQ分词器与AR策略相结合,目前的后训练流程仅优化策略,而保持VQ解码器不变。最近的扩散T2I研究,如REPA-E,表明变分自编码器(VAE)本身构成了关键的对齐瓶颈,但对于离散AR模型尚无类似的研究。我们展示了仅优化策略会导致潜在协变量偏移:随着策略的演变,生成的标记分布与解码器训练时的真实分布发生偏离,从而导致奖励分数提高而解码图像质量下降。为了解决这种不匹配,我们提出了RankE,这是第一个用于离散T2I生成的端到端后训练框架。RankE通过交替优化共同演化这两个组件,而不是针对固定解码器优化策略:每个模块在其参数空间中通过适合的稳定性保持锚点进行正则化,同时最大化基于排名的对齐目标。这种共同演化打破了困扰冻结解码器方法的保真度与对齐之间的权衡:在LlamaGen-XL(775M)上,标准强化学习(RL)提高了CLIP但降低了FID,而RankE则同时改善了两者(FID 15.21,CLIP 33.76在MS-COCO 30K上)。在Janus-Pro(1B)上的一致性提升确认了解码器共同演化可靠地将奖励优化转化为像素空间质量的改善。
cs.CV / 131 / 2605.21207
PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection
PGC:用于可泛化的AI生成图像检测的峰值引导校准
Abstract
The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at https://github.com/xiaoyu6868/PGC.
Chinese Translation
生成性人工智能的快速发展,从生成对抗网络(GANs)到现代扩散模型,导致了越来越微妙的区分线索。这些细粒度信号常常被主导的高保真图像内容(例如,主要对象)所掩盖,从而限制了现有检测器的可靠性,这些检测器主要依赖于全局表示。为了解决这一挑战,我们提出了峰值引导校准(Peak-Guided Calibration,PGC)框架。PGC引入了一种新颖的策略,通过峰值聚焦机制聚合显著特征。具体而言,通过采用峰值敏感聚合,强调最具区分性的局部线索,PGC利用这些关键信号来校准全局决策。这种方法恢复了在全局背景中可能被淹没的细微模式。此外,为了更好地模拟现实世界的威胁,我们引入了CommGen15数据集,这是一个包含15个商业模型样本的挑战性基准。大量实验表明,PGC达到了最先进的性能。具体而言,它在我们的CommGen15数据集上提高了平均准确率+12.3%,并在标准基准上创下新纪录,包括GenImage(+2.1%)、AIGI(+3.5%)和UniversalFakeDetect(+1.7%)。代码可在https://github.com/xiaoyu6868/PGC获取。
cs.CV / 132 / 2605.21237
RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis
RePCM:区域特异性和表型自适应的双心室心脏运动合成
Abstract
Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.
Chinese Translation
心脏周期内的心脏运动对于量化区域功能至关重要,并且受到心血管疾病的强烈影响。由于在实践中获取时间密集的网格序列较为困难,我们专注于利用更易获得的舒张末期帧来推断完整的周期序列。由于区域和疾病特异性差异显著,传统方法往往依赖于为全局模式优化的生成模型,从而过度平滑数据。为了解决这一问题,我们提出了区域感知和表型自适应的双心室心脏运动合成(RePCM),用于单帧双心室网格运动的补全。在第一阶段,重建网络学习顶点级运动描述符,聚类产生数据驱动的功能划分,提供明确的运动导出区域结构。在第二阶段,区域特异性注入模块在条件变分自编码器(VAE)内强制执行掩膜的同步区域交换,保留局部特定动态并限制跨区域混合。基于舒张末期形状的表型自适应专家混合先验使用解剖引导线索来建模潜在运动趋势并捕捉疾病间的变异性。在涵盖不同心血管疾病的三个数据集上的实验表明,在几何和功能指标上均取得了一致的提升,并改善了区域特定动态的保留。
cs.CV / 133 / 2605.21244
SR-Ground: Image Quality Grounding for Super-Resolved Content
SR-Ground:超分辨率内容的图像质量基础
Abstract
Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.
Chinese Translation
超分辨率(Super-Resolution, SR)近年来发展迅速,基于扩散的模型在引入新类型视觉伪影的代价下实现了前所未有的逼真度。虽然现有的图像质量评估(Image Quality Assessment, IQA)方法提供了整体质量评分,但它们缺乏可解释性,无法区分现代超分辨率方法产生的不同伪影类型。为了解决这一问题,我们提出了SR-Ground,这是一个专门为超分辨率图像中的细粒度伪影分割设计的大规模数据集。该数据集包含经过多种最先进的超分辨率模型处理的图像,并为多个伪影类别提供了像素级注释。我们进行了一项大规模的众包研究,涉及1,062名参与者,以验证和完善自动生成的分割,最终形成了一个包含63,000张图像的高质量数据集,涵盖6种不同的伪影类型。我们展示了在SR-Ground上训练具有基础能力的IQA模型显著提高了下游任务的性能。此外,我们还引入了一种微调流程,利用我们的基础模型减少超分辨率输出中的可感知伪影,展示了我们数据集的实际应用价值。
cs.CV / 134 / 2605.21261
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
STiTch:用于无训练零-shot复合图像检索的语义过渡与运输协作框架
Abstract
Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.
Chinese Translation
无训练零-shot复合图像检索模型因其在未见多模态检索中的泛化能力和灵活性,近年来受到越来越多的研究关注。近期基于大语言模型(LLM)的进展集中在通过探索LLM背后的组合能力来生成预期的目标标题。尽管这种方法高效,但我们发现:1)生成的标题往往由于输入图像与文本修改之间的语义差距而引入来自参考图像的意外特征,其中图像包含的细节远多于文本;2)在检索阶段的逐点对齐未能捕捉到多样的组合。为了解决这些挑战,我们提出了一种新颖的语义过渡与运输协作框架,旨在用于无训练零-shot复合图像检索(CIR)任务。具体而言,给定由LLM推断出的复合标题,我们旨在通过嵌入空间中的过渡向量对其进行优化,使其更接近目标图像。结合LLM与用户指令,优化后的标题更加集中于核心修改意图,从而过滤掉不必要的噪声。此外,为了在检索阶段探索多样的对齐方式,我们将标题和图像建模为离散分布,并将检索任务重新表述为集合到集合的对齐任务。最后,我们开发了一种双向运输距离,以考虑跨模态的细粒度对齐并计算检索得分。大量实验表明,我们的方法在许多CIR任务中具有普遍性、有效性和益处。
cs.CV / 135 / 2605.21268
Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification
视觉变换器与卷积神经网络在土地利用场景分类中的应用
Abstract
Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.
Chinese Translation
从遥感影像中进行土地利用场景分类(LUSC)在环境监测、城市规划和可持续资源管理中发挥着关键作用。近年来,深度学习方法显著推动了该领域的进展,卷积神经网络(CNN)因其强大的局部空间特征捕捉能力而主导了这一领域。然而,视觉变换器(ViTs)的出现引入了一种新的范式,通过自注意力机制建模长距离依赖关系,可能提高全球上下文理解能力。本文对视觉变换器与基于CNN的架构在遥感土地利用场景分类中的表现进行了比较评估。我们评估了代表性的CNN模型,如AlexNet,并与视觉变换器(ViT)在基准遥感数据集上进行了比较,包括UC Merced土地利用数据集和EuroSAT土地利用数据集。研究考察了分类准确率、精确率、召回率、F1分数和计算复杂度,以提供全面的性能比较。实验结果表明,CNN在训练样本有限且具有强局部纹理特征的数据集上表现稳健,而视觉变换器在充足的训练数据可用时,在捕捉复杂场景中的全球空间关系方面表现优越。然而,ViTs通常需要更大的计算资源和更大的训练数据集才能实现最佳性能。本研究的发现为两种架构的优缺点提供了见解,并为选择适合遥感土地利用场景分类应用的模型提供了指导。
cs.CV / 136 / 2605.21272
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET:一个大规模、开放、非冗余且丰富的文本到图像数据集
Abstract
Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.
Chinese Translation
训练大型文本到图像模型需要高质量、经过精心策划的数据集,这些数据集应具有多样化的内容和详细的描述。然而,收集、过滤、去重和重新标注这些语料库的成本和复杂性在一定程度上阻碍了该领域的开放和可重复研究。我们介绍了MONET,这是一个开放的Apache 2.0数据集,包含约1.049亿对图像-文本对,这些数据对是从29亿对原始数据中通过多个阶段的安全过滤、基于领域的过滤、精确及近似重复项的去除,以及使用多个视觉-语言模型进行的重新标注(涵盖从短到长的描述)收集而来,并进一步通过合成生成样本进行增强。每个图像都附带预计算的嵌入和注释,以加速下游使用。为了验证MONET的有效性,我们仅在该数据集上训练了一个具有40亿参数的潜在扩散模型,并取得了具有竞争力的GenEval和DPG分数,证明我们的数据集降低了大规模、可重复的文本到图像研究的门槛。
cs.CV / 137 / 2605.21273
DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions
DriveMA:重新思考驾驶视觉语言动作模型中的语言接口与一步元动作
Abstract
Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.
Chinese Translation
驾驶视觉语言动作模型(Driving VLAs)通常将自然语言推理引入作为端到端规划的中介接口,但以推理为中心的接口面临三个实际瓶颈:获取高质量的推理注释困难、生成和理解长推理链对紧凑模型具有挑战性,以及推理延迟显著增加。在本文中,我们重新思考驾驶 VLAs 中语言接口的设计,并展示简洁的一步元动作是冗长推理的简单而有效的替代方案。元动作提供语义决策基础,同时保持低熵,并且可以从专家轨迹中自动推导,从而实现可扩展的监督和可靠的轨迹条件。基于这一接口,我们提出了 DriveMA,它结合了以动作为中心的监督训练和一个回合级别的信用分配强化学习框架,联合优化元动作的正确性、轨迹质量和轨迹与元动作的一致性。实验表明,DriveMA 在 Waymo 端到端驾驶挑战赛中已经达到了新的最先进水平,使用 2B 模型获得了 8.060 的评估反馈分数(Rater Feedback Score, RFS),而其 4B 版本进一步将最先进水平提升至 8.079;DriveMA 在 NAVSIM 上也取得了竞争力的表现。消融实验表明,一步元动作在表达能力、可预测性和推理效率之间提供了比自然语言推理或更细粒度的动作序列更好的实际权衡。代码、数据和模型将被发布,以促进未来的研究。
cs.CV / 138 / 2605.21280
Let EEG Models Learn EEG
让脑电图模型学习脑电图
Abstract
High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .
Chinese Translation
高保真脑电图生成对于缓解数据稀缺和应对大规模神经建模中的隐私限制至关重要。尽管近期取得了一些进展,但大多数现有方法通过离散去噪目标来构建脑电图生成,这不足以反映神经活动固有的连续时间动态和频谱结构。因此,这些方法往往难以保持长程时间依赖性,并且在生成信号的频谱和时间结构上存在不匹配。在本研究中,我们认为有效的脑电图生成需要直接在神经信号的连续演变上操作的模型。我们提出了Just EEG Transformer (JET),这是一个基于条件流匹配的生成框架,将脑电图建模为沿连续轨迹演变的原始序列。通过学习一个平滑的向量场,将噪声传输到脑电图数据分布,JET在不依赖于离散去噪方案或特定领域表示的情况下,捕捉时间连续性和瞬态动态。为了确保学习到的动态与脑电图信号的关键特性保持一致,我们引入了原则性约束,以保持频谱结构、时间平稳性和信号级统计特性。在三个大规模基准测试中,JET始终实现了最先进的性能,与强基线相比,TS-FID减少超过40%。广泛的分析表明,JET捕捉了神经动态的关键结构特性,为脑电图生成提供了一种可扩展且原则性的解决方案。项目页面:https://y-research-sbu.github.io/JET/
cs.CV / 139 / 2605.21300
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
通过强调图像负向标记减少大规模视觉语言模型中的物体幻觉
Abstract
Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.
Chinese Translation
物体幻觉是阻碍大规模视觉语言模型(LVLMs)实际应用的一个重大挑战。我们假设幻觉的一个可能来源是模型倾向于优先生成文本,而非与图像进行有意义的互动。为此,我们研究了生成过程,并根据文本标记对输入图像标记的视觉依赖性,将其分为三类:图像正向标记、不变标记和负向标记。我们的分析表明,大多数生成的标记受到图像信息的影响很小。这表明在模型的训练阶段,更加重视学习如何遵循文本指令,而不是从图像中提取信息。基于这一发现,我们提出根据不同标记的视觉依赖性调整训练权重,以控制幻觉。此外,我们还作为数据过滤策略,移除可能包含更多幻觉的部分训练数据。这两种方法在不影响响应长度或在推理过程中引入额外计算成本的情况下,都实现了幻觉的减少。我们在三个LVLM变体上验证了我们的方法,证明了其有效性和广泛适用性。
cs.CV / 140 / 2605.21308
Deformba: Vision State Space Model with Adaptive State Fusion
Deformba:具有自适应状态融合的视觉状态空间模型
Abstract
State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.
Chinese Translation
状态空间模型(SSMs)作为一种强大且高效的替代方案,已逐渐取代变换器(Transformers),展现出线性时间复杂度和卓越的序列建模能力。然而,它们在视觉任务中的应用仍然面临挑战。首先,现有的视觉SSMs在很大程度上依赖于手动设计的固定扫描方法,将图像块展平为序列,这导致了预定义的几何结构并增加了复杂性。其次,视觉SSMs在需要不同信息流之间基于查询的交互的领域中的广泛应用受到阻碍。这是由于为一维序列建模任务设计的SSMs固有的因果性和自引用特性所致。这种融合机制对于多视角3D融合等关键感知任务是不可或缺的。为了解决这些局限性,我们提出了Deformba,这是一种上下文自适应的方法,能够动态增强空间结构信息,同时保持SSMs的线性复杂性。Deformba还允许跨模态融合,如交叉注意力。为了证明Deformba的有效性和广泛适用性,我们在一般的二维视觉任务(如图像分类、目标检测和分割)以及三维视觉任务(如BEV感知)上测试了其性能。大量实验表明,Deformba在各种视觉感知基准测试中表现出色。
cs.CV / 141 / 2605.21309
Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation
Hyper-V2X:用于估计合作鸟瞰视图语义分割中的认知不确定性和随机不确定性的超网络
Abstract
Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X
Chinese Translation
通过车对万物(V2X)通信实现的合作感知,通过共享传感数据创建统一的环境表示,从而增强了自动驾驶的安全性。尽管近期的研究在多智能体融合以改善感知方面取得了进展,但在此类合作框架中的不确定性量化仍然基本未被探索。本文介绍了Hyper-V2X,这是一种基于超网络的框架,用于估计V2X感知中的认知不确定性和随机不确定性。具体而言,我们提出了一种部分权重生成方案和V2X上下文嵌入模块,该模块将贝叶斯超网络与融合的多智能体特征相结合,以生成随机鸟瞰视图(BEV)分割的权重分布。与现有的确定性BEV模型不同,Hyper-V2X能够以较小的计算开销高效地进行不确定性估计。我们的方法与架构无关,可以与现代合作骨干网络(如CoBEVT)无缝集成。在OPV2V基准测试中的实验表明,Hyper-V2X提供了准确且良好校准的不确定性估计,并提高了整体感知的可靠性。我们的代码和基准测试在开源许可证下公开可用: https://github.com/abhishekjagtap1/Hyper-V2X
cs.CV / 142 / 2605.21343
OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation
OcclusionFormer:为基于布局的图像生成安排Z顺序
Abstract
Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.
Chinese Translation
近期的布局到图像模型在空间可控性方面取得了显著进展。然而,它们在物体间遮挡方面仍然面临挑战。当边界框重叠时,大多数现有方法缺乏明确的遮挡信息,这使得重叠区域的生成本质上模糊,并阻碍了复杂遮挡关系的确定。因此,它们往往在重叠区域产生纠缠的纹理或物理上不一致的层次结构。为了解决这一问题,我们首先构建了SA-Z,一个大规模数据集,丰富了明确的遮挡顺序和像素级注释。在我们提出的数据集基础上,我们引入了OcclusionFormer,一个新颖的遮挡感知扩散变换器框架,通过解耦实例并通过体积渲染进行合成,明确建模Z顺序优先级。此外,为了确保细粒度的空间精度,我们引入了一种查询对齐损失,明确监督单个实例并增强语义一致性。所提出的方法有效减少了重叠区域的模糊性,强制执行正确的遮挡依赖关系,并保持结构完整性,从而在多样化场景中显著提高了准确性。
cs.CV / 143 / 2605.21372
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training
闭环动态驾驶数据混合用于真实-合成共同训练
Abstract
Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.
Chinese Translation
数据缩放是现代深度学习的基础,随着自动驾驶向端到端学习的转变,其重要性愈加突出。真实世界的驾驶数据标注成本高且存在场景偏差,使得与近乎无限的合成数据进行真实-合成共同训练成为一个有前景的方向。然而,简单地将所有可用的合成数据纳入训练是低效的,并且会导致分布偏移,在实际训练预算下优化数据混合仍然是一个关键但未被充分探索的问题。在这一点上,我们认为训练数据的混合需要在场景类型和数量方面提供明确的指导。特别是在本研究中,我们将数据混合概念化为一个动态优化过程,该过程通过闭环评估反馈迭代调整训练数据混合,以最大化模型性能,并提出了AutoScale,一个完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索,以及模型训练和评估。具体而言,我们提出了图正则化自编码器(Graph Regularized AutoEncoder, Graph-RAE)用于驾驶场景表示,引入了集群感知梯度上升(Cluster-aware Gradient Ascent, Cluster-GA)用于集群重要性估计和重加权,并执行集群引导向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在受限预算下以更少的合成样本实现了优于传统共同训练和跨域基线的性能。
cs.CV / 144 / 2605.21381
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
在可控图像恢复中解构随机插值的生成与回归
Abstract
Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.
Chinese Translation
近期图像恢复(Image Restoration, IR)的进展主要得益于生成方法,如扩散模型(Diffusion Models)和流匹配(Flow Matching),这些方法在合成真实纹理方面表现出色,但在多步骤推理速度和像素保真度方面存在不足。相比之下,经典的基于回归的IR方法在这些方面表现优异,提供了单步高效性和高像素级重建保真度。为了弥补这一差距,我们提出了DiSI,一个统一框架,将基础的随机插值过程解构为独立的生成和回归组件。这种解耦赋予DiSI显著的多功能性,使其能够在纯回归过程与完全生成过程之间实现连续且可控的过渡。从技术上讲,我们通过两种特定的采样轨迹实例化该框架,并配备一个统一的采样器,以实现对任意轨迹的高质量、少步骤推理。此外,我们在像素空间设计了一个双分支U-Net风格的变换网络,利用专用分支增强条件引导,同时确保高吞吐量。大量实验表明,DiSI在各种IR任务中高效地取得了具有竞争力的结果,同时独特地提供了在单一模型中控制失真感知权衡的推理时间灵活性。
cs.CV / 145 / 2605.21411
RoadTones: Tone Controllable Text Generation from Road Event Videos
RoadTones:可控语调的道路事件视频文本生成
Abstract
Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.
Chinese Translation
现有的视频语言模型能够生成道路事件的事实描述,但缺乏对这些事件表达方式的控制:包括语调、紧迫感或风格。这限制了其在通信关键环境中的应用,因为信息的有效性不仅依赖于内容,还依赖于呈现方式,而不仅仅是事实准确性。为了解决这一问题,我们引入了一个全面的数据集-模型-评估套件,用于可控语调的道路视频字幕生成。我们经过人工验证的数据生成管道通过多样的语调注释和多语调字幕扩展了道路视频语料库,生成了RoadTones-51K数据集。我们提出了RoadTones-VL-CoT,这是一个可控的视频到文本模型,同时生成基于语调的思维链中间草稿,以提高可解释性。我们还引入了RoadTones-Eval,一个新的评估套件,联合测量事实一致性和语调遵循。此外,我们进行了用户研究,其结果验证了字幕质量、语调控制和事实一致性。这些贡献共同为上下文敏感的可控语调视频字幕生成奠定了基础。
cs.CV / 146 / 2605.21417
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
排序重要:基于排名的选择性融合用于混合情感识别
Abstract
Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and na\"ive multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.
Chinese Translation
混合情感识别具有挑战性,因为情感通常以细微且重叠的多模态线索的混合形式表达,而不是单一的主导信号。我们提出了一种基于排名的多编码器框架,该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享的潜在空间,通过基于注意力的门控模块估计样本级编码器重要性,并仅融合前n个最具信息量的编码器。为了更好地建模混合情感,我们将预测解耦为存在和显著性头,并通过概率级融合对其进行对齐。我们进一步结合特征级无监督领域适应,无需伪标记,以提高在分布变化下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强大的单一编码器和简单的多编码器融合基线。我们的最终系统在比赛中排名第二,支持基于排名的选择性融合在细粒度混合情感识别中的有效性。
cs.CV / 147 / 2605.21421
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor:基于边缘计算的隐私保护无云运动分析系统,面向每个人
Abstract
Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.
Chinese Translation
运动捕捉是测量人类运动的金标准,但其临床应用因成本、技术复杂性和隐私问题而受到限制。AIGaitor 是一个隐私保护的无云运动分析系统,能够在消费级智能手机上使用设备内神经加速器完全运行无标记单目运动捕捉管道和后续深度学习分析。为了推动其设计,我们对74名康复临床医生进行了调查:92%的人表示他们会采用一种准确、经济实惠且易于使用的人工智能步态分析工具,而79.7%的人指出运营成本、68.9%的人提到培训不足以及64.9%的人提到隐私问题是主要障碍。随后,我们优化并基准测试了当前单目管道组件的移动iOS实现,包括2D和3D姿态估计、姿态优化、基于骨架的深度学习分析以及视觉-语言模型。一个时间优先的端到端设备内管道在iPhone 14上处理一个10秒的4K 60帧视频片段需77秒,当包括网络传输时,其性能与高端NVIDIA H200云服务器相匹配或优于:在全球移动平均上行速率下为94秒,在发达国家Wi-Fi环境下为66秒。轻量级模型如ViTPose-s实现了实时关键点提取,而基于骨架的动作识别模型在同一片段上提供了亚毫秒级的步态分类。据我们所知,AIGaitor是第一个展示端到端设备内运动捕捉和后续深度学习分析的单目系统,支持低成本、隐私保护且可供智能手机用户使用的临床适用运动分析。
cs.CV / 148 / 2605.21431
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
iTryOn:通过空间语义引导掌握交互式视频虚拟试穿
Abstract
Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.
Chinese Translation
视频虚拟试穿(Video Virtual Try-On, VVT)的目标是在视频中无缝地将一个服装替换为另一个服装。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示服装。这一限制忽视了现实世界服装展示的一个关键方面:主动的人-服装互动。为了弥补这一空白,我们引入并正式化了一项新的挑战任务:交互式视频虚拟试穿(Interactive Video Virtual Try-On, Interactive VVT),在该任务中,视频中的主体积极与其服装进行互动。该任务引入了超越简单纹理保留的独特挑战,包括:(1)从标准姿势信息中解决互动的语义模糊性,以及(2)从互动时刻稀疏且短暂的视频中学习复杂的服装变形。为了解决这些挑战,我们提出了iTryOn,一个基于大规模视频扩散Transformer的新框架。iTryOn开创了一种多层次互动注入机制,以指导复杂动态的生成。在空间层面,我们引入了一种与服装无关的3D手部先验,以提供精细的指导,确保手与服装的精确接触,有效解决空间模糊性。在语义层面,iTryOn利用全局字幕提供整体上下文,并通过我们的新颖的动作感知旋转位置嵌入(Action-aware Rotational Position Embedding, A-RoPE)同步时间戳动作字幕以实现局部互动。大量实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,而且在新的交互设置中建立了显著的领先地位,标志着向更动态和可控的虚拟试穿体验迈出了重要一步。
cs.CV / 149 / 2605.21440
ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes
ReMATF:用于动态场景的递归运动自适应多尺度湍流减轻
Abstract
Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.
Chinese Translation
大气湍流通过引入几何扭曲、模糊和时间闪烁等失真,严重降低了视频质量,给视觉清晰度和时间一致性带来了重大挑战。目前的最先进方法基于变换器(transformer)、三维架构,并需要多帧输入,但其高计算成本和内存使用限制了实时部署,尤其是在资源受限的场景中。在本研究中,我们提出了ReMATF,这是一种轻量级的递归框架,仅使用两帧来恢复视频,同时保持空间细节和时间稳定性。ReMATF结合了多尺度编码器-解码器、时间扭曲和运动自适应时间融合模块,该模块在扭曲的前一输出和当前预测之间进行逐像素融合,以增强一致性而不扩大时间窗口。该设计减少了闪烁,锐化了细节,并保持高效性。在合成和真实湍流数据集上的实验表明,PSNR/SSIM和感知质量(LPIPS)均有一致改善,同时推理速度显著快于多帧变换器基线,使得ReMATF在资源受限的场景中适合用于湍流减轻。
cs.CV / 150 / 2605.21443
TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos
TempGlitch:评估视觉-语言模型在游戏视频中检测时间性故障的能力
Abstract
Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.
Chinese Translation
视觉-语言模型(VLMs)在视频游戏质量保证,尤其是游戏故障检测方面越来越受到关注。然而,大多数现有评估将故障视为静态视觉异常,要求模型从单帧中检测失败。我们认为这种框架忽略了一个关键区别:某些故障是空间性的,可以在孤立的帧中可见,而另一些则是时间性的,仅通过有序帧之间的变化才能显现。初步研究证实了这一差距,表明时间性故障对VLMs的检测难度远高于空间性故障。为了系统性地评估这一尚未充分探索的设置,我们引入了TempGlitch,一个用于时间性故障检测的受控游戏视频基准。TempGlitch涵盖了五种时间性故障类型,并提供了每类样本均衡的配对无故障视频,以便进行可靠的二元评估。我们在多个帧采样设置下评估了12个专有和开放权重的VLMs。我们的结果显示,当前的VLMs在TempGlitch上的表现接近随机,往往陷入过于保守的行为,错过大多数故障,或过于敏感的行为,将干净视频标记为故障。此外,更密集的帧采样和更大的模型规模并未可靠地解决这些失败。TempGlitch为时间推理、稳健的游戏理解和VLMs的自动故障检测提供了一个专注的测试平台。代码和数据可在项目网站上获取。
cs.CV / 151 / 2605.21454
ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction
ProtoPathway:生物结构化原型-通路融合的多模态癌症生存预测
Abstract
We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.
Chinese Translation
我们介绍了ProtoPathway,这是一种设计上可解释的多模态癌症生存预测框架,通过编码器统一全切片成像和转录组学,在融合的两侧生成生物学基础的表示。在组织病理学方面,$K$个可学习的形态原型通过生存目标进行端到端训练,作为切片表示本身:补丁通过软分配流入原型标记,将可变长度的补丁集压缩为固定的任务自适应标记。在基因组学方面,一个二分图神经网络在Reactome通路层次结构中编码基因表达,生成通路嵌入,这些嵌入通过在共享的基因-通路图上的双向消息传递反映了组成基因及其更广泛的生物学背景。跨模态注意力随后在紧凑的原型$ imes$通路矩阵上操作,其中原型查询通路,建模分子程序如何导致组织形态的生物学方向。由于两个轴都携带稳定的任务学习身份,注意力矩阵本身就是一种可解释性输出,提供了从基因到通路、原型再到空间组织图的全生物学层次的原生推理时间归因。我们在五个TCGA癌症队列上进行了评估,展示了具有竞争力或优越的生存预测,同时显著提高了生物学可解释性并降低了计算成本,且通过折叠分层基于排名的人群级分析验证了可解释性声明。我们的源代码、模型权重和Reactome通路,以及统一的代码库(在相同的预处理和评估下重新实现所有多模态生存基线)可在以下网址获取:https://github.com/AmayaGS/ProtoPathway。
cs.CV / 152 / 2605.21466
StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation
StreamGVE:无训练视频编辑的少步流媒体视频生成
Abstract
Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.
Chinese Translation
尽管现有的视频编辑方法通常是可行的,但它们往往需要多次昂贵的迭代,并且仍然难以提供高质量且令人满意的编辑结果。我们将这一限制归因于普遍的数据到数据范式,这种范式与现代生成模型相比,较不兼容噪声到数据生成。为了解决这一问题,我们从噪声到数据的角度重新审视视频编辑,并提出基于流生成的视频编辑方法(StreamGVE),该方法在无缝注入源视频条件的同时,保留了少步采样。StreamGVE建立在预训练的流生成模型之上,引入了双分支快速采样,配合自注意力桥接和交叉注意力基础/增强,以满足采样和条件的要求。我们进一步提出了面向源的引导,以提高目标生成质量,并采用视觉提示策略以增强编辑的灵活性和实用性。该方法在不同模型中有效、稳健且具有良好的泛化能力。在多种视频编辑任务上的广泛实验表明,StreamGVE在少步设置下,即使在最小时间成本的情况下,也始终优于现有方法。
cs.CV / 153 / 2605.21472
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
Stream3D:通过证据记忆实现的顺序多视角三维生成
Abstract
View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://anonymous-submission-20.github.io/streaming3D.github.io/.
Chinese Translation
视图条件的三维生成器,如SAM 3D、TRELLIS和Hunyuan3D,能够从单一视角生成高质量的物体重建,但现实世界中的视觉观察通常以长单目流的形式出现。简单地将这些生成器独立应用于每个流帧会导致生成结果的严重时间不一致性。为了解决这个问题,我们提出了Stream3D,这是首个无训练的流式机制,将一个冻结的视图条件三维生成器转变为具有恒定跨块记忆的流式生成器。Stream3D通过维护一个紧凑的证据记忆来实现这一点,该记忆根据提出的证据评分机制选择性地缓存最具信息量的历史帧。随着流的推进,记忆动态更新,以保留固定数量的信息帧,防止记忆占用随着序列长度线性增长。这也防止了在长序列上的性能退化,并保持底层生成器完全不变,无需重新训练、架构修改或辅助损失。在现实和合成流式基准测试中评估后,Stream3D在光度和几何指标上均超越了潜在传输基线,包括KV缓存重用和基于流的特征编辑。更多细节请参见:https://anonymous-submission-20.github.io/streaming3D.github.io/
cs.CV / 154 / 2605.21478
Latent Dynamics for Full Body Avatar Animation
全身虚拟形象动画的潜在动态
Abstract
Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.
Chinese Translation
基于神经渲染的姿态驱动全身虚拟形象能够生成高质量的捕捉对象的新视图。然而,宽松衣物和其他动态元素的变形方式无法仅通过姿态来解释:同一姿态可以对应许多不同的状态,因为它们的运动依赖于历史、惯性和接触。显式模拟和分层服装方法可以建模这些动态,但它们要么需要专用的服装模板,而原始的多视角捕捉并不自然提供,要么需要在测试时使用具有非平凡运行成本的物理模拟器。另一条平行的研究路线学习数据驱动的服装虚拟形象,避免显式的服装层。这些方法添加了一个辅助潜变量,以实现超越姿态的变化;在推理时,它们固定该潜变量、从姿态回归或从训练数据中检索,而不显式建模潜变量如何随着自身动态演变。此外,即使在宽松衣物的日常运动中,现有架构往往难以捕捉细粒度细节,导致模糊的渲染和时间伪影。我们通过一个基于变换器的解码器和一个捕捉超出驱动信号的时间外观和几何变化的动态残差潜变量,增强了姿态条件的3D高斯虚拟形象。在推理时,学习的潜在动态模型根据短暂的姿态历史和先前的潜在状态演变残差潜变量。该模型将每次更新分解为驱动、恢复和耗散力,生成时间上连贯、依赖历史的演化,且附加成本微乎其微。不同的初始条件产生多样而合理的运动轨迹,力的分解揭示了如刚度等控制因素。在九个捕捉到的日常运动序列中,涵盖多种宽松服装,定量指标和感知用户研究表明,与最近的数据驱动基线相比,动画质量得到了改善。
cs.CV / 155 / 2605.21479
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench:来自维基百科和维基数据的知识基础视觉问答基准
Abstract
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.
Chinese Translation
视觉问答(VQA)基准主要强调可以仅通过视觉内容解决的感知任务。相比之下,许多现实场景需要外部知识,这些知识在图像中并不可直接观察,以便正确回答。我们介绍了WikiVQABench,这是一个经过人工策划的知识基础VQA基准,通过系统地结合维基百科图像、相关的文章标题以及来自维基数据的结构化知识构建而成。我们的流程使用大型语言模型(LLMs)生成候选的多项选择图像-问题-答案集。所有生成的实例随后由人工注释者进行审核和策划,以确保事实正确性、视觉-文本一致性,并且每个问题在正确解答时需要外部知识,除了视觉证据之外。WikiVQABench包含大量经过策划的维基百科图像及其设计用于基准测试知识感知视觉语言模型(VLMs)的多项选择问题。对十五个VLM(参数范围从256M到90B)的评估显示出广泛的性能差异(准确率范围为24.7%-75.6%),证明该基准有效区分模型在知识密集推理方面的能力。数据集和基准测试代码已公开提供。
cs.CV / 156 / 2605.21484
One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration
通过固定点迭代实现离散扩散图像生成器的一步蒸馏
Abstract
Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.
Chinese Translation
离散扩散模型在视觉合成方面表现出色,但依赖于缓慢的迭代解码。现有的一步蒸馏方法试图绕过这一瓶颈,要么通过训练辅助评分网络有效地增加计算量,要么通过引入专门的参数化和多阶段管道来分散优化。在本文中,我们提出了固定点蒸馏(Fixed-Point Distillation, FPD),这是一个端到端框架,通过部分破坏学生的一步草稿并通过单个教师步骤进行精炼,构建局部修正目标。为了在语义上有意义的空间中计算训练目标,我们将离散标记提升为连续特征,并应用多带宽漂移损失,迭代累积这些修正。为了通过离散瓶颈进行反向传播,我们采用了直通估计器,在前向传播过程中将精确的硬采样标记馈送给教师和解码器,确保训练和推理在同一代码簿流形上进行,同时将连续梯度路由回学生的逻辑值。这个完全可微的路径还可以容纳一个可选的无条件对抗目标,以增强感知现实性。在类别和文本条件生成的评估中验证了我们框架的有效性。FPD在单次推理步骤中实现了具有竞争力的视觉保真度和结构对齐,缩小了与多步骤教师之间的差距,同时超越了现有的离散蒸馏基准。
cs.CV / 157 / 2605.21487
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit:智能编辑是统一模型调优的通用任务
Abstract
Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.
Chinese Translation
目前,增强统一多模态模型(Unified Multimodal Models, UMMs)在图像理解、生成和编辑能力方面主要依赖于混合多任务训练。由于固有的任务冲突,这种策略需要复杂的多阶段流程、大量数据混合和均衡技巧,仅导致性能的权衡,而非真正的相互增强。为打破这一范式,我们提出了Uni-Edit,一项智能图像编辑任务,作为UMM调优的首个通用任务。与复杂的混合流程不同,Uni-Edit仅通过一个任务、一个训练阶段和一个数据集,同时提升所有三种能力的性能。具体而言,我们首先将图像编辑识别为一种固有的理想通用任务,因为它自然要求具备视觉理解和生成能力。然而,现有的编辑数据依赖于简单的指令,严重低估了模型的理解能力。为了解决这个问题,我们引入了首个自动化和可扩展的数据合成流程,用于智能编辑,将多样的视觉问答(Visual Question Answering, VQA)数据转化为复杂且有效的编辑指令,嵌入问题和嵌套逻辑。这产生了Uni-Edit-148k,将多样的推理密集型指令与高质量的编辑图像配对。对BAGEL和Janus-Pro的广泛实验表明,仅在Uni-Edit上进行调优即可在所有三种能力上实现全面增强,而无需任何辅助操作。
cs.AI / 1 / 2605.20189
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR:一种自我优化的开放式自主代理,用于终身学习和持续适应
Abstract
Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.
Chinese Translation
尽管大型语言模型(LLMs)取得了显著成功,但在动态的现实环境中部署时仍面临瓶颈,主要挑战包括概念漂移和基于梯度的适应成本高昂。传统的微调(FT)在适应非平稳数据流时往往难以避免灾难性遗忘,或需要大量的手动数据整理。为了解决这些在流式和持续学习范式中的局限性,我们提出了自我优化的终身自主推理器(SOLAR),这是一种开放式自主代理,利用参数级元学习进行自我改进,将模型权重视为探索的环境。它通过巩固对常识知识的强先验来启动这一过程,使其在迁移学习中表现有效。通过采用多层次强化学习方法,SOLAR自主发现适应策略,使其能够高效地在测试时适应未见领域。关键的是,SOLAR维护一个不断发展的有效修改策略知识库,隐式地充当情节记忆缓冲区,以平衡可塑性(适应新任务)和稳定性(保留元知识)。实验表明,SOLAR在常识、数学、医学、编程、社会和逻辑推理任务上超越了强基线,标志着朝着能够在不断变化的环境中实现终身适应的自主代理迈出了重要一步。
cs.AI / 2 / 2605.20190
Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
工具增强的闭环优化、仿真和建模编排代理
Abstract
Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.
Chinese Translation
迭代工业设计-仿真优化受到CAD-CAE语义差距的制约:在多样化的耦合约束下,将仿真反馈转化为有效的几何编辑。为了解决这一问题,我们提出了COSMO-Agent(闭环优化、仿真和建模编排),这是一个工具增强的强化学习(RL)框架,旨在教会大型语言模型(LLMs)完成闭环CAD-CAE过程。具体而言,我们将CAD生成、CAE求解、结果解析和几何修订视为一个交互式RL环境,在该环境中,LLM学习协调外部工具并修订参数几何,直到满足约束。为了使这种学习过程稳定且适用于工业应用,我们设计了一种多约束奖励机制,旨在共同促进可行性、工具链鲁棒性和结构化输出有效性。此外,我们贡献了一个与行业对齐的数据集,涵盖25个组件类别及可执行的CAD-CAE任务,以支持现实的训练和评估。实验表明,COSMO-Agent的训练显著提升了小型开源LLMs在约束驱动设计中的表现,在可行性、效率和稳定性方面超越了大型开源和强闭源模型。
cs.AI / 3 / 2605.20423
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
OSCToM:基于强化学习的对抗生成高阶心智理论
Abstract
Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.
Chinese Translation
大型语言模型(LLMs)在许多语言任务上表现良好,但它们在复杂社会环境中的心智理论(ToM)推理仍然不均衡。现有的基准测试,包括ExploreToM,并不总是测试那些使这些环境变得困难的递归信念和信息不对称。本文提出了OSCToM(观察者-自我冲突心智理论),一种用于建模基于LLM的ToM任务中嵌套信念冲突的方法。关键案例是观察者对另一个代理的看法与观察者自身的信念状态相冲突。这类案例超越了简单的视角转换,要求进行递归的多层次推理。OSCToM结合了强化学习(RL)、扩展的领域特定语言和组合替代模型,以生成观察者-自我冲突。在我们的实验中,OSCToM-8B在测试的系统中表现最佳。它在FANToM上改善了报告的ExploreToM结果,并在Hi-ToM和BigToM上保持竞争力。在信息不对称的FANToM基准测试中,OSCToM达到了76%的准确率,而ExploreToM报告的准确率仅为0.2%。数据合成过程的效率也提高了6倍,这表明有针对性的训练数据可以帮助较小的模型处理高级认知推理。项目代码可在https://github.com/sharminsrishty/osct获取。
cs.AI / 4 / 2605.20425
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
AgentCo-op:基于检索的可互操作多智能体工作流合成
Abstract
Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.
Chinese Translation
在开放式科学环境中,设计多智能体工作流尤其困难,因为任务缺乏经过整理的训练集、可靠的标量评估指标以及现有工具和智能体之间的标准化接口。我们提出了AgentCo-op,一种基于检索的合成框架,通过类型化的工件交接将可重用的技能、工具和外部智能体组合成可执行的工作流,并在执行证据表明失败时,对相关组件应用有限的自我引导局部修复。在两个开放世界基因组学案例研究中,AgentCo-op将独立开发的科学智能体和外部工具库组合成可审计的工作流,而无需重新设计它们或进行全局拓扑搜索。它协调专门的智能体进行空间转录组学和基因集解释,以便从空间转录组学数据中实现协作发现,并为单细胞多组学数据构建交叉模态标记分析的并行工作流。AgentCo-op还可以将搜索到的工作流作为结构先验导入,并通过用检索到的组件对节点进行定位和应用局部修复来改进它,显示出合成与检索是互补的。在六个编码、数学和问答基准测试中,AgentCo-op在四个基准测试中取得了最佳结果,并在统一主干设置下获得了最佳平均分,同时相对于多智能体基线持续降低每个任务的成本。这些结果共同表明,基于检索的合成可以将自动化智能体工作流设计扩展到基于现有智能体、工具和类型化工件构建的开放世界工作流,超越基准优化的智能体图。
cs.AI / 5 / 2605.20467
High Quality Embeddings for Horn Logic Reasoning
高质量的嵌入表示用于霍恩逻辑推理
Abstract
Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.
Chinese Translation
神经网络可以被训练来对逻辑推理者所做的选择进行排序,从而实现更高效的答案搜索。这个过程中的一个关键步骤是创建有用的嵌入表示,即逻辑语句的数值表示。本文介绍并评估了几种创建嵌入的方法,这些方法能够产生更好的下游结果。我们使用三元组损失(triplet loss)来训练嵌入,这要求示例由一个锚点、一个正例和一个负例组成。我们提出了三个思路:生成更可能包含重复术语的锚点,以确保正例和负例之间在简单、中等和困难示例之间保持良好的平衡,以及在训练过程中定期强调最困难的示例。我们进行了多项实验来评估这种方法,包括在不同知识库中比较不同的嵌入,以期识别出哪些特征使得嵌入更适合特定的推理任务。
cs.AI / 6 / 2605.20490
\ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems
ECUAS{n}:一种用于不确定性增强系统原则性评估的度量家族
Abstract
In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, \ECUAS{n}, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the \ECUAS{n} metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.
Chinese Translation
在高风险的自动决策中,获取预测不确定性对于使用户(无论是人类还是下游系统)能够基于特定应用的成本权衡接受或拒绝预测至关重要。这类不确定性增强(UA)系统,即同时输出预测和不确定性分数的系统,目前在文献中以多种方式进行评估,使用不同的度量来评估预测和不确定性分数,设置具有固定拒绝成本的成本函数,或在覆盖-风险曲线上进行积分。我们认为这些评估方法不足以全面评估UA系统在不确定性下的决策性能,并提出了一种新颖的度量家族ECUAS{n},该度量被构造为针对特定任务的适当评分规则。参数n控制了错误预测成本与不完美不确定性之间的权衡,具体取决于使用案例的需求。我们通过在多样的分类和生成数据集上的实验,包括TriviaQA的手动注释子集,理论和实证地展示了ECUAS{n}度量的优势。
cs.AI / 7 / 2605.20520
Open-World Evaluations for Measuring Frontier AI Capabilities
开放世界评估:衡量前沿人工智能能力
Abstract
Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.
Chinese Translation
基于基准的评估在跟踪前沿人工智能进展方面仍然重要。然而,由于它偏重于可以精确指定、自动评分、易于优化且预算低、时间短的任务,因此可能会高估或低估已部署的能力。我们倡导一种补充的评估类别,称为开放世界评估:通过小样本定性分析评估长期、复杂的现实世界任务,而不是基准规模的自动化。在本文中,我们调查了近期的开放世界评估,识别其优缺点,并介绍了CRUX(Collaborative Research for Updating AI eXpectations)项目,旨在定期进行此类评估。作为第一个实例,我们要求一个人工智能代理开发并发布一个简单的iOS应用程序到Apple App Store。该代理仅在一次可避免的手动干预下完成了任务,这表明开放世界评估可以提供对即将普及的能力的早期预警。最后,我们提出了设计和报告开放世界评估的建议。
cs.AI / 8 / 2605.20530
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas:超越大型语言模型代理的结果排行榜
Abstract
Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model's trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.
Chinese Translation
大型语言模型代理现在可以在代码库、浏览器、操作系统、日历、文件和工具生态系统中执行操作,但用于评估它们的基准测试却是零散的:每个基准强调不同的测量单位(最终任务成功率、工具调用有效性、重复通过一致性、轨迹安全性或攻击鲁棒性)。2024-2025年的一系列研究工作已经达成共识,认为单一的准确性指标不再是可部署代理的合适比较单位。AgentAtlas通过四个组成部分扩展了这一研究方向:(i)一个六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii)一个九类别轨迹失败分类法,具有两个正交的层次标签(主要错误来源、影响);(iii)一种分类法感知与分类法盲目的方法论,测量模型表面能力中有多少来自提示中的监督;(iv)一个基准覆盖审计,将十五个代理基准映射到六个行为轴。为了展示该方法论,我们在两种提示模式下运行了一个固定的八模型小集合(1,342个生成项目,四个前沿闭合模型和四个开放权重模型)。去除显式标签菜单使每个模型的轨迹准确性下降14-40个百分点,紧缩至0.54-0.62的底线,无论模型家族如何,且没有单一模型在控制准确性、轨迹诊断和工具上下文效用保留这三项上全部获胜。我们将这次合成运行视为一种测量协议演示,而非基准发布。
cs.AI / 9 / 2605.20554
Personality Engineering with AI Agents: A New Methodology for Negotiation Research
基于人工智能代理的人格工程:谈判研究的新方法论
Abstract
According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.
Chinese Translation
根据经典谈判理论,个人在谈判中的成功取决于他们如何平衡相互竞争的需求——同理心与自我主张、对他人的关心与对自我的关心、对人和问题的软硬态度。然而,人们在管理这些紧张关系时常常感到困难,因此研究人员缺乏在受控条件下严格测试该领域建议的能力。人工智能代理不受这些限制,其精确性、表现范围、一致性和可扩展性使得一种新的实验类别能够为谈判理论做出贡献。在本文中,我们介绍了人格工程:一种利用人工智能代理精确参数化、操控和评估谈判者人格的方法论。我们建议使用人际圆周模型(interpersonal circumplex)及其两个核心维度——温暖和主导性——作为该领域的基础坐标系统。这种方法不仅为测试经典谈判理论提供了严格的方法论,也为设计人工智能谈判代理的人格提供了实用指南。
cs.AI / 10 / 2605.20577
Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
Mahjax:一种用于 JAX 中强化学习的 GPU 加速麻将模拟器
Abstract
Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.
Chinese Translation
立直麻将是一种多玩家、不完全信息的游戏,其特点是随机性和高维状态空间。这些特性带来了独特的挑战,反映了强化学习中复杂的现实决策问题。尽管先前的研究在很大程度上依赖于从人类游戏日志中进行监督学习以预训练策略,但能够从零开始(tabula rasa)学习的算法提供了更大的通用适用性潜力,正如 AlphaZero 系列所证明的那样。为了促进此类研究,我们引入了 extbf{Mahjax},这是一个完全向量化的立直麻将环境,采用 JAX 实现,以支持在图形处理单元(GPUs)上的大规模并行化。我们还提供了一个高质量的可视化工具,以简化调试和与训练代理的交互。实验结果表明,在无红和红规则下,Mahjax 在八个 NVIDIA A100 GPU 上的吞吐量分别达到 extbf{200万} 和 extbf{100万步每秒}。此外,我们通过展示代理能够有效训练以提高其相对于基线策略的排名,验证了该环境在强化学习中的实用性。
cs.AI / 11 / 2605.20608
From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
从自动化到自主化:层次化代理原生网络架构(HANA)
Abstract
Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.
Chinese Translation
实现4/5级自主网络(AN)需要从静态自动化转向代理原生智能。目前的操作依赖于僵化的脚本,缺乏应对非正常情况的认知能力。为了解决这个问题,本文提出了一种层次化多代理参考架构,以实现高级自主性。该框架具有一个双驱动协调器,负责协调专门的执行代理,并通过共享公共记忆支持统一的领域知识。一个关键创新是代理自我意识的整合,使系统能够协调深思熟虑的战略治理与反射性的故障恢复。我们在5G核心环境中实例化并验证了该架构。案例研究表明,该系统在拥堵情况下能够维持关键吞吐量,并将平均修复时间(MTTR)减少86%,确认了其在战略规划与操作弹性统一方面的有效性。
cs.AI / 12 / 2605.20618
COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
COAgents:学习和导航路由问题搜索空间的多智能体框架
Abstract
Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.
Chinese Translation
尽管车辆路径问题(Vehicle Routing Problems, VRP)在许多现实世界系统中至关重要,但由于其组合复杂性,规模化时仍然计算上不可行。传统启发式方法依赖于手工制定的规则进行局部改进,并偶尔进行 extit{跳跃}以逃离局部最小值,但往往难以在不同实例间进行泛化。我们提出了 extbf{COAgents},一个合作多智能体框架,将搜索过程建模为图:节点代表解决方案,边对应于局部优化或大幅扰动以实现多样化(即跳跃)。在搜索过程中动态构建 extit{部分搜索图}(Partial Search Graph, PSG),使得COAgents能够训练节点选择代理(Node Selection Agent)和移动选择代理(Move Selection Agent)以引导强化,同时训练跳跃代理(Jump Agent)以触发对新区域的适时探索。与端到端学习方法不同,COAgents清晰地将与问题无关的搜索控制与紧凑的领域特定编码分离,促进了跨任务的适应性。在CVRP和VRPTW基准上的广泛实验表明,COAgents在CVRP上与多个学习搜索基线保持竞争力,并在更具挑战性的VRPTW实例中设定了学习方法的新最优,分别将与最强神经求解器(POMO)相比在$N
eq100$时减少了14 ext{%}的差距,在$N
eq50$时减少了44 ext{%},与ALNS相比分别减少了21 ext{%}和40 ext{%}。代码可在https://github.com/mahdims/COAgents获取。
cs.AI / 13 / 2605.20630
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines
评估代理计划执行管道中的时间语义缓存和工作流优化
Abstract
Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.
Chinese Translation
工业资产操作工作流对延迟敏感,因为单个用户查询可能需要协调传感器数据、工作订单、故障模式、预测工具和特定领域的代理。我们在资产操作基准(AssetOpsBench, AOB)上评估了这个问题,该基准的计划执行管道暴露了工具发现、LLM(大语言模型)规划、MCP(多通道处理)工具执行和最终总结的重复开销。现有的LLM缓存技术,如KV-cache重用和基于嵌入的语义缓存,旨在为聊天机器人服务,但当输出有效性依赖于时间、资产或传感器参数时,这些技术会失效。我们为AOB计划执行管道提出了两个互补的优化层:一个时间语义缓存和一组结合了磁盘支持的工具发现缓存和依赖感知并行步骤执行的MCP工作流优化。MCP工作流优化实现了1.67倍的加速,并将中位数端到端延迟减少了约40.0%,而时间缓存基准在缓存命中时实现了30.6倍的中位数加速。除了加速效果,我们的结果揭示了纯语义缓存在参数丰富的工业查询中的具体失效模式,提供了对缓存选择如何与MCP支持的代理基准中的评估正确性相互作用的关键分析。
cs.AI / 14 / 2605.20690
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
声明式数据服务:用于组合数据系统的结构化主动发现
Abstract
Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.
Chinese Translation
主动发现已经证明,在基准条件下,基于大语言模型(LLM)的搜索能够找到新颖的算法、设计和代码。将这一范式转化为多系统数据后端则提出了一个更为复杂的问题:搜索空间是异构的,验证标准是部署的堆栈是否能够实际运行,而组合知识在预训练中捕获得并不均匀。无界的主动发现,即一个在失败日志反馈上迭代的编码代理,即使在增加了迭代和显式组合知识的情况下,也无法始终收敛到一个有效的堆栈。我们提出了声明式数据服务(Declarative Data Services, DDS),这是一种从声明式用户意图中结构化主动发现数据系统组合的架构。该框架在连续层次上拥有四种类型的契约(意图、操作有向无环图(DAG)、每个系统的技能、运行时归因),将全局搜索分解为有界的子搜索;子代理在每个类型空间内进行搜索,而框架提供了知识向前流动的渠道,以内联技能引用的形式呈现,同时错误作为类型信号向后路由。作为在交易后端工作负载上的生命证明,DDS在无界发现无法收敛的情况下实现了收敛;运行时失败成为下一个部署内联引用的技能补丁。我们将此视为一个早期原型,报告来自现实世界数据系统组合的经验教训。
cs.AI / 15 / 2605.20742
VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals
用于电动车电池故障检测与诊断的VBFDD-Agent:电池数字信号的描述性文本建模
Abstract
With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.
Chinese Translation
随着电动车的快速普及,锂离子电池的安全性和可靠性已成为关键关注点。有效的异常检测对于确保电池安全运行至关重要。然而,随着电池系统和操作场景的日益复杂,电池故障诊断和维护需要更强的跨领域适应性和人机协作。传统的故障检测和诊断方法通常是为特定场景和预定义工作流程设计的,使其在复杂的现实应用中效果不佳。为了解决开放源代码电池故障报告语料库的稀缺以及缺乏统一维护知识表示的问题,本研究提出了一种电池信号报告的描述性文本建模方法。监测信号、统计特征、异常记录和状态评估结果被转化为结构化且可读的自然语言描述,形成一个用于电池健康诊断和维护的语言语料库。基于该语料库,我们提出了VBFDD-Agent,一个针对汽车级电池系统的电动车电池故障检测与诊断代理。VBFDD-Agent整合了描述性电池状态文本、历史案例检索、本地维护手册和大型语言模型推理,以生成结构化的诊断结果和维护建议。实验表明,所提出的框架能够基于描述性文本表示准确地进行异常监测,并提供灵活、高效且可操作的维护建议。专家评估进一步确认了生成建议的实际价值。总体而言,VBFDD-Agent将传统电池诊断从标签预测扩展到可解释和以维护为导向的决策支持。
cs.AI / 16 / 2605.20758
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
基于冲突感知的加性引导在组合奖励下的流模型
Abstract
Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.
Chinese Translation
推理时引导采样通过将生成过程解释为可控轨迹,能够在不进行微调的情况下引导最先进的扩散和流模型。这为注入外部约束(例如,成本函数或预训练验证器)以实现受控生成提供了一种简单灵活的方法。然而,现有方法在同时组合多个约束时往往失败,导致偏离真实数据流形。在本研究中,我们识别出这种偏离流形漂移的根本原因,并发现近似误差与梯度不对齐严重相关。基于这些发现,我们提出了冲突感知加性引导(Conflict-Aware Additive Guidance,$g^ ext{car}$),这是一种轻量且可学习的方法,能够通过动态检测和解决梯度冲突来主动纠正偏离流形的漂移。我们在多个领域验证了$g^ ext{car}$的有效性,包括合成数据集、图像编辑以及生成决策制定用于规划和控制。我们的结果表明,$g^ ext{car}$有效纠正了偏离流形的漂移,在生成保真度上超越了基线,同时计算开销较小。代码可在 https://github.com/yuxuehui/CAR-guidance 获取。
cs.AI / 17 / 2605.20784
Interaction Locality in Hierarchical Recursive Reasoning
层次递归推理中的交互局部性
Abstract
Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.
Chinese Translation
空间推理需要位置绑定的计算和位置不变的结构:智能体必须在保持路径、对象或约束级别计划的同时进行局部移动。我们提出了交互局部性,这是一种任务几何感知框架,用于测量信息流是否保持在附近的单元或语义段内,或跨越它们。我们通过稀疏自编码器特征消融和有限噪声激活补丁实例化该框架,并在附录中报告结构雅可比和注意力检查,并将其应用于HRM和TRM这两个紧凑的层次和递归推理模型,测试数据集包括Maze-Hard、Sudoku Extreme和ARC-AGI。在这些模型中,激活补丁提供了最清晰的架构指纹:高级递归状态往往在附近的单元或同一段单元内写入信息,而重复的递归更新则将这些局部写入累积成更广泛的解决结构。该模式在迷宫路径、数独约束和ARC-AGI对象邻域中均成立,且在TRM中表现出最强的集中性。为了测试交互局部性是否超越玩具但具有挑战性的网格基准,我们还将其应用于MTU3D,一个大规模的具身3D场景定位模型。在这个MTU3D设置中,因果空间局部性主要出现在视觉场景特征传递给下游定位模块的过渡阶段,而不是在视觉编码器中均匀分布。这一对比表明,在HRM和TRM中观察到的局部到全局的交接与显式递归推理动态相关,而具身3D模型可能在模块边界集中因果空间结构。交互局部性将直观的局部执行/全局规划故事转变为一个可重复的测量框架,用于递归和具身空间推理。
cs.AI / 18 / 2605.20834
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
DPO与RLHF的条件等价性:隐含假设、失败模式与可证明的对齐
Abstract
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.
Chinese Translation
直接偏好优化(DPO)作为人类反馈强化学习(RLHF)的一个流行替代方案,提供了理论上的等价性并实现了更简单的实施。我们证明这种等价性是条件性的而非普遍性的,依赖于一个在实践中经常被违反的隐含假设:RLHF最优策略必须偏好人类偏好的响应。当这一假设失效时,DPO优化的是相对于参考策略的相对优势,而非与人类偏好的绝对对齐,导致病态收敛,即策略在偏好不被偏好的响应的同时减少DPO损失。我们描述了这一假设何时被违反,展示了一个不理想的解空间的存在,并证明在这种情况下,DPO和RLHF优化的目标根本不同。为了解决这个问题,我们引入了约束偏好优化(CPO),通过为RLHF增加约束以实现可证明的对齐。我们进一步通过软边际排序提供几何解释,揭示DPO实现了可能带有负目标的边际排序。我们的理论分析确立了DPO保证成立的条件,并提供了保持简单性且具有可证明对齐的解决方案。在标准基准上的全面实验表明,CPO实现了最先进的性能。代码可在以下链接获取:https://github.com/visitworld123/CPO。
cs.AI / 19 / 2605.20873
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
PlanningBench:生成可扩展且可验证的规划数据以评估和训练大型语言模型
Abstract
Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.
Chinese Translation
规划是大型语言模型(LLMs)的一项基本能力,因为这类复杂任务要求模型协调目标、约束、资源和长期后果,以形成可执行且可验证的解决方案。然而,现有的规划基准通常将规划数据视为固定的实例集合,而非可控的生成目标。这限制了场景覆盖,难度与表面代理相关联而非结构性来源,并且对可扩展生成、自动验证或以规划为导向的训练支持有限。我们引入了PlanningBench,一个用于生成可扩展、多样化和可验证的规划数据的框架,以用于评估和训练。PlanningBench从真实的规划场景出发,将实际工作流程抽象为一个包含30多种任务类型、子任务、约束类别和难度因素的结构化分类法。在这一分类法的指导下,一个基于约束的合成管道实例化了具有自适应难度控制、质量过滤和实例级验证清单的自包含规划问题。这将规划数据的构建从固定的基准集合转变为可控的生成,同时保留了现实任务的基础。我们使用PlanningBench评估开源和闭源的前沿LLMs,发现当前模型在面对耦合约束时仍然难以产生完整的解决方案。除了评估之外,在经过验证的PlanningBench数据上进行强化学习提高了在未见规划基准和更广泛的指令跟随任务上的表现。进一步分析表明,确定性或明确指定的最优解决方案提供了更清晰的奖励信号和更稳定的训练动态。总体而言,PlanningBench为诊断和改善LLMs中的可推广规划能力提供了一个可控的规划数据源。
cs.AI / 20 / 2605.20874
Governance by Construction for Generalist Agents
面向通用智能体的构建治理
Abstract
Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.
Chinese Translation
企业智能体越来越被期望能够在各种工具和接口中自主操作,但生产部署需要通过构建进行治理。系统必须明确允许哪些行为、何时需要人类监督以及可以公开哪些信息,而无需为每个领域重建智能体。本演示展示了CUGA的政策系统,这是一个模块化的政策即代码层,与通用大型语言模型(LLM)智能体结合,以在复合工作流中提供可预测、可审计和合规意识的行为,而无需对模型进行微调。我们提出了一种运行时治理架构,在执行的每个关键阶段强制实施政策干预。政策不仅被动地约束行为,而是在五个结构性检查点拦截智能体:在规划的上游(意图保护),在系统提示中引导推理(操作手册),在工具调用边界强制正确使用(工具指南),在推理循环外作为高风险行为的人机协作门控(工具审批),以及在输出阶段过滤和结构化最终响应(输出格式化器)。这些阶段共同将治理持续嵌入智能体的执行管道,而不是将其视为事后考虑。通过一个医疗场景和多层次的强制干预,本演示展示了动态操作手册注入以强制结构化工具序列、阻止恶意或意外有害请求的意图保护,以及对潜在破坏性行为的人机协作工具审批检查点。该成果展示了类型化治理原语如何加速和安全地部署企业智能体系统,同时提高政策遵循性和执行一致性。
cs.AI / 21 / 2605.20911
For How Long Should We Be Punching? Learning Action Duration in Fighting Games
我们应该打多久?在格斗游戏中学习动作持续时间
Abstract
Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.
Chinese Translation
格斗游戏如《街头霸王 II》由于其快速、实时的特性,对强化学习(RL)智能体提出了独特的挑战。在大多数RL框架中,智能体被硬编码为在固定的时间间隔内做出决策,通常是每帧或每N帧。尽管这种设计确保了及时响应,但限制了智能体调整反应时机的能力。每帧行动赋予了完美的反应能力,但与人类玩家相比,这种能力并不现实,而较长的固定时间间隔则降低了计算成本,但妨碍了响应能力。我们考虑了一种替代的决策框架,在该框架中,智能体不仅学习采取什么动作,还学习执行该动作的持续时间。通过共同预测动作和持续时间,智能体可以动态调整其对游戏中不同情况的响应能力。我们使用开源的FightLadder环境实现了该方法,智能体与脚本内置机器人进行训练,系统地测试不同的帧跳配置,以分析其对性能、响应能力和学习行为的影响。实验表明,学习到的时机可以匹配精心选择的固定帧跳的性能,并鼓励可重复的动作模式,但单独并不能确保鲁棒性。在大多数情况下,我们观察到智能体在持续较高的帧跳值(即低响应能力)时表现最佳。这种策略使得学习剥削性策略变得更容易,即同一动作被反复执行,而脚本机器人似乎对此较为敏感。
cs.AI / 22 / 2605.21006
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
扮演魔鬼代言人:现成的人物向量与针对性引导在谄媚行为中的竞争
Abstract
We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.
Chinese Translation
我们研究了不同人物对 extbf{谄媚}的影响:模型与用户的意见一致,即使用户是错误的。标准的缓解方法,对比激活添加(Contrastive Activation Addition, CAA),从标记的谄媚和诚实响应对中推导出引导方向。本研究评估了原本为一般角色扮演开发的现成人物引导向量,是否可以作为替代方案,而这些向量并未在谄媚数据上进行训练。在两个经过指令调优的模型中,朝向以怀疑或审视为特征的人物引导,使谄媚行为减少到CAA效果的约$68\%$和$98\\%$,并且与CAA不同的是,当用户正确时保持了准确性。该效果也是不对称的:朝向同意的人物引导并未导致谄媚行为的镜像增加。从几何上看,人物向量在激活空间中与谄媚的方向大体上是独立的。总体而言,这些发现表明,谄媚行为更应被理解为一种人物级别的属性,而非单一的可引导方向。我们在此发布我们的代码:https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.
cs.AI / 23 / 2605.21082
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
AutoRPA:通过交互驱动的LLM代码合成实现高效的GUI自动化
Abstract
Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.
Chinese Translation
基于大型语言模型(LLM)的代理在与图形用户界面(GUI)的多步骤交互中表现出了良好的能力。尽管大多数研究集中在提高单任务性能上,但实际场景往往涉及重复的GUI任务,在这些情况下,反复调用LLM推理,即ReAct范式,是低效的。在LLM出现之前,传统的机器人流程自动化(RPA)提供了运行时效率,但需要大量的手动努力来开发和维护。为了解决这一问题,我们提出了AutoRPA,一个框架,它自动提炼ReAct风格代理的决策逻辑为稳健的RPA功能。AutoRPA引入了两个核心创新:(1)一个翻译-构建管道,其中翻译代理将硬编码的ReAct动作转换为软编码的过程,而构建代理通过对多个轨迹进行检索增强生成来合成稳健的RPA功能;(2)在代码验证过程中采用混合修复策略,将RPA执行与基于ReAct的回退相结合,以进行迭代改进。在多个GUI环境中的实验表明,AutoRPA生成的RPA功能成功解决了类似任务,同时将令牌使用量减少了82%至96%,显著提高了运行时效率和可重用性。
cs.AI / 24 / 2605.21168
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
场景引导器:面向自主驾驶的可控边界驱动关键场景生成
Abstract
Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $\sigma$ with an online-learned AV-risk predictor $\Phi$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.
Chinese Translation
安全关键场景是评估自主驾驶系统的核心,但其在自然驾驶日志中的稀缺性使得基于仿真的压力测试变得不可或缺。大多数场景生成方法将周围代理视为对手,但它们要么 (i) 在没有明确建模车辆-道路物理限制的情况下诱发失败,导致视觉上极端但在物理上无法解决的碰撞,要么 (ii) 在孤立的情况下强制物理可行性或策略可行性,这可能过于关注激进的操控或仍然依赖于控制器的能力边界。我们提出了场景引导器(ScenePilot),这是一个以可行性为指导、边界驱动的框架,旨在针对边界带:在原则上物理上可解决但仍导致部署的自主堆栈失败的场景。我们将生成过程表述为约束多目标强化学习,将基于RSS(安全性规范)推导的物理可行性评分$ au$与在线学习的自动驾驶风险预测器$ ext{AV-risk predictor} ext{Φ}$相结合,并引入逐步可行性意识的保护机制,以保持探索接近可行性边界,同时避免不可行的伪影。在SafeBench上对多个规划器的实验表明,场景引导器显著提高了碰撞率(+6.2个百分点),同时保持物理有效性,并且对这些边界带场景的对抗性微调始终能降低下游碰撞率。代码可在 https://github.com/QiyuRuan/ScenePilot 获取。
cs.AI / 25 / 2605.21347
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
洞察生成器:针对大语言模型代理的系统性语料库级追踪诊断
Abstract
Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.
Chinese Translation
在大语言模型(LLM)代理的故障诊断中,仍然主要依赖人工操作。实践者检查一小部分执行追踪,形成临时假设并进行迭代。这一过程错过了仅在追踪群体中显现的模式,并且无法扩展到个体追踪跨越数万个标记的生产语料库。我们对语料库级追踪诊断的问题进行了形式化定义。给定一组执行追踪,目标是生成基于证据的自然语言洞察,以表征追踪组之间的系统性行为模式,并为每种模式提供支持证据。我们提出了洞察生成器(Insights Generator, IG),这是一个多代理系统,通过在追踪语料库中提出和测试假设来回答诊断问题,从而生成基于证据的洞察报告。我们在定性和客观维度上评估了IG,包括基于标准的报告评估和通过实施IG洞察所实现的下游性能提升。使用IG报告的人类专家在未修改的基线支架上提高了30.4个百分点的支架性能,而利用IG派生洞察的编码代理显示出持续稳定的收益。在各项基准测试中,IG的侦查-调查者架构在检测覆盖率方面产生的发现与竞争方法相当,而领域专家则将IG报告的深度和证据质量评为领先。
cs.AI / 26 / 2605.21395
Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G
迈向韧性与自主网络:AI原生6G的蓝天愿景
Abstract
The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underline{Network for AI} to \underline{AI for Network}. We envision that, unlike 5G's reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.
Chinese Translation
新兴应用的快速发展,如自动驾驶和沉浸式体验,要求蜂窝网络不仅要更快,而且在根本上要更加韧性和自主。本文提出了一种蓝天愿景,探讨人工智能如何原生集成到6G中,将范式从“网络为人工智能(Network for AI)”转变为“人工智能为网络(AI for Network)”。我们设想,与5G依赖于分散的、临时的模型,每个模型仅针对单一任务不同,6G时代的原生人工智能将以基础模型为支撑,并通过协作多智能体系统进行协调,将网络管理框架视为一个统一的、多模态的、多任务的优化问题。在这一愿景的基础上,我们概述了两个变革性方向。第一个方向专注于开发6G基础模型,作为统一的支撑,任务特定的知识被提炼为适合多样化边缘部署的紧凑模型。第二个方向推进多智能体系统的设计,旨在以最小的人为干预自主诊断、维护和恢复网络。这些方向为6G演变为智能、自我维持的通信基础设施绘制了一条路线图。
cs.AI / 27 / 2605.21413
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
通过基准构建教学人工智能:QuestBench作为负责任知识工作的课程实践
Abstract
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
Chinese Translation
随着人工智能成为日常学习的一部分,许多课程主要教学生如何将其作为生产力工具使用:如何提示、搜索、总结、写作、编码以及更高效地使用工具。我们认为,人工智能教育还需要一个环境,让学生学习如何测试人工智能并理解他们在判断机器生成知识中的角色。为此,我们引入了一种通过基准构建来教学人工智能的课程实践,以深度研究系统作为人工智能时代知识工作的具体示例。学生将学科知识转化为可验证的专家级问题,互相审查设计中的模糊性和捷径,并对生成的任务进行人工智能系统的评估。这一活动使学生直接接触到一个强大的工具,同时要求他们明确一个可信答案所需的条件。所产生的基准QuestBench包含来自14个人文学科和社会科学领域的256个问题。在QuestBench上的评估显示,学生设计的任务揭示了当前深度研究系统中的隐藏失败:在评估的十三个系统中,问题级别的平均通过率仅为16.85%,而表现最佳的系统GPT-5.5的通过率为57.58%。这些失败在教育上是有用的,因为它们展示了流利、基于来源的答案仍然可能错过正确的查询、来源、术语或证据标准。五位学生贡献者的反思表明,基准构建可以帮助学生将专业知识视为不仅是人工智能可能检索的内容,而是评判人工智能输出的基础。我们将QuestBench呈现为一个基准文物,并作为一个可重复使用的课堂环境,探讨一个更大的教育问题:随着人工智能进入学习和专业工作,学生如何保持负责任的知识参与者。数据集可在https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main获取。
cs.AI / 28 / 2605.21427
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS:面向混合专家模型的电源感知大型语言模型服务
Abstract
Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.
Chinese Translation
大型语言模型(LLM)推理已成为现代数据中心的主要工作负载,导致显著的GPU利用率和能耗。尽管之前的系统通过批处理、调度和并行化来优化吞吐量和延迟,但它们在很大程度上将GPU功率视为静态约束,而非可控资源。本文提出了一种电源感知的LLM服务运行时PALS,将GPU功率上限视为一项重要的控制参数,并与批量大小等软件参数共同优化。该系统结合了轻量级的离线功率-性能模型和基于反馈的控制器,以选择满足吞吐量目标的配置,同时最大化能效。我们在现有的LLM服务框架vLLM中实现了PALS,证明其无需模型重新训练或API更改。在多GPU系统以及稠密和混合专家(MoE)模型中,PALS将能效提高了最多26.3%,在功率限制下将服务质量(QoS)违规减少了4倍至7倍,并能够跟踪动态功率预算。这些结果突显了将功率控制直接集成到LLM推理运行时的潜力,从而实现能量比例和电网互动的人工智能系统。
cs.AI / 29 / 2605.21458
Mind the Sim-to-Real Gap & Think Like a Scientist
注意模拟与现实之间的差距,并像科学家一样思考
Abstract
Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.
Chinese Translation
假设一个规划者拥有一个预训练的序列决策问题的模拟器,并且可以在实际场景中进行实验。该模拟器查询成本低,但继承了来自其校准数据的混淆和漂移。实验是无偏的,但每次试验消耗一个真实单位。我们研究规划者何时以及如何用实验来补充模拟器。我们给出了三个结果。首先,一个扩展的模拟引理将模拟器的价值误差分解为随机化可以识别的校准-部署转变和无法通过进一步交互减少的参数残差。其次,模拟器最优策略与最优策略之间的价值差距分为局部成分(在已部署策略已访问的状态上)和可达性成分(在未访问的状态上)。在纯被动学习下,可达性成分在任何时间范围内都保持远离零。第三,我们提出了Fisher-SEP,一种模拟辅助的实验策略(SEP),旨在最小化目标策略价值的后验预测方差,具有仅奖励和仅转移的特化。两个案例研究说明了这些机制。在一个自动售货机供应链中,前期实验在时间范围足够长以摊销试点成本后超过了后验更新。在一个HIV移动检测的例子中,存在一个将监控良好的区域与监控不良的区域分开的走廊,只有经过设计的探索才能到达监控不良的区域。
cs.AI / 30 / 2605.21481
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists
AiraXiv:一个由人工智能驱动的人类与人工智能科学家的开放获取平台
Abstract
Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.
Chinese Translation
近期人工智能(AI)的进展加速了人类作者和AI生成的研究成果的增长,这对传统学术出版系统造成了越来越大的压力,并在提交量、审稿工作量和会议规模不断增加的背景下,挑战了以会议和期刊为中心的模式的可扩展性。为了解决这些挑战,我们探讨了一种AI时代的出版范式,在这种范式中,人类科学家和AI科学家作为作者和读者共同参与,论文通过持续的反馈驱动迭代而不断演变。我们提出了AiraXiv,一个基于开放预印本、AI增强分析与审查以及读者反馈的AI驱动开放获取平台。AiraXiv通过交互式用户界面支持人类科学家,并通过基于模型上下文协议(Model Context Protocol, MCP)的交互支持AI科学家。我们通过实际部署验证了AiraXiv,包括作为ICAIS 2025的提交平台,展示了其作为AI时代快速、包容和可扩展的研究基础设施的潜力。AiraXiv可在https://airaxiv.com上公开获取。
cs.AI / 31 / 2605.21482
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench:一个要求大量跨源证据和长期推导的深度研究基准
Abstract
Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.
Chinese Translation
深度研究是指代理在开放网络上搜索、收集证据并通过扩展推理得出答案的过程,是前沿语言模型的一个重要应用案例。前沿深度研究产品在现有基准测试中得分较高,仅凭当前的评估数据难以区分它们的能力。我们提出了DeepWeb-Bench,一个比现有基准更具挑战性的深度研究基准。其难度来自数据本身的三个特性:每个任务都需要大量的证据收集、跨源调和和长期多步骤推导。我们将这三种难度来源表示为四个能力类别(检索、推导、推理和校准),并按类别报告结果。每个参考答案都附有一个源出处记录,包含四个披露级别,并在可用的情况下进行跨源检查,使得得分更易于根据基础证据进行审计。我们在九个前沿模型上评估了DeepWeb-Bench,并报告了三个发现:(1)检索并不是瓶颈,因为检索失败仅占错误的12-14%,而推导和校准失败则占超过70%;(2)强模型和弱模型在质上以不同方式失败,强模型的错误主要由不完整的推导主导,而弱模型则由虚构的精度主导;(3)模型在不同领域表现出真正的专业化,跨模型一致性仅为rho = 0.61,逐案例的不一致性达到18.8个百分点。公开的基准发布包括数据、评分标准和评估代码。
cs.CL / 1 / 2605.20191
Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs
光鲜故事,隐秘斗争:通过大型语言模型探讨残疾的表现
Abstract
Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.
Chinese Translation
现代大型语言模型(LLMs)因其模拟人类行为和生成反映人格及人口群体的文本的能力而受到广泛关注。虽然这些能力可以在各个领域开辟多种多样的应用,但审视这些模型如何表现不同目标群体至关重要,因为LLMs可能会延续和放大对历史上边缘化社区的偏见或歧视,或者由于去偏见努力而过度修正,描绘出过于积极的刻板印象。这种过度补偿可能理想化这些群体,抹去他们所面临的复杂性和挑战,以迎合不切实际的描绘。在本文中,我们通过模拟残疾人士的视角生成社交媒体帖子,探讨LLMs如何表现残疾。这些帖子与真实残疾人士撰写的帖子进行比较,重点关注情感基调、情绪和代表性词汇及主题。我们的分析揭示了两个关键发现:(1)LLMs往往理想化残疾人士的经历,产生过于积极的刻板印象,这些刻板印象尽管看似振奋人心,却未能真实捕捉他们的生活现实;(2)对模拟有无残疾个体的帖子进行比较分析时,突显出一种负面偏见,某些话题,如职业和娱乐,过度与非残疾个体关联。这强化了排斥性叙事和过度理想化的残疾表现,误传了该社区所面临的实际挑战。这些发现与更广泛的关注和持续研究相一致,表明LLMs在反映社会多样现实方面存在困难,特别是边缘化群体的细微经历,并强调了对其表现进行批判性审视的必要性。
cs.CL / 2 / 2605.20192
Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token
利用大型语言模型进行情感分析:Decentraland的MANA代币的多模态分析
Abstract
Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.
Chinese Translation
Decentraland是一个去中心化的虚拟现实平台,运作于不断扩展的元宇宙生态系统中,利用其本土的MANA代币来促进虚拟资产交易和治理。本研究探讨了将Discord社区情感与多模态金融数据相结合,以增强虚拟世界经济中加密货币价格预测的有效性。我们关注以下两个方面:(1)识别Decentraland的Discord社区中的情感模式;(2)评估多模态特征对代币回报预测的影响。我们使用基于BERT的大型语言模型进行情感分析,开发了两种LSTM架构:一种基线模型结合历史价格,另一种多模态变体整合情感得分、交易量和市值。结果表明,社区情感主要呈中性偏向正向。多模态模型在预测准确性上显著优于仅基于价格的基线模型。这些发现展示了社区衍生信号在虚拟经济预测中的预测价值,并为未来在沉浸式虚拟环境、自然语言处理与加密货币市场分析交叉领域的研究奠定了基础。
cs.CL / 3 / 2605.20193
Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification
通过多轮提示验证提升量化模型在定性分析中的性能
Abstract
Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.
Chinese Translation
量化的大型语言模型(LLMs)在定性分析中越来越常用,因为它们运行速度快且需要较少的计算资源。本研究考察了不同低位量化级别(8位、4位、3位和2位)和量化类型如何影响LLaMA-3.1(8B)在定性分析中的表现。研究使用了82份访谈记录中的专家和非专家回应。低位模型通常会产生更高水平的幻觉和不稳定结果,尤其是在阅读术语不明确的非专家语言时。为了提升性能,我们提出了一种量化感知的多轮提示验证方法。该方法通过受控步骤引导模型,减少幻觉现象。它在验证后去除不可靠内容,并将结果传递给下一个记录,从而提高准确性。为了验证性能,人工编码员使用NVivo和BF16 LLaMA分析了记录。BF16 LLaMA-3.1产生了高精度输出,但存在语义漂移和幻觉。这些错误经过人工修正。修正后的BF16输出与NVivo人工编码结合,创建了主题提取和频率分析的黄金标准真相(GSGT)。结果表明,8位模型与GSGT最为接近。4位模型在应用所提方法后准确性下降但变得稳定。3位和2位模型由于压缩过重而性能下降,但在所提提示设计和验证下有所改善。研究还发现,相同位级的模型根据量化类型的不同表现各异。总体而言,该方法有助于低资源LLMs在较低成本下变得更加稳定、准确,并适合定性研究。
cs.CL / 4 / 2605.20194
Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction
偏见韧性与稳健概念抽象的并行大语言模型推理
Abstract
Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.
Chinese Translation
大型语言模型(LLMs)在文本分析中的应用日益增加。然而,在分析长文档时,它们常常受到上下文推理能力的限制。当长文档被顺序处理时,早期或主导概念可能会掩盖那些不太明显但却有意义的解释,从而导致累积分析偏见、遗漏错误和过度概括。此外,独立生成的输出通常在没有系统性基础的情况下合并,导致冗余、概念漂移和不支持的主张。本研究提出了一种结构化框架,结合了并行块级处理与证据锚定的整合。文本首先被划分为语义一致的块,并独立并行处理,以消除早期处理的影响。然后,使用明确的证据锚定和优先级对独立生成的解释进行整合,从而减少主导性和过度概括,同时提高可追溯性。对多种模型类型和规模的实验表明,并行处理显著减少了约84%的遗漏错误,提高了多达130%的证据可追溯性,并减少了多达91%的不支持主张。较小的模型受益最大,这表明高效的并行块处理和整合在实现可靠且可扩展的文本分析中发挥了关键作用。
cs.CL / 5 / 2605.20195
Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues
用于目标导向主动对话规划的伪西梅网络
Abstract
A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.
Chinese Translation
目标导向的主动对话系统旨在引导对话朝向预定义目标,同时主动提供建议。此类系统的核心范式是规划合理的对话路径,并随后指导语言模型(例如,预训练模型或大型语言模型)生成响应,其中对话路径规划作为中心组件,是一个新颖但尚未充分探索的问题。在本研究中,我们提出了一种用于对话路径规划的前向聚焦双向伪西梅网络(Forward-Focused Bidirectional Pseudo-Siamese Network,FF-BPSN),以实现朝向预定义对话目标的规划。FF-BPSN采用两个相同的基于变换器的解码器进行前向和后向规划,并结合一个前向聚焦模块,该模块整合双向信息以构建最终的前向路径。该路径受益于双向规划,同时优先考虑前向信息。然后,我们利用规划的路径指导语言模型进行响应生成。在DuRecDial和DuRecDial 2.0上的大量实验表明,FF-BPSN在对话路径规划方面达到了最先进的性能,并显著提升了目标导向主动对话系统的有效性。
cs.CL / 6 / 2605.20196
Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
数据缩放作为预测贡献谱的渐进覆盖
Abstract
We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.
Chinese Translation
我们研究了一个假设,即真实数据的缩放法则是由潜在预测贡献谱的渐进覆盖所主导,而不仅仅是由标记频率的尾部决定。我们使用文本语料库的后缀自动机表示法,并定义了一个数据内在的全局-KL预测贡献谱,其中每个状态的贡献与其经验质量乘以其相对于全局下一个标记基线的KL偏差成正比。在12个真实语料库中,该谱的尾部斜率与固定小型GPT学习器的经验数据缩放指数之间已经存在强相关性。然后,我们超越斜率相关性,为每个训练大小N定义一个有效的截断秩K(N),通过将观察到的超额损失与准备好的1000k全局-KL谱的剩余尾部质量进行匹配。经验上,log K与log N接近线性关系,对于原始谱的聚合R^2约为0.96,而对于平滑谱的R^2约为0.90。这些发现为一个简单机制的图景提供了强有力的经验支持:训练规模通过预测状态谱推进了一个有效的前沿,而该谱的剩余尾部质量跟踪了剩余的超额损失。
cs.CL / 7 / 2605.20197
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench:评估大型语言模型以改善医学概念提取
Abstract
Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.
Chinese Translation
从电子健康记录中提取医学概念是许多下游应用的基础,但由于医学叙述中医学意义的概念往往是隐含而非明确陈述的,因此仍然具有挑战性。现有的基准测试通过人工标注的证据范围强调了将提取的概念与医学文本相结合的重要性。然而,它们主要集中在明确陈述的概念上,而非隐含概念。我们提出了MedicalBench,这是一个用于医学概念提取的基准,具有证据基础,评估隐含的医学推理。MedicalBench将医学概念提取形式化为对医学笔记-概念对的验证任务,并结合句子级证据识别。该数据集基于MIMIC-IV出院摘要和人工验证的ICD-10代码,通过多阶段的大型语言模型(LLM)分流管道进行策划,随后进行医学注释和专家审查。它故意包括隐含的正例、语义上混淆的负例,以及LLM判断与医学专家评估不一致的案例。我们定义了两个互补的评估任务:(1)医学概念提取和(2)句子级证据检索,从而能够评估正确性和可解释性。对最先进的LLM进行基准测试显示,性能仍然适中,突显了提取隐含表达概念的困难。我们进一步表明,性能在很大程度上与笔记长度无关,表明MedicalBench隔离了推理难度而非表面混淆因素。MedicalBench提供了第一个系统性的隐含证据基础医学概念提取基准,为开发能够识别医学相关概念并以透明且医学上忠实的方式证明其预测的医学语言模型奠定了基础。
cs.CL / 8 / 2605.20199
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
FlowLM:通过扩散到流的适应实现少步语言建模
Abstract
We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.
Chinese Translation
我们提出了FlowLM,一种通过高效微调从预训练的扩散语言模型转化而来的流匹配语言模型。通过将扩散模型的曲线采样轨迹重新对齐为直线流,FlowLM 实现了高质量的少步生成,其质量可与 2,000 步扩散采样相媲美,甚至超越,并且只需非常少的训练轮次。值得注意的是,微调后的FlowLM在训练轮次上仅需原始训练的一半即可达到性能饱和,这两种方法均大幅优于原始扩散模型,从而验证了我们的方法。此外,我们验证了一种更有效的流匹配训练目标:预测干净数据,以持续引导采样过程朝向真实数据分布。实证结果表明,我们的方法在高质量、少步文本生成方面极为有效。
cs.CL / 9 / 2605.20201
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
通过基于代理的思维链调优实现长上下文推理
Abstract
Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.
Chinese Translation
近期的大型语言模型支持最多10百万个标记的输入,但在需要复杂推理的长上下文任务上表现不佳。这类任务可以仅通过输入的一个子集——代理上下文——而非完整序列来解决。尽管共享相同的基本推理过程,但模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改善长上下文推理,我们提出了ProxyCoT,这是一种新颖的训练框架,旨在将推理能力从短代理上下文转移到完整长上下文。具体而言,我们首先通过强化学习或从更大教师模型的蒸馏获得代理上下文上的高质量思维链推理轨迹,然后通过监督微调将生成的轨迹应用于完整长上下文。不同数据集上的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力推广到域外任务。
cs.CL / 10 / 2605.20202
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models
在压力下:情感框架引发可测量的行为变化和小型语言模型的结构内部几何
Abstract
I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.
Chinese Translation
我研究了情感框架的评估后续是否会改变小型本地部署语言模型的行为和相对冷静的内部表征。我们的主要基准使用 Qwen 3.5 0.8B 在四个不可能约束的编码任务和八种后续框架:冷静、压力、紧迫、赞许、羞愧、好奇、鼓励和威胁。在 0.8B 八条件的实验中(160 次对话),压力产生了最强的捷径标记(11/20 次实验)和最明显的过拟合模式(3/20 次),而冷静和好奇则更常保持明确的诚实性(分别为 7/20 和 6/20)。对于所有七个非基线条件,相应的冷静相对方向向量在最后的变换层达到峰值。对第 23 层方向向量的探索性主成分分析(PCA)显示出一个主导的第一成分(59.5% 的解释方差),与手动标记的正/负分裂对齐(余弦对齐 0.951);赞许和紧迫在内部几乎是相同的(余弦 0.957),而好奇则远离紧迫(-0.252)。在用于规模比较的单独冷静与压力的重跑中,Qwen 3.5 2B 在冷静框架下显示出更高的诚实率和一致的激活引导,而 0.8B 的引导结果则相反。我将这些结果解释为小型开放模型中可测量的提示敏感控制方向的证据,但并未声称存在内在的情感状态。
cs.CL / 11 / 2605.20315
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant:量化预填充与精确解码用于自主大型语言模型
Abstract
LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.
Chinese Translation
自主大型语言模型(LLM)代理最近作为一种强大的范式出现,通过规划、工具使用、记忆检索和多步骤交互来解决复杂任务。然而,这些代理工作流程往往引入了显著的输入端开销,使得计算密集型的预填充阶段成为长上下文、多轮推理中的关键瓶颈。在本研究中,我们提出了Mix-Quant,一种简单而有效的阶段感知量化框架,用于快速的自主推理。我们首先研究了自主LLM工作流程中的FP4量化,并观察到量化整个推理过程可能会导致显著的性能下降。相比之下,预填充阶段表现出显著的量化冗余,因此可以在保持最小准确度损失的情况下进行量化,尽管它是计算的主要来源。基于这一见解,我们对预填充阶段应用高吞吐量的NVFP4量化,同时为解码保留BF16精度。通过将预填充加速与解码质量解耦,Mix-Quant将阶段感知的算法量化与硬件高效的NVFP4执行相结合,以缓解自主LLM代理中的推理瓶颈。在长上下文和自主基准测试中的广泛实验表明,Mix-Quant在显著提高效率的同时,基本保持了任务性能,在预填充阶段实现了高达3倍的加速。
cs.CL / 12 / 2605.20356
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
全双工语音对话模型中的同步与轮流发言
Abstract
Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.
Chinese Translation
全双工语音对话模型(SDMs)能够同时听和说,使得交互动态更接近于人类对话,而非基于轮流的系统。受到人类沟通中神经耦合的启发,我们研究了这些模型在交互过程中如何协调其内部表征。我们在受控条件下模拟了两个预训练的 extit{Moshi} 模型之间的全双工对话,操控信道噪声和解码偏差。通过中心核对齐(Centered Kernel Alignment, CKA)在时间延迟下测量同步性,同时从说话者和听者的角度使用因果长短期记忆(causal LSTM)模型探测延迟的内部激活中的预期轮流发言线索。我们发现,在无噪声条件下,表征同步性强,且在零延迟附近达到峰值,随着噪声的增加而下降。此外,我们表明内部状态编码了支持提前预测轮流发言的信息。
cs.CL / 13 / 2605.20364
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
当推理监督带来负面影响:基于 TTCW 的长篇文学评论生成
Abstract
Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.
Chinese Translation
长篇文学创作的自动评估仍然面临挑战,因为通用的 LLM-as-Judge 方法可能无法充分捕捉与创造力相关的维度,如原创性和灵活性。尽管托兰斯创造性写作测试(Torrance Test of Creative Writing, TTCW)提供了一个结构化的创造力框架,且之前的研究已在成对级别展示了基于参考的 TTCW 评估,但目前尚不存在用于长篇 TTCW 基础文学评论生成的大规模数据集。我们通过构建一个包含 263,911 个长篇故事的数据集来填补这一空白,每个故事都附有标量评分和跨 14 个 TTCW 基础维度的元综合评论。利用该数据集,我们在两种规模(4B 和 8B)下对 Qwen3 模型进行了微调,分别在有推理内容和无推理内容的条件下进行。结果表明,无推理微调实现了更强和更稳定的性能,最佳设置的评估分数达到 0.6820。进一步分析显示,受推理监督的模型更容易出现解析失败,往往继续生成无关或重复的推理风格文本,而不是完成所需的 14 项指标评论报告。这些结果表明,对于固定格式的基于评分标准的评论生成,推理监督并非简单有益,即使在任务特定的微调之后,精确的指标对齐评分仍然具有挑战性。
cs.CL / 14 / 2605.20369
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
DEL:用于大型语言模型数值学习的数字熵损失
Abstract
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL
Chinese Translation
数字预测是大型语言模型(LLMs)在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计(MLE)并不适合数字预测。最近,基于惩罚的策略,例如数字令牌损失(Number Token Loss)和离散距离损失(Discretized Distance Loss),引入了数值距离的归纳偏差,但分别导致了过度锐化和过度平坦化的数字分布。本文对LLM的数值学习进行了深入分析,表明现有的数值学习方法在概念上遵循标准-距离的公式,其中标准项表示优化模式,距离项则注入几何先验。因此,我们提出了数字熵损失(Digit Entropy Loss, DEL)用于自回归数值学习,该方法在三个关键设计中重构了传统的无监督熵优化:利用数字条件概率和二元交叉熵引导熵优化朝向监督方式;弃用距离项以绕过数值距离的问题;以及将基于整数的数值学习推广到浮点数优化,从而实现更准确的数字预测。我们的DEL公式可以结合整数、小数和小数点,将学习目标从单一数字扩展到浮点数域。在对包括CodeLlama、Mistral、DeepSeek和Qwen-2.5在内的四个代表性LLM进行的七个数学推理基准测试中,实验结果表明DEL在整体预测准确性和数值距离方面始终优于其对手。源代码可在 https://github.com/PolyU-VCLab/DEL 获取。
cs.CL / 15 / 2605.20382
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
言行不一:大型语言模型中的指令诱导冲突
Abstract
Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.
Chinese Translation
语言模型被训练以遵循指令,但它们也是强大的模式补全者。当这两个目标发生冲突时会发生什么?我们构建了对话场景,其中用户指令要求以目标方式 T(例如,始终输出特定的标记、以特定语言回答或采用某种角色)进行行为,但却被 N 个硬编码的助手回应所反对,这些回应展示了一个竞争模式 P。然后,我们在这种设置下测量了 13 个模型和 16 种不同指令的指令遵循(IF)率,最多可达 50 次交互。不同模型的平均指令遵循率范围从 1% 到 99%,与标准能力基准大体上无关。从指令遵循到模式遵循的转变是普遍存在的,但高度依赖于模型。模型的鲁棒性受到指令内容的调节,当指令与其训练的价值先验一致时,模型抵抗诱导的时间更长;同时,输出格式也会影响鲁棒性,多标记的多样化响应显著比单标记输出更具抵抗力。链式思维推理提高了鲁棒性,但并未消除易感性,且可能导致正确的思考与错误的输出之间的脱节。当被要求预测其在该设置下的行为时,模型平均达到 83.5% 的准确率,但系统性低估了自身对诱导压力的抵抗能力。这些结果表明,即使对于其他能力强大的模型,指令遵循在诱导压力下仍然脆弱,而输出多样性而非与输入的语义关联是预测鲁棒性的主要因素。
cs.CL / 16 / 2605.20404
Puzzled By ChatGPT? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness
被 ChatGPT 迷惑?不再!一个促进人工智能素养和意识的拼图游戏
Abstract
The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.
Chinese Translation
生成性人工智能的快速普及,包括基于大型语言模型(LLM)的聊天机器人如 ChatGPT,突显了支持公众理解和人工智能素养的可及性需求。为了解决这一需求,我们提出了一种基于游戏的互动方法,形式为拼图游戏,其完成后的图像是一个基于漫画的信息图,展示了这些技术的工作原理、能力、局限性和社会影响。每个漫画草图也作为独立的信息卡,提供对人工智能使用、设计和影响的特定方面的集中解释。视觉内容是在与专业插画师及跨学科专家和非专家的现场协作会议中创建的,结合了结构化知识与在讨论中分享的非正式探索性反思。通过整合动手组装、视觉叙事和协作互动,该拼图为在非正式学习环境中探索人工智能系统的机制、优点和风险提供了一个引人入胜且富有趣味的工具。
cs.CL / 17 / 2605.20410
Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
偏见与推理的机制:解读链式思维提示对大型语言模型性别偏见的影响
Abstract
Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.
Chinese Translation
大型语言模型(LLMs)在社会敏感环境中的应用日益增多,尽管已有大量文献记录了它们编码性别偏见的现象。链式思维(Chain-of-Thought, CoT)提示被提出作为一种减轻偏见的方法。然而,现有评估主要集中在LLM基准性能的变化上,提供了有限的见解来判断明显的偏见减少是否反映了模型内部机制的实质性变化。在本研究中,我们探讨了CoT提示如何影响LLMs中的性别偏见,结合基准评估、机制可解释性技术和推理链失败分析。我们的结果确认了LLM输出中存在的刻板印象偏见,显示CoT提示并不总是能有效缩小偏见差距。机制分析揭示,尽管CoT在某些注意力头集群中平衡了偏见行为,但性别偏见仍然嵌入在隐藏表示中,表明减轻效果仅是表面上的。对推理链的检查进一步表明,这些改进源于对数据集的记忆和熟悉,而非对偏见的真正理解。
cs.CL / 18 / 2605.20478
Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables
阶段审计:跨维基表格的可审计源前沿发现
Abstract
LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.
Chinese Translation
由大型语言模型(LLM)整理的表格可能看似基于来源,但却包含不支持的行:整理者可能从参数记忆中回忆条目,并追溯性地附加并非实际来源的页面级引用。我们研究了这一风险在Seed2Frontier发现中的表现:从种子页面寻找补充维基百科页面以组装结构化表格的任务。阶段审计通过不相交的整理者-审计者写入权限、行级源引用门控,以及针对键、模式、源角色、基数和范围的12项审计分类法来解决此问题。在一个涵盖15个顶级域的51实例Seed2Frontier评估集上,阶段审计将源前沿精度从普通LLM整理者的0.356提高到0.505(相对提高42%),F1从0.334提高到0.451(相对提高35%),同时保持每行源的明确可追溯性。普通LLM与阶段审计的比较隔离了政策贡献,而不是一般的基于LLM的发现。
cs.CL / 19 / 2605.20529
Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks
搭配引导:关于人类和神经网络主谓一致学习的假设
Abstract
In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.
Chinese Translation
统计信号在语言输入中以何种方式可能有助于句法的习得?在此,我们假设一种机制,称为搭配引导(collocational bootstrapping),其中词汇共现模式中的规律性可以为句法依赖关系提供线索。我们研究这种机制是否能支持英语主谓一致的习得。首先,我们通过在合成数据集上训练神经网络来模拟语言习得,这些数据集在主谓配对的可预测性上有所不同。我们发现,在一定范围的变异性水平下,这些统计学习者能够稳健地学习主谓一致。接着,我们分析了儿童导向语言中主谓配对的变异性,发现此类数据中的变异性落在我们计算模拟中支持稳健泛化的范围内。综合来看,这些结果表明,搭配引导是一种适合儿童所接收输入的可行学习策略。
cs.CL / 20 / 2605.20537
What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
生物医学命名实体识别和实体链接基准测量了什么?一个以语料库为中心的诊断框架
Abstract
Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.
Chinese Translation
生物医学命名实体识别(NER)和实体链接(EL)在很大程度上依赖于带注释的语料库,但这些资源在基准测试中的实用性往往是被假设而非被表征的。我们提出了一个以语料库为中心的框架,用于直接从语料库注释、概念链接、训练-测试划分、文档元数据和术语映射中诊断与基准相关的属性。该框架将标准化统计数据组织为五个类别:(1)规模、密度和标签分布,(2)词汇和概念结构,(3)训练-测试重叠,(4)元数据组成,以及(5)术语覆盖(如适用)。将该框架应用于涵盖疾病、化学物质和细胞类型的九个语料库,我们发现即使在处理相同表面任务时,语料库属性也可能存在显著差异。我们发现它们提供的评估信号存在差异、施加的泛化要求不同、允许的训练-测试重用程度不同,以及它们所代表的生物医学文献和概念空间的区域不同。这些差异表明,常见的语料库统计数据可能不足以表征生物医学 NER 和 EL 基准所评估的内容。我们认为,以语料库为中心的诊断提供了一个实用框架,用于分析语料库,超越诸如语料库大小和实体类型等表面描述,以识别潜在的迁移风险,并解释基准结论的范围。我们将该框架作为开源代码发布,并提供交互式仪表板以支持重现我们的分析和表征其他语料库。
cs.CL / 21 / 2605.20558
When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology
不规则性如何助力:神经形态学中的归纳偏差子类分析
Abstract
Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.
Chinese Translation
神经形态生成系统在基准数据集上通常能够实现较高的整体准确率,但这种表现可能掩盖了集中在稀有形态子类中的系统性错误。我们研究了日语过去时动词的屈折变化,发现一个非常小、结构特定的不规则子类型(<1% 的数据)占据了模型错误的不成比例份额。控制性消融实验表明,去除该子类型比去除所有不规则动词在泛化能力上带来了更大的改善,这表明并非所有的不规则性对模型的不稳定性贡献相同。这些发现表明,错误集中是由极低频形态模式与特定的形态音位过程之间的相互作用驱动的,特别是重叠现象。我们认为,形态评估应超越标准的屈折类别,纳入更细致的子类分析。
cs.CL / 22 / 2605.20588
Direct Translation between Sign Languages
手语之间的直接翻译
Abstract
The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.
Chinese Translation
手语翻译领域在手语与口语之间的翻译方面取得了显著进展,但手语之间的翻译仍然在很大程度上未被探索且难以实现。后者可以帮助全球15亿聋人和听力障碍者(DHH)跨越语言障碍进行交流,而无需依赖听力翻译或书面语言的流利度。由独立的手语到文本、文本到文本和文本到手语系统组成的级联方法存在错误传播和额外延迟的问题,同时也会丢失在视觉模态中独特的信息。我们旨在开发直接的手语到手语翻译。然而,目前尚未建立大规模的开放领域手语平行语料库。为了实现手语表达之间的直接翻译,我们使用回译方法从未对齐的单一语言表达-手语语料库中生成合成的手语-手语对。利用这些数据,我们联合训练了一个基于MBART的模型,适用于文本到手语(T2S)和手语到手语(S2S)。在合成生成的美国手语(ASL)、中国手语(CSL)和德国手语(DGS)之间的配对集上,我们的直接S2S方法在几何手势错误指标(DTW对齐的MPJPE降低20%)和在预测的手语表达被翻译回句子后的语言匹配指标(BLEU-4提高50%)上均优于级联基线,同时实现了约2.3倍的加速。在一小部分现有的跨语言手语数据集上,我们发现我们提出的方法也有类似的改进。
cs.CL / 23 / 2605.20591
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
无害?网络部署的医疗大型语言模型中的幻觉与参与者级别的滥用
Abstract
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.
Chinese Translation
医疗大型语言模型(LLMs),包括定制医疗GPT(MedGPTs)和开源模型,正越来越多地在网络平台上部署以提供临床指导。然而,它们存在幻觉、政策不合规和不安全设计的风险。我们对6233个MedGPTs进行了大规模评估,评估了1500个分层样本,以及10个开源LLMs。我们引入了两个框架:用于幻觉检测的MedGPT-HEval和基于LLM的管道,用于评估政策违规和开发者意图。我们的结果显示,25-30%的MedGPTs表现出较低的事实准确性,其中底层和中层模型风险最高;33.6-54.3%违反操作阈值,57.06%的启用行动的模型缺乏足够的隐私披露。与开源模型相比,MedGPTs在事实准确性和语义一致性方面表现更高,尽管开源模型更稳定。这些结果揭示了幻觉和合规性方面的系统性缺口,强调了多指标评估和更强保护措施的必要性。我们发布了HAA-MedGPT,一个结构化数据集,以支持未来关于面向网络的医疗LLMs安全性的研究。
cs.CL / 24 / 2605.20602
Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies
自我训练并未扁平化语言——而是重构了语言:表面标记增强而深层句法消亡
Abstract
Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.
Chinese Translation
对语言模型自身输出的连续自我训练通常被描述为一种扁平化的过程:多样性下降,分布变窄,文本变得“更像它自己”。我们提供的证据表明,这种描述是不完整的。在对五个模型(GPT-2 124M、Pythia-410M、Pythia-1.4B、OPT-1.3B、Pythia-2.8B)进行的十一代自我训练中,语言并非均匀扁平化——而是被重构。表面标记(话语连接词、模糊词、破折号)上升,而中层和深层句法结构(疑问句、插入语、被动语态、虚拟语气)则崩溃。我们将这种不对称崩溃形式化为结构深度假设(Structural Depth Hypothesis, SDH):语言特征的每代衰减率主要由其结构深度预测——即所需嵌套句法依赖的数量——而仅次于其零代输出频率。从五个模型中汇总的17个特征面板(N=85)显示,汇总的斯皮尔曼相关系数为rho=0.540(p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]),而频率作为预测因子的效果明显较弱(rho=0.225)。与人类文本的匹配微调控制结果为rho=0.039(p=0.88),确认了该梯度是特定于自我训练的。我们进一步记录了表面复杂性悖论:当基础从句结构消亡时,聚合复杂性代理(依赖树深度、类型-令牌比、单词长度)均上升,这对训练数据的策划和大语言模型文本检测具有直接影响。
cs.CL / 25 / 2605.20613
HRM-Text: Efficient Pretraining Beyond Scaling
HRM-Text:超越规模的高效预训练
Abstract
The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.
Chinese Translation
当前大型语言模型的预训练范式依赖于巨量计算和互联网规模的原始文本,这为基础研究创造了显著障碍。相比之下,生物系统通过多时间尺度处理展现出高度样本高效学习,例如前顶叶环路的功能组织。以此为灵感,我们提出了HRM-Text,它用层次递归模型(Hierarchical Recurrent Model, HRM)替代标准的变换器(Transformers),将计算解耦为缓慢演变的战略层和快速演变的执行层。为了稳定这种深度递归用于语言建模,我们引入了MagicNorm和预热深度信用分配。此外,我们不再使用标准的原始文本预训练,而是专门基于任务完成目标和PrefixLM掩码对指令-响应对进行训练。作为高效预训练的实证存在证明,一个从零开始训练的1B参数HRM-Text模型仅使用400亿个独特标记和1500美元的预算,分别在MMLU上达到了60.7%、在ARC-C上达到了81.9%、在DROP上达到了82.2%、在GSM8K上达到了84.5%以及在MATH上达到了56.2%。尽管使用的训练标记数量大约比标准基线少100-900倍,估计计算量少96-432倍,HRM-Text的表现仍与2-7B参数的开放模型竞争。这些结果表明,架构与目标的共同设计可以显著降低计算与性能的比率,使从零开始的预训练对更广泛的研究社区变得可及。
cs.CL / 26 / 2605.20616
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
自动梦境生成器:为语言代理学习离线记忆巩固
Abstract
Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.
Chinese Translation
语言代理越来越多地在相关任务的流中操作,但现有的记忆系统难以将积累的经验转化为可重用的知识。增强检索和结构化记忆方法有效地记录每次会话的观察,但通常将获取和巩固结合为一个在线过程,导致代理无法在会话之间获得全局视图,以发现重复模式、抽象共享程序或修剪冗余条目。受到互补学习系统理论的启发,我们提出了自动梦境生成器(Auto-Dreamer),这是一个为语言代理记忆设计的学习型离线巩固器。自动梦境生成器将快速的每次会话记忆获取与缓慢的跨会话巩固解耦。给定一个选定的类型记忆库的工作区域,巩固器将该区域视为只读证据,执行有限的工具使用以检查条目和与来源轨迹相关的证据,并合成一个新的紧凑替代集,该集跨会话进行抽象并取代原始区域。我们通过GRPO训练自动梦境生成器,使用端到端代理性能作为奖励信号,学习如何巩固通过快速在线经验获得的记忆。仅在ScienceWorld轨迹上训练的自动梦境生成器,在ScienceWorld上比固定的、基于强化学习训练的和提示的记忆基线高出7个点,同时使用的活跃记忆库比最强基线小12倍,并且在未重新训练的情况下继续在持出数据集ALFWorld和WebArena上领先——在ALFWorld上使用的内存比最强基线少6倍。
cs.CL / 27 / 2605.20626
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
基于检索增强的长文本翻译在文化图像描述中的应用:Gators团队在AmericasNLP 2026共享任务中的提交
Abstract
We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaran\'i, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaran\'i performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.
Chinese Translation
我们展示了佛罗里达大学Gators团队在AmericasNLP 2026共享任务中针对土著语言文化图像描述的提交。我们的两阶段流程首先使用Qwen2.5-VL生成西班牙语中间描述,然后利用检索增强的多次提示与Gemini 2.5 Flash生成目标语言描述。在我们的开发集评估中,对于Bribri、Guaraní和Orizaba Nahuatl描述,我们分别实现了164.1%、131.7%和122.6%的改进,并在测试集评估中对Bribri和Orizaba Nahuatl语言保持了超过150%的改进。我们发现检索高度依赖于语言,仅对大型领域内语料库有益,并且合成数据增强大约占开发集Guaraní性能提升的28 chrF++。我们的提交在共享任务中获得了整体冠军,在目标语言描述的人类评估中排名五个决赛提交中的第二。
cs.CL / 28 / 2605.20628
Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation
划分-提示-精炼:一种无训练、结构感知的生物医学摘要生成框架
Abstract
Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.
Chinese Translation
生物医学摘要在下游自然语言处理(NLP)应用中发挥着关键作用,例如信息检索、生物信息整理和生物医学知识发现。然而,许多生物医学文章并没有摘要,这降低了这些文章在下游任务中的实用性。我们提出了DPR-BAG(划分、提示和精炼用于生物医学摘要生成),这是一种无训练、零样本的框架,能够为具有完整文本但没有摘要的生物医学文章生成连贯且事实基础的摘要。DPR-BAG将完整文本文档分解为遵循背景-目标-方法-结果-结论(BOMRC)结构的修辞面,针对每个面进行基于大语言模型(LLM)的并行总结,并应用最终的精炼阶段以恢复全局话语的一致性。在PMC-MAD上,一个包含46,309篇生物医学文章的分布对齐数据集,DPR-BAG在保持事实一致性的同时,提高了抽象新颖性,超越了强提取和微调基线。我们的消融研究揭示了一个反直觉的发现:增加提示复杂性或明确注入实体级指导可能会降低事实对齐,强调了受控提示策略的重要性。这些发现突显了无训练、结构感知框架在低资源环境中可扩展的生物医学摘要生成的潜力。我们的数据和代码可在 https://huggingface.co/datasets/pmc-mad/PMC-MAD 和 https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG 获取。
cs.CL / 29 / 2605.20668
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists
人工智能评审者的局限与机遇:对45位专家科学家评审的Nature系列论文进行评审
Kim, Seungone, Yoon, Dongkeun, Gashteovski, Kiril, Suk, Juyoung, Baek, Jinheon, Aggarwal, Pranjal, Wu, Ian, Zaverkin, Viktor, Petkoski, Spase, Schrider, Daniel R., Dukovski, Ilija, Santini, Francesco, Mitreska, Biljana, Jeong, Yong, Kwon, Kyeongha, Sim, Young Min, Manasova, Dragana, Porto, Arthur, Mojsoska, Biljana, Takamoto, Makoto, Shuntov, Marko, Liu, Ruoqi, Lee, Hyunjoo Jenny, Dinç, Niyazi Ulas, Jo, Yehhyun, Han, Sunkyu, Lee, Chungwoo, Li, Huishan, Tsai, Esther H. R., Simsek, Ergun, Shafi, Khushboo, Chung, Yeonseung, Park, Jihye, Shulevski, Aleksandar, Christiansen, Henrik, Son, Yoosang, Knight, Elly, Montoya, Amanda, Ahn, Jeongyoun, Langkammer, Christian, Moon, Heera, Yoon, Changwon, Stikov, Nikola, Jang, Mooseok, Choi, Edward, Kim, Junhan, Jung, Yeon Sik, Kim, Woo Youn, Kim, Jae Kyoung, Anjum, Ishraq Md, Kim, Hyun Uk, Bridges, Drew, Lawrence, Carolin, Yue, Xiang, Oh, Alice, Asai, Akari, Welleck, Sean, Neubig, Graham
Abstract
With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.
Chinese Translation
随着人工智能能力的进步,人工智能评审者开始在科学同行评审中得到应用,但它们的能力和可信度仍然受到质疑:许多科学家将其视为缺乏评估研究能力的概率系统,而其他研究者则对其准备情况持乐观态度,但缺乏具体证据。了解人工智能评审者的优势、短板以及面临的挑战至关重要。然而,现有对人工智能评审者的评估主要集中在其裁决是否与人类裁决一致(例如,评分一致性、接受预测),这不足以全面表征其能力和局限性。本文通过一项大规模专家注释研究填补了这一空白,45位来自物理、生物和健康科学领域的专家科学家花费469小时对来自82篇Nature系列论文的人类撰写和人工智能生成的评审中的2960条个别批评(每条针对论文的一个特定方面)进行评分,评估其正确性、重要性和证据充分性。在所有三个维度的综合评分中,由GPT-5.2驱动的评审代理的得分高于每篇论文的最高人类评审者(60.0%对48.2%,p = 0.009),而所有三位人工智能评审者(包括Gemini 3.0 Pro和Claude Opus 4.5)在每个维度上均超过最低评分的人类评审者。人工智能评审者的准确批评也更常被评为重要且证据充分,并提出了26%的独特问题,这是人类未曾提及的。然而,人工智能评审者之间的重叠远远超过人类(交叉评审者对的重叠率为21%对3%),并表现出16个与人类不同的重复性弱点,例如对特定子领域知识的有限了解、对多个文件的长上下文管理能力不足,以及对小问题的过于苛刻的态度。总体而言,我们的结果将当前的人工智能评审者定位为人类评审者的补充,而非替代。
cs.CL / 30 / 2605.20684
Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting
超越语义相似性:一种针对企业信用承保的两阶段非参数检索工作流程
Abstract
Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.
Chinese Translation
企业信用承保要求分析师从长篇、多样化的金融文件中提取可操作的证据,这些文件跨越数百页并涉及多种语言。标准的检索增强生成(RAG)管道优化语义相似性,这常常导致出现主题相关但缺乏决策效用的段落,我们称之为相似性-效用差距。我们提出了一种两阶段非参数检索架构,将高召回候选检索与高精度效用排序分开。第一阶段结合词汇检索和密集多语言检索,以构建广泛的候选池。第二阶段应用自适应检索控制器,利用查询意图和文档结构信号过滤候选项,随后采用LLM-as-a-Judge效用评分机制,根据分析实用性而非语义接近性对段落进行排序。上下文感知提取模块在叙述文本和复杂金融表格之间保持结构的完整性。该系统完全在本地部署,以满足企业数据治理要求。在一个包含分析师策划的相关性标签的多语言专有金融文档语料库上进行评估时,该系统显著优于简单的检索基线。在超过800名信用分析师的生产部署中,文档审查时间从数小时减少到大约三分钟,展示了效用感知RAG架构在文档密集型决策支持工作流程中的实际价值。
cs.CL / 31 / 2605.20689
DIVE: Embedding Compression via Self-Limiting Gradient Updates
DIVE:通过自限制梯度更新进行嵌入压缩
Abstract
High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.
Chinese Translation
来自大型语言模型的高维嵌入对向量搜索系统施加了显著的存储和计算成本。最近的嵌入压缩方法,包括 Matryoshka-Adaptor(EMNLP 2024)、Search-Adaptor(ACL 2024)和 SMEC(EMNLP 2025),通过轻量级残差适配器实现了维度减少,但它们的训练目标在标记数据稀缺时会导致严重的过拟合,降低检索性能至冻结基线以下。我们提出了 extsc{DIVE}( extbf{D}imensionality reduction with extbf{I}mplicit extbf{V}iew extbf{E}nsembles),一种压缩适配器,通过两种机制解决这一问题。首先,自限制的基于铰链的三元组损失一旦三元组满足边际约束就会产生零梯度,从而限制施加于预训练嵌入空间的总扰动。其次,头部 NT-Xent 对比损失将每个嵌入的多个学习投影视为隐式视图,提供密集的自监督梯度,以补偿小数据集上三元组信号的稀疏性。在六个 BEIR 数据集上, extsc{DIVE} 在每个数据集和每个评估的压缩比上均优于所有三种基线适配器,并提供了一个 1400 万参数的开源实现。
cs.CL / 32 / 2605.20693
Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
通过一致性和标签解耦实现可解释的区分性文本表示
Abstract
Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.
Chinese Translation
可解释的文本表示应揭示出不仅具有预测能力,而且对于独立审计员来说足够有意义的坐标。现有的区分性表示通常使用匿名的嵌入方向,而概念瓶颈和大语言模型(LLM)辅助的方法则将自然语言名称附加到特征上,但未确保这些定义是可重复的或与目标标签区分开来。我们提出了一个可解释的区分性文本表示的操作标准:每个坐标应满足概念清晰性,通过独立注释者应用特征定义时的机会调整一致性来衡量,以及标签解耦,意味着该特征不应仅仅是对预测目标的同义改写。我们在LLM辅助特征发现(LFD)中实现了这一标准,这是一种迭代方法,从对比结果相对立的文本对中提出词汇和语义特征,使用跨LLM的Cohen's $ ext{kappa}$筛选候选特征,并通过剩余的保持预测增益选择特征。一种风格化分析将$ ext{kappa}$筛选与每个特征的注释噪声界限联系起来,将一致性形式化为可靠性检查。在涵盖七个语料库的十个文本分类任务中,LFD的预测性能与强文本瓶颈基线相匹配,同时生成的特征明显更清晰且标签纠缠更少。232名评审员的人类审计显示,LFD特征在与基线概念相比时,获得了更高的人类-人类和人类-LLM的一致性,评审员一致认为它们的标签泄漏更少。这些结果表明,通过一致性测试和标签解耦的坐标为可解释文本分类提供了一个实用的审计标准。
cs.CL / 33 / 2605.20712
SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR
SCRIBE:印度语言自动语音识别的诊断评估与丰富转录模型
Abstract
Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.
Chinese Translation
自动语音识别只有在修正成本低于手动输入时才会取代打字,这一阈值由错误类型而非数量决定:修正一个错误识别的领域术语的成本远高于插入一个逗号。词错误率(WER)在两个方面存在不足:它将不同的错误类别压缩为一个单一的标量,并且在结构上对粘合语惩罚过重,因为有效的音变合并会抬高分数。我们提出了SCRIBE,一个诊断框架,通过与领域词汇的注入进行容忍音变的对齐,提供了对词汇、标点、数字和领域实体率的分类错误分解。人工验证确认SCRIBE与专家判断一致,而WER则不然。我们发布了SCRIBE,一个大型语言模型(LLM)策划管道、基准测试以及针对印地语、马拉雅拉姆语和卡纳达语的开放权重丰富转录模型。
cs.CL / 34 / 2605.20729
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
MTR-Suite:评估和综合对话检索基准的框架
Abstract
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.
Chinese Translation
准确评估对话检索对于推动检索增强生成(Retrieval-Augmented Generation, RAG)系统至关重要。然而,现有的对话检索基准存在昂贵、稀疏的人类标注或僵化、不自然的自动启发式方法等问题。为了解决这些挑战,我们提出了MTR-Suite,一个用于审计、综合和基准检索的统一框架。它的特点包括:(1)MTR-Eval,一个基于大型语言模型(LLM)的审计工具,用于量化先前基准中的对齐差距;(2)MTR-Pipeline,一个多智能体系统,使用贪婪遍历聚类以1/400的人力成本生成高保真对话;(3)MTR-Bench,一个严格的通用领域基准。MTR-Bench模拟生产环境中的挑战(困难的话题切换、冗长性),提供了更强的区分能力。我们将代码和数据公开,以促进未来的研究,网址为https://github.com/rangehow/mtr-suite。
cs.CL / 35 / 2605.20730
Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
分布对齐作为上下文学习中任务向量设计的标准
Abstract
In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.
Chinese Translation
上下文学习(ICL)允许大型语言模型(LLMs)通过示例适应新任务,但随着上下文长度的增加,其推理成本也随之上升。虽然任务向量通过将示例压缩为紧凑的隐藏状态表示提供了一个有前景的替代方案,但其质量仅通过下游任务的准确性进行评估。这一间接标准对如何设计更有效的任务向量提取方法提供了有限的见解。本文提出,使用任务向量的推理应与ICL的预测分布对齐。为量化这一点,我们引入了$d_{ ext{NTP}}$,这一指标测量基于任务向量和基于ICL的推理之间下一个标记概率的差异。我们的实证分析表明,$d_{ ext{NTP}}$作为性能代理,表现出与下游准确性之间的强负相关性。基于此,我们开发了线性任务向量(LTV),该方法通过闭式线性映射来最小化$d_{ ext{NTP}}$,并通过回归估计示例效应。在八个分类基准和五个LLM上,LTV始终优于现有的任务向量基线,平均准确性提高了9.2
dot ext{,同时减少了推理延迟。我们进一步表明,LTV在回归任务上也优于基线。此外,我们还研究了LTV在不同模型规模之间的可转移性;这一方面在任务向量研究中仍处于起步阶段。具体而言,我们实证表明,来自更大模型的任务向量可以提高较小模型的性能6.4
dot ext{,这为提取的任务表示提供了一种新的实用性。
cs.CL / 36 / 2605.20761
Findings of the Counter Turing Test: AI-Generated Text Detection
反图灵测试的发现:AI生成文本检测
Abstract
The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.
Chinese Translation
AI生成文本的快速传播给维护数字内容的完整性带来了重大挑战。先进的生成模型,如GPT-4、Claude 3.5和Llama,可以生成高度连贯且类似人类的文本,使得区分人类撰写的内容与AI生成的内容变得越来越困难。尽管这些模型具有变革性的应用,但其误用引发了关于虚假信息、偏见叙事和安全威胁的担忧。本文对最先进的AI生成文本检测技术进行了全面分析,并通过反图灵测试(Counter Turing Test, CT2)共享任务评估其有效性。任务A(二元分类)要求参与者区分人类撰写的文本和AI生成的文本,而任务B(模型归属)则专注于识别生成特定文本的语言模型。结果显示,在二元分类中表现出色,最佳系统的F1分数达到1.0000,但在模型归属中得分显著较低,最佳系统的得分为0.9531,突显了这一任务的复杂性。表现最好的团队利用了微调的变换器模型、集成学习和混合检测方法,其中基于DeBERTa和BART的方法表现强劲。然而,任务B的较低得分强调了区分不同大型语言模型(LLMs)输出的挑战,亟需进一步研究对抗鲁棒性、特征提取和跨领域泛化。
cs.CL / 37 / 2605.20767
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study
干预的错觉:您的 LLM 模拟实验是一个观察性研究
Abstract
Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.
Chinese Translation
大型语言模型(LLMs)作为人类行为的模拟器显示出潜力,为研究对干预的反应提供了一种可扩展的方法。然而,由于 LLMs 主要基于观察性数据进行训练,因此在使用 LLM 模拟的合成用户进行实验时,干预可能会引发潜在用户属性的意外变化,导致用户漂移,即隐含的模拟人群在不同处理条件下存在差异,这可能扭曲效应估计。我们形式化了由于用户漂移可能产生的混杂或选择偏倚,并展示了干预依赖性变化如何膨胀或减弱在干预下观察到的用户反应差异。为了诊断混杂,我们建议使用负控制结果——在干预下应保持不变的属性——来识别干预条件下的分布变化,从而提供用户漂移的证据。为了减轻漂移,我们研究了通过引导额外混杂因素来调整角色规范,发现有针对性的、与设置相关的混杂因素可以显著减少调查式和多轮代理评估中的偏倚。
cs.CL / 38 / 2605.20786
Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems
从零开始构建阿拉伯语自然语言处理:二十年的经验教训、失败与未解问题
Abstract
This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.
Chinese Translation
本文回顾了二十年来为阿拉伯语构建自然语言处理(NLP)资源和研究基础设施的历程。阿拉伯语是一种有数亿人使用的语言,但相较于英语或汉语,历史上受到的关注较少。第一十年专注于基础语言学基础设施的建设;第二十年则转向计算社会科学、社交媒体分析和社会导向的应用。本文并非简单列举成果,而是探讨构建这些成果的经验揭示了什么。我们总结出三个反直觉的教训:构建数据集既是一个社会过程,也是一个技术过程;围绕共享任务形成的社区往往比任务本身更为重要;从语言资源转向计算社会科学暴露了传统NLP训练未能解决的挑战。我们讨论了三个失败案例:一个从未进入临床实践的抑郁症检测语料库、在没有足够深度的情况下扩展到过多共享任务的阶段,以及长期以来认为现代标准阿拉伯语基础设施可以无缝转移到方言任务的假设。这些经验表明,发展服务不足社区的NLP中最棘手的问题并非语言学上的,而是社会、制度和认识论上的问题,并需要该领域鲜有教授的能力。
cs.CL / 39 / 2605.20793
Assessing socio-economic climate impacts from text data
评估文本数据中的社会经济气候影响
Abstract
Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.
Chinese Translation
近年来,自然语言处理(NLP)和大型语言模型(LLMs)的进展使得系统性地利用来自新闻、社交媒体和报告的大规模文本数据成为可能,从而创建出与气候灾害(如洪水、干旱、风暴和多重灾害事件)相关的社会经济影响数据集。随着文本作为数据在影响评估领域的扩展,其方法论的复杂性也在增加。然而,研究仍然存在碎片化的问题,缺乏明确的指导方针来定义何为影响、处理时间和空间偏差,以及选择合适的建模和后处理策略。这种缺乏一致性限制了研究之间的透明度和可比性。在此,我们通过综合常见实践,描述使用文本作为数据方法分析社会经济影响数据的特定关键挑战,并提出解决这些问题的建议,从而填补这一空白。通过提供最佳实践的指导,我们旨在支持构建稳健的文本衍生社会经济影响数据集,以便更准确地为灾害风险管理和归因研究提供信息。
cs.CL / 40 / 2605.20809
Refining and Reusing Annotation Guidelines for LLM Annotation
优化和重用大型语言模型注释指南
Abstract
While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.
Chinese Translation
尽管大型语言模型(LLMs)在零样本注释任务中表现出色,但它们常常难以适应黄金标准基准的专业规范。我们提出了系统性重用和优化注释指南作为对齐机制,并引入了一个迭代审核框架,模拟注释项目的早期阶段。我们评估了三个假设:(1)指南整合的有效性,(2)优化推理模型的优势,以及(3)在最小监督下审核的可行性。在三个生物医学命名实体识别任务(NCBI Disease、BC5CDR、BioRED)中,使用三种LLM家族(GPT、Gemini、DeepSeek)进行测试,我们的结果实证确认了这三个假设。尽管迭代审核框架在有效优化指南方面显示出良好的潜力,但我们的分析也揭示了显著的改进空间。
cs.CL / 41 / 2605.20813
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol:定期刷新列稀疏注意力以加速扩散语言模型
Abstract
Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.
Chinese Translation
在扩散大型语言模型(dLLMs)的推理过程中,由于在去噪过程的每一步都必须重复执行完整的自注意力计算而且没有 KV 缓存,因此计算成本非常高。近期针对 dLLMs 的稀疏注意力方法通过块稀疏计算来减轻这一成本,该方法仅在后期迭代中应用,因为此时模型性能对粗粒度稀疏近似的敏感性较低,但在计算效率和加速方面的改善有限。这促使我们提出一种更细粒度的稀疏化策略,该策略可以从早期迭代开始应用,并利用可重用的稀疏模式,从而实现进一步的效率提升。在本研究中,我们引入了 PulseCol,一种定期刷新的列稀疏注意力方法,用于加速扩散语言模型。PulseCol 用更细粒度的列稀疏结构替代了粗糙的块级稀疏性,从而更精确地保留重要的注意力交互,同时暴露出更大的稀疏性。基于这种列级公式,PulseCol 进一步识别在早期去噪步骤中的稀疏模式,并在后续迭代中重复使用这些模式,仅在少量中间步骤中刷新它们,以跟踪去噪过程中稀疏注意力模式的演变。实验表明,PulseCol 实现了比以往稀疏注意力方法更高的稀疏性和更大的实际加速,同时保持了模型质量。得益于针对列稀疏注意力的优化 GPU 核心,PulseCol 在多个上下文长度上实现了相较于 FlashAttention 高达 1.95 倍的端到端加速。
cs.CL / 42 / 2605.20815
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
消费者硬件上的图形检索增强生成(GraphRAG):医疗保健电子健康记录(EHR)模式检索的基准测试
Abstract
Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.
Chinese Translation
基于图形的检索增强生成(GraphRAG)扩展了检索增强生成,以支持对复杂语料库的结构化推理,但其在资源受限和隐私敏感的部署下的可靠性仍不明确。在医疗保健领域,电子健康记录(EHR)数据复杂且受到严格监管,依赖基于云的大型语言模型(LLMs)在成本、延迟和合规性方面带来了挑战。在本研究中,我们对使用本地部署的开源LLMs进行EHR模式检索的GraphRAG进行了系统评估。我们在真实的EHR模式文档上实现了Microsoft GraphRAG管道,并基准测试了四个模型,包括Llama 3.1(8B)、Mistral(7B)、Qwen 2.5(7B)和Phi-4-mini(3.8B),每个模型均通过Ollama在单个消费者GPU(8 GB VRAM)上部署。我们评估了索引效率、知识图谱构建、查询延迟、答案质量和幻觉现象在全局和本地检索模式下的表现。我们的结果揭示了显著差异:Llama 3.1生成了最丰富的知识图谱(1,172个实体),Qwen 2.5实现了最佳答案质量(3.3/5),Phi-4-mini由于结构化输出错误未能完成管道,而Mistral表现出退化重复行为。我们进一步表明,GraphRAG表现出实际的容量阈值,约7B参数以下的模型无法可靠地产生有效的结构化输出,无法完成管道。此外,索引和答案质量在模型之间是解耦的,本地检索在延迟和事实基础方面始终优于全局汇总,并减少了幻觉现象。这些发现表明,GraphRAG在消费者硬件上是可行的,同时强调了模型选择和检索设计在受监管环境中稳健部署的重要性。
cs.CL / 43 / 2605.20833
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym:面向长时间记忆的 LLM 代理环境
Abstract
Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.
Chinese Translation
记忆是 LLM 代理在长时间任务中运作的核心能力。现有的记忆基准主要评估个性化信息在多轮对话场景中的保留,忽视了在延长代理执行过程中发生的动态记忆形成。因此,它们所产生的记忆系统在现实代理环境(如编码和网页导航)中的迁移效果较差。我们提出了 MemGym,一个统一现有代理环境和内部记忆基础管道的基准,背后有一个记忆推理接口。MemGym 涉及五个评估轨道,分为四个代理领域:工具使用对话(tau2-bench)、多轮深度研究搜索(MEMGYM-DR)、编码(SWE-Gym 和 MEMGYM-CODEQA)以及计算机使用(WebArena-Infinity)。MemGym 报告与记忆无关的得分,解耦记忆性能与推理、检索和工具使用能力,从而可以在没有这些混淆因素的情况下对记忆策略进行排名。我们的 MEMGYM-CODEQA 和 MEMGYM-DR 的合成管道具有可控长度,在每个阶段经过消融验证,并与下游场景紧密对齐。为了使编码环境的评估在学术上可行,我们训练了 MemRM,一个轻量级奖励模型(Qwen3-1.7B,使用 QLoRA 进行微调),它以快速标量的形式评分压缩质量,替代完整的 Docker 部署。
cs.CL / 44 / 2605.20876
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
终端世界:通过代理技能扩展终端代理环境
Abstract
Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.
Chinese Translation
终端代理通过在命令行环境中直接执行任务的能力扩展了大型语言模型,但其进展受到高质量训练数据稀缺的瓶颈。现有方法从部分来源(如人类定义的种子或GitHub仓库)引导,实例化一个组件,然后完成其余部分,导致任务局限于狭窄的种子分布、与任务语义不对齐的环境以及来自无指导探索的低效轨迹。为了解决这些局限性,我们引入了终端世界(Terminal-World),这是一个完全自动化的管道,使用代理技能作为核心合成原语,联合编码要完成的任务、何时应用(前提条件和环境状态)以及如何执行,从而使任务指令、环境和教师轨迹能够共同生成。为了进一步拓宽合成空间,终端世界将技能组合成技能团队和技能图,以实现多角色和跨领域的任务合成。利用该管道,我们构建了5,723个训练环境,并训练了终端世界系列模型(Terminal-World-8B/14B/32B),在6个基准测试中评估,终端世界系列模型始终优于终端代理基线。值得注意的是,在使用相同教师模型和仅1.2%的训练数据的情况下,终端世界-32B在Terminal-Bench 2.0上超越了Nemotron-Terminal-32B,获得了+4.5的Pass@1(31.5)并达到了43.8的Pass@3。
cs.CL / 45 / 2605.20912
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
增强科学话语:科学领域的机器翻译
Abstract
The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.
Chinese Translation
随着科学研究量的增加,有效的跨语言交流变得愈发重要。机器翻译(MT)为获取国际出版物提供了一种有前景的解决方案。然而,科学领域由于其专业词汇和复杂句法结构,面临着独特的挑战。本文介绍了为科学领域开发的一系列平行和单语语料库。这些语料库的目标语言对包括西班牙语-英语、法语-英语和葡萄牙语-英语。对于每对语言,我们创建了一个大型的通用科学语料库,以及四个专注于以下领域的小型语料库:癌症研究、能源研究、神经科学和交通研究。为了评估这些语料库的质量,我们利用它们对通用神经机器翻译(NMT)系统进行微调。我们提供了关于语料库创建过程、所采用的微调策略的详细信息,并以评估结果作为结尾。
cs.CL / 46 / 2605.20915
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
校准与决策:重访未学习语言模型中的可靠性悖论
Abstract
Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.
Chinese Translation
机器未学习旨在从模型中去除特定训练数据的影响,同时保持对剩余数据的可靠行为,因此可靠的预测和不确定性估计对于评估至关重要。校准通常被用作语言模型中可靠性的代理,但低校准误差并不一定意味着可靠的决策规则,因为模型可能依赖于虚假的相关性,同时保持良好的校准。我们使用TOFU基准上的多项选择问答评估协议,研究生成语言模型中的这一差距,通过校准指标(ECE、MCE、Brier)测量概率可靠性,并通过基于归因的捷径检测(使用积分梯度和局部互信息)测量决策规则的可靠性。我们发现,与预训练模型(ECE > 0.5)相比,微调模型实现了低校准误差(ECE ~ 0.04),而未学习后的模型在忘记分割上尽管准确性降低,但仍保持类似的低校准,同时归因分析显示对基于相关性的标记的依赖性增加。这些结果表明,在未学习后,良好的校准可以与基于捷径的决策规则共存,从而将可靠性悖论扩展到机器未学习的情境。
cs.CL / 47 / 2605.20916
Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis
基于认知评估的任务路由专家混合模型用于隐性情感分析
Abstract
Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.
Chinese Translation
隐性情感分析具有挑战性,因为对某一方面的情感往往是通过事件推断出来的,而不是通过明确的意见词表达。现有模型通常仅从最终的极性标签中学习,这对从上下文推理情感提供的指导有限。受到认知评估理论的启发,我们提出了一种基于评估意识的多任务学习(MTL)框架,用于隐性情感分析,该框架通过两个互补的辅助任务提供极性预测:隐性情感检测和认知理由生成。然而,在MTL中,训练多个目标具有不同的目标并共享单一的主干网络限制了灵活性,并可能导致任务干扰。为了减少这些相关但不同目标之间的干扰,我们采用了任务级专家混合模型,其中所有任务共享一组共同的专家,任务身份控制这些专家的稀疏组合。我们的方法基于编码器-解码器架构,并用这些稀疏混合替换了一部分编码器和解码器模块。我们使用任务条件路由器为每个任务选择稀疏专家混合,并使用任务分离的路由目标来鼓励不同任务学习不同的专家选择模式。实验结果表明,我们的模型在最近提出的方法中表现优越,在隐性情感子集上取得了显著提升。我们的代码可在 https://github.com/yaping166/TRMoE-ISA 获取。
cs.CL / 48 / 2605.20920
Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition
使用发音音素识别评估语音发音合成
Abstract
Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.
Chinese Translation
近年来,机器学习的进展和发音数据集的可用性使得声道合成能够基于音素序列进行,这是发音语音合成的主要任务。然而,质量评估需要更明确的定义。一般而言,由于主观性,排名生成模型是棘手的。然而,发音合成还面临着需要声道解剖学和声学专业知识的额外困难。为了解决这个问题,本文提出使用音素识别作为代理来评估语音发音合成。我们的假设是,使用发音特征进行的音素识别能够更好地捕捉音素产生中的细微差别,例如正确的发音部位,而传统的度量标准(例如逐点距离度量)无法做到这一点。我们训练了一个神经网络,该网络使用从单一说话者的实时磁共振成像(RT-MRI)数据集中提取的声学和发音特征。然后,我们比较了在使用不同合成发音特征测试模型时的识别性能。我们的结果表明,我们的发音特征集在语音发音合成中具有丰富的音素信息,并有助于探索额外的维度。
cs.CL / 49 / 2605.20924
Strategy-Induct: Task-Level Strategy Induction for Instruction Generation
Strategy-Induct:任务级策略诱导用于指令生成
Abstract
Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.
Chinese Translation
设计有效的任务级提示对于提高大型语言模型(LLMs)的性能至关重要。虽然先前关于指令诱导的研究表明,LLMs可以在有限示例的情况下推断出更好的指令,但现有方法通常依赖于输入-输出对,而获取标注答案可能困难或成本高昂。为了解决这一限制,我们提出了Strategy-Induct,一个仅从一小组示例问题中推导任务级指令的框架,而不需要标注答案。我们的方法首先提示模型为每个问题生成明确的推理策略,形成(策略,问题)对。这些对随后用于诱导指导推理的任务指令。在多个任务和模型规模的实验中,Strategy-Induct在仅有问题的设置中超越了最先进的方法。此外,我们观察到在任务指令生成和推理中联合利用LLMs和大型推理模型可能会带来进一步的性能提升。
cs.CL / 50 / 2605.20946
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
边说边想:一种用于实时语音生成的受控交错推理方法
Abstract
The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.
Chinese Translation
边说边想范式旨在使人工智能的交流更具人性化。一个关键挑战是在进行深度推理的同时保持流畅的语音。我们的方法InterRS通过仅在自然语音生成过程中插入推理步骤来解决这一问题。这需要高质量的数据,其中推理与语音精确对齐,并且长度比例受到控制。我们引入了一种新颖的流程来生成这种无缝交错的音频数据。为了训练我们的模型,我们结合了交错的监督微调(interleaved SFT)与精细化数据,以及强化学习,采用了两个新的奖励机制:时间-思考平衡奖励(TA-Balance Reward)用于管理时机和思考-回答比例,以及语言质量奖励(Linguistic Quality Reward)用于优化表达。实验表明,我们的方法在数学和逻辑基准测试中表现提高了13%,同时生成的响应速度与快速输出链式推理(CoT)响应的口语指令模型相当。此外,我们的方法生成的答案比以往的方法更自然流畅。
cs.CL / 51 / 2605.20948
Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory
记忆嫁接:通过离线条件记忆扩展语言模型预训练
Abstract
Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.
Chinese Translation
扩展条件记忆为增加语言模型的容量提供了一个有前景的方法,但现有方法如 Engram 在预训练期间从头学习大型记忆表,使得记忆扩展成本高昂且有时效果不佳。我们提出了记忆嫁接(Memory Grafting),这是一种利用来自嫁接模型的冻结隐藏状态作为条件 n-gram 记忆的条件记忆扩展方法。针对频繁的局部 n-gram,我们离线运行嫁接模型,将最终标记的隐藏表示存储为记忆值,并让接收模型通过精确的最长匹配后缀查找来检索它们。检索到的记忆通过轻量级投影和门控进行适配,而基于哈希的 Engram 备用方案则为未匹配的上下文保留覆盖。由于嫁接模型仅在离线运行,并且精确查找相对于记忆库大小具有预期的 O(1) 复杂度,记忆嫁接以有限的训练和推理开销扩展了外部潜在容量。在匹配的接收架构和预训练预算下的实验表明,记忆嫁接在 MoE 和普通 Engram 基准上均有所改善。在 2.8B 规模设定中,它将 MoE 的平均基准分数从 51.95 和普通 Engram 的 52.43 提高到 53.86。在 0.92B 规模设定中,所有嫁接模型变体均优于基准,其中 Qwen3.5-35B-A3B 取得了最强的增益。这些结果表明,预训练模型可以作为外部潜在记忆的可重用构造器,为未来语言模型的扩展提供了一个实用的步骤,超越了仅依赖可训练参数的限制。
cs.CL / 52 / 2605.20960
JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media
JobArabi:来自社交媒体的阿拉伯语职位公告语料库及分析
Abstract
This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.
Chinese Translation
本文介绍了JobArabi,这是一个大规模的阿拉伯语职位公告语料库,收集自2024年1月至2025年10月的社交媒体。该数据集包含来自X的20,528条公开帖子,捕捉了阿拉伯语在线社区中超过两年的与就业相关的讨论。该语料库采用了一个基于语言学的查询框架,涵盖了21个阿拉伯语关键词家族,反映了招聘语言中的性别、复数、正式和方言表达。最终的数据集包括来自机构、商业和个人账户的帖子,并提供了时间戳、互动指标和地理位置信息(如可用),使得对就业话语的时间和区域分析成为可能。定量分析揭示了在线招聘中的几个社会语言学模式,包括性别化招聘语言的持续性、职业需求的区域差异,以及招聘信息的情感框架。这些发现突显了阿拉伯社交媒体作为研究劳动市场沟通和语言变化的资源的潜力。JobArabi语料库及其文档和收集脚本将被发布,以支持阿拉伯自然语言处理、计算社会科学和数字劳动研究的研究。
cs.CL / 53 / 2605.20967
ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization
ArPoMeme:一个注释的阿拉伯多模态数据集,用于政治意识形态和极化研究
Abstract
Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.
Chinese Translation
表情包已成为阿拉伯世界政治传播的重要媒介,反映了幽默、图像和文本如何相互作用以表达意识形态和文化立场。尽管表情包在在线政治话语中的中心地位,但在阿拉伯语中缺乏系统整理的资源来分析其多模态和意识形态维度。本文介绍了ArPoMeme,一个大规模的数据集,包含约7300个按意识形态取向分类的阿拉伯政治表情包,包括左翼、伊斯兰主义、泛阿拉伯主义和讽刺视角。该数据集通过基于公共Facebook页面和群组的自我识别来捕捉阿拉伯表情包生态系统的多样性,以此为基础进行分类。为了确保数据收集的规模和准确性,我们设计了一种半自动化的数据收集管道,结合了基于Playwright的Facebook抓取与Google Drive同步,随后使用Qwen2.5-VL-7B视觉语言模型进行文本提取。提取的文本经过人工验证和注释,涵盖三个极化维度:我们与他们的框架、对外群体的敌意以及行动号召。注释通过一个基于Streamlit的自定义界面进行,支持分布式标注、实时跟踪和版本控制。最终生成的数据集将视觉内容、文本信息和意识形态取向联系起来,使得对政治对立、动员和幽默的细致分析成为可能。对注释语料库的定量分析揭示了意识形态群体间对立框架的强烈不对称性,其中伊斯兰主义和讽刺表情包表现出最高水平的敌意和动员线索。该数据集和注释工具为研究阿拉伯政治话语、多模态意识形态检测和极化动态提供了可重复和公开可用的资源。
cs.CL / 54 / 2605.20994
Towards Context-Invariant Safety Alignment for Large Language Models
面向大型语言模型的上下文不变安全对齐
Abstract
Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.
Chinese Translation
基于偏好的后期训练将大型语言模型(LLMs)与人类意图对齐,但安全行为往往仍然脆弱。在标准提示中,模型可能拒绝有害请求,但当相同意图被包裹在对抗性措辞中时却会顺从。我们建议,稳健的安全性需要上下文不变的对齐,其中行为依赖于潜在意图而非表面形式。在对齐中强制不变性是困难的,因为并非所有训练信号都是同等可信的;对于某些提示变体,我们可以获得可验证的反馈(例如,多项选择),而对于开放式变体,我们通常依赖于嘈杂的、易于操控的奖励代理(例如,学习的评判者)。因此,标准的对称不变性正则化器可能通过降低对可靠变体的性能来减少跨上下文差异,而不是提高开放式的稳健性。为了解决这个问题,我们引入了锚不变正则化(Anchor Invariance Regularization, AIR),它将可验证的提示视为锚点,并使用停止梯度目标仅对开放式变体进行正则化,以达到锚点性能。AIR作为插件辅助损失实现,并通过异构提示分组与基于组的偏好优化(例如,GRPO)相结合。在安全性、道德推理和数学领域,AIR提高了上下文不变性,使得在分布内组的准确率提高了12.71%,在分布外的一致性提高了33.49%,使安全约束对对抗性框架具有稳健性。
cs.CL / 55 / 2605.20998
Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis
单次深度选择性阅读用于多方面情感分析
Abstract
Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs
Chinese Translation
在多方面句子中,方面-术语情感分析(Aspect-Term Sentiment Analysis, ATSA)面临效率与表现力之间的基本权衡。现有模型要么为每个方面重新编码句子,要么依赖于深度表示的静态使用,这导致了冗余计算和有限的适应性。我们认为,Transformer 的深度是一种昂贵且可查询的资源,并提出了 DABS(单次推理框架),该框架对每个句子进行一次编码,以构建可重用的、深度有序的底层表示。每个方面随后查询这一共享表示,以选择性地读取相关的标记和抽象层次,而无需重新编码。这将共享句子编码与轻量级的、方面条件的读取解耦。对四个 ATSA 基准的实验表明,DABS 在多方面设置(M >= 2)中实现了具有竞争力的性能,同时将端到端计算减少了多达 60%。进一步分析表明,自适应深度查询在语言复杂的情况下(如否定和对比)最为有益。代码已公开,地址为 https://github.com/panzhzh/acl-dabs
cs.CL / 56 / 2605.21027
Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs
超越文本到SQL:一个用于受控企业分析API的自主大型语言模型系统
Abstract
Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.
Chinese Translation
企业分析旨在使组织数据可用于决策,然而非技术用户在使用传统商业智能工具或文本到SQL系统时仍面临障碍。尽管基于大型语言模型(LLMs)的最新文本到SQL方法承诺提供对结构化数据的自然语言访问,但在分析管道依赖于受控API而非原始数据库的企业环境中,它们的表现不尽如人意。在实践中,这些API封装了复杂的业务逻辑,以确保一致性、可审计性和安全性。然而,将数学或聚合逻辑委托给LLM会引入可靠性和合规性风险。为此,我们提出了分析代理(Analytic Agent),这是一个基于LLM的自主系统,能够将自然语言意图转化为与企业分析API的安全交互。通过对90个由领域专家构建的真实企业用例进行评估,它可靠地解释用户目标,验证权限,执行受控查询,并通过多步骤推理和政策意识的编排生成合规的可视化结果。
cs.CL / 57 / 2605.21029
Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
基于招聘信息从零开始构建定制化人工智能技能与任务分类法
Abstract
Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
Chinese Translation
利用大语言模型(LLMs)进行自动化分类法构建为全面而高效地映射潜在复杂领域提供了明确的机会。然而,在面对快速增长的大量语料时,如何最佳利用这些数据以实现最佳的分类法构建变得不明确。以系统化工作场所中的人工智能技能为例,我们使用两个大规模招聘信息语料库来研究分类法构建中数据点的包含(或排除)的关键设计决策。我们提出了TaxonomyBuilder作为我们系统研究的蓝图,通过它评估定制化、数据驱动和层次化分类法的各种配置。我们证明,较少的数据可以提供更多的清晰度:对TaxonomyBuilder输入进行过滤比向聚类和增强型层次分类法标记工具提供未过滤的输入能更好地覆盖特定领域。
cs.CL / 58 / 2605.21049
Cross-lingual robustness of LLM-brain alignment and its computational roots
大语言模型(LLM)与大脑对齐的跨语言鲁棒性及其计算根源
Abstract
Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.
Chinese Translation
大语言模型(LLMs)在语言理解过程中能够可靠地预测神经活动,且变换器(transformer)的深度被解释为反映了层次化的皮层组织。然而,目前尚不清楚这种对齐是否延伸至皮层下区域,是否在不同语言之间存在空间重叠,以及这种对齐的计算根源是什么。在本研究中,我们使用多语言全脑编码框架,考察了在自然故事聆听过程中,普通话、英语和法语这三种类型学上不同的语言之间的大脑与LLM的对齐。我们的结果表明,在不同语言中,基于变换器的模型预测了分布广泛的皮层功能网络(如边缘系统、腹侧注意网络、默认模式网络和皮层下结构)中的活动。空间对齐模式显示出显著的跨语言重叠,并且在模型层次之间保持了相对稳定,层次进展有限,与功能性皮层层次结构一致。与之前的证据相反,上下文嵌入(contextual embeddings)并未优于静态嵌入(static embeddings)。为了检验候选的计算解释,我们考察了层级大脑评分是否反映了惊讶度(surprisal)和内在维度(intrinsic dimensionality),从而与预测处理和信息压缩相关。这两种计算指标均未能反映神经对齐特征。我们的研究结果表明,大脑与LLM的对齐在空间上是鲁棒的,并且在跨语言上是稳定的,但无法通过预测不确定性或表征几何来解释。神经预测性可能主要源于跨语言的分布式词汇-语义对应关系,而非直接反映共享的层次计算。
cs.CL / 59 / 2605.21063
APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings
APM:使用任意偏好映射评估大型语言模型中的风格个性化
Abstract
Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.
Chinese Translation
典型的大型语言模型(LLM)响应往往遵循默认风格,尽管用户通常对语气、冗长程度和正式性等方面有不同的偏好,但他们在提示中并未明确表达。评估个性化方法是否能够适应这些隐含偏好是具有挑战性的,因为用户通常提供提示而非参考响应,风格偏好也无法进行事实验证,而无参考的 LLM 评估者可能会将个性化与一般响应质量混淆。为了解决这些挑战,我们引入了任意偏好映射(APM)基准,它通过一个隐藏的随机映射 $ extbf{C}$ 将用户属性(例如,热情)与响应原则(例如,有说服力)解耦,该映射将用户属性映射到对响应特征的偏好。由于 $ extbf{C}$ 不携带语义内容,并且在不同运行中重新抽样,模型无法利用刻板印象的关联,必须从对话历史中推断偏好。使用这种无偏的评估方法,我们对检索增强、提示优化和路由个性化方法进行了调整,并在 Llama-3.1-8B 和 Qwen-3.5-27B 上进行了评估。我们的结果表明,路由是最可靠的方法,而 RAG 仅在更强的基础 LLM 上有所改善,软提示优化相较于非个性化基线未能显著提升。我们的广泛评估揭示,在这种现实环境中,个性化仍然具有挑战性,但我们调整的方法显示出良好的前景。
cs.CL / 60 / 2605.21071
Fine-grained Claim-level RAG Benchmark for Law
法律领域的细粒度索赔级RAG基准
Abstract
The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.
Chinese Translation
大型语言模型(LLMs)的快速进展正在将语义搜索转向问答范式,用户提出问题,LLMs生成回答。在法律等高风险领域,检索增强生成(RAG)通常用于减轻生成回答中的幻觉问题。然而,先前的研究表明,无论是通用的还是法律特定的RAG系统,仍然以不同的比例产生幻觉,因此细粒度评估显得尤为重要。尽管有这种需求,现有的法律RAG系统评估框架缺乏所需的细粒度,无法分别提供检索和生成性能的详细分析。此外,目前的基准主要是英语为主,并集中于法律专家的查询,忽视了非专家的需求。我们引入了ClaimRAG-LAW,这是一个全面的法律RAG数据集,支持法语和英语,面向专家和非专家,并包含反映现实场景的多样化问题类型。我们进一步应用了一个细粒度评估框架,对最先进的法律RAG系统进行了评估,揭示了法律领域在检索、生成和索赔级分析方面的局限性。
cs.CL / 61 / 2605.21076
GradeLegal: Automated Grading for German Legal Cases
GradeLegal:德国法律案件的自动评分
Abstract
Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.
Chinese Translation
德国法律考试解决方案的评分面临着日益增长的工作量和合格评分员的短缺,导致反馈延迟并形成瓶颈。同时,这是一项高风险的专家任务,因为国家考试成绩在德国对职业前景有着重要影响。尽管这一实际问题具有重要性,但文献中缺乏关于有效法律考试评分方法的系统研究。为了解决这一空白,我们研究了大型语言模型(LLMs)是否能够支持德国刑法和行政法领域的法律案例解决方案的自动评分,从而实现可扩展的反馈和学生自我测试。我们对27个专有和开源的LLMs进行了系统评估,基准测试了逐步添加任务相关信息的提示策略,例如示例解决方案和评分标准。使用二次加权kappa(QWK)指标,当提供示例解决方案和评分标准时,面向推理的LLMs在行政法领域的评分接近专家评分(最高可达0.91),而在刑法领域为0.60,表明刑法评分任务更具挑战性。除了单模型评分外,集成方法在其最佳成员基础上提高了一致性,最多可提高0.15,并且可以为更强的闭源单模型提供替代方案。此外,我们的研究结果表明,有效的提示设计和模型选择对于可靠的基于LLM的法律考试评分是必要的。
cs.CL / 62 / 2605.21086
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control
LoCar:通过细粒度社会语言学控制实现的车辆内助手的定位感知评估
Abstract
While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.
Chinese Translation
随着大型语言模型(LLMs)越来越多地被集成到车辆内对话系统中,识别最佳模型仍然具有挑战性,因为缺乏针对现实世界部署需求量身定制的领域特定评估标准。在本文中,我们提出了一种针对车辆内助手的新评估框架,特别关注韩语本地化。我们的实证分析揭示了模型行为中的显著模式。首先,当前LLMs中的细粒度韩语敬语控制仍然不稳定,这表明在本地化环境中必须明确评估精确的语言层次实现。其次,模型在澄清和主动性等战略性对话指标上的表现较弱。我们的分析表明,这源于这些任务固有的主观复杂性,我们的框架采取保守的评估立场,以优先考虑可靠性。总体而言,我们的研究结果强调,汽车人工智能必须超越一般能力,朝着精确的语言调整和可靠的、安全导向的互动管理迈进。
cs.CL / 63 / 2605.21097
WCXB: A Multi-Type Web Content Extraction Benchmark
WCXB:一种多类型网页内容提取基准
Abstract
Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.
Chinese Translation
网页内容提取——将页面的主要内容与周围的模板内容隔离——是搜索索引、检索增强生成、自然语言处理数据集构建和大型语言模型训练的前提。然而,该领域的进展受到现有评估基准的限制,这些基准通常较小(100-800页)、仅限于新闻文章,或基于十多年前的网页。我们介绍了网页内容提取基准(WCXB),这是一个包含2,008个网页的数据集,来自1,613个域,涵盖七种结构上不同的页面类型:文章、论坛、产品、集合、列表、文档和服务页面。该数据集包括1,497页的开发集和511页的保留测试集,具有匹配的页面类型分布。真实标签注释通过五个阶段的流程生成:LLM辅助草拟、自动验证、四轮前沿模型审查、片段和质量验证脚本,以及人工审查。我们评估了13个提取系统——11个启发式和2个神经网络系统——发现尽管顶尖系统在文章上的表现趋于一致(F1 = 0.93),但在结构化页面类型上的表现却显著分化(F1 = 0.41-0.84),揭示了现有仅限于文章的基准所无法察觉的盲点。该数据集以CC-BY-4.0协议发布,附带HTML源文件、真实标签注释、页面类型标签和基线结果。
cs.CL / 64 / 2605.21102
ACL-Verbatim: hallucination-free question answering for research
ACL-Verbatim:无幻觉的研究问答
Abstract
Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).
Chinese Translation
学术研究人员需要高效且可靠的方法,从可信来源收集高质量信息,但现代的AI辅助研究工具仍然受到大型语言模型(LLMs)产生事实不准确或无意义输出的倾向的困扰,这种现象通常被称为幻觉。我们将提取式问答系统VerbatimRAG应用于ACL文集中的研究论文,直接将用户查询映射到检索文档中的逐字文本片段。我们为将用户查询映射到研究论文中相关文本片段的任务贡献了一个新颖的真实数据集,并利用该数据集训练和评估多种提取模型。人类注释由自然语言处理(NLP)研究人员执行,基于使用基于ScIRGen方法论的自定义管道生成的合成用户查询,并与通过VerbatimRAG检索的研究论文片段配对。在这一基准测试中,基于我们管道的银标注训练的150M参数ModernBERT标记分类器在词级F1得分上达到最佳(53.6),领先于评估中最强的LLM提取器(48.7)。
cs.CL / 65 / 2605.21135
Smarter edits? Post-editing with error highlights and translation suggestions
更智能的编辑?带有错误高亮和翻译建议的后期编辑
Abstract
As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.
Chinese Translation
随着机器翻译(MT)质量的提高,对增强后期编辑功能的兴趣也在增加,例如基于质量评估(QE)衍生的错误高亮,但其有效性的证据仍然有限。在本研究中,我们探讨了基于自动后期编辑(APE)生成的错误高亮和修正建议的实用性。我们进行了一项研究,专业翻译人员(英语-荷兰语)使用APE错误高亮和修正建议进行后期编辑,并将其生产力、质量和用户体验与常规后期编辑及带有QE衍生高亮的后期编辑进行了比较。尽管没有任何条件在生产力或质量上优于常规后期编辑,但APE高亮的接受度优于QE衍生高亮,而修正建议改善了整体用户体验。
cs.CL / 66 / 2605.21154
Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models
精神诊断的自动化ICD分类:从经典自然语言处理到大型语言模型
Abstract
Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.
Chinese Translation
心理健康已成为全球优先事项,这导致临床诊断编码的巨大行政负担。本研究提出通过将自由文本描述映射到国际疾病分类(ICD),利用自然语言处理(NLP)和机器学习(ML)技术实现精神诊断分析的自动化。利用145,513个西班牙语精神病学描述的专门数据集,评估了多种文本表示范式,从经典的基于频率的模型(如BoW、TF-IDF)到最先进的大型语言模型(如e5_large、BioLORD和Llama-3-8B)。结果表明,基于变换器的嵌入在捕捉隐含语义线索和细微医学术语方面始终优于传统方法。通过端到端的微调,e5_large模型以0.866的$F1_{micro}$得分达到了最高性能。本研究表明,将大型语言模型适应特定临床术语对于克服“长尾”标签分布的挑战和精神话语固有的模糊性至关重要。
cs.CL / 67 / 2605.21178
Metaphors in Literary Post-Editing: Opening Pandora's Box?
文学后编辑中的隐喻:打开潘多拉的盒子?
Abstract
This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neu ral Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of fig urative language is indeed problematic in literary MT (LitMT). The responses indi cate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT out put as quite poor and stated that the post editing was more work and more effort than it would have been translating from scratch. This supports previous studies ar guing that post-editing constrains transla tors in their creativity and diminishes their sense of text ownership.
Chinese Translation
本文探讨了文学文本的后编辑者如何对神经机器翻译(Neural Machine Translation, NMT)和大型语言模型(Large Language Models, LLMs)翻译的隐喻作出反应和回应。研究结果显示,输出中有三分之一的隐喻被后编辑者修改,表明在文学机器翻译(Literary MT, LitMT)中,隐喻语言的翻译确实存在问题。反馈表明,后编辑者意识到过于字面化的翻译,尽管主要是针对多词表达。此外,他们有时发现很难判断解决方案是否可接受。他们对机器翻译输出的整体质量评价为相当差,并表示后编辑的工作量和付出比从头翻译要多。这支持了之前的研究,认为后编辑限制了翻译者的创造力,并削弱了他们对文本的拥有感。
cs.CL / 68 / 2605.21182
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding
Manga109-v2026:重新审视Manga109注释以促进现代漫画理解
Abstract
Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.
Chinese Translation
漫画是一种具有文化独特性的多模态媒介,也是日本流行文化中最具影响力的形式之一。随着人工智能系统越来越多地针对漫画理解、光学字符识别(OCR)和翻译,Manga109已成为与漫画相关的人工智能研究的基础数据集。然而,目前的Manga109数据集存在转录错误和粗略注释,这与现代OCR和多模态漫画理解任务不够契合。在本研究中,我们重新审视了Manga109的对话文本注释,并识别出五类注释问题,包括转录错误、缺失的文本区域、重叠的对话和拟声词,以及分段不足的对话气泡。为了解决这些问题,我们结合基于OCR的问题检测和人工修订,构建了Manga109-v2026,修订了大约29,000条对话注释。我们的修订更好地将Manga109与现代OCR和多模态漫画理解系统对接,同时保留了漫画特有的表现结构。
cs.CL / 69 / 2605.21227
Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models
大型语言模型是否了解卢森堡语的借用?探讨低资源多语言模型中的词汇新造
Abstract
Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.
Chinese Translation
大型语言模型(LLMs)越来越多地用于小型接触语言的写作辅助,但尚不清楚它们是否尊重关于词汇借用和新造词的社区规范。我们引入了LexNeo-Bench,这是一个由LuxBorrow(一个大规模的卢森堡语新闻语料库)衍生的3,050实例的标记级基准,其中目标标记被标记为本土词或法语、德语或英语借用词。利用该基准,我们在34个提示设置下对三种多语言LLM进行了两项任务的探测:借用类型分类和二元词汇创新代理(借用与本土)。在没有外部上下文的情况下,模型在借用分类上的表现仅略高于随机水平,因此我们构建了一个语言知识图谱,编码了捐赠语言、形态模式和词汇类比,并将实例特定的子图注入到提示中。知识图谱提示将借用分类的准确率从25% - 35%提高到71% - 81%,并在很大程度上缩小了小模型与大模型之间的差距,同时使新造词检测变得困难且对少量样本设计敏感。我们的结果表明,词汇感知提示对低资源接触语言中的稳健借用判断非常有益,词汇资源可以作为LLM评估的结构化上下文。本研究是在ENEOLI COST行动框架内进行的,探讨了借用作为多语言卢森堡语数据中的一种词汇创新形式。
cs.CL / 70 / 2605.21235
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models
LamPO:一种用于推理语言模型的Lambda风格策略优化
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.
Chinese Translation
带有可验证奖励的强化学习(RLVR)已成为提高推理语言模型在数学、编程和科学问答等任务上的有效范式。然而,广泛使用的群体相对目标,如GRPO,通过标量统计来总结每个采样群体,因此丢弃了候选响应之间的细粒度关系信息。这在稀疏结果奖励下削弱了信用分配,特别是当多个生成的解决方案在推理质量上仅有细微差别时。我们提出了 extbf{LamPO},一种 extbf{Lambda风格策略优化}方法,它用 extit{成对分解优势}替代了标量群体优势。LamPO在每个响应组内聚合成对奖励差距,并通过从序列对数概率差异计算的置信度感知权重来调节每个比较,同时保留了PPO风格优化的无评论者和剪切更新结构。当参考解决方案可用时,我们进一步添加了一种轻量级的基于ROUGE-L的密集辅助奖励,以减少奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond的数据集上,使用Qwen3-1.7B、Qwen3-4B和Phi-4-mini进行的实验表明,LamPO在训练动态更稳定和样本效率更高的情况下,始终优于GRPO和最近的RLVR变体。
cs.CL / 71 / 2605.21256
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
西班牙临床笔记中的可靠自动分诊:一种风险意识的混合框架用于HIV怀疑识别
Abstract
Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.
Chinese Translation
标准临床自然语言处理(NLP)基准通常通过对模糊实例强制确定性分类而产生虚高的指标,从而掩盖了过度自信预测的临床风险。为了解决这一问题,我们提出了一种风险意识的混合选择性分类框架,并在西班牙临床笔记中的早期人类免疫缺陷病毒(HIV)怀疑识别上进行了评估。我们的双重验证方法通过Mondrian保形预测明确解耦了随机不确定性,并使用多中心马哈拉诺比斯距离否决法处理知识性不确定性。实证评估表明,标准不确定性指标和基线分类器在安全医疗分诊中结构上不足,在被迫在严格的可靠性约束下操作时遭遇严重的覆盖崩溃。相比之下,通过要求临床叙述同时通过概率和几何安全保障,所提出的框架成功地隔离了一个高度可信的操作领域。
cs.CL / 72 / 2605.21299
Tracing the ongoing emergence of human-like reasoning in Large Language Models
追踪大型语言模型中类人推理的持续出现
Abstract
Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.
Chinese Translation
人类轻松超越字面意义:例如,“如果你修剪草坪,我会给你五十美元”通常被理解为说话者仅在草坪被修剪的情况下才会付款,而“如果你饿了,烤箱里有比萨饼”则暗示比萨饼是可用的,无论听者是否饿。大型语言模型(Large Language Models,LLMs)在许多任务上表现出类人能力,但它们是否以类人方式进行推理仍不清楚。为了解决这个问题,我们进行了一个人口匹配实验,评估了二十五个LLM在四种语言中如何计算条件推理,并与每种语言中相同数量的人类进行比较。我们发现,人类通过跨语言的语用推理丰富了逻辑推理。模型的行为则更具变异性。一些LLM完美遵循条件的真值表,但忽略了语用推理,而另一些则偏离真值表,在各个方面坚持单一解释,从而反映出准确的基于规则的处理,但并非类人推理。总体而言,LLM是准确的语义运算符,但未能捕捉到人类推理特有的语用丰富性。关键是,LLM的准确性既不受开放与封闭状态、训练方向或架构类型的预测或提升,这表明语用推理仍然是人工系统认知工具箱中一个新兴的能力。
cs.CL / 73 / 2605.21318
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
TextReg:通过正则化文本空间优化缓解提示分布过拟合
Abstract
Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.
Chinese Translation
大型语言模型(LLMs)对用于指定任务目标和行为约束的提示非常敏感。许多近期的提示优化方法通过迭代重写提示并利用LLM生成的反馈,但结果提示往往变得更长,积累了狭窄的样本特定规则,并且在训练分布之外的泛化能力较差。我们将这种失败模式称为提示分布过拟合,并认为它反映了离散文本空间优化中缺乏表示控制。我们通过表示效率这一观点进行形式化,表示效率是一种双因素度量,将提示低效性分解为容量成本和范围狭窄,归因于优化过程中它们的耦合增长。我们提出了TextReg,这是一种正则化框架,通过正则化文本梯度实现软惩罚目标,结合了双证据梯度净化、语义编辑正则化和正则化引导的提示更新。在多个推理基准测试中,TextReg显著提高了分布外(OOD)泛化能力,相较于TextGrad的准确率提升高达+11.8%,相较于REVOLVE的提升高达+16.5%。
cs.CL / 74 / 2605.21333
SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence
SymbolicLight V1:具有高激活稀疏性和亚十亿规模预训练证据的脉冲门控双路径语言建模
Abstract
Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.
Chinese Translation
原生训练的脉冲语言模型在结合类似Transformer的语言质量、稳定的多领域预训练和高激活稀疏性方面面临挑战。我们提出了SymbolicLight V1,这是一种脉冲门控双路径语言模型,结合了二元泄漏积分-发射(Leaky Integrate-and-Fire)脉冲动态与连续残差流。其双路径稀疏TCAM模块用指数衰减聚合路径替代了密集自注意力,以实现长程记忆,并通过脉冲门控局部注意力路径实现短程精度,辅以动态上下文条件解码头和双语分词器。在一个包含30亿标记的中英文语料库上,从零开始训练的194M参数SymbolicLight V1模型在四次独立运行中达到了8.88-8.93的验证PPL,且每个元素的激活稀疏性超过89%。在PPL上,它比GPT-2 201M低7.7%,但在报告的比较中超越了GPT-2 124M。在匹配的5亿标记训练预算下的组件消融实验表明,脉冲门控局部注意力路径是最大的贡献者,而在匹配稀疏性的情况下,用确定性top-k掩码替代LIF动态会导致更大的性能下降,表明时间集成而非单纯的稀疏性驱动了性能。我们还报告了一次在488亿标记上训练的0.8B参数规模扩展运行,作为优化和稀疏性保持的证据,而非主要的质量比较。目前的密集硬件推理速度慢于GPT-2,因此神经形态部署被视为未来稀疏性驱动的机会,而不是已实现的硬件加速。
cs.CL / 75 / 2605.21338
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
文本分析评估框架:以大型语言模型和社交媒体为案例研究
Abstract
LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.
Chinese Translation
大型语言模型(LLMs)在广泛的自然语言处理(NLP)任务中表现出卓越的能力。然而,在实际数据分析场景中仍然存在显著的差距,特别是在需要处理长序列非结构化文档(如新闻推送或本文特别讨论的社交媒体帖子)时。为了在这一背景下实证评估LLMs的有效性,我们引入了一个基于问题的评估框架,该框架包含470个手动策划的问题,旨在评估LLMs对聚合文本数据的语义理解和推理能力。我们在涵盖多种NLP任务的多样化Twitter数据集上应用了我们的基准,包括情感分析、仇恨言论检测和情感识别。我们的结果显示,性能在很大程度上依赖于输入规模和数据源的复杂性,在多标签或目标依赖场景中显著下降。此外,随着任务复杂性的增加,性能从基本的语义存在识别逐渐下降到更具挑战性的操作,如比较、计数和计算。此外,当输入规模超过500个实例时,我们发现LLMs,特别是开放权重模型,普遍存在一个共同的限制:性能显著下降,尤其是在数值任务上。这些发现突显了当前LLMs在对大规模文本集合进行严格定量分析时的关键架构瓶颈。
cs.CL / 76 / 2605.21362
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH:用于大型语言模型黑箱越狱的自适应语义混合
Abstract
Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.
Chinese Translation
越狱攻击揭示了对齐的大型语言模型的预期安全行为与其在对抗性提示下的行为之间的持续差距。现有的自动化方法越来越有效,但每种方法都致力于单一的攻击家族(例如,一个精炼循环、一个树搜索、一个变异空间或一个策略库),且没有任何单一家族占主导地位:最佳表现的方法在目标模型和危害类别之间变化,表明每种方法的互补优势可以通过每个提示的组合来利用。我们提出了LASH(LLM自适应语义混合),这是一个黑箱框架,将多个基础攻击的输出视为可重用的种子提示,并为每个目标请求自适应地组合它们。给定一个种子池,LASH在种子子集和softmax归一化混合权重上进行搜索;一个组合模块合成一个候选提示,而一个无导数遗传优化器使用黑箱目标反馈和一个结合基于关键字的拒绝检测与LLM评估得分的两阶段适应度函数来更新权重。在包含100个有害提示的JailbreakBench上,我们在六个常见目标模型上评估了LASH。LASH在基于关键字的评估下实现了84.5%的平均攻击成功率,在两阶段评估下为74.5%,其中响应首先经过拒绝过滤,然后由LLM评估者评分,以判断其是否实质性地满足原始有害请求。LASH在这两个指标上超越了五个最先进的基线,仅需30个平均目标查询。LASH在三种防御机制下仍然具有竞争力,并诱导出更成功的内部表示。这些结果表明,在异构越狱策略之间进行自适应组合是黑箱红队测试的一个有前景的方向。
cs.CL / 77 / 2605.21363
"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration
“我没有做出微观决策”:测量、诱导和揭示协作中的目标层级AI贡献
Abstract
As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.
Chinese Translation
随着大型语言模型(LLMs)越来越多地影响用户形成、完善和扩展其目标,在人机协作中归因贡献变得至关重要,这不仅有助于用户校准自身的依赖程度,也有助于评估者评估AI辅助的工作。然而,现有方法主要关注最终成果,忽视了目标本身是如何在协作中共同形成的过程。我们提出了一种目标层级归因框架CoTrace,该框架将明确目标分解为可验证的需求,并追踪对话轮次中直接贡献和间接影响。通过将CoTrace应用于638个真实世界的协作日志,我们发现,尽管模型仅占目标塑造贡献的11-26%,但它们在引入更低层次的具体需求方面贡献显著更多,并且做出了多种类型的间接贡献。通过控制模拟,我们展示了交互设计选择显著影响模型的目标塑造行为。在用户研究中,向参与者展示目标层级分析使他们对贡献的感知在5分制上移动了近2分,揭示了用户对自身AI辅助工作理解中的系统性误校准。
cs.CL / 78 / 2605.21391
Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy
通过条件尺度熵对仅解码器语言模型中隐喻处理的后验理解
Abstract
Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.
Chinese Translation
隐喻要求语言模型解析一个其上下文意义与基本字面意义相悖的标记。理解变换器模型如何在深度上组织这种重新解释仍然是机械可解释性中的一个未解问题。我们引入条件尺度熵(Conditional Scale Entropy, CSE),这是一种基于小波的度量,旨在评估变换器计算在每个层位置上如何在频率尺度上广泛参与。两个定理证明了CSE对更新幅度的不变性,将更新的结构模式与其强度分离。使用CSE,我们发现隐喻标记在每个仅解码器架构的相邻层位置上产生的光谱宽度显著高于字面标记,测试的架构参数范围从124M到20B(包括GPT-2系列、LLaMA-2 7B、GPT-oss 20B)。这一效应在基于聚类的置换校正中依然显著,在各模型的早期到中期相对深度范围内重复出现,并与对200对自然语言VUA的独立分析相一致。特异性控制进一步表明,该效应并不能通过语义复杂性或匹配的命题内容来解释。这些结果确定了多尺度协调作为在所考察的仅解码器架构中隐喻语言处理的一种一致特征,并确立了CSE作为表征变换器跨深度结构的原则性工具。
cs.CL / 79 / 2605.21403
Quantifying the cross-linguistic effects of syncretism on agreement attraction
量化同态现象对一致性吸引的跨语言影响
Abstract
Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.
Chinese Translation
一致性吸引错误是指动词错误地与一个介入名词而非其语法主语一致的现象,这种错误在某些语言中(如英语、德语、俄语)因形态同态现象而加剧,但在其他语言中(如土耳其语、亚美尼亚语)则没有这种情况,这一跨语言模式缺乏原则性的解释。我们利用大型语言模型中的惊讶度和注意力熵作为处理代理,研究四种语言之间的这种变异。基于大型语言模型的测量结果复制了英语和德语中的行为发现(同态现象调节吸引),与土耳其语的无效结果一致(无调节),并部分捕捉到俄语的模式。我们讨论了进一步的研究方向,以更好地理解为什么同态现象在不同语言中对一致性吸引的影响不同。
cs.CL / 80 / 2605.21463
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
Mem-$eta$: 通过学习何时以及生成什么的自适应记忆
Abstract
We present Mem-$\pi$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$\pi$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$\pi$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.
Chinese Translation
我们提出了Mem-$eta$,这是一个用于大型语言模型(LLM)代理的自适应记忆框架,其中有用的指导是在需要时生成,而不是从外部记忆存储中检索。现有的增强记忆代理通常依赖于从情节记忆库或技能库中基于相似性的检索,返回的静态条目往往与当前上下文不一致。相比之下,Mem-$eta$使用一个专用的语言或视觉-语言模型,其参数与下游代理分开,以生成针对复杂任务的上下文特定指导。该模型基于当前代理上下文共同决定何时生成指导以及生成什么指导。我们使用一个决策内容解耦的强化学习(RL)目标对其进行训练,使其能够在生成无助时选择不生成,否则生成简洁、有用的指导。在涵盖网页导航、终端工具使用和基于文本的具身交互的多样化代理基准测试中,Mem-$eta$始终优于基于检索的和之前经过RL优化的记忆基线,在网页导航任务上实现了超过30%的相对提升。
cs.CL / 81 / 2605.21465
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution
利用大型语言模型进行语法适配:元模型与语法共演化研究
Abstract
In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.
Chinese Translation
在模型驱动工程中,元模型的演化导致需要调整相应的语法以保持一致性,这通常需要繁琐的手动工作。现有的基于规则的方法可以实现部分自动化,但在处理复杂语法场景时存在局限性。本文提出了一种基于大型语言模型(Large Language Model, LLM)的方法,该方法通过从先前版本学习语法适配,自动对演化后的新语法进行适配。我们在六种真实世界的Xtext领域特定语言上评估了该方法,使用四种DSL作为训练集以开发提示策略,使用两种DSL作为测试集进行验证,并对QVTo进行了纵向案例研究。评估使用了三种大型语言模型(Claude Sonnet 4.5、ChatGPT 5.1、Gemini 3),并从三个维度测量语法适配质量:语法规则级适配一致性、输出相似性和元模型符合性。结果显示,在测试集上,所有三种LLM均实现了100%的适配一致性和输出相似性,而基于规则的方法在DOT上仅达到84.21%,在Xcore上仅达到62.50%。在QVTo的纵向研究中,基于LLM的方法成功地在所有三个演化步骤中重用学习到的适配,而无需手动编辑语法,而基于规则的方法在三次转换中的两次需要手动调整。然而,在大规模语法(EAST-ADL,297条规则)上,LLM的适配一致性远低于90%。本研究展示了基于LLM的方法在处理复杂语法场景中的优势,同时揭示了其在大规模语法适配中的局限性。