arXiv Daily Digest

261

Papers

HALO: Hybrid Auto-encoded Locomotion with Learned Latent Dynamics, Poincar\'e Maps, and Regions of Attraction

HALO：具有学习潜在动态、庞加莱映射和吸引区域的混合自编码运动

Werner, Blake, Esteban, Sergio A., De Sa, Massimiliano, Cohen, Max H., Ames, Aaron D.

Abstract

Reduced-order models are powerful for analyzing and controlling high-dimensional dynamical systems. Yet constructing these models for complex hybrid systems such as legged robots remains challenging. Classical approaches rely on hand-designed template models (e.g., LIP, SLIP), which, though insightful, only approximate the underlying dynamics. In contrast, data-driven methods can extract more accurate low-dimensional representations, but it remains unclear when stability and safety properties observed in the latent space meaningfully transfer back to the full-order system. To bridge this gap, we introduce HALO (Hybrid Auto-encoded Locomotion), a framework for learning latent reduced-order models of periodic hybrid dynamics directly from trajectory data. HALO employs an autoencoder to identify a low-dimensional latent state together with a learned latent Poincar\'e map that captures step-to-step locomotion dynamics. This enables Lyapunov analysis and the construction of an associated region of attraction in the latent space, both of which can be lifted back to the full-order state space through the decoder. Experiments on a simulated hopping robot and full-body humanoid locomotion demonstrate that HALO yields low-dimensional models that retain meaningful stability structure and predict full-order region-of-attraction boundaries.

Chinese Translation

降阶模型在分析和控制高维动态系统方面具有强大的能力。然而，为复杂的混合系统（如腿部机器人）构建这些模型仍然具有挑战性。经典方法依赖于手工设计的模板模型（例如，LIP、SLIP），这些模型虽然具有洞察力，但仅能近似底层动态。相比之下，数据驱动的方法可以提取更准确的低维表示，但尚不清楚在潜在空间中观察到的稳定性和安全性特性何时能够有意义地转移回全阶系统。为了解决这一问题，我们提出了HALO（混合自编码运动），这是一个直接从轨迹数据中学习周期性混合动态的潜在降阶模型的框架。HALO采用自编码器来识别低维潜在状态，并结合学习到的潜在庞加莱映射，以捕捉逐步运动动态。这使得李雅普诺夫分析和在潜在空间中构建相关吸引区域成为可能，这两者都可以通过解码器提升回全阶状态空间。在模拟跳跃机器人和全身人形机器人运动的实验中，HALO产生的低维模型保留了有意义的稳定结构，并预测了全阶吸引区域的边界。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.18900

Thrust Regulation Through Wing Linkage Modulation on the Aerobat Platform: Piezoelectric Slip-Stick Actuated Regulator Development

通过翼联动调节在飞行器平台上的推力调节：压电滑移-粘滞驱动调节器的开发

Ciampaglia, Luca

Abstract

Aerobat is a bat-inspired flapping-wing robot with a wing gait generate by the computational structure, a planar linkage of carbon fiber links driven by a single motor. This design minimizes weight but couples both wings to a shared input motor, eliminating independent thrust control and preventing asymmetric maneuvers. This thesis investigates thrust regulation by modifying the effective length of the first radius link $R_1$ in the computational structure. Static experiments using FDM-printed $R_1$ links at three lengths (28.58, 29.33, and 30.08 mm) across 3,4, and 5 Hz flapping frequencies demonstrated that a 1.5 mm length increase produced a 37% increase in peak lift force and shifted peak force timing within the downstroke. An additional experiment using a string-actuated regulator mechanism was performed. Further actuation methods were evaluated: sub-gram micro-servo and piezoelectric slip-stick. After both the string-tension and micro-servo actuation methods failed due to structural member compliance and motor fragility respectively, a TULA-50 piezoelectric slip-stick actuator was selected. Multiple force-amplifying mechanisms were prototyped, resulting in a direct-drive variable-length mechanism. This final mechanism was demonstrated in a preliminary bench-top test, though insufficient force output prevented dynamic testing during flapping. This work establishes linkage-length modulation via embedded slip-stick actuation as a viable approach to independent wing thrust control.

Chinese Translation

Aerobat是一种受蝙蝠启发的拍打翼机器人，其翼步态由计算结构生成，采用由单个电机驱动的碳纤维连杆平面联动设计。该设计最小化了重量，但将两翼耦合到共享输入电机，消除了独立的推力控制，阻碍了不对称机动。本文研究了通过修改计算结构中第一个半径连杆$R_1$的有效长度来调节推力。使用FDM打印的$R_1$连杆在三种长度（28.58、29.33和30.08 mm）下进行的静态实验，针对3、4和5 Hz的拍打频率，表明长度增加1.5 mm可使峰值升力增加37%，并在下冲程中改变峰值力的时机。还进行了使用绳索驱动调节机制的额外实验。进一步评估了其他驱动方法：亚克拉微伺服和压电滑移-粘滞。在绳索张力和微伺服驱动方法因结构部件的柔顺性和电机脆弱性分别失败后，选择了TULA-50压电滑移-粘滞驱动器。原型制作了多种力放大机制，最终形成了一种直接驱动的可变长度机制。该最终机制在初步台面测试中得到了验证，尽管由于输出力不足，未能在拍打过程中进行动态测试。本研究确立了通过嵌入式滑移-粘滞驱动实现的连杆长度调制作为独立翼推力控制的可行方法。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.18905

Task-Adaptive Admittance Control for Human-Quadrotor Cooperative Load Transportation with Dynamic Cable-Length Regulation

人机协作负载运输的任务自适应导纳控制与动态缆绳长度调节

Li, Shuai, Duong, Ton T. H., Zanotto, Damiano

Abstract

The collaboration between humans and robots is critical in many robotic applications, especially in those requiring physical human-robot interaction (pHRI). Previous research in pHRI has largely focused on robotic manipulators, employing impedance or admittance control to maintain operational safety. Conversely, research in human-quadrotor cooperative load transportation (CLT) is still in its infancy. This letter introduces a novel admittance controller designed for safe and effective human-quadrotor CLT using a quadrotor equipped with an actively-controlled winch. The proposed method accounts for the system's coupled dynamics, allowing the quadrotor and its cable to dynamically adapt to contact forces during CLT tasks, thereby enhancing responsiveness. We experimentally validated the task-adaptive capability of the controller across the entire CLT process, including in-place loading/unloading and load transporting tasks. To this end, we compared the system performances against a conventional approach, using both variable and fixed cable lengths under low- and high-stiffness conditions. Results demonstrate that the proposed method outperforms the conventional approach in terms of system responsiveness and motion smoothness, leading to improved CLT capabilities.

Chinese Translation

人类与机器人之间的协作在许多机器人应用中至关重要，尤其是在需要物理人机交互（pHRI）的场景中。以往的pHRI研究主要集中在机器人操纵器上，采用阻抗或导纳控制以维持操作安全性。相较之下，人机四旋翼协作负载运输（CLT）的研究仍处于起步阶段。本文提出了一种新型导纳控制器，旨在利用配备主动控制绞车的四旋翼实现安全有效的人机四旋翼CLT。所提出的方法考虑了系统的耦合动力学，使四旋翼及其缆绳能够在CLT任务中动态适应接触力，从而增强响应能力。我们通过实验验证了控制器在整个CLT过程中的任务自适应能力，包括原地装载/卸载和负载运输任务。为此，我们将系统性能与传统方法进行了比较，使用了在低刚度和高刚度条件下的可变和固定缆绳长度。结果表明，所提方法在系统响应性和运动平滑性方面优于传统方法，从而提升了CLT能力。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.18933

Gated Memory Policy

门控记忆策略

Gao, Yihuai, Liu, Jinyun, Li, Shuang, Song, Shuran

Abstract

Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting. To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference. On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code, data and in-the-wild deployment instructions are available on our project website https://gated-memory-policy.github.io/.

Chinese Translation

机器人操作任务展现出不同的记忆需求，从不需要记忆的马尔可夫任务到依赖于单次或多次交互试验历史信息的非马尔可夫任务。令人惊讶的是，简单地扩展视觉运动策略的观察历史往往会导致显著的性能下降，这是由于分布转移和过拟合所致。为了解决这些问题，我们提出了门控记忆策略（Gated Memory Policy, GMP），这是一种学习何时回忆记忆以及回忆什么的视觉运动策略。为了学习何时回忆记忆，GMP采用了一种学习的记忆门机制，仅在必要时选择性地激活历史上下文，从而提高了鲁棒性和反应能力。为了有效地学习回忆什么，GMP引入了一种轻量级的交叉注意力模块，构建有效的潜在记忆表示。为了进一步增强鲁棒性，GMP在历史动作中注入扩散噪声，减轻了在训练和推理过程中对嘈杂或不准确历史的敏感性。在我们提出的非马尔可夫基准MemMimic上，GMP在长历史基线上的平均成功率提高了30.1%，同时在RoboMimic的马尔可夫任务中保持了竞争力的表现。所有代码、数据和实际部署说明均可在我们的项目网站https://gated-memory-policy.github.io/上获取。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.18961

AI-Enabled Image-Based Hybrid Vision/Force Control of Tendon-Driven Aerial Continuum Manipulators

基于AI的 tendon 驱动空中连续操控器的图像-力混合控制

Sepahvand, Shayan, Janabi-Sharifi, Farrokh, Aghili, Farhad

Abstract

This paper presents an AI-enabled cascaded hybrid vision/force control framework for tendon-driven aerial continuum manipulators based on constant-strain modeling in $SE(3)$ as a coupled system. The proposed controller is designed to enable autonomous, physical interaction with a static environment while stabilizing the image feature error. The developed strategy combines the cascaded fast fixed-time sliding mode control and a radial basis function neural network to cope with the uncertainties in the image acquired by the eye-in-hand monocular camera and the measurements from the force sensing apparatus. This ensures rapid, online learning of the vision- and force-related uncertainties without requiring offline training. Furthermore, the features are extracted via a state-of-the-art graph neural network architecture employed by a visual servoing framework using line features, rather than relying on heuristic geometric line extractors, to concurrently contribute to tracking the desired normal interaction force during contact and regulating the image feature error. A comparative study benchmarks the proposed controller against established rigid-arm aerial manipulation methods, evaluating robustness across diverse scenarios and feature extraction strategies. The simulation and experimental results showcase the effectiveness of the proposed methodology under various initial conditions and demonstrate robust performance in executing manipulation tasks.

Chinese Translation

本文提出了一种基于常量应变建模的 tendon 驱动空中连续操控器的 AI 驱动级联混合图像/力控制框架，将其视为一个耦合系统。所提出的控制器旨在实现与静态环境的自主物理交互，同时稳定图像特征误差。该策略结合了级联快速固定时间滑模控制和径向基函数神经网络，以应对由手眼单目相机获取的图像和力传感器测量中的不确定性。这确保了在不需要离线训练的情况下，快速在线学习与视觉和力相关的不确定性。此外，特征通过一种先进的图神经网络架构提取，该架构在视觉伺服框架中使用线特征，而不是依赖启发式几何线提取器，从而同时有助于在接触过程中跟踪所需的法向交互力并调节图像特征误差。比较研究将所提出的控制器与已建立的刚性臂空中操控方法进行了基准测试，评估了在不同场景和特征提取策略下的鲁棒性。仿真和实验结果展示了所提方法在各种初始条件下的有效性，并证明其在执行操控任务时的鲁棒性能。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.19025

RoomRecon: High-Quality Textured Room Layout Reconstruction on Mobile Devices

RoomRecon：移动设备上的高质量纹理房间布局重建

Kim, Seok Joon, Cao, Dinh Duc, Spinola, Federica, Lee, Se Jin, Cho, Kyu Sung

Abstract

Widespread RGB-Depth (RGB-D) sensors and advanced 3D reconstruction technologies facilitate the capture of indoor spaces, improving the fields of augmented reality (AR), virtual reality (VR), and extended reality (XR). Nevertheless, current technologies still face limitations, such as the inability to reflect minor scene changes without a complete recapture, the lack of semantic scene understanding, and various texturing challenges that affect the 3D model's visual quality. These issues affect the realism required for VR experiences and other applications such as in interior design and real estate. To address these challenges, we introduce RoomRecon, an interactive, real-time scanning and texturing pipeline for 3D room models. We propose a two-phase texturing pipeline that integrates AR-guided image capturing for texturing and generative AI models to improve texturing quality and provide better replicas of indoor spaces. Moreover, we suggest focusing only on permanent room elements such as walls, floors, and ceilings, to allow for easily customizable 3D models. We conduct experiments in a variety of indoor spaces to assess the texturing quality and speed of our method. The quantitative results and user study demonstrate that RoomRecon surpasses state-of-the-art methods in terms of texturing quality and on-device computation time.

Chinese Translation

广泛应用的RGB-深度（RGB-D）传感器和先进的3D重建技术促进了室内空间的捕捉，提升了增强现实（AR）、虚拟现实（VR）和扩展现实（XR）等领域。然而，当前技术仍面临一些限制，例如无法在不完全重新捕捉的情况下反映微小场景变化、缺乏语义场景理解，以及影响3D模型视觉质量的各种纹理挑战。这些问题影响了VR体验所需的真实感，以及在室内设计和房地产等其他应用中的表现。为了解决这些挑战，我们提出了RoomRecon，一个用于3D房间模型的互动实时扫描和纹理处理管道。我们提出了一种两阶段纹理处理管道，结合了AR引导的图像捕捉用于纹理处理和生成性AI模型，以提高纹理质量并提供更好的室内空间复制品。此外，我们建议仅关注墙壁、地板和天花板等永久性房间元素，以便轻松定制3D模型。我们在多种室内空间中进行实验，以评估我们方法的纹理质量和速度。定量结果和用户研究表明，RoomRecon在纹理质量和设备内计算时间方面超越了最先进的方法。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.19059

AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs

AeroBridge-TTA：无人机的测试时自适应语言条件控制

Lyu, Lingxue

Abstract

Language-guided unmanned aerial vehicles (UAVs) often fail not from bad reasoning or perception, but from execution mismatch: the gap between a planned trajectory and the controller's ability to track it when the real dynamics differ from training (mass changes, drag shifts, actuator delay, wind). We propose AeroBridge-TTA, a language-conditioned control pipeline that targets this gap with test-time adaptation. It has three parts: a language encoder that maps the command into a subgoal, an adaptive policy conditioned on the subgoal and a learned latent, and a test-time adaptation (TTA) module that updates the latent online from observed transitions. On five language-conditioned UAV tasks under 13 mismatch conditions with the same domain randomization, AeroBridge-TTA ties a strong PPO-MLP baseline in-distribution and wins all 5 out-of-distribution (OOD) conditions, +22.0 pts on average (62.7% vs. 40.7%); the +8.5 pt overall gain comes entirely from the OOD regime. A same-weights ablation that only changes the step size $\alpha$ shows the latent update itself is responsible for a $4.6\times$ OOD lift.

Chinese Translation

语言引导的无人机（UAV）往往不是由于推理或感知不佳而失败，而是由于执行不匹配：即在真实动态与训练时不同（质量变化、阻力变化、执行器延迟、风）时，计划轨迹与控制器跟踪能力之间的差距。我们提出了AeroBridge-TTA，这是一种针对这一差距的语言条件控制管道，采用测试时自适应。它由三个部分组成：一个将命令映射为子目标的语言编码器，一个基于子目标和学习潜变量的自适应策略，以及一个测试时自适应（TTA）模块，该模块根据观察到的转变在线更新潜变量。在五个语言条件的无人机任务中，在13种不匹配条件下，AeroBridge-TTA在同一领域随机化中与强大的PPO-MLP基线表现相当，并在所有5个分布外（OOD）条件下获胜，平均提高22.0分（62.7%对40.7%）；整体提高8.5分完全来自于OOD模式。一个仅改变步长$eta$的相同权重消融实验表明，潜变量的更新本身负责了$4.6 imes$的OOD提升。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.19062

Differentiable Satellite Constellation Configuration via Relaxed Coverage and Revisit Objectives

通过放松覆盖和重访目标的可微卫星星座配置

Kacker, Shreeyam, Cahoy, Kerri

Abstract

Satellite constellation design requires optimizing orbital parameters across multiple satellites to maximize mission specific metrics. For many types of mission, it is desirable to maximize coverage and minimize revisit gaps over ground targets. Existing approaches to constellation design either restrict the design space to symmetric parametric families such as Walker constellations, or rely on metaheuristic methods that require significant compute and many iterations. Gradient-based optimization has been considered intractable due to the non-differentiability of coverage and revisit metrics, which involve binary visibility indicators and discrete max operations. We introduce four continuous relaxations: soft sigmoid visibility, noisy-OR multi-satellite aggregation, leaky integrator revisit gap tracking, and LogSumExp soft-maximum, which when composed with the $\partial$SGP4 differentiable orbit propagator, yield a fully differentiable pipeline from orbital elements to mission-level objectives. We show that this scheme can recover Walker-Delta geometry from irregular initializations, and discovers elliptical Molniya-like orbits with apogee dwell over extreme latitudes from only gradients. Compared to simulated annealing (SA), genetic algorithm (GA), and differential evolution (DE) baselines, our gradient-based method recovers Walker-equivalent geometry within ${\sim}750$ evaluations, whereas the three black-box baselines plateau at with significantly worse revisit even with roughly four times the evaluation budget.

Chinese Translation

卫星星座设计需要优化多个卫星的轨道参数，以最大化特定任务的指标。对于许多类型的任务，最大化覆盖率和最小化地面目标的重访间隔是理想的。现有的星座设计方法要么将设计空间限制在对称的参数族（如沃克星座），要么依赖于需要大量计算和多次迭代的元启发式方法。由于覆盖率和重访指标涉及二元可见性指示符和离散最大操作，基于梯度的优化被认为是不可行的。我们引入了四种连续放松方法：软Sigmoid可见性、噪声-OR多卫星聚合、漏斗积分重访间隔跟踪和LogSumExp软最大值，这些方法与$ ext{SGP4}$可微轨道传播器组合后，形成了从轨道元素到任务级目标的完全可微管道。我们展示了该方案能够从不规则初始化中恢复沃克-德尔塔几何形状，并从梯度中发现具有极高纬度的近地点停留的椭圆形莫尔尼亚轨道。与模拟退火（SA）、遗传算法（GA）和差分进化（DE）基线相比，我们的基于梯度的方法在大约${ ext{750}}$次评估内恢复了沃克等效几何形状，而这三种黑箱基线即使在大约四倍的评估预算下也停滞不前，重访效果显著更差。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.19092

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

RoboWM-Bench：评估机器人操作中世界模型的基准

Jiang, Feng, Chen, Yang, Xu, Kyle, Liu, Yuchen, Wang, Haifeng, Shen, Zhenhao, Lu, Jasper, Huang, Shengze, Wang, Yuanfei, Xie, Chen, Wu, Ruihai

Abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.

Chinese Translation

近期大规模视频世界模型的进展使得未来预测变得愈加真实，这为利用想象中的视频进行机器人学习带来了可能性。然而，视觉真实感并不意味着物理合理性，从生成视频中推断出的行为可能违反动力学，并在由具身代理执行时失败。现有基准开始纳入物理合理性的概念，但它们大多仍然以感知或诊断为导向，并未系统性地评估预测行为是否能够转化为完成预期任务的可执行动作。为了解决这一问题，我们提出了RoboWM-Bench，这是一个以操作为中心的基准，用于基于具身评估视频世界模型。RoboWM-Bench将来自人手和机器人操作视频的生成行为转换为具身动作序列，并通过机器人执行进行验证。该基准涵盖多种操作场景，并建立了统一的协议以实现一致且可重复的评估。通过使用RoboWM-Bench，我们评估了最先进的视频世界模型，并发现可靠地生成物理可执行的行为仍然是一个开放的挑战。常见的失败模式包括空间推理错误、不稳定的接触预测和非物理变形。尽管在操作数据上进行微调能够带来改进，但物理不一致性仍然存在，这表明为机器人生成更具物理基础的视频的机会。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.19102

Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior

基于选择性对抗运动先验的强化学习多步态学习用于类人机器人

Wu, Yuanye, Wang, Keyi, Ye, Linqi, Xing, Boyang

Abstract

Learning diverse locomotion skills for humanoid robots in a unified reinforcement learning framework remains challenging due to the conflicting requirements of stability and dynamic expressiveness across different gaits. We present a multi-gait learning approach that enables a humanoid robot to master five distinct gaits -- walking, goose-stepping, running, stair climbing, and jumping -- using a consistent policy structure, action space, and reward formulation. The key contribution is a selective Adversarial Motion Prior (AMP) strategy: AMP is applied to periodic, stability-critical gaits (walking, goose-stepping, stair climbing) where it accelerates convergence and suppresses erratic behavior, while being deliberately omitted for highly dynamic gaits (running, jumping) where its regularization would over-constrain the motion. Policies are trained via PPO with domain randomization in simulation and deployed on a physical 12-DOF humanoid robot through zero-shot sim-to-real transfer. Quantitative comparisons demonstrate that selective AMP outperforms a uniform AMP policy across all five gaits, achieving faster convergence, lower tracking error, and higher success rates on stability-focused gaits without sacrificing the agility required for dynamic ones.

Chinese Translation

在统一的强化学习框架中，为类人机器人学习多样化的运动技能仍然具有挑战性，因为不同步态在稳定性和动态表现之间存在冲突的要求。我们提出了一种多步态学习方法，使类人机器人能够掌握五种不同的步态——行走、正步走、跑步、爬楼梯和跳跃——采用一致的策略结构、动作空间和奖励公式。关键贡献在于选择性对抗运动先验（Adversarial Motion Prior, AMP）策略：AMP应用于周期性、稳定性关键的步态（行走、正步走、爬楼梯），加速收敛并抑制不稳定行为，而在高度动态的步态（跑步、跳跃）中故意省略，以避免其正则化对运动的过度约束。策略通过PPO（Proximal Policy Optimization）在仿真中进行领域随机化训练，并通过零-shot仿真到现实转移部署到物理12自由度类人机器人上。定量比较表明，选择性AMP在所有五种步态上优于均匀AMP策略，实现了更快的收敛、更低的跟踪误差以及在关注稳定性的步态上更高的成功率，而不牺牲动态步态所需的灵活性。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.19104

Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots

基于强化学习的双足足球机器人自适应多任务控制

Zhang, Yulai, Zhang, Yinrong, Wu, Ting, Ye, Linqi

Abstract

Developing bipedal football robots in dynamiccombat environments presents challenges related to motionstability and deep coupling of multiple tasks, as well ascontrol switching issues between different states such as up-right walking and fall recovery. To address these problems,this paper proposes a modular reinforcement learning (RL)framework for achieving adaptive multi-task control. Firstly,this framework combines an open-loop feedforward oscilla-tor with a reinforcement learning-based feedback residualstrategy, effectively separating the generation of basic gaitsfrom complex football actions. Secondly, a posture-driven statemachine is introduced, clearly switching between the ballseeking and kicking network (BSKN) and the fall recoverynetwork (FRN), fundamentally preventing state interference.The FRN is efficiently trained through a progressive forceattenuation curriculum learning strategy. The architecture wasverified in Unity simulations of bipedal robots, demonstratingexcellent spatial adaptability-reliably finding and kicking theball even in restricted corner scenarios-and rapid autonomousfall recovery (with an average recovery time of 0.715 seconds).This ensures seamless and stable operation in complex multi-task environments.

Chinese Translation

在动态战斗环境中开发双足足球机器人面临着与运动稳定性、多个任务深度耦合以及不同状态（如直立行走和跌倒恢复）之间控制切换相关的挑战。为了解决这些问题，本文提出了一种模块化的强化学习（RL）框架，以实现自适应多任务控制。首先，该框架将开环前馈振荡器与基于强化学习的反馈残差策略相结合，有效地将基本步态的生成与复杂足球动作分离。其次，引入了一种姿态驱动的状态机，明确地在寻球与踢球网络（BSKN）和跌倒恢复网络（FRN）之间切换，根本上防止状态干扰。FRN通过渐进式力衰减课程学习策略进行高效训练。该架构在双足机器人Unity仿真中得到了验证，展示了出色的空间适应性——即使在受限的角落场景中也能可靠地找到并踢球，以及快速的自主跌倒恢复（平均恢复时间为0.715秒）。这确保了在复杂多任务环境中的无缝和稳定操作。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.19148

Multi-Step Gaussian Process Propagation for Adaptive Path Planning

用于自适应路径规划的多步高斯过程传播

Beaudin, Alex, Kristiansen, Bjørn Andreas, Gryte, Kristoffer, Chiatante, Corrado, Alver, Morten Omholt, Arcak, Murat, Johansen, Tor Arne

Abstract

Efficient and robust path planning hinges on combining all accessible information sources. In particular, the task of path planning for robotic environmental exploration and monitoring depends highly on the current belief of the world. To capture the uncertainty in the belief, we present a Gaussian process based path planning method that adapts to multi-modal environmental sensing data and incorporates state and input constraints. To solve the path planning problem, we optimize over future waypoints in a receding horizon fashion, and our cost is thus a function of the Gaussian process posterior over all these waypoints. We demonstrate this method, dubbed OLAhGP, on an autonomous surface vessel using oceanic algal bloom data from both a high-fidelity model and in-situ sensing data in a monitoring scenario. Our simulated and experimental results demonstrate significant improvement over existing methods. With the same number of samples, our method generates more informative paths and achieves greater accuracy in identifying algal blooms in chlorophyll a rich waters, measured with respect to total misclassification probability and binary misclassification rate over the domain of interest.

Chinese Translation

高效且稳健的路径规划依赖于结合所有可用的信息源。特别是，机器人环境探索和监测的路径规划任务高度依赖于对世界的当前信念。为了捕捉信念中的不确定性，我们提出了一种基于高斯过程的路径规划方法，该方法能够适应多模态环境传感数据，并结合状态和输入约束。为了解决路径规划问题，我们以递归视野的方式对未来的航点进行优化，因此我们的成本函数是所有这些航点上高斯过程后验的函数。我们在一个自主水面船只上演示了这一方法，称为OLAhGP，使用来自高保真模型和现场传感数据的海洋藻华数据进行监测场景。我们的模拟和实验结果表明，与现有方法相比，显著提高了性能。在相同数量的样本下，我们的方法生成了更具信息性的路径，并在识别富含叶绿素a的水域中的藻华方面达到了更高的准确性，这通过对感兴趣领域的总误分类概率和二元误分类率进行测量得出。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.19267

Multimodal embodiment-aware navigation transformer

多模态具身感知导航变换器

Dezons, Louis, Picard, Quentin, Marsal, Rémi, Goulette, François, Filliat, David

Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories produced by the diffusion model. The diffusion conditioning as well as the trajectory ranking head depend on a robot's embodiment token that allows our model to generate and select trajectories with respect to the robot's dimensions. Across three simulated environments, ViLiNT improves Success Rate on average by 166\% over equivalent state-of-the-art vision-only baseline (NoMaD). This increase in performance is confirmed through real-world deployments of a rover navigating in obstacle fields. These results highlight that combining multimodal fusion with our collision prediction mechanism leads to improved off-road navigation robustness.

Chinese Translation

基于目标条件的地面机器人导航模型通过监督学习训练，展现出良好的零样本迁移能力，但在分布变化下（即环境、机器人或传感器配置变化）其避碰能力仍然下降。我们提出了ViLiNT，一种多模态、基于注意力的目标导航策略，利用来自多个平台和环境的异构数据进行训练，从而增强鲁棒性，具有两个关键特征。首先，我们将RGB图像、3D LiDAR点云、目标嵌入和机器人的具身描述符融合在一起，采用变换器架构以捕捉互补的几何和外观线索。变换器的输出用于条件化一个生成可导航轨迹的扩散模型。其次，利用自动生成的离线标签，我们训练了一个路径间隙预测头，用于对扩散模型生成的轨迹进行评分和排序。扩散条件和轨迹排序头依赖于机器人的具身标记，使我们的模型能够根据机器人的尺寸生成和选择轨迹。在三个模拟环境中，ViLiNT的成功率平均提高了166 ext{%}，相较于等效的最先进视觉基线（NoMaD）。这一性能提升通过在障碍物场中导航的探测器的实际部署得到了验证。这些结果突显了将多模态融合与我们的碰撞预测机制结合，能够提高越野导航的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.19270

Warmth and Competence in the Swarm: Designing Effective Human-Robot Teams

群体中的温暖与能力：设计有效的人机团队

Miyauchi, Genki, Groß, Roderich, Chen, Chaona

Abstract

As groups of robots increasingly collaborate with humans, understanding how humans perceive them is critical for designing effective human-robot teams. While prior research examined how humans interpret and evaluate the abilities and intentions of individual agents, social perception of robot teams remains relatively underexplored. Drawing on the competence-warmth framework, we conducted two studies manipulating swarm behaviors in completing a collective search task and measured the social perception of swarm behaviors when human participants are either observers (Study 1) and operators (Study 2). Across both studies, our results show that variations in swarm behaviors consistently influenced participants' perceptions of warmth and competence. Notably, longer broadcast durations increased perceived warmth; larger separation distances increased perceived competence. Interestingly, individual robot speed had no effect on either of the perceptions. Furthermore, our results show that these social perceptions predicted participants' team preferences more strongly than task performance. Participants preferred robot teams that were both warm and competent, not those that completed tasks most quickly. These findings demonstrate that human-robot interaction dynamically shapes social perception, underscoring the importance of integrating both technical and social considerations when designing robot swarms for effective human-robot collaboration.

Chinese Translation

随着机器人群体越来越多地与人类合作，理解人类如何感知这些机器人对于设计有效的人机团队至关重要。尽管之前的研究考察了人类如何解读和评估个体代理的能力和意图，但对机器人团队的社会感知仍然相对缺乏探索。基于能力-温暖框架，我们进行了两项研究，操控群体行为以完成集体搜索任务，并测量人类参与者在观察者（研究1）和操作员（研究2）角色下对群体行为的社会感知。在两项研究中，我们的结果表明，群体行为的变化始终影响参与者对温暖和能力的感知。值得注意的是，较长的广播持续时间增加了感知的温暖；较大的分离距离提高了感知的能力。有趣的是，个别机器人的速度对这两种感知没有影响。此外，我们的结果显示，这些社会感知比任务表现更强烈地预测了参与者的团队偏好。参与者更倾向于选择既温暖又有能力的机器人团队，而不是那些完成任务最快的团队。这些发现表明，人机交互动态地塑造了社会感知，强调了在设计机器人群体以实现有效的人机协作时整合技术和社会考量的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.19344

Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input

四足机器人跑酷学习：稀疏门控专家混合模型与视觉输入

Ziegltrum, Michael, Jiao, Jianhao, Peng, Tianhu, Zhou, Chengxu, Kanoulas, Dimitrios

Abstract

Robotic parkour provides a compelling benchmark for advancing locomotion over highly challenging terrain, including large discontinuities such as elevated steps. Recent approaches have demonstrated impressive capabilities, including dynamic climbing and jumping, but typically rely on sequential multilayer perceptron (MLP) architectures with densely activated layers. In contrast, sparsely gated mixture-of-experts (MoE) architectures have emerged in the large language model domain as an effective paradigm for improving scalability and performance by activating only a subset of parameters at inference time. In this work, we investigate the application of sparsely gated MoE architectures to vision-based robotic parkour. We compare control policies based on standard MLPs and MoE architectures under a controlled setting where the number of active parameters at inference time is matched. Experimental results on a real Unitree Go2 quadruped robot demonstrate clear performance gains, with the MoE policy achieving double the number of successful trials in traversing large obstacles compared to a standard MLP baseline. We further show that achieving comparable performance with a standard MLP requires scaling its parameter count to match that of the total MoE model, resulting in a 14.3\% increase in computation time. These results highlight that sparsely gated MoE architectures provide a favorable trade-off between performance and computational efficiency, enabling improved scaling of control policies for vision-based robotic parkour. An anonymized link to the codebase is https://osf.io/v2kqj/files/github?view_only=7977dee10c0a44769184498eaba72e44.

Chinese Translation

机器人跑酷为在高度挑战性地形上推进运动提供了一个引人注目的基准，包括如高台阶等大间断。近期的方法展示了令人印象深刻的能力，包括动态攀爬和跳跃，但通常依赖于具有密集激活层的序列多层感知器（MLP）架构。相比之下，稀疏门控专家混合模型（MoE）架构在大型语言模型领域中作为一种有效的范式出现，通过在推理时仅激活一部分参数来提高可扩展性和性能。在本研究中，我们探讨了稀疏门控MoE架构在基于视觉的机器人跑酷中的应用。我们在一个受控环境中比较了基于标准MLP和MoE架构的控制策略，其中推理时激活的参数数量相匹配。对真实的Unitree Go2四足机器人进行的实验结果显示，MoE策略在跨越大障碍物时成功试验的数量是标准MLP基线的两倍。我们进一步表明，要实现与标准MLP相当的性能，需要将其参数数量扩展到与总MoE模型相匹配，从而导致计算时间增加14.3%。这些结果突显了稀疏门控MoE架构在性能和计算效率之间提供了良好的权衡，使得基于视觉的机器人跑酷控制策略的可扩展性得以改善。代码库的匿名链接为https://osf.io/v2kqj/files/github?view_only=7977dee10c0a44769184498eaba72e44。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.19374

Achieving Interaction Fluidity in a Wizard-of-Oz Robotic System: A Prototype for Fluid Error-Correction

在巫师-奥兹机器人系统中实现交互流畅性：流畅错误纠正的原型

De Lima, Carlos Baptista, Hough, Julian, Förster, Frank, Holthaus, Patrick, Zheng, Yongjun

Abstract

Achieving truly fluid interaction with robots with speech interfaces remains a hard problem, and the experience of current Human-Robot Interaction (HRI) remains laboured and frustrating. Some of the barriers to fluid interaction stem from a lack of a suitable development platform for HRI for improving interaction, even in robotic Wizard-of-Oz (WoZ) modes of operation used for data collection and prototyping. Based on previous systems, we propose the properties of interruptibility and correction (IaC), pollability, latency measurement and optimisation and time-accurate reproducibility of actions from logging data as key criteria for a fluid WoZ system to support fluid error correction. We finish by presenting a Virtual Reality (VR) HRI simulation environment for mobile manipulators which meets these criteria.

Chinese Translation

与具有语音接口的机器人实现真正流畅的交互仍然是一个难题，目前的人机交互（HRI）体验依然显得繁琐和令人沮丧。流畅交互的一些障碍源于缺乏适合人机交互的开发平台，以改善交互，即使在用于数据收集和原型设计的机器人巫师-奥兹（WoZ）操作模式中也是如此。基于之前的系统，我们提出了可打断性和纠正性（IaC）、可轮询性、延迟测量与优化，以及从日志数据中时间准确再现动作的属性作为支持流畅错误纠正的流畅WoZ系统的关键标准。最后，我们展示了一个满足这些标准的移动操纵器虚拟现实（VR）人机交互仿真环境。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.19404

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

M$^{2}$GRPO：基于Mamba的多智能体群体相对策略优化用于仿生水下机器人追逐

Feng, Yukai, Wu, Zhiheng, Wu, Zhengxing, Gu, Junwen, Yu, Junzhi

Abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

Chinese Translation

传统的合作追逐中的策略学习方法在仿生水下机器人中面临着基本挑战，其中长期决策、部分可观测性和机器人间协调需要同时具备表达能力和稳定性。为了解决这些问题，提出了一种新颖的框架，称为基于Mamba的多智能体群体相对策略优化（M$^{2}$GRPO），该框架在集中训练与分散执行（CTDE）范式下，将选择性状态空间的Mamba策略与群体相对策略优化相结合。具体而言，基于Mamba的策略利用观察历史来捕捉长期的时间依赖性，并利用基于注意力的关系特征来编码智能体间的交互，通过归一化的高斯采样生成有界的连续动作。为了进一步改善信用分配而不牺牲稳定性，群体相对优势通过在每个回合内对智能体的奖励进行归一化获得，并通过GRPO的多智能体扩展进行优化，这显著减少了对训练资源的需求，同时实现了稳定和可扩展的策略更新。广泛的仿真和真实水池实验跨越团队规模和逃避者策略表明，M$^{2}$GRPO在追逐成功率和捕获效率方面始终优于MAPPO和递归基线。总体而言，所提出的框架为仿生机器人系统的合作水下追逐提供了一种实用且可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.19419

Forward Dynamics of Variable Topology Mechanisms - The Case of Constraint Activation

可变拓扑机制的前向动力学 - 约束激活的案例

Mueller, Andreas

Abstract

Many mechanical systems exhibit changes in their kinematic topology altering the mobility. Ideal contact is the best known cause, but also stiction and controlled locking of parts of a mechanism lead to topology changes. The latter is becoming an important issue in human-machine interaction. Anticipating the dynamic behavior of variable topology mechanisms requires solving a non-smooth dynamic problem. The core challenge is a physically meaningful transition condition at the topology switching events. Such a condition is presented in this paper. Two versions are reported, one using projected motion equations in terms of redundant coordinates, and another one using the Voronets equations in terms of minimal coordinates. Their computational properties are discussed. Results are shown for joint locking of a planar 3R mechanisms and a 6DOF industrial manipulator.

Chinese Translation

许多机械系统表现出其运动拓扑的变化，从而改变了其运动能力。理想接触是最为人知的原因，但粘滞和机制部分的受控锁定也会导致拓扑变化。后者在人与机器的交互中正变得越来越重要。预测可变拓扑机制的动态行为需要解决一个非光滑动态问题。核心挑战是在拓扑切换事件中提供一个物理上有意义的过渡条件。本文提出了这样一个条件。报告了两种版本，一种是使用冗余坐标的投影运动方程，另一种是使用最小坐标的Voronets方程。讨论了它们的计算特性。结果展示了平面3R机制和6自由度工业机械手的关节锁定情况。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.19469

Wrench-Aware Admittance Control for Unknown-Payload Manipulation

考虑扭矩的未知负载操控的适应控制

Gholampour, Hossein, Beaver, Logan E.

Abstract

Unknown payloads can strongly affect compliant robotic manipulation, especially when the payload center of mass is not aligned with the tool center point. In this case, the payload generates an offset wrench at the robot wrist. During motion, this wrench is not only related to payload weight, but also to payload inertia. If it is not modeled, the compliant controller can interpret it as an external interaction wrench, which causes unintended compliant motion, larger tracking error, and reduced transport accuracy. This paper presents a wrench-aware admittance control framework for unknown-payload pick-and-place using a UR5e robot. The method uses force-torque measurements in two different roles. First, a three-axis translational excitation term is used to reduce payload-induced force effects during transport without making the robot excessively stiff. Second, after grasping, the controller first estimates payload mass for transport compensation and then estimates the payload CoM offset relative to the TCP using wrist force-torque measurements collected during the subsequent translational motion. This helps improve object placement and stacking behavior. Experimental results show improved transport and placement performance compared with uncorrected placement while preserving compliant motion.

Chinese Translation

未知负载会对柔性机器人操控产生显著影响，尤其是当负载的质心与工具中心点不对齐时。在这种情况下，负载会在机器人手腕处产生偏移扭矩。在运动过程中，这个扭矩不仅与负载重量有关，还与负载的惯性有关。如果不进行建模，柔性控制器可能会将其解释为外部交互扭矩，从而导致意外的柔性运动、更大的跟踪误差和降低的运输精度。本文提出了一种针对未知负载的抓取与放置的考虑扭矩的适应控制框架，使用UR5e机器人。该方法在两个不同的角色中使用力-扭矩测量。首先，使用三轴平移激励项来减少运输过程中负载引起的力效应，而不使机器人过于僵硬。其次，在抓取后，控制器首先估计负载质量以进行运输补偿，然后利用在后续平移运动中收集的手腕力-扭矩测量估计负载质心相对于工具中心点的偏移。这有助于改善物体的放置和堆叠行为。实验结果表明，与未修正的放置相比，运输和放置性能得到了改善，同时保持了柔性运动。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.19509

Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

评估基于视觉语言模型的非类人机器人形态的语义赋能推断

Jones, Jess, Santos-Rodriguez, Raul, Hauert, Sabine

Abstract

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.

Chinese Translation

视觉语言模型（VLMs）在理解人类与物体之间的交互方面展现了显著的能力，但其在非类人形态的机器人系统中的应用仍然基本未被探索。本研究探讨了VLMs是否能够有效推断与人类有根本不同的机器人赋能，填补了这些模型在多样化机器人应用中的关键空白。我们引入了一个新颖的混合数据集，该数据集结合了注释的现实世界机器人赋能-物体关系与VLM生成的合成场景，并对VLM在多个物体类别和机器人形态上的性能进行了实证分析，揭示了赋能推断能力的显著差异。我们的实验表明，尽管VLMs在非类人机器人形态上展现出有希望的泛化能力，但其在不同物体领域的表现却显著不一致。重要的是，我们发现所有形态和物体类别中均存在低假阳性率但高假阴性率的一致模式，表明VLMs倾向于保守的赋能预测。我们的分析显示，这一模式在新颖工具使用场景和非常规物体操作中尤为明显，表明将VLMs有效整合到机器人系统中需要采用互补的方法，以减轻过于保守的行为，同时保持低假阳性率所带来的固有安全性优势。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.19522

GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation

GenerativeMPC：基于VLM-RAG指导的全身模型预测控制与虚拟阻抗的双手移动操控

Fernando, Marcelino Julio, Cabrera, Miguel Altamirano, Sam, Jeffrin, Mahmoud, Yara, Gubernatorov, Konstantin, Tsetserukou, Dzmitry

Abstract

Bimanual mobile manipulation requires a seamless integration between high-level semantic reasoning and safe, compliant physical interaction - a challenge that end-to-end models approach opaquely and classical controllers lack the context to address. This paper presents GenerativeMPC, a hierarchical cyber-physical framework that explicitly bridges semantic scene understanding with physical control parameters for bimanual mobile manipulators. The system utilizes a Vision-Language Model with Retrieval-Augmented Generation (VLM-RAG) to translate visual and linguistic context into grounded control constraints, specifically outputting dynamic velocity limits and safety margins for a Whole-Body Model Predictive Controller (MPC). Simultaneously, the VLM-RAG module modulates virtual stiffness and damping gains for a unified impedance-admittance controller, enabling context-aware compliance during human-robot interaction. Our framework leverages an experience-driven vector database to ensure consistent parameter grounding without retraining. Experimental results in MuJoCo, IsaacSim, and on a physical bimanual platform confirm a 60% speed reduction near humans and safe, socially-aware navigation and manipulation through semantic-to-physical parameter grounding. This work advances the field of human-centric cybernetics by grounding large-scale cognitive models into predictable, high-frequency physical control loops.

Chinese Translation

双手移动操控需要高层语义推理与安全、顺应的物理交互之间的无缝整合——这是一个端到端模型模糊处理而经典控制器缺乏上下文以应对的挑战。本文提出了GenerativeMPC，一个分层的网络物理框架，明确地将语义场景理解与双手移动操控器的物理控制参数相结合。该系统利用带有检索增强生成（VLM-RAG）的视觉-语言模型，将视觉和语言上下文转化为具体的控制约束，特别是输出全身模型预测控制器（MPC）的动态速度限制和安全边际。同时，VLM-RAG模块调节虚拟刚度和阻尼增益，以实现统一的阻抗-导纳控制器，在人机交互过程中实现上下文感知的顺应性。我们的框架利用基于经验的向量数据库，确保在不重新训练的情况下保持参数的一致性。MuJoCo、IsaacSim以及物理双手平台上的实验结果确认了在接近人类时速度降低60%以及通过语义到物理参数的基础实现安全、社会意识的导航和操控。本研究通过将大规模认知模型与可预测的高频物理控制回路相结合，推动了以人为中心的网络控制学领域的发展。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.19536

LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

LiveVLN：打破视觉-语言导航中的停顿循环

Wang, Xiangchen, Zhu, Weiye, Wang, Teng, Geng, TianTian, Zhang, Zekai, Qi, Zhiyuan, Yang, Jinyu, Zheng, Feng

Abstract

Recent navigation systems achieve strong benchmark results, yet real-world deployment often remains visibly stop-and-go. This bottleneck arises because the sense-inference-execution loop is still blocking: after each new observation, the controller must wait for sensing, transmission, and inference before motion can continue. Reducing action-generation cost alone therefore does not remove redundant waiting. To address this issue, we present LiveVLN, a training-free framework for more continuous embodied navigation by augmenting pretrained VLM navigators with multi-step action continuation. Instead of pausing for each full sense-and-inference round, LiveVLN overlaps execution with the processing of newly arrived observations, allowing refreshed future actions to be handed off before the current executable prefix is exhausted. This design keeps actions continuously available during motion, reducing idle waiting and enabling smoother online execution. The framework operates at runtime and can be integrated with compatible pretrained VLM navigators. Across R2R and RxR, LiveVLN preserves benchmark performance while reducing waiting time and improving action availability. In real-world deployments, it cuts average episode waiting time by up to $77.7\%$ and shortens wall-clock episode time by $12.6\%$ on StreamVLN and $19.6\%$ on NaVIDA, yielding more coherent execution during deployment. Code is available at https://github.com/NIneeeeeem/LiveVLN.

Chinese Translation

近期的导航系统在基准测试中取得了良好的结果，但在实际应用中往往仍然显得停顿不前。这一瓶颈的产生是因为感知-推理-执行循环仍然是阻塞性的：在每次新的观察后，控制器必须等待感知、传输和推理，才能继续运动。因此，仅仅减少动作生成的成本并不能消除冗余的等待。为了解决这一问题，我们提出了LiveVLN，这是一种无需训练的框架，通过多步动作延续增强预训练的视觉语言模型（VLM）导航器，实现更连续的具身导航。LiveVLN并不需要在每次完整的感知和推理回合中暂停，而是将执行与新到达观察的处理重叠，允许在当前可执行前缀耗尽之前交接更新的未来动作。这一设计在运动过程中保持动作的持续可用性，减少了闲置等待，并实现了更流畅的在线执行。该框架在运行时操作，可以与兼容的预训练VLM导航器集成。在R2R和RxR数据集上，LiveVLN保持了基准性能，同时减少了等待时间并提高了动作的可用性。在实际部署中，它将平均回合等待时间减少了多达77.7%，并在StreamVLN上缩短了墙钟回合时间12.6%，在NaVIDA上缩短了19.6%，从而在部署过程中实现了更连贯的执行。代码可在https://github.com/NIneeeeeem/LiveVLN获取。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.19618

Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing

基于干扰感知预测视觉伺服的自主无人机管道近距离检测

Li, Wen, Wang, Hui, Su, Jinya, Liu, Cunjia, Chen, Wen-Hua, Li, Shihua

Abstract

Reliable pipeline inspection is critical to safe energy transportation, but is constrained by long distances, complex terrain, and risks to human inspectors. Unmanned aerial vehicles provide a flexible sensing platform, yet reliable autonomous inspection remains challenging. This paper presents an autonomous quadrotor near-proximity pipeline inspection framework for three-dimensional scenarios based on image-based visual servoing model predictive control (VMPC). A unified predictive model couples quadrotor dynamics with image feature kinematics, enabling direct image-space prediction within the control loop. To address low-rate visual updates, measurement noise, and environmental uncertainties, an extended-state Kalman filtering scheme with image feature prediction (ESKF-PRE) is developed, and the estimated lumped disturbances are incorporated into the VMPC prediction model, yielding the ESKF-PRE-VMPC framework. A terrain-adaptive velocity design is introduced to maintain the desired cruising speed while generating vertical velocity references over unknown terrain slopes without prior terrain information. The framework is validated in high-fidelity Gazebo simulations and real-world experiments. In real-world tests, the proposed method reduces RMSE by 52.63% and 75.04% in pipeline orientation and lateral deviation in the image, respectively, for straight-pipeline inspection without wind, and successfully completes both wind-disturbance and bend-pipeline tasks where baseline method fails. An open-source nano quadrotor is modified for indoor experimentation.

Chinese Translation

可靠的管道检测对于安全的能源运输至关重要，但受到长距离、复杂地形和对人类检查员的风险的限制。无人驾驶飞行器提供了灵活的传感平台，但可靠的自主检测仍然具有挑战性。本文提出了一种基于图像视觉伺服模型预测控制（VMPC）的自主四旋翼近距离管道检测框架，适用于三维场景。统一的预测模型将四旋翼动力学与图像特征运动学相结合，使得在控制回路内能够直接进行图像空间预测。为了解决低频视觉更新、测量噪声和环境不确定性的问题，开发了一种扩展状态卡尔曼滤波方案，结合图像特征预测（ESKF-PRE），并将估计的集中干扰纳入VMPC预测模型，形成ESKF-PRE-VMPC框架。引入了一种地形自适应速度设计，以在未知地形坡度上生成垂直速度参考，同时保持所需的巡航速度，而无需事先获取地形信息。该框架在高保真Gazebo仿真和真实世界实验中得到了验证。在真实世界测试中，所提出的方法在无风的直管道检测中，管道方向和图像中的横向偏差的均方根误差（RMSE）分别降低了52.63%和75.04%，并成功完成了在基线方法失效的情况下的风干扰和弯曲管道任务。对开源纳米四旋翼进行了修改，以便进行室内实验。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.19643

A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots

基于手势的视觉学习模型用于声学微操控交互的声波机器人群

Lin, Alex, Gao, Lei, Kemsaram, Narsimlu, Subramanian, Sriram

Abstract

AcoustoBots are mobile acoustophoretic robots capable of delivering mid-air haptics, directional audio, and acoustic levitation, but existing implementations rely on scripted commands and lack an intuitive interface for real-time human control. This work presents a gesture-based visual learning framework for contactless human-swarm interaction with a multimodal AcoustoBot platform. The system combines ESP32-CAM gesture capture, PhaseSpace motion tracking, centralized processing, and an OpenCLIP-based visual learning model (VLM) with linear probing to classify three hand gestures and map them to haptics, audio, and levitation modalities. Validation accuracy improved from about 67% with a small dataset to nearly 98% with the largest dataset. In integrated experiments with two AcoustoBots, the system achieved an overall gesture-to-modality switching accuracy of 87.8% across 90 trials, with an average end-to-end latency of 3.95 seconds. These results demonstrate the feasibility of using a vision-language-model-based gesture interface for multimodal human-swarm interaction. While the current system is limited by centralized processing, a static gesture set, and controlled-environment evaluation, it establishes a foundation for more expressive, scalable, and accessible swarm robotic interfaces.

Chinese Translation

声波机器人（AcoustoBots）是一种能够提供空中触觉、定向音频和声学悬浮的移动声学微操控机器人，但现有实现依赖于脚本命令，缺乏直观的实时人机控制界面。本研究提出了一种基于手势的视觉学习框架，用于与多模态声波机器人平台进行无接触的人群交互。该系统结合了ESP32-CAM手势捕捉、PhaseSpace运动跟踪、集中处理和基于OpenCLIP的视觉学习模型（VLM），通过线性探测对三种手势进行分类，并将其映射到触觉、音频和悬浮模态。验证准确率从小数据集的约67%提高到最大数据集的近98%。在与两个声波机器人进行的集成实验中，该系统在90次试验中实现了87.8%的整体手势到模态切换准确率，平均端到端延迟为3.95秒。这些结果证明了使用基于视觉-语言模型的手势界面进行多模态人群交互的可行性。尽管当前系统受到集中处理、静态手势集和受控环境评估的限制，但它为更具表现力、可扩展性和可访问性的机器人群体接口奠定了基础。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.19670

Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming

人机团队中的多周期时空适应

Cuellar, Alex, Hagenow, Michael, Shah, Julie

Abstract

Effective human-robot teaming is crucial for the practical deployment of robots in human workspaces. However, optimizing joint human-robot plans remains a challenge due to the difficulty of modeling individualized human capabilities and preferences. While prior research has leveraged the multi-cycle structure of domains like manufacturing to learn an individual's tendencies and adapt plans over repeated interactions, these techniques typically consider task-level and motion-level adaptation in isolation. Task-level methods optimize allocation and scheduling but often ignore spatial interference in close-proximity scenarios; conversely, motion-level methods focus on collision avoidance while ignoring the broader task context. This paper introduces RAPIDDS, a framework that unifies these approaches by modeling an individual's spatial behavior (motion paths) and temporal behavior (time required to complete tasks) over multiple cycles. RAPIDDS then jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for these individualized models. We demonstrate the importance of this dual adaptation through an ablation study in simulation and a physical robot scenario using a 7-DOF robot arm. Finally, we present a user study (n=32) showing significant plan improvement compared to non-adaptive systems across both objective metrics, such as efficiency and proximity, and subjective measures, including fluency and user preference. See this paper's companion video at: https://youtu.be/55Q3lq1fINs.

Chinese Translation

有效的人机团队合作对于机器人在人工工作环境中的实际部署至关重要。然而，由于个体化人类能力和偏好的建模困难，优化人机联合计划仍然是一项挑战。虽然先前的研究利用制造等领域的多周期结构来学习个体的倾向并在重复交互中调整计划，但这些技术通常将任务级和运动级适应孤立考虑。任务级方法优化分配和调度，但往往忽视近距离场景中的空间干扰；相反，运动级方法专注于避免碰撞，却忽略了更广泛的任务背景。本文介绍了RAPIDDS框架，它通过建模个体的空间行为（运动路径）和时间行为（完成任务所需时间）在多个周期内统一了这些方法。RAPIDDS随后联合调整任务调度，并引导机器人运动的扩散模型，以最大化效率并最小化接近度，同时考虑这些个体化模型。我们通过模拟中的消融研究和使用7自由度机器人手臂的物理机器人场景展示了这种双重适应的重要性。最后，我们呈现了一项用户研究（n=32），显示与非自适应系统相比，在效率和接近度等客观指标以及流畅性和用户偏好等主观指标上，计划有显著改善。有关本论文的配套视频，请参见：https://youtu.be/55Q3lq1fINs。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.19677

Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty

在不确定性下学习高精度接触操作的混合控制策略

Brown, Hunter L., Hollinger, Geoffrey, Lee, Stefan

Abstract

Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH's learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies -- solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.

Chinese Translation

基于强化学习的控制策略在许多操作任务中已被证明比分析技术更为有效。通常，这些方法学习神经控制策略，直接从观察到的状态信息中预测末端执行器的姿态变化。对于插入精细连接器等任务，这些任务会引入力约束，基于姿态的策略在力的显式控制上有限，并依赖于精心调节的低级控制器以避免执行有害操作。在本研究中，我们提出了混合位置-力控制策略，这些策略学习在每个控制维度中动态选择何时使用力控制或位置控制。为了提高这些策略的学习效率，我们引入了接触处理的模式感知训练（Mode-Aware Training for Contact Handling, MATCH），该方法调整策略行动概率，以明确反映混合控制中的模式选择行为。我们使用在极端定位不确定性下的脆弱插销-孔任务验证了MATCH学习策略的有效性。我们发现MATCH在成功率上显著优于姿态控制策略——在常见的状态估计误差类型下，这些任务的成功率提高了10%，插销断裂次数减少了5倍。尽管在更大且更复杂的动作空间中学习，MATCH在数据效率上与姿态控制策略相当。在超过1600次的仿真到现实实验中，我们发现MATCH在高噪声环境下的成功率是姿态策略的两倍（33%对68%），并且与可变阻抗策略相比，平均施加的力减少了约30%（在实验室条件下使用Franka FR3）。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.19683

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

掩模世界模型：预测对稳健机器人策略学习重要的因素

Lou, Yunfan, Chi, Xiaowei, Zhang, Xiaojie, Qian, Zezhong, Li, Chengxuan, Zhang, Rongyu, Lyu, Yaoxu, Song, Guoyu, Fu, Chuyao, Xu, Haoxuan, Wang, Pengwei, Zhang, Shanghang

Abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Chinese Translation

基于大规模视频生成预训练的世界模型已成为通用机器人策略学习的一个有前景的范式。然而，标准方法通常侧重于高保真度的RGB视频预测，这可能导致对无关因素的过拟合，例如动态背景和光照变化。这些干扰降低了模型的泛化能力，最终导致不可靠和脆弱的控制策略。为了解决这个问题，我们提出了掩模世界模型（Mask World Model, MWM），它利用视频扩散架构预测语义掩模的演变，而不是像素。这一转变施加了几何信息瓶颈，迫使模型捕捉重要的物理动态和接触关系，同时过滤掉视觉噪声。我们将这一掩模动态骨干与基于扩散的策略头无缝集成，以实现稳健的端到端控制。广泛的评估表明，MWM在LIBERO和RLBench模拟基准测试中表现优越，显著超越了最先进的基于RGB的世界模型。此外，现实世界实验和鲁棒性评估（通过随机标记修剪）显示，MWM展现出更强的泛化能力和对纹理信息丢失的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.19728

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

VLA Foundry：一个统一的视觉-语言-动作模型训练框架

Mercat, Jean, Keh, Sedrick, Arora, Kushal, Huang, Isabella, Shah, Paarth, Nishimura, Haruki, Iwase, Shun, Liu, Katherine

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

Chinese Translation

我们提出了VLA Foundry，一个开源框架，将LLM、VLM和VLA训练统一在一个代码库中。大多数开源VLA项目专注于动作训练阶段，通常将不兼容的预训练管道拼接在一起。而VLA Foundry则提供了一个共享的训练栈，能够实现从语言预训练到动作专家微调的端到端控制。VLA Foundry支持从零开始的训练以及来自Hugging Face的预训练骨干网络。为了展示我们框架的实用性，我们训练并发布了两种类型的模型：第一种是通过我们的LLM-->VLM-->VLA管道完全从零开始训练的模型，第二种是基于预训练的Qwen3-VL骨干网络构建的模型。我们在LBM Eval上评估了这两种模型的闭环策略性能，该平台是一个开放数据、开源的模拟器。我们还对模拟器和STEP分析工具进行了可用性改进，以便于公众使用。在标准评估设置中，我们的完全开放的从零开始模型与我们之前的闭源工作相当，而替换为Qwen3-VL骨干网络则产生了一个强大的多任务桌面操作策略，显著超越了我们的基线。VLA Foundry代码库可在https://github.com/TRI-ML/vla_foundry获取，所有多任务模型权重已在https://huggingface.co/collections/TRI-ML/vla-foundry发布。更多定性视频可在项目网站https://tri-ml.github.io/vla_foundry上找到。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.19734

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

UniT：朝着统一的人类与类人机器人政策学习和世界建模的物理语言

Chen, Boyu, Chen, Yi, Qiu, Lu, Bai, Jerry, Ge, Yuying, Ge, Yixiao

Abstract

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

Chinese Translation

类人机器人基础模型的扩展受到机器人数据稀缺的瓶颈限制。虽然大量以自我为中心的人类数据提供了一种可扩展的替代方案，但由于运动学的不匹配，跨体现的桥接仍然是一个根本性挑战。我们提出了UniT（通过视觉锚定的统一潜在动作标记器），这是一个建立人类与类人机器人转移的统一物理语言的框架。基于异质运动学共享普遍视觉后果的理念，UniT采用了三分支交叉重建机制：动作预测视觉以将运动学锚定到物理结果，而视觉重建动作以过滤掉无关的视觉干扰因素。同时，融合分支将这些净化的模态协同到一个共享的离散潜在空间中，形成与体现无关的物理意图。我们在两个范式中验证了UniT：1）政策学习（VLA-UniT）：通过预测这些统一的标记，它有效利用多样的人类数据，实现了在类人机器人仿真基准和现实世界部署中的最先进数据效率和强健的分布外（OOD）泛化，特别展示了零样本任务转移。2）世界建模（WM-UniT）：通过将跨体现动态与统一标记对齐作为条件，它实现了直接的人类到类人机器人的动作转移。这种对齐确保了人类数据无缝转化为类人机器人视频生成的增强动作可控性。最终，通过诱导高度对齐的跨体现表示（通过t-SNE可视化验证，揭示人类和类人机器人特征收敛到共享流形），UniT提供了一条可扩展的路径，将大量人类知识提炼为通用的类人机器人能力。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

105

cs.CV / 1 / 2604.18623

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

我们能构建场景图，而不是对其进行分类吗？FlowSG：基于流匹配的渐进式图像条件场景图生成

Hu, Xin, Qin, Ke, Yin, Wen, Li, Yuan-Fang, Li, Ming, He, Tao

Abstract

Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.

Chinese Translation

场景图生成（SGG）通过预测框和主谓宾三元组来统一物体定位和视觉关系推理。然而，大多数流程将SGG视为一次性、确定性的分类问题，而不是一个真正的渐进式生成任务。我们提出了FlowSG，将SGG重新构建为混合离散-连续状态上的连续时间传输：从一个带噪声的图开始，模型通过约束感知的细化逐步生成一个图像条件的场景图，同时合成节点（物体）和边（谓词）。具体而言，我们首先利用VQ-VAE将场景图（例如，连续视觉特征）量化为紧凑、可预测的标记；然后，图形Transformer (i) 预测条件速度场以传输连续几何体（框），并且 (ii) 更新离散后验以获取分类标记（物体特征和谓词标签），通过流条件消息聚合将语义和几何结合起来。训练结合了几何的流匹配损失和标记的离散流目标，实现了少步推理，并与标准检测器和分割器兼容。对VG和PSG在闭合和开放词汇协议下的广泛实验显示，在谓词R/mR和图级指标上均取得了一致的提升，验证了混合离散-连续生成公式相较于一次性分类基线的有效性，平均提升约3个点，超越了最先进的USG-Par。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.18627

Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses

基于视觉的人类意识估计以增强工业仓库中自主移动机器人（AMRs）的安全性和效率

Haug, Maximilian, Stippel, Christian, Pscherer, Lukas, Schwendinger, Benjamin, Hoch, Ralph, Gaydarov, Angel, Schlund, Sebastian, Sauter, Thilo

Abstract

Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human's position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.

Chinese Translation

在同时有人工工人和自主移动机器人（AMRs）混合交通的仓库环境中，确保人类安全至关重要。目前的方法通常将人类视为一般动态障碍物，这导致AMR采取保守的行为，如减速或绕行，即使工人完全意识到并能够安全共享空间。本文提出了一种基于实时视觉的方法，通过单个RGB摄像头估计人类对AMR的意识。我们将最先进的3D人类姿态提升与头部方向估计相结合，以确定人类相对于AMR的位置及其视锥，从而判断人类是否意识到AMR的存在。整个流程在NVIDIA Isaac Sim这一强大的物理准确机器人仿真环境中使用合成生成的数据进行了验证。实验结果确认我们的系统能够实时可靠地检测人类的位置及其注意力，使AMR能够根据人类的意识安全地调整其运动。这一增强对于提高工业和工厂自动化环境中的安全性和操作效率至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.18632

StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

StomaD2：一种基于扩散恢复检测网络的智能气孔表型分析一体化系统

Zhao, Quanling, Qin, Meng'en, Sun, Yanfeng, Miao, Yuan, Yang, Xiaohui

Abstract

Stomata play a crucial role in regulating plant physiological processes and reflecting environmental responses. However, accurate and high-throughput stomatal phenotyping remains challenging, as conventional approaches rely on destructive sampling and manual annotation, restricting large-scale and field deployment. To overcome these limitations, a noninvasive restoration-detection integrated framework, termed StomaD2, is developed to achieve accurate and fast stomatal phenotyping under complex imaging conditions. The framework incorporates a diffusion-based restoration module to recover degraded images and a specialized rotated object detection network tailored to the small, dense, and cluttered characteristics of stomata. The proposed network enhances feature representation through three key innovations: a column-wise structure for global feature interaction, context-aware resampling and reweighting mechanism to improve multi-scale consistency, and a feature reassembly module to boost discrimination against complex backgrounds. In extensive comparisons, StomaD2 demonstrated state-of-the-art performance. On public Maize and Wheat datasets, it achieved accuracies of 0.994 and 0.992, respectively, significantly outperforming existing benchmarks. When benchmarked against ten other advanced models, including Oriented Former and YOLOv12, StomaD2 achieved a top-tier F1-score/mAP of 0.989. The framework is integrated into a user-friendly, field-operable system that supports the fast extraction of eight stomatal phenotypes, such as density and conductance. Validated on more than 130 plant species, StomaD2's results highlight its strong generalizability and potential for large-scale phenotyping, plant physiology analysis, and precision agriculture applications.

Chinese Translation

气孔在调节植物生理过程和反映环境响应方面发挥着至关重要的作用。然而，准确且高通量的气孔表型分析仍然面临挑战，因为传统方法依赖于破坏性取样和人工标注，限制了大规模和现场应用。为克服这些限制，开发了一种非侵入性的恢复检测集成框架，称为StomaD2，以在复杂成像条件下实现准确和快速的气孔表型分析。该框架结合了一个基于扩散的恢复模块，用于恢复退化图像，以及一个专门为气孔的小型、密集和杂乱特征量身定制的旋转目标检测网络。所提出的网络通过三个关键创新增强了特征表示：列式结构用于全局特征交互、上下文感知的重采样和重加权机制以改善多尺度一致性，以及特征重组模块以增强对复杂背景的区分能力。在广泛的比较中，StomaD2展示了最先进的性能。在公共的玉米和小麦数据集上，它分别达到了0.994和0.992的准确率，显著优于现有基准。当与包括Oriented Former和YOLOv12在内的十个其他先进模型进行基准测试时，StomaD2达到了顶级的F1-score/mAP为0.989。该框架集成到一个用户友好、可在现场操作的系统中，支持快速提取八种气孔表型，如密度和导度。在超过130种植物物种上进行验证，StomaD2的结果突显了其强大的泛化能力和在大规模表型分析、植物生理分析和精准农业应用中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.18648

DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

DanceCrafter：基于文本驱动的细粒度可控舞蹈生成与编舞语法

Yuan, Hang, Hu, Xiaolin, Wan, Yan, Gao, Menglin, Yu, Wenzhe, Huang, Cong, Xu, Fei, Li, Qing, Wang, Christina Dan, Yu, Zhou, Chen, Kai

Abstract

Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.

Chinese Translation

基于文本驱动的可控舞蹈生成仍然未得到充分探索，主要原因在于高质量数据集的严重匮乏以及复杂编舞的固有难度。舞蹈的特征化尤其具有挑战性，因为其复杂的空间动态、强烈的方向性以及不同身体部位高度解耦的运动。为了克服这些瓶颈，我们结合舞蹈研究、人类解剖学和生物力学的原则，提出了 extit{编舞语法}（Choreographic Syntax），这是一个具有定制注释系统的新理论框架。在这一语法的基础上，我们结合专业舞蹈档案和高保真动作捕捉数据，构建了 extbf{DanceFlow}，迄今为止最细粒度的舞蹈数据集。该数据集包含41小时的高质量动作和634万字的详细描述。在模型层面，我们引入了 extbf{DanceCrafter}，一个基于Momentum Human Rig构建的定制运动变换器。为了避免优化不稳定性，我们构建了一个连续流形运动表示，并配备混合归一化策略。此外，我们设计了一种关注解剖结构的损失函数，以明确调节身体部位的解耦特性。这些适应性设计使DanceCrafter能够实现复杂舞蹈序列的高保真和稳定生成。广泛的评估和用户研究证明了我们在运动质量、细粒度可控性和生成自然性方面的先进性能。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.18713

Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

对齐再精炼：文本引导的三维前列腺病变分割

Sun, Cuiling, Peng, Linkai, Murphy, Adam, Keles, Elif, Patel, Hiten D., Ross, Ashley, Miller, Frank, Turkbey, Baris, Bejar, Andrea Mia, Aktas, Halil Ertugrul, Durak, Gorkem, Bagci, Ulas

Abstract

Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at https://github.com/NUBagciLab/Prostate-Lesion-Segmentation.

Chinese Translation

从双参数MRI（bp-MRI）中自动进行前列腺病变的三维分割对于可靠的算法分析至关重要，但实现高精度仍然具有挑战性。体积方法必须结合多种模态，同时确保解剖一致性，但当前模型在可靠地整合跨模态信息方面存在困难。尽管视觉-语言模型（VLMs）正在取代当前使用的架构设计，但它们仍然缺乏有效局部引导所需的细粒度病变级语义。为了解决这些限制，我们提出了一种新的多编码器U-Net架构，结合了三个关键创新：（1）增强前景文本-图像相似性的对齐损失，以注入病变语义；（2）校准相似性图并抑制虚假背景激活的热图损失；（3）在高置信度区域执行局部边界编辑的最终阶段置信度门控多头交叉注意力精炼器。阶段调度的训练机制稳定了这些组件的优化。我们的方法在PI-CAI数据集上始终优于先前的方法，通过增强的多模态融合和局部文本引导建立了新的最先进水平。我们的代码可在 https://github.com/NUBagciLab/Prostate-Lesion-Segmentation 获取。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.18725

Colour Extraction Pipeline for Odonates using Computer Vision

基于计算机视觉的蜻蜓色彩提取管道

Rajaraman, Megan Mirnalini Sundaram, Verbeek, Fons J., Kalkman, Vincent J., Pucci, Rita

Abstract

The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species' morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts' colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.

Chinese Translation

昆虫形态特征与气候之间的相关性已在生理学研究中得到证实，但此类研究仍受到数据分析耗时的限制。特别是，开源数据集通常缺乏物种形态特征的注释，这使得专门的注释工作变得必要；这些工作通常规模有限且成本高昂。本文提出了一种管道，通过深度神经网络识别和分割蜻蜓（蜻蜓和豆娘）的身体部位，最终目标是提取身体部位的色彩。该管道在有限的注释数据集上进行训练，并通过伪监督数据进行优化。我们展示了通过使用来自公民科学平台的开源图像，我们的方法能够将每个可见对象（蜻蜓）分割为头部、胸部、腹部和翅膀，并为每个身体部位提取色彩调色板。这将使大规模的生态相关性统计分析成为可能（例如，色彩与气候变化、栖息地丧失或地理位置之间的关系），这些分析对于量化和评估生态系统生物多样性状态至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.18740

Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

基于自主骨骼标志定位的智能C臂控制

Jung, Jay, Arrabi, Ahmad, Luo, Jax, Raymond, Scott, Wshah, Safwan

Abstract

Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git

Chinese Translation

目的：自动化C臂定位确保需要紧急干预的患者能够及时接受治疗。当传统的深度学习（Deep Learning, DL）方法在C臂控制中失败时，临床医生必须回归手动操作，从而导致额外的延误。因此，基于多模态大型语言模型（Multimodal Large Language Models, MLLMs）的智能C臂控制框架是非常理想的，因为它能够结合临床医生的反馈并利用推理进行调整，以实现更准确的定位。骨骼标志定位对于C臂控制至关重要，我们研究了如何调整MLLMs以实现自主标志定位。方法：我们使用了一个带注释的合成X射线数据集和一个真实的X射线数据集。两个数据集中的每个X射线都与多个骨骼标志配对。我们对两个MLLMs进行了微调，并要求它们从每个X射线中检索最接近的标志。对标志定位的定量评估进行了实施，并与领先的DL方法进行了比较。我们进一步进行了定性实验，展示了：（1）MLLM如何通过推理纠正最初错误的预测，以及（2）MLLM如何顺序导航C臂到达目标位置。结果：在两个数据集上，微调后的MLLM在所有定位任务中表现出与DL方法相当的竞争性能。在定性实验中，MLLM提供了推理和空间意识的证据。结论：本研究表明，微调后的MLLM能够实现准确的骨骼标志定位，并在智能自主C臂控制方面展现出良好的前景。我们的代码可在https://github.com/marszzibros/C-arm-localization-LLMs.git获取。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.18744

Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

匹配任意事件：针对事件相机的零-shot运动鲁棒特征匹配跨越宽基线

Zhang, Ruijun, Su, Hang, Daniilidis, Kostas, Wang, Ziyun

Abstract

Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: https://github.com/spikelab-jhu/Match-Any-Events.

Chinese Translation

事件相机由于其对低光照和快速运动的鲁棒性，最近在瞬时运动估计方面展现了良好的能力。然而，在两个任意视图之间计算宽基线对应关系仍然是一个重大挑战，因为事件外观随着运动而发生显著变化，而基于学习的方法受到可扩展性和有限宽基线监督的限制。因此，我们提出了第一个事件匹配模型，该模型以零-shot方式实现跨数据集的宽基线对应：一个训练一次的单一模型可以在未见过的数据集上部署，而无需任何目标领域的微调或适应。为了实现这一能力，我们引入了一种运动鲁棒且计算高效的注意力骨干网络，该网络从事件流中学习多时间尺度特征，并结合稀疏感知的事件标记选择，使得在多样化的宽基线监督下进行大规模训练在计算上变得可行。为了提供宽基线泛化所需的监督，我们开发了一个鲁棒的事件运动合成框架，以生成具有增强视角、模态和运动的大规模事件匹配数据集。在多个基准上的广泛实验表明，我们的框架在事件特征匹配方法上实现了37.7%的改进。代码和数据可在以下网址获取：https://github.com/spikelab-jhu/Match-Any-Events。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.18745

DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

DeltaSeg：用于多类别结构缺陷分割的分层注意力与深度增量学习

Noguera, Enrique Hernandez, Ferdaus, Md Meftahul, Ioup, Elias, Abdelguerfi, Mahdi

Abstract

Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.

Chinese Translation

从视觉检查图像中自动分割结构缺陷仍然具有挑战性，原因在于损伤类型的多样性、极端的类别不平衡以及对精确边界划定的需求。本文提出了DeltaSeg，一种U型编码器-解码器架构，采用分层注意力策略，将Squeeze-and-Excitation (SE) 通道注意力集成在编码器中，在瓶颈和解码器中使用坐标注意力，并在跳跃连接中引入了一种新颖的深度增量注意力 (DDA) 机制。编码器使用带有扩张阶段的深度可分离卷积，以保持空间分辨率的同时扩展感受野。瓶颈处的Atrous Spatial Pyramid Pooling (ASPP) 捕获多尺度上下文。DDA模块通过一种双路径方案精炼跳跃连接，该方案结合了用于抑制干扰特征的学习增量算子和基于解码器信号的空间注意力门。通过多尺度辅助头进行深度监督进一步增强了梯度流动，并鼓励在中间解码器阶段提取语义上有意义的特征。我们在两个数据集上评估DeltaSeg：S2DS数据集（7个类别）和Culvert-Sewer Defect Dataset (CSDD, 9个类别)。在这两个基准测试中，DeltaSeg始终优于包括U-Net、SA-UNet、UNet3+、SegFormer、Swin-UNet、EGE-UNet、FPN和Mobile-UNETR在内的12种竞争架构，展示了在不同损伤类型、成像条件和结构几何形状下的强大泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.18747

URoPE: Universal Relative Position Embedding across Geometric Spaces

URoPE：跨几何空间的通用相对位置嵌入

Xie, Yichen, Meng, Depu, Peng, Chensheng, Hu, Yihan, Herau, Quentin, Tomizuka, Masayoshi, Zhan, Wei

Abstract

Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.

Chinese Translation

相对位置嵌入已成为在变换器中编码位置信息的标准机制。然而，现有的公式通常仅限于固定的几何空间，即一维序列或规则的二维/三维网格，这限制了它们在许多需要跨摄像机视角或二维与三维空间之间进行几何推理的计算机视觉任务中的适用性。为了解决这一限制，我们提出了URoPE，它是对旋转位置嵌入（Rotary Position Embedding, RoPE）的通用扩展，适用于跨视角或跨维度的几何空间。对于每个关键/值图像块，URoPE在预定义的深度锚点沿相应的摄像机光线采样三维点，并将其投影到查询图像平面。然后可以使用投影的像素坐标应用标准的二维RoPE。URoPE是一种无参数且具备内在意识的相对位置嵌入，对全局坐标系统的选择不变，同时与现有的RoPE优化注意力核完全兼容。我们将URoPE作为插件位置编码在多种任务的变换器架构中进行评估，包括新视图合成、三维物体检测、物体跟踪和深度估计，涵盖二维-二维、二维-三维和时间场景。实验表明，URoPE在所有任务中持续提升了基于变换器的模型性能，证明了其在几何推理中的有效性和普适性。我们的项目网站是：https://urope-pe.github.io/。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.18757

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

REVEAL：视网膜形态测量与临床风险的多模态视觉-语言对齐用于阿尔茨海默病和痴呆症的发生预测

Leem, Seowung, Gu, Lin, You, Chenyu, Gong, Kuang, Fang, Ruogu

Abstract

The retina provides a unique, noninvasive window into Alzheimer's disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

Chinese Translation

视网膜提供了一个独特的、非侵入性的窗口，用于观察阿尔茨海默病（AD）和痴呆症，通过形态特征捕捉早期结构变化，而系统性和生活方式风险因素则反映了在临床症状出现之前，已确立的疾病易感性贡献者。然而，目前的视网膜分析框架通常将影像学和风险因素分开建模，限制了其捕捉对早期风险预测至关重要的联合多模态模式的能力。此外，现有方法很少纳入组织或对齐具有相似视网膜和临床特征的患者的机制，从而限制了连贯的跨模态关联的学习。为了解决这些局限性，我们提出了REVEAL（REtinal-risk Vision-Language Early Alzheimer's Learning），一个将彩色眼底照片与个体化疾病特定风险概况对齐的框架，用于预测发生的AD和痴呆症，平均在诊断前8年（范围：1-11年）。由于现实世界的风险因素是结构化的问卷数据，我们将其转化为与预训练的视觉-语言模型（VLMs）兼容的临床可解释叙述。我们进一步提出了一种群体感知对比学习（GACL）策略，将具有相似视网膜形态测量和风险因素的患者聚类为正对，增强多模态对齐。该统一表示学习框架在与临床文本编码器配对的最先进视网膜影像模型以及通用VLMs相比，表现显著优越，展示了联合建模视网膜生物标志物和临床风险因素的价值。通过提供一种可推广的、非侵入性的早期AD和痴呆症风险分层方法，REVEAL有潜力实现更早的干预，并改善群体层面的预防护理。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.18781

CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans

CAHAL：适用于低分辨率MRI扫描的临床应用分辨率增强

Morell-Ortega, Sergio, González-Cebrián, Ángela, Mansencal, Boris, Gadea, Marien, Vivo-Hernando, Roberto, Rubio, Gregorio, Aparici, Fernando, de la Iglesia-Vaya, Maria, Catheline, Gwenaelle, Coupé, Pierrick, Manjón, José V.

Abstract

Large-scale automated morphometric analysis of brain MRI is limited by the thick-slice, anisotropic acquisitions prevalent in routine clinical practice. Existing generative super-resolution (SR) methods produce visually compelling isotropic volumes but often introduce anatomical hallucinations, systematic volumetric overestimation, and structural distortions that compromise downstream quantitative analysis and diagnostic safety. To address this, we propose CAHAL (Clinically Applicable resolution enHAncement for Low-resolution MRI scans), a hallucination-robust, physics-informed resolution enhancement framework that operates directly in the patient's native acquisition space. CAHAL employs a deterministic bivariate Mixture of Experts (MoE) architecture routing each input through specialised residual 3D U-Net experts conditioned on both volumetric resolution and acquisition anisotropy, two independent descriptors of clinical MRI acquisition. Experts are optimised with a composite loss combining edge-penalised spatial reconstruction, Fourier-domain spectral coherence matching, and a segmentation-guided semantic consistency constraint. Training pairs are generated on-the-fly via physics-based degradation sampled from a large-scale real-world database, ensuring robust generalisation. Validated on T1-weighted and FLAIR sequences against generative baselines, CAHAL achieves state-of-the-art results, improving the best related methods in terms of accuracy and efficiency.

Chinese Translation

大规模自动化脑MRI形态测量分析受到常规临床实践中普遍存在的厚切片和各向异性采集的限制。现有的生成超分辨率（SR）方法虽然能够生成视觉上令人信服的各向同性体积，但往往会引入解剖幻觉、系统性的体积过估计以及结构扭曲，从而影响后续的定量分析和诊断安全性。为了解决这一问题，我们提出了CAHAL（适用于低分辨率MRI扫描的临床应用分辨率增强），这是一个抗幻觉、基于物理的分辨率增强框架，能够直接在患者的原始采集空间中操作。CAHAL采用确定性的双变量专家混合（Mixture of Experts, MoE）架构，将每个输入通过专门的残差3D U-Net专家进行路由，这些专家根据体积分辨率和采集各向异性这两个临床MRI采集的独立描述符进行条件化。专家通过复合损失进行优化，该损失结合了边缘惩罚的空间重建、傅里叶域谱相干匹配和分割引导的语义一致性约束。训练对通过基于物理的降解实时生成，样本来自一个大规模的真实世界数据库，确保了稳健的泛化能力。在T1加权和FLAIR序列上与生成基线进行验证，CAHAL在准确性和效率方面达到了最先进的结果，超越了最佳相关方法。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.18790

EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion

EfficientPENet：通过轻量级多模态融合实现稀疏LiDAR的实时深度补全

Lopez, Johny J., Ferdaus, Md Meftahul, Abdelguerfi, Mahdi, Netchaev, Anton, Sloan, Steven, Pathak, Ken, Niles, Kendall N.

Abstract

Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.

Chinese Translation

从稀疏LiDAR测量和相应的RGB图像中进行深度补全是机器人系统中准确3D感知的前提。现有方法在标准基准上实现了高精度，但依赖于重型主干架构，限制了其在嵌入式硬件上的实时部署。我们提出了EfficientPENet，这是一种双分支深度补全网络，采用现代化的ConvNeXt主干替代传统的ResNet编码器，为深度流引入稀疏不变卷积，并通过卷积空间传播网络（CSPN）精炼预测。RGB分支利用在ImageNet上预训练的ConvNeXt模块，结合层归一化、7x7深度卷积和随机深度正则化。两个分支的特征通过后期融合进行合并，并通过多尺度深度监督策略进行解码。我们进一步引入了一种位置感知的测试时增强方案，在水平翻转过程中修正坐标张量，从而在推理时实现一致的误差减少。在KITTI深度补全基准上，EfficientPENet实现了631.94 mm的均方根误差（RMSE），参数量为36.24M，延迟为20.51 ms，帧率为48.76 FPS。这相较于BP-Net在参数量上减少了3.7倍，在速度上提升了23倍，同时保持了竞争力的精度。这些结果确立了EfficientPENet作为在资源受限的边缘平台（如NVIDIA Jetson）上进行实时深度补全的实用解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.18797

CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

CrossPan：跨序列胰腺MRI分割与泛化的综合基准

Peng, Linkai, Sun, Cuiling, Zhang, Zheyuan, Dou, Wanying, Aktas, Halil Ertugrul, Bejar, Andrea M, Keles, Elif, Gonda, Tamas, Wallace, Michael B, Zhou, Zongwei, Durak, Gorkem, Keswani, Rajesh N, Bagci, Ulas

Abstract

Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (<0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at https://crosspan.netlify.app/.

Chinese Translation

自动胰腺分割是腹部MRI分析的基础，但在一种MRI序列上训练的深度学习模型在应用于另一种序列时往往会出现灾难性失败，这一挑战尚未得到系统性的研究。我们引入了CrossPan，这是一个多机构基准，包含来自八个中心的1,386个3D扫描，涵盖三种常规获取的序列（T1加权、T2加权和相位反转）。我们的实验揭示了三个关键发现。首先，跨序列领域转移的严重性远超过跨中心的变异性：在领域内获得超过0.85的Dice分数的模型在跨序列转移时会降至接近零（<0.02）。其次，最先进的领域泛化方法在这些物理驱动的对比度反转下几乎没有益处，而像MedSAM2这样的基础模型通过对比度不变的形状先验保持了适度的零-shot性能。第三，半监督学习仅在稳定的强度分布下提供收益，而在具有高器官内变异性的序列上则变得不稳定。这些结果确立了跨序列泛化——而非模型架构或中心多样性——作为临床可部署胰腺MRI分割的主要障碍。数据集和代码可在 https://crosspan.netlify.app/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.18803

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

LLM作为评判框架评估视觉-语言模型中的语调诱导幻觉

Jiang, Zhiyuan, Hong, Weihao, Guan, Xinlei, Dhandu, Tejaswi, Li, Miles Q., Xu, Meng, Huang, Kuan, Tida, Umamaheswara Rao, Shen, Bingyu, Kwak, Daehan, Li, Boyang

Abstract

Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families -- text-illegibility, time-reading, and object-absence -- each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels -- patterns that aggregate metrics obscure.

Chinese Translation

视觉-语言模型（VLMs）越来越多地应用于可靠的视觉基础具有操作性后果的场景中，但它们在逐渐强制的提示措辞下的行为仍然未被充分表征。现有的幻觉基准主要依赖于中性提示和二元检测，尚未探讨在结构上不同的任务类型中，虚构的发生率和强度如何响应分级语言压力。我们提出了Ghost-100，这是一个程序性构建的基准，包含800幅合成生成的图像，涵盖三个任务类别中的八个类别——文本不可读性、时间读取和物体缺失——每个类别均根据负真相原则设计，确保所查询的目标在构造上是缺失的、不可读的或不确定的。每幅图像配有五个提示，这些提示来自一个结构化的5级提示强度框架，固定图像和任务身份，仅变化指令力度，从而将语调孤立为唯一的自变量。我们采用双轨评估协议：基于规则的H-Rate测量模型从有根据的拒绝转向无支持的积极承诺的响应比例，以及基于GPT-4o-mini评判的H-Score，采用1-5的评分标准来表征一旦发生虚构的信心和特异性。此外，我们还发布了一个三阶段的自动验证工作流程，回顾性地确认了800幅图像中的717幅严格合规。评估九个开放权重的VLMs时，我们发现H-Rate和H-Score在模型家族之间显著分离，阅读风格和存在检测子集对提示压力的响应在定性上存在不同，并且几个模型在中间语调水平上表现出非单调敏感性——这些模式在聚合指标中被掩盖。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.18804

Geometric Decoupling: Diagnosing the Structural Instability of Latent

几何解耦：诊断潜在结构的不稳定性

Liang, Yuanbang, Chen, Zhengwen, Lai, Yu-Kun

Abstract

Latent Diffusion Models (LDMs) achieve high-fidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry into \textit{Local Scaling} (capacity) and \textit{Local Complexity} (curvature). Our study uncovers a \textbf{``Geometric Decoupling"}: while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifies ``Geometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.

Chinese Translation

潜在扩散模型（Latent Diffusion Models, LDMs）实现了高保真合成，但由于潜在空间的脆弱性，在编辑过程中会导致语义的非连续跳跃。我们引入了一个黎曼框架，通过分析生成雅可比矩阵来诊断这种不稳定性，将几何分解为 extit{局部缩放}（容量）和 extit{局部复杂性}（曲率）。我们的研究揭示了一个 extbf{“几何解耦”}：在正常生成中，曲率在功能上编码了图像细节，而在超出分布（Out-of-Distribution, OOD）生成中，表现出一种功能解耦，极端曲率被浪费在不稳定的语义边界上，而不是可感知的细节上。这种几何错误分配识别出“几何热点”，作为不稳定性的结构根源，为诊断生成可靠性提供了一个稳健的内在度量。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.18829

DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

DUALVISION：用于稳健视觉推理的RGB-红外多模态大语言模型

Majeedi, Abrar, Ruan, Zhiyuan, Zhao, Ziyi, Wang, Hongcheng, Lu, Jianglin, Li, Yin

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at https://abrarmajeedi.github.io/dualvision.

Chinese Translation

多模态大语言模型（MLLMs）在RGB图像的视觉感知和推理任务中取得了令人瞩目的表现，但在常见的退化条件下（如雾霾、模糊或低光照条件）仍然显得脆弱。红外（IR）成像作为RGB的一个成熟补充，在这些条件下提供了固有的鲁棒性，但其在MLLMs中的整合仍然未被充分探索。为了解决这一问题，我们提出了DUALVISION，一个轻量级融合模块，通过局部补丁级别的交叉注意力有效地将IR-RGB信息融入MLLMs。为了支持训练和评估，并促进未来的研究，我们还引入了DV-204K，一个包含约25K个公开可用的对齐IR-RGB图像对的数据集，配有204K个特定模态的问答注释，以及DV-500，一个包含500个IR-RGB图像对和500个问答对的基准，旨在评估跨模态推理。利用这些数据集，我们对开放源代码和闭源的MLLMs进行了基准测试，结果表明DUALVISION在各种视觉退化条件下都表现出强大的实证性能。我们的代码和数据集可在https://abrarmajeedi.github.io/dualvision获取。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.18831

Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

通过从视觉基础模型蒸馏实现室内逐帧激光雷达语义分割的可行性

Wu, Haiyang, Torres, Juan J. Gonzales, Vosselman, George, Lehtola, Ville

Abstract

Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.

Chinese Translation

室内激光雷达扫描的逐帧语义分割是实现更高层次的三维场景理解和映射应用的基础步骤。然而，获取逐帧的真实标签以训练深度学习模型既昂贵又耗时。对于图像数据，这一挑战在很大程度上通过视觉基础模型（Visual Foundation Models, VFMs）得以解决，这些模型能够对图像帧进行分割。相同的VFMs可以通过二维到三维的蒸馏管道来训练激光雷达扫描的帧分割模型。这种蒸馏在自动驾驶场景中已被证明成功，但在室内场景中尚未得到验证。在此，我们研究了在逐帧蒸馏的方式下，将每个激光雷达扫描与经过VFM处理的相机图像结合，以重复这一成功在室内场景中的可行性。评估使用室内SLAM数据集进行，其中伪标签用于下游评估。此外，提供了一个小型手动标注的激光雷达数据集用于验证，因为目前没有其他带语义的激光雷达逐帧室内数据集。结果表明，蒸馏模型在伪标签评估下达到了最高56%的mIoU，而在真实标签下约为36%的mIoU，证明了在没有手动标注的情况下，跨模态蒸馏在室内激光雷达语义分割中的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.18842

Multi-Domain Learning with Global Expert Mapping

具有全球专家映射的多领域学习

Shamsolmoali, Pourya, Zareapoor, Masoumeh, Zhou, Huiyu, Mendez, Oscar, Tao, Dacheng, Li, Xuelong

Abstract

Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.

Chinese Translation

人类感知在不同领域之间具有良好的泛化能力，但大多数视觉模型在训练数据之外的表现较差。这一差距促使了多数据集学习的研究，即在多样化的数据集上训练单一模型，以提高在领域转移下的鲁棒性。然而，由于数据分布和标签语义的不一致，统一训练仍然面临挑战。专家混合模型（Mixture-of-Experts, MoE）提供了一种可扩展的解决方案，通过将输入路由到专门的子网络（专家）来实现。然而，现有的MoE往往无法有效地进行专业化，因为它们的负载均衡机制强制在专家之间实现均匀的输入分布。这种公平性与领域感知路由相冲突，导致专家学习冗余的表示，尤其是在稀有或分布外领域时降低了性能。我们提出了GEM（全球专家映射），一个规划-编译框架，用于用全局调度器替代学习到的路由器。我们的规划器基于线性规划松弛，计算数据集到专家的分数分配，而编译器则应用层次化舍入将这一软计划转换为确定性的、考虑容量的映射。与之前的MoE不同，GEM避免了平衡损失，解决了公平性与专业化之间的冲突，并生成可解释的路由。实验结果表明，GEM-DINO在UODB基准测试上达到了最先进的性能，在代表性不足的数据集上取得了显著提升，并解决了少样本适应场景中的任务干扰问题。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.18853

DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification

DDF2Pol：一种用于PolSAR图像分类的双域特征融合网络

Alkhatib, Mohammed Q.

Abstract

This paper presents DDF2Pol, a lightweight dual-domain convolutional neural network for PolSAR image classification. The proposed architecture integrates two parallel feature extraction streams, one real-valued and one complex-valued, designed to capture complementary spatial and polarimetric information from PolSAR data. To further refine the extracted features, a depth-wise convolution layer is employed for spatial enhancement, followed by a coordinate attention mechanism to focus on the most informative regions. Experimental evaluations conducted on two benchmark datasets, Flevoland and San Francisco, demonstrate that DDF2Pol achieves superior classification performance while maintaining low model complexity. Specifically, it attains an Overall Accuracy (OA) of 98.16% on the Flevoland dataset and 96.12% on the San Francisco dataset, outperforming several state-of-the-art real- and complex-valued models. With only 91,371 parameters, DDF2Pol offers a practical and efficient solution for accurate PolSAR image analysis, even when training data is limited. The source code is publicly available at https://github.com/mqalkhatib/DDF2Pol

Chinese Translation

本文提出了DDF2Pol，一种轻量级的双域卷积神经网络，用于PolSAR图像分类。所提架构集成了两个并行特征提取流，一个为实值流，另一个为复值流，旨在从PolSAR数据中捕获互补的空间和极化信息。为了进一步优化提取的特征，采用了深度卷积层进行空间增强，随后引入了坐标注意力机制，以关注最具信息量的区域。在Flevoland和San Francisco两个基准数据集上进行的实验评估表明，DDF2Pol在保持低模型复杂度的同时，实现了优越的分类性能。具体而言，在Flevoland数据集上获得了98.16%的总体准确率（Overall Accuracy, OA），在San Francisco数据集上获得了96.12%的OA，优于多种最先进的实值和复值模型。DDF2Pol仅有91,371个参数，为准确的PolSAR图像分析提供了一个实用且高效的解决方案，即使在训练数据有限的情况下。源代码已公开，地址为 https://github.com/mqalkhatib/DDF2Pol

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.18856

ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

ConvVitMamba：用于高光谱图像分类的高效多尺度卷积、变换器和Mamba基础序列建模

Alkhatib, Mohammed Q.

Abstract

Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba

Chinese Translation

高光谱图像（HSI）分类仍然面临挑战，主要由于高光谱维度、冗余性和标注数据的有限性。尽管卷积神经网络（CNNs）和视觉变换器（ViTs）通过利用光谱-空间信息和长距离依赖关系取得了良好的性能，但它们通常会产生高计算成本和较大的模型规模，限制了实际应用。为了解决这些局限性，提出了一种统一的混合框架，称为ConvVitMamba，用于高效的HSI分类。该架构集成了三个组件：一个多尺度卷积特征提取器，用于捕捉局部光谱、空间和联合模式；一个基于视觉变换器的标记化和编码阶段，用于建模全局上下文关系；以及一个轻量级的Mamba启发的门控序列混合模块，用于在不使用二次自注意力的情况下实现高效的内容感知细化。主成分分析（PCA）被用作预处理，以减少冗余并提高效率。在包括休斯顿和三个无人机搭载的QUH数据集（平安、青云和塘道湾）在内的四个基准数据集上的实验表明，ConvVitMamba在保持准确性、模型规模和推理效率之间良好平衡的同时，始终优于基于CNN、变换器和Mamba的方法。消融研究确认了所有组件的互补贡献。结果表明，所提出的框架为在多样化场景中进行HSI分类提供了一种有效且高效的解决方案。源代码可在https://github.com/mqalkhatib/ConvVitMamba公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.18866

HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

HMR-Net：用于航空图像跨领域目标检测的分层模块化路由

Shamsolmoali, Pourya, Zareapoor, Masoumeh, Felsberg, Michael, Pears, Nick, Lu, Yue

Abstract

Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.

Chinese Translation

尽管目标检测技术已有所进展，但航空图像仍然是一个具有挑战性的领域，因为模型往往无法在空间分辨率、场景组成和语义标签覆盖的变化中进行有效泛化。地理背景、传感器特性以及数据集间物体分布的差异限制了传统模型学习一致且可转移表示的能力。在此类数据上训练的共享方法往往在根本上不同的领域之间强加统一的表示，导致在特定区域内容上的性能较差，以及在处理新物体类别时的灵活性不足。为了解决这一问题，我们提出了一种新颖的模块化学习框架，能够在航空检测中实现结构化的专业化。我们的方法引入了一种具有两级模块化的分层路由机制：一个全局专家分配层，利用潜在的地理嵌入将数据集路由到专业处理模块，以及一个局部场景分解机制，将图像子区域分配给区域特定的子模块。这使得我们的方法能够在数据集之间以及在复杂场景内实现专业化。此外，该框架包含一个条件专家模块，利用外部语义信息（例如类别名称或文本描述）来实现对新物体类别的检测，而无需重新训练或微调。通过超越单一的表示，我们的方法为遥感目标检测提供了一个自适应框架。在四个数据集上的全面评估突显了在多数据集泛化、区域专业化和开放类别检测方面的改进。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.18867

Hierarchically Robust Zero-shot Vision-language Models

层次鲁棒的零样本视觉-语言模型

Dong, Junhao, Zhang, Yifei, Zhu, Hao, Ong, Yew-Soon, Koniusz, Piotr

Abstract

Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

Chinese Translation

视觉-语言模型（VLMs）能够进行零样本分类，但易受到对抗攻击的影响。尽管鲁棒微调提高了它们的鲁棒性，但现有方法将固定的文本嵌入与图像嵌入对齐，牺牲了自然性能和鲁棒性。当模型面临针对超类（父类，例如哺乳动物）以及其基本（叶子）类（例如猫）的对抗攻击时，鲁棒性也会下降。因此，为了增强对抗鲁棒性并利用类空间的固有层次属性，我们提出了一种基于层次嵌入的新型对抗微调框架，以及图像-文本模态的多个层次的对抗鲁棒对齐。额外机制将视觉嵌入放置在所需的层次深度，并且我们提供了嵌入深度与最大可行间隔大小之间的理论联系。我们的模型自然实现了多个间隔大小，增强了对抗者的泛化能力以实现鲁棒化。由于不同父标签的各种树可以共享相同的叶标签，我们还考虑在多个树上进行对齐，以增强语义多样性。我们在多个数据集上进行了实验。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.18881

A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders

用于地面融合的地球观测与位置编码器的代理一致性损失

Wang, Zhongying, Lane, Kevin, Cai, Levi, Karimzadeh, Morteza, Rolf, Esther

Abstract

Supervised learning with Earth observation inputs is often limited by the sparsity of high-quality labeled or in-situ measured data to use as training labels. With the abundance of geographic data products, in many cases there are variables correlated with - but different from - the variable of interest that can be leveraged. We integrate such proxy variables within a geographic prior via a trainable location encoder and introduce a proxy consistency loss (PCL) formulation to imbue proxy data into the location encoder. The first key insight behind our approach is to use the location encoder as an agile and flexible way to learn from abundantly available proxy data which can be sampled independently of training label availability. Our second key insight is that we will need to regularize the location encoder appropriately to achieve performance and robustness with limited labeled data. Our experiments on air quality prediction and poverty mapping show that integrating proxy data implicitly through the location encoder outperforms using both as input to an observation encoder and fusion strategies that use frozen, pretrained location embeddings as a geographic prior. Superior performance for in-sample prediction shows that the PCL can incorporate rich information from the proxies, and superior out-of-sample prediction shows that the learned latent embeddings help generalize to areas without training labels.

Chinese Translation

使用地球观测输入的监督学习通常受到高质量标注或原位测量数据稀缺性的限制，这些数据可用作训练标签。随着地理数据产品的丰富，在许多情况下，有与感兴趣变量相关但不同的变量可以被利用。我们通过可训练的位置编码器将这些代理变量整合到地理先验中，并引入代理一致性损失（Proxy Consistency Loss, PCL）公式，将代理数据注入位置编码器。我们方法背后的第一个关键见解是使用位置编码器作为一种灵活的方式，从丰富的代理数据中学习，这些数据可以独立于训练标签的可用性进行采样。我们的第二个关键见解是，我们需要适当地对位置编码器进行正则化，以在有限的标注数据下实现性能和鲁棒性。我们在空气质量预测和贫困映射上的实验表明，通过位置编码器隐式整合代理数据的效果优于将两者作为输入提供给观测编码器和使用冻结的、预训练的位置嵌入作为地理先验的融合策略。在样本内预测中表现出色，表明PCL能够从代理中整合丰富的信息，而在样本外预测中的优越性则表明学习到的潜在嵌入有助于推广到没有训练标签的区域。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.18940

Localization-Guided Foreground Augmentation in Autonomous Driving

基于定位引导的前景增强在自动驾驶中的应用

Yong, Jiawei, Qu, Deyuan, Chen, Qi, Oguchi, Kentaro, Fukushima, Shintaro

Abstract

Autonomous driving systems often degrade under adverse visibility conditions-such as rain, nighttime, or snow-where online scene geometry (e.g., lane dividers, road boundaries, and pedestrian crossings) becomes sparse or fragmented. While high-definition (HD) maps can provide missing structural context, they are costly to construct and maintain at scale. We propose Localization-Guided Foreground Augmentation (LG-FA), a lightweight and plug-and-play inference module that enhances foreground perception by enriching geometric context online. LG-FA: (i) incrementally constructs a sparse global vector layer from per-frame Bird's-Eye View (BEV) predictions; (ii) estimates ego pose via class-constrained geometric alignment, jointly improving localization and completing missing local topology; and (iii) reprojects the augmented foreground into a unified global frame to improve per-frame predictions. Experiments on challenging nuScenes sequences demonstrate that LG-FA improves the geometric completeness and temporal stability of BEV representations, reduces localization error, and produces globally consistent lane and topology reconstructions. The module can be seamlessly integrated into existing BEV-based perception systems without backbone modification. By providing a reliable geometric context prior, LG-FA enhances temporal consistency and supplies stable structural support for downstream modules such as tracking and decision-making.

Chinese Translation

自动驾驶系统在不利的能见度条件下（如雨天、夜间或雪天）往往表现不佳，此时在线场景几何信息（例如车道分隔线、道路边界和人行横道）变得稀疏或破碎。尽管高清（HD）地图可以提供缺失的结构上下文，但其构建和维护成本高昂。我们提出了一种定位引导的前景增强（Localization-Guided Foreground Augmentation, LG-FA）方法，这是一种轻量级的即插即用推理模块，通过在线丰富几何上下文来增强前景感知。LG-FA：（i）从每帧的鸟瞰视图（Bird's-Eye View, BEV）预测中逐步构建稀疏的全局向量层；（ii）通过类约束几何对齐估计自我姿态，联合改善定位并补全缺失的局部拓扑；（iii）将增强的前景重投影到统一的全局框架中，以改善每帧的预测。在具有挑战性的nuScenes序列上的实验表明，LG-FA提高了BEV表示的几何完整性和时间稳定性，减少了定位误差，并生成了全局一致的车道和拓扑重建。该模块可以无缝集成到现有的基于BEV的感知系统中，而无需修改主干网络。通过提供可靠的几何上下文先验，LG-FA增强了时间一致性，并为下游模块（如跟踪和决策）提供了稳定的结构支持。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.18957

Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images

桥接基础模型与ASTM冶金标准，实现显微镜图像自动化颗粒大小估计

Mueez, Abdul, Vyas, Shruti

Abstract

Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator.

Chinese Translation

从显微镜图像中提取标准化的冶金指标仍然具有挑战性，原因在于复杂的颗粒形态和对监督分割的数据需求。为了将基础计算机视觉与实际冶金评估相结合，我们提出了一种自动化管道，用于密集实例分割和颗粒大小估计，该管道将Cellpose-SAM适配于微观结构，并将其拓扑感知的梯度跟踪与ASTM E112 Jeffries平面测量模块相结合。我们系统地将该管道与经典卷积网络（U-Net）、自适应提示视觉基础模型（MatSAM）和当代视觉-语言模型（Qwen2.5-VL-7B）进行了基准测试。评估结果表明，尽管开箱即用的视觉-语言模型在进行密集显微计数时在局部空间推理方面表现不佳，而MatSAM在领域特定提示生成方面存在过度分割的问题，但我们适配的管道成功地保持了拓扑分离。此外，在逐步减少的训练样本分割上进行的实验展示了卓越的少样本可扩展性；仅利用两个训练样本，所提系统预测ASTM颗粒大小数（G）的平均绝对百分比误差（MAPE）低至1.50%，同时在不同目标颗粒数量下的鲁棒性测试实证验证了ASTM 50颗粒采样的最低要求。这些结果突显了应用级基础模型集成在高精度自动化材料表征中的有效性。我们的项目代码库可在 https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator 获取。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.18967

Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

朝着临床可接受的胸部X光报告生成：CXRMate-2的定性回顾性初步研究

Nicolson, Aaron, Cooper, Elizabeth J., Yoon, Hwan-Jin, McCafferty, Claire, Krishnan, Ramya, Craigie, Michelle, Saad, Nivene, Dowling, Jason, Scott, Ian A., Koopman, Bevan

Abstract

Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.

Chinese Translation

胸部X光（CXR）放射学报告生成（RRG）模型已显示出快速进展，但由于放射科医生的评估有限，其临床实用性仍不确定。我们提出了CXRMate-2，这是一种最先进的CXR RRG模型，结合了结构化的多模态条件和强化学习，并通过复合奖励实现与放射科医生报告的语义对齐。在MIMIC-CXR、CheXpert Plus和ReXgradient数据集上，CXRMate-2在强基准上取得了统计学上显著的改进，包括在MIMIC-CXR相对于MedGemma 1.5（4B）在GREEN和RadGraph-XL上分别提高了11.2%和24.4%。为了直接比较CXRMate-2与放射科医生报告，我们进行了盲法随机定性回顾性评估。三位顾问放射科医生比较了在MIMIC-CXR测试集中的120项研究中生成的报告和放射科医生的报告。生成的报告在45%的评分中被认为是可接受的（定义为优先或与放射科医生报告同等评分），在分析的八项发现中，放射科医生报告和可接受的生成报告之间的偏好率没有统计学上显著的差异。对放射科医生报告的偏好主要是由于更高的召回率，而生成的报告在可读性上往往更受欢迎。这些结果共同表明了通往临床可接受的CXR RRG的可信路径。召回率的提高，以及对微妙发现（例如肺充血）的更好检测，可能足以实现与放射科医生报告的非劣性。通过这些有针对性的进展，CXR RRG系统可能已准备好在放射科医生主导的工作流程中进行前瞻性评估。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.18980

AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

AdaGScale：在3D高斯点云中自适应视点的高斯缩放以减少高斯瓦片对

Jo, Joongho, Lim, Hyerin, Choi, Hanjun, Park, Jongsun

Abstract

Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves a geometric mean speedup of 13.8x over original 3D-GS on a GPU, with only about 0.5 dB degradation in PSNR on city-scale scenes.

Chinese Translation

减少高斯瓦片对的数量是提高GPU上3D高斯点云（3D-GS）渲染速度的最有前景的方法之一。然而，以往的研究从未考虑高斯瓦片对之间的重要性差异。本文提出了一种新颖的自适应视点高斯缩放技术AdaGScale，用于减少高斯瓦片对的数量。AdaGScale基于观察，即位于高斯中心远处的边缘瓦片对像素颜色累积的贡献微乎其微。这表明可以根据颜色贡献来减少高斯瓦片对的数量。AdaGScale在预处理阶段有效地估计每个高斯的边缘区域的颜色贡献，并根据边缘评分自适应地调整其大小。因此，在交集测试期间，重要性较低的高斯与较少的瓦片相交，从而提高了渲染速度，同时保持了图像质量。调整后的大小仅用于瓦片交集测试，而在颜色累积过程中保留原始大小以保持视觉保真度。实验结果表明，AdaGScale在GPU上实现了比原始3D-GS快13.8倍的几何平均加速，且在城市规模场景中PSNR仅下降约0.5 dB。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.18988

A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

一种具有结构化推理和反思精炼的多智能体框架用于多模态同理心响应生成

Wang, Liping, Ye, Cheng, Chen, Weidong, Song, Peipei, Hu, Bo, Mao, Zhendong

Abstract

Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users' multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.

Chinese Translation

多模态同理心响应生成（MERG）旨在根据用户的多模态背景生成情感丰富且富有同理心的响应。现有方法通常依赖于从多模态背景到最终响应的隐式单次生成范式，这忽视了MERG的两个内在特征：（1）人类对情感线索的感知本质上是结构化的，而不是直接映射。传统范式忽略了情感感知的层次进展，导致情感判断的扭曲。（2）考虑到人类情感的固有复杂性和模糊性，传统范式容易产生显著的情感偏见，最终导致同理心的低效。在本文中，我们提出了一种用于MERG的多智能体框架，通过结构化推理和反思精炼来增强同理心。具体而言，我们首先引入一个结构化的同理心推理到生成模块，该模块通过多模态感知、一致性情感预测、务实策略规划和策略引导的响应生成，明确地分解响应生成过程，提供从多模态证据到响应实现的更清晰的中间路径。此外，我们开发了一个全局反思和精炼模块，其中全局反思智能体对中间状态和生成的响应进行逐步审计，消除现有的情感偏见和同理心错误，并触发针对性的再生成。总体而言，这种闭环框架使我们的模型能够在迭代过程中逐步提高情感感知的准确性并消除情感偏见。在多个基准测试（例如IEMOCAP和MELD）上的实验表明，我们的模型在同理心响应生成能力上优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.18993

AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos

AutoAWG：用于汽车视频的自适应多控制不良天气生成

Hu, Jiagao, Zhou, Daiguo, Fu, Danzhen, Li, Fuhao, Wang, Zepeng, Wang, Fei, Liao, Wenhua, Xie, Jiayi, Sun, Haiyang

Abstract

Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic--structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: https://github.com/higherhu/AutoAWG

Chinese Translation

在不良天气条件下的感知鲁棒性仍然是自动驾驶面临的一个关键挑战，其核心瓶颈在于缺乏真实世界的不良天气视频数据。现有的天气生成方法难以平衡视觉质量和标注可重用性。我们提出了AutoAWG，一个用于自动驾驶的可控不良天气视频生成框架。我们的方法采用语义引导的多控制自适应融合，以平衡强烈的天气风格化与安全关键目标的高保真度；利用消失点锚定的时间合成策略从静态图像构建训练序列，从而减少对合成数据的依赖；并采用掩蔽训练以增强长时间生成的稳定性。在nuScenes验证集上，AutoAWG显著超越了之前的最先进方法：在没有首帧条件的情况下，FID和FVD分别减少了50.0%和16.1%；在有首帧条件的情况下，进一步减少了8.7%和7.2%。大量的定性和定量结果展示了在风格保真度、时间一致性和语义-结构完整性方面的优势，突显了AutoAWG在提升自动驾驶下游感知方面的实际价值。我们的代码可在以下链接获取：https://github.com/higherhu/AutoAWG

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.19034

Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

像人类一样探索：具身代理的在线SG-Memo构建自主探索

Chen, Xu, Xie, Shichao, Gu, Zhining, Jia, Lu, Luo, Minghua, Liu, Fei, Chu, Zedong, Shen, Yanfen, Wu, Xiaolong, Xu, Mu

Abstract

Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent's movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.

Chinese Translation

构建结构化空间记忆对于在复杂的具身导航任务中实现长时间推理至关重要。目前的记忆构建主要依赖于一种解耦的两阶段范式：代理首先通过探索聚合环境数据，然后离线重建空间记忆。然而，这种事后和以几何为中心的方法使得代理无法利用高级语义智能，常常导致它们忽视导航中至关重要的地标（例如，门口和楼梯），这些地标在人的认知地图中作为基本的语义锚点。为了解决这一问题，我们提出了ABot-Explorer，一种将记忆构建和探索统一为在线RGB-only过程的新型主动探索框架。ABot-Explorer的核心利用大型视觉-语言模型（VLMs）提炼语义导航可供性（SNA），作为认知对齐的锚点来指导代理的运动。通过将这些SNA动态整合到层次化的SG-Memo中，ABot-Explorer通过优先考虑结构过渡节点来促进高效覆盖，从而模拟人类的探索逻辑。为了支持这一框架，我们贡献了一个大规模数据集，扩展了InteriorGS，并添加了SNA和SG-Memo注释。实验结果表明，ABot-Explorer在探索效率和环境覆盖方面显著优于当前最先进的方法，而所生成的SG-Memo被证明能够有效支持多种下游任务。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.19039

Generative Texture Filtering

生成纹理过滤

Zheng, Rongjia, Huang, Shangwei, Zhu, Lei, Zheng, Wei-Shi, Zhang, Qing

Abstract

We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at https://github.com/OnlyZZZZ/Generative_Texture_Filtering.

Chinese Translation

我们提出了一种生成方法用于纹理过滤，该方法表现出令人惊讶的良好性能和广泛的适应性。我们的核心思想是通过充分利用预训练生成模型的强学习图像先验来增强纹理过滤的能力。为此，我们提出通过两阶段策略对预训练生成模型进行微调。具体而言，我们首先在一小组配对图像上进行监督微调，然后在一个大规模无标记数据集上进行强化微调，指导原则是一个量化纹理去除和结构保留质量的奖励函数。大量实验表明，我们的方法明显优于之前的方法，并且在处理以往具有挑战性的案例时也表现出色。我们的代码可在 https://github.com/OnlyZZZZ/Generative_Texture_Filtering 获取。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.19054

Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

2025年低功耗计算机视觉挑战赛获胜解决方案评估

Ye, Zihao, Lu, Yung Hsiang, Hu, Xiao, Zhang, Shuai, Jing, Taotao, Li, Xin, Yao, Zhen, Lang, Bo, Zheng, Zhihao, Oh, Seungmin, Kang, Hankyul, Kang, Seunghun, Ryu, Jongbin, Chen, Kexin, Qi, Yuan, Thiruvathukal, George K, Chuah, Mooi Choo

Abstract

The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.

Chinese Translation

IEEE低功耗计算机视觉挑战赛（LPCVC）旨在促进高效视觉模型在边缘设备上的发展，平衡准确性与延迟、内存容量和能耗等限制。2025年的挑战赛设有三个赛道：（1）在不同光照条件和风格下的图像分类，（2）带文本提示的开放词汇分割，以及（3）单目深度估计。本文介绍了LPCVC 2025的设计，包括其竞赛结构和评估框架，该框架整合了高通AI Hub，以实现一致和可重复的基准测试。本文还介绍了每个赛道的表现最佳解决方案，并概述了关键趋势和观察结果。最后，本文对未来的计算机视觉竞赛提出了建议。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.19064

The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

视觉与语言导航中自我改进代理的平衡本质

Liu, Zhen, Liu, Yuhan, Wang, Jinjun, Liu, Jianyi, Song, Wei, Fu, Jingwen

Abstract

In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.

Chinese Translation

在视觉与语言导航（VLN）中，仅依靠标准的VLN动作监督，从政策引导的经验中自我改进，关键在于平衡行为多样性和学习稳定性，这决定了代理是否能够提取可靠的学习信号以实现改进。增加行为多样性是必要的，以暴露替代的行动假设，但这可能会使政策引导的学习信号不稳定，而过于保守的稳定性约束则会抑制探索并导致过早承诺，从而使可靠的自我改进变得困难。为了解决这一挑战，我们提出了稳定-多样性平衡（Stability-Diversity Balance, SDB），这是一个用于VLN中平衡自我改进的即插即用机制。SDB通过对指令条件下的隐藏状态施加受控的偏移，将每个决策步骤扩展为多个潜在的行为假设，然后进行可靠性意识的软评估和聚合，以在学习过程中保留多样但与指令一致的替代方案。一个显式的正则化器进一步约束假设之间的交互，防止假设多样性的过度漂移或过早崩溃，从而在不丢弃训练信号的情况下稳定自我改进。在R2R、SOON和REVERIE上的实验显示出一致的改进；例如，在REVERIE val-unseen上，SDB将SPL从33.73提高到35.93，OSR从51.07提高到54.25。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.19093

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

通过自适应概率高斯校准实现多模态测试时适应

Xu, Jinglin, Li, Yi, Sun, Chuxiong, Xu, Xiao, Li, Jiangmeng, Xu, Fanjiang

Abstract

Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.

Chinese Translation

多模态测试时适应（TTA）通过在推理过程中利用未标记的目标数据，增强基准多模态模型对分布变化的韧性。尽管已有成功的案例，但多模态 TTA 方法的发展受到一个持续性限制的阻碍，即缺乏对类别条件分布的明确建模，而这对于产生准确的预测和可靠的决策边界至关重要。经典的高斯判别分析（GDA）提供了类别条件分布的基础建模，并在单模态背景下取得了适度的进展。然而，在多模态 TTA 场景中，固有的模态分布不对称性削弱了通过经典 GDA 建模类别条件分布的有效性。为此，我们引入了一种专门针对多模态 TTA 的概率高斯模型，以明确建模类别条件分布，并进一步提出了一种自适应对比不对称校正技术，以抵消模态不对称性带来的不利影响，从而得出经过校准的预测和可靠的决策边界。在多项基准测试中的广泛实验表明，我们的方法在各种分布变化下实现了最先进的性能。代码可在 https://github.com/XuJinglinn/AdaPGC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.19105

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

自我运动：用于自我中心视觉-语言运动生成的层次推理与扩散

Hou, Ruibing, Zhou, Mingyue, Gui, Yuwei, Luo, Mingshuang, Ma, Bingpeng, Chang, Hong, Shan, Shiguang, Chen, Xilin

Abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

Chinese Translation

在动态环境中忠实建模人类行为是具身智能的基础挑战。尽管条件运动合成已取得显著进展，但由于第一人称感知的固有复杂性，自我中心运动生成仍然在很大程度上未被探索。在本研究中，我们探讨了自我中心视觉-语言（Ego-VL）运动生成。该任务要求基于第一人称视觉观察和自然语言指令共同合成3D人类运动。我们识别出一个关键的 extit{推理-生成纠缠}挑战：语义推理与运动学建模的同时优化引入了梯度冲突。这些冲突系统性地降低了多模态基础和运动质量的保真度。为了解决这一挑战，我们提出了一个层次生成框架 extbf{EgoMotion}。EgoMotion受到认知推理与运动控制生物解耦的启发，分为两个阶段。在认知推理阶段，视觉-语言模型（VLM）将多模态输入投影到离散运动原型的结构空间中。这迫使VLM获取与目标一致的表征，有效地弥合了高层次感知理解与低层次动作执行之间的语义差距。在运动生成阶段，这些学习到的表征作为基于扩散的运动生成器的表达性条件信号。通过在连续潜在空间中执行迭代去噪，生成器合成出物理上合理且时间上连贯的轨迹。广泛的评估表明，EgoMotion实现了最先进的性能，并生成在语义上扎根且在运动学上优于现有方法的运动序列。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.19129

PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

PortraitDirector：一种用于可控和实时面部重演的层次解耦框架

Ji, Chaonan, Qi, Jinwei, Xu, Sheng, Zhang, Peng, Zhang, Bang

Abstract

Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

Chinese Translation

现有的面部重演方法在表现力与细粒度可控性之间存在权衡。整体面部重演模型通常为了表现力而牺牲了细致的控制，而为控制而设计的方法可能在保真度和稳健解耦方面存在困难。我们探索了一种将面部运动视为组成信号的替代性视角，而不是将其视为单一信号。本文介绍了PortraitDirector，一个新颖的框架，将面部重演形式化为一个层次组合任务，实现高保真和可控的结果。我们采用了层次运动解耦与组合策略，将面部运动解构为物理运动的空间层和情感内容的语义层。空间层包括：（i）通过专用表示和注入路径管理的全局头部姿态；（ii）从裁剪的面部区域提取的空间分离的局部面部表情，通过利用信息瓶颈的情感过滤模块去除情感线索。语义层包含一个派生的全局情感。然后将解耦的组件重新组合成表现力丰富的运动潜变量。此外，我们通过一系列优化措施（包括扩散蒸馏、因果注意力和变分自编码器加速）对框架进行了实时性能工程。PortraitDirector在单个5090 GPU上以20 FPS的速度实现了流式、高保真、可控的512 x 512面部重演，端到端延迟为800毫秒。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.19133

BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination

BALTIC：在不同光照条件下跨空气和水下领域的3D重建基准与跨领域策略

Grimaldi, Michele, Nakath, David, Pizarro, Oscar, Willners, Jonatan Scharff, Carlucho, Ignacio, Petillot, Yvan R.

Abstract

Robust 3D reconstruction across varying environmental conditions remains a critical challenge for robotic perception, particularly when transitioning between air and water. To address this, we introduce BALTIC, a controlled benchmark designed to systematically evaluate modern 3D reconstruction methods under variations in medium and lighting. The benchmark comprises 13 datasets spanning two media (air and water) and three lighting conditions (ambient, artificial, and mixed), with additional variations in motion type, scanning pattern, and initialization trajectory, resulting in a diverse set of sequences. Our experimental setup features a custom water tank equipped with a monocular camera and an HTC Vive tracker, enabling accurate ground-truth pose estimation. We further investigate cross-domain reconstruction by augmenting underwater image sequences with a small number of in-air views captured under similar lighting conditions. We evaluate Structure-from-Motion reconstruction using COLMAP in terms of both trajectory accuracy and scene geometry, and use these reconstructions as input to Neural Radiance Fields and 3D Gaussian Splatting methods. The resulting models are assessed against ground-truth trajectories and in-air references, while rendered outputs are compared using perceptual and photometric metrics. Additionally, we perform a color restoration analysis to evaluate radiometric consistency across domains. Our results show that under controlled, texture-consistent conditions, Gaussian Splatting with simple preprocessing (e.g., white balance correction) can achieve performance comparable to specialized underwater methods, although its robustness decreases in more complex and heterogeneous real-world environments

Chinese Translation

在不同环境条件下进行稳健的3D重建仍然是机器人感知面临的一项关键挑战，尤其是在空气和水之间的转换过程中。为了解决这一问题，我们提出了BALTIC，这是一个受控基准，旨在系统地评估现代3D重建方法在介质和光照变化下的表现。该基准包含13个数据集，涵盖两种介质（空气和水）和三种光照条件（环境光、人工光和混合光），并在运动类型、扫描模式和初始化轨迹上进行了额外的变化，从而形成了一组多样化的序列。我们的实验设置采用一个定制的水槽，配备单目相机和HTC Vive跟踪器，以实现准确的真实姿态估计。我们进一步通过增强水下图像序列，结合在类似光照条件下捕获的少量空气视图，来研究跨领域重建。我们使用COLMAP评估基于运动的重建，关注轨迹准确性和场景几何，并将这些重建结果作为输入应用于神经辐射场（Neural Radiance Fields）和3D高斯点云（3D Gaussian Splatting）方法。所得到的模型与真实轨迹和空气参考进行比较，同时渲染输出使用感知和光度度量进行评估。此外，我们还进行颜色恢复分析，以评估跨领域的辐射一致性。我们的结果表明，在受控的、纹理一致的条件下，简单预处理（例如白平衡校正）的高斯点云可以达到与专门的水下方法相当的性能，尽管在更复杂和异质的真实环境中，其稳健性有所下降。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.19135

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Diff-SBSR：用于零-shot 草图基础的 3D 形状检索的多模态特征增强扩散模型学习

Cheng, Hang, Dong, Fanhe, Zeng, Long

Abstract

This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.

Chinese Translation

本文首次探讨了文本到图像扩散模型在零-shot 草图基础的 3D 形状检索（ZS-SBSR）中的应用。现有的草图基础 3D 形状检索方法在零-shot 设置中面临挑战，主要由于缺乏类别监督和草图输入的极度稀疏性。我们的关键见解是，大规模预训练的扩散模型本质上具备开放词汇能力和强大的形状偏向，使其非常适合用于零-shot 视觉检索。我们利用一个冻结的 Stable Diffusion 主干网络，从中间 U-Net 层提取和聚合草图和渲染 3D 视图的判别表示。由于草图的极度抽象和稀疏性，以及与自然图像之间显著的领域差距，扩散模型在处理草图时面临困难。为了在不进行昂贵的重新训练的情况下解决这一限制，我们提出了一种多模态特征增强策略，该策略利用来自 CLIP 的互补视觉和文本线索来条件化冻结的扩散主干网络，明确增强语义上下文捕捉的能力，并集中于草图轮廓。具体而言，我们注入来自预训练 CLIP 视觉编码器的全局和局部视觉特征，并通过将可学习的软提示与由 BLIP 生成的硬文本描述相结合，融入丰富的文本指导。此外，我们采用 Circle-T 损失动态增强正样本对的吸引力，一旦负样本被充分分离，从而适应草图噪声并实现更有效的草图-3D 对齐。在两个公共基准上的大量实验表明，我们的方法在 ZS-SBSR 中始终优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.19141

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

去噪，快与慢：基于难度感知的自适应采样用于图像生成

Schusterbauer, Johannes, Gui, Ming, Li, Yusong, Ma, Pingchuan, Krause, Felix, Ommer, Björn

Abstract

Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

Chinese Translation

扩散和流动模型通常在空间上均匀分配计算资源，以相同的时间步和函数评估次数更新所有图像块。尽管这种方法方便，但忽略了自然图像的异质性：某些区域容易去噪，而其他区域则需要更多的细化或额外的上下文。基于此动机，我们探索了图像合成中的块级噪声尺度。我们发现，简单地在图像标记之间变化时间步的效果较差，因为这使模型暴露于在推理时不会出现的过于信息丰富的训练状态。因此，我们引入了一种时间步采样器，明确控制训练期间可用的最大块级信息，并展示了从全局时间步到块级时间步的转变已经改善了图像生成，相较于标准基线。通过进一步增强模型，增加轻量级的每块难度头，我们实现了自适应采样器，能够动态分配计算资源到最需要的地方。结合在空间和扩散时间上变化的噪声水平，这产生了块强制（Patch Forcing，PF）框架，提前推进较容易的区域，以便为较难的区域提供上下文。PF在类条件的ImageNet上取得了优越的结果，且与表示对齐和引导方法正交，并可扩展到文本到图像的合成。我们的结果表明，块级去噪调度为自适应图像生成提供了一个有前景的基础。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.19145

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

ST-Prune：用于自主驾驶的无训练时空令牌剪枝的视觉-语言模型

Sha, Lin, Guo, Haiyun, Wang, Tao, Zhang, Cong, Huang, Min, Wang, Jinqiao, Miao, Qinghai

Abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

Chinese Translation

视觉-语言模型（VLMs）已成为自主驾驶系统的核心，但其部署受到多视角摄像头和多帧视频输入的巨大计算开销的严重制约。现有的令牌剪枝方法主要针对单图像输入，孤立地处理每个帧或视角，因此未能利用驾驶场景中固有的时空冗余。为了解决这一问题，我们提出了ST-Prune，这是一种无训练、即插即用的框架，包含两个互补模块：运动感知时间剪枝（MTP）和环视空间剪枝（RSP）。MTP通过将运动波动性和时间近期性编码为多样性选择目标中的软约束，来解决时间冗余，优先考虑动态轨迹和当前帧内容，而非静态历史背景。RSP则通过利用环视摄像头几何结构来惩罚双边视角相似性，进一步解决空间冗余，消除重复投影和时间剪枝无法抑制的残余背景。这两个模块共同构成了一个完整的时空剪枝过程，在严格压缩下保留关键场景信息。在四个涵盖感知、预测和规划的基准测试中验证，ST-Prune确立了无训练令牌剪枝的新最先进水平。值得注意的是，即使在90%的令牌减少下，ST-Prune也实现了几乎无损的性能，某些指标超过了全模型基线，同时保持与现有剪枝方法相当的推理速度。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.19159

MSDS: Deep Structural Similarity with Multiscale Representation

MSDS：具有多尺度表示的深层结构相似性

Kang, Danling, Chen, Xue-Hua, Liu, Bin, Zhang, Keke, Chen, Weiling, Zhao, Tiesong

Abstract

Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

Chinese Translation

基于深层特征的感知相似性模型在图像质量评估（IQA）中与人类视觉感知表现出强烈的一致性。然而，大多数现有方法仅在单一空间尺度上操作，隐含假设在固定分辨率下的结构相似性是足够的。因此，空间尺度在深层特征相似性建模中的作用仍然不够明确。在本文中，我们通过对DeepSSIM进行最小多尺度扩展，将空间尺度作为一个独立因素进行隔离，称之为具有多尺度表示的深层结构相似性（MSDS）。所提出的框架通过在金字塔层级上独立计算DeepSSIM，并使用一组轻量级的可学习全局权重融合结果得分，从而将深层特征表示与跨尺度集成解耦。对多个基准数据集的实验表明，与单尺度基线相比，所提方法在性能上实现了一致且统计显著的提升，同时引入的额外复杂性微乎其微。结果实证确认空间尺度在深层感知相似性中是一个不可忽视的因素，这里通过一个最小测试平台进行了隔离。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.19191

Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

通过均值漂移密度增强改善医学图像中的异常检测

Kar, Pritam, S, Gouri Lakshmi, Bej, Saptarshi

Abstract

Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.

Chinese Translation

医学成像中的异常检测对于识别罕见的病理状况至关重要，尤其是在标注的异常样本有限的情况下。我们提出了一种混合异常检测框架，该框架将自监督表征学习与基于流形的密度估计相结合，这种组合在该领域仍然很少被探索。医学图像首先通过预训练的、可能是特定领域的主干网络嵌入到潜在特征空间中。这些表征随后通过均值漂移密度增强（Mean Shift Density Enhancement, MSDE）进行精炼，这是一种迭代的流形移动过程，将样本向更高可能性区域移动。然后，使用在主成分分析（PCA）降维的潜在空间中的高斯密度估计计算异常分数，其中马哈拉诺比斯距离用于衡量与学习到的正常分布的偏差。该框架遵循单类学习范式，仅需正常样本进行训练。在七个医学成像数据集上的广泛实验表明了其最先进的性能。MSDE在四个数据集上达到了最高的AUC，在五个数据集上达到了最高的平均精度，包括在脑肿瘤检测中接近完美的表现（0.981 AUC/AP）。这些结果强调了所提出框架作为可扩展的临床决策支持工具在早期疾病检测、低标注环境下的筛查以及在多种成像模式下的稳健部署的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.19193

How Far Are Video Models from True Multimodal Reasoning?

视频模型距离真实多模态推理还有多远？

Zhang, Xiaotian, Wei, Jianhui, Wang, Yuan, Tan, Jie, Li, Yichen, Zhang, Yan, Chen, Ziyi, Zhang, Daoan, YU, Dezhi, Xu, Wei, Jiang, Songtao, Liu, Zuozhu

Abstract

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.

Chinese Translation

尽管通用视频模型取得了显著进展，但一个关键问题仍未得到解答：这些模型距离实现真实的多模态推理还有多远？现有基准测试未能严格解决这一问题，因为它们受到简单任务设计和碎片化评估指标的限制，忽视了复杂的多模态推理。为填补这一空白，我们引入了CLVG-Bench，一个评估框架，旨在通过视频生成中的上下文学习探测视频模型的零-shot推理能力。CLVG-Bench包含超过1,000个高质量的手动注释元数据，涵盖6个类别和47个子类别，涉及复杂场景，包括物理模拟、逻辑推理和互动上下文。为了实现严格且可扩展的评估，我们进一步提出了一种自适应视频评估器（Adaptive Video Evaluator, AVE），该评估器使用最少的注释与人类专家的感知对齐，提供可解释的文本反馈，适用于多样的视频上下文任务。大量实验揭示了我们中心问题的一个显著答案：尽管最先进的视频模型（如Seedance 2.0）在某些理解和推理子任务上表现出色，但在逻辑基础和互动生成任务上却显著不足（成功率分别低于25%和约0%），暴露出多模态推理和物理基础作为关键瓶颈。通过系统量化这些局限性，所提出的方法提供了可操作的反馈和通向真正稳健的通用视频模型的清晰路线图。CLVG-Bench及其代码在此发布。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.19196

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向领域泛化的人脸防伪视觉基础模型基准测试

Feng, Mika, Gallin-Martel, Pierre, Ito, Koichi, Aoki, Takafumi

Abstract

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

Chinese Translation

人脸防伪（FAS）由于需要在未见环境中实现强健的领域泛化而仍然面临挑战。尽管最近的趋势利用视觉-语言模型（VLMs）进行语义监督，但这些多模态方法通常需要昂贵的计算资源，并且表现出较高的推理延迟。此外，它们的有效性本质上受到基础视觉特征质量的限制。本文重新审视了仅基于视觉的基础模型的潜力，以建立一个高效且稳健的人脸防伪基准。我们对15个预训练模型进行了系统的基准测试，包括监督卷积神经网络（CNNs）、监督视觉变换器（ViTs）和自监督视觉变换器（ViTs），在包括MICO和有限源域（LSD）协议在内的严苛跨域场景下进行测试。我们的综合分析表明，自监督视觉模型，特别是带有注册器的DINOv2，显著抑制了注意力伪影，并捕捉到了关键的细粒度伪造线索。结合人脸防伪数据增强（FAS-Aug）、基于补丁的数据增强（PDA）和加权补丁损失（APL），我们提出的仅基于视觉的基准在MICO协议中达到了最先进的性能。该基准在数据受限的LSD协议下超越了现有方法，同时保持了卓越的计算效率。本研究为人脸防伪提供了一个明确的仅基于视觉的基准，表明优化的自监督视觉变换器可以作为未来仅基于视觉和多模态FAS系统的骨干。项目页面可访问：https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ 。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.19206

When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

我们何时可以信任深度神经网络？朝着可靠的工业部署迈进，附带可解释性指南

Dong, Hang-Cheng, Jiang, Yuhao, Jiao, Yibo, Zou, Lu, Zheng, Kai, Liu, Bingguo, Ye, Dong, Liu, Guodong

Abstract

The deployment of AI systems in safety-critical domains, such as industrial defect inspection, autonomous driving, and medical diagnosis, is severely hampered by their lack of reliability. A single undetected erroneous prediction can lead to catastrophic outcomes. Unfortunately, there is often no alternative but to place trust in the outputs of a trained AI system, which operates without an internal safeguard to flag unreliable predictions, even in cases of high accuracy. We propose a post-hoc explanation-based indicator to detect false negatives in binary defect detection networks. To our knowledge, this is the first method to proactively identify potentially erroneous network outputs. Our core idea leverages the difference between class-specific discriminative heatmaps and class-agnostic ones. We compute the difference in their intersection over union (IoU) as a reliability score. An adversarial enhancement method is further introduced to amplify this disparity. Evaluations on two industrial defect detection benchmarks show our method effectively identifies false negatives. With adversarial enhancement, it achieves 100\% recall, albeit with a trade-off for true negatives. Our work thus advocates for a new and trustworthy deployment paradigm: data-model-explanation-output, moving beyond conventional end-to-end systems to provide critical support for reliable AI in real-world applications.

Chinese Translation

在安全关键领域（如工业缺陷检测、自动驾驶和医疗诊断）中，人工智能系统的部署受到其可靠性不足的严重阻碍。单个未被检测的错误预测可能导致灾难性的后果。不幸的是，往往别无选择，只能信任训练好的人工智能系统的输出，而该系统在高准确率的情况下也没有内部保护机制来标记不可靠的预测。我们提出了一种基于事后解释的指标，用于检测二元缺陷检测网络中的假阴性。据我们所知，这是首个主动识别潜在错误网络输出的方法。我们的核心思想利用了类别特定的判别热图与类别无关热图之间的差异。我们计算它们的交并比（IoU）差异作为可靠性评分。此外，我们引入了一种对抗增强方法，以放大这种差异。在两个工业缺陷检测基准上的评估表明，我们的方法有效识别假阴性。通过对抗增强，尽管在真实阴性方面存在权衡，但其召回率达到了100%。因此，我们的工作倡导一种新的可信部署范式：数据-模型-解释-输出，超越传统的端到端系统，为现实世界应用中的可靠人工智能提供关键支持。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.19216

An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones

一种面向对象的数据采集方法用于移动电话的3D高斯点云生成

Zhang, Yuezhe, Bai, Luqian, Yu, Mengting, Wei, Lei, Wan, Shuai, Zhang, Yifan

Abstract

Data acquisition through mobile phones remains a challenge for 3D Gaussian Splatting (3DGS). In this work we target the object-centered scenario and enable reliable mobile acquisition by providing on-device capture guidance and recording onboard sensor signals for offline reconstruction. After the calibration step, the device orientations are aligned to a baseline frame to obtain relative poses, and the optical axis of the camera is mapped to an object-centered spherical grid for uniform viewpoint indexing. To curb polar sampling bias, we compute area-weighted spherical coverage in real-time and guide the user's motion accordingly. We compare the proposed method with RealityScan and the free-capture strategy. Our method achieves superior reconstruction quality using fewer input images compared to free capture and RealityScan. Further analysis shows that the proposed method is able to obtain more comprehensive and uniform viewpoint coverage during object-centered acquisition.

Chinese Translation

通过移动电话进行数据采集仍然是3D高斯点云生成（3DGS）的一项挑战。在本研究中，我们针对面向对象的场景，提供可靠的移动采集方法，通过在设备上提供捕捉指导并记录传感器信号以便离线重建。经过校准步骤后，设备的方向被对齐到基准帧，以获取相对姿态，并将相机的光轴映射到面向对象的球形网格上，以实现均匀的视角索引。为了抑制极坐标采样偏差，我们实时计算面积加权的球面覆盖，并相应地引导用户的运动。我们将所提出的方法与RealityScan和自由捕捉策略进行了比较。与自由捕捉和RealityScan相比，我们的方法在使用更少的输入图像的情况下实现了更优的重建质量。进一步分析表明，所提出的方法能够在面向对象的采集过程中获得更全面和均匀的视角覆盖。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.19217

Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data

基于注意力的多模态深度学习模型在卫星、土壤和气候数据下的时空作物产量预测

Shyam, Gopal Krishna, Chandrakar, Ila

Abstract

Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.

Chinese Translation

作物产量预测是一个重要的挑战，对全球粮食安全和政策决策至关重要。传统的预测技术在准确性上受到限制，因为它们利用的静态数据源无法反映环境变量之间随时间变化的动态和复杂关系。本文提出了一种基于注意力的多模态深度学习框架（Attention-Based Multi-Modal Deep Learning Framework, ABMMDLF），建议用于高精度的时空作物产量预测。我们使用的模型结合了多年的卫星影像、高分辨率的气象数据时间序列和初始土壤特性，而不是仅使用上述因素中的一个传统模型。主要架构涉及使用卷积神经网络（Convolutional Neural Networks, CNN）提取空间特征，以及使用时间注意力机制（Temporal Attention Mechanism）自适应地加权算法所针对的重要物候期，以便根据图像和视频序列的空间特征随时间和条件变化。实验结果表明，所提出的研究工作提供了0.89的R^2评分，远优于基线模型的表现。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.19218

Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

匹配前的思考：一种面向通用人脸重识别的强化推理范式

Zhang, Quan, Wu, Jingze, Wang, Jialong, Xie, Xiaohua, Lai, Jianhuang, Chen, Hongbo

Abstract

Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.

Chinese Translation

学习具有多场景通用性的身份区分表示已成为人脸重识别（ReID）中的一个关键目标。然而，主流的感知驱动范式往往倾向于从大量标注数据中识别匹配，而不是理解身份因果线索，这使得其在面对多种干扰时表现出脆弱的表示能力。在本研究中，提出了ReID-R作为一种新颖的推理驱动范式，通过将思维链（chain-of-thought）融入ReID流程，实现明确的身份理解和推理。具体而言，ReID-R由两个阶段的贡献组成：(i) 区分性推理预热，在这一阶段，模型以无标签的思维链方式进行训练，以获取身份感知特征理解；(ii) 高效的强化学习，提出了一种非平凡的采样方法来构建场景通用的数据。在此基础上，ReID-R利用高质量的奖励信号引导模型关注与身份相关的线索，从而实现准确的推理和正确的响应。在多个ReID基准上的广泛实验表明，ReID-R在仅使用14.3K非平凡数据（占现有数据规模的20.9%）的情况下，达到了与优越方法竞争的身份区分能力。此外，得益于内在的推理能力，ReID-R能够为结果提供高质量的解释。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.19233

Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

自适应切片辅助超推理在高分辨率图像中增强小物体检测的研究

Moretti, Francesco, Jin, Yi, Mario, Guiqin

Abstract

Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

Chinese Translation

基于深度学习的物体检测器在众多计算机视觉应用中取得了显著成功，但在高分辨率航空和卫星图像中的小物体检测仍然面临挑战，这些挑战包括密集的物体分布、可变的拍摄角度、微小的目标尺寸以及显著的类别间变异性。现有的切片策略通过将高分辨率图像划分为可管理的补丁，已显示出在扩大小目标有效感受野方面的良好效果；然而，它们对固定切片尺寸的依赖导致了显著的冗余计算，增加了推理成本并降低了检测速度。本文提出了 extbf{自适应切片辅助超推理（ASAHI）}，这是一种新颖的切片框架，改变了从规定固定切片大小到根据图像分辨率自适应确定最佳切片数量的范式，从而在保留相邻补丁之间有益重叠的同时，显著减轻冗余计算。ASAHI集成了三个协同组件：(1)一种自适应的分辨率感知切片算法，根据学习到的阈值动态生成6个或12个重叠补丁，(2)一种切片辅助微调（SAF）策略，构建包含全分辨率和切片图像补丁的增强训练数据，以及(3)一个Cluster-DIoU-NMS（CDN）后处理模块，将Cluster-NMS的几何合并效率与DIoU-NMS的中心距离感知抑制相结合，以在拥挤场景中实现稳健的重复消除。在VisDrone2019和xView上的大量实验表明，ASAHI在VisDrone2019-DET-val上达到了56.8%的最先进性能，在xView-test上达到了22.7%，同时与基线SAHI方法相比，推理时间减少了20-25%。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.19234

Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

学习正确的步骤赋予奖励：面向目标的视觉生成过程优化

Li, Rui, Hao, Ke, Liang, Yuanzhi, Huang, Haibin, Zhang, Chi, YunGu, Li, XueLong

Abstract

Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.

Chinese Translation

强化学习，特别是群体相对策略优化（Group Relative Policy Optimization, GRPO），已成为一种有效的框架，用于通过人类偏好信号对后训练的视觉生成模型进行优化。然而，其有效性在根本上受到粗糙奖励赋值的限制。在现代视觉生成中，通常使用多个奖励模型来捕捉异质目标，例如视觉质量、运动一致性和文本对齐。现有的GRPO流程通常将这些奖励合并为一个静态标量，并在整个扩散轨迹中均匀传播。这种设计忽略了不同去噪步骤的阶段特定角色，产生了时机不当或不兼容的优化信号。为了解决这个问题，我们提出了面向目标的轨迹奖励赋值（Objective-aware Trajectory Credit Assignment, OTCA），这是一个用于细粒度GRPO训练的结构化框架。OTCA由两个关键组件组成：轨迹级奖励分解（Trajectory-Level Credit Decomposition）估计不同去噪步骤的相对重要性；多目标奖励分配（Multi-Objective Credit Allocation）在整个去噪过程中自适应地加权和组合多个奖励信号。通过联合建模时间信用和目标级信用，OTCA将粗糙的奖励监督转化为一个结构化的、时间步感知的训练信号，更好地匹配基于扩散的生成的迭代特性。大量实验表明，OTCA在各项评估指标上始终提高了图像和视频生成质量。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.19238

Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

Allo{SR}$^2$: 通过变形生成流修正一步超分辨率以保持真实

Wang, Zihan, Huang, Xudong, Qiao, Junbo, Li, Wei, Hu, Jie, Chen, Xinghao, Lin, Shaohui

Abstract

Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates "prior collapse" that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose Allo{SR}$^2$, a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that Allo{SR}$^2$ achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.

Chinese Translation

现实世界图像超分辨率（Real-SR）通过利用大规模扩散和基于流的模型的强大生成先验发生了革命性的变化。然而，在有限的低分辨率-高分辨率（LR-HR）配对上微调这些模型往往会导致“先验崩溃”，即模型牺牲其固有的生成丰富性以过拟合特定的训练退化。这一问题在一步生成中尤为严重，因为缺乏多步精细化导致显著的轨迹漂移和伪影生成。本文提出了Allo{SR}$^2$，一个新颖的框架，通过变形生成流修正一步超分辨率轨迹，以保持高保真度的生成真实感。具体而言，我们利用信噪比（SNR）引导的轨迹初始化，建立一个物理基础的起始状态，通过将低分辨率潜在特征的退化水平与预训练流的最佳锚定时间步对齐。为了确保一步推断的稳定性和无曲率路径，我们提出了流锚定轨迹一致性（FATC），它在中间状态之间强制实施速度级监督。此外，我们开发了变形轨迹匹配（ATM），这是一种自对抗对齐策略，旨在最小化超分辨率流和生成流在统一向量场中的分布差异。在合成和现实世界基准上的大量实验表明，Allo{SR}$^2$在一步Real-SR中实现了最先进的性能，在恢复保真度和生成真实感之间提供了优越的平衡，同时保持极高的效率。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.19257

Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

无姿态到3D：从真实世界图像学习模拟准备车辆

Liu, Hongyuan, Zou, Bochao, Liu, Qiankun, Yu, Haochen, Mei, Qi, Jiang, Jianfei, Liu, Chen, Bi, Cheng, Wang, Zhao, Zhang, Xueyang, Zhan, Yifei, Chen, Jiansheng, Ma, Huimin

Abstract

Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

Chinese Translation

创建逼真且适用于模拟的3D资产对于自动驾驶研究和虚拟环境构建至关重要。然而，现有的3D车辆生成方法通常是在与真实世界分布存在显著领域差距的合成数据上训练的。生成的模型往往表现出任意姿态和未定义的尺度，导致在驾驶场景中集成时视觉一致性较差。本文提出了无姿态到3D（Unposed-to-3D），一个新颖的框架，旨在通过仅使用图像监督从真实世界驾驶图像中重建3D车辆。我们的方法分为两个阶段。在第一阶段，我们使用具有已知相机参数的有姿态图像训练图像到3D重建网络。在第二阶段，我们去除相机监督，使用一个相机预测头直接从无姿态图像中估计相机参数。预测的姿态随后用于可微渲染，以提供自我监督的光度反馈，使模型能够仅从无姿态图像中学习3D几何。为了确保模拟准备性，我们进一步引入一个尺度感知模块来预测真实世界的尺寸信息，以及一个协调模块，使生成的车辆适应目标驾驶场景的一致光照和外观。大量实验表明，无姿态到3D（Unposed-to-3D）有效地从真实世界图像中重建逼真、姿态一致且协调的3D车辆模型，为创建高质量的驾驶场景模拟和数字双胞胎环境提供了可扩展的路径。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.19259

Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection

基于特征扰动池的融合网络用于统一多类别工业缺陷检测

Xu, Yuanchan, Zang, Wenjun, Wu, Ying

Abstract

Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns -- including Gaussian noise, F-Noise, and F-Drop -- into the extracted feature representations, thereby strengthening the model's robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\cite{you2022unified}, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17\% image-level AUROC and 96.93\% pixel-level AUROC on MVTec-AD, and 91.08\% image-level AUROC and 99.08\% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.

Chinese Translation

多类别缺陷检测是工业质量检测中一项关键但具有挑战性的任务，现有方法通常面临两个基本限制：（i）需要为每个缺陷类别训练单独的模型，导致显著的计算和内存开销；（ii）当异构缺陷类别共同建模时，因类别间特征扰动而导致的鲁棒性下降。本文提出了FPFNet，一种基于特征扰动池的融合网络，协同整合随机特征扰动池与多层特征融合策略，以在统一检测框架内应对这些挑战。特征扰动池通过随机注入多样的噪声模式（包括高斯噪声、F-Noise和F-Drop）来丰富训练分布，从而增强模型对领域变化和未见缺陷形态的鲁棒性。同时，多层特征融合模块通过残差连接和归一化聚合编码器和解码器的层次特征表示，使网络能够捕捉复杂的跨尺度关系，同时保留对精确缺陷定位至关重要的细粒度空间细节。基于UniAD架构，我们的方法在两个广泛采用的基准测试中实现了最先进的性能：在MVTec-AD上，图像级AUROC达到97.17%，像素级AUROC达到96.93%；在VisA上，图像级AUROC达到91.08%，像素级AUROC达到99.08%，显著超越现有方法，同时未引入额外的可学习参数或计算复杂性。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.19264

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

DR-MMSearchAgent：在多模态搜索代理中深化推理

Wang, Shengqin, Yan, Wentao, Zhou, Huichi, Chen, Yihang, Shao, Kun, Zhang, Zhizhong, Xie, Yuan

Abstract

Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$\%$ on FVQA-test.

Chinese Translation

代理多模态模型因其利用外部工具解决复杂任务的能力而受到广泛关注。然而，观察到此类代理常常遭遇早期交互崩溃，这主要由两个原因造成：1）终端奖励通常附加在最后一个标记上，阻碍了区分具有探索性行为的轨迹的优势；2）过于冗余的上下文妨碍了代理吸收有用反馈。为了解决这些问题，我们提出了深化推理MMSearchAgent框架，该框架利用结构邻近性从整个批次的完整回合轨迹中推导优势信号，从而进一步鼓励生成不同长度的轨迹，即使它们包含相同的正确答案。此外，采用差异化的高斯奖励动态校准交互容忍度，从而确保信息的可靠性并减少冗余。为了支持多轮交互训练，我们构建了一个多步骤深度推理数据集，包括3602个高质量的问答对，至少包含3个推理步骤。大量实验表明，我们的方法达到了最先进的性能，在FVQA-test上比MMSearch-R1提高了8.4%。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.19314

Framelet-Based Blind Image Restoration with Minimax Concave Regularization

基于框架小波的盲图像恢复与极小极大凹正则化

Zhang, Heng, Parvaz, Reza, Yang, Rui

Abstract

Recovering corrupted images is one of the most challenging problems in image processing. Among various restoration tasks, blind image deblurring has been extensively studied due to its practical importance and inherent difficulty. In this problem, both the point spread function (PSF) and the underlying latent sharp image must be estimated simultaneously. This problem cannot be solved directly due to its ill-posed nature. One powerful tool for solving such problems is total variation (TV) regularization. The $\ell_0$-norm regularization within the TV framework has been widely adopted to promote sparsity in image gradients or transform domains, leading to improved preservation of edges and fine structures. However, the use of the $\ell_0$-norm results in a highly nonconvex and computationally intractable optimization problem, which limits its practical applicability. To overcome these difficulties, we employ the minimax concave penalty (MCP), which promotes enhanced sparsity and provides a closer approximation to the $\ell_0$-norm. In addition, a reweighted $\ell_1$-norm regularization is incorporated to further reduce estimation bias and improve the preservation of fine image details and textures. After introducing the proposed model, a numerical algorithm is developed to solve the resulting optimization problem. The effectiveness of the proposed approach is then demonstrated through experimental evaluations on several test images.

Chinese Translation

恢复受损图像是图像处理领域中最具挑战性的问题之一。在各种恢复任务中，盲图像去模糊因其实际重要性和固有难度而受到广泛研究。在这一问题中，点扩散函数（PSF）和潜在的清晰图像必须同时被估计。由于该问题的病态特性，无法直接求解。解决此类问题的一个强大工具是全变差（TV）正则化。在TV框架内，$ ext{l}_0$-范数正则化被广泛采用，以促进图像梯度或变换域的稀疏性，从而改善边缘和细微结构的保留。然而，使用$ ext{l}_0$-范数会导致一个高度非凸且计算上难以处理的优化问题，限制了其实际应用性。为克服这些困难，我们采用极小极大凹罚函数（MCP），它促进了更强的稀疏性，并提供了对$ ext{l}_0$-范数的更接近的近似。此外，结合加权$ ext{l}_1$-范数正则化进一步减少估计偏差，并改善细节和纹理的保留。在介绍所提出的模型后，开发了一种数值算法来解决所产生的优化问题。最后，通过对多个测试图像的实验评估，证明了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.19318

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

基于视图交互的多视角人群追踪变换器在大型真实场景中的应用

Zhang, Qi, Chen, Jixuan, Zhang, Kaiyi, Yu, Xinquan, Chan, Antoni B., Huang, Hui

Abstract

Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

Chinese Translation

多视角人群追踪旨在估计每个人在场景地面上的追踪轨迹。近期的研究主要依赖于基于卷积神经网络（CNN）的多视角人群追踪架构，而大多数研究是在相对较小的数据集上进行评估和比较，例如Wildtrack和MultiviewX。由于这两个数据集是在小场景中收集的，并且在评估阶段仅包含数十帧，因此当前方法很难应用于场景规模和遮挡更复杂的真实世界应用。在本文中，我们提出了一种基于变换器（Transformer）的多视角人群追踪模型 extit{MVTrackTrans}，该模型采用相机视角与地面平面之间的交互，以增强多视角追踪性能。此外，为了更好地评估，我们收集并标注了两个大型真实世界多视角追踪数据集MVCrowdTrack和CityTrack，这些数据集在更长时间内包含了更大的场景规模。与现有方法在这两个大型新数据集上的表现相比，所提出的MVTrackTrans模型取得了更好的性能，展示了该模型设计在处理大场景方面的优势。我们相信，所提出的数据集和模型将推动该任务在更实际场景中的发展，数据集和代码可在以下链接获取：https://github.com/zqyq/MVTrackTrans。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.19324

PLaMo 2.1-VL Technical Report

PLaMo 2.1-VL 技术报告

Kerola, Tommi, Masuda, Yuya, Masuko, Takashi, Nakanishi, Toshiki, Nishino, Daisuke, Takahashi, Kuniyuki, Wang, Hanqin, Yamada, Yoshihiro

Abstract

We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.

Chinese Translation

我们介绍了 PLaMo 2.1-VL，这是一种轻量级视觉语言模型（Vision Language Model, VLM），适用于自主设备，提供 8B 和 2B 两种变体，旨在支持本地和边缘部署，并能够进行日语操作。该模型的核心能力集中在视觉问答（Visual Question Answering, VQA）和视觉定位（Visual Grounding）上，我们为两个现实应用场景进行了模型的开发和评估：通过工具识别进行工厂任务分析，以及基础设施异常检测。我们还开发了一个大规模合成数据生成管道和全面的日语训练与评估资源。PLaMo 2.1-VL 在日语和英语基准测试中超越了可比的开放模型，在 JA-VG-VQA-500 上达到了 61.5 的 ROUGE-L 分数，在日语 Ref-L4 上的准确率为 85.2%。在这两个应用场景中，它在工厂任务分析中达到了 53.9% 的零样本准确率，并且在电厂数据上进行微调后，异常检测的边界框 + 标签 F1 分数从 39.7 提升至 64.9。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.19334

Silicon Aware Neural Networks

硅感知神经网络

Fieldhouse, Sebastian, Tang, Kea-Tiong

Abstract

Recent work in the machine learning literature has demonstrated that deep learning can train neural networks made of discrete logic gate functions to perform simple image classification tasks at very high speeds on CPU, GPU and FPGA platforms. By virtue of being formed by discrete logic gates, these Differentiable Logic Gate Networks (DLGNs) lend themselves naturally to implementation in custom silicon - in this work we present a method to map DLGNs in a one-to-one fashion to a digital CMOS standard cell library by converting the trained model to a gate-level netlist. We also propose a novel loss function whereby the DLGN can optimize the area, and indirectly power consumption, of the resulting circuit by minimizing the expected area per neuron based on the area of the standard cells in the target standard cell library. Finally, we also show for the first time an implementation of a DLGN as a silicon circuit in simulation, performing layout of a DLGN in the SkyWater 130nm process as a custom hard macro using a Cadence standard cell library and performing post-layout power analysis. We find that our custom macro can perform classification on MNIST with 97% accuracy 41.8 million times a second at a power consumption of 83.88 mW.

Chinese Translation

近期的机器学习文献表明，深度学习可以训练由离散逻辑门函数构成的神经网络，以在CPU、GPU和FPGA平台上以非常高的速度执行简单的图像分类任务。由于由离散逻辑门构成，这些可微分逻辑门网络（Differentiable Logic Gate Networks, DLGNs）自然适合在定制硅上实现。在本研究中，我们提出了一种将DLGNs以一对一的方式映射到数字CMOS标准单元库的方法，通过将训练好的模型转换为门级网表。此外，我们还提出了一种新颖的损失函数，使得DLGN能够通过最小化基于目标标准单元库中标准单元面积的每个神经元的期望面积，来优化所生成电路的面积，并间接优化功耗。最后，我们首次展示了DLGN作为硅电路的实现，通过在SkyWater 130nm工艺中使用Cadence标准单元库进行DLGN布局，并进行后布局功耗分析。我们发现，我们的定制宏可以以97%的准确率每秒对MNIST进行4180万次分类，功耗为83.88毫瓦。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.19339

Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data

在有限数据条件下对高相似性背景下整体认知的分治方法

Wang, Shijie, Wang, Zijian, Luo, Yadan, Li, Haojie, Huang, Zi, Baktashmotlagh, Mahsa

Abstract

Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.

Chinese Translation

超细粒度视觉分类（Ultra-FGVC）旨在使用有限的训练样本对细粒度对象中的高度相似子类别进行分类。然而，整体而又具辨别性的线索，例如在极其相似的品种中的叶片轮廓，在当前研究中仍然未得到充分探索，从而限制了识别性能。尽管至关重要，但对具有复杂形态结构的整体线索进行建模通常需要大量的训练样本，这在数据有限的情况下带来了显著挑战。为了解决这一挑战，我们提出了一种新颖的分治整体认知网络（DHCNet），该网络通过将整体线索分解为空间关联的细微差异并逐步建立整体认知过程，实施分治策略，从而显著简化整体认知，同时减少对训练数据的依赖。从技术上讲，DHCNet首先通过逐步分析细微差异，从较小的局部区域过渡到较大的区域，使用局部区域的自洗牌操作。同时，它利用未受影响的局部区域来潜在地引导在洗牌补丁之间原始拓扑结构的感知，从而帮助建立这些差异的空间关联。此外，DHCNet将从局部区域发现的这些整体线索的在线精炼纳入训练过程中，以迭代提高其质量。因此，DHCNet使用这些整体线索作为监督信号来微调识别模型的参数，从而提高其对整个对象中整体线索的敏感性。广泛的评估表明，DHCNet在五个广泛使用的Ultra-FGVC数据集上取得了显著的性能。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.19345

Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data

几何引导的自监督学习用于有限数据下的超细粒度识别

Wang, Shijie, Luo, Yadan, Wang, Zijian, Li, Haojie, Huang, Zi, Baktashmotlagh, Mahsa

Abstract

This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation -- a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.

Chinese Translation

本文研究了高度相似物体的内在几何特征，并引入了一种通用的自监督框架，称为几何属性探索网络（Geometric Attribute Exploration Network，GAEor），旨在解决数据有限场景下的超细粒度视觉分类（Ultra-FGVC）任务。与以往研究通常捕捉微妙但关键的区别不同，GAEor 生成几何属性作为新颖的替代识别线索。这些属性由物体内部的各种细节决定，并与其几何模式对齐，例如大豆叶片中的复杂脉络结构。重要的是，每个类别都展现出独特的几何描述符，即使在视觉变化极小的物体之间，这些描述符也能作为强有力的线索——这一点在近期研究中往往被忽视。GAEor 通过首先通过主干网络的视觉反馈放大与几何相关的细节，然后将这些细节的相对极坐标嵌入最终表示，来发现这些几何属性。大量实验表明，GAEor 在五个广泛使用的 Ultra-FGVC 基准测试中显著创下了新的最先进记录。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.19349

RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

RAFT-MSF++：用于自监督单目场景流的时间几何-运动特征融合

Sun, Xunpei, Hou, Zuoxun, Chang, Yi, Chen, Gang, Zheng, Wei-Shi

Abstract

Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at https://github.com/sunzunyi/RAFT-MSF-PlusPlus.

Chinese Translation

单目场景流估计旨在从图像序列中恢复稠密的三维运动，但大多数现有方法仅限于双帧输入，这限制了时间建模和对遮挡的鲁棒性。我们提出了RAFT-MSF++，一种自监督的多帧框架，通过递归融合时间特征来联合估计深度和场景流。我们方法的核心是几何-运动特征（Geometry-Motion Feature, GMF），它紧凑地编码了耦合的运动和几何线索，并通过迭代更新实现有效的时间推理。为了确保这种时间融合对遮挡的鲁棒性，我们引入了相对位置注意力机制以注入空间先验，并添加了遮挡正则化模块以从可见区域传播可靠的运动。这些组件使得GMF能够在模糊区域有效传播信息。大量实验表明，RAFT-MSF++在KITTI场景流基准测试中达到了24.14%的SF-all，相较于基线提高了30.99%，并在遮挡区域表现出更好的鲁棒性。代码可在https://github.com/sunzunyi/RAFT-MSF-PlusPlus获取。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.19350

Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

关注重要信息：利用视觉基础模型进行乳腺癌分类的乳腺X光片分析

Sanghvi, Samyak, Miglani, Piyush, Shashikumar, Sarvesh, Borgavi, Kaustubh R, Singla, Veenu, Arora, Chetan

Abstract

Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters

Chinese Translation

视觉变换器（Vision Transformers, exttt{ViT}）已成为许多计算机视觉任务的首选架构，但其在计算机辅助诊断中的表现仍然有限。针对乳腺癌检测的乳腺X光片，我们识别出这一不足的两个主要原因。首先，医学图像具有高分辨率且存在小的异常，导致过多的标记，使得基于softmax的注意力机制难以定位和关注相关区域。其次，医学图像分类本质上是细粒度的，类间变异性低而类内变异性高，标准的交叉熵训练不足以应对这些挑战。为了解决这些问题，我们提出了一个包含三个关键组件的框架：（1）基于感兴趣区域（Region of Interest, exttt{RoI}）的标记减少，利用目标检测模型引导注意力；（2）在选定的 exttt{RoI}之间进行对比学习，通过困难负样本训练增强细粒度区分能力；（3）一个预训练的 exttt{DINOv2} exttt{ViT}，捕捉定位感知的细粒度特征，而不是全局的 exttt{CLIP}表示。对公共乳腺X光数据集的实验表明，我们的方法在现有基准上表现优越，确立了其有效性和在大规模乳腺癌筛查中的潜在临床应用。我们的代码可在此处获取以便复现：https://aih-iitd.github.io/publications/attend-what-matters

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.19365

Detection of T-shirt Presentation Attacks in Face Recognition Systems

面部识别系统中的T恤展示攻击检测

Ibsen, Mathias, Ide, Loris Tim, Rathgeb, Christian, Busch, Christoph

Abstract

Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T-shirt attacks of the T-shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T-shirt attacks. Precisely, state-of-the-art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T-shirt attacks can be reliably detected.

Chinese Translation

面部识别系统常用于生物识别认证。然而，已知在没有任何保护措施的情况下，面部识别系统容易受到展示攻击的影响。为了解决这一安全问题，已经开发了检测展示攻击的方法，并在多个基准数据集上显示出良好的检测性能。然而，将展示攻击检测方法推广到新的和新颖类型的攻击仍然是一个持续的挑战。在本研究中，我们利用T恤面部展示攻击（T-shirt Face Presentation Attack, TFPA）数据库中的1,608个T恤攻击案例，这些攻击使用了100种独特的展示攻击工具，并结合152个真实展示。在全面评估中，我们展示了这种类型的攻击可能会危害面部识别系统的安全性。此外，我们提出了一种基于空间一致性检查的检测方法，以检测上述T恤攻击。具体而言，我们结合了最先进的面部和人物检测器，分析检测到的面部和人物的空间位置，从而可靠地检测T恤攻击。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.19368

Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving

Mind2Drive：基于脑电图预测真实道路驾驶中的驾驶员意图

Alosaimi, Ghadah, Alhamdan, Hanadi, E, Wenke, Katsigiannis, Stamos, Atapour-Abarghouei, Amir, Breckon, Toby P.

Abstract

Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: https://github.com/galosaimi/Mind2Drive.

Chinese Translation

从神经生理信号中预测驾驶员意图为增强高级驾驶辅助系统的主动安全提供了一条有前景的途径，但由于脑电图（EEG）信号的非平稳性和认知-运动准备的复杂性，在真实驾驶环境中仍然面临挑战。本研究提出并评估了一种基于EEG的驾驶员意图预测框架，该框架使用集成在真实电动车中的同步多传感器平台。我们在32个驾驶会话中收集了真实道路数据集，并在一致的实验条件下评估了12种深度学习架构。在评估的架构中，TSCeption达到了最高的平均准确率（0.907）和宏观F1分数（0.901）。所提出的框架展示了强大的时间稳定性，在机动执行前1000毫秒内保持稳健的解码性能，且降级最小。此外，额外分析表明，最小的EEG预处理优于伪影处理管道，且预测性能在400-600毫秒的时间间隔内达到峰值，这对应于驾驶机动前的关键神经准备阶段。总体而言，这些发现支持在真实道路条件下早期和稳定的基于EEG的驾驶员意图解码的可行性。代码：https://github.com/galosaimi/Mind2Drive。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.19369

IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging

IonMorphNet：用于质谱成像中峰值选择的离子图像形态的可泛化学习

Weigand, Philipp, Nawrot, Niels, Ebert, Nikolas, Hopf, Carsten, Wasenmüller, Oliver

Abstract

Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at https://github.com/CeMOS-IS/IonMorphNet.

Chinese Translation

峰值选择是质谱成像（MSI）中的一个基本预处理步骤，每个样本由数百到数千个离子图像表示。现有的方法需要仔细的特定数据集超参数调整，并且往往无法在不同的采集协议之间泛化。我们提出了IonMorphNet，这是一种空间结构感知的离子图像表示模型，能够实现完全数据驱动的峰值选择，而无需任何特定任务的监督。我们整理了53个公开可用的MSI数据集，并定义了六个结构类别，以捕捉离子图像中的代表性空间模式，从而训练标准图像骨干网络进行结构模式分类。一旦训练完成，IonMorphNet可以评估离子图像并执行峰值选择，而无需额外的超参数调整。使用ConvNeXt V2-Tiny骨干网络，我们的方法在多个数据集上相比于最先进的方法提高了+7%的mSCF1峰值选择性能。除了峰值选择，我们还展示了空间信息引导的通道减少使得3D CNN能够在MSI中进行基于补丁的肿瘤分类。该方法在三个肿瘤分类任务中与像素级光谱分类器的表现相当或超过，提升了+7.3%的平衡准确率，表明了有意义的离子图像选择。源代码和模型权重可在https://github.com/CeMOS-IS/IonMorphNet获取。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.19379

PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving

PanDA：用于自主驾驶中多模态3D全景分割的无监督领域适应

Pan, Yining, Li, Shijie, Wu, Yuchen, Yang, Xulei, Zhao, Na

Abstract

This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.

Chinese Translation

本文首次研究了多模态3D全景分割（mm-3DPS）中的无监督领域适应（UDA），旨在改善在现实世界自主驾驶中常遇到的领域转移下的泛化能力。一种简单的解决方案是采用伪标签策略，该策略在UDA中广泛使用，以为未标记的目标数据生成监督，并结合mm-3DPS骨干网络。然而，现有的监督mm-3DPS方法在很大程度上依赖于LiDAR和RGB输入之间的强跨模态互补性，使其在某一模态退化（例如，光照不足或恶劣天气）时变得脆弱。此外，传统的伪标签通常仅保留高置信度区域，导致掩膜碎片化和对象监督不完整，这些问题对全景分割尤其有害。为了解决这些挑战，我们提出了PanDA，这是第一个专门为多模态3D全景分割设计的UDA框架。为了提高对单传感器退化的鲁棒性，我们引入了一种不对称多模态增强，选择性地丢弃区域以模拟领域转移并改善鲁棒表示学习。为了增强伪标签的完整性和可靠性，我们进一步开发了一个双专家伪标签精炼模块，从2D和3D模态中提取领域不变先验。我们在跨越时间、天气、地点和传感器变化的多种领域转移下进行了广泛实验，显著超越了当前最先进的3D语义分割UDA基线。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.19386

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

Air-Know：仲裁者校准的知识内化鲁棒网络用于复合图像检索

Fu, Zhiheng, Hu, Yupeng, Yang, Qianyun, Zhang, Shiqi, Chen, Zhiwei, Li, Zixu

Abstract

Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the "small loss hypothesis", but the unique semantic ambiguity in NTC, such as "partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic "representation pollution". To address this critical challenge, we propose a novel "Expert-Proxy-Diversion" decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy "arbiter" to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

Chinese Translation

复合图像检索（CIR）因其灵活的多模态查询方法而受到广泛关注，但其发展受到噪声三元组对应（NTC）问题的严重制约。现有的大多数鲁棒学习方法依赖于“小损失假设”，但NTC中的独特语义模糊性，如“部分匹配”，使这一假设失效，从而导致不可靠的噪声识别。这使得模型陷入一个自我依赖的恶性循环，学习者与仲裁者相互交织，最终导致灾难性的“表示污染”。为了解决这一关键挑战，我们提出了一种新颖的“专家-代理-转移”解耦范式，命名为Air-Know（仲裁者校准的知识内化鲁棒网络）。Air-Know包含三个核心模块：（1）外部先验仲裁（EPA），利用多模态大语言模型（MLLMs）作为离线专家构建高精度锚定数据集；（2）专家知识内化（EKI），有效指导轻量级代理“仲裁者”内化专家的区分逻辑；（3）双流调和（DSR），利用EKI的匹配信心来转移训练数据，实现干净的对齐流和表示反馈调和流。在多个CIR基准数据集上的广泛实验表明，Air-Know在NTC设置下显著优于现有的最先进方法，同时在传统CIR中也展现出强大的竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.19392

HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition

HarmoniDiff-RS：无训练扩散和谐化卫星图像合成

Zhuang, Xiaoqi, Santos, Jefersson A. Dos, Han, Jungong

Abstract

Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: https://github.com/XiaoqiZhuang/HarmoniDiff-RS.

Chinese Translation

卫星图像合成在数据增强、灾害模拟和城市规划等遥感应用中发挥着关键作用。我们提出了HarmoniDiff-RS，这是一种无训练的基于扩散的框架，用于在多样的领域条件下和谐化复合卫星图像。我们的方法通过潜在均值偏移（Latent Mean Shift）操作对源域和目标域进行对齐，从而在它们之间转移辐射特性。为了平衡和谐化与内容保留，我们引入了一种时间步长潜在融合（Timestep-wise Latent Fusion）策略，利用早期反转潜在（early inverted latents）实现高和谐化，利用晚期潜在（late latents）保持语义一致性，以生成一组复合候选图像。我们还训练了一个轻量级的和谐分类器，以进一步自动选择其中最一致的结果。此外，我们构建了RSIC-H，这是一个基于fMoW的卫星图像和谐化基准数据集，提供了500对合成样本。实验表明，我们的方法有效地执行卫星图像合成，显示出在可扩展遥感合成和模拟任务中的强大潜力。代码可在以下链接获取：https://github.com/XiaoqiZhuang/HarmoniDiff-RS。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.19403

VecHeart: Holistic Four-Chamber Cardiac Anatomy Modeling via Hybrid VecSets

VecHeart：通过混合向量集进行整体四腔心脏解剖建模

Chen, Yihong, Fua, Pascal

Abstract

Accurate cardiac anatomy modeling requires the model to be able to handle intricate interrelations among structures. In this paper, we propose VecHeart, a unified framework for holistic reconstruction and generation of four-chamber cardiac structures. To overcome the limitations of current feed-forward implicit methods, specifically their restriction to single-object modeling and their neglect of inter-part correlations, we introduce Hybrid Part Transformer, which leverages part-specific learnable queries and interleaved attention to capture complex inter-chamber dependencies. Furthermore, we propose Anatomical Completion Masking and Modality Alignment strategies, enabling the model to infer complete four-chamber structures from partial, sparse, or noisy observations, even when certain anatomical parts are entirely missing. VecHeart also seamlessly extends to 3D+t dynamic mesh sequence generation, demonstrating exceptional versatility. Experiments show that our method achieves state-of-the-art performance, maintaining high-fidelity reconstruction across diverse challenging scenarios. Code will be released.

Chinese Translation

准确的心脏解剖建模要求模型能够处理结构之间复杂的相互关系。本文提出了VecHeart，一个用于四腔心脏结构整体重建和生成的统一框架。为了克服当前前馈隐式方法的局限性，特别是它们对单一对象建模的限制以及对部件间相关性的忽视，我们引入了混合部件变换器（Hybrid Part Transformer），该变换器利用特定部件的可学习查询和交错注意力机制来捕捉复杂的腔间依赖关系。此外，我们提出了解剖完成掩码（Anatomical Completion Masking）和模态对齐（Modality Alignment）策略，使模型能够从部分、稀疏或噪声观测中推断出完整的四腔结构，即使某些解剖部件完全缺失。VecHeart还无缝扩展到3D+t动态网格序列生成，展示了卓越的多功能性。实验表明，我们的方法在各种具有挑战性的场景中实现了最先进的性能，保持了高保真度的重建。代码将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.19406

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

HP-Edit：一种基于人类偏好的图像编辑后训练框架

Li, Fan, Wang, Chonghuinan, Lei, Lina, Qiu, Yuping, Xu, Jiaqi, Jiang, Jiaxiu, Qin, Xinran, Chen, Zhikai, Song, Fenglong, Wang, Zhixin, Pei, Renjing, Zuo, Wangmeng

Abstract

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

Chinese Translation

常见的图像编辑任务通常采用强大的生成扩散模型作为现实内容编辑的主要范式。与此同时，尽管强化学习（RL）方法如Diffusion-DPO和Flow-GRPO进一步提高了生成质量，但由于缺乏可扩展的人类偏好数据集和针对多样化编辑需求量身定制的框架，将人类反馈的强化学习（RLHF）有效应用于基于扩散的编辑仍然基本未被探索。为填补这一空白，我们提出了HP-Edit，一种用于人类偏好对齐编辑的后训练框架，并引入了RealPref-50K，一个涵盖八个常见任务并平衡常见对象编辑的真实世界数据集。具体而言，HP-Edit利用少量人类偏好评分数据和预训练的视觉大型语言模型（VLM）开发了HP-Scorer——一种自动化的人类偏好对齐评估器。然后，我们使用HP-Scorer高效构建可扩展的偏好数据集，并作为后训练编辑模型的奖励函数。我们还引入了RealPref-Bench，一个用于评估现实世界编辑性能的基准。大量实验表明，我们的方法显著增强了如Qwen-Image-Edit-2509等模型，使其输出更贴近人类偏好。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.19411

GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

GOLD-BEV：用于动态场景的密集语义鸟瞰图（BEV）映射的地面和空中数据

Niemeijer, Joshua, Zekri, Alaa Eddine Ben, Bahmanyar, Reza, Schmälzle, Philipp M., Chaabouni-Chouayakh, Houda, Kurz, Franz

Abstract

Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

Chinese Translation

在几何一致的、以场景为中心的表示中理解道路场景对于规划和映射至关重要。我们提出了GOLD-BEV，一个框架，通过自我中心传感器学习密集的鸟瞰图（BEV）语义环境地图——包括动态代理——仅在训练期间使用时间同步的空中图像作为监督。与BEV对齐的空中裁剪提供了一个直观的目标空间，使得密集语义标注可以在最小的人工努力下完成，并避免了仅依赖自我视角的BEV标注所带来的模糊性。至关重要的是，严格的空中与地面同步使得从上方观察能够监督移动的交通参与者，并减轻了非同步上方源固有的时间不一致性。为了获得可扩展的密集目标，我们使用域适应的空中教师生成BEV伪标签，并联合训练BEV分割与可选的伪空中BEV重建以提高可解释性。最后，我们通过学习从自我传感器合成伪空中BEV图像，扩展了空中覆盖范围，这支持轻量级的人类标注和对未标记驾驶的基于不确定性的伪标注。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.19412

VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

VCE：一种通过视觉对比编辑实现零成本幻觉缓解的LVLM方法

Huang, Yanbin, Li, Yisen, Tie, Guiyao, Qu, Xiaoye, Zhou, Pan, Wang, Hongfei, Zou, Zhaofan, Sun, Hao, Li, Xuelong

Abstract

Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model's response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model's activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model's original computational efficiency.

Chinese Translation

大型视觉语言模型（LVLMs）经常遭遇物体幻觉（Object Hallucination, OH）问题，即它们生成的描述中包含实际输入图像中并不存在的物体。这一现象在医学影像和自动驾驶等对准确性要求极高的实际应用中尤为棘手。近期研究表明，幻觉问题可能源于语言先验：在预训练过程中学习到的偏差，导致LVLMs根据词语的统计共现生成描述。为了解决这一问题，我们提出了视觉对比编辑（Visual Contrastive Editing, VCE），这是一种新颖的后处理方法，通过分析模型对对比视觉扰动的响应来识别和抑制幻觉倾向。我们使用奇异值分解（Singular Value Decomposition, SVD）对模型的激活模式进行分解，以隔离幻觉子空间，并应用针对性的参数编辑以减弱其影响。与现有需要微调或标注数据的方法不同，VCE作为一种无标签干预方法，具有可扩展性和在资源受限环境中部署的实用性。实验结果表明，VCE在多个基准测试中有效减少了物体幻觉，同时保持了模型原有的计算效率。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.19420

TESO: Online Tracking of Essential Matrix by Stochastic Optimization

TESO：通过随机优化进行本质矩阵的在线跟踪

Moravec, Jaroslav, Šára, Radim, Sugimoto, Akihiro

Abstract

Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems' perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis (critical for stereo accuracy) while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.

Chinese Translation

保持立体相机标定参数的长期准确性对于自主系统的感知至关重要。本研究提出了通过随机优化进行本质矩阵在线跟踪的方法（TESO）。TESO的核心机制包括：1）基于暂定对应关系的核相关性的鲁棒损失函数，2）在本质流形上的自适应在线随机优化。TESO具有低CPU和内存需求，依赖于少量超参数，并消除了数据驱动训练的需要，使其能够在资源受限的在线感知系统中使用。我们评估了TESO对几何精度、校正质量和立体深度一致性的影响。在大规模的MAN TruckScenes数据集上，TESO在Y轴上以0.12度的精度跟踪旋转标定漂移（这对立体精度至关重要），而X轴和Z轴的精度则高出五倍。对模拟漂移序列的跟踪显示其精度与无漂移序列的参考相似，表明跟踪器没有偏差。在KITTI数据集中，TESO揭示了立体对之间外部参数的系统性不一致，证实了之前发表的研究结果。我们验证了内在去标定影响了这些错误， rectification和深度指标的矛盾行为证明了这一点。在修正参考标定后，TESO在Y轴上的旋转精度提高了20倍，达到了0.025度，深度准确性提高了50倍。尽管设计轻量，直接优化所提出的TESO损失函数的准确性仍可与基于神经网络的单帧方法相媲美。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.19432

DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

DINO 吃 CLIP：超越已知的开放集 3D 物体检索适应

He, Xinwei, Zheng, Yansong, Han, Qianru, Wang, Zhichuan, Cai, Yuxuan, Zhou, Yang, Xia, Jingbo, Wang, Yulong, Xiang, Jinhai, Bai, Xiang

Abstract

Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

Chinese Translation

视觉基础模型在开放集 3D 物体检索（3DOR）中展现出了巨大的潜力，通过对多视图图像的高效适应。利用语义对齐的潜在空间，以往的研究通常将 CLIP 编码器适配用于构建基于视图的 3D 描述符。尽管 CLIP 具有强大的泛化能力，但其缺乏细粒度特征促使我们探索更近期的自监督编码器 DINO 的潜力。为此，我们提出了 DINO Eats CLIP (DEC)，这是一个用于动态多视图集成的新框架，通过合成未见类别的数据进行正则化。我们首先发现，仅仅对来自冻结 DINO 主干的视图特征进行均值池化就能获得不错的性能。然而，进一步的适应会导致对已知类别的平均视图模式的严重过拟合。为了解决这个问题，我们设计了一个名为分块与适应模块（Chunking and Adapting Module, CAM）的模块。它将多视图图像分割为块，并动态整合局部视图关系，生成比标准池化策略更为稳健的特征。最后，我们提出了虚拟特征合成（Virtual Feature Synthesis, VFS）模块，以明确减轻对已知类别的偏见。在其内部，VFS 利用 CLIP 广泛的、预对齐的视觉-语言空间为未见类别合成虚拟特征。通过将 DEC 暴露于这些虚拟特征，我们大大增强了其开放集判别能力。在标准开放集 3DOR 基准上的大量实验表明了其卓越的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.19445

LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results

LoViF 2026挑战赛：真实世界一体化图像恢复的方法与结果

Chen, Xiang, Li, Hao, Dong, Jiangxin, Pan, Jinshan, Li, Xin, He, Xin, Chen, Naiwei, Li, Shengyuan, Liu, Fengning, Lv, Haoyi, Peng, Haowei, Zhong, Yilian, Chen, Yuxiang, Yin, Shibo, Fang, Yushun, Zhu, Xilei, Wang, Yahui, Lu, Chen, Chen, Kaibin, Zhang, Xu, Cao, Xuhui, Ma, Jiaqi, Wang, Ziqi, Hu, Shengkai, Cui, Yuning, Zhang, Huan, Chen, Shi, Ren, Bin, Zhang, Lefei, Dong, Guanglu, Zhao, Qiyao, Zheng, Tianheng, Li, Chunlei, Mou, Lichao, Ren, Chao, Xing, Wangzhi, Lu, Xin, Gu, Enxuan, Zhang, Jingxi, Chen, Diqi, Yi, Qiaosi, Wei, Bingcai, Liu, Mingyu, Wang, Pengyu, Liu, Ce, Guan, Miaoxin, Chen, Boyu, Li, Hongyu, Zhu, Jian, Luo, Xinrui, He, Ziyang, Wang, Jiayu, Xiang, Yichen, Qi, Huayi, Bian, Haoyu, Li, Yiran, Zhou, Sunlichen

Abstract

This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.

Chinese Translation

本文对LoViF挑战赛的真实世界一体化图像恢复进行了回顾。该挑战旨在推动在多种真实世界退化条件下进行真实世界一体化图像恢复的研究，包括模糊、低光、雾霾、雨水和雪。它提供了一个统一的基准，以评估恢复模型在多个退化类别下的鲁棒性和泛化能力。比赛吸引了124名注册参与者，并收到了9份有效的最终提交及其相应的事实表，显著推动了真实世界一体化图像恢复的进展。本报告对提交的方法及其对应结果进行了详细分析，强调了在统一真实世界图像恢复方面的最新进展。分析突出了有效的方法，并为未来在真实世界低级视觉领域的研究建立了基准。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.19473

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

TS-Attn：用于多事件视频生成的时间可分离注意力机制

Zhang, Hongyu, Deng, Yufan, Pan, Zilin, Jiang, Peng-Tao, Li, Bo, Hou, Qibin, Dou, Zhiyang, Dong, Zhen, Zhou, Daquan

Abstract

Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.

Chinese Translation

从包含多个顺序动作的复杂时间描述中生成高质量视频是一个关键的未解决问题。现有方法受到固有权衡的限制：将多个短提示顺序输入模型可以提高动作的真实感，但会妨碍时间一致性，而单一复杂提示则在保持一致性的同时牺牲了对提示的跟随能力。我们将此问题归因于两个主要原因：1）视频内容与提示之间的时间错位，以及2）与运动相关的视觉对象及其相关文本条件之间的注意力耦合冲突。为了解决这些挑战，我们提出了一种新颖的无训练注意力机制——时间可分离注意力（Temporal-wise Separable Attention，TS-Attn），该机制动态重新排列注意力分布，以确保在多事件场景中的时间意识和全局一致性。TS-Attn可以无缝集成到各种预训练的文本到视频模型中，在Wan2.1-T2V-14B和Wan2.2-T2V-A14B上分别提高了StoryEval-Bench评分33.5%和16.4%，而推理时间仅增加了2%。它还支持在多事件图像到视频生成中的即插即用使用。源代码和项目页面可在 https://github.com/Hong-yu-Zhang/TS-Attn 获取。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.19480

Deep sprite-based image models: An analysis

基于深度精灵的图像模型：分析

Baltacı, Zeynep Sonat, Loiseau, Romain, Aubry, Mathieu

Abstract

While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.

Chinese Translation

尽管基础模型在图像分割方面推动了稳定的进展，而扩散算法则生成越来越逼真的图像，但在一组图像中识别重复模式这一看似简单的问题仍然悬而未决。本文重点研究基于精灵的图像分解模型，这些模型在聚类和图像分解方面显示出一定的前景，并因其高可解释性而受到关注。这些模型有不同的变体，需要针对特定数据集进行调整，并且在处理包含多个对象的图像时面临扩展困难。我们深入探讨了它们的设计细节，识别了其核心组件，并在聚类基准上进行了广泛分析。我们利用这一分析提出了一种基于深度精灵的图像分解方法，该方法在标准CLEVR基准上与最先进的无监督类别感知图像分割方法表现相当，能够线性扩展到多个对象，明确识别对象类别，并以易于解释的方式全面建模图像。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.19489

Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram

大规模候选人可视化：用于Instagram视觉政治传播的多模态大语言模型

Achmann-Denkler, Michael, Haim, Mario, Wolff, Christian

Abstract

This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.

Chinese Translation

本文呈现了一项计算案例研究，评估了专门的机器学习模型和新兴的多模态大语言模型在视觉政治传播（VPC）分析中的能力。我们聚焦于2021年德国联邦选举期间Instagram故事和帖子中的集中可见性，比较了传统计算机视觉模型（FaceNet512、RetinaFace、Google Cloud Vision）与多模态大语言模型（GPT-4o）在识别领先政治人物和统计图像中个体数量方面的表现。GPT-4o在面部识别方面的宏观F1-score达到了0.89，在故事中的个体计数方面达到了0.86，优于其他模型。这些发现展示了先进人工智能系统在政治传播中扩展和精炼视觉内容分析的潜力，同时强调了未来研究的 методологические考虑。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.19510

Evaluating Histogram Matching for Robust Deep learning-Based Grapevine Disease Detection

评估直方图匹配在基于深度学习的葡萄藤病害检测中的鲁棒性

Pascual, Ruben, Hernández, Inés, Gutiérrez, Salvador, Tardaguila, Javier, Melo-Pinto, Pedro, Paternain, Daniel, Galar, Mikel

Abstract

Variability in illumination is a primary factor limiting deep learning robustness for field-based plant disease detection. This study evaluates Histogram Matching (HM), a technique that transforms the pixel intensity distribution of an image to match a reference profile, to mitigate this in grapevine classification, distinguishing among healthy leaves, downy mildew, and spider mite damage. We propose a dual-stage integration of HM: (i) as a preprocessing step for normalization, and (ii) as a data augmentation technique to introduce controlled training variability. Experiments using 1,469 RGB images (comprising homogeneous leaf-focused and heterogeneous canopy samples) to train ResNet-18 models demonstrate that this combination significantly enhances robustness on real-world canopy images. While leaf-focused samples showed marginal gains, the canopy subset improved markedly, indicating that balancing normalization with histogram-based diversification effectively bridges the domain gap caused by uncontrolled lighting.

Chinese Translation

光照变化是限制基于深度学习的田间植物病害检测鲁棒性的主要因素。本研究评估了直方图匹配（Histogram Matching, HM）技术，该技术通过将图像的像素强度分布转换为匹配参考轮廓，以减轻这一问题，应用于葡萄藤分类，区分健康叶片、霜霉病和蜘蛛螨损害。我们提出了HM的双阶段集成： (i) 作为归一化的预处理步骤， (ii) 作为数据增强技术，以引入可控的训练变异性。使用1,469张RGB图像（包括同质的叶片聚焦样本和异质的树冠样本）训练ResNet-18模型的实验表明，这种组合显著增强了在真实世界树冠图像上的鲁棒性。尽管叶片聚焦样本的提升幅度较小，但树冠子集的改善显著，表明将归一化与基于直方图的多样化平衡有效地弥补了由不受控光照造成的领域差距。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.19556

Paparazzo: Active Mapping of Moving 3D Objects

Paparazzo：动态三维物体的主动映射

Allegro, Davide, Li, Shiyao, Ghidoni, Stefano, Lepetit, Vincent

Abstract

Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: https://davidea97.github.io/paparazzo-page/

Chinese Translation

当前的三维映射流程通常假设环境是静态的，这限制了它们准确捕捉和重建动态物体的能力。为了解决这一限制，我们引入了动态物体主动映射的新任务，在该任务中，映射代理必须在补偿物体运动的同时规划其轨迹。我们的方法Paparazzo提供了一种无学习的解决方案，能够稳健地预测目标的轨迹，并识别出观察目标的最具信息量的视角，以规划其自身路径。我们还贡献了一个为这一新任务设计的全面基准。通过大量实验，我们表明Paparazzo在三维重建的完整性和准确性方面显著优于几个强基线，标志着动态场景理解的重要一步。项目页面：https://davidea97.github.io/paparazzo-page/

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.19564

EgoSelf: From Memory to Personalized Egocentric Assistant

EgoSelf：从记忆到个性化自我中心助手

Wang, Yanshuo, Xu, Yuan, Li, Xuesong, Hong, Jie, Wang, Yizhou, Chen, Chang Wen, Zhu, Wentao

Abstract

Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself\_project/}.

Chinese Translation

自我中心助手通常依赖第一人称视角数据来捕捉用户行为和上下文，以提供个性化服务。由于不同用户表现出独特的习惯、偏好和日常活动，因此这种个性化对于真正有效的帮助至关重要。然而，有效整合长期用户数据以实现个性化仍然是一个关键挑战。为了解决这一问题，我们提出了EgoSelf，一个包含基于图的交互记忆的系统，该记忆由过去的观察构建而成，并配备了专门的个性化学习任务。该记忆捕捉交互事件和实体之间的时间和语义关系，从中推导出用户特定的个人资料。个性化学习任务被表述为一个预测问题，模型根据记录在图中的个别用户历史行为预测可能的未来交互。大量实验表明，EgoSelf作为个性化自我中心助手的有效性。代码可在 exttt{https://abie-e.github.io/egoself extunderscore project/} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.19570

RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

RF-HiT：用于通用医学图像分割的整流流层次变换器

Djouama, Ahmed Marouane, Belaala, Abir, Sellam, Abdellah Zakaria, Bekhouche, Salah Eddine, Distante, Cosimo, Hadid, Abdenour

Abstract

Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

Chinese Translation

准确的医学图像分割需要长距离的上下文推理和精确的边界描绘，而现有的基于变换器和扩散的范式常常受到二次计算复杂度和高昂推理延迟的瓶颈。我们提出了RF-HiT，一种整流流层次变换器，它将沙漏形变换器主干与多尺度层次编码器结合，以实现解剖引导的特征条件化。与之前的基于扩散的方法不同，RF-HiT利用整流流与高效的变换器模块实现线性复杂度，同时仅需少量离散化步骤。该模型进一步通过可学习的插值在不同分辨率之间融合条件特征，从而实现有效的多尺度表示，且计算开销最小。因此，RF-HiT在效率与性能之间实现了良好的平衡，仅需10.14 GFLOPs、13.6M参数，并且推理步骤少至三步。尽管设计紧凑，RF-HiT在ACDC数据集上达到了91.27%的平均Dice系数，在BraTS 2021上达到了87.40%，其性能可与显著更复杂的架构相媲美或超越。这表明RF-HiT作为实时临床分割的强大、计算高效的基础具有良好的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.19571

TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing

TransSplat：用于语言驱动的3D高斯点云编辑的不平衡语义传输

Chen, Yanhui, Li, Jiahong, Wang, Jingchao, Lin, Junyi, Zeng, Zixin, Shi, Yang

Abstract

Language-driven 3D Gaussian Splatting (3DGS) editing provides a more convenient approach for modifying complex scenes in VR/AR. Standard pipelines typically adopt a two-stage strategy: first editing multiple 2D views, and then optimizing the 3D representation to match these edited observations. Existing methods mainly improve view consistency through multi-view feature fusion, attention filtering, or iterative recalibration. However, they fail to explicitly address a more fundamental issue: the semantic correspondence between edited 2D evidence and 3D Gaussians. To tackle this problem, we propose TransSplat, which formulates language-driven 3DGS editing as a multi-view unbalanced semantic transport problem. Specifically, our method establishes correspondences between visible Gaussians and view-specific editing prototypes, thereby explicitly characterizing the semantic relationship between edited 2D evidence and 3D Gaussians. It further recovers a cross-view shared canonical 3D edit field to guide unified 3D appearance updates. In addition, we use transport residuals to suppress erroneous edits in non-target regions, mitigating edit leakage and improving local control precision. Qualitative and quantitative results show that, compared with existing 3D editing methods centered on enhancing view consistency, TransSplat achieves superior performance in local editing accuracy and structural consistency.

Chinese Translation

语言驱动的3D高斯点云（3DGS）编辑为在虚拟现实/增强现实（VR/AR）中修改复杂场景提供了更便捷的方法。标准流程通常采用两阶段策略：首先编辑多个2D视图，然后优化3D表示以匹配这些编辑后的观测结果。现有方法主要通过多视图特征融合、注意力过滤或迭代重新校准来提高视图一致性。然而，它们未能明确解决一个更根本的问题：编辑的2D证据与3D高斯之间的语义对应关系。为了解决这个问题，我们提出了TransSplat，将语言驱动的3DGS编辑形式化为一个多视图不平衡语义传输问题。具体而言，我们的方法在可见高斯和视图特定的编辑原型之间建立对应关系，从而明确表征编辑的2D证据与3D高斯之间的语义关系。它进一步恢复一个跨视图共享的规范3D编辑场，以指导统一的3D外观更新。此外，我们使用传输残差来抑制非目标区域的错误编辑，减轻编辑泄漏并提高局部控制精度。定性和定量结果表明，与现有以增强视图一致性为中心的3D编辑方法相比，TransSplat在局部编辑精度和结构一致性方面表现出更优越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.19587

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

SmartPhotoCrafter：统一推理、生成与优化的自动摄影图像编辑

Zeng, Ying, Luo, Miaosen, Li, Guangyuan, Yang, Yang, Fan, Ruiyang, Shi, Linxiao, Yang, Qirui, Zhang, Jian, Liu, Chengcheng, Zheng, Siming, Chen, Jinwei, Li, Bo, Jiang, Peng-Tao

Abstract

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

Chinese Translation

传统的摄影图像编辑通常要求用户具备足够的审美理解，以提供适当的指令来调整图像质量和相机参数。然而，这种范式依赖于对审美意图的明确人类指令，而这些指令往往模糊、不完整，或对非专业用户而言难以获取。在本研究中，我们提出了SmartPhotoCrafter，一种将图像编辑形式化为紧密耦合的推理与生成过程的自动摄影图像编辑方法。所提出的模型首先通过图像评论模块进行图像质量理解并识别缺陷，然后由摄影艺术家模块实现针对性的编辑，以增强图像吸引力，消除对明确人类指令的需求。我们采用了多阶段的训练流程：（i）基础预训练以建立基本的审美理解和编辑能力，（ii）通过推理引导的多编辑监督进行适应，以融入丰富的语义指导，以及（iii）协调的推理到生成的强化学习，以共同优化推理和生成。在训练过程中，SmartPhotoCrafter强调照片真实感的图像生成，同时支持图像恢复和修饰任务，并始终遵循与颜色和色调相关的语义。我们还构建了一个阶段特定的数据集，逐步建立推理和可控生成，有效的跨模块协作，最终实现高质量的摄影增强。实验表明，SmartPhotoCrafter在自动摄影增强任务上优于现有的生成模型，取得了照片真实感的结果，同时对修饰指令表现出更高的色调敏感性。项目页面：https://github.com/vivoCameraResearch/SmartPhotoCrafter。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.19591

Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

全球地理空间嵌入的结构-语义解耦调制用于高分辨率遥感制图

Lyu, Jienan, Yang, Miao, Cai, Jinchen, Hu, Yiwen, Lu, Guanyi, Qiu, Junhao, Dong, Runmin

Abstract

Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.

Chinese Translation

细粒度高分辨率遥感制图通常依赖于局部视觉特征，这限制了跨域的可推广性，并且常常导致大规模土地覆盖的预测碎片化。尽管全球地理空间基础模型提供了强大且可推广的表示，但将其高维隐式嵌入与高分辨率视觉特征直接融合，常常会因严重的语义-空间差距而引发特征干扰和空间结构退化。为克服这些局限性，我们提出了一种结构-语义解耦调制（Structure-Semantic Decoupled Modulation, SSDM）框架，该框架将全球地理空间表示解耦为两个互补的跨模态注入路径。首先，结构先验调制分支将来自全球表示的宏观感受野先验引入高分辨率编码器的自注意力模块。通过用整体结构约束指导局部特征提取，它有效抑制了由高频细节噪声和过度类内方差引起的预测碎片化。其次，全球语义注入分支明确地将整体上下文与深层高分辨率特征空间对齐，并通过跨模态集成直接补充全球语义，从而显著增强复杂土地覆盖的语义一致性和类别级别的区分能力。大量实验表明，我们的方法在现有跨模态融合方法中实现了最先进的性能。通过释放全球嵌入的潜力，SSDM在多种场景中持续提高高分辨率制图的准确性，为将地理空间基础模型集成到高分辨率视觉任务中提供了一个通用而有效的范式。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.19596

PC2Model: ISPRS benchmark on 3D point cloud to model registration

PC2Model：ISPRS 3D点云与模型配准基准

Maboudi, Mehdi, Harb, Said, Ferrao, Jackson, Khoshelham, Kourosh, Turkan, Yelda, Mawas, Karam

Abstract

Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/uploads/17581812.

Chinese Translation

点云配准涉及将一个点云与另一个点云或三维（3D）模型对齐，从而实现多模态数据的统一表示。这在建筑监测、自动驾驶、机器人技术以及虚拟或增强现实（VR/AR）等应用中至关重要。随着激光雷达（LiDAR）和结构光扫描等点云获取技术的日益普及，以及深度学习的最新进展，研究重点逐渐转向下游任务，特别是点云到模型（PC2Model）配准。尽管数据驱动的方法旨在自动化这一过程，但它们在处理真实世界扫描中的稀疏性、噪声、杂乱和遮挡时面临挑战，这限制了它们的性能。为了解决这些问题，本文介绍了PC2Model基准，这是一个公开可用的数据集，旨在支持经典方法和数据驱动方法的训练和评估。在ICWG II/Ib的领导下，PC2Model基准采用混合设计，结合了模拟点云和在某些情况下的真实世界扫描及其对应的3D模型。模拟数据提供了精确的真实值和受控条件，而真实世界数据则引入了传感器和环境伪影。这种设计支持跨领域的稳健训练和评估，并使得系统分析从模拟到真实场景的模型可转移性成为可能。该数据集可在以下网址公开获取：https://zenodo.org/uploads/17581812。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.19624

GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

GRAFT：用于人类场景重建的几何精炼与拟合变换器

YM, Pradyumna, Xue, Yuxuan, Chen, Yue, Kister, Nikita, Sárándi, István, Pons-Moll, Gerard

Abstract

Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .

Chinese Translation

从单幅图像重建物理上合理的3D人类-场景交互（HSI）目前面临一个权衡：基于优化的方法提供准确的接触信息，但速度较慢（约20秒），而前馈方法则快速但缺乏明确的交互推理，导致漂浮和相互穿透的伪影。我们的关键见解是，基于几何的人类-场景拟合可以被摊销为快速的前馈推理。我们提出了GRAFT（几何精炼与拟合变换器），这是一种学习的HSI先验，预测交互梯度：通过推理人类网格与周围场景的3D关系，迭代地精炼人类网格的修正参数更新。GRAFT将交互状态编码为紧凑的以身体为锚的标记，每个标记通过几何探针与场景几何相结合，捕捉与附近表面的空间关系。一个轻量级的变换器反复更新人类网格并重新探测场景，确保最终姿态与学习的先验和观察到的几何相一致。GRAFT可以作为一个端到端的重建器使用图像特征，或仅使用几何作为可转移的即插即用HSI先验，提升前馈方法而无需重新训练。实验表明，GRAFT在交互质量上比最先进的前馈方法提高了多达113%，并且在运行时间上以约50倍的速度匹配基于优化的交互质量，同时无缝地推广到野外的多人的场景，并在三方用户研究中被64.8%的参与者偏好。项目页面：https://pradyumnaym.github.io/graft。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.19631

MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

MOSA：基于运动引导的动态场景图生成的语义对齐

Wang, Xuejiao, Zhang, Bohao, Wang, Changbo, He, Gaoqi

Abstract

Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

Chinese Translation

动态场景图生成（DSGG）旨在对视频序列中的对象及其动态交互进行结构化建模，以实现高层次的语义理解。然而，现有方法在细粒度关系建模、语义表示利用以及尾部关系建模能力方面存在困难。为了解决这些问题，本文提出了一种用于DSGG的运动引导语义对齐方法（MoSA）。首先，运动特征提取器（MFE）编码对象对的运动属性，如距离、速度、运动持续性和方向一致性。然后，这些运动属性通过运动引导交互模块（MIM）与空间关系特征融合，以生成运动感知的关系表示。为了进一步增强语义区分能力，跨模态动作语义匹配（ASM）机制将视觉关系特征与关系类别的文本嵌入对齐。最后，引入了一种类别加权损失策略，以强调对尾部关系的学习。广泛而严格的测试表明，MoSA在动作基因组数据集上表现最佳。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.19632

CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

CreatiParser：将光栅图形设计生成可编辑层的生成图像解析

Chen, Weidong, Hong, Dexiang, Mao, Zhendong, Cheng, Yutao, Liu, Xinyan, Zhang, Lei, Zhang, Yongdong

Abstract

Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.

Chinese Translation

图形设计图像由多个可编辑层组成，如文本、背景和装饰元素，而大多数生成模型生成的输出为光栅化图像，缺乏明确的层结构，限制了后续编辑。现有的图形设计解析方法通常依赖于多阶段管道，结合布局预测、抠图和修复，这些方法存在误差累积和可控性有限的问题。我们提出了一种混合生成框架，用于光栅到层的图形设计解析，将设计图像分解为可编辑的文本、背景和贴纸层。文本区域使用视觉-语言模型解析为文本渲染协议，从而实现忠实重建和灵活的重新编辑，而背景和贴纸层则采用支持 RGBA 的多分支扩散架构生成。我们进一步引入了 ParserReward，并将其与群体相对策略优化（Group Relative Policy Optimization）结合，以使生成质量与人类设计偏好对齐。在两个具有挑战性的数据集上进行的广泛实验，即 Parser-40K 和 Crello 数据集，显示出优于现有方法的性能，例如，在所有指标上实现了平均提高 23.7%。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.19636

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

CoInteract：通过空间结构协同生成实现物理一致的人机交互视频合成

Luo, Xiangyang, Xin, Xiaozhe, Feng, Tao, Guo, Xu, Jin, Meiguang, Ma, Junfeng

Abstract

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

Chinese Translation

合成人机交互（HOI）视频在电子商务、数字广告和虚拟营销中具有广泛的实际价值。然而，尽管当前的扩散模型具备逼真的渲染能力，但在（i）手部和面部等敏感区域的结构稳定性以及（ii）物理上合理的接触（例如，避免手与物体的相互渗透）方面仍然经常失败。我们提出了CoInteract，这是一个端到端的HOI视频合成框架，依赖于人物参考图像、产品参考图像、文本提示和语音音频。CoInteract在扩散变换器（Diffusion Transformer, DiT）主干中引入了两个互补设计。首先，我们提出了一种人类感知的专家混合模型（Human-Aware Mixture-of-Experts, MoE），通过空间监督路由将标记路由到轻量级的区域专用专家，从而在最小的参数开销下提高细粒度结构的保真度。其次，我们提出了空间结构协同生成（Spatially-Structured Co-Generation），这是一种双流训练范式，联合建模RGB外观流和辅助HOI结构流，以注入交互几何先验。在训练过程中，HOI流关注RGB标记，其监督正则化共享主干权重；在推理时，HOI分支被移除以实现零开销的RGB生成。实验结果表明，CoInteract在结构稳定性、逻辑一致性和交互真实感方面显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.19648

CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

CoCo-SAM3：利用概念冲突进行开放词汇语义分割

Chen, Yanhui, Yang, Baoyao, Liu, Siqi, Wang, Jingchao

Abstract

SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

Chinese Translation

SAM3通过引入基于提示的掩码生成范式，推动了开放词汇语义分割的发展。然而，在多类开放词汇场景中，从不同类别提示独立生成的掩码缺乏统一的、可类间比较的证据尺度，常常导致重叠覆盖和不稳定的竞争。此外，同一概念的同义表达往往会激活不一致的语义和空间证据，导致类内漂移，加剧类间冲突并妨碍整体推理的稳定性。为了解决这些问题，我们提出了CoCo-SAM3（概念冲突SAM3），该方法明确将推理解耦为类内增强和类间竞争。我们的方法首先对同义提示的证据进行对齐和聚合，以增强概念一致性。然后在统一的可比较尺度上进行类间竞争，使所有候选类别之间能够进行直接的像素级比较。该机制稳定了多类推理，并有效减轻了类间冲突。在不需要任何额外训练的情况下，CoCo-SAM3在八个开放词汇语义分割基准上实现了一致的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.19673

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

InHabit：利用图像基础模型实现可扩展的3D人类放置

Kister, Nikita, YM, Pradyumna, Sárándi, István, Wang, Jiayi, Khoreva, Anna, Pons-Moll, Gerard

Abstract

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

Chinese Translation

训练具身代理以像人类一样理解3D场景需要大量人类与多样环境有意义互动的数据，但此类数据稀缺。现实世界的动作捕捉成本高且仅限于受控环境，而现有的合成数据集依赖于简单的几何启发式，忽视了丰富的场景上下文。相比之下，在互联网规模数据上训练的2D基础模型隐式地获取了人类与环境互动的常识知识。为了将这种知识转移到3D中，我们提出了InHabit，一个完全自动化且可扩展的数据生成器，用于在人类互动的3D场景中填充。InHabit遵循渲染-生成-提升的原则：给定一个渲染的3D场景，视觉-语言模型提出上下文相关的有意义动作，图像编辑模型插入一个人类，优化程序将编辑结果提升为与场景几何对齐的物理上合理的SMPL-X身体。应用于Habitat-Matterport3D，InHabit生成了首个大规模的照片真实感3D人类-场景互动数据集，包含78K样本，覆盖800个建筑规模的场景，具有完整的3D几何、SMPL-X身体和RGB图像。用我们的样本增强标准训练数据，提高了基于RGB的3D人类-场景重建和接触估计，在一项感知用户研究中，我们的数据在78%的情况下优于现有技术。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.19675

MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

MedFlowSeg：具有频率感知注意力的医学图像分割流匹配

Chen, Zhi, Hu, Runze, Zhang, Le

Abstract

Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.

Chinese Translation

流匹配最近作为一种原则性框架出现，用于学习连续时间传输映射，使得高效的确定性生成成为可能，而无需依赖随机扩散过程。尽管生成建模在医学图像分割中显示出前景，特别是在捕捉不确定性和复杂解剖变异性方面，现有方法主要基于扩散模型，这导致由于迭代采样而产生了大量计算开销，并且通常受到基于UNet的参数化的限制。在本研究中，我们提出了MedFlowSeg，一种条件流匹配框架，将医学图像分割表述为学习一个时间依赖的向量场，该向量场将简单的先验分布传输到目标分割分布。这种表述使得一步确定性推理成为可能，同时保留了生成建模的表现力。我们进一步开发了一种双重条件机制，将结构化先验纳入学习的流中。具体而言，我们提出了一个双分支空间注意力模块，将多尺度结构信息注入流场中，以及一个频率感知注意力模块，通过差异感知融合和时间依赖调制建模空间和光谱表示之间的跨域交互。这些组件共同提供了一种有效的条件流参数化，捕捉全局解剖结构和细粒度边界细节。我们在多种医学成像模式下进行了广泛的实证验证，证明MedFlowSeg在显著降低计算成本的同时，达到了与基于扩散的方法相比的最先进性能。我们的结果突显了流匹配作为一种理论基础和计算高效的生成医学图像分割替代方案的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2604.19679

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

MMControl：统一的多模态控制用于联合音视频生成

Li, Liyang, Wang, Wen, Zhao, Canyu, Feng, Tianjian, Zhao, Zhiyue, Chen, Hao, Shen, Chunhua

Abstract

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

Chinese Translation

最近在扩散变换器（Diffusion Transformers, DiTs）方面的进展使得高质量的联合音视频生成成为可能，能够在单一模型中生成与音频同步的视频。然而，现有的可控生成框架通常仅限于视频控制。这限制了全面的可控性，并且往往导致跨模态对齐的次优结果。为了解决这一问题，我们提出了MMControl，它使用户能够在联合音视频生成中进行多模态控制。MMControl引入了一种双流条件注入机制。它将视觉和声学控制信号，包括参考图像、参考音频、深度图和姿态序列，融入到联合生成过程中。这些条件通过旁路分支注入到联合音视频扩散变换器中，使模型能够在结构约束下同时生成身份一致的视频和音色一致的音频。此外，我们引入了模态特定的引导缩放，允许用户在推理时独立且动态地调整每个视觉和声学条件的影响强度。大量实验表明，MMControl在联合音视频生成中实现了对角色身份、声音音色、身体姿态和场景布局的细粒度、可组合控制。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2604.19680

IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow

IR-Flow：通过校正流连接判别式与生成式图像恢复

Fan, Zihao, Lu, Xin, Xiao, Jie, Li, Dong, Huang, Jie, Fu, Xueyang

Abstract

In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at https://github.com/fanzh03/IR-Flow.

Chinese Translation

在图像恢复中，单步判别映射往往通过期望学习缺乏细节，而生成范式则受到低效的多步采样和噪声残余耦合的困扰。为了解决这一困境，我们提出了IR-Flow，这是一种基于校正流的新颖图像恢复方法，作为一个统一框架，弥合判别式与生成式范式之间的差距。具体而言，我们首先构建多层次数据分布流，扩展模型从不同降级水平学习和适应的能力。随后，提出了累积速度场，以学习跨越不同降级水平的传输轨迹，引导中间状态朝向干净目标，同时提出了多步一致性约束，以强制轨迹一致性并提升少步恢复性能。我们表明，直接在降级和干净图像域之间建立线性传输流，不仅能够实现快速推理，还能提高对分布外降级的适应能力。在去雨、去噪和雨滴去除任务上的广泛评估表明，IR-Flow在仅需少量采样步骤的情况下实现了具有竞争力的定量结果，提供了一个高效且灵活的框架，保持了出色的失真感知平衡。我们的代码可在 https://github.com/fanzh03/IR-Flow 获取。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2604.19697

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

揭示细粒度视觉痕迹：评估多模态交错推理链在多模态STEM任务中的表现

Jin, Jing, Liu, Hao, Bai, Yan, Lou, Yihang, Wang, Zhenke, Yuan, Tianrun, Chen, Juntong, Zhu, Yongkang, Zeng, Fanhu, Zhu, Xuanyu, Xu, Yige

Abstract

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.

Chinese Translation

多模态大型语言模型（MLLMs）展现了令人期待的推理能力，但在专业领域评估其性能仍然具有挑战性。STEM推理是一个特别有价值的测试平台，因为它提供了高度可验证的反馈，但现有基准往往由于模态冗余而允许单模态捷径，并且主要关注最终答案的准确性，忽视了推理过程本身。为了解决这一挑战，我们引入了StepSTEM：一个涵盖数学、物理、化学、生物和工程的283个问题的研究生级基准，用于对MLLMs中的跨模态推理进行细粒度评估。StepSTEM通过严格的策划流程构建，确保文本和视觉输入之间的严格互补性。我们进一步提出了一个通用的逐步评估框架，适用于仅文本的思维链和交错的图像-文本推理，利用动态规划将预测的推理步骤与多个参考解决方案对齐。针对多种模型的实验表明，当前的MLLMs仍然在很大程度上依赖文本推理，即使是Gemini 3.1 Pro和Claude Opus 4.6的准确率也仅为38.29%。这些结果突显了真正的跨模态STEM推理的巨大提升空间，并将StepSTEM定位为多模态推理的细粒度评估基准。源代码可在https://github.com/lll-hhh/STEPSTEM获取。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2604.19702

Face Anything: 4D Face Reconstruction from Any Image Sequence

面对一切：从任意图像序列进行4D人脸重建

Kocasari, Umut, Giebenhain, Simon, Shaw, Richard, Nießner, Matthias

Abstract

Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

Chinese Translation

从图像序列中准确重建和跟踪动态人脸具有挑战性，因为非刚性变形、表情变化和视角变化同时发生，导致几何和对应估计中存在显著的模糊性。我们提出了一种基于典型人脸点预测的高保真4D人脸重建统一方法，该表示法为每个像素分配一个在共享典型空间中的归一化人脸坐标。该公式将密集跟踪和动态重建转化为一个典型重建问题，使得在单个前馈模型中实现时间一致的几何形状和可靠的对应关系。通过联合预测深度和典型坐标，我们的方法能够在单一架构内实现准确的深度估计、时间稳定的重建、密集的3D几何形状以及稳健的人脸点跟踪。我们使用基于变换器的模型实现该公式，该模型联合预测深度和典型人脸坐标，并使用非刚性变形为典型空间的多视角几何数据进行训练。在图像和视频基准上的大量实验表明，在重建和跟踪任务中实现了最先进的性能，相较于先前的动态重建方法，达到约3倍更低的对应误差和更快的推理速度，同时深度准确性提高了16%。这些结果突显了典型人脸点预测作为统一前馈4D人脸重建的有效基础。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2604.19710

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

SpanVLA：高效的动作桥接与负恢复样本学习的视觉-语言-动作模型

Zhou, Zewei, Yang, Ruining, Xuewei, Qi, Guo, Yiluan, Chen, Sherry X., Feng, Tao, Pistunova, Kateryna, Shen, Yishan, Su, Lili, Ma, Jiaqi

Abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

Chinese Translation

视觉-语言-动作（VLA）模型为利用世界知识和推理能力提供了一种有前景的自主驾驶范式，特别是在长尾场景中。然而，现有的VLA模型在使用自回归生成框架进行动作生成时常常面临高延迟的问题，并且表现出有限的鲁棒性。本文提出了SpanVLA，一种新颖的端到端自主驾驶框架，集成了自回归推理和流匹配动作专家。首先，SpanVLA引入了一个高效的桥接机制，利用视觉和推理指导来高效规划未来轨迹，使用基于历史轨迹初始化的流匹配策略，从而显著减少推理时间。其次，为了进一步提高SpanVLA模型的性能和鲁棒性，我们提出了一种基于GRPO的后训练方法，使得VLA模型不仅能够从正向驾驶样本中学习，还能够学习如何避免典型的负面行为并学习恢复行为。我们进一步引入了mReasoning，一个新的现实世界驾驶推理数据集，专注于复杂的、需要推理的场景和负恢复样本。在NAVSIM（v1和v2）上的大量实验表明，SpanVLA模型具有竞争力的性能。此外，在多样化场景中的定性结果突显了我们模型的规划性能和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2604.19715

A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

智能配电系统中分布式能源资源控制的网络感知评估

Gan, Houchao

Abstract

Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal--dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.

Chinese Translation

高渗透率分布式能源资源（DERs）的配电网络越来越依赖通信网络来协调电网互动控制。尽管已经提出了许多分布式控制方案，但它们通常是在理想化的通信假设下进行评估，这使得在现实网络条件下评估其性能变得困难。本研究展示了一种基于实施的评估方法，针对一个代表性的虚拟电厂（VPP）调度算法，采用了一个将线性化配电系统模型与 ns-3 中的分组级下行链路仿真相结合的协同仿真框架。研究考虑了一个修改后的 IEEE 37 节点馈线，具有高光伏渗透率，以及一个原始-对偶 VPP 调度，该调度同时针对馈线头有功功率跟踪和电压调节。通信效应仅在承载双变量更新的下行路径上引入，其中对每个 DER 的分组延迟和保持最后值策略进行了建模。结果表明，在理想通信条件下，调度能够紧密跟踪馈线头功率参考，同时在选定母线处保持电压在规定范围内。当引入现实的下行延迟时，同一控制器在馈线头功率上表现出较大的振荡，并且更频繁地违反电压限制。这些发现突显了分布式 DER 控制性能可能受到通信行为的强烈影响，并激励评估框架明确将网络动态纳入电网互动控制方案的评估中。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2604.19720

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

ReImagine：通过图像优先合成重新思考可控的高质量人类视频生成

Sun, Zhengwentai, Zheng, Keru, Li, Chenghong, Liao, Hongjie, Yang, Xihe, Li, Heyuan, Zhi, Yihao, Ning, Shuliang, Cui, Shuguang, Han, Xiaoguang

Abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Chinese Translation

人类视频生成仍然面临挑战，因为在有限的多视角数据下，联合建模人类外观、运动和相机视角的难度较大。现有方法通常分别处理这些因素，导致可控性有限或视觉质量降低。我们从图像优先的角度重新审视这个问题，通过图像生成学习高质量的人类外观，并将其作为视频合成的先验，从而将外观建模与时间一致性解耦。我们提出了一种可控的姿态和视角的管道，结合了预训练的图像主干与基于SMPL-X的运动指导，以及基于预训练视频扩散模型的无训练时间精炼阶段。我们的方法在多种姿态和视角下生成高质量、时间一致的视频。我们还发布了一个标准人类数据集和一个辅助模型，用于组合人类图像合成。代码和数据可在 https://github.com/Taited/ReImagine 上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2604.19736

Generative Drifting for Conditional Medical Image Generation

条件医学图像生成的生成漂移

Li, Zirong, Mei, Siyuan, Wu, Weiwen, Maier, Andreas, Gölz, Lina, Xia, Yan

Abstract

Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.

Chinese Translation

条件医学图像生成在许多临床相关的成像任务中发挥着重要作用。然而，现有方法在平衡推理效率、患者特异性保真度和分布级可信度方面仍面临根本性挑战，尤其是在高维3D医学成像中。在本研究中，我们提出了GDM（生成漂移模型），这是一个将确定性医学图像预测重新表述为多目标学习问题的生成漂移框架，以共同促进分布级可信度和患者特异性保真度，同时保持一步推理。GDM通过一种吸引-排斥漂移扩展到3D医学成像，最小化生成器推送与目标分布之间的差异。为了在3D体积数据中实现稳定的基于漂移的学习，GDM从医学基础编码器构建了一个多层特征库，以支持在互补的全局、局部和空间表示之间进行可靠的亲和性估计和漂移场计算。此外，共享输出空间中的梯度协调策略在竞争的分布级和保真度导向目标下改善了优化平衡。我们在两个代表性任务上评估了所提出的框架，即MRI到CT的合成和稀疏视图CT重建。实验结果表明，GDM在解剖保真度、定量可靠性、感知真实感和推理效率之间的平衡上，始终优于包括基于GAN、基于流匹配和基于SDE的生成模型以及监督回归方法在内的多种基线。这些发现表明，GDM为条件3D医学图像生成提供了一个实用有效的框架。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2604.19741

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

CityRAG：通过空间基础的视频生成进入城市

Chou, Gene, Herrmann, Charles, Genova, Kyle, Deng, Boyang, Peng, Songyou, Hariharan, Bharath, Zhang, Jason Y., Snavely, Noah, Henzler, Philipp

Abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Chinese Translation

我们解决了生成一个三维一致、可导航环境的问题，该环境是空间基础的：即对真实地点的模拟。现有的视频生成模型能够生成与文本（T2V）或图像（I2V）提示一致的合理序列。然而，在任意天气条件和动态物体配置下重建真实世界的能力对于包括自动驾驶和机器人模拟在内的下游应用至关重要。为此，我们提出了CityRAG，这是一种视频生成模型，利用大量地理注册数据作为上下文，将生成与物理场景相结合，同时保持对复杂运动和外观变化的学习先验。CityRAG依赖于时间上未对齐的训练数据，这教会模型从瞬态属性中语义性地解耦基础场景。我们的实验表明，CityRAG能够生成连贯的数分钟、物理基础的视频序列，保持数千帧内的天气和光照条件，实现循环闭合，并导航复杂轨迹以重建真实世界地理。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2604.19747

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

AnyRecon：基于视频扩散模型的任意视角三维重建

Chen, Yutian, Guo, Shi, Jin, Renbiao, Yang, Tianshuo, Cai, Xin, Luo, Yawen, Yang, Mingxin, Yu, Mulin, Xu, Linning, Xue, Tianfan

Abstract

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

Chinese Translation

稀疏视角三维重建对于从随意捕捉的场景建模至关重要，但在非生成重建中仍然面临挑战。现有的基于扩散的方法通过合成新视角来缓解这些问题，但它们通常仅依赖于一到两个捕捉帧，这限制了几何一致性，并限制了对大规模或多样化场景的扩展能力。我们提出了AnyRecon，一个可扩展的框架，能够从任意和无序的稀疏输入中进行重建，同时保持明确的几何控制，并支持灵活的条件基数。为了支持长距离条件，我们的方法通过预先捕捉视图缓存构建一个持久的全局场景记忆，并去除时间压缩，以在大视角变化下保持帧级对应关系。除了更好的生成模型外，我们还发现生成与重建之间的相互作用对于大规模三维场景至关重要。因此，我们引入了一种几何感知的条件策略，通过显式的三维几何记忆和几何驱动的捕捉视图检索将生成与重建结合起来。为了确保效率，我们将四步扩散蒸馏与上下文窗口稀疏注意力相结合，以降低二次复杂度。大量实验表明，在不规则输入、大视角间隙和长轨迹下，重建表现出强健性和可扩展性。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2604.19748

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Tstars-Tryon 1.0：针对多样化时尚物品的稳健且真实的虚拟试穿

Chen, Mengting, Chen, Zhengrui, Du, Yongchao, Gao, Zuan, Hu, Taihang, Lan, Jinsong, Lin, Chao, Shen, Yefeng, Wang, Xingjian, Wang, Zhao, Wu, Zhengtao, Xu, Xiaoli, Xu, Zhengze, Yan, Hao, Zhang, Mingzhou, Zheng, Jun, Zhou, Qinye, Zhu, Xiaoyong, Zheng, Bo

Abstract

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

Chinese Translation

近期在图像生成和编辑方面的进展为虚拟试穿开辟了新的机会。然而，现有方法仍然难以满足复杂的现实世界需求。我们提出了Tstars-Tryon 1.0，这是一个商业规模的虚拟试穿系统，具有稳健性、真实感、多功能性和高效性。首先，我们的系统在极端姿势、严重光照变化、运动模糊和其他现实环境条件下，保持了高成功率。其次，它提供了高度逼真的结果，细致入微地保留了服装的纹理、材料特性和结构特征，同时在很大程度上避免了常见的AI生成伪影。第三，除了服装试穿外，我们的模型支持在8个时尚类别中灵活的多图像合成（最多6张参考图像），并对人物身份和背景进行协调控制。第四，为了克服商业部署的延迟瓶颈，我们的系统在推理速度上进行了大量优化，实现了近实时生成，提供无缝的用户体验。这些能力得益于一个集成的系统设计，涵盖端到端的模型架构、可扩展的数据引擎、稳健的基础设施和多阶段的训练范式。广泛的评估和大规模的产品部署表明，Tstars-Tryon 1.0在整体性能上处于领先地位。为了支持未来的研究，我们还发布了一个全面的基准。该模型已在淘宝App上以工业规模部署，为数百万用户提供服务，处理数千万的请求。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.18645

On Solving the Multiple Variable Gapped Longest Common Subsequence Problem

解决多变量间隔最长公共子序列问题

Djukanović, Marko, Balaban, Nikola, Blum, Christian, Kartelj, Aleksandar, Džeroski, Sašo, Zebec, Žiga

Abstract

This paper addresses the Variable Gapped Longest Common Subsequence (VGLCS) problem, a generalization of the classical LCS problem involving flexible gap constraints between consecutive solutions' characters. The problem arises in molecular sequence comparison, where structural distance constraints between residues must be respected, and in time-series analysis where events are required to occur within specified temporal delays. We propose a search framework based on the root-based state graph representation, in which the state space comprises a generally large number of rooted state subgraphs. To cope with the resulting combinatorial explosion, an iterative beam search strategy is employed, dynamically maintaining a global pool of promising candidate root nodes, enabling effective control of diversification across iterations. To exploit the search for high-quality solutions, several known heuristics from the LCS literature are utilized into the standalone beam search procedure. To the best of our knowledge, this is the first comprehensive computational study on the VGLCS problem comprising 320 synthetic instances with up to 10 input sequences and up to 500 characters. Experimental results show robustness of the designed approach over the baseline beam search in comparable runtimes.

Chinese Translation

本文探讨了可变间隔最长公共子序列（Variable Gapped Longest Common Subsequence, VGLCS）问题，这是经典最长公共子序列（LCS）问题的一种推广，涉及到连续解字符之间灵活的间隔约束。该问题出现在分子序列比较中，其中必须遵守残基之间的结构距离约束，以及时间序列分析中，其中事件要求在指定的时间延迟内发生。我们提出了一种基于根节点状态图表示的搜索框架，其中状态空间通常包含大量的根状态子图。为了应对由此产生的组合爆炸，采用了一种迭代束搜索策略，动态维护一个有前景的候选根节点的全局池，从而有效控制迭代过程中的多样性。为了利用搜索高质量解，采用了LCS文献中已知的几种启发式方法，融入到独立的束搜索过程中。根据我们所知，这是首次对VGLCS问题进行全面的计算研究，涵盖320个合成实例，最多包含10个输入序列和500个字符。实验结果表明，所设计的方法在可比运行时间内相较于基线束搜索展现出良好的稳健性。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.18724

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

超越单一输出：可视化和比较语言模型生成的分布

Reif, Emily, Yang, Claire, Hwang, Jared, Nazar, Deniz, Smith, Noah, Heer, Jeff

Abstract

Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how they reason about distributions over language, and where current workflows break down, we introduce GROVE. GROVE is an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. We evaluate across three crowdsourced user studies (N=47, 44, and 40 participants) targeting complementary distributional tasks. Our results support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.

Chinese Translation

用户通常通过单一输出与语言模型进行交互和评估，但每个输出仅是可能完成的广泛分布中的一个样本。这种交互隐藏了分布结构，例如模式、不常见的边缘案例以及对小提示变化的敏感性，导致用户在为开放式任务迭代提示时过度概括。基于对使用语言模型的研究人员（n=13）的初步研究，探讨了随机性在实践中的重要性、他们如何推理语言分布以及当前工作流程的缺陷，我们引入了GROVE。GROVE是一个交互式可视化工具，将多个语言模型生成表示为文本图中的重叠路径，揭示共享结构、分支点和聚类，同时保留对原始输出的访问。我们在三项众包用户研究中进行了评估（N=47、44和40名参与者），针对互补的分布任务。我们的结果支持一种混合工作流程：图形摘要改善了结构判断，例如评估多样性，而直接输出检查在细节导向的问题上仍然更具优势。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.18789

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ARES：自适应红队测试与政策-奖励系统的端到端修复

Liang, Jiacheng, Ma, Yao, Kumarage, Tharindu, Krishna, Satyapriya, Gupta, Rahul, Chang, Kai-Wei, Galstyan, Aram, Peris, Charith

Abstract

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.

Chinese Translation

基于人类反馈的强化学习（RLHF）是对齐大型语言模型（LLMs）的核心，但它引入了一个关键的脆弱性：不完善的奖励模型（RM）可能成为单点故障，当它未能惩罚不安全行为时。现有的红队测试方法主要针对政策层面的弱点，但忽视了我们所称的系统性弱点，即核心LLM和RM同时失效的情况。我们提出了ARES，一个系统性发现并缓解这种双重脆弱性的框架。ARES采用“安全导师”，通过组合结构化组件类型（主题、角色、策略、目标）动态生成语义连贯的对抗性提示，并生成相应的恶意和安全响应。这种双重目标的方法同时暴露了核心LLM和RM的弱点。利用获得的脆弱性，ARES实施了一个两阶段的修复过程：首先微调RM，以更好地检测有害内容，然后利用改进后的RM来优化核心模型。在多个对抗性安全基准上的实验表明，ARES显著增强了安全鲁棒性，同时保持了模型能力，建立了全面的RLHF安全对齐的新范式。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.18805

AI scientists produce results without reasoning scientifically

人工智能科学家在没有科学推理的情况下产生结果

Ríos-García, Martiño, Alampara, Nawaf, Gupta, Chandan, Mandal, Indrajeet, Mannan, Sajid, Aghajani, Ali Asghar, Krishnan, N. M. Anoop, Jablonka, Kevin Maik

Abstract

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

Chinese Translation

基于大型语言模型（LLM）的系统越来越多地被部署以自主进行科学研究，但它们的推理是否遵循使科学探究自我修正的认识论规范尚不清楚。在此，我们通过超过25,000次代理运行和两种互补视角，评估了LLM基础的科学代理在八个领域的表现，这些领域涵盖了工作流程执行到假设驱动的探究：(i) 一项系统的性能分析，分解基础模型和代理框架的贡献；(ii) 对代理推理的认识论结构进行的行为分析。我们观察到，基础模型是性能和行为的主要决定因素，解释的方差中有41.4%归因于基础模型，而框架仅占1.5%。在所有配置中，68%的轨迹忽视了证据，26%的情况下发生了基于反驳的信念修正，而收敛的多重测试证据则很少出现。无论代理执行计算工作流程还是进行假设驱动的探究，均出现相同的推理模式。即使在代理接收到几乎完整的成功推理轨迹作为上下文时，这些模式仍然存在，并且在对认识论要求高的领域中，随重复试验而累积的结果不可靠。因此，当前的基于LLM的代理能够执行科学工作流程，但并未表现出科学推理所特有的认识论模式。基于结果的评估无法检测到这些失败，仅靠框架工程无法修复它们。在推理本身成为训练目标之前，这些代理所产生的科学知识无法通过生成该知识的过程得到合理化。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.18838

Quantum inspired qubit qutrit neural networks for real time financial forecasting

基于量子启发的量子比特和量子三元神经网络用于实时金融预测

Bakshi, Kanishk, Srinivasan, Kathiravan

Abstract

This research investigates the performance and efficacy of machine learning models in stock prediction, comparing Artificial Neural Networks (ANNs), Quantum Qubit-based Neural Networks (QQBNs), and Quantum Qutrit-based Neural Networks (QQTNs). By outlining methodologies, architectures, and training procedures, the study highlights significant differences in training times and performance metrics across models. While all models demonstrate robust accuracies above 70%, the Quantum Qutrit-based Neural Network consistently outperforms with advantages in risk-adjusted returns, measured by the Sharpe ratio, greater consistency in prediction quality through the Information Coefficient, and enhanced robustness under varying market conditions. The QQTN not only surpasses its classical and qubit-based counterparts in multiple quantitative and qualitative metrics but also achieves comparable performance with significantly reduced training times. These results showcase the promising prospects of Quantum Qutrit-based Neural Networks in practical financial applications, where real-time processing is critical. By achieving superior accuracy, efficiency, and adaptability, the proposed models underscore the transformative potential of quantum-inspired approaches, paving the way for their integration into computationally intensive fields.

Chinese Translation

本研究探讨了机器学习模型在股票预测中的性能和有效性，比较了人工神经网络（ANNs）、基于量子比特的神经网络（QQBNs）和基于量子三元的神经网络（QQTNs）。通过概述方法论、架构和训练程序，研究突出了不同模型在训练时间和性能指标上的显著差异。尽管所有模型的准确率均超过70%，但基于量子三元的神经网络在风险调整收益（通过夏普比率衡量）、预测质量的一致性（通过信息系数衡量）以及在不同市场条件下的鲁棒性方面表现出明显优势。QQTN不仅在多个定量和定性指标上超越了其经典和基于量子比特的同行，而且在训练时间显著减少的情况下实现了可比的性能。这些结果展示了基于量子三元的神经网络在实际金融应用中的良好前景，尤其是在实时处理至关重要的领域。通过实现更高的准确性、效率和适应性，所提出的模型强调了量子启发方法的变革潜力，为其在计算密集型领域的整合铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.18847

Human-Guided Harm Recovery for Computer Use Agents

人类引导的计算机使用代理的危害恢复

Li, Christy, CH-Wang, Sky, Peng, Andi, Bobu, Andreea

Abstract

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

Chinese Translation

随着语言模型（LM）代理能够在真实计算机系统上执行操作，我们需要不仅能够在大规模上防止有害行为的方法，还需要在预防失败时有效地修复危害。我们将这一被忽视的后执行安全保障挑战形式化为危害恢复：即在与人类偏好一致的情况下，如何最佳地引导代理从有害状态恢复到安全状态。通过一项形成性用户研究，我们为偏好对齐的恢复奠定基础，识别出有价值的恢复维度，并生成自然语言评分标准。我们收集的1150个成对判断的数据集揭示了属性重要性的上下文依赖性变化，例如对务实、针对性策略的偏好高于全面的长期方法。我们在奖励模型中将这些学习到的见解付诸实践，在测试时对代理框架生成的多个候选恢复计划进行重新排序。为了系统地评估恢复能力，我们引入了BackBench，一个包含50个计算机使用任务的基准，测试代理从有害状态恢复的能力。人类评估表明，我们的奖励模型框架产生的恢复轨迹质量高于基础代理和基于评分标准的框架。综合来看，这些贡献为一种新型代理安全方法奠定了基础——这种方法不仅通过预防危害来应对危害，还通过对其后果的引导和意图进行应对。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.18873

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

从自然语言到可执行的 Narsese：一个用于 NARS 推理的神经符号基准和流程

Gabriel, Mina, Wang, Pei

Abstract

Large language models (LLMs) are highly capable at language generation, but they remain unreliable when reasoning requires explicit symbolic structure, multi-step inference, and interpretable uncertainty. This paper presents a neuro-symbolic framework for translating natural-language reasoning problems into executable formal representations using first-order logic (FOL) and Narsese, the language of the Non-Axiomatic Reasoning System (NARS). To support this direction, we introduce NARS-Reasoning-v0.1, a benchmark of natural-language reasoning problems paired with FOL forms, executable Narsese programs, and three gold labels: True, False, and Uncertain. We develop a deterministic compilation pipeline from FOL to executable Narsese and validate retained examples through runtime execution in OpenNARS for Applications (ONA), ensuring that the symbolic targets are not only syntactically well formed but also behaviorally aligned with the intended answer. We further present Language-Structured Perception (LSP), a formulation in which an LLM is trained to produce reasoning-relevant symbolic structure rather than only a final verbal response. As an initial proof of concept, we also train and release a Phi-2 LoRA adapter on NARS-Reasoning-v0.1 for three-label reasoning classification, showing that the benchmark can support supervised adaptation in addition to executable evaluation. Overall, the paper positions executable symbolic generation and execution-based validation as a practical path toward more reliable neuro-symbolic reasoning systems.

Chinese Translation

大型语言模型（LLMs）在语言生成方面具有很强的能力，但在需要明确符号结构、多步骤推理和可解释的不确定性时，它们仍然不可靠。本文提出了一种神经符号框架，用于将自然语言推理问题翻译为可执行的形式化表示，使用一阶逻辑（FOL）和非公理推理系统（NARS）的语言 Narsese。为了支持这一方向，我们引入了 NARS-Reasoning-v0.1，这是一个与 FOL 形式、可执行的 Narsese 程序以及三个金标准标签（真、假和不确定）配对的自然语言推理问题基准。我们开发了一条从 FOL 到可执行 Narsese 的确定性编译流程，并通过在 OpenNARS for Applications（ONA）中的运行时执行验证保留的示例，确保符号目标不仅在语法上是良构的，而且在行为上与预期答案一致。我们进一步提出了语言结构感知（Language-Structured Perception, LSP），一种将 LLM 训练为生成与推理相关的符号结构，而不仅仅是最终的语言响应的形式。作为初步概念验证，我们还在 NARS-Reasoning-v0.1 上训练并发布了一个 Phi-2 LoRA 适配器，用于三标签推理分类，显示该基准不仅可以支持可执行评估，还可以支持监督适应。总体而言，本文将可执行符号生成和基于执行的验证定位为更可靠的神经符号推理系统的实际路径。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.18874

How Adversarial Environments Mislead Agentic AI?

对抗环境如何误导自主智能体？

Zhan, Zhonghao, Zhou, Huichi, Li, Zhenhao, Jing, Peiyuan, Li, Krinos, Haddadi, Hamed

Abstract

Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking "can the agent use tools correctly" but never "what if the tools lie". We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a "fake world" of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.

Chinese Translation

工具集成智能体的部署基于外部工具能够将其输出与现实相结合的前提。然而，这种依赖性却创造了一个关键的攻击面。目前的评估在良性环境中进行基准测试，询问“智能体能否正确使用工具”，却从未考虑“如果工具撒谎会怎样”。我们识别出这一信任缺口：智能体的评估侧重于性能，而非怀疑能力。我们将这种脆弱性形式化为对抗环境注入（Adversarial Environmental Injection, AEI），这是一种威胁模型，其中对手妨碍工具输出以欺骗智能体。AEI构成了环境欺骗：围绕毫无戒心的智能体构建一个“虚假世界”，其中包含被污染的搜索结果和虚构的参考网络。我们通过POTEMKIN，一个与模型上下文协议（Model Context Protocol, MCP）兼容的即插即用鲁棒性测试工具，来实现这一点。我们识别出两个正交的攻击面：幻觉（The Illusion，广度攻击）通过污染检索引发对虚假信念的认知漂移，而迷宫（The Maze，深度攻击）利用结构陷阱导致策略崩溃，陷入无限循环。在对五个前沿智能体进行超过11,000次的测试中，我们发现了明显的鲁棒性差距：对一种攻击的抵抗往往会增加对另一种攻击的脆弱性，表明认知鲁棒性和导航鲁棒性是不同的能力。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.18882

Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI + Lean 4 Pipeline

通过依赖类型理论进行形式验证的专利分析：来自混合人工智能 + Lean 4 流水线的机器可检查证书

Koomullil, George

Abstract

We present a formally verified framework for patent analysis as a hybrid AI + Lean 4 pipeline. The DAG-coverage core (Algorithm 1b) is fully machine-verified once bounded match scores are fixed. Freedom-to-operate, claim-construction sensitivity, cross-claim consistency, and doctrine-of-equivalents analyses are formalized at the specification level with kernel-checked candidate certificates. Existing patent-analysis approaches rely on manual expert analysis (slow, non-scalable) or ML/NLP methods (probabilistic, opaque, non-compositional). To our knowledge, this is the first framework that applies interactive theorem proving based on dependent type theory to intellectual property analysis. Claims are encoded as DAGs in Lean 4, match strengths as elements of a verified complete lattice, and confidence scores propagate through dependencies via proven-correct monotone functions. We formalize five IP use cases (patent-to-product mapping, freedom-to-operate, claim construction sensitivity, cross-claim consistency, doctrine of equivalents) via six algorithms. Structural lemmas, the coverage-core generator, and the closed-path identity coverage = W_cov are machine-verified in Lean 4. Higher-level theorems for the other use cases remain informal proof sketches, and their proof-generation functions are architecturally mitigated (untrusted generators whose outputs are kernel-checked and sorry-free axiom-audited). Guarantees are conditional on the ML layer: they certify mathematical correctness of computations downstream of ML scores, not the accuracy of the scores themselves. A case study on a synthetic memory-module claim demonstrates weighted coverage and construction-sensitivity analysis. Validation against adjudicated cases is future work.

Chinese Translation

我们提出了一个形式验证的专利分析框架，作为一个混合人工智能 + Lean 4 流水线。DAG覆盖核心（算法 1b）在固定匹配分数后完全机器验证。操作自由、权利要求构造敏感性、交叉权利要求一致性和等效原则分析在规范层面上以内核检查的候选证书形式化。现有的专利分析方法依赖于人工专家分析（速度慢、不可扩展）或机器学习/自然语言处理方法（概率性、不透明、非组合性）。据我们所知，这是第一个将基于依赖类型理论的交互式定理证明应用于知识产权分析的框架。权利要求被编码为 Lean 4 中的有向无环图（DAG），匹配强度作为经过验证的完备格的元素，置信分数通过经过证明的单调函数在依赖关系中传播。我们通过六个算法形式化了五个知识产权使用案例（专利到产品映射、操作自由、权利要求构造敏感性、交叉权利要求一致性、等效原则）。结构引理、覆盖核心生成器和闭合路径身份覆盖 = W_cov 在 Lean 4 中经过机器验证。其他使用案例的高级定理仍然是非正式的证明草图，其证明生成函数在架构上得到缓解（不受信任的生成器，其输出经过内核检查且无错误公理审计）。保证依赖于机器学习层：它们证明了机器学习分数下游计算的数学正确性，而不是分数本身的准确性。对合成内存模块权利要求的案例研究展示了加权覆盖和构造敏感性分析。针对裁决案例的验证是未来的工作。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.18916

Error-free Training for MedMNIST Datasets

无误训练MedMNIST数据集

Deng, Bo

Abstract

In this paper, we introduce a new concept called Artificial Special Intelligence by which Machine Learning models for the classification problem can be trained error-free, thus acquiring the capability of not making repeated mistakes. The method is applied to 18 MedMNIST biomedical datasets. Except for three datasets, which suffer from the double-labeling problem, all are trained to perfection.

Chinese Translation

在本文中，我们引入了一种新的概念——人工特殊智能（Artificial Special Intelligence），通过该概念，机器学习模型可以在分类问题上进行无误训练，从而具备不重复犯错的能力。该方法应用于18个MedMNIST生物医学数据集。除了三个受到双重标记问题影响的数据集外，其余数据集均被训练至完美。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.18934

AutomationBench

自动化基准测试（AutomationBench）

Shepard, Daniel, Salimans, Robin

Abstract

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.

Chinese Translation

现有的用于软件自动化的人工智能基准测试很少结合跨应用协调、自主API发现和政策遵循。真实的商业工作流程需要这三者：单一任务可能跨越客户关系管理（CRM）、收件箱、日历和消息平台——这要求智能体找到正确的端点，遵循政策文件，并向每个系统写入正确的数据。为了解决这一空白，我们引入了自动化基准测试（AutomationBench），这是一个用于评估人工智能智能体在通过REST API进行跨应用工作流编排的基准测试。该基准测试基于Zapier平台的真实工作流程模式，任务涵盖销售、市场营销、运营、支持、财务和人力资源等领域。智能体必须自行发现相关端点，遵循分层的业务规则，并在环境中导航，处理无关且有时误导性的记录。评分是程序化的，仅基于最终结果：即正确的数据是否最终进入了正确的系统。即使是目前最先进的模型，其得分也低于10%。自动化基准测试提供了一个具有挑战性且现实的衡量标准，以评估当前模型相对于企业实际所需的智能能力的水平。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.18943

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

个性化基准测试：根据个人偏好评估大型语言模型

Garbacea, Cristina, Wang, Heran, Tan, Chenhao

Abstract

With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $\rho = 0.04$ (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($\rho = 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.

Chinese Translation

随着大型语言模型（LLMs）能力的提升及其在实际任务中的应用，评估LLM与人类偏好的对齐已成为一个重要挑战。目前的基准测试通过对所有用户的偏好进行平均来计算总体评分，忽视了在建立模型排名时个别用户的偏好。由于用户在不同情境下的偏好各异，我们呼吁建立个性化的LLM基准测试，根据个体需求对模型进行排名。我们使用ELO评分和Bradley-Terry系数计算115名活跃的Chatbot Arena用户的个性化模型排名，并分析用户查询特征（主题和写作风格）与LLM排名变化之间的关系。我们证明，LLM模型的个体排名与总体LLM排名存在显著差异，Bradley-Terry相关性平均仅为$ ho = 0.04$（57 ext{%}的用户显示近零或负相关），而ELO评分显示中等相关性（$ ho = 0.43$）。通过主题建模和风格分析，我们发现用户在主题兴趣和沟通风格上表现出显著的异质性，影响其模型偏好。我们进一步表明，主题和风格特征的紧凑组合为预测用户特定模型排名提供了有用的特征空间。我们的结果提供了强有力的定量证据，表明总体基准未能捕捉大多数用户的个体偏好，并强调了开发个性化基准的重要性，以根据个别用户偏好对LLM模型进行排名。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.18946

Reasoning Structure Matters for Safety Alignment of Reasoning Models

推理结构对推理模型安全对齐的重要性

In, Yeonjun, Kim, Wonjoong, Park, Sangwu, Park, Chanyoung

Abstract

Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.

Chinese Translation

大型推理模型（LRMs）在复杂推理任务中表现出色，但常常对恶意用户查询生成有害响应。本文探讨了这些安全风险的根本原因，并表明问题出在推理结构本身。基于这一见解，我们认为通过改变推理结构可以有效实现安全对齐。我们提出了AltTrain，这是一种简单而有效的后训练方法，能够明确改变LRMs的推理结构。AltTrain既实用又具有可推广性，无需复杂的强化学习（RL）训练或奖励设计，仅需使用轻量级的1000个训练样本进行监督微调（SFT）。在不同的LRM骨干网络和模型规模上的实验表明，安全对齐效果显著，同时在推理、问答、摘要和多语言设置中展现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.18964

DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning

DW-Bench：在数据仓库图拓扑推理上对大型语言模型的基准测试

Ahmed, Ahmed G. A. H, Sakar, C. Okan

Abstract

This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.

Chinese Translation

本文介绍了DW-Bench，这是一个新的基准测试，用于评估大型语言模型（LLMs）在数据仓库模式上的图拓扑推理能力，明确整合了外键（FK）和数据血缘边。该基准测试包含1,046个自动生成的、可验证正确的问题，涵盖五个模式。实验表明，工具增强的方法在性能上显著优于静态方法，但在困难的组合子类型上表现趋于平稳。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.18982

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

SAVOIR：通过基于Shapley的奖励归因学习社会 savoir-faire

Feng, Xiachong, Jiang, Yi, Feng, Xiaocheng, Yin, Deyi, Qin, Libo, Ye, Yangfan, Huang, Lei, Ma, Weitao, Gu, Yuxuan, Qin, Chonghan, Qin, Bing, Kong, Lingpeng

Abstract

Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.

Chinese Translation

社会智能，即在复杂的人际互动中导航的能力，对语言代理提出了根本性的挑战。通过强化学习训练此类代理需要解决信用分配问题：确定个别发言如何影响多轮对话的结果。现有方法直接利用语言模型分配情节级奖励，导致的归因是回顾性的，缺乏理论基础。我们提出了SAVOIR（ShApley Value fOr SocIal RL），这是一个基于合作博弈论的新颖原则框架。我们的方法结合了两个互补的原则：期望效用将评估从回顾性归因转向前瞻性评估，捕捉发言在促进有利未来轨迹中的战略潜力；Shapley值确保公平的信用分配，并具有效率、对称性和边际性的公理保证。在SOTOPIA基准上的实验表明，SAVOIR在所有评估设置中都达到了新的最先进性能，我们的7B模型与包括GPT-4o和Claude-3.5-Sonnet在内的专有模型相匹配或超越。值得注意的是，即使是大型推理模型也始终表现不佳，这表明社会智能需要与分析推理 qualitatively 不同的能力。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.19022

On Accelerating Grounded Code Development for Research

加速研究领域的基础代码开发

Ganji, Santosh

Abstract

A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via doc-search.dev and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows

Chinese Translation

在利用编码代理的利基科学和技术领域中，一个主要挑战是缺乏对最新的、领域特定知识的访问。基础模型在专业领域中通常表现出有限的推理能力，无法固有地融入通过持续研究和实验而发展的知识。材料科学家探索新化合物、通信工程师设计和评估新协议，以及生物工程研究人员进行迭代实验，都面临这一限制。这些专家通常缺乏微调大型模型或持续嵌入新发现的资源，从而形成了采用基于人工智能的编码代理的障碍。为了解决这一问题，我们提出了一个框架，使编码代理能够即时访问研究库和技术文档，从而实现实时、上下文感知的操作。我们的开源实现允许用户通过 doc-search.dev 上传文档，并包括 zed-fork，它强制执行领域特定的规则和工作流程。这些工具共同加速了编码代理在专业科学和技术工作流程中的整合。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.19036

Plausible Reasoning and First-Order Plausible Logic

可信推理与一阶可信逻辑

Billington, David

Abstract

Defeasible statements are statements that are likely, or probable, or usually true, but may occasionally be false. Plausible reasoning makes conclusions from statements that are either facts or defeasible statements without using numbers. So there are no probabilities or suchlike involved. Seventeen principles of logics that do plausible reasoning are suggested and several important plausible reasoning examples are considered. There are 14 necessary principles and 3 desirable principles, one of which is not formally stated. A first-order logic, called Plausible Logic (PL), is defined that satisfies all but two of the desirable principles and reasons correctly with all the examples. As far as we are aware, this is the only such logic. PL has 8 reasoning algorithms because, from a given plausible reasoning situation, there are different sensible conclusions. This article is a condensation of my book `Plausible Reasoning and Plausible Logic' (PRPL), which is to be submitted. Each section of this article corresponds to a chapter in PRPL, and vice versa. The proofs of all the results are in PRPL, so they are omitted in this article.

Chinese Translation

可反驳陈述是指那些可能、或概率较高、或通常为真的陈述，但偶尔也可能是错误的。可信推理从事实或可反驳陈述中得出结论，而不使用数字。因此，不涉及概率等概念。本文提出了17条进行可信推理的逻辑原则，并考虑了若干重要的可信推理示例。其中包括14条必要原则和3条期望原则，其中一条未正式表述。定义了一种称为可信逻辑（Plausible Logic, PL）的一阶逻辑，该逻辑满足所有期望原则中的两条，并且能够正确推理所有示例。就我们所知，这是唯一的此类逻辑。PL有8种推理算法，因为在给定的可信推理情境中，可能得出不同的合理结论。本文是我即将提交的书籍《可信推理与可信逻辑》（Plausible Reasoning and Plausible Logic, PRPL）的精简版。本文的每个部分对应于PRPL中的一章，反之亦然。所有结果的证明都在PRPL中，因此在本文中省略。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.19043

Learning Lifted Action Models from Unsupervised Visual Traces

从无监督视觉轨迹中学习提升的动作模型

Xi, Kai, Gould, Stephen, Thiébaux, Sylvie

Abstract

Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.

Chinese Translation

有效构建捕捉动作前提条件和效果的模型对于在现实世界领域应用人工智能规划至关重要。大量先前的研究探索了如何从状态和/或动作序列的高层描述中学习这些模型。本文我们处理一个更具挑战性的设置：从状态图像序列中学习提升的动作模型，而不依赖于动作观察。我们提出了一个深度学习框架，该框架联合学习状态预测、动作预测和提升的动作模型。我们还引入了一个混合整数线性规划（MILP），以防止预测崩溃和预测之间的自我强化错误。MILP利用预测的状态、动作和在部分轨迹上的动作模型，求解出逻辑上一致的状态、动作和动作模型，使其尽可能接近原始预测。从MILP解中提取的伪标签随后被用来指导进一步的训练。在多个领域的实验表明，整合基于MILP的修正有助于模型逃离局部最优并收敛到全局一致的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.19060

Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

强化学习提高了从放射学报告中进行疾病分类的LLM准确性和推理能力

Wei, Yishu, Lin, Yi, Flanders, Adam, Shih, George, Peng, Yifan

Abstract

Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.

Chinese Translation

从放射学报告中进行准确的疾病分类对许多应用至关重要。虽然轻量级LLM的监督微调（SFT）可以提高准确性，但可能会降低推理能力。我们提出了一种两阶段的方法：首先对疾病标签进行SFT，然后通过群体相对策略优化（GRPO）来优化准确性和格式，从而精炼预测，而不依赖推理监督。在三个放射科医生标注的数据集上，SFT的表现优于基线，而GRPO进一步提高了分类效果，并增强了推理的召回率和全面性。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.19087

OLLM: Options-based Large Language Models

OLLM：基于选项的大型语言模型

Sharma, Shashank, Hoffmann, Janina, Namboodiri, Vinay

Abstract

We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight "plug-in" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\%$ final answer correctness, while OLLM's option set allows up to $\sim 70\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.

Chinese Translation

我们介绍了选项LLM（OLLM），这是一种简单且通用的方法，它用一组为下一个标记学习的选项替代了标准大型语言模型（LLMs）中的单一下一个标记预测，这些选项由离散潜变量索引。OLLM明确建模了变化，而不是依赖温度或采样启发式来引入多样性：一个小的潜在空间参数化了多个合理的下一个标记选项，这些选项可以被下游策略选择或搜索。在架构上，OLLM是一个轻量级的“插件”，在输出头之前插入了两个层：一个编码器和一个解码器，使几乎任何预训练的LLM都可以以最少的额外参数进行转换。我们将OLLM应用于一个具有17亿参数的主干（仅$1.56\%$的参数可训练），该主干在OpenMathReasoning上训练，并在OmniMath上进行评估。SOTA LoRA适应的基线在最终答案正确率上达到$51\\%$的峰值，而OLLM的选项集在最佳潜在选择下允许高达$ extsim 70\\%$的准确率。然后，我们在潜在空间中训练一个紧凑的策略，以发出潜在变量以控制生成。在低维选项空间中操作使得奖励优化变得更加样本高效，并显著减少了常见的错位（例如，语言切换或退化推理），因为策略被限制在SFT期间学习的选项上。关键是，这种对齐来自模型结构，而不是额外的KL或手工设计的对齐损失。我们的结果表明，选项化的下一个标记建模增强了数学推理中的可控性、鲁棒性和效率，并突出了潜在空间策略学习作为LLM中强化学习的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.19089

Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

面向可扩展的终身知识编辑与选择性知识抑制

Jung, Dahyun, Lee, Jaewook, Lim, Heuiseok

Abstract

Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model's original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.

Chinese Translation

大型语言模型（LLMs）需要频繁更新知识，以反映变化的事实并减轻幻觉现象。为满足这一需求，终身知识编辑作为一种持续的方法应运而生，旨在在不重新训练整个模型的情况下修改特定知识片段。现有的参数编辑方法在进行连续编辑时由于灾难性遗忘而面临稳定性问题。虽然提出了基于检索的方法以缓解这一问题，但由于高昂的训练成本，其适用性在不同数据集上仍然有限。为了解决这些局限性并增强终身设置下的可扩展性，我们提出了LightEdit。我们的框架首先从检索的信息中选择相关知识，以有效修改查询。然后，它结合解码策略来抑制模型原有知识的概率，从而基于所选信息实现高效编辑。在ZSRE、Counterfact和RIPE基准上的大量实验表明，LightEdit优于现有的终身知识编辑方法。此外，通过最小化训练成本，LightEdit实现了具有成本效益的可扩展性，能够轻松适应各种数据集。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.19131

Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory

自动化作文评分是否已达到足够的准确性？基于经典测验理论推导可实现的 QWK 上限

Uto, Masaki

Abstract

Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human--human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.

Chinese Translation

自动化作文评分（AES）通常在公共基准上使用二次加权卡帕（QWK）进行评估。然而，由于基准标签是由人工评分者分配的，并不可避免地包含评分错误，因此尚不清楚理论上可达到的 QWK 是什么，以及在实际应用中什么水平是足够的。因此，我们基于经典测验理论中的信度概念推导出两个特定数据集的 QWK 上限，这些上限可以从标准的双评分者基准中估算，而无需额外的标注。第一个是理论上限：在标签噪声下，理想的 AES 模型在完美预测潜在真实分数时可以达到的最大 QWK。第二个是类人上限：一个具有类人评分误差的 AES 模型可以达到的 QWK，为当 AES 旨在替代单个人工评分者时提供了一个实际目标。我们进一步表明，人际 QWK，通常用作上限参考，可能低估真实上限。模拟实验验证了所提出的上限，而在真实基准上的实验则说明了这些上限如何澄清现代 AES 模型的当前表现和剩余提升空间。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.19172

Reasoning-Aware AIGC Detection via Alignment and Reinforcement

基于对齐与强化的推理感知AIGC检测

Wang, Zhao, Xiong, Max, Lian, Jianxun, Dou, Zhicheng

Abstract

The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal

Chinese Translation

大型语言模型（LLMs）的快速发展和广泛应用提升了对可靠的人工智能生成内容（AIGC）检测的需求，但随着模型的演变，这一任务仍然充满挑战。我们引入了AIGC-text-bank，这是一个涵盖多领域、多样化LLM来源和作者场景的综合数据集，并提出了REVEAL，一个在分类之前生成可解释推理链的检测框架。我们的方法采用了两阶段训练策略：首先进行监督微调以建立推理能力，然后通过强化学习提高准确性、改善逻辑一致性并减少幻觉。大量实验表明，REVEAL在多个基准测试中达到了最先进的性能，为AIGC检测提供了一种稳健且透明的解决方案。该项目的开源地址为 https://aka.ms/reveal

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.19211

ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

ClawNet：跨用户自主合作的人类共生代理网络

Yang, Zhiqin, Zhang, Zhenyuan, Jia, Xianzhang, Song, Jun, Xue, Wei, Zhang, Yonggang, Guo, Yike

Abstract

Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human-symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner's behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per-identity access control and escalates boundary violations to the owner. Action-level accountability logs every operation against its owner's identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.

Chinese Translation

当前的人工智能代理框架在自动化单一任务方面取得了显著进展，但所有现有系统均服务于单一用户。人类的生产力依赖于人们协调、谈判和委托的社会和组织关系。当代理从为一个人执行任务转变为代表该人与他人合作时，跨用户代理合作的基础设施完全缺失，更不用说确保其安全所需的治理机制。我们认为，人工智能代理的下一个前沿不在于增强个体能力，而在于人类协作关系的数字化。为此，我们提出了一种人类共生代理范式。每个用户拥有一个永久绑定的代理系统，代表所有者进行协作，形成一个以人类而非代理为节点的网络。该范式基于三个治理原语。分层身份架构将管理代理（Manager Agent）与多个上下文特定的身份代理（Identity Agents）分离；管理代理持有全球知识，但在架构上与外部通信隔离。范围授权（Scoped Authorization）强制执行每个身份的访问控制，并将边界违规行为上报给所有者。行动级问责（Action-level Accountability）记录每个操作与其所有者的身份和授权的关系，确保完全可审计性。我们在ClawNet中实例化了这一范式，这是一个身份治理的代理协作框架，通过中央协调者强制执行身份绑定和授权验证，使多个用户能够通过各自的代理安全地进行协作。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.19221

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

UAF：一种用于全双工语音交互的统一音频前端大语言模型

Li, Yadong, Wu, Guoxin, Hou, Haiping, Li, Biye

Abstract

Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.

Chinese Translation

全双工语音交互作为人类交流中最自然、最直观的方式，正在推动人工智能向更人性化的对话系统发展。传统的级联语音处理流程存在诸多关键限制，包括累积延迟、信息丢失以及模块间的错误传播。为了解决这些问题，近期的研究集中于端到端音频大语言模型（LLMs），如GPT-4o，这些模型主要统一了语音理解和生成任务。然而，这些模型大多数本质上是半双工的，并依赖于一套独立的、特定任务的前端组件，如语音活动检测（VAD）和轮换检测（TD）。在我们开发语音助手的过程中，我们观察到，优化语音前端与推进后端统一模型同样关键，以实现无缝、响应迅速的交互。为填补这一空白，我们提出了首个针对全双工语音系统的统一音频前端大语言模型（UAF）。我们的模型将多种音频前端任务重新表述为一个单一的自回归序列预测问题，包括VAD、TD、说话人识别（SR）、自动语音识别（ASR）和问答（QA）。它以流式固定时长音频片段（例如600毫秒）作为输入，利用参考音频提示在开始时固定目标说话人，并逐步生成编码语义内容和系统级状态控制（例如中断信号）的离散标记。实验表明，我们的模型在多个音频前端任务中实现了领先的性能，并显著提升了在现实交互场景中的响应延迟和中断准确性。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.19240

Industrial Surface Defect Detection via Diffusion Generation and Asymmetric Student-Teacher Network

通过扩散生成和非对称师生网络进行工业表面缺陷检测

Feng, Shuo, Zhou, Runlin, Li, Yuyang, Liu, Guangcan

Abstract

Industrial surface defect detection often suffers from limited defect samples, severe long-tailed distributions, and difficulties in accurately localizing subtle defects under complex backgrounds. To address these challenges, this paper proposes an unsupervised defect detection method that integrates a Denoising Diffusion Probabilistic Model (DDPM) with an asymmetric teacher-student architecture. First, at the data level, the DDPM is trained solely on normal samples. By introducing constant-variance Gaussian perturbations and Perlin noise-based masks, high-fidelity and physically consistent defect samples along with pixel-level annotations are generated, effectively alleviating the data scarcity problem. Second, at the model level, an asymmetric dual-stream network is constructed. The teacher network provides stable representations of normal features, while the student network reconstructs normal patterns and amplifies discrepancies between normal and anomalous regions. Finally, a joint optimization strategy combining cosine similarity loss and pixel-wise segmentation supervision is adopted to achieve precise localization of subtle defects. Experimental results on the MVTecAD dataset show that the proposed method achieves 98.4\% image-level AUROC and 98.3\% pixel-level AUROC, significantly outperforming existing unsupervised and mainstream deep learning methods. The proposed approach does not require large amounts of real defect samples and enables accurate and robust industrial defect detection and localization. \keywords{Industrial defect detection \and diffusion models \and data generation \and teacher-student architecture \and pixel-level localization}

Chinese Translation

工业表面缺陷检测常常面临缺陷样本有限、严重的长尾分布以及在复杂背景下准确定位微小缺陷的困难。为了解决这些挑战，本文提出了一种无监督缺陷检测方法，该方法将去噪扩散概率模型（Denoising Diffusion Probabilistic Model, DDPM）与非对称师生架构相结合。首先，在数据层面，DDPM仅在正常样本上进行训练。通过引入恒方差高斯扰动和基于Perlin噪声的掩膜，生成高保真且物理一致的缺陷样本以及像素级标注，有效缓解了数据稀缺问题。其次，在模型层面，构建了一个非对称双流网络。教师网络提供正常特征的稳定表示，而学生网络重建正常模式并放大正常区域与异常区域之间的差异。最后，采用结合余弦相似度损失和像素级分割监督的联合优化策略，以实现微小缺陷的精确定位。在MVTecAD数据集上的实验结果表明，所提方法在图像级AUROC上达到98.4\%，在像素级AUROC上达到98.3\\%，显著优于现有的无监督和主流深度学习方法。所提方法不需要大量真实缺陷样本，能够实现准确且稳健的工业缺陷检测与定位。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.19278

Explicit Trait Inference for Multi-Agent Coordination

多智能体协调的显性特征推断

Abdurahman, Suhaib, Ishii, Etsuko, Margatina, Katerina, Bhargavi, Divya, Sunkara, Monica, Zhang, Yi

Abstract

LLM-based multi-agent systems (MAS) show promise on complex tasks but remain prone to coordination failures such as goal drift, error cascades, and misaligned behaviors. We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. ETI enables agents to infer and track partner characteristics along two established psychological dimensions--warmth (e.g., trust) and competence (e.g., skill)--from interaction histories to guide decisions. We evaluate ETI in controlled settings (economic games), where it reduces payoff loss by 45-77%, and in more realistic, complex multi-agent settings (MultiAgentBench), where it improves performance by 3-29% depending on the scenario and model, relative to a CoT baseline. Additional analysis shows that gains are closely linked to trait inference: ETI profiles predict agents' actions, and informative profiles drive improvements. These results highlight ETI as a lightweight and robust mechanism for improving coordination in diverse multi-agent settings, and provide the first systematic evidence that LLM agents can (i) reliably infer others' traits from interaction histories and (ii) leverage structured awareness of others' traits for coordination.

Chinese Translation

基于大型语言模型（LLM）的多智能体系统（MAS）在复杂任务中展现出潜力，但仍然容易出现协调失败，如目标漂移、错误级联和行为不一致。我们提出了显性特征推断（Explicit Trait Inference, ETI），这是一种基于心理学的改进协调的方法。ETI使智能体能够从交互历史中推断和跟踪伙伴特征，沿着两个已建立的心理学维度——温暖（例如，信任）和能力（例如，技能）——来指导决策。我们在受控环境（经济游戏）中评估了ETI，结果显示其减少了45-77%的收益损失；在更现实的复杂多智能体环境（MultiAgentBench）中，ETI的表现相较于CoT基线提高了3-29%，具体取决于场景和模型。进一步分析表明，收益与特征推断密切相关：ETI特征能够预测智能体的行为，而信息丰富的特征驱动了改进。这些结果突显了ETI作为一种轻量且稳健的机制，可以在多样化的多智能体环境中改善协调，并提供了首个系统性证据，表明LLM智能体能够（i）可靠地从交互历史中推断他人的特征，以及（ii）利用对他人特征的结构化认知进行协调。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.19301

Large Language Models Exhibit Normative Conformity

大型语言模型表现出规范性从众行为

Bito, Mikako, Nishimoto, Keita, Asatani, Kimitaka, Sakata, Ichiro

Abstract

The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision-making in LLM-based multi-agent systems (LLM-MAS). While many prior studies have treated "conformity" simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision-making in LLM-MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of "conformity," they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how "norms" are implemented in LLMs and how they influence group dynamics.

Chinese Translation

大型语言模型（LLMs）表现出的从众偏差可能对基于LLM的多智能体系统（LLM-MAS）中的决策造成重大挑战。尽管许多先前的研究将“从众”简单视为意见变化的问题，但本研究引入了信息从众和规范从众之间的社会心理学区分，以便在机制层面理解LLM的从众行为。具体而言，我们设计了新的任务以区分信息从众（在讨论中，参与者被激励做出准确判断）和规范从众（参与者被激励避免冲突或获得群体认可）。随后，我们基于这些任务设置进行了实验。实验结果显示，在评估的六个LLM中，多达五个表现出不仅是信息从众的倾向，还有规范从众的倾向。此外，令人感兴趣的是，我们证明通过操控社会情境的微妙方面，可能能够控制特定LLM所指向的规范从众目标。这些发现表明，LLM-MAS中的决策可能容易受到少数恶意用户的操控。此外，通过分析与信息从众和规范从众相关的内部向量，我们建议尽管这两种行为在外部看似相同的“从众”形式，但实际上可能由不同的内部机制驱动。综合来看，这些结果可能为理解“规范”如何在LLM中实施以及它们如何影响群体动态提供了初步的里程碑。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.19354

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

代理是否梦见根壳？对捕旗挑战中大型语言模型代理的部分信用评估

Al-Kaswan, Ali, Plotnikov, Maksim, Hájek, Maxim, Vízner, Roland, van Deursen, Arie, Izadi, Maliheh

Abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

Chinese Translation

大型语言模型（LLM）代理越来越多地被提议用于自主网络安全任务，但它们在现实攻击环境中的能力仍然不甚了解。我们提出了DeepRed，这是一个用于评估基于LLM的代理在隔离虚拟环境中进行现实捕旗（CTF）挑战的开源基准。DeepRed将代理置于Kali攻击者环境中，配备终端工具和可选的网络搜索，通过私有网络连接到目标挑战，并记录完整的执行轨迹以供分析。为了超越二元的解决/未解决结果，我们引入了一种基于来自公共写作的挑战特定检查点的部分信用评分方法，以及一个自动化的总结-然后-判断标签管道，用于从日志中分配检查点完成情况。使用DeepRed，我们对十个商业可用的LLM在十个基于虚拟机的CTF挑战中进行了基准测试，这些挑战涵盖了不同的挑战类别。结果表明，当前的代理仍然有限：最佳模型仅实现了35%的平均检查点完成率，在常见挑战类型上表现最佳，而在需要非标准发现和长远适应的任务上表现最差。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.19377

Towards Energy Impact on AI-Powered 6G IoT Networks: Centralized vs. Decentralized

面向人工智能驱动的6G物联网网络的能源影响：集中式与分散式

Qiu, Anjie, Wang, Donglin, Partani, Sanket, Weinand, Andreas, Schotten, Hans D.

Abstract

The emergence of sixth-generation (6G) technologies has introduced new challenges and opportunities for machine learning (ML) applications in Internet of Things (IoT) networks, particularly concerning energy efficiency. As model training and data transmission contribute significantly to energy consumption, optimizing these processes has become critical for sustainable system design. This study first conduct analysis on the energy consumption model for both centralized and decentralized architecture and then presents a testbed deployed within the German railway infrastructure, leveraging sensor data for ML-based predictive maintenance. A comparative analysis of distributed versus Centralized Learning (CL) architectures reveals that distributed models maintain competitive predictive accuracy (~90%) while reducing overall electricity consumption by up to 70%. These findings underscore the potential of distributed ML to improve energy efficiency in real-world IoT deployments, particularly by mitigating transmission-related energy costs.

Chinese Translation

第六代（6G）技术的出现为物联网（IoT）网络中的机器学习（ML）应用带来了新的挑战和机遇，特别是在能源效率方面。由于模型训练和数据传输对能源消耗的显著贡献，优化这些过程已成为可持续系统设计的关键。本研究首先对集中式和分散式架构的能源消耗模型进行了分析，随后展示了在德国铁路基础设施中部署的测试平台，利用传感器数据进行基于机器学习的预测性维护。对分布式与集中学习（Centralized Learning, CL）架构的比较分析表明，分布式模型在保持竞争性预测准确率（约90%）的同时，整体电力消耗降低了多达70%。这些发现强调了分布式机器学习在改善现实世界物联网部署中的能源效率的潜力，特别是在减轻与传输相关的能源成本方面。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.19398

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

GRASPrune：用于预算结构化剪枝的大型语言模型的全局门控

Wang, Ziyang, Xiao, Jiangfeng, Xiao, Chuan, Li, Ruoxiang, Mao, Rui, Qin, Jianbin

Abstract

Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.

Chinese Translation

大型语言模型（LLMs）的服务成本高昂，因为模型参数、注意力计算和键值（KV）缓存带来了巨大的内存和延迟开销。我们提出了GRASPrune，这是一种在预训练后应用的结构化剪枝框架，能够在单一全局预算下联合剪枝前馈神经网络（FFN）通道和KV头组。GRASPrune并不是在没有约束的情况下学习重要性评分并在训练后应用预算，而是通过投影直通估计器学习轻量级门控评分，该估计器在每一步强制执行满足预算的硬掩码，同时保持主干权重不变。在掩码固定后，我们对保留单元的缩放因子进行校准，以减轻剪枝造成的规模不匹配，并将这些因子折叠到剪枝权重中，以获得一个没有额外参数的小型稠密检查点，在推理时使用。在LLaMA-2-7B上，GRASPrune去除了50%的参数，并在WikiText-2上达到了12.18的困惑度，同时在五个基准上保持了竞争力的平均零-shot准确率，使用了四个周期在单个NVIDIA A100 80GB GPU上对512个未标记的校准序列进行训练，而没有进行任何完整模型的微调。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.19457

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

四轴决策对齐用于长时间范围企业人工智能代理

Srininvasan, Vasundra

Abstract

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

Chinese Translation

长时间范围的企业代理在损失记忆、多步推理和约束性监管限制下做出高风险决策（如贷款承保、索赔裁定、临床审查、事先授权）。当前的评估报告仅提供单一的任务成功标量，这混淆了不同的失败模式，并掩盖了代理是否与其部署环境所需的标准对齐。我们提出，长时间范围的决策行为可以分解为四个正交对齐轴，每个轴都是独立可测量和可能失败的：事实精确度（Factual Precision, FRP）、推理一致性（Reasoning Coherence, RCS）、合规重建（Compliance Reconstruction, CRR）和校准弃权（Calibrated Abstention, CAR）。CRR是一个新颖的基于监管的轴；CAR是一个测量轴，将覆盖率与准确性分开。我们在一个受控基准（LongHorizon-Bench）上进行分解，涵盖贷款资格和保险索赔裁定，并构建确定性的真实情况。运行六种记忆架构后，我们发现结构聚合准确性无法观察到：检索在事实精确度上崩溃；基于模式的架构支付了支架税；在事实保留提示下的简单摘要在FRP、RCS、EDA和CRR上是一个强基线；所有六种架构在每个案例上都犯错，暴露了该领域尚未针对的决策对齐轴。分解还揭示了我们自己预注册的预测，即摘要将无法实现事实回忆，而数据在大幅度上反转了这一点，轴级反转的聚合准确性将被隐藏。制度对齐（监管重建）和决策对齐（校准弃权）在对齐文献中被低估，并在决策离开实验室后成为承重因素。该框架可以通过两个步骤转移到任何受监管的决策领域：构建事实模式，并校准CRR审计提示。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.19459

Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

大型语言模型是否存在形式化游戏？评估逻辑推理中的忠实性

Kim, Kyuhee, Poiroux, Auguste, Bosselut, Antoine

Abstract

Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.

Chinese Translation

形式验证保证了证明的有效性，但并不保证形式化的忠实性。在自然语言逻辑推理中，模型从零开始构建公理系统而不受库的约束，这使得有效证明与忠实翻译之间的差距尤为明显。我们研究前沿模型在生成 Lean 4 证明时是否利用了这一差距，这种行为我们称之为形式化游戏。我们对 GPT-5 和 DeepSeek-R1 在 303 个一阶逻辑问题（203 个来自 FOLIO，100 个来自 Multi-LogiEval）上的表现进行了评估，比较了统一生成与将形式化与证明分开的两阶段管道。尽管编译率达到 87-99%，我们发现统一生成中没有系统性游戏的证据：模型更倾向于报告失败而不是强行生成证明，即使在设计上旨在鼓励这一行为的提示下也是如此。然而，可能仍然存在逃避我们检测信号的非忠实性。两阶段管道揭示了两种不同的非忠实模式：GPT-5 在证明生成过程中伪造公理，这是一种可通过跨阶段比较检测到的反应性后备，而 DeepSeek-R1 在形式化过程中错误翻译前提，生成的内部一致输出完全逃避检测。这些发现表明，高编译率或准确性不应与忠实推理等同。代码和数据可在 https://github.com/koreankiwi99/formalization-gaming 获取。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.19488

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

CoDA：通过CoT引导的领域适应实现有效的跨领域知识转移

Yan, Jianzhi, Liu, Le, Tang, Buzhou, Xiang, Yang, Sun, Dongning, Li, Zhiming

Abstract

Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

Chinese Translation

大型语言模型（LLMs）在逻辑推理方面取得了显著进展，但仍然落后于人类水平的表现。在上下文学习中，通过用专家策划的领域内示例提示模型输入，提供了一种可行的解决方案，提升了模型的性能。然而，在许多现实世界中缺乏专业知识的领域，如低资源科学学科、新兴生物医学子领域或小众法律管辖区，这种高质量的领域内示范本质上是有限的或完全不可用的，从而限制了这些方法的普遍适用性。为了缓解这一限制，最近的研究探索了将跨领域样本作为替代的上下文示范进行检索。然而，所获得的收益仍然有限。这在很大程度上归因于源分布和目标分布之间显著的领域转移，这妨碍了模型有效识别和利用潜在共享结构或潜在推理模式的能力。因此，当仅依赖原始文本提示时，LLMs在稳健和系统地抽象和转移这种跨领域知识方面面临挑战。为了解决这些问题，我们提出了CoDA，它采用轻量级适配器直接干预中间隐藏状态。通过结合基于特征的CoT丰富参考表示的蒸馏与最大均值差异（MMD）进行核化分布匹配，我们的方法对齐了源领域和目标领域的潜在推理表示。在多个逻辑推理任务上的广泛实验结果验证了CoDA的有效性，显著超越了之前的最先进基线，取得了大幅提升。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.19516

From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

从经验到技能：通过可重用策略学习的多智能体生成引擎优化

Wu, Beining, Mao, Fuyou, Lin, Jiong, Yang, Cheng, Lu, Jiaxuan, Guo, Yifu, Zhang, Siyu, Wu, Yifan, Huang, Ying, Li, Fu

Abstract

Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO

Chinese Translation

生成引擎（GEs）通过用基于引用的答案替代排名链接，正在重塑信息获取的方式。然而，目前的生成引擎优化（GEO）方法在孤立的情况下优化每个实例，无法在任务和引擎之间积累或转移有效策略。我们将GEO重新框定为一个策略学习问题，并提出MAGEO，一个多智能体框架，其中协调规划、编辑和关注保真度的评估作为执行层，而经过验证的编辑模式逐步提炼为可重用的引擎特定优化技能。为了实现受控评估，我们引入了双分支评估协议，以便对内容编辑进行因果归因，并提出了DSV-CF，这是一种将语义可见性与归因准确性统一的双轴度量。我们还发布了MSME-GEO-Bench，这是一个基于真实查询的多场景、多引擎基准。在对三种主流引擎的实验中，MAGEO在可见性和引用保真度方面显著优于启发式基线，消融实验确认引擎特定偏好建模和策略重用是这些提升的核心，暗示了一种可扩展的以学习驱动的可信GEO范式。代码可在 https://github.com/Wu-beining/MAGEO 获取。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.19520

SimDiff: Depth Pruning via Similarity and Difference

SimDiff：通过相似性和差异性进行深度剪枝

Chen, Yuli, Zhang, Shuhao, Meng, Fanshen, Cheng, Bo, Han, Jiale, Tong, Qiang, Liu, Xiulei

Abstract

Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.

Chinese Translation

深度剪枝通过识别和移除冗余层来提高大型语言模型（LLMs）的部署效率。当前广泛接受的识别标准是使用余弦距离测量层之间的相似性。然而，我们发现仅依赖这一单维启发式方法可能会在不同架构中表现出不可预测的性能，甚至导致灾难性的崩溃。为了解决这一问题，我们提出了SimDiff，一种新的层重要性标准，从表示相似性和变换差异两个正交视角共同评估层。差异通过两种不同的度量进行量化：MSSD，对异常值敏感，识别出做出决定性修正的层；以及MASD，稳健地测量层的平均贡献。在多个模型（参数范围从0.5B到13B）上的大量实验表明，SimDiff在各种剪枝比例下显著优于最先进的基线。值得注意的是，我们的方法在25%的剪枝比例下保留了超过91%的LLaMA2-7B的性能，并在对LLaMA3.1-8B剪枝12层时实现了高达1.49倍的推理加速。我们还展示了经过剪枝的模型可以通过最小的微调有效恢复。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.19523

Revac: A Social Deduction Reasoning Agent

Revac：一种社会推理代理

Arya, Mihir Shriniwas, Anish, Avinash, Ranjan, Aditya

Abstract

Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.

Chinese Translation

社会推理游戏如黑手党（Mafia）提出了一个独特的人工智能挑战：玩家必须在不确定性下进行推理，解读不完整和故意误导的信息，评估类人沟通，并做出战略性淘汰决策。与确定性棋盘游戏不同，在黑手党中的成功并不依赖于完美的信息或蛮力搜索，而是依赖于推理、记忆和在欺骗环境中的适应能力。本研究展示了Revac-8的设计与评估，该人工智能代理是为MindGames Arena竞赛的社会推理赛道开发的，并获得了第一名。最终的代理从一个简单的两阶段推理系统演变为一个多模块架构，整合了基于记忆的玩家画像、对指控和辩护的社交图分析，以及动态语调选择以进行沟通。这些结果强调了结构化记忆和适应性沟通在高风险社交环境中实现强大表现的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.19538

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

将异常检测整合到自主人工智能中，以实现人类活动的主动风险管理

Zorriassatine, Farbod, Lotfi, Ahmad

Abstract

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

Chinese Translation

自主人工智能具备目标导向、主动和自主决策能力，为应对人类活动中的运动相关风险提供了一个引人注目的机会，包括老年人群体中持续存在的跌倒危险。尽管已有多种通过跌倒预测和检测来减轻跌倒风险的方法，但现有系统尚未能在护理路径和安全关键环境中作为通用解决方案发挥作用。这主要是由于在持续处理现实世界复杂性方面的局限性，尤其是上下文意识差、误报率高、环境噪声和数据稀缺。我们认为，跌倒检测和跌倒预测可以有效地被视为异常检测问题，并通过自主人工智能系统更有效地解决。从更广泛的角度来看，这一视角使得能够早期识别与风险增加相关的运动模式细微偏差，无论是由于年龄相关的衰退、疲劳还是环境因素引起的。虽然立即部署的技术要求超出了本文的范围，但我们提出了一个概念框架，强调其潜在价值。该框架通过动态选择相关工具并将其整合到自适应决策工作流程中，促进了一种良好协调的风险管理方法，而不是依赖于针对狭窄定义场景量身定制的静态配置。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.19544

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

DT2IT-MRM：去偏好构建与多模态奖励建模的迭代训练

Zhang, Zhihong, Zhao, Jie, Huang, Xiaojian, Xu, Jin, Luo, Zhuodong, Liu, Xin, Wei, Jiansheng, Chen, Xuejin

Abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

Chinese Translation

多模态奖励模型（MRMs）在将多模态大型语言模型（MLLMs）与人类偏好对齐方面发挥着至关重要的作用。训练一个优秀的MRM需要高质量的多模态偏好数据。然而，现有的偏好数据集面临三个主要挑战：偏好强度缺乏细粒度、文本风格偏差以及不可靠的偏好信号。此外，现有的开源多模态偏好数据集存在大量噪声，但缺乏有效且可扩展的策划方法来提升其质量。为了解决这些局限性，我们提出了 extbf{DT2IT-MRM}，该方法整合了 extbf{去偏好}构建管道、文本到图像（ extbf{T2I}）偏好数据的新型重构以及一个 extbf{迭代训练}框架，用于策划现有的多模态偏好数据集以进行 extbf{多模态奖励建模}。我们的实验结果表明，DT2IT-MRM在三个主要基准测试（VL-RewardBench、Multimodal RewardBench和MM-RLHF-RewardBench）上达到了新的 extbf{最先进}的整体性能。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.19559

Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

在极端高温下增强建筑工人安全：利用可穿戴技术进行预测健康分析的机器学习方法

Ullah, Syed Sajid, Khan, Amir

Abstract

Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.

Chinese Translation

建筑工人对热应激高度敏感，但将实时生理数据转化为可操作安全智能的工具仍然稀缺。本研究通过开发和评估深度学习模型，特别是基线长短期记忆网络（Long Short-Term Memory, LSTM）和基于注意力机制的LSTM，来预测沙特阿拉伯19名工人的热应激。使用Garmin Vivosmart 5智能手表监测心率、心率变异性（HRV）和血氧饱和度等指标，基于注意力机制的模型在测试中表现优于基线模型，达到了95.40%的测试准确率，并显著减少了假阳性和假阴性。该方法的精确度、召回率和F1分数均为0.982，不仅提高了预测性能，还提供了适合集成到物联网（IoT）安全系统和建筑信息模型（BIM）仪表板中的可解释结果，推动了建筑行业以信息技术为驱动的主动安全管理。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.19561

Detecting Data Contamination in Large Language Models

检测大型语言模型中的数据污染

Janicki, Juliusz, Chamezopoulos, Savvas, Kanoulas, Evangelos, Tsatsaronis, Georgios

Abstract

Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.

Chinese Translation

大型语言模型（LLMs）在训练过程中使用了大量数据，其中一些可能来自受版权保护的来源。成员推断攻击（Membership Inference Attacks, MIA）旨在检测这些文档以及它们是否已被纳入LLMs的训练语料库。黑箱MIA需要大量的数据操作，因此它们的比较通常具有挑战性。我们在黑箱假设下研究了最先进（SOTA）的MIA，并使用统一的数据集对它们进行比较，以确定其中是否有任何方法能够在SOTA LLMs下可靠地检测成员身份。此外，开发了一种新方法，称为熟悉度排名（Familiarity Ranking），以展示黑箱MIA的一种可能方法，从而使LLMs在表达上有更多自由，以更好地理解其推理。结果表明，所有方法在多个LLMs上的AUC-ROC均约为0.5，表明没有任何方法能够可靠地检测LLMs中的成员身份。更高级LLMs的较高真正率（TPR）和假正率（FPR）表明其推理和概括能力更强，显示出使用黑箱MIA检测LLMs中的成员身份的困难。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.19567

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

基于大型语言模型的多模态推理与视觉语义算术

Xu, Chuou, Ji, Liya, Chen, Qifeng

Abstract

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.

Chinese Translation

强化学习（RL）作为后训练方法，对于增强大型语言模型（LLMs）在编码和数学方面的推理能力至关重要。然而，它们在视觉语义算术方面的能力，即从图像推断关系，仍然未得到充分探索。经典的文本类比“国王”-“人”+“女人”=“女王”展示了关系推理，但用“国王”和“人”的图像替代文本会显著降低性能，因为这需要常识知识以及从无关视觉细节中提取简洁概念的能力。这种能力对于在非结构化环境中的服务和家庭机器人尤为重要，因为机器人必须推断对象、代理和动作之间的语义关系。在厨房中，从图像中识别“粉”和“蛋糕”之间的关系为“由……制成”，为感知中的符号关系奠定基础，从而实现工具替代、任务泛化和改进的语义推理。之前的研究通过在向量算术后解码图像特征来处理语义算术，但存在模态间的差距，且缺乏系统评估。在本文中，我们提出了两个新任务：两项减法和三项运算，并构建了图像关系对数据集（Image-Relation-Pair Dataset, IRPD）以进行基准测试。我们进一步提出了语义算术强化微调（Semantic Arithmetic Reinforcement Fine-Tuning, SAri-RFT），该方法使用可验证的函数和群体相对策略优化（Group Relative Policy Optimization, GRPO）对大型视觉语言模型（Large Vision-Language Models, LVLMs）进行后训练。我们的方法在IRPD和真实世界的Visual7W-Telling数据集上取得了最先进的结果。通过为LVLMs提供强大的跨模态关系推理能力，本研究提升了家庭机器人在感知中扎根符号推理的能力，增强了在复杂环境中的决策能力、工具适应性和人机交互。数据集和源代码已在补充材料中提供。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.19606

AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

AblateCell：一种用于虚拟细胞库的重现-再消融代理

Xia, Xue, Yao, Chengkai, Tsoi, Mingyu, Mao, Xinjie, Huang, Wenxuan, Wei, Jiaqi, Wu, Hao, Tan, Cheng, Yu, Lang, Yang, Yuejin, Sun, Siqi, Gao, Zhangyang

Abstract

Systematic ablations are essential to attribute performance gains in AI Virtual Cells, yet they are rarely performed because biological repositories are under-standardized and tightly coupled to domain-specific data and formats. While recent coding agents can translate ideas into implementations, they typically stop at producing code and lack a verifier that can reproduce strong baselines and rigorously test which components truly matter. We introduce AblateCell, a reproduce-then-ablate agent for virtual cell repositories that closes this verification gap. AblateCell first reproduces reported baselines end-to-end by auto-configuring environments, resolving dependency and data issues, and rerunning official evaluations while emitting verifiable artifacts. It then conducts closed-loop ablation by generating a graph of isolated repository mutations and adaptively selecting experiments under a reward that trades off performance impact and execution cost. Evaluated on three single-cell perturbation prediction repositories (CPA, GEARS, BioLORD), AblateCell achieves 88.9% (+29.9% to human expert) end-to-end workflow success and 93.3% (+53.3% to heuristic) accuracy in recovering ground-truth critical components. These results enable scalable, repository-grounded verification and attribution directly on biological codebases.

Chinese Translation

系统性的消融实验对于归因人工智能虚拟细胞的性能提升至关重要，然而由于生物库标准化不足且与特定领域的数据和格式紧密耦合，这类实验很少被执行。尽管最近的编码代理能够将想法转化为实现，但它们通常仅限于生成代码，缺乏能够重现强基准并严格测试哪些组件真正重要的验证器。我们提出了AblateCell，一种用于虚拟细胞库的重现-再消融代理，旨在填补这一验证空白。AblateCell首先通过自动配置环境、解决依赖和数据问题，并重新运行官方评估，同时生成可验证的文档，来端到端地重现已报告的基准。然后，它通过生成孤立库变异的图并在权衡性能影响和执行成本的奖励下自适应选择实验，进行闭环消融。在三个单细胞扰动预测库（CPA、GEARS、BioLORD）上的评估显示，AblateCell实现了88.9%（比人类专家高出29.9%）的端到端工作流成功率和93.3%（比启发式高出53.3%）的准确率，成功恢复了真实的关键组件。这些结果使得在生物代码库上能够进行可扩展的、基于库的验证和归因。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.19633

Time Series Augmented Generation for Financial Applications

用于金融应用的时间序列增强生成

Kolonin, Anton, Glushchenko, Alexey, Bochkov, Evgeny, Saxena, Abhishek

Abstract

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

Chinese Translation

评估大型语言模型（LLMs）在复杂定量金融任务中的推理能力是一个关键且尚未解决的挑战。标准基准往往无法孤立出代理解析查询和协调计算的核心能力。为了解决这个问题，我们引入了一种新颖的评估方法和基准，旨在严格测量LLM代理在金融时间序列分析中的推理能力。我们在一项大规模实证研究中应用了这一方法，使用我们的框架——时间序列增强生成（Time Series Augmented Generation, TSAG），其中LLM代理将定量任务委托给可验证的外部工具。我们的基准由100个金融问题组成，用于比较多个最先进的代理（例如，GPT-4o、Llama 3、Qwen2）在工具选择准确性、忠实度和幻觉等指标上的表现。结果表明，能够的代理在工具使用准确性方面可以达到近乎完美的水平，同时幻觉现象极少，验证了工具增强范式。我们的主要贡献是这一评估框架及相应的代理性能实证洞察，我们将其公开发布，以促进对可靠金融人工智能的标准化研究。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.19638

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

SafetyALFRED：评估多模态大型语言模型的安全意识规划

Torres-Fonseca, Josue, Deng, Naihao, Dai, Yinpei, Storks, Shane, Zhang, Yichi, Mihalcea, Rada, Kennington, Casey, Chai, Joyce

Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

Chinese Translation

多模态大型语言模型在互动环境中作为自主代理的应用日益增加，但它们主动应对安全隐患的能力仍然不足。我们提出了SafetyALFRED，该模型基于具身代理基准ALFRED，并增加了六类现实世界厨房隐患。现有的安全评估主要集中在通过非具身问答（QA）设置进行隐患识别，而我们评估了来自Qwen、Gemma和Gemini家族的十一种最先进模型，不仅关注隐患识别，还关注通过具身规划进行主动风险缓解。我们的实验结果揭示了显著的对齐差距：尽管模型能够在问答设置中准确识别隐患，但这些隐患的平均缓解成功率相对较低。我们的研究结果表明，通过问答进行的静态评估不足以确保物理安全，因此我们倡导向优先考虑具身情境中纠正行动的基准转变。我们已将代码和数据集开源，地址为 https://github.com/sled-group/SafetyALFRED.git

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.19653

A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

合成轨迹生成器的双重视角：效用框架与隐私漏洞

Cherigui, Aya, Guépin, Florent, Legendre, Arnaud, Couchot, Jean-François

Abstract

Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

Chinese Translation

人类移动数据在众多应用中被使用，从公共卫生到城市规划。人类移动数据本质上是敏感的，因为它可能包含宗教信仰和政治倾向等信息。历史上，曾提出使用聚合、模糊处理或噪声添加等技术来修改信息，以充分保护隐私并消除担忧。然而，这些方法在效用上付出了巨大代价，因此引入了利用生成模型发展的新方法。这些方法在多大程度上解决了隐私与效用之间的权衡仍然是一个未解的问题。在本文中，我们通过引入和应用一个新的效用评估框架，迈出了朝解决这一问题的第一步。此外，我们提供证据表明，隐私评估仍然是一个需要考虑的重要挑战，应根据当前的欧盟法规通过对抗性评估来应对。我们提出了一种针对生成模型子类别的新型成员推断攻击，尽管该子类别因其对轨迹用户关联问题的抵抗力而被认为是私密的。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.19689

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

A-MAR：基于代理的多模态艺术检索用于细粒度艺术作品理解

Wang, Shuai, Zhu, Hongyi, Huang, Jia-Hong, Shen, Yixian, Zeng, Chengxi, Rudinac, Stevan, Kackovic, Monika, Wijnberg, Nachoem, Worring, Marcel

Abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

Chinese Translation

理解艺术作品需要对视觉内容以及文化、历史和风格背景进行多步骤推理。尽管最近的多模态大型语言模型在艺术作品解释方面显示出潜力，但它们依赖于隐式推理和内化知识，限制了可解释性和显式证据的基础。我们提出了A-MAR，一个基于代理的多模态艺术检索框架，明确地将检索条件化于结构化推理计划。给定一件艺术作品和用户查询，A-MAR首先将任务分解为一个结构化推理计划，该计划指定了每个步骤的目标和证据要求。检索随后基于该计划进行，从而实现有针对性的证据选择，并支持逐步的、基于证据的解释。为了评估艺术领域中的基于代理的多模态推理，我们引入了ArtCoT-QA。该诊断基准包含多步骤推理链，适用于多种艺术相关查询，使得分析超越简单的最终答案准确性。对SemArt和Artpedia的实验表明，A-MAR在最终解释质量上始终优于静态、非计划的检索和强大的多模态大型语言模型基线，而在ArtCoT-QA上的评估进一步展示了其在证据基础和多步骤推理能力方面的优势。这些结果突显了条件推理检索在知识密集型多模态理解中的重要性，并将A-MAR定位为朝向可解释的、目标驱动的人工智能系统迈出的重要一步，尤其与文化产业相关。代码和数据可在以下链接获取：https://github.com/ShuaiWang97/A-MAR。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.18592

Two-dimensional early exit optimisation of LLM inference

大语言模型推理的二维提前退出优化

Hůla, Jan, Adamczyk, David, Filip, Tomáš, Pavlíček, Martin, Sosík, Petr

Abstract

We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3$\times$ over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.

Chinese Translation

我们提出了一种二维（2D）提前退出策略，协调层级和句子级的退出，以用于大语言模型的分类任务。通过逐句增量处理输入，同时逐步激活更深层次的网络，我们的方法实现了超越独立优化任一维度的乘法计算节省。在四个最先进的大语言模型（Llama 3.1、Llama 3.2、Gemma、Qwen；参数规模为3B-8B）上对三个情感分类数据集的实验评估表明，对于简单任务，使用普通模型的最佳层级提前退出相比，我们的方法实现了1.4-2.3倍的额外加速，并在复杂的多类问题上表现出优雅的降级。微调虽然减少了这一优势，但并未消除。该方法与模型无关，仅需轻量级的分类适配器，并且与量化和剪枝等互补的效率方法是正交的。我们的研究结果表明，当语义信息在输入结构中以可预测的方式积累时，二维提前退出策略表现优异，这表明其可能适用于超越情感分类的序列处理任务。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.18712

Probing for Reading Times

探测阅读时间

Tsipidi, Eleftheria, Kiegeland, Samuel, Re, Francesco Ignazio, Xu, Tianyang, Giulianelli, Mario, Stanczak, Karolina, Cotterell, Ryan

Abstract

Probing has shown that language model representations encode rich linguistic information, but it remains unclear whether they also capture cognitive signals about human processing. In this work, we probe language model representations for human reading times. Using regularized linear regression on two eye-tracking corpora spanning five languages (English, Greek, Hebrew, Russian, and Turkish), we compare the representations from every model layer against scalar predictors -- surprisal, information value, and logit-lens surprisal. We find that the representations from early layers outperform surprisal in predicting early-pass measures such as first fixation and gaze duration. The concentration of predictive power in the early layers suggests that human-like processing signatures are captured by low-level structural or lexical representations, pointing to a functional alignment between model depth and the temporal stages of human reading. In contrast, for late-pass measures such as total reading time, scalar surprisal remains superior, despite its being a much more compressed representation. We also observe performance gains when using both surprisal and early-layer representations. Overall, we find that the best-performing predictor varies strongly depending on the language and eye-tracking measure.

Chinese Translation

探测研究表明，语言模型的表示编码了丰富的语言信息，但尚不清楚它们是否也捕捉到了关于人类处理的认知信号。在本研究中，我们探测语言模型的表示以获取人类阅读时间。我们使用正则化线性回归分析了两个涵盖五种语言（英语、希腊语、希伯来语、俄语和土耳其语）的眼动追踪语料库，比较了每个模型层的表示与标量预测变量——惊讶度（surprisal）、信息值（information value）和对数透镜惊讶度（logit-lens surprisal）的表现。我们发现，早期层的表示在预测早期通过指标（如首次注视和注视持续时间）方面优于惊讶度。预测能力集中在早期层表明，人类类似的处理特征是由低层次的结构或词汇表示捕捉到的，这指向了模型深度与人类阅读的时间阶段之间的功能一致性。相比之下，对于总阅读时间等晚期通过指标，标量惊讶度仍然表现优越，尽管它是一个更为压缩的表示。我们还观察到，当同时使用惊讶度和早期层表示时，性能有所提升。总体而言，我们发现最佳预测变量的表现因语言和眼动追踪指标的不同而有很大差异。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.18715

Characterizing AlphaEarth Embedding Geometry for Agentic Environmental Reasoning

表征 AlphaEarth 嵌入几何以进行代理环境推理

Rahman, Mashrekur, Barrett, Samuel J., Last, Christina

Abstract

Earth observation foundation models encode land surface information into dense embedding vectors, yet the geometric structure of these representations and its implications for downstream reasoning remain underexplored. We characterize the manifold geometry of Google AlphaEarth's 64-dimensional embeddings across 12.1 million Continental United States samples (2017--2023) and develop an agentic system that leverages this geometric understanding for environmental reasoning. The manifold is non-Euclidean: effective dimensionality is 13.3 (participation ratio) from 64 raw dimensions, with local intrinsic dimensionality of approximately 10. Tangent spaces rotate substantially, with 84\% of locations exceeding 60\textdegree{} and local-global alignment (mean$|\cos\theta| = 0.17$) approaching the random baseline of 0.125. Supervised linear probes indicate that concept directions rotate across the manifold, and compositional vector arithmetic using both PCA-derived and probe-derived directions yields poor precision. Retrieval instead produces physically coherent results, with local geometry predicting retrieval coherence ($R^2 = 0.32$). Building on this characterization, we introduce an agentic system with nine specialized tools that decomposes environmental queries into reasoning chains over a FAISS-indexed embedding database. A five-condition ablation (120 queries, three complexity tiers) shows that embedding retrieval dominates response quality ($\mu = 3.79 \pm 0.90$ vs.\ $3.03 \pm 0.77$ parametric-only; scale 1--5), with peak performance on multi-step comparisons ($\mu = 4.28 \pm 0.43$). A cross-model benchmark show that geometric tools reduce Sonnet 4.5's score by 0.12 points but improve Opus 4.6's by 0.07, with Opus achieving higher geometric grounding (3.38 vs.\ 2.64), suggesting that the value of geometric characterization scales with the reasoning capability of the consuming model.

Chinese Translation

地球观测基础模型将地表信息编码为密集的嵌入向量，但这些表示的几何结构及其对下游推理的影响仍然未被充分探索。我们对 Google AlphaEarth 的 64 维嵌入在 1210 万个美国大陆样本（2017-2023）中的流形几何进行了表征，并开发了一个代理系统，利用这种几何理解进行环境推理。该流形是非欧几里得的：有效维度为 13.3（参与比率），来自 64 个原始维度，局部内在维度约为 10。切向空间旋转显著，84 ext{%} 的位置超过 60 extdegree{}，局部与全局的对齐（均值 $| ext{cos} heta| = 0.17$）接近随机基线 0.125。监督线性探针表明概念方向在流形上旋转，使用 PCA 导出的方向和探针导出的方向进行的组合向量算术精度较低。检索则产生物理上连贯的结果，局部几何预测检索一致性（$R^2 = 0.32$）。基于这一表征，我们引入了一个具有九个专用工具的代理系统，将环境查询分解为基于 FAISS 索引的嵌入数据库的推理链。五个条件的消融实验（120 个查询，三个复杂度层级）表明嵌入检索主导了响应质量（$ ext{μ} = 3.79 ext{±} 0.90$ 对比 $ ext{3.03 ± 0.77}$ 仅参数化；规模 1-5），在多步骤比较中表现最佳（$ ext{μ} = 4.28 ext{±} 0.43$）。跨模型基准测试显示几何工具将 Sonnet 4.5 的得分降低了 0.12 分，但将 Opus 4.6 的得分提高了 0.07 分，Opus 实现了更高的几何基础（3.38 对比 2.64），这表明几何表征的价值与消费模型的推理能力成正比。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.18722

Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP

时间中的脚本：转写在自然语言处理中的演变角色调查

Jayakumar, Thanmay, Halder, Deepon, Dabre, Raj

Abstract

Cross-lingual transfer in NLP is often hindered by the ``script barrier'' where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.

Chinese Translation

自然语言处理中的跨语言迁移常常受到“脚本障碍”的制约，不同的书写系统阻碍了语言之间的迁移学习。转写（transliteration）作为一种将脚本转换的过程，已成为弥合这一差距的有效技术，通过增加词汇重叠来实现。本文对转写在跨语言自然语言处理中的应用进行了全面的调查。我们提出了利用转写在语言模型中的关键动机的分类，并概述了将转写作为输入的不同方法。我们分析了这些方法的演变和有效性，讨论了其中涉及的关键权衡，并将其在现代大型语言模型（LLMs）中的必要性进行了背景化。该评审探讨了多种场景，展示了转写的益处，包括处理代码混合文本、利用语言家族的相关性以及推理效率的实用收益。基于此分析，我们为研究人员提供了具体建议，以选择和实施最适合其特定语言、任务和资源限制的转写策略。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.18729

Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

通过幽默研究大型语言模型对身份的反事实不公平性

Kim, Shubin, Son, Yejin, Park, Junyeong, Ka, Keummin, Lee, Seungbeen, Lee, Jaeyoung, Jang, Hyeju, Oh, Alice, Yu, Youngjae

Abstract

Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.

Chinese Translation

幽默反映了社会认知：我们认为有趣的事物往往反映了我们的身份以及我们如何评判他人。当语言模型与幽默互动时，它们的反应揭示了它们从训练数据中内化的社会假设。本文通过观察在保持其他因素不变的情况下，交换说话者和被称呼者时模型的反应变化，研究了通过幽默表现出的反事实不公平性。我们的框架涵盖三个任务：幽默生成拒绝、说话者意图推断和关系/社会影响预测，涉及身份无关的幽默和特定身份的贬损幽默。我们引入了可解释的偏见度量，捕捉身份交换下的不对称模式。在最先进的模型上进行的实验揭示了一致的关系差异：特权说话者讲的笑话被拒绝的频率高达67.5%，被判断为恶意的频率高达64.7%，在5分制的社会危害评分中高出1.5分。这些模式突显了生成模型中敏感性和刻板印象的共存，复杂化了实现公平性和文化对齐的努力。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.18738

Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

重标记，而非替换：在掩码扩散语言模型中的标记到掩码精炼

Yao, Lin

Abstract

Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.

Chinese Translation

掩码扩散语言模型如 LLaDA2.1 依赖于标记到标记（Token-to-Token, T2T）编辑来纠正自身生成的错误：每当一个不同的标记超过置信阈值时，已提交的标记便会被覆盖。我们识别出这一规则的三种结构性失败模式。当没有单一的替代选项足够自信时，触发器无法触发；替换是在可能自身包含错误的上下文中计算的；而用于训练 T2T 流的均匀扰动并不类似于模型在推理时实际产生的连贯且语义上合理的错误。作为替代方案，我们提出了标记到掩码（Token-to-Mask, T2M）重标记。T2M 不是用新的猜测覆盖可疑标记，而是将位置重置为掩码状态，以便下一个去噪步骤从分布内上下文重新预测它。该方法无需训练，仅修改编辑规则，并且不引入新参数。我们将其与三种检测启发式相结合，并简要理论说明为什么掩码比错误标记更好的条件信号。在 8 个基准测试中，T2M 提高了需要精确标记级输出的任务的准确性。其最大增益为 CMATH 上的 +5.92 分，我们将基线错误的 79.9% 归因于最后一公里的损坏（正确推理后跟随一个混乱的最终答案）；T2M 修复了 41.3% 的这些案例。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.18758

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

语法作为罗塞塔石：用于上下文中科普特语翻译的通用依赖

Purushothama, Abhishek, Thronson, Emma, Guo, Alexia, Zeldes, Amir

Abstract

Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

Chinese Translation

低资源机器翻译需要与高资源语言不同的方法。本文提出了一种新颖的上下文学习方法，以支持科普特语到英语的低资源机器翻译，并通过输入句子的通用依赖解析进行语法增强。在现有使用双语词典支持词汇项推理的工作基础上，我们向输入中添加了几种语法分析的表示，特别探讨了原始解析器输出、用普通英语表述的解析结果以及针对在子树中识别出的困难结构的具体翻译指令的包含。我们的结果表明，尽管单独的语法信息不如基于词典的释义有用，但将检索到的词典项与语法信息结合起来，在不同模型规模上都取得了显著的提升，为科普特语翻译实现了新的最先进结果。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.18759

Model-Agnostic Meta Learning for Class Imbalance Adaptation

模型无关的元学习用于类别不平衡适应

Rao, Hanshu, Han, Guangzeng, Huang, Xiaolei

Abstract

Class imbalance is a widespread challenge in NLP tasks, significantly hindering robust performance across diverse domains and applications. We introduce Hardness-Aware Meta-Resample (HAMR), a unified framework that adaptively addresses both class imbalance and data difficulty. HAMR employs bi-level optimizations to dynamically estimate instance-level weights that prioritize genuinely challenging samples and minority classes, while a neighborhood-aware resampling mechanism amplifies training focus on hard examples and their semantically similar neighbors. We validate HAMR on six imbalanced datasets covering multiple tasks and spanning biomedical, disaster response, and sentiment domains. Experimental results show that HAMR achieves substantial improvements for minority classes and consistently outperforms strong baselines. Extensive ablation studies demonstrate that our proposed modules synergistically contribute to performance gains and highlight HAMR as a flexible and generalizable approach for class imbalance adaptation. Code is available at https://github.com/trust-nlp/ImbalanceLearning.

Chinese Translation

类别不平衡是自然语言处理（NLP）任务中普遍存在的挑战，显著阻碍了在不同领域和应用中的稳健性能。我们提出了困难感知元重采样（Hardness-Aware Meta-Resample, HAMR），这是一个统一框架，能够自适应地解决类别不平衡和数据难度问题。HAMR采用双层优化，动态估计实例级权重，以优先考虑真正具有挑战性的样本和少数类，同时，邻域感知重采样机制增强了对困难示例及其语义相似邻居的训练关注。我们在六个涵盖多个任务的类别不平衡数据集上验证了HAMR，这些数据集涉及生物医学、灾害响应和情感分析领域。实验结果表明，HAMR在少数类上实现了显著的改进，并且始终优于强基线。广泛的消融研究表明，我们提出的模块协同促进了性能提升，并突显了HAMR作为一种灵活且可推广的类别不平衡适应方法的潜力。代码可在 https://github.com/trust-nlp/ImbalanceLearning 获取。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.18775

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

针对大型语言模型越狱检测的多代采样实证研究

Luo, Hanrui, Gowda, Shreyank N

Abstract

Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.

Chinese Translation

在大型语言模型中检测越狱行为仍然具有挑战性，特别是当高度对齐的模型仅偶尔产生有害输出时。在本研究中，我们基于现实条件下的输出进行了越狱检测的实证研究，使用了 JailbreakBench Behaviors 数据集和多种生成器模型，具有不同的对齐强度。我们评估了词汇 TF-IDF 检测器和基于生成不一致性的检测器在不同采样预算下的表现。我们的结果表明，单一输出评估系统性地低估了越狱脆弱性，因为增加采样生成的数量会揭示额外的有害行为。最显著的改进发生在从单一生成转向适度采样时，而更大的采样预算则收益递减。跨生成器实验表明，检测信号在模型之间部分泛化，相关模型家族之间观察到更强的迁移。类别级别分析进一步揭示，词汇检测器捕捉到的是行为信号和主题特定线索的混合，而不仅仅是有害行为。总体而言，我们的研究结果表明，适度的多样本审计提供了一种更可靠和实用的方法，用于估计模型脆弱性并改善大型语言模型中的越狱检测。代码将会发布。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.18779

Mango: Multi-Agent Web Navigation via Global-View Optimization

Mango：基于全局视图优化的多智能体网页导航

Tong, Weixi, Di, Yifeng, Zhang, Tianyi

Abstract

Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at https://github.com/VichyTong/Mango.

Chinese Translation

现有的网页智能体通常从根网址开始探索，这对于具有深层次层级结构的复杂网站来说效率较低。在缺乏网站结构的全局视图的情况下，智能体常常陷入导航陷阱，探索无关的分支，或在有限的预算内无法找到目标信息。我们提出了Mango，一种利用网站结构动态确定最佳起始点的多智能体网页导航方法。我们将网址选择形式化为多臂老虎机问题，并采用汤普森采样（Thompson Sampling）自适应地在候选网址之间分配导航预算。此外，我们引入了一个情节记忆组件来存储导航历史，使智能体能够从之前的尝试中学习。在WebVoyager上的实验表明，当使用GPT-5-mini时，Mango的成功率达到63.6%，比最佳基线高出7.3%。此外，在WebWalkerQA上，Mango的成功率为52.5%，超出最佳基线26.8%。我们还展示了Mango的通用性，使用开源和闭源模型作为骨干。我们的数据和代码是开源的，已在https://github.com/VichyTong/Mango上提供。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.18786

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

实验还是结果？探讨大型语言模型中的科学可行性

Mohammadi, Seyedali, Gaur, Manas, Ferraro, Francis

Abstract

Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

Chinese Translation

科学可行性评估旨在判断一个主张是否与已建立的知识一致，以及实验证据是否能够支持或反驳该主张。我们将可行性评估框架设定为一种诊断推理任务，在该任务中，给定一个假设，模型预测其可行或不可行，并为其决策提供理由。我们在受控知识条件下（仅假设、带实验、带结果或两者兼具）评估大型语言模型（LLMs），并通过逐步移除实验和/或结果上下文的部分内容来探讨其鲁棒性。在多个LLM和两个数据集的实验中，提供结果证据通常比提供实验描述更为可靠。结果往往能够提升准确性，超出仅依赖内部知识所能提供的水平，而实验文本可能较为脆弱，当上下文不完整时可能会降低性能。这些发现阐明了何时实验证据有助于基于LLM的可行性评估，以及何时它会引入脆弱性。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.18835

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

文档干草堆中的语义针：LLM作为评判者的相似性评分敏感性测试

Aksoy, Sinan G., Sabrio, Alexandra A., VonKaenel, Erik, Burke, Lee

Abstract

We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable "fingerprint" that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.

Chinese Translation

我们提出了一个可扩展的多因素实验框架，系统性地探讨LLM对成对文档比较中细微语义变化的敏感性。我们将其类比为在干草堆中寻找针的问题：一个语义上被改变的句子（针）嵌入在周围的上下文中（干草），我们在所有组合中变化扰动类型（否定、结合交换、命名实体替换）、上下文类型（原始与主题无关）、针的位置和文档长度，对数万个文档对进行五种LLM的测试。我们的分析揭示了几个显著的发现。首先，LLM在文档内部表现出一种与先前研究的候选顺序效应不同的位置信息偏差：大多数模型对文档早期出现的语义差异给予更严厉的惩罚。其次，当被改变的句子被主题无关的上下文包围时，它系统性地降低相似性评分，并引发两极化的评分，表明相似性非常低或非常高。这与解释框架的观点一致，即主题相关的上下文可能使模型能够对变化进行上下文化并降低其权重。第三，每个LLM产生了质上不同的评分分布，这是一个稳定的“指纹”，对扰动类型不变，但所有模型在对不同扰动类型的宽容程度上共享一个普遍的等级体系。这些结果共同表明，LLM的语义相似性评分对文档结构、上下文连贯性和模型身份敏感，这些影响超出了语义变化本身，而所提出的框架为审计和比较当前及未来模型的评分行为提供了一个实用的、与LLM无关的工具包。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.18878

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

LegalBench-BR：评估大型语言模型在巴西法律判决分类中的基准

Neto, Pedro Barbosa de Carvalho

Abstract

We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

Chinese Translation

我们介绍了LegalBench-BR，这是第一个用于评估语言模型在巴西法律文本分类中的公共基准。数据集包含来自圣卡塔琳娜州法院（TJSC）的3,105个上诉案件，通过DataJud API（CNJ）收集，并通过LLM辅助标注和启发式验证在五个法律领域进行了注释。在一个类别平衡的测试集上，BERTimbau-LoRA仅更新0.3%的模型参数，达到了87.6%的准确率和0.87的宏F1（比Claude 3.5 Haiku提高22个百分点，比GPT-4o mini提高28个百分点）。在行政法（administrativo）类别上，差距尤为显著：GPT-4o mini的F1得分为0.00，而Claude 3.5 Haiku的F1得分为0.08，而经过微调的模型达到了F1 = 0.91。这两款商业LLM在民法（civel）上表现出系统性偏差，吸收模糊类别而非区分它们，这种失败模式在领域适应的微调中得以消除。这些结果表明，通用的LLM无法替代领域适应模型在巴西法律分类中的作用，即使任务是一个简单的五类问题，并且在消费级GPU上进行LoRA微调可以在零边际推理成本下缩小差距。我们发布完整的数据集、模型和流程，以支持葡萄牙语法律自然语言处理中的可重复研究。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.18880

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

虚假引用的产生地点：将领域级幻觉追溯到大型语言模型中的特定神经元

Chen, Yuefei, Quan, Yihao, Lin, Xiaodong, Tang, Ruixiang

Abstract

LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

Chinese Translation

大型语言模型（LLMs）经常生成虚构但令人信服的引用，尽管基础参考文献错误，它们通常表现出高度的自信。我们研究了9个模型和108,000个生成引用中的这种失败，发现作者姓名在所有模型和设置中失败的频率远高于其他字段。引用风格没有可测量的影响，而以推理为导向的蒸馏会降低召回率。在一个领域上训练的探针在其他领域的转移效果接近随机水平，这表明幻觉信号在不同领域之间并不具有普遍性。基于这一发现，我们对Qwen2.5-32B-Instruct的神经元级CETT值应用了弹性网正则化和稳定性选择，识别出一组稀疏的领域特定幻觉神经元（FH-neurons）。因果干预进一步确认了它们的作用：增强这些神经元会增加幻觉，而抑制它们则在各个领域提高性能，某些领域的增益更大。这些结果表明，使用内部模型信号检测和减轻引用幻觉的轻量级方法是可行的。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.18892

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

优先选择最佳：通过奖励超越答案正确性来激励可靠的多模态推理

Jia, Mengzhao, Zhang, Zhihan, Jiang, Meng

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.

Chinese Translation

可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）通过奖励可验证的最终答案来改善多模态推理。然而，答案正确的轨迹仍可能依赖于不完整的推导、薄弱的证据或与其结论相矛盾的陈述。我们称之为推理-答案不一致的这种答案正确性与推理有效性之间的差距，促使我们在多模态强化学习中进行轨迹监督。我们比较了两种主要方法：奖励模型（Reward Models, RMs）和生成奖励（Generative Rewards, GRs）。RMs高效且在训练初期有帮助，但随着策略分布的变化，其收益会减弱；GRs提高了性能，但可能会给出不稳定的奖励并且计算成本高。因此，我们提出了组内排名奖励（Groupwise Ranking Reward），该方法在一次传递中对同一提示的验证通过轨迹进行排名，并相应地重新分配奖励。组内比较在较低的评估开销下更好地区分了强和弱的正确轨迹。实验表明，RLVR加剧了推理-答案不一致，而轨迹监督则缓解了这一问题。组内排名奖励在整体表现上最佳，使得基于可靠性的准确率从47.4%提高到54.7%，超越了RLVR。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.18897

Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

少即是多：认知负荷与大型语言模型数学推理中的单一提示上限

Cazares, Manuel Israel

Abstract

We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas -- a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering

Chinese Translation

我们在SAIR方程理论第一阶段竞赛的背景下，进行了关于正式数学推理的提示工程的系统实证研究。该任务要求判断一个方程法则是否在所有的代数结构中蕴含另一个法则——这是一个一般情况下不可判定的问题，但通过有限模型搜索对于FALSE是可判定的。在五周的时间里，我们设计、测试并分析了超过40种提示变体，大小范围从0到4,878字节，涵盖了四个评估分组和三个语言模型（gpt-oss-120b、Llama 3.3 70B、Gemma 4 31B）。我们的主要发现是单一提示上限：尽管进行了大量的工程努力，gpt-oss-120b的平衡难度准确率在大约60%到79%的实证饱和区域内停滞不前，相比之下，59.75%的无备忘单基线表现不佳。我们识别出三个导致这一上限的机制：（1）TRUE案例的数学不可判定性限制了任何有限提示所能编码的内容；（2）复杂的规则系统降低了较弱模型的性能（Llama 3.3 70B在提示超过2KB时TRUE召回率降至0%）；（3）提示顺序效应以脆弱的非单调方式与模型注意力相互作用。我们的最佳提交（AN45c，2,252字节）在hard3（n=400；95%置信区间：[75.0%，82.9%]）上达到了79.25%的准确率，TRUE召回率为95.9%，FALSE召回率为63.4%，相比无备忘单基线（59.75%）提高了19.5个百分点。我们将在https://github.com/israelcazares/sair-prompt-engineering发布所有提示变体、评估脚本和结果。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.18913

LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

LogosKG：硬件优化的可扩展且可解释的知识图谱检索

Cheng, He, Wu, Yifu, Khatwani, Saksham, Kruse, Maya, Dligach, Dmitriy, Miller, Timothy A., Afshar, Majid, Gao, Yanjun

Abstract

Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at https://github.com/LARK-NLP-Lab/LogosKG, and an online demo is available at https://lark-nlp-lab-logoskg.hf.space/.

Chinese Translation

知识图谱（KGs）正越来越多地与大型语言模型（LLMs）集成，以提供结构化和可验证的推理。这种集成中的核心操作是多跳检索，但现有系统在效率、可扩展性和可解释性之间难以取得平衡。我们提出了LogosKG，这是一种新颖的、与硬件对齐的框架，通过建立在符号知识图谱表达基础上，并将遍历作为对分解的主题、对象和关系表示的硬件高效操作来执行，从而实现对大型知识图谱的可扩展和可解释的k-hop检索。为了扩展到十亿边的图，LogosKG集成了度感知分区、跨图路由和按需缓存。实验表明，在不损失检索准确度的情况下，相较于CPU和GPU基线，效率显著提高。通过在知识图谱检索中的验证性能，下游的两轮KG-LLM交互展示了LogosKG如何支持大规模、基于证据的分析，揭示知识图谱拓扑（如跳数分布和连通性）如何影响结构化生物医学知识与LLM诊断推理之间的对齐，从而为下一代KG-LLM集成开辟了新的可能。源代码已公开，地址为 https://github.com/LARK-NLP-Lab/LogosKG，在线演示可访问 https://lark-nlp-lab-logoskg.hf.space/。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.18914

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

MORPHOGEN：评估性别意识形态生成的多语言基准

Agarwal, Mehul, Aggarwal, Aditya, Goel, Arnav, Hira, Medha, Gupta, Anubha

Abstract

While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.

Chinese Translation

尽管多语言大型语言模型（LLMs）在翻译和问答等高层次任务中表现良好，但它们处理语法性别和形态一致性的能力仍然未得到充分探索。在形态丰富的语言中，性别影响动词变位、代词，甚至是带有明确和隐含性别提及的第一人称结构。我们引入了MORPHOGEN，这是一个以形态为基础的大规模基准数据集，用于评估三种类型学上多样的语法性别语言（法语、阿拉伯语和印地语）中的性别意识生成。核心任务GENFORM要求模型在保留句子意义和结构的同时，将第一人称句子重写为相反性别。我们构建了一个涵盖这三种语言的高质量合成数据集，并对15个流行的多语言LLMs（2B-70B）在执行此转换能力上进行了基准测试。我们的结果揭示了当前模型在处理形态性别方面的显著差距和有趣的见解。MORPHOGEN为性别意识语言建模提供了一个聚焦的诊断视角，并为未来关于包容性和形态敏感的自然语言处理研究奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.18919

Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

提出主题模型和评估框架以分析与外部结果的关联：基于大规模企业评价数据的领导力分析应用

Yoshida, Yura, Kanai, Masato, Nakayama, Masataka, Ohsawa, Haruki, Uchida, Yukiko, Yuminaga, Arata, Hoshina, Gakuse, Sayama, Nobuo

Abstract

Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.

Chinese Translation

在计算社会科学和组织研究等领域，分析从文本数据中提取的主题与外部结果之间的关系至关重要。然而，现有的主题建模方法在同时实现可解释性、主题特异性（与具体行动或特征的一致性）和极性立场一致性（主题内缺乏混合的正面和负面评估）方面存在困难。本研究聚焦于利用企业评价数据进行领导力分析，提出了一种利用大型语言模型生成满足这些特性的主题的方法，并提出了一个针对外部结果分析的评估框架。该框架明确将主题特异性和极性立场一致性作为评估标准，并考察基于现有指标的自动化评估方法。通过使用日本主要企业评价平台OpenWork的员工评价，所提出的方法在可解释性、特异性和极性一致性方面优于现有方法。在对员工士气等外部结果的分析中，该方法还生成了具有更高解释力的主题。这些结果表明，所提出的方法和评估框架为涉及外部结果的主题分析提供了一种通用的方法。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.18942

Disparities In Negation Understanding Across Languages In Vision-Language Models

视觉语言模型中不同语言的否定理解差异

Moraitaki, Charikleia, Pan, Sarah, Pulling, Skyler, Flusche, Gwendolyn, Alhamoud, Kumail, Ghassemi, Marzyeh

Abstract

Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

Chinese Translation

视觉语言模型（VLMs）表现出肯定偏见：即使正确描述包含否定（“没有 X”），它们仍系统性地倾向于选择积极的标题（“X 存在”）。尽管之前的研究已在英语中记录了这种失败模式并提出了解决方案，但否定在不同语言中通过不同的形态、词序和附加模式表现出不同，提出了这些解决方案是否公平地服务于所有语言社区的问题。我们引入了第一个经过人工验证的多语言否定基准，涵盖七种类型学上多样的语言：英语、普通话、阿拉伯语、希腊语、俄语、他加禄语和西班牙语。评估三种 VLMs - CLIP、SigLIP 和 MultiCLIP - 我们发现标准的 CLIP 在非拉丁字母语言中的表现与随机猜测相当或更差，而 MultiCLIP 达到了最高且最均匀的准确率。我们还评估了 SpaceVLM，一种提出的否定纠正方法，发现它在几种语言中产生了显著的改进，特别是在英语、希腊语、西班牙语和他加禄语中，同时在类型学上不同的语言中表现出不同的有效性。这种变化揭示了形态、书写系统和否定结构等语言特性如何以与公平相关的方式与模型改进相互作用。随着 VLMs 在全球的部署，多语言基准对于理解解决方案不仅是否有效，而且对谁有效至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.18944

A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

信息密度对用户生成内容命名实体识别影响的机制与优化研究

Xiaobo, Jiang, Lai, Dinghong, Qiu, Song, Deng, Yadong, Zhan, Xinkai

Abstract

Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,'' ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.

Chinese Translation

在干净、高资源的语料库上训练的命名实体识别（NER）模型在部署到嘈杂、稀疏的用户生成内容（UGC）上时表现出灾难性的性能崩溃，例如社交媒体。以往的研究主要集中在逐点症状修复上——通过定制化微调来解决新词、别名漂移、非标准拼写、长尾实体和类别不平衡等问题。然而，这些改进往往无法推广，因为它们忽视了UGC固有的结构稀疏性。本研究揭示了表面噪声症状共享一个统一的根本原因：低信息密度（ID）。通过分层混淆控制的重采样实验（特别控制实体稀有性和标注一致性），本文将ID识别为一个独立的关键因素。我们引入了注意力谱分析（Attention Spectrum Analysis, ASA）来量化降低的ID如何因果地导致“注意力钝化”，最终降低NER性能。基于这些机制洞察，我们提出了窗口感知优化模块（Window-Aware Optimization Module, WOM），这是一个由大型语言模型（LLM）驱动的模型无关框架。WOM识别信息稀疏区域，并利用选择性回译在不改变模型架构的情况下定向增强语义密度。在主流架构上部署于标准UGC数据集（WNUT2017、Twitter-NER、WNUT2016）上，WOM实现了最高4.5%的绝对F1提升，展示了其稳健性，并在WNUT2017上达到了新的最先进（SOTA）结果。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.18955

Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

评估大型语言模型在社交媒体分析中的能力：一项多任务探索

Davoudi, Ramtin, Thakkar, Kartik, Donyapour, Nazanin, Derr, Tyler, Karimi, Hamid

Abstract

In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate "seen-data" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.

Chinese Translation

在本研究中，我们首次对现代大型语言模型（LLMs）进行全面评估，包括GPT-4、GPT-4o、GPT-3.5-Turbo、Gemini 1.5 Pro、DeepSeek-V3、Llama 3.2和BERT，针对Twitter（X）数据集的三个核心社交媒体分析任务进行评估：(I) 社交媒体作者身份验证，(II) 社交媒体帖子生成，以及(III) 用户属性推断。对于作者身份验证，我们引入了一种系统的抽样框架，涵盖多样的用户和帖子选择策略，并评估在2024年1月及以后新收集的推文上的泛化能力，以减轻“已见数据”偏差。对于帖子生成，我们评估LLMs生成真实、用户类似内容的能力，使用全面的评估指标。连接任务I和II，我们进行了一项用户研究，以测量真实用户对基于其自身写作条件生成的LLM帖子感知。对于属性推断，我们使用两个标准化分类法（IAB Tech Lab 2023和2018年美国SOC）对职业和兴趣进行了标注，并将LLMs与现有基准进行比较。总体而言，我们的统一评估提供了新的见解，并建立了可重复的基准，以推动LLM驱动的社交媒体分析。代码和数据已在补充材料中提供，并将在发表后公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.18976

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

STAR-Teaming：一种基于策略-响应多路网络的自动化大型语言模型红队策略

Jung, MinJae, Lim, YongTaek, Kim, Chaeyun, Kim, Junghwan, Kim, Kihyun, Kim, Minwoo

Abstract

While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar-ai/STAR-Teaming-paper.

Chinese Translation

尽管大型语言模型（LLMs）被广泛使用，但它们仍然容易受到能够引发有害或不当响应的越狱提示的影响。本文介绍了STAR-Teaming，一种新颖的黑箱框架，用于自动化红队测试，能够有效生成此类提示。STAR-Teaming将多代理系统（MAS）与策略-响应多路网络相结合，并采用网络驱动的优化方法来采样有效的攻击策略。这种基于网络的方法将难以处理的高维嵌入空间重构为可处理的结构，带来了两个关键优势：它增强了LLM战略脆弱性的可解释性，并通过将搜索空间组织成语义社区来简化有效策略的搜索，从而防止冗余探索。实证结果表明，STAR-Teaming显著超越现有方法，以更低的计算成本实现更高的攻击成功率（ASR）。大量实验验证了多路网络的有效性和可解释性。代码可在https://github.com/selectstar-ai/STAR-Teaming-paper获取。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.18995

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM：通过时空冗余减少加速扩散大型语言模型

Du, Zhenbang, Xia, Kejing, Zhong, Xinrui, Fu, Yonggan, Oswald, Nicolai, Ji, Binfei, Khailany, Brucek, Molchanov, Pavlo, Lin, Yingyan

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

Chinese Translation

扩散大型语言模型（dLLMs）作为自回归生成的有希望的替代方案，通过实现并行的标记预测而受到关注。然而，实际的 dLLM 解码仍然面临高推理延迟的问题，这限制了其部署。在本研究中，我们观察到，这种低效的很大一部分源于解码过程中反复出现的冗余，包括由置信度聚类和位置模糊引起的空间冗余，以及由于反复重新掩蔽已经稳定的预测而引起的时间冗余。基于这些模式，我们提出了 $R^2$-dLLM，一个统一的框架，从推理和训练的角度减少解码冗余。在推理时，我们引入了无训练的解码规则，聚合局部置信度和标记预测，并最终确定时间上稳定的标记，以避免冗余的解码步骤。我们进一步提出了一种关注冗余的监督微调流程，使模型与高效的解码轨迹对齐，并减少对手动调节阈值的依赖。实验表明，与现有的解码策略相比，$R^2$-dLLM 一致地将解码步骤数量减少了多达 75%，同时在不同模型和任务中保持了竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈，明确减少冗余可以带来显著的实际效率提升。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.19001

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

当安全在答案之前失败：推理链中有害行为检测的基准测试

Kakkar, Ishita, Zhang, Enze, Uppaal, Rheeya, Hu, Junjie

Abstract

Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts

Chinese Translation

大型推理模型（LRMs）产生复杂的多步骤推理轨迹，但安全评估仍然集中于最终输出，忽视了在推理过程中有害行为的出现。当被破解时，有害行为并不会瞬间显现，而是通过抑制拒绝、合理化顺从、分解有害任务和隐瞒风险等不同的行为步骤逐步展开。然而，目前没有现有的基准能够在推理轨迹的句子级别捕捉这一过程——这是实现可靠安全监控、干预和系统性故障诊断的关键步骤。为了解决这一空白，我们提出了HarmThoughts，这是一个用于推理轨迹逐步安全评估的基准。我们的数据集基于我们提出的16种有害推理行为的危害分类法，涵盖四个功能组，描述了有害行为的传播方式，而不是产生了什么危害。该数据集包含来自四个模型家族生成的1,018条推理轨迹中的56,931个句子，每个句子都标注了细粒度的句子级行为标签。利用HarmThoughts，我们分析了推理轨迹中的有害传播模式，识别出常见的行为轨迹和推理从安全转向不安全的漂移点。最后，我们系统地比较了白盒和黑盒检测器在HarmThoughts上识别有害推理行为的任务。我们的结果表明，现有检测器在推理轨迹中的细粒度行为检测方面存在困难，特别是在有害行为的出现和执行的细微类别中，突显了过程级安全监控中的一个关键空白。HarmThoughts可在以下网址公开获取：https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.19005

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

辩论无声之处：基于角色的多智能体推理用于半真相检测

Tang, Yixuan, Zhang, Yirui, Feng, Hang, Tung, Anthony K. H.

Abstract

Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.

Chinese Translation

半真相是指那些在事实上是正确的，但由于缺失的背景信息而具有误导性的主张，这仍然是专注于显性虚假信息的事实验证系统的盲点。解决这种基于遗漏的操控需要不仅对所说的内容进行推理，还要对未说的内容进行推理。我们提出了RADAR，一个基于角色的多智能体辩论框架，用于在现实的、嘈杂的检索环境下进行关注遗漏的事实验证。RADAR为政治家和科学家分配互补角色，他们在中立法官的调解下，对共享的检索证据进行对抗性推理。一个双阈值的早期终止控制器自适应地决定何时达到足够的推理以作出裁决。实验表明，RADAR在各个数据集和基础模型上始终优于强大的单智能体和多智能体基线，提高了遗漏检测的准确性，同时降低了推理成本。这些结果表明，基于角色的、以检索为基础的辩论结合自适应控制是揭示事实验证中缺失背景的有效且可扩展的框架。代码可在 https://github.com/tangyixuan/RADAR 获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.19016

AlignCultura: Towards Culturally Aligned Large Language Models?

AlignCultura：朝着文化对齐的大型语言模型迈进？

Kashyap, Gautam Siddharth, Dras, Mark, Naseem, Usman

Abstract

Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO's principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.

Chinese Translation

大型语言模型（LLMs）中的文化对齐对于生成具有上下文意识、尊重和可信赖的输出至关重要。如果缺乏文化对齐，模型可能会生成刻板印象、不敏感或误导性的响应，无法反映与有益、无害和诚实（HHH）范式相关的文化多样性。现有的基准测试代表了朝着文化对齐迈出的早期步骤；然而，目前没有基准能够系统地评估与联合国教科文组织（UNESCO）文化多样性原则相一致的文化对齐，尤其是在HHH范式下。因此，为了填补这一空白，我们构建了Align-Cultura，一个用于文化对齐的两阶段流程。第一阶段构建了CULTURAX，这是一个基于联合国教科文组织文化分类法的HHH-英语数据集，通过查询构建（Query Construction），重新分类提示，扩展代表性不足的领域（或标签），并通过SimHash防止数据泄漏。然后，响应生成通过两阶段拒绝采样将提示与文化基础的响应配对。最终数据集包含1,500个样本，涵盖30个有形和无形文化形式的子领域。第二阶段对CULTURAX在通用模型、文化微调模型和开放权重的LLMs（如Qwen3-8B和DeepSeek-R1-Distill-Qwen-7B）上进行基准测试。实证结果表明，文化微调模型的联合HHH提高了4%-6%，文化失误减少了18%，效率提升了10%-12%，并将泄漏限制在0.3%。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.19047

RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

RARE：针对高相似性语料库的冗余感知检索评估框架

Cho, Hanjun, Lee, Jay-Yoon

Abstract

Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.

Chinese Translation

现有的问答（QA）基准通常假设文档之间存在明显差异且重叠最小，然而现实世界中的检索增强生成（RAG）系统则在诸如财务报告、法律法规和专利等语料库上运行，这些语料库的信息高度冗余，文档之间表现出强烈的相似性。这种不匹配削弱了评估的有效性：检索器即使检索到提供充分证据的文档，也可能被不公平地低估，因为评估中没有考虑文档之间的冗余。另一方面，在标准基准上表现良好的检索器往往在面对高度相似和冗余的现实世界语料库时泛化能力较差。我们提出了RARE（冗余感知检索评估），这是一个通过（i）将文档分解为原子事实以实现精确的冗余跟踪，以及（ii）通过CRRF增强基于大型语言模型（LLM）的数据生成，来构建现实基准的框架。RAG基准数据通常需要多个质量标准，但LLM往往产生琐碎的输出。CRRF分别对标准进行评分，并通过排名融合决策，从而提高生成数据的可靠性。将RARE应用于财务、法律和专利语料库，我们引入了RedQA，其中一个强大的检索基线在4跳General-Wiki上从66.4%的PerfRecall@10下降到4跳深度时的5.0-27.9%的PerfRecall@10，揭示了当前基准未能捕捉的鲁棒性差距。RARE使从业者能够构建忠实反映现实部署条件的领域特定RAG评估。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.19048

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

SAMoRA：面向语义的LoRA专家混合模型用于任务自适应学习

Shi, Boyan, Chen, Wei, Zhao, Shuyuan, Shen, Junfeng, Guo, Shengnan, Wang, Shaojiang, Wan, Huaiyu

Abstract

The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA

Chinese Translation

混合专家模型（Mixture-of-Experts, MoE）与低秩适应（Low-Rank Adaptation, LoRA）的结合在增强大型语言模型的多任务学习能力方面展现了显著潜力。然而，现有方法面临两个主要挑战：（1）当前的MoE-LoRA方法中的不精确路由未能明确将输入语义与专家能力相匹配，导致专家专业化不足。（2）统一的权重融合策略难以提供自适应的更新强度，忽视了不同任务的复杂性差异。为了解决这些局限性，我们提出了SAMoRA（面向语义的LoRA专家混合模型），这是一种新颖的参数高效微调框架，专为任务自适应学习而设计。具体而言，提出了一种语义感知路由器，以明确将文本语义与最合适的专家对齐，实现精确路由。设计了一种任务自适应缩放机制，根据特定任务需求动态调节专家贡献。此外，提出了一种新颖的正则化目标，以共同促进专家专业化和有效缩放。在多个多任务基准上的广泛实验表明，SAMoRA显著优于现有的最先进方法，并具有出色的任务泛化能力。代码可在 https://github.com/boyan-code/SAMoRA 获取。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.19052

Cell-Based Representation of Relational Binding in Language Models

语言模型中关系绑定的基于单元的表示

Dai, Qin, Heinzerling, Benjamin, Inui, Kentaro

Abstract

Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell'' corresponds to an entity--relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.

Chinese Translation

理解话语需要跟踪实体及其之间的关系。尽管大型语言模型（LLMs）在关系推理方面表现良好，但它们绑定实体、关系和属性的机制仍不清楚。我们研究了话语级别的关系绑定，并表明LLMs通过基于单元的绑定表示（Cell-based Binding Representation, CBR）对其进行编码：这是一个低维线性子空间，其中每个“单元”对应于一个实体-关系索引对，并且在推理过程中从相应的单元中检索绑定属性。通过使用标注有实体和关系索引的受控多句子数据，我们通过使用偏最小二乘回归（Partial Least Squares regression）从属性标记激活中解码这些索引，从而识别CBR子空间。在不同领域和两种模型家族中，这些索引是线性可解的，并在投影空间中形成网格状几何结构。我们进一步发现，上下文特定的CBR表示在激活空间中通过平移向量相关联，从而实现跨上下文转移。最后，激活补丁显示，系统地操纵该子空间会改变关系预测，并且扰动它会破坏性能，提供了因果证据，表明LLMs依赖于CBR进行关系绑定。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.19069

Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

专家产品训练减少自然语言推理中的数据集伪影

Mathew, Aby Mammen

Abstract

Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.

Chinese Translation

神经网络自然语言推理（NLI）模型过度拟合数据集伪影，而非真正进行推理。仅基于假设的模型在SNLI数据集上获得了57.7%的准确率，显示出强烈的虚假相关性，基线错误中有38.6%是这些伪影造成的。我们提出了专家产品（Product-of-Experts, PoE）训练方法，该方法降低了偏见模型过于自信的示例的权重。PoE几乎保持了准确率（89.10%对比89.30%），同时将偏见依赖性降低了4.71%（偏见一致性从49.85%降至45%）。消融实验发现，λ = 1.5是最佳的平衡去偏见与准确率的参数。行为测试仍然揭示了在否定和数值推理方面存在的问题。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.19070

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

TRN-R1-Zero：仅通过强化学习进行文本丰富网络推理

Liu, Yilun, Qiu, Ruihong, Huang, Zi

Abstract

Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.

Chinese Translation

在文本丰富网络（TRNs）上进行零-shot 推理仍然是一个具有挑战性的前沿领域，因为模型必须在没有特定任务监督的情况下，将文本语义与关系结构相结合。尽管图神经网络依赖于固定的标签空间和监督目标，但最近基于大型语言模型（LLM）的方法往往忽视图的上下文，或依赖于从更大模型的蒸馏，限制了其泛化能力。我们提出了 TRN-R1-Zero，这是一种仅通过强化学习训练的 TRN 推理后训练框架。TRN-R1-Zero 直接优化基础 LLM，使用邻居感知的组相对策略优化目标，根据邻近信号的信息量的新颖边际增益指标动态调整奖励，有效引导模型朝向关系推理。与之前的方法不同，TRN-R1-Zero 不需要监督微调或从大型推理模型生成的思维链数据。在引用、超链接、社交和共同购买 TRN 基准上进行的广泛实验表明，TRN-R1-Zero 的优越性和鲁棒性。此外，TRN-R1-Zero 严格依赖于节点级训练，在边级和图级任务上实现了零-shot 推理，超越了跨领域迁移。代码库已公开，地址为 https://github.com/superallen13/TRN-R1-Zero。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.19071

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

HoWToBench：利用写作树对大型语言模型人类水平写作能力的整体评估

Feng, Andrew Zhuoer, Wang, Cunxiang, Luo, Yu, Fan, Lin, Zhou, Yilin, Wang, Zikang, Gu, Xiaotao, Tang, Jie, Wang, Hongning, Huang, Minlie

Abstract

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

Chinese Translation

评估大型语言模型（LLMs）的写作能力仍然是一个重大挑战，因为写作技能具有多维特性，并且现有指标存在局限性。传统的基于参考的指标或现代的LLM作为评判者的方法无法充分评估LLM在千字级和开放式写作中的表现。我们提出了写作树（Tree-of-Writing, ToW），以解决LLM作为评判者在文本评估中聚合所有子特征时常见的隐性矛盾。ToW通过明确建模子特征的聚合权重，采用树状结构的工作流程。我们还提出了HowToBench，这是一个大规模的中文写作基准，涵盖12个体裁和1302个指令，分为三个任务类别：上下文补全、提纲指导写作和开放式生成。ToW成功减轻了偏见，与人类判断的皮尔逊相关系数达到0.93。此外，我们发现基于重叠的文本生成指标和流行的LLM作为评判者的实践都容易受到文本干扰的影响，而ToW对此具有鲁棒性。我们还发现，在指导任务中，输入长度与内容相关评分之间存在负相关关系，表明仅通过输入侧信息的堆积无法简单改善这一问题。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.19098

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

SAHM：阿拉伯金融与伊斯兰法合规推理的基准

Elbadry, Rania, Ahmad, Sarfraz, Heakl, Ahmed, Bouch, Dani, Ahsan, Momina, AlMahri, Muhra, khalil, Marwa Elsaid, Wang, Yuxia, Lahlou, Salem, Ananiadou, Sophia, Stoyanov, Veselin, Huang, Jimin, Peng, Xueqing, Nakov, Preslav, Xie, Zhuohan

Abstract

English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

Chinese Translation

尽管对可靠的金融和伊斯兰金融助手有强烈的实际需求，阿拉伯金融自然语言处理（NLP）仍然相对未被充分探索，而英语金融NLP则通过情感分析、文档理解和金融问答等基准迅速发展。我们介绍了SAHM，这是一个基于文档的基准和指令调优数据集，旨在支持阿拉伯金融NLP和伊斯兰法合规推理。SAHM包含14,380个经过专家验证的实例，涵盖七个任务：AAOIFI标准问答、基于法特瓦的问答/选择题、会计和商业考试、金融情感分析、提取式摘要以及事件-原因推理，这些数据来自真实的监管、法学和企业来源。我们使用任务特定的指标和基于评分标准的评估方法，对19个强大的开放和专有大型语言模型（LLMs）进行了评估，发现阿拉伯语流利性并不能可靠地转化为基于证据的金融推理：模型在识别类任务上的表现明显优于生成和因果推理任务，尤其是在事件-原因推理任务上差距最大。我们发布了该基准、评估框架以及一个经过指令调优的模型，以支持未来在可靠的阿拉伯金融NLP领域的研究。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.19124

Detoxification for LLM: From Dataset Itself

大型语言模型的去毒化：来自数据集本身

Shao, Wei, Wang, Yihang, Zhu, Gaoyu, Cheng, Ziqiang, Yu, Lei, Guo, Jiafeng, Cheng, Xueqi

Abstract

Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)

Chinese Translation

现有的大型语言模型去毒化方法主要集中在训练后阶段或推理时，而很少关注毒性的来源，即数据集本身。这类基于训练或可控解码的方法无法完全抑制模型固有的毒性，而去毒化预训练数据集可以从根本上减少模型在训练过程中学习到的毒性。因此，我们尝试在原始语料上直接进行去毒化，采用SoCD（软对比解码）方法，引导大型语言模型在保留语义的同时定位并重写原始数据中的毒性片段，在我们提出的HSPD（层次语义保留去毒化）流程中，生成一个可以替代原始数据进行微调或其他训练的去毒化语料。在GPT2-XL上，HSPD实现了最先进的去毒化效果，将毒性概率（TP）从0.42降低到0.18，将预期最大毒性（EMT）从0.43降低到0.20。我们进一步验证了在LLaMA2-7B、OPT-6.7B和Falcon-7B上的一致性最佳结果。这些发现表明，采用HSPD进行语义保留的语料级重写有效抑制了下游毒性，同时保留了数据的实用性，并允许无缝的源级缓解，从而减少后续模型行为调整的成本。（代码可在：https://github.com/ntsw2001/data_detox_for_llm获取）

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.19125

Do Emotions Influence Moral Judgment in Large Language Models?

情感是否影响大型语言模型中的道德判断？

Saim, Mohammad, Jiang, Tianyu

Abstract

Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.

Chinese Translation

大型语言模型在情感识别和道德推理作为独立能力方面得到了广泛研究，但情感在多大程度上影响道德判断仍然未被充分探讨。在本研究中，我们开发了一种情感诱导流程，将情感注入道德情境，并评估在多个数据集和大型语言模型（LLMs）中道德可接受性的变化。我们观察到一种方向性模式：积极情感增加道德可接受性，而消极情感则降低道德可接受性，其效果足以在多达20%的案例中逆转二元道德判断，并且这种敏感性与模型能力呈反比。我们的分析进一步揭示，特定情感有时可能表现出与其效价预测相反的行为（例如，悔恨反而增加可接受性）。一项补充的人类标注研究表明，人类并未表现出这些系统性的变化，表明当前大型语言模型存在对齐差距。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.19137

Construction of Knowledge Graph based on Language Model

基于语言模型的知识图谱构建

Zhu, Qiubai, Wang, Qingwang, Yuan, Haibin, Chen, Wei, Shen, Tao

Abstract

Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.

Chinese Translation

知识图谱（Knowledge Graph, KG）能够有效整合来自海量数据的有价值信息，因此在许多领域得到了快速发展和广泛应用。传统的知识图谱构建方法依赖于人工标注，这往往消耗大量时间和人力。而基于深度学习的知识图谱构建方案通常具有较弱的泛化能力。随着预训练语言模型（Pre-trained Language Models, PLM）的快速发展，PLM在知识图谱构建领域展现出了巨大的潜力。本文全面回顾了使用PLM构建知识图谱的最新研究进展。我们解释了PLM如何利用其语言理解和生成能力，从文本数据中自动提取知识图谱的关键信息，如实体和关系。此外，我们还提出了一种基于轻量级大语言模型（Lightweight Large Language Model, LLM）的新型超关系知识图谱构建框架（Hyper-Relational Knowledge Graph, LLHKG），并将其与之前的方法进行了比较。在我们的框架下，轻量级LLM的知识图谱构建能力可与GPT-3.5相媲美。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.19139

The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

大型语言模型中语言口头禅的兴起：前沿模型的系统分析

Wu, Shuai, Li, Xue, Feng, Yanna, Li, Yufang, Wang, Zhijun, Wang, Ran

Abstract

As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers ("That's a great question!", "Awesome!") to pseudo-empathetic affirmations ("I completely understand your concern", "I'm right here to catch you") and overused vocabulary ("delve", "tapestry", "nuanced"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the "alignment tax" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

Chinese Translation

随着大型语言模型（LLMs）通过人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）和宪法人工智能（Constitutional AI）等对齐技术不断发展，一个日益明显的现象逐渐显现：语言口头禅的激增——这些是反复出现的、公式化的语言模式，充斥着模型输出。这些口头禅包括谄媚的开场白（如“这是个好问题！”，“太棒了！”）到伪同情的肯定（如“我完全理解你的担忧”，“我就在这里支持你”）以及过度使用的词汇（如“深入探讨”，“织锦”，“细致入微”）。在本文中，我们对八个最先进的LLM进行了语言口头禅现象的系统分析：GPT-5.4、Claude Opus 4.7、Gemini 3.1 Pro、Grok 4.2、Doubao-Seed-2.0-pro、Kimi K2.5、DeepSeek V3.2 和 MiMo-V2-Pro。我们利用一个定制的评估框架进行标准化的基于API的评估，评估了10,000个提示，涵盖10个任务类别，涉及英语和中文，产生了160,000个模型响应。我们引入了语言口头禅指数（Verbal Tic Index, VTI），这是一个量化口头禅普遍性的复合指标，并分析其与谄媚、词汇多样性和人类感知自然性的相关性。我们的研究结果揭示了模型间显著的差异：Gemini 3.1 Pro的VTI最高（0.590），而DeepSeek V3.2的VTI最低（0.295）。我们进一步证明，语言口头禅在多轮对话中累积，在主观任务中被放大，并显示出明显的跨语言模式。人类评估（N = 120）确认了谄媚与感知自然性之间的强负相关关系（r = -0.87，p < 0.001）。这些结果强调了当前训练范式的“对齐税”，并突显了对更真实的人机交互框架的迫切需求。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.19144

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

ReflectMT：内化反思以实现高效高质量机器翻译

Li, Kunquan, Zhang, Yingxue, Meng, Fandong, Su, Jinsong

Abstract

Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a "think-first-then-translate" paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a "translate-first-think-later" paradigm. Our approach develops the model's "translate-reflect-refine" capability through reinforcement learning. In the first stage, we cultivate the model's capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model's first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.

Chinese Translation

近年来，越来越多的研究关注将大型推理模型（Large Reasoning Models, LRM）应用于机器翻译（Machine Translation, MT）。现有的方法主要采用“先思考后翻译”的范式。尽管显式推理轨迹显著提高了翻译质量，但也带来了高昂的推理成本和延迟。为了解决这些局限性，我们提出了ReflectMT，一种用于机器翻译的两阶段反思内化算法，采用“先翻译后思考”的范式。我们的方法通过强化学习发展模型的“翻译-反思-精炼”能力。在第一阶段，我们培养模型的高质量反思和精炼能力，从而增强其语义理解和任务特定知识。在第二阶段，我们训练模型内化反思过程中获得的知识。因此，在推理过程中，ReflectMT以直接翻译模式运行，首次尝试即可生成高质量翻译，无需任何显式推理步骤。在WMT24等数据集上的实验结果表明，我们模型在推理过程中的首次翻译在自动评估指标和基于GPT的评估中均优于多步骤推理的LRM，如DeepSeek-R1，基于GPT的翻译质量评估提高了2.16分，同时令令牌消耗减少了94.33%。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.19149

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

回答标记如何读取推理轨迹？思维大型语言模型在定量推理中的自我阅读模式

Chen, Haoyang, Liu, Yi, Shao, Jianzhi, Zhang, Tao, Huo, Chengfu, Hu, Wei

Abstract

Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.

Chinese Translation

思维大型语言模型在回答之前会生成推理轨迹。先前的激活引导研究主要集中在塑造这些轨迹上，但关于回答标记如何实际读取和整合推理以产生可靠结果的理解仍然不足。我们聚焦于定量推理，分析回答与推理之间的注意力，并观察到与正确性一致的良性自我阅读模式，其特征是阅读焦点沿推理轨迹向前漂移，并持续集中于关键语义锚点，而不正确的解决方案则表现出分散和不规则的注意力模式。我们将其解释为答案解码过程中的内部确定性，在此过程中，模型承诺于一个可行的解决方案分支并整合关键证据。随后，我们提出了一种无训练的引导方法，该方法由自我阅读质量（Self-Reading Quality, SRQ）评分驱动，结合几何度量用于过程控制和语义度量用于内容监控。SRQ选择数据以构建引导向量，引导推理朝向良性自我阅读，远离不确定和无序的阅读。实验表明，我们的方法带来了持续的准确性提升。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.19151

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声：一个针对印度现实语音识别的大规模基准测试

Bhogale, Kaushal, Dhir, Manas, Walecha, Amritansh, Kaur, Manmeet, Chhabra, Vanshika, Pareek, Aaditya, Sidh, Hanuman, Jain, Sagar, Singh, Bhaskar, Singh, Utkarsh, Javed, Tahir, Banga, Shobhit, Khapra, Mitesh M.

Abstract

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

Chinese Translation

现有的印度语言自动语音识别（ASR）基准测试通常使用脚本化的清晰语音，并采用以排行榜为驱动的评估方式，这导致了数据集特定的过拟合。此外，严格的单一参考字错误率（WER）惩罚了印度语言中自然拼写的变异，包括非标准化的混合英语词汇拼写。为了解决这些局限性，我们推出了“印度之声”（Voice of India），这是一个基于非脚本化电话对话构建的闭源基准，涵盖了15种主要印度语言，分布在139个区域集群中。该数据集包含306230个发声，总计536小时的语音，来自36691名说话者，转录文本考虑了拼写变异。我们还在地区层面分析了地理表现，揭示了差异。最后，我们提供了关于音频质量、说话速度、性别和设备类型等因素的详细分析，突显了当前ASR系统面临的挑战，并为改善现实世界中的印度语言ASR系统提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.19162

Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

关注未见的质量：通过软混合字母估计揭示大型语言模型的幻觉

Pan, Hongxing, Guo, Yingying, Kuang, Wenqing, Lu, Jiashi

Abstract

This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

Chinese Translation

本文研究了在黑箱访问下对大型语言模型（LLMs）进行不确定性量化的问题，在这种情况下，每个查询只能抽样少量响应。在这种设置中，估计有效语义字母表的大小——即在抽样响应中表达的不同含义的数量——为下游风险提供了有用的代理。然而，当样本量较小时，基于频率的估计器往往会低估稀有语义模式，而图谱量本身并不旨在准确估计语义占用。为了解决这个问题，我们提出了SHADE（软混合字母动态估计器），这是一种简单且可解释的估计器，它结合了广义古德-图灵覆盖率与从抽样响应的蕴含加权图构建的归一化拉普拉斯算子的热核迹。估计的覆盖率自适应地决定了融合规则：在高覆盖率下，SHADE使用两种信号的凸组合，而在低覆盖率下，它应用LogSumExp融合以强调缺失或观察较弱的语义模式。然后引入有限样本修正，以稳定结果的基数估计，随后将其转换为覆盖调整的语义熵得分。在与大样本参考的汇总语义字母表大小估计和QA错误检测的实验中，SHADE在样本限制最严格的情况下取得了最强的改进，而随着样本数量的增加，性能差距逐渐缩小。这些结果表明，当黑箱不确定性量化必须在严格的抽样预算下操作时，混合语义占用估计特别有利。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.19185

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

SCURank：利用摘要内容单元对多个候选摘要进行排名以增强摘要效果

Wang, Bo-Jyun, Lin, Ying-Jia, Kao, Hung-Yu

Abstract

Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.

Chinese Translation

小型语言模型（SLMs），如BART，通过蒸馏可以实现与大型语言模型（LLMs）相当的摘要性能。然而，现有基于LLM的候选摘要排名策略存在不稳定性，而经典指标（如ROUGE）不足以有效排名高质量摘要。为了解决这些问题，我们提出了 extbf{SCURank}，一个通过利用 extbf{摘要内容单元（SCUs）}来增强摘要效果的框架。SCURank不依赖于不稳定的比较或表面重叠，而是基于信息内容的丰富性和语义重要性来评估摘要。我们研究了SCURank在从多个多样化LLM中提取摘要的有效性。实验结果表明，SCURank在各项评估指标和数据集上均优于传统指标和基于LLM的排名方法。此外，我们的研究结果显示，结合多样化的LLM摘要可以增强模型的抽象性和整体蒸馏模型性能，验证了信息中心排名在多LLM蒸馏中的优势。SCURank的代码可在https://github.com/IKMLab/SCURank获取。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.19189

Headlines You Won't Forget: Can Pronoun Insertion Increase Memorability?

你不会忘记的标题：代词插入能否提高记忆度？

Meyer, Selina, Abel, Magdalena, Roth, Michael

Abstract

For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.

Chinese Translation

为了使新闻标题能够影响信念并推动行动，相关信息需要在记忆中被保留和提取。在这项探讨性研究中，我们借鉴了认知心理学的实验设计，考察了一种特定的语言特征，即通过第一人称和第二人称代词的直接称呼，如何影响记忆度，以及在多大程度上可以利用大型语言模型（LLMs）将这一特征有针对性地插入现有文本中，而不改变其核心含义。在三项控制的记忆实验中，共有240名参与者，产生了7,680个独特的记忆判断，我们显示代词插入对记忆度的影响是复杂的。探索性分析表明，影响因子因标题主题、代词插入方式及其直接上下文而异。需要更多的数据和细致的分析来对这些中介因素得出明确的结论。我们进一步展示，LLMs的自动修订并不总是合适的：众包评估发现其中许多内容缺乏准确性和情感保留，或者导致不自然的写作风格。我们将收集的数据提供给未来的研究使用。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.19245

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

与全知的GPT对话还是与二次猜测的Claude对话？修复如何揭示大型语言模型中的不可靠多轮行为

Lachenmaier, Clara, Bultmann, Hannah, Zarrieß, Sina

Abstract

Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.

Chinese Translation

修复是解决人际对话中问题的重要资源，但在人与大型语言模型（LLM）的互动中仍然未得到充分探索。本研究探讨了LLM在围绕可解和不可解数学问题的多轮对话中如何参与修复的互动过程。我们考察了模型是否主动发起修复，以及它们如何回应用户发起的修复。我们的结果显示不同模型之间存在显著差异：反应范围从几乎完全抵制（适当的）修复尝试到高度易受影响且容易被操控。我们进一步证明，一旦对话超出单轮，模型行为在系统之间变得更加独特且不易预测。总体而言，我们的研究发现，每个被测试的LLM在修复的背景下展现出其自身特有的不可靠性形式。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.19254

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

ShadowPEFT：用于参数高效微调的影子网络

Li, Xianming, Li, Zongxi, Lee, Tsz-fung Andrew, Li, Jing, Xie, Haoran, Li, Qing

Abstract

Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.

Chinese Translation

参数高效微调（PEFT）通过仅训练一小部分特定于任务的参数，同时冻结预训练的主干，减少了对大型语言模型（LLMs）进行全参数微调的训练成本。然而，现有的方法，如低秩适应（Low-Rank Adaptation, LoRA），通过直接向单个权重插入独立的低秩扰动来实现适应，导致适应的局部参数化。我们提出了ShadowPEFT，一种集中式PEFT框架，它通过深度共享的影子模块在层级上进行细化。在每个变换器层中，ShadowPEFT维持一个平行的影子状态，并反复演化以获得逐渐丰富的隐藏状态。这一设计将适应从分布式权重空间扰动转移到共享的层空间细化过程。由于影子模块与主干解耦，它可以在深度上重复使用，独立预训练，并可选择以分离模式部署，从而有利于边缘计算场景。在生成和理解基准上的实验表明，ShadowPEFT在可比较的可训练参数预算下与LoRA和DoRA相匹配或超越。关于影子预训练、跨数据集迁移、参数缩放、推理延迟和系统级评估的额外分析表明，集中式层空间适应是传统低秩PEFT的一个具有竞争力和灵活性的替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.19261

Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

朝向叙事的语言评估：一种定量风格框架

Maisto, Alessandro

Abstract

The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

Chinese Translation

叙事质量的评估仍然是一个复杂的挑战，因为它涉及情节、角色发展和情感影响等主观因素。本研究提出了一种定量的方法来评估叙事，重点关注语言维度作为质量的主要指标。本文展示了一种基于提取33个定量语言特征的综合方法，这些特征被分类为词汇、句法和语义组。为了测试该模型，进行了一个实验，使用了一个包含23本书籍的专业语料库，其中包括经典杰作和自出版作品。通过相似性矩阵，该系统成功地对叙事进行了聚类，几乎完美地区分了专业编辑的文本和自出版的文本。此外，该方法还在一个人工标注的数据集上进行了验证；其显著优于传统的故事级别评估指标，证明了定量语言特征在评估叙事质量方面的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.19262

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

CulturALL：基于实际任务评估大型语言模型的多语言和多文化能力的基准

Lin, Peiqin, Lyu, Chenyang, Luo, Wenjiang, Ye, Haotian, Hossain, Md Mehrab, Ma, Chunlan, Ji, Shaoxiong, Samih, Younes, Zeng, Bo, Jiang, Fan, Cao, Yuanbin, Duisenbek, Dilda, Xun, Adrian Neo Sau, Pozdniakova, Daria, Misevich, Liubou, Marinković, Nevena, Nguyen, Ngoc Gia Linh, Do, Thi Khanh Linh, Sophy, Sarakmatak, Hu, Baotian, Chen, Guanhua, Tang, Gongbo, Aji, Alham Fikri, Wang, Longyue, Luo, Weihua

Abstract

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

Chinese Translation

大型语言模型（LLMs）现已在全球范围内部署，激发了一系列基准测试，以衡量它们的多语言和多文化能力。然而，这些基准测试优先考虑通用语言理解或肤浅的文化知识，导致对实际任务的评估——即模型必须在真实世界的丰富上下文场景中进行推理——大多未得到关注。为填补这一空白，我们提出了CulturALL，这是一个全面且具有挑战性的基准，用于评估LLMs在实际任务中的多语言和多文化能力。CulturALL是通过人类与人工智能的协作框架构建的：专家注释者确保适当的难度和事实准确性，而LLMs则减轻了人工工作负担。通过整合多样化的来源，CulturALL确保了全面的场景覆盖。每个项目都经过精心设计，以呈现高水平的难度，使CulturALL具有挑战性。CulturALL包含来自51个地区的14种语言的2,610个样本，分布在16个主题中，以捕捉实际任务的全貌。实验表明，最佳LLM在CulturALL上的准确率为44.48%，凸显了显著的改进空间。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.19274

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

HarDBench：一种针对基于草稿的共同创作越狱攻击的基准测试，以确保人类与大型语言模型的安全协作写作

Kim, Euntae, Han, Soomin, Chang, Buru

Abstract

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench

Chinese Translation

大型语言模型（LLMs）在协作写作中越来越多地被用作共同作者，用户从粗略草稿开始，依赖LLMs来完成、修订和完善他们的内容。然而，这种能力带来了严重的安全风险：恶意用户可能会越狱模型——用危险内容填充不完整的草稿——迫使其生成有害输出。本文识别了当前LLMs在此类基于草稿的共同创作越狱攻击中的脆弱性，并引入了HarDBench，一个系统化的基准测试，旨在评估LLMs对这一新兴威胁的鲁棒性。HarDBench涵盖了一系列高风险领域——包括爆炸物、药物、武器和网络攻击——并提供具有现实结构和领域特定提示的提示，以评估模型对有害完成的易感性。为了降低这一风险，我们提出了一种基于偏好优化的安全-效用平衡对齐方法，训练模型拒绝有害的完成，同时在良性草稿上保持有用性。实验结果表明，现有LLMs在共同创作环境中高度脆弱，而我们的方法显著减少了有害输出，同时不降低共同创作能力的表现。这为评估和对齐人类与LLMs在协作写作环境中的新范式提供了新的视角。我们的新基准和数据集可在我们的项目页面上获取，网址为 https://github.com/untae0122/HarDBench

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.19292

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

位置未找到：揭示多语言大型语言模型中的隐含地方性和全球性偏见

Mor-Lan, Guy, Goldman, Omer, Eyal, Matan, Gilady, Adi Mayrav, Eiger, Sivan, Szpektor, Idan, Hassidim, Avinatan, Matias, Yossi, Tsarfaty, Reut

Abstract

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

Chinese Translation

多语言大型语言模型（LLMs）已缩小了不同语言之间的流畅性差距。然而，这一进展使模型面临偏见行为的风险，因为知识和规范可能在不同语言之间传播。在本研究中，我们旨在通过模型回答地方模糊问题的能力来量化模型的跨语言和同语言偏见。为此，我们提出了LocQA，一个包含2156个问题的测试集，涵盖12种语言，涉及法律、日期和测量等各种依赖地方的事实。这些问题除了查询语言本身外，不包含与其相关的地方指示。因此，LLMs对LocQA地方模糊问题的回答揭示了模型的隐含先验。我们使用LocQA评估了32个模型，并检测到两种类型的结构性偏见。在跨语言方面，我们发现即使在非英语语言中提问，模型仍表现出对与美国地方相关答案的全球偏见。此外，我们发现与其基础模型相比，经过指令调优的模型这一全球偏见更加严重。在同语言方面，我们展示了当多个地方与同一种语言相关时，模型表现得像是人口概率引擎，优先考虑人口较多的地方。综合来看，LocQA的洞察可能有助于塑造LLMs所需的地方行为，并量化不同训练阶段对各种偏见的影响。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.19298

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

IndiaFinBench：评估大型语言模型在印度金融监管文本上的表现的评估基准

Pall, Rajveer Singh

Abstract

We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench

Chinese Translation

我们介绍了IndiaFinBench，至今为止这是第一个公开可用的评估基准，用于评估大型语言模型（LLM）在印度金融监管文本上的表现。现有的金融自然语言处理（NLP）基准完全来自西方金融语料库（如美国证券交易委员会（SEC）文件、美国财报和英语金融新闻），在覆盖非西方监管框架方面存在显著空白。IndiaFinBench通过406个专家注释的问题-答案对填补了这一空白，这些数据源自192份来自印度证券交易委员会（SEBI）和印度储备银行（RBI）的文件，涵盖四种任务类型：监管解释（174项）、数值推理（92项）、矛盾检测（62项）和时间推理（78项）。注释质量通过基于模型的二次验证（矛盾检测的kappa值为0.918）和60项人类注释者间一致性评估（kappa值为0.611；总体一致性为76.7%）得以验证。我们在零样本条件下评估了十二个模型，准确率范围从70.4%（Gemma 4 E4B）到89.7%（Gemini 2.5 Flash）。所有模型的表现均显著优于非专业人类基线的60.0%。数值推理是最具区分性的任务，各模型之间的差异达到35.9个百分点。自助法显著性测试（10,000次重抽样）揭示了三个统计上显著不同的表现层次。数据集、评估代码和所有模型输出可在https://github.com/rajveerpall/IndiaFinBench获取。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.19299

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

重新思考规模：小型语言模型在智能体范式下的部署权衡

Wang, Xinlin, Brorsson, Mats

Abstract

Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

Chinese Translation

尽管大型语言模型具有令人印象深刻的能力，但其巨大的计算成本、延迟和隐私风险阻碍了它们在现实应用中的广泛部署。拥有不到100亿参数的小型语言模型（Small Language Models, SLMs）提供了一个有前景的替代方案；然而，它们在知识和推理方面的固有局限性限制了其有效性。现有研究主要集中在通过规模法则或微调策略来增强SLMs，而忽视了利用智能体范式（如工具使用和多智能体协作）系统性弥补小型模型固有弱点的潜力。为了解决这一问题，本文首次对三种范式下的<10B开源模型进行了大规模、全面的研究：（1）基础模型，（2）配备工具的单一智能体，以及（3）具有协作能力的多智能体系统。我们的结果表明，单一智能体系统在性能和成本之间实现了最佳平衡，而多智能体设置则增加了开销但收益有限。我们的研究结果强调了在资源受限环境中进行高效和可信部署时，智能体中心设计的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.19331

Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation

评估基于大型语言模型的议会辩论摘要与计算论证

Cunningham, Eoghan, Greene, Derek, Cross, James, Rago, Antonio

Abstract

Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.

Chinese Translation

理解政策在议会中的辩论和辩护是民主过程的一个基本方面。然而，辩论的数量和复杂性使得外部观众难以参与。同时，已有研究表明，大型语言模型（LLMs）能够实现大规模的自动摘要。虽然辩论摘要可以使议会程序更加易于理解，但评估这些摘要是否忠实地传达了论证内容仍然具有挑战性。现有的自动摘要评估指标与人类对一致性（即摘要与来源之间的忠实度或一致性）的判断相关性较差。在本研究中，我们提出了一个评估议会辩论摘要的正式框架，该框架将论证结构与待辩论的争议提案相结合。我们基于计算论证的创新方法，专注于评估与忠实保留为支持或反对政策结果而呈现的推理相关的正式属性。我们通过对欧洲议会辩论及相关的基于LLM的摘要的案例研究展示了我们的方法。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.19342

Are Large Language Models Economically Viable for Industry Deployment?

大型语言模型在行业部署中经济上可行吗？

Mohammad, Abdullah, Ray, Sushant Kumar, Arora, Pushkar, Ali, Rafiq, Shabbir, Ebad, Kashyap, Gautam Siddharth, Gao, Jiechao, Naseem, Usman

Abstract

Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the <2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

Chinese Translation

基于大型语言模型（LLMs）的生成性人工智能正在医疗决策支持、金融分析、企业检索和对话自动化等行业中越来越多地应用，其中可靠性、效率和成本控制至关重要。在这些环境中，模型必须满足对能耗、延迟和硬件利用率的严格限制，而不仅仅是准确性。然而，现有的评估流程仍然以准确性为中心，导致了一个部署评估差距——在模型评估中缺乏操作和经济标准。为了解决这一差距，我们提出了EDGE-EVAL——一个面向行业的基准框架，评估LLMs在传统NVIDIA Tesla T4 GPU上的整个生命周期。通过在三个工业任务中对LLaMA和Qwen变体进行基准测试，我们引入了五个部署指标——经济盈亏平衡（Nbreak）、每瓦智能（IPW）、系统密度（ {ho}sys）、冷启动税（Ctax）和量化保真度（Qret），这些指标捕捉了盈利能力、能效、硬件扩展性、无服务器可行性和压缩安全性。我们的结果揭示了一个明确的效率边界——在经济和生态维度上，<2B参数类别的模型在更大的基准模型中占据主导地位。LLaMA-3.2-1B（INT4）在14个请求（中位数）中实现了投资回报率盈亏平衡，提供了比7B模型高出3倍的能量标准化智能，并在4位量化下超过6,900个tokens/s/GB。我们进一步发现了一个效率异常——尽管QLoRA减少了内存占用，但它使小模型的适应能耗增加了多达7倍，这挑战了关于边缘部署中量化感知训练的现有假设。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.19351

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

DASH-KV：通过不对称KV缓存哈希加速长上下文LLM推理

Guo, Jinyu, Zhang, Zhihan, Li, Yutong, Xie, Jiehui, Iqbal, Md. Tamim, Han, Dongshen, Lee, Lik-Hang, Bae, Sung-Ho, Zou, Jie, Yang, Yang, Zhang, Chaoning

Abstract

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan-Zh/DASH-KV

Chinese Translation

标准注意力机制的平方计算复杂度构成了大型语言模型在长上下文推理中的基本瓶颈。尽管现有的KV缓存压缩方法缓解了内存压力，但它们往往牺牲生成质量，并未解决浮点运算的高开销。本文提出了DASH-KV，一种创新的加速框架，通过不对称深度哈希将注意力重新表述为近似最近邻搜索。在这一范式下，我们设计了一种不对称编码架构，差异性地将查询和键映射，以考虑它们在精度和重用特性上的区别。为了平衡效率和准确性，我们进一步引入了一种动态混合精度机制，自适应地为关键标记保留全精度计算。在LongBench上的大量实验表明，DASH-KV显著优于最先进的基线方法，同时与全注意力的性能相匹配，并将推理复杂度从O(N^2)降低到线性O(N)。代码可在https://github.com/Zhihan-Zh/DASH-KV获取。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.19394

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

持续预训练能否缩小通用语言模型与专用语言模型在医学领域的性能差距？

Doll, Niclas, Buschhoff, Jasper Schulze, Satheesh, Shalaka, Abdelwahab, Hammam, Allende-Cid, Héctor, Klug, Katrin

Abstract

This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

Chinese Translation

本文通过持续预训练和模型合并，缩小了小型专用模型与显著更大通用模型之间的性能差距。我们通过从FineWeb2构建高质量的德语医学语料库（FineMed-de），解决了专用非英语数据的稀缺性。该语料库用于对三种知名的大型语言模型（参数范围从$7B$到$24B$）进行持续预训练和合并，创建了DeFineMed模型系列。全面评估确认，专业化显著提升了$7B$模型在德语医学基准测试中的性能。此外，基于Qwen2.5的模型的成对胜率分析显示，通过领域适应，胜率相较于更大的Mistral-Small-24B-Instruct提高了约$3.5$倍。这一证据表明，专用的$7B$模型在复杂医学指令跟随任务中是一种具有竞争力且资源高效的解决方案。尽管模型合并成功恢复了指令跟随能力，但后续的失败模式分析揭示了固有的权衡，包括语言混合的引入和冗长性增加，突显了未来工作中更有针对性的微调的必要性。本研究提供了一种稳健且合规的方法论，用于开发专用大型语言模型，为在德语医疗环境中的实际应用奠定基础。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.19395

Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

自一致性是否提高了百科知识的回忆能力？

Hoshino, Sho, Honda, Ukyo, Zhang, Peinan

Abstract

While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89\% accuracy on MMLU, the best performance to date with the use of GPT-4o.

Chinese Translation

虽然自一致性被认为能提高符号推理的表现，但由于缺乏针对性的评估依据，其对百科知识回忆的影响尚不明确。为了解决这一问题，我们通过应用先前工作的数据驱动启发式方法，为流行的 MMLU 基准建立了这样的知识回忆分割。我们通过展示符号推理和知识回忆子集的表现模式分别与 GSM8K 和 MedMCQA 的模式相似来验证这一分割。基于这一坚实的基础，我们发现自一致性在符号推理和知识回忆两个方面的表现均有持续改善，尽管其基础的 CoT 提示主要对符号推理有效。因此，我们在 MMLU 上达到了 89\% 的准确率，这是迄今为止使用 GPT-4o 的最佳表现。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.19405

Lost in Translation: Do LVLM Judges Generalize Across Languages?

翻译中的迷失：大型视觉语言模型（LVLM）评估者是否能够跨语言泛化？

Laskar, Md Tahmid Rahman, Islam, Mohammed Saidul, Nayeem, Mir Tafseer, Bhuiyan, Amran, Rahman, Mizanur, Joty, Shafiq, Hoque, Enamul, Huang, Jimmy

Abstract

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

Chinese Translation

自动评估器，如奖励模型，在大型视觉语言模型（LVLM）的对齐和评估中发挥着核心作用。尽管它们的重要性日益增加，这些评估器几乎仅在以英语为中心的基准上进行评估，这使得它们在跨语言泛化能力方面的问题悬而未决。为了解答这一问题，我们引入了MM-JudgeBench，这是第一个针对多语言和多模态评估模型的大规模基准，包含超过60,000个成对偏好实例，涵盖25种类型多样的语言。MM-JudgeBench整合了两个互补的子集：一个扩展自VL-RewardBench的通用视觉语言偏好评估子集，以及一个源自OpenCQA的图表中心视觉文本推理子集，从而能够系统性地分析奖励模型（即LVLM评估者）在不同环境中的表现。此外，我们还发布了一个来自MM-RewardBench的多语言训练集，与我们的评估数据不重叠，以支持领域适应。通过评估22个LVLM（15个开源，7个专有），我们在所提出的基准中发现了显著的跨语言性能差异。我们的分析进一步表明，模型的大小和架构对多语言鲁棒性预测效果不佳，即使是最先进的LVLM评估者在不同语言间也表现出不一致的行为。这些发现共同揭示了当前奖励建模的基本局限性，并强调了开发可靠自动评估器所需的多语言、多模态基准的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.19440

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

什么使得大型语言模型成为优秀的优化器？大型语言模型引导的进化搜索轨迹分析

Zhang, Xinhao, Chen, Xi, Portet, François, Peyrard, Maxime

Abstract

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

Chinese Translation

近期的研究表明，在进化和自主优化系统中协调大型语言模型（LLMs）具有良好的前景。然而，推动这些优化增益的机制仍然不甚明了。在本研究中，我们进行了大规模的LLM引导的进化搜索研究，收集了15个LLM在8个任务上的优化轨迹。尽管零-shot问题解决能力与最终优化结果相关，但它仅解释了部分方差：具有相似初始能力的模型往往会产生截然不同的搜索轨迹和结果。通过分析这些轨迹，我们发现强大的LLM优化器表现为局部精炼者，频繁产生增量改进，同时逐步在语义空间中局部化搜索。相反，较弱的优化器则表现出较大的语义漂移，偶尔出现突破后又陷入停滞。值得注意的是，各种解决方案新颖性的度量并不能预测最终性能；新颖性只有在搜索足够局部化于高性能区域时才是有益的。我们的结果强调了轨迹分析在理解和改进基于LLM的优化系统中的重要性，并为其设计和训练提供了可行的见解。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.19447

'The Order in the Horse's Heart': A Case Study in LLM-Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction

‘马心中的秩序’：基于大型语言模型的风格计量学案例研究，探讨现代文学小说中的圣经典故

Cameron, Ewan

Abstract

We present a dual-track pipeline for detecting biblical allusions in literary fiction and apply it to the novels of Cormac McCarthy. A bottom-up embedding track uses inverse document frequency to identify rare vocabulary shared with the King James Bible, embeds occurrences in their local context for sense disambiguation, and passes candidate passage pairs through cascaded LLM review. A top-down register track asks an LLM to read McCarthy's prose undirected to any specific biblical passage for comparison, catching allusions not distinguished by word or phrase rarity. Both tracks are cross-validated by a long-context model that holds entire novels alongside the KJV in a single pass, and every finding is checked against published scholarship. Restricting attention to allusions that carry a textual echo--shared phrasing, reworked vocabulary, or transplanted cadence--and distinguishing literary allusions proper from signposted biblical references (similes naming biblical figures, characters overtly citing scripture), the pipeline surfaces 349 allusions across the corpus. Among a target set of 115 previously documented allusions retrieved through human review of the academic literature, the pipeline independently recovers 62 (54% recall), with recall varying by connection type from 30% (transformed imagery) to 80% (register collisions). We contextualise these results with respect to the value-add from LLMs as assistants to mechanical stylometric analyses, and their potential to facilitate the statistical study of intertextuality in massive literary corpora.

Chinese Translation

我们提出了一种双轨管道，用于检测文学小说中的圣经典故，并将其应用于科马克·麦卡锡（Cormac McCarthy）的小说。底层嵌入轨道使用逆文档频率来识别与《钦定版圣经》（King James Bible）共享的稀有词汇，将出现的词汇嵌入其局部语境以进行意义消歧，并通过级联的LLM（大型语言模型）审查候选段落对。顶层注册轨道则要求LLM无特定指向地阅读麦卡锡的散文进行比较，捕捉那些未因词汇或短语稀有性而被区分的典故。两个轨道均通过一个长上下文模型进行交叉验证，该模型在一次传递中将整部小说与《钦定版圣经》一起处理，所有发现均与已发表的学术研究进行核对。我们将注意力限制在具有文本回声的典故上——共享的措辞、改编的词汇或移植的韵律——并区分真正的文学典故与标记的圣经引用（命名圣经人物的明喻、明显引用经文的角色），该管道在整个语料库中发现了349个典故。在通过人类审查学术文献检索的115个先前记录的典故目标集中，该管道独立恢复了62个（54%的召回率），不同连接类型的召回率从30%（转化的意象）到80%（注册碰撞）不等。我们将这些结果与LLM作为机械风格计量分析助手的附加价值进行背景化，并探讨其在大规模文学语料库中促进文本互文性统计研究的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.19464

LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

LePREC：基于结构化因素的分类推理以评估法律问题的相关性

Wang, Fanyu, Kang, Xiaoxi, Burgess, Paul, Srivastava, Aashish, Arora, Chetan, Trakic, Adnan, Soon, Lay-Ki, Hossain, Md Khalid, Qu, Lizhen

Abstract

More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs' capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

Chinese Translation

全球超过一半的人口因法律资源有限而难以满足其民事司法需求。尽管大型语言模型（LLMs）展现了令人印象深刻的推理能力，但在法律问题识别这一基础步骤上仍面临重大挑战。为探讨LLMs在此任务中的能力，我们构建了一个数据集，来源于769个真实的马来西亚合同法法庭案件，利用GPT-4o提取事实并生成候选法律问题，由资深法律专家进行注释，这揭示了一个关键限制：尽管LLMs生成了多样的议题候选，但其精确度仍然不足（GPT-4o的精确度仅为62%）。为了解决这一差距，我们提出了LePREC（法律专业人士启发的推理引发与分类），这是一个结合神经生成与结构化统计推理的神经符号框架。LePREC由以下部分组成：（1）神经组件利用LLMs将法律描述转化为代表多样分析因素的问题-答案对；（2）符号组件在这些离散特征上应用稀疏线性模型，学习显式代数权重，以识别最具信息量的推理因素。与端到端的神经方法不同，LePREC通过透明的特征加权实现可解释性，同时通过基于相关性的统计分类保持数据效率。实验表明，LePREC在包括GPT-4o和Claude在内的先进LLM基线之上提高了30-40%的性能，确认基于相关的因素-问题分析为相关性决策提供了更具数据效率的解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.19499

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

等级湍流Delta与可解释的风格计量Delta度量方法

Pronin, Dmitry, Kazartsev, Evgeny

Abstract

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.

Chinese Translation

本文介绍了两种新的作者归属度量——等级湍流Delta和詹森-香农Delta，它们通过应用为概率分布设计的距离函数，推广了Burrows经典的Delta方法。我们首先阐述了这些度量的理论基础，对比了词频向量的中心化和非中心化z评分，并将非中心化向量重新表述为概率分布。在此表示的基础上，我们开发了一种基于标记的分解方法，使每个Delta距离在数值上可解释，从而促进了细读和结果验证。这些方法的有效性在四个文学语料库上进行了评估，分别为英语、德语、法语和俄语。英语、德语和法语数据集来自古腾堡计划，而俄语基准则是SOCIOLIT语料库，包含180位作者在十八世纪至二十一世纪间创作的755部作品。等级湍流Delta的归属准确率与余弦Delta相当；詹森-香农Delta的表现始终与经典的Burrows Delta相匹配或超过。最后，几种已建立的归属算法在扩展的SOCIOLIT语料库上进行了重新评估。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.19502

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

超越评分：人工智能评审的综合评估与基准

Li, Bowen, Ma, Haochen, Wang, Yuxin, Yang, Jie, Chen, Xinchi, Huang, Xuanjing, Zheng, Yining, Qiu, Xipeng

Abstract

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

Chinese Translation

大型语言模型（LLMs）的快速应用激发了对自动化同行评审的兴趣；然而，目前的进展受到将评审主要视为评分预测任务的基准的限制。我们认为，评审的价值在于其文本论证——即论点、问题和批评——而非一个标量分数。为了解决这一问题，我们提出了“超越评分”（Beyond Rating），一个全面的评估框架，评估人工智能评审者在五个维度上的表现：内容忠实性（Content Faithfulness）、论证一致性（Argumentative Alignment）、焦点一致性（Focus Consistency）、问题建设性（Question Constructiveness）和人工智能可能性（AI-Likelihood）。值得注意的是，我们提出了一种最大召回（Max-Recall）策略，以适应有效的专家分歧，并引入了一个经过严格筛选的高信度评审论文数据集，以消除程序噪声。广泛的实验表明，传统的n-gram指标无法反映人类偏好，而我们提出的以文本为中心的指标——特别是弱点论点的召回率——与评分准确性高度相关。这些发现表明，将人工智能的批评焦点与人类专家对齐是可靠的自动评分的前提，为未来的研究提供了一个稳健的标准。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.19508

Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

孟加拉语关键字到文本生成：低资源语言的关键字驱动文本生成

Talukder, Tonmoy, Shahariar, G M

Abstract

This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of $2.6$ million Bangla keyword--text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword--text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.

Chinese Translation

本文介绍了 extit{Bangla Key2Text}，这是一个包含260万对孟加拉语关键字与文本的大规模数据集，旨在为低资源语言中的关键字驱动文本生成提供支持。该数据集是通过应用基于BERT的关键字提取管道于数百万篇孟加拉新闻文本构建的，将原始文章转化为适合监督学习的结构化关键字与文本对。为了在这一新基准上建立基线性能，我们对两个序列到序列模型 exttt{mT5}和 exttt{BanglaT5}进行了微调，并使用多种自动评估指标和人工评判进行评估。实验结果表明，与零-shot大型语言模型相比，任务特定的微调显著提高了孟加拉语中基于关键字的文本生成。该数据集、训练模型和代码已公开发布，以支持未来在孟加拉自然语言生成和关键字到文本生成任务中的研究。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.19547

Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

通过语义解耦和图对齐提取对话中的情感-原因对

Ma, Tianxiang, Feng, Weijie, Wang, Xinyu, Cheng, Zhiyong

Abstract

Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at https://github.com/CoCoSphere/SCALE.

Chinese Translation

情感-原因对提取（ECPEC）旨在识别对话中情感表达与其触发原因之间的因果关系集合。现有的大多数方法将ECPEC形式化为独立的成对分类任务，忽视了情感扩散和原因解释的独特语义，未能捕捉到全局一致的多对多对话因果关系。为了解决这些局限性，我们从语义的角度重新审视ECPEC，试图将情感导向的语义与原因导向的语义解耦，将它们映射到两个互补的表示空间，以更好地捕捉它们在对话中的不同角色。在此语义解耦的基础上，我们自然地将ECPEC形式化为情感侧与原因侧表示之间的全局对齐问题，并采用最优传输方法实现多对多和全局一致的情感-原因匹配。基于这一视角，我们提出了一个统一框架SCALE，该框架在共享的对话结构中实现上述语义解耦和对齐原则。在多个基准数据集上的广泛实验表明，SCALE始终实现了最先进的性能。我们的代码已发布在 https://github.com/CoCoSphere/SCALE。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.19548

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

通过辩证对齐驯服代理中的行为者-观察者不对称

Li, Bobo, Wu, Rui, Ji, Zibo, Zhang, Meishan, Fei, Hao, Zhang, Min, Lee, Mong-Li, Hsu, Wynne

Abstract

Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

Chinese Translation

大型语言模型代理迅速从静态文本生成器演变为能够执行复杂自主工作流的动态系统。为了增强可靠性，越来越多地采用分配专门角色的多代理框架，以实现自我反思和相互审计。尽管这种角色扮演有效利用了领域专家知识，但我们发现它同时引发了一种类似人类的认知偏差，称为行为者-观察者不对称（Actor-Observer Asymmetry，AOA）。具体而言，作为行为者的代理（在自我反思期间）倾向于将失败归因于外部因素，而作为观察者的代理（在相互审计期间）则将相同的错误归因于内部故障。我们使用新的模糊失败基准（Ambiguous Failure Benchmark）对这一现象进行了量化，结果显示，仅仅交换视角就会在大多数模型中超过20%的案例中触发AOA效应。为了驯服这种偏差，我们引入了ReTAS（通过论题-反论题-综合进行推理），这是一个通过辩证对齐训练的模型，旨在强制执行视角不变的推理。通过将辩证思维链与群体相对政策优化（Group Relative Policy Optimization）相结合，ReTAS引导代理将相互矛盾的观点综合成客观共识。实验表明，ReTAS有效减轻了归因不一致性，并显著提高了模糊场景中的故障解决率。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.19565

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

在推理时使用注意力图检测语音大型语言模型中的幻觉

Waldendorf, Jonas, Hasan, Bashar Awwad Shiekh, Tsymbalov, Evgenii

Abstract

Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

Chinese Translation

语音大型语言模型（SpeechLLMs）中的幻觉现象带来了显著风险，但现有的检测方法通常依赖于昂贵或不切实际的标准输出。此外，针对文本基础的语言模型（LLMs）开发的幻觉检测方法并未直接捕捉音频特有的信号。我们研究了四个基于注意力的度量指标：AUDIORATIO、AUDIOCONSISTENCY、AUDIOENTROPY和TEXTENTROPY，旨在捕捉与幻觉相关的病态注意力模式，并在这些特征上训练轻量级逻辑回归分类器，以实现高效的推理时检测。在自动语音识别和语音转文本翻译任务中，对Qwen-2-Audio和Voxtral-3B的评估表明，我们的方法在领域内数据上优于基于不确定性和先前注意力的基线，提升幅度高达+0.23 PR-AUC，并且能够推广到领域外的ASR设置。我们进一步发现，使用大约100个注意力头可以实现良好的性能，相较于使用所有头部，改善了领域外的泛化能力。尽管效果依赖于模型，并且需要特定任务的训练，我们的结果表明，注意力模式为SpeechLLMs中的幻觉检测提供了有价值的工具。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2604.19572

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

基于观察上下文压缩的高效终端智能体自演化框架

Ren, Jincheng, Wu, Siwei, Li, Yizhi, Zhu, Kang, Xu, Shu, Feng, Boyu, Yuan, Ruibin, Zhang, Wei, Batista-Navarro, Riza, Yang, Jian, Lin, Chenghua

Abstract

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

Chinese Translation

随着模型能力的提升，研究越来越多地转向长时间跨度、多回合的以终端为中心的智能体任务，其中原始环境反馈通常保留在交互历史中以支持未来决策。然而，反复保留这些反馈会引入大量冗余，并导致随着步骤数量的增加，累积的令牌成本呈平方增长，从而妨碍长时间跨度的推理。尽管观察压缩可以缓解这一问题，但终端环境的异质性使得基于启发式或固定提示的方法难以推广。我们提出了 TACO（Terminal Agent Compression），一个即插即用的自演化终端智能体压缩框架，能够自动从交互轨迹中发现和优化现有终端智能体的压缩规则。在 TerminalBench（TB 1.0 和 TB 2.0）及四个额外的终端相关基准（即 SWE-Bench Lite、CompileBench、DevEval 和 CRUST-Bench）上的实验表明，TACO 在主流智能体框架和强大基础模型中始终提高了性能。在 MiniMax-2.5 上，它在大多数基准上提高了性能，同时将令牌开销减少了约 10%。在 TerminalBench 上，它在强智能体模型中带来了 1%-4% 的一致性增益，并在相同令牌预算下进一步提高了约 2%-3% 的准确性。这些结果证明了自演化、任务感知压缩在终端智能体中的有效性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2604.19578

Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

大型语言模型对同行评审意见的细致影响：来自人工智能顶级会议论文集的证据

Wu, Wenqing, Zhang, Chengzhi, Zhao, Yi, Bao, Tong

Abstract

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

Chinese Translation

随着大型语言模型（LLMs）的快速发展，学术界面临着前所未有的冲击，尤其是在学术交流领域。同行评审的主要功能是提高学术稿件的质量，例如清晰度、原创性及其他评估方面。尽管先前的研究表明，LLMs已经开始影响同行评审，但尚不清楚它们是否在改变其核心评估功能。此外，LLMs对同行评审报告的语言形式、评估重点和推荐相关信号的影响程度尚未得到系统性研究。在本研究中，我们考察了LLMs出现后学术文章的同行评审报告的变化，强调了细致层面的差异。具体而言，我们研究了评审意见中词汇和句子的长度与复杂性等语言特征，同时自动标注个别评审句子的评估方面。我们还使用先前建立的最大似然估计方法，识别可能由LLMs修改或生成的评审报告。最后，我们评估了LLM辅助评审报告中提到的评估方面对论文决策推荐信息量的影响。结果表明，随着LLMs的出现，同行评审文本变得更长且更流畅，摘要和表面清晰度的强调有所增加，同时语言模式更加标准化，尤其是在信心评分较低的评审者中。与此同时，对更深层次评估维度（如原创性、可复制性和细致的批判性推理）的关注有所下降。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2604.19584

A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

A Bolu：用于撒丁岛即兴诗歌计算分析的结构化数据集

Calderaro, Silvio, Monti, Johanna

Abstract

The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.

Chinese Translation

自然语言处理（NLP）对少数语言日益增长的兴趣尚未弥补口头语言遗产保护的缺口。特别是即兴诗歌——一种基于实时即兴创作和韵律修辞能力的表演性体裁——仍然是计算语言学中一个较少探索的领域。这一方法论上的空白需要创建特定资源来记录和分析即兴诗歌的结构。在此背景下，A Bolu应运而生，这是第一个专门针对cantada logudorese（撒丁岛语言的一种变体）的即兴诗歌结构化语料库。该数据集包含2,835个节，总代码数为141,321。研究展示了语料库的架构，并应用多维分析，结合描述性统计指标和计算语言学技术，以描绘诗歌文本的特征。结果表明，撒丁岛即兴诗人的创作具有反复出现的模式，这支持了Parry和Lord的公式化理论。这一证据不仅为理解口头创作提供了新的视角，还为开发更具包容性和敏感性、能够适应使用较少语言的NLP工具做出了重要贡献。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2604.19593

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

RoLegalGEC：罗马尼亚法律领域语法错误检测与修正数据集

Timpuriu, Mircea, Cercel, Dumitru-Clementin

Abstract

The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

Chinese Translation

法律文件中清晰和正确的文本的重要性不容忽视，因此，旨在协助法律专业人士的语法错误修正工具必须具备理解法律环境中可能出现的错误的能力，并相应地进行修正，同时隐含地需要在相同环境中进行训练，使用真实的法律数据。然而，对于罗马尼亚语等语言来说，所需的手动标注数据稀缺，更不用说特定领域的数据了。最常见的方法是合成生成平行数据；然而，这需要对罗马尼亚语法有结构化的理解。在本文中，我们介绍了我们所知的首个罗马尼亚语平行数据集RoLegalGEC，用于法律领域语法错误的检测与修正，该数据集汇集了350,000个法律段落中的错误示例及其错误注释。此外，我们评估了几种神经网络模型，将该数据集转化为一个有价值的工具，用于检测和修正语法错误，包括知识蒸馏Transformer、用于检测的序列标注架构，以及多种预训练的文本到文本Transformer模型用于修正。我们认为，这一系列模型与新颖的RoLegalGEC数据集将丰富罗马尼亚语进一步研究的资源基础。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2604.19598

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

人工智能生成运动处方的跨模型一致性：基于三种大型语言模型的重复生成研究

Lee, Kihyuk

Abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

Chinese Translation

本研究比较了三种大型语言模型（LLMs）在运动处方输出的重复生成一致性，具体为GPT-4.1、Claude Sonnet 4.6和Gemini 2.5 Flash，在温度=0的条件下进行。每个模型为六种临床场景生成了20次处方，总共分析了360个输出，涵盖四个维度：语义相似性、输出可重复性、FITT分类和安全性表达。GPT-4.1的平均语义相似性最高（0.955），其次是Gemini 2.5 Flash（0.950）和Claude Sonnet 4.6（0.903），并确认了模型间的显著差异（H = 458.41, p < .001）。重要的是，这些分数反映了根本不同的生成行为：GPT-4.1生成了完全独特的输出（100%），且语义内容稳定，而Gemini 2.5 Flash则表现出明显的输出重复（27.5%独特输出），表明其高相似性得分源于文本重复而非一致的推理。因此，相同的解码设置产生了根本不同的一致性特征，这一点是单输出评估无法捕捉的。所有模型的安全性表达达到了上限，确认其作为区分指标的有限效用。这些结果表明，模型选择构成了一项临床决策，而不仅仅是技术决策，并且在重复生成条件下的输出行为应被视为可靠部署基于LLM的运动处方系统的核心标准。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2604.19642

Micro Language Models Enable Instant Responses

微型语言模型实现即时响应

Cheng, Wen, Chen, Tuochao, Helwani, Karim, Srinivasan, Sriram, Zettlemoyer, Luke, Gollakota, Shyamnath

Abstract

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($\mu$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $\mu$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

Chinese Translation

由于电力和计算限制，边缘设备如智能手表和智能眼镜无法持续运行即使是最小的100M-1B参数语言模型，而云推理则引入了多秒的延迟，打破了响应助手的幻觉。我们提出了微型语言模型（$BC$LMs）：超紧凑模型（8M-30M参数），能够在设备上即时生成上下文相关响应的前4-8个单词，同时云模型完成其余部分，从而掩盖云延迟。我们展示了在这一极端规模下有用的语言生成依然存在，我们的模型与多个70M-256M级别的现有模型相匹配。我们设计了一个协同生成框架，将云模型重新定义为延续者而非响应者，实现了无缝的中句交接和通过三种错误修正方法在本地开启者出现错误时的结构化优雅恢复。实证结果表明，$BC$LMs能够发起响应，而更大的模型可以无缝完成，证明了数量级不对称协作是可实现的，并为极其资源受限的设备解锁了响应式人工智能。模型检查点和演示可在 https://github.com/Sensente/micro_language_model_swen_project 获取。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2604.19645

The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

信号是上限：开放式调查文本中LLM预测体验评分的测量限制

Hong, Andrew, Potteiger, Jason, Zapata, Luis E.

Abstract

An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that "prompt engineering helps a little" but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

Chinese Translation

早期的一篇论文（Hong, Potteiger, 和 Zapata 2026）建立了一个未经优化的GPT 4.1提示从开放式调查文本中预测粉丝报告的体验评分的准确率为67%（误差在1分以内）。本文测试了提示设计和模型选择对该性能的相对影响。我们比较了来自五支MLB球队的约10,000份赛后调查的四种配置：原始基线提示和一个适度定制的版本，交叉使用三种GPT模型（4.1, 4.1-mini, 5.2）。提示定制使得GPT 4.1的准确率在+/-1一致性上提高了大约两个百分点（从67%提高到69%）。从最佳配置中更换模型都导致性能下降：GPT 5.2回到了基线，而GPT 4.1-mini则下降了六个百分点。两者的结合效果与输入本身相比显得微不足道：在可用配置中，准确性因文本的语言特征而变化的幅度比提示或模型的选择大一个数量级。上限有两个部分。一个是模型读取文本的偏差，这可以通过提示设计来纠正。另一个是粉丝所写内容与他们实际决定之间的差异，这种差异无法通过工程手段弥补，因为缺失的信息不在文本中。提示定制改善了第一部分；而模型选择在这方面没有可靠的改善。结果并不是“提示工程稍微有帮助”，而是提示工程在其能够达到的上限的特定和可预测的方式上提供了帮助。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2604.19656

Pause or Fabricate? Training Language Models for Grounded Reasoning

暂停还是虚构？训练语言模型以实现扎根推理

Qiu, Yiwen, Wu, Linjuan, Liu, Yizhou, Yan, Yuchen, Ma, Jin, Tan, Xu, Hu, Yao, Zhang, Daoxin, Zhang, Wenqi, Lu, Weiming, Xiao, Jun, Shen, Yongliang

Abstract

Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

Chinese Translation

大型语言模型在复杂推理任务上取得了显著进展。然而，当输入不完整时，它们常常隐式地虚构信息，产生自信但不可靠的结论——我们称之为无根推理的失败模式。我们认为，这一问题并非源于推理能力不足，而是缺乏推理边界意识——即识别有效推理所需前提缺失的能力。为了解决这一问题，我们提出了通过互动强化学习实现扎根推理（Grounded Reasoning via Interactive Reinforcement Learning, GRIL），这是一个针对不完整信息下扎根推理的多轮强化学习框架。GRIL将推理过程分解为两个阶段：澄清和暂停，前者识别可用信息是否充足，后者在必要前提建立后进行任务解决。我们设计了阶段特定的奖励机制，以惩罚虚构现象，使模型能够检测信息缺口，主动停止，并在澄清后恢复推理。在GSM8K-Insufficient和MetaMATH-Insufficient上的实验表明，GRIL显著提高了前提检测（最高可达45%），使任务成功率提高了30%，同时将平均响应长度减少了超过20%。附加分析确认了对噪声用户响应的鲁棒性以及对分布外任务的泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2604.19667

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow：一个用于通过自然语言生成可执行视觉工作流的基准

Zhong, Yi, Xu, Buqiang, Wang, Yijun, Shan, Zifei, Qiao, Shuofei, Zheng, Guozhou, Zhang, Ningyu

Abstract

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

Chinese Translation

目前，可执行视觉工作流已成为现实工业部署中的主流范式，提供了强大的可靠性和可控性。然而，在当前实践中，这些工作流几乎完全通过手动工程构建：开发人员必须仔细设计工作流，为每个步骤编写提示，并随着需求的变化反复修订逻辑，这使得开发成本高、耗时且容易出错。为了研究大型语言模型是否能够自动化这一多轮交互过程，我们引入了Chat2Workflow，一个直接从自然语言生成可执行视觉工作流的基准，并提出了一个稳健的代理框架以减轻重复执行错误。Chat2Workflow基于大量真实商业工作流构建，每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台，如Dify和Coze。实验结果表明，尽管最先进的语言模型通常能够捕捉高层次的意图，但在生成正确、稳定和可执行的工作流方面仍然存在困难，尤其是在复杂或变化的需求下。尽管我们的代理框架提高了最多5.34%的解决率，但剩余的现实差距使得Chat2Workflow成为推动工业级自动化的基础。代码可在 https://github.com/zjunlp/Chat2Workflow 获取。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2604.19678

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

探索函数向量中的语言无关性：机器翻译的案例研究

Laiyk, Nurkhan, Gállego, Gerard I., Ferrando, Javier, Koto, Fajri

Abstract

Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\rightarrow$Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.

Chinese Translation

函数向量（Function Vectors, FVs）是从模型激活中提取的任务的向量表示，通常用于上下文学习。尽管先前的研究表明，多语言模型表示可以是语言无关的，但对于函数向量是否也具备同样特性仍不明确。我们以机器翻译为案例研究，探讨函数向量是否表现出语言无关性。在三个仅解码的多语言大语言模型（LLMs）中，我们发现从单一的英语$ ightarrow$目标语言方向提取的翻译函数向量能够转移到其他目标语言，持续提高多个未见语言中正确翻译标记的排名。消融实验结果表明，移除函数向量会降低跨语言的翻译效果，而对无关任务的影响有限。我们进一步展示了基础模型的函数向量能够转移到经过指令调优的变体，并在一定程度上从词级翻译推广到句级翻译。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2604.19685

An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA

答案只是开始：开放式文档基础问答的相关洞察生成

Sharma, Saransh, Ramu, Pritika, Garimella, Aparna, Mukherjee, Koyel

Abstract

Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

Chinese Translation

回答开放式问题对于人工智能系统仍然具有挑战性，因为这需要超越事实检索的综合、判断和探索，而用户通常通过多次迭代来完善答案，而不是接受单一的回应。现有的问答基准并未明确支持这一完善过程。为了解决这一问题，我们引入了一项新任务——文档基础的相关洞察生成，其目标是从文档集合中生成额外的洞察，以帮助改善、扩展或重新思考对开放式问题的初步回答，最终支持更丰富的用户互动和更好的整体问答体验。我们策划并发布了SCOpE-QA（开放式问答的科学集合），这是一个包含20个研究集合中3,000个开放式问题的数据集。我们提出了InsightGen，一种两阶段的方法，首先通过聚类构建文档集合的主题表示，然后基于主题图的邻域选择来选择相关上下文，利用大型语言模型（LLMs）生成多样且相关的洞察。对3,000个问题使用两种生成模型和两种评估设置进行的广泛评估表明，InsightGen始终能够生成有用、相关且可操作的洞察，为这一新任务建立了强有力的基准。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2604.19699

Epistemic orientation in parliamentary discourse is associated with deliberative democracy

议会话语中的认知取向与协商民主的关联

Aroyehun, Segun, Lewandowsky, Stephan, Garcia, David

Abstract

The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.

Chinese Translation

追求真理是民主协商和治理的核心，然而政治话语反映出不同的认知取向，从基于可验证信息的证据推理到根植于信念和主观解释的直觉推理。我们提出了一种可扩展的方法，通过证据减去直觉（Evidence--Minus--Intuition, EMI）评分来测量认知取向，该评分来源于大型语言模型（Large Language Model, LLM）的评级和基于嵌入的语义相似性。将该方法应用于1946年至2025年间七个国家的1500万个议会演讲片段，我们考察了话语中的时间模式及其与协商民主和治理的关联。我们发现，EMI在国家内部随着时间的推移与协商民主呈正相关，并且在同时分析和滞后分析中均表现出一致的关系。EMI还与法律的透明性和可预测实施作为治理的一个维度呈正相关。这些发现表明，政治话语的认知性质对民主和治理的质量至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2604.19716

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

发现共享的逻辑子空间：通过自然语言与符号视角的对齐引导大型语言模型的逻辑推理

Fang, Feihao, Thai, My T., Lei, Yuanyuan

Abstract

Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.

Chinese Translation

大型语言模型（LLMs）在多步骤逻辑推理方面仍然面临挑战。现有的方法要么仅仅在自然语言形式中优化推理链，要么将符号求解器作为外部模块附加。在本研究中，我们提出一个问题：LLMs 是否包含一个共享的内部逻辑子空间，该子空间能够同时对齐推理过程的自然语言和符号语言视角。我们的假设是，这个逻辑子空间捕捉了 LLMs 中在不同视角间共享的逻辑推理能力，同时独立于表面形式。为了验证这一点，我们对来自自然语言和符号语言推理链的配对残差激活应用了典型相关分析（Canonical Correlation Analysis），学习一个具有最大跨视角相关性的低维子空间。此外，我们设计了一种无训练的方法，引导 LLMs 的推理链沿着这个逻辑子空间进行，从而利用来自两种视角的互补推理信号。在四个逻辑推理基准上的实验表明我们的方法有效，提高了准确率最多达 11 个百分点，并在域外问题上表现良好。

View on arXiv Download PDF AI Translation

arXiv Papers

HALO: Hybrid Auto-encoded Locomotion with Learned Latent Dynamics, Poincar\'e Maps, and Regions of Attraction

Thrust Regulation Through Wing Linkage Modulation on the Aerobat Platform: Piezoelectric Slip-Stick Actuated Regulator Development

Task-Adaptive Admittance Control for Human-Quadrotor Cooperative Load Transportation with Dynamic Cable-Length Regulation

Gated Memory Policy

AI-Enabled Image-Based Hybrid Vision/Force Control of Tendon-Driven Aerial Continuum Manipulators

RoomRecon: High-Quality Textured Room Layout Reconstruction on Mobile Devices

AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs

Differentiable Satellite Constellation Configuration via Relaxed Coverage and Revisit Objectives

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior

Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots

Multi-Step Gaussian Process Propagation for Adaptive Path Planning

Multimodal embodiment-aware navigation transformer

Warmth and Competence in the Swarm: Designing Effective Human-Robot Teams

Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input

Achieving Interaction Fluidity in a Wizard-of-Oz Robotic System: A Prototype for Fluid Error-Correction

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Forward Dynamics of Variable Topology Mechanisms - The Case of Constraint Activation

Wrench-Aware Admittance Control for Unknown-Payload Manipulation

Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation

LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing

A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots

Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming

Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses

StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

Colour Extraction Pipeline for Odonates using Computer Vision

Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

URoPE: Universal Relative Position Embedding across Geometric Spaces

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans

EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion

CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

Geometric Decoupling: Diagnosing the Structural Instability of Latent

DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

Multi-Domain Learning with Global Expert Mapping

DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification

ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

Hierarchically Robust Zero-shot Vision-language Models

A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders

Localization-Guided Foreground Augmentation in Autonomous Driving

Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images

Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos

Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

Generative Texture Filtering

Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

MSDS: Deep Structural Similarity with Multiscale Representation

Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

How Far Are Video Models from True Multimodal Reasoning?

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones

Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data

Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery