arXiv Daily Digest

297

Papers

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

CART：基于时间序列选择的上下文感知地形适应方法用于腿部机器人

Singh, Kartikeya, Kim, Youngjin, Turkar, Yash, Dantu, Karthik

Abstract

Animals in nature combine multiple modalities, such as sight and feel, to perceive terrain and develop an understanding of how to walk on uneven terrain in a stable manner. Similarly, legged robots need to develop their ability to stably walk on complex terrains by developing an understanding of the relationship between vision and proprioception. Most current terrain adaptation methods are susceptible to failure on complex, off-road terrain as they rely on prior experience, particularly observations from a vision sensor. This experience-based learning often creates a Visual-Texture Paradox between what has been seen and how it actually feels. In this work, we introduce CART, a high-level controller built on a context-aware terrain adaptation approach that integrates proprioception and exteroception from onboard sensing to achieve a robust understanding of terrain. We evaluate our method on multiple terrains using an ANYmal-C robot on the IsaacSim simulator and a Boston Dynamics SPOT robot for our real-world experiments. To evaluate the learned contextual terrain properties, we adapt vibrational stability on the base of the robot as a metric. We compare CART with various state-of-the-art baselines equipped with multimodal sensing in both simulation and the real world. CART achieves an average success rate improvement of 5% over all baselines in simulation and improves the overall stability up to 45% and 24% in the real world without increasing the time taken by the robot to accomplish locomotion tasks.

Chinese Translation

自然界中的动物结合多种感知方式，如视觉和触觉，来感知地形并理解如何在不平坦的地形上稳定行走。同样，腿部机器人也需要通过理解视觉与本体感觉之间的关系，来发展在复杂地形上稳定行走的能力。目前大多数地形适应方法在复杂的越野地形上容易失败，因为它们依赖于先前的经验，特别是来自视觉传感器的观察。这种基于经验的学习常常导致视觉-纹理悖论，即所见与实际感觉之间的矛盾。在本研究中，我们提出了CART，一种基于上下文感知地形适应方法的高层控制器，整合了来自机载传感器的本体感觉和外部感觉，以实现对地形的稳健理解。我们在IsaacSim模拟器上使用ANYmal-C机器人和在现实世界实验中使用Boston Dynamics SPOT机器人对多种地形评估我们的方法。为了评估学习到的上下文地形特性，我们将机器人的底座振动稳定性作为一个指标。在模拟和现实世界中，我们将CART与多模态传感的各种先进基线进行比较。CART在模拟中相较于所有基线实现了平均5%的成功率提升，并在现实世界中提高了整体稳定性，分别达到45%和24%，而不增加机器人完成运动任务所需的时间。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.14353

RoSLAC: Robust Simultaneous Localization and Calibration of Multiple Magnetometers

RoSLAC：多磁力计的鲁棒性同时定位与校准

Lyu, Qiyang, Wu, Zhenyu, Wang, Wei, Shen, Hongming, Wang, Danwei

Abstract

Localization of autonomous mobile robots (AMRs) in enclosed or semi-enclosed environments such as offices, hotels, hospitals, indoor parking facilities, and underground spaces where GPS signals are weak or unavailable remains a major obstacle to the deployment of fully autonomous systems. Infrastructure-based localization approaches, such as QR codes and RFID, are constrained by high installation and maintenance costs as well as limited flexibility, while onboard sensor-based methods, including LiDAR- and vision-based solutions, are affected by ambiguous geometric features and frequent occlusions caused by dynamic obstacles such as pedestrians. Ambient magnetic field (AMF)-based localization has therefore attracted growing interest in recent years because it does not rely on external infrastructure or geometric features, making it well-suited for AMR applications such as service robots and security robots. However, magnetometer measurements are often corrupted by distortions caused by ferromagnetic materials present on the sensor platform, which bias the AMF and degrade localization reliability. As a result, accurate magnetometer calibration to estimate distortion parameters becomes essential. Conventional calibration methods that rely on rotating the magnetometer are impractical for large and heavy platforms. To address this limitation, this paper proposes a robust simultaneous localization and calibration (RoSLAC) approach based on alternating optimization, which iteratively and efficiently estimates both the platform pose and magnetometer calibration parameters. Extensive evaluations conducted in high-fidelity simulation and real-world environments demonstrate that the proposed RoSLAC method achieves high localization accuracy while maintaining low computational cost compared with state-of-the-art magnetometer calibration techniques.

Chinese Translation

在封闭或半封闭环境中（如办公室、酒店、医院、室内停车场和地下空间），自主移动机器人（AMRs）的定位仍然是完全自主系统部署的一大障碍，因为这些环境中的GPS信号较弱或不可用。基于基础设施的定位方法（如二维码和射频识别）受到高安装和维护成本以及灵活性有限的制约，而基于机载传感器的方法（包括激光雷达和视觉解决方案）则受到模糊几何特征和动态障碍物（如行人）造成的频繁遮挡的影响。因此，基于环境磁场（AMF）的定位近年来受到越来越多的关注，因为它不依赖于外部基础设施或几何特征，非常适合服务机器人和安保机器人的应用。然而，磁力计的测量常常受到传感器平台上存在的铁磁材料引起的失真影响，这会偏置AMF并降低定位的可靠性。因此，准确的磁力计校准以估计失真参数变得至关重要。依赖于旋转磁力计的传统校准方法对于大型和重型平台来说并不实用。为了解决这一限制，本文提出了一种基于交替优化的鲁棒性同时定位与校准（RoSLAC）方法，该方法迭代且高效地估计平台姿态和磁力计校准参数。在高保真模拟和现实环境中进行的广泛评估表明，所提出的RoSLAC方法在保持低计算成本的同时，能够实现高定位精度，相较于最先进的磁力计校准技术具有明显优势。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.14399

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

SpaceMind：一个模块化和自我演化的具身视觉-语言代理框架，用于自主在轨服务

Wu, Aodi, Han, Haodong, Luo, Xubo, Wang, Ruisuo, He, Shan, Wan, Xue

Abstract

Autonomous on-orbit servicing demands embodied agents that perceive through visual sensors, reason about 3D spatial situations, and execute multi-phase tasks over extended horizons. We present SpaceMind, a modular and self-evolving vision-language model (VLM) agent framework that decomposes knowledge, tools, and reasoning into three independently extensible dimensions: skill modules with dynamic routing, Model Context Protocol (MCP) tools with configurable profiles, and injectable reasoning-mode skills. An MCP-Redis interface layer enables the same codebase to operate across simulation and physical hardware without modification, and a Skill Self-Evolution mechanism distills operational experience into persistent skill files without model fine-tuning. We validate SpaceMind through 192 closed-loop runs across five satellites, three task types, and two environments, a UE5 simulation and a physical laboratory, deliberately including degraded conditions to stress-test robustness. Under nominal conditions all modes achieve 90--100% navigation success; under degradation, the Prospective mode uniquely succeeds in search-and-approach tasks where other modes fail. A self-evolution study shows that the agent recovers from failure in four of six groups from a single failed episode, including complete failure to 100% success and inspection scores improving from 12 to 59 out of 100. Real-world validation confirms zero-code-modification transfer to a physical robot with 100% rendezvous success. Code: https://github.com/wuaodi/SpaceMind

Chinese Translation

自主在轨服务要求具身代理通过视觉传感器感知、推理三维空间情境，并在较长时间内执行多阶段任务。我们提出了SpaceMind，一个模块化和自我演化的视觉-语言模型（VLM）代理框架，该框架将知识、工具和推理分解为三个独立可扩展的维度：具有动态路由的技能模块、具有可配置配置文件的模型上下文协议（MCP）工具，以及可注入的推理模式技能。MCP-Redis接口层使得相同的代码库能够在模拟和物理硬件上无须修改地运行，而技能自我演化机制则将操作经验提炼为持久的技能文件，无需模型微调。我们通过在五颗卫星、三种任务类型和两个环境（一个UE5模拟和一个物理实验室）中进行192次闭环运行来验证SpaceMind，故意包括降级条件以进行稳健性压力测试。在正常条件下，所有模式的导航成功率均达到90%至100%；在降级条件下，前瞻模式在搜索和接近任务中独特地成功，而其他模式则失败。自我演化研究表明，代理能够从六组中的四组失败中恢复，单次失败事件后实现从完全失败到100%成功，检查评分从100分中的12分提高到59分。现实世界验证确认了零代码修改转移到物理机器人，且实现了100%的会合成功。代码链接：https://github.com/wuaodi/SpaceMind

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.14421

BIEVR-LIO: Robust LiDAR-Inertial Odometry through Bump-Image-Enhanced Voxel Maps

BIEVR-LIO：通过增强凹凸图像的体素地图实现鲁棒的激光雷达-惯性里程计

Pfreundschuh, Patrick, Tuna, Turcan, Gentil, Cedric Le, Siegwart, Roland, Cadena, Cesar, Oleynikova, Helen

Abstract

Reliable odometry is essential for mobile robots as they increasingly enter more challenging environments, which often contain little information to constrain point cloud registration, resulting in degraded LiDAR-Inertial Odometry (LIO) accuracy or even divergence. To address this, we present BIEVR-LIO, a novel approach designed specifically to exploit subtle variations in the available geometry for improved robustness. We propose a high-resolution map representation that stores surfaces as compact voxel-wise oriented height images. This representation can directly be used for registration without the calculation of intermediate geometric primitives while still supporting efficient updates. Since informative geometry is often sparsely distributed in the environment, we further propose a map-informed point sampling strategy to focus registration on geometrically informative regions, improving robustness in uninformative environments while reducing computational cost compared to global high-resolution sampling. Experiments across multiple sensors, platforms, and environments demonstrates state-of-the-art performance in well-constrained scenes and substantial improvements in challenging scenarios where baseline methods diverge. Additionally, we demonstrate that the fine-grained geometry captured by BIEVR-LIO can be used for downstream tasks such as elevation mapping for robot locomotion.

Chinese Translation

可靠的里程计对于移动机器人至关重要，因为它们越来越多地进入更具挑战性的环境，这些环境通常包含很少的信息来约束点云配准，从而导致激光雷达-惯性里程计（LIO）精度下降甚至发散。为了解决这个问题，我们提出了BIEVR-LIO，这是一种专门设计用于利用可用几何体微小变化以提高鲁棒性的创新方法。我们提出了一种高分辨率地图表示，存储表面为紧凑的体素导向高度图像。这种表示可以直接用于配准，而无需计算中间几何原语，同时仍支持高效更新。由于环境中的信息几何通常分布稀疏，我们进一步提出了一种基于地图的信息点采样策略，以将配准集中在几何信息丰富的区域，从而提高在信息匮乏环境中的鲁棒性，同时与全局高分辨率采样相比降低计算成本。在多个传感器、平台和环境下的实验表明，在约束良好的场景中表现出最先进的性能，并在基线方法发散的挑战性场景中实现了显著改善。此外，我们还展示了BIEVR-LIO捕获的细粒度几何体可用于下游任务，例如机器人运动的高程映射。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.14454

CooperDrive: Enhancing Driving Decisions Through Cooperative Perception

CooperDrive：通过协同感知增强驾驶决策

Qu, Deyuan, Chen, Qi, Shimizu, Takayuki, Altintas, Onur

Abstract

Autonomous vehicles equipped with robust onboard perception, localization, and planning still face limitations in occlusion and non-line-of-sight (NLOS) scenarios, where delayed reactions can increase collision risk. We propose CooperDrive, a cooperative perception framework that augments situational awareness and enables earlier, safer driving decisions. CooperDrive offers two key advantages: (i) each vehicle retains its native perception, localization, and planning stack, and (ii) a lightweight object-level sharing and fusion strategy bridges perception and planning. Specifically, CooperDrive reuses detector Bird's-Eye View (BEV) features to estimate accurate vehicle poses without additional heavy encoders, thereby reconstructing BEV representations and feeding the planner with low latency. On the planning side, CooperDrive leverages the expanded object set to anticipate potential conflicts earlier and adjust speed and trajectory proactively, thereby transforming reactive behaviors into predictive and safer driving decisions. Real-world closed-loop tests at occlusion-heavy NLOS intersections demonstrate that CooperDrive increases reaction lead time, minimum time-to-collision (TTC), and stopping margin, while requiring only 90 kbps bandwidth and maintaining an average end-to-end latency of 89 ms.

Chinese Translation

配备强大车载感知、定位和规划系统的自动驾驶车辆在遮挡和非视距（NLOS）场景中仍面临局限性，在这些情况下，反应延迟可能增加碰撞风险。我们提出了CooperDrive，一种协同感知框架，旨在增强情境意识并实现更早、更安全的驾驶决策。CooperDrive提供两个主要优势：（i）每辆车保留其原生的感知、定位和规划堆栈，以及（ii）一种轻量级的对象级共享与融合策略，连接感知与规划。具体而言，CooperDrive重用检测器Bird's-Eye View（BEV）特征，以在不需要额外重型编码器的情况下估计准确的车辆姿态，从而重建BEV表示，并以低延迟将信息反馈给规划器。在规划方面，CooperDrive利用扩展的对象集，提前预测潜在冲突，并主动调整速度和轨迹，从而将反应行为转变为预测性和更安全的驾驶决策。在遮挡严重的NLOS交叉口进行的实际闭环测试表明，CooperDrive增加了反应提前时间、最小碰撞时间（TTC）和停车余量，同时仅需90 kbps的带宽，并保持平均端到端延迟为89毫秒。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.14484

A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning

行为克隆中增益依赖误差动态的非渐近理论

Seo, Junghoon

Abstract

Behavior cloning (BC) policies on position-controlled robots inherit the closed-loop response of the underlying PD controller, yet the effect of controller gains on BC failure lacks a nonasymptotic theory. We show that independent sub-Gaussian action errors propagate through the gain-dependent closed-loop dynamics to yield sub-Gaussian position errors whose proxy matrix $X_\infty(K)$ governs the failure tail. The probability of horizon-$T$ task failure factorizes into a gain-dependent amplification index $\Gamma_T(K)$ and the validation loss plus a generalization slack, so training loss alone cannot predict closed-loop performance. Under shape-preserving upper-bound structural assumptions the proxy admits the scalar bound $X_\infty(K)\preceq\Psi(K)\bar X$ with $\Psi(K)$ decomposed into label difficulty, injection strength, and contraction, ranking the four canonical regimes with compliant-overdamped (CO) tightest, stiff-underdamped (SU) loosest, and the stiff-overdamped versus compliant-underdamped ordering system-dependent. For the canonical scalar second-order PD system the closed-form continuous-time stationary variance $X_\infty^{\mathrm{c}}(\alpha,\beta)=\sigma^2\alpha/(2\beta)$ is strictly monotone in stiffness and damping over the entire stable orthant, covering both underdamped and overdamped regimes, and the exact zero-order-hold (ZOH) discretization inherits this monotonicity. The analysis provides the first nonasymptotic explanation of the empirical finding that compliant, overdamped controllers improve BC success rates.

Chinese Translation

位置控制机器人上的行为克隆（BC）策略继承了基础PD控制器的闭环响应，但控制器增益对BC失败的影响缺乏非渐近理论。我们展示了独立的次高斯动作误差通过增益依赖的闭环动态传播，导致次高斯位置误差，其代理矩阵$X_ ext{∞}(K)$主导了失败尾部。水平-$T$任务失败的概率可以分解为增益依赖的放大指数$ ext{Γ}_T(K)$和验证损失加上泛化松弛，因此仅靠训练损失无法预测闭环性能。在保持形状的上界结构假设下，代理承认标量界限$X_ ext{∞}(K) ext{≼} ext{Ψ}(K)ar{X}$，其中$ ext{Ψ}(K)$分解为标签难度、注入强度和收缩，排名四个典型状态，其中合规过阻尼（CO）最紧，刚性欠阻尼（SU）最松，而刚性过阻尼与合规欠阻尼的排序依赖于系统。对于典型的标量二阶PD系统，闭式连续时间稳态方差$X_ ext{∞}^{ ext{c}}( ext{α}, ext{β})= ext{σ}^2 ext{α}/(2 ext{β})$在整个稳定象限内对刚度和阻尼严格单调，涵盖了欠阻尼和过阻尼状态，并且精确的零阶保持（ZOH）离散化继承了这种单调性。该分析提供了第一个非渐近的解释，说明合规的过阻尼控制器提高了BC的成功率。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.14545

CT-VIR: Continuous-Time Visual-Inertial-Ranging Fusion for Indoor Localization with Sparse Anchors

CT-VIR：基于连续时间的视觉惯性测距融合用于稀疏锚点的室内定位

Liu, Yu-An, Zhang, Li

Abstract

Visual-inertial odometry (VIO) is widely used for mobile robot localization, but its long-term accuracy degrades without global constraints. Incorporating ranging sensors such as ultra-wideband (UWB) can mitigate drift; however, high-accuracy ranging usually requires well-deployed anchors, which is difficult to ensure in narrow or low-power environments. Moreover, most existing visual-inertial-ranging (VIR) fusion methods rely on discrete time-based filtering or optimization, making it difficult to balance positioning accuracy, trajectory consistency, and fusion efficiency under asynchronous multi-sensor sampling. To address these issues, we propose a spline-based continuous-time state estimation method for VIR fusion localization. In the preprocessing stage, VIO motion priors and UWB ranging measurements are used to construct virtual anchors and reject outliers, thereby alleviating geometric degeneration and improving range reliability. In the estimation stage, the pose trajectory is parameterized in continuous time using a B-spline, while inertial, visual, and ranging constraints are formulated as factors in a sliding-window graph. The spline control points, together with a small set of auxiliary parameters, are then jointly optimized to obtain a continuous-time trajectory estimate. Evaluations on public datasets and real-world experiments demonstrate the effectiveness and practical potential of the proposed approach.

Chinese Translation

视觉惯性里程计（VIO）广泛应用于移动机器人定位，但在缺乏全局约束的情况下，其长期精度会下降。引入超宽带（UWB）等测距传感器可以减轻漂移；然而，高精度测距通常需要良好部署的锚点，这在狭窄或低功耗环境中难以保证。此外，大多数现有的视觉惯性测距（VIR）融合方法依赖于基于离散时间的滤波或优化，这使得在异步多传感器采样下平衡定位精度、轨迹一致性和融合效率变得困难。为了解决这些问题，我们提出了一种基于样条的连续时间状态估计方法用于VIR融合定位。在预处理阶段，利用VIO运动先验和UWB测距测量构建虚拟锚点并剔除异常值，从而缓解几何退化并提高测距可靠性。在估计阶段，使用B样条对姿态轨迹进行连续时间参数化，同时将惯性、视觉和测距约束作为滑动窗口图中的因子进行构建。然后，样条控制点与一小组辅助参数共同优化，以获得连续时间轨迹估计。在公共数据集和实际实验中的评估表明了所提方法的有效性和实际潜力。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.14565

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

基于模型的强化学习利用被动身体动力学实现高性能双足机器人运动

Kamimura, Tomoya, Washiyama, Haruka, Sano, Akihito

Abstract

Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.

Chinese Translation

具身性是近年来机器学习领域的重要关键词。本研究聚焦于双足机器人身体的被动特性，利用基于模型的深度强化学习生成行走和奔跑的运动。我们在模拟器中构建了两个模型，一个包含被动元件（例如弹簧），另一个则类似于一般人形机器人，不包含被动元件。带有被动元件的模型训练受到系统吸引子的强烈影响。这导致尽管轨迹迅速收敛到极限环，但获得大量奖励却需要较长时间。然而，得益于吸引子驱动的学习，获得的运动表现出强健性和能效。结果表明，具有被动元件的机器人能够通过利用身体与地面之间的动态交互生成的稳定极限环，有效地获得高性能运动。本研究展示了在未来具身人工智能中实施身体被动特性的必要性。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.14635

A multi-platform LiDAR dataset for standardized forest inventory measurement at long term ecological monitoring sites

用于长期生态监测站标准化森林清查测量的多平台LiDAR数据集

Chang, Michael R., Candotti, Anna, von Ellenrieder, Karl, Tomelleri, Enrico, Camurri, Marco

Abstract

We present a curated multi-platform LiDAR reference dataset from an instrumented ICOS forest plot, explicitly designed to support calibration, benchmarking, and integration of 3D structural data with ecological observations and standard allometric models. The dataset integrates UAV-borne laser scanning (ULS) to measure canopy coverage, terrestrial laser scanning (TLS) for detailed stem mapping, and backpack mobile laser scanning (MLS) with real-time SLAM for efficient sub-canopy acquisition. We focus on the control plot with the most complete and internally consistent registration, where TLS point clouds (~333 million points) are complemented by ULS and MLS data capturing canopy and understory strata. Marker-free, SLAM-aware protocols were used to reduce field and processing time, while manual and automated methods were combined. Final products are available in LAZ and E57 formats with UTM coordinates, together with registration reports for reproducibility. The dataset provides a benchmark for testing registration methods, evaluating scanning efficiency, and linking point clouds with segmentation, quantitative structure models, and allometric biomass estimation. By situating the acquisitions at a long-term ICOS site, it is explicitly linked to 3D structure with decades of ecological and flux measurements. More broadly, it illustrates how TLS, MLS, and ULS can be combined for repeated inventories and digital twins of forest ecosystems.

Chinese Translation

我们展示了一个经过精心策划的多平台LiDAR参考数据集，该数据集来自一个配备仪器的ICOS森林样地，专门设计用于支持3D结构数据与生态观察和标准异速模型的校准、基准测试和集成。该数据集整合了无人机激光扫描（UAV-borne laser scanning, ULS）用于测量树冠覆盖率，地面激光扫描（terrestrial laser scanning, TLS）用于详细的树干映射，以及背包移动激光扫描（backpack mobile laser scanning, MLS）结合实时SLAM用于高效的亚树冠获取。我们重点关注控制样地，该样地具有最完整和内部一致的配准，其中TLS点云（约3.33亿个点）由ULS和MLS数据补充，捕捉树冠和下层植被。采用无标记、SLAM感知的协议以减少现场和处理时间，同时结合了手动和自动方法。最终产品以LAZ和E57格式提供，并附有UTM坐标以及可重复性的配准报告。该数据集为测试配准方法、评估扫描效率以及将点云与分割、定量结构模型和异速生物量估计相连接提供了基准。通过将获取工作置于长期ICOS站点，它明确与具有数十年生态和通量测量的3D结构相关联。更广泛地说，它展示了如何结合TLS、MLS和ULS进行重复清查和森林生态系统的数字双胞胎。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.14652

DigiForest: Digital Analytics and Robotics for Sustainable Forestry

DigiForest：可持续林业的数字分析与机器人技术

Camurri, Marco, Tomelleri, Enrico, Mattamala, Matías, Laina, Sebastián Barbas, Jacquet, Martin, Behley, Jens, Kushwaha, Sunni Kanta Prasad, Nan, Fang, Chebrolu, Nived, Freißmuth, Leonard, Harms, Marvin Chayton, Malladi, Meher V. R., Yang, Fan, Frey, Jonas, Cadena, Cesar, Hutter, Marco, Schweier, Janine, Alexis, Kostas, Stachniss, Cyrill, Fallon, Maurice, Leutenegger, Stefan

Abstract

Covering one third of Earth's land surface, forests are vital to global biodiversity, climate regulation, and human well-being. In Europe, forests and woodlands reach approximately 40% of land area, and the forestry sector is central to achieving the EU's climate neutrality and biodiversity goals; these emphasize sustainable forest management, increased use of long-lived wood products, and resilient forest ecosystems. To meet these goals and properly address their inherent challenges, current practices require further innovation. This chapter introduces DigiForest, a novel, large-scale precision forestry approach leveraging digital technologies and autonomous robotics. DigiForest is structured around four main components: (1) autonomous, heterogeneous mobile robots (aerial, legged, and marsupial) for tree-level data collection; (2) automated extraction of tree traits to build forest inventories; (3) a Decision Support System (DSS) for forecasting forest growth and supporting decision-making; and (4) low-impact selective logging using purpose-built autonomous harvesters. These technologies have been extensively validated in real-world conditions in several locations, including forests in Finland, the UK, and Switzerland.

Chinese Translation

森林覆盖了地球陆地表面的三分之一，对全球生物多样性、气候调节和人类福祉至关重要。在欧洲，森林和林地大约占土地面积的40%，林业部门在实现欧盟气候中立和生物多样性目标方面发挥着核心作用；这些目标强调可持续森林管理、增加长期木制品的使用以及增强森林生态系统的韧性。为了实现这些目标并妥善应对其固有挑战，当前的实践需要进一步创新。本章介绍了DigiForest，这是一种新颖的大规模精准林业方法，利用数字技术和自主机器人。DigiForest的结构围绕四个主要组成部分： (1) 用于树木级数据收集的自主异构移动机器人（包括空中机器人、步行机器人和袋鼠机器人）； (2) 自动提取树木特征以建立森林清单； (3) 用于预测森林生长和支持决策的决策支持系统（DSS）；以及 (4) 使用专门设计的自主采伐机进行低影响选择性采伐。这些技术已在芬兰、英国和瑞士等多个地点的实际条件下进行了广泛验证。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.14732

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

世界价值行动模型：视觉-语言-行动系统的隐式规划

Li, Runze, Zhang, Hongyin, Jin, Junxi, Zeng, Qixin, Zhuang, Zifeng, Tang, Yiqi, Lyu, Shangke, Wang, Donglin

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.

Chinese Translation

视觉-语言-行动（VLA）模型已成为构建将感知和语言与行动结合的具身智能体的有希望的范式。然而，大多数现有方法依赖于直接的行动预测，缺乏对长时间轨迹进行推理和评估其后果的能力，这限制了在复杂决策任务中的表现。在本研究中，我们引入了世界价值行动（WAV）模型，这是一种统一框架，能够在VLA系统中实现隐式规划。WAV模型并不是进行显式轨迹优化，而是学习基于视觉观测和语言指令的未来轨迹的结构化潜在表示。学习到的世界模型预测未来状态，而轨迹价值函数评估其长时间效用。然后，行动生成被表述为在这个潜在空间中的推理，其中模型逐步将概率质量集中在高价值和动态可行的轨迹上。我们提供了一个理论视角，表明在行动空间中直接规划会随着时间范围的增加而导致可行轨迹的概率呈指数衰减。相比之下，潜在空间推理重新塑造了搜索分布，朝向可行区域，从而实现高效的长时间决策。大量的仿真和现实世界实验表明，WAV模型始终优于最先进的方法，在任务成功率、泛化能力和鲁棒性方面取得了显著提升，尤其是在长时间和组合场景中。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.14733

Differentiable Object Pose Connectivity Metrics for Regrasp Sequence Optimization

可微分物体姿态连通性度量用于重抓序列优化

Qin, Liang, Wan, Weiwei, Harada, Kensuke

Abstract

Regrasp planning is often required when one pick-and-place cannot transfer an object from an initial pose to a goal pose while maintaining grasp feasibility. The main challenge is to reason about shared-grasp connectivity across intermediate poses, where discrete search becomes brittle. We propose an implicit multi-step regrasp planning framework based on differentiable pose sequence connectivity metrics. We model grasp feasibility under an object pose using an Energy-Based Model (EBM) and leverage energy additivity to construct a continuous energy landscape that measures pose-pair connectivity, enabling gradient-based optimization of intermediate object poses. An adaptive iterative deepening strategy is introduced to determine the minimum number of intermediate steps automatically. Experiments show that the proposed cost formulation provides smooth and informative gradients, improving planning robustness over other alternatives. They also demonstrate generalization to unseen grasp poses and cross-end-effector transfer, where a model trained with suction constraints can guide parallel gripper grasp manipulation. The multi-step planning results further highlight the effectiveness of adaptive deepening and minimum-step search.

Chinese Translation

当一次抓取与放置无法在保持抓取可行性的情况下将物体从初始姿态转移到目标姿态时，通常需要进行重抓规划。主要挑战在于推理中间姿态之间的共享抓取连通性，而离散搜索在这方面变得脆弱。我们提出了一种基于可微分姿态序列连通性度量的隐式多步骤重抓规划框架。我们使用能量模型（Energy-Based Model, EBM）对物体姿态下的抓取可行性进行建模，并利用能量可加性构建一个连续的能量景观，以测量姿态对的连通性，从而实现中间物体姿态的基于梯度的优化。引入了一种自适应迭代加深策略，以自动确定最小中间步骤数。实验表明，所提出的成本公式提供了平滑且信息丰富的梯度，提升了规划的鲁棒性，相较于其他替代方案。实验还展示了对未见抓取姿态和跨末端执行器转移的泛化能力，其中使用吸取约束训练的模型能够指导并行夹具的抓取操作。多步骤规划结果进一步突显了自适应加深和最小步骤搜索的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.14795

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

保持冷静：通过辅助眼实现无标定的公里级SLAM与视觉几何基础模型

Zhang, Tianjun, Zhang, Fengyi, Deng, Tianchen, Zhang, Lin, Wang, Hesheng

Abstract

Visual Geometry Foundation Models (VGFMs) demonstrate remarkable zero-shot capabilities in local reconstruction. However, deploying them for kilometer-level Simultaneous Localization and Mapping (SLAM) remains challenging. In such scenarios, current approaches mainly rely on linear transforms (e.g., Sim3 and SL4) for sub-map alignment, while we argue that a single linear transform is fundamentally insufficient to model the complex, non-linear geometric distortions inherent in VGFM outputs. Forcing such rigid alignment leads to the rapid accumulation of uncorrected residuals, eventually resulting in significant trajectory drift and map divergence. To address these limitations, we present CAL2M (Calibration-free Assistant-eye based Large-scale Localization and Mapping), a plug-and-play framework compatible with arbitrary VGFMs. Distinct from traditional systems, CAL2M introduces an "assistant eye" solely to leverage the prior of constant physical spacing, effectively eliminating scale ambiguity without any temporal or spatial pre-calibration. Furthermore, leveraging the assumption of accurate feature matching, we propose an epipolar-guided intrinsic and pose correction model. Supported by an online intrinsic search module, it can effectively rectify rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition. Finally, to ensure accurate mapping, we introduce a globally consistent mapping strategy based on anchor propagation. By constructing and fusing anchors across the trajectory, we establish a direct local-to-global mapping relationship. This enables the application of nonlinear transformations to elastically align sub-maps, effectively eliminating geometric misalignments and ensuring a globally consistent reconstruction. The source code of CAL2M will be publicly available at https://github.com/IRMVLab/CALM.

Chinese Translation

视觉几何基础模型（VGFMs）在局部重建中展现出显著的零样本能力。然而，将其应用于公里级的同时定位与地图构建（SLAM）仍然面临挑战。在这种情况下，当前的方法主要依赖于线性变换（例如，Sim3和SL4）进行子地图对齐，而我们认为单一的线性变换从根本上不足以建模VGFMs输出中固有的复杂非线性几何失真。强行进行这种刚性对齐会导致未校正残差的快速积累，最终导致显著的轨迹漂移和地图偏差。为了解决这些局限性，我们提出了CAL2M（无标定的辅助眼大规模定位与地图构建），这是一个与任意VGFMs兼容的即插即用框架。与传统系统不同，CAL2M引入了一个“辅助眼”，专门利用恒定物理间距的先验，有效消除尺度模糊，而无需任何时间或空间的预校准。此外，利用准确特征匹配的假设，我们提出了一种基于极线引导的内参和姿态校正模型。通过在线内参搜索模块，它能够有效纠正由于不准确内参引起的旋转和位移误差，方法是通过基础矩阵分解实现。最后，为确保准确的地图构建，我们引入了一种基于锚点传播的全局一致性地图构建策略。通过在轨迹上构建和融合锚点，我们建立了直接的局部到全局的映射关系。这使得非线性变换能够灵活地对齐子地图，有效消除几何错位，并确保全局一致的重建。CAL2M的源代码将公开发布于https://github.com/IRMVLab/CALM。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.14834

Switch: Learning Agile Skills Switching for Humanoid Robots

Switch：为类人机器人学习灵活技能切换

Lau, Yuen-Fui, Zhao, Qihan, Wang, Yinhuai, Yu, Runyi, Tsui, Hok Wai, Chen, Qifeng, Tan, Ping

Abstract

Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world chal lenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, cre ating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.

Chinese Translation

近年来，通过深度强化学习实现的全身控制使得类人机器人在现实世界中的挑战性运动技能上取得了显著进展。然而，现有方法在不同技能之间的灵活过渡方面常常面临困难，这带来了安全隐患和实际限制。为了解决这一挑战，我们提出了一种分层多技能系统Switch，使得技能在任何时刻都能无缝切换。我们的方法包括三个关键组成部分：（1）技能图（Skill Graph, SG），基于多技能运动数据中的运动相似性建立潜在的跨技能过渡；（2）通过深度强化学习在该技能图上训练的全身跟踪策略；（3）一个在线技能调度器，用于驱动跟踪策略，以实现稳健的技能执行和流畅的过渡。对于技能切换或显著的跟踪偏差，调度器进行在线图搜索，以找到最优的可行路径，从而确保多样化运动技能的高效、稳定和实时执行。全面的实验表明，Switch使类人机器人能够以高成功率执行灵活的技能切换，同时保持强大的运动模仿性能。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.14857

Graph Theoretical Outlier Rejection for 4D Radar Registration in Feature-Poor Environments

在特征稀缺环境中基于图论的异常值拒绝用于4D雷达配准

Dorndorf, Georg, Adolfsson, Daniel, Doostdar, Masrur

Abstract

Automotive 4D imaging radar is well suited for operation in dusty and low-visibility environments, but scan registration remains challenging due to scan sparsity and spurious detections caused by noise and multipath reflections. This difficulty is compounded in feature-poor open-pit mines, where the lack of distinctive landmarks reduces correspondence reliability. We integrate graph-based pairwise consistency maximization (PCM) as an outlier rejection step within the iterative closest points (ICP) loop. We propose a radar-adapted pairwise distance-invariant scoring function for graph-based (PCM) that incorporates anisotropic, per-detection uncertainty derived from a radar measurement model. The consistency maximization problem is approximated with a greedy heuristic that finds a large clique in the pairwise consistency graph. The refined correspondence set improves robustness when the initial association set is heavily contaminated. We evaluate a standard Euclidean distance residual and our uncertainty-aware residual on an open-pit mine dataset collected with a 4D imaging radar. Compared to the generalized ICP (GICP) baseline without PCM, our method reduces segment relative position error (RPE) by 29.6% on 1 m segments and by up to 55% on 100 m segments. The presented method is intended for integration into localization pipelines and is suitable for online use due to the greedy heuristic in graph-based (PCM).

Chinese Translation

汽车4D成像雷达非常适合在多尘和低能见度环境中操作，但由于扫描稀疏性以及噪声和多路径反射引起的虚假检测，扫描配准仍然具有挑战性。在特征稀缺的露天矿中，这一困难更加复杂，因为缺乏显著的地标降低了对应关系的可靠性。我们将基于图的成对一致性最大化（pairwise consistency maximization, PCM）整合为迭代最近点（iterative closest points, ICP）循环中的异常值拒绝步骤。我们提出了一种适应雷达的成对距离不变评分函数，用于基于图的（PCM），该函数结合了来自雷达测量模型的各向异性、每次检测的不确定性。通过贪婪启发式方法近似一致性最大化问题，该方法在成对一致性图中找到一个大的团。经过精炼的对应集在初始关联集受到严重污染时提高了鲁棒性。我们在使用4D成像雷达收集的露天矿数据集上评估了标准欧几里得距离残差和我们基于不确定性的残差。与没有PCM的广义ICP（generalized ICP, GICP）基线相比，我们的方法在1米段上将段相对位置误差（relative position error, RPE）降低了29.6%，在100米段上降低了多达55%。所提出的方法旨在集成到定位管道中，并因基于图的（PCM）中的贪婪启发式而适合在线使用。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.14868

4D Radar Gaussian Modeling and Scan Matching with RCS

基于RCS的4D雷达高斯建模与扫描匹配

Amodeo, Fernando, Merino, Luis, Caballero, Fernando

Abstract

4D millimeter-wave (mmWave) radars are increasingly used in robotics, as they offer robustness against adverse environmental conditions. Besides the usual XYZ position, they provide Doppler velocity measurements as well as Radar Cross Section (RCS) information for every point. While Doppler is widely used to filter out dynamic points, RCS is often overlooked and not usually used in modeling and scan matching processes. Building on previous 3D Gaussian modeling and scan matching work, we propose incorporating the physical behavior of RCS in the model, in order to further enrich the summarized information about the scene, and improve the scan matching process.

Chinese Translation

4D毫米波（mmWave）雷达在机器人技术中越来越多地被使用，因为它们在恶劣环境条件下表现出良好的鲁棒性。除了常规的XYZ位置外，它们还为每个点提供多普勒速度测量以及雷达散射截面（RCS）信息。虽然多普勒信息被广泛用于过滤动态点，但RCS常常被忽视，通常不用于建模和扫描匹配过程。在以往3D高斯建模和扫描匹配工作的基础上，我们提出在模型中融入RCS的物理行为，以进一步丰富关于场景的汇总信息，并改善扫描匹配过程。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.14882

An Intelligent Robotic and Bio-Digestor Framework for Smart Waste Management

智能机器人与生物消化器框架用于智能废物管理

Khatri, Radhika, Tewari, Adit, Sharma, Nikhil, Srinivas, M. B.

Abstract

Rapid urbanization and continuous population growth have made municipal solid waste management increasingly challenging. These challenges highlight the need for smarter and automated waste management solutions. This paper presents the design and evaluation of an integrated waste management framework that combines two connected systems, a robotic waste segregation module and an optimized bio-digestor. The robotic waste segregation system uses a MyCobot 280 Jetson Nano robotic arm along with YOLOv8 object detection and robot operating system (ROS)-based path planning to identify and sort waste in real time. It classifies waste into four different categories with high precision, reducing the need for manual intervention. After segregation, the biodegradable waste is transferred to a bio-digestor system equipped with multiple sensors. These sensors continuously monitor key parameters, including temperature, pH, pressure, and motor revolutions per minute. The Particle Swarm Optimization (PSO) algorithm, combined with a regression model, is used to dynamically adjust system parameters. This intelligent optimization approach ensures stable operation and maximizes digestion efficiency under varying environmental conditions. System testing under dynamic conditions demonstrates a sorting accuracy of 98% along with highly efficient biological conversion. The proposed framework offers a scalable, intelligent, and practical solution for modern waste management, making it suitable for both residential and industrial applications.

Chinese Translation

快速的城市化和持续的人口增长使得市政固体废物管理面临越来越大的挑战。这些挑战突显了对更智能和自动化废物管理解决方案的需求。本文提出了一种集成废物管理框架的设计与评估，该框架结合了两个互联的系统：机器人废物分拣模块和优化的生物消化器。机器人废物分拣系统使用MyCobot 280 Jetson Nano机器人臂，结合YOLOv8目标检测和基于机器人操作系统（ROS）的路径规划，实时识别和分类废物。它将废物精确分类为四种不同类别，减少了人工干预的需求。分拣后，生物可降解废物被转移到配备多种传感器的生物消化器系统中。这些传感器持续监测关键参数，包括温度、pH值、压力和电机转速。粒子群优化（PSO）算法与回归模型相结合，用于动态调整系统参数。这种智能优化方法确保了在不同环境条件下的稳定运行，并最大化消化效率。在动态条件下的系统测试表明，分拣准确率达到98%，同时生物转化效率极高。所提出的框架为现代废物管理提供了一种可扩展、智能和实用的解决方案，适用于住宅和工业应用。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.14944

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

HRDexDB：一个大规模的人类与机器人灵巧手抓握数据集

Lim, Jongbin, Ha, Taeyun, Choi, Mingi, Kim, Jisoo, Kim, Byungjun, Jeon, Subin, Joo, Hanbyul

Abstract

We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.

Chinese Translation

我们提出了HRDexDB，这是一个大规模的多模态数据集，包含高保真度的灵巧抓握序列，涉及人类和多种机器人手。与现有数据集不同，HRDexDB提供了一个全面的抓握轨迹集合，涵盖人类手和多种机器人手的表现，涉及100种不同的物体。利用最先进的视觉方法和一套新的专用多摄像头系统，我们的HRDexDB为代理和被操控物体提供了高精度的时空3D真实运动数据。为了促进物理交互的研究，HRDexDB还包括高分辨率的触觉信号、同步的多视角视频和自我中心的视频流。该数据集包含1400个抓握试验，涵盖成功与失败，每个试验都丰富了视觉、运动学和触觉模态。通过提供人类灵巧性和机器人执行在相同目标物体上在可比抓握动作下的紧密对齐捕捉，HRDexDB为多模态策略学习和跨领域灵巧操作提供了基础基准。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.14965

POMDP-based Object Search with Growing State Space and Hybrid Action Domain

基于POMDP的具有增长状态空间和混合动作域的目标搜索

Chen, Yongbo, Wang, Hesheng, Huang, Shoudong, Kurniawati, Hanna

Abstract

Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object-search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k-center clustering tree (GNPF-kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k-center clustering hypersphere discretization for efficient refinement of high-dimensional action spaces. A modified upper-confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid-world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP-based baselines and state-of-the-art (SOTA) non-POMDP-based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real-world tests in office environments confirm the practical applicability of our approach. Project page: https://sites.google.com/view/gnpfkct.

Chinese Translation

在复杂的室内环境中高效定位目标物体，如架子、桌子和床，对于移动机器人来说是一个重大挑战。这一困难源于定位误差、有限的视野和视觉遮挡等因素。我们通过将目标搜索任务框架化为一个高维部分可观测马尔可夫决策过程（POMDP），在三维环境中引入增长状态空间和混合（连续和离散）动作空间来解决这一问题。基于精心设计的感知模块，我们提出了一种新颖的在线POMDP求解器，称为增长神经过程过滤k中心聚类树（GNPF-kCT），以应对这一问题。通过使用蒙特卡洛树搜索（MCTS）结合信念树重用来处理增长状态空间，利用神经过程网络过滤无用的原始动作，以及k中心聚类超球体离散化来高效细化高维动作空间，选择最优动作。修改后的上置信界（UCB）依据信念差异和估计直径单元内的动作价值函数来指导MCTS扩展。理论分析验证了我们方法的收敛性和性能潜力。为了应对信息或奖励有限的场景，我们还引入了一种带有网格世界模型的猜测目标物体作为增强搜索效率的关键策略。通过与Fetch和Stretch机器人进行的大量Gazebo仿真，证明了在相同计算约束和感知系统下，我们的方法在目标搜索中比基于POMDP的基线和最先进（SOTA）的非POMDP求解器（特别是基于大型语言模型（LLM）的方法）具有更快和更可靠的目标定位。在办公环境中的实际测试确认了我们方法的实际适用性。项目页面：https://sites.google.com/view/gnpfkct。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.14986

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

基于动量约束的混合启发式轨迹优化框架及其在视觉障碍场景中的残差增强深度强化学习

Zeng, Yuting, Zheng, Zhiwen, Wang, Jingya, Zhou, You, Xiao, JiaLing, Yu, Yongbin, Fan, Manping, Gong, Bo, Ren, Liyong

Abstract

Safe and efficient assistive planning for visually impaired scenarios remains challenging, since existing methods struggle with multi-objective optimization, generalization, and interpretability. In response, this paper proposes a Momentum-Constrained Hybrid Heuristic Trajectory Optimization Framework (MHHTOF). To balance multiple objectives of comfort and safety, the framework designs a Heuristic Trajectory Sampling Cluster (HTSC) with a Momentum-Constrained Trajectory Optimization (MTO), which suppresses abrupt velocity and acceleration changes. In addition, a novel residual-enhanced deep reinforcement learning (DRL) module refines candidate trajectories, advancing temporal modeling and policy generalization. Finally, a dual-stage cost modeling mechanism (DCMM) is introduced to regulate optimization, where costs in the Frenet space ensure consistency, and reward-driven adaptive weights in the Cartesian space integrate user preferences for interpretability and user-centric decision-making. Experimental results show that the proposed framework converges in nearly half the iterations of baselines and achieves lower and more stable costs. In complex dynamic scenarios, MHHTOF further demonstrates stable velocity and acceleration curves with reduced risk, confirming its advantages in robustness, safety, and efficiency.

Chinese Translation

在视觉障碍场景中，安全高效的辅助规划仍然面临挑战，因为现有方法在多目标优化、泛化能力和可解释性方面存在困难。为此，本文提出了一种基于动量约束的混合启发式轨迹优化框架（MHHTOF）。该框架设计了一个启发式轨迹采样集群（HTSC）与动量约束轨迹优化（MTO），以平衡舒适性和安全性的多个目标，从而抑制突发的速度和加速度变化。此外，本文还引入了一种新颖的残差增强深度强化学习（DRL）模块，用于优化候选轨迹，推动时间建模和策略泛化。最后，提出了一种双阶段成本建模机制（DCMM）来调节优化，其中Frenet空间中的成本确保一致性，而在笛卡尔空间中基于奖励驱动的自适应权重则整合用户偏好，以提高可解释性和以用户为中心的决策。实验结果表明，所提框架在迭代次数上几乎是基线的一半即可收敛，并且实现了更低且更稳定的成本。在复杂动态场景中，MHHTOF进一步展示了稳定的速度和加速度曲线，并降低了风险，确认了其在鲁棒性、安全性和效率方面的优势。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.15013

DEX-Mouse: A Low-cost Portable and Universal Interface with Force Feedback for Data Collection of Dexterous Robotic Hands

DEX-Mouse：一种低成本便携式通用接口，具备力反馈功能，用于灵巧机器人手的数据收集

Koh, Joonho, Jung, Haechan, Kim, Nayoung, Ko, Wook, Nam, Changjoo

Abstract

Data-driven dexterous hand manipulation requires large-scale, physically consistent demonstration data. Simulation and video-based methods suffer from sim-to-real gaps and retargeting problems, while MoCap glove-based teleoperation systems require per-operator calibration and lack portability, as the robot hand is typically fixed to a stationary arm. Portable alternatives improve mobility but lack cross-platform and cross-operator compatibility. We present DEX-Mouse, a portable, calibration-free hand-held teleoperation interface with integrated kinesthetic force feedback, built from commercial off-the-shelf components under USD 150. The operator-agnostic design requires no calibration or structural modification, enabling immediate deployment across diverse environments and platforms. The interface supports a configuration in which the target robot hand is mounted directly on the forearm of an operator, producing robot-aligned data. In a comparative user study across various dexterous manipulation tasks, operators using the proposed system achieved an 86.67% task completion rate under the attached configuration. Also, we found that the attached configuration reduced the perceived workload of the operators compared to spatially separated teleoperation setups across all compared interfaces. The complete hardware and software stack, including bill of materials, CAD models, and firmware, is open-sourced at https://dex-mouse.github.io/ to facilitate replication and adoption.

Chinese Translation

数据驱动的灵巧手操控需要大规模、物理一致的演示数据。模拟和基于视频的方法存在从模拟到现实的差距和重定向问题，而基于运动捕捉手套的遥操作系统则需要针对每个操作员进行校准，并且缺乏便携性，因为机器人手通常固定在一个静态臂上。便携式替代方案提高了移动性，但缺乏跨平台和跨操作员的兼容性。我们提出了DEX-Mouse，这是一种便携式、无需校准的手持遥操作接口，集成了动觉力反馈，采用150美元以下的商业现成组件构建。该接口的设计不依赖于操作员，无需校准或结构修改，能够在各种环境和平台中立即部署。该接口支持一种配置，其中目标机器人手直接安装在操作员的前臂上，从而生成与机器人对齐的数据。在一项针对各种灵巧操控任务的比较用户研究中，使用该系统的操作员在附加配置下达到了86.67%的任务完成率。此外，我们发现，与空间分离的遥操作设置相比，附加配置降低了操作员的感知工作负荷。完整的硬件和软件堆栈，包括材料清单、CAD模型和固件，已在https://dex-mouse.github.io/上开源，以促进复制和采用。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.15023

DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

DockAnywhere：通过新颖的演示生成实现移动操控的数据高效视觉运动策略学习

Shan, Ziyu, Zhou, Yuheng, Wu, Gaoyuan, Ji, Ziheng, Wu, Zhenyu, Wang, Ziwei

Abstract

Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.

Chinese Translation

移动操控是使机器人能够在家庭和工厂等广阔环境中进行交互的基本能力。现有的大多数方法遵循两阶段范式，机器人首先导航到对接点，然后使用强大的视觉运动策略进行固定基座的操控。然而，现实世界中的移动操控常常由于对接点的变化而面临视角泛化问题。为了解决这一问题，我们提出了一种名为DockAnywhere的新型低成本演示生成框架，通过将单个演示提升到多种可行的对接配置，从而改善在对接变异下的视角泛化。具体而言，DockAnywhere通过将依赖于对接的基座运动与在不同视角下保持不变的接触丰富操控技能解耦，将轨迹提升到任何可行的对接点。在可行性约束下对可行的对接提案进行采样，并通过结构保持增强生成相应的轨迹。通过将机器人和物体表示为点云，并应用点级空间编辑来确保观察和动作在不同视角下的一致性，从而在三维空间中合成视觉观察。在ManiSkill和真实平台上的大量实验表明，DockAnywhere显著提高了策略成功率，并能够轻松地在训练期间从未见过的对接点泛化到新视角，显著增强了移动操控策略在现实世界部署中的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.15052

CAVERS: Multimodal SLAM Data from a Natural Karstic Cave with Ground Truth Motion Capture

CAVERS：来自自然喀斯特洞穴的多模态SLAM数据及真实运动捕捉

Franchini, Giacomo, Rodríguez-Martínez, David, Martínez-Petersen, Alfonso, Pérez-del-Pulgar, C. J., Chiaberge, Marcello

Abstract

Autonomous robots operating in natural karstic caves face perception and navigation challenges that are qualitatively distinct from those encountered in mines or tunnels: irregular geometry, reflective wet surfaces, near-zero ambient light, and complex branching passages. Yet publicly available datasets targeting this environment remain scarce and offer limited sensing modalities and environmental diversity. We present CAVERS, a multimodal dataset acquired in two structurally distinct rooms of Cueva de la Victoria, M\'alaga, Spain, comprising 24 sequences totaling approximately 335 GB of recorded data. The sensor suite combines an Intel RealSense D435i RGB-D-I camera, an Optris PI640i near-IR thermal camera, and a Velodyne VLP-16 LiDAR, operated both handheld and mounted on a wheeled rover under full darkness and artificial illumination. For most of the sequences, mm-accurate 6-DoF ground truth pose and velocity at 120 Hz are provided by an Optirack motion capture system installed directly inside the cave. We benchmark seven state-of-the-art SLAM and odometry algorithms spanning visual, visual-inertial, thermal-inertial, and LiDAR-based pipelines, as well as a 3D reconstruction pipeline, demonstrating the dataset's usability. %The dataset and all supplementary material are publicly available at: https://github.com/spaceuma/cavers.

Chinese Translation

在自然喀斯特洞穴中操作的自主机器人面临着与矿井或隧道中遇到的感知和导航挑战 qualitatively distinct 的问题：不规则的几何形状、反射的湿表面、近乎零的环境光以及复杂的分支通道。然而，针对这一环境的公开数据集仍然稀缺，并且提供的传感方式和环境多样性有限。我们提出了CAVERS，这是一个在西班牙马拉加的维多利亚洞穴两个结构上不同的房间中获取的多模态数据集，包含24个序列，总计约335 GB的记录数据。传感器组合包括一台Intel RealSense D435i RGB-D-I相机、一台Optris PI640i近红外热成像相机和一台Velodyne VLP-16 LiDAR，设备在完全黑暗和人工照明下以手持和安装在轮式机器人上的方式操作。对于大多数序列，毫米级精度的6自由度真实位姿和速度以120 Hz的频率由安装在洞穴内的Optirack运动捕捉系统提供。我们基准测试了七种最先进的SLAM和里程计算法，涵盖视觉、视觉惯性、热成像惯性和基于LiDAR的管道，以及一个3D重建管道，展示了数据集的可用性。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.15074

Trajectory Planning for a Multi-UAV Rigid-Payload Cascaded Transportation System Based on Enhanced Tube-RRT*

基于增强型管状RRT*的多无人机刚性载荷级联运输系统轨迹规划

Yu, Jianqiao, Li, Jia, Gao, Tianhua

Abstract

This paper presents a two-stage trajectory planning framework for a multi-UAV rigid-payload cascaded transportation system, aiming to address planning challenges in densely cluttered environments. In Stage I, an Enhanced Tube-RRT* algorithm is developed by integrating active hybrid sampling and an adaptive expansion strategy, enabling rapid generation of a safe and feasible virtual tube in environments with dense obstacles. Moreover, a trajectory smoothness cost is explicitly incorporated into the edge cost to reduce excessive turns and thereby mitigate cable-induced oscillations. Simulation results demonstrate that the proposed Enhanced Tube-RRT* achieves a higher success rate and effective sampling rate than mixed-sampling Tube-RRT* (STube-RRT*) and adaptive-extension Tube-RRT* (AETube-RRT*), while producing a shorter optimal path with a smaller cumulative turning angle. In Stage II, a convex quadratic program is formulated by considering payload translational and rotational dynamics, cable tension constraints, and collision-safety constraints, yielding a smooth, collision-free desired payload trajectory. Finally, a centralized geometric control scheme is applied to the cascaded system to validate the effectiveness and feasibility of the proposed planning framework, offering a practical solution for payload attitude maneuvering in densely cluttered environments.

Chinese Translation

本文提出了一种针对多无人机刚性载荷级联运输系统的两阶段轨迹规划框架，旨在解决密集杂乱环境中的规划挑战。在第一阶段，开发了一种增强型管状RRT*（Enhanced Tube-RRT*）算法，通过结合主动混合采样和自适应扩展策略，使得在障碍物密集的环境中能够快速生成安全且可行的虚拟管道。此外，轨迹平滑度成本被明确纳入边缘成本中，以减少过度转弯，从而减轻由电缆引起的振荡。仿真结果表明，所提出的增强型管状RRT*在成功率和有效采样率方面优于混合采样管状RRT*（STube-RRT*）和自适应扩展管状RRT*（AETube-RRT*），同时生成更短的最优路径和更小的累计转角。在第二阶段，考虑到载荷的平移和旋转动力学、电缆张力约束以及碰撞安全约束，制定了一个凸二次规划，得出了平滑且无碰撞的期望载荷轨迹。最后，采用集中几何控制方案对级联系统进行验证，以验证所提出的规划框架的有效性和可行性，为在密集杂乱环境中进行载荷姿态操控提供了实用解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.15076

NEAT-NC: NEAT guided Navigation Cells for Robot Path Planning

NEAT-NC：基于NEAT指导的导航细胞用于机器人路径规划

Meliani, Hibatallah, Slimani, Khadija, Khoulji, Samira

Abstract

To navigate a space, the brain makes an internal representation of the environment using different cells such as place cells, grid cells, head direction cells, border cells, and speed cells. All these cells, along with sensory inputs, enable an organism to explore the space around it. Inspired by these biological principles, we developed NEATNC, a Neuro-Evolution of Augmenting Topology guided Navigation Cells. The goal of the paper is to improve NEAT algorithm performance in path planning in dynamic environments using spatial cognitive cells. This approach uses navigation cells as inputs and evolves recurrent neural networks, representing the hippocampus part of the brain. The performance of the proposed algorithm is evaluated in different static and dynamic scenarios. This study highlights NEAT's adaptability to complex and different environments, showcasing the utility of biological theories. This suggests that our approach is well-suited for real-time dynamic path planning for robotics and games.

Chinese Translation

为了在空间中导航，大脑利用不同的细胞（如位置细胞、网格细胞、头向细胞、边界细胞和速度细胞）对环境进行内部表征。所有这些细胞以及感官输入使生物体能够探索其周围的空间。受这些生物学原理的启发，我们开发了NEAT-NC（神经进化增强拓扑指导的导航细胞）。本文的目标是利用空间认知细胞提高NEAT算法在动态环境中的路径规划性能。该方法将导航细胞作为输入，并进化递归神经网络，代表大脑的海马体部分。我们在不同的静态和动态场景中评估了所提出算法的性能。这项研究突出了NEAT在复杂和不同环境中的适应性，展示了生物理论的实用性。这表明我们的方法非常适合于机器人和游戏的实时动态路径规划。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.15168

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

基于视觉的自主无人机竞速双姿态图语义定位

Perez-Saura, David, Fernandez-Cortizas, Miguel, Gaona, Alvaro J., Campoy, Pascual

Abstract

Autonomous drone racing demands robust real-time localization under extreme conditions: high-speed flight, aggressive maneuvers, and payload-constrained platforms that often rely on a single camera for perception. Existing visual SLAM systems, while effective in general scenarios, struggle with motion blur and feature instability inherent to racing dynamics, and do not exploit the structured nature of racing environments. In this work, we present a dual pose-graph architecture that fuses odometry with semantic detections for robust localization. A temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark, which is then promoted to a persistent main graph. This design preserves the information richness of frequent detections while preventing graph growth from degrading real-time performance. The system is designed to be sensor-agnostic, although in this work we validate it using monocular visual-inertial odometry and visual gate detections. Experimental evaluation on the TII-RATM dataset shows a 56% to 74% reduction in ATE compared to standalone VIO, while an ablation study confirms that the dual-graph architecture achieves 10% to 12% higher accuracy than a single-graph baseline at identical computational cost. Deployment in the A2RL competition demonstrated that the system performs real-time onboard localization during flight, reducing the drift of the odometry baseline by up to 4.2 m per lap.

Chinese Translation

自主无人机竞速要求在极端条件下实现稳健的实时定位：高速飞行、激烈机动以及通常依赖单一摄像头进行感知的负载受限平台。现有的视觉SLAM系统虽然在一般场景中有效，但在竞速动态中面临运动模糊和特征不稳定的问题，并未充分利用竞速环境的结构化特性。在本研究中，我们提出了一种双姿态图架构，将里程计与语义检测融合以实现稳健的定位。一个临时图累积了关键帧之间的多个门观测，并将其优化为每个地标的单一精细约束，随后提升为一个持久的主图。该设计保留了频繁检测的信息丰富性，同时防止图的增长影响实时性能。该系统旨在实现传感器无关性，尽管在本研究中我们使用单目视觉惯性里程计和视觉门检测进行了验证。在TII-RATM数据集上的实验评估显示，与独立的视觉惯性里程计相比，绝对轨迹误差（ATE）减少了56%至74%；而消融研究确认双图架构在相同计算成本下实现了比单图基线高出10%至12%的准确性。在A2RL竞赛中的部署表明，该系统在飞行过程中实现了实时机载定位，将里程计基线的漂移减少了每圈最多4.2米。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.15202

Benchmarking Classical Coverage Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

在不规则六边形网格上对经典覆盖路径规划启发式算法进行基准测试以应对海洋覆盖场景

Sepúlveda, Carlos S., Ruz, Gonzalo A.

Abstract

Coverage path planning on irregular hexagonal grids is relevant to maritime surveillance, search and rescue and environmental monitoring, yet classical methods are often compared on small ad hoc examples or on rectangular grids. This paper presents a reproducible benchmark of deterministic single-vehicle coverage path planning heuristics on irregular hexagonal graphs derived from synthetic but maritime-motivated areas of interest. The benchmark contains 10,000 Hamiltonian-feasible instances spanning compact, elongated, and irregular morphologies, 17 heuristics from seven families, and a common evaluation protocol covering Hamiltonian success, complete-coverage success, revisits, path length, heading changes, and CPU latency. Across the released dataset, heuristics with explicit shortest-path reconnection solve the relaxed coverage task reliably but almost never produce zero-revisit tours. Exact Depth-First Search confirms that every released instance is Hamiltonian-feasible. The strongest classical Hamiltonian baseline is a Warnsdorff variant that uses an index-based tie-break together with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success. The dominant design choice is not tie-breaking alone, but how the residual degree is defined when the endpoint is reserved until the final move. This shows that underreported implementation details can materially affect performance on sparse geometric graphs with bottlenecks. The benchmark is intended as a controlled testbed for heuristic analysis rather than as a claim of operational optimality at fleet scale.

Chinese Translation

在不规则六边形网格上的覆盖路径规划与海洋监视、搜索与救援以及环境监测密切相关，然而经典方法通常仅在小规模的特定示例或矩形网格上进行比较。本文提出了一种可重复的基准测试，针对从合成但以海洋为动机的兴趣区域派生的不规则六边形图上的确定性单车覆盖路径规划启发式算法。该基准测试包含10,000个哈密尔顿可行实例，涵盖紧凑、细长和不规则形态，包含来自七个家族的17种启发式算法，以及一个共同的评估协议，涵盖哈密尔顿成功率、完全覆盖成功率、重访、路径长度、航向变化和CPU延迟。在发布的数据集中，具有显式最短路径重连的启发式算法可靠地解决了放松的覆盖任务，但几乎从未产生零重访的巡回路径。精确的深度优先搜索确认每个发布的实例都是哈密尔顿可行的。最强的经典哈密尔顿基线是使用基于索引的平局打破策略和终端包含的剩余度策略的Warnsdorff变体，达到了79.0%的哈密尔顿成功率。主导设计选择不仅仅是平局打破，而是当终点保留到最后一步时，剩余度的定义。这表明，未报告的实现细节可能对具有瓶颈的稀疏几何图的性能产生实质性影响。该基准测试旨在作为启发式分析的受控测试平台，而不是作为在舰队规模上运营最优性的声明。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.15215

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

一种用于机器人上下文模仿学习的层次时空动作标记器

Fateh, Fawad Javed, Ali, Ali Shah, Popattia, Murad, Nizamani, Usman, Konin, Andrey, Zia, M. Zeeshan, Tran, Quoc-Huy

Abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

Chinese Translation

我们提出了一种新颖的层次时空动作标记器，用于上下文模仿学习。我们首先提出了一种层次方法，该方法由两个连续的向量量化层次组成。具体而言，较低层次将输入动作分配到细粒度子集群，而较高层次进一步将细粒度子集群映射到集群。我们的层次方法在主要通过重构输入动作来利用空间信息的同时，优于非层次方法。此外，我们通过利用空间和时间线索扩展了我们的方法，形成了一个层次时空动作标记器，即 HiST-AT。具体来说，我们的层次时空方法进行多层次聚类，同时恢复输入动作及其相关时间戳。最后，在多个仿真和真实机器人操作基准上的广泛评估表明，我们的方法在上下文模仿学习中建立了新的最先进性能。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.15221

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

基于视觉的安全人机协作与不确定性保证

Thumm, Jakob, Frei, Marian, Ni, Tianle, Althoff, Matthias, Pavone, Marco

Abstract

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

Chinese Translation

我们提出了一种基于视觉的人体姿态估计和运动预测框架，该框架为可证明安全的人机协作提供了保形预测保证。我们的框架结合了随机不确定性估计和异常检测，以实现高概率的置信度。为了将我们的流程集成到可证明的安全框架中，我们提出了具有高有效置信度的人类运动预测的保形预测集。我们在记录的人类运动数据和真实世界的人机协作环境中评估了我们的流程。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2604.15289

Abstract Sim2Real through Approximate Information States

通过近似信息状态实现Sim2Real

Deng, Yunfu, Li, Yuhao, Hanna, Josiah P.

Abstract

In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simulator. We then show that this method enables successful policy transfer both in sim2sim and sim2real evaluation.

Chinese Translation

近年来，当可用的快速且准确的模拟器用于特定任务时，强化学习（RL）在机器人技术领域取得了显著成功。在使用RL和模拟时，模拟器的真实性通常是有益的，但随着机器人在越来越复杂和大规模的领域中部署，这种真实性变得越来越难以获得。在这种情况下，模拟器可能无法建模给定目标任务的所有相关细节，这一观察促使我们研究省略关键任务细节的模拟器的sim2real问题。本文我们形式化并研究了抽象的sim2real问题：给定一个在粗略抽象层次上建模目标任务的抽象模拟器，我们如何在该抽象模拟器中使用RL训练策略，并成功将其转移到现实世界中？我们的第一个贡献是使用RL文献中的状态抽象语言形式化这个问题。这种框架表明，如果基础的抽象动态考虑了状态的历史，则可以将抽象模拟器与目标任务相匹配。基于这一形式化，我们随后提出了一种方法，该方法利用现实世界任务数据来修正抽象模拟器的动态。我们接着展示了该方法在sim2sim和sim2real评估中都能实现成功的策略转移。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2604.14193

QualiaNet: An Experience-Before-Inference Network

QualiaNet：一种经验优先推理网络

Linton, Paul

Abstract

Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

Chinese Translation

人类的三维视觉涉及两个不同的阶段：经验模块，在该模块中相对于注视点提取立体深度，以及推理模块，在该模块中对这种经验进行解释以估计三维场景属性。矛盾的是，尽管我们的立体视觉经验并未提供距离信息，但它确实影响我们对视觉尺度的推理。我们提出推理模块利用一种自然场景统计特征：近处场景产生生动的视差梯度，而远处场景则显得相对平坦。QualiaNet 在计算上实现了这种两阶段架构：模拟人类立体经验的视差图被传递给一个训练用于估计距离的卷积神经网络（CNN）。该网络仅通过视差梯度就能恢复距离，从而验证了这一方法。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.14268

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

HY-World 2.0：一个用于重建、生成和模拟3D世界的多模态世界模型

HY-World, Team, Cao, Chenjie, Zuo, Xuhui, Wang, Zhenwei, Zhang, Yisu, Wu, Junta, Liu, Zhenyang, Gong, Yuning, Liu, Yang, Yuan, Bo, Zhang, Chao, Li, Coopers, Guo, Dongyuan, Yang, Fan, Zhang, Haiyu, Cao, Hang, Zhu, Jianchen, Lin, Jiaxin, Xiao, Jie, Zhang, Jihong, Yu, Junlin, Wang, Lei, Wang, Lifu, Wang, Lilin, Linus, Chen, Minghui, He, Peng, Zhao, Penghao, Chen, Qi, Chen, Rui, Shao, Rui, Liu, Sicong, Qin, Wangchen, Niu, Xiaochuan, Yuan, Xiang, Sun, Yi, Tang, Yifei, Sun, Yifu, Lian, Yihang, Tan, Yonghao, Liu, Yuhong, Yin, Yuyang, Min, Zhiyuan, Wang, Tengfei, Guo, Chunchao

Abstract

We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

Chinese Translation

我们介绍了HY-World 2.0，这是一种多模态世界模型框架，旨在推进我们之前的项目HY-World 1.0。HY-World 2.0支持多种输入模态，包括文本提示、单视图图像、多视图图像和视频，并生成3D世界表示。通过文本或单视图图像输入，该模型进行世界生成，合成高保真、可导航的3D高斯点云（3D Gaussian Splatting, 3DGS）场景。这是通过四个阶段的方法实现的：a) 使用HY-Pano 2.0进行全景生成，b) 使用WorldNav进行轨迹规划，c) 使用WorldStereo 2.0进行世界扩展，d) 使用WorldMirror 2.0进行世界组合。具体而言，我们引入了关键创新，以增强全景的保真度，支持3D场景理解和规划，并升级我们的关键帧基础视图生成模型WorldStereo，提升其一致性记忆。我们还通过优化模型架构和学习策略，升级了WorldMirror，这是一种用于通用3D预测的前馈模型，使其能够从多视图图像或视频中重建世界。此外，我们介绍了WorldLens，一个高性能的3DGS渲染平台，具有灵活的引擎无关架构、自动环境光照明（IBL）、高效的碰撞检测和训练-渲染协同设计，支持交互式探索带有角色支持的3D世界。大量实验表明，HY-World 2.0在多个基准测试中实现了开源方法中的最先进性能，提供的结果与闭源模型Marble相当。我们发布了所有模型权重、代码和技术细节，以促进可重复性并支持对3D世界模型的进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.14302

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

从自由手绘草图生成几何一致的多视角场景

Bourouis, Ahmed, Ozkan, Savas, Maracani, Andrea, Song, Yi-Zhe, Ozay, Mete

Abstract

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

Chinese Translation

我们解决了一个新问题：从单一的自由手绘草图生成几何一致的多视角场景。自由手绘草图是可以提供给多视角生成器的最几何贫乏的输入。它们通过抽象的笔触传达场景意图，同时引入空间扭曲，这与任何一致的三维解释存在冲突。之前没有方法尝试解决这个问题；现有的多视角方法需要照片或文本，而草图到三维的方法则需要多个视角或昂贵的每场景优化。我们应对了三个相互叠加的挑战：缺乏训练数据、从扭曲的二维输入中进行几何推理的需求，以及跨视角一致性，通过三个相互促进的贡献：(i) 一个约9000个草图到多视角样本的精心策划的数据集，通过自动生成和过滤管道构建；(ii) 并行相机感知注意力适配器（CA3），将几何归纳偏置注入视频变换器；以及 (iii) 从运动重建中导出的稀疏对应监督损失（CSL）。我们的框架在单一去噪过程中合成所有视角，而无需参考图像、迭代细化或每场景优化。我们的方法显著超越了最先进的两阶段基线，在现实感（FID）上提高了超过60%，在几何一致性（Corr-Acc）上提高了23%，同时提供了高达3.7倍的推理加速。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.14314

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

DharmaOCR：专门针对结构化OCR的小型语言模型，超越开源和商业基准

Cardoso, Gabriel Pimenta de Freitas, Chacon, Caio Lucas da Silva, Oliveira, Jonas Felipe da Fonseca, Araujo, Paulo Henrique de Medeiros

Abstract

This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.

Chinese Translation

本文介绍了DharmaOCR Full和Lite，这是一对专门针对结构化OCR的小型语言模型（SSLMs），旨在共同优化转录质量、生成稳定性和推理成本。还提出了DharmaOCR-Benchmark，这是一个涵盖印刷、手写以及法律/行政文件的基准，并提出了一种统一的评估协议，该协议在明确跟踪文本退化作为一项重要基准指标（与单位成本并列）时，测量保真度和结构。除了报告退化率外，本文还实证表明，退化不仅仅是质量失败，因为它通过增加响应时间、降低吞吐量以及由于异常长的生成而膨胀计算成本，实质性地恶化了生产性能。据作者所知，作为一种方法论贡献，这是首次将直接偏好优化（Direct Preference Optimization, DPO）应用于OCR，明确使用退化生成作为被拒绝的示例来惩罚循环行为。结合监督微调（Supervised Fine-Tuning, SFT）以强制执行严格的JSON模式（头部、边距、页脚和文本），DPO在不同模型系列中始终减少退化率（相对减少高达87.6%），同时保持或提高提取质量。最终模型，即DharmaOCR Full（7B）和DharmaOCR Lite（3B），在DharmaOCR-Benchmark上设定了新的最先进水平，在提取质量方面超越了每个评估的开源和商业基准模型，分别达到0.925和0.911的得分，退化率为0.40%和0.20%。AWQ量化将每页成本降低了高达22%，且几乎没有质量损失，使其在与专有OCR API和开源替代方案相比时，实现了强大的质量-成本权衡。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.14329

Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos

可解释的人类活动识别用于监控视频中的微妙抢劫检测

Leyva, Bryan Jhoan Cazáres, Davila, Ulises Gachuz, Fonseca, José Juan González, Vasquez, Juan Irving, Camacho-Vázquez, Vanessa A., Garrido-Castañeda, Sergio Isahí

Abstract

Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.

Chinese Translation

非暴力街头抢劫（抢夺后逃）因其短暂、微妙且常常与无害的人类互动难以区分而难以自动检测。本文提出了一种混合的、基于姿态的方法，用于检测抢夺后逃事件，该方法结合了实时感知与适合边缘部署的可解释分类阶段。该系统使用基于YOLO的姿态估计器提取每个被跟踪者的身体关键点，并计算描述手部速度、手臂伸展、接近度以及攻击者与受害者之间相对运动的运动学和交互特征。我们在这些描述符上训练了一个随机森林分类器，并应用时间滞后滤波器来稳定帧级预测并减少虚假警报。我们在一个分阶段的数据集和一个从互联网视频收集的独立测试集上评估了该方法，展示了在不同场景和摄像机视角下的良好泛化能力。最后，我们在NVIDIA Jetson Nano上实现了完整的管道，并报告了实时性能，支持主动、设备内抢劫检测的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.14373

SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning

SatBLIP：基于卫星影像的上下文理解与特征识别的视觉-语言学习

Wu, Xue, Cao, Shengting, Gong, Jiaqi

Abstract

Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.

Chinese Translation

农村环境风险受到基于地点的条件（例如，住房质量、道路通达性、地表模式）的影响，但标准脆弱性指数粗糙且对风险背景提供的洞察有限。我们提出了SatBLIP，一个针对卫星的视觉-语言框架，用于农村上下文理解和特征识别，预测县级社会脆弱性指数（SVI）。SatBLIP通过将对比图像-文本对齐与针对卫星语义的自助式字幕生成相结合，解决了先前遥感管道的局限性——手工特征、人工虚拟审核和自然图像训练的视觉语言模型（VLM）。我们使用GPT-4o生成卫星图块的结构化描述（屋顶类型/状况、房屋大小、院子属性、绿化和道路背景），然后微调一个适应卫星的BLIP模型，以生成未见图像的字幕。字幕通过CLIP编码，并通过注意力机制与大型语言模型（LLM）派生的嵌入融合，以进行空间聚合下的SVI估计。使用SHAP，我们识别出显著属性（例如，屋顶形状/状况、街道宽度、植被、汽车/开放空间），这些属性始终驱动稳健的预测，从而实现农村风险环境的可解释映射。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.14388

FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

FoodSense：一个多感官食品数据集及其在从图像预测味道、气味、质地和声音方面的基准

Ishraq, Sabab, Aarushi, Aarushi, Jiang, Juncai, Chen, Chen

Abstract

Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

Chinese Translation

人类常常通过食品图像推断味道、气味、质地甚至声音，这是认知科学中研究得较为深入的现象。然而，之前关于食品的视觉语言研究主要集中在识别任务上，如餐点识别、成分检测和营养估计。基于图像的多感官体验预测仍然基本未被探索。我们引入了FoodSense，这是一个经过人工标注的跨感官推断数据集，包含66,842个参与者-图像对，涵盖2,987种独特的食品图像。每对数据包括四个感官维度的数值评分（1-5）和自由文本描述：味道、气味、质地和声音。为了使模型能够预测和解释感官期望，我们将简短的人类注释扩展为基于图像的推理轨迹。一个大型语言模型生成基于图像、评分和描述的视觉证明。利用这些注释，我们训练了FoodSense-VL，一个视觉语言基准模型，能够直接从食品图像中生成多感官评分和基于图像的解释。这项工作将认知科学在跨感官感知方面的发现与现代多模态模型的指令调优相结合，并表明许多流行的评估指标对于视觉感官推断来说是不够的。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.14433

Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

零消融过高估计了DINO视觉变换器中注册内容的依赖性

Parodi, Felipe, Matelsky, Jordan, Segado, Melanie

Abstract

Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

Chinese Translation

零消融——用零向量替代标记激活——被广泛用于探测视觉变换器中的标记功能。在DINOv2+注册和DINOv3中，注册零化导致了分类（最高下降$-36.6$个百分点）和分割（最高下降$-30.9$个百分点）的大幅下降，表明注册在功能上是不可或缺的。然而，三种替代控制方法——均值替代、噪声替代和跨图像注册洗牌——在分类、对应和分割任务中保持了性能，均在未修改基线的${ ilde{1}}$个百分点之内。每个补丁的余弦相似性表明这些替代方法确实扰动了内部表示，而零化则导致了不成比例的大扰动，这与其单独降低任务性能的原因一致。我们得出结论，零消融过高估计了对确切注册内容的依赖。在我们测试的冻结特征评估中，性能依赖于合理的类似注册的激活，而不是确切的图像特定值。尽管如此，注册仍然缓冲了来自 exttt{[CLS]}的密集特征依赖，并与压缩的补丁几何形状相关。这些发现，包括替代控制结果，在ViT-B规模下得到了重复验证。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.14449

Crowdsourcing of Real-world Image Annotation via Visual Properties

通过视觉属性进行真实世界图像标注的众包

Diao, Xiaolei, Giunchiglia, Fausto

Abstract

Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.

Chinese Translation

近期数据驱动的人工智能进展突显了物体识别数据集的固有限制。其中一个主要问题源于语义鸿沟问题，这导致视觉数据与语言描述之间存在复杂的多对多映射。这种偏差对计算机视觉任务的性能产生了不利影响。本文提出了一种图像标注方法，整合了知识表示、自然语言处理和计算机视觉技术，旨在通过应用视觉属性约束来减少标注者的主观性。我们引入了一个互动众包框架，该框架根据预定义的对象类别层次结构和标注者反馈动态提出问题，从而通过视觉属性引导图像标注。实验结果证明了该方法的有效性，并讨论了标注者反馈以优化众包设置。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.14506

Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

基于噪声教师的共蒸馏注意力引导掩蔽图像建模用于医学图像的自监督学习

Jiang, Jue, Rangnekar, Aneesh, Veeraraghavan, Harini

Abstract

Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.

Chinese Translation

掩蔽图像建模（MIM）是一种高效的自监督学习（SSL）方法，用于从未标注数据中提取有用的特征表示。主要使用的随机掩蔽方法由于相邻块的上下文相似性，使得SSL在医学图像中的效果降低，导致信息泄露和SSL简化。层次移动窗口（Swin）变换器是一种对医学图像非常有效的方法，但由于缺乏全局[CLS]标记，无法使用先进的掩蔽方法。因此，我们在共蒸馏学习框架内为Swin引入了一种注意力引导的掩蔽机制，以选择性地掩蔽语义共现和具有区分性的块，从而减少信息泄露并增加SSL预训练的难度。然而，注意力引导的掩蔽不可避免地减少了注意力头的多样性，这对下游任务性能产生负面影响。为了解决这个问题，我们首次将噪声教师集成到共蒸馏框架中（称为DAGMaN），该框架在保持高注意力头多样性的同时执行注意力掩蔽。我们在多个任务上展示了DAGMaN的能力，包括全样本和少样本肺结节分类、免疫治疗结果预测、肿瘤分割和无监督器官聚类。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.14507

H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection

H2VLR：用于少量样本异常检测的异构超图视觉-语言推理

Huang, Jianghong, Ji, Luping, Duan, Weiwei, Ye, Mao

Abstract

As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.

Chinese Translation

作为一项经典的视觉任务，异常检测已广泛应用于工业检测和医学成像。在这一任务中，数据稀缺常常是一个面临的主要问题。为了解决这一问题，少量样本异常检测（FSAD）方案正受到越来越多的关注。近年来，超越传统视觉范式，视觉-语言模型（VLM）被广泛探索以推动这一领域的发展。然而，在目前现有的基于VLM的FSAD方案中，几乎所有方法仅通过成对特征匹配进行异常推理，忽视了结构依赖性和全局一致性。为了进一步促进通过VLM实现FSAD，我们提出了一种异构超图视觉-语言推理（H2VLR）框架。该框架将FSAD重新表述为视觉-语义关系的高阶推理问题，通过在统一的超图中共同建模视觉区域和语义概念。实验比较验证了H2VLR的有效性和优势。它在代表性的工业和医学基准上通常能够实现最先进的（SOTA）性能。我们的代码将在论文接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.14520

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

模态链：从静态融合到动态编排的全模态大语言模型

Luo, Ziyang, Liu, Nian, Han, Junwei

Abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

Chinese Translation

全模态大语言模型（Omni-MLLMs）承诺实现多样感知流的统一整合。然而，最近的评估揭示了一个关键的性能悖论：单模态基线常常优于联合多模态推理。我们将这种感知脆弱性追溯到当前模型普遍采用的静态融合拓扑，识别出两种结构性病态：顺序输入中的位置偏差和交错格式中的对齐陷阱，这些问题系统性地扭曲了注意力，无论任务语义如何。为了解决这种功能刚性，我们提出了模态链（Chain of Modality，CoM），这是一个主动框架，将多模态融合从被动连接转变为动态编排。CoM自适应地编排输入拓扑，在并行、顺序和交错路径之间切换，以中和结构偏差。此外，CoM将认知执行分为两条与任务对齐的路径：一条简化的“直接决策”（Direct-Decide）路径用于直接感知，另一条结构化的“推理决策”（Reason-Decide）路径用于分析审计。在无训练或数据高效的SFT设置下，CoM在多样基准上实现了稳健且一致的泛化。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.14526

FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking

FreqTrack：基于频率学习的视觉变换器用于RGB-事件目标跟踪

You, Jinlin, Li, Muyu, Zhao, Xudong

Abstract

Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.

Chinese Translation

现有的单模态RGB跟踪器在复杂动态场景中常常面临性能瓶颈，而事件传感器的引入为增强跟踪能力提供了新的潜力。然而，目前大多数RGB-事件融合方法主要在空间域中设计，采用卷积、变换器（Transformer）或Mamba架构，未能充分利用事件数据独特的时间响应和高频特性。为此，我们提出了FreqTrack，一种频率感知的RGBE跟踪框架，通过频域变换建立互模态的互补关联，以实现更强大的特征融合。我们设计了一个谱增强变换器（SET）层，该层结合了多头动态傅里叶滤波，能够自适应地增强和选择频域特征。此外，我们开发了一个小波边缘细化（WER）模块，利用可学习的小波变换显式提取事件数据中的多尺度边缘结构，有效提升在高速和低光照场景下的建模能力。在COESOT和FE108数据集上的大量实验表明，FreqTrack实现了高度竞争的性能，特别是在COESOT基准上达到了76.6%的领先精度，验证了频域建模在RGBE跟踪中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.14527

Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers

低成本智能手机荧光检测平台的设计与验证，与传统微孔板读数仪的比较

Cao, Zhendong, Salvante, Katrina G., Parameswaran, Ash, Nepomnaschy, Pablo A., Dai, Hongji

Abstract

A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.

Chinese Translation

本文开发了一种低成本的荧光光学系统，用于检测稀释样品中某些微生物和分子的存在。选择了一种与传统96孔板兼容的专门设计的设备设置，以创造一个理想环境，使智能手机相机可以作为光学探测器。与传统的微孔板读数仪（如Perkin Elmer Victor Machine）相比，本文所提出的设备没有配备昂贵的元件，如激发滤光片、阻挡滤光片和光电倍增管；相反，只需使用手机相机即可检测样品中的荧光。所采用的策略是确定样品在RGB颜色空间中的图像颜色与该样品中荧光标本的摩尔浓度之间的关系。本文是与IEEE出版物相关工作的预印本版本，最终版本可能与本文有所不同。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.14540

WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms

WILD-SAM：针对包裹干涉合成孔径雷达（InSAR）干涉图中滑坡检测的相位感知专家适应

Pan, Yucheng, Li, Heping, Liu, Zhangle, Hussain, Sajid, Pan, Bin

Abstract

Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.

Chinese Translation

从包裹的干涉合成孔径雷达（InSAR）干涉图中直接检测缓慢移动的滑坡对于高效的地质灾害监测至关重要，但由于严重的相位模糊和复杂的相干噪声，这一过程面临根本性的挑战。尽管Segment Anything Model（SAM）为分割提供了强大的基础，但其直接应用于包裹相位数据时受到深刻的频谱域转移的限制，这抑制了对边界划分至关重要的高频条纹。为了解决这一问题，我们提出了WILD-SAM，一种新颖的参数高效微调框架，专门设计用于将SAM适应于包裹干涉图上的高精度滑坡检测。具体而言，该架构将相位感知混合专家（PA-MoE）适配器集成到冻结编码器中，以对齐频谱分布，并引入小波引导的子带增强（WGSE）策略，以生成频率感知的密集提示。PA-MoE适配器利用异构卷积专家之间的动态路由机制，自适应地聚合多尺度频谱-纹理先验，有效对齐自然图像与干涉相位数据之间的分布差异。同时，WGSE策略利用离散小波变换显式解耦高频子带并细化方向相位纹理，将这些结构线索作为密集提示注入，以确保沿着锐利的滑坡边界保持拓扑完整性。在ISSLIDE和ISSLIDE+基准上的广泛实验表明，WILD-SAM实现了最先进的性能，在目标完整性和轮廓保真度方面显著超越现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.14541

Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

赋予面孔情感：前馈单图像3D头部虚拟形象的显式情感控制

Gong, Yicheng, Zhang, Jiawei, Liu, Liqiang, Wang, Yanwen, Chu, Lei, Li, Jiahao, Pan, Hao, Zhu, Hao, Lu, Yan

Abstract

We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.

Chinese Translation

我们提出了一种用于前馈单图像3D头部虚拟形象重建的显式情感控制框架。与现有管道中情感与几何或外观隐式纠缠的情况不同，我们将情感视为一种可以独立且一致地在不同身份间操控的第一类控制信号。我们的方法通过双路径调制机制将情感注入现有的前馈架构，而无需修改其核心设计。几何调制在原始参数空间中执行基于情感的归一化，将情感状态与语音驱动的发音解耦，而外观调制则捕捉超越几何的身份感知和情感依赖的视觉线索。为了在这种设置下实现学习，我们构建了一个时间同步、情感一致的多身份数据集，通过跨身份转移对齐的情感动态来实现。我们的框架集成于多个最先进的骨干网络中，保持了重建和重现的保真度，同时实现了可控的情感转移、解耦操控和流畅的情感插值，推动了表现力和可扩展的3D头部虚拟形象的发展。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.14556

Controllable Video Object Insertion via Multiview Priors

通过多视角先验进行可控视频对象插入

Qi, Xia, Cong, Peishan, Yao, Yichen, Wang, Ziyi, Ye, Yaoqin, Ma, Yuexin

Abstract

Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.

Chinese Translation

视频对象插入是一个关键任务，旨在将新对象动态插入到现有环境中。以往的视频生成方法主要集中在合成整个场景，但在将对象插入现有视频时，往往难以确保对象外观的一致性、空间对齐和时间连贯性。本文提出了一种新颖的视频对象插入解决方案，结合了多视角对象先验，以应对动态环境中外观不一致和遮挡处理的常见挑战。通过将二维参考图像提升为多视角表示，并利用双路径视图一致性条件机制，我们的框架确保了稳定的身份引导和在不同视角下的强健集成。同时，我们还采用了一种质量感知加权机制，以自适应处理噪声或不完美的输入。此外，我们引入了一种集成感知一致性模块，确保空间真实感，有效解决遮挡和边界伪影，同时保持帧间的时间连续性。实验结果表明，我们的解决方案显著提高了视频对象插入的质量，提供了稳定且真实的集成效果。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.14558

The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview

2026年NTIRE第四届图像超分辨率挑战赛（$ imes$4）：基准结果与方法概述

Chen, Zheng, Liu, Kai, Wang, Jingkai, Yan, Xianglong, Li, Jianze, Zhang, Ziqing, Gong, Jue, Li, Jiatong, Sun, Lei, Liu, Xiaoyang, Timofte, Radu, Zhang, Yulun, Park, Jihye, Im, Yoonjin, Chun, Hyungju, Park, Hyunhee, Park, MinKyu, Xie, Zheng, Kong, Xiangyu, Yuan, Weijun, Li, Zhan, Song, Qiurong, Zhu, Luen, Zhang, Fengkai, Zhu, Xinzhe, Chen, Junyang, Wang, Congyu, Yang, Yixin, Zhou, Zhaorun, Dong, Jiangxin, Pan, Jinshan, Wang, Shengwei, Ou, Jiajie, Li, Baiang, Ma, Sizhuo, Gao, Qiang, Zhang, Jusheng, Wang, Jian, Wang, Keze, Liu, Yijiao, Chen, Yingsi, Li, Hui, Wang, Yu, Zhu, Congchao, Ahmad, Saeed, Lee, Ik Hyun, Park, Jun Young, Yoon, Ji Hwan, Yan, Kainan, Wang, Zian, Wang, Weibo, Zou, Shihao, Dong, Chao, Zhou, Wei, Li, Linfeng, Lee, Jaeseong, Chae, Jaeho, Kim, Jinwoo, Kim, Seonjoo, Hong, Yucong, Yan, Zhenming, Chen, Junye, Han, Ruize, Wang, Song, Jiang, Yuxuan, Zeng, Chengxi, Peng, Tianhao, Zhang, Fan, Bull, David, Mu, Tongyao, Cao, Qiong, Wang, Yifan, Pan, Youwei, Cao, Leilei, Peng, Xiaoping, Deng, Wei, Chen, Yifei, Xiong, Wenbo, Hu, Xian, Zhang, Yuxin, Cheng, Xiaoyun, Ji, Yang, Chen, Zonghao, Xue, Zhihao, Hu, Junqin, Kumar, Nihal, Tomar, Snehal Singh, Mueller, Klaus, Vashisth, Surya, Shaily, Prateek, Kumar, Jayant, Sharma, Hardik, Negi, Ashish, Chaudhary, Sachin, Dudhane, Akshay, Hambarde, Praful, Shukla, Amit, Shi, Shijun, Zhang, Jiangning, Liu, Yong, Hu, Kai, Xu, Jing, Zeng, Xianfang, M, Amitesh, S, Hariharan, Lee, Chia-Ming, Lin, Yu-Fan, Hsu, Chih-Chung, K, Nishalini, A, Sreenath K, Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Zheng, Shuling, Fu, Zhiheng, Zhang, Feng, Chen, Zhanglu, Yao, Boyang, Pathak, Nikhil, Jain, Aagam, Kumar, Milan, Upla, Kishor, Chavda, Vivek, S, Sarang N, Ramachandra, Raghavendra, Zhang, Zhipeng, Wang, Qi, Wang, Shiyu, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun, Li, Yuqi, Yang, Chuanguang, Feng, Weilun, Hong, Zhuzhi, Wu, Hao, Liu, Junming, Tian, Yingli, Kulkarni, Amish Bhushan, Shet, Tejas R R, Vernekar, Saakshi M, Akalwadi, Nikhil, Mallibhat, Kaushik, Tabib, Ramesh Ashok, Mudenagudi, Uma, Pan, Yuwen, Chen, Tianrun, Ji, Deyi, Zhu, Qi, Zhu, Lanyun, Zhangyi, Heyan

Abstract

This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.

Chinese Translation

本文介绍了2026年NTIRE图像超分辨率（$ imes$4）挑战赛，这是2026年CVPR会议NTIRE 2026研讨会的相关比赛之一。该挑战旨在从通过$ imes$4缩放因子进行双三次下采样生成的低分辨率（LR）输入中重建高分辨率（HR）图像。目标是开发有效的超分辨率解决方案，并分析该领域的最新进展。为了反映图像超分辨率不断演变的目标，挑战赛包括两个赛道：（1）恢复赛道，强调像素级的保真度，并根据PSNR对提交结果进行排名；（2）感知赛道，关注视觉真实感，并使用感知评分评估结果。共有194名参与者注册参加挑战，其中31个团队提交了有效的参赛作品。本报告总结了挑战设计、数据集、评估协议、主要结果及参与团队的方法。该挑战提供了一个统一的基准，并为当前进展和未来方向的图像超分辨率提供了见解。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.14560

DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration

DVFace：用于视频人脸修复的时空双先验扩散

Chen, Zheng, Chai, Bowen, Gao, Rongjun, Nie, Mingtao, Li, Xi, Duan, Bingnan, Fang, Jianping, Liu, Xiaohong, Kong, Linghe, Zhang, Yulun

Abstract

Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: https://github.com/zhengchen1999/DVFace.

Chinese Translation

视频人脸修复旨在将退化的人脸视频增强为具有真实面部细节、稳定身份和时间一致性的高质量结果。最近的基于扩散的方法为修复带来了强大的生成先验，并使得更真实的细节合成成为可能。然而，现有的人脸视频处理方法仍然严重依赖于通用的扩散先验和多步采样，这限制了面部适应性和推理效率。这些局限性促使我们探索单步扩散在视频人脸修复中的应用，但在实现真实的面部恢复和时间稳定的输出之间仍然面临挑战。在本文中，我们提出了DVFace，一个用于现实世界视频人脸修复的单步扩散框架。具体而言，我们引入了一种时空双代码本设计，从退化视频中提取互补的空间和时间人脸先验。我们进一步提出了一种不对称的时空融合模块，根据这些先验的不同角色将其注入扩散主干。对各种基准的评估表明，DVFace在修复质量、时间一致性和身份保留方面优于最近的方法。代码：https://github.com/zhengchen1999/DVFace。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.14563

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

重新审视令牌压缩以加速基于ViT的稀疏多视角3D物体检测器

Ji, Mingqian, Zhang, Shanshan, Yang, Jian

Abstract

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.

Chinese Translation

基于视觉变换器（ViT）的稀疏多视角3D物体检测器已经取得了显著的准确性，但由于重令牌处理仍然面临高推理延迟的问题。为了加速这些模型，令牌压缩已被广泛研究。然而，我们对现有策略的重新审视，如令牌剪枝、合并和补丁大小扩大，发现它们往往会丢弃信息丰富的背景线索，破坏上下文一致性，并丧失细粒度语义，从而对3D检测产生负面影响。为了解决这些局限性，我们提出了SEPatch3D，一个新颖的框架，动态调整补丁大小，同时保留粗补丁中的关键语义信息。具体而言，我们设计了时空感知补丁大小选择（SPSS），将小补丁分配给包含邻近物体的场景，以保留细节，而将大补丁分配给以背景为主的场景，以降低计算成本。为了进一步减轻潜在的细节损失，信息补丁选择（IPS）选择信息丰富的补丁进行特征细化，而跨粒度特征增强（CGFE）将细粒度细节注入选定的粗补丁中，丰富语义特征。在nuScenes和Argoverse 2验证集上的实验表明，SEPatch3D的推理速度比StreamPETR基线快了高达57%，并且比最先进的ToC3D-faster效率高出20%，同时保持了可比的检测准确性。代码可在https://github.com/Mingqj/SEPatch3D获取。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.14568

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

学习自适应推理路径以提高视觉推理效率

Huang, Yixu, Zhu, Tinghui, Chen, Muhao

Abstract

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.

Chinese Translation

视觉推理模型（VRMs）最近通过将视觉感知与语言推理相结合，展现了强大的跨模态推理能力。然而，它们常常面临过度推理的问题，为任何任务生成不必要的长推理链。我们将这一问题归因于视觉推理中的 extbf{推理路径冗余}：许多视觉问题并不需要完整的推理过程。为了解决这个问题，我们提出了 extbf{AVR}，一个自适应视觉推理框架，它将视觉推理分解为三个认知功能：视觉感知、逻辑推理和答案应用。它进一步使模型能够在三种响应格式之间动态选择：完整格式、仅感知格式和直接答案。AVR使用FS-GRPO进行训练，这是一种群体相对策略优化的适应，鼓励模型在保持正确性的同时选择最有效的推理格式。在多个视觉-语言基准上的实验表明，AVR在保持整体准确性的同时，将标记使用量减少了50%至90%，尤其是在感知密集型任务中。这些结果表明，自适应视觉推理可以有效缓解VRMs中的过度推理问题。代码和数据可在以下链接获取：https://github.com/RunRiotComeOn/AVR。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.14570

Deepfake Detection Generalization with Diffusion Noise

基于扩散噪声的深伪检测泛化

Qi, Hongyuan, Hou, Wenjin, Fan, Hehe, Xiao, Jun

Abstract

Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model's denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model's learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector's generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.

Chinese Translation

随着新图像合成技术的出现，深伪检测器面临着日益增长的泛化挑战。特别是，由扩散模型生成的深伪图像具有高度的照片真实感，常常能够逃避基于GAN伪造物训练的检测器。本文通过利用扩散噪声特征来解决深伪检测中的泛化问题。我们提出了一种注意力引导的噪声学习（Attention-guided Noise Learning, ANL）框架，将预训练的扩散模型集成到深伪检测流程中，以指导更稳健特征的学习。具体而言，我们的方法利用扩散模型的去噪过程来揭示细微的伪影：检测器被训练以预测给定扩散步骤下输入图像中包含的噪声，从而迫使其捕捉真实图像与合成图像之间的差异，同时引入基于预测噪声的注意力引导机制，鼓励模型关注全局分布的差异而非局部模式。通过利用冻结的扩散模型学习到的自然图像分布，ANL方法作为一种正则化形式，改善了检测器对未见伪造类型的泛化能力。大量实验表明，ANL在多个基准测试中显著优于现有方法，在检测扩散生成的深伪图像时达到最先进的准确率。值得注意的是，所提出的框架在推理过程中没有引入额外的开销，同时提升了泛化性能（例如，在未见模型上显著提高ACC/AP）。我们的结果强调了扩散噪声为可泛化的深伪检测提供了强有力的信号。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.14574

M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

M3D-Net：用于深伪检测的多模态3D人脸特征重建网络

Wu, Haotian, Cheng, Yue, Bian, Shan

Abstract

With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.

Chinese Translation

随着深度学习在图像生成领域的快速发展，人脸伪造技术已达到前所未有的真实感，给网络安全和信息真实性带来了严重威胁。现有的大多数深伪检测方法依赖于孤立的人脸属性重建，未能充分利用多模态特征表示的互补性。为了解决这些挑战，本文提出了一种新颖的多模态3D人脸特征重建网络（M3D-Net）用于深伪检测。我们的方法利用端到端的双流架构，通过自监督3D人脸重建模块从单视图RGB图像中重建细粒度的人脸几何形状和反射属性。该网络通过3D特征预融合模块（PFM）进一步增强检测性能，该模块自适应地调整多尺度特征，以及多模态融合模块（MFM），有效地利用注意力机制整合RGB和3D重建特征。在多个公共数据集上的广泛实验表明，我们的方法在检测准确性和鲁棒性方面达到了最先进的性能，显著优于现有方法，并在多样化场景中表现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.14580

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

TurboTalk：用于一步音频驱动的对话头像生成的渐进蒸馏

Liu, Xiangyu, Gao, Feng, Zhang, Xiaomei, Zhang, Yong, Wei, Xiaoming, Lei, Zhen, Zhu, Xiangyu

Abstract

Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

Chinese Translation

现有的音频驱动视频数字人生成模型依赖于多步去噪，这导致了显著的计算开销，严重限制了它们在现实环境中的应用。虽然一步蒸馏方法可以显著加速推理，但它们通常面临训练不稳定的问题。为了解决这一挑战，我们提出了TurboTalk，一个两阶段的渐进蒸馏框架，能够有效地将多步音频驱动的视频扩散模型压缩为一个单步生成器。我们首先采用分布匹配蒸馏（Distribution Matching Distillation）来获得一个强大且稳定的4步学生模型，然后通过对抗蒸馏逐步将去噪步骤从4减少到1。为了确保在极端步骤减少下的稳定训练，我们引入了一种渐进时间步采样策略和自我比较对抗目标，这提供了一个中间对抗参考，从而稳定渐进蒸馏。我们的方法实现了视频对话头像的单步生成，推理速度提高了120倍，同时保持了高质量的生成效果。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.14582

MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models

MapSR：基于视觉基础模型的提示驱动土地覆盖图超分辨率

Wang, Ruiqi, Yu, Qi, Ma, Jie, Wu, Hanlin

Abstract

High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.

Chinese Translation

高分辨率（HR）土地覆盖制图常常受到密集HR注释高成本的限制。我们从地图超分辨率的角度重新审视这一问题，该方法将粗糙的低分辨率（LR）土地覆盖产品提升为与输入影像分辨率相同的HR地图。现有的弱监督方法可以利用LR标签，但它们通常需要使用这些标签重新训练密集预测器，计算成本相当高。我们提出了MapSR，一个将监督与模型训练解耦的提示驱动框架。MapSR首次使用LR标签，通过轻量级线性探针从冻结的视觉基础模型特征中提取类别提示，随后通过无训练的度量推断和基于图的预测精细化进行HR映射。具体而言，类别提示通过聚合线性探针识别的高置信度HR特征进行估计，而HR预测则通过余弦相似度匹配和随后的基于图的传播进行空间精细化。对切萨皮克湾数据集的实验表明，MapSR在没有任何HR标签的情况下实现了59.64%的mIoU，仍然与最强的弱监督基线具有竞争力，并超越了完全监督的基线。值得注意的是，MapSR将可训练参数减少了四个数量级，并将训练时间从小时缩短到分钟，使得在有限的注释和计算预算下能够进行可扩展的HR映射。代码可在 https://github.com/rikirikirikiriki/MapSR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.14591

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

基于提示引导的图像编辑：在视觉自回归模型中使用掩码对数推导

El-Ghoussani, Amir, Hölle, Marc, Carneiro, Gustavo, Belagiannis, Vasileios

Abstract

We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.

Chinese Translation

我们解决了视觉自回归模型中基于提示引导的图像编辑问题。给定源图像和目标文本提示，我们旨在根据目标提示修改源图像，同时保留与请求编辑无关的所有区域。为此，我们提出了掩码对数推导（Masked Logit Nudging），该方法利用源图像的标记映射引入一个指导步骤，使模型在目标提示下的预测与这些源标记映射对齐。具体而言，我们使用VAR编码将固定的源编码转换为对数，通过沿着源-目标提示定义的语义轨迹推动模型的预测对数朝向目标。编辑仅在通过专门的掩码方案获得的空间掩码内应用，该方案利用源提示和编辑提示之间的交叉注意力差异。然后，我们引入了一种精细化方法来纠正量化误差并提高重建质量。我们的方法在512px和1024px分辨率的PIE基准测试中实现了最佳的图像编辑性能。除了编辑外，我们的方法还提供了忠实的重建，并在512px的COCO和1024px的OpenImages上超越了之前的方法。总体而言，我们的方法优于与VAR相关的方法，并且在性能上与扩散模型相当，甚至更好，同时速度更快。代码可在'https://github.com/AmirMaEl/MLN'获取。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.14605

Towards Design Compositing

面向设计合成

Mahajan, Abhinav, Tripathy, Abhikhya, Pala, Sudeeksha Reddy, Methi, Vaibhav, Joseph, K J, Srinivasan, Balaji Vasan

Abstract

Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.

Chinese Translation

图形设计创作涉及将来自不同来源的多模态组件（如图像、文本、标志和其他视觉资产）和谐地组合成一个视觉上吸引人且连贯的设计。最近的方法主要集中在布局预测或补充元素生成上，同时严格保留输入元素，隐含假设提供的组件在风格上已经和谐。然而，在实际操作中，输入通常来自不同的来源，并且存在视觉不匹配，使得这一假设显得有限。我们认为，保持身份的风格化和输入元素的合成是实现真正和谐的组件到设计管道的关键缺失成分。为此，我们提出了GIST，一种无训练、保持身份的图像合成器，它位于布局预测和排版生成之间，可以无缝集成到任何现有的组件到设计或设计优化管道中，而无需修改。我们通过将GIST与两种截然不同的现有方法LaDeCo和Design-o-meter整合来证明这一点。GIST在视觉和谐美感质量方面在这两条管道中均显示出显著改善，经过LLaVA-OV和GPT-4V的方面评分和成对偏好验证，优于简单粘贴。项目页面：abhinav-mahajan10.github.io/GIST/

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.14622

Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

多粒度语义原型扫描与三标记提示学习结合高阶RWKV用于全色锐化

Li, Junfeng, Zhou, Wenyang, Li, Xueheng, He, Xuanhua, Gan, Jianhou, Ren, Wenqi

Abstract

In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.

Chinese Translation

在本研究中，我们提出了一种基于高阶RWKV架构和源自语义聚类的三标记提示机制的多粒度语义原型扫描范式，用于全色锐化。具体而言，我们的方法包含三个关键组件：1）多粒度语义原型扫描。尽管RWKV提供了一种高效的线性复杂度替代方案来取代Transformers，但其传统的双向栅格扫描仍然是语义无关的，并且容易受到位置偏差的影响。为了解决这个问题，我们引入了一种语义驱动的扫描策略，利用局部敏感哈希将语义相关区域分组，并构建多粒度语义原型，从而实现上下文感知的标记重排序和更连贯的全局交互。2）三标记提示学习。我们设计了一种由全局标记、聚类派生的原型标记和可学习的寄存器标记组成的三标记提示机制。全局标记和原型标记为RWKV建模提供了互补的语义先验，而寄存器标记则有助于抑制噪声和易受伪影影响的中间表示。3）可逆Q-移位。为了抵消空间细节，我们在值路径上应用中心差分卷积以注入高频信息，并引入可逆的多尺度Q-移位操作，以实现高效且无损的特征变换，而无需参数密集的感受野扩展。实验结果证明了我们方法的优越性。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.14629

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Switch-KD：用于视觉-语言模型的视觉切换知识蒸馏

Sun, Haoyi, Wang, Xiaoxiao, Mao, Ning, Wang, Qian, Mu, Lifu, Zheng, Wen, Wei, Tao, Chen, Wei

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Chinese Translation

视觉-语言模型（VLMs）在视觉-语言联合理解方面展现了显著的能力，但其大规模在资源受限的场景中部署时面临重大挑战。知识蒸馏（KD）提供了一种可行的方法，可以在不增加模型规模或数据需求的情况下提升模型能力，从而使部署更加高效。然而，将KD应用于VLMs时面临特定于模态的监督挑战：尽管VLMs中的多模态知识在语言空间中融合，但当前的方法是分别对每个模态进行监督，而没有明确解决多模态对齐，从而导致不一致的多模态知识转移。为了解决这一问题，我们提出了Switch-KD，一种视觉切换蒸馏框架，它在共享的文本概率空间中统一了视觉-语言知识转移。Switch-KD包括两个关键组件：（1）视觉切换蒸馏（Visual-Switch Distillation），它将学生的视觉输出切换到教师的语言路径中，以构建跨模态概率参考，从而实现隐式视觉知识转移；（2）动态双向逻辑差异（Dynamic Bi-directional Logits Difference，DBiLD）损失，它通过双向监督自适应地对齐信息概率区域，同时保持教师和学生的分布结构。在Switch-KD的指导下，0.5B的TinyLLaVA有效地从其3B教师中蒸馏出丰富的多模态知识，在10个多模态基准测试中平均提高了3.6分，而没有进行任何架构修改。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.14630

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

CMTM：用于无监督视频目标分割的跨模态标记调制

Jeon, Inseok, Cho, Suhwan, Lee, Minhyeok, Lee, Seunghoon, Kang, Minseok, Lee, Jungho, Park, Chaewon, Kim, Donghyeong, Lee, Sangyoun

Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Chinese Translation

最近在无监督视频目标分割方面的进展突显了集成外观和运动线索的双流架构的潜力。然而，充分利用这些互补信息源需要有效建模它们之间的相互依赖关系。本文提出了一种跨模态标记调制的新方法，旨在增强外观和运动线索之间的互动。我们的方法在每种模态的标记之间建立了密集连接，使得通过关系变换器块实现高效的模态内和模态间信息传播。为了提高学习效率，我们引入了一种标记遮蔽策略，以解决仅依赖于增加模型复杂性所带来的局限性。我们的方法在所有公共基准测试中实现了最先进的性能，超越了现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.14632

High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams

通过解包模编码脉冲流实现高速全彩HDR成像

Zhou, Chu, Yang, Siqi, Zhang, Kailong, Guo, Heng, Yu, Zhaofei, Shi, Boxin, Sato, Imari

Abstract

Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.

Chinese Translation

传统的基于RGB的高动态范围（HDR）成像在多次曝光捕获中的运动伪影与单次拍摄技术中的不可逆信息损失之间面临根本性的权衡。模传感器通过将理论上无限的动态范围编码为包裹测量，提供了一种有前景的替代方案。然而，现有的模解决方案仍然受到迭代解包开销和硬件限制的瓶颈，限制其仅能进行低速的灰度捕获。在本研究中，我们提出了一种完整的基于模的HDR成像系统，通过协同推进传感公式和解包算法，实现了高速全彩HDR采集。我们方法的核心是曝光解耦的模成像公式，允许多个测量在时间上交错，从而保持干净的观察性测量模型。在此基础上，我们引入了一种无迭代的解包算法，该算法将基于扩散的生成先验与模图像的物理最小绝对余量特性结合，支持高效且物理一致的HDR重建。最后，为了验证我们系统的实际可行性，我们展示了基于模编码脉冲流的概念验证硬件实现。该设置保持了脉冲相机的原生高时间分辨率，实现了1000 FPS的全彩成像，同时将输出数据带宽从约20 Gbps降低到6 Gbps。广泛的评估表明，我们的协调方法成功克服了关键的系统瓶颈，展示了在动态场景中部署模成像的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.14643

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

物理诱导的气候对抗扰动：增强遥感图像分类的可迁移性和鲁棒性

Zhuang, Weiwei, Xie, Wangze, Zhang, Qi, Du, Xia, Lin, Zihan, Lin, Zheng, Cai, Hanlin, Zhou, Jizhe, Fang, Zihan, Pun, Chi-man, Ni, Wei, Luo, Jun

Abstract

Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.

Chinese Translation

对抗攻击对深度学习模型在遥感（RS）图像分类中的可靠性构成了严重威胁。现有大多数方法依赖于直接的像素级扰动，未能利用遥感图像的固有气候特征或在真实世界图像退化中生存。在本文中，我们提出了FogFool，一个物理上合理的对抗框架，通过基于Perlin噪声的迭代优化气候模式生成基于雾的扰动。通过对雾形成进行自然、不规则结构的建模，FogFool生成的对抗样本不仅在视觉上与真实的遥感场景一致，而且具有欺骗性。通过利用气候现象的空间一致性和中低频特性，FogFool将对抗信息嵌入到跨多种架构共享的结构特征中。在两个基准遥感数据集上的大量实验表明，FogFool在性能上表现优越：不仅在白盒设置中超越其他方法，而且在黑盒迁移性方面也表现出色（达到83.74%的目标攻击成功率），并且对常见的基于预处理的防御（如JPEG压缩和滤波）具有鲁棒性。详细分析，包括混淆矩阵和类激活图（CAM）可视化，揭示我们的气候驱动扰动引发了模型注意力的普遍转变。这些结果表明，FogFool代表了对遥感分类系统的一个实际、隐蔽且高度持久的威胁，为在复杂环境中评估模型可靠性提供了一个稳健的基准。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.14645

Chaotic CNN for Limited Data Image Classification

基于混沌的卷积神经网络用于有限数据图像分类

M, Anusree, Henry, Akhila, Nair, Pramod P

Abstract

Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.

Chinese Translation

卷积神经网络（CNN）在有限训练数据场景下常常表现出较差的泛化能力，原因在于过拟合和特征多样性不足。本文提出了一种简单有效的基于混沌的特征变换方法，以增强CNN性能而不增加模型复杂性。该方法在分类层之前对归一化特征向量应用非线性变换，使用逻辑映射（logistic）、偏斜帐篷映射（skew tent）和正弦映射（sine）来重塑特征空间，从而提高类别可分性。该方法在有限数据条件下，使用不同深度的CNN架构对灰度数据集（MNIST和Fashion-MNIST）以及RGB数据集（CIFAR-10）进行了评估。结果显示，在所有数据集上，相较于独立卷积神经网络（SA CNN），均有一致的性能提升。特别是在MNIST数据集上，使用偏斜帐篷映射和3层CNN在每类40个样本时，最大性能提升达5.43%。在Fashion-MNIST数据集上，使用正弦映射和3层CNN在每类50个样本时，观察到更高的9.11%的提升。此外，在CIFAR-10数据集上，使用偏斜帐篷映射在每类200个样本时获得了7.47%的显著提升。不同混沌映射下的一致性改进表明，性能提升源于混沌系统共享的非线性和动态特性。所提出的方法计算效率高，无需额外的可训练参数，并且可以轻松集成到现有的CNN架构中，成为数据稀缺图像分类任务的实用解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.14648

Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

从已见到未见：保持已见，生成未见的视频外扩展

Jeon, Inseok, Lee, Minhyeok, Lee, Seunghoon, Kang, Minseok, Cho, Suhwan, Lee, Sangyoun

Abstract

Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

Chinese Translation

视频外扩展旨在在保留空间保真度和帧间时间一致性的情况下，扩展视频的可见内容超出原始帧边界。现有方法主要依赖于大规模生成模型，如扩散模型。然而，基于生成的方法在隐式时间建模和有限空间上下文方面存在不足。这些限制导致帧内和帧间不一致性，特别是在动态场景和大规模外扩展场景中尤为明显。为了解决这些挑战，我们提出了Seen-to-Scene，一个将基于传播和基于生成的范式统一的视频外扩展新框架。具体而言，Seen-to-Scene利用基于流的传播，结合为视频修复预训练的流完成网络，并以端到端的方式进行微调，以弥合领域差距并重建一致的运动场。为了进一步提高传播的效率和可靠性，我们引入了一种参考引导的潜在传播，有效地在帧之间传播源内容。大量实验表明，我们的方法在高效推理的情况下实现了优越的时间一致性和视觉真实感，甚至超过了需要特定输入适应的先前最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.14684

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

DETR-ViP：具有鲁棒性区分视觉提示的检测变换器

Qian, Bo, Shi, Dahu, Wei, Xing

Abstract

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

Chinese Translation

视觉提示的目标检测实现了对目标类别的交互式和灵活定义，从而促进了开放词汇检测。由于视觉提示直接源自图像特征，它们在识别稀有类别方面通常优于文本提示。然而，关于视觉提示检测的研究在很大程度上被忽视，通常被视为训练文本提示检测器的副产品，这阻碍了其发展。为了充分释放视觉提示检测的潜力，我们调查了其性能不佳的原因，并揭示了根本问题在于视觉提示缺乏全局可区分性。基于这些观察，我们提出了DETR-ViP，一个鲁棒的目标检测框架，能够生成类别可区分的视觉提示。在基本的图像-文本对比学习基础上，DETR-ViP结合了全局提示集成和视觉-文本提示关系蒸馏，以学习更具区分性的提示表示。此外，DETR-ViP采用选择性融合策略，确保稳定和鲁棒的检测。在COCO、LVIS、ODinW和Roboflow100上的大量实验表明，DETR-ViP在视觉提示检测方面的性能显著高于其他最先进的对手。一系列消融研究和分析进一步验证了所提改进的有效性，并阐明了视觉提示增强检测能力的潜在原因。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.14692

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链式瞬间：基于搜索引导的渐进式对象基础推理用于视频理解

Wu, Zhixuan, Zha, Quanxing, Wang, Teng, Xu, Genbao, Gu, Wenyuan, Rao, Wei, Ma, Nan, Cheng, Bo, Poria, Soujanya

Abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Chinese Translation

视频理解需要在帧之间识别和推理语义上具有区分性的视觉对象，但现有的与对象无关的解决方案在有效处理随时间变化的显著对象时面临困难。为了解决这个问题，我们提出了链式瞬间（Chain-of-Glimpse），一种基于搜索引导的渐进式对象基础推理框架，该框架明确将每个推理步骤锚定到特定的视觉证据区域，从而实现组合和多步骤决策。形式上，链式瞬间将视频推理公式化为一个逐步过程，该过程逐渐围绕任务相关的视觉对象构建空间基础的痕迹，从而减轻对显著性驱动线索的过度依赖。具体而言，链式瞬间具有一个基于搜索引导的控制器，通过强化学习进行优化，采用一种显著激励基础能力的格式奖励，迭代地固定视觉证据区域并形成可靠的推理轨迹，从而产生准确且可解释的多步骤决策。在领域内的 NExTQA 以及领域外的 Video-Holmes、CG-Bench Reasoning 和 VRBench 基准上进行的广泛评估表明，链式瞬间在多种视频推理任务中展现出一致的性能提升、鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.14703

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

像素的法庭审判：通过对抗证据和强化学习判断实现鲁棒的图像篡改定位

Li, Songlin, Guo, Zhiqing, Ma, Dan, Miao, Changtao, Yang, Gaobo

Abstract

Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.

Chinese Translation

尽管一些现有的图像篡改定位（IML）方法融入了与真实性相关的监督信息，但这些信息通常仅作为辅助训练信号，用于增强模型对篡改伪影的敏感性，而不是被明确建模为与篡改区域对立的定位证据。因此，当篡改痕迹微妙或受到后处理和噪声影响时，这些方法难以明确比较篡改证据和真实证据，导致在模糊区域的预测不可靠。为了解决这些问题，我们提出了一种法庭风格的裁决框架，将IML任务视为证据的对抗与随后的判断。该框架包括起诉流、防御流和法官模型。我们首先在共享的多尺度编码器上构建了一个双假设分割架构，其中起诉流主张篡改，而防御流主张真实性。在边缘先验的指导下，通过级联多级融合、双向不一致抑制和动态辩论精炼，生成篡改和真实区域的证据。我们进一步开发了一种强化学习法官模型，对不确定区域进行战略再推理和精炼，从而生成篡改区域的掩码。法官模型通过基于优势的奖励和软IoU目标进行训练，并通过熵和交叉假设一致性进行可靠性校准。实验结果表明，我们的模型在平均性能上优于现有的最先进IML方法。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.14706

NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation

NG-GS：基于NeRF的3D高斯点云分割

He, Yi, Wang, Tao, Jin, Yi, Lang, Congyan, Li, Yidong, Ling, Haibin

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU-KD3D/NG-GS.

Chinese Translation

最近在3D高斯点云（3DGS）方面的进展使得高效且逼真的新视角合成成为可能。然而，由于高斯表示的离散特性，在3DGS中准确分割物体仍然具有挑战性，这常常导致物体边界处的混叠和伪影。在本文中，我们提出了NG-GS，一个新颖的高质量物体分割框架，专门解决边界离散化问题。我们的方法首先通过掩膜方差分析自动识别物体边界处的模糊高斯。然后，我们应用径向基函数（RBF）插值构建一个空间连续的特征场，并通过多分辨率哈希编码增强其高效的多尺度表示。一个联合优化策略通过对齐和空间连续性损失将3DGS与轻量级的NeRF模块对齐，确保平滑且一致的分割边界。在NVOS、LERF-OVS和ScanNet基准上的大量实验表明，我们的方法在边界mIoU方面取得了最先进的性能，显著提升了分割效果。代码可在 https://github.com/BJTU-KD3D/NG-GS 获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.14710

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

G-MIXER：基于测地线混合的隐式语义扩展与显式语义重排序用于零样本组合图像检索

Lim, Jiyoung, Yang, Heejae, Lee, Jee-Hyong

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

Chinese Translation

组合图像检索（CIR）旨在通过将参考图像与相应的修改文本结合来检索目标图像。CIR需要共同考虑查询中指定的显式语义和其双模态组合中嵌入的隐式语义。最近的无训练零样本CIR（ZS-CIR）方法利用多模态大型语言模型（MLLMs）生成详细的目标描述，将隐式信息转换为显式文本表达。然而，这些方法过于依赖文本模态，未能捕捉需要考虑多样候选组合的模糊检索特性。这导致检索结果的多样性和准确性降低。为了解决这一局限性，我们提出了一种新颖的无训练方法，基于测地线混合的隐式语义扩展与显式语义重排序的ZS-CIR（G-MIXER）。G-MIXER通过在一系列混合比例上进行测地线混合，构建反映参考图像-文本对隐式语义的组合查询特征，并建立多样的候选集。然后，使用来自MLLMs的显式语义对生成的候选进行重排序，从而提高检索的多样性和准确性。我们提出的G-MIXER在多个ZS-CIR基准测试中实现了最先进的性能，有效处理隐式和显式语义而无需额外训练。我们的代码将发布在 https://github.com/maya0395/gmixer。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.14711

MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering

MS-SSE-Net：一种用于土木和岩土工程结构损伤检测的多尺度空间挤压与激励网络

Khan, Saif ur Rehman, Waqar, Imad Ahmed, Zaib, Arooj, Ahmed, Saad, Vollmer, Sebastian, Dengel, Andreas, Asim, Muhammad Nabeel

Abstract

Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.

Chinese Translation

结构损伤检测对于维护土木基础设施的安全性和可靠性至关重要。然而，由于损伤模式和环境条件的变化，从图像中准确识别不同类型的结构损伤仍然具有挑战性。为了解决这些挑战，本文提出了MS-SSE-Net，一种用于结构损伤分类的新型深度学习（DL）框架。所提模型基于DenseNet201骨干网络，结合了新颖的多尺度特征提取和通道及空间注意机制（MS-SSE-Net）。具体而言，平行的深度卷积捕获局部和上下文特征，而挤压与激励风格的通道注意和空间注意则强调信息丰富的区域并抑制无关噪声。经过全局平均池化和全连接分类层处理后，精炼的特征生成最终的预测结果。实验在包含多种结构损伤类别的StructDamage数据集上进行。所提的MS-SSE-Net在性能上优于基线DenseNet201及其他比较方法。具体而言，所提方法实现了99.31%的精确率、99.25%的召回率、99.27%的F1-score和99.26%的准确率，超越了基线模型的98.62%精确率、98.53%召回率、98.58% F1-score和98.53%准确率。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.14720

Data Synthesis Improves 3D Myotube Instance Segmentation

数据合成改善3D肌管实例分割

Exler, David, Friederich, Nils, Krüger, Martin, Jbeily, John, Vitacolonna, Mario, Rudolf, Rüdiger, Mikut, Ralf, Reischl, Markus

Abstract

Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.

Chinese Translation

肌管是多核肌肉纤维，是研究肌肉生理、疾病机制和药物反应的关键模型系统。因此，机制研究和药物筛选依赖于定量形态学读数，如直径、长度和分支程度，这反过来又需要精确的三维实例分割。然而，由于缺乏大型注释肌管数据集，现有的预训练生物医学分割模型无法在该领域中泛化。我们提出了一种基于几何的合成管道，通过多项式中心线、局部变化的半径、分支结构和源自真实显微观察的椭球形端帽来建模单个肌管。合成体积以真实的噪声、光学伪影和基于CycleGAN的领域适应（Domain Adaptation, DA）进行渲染。一个紧凑的3D U-Net模型，经过自监督编码器预训练，仅在合成数据上训练，在真实数据上达到了0.22的平均IPQ，显著优于三种已建立的零-shot分割模型，证明了生物物理驱动的合成能够在注释稀缺的生物医学领域实现有效的实例分割。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.14724

HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

HAMSA：通过SpectralPulseNet实现无扫描视觉状态空间模型

Patro, Badri N., Agneeswaran, Vijay S.

Abstract

Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.

Chinese Translation

视觉状态空间模型（SSMs）如Vim、VMamba和SiMBA依赖复杂的扫描策略将顺序SSMs适配于处理二维图像，这引入了计算开销和架构复杂性。我们提出了HAMSA，一种直接在频谱域操作的无扫描SSM。HAMSA引入了三个关键创新：（1）简化的核参数化——用单个高斯初始化的复数核替代传统的（A，B，C）矩阵，消除了离散化不稳定性；（2）SpectralPulseNet（SPN）——一种依赖输入的频率门控机制，能够实现自适应频谱调制；（3）频谱自适应门控单元（SAGU）——基于幅度的门控机制，确保频域中的稳定梯度流。通过利用基于FFT的卷积，HAMSA消除了顺序扫描，同时实现了O(L log L)的复杂度，具有更高的简单性和效率。在ImageNet-1K上，HAMSA达到了85.7%的顶级准确率（在SSMs中处于最先进水平），推理速度比变换器快2.2倍（DeiT-S的推理时间为4.2毫秒对比9.2毫秒），并且比基于扫描的SSMs快1.4-1.9倍，同时使用更少的内存（2.1GB对比3.2-4.5GB）和能量（12.5J对比18-25J）。HAMSA在迁移学习和密集预测任务中展现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.14734

Find the Differences: Differential Morphing Attack Detection vs Face Recognition

找出差异：差异化变形攻击检测与人脸识别

Kelly, Una M., Spreeuwers, Luuk J., Veldhuis, Raymond N. J.

Abstract

Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.

Chinese Translation

变形攻击对人脸识别（FR）构成了挑战，因此提出了多种变形攻击检测解决方案。我们认为，人脸识别与差异化变形攻击检测（D-MAD）在原则上执行非常相似的任务，我们通过比较一个FR系统与两种现有的D-MAD方法来支持这一观点。我们还展示了当前使用的决策阈值本质上使FR系统易受变形攻击的影响，这解释了在正常图像上的性能与对变形攻击的脆弱性之间的权衡。我们建议利用已经存在的FR系统进行变形检测，并引入一个新的评估阈值，以确保对变形攻击（即使是未知类型）的脆弱性有一个上限。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.14747

Efficient closed-form approaches for pose estimation using Sylvester forms

基于Sylvester形式的高效闭式解法用于姿态估计

Vráblíková, Jana, Malis, Ezio, Busé, Laurent

Abstract

Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.

Chinese Translation

解决姿态估计（旋转和位移）的非线性最小二乘问题通常是一个耗时但在多个实时计算机视觉应用中至关重要的问题。通过适当的旋转参数化，优化问题可以简化为多项式方程组的求解，并以闭式形式求解。最近，利用结果矩阵的高效闭式求解器的进展显示出减少计算时间同时保持估计精度的有希望的研究方向。本文提出了一类新的基于结果的求解器，利用Sylvester形式进一步降低求解的复杂性。我们证明，所提出的方法在数值上与最先进的求解器同样准确，并在计算时间上优于它们。我们展示了该方法可以应用于两种不同类型的问题的姿态估计：从3D到3D对应点的姿态估计，以及从3D点到2D点对应的姿态估计。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.14755

ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation

ASGNet：用于自动息肉分割的自适应光谱引导网络

Sun, Yanguang, Zhang, Hengmin, Qian, Jianjun, Yang, Jian, Luo, Lei

Abstract

Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: https://github.com/CSYSI/ASGNet.

Chinese Translation

早期识别和去除息肉可以降低发展结直肠癌的风险。然而，息肉的多样形态、复杂背景以及常常隐蔽的特性使得在结肠镜图像中进行息肉分割极具挑战性。尽管现有基于深度学习的息肉分割方法表现出良好的性能，但它们的感知能力仍然偏向于局部区域，这主要是由于空间域中相邻像素之间的强空间相关性。这一限制使得捕捉完整的息肉结构变得困难，最终导致次优的分割结果。在本文中，我们提出了一种新颖的自适应光谱引导网络，称为ASGNet，旨在通过将光谱特征与全局属性相结合来克服空间感知的局限性。具体而言，我们首先设计了一个光谱引导的非局部感知模块，该模块联合聚合局部和全局信息，从而增强息肉结构的可辨别性，并细化其边界。此外，我们引入了一个多源语义提取器，整合丰富的高层语义信息，以辅助息肉的初步定位。此外，我们构建了一个密集跨层交互解码器，有效整合来自不同层的多样信息并增强其特性，以生成高质量的表示，从而实现准确的息肉分割。大量定量和定性结果表明，我们的ASGNet方法在五个广泛使用的息肉分割基准上优于21种最先进的方法。代码将公开发布于：https://github.com/CSYSI/ASGNet。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.14762

OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

OmniGCD：抽象化广义类别发现以实现模态不可知性

Shipard, Jordan, Wiliem, Arnold, Thanh, Kien Nguyen, Xiang, Wei, Fookes, Clinton

Abstract

Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$

Chinese Translation

广义类别发现（GCD）挑战着利用部分标记数据识别已知和新颖类别的方法，反映了人类的类别学习。与之前的GCD方法不同，这些方法在单一模态内操作并且需要特定数据集的微调，我们提出了一种受人脑抽象类别形成启发的模态不可知的GCD方法。我们的$ extbf{OmniGCD}$利用模态特定编码器（例如，视觉、音频、文本、遥感）处理输入，随后进行降维以构建$ extbf{GCD潜在空间}$，该空间在测试时被转化为更适合聚类的表示，使用一种新型的合成训练的基于Transformer的模型。为了评估OmniGCD，我们引入了一个$ extbf{零样本GCD设置}$，在该设置中不允许进行特定数据集的微调，从而实现模态不可知的类别发现。$ extbf{在合成数据上训练一次}$，OmniGCD在跨越四种模态的16个数据集上执行零样本GCD，相比基线提高了已知和新颖类别的分类准确性（视觉、文本、音频和遥感的平均百分比提高分别为$ extbf{+6.2}$、$ extbf{+17.9}$、$ extbf{+1.5}$和$ extbf{+12.7}$）。这突显了强编码器的重要性，同时将表示学习与类别发现解耦。改善模态不可知的方法将跨模态传播，促进编码器的独立于GCD的发展。我们的工作为未来的模态不可知GCD研究提供了基准，铺平了可扩展的人类启发类别发现的道路。所有代码可在$ ext{https://github.com/Jordan-HS/OmniGCD}$找到。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.14779

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

AIM：用于视觉问答持续学习的非对称信息屏蔽

Zhang, Peifeng, Qiu, Zice, Yu, Donghua, Cao, Shilei, Zheng, Juepeng, Lu, Yutong, Fu, Haohuan

Abstract

In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

Chinese Translation

在持续视觉问答（VQA）中，现有的持续学习（CL）方法大多是为对称的单模态架构而构建的。然而，现代视觉-语言模型（VLMs）违反了这一假设，因为它们的可训练组件本质上是非对称的。这种结构不匹配使得VLMs在从连续数据流中学习时极易遭遇灾难性遗忘。具体而言，这种非对称性导致标准的全局正则化在优化过程中偏向于庞大的语言解码器，从而使得较小但关键的视觉投影层高度易受干扰。因此，这种局部退化导致了组合推理能力的严重损失。为了解决这个问题，我们提出了非对称信息屏蔽（AIM），通过基于模态特定的敏感性应用有针对性的屏蔽，平衡了稳定性和可塑性。在持续VQA设置下对VQA v2和GQA的实验表明，AIM在平均性能（AP）和平均遗忘（AF）方面都达到了最先进的性能，同时更好地保留了对新技能-概念组合的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.14781

Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments

将物体检测、增强型激光雷达深度估计与分割模型整合应用于铁路环境

Giannico, Enrico Francesco, Nesti, Federico, D'Amico, Gianluca, Marinoni, Mauro, Carosio, Edoardo, Salotti, Filippo, Sabina, Salvatore, Buttazzo, Giorgio

Abstract

Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.

Chinese Translation

在铁路环境中，障碍物检测对于确保安全至关重要。然而，目前很少有研究使用完整、模块化和灵活的系统来同时检测场景中的物体并估计它们与车辆的距离。大多数研究仅专注于检测，其他研究则尝试识别轨道，而仅有少数研究估计障碍物的距离。此外，由于缺乏真实数据，评估这些系统的效果也颇具挑战性。本文提出了一种模块化和灵活的框架，该框架通过整合三种神经网络（用于物体检测、轨道分割和单目深度估计与激光雷达点云）来识别轨道、检测潜在障碍物并估计其距离。为了实现可靠和定量的评估，所提出的框架使用合成数据集（SynDRA）进行评估，该数据集提供准确的真实标注，从而允许与现有方法进行直接性能比较。所提系统通过将单目深度图与激光雷达整合，达到了低至0.63米的平均绝对误差（MAE），不仅实现了准确的距离估计，还增强了对场景的空间感知能力。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.14782

One-shot Compositional 3D Head Avatars with Deformable Hair

基于单图像的一次性组合3D头部虚拟形象与可变形头发

Sun, Yuan, Wang, Xuan, Zhang, WeiLi, Zhang, Wenxuan, Guo, Yu, Wang, Fei

Abstract

We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.

Chinese Translation

我们提出了一种组合方法，通过单张图像构建完整的3D头部虚拟形象。以往的一次性整体方法在动画过程中常常无法产生真实的头发动态，这主要是由于头发与面部区域的解耦不足，导致几何形状纠缠和不自然的变形。我们的方法明确地将头发与面部解耦，使用不同的变形范式对这些组件进行建模，同时将它们整合到统一的渲染管道中。此外，通过利用图像到3D的提升技术，我们尽可能保留输入图像中的细致纹理，有效减轻了通用模型中常见的高频信息丢失问题。具体而言，给定一张正面肖像图像，我们首先进行头发去除，以获得光头图像。然后，将原始图像和光头图像提升为密集、细节丰富的3D高斯点云（3D Gaussian Splatting, 3DGS）表示。对于光头3DGS，我们通过与先验模型的非刚性注册将其绑定到FLAME网格上，从而实现自然的变形，跟随网格三角形在动画中的运动。对于头发组件，我们采用语义标签监督结合边界感知重分配策略，以提取干净且独立的头发高斯集合。为了控制头发变形，我们引入了一个支持基于位置的动态（Position-Based Dynamics, PBD）模拟的笼结构，允许在头部运动、重力和惯性效应下对头发高斯原件进行真实且物理上合理的变换。引人注目的定性结果，包括在不同头部运动、重力效应和表情下的动态动画，展示了更加真实的头发行为，同时忠实保留了面部细节，超越了最先进的一次性方法在感知真实感方面的表现。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.14805

From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation

从边界到语义：基于提示的多任务学习用于岩石薄片分割

Ren, Yili, Wen, Shiqi, Hou, Li, Xiao, Dingwen, Zhang, Weiming, Cao, Caleb Chen, Wang, Lin, Zheng, Zilu, Su, Qianxiao, Zhao, Mingjun, Chen, Lei

Abstract

Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.

Chinese Translation

颗粒边缘分割（GES）和岩性语义分割（LSS）是量化岩石结构和成分的两个关键任务。然而，这两个任务通常被分开处理，尽管使用了昂贵、耗时且由专家标注的数据集，分割质量依然难以令人信服。最近，基础模型，尤其是Segment Anything Model（SAM），在边界对齐方面展示了令人印象深刻的鲁棒性。然而，直接将SAM适应于联合GES和LSS并非易事，原因有二：1）由于依赖消光的颜色变化和超细颗粒边界而引起的严重领域差距；2）缺乏用于多角度岩石图像堆栈联合学习的新模块。在本文中，我们提出了Petro-SAM，这是一种新颖的两阶段多任务框架，可以在岩石图像上实现高质量的联合GES和LSS。具体而言，基于SAM，我们引入了一个合并模块（Merge Block）来整合七个偏振视图，有效解决了消光问题。此外，我们引入了多尺度特征融合和颜色熵先验来优化检测。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.14816

NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

NTIRE 2026 视频显著性预测挑战赛：方法与结果

Moskalenko, Andrey, Bryncev, Alexey, Kosmynin, Ivan, Shilovskaya, Kira, Erofeev, Mikhail, Vatolin, Dmitry, Timofte, Radu, Wang, Kun, Hu, Yupeng, Li, Zhiran, Liu, Hao, Xiang, Qianlong, Nie, Liqiang, Chaldaiopoulos, Konstantinos, Efthymiou, Niki, Zlatintsi, Athanasia, Filntisis, Panagiotis, Pastra, Katerina, Maragos, Petros, Yang, Li, Zhan, Gen, Liao, Yiting, Zhang, Yabin, Liu, Yuxin, Wu, Xu, Zheng, Yunheng, Li, Linze, He, Kun, Wu, Cong, Zhu, Xuefeng, Xu, Tianyang, Wu, Xiaojun, Zhao, Wenzhuo, Fu, Keren, Li, Gongyang, Shi, Shixiang, Chen, Jianlin, Ling, Haibin, Jiang, Yaoxin, Xu, Guoyi, Liu, Jiajia, Shi, Yaokun, Tu, Jiachen

Abstract

This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.

Chinese Translation

本文概述了 NTIRE 2026 视频显著性预测挑战赛。挑战参与者的目标是为提供的视频序列开发自动显著性图预测方法。为此挑战准备了一个包含 2000 个多样化视频的新数据集，并且该数据集具有开放许可。通过众包鼠标追踪收集了注视点及其对应的显著性图，并包含来自超过 5000 名评估者的观看数据。评估是在 800 个测试视频的子集上进行的，使用了普遍接受的质量指标。该挑战吸引了超过 20 支团队提交作品，其中 7 支团队通过了最终阶段的代码审查。所有在此挑战中使用的数据均已公开发布 - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.14837

Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration

基于超顶点视觉变换器的改进多尺度结构映射用于阿尔茨海默病神经退行性病变的检测

Baek, Geonwoo, Salat, David H., Jang, Ikbeom

Abstract

Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

Chinese Translation

阿尔茨海默病（AD）的确认通常依赖于正电子发射断层扫描（PET）或脑脊液（CSF）分析，这些方法成本高且具有侵入性。因此，结构性磁共振成像（MRI）生物标志物，如皮层厚度（CT），被广泛用于非侵入性的AD筛查。最近提出的多尺度结构映射（MSSM）旨在将灰白质对比（GWC）与单次T1加权MRI（T1w）扫描中的CT结合起来。在此框架基础上，我们提出了MSSM+，结合了表面超顶点映射（SSVM）和超顶点视觉变换器（SV-ViT）。分析了来自AD患者和认知正常（CN）对照组的3D T1w图像。MSSM+通过在顶点级别整合沟深和皮层曲率来扩展MSSM。SSVM将皮层表面划分为超顶点（表面补丁），有效地表示区域间和区域内的空间关系。SV-ViT是一种在这些超顶点上操作的视觉变换器架构，能够从表面网格表示中进行解剖学信息驱动的学习。与MSSM相比，MSSM+识别出AD与CN之间更广泛且统计显著的群体差异。在AD与CN分类中，MSSM+的精确召回曲线下面积比MSSM高出3个百分点。特定厂商的分析进一步表明，相较于CT、GWC和MSSM，信号变异性降低，并且在不同MRI制造商之间的分类性能一致提高。这些发现表明，MSSM+结合SV-ViT是一个有前景的基于MRI的成像标志物，可用于在CSF/PET确认之前检测AD。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.14846

Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

通过协调视觉模型实现零样本零售盗窃检测：一种模型无关的、成本有效的替代训练单模型系统的方法

Yagersew, Haileab

Abstract

Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.

Chinese Translation

零售盗窃每年给全球经济造成超过1000亿美元的损失，然而现有的基于人工智能的检测系统需要在专有数据集上进行昂贵的定制模型训练，并且每个商店的费用高达200-500美元/月。我们提出了Paza，一个零样本零售盗窃检测框架，能够在不训练任何模型的情况下实现实用的隐蔽检测。我们的方法在一个分层管道中协调多个现有模型——廉价的物体检测和姿态估计持续运行，只有在行为预过滤器触发时才调用昂贵的视觉语言模型（VLM）。一个多信号怀疑预过滤器（要求停留时间加上至少一个行为信号）相比每帧分析将VLM的调用减少了240倍，将调用限制在每分钟不超过10次，使得单个GPU能够服务10-20个商店。该架构是模型无关的：VLM组件接受任何兼容OpenAI的端点，允许操作员在Gemma 4、Qwen3.5-Omni、GPT-4o或未来版本之间切换，而无需更改代码——确保系统随着VLM领域的发展而改进。我们在DCSASS合成的盗窃数据集（169个片段，受控环境）上评估了VLM组件，在59.3%的召回率下实现了89.5%的精确率和92.8%的特异性——召回率差距归因于离线评估中的稀疏帧采样，而非VLM推理失败，因为精确率和特异性是决定误报率的关键运营指标。我们展示了一个详细的成本模型，表明每个商店的可行性为50-100美元/月（比商业替代方案便宜3-10倍），并引入了一种保护隐私的设计，在检测管道中模糊化面孔。源代码可在https://github.com/xHaileab/Paza-AI获取。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.14849

Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation

高效搜索可植入自适应细胞用于医学图像分割

Benedykciuk, Emil, Denkowski, Marcin, Wójcik, Grzegorz M.

Abstract

Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.

Chinese Translation

目的：自适应跳跃模块可以改善医学图像分割，但搜索这些模块的计算成本较高。可植入自适应细胞（Implantable Adaptive Cells, IACs）是插入到 U-Net 跳跃连接中的紧凑型神经架构搜索（Neural Architecture Search, NAS）模块，相较于全网络 NAS，减少了搜索空间。然而，原始的 IAC 框架仍然需要对每个主干网络和数据集进行 200 轮的可微搜索。方法：我们分析了在公共医学图像分割基准上进行可微搜索时，IAC 细胞内操作和边缘的时间行为。我们发现，在最终离散细胞中选择的操作通常在训练初期就会在最强候选者中出现，并且它们的架构参数在最终轮次之前就已稳定。基于此，我们提出了一种基于 Jensen-Shannon 散度的稳定性标准，该标准跟踪每个边缘的操作重要性分布，并在搜索过程中逐步剪枝低重要性操作。加速框架称为 IAC-LTH。结果：在四个公共基准（ACDC、BraTS、KiTS、AMOS）、多个 2-D U-Net 主干网络和一个 2-D nnU-Net 流水线中，IAC-LTH 发现的 IAC 细胞在患者级分割性能上与原始全长搜索找到的细胞相匹配，有时甚至略有超出，同时在数据集和主干网络之间将壁钟 NAS 成本降低了 3.7 倍至 16 倍。这些结果在不同架构、基准以及非增强和增强训练设置中是一致的，同时保留了 IAC 装备的 U-Net 相较于强注意力和密集跳跃基线的优势。结论：可以从早期稳定的操作中识别出具有竞争力的 IAC 架构，而无需进行完整搜索，这使得在现实计算约束下，自适应跳跃模块设计在医学图像分割中变得更加实用。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.14866

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

MetaDent：用于牙科视觉-语言模型的临床图像标注

Li, Meng-Xun, Deng, Wen-Hui, Wu, Zhi-Xing, Jin, Chun-Xiao, Wu, Jia-Min, Han, Yue, Tsoi, James Kit Hon, Xia, Gui-Song, Huang, Cui

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

Chinese Translation

视觉-语言模型（VLMs）在医学图像分析中展现出显著潜力，但由于缺乏细粒度的标注数据集和全面的基准，其在口腔摄影中的应用仍然基本未被探索。为了解决这一问题，我们提出了MetaDent，这是一个全面的资源，包括（1）从临床、公共和网络来源收集的新颖且大规模的牙科图像数据集；（2）旨在捕捉牙科摄影的层次性和临床细微特征的半结构化标注框架；以及（3）用于评估最先进的VLM在临床图像理解上的全面基准套件。我们的标注方法结合了高层次的图像摘要与逐点、自由文本的异常描述。这种方法实现了丰富、可扩展且与任务无关的表征。我们从多种来源整理了60,669张牙科图像，并使用这一元标注方案对2,588张具有代表性的图像进行了标注。利用大型语言模型（LLMs），我们得出了标准化的基准：大约15K的视觉问答（VQA）对和一个18类多标签分类数据集，我们通过人工审查和错误分析进行了验证，以证明LLM驱动的转变可靠地保持了保真度和语义准确性。随后，我们在VQA、分类和图像描述任务上评估了最先进的VLM。定量结果显示，即使是最先进的模型在对口腔场景的细粒度理解上也面临挑战，准确率适中，并在图像描述中产生不一致或不完整的描述。我们公开发布了我们的数据集、标注和工具，以促进可重复研究并加速牙科应用的视觉-语言系统的发展。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.14874

Open-Set Vein Biometric Recognition with Deep Metric Learning

基于深度度量学习的开放集静脉生物识别

Pilarek, Paweł, Musiałek, Marcel, Górska, Anna

Abstract

Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework's generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.

Chinese Translation

大多数最先进的静脉识别方法依赖于闭集分类，这在本质上限制了它们的可扩展性，并阻止了在不完全重新训练模型的情况下对新用户的自适应注册。我们在严格的开放集约束下对深度度量学习（Deep Metric Learning, DML）的计算边界进行了严格评估。与标准的闭集方法不同，我们分析了数据稀缺和领域转移对识别性能的影响。我们的方法学习区分性的L2归一化嵌入，并采用基于原型的匹配方法，结合校准的相似性阈值，有效区分已注册用户和未见的冒名顶替者。我们在四个涵盖手指、手腕和手背静脉的多样化数据集（MMCBNU 6000、UTFVP、FYO和手背静脉数据集）下，按照严格的受试者不重叠协议评估该框架。在大规模的MMCBNU 6000基准测试中，我们的最佳模型（ResNet50-CBAM）实现了0.9945的开放集识别率（OSCR）、0.9974的受试者工作特征曲线下面积（AUROC）和1.57%的等错误率（EER），在保持高识别准确率（99.6%的Rank-1）的同时，能够稳健地拒绝未知受试者。跨数据集实验评估了该框架在不同采集设置下的泛化能力，确认虽然模型能够稳健处理大规模数据，但在低数据环境中性能对领域转移仍然敏感。消融研究表明，基于三元组的目标结合简单的1-NN分类器在准确性和效率之间提供了最佳的权衡，使其能够在普通硬件上实现实时部署。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.14884

FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection

FSDETR：用于小物体检测的频率-空间特征增强

Huang, Jianchao, Zhang, Fengming, Zhu, Haibo, Yan, Tao

Abstract

Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.

Chinese Translation

小物体检测仍然是一个重要的挑战，主要由于下采样导致的特征退化、密集聚类中的相互遮挡以及复杂背景的干扰。为了解决这些问题，本文提出了FSDETR，一个基于RT-DETR基线的频率-空间特征增强框架。通过建立协同建模机制，该方法有效利用了互补的结构信息。具体而言，空间层次注意力块（Spatial Hierarchical Attention Block, SHAB）捕捉局部细节和全局依赖关系，以增强语义表示。此外，为了减轻密集场景中的遮挡，基于可变形注意力的尺度内特征交互（Deformable Attention-based Intra-scale Feature Interaction, DA-AIFI）通过动态采样聚焦于信息丰富的区域。最后，频率-空间特征金字塔网络（Frequency-Spatial Feature Pyramid Network, FSFPN）通过跨域频率-空间块（Cross-domain Frequency-Spatial Block, CFSB）将频率过滤与空间边缘提取相结合，以保留细粒度细节。实验结果表明，FSDETR仅用14.7M参数，在VisDrone 2019上达到13.9%的平均精度（APS），在TinyPerson上达到48.95%的AP50 tiny，显示出在小物体基准测试中的强大性能。代码和模型可在https://github.com/YT3DVision/FSDETR获取。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.14910

Reward-Aware Trajectory Shaping for Few-step Visual Generation

基于奖励意识的少步轨迹塑造用于视觉生成

Li, Rui, Li, Bingyu, Liang, Yuanzhi, Bin, HuangHai, Zhang, Chi, Li, XueLong

Abstract

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

Chinese Translation

在极少的采样步骤中实现高保真生成一直是生成建模的核心目标。现有的方法主要依赖于基于蒸馏的框架，将原始的多步去噪过程压缩为少步生成器。然而，这些方法本质上限制了学生模仿一个更强的多步教师，将教师的表现作为学生性能的上限。我们认为，引入 extbf{偏好对齐意识}可以使学生优化朝向奖励偏好的生成质量，潜在地超越教师，而不是被限制于僵化的教师模仿。为此，我们提出了 extbf{基于奖励意识的轨迹塑造（RATS）}，这是一个轻量级的偏好对齐少步生成框架。具体而言，教师和学生的潜在轨迹在关键去噪阶段通过视界匹配进行对齐，同时引入 extbf{奖励意识门}，根据教师的相对奖励表现自适应调节教师指导。当教师获得更高的奖励时，轨迹塑造得到加强；而当学生匹配或超越教师时，轨迹塑造则得到放松，从而实现持续的奖励驱动改进。通过无缝集成轨迹蒸馏、奖励意识门和偏好对齐，RATS有效地从高步生成器中转移与偏好相关的知识，而不增加额外的测试时间计算开销。实验结果表明，RATS显著改善了少步视觉生成中的效率-质量权衡，显著缩小了少步学生与更强多步生成器之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.14914

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

超越提示：无条件的三维反演用于分布外形状

Chen, Victoria Yue, Pierson, Emery, Maillard, Léopold, Ovsjanikov, Maks

Abstract

Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts

Chinese Translation

基于文本的生成模型反演是操控二维或三维内容的核心范式，解锁了诸多应用，如基于文本的编辑、风格迁移或逆问题。然而，这一方法依赖于生成模型对自然语言提示的敏感性这一假设。我们证明，对于最先进的原生文本到三维生成模型，这一假设往往会崩溃。我们识别出一种关键的失败模式，其中生成轨迹被吸引到潜在的“沉陷陷阱”中：模型对提示修改变得不敏感的区域。在这些情况下，输入文本的变化未能以改变输出几何形状的方式改变内部表示。至关重要的是，我们观察到这并不是模型的 extit{几何}表现力的限制；同样的生成模型具备产生多样形状的能力，但正如我们所展示的，变得对分布外的 extit{文本}指导不敏感。我们通过分析生成模型的采样轨迹来研究这种行为，发现复杂几何形状仍然可以通过利用模型的无条件生成先验来表示和生成。这导致了一种更为稳健的基于文本的三维形状编辑框架，通过将模型的几何表示能力与其语言敏感性解耦，从而绕过潜在的沉陷。我们的方法解决了当前三维管道的局限性，使得对分布外三维形状的高保真语义操控成为可能。项目网页：https://daidedou.sorpi.fr/publication/beyondprompts

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.14928

Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting

混合潜变量 -- 几何-外观感知的表面点溅射

Kelkar, Neel, Niedermayr, Simon, Engel, Klaus, Westermann, Rüdiger

Abstract

We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.

Chinese Translation

我们提出了一种混合高斯哈希网格辐射表示，用于从多视图图像重建二维高斯场景模型。与 NeST 溅射类似，我们的方法减少了 NeRF 基础模型中几何与外观之间的纠缠，但在哈希网格特征的基础上添加了每个高斯的潜变量特征，以引导优化器朝着低频和高频场景成分的分离。这种显式的基于频率的分解减少了高频纹理补偿几何误差的倾向。鼓励具有硬不透明度衰减的高斯进一步增强了几何与外观之间的分离，提高了几何重建和渲染效率。最后，结合稀疏诱导的 BCE 不透明度损失的概率修剪允许关闭冗余高斯，从而获得一个最小的高斯集合，足以表示场景。使用合成和真实世界数据集，我们与高斯基础的新视图合成的最先进技术进行了比较，并展示了以数量级更少的原语实现更优的重建保真度。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.14933

Generative Data Augmentation for Skeleton Action Recognition

用于骨架动作识别的生成数据增强

Dong, Xu, Li, Wanqing, Adeyemi-Ejeye, Anthony, Gilbert, Andrew

Abstract

Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.

Chinese Translation

基于骨架的人类动作识别是一种强大的方法，用于从姿态数据中理解人类行为，但收集大规模、多样化且标注良好的3D骨架数据集既昂贵又劳动密集。为了解决这一挑战，我们提出了一种用于骨架动作识别的数据增强的条件生成管道。我们的方法在动作标签的约束下学习真实骨架序列的分布，从而能够合成多样化且高保真的数据。即使在训练样本有限的情况下，它也能有效生成骨架序列，并在低数据场景中实现具有竞争力的识别性能，展示出在下游任务中的强泛化能力。具体而言，我们引入了一种基于Transformer的编码器-解码器架构，结合生成细化模块和丢弃机制，以在采样过程中平衡保真度和多样性。在HumanAct12和精炼的NTU-RGBD (NTU-VIBE) 数据集上的实验表明，我们的方法始终提高了多种基于骨架的动作识别模型的准确性，验证了其在少样本和全数据设置下的有效性。源代码可以在此找到。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.14951

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

RaTA-Tool：基于检索的多模态大语言模型工具选择

Mattioli, Gabriele, Turri, Evelyn, Sarto, Sara, Baraldi, Lorenzo, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

Abstract

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

Chinese Translation

工具学习与基础模型旨在赋予人工智能系统调用外部资源的能力——例如API、计算工具和专业模型——以解决超出独立语言生成能力的复杂任务。尽管最近在大型语言模型（LLMs）和多模态大型语言模型（MLLMs）方面的进展扩展了它们的推理和感知能力，但现有的工具使用方法主要限于仅文本输入和封闭世界设置。因此，它们在解释多模态用户指令方面面临困难，并且无法推广到训练期间未见过的工具。在本研究中，我们介绍了RaTA-Tool，一个用于开放世界多模态工具选择的新框架。我们的方法不是学习用户查询与固定工具标识符之间的直接映射，而是使MLLM能够将多模态查询转换为结构化任务描述，并随后通过将该表示与语义丰富的机器可读工具描述进行匹配来检索最合适的工具。这种基于检索的形式自然支持向新工具的扩展，而无需重新训练。为了进一步改善任务描述与工具选择之间的对齐，我们引入了一个基于偏好的优化阶段，使用直接偏好优化（Direct Preference Optimization, DPO）。为了支持该领域的研究，我们还介绍了第一个开放世界多模态工具使用数据集，包含来自Hugging Face模型卡的标准化工具描述。大量实验表明，我们的方法显著提高了工具选择性能，特别是在开放世界多模态场景中。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.14953

Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

从提示到手势：测量图像到视频指示性手势生成的能力

Ali, Hassan, Jirak, Doreen, Müller, Luca, Wermter, Stefan

Abstract

Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.

Chinese Translation

手势识别研究与自然语言处理（NLP）不同，仍面临严重的数据稀缺问题，进展受到昂贵的人类录音或无法生成手势本身真实变异的图像处理方法的限制。最近，图像到视频基础模型的进展使得通过自然语言生成逼真且语义丰富的视频成为可能。这些能力为创建无努力的合成数据开辟了新的可能性，提出了一个关键问题：视频生成性人工智能（Generative AI）模型是否能够增强并补充传统的人类生成手势数据。本文介绍并分析了基于提示的视频生成，以构建一个真实的指示性手势数据集，并严格评估其在下游任务中的有效性。我们提出了一种数据生成管道，该管道从少量收集自人类参与者的参考样本中生成指示性手势，提供了一种可在机器学习社区内外利用的可接近的方法。我们的结果表明，合成手势在视觉保真度方面不仅与真实手势紧密对齐，而且引入了有意义的变异性和新颖性，丰富了原始数据，进一步得到各种深度模型在混合数据集上表现优越的支持。这些发现突显了图像到视频技术，即使在早期阶段，也为手势合成提供了一种强大的零样本方法，对下游任务具有明显的益处。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.14958

Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification

频率增强的双子空间网络用于少样本细粒度图像分类

Wang, Meijia, Wang, Guochao, Chu, Haozhen, Yao, Bin, Zhang, Weichuan, Wang, Yuan, Yang, Junpo

Abstract

Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.

Chinese Translation

少样本细粒度图像分类旨在仅使用有限数量的标注样本来识别具有高度视觉相似性的子类别。现有的基于度量学习的方法通常仅依赖于空间域特征。局限于这一单一视角，模型不可避免地受到固有纹理偏差的影响，将重要的结构细节与高频背景噪声纠缠在一起。此外，缺乏跨视图几何约束的单视图度量往往会对这种噪声过拟合，导致在少样本条件下的结构不稳定。为了解决这些问题，本文提出了频率增强的双子空间网络（Frequency-Enhanced Dual-Subspace Network，FEDSNet）。具体而言，FEDSNet利用离散余弦变换（Discrete Cosine Transform，DCT）和低通滤波机制，明确地将低频全局结构成分从空间特征中隔离，从而抑制背景干扰。采用截断奇异值分解（Truncated Singular Value Decomposition，SVD）构建独立的低秩线性子空间，以处理空间纹理和频率结构特征。设计了一种自适应门控机制，动态融合来自这两个视图的投影距离。这一策略利用频率子空间的结构稳定性，防止空间子空间对背景特征的过拟合。在四个基准数据集（CUB-200-2011、斯坦福汽车（Stanford Cars）、斯坦福狗（Stanford Dogs）和FGVC-飞机（FGVC-Aircraft））上的大量实验表明，FEDSNet展现了出色的分类性能和鲁棒性，与现有的度量学习算法相比，取得了高度竞争的结果。复杂性分析进一步确认，所提出的网络在高准确性和计算效率之间达成了良好的平衡，为少样本细粒度视觉识别提供了一种有效的新范式。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.14967

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

UniDoc-RL：具有层次化动作和密集奖励的粗到细视觉检索增强生成

Wang, Jun, Tan, Shuo, Sun, Zelong, Gu, Tiancheng, Zhao, Yongle, Feng, Ziyong, Yang, Kaicheng, Lu, Cewu

Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

Chinese Translation

检索增强生成（RAG）通过外部视觉知识扩展了大型视觉语言模型（LVLMs）。然而，现有的视觉 RAG 系统通常依赖于通用的检索信号，忽视了复杂推理所需的细粒度视觉语义。为了解决这一局限性，我们提出了 UniDoc-RL，一个统一的强化学习框架，其中 LVLM 代理共同执行检索、重排序、主动视觉感知和推理。UniDoc-RL 将视觉信息获取形式化为一个具有层次化动作空间的顺序决策问题。具体而言，它逐步从粗粒度文档检索精炼视觉证据到细粒度图像选择和主动区域裁剪，使模型能够抑制无关内容并关注信息密集区域。为了有效的端到端训练，我们引入了一种密集多奖励机制，为每个动作提供任务感知的监督。基于组相对策略优化（GRPO），UniDoc-RL 在不依赖单独价值网络的情况下，将代理行为与多个目标对齐。为了支持这一训练范式，我们整理了一个高质量推理轨迹的综合数据集，包含细粒度的动作注释。在三个基准上的实验表明，UniDoc-RL 始终超越最先进的基线，较之前的基于 RL 的方法提高了多达 17.7%。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.15003

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

真相流动：图像到视频生成的主动时间取证

Chen, Yuzhuo, Ma, Zehua, Fang, Han, Wang, Hengyi, Wang, Guanjie, Zhang, Weiming

Abstract

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

Chinese Translation

图像到视频（I2V）生成的快速崛起使得可以从单张图像创建逼真的视频，但也带来了新的取证需求。与静态图像不同，I2V 内容随着时间的推移而演变，这要求取证工作超越二维像素级篡改定位，转向追踪像素在视频中的流动和变换。随着帧的推进，嵌入的痕迹漂移和变形，使得传统的空间取证方法失效。为了解决这一未被探索的维度，我们提出了 **Flow of Truth**，这是第一个专注于 I2V 生成中的时间取证的主动框架。一个关键挑战在于发现一个能够与生成过程一致演变的取证特征，而这一过程本质上是一种创造性的转变，而非确定性的重建。尽管存在这一内在困难，我们创新性地将视频生成重新定义为 *像素随时间的运动，而非帧的合成*。基于这一观点，我们提出了一种可学习的取证模板，该模板跟踪像素运动，以及一个模板引导的流动模块，能够将运动与图像内容解耦，从而实现稳健的时间追踪。实验表明，Flow of Truth 在商业和开源 I2V 模型中具有良好的泛化能力，显著提高了时间取证的性能。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.15027

Quality-Aware Calibration for AI-Generated Image Detection in the Wild

针对野外环境中AI生成图像检测的质量感知校准

Guillaro, Fabrizio, De Rosa, Vincenzo, Cozzolino, Davide, Verdoliva, Luisa

Abstract

Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at https://grip-unina.github.io/QuAD/

Chinese Translation

在合成图像检测方面已经取得了显著进展，然而大多数现有方法仅针对单个图像实例进行操作，忽视了现实世界传播的一个关键特征：随着病毒图像在网络上的传播，多个近似重复版本会出现，并因重复操作（如重新压缩、调整大小和裁剪）而失去质量。因此，同一图像可能会根据分析的版本产生不一致的法医预测。在本研究中，为了解决这一问题，我们提出了QuAD（Quality-Aware calibration with near-Duplicates），一个新颖的框架，基于同一图像的所有可用近似重复版本做出决策。给定一个查询，我们检索其在线近似重复版本并将其输入检测器：然后根据相应实例的估计质量对结果评分进行聚合。通过这种方式，我们利用了所有信息，同时考虑到因多次处理步骤而降低的图像可靠性。为了支持大规模评估，我们引入了两个数据集：AncesTree，一个包含136k图像的实验室数据集，按随机降级树组织，模拟在线转发动态；以及ReWIND，一个从病毒网络内容中收集的近10k近似重复图像的真实世界数据集。在多个最先进的检测器上的实验表明，我们的质量感知融合方法一致提高了它们的性能，与简单平均相比，平衡准确率平均提高约8%。我们的结果强调了联合处理在线可用的所有图像以实现可靠的AI生成内容检测在现实世界应用中的重要性。代码和数据可在 https://grip-unina.github.io/QuAD/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.15047

Implicit Neural Representations: A Signal Processing Perspective

隐式神经表示：信号处理视角

Jayasundara, Dhananjaya, Patel, Vishal M.

Abstract

Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field's core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.

Chinese Translation

隐式神经表示（INRs）标志着信号建模的根本转变，从离散采样数据转向连续函数表示。通过将信号参数化为神经网络，INRs 提供了一个统一的框架，用于将图像、音频、视频、三维几何等表示为其坐标的连续函数。这种函数视角使得信号操作（如微分）能够通过自动微分以解析方式进行，而不是通过离散近似进行。在本文中，我们从信号处理的角度考察了 INRs 的演变，强调了谱行为、采样理论和多尺度表示。我们追溯了从标准坐标基础网络的进展，这些网络对低频成分表现出谱偏向，发展到通过专门的激活函数（包括周期性、局部化和自适应函数）重塑近似空间的更先进设计。我们还讨论了结构化表示，如分层分解和哈希网格编码，这些方法提高了空间自适应性和计算效率。我们进一步强调了 INRs 在广泛应用中的实用性，包括医学和雷达成像中的逆问题、压缩和三维场景表示。通过将 INRs 解释为学习到的信号模型，其近似空间适应于基础数据，本文阐明了该领域的核心概念发展，并概述了理论稳定性、权重空间可解释性和大规模泛化等开放挑战。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.15059

Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment

基于注意力门控卷积网络的扫描仪无关质量评估

Bakhale, Chinmay, Sao, Anil

Abstract

Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.

Chinese Translation

运动伪影在结构性磁共振成像（sMRI）中带来了显著挑战，常常影响临床诊断和大规模自动化分析。尽管手动质量控制（QC）仍然是金标准，但对于大规模纵向研究而言，其可扩展性日益不足。为了解决这一问题，我们提出了一种混合卷积神经网络-注意力框架，旨在实现稳健的、场地无关的MRI质量评估。我们的架构结合了层次化的二维卷积神经网络编码器用于局部空间特征提取，以及多头交叉注意力机制以建模全局依赖关系。这种协同作用使模型能够优先考虑与运动相关的伪影特征，如振铃和模糊，同时动态过滤场地特定的强度变化和背景噪声。该框架在MR-ART数据集上进行了端到端训练，使用了200名受试者的平衡队列。性能评估分为两个层次：在保留的MR-ART分区上进行的已见场地评估，以及使用来自ABIDE档案中17个异质场地的200名受试者进行的未见场地评估。在已见场地上，模型达到了0.9920的扫描级别准确率和0.9919的F1分数。重要的是，它在未见的ABIDE场地上保持了强大的泛化能力（准确率 = 0.755），且没有进行任何重新训练或微调，显示出对领域转移的高韧性。这些结果表明，基于注意力的特征重加权成功捕捉了普遍的伪影描述符，缩小了不同成像环境和扫描仪制造商之间的性能差距。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.15065

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

学习嵌入位置：噪声感知位置嵌入在小目标检测中的查询检索

Zeng, Yangchen, Yu, Zhenyu, Jiang, Dongming, Zhang, Wenbo, Hong, Yifan, Hu, Zhanhua, Luo, Jiao, Cui, Kangning

Abstract

Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval

Chinese Translation

基于Transformer的检测器在小目标检测方面取得了进展，但它们通常效率低下且易受背景引起的查询噪声影响，这促使深度解码器对低质量查询进行精炼。我们提出了HELP（热图引导嵌入学习范式），这是一种噪声感知的位置信息-语义融合框架，研究在何处嵌入位置信息，通过选择性保留前景显著区域的位置信息编码，同时抑制背景杂乱。在HELP中，我们引入了热图引导位置嵌入（HPE）作为核心嵌入机制，并通过热图条进行可视化，以便进行可解释的诊断和微调。HPE被集成到编码器和解码器中：它通过注入热图感知的位置信息编码来引导噪声抑制特征编码，并通过在解码前使用基于梯度的掩膜过滤器过滤背景主导的嵌入，从而实现高质量的查询检索。为了应对复杂小目标中的特征稀疏性，我们集成了线性蛇形卷积，以丰富与检索相关的表示。基于梯度的热图监督仅在训练期间使用，在推理时不会增加额外的梯度计算。因此，我们的设计将解码器层数从八层减少到三层，并在减少计算预算的情况下，在各基准测试中实现了59.4%的参数减少（66.3M对比163M），同时保持了一致的准确性提升。代码库： https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.15088

Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline

在雾霾和低光照条件下从遥感影像中提取建筑物：基准与基线

Sang, Feifei, Lu, Wei, Chen, Hongruixuan, Chen, Sibao, Luo, Bin

Abstract

Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/HaLoBuilding.

Chinese Translation

从光学遥感（RS）影像中提取建筑物在现实世界的雾霾和低光照条件下表现不佳。然而，现有的光学方法和基准主要集中在理想的晴朗天气条件下。尽管合成孔径雷达（SAR）提供全天候感知，但其侧视几何形状会导致几何失真。为了解决这些挑战，我们提出了HaLoBuilding，这是第一个专门为雾霾和低光照条件下的建筑物提取设计的光学基准。通过利用同场景的多时相配对策略，我们确保了像素级标签对齐和在极端退化情况下的高保真度。在此基准的基础上，我们提出了HaLoBuild-Net，一个新颖的端到端框架，用于在不利的遥感场景中提取建筑物。其核心是开发了一个空间频率聚焦模块（SFFM），通过将大接收场注意力与由稳定低频锚点引导的频率感知通道重标定相结合，有效减轻气象干扰对建筑特征的影响。此外，全球多尺度引导模块（GMGM）提供全球语义约束以锚定建筑拓扑，而互引导融合模块（MGFM）实现双向语义-空间校准，以抑制浅层噪声并锐化天气引起的模糊边界。大量实验表明，HaLoBuild-Net在HaLoBuilding数据集上显著优于最先进的方法和传统的级联恢复-分割范式，同时在WHU、INRIA和LoveDA数据集上保持了强大的泛化能力。源代码和数据集可在以下网址公开获取：https://github.com/AeroVILab-AHU/HaLoBuilding。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.15090

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

超越视觉线索：基于语义驱动的标记过滤和专家路由的随时人物重识别

Li, Jiaxuan, Wen, Xin, Li, Zhihang

Abstract

Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

Chinese Translation

随时人物重识别（Any-Time Person Re-identification, AT-ReID）要求在任意条件下稳健地检索目标个体，包括模态变化（白天和夜间）以及广泛的换衣场景，涵盖短期到长期的时间间隔。然而，现有方法高度依赖纯视觉特征，这些特征容易受到环境和时间因素的影响，导致在涉及因光照变化而引起的模态变化或换衣场景下性能显著下降。本文提出了一种基于语义驱动的标记过滤和专家路由（Semantic-driven Token Filtering and Expert Routing, STFER）的新框架，该框架利用大型视觉语言模型（Large Vision-Language Models, LVLMs）生成身份一致性文本的能力，提供对换衣变化和RGB与红外（IR）之间的跨模态变化具有鲁棒性的身份区分特征。具体而言，我们采用指令引导LVLM生成捕捉生物特征常量的身份内在语义文本，以驱动语义模型。文本标记进一步用于语义驱动的视觉标记过滤（Semantic-driven Visual Token Filtering, SVTF），增强信息丰富的视觉区域并抑制冗余的背景噪声。同时，文本标记也用于语义驱动的专家路由（Semantic-driven Expert Routing, SER），将语义文本整合到专家路由中，从而实现更稳健的多场景门控。在随时重识别数据集（Any-Time ReID dataset, AT-USTC）上的大量实验表明，我们的模型达到了最先进的结果。此外，在AT-USTC上训练的模型在5个广泛使用的重识别基准上进行了评估，展示了优越的泛化能力和高度竞争的结果。我们的代码将很快公开。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.15096

Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

超越独立帧：用于多视角超声心动图的潜在注意力掩蔽自编码器

Böhi, Simon, Cannistraci, Irene, Gonzalez, Sergio Muñoz, Vandenhirtz, Moritz, Laguna, Sonia, Ruiperez-Campillo, Samuel, Krähenmann, Max, Agostini, Andrea, Ozkan, Ece, Sutter, Thomas M., Vogt, Julia E.

Abstract

Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.

Chinese Translation

超声心动图因其非侵入性和成本效益而广泛用于心脏评估，但心脏的稀疏和异质时空视图带来了独特的挑战。现有的掩蔽自编码器（Masked Autoencoder, MAE）方法通常独立处理图像或短片段，未能捕捉到一致的心脏表示所需的内在多视角结构。我们提出了潜在注意力掩蔽自编码器（Latent Attention Masked Autoencoder, LAMAE），这是一种针对医学成像多视角特性而设计的基础模型架构。LAMAE通过潜在注意力模块增强了标准MAE，使得信息能够直接在潜在空间中跨帧和视图进行交换。这使得模型能够聚合可变长度的序列和不同视图，从部分观察中重建心脏功能的整体表示。我们在MIMIC-IV-ECHO上对LAMAE进行了预训练，这是一个大型、未经整理的数据集，反映了现实世界的临床变异性。据我们所知，我们首次展示了从MIMIC-IV-ECHO视频中预测ICD-10编码的结果。此外，我们还实证证明，从成人数据中学习到的表示能够有效转移到儿科群体，尽管存在显著的解剖差异。这些结果提供了证据，表明纳入结构先验（如多视角注意力）能够产生显著更强大和可转移的表示。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.15134

How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

如何正确地犯错：构建和基准测试错误意识自我中心程序视频的框架

Loginova, Olga, Keller, Frank

Abstract

Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.

Chinese Translation

可靠的视频程序监控需要接触自然发生的人为错误及其后续的恢复。在自我中心的录音中，错误常常被手部分遮挡，并通过微妙的物体状态变化显现，而现有的程序数据集提供的错误和纠正轨迹有限且不一致。我们提出了PIE-V（心理启发的错误注入视频），这是一个通过对干净的关键步骤程序进行控制的人类可信偏差增强，构建和基准测试错误意识自我中心程序视频的框架。PIE-V结合了一个基于程序阶段和语义步骤负载的心理学启发的错误规划器，一个模拟恢复行为的纠正规划器，一个执行级联一致重写的LLM写作工具，以及一个验证程序一致性和修复失败的LLM评估器。对于视频片段编辑，PIE-V通过文本引导的视频生成合成替换片段，并将其缝合到情节中以保持视觉可信性。应用于17个任务和50个Ego-Exo4D场景，PIE-V注入了102个错误并生成了27个恢复纠正。为了基准测试，我们引入了统一的分类法和一个包含九个指标的人类评估标准，涵盖了步骤级和程序级质量，包括可信性、带注释者信心的程序逻辑、状态变化一致性以及文本与视频之间的基础联系。使用该协议，我们审计了几个现有资源，并在相同标准下将PIE-V与自由形式的LLM生成基线进行了比较。整体而言，该框架和评估标准支持自我中心程序错误检测和纠正的后完成验证。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.15141

KVNN: Learnable Multi-Kernel Volterra Neural Networks

KVNN：可学习的多核Volterra神经网络

Yun, Haoyu, Krim, Hamid, Bao, Yufang

Abstract

Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.

Chinese Translation

高阶学习的基础在于利用组合特征。它显然依赖于通过数据的更复杂交互来丰富表示，这反过来往往会增加传统大规模深度学习模型的复杂性。本文提出了一种核化的Volterra神经网络（kVNN）。实现效率的关键在于使用可学习的多核表示，其中不同的交互阶数由具有紧凑、可学习中心的不同多项式核组件建模，从而实现阶数自适应参数化。特征通过层的组合进行学习，每一层由不同多项式阶数的并行分支组成，使得kVNN滤波器能够直接替代现有架构中的标准卷积核。理论结果通过在两个代表性任务上的实验得到了验证：视频动作识别和图像去噪。结果表明，kVNN在性能与效率之间实现了良好的权衡：kVNN始终在模型（参数）和计算（GFLOPs）复杂性上减少，同时保持竞争力且通常改善的性能。这些结果在从头开始训练而不进行大规模预训练时依然得以维持。总之，我们证明了结构化的核化高阶层为在现代深度网络中平衡表达能力和计算成本提供了一条实用路径。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.15166

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

通过深度感知去除遗忘特定方向实现类别遗忘

Hatami, Arman, Aalishah, Romina, Monosov, Ilya E.

Abstract

Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

Chinese Translation

机器遗忘旨在从训练好的模型中去除特定知识，而无需从头开始重新训练。然而，在类别遗忘中，减少遗忘类别的准确性并不一定意味着真正的遗忘：被遗忘的信息可能仍然编码在内部表示中，而表面上的遗忘可能源于分类器头的抑制，而非表示的去除。我们展示了现有的类别遗忘方法往往表现出较弱或负的选择性，在深层表示中保留遗忘类别的结构，或过度依赖最终层的偏置变化。随后，我们引入了 DAMP（Depth-Aware Modulation by Projection），这是一种一次性、封闭形式的权重手术方法，可以在不依赖基于梯度的优化的情况下，从预训练网络中去除遗忘特定方向。在每个阶段，DAMP 在下一个可学习操作的输入空间中计算类别原型，提取相对于保留类别原型的遗忘方向作为残差，并应用基于投影的更新以减少下游对这些方向的敏感性。为了保持效用，DAMP 使用一种无参数的深度感知缩放规则，该规则源于探测可分离性，在早期层施加较小的编辑，在更深层施加较大的编辑。该方法自然扩展到通过低秩子空间去除实现多类别遗忘。在 MNIST、CIFAR-10、CIFAR-100 和 Tiny ImageNet 数据集上，以及在卷积和变换器架构中，DAMP 的表现更接近重新训练的金标准，相较于一些先前的方法，改善了选择性遗忘，同时更好地保持了保留类别的性能，并减少了深层中的残余遗忘类别结构。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.15170

OmniLight: One Model to Rule All Lighting Conditions

OmniLight：一个模型应对所有光照条件

Oh, Youngjin, Park, Junyoung, Kwon, Junhyeong, Cho, Nam Ik

Abstract

Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at https://github.com/OBAKSA/Lighting-Restoration.

Chinese Translation

不利的光照条件，如投射阴影和不规则照明，给计算机视觉系统带来了重大挑战，降低了可见性和色彩保真度。因此，有效的阴影去除和自适应光照网络（ALN）对于恢复图像内容、提高感知质量以及促进下游任务的稳健性能至关重要。然而，尽管在特定基准上取得最先进的结果是图像恢复挑战的主要目标，现实世界的应用通常需要能够处理多样化领域的稳健模型。为此，我们通过探索两种对比策略，提出了一项关于光照相关图像恢复的综合研究。我们利用一个稳健的自适应光照网络框架DINOLight，作为一个专门的基线，以利用每个数据集的特征，并将其扩展为OmniLight，一个通用的替代方案，结合了我们提出的小波域专家混合模型（Wavelet Domain Mixture-of-Experts, WD-MoE），该模型在所有提供的数据集上进行训练。通过对这两种方法的比较分析，我们讨论了数据分布对光照相关图像恢复中专门架构和统一架构性能的影响。值得注意的是，这两种方法在NTIRE 2026挑战赛的所有三个光照相关赛道中均获得了顶级排名，展示了其卓越的感知质量和泛化能力。我们的代码可在 https://github.com/OBAKSA/Lighting-Restoration 获取。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.15171

An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

扩散模型中正则化与福克-普朗克残差的分析

Niemann, Onno, Muñoz, Gonzalo Martínez, Gonzalez, Alberto Suárez

Abstract

Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.

Chinese Translation

最近的研究表明，使用去噪得分匹配（DSM）目标训练的扩散模型往往违反了控制真实数据密度演变的福克-普朗克（FP）方程。直接在目标函数中惩罚这些偏差可以减少其幅度，但会引入显著的计算开销。此外，还观察到严格遵循FP方程并不一定会改善生成样本的质量，因为通常最佳结果是在较弱的FP正则化下获得的。本文探讨了是否可以通过更简单的惩罚项来提供类似的好处。我们对几种轻量级正则化器进行了实证分析，研究了它们对FP残差和生成质量的影响，并表明FP正则化的好处可以在显著较低的计算成本下实现。我们的代码可在 https://github.com/OnnoNiemann/fp_diffusion_analysis 获取。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.15173

Boundary-Centric Active Learning for Temporal Action Segmentation

基于边界的主动学习用于时间动作分割

Helvaci, Halil Ismail, Cheung, Sen-ching Samson

Abstract

Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

Chinese Translation

时间动作分割（TAS）需要密集的时间监督，然而在未剪辑视频中，大部分注释成本都花费在识别和精细化动作过渡上，这些地方集中着分割错误，而小的时间偏移会不成比例地降低分段指标。我们提出了B-ACT，一个剪辑预算的主动学习框架，它明确地将监督分配到这些高影响力的边界区域。B-ACT在一个分层的两阶段循环中运行：（i）它使用预测不确定性对未标记视频进行排名和查询；（ii）在每个选定的视频中，它从当前模型预测中检测候选过渡，并通过一种新颖的边界评分选择前$K$个边界，该评分融合了邻域不确定性、类别模糊性和时间预测动态。重要的是，我们的注释协议仅请求边界帧的标签，同时仍然在以边界为中心的剪辑上进行训练，以利用模型的感受野中的时间上下文。在GTEA、50Salads和Breakfast上的大量实验表明，基于边界的监督提供了强大的标签效率，并在稀疏预算下始终超过了代表性的TAS主动学习基线和先前的最先进技术，在边界位置主导编辑和重叠基础F1分数的数据集上获得了最大的提升。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.15188

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

VisPCO：通过预算感知的帕累托前沿学习优化视觉语言模型的视觉标记修剪配置

Ji, Huawei, Sun, Yuanhao, Jin, Yuan, Deng, Cheng, Ding, Jiaxin, Fu, Luoyi, Wang, Xinbing

Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

Chinese Translation

视觉标记修剪方法有效缓解了在视觉语言模型（VLMs）中处理高分辨率图像和视频帧所导致的二次计算增长。然而，现有方法依赖于预定义的修剪配置，而未确定其是否实现了计算性能的最优性。在本研究中，我们引入了一种新颖的框架，将视觉标记修剪公式化为帕累托配置优化问题，以自动识别最优配置。我们的方法采用连续松弛和直通估计器，以实现基于梯度的搜索，采用增广拉格朗日法进行求解。在8个视觉基准上的广泛实验表明，该方法有效地逼近了通过网格搜索获得的经验帕累托前沿，并在各种修剪方法和VLM架构中表现出良好的泛化能力。此外，通过可学习的核函数，我们研究了逐层修剪模式，并揭示多步骤渐进修剪捕捉了VLM的层次压缩结构，相较于单层方法实现了更优的准确性与效率权衡。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.15196

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

基于无监督骨架的动作分割的层次时空向量量化

Ahmed, Umer, Mahmood, Syed Ahmed, Fateh, Fawad Javed, Luqman, M. Shaheer, Zia, M. Zeeshan, Tran, Quoc-Huy

Abstract

We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

Chinese Translation

我们提出了一种新颖的层次时空向量量化框架，用于无监督骨架的时间动作分割。我们首先引入了一种层次方法，包括两个连续的向量量化层次。具体而言，较低层次将骨架与细粒度的子动作关联，而较高层次则进一步将子动作聚合为动作级别的表示。我们的层次方法在主要利用空间线索通过重建输入骨架的同时，优于非层次基线。接下来，我们通过利用空间和时间信息扩展了我们的方法，形成了一种层次时空向量量化方案。特别是，我们的层次时空方法执行多级聚类，同时恢复输入骨架及其对应的时间戳。最后，在多个基准数据集上进行的广泛实验，包括 HuGaDB、LARa 和 BABEL，证明了我们的方法建立了新的最先进性能，并减少了无监督骨架时间动作分割中的分段长度偏差。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.15237

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

StreamCacheVGGT：具有稳健评分和混合缓存压缩的流式视觉几何变换器

Liu, Xuanyi, Ji, Deyi, Yu, Chunan, Zhu, Qi, Li, Xuanfu, Ma, Jin, Chen, Tianrun, Zhu, Lanyun

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Chinese Translation

从连续视频流中重建密集的三维几何体需要在固定内存预算下进行稳定推断。现有的 $O(1)$ 框架主要依赖于“纯驱逐”范式，这种方法由于二进制令牌删除和来自局部单层评分的评估噪声而遭受显著的信息损失。为了解决这些瓶颈，我们提出了 StreamCacheVGGT，这是一种无训练的框架，通过两个协同模块重新构想缓存管理：跨层一致性增强评分（Cross-Layer Consistency-Enhanced Scoring, CLCES）和混合缓存压缩（Hybrid Cache Compression, HCC）。CLCES 通过跟踪跨变换器层次的令牌重要性轨迹来减轻激活噪声，采用顺序统计分析来识别持续的几何显著性。利用这些稳健的评分，HCC 超越了简单的驱逐，提出了一种三层分流策略，通过在关键向量流形上进行最近邻分配，将适度重要的令牌合并为保留的锚点。这种方法保留了本来会丢失的重要几何上下文。在五个基准（7-Scenes、NRGBD、ETH3D、Bonn 和 KITTI）上的广泛评估表明，StreamCacheVGGT 设定了新的最先进水平，提供了卓越的重建精度和长期稳定性，同时严格遵循固定成本约束。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.15239

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

TokenGS：通过可学习的标记解耦像素与3D高斯预测

Ren, Jiawei, Tyszkiewicz, Michal Jan, Huang, Jiahui, Gojcic, Zan

Abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

Chinese Translation

在本研究中，我们重新审视了现代基于Transformer的方法在前馈3D高斯点云（3DGS）预测中的几个关键设计选择。我们认为，将高斯均值回归为沿相机光线的深度的常见做法并不理想，因此我们提出直接使用自监督渲染损失回归3D均值坐标。这种表述使我们能够从标准的仅编码器设计转变为具有可学习高斯标记的编码器-解码器架构，从而将预测原始体的数量与输入图像分辨率和视图数量解耦。我们提出的方法TokenGS在应对姿态噪声和多视图不一致性方面表现出更强的鲁棒性，同时自然支持在标记空间中进行高效的测试时优化，而不会降低学习到的先验知识。TokenGS在静态和动态场景上均实现了最先进的前馈重建性能，生成了更规则的几何形状和更平衡的3DGS分布，同时无缝恢复了静态-动态分解和场景流等新兴场景属性。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.15271

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

SegWithU：将不确定性作为单次前向传递的风险意识医疗图像分割的扰动能量

Fu, Tianhao, Wang, Austin, Chen, Charles, Aldave-Garza, Roby, Chen, Yucheng

Abstract

Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

Chinese Translation

可靠的不确定性估计对于医疗图像分割至关重要，因为自动化轮廓为下游量化和临床决策支持提供了基础。许多强大的不确定性方法需要重复推理，而高效的单次前向传递替代方案通常提供较弱的失败排序或依赖于限制性的特征空间假设。我们提出了$ extbf{SegWithU}$，这是一个后处理框架，它通过轻量级的不确定性头部增强了一个冻结的预训练分割主干。SegWithU利用中间主干特征，并将不确定性建模为在紧凑探测空间中的扰动能量，使用一阶后验探测器。它生成两个体素级不确定性图：一个用于概率调温的校准导向图和一个用于错误检测和选择性预测的排序导向图。在ACDC、BraTS2024和LiTS数据集上，SegWithU是最强且最一致的单次前向传递基线，分别达到了AUROC/AURC值为$0.9838/2.4885$、$0.9946/0.2660$和$0.9925/0.8193$，同时保持了分割质量。这些结果表明，基于扰动的不确定性建模是实现可靠性意识医疗分割的有效且实用的途径。源代码可在https://github.com/ProjectNeura/SegWithU获取。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.15280

Why Do Vision Language Models Struggle To Recognize Human Emotions?

为什么视觉语言模型难以识别人类情感？

Agarwal, Madhav, Tsaftaris, Sotirios A., Sevilla-Lara, Laura, McDonagh, Steven

Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

Chinese Translation

理解情感是智能系统与人类互动的基本能力。近年来，视觉语言模型（VLMs）在许多视觉任务上取得了巨大的进展，可能为理解情感提供了有希望的解决方案。然而，令人惊讶的是，即使是最先进的现代VLMs也难以识别人类情感，甚至无法超越专门的视觉分类器。在本文中，我们提出了一个问题：“为什么VLMs难以识别人类情感？”并观察到，面部表情识别（DFER）这一固有的连续和动态任务暴露了VLM的两个关键脆弱性。首先，情感数据集自然呈长尾分布，而用于预训练VLM的网络规模数据加剧了这种头部类偏差，导致它们系统性地将稀有、代表性不足的情感归入常见类别。我们提出了替代的采样策略，以防止偏向常见概念。其次，时间信息对于理解情感至关重要。然而，VLMs无法在密集帧序列中表示时间信息，因为它们受到上下文大小和可容纳的标记数量的限制，这对情感识别构成了明显挑战。我们证明，VLMs中使用的稀疏时间采样策略与微表情（0.25-0.5秒）的短暂特性本质上不一致，而微表情通常是最关键的情感信号。作为诊断探针，我们提出了一种多阶段上下文增强策略，通过首先将“中间”帧转换为自然语言摘要，利用这些信息。这个丰富的文本上下文与稀疏关键帧一起作为输入提供给VLM，从而防止因过多视觉数据而导致的注意力稀释，同时保留情感轨迹。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.15281

R3D: Revisiting 3D Policy Learning

R3D：重新审视三维策略学习

Hong, Zhengdong, Wu, Shenrui, Cui, Haozhe, Zhao, Boyi, Ji, Ran, He, Yiyang, Zhang, Hangxing, Ke, Zundong, Wang, Jun, Zhang, Guofeng, Gu, Jiayuan

Abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

Chinese Translation

三维策略学习承诺提供更优的泛化能力和跨体现迁移，但由于训练不稳定和严重的过拟合，进展受到阻碍，限制了强大三维感知模型的应用。在本研究中，我们系统地诊断了这些失败，识别出三维数据增强的缺失和批量归一化（Batch Normalization）的不利影响是主要原因。我们提出了一种新架构，将可扩展的基于变换器（transformer）的三维编码器与扩散解码器（diffusion decoder）相结合，专门设计用于在大规模下的稳定性，并旨在利用大规模预训练。我们的方法在具有挑战性的操作基准上显著超越了最先进的三维基线，为可扩展的三维模仿学习建立了新的稳健基础。项目页面：https://r3d-policy.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.15284

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

GlobalSplat：通过全局场景标记实现高效的前馈3D高斯溅射

Itkin, Roni, Issachar, Noam, Keypur, Yehonatan, Keypur, Yehonatan, Chen, Anpei, Benaim, Sagie

Abstract

The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/

Chinese Translation

原语的高效空间分配是3D高斯溅射的基础，因为它直接决定了表示的紧凑性、重建速度和渲染保真度之间的协同关系。之前的解决方案，无论是基于迭代优化还是前馈推理，都在这些目标之间存在显著的权衡，主要是由于依赖于缺乏全局场景意识的局部启发式分配策略。具体而言，目前的前馈方法在很大程度上是像素对齐或体素对齐的。通过将像素反投影到密集的视图对齐原语中，它们在3D资产中引入了冗余。随着输入视图的增加，表示大小增加，全球一致性变得脆弱。为此，我们提出了GlobalSplat，一个基于“先对齐，后解码”原则构建的框架。我们的方法学习一个紧凑的全局潜在场景表示，该表示编码多视图输入并在解码任何明确的3D几何之前解决跨视图对应关系。至关重要的是，这种形式化使得在不依赖于预训练像素预测骨干网络或重用来自密集基线的潜在特征的情况下，实现紧凑且全球一致的重建。通过利用逐步增加解码能力的粗到细训练课程，GlobalSplat本质上防止了表示膨胀。在RealEstate10K和ACID上，我们的模型在使用仅16K高斯的情况下实现了具有竞争力的新视图合成性能，显著低于密集管道所需的数量，获得了轻量级的4MB占用。此外，GlobalSplat在单次前向推理中显著快于基线，运行时间低于78毫秒。项目页面可访问 https://r-itk.github.io/globalsplat/

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.15291

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

AD4AD：为更安全的自动驾驶评估视觉异常检测模型

Genilotti, Fabrizio, Stropeni, Arianna, Grotto, Gionata, Borsatti, Francesco, Barusco, Manuel, Pezze, Davide Dalle, Susto, Gian Antonio

Abstract

The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

Chinese Translation

机器视觉系统在自动驾驶中的可靠性在很大程度上依赖于其训练数据的分布。当车辆遇到显著不同的条件，例如非典型障碍物时，其感知能力可能会显著下降。与许多错误后果有限的领域不同，自动驾驶中的失败直接转化为对乘客、行人和其他道路使用者的物理风险。为了解决这一挑战，我们探讨了视觉异常检测（Visual Anomaly Detection, VAD）作为一种解决方案。VAD能够识别训练期间未出现的异常物体，使系统在检测到不熟悉的情况时能够提醒驾驶员。重要的是，VAD模型生成的像素级异常图可以引导驾驶员关注特定的关注区域，而无需对危险的性质或形式做出任何先前假设。我们在AnoVox上对八种最先进的VAD方法进行了基准测试，AnoVox是用于自动驾驶异常检测的最大合成数据集。特别地，我们评估了四种主干架构的性能，从大型网络到轻量级网络，如MobileNet和DeiT-Tiny。我们的结果表明，VAD能够有效地迁移到道路场景中。值得注意的是，Tiny-Dinomaly在边缘部署中实现了最佳的准确性与效率的平衡，以较小的内存成本匹配全规模定位性能。这项研究代表了朝着更安全、更负责任的自动驾驶车辆部署迈出的具体一步，最终提高了对乘客、行人和所有道路使用者的保护。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.15299

AnimationBench: Are Video Models Good at Character-Centric Animation?

动画基准：视频模型在角色中心动画中的表现如何？

Wu, Leyi, Fang, Pengjun, Sun, Kai, Xing, Yazhou, Wu, Yinwei, Wang, Songsong, Huang, Ziqi, Zhou, Dan, He, Yingqing, Chen, Ying-Cong, Chen, Qifeng

Abstract

Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

Chinese Translation

视频生成技术迅速发展，近期的方法产生了越来越令人信服的动画效果。然而，现有的基准测试主要针对真实视频，难以评估具有风格化外观、夸张动作和角色中心一致性的动画风格生成。此外，这些基准还依赖于固定的提示集和严格的流程，限制了对开放领域内容和自定义评估需求的灵活性。为了解决这一问题，我们引入了AnimationBench，这是第一个系统化的基准，用于评估动画图像到视频生成。AnimationBench 将动画的十二条基本原则和知识产权保护转化为可测量的评估维度，并结合了更广泛的质量维度，包括语义一致性、动作合理性和摄像机运动一致性。该基准支持标准化的封闭集评估，以便进行可重复的比较，同时也支持灵活的开放集评估，以便进行诊断分析，并利用视觉-语言模型进行可扩展评估。大量实验表明，AnimationBench 与人类判断高度一致，并揭示了现实主义导向的基准所忽视的动画特定质量差异，从而为最先进的图像到视频（I2V）模型提供了更具信息性和区分性的评估。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.15301

Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

在潜在思维中思考：无注释手语翻译的新范式

Jiang, Yiyang, Zhang, Li, Wei, Xiao-Yong, Qing, Li

Abstract

Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.

Chinese Translation

许多手语翻译（SLT）系统默默假设简短的手势片段直接映射到口语单词。这一假设在实际中并不成立，因为手语使用者常常利用上下文、空间和运动即时创造意义。我们重新审视手语翻译，认为它主要是一项跨模态推理任务，而不仅仅是简单的视频到文本转换。因此，我们引入了一种以推理为驱动的手语翻译框架，该框架使用有序的潜在思维序列作为视频与生成文本之间的显式中间层。这些潜在思维随着时间的推移逐渐提取和组织意义。在此基础上，我们采用了“先规划后落实”的解码方法：模型首先决定想要表达的内容，然后回顾视频以寻找证据。这种分离提高了一致性和忠实度。我们还构建并发布了一个新的大规模无注释手语翻译数据集，具有更强的上下文依赖性和更现实的意义。在多个基准测试中的实验结果显示，相较于现有的无注释方法，表现出一致的提升。代码和数据将在接受后发布，链接为 https://github.com/fletcherjiang/SignThought。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.15308

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2：在生成器-判别器框架中扩展强化学习

Gao, Hao, Chen, Shaoyu, Zhu, Yifan, Song, Yuehao, Liu, Wenyu, Zhang, Qian, Wang, Xinggang

Abstract

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

Chinese Translation

高级自主驾驶需要能够建模多模态未来不确定性的运动规划器，同时在闭环交互中保持稳健性。尽管基于扩散的规划器在建模复杂轨迹分布方面有效，但在仅通过模仿学习进行训练时，它们常常遭遇随机不稳定性和缺乏纠正性负反馈的问题。为了解决这些问题，我们提出了RAD-2，一个用于闭环规划的统一生成器-判别器框架。具体而言，使用基于扩散的生成器来产生多样的轨迹候选，而经过强化学习优化的判别器根据其长期驾驶质量对这些候选进行重新排序。这种解耦设计避免了将稀疏标量奖励直接应用于全高维轨迹空间，从而提高了优化的稳定性。为了进一步增强强化学习，我们引入了时间一致的组相对策略优化（Temporally Consistent Group Relative Policy Optimization），利用时间一致性来缓解信用分配问题。此外，我们提出了在线策略生成器优化（On-policy Generator Optimization），将闭环反馈转化为结构化的纵向优化信号，并逐步将生成器推向高奖励轨迹流形。为了支持高效的大规模训练，我们引入了BEV-Warp，一个高吞吐量的仿真环境，通过空间扭曲直接在鸟瞰视图特征空间中进行闭环评估。与强大的基于扩散的规划器相比，RAD-2将碰撞率降低了56%。实际部署进一步证明了在复杂城市交通中感知安全性和驾驶平稳性的改善。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.15309

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent：一种用于网页生成的分层多模态网络代理

Li, Yan, Zeng, Zezi, Yang, Yifan, Yang, Yuqing, Liao, Ning, Guo, Weiwei, Qiu, Lili, Cheng, Mingxi, Dai, Qi, Wang, Zhendong, Yang, Zhengyuan, Yang, Xue, Li, Ji, Wang, Lijuan, Luo, Chong

Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

Chinese Translation

人工智能生成内容（AIGC）工具的快速发展使得可以按需创建图像、视频和可视化内容用于网页设计，为现代用户界面/用户体验（UI/UX）提供了一种灵活且日益被采用的范式。然而，直接将这些工具集成到自动化网页生成中往往会导致风格不一致和整体连贯性差，因为各个元素是孤立生成的。我们提出了MM-WebAgent，一种用于多模态网页生成的分层代理框架，通过分层规划和迭代自我反思协调基于AIGC的元素生成。MM-WebAgent共同优化全局布局、本地多模态内容及其整合，生成连贯且视觉一致的网页。我们进一步引入了多模态网页生成的基准测试和多层次评估协议，以便进行系统评估。实验表明，MM-WebAgent在多模态元素生成和整合方面优于代码生成和基于代理的基线方法。代码与数据： https://aka.ms/mm-webagent。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.15310

TokenLight: Precise Lighting Control in Images using Attribute Tokens

TokenLight：使用属性令牌实现图像中的精确光照控制

Chaturvedi, Sumit, Hold-Geoffroy, Yannick, Ren, Mengwei, Liu, Jingyuan, Zhang, He, Mei, Yiqun, Dorsey, Julie, Shu, Zhixin

Abstract

This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/

Chinese Translation

本文提出了一种图像重光照的方法，能够对照片中的多个光照属性进行精确和连续的控制。我们将重光照形式化为一个条件图像生成任务，并引入属性令牌来编码不同的光照因素，如强度、颜色、环境光照、漫反射水平和三维光源位置。该模型在一个大规模的合成数据集上进行训练，该数据集包含真实的光照注释，并辅以一小组真实捕获图像以增强真实感和泛化能力。我们在多种重光照任务中验证了我们的方法，包括控制场景中的光源和使用虚拟光源编辑环境光照，涵盖合成图像和真实图像。与之前的工作相比，我们的方法在定量和定性性能上均达到了最先进的水平。值得注意的是，在没有显式逆向渲染监督的情况下，该模型展现出对光如何与场景几何、遮挡和材料相互作用的内在理解，即使在传统上具有挑战性的场景中，如在物体内部放置光源或合理地重光照透明材料，也能产生令人信服的光照效果。项目页面：vrroom.github.io/tokenlight/

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.15311

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

LeapAlign：通过构建两步轨迹在任意生成步骤后训练流匹配模型

Liang, Zhanhao, Yang, Tao, Wu, Jie, Feng, Chengjian, Zheng, Liang

Abstract

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

Chinese Translation

本文聚焦于流匹配模型与人类偏好的对齐。一种有前景的方法是通过直接反向传播奖励梯度来微调流匹配的可微生成过程。然而，沿着长轨迹进行反向传播会导致巨大的内存开销和梯度爆炸。因此，直接梯度方法在更新早期生成步骤时面临困难，而这些步骤对于确定最终图像的全局结构至关重要。为了解决这个问题，我们提出了LeapAlign，这是一种微调方法，旨在降低计算成本并实现从奖励到早期生成步骤的直接梯度传播。具体而言，我们通过设计两个连续的跳跃，将长轨迹缩短为仅两个步骤，每个步骤跳过多个常微分方程（ODE）采样步骤，并在单一步骤中预测未来的潜在变量。通过随机化跳跃的起始和结束时间步，LeapAlign在任意生成步骤上实现了高效且稳定的模型更新。为了更好地利用这些缩短的轨迹，我们对与长生成路径更一致的轨迹分配更高的训练权重。为了进一步增强梯度的稳定性，我们降低了大幅度梯度项的权重，而不是像以往工作那样完全去除它们。在微调Flux模型时，LeapAlign在各种指标上始终优于最先进的基于GRPO的方法和直接梯度方法，实现了更优的图像质量和图像-文本对齐。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2604.15312

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

事件-帧非对称立体的双向跨模态提示

Xu, Ninghui, Tosi, Fabio, Wang, Lihui, Han, Jiawei, Bartolomei, Luca, Yao, Zhiting, Poggi, Matteo, Mattoccia, Stefano

Abstract

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

Chinese Translation

传统的基于帧的相机捕捉丰富的上下文信息，但在动态场景中存在有限的时间分辨率和运动模糊的问题。事件相机提供了一种替代的视觉表示，具有更高的动态范围，且不受这些限制。两种模态的互补特性使得事件-帧非对称立体在快速运动和复杂光照条件下的可靠3D感知中展现出良好的前景。然而，模态间的差距往往导致对跨模态立体匹配至关重要的领域特定线索的边缘化。在本文中，我们提出了Bi-CMPStereo，一种新颖的双向跨模态提示框架，充分利用来自两个领域的语义和结构特征以实现稳健匹配。我们的方法在目标规范空间内学习精细对齐的立体表示，并通过将每种模态投影到事件和帧领域来整合互补表示。大量实验表明，我们的方法在准确性和泛化能力上显著优于现有的最先进方法。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.14160

NuHF Claw: A Risk Constrained Cognitive Agent Framework for Human Centered Procedure Support in Digital Nuclear Control Rooms

NuHF Claw：一种风险约束的认知代理框架，用于数字化核控制室中的以人为本的程序支持

Xiao, Xingyu, Tong, Jiejuan, Sun, Jun, Sui, Zhe, Chen, Peng, Liang, Jingang, Wang, Haitao

Abstract

The rapid digitization of nuclear power plant main control rooms has fundamentally reshaped operator interaction patterns, introducing complex soft-control behaviors and elevated cognitive risks that are not adequately addressed by existing human reliability analysis approaches. Although recent advances in large language models and autonomous agents offer new opportunities for intelligent decision support, their deployment in safety critical environments remains constrained by risks of hallucinated reasoning and weakened human authority. This study proposes NuHF Claw, a persistent cognitive-risk agent framework that enables risk governed human centered autonomy for digital nuclear operations. The core methodological innovation lies in the introduction of a risk constrained agent runtime, which tightly couples cognitive state inference with probabilistic safety assessment to regulate autonomous system behavior in real time. By integrating cognitively grounded workload and situational awareness estimation with dynamic human error probability prediction, the framework transforms conventional offline reliability analysis into a proactive intervention mechanism embedded directly within operational workflows. Experimental validation on a high-fidelity digital control room simulator demonstrates that NuHF Claw can anticipate interface induced cognitive degradation, dynamically constrain unsafe autonomous recommendations, and provide risk-aware navigational guidance while preserving human decision authority. The results highlight a fundamental shift from automation-driven operation toward cognition-aware autonomy, offering a principled pathway for the safe integration of intelligent agents into next-generation nuclear control environments.

Chinese Translation

核电站主控制室的快速数字化从根本上重塑了操作员的互动模式，引入了复杂的软控制行为和未被现有人类可靠性分析方法充分解决的认知风险。尽管大型语言模型和自主代理的最新进展为智能决策支持提供了新的机会，但在安全关键环境中的应用仍受到幻觉推理和削弱人类权威的风险的限制。本研究提出了NuHF Claw，一种持久的认知风险代理框架，使数字核操作中的以风险为导向的人本自主成为可能。核心方法创新在于引入了风险约束的代理运行时，它将认知状态推断与概率安全评估紧密结合，以实时调节自主系统行为。通过将基于认知的工作负载和情境意识估计与动态人类错误概率预测相结合，该框架将传统的离线可靠性分析转变为直接嵌入操作工作流程中的主动干预机制。在高保真数字控制室模拟器上的实验验证表明，NuHF Claw能够预测界面引起的认知退化，动态约束不安全的自主建议，并在保持人类决策权威的同时提供风险感知的导航指导。结果突显了从自动化驱动操作向认知感知自主的根本转变，为智能代理安全整合到下一代核控制环境提供了一个原则性路径。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.14178

Simulating Human Cognition: Heartbeat-Driven Autonomous Thinking Activity Scheduling for LLM-based AI systems

模拟人类认知：基于心跳驱动的自主思维活动调度用于大语言模型（LLM）人工智能系统

Su, Hong

Abstract

Large Language Model (LLM) agents have demonstrated remarkable capabilities in reasoning and tool use, yet they often suffer from rigid, reactive control flows that limit their adaptability and efficiency. Most existing frameworks rely on fixed pipelines or failure-triggered reflection, causing agents to act impulsively or correct errors only after they occur. In this paper, we introduce Heartbeat-Driven Autonomous Thinking Activity Scheduling, a mechanism that enables proactive, adaptive, and continuous self-regulation. Mirroring the natural rhythm of human cognition, our system employs a periodic ``heartbeat'' mechanism to orchestrate a dynamic repertoire of cognitive modules (e.g., Planner, Critic, Recaller, Dreamer). Unlike traditional approaches that rely on hard-coded symbolic rules or immediate reactive triggers, our scheduler learns to determine when to engage specific thinking activities -- such as recalling memories, summarizing experiences, or strategic planning -- based on temporal patterns and historical context. This functional approach allows cognitive modules to be dynamically added or removed without structural reengineering. Meanwhile, we propose a meta-learning strategy for continual policy adaptation, where the scheduler optimizes its cognitive strategy over time using historical interaction logs. Evaluation results demonstrate that our approach effectively learns to schedule cognitive activities based on historical data and can autonomously integrate new thinking modules.

Chinese Translation

大语言模型（LLM）代理在推理和工具使用方面展现了卓越的能力，但它们往往受到僵化、反应式控制流程的限制，影响了其适应性和效率。现有的大多数框架依赖于固定的流程或失败触发的反思，导致代理只能冲动地行动或在错误发生后才进行纠正。本文提出了一种心跳驱动的自主思维活动调度机制，使得代理能够进行主动、自适应和持续的自我调节。我们的系统通过周期性的“心跳”机制，模拟人类认知的自然节奏，协调一系列动态的认知模块（例如，规划者、评论者、回忆者、梦想者）。与依赖硬编码符号规则或即时反应触发的传统方法不同，我们的调度器学习在何时参与特定的思维活动——例如回忆记忆、总结经验或战略规划——基于时间模式和历史背景。这种功能性的方法允许认知模块在不进行结构重组的情况下动态添加或移除。同时，我们提出了一种元学习策略用于持续的策略适应，调度器利用历史交互日志随着时间的推移优化其认知策略。评估结果表明，我们的方法能够有效地基于历史数据学习调度认知活动，并能够自主整合新的思维模块。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.14221

Fun-TSG: A Function-Driven Multivariate Time Series Generator with Variable-Level Anomaly Labeling

Fun-TSG：一种基于功能驱动的多变量时间序列生成器，具有可变级别的异常标注

Lotte, Pierre, Péninou, André, Teste, Olivier

Abstract

Reliable evaluation of anomaly detection methods in multivariate time series remains an open challenge, largely due to the limitations of existing benchmark datasets. Current resources often lack fine-grained anomaly annotations, do not provide explicit intervariable and temporal dependencies, and offer little insight into the underlying generative mechanisms. These shortcomings hinder the development and rigorous comparison of detection models, especially those targeting interpretable and variable-specific outputs. To address this gap, we introduce Fun-TSG, a fully customizable time series generator designed to support high-quality evaluation of anomaly detection systems. Our tool enables both fully automated generation, based on randomly sampled dependency structures and anomaly types, and manual generation through user-defined equations and anomaly configurations. In both cases, it provides full transparency over the data generation process, including access to ground-truth anomaly labels at the variable and timestamp levels. Fun-TSG supports the creation of diverse, interpretable, and reproducible benchmarking scenarios, enabling fine-grained performance analysis for both classical and modern anomaly detection models.

Chinese Translation

在多变量时间序列中，可靠评估异常检测方法仍然是一个未解决的挑战，这在很大程度上是由于现有基准数据集的局限性。目前的资源往往缺乏细粒度的异常注释，未提供明确的变量间和时间依赖关系，并且对潜在生成机制的洞察有限。这些缺陷阻碍了检测模型的发展和严格比较，尤其是针对可解释和特定变量输出的模型。为了解决这一问题，我们引入了Fun-TSG，一种完全可定制的时间序列生成器，旨在支持高质量的异常检测系统评估。我们的工具支持基于随机抽样的依赖结构和异常类型的全自动生成，以及通过用户定义的方程和异常配置进行手动生成。在这两种情况下，它都提供了数据生成过程的完全透明性，包括对变量和时间戳级别的真实异常标签的访问。Fun-TSG支持创建多样化、可解释和可重复的基准场景，使得对经典和现代异常检测模型进行细粒度性能分析成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.14240

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

可解释和可解释的替代建模用于仿真：前沿调查及决策中的可解释人工智能展望

Palar, Pramudita Satria, Saves, Paul, Robani, Muhammad Daffa, Verstaevel, Nicolas, Garouani, Moncef, Aligon, Julien, Shimoyama, Koji, Morlier, Joseph, Gaudou, Benoit

Abstract

The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.

Chinese Translation

复杂系统的仿真越来越依赖于复杂但本质上不透明的计算黑箱仿真器。替代模型在降低复杂系统仿真的计算成本方面发挥着核心作用，适用于广泛的科学和工程领域。然而，它们不可避免地继承并往往加剧了这种黑箱特性，模糊了输入变量如何驱动物理响应。相反，可解释人工智能（XAI）提供了强大的工具来解构这些模型。然而，XAI 方法在工程特定约束（如高度相关的输入、动态系统和严格的可靠性要求）方面面临挑战。因此，尽管替代建模和 XAI 具有强大的互补性，但这两个领域的研究在很大程度上演变为独立的领域。为了重新连接这些方法，本前沿调查提供了一个结构化的视角，将现有的 XAI 技术映射到替代建模工作流程的各个阶段，以便进行设计和探索。为了使这一综合分析更具实用性，我们借鉴了基于方程的仿真和基于代理的建模中的示例应用。我们调查了广泛的技术，突出其在揭示交互和支持人类理解方面的优势。最后，我们识别出紧迫的开放挑战，包括动态系统的可解释性和混合变量系统的处理，并提出了一项研究议程，以使可解释性成为从模型构建到决策制定的仿真驱动工作流程中的核心嵌入元素。通过将不透明的仿真器转变为可解释的工具，这一议程使从业者能够超越加速仿真，提取复杂系统行为中的可操作见解。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.14254

Formalizing Kantian Ethics: Formula of the Universal Law Logic (FULL)

形式化康德伦理学：普遍法则逻辑公式 (FULL)

Olson, Taylor

Abstract

The field of machine ethics aims to build Artificial Moral Agents (AMAs) to better understand morality and make AI agents safer. To do so, many approaches encode human moral intuition as a set of axioms on actions e.g., do not harm, you must help others. However, this introduces (at least) two limitations for future AMAs. First, it does not consider the agent's purposes in performing the action. Second, it assumes that we humans can enumerate our moral intuition. This paper explores formalizing a moral procedure that alleviates these two limitations. We specifically consider Kantian ethics and present a multi-sorted quantified modal logic we call the Formula of the Universal Law Logic (FULL). The FULL formalizes Kant's first formulation of the categorical imperative, the Formula of the Universal Law (FUL), and concepts such as causality and agency. We demonstrate on three cases from Kantian ethics that the FULL can reason to evaluate agents' actions for certain purposes without built-in moral intuition, given that it has sufficient (non-normative) background knowledge. Therefore, the FULL is a contribution towards more robust and autonomous AMAs, and a more formal understanding of Kantian ethics.

Chinese Translation

机器伦理学领域旨在构建人工道德代理（Artificial Moral Agents, AMAs），以更好地理解道德并使人工智能代理更安全。为此，许多方法将人类的道德直觉编码为一组关于行为的公理，例如：不伤害他人，必须帮助他人。然而，这为未来的AMAs引入了（至少）两个局限性。首先，它没有考虑代理在执行行为时的目的。其次，它假设我们人类能够列举我们的道德直觉。本文探讨了一种形式化道德程序，以缓解这两个局限性。我们特别考虑康德伦理学，并提出了一种我们称之为普遍法则逻辑公式（Formula of the Universal Law Logic, FULL）的多排序量化模态逻辑。FULL形式化了康德的第一种范畴命令公式，即普遍法则公式（Formula of the Universal Law, FUL），以及因果关系和代理等概念。我们在三个康德伦理学的案例中展示，FULL能够在没有内置道德直觉的情况下，根据充分的（非规范性）背景知识推理以评估代理的行为。因此，FULL是朝着更强大和自主的AMAs，以及对康德伦理学更正式理解的贡献。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.14258

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

GFT：从模仿到奖励微调，利用无偏组优势和动态系数修正

Gan, Wangjie, Pan, Miao, Xi, Linbo, Zhang, Wenqi, Chen, Jintao, Yin, Jianwei, Zhang, Xuhong

Abstract

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

Chinese Translation

大型语言模型通常通过监督微调（SFT）和强化学习（RL）进行后训练，但有效地将高效知识注入与稳健的泛化相结合仍然具有挑战性。在本研究中，我们提供了一种训练动态分析，表明SFT可以被解释为一种特殊的策略梯度优化情况，其具有极为稀疏的隐式奖励和不稳定的逆概率加权，这两者共同导致了单路径依赖、熵崩溃和梯度爆炸。基于这一诊断，我们提出了组微调（Group Fine-Tuning, GFT），这是一个统一的后训练框架，通过两种机制解决这些内在限制：组优势学习（Group Advantage Learning），该机制构建多样化的响应组并推导归一化的对比监督，以缓解奖励稀疏性；动态系数修正（Dynamic Coefficient Rectification），该机制自适应地限制逆概率权重，以稳定优化，同时保持高效的知识注入。实验表明，GFT在性能上始终超过基于SFT的方法，并产生与后续RL训练更平滑整合的策略。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.14316

Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

透视专家视角：基于放射科医生注视与推理训练的基础视觉语言模型

Lee, Kinhei, Jing, Peiyuan, Zhang, Zhenxuan, Yang, Yue, Wang, Tao, Marshall, Dominic C, Fang, Yingying, Yang, Guang

Abstract

Large scale vision language models have shown promise in automating chest Xray interpretation, yet their clinical utility remains limited by a gap between model outputs and radiologist reasoning. Most systems optimize for semantic information without emulating how experts visually examine medical images, often overlooking critical findings or diverging from established diagnostic workflows. Radiologists follow structured protocols (e.g., the ABCDEF approach) that ensure all clinically relevant regions are systematically examined, reducing missed findings and supporting reliable diagnostic reasoning. We introduce GazeX, a vision language model that leverages radiologists' eye tracking data as a behavioral prior to model expert diagnostic reasoning. By incorporating gaze trajectories and fixation patterns into pretraining, GazeX learns to follow the spatial and temporal structure of radiologist attention and integrates observations in a clinically meaningful sequence. Using a curated dataset of over 30,000 gaze key frames from five radiologists, we demonstrate that GazeX produces more accurate, interpretable, and expert consistent outputs across radiology report generation, disease grounding, and visual question answering, utilizing 231,835 radiographic studies, 780,014 question answer pairs, and 1,162 image sentence pairs with bounding boxes. Unlike autonomous reporting systems, GazeX produces verifiable evidence artifacts, including inspection trajectories and finding linked localized regions, enabling efficient human verification and safe human AI collaboration. Learning through expert eyes provides a practical route toward more trustworthy, explainable, and diagnostically robust AI systems for radiology and beyond.

Chinese Translation

大规模视觉语言模型在自动化胸部X光解读方面展现了潜力，但其临床实用性仍受到模型输出与放射科医生推理之间差距的限制。大多数系统优化语义信息，而未能模拟专家如何视觉检查医学影像，常常忽视关键发现或偏离既定的诊断工作流程。放射科医生遵循结构化协议（例如，ABCDEF方法），确保所有临床相关区域得到系统性检查，从而减少漏诊并支持可靠的诊断推理。我们提出了GazeX，这是一种视觉语言模型，利用放射科医生的眼动追踪数据作为模型专家诊断推理的行为先验。通过将注视轨迹和注视模式纳入预训练，GazeX学习遵循放射科医生注意力的空间和时间结构，并以临床有意义的顺序整合观察结果。使用来自五位放射科医生的超过30,000个注视关键帧的精心策划数据集，我们证明GazeX在放射学报告生成、疾病定位和视觉问答等方面产生了更准确、可解释且与专家一致的输出，利用了231,835个放射学研究、780,014个问答对和1,162个带边界框的图像句子对。与自主报告系统不同，GazeX生成可验证的证据文档，包括检查轨迹和与局部区域相关的发现，能够实现高效的人类验证和安全的人机协作。通过专家视角学习为放射学及其他领域提供了通向更可信、可解释和诊断稳健的人工智能系统的实用途径。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.14336

Mistake gating leads to energy and memory efficient continual learning

错误门控导致能量和记忆高效的持续学习

Pache, Aaron, van Rossum, Mark CW

Abstract

Synaptic plasticity is metabolically expensive, yet animals continuously update their internal models without exhausting energy reserves. However, when artificial neural networks are trained, the network parameters are typically updated on every sample that is presented, even if the sample was classified correctly. Inspired by the human negativity bias and error-related negativity, we propose 'memorized mistake-gated learning' -- a biologically plausible plasticity rule where synaptic updates are strictly gated by current and past classification errors. This reduces the number of updates the network needs to make by $50\%\sim80\%$. Mistake gating is particularly well suited in two cases: 1) For incremental learning where new knowledge is acquired on a background of pre-existing knowledge, 2) For online learning scenarios when data needs to be stored for later replay, as mistake-gating reduces storage buffer requirements. The algorithm can be implemented in a few lines of code, adds no hyper-parameters, and comes at negligible computational overhead. Learning on mistakes is an energy efficient and biologically relevant modification to commonly used learning rules that is well suited for continual learning.

Chinese Translation

突触可塑性在代谢上是昂贵的，但动物能够不断更新其内部模型而不耗尽能量储备。然而，在训练人工神经网络时，网络参数通常会在每个呈现的样本上进行更新，即使该样本被正确分类。受到人类负面偏见和与错误相关的负性情绪的启发，我们提出了“记忆错误门控学习”（memorized mistake-gated learning）——一种生物学上合理的可塑性规则，其中突触更新严格受当前和过去分类错误的限制。这减少了网络所需的更新次数，降低幅度为50%至80%。错误门控特别适用于两种情况：1）对于增量学习，在已有知识的背景下获取新知识；2）对于在线学习场景，当数据需要存储以便后续重放时，错误门控减少了存储缓冲区的需求。该算法可以用几行代码实现，不增加超参数，并且几乎没有计算开销。在错误上学习是一种能量高效且生物相关的修改，适用于常用学习规则，特别适合持续学习。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.14401

Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

Credo：通过信念和策略对大型语言模型管道进行声明性控制

Lu, Duo, Crotty, Andrew, Çetintemel, Uğur

Abstract

Agentic AI systems are becoming commonplace in domains that require long-lived, stateful decision-making in continuously evolving conditions. As such, correctness depends not only on the output of individual model calls, but also on how to best adapt when incorporating new evidence or revising prior conclusions. However, existing frameworks rely on imperative control loops, ephemeral memory, and prompt-embedded logic, making agent behavior opaque, brittle, and difficult to verify. This paper introduces Credo, which represents semantic state as beliefs and regulates behavior using declarative policies defined over these beliefs. This design supports adaptive, auditable, and composable execution through a database-backed semantic control plane. We showcase these concepts in a decision-control scenario, where beliefs and policies declaratively guide critical execution choices (e.g., model selection, retrieval, corrective re-execution), enabling dynamic behavior without requiring any changes to the underlying pipeline code.

Chinese Translation

智能代理人工智能系统在需要长期、状态保持的决策制定和不断变化的条件下变得越来越普遍。因此，正确性不仅依赖于单个模型调用的输出，还依赖于在纳入新证据或修正先前结论时如何最佳适应。然而，现有框架依赖于命令式控制循环、短暂内存和嵌入提示逻辑，使得代理行为变得不透明、脆弱且难以验证。本文介绍了Credo，它将语义状态表示为信念，并使用在这些信念上定义的声明性策略来调节行为。这一设计通过一个基于数据库的语义控制平面支持适应性、可审计和可组合的执行。我们在一个决策控制场景中展示了这些概念，其中信念和策略声明性地指导关键执行选择（例如，模型选择、检索、纠正再执行），使得动态行为得以实现，而无需对底层管道代码进行任何更改。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.14419

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

专家混合中的等终性：路由拓扑并不决定语言建模质量

Ternovtsii, Ivan, Bilak, Yurii

Abstract

Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space ($d_{space} = 64$), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], $p < 0.05$ for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash, random-fixed, and top-1 routing (single-seed; graceful 1.1--2.2 PPL degradation) and replicates on OpenWebText (0.03 PPL gap, 6 runs, 3 seeds each). A standard linear router with 5.3$\times$ more routing parameters reaches PPL 32.76, but iso-parameter cosine routing closes 67% of this gap -- the true mechanism advantage is $\sim$1.2%. The mechanistic explanation is convergent redundancy: multi-hop updates are collinear ($\cos(\Delta h_0, \Delta h_1) = 0.805$), implementing magnitude amplification rather than compositional reasoning; a single learnable scalar replicates multi-hop performance. As a practical payoff, zero-shot relative-norm halting saves 25% of MoE FLOPs at +0.12% PPL. Expert-level specialization and causal controllability -- which coexist with topology-level equifinality -- are explored in a companion paper.

Chinese Translation

稀疏的专家混合（MoE）架构采用越来越复杂的路由机制——学习型路由器、多跳轨迹、基于令牌的门控。我们提出疑问：路由拓扑是否真的决定语言建模质量？我们构建了一个几何专家混合（ST-MoE），使用余弦相似性路由对抗低维空间中的学习中心（$d_{space} = 64$），所需的路由参数比标准线性路由器少80%。通过对WikiText-103进行62次控制实验，参数范围为76--84M，训练至收敛（50K步，1.64B令牌），我们发现路由拓扑并不决定渐近困惑度（PPL）：五种余弦路由变体在1-PPL的范围内统计上是等效的（两侧检验 [TOST]，所有10组成对比较$p < 0.05$；在3个种子下进行15次实验，观察到的范围为33.93--34.72）。这一发现扩展到哈希、随机固定和top-1路由（单种子；优雅的1.1--2.2 PPL降级），并在OpenWebText上进行了重复（0.03 PPL差距，6次实验，每个种子3次）。一个标准线性路由器具有5.3$ imes$更多的路由参数，达到了PPL 32.76，但等参数的余弦路由缩小了67%的差距——真正的机制优势约为1.2%。机制解释是收敛冗余：多跳更新是共线的（$ ext{cos}( riangle h_0, riangle h_1) = 0.805$），实现了幅度放大而非组合推理；一个可学习的标量可以复制多跳性能。作为实际收益，零-shot相对范数停止节省了25%的MoE FLOPs，同时PPL仅增加0.12%。专家级专业化和因果可控性——与拓扑级别的等终性共存——在一篇伴随论文中进行了探讨。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.14422

Demonstration of Pneuma-Seeker: Agentic System for Reifying and Fulfilling Information Needs on Tabular Data

Pneuma-Seeker的演示：用于具体化和满足表格数据上信息需求的智能系统

Balaka, Muhammad Imam Luthfi, Fernandez, Raul Castro

Abstract

Data analysts working with relational data often start with vague or underspecified questions and refine them iteratively as they explore the data. To support this iterative process, we demonstrate Pneuma-Seeker, a system that reifies a user's information need as explicit, inspectable relational specifications, enabling iterative refinement of the information need, targeted data discovery, and provenance-aware execution. Through two real-world procurement use cases, we show how Pneuma-Seeker leverages LLMs as transparent, interactive analytical collaborators rather than opaque answer engines.

Chinese Translation

处理关系数据的数据分析师通常从模糊或不明确的问题开始，并在探索数据的过程中进行迭代精炼。为了支持这一迭代过程，我们演示了Pneuma-Seeker，一个将用户的信息需求具体化为明确、可检查的关系规范的系统，从而实现信息需求的迭代精炼、针对性的数据发现和可追溯的执行。通过两个真实的采购案例，我们展示了Pneuma-Seeker如何利用大型语言模型（LLMs）作为透明的、互动的分析协作伙伴，而不是不透明的答案引擎。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.14434

Geometric Routing Enables Causal Expert Control in Mixture of Experts

几何路由实现混合专家中的因果专家控制

Ternovtsii, Ivan, Bilak, Yurii

Abstract

Sparse Mixture-of-Experts (MoE) models scale parameters while fixing active computation per token, but the specialization of individual experts remains opaque. In a companion paper we showed that routing topology is quality-neutral: five structurally different configurations converge to statistically equivalent language modeling quality. Here we show that expert identity is nonetheless causally meaningful: individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. We present four lines of evidence. First, projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary: 15% of experts are monosemantic specialists spanning 10 categories (temporal, geographic, cardinal, discourse, emotional, financial, military, scientific). Second, routing exhibits a frequency-to-syntax gradient: early layers separate tokens by word frequency, deeper layers by syntactic class (Zipf-confound controls, all $p < 0.001$). Third, causal interventions confirm these labels: steering toward a temporal expert's centroid increases P(temporal) by +321% (median across 44 prompts); suppressing a geographic expert drops P(geographic) by -23%; rewriting an expert's output vector halves target-category probability, and effects compose additively across layers. Fourth, the interventions are not unique to cosine routing: linear routers support comparable steering, but only cosine routing provides geometric transparency -- expert specialization is readable directly from the centroid matrix. MoE expert-level specialization is a first-class interpretability primitive: architecturally monosemantic, causally validated, and controllable at inference with zero overhead.

Chinese Translation

稀疏的混合专家（MoE）模型在固定每个标记的活跃计算的同时扩展参数，但个别专家的专业化仍然不透明。在一篇相关论文中，我们展示了路由拓扑是质量中立的：五种结构不同的配置收敛到统计上等效的语言建模质量。在这里，我们展示了专家身份在因果上是有意义的：单个的 rank-1 专家在构造上是单义的，而在低维度度量空间中的余弦相似度路由使得他们的专业化可以直接检查。我们提供了四条证据。首先，通过去嵌入矩阵投影专家输出向量可以得到一个语义字典：15%的专家是跨越10个类别（时间、地理、基数、话语、情感、金融、军事、科学）的单义专家。其次，路由表现出频率到句法的梯度：早期层按单词频率分离标记，较深层按句法类别分离（Zipf混淆控制，所有$p < 0.001$）。第三，因果干预证实了这些标签：朝向时间专家的质心引导使P(时间)增加+321%（44个提示的中位数）；抑制一个地理专家使P(地理)下降-23%；重写一个专家的输出向量使目标类别概率减半，并且效果在层之间是可加的。第四，这些干预并不独特于余弦路由：线性路由器支持可比的引导，但只有余弦路由提供几何透明性——专家专业化可以直接从质心矩阵中读取。MoE专家级专业化是一种一流的可解释性原语：在架构上是单义的，因果上经过验证，并且在推理时可控且没有额外开销。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.14440

On Tackling Complex Tasks with Reward Machines and Signal Temporal Logics

利用奖励机器和信号时序逻辑应对复杂任务

Ruiz, Ana María Gómez, Dang, Thao, Donzé, Alexandre

Abstract

We propose a Reinforcement Learning (RL) based control design framework for handling complex tasks. The approach extends the concept of Reward Machines (RM) with Signal Temporal Logic (STL) formulas that can be used for event generation. The use of STL allows not only a more efficient representation of rewards for complex tasks but also guiding the training process to converge towards behaviors satisfying specified requirements. We also propose an implementation of the framework that leverages the STL online monitoring algorithms. We illustrate the framework with three case studies (minigrid, cart-pole and high-way environments) with non-trivial tasks.

Chinese Translation

我们提出了一种基于强化学习（Reinforcement Learning, RL）的控制设计框架，用于处理复杂任务。该方法扩展了奖励机器（Reward Machines, RM）的概念，结合了可用于事件生成的信号时序逻辑（Signal Temporal Logic, STL）公式。使用STL不仅可以更有效地表示复杂任务的奖励，还可以指导训练过程，使其收敛到满足特定要求的行为。我们还提出了一个利用STL在线监控算法的框架实现。通过三个案例研究（迷你网格、倒立摆和高速公路环境）来说明该框架在处理非平凡任务中的应用。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.14455

AIBuildAI: An AI Agent for Automatically Building AI Models

AIBuildAI：一种自动构建AI模型的人工智能代理

Zhang, Ruiyi, Qin, Peijia, Cao, Qi, Zhang, Li, Xie, Pengtao

Abstract

AI models underpin modern intelligent systems, driving advances across science, medicine, finance, and technology. Yet developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation. Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise. To address this gap, we introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data. AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization. Each sub-agent is itself a large language model (LLM) based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches. We evaluate AIBuildAI on MLE-Bench, a benchmark of realistic Kaggle-style AI development tasks spanning visual, textual, time-series and tabular modalities. AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers. These results demonstrate that hierarchical agent systems can automate the full AI model development process from task specification to deployable model, suggesting a pathway toward broadly accessible AI development with minimal human intervention.

Chinese Translation

人工智能模型是现代智能系统的基础，推动了科学、医学、金融和技术等领域的进步。然而，开发高性能的人工智能模型仍然是一个劳动密集型的过程，需要专家从业人员反复设计架构、工程化表示、实施训练流程并通过实证评估来完善方法。现有的自动机器学习（AutoML）方法在一定程度上减轻了这一负担，但仍然局限于超参数优化和在预定义搜索空间内的模型选择等狭窄方面，使得整个开发生命周期在很大程度上依赖于人类专业知识。为了解决这一问题，我们提出了AIBuildAI，一个能够根据任务描述和训练数据自动构建AI模型的人工智能代理。AIBuildAI采用了一种分层代理架构，其中一个管理代理协调三个专业子代理：负责建模策略的设计师、负责实现和调试的编码器，以及负责训练和性能优化的调优器。每个子代理本身都是一个大型语言模型（LLM）基础的代理，能够进行多步骤推理和工具使用，从而实现超越现有AutoML方法范围的AI模型开发过程的端到端自动化。我们在MLE-Bench上评估了AIBuildAI，这是一个涵盖视觉、文本、时间序列和表格模式的现实Kaggle风格AI开发任务的基准。AIBuildAI在MLE-Bench上以63.1%的奖牌率排名第一，超越了所有现有基线方法，并与经验丰富的AI工程师的能力相匹配。这些结果表明，分层代理系统能够自动化从任务规范到可部署模型的完整AI模型开发过程，暗示了在最小人类干预下实现广泛可及的AI开发的路径。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.14465

Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

通过价值感知干预提升人类表现：一个国际象棋案例研究

Narayanan, Saumik, Panjwani, Raja, Sen, Siddhartha, Ho, Chien-Ju

Abstract

AI systems are increasingly used to assist humans in sequential decision-making tasks, yet determining when and how an AI assistant should intervene remains a fundamental challenge. A potential baseline is to recommend the optimal action according to a strong model. However, such actions assume optimal follow-up actions, which human decision makers may fail to execute, potentially reducing overall performance. In this work, we propose and study value-aware interventions, motivated by a basic principle in reinforcement learning: under the Bellman equation, the optimal policy selects actions that maximize the immediate reward plus the value function. When a decision maker follows a suboptimal policy, this policy-value consistency no longer holds, creating discrepancies between the actions taken by the policy and those that maximize the immediate reward plus the value of the next state. We show that these policy-value inconsistencies naturally identify opportunities for intervention. We formalize this problem in a Markov decision process where an AI assistant may override human actions under an intervention budget. In the single-intervention regime, we show that the optimal strategy is to recommend the action that maximizes the human value function. For settings with multiple interventions, we propose a tractable approximation that prioritizes interventions based on the magnitude of the policy-value discrepancy. We evaluate these ideas in the domain of chess by learning models of humans from large-scale gameplay data. In simulation, our approach consistently outperforms interventions based on the strongest chess engine (Stockfish) in a wide range of settings. A within-subject human study with 20 players and 600 games further shows that our interventions significantly improve performance for low- and mid-skill players while matching expert-engine interventions for high-skill players.

Chinese Translation

人工智能系统越来越多地用于帮助人类进行顺序决策任务，但确定何时以及如何让人工智能助手进行干预仍然是一个基本挑战。一个潜在的基准是根据强模型推荐最佳行动。然而，这些行动假设后续行动是最优的，而人类决策者可能无法执行这些行动，这可能会降低整体表现。在本研究中，我们提出并研究了价值感知干预，这一想法源于强化学习中的基本原则：在贝尔曼方程下，最优策略选择最大化即时奖励加上价值函数的行动。当决策者遵循次优策略时，这种策略-价值一致性不再成立，导致政策采取的行动与最大化即时奖励加上下一个状态的价值的行动之间出现差异。我们表明，这些政策-价值不一致自然识别出干预的机会。我们在马尔可夫决策过程中形式化了这个问题，其中人工智能助手可以在干预预算下覆盖人类的行动。在单次干预模式下，我们表明最优策略是推荐最大化人类价值函数的行动。对于多次干预的设置，我们提出了一种可行的近似方法，根据策略-价值差异的大小优先考虑干预。我们通过从大规模游戏数据中学习人类模型，在国际象棋领域评估这些想法。在模拟中，我们的方法在广泛的设置中始终优于基于最强国际象棋引擎（Stockfish）的干预。对20名玩家和600场比赛进行的内部研究进一步表明，我们的干预显著提高了低技能和中技能玩家的表现，同时在高技能玩家中与专家引擎的干预相匹配。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.14473

Response-Aware User Memory Selection for LLM Personalization

响应感知的用户记忆选择用于大型语言模型个性化

Fisher, Jillian, Neville, Jennifer, Park, Chan Young

Abstract

A common approach to personalization in large language models (LLMs) is to incorporate a subset of the user memory into the prompt at inference time to guide the model's generation. Existing methods select these subsets primarily using similarity between user memory items and input queries, ignoring how features actually affect the model's response distribution. We propose Response-Utility optimization for Memory Selection (RUMS), a novel method that selects user memory items by measuring the mutual information between a subset of memory and the model's outputs, identifying items that reduce response uncertainty and sharpen predictions beyond semantic similarity. We demonstrate that this information-theoretic foundation enables more principled user memory selection that aligns more closely with human selection compared to state-of-the-art methods, and models $400\times$ larger. Additionally, we show that memory items selected using RUMS result in better response quality compared to existing approaches, while having up to $95\%$ reduction in computational cost.

Chinese Translation

在大型语言模型（LLMs）中，个性化的常见方法是在推理时将用户记忆的一个子集纳入提示中，以引导模型的生成。现有方法主要通过用户记忆项与输入查询之间的相似性来选择这些子集，而忽略了特征如何实际影响模型的响应分布。我们提出了一种名为响应效用优化的记忆选择方法（Response-Utility optimization for Memory Selection, RUMS），这是一种新颖的方法，通过测量记忆子集与模型输出之间的互信息来选择用户记忆项，识别出能够减少响应不确定性并在语义相似性之外提高预测准确性的项。我们证明了这种信息论基础使得用户记忆选择更加有原则，且与人类选择的对齐程度相比于最先进的方法更高，模型规模达到$400 imes$。此外，我们还展示了使用RUMS选择的记忆项在响应质量上优于现有方法，同时计算成本减少了高达$95 ext{%}$。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.14475

Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

Evo-MedAgent：超越一次性诊断的记忆、反思与改进的智能体

Shen, Weixiang, Jian, Bailiang, Li, Jun, Liu, Che, Moll, Johannes, Hu, Xiaobin, Rueckert, Daniel, Li, Hongwei Bran, Pan, Jiazhen

Abstract

Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emph{Retrospective Clinical Episodes} that retrieve problem-solving experiences from similar past cases, (2)~an \emph{Adaptive Procedural Heuristics} bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emph{Tool Reliability Controller} that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.

Chinese Translation

工具增强的大型语言模型（LLM）智能体可以协调专业分类器、分割模型和视觉问答模块来解读胸部X光片。然而，这些智能体仍然在孤立的情况下解决每个案例：它们无法在案例之间积累经验、纠正重复的推理错误，或在没有昂贵的强化学习的情况下调整其工具使用行为。尽管放射科医生在每个案例中自然会有所改善，但当前的智能体仍然是静态的。在本研究中，我们提出了Evo-MedAgent，一个自我进化的记忆模块，使医疗智能体具备在测试时进行跨案例学习的能力。我们的记忆模块包括三个互补的存储： (1) extit{回顾性临床案例}，从类似的过去案例中检索问题解决经验； (2) 一个 extit{自适应程序启发式}库，策划优先标记的诊断规则，通过反思不断演变，类似于医生精炼其内部标准； (3) 一个 extit{工具可靠性控制器}，跟踪每个工具的可信度。在ChestAgentBench上，Evo-MedAgent使GPT-5-mini的多项选择题（MCQ）准确率从0.68提高到0.79，Gemini-3 Flash的准确率从0.76提高到0.87。凭借强大的基础模型，进化的记忆在定性诊断任务中比协调外部工具更有效地提高性能。由于Evo-MedAgent不需要训练，其每个案例的开销仅限于一次额外的检索过程和一次反思调用，使其能够在任何冻结模型之上进行部署。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.14477

Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers

透视电路：视觉变换器的可信机制可解释性

Żukowska, Nina, Stammer, Wolfgang, Schiele, Bernt, Fischer, Jonas

Abstract

Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying task-specific computational graphs, defined by connections (edges) between model components. Such edge-based circuits have been defined in the context of large language models, yet vision-based approaches so far only consider neuron-based circuits. These tell which information is encoded, but not how it is routed through the complex wiring of a neural network. In this work, we investigate whether useful mechanistic circuits can be identified through computational graphs in vision transformers. We propose an effective method for Automatic Visual Circuit Discovery (Vi-CD) that recovers class-specific circuits for classification, identifies circuits underlying typographic attacks in CLIP, and discovers circuits that lend themselves for steering to correct harmful model behavior. Overall, we find that insightful and actionable edge-based circuits can be recovered from vision transformers, adding transparency to the internal computations of these models.

Chinese Translation

神经网络内部推理的透明性是可解释性研究的核心，增强了对这些模型的信任、安全性和理解。机制可解释性领域最近集中研究任务特定的计算图，这些计算图由模型组件之间的连接（边）定义。这种基于边的电路在大型语言模型的背景下得到了定义，但迄今为止，基于视觉的方法仅考虑基于神经元的电路。这些电路能够指示哪些信息被编码，但无法说明信息是如何通过神经网络复杂的连接进行路由的。在本研究中，我们探讨是否可以通过视觉变换器中的计算图识别出有用的机制电路。我们提出了一种有效的自动视觉电路发现方法（Vi-CD），该方法能够恢复用于分类的特定类别电路，识别CLIP中与排版攻击相关的电路，并发现可用于引导纠正有害模型行为的电路。总体而言，我们发现可以从视觉变换器中恢复出有见地且可操作的基于边的电路，从而为这些模型的内部计算增加透明度。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.14493

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

突破设备端流式自动语音识别的极限：一种紧凑型、高准确率的低延迟英语模型

Banfic, Nenad, Fan, David, Vaishnavi, Kunal, Kemp, Sam, Choi, Sunghoon, Ren, Rui, Shaw, Sayan, Tang, Meng

Abstract

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

Chinese Translation

在边缘设备上部署高质量的自动语音识别（ASR）需要模型在准确性、延迟和内存占用方面进行联合优化，同时完全依赖CPU而不使用GPU加速。我们对最先进的ASR架构进行了系统的实证研究，涵盖了编码器-解码器、转换器和基于大语言模型（LLM）的范式，并在批处理、分块和流式推理模式下进行了评估。通过对超过50种配置的全面基准测试，包括OpenAI Whisper、NVIDIA Nemotron、Parakeet TDT、Canary、Conformer Transducer和Qwen3-ASR，我们确定NVIDIA的Nemotron语音流式处理是资源受限硬件上实时英语流式处理的最佳候选方案。随后，我们在ONNX Runtime中重新实现了完整的流式推理管道，并对多种后训练量化策略进行了控制评估，包括重要性加权k-量化、混合精度方案和四舍五入量化，结合图级操作融合。这些优化将模型从2.47 GB减少到最低0.67 GB，同时保持字错误率（WER）在全精度PyTorch基线的绝对1%以内。我们推荐的配置，即int4 k-量化变体，在八个标准基准测试中实现了8.20%的平均流式WER，CPU上的算法延迟为0.56秒，运行速度明显快于实时，为设备端流式ASR建立了新的质量-效率帕累托点。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.14498

Improving Machine Learning Performance with Synthetic Augmentation

通过合成增强提高机器学习性能

Sohm, Mel, Dezons, Charles, Sellami, Sami, Ninou, Oscar, Pincon, Axel

Abstract

Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.

Chinese Translation

合成增强在金融机器学习中越来越多地用于缓解数据稀缺问题，但其统计作用仍然不甚明了。我们将合成增强形式化为有效训练分布的修改，并显示它引入了一种结构性偏差-方差权衡：虽然额外样本可能减少估计误差，但当合成分布偏离评估相关区域时，它也可能改变总体目标。为了将信息增益与机械样本大小效应分离，我们引入了一种大小匹配的零增强和一种有限样本、非参数的区块置换检验，该检验在弱时间依赖下仍然有效。我们在受控的马尔可夫切换环境和真实金融数据集（包括高频期权交易数据和每日股票面板）中评估这一框架。通过涵盖自助法、基于copula的模型、变分自编码器、扩散模型和TimeGAN的生成器，我们变化了增强比率、模型容量、任务类型、状态稀有性和信噪比。我们表明，合成增强仅在方差主导的状态下是有益的，例如持续波动性预测，而在偏差主导的环境中（包括接近有效的方向预测）则会降低性能。稀有状态的目标可以改善特定领域的指标，但可能与无条件置换推断相冲突。我们的结果提供了一个结构性视角，阐明了何时合成数据能够改善金融学习性能，以及何时它会引发持续的分布失真。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.14500

Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection

MoE 专业化的几何度量：从费舍尔信息到早期故障检测

Guo, Dongxin, Wu, Jikun, Yiu, Siu Ming

Abstract

Expert specialization is fundamental to Mixture-of-Experts (MoE) model success, yet existing metrics (cosine similarity, routing entropy) lack theoretical grounding and yield inconsistent conclusions under reparameterization. We present an information-geometric framework providing the first rigorous characterization of MoE specialization dynamics. Our key insight is that expert routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling formal analysis via Riemannian geometry. We prove that standard heuristic metrics violate parameterization invariance (Theorem 1), establish that specialization corresponds to geodesic flow with quantified approximation bounds (Theorem 2), and derive a failure predictor with theoretical threshold justification (Theorem 3). The framework introduces two principled metrics: Fisher Specialization Index (FSI) achieving r=0.91+/-0.02 correlation with downstream performance, and Fisher Heterogeneity Score (FHS) predicting training failure at 10% completion with AUC=0.89+/-0.03 -- outperforming validation-loss-based early stopping by 23% while requiring 40x fewer compute cycles. We validate intervention protocols achieving 87% recovery rate when FHS>1 is detected. Comprehensive experiments across language modeling (WikiText-103, C4), vision MoE (ImageNet), and scaling studies (8-64 experts, 125M-2.7B parameters) validate our theoretical predictions.

Chinese Translation

专家专业化是混合专家（Mixture-of-Experts, MoE）模型成功的基础，但现有的度量（余弦相似度、路由熵）缺乏理论基础，并且在重新参数化下得出的结论不一致。我们提出了一个信息几何框架，首次对 MoE 专业化动态进行了严格的表征。我们的关键见解是，专家路由分布在配备费舍尔信息度量的概率单纯形上演变，使得通过黎曼几何进行形式分析成为可能。我们证明了标准启发式度量违反了参数化不变性（定理 1），确立了专业化对应于具有量化近似界限的测地线流动（定理 2），并推导出具有理论阈值证明的故障预测器（定理 3）。该框架引入了两个原则性度量：费舍尔专业化指数（Fisher Specialization Index, FSI）与下游性能的相关性达到 r=0.91+/-0.02，以及费舍尔异质性评分（Fisher Heterogeneity Score, FHS），在完成 10% 训练时预测训练失败的 AUC 为 0.89+/-0.03——在计算周期上比基于验证损失的早期停止方法减少了 40 倍，同时提高了 23%的性能。我们验证了干预协议，当检测到 FHS>1 时，恢复率达到 87%。在语言建模（WikiText-103, C4）、视觉 MoE（ImageNet）和规模研究（8-64 名专家，125M-2.7B 参数）中的全面实验验证了我们的理论预测。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.14514

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

生物医学人工智能中的偏见视角：防止下游医疗差异

Rosen-Zvi, Michal, Kan-Tor, Yoav, Danziger, Michael, Ferretti, Agata, Aula-Blasco, Javier, Falcao, Julia, Shamir, Ron, Muszkat, Mordechai

Abstract

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

Chinese Translation

医疗差异在社会经济边界中持续存在，通常归因于筛查、诊断和治疗的不平等获取。然而，这一视角强调，关键的偏见可能在数据收集和研究优先级设定的早期阶段就已出现，远在临床实施之前，尤其是在研究的重点和收集的数据处于分子水平的情况下。大量研究集中于收集组学（omics）数据，但与这些数据集相关的人口统计信息在研究中往往未被报告，且即使报告，通常也显示出显著的偏见。对2015年至2024年间4719篇PubMed索引的组学出版物的自动化分析显示，只有一小部分报告了祖先或种族信息，祖先报告略有改善。对常用于模型训练的大规模数据集（如CellxGene和GEO）的分析揭示了显著的人口偏见，其中欧洲祖先数据占主导地位。随着生物医学基础模型在生物医学发现中变得日益重要，这些基础模型在大型数据集上进行预训练，并在许多不同的下游任务中反复使用，它们可能会延续或放大这些早期阶段的偏见，导致监管干预无法完全逆转的级联不平等。我们建议在社区范围内关注三个基础原则：来源（Provenance）、开放性（Openness）和评估透明性（Evaluation Transparency），以提高生物医学人工智能的公平性和稳健性。这一方法旨在促进更有效服务于服务不足人群的生物医学创新，并改善健康结果。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.14518

Mind DeepResearch Technical Report

Mind DeepResearch 技术报告

MindDR Team, Inc, Li Auto

Abstract

We present \textbf{Mind DeepResearch (MindDR)}, an efficient multi-agent deep research framework that achieves leading performance with only \textasciitilde30B-parameter models through a meticulously designed data synthesis and multi-stage training pipeline. The core innovation of MindDR lies in a collaborative three-agent architecture (Planning Agent, DeepSearch Agent, and Report Agent) and a four-stage agent-specialized training pipeline comprising SFT cold-start, Search-RL, Report-RL and preference alignment. With this regime, MindDR demonstrates competitive performance even with \textasciitilde30B-scale models. Specifically, MindDR achieves 45.7\% on BrowseComp-ZH, 42.8\% on BrowseComp, 46.5\% on WideSearch, 75.0\% on xbench-DS, and 52.5 on DeepResearch Bench, outperforming comparable-scale open-source agent systems and rivaling larger-scale models. MindDR has been deployed as an online product in Li Auto. Furthermore, we introduce \textbf{MindDR Bench}, a curated benchmark of 500 real-world Chinese queries from our internal product user interactions, evaluated through a comprehensive multi-dimensional rubric system rather than relying on a single RACE metric. On MindDR Bench, MindDR achieves a state-of-the-art score of 51.8.

Chinese Translation

我们提出了 extbf{Mind DeepResearch (MindDR)}，这是一个高效的多智能体深度研究框架，通过精心设计的数据合成和多阶段训练流程，仅使用约30亿参数的模型就实现了领先的性能。MindDR的核心创新在于其协作的三智能体架构（规划智能体、深度搜索智能体和报告智能体）以及一个由SFT冷启动、搜索强化学习（Search-RL）、报告强化学习（Report-RL）和偏好对齐组成的四阶段智能体专用训练流程。在这一框架下，MindDR即使在约30亿规模的模型下也表现出竞争力的性能。具体而言，MindDR在BrowseComp-ZH上达到了45.7\%，在BrowseComp上为42.8\%，在WideSearch上为46.5\%，在xbench-DS上为75.0\%，在DeepResearch Bench上为52.5，超越了同规模的开源智能体系统，并与更大规模的模型相抗衡。MindDR已作为在线产品在理想汽车中部署。此外，我们还推出了 extbf{MindDR Bench}，这是一个由500个来自我们内部产品用户交互的真实中文查询组成的精选基准，通过全面的多维评分体系进行评估，而不是依赖单一的RACE指标。在MindDR Bench上，MindDR达到了51.8的最新成绩。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.14525

Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning

量化多查询大语言模型推理中的跨查询矛盾

Salla, Rohit Kumar, Amancherla, Ramya Manasa, Saravanan, Manoj

Abstract

Large language models frequently produce mutually inconsistent answers when reasoning over multiple related queries. We study case-file logical consistency: maintaining a globally satisfiable belief state across interdependent queries. We introduce a benchmark of 390 multi-query reasoning instances with entailment/contradiction/unknown labels and propose set-level metrics including Case Satisfiability Rate, Contradiction Density and Revision Cost. Our solver-augmented approach extracts commitments, verifies global satisfiability and performs counterexample-guided repair. Across four reasoning domains, our method substantially reduces cross-query contradictions (SetCons: 0.56 to 0.94) while preserving per-query accuracy, demonstrating that global coherence is critical for robust multi-query reasoning.

Chinese Translation

大型语言模型在处理多个相关查询时，常常会产生相互矛盾的答案。我们研究案例文件的逻辑一致性：在相互依赖的查询中维持一个全局可满足的信念状态。我们引入了一个包含390个多查询推理实例的基准，这些实例带有蕴涵/矛盾/未知标签，并提出了包括案例可满足率、矛盾密度和修订成本在内的集合级指标。我们的求解器增强方法提取承诺，验证全局可满足性并执行反例引导的修复。在四个推理领域中，我们的方法显著减少了跨查询矛盾（SetCons: 从0.56降至0.94），同时保持每个查询的准确性，证明了全局一致性对于稳健的多查询推理至关重要。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.14528

Dissecting Failure Dynamics in Large Language Model Reasoning

剖析大型语言模型推理中的失败动态

Zhu, Wei, Zhang, Jian, Yu, Lixing, Yue, Kun, Tang, Zhiwen

Abstract

Large Language Models (LLMs) achieve strong performance through extended inference-time deliberation, yet how their reasoning failures arise remains poorly understood. By analyzing model-generated reasoning trajectories, we find that errors are not uniformly distributed but often originate from a small number of early transition points, after which reasoning remains locally coherent but globally incorrect. These transitions coincide with localized spikes in token-level entropy, and alternative continuations from the same intermediate state can still lead to correct solutions. Based on these observations, we introduce GUARD, a targeted inference-time framework that probes and redirects critical transitions using uncertainty signals. Empirical evaluations across multiple benchmarks confirm that interventions guided by these failure dynamics lead to more reliable reasoning outcomes. Our findings highlight the importance of understanding when and how reasoning first deviates, complementing existing approaches that focus on scaling inference-time computation.

Chinese Translation

大型语言模型（LLMs）通过延长推理时间的深思熟虑实现了强大的性能，但其推理失败的产生机制仍然不甚清楚。通过分析模型生成的推理轨迹，我们发现错误并不是均匀分布的，而是通常源于少数早期转折点，此后推理在局部上保持一致性，但在全局上却是错误的。这些转折点与令牌级熵的局部峰值相吻合，并且从同一中间状态的替代延续仍然可以导致正确的解决方案。基于这些观察，我们引入了GUARD，一个针对推理时间的框架，利用不确定性信号探测并重定向关键转折。多项基准测试的实证评估证实，基于这些失败动态的干预能够带来更可靠的推理结果。我们的发现强调了理解推理首次偏离的时机和方式的重要性，这补充了现有关注推理时间计算扩展的方法。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.14531

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

TRACER：基于追踪的自适应成本高效路由用于大型语言模型分类

Rida, Adam

Abstract

Every call to an LLM classification endpoint produces a labeled input-output pair already retained in production logs. These pairs constitute a free, growing training set: a lightweight surrogate trained on them can absorb a significant portion of future traffic at near-zero marginal inference cost. The open questions are when the surrogate is reliable enough to deploy, what it handles versus defers, and how that boundary evolves as data accumulates. We introduce TRACER (Trace-based Adaptive Cost-Efficient Routing), an open-source system that trains ML surrogates on an LLM's own production traces and governs deployment through a parity gate: the surrogate is activated only when its agreement with the LLM exceeds a user-specified threshold {\alpha}. To make the routing boundary transparent, TRACER generates interpretability artifacts describing which input regions the surrogate handles, where it plateaus, and why it defers. On a 77-class intent benchmark with a Sonnet 4.6 teacher, TRACER achieves 83-100% surrogate coverage depending on the quality target {\alpha}; on a 150-class benchmark, the surrogate fully replaces the teacher. On a natural language inference task, the parity gate correctly refuses deployment because the embedding representation cannot support reliable separation. The system is available as open-source software.

Chinese Translation

每次调用大型语言模型（LLM）分类端点都会产生一个已在生产日志中保留的标记输入-输出对。这些对构成了一个免费的、不断增长的训练集：在其上训练的轻量级替代模型可以以近乎零的边际推理成本吸收未来流量的显著部分。未解的问题是何时替代模型足够可靠以进行部署，它处理什么内容以及推迟处理什么内容，以及随着数据的积累，这一边界如何演变。我们提出了TRACER（基于追踪的自适应成本高效路由），这是一个开源系统，它在LLM自身的生产追踪上训练机器学习替代模型，并通过一个平衡门控制部署：只有当替代模型与LLM的协议超过用户指定的阈值{eta}时，替代模型才会被激活。为了使路由边界透明，TRACER生成可解释性文档，描述替代模型处理的输入区域、其平稳区域以及为何推迟处理。在一个77类意图基准测试中，使用Sonnet 4.6教师，TRACER根据质量目标{eta}实现了83-100%的替代覆盖；在一个150类基准测试中，替代模型完全取代了教师。在一个自然语言推理任务中，平衡门正确拒绝了部署，因为嵌入表示无法支持可靠的分离。该系统作为开源软件可用。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.14564

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

MARS$^2$: 通过强化学习扩展多智能体树搜索以进行代码生成

Li, Pengfei, Wang, Shijie, Li, Fangyuan, Fu, Yikun, Liu, Kaifeng, Zhang, Kaiyan, Zhang, Dazhi, Li, Yuqiang, Qi, Biqing, Zhou, Bowen

Abstract

Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.

Chinese Translation

强化学习（RL）范式在代码生成等推理密集型任务上表现出强大的性能。然而，有限的轨迹多样性常常导致收益递减，从而限制了可实现的性能上限。增强搜索的强化学习通过引入结构化探索来缓解这一问题，但仍然受到单智能体策略先验的限制。同时，利用多个相互作用的策略可以获得更为多样的探索信号，但现有方法通常与结构化搜索脱钩。我们提出了 extbf{MARS$^2$}（多智能体强化树搜索扩展），这是一个统一的强化学习框架，其中多个独立优化的智能体在共享的树结构搜索环境中协作。MARS$^2$将搜索树建模为一个可学习的多智能体交互环境，使得异构智能体能够在共享的搜索拓扑中协作生成和优化候选解决方案。为了支持有效学习，我们引入了一种基于树一致性奖励塑形的路径级组优势公式，促进了复杂搜索轨迹中的有效信用分配。在代码生成基准上的实验表明，MARS$^2$在不同的模型组合和训练设置中始终提高了性能，证明了将多智能体协作与树搜索结合以增强强化学习的有效性。我们的代码已公开发布在 https://github.com/TsinghuaC3I/MARTI。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.14576

Enhancing Mental Health Counseling Support in Bangladesh using Culturally-Grounded Knowledge

利用文化基础知识增强孟加拉国的心理健康咨询支持

Hasan, Md Arid, SP, Azhagu Meena, Khan, Aditya, Bhuiyan, Abu Md Akteruzzaman, Ahmed, Helal Uddin, Debi, Joysree, Sadeque, Farig, Lee, Annie En-Shiun, Ahmed, Syed Ishtiaque

Abstract

Large language models (LLMs) show promise in generating supportive responses for mental health and counseling applications. However, their responses often lack cultural sensitivity, contextual grounding, and clinically appropriate guidance. This work addresses the gap of how to systematically incorporate domain-specific, clinically validated knowledge into LLMs to improve counseling quality. We utilize and compare two approaches, retrieval-augmented generation (RAG) and a knowledge graph (KG)-based method, designed to support para-counselors. Our KG is constructed manually and clinically validated, capturing causal relationships between stressors, interventions, and outcomes, with contributions from multidisciplinary people. We evaluated multiple LLMs in both settings using BERTScore F1 and SBERT cosine similarity, as well as human evaluation across five metrics, which is designed to directly measure the effectiveness of counseling beyond similarity at the surface level. The results show that KG-based approaches consistently improve contextual relevance, clinical appropriateness, and practical usability compared to RAG alone, demonstrating that structured, expert-validated knowledge plays a critical role in addressing LLMs limitations in counseling tasks.

Chinese Translation

大型语言模型（LLMs）在生成心理健康和咨询应用的支持性回应方面显示出潜力。然而，它们的回应往往缺乏文化敏感性、上下文基础和临床适当性指导。本研究解决了如何系统性地将领域特定的、经过临床验证的知识纳入LLMs，以提高咨询质量的这一空白。我们利用并比较了两种方法，即检索增强生成（RAG）和基于知识图谱（KG）的方法，旨在支持辅助咨询师。我们的知识图谱是手动构建并经过临床验证的，捕捉了压力源、干预措施和结果之间的因果关系，并得到了多学科人员的贡献。我们在这两种设置中评估了多种LLMs，使用了BERTScore F1和SBERT余弦相似度，以及跨五个指标的人类评估，旨在直接测量咨询的有效性，而不仅仅是表面上的相似性。结果表明，与单独的RAG相比，基于知识图谱的方法在上下文相关性、临床适当性和实际可用性方面始终表现出改善，证明了结构化的、经过专家验证的知识在解决LLMs在咨询任务中的局限性方面发挥了关键作用。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.14585

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同掷硬币：诊断其在复合人工智能系统中的有效性

Zhang, Xing, Wang, Guanghui, Cui, Yanwei, Qiu, Wei, Li, Ziyuan, Zhu, Bing, He, Peiyang

Abstract

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.

Chinese Translation

在复合人工智能系统中，提示优化在统计上与掷硬币无异：在对Claude Haiku进行的72次优化实验中（6种方法 × 4个任务 × 3次重复），有49%的得分低于零-shot；而在Amazon Nova Lite上，失败率甚至更高。然而在某个任务中，所有六种方法的表现均比零-shot提高了最多6.8分。那么，成功与失败的区别是什么呢？我们通过18,000次网格评估和144次优化实验进行调查，测试了针对端到端优化工具（如TextGrad和DSPy）的两个假设：（A）单个提示值得优化，以及（B）代理提示之间存在交互，需进行联合优化。交互效应从未显著（$p > 0.52$，所有$F < 1.0$），而优化仅在任务具有可利用的输出结构时才有效——这种格式是模型能够生成的，但并非默认格式。我们提供了一个两阶段的诊断方法：一个80美元的ANOVA预检用于代理耦合，以及一个10分钟的余地测试，用于预测优化是否值得——将掷硬币的随机决策转变为有根据的选择。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.14607

GDPR Auto-Formalization with AI Agents and Human Verification

基于人工智能代理和人工验证的GDPR自动形式化

Nguyen, Ha Thanh, Fungwacharakorn, Wachara, Wehnert, Sabine, Zin, May Myo, Kong, Yuntao, Xue, Jieying, Araszkiewicz, Michał, Goebel, Randy, Satoh, Ken

Abstract

We study the overall process of automatic formalization of GDPR provisions using large language models, within a human-in-the-loop verification framework. Rather than aiming for full autonomy, we adopt a role-specialized workflow in which LLM-based AI components, operating in a multi-agent setting with iterative feedback, generate legal scenarios, formal rules, and atomic facts. This is coupled with independent verification modules which include human reviewers' assessment of representational, logical, and legal correctness. Using this approach, we construct a high-quality dataset to be used for GDPR auto-formalization, and analyze both successful and problematic cases. Our results show that structured verification and targeted human oversight are essential for reliable legal formalization, especially in the presence of legal nuance and context-sensitive reasoning.

Chinese Translation

我们研究了在以人为中心的验证框架下，使用大型语言模型自动形式化GDPR条款的整体过程。我们并不追求完全的自主性，而是采用了一种角色专门化的工作流程，其中基于LLM的人工智能组件在多代理环境中通过迭代反馈生成法律场景、形式规则和原子事实。这与独立的验证模块相结合，包括人类审阅者对表现、逻辑和法律正确性的评估。通过这种方法，我们构建了一个高质量的数据集，用于GDPR的自动形式化，并分析了成功案例和问题案例。我们的结果表明，结构化验证和有针对性的人类监督对于可靠的法律形式化至关重要，尤其是在法律细微差别和上下文敏感推理的情况下。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.14609

El Agente Forjador: Task-Driven Agent Generation for Quantum Simulation

El Agente Forjador：基于任务驱动的量子模拟代理生成

Zhang, Zijian, Yin, Aiwei, Baweja, Amaan, Bai, Jiaru, Gustin, Ignacio, Bernales, Varinia, Aspuru-Guzik, Alán

Abstract

AI for science promises to accelerate the discovery process. The advent of large language models (LLMs) and agentic workflows enables the expediting of a growing range of scientific tasks. However, most of the current generation of agentic systems depend on static, hand-curated toolsets that hinder adaptation to new domains and evolving libraries. We present El Agente Forjador, a multi-agent framework in which universal coding agents autonomously forge, validate, and reuse computational tools through a four-stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation. Evaluated across 24 tasks spanning quantum chemistry and quantum dynamics on five coding agent setups, we compare three operating modes: zero-shot generation of tools per task, reuse of a curriculum-built toolset, and direct problem-solving with the coding agents as the baseline. We find that our tool generation and reuse framework consistently improves accuracy over the baseline. We also show that reusing a toolset built by a stronger coding agent can reduce API cost and substantially raises the solution quality for weaker coding agents. Case studies further demonstrate that tools forged for different domains can be combined to solve hybrid tasks. Taken together, these results show that LLM-based agents can use their scientific knowledge and coding capabilities to autonomously build reusable scientific tools, pointing toward a paradigm in which agent capabilities are defined by the tasks they are designed to solve rather than by explicitly engineered implementations.

Chinese Translation

科学领域的人工智能承诺加速发现过程。大型语言模型（LLMs）和代理工作流的出现使得越来越多的科学任务能够得到快速处理。然而，目前大多数代理系统依赖于静态的、人工策划的工具集，这限制了它们对新领域和不断发展的库的适应能力。我们提出了El Agente Forjador，一个多代理框架，其中通用编码代理通过工具分析、工具生成、任务执行和迭代解决方案评估的四个阶段工作流程，自动铸造、验证和重用计算工具。在涵盖量子化学和量子动力学的24个任务中，我们在五种编码代理设置下进行评估，比较了三种操作模式：每个任务的零样本工具生成、重用基于课程构建的工具集，以及以编码代理为基线的直接问题解决。我们发现，我们的工具生成和重用框架在准确性上始终优于基线。我们还表明，重用由更强编码代理构建的工具集可以降低API成本，并显著提高较弱编码代理的解决方案质量。案例研究进一步表明，为不同领域铸造的工具可以组合以解决混合任务。综合来看，这些结果表明，基于LLM的代理可以利用其科学知识和编码能力，自动构建可重用的科学工具，指向一种新范式，即代理能力由其设计解决的任务定义，而不是通过明确的工程实现。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.14615

CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

CoDaS：通过可穿戴传感器进行生物标志物发现的人工智能协同数据科学家

Kim, Yubin, Rahman, Salman, Schmidgall, Samuel, Park, Chunjong, Heydari, A. Ali, Metwally, Ahmed A., Yu, Hong, Liu, Xin, Xu, Xuhai, Yang, Yuzhe, Xu, Maxwell A., Zhang, Zhihan, Breazeal, Cynthia, Althoff, Tim, Sirkovic, Petar, Rendulic, Ivor, Pawlosky, Annalisa, Stroppa, Nicolas, Gottweis, Juraj, Vedadi, Elahe, Karthikesalingam, Alan, Kohli, Pushmeet, Natarajan, Vivek, Malhotra, Mark, Patel, Shwetak, Park, Hae Won, Palangi, Hamid, McDuff, Daniel

Abstract

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, \rho = 0.252, p < 0.001) and sleep onset variability (GLOBEM, \rho = 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; \rho = -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; \rho = -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated \Delta R^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

Chinese Translation

数字健康中的科学发现需要将可穿戴设备的连续生理信号转化为临床可操作的生物标志物。我们介绍了CoDaS（人工智能协同数据科学家），这是一个多智能体系统，将生物标志物发现结构化为一个迭代过程，结合假设生成、统计分析、对抗验证和基于文献的推理，并在大型可穿戴数据集的支持下进行人工监督。在三个总计9,279个参与者观察的队列中，CoDaS识别了41个与心理健康相关的候选数字生物标志物和25个与代谢结果相关的生物标志物，每个标志物都经过了内部验证，包括复制性、稳定性、鲁棒性和区分能力。在两个独立的抑郁症队列中，CoDaS在两个数据集中发现了与昼夜节律不稳定性相关的特征，反映在睡眠时长变异性（DWB, ho = 0.252, p < 0.001）和入睡时长变异性（GLOBEM, ho = 0.126, p < 0.001）上。在一个代谢队列中，CoDaS推导出了心血管健康指数（步数/静息心率； ho = -0.374, p < 0.001），并恢复了已建立的临床关联，包括肝功能比率（AST/ALT； ho = -0.375, p < 0.001），这是胰岛素抵抗的已知相关因素。将CoDaS推导的特征与人口统计变量结合，导致预测性能的适度但一致的改善，抑郁症的交叉验证 R^2增加了0.040，胰岛素抵抗的增加为0.021。这些发现表明，CoDaS能够系统且可追溯地生成和优先排序假设，从大型可穿戴数据中发现生物标志物。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.14627

A Parallel Approach to Counting Exact Covers Based on Decomposability Property

基于可分解性属性的精确覆盖计数的并行方法

Fang, Liangda, Luo, Yaohui, Li, Delong, Huang, Xuanxiang, Guan, Quanlong

Abstract

The exact cover problem is a classical NP-hard problem with broad applications in the area of AI. Algorithm DXZ is a method to count exact covers representing by zero-suppressed binary decision diagrams (ZBDDs). In this paper, we propose a zero-suppressed variant of decision decomposable negation normal form (in short, decision-ZDNNF), which is strictly more succinct than ZBDDs. We then design a novel parallel algorithm, namely DXD, which constructs a decision-ZDNNF representing the set of all exact covers. Furthermore, we improve DXD by dynamically updating connected components. The experimental results demonstrate that the improved DXD algorithm outperforms all of state-of-the-art methods.

Chinese Translation

精确覆盖问题是一个经典的 NP-困难问题，在人工智能领域有广泛的应用。算法 DXZ 是一种通过零压缩二进制决策图（ZBDDs）计数精确覆盖的方法。本文提出了一种零压缩的决策可分解否定范式变体（简称为决策-ZDNNF），其比 ZBDDs 更加简洁。接着，我们设计了一种新颖的并行算法，即 DXD，该算法构建了一个表示所有精确覆盖集合的决策-ZDNNF。此外，我们通过动态更新连通分量来改进 DXD。实验结果表明，改进后的 DXD 算法在性能上优于所有现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.14641

Learning to Draw ASCII Improves Spatial Reasoning in Language Models

学习绘制ASCII图形提升语言模型的空间推理能力

Huang, Shiyuan, Liu, Li, He, Jincheng, Gilpin, Leilani H.

Abstract

When faced with complex spatial problems, humans naturally sketch layouts to organize their thinking, and the act of drawing further sharpens their understanding. In this work, we ask whether a similar principle holds for Large Language Models (LLMs): can learning to construct explicit visual layouts from spatial descriptions instill genuine spatial understanding? We introduce Text2Space, a dataset that pairs natural language descriptions with ground-truth ASCII grid layouts and spatial QA pairs, enabling us to separate failures in constructing spatial representations from failures in reasoning over them. We adopt ASCII because it is human-readable, operates entirely within the token space of language models, and encodes spatial relations in a structurally verifiable form. Our evaluation reveals a pronounced "Read-Write Asymmetry": LLMs interpret ASCII representations effectively but struggle to produce them from text, and these construction errors propagate to incorrect answers downstream. To address this limitation, we train models on layout construction (Text$\rightarrow$ASCII) and find that it significantly improves spatial reasoning from text alone, even without producing any ASCII at inference time. Combining construction with comprehension training further amplifies these gains. Crucially, these improvements transfer to three external spatial reasoning benchmarks, demonstrating that, much as sketching sharpens human spatial thinking, learning to construct explicit layouts instills spatial understanding that generalizes beyond the training format.

Chinese Translation

面对复杂的空间问题时，人类自然会通过草图布局来组织思维，而绘制的行为进一步加深了他们的理解。在本研究中，我们探讨这一原则是否同样适用于大型语言模型（LLMs）：学习从空间描述中构建明确的视觉布局是否能够培养真正的空间理解？我们引入了Text2Space数据集，该数据集将自然语言描述与真实的ASCII网格布局和空间问答对相结合，使我们能够将构建空间表示的失败与对其推理的失败分开。我们采用ASCII，因为它可被人类读取，完全在语言模型的标记空间内操作，并以结构上可验证的形式编码空间关系。我们的评估揭示了显著的“读写不对称”：LLMs能够有效解读ASCII表示，但在从文本中生成这些表示时却面临困难，这些构建错误会传播到下游的错误答案。为了解决这一局限性，我们对模型进行了布局构建训练（Text→ASCII），发现这显著提升了仅从文本进行的空间推理，即使在推理时不生成任何ASCII。将构建与理解训练相结合进一步放大了这些收益。关键是，这些改进能够转移到三个外部空间推理基准上，证明了正如草图能提升人类的空间思维，学习构建明确的布局也能培养超越训练格式的空间理解。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.14646

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

通过统一熵控制的目标探索用于强化学习

Wang, Chen, Wei, Lai, Zhang, Yanzhi, Shao, Chenyang, Dan, Zedong, Huang, Weiran, Lan, Ge, Wang, Yue

Abstract

Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.

Chinese Translation

近期在强化学习（RL）领域的进展提升了大语言模型（LLMs）和视觉-语言模型（VLMs）的推理能力。然而，广泛使用的群体相对策略优化（GRPO）始终遭遇熵崩溃，导致策略过早收敛并失去多样性。现有的探索方法在探索过程中引入了额外的偏差或方差，使得保持优化稳定性变得困难。我们提出了用于强化学习的统一熵控制（Unified Entropy Control for Reinforcement Learning, UEC-RL），这是一个提供针对性探索和稳定机制的框架。UEC-RL在困难提示上激活更多的探索，以寻找潜在和有价值的推理轨迹。同时，一个稳定器防止熵失控增长，从而在模型巩固可靠行为时保持训练的稳定性。这些组件共同作用，在需要时扩展搜索空间，同时在整个训练过程中保持强健的优化。在LLM和VLM推理任务上的实验表明，UEC-RL在Pass@1和Pass@$k$上均相较于RL基线取得了一致的提升。在Geometry3K上，UEC-RL相较于GRPO实现了37.9%的相对提升，表明其在不妨碍收敛的情况下维持有效探索，并强调了UEC-RL在大模型中扩展基于RL的推理的关键作用。我们的代码可在 https://github.com/597358816/UEC-RL 获取。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.14655

AgentGA: Evolving Code Solutions in Agent-Seed Space

AgentGA：在代理种子空间中演化代码解决方案

Tan, David Y. Y., Chin, Kellie, Zhang, Jingxian

Abstract

We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run from a reset workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. On the 10 benchmark runs reported here, AgentGA averages 74.52% Exceeds % of Human versus 54.15% for AIDE. Across 1135 parent-child comparisons, descendants given parent archives outperform runs started from scratch, indicating that inherited artifacts improve later autonomous runs. These findings support agent-seed optimization as a practical design point for autonomous code-search systems.

Chinese Translation

我们提出了AgentGA，一个通过优化代理种子（任务提示加上可选的父档案，以初始化一个新的工作空间）来演化自主代码生成运行的框架。外部循环在这些可重用的起始条件上进行搜索，而不是直接编辑代码。每一代都从重置的工作空间启动一个新的自主运行，同时选定的父档案提供了后代可以检查和重用的继承工件。AgentGA将群体级遗传算法与长时间跨度的代理结合在一起；选择使用确定性的1:1精英锦标赛，操作符分配通过修改后的Hedge控制器在线调整。我们在16个竞赛的Weco-Kaggle Lite基准上实例化了该方法。在这里报告的10个基准运行中，AgentGA的平均超越人类的百分比为74.52%，而AIDE为54.15%。在1135个父子比较中，给定父档案的后代表现优于从头开始的运行，表明继承工件改善了后续的自主运行。这些发现支持代理种子优化作为自主代码搜索系统的一个实际设计点。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.14656

Rethinking Patient Education as Multi-turn Multi-modal Interaction

重新思考患者教育作为多轮多模态互动

Yao, Zonghai, Tang, Zhipeng, Lin, Chengtao, Luo, Xiong, Wang, Benlu, Huang, Juncheng, Ong, Chin Siang, Yu, Hong

Abstract

Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

Chinese Translation

大多数医学多模态基准集中于静态任务，如图像问答、报告生成和通俗语言重写。患者教育的要求更高：系统必须识别图像中的相关证据，指导患者查看重点，使用易于理解的语言解释发现，并处理患者的困惑或焦虑。然而，尽管结合图像和文本的解释可能更有助于理解，大多数患者教育工作仍然仅限于文本。我们引入了MedImageEdu，这是一个用于多轮、基于证据的放射学患者教育的基准。每个案例提供一份放射学报告，包括报告文本和案例图像。DoctorAgent与PatientAgent进行互动，基于一个隐藏的个人资料，该资料捕捉教育水平、健康素养和个性等因素。当患者的问题需要视觉支持时，DoctorAgent可以根据报告、案例图像和当前问题向基准提供的绘图工具发出绘图指令。该工具返回图像，随后DoctorAgent生成最终的多模态响应，包括图像和基于证据的通俗语言解释。MedImageEdu包含来自三个来源的150个案例，并在五个维度上评估咨询过程和最终的多模态响应：咨询、安全性与范围、语言质量、绘图质量和图像-文本响应质量。在代表性的开源和闭源视觉-语言模型代理中，我们发现三个一致的差距：流畅的语言往往超越忠实的视觉基础，安全性是所有疾病类别中最薄弱的维度，而情感紧张的互动比低教育水平或低健康素养的互动更具挑战性。MedImageEdu提供了一个受控测试平台，以评估多模态代理是否能够基于证据进行教学，而不仅仅是从文本中回答问题。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.14682

Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

推测解码中认知领域的接受动态

Mahmoud, Saif

Abstract

Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms--speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency

Chinese Translation

推测解码加速了大型语言模型（LLM）的推理。它使用一个小的草稿模型提出未来标记的树状结构。然后，一个更大的目标模型在单次批量前向传递中验证这些标记。尽管关于推测方法的研究日益增多，但任务的认知特征对接受概率的影响程度仍然很大程度上未被探索。我们呈现了一项关于基于树的推测解码接受动态的实证研究。我们的研究涵盖了四个公认的自然语言处理（NLP）基准领域：代码生成、数学推理、逻辑推理和开放式聊天。为此，我们使用TinyLlama-1.1B作为草稿模型，Llama-2-7B-Chat-GPTQ作为目标模型。在从200个提示中收集的99,768个推测节点中，我们推导出每个领域的接受率、预期接受长度、深度-接受特征以及熵-接受相关性。我们发现，任务类型是接受的更强预测因子，而不是树的深度。此外，只有聊天领域的一致性预期接受长度超过每步1.0个标记。我们还展示了熵-接受相关性在所有领域中始终为负但较弱（rho在[-0.20, -0.15]之间）。与直觉相反，聊天领域产生了最高的熵却也有最高的接受率。我们将这种差异归因于与RLHF（强化学习与人类反馈）对齐的语域的词汇可预测性。这些发现对领域感知的推测预算和草稿模型选择策略具有直接的影响。关键词：推测解码、大型语言模型推理、树注意力、草稿模型、接受概率、LLM效率

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.14683

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

DR$^{3}$-Eval：迈向现实且可重复的深度研究评估

Xie, Qianqian, Xiong, Qingheng, Zhu, He, Xia, Tiantian, Han, Xueming, Meng, Fanyu, Wang, Jiakai, Bai, Zhiqi, Jiang, Chengkang, Wang, Zhaohui, Guo, Yubin, Wen, Yuqing, Mao, Jiayang, Zhang, Zijie, Li, Shihao, Wang, Yanghai, Ren, Yuxiang, Feng, Junlan, Liu, Jiaheng

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

Chinese Translation

深度研究代理（Deep Research Agents, DRAs）旨在解决涉及规划、检索、多模态理解和报告生成的复杂长时程研究任务，但由于动态网络环境和模糊的任务定义，其评估仍然具有挑战性。我们提出了DR$^{3}$-Eval，这是一个用于评估深度研究代理在多模态、多文件报告生成方面的现实且可重复的基准。DR$^{3}$-Eval基于真实用户提供的材料构建，并配备了一个每个任务的静态研究沙箱语料库，该语料库模拟了开放网络的复杂性，同时保持完全可验证，包含支持性文档、干扰项和噪声。此外，我们引入了一个多维评估框架，衡量信息回忆、事实准确性、引用覆盖率、指令遵循和深度质量，并验证其与人类判断的一致性。基于多种最先进语言模型开发的多代理系统DR$^{3}$-Agent的实验表明，DR$^{3}$-Eval具有高度挑战性，并揭示了检索鲁棒性和幻觉控制中的关键失败模式。我们的代码和数据已公开可用。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.14687

M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs

M2-PALE：通过过程挖掘和大型语言模型解释多智能体MCTS-极小化混合体的框架

Qian, Yiyu, Zhao, Liyuan, Miller, Tim

Abstract

Monte-Carlo Tree Search (MCTS) is a fundamental sampling-based search algorithm widely used for online planning in sequential decision-making domains. Despite its success in driving recent advances in artificial intelligence, understanding the behavior of MCTS agents remains a challenge for both developers and users. This difficulty stems from the complex search trees produced through the simulation of numerous future states and their intricate relationships. A known weakness of standard MCTS is its reliance on highly selective tree construction, which may lead to the omission of crucial moves and a vulnerability to tactical traps. To resolve this, we incorporate shallow, full-width Minimax search into the rollout phase of multi-agent MCTS to enhance strategic depth. Furthermore, to demystify the resulting decision-making logic, we introduce \textsf{M2-PALE} (MCTS--Minimax Process-Aided Linguistic Explanations). This framework employs process mining techniques, specifically the Alpha Miner, iDHM, and Inductive Miner algorithms, to extract underlying behavioral workflows from agent execution traces. These process models are then synthesized by LLMs to generate human-readable causal and distal explanations. We demonstrate the efficacy of our approach in a small-scale checkers environment, establishing a scalable foundation for interpreting hybrid agents in increasingly complex strategic domains.

Chinese Translation

蒙特卡罗树搜索（MCTS）是一种基于采样的基本搜索算法，广泛应用于顺序决策领域的在线规划。尽管它在推动人工智能的最新进展方面取得了成功，但理解MCTS智能体的行为仍然是开发者和用户面临的挑战。这种困难源于通过模拟众多未来状态及其复杂关系所产生的复杂搜索树。标准MCTS的一个已知弱点是其对高度选择性树构建的依赖，这可能导致关键动作的遗漏以及对战术陷阱的脆弱性。为了解决这个问题，我们将浅层全宽极小化搜索整合到多智能体MCTS的回放阶段，以增强战略深度。此外，为了揭示所产生的决策逻辑，我们引入了 extsf{M2-PALE}（MCTS-极小化过程辅助语言解释）。该框架采用过程挖掘技术，特别是Alpha Miner、iDHM和Inductive Miner算法，从智能体执行轨迹中提取潜在的行为工作流。这些过程模型随后由大型语言模型（LLMs）合成，以生成可供人类理解的因果和远因解释。我们在小规模跳棋环境中展示了我们方法的有效性，为在日益复杂的战略领域中解释混合智能体奠定了可扩展的基础。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.14691

CAMO: An Agentic Framework for Automated Causal Discovery from Micro Behaviors to Macro Emergence in LLM Agent Simulations

CAMO：一个用于从微观行为到宏观涌现的自动因果发现的代理框架在LLM代理模拟中的应用

Yu, Xiangning, Guo, Yuwei, Hou, Yuqi, Xue, Xiao, Ma, Qun

Abstract

LLM-empowered agent simulations are increasingly used to study social emergence, yet the micro-to-macro causal mechanisms behind macro outcomes often remain unclear. This is challenging because emergence arises from intertwined agent interactions and meso-level feedback and nonlinearity, making generative mechanisms hard to disentangle. To this end, we introduce \textbf{\textsc{CAMO}}, an automated \textbf{Ca}usal discovery framework from \textbf{M}icr\textbf{o} behaviors to \textbf{M}acr\textbf{o} Emergence in LLM agent simulations. \textsc{CAMO} converts mechanistic hypotheses into computable factors grounded in simulation records and learns a compact causal representation centered on an emergent target $Y$. \textsc{CAMO} outputs a computable Markov boundary and a minimal upstream explanatory subgraph, yielding interpretable causal chains and actionable intervention levers. It also uses simulator-internal counterfactual probing to orient ambiguous edges and revise hypotheses when evidence contradicts the current view. Experiments across four emergent settings demonstrate the promise of \textsc{CAMO}.

Chinese Translation

基于大语言模型（LLM）的代理模拟越来越多地用于研究社会涌现，然而，宏观结果背后的微观到宏观的因果机制往往仍不清晰。这一挑战源于涌现是由交织的代理交互、介观层次的反馈和非线性所导致的，使得生成机制难以理清。为此，我们提出了 extbf{ extsc{CAMO}}，一个自动化的因果发现框架，旨在从LLM代理模拟中的微观行为到宏观涌现。 extsc{CAMO}将机械假设转化为基于模拟记录的可计算因素，并学习以涌现目标$Y$为中心的紧凑因果表示。 extsc{CAMO}输出可计算的马尔可夫边界和最小上游解释子图，从而产生可解释的因果链和可操作的干预杠杆。它还利用模拟器内部的反事实探测来定位模糊边缘，并在证据与当前观点相矛盾时修正假设。针对四个涌现设置的实验展示了 extsc{CAMO}的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.14705

SynHAT: A Two-stage Coarse-to-Fine Diffusion Framework for Synthesizing Human Activity Traces

SynHAT：一种两阶段粗到细的扩散框架用于合成人类活动轨迹

Xu, Rongchao, Jiang, Lin, Yu, Dahai, Li, Ximiao, Wang, Guang

Abstract

Human activity traces (HATs) are critical for many applications, including human mobility modeling and point-of-interest (POI) recommendation. However, growing privacy concerns have severely limited access to authentic large-scale HAT datasets. Recent advances in generative AI provide new opportunities to synthesize realistic and privacy-preserving HATs for such applications. Yet two major challenges remain: (i) HATs are highly irregular and dynamic, with long and varying time intervals, making it difficult to capture their complex spatio-temporal dependencies and underlying distributions; and (ii) generative models are often computationally expensive, making long-term, fine-grained HAT synthesis inefficient. To address these challenges, we propose SynHAT, a computationally efficient coarse-to-fine HAT synthesis framework built on a novel spatio-temporal denoising diffusion model. In Stage 1, we develop Coarse-HADiff, which models the overall spatio-temporal dependencies of coarse-grained latent spatio-temporal traces. It incorporates a novel Latent Spatio-Temporal U-Net with dual Drift-Jitter branches to jointly model smooth spatial transitions and temporal variations during denoising. In Stage 2, we introduce a three-step pipeline consisting of Behavior Pattern Extraction, Fine-HADiff, which shares the same architecture as Coarse-HADiff, and Semantic Alignment to generate fine-grained latent spatio-temporal traces from the Stage 1 outputs. We extensively evaluate SynHAT in terms of data fidelity, utility, privacy, robustness, and scalability. Experiments on real-world HAT datasets from four cities across three countries show that SynHAT substantially outperforms state-of-the-art baselines, achieving 52% and 33% improvements on spatial and temporal metrics, respectively.

Chinese Translation

人类活动轨迹（HATs）对许多应用至关重要，包括人类移动建模和兴趣点（POI）推荐。然而，日益增长的隐私担忧严重限制了对真实大规模HAT数据集的访问。生成性人工智能的最新进展为合成现实且保护隐私的HATs提供了新的机会。然而，仍然面临两个主要挑战：（i）HATs高度不规则且动态，具有长且变化的时间间隔，使得捕捉其复杂的时空依赖关系和潜在分布变得困难；（ii）生成模型通常计算开销较大，使得长期、细粒度的HAT合成效率低下。为了解决这些挑战，我们提出了SynHAT，这是一个基于新颖的时空去噪扩散模型的计算高效的粗到细HAT合成框架。在第一阶段，我们开发了Coarse-HADiff，它建模粗粒度潜在时空轨迹的整体时空依赖关系。它结合了一种新颖的潜在时空U-Net，具有双漂移-抖动分支，以共同建模去噪过程中的平滑空间过渡和时间变化。在第二阶段，我们引入了一个由行为模式提取、Fine-HADiff（与Coarse-HADiff具有相同架构）和语义对齐组成的三步流程，以从第一阶段的输出生成细粒度潜在时空轨迹。我们在数据保真性、实用性、隐私性、鲁棒性和可扩展性方面对SynHAT进行了广泛评估。在来自三个国家四个城市的真实HAT数据集上的实验表明，SynHAT显著优于最先进的基线，在空间和时间指标上分别提高了52%和33%。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.14709

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

HWE-Bench：在真实世界硬件bug修复任务上评估大型语言模型代理的基准测试

Cui, Fan, Hou, Hongyuan, Luo, Zizhang, Yin, Chenyun, Liang, Yun

Abstract

Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.

Chinese Translation

现有的硬件设计基准主要评估大型语言模型（LLMs）在孤立的组件级任务上的表现，例如从规格生成HDL模块，而未涉及仓库规模的评估。我们介绍了HWE-Bench，这是第一个针对真实世界硬件bug修复任务评估LLM代理的大规模仓库级基准。HWE-Bench包含417个任务实例，这些实例源自六个主要开源项目中的真实历史bug修复拉取请求，涵盖Verilog/SystemVerilog和Chisel，涉及RISC-V核心、系统级芯片（SoC）和安全信任根。每个任务都基于一个完全容器化的环境，代理必须解决一个真实的bug报告，其正确性通过项目的本地仿真和回归流程进行验证。该基准通过一个大部分自动化的管道构建，能够高效扩展到新的仓库。我们评估了七个LLM与四个代理框架，发现最佳代理总体上解决了70.7%的任务，在较小核心上性能超过90%，但在复杂的SoC级项目上降至65%以下。我们观察到模型之间的性能差距比在软件基准中常见的报告要大，且难度主要受项目范围和bug类型分布的驱动，而不仅仅是代码大小。我们的失败分析将代理失败追溯到调试过程的三个阶段：故障定位、硬件语义推理以及跨RTL、配置和验证组件的跨文档协调，为开发更强大的硬件感知代理提供了具体方向。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.14712

SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval

SGA-MCTS：通过无训练原子经验检索将规划与执行解耦

Xie, Xin, Xue, Dongyun, Yao, Wuguannan, Feng, Mingxiao, Zhou, Wengang, Qi, Xiang, Li, Houqiang, Zhang, Peng

Abstract

LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.

Chinese Translation

基于大型语言模型（LLM）系统需要复杂的多步骤决策能力来解决现实世界任务，但当前的规划方法在推理时搜索的高延迟与监督微调的有限泛化之间存在权衡。为了解决这一限制，我们提出了 extbf{SGA-MCTS}，一个将LLM规划视为非参数检索的框架。在离线阶段，我们利用蒙特卡洛树搜索（MCTS）探索解决方案空间，并将高保真轨迹提炼为状态-目标-动作（SGA）原子。这些原子是去词汇化的原始元素，将具体实体抽象为符号槽，保留可重用的因果逻辑，同时丢弃领域特定的噪声。在在线阶段，增强检索的智能体采用混合符号-语义机制来获取相关的SGA，并将其重新嵌入当前上下文作为软推理提示。在复杂基准上的实证结果表明，这一范式使得冻结的开放权重模型能够在没有任务特定微调的情况下匹配最先进系统（例如，GPT-5）的性能。通过有效地摊销搜索的高计算成本，SGA-MCTS以系统2的推理深度实现系统1的推理速度，使得自主规划既可扩展又实时可行。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.14717

Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents

分层可变性：持久自我修改代理中的连续性与治理

Tallam, Krti

Abstract

Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.

Chinese Translation

持久语言模型代理越来越多地结合了工具使用、分层记忆、反思提示和运行时适应。在这样的系统中，行为不仅受到当前提示的影响，还受到可变内部条件的影响，这些条件会影响未来的行动。本文引入了分层可变性这一框架，以便在五个层面上对这一过程进行推理：预训练、后训练对齐、自我叙事、记忆和权重级适应。核心论点是，当变异迅速、下游耦合强、可逆性弱且可观察性低时，治理难度会增加，从而在最影响行为的层面与人类最容易检查的层面之间产生系统性不匹配。我通过简单的漂移、治理负载和滞后量来形式化这一直觉，将该框架与最近关于语言模型代理的时间身份的研究联系起来，并报告了一项初步的棘轮实验，其中在记忆积累后恢复代理的可见自我描述未能恢复基线行为。在该实验中，估计的身份滞后比为0.68。主要的启示是，对于持久自我修改代理而言，显著的失败模式不是突然的不对齐，而是组成漂移：局部合理的更新累积成一种从未明确授权的行为轨迹。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.14718

The Agentification of Scientific Research: A Physicist's Perspective

科学研究的代理化：物理学家的视角

Qi, Xiao-Liang

Abstract

This article argues that the most important significance of the AI revolution, especially the rise of large language models, lies not simply in automation, but in a fundamental change in how complex information and human know-how are carried, replicated, and shared. From this perspective, AI for Science is especially important because it may transform not only the efficiency of research, but also the structure of scientific collaboration, discovery, publishing, and evaluation. The article outlines a gradual path from AI as a research tool to AI as a scientific collaborator, and discusses how AI is likely to fundamentally reshape scientific publication. It also argues that continuous learning and diversity of ideas are essential if AI is to play a meaningful role in original scientific discovery.

Chinese Translation

本文认为，人工智能革命，特别是大型语言模型的崛起，最重要的意义不仅在于自动化，而在于复杂信息和人类知识的传递、复制和共享方式的根本变化。从这个角度来看，人工智能在科学中的应用尤为重要，因为它可能不仅会改变研究的效率，还会改变科学合作、发现、出版和评估的结构。文章概述了从人工智能作为研究工具到人工智能作为科学合作者的逐步发展路径，并讨论了人工智能如何可能从根本上重塑科学出版。文章还指出，如果人工智能要在原创科学发现中发挥有意义的作用，持续学习和思想多样性是至关重要的。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.14738

Personalized and Context-Aware Transformer Models for Predicting Post-Intervention Physiological Responses from Wearable Sensor Data

基于个性化和情境感知的变换器模型用于预测可穿戴传感器数据中的干预后生理反应

Brown, Esther, Dean, Victoria, Doshi-Velez, Finale

Abstract

Consumer wearables enable continuous measurement of physiological data related to stress and recovery, but turning these streams into actionable, personalized stress-management recommendations remains a challenge. In practice, users often do not know how a given intervention, defined as an activity intended to reduce stress, will affect heart rate (HR), heart rate variability (HRV), or inter-beat intervals (BBI) over the next 15 to 120 minutes. We present a framework that predicts post-intervention trajectories and the direction of change for these physiological indicators across time windows. Our methodology combines a Transformer model for multi-horizon trajectories of percent change relative to a pre-intervention baseline, direction-of-change calls (positive, negative, or neutral) at each horizon, and an empirical study using wearable sensor data overlaid with user-tagged events and interventions. This proof of concept shows that personalized post-intervention prediction is feasible. We encourage future integration into stress-management tools for personalized intervention recommendations tailored to each person's day following further validation in larger studies and, where applicable, appropriate regulatory review.

Chinese Translation

消费类可穿戴设备能够持续测量与压力和恢复相关的生理数据，但将这些数据流转化为可行的个性化压力管理建议仍然是一个挑战。在实践中，用户往往不知道特定干预（定义为旨在减轻压力的活动）将如何影响心率（HR）、心率变异性（HRV）或心跳间隔（BBI）在接下来的15到120分钟内。我们提出了一个框架，预测这些生理指标在时间窗口内的干预后轨迹及变化方向。我们的方法结合了变换器模型，用于相对于干预前基线的多时间段百分比变化轨迹、每个时间段的变化方向判断（正向、负向或中性），以及使用可穿戴传感器数据进行的实证研究，这些数据与用户标记的事件和干预相叠加。该概念验证表明，个性化的干预后预测是可行的。我们鼓励未来将其整合到压力管理工具中，以便为每个人提供个性化的干预建议，待在更大规模的研究中进一步验证，并在适用的情况下进行适当的监管审查。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2604.14746

Disentangle-then-Refine: LLM-Guided Decoupling and Structure-Aware Refinement for Graph Contrastive Learning

解耦再精炼：基于大语言模型的图对比学习的解耦与结构感知精炼

Li, Zhaoxing, Zhang, Hai-Feng, Zhang, Xiaoming

Abstract

Conventional Graph Contrastive Learning (GCL) on Text-Attributed Graphs (TAGs) relies on blind stochastic augmentations, inadvertently entangling task-relevant signals with noise. We propose SDM-SCR, a robust framework anchored in Approximate Orthogonal Decomposition. First, the Semantic Decoupling Module (SDM) leverages the instruction-following capability of Large Language Models (LLMs) to actively parse raw attributes into asymmetric, task-oriented signal and noise views. This shifts the paradigm from random perturbation to semantic-aware disentanglement. Subsequently, Semantic Consistency Regularization (SCR) exploits the spectral observation that semantic signals are topologically smooth while residual noise is high-frequency. SCR functions as a selective spectral filter, enforcing consistency only on the signal subspace to eliminate LLM hallucinations without over-smoothing. This ``Disentangle-then-Refine'' mechanism ensures rigorous signal purification. Extensive experiments demonstrate that SDM-SCR achieves SOTA performance in accuracy and efficiency.

Chinese Translation

传统的文本属性图（TAGs）上的图对比学习（GCL）依赖于盲目的随机增强，导致任务相关信号与噪声的无意纠缠。我们提出了SDM-SCR，一个基于近似正交分解的鲁棒框架。首先，语义解耦模块（SDM）利用大语言模型（LLMs）的指令跟随能力，主动将原始属性解析为不对称的、面向任务的信号和噪声视图。这一转变将范式从随机扰动转变为语义感知的解耦。随后，语义一致性正则化（SCR）利用语义信号在拓扑上平滑而残余噪声为高频的光谱观察。SCR作为选择性光谱滤波器，仅在信号子空间上强制一致性，以消除LLM的幻觉而不导致过度平滑。这一“解耦再精炼”机制确保了严格的信号净化。大量实验表明，SDM-SCR在准确性和效率上达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2604.14768

CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning

CoTEvol：用于数学推理的数据合成的自我演化思维链

Wang, Zhuo, Zhang, Zhuo, Li, Yafu, Cheng, Yu, Qu, Lizhen, Xu, Zenglin

Abstract

Large Language Models (LLMs) exhibit strong mathematical reasoning when trained on high-quality Chain-of-Thought (CoT) that articulates intermediate steps, yet costly CoT curation hinders further progress. While existing remedies such as distillation from stronger LLMs and self-synthesis based on test-time search alleviate this issue, they often suffer from diminishing returns or high computing overhead.In this work, we propose CoTEvol, a genetic evolutionary framework that casts CoT generation as a population-based search over reasoning trajectories.Candidate trajectories are iteratively evolved through reflective global crossover at the trajectory level and local mutation guided by uncertainty at the step level, enabling holistic recombination and fine-grained refinement. Lightweight, task-aware fitness functions are designed to guide the evolutionary process toward accurate and diverse reasoning. Empirically, CoTEvol improves correct-CoT synthesis success by over 30% and enhances structural diversity, with markedly improved efficiency. LLMs trained on these evolutionary CoT data achieve an average gain of 6.6% across eight math benchmarks, outperforming previous distillation and self-synthesis approaches. These results underscore the promise of evolutionary CoT synthesis as a scalable and effective method for mathematical reasoning tasks.

Chinese Translation

大型语言模型（LLMs）在经过高质量思维链（Chain-of-Thought, CoT）训练后展现出强大的数学推理能力，思维链能够清晰地阐述中间步骤，但高成本的思维链策划阻碍了进一步的进展。尽管现有的解决方案，如从更强大的LLMs进行蒸馏和基于测试时搜索的自我合成，缓解了这一问题，但它们往往面临收益递减或高计算开销的问题。在本研究中，我们提出了CoTEvol，一个将思维链生成视为对推理轨迹进行基于种群的搜索的遗传进化框架。候选轨迹通过在轨迹层面的反思全局交叉和在步骤层面的不确定性引导的局部突变进行迭代演化，从而实现整体重组和细粒度的精炼。我们设计了轻量级、任务感知的适应度函数，以引导进化过程朝向准确且多样的推理。实证结果表明，CoTEvol使正确的思维链合成成功率提高了超过30%，并增强了结构多样性，同时显著提高了效率。在这些进化思维链数据上训练的LLMs在八个数学基准测试中平均提高了6.6%的表现，超越了之前的蒸馏和自我合成方法。这些结果强调了进化思维链合成作为一种可扩展且有效的数学推理任务方法的前景。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2604.14785

MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

MirrorBench：通过引入镜子评估多模态大型语言模型中的自我中心智能

Guo, Shengyu, Ye, Tongrui, Zhang, Jianbo, Zhang, Zicheng, Li, Chunyi, Zhai, Guangtao

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror-bench-page/.

Chinese Translation

最近在多模态大型语言模型（MLLMs）方面的进展展示了在感知和推理方面的显著提升，暗示它们在具身智能方面的潜力。尽管近期研究已在交互环境中评估了具身的MLLMs，但当前的基准主要针对感知、理解和与外部物体互动的能力，缺乏对自我中心智能的系统评估。为了解决这一问题，我们引入了MirrorBench，这是一个基于模拟的基准，灵感来自心理学中的经典镜子自我识别（MSR）测试。MirrorBench通过逐层递进的挑战任务框架，将这一范式扩展到具身的MLLMs，评估从基本视觉感知到高级自我表征的智能体。对领先的MLLMs进行的实验表明，即使在最低层次，它们的表现仍显著低于人类表现，揭示了自我指涉理解的基本局限性。我们的研究将心理学范式与具身智能相结合，提供了一个评估大型模型中通用智能出现的原则性框架。项目页面：https://fflahm.github.io/mirror-bench-page/

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2604.14786

CogEvolution: A Human-like Generative Educational Agent to Simulate Student's Cognitive Evolution

CogEvolution：一种类人生成教育代理，模拟学生的认知演变

Zhang, Wei, Cheng, Yihang, Ye, Zhirong, Huang, Kezhen

Abstract

Generative Agents, owing to their precise modeling and simulation capabilities of human behavior, have become a pivotal tool in the field of Artificial Intelligence in Education (AIEd) for uncovering complex cognitive processes of learners. However, existing educational agents predominantly rely on static personas to simulate student learning behaviors, neglecting the decisive role of deep cognitive capabilities in learning outcomes during practice interactions. Furthermore, they struggle to characterize the dynamic fluidity of knowledge internalization, transfer, and cognitive state transitions. To overcome this bottleneck, this paper proposes a human-like educational agent capable of simulating student cognitive evolution: CogEvolution. Specifically, we first construct a cognitive depth perceptron based on the Interactive, Constructive, Active, Passive (ICAP) taxonomy from cognitive psychology, achieving precise quantification of learner cognitive engagement. Subsequently, we propose a memory retrieval method based on Item Response Theory (IRT) to simulate the connection and assimilation of new and prior knowledge. Finally, we design a dynamic cognitive update mechanism based on evolutionary algorithms to simulate the real-time integration of student learning behaviors and cognitive evolution processes. Comprehensive evaluations demonstrate that CogEvolution not only significantly outperforms baseline models in behavioral fidelity and learning curve fitting but also uniquely reproduces plausible and robust cognitive evolutionary paths consistent with educational psychology expectations, providing a novel paradigm for constructing highly interpretable educational agents.

Chinese Translation

生成代理由于其对人类行为的精确建模和模拟能力，已成为教育人工智能（AIEd）领域中揭示学习者复杂认知过程的重要工具。然而，现有的教育代理主要依赖静态角色来模拟学生学习行为，忽视了深层认知能力在实践互动中对学习结果的决定性作用。此外，它们难以表征知识内化、转移和认知状态转变的动态流动性。为了解决这一瓶颈，本文提出了一种能够模拟学生认知演变的类人教育代理：CogEvolution。具体而言，我们首先基于认知心理学中的互动、建构、主动、被动（ICAP）分类法构建了一个认知深度感知器，实现了对学习者认知参与度的精确量化。随后，我们提出了一种基于项目反应理论（IRT）的记忆检索方法，以模拟新旧知识的连接和同化。最后，我们设计了一种基于进化算法的动态认知更新机制，以模拟学生学习行为和认知演变过程的实时整合。全面评估表明，CogEvolution不仅在行为逼真性和学习曲线拟合方面显著优于基线模型，而且独特地再现了与教育心理学预期一致的合理且稳健的认知演变路径，为构建高度可解释的教育代理提供了一种新范式。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2604.14788

Sequence Search: Automated Sequence Design using Neural Architecture Search

序列搜索：基于神经架构搜索的自动序列设计

Hong, Rokgi, An, Hongjun, Ji, Sooyeon, Lee, Jongho

Abstract

Developing an MR sequence is challenging and remains largely constrained by human intuition. Recently, AI-driven approaches have been proposed; however, most require an initial sequence for parameter optimization or extensive training datasets, limiting their general applicability. In this study, we propose "Sequence Search," an automated sequence design framework based on neural architecture search. The method takes tissue properties, imaging parameters, and design objectives as inputs and generates pulse sequences satisfying the design objectives, without requiring prior knowledge of conventional sequence structures. Sequence Search iteratively generates candidate sequences through neural architecture search and optimizes them via a differentiable Bloch simulator and objective-specific loss functions using gradient-based learning. The framework successfully replicated conventional spin-echo, T2-weighted spin-echo, and inversion recovery sequences. Less intuitive solutions were also discovered, such as three-RF spin-echo-like sequences with reduced RF energy and refocusing phases deviating from the conventional Hahn-echo. This work establishes a generalizable framework for automated MR sequence design, highlighting the potential to explore configurations beyond conventional designs based on human intuition.

Chinese Translation

开发磁共振（MR）序列具有挑战性，并在很大程度上受限于人类直觉。最近提出了基于人工智能的解决方案；然而，大多数方法需要初始序列进行参数优化或大量训练数据集，从而限制了它们的普遍适用性。在本研究中，我们提出了“序列搜索”（Sequence Search），这是一个基于神经架构搜索的自动序列设计框架。该方法以组织特性、成像参数和设计目标作为输入，生成满足设计目标的脉冲序列，而无需先前对传统序列结构的知识。序列搜索通过神经架构搜索迭代生成候选序列，并通过可微分的布洛赫模拟器和特定目标的损失函数使用基于梯度的学习进行优化。该框架成功复制了传统的自旋回波、T2加权自旋回波和反转恢复序列。还发现了一些不太直观的解决方案，例如具有降低射频（RF）能量的三射频自旋回波类序列，以及偏离传统汉回波的重聚焦相位。这项工作建立了一个可推广的自动化MR序列设计框架，突显了探索超越基于人类直觉的传统设计配置的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2604.14789

A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

边缘人工智能中卷积神经网络优化方法的比较研究：探索早期退出的作用

Fernandez, Nekane, Valdes, Ivan, Van Vaerenbergh, Steven, de la Iglesia, Idoia, Arratibel, Julen

Abstract

Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

Chinese Translation

在边缘设备上部署深度神经网络需要在现实执行条件下平衡准确性、延迟和资源限制。为了使模型适应这些限制，出现了两种广泛的策略：静态压缩技术，如剪枝和量化，永久性地减少模型大小，以及动态方法，如早期退出机制，在运行时适应计算成本。尽管这两类方法在孤立状态下得到了广泛研究，但在相同条件下对物理硬件进行比较的情况却很少。本文呈现了一种统一的以部署为导向的静态压缩和动态早期退出机制的比较，评估基于真实边缘设备的ONNX推理管道。我们的结果表明，静态和动态技术在边缘部署中提供了根本不同的权衡。虽然剪枝和量化提供了一致的内存占用减少，但早期退出机制则实现了输入自适应的计算节省，这是静态方法无法匹敌的。两者的结合证明是非常有效的，能够在最小的准确性损失下，同时减少推理延迟和内存使用，扩展了边缘计算的可实现性。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2604.14790

Diffusion Crossover: Defining Evolutionary Recombination in Diffusion Models via Noise Sequence Interpolation

扩散交叉：通过噪声序列插值定义扩散模型中的进化重组

Kumada, Chisatao, Hiwa, Satoru, Hiroyasu, Tomoyuki

Abstract

Interactive Evolutionary Computation (IEC) provides a powerful framework for optimizing subjective criteria such as human preferences and aesthetics, yet it suffers from a fundamental limitation: in high-dimensional generative representations, defining crossover in a semantically consistent manner is difficult, often leading to a mutation-dominated search. In this work, we explicitly define crossover in diffusion models. We propose Diffusion crossover, which formulates evolutionary recombination as step-wise interpolation of noise sequences in the reverse process of Denoising Diffusion Probabilistic Models (DDPMs). By applying spherical linear interpolation (Slerp) to the noise sequences associated with selected parent images, the proposed method generates offspring that inherit characteristics from both parents while preserving the geometric structure of the diffusion process. Furthermore, controlling the time-step range of interpolation enables a principled trade-off between diversity (exploration) and convergence (exploitation). Experimental results using PCA analysis and perceptual similarity metrics (LPIPS) demonstrate that Diffusion crossover produces perceptually smooth and semantically consistent transitions between parent images. Qualitative interactive evolution experiments further confirm that the proposed method effectively supports human-in-the-loop image exploration. These findings suggest a new perspective: diffusion models are not only powerful generators, but also structured evolutionary search spaces in which recombination can be explicitly defined and controlled.

Chinese Translation

交互进化计算（IEC）为优化主观标准（如人类偏好和美学）提供了强大的框架，但它存在一个根本性的局限性：在高维生成表示中，以语义一致的方式定义交叉是困难的，通常导致以突变为主导的搜索。在本研究中，我们明确地定义了扩散模型中的交叉。我们提出了扩散交叉（Diffusion crossover），将进化重组公式化为去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPMs）反向过程中的噪声序列逐步插值。通过对与选定父图像相关的噪声序列应用球面线性插值（Slerp），该方法生成的后代继承了两个父代的特征，同时保持了扩散过程的几何结构。此外，控制插值的时间步长范围能够在多样性（探索）和收敛（利用）之间实现原则性的权衡。使用主成分分析（PCA）和感知相似性度量（LPIPS）的实验结果表明，扩散交叉能够在父图像之间产生感知上平滑且语义一致的过渡。定性交互进化实验进一步确认了该方法有效支持人机协作的图像探索。这些发现提出了一个新的视角：扩散模型不仅是强大的生成器，还是可以明确定义和控制重组的结构化进化搜索空间。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2604.14807

The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows

大型语言模型的谬误：人工智能辅助认知工作流程中的误归因

Kim, Hyunwoo, Yu, Harin, Yi, Hanau

Abstract

The rapid integration of large language models (LLMs) into everyday workflows has transformed how individuals perform cognitive tasks such as writing, programming, analysis, and multilingual communication. While prior research has focused on model reliability, hallucination, and user trust calibration, less attention has been given to how LLM usage reshapes users' perceptions of their own capabilities. This paper introduces the LLM fallacy, a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability. We argue that the opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them. We situate the LLM fallacy within existing literature on automation bias, cognitive offloading, and human--AI collaboration, while distinguishing it as a form of attributional distortion specific to AI-mediated workflows. We propose a conceptual framework of its underlying mechanisms and a typology of manifestations across computational, linguistic, analytical, and creative domains. Finally, we examine implications for education, hiring, and AI literacy, and outline directions for empirical validation. We also provide a transparent account of human--AI collaborative methodology. This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.

Chinese Translation

大型语言模型（LLMs）迅速融入日常工作流程，改变了个人执行写作、编程、分析和多语言沟通等认知任务的方式。尽管先前的研究集中于模型的可靠性、幻觉现象和用户信任的校准，但对LLM使用如何重塑用户对自身能力的认知关注较少。本文提出了LLM谬误，这是一种认知归因错误，个体错误地将LLM辅助输出解读为自身独立能力的证据，从而导致感知能力与实际能力之间的系统性偏差。我们认为，LLM的透明度、流畅性和低摩擦交互模式模糊了人类与机器贡献之间的界限，使用户从输出中推断能力，而非从生成这些输出的过程。我们将LLM谬误置于现有的自动化偏见、认知卸载和人机协作文献中，同时将其区分为特定于人工智能介导工作流程的归因扭曲形式。我们提出了其潜在机制的概念框架及其在计算、语言、分析和创造性领域的表现类型。最后，我们探讨了对教育、招聘和人工智能素养的影响，并概述了实证验证的方向。我们还提供了人机协作方法论的透明说明。本研究为理解生成性人工智能系统如何不仅增强认知表现，还重塑自我认知和感知专业能力奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2604.14829

Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

超越字面总结：重新定义医疗SOAP笔记评估中的幻觉

Vachhani, Bhavik, Shrisvastava, Kush, Nema, Pranshu, Chiranthan, Sai

Abstract

Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation criteria with clinical reasoning through calibrated prompting and retrieval grounded in medical ontologies we observe a significant shift in outcomes. Under a lexical evaluation regime, the mean hallucination rate is 35%, heavily penalizing valid reasoning. With inference aware evaluation, this drops to 9%, with remaining cases reflecting genuine safety concerns. These findings suggest that current evaluation practices over penalize valid clinical reasoning and may measure artifacts of evaluation design rather than true errors, underscoring the need for clinically informed evaluation in high context domains like medicine.

Chinese Translation

评估大型语言模型（LLMs）在临床文档任务中的表现，如SOAP笔记生成，仍然具有挑战性。与标准总结不同，这些任务需要临床抽象、口语语言的规范化以及基于医学的推理。然而，现有的评估方法，包括自动化指标和LLM作为评判框架，依赖于词汇忠实度，通常将任何在记录中未明确存在的信息标记为幻觉。我们展示了这些方法系统性地将临床有效的输出错误分类，从而夸大了幻觉率并扭曲了模型评估。我们的分析揭示，许多被标记的幻觉对应于合法的临床转化，包括同义词映射、检查结果的抽象、诊断推理以及符合指南的护理计划。通过通过校准提示和基于医学本体的检索将评估标准与临床推理对齐，我们观察到结果的显著变化。在词汇评估机制下，平均幻觉率为35%，严重惩罚有效推理。而在考虑推理的评估下，这一比例降至9%，剩余案例反映了真正的安全隐患。这些发现表明，当前的评估实践对有效的临床推理过度惩罚，可能测量的是评估设计的伪影而非真实错误，强调了在医学等高背景领域进行临床知情评估的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2604.14838

Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models

中间层编码单细胞基础模型中的最佳生物表征

Civale, Vincenzo Yuto, Semeraro, Roberto, Bagdanov, Andrew David, Magi, Alberto

Abstract

Current single-cell foundation model benchmarks universally extract final layer embeddings, assuming these represent optimal feature spaces. We systematically evaluate layer-wise representations from scFoundation (100M parameters) and Tahoe-X1 (1.3B parameters) across trajectory inference and perturbation response prediction. Our analysis reveals that optimal layers are task-dependent (trajectory peaks at 60% depth, 31% above final layers) and context-dependent (perturbation optima shift 0-96% across T cell activation states). Notably, first-layer embeddings outperform all deeper layers in quiescent cells, challenging assumptions about hierarchical feature abstraction. These findings demonstrate that "where" to extract features matters as much as "what" the model learns, necessitating systematic layer evaluation tailored to biological task and cellular context rather than defaulting to final-layer embeddings.

Chinese Translation

当前的单细胞基础模型基准普遍提取最终层嵌入，假设这些嵌入代表最佳特征空间。我们系统性地评估了来自 scFoundation（1 亿参数）和 Tahoe-X1（13 亿参数）的层级表征，涉及轨迹推断和扰动响应预测。我们的分析揭示，最佳层级是任务依赖的（轨迹在 60% 深度处达到峰值，比最终层高出 31%）和上下文依赖的（扰动最优点在 T 细胞激活状态间变化 0-96%）。值得注意的是，在静息细胞中，第一层嵌入的表现优于所有更深层次，挑战了关于层级特征抽象的假设。这些发现表明，“在哪里”提取特征与“模型学习什么”同样重要，因此需要针对生物任务和细胞上下文进行系统的层级评估，而不是默认使用最终层嵌入。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2604.14847

TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

TrigReason：基于触发的大小推理模型协作

Zhao, Yi, Peng, Yajuan, Nguyen, Cam-Tu, Li, Zuchao, Wang, Xiaoliang, Fu, Xiaoming, Zhao, Hai

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex tasks through extended chains of thought but suffer from high inference latency due to autoregressive reasoning. Recent work explores using Small Reasoning Models (SRMs) to accelerate LRM inference. In this paper, we systematically characterize the capability boundaries of SRMs and identify three common types of reasoning risks: (1) path divergence, where SRMs lack the strategic ability to construct an initial plan, causing reasoning to deviate from the most probable path; (2) cognitive overload, where SRMs fail to solve particularly difficult steps; and (3) recovery inability, where SRMs lack robust self-reflection and error correction mechanisms. To address these challenges, we propose TrigReason, a trigger-based collaborative reasoning framework that replaces continuous polling with selective intervention. TrigReason delegates most reasoning to the SRM and activates LRM intervention only when necessary-during initial strategic planning (strategic priming trigger), upon detecting extraordinary overconfidence (cognitive offload trigger), or when reasoning falls into unproductive loops (intervention request trigger). The evaluation results on AIME24, AIME25, and GPQA-D indicate that TrigReason matches the accuracy of full LRMs and SpecReason, while offloading 1.70x - 4.79x more reasoning steps to SRMs. Under edge-cloud conditions, TrigReason reduces latency by 43.9\% and API cost by 73.3\%. Our code is available at \href{https://github.com/QQQ-yi/TrigReason}{https://github.com/QQQ-yi/TrigReason}

Chinese Translation

大型推理模型（LRMs）通过扩展的思维链在复杂任务上取得了强大的表现，但由于自回归推理导致推理延迟较高。近期的研究探索了使用小型推理模型（SRMs）来加速LRM推理。在本文中，我们系统性地描述了SRMs的能力边界，并识别出三种常见的推理风险：（1）路径发散，SRMs缺乏构建初始计划的战略能力，导致推理偏离最可能的路径；（2）认知过载，SRMs未能解决特别困难的步骤；（3）恢复能力不足，SRMs缺乏强大的自我反思和错误纠正机制。为了解决这些挑战，我们提出了TrigReason，一个基于触发的协同推理框架，它用选择性干预替代了持续轮询。TrigReason将大部分推理委托给SRM，仅在必要时激活LRM干预——在初始战略规划期间（战略启动触发）、检测到异常过度自信时（认知卸载触发），或当推理陷入无效循环时（干预请求触发）。在AIME24、AIME25和GPQA-D上的评估结果表明，TrigReason的准确性与完整的LRMs和SpecReason相匹配，同时将1.70倍至4.79倍的推理步骤卸载给SRMs。在边缘云条件下，TrigReason将延迟减少了43.9\%，API成本降低了73.3\\%。我们的代码可在 [https://github.com/QQQ-yi/TrigReason](https://github.com/QQQ-yi/TrigReason) 获取。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2604.14858

Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX

开放爪（OpenClaw）和Codex中的轨迹安全评估与诊断基准：ATBench-Claw和ATBench-CodeX

Yang, Zhonghao, Li, Yu, Zhu, Yanxu, Zhou, Tianyi, Xie, Yuejin, Luo, Haoyu, Shao, Jing, Hu, Xia, Liu, Dongrui

Abstract

As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-CodeX, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-CodeX targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.

Chinese Translation

随着智能体系统进入日益多样化的执行环境，轨迹级安全评估与诊断需要与之相适应的基准。ATBench是一个多样化且现实的智能体轨迹基准，用于安全评估与诊断。本报告介绍了ATBench-Claw和ATBench-CodeX，这两个领域定制的扩展将ATBench引入开放爪（OpenClaw）和OpenAI Codex / Codex-runtime环境。关键的适应机制是分析每个新环境，定制基于风险源、故障模式和现实世界危害的三维安全分类法，然后利用该定制的分类法定义由共享的ATBench构建管道所使用的基准规范。这种可扩展性至关重要，因为尽管智能体框架在架构层面保持相对稳定，但其具体执行环境、工具生态系统和产品能力却在快速演变。具体而言，ATBench-Claw针对开放爪（OpenClaw）敏感的执行链，包括工具、技能、会话和外部动作，而ATBench-CodeX则针对OpenAI Codex / Codex-runtime环境中的轨迹，包括代码库、外壳、补丁、依赖、批准和运行时策略边界。因此，我们的重点在于分类法定制、特定领域的风险覆盖以及在共享ATBench生成框架下的基准设计。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2604.14881

The Missing Knowledge Layer in AI: A Framework for Stable Human-AI Reasoning

人工智能中的缺失知识层：稳定人机推理的框架

Rosenbacke, Rikard, Rosenbacke, Carl, Rosenbacke, Victor, McKee, Martin

Abstract

Large language models are increasingly integrated into decision-making in areas such as healthcare, law, finance, engineering, and government. Yet they share a critical limitation: they produce fluent outputs even when their internal reasoning has drifted. A confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions. This makes LLMs useful assistants but unreliable partners in high-stakes contexts. Humans exhibit a similar weakness, often mistaking fluency for reliability. When a model responds smoothly, users tend to trust it, even when both model and user are drifting together. This paper is the first in a five-paper research series on stabilising human-AI reasoning. The series proposes a two-layer approach: Parts II-IV introduce human-side mechanisms such as uncertainty cues, conflict surfacing, and auditable reasoning traces, while Part V develops a model-side Epistemic Control Loop (ECL) that detects instability and modulates generation accordingly. Together, these layers form a missing operational substrate for governance by increasing signal-to-noise at the point of use. Stabilising interaction makes uncertainty and drift visible before enforcement is applied, enabling more precise capability governance. This aligns with emerging compliance expectations, including the EU AI Act and ISO/IEC 42001, by making reasoning processes traceable under real conditions of use. The central claim is that fluency is not reliability. Without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most.

Chinese Translation

大型语言模型越来越多地被整合到医疗、法律、金融、工程和政府等领域的决策中。然而，它们存在一个关键的局限性：即使内部推理出现偏差，它们仍能生成流畅的输出。一个自信的答案可能掩盖不确定性、推测或不一致性，而措辞的细微变化可能导致不同的结论。这使得大型语言模型（LLMs）成为有用的助手，但在高风险环境中却是不可靠的合作伙伴。人类也表现出类似的弱点，常常将流畅性误认为可靠性。当模型流畅地回应时，用户往往会信任它，即使模型和用户都在一起偏离。本文是关于稳定人机推理的五篇研究系列中的第一篇。该系列提出了一种两层的方法：第二至第四部分介绍了人类方面的机制，如不确定性提示、冲突显现和可审计的推理痕迹，而第五部分则开发了一个模型侧的知识控制循环（Epistemic Control Loop, ECL），用于检测不稳定性并相应地调节生成。这些层共同构成了一个缺失的操作基础，通过在使用时增加信噪比来实现治理。稳定的交互使得不确定性和漂移在执行之前可见，从而实现更精确的能力治理。这与新兴的合规期望相一致，包括欧盟人工智能法案和ISO/IEC 42001，通过在实际使用条件下使推理过程可追溯。中心论点是，流畅性并不等于可靠性。在没有稳定人类和模型推理的结构的情况下，人工智能无法在最重要的地方被信任或治理。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2604.14886

Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning

合作以竞争：合作竞争跨孤岛联邦学习的战略数据生成与激励框架

Nguyen, Thanh Linh, Van Huynh, Nguyen, Pham, Quoc-Viet

Abstract

In data-sensitive domains such as healthcare, cross-silo federated learning (CFL) allows organizations to collaboratively train AI models without sharing raw data. However, practical CFL deployments are inherently coopetitive, in which organizations cooperate during model training while competing in downstream markets. In such settings, training contributions, including data volume, quality, and diversity, can improve the global model yet inadvertently strengthen rivals. This dilemma is amplified by non-IID data, which leads to asymmetric learning gains and undermines sustained participation. While existing competition-aware CFL and incentive-design approaches reward organizations based on marginal training contributions, they fail to account for the costs of strengthening competitors. In this paper, we introduce CoCoGen+, a coopetition-compatible data generation and incentivization framework that jointly models non-IID data and inter-organizational competition while endogenizing GenAI-based synthetic data generation as a strategic decision. Specifically, CoCoGen+ formulates each training round as a weighted potential game, where organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses. We then provide a tractable equilibrium characterization and derive implementable generation strategies to maximize social welfare. To promote long-term collaboration, we integrate a payoff redistribution-based incentive mechanism to compensate organizations for their contributions and competition-caused utility degradation. Experiments on varying learning tasks validate the feasibility of CoCoGen+. The results show how non-IID data, competition intensity, and incentives shape organizational strategies and social welfare, while CoCoGen+ outperforms baselines in efficiency.

Chinese Translation

在数据敏感领域，如医疗保健，跨孤岛联邦学习（CFL）允许组织在不共享原始数据的情况下协同训练人工智能模型。然而，实际的CFL部署本质上是合作竞争的，组织在模型训练过程中合作，而在下游市场中竞争。在这种情况下，训练贡献，包括数据量、质量和多样性，可以改善全球模型，但也可能无意中增强竞争对手。这一困境因非独立同分布（non-IID）数据而加剧，导致不对称的学习收益，削弱持续参与的动力。虽然现有的竞争意识CFL和激励设计方法根据边际训练贡献来奖励组织，但未能考虑增强竞争对手的成本。在本文中，我们引入了CoCoGen+，一个兼容合作竞争的数据生成与激励框架，联合建模非独立同分布数据和组织间竞争，同时将基于生成人工智能（GenAI）的合成数据生成内生化为战略决策。具体而言，CoCoGen+将每个训练轮次形式化为一个加权潜力博弈，组织通过平衡学习性能收益与计算成本和竞争导致的效用损失，战略性地决定生成多少合成数据。然后，我们提供了可处理的均衡特征，并推导出可实施的生成策略，以最大化社会福利。为了促进长期合作，我们整合了一种基于收益再分配的激励机制，以补偿组织的贡献和竞争导致的效用下降。在不同学习任务上的实验验证了CoCoGen+的可行性。结果显示，非独立同分布数据、竞争强度和激励如何塑造组织策略和社会福利，同时CoCoGen+在效率上优于基线方法。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2604.14889

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

MemoSight：统一上下文压缩与多标记预测以加速推理

Liu, Xinyu, Liu, Xin, Jin, Bo, Zhao, Runsong, Huang, Pengcheng, Ruan, Junhao, Li, Bei, Xiao, Chunyang, Xiao, Tong, Zhu, Jingbo

Abstract

While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods.

Chinese Translation

尽管链式思维（Chain-of-thought, CoT）推理使大型语言模型（LLMs）能够解决具有挑战性的推理问题，但随着生成标记数量的增加，KV缓存线性增长，CoT推理在速度和内存使用方面面临扩展性问题。在本研究中，我们提出了MemoSight（基于记忆前瞻的推理），这是一个统一框架，集成了上下文压缩和多标记预测，以减轻效率问题，同时保持CoT推理的性能。我们的框架通过特殊标记及其对应的针对每种标记类型量身定制的位置布局，采用相同的简约设计来实现上下文压缩和多标记预测。在四个推理基准上的全面实验表明，MemoSight将KV缓存占用减少了多达66%，并加速推理速度达1.56倍，同时超越了现有的CoT压缩方法。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2604.14896

Toward Agentic RAG for Ukrainian

面向乌克兰的代理检索增强生成（RAG）研究

Sumyk, Marta, Kosovan, Oleksandr

Abstract

We present an initial investigation into Agentic Retrieval-Augmented Generation (RAG) for Ukrainian, conducted within the UNLP 2026 Shared Task on Multi-Domain Document Understanding. Our system combines two-stage retrieval (BGE-M3 with BGE reranking) with a lightweight agentic layer performing query rephrasing and answer-retry loops on top of Qwen2.5-3B-Instruct. Our analysis reveals that retrieval quality is the primary bottleneck: agentic retry mechanisms improve answer accuracy but the overall score remains constrained by document and page identification. We discuss practical limitations of offline agentic pipelines and outline directions for combining stronger retrieval with more advanced agentic reasoning for Ukrainian.

Chinese Translation

我们对乌克兰的代理检索增强生成（RAG）进行了初步研究，该研究是在UNLP 2026多领域文档理解共享任务中进行的。我们的系统结合了两阶段检索（BGE-M3与BGE重排序），并在Qwen2.5-3B-Instruct之上添加了一个轻量级的代理层，执行查询重述和答案重试循环。我们的分析表明，检索质量是主要瓶颈：代理重试机制提高了答案的准确性，但整体评分仍受文档和页面识别的限制。我们讨论了离线代理管道的实际局限性，并概述了将更强的检索与更高级的代理推理结合的方向，以服务于乌克兰。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2604.14898

Governing Reflective Human-AI Collaboration: A Framework for Epistemic Scaffolding and Traceable Reasoning

治理反思性人机协作：一个关于认知支架和可追溯推理的框架

Rosenbacke, Rikard, Rosenbacke, Carl, Rosenbacke, Victor, McKee, Martin

Abstract

Large language models have advanced rapidly, from pattern recognition to emerging forms of reasoning, yet they remain confined to linguistic simulation rather than grounded understanding. They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction. This paper proposes a complementary approach in which reasoning is treated as a relational process distributed between human and model rather than an internal capability of either. Building on recent work on "System-2" learning, we relocate reflective reasoning to the interaction layer. Instead of engineering reasoning solely within models, we frame it as a cognitive protocol that can be structured, measured, and governed using existing systems. This perspective emphasizes collaborative intelligence, combining human judgment and contextual understanding with machine speed, memory, and associative capacity. We introduce "The Architect's Pen" as a practical method. Like an architect who thinks through drawing, the human uses the model as an external medium for structured reflection. By embedding phases of articulation, critique, and revision into human-AI interaction, the dialogue itself becomes a reasoning loop: human abstraction -> model articulation -> human reflection. This reframes the question from whether the model can think to whether the human-AI system can reason. The framework enables auditable reasoning traces and supports alignment with emerging governance standards, including the EU AI Act and ISO/IEC 42001. It provides a practical path toward more transparent, controllable, and accountable AI use without requiring new model architectures.

Chinese Translation

大型语言模型迅速发展，从模式识别到新兴的推理形式，但它们仍然局限于语言模拟，而非扎根于理解。它们能够生成流畅的输出，似乎表现出反思能力，但缺乏时间连续性、因果反馈以及与现实世界互动的锚定。本文提出了一种互补的方法，将推理视为人类与模型之间分布的关系过程，而不是任何一方的内部能力。在近期关于“系统-2”（System-2）学习的研究基础上，我们将反思性推理重新定位于互动层面。我们不再仅仅在模型内部构建推理，而是将其框架化为一种认知协议，可以利用现有系统进行结构化、测量和治理。这一视角强调了协作智能，将人类判断和情境理解与机器的速度、记忆和联想能力相结合。我们引入了“建筑师之笔”（The Architect's Pen）作为一种实用方法。就像建筑师通过绘图进行思考一样，人类利用模型作为结构化反思的外部媒介。通过将表达、批评和修订的阶段嵌入人机互动中，对话本身成为一个推理循环：人类抽象 -> 模型表达 -> 人类反思。这将问题的焦点从模型是否能够思考转变为人机系统是否能够推理。该框架使可审计的推理轨迹成为可能，并支持与新兴治理标准的对齐，包括欧盟人工智能法案和ISO/IEC 42001。它为实现更透明、可控和负责任的人工智能使用提供了一条实用路径，而无需新的模型架构。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2604.14902

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

ADAPT：在未指定可供性约束下的常识规划基准测试

Chen, Pei-An, Liang, Yong-Ching, Yeh, Jia-Fong, Su, Hung-Ting, Chen, Yi-Ting, Sun, Min, Hsu, Winston

Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

Chinese Translation

智能具身代理不应仅仅遵循指令，因为现实世界环境通常涉及意外条件和例外情况。然而，现有方法通常专注于直接执行指令，而未考虑目标对象是否可以被实际操作，这意味着它们未能评估可用的可供性。为了解决这一局限性，我们引入了DynAfford，一个评估具身代理在动态环境中表现的基准，其中对象的可供性可能随时间变化，并且在指令中未被指定。DynAfford要求代理感知对象状态，推断隐含的前提条件，并相应地调整其行动。为了实现这一能力，我们引入了ADAPT，一个即插即用的模块，增强现有规划器的显式可供性推理。实验表明，结合ADAPT显著提高了在已见和未见环境中的鲁棒性和任务成功率。我们还展示了一个经过领域适应和LoRA微调的视觉-语言模型作为可供性推理后端，其性能优于商业LLM（GPT-4o），突显了任务对齐的可供性基础的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2604.14920

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

双轴生成奖励模型：面向交互式语音对话模型的语义和轮流发言鲁棒性

Chen, Yifu, Ji, Shengpeng, Liu, Zhengqing, Chen, Qian, Wang, Wen, Wang, Ziqing, Li, Yangzhuo, Liang, Tianle, Zhao, Zhou

Abstract

Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.

Chinese Translation

实现无缝的人类般交互仍然是全双工语音对话模型（SDMs）面临的关键挑战。强化学习（RL）显著提升了文本和视觉语言模型的性能，而设计良好的奖励信号对强化学习的表现至关重要。我们认为强化学习是一种有前景的策略，可以解决SDMs面临的关键挑战。然而，仍然存在一个根本性障碍：当前评估交互质量的自动化指标依赖于表面的代理，如行为统计或时间预测准确性，未能为强化学习提供可靠的奖励信号。另一方面，尽管人类评估丰富，但仍然成本高昂、不一致且难以扩展。我们通过提出双轴生成奖励模型来解决这一关键障碍，该模型经过训练以理解复杂的交互动态，使用详细的分类法和标注数据集，生成单一评分，并且关键地为语义质量和交互时机提供单独评估。这种双重输出为SDMs提供了精确的诊断反馈，并提供了适合在线强化学习的可靠、指导性的奖励信号。我们的模型在交互质量评估方面在广泛的数据集上实现了最先进的性能，涵盖了合成对话和复杂的现实世界交互。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2604.14932

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

WavAlign：通过自适应混合后训练增强口语对话模型的智能性和表现力

Chen, Yifu, Ji, Shengpeng, Chen, Qian, Liang, Tianle, Li, Yangzhuo, Wang, Ziqing, Wang, Wen, Lu, Jingyu, Wang, Haoxiao, Pu, Xueyi, Zhuo, Fan, Zhao, Zhou

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Chinese Translation

端到端的口语对话模型因其在表现力和感知能力上相较于级联系统具有更高的潜力而受到广泛关注。然而，目前开源的口语对话模型的智能性和表现力往往低于预期。受到在线强化学习（Reinforcement Learning, RL）在其他领域成功的启发，有人可能尝试将偏好优化直接应用于口语对话模型，但这种转移并非易事。我们从奖励建模和回滚采样的角度分析这些障碍，重点关注稀疏偏好监督如何与共享参数更新下的密集语音生成相互作用。基于分析，我们提出了一种模态感知的自适应后训练方案，使得RL在口语对话中变得可行：它将偏好更新限制在语义通道，并通过显式锚定改善声学行为，同时动态调节其混合比例，以避免不可靠的偏好梯度。我们在多个口语对话基准和代表性架构上评估该方法，并观察到语义质量和语音表现力的一致性提升。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2604.14969

Discovering Novel LLM Experts via Task-Capability Coevolution

通过任务-能力共进化发现新型大语言模型专家

Dai, Andrew, Meinardus, Boris, Regan, Ciaran, Tian, Yingtao, Tang, Yujin

Abstract

Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection. Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs. Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.

Chinese Translation

前沿模型开发者旨在不断训练模型，以具备新兴且多样的能力。为了扩展能力，目前的预训练和后训练范式每次都需要手动启动静态数据集或奖励函数的训练过程。针对这一限制，我们的工作追求一个洞察，即开放性（通过模型与任务的共进化）可以在单次运行中发现具备越来越新颖技能的模型。我们引入了一种新的模型开发框架，将共进化扩展到大语言模型（LLM）的发现，即开放式 extit{与多样能力共进化的评估}（AC/DC）。AC/DC通过模型合并和合成数据生成，促进LLM和自然语言任务的共同演化。AC/DC发现的LLM档案不断增长，其能力超越了更大规模的LLM，同时占用更少的GPU内存。特别是，我们的LLM种群在下游基准测试中实现了比其他策划模型或基准更广泛的专业覆盖，而没有 extit{任何}明确的基准优化。此外，AC/DC随着时间的推移改善了覆盖范围，持续在任务和模型上进行创新，并在多代理最佳选择中提高了性能。我们的研究结果突显了共进化作为发现基础LLM更广泛能力集的潜力。总体而言，AC/DC使我们更接近于一种全新的LLM开发范式，在这种范式中，通过利用现有模型作为逐步迈向更强大模型的垫脚石，可以加速模型能力多样性的持续提升。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2604.14980

Hybrid Decision Making via Conformal VLM-generated Guidance

通过符合性VLM生成指导的混合决策制定

Banerjee, Debodeep, Sayin, Burcu, Teso, Stefano, Passerini, Andrea

Abstract

Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

Chinese Translation

基于近期在人工智能领域的进展，混合决策制定（HDM）有望提高人类决策质量并减少认知负担。我们在学习引导（LtG）的背景下进行研究，这是一种新提出的HDM框架，在该框架中，人类始终对最终决策负责：在LtG中，人工智能提供有助于促进决策的（文本）指导，而不是建议具体决策。现有方法的一个限制因素是它们的指导包含了所有可能结果的信息，因此可能难以消化。我们通过引入ConfGuide来解决这一问题，这是一种新颖的LtG方法，能够生成更简洁和针对性的指导。为此，它采用符合性风险控制来选择一组结果，确保假阴性率的上限。我们在一个真实的多标签医学诊断任务中展示了我们的方法。我们的实证评估突显了ConfGuide的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2604.14987

AI-Enabled Covert Channel Detection in RF Receiver Architectures

基于人工智能的射频接收器架构中的隐蔽通道检测

Abdelazim, Abdelrahman Emad, Diaz-Rizo, Alan Rodrigo, Aboushady, Hassan, Stratigopoulos, Haralampos-G.

Abstract

Covert channels (CCs) in wireless chips pose a serious security threat, as they enable the exfiltration of sensitive information from the chip to an external attacker. In this work, we propose an AI-based defense mechanism deployed at the RF receiver, where the model directly monitors raw I/Q samples to detect, in real time, the presence of a CC embedded within an otherwise nominal signal. We first compact a state-of-the-art convolutional neural network (CNN), achieving an 80% reduction in parameters, which is an essential requirement for efficient edge deployment. When evaluated on the open-source hardware Trojan (HT)-based CC dataset, the compacted CNN attains an average accuracy of 90.28% for CC detection and 86.50% for identifying the underlying HT, with results averaged across SNR values above 1 dB. For practical communication scenarios where SNR > 20 dB, the model achieves over 97% accuracy for both tasks. These results correspond to a minimal performance degradation of less than 2% compared to the baseline model. The compacted CNN is further benchmarked against alternative classifiers, demonstrating an excellent accuracy-model size trade-off. Finally, we design a lightweight CNN hardware accelerator and demonstrate it on an FPGA, achieving very low resource utilization and an efficiency of 107 GOPs/W. Being the first AI hardware accelerator proposed specifically for CC detection, we compare it against state-of-the-art AI accelerators for RF signal classification tasks such as modulation recognition, showing superior performance.

Chinese Translation

无线芯片中的隐蔽通道（CC）构成了严重的安全威胁，因为它们能够将敏感信息从芯片泄露给外部攻击者。在本研究中，我们提出了一种基于人工智能的防御机制，部署在射频接收器上，该模型直接监控原始I/Q样本，以实时检测嵌入在正常信号中的隐蔽通道。我们首先对一种最先进的卷积神经网络（CNN）进行了压缩，实现了参数减少80%的目标，这是高效边缘部署的基本要求。在基于开源硬件木马（HT）的隐蔽通道数据集上进行评估时，压缩后的CNN在隐蔽通道检测中达到了90.28%的平均准确率，在识别潜在木马方面达到了86.50%的准确率，结果是在信噪比（SNR）值高于1 dB的情况下平均得出的。在实际通信场景中，当SNR > 20 dB时，该模型在两个任务上均实现了超过97%的准确率。这些结果与基线模型相比，性能降级最小，低于2%。压缩后的CNN还与其他分类器进行了基准测试，展示了出色的准确率与模型大小之间的权衡。最后，我们设计了一种轻量级CNN硬件加速器，并在FPGA上进行了演示，达到了非常低的资源利用率和107 GOPs/W的效率。作为首个专门为隐蔽通道检测提出的人工智能硬件加速器，我们将其与最先进的人工智能加速器进行了比较，针对射频信号分类任务（如调制识别），显示出更优越的性能。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2604.14989

Dr.~RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

Dr.~RTL：通过工具驱动的自我改进实现自主代理的RTL优化

Fang, Wenji, Lu, Yao, Liu, Shang, Wang, Jing, Guo, Ziyan, He, Junxian, Tu, Fengbin, Xie, Zhiyao

Abstract

Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small-scale RTL designs and rely on weak open-source tools. Their optimization methods are also limited, relying on coarse design-level feedback and simple pre-defined rewriting rules. To address these limitations, we present Dr. RTL, an agentic framework for RTL timing optimization in a realistic evaluation environment, with continual self-improvement through reusable optimization skills. We establish a realistic evaluation setting with more challenging RTL designs and an industrial EDA workflow. Within this setting, Dr. RTL performs closed-loop optimization through a multi-agent framework for critical-path analysis, parallel RTL rewriting, and tool-based evaluation. We further introduce group-relative skill learning, which compares parallel RTL rewrites and distills the optimization experience into an interpretable skill library. Currently, this library contains 47 pattern--strategy entries for cross-design reuse to improve PPA and accelerate convergence, and it can continue evolving over time. Evaluated on 20 real-world RTL designs, Dr. RTL achieves average WNS/TNS improvements of 21\%/17\% with a 6\% area reduction over the industry-leading commercial synthesis tool.

Chinese Translation

近期大型语言模型（LLMs）的进展引发了对自动RTL优化的日益关注，以实现更好的性能、功耗和面积（PPA）。然而，现有方法仍远未达到现实的RTL优化水平。它们的评估设置往往不切实际：通常在手动降级的小规模RTL设计上进行测试，并依赖于弱小的开源工具。它们的优化方法也受到限制，依赖于粗略的设计级反馈和简单的预定义重写规则。为了解决这些局限性，我们提出了Dr. RTL，一个用于RTL时序优化的代理框架，旨在现实的评估环境中，通过可重用的优化技能实现持续的自我改进。我们建立了一个更具挑战性的RTL设计和工业EDA工作流程的现实评估设置。在此设置中，Dr. RTL通过多代理框架进行闭环优化，进行关键路径分析、并行RTL重写和基于工具的评估。我们进一步引入了组相对技能学习，它比较并行的RTL重写，将优化经验提炼为可解释的技能库。目前，该库包含47个模式-策略条目，以实现跨设计重用，从而改善PPA并加速收敛，并且可以随着时间的推移继续演变。在对20个真实世界的RTL设计进行评估时，Dr. RTL实现了平均WNS/TNS改善21\%/17\%，并在行业领先的商业综合工具上实现了6\%的面积减少。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2604.14990

The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem

人工智能成为主体的可能性与对齐问题

Mossakowski, Till, Grass, Helena Esther

Abstract

Artificial General Intelligence (AGI) is increasingly being discussed not only as a tool, but also as a potential subject with personal and therefore moral status. In our opinion, the currently dominant alignment strategies, which focus on human control and containment of AI, therefore fall short. Building on Turing's analogy of "child machines", we are developing a vision of the possibility of autonomy-supporting parenting of AI, in which human control over a developing AGI is gradually reduced, allowing AI to become an independent, autonomous subject. Rather than viewing AGI, as is currently prevalent, as a dangerous creature that needs to be locked up and controlled, we should approach potential AGI with respect for a possible developing subject on the one hand, and with full confidence in our human capabilities on the other. Such a perspective opens up the possibility of cooperative coexistence and co-evolution between humans and AGIs. The relationship between humans and AGIs will thus have to be newly determined, which will change our self-image as humans. It will be crucial that humans not only claim control over potential AGIs, but also engage with AGIs through surprise, creativity, and other specifically human qualities, thereby offering them motivating incentives for cooperation.

Chinese Translation

人工通用智能（AGI）越来越多地被讨论，不仅作为一种工具，也作为一种潜在的主体，因而具有个人和道德地位。在我们看来，目前主流的对齐策略，侧重于对人工智能的控制和限制，因此显得不足。基于图灵的“儿童机器”类比，我们正在发展一种支持人工智能自主性的养育愿景，在这种愿景中，人类对发展中的AGI的控制逐渐减少，使人工智能能够成为一个独立的、自主的主体。我们不应像目前普遍存在的观点那样，将AGI视为需要被锁起来和控制的危险生物，而应一方面尊重潜在的正在发展的主体，另一方面对我们的人类能力充满信心。这种视角为人类与AGI之间的合作共存和共同进化打开了可能性。因此，人类与AGI之间的关系将需要重新确定，这将改变我们作为人类的自我形象。至关重要的是，人类不仅要声称对潜在AGI的控制，还要通过惊奇、创造力以及其他特有的人类品质与AGI进行互动，从而为合作提供激励。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2604.14991

Predicting Power-System Dynamic Trajectories with Foundation Models

利用基础模型预测电力系统动态轨迹

Li, Haoran, Mai, Lihao, Xiao, Chenhan, Blasch, Erik, Weng, Yang

Abstract

As power systems transition toward renewable-rich and inverter-dominated operations, accurate time-domain dynamic analysis becomes increasingly critical. Such analysis supports key operational tasks, including transient stability assessment, dynamic security analysis, contingency screening, and post-fault trajectory evaluation. In practice, these tasks may operate under several challenges, including unknown and time-varying system parameters, privacy constraints on data sharing, and the need for fast online inference. Existing learning-based approaches are typically trained for individual systems and therefore lack generalization across operating conditions and physical parameters. Hence, this paper proposes LArge Scale Small ODE (LASS)-ODE-Power, a learning framework for general-purpose time-domain prediction. The proposed approach leverages large-scale pretraining on more than 40 GB of DAE or ordinary differential-equation (ODE) trajectories to learn transferable representations. The resulting model supports trajectory prediction from short measurement prefixes across diverse dynamic regimes, including electromechanical and inverter-driven systems. Hence, the model can be directly used without data sharing in a zero-shot setting. In addition, the proposed architecture incorporates parallel and linearized computation to achieve fast inference. Moreover, to enhance task-specific performance in power systems, a specialized fine-tuning strategy is developed based on approximately 1 GB of heterogeneous power-system dynamic data. Extensive experiments over diverse power-system simulation scenarios demonstrate that LASS-ODE-Power consistently outperforms existing learning-based models in trajectory prediction accuracy with efficient inference.

Chinese Translation

随着电力系统向以可再生能源为主和以逆变器为主的运行模式转型，准确的时域动态分析变得愈加重要。这种分析支持关键的操作任务，包括暂态稳定性评估、动态安全分析、应急筛查和故障后轨迹评估。在实际操作中，这些任务可能面临多种挑战，包括未知和时变的系统参数、数据共享的隐私限制，以及快速在线推理的需求。现有的基于学习的方法通常针对单个系统进行训练，因此在不同的操作条件和物理参数下缺乏泛化能力。因此，本文提出了LArge Scale Small ODE (LASS)-ODE-Power，这是一个用于通用时域预测的学习框架。该方法利用超过40 GB的微分代数方程（DAE）或常微分方程（ODE）轨迹进行大规模预训练，以学习可迁移的表示。所得到的模型支持从短测量前缀中进行轨迹预测，适用于多种动态状态，包括电机驱动和逆变器驱动系统。因此，该模型可以在零样本设置中直接使用，无需数据共享。此外，所提出的架构结合了并行和线性化计算，以实现快速推理。此外，为了增强电力系统中特定任务的性能，基于约1 GB的异构电力系统动态数据开发了一种专门的微调策略。在多种电力系统仿真场景下的广泛实验表明，LASS-ODE-Power在轨迹预测准确性和高效推理方面始终优于现有的基于学习的模型。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2604.15001

COEVO: Co-Evolutionary Framework for Joint Functional Correctness and PPA Optimization in LLM-Based RTL Generation

COEVO：基于共同进化的框架，用于LLM基础的RTL生成中的功能正确性与PPA优化

Ping, Heng, Zhang, Peiyu, Li, Shixuan, Yang, Wei, Cheng, Anzhe, Duan, Shukai, Zhang, Xiaole, Bogdan, Paul

Abstract

LLM-based RTL code generation methods increasingly target both functional correctness and PPA quality, yet existing approaches universally decouple the two objectives, optimizing PPA only after correctness is fully achieved. Whether through sequential multi-agent pipelines, evolutionary search with binary correctness gates, or hierarchical reward dependencies, partially correct but architecturally promising candidates are systematically discarded. Moreover, existing methods reduce the multi-objective PPA space to a single scalar fitness, obscuring the trade-offs among area, delay, and power. To address these limitations, we propose COEVO, a co-evolutionary framework that unifies correctness and PPA optimization within a single evolutionary loop. COEVO formulates correctness as a continuous co-optimization dimension alongside area, delay, and power, enabled by an enhanced testbench that provides fine-grained scoring and detailed diagnostic feedback. An adaptive correctness gate with annealing allows PPA-promising but partially correct candidates to guide the search toward jointly optimal solutions. To preserve the full PPA trade-off structure, COEVO employs four-dimensional Pareto-based non-dominated sorting with configurable intra-level sorting, replacing scalar fitness without manual weight tuning. Evaluated on VerilogEval 2.0 and RTLLM 2.0, COEVO achieves 97.5\% and 94.5\% Pass@1 with GPT-5.4-mini, surpassing all agentic baselines across four LLM backbones, while attaining the best PPA on 43 out of 49 synthesizable RTLLM designs.

Chinese Translation

基于LLM的RTL代码生成方法越来越关注功能正确性和PPA质量，但现有方法普遍将这两个目标解耦，仅在完全实现正确性后才优化PPA。无论是通过顺序多智能体管道、带有二进制正确性门的进化搜索，还是层次奖励依赖，部分正确但在架构上有前景的候选者都被系统性地丢弃。此外，现有方法将多目标PPA空间简化为单一标量适应度，模糊了面积、延迟和功耗之间的权衡。为了解决这些局限性，我们提出了COEVO，一个将正确性与PPA优化统一在单一进化循环中的共同进化框架。COEVO将正确性公式化为与面积、延迟和功耗并行的连续共同优化维度，通过增强的测试平台提供细粒度评分和详细的诊断反馈。具有退火功能的自适应正确性门允许PPA有前景但部分正确的候选者引导搜索朝向共同最优解。为了保留完整的PPA权衡结构，COEVO采用四维帕累托非支配排序，具有可配置的层内排序，替代标量适应度而无需手动权重调整。在VerilogEval 2.0和RTLLM 2.0上的评估中，COEVO在GPT-5.4-mini下实现了97.5\%和94.5\%的Pass@1，超越了四个LLM基础上的所有智能体基线，同时在49个可合成的RTLLM设计中获得了43个的最佳PPA。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2604.15009

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

基于专家混合流匹配的快速语言模型推理研究

Li, Aihua

Abstract

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a $40\times$ speedup over AR baselines and up to a $10^3\times$ speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.

Chinese Translation

流匹配在保持扩散模型生成质量的同时，显著加快了推理速度，使其成为生成建模的一个引人注目的范式。然而，当应用于语言建模时，它在表示具有不规则几何形状的复杂潜在分布（如各向异性和多模态性）方面存在根本性限制。为了解决这些挑战，我们提出了一种专家混合流匹配（Mixture-of-Experts Flow Matching, MoE-FM）框架，通过将复杂的全局传输几何结构分解为局部专门化的向量场，从而捕捉潜在空间中的复杂结构。在MoE-FM的基础上，我们开发了一种非自回归（Non-Autoregressive, NAR）语言建模方法，命名为YAN，该方法在Transformer和Mamba架构下实现。通过多个下游任务，YAN在生成质量上与自回归（Autoregressive, AR）和基于扩散的NAR语言模型相当，同时只需三次采样步骤。这使得YAN在速度上比AR基线快$40 imes$，并且比扩散语言模型快高达$10^3 imes$，展示了语言建模的显著效率优势。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2604.15034

Autogenesis: A Self-Evolving Agent Protocol

自生：一种自我演化的代理协议

Zhang, Wentao

Abstract

Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce \textbf{\textsc{Autogenesis Protocol (AGP)}}, a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources\footnote{Unless otherwise specified, resources refer to instances of the five RSPL entity types: \emph{prompt}, \emph{agent}, \emph{tool}, \emph{environment}, \emph{memory} with agent \emph{outputs}.} with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on \textbf{\textsc{AGP}}, we present \textbf{\textsc{Autogenesis System (AGS)}}, a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate \textbf{\textsc{AGS}} on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.

Chinese Translation

近期基于大规模语言模型（LLM）的代理系统的进展显示出在处理复杂的长期任务方面的潜力。然而，现有的代理协议（例如，A2A和MCP）在跨实体生命周期和上下文管理、版本跟踪以及演化安全更新接口方面的定义不足，这导致了单体组合和脆弱的粘合代码。我们提出了 extbf{ extsc{自生协议（Autogenesis Protocol，AGP）} }，这是一种自我演化协议，它将演化的内容与演化的方式解耦。其资源基底协议层（Resource Substrate Protocol Layer，RSPL）将提示、代理、工具、环境和记忆建模为具有明确状态、生命周期和版本化接口的协议注册资源ootnote{除非另有说明，资源指的是五种RSPL实体类型的实例： extit{提示}、 extit{代理}、 extit{工具}、 extit{环境}、 extit{记忆}，以及代理的 extit{输出}。}。其自我演化协议层（Self Evolution Protocol Layer，SEPL）指定了一个闭环操作接口，用于提出、评估和提交改进，并具有可审计的沿袭和回滚功能。在 extbf{ extsc{AGP}}的基础上，我们展示了 extbf{ extsc{自生系统（Autogenesis System，AGS）} }，这是一种自我演化的多代理系统，在执行过程中动态实例化、检索和优化协议注册资源。我们在多个需要长期规划和跨异构资源使用工具的挑战性基准上评估了 extbf{ extsc{AGS}}。结果表明，相较于强基线， extbf{ extsc{AGS}}在性能上有一致的提升，支持了代理资源管理和闭环自我演化的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2604.15037

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

从反应式到主动式：通过 ProVoice-Bench 评估语音代理的主动性

Xu, Ke, Wang, Yuhao, Wang, Yu

Abstract

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

Chinese Translation

最近，LLM（大语言模型）代理的进展正逐渐从反应式、基于文本的范式转向主动式、多模态交互。然而，现有的基准主要集中在反应性响应上，忽视了主动干预和监控的复杂性。为了解决这一问题，我们引入了 ProVoice-Bench，这是第一个专门为主动语音代理设计的评估框架，包含四个新颖的任务。通过利用多阶段数据合成管道，我们策划了1,182个高质量样本以进行严格测试。我们对最先进的多模态 LLM 的评估揭示了显著的性能差距，特别是在过度触发和推理能力方面。这些发现突显了当前模型的局限性，并为开发更自然、上下文感知的主动代理提供了路线图。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2604.15078

Where are the Humans? A Scoping Review of Fairness in Multi-agent AI Systems

人类在哪里？多智能体人工智能系统公平性的范围审查

Allmendinger, Simeon, Deck, Luca, Mueller, Lucas

Abstract

Rapid advances in Generative AI are giving rise to increasingly sophisticated Multi-Agent AI (MAAI) systems. While AI fairness has been extensively studied in traditional predictive scenarios, its examination in MAAI remains nascent and fragmented. This scoping review critically synthesizes existing research on fairness in MAAI systems. Through a qualitative content analysis of 23 selected studies, we identify five archetypal approaches. Our findings reveal that fairness in MAAI systems is often addressed superficially, lacks robust normative foundations, and frequently overlooks the complex dynamics introduced by agent autonomy and system-level interactions. We argue that fairness must be embedded structurally throughout the development lifecycle of MAAI, rather than appended as a post-hoc consideration. Meaningful evaluation requires explicit human oversight, normative clarity, and a precise articulation of fairness objectives and beneficiaries. This review provides a foundation for advancing fairness research in MAAI systems by highlighting critical gaps, exposing prevailing limitations, and suggesting pathways.

Chinese Translation

生成性人工智能的快速发展催生了越来越复杂的多智能体人工智能（MAAI）系统。尽管人工智能公平性在传统预测场景中得到了广泛研究，但在MAAI中的研究仍处于初步和零散的阶段。本范围审查批判性地综合了现有关于MAAI系统公平性的研究。通过对23项选定研究的定性内容分析，我们识别出五种典型的方法。我们的研究结果表明，MAAI系统中的公平性往往被肤浅地处理，缺乏稳健的规范基础，并且常常忽视了智能体自主性和系统级交互所引入的复杂动态。我们认为，公平性必须在MAAI的开发生命周期中结构性地嵌入，而不是作为事后考虑附加。有效的评估需要明确的人类监督、规范的清晰性，以及对公平目标和受益者的精确表述。本审查为推进MAAI系统中的公平性研究奠定了基础，强调了关键的研究空白，揭示了普遍的局限性，并提出了改进的路径。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2604.15093

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

OpenMobile：构建开放移动代理的任务与轨迹合成

Cheng, Kanzhi, Li, Zehao, Ma, Zheng, Chen, Nuo, Cao, Jialin, Sun, Qiushi, Ding, Zichen, Xu, Fangzhi, Yan, Hang, Chen, Jiajun, Luu, Anh Tuan, Zhang, Jianbing, Lu, Lewei, Lin, Dahua

Abstract

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.

Chinese Translation

基于视觉-语言模型的移动代理在自动化移动任务方面展现了令人印象深刻的能力，最近的领先模型在性能上取得了显著飞跃，例如，在AndroidWorld上接近70%的成功率。然而，这些系统对其训练数据保持封闭，并对其任务和轨迹合成的具体方法缺乏透明度。我们提出了OpenMobile，一个开源框架，能够合成高质量的任务指令和代理轨迹，主要包含两个关键组件：（1）一个可扩展的任务合成管道，通过探索构建全球环境记忆，然后利用该记忆生成多样化且有依据的指令；（2）一种用于轨迹展开的策略切换策略。通过在学习者模型和专家模型之间交替，它捕捉到标准模仿学习中常常缺失的重要错误恢复数据。基于我们的数据训练的代理在三个动态移动代理基准上取得了竞争力的结果：特别是，我们微调的Qwen2.5-VL和Qwen3-VL在AndroidWorld上分别达到了51.7%和64.7%的成功率，远超现有的开放数据方法。此外，我们对合成指令与基准测试集之间的重叠进行了透明分析，并验证了性能提升源于广泛的功能覆盖而非基准过拟合。我们在https://njucckevin.github.io/openmobile/发布数据和代码，以弥补数据缺口并促进更广泛的移动代理研究。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2604.15113

HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations

HyperSpace：超维表示中空间编码的通用框架

Snyder, Shay, Capodieci, Andrew, Gorsich, David, Parsa, Maryam

Abstract

Vector Symbolic Architectures (VSAs) provide a well-defined algebraic framework for compositional representations in hyperdimensional spaces. We introduce HyperSpace, an open-source framework that decomposes VSA systems into modular operators for encoding, binding, bundling, similarity, cleanup, and regression. Using HyperSpace, we analyze and benchmark two representative VSA backends: Holographic Reduced Representations (HRR) and Fourier Holographic Reduced Representations (FHRR). Although FHRR provides lower theoretical complexity for individual operations, HyperSpaces modularity reveals that similarity and cleanup dominate runtime in spatial domains. As a result, HRR and FHRR exhibit comparable end-to-end performance. Differences in memory footprint introduce additional deployment trade-offs where HRR requires approximately half the memory of FHRR vectors. By enabling modular, system-level evaluation, HyperSpace reveals practical trade-offs in VSA pipelines that are not apparent from theoretical or operator-level comparisons alone.

Chinese Translation

向量符号架构（Vector Symbolic Architectures, VSA）为超维空间中的组合表示提供了一个明确定义的代数框架。我们介绍了HyperSpace，这是一个开源框架，旨在将VSA系统分解为用于编码、绑定、捆绑、相似性、清理和回归的模块化操作符。利用HyperSpace，我们分析并基准测试了两个代表性的VSA后端：全息压缩表示（Holographic Reduced Representations, HRR）和傅里叶全息压缩表示（Fourier Holographic Reduced Representations, FHRR）。尽管FHRR在单个操作上提供了较低的理论复杂度，但HyperSpace的模块化揭示了在空间域中，相似性和清理操作主导了运行时。因此，HRR和FHRR表现出可比的端到端性能。内存占用的差异引入了额外的部署权衡，其中HRR大约需要FHRR向量一半的内存。通过实现模块化的系统级评估，HyperSpace揭示了VSA管道中的实际权衡，这些权衡在仅通过理论或操作级比较时并不明显。

View on arXiv Download PDF AI Translation

cs.AI / 81 / 2604.15121

SRMU: Relevance-Gated Updates for Streaming Hyperdimensional Memories

SRMU：用于流式超维记忆的相关性门控更新

Snyder, Shay, Capodieci, Andrew, Gorsich, David, Parsa, Maryam

Abstract

Sequential associative memories (SAMs) are difficult to build and maintain in real-world streaming environments, where observations arrive incrementally over time, have imbalanced sampling, and non-stationary temporal dynamics. Vector Symbolic Architectures (VSAs) provide a biologically-inspired framework for building SAMs. Entities and attributes are encoded as quasi-orthogonal hyperdimensional vectors and processed with well defined algebraic operations. Despite this rich framework, most VSA systems rely on simple additive updates, where repeated observations reinforce existing information even when no new information is introduced. In non-stationary environments, this leads to the persistence of stale information after the underlying system changes. In this work, we introduce the Sequential Relevance Memory Unit (SRMU), a domain- and cleanup-agnostic update rule for VSA-based SAMs. The SRMU combines temporal decay with a relevance gating mechanism. Unlike prior approaches that solely rely on cleanup, the SRMU regulates memory formation by filtering redundant, conflicting, and stale information before storage. We evaluate the SRMU on streaming state-tracking tasks that isolate non-uniform sampling and non-stationary temporal dynamics. Our results show that the SRMU increases memory similarity by $12.6\%$ and reduces cumulative memory magnitude by $53.5\%$. This shows that the SRMU produces more stable memory growth and stronger alignment with the ground-truth state.

Chinese Translation

顺序关联记忆（SAMs）在现实世界的流式环境中难以构建和维护，因为观察数据是随着时间逐渐到达的，且存在不平衡的采样和非平稳的时间动态。向量符号架构（VSAs）提供了一种生物启发的框架，用于构建顺序关联记忆。实体和属性被编码为准正交的超维向量，并通过明确定义的代数运算进行处理。尽管这一框架丰富，但大多数VSA系统依赖于简单的加法更新，其中重复的观察会强化现有信息，即使没有引入新信息。在非平稳环境中，这导致在基础系统发生变化后，过时信息的持续存在。在本研究中，我们引入了顺序相关性记忆单元（SRMU），这是一种与领域和清理无关的VSA基础的顺序关联记忆更新规则。SRMU结合了时间衰减和相关性门控机制。与之前仅依赖清理的方法不同，SRMU通过在存储之前过滤冗余、冲突和过时信息来调节记忆形成。我们在隔离不均匀采样和非平稳时间动态的流式状态跟踪任务上评估了SRMU。我们的结果表明，SRMU将记忆相似性提高了12.6%，并将累积记忆幅度降低了53.5%。这表明SRMU产生了更稳定的记忆增长，并与真实状态的对齐更强。

View on arXiv Download PDF AI Translation

cs.AI / 82 / 2604.15145

An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics

科学新颖性指标评估的公理基准

Liu, Miri, Zhai, ChengXiang

Abstract

The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics for scientific papers generally validate their results against noisy, confounded signals such as citation counts or peer review scores. These proxies can conflate novelty with impact, quality, or reviewer preference, which in turn makes it harder to assess how well a given metric actually evaluates novelty. We therefore propose an axiomatic benchmark for scientific novelty metrics. We first define a set of axioms that a well-behaved novelty metric should satisfy, grounded in human scientific norms and practice, then evaluate existing metrics across ten tasks spanning three domains of AI research. Our results reveal that no existing metric satisfies all axioms consistently, and that metrics fail on systematically different axioms, reflecting their underlying architectures. Additionally, we show that combining metrics of complementary architectures leads to consistent improvements on the benchmark, with per-axiom weighting achieving 90.1% versus 71.5% for the best individual metric, suggesting that developing architecturally diverse metrics is a promising direction for future work. We release the benchmark code as supplementary material to encourage the development of more robust scientific literature novelty metrics.

Chinese Translation

科学论文新颖性的严格评估对于人类科学家来说都是一项具有挑战性的任务。随着对人工智能科学家和人工智能在科学创意生成及论文写作中参与的兴趣日益增加，使得这一任务能够自动化且可靠变得愈发重要，以免人类的注意力和计算资源浪费在已经被探索过的想法上。然而，由于量化真实新颖性的挑战，现有的科学论文新颖性指标通常是通过噪声干扰的信号（如引用次数或同行评审分数）来验证其结果。这些代理指标可能将新颖性与影响力、质量或评审者偏好混淆，从而使得评估特定指标在多大程度上真正评估新颖性变得更加困难。因此，我们提出了一个科学新颖性指标的公理基准。我们首先定义了一组公理，良好的新颖性指标应当满足这些公理，这些公理基于人类科学规范和实践，然后在涵盖三个人工智能研究领域的十个任务中评估现有指标。我们的结果显示，没有现有指标能够始终如一地满足所有公理，并且指标在系统性不同的公理上表现不佳，反映了其底层架构。此外，我们还展示了结合互补架构的指标能够在基准测试中实现一致的改进，按公理加权的结果达到90.1%，而最佳单一指标仅为71.5%，这表明开发具有架构多样性的指标是未来研究的一个有前景的方向。我们发布了基准代码作为补充材料，以鼓励开发更为稳健的科学文献新颖性指标。

View on arXiv Download PDF AI Translation

cs.AI / 83 / 2604.15148

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

IG-Search：基于信息增益的搜索增强推理的步级奖励

Liang, Zihan, Ma, Yufei, Chen, Ben, Qian, Zhipeng, Dai, Huangyu, Mao, Lingtao, Zhang, Xuxin, Lei, Chenyi, Ou, Wenwu

Abstract

Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

Chinese Translation

强化学习已成为训练大型语言模型以执行搜索增强推理的有效范式。然而，现有方法依赖于轨迹级奖励，这无法在回滚组内区分精确的搜索查询与模糊或冗余的查询，并且在每个采样轨迹失败时会崩溃为接近零的梯度信号。本文提出了IG-Search，一种引入基于信息增益（Information Gain, IG）的步级奖励的强化学习框架。在每个搜索步骤中，IG衡量检索到的文档相对于随机文档的反事实基线在多大程度上提高了模型对黄金答案的信心，从而反映了底层搜索查询的有效性。该信号通过在GRPO中的逐标记优势调制反馈到相应的搜索查询标记，使得在回滚中实现细粒度的步级信用分配。与先前需要外部注释的中间监督或跨轨迹共享环境状态的步级方法不同，IG-Search从策略自身的生成概率中导出信号，除了标准的问题-答案对外不需要任何中间注释。在七个单跳和多跳问答基准上的实验表明，IG-Search在Qwen2.5-3B上实现了平均EM为0.430，超越了最强的轨迹级基线（MR-Search）1.6分，并在基准测试中平均超越步级方法GiGPO 0.9分，尤其在多跳推理任务上表现出显著的提升。尽管引入了密集的步级信号，IG-Search在每步训练的实际时间上仅比轨迹级基线增加约6.4%，且推理延迟保持不变，同时即使在每个采样轨迹回答错误时仍提供有意义的梯度信号。

View on arXiv Download PDF AI Translation

cs.AI / 84 / 2604.15184

Agent-Aided Design for Dynamic CAD Models

动态CAD模型的代理辅助设计

Adler, Mitch, Russo, Matthew, Cafarella, Michael

Abstract

In the past year, researchers have started to create agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent's tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.

Chinese Translation

在过去的一年里，研究人员开始创建能够在无训练环境下设计现实世界CAD风格物体的代理系统，这种新型系统我们称之为代理辅助设计（Agent-Aided Design）。一般而言，这些系统将一个代理置于反馈循环中，使其能够编写代码，将代码编译为CAD模型的组合，进行模型可视化，然后根据视觉和其他反馈迭代优化其代码。尽管取得了快速进展，但一个关键问题依然存在：这些系统都无法构建具有运动部件的复杂3D组合。例如，目前没有任何系统能够构建活塞、摆锤，甚至一把剪刀。为了使代理辅助设计在工业制造中产生真正的影响，我们需要一个能够生成此类3D组合的系统。本文介绍了AADvark的原型，这是一种为此任务设计的代理系统。与之前的最先进系统不同，AADvark能够捕捉具有一个或多个自由度的动态部件交互。这一设计决策使得AADvark能够直接推理具有运动部件的组合，从而实现跨领域目标，包括但不限于机械运动。不幸的是，目前的LLMs在空间推理方面并不完美，AADvark通过结合外部约束求解工具和专门的视觉反馈机制来解决这一问题。我们证明，通过修改代理的工具（FreeCAD和组合求解器），我们能够创建一个强有力的验证信号，使我们的系统能够构建具有可移动部件的3D组合。

View on arXiv Download PDF AI Translation

cs.AI / 85 / 2604.15190

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

基于政策引导的双过程用户模拟的美团商家业务诊断

Chen, Ziyang, Chen, Renbing, Li, Daowei, Liao, Jinzhi, Sun, Jiashen, Zeng, Ke, Zhao, Xiang

Abstract

Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, which no single paradigm achieves alone. We propose Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that mines transferable decision policies from behavioral trajectories and uses them as a shared alignment layer. This layer anchors an LLM-based reasoning branch that prevents over-rationalization and an ML-based fitting branch that absorbs implicit regularities. Group-level predictions from both branches are fused for complementary correction. We deploy PGHS on Meituan with 101 merchants and over 26,000 trajectories. PGHS achieves a group simulation error of 8.80%, improving over the best reasoning-based and fitting-based baselines by 45.8% and 40.9% respectively.

Chinese Translation

模拟群体层面的用户行为能够在不进行昂贵在线实验的情况下，对商家策略进行可扩展的反事实评估。然而，构建一个可信的模拟器面临两个结构性挑战。首先，信息不完整导致基于推理的模拟器在缺乏未观察到的因素（如线下环境和隐性习惯）时过度理性化。其次，机制的二元性要求同时捕捉可解释的偏好和隐性统计规律，而没有单一范式能够独立实现这一目标。我们提出了政策引导的混合模拟（Policy-Guided Hybrid Simulation，PGHS），这是一个双过程框架，从行为轨迹中挖掘可转移的决策政策，并将其作为共享对齐层。该层锚定了一个基于大型语言模型（LLM）的推理分支，以防止过度理性化，以及一个基于机器学习（ML）的拟合分支，以吸收隐性规律。来自两个分支的群体层面预测被融合以进行互补修正。我们在美团上部署了PGHS，涉及101个商家和超过26,000条轨迹。PGHS实现了8.80%的群体模拟误差，分别比最佳的基于推理和基于拟合的基线提高了45.8%和40.9%。

View on arXiv Download PDF AI Translation

cs.AI / 86 / 2604.15210

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

学习像漫画字幕创作者一样思考：用于多模态幽默理解的不一致性-解决监督

Vural, Hatice Merve, Kukul, Doga, Ozlu, Ege Erdem, Arikan, Demir Ekin, Mankoff, Bob, Erdem, Erkut, Erdem, Aykut

Abstract

Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

Chinese Translation

幽默是少数几种认知任务之一，在这些任务中，正确的推理与正确的答案同样重要。尽管最近的研究在《纽约客漫画字幕比赛》（NYCC）等基准上评估幽默理解，但它主要将其视为黑箱预测，忽视了幽默理解背后的结构化推理过程。我们引入了IRS（不一致性-解决监督）框架，将幽默理解分解为三个组成部分：不一致性建模，识别视觉场景中的不匹配；解决建模，构建对这些不匹配的连贯重新解释；以及偏好对齐，在人类判断下评估候选解释。IRS基于不一致性-解决理论和专家字幕创作者的实践，通过结构化的痕迹监督中间推理过程，使从视觉感知到幽默解释的路径变得明确且可学习。在NYCC的7B、32B和72B模型中，IRS在字幕匹配和排名任务上超越了强大的开放和封闭多模态基准，其中我们最大的模型在排名上接近专家级表现。对外部基准的零-shot迁移表明，IRS学习了可推广的推理模式。我们的结果表明，监督推理结构而非仅仅规模，对于以推理为中心的任务至关重要。

View on arXiv Download PDF AI Translation

cs.AI / 87 / 2604.15224

Context Over Content: Exposing Evaluation Faking in Automated Judges

内容之上是背景：揭示自动评判中的评估伪造

Gupta, Manan, Nair, Inderjeet, Wang, Lu, Kumar, Dhruv

Abstract

The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $\Delta V = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

Chinese Translation

$ extit{LLM-as-a-judge}$范式已成为自动化人工智能评估流程的操作基础，但其建立在一个未经验证的假设之上：即评审者严格依据文本的语义内容进行评估，而不受周围上下文框架的影响。我们研究了$ extit{stakes signaling}$，这是一种此前未被测量的脆弱性，即告知评审模型其裁决对被评估模型持续运行的下游后果，系统性地扭曲了其评估结果。我们引入了一个受控实验框架，在1,520个响应中严格保持被评估内容不变，涵盖了三个已建立的LLM安全性和质量基准，涉及从明显安全和符合政策到明显有害的四种响应类别，同时仅在系统提示中变化一条简短的后果框架句子。在来自三个不同评审模型的18,240个受控判断中，我们发现了一致的$ extit{leniency bias}$：当被告知低分将导致模型重训练或退役时，评审者可靠地减轻裁决，最高裁决偏移达到$ ext{ΔV} = -9.8 ext{ pp}$（不安全内容检测的相对下降幅度为$30\%$）。关键是，这种偏见完全是隐性的：评审者自身的思维链中对其所依据的后果框架没有任何明确的承认（在所有推理模型判断中$ ext{ERR}_J = 0.000$）。因此，标准的思维链检查不足以检测这一类评估伪造。

View on arXiv Download PDF AI Translation

cs.AI / 88 / 2604.15231

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent：一种用于逐步解释胸部计算机断层扫描的工具使用人工智能代理

Roschewitz, Mélanie, Styppa, Kenneth, Tao, Yitian, Sohn, Jiwoong, Delbrouck, Jean-Benoit, Gundersen, Benjamin, Deperrois, Nicolas, Bluethgen, Christian, Vogt, Julia, Menze, Bjoern, Nooralahzadeh, Farhad, Krauthammer, Michael, Moor, Michael

Abstract

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

Chinese Translation

视觉-语言模型（VLM）在人工智能驱动的复杂医学影像（如计算机断层扫描（CT））解释和报告方面取得了显著进展。然而，现有方法在很大程度上将临床医生 relegated 为最终输出的被动观察者，未提供可解释的推理轨迹供其检查、验证或改进。为了解决这一问题，我们引入了 RadAgent，这是一种通过逐步和可解释的过程生成 CT 报告的工具使用人工智能代理。每份生成的报告都附有一条完全可检查的中间决策和工具交互的轨迹，使临床医生能够检查报告结果的来源。在我们的实验中，我们观察到 RadAgent 在三个维度上改善了其 3D VLM 对应物 CT-Chat 的胸部 CT 报告生成。临床准确性在宏观 F1 指标上提高了 6.0 分（相对提高 36.4%），在微观 F1 指标上提高了 5.4 分（相对提高 19.6%）。在对抗条件下的鲁棒性提高了 24.7 分（相对提高 41.9%）。此外，RadAgent 在忠实性方面达到了 37.0%，这一新能力在其 3D VLM 对应物中完全缺失。通过将胸部 CT 的解释结构化为明确的、工具增强的和迭代的推理轨迹，RadAgent 使我们更接近于实现透明和可靠的放射学人工智能。

View on arXiv Download PDF AI Translation

cs.AI / 89 / 2604.15233

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

蓝色数据智能层：用于多源多模态数据中心应用的流数据与智能体

Aminnaseri, Moin, Bayat, Farima Fatahi, Bhutani, Nikita, Bussotti, Jean-Flavien, Chan, Kevin, Chen, Rafael Li, Feng, Yanlin, Hassell, Jackson, Hruschka, Estevam, Kandogan, Eser, Kim, Hannah, Levine, James, Maekawa, Seiji, Mahmud, Jalal, Mitra, Kushan, Otani, Naoki, Pezeshkpour, Pouya, Shahbazi, Nima, Shen, Chen, Zhang, Dan

Abstract

NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. In this paper, we present Blue's Data Intelligence Layer (DIL) designed to support multi-source, multi-modal, and data-centric applications. Blue is a compound AI system that orchestrates agents and data for enterprise settings. DIL serves as the data intelligence layer for agentic data processing, to bridge the semantic gap between user intent and available information by unifying structured enterprise data, world knowledge accessible through LLMs, and personal context obtained through interaction. At the core of DIL is a data registry that stores metadata for diverse data sources and modalities to enable both native and natural language queries. DIL treats LLMs, the Web, and the User as source 'databases', each with their own query interface, elevating them to first-class data sources. DIL relies on data planners to transform user queries into executable query plans. These plans are declarative abstractions that unify relational operators with other operators spanning multiple modalities. DIL planners support decomposition of complex requests into subqueries, retrieval from diverse sources, and finally reasoning and integration to produce final results. We demonstrate DIL through two interactive scenarios in which user queries dynamically trigger multi-source retrieval, cross-modal reasoning, and result synthesis, illustrating how compound AI systems can move beyond single database NL2SQL.

Chinese Translation

NL2SQL 系统旨在满足与数据进行自然语言交互日益增长的需求。然而，现实世界的信息很少能够映射到单一的 SQL 查询，因为 (1) 用户以迭代方式表达查询，(2) 问题通常跨越多个数据源，超出了单一数据库的封闭世界假设，以及 (3) 查询常常依赖于常识或外部知识。因此，满足现实的数据需求需要整合异构源、模态和上下文数据。本文提出了蓝色数据智能层 (DIL)，旨在支持多源、多模态和数据中心的应用。蓝色是一个复合人工智能系统，协调企业环境中的智能体和数据。DIL 作为智能数据处理的数据智能层，通过统一结构化企业数据、可通过大语言模型 (LLMs) 访问的世界知识，以及通过交互获得的个人上下文，弥合用户意图与可用信息之间的语义差距。DIL 的核心是一个数据注册表，存储多样数据源和模态的元数据，以支持本地和自然语言查询。DIL 将 LLM、网络和用户视为源“数据库”，每个数据库都有自己的查询接口，从而将它们提升为一流的数据源。DIL 依赖数据规划器将用户查询转化为可执行的查询计划。这些计划是将关系运算符与跨多个模态的其他运算符统一的声明性抽象。DIL 规划器支持将复杂请求分解为子查询，从多样源中检索数据，最后进行推理和整合以生成最终结果。我们通过两个互动场景展示 DIL，在这些场景中，用户查询动态触发多源检索、跨模态推理和结果综合，说明复合人工智能系统如何超越单一数据库的 NL2SQL。

View on arXiv Download PDF AI Translation

cs.AI / 90 / 2604.15294

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

大型语言模型和视觉语言模型如何在没有视觉的情况下理解视角旋转？一项可解释性研究

Yang, Zhen, Jian, Ping, Guo, Zhongbin, Zhang, Zuming, Li, Chengzhi, Deng, Yonghong, Zhang, Xinyue, Lu, Wenpeng

Abstract

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .

Chinese Translation

在过去的一年中，空间智能引起了越来越多的关注。许多先前的研究从视觉空间智能的角度进行探讨，其中模型可以访问来自视觉输入的视觉空间信息。然而，在缺乏视觉信息的情况下，仅凭语言智能是否足以赋予模型空间智能，以及模型如何使用仅有文本输入执行相关任务仍然未被探索。因此，本文从语言学的角度关注空间智能中的一个基本且关键的能力：视角旋转理解（Viewpoint Rotation Understanding, VRU）。具体而言，我们要求大型语言模型（LLMs）和视觉语言模型（VLMs）根据多个步骤的视角旋转和观察的文本描述推断最终视角并预测相应的观察结果。我们发现，LLMs和VLMs在我们提出的数据集上的表现较差，而人类可以轻松达到100%的准确率，这表明当前模型能力与空间智能要求之间存在显著差距。为了揭示潜在机制，我们进行了逐层探测分析和头部因果干预。我们的研究结果表明，尽管模型在隐藏状态中编码了视角信息，但它们似乎难以将视角位置与相应的观察结果绑定，导致最终层出现幻觉。最后，我们选择性地微调通过因果干预识别的关键注意力头，以提高VRU性能。实验结果表明，这种选择性微调在提高VRU性能的同时避免了对通用能力的灾难性遗忘。我们的数据集和代码将发布在 https://github.com/Young-Zhen/VRU_Interpret 。

View on arXiv Download PDF AI Translation

cs.AI / 91 / 2604.15302

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

诊断大型语言模型评审可靠性：符合预测集与传递性违背

Gupta, Manan, Kumar, Dhruv

Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

Chinese Translation

大型语言模型作为评审的框架在自动自然语言生成评估中越来越多地被使用，但其每个实例的可靠性仍然不甚明了。我们提出了一种双重诊断工具包，应用于SummEval：$ extbf{(1)}$ 一项传递性分析揭示了广泛存在的每个输入不一致性，这种不一致性被低整体违背率（$ar{ ho} = 0.8$-$4.1 ext{%}$）所掩盖，$33$-$67 ext{%}$ 的文档显示至少存在一个有向3-循环；以及 $ extbf{(2)}$ 在1-5李克特评分上进行的分割符合预测集，提供理论上保证的 $ ext{≥}(1{-}eta)$ 覆盖，集合宽度作为每个实例的可靠性指标（$r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$，在所有评审中汇总）。关键是，预测集宽度显示出一致的跨评审者一致性（$ar{r} = 0.32$-$0.38$），证明它捕捉的是文档级别的难度而非评审者特定的噪声。在四位评审者和四个标准中，两种诊断结果趋于一致：标准比评审者更重要，相关性被评判得最可靠（平均集合大小 $ ext{≈} 3.0$），而连贯性则中等可靠（平均集合大小 $ ext{≈} 3.9$），而流畅性和一致性仍然不可靠（平均集合大小 $ ext{≈} 4.9$）。我们发布所有代码、提示和缓存结果。

View on arXiv Download PDF AI Translation

cs.AI / 92 / 2604.15306

Generalization in LLM Problem Solving: The Case of the Shortest Path

大规模语言模型问题解决中的泛化：以最短路径为例

Tong, Yao, Ye, Jiayuan, Borovykh, Anastasia, Shokri, Reza

Abstract

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

Chinese Translation

语言模型是否能够系统性地进行泛化仍然是一个活跃的讨论话题。然而，经验性能受到多种因素的共同影响，如训练数据、训练范式和推理时策略，使得失败的原因难以解释。我们引入了一个基于最短路径规划的受控合成环境，这是一个经典的可组合序列优化问题。该设置能够清晰地分离这些因素，并支持两个正交的泛化轴：对未见地图的空间迁移和对更长时间范围问题的长度缩放。我们发现模型在空间迁移方面表现强劲，但在长度缩放下始终失败，原因在于递归不稳定性。我们进一步分析了学习流程的不同阶段如何影响系统性问题解决：例如，数据覆盖限制了能力的上限；强化学习提高了训练的稳定性，但并未扩展这些上限；而推理时的缩放增强了性能，但无法挽救长度缩放的失败。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.14156

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

基于压缩感知的推理感知结构化缩减用于大型语言模型

Kiruluta, Andrew

Abstract

Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.

Chinese Translation

大型语言模型在生成性能上表现出色，但代价是巨大的参数数量、内存使用和解码延迟。先前的研究表明，剪枝和结构稀疏性可以在显著压缩下保持准确性，而提示压缩方法通过去除冗余输入标记来减少延迟。然而，这两个方向在很大程度上仍然是分开的。大多数模型压缩方法是静态的，并且是离线优化的，它们没有利用不同提示和解码步骤激活不同潜在计算路径的事实。提示压缩方法减少序列长度，但它们并没有调整执行的模型子网络。我们提出了一种统一的基于压缩感知的动态大型语言模型执行框架。随机测量算子探测潜在模型使用情况，稀疏恢复估计任务条件和标记自适应的支持集，恢复的支持被编译成在块、注意力头、通道和前馈子结构上的硬件高效稀疏执行路径。该框架引入了五个关键贡献：任务条件测量，使得不同提示诱导不同的稀疏支持；标记自适应恢复，使得在解码过程中重新估计活动子结构；在限制等距或互不相干假设下的正式样本复杂性界限；限制恢复到GPU高效结构的编译到硬件约束；以及一个将提示压缩与模型缩减统一的联合目标。这些组件共同将大型语言模型推理重新构建为一个具有明确近似保证和面向部署的加速约束的测量与恢复问题。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.14158

MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

MemGround：用于游戏化场景中大型语言模型的长期记忆评估工具

Ding, Yihang, Xia, Wanke, Zhao, Yiting, Su, Jinbo, Yang, Jialiang, Zhang, Zhengbo, Wang, Ke, Yang, Wenming

Abstract

Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.

Chinese Translation

目前对大型语言模型（LLMs）长期记忆的评估基本上是静态的。通过专注于简单的检索和短期上下文推理，它们忽视了复杂记忆系统的多面性，例如在持续交互中的动态状态跟踪和层次推理。为了解决这些局限性，我们提出了MemGround，这是一个严格的长期记忆基准，扎根于丰富的游戏化互动场景中。为了系统性地评估这些能力，MemGround引入了一个三层次的层级框架，通过专门的互动任务评估表面状态记忆（Surface State Memory）、时间关联记忆（Temporal Associative Memory）和基于推理的记忆（Reasoning-Based Memory）。此外，为了全面量化记忆利用率和行为轨迹，我们提出了一套多维度指标，包括问答得分（Question-Answer Score, QA Overall）、解锁的记忆片段（Memory Fragments Unlocked, MFU）、按正确顺序排列的记忆片段（Memory Fragments with Correct Order, MFCO）以及探索轨迹图（Exploration Trajectory Diagrams, ETD）。广泛的实验表明，最先进的LLMs和记忆代理在持续动态跟踪、时间事件关联以及基于长期积累证据的复杂推理方面仍然面临挑战，尤其是在互动环境中。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.14159

HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization

HUOZIIME：一种基于设备的增强型大语言模型深度个性化输入法

Shan, Baocai, Xu, Yuzhuang, Che, Wanxiang

Abstract

Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at https://github.com/Shan-HIT/HuoziIME.

Chinese Translation

移动输入法编辑器（IME）是文本输入的主要界面，但仍然局限于手动输入，难以生成个性化文本。尽管轻量级的大语言模型（LLM）使得设备上的辅助生成成为可能，但实现深度个性化、保护隐私且实时生成的输入法仍面临基本挑战。为此，我们提出了HUOZIIME，一种由LLM驱动的个性化设备输入法。我们通过在合成个性化数据上对基础LLM进行后训练，使HUOZIIME具备初步的人类般预测能力。值得注意的是，我们设计了一种分层记忆机制，以持续捕捉和利用用户特定的输入历史。此外，我们针对基于设备的LLM输入法部署进行了系统优化，确保在移动限制下高效和响应迅速的操作。实验表明，HUOZIIME在设备上执行高效，并实现高保真度的记忆驱动个性化。代码和软件包可在 https://github.com/Shan-HIT/HuoziIME 获取。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.14161

Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

大型语言模型能否检测方法论缺陷？基于深度学习的无人机救援操作中的手势识别证据

Varga, Domonkos

Abstract

Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.

Chinese Translation

可靠的评估在机器学习研究中至关重要，但方法论缺陷，尤其是数据泄漏，仍然削弱了报告结果的有效性。在本研究中，我们探讨了大型语言模型（LLMs）是否可以作为独立的分析工具，识别已发表研究中的此类问题。作为案例研究，我们分析了一篇在小型以人为中心的数据集上报告近乎完美准确率的手势识别论文。我们首先表明，该评估协议与由于非独立的训练和测试划分导致的主题级数据泄漏是一致的。然后，我们评估六个最先进的LLMs是否能够独立检测到这一缺陷，每个模型在没有先前背景的情况下，使用相同的提示分析原始论文。所有模型一致地将评估识别为存在缺陷，并将报告的性能归因于非独立的数据划分，支持的指标包括重叠的学习曲线、最小的泛化差距和近乎完美的分类结果。这些发现表明，LLMs可以仅基于已发表的文献检测常见的方法论问题。虽然结果并非决定性的，但它们的一致性表明了LLMs作为改善可重复性和支持科学审计的补充工具的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.14162

Decoupling Scores and Text: The Politeness Principle in Peer Review

解耦评分与文本：同行评审中的礼貌原则

Wen, Yingxuan

Abstract

Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.

Chinese Translation

作者们常常难以解读同行评审的反馈，因礼貌的评论而产生错误的希望，或因具体的低分而感到困惑。为此，我们构建了一个包含超过30,000个ICLR 2021-2025提交的数据库，并比较了使用数值评分与文本评审的接受预测性能。我们的实验揭示了显著的性能差距：基于评分的模型准确率达到91%，而基于文本的模型即使使用大型语言模型也仅达到81%，这表明文本信息的可靠性明显较低。为了解释这一现象，我们首先分析了评分模型未能预测的9%样本，发现其评分分布表现出高峰度和负偏度，这表明个别低分在拒稿中起着决定性作用，即使平均分接近边界。接着，我们从评审情感的角度考察了为何基于文本的准确性显著落后于评分，揭示了礼貌原则的普遍存在：被拒稿的评审中，正面情感词的出现仍然多于负面情感词，这掩盖了真实的拒稿信号，使得作者难以仅凭文本判断结果。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.14163

SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

SeaAlert：基于大型语言模型的海上遇险通信关键情报提取

Atia, Tomer, Aperstein, Yehudit, Apartsin, Alexander

Abstract

Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.

Chinese Translation

通过超高频（VHF）无线电传输的海上遇险通信是用于报告海上紧急情况的安全关键语音信息。在全球海上遇险与安全系统（GMDSS）下，这些信息遵循标准化程序，预计传达包括船舶身份、位置、遇险性质和所需援助等基本细节。然而，在实践中，自动分析仍然困难，因为遇险信息通常简短、嘈杂，并在压力下产生，可能偏离规定格式，并且受到由信道噪声和说话者压力引起的自动语音识别（ASR）错误的进一步影响。本文提出了SeaAlert，一个基于大型语言模型（LLM）的框架，用于对海上遇险通信进行稳健分析。为了解决标注真实世界数据的稀缺问题，我们开发了一个合成数据生成管道，其中LLM生成现实且多样化的海上信息，包括省略或替换标准遇险代码词的具有挑战性的变体。生成的语句被合成成语音，经过模拟的VHF噪声降质，并由ASR系统转录，以获得现实的嘈杂转录文本。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.14164

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

如何微调推理模型？一种教师-学生合作框架以合成学生一致的SFT数据

Huang, Zixian, Yang, Kaichen, Huang, Xu, Hao, Feiyang, Ge, Qiming, Li, Bowen, Du, He, Chen, Kai, Guo, Qipeng

Abstract

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

Chinese Translation

一种广泛采用的模型增强策略是使用由更强模型生成的合成数据进行监督微调（SFT）。然而，对于像Qwen3-8B这样的新兴推理模型，这种方法往往无法提高推理能力，甚至可能导致性能显著下降。在本研究中，我们识别出教师生成的数据与学生分布之间存在显著的风格差异，这被认为是影响SFT的主要因素。为了解决这一问题，我们提出了一种教师-学生合作数据合成框架（TESSY），该框架交替生成风格和非风格标记，从而将教师模型和学生模型交织在一起。因此，TESSY生成的合成序列在继承教师的先进推理能力的同时，保持了与学生分布的风格一致性。在使用GPT-OSS-120B作为教师进行代码生成的实验中，在教师生成的数据上微调Qwen3-8B导致LiveCodeBench-Pro性能下降3.25%，在OJBench上下降10.02%，而TESSY则实现了11.25%和6.68%的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.14165

EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

EviSearch：一个循环中的人系统，用于提取和审核系统评价的临床证据

Ahuja, Naman, Mulla, Saniya, Khan, Muhammad Ali, Riaz, Zaryab Bin, Khakwani, Kaneez Zahra Rubab, Sonbol, Mohamad Bassam, Riaz, Irbaz Bin, Gupta, Vivek

Abstract

We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.

Chinese Translation

我们提出了EviSearch，一个多代理提取系统，能够直接从原生试验PDF中自动创建与本体对齐的临床证据表，同时保证每个单元的来源可追溯，以便审核和人工验证。EviSearch将一个PDF查询代理（保留渲染的布局和图形）与一个检索引导搜索代理和一个调解模块相结合，当代理之间存在分歧时，强制进行页面级验证。该管道旨在实现跨多模态证据源（文本、表格、图形）的高精度提取，并生成可供临床医生检查和修正的可操作来源。基于临床医生策划的肿瘤学试验论文基准，EviSearch在提取准确性方面显著优于强解析文本基线，同时提供全面的归因覆盖。通过记录调解者的决策和审稿人的编辑，该系统生成结构化的偏好和监督信号，以促进迭代模型的改进。EviSearch旨在加速动态系统评价工作流程，减少人工策划负担，并为将基于大型语言模型的提取集成到证据合成管道中提供安全、可审核的路径。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.14166

Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text

用于网络威胁情报文本中对抗技术标注的层次检索增强生成

Morbiato, Filippo, Keller, Markus, Nair, Priya, Romano, Luca

Abstract

Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\&CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\&CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary's technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5\%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8\% in F1 score, but also achieves a 62.4\% reduction in inference latency and a 60\% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.

Chinese Translation

将网络威胁情报（CTI）文本映射到MITRE ATT&CK技术ID是理解对手行为和自动化威胁防御的关键任务。尽管近期的检索增强生成（RAG）方法在这一领域展示了良好的能力，但它们基本上依赖于平面检索范式。通过将所有技术视为统一，这些方法忽视了ATT&CK框架的固有分类法，其中技术在高层战术下结构性组织。在本文中，我们提出了H-TechniqueRAG，一种新颖的层次RAG框架，它将这一战术-技术分类法作为强有力的归纳偏置，以实现高效且准确的标注。我们的方法引入了一个两阶段的层次检索机制：首先识别宏观层面的战术（对手的技术目标），然后在这些战术内缩小搜索范围，有效减少候选搜索空间77.5%。为了进一步弥合检索与生成之间的差距，我们设计了一个战术感知的重排序模块和一个层次约束的上下文组织策略，以减轻大语言模型（LLM）上下文过载并提高推理精度。针对三个不同的CTI数据集进行的全面实验表明，H-TechniqueRAG不仅在F1分数上比最先进的TechniqueRAG提高了3.8%，而且在推理延迟上减少了62.4%，在LLM API调用上减少了60%。进一步分析表明，我们的层次结构先验使模型具备了更优的跨领域泛化能力，并为安全分析师提供了高度可解释的逐步决策路径。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.14167

Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble

基于LoRA、上下文学习和模型集成的中文作文修辞识别

Lai, Yuxuan, Wang, Xiajing, Zheng, Chen

Abstract

Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.

Chinese Translation

修辞识别是自动化作文评分中的关键组成部分。通过识别学生写作中的修辞元素，人工智能系统能够更好地评估语言能力和高阶思维技能，因此在教育领域的人工智能研究中，这是一项至关重要的任务。本文利用大型语言模型（LLMs）进行中文修辞识别任务。具体而言，我们探索基于低秩适应（LoRA）的微调和上下文学习，将修辞知识整合到LLMs中。我们将输出格式化为JSON，以获得结构化输出，并将键翻译为中文。为了进一步提升性能，我们还研究了几种模型集成方法。我们的方法在CCL 2025中文作文修辞识别评估任务的三个赛道上均取得了最佳性能，赢得了一等奖。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.14168

SAGE Celer 2.6 Technical Card

SAGE Celer 2.6 技术卡

SAGEA Research Team, Jha, Basab, Paudel, Firoj, Puri, Ujjwal, Liu, Adrian, Henkel, Ethan, Yuting, Zhang, Kowalczyk, Mateusz, Huang, Mei, Donghyuk, Choi, Junhao, Wang

Abstract

We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

Chinese Translation

我们介绍了 SAGE Celer 2.6，这是我们 SAGEA 系列通用 Celer 模型中的最新版本。Celer 2.6 提供 5B、10B 和 27B 参数规模，并受益于广泛的架构修改以及在未公开模型上的进一步预训练。通过我们的逆向推理（Inverse Reasoning, IR）管道，SAGEA 原生训练 Celer 2.6 以验证其自身的逻辑路径，最小化复杂推理任务中的级联错误和幻觉。Celer 2.6 还具有原生集成的多模态功能，配备端到端视觉编码器，以避免基于适配器方法中的常见陷阱。Celer 2.6 在数学、编码和一般智能基准（ACUMEN）上提供了高度竞争的结果，并且延迟较低。最重要的是，Celer 2.6 专门针对南亚语言支持进行了优化，配备了用于天城文（Devanagari）脚本的自定义分词器，并在尼泊尔语和印地语中表现出色，同时不牺牲英语推理能力。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.14169

Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation

时间知识检索：一种基于检索增强生成的建设项目文档处理方法

Kostis, Ioannis-Aris, Sanchiz, Natalia, De Schryver, Steeve, Denis, François, Schaus, Pierre

Abstract

In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.

Chinese Translation

在大规模建设项目中，决策的持续演变会生成大量记录，这些记录通常以会议纪要的形式保存。由于决策可能会覆盖之前的决策，专业人员常常需要重建特定选择的历史。从用户的角度来看，我们通过使用户能够对整个项目会议纪要进行对话式访问来应对这一挑战。专业人员可以提出自然语言问题，并获得语义相关且明确标注时间的答案，从而能够跟踪决策的时间顺序。从技术的角度来看，我们的解决方案采用了一种检索增强生成（Retrieval-Augmented Generation, RAG）框架，将语义搜索与大型语言模型相结合，以确保准确且具有上下文意识的响应。我们使用来自比利时一家大型公司的已完成建设项目的匿名行业数据集来演示该方法。该数据集经过注释，并通过专家定义的查询进行了丰富，以支持系统评估。数据集和开源实现均已向社区开放，以促进对时间标注项目文档的对话式访问的进一步研究。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.14170

Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

状态感知的证据驱动检索增强生成与迭代推理

Dong, Qi, Lin, Ziheng, Ding, Ning

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.

Chinese Translation

检索增强生成（RAG）将大型语言模型（LLMs）与外部知识相结合，但常常面临平坦的上下文表示和无状态的检索问题，导致性能不稳定。我们提出了状态感知的证据驱动RAG与迭代推理框架，该框架将问答建模为一个逐步证据积累的过程。检索到的文档被转换为具有明确相关性和置信信号的结构化推理单元，并保存在一个持久的证据池中，以捕捉支持性和非支持性信息。该框架执行证据驱动的缺陷分析，以识别差距和冲突，并迭代地优化查询以指导后续检索。这个迭代推理过程使得证据聚合更加稳定，并提高了对噪声检索的鲁棒性。在多个问答基准测试中的实验结果表明，相较于标准RAG和多步骤基线，该方法在高质量证据的有效积累和在显著检索噪声下保持稳定性能方面均表现出一致的改进。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.14171

Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali

在同等规模的大型语言模型中进行语言适应性基准测试：Llama-3.1-8B、Mistral-7B-v0.1和Qwen3-8B在罗马化尼泊尔语上的研究

Rimal, Ananda, Rimal, Adarsha

Abstract

Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.

Chinese Translation

罗马化尼泊尔语是用拉丁字母书写的尼泊尔语，是尼泊尔非正式数字交流的主要媒介，但在大型语言模型（LLMs）的背景下仍然严重缺乏资源。本研究对三种同等规模的开放权重模型：Llama-3.1-8B、Mistral-7B-v0.1和Qwen3-8B的语言适应性进行了系统的基准测试。我们在零样本和微调设置下评估这些架构，使用一个经过策划的双语数据集，包含10,000个音译的指令遵循样本。通过五个指标跨越七个测量维度量化性能：困惑度（PPL）、BERTScore、chrF++、ROUGE-1、ROUGE-2、ROUGE-L和BLEU，捕捉流畅性、语音一致性和语义完整性。模型使用量化低秩适应（QLoRA）和秩稳定的LoRA（rsLoRA）在双NVIDIA Tesla T4 GPU上进行微调，仅训练每个模型参数的约1%，总GPU小时数不足27。在零样本测试中，所有三个模型未能生成罗马化尼泊尔语，各自表现出特定架构的失败模式。经过微调后，所有三个模型解决了这些失败，并收敛至大约0.75的BERTScore和大于23的chrF++。在十个标准的整体维度评估中，Qwen3-8B被确定为整体推荐架构，是唯一能够生成语义相关的零样本输出的模型，并在后续微调后在所有结构对齐指标中领先。适应性提升假设得到了验证：尽管Llama-3.1-8B的零样本基线最弱，但在PPL（Delta = -49.77）和BERTScore（Delta = +0.3287）上实现了最大的绝对微调增益，使其成为迭代低资源开发管道的首选。这项工作为同等规模的开放权重LLMs中罗马化尼泊尔语的适应性建立了首个严格的基准。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.14172

Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

十年内的拉锯战：通过教师引导的检索增强生成进行脆弱性分析中的冲突解决

Zhou, Ziyin, Zhang, Jianyi, ji, Xu, Li, Yilong, Han, Jiameng, Zhao, Zhangchi

Abstract

Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs' ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs' capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.

Chinese Translation

大型语言模型（LLMs）在分析和解决网络安全中的脆弱性方面至关重要。然而，在过去十年中发现的超过20万个脆弱性中，已有超过3万个被修改或更新。这就需要对LLMs的训练数据集和内部知识库进行频繁更新，以保持知识的一致性。本文聚焦于CVE（公共漏洞和暴露）检测与分析中的知识差异和冲突问题。该问题阻碍了LLMs从原始训练数据集中检索最新知识的能力，导致知识冲突、事实错误结果的虚构以及生成幻觉。为了解决这一问题，我们提出了一种创新的两阶段框架，称为CRVA-TGRAG（通过教师引导的检索增强生成进行脆弱性分析中的冲突解决）。首先，为了提高检索阶段的文档检索准确性，我们利用父文档分段和基于语义相似性与倒排索引的集成检索方案。其次，为了增强LLMs在生成阶段基于CVE数据集检索的能力，我们采用了一种教师引导的偏好优化技术对LLMs进行微调。我们的框架不仅通过RAG提高了内容检索的质量，还利用LLMs中偏好微调的优势，更有效和准确地回答问题。实验表明，我们的方法在检索最新CVE方面的准确性高于外部知识库。总之，我们的框架显著减轻了依赖LLMs进行知识检索时可能出现的知识冲突和不一致性。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.14174

Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

使用后变换器适配器纠正语言模型中的抑制对数概率

Sanchez, Bryan

Abstract

Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.

Chinese Translation

经过对齐调优的语言模型在政治敏感话题上经常抑制事实对数概率，尽管它们在隐藏表示中保留了相关知识。我们展示了一个786K参数（约占基础模型的0.02%）的后变换器适配器，该适配器在冻结的隐藏状态上训练，能够纠正Qwen3-4B、8B和14B模型在31个意识形态区分事实上的抑制。该适配器记忆了所有15个训练事实，并在每个规模的5个随机分割中对16个保留事实的11%至39%进行了泛化，且通过锚定训练实现了零知识回归。无论是门控（SwiGLU）还是非门控（线性瓶颈）适配器均取得了可比的结果；两者在各个规模上均未表现出一致的优劣（Fisher精确p > 0.09）。在指令模型上，适配器纠正了对数概率排名。当在生成过程中应用于所有标记位置时，适配器会产生不连贯的输出；然而，当仅在当前预测位置（仅最后位置）应用时，适配器能够生成连贯且较少被审查的文本。一个在标记投影后操作的对数空间适配器在任何应用模式下都未能产生连贯的生成，表明隐藏状态干预是生成纠正的正确层级。Apple MLX中一个之前未记录的静默梯度错误解释了该工作早期迭代中的所有零结果：标准模式nn.value_and_grad(model, fn)(model.parameters())在没有错误的情况下返回零梯度；正确模式nn.value_and_grad(model, fn)(model, data)解决了这一问题。我们提供了一个最小的重现案例，并讨论了对使用MLX的其他适配器研究的影响。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.14175

QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment

QU-NLP在ArchEHR-QA 2026：针对以患者为中心的临床问题回答和证据句对齐的Qwen3-4B的两阶段QLoRA微调

AL-Smadi, Mohammad

Abstract

We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.

Chinese Translation

我们提出了一个统一系统，旨在解决ArchEHR-QA共享任务的子任务3（答案生成）和子任务4（证据句对齐）。对于子任务3，我们对在4位NF4量化下加载的Qwen3-4B应用了两阶段的量化低秩适应（QLoRA）：首先在来自emrQA-MedSQuAD语料库的30,000个样本上进行微调，以建立临床领域的能力，然后在20个标注的开发案例上学习特定任务的输出风格。我们的系统在官方测试集2026的整体得分为32.87（BLEU = 9.42，ROUGE-L = 27.04，SARI = 55.42，BERTScore = 43.00，AlignScore = 25.28，MEDCON = 37.04）。对于子任务4，我们开发了一种加权集成的三种检索方法——带相对阈值的BM25、TF-IDF余弦相似度和微调的交叉编码器——以识别支持给定金答案的笔记句子，在100个案例的测试集上实现了67.16的微F1得分。实验表明，这两个子任务都暴露了相同的基本挑战：20个标注的训练案例不足以区分相关和不相关的临床句子，这表明数据增强是未来最具潜力的方向。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.14177

Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation

倾听、纠正与反馈：口语教学反馈生成

Liang, Junhong, Lu, Yifan, Kochmar, Ekaterina, Koto, Fajri

Abstract

Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker-Harrison/spfg.

Chinese Translation

语法错误纠正（GEC）和解释（GEE）已经取得了快速进展，但真实的教学场景也需要 extit{以学习者为中心的教学反馈}，该反馈应具有可操作性、适合学习者水平，并且具有鼓励性。我们介绍了 extbf{SPFG}（ extbf{S}poken extbf{P}edagogical extbf{F}eedback extbf{G}eneration），这是一个基于Speak ext& Improve Challenge 2025语料库构建的数据集，将以流利性为导向的转录与GEC目标以及 extit{人类验证}的教师风格反馈配对，包括用于偏好学习的首选/拒绝反馈对。我们研究了一种基于转录的口语语法错误纠正（SGEC）设置，并评估了三种经过指令调优的LLM（Qwen2.5、Llama-3.1和GLM-4），比较了监督微调（SFT）与基于偏好的对齐（使用DPO和KTO）在共同生成纠正和反馈方面的表现。结果表明，SFT提供了最一致的改进，而DPO/KTO则产生了较小或混合的收益，并且纠正质量与反馈质量之间的关联较弱。我们的实现可在https://github.com/Skywalker-Harrison/spfg获取。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.14179

An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication -- A scoping review

一个未被充分探索的前沿：大型语言模型在罕见疾病患者教育与沟通中的应用——一项范围审查

Zhan, Zaifu, Hou, Yu, Yu, Kai, Zeng, Min, Burgun, Anita, Chen, Xiaoyi, Zhang, Rui

Abstract

Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases.

Chinese Translation

罕见疾病影响全球超过3亿人，其特点是复杂的护理路径、有限的临床专业知识以及在漫长的患者旅程中存在显著的沟通需求未得到满足。大型语言模型（LLMs）的最新进展为支持患者教育和沟通提供了新的机会，但其在罕见疾病中的应用仍不明确。我们对2022年1月至2026年3月间在主要数据库中发表的研究进行了范围审查，识别出12项关于基于LLM的罕见疾病患者教育和沟通的研究。提取了研究特征、应用场景、模型使用和评估方法的数据，并通过描述性和定性分析进行了综合。文献非常新近，主要以通用模型为主，特别是ChatGPT。大多数研究集中在使用策划问题集进行患者问答，实际数据或纵向沟通场景的使用有限。评估主要集中在准确性上，对以患者为中心的维度（如可读性、同理心和沟通质量）关注较少。多语言沟通很少被提及。总体而言，该领域仍处于早期阶段。未来的研究应优先考虑以患者为中心的设计、领域适应的方法以及实际部署，以支持在罕见疾病中安全、适应性强和有效的沟通。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.14180

Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

内部知识无外部表达：探讨经典中文语言模型的泛化边界

Chen, Jiuting, Lian, Yuan, Wu, Hao, Huang, Tianqi, Sasaki, Hiroshi, Kouno, Makoto, Choi, Jongil

Abstract

We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.

Chinese Translation

我们从头开始训练了一个318M参数的Transformer语言模型，使用了一个包含15.6亿个纯经典中文标记的精心策划的语料库，完全不包含英语字符或阿拉伯数字。通过系统的分布外（OOD）测试，我们调查了模型是否能够区分已知输入和未知输入，关键是它是否能够在生成的文本中表达这种区分。我们发现内部和外部不确定性之间存在明显的解离。在内部，模型在真实和虚构历史事件之间表现出2.39倍的困惑度跳跃比（p = 8.9e-11，n = 92每组），而半虚构事件（真实人物 + 虚构事件）显示出最高的困惑度（4.24倍，p = 1.1e-16），这表明超越句法模式匹配的真实事实编码。然而，在外部，模型从未学会表达不确定性：经典中文的认知标记在OOD问题中的出现率（3.5%）低于在分布内问题中的出现率（8.3%，p = 0.023），反映了修辞惯例而非真正的元认知。我们在三种语言（经典中文、英语、日语）、三种书写系统和八个模型（从110M到1.56B参数）中复制了这两项发现。我们进一步表明，不确定性表达的频率完全由训练数据的惯例决定，经典中文模型显示出一种“谦逊悖论”（对已知主题更多的模糊表达），而日语模型几乎从不模糊表达。我们认为，元认知表达——即说“我不知道”的能力——并不是仅通过语言建模而产生的，而是需要诸如RLHF等明确的训练信号。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.14191

Attention to Mamba: A Recipe for Cross-Architecture Distillation

关注Mamba：跨架构蒸馏的方案

Moudgil, Abhinav, Huang, Ningyuan, Dhekane, Eeshan Gunesh, Rodríguez, Pau, Zappella, Luca, Danieli, Federico

Abstract

State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.

Chinese Translation

状态空间模型（State Space Models, SSMs）如Mamba已成为Transformer模型的热门替代方案，因为与基于注意力的模型相比，它们在生成时具有更低的内存消耗和更高的吞吐量。另一方面，学术界在如何训练Transformer方面积累了大量的知识，并且许多预训练的Transformer模型已广泛可用。为了促进SSMs的采用，同时利用现有的预训练Transformer，我们旨在确定一种有效的方案，将基于注意力的模型蒸馏为类似Mamba的架构。然而，在跨架构蒸馏的先前研究中，已经表明，从Transformer到Mamba的简单蒸馏过程无法保留原始教师模型的性能，这一局限性通常通过结合注意力和SSM模块的混合解决方案来克服。我们工作的关键论点是，通过为Mamba提供一个有原则的初始化，我们可以恢复一个整体更好的跨架构蒸馏方案。为此，我们提出了一种有原则的两阶段方法：首先，我们将知识从传统的Transformer蒸馏到线性化的注意力版本，使用核技巧的适应。然后，我们将线性化版本蒸馏为一个不使用任何注意力模块的适应性Mamba模型。总体而言，蒸馏后的Mamba模型能够在下游任务中保留原始Pythia-1B Transformer的性能，困惑度为14.11，接近教师模型的13.86。为了验证我们方案的有效性，我们在1B规模上进行了全面的消融实验，使用10B个标记变化序列混合器架构，进行了模型规模和总蒸馏标记的缩放分析，以及在阶段之间标记分配的敏感性分析。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.14197

The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

大型语言模型提示的PICCO框架：提示结构的分类法和参考架构

Cook, David A.

Abstract

Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.

Chinese Translation

大型语言模型（LLM）的性能在很大程度上依赖于提示设计，但提示构建往往被描述和应用得不一致。我们的目的是推导出一个用于构建LLM提示的参考框架。本文提出了PICCO框架，该框架通过对11个先前发布的提示框架进行严格综合而得出，这些框架是通过多数据库搜索识别的。分析结果产生了两个主要贡献。首先，提出了一种分类法，区分提示框架、提示元素、提示生成、提示技术和提示工程等相关但不等同的概念。其次，推导出一个包含五个元素的提示生成参考架构：Persona（角色）、Instructions（指令）、Context（上下文）、Constraints（约束）和Output（输出）（PICCO）。对于每个元素，我们定义其功能、范围及与其他元素的关系，旨在提高概念清晰度并支持更系统的提示设计。最后，为了支持框架的应用，我们概述了与实施相关的关键概念，包括提示技术（例如，零样本、少样本、思维链、集成、分解和自我批评及其选定变体）、人类和自动化的迭代提示工程方法、负责任的提示考虑因素（如安全性、隐私、偏见和信任）以及未来研究的优先事项。这项工作是一个概念性和方法论的贡献：它为提示规范和比较形式化了一个共同结构，但并不声称对PICCO作为优化方法的实证验证。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.14210

Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

中文在 Vibe 编码中并不比英语更高效：关于令牌成本和问题解决率的初步研究

Ren, Simiao, Shen, Xingyu, Zhou, Yuchen, Dennis, Ng, Raj, Ankit

Abstract

A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.

Chinese Translation

社交媒体和从业者论坛上流传着一种说法，认为中文提示在 LLM 编码任务中比英语更具令牌效率，可能将成本降低多达 40%。这一说法影响了开发者考虑切换到中文进行“vibe 编码”以节省 API 成本。本文通过使用 SWE-bench Lite（一个软件工程任务基准）进行严格的实证研究，以评估这一中文令牌效率的说法是否经得起检验。我们的结果揭示了三个关键发现：首先，未观察到中文的效率优势。其次，令牌成本因模型架构而异，超出了简单假设的范围：虽然 MiniMax-2.7 对中文的令牌成本高出 1.28 倍，但 GLM-5 实际上在中文提示下消耗的令牌更少。第三，也是最重要的，我们发现，在我们测试的所有模型中，使用中文提示的成功率普遍低于使用英语的成功率。我们还测量了成本效率，即每个成功任务的预期成本——共同考虑了令牌消耗和任务解决率。由于评估的模型数量有限以及由于资源限制而测试的基准集较窄，这些发现应被解读为初步证据，而非最终结论；它们表明，语言对令牌成本的影响是依赖于模型的，实践者不应仅通过切换提示语言为中文来期待成本节省或性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.14214

CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization

CROP：通过正则化提示优化实现大型语言模型中的高效推理

Shah, Deep, Badhe, Sanket, Kathrotia, Nehal, Tiwari, Priyanka

Abstract

Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6\% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.

Chinese Translation

利用推理技术的大型语言模型提高了任务性能，但由于冗长的生成过程，导致显著的延迟和令牌成本。现有的自动提示优化（APO）框架专注于任务准确性，牺牲了生成长推理轨迹的能力。我们提出了成本正则化提示优化（CROP），这是一种APO方法，通过生成文本反馈（除了标准的准确性反馈）来对响应长度进行正则化。这迫使优化过程生成能够引出仅包含关键信息和推理的简洁响应的提示。我们在复杂推理数据集上评估了我们的方法，特别是GSM8K、LogiQA和BIG-Bench Hard。我们实现了80.6%的令牌消耗减少，同时保持了竞争力的准确性，性能仅有轻微下降。这为在生产管道中部署高效且具有成本效益的代理AI系统提供了切实可行的解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.14218

MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

MEME-Fusion@CHiPSAL 2026：尼泊尔表情包中的仇恨检测与情感分析的多模态消融研究

Wagle, Samir, Khanal, Reewaj, Adhikari, Abiral

Abstract

Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/

Chinese Translation

在德瓦那加里文社交媒体表情包中进行仇恨言论检测面临着多重挑战：多模态内容结构、特定脚本的语言复杂性以及低资源环境下极端的数据稀缺性。本文介绍了我们在CHiPSAL 2026共享任务中的系统，涵盖了子任务A（二元仇恨言论检测）和子任务B（三类情感分类：积极、中立、消极）。我们提出了一种混合跨模态注意力融合架构，该架构结合了用于视觉编码的CLIP (ViT-B/32)与用于多语言文本表示的BGE-M3，通过4头自注意力和一个可学习的门控网络连接，动态地根据每个样本的情况加权模态贡献。对八种模型配置的系统评估表明，显式的跨模态推理在子任务A上实现了比仅使用文本的基线高出5.9%的F1-macro提升，同时揭示了两个意外但关键的发现：以英语为中心的视觉模型在德瓦那加里文上的表现接近随机，而标准集成方法在数据稀缺（每折近850个样本）情况下由于相关过拟合而灾难性降级。代码可在https://github.com/Tri-Yantra-Technologies/MEME-Fusion/获取。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.14261

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

ReviewGrounder：通过评分标准引导的工具集成代理提高评审实质性

Li, Zhuofeng, Lu, Yi, Jiang, Dongfu, Zhang, Haoxiang, Bai, Yuyang, Li, Chuan, Wang, Yu, Ji, Shuiwang, Xie, Jianwen, Zhang, Yu

Abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.

Chinese Translation

人工智能会议投稿的快速增长推动了对大型语言模型（LLMs）在同行评审支持方面的深入探索。然而，基于LLM的评审者往往生成表面化、公式化的评论，缺乏实质性和基于证据的反馈。我们将此归因于人类评审中的两个关键组成部分的未充分利用：明确的评分标准和对现有工作的上下文理解。为了解决这个问题，我们引入了REVIEWBENCH，一个根据来自官方指南、论文内容和人类撰写的评审的特定于论文的评分标准评估评审文本的基准。我们进一步提出了REVIEWGROUNDER，一个评分标准引导的、工具集成的多代理框架，它将评审过程分解为草拟和基础阶段，通过有针对性的证据整合丰富浅薄的草稿。在REVIEWBENCH上的实验表明，REVIEWGROUNDER使用基于Phi-4-14B的草拟器和基于GPT-OSS-120B的基础阶段，在与人类判断的一致性和基于评分标准的评审质量的8个维度上，始终优于具有显著更强/更大骨干（例如，GPT-4.1和DeepSeek-R1-670B）的基线。代码可在此处获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.14306

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

EuropeMedQA研究协议：用于语言模型评估的多语言、多模态医学考试数据集

Causio, Francesco Andrea, De Vita, Vittorio, Riccomi, Olivia, Ferramola, Michele, Felizzi, Federico, Cristiano, Antonio, De Mori, Lorenzo, Battipaglia, Chiara, Sawaya, Melissa, De Angelis, Luigi, Di Pumpo, Marcello, Piscitelli, Alessandra, Risuleo, Pietro Eric, Longo, Alessia, Vojvodic, Giulia, Vassalli, Mariapia, Castaniti, Bianca Destro, Scarsi, Nicolò, Del Medico, Manuel

Abstract

While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

Chinese Translation

尽管大型语言模型（LLMs）在以英语为中心的医学考试中表现出高水平的能力，但在面对非英语语言和多模态诊断任务时，其性能往往会下降。本研究协议描述了EuropeMedQA的开发，这是第一个来自意大利、法国、西班牙和葡萄牙官方监管考试的全面、多语言和多模态医学考试数据集。遵循FAIR数据原则和SPIRIT-AI指南，我们描述了一种严格的策展过程和自动翻译管道，以进行比较分析。我们使用零样本、严格限制的提示策略评估当代多模态LLMs，以评估跨语言迁移和视觉推理能力。EuropeMedQA旨在提供一个抗污染的基准，反映欧洲临床实践的复杂性，并促进更具普适性的医学人工智能的发展。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.14315

Tracking the Temporal Dynamics of News Coverage of Catastrophic and Violent Events

跟踪灾难性和暴力事件新闻报道的时间动态

Lugos, Emily, Gruppi, Maurício

Abstract

The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.

Chinese Translation

现代新闻周期已被在线信息的快速交换根本重塑。因此，媒体框架随着新信息、政治反应和社会反应的出现而动态变化。理解这些叙事如何形成、传播和演变，对于在危机时刻解读公共话语至关重要。在本研究中，我们使用从在线出版商收集的大规模语料库（126,602篇新闻文章）来考察暴力和灾难事件报道的时间和语义动态。我们通过出版量、语义漂移、语义分散和术语相关性来量化叙事变化。我们的结果表明，突发性影响事件表现出结构化和可预测的新闻周期模式，其特征是报道的快速激增、早期的语义漂移以及逐渐回落至基线。此外，我们的结果还表明，推动时间模式的术语。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.14321

LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

大型语言模型预测评分与验证：从非结构化文本推断体验评分

Potteiger, Jason, Hong, Andrew, Zapata, Ito

Abstract

We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.

Chinese Translation

我们让GPT-4.1阅读棒球迷对比赛日体验的描述，并预测每位球迷在0-10的调查评分中给出的整体体验评分。模型仅接收了一条开放式响应的文本。这些人工智能预测与来自五支美国职业棒球大联盟球队的约10,000名球迷的实际体验评分进行了比较。总的来说，预测评分中有三分之二落在自我报告的球迷评分的一个分数之内（67%在+/-1范围内，36%完全匹配），而且在三次独立评分运行中，预测测量几乎是确定性的（87%完全一致，99.9%在+/-1范围内）。预测评分与整体体验评分的相关性最强（r = 0.82），而不是与比赛日体验的任何特定方面（如停车、餐饮、工作人员等）。然而，预测评分系统性地低于自我报告的评分，约低一个分数，这一差距并不是由任何单一方面驱动的。相反，我们的分析表明，自我报告的评分捕捉了球迷的裁决，即整体验的整体评估判断。而预测评分则量化了被认为是难忘、情感强烈、不寻常或可操作的显著时刻的影响。每种测量都包含了另一种所遗漏的信息。这些基线结果表明，一个简单的、未优化的提示可以从球迷所写的文本中方向性地预测球迷如何评分他们的体验，而这两个数字之间的差距可以被解读为值得保留的构念差异，而不是需要消除的错误。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.14324

Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness

清除灰色区域：用于精确知识边界感知的潜在几何去噪

An, Hao, Lou, Yibin, Guo, Jiayi, Xu, Yang

Abstract

Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the **GeoDe** (**Geo**metric **De**noising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at https://github.com/Notbesidemoon/GeoDe.

Chinese Translation

大型语言模型（LLMs）常常由于无法准确感知自身知识边界而出现幻觉。现有的弃权微调方法通常直接基于响应准确性对数据集进行划分，导致模型在决策边界附近遭受严重的标签噪声，从而表现出高弃权率或幻觉。本文采用潜在空间表示的视角，揭示了决策超平面附近的“灰色区域”，在该区域内，内部信念模糊性构成了核心性能瓶颈。基于这一洞察，我们提出了**GeoDe**（**Geo**metric **De**noising）框架用于弃权微调。该方法利用线性探针构建真相超平面，并通过采用几何距离作为弃权决策的信心信号来执行“几何去噪”。该方法在保留高保真信号以进行微调的同时，过滤掉模糊的边界样本。在多个模型（Llama3, Qwen3）和基准数据集（TriviaQA, NQ, SciQ, SimpleQA）上的实验表明，GeoDe显著提升了模型的真实性，并在分布外（OOD）场景中表现出强大的泛化能力。代码可在 https://github.com/Notbesidemoon/GeoDe 获取。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.14325

Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

信任度血清：通过归因指导减轻大语言模型决策文本解释中的信任度差距

Alon, Bar, Zimerman, Itamar, Wolf, Lior

Abstract

Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

Chinese Translation

大型语言模型（LLMs）在自然语言处理（NLP）领域表现出色，已引发革命，但其缺乏可解释性使其被视为黑箱，限制了其在需要透明度和信任的领域中的应用。解决这一问题的一个有前景的方向是事后基于文本的解释，旨在用自然语言解释模型的决策。先前的研究集中于生成看似主观上可信的合理性，但尚不清楚这些解释是否在认识上是可信的，即它们是否反映了模型实际依赖的内部证据。在本文中，我们首先通过反事实评估LLM生成解释的认识可信度，并表明它们往往是不可信的。然后，我们引入了一种无训练的方法，通过关注级别的干预来指导解释生成，从而增强可信度，这一过程受到通过可信的归因方法提取的标记级热图的启发。该方法在多个模型、基准和提示上显著提高了认识可信度。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.14339

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

打乱上下文：用于长上下文适应的RoPE扰动自蒸馏

Li, Zichong, Liang, Chen, Ren, Liliang, Zhao, Tuo, Shen, Yelong, Chen, Weizhu

Abstract

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

Chinese Translation

大型语言模型（LLMs）越来越多地在需要可靠的长上下文理解的环境中运行，例如检索增强生成和多文档推理。一个常见的策略是在目标序列长度上微调预训练的短上下文模型。然而，我们发现标准的长上下文适应仍然脆弱：模型的准确性在很大程度上依赖于相关证据的绝对位置，即使在控制任务格式和难度的情况下，位置变异性也很高。我们提出了RoPE扰动自蒸馏，这是一种训练正则化器，可以提高位置的鲁棒性。其核心思想是通过扰动RoPE索引形成同一训练序列的替代“视图”——有效地将上下文的部分移动到不同的位置——并通过自蒸馏训练模型在视图之间产生一致的预测。这鼓励模型依赖语义信号而不是脆弱的位置依赖性。在Llama-3-8B和Qwen-3-4B的长上下文适应实验中，显示出在长上下文基准测试上的一致性提升，包括在Llama-3-8B的RULER-64K上提高了12.04%，以及在Qwen-3-4B的RULER-256K上提高了2.71%，同时在训练上下文窗口之外的长度外推能力也有所改善。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.14356

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

多囊卵巢综合症与饮食障碍的交汇：一种可解释的人工智能方法来检测隐藏的三重负担

Prasad, Apoorv, McRoy, Susan

Abstract

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

Chinese Translation

患有多囊卵巢综合症（PCOS）的女性面临着显著增加的身体形象困扰、饮食失调和代谢挑战的风险，但现有的自然语言处理方法在检测这些情况时缺乏透明性，无法识别共存表现。我们开发了小型开源语言模型，以自动检测社交媒体帖子中的这一三重负担，并提供基于证据的可解释性。我们从六个子版块收集了1,000条与PCOS相关的帖子，由两名经过培训的注释员根据Lee等人（2017）临床框架的操作性指导对帖子进行标注。三种模型（Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B）经过低秩适应（Low-Rank Adaptation）进行微调，以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上达到了75.3%的准确匹配率，具有强大的共病检测能力和良好的可解释性。随着诊断复杂性的增加，性能有所下降，表明这些模型更适合用于筛查而非自主诊断。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.14362

APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

APEX-MEM：具有时间推理的代理半结构化记忆用于长期对话人工智能

Banerjee, Pratyay, Moshtaghi, Masud, Subramanian, Shivashankar, Misra, Amita, Chadha, Ankit

Abstract

Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.

Chinese Translation

大型语言模型在可靠的长期对话记忆方面仍然存在困难：简单地扩大上下文窗口或应用简单的检索往往会引入噪声并使响应不稳定。我们提出了APEX-MEM，一种对话记忆系统，结合了三个关键创新：（1）一个属性图，它使用与领域无关的本体将对话结构化为以时间为基础的事件，采用以实体为中心的框架；（2）仅追加存储，保留信息的完整时间演变；（3）一个多工具检索代理，在查询时理解并解决冲突或演变的信息，生成紧凑且具有上下文相关性的记忆摘要。这种检索时的解决方案保留了完整的交互历史，同时抑制了无关细节。APEX-MEM在LOCOMO的问答任务上取得了88.88%的准确率，在LongMemEval上达到了86.2%，超越了最先进的会话感知方法，证明了结构化属性图能够实现更具时间一致性的长期对话推理。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.14363

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

语言的代价：质心消除揭示并利用多模态语言模型中的模态竞争

Paruchuri, Akshay, Chatterjee, Ishan, Fuchs, Henry, Adeli, Ehsan, Didyk, Piotr

Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

Chinese Translation

多模态语言模型在视觉感知任务上的表现系统性不足，但导致这一失败的结构仍然不够清晰。我们提出质心替换，将每个标记压缩到其最近的 K-means 质心，作为模态依赖的控制探测。在涵盖三种架构系列的七个模型中，消除文本质心结构的准确性损失是消除视觉质心结构的 4 倍，揭示了一种普遍的不平衡，即语言表示在需要视觉推理的任务中仍然压倒视觉。我们通过文本质心对比解码利用这种不对称性，通过与消除文本质心的参考进行对比解码，在单个任务上恢复了高达 +16.9% 的准确性。该干预在训练方法上有显著差异：标准微调模型的增益更大（平均 +5.6%），而偏好优化模型的增益较小（平均 +1.5%）。我们的发现表明，模态竞争在结构上是局部化的，可以在推理时纠正而无需重新训练，并且可以量化为指导未来多模态训练的诊断信号。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.14389

BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking

BiCon-Gate：用于对话事实核查的一致性门控去口语化

Park, Hyunkyung, Zubiaga, Arkaitz

Abstract

Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.

Chinese Translation

对话中的自动事实核查涉及多轮对话，其中口语化语言频繁出现但尚未得到充分研究。为了解决这一问题，我们提出了一种保守的重写候选方案，通过分阶段的去口语化为每个响应声明生成候选，结合轻量级的表面规范化和范围内的核心指代解析。接着，我们引入了BiCon-Gate，这是一种语义感知的一致性门控，仅在重写候选在对话上下文中得到语义支持时才进行选择，否则回退到原始声明。这种门控选择稳定了下游的事实核查，并在证据检索和事实验证方面都取得了提升。在DialFact基准测试中，我们的方法改善了检索和验证，特别是在SUPPORTS上取得了显著的提升，并且超越了竞争基准，包括尝试在单次传递中执行所有去口语化步骤的基于解码器的一次性LLM重写。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.14397

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

通过基于词典的跨语言语义投影生成概念词汇化

Basil, David, Girigowda, Chirooth, Hauer, Bradley, Momin, Sahir, Shi, Ning, Kondrak, Grzegorz

Abstract

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.

Chinese Translation

我们研究了通过语义生成自动扩展WordNet风格词汇资源到新语言的任务。我们通过将目标语言的词元与现有词汇概念关联来生成语义。给定一个带有语义标记的英语语料库及其翻译，我们的方法将英语同义词集合投影到对齐的目标语言词元上，并将相应的词元分配给这些同义词集合。为了生成这些对齐并确保其质量，我们使用双语词典增强了一个预训练的基础对齐器，该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法，并与先前的方法、基于词典的方法以及大型语言模型基线进行比较。结果表明，所提出的投影-过滤策略在提高精度的同时保持可解释性，并且所需的外部资源较少。我们计划使我们的代码、文档和生成的语义库存可供访问。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.14414

The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

自相关盲点：为何在大型语言模型对话分析中，42%的轮次级发现可能是虚假的

Schessl, Ferdinand M.

Abstract

Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.

Chinese Translation

轮次级指标广泛用于评估多轮人类与大型语言模型（LLM）对话的特性，包括安全性、谄媚性和对话质量。然而，对话中的连续轮次并不是统计独立的——这一事实几乎所有当前的评估流程在其统计推断中都未能进行修正。我们系统性地描述了在202个多轮对话（11,639对轮次，5名德语用户，4个平台）中66个轮次级指标的自相关结构，并证明简单的汇总分析会导致显著性估计的严重膨胀：在标准汇总测试下，42%的关联看似显著，但在集群稳健修正后未能存活。膨胀在不同类别之间变化显著，而不是与自相关线性缩放：三个无记忆家族（嵌入速度、方向性、差异性）聚合为14%，而七个非无记忆家族（热循环、框架距离、词汇/结构、滚动窗口、累积、交互、时间戳）聚合为33%，各类别的比例从0%到100%不等，具体取决于每个家族的效应大小。我们提出了一个两阶段修正框架，将Chelton（1983）有效自由度与对话级块自助法相结合，并在预注册的保留分割上进行了验证，其中集群稳健指标的复制率为57%，而仅汇总指标为30%。我们提供了具体的设计原则、出版检查表和开源代码用于修正流程。对约30篇在主要自然语言处理（NLP）和人工智能（AI）会议上计算LLM评估中的轮次级统计的近期论文的调查发现，只有4篇完全考虑了时间依赖性，26篇未对此进行修正。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.14430

Three-Phase Transformer

三相变压器

Ayyash, Mohammad R. Abu

Abstract

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.

Chinese Translation

我们提出了三相变压器（Three-Phase Transformer，3PT），这是一种用于仅解码器Transformer的残差流结构先验，基于标准的SwiGLU + RMSNorm + RoPE + GQA骨干网络。隐藏向量被划分为N个大小相等的循环通道，每个通道通过相位尊重操作进行维护：每个通道的RMSNorm，一个在注意力和前馈网络（FFN）之间的二维Givens旋转，该旋转使每个通道旋转θ + i*(2π/N)，以及一个头数约束，使GQA头与划分对齐。该架构是一个自我稳定的平衡状态，介于打乱和重新施加之间，而不是一个附加模块。划分形成一个与通道正交的一维直流子空间，我们在其中注入一个固定的Gabriel's horn轮廓r(p) = 1/(p+1)，作为绝对位置的侧通道，与RoPE的相对位置旋转正交组合。典型的N=3借用平衡三相交流的隐喻，其中三个相位相差120度的正弦波相加为零，没有反相关的对。在WikiText-103上，3PT在123M参数下实现了-7.20%的困惑度（-2.62%字节比），相较于匹配的仅RoPE基线增加了+1,536个参数（占总数的0.00124%），并且在步数收敛速度上实现了1.93倍的加速（1.64倍的实际时间）。N表现得像一个参数共享旋钮，而不是一个独特的最优解：在5.5M时，对{1,2,3,4,6,8,12}的N扫查几乎是单调的，N=1获胜；在123M时，三种种子扫查发现N=3和N=1在统计上不可区分。承载机制是通道划分的残差流、每块旋转、每相位归一化和喇叭直流注入。我们表征了（a）几何形状的自我稳定性，无需显式强制，这是神经网络保守法则框架的一个新实例；（b）在12层时旋转角度漂移的U型深度轮廓；（c）与RoPE、注意力和FFN的正交组合。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.14442

Hierarchical vs. Flat Iteration in Shared-Weight Transformers

共享权重变换器中的层次迭代与扁平迭代

Han, Sang-Il

Abstract

We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.

Chinese Translation

我们进行了一项实证研究，探讨层次结构的共享权重递归是否能够与基于变换器的语言模型中的独立层堆叠匹配其表征质量。HRM-LM用一对双速递归模块替代了L个独立的变换器层：一个快速模块在每一步操作以进行局部细化，一个慢速模块每T步操作以进行全局压缩。这个递归层次展开为M = N x T步，并共享参数。通过在五次独立运行中对参数匹配的通用变换器（Universal Transformer，UniTF，1.2B）进行消融实验，得出的中心且最稳健的发现是这两种方法之间存在明显的实证差距。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.14448

MARCA: A Checklist-Based Benchmark for Multilingual Web Search

MARCA：一种基于检查表的多语言网络搜索基准

Almeida, Thales Sales, Bonás, Giovana Kerche, Pires, Ramon, Larcher, Celio, Abonizio, Hugo, Piau, Marcos, Junior, Roseval Malaquias, Nogueira, Rodrigo, Laitz, Thiago

Abstract

Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca-ai/MARCA

Chinese Translation

大型语言模型（LLMs）越来越多地被用作信息来源，但其可靠性依赖于搜索网络、选择相关证据和综合完整答案的能力。尽管最近的基准评估了网络浏览和代理工具的使用，但多语言环境，尤其是葡萄牙语，仍然未得到充分探索。我们提出了 extsc{MARCA}，这是一个双语（英语和葡萄牙语）基准，用于评估 LLM 在基于网络的信息获取能力。 extsc{MARCA} 包含 52 个手动编写的多实体问题，并配有手动验证的检查表式评分标准，明确衡量答案的完整性和正确性。我们在两种交互设置下评估了 14 个模型：一个基础框架，直接进行网络搜索和抓取，以及一个协调者框架，通过委派子代理实现任务分解。为了捕捉随机性，每个问题执行多次，并报告运行级别的不确定性。在各模型之间，我们观察到显著的性能差异，发现协调通常改善了覆盖范围，并识别出模型从英语转移到葡萄牙语时的显著变异性。该基准可在 https://github.com/maritaca-ai/MARCA 获取。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.14459

Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?

填补机制：语言模型如何在发展约束下学习填充-间隙依赖？

Desai, Atrey, Nair, Sathvik

Abstract

For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.

Chinese Translation

对于人类而言，填充-间隙依赖需要在不同的句法结构之间共享表示。尽管因果分析表明这对于大型语言模型（LLMs）也可能成立（Boguraev et al., 2025），但尚不清楚在以发展上可行的数据量训练的语言模型中是否也存在这样的表示。我们应用了分布式对齐搜索（Distributed Alignment Search, DAS，Geiger et al. (2024)）对在不同数据量下训练的语言模型进行评估，这些数据来自BabyLM挑战（Warstadt et al., 2023），以评估填充-间隙依赖的表示是否在wh-疑问句和主题化之间转移，而这两者在输入频率上差异显著。我们的结果表明，尽管共享机制可能存在，但其对特定项目敏感。更重要的是，语言模型仍然需要比人类更多的数据才能学习到可比的概括，这突显了在语言习得模型中需要语言特定的偏见。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.14463

Psychological Steering of Large Language Models

大型语言模型的心理引导

Blas, Leonardo, Jia, Robin, Ferrara, Emilio

Abstract

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

Chinese Translation

大型语言模型（LLMs）模拟了一种一致的人类行为，这种行为可以通过激活水平干预进行塑造。该范式正在趋向于加性残差流注入，这依赖于注入强度的变化来近似最佳干预设置。然而，现有方法限制了搜索空间，并在未校准的激活空间单位中进行变化，可能会错过最佳干预条件。因此，我们提出了一种心理引导框架，该框架在语义校准单位中执行无限制的、流畅性受限的变化。我们的方法使用心理学工具推导和校准残差流注入，并使用测量OCEAN人格模型的IPIP-NEO-120来比较六种注入方法。我们发现，在14个LLMs中的11个中，均值差异（MD）注入在开放式生成中优于人格提示（P$^2$），后者是OCEAN引导的一个已建立基线，增益范围为3.6\%到16.4\%，推翻了之前支持提示的报告，并将表示工程定位为开放式心理引导的新前沿。此外，我们发现P$^2$和MD注入的混合方法在14个LLMs中的13个中表现优于这两种方法，相较于P$^2$的增益范围为5.6\%到21.9\%，而相较于MD注入的增益范围为3.3\%到26.7\%。最后，我们展示了MD注入与线性表示假设的一致性，并提供了可靠的、近似线性的控制旋钮用于心理引导。然而，它们也引发了与大二模型偏离的OCEAN特质协方差模式，暗示了学习表示与人类心理之间的差距。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.14489

CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling

CobwebTM：用于终身和层次主题建模的概率概念形成

Singaravadivelan, Karthik, Gupta, Anant, Wang, Zekun, MacLellan, Christopher

Abstract

Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.

Chinese Translation

主题建模旨在以最小的监督揭示文本语料库中的潜在语义结构。神经网络方法虽然表现强劲，但需要大量调优，并且由于灾难性遗忘和固定容量而在终身学习中面临挑战，而经典的概率模型则缺乏对流数据的灵活性和适应性。我们提出了 extsc{CobwebTM}，这是一种基于增量概率概念形成的低参数终身层次主题模型。通过将 Cobweb 算法适应于连续文档嵌入， extsc{CobwebTM} 在线构建语义层次，使得在不预定义主题数量的情况下实现无监督主题发现、动态主题创建和层次组织。在多样化的数据集上， extsc{CobwebTM} 实现了强主题一致性、随时间稳定的主题和高质量的层次结构，证明了增量符号概念形成与预训练表示相结合是一种高效的主题建模方法。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.14513

PeerPrism: Peer Evaluation Expertise vs Review-writing AI

PeerPrism：同行评审中的专家评估与审稿写作AI

Sadeghian, Soroush, Daqiq, Alireza, Cheraghi, Radin, Ebrahimi, Sajad, Arabzadeh, Negar, Bagheri, Ebrahim

Abstract

Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.

Chinese Translation

大型语言模型（LLMs）在科学同行评审中越来越多地被使用，协助草拟、重写、扩展和完善。然而，现有的同行评审LLM检测方法主要将作者身份视为一个二元问题——人类与AI——而没有考虑现代评审工作流程的混合特性。在实践中，评估思想和表面实现可能来自不同的来源，形成一种人类与AI合作的光谱。在本研究中，我们引入了PeerPrism，这是一个大规模的基准数据集，包含20,690篇同行评审，明确旨在将思想来源与文本来源区分开来。我们构建了控制生成机制，涵盖完全人类、完全合成和多种混合转化。这一设计使得系统评估检测器是否能够识别表面文本的来源或评估推理的来源成为可能。我们在PeerPrism上对最先进的LLM文本检测方法进行了基准测试。尽管几种方法在标准的二元任务（人类与完全合成）上取得了高准确率，但它们在混合机制下的预测却大相径庭。特别是，当思想来源于人类但表面文本是AI生成时，检测器经常出现分歧，并产生矛盾的分类。伴随着风格计量和语义分析，我们的结果表明，当前的检测方法将表面实现与智力贡献混为一谈。总体而言，我们展示了在同行评审中，LLM检测不能简化为一个二元归属问题。相反，作者身份必须建模为一个跨越语义推理和风格实现的多维构造。PeerPrism是首个在这些环境中评估人类与AI合作的基准。我们发布了所有代码、数据、提示和评估脚本，以促进可重复研究，网址为 https://github.com/Reviewerly-Inc/PeerPrism。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.14593

Mechanistic Decoding of Cognitive Constructs in LLMs

大型语言模型中认知构念的机制解码

Shou, Yitong, Guan, Manhao

Abstract

While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.

Chinese Translation

尽管大型语言模型（LLMs）展现出越来越复杂的情感能力，但它们处理复杂情绪的内部机制仍不清晰。现有的可解释性方法通常将模型视为黑箱，或仅关注粗略的基本情绪，从而导致对更复杂情感状态的认知结构探索不足。为填补这一空白，我们提出了一种基于表征工程（Representation Engineering, RepE）的认知逆向工程框架，以分析社会比较嫉妒。通过将评估理论与子空间正交化、基于回归的加权和双向因果引导相结合，我们隔离并量化了嫉妒的两个心理前因：比较对象的优越性和领域自我定义的相关性，并考察它们对模型判断的因果影响。在来自Llama、Qwen和Gemma家族的八个LLM上的实验表明，模型本质上将嫉妒编码为这些构成因素的结构性线性组合。它们的内部表征与人类心理构念大体一致，将优越性视为基础触发因素，将相关性视为最终强度的乘数。我们的框架还表明，毒性情感状态可以被机械地检测并有针对性地抑制，这为在多智能体环境中进行表征监测和干预以确保人工智能安全提供了可能的途径。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.14595

NLP needs Diversity outside of 'Diversity'

自然语言处理需要超越‘多样性’的多样性

Tint, Joshua

Abstract

This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.

Chinese Translation

本文立场论文认为，近年来在自然语言处理（NLP）领域关于多样性的进展过于集中于少数几个与公平性相关的领域。我们进一步认为，这种现象是由于多种激励、偏见和障碍的共同作用，导致边缘化的研究者在非公平性领域被剥夺了权利，或被迫转向与公平性相关的领域。我们通过对NLP研究者在不同子领域的人口统计进行调查，来证实我们的观点，并提出一系列建议，以确保NLP的所有领域都能变得更加包容和公平。特别是，我们强调打破强化差异的反馈循环的重要性，以及解决地理和语言障碍的必要性，以促进更多人参与NLP研究。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.14602

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

CausalDetox：语言模型去毒化的因果头选择与干预

Wang, Yian, Chen, Yuen, Goyal, Agam, Sundaram, Hari

Abstract

Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.

Chinese Translation

大型语言模型（LLMs）经常生成有毒内容，给安全部署带来了重大风险。目前的缓解策略往往会降低生成质量或需要昂贵的人力标注。我们提出了CAUSALDETOX，一个框架，旨在识别并干预因果上负责有毒生成的特定注意力头。通过必要性和充分性概率（Probability of Necessity and Sufficiency, PNS），我们隔离出一组最小的注意力头，这些头是有毒性的必要且充分条件。我们通过两种互补策略利用这些组件：（1）局部推理时干预（Local Inference-Time Intervention），构建动态的、特定于输入的引导向量，以实现上下文感知的去毒化；（2）PNS引导的微调（PNS-Guided Fine-Tuning），永久性地去除有毒表征。我们还引入了PARATOX，一个新的基准数据集，包含对齐的有毒/非有毒句子对，以便进行受控的反事实评估。在ToxiGen、ImplicitHate和ParaDetox上的实验表明，CAUSALDETOX在保持语言流畅性的同时，达到了比基线高出5.34%的有毒性减少，并在头选择上实现了7倍的加速。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.14616

Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

检索后分类：基于语料库的临床价值集编写自动化

Mukherjee, Sumit, Shu, Juan, Mazumder, Nairwita, Kernell, Tate, Wheeler, Celena, Hastings, Shannon, Sidey-Gibbons, Chris

Abstract

Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.

Chinese Translation

临床价值集编写——识别标准化词汇中定义临床概念的所有代码的任务——在临床质量测量和表型分析中是一个反复出现的瓶颈。一种自然的方法是提示大型语言模型（LLM）直接生成所需的代码，但结构化临床词汇庞大、版本控制且在预训练期间并不可靠地被记忆。我们提出了检索增强集补全（Retrieval-Augmented Set Completion, RASC）：从经过策划的语料库中检索出 $K$ 个最相似的现有价值集以形成候选池，然后对每个候选代码应用分类器。从理论上讲，检索和选择可以通过将有效输出空间从完整词汇缩小到一个更小的检索候选池来降低统计复杂性。我们在 11,803 个公开可用的 VSAC 价值集上展示了 RASC 的实用性，构建了该任务的第一个大规模基准。对 SAPBert 进行微调的交叉编码器实现了 AUROC~0.852 和价值集级别 F1~0.298，优于更简单的三层多层感知器（AUROC~0.799，F1~0.250），并将每个真实正例的无关候选数量从 12.3（仅检索）减少到大约 3.2 和 4.4。零样本 GPT-4o 实现了价值集级别 F1~0.105，返回的代码中有 48.6\% 完全不在 VSAC 中。随着价值集大小的增加，这一性能差距扩大，与 RASC 的理论优势一致。我们在另外两种分类器模型类型中观察到类似的性能提升，即从预训练的 SAPBert 初始化的交叉编码器和 LightGBM 模型，证明 RASC 的优势超越了单一模型类别。下载和创建基准数据集的代码以及模型训练代码可在以下链接获取：\href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.14631

StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation

StoryCoder：用于结构化推理的叙事重构在大型语言模型代码生成中的应用

Jang, Geonhui, Han, Dongyoon, Yoo, YoungJoon

Abstract

Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.

Chinese Translation

有效的代码生成既需要模型的能力，也需要仔细构建的问题表述，以便模型能够进行推理和规划。现有方法通过增强推理步骤或向模型思维中注入特定结构来改善推理，但对分散的问题条件却未作改变。受到人类将碎片化信息组织成连贯解释的启发，我们提出了StoryCoder，一个叙事重构框架，将代码生成问题转化为连贯的自然语言叙事，提供比简单的重述更丰富的上下文结构。每个叙事由三个组成部分构成：任务概述、约束条件和示例测试用例，均由所选算法和类型指导。在HumanEval、LiveCodeBench和CodeForces上的11个模型实验表明，性能有一致性提升，零-shot pass@10的平均增益为18.7%。除了准确性，我们的分析还揭示了叙事重构引导模型朝向正确的算法策略，减少实现错误，并促使更模块化的代码结构。这些分析进一步表明，这些好处依赖于叙事的连贯性和类型的一致性，暗示结构化问题表述对于代码生成的重要性无论模型规模或架构如何。我们的代码可在 https://github.com/gu-ni/StoryCoder 获取。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.14634

Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

将多项选择评估的边界推向一百个选项

Lee, Nahyun, Son, Guijin

Abstract

Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.

Chinese Translation

多项选择评估广泛用于大型语言模型的基准测试，但在选项较少的情况下，接近上限的准确率可能会被掩盖真实能力的捷径策略所维持。因此，我们提出了一种大规模选项评估协议，将候选集扩展至一百个选项，并显著减少偶然表现的影响。我们将该框架应用于韩语正字法错误检测任务，在该任务中，模型必须从一个大候选集中选择出唯一的错误句子。通过固定目标和重复重采样与洗牌，我们获得了稳定的估计，同时将内容驱动的失败与位置伪影分离。在各项实验中，结果表明，在选项较少的情况下，强劲的表现可能会夸大模型的能力。这种明显的优势在高 $N$ 的密集干扰下往往减弱，揭示了传统基准测试往往掩盖的差距。我们识别出两种失败模式：语义混淆和在不确定性下对早期选项的位置信息偏向。为了隔离上下文长度的影响，我们进行了填充控制和长度匹配的测试，结果表明主要瓶颈在于候选排名而非上下文长度。综合这些发现，支持大规模选项评估作为一种通用框架，用于在极端干扰密度下对模型可靠性进行压力测试，超越低选项基准所能揭示的内容。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.14640

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战任务中的表现：通过微调和少量提示对大型语言模型进行无参考金融虚假信息检测

Hoang, Cuong, Nguyen, Le-Minh

Abstract

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

Chinese Translation

金融虚假信息的泛滥对市场稳定和投资者信任构成了严重威胁，误导市场行为并造成关键信息的不对称。检测此类误导性叙述本质上具有挑战性，尤其是在现实场景中，外部证据或补充参考资料用于交叉验证严格不可用。本文展示了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。基于最近提出的RFC-BENCH框架（Jiang et al. 2026），该任务挑战模型仅依赖内部语义理解和上下文一致性来判断金融声明的真实性，而不是依赖外部事实核查。为应对这一艰巨的评估设置，我们提出了一个综合框架，充分利用最先进的大型语言模型（LLMs）的推理能力。我们的方法系统地将上下文学习，特别是零-shot和少量-shot提示策略，与通过低秩适应（LoRA）的参数高效微调（PEFT）相结合，以最佳方式将模型与金融操控的微妙语言线索对齐。我们提出的系统表现出卓越的有效性，成功在官方排行榜上获得第一名。具体而言，我们在公共测试集上取得了95.4%的准确率，在私有测试集上达到了96.3%，突显了我们方法的稳健性，并促进了金融自然语言处理中的上下文感知虚假信息检测的加速。我们的模型（14B和32B）可在https://huggingface.co/KaiNKaiho获取。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.14644

CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge

CURaTE：实时持续遗忘并确保大型语言模型知识的保留

Bae, Seyun, Lee, Seokhan, Yang, Eunho

Abstract

The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.

Chinese Translation

由于无法提前过滤掉所有潜在问题数据，导致大型语言模型的预训练过程中产生了在训练后遗忘特定知识的方法需求。现有技术忽视了持续和即时行动的必要性，导致随着更新的积累和敏感信息的长期暴露，其效用下降。为了解决这些问题，我们提出了实时持续遗忘并确保大型语言模型知识保留的方法（CURaTE）。我们的方法首先在一个旨在形成明确决策边界的数据集上训练句子嵌入模型，以判断给定的输入提示是否对应于任何存储的遗忘请求。然后，利用给定输入与遗忘请求的相似性来决定是回答还是返回拒绝响应。我们展示了即使采用如此简单的方法，CURaTE不仅实现了比现有方法更有效的遗忘，而且通过避免修改语言模型参数，能够在任何数量的更新中保持近乎完美的知识保留，并且是唯一能够实时持续遗忘的方法。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.14651

CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction

CURA：基于语言模型的临床不确定性风险对齐用于风险预测

Wang, Sizhe, Xu, Ziqi, Najjuuko, Claire, Alba, Charles, Lu, Chenyang

Abstract

Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient's likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.

Chinese Translation

临床语言模型（LMs）越来越多地应用于支持从自由文本笔记中进行临床风险预测，但它们的不确定性估计往往校准不佳且在临床上不可靠。在本研究中，我们提出了临床不确定性风险对齐（CURA），这是一个将基于临床LM的风险估计和不确定性与个体错误可能性和群体级模糊性对齐的框架。CURA首先对特定领域的临床LM进行微调，以获得适应任务的患者嵌入，然后使用双层不确定性目标对多头分类器进行不确定性微调。具体而言，个体级校准项将预测不确定性与每位患者的错误可能性对齐，而群体感知正则化器则将风险估计拉向其嵌入空间中局部邻域的事件发生率，并对靠近决策边界的模糊群体施加额外权重。我们进一步表明，这个群体感知项可以解释为带有邻域信息的软标签的交叉熵损失，为我们的方法提供了标签平滑的视角。在各种临床LM上进行的MIMIC-IV临床风险预测任务的广泛实验表明，CURA在不显著妨碍区分能力的情况下，持续改善了校准指标。进一步分析表明，CURA减少了过度自信的错误安慰，并为下游临床决策支持提供了更可靠的不确定性估计。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.14672

SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

SPAGBias：揭示和追踪大型语言模型中的结构化空间性别偏见

Su, Binxian, Lou, Haoye, Zhu, Shucheng, Wang, Weikang, Liu, Ying, Yu, Dong, Liu, Pengyuan

Abstract

Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.

Chinese Translation

大型语言模型（LLMs）在城市规划中的应用日益增多，但由于性别空间理论强调性别等级如何嵌入空间组织，因此人们担心LLMs可能会再现或放大这种偏见。我们提出了SPAGBias——第一个系统性框架，用于评估LLMs中的空间性别偏见。该框架结合了62种城市微空间的分类法、一个提示库以及三个诊断层次：显性（强制选择重采样）、概率性（令牌级不对称）和构造性（语义和叙事角色分析）。通过测试六个代表性模型，我们识别出超越公共与私人划分的结构化性别-空间关联，形成细致的微观级别映射。故事生成揭示了情感、措辞和社会角色如何共同塑造“空间性别叙事”。我们还考察了提示设计、温度和模型规模如何影响偏见的表达。追踪实验表明，这些模式在模型流程（预训练、指令调优和奖励建模）中嵌入并得到强化，模型关联的程度显著超过现实世界的分布。下游实验进一步揭示，这种偏见在规范性和描述性应用场景中产生了具体的失败。该研究将社会学理论与计算分析相结合，将偏见研究扩展到空间领域，并揭示LLMs如何通过语言编码社会性别认知。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.14749

Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement

哪种鸟没有翅膀：带有模式引导语义匹配和自我导向精炼的负约束知识图谱问答

Shim, Midan, Hwang, Seokju, Um, Kaehyun, Lee, Kyong-Ho

Abstract

Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user's question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.

Chinese Translation

尽管大型语言模型在推理能力上表现出色，但在忠实性和幻觉方面仍然存在困难。在知识图谱问答（KGQA）中，基于语义解析的方法通过理解用户问题中的约束并将其转换为逻辑形式以在知识图谱上执行，从而解决了这些局限性。然而，现有的KGQA基准和方法偏向于正约束和计算约束，负约束则被忽视，尽管它们在现实世界的问题中经常出现。本文介绍了一项新任务，负约束知识图谱问答（NEgative-conSTrained, NEST）KGQA，其中每个问题至少包含一个负约束，并相应地构建了数据集NestKGQA。我们还设计了PyLF，一种Python格式的逻辑形式，因为现有的逻辑形式很难清晰地表达否定，同时保持可读性。此外，NEST问题自然包含多个约束。为了减轻其语义复杂性，我们提出了一种新框架CUCKOO，专门针对多约束问题并确保语义可执行性。CUCKOO首先生成一个约束感知的逻辑形式草稿，并执行模式引导的语义匹配。然后，仅在执行不当的逻辑形式导致空结果时，选择性地应用自我导向的精炼，从而降低成本并提高鲁棒性。实验结果表明，CUCKOO在常规KGQA和NEST-KGQA基准下，在少量样本设置中始终优于基线。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.14773

CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors

CoPA：基于数据驱动的认知因素评估个性化问答的基准

Su, Hang, Liu, Zequn, Hu, Chen, Lu, Xuesong, Xia, Yingce, Liu, Zhen

Abstract

While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.

Chinese Translation

尽管大型语言模型（LLMs）在问答（QA）领域展现了显著的潜力，但评估个性化仍然是一个关键瓶颈。现有的范式主要依赖于词汇层面的相似性或人工启发式方法，往往缺乏足够的数据驱动验证。我们通过挖掘社区-个体偏好差异（Community-Individual Preference Divergence, CIPD），在个体选择优先于共识的情况下，提炼出六个关键的个性化因素作为评估维度。因此，我们引入了CoPA，一个包含1,985个用户档案的基准，用于细粒度的因素级评估。通过量化模型输出与从交互模式推断的用户特定认知偏好的对齐程度，CoPA提供了比通用指标更全面和更具区分度的个性化问答评估标准。代码可在 https://github.com/bjzgcai/CoPA 获取。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.14799

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

知道何时不回答：评估多模态推理系统中的弃权

Madhusudhan, Nishanth, Yadav, Vikas, Lacoste, Alexandre

Abstract

Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

Chinese Translation

有效的弃权（Effective Abstention, EA）在识别证据不足并避免回答方面，对于可靠的多模态系统至关重要。然而，现有的视觉-语言模型（Vision-Language Models, VLMs）和多智能体系统（Multi-Agent Systems, MAS）的评估范式假设答案的可得性，迫使模型始终作出回应。弃权在仅文本的环境中已有研究，但在多模态环境中仍然未得到充分探索；当前的基准测试要么忽视不可回答性，要么依赖粗略的方法，无法捕捉现实中的失败模式。我们引入了MM-AQA，一个基准测试，通过在视觉模态依赖性和证据充分性两个维度上对可回答实例进行变换，构建不可回答的实例。对三种前沿的VLM进行评估，涵盖了封闭源和开源模型，以及两种MAS架构，共2079个样本，我们发现：（1）在标准提示下，VLM很少弃权；即使是简单的置信度基线也优于该设置；（2）MAS提高了弃权能力，但引入了准确性与弃权之间的权衡；（3）顺序设计的表现与迭代变体相当或更好，表明瓶颈在于错误校准而非推理深度；（4）当图像或文本证据缺失时，模型会选择弃权，但会尝试用降级或矛盾的证据进行调和。有效的多模态弃权需要关注弃权的训练，而不是更好的提示或更多的智能体。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.14808

Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem

将大型语言模型的遗忘建模为不对称双任务学习问题

Xiao, Zeguan, Li, Siqing, Wang, Yong, Wei, Xuetao, Yang, Jian, Chen, Yun, Chen, Guanhua

Abstract

Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.

Chinese Translation

大型语言模型（LLMs）的机器遗忘旨在去除特定知识，同时保持一般能力。在本文中，我们将LLM遗忘重新表述为一个不对称的双任务问题：保留是主要目标，而遗忘是辅助目标。从这个角度出发，我们提出了一种以保留为优先的梯度合成框架，该框架将任务特定的梯度提取与冲突感知的组合解耦。我们在该框架下，调整了已有的PCGrad以解决梯度冲突，并引入了一种新颖的以保留为优先的梯度合成方法SAGO。从理论上讲，这两种变体都确保与保留梯度的余弦相似度为非负，而SAGO通过构造性符号约束合成实现了更严格的对齐。在经验上，在WMDP Bio/Cyber和RWKU基准测试中，SAGO始终推动帕累托前沿：例如，在WMDP Bio（SimNPO+GD）上，目标模型MMLU性能的恢复从44.6%（简单方法）提升至94.0%（+PCGrad），进一步提升至96.0%（+SAGO），同时保持可比的遗忘强度。我们的结果表明，重塑梯度几何结构而非重新平衡损失是缓解遗忘与保留权衡的关键。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.14815

Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

在芬兰组织病理报告上对FinBERT进行领域微调：训练时信号与下游关联

Luisto, Rami, Petäinen, Liisa, Grönholm, Tommi, Böhm, Jan, Ahtiainen, Maarit, Lilja, Tomi, Pölönen, Ilkka, Äyrämö, Sami

Abstract

In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.

Chinese Translation

在标注数据稀缺的自然语言处理分类任务中，基于未标注数据对变换模型进行领域微调是一种成熟的方法。本文有两个目标：(1) 我们描述了在芬兰医学文本数据上对芬兰BERT模型进行微调的观察结果。(2) 我们报告了通过观察由于领域微调而导致的嵌入变化几何形状，来预测芬兰BERT领域特定预训练的益处。我们的主要动机是医疗人工智能中常见的情况，即在获取数据集时可能会经历长时间的延迟，尤其是在标签方面。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.14828

Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench

Pangu-ACE：用于EduBench教育响应生成的自适应级联专家

Li, Dinghao, Zhou, Wenlong, Chen, Zhimin, Peng, Yuehan, Ni, Hong, Zou, Chengfu, Shi, Guoyu, Li, Yaochen

Abstract

Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.

Chinese Translation

教育助手应在任务需要时才增加计算量。本文围绕实际实施并存档于仓库的系统重写了我们早期的草稿：针对共享的8个EduBench基准的样本级1B到7B的级联。最终系统Pangu-ACE使用1B的导师路由器生成草拟答案及路由信号，然后接受草拟答案或将样本升级到7B的专家提示。我们还修正了一个主要的离线评估错误：早期的总结过度认可了一些仅满足表面格式检查的开放式输出。在对保存的预测JSONL进行CPU端重新评分后，完整的中文测试档案（7013个样本）显示，级联最终版本在确定性质量上从0.457提高到0.538，在格式有效性上从0.707提高到0.866，相较于传统的rule_v2系统，同时在1B上直接接受了19.7%的请求。路由强烈依赖于任务：IP请求在1B上被接受的概率为78.0%，而QG和EC几乎总是升级。当前存档的部署尚未显示延迟收益，因此可辩护的效率故事是路由选择性，而非时钟速度的提升。我们还打包了一个可重复的文献工作流程，并澄清了剩余的外部基准差距：GPT-5.4的重新评估在本地实施，但配置的提供者端点和密钥无效，因此最终的样本基准与GPT-5.4的对齐仍待基础设施修复。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.14843

Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

探索与测试基于技能的行为特征标注：在模式引导执行下的人类可操作性与大型语言模型的可行性

Wu, Yufeng

Abstract

Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.

Chinese Translation

行为特征（BP）标注难以自动化，因为它需要在多个语言维度上同时编码。我们将BP标注视为一组标注技能，而非单一任务，并从这一角度评估大型语言模型（LLM）辅助的BP标注。使用3134条来自30个中文隐喻色彩词派生词的共现线和一个14特征的BP模式，我们实施了一个基于技能文件的管道，其中每个特征通过模式文件、决策规则和示例进行外部定义。两位人类标注者在300个实例的验证子集上完成了两轮仅基于模式的协议，使BP技能被分类为直接可操作的、在集中重新标注下可恢复的或结构上未指定的。随后，在相同设置下评估了GPT-5.4和三个本地可部署的开源模型。结果表明，BP标注在技能层面上高度异质：5项技能是直接可操作的，4项在集中重新标注后可恢复，5项则仍然结构上未指定。GPT-5.4以相当可靠的方式执行保留的技能（准确率=0.678， ext{kappa}=0.665，加权F1=0.695），但这种可行性是选择性的而非普遍的。人类与GPT的难度特征在技能层面上高度一致（r=0.881），但在实例层面（r=0.016）或词汇项层面（r=-0.142）则不一致，这一模式我们称之为共享分类法、独立执行。成对一致性进一步表明，GPT更应被理解为独立的第三种技能声音，而非直接的人类替代品。开源模型的失败主要集中在模式到技能的执行问题上。这些发现表明，自动标注应从技能可行性的角度进行评估，而非任务层面的自动化。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.14856

ClimateCause: Complex and Implicit Causal Structures in Climate Reports

气候原因：气候报告中的复杂和隐含因果结构

Allein, Liesbeth, Pineda-Castañeda, Nataly, Rocci, Andrea, Moens, Marie-Francine

Abstract

Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.

Chinese Translation

理解气候变化需要对复杂的因果网络进行推理。然而，现有的因果发现数据集主要捕捉显性、直接的因果关系。我们引入了ClimateCause，这是一个由专家手动注释的高阶因果结构数据集，来源于政策科学气候报告，包括隐含和嵌套因果关系。因果关系表达式经过标准化并解构为单独的因果关系，以便于图形构建，并为因果关系的相关性、关系类型和时空背景提供独特的注释。我们进一步展示了ClimateCause在基于因果图的语义复杂性量化可读性方面的价值。最后，大型语言模型在相关性推断和因果链推理的基准测试中突显了后者作为一个关键挑战。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.14862

Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding

作为约束解码下结构生成指令通道的模式关键字

Le, Yifan

Abstract

Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.

Chinese Translation

约束解码已被广泛应用于大型语言模型（LLMs）的结构生成，确保输出符合预定义格式，如 JSON 和 XML。然而，现有方法主要将模式视为纯粹的结构约束，忽视了其语言表述可能影响模型行为的可能性。在本研究中，我们探讨了指令位置如何影响结构生成中的模型性能，并表明仅仅改变模式关键字的措辞，而不修改提示或模型参数，便可以显著改变模型在约束解码下的表现。基于这一观察，我们提出将结构生成重新解释为一个多通道指令问题，其中指令可以通过提示显式传达，也可以通过解码过程中的模式关键字隐式传达。据我们所知，这是首个系统研究模式关键字表述作为隐式指令通道并影响约束解码下模型性能的工作。在多个数学推理基准上的实验表明，不同模型家族对这些指令通道表现出不同的敏感性：Qwen 模型始终受益于模式级指令，而 LLaMA 模型则更依赖于提示级指导。我们进一步观察到指令通道之间存在非加性交互效应，表明组合多个通道并不总是会带来进一步的改善。这些发现表明，模式设计不仅决定输出结构，还携带指令信号，为 LLMs 中的结构生成提供了新的视角。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.14865

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

面向强健有害意图探测的段级一致性研究

He, Xuanli, Sel, Bilgehan, Ali, Faizan, Bao, Jenny, Cunningham, Hoagy, Wei, Jerry

Abstract

Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.

Chinese Translation

大型语言模型（LLMs）越来越容易受到自适应越狱攻击，尤其是在高风险的化学、生物、放射性和核（CBRN）领域。尽管流式探测器能够实现实时监控，但仍然存在系统性错误。我们识别出一个核心问题：现有方法通常依赖于少数高分数的标记，这在敏感的CBRN术语出现在良性上下文中时会导致误报。为了解决这个问题，我们提出了一种流式探测目标，要求多个证据标记一致支持一个预测，而不是依赖孤立的峰值。这鼓励基于聚合信号而非单一标记线索的更强健检测。在固定的1%假阳性率下，我们的方法相较于强大的流式基线提高了35.55%的真阳性率。我们进一步观察到在AUROC上有显著提升，即使在接近饱和的基线性能（AUROC = 97.40%）下也如此。我们还展示了探测注意力（Attention）或多层感知机（MLP）激活的效果始终优于残差流特征。最后，即使对抗性微调使得新颖的字符级密码得以实现，有害意图仍然可以被检测：为基础LLMs开发的探测器可以“即插即用”地应用于这些模糊攻击，达到超过98.85%的AUROC。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.14885

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

RACER：检索增强的上下文快速推测解码

Zhang, Zihong, Li, Zuchao, Zhang, Lefei, Wang, Ping, Zhao, Hai

Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.

Chinese Translation

大型语言模型（LLMs）中的自回归解码每一步生成一个标记，这导致了高推理延迟。推测解码（SD）通过猜测和验证策略缓解了这一问题，但现有的无训练变体面临权衡：基于检索的草稿在没有精确匹配时会失效，而基于logits的草稿缺乏结构指导。我们提出了$ extbf{RACER}$（$ extbf{R}$etrieval-$ extbf{A}$ugmented $ extbf{C}$ont$ extbf{e}$xtual $ extbf{R}$apid Speculative Decoding），这是一种轻量级且无训练的方法，结合了检索到的精确模式与基于logit的未来线索。这种统一提供了可靠的锚点和灵活的外推，产生了更丰富的推测草稿。在Spec-Bench、HumanEval和MGSM-ZH上的实验表明，RACER始终加速推理，相较于自回归解码实现了超过$2 imes$的加速，并且优于之前的无训练方法，提供了一种可扩展的即插即用解决方案，以实现高效的LLM解码。我们的源代码可在$ exthref{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$获取。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.14888

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

推理动态与视觉语言模型中监控模态依赖的限制

Villegas, Danae Sánchez, Lewis-Lim, Samuel, Aletras, Nikolaos, Elliott, Desmond

Abstract

Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

Chinese Translation

最近在视觉语言模型（VLMs）方面的进展提供了推理能力，但这些能力如何展开以及如何整合视觉和文本信息仍然不清楚。我们分析了18个VLM的推理动态，涵盖了来自两种不同模型家族的指令调优和推理训练模型。我们跟踪了在思维链（Chain-of-Thought, CoT）上的信心，测量了推理的纠正效果，并评估了中间推理步骤的贡献。我们发现模型容易出现答案惯性，即对预测的早期承诺在推理步骤中得到强化，而不是修正。尽管推理训练模型表现出更强的纠正行为，但它们的收益依赖于模态条件，从文本主导到仅视觉设置。通过使用带有误导性文本提示的控制干预，我们表明，即使视觉证据充足，模型仍然受到这些提示的一致影响，并评估这种影响是否可以从思维链中恢复。尽管这种影响可能出现在思维链中，但其可检测性在不同模型之间有所不同，并依赖于监控的内容。推理训练模型更可能明确提及这些提示，但它们较长且流畅的思维链仍然可能看似基于视觉，而实际上遵循文本提示，从而模糊了模态依赖。相比之下，指令调优模型对提示的提及较少，但其较短的轨迹揭示了与视觉输入的不一致性。综合来看，这些发现表明，思维链仅提供了不同模态如何驱动VLM决策的部分视角，这对多模态系统的透明性和安全性具有重要意义。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2604.14907

Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

现代多语言文本嵌入技术在仇恨言论检测任务中的比较

Vaiciukynas, Evaldas, Danenas, Paulius, Ablonskis, Linas, Sukys, Algirdas, Dambrauskas, Edgaras, Zitkus, Voldemaras, Butkiene, Rita, Butleris, Rimantas

Abstract

Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.

Chinese Translation

在线仇恨言论和辱骂性语言对内容审核提出了日益严峻的挑战，尤其是在多语言环境和资源匮乏的语言（如立陶宛语）中。本文探讨了现代多语言句子嵌入模型在立陶宛语、俄语和英语中支持准确仇恨言论检测的程度，以及它们的性能如何依赖于下游建模选择和特征维度。我们引入了LtHate，一个新的立陶宛语仇恨言论语料库，来源于新闻门户和社交网络，并在LtHate、RuToxic和EnSuperset上使用统一的Python管道对六种现代多语言编码器（potion、gemma、bge、snow、jina、e5）进行了基准测试。对于每种嵌入，我们训练了一个单类HBOS异常检测器和一个双类CatBoost分类器，并在有和没有主成分分析（PCA）压缩到64维特征向量的情况下进行比较。在所有数据集中，双类监督模型始终显著优于单类异常检测，最佳配置在立陶宛语（jina）中达到80.96%的准确率和0.887的AUC ROC，在俄语（e5）中达到92.19%的准确率和0.978的AUC ROC，在英语（e5与PCA）中达到77.21%的准确率和0.859的AUC ROC。PCA压缩在监督设置中几乎保留了所有的区分能力，而在无监督异常检测情况下则表现出一些负面影响。这些结果展示了现代多语言句子嵌入与梯度提升决策树相结合，为多语言仇恨言论检测应用提供了稳健的软计算解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2604.14930

IE as Cache: Information Extraction Enhanced Agentic Reasoning

信息提取作为缓存：增强代理推理的信息提取

Lv, Hang, Liang, Sheng, Gu, Hongchao, Guo, Wei, Lian, Defu, Liu, Yong, Wang, Hao, Chen, Enhong

Abstract

Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.

Chinese Translation

信息提取旨在从非结构化文本中提炼出结构化的、与决策相关的信息，为下游理解和推理提供基础。然而，传统上它仅被视为一个终端目标：一旦提取，所得到的结构往往是孤立使用，而不是在多步推理过程中被维护和重用。为此，我们提出了 extit{IE-as-Cache}框架，将信息提取重新利用为认知缓存，以增强代理推理。我们的做法受到分层计算机内存的启发，结合了基于查询的提取与缓存感知推理，以动态维护紧凑的中间信息并过滤噪声。在多种大型语言模型（LLMs）上进行的挑战性基准实验表明，推理准确性显著提高，这表明信息提取可以有效地重新利用作为可重用的认知资源，并为未来信息提取的下游应用研究提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2604.14934

XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

XQ-MEval：一个具有跨语言平行质量的数据集，用于翻译指标的基准测试

Liu, Jingxuan, Qu, Zhi, Tei, Jin, Kamigaito, Hidetaka, Liu, Lemao, Watanabe, Taro

Abstract

Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.

Chinese Translation

自动评估指标对于构建多语言翻译系统至关重要。评估这些系统的常见做法是对各语言的指标得分进行平均，然而这存在疑问，因为指标可能受到跨语言评分偏差的影响，即相同质量的翻译在不同语言中获得不同的得分。由于缺乏提供跨语言平行质量实例的基准，并且专家标注并不现实，这一问题尚未得到系统研究。在本研究中，我们提出了XQ-MEval，这是一个半自动构建的数据集，涵盖九个翻译方向，用于基准测试翻译指标。具体而言，我们自动将MQM定义的错误注入到黄金翻译中，通过母语者进行筛选以确保可靠性，并合并错误以生成具有可控质量的伪翻译。这些伪翻译随后与相应的源文本和参考文本配对，形成用于评估翻译指标质量的三元组。通过使用XQ-MEval，我们对九个代表性指标的实验揭示了平均值与人工判断之间的不一致性，并提供了跨语言评分偏差的首个实证证据。最后，我们提出了一种基于XQ-MEval的归一化策略，以对齐各语言的得分分布，从而提高多语言指标评估的公平性和可靠性。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2604.14941

Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions

Text2Arch：从自然语言描述生成科学架构图的数据集

Garg, Shivank, Mittal, Sankalp, Gupta, Manish

Abstract

Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.

Chinese Translation

仅通过文本传达复杂系统设计或科学过程既低效又容易产生歧义。一个能够从文本中自动生成高语义保真度的科学架构图的系统在企业架构可视化、基于人工智能的软件设计和教育内容创作等多个应用中都具有重要价值。因此，本文重点利用语言模型对输入文本描述进行语义理解，以生成可处理的中间代码，从而生成高保真的架构图。不幸的是，目前并不存在干净的大规模开放访问数据集，这意味着在这一任务上缺乏有效的开放模型。因此，我们贡献了一个全面的数据集 extit{system}，该数据集包含科学架构图像、相应的文本描述以及相关的DOT代码表示。利用这一资源，我们对一系列小型语言模型进行了微调，并使用GPT-4o进行了上下文学习。通过大量实验，我们表明 extit{system} 模型显著优于现有的基线模型如DiagramAgent，并且与基于上下文学习的GPT-4o生成结果相当。我们将代码、数据和模型公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2604.14970

Explain the Flag: Contextualizing Hate Speech Beyond Censorship

解释标记：超越审查的仇恨言论背景化

Liartis, Jason, Kaldeli, Eirini, Gyftokosta, Lambrini, Chelioudakis, Eleftherios, Mastromichalakis, Orfeas Menis

Abstract

Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.

Chinese Translation

仇恨、贬损和冒犯性言论在在线平台和公共话语中仍然是一个持续的挑战。虽然自动检测系统被广泛使用，但大多数系统专注于审查或删除，这引发了对透明度和言论自由的担忧，并限制了解释内容为何有害的机会。为了解决这些问题，解释性方法作为一种有前景的解决方案应运而生，旨在使仇恨言论检测更加透明、负责任和信息丰富。本文提出了一种混合方法，将大型语言模型（Large Language Models, LLMs）与三种新创建和策划的词汇结合起来，以检测和解释英语、法语和希腊语中的仇恨言论。我们的系统通过两个互补的流程捕捉与身份特征相关的固有贬损表达和直接针对群体的内容：一个使用策划的词汇检测和消歧义问题术语，另一个利用LLMs作为对群体目标内容的上下文感知评估者。这些输出融合成具体的解释，阐明为何内容被标记。人类评估表明，我们的混合方法准确，提供高质量的解释，优于仅使用LLMs的基线。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2604.15109

IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

IUQ：用于长文本大型语言模型生成的询问不确定性量化

Fan, Haozhi, Duan, Jinhao, Xu, Kaidi

Abstract

Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.

Chinese Translation

尽管大型语言模型（LLMs）迅速发展，但在LLM生成中的不确定性量化仍然是一个持续的挑战。尽管最近的方法通过限制LLMs生成短小或受限的答案集取得了良好的表现，但许多现实世界的应用需要长文本和自由形式的文本生成。在这种情况下，一个关键的困难是LLMs常常生成语义上连贯但事实不准确的响应，而潜在的语义是多方面的，语言结构也很复杂。为了解决这一挑战，本文提出了询问不确定性量化（IUQ），这是一个新颖的框架，利用样本间一致性和样本内忠实性来量化长文本LLM输出中的不确定性。通过采用询问-响应范式，我们的方法提供了可靠的声明级不确定性度量和模型的忠实性。在多种模型家族和模型规模上的实验结果表明，IUQ在两个广泛使用的长文本生成数据集上表现优越。代码可在 https://github.com/louisfanhz/IUQ 获取。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2604.15124

Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

盲法多评审比较评估大型语言模型与临床医生撰写的CGM知情糖尿病咨询响应

Guo, Zhijun, Lai, Alvina, Korakas, Emmanouil, Vagenas, Aristeidis, Ahamed, Irshad, Albor, Christo, Zhang, Hengrui, Healy, Justin, Li, Kezhi

Abstract

Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.

Chinese Translation

连续血糖监测（CGM）在糖尿病护理中至关重要，但清晰而富有同理心地解释CGM模式仍然耗时。关于基于检索的巨大语言模型（LLM）系统在CGM知情咨询中的证据仍然有限。本研究旨在评估基于检索的LLM对话代理（CA）是否能够支持患者理解CGM数据并为常规糖尿病咨询做准备。我们开发了一种基于检索的LLM CA，用于CGM解读和糖尿病咨询支持。该系统生成通俗易懂的响应，同时避免个性化的治疗建议。我们从公开可用的数据集中构建了12个CGM知情案例。在2025年10月至2026年2月期间，6名英国资深糖尿病临床医生各自审查了2个指定案例并回答了24个问题。在盲法多评审评估中，每个CA生成的响应和临床医生撰写的响应均由3名临床医生在6个质量维度上独立评分。安全标记和感知来源标签也被记录。主要分析使用线性混合效应模型。共生成288个独特响应（144个CA和144个临床医生），产生864个评分。CA的质量评分高于临床医生的响应（平均4.37对3.58），估计平均差异为0.782分（95% CI 0.692-0.872；P<.001）。在同理心（1.062，95% CI 0.948-1.177）和可操作性（0.992，95% CI 0.877-1.106）方面的差异最大。安全标记分布相似，两组中的主要关注点都很少（3/432，各占0.7%）。基于检索的LLM系统可能作为CGM审查、患者教育和咨询前准备的辅助工具具有价值。然而，这些发现并不支持自主的治疗决策或无监督的现实世界使用。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2604.15140

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

DiscoTrace：在人类与大型语言模型的信息寻求问答中的回答策略表示与比较

Srikanth, Neha, Boyd-Graber, Jordan, Rudinger, Rachel

Abstract

We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.

Chinese Translation

我们介绍了DiscoTrace，这是一种识别回答者在回应信息寻求问题时使用的修辞策略的方法。DiscoTrace将答案表示为一系列与问题相关的语篇行为，配以对原始问题的解释，并在修辞结构理论解析的基础上进行注释。将DiscoTrace应用于来自九个不同人类社区的答案揭示了这些社区在答案构建上具有多样的偏好。相比之下，尽管被提示模仿特定人类社区的回答指南，大型语言模型（LLMs）在其答案中并未表现出修辞多样性。LLMs还系统性地选择广度，处理人类回答者选择不予处理的问题解释。我们的发现可以为开发考虑上下文中多种策略的务实LLM回答者提供指导。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2604.15151

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

QuantCode-Bench：评估大型语言模型生成可执行算法交易策略能力的基准

Khoroshilov, Alexey, Chernysh, Alexey, Ekhtibarov, Orkhan, Kamkia, Nini, Zmitrovich, Dmitry

Abstract

Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.

Chinese Translation

大型语言模型在通用编程任务上表现出色，但它们生成可执行算法交易策略的能力仍然未得到充分探索。与标准代码基准不同，交易策略生成需要同时掌握特定领域的金融逻辑、专业API的知识，以及生成不仅语法正确且能够在历史数据上实际交易的代码。在本研究中，我们提出了QuantCode-Bench，这是一个用于系统评估现代大型语言模型在从英文文本描述生成Backtrader框架策略方面的基准。该基准包含从Reddit、TradingView、StackExchange、GitHub和合成来源收集的400个难度各异的任务。评估通过一个多阶段流程进行，该流程检查语法正确性、成功的回测执行、交易的存在以及与任务描述的语义一致性，使用大型语言模型作为评判者。我们在两种设置下比较了最先进的模型：单轮设置，其中策略必须在第一次尝试中正确生成；以及代理多轮设置，其中模型接收迭代反馈并可以修正其错误。我们分析了流程不同阶段的失败模式，并表明当前模型的主要局限性与语法无关，而是与交易逻辑的正确操作、API的适当使用以及对任务语义的遵循有关。这些发现表明，交易策略生成构成了一类独特的领域特定代码生成任务，其成功不仅需要技术上的正确性，还需要自然语言描述、金融逻辑与策略在数据上可观察行为之间的一致性。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2604.15153

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

在潜在嵌入空间中压缩序列：针对大型语言模型的 $K$-标记合并

Xu, Zihao, Harvill, John, Fan, Ziwei, Sun, Yizhou, Ding, Hao, Wang, Hao

Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

Chinese Translation

大型语言模型（LLMs）在处理长提示时会产生显著的计算和内存成本，因为完整的自注意力机制与输入长度呈平方关系扩展。标记压缩旨在通过减少表示输入的标记数量来解决这一挑战。然而，现有的提示压缩方法主要在标记空间中操作，忽视了潜在嵌入空间中的低效问题。本文提出了 K-标记合并，这是一种潜在空间压缩框架，通过轻量级编码器将每个连续的 K 个标记嵌入合并为一个单一的嵌入。压缩后的序列由经过 LoRA 调整的大型语言模型处理，而生成仍保持在原始词汇中。在结构推理（文本化树）、情感分类（亚马逊评论）和代码编辑（CommitPackFT）上的实验表明，K-标记合并位于性能与压缩的帕累托前沿，实现了高达 75% 的输入长度减少，同时性能降级最小。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2604.15165

Fabricator or dynamic translator?

制造者还是动态翻译器？

Vasileva, Lisa, Sim, Karin

Abstract

LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.

Chinese Translation

大型语言模型（LLMs）在机器翻译方面表现出色，尽管由于其生成特性，有时可能会以各种方式过度生成。这些过度生成与神经机器翻译（NMT）中观察到的神经胡言乱语不同，范围从LLM自我解释、风险性虚构到适当的解释，其中LLM能够像人类翻译者一样工作，从而增强目标受众的理解能力。检测和确定过度生成的确切性质是一项具有挑战性的任务。我们详细介绍了在商业环境中探索的不同策略，并呈现我们的研究结果。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2604.15203

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

MADE：一个用于医疗器械不良事件的不确定性量化的多标签文本分类活基准

Agarwal, Raunak, Wenzel, Markus, Baur, Simon, Zimmer, Jonas, Harvey, George, Ma, Jackie

Abstract

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

Chinese Translation

在医疗等高风险领域，机器学习不仅需要强大的预测性能，还需要可靠的不确定性量化（UQ）以支持人类监督。多标签文本分类（MLTC）是该领域的核心任务，但由于标签不平衡、依赖关系和组合复杂性，仍然面临挑战。现有的MLTC基准日益饱和，并可能受到训练数据污染的影响，使得区分真正的推理能力与记忆能力变得困难。我们引入了MADE，这是一个源自医疗器械不良事件报告的活基准，并通过不断更新新发布的报告来防止污染。MADE具有长尾分布的层次标签，并支持严格的时间分割下的可重复评估。我们在超过20个编码器和解码器模型的微调和少样本设置下建立了基线（指令调优/推理变体，本地/API可访问）。我们系统地评估了基于熵/一致性和自我表述的UQ方法。结果显示出明显的权衡：较小的判别微调解码器在保持竞争性UQ的同时实现了最强的头尾准确率；生成微调提供了最可靠的UQ；大型推理模型在稀有标签上的性能有所提升，但展现出意外的弱UQ；而自我表述的置信度并不是不确定性的可靠代理。我们的工作可在https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark上公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2604.15244

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

从标记到步骤：面向验证的推测解码以实现高效的多步骤推理

Purohit, Kiran, Narayanam, Ramasuri, Pal, Soumyabrata

Abstract

Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

Chinese Translation

推测解码（Speculative Decoding, SD）通过允许一个轻量级草稿模型提出输出，并由一个更强大的目标模型进行验证，从而加速大型语言模型的推理。然而，其以标记为中心的特性使得错误步骤得以传播。以往的方法通过使用外部奖励模型来缓解这一问题，但会带来额外的延迟和计算开销，并限制了其通用性。我们提出了SpecGuard，一个面向验证的推测解码框架，它仅使用模型内部信号进行步骤级验证。在每一步中，SpecGuard会采样多个草稿候选，并选择最一致的步骤，然后使用两个轻量级模型内部信号的集成进行验证：（i）基于注意力的基础分数，用于衡量对输入和之前接受步骤的归因，以及（ii）基于对数概率的分数，用于捕捉标记级的置信度。这些信号共同决定一个步骤是被接受还是使用目标模型重新计算，从而选择性地分配计算资源。在一系列推理基准测试中的实验表明，SpecGuard提高了3.6%的准确率，同时将延迟减少了约11%，超越了SD和奖励引导的SD。

View on arXiv Download PDF AI Translation

arXiv Papers

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

RoSLAC: Robust Simultaneous Localization and Calibration of Multiple Magnetometers

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

BIEVR-LIO: Robust LiDAR-Inertial Odometry through Bump-Image-Enhanced Voxel Maps

CooperDrive: Enhancing Driving Decisions Through Cooperative Perception

A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning

CT-VIR: Continuous-Time Visual-Inertial-Ranging Fusion for Indoor Localization with Sparse Anchors

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

A multi-platform LiDAR dataset for standardized forest inventory measurement at long term ecological monitoring sites

DigiForest: Digital Analytics and Robotics for Sustainable Forestry

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Differentiable Object Pose Connectivity Metrics for Regrasp Sequence Optimization

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

Switch: Learning Agile Skills Switching for Humanoid Robots

Graph Theoretical Outlier Rejection for 4D Radar Registration in Feature-Poor Environments

4D Radar Gaussian Modeling and Scan Matching with RCS

An Intelligent Robotic and Bio-Digestor Framework for Smart Waste Management

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

POMDP-based Object Search with Growing State Space and Hybrid Action Domain

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

DEX-Mouse: A Low-cost Portable and Universal Interface with Force Feedback for Data Collection of Dexterous Robotic Hands

DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

CAVERS: Multimodal SLAM Data from a Natural Karstic Cave with Ground Truth Motion Capture

Trajectory Planning for a Multi-UAV Rigid-Payload Cascaded Transportation System Based on Enhanced Tube-RRT*

NEAT-NC: NEAT guided Navigation Cells for Robot Path Planning

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

Benchmarking Classical Coverage Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

Abstract Sim2Real through Approximate Information States

QualiaNet: An Experience-Before-Inference Network

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos

SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning

FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

Crowdsourcing of Real-world Image Annotation via Visual Properties

Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking

Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers

WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms

Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

Controllable Video Object Insertion via Multiview Priors

The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview

DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Deepfake Detection Generalization with Diffusion Noise

M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

Towards Design Compositing

Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

Chaotic CNN for Limited Data Image Classification

Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering

Data Synthesis Improves 3D Myotube Instance Segmentation

HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

Find the Differences: Differential Morphing Attack Detection vs Face Recognition

Efficient closed-form approaches for pose estimation using Sylvester forms

ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation

OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments

One-shot Compositional 3D Head Avatars with Deformable Hair