arXiv Daily Digest

668

Papers

Spectral Kernel Dynamics via Maximum Caliber: Fixed Points, Geodesics, and Phase Transitions

通过最大能力的谱核动力学：固定点、测地线与相变

Das, Jnaneshwar

Abstract

We derive a closed-form geometric functional for kernel dynamics on finite graphs by applying the Maximum Caliber (MaxCal) variational principle to the spectral transfer function h(lambda) of the graph Laplacian eigenbasis. The main result is that the MaxCal stationarity condition decouples into N one-dimensional problems with explicit solution: h*(lambda_l) = h_0(lambda_l) exp(-1 - T_l[h*]), yielding self-consistent (fixed-point) kernels via exponential tilting (Corollary 1), log-linear Fisher-Rao geodesics (Corollary 2), a diagonal Hessian stability criterion (Corollary 3), and an l^2_+ isometry for the spectral kernel space (Proposition 3). The spectral entropy H[h_t] provides a computable O(N) early-warning signal for network-structural phase transitions (Remark 7). All claims are numerically verified on the path graph P_8 with a Gaussian mutual-information source, using the open-source kernelcal library. The framework is grounded in a structural analogy with Einstein's field equations, used as a guiding template rather than an established equivalence; explicit limits are stated in Section 6.

Chinese Translation

我们通过将最大能力（Maximum Caliber, MaxCal）变分原理应用于图拉普拉斯特征基的谱传递函数 h(lambda)，推导出有限图上的核动力学的封闭形式几何泛函。主要结果是，MaxCal的平稳性条件解耦为N个一维问题，具有明确的解：h*(lambda_l) = h_0(lambda_l) exp(-1 - T_l[h*])，通过指数倾斜（推论1）获得自洽（固定点）核，获得对数线性Fisher-Rao测地线（推论2）、对角Hessian稳定性标准（推论3），以及谱核空间的l^2_+等距性（命题3）。谱熵H[h_t]提供了一个可计算的O(N)早期预警信号，用于网络结构相变（备注7）。所有主张在具有高斯互信息源的路径图P_8上进行了数值验证，使用开源的kernelcal库。该框架基于与爱因斯坦场方程的结构类比，作为指导模板而非已建立的等价关系；明确的限制在第6节中说明。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.09800

Kinematics of continuum planar grasping

连续体平面抓取的运动学研究

Halder, Udit, Zambrano, Nicolas Echeverria, Li, Xincheng

Abstract

This paper presents an analytical framework to study the geometry arising when a soft continuum arm grasps a planar object. Both the arm centerline and the object boundary are modeled as smooth curves. The grasping problem is formulated as a kinematic boundary following problem, in which the object boundary acts as the arm's 'shadow curve'. This formulation leads to a set of reduced kinematic equations expressed in terms of relative geometric shape variables, with the arm curvature serving as the control input. An optimal control problem is formulated to determine feasible arm shapes that achieve optimal grasping configurations, and its solution is obtained using Pontryagin's Maximum Principle. Based on the resulting optimal grasp kinematics, a class of continuum grasp quality metrics is proposed using the algebraic properties of the associated continuum grasp map. Feedback control aspects in the dynamic setting are also discussed. The proposed methodology is illustrated through systematic numerical simulations.

Chinese Translation

本文提出了一个解析框架，用于研究软连续体机械臂抓取平面物体时产生的几何特性。机械臂的中心线和物体边界均被建模为光滑曲线。抓取问题被表述为一个运动学边界跟踪问题，其中物体边界作为机械臂的“影子曲线”。该表述导出了一组以相对几何形状变量表示的简化运动学方程，机械臂曲率作为控制输入。通过构建最优控制问题来确定实现最优抓取配置的可行机械臂形状，并利用Pontryagin极大值原理求解该问题。基于所得的最优抓取运动学，利用相关连续体抓取映射的代数性质，提出了一类连续体抓取质量度量指标。文中还讨论了动态环境下的反馈控制问题。所提方法通过系统的数值仿真进行了验证。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.09824

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

ProGAL-VLA：通过前瞻性推理实现视觉-语言-动作模型中的基于实体的对齐

Darabi, Nastaran, Trivedi, Amit Ranjan

Abstract

Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding $g_t$, whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.

Chinese Translation

视觉语言动作（VLA）模型使通用机器人代理成为可能，但通常表现出对语言的忽视，依赖视觉捷径且对指令变化不敏感。我们提出了前瞻性基于实体的对齐视觉语言动作模型（ProGAL-VLA），该模型构建了一个以三维实体为中心的图结构（GSM），使用慢速规划器生成符号子目标，并通过基于实体的对齐对比损失（Grounding Alignment Contrastive, GAC）将子目标与实体进行对齐。所有动作均基于经过验证的目标嵌入$g_t$，其注意力熵提供了内在的歧义信号。在LIBERO-Plus数据集上，ProGAL-VLA在机器人扰动下的鲁棒性从30.3%提升至71.5%，语言忽视减少了3至4倍，实体检索的Recall@1从0.41提升至0.71。在自定义歧义基准测试中，其AUROC达到0.81（对比0.52），AUPR为0.79，且在歧义输入上的澄清率从0.09提升至0.81，同时不影响非歧义任务的成功率。验证瓶颈提高了语言与动作的互信息，GAC损失施加了实体级别的InfoNCE界限，注意力熵实现了校准的选择性预测，表明显式的验证基于实体的对齐是实现对指令敏感且具备歧义感知能力代理的有效路径。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.09829

Perception Is All You Need: A Neuroscience Framework for Low Cost Sensorless Gaze in HRI

感知就是一切：低成本无传感器注视在人机交互中的神经科学框架

Kadem, Mason

Abstract

Gaze-following in child-robot interaction improves attention, recall, and learning, but requires expensive platforms (\$30,000+), sensors, algorithms, and raises privacy concerns. We propose a framework that avoids sensors and computation entirely, instead relying on the human visual system's assumption of convexity to produce perceptual gaze-following between a robot and its viewer. Specifically, we motivate sub-dollar cardboard robot design that directly implements the brain's own gaze computation pipeline in reverse, making the viewer's perceptual system the robot's "actuator", with no sensors, no power, and no privacy concerns. We ground this framework in three converging lines of theoretical and empirical neuroscience evidence. Namely, the distributed face processing network that computes gaze direction via the superior temporal sulcus, the high-precision convexity prior that causes the brain to perceive concave faces as convex, and the predictive processing hierarchy in which top-down face knowledge overrides bottom-up depth signals. These mechanisms explain why a concave eye socket with a painted pupil produces the perception of mutual gaze from any viewing angle. We derive design constraints from perceptual science, present a sub-dollar open-template robot with parameterized interchangeable eye inserts, and identify boundary conditions (developmental, clinical, and geometric) that predict where the framework will succeed and where it will fail. If leveraged, two decades of HRI gaze findings become deliverable at population scale.

Chinese Translation

在儿童与机器人互动中，注视跟随能够提高注意力、记忆和学习，但需要昂贵的平台（超过30,000美元）、传感器和算法，并引发隐私问题。我们提出一个完全避免传感器和计算的框架，依赖于人类视觉系统对凸性假设的理解，以实现机器人与其观察者之间的感知注视跟随。具体而言，我们提出一种低于一美元的纸板机器人设计，直接反向实现大脑自身的注视计算流程，使观察者的感知系统成为机器人的“执行器”，无需传感器、无需电源，也无需担忧隐私问题。我们基于三条相互交汇的理论和实证神经科学证据来支撑这一框架。即，通过上颞沟计算注视方向的分布式面部处理网络、高精度的凸性先验使大脑将凹面脸部感知为凸面，以及预测处理层级中自上而下的面部知识覆盖自下而上的深度信号。这些机制解释了为什么一个凹陷的眼窝与涂有瞳孔的眼睛从任何观察角度都能产生相互注视的感知。我们从感知科学中推导出设计约束，展示了一种低于一美元的开放模板机器人，配备可参数化的可互换眼部插入物，并识别出预测框架成功与失败的边界条件（发展性、临床和几何）。如果得以利用，二十年来在人机交互中的注视研究成果将能够在更大范围内实现。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.09860

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

RoboLab：用于任务通用策略分析的高保真仿真基准

Yang, Xuning, Dagli, Rishit, Zook, Alex, Hadfield, Hugo, Goyal, Ankit, Birchfield, Stan, Ramos, Fabio, Tremblay, Jonathan

Abstract

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which external factors most strongly affect that behavior under controlled perturbations. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a physically realistic and photorealistic simulation. With this, we propose the RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational competency, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, indicating that high-fidelity simulation can serve as a proxy for analyzing performance and its dependence on external factors. Evaluation with RoboLab exposes significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies.

Chinese Translation

通用机器人技术的追求催生了令人瞩目的基础模型，然而基于仿真的基准测试仍然是瓶颈，原因在于性能快速饱和且缺乏真正的泛化测试。现有基准通常在训练与评估之间存在显著的领域重叠，导致成功率被弱化且难以洞察鲁棒性。我们提出了RoboLab，一个旨在解决上述挑战的仿真基准框架。具体而言，该框架设计用于回答两个问题：（1）通过分析策略在仿真中的行为，我们在多大程度上能够理解其在现实世界中的表现；（2）在受控扰动下，哪些外部因素对该行为影响最大。首先，RoboLab支持以机器人和策略无关的方式，通过人工编写和大型语言模型（LLM）辅助生成物理真实且光照真实的仿真场景和任务。基于此，我们提出了RoboLab-120基准，包含120个任务，分为视觉能力、程序能力和关系能力三个能力轴，涵盖三个难度等级。其次，我们引入了对现实世界策略的系统性分析，量化其性能及其行为对受控扰动的敏感性，表明高保真仿真可作为分析性能及其对外部因素依赖性的代理。通过RoboLab的评估揭示了当前最先进模型存在显著的性能差距。通过提供细粒度指标和可扩展工具集，RoboLab为评估任务通用机器人策略的真实泛化能力提供了一个可扩展的框架。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.09938

CableTract: A Co-Designed Cable-Driven Field Robot for Low-Compaction, Off-Grid Capable Agriculture

CableTract：一种共同设计的低压实、离网农业驱动的电缆机器人

Yilmaz, Ozgur

Abstract

Conventional field operations spend most of their energy moving the tractor body, not the implement. Yet feasibility studies for novel agricultural vehicles rarely tie mechanics, energy harvest, draft, field geometry, economics, life-cycle CO2, and uncertainty quantification together on a single reproducible code path. This paper builds such a framework and applies it to CableTract, a two-module cable-driven field robot. A stationary Main Unit (winch + motor + battery + harvester module) (MU) and a lighter Anchor module (held by helical screw piles) tension a cable across a strip while a lightweight implement carriage rolls along it. The heavy bodies stay on the headland; only the carriage enters the field. The carriage runs a 10-implement library co-designed for the cable architecture. This co-design is the paper's central analytical lever. The framework is prototype-free. It chains a catenary cable model, a drivetrain efficiency chain, a stochastic draft model fitted to the co-designed library, an hourly solar + wind + battery simulator on six sites, a polygon coverage planner on a 50-field corpus, a contact-pressure compaction model, a discounted cash-flow economics engine with battery replacement and life-cycle CO2, and a global sensitivity analysis on 20 inputs. An operating-envelope sweep and an architectural-variant comparison close the loop. The full implementation is open source. Applied to the codesigned reference, the framework yields energy, compaction advantages and potential off-grid operation.

Chinese Translation

传统的田间作业大部分能量用于移动拖拉机本体，而非工具。然而，针对新型农业车辆的可行性研究很少将机械学、能量收集、牵引力、田间几何形状、经济学、生命周期二氧化碳排放和不确定性量化整合在一个可重复的代码路径上。本文构建了这样一个框架，并将其应用于CableTract，一个由两个模块组成的电缆驱动田间机器人。一个静态的主单元（绞盘 + 电动机 + 电池 + 收割机模块）（MU）和一个较轻的锚模块（通过螺旋桩固定）拉紧一条电缆，轻量级的工具车在其上滚动。重型设备停留在田边，只有工具车进入田间。工具车运行一个为电缆架构共同设计的10种工具库。这个共同设计是本文的核心分析杠杆。该框架不依赖于原型。它串联了一个悬链电缆模型、一个驱动链效率链、一个适配于共同设计库的随机牵引模型、一个在六个地点的每小时太阳能 + 风能 + 电池模拟器、一个在50个田地语料库上的多边形覆盖规划器、一个接触压力压实模型、一个带电池更换和生命周期二氧化碳的折现现金流经济引擎，以及对20个输入的全局敏感性分析。操作范围的扫描和建筑变体的比较闭合了循环。完整的实现是开源的。应用于共同设计的参考，该框架产生了能量和压实优势，并具有潜在的离网操作能力。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.09993

GPU-Accelerated Continuous-Time Successive Convexification for Contact-Implicit Legged Locomotion

基于GPU加速的连续时间连续凸化方法在隐式接触腿式运动中的应用

Buckner, Samuel C., Elango, Purnanand

Abstract

Contact-implicit trajectory optimization (CITO) enables the automatic discovery of contact sequences, but most methods rely on fine time discretization to capture all contact events accurately, which increases problem size and runtime while tying solution quality to grid resolution. We extend the recently proposed sequential convex programming (SCP) approach for trajectory optimization, continuous-time successive convexification (ct-SCvx), to CITO by introducing integral cross-complementarity constraints, which eliminate the risk of missing contact events between discretization nodes while preserving the flexibility of contact mode changes. The resulting framework, contact-implicit successive convexification (ci-SCvx), models full multibody dynamics in maximal coordinates, including stick-slip friction and partially elastic impacts. To handle complementarity constraints, we embed a backtracking homotopy scheme within SCP for reliable convergence. We implement this framework in a stand-alone Python software, leveraging JAX for GPU acceleration and a custom canonical-form parser for the convex subproblems of SCP to avoid the overhead of general-purpose modeling tools such as CVXPY. We demonstrate ci-SCvx on diverse legged-locomotion tasks. In particular, we validate the approach in MuJoCo with the Gymnasium HalfCheetah model against the MuJoCo MPC baseline, showing that a tracking simulation with the optimized torque profiles from ci-SCvx produces physically consistent trajectories with lesser energy consumption. We also show that the resulting software achieves faster solve times than existing state-of-the-art SCP implementations by over an order of magnitude, thereby demonstrating a practically important contribution to scalable real-time trajectory optimization.

Chinese Translation

隐式接触轨迹优化（Contact-Implicit Trajectory Optimization, CITO）能够自动发现接触序列，但大多数方法依赖于细粒度时间离散化以准确捕捉所有接触事件，这不仅增加了问题规模和运行时间，还使得解的质量受限于网格分辨率。我们将最近提出的轨迹优化顺序凸规划（Sequential Convex Programming, SCP）方法——连续时间连续凸化（continuous-time successive convexification, ct-SCvx）扩展到CITO，通过引入积分交叉互补约束，消除了在离散节点间遗漏接触事件的风险，同时保持了接触模式变化的灵活性。由此形成的框架——隐式接触连续凸化（contact-implicit successive convexification, ci-SCvx），在最大坐标系下建模完整多体动力学，包括粘滑摩擦和部分弹性碰撞。为处理互补约束，我们在SCP中嵌入了回溯同伦方案以确保收敛的可靠性。该框架以独立Python软件实现，利用JAX实现GPU加速，并采用定制的SCP凸子问题规范形式解析器，避免了使用CVXPY等通用建模工具带来的开销。我们在多种腿式运动任务中验证了ci-SCvx，特别是在MuJoCo环境下使用Gymnasium HalfCheetah模型与MuJoCo基线MPC进行对比，结果表明，利用ci-SCvx优化的扭矩轨迹进行跟踪仿真能够产生物理一致且能耗更低的轨迹。此外，所开发的软件在求解速度上较现有最先进的SCP实现快一个数量级以上，展示了其在可扩展实时轨迹优化中的实际重要贡献。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.10055

Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation

视觉-语言-动作模型、鲁棒性、多模态学习与机器人操作

Xie, Yuhan, Yan, Yuping, Zhao, Yunqi, Wang, Handing, Jin, Yaochu

Abstract

Despite their strong performance in embodied tasks, recent Vision-Language-Action (VLA) models remain highly fragile under multimodal perturbations, where visual corruption and linguistic noise jointly induce distribution shifts that degrade task-level execution. Existing robustness approaches typically rely on joint training with perturbed data, treating robustness as a static objective, which leads to conflicting optimization between robustness and task fidelity. In this work, we propose STRONG-VLA, a decoupled fine-tuning framework that explicitly separates robustness acquisition from task-aligned refinement. In Stage I, the model is exposed to a curriculum of multimodal perturbations with increasing difficulty, enabling progressive robustness learning under controlled distribution shifts. In Stage II, the model is re-aligned with clean task distributions to recover execution fidelity while preserving robustness. We further establish a comprehensive benchmark with 28 perturbation types spanning both textual and visual modalities, grounded in realistic sources of sensor noise, occlusion, and instruction corruption. Extensive experiments on the LIBERO benchmark show that STRONG-VLA consistently improves task success rates across multiple VLA architectures. On OpenVLA, our method achieves gains of up to 12.60% under seen perturbations and 7.77% under unseen perturbations. Notably, similar or larger improvements are observed on OpenVLA-OFT (+14.48% / +13.81%) and pi0 (+16.49% / +5.58%), demonstrating strong cross-architecture generalization. Real-world experiments on an AIRBOT robotic platform further validate its practical effectiveness. These results highlight the importance of decoupled optimization for multimodal robustness and establish STRONG-VLA as a simple yet principled framework for robust embodied control.

Chinese Translation

尽管在具身任务中表现优异，近期的视觉-语言-动作（Vision-Language-Action, VLA）模型在多模态扰动下依然高度脆弱，视觉损坏与语言噪声共同引发的分布偏移会降低任务执行效果。现有的鲁棒性方法通常依赖于带扰动数据的联合训练，将鲁棒性视为静态目标，导致鲁棒性与任务准确性之间的优化冲突。本文提出了STRONG-VLA，一种解耦的微调框架，明确区分鲁棒性获取与任务对齐的细化过程。第一阶段，模型接受难度逐步增加的多模态扰动课程训练，实现受控分布偏移下的渐进鲁棒性学习。第二阶段，模型重新与干净的任务分布对齐，以恢复执行准确性同时保持鲁棒性。我们进一步构建了涵盖28种扰动类型的综合基准，跨文本与视觉模态，基于真实传感器噪声、遮挡及指令损坏来源。大量在LIBERO基准上的实验表明，STRONG-VLA在多种VLA架构中持续提升任务成功率。在OpenVLA上，方法在已见扰动下提升最高达12.60%，未见扰动下提升7.77%。值得注意的是，在OpenVLA-OFT（+14.48% / +13.81%）和pi0（+16.49% / +5.58%）上也观察到相似甚至更大幅度的提升，展现出强大的跨架构泛化能力。基于AIRBOT机器人平台的真实环境实验进一步验证了其实用性。结果强调了解耦优化在多模态鲁棒性中的重要性，确立了STRONG-VLA作为一种简单而原则性强的鲁棒具身控制框架。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.10057

Natural Gradient Gaussian Approximation Filter on Lie Groups for Robot State Estimation

基于李群的自然梯度高斯近似滤波器用于机器人状态估计

Zhang, Tianyi, Cao, Wenhan, Liu, Chang, Lyu, Yao, Li, Shengbo Eben

Abstract

Accurate state estimation for robotic systems evolving on Lie group manifolds, such as legged robots, is a prerequisite for achieving agile control. However, this task is challenged by nonlinear observation models defined on curved manifolds, where existing filters rely on local linearization in the tangent space to handle such nonlinearity, leading to accumulated estimation errors. To address this limitation, we reformulate manifold filtering as a parameter optimization problem over a Gaussian-distributed increment variable, thereby avoiding linearization. Under this formulation, the increment can be mapped to the Lie group through the exponential operator, where it acts multiplicatively on the prior estimate to yield the posterior state. We further propose a natural gradient optimization scheme for solving this problem, whose iteration process leverages the Fisher information matrix of the increment variable to account for the curvature of the tangent space. This results in an iterative algorithm named the Natural Gradient Gaussian Approximation on Lie Groups (NANO-L) filter. Leveraging the perturbation model in Lie derivative, we prove that for the invariant observation model widely adopted in robotic localization tasks, the covariance update in NANO-L admits an exact closed-form solution, eliminating the need for iterative updates thus improving computational efficiency. Hardware experiments on a Unitree GO2 legged robot operating across different terrains demonstrate that NANO-L achieves approximately 40% lower estimation error than commonly used filters at a comparable computational cost.

Chinese Translation

对于在李群流形上演化的机器人系统（如腿式机器人），准确的状态估计是实现灵活控制的前提。然而，该任务面临定义在曲率流形上的非线性观测模型的挑战，现有滤波器依赖切空间的局部线性化来处理非线性，导致估计误差的累积。为解决此限制，我们将流形滤波重新表述为对高斯分布增量变量的参数优化问题，从而避免线性化。在此框架下，增量通过指数映射映射到李群，在该群上乘法作用于先验估计以得到后验状态。我们进一步提出了一种自然梯度优化方案来求解该问题，其迭代过程利用增量变量的费舍尔信息矩阵以考虑切空间的曲率。由此产生的迭代算法称为李群上的自然梯度高斯近似滤波器（Natural Gradient Gaussian Approximation on Lie Groups，NANO-L）。利用李导数中的扰动模型，我们证明了对于机器人定位任务中广泛采用的不变观测模型，NANO-L中的协方差更新具有精确的闭式解，消除了迭代更新的需求，从而提升了计算效率。在不同地形上运行的Unitree GO2腿式机器人硬件实验表明，NANO-L在相当的计算成本下，实现了约40%的估计误差降低，优于常用滤波器。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.10058

A Ray Intersection Algorithm for Fast Growth Distance Computation Between Convex Sets

一种用于快速计算凸集之间增长距离的射线交点算法

Thirugnanam, Akshay, Sreenath, Koushil

Abstract

In this paper, we discuss an efficient algorithm for computing the growth distance between two compact convex sets with representable support functions. The growth distance between two sets is the minimum scaling factor such that the sets intersect when scaled about some center points. Unlike the minimum distance between sets, the growth distance provides a unified measure for set intersection and separation. We first reduce the growth distance problem to an equivalent ray intersection problem on the Minkowski difference set. Then, we propose an algorithm to solve the ray intersection problem by iteratively constructing inner and outer polyhedral approximations of the Minkowski difference set. We show that our algorithm satisfies several key properties, such as primal and dual feasibility and monotone convergence. We provide extensive benchmark results for our algorithm and show that our open-source implementation achieves state-of-the-art performance across a wide variety of convex sets. Finally, we demonstrate robotics applications of our algorithm in motion planning and rigid-body simulation.

Chinese Translation

本文讨论了一种高效的算法，用于计算两个具有可表示支持函数的紧凑凸集之间的增长距离。两个集合之间的增长距离是指在某些中心点缩放时，使得集合相交的最小缩放因子。与集合之间的最小距离不同，增长距离为集合的交集和分离提供了一种统一的度量。我们首先将增长距离问题简化为在闵可夫斯基差集上的等效射线交点问题。然后，我们提出了一种通过迭代构建闵可夫斯基差集的内外多面体近似来解决射线交点问题的算法。我们证明了我们的算法满足多个关键属性，如原始和对偶可行性以及单调收敛性。我们为我们的算法提供了广泛的基准结果，并显示我们的开源实现能够在各种凸集上实现最先进的性能。最后，我们展示了该算法在运动规划和刚体仿真中的机器人应用。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.10165

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

MoRI：用于长时间操作任务的强化学习与模仿学习专家混合模型

Xu, Yaohang, Ma, Lianjie, Zuo, Gewei, Zhang, Wentao, Ding, Han, Zhu, Lijun

Abstract

Reinforcement Learning (RL) and Imitation Learning (IL) are the standard frameworks for policy acquisition in manipulation. While IL offers efficient policy derivation, it suffers from compounding errors and distribution shift. Conversely, RL facilitates autonomous exploration but is frequently hindered by low sample efficiency and the high cost of trial and error. Since existing hybrid methods often struggle with complex tasks, we introduce Mixture of RL and IL Experts (MoRI). This system dynamically switches between IL and RL experts based on the variance of expert actions to handle coarse movements and fine-grained manipulations. MoRI employs an offline pre-training stage followed by online fine-tuning to accelerate convergence. To maintain exploration safety and minimize human intervention, the system applies IL-based regularization to the RL component. Evaluation across four complex real-world tasks shows that MoRI achieves an average success rate of 97.5% within 2 to 5 hours of fine-tuning. Compared to baseline RL algorithms, MoRI reduces human intervention by 85.8% and shortens convergence time by 21%, demonstrating its capability in robotic manipulation.

Chinese Translation

强化学习（Reinforcement Learning, RL）和模仿学习（Imitation Learning, IL）是操作任务中策略获取的标准框架。虽然模仿学习提供了高效的策略推导，但它容易受到累积误差和分布转移的影响。相反，强化学习促进了自主探索，但常常受到低样本效率和试错成本高的限制。由于现有的混合方法在复杂任务中常常表现不佳，我们提出了强化学习与模仿学习专家混合模型（Mixture of RL and IL Experts, MoRI）。该系统根据专家动作的方差动态切换IL和RL专家，以处理粗略运动和精细操作。MoRI采用离线预训练阶段，随后进行在线微调，以加速收敛。为了保持探索的安全性并最小化人类干预，该系统对RL组件应用基于IL的正则化。在四个复杂的现实任务中的评估显示，MoRI在2到5小时的微调内实现了97.5%的平均成功率。与基线RL算法相比，MoRI将人类干预减少了85.8%，并缩短了收敛时间21%，展示了其在机器人操作中的能力。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.10170

Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation

面向高效机器人操作的设备条件神经架构搜索

Wu, Yiming, Wang, Huan, Chen, Zhenghao, Yuan, Ge, Xu, Dong

Abstract

The growing complexity of visuomotor policies poses significant challenges for deployment with heterogeneous robotic hardware constraints. However, most existing model-efficient approaches for robotic manipulation are device- and model-specific, lack generalizability, and require time-consuming per-device optimization during the adaptation process. In this work, we propose a unified framework named \textbf{D}evice-\textbf{C}onditioned \textbf{Q}uantization-\textbf{F}or-\textbf{A}ll (DC-QFA) which amortizes deployment effort with the device-conditioned quantization-aware training and hardware-constrained architecture search. Specifically, we introduce a single supernet that spans a rich design space over network architectures and mixed-precision bit-widths. It is optimized with latency- and memory-aware regularization, guided by per-device lookup tables. With this supernet, for each target platform, we can perform a once-for-all lightweight search to select an optimal subnet without any per-device re-optimization, which enables more generalizable deployment across heterogeneous hardware, and substantially reduces deployment time. To improve long-horizon stability under low precision, we further introduce multi-step on-policy distillation to mitigate error accumulation during closed-loop execution. Extensive experiments on three representative policy backbones, such as DiffusionPolicy-T, MDT-V, and OpenVLA-OFT, demonstrate that our DC-QFA achieves $2\text{-}3\times$ acceleration on edge devices, consumer-grade GPUs, and cloud platforms, with negligible performance drop in task success. Real-world evaluations on an Inovo robot equipped with a force/torque sensor further validates that our low-bit DC-QFA policies maintain stable, contact-rich manipulation even under severe quantization.

Chinese Translation

视觉运动策略日益复杂，给异构机器人硬件约束下的部署带来了显著挑战。然而，大多数现有的机器人操作模型高效方法均针对特定设备和模型，缺乏泛化能力，并且在适配过程中需要耗时的逐设备优化。本文提出了一种统一框架，称为设备条件量化适用于所有（Device-Conditioned Quantization-For-All，DC-QFA），通过设备条件量化感知训练和硬件约束架构搜索来摊销部署工作。具体而言，我们引入了一个涵盖丰富设计空间的单一超网络，涵盖网络架构和混合精度位宽。该超网络通过延迟和内存感知正则化进行优化，并由逐设备查找表指导。借助该超网络，对于每个目标平台，我们可以执行一次性轻量级搜索以选择最优子网，无需任何逐设备的重新优化，从而实现跨异构硬件的更具泛化性的部署，并大幅缩短部署时间。为提升低精度下的长时稳定性，我们进一步引入多步在线策略蒸馏，以缓解闭环执行中的误差累积。在DiffusionPolicy-T、MDT-V和OpenVLA-OFT等三种代表性策略骨干上的大量实验表明，我们的DC-QFA在边缘设备、消费级GPU和云平台上实现了2至3倍的加速，且任务成功率性能几乎无损。基于配备力/力矩传感器的Inovo机器人进行的真实环境评估进一步验证了我们的低位宽DC-QFA策略即使在严重量化条件下，也能保持稳定且富有接触的操作能力。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.10213

ReaLiTy and LADS: A Unified Framework and Dataset Suite for LiDAR Adaptation Across Sensors and Adverse Weather Conditions

ReaLiTy 与 LADS：面向传感器与恶劣天气条件下 LiDAR 适应的统一框架与数据集套件

Anand, Vivek, Lohani, Bharat, Mishra, Rakesh, Pandey, Gaurav

Abstract

Reliable LiDAR perception requires robustness across sensors, environments, and adverse weather. However, existing datasets rarely provide physically consistent observations of the same scene under varying sensor configurations and weather conditions, limiting systematic analysis of domain shifts. This work presents ReaLiTy, a unified physics-informed framework that transforms LiDAR data to match target sensor specifications and weather conditions. The framework integrates physically grounded cues with a learning-based module to generate realistic intensity patterns, while a physics-based weather model introduces consistent geometric and radiometric degradations. Building on this framework, we introduce the LiDAR Adaptation Dataset Suite (LADS), a collection of physically consistent, transformation-ready point clouds with one-to-one correspondence to original datasets. Experiments demonstrate improved cross-domain consistency and realistic weather effects. ReaLiTy and LADS provide a reproducible foundation for studying LiDAR adaptation and simulation-driven perception in intelligent transportation systems.

Chinese Translation

可靠的 LiDAR 感知需要在不同传感器、环境及恶劣天气条件下具备鲁棒性。然而，现有数据集很少提供同一场景在不同传感器配置和天气条件下的物理一致观测，限制了对域迁移的系统性分析。本文提出 ReaLiTy，一种统一的物理驱动框架，用于将 LiDAR 数据转换以匹配目标传感器规格和天气条件。该框架结合了基于物理的线索与学习模块，生成逼真的强度模式，同时通过基于物理的天气模型引入一致的几何和辐射退化。基于此框架，我们引入了 LiDAR 适应数据集套件（LADS），该套件包含物理一致且可转换的点云数据，与原始数据集一一对应。实验表明该方法提升了跨域一致性及天气效果的真实性。ReaLiTy 与 LADS 为智能交通系统中 LiDAR 适应及基于仿真的感知研究提供了可复现的基础。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.10241

A Coordinate-Invariant Local Representation of Motion and Force Trajectories for Identification and Generalization Across Coordinate Systems

一种用于跨坐标系识别与泛化的运动与力轨迹坐标不变局部表示

Verduyn, Arno, Aertbeliën, Erwin, Vochten, Maxim, De Schutter, Joris

Abstract

Identifying the trajectories of rigid bodies and of interaction forces is essential for a wide range of tasks in robotics, biomechanics, and related domains. These tasks include trajectory segmentation, recognition, and prediction. For these tasks, a key challenge lies in achieving consistent results when the trajectory is expressed in different coordinate systems. A way to address this challenge is to utilize trajectory models that can generalize across coordinate systems. The focus of this paper is on such trajectory models obtained by transforming the trajectory into a coordinate-invariant representation. However, coordinate-invariant representations often suffer from sensitivity to measurement noise and the manifestation of singularities in the representation, where the representation is not uniquely defined. This paper aims to address this limitation by introducing the novel Dual-Upper-Triangular Invariant Representation (DUTIR), with improved robustness to singularities, along with its computational algorithm. The proposed representation is formulated at a level of abstraction that makes it applicable to both rigid-body trajectories and interaction-force trajectories, hence making it a versatile tool for robotics, biomechanics, and related domains.

Chinese Translation

识别刚体轨迹及交互力轨迹对于机器人学、生物力学及相关领域的众多任务至关重要，这些任务包括轨迹分割、识别与预测。实现这些任务的关键挑战在于当轨迹以不同坐标系表达时，如何获得一致的结果。解决该挑战的一种方法是采用能够跨坐标系泛化的轨迹模型。本文关注通过将轨迹转换为坐标不变表示而获得的此类轨迹模型。然而，坐标不变表示通常对测量噪声敏感，且存在奇异点问题，即表示不唯一。本文旨在通过引入新颖的双上三角不变表示（Dual-Upper-Triangular Invariant Representation，DUTIR）及其计算算法，提升对奇异性的鲁棒性。所提表示在抽象层面上的构建使其既适用于刚体轨迹，也适用于交互力轨迹，从而成为机器人学、生物力学及相关领域的多功能工具。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.10351

Trajectory-based actuator identification via differentiable simulation

基于轨迹的执行器识别方法：通过可微分仿真实现

Kovalev, Vyacheslav, Chaikovskaia, Ekaterina, Davydenko, Egor, Gorbachev, Roman

Abstract

Accurate actuation models are critical for bridging the gap between simulation and real robot behavior, yet obtaining high-fidelity actuator dynamics typically requires dedicated test stands and torque sensing. We present a trajectory-based actuator identification method that uses differentiable simulation to fit system-level actuator models from encoder motion alone. Identification is posed as a trajectory-matching problem: given commanded joint positions and measured joint angles and velocities, we optimize actuator and simulator parameters by backpropagating through the simulator, without torque sensors, current/voltage measurements, or access to embedded motor-control internals. The framework supports multiple model classes, ranging from compact structured parameterizations to neural actuator mappings, within a unified optimization pipeline. On held-out real-robot trajectories under identical commands, the proposed torque-sensor-free identification achieves much tighter trajectory alignment than a supervised stand-trained baseline dominated by steady-state data, reducing mean absolute position error from 14.20 mrad to as low as 7.54 mrad (1.88 times). Finally, we demonstrate downstream impact in a real-robot locomotion study: training policies with the refined actuator model increases travel distance by 46% and reduces rotational deviation by 75% relative to the baseline.

Chinese Translation

准确的执行器模型对于缩小仿真与真实机器人行为之间的差距至关重要，然而获得高保真度的执行器动力学通常需要专用的测试台和扭矩传感器。我们提出了一种基于轨迹的执行器识别方法，利用可微分仿真仅通过编码器运动数据拟合系统级执行器模型。识别过程被构建为轨迹匹配问题：在给定关节指令位置及测量的关节角度和速度的条件下，我们通过在仿真器中反向传播优化执行器及仿真参数，无需扭矩传感器、电流/电压测量或访问嵌入式电机控制内部信息。该框架支持多种模型类别，从紧凑的结构化参数化到神经执行器映射，均可在统一的优化流程中实现。在相同指令下的真实机器人保留轨迹测试中，所提无扭矩传感器识别方法相比以稳态数据为主的监督训练基线，实现了更紧密的轨迹对齐，将平均绝对位置误差从14.20毫弧度降低至最低7.54毫弧度（提升1.88倍）。最后，我们在真实机器人运动研究中展示了该方法的下游应用效果：利用精炼后的执行器模型训练的策略相比基线，行进距离提升了46%，旋转偏差减少了75%。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.10358

COSMIK-MPPI: Scaling Constrained Model Predictive Control to Collision Avoidance in Close-Proximity Dynamic Human Environments

COSMIK-MPPI：将受限模型预测控制扩展至近距离动态人类环境中的碰撞避免

Gursoy, Ege, Sabbah, Maxime, Haffemayer, Arthur, Santos, Joao Cavalcanti, Crestaz, Pietro Noah, Petrik, Vladimir, Mansard, Nicolas, Bonnet, Vincent

Abstract

Ensuring safe physical interaction between torque-controlled manipulators and humans is essential for deploying robots in everyday environments. Model Predictive Control (MPC) has emerged as a suitable framework thanks to its capacity to handle hard constraints, provide strong guarantees and zero-shot adaptability through predictive reasoning. However, Gradient-Based MPC (GB-MPC) solvers have demonstrated limited performance for collision avoidance in complex environments. Sampling-based approaches such as Model Predictive Path Integral (MPPI) control offer an alternative via stochastic rollouts, but enforcing safety via additive penalties is inherently fragile, as it provides no formal constraint satisfaction guarantees. We propose a collision avoidance framework called COSMIK-MPPI combining MPPI with the toolbox for human motion estimation RT-COSMIK and the Constraints-as-Terminations transcription, which enforces safety by treating constraint violations as terminal events, without relying on large penalty terms or explicit human motion prediction. The proposed approach is evaluated against state-of-the-art GB-MPC and vanilla MPPI in simulation and on a real manipulator arm. Results show that COSMIK-MPPI achieves a 100% task success rate with a constant computation time (22 ms), largely outperforming GB-MPC. In simulated infeasible scenarios, COSMIK-MPPI consistently generates collision-free trajectories, contrary to vanilla MPPI. These properties enabled safe execution of complex real-world human-robot interaction tasks in shared workspaces using an affordable markerless human motion estimator, demonstrating a robust, compliant, and practical solution for predictive collision avoidance (cf. results showcased at https://exquisite-parfait-ffa925.netlify.app)

Chinese Translation

确保扭矩控制的操作臂与人类之间的安全物理交互对于在日常环境中部署机器人至关重要。模型预测控制（Model Predictive Control, MPC）因其处理硬约束的能力、提供强有力的保证以及通过预测推理实现零样本适应性而成为一种合适的框架。然而，基于梯度的MPC（Gradient-Based MPC, GB-MPC）求解器在复杂环境中的碰撞避免表现有限。基于采样的方法，如模型预测路径积分（Model Predictive Path Integral, MPPI）控制，通过随机展开提供了另一种选择，但通过附加惩罚来强制执行安全性本质上是脆弱的，因为它没有提供正式的约束满足保证。我们提出了一种称为COSMIK-MPPI的碰撞避免框架，将MPPI与人类运动估计工具箱RT-COSMIK和约束作为终止转录结合起来，通过将约束违反视为终端事件来强制执行安全性，而不依赖于大规模惩罚项或明确的人类运动预测。所提方法在仿真和真实操作臂上与最先进的GB-MPC和普通MPPI进行了评估。结果表明，COSMIK-MPPI以恒定的计算时间（22毫秒）实现了100%的任务成功率，远远优于GB-MPC。在模拟的不可行场景中，COSMIK-MPPI始终生成无碰撞轨迹，而普通MPPI则不然。这些特性使得在共享工作空间中使用经济实惠的无标记人类运动估计器安全执行复杂的现实世界人机交互任务，展示了一种稳健、合规且实用的预测碰撞避免解决方案（参见展示结果：https://exquisite-parfait-ffa925.netlify.app）

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.10432

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

AnySlot：用于零-shot槽级放置的目标条件视觉-语言-动作策略

Hu, Zhaofeng, Zhou, Sifan, Zhang, Qinbo, Xu, Rongtao, Su, Qi, Liang, Ci-Jyun

Abstract

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language instructions remains a major challenge for modern monolithic VLA policies. Slot-level tasks require both reliable slot grounding and sub-centimeter execution accuracy. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal as an intermediate representation between language grounding and control. AnySlot turns language into an explicit visual goal by generating a scene marker, then executes this goal with a goal-conditioned VLA policy. This hierarchical design effectively decouples high-level slot selection from low-level execution, ensuring both semantic accuracy and spatial robustness. Furthermore, recognizing the lack of existing benchmarks for such precision-demanding tasks, we introduce SlotBench, a comprehensive simulation benchmark featuring nine task categories tailored to evaluate structured spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and previous modular grounding methods in zero-shot slot-level placement.

Chinese Translation

视觉-语言-动作（VLA）策略已成为通用机器人操作的多功能范式。然而，在组合语言指令下实现精确的物体放置仍然是现代单一VLA策略面临的主要挑战。槽级任务需要可靠的槽定位和亚厘米级的执行精度。为此，我们提出了AnySlot，一个通过引入明确的空间视觉目标作为语言定位与控制之间的中间表示来降低组合复杂性的框架。AnySlot通过生成场景标记将语言转化为明确的视觉目标，然后使用目标条件的VLA策略执行该目标。这种分层设计有效地将高层次的槽选择与低层次的执行解耦，确保语义准确性和空间鲁棒性。此外，鉴于现有的针对这种高精度任务的基准测试的缺乏，我们引入了SlotBench，一个全面的仿真基准，涵盖九个任务类别，旨在评估槽级放置中的结构化空间推理。大量实验表明，AnySlot在零-shot槽级放置中显著优于平面VLA基线和之前的模块化定位方法。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.10433

PRoID: Predicted Rate of Information Delivery in Multi-Robot Exploration and Relaying

PRoID：多机器人探索与中继中的信息传递预测速率

Kim, Seungchan, Baek, Seungjae, Corah, Micah, Best, Graeme, Moon, Brady, Scherer, Sebastian

Abstract

We address Multi-Robot Exploration and Relaying (MRER): a team of robots must explore an unknown environment and deliver acquired information to a fixed base station within a mission time limit. The central challenge is deciding when each robot should stop exploring and relay: this depends on what the robot is likely to find ahead, what information it uniquely holds, and whether immediate or future delivery is more valuable. Prior approaches either ignore the reporting requirement entirely or rely on fixed-schedule relay strategies that cannot adapt to environment structure, team composition, or mission progress. We introduce PRoID (Predicted Rate of Information Delivery), a relay criterion that uses learned map prediction to estimate each robot's future information gain along its planned path, accounting for what teammates are already relaying. PRoID triggers relay when immediate return yields higher information delivery per unit time. We further propose PRoID-Safe, a failure-aware extension that incorporates robot survival probability into the relay criterion, naturally biasing decisions toward earlier relay as failure risk grows. We evaluate on real-world indoor floor plan datasets and show that PRoID and PRoID-Safe outperform fixed-schedule baselines, with stronger relative gains in failure scenarios.

Chinese Translation

本文研究多机器人探索与中继（Multi-Robot Exploration and Relaying, MRER）问题：一组机器人必须在任务时间限制内探索未知环境并将获取的信息传递至固定基站。核心挑战在于决定每个机器人何时停止探索并进行中继，这取决于机器人前方可能发现的内容、其独有的信息以及即时传递与未来传递的价值权衡。以往方法要么完全忽略报告需求，要么依赖固定时间表的中继策略，无法适应环境结构、团队组成或任务进展。我们提出PRoID（Predicted Rate of Information Delivery），一种利用学习到的地图预测来估计机器人沿规划路径未来信息增益的中继准则，同时考虑队友已中继的信息。PRoID在即时返回能带来更高单位时间信息传递时触发中继。进一步地，我们提出PRoID-Safe，一种考虑机器人存活概率的失败感知扩展，将失败风险增加时的决策自然偏向更早中继。我们在真实室内平面图数据集上进行了评估，结果表明PRoID和PRoID-Safe均优于固定时间表基线方法，在失败场景中表现出更显著的相对提升。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.10533

VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions

VLN-NF：具备可行性意识的视觉与语言导航与虚假前提指令

Su, Hung-Ting, Wang, Ting-Jun, Yeh, Jia-Fong, Sun, Min, Hsu, Winston H.

Abstract

Conventional Vision-and-Language Navigation (VLN) benchmarks assume instructions are feasible and the referenced target exists, leaving agents ill-equipped to handle false-premise goals. We introduce VLN-NF, a benchmark with false-premise instructions where the target is absent from the specified room and agents must navigate, gather evidence through in-room exploration, and explicitly output NOT-FOUND. VLN-NF is constructed via a scalable pipeline that rewrites VLN instructions using an LLM and verifies target absence with a VLM, producing plausible yet factually incorrect goals. We further propose REV-SPL to jointly evaluate room reaching, exploration coverage, and decision correctness. To address this challenge, we present ROAM, a two-stage hybrid that combines supervised room-level navigation with LLM/VLM-driven in-room exploration guided by a free-space clearance prior. ROAM achieves the best REV-SPL among compared methods, while baselines often under-explore and terminate prematurely under unreliable instructions. VLN-NF project page can be found at https://vln-nf.github.io/.

Chinese Translation

传统的视觉与语言导航（VLN）基准假设指令是可行的，并且所引用的目标存在，这使得智能体在处理虚假前提目标时显得无能为力。我们引入了VLN-NF，这是一个具有虚假前提指令的基准，其中目标在指定房间中缺失，智能体必须通过房间内探索导航、收集证据，并明确输出NOT-FOUND。VLN-NF是通过一个可扩展的流程构建的，该流程使用大型语言模型（LLM）重写VLN指令，并通过视觉语言模型（VLM）验证目标的缺失，生成合理但事实不正确的目标。我们进一步提出了REV-SPL，以共同评估房间到达、探索覆盖率和决策正确性。为了解决这一挑战，我们提出了ROAM，这是一种两阶段混合方法，结合了监督的房间级导航与基于LLM/VLM驱动的房间内探索，后者由自由空间清理先验引导。ROAM在比较方法中实现了最佳的REV-SPL，而基线方法在不可靠的指令下往往探索不足并提前终止。VLN-NF项目页面可以在https://vln-nf.github.io/找到。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.10548

Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

简单而稳定、快速且安全：通过高保真可微分仿真实现端到端控制

Li, Fanxing, Wang, Shengyang, Huang, Yuxiang, Sun, Fangyu, Yan, Yufei, Zou, Danping, Yu, Wenxian

Abstract

Obstacle avoidance is a fundamental vision-based task essential for enabling quadrotors to perform advanced applications. When planning the trajectory, existing approaches both on optimization and learning typically regard quadrotor as a point-mass model, giving path or velocity commands then tracking the commands by outer-loop controller. However, at high speeds, planned trajectories sometimes become dynamically infeasible in actual flight, which beyond the capacity of controller. In this paper, we propose a novel end-to-end policy that directly maps depth images to low-level bodyrate commands by reinforcement learning via differentiable simulation. The high-fidelity simulation in training after parameter identification significantly reduces all the gaps between training, simulation and real world. Analytical process by differentiable simulation provides accurate gradient to ensure efficiently training the low-level policy without expert guidance. The policy employs a lightweight and the most simple inference pipeline that runs without explicit mapping, backbone networks, primitives, recurrent structures, or backend controllers, nor curriculum or privileged guidance. By inferring low-level command directly to the hardware controller, the method enables full flight envelope control and avoids the dynamic-infeasible issue.Experimental results demonstrate that the proposed approach achieves the highest success rate and the lowest jerk among state-of-the-art baselines across multiple benchmarks. The policy also exhibits strong generalization, successfully deploying zero-shot in unseen, outdoor environments while reaching speeds of up to 7.5m/s as well as stably flying in the super-dense forest.

Chinese Translation

避障是基于视觉的基本任务，对于使四旋翼无人机执行高级应用至关重要。在轨迹规划时，现有的优化和学习方法通常将四旋翼视为点质量模型，先给出路径或速度指令，再通过外环控制器跟踪这些指令。然而，在高速飞行时，规划的轨迹有时在实际飞行中变得动力学上不可行，超出了控制器的能力范围。本文提出了一种新颖的端到端策略，通过可微分仿真中的强化学习，直接将深度图像映射到低级机体角速度（bodyrate）指令。经过参数识别后的高保真仿真显著缩小了训练、仿真与现实世界之间的所有差距。可微分仿真提供的解析过程能够准确计算梯度，确保在无专家指导下高效训练低级策略。该策略采用轻量且极简的推理流程，无需显式地图、骨干网络、基本动作、循环结构或后端控制器，也不依赖课程学习或特权指导。通过直接推断低级指令给硬件控制器，该方法实现了全飞行包络控制，避免了动力学不可行问题。实验结果表明，所提方法在多个基准测试中实现了最高的成功率和最低的加加速度（jerk），并表现出强大的泛化能力，能够零次迁移部署于未见过的户外环境，最高速度达到7.5米/秒，同时在超密集森林中稳定飞行。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.10579

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen：基于可供性对应关系生成多样化示范以实现可泛化的物体操作

Zhang, Jiawei, Hu, Kaizhe, Huang, Yingqian, Ju, Yuanchen, Xue, Zhengrong, Xu, Huazhe

Abstract

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning.

Chinese Translation

尽管现代模仿学习方法在机器人操作领域取得了显著进展，但其性能常因数据多样性受限导致的几何变化而受到限制。本文提出的AffordGen框架利用强大的三维生成模型和视觉基础模型（VFMs），通过跨大规模三维网格的语义关键点对应关系生成新的机器人操作轨迹，从而克服了这一限制。该大规模、具备可供性意识的数据集随后用于训练鲁棒的闭环视觉运动策略，将可供性的语义泛化能力与端到端学习的反应鲁棒性相结合。仿真和真实环境中的实验表明，使用AffordGen训练的策略不仅取得了较高的成功率，还实现了对全新未见物体的零样本泛化，显著提升了机器人学习的数据效率。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.10593

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

MonoEM-GS：单目期望最大化高斯散点SLAM

Kruzhkov, Evgenii, Behnke, Sven

Abstract

Feed-forward geometric foundation models can infer dense point clouds and camera motion directly from RGB streams, providing priors for monocular SLAM. However, their predictions are often view-dependent and noisy: geometry can vary across viewpoints and under image transformations, and local metric properties may drift between frames. We present MonoEM-GS, a monocular mapping pipeline that integrates such geometric predictions into a global Gaussian Splatting representation while explicitly addressing these inconsistencies. MonoEM-GS couples Gaussian Splatting with an Expectation--Maximization formulation to stabilize geometry, and employs ICP-based alignment for monocular pose estimation. Beyond geometry, MonoEM-GS parameterizes Gaussians with multi-modal features, enabling in-place open-set segmentation and other downstream queries directly on the reconstructed map. We evaluate MonoEM-GS on 7-Scenes, TUM RGB-D and Replica, and compare against recent baselines.

Chinese Translation

前馈几何基础模型能够直接从RGB流中推断密集点云和相机运动，为单目SLAM提供先验。然而，它们的预测往往依赖于视角且噪声较大：几何形状在不同视角和图像变换下可能会有所变化，局部度量属性在帧之间可能会漂移。我们提出了MonoEM-GS，一种单目映射管道，将这些几何预测整合到全局高斯散点表示中，同时明确解决这些不一致性。MonoEM-GS将高斯散点与期望最大化（Expectation-Maximization）公式结合，以稳定几何形状，并采用基于ICP的对齐方法进行单目姿态估计。除了几何形状，MonoEM-GS还使用多模态特征对高斯进行参数化，从而能够在重建的地图上直接进行开放集分割和其他下游查询。我们在7-Scenes、TUM RGB-D和Replica数据集上评估了MonoEM-GS，并与近期的基线进行了比较。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.10598

AWARE: Adaptive Whole-body Active Rotating Control for Enhanced LiDAR-Inertial Odometry under Human-in-the-Loop Interaction

AWARE：增强人机交互下的自适应全身主动旋转控制用于激光雷达惯性测程

Zhang, Yizhe, Li, Jianping, Yin, Liangliang, Dong, Zhen, Yang, Bisheng

Abstract

Human-in-the-loop (HITL) UAV operation is essential in complex and safety-critical aerial surveying environments, where human operators provide navigation intent while onboard autonomy must maintain accurate and robust state estimation. A key challenge in this setting is that resource-constrained UAV platforms are often limited to narrow-field-of-view LiDAR sensors. In geometrically degenerate or feature-sparse scenes, limited sensing coverage often weakens LiDAR Inertial Odometry (LIO)'s observability, causing drift accumulation, degraded geometric accuracy, and unstable state estimation, which directly compromise safe and effective HITL operation and the reliability of downstream surveying products. To overcome this limitation, we present AWARE, a bio-inspired whole-body active yawing framework that exploits the UAV's own rotational agility to extend the effective sensor horizon and improve LIO's observability without additional mechanical actuation. The core of AWARE is a differentiable Model Predictive Control (MPC) framework embedded in a Reinforcement Learning (RL) loop. It first identifies the viewing direction that maximizes information gain across the full yaw space, and a lightweight RL agent then adjusts the MPC cost weights online according to the current environmental context, enabling an adaptive balance between estimation accuracy and flight stability. A Safe Flight Corridor mechanism further ensures operational safety within this HITL paradigm by decoupling the operator's navigational intent from autonomous yaw optimization to enable safe and efficient cooperative control. We validate AWARE through extensive experiments in diverse simulated and real-world environments.

Chinese Translation

人机交互（HITL）无人机操作在复杂且安全关键的空中测绘环境中至关重要，在这些环境中，人类操作员提供导航意图，而机载自主系统必须保持准确且稳健的状态估计。在这种情况下，一个主要挑战是资源受限的无人机平台通常仅限于狭视场激光雷达传感器。在几何退化或特征稀疏的场景中，有限的感知覆盖往往削弱激光雷达惯性测程（LIO）的可观测性，导致漂移积累、几何精度下降和状态估计不稳定，这直接影响安全有效的HITL操作和下游测绘产品的可靠性。为克服这一限制，我们提出了AWARE，一种生物启发的全身主动偏航框架，利用无人机自身的旋转灵活性来扩展有效传感器视野并提高LIO的可观测性，而无需额外的机械驱动。AWARE的核心是一个可微分的模型预测控制（MPC）框架，嵌入在强化学习（RL）循环中。它首先识别出在整个偏航空间中最大化信息增益的视角方向，然后一个轻量级的RL代理根据当前环境上下文在线调整MPC成本权重，从而实现估计精度与飞行稳定性之间的自适应平衡。安全飞行走廊机制进一步确保了在这一HITL范式下的操作安全，通过将操作员的导航意图与自主偏航优化解耦，从而实现安全高效的协同控制。我们通过在多样化的模拟和现实环境中进行广泛实验来验证AWARE。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.10647

OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

OmniUMI：通过人类对齐的多模态交互实现物理基础的机器人学习

Luo, Shaqi, Li, Yuanyuan, Hu, Youhao, Yu, Chenhao, Xu, Chaoran, Zhang, Jiachen, Yao, Guocai, Huang, Tiejun, He, Ran, Wang, Zhongyuan

Abstract

UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.

Chinese Translation

UMI风格的接口使得机器人学习具备可扩展性，但现有系统仍主要依赖于视觉运动，主要基于RGB观察和轨迹，同时仅提供有限的物理交互信号访问。这在接触丰富的操作中成为一个基本限制，因为成功依赖于接触动态，如触觉交互、内部抓取力和外部交互扭矩，这些仅通过视觉难以推断。我们提出了OmniUMI，一个通过人类对齐的多模态交互实现物理基础的机器人学习的统一框架。OmniUMI在一个紧凑的手持系统中同步捕捉RGB、深度、轨迹、触觉感知、内部抓取力和外部交互扭矩，同时通过共享的具身设计保持收集与部署的一致性。为了支持人类对齐的演示，OmniUMI通过双向夹持器反馈和手持具身中的外部交互扭矩的自然感知提供双重力反馈。在此接口的基础上，我们扩展了扩散策略，结合视觉、触觉和与力相关的观察，并通过基于阻抗的执行部署学习到的策略，以统一调节运动和接触行为。实验表明，在对力敏感的拾取与放置、交互表面擦除和触觉信息驱动的选择性释放等任务上，OmniUMI展现了可靠的感知能力和强大的下游性能。总体而言，OmniUMI结合了物理基础的多模态数据采集与人类对齐的交互，为学习接触丰富的操作提供了可扩展的基础。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.10677

LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

LIDEA：通过隐式特征蒸馏和显式几何对齐实现人机模仿学习

Xu, Yifu, Lin, Bokai, Zhan, Xinyu, Fang, Hongjie, Li, Yong-Lu, Lu, Cewu, Yang, Lixin

Abstract

Scaling up robot learning is hindered by the scarcity of robotic demonstrations, whereas human videos offer a vast, untapped source of interaction data. However, bridging the embodiment gap between human hands and robot arms remains a critical challenge. Existing cross-embodiment transfer strategies typically rely on visual editing, but they often introduce visual artifacts due to intrinsic discrepancies in visual appearance and 3D geometry. To address these limitations, we introduce LIDEA (Implicit Feature Distillation and Explicit Geometric Alignment), an imitation learning framework in which policy learning benefits from human demonstrations. In the 2D visual domain, LIDEA employs a dual-stage transitive distillation pipeline that aligns human and robot representations in a shared latent space. In the 3D geometric domain, we propose an embodiment-agnostic alignment strategy that explicitly decouples embodiment from interaction geometry, ensuring consistent 3D-aware perception. Extensive experiments empirically validate LIDEA from two perspectives: data efficiency and OOD robustness. Results show that human data substitutes up to 80% of costly robot demonstrations, and the framework successfully transfers unseen patterns from human videos for out-of-distribution generalization.

Chinese Translation

机器人学习的规模扩大受到机器人演示稀缺的限制，而人类视频则提供了一个庞大且未被充分利用的交互数据源。然而，弥合人类手与机器人手臂之间的体现差距仍然是一个关键挑战。现有的跨体现转移策略通常依赖于视觉编辑，但由于视觉外观和三维几何的内在差异，它们往往会引入视觉伪影。为了解决这些局限性，我们提出了LIDEA（隐式特征蒸馏和显式几何对齐），这是一个模仿学习框架，其中策略学习受益于人类演示。在二维视觉领域，LIDEA采用双阶段的过渡蒸馏管道，将人类和机器人表示对齐到共享的潜在空间。在三维几何领域，我们提出了一种与体现无关的对齐策略，明确将体现与交互几何解耦，确保一致的三维感知。大量实验从数据效率和OOD（分布外）鲁棒性两个角度实证验证了LIDEA。结果表明，人类数据可以替代高达80%的昂贵机器人演示，并且该框架成功地将人类视频中的未见模式转移用于分布外泛化。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.10809

WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

WARPED：基于手腕视角渲染的机器人策略学习方法，源自第一人称人类示范

Freeman, Harry, Kim, Chung Hee, Kantor, George

Abstract

Recent advancements in learning from human demonstration have shown promising results in addressing the scalability and high cost of data collection required to train robust visuomotor policies. However, existing approaches are often constrained by a reliance on multiview camera setups, depth sensors, or custom hardware and are typically limited to policy execution from third-person or egocentric cameras. In this paper, we present WARPED, a framework designed to synthesize realistic wrist-view observations from human demonstration videos to facilitate the training of visuomotor policies using only monocular RGB data. With data collected from an egocentric RGB camera, our system leverages vision foundation models to initialize the interactive scene. A hand-object interaction pipeline is then employed to track the hand and manipulated object and retarget the trajectories to a robotic end-effector. Lastly, photo-realistic wrist-view observations are synthesized via Gaussian Splatting to directly train a robotic policy. We demonstrate that WARPED achieves success rates comparable to policies trained on teleoperated demonstration data for five tabletop manipulation tasks, while requiring 5-8x less data collection time.

Chinese Translation

近年来，从人类示范中学习的进展在解决训练鲁棒视觉运动策略所需的数据收集规模和高成本问题上展现出良好前景。然而，现有方法通常依赖多视角摄像头设置、深度传感器或定制硬件，且通常仅限于从第三人称或第一人称摄像头执行策略。本文提出了WARPED框架，旨在从人类示范视频中合成逼真的手腕视角观察，以便仅使用单目RGB数据训练视觉运动策略。利用从第一人称RGB摄像头采集的数据，系统借助视觉基础模型初始化交互场景，随后通过手-物体交互管线跟踪手部及操作物体，并将轨迹重定向至机器人末端执行器。最后，通过高斯散点渲染（Gaussian Splatting）合成逼真的手腕视角观察，直接用于机器人策略训练。实验证明，WARPED在五个桌面操作任务中取得的成功率与基于远程操作示范数据训练的策略相当，同时数据收集时间减少了5至8倍。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.10856

BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

BridgeSim：揭示端到端自动驾驶中的开环-闭环差距（OL-CL Gap）

Zhao, Seth Z., Wang, Luobin, Ruan, Hongwei, Bao, Yuxin, Chen, Yilan, Leng, Ziyang, Ravichandran, Abhijit, He, Honglin, Zhou, Zewei, Han, Xu, Peri, Abhishek, Huang, Zhiyu, Desai, Pranav, Christensen, Henrik, Ma, Jiaqi, Zhou, Bolei

Abstract

Open-loop (OL) to closed-loop (CL) gap (OL-CL gap) exists when OL-pretrained policies scoring high in OL evaluations fail to transfer effectively in closed-loop (CL) deployment. In this paper, we unveil the root causes of this systemic failure and propose a practical remedy. Specifically, we demonstrate that OL policies suffer from Observational Domain Shift and Objective Mismatch. We show that while the former is largely recoverable with adaptation techniques, the latter creates a structural inability to model complex reactive behaviors, which forms the primary OL-CL gap. We find that a wide range of OL policies learn a biased Q-value estimator that neglects both the reactive nature of CL simulations and the temporal awareness needed to reduce compounding errors. To this end, we propose a Test-Time Adaptation (TTA) framework that calibrates observational shift, reduces state-action biases, and enforces temporal consistency. Extensive experiments show that TTA effectively mitigates planning biases and yields superior scaling dynamics than its baseline counterparts. Furthermore, our analysis highlights the existence of blind spots in standard OL evaluation protocols that fail to capture the realities of closed-loop deployment.

Chinese Translation

开环（OL）到闭环（CL）差距（OL-CL差距）指的是在开环评估中表现优异的OL预训练策略在闭环部署中无法有效迁移的现象。本文揭示了这一系统性失败的根本原因，并提出了一种实用的解决方案。具体而言，我们证明了OL策略存在观测域偏移（Observational Domain Shift）和目标不匹配（Objective Mismatch）问题。我们展示了前者通过适应技术在很大程度上是可恢复的，而后者则导致了结构性地无法建模复杂反应行为，这构成了主要的OL-CL差距。我们发现，大多数OL策略学习了一个有偏的Q值估计器，该估计器忽视了CL仿真的反应性特征以及减少累积误差所需的时间感知。为此，我们提出了一种测试时适应（Test-Time Adaptation，TTA）框架，用以校正观测偏移、减少状态-动作偏差并强化时间一致性。大量实验表明，TTA有效缓解了规划偏差，并展现出优于基线方法的扩展动态性能。此外，我们的分析还强调了标准OL评估协议中存在的盲点，这些盲点未能反映闭环部署的实际情况。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.10892

HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

HECTOR：面向人类的机器人群体分层协调与监督框架，应用于持续性时序任务

Wang, Shen, Luo, Yinhang, Li, Jie, Guo, Meng

Abstract

Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The operator might need to add new tasks, cancel some tasks, change priorities and modify planning results. How to design the procedure for these interactions and efficient algorithms to fulfill these needs have been mostly neglected in the related literature. Thus, this work proposes a human-centric coordination and supervision scheme (HECTOR) for large-scale robotic fleets under continual and uncertain temporal tasks. It consists of three hierarchical layers: (I) the bidirectional and multimodal protocol of online human-fleet interaction, where the operator interacts with and supervises the whole fleet; (II) the rolling assignment of currently-known tasks to teams within a certain horizon, and (III) the dynamic coordination within a team given the detected subtasks during online execution. The overall mission can be as general as temporal logic formulas over collaborative actions. Such hierarchical structure allows human interaction and supervision at different granularities and triggering conditions, to both improve computational efficiency and reduce human effort. Extensive human-in-the-loop simulations are performed over heterogeneous fleets under various temporal tasks and environmental uncertainties.

Chinese Translation

机器人群体在并行协作时能够极大提升效率，例如在配送、监控、搜救等场景中。然而，操作员直接控制每一台机器人既繁琐又不切实际。因此，机器人群体的自主性及其与操作员的在线交互尤为关键，尤其是在动态且部分未知的环境中。操作员可能需要添加新任务、取消任务、调整优先级以及修改规划结果。如何设计满足这些交互需求的流程及高效算法，在相关文献中尚未得到充分关注。为此，本文提出了一种面向人类的分层协调与监督方案（HECTOR），用于大规模机器人群体执行持续且不确定的时序任务。该方案包含三层结构：(I) 双向多模态的在线人机交互协议，支持操作员对整个机器人群体的交互与监督；(II) 在一定时间视野内对当前已知任务进行滚动分配给各团队；(III) 在线执行过程中，基于检测到的子任务对团队内部进行动态协调。整体任务可用基于协作动作的时序逻辑公式来描述。该分层结构允许在不同粒度和触发条件下实现人机交互与监督，既提升计算效率，又减轻人类负担。通过大量人机闭环仿真，验证了该方法在异构机器人群体中应对多样时序任务及环境不确定性的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.10929

Ro-SLM: Onboard Small Language Models for Robot Task Planning and Operation Code Generation

Ro-SLM：用于机器人任务规划与操作代码生成的车载小型语言模型

Wang, Wenhao, Li, Yanyan, Jiao, Long, Yuan, Jiawei

Abstract

Recent advances in large language models (LLMs) provide robots with contextual reasoning abilities to comprehend human instructions. Yet, current LLM-enabled robots typically depend on cloud-based models or high-performance computing infrastructure, which limit their deployment on robots under unreliable internet environments or with constrained computational resources, such as UAVs and small ground vehicles. Thus, deploying fine-tuned small language models (SLMs) that support onboard deployment offers a promising alternative. This paper introduces Ro-SLM, a framework that enables reliable SLM-driven robot operation by distilling LLMs' knowledge and reasoning. Ro-SLM starts from dataset synthesis by leveraging LLMs to generate diverse task instructions, produce corresponding ground truth code with minimal human assistance, and augment instructions into real-world application scenarios. Ro-SLM is then fine-tuned with the dataset, in which LLM serves as a reward function to guide the training. Extensive experiments on UAV operation tasks demonstrate that Ro-SLM improves the performance of SLM from being incapable of supporting robotic task planning and code generation to achieving performance that approaches LLM.

Chinese Translation

近年来，大型语言模型（LLMs）的进步赋予机器人上下文推理能力，使其能够理解人类指令。然而，目前基于LLM的机器人通常依赖云端模型或高性能计算基础设施，这限制了其在网络不稳定或计算资源受限的机器人（如无人机和小型地面车辆）上的部署。因此，部署支持车载运行的微调小型语言模型（SLMs）成为一种有前景的替代方案。本文提出了Ro-SLM框架，通过蒸馏LLMs的知识与推理能力，实现基于SLM的可靠机器人操作。Ro-SLM首先利用LLMs合成数据集，生成多样化的任务指令，产出对应的真实代码，且仅需极少人工辅助，并将指令扩展至实际应用场景。随后，Ro-SLM基于该数据集进行微调训练，其中LLM作为奖励函数引导训练过程。在无人机操作任务上的大量实验表明，Ro-SLM显著提升了SLM的性能，使其从无法支持机器人任务规划与代码生成，提升至接近LLM的表现水平。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2604.10951

Fast-SegSim: Real-Time Open-Vocabulary Segmentation for Robotics in Simulation

Fast-SegSim：用于仿真中机器人实时开放词汇分割的框架

Yu, Xuan, Xie, Yuxuan, Zhai, Shichao, Ye, Shuhao, Xiong, Rong, Wang, Yue

Abstract

Open-vocabulary panoptic reconstruction is crucial for advanced robotics and simulation. However, existing 3D reconstruction methods, such as NeRF or Gaussian Splatting variants, often struggle to achieve the real-time inference frequency required by robotic control loops. Existing methods incur prohibitive latency when processing the high-dimensional features required for robust open-vocabulary segmentation. We propose Fast-SegSim, a novel, simple, and end-to-end framework built upon 2D Gaussian Splatting, designed to realize real-time, high-fidelity, and 3D-consistent open-vocabulary segmentation reconstruction. Our core contribution is a highly optimized rendering pipeline that specifically addresses the computational bottleneck of high-channel segmentation feature accumulation. We introduce two key optimizations: Precise Tile Intersection to reduce rasterization redundancy, and a novel Top-K Hard Selection strategy. This strategy leverages the geometric sparsity inherent in the 2D Gaussian representation to greatly simplify feature accumulation and alleviate bandwidth limitations, achieving render rates exceeding 40 FPS. Fast-SegSim provides critical value in robotic applications: it serves both as a high-frequency sensor input for simulation platforms like Gazebo, and its 3D-consistent outputs provide essential multi-view 'ground truth' labels for fine-tuning downstream perception tasks. We demonstrate this utility by using the generated labels to fine-tune the perception module in object goal navigation, successfully doubling the navigation success rate. Our superior rendering speed and practical utility underscore Fast-SegSim's potential to bridge the sim-to-real gap.

Chinese Translation

开放词汇全景重建对于先进的机器人技术和仿真至关重要。然而，现有的3D重建方法，如NeRF或高斯点云（Gaussian Splatting）变体，往往难以达到机器人控制循环所需的实时推理频率。在处理高维特征以实现稳健的开放词汇分割时，现有方法会产生过高的延迟。我们提出了Fast-SegSim，这是一种新颖、简单且端到端的框架，基于2D高斯点云构建，旨在实现实时、高保真且3D一致的开放词汇分割重建。我们的核心贡献是一个高度优化的渲染管道，专门解决高通道分割特征累积的计算瓶颈。我们引入了两个关键优化：精确的瓦片交集（Precise Tile Intersection）以减少光栅化冗余，以及一种新颖的Top-K硬选择策略（Top-K Hard Selection）。该策略利用2D高斯表示中固有的几何稀疏性，大大简化了特征累积并缓解了带宽限制，实现了超过40帧每秒的渲染速率。Fast-SegSim在机器人应用中提供了重要价值：它既作为仿真平台（如Gazebo）的高频传感器输入，又其3D一致的输出为下游感知任务的微调提供了必要的多视角“真实标签”。我们通过使用生成的标签微调目标导航中的感知模块，成功将导航成功率提高了一倍。我们卓越的渲染速度和实际效用突显了Fast-SegSim在缩小仿真与现实之间差距的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2604.10953

Diffusion Reinforcement Learning Based Online 3D Bin Packing Spatial Strategy Optimization

基于扩散强化学习的在线三维箱体装载空间策略优化

Han, Jie, Li, Tong, Xu, Qingyang, Song, Yong, Pang, Bao, Yuan, Xianfeng

Abstract

The online 3D bin packing problem is important in logistics, warehousing and intelligent manufacturing, with solutions shifting to deep reinforcement learning (DRL) which faces challenges like low sample efficiency. This paper proposes a diffusion reinforcement learning-based algorithm, using a Markov decision chain for packing modeling, height map-based state representation and a diffusion model-based actor network. Experiments show it significantly improves the average number of packed items compared to state-of-the-art DRL methods, with excellent application potential in complex online scenarios.

Chinese Translation

在线三维箱体装载问题在物流、仓储和智能制造中具有重要意义，解决方案正逐步转向深度强化学习（DRL），但面临样本效率低等挑战。本文提出了一种基于扩散强化学习的算法，采用马尔可夫决策链进行装载建模，基于高度图的状态表示以及基于扩散模型的策略网络。实验结果表明，该方法在平均装载数量上显著优于现有最先进的DRL方法，且在复杂的在线场景中具有良好的应用潜力。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2604.10962

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

ScoRe-Flow：通过基于评分的强化学习实现流匹配的完整分布控制

Qiu, Xiaotian, Chen, Lukai, Li, Jinhao, Sun, Qi, Zhuo, Cheng, Dai, Guohao

Abstract

Flow Matching (FM) policies have emerged as an efficient backbone for robotic control, offering fast and expressive action generation that underpins recent large-scale embodied AI systems. However, FM policies trained via imitation learning inherit the limitations of demonstration data; surpassing suboptimal behaviors requires reinforcement learning (RL) fine-tuning. Recent methods convert deterministic flows into stochastic differential equations (SDEs) with learnable noise injection, enabling exploration and tractable likelihoods, but such noise-only control can compromise training efficiency when demonstrations already provide strong priors. We observe that modulating the drift via the score function, i.e., the gradient of log-density, steers exploration toward high-probability regions, improving stability. The score admits a closed-form expression from the velocity field, requiring no auxiliary networks. Based on this, we propose ScoRe-Flow, a score-based RL fine-tuning method that combines drift modulation with learned variance prediction to achieve decoupled control over the mean and variance of stochastic transitions. Experiments demonstrate that ScoRe-Flow achieves 2.4x faster convergence than flow-based SOTA on D4RL locomotion tasks and up to 5.4% higher success rates on Robomimic and Franka Kitchen manipulation tasks.

Chinese Translation

流匹配（FM）策略已成为机器人控制的高效支柱，提供快速且富有表现力的动作生成，支撑了近期大规模具身人工智能系统的发展。然而，通过模仿学习训练的FM策略继承了示范数据的局限性；超越次优行为需要强化学习（RL）微调。最近的方法将确定性流转换为具有可学习噪声注入的随机微分方程（SDE），从而实现探索和可处理的似然性，但仅依赖噪声的控制在示范已经提供强先验的情况下可能会影响训练效率。我们观察到，通过评分函数（即对数密度的梯度）调节漂移，可以将探索引导至高概率区域，从而提高稳定性。评分函数可以从速度场中得到封闭形式的表达，无需辅助网络。基于此，我们提出了ScoRe-Flow，一种基于评分的RL微调方法，结合漂移调节与学习的方差预测，实现对随机转移的均值和方差的解耦控制。实验表明，ScoRe-Flow在D4RL运动任务上实现了比基于流的最先进技术快2.4倍的收敛速度，并在Robomimic和Franka Kitchen操作任务上达到了高达5.4%的成功率提升。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2604.10982

{\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

{A8}-映射：全景表面集成映射实现真实到模拟的转移

Yu, Xuan, Xie, Yuxuan, Jiang, Changjian, Zhai, Shichao, Xiong, Rong, Zhang, Yu, Wang, Yue

Abstract

Open-vocabulary panoptic reconstruction is essential for advanced robotics perception and simulation. However, existing methods based on 3D Gaussian Splatting (3DGS) often struggle to simultaneously achieve geometric accuracy, coherent panoptic understanding, and real-time inference frequency in large-scale scenes. In this paper, we propose a comprehensive framework that integrates geometric reinforcement, end-to-end panoptic learning, and efficient rendering. First, to ensure physical realism in large-scale environments, we leverage LiDAR data to construct plane-constrained multimodal Gaussian Mixture Models (GMMs) and employ 2D Gaussian surfels as the map representation, enabling high-precision surface alignment and continuous geometric supervision. Building upon this, to overcome the error accumulation and cumbersome cross-frame association inherent in traditional multi-stage panoptic segmentation pipelines, we design a query-guided end-to-end learning architecture. By utilizing a local cross-attention mechanism within the view frustum, the system lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding. Finally, addressing the computational bottlenecks caused by high-dimensional semantic features, we introduce Precise Tile Intersection and a Top-K Hard Selection strategy to optimize the rendering pipeline. Experimental results demonstrate that our system achieves superior geometric and panoptic reconstruction quality in large-scale scenes while maintaining an inference rate exceeding 40 FPS, meeting the real-time requirements of robotic control loops.

Chinese Translation

开放词汇的全景重建对于先进的机器人感知和模拟至关重要。然而，基于3D高斯点云（3D Gaussian Splatting，3DGS）的方法往往难以在大规模场景中同时实现几何准确性、一致的全景理解和实时推理频率。本文提出了一个综合框架，整合了几何增强、端到端的全景学习和高效渲染。首先，为了确保大规模环境中的物理真实感，我们利用激光雷达（LiDAR）数据构建平面约束的多模态高斯混合模型（Gaussian Mixture Models，GMMs），并采用2D高斯表面（2D Gaussian surfels）作为地图表示，从而实现高精度的表面对齐和连续的几何监督。在此基础上，为了克服传统多阶段全景分割管道中固有的误差累积和繁琐的跨帧关联问题，我们设计了一种查询引导的端到端学习架构。通过在视锥体内利用局部交叉注意机制，系统将2D掩模特征直接提升到3D空间，实现全局一致的全景理解。最后，为了解决高维语义特征带来的计算瓶颈，我们引入了精确的瓦片交集（Precise Tile Intersection）和Top-K硬选择策略，以优化渲染管道。实验结果表明，我们的系统在大规模场景中实现了优越的几何和全景重建质量，同时保持超过40 FPS的推理速率，满足机器人控制循环的实时需求。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2604.11020

Inferring World Belief States in Dynamic Real-World Environments

动态真实环境中世界信念状态的推断

Kolb, Jack, Garg, Aditya, Warner, Nikolai, Feigh, Karen M.

Abstract

We investigate estimating a human's world belief state using a robot's observations in a dynamic, 3D, and partially observable environment. The methods are grounded in mental model theory, which posits that human decision making, contextual reasoning, situation awareness, and behavior planning draw from an internal simulation or world belief state. When in teams, the mental model also includes a team model of each teammate's beliefs and capabilities, enabling fluent teamwork without the need for constant and explicit communication. In this work we replicate a core component of the team model by inferring a teammate's belief state, or level one situation awareness, as a human-robot team navigates a household environment. We evaluate our methods in a realistic simulation, extend to a real-world robot platform, and demonstrate a downstream application of the belief state through an active assistance semantic reasoning task.

Chinese Translation

我们研究了如何利用机器人在动态、三维且部分可观测环境中的观测数据来估计人类的世界信念状态。该方法基于心理模型理论，该理论认为人类的决策制定、情境推理、情境感知和行为规划均源自内部的模拟或世界信念状态。在团队协作中，心理模型还包括对每个队友信念和能力的团队模型，从而实现流畅的团队合作，无需持续且明确的沟通。在本研究中，我们通过推断队友的信念状态（即一级情境感知）来复现团队模型的核心组成部分，场景为人机团队在家庭环境中的导航。我们在逼真的仿真环境中评估了所提方法，并将其扩展至真实机器人平台，最后展示了基于信念状态的下游应用，即主动辅助语义推理任务。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2604.11028

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

联邦单智能体机器人：无内部多智能体分裂的多机器人协同

Qin, Xue, Luan, Simin, See, John, Yang, Cong, Li, Zhijun

Abstract

As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

Chinese Translation

随着具身机器人向舰队规模运行迈进，多机器人协同正成为系统中的核心挑战。现有方法通常将此视为增加每个机器人内部多智能体分解的动因。我们提出一种不同的原则：多机器人协同不需要机器人内部的多智能体分裂。每个机器人应保持为单一具身智能体，拥有自身的持久运行时、本地策略范围、能力状态和恢复权限，而协同则通过舰队层面的机器人间联邦实现。我们提出了联邦单智能体机器人（Federated Single-Agent Robotics, FSAR），这是一种基于单智能体机器人运行时构建的多机器人协同运行时架构。每个机器人暴露受控的能力接口，而非内部分裂的智能体集合。舰队协同通过共享能力注册表、跨机器人任务委派、策略感知权限分配、信任范围内交互及分层恢复协议实现。我们形式化了关键协同关系，包括权限委托、机器人间能力请求、本地与舰队恢复边界及分层人类监督，并描述了支持共享具身能力模块（Embodied Capability Module, ECM）发现、合同感知跨机器人协同及舰队级治理的舰队运行时架构。我们在典型多机器人协同场景中对FSAR与重分解基线进行了评估。结果显示，在治理局部性（d=2.91，p<.001，较集中控制）和恢复隔离性（d=4.88，p<.001，较重分解）方面均有统计学显著提升，同时在所有场景中减少了权限冲突和策略违规。我们的结果支持这样一种观点：从具身智能体迈向具身舰队，更适合通过机器人运行时间的联邦实现，而非其内部的分裂。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2604.11090

Simulator Adaptation for Sim-to-Real Learning of Legged Locomotion via Proprioceptive Distribution Matching

通过本体感知分布匹配进行腿部运动的仿真适应以实现从仿真到现实的学习

Dao, Jeremy, Fern, Alan

Abstract

Simulation trained legged locomotion policies often exhibit performance loss on hardware due to dynamics discrepancies between the simulator and the real world, highlighting the need for approaches that adapt the simulator itself to better match hardware behavior. Prior work typically quantify these discrepancies through precise, time-aligned matching of joint and base trajectories. This process requires motion capture, privileged sensing, and carefully controlled initial conditions. We introduce a practical alternative based on proprioceptive distribution matching, which compares hardware and simulation rollouts as distributions of joint observations and actions, eliminating the need for time alignment or external sensing. Using this metric as a black-box objective, we explore adapting simulator dynamics through parameter identification, action-delta models, and residual actuator models. Our approach matches the parameter recovery and policy-performance gains of privileged state-matching baselines across extensive sim-to-sim ablations on the Go2 quadruped. Real-world experiments demonstrate substantial drift reduction using less than five minutes of hardware data, even for a challenging two-legged walking behavior. These results demonstrate that proprioceptive distribution matching provides a practical and effective route to simulator adaptation for sim-to-real transfer of learned legged locomotion.

Chinese Translation

经过仿真训练的腿部运动策略在硬件上往往表现出性能损失，这主要是由于仿真器与现实世界之间的动态差异，突显了需要适应仿真器本身以更好地匹配硬件行为的方法。以往的研究通常通过精确、时间对齐的关节和基座轨迹匹配来量化这些差异。这一过程需要运动捕捉、特权传感和精心控制的初始条件。我们提出了一种基于本体感知分布匹配的实用替代方案，该方案将硬件和仿真过程视为关节观察和动作的分布进行比较，从而消除了时间对齐或外部传感的需求。利用这一指标作为黑箱目标，我们探索通过参数识别、动作增量模型和残余执行器模型来适应仿真器动态。我们的方法在Go2四足机器人上进行的大规模仿真到仿真的消融实验中，匹配了特权状态匹配基线的参数恢复和策略性能提升。现实世界的实验表明，使用不到五分钟的硬件数据，即使对于具有挑战性的双足行走行为，也能显著减少漂移。这些结果表明，本体感知分布匹配为仿真适应提供了一条实用且有效的途径，以实现学习的腿部运动从仿真到现实的转移。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2604.11135

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

AIM：基于空间价值图的意图感知统一世界动作建模

Fan, Liaoyuan, Xu, Zetian, Cao, Chen, Zhang, Wenyao, Yuan, Mingqi, Chen, Jiayu

Abstract

Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

Chinese Translation

预训练的视频生成模型为机器人控制提供了强有力的先验，但现有的统一世界动作模型在缺乏大量机器人特定训练的情况下，仍难以解码出可靠的动作。我们将这一限制归因于结构上的不匹配：视频模型捕捉场景的演变过程，而动作生成则需要明确推理交互位置及其背后的操作意图。我们提出了AIM，一种意图感知的统一世界动作模型，通过显式的空间接口弥合这一差距。AIM不直接从未来视觉表示中解码动作，而是预测一个对齐的空间价值图（spatial value map），该图编码了与任务相关的交互结构，从而实现对未来动态的面向控制的抽象。基于预训练的视频生成模型，AIM在共享的混合变换器（mixture-of-transformers）架构中联合建模未来观测和价值图。它采用意图因果注意力（intent-causal attention），将未来信息仅通过价值表示传递给动作分支。我们进一步提出了自蒸馏强化学习阶段，在该阶段冻结视频和价值分支，仅使用由投影价值图响应生成的密集奖励和稀疏任务级信号优化动作头。为支持训练和评估，我们构建了包含3万条操作轨迹的仿真数据集，数据集中同步包含多视角观测、动作和价值图注释。RoboTwin 2.0基准实验表明，AIM实现了94.0%的平均成功率，显著优于先前的统一世界动作基线。值得注意的是，在长时域和接触敏感的操作任务中，性能提升更为显著，证明了显式空间意图建模作为视觉世界建模与机器人控制之间桥梁的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2604.11138

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

ViserDex：用于稳健灵巧手中重定位的视觉仿真到现实

Bhardwaj, Arjun, Wilder-Smith, Maximum, Mittal, Mayank, Patil, Vaishakh, Hutter, Marco

Abstract

In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: https://rffr.leggedrobotics.com/works/viserdex/

Chinese Translation

手中物体重定位需要精确估计物体姿态以应对复杂的任务动态。虽然RGB传感器提供了丰富的语义线索用于姿态跟踪，但现有解决方案依赖于多摄像头设置或昂贵的光线追踪。我们提出了一种用于单目RGB手中重定位的仿真到现实框架，该框架集成了3D高斯点云（3D Gaussian Splatting, 3DGS），以弥合视觉仿真与现实之间的差距。我们的关键见解是在高斯表示空间中执行领域随机化：通过对3D高斯应用物理一致的预渲染增强，我们生成了用于物体姿态估计的逼真随机视觉数据。操控策略采用基于课程的强化学习与师生蒸馏进行训练，从而高效学习复杂行为。重要的是，感知和控制模型可以在消费级硬件上独立训练，消除了对大型计算集群的需求。实验表明，使用3DGS数据训练的姿态估计器在具有挑战性的视觉环境中优于使用传统渲染数据训练的估计器。我们在一台配备RGB摄像头的物理多指手上验证了该系统，展示了在具有挑战性的光照条件下对五种不同物体的稳健重定位。我们的结果突出了高斯点云作为仅使用RGB进行灵巧操控的实用路径。有关硬件部署的视频和其他补充材料，请访问项目网站：https://rffr.leggedrobotics.com/works/viserdex/

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2604.11174

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

EmbodiedGovBench：一个针对具身智能系统治理、恢复和升级安全的基准测试

Qin, Xue, Luan, Simin, See, John, Yang, Cong, Li, Zhijun

Abstract

Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

Chinese Translation

近年来，具身人工智能的发展催生了一个日益丰富的机器人政策、基础模型和模块化运行时生态系统。然而，目前的评估仍然主要依赖于任务成功指标，如完成率或操作准确性。这些指标存在一个关键缺口：它们并未衡量具身系统是否可治理——即它们是否遵循能力边界、执行政策、安全恢复、维护审计记录并对人类监督作出响应。我们提出了EmbodiedGovBench，这是一个针对具身智能体系统的治理导向评估基准。EmbodiedGovBench不仅关注机器人是否能够完成任务，还评估系统在现实扰动下是否保持可控、政策受限、可恢复、可审计以及升级安全。该基准涵盖七个治理维度：未经授权的能力调用、运行时漂移鲁棒性、恢复成功、政策可移植性、版本升级安全性、人类覆盖响应性和审计完整性。我们定义了一个涵盖单机器人和车队设置的基准结构，包括场景模板、扰动操作、治理指标和基线评估协议。我们描述了如何在具身能力运行时上实例化该基准，利用模块化接口和合同感知的升级工作流。我们的分析表明，具身治理应成为一项重要的评估目标。EmbodiedGovBench为这一转变提供了初步的测量框架。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2604.11251

CLAW: Composable Language-Annotated Whole-body Motion Generation

CLAW：可组合的语言注释全身运动生成

Cao, Jianuo, Chen, Yuxin, Tomizuka, Masayoshi

Abstract

Training language-conditioned whole-body controllers for humanoid robots requires large-scale datasets pairing motion trajectories with natural-language descriptions.Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible.Therefore, we present CLAW, an interactive web-based pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW treats the motion modes of a kinematic planner as composable building blocks, each parameterized by movement, heading, speed, pelvis height and duration, and provides two browser-based interfaces -- a real-time keyboard mode and a timeline-based sequence editor -- for exploratory and batch data collection. A low-level whole-body controller tracks the planner's kinematic references in MuJoCo simulation, producing physically grounded trajectories recorded at 50Hz. Simultaneously, a deterministic template-based annotation engine generates diverse natural-language descriptions at multiple stylistic registers for every segment and for the full trajectory. We release the system as open source to support scalable generation of language-motion paired data for humanoid robot learning.

Chinese Translation

为人形机器人训练语言条件下的全身控制器需要大规模的数据集，将运动轨迹与自然语言描述配对。现有基于动作捕捉的方法成本高且多样性有限，而文本到运动的生成模型产生的输出仅为运动学数据，无法保证其物理可行性。因此，我们提出了CLAW，一个基于互动网页的管道，用于为Unitree G1人形机器人可扩展地生成语言注释的全身运动数据。CLAW将运动规划器的运动模式视为可组合的构建块，每个构建块由运动、朝向、速度、骨盆高度和持续时间参数化，并提供两种基于浏览器的接口——实时键盘模式和基于时间线的序列编辑器——用于探索性和批量数据收集。一个低级全身控制器在MuJoCo仿真中跟踪规划器的运动学参考，生成以50Hz记录的物理基础轨迹。同时，一个确定性的基于模板的注释引擎为每个片段和完整轨迹生成多样的自然语言描述，涵盖多个风格层次。我们将该系统作为开源发布，以支持人形机器人学习中语言与运动配对数据的可扩展生成。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2604.11295

Modeling, Analysis and Activation of Planar Viscoelastically-combined Rimless Wheels

平面粘弹性组合无辐条轮的建模、分析与激活

Asano, Fumihiko, Xiang, Yuxuan, Zheng, Yanqiu, Yan, Cong

Abstract

This paper proposes novel passive-dynamic walkers formed by two cross-shaped frames and eight viscoelastic elements. Since it is a combination of two four-legged rimless wheels via viscoelastic elements, we call it viscoelastically-combined rimless wheel (VCRW). Two types of VCRWs consisting of different cross-shaped frames are introduced; one is formed by combining two Greek-cross-shaped frames (VCRW1), and the other is formed by combining two-link cross-shaped frames that can rotate freely around the central axis (VCRW2). First, we describe the model assumptions and equations of motion and collision. Second, we numerically analyze the basic gait properties of passive dynamic walking. Furthermore, we consider an activation of VCRW2 for generating a stable level gait, and discuss the significance of the study as a novel walking support device.

Chinese Translation

本文提出了一种由两个十字形框架和八个粘弹性元件组成的新型被动动力行走器。由于其是通过粘弹性元件将两个四足无辐条轮组合而成，我们称之为粘弹性组合无辐条轮（Viscoelastically-Combined Rimless Wheel，VCRW）。介绍了两种由不同十字形框架组成的VCRW类型：一种是由两个希腊十字形框架组合而成（VCRW1），另一种是由两个可绕中心轴自由旋转的两连杆十字形框架组合而成（VCRW2）。首先，本文描述了模型假设以及运动和碰撞的方程。其次，数值分析了被动动力行走的基本步态特性。此外，针对VCRW2，研究了其激活以生成稳定水平步态的可能性，并讨论了该研究作为一种新型行走辅助装置的意义。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2604.11302

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

基于世界模型的蒙特卡洛树搜索的3D锚定前瞻规划用于持久机器人场景记忆

Sidik, Bronislav, Mizrahi, Dror

Abstract

We present 3D-Anchored Lookahead Planning (3D-ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D-consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D-ALP maintains a persistent camera-to-world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5-step sequential reach task requiring spatial memory (Experiment E3), 3D-ALP achieves 0.650 0.109 success rate on memory-required steps versus 0.006 0.008 for a greedy reactive baseline ({\Delta}=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT-MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.

Chinese Translation

我们提出了3D锚定前瞻规划（3D-ALP），这是一种用于机器人操作的系统2推理引擎，它将蒙特卡洛树搜索（MCTS）与3D一致的世界模型结合，作为回滚预言机。与仅从当前相机帧评估动作的反应性策略不同，3D-ALP保持一个持久的相机到世界（c2w）锚点，能够在遮挡情况下存活，从而实现对不再直接可观察的物体位置的准确重新规划。在一个需要空间记忆的5步顺序到达任务中（实验E3），3D-ALP在需要记忆的步骤上取得了0.650 ± 0.109的成功率，而贪婪反应基线则为0.006 ± 0.008（{ ext{Δ}}=+0.645），而第5步的成功率达到0.822，贪婪基线为0.000。消融研究（30个实验，3个种子）将树搜索空间记忆作为主要驱动因素（+0.533，增益的82%），并且更深的前瞻规划也带来了额外的好处（+0.111，17%）。我们还识别并解决了在将UCT-MCTS（应用于树的上置信界 [10]）应用于连续机器人操作时的四种结构性失败模式。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2604.11306

Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment

学习遗忘——用于终身机器人部署的层次性情节记忆

Bärmann, Leonard, Plewnia, Joana, Waibel, Alex, Asfour, Tamim

Abstract

Robots must verbalize their past experiences when users ask "Where did you put my keys?" or "Why did the task fail?" Yet maintaining life-long episodic memory (EM) from continuous multimodal perception quickly exceeds storage limits and makes real-time query impractical, calling for selective forgetting that adapts to users' notions of relevance. We present H$^2$-EMV, a framework enabling humanoids to learn what to remember through user interaction. Our approach incrementally constructs hierarchical EM, selectively forgets using language-model-based relevance estimation conditioned on learned natural-language rules, and updates these rules given user feedback about forgotten details. Evaluations on simulated household tasks and 20.5-hour-long real-world recordings from ARMAR-7 demonstrate that H$^2$-EMV maintains question-answering accuracy while reducing memory size by 45% and query-time compute by 35%. Critically, performance improves over time - accuracy increases 70% in second-round queries by adapting to user-specific priorities - demonstrating that learned forgetting enables scalable, personalized EM for long-term human-robot collaboration.

Chinese Translation

当用户询问“你把我的钥匙放在哪里？”或“任务为什么失败？”时，机器人必须能够口头表达其过去的经历。然而，持续的多模态感知所产生的终身情节记忆（EM）很快会超出存储限制，使得实时查询变得不切实际，这就需要一种适应用户相关性概念的选择性遗忘。我们提出了H$^2$-EMV，一个使类人机器人能够通过用户交互学习记忆内容的框架。我们的方法逐步构建层次性情节记忆，利用基于语言模型的相关性估计进行选择性遗忘，该估计基于学习到的自然语言规则，并根据用户对遗忘细节的反馈更新这些规则。在模拟家庭任务和来自ARMAR-7的20.5小时真实世界录音的评估中，H$^2$-EMV在保持问答准确性的同时，将内存大小减少了45%，查询时间计算减少了35%。关键是，性能随着时间的推移而改善——在第二轮查询中，准确性提高了70%，这得益于对用户特定优先级的适应——这表明学习遗忘使得长期人机协作中的可扩展个性化情节记忆成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2604.11320

CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

CLASP：用于开放词汇桌面物体抓取的闭环异步空间感知

Ling, Yiran, Li, Wenxuan, Dong, Siying, Zhang, Yize, Huang, Xiaoyao, Jiang, Jing, Li, Ruonan, Liu, Jie

Abstract

Robot grasping of desktop object is widely used in intelligent manufacturing, logistics, and agriculture.Although vision-language models (VLMs) show strong potential for robotic manipulation, their deployment in low-level grasping faces key challenges: scarce high-quality multimodal demonstrations, spatial hallucination caused by weak geometric grounding, and the fragility of open-loop execution in dynamic environments. To address these challenges, we propose Closed-Loop Asynchronous Spatial Perception(CLASP), a novel asynchronous closed-loop framework that integrates multimodal perception, logical reasoning, and state-reflective feedback. First, we design a Dual-Pathway Hierarchical Perception module that decouples high-level semantic intent from geometric grounding. The design guides the output of the inference model and the definite action tuples, reducing spatial illusions. Second, an Asynchronous Closed-Loop Evaluator is implemented to compare pre- and post-execution states, providing text-based diagnostic feedback to establish a robust error-correction loop and improving the vulnerability of traditional open-loop execution in dynamic environments. Finally, we design a scalable multi-modal data engine that automatically synthesizes high-quality spatial annotations and reasoning templates from real and synthetic scenes without human teleoperation. Extensive experiments demonstrate that our approach significantly outperforms existing baselines, achieving an 87.0% overall success rate. Notably, the proposed framework exhibits remarkable generalization across diverse objects, bridging the sim-to-real gap and providing exceptional robustness in geometrically challenging categories and cluttered scenarios.

Chinese Translation

桌面物体的机器人抓取广泛应用于智能制造、物流和农业。尽管视觉-语言模型（VLMs）在机器人操作中展现出强大的潜力，但在低级抓取中的应用面临关键挑战：高质量多模态演示稀缺、由于几何基础薄弱导致的空间幻觉，以及在动态环境中开放循环执行的脆弱性。为了解决这些挑战，我们提出了闭环异步空间感知（CLASP），这是一种新颖的异步闭环框架，集成了多模态感知、逻辑推理和状态反思反馈。首先，我们设计了一个双通道层次感知模块，将高层语义意图与几何基础解耦。该设计指导推理模型的输出和明确的动作元组，减少空间幻觉。其次，实施了一个异步闭环评估器，以比较执行前后的状态，提供基于文本的诊断反馈，以建立稳健的错误修正循环，并改善传统开放循环执行在动态环境中的脆弱性。最后，我们设计了一个可扩展的多模态数据引擎，能够自动从真实和合成场景中合成高质量的空间注释和推理模板，而无需人工遥控。大量实验表明，我们的方法显著优于现有基线，整体成功率达到87.0%。值得注意的是，所提出的框架在多样化物体上展现出显著的泛化能力，弥合了模拟与现实之间的差距，并在几何挑战类别和杂乱场景中提供了卓越的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2604.11349

Learning Racket-Ball Bounce Dynamics Across Diverse Rubbers for Robotic Table Tennis

跨多种橡胶材质的球拍-乒乓球弹跳动力学学习用于机器人乒乓球

Gossard, Thomas

Abstract

Accurate dynamic models for racket-ball bounces are essential for reliable control in robotic table tennis. Existing models typically assume simple linear models and are restricted to inverted rubbers, limiting their ability to generalize across the wide variety of rackets encountered in practice. In this work, we present a unified framework for modeling ball-racket interactions across 10 racket configurations featuring different rubber types, including inverted, anti-spin, and pimpled surfaces. Using a high-speed multi-camera setup with spin estimation, we collect a dataset of racket-ball bounces spanning a broad range of incident velocities and spins. We show that key physical parameters governing rebound, such as the Coefficient of Restitution and tangential impulse response, vary systematically with the impact state and differ significantly across rubbers. To capture these effects while preserving physical interpretability, we estimate the parameters of an impulse-based contact model using Gaussian Processes conditioned on the ball's incoming velocity and spin. The resulting model provides both accurate predictions and uncertainty estimations. Compared to the constant parameter baselines, our approach reduces post-impact velocity and spin prediction errors across all racket types, with the largest improvements observed for nonstandard rubbers. Furthermore, the GP-based model enables online identification of racket dynamics with few observations during gameplay.

Chinese Translation

准确的球拍-乒乓球弹跳动力学模型对于机器人乒乓球的可靠控制至关重要。现有模型通常假设简单的线性模型，且仅限于反胶橡胶，限制了其在实际中面对多样化球拍时的泛化能力。本文提出了一个统一框架，用于建模涵盖10种不同球拍配置的球拍-乒乓球相互作用，这些配置包括反胶、抗旋转和颗粒橡胶表面。通过高速多摄像头系统结合旋转估计，我们收集了涵盖广泛入射速度和旋转的球拍-乒乓球弹跳数据集。研究表明，决定反弹的关键物理参数，如恢复系数和切向冲量响应，随着撞击状态系统性变化，并且在不同橡胶间存在显著差异。为捕捉这些效应并保持物理可解释性，我们基于高斯过程（Gaussian Processes）估计了一个基于冲量的接触模型参数，该模型以球的入射速度和旋转为条件。所得模型不仅提供了准确的预测，还能估计不确定性。与常数参数基线相比，我们的方法在所有球拍类型上均降低了冲击后速度和旋转的预测误差，非标准橡胶的改进尤为显著。此外，基于高斯过程的模型支持在比赛过程中通过少量观测实现球拍动力学的在线识别。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2604.11351

WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

WM-DAgger：利用世界模型实现高效数据聚合以进行模仿学习

Yu, Anlan, Chen, Zaishu, Song, Peili, Hong, Zhiqing, Wang, Haotian, Zhang, Desheng, He, Tian, Ding, Yi, Zhang, Daqing

Abstract

Imitation learning is a powerful paradigm for training robotic policies, yet its performance is limited by compounding errors: minor policy inaccuracies could drive robots into unseen out-of-distribution (OOD) states in the training set, where the policy could generate even bigger errors, leading to eventual failures. While the Data Aggregation (DAgger) framework tries to address this issue, its reliance on continuous human involvement severely limits scalability. In this paper, we propose WM-DAgger, an efficient data aggregation framework that leverages World Models to synthesize OOD recovery data without requiring human involvement. Specifically, we focus on manipulation tasks with an eye-in-hand robotic arm and only few-shot demonstrations. To avoid synthesizing misleading data and overcome the hallucination issues inherent to World Models, our framework introduces two key mechanisms: (1) a Corrective Action Synthesis Module that generates task-oriented recovery actions to prevent misleading supervision, and (2) a Consistency-Guided Filtering Module that discards physically implausible trajectories by anchoring terminal synthesized frames to corresponding real frames in expert demonstrations. We extensively validate WM-DAgger on multiple real-world robotic tasks. Results that our method significantly improves success rates, achieving a 93.3\% success rate in soft bag pushing with only five demonstrations. The source code is publicly available at https://github.com/czs12354-xxdbd/WM-Dagger.

Chinese Translation

模仿学习是一种强大的机器人策略训练范式，但其性能受到累积误差的限制：轻微的策略不准确可能导致机器人进入训练集中未见的分布外（OOD）状态，在这些状态下，策略可能产生更大的错误，最终导致失败。虽然数据聚合（DAgger）框架试图解决这一问题，但其对持续人类参与的依赖严重限制了可扩展性。本文提出了WM-DAgger，一种高效的数据聚合框架，利用世界模型合成OOD恢复数据，而无需人类参与。具体而言，我们关注的是带有眼在手的机器人臂的操控任务，并且只使用少量示范。为了避免合成误导性数据并克服世界模型固有的幻觉问题，我们的框架引入了两个关键机制：（1）纠正行动合成模块，生成面向任务的恢复动作以防止误导性监督；（2）一致性引导过滤模块，通过将终端合成帧锚定到专家示范中的相应真实帧，丢弃物理上不合理的轨迹。我们在多个真实世界的机器人任务上对WM-DAgger进行了广泛验证。结果表明，我们的方法显著提高了成功率，在仅有五个示范的情况下，在软袋推动任务中达到了93.3%的成功率。源代码已公开，地址为 https://github.com/czs12354-xxdbd/WM-Dagger。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2604.11372

MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

MR.ScaleMaster：来自众包单目视频的尺度一致性协作映射

Ju, Hyoseok, Kim, Giseop

Abstract

Crowd-sourced cooperative mapping from monocular cameras promises scalable 3D reconstruction without specialized sensors, yet remains hindered by two scale-specific failure modes: abrupt scale collapse from false-positive loop closures in repetitive environments, and gradual scale drift over long trajectories and per-robot scale ambiguity that prevent direct multi-session fusion. We present MR.ScaleMaster, a cooperative mapping system for crowd-sourced monocular videos that addresses both failure modes. MR.ScaleMaster introduces three key mechanisms. First, a Scale Collapse Alarm rejects spurious loop closures before they corrupt the pose graph. Second, a Sim(3) anchor node formulation generalizes the classical SE(3) framework to explicitly estimate per-session scale, resolving per-robot scale ambiguity and enforcing global scale consistency. Third, a modular, open-source, plug-and-play interface enables any monocular reconstruction model to integrate without backend modification. On KITTI sequences with up to 15 agents, the Sim(3) formulation achieves a 7.2x ATE reduction over the SE(3) baseline, and the alarm rejects all false-positive loops while preserving every valid constraint. We further demonstrate heterogeneous multi-robot dense mapping fusing MASt3R-SLAM, pi3, and VGGT-SLAM 2.0 within a single unified map.

Chinese Translation

来自单目相机的众包协作映射承诺在没有专用传感器的情况下实现可扩展的三维重建，但仍受到两种特定于尺度的失败模式的阻碍：在重复环境中，由于错误的循环闭合导致的突发尺度崩溃，以及在长轨迹和每个机器人尺度模糊下的逐渐尺度漂移，这些都阻碍了直接的多会话融合。我们提出了MR.ScaleMaster，一个针对众包单目视频的协作映射系统，旨在解决这两种失败模式。MR.ScaleMaster引入了三个关键机制。首先，尺度崩溃警报在伪循环闭合损坏姿态图之前拒绝虚假闭合。其次，Sim(3)锚节点公式将经典的SE(3)框架推广到显式估计每个会话的尺度，从而解决每个机器人的尺度模糊并强制执行全局尺度一致性。第三，模块化、开源的即插即用接口使任何单目重建模型能够无后端修改地集成。在包含多达15个代理的KITTI序列中，Sim(3)公式在SE(3)基线基础上实现了7.2倍的平均绝对误差（ATE）减少，且警报拒绝所有虚假循环，同时保留每个有效约束。我们进一步展示了异构多机器人密集映射，将MASt3R-SLAM、pi3和VGGT-SLAM 2.0融合在一个统一的地图中。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2604.11373

Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

最小化具身性促进机器人数字概念的高效学习

Shangguan, Zhegong, Di Nuovo, Alessandro, Cangelosi, Angelo

Abstract

Robots are increasingly entering human-interactive scenarios that require understanding of quantity. How intelligent systems acquire abstract numerical concepts from sensorimotor experience remains a fundamental challenge in cognitive science and artificial intelligence. Here we investigate embodied numerical learning using a neural network model trained to perform sequential counting through naturalistic robotic interaction with a Franka Panda manipulator. We demonstrate that embodied models achieve 96.8\% counting accuracy with only 10\% of training data, compared to 60.6\% for vision-only baselines. This advantage persists when visual-motor correspondences are randomized, indicating that embodiment functions as a structural prior that regularizes learning rather than as an information source. The model spontaneously develops biologically plausible representations: number-selective units with logarithmic tuning, mental number line organization, Weber-law scaling, and rotational dynamics encoding numerical magnitude ($r = 0.97$, slope $= 30.6{\deg}$/count). The learning trajectory parallels children's developmental progression from subset-knowers to cardinal-principle knowers. These findings demonstrate that minimal embodiment can ground abstract concepts, improve data efficiency, and yield interpretable representations aligned with biological cognition, which may contribute to embodied mathematics tutoring and safety-critical industrial applications.

Chinese Translation

机器人越来越多地进入需要理解数量的人机交互场景。智能系统如何从感知运动经验中获取抽象的数字概念仍然是认知科学和人工智能中的一个基本挑战。在这里，我们使用一个神经网络模型研究具身的数字学习，该模型经过训练以通过与Franka Panda机械手的自然交互进行顺序计数。我们证明，具身模型在仅使用10%的训练数据时，计数准确率达到96.8%，而仅使用视觉的基线模型则为60.6%。当视觉-运动对应关系被随机化时，这一优势依然存在，表明具身性作为一种结构性先验，规范了学习，而不是作为信息源。该模型自发地发展出生物学上合理的表征：具有对数调谐的数字选择单元、心理数字线组织、韦伯定律缩放和编码数字大小的旋转动态（$r = 0.97$, slope $= 30.6{ ext{°}}/ ext{count}$）。学习轨迹与儿童从子集知晓者到基数原则知晓者的发展进程相似。这些发现表明，最小化具身性可以为抽象概念提供基础，提高数据效率，并产生与生物认知相一致的可解释表征，这可能有助于具身数学辅导和安全关键的工业应用。

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2604.11386

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

ComSim：通过组合仿真构建可扩展的真实世界机器人数据生成

Qin, Yiran, Ma, Jiahua, Kang, Li, Li, Wenzhan, Jiao, Yihang, Wen, Xin, Song, Xiufeng, Zhou, Heng, Yu, Jiwen, Yin, Zhenfei, Liu, Xihui, Torr, Philip, Du, Yilun, Zhang, Ruimao

Abstract

Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.

Chinese Translation

近年来，基础模型（如大型语言模型和世界模型）的进展极大地增强了机器人技术的能力，使机器人能够自主执行复杂任务。然而，获取大规模、高质量的机器人训练数据仍然是一个挑战，因为这通常需要大量的人工努力，并且在覆盖多样化的真实世界环境方面存在局限性。为了解决这个问题，我们提出了一种名为组合仿真的新型混合方法，该方法结合了经典仿真和神经仿真，以生成准确的动作-视频对，同时保持与真实世界的一致性。我们的方法利用闭环真实-仿真-真实数据增强管道，利用少量的真实世界数据生成多样化的大规模训练数据集，覆盖更广泛的真实场景。我们训练了一个神经仿真器，将经典仿真视频转化为真实世界的表现，从而提高在真实环境中训练的策略模型的准确性。通过广泛的实验，我们证明了我们的方法显著减少了仿真到现实的领域差距，从而提高了真实世界策略模型训练的成功率。我们的方法为生成强大的训练数据和弥合仿真与真实世界机器人之间的差距提供了一种可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2604.11400

EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

EagleVision：面向高速自动驾驶赛车跨域感知的多任务基准测试

Yagudin, Zakhar, Mebrahtu, Murad, Jin, Ren, Huang, Jiaqi, Yue, Yujia, Tsetserukou, Dzmitry, Dias, Jorge, Khonji, Majid

Abstract

High-speed autonomous racing presents extreme perception challenges, including large relative velocities and substantial domain shifts from conventional urban-driving datasets. Existing benchmarks do not adequately capture these high-dynamic conditions. We introduce EagleVision, a unified LiDAR-based multi-task benchmark for 3D detection and trajectory prediction in high-speed racing, providing newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and the A2RL Real competition dataset (1,163 frames), together with 12,000 simulator-generated annotated frames, all standardized under a common evaluation protocol. Using a dataset-centric transfer framework, we quantify cross-domain generalization across urban, simulator, and real racing domains. Urban pretraining improves detection over scratch training (NDS 0.72 vs. 0.69), while intermediate pretraining on real racing data achieves the best transfer to A2RL (NDS 0.726), outperforming simulator-only adaptation. For trajectory prediction, Indy-trained models surpass in-domain A2RL training on A2RL test sequences (FDE 0.947 vs. 1.250), highlighting the role of motion-distribution coverage in cross-domain forecasting. EagleVision enables systematic study of perception generalization under extreme high-speed dynamics. The dataset and benchmark are publicly available at https://avlab.io/EagleVision

Chinese Translation

高速自动驾驶赛车面临极端的感知挑战，包括较大的相对速度和与传统城市驾驶数据集显著的域偏移。现有基准测试未能充分覆盖这些高动态条件。我们提出了EagleVision，一个基于LiDAR的统一多任务基准，针对高速赛车中的三维检测和轨迹预测，提供了Indy Autonomous Challenge数据集（14,893帧）和A2RL Real竞赛数据集（1,163帧）中新标注的三维边界框，以及12,000帧模拟器生成的标注数据，所有数据均采用统一的评估协议。通过数据集中心的迁移框架，我们量化了城市、模拟器和真实赛车域之间的跨域泛化能力。城市预训练相比从零训练提升了检测性能（NDS 0.72对0.69），而在真实赛车数据上的中间预训练实现了对A2RL的最佳迁移效果（NDS 0.726），优于仅基于模拟器的适配。轨迹预测方面，Indy训练的模型在A2RL测试序列上表现优于A2RL域内训练（FDE 0.947对1.250），凸显了运动分布覆盖在跨域预测中的重要性。EagleVision促进了在极端高速动态条件下感知泛化的系统研究。该数据集和基准测试公开发布，详见https://avlab.io/EagleVision

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2604.11406

Using Unwrapped Full Color Space Palette Recording to Measure Exposedness of a Vehicle Exterior Parts for External Human Machine Interfaces

利用展开的全色彩空间调色板记录测量车辆外部零件对外部人机界面的暴露度

Kwon, Jaerock, Gonzalez-Belmonte, Jose

Abstract

One of the concerns with autonomous vehicles is their ability to communicate their intent to other road users, specially pedestrians, in order to prevent accidents. External Human-Machine Interfaces (eHMIs) are the proposed solution to this issue, through the introduction of electronic devices on the exterior of a vehicle that communicate when the vehicle is planning on slowing down or yielding. This paper uses the technique of unwrapping the faces of a mesh onto a texture where every pixel is a unique color, as well as a series of animated simulations made and ran in the Unity game engine, to measure how many times is each point on a 2015 Ford F-150 King Ranch is unobstructed to a pedestrian attempting to cross the road at a four-way intersection. By cross-referencing the results with a color-coded map of the labeled parts on the exterior of the vehicle, it was concluded that while the bumper, grill, and hood were the parts of the vehicle visible to the crossing pedestrian most often, the existence of other vehicles on the same lane that might obstruct the view of these makes them insufficient. The study recommends instead a distributive approach to eHMIs by using both the windshield and frontal fenders as simultaneous placements for these devices.

Chinese Translation

自动驾驶车辆面临的一个重要问题是其向其他道路使用者，特别是行人，传达意图的能力，以防止事故发生。外部人机界面（eHMIs）被提出作为解决该问题的方案，通过在车辆外部安装电子设备，传达车辆计划减速或让行的信息。本文采用将网格面展开到纹理上的技术，使每个像素具有唯一颜色，并结合在Unity游戏引擎中制作和运行的一系列动画模拟，测量2015款福特F-150 King Ranch车型在四路交叉口处，行人试图过马路时车辆各点的无遮挡次数。通过将结果与车辆外部标注部件的彩色编码地图交叉比对，得出结论：虽然保险杠、格栅和引擎盖是行人最常见到的车辆部位，但同车道内其他车辆可能遮挡视线，使这些部位不足以作为信息传达位置。研究建议采用分布式eHMIs方案，同时在挡风玻璃和前翼子板上安装这些设备。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2604.11417

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

面向机器人共语的高效情感感知标志性手势预测

Montiel-Vazquez, Edwin C., Cruz, Christian Arzate, Gkikas, Stefanos, Kassiotis, Thomas, Giannakakis, Giorgos, Gomez, Randy

Abstract

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

Chinese Translation

共语手势能够提升互动参与度并增强语言理解。大多数数据驱动的机器人系统生成节奏性类似节拍的动作，但很少融合语义强调。为此，我们提出了一种轻量级Transformer模型，仅基于文本和情感信息推断标志性手势的位置和强度，推理时无需音频输入。该模型在BEAT2数据集上的语义手势位置分类和强度回归任务中均优于GPT-4o，同时保持计算资源紧凑，适合在具身智能体上实时部署。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2604.11423

Dyadic Partnership(DP): A Missing Link Towards Full Autonomy in Medical Robotics

双向合作（DP）：迈向医疗机器人完全自主性的缺失环节

Navab, Nassir, Jiang, Zhongliang

Abstract

For the past decades medical robotic solutions were mostly based on the concept of tele-manipulation. While their design was extremely intelligent, allowing for better access, improved dexterity, reduced tremor, and improved imaging, their intelligence was limited. They therefore left cognition and decision making to the surgeon. As medical robotics advances towards high-level autonomy, the scientific community needs to explore the required pathway towards partial and full autonomy. Here, we introduce the concept of Dyadic Partnership(DP), a new paradigm in which robots and clinicians engage in intelligent, expert interaction and collaboration. The Dyadic Partners would discuss and agree on decisions and actions during their dynamic and interactive collaboration relying also on intuitive advanced media using generative AI, such as a world model, and advanced multi-modal visualization. This article outlines the foundational components needed to enable such systems, including foundation models for clinical intelligence, multi-modal intent recognition, co-learning frameworks, advanced visualization, and explainable, trust-aware interaction. We further discuss key challenges such as data scarcity, lack of standardization, and ethical acceptance. Dyadic partnership is introduced and is positioned as a powerful yet achievable, acceptable milestone offering a promising pathway toward safer, more intuitive collaboration and a gradual transition to full autonomy across diverse clinical settings.

Chinese Translation

在过去几十年中，医疗机器人解决方案主要基于远程操控的概念。尽管它们的设计极为智能，能够提供更好的接入、改善灵巧性、减少颤抖和改善成像，但其智能性仍然有限。因此，它们将认知和决策留给了外科医生。随着医疗机器人向高水平自主性发展，科学界需要探索实现部分和完全自主所需的路径。在此，我们引入了双向合作（Dyadic Partnership, DP）的概念，这是一种新的范式，其中机器人和临床医生进行智能的专家互动与合作。双向合作伙伴将在动态互动的合作中讨论并达成决策和行动，同时依赖于使用生成性人工智能的直观先进媒介，如世界模型和先进的多模态可视化。本文概述了实现此类系统所需的基础组件，包括临床智能的基础模型、多模态意图识别、共同学习框架、先进可视化以及可解释的、以信任为基础的互动。我们还讨论了数据稀缺、缺乏标准化和伦理接受等关键挑战。双向合作被引入并被定位为一个强大而可实现的可接受里程碑，为在多样化临床环境中实现更安全、更直观的合作和逐步过渡到完全自主性提供了有希望的路径。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2604.11447

Safe Human-to-Humanoid Motion Imitation Using Control Barrier Functions

基于控制屏障函数的安全人形机器人动作模仿

Cai, Wenqi, Abanes, John, Evangeliou, Nikolaos, Tzes, Anthony

Abstract

Ensuring operational safety is critical for human-to-humanoid motion imitation. This paper presents a vision-based framework that enables a humanoid robot to imitate human movements while avoiding collisions. Human skeletal keypoints are captured by a single camera and converted into joint angles for motion retargeting. Safety is enforced through a Control Barrier Function (CBF) layer formulated as a Quadratic Program (QP), which filters imitation commands to prevent both self-collisions and human-robot collisions. Simulation results validate the effectiveness of the proposed framework for real-time collision-aware motion imitation.

Chinese Translation

确保操作安全对于人形机器人动作模仿至关重要。本文提出了一种基于视觉的框架，使人形机器人能够模仿人类动作的同时避免碰撞。通过单目摄像头捕捉人体骨骼关键点，并将其转换为关节角度以实现动作重定向。通过将控制屏障函数（Control Barrier Function, CBF）层设计为二次规划（Quadratic Program, QP）问题，过滤模仿指令，从而防止机器人自身碰撞及人与机器人之间的碰撞。仿真结果验证了所提框架在实时碰撞感知动作模仿中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2604.11572

DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

DA-PTQ：面向高效视觉-语言-动作模型的漂移感知后训练量化方法

Xu, Siyuan, Wang, Tianshi, Li, Fengling, Zhu, Lei, Shen, Heng Tao

Abstract

Vision-Language-Action models (VLAs) have demonstrated strong potential for embodied AI, yet their deployment on resource-limited robots remains challenging due to high memory and computational demands. While Post-Training Quantization (PTQ) provides an efficient solution, directly applying PTQ to VLAs often results in severe performance degradation during sequential control. We identify temporal error accumulation as a key factor, where quantization perturbations at the vision-language-to-action interface are progressively amplified, leading to kinematic drift in executed trajectories. To address this issue, we propose Drift-Aware Post-Training Quantization (DA-PTQ), which formulates quantization as a drift-aware optimization problem over sequential decision processes. DA-PTQ consists of two components: (1) Cross-Space Representation Compensation, which mitigates structured distortions between multimodal representations and action space to improve action consistency, and (2) Motion-Driven Mixed-Precision Allocation, which assigns bit-widths by minimizing trajectory-level motion errors. Extensive experiments show that DA-PTQ significantly reduces kinematic drift and achieves comparable performance to full-precision models under low-bit settings, enabling practical deployment of VLAs on resource-limited robotic platforms.

Chinese Translation

视觉-语言-动作模型（Vision-Language-Action，VLA）在具身人工智能领域展现出强大潜力，然而由于其高内存和计算需求，在资源受限的机器人上部署仍面临挑战。尽管后训练量化（Post-Training Quantization，PTQ）提供了一种高效的解决方案，直接将PTQ应用于VLA通常会导致序列控制过程中性能严重下降。我们发现时序误差累积是关键因素，即视觉-语言到动作接口处的量化扰动被逐步放大，导致执行轨迹出现运动学漂移。为解决该问题，我们提出了漂移感知后训练量化（Drift-Aware Post-Training Quantization，DA-PTQ），将量化过程建模为序列决策过程中的漂移感知优化问题。DA-PTQ包含两个核心组件：（1）跨空间表示补偿（Cross-Space Representation Compensation），缓解多模态表示与动作空间之间的结构性失真以提升动作一致性；（2）运动驱动混合精度分配（Motion-Driven Mixed-Precision Allocation），通过最小化轨迹级运动误差来分配比特宽度。大量实验表明，DA-PTQ显著降低了运动学漂移，在低比特量化设置下实现了与全精度模型相当的性能，促进了VLA在资源受限机器人平台上的实际部署。

View on arXiv Download PDF AI Translation

cs.RO / 56 / 2604.11587

Optimal Kinodynamic Motion Planning Through Anytime Bidirectional Heuristic Search with Tight Termination Condition

通过具有紧终止条件的随时双向启发式搜索实现最优的动力学运动规划

Wang, Yi, Mu, Bingxian, Shokouhi, Shahab, Thein, May-Win

Abstract

This paper introduces Bidirectional Tight Informed Trees (BTIT*), an asymptotically optimal kinodynamic sampling-based motion planning algorithm that integrates an anytime bidirectional heuristic search (Bi-HS) and ensures the \emph{meet-in-the-middle} property (MMP) and optimality (MM-optimality). BTIT* is the first anytime MEET-style algorithm to utilize termination conditions that are efficient to evaluate and enable early termination \emph{on-the-fly} in batch-wise sampling-based motion planning. Experiments show that BTIT* achieves strongly faster time-to-first-solution and improved convergence than representative \emph{non-lazy} informed batch planners on two kinodynamic benchmarks: a 4D double-integrator model and a 10D linearized Quadrotor. The source code is available here.

Chinese Translation

本文介绍了一种双向紧信息树（Bidirectional Tight Informed Trees, BTIT*），这是一种渐近最优的基于采样的动力学运动规划算法，结合了随时双向启发式搜索（Anytime Bidirectional Heuristic Search, Bi-HS），并确保了中间相遇属性（meet-in-the-middle property, MMP）和最优性（MM-optimality）。BTIT* 是首个采用高效评估的终止条件的随时 MEET 风格算法，能够在基于批次的采样运动规划中实现即时早期终止。实验表明，BTIT* 在两个动力学基准测试（4D 双重积分器模型和 10D 线性化四旋翼）上实现了显著更快的首次解时间和更好的收敛性，相较于代表性的非懒惰信息批量规划器。源代码可在此处获取。

View on arXiv Download PDF AI Translation

cs.RO / 57 / 2604.11640

Micro-Dexterity in Biological Micromanipulation: Embodiment, Perception, and Control

生物微操作中的微灵巧性：体现、感知与控制

Lu, Kangyi, Wei, Lan, Tan, Zongcai, Zhang, Dandan

Abstract

Microscale manipulation has advanced substantially in controlled locomotion and targeted transport, yet many biomedical applications require precise and adaptive interaction with biological micro-objects. At these scales, manipulation is realized through three main classes of platforms: embodied microrobots that physically interact as mobile agents, field-mediated systems that generate contactless trapping or manipulation forces, and externally actuated end-effectors that interact through remotely driven physical tools. Unlike macroscale manipulators, these systems function in fluidic, confined, and surface-dominated environments characterized by negligible inertia, dominant interfacial forces, and soft, heterogeneous, and fragile targets. Consequently, classical assumptions of dexterous manipulation, including rigid-body contact, stable grasping, and rich proprioceptive feedback, become difficult to maintain. This review introduces micro-dexterity as a framework for analyzing biological micromanipulation through the coupled roles of embodiment, perception, and control. We examine how classical manipulation primitives, including pushing, reorientation, grasping, and cooperative manipulation, are reformulated at the microscale; compare the architectures that enable them, from contact-based micromanipulators to contactless field-mediated systems and cooperative multi-agent platforms; and review the perception and control strategies required for task execution. We identify the current dexterity gap between laboratory demonstrations and clinically relevant biological manipulation, and outline key challenges for future translation.

Chinese Translation

微尺度操作在受控运动和目标运输方面取得了显著进展，但许多生物医学应用需要与生物微物体进行精确和自适应的交互。在这些尺度上，操作主要通过三类平台实现：作为移动代理进行物理交互的具身微型机器人、产生无接触捕获或操作力的场介导系统，以及通过远程驱动物理工具进行交互的外部驱动末端执行器。与宏观尺度的操纵器不同，这些系统在流体、受限和表面主导的环境中运行，其特征是惯性可忽略、界面力占主导地位，以及目标软、异质且脆弱。因此，传统的灵巧操作假设，包括刚体接触、稳定抓取和丰富的本体感知反馈，变得难以维持。本文回顾了微灵巧性作为分析生物微操作的框架，探讨了体现、感知和控制的耦合角色。我们考察了经典操作原语（如推、重新定向、抓取和协作操作）在微尺度下的重新构建；比较了使其成为可能的架构，从基于接触的微操作器到无接触的场介导系统和协作多智能体平台；并回顾了任务执行所需的感知和控制策略。我们识别了实验室演示与临床相关生物操作之间的当前灵巧性差距，并概述了未来转化的关键挑战。

View on arXiv Download PDF AI Translation

cs.RO / 58 / 2604.11674

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

AffordSim：一种可扩展的面向可供性意识的机器人操作数据生成器和基准测试

Li, Mingyang, Xu, Haofan, Sun, Haowen, Chen, Xinzhe, Ren, Sihua, Huang, Liqi, Sui, Xinyang, Miao, Chenyang, Cui, Qiongjie, Liu, Zeyang, Chen, Xingyu, Lan, Xuguang

Abstract

Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.

Chinese Translation

基于仿真的数据生成已成为训练机器人操作策略的主流范式，然而现有平台并未将物体可供性信息纳入轨迹生成中。因此，要求与特定功能区域进行精确交互的任务——例如通过把手抓取杯子、从杯缘倒水或将杯子挂在钩子上——无法自动生成语义上正确的轨迹。我们介绍了AffordSim，这是第一个将开放词汇3D可供性预测整合到操作数据生成流程中的仿真框架。AffordSim使用我们的VoxAfford模型，这是一种开放词汇3D可供性检测器，通过多尺度几何特征增强MLLM输出标记，以预测物体点云上的可供性图，从而引导抓取姿态估计朝向与任务相关的功能区域。AffordSim建立在NVIDIA Isaac Sim之上，支持跨实体（Franka FR3、Panda、UR5e、Kinova），具备基于VLM的任务生成以及使用DA3基于真实照片的3D高斯重建的新颖领域随机化，能够实现自动化、可扩展的可供性意识操作数据生成。我们建立了一个涵盖7个类别（抓取、放置、堆叠、推/拉、倒水、杯子悬挂、长时间复合）的50个任务的基准，并评估了4个模仿学习基线（BC、Diffusion Policy、ACT、Pi 0.5）。我们的结果显示，尽管抓取任务的成功率较高（53-93%），但诸如向狭窄容器倒水（1-43%）和杯子悬挂（0-47%）等需求可供性的任务对于当前的模仿学习方法仍然显著更具挑战性，突显了可供性意识数据生成的必要性。在真实的Franka FR3上进行的零-shot仿真到现实实验验证了生成数据的可转移性。

View on arXiv Download PDF AI Translation

cs.RO / 59 / 2604.11680

Dual-Control Frequency-Aware Diffusion Model for Depth-Dependent Optical Microrobot Microscopy Image Generation

基于双控频率感知扩散模型的深度依赖光学微型机器人显微图像生成

Wei, Lan, Tan, Zongcai, Lu, Kangyi, Zheng, Jian-Qing, Zhang, Dandan

Abstract

Optical microrobots actuated by optical tweezers (OT) are important for cell manipulation and microscale assembly, but their autonomous operation depends on accurate 3D perception. Developing such perception systems is challenging because large-scale, high-quality microscopy datasets are scarce, owing to complex fabrication processes and labor-intensive annotation. Although generative AI offers a promising route for data augmentation, existing generative adversarial network (GAN)-based methods struggle to reproduce key optical characteristics, particularly depth-dependent diffraction and defocus effects. To address this limitation, we propose Du-FreqNet, a dual-control, frequency-aware diffusion model for physically consistent microscopy image synthesis. The framework features two independent ControlNet branches to encode microrobot 3D point clouds and depth-specific mesh layers, respectively. We introduce an adaptive frequency-domain loss that dynamically reweights high- and low-frequency components based on the distance to the focal plane. By leveraging differentiable FFT-based supervision, Du-FreqNet captures physically meaningful frequency distributions often missed by pixel-space methods. Trained on a limited dataset (e.g., 80 images per pose), our model achieves controllable, depth-dependent image synthesis, improving SSIM by 20.7% over baselines. Extensive experiments demonstrate that Du-FreqNet generalizes effectively to unseen poses and significantly enhances downstream tasks, including 3D pose and depth estimation, thereby facilitating robust closed-loop control in microrobotic systems.

Chinese Translation

由光学镊子（Optical Tweezers, OT）驱动的光学微型机器人在细胞操作和微尺度组装中具有重要作用，但其自主运行依赖于精确的三维感知。开发此类感知系统面临挑战，原因在于复杂的制造工艺和繁重的标注工作导致大规模高质量显微镜数据集稀缺。尽管生成式人工智能为数据增强提供了有前景的途径，现有基于生成对抗网络（GAN）的方法难以准确再现关键的光学特性，尤其是深度依赖的衍射和散焦效应。为解决该限制，本文提出了Du-FreqNet，一种双控频率感知扩散模型，用于物理一致的显微图像合成。该框架包含两个独立的ControlNet分支，分别编码微型机器人的三维点云和深度特定的网格层。我们引入了一种自适应频域损失，根据与焦平面的距离动态调整高频和低频成分的权重。通过利用基于可微傅里叶变换（FFT）的监督，Du-FreqNet捕捉了像素空间方法常忽略的物理意义频率分布。在有限数据集（例如每个姿态80张图像）上训练后，模型实现了可控的深度依赖图像合成，结构相似性指数（SSIM）较基线提升了20.7%。大量实验表明，Du-FreqNet在未见姿态上具有良好泛化能力，显著提升了包括三维姿态和深度估计在内的下游任务性能，从而促进了微型机器人系统中稳健的闭环控制。

View on arXiv Download PDF AI Translation

cs.RO / 60 / 2604.11708

ACT: Automated CPS Testing for Open-Source Robotic Platforms

ACT：开源机器人平台的自动化网络物理系统测试

Krishnan, Aditya A., Kim, Donghoon, Kim, Hokeun

Abstract

Open-source software for cyber-physical systems (CPS) often lacks robust testing involving robotic platforms, resulting in critical errors that remain undetected. This is especially challenging when multiple modules of CPS software are developed by various open-source contributors. To address this gap, we propose Automated CPS Testing (ACT) that performs automated, continuous testing of open-source software with its robotic platforms, integrated with the open-source infrastructure such as GitHub. We implement an ACT prototype and conduct a case study on an open-source CPS with an educational robotic platform to demonstrate its capabilities.

Chinese Translation

开源网络物理系统（CPS）软件通常缺乏针对机器人平台的稳健测试，导致关键错误未被发现。这在多个CPS软件模块由不同的开源贡献者开发时尤其具有挑战性。为了解决这一问题，我们提出了自动化CPS测试（ACT），该方法对开源软件及其机器人平台进行自动化、持续的测试，并与GitHub等开源基础设施集成。我们实现了ACT原型，并对一个具有教育意义的开源CPS和机器人平台进行了案例研究，以展示其能力。

View on arXiv Download PDF AI Translation

cs.RO / 61 / 2604.11734

Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

Multi-ORFT：面向协同驾驶的多智能体扩散规划的稳定在线强化微调

Bai, Haojie, Li, Aimin, Yao, Ruoyu, Zhao, Xiongwei, Zhang, Tingting, Zhang, Xing, Gao, Lin, Ma, and Jun

Abstract

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

Chinese Translation

闭环协同驾驶需要规划器能够生成真实的多模态多智能体轨迹，同时提升安全性和交通效率。现有的扩散规划器能够从示范中建模多模态行为，但通常表现出较弱的场景一致性，且与闭环目标的对齐度较低；同时，在反应式多智能体环境中实现稳定的在线后训练仍然具有挑战性。我们提出了Multi-ORFT，该方法将基于场景条件的扩散预训练与稳定的在线强化后训练相结合。在预训练阶段，规划器采用智能体间自注意力、交叉注意力以及基于AdaLN-Zero的场景条件机制，以提升联合轨迹的场景一致性和道路依从性。在后训练阶段，我们构建了一个两层次的马尔可夫决策过程（MDP），暴露逐步的逆核似然以支持在线优化，并结合了密集的轨迹级奖励与方差门控的群体相对策略优化（VG-GRPO）以稳定训练。在WOMD闭环基准测试中，Multi-ORFT将碰撞率从2.04%降低至1.89%，越路率从1.68%降低至1.36%，同时相较于预训练规划器，平均速度从8.36提升至8.61米/秒，并且在主要的安全性和效率指标上优于包括SMART-large、SMART-tiny-CLSFT及VBD在内的强大开源基线。这些结果表明，将场景一致的去噪与稳定的在线扩散策略优化相结合，能够提升闭环协同驾驶的可靠性。

View on arXiv Download PDF AI Translation

cs.RO / 62 / 2604.11751

Grounded World Model for Semantically Generalizable Planning

用于语义泛化规划的基于感知的世界模型

Li, Quanyi, Feng, Lan, Zhang, Haonan, Li, Wuyang, Wang, Letian, Alahi, Alexandre, Soh, Harold

Abstract

In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

Chinese Translation

在模型预测控制（Model Predictive Control, MPC）中，世界模型用于预测各种动作提议的未来结果，随后通过评分指导最优动作的选择。对于视觉运动MPC，评分函数是在预训练视觉编码器（如DINO和JEPA）的潜在空间中，预测图像与目标图像之间的距离度量。然而，在任务执行前获取目标图像尤其在新环境中具有挑战性。此外，通过图像传达目标相比自然语言交互性有限。在本工作中，我们提出在视觉-语言对齐的潜在空间中学习一个基于感知的世界模型（Grounded World Model, GWM）。因此，每个动作提议的评分基于其未来结果与任务指令的接近程度，该接近程度通过嵌入向量的相似性体现。该方法将视觉运动MPC转化为视觉语言对齐（VLA）方法，且在语义泛化能力上优于基于视觉语言模型（VLM）的VLA。在所提出的WISER基准测试中，GWM-MPC在包含288个测试任务的测试集上实现了87%的成功率，这些任务包含未见过的视觉信号和指称表达，但通过训练期间演示的动作仍可解决。相比之下，传统的VLA即使在训练集上过拟合达到90%的成功率，在测试集上的平均成功率仅为22%。

View on arXiv Download PDF AI Translation

cs.RO / 63 / 2604.11757

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

StarVLA-$eta$: 减少视觉-语言-动作系统中的复杂性

Ye, Jinhui, Gao, Ning, Yang, Senqiao, Zheng, Jinliang, Wang, Zixuan, Chen, Yuxin, Chen, Pengguang, Chen, Yilun, Liu, Shu, Jia, Jiaya

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$\alpha$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$\alpha$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $\pi_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$\alpha$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

Chinese Translation

视觉-语言-动作（VLA）模型最近作为构建通用机器人代理的一种有前景的范式而出现。然而，VLA领域仍然高度碎片化且复杂：现有方法在架构、训练数据、体现配置和基准特定工程方面差异显著。在本研究中，我们引入了StarVLA-$eta$，这是一个简单而强大的基线，旨在在受控条件下研究VLA设计选择。StarVLA-$eta$故意最小化架构和流程复杂性，以减少实验混淆因素并实现系统分析。具体而言，我们重新评估了几个关键设计轴，包括动作建模策略、机器人特定的预训练和接口工程。在LIBERO、SimplerEnv、RoboTwin和RoboCasa的统一多基准训练中，相同的简单基线依然具有高度竞争力，表明强大的视觉-语言模型（VLM）骨干结合最小设计已经足以在不依赖额外架构复杂性或工程技巧的情况下实现强劲性能。值得注意的是，我们的单一通用模型在公共真实世界RoboChallenge基准上比$eta_{0.5}$高出20%。我们期待StarVLA-$eta$能够作为未来VLA领域研究的坚实起点。代码将发布在https://github.com/starVLA/starVLA。

View on arXiv Download PDF AI Translation

cs.RO / 64 / 2604.11768

Identifying Inductive Biases for Robot Co-Design

识别机器人协同设计的归纳偏置

Vaish, Apoorv, Brock, Oliver

Abstract

Co-designing a robot's morphology and control can ensure synergistic interactions between them, prevalent in biological organisms. However, co-design is a high-dimensional search problem. To make this search tractable, we need a systematic method for identifying inductive biases tailored to its structure. In this paper, we analyze co-design landscapes for soft locomotion and manipulation tasks and identify three patterns that are consistent across regions of their co-design spaces. We observe that within regions of co-design space, quality varies along a low-dimensional manifold. Higher-quality regions exhibit variations spread across more dimensions, while tightly coupling morphology and control. We leverage these insights to devise an efficient co-design algorithm. Since the precise instantiation of this structure varies across tasks and is not known a priori, our algorithm infers it from information gathered during search and adapts to each task's specific structure. This yields $36\%$ more improvement than benchmark algorithms. Moreover, our algorithm achieved more than two orders of magnitude in sample efficiency compared to these benchmark algorithms, demonstrating the effectiveness of leveraging inductive biases to co-design.

Chinese Translation

协同设计机器人的形态和控制可以确保它们之间的协同互动，这在生物有机体中普遍存在。然而，协同设计是一个高维搜索问题。为了使这一搜索变得可行，我们需要一种系统的方法来识别针对其结构量身定制的归纳偏置。本文分析了软体运动和操控任务的协同设计景观，并识别出在其协同设计空间的不同区域中一致的三种模式。我们观察到，在协同设计空间的区域内，质量沿着低维流形变化。高质量区域的变化分布在更多维度上，同时紧密耦合形态和控制。我们利用这些见解设计了一种高效的协同设计算法。由于这种结构的精确实例化在不同任务中有所不同，且事先并不明确，我们的算法从搜索过程中收集的信息中推断出这一结构，并适应每个任务的特定结构。这比基准算法提高了$36\%$的改进。此外，与这些基准算法相比，我们的算法在样本效率上提高了两个数量级以上，证明了利用归纳偏置进行协同设计的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 65 / 2604.11793

Disentangled Point Diffusion for Precise Object Placement

解耦点扩散用于精确物体放置

He, Lyuxing, Cai, Eric, Aggarwal, Shobhit, Wang, Jianjun, Held, David

Abstract

Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose TAX-DPD, a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. We model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy than a prior approach based on SE(3)-diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our framework can further relax assumptions on object rigidity.

Chinese Translation

近期在机器人操作领域的进展突显了从示范学习的有效性。然而，尽管端到端策略在表达能力和灵活性方面表现出色，但它们在推广到新颖物体几何形状和实现高精度方面面临挑战。另一种以物体为中心的方法将任务框定为预测目标物体的放置姿态，从而提供了问题的模块化分解。在这一目标预测范式的基础上，我们提出了TAX-DPD，一个层次化的解耦点扩散框架，在放置精度、多模态覆盖以及对物体几何形状和场景配置变化的推广能力方面实现了最先进的性能。我们通过一种新颖的前馈密集高斯混合模型（Dense Gaussian Mixture Model, GMM）对全局场景级放置进行建模，该模型在全局放置上产生了空间密集的先验；然后，我们通过一种新颖的解耦点云扩散模块对局部物体级配置进行建模，该模块分别扩散物体几何形状和放置框架，从而实现精确的局部几何推理。有趣的是，我们展示了我们的点云扩散在刚性物体放置的背景下，显著高于基于SE(3)-扩散的先前方法的准确性。我们在一系列具有挑战性的任务中验证了我们的方法，包括在高精度工业插入任务中的模拟和现实世界应用。此外，我们在模拟的布料悬挂任务中呈现了结果，表明我们的框架可以进一步放宽对物体刚性的假设。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

298

cs.CV / 1 / 2604.09639

3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation

基于无姿态对应匹配的三维多视角风格化方法及其鲁棒的三维几何保持

Bose, Shirsha

Abstract

Artistic style transfer is well studied for images and videos, but extending it to multi-view 3D scenes remains difficult because stylization can disrupt correspondences needed by geometry-aware pipelines. Independent per-view stylization often causes texture drift, warped edges, and inconsistent shading, degrading SLAM, depth prediction, and multi-view reconstruction. This thesis addresses multi-view stylization that remains usable for downstream 3D tasks without assuming camera poses or an explicit 3D representation during training. We introduce a feed-forward stylization network trained with per-scene test-time optimization under a composite objective coupling appearance transfer with geometry preservation. Stylization is driven by an AdaIN-inspired loss from a frozen VGG-19 encoder, matching channel-wise moments to a style image. To stabilize structure across viewpoints, we propose a correspondence-based consistency loss using SuperPoint and SuperGlue, constraining descriptors from a stylized anchor view to remain consistent with matched descriptors from the original multi-view set. We also impose a depth-preservation loss using MiDaS/DPT and use global color alignment to reduce depth-model domain shift. A staged weight schedule introduces geometry and depth constraints. We evaluate on Tanks and Temples and Mip-NeRF 360 using image and reconstruction metrics. Style adherence and structure retention are measured by Color Histogram Distance (CHD) and Structure Distance (DSD). For 3D consistency, we use monocular DROID-SLAM trajectories and symmetric Chamfer distance on back-projected point clouds. Across ablations, correspondence and depth regularization reduce structural distortion and improve SLAM stability and reconstructed geometry; on scenes with MuVieCAST baselines, our method yields stronger trajectory and point-cloud consistency while maintaining competitive stylization.

Chinese Translation

艺术风格迁移在图像和视频领域已有深入研究，但将其扩展到多视角三维场景仍然具有挑战性，因为风格化过程可能破坏几何感知流程所需的对应关系。独立的每视角风格化常导致纹理漂移、边缘扭曲和光照不一致，进而降低SLAM、深度预测和多视角重建的性能。本文针对多视角风格化问题，提出一种无需在训练阶段假设相机姿态或显式三维表示的方案，使风格化结果可用于后续三维任务。我们引入了一种前馈风格化网络，结合每场景测试时优化，通过复合目标函数将外观迁移与几何保持耦合进行训练。风格化由基于冻结的VGG-19编码器的AdaIN启发损失驱动，匹配通道统计量至风格图像。为稳定不同视角间的结构，我们提出基于SuperPoint和SuperGlue的对应一致性损失，约束风格化锚视图的描述子与原始多视角集合中匹配描述子保持一致。同时引入基于MiDaS/DPT的深度保持损失，并采用全局颜色对齐以减少深度模型的域偏移。分阶段权重调度引入几何和深度约束。我们在Tanks and Temples和Mip-NeRF 360数据集上使用图像及重建指标进行评估。风格遵循性和结构保持通过颜色直方图距离（CHD）和结构距离（DSD）衡量；三维一致性采用单目DROID-SLAM轨迹及反投影点云的对称Chamfer距离评估。消融实验表明，对应关系和深度正则化显著减少结构失真，提升SLAM稳定性和重建几何质量；在含MuVieCAST基线的场景中，我们的方法在保持竞争力的风格化效果的同时，实现了更强的轨迹和点云一致性。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.09643

PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging

PA-SFM：无追踪器的可微分声辐射用于自由手3D光声成像

Li, Shuang, Gao, Jian, Kim, Chulhong, Choi, Seongwook, Chen, Qian, Wang, Yibing, Wu, Shuang, Zhang, Yu, Huang, Tingting, Zhou, Yucheng, Yao, Boxin, Yao, Yao, Li, Changhui

Abstract

Three-dimensional (3D) handheld photoacoustic tomography typically relies on bulky and expensive external positioning sensors to correct motion artifacts, which severely limits its clinical flexibility and accessibility. To address this challenge, we present PA-SFM, a tracker-free framework that leverages exclusively single-modality photoacoustic data for both sensor pose recovery and high-fidelity 3D reconstruction via differentiable acoustic radiation modeling. Unlike traditional structure-from-motion (SFM) methods based on visual features, PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline. By leveraging a high-performance, GPU-accelerated acoustic radiation kernel, the framework simultaneously optimizes the 3D photoacoustic source distribution and the sensor array pose via gradient descent. To ensure robust convergence in freehand scenarios, we introduce a coarse-to-fine optimization strategy that incorporates geometric consistency checks and rigid-body constraints to eliminate motion outliers. We validated the proposed method through both numerical simulations and in-vivo rat experiments. The results demonstrate that PA-SFM achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks, offering a low-cost, software-defined solution for clinical freehand photoacoustic imaging. The source code is publicly available at \href{https://github.com/JaegerCQ/PA-SFM}{https://github.com/JaegerCQ/PA-SFM}.

Chinese Translation

三维（3D）手持光声断层成像通常依赖于笨重且昂贵的外部定位传感器来校正运动伪影，这严重限制了其临床灵活性和可及性。为了解决这一挑战，我们提出了PA-SFM，一个无追踪器的框架，专门利用单一模态的光声数据进行传感器姿态恢复和高保真3D重建，采用可微分的声辐射建模。与基于视觉特征的传统运动重建（structure-from-motion, SFM）方法不同，PA-SFM将声波方程集成到可微分编程管道中。通过利用高性能的GPU加速声辐射核，该框架通过梯度下降同时优化3D光声源分布和传感器阵列姿态。为了确保在自由手场景中的稳健收敛，我们引入了一种粗到细的优化策略，该策略结合了几何一致性检查和刚体约束，以消除运动异常值。我们通过数值模拟和体内大鼠实验验证了所提出的方法。结果表明，PA-SFM实现了亚毫米级的定位精度，并恢复了与真实基准相当的高分辨率3D血管结构，提供了一种低成本、软件定义的临床自由手光声成像解决方案。源代码可在 exttt{https://github.com/JaegerCQ/PA-SFM} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.09648

TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock

TRACE：用于牲畜二氧化碳排放的热识别注意框架

Islam, Taminul, Lakhssassi, Abdellah, Sarker, Toqi Tahamid, Embaby, Mohamed, Ahmed, Khaled R, AbuGhazaleh, Amer

Abstract

Quantifying exhaled CO2 from free-roaming cattle is both a direct indicator of rumen metabolic state and a prerequisite for farm-scale carbon accounting, yet no existing system can deliver continuous, spatially resolved measurements without physical confinement or contact. We present TRACE (Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock), the first unified framework to jointly address per-frame CO2 plume segmentation and clip-level emission flux classification from mid-wave infrared (MWIR) thermal video. TRACE contributes three domain-specific advances: a Thermal Gas-Aware Attention (TGAA) encoder that incorporates per-pixel gas intensity as a spatial supervisory signal to direct self-attention toward high-emission regions at each encoder stage; an Attention-based Temporal Fusion (ATF) module that captures breath-cycle dynamics through structured cross-frame attention for sequence-level flux classification; and a four-stage progressive training curriculum that couples both objectives while preventing gradient interference. Benchmarked against fifteen state-of-the-art models on the CO2 Farm Thermal Gas Dataset, TRACE achieves an mIoU of 0.998 and the best result on every segmentation and classification metric simultaneously, outperforming domain-specific gas segmenters with several times more parameters and surpassing all baselines in flux classification. Ablation studies confirm that each component is individually essential: gas-conditioned attention alone determines precise plume boundary localization, and temporal reasoning is indispensable for flux-level discrimination. TRACE establishes a practical path toward non-invasive, continuous, per-animal CO2 monitoring from overhead thermal cameras at commercial scale. Codes are available at https://github.com/taminulislam/trace.

Chinese Translation

量化自由放养牛只呼出的二氧化碳（CO2）不仅是反刍动物代谢状态的直接指标，也是农场规模碳核算的前提条件，然而现有系统无法在不进行物理限制或接触的情况下提供连续的、空间分辨的测量。我们提出了TRACE（用于牲畜二氧化碳排放的热识别注意框架），这是第一个统一框架，旨在共同解决来自中波红外（MWIR）热视频的每帧CO2羽流分割和剪辑级排放通量分类。TRACE在三个领域特定的方面做出了贡献：一个热气体感知注意（TGAA）编码器，它将每个像素的气体强度作为空间监督信号，指导自注意力在每个编码器阶段聚焦于高排放区域；一个基于注意力的时间融合（ATF）模块，通过结构化的跨帧注意捕捉呼吸周期动态，以实现序列级的通量分类；以及一个四阶段渐进训练课程，结合了这两个目标，同时防止梯度干扰。在CO2农场热气体数据集上与十五个最先进模型进行基准测试，TRACE在每个分割和分类指标上同时实现了0.998的mIoU，并在所有指标上取得最佳结果，超越了具有数倍参数的领域特定气体分割器，并在通量分类中超越了所有基线。消融研究确认每个组件都是不可或缺的：仅依赖气体条件的注意力就能精确确定羽流边界定位，而时间推理对于通量级别的区分是不可或缺的。TRACE为从商业规模的高空热像仪进行非侵入式、连续的逐动物CO2监测建立了一条切实可行的路径。代码可在 https://github.com/taminulislam/trace 获取。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.09651

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

FlowHijack：一种动态感知的后门攻击方法针对流匹配视觉-语言-动作模型

An, Xinyuan, Luo, Tao, Peng, Gengyun, Wang, Yaobing, Ren, Kui, Wang, Dongxia

Abstract

Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $\pi_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $\tau$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.

Chinese Translation

视觉-语言-动作（VLA）模型正逐渐成为机器人技术的基石，其中流匹配策略如 $ ext{π}_0$ 在生成平滑、连续的动作方面展现出巨大潜力。随着这些模型的发展，其独特的动作生成机制——向量场动力学，暴露出一种关键但尚未被探索的安全漏洞，特别是后门漏洞。现有针对自回归离散化VLA的后门攻击无法直接应用于这种新的连续动力学。我们提出了FlowHijack，这是第一个系统性针对流匹配VLA的基础向量场动力学的后门攻击框架。我们的方法结合了一种新颖的 $ au$-条件注入策略，该策略操控动作生成的初始阶段，以及一种动力学模仿正则化器。实验表明，FlowHijack在使用隐蔽的、上下文感知的触发器时，能够实现高攻击成功率，而之前的工作未能做到这一点。重要的是，它保持了良性的任务性能，并通过强制运动相似性，生成在行为上与正常动作无法区分的恶意动作。我们的研究揭示了连续体模型中的显著漏洞，强调了针对模型内部生成动力学的防御措施的迫切需求。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.09657

Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors

磁尘中的印记：基于校验和计数向量的遗留媒体图像中的鲁棒相似性搜索

Grzeszczuk, Maciej, Skorupska, Kinga, Wójcik, Grzegorz M.

Abstract

Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58\% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.

Chinese Translation

数字化包含计算机数据的磁介质仅是保护早期家庭计算时代文物的第一步。音频磁带图像必须进行解码、验证、必要时修复、测试和记录。如果这一过程的某些部分能够有效自动化，志愿者可以专注于贡献背景和历史知识，而不是与技术工具作斗争。因此，我们提出了一种基于校验和计数向量的特征表示，并评估其在大型数据存储中检测录音重复和变体的适用性。该方法在一组解码的磁带图像（n=4902）上进行了测试，在检测变体时达到了58%的准确率，在识别替代副本时达到了97%的准确率，适用于缺失记录高达75%的损坏录音。这些结果代表了朝着完全自动化的恢复、去重和历史数字文物语义整合管道的重要一步，通过序列匹配、自动修复和知识发现实现。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.09685

A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video

用于交通监控视频中事故检测、定位与分类的模块化零样本管线

Thakur, Amey, Talele, Sarvesh

Abstract

We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.

Chinese Translation

本文介绍了一种为 ACCIDENT @ CVPR 2026 挑战赛开发的零样本管线。该挑战要求在无标注真实世界训练数据的情况下，预测交通监控视频中事故发生的时间、地点及类型。我们的方法将问题拆分为三个独立模块。第一个模块通过对 z-score 标准化的帧差信号进行峰值检测，实现碰撞时间定位。第二个模块利用 Farneback 算法计算累积稠密光流幅值图的加权质心，确定碰撞的空间位置。第三个模块通过测量检测到峰值附近帧的 CLIP 图像嵌入与基于多提示自然语言描述构建的文本嵌入之间的余弦相似度，实现碰撞类型分类。该管线不涉及任何领域特定的微调，仅使用预训练模型权重处理每个视频。我们的实现已作为 Kaggle 笔记本公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.09687

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Grid2Matrix：揭示视觉-语言模型中的数字失认症

Zhang, Yunkai, Li, Linda, Cui, Yingxin, Ruan, Xiyuan, Zheng, Zeyu, Chen, Kezhen, Zhang, Yi, Yang, Diji

Abstract

Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.

Chinese Translation

视觉-语言模型（Vision-Language Models, VLMs）在许多多模态推理基准测试中表现出色，但这些评估通常不要求对图像进行详尽的解读，因此可能掩盖模型在忠实捕捉所有视觉细节方面的失败。我们提出了Grid2Matrix（G2M），一个受控基准测试，模型需要根据所展示的彩色网格及颜色到数字的映射，输出对应的矩阵。通过改变网格大小和颜色数量，G2M提供了一种简单的方法来增加视觉复杂度，同时最大限度地减少语义混淆。我们发现VLMs在零样本端到端评估中表现出明显的早期崩溃，模型在意外小的网格上即失败，而非随着任务密度增加而逐渐退化。我们对两类代表性VLM的视觉编码器进行了探查，发现它们保留的网格信息远多于对应的端到端输出。这表明失败不仅仅是视觉编码的问题，还反映了从视觉特征中可恢复的信息与最终语言表达之间的差距。我们将这一差距称为“数字失认症”（Digital Agnosia）。进一步分析显示，这些错误具有高度结构性，且强烈依赖于网格单元与视觉补丁边界的重叠方式。我们还发现，诸如模型扩展和多模态对齐等常见策略并不能完全消除这一失败模式。我们期望G2M成为理解VLMs在何处及如何丢失细微视觉细节的有用测试平台，并用于评估那些即使缺失微小视觉细节也会产生影响的任务，如表格、图表、表单和图形用户界面（GUI）。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.09688

Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

通过属性空间陷阱对3D高斯生成模型进行免疫以防止未经授权的微调

Zhang, Jianwei, Cao, Sihan, Zhang, Chaoning, Hong, Ziming, Huang, Jiaxin, Zheng, Pengcheng, Qin, Caiyan, Dong, Wei, Yang, Yang, Liu, Tongliang

Abstract

Recent large-scale generative models enable high-quality 3D synthesis. However, the public accessibility of pre-trained weights introduces a critical vulnerability. Adversaries can fine-tune these models to steal specialized knowledge acquired during pre-training, leading to intellectual property infringement. Unlike defenses for 2D images and language models, 3D generators require specialized protection due to their explicit Gaussian representations, which expose fundamental structural parameters directly to gradient-based optimization. We propose GaussLock, the first approach designed to defend 3D generative models against fine-tuning attacks. GaussLock is a lightweight parameter-space immunization framework that integrates authorized distillation with attribute-aware trap losses targeting position, scale, rotation, opacity, and color. Specifically, these traps systematically collapse spatial distributions, distort geometric shapes, align rotational axes, and suppress primitive visibility to fundamentally destroy structural integrity. By jointly optimizing these dual objectives, the distillation process preserves fidelity on authorized tasks while the embedded traps actively disrupt unauthorized reconstructions. Experiments on large-scale Gaussian models demonstrate that GaussLock effectively neutralizes unauthorized fine-tuning attacks. It substantially degrades the quality of unauthorized reconstructions, evidenced by significantly higher LPIPS and lower PSNR, while effectively maintaining performance on authorized fine-tuning.

Chinese Translation

近期的大规模生成模型实现了高质量的3D合成。然而，预训练权重的公开可访问性引入了一个关键的脆弱性。对手可以对这些模型进行微调，以窃取在预训练过程中获得的专业知识，从而导致知识产权的侵犯。与2D图像和语言模型的防御不同，3D生成器由于其显式的高斯表示，需要专门的保护，因为这些表示直接将基本结构参数暴露给基于梯度的优化。我们提出了GaussLock，这是首个旨在防御3D生成模型免受微调攻击的方法。GaussLock是一个轻量级的参数空间免疫框架，结合了授权蒸馏与针对位置、尺度、旋转、不透明度和颜色的属性感知陷阱损失。具体而言，这些陷阱系统性地压缩空间分布、扭曲几何形状、对齐旋转轴并抑制原始可见性，从根本上破坏结构完整性。通过共同优化这两个目标，蒸馏过程在授权任务上保持了保真度，而嵌入的陷阱则积极干扰未经授权的重建。在大规模高斯模型上的实验表明，GaussLock有效中和了未经授权的微调攻击。它显著降低了未经授权重建的质量，表现为LPIPS显著提高和PSNR显著降低，同时有效维持了授权微调的性能。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.09689

Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

面部密度作为数据复杂度的代理指标：量化实例数量的难度

Mohammadi-Seif, Abolfazl, Baeza-Yates, Ricardo

Abstract

Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,'' we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation.

Chinese Translation

机器学习的进展历来侧重于模型中心的创新，然而可实现的性能常常受限于数据本身的内在复杂度。在本研究中，我们将实例密度（以面部数量计量）作为数据复杂度的主要驱动因素进行隔离和量化。我们不仅仅观察“拥挤场景更难”，而是严格控制类别不平衡，精确测量仅由密度引起的性能下降。在WIDER FACE和Open Images数据集上进行的受控实验中，图像中面部数量严格限定在1至18个，且采样完全平衡，结果显示模型性能随着面部数量的增加单调下降。该趋势在分类、回归和检测范式中均成立，即使模型充分暴露于整个密度范围。此外，我们证明在低密度条件下训练的模型无法很好地泛化到高密度条件，表现出系统性的低估计偏差，错误率最高增加4.6倍，这表明密度作为一种领域迁移因素存在。这些发现确立了实例密度作为数据难度的内在且可量化维度，并激发了在课程学习和基于密度分层评估中的具体干预措施。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.09690

Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification

我们是在识别美洲虎还是其背景？一种用于美洲虎重识别的诊断框架

Rueda-Toicen, Antonio, Martin, Abigail Allen, Morozov, Daniil, Mahmood, Matin, Schild, Alexandra, Dayani, Shahabeddin, Panza, Davide, de Melo, Gerard

Abstract

Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.

Chinese Translation

基于公民科学图像的美洲虎重识别（re-ID）在标准检索指标上可能表现良好，但仍可能依赖错误的证据，例如背景环境或轮廓形状，而非定义身份的皮毛图案。我们提出了一个用于野生动物重识别的诊断框架，包含两个维度：一个是泄漏控制的上下文比例（背景/前景），通过对仅背景和仅前景图像进行修复后计算得出；另一个是基于跨侧检索和镜像自相似性的左右侧诊断。为了使这些诊断可量化，我们整理了一个包含逐像素分割掩码和身份平衡评估协议的潘塔纳尔（Pantanal）美洲虎基准数据集。随后，我们以代表性缓解方法为案例研究，包括ArcFace微调、反对称正则化以及Lorentz双曲嵌入，在相同的评估视角下进行分析。我们的目标不仅是评估哪种模型排名最佳，更关注其使用了哪些视觉证据来实现该排名。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.09691

CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

CAGE：通过代码锚定生成增强弥合教育图表中的准确性与美学差距

Kukreja, Dikshant, Sah, Kshitij, Goyal, Karan, Mohania, Mukesh, Goyal, Vikram

Abstract

Educational diagrams -- labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts -- are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.

Chinese Translation

教育图表——生物过程、化学结构、物理系统和数学概念的标注插图——是K-12教学中不可或缺的认知工具。然而，目前没有现有的方法能够同时生成准确且引人入胜的图表。开源扩散模型能够生成视觉丰富的图像，但严重混淆文本标签。基于代码的生成通过大语言模型（LLMs）可以保证标签的正确性，但生成的输出在视觉上显得平淡。封闭源API在一定程度上弥补了这一差距，但在教育规模上仍然不可靠且成本过高。我们在400个K-12图表提示上量化了这一准确性与美学的困境，通过互补的自动化和人工评估协议测量标签的保真度和视觉质量。为了解决这一问题，我们提出了CAGE（代码锚定生成增强）：一个大语言模型合成可执行代码，生成结构正确的图表，然后通过ControlNet条件化于程序输出的扩散模型将其精炼为视觉上精美的图形，同时保持标签的保真度。我们还介绍了EduDiagram-2K，这是一个包含2000个配对程序化-风格化图表的集合，以支持这一流程，并展示了概念验证结果和多媒体社区的研究议程。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.09693

TaFall: Balance-Informed Fall Detection via Passive Thermal Sensing

TaFall：基于平衡信息的被动热成像跌倒检测系统

Li, Chengxiao, Zhang, Xie, Zhu, Wei, Jiang, Yan, Wu, Chenshu

Abstract

Falls are a major cause of injury and mortality among older adults, yet most incidents occur in private indoor environments where monitoring must balance effectiveness with privacy. Existing privacy-preserving fall detection approaches, particularly those based on radio frequency sensing, often rely on coarse motion cues, which limits reliability in real-world deployments. We introduce TaFall, a balance-informed fall detection system based on low-cost, privacy-preserving thermal array sensing. The key insight is that TaFall models a fall as a process of balance degradation and detects falls by estimating pose-driven biomechanical balance dynamics. To enable this capability from low-resolution thermal array maps, we propose (i) an appearance-motion fusion model for robust pose reconstruction, (ii) physically grounded balance-aware learning, and (iii) pose-bridged pretraining to improve robustness. TaFall achieves a detection rate of 98.26% with a false alarm rate of 0.65% on our dataset with over 3,000 fall instances from 35 participants across diverse indoor environments. In 27 day deployments across four homes, TaFall attains an ultra-low false alarm rate of 0.00126% and a pilot bathroom study confirms robustness under moisture and thermal interference. Together, these results establish TaFall as a reliable and privacy-preserving approach to fall detection in everyday living environments.

Chinese Translation

跌倒是老年人伤害和死亡的主要原因之一，但大多数跌倒事件发生在私密的室内环境中，监测系统需在有效性与隐私保护之间取得平衡。现有的隐私保护型跌倒检测方法，尤其是基于射频感知的方案，通常依赖粗略的运动线索，限制了其在实际应用中的可靠性。本文提出TaFall，一种基于低成本、隐私保护的热成像阵列传感的平衡信息驱动跌倒检测系统。其核心理念是将跌倒建模为平衡能力的退化过程，通过估计基于姿态的生物力学平衡动态来检测跌倒。为实现从低分辨率热成像阵列图中提取此能力，本文提出了：(i) 外观与运动融合模型以实现鲁棒的姿态重建，(ii) 基于物理原理的平衡感知学习，以及 (iii) 基于姿态的预训练策略以提升系统鲁棒性。TaFall在包含35名参与者、3000余次跌倒实例的多样化室内环境数据集上，实现了98.26%的检测率和0.65%的误报率。在四个家庭环境中进行的为期27天部署中，TaFall达到了极低的误报率0.00126%，且一项浴室试点研究验证了其在湿度和热干扰条件下的鲁棒性。综上所述，TaFall被确立为一种可靠且兼顾隐私保护的日常生活环境跌倒检测方案。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.09694

EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation

EDFNet：用于无人机导航中细小障碍物分割的边缘与深度早期融合方法

Fathi, Negar

Abstract

Autonomous Unmanned Aerial Vehicles (UAVs) must reliably detect thin obstacles such as wires, poles, and branches to navigate safely in real-world environments. These structures remain difficult to perceive because they occupy few pixels, often exhibit weak visual contrast, and are strongly affected by class imbalance. Existing segmentation methods primarily target coarser obstacles and do not fully exploit the complementary multimodal cues needed for thin-structure perception. We present EDFNet, a modular early-fusion segmentation framework that integrates RGB, depth, and edge information for thin-obstacle perception in cluttered aerial scenes. We evaluate EDFNet on the Drone Depth and Obstacle Segmentation (DDOS) dataset across sixteen modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings. The results show that early RGB-Depth-Edge fusion provides a competitive and well-balanced baseline, with the most consistent gains appearing in boundary-sensitive and recall-oriented metrics. The pretrained RGBDE U-Net achieves the best overall performance, with the highest Thin-Structure Evaluation Score (0.244), mean IoU (0.219), and boundary IoU (0.234), while maintaining competitive runtime performance (19.62 FPS) on our evaluation hardware. However, performance on the rarest ultra-thin categories remains low across all models, indicating that reliable ultra-thin segmentation is still an open challenge. Overall, these findings position early RGB-Depth-Edge fusion as a practical and modular baseline for thin-obstacle segmentation in UAV navigation.

Chinese Translation

自主无人机（UAV）必须可靠地检测细小障碍物，如电线、杆子和树枝，以在真实环境中安全导航。这些结构由于占据像素较少、视觉对比度弱且受类别不平衡影响较大，仍然难以感知。现有的分割方法主要针对较大障碍物，未能充分利用细小结构感知所需的互补多模态信息。本文提出EDFNet，一种模块化的早期融合分割框架，集成了RGB、深度和边缘信息，用于复杂空中场景中的细小障碍物感知。我们在Drone Depth and Obstacle Segmentation（DDOS）数据集上，采用U-Net和DeepLabV3两种主干网络，在预训练和非预训练设置下，评估了十六种模态-主干组合。结果表明，早期RGB-深度-边缘融合提供了具有竞争力且均衡的基线，边界敏感和召回导向指标上增益最为显著。预训练的RGBDE U-Net实现了最佳整体性能，Thin-Structure Evaluation Score达到0.244，平均IoU为0.219，边界IoU为0.234，同时在评测硬件上保持了19.62 FPS的竞争性运行速度。然而，所有模型在最稀有的超细类别上的表现仍然较低，表明可靠的超细分割仍是一个未解决的挑战。总体而言，这些发现将早期RGB-深度-边缘融合定位为无人机导航中细小障碍物分割的实用且模块化的基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.09695

Assessing Privacy Preservation and Utility in Online Vision-Language Models

在线视觉语言模型中的隐私保护与效用评估

Chaudhari, Karmesh Siddharam, Zhu, Youxiang, Feng, Amy, Liang, Xiaohui, Zhang, Honggang

Abstract

The increasing use of Online Vision Language Models (OVLMs) for processing images has introduced significant privacy risks, as individuals frequently upload images for various utilities, unaware of the potential for privacy violations. Images contain relationships that relate to Personally Identifiable Information (PII), where even seemingly harmless details can indirectly reveal sensitive information through surrounding clues. This paper explores the critical issue of PII disclosure in images uploaded to OVLMs and its implications for user privacy. We investigate how the extraction of contextual relationships from images can lead to direct (explicit) or indirect (implicit) exposure of PII, significantly compromising personal privacy. Furthermore, we propose methods to protect privacy while preserving the intended utility of the images in Vision Language Model (VLM)-based applications. Our evaluation demonstrates the efficacy of these techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments. Index Terms-Personally Identifiable Information (PII), Privacy, Utility, privacy concerns, sensitive information

Chinese Translation

随着在线视觉语言模型（Online Vision Language Models, OVLMs）在图像处理中的广泛应用，隐私风险日益凸显。用户常常为实现各种功能上传图像，却未意识到潜在的隐私泄露问题。图像中包含与个人身份信息（Personally Identifiable Information, PII）相关的关系，即使是看似无害的细节，也可能通过周边线索间接揭示敏感信息。本文聚焦于上传至OVLMs的图像中PII泄露的关键问题及其对用户隐私的影响。我们研究了从图像中提取上下文关系如何导致PII的直接（显性）或间接（隐性）暴露，从而严重威胁个人隐私。此外，本文提出了在基于视觉语言模型（Vision Language Model, VLM）应用中保护隐私的同时，保持图像预期效用的方法。评估结果表明，这些技术在维护效用与保护隐私之间实现了微妙的平衡，彰显了在线图像处理环境中隐私保护的重要性。关键词：个人身份信息（PII）、隐私、效用、隐私关注、敏感信息

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.09697

I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

我无法相信测试时增强（TTA）没有更好：当测试时增强影响医学图像分类

Medeiros, Daniel Nobrega

Abstract

Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the distribution shift between augmented and training-time inputs--amplified by batch normalization statistics mismatch--as the primary mechanism. Our ablation studies show that augmentation strategy matters critically: intensity-only augmentations preserve more performance than geometric transforms, and including the original unaugmented image partially mitigates but does not eliminate the accuracy drop. These findings serve as a cautionary note for practitioners: TTA should not be applied as a default post-hoc improvement but must be validated on the specific model-dataset combination.

Chinese Translation

测试时增强（TTA）——对多个增强副本的测试输入进行预测聚合——被广泛认为能够提高分类准确性，尤其是在医学成像领域，它在生产系统和竞赛解决方案中被常规使用。我们进行了一项系统的实证研究，挑战这一假设，涵盖了三个 MedMNIST v2 基准和四种架构，参数数量跨越三个数量级（21K 到 11M）。我们的主要发现是，使用标准增强管道的 TTA 在准确性上始终低于单次推理，ResNet-18 在病理图像上的准确率下降幅度高达 31.6 个百分点。这种下降影响所有架构，包括卷积模型，并且随着增强视图的增加而加重。唯一的例外是 ResNet-18 在皮肤病图像上的表现，略微提升了 +1.6%。我们确定增强输入与训练时输入之间的分布偏移——由批量归一化统计不匹配放大——是主要机制。我们的消融研究表明，增强策略至关重要：仅强度的增强比几何变换更能保持性能，并且包含原始未增强图像在一定程度上缓解但并未消除准确率下降。这些发现为从业者提供了警示：TTA 不应作为默认的事后改进，而必须在特定模型-数据集组合上进行验证。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.09700

Attention-Guided Flow-Matching for Sparse 3D Geological Generation

基于注意力引导的流匹配稀疏三维地质生成方法

Lu, Zhixiang, Han, Mengqi, Guo, Peixin, Bai, Tianming, Su, Jionglong, Fang, Fei, Song, Sifan

Abstract

Constructing high-resolution 3D geological models from sparse 1D borehole and 2D surface data is a highly ill-posed inverse problem. Traditional heuristic and implicit modeling methods fundamentally fail to capture non-linear topological discontinuities under extreme sparsity, often yielding unrealistic artifacts. Furthermore, while deep generative architectures like Diffusion Models have revolutionized continuous domains, they suffer from severe representation collapse when conditioned on sparse categorical grids. To bridge this gap, we propose 3D-GeoFlow, the first Attention-Guided Continuous Flow Matching framework tailored for sparse multimodal geological modeling. By reformulating discrete categorical generation as a simulation-free, continuous vector field regression optimized via Mean Squared Error, our model establishes stable, deterministic optimal transport paths. Crucially, we integrate 3D Attention Gates to dynamically propagate localized borehole features across the volumetric latent space, ensuring macroscopic structural coherence. To validate our framework, we curated a large-scale multimodal dataset comprising 2,200 procedurally generated 3D geological cases. Extensive out-of-distribution (OOD) evaluations demonstrate that 3D-GeoFlow achieves a paradigm shift, significantly outperforming heuristic interpolations and standard diffusion baselines.

Chinese Translation

从稀疏的一维钻孔和二维地表数据构建高分辨率三维地质模型是一个高度病态的逆问题。传统的启发式和隐式建模方法在极度稀疏条件下，根本无法捕捉非线性拓扑不连续性，常常产生不真实的伪影。此外，尽管扩散模型（Diffusion Models）等深度生成架构在连续域中取得了革命性进展，但在以稀疏类别网格为条件时，表现出严重的表示崩溃。为弥合这一差距，我们提出了3D-GeoFlow，这是首个针对稀疏多模态地质建模的注意力引导连续流匹配（Attention-Guided Continuous Flow Matching）框架。通过将离散类别生成重新表述为无模拟的连续向量场回归，并以均方误差（Mean Squared Error）进行优化，我们的模型建立了稳定且确定性的最优传输路径。关键在于，我们集成了三维注意力门（3D Attention Gates），动态传播局部钻孔特征至体积潜在空间，确保宏观结构的一致性。为验证该框架，我们构建了一个包含2200个程序生成三维地质案例的大规模多模态数据集。大量的分布外（OOD）评估表明，3D-GeoFlow实现了范式转变，显著优于启发式插值和标准扩散基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.09701

PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation

PASTA：用于弱监督目标和异常分割的视觉变换器补丁聚合

Neubauer, Melanie, Rueckert, Elmar, Rauch, Christian

Abstract

Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.

Chinese Translation

在非结构化环境中检测未见异常是工业和农业应用（如材料回收和除草）面临的一项重大挑战。现有的感知系统由于依赖于详尽标注的数据集，往往无法满足这些领域严格的操作要求，特别是实时处理、像素级分割精度和鲁棒性准确性。为了解决这些局限性，我们提出了一种名为“目标和异常分割的补丁聚合”（PASTA）的弱监督对象分割和分类管道，该管道使用弱图像级监督。通过将观察到的场景与名义参考进行比较，PASTA通过自监督视觉变换器（ViT）特征空间中的分布分析识别目标和异常对象。我们的管道利用语义文本提示，通过Segment Anything Model 3指导零样本对象分割。在定制的钢铁废料回收数据集和植物数据集上的评估表明，我们的方法在训练时间上比领域特定基线减少了75.8%。尽管我们的方法不依赖于特定领域，但在工业和农业领域中，其目标分割性能（最高可达88.3% IoU）和异常分割性能（最高可达63.5% IoU）均表现出色。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.09702

Identity-Aware U-Net: Fine-grained Cell Segmentation via Identity-Aware Representation Learning

Identity-Aware U-Net：基于身份感知表示学习的细粒度细胞分割

Xiao, Rui

Abstract

Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine-grained object segmentation from an identity-aware perspective and propose Identity-Aware U-Net (IAU-Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U-Net-style encoder-decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high-level features, while the main branch predicts pixel-accurate masks. To enhance robustness in distinguishing objects with near-identical contours or textures, we further incorporate triplet-based metric learning, which pulls target-consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category-level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.

Chinese Translation

对于形状高度相似的目标进行精确分割仍然是密集预测中的一大挑战，尤其是在边界模糊、实例重叠以及实例间视觉差异较弱的场景中。尽管传统分割模型在定位目标区域方面表现有效，但它们往往缺乏可靠区分目标对象与形态相似干扰物的判别能力。本文从身份感知的视角研究细粒度目标分割，提出了Identity-Aware U-Net（IAU-Net），一个联合建模空间定位与实例判别的统一框架。该方法基于U-Net风格的编码器-解码器架构，在分割主干网络的基础上增加了辅助嵌入分支，从高层特征中学习判别性身份表示，同时主分支负责预测像素级精确掩码。为了增强对近似轮廓或纹理目标的区分鲁棒性，我们进一步引入基于三元组的度量学习，将目标一致的嵌入拉近，并将其与形态相似的困难负样本分离。该设计使模型超越了类别级分割，获得了在视觉相似目标间进行精确判别的更强能力。在包括细胞分割的多个基准测试中，实验结果表现出良好效果，尤其在轮廓相似、布局密集及边界模糊的挑战性案例中表现突出。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.09704

Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

基于属性感知强化学习排序的图像质量评估的多粒度推理

Chen, Xiangyong, Lin, Xiaochuan, Liu, Haoran, Li, Xuan, Su, Yichen, Guo, Xiangwei

Abstract

Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1\%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.

Chinese Translation

近期在推理驱动的图像质量评估（IQA）方面的进展展示了强化学习排序（RL2R）在训练视觉-语言模型（VLMs）以评估感知质量方面的强大能力。然而，现有方法仅在单一粒度上操作，仅预测整体质量评分，忽视了人类质量感知的多维特性，包括清晰度、色彩保真度、噪声水平和构图美学等属性。本文提出了MG-IQA（多粒度IQA），一种多粒度推理框架，扩展了RL2R，以在单次推理过程中共同评估整体图像质量和细粒度质量属性。我们的方法引入了三个关键创新：（1）一种属性感知的提示策略，能够引导VLMs进行结构化的多属性推理；（2）一种多维的Thurstone奖励模型，计算特定属性的保真度奖励，以进行组相对策略优化；（3）一种跨领域对齐机制，使得在合成失真、真实失真和AI生成图像数据集上能够稳定地进行联合训练，而无需感知尺度的重新对齐。在八个IQA基准上的大量实验表明，MG-IQA在整体质量预测（平均SRCC提升2.1%）和属性级评估方面始终优于最先进的方法，同时生成可解释的、与人类一致的质量描述。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.09706

The Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation

人工智能媒体检测中的部署差距：平台感知与视觉约束的对抗性评估

Budhkar, Aishwarya, Dhara, Trishita, Sheth, Siddhesh

Abstract

Recent AI media detectors report near-perfect performance under clean laboratory evaluation, yet their robustness under realistic deployment conditions remains underexplored. In practice, AI-generated images are resized, compressed, re-encoded, and visually modified before being shared on online platforms. We argue that this creates a deployment gap between laboratory robustness and real-world reliability. In this work, we introduce a platform-aware adversarial evaluation framework for AI media detection that explicitly models deployment transforms (e.g., resizing, compression, screenshot-style distortions) and constrains perturbations to visually plausible meme-style bands rather than full-image noise. Under this threat model, detectors achieving AUC $\approx$ 0{.}99 in clean settings experience substantial degradation. Per-image platform-aware attacks reduce AUC to significantly lower levels and achieve high fake-to-real misclassification rates, despite strict visual constraints. We further demonstrate that universal perturbations exist even under localized band constraints, revealing shared vulnerability directions across inputs. Beyond accuracy degradation, we observe pronounced calibration collapse under attack, where detectors become confidently incorrect. Our findings highlight that robustness measured under clean conditions substantially overestimates deployment robustness. We advocate for platform-aware evaluation as a necessary component of future AI media security benchmarks and release our evaluation framework to facilitate standardized robustness assessment.

Chinese Translation

近期的人工智能媒体检测器在干净的实验室评估中表现出近乎完美的性能，然而其在实际部署环境下的鲁棒性仍未得到充分探讨。实际上，人工智能生成的图像在被分享到在线平台之前，通常会经历尺寸调整、压缩、重新编码以及视觉上的修改。我们认为这在实验室鲁棒性与现实世界可靠性之间造成了部署差距。在本研究中，我们提出了一种平台感知的对抗性评估框架，用于人工智能媒体检测，该框架明确建模了部署过程中的变换（例如尺寸调整、压缩、截图式失真），并将扰动限制在视觉上合理的模因风格带状区域，而非全图噪声。在该威胁模型下，检测器在干净环境中达到约0.99的AUC时，其性能会显著下降。针对每张图像的基于平台的攻击显著降低了AUC，并在严格的视觉约束下实现了较高的假图像误判率。我们进一步证明，即使在局部带状约束下，也存在通用扰动，揭示了输入间共享的脆弱方向。除了准确率下降外，我们还观察到攻击下明显的校准崩溃，检测器变得自信但错误。我们的研究结果表明，在干净条件下测得的鲁棒性大大高估了实际部署的鲁棒性。我们主张将平台感知评估作为未来人工智能媒体安全基准测试的必要组成部分，并发布了我们的评估框架以促进标准化的鲁棒性评估。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.09709

Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

视觉Transformer前馈网络的正交二次补充

Zixian, Wang

Abstract

Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.

Chinese Translation

近期针对视觉Transformer的双线性前馈替代方案能够显著提升准确率，但它们往往将两种效应混为一谈：更强的二阶交互作用和相对于主分支的冗余增加。我们研究了一种互补的设计原则，其中辅助二次特征仅贡献主隐藏表示尚未捕获的信息。为此，我们提出了正交二次补充（Orthogonal Quadratic Complements，OQC），该方法构建了一个低秩二次辅助分支，并在注入之前显式地将其投影到主分支的正交补空间。我们进一步研究了高效的低秩实现（OQC-LR）及门控扩展（OQC-static 和 OQC-dynamic）。在参数匹配的Deep-ViT和CIFAR-100协议下，固定倒数第二层残差读出，完整的OQC将AFBO基线从64.25 ± 0.22提升至65.59 ± 0.22，而OQC-LR以更优的速度-准确率权衡达到65.52 ± 0.25。在TinyImageNet上，门控扩展OQC-dynamic达到51.88 ± 0.32，较基线（50.45 ± 0.21）提升1.43个百分点，并优于所有无门控变体。机制分析显示投影后辅助与主分支的重叠几乎为零，同时表现出改进的表示几何结构和类别分离。包括无门控和门控变体在内的整个系列方法在两个数据集上均表现出一致的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.09710

Robust Fair Disease Diagnosis in CT Images

CT图像中的稳健公平疾病诊断

Li, Justin, Ding, Daniel, Pritha, Asmita Yuki, Hou, Aryana, Wang, Xin, Hu, Shu

Abstract

Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two-level objective that targets both axes of this problem. Logit-adjusted cross-entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet-18 pretrained on Kinetics-400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID-19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender-averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at https://github.com/Purdue-M2/Fair-Disease-Diagnosis.

Chinese Translation

基于胸部CT的自动化诊断在深度学习的推动下取得了显著进展，但在偏斜数据集上训练的模型往往在患者人口统计特征上表现不均。然而，情况比简单的人口统计偏差更为严重。在临床数据中，类别不平衡和群体代表性不足往往同时存在，形成复合失效模式，而标准的重平衡或公平性修正单独无法解决这一问题。我们提出了一种双层目标，针对该问题的两个方面进行优化。Logit调整的交叉熵损失在样本层面上操作，通过类别频率调整决策边界，并提供可证明的一致性保证。条件风险价值聚合在群体层面上操作，将优化压力指向当前损失较高的人口群体。我们在公平疾病诊断基准上进行评估，使用在Kinetics-400上预训练的3D ResNet-18，将CT体积分类为腺癌、鳞状细胞癌、COVID-19和正常组，并附有患者性别注释。训练集具体展示了复合问题：鳞状细胞癌总共有84个样本，其中女性仅5个。综合损失达到了性别平均宏F1值0.8403，公平性差距为0.0239，相较基线提高了13.3%的得分，并减少了78%的人口统计差异。消融实验表明，每个组件单独使用时效果均不理想。代码已公开发布在https://github.com/Purdue-M2/Fair-Disease-Diagnosis。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.09711

Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality

在缺失模态下，MLLMs中的头部模态专业化用于稳健的假新闻检测

Qian, Kai, Shi, Weijie, Wang, Jiaqi, Li, Mengze, Chen, Hao, Cui, Yue, Guo, Hanghui, Liu, Ziyi, Zhu, Jia, Xu, Jiajie

Abstract

Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.

Chinese Translation

多模态假新闻检测（MFND）旨在通过联合利用文本和视觉证据来验证新闻的可信度。然而，现实世界中的新闻传播常常因删除图片、损坏的截图等问题而遭遇缺失模态。因此，在这种情况下，稳健的检测需要为每种模态保留强大的验证能力，这在MFND中是具有挑战性的，因为低贡献模态的学习不足以及单模态标注稀缺。为了解决这个问题，我们提出了一种在缺失模态下，基于多模态大型语言模型（MLLMs）的头部模态专业化方法，以实现稳健的MFND。具体而言，我们首先系统地研究了MLLMs中的注意力头及其与缺失模态下性能的关系，表明模态关键头作为单模态验证能力的关键载体，通过其模态专业化发挥作用。基于这一观察，为了更好地保留低贡献模态的验证能力，我们引入了一种头部专业化机制，明确将这些头分配给不同的模态，并通过下限注意力约束来保持其专业化。此外，为了更好地利用稀缺的单模态标注，我们提出了一种单模态知识保留策略，防止这些头偏离从有限监督中学习到的单模态知识。实验表明，我们的方法在缺失模态下提高了稳健性，同时在全模态输入下保持了性能。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.09712

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

LAST：利用工具作为提示以增强多模态大型语言模型的空间推理能力

Tian, Shi-Yu, Zhou, Zhi, Yu, Kun-Yang, Yang, Ming, Chen, Yang, Shang, Ziqiao, Guo, Lan-Zhe, Li, Yu-Feng

Abstract

Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20\% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.

Chinese Translation

空间推理是智能系统感知和与物理世界互动的基础能力。然而，多模态大型语言模型（MLLMs）在解析复杂几何布局时常常遭遇幻觉和不精确的问题。由于数据驱动的扩展难以内化结构化的几何先验和空间约束，因此整合成熟的专业视觉模型成为一种有吸引力的替代方案。尽管这一范式具有潜力，但将其应用于空间推理面临两个主要挑战：一是调用异构、参数丰富的工具的困难，二是理解和有效利用其多样化低级输出（例如，分割掩码、深度图）进行高级推理的挑战。为了解决这些问题，我们提出了LAST，一个用于工具增强空间推理的统一框架。LAST具有一个可扩展的交互沙盒，称为LAST-Box，它将异构工具调用抽象为原子指令和可重用的空间技能，返回多模态提示（例如，带注释的图像和文本描述），这些提示可以被LLMs直接使用。我们进一步设计了一个三阶段的渐进训练策略，引导模型从理解工具输出到熟练且自适应地调用工具。在四个数据集上的实验表明，LAST-7B在其基础模型上实现了约20%的性能提升，并且超越了强大的专有闭源LLMs，显著增强了复杂空间任务的推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.09713

Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies

通过任务类比实现零样本合成到真实手写文本识别

Garrido-Munoz, Carlos, Panariello, Aniello, Cascianelli, Silvia, Porrello, Angelo, Calderara, Simone, Calvo-Zaragoza, Jorge, Cucchiara, Rita

Abstract

Handwritten Text Recognition (HTR) models trained on synthetic handwriting often struggle to generalize to real text, and existing adaptation methods still require real samples from the target domain. In this work, we tackle the fully zero-shot synthetic-to-real generalization setting, where no real data from the target language is available. Our approach learns how model parameters change when moving from synthetic to real handwriting in one or more source languages and transfers this learned correction to new target languages. When using multiple sources, we rely on linguistic similarity to weigh their contrubition when combining them. Experiments across five languages and six architectures show consistent improvements over synthetic-only baselines and reveal that the transferred corrections benefit even languages unrelated to the sources.

Chinese Translation

在合成手写体上训练的手写文本识别（HTR）模型通常难以推广到真实文本，且现有的适应方法仍需目标域的真实样本。本文研究了完全零样本的合成到真实泛化设置，即目标语言没有任何真实数据。我们的方法学习模型参数从合成手写体到真实手写体在一个或多个源语言上的变化，并将这种学习到的校正迁移到新的目标语言。在使用多个源语言时，我们依赖语言相似性来加权其贡献。跨五种语言和六种架构的实验表明，该方法相较仅使用合成数据的基线有持续改进，且迁移的校正对与源语言无关的语言同样有效。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.09715

MuPPet: Multi-person 2D-to-3D Pose Lifting

MuPPet：多人体二维到三维姿态提升方法

Markhorst, Thomas, Lin, Zhi-Yi, Chew, Jouh Yeong, van Gemert, Jan, Zhang, Xucong

Abstract

Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: https://github.com/Thomas-Markhorst/MuPPet

Chinese Translation

多人体社交互动本质上建立在群体中所有个体之间的连贯性和关系之上，因此多人体定位和身体姿态估计对于理解这些社交动态至关重要。一种有前景的方法是二维到三维姿态提升（2D-to-3D pose lifting），该方法基于二维姿态估计的重大进展，提供包含丰富空间细节的三维人体姿态。然而，现有的二维到三维姿态提升方法往往忽视个体间的关系，或无法处理不同规模的群体，限制了其在多人体场景中的有效性。我们提出了MuPPet，一种新颖的多人体二维到三维姿态提升框架，能够显式建模个体间的相关性。为了利用这些个体间的依赖关系，我们的方法引入了Person Encoding（个体编码）以构建个体表示，Permutation Augmentation（排列增强）以提升训练多样性，以及Dynamic Multi-Person Attention（动态多人体注意力）以自适应地建模个体间的相关性。在群体互动数据集上的大量实验表明，MuPPet显著优于现有的单人及多人二维到三维姿态提升方法，并在遮挡场景下表现出更强的鲁棒性。我们的研究结果强调了建模个体间相关性的重要性，为实现精准且具社交感知的三维姿态估计开辟了新路径。我们的代码可在以下地址获取：https://github.com/Thomas-Markhorst/MuPPet

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.09716

Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach

通过动态系统方法超越损失和准确率训练深度视觉网络

La Quang, Hai, Ugail, Hassan, Howard, Newton, Tien, Cong Tran, Hoai, Nam Vu, Viet, Hung Nguyen

Abstract

Deep visual recognition models are usually trained and evaluated using metrics such as loss and accuracy. While these measures show whether a model is improving, they reveal very little about how its internal representations change during training. This paper introduces a complementary way to study that process by examining training through the lens of dynamical systems. Drawing on ideas from signal analysis originally used to study biological neural activity, we define three measures from layer activations collected across training epochs: an integration score that reflects long-range coordination across layers, a metastability score that captures how flexibly the network shifts between more and less synchronised states, and a combined dynamical stability index. We apply this framework to nine combinations of model architecture and dataset, including several ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and a pretrained Vision Transformer on CIFAR-10 and CIFAR-100. The results suggest three main patterns. First, the integration measure consistently distinguishes the easier CIFAR-10 setting from the more difficult CIFAR-100 setting. Second, changes in the volatility of the stability index may provide an early sign of convergence before accuracy fully plateaus. Third, the relationship between integration and metastability appears to reflect different styles of training behaviour. Overall, this study offers an exploratory but promising new way to understand deep visual training beyond loss and accuracy.

Chinese Translation

深度视觉识别模型通常使用损失和准确率等指标进行训练和评估。尽管这些指标显示模型是否在改善，但它们对模型在训练过程中内部表征的变化揭示的信息非常有限。本文引入了一种补充的方法，通过动态系统的视角研究训练过程。借鉴最初用于研究生物神经活动的信号分析思想，我们定义了三个基于训练周期中收集的层激活的指标：一个反映层间长程协调的整合评分，一个捕捉网络在更同步和更不同步状态之间灵活转变的亚稳定性评分，以及一个综合动态稳定性指数。我们将这一框架应用于九种模型架构和数据集的组合，包括多个ResNet变体、DenseNet-121、MobileNetV2、VGG-16，以及在CIFAR-10和CIFAR-100上的预训练视觉变换器。结果表明了三个主要模式。首先，整合指标始终能够区分较简单的CIFAR-10设置与较困难的CIFAR-100设置。其次，稳定性指数波动性的变化可能在准确率完全趋于平稳之前提供收敛的早期迹象。第三，整合与亚稳定性之间的关系似乎反映了不同的训练行为风格。总体而言，本研究提供了一种探索性但有前景的新方法，以理解深度视觉训练超越损失和准确率的范畴。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.09717

Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

基于多头注意力的交互感知架构用于孟加拉手写字符识别：首个基础数据集的构建

Raquib, Mirza, Polok, Asif Pervez, Biswas, Kedar Nath, Prity, Farida Siddiqi, Murad, Saydul Akbar, Rahimi, Nick

Abstract

Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.

Chinese Translation

字符识别是光学字符识别（OCR）系统的基础部分。通过字符识别，可以准确完成单词识别、句子转录、文档数字化及语言处理等更高层次的任务。然而，识别孟加拉手写字符并非易事，因为其书写风格多样，笔画模式不一致且字符之间视觉相似度高。现有数据集通常存在类内样本有限且类别分布不均衡的问题。为解决这些问题，我们构建了一个新的平衡孟加拉手写字符数据集，包含78个类别，每类约650个样本，涵盖基本字符、复合字符（Juktobarno）及数字。样本来源多样，涵盖不同年龄段和社会经济群体，包括小学至高中学生、大学生及专业人士，且包含左右手书写者。我们进一步提出了一种交互感知的混合深度学习架构，结合了EfficientNetB3、Vision Transformer和Conformer模块并行处理，通过多头交叉注意力融合机制实现各组件间的有效特征交互。所提模型在构建的数据集上达到98.84%的准确率，在外部CHBCR基准测试上达到96.49%，展现出较强的泛化能力。Grad-CAM可视化进一步增强了模型的可解释性，突出显示了判别区域。本研究的数据集和源码已公开发布于：https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.09728

Data-Driven Automated Identification of Optimal Feature-Representative Images in Infrared Thermography Using Statistical and Morphological Metrics

基于统计与形态学指标的红外热成像最优特征代表图像数据驱动自动识别方法

Yagdjian, Harutyun, Gurka, Martin

Abstract

Infrared thermography (IRT) is a widely used non-destructive testing technique for detecting structural features such as subsurface defects. However, most IRT post-processing methods generate image sequences in which defect visibility varies strongly across time, frequency, or coefficient/index domains, making the identification of defect-representative images a critical challenge. Conventional evaluation metrics, such as the signal-to-noise ratio (SNR) or the Tanimoto criterion, often require prior knowledge of defect locations or defect-free reference regions, limiting their suitability for automated and unsupervised analysis. In this work, a data-driven methodology is proposed to identify images within IRT datasets that are most likely to contain and represent structural features, particularly anomalies and defects, without requiring prior spatial information. The approach is based on three complementary metrics: the Homogeneity Index of Mixture (HI), which quantifies statistical heterogeneity via deviations of local intensity distributions from a global reference distribution; a Representative Elementary Area (REA), derived from a Minkowski-functional adaptation of the Representative Elementary Volume concept to two-dimensional images; and a geometrical-topological Total Variation Energy (TVE) index, also based on two-dimensional Minkowski functionals, designed to improve sensitivity to localized anomalies. The framework is validated experimentally using pulse-heated IRT data from a carbon fiber-reinforced polymer (CFRP) plate containing six artificial defects at depths between 0.135 mm and 0.810 mm, and is further supported by one-dimensional N-layer thermal model simulations. The results demonstrate robust and unbiased ranking of image sequences and provide a reliable basis for automated defect-oriented image selection in IRT.

Chinese Translation

红外热成像（Infrared Thermography, IRT）是一种广泛应用的无损检测技术，用于检测如亚表面缺陷等结构特征。然而，大多数IRT后处理方法生成的图像序列中，缺陷的可见性在时间、频率或系数/指标域中变化显著，使得缺陷代表图像的识别成为一项关键挑战。传统的评价指标，如信噪比（Signal-to-Noise Ratio, SNR）或Tanimoto准则，通常需要先验的缺陷位置或无缺陷参考区域信息，限制了其在自动化和无监督分析中的适用性。本研究提出了一种数据驱动的方法，用于在IRT数据集中识别最有可能包含并代表结构特征（尤其是异常和缺陷）的图像，且无需先验空间信息。该方法基于三种互补指标：混合均匀性指数（Homogeneity Index of Mixture, HI），通过局部强度分布与全局参考分布的偏差量化统计异质性；代表性基本区域（Representative Elementary Area, REA），由代表性基本体积（Representative Elementary Volume）概念的Minkowski泛函适配至二维图像而得；以及几何拓扑全变差能量指数（Total Variation Energy, TVE），同样基于二维Minkowski泛函，旨在提升对局部异常的敏感性。该框架通过脉冲加热IRT数据进行实验验证，数据来源于含有六个深度介于0.135 mm至0.810 mm人工缺陷的碳纤维增强聚合物（Carbon Fiber-Reinforced Polymer, CFRP）板材，并辅以一维多层热模型仿真支持。结果表明，该方法能够稳健且无偏地对图像序列进行排序，为IRT中基于缺陷的自动图像选择提供了可靠基础。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.09729

LOLGORITHM: Funny Comment Generation Agent For Short Videos

LOLGORITHM：短视频幽默评论生成代理

Ouyang, Xuan, Wang, Senan, Wang, Bouzhou, Xiahou, Siyuan, Zhou, Jinrong, Li, Yuekang

Abstract

Short-form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches -- including video summarization and live-streaming danmaku generation -- fail to produce authentic comments that conform to platform-specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi-agent framework for stylized short-form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high-engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large-scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46\% on YouTube and 84.29\% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.

Chinese Translation

短视频平台已成为多媒体信息传播的核心，其中评论在推动用户参与、传播和算法反馈方面发挥着关键作用。然而，现有的方法——包括视频摘要和直播弹幕生成——未能生成符合平台特定文化和语言规范的真实评论。本文提出了LOLGORITHM，一种新颖的模块化多代理框架，用于风格化短视频评论生成。LOLGORITHM支持六种可控的评论风格，并由三个核心模块组成：视频内容摘要、视频分类和带有语义检索及热门表情包增强的评论生成。我们进一步构建了一个包含3,267个视频和16,335条评论的双语数据集，涵盖YouTube和抖音五个高参与度类别。结合自动评分和大规模人类偏好分析的评估结果表明，LOLGORITHM在107名受访者中持续优于基线方法，在YouTube上的人类偏好选择率达到80.46%，在抖音上达到84.29%。消融研究确认，这些提升归因于框架架构而非基础大型语言模型的选择，强调了我们方法的鲁棒性和通用性。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.09734

Multi-Frequency Local Plasticity for Visual Representation Learning

用于视觉表征学习的多频局部可塑性

Serj, Mehdi Fatan, Parraga, C. Alejandro, Otazu, Xavier

Abstract

We study how far structured architectural bias can compensate for the absence of end-to-end gradient-based representation learning in visual recognition. Building on the VisNet tradition, we introduce a modular hierarchical framework combining: (i) fixed multi-frequency Gabor decomposition into F=7 parallel streams; (ii) within-stream competitive learning with Hebbian and Oja updates and anti-Hebbian decorrelation; (iii) an associative memory module inspired by modern Hopfield retrieval; and (iv) iterative top-down modulation using local prediction and reconstruction signals. Representational layers are trained without end-to-end backpropagation through the full hierarchy; only the final linear readout and top-down projection matrices are optimized by gradient descent. We therefore interpret the model as a hybrid system that is predominantly locally trained but includes a small number of gradient-trained parameters. On CIFAR-10, the full model reaches 80.1% +/- 0.3% top-1 accuracy, linear probe), compared with 71.0% for a Hebbian-only baseline and 83.4% for a gradient-trained model on the same fixed Gabor basis. On CIFAR-100, performance is 54.8%. Factorial analysis indicates that multi-frequency streams, associative memory, and top-down feedback contribute largely additively, with a significant Streams x TopDown interaction (p=0.02). These results suggest that carefully chosen architectural priors can recover a substantial fraction of the performance typically associated with global gradient training, while leaving a measurable residual gap. Experiments are limited to CIFAR-10/100.

Chinese Translation

我们研究了结构化架构偏置在视觉识别中在缺乏端到端基于梯度的表征学习时能够补偿的程度。基于VisNet传统，我们引入了一个模块化的层级框架，结合了：(i) 固定的多频Gabor分解为F=7个并行流；(ii) 流内竞争学习，采用Hebbian和Oja更新以及反Hebbian去相关；(iii) 受现代Hopfield检索启发的联想记忆模块；(iv) 利用局部预测和重构信号的迭代自顶向下调制。表征层的训练不通过整个层级的端到端反向传播；仅最终的线性读出层和自顶向下投影矩阵通过梯度下降进行优化。因此，我们将该模型解释为一个主要通过局部训练但包含少量梯度训练参数的混合系统。在CIFAR-10数据集上，完整模型达到80.1% ± 0.3%的top-1准确率（线性探针），相比之下，纯Hebbian基线为71.0%，在相同固定Gabor基上梯度训练模型为83.4%。在CIFAR-100上，性能为54.8%。因子分析表明，多频流、联想记忆和自顶向下反馈的贡献大致为加性，且存在显著的Streams x TopDown交互作用（p=0.02）。这些结果表明，精心选择的架构先验能够恢复通常与全局梯度训练相关的性能的很大一部分，同时仍存在可测量的性能差距。实验仅限于CIFAR-10/100。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.09749

See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

公平关注，真实表达：公平注意力提升视觉语言对齐中的定位能力并减少幻觉现象

Azeez, Mohammad Anas, Deria, Ankan, Siddiqui, Zohaib Hasan, Dukre, Adinath Madhavrao, Ali, Rafiq, Atito, Sara, Xie, Yutong, Razzak, Imran

Abstract

Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.

Chinese Translation

多模态大型语言模型（MLLMs）经常产生视觉输入中不存在的物体幻觉，这通常是因为解码过程中注意力过度集中于视觉上占主导地位或频繁出现的内容。我们观察到，注意力分配的不公平是物体幻觉的根本原因：当罕见、小型或语境边缘的物体未能获得足够的注意力时，模型无法将生成内容准确地基于完整的视觉场景。我们认为图像中的每个物体，无论其大小、出现频率或视觉显著性如何，都应在解码过程中获得平等的表示机会。为此，我们提出了DOP-OBC，一种基于公平注意力原则的无训练且与架构无关的解码策略。该策略结合了两种互补的面向物体的信号：一种是主导物体惩罚（Dominant Object Penalty，DOP），用于柔性抑制对视觉主导区域的注意力过度集中；另一种是异常提升系数（Outlier Boost Coefficient，OBC），用于增强对罕见但检测置信度高的物体的注意力。这些信号通过因果注意力掩码中的逐行logit调制注入，无需权重更新且保持自回归解码特性。大量针对图像和视频的MLLM实验表明，在CHAIR和POPE基准测试中，DOP-OBC持续减少了物体幻觉，同时在GPT-4o评估的描述质量上，在正确性、一致性、细节、语境和时间维度均有所提升。DOP-OBC证明，注意力分配的公平性不仅是设计原则，更是实现更真实多模态生成的有效途径。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.09757

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

MedLVR：用于可靠医学视觉问答的潜在视觉推理

Xi, Suyang, Hu, Songtao, Lai, Yuxiang, Dan, Wangyun, Liu, Yaqi, Wang, Shansong, Yang, Xiaofeng

Abstract

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

Chinese Translation

医学视觉-语言模型（VLMs）在医学视觉问答（VQA）中展现出强大潜力，但其推理过程仍主要以文本为中心：图像被编码为静态上下文，后续推理主要依赖语言。这一范式在临床场景中存在根本性限制，因为准确答案往往依赖于细微且局部的视觉证据，而这些证据无法在静态嵌入中可靠保存。我们提出了MedLVR，一种潜在视觉推理框架，在自回归解码过程中引入了显式的视觉证据状态。MedLVR不单依赖基于文本的中间推理，而是在解码器中交错插入短暂的潜在推理段，通过重用隐藏状态作为连续潜在步骤，实现对与查询相关视觉证据的迭代保存与细化，随后生成答案。为支持有效的视觉监督，我们采用两阶段训练策略：兴趣区域（ROI）监督微调使潜在状态与临床相关图像证据对齐，视觉-潜在策略优化（VLPO）在结果层面奖励下进一步优化潜在推理和答案生成。在OmniMedVQA及五个外部医学VQA基准上的实验表明，MedLVR持续优于近期推理基线，并使基于Qwen2.5-VL-7B骨干的平均得分从48.3%提升至53.4%。结果表明，潜在视觉推理为保存诊断相关视觉证据及提升医学VQA的可靠性提供了有效机制。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.09781

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

基于文本引导的6D物体姿态重排通过闭环VLM代理

Baik, Sangwon, Kim, Gunhee, Choi, Mingi, Joo, Hanbyul

Abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

Chinese Translation

视觉-语言模型（VLMs）展现出强大的视觉推理能力，但在3D理解方面仍然存在困难。特别是，VLMs常常无法推断出3D场景中目标物体与文本一致的目标6D姿态。然而，我们发现通过一些推理时的技术和迭代推理，VLMs能够实现显著的性能提升。具体而言，给定一个由RGB-D图像（或3D网格的组合场景）表示的3D场景和一个指定期望状态变化的文本指令，我们重复以下循环：观察当前场景；评估其是否忠实于指令；为目标物体提出姿态更新；应用该更新；并渲染更新后的场景。通过这种闭环交互，VLM有效地充当了一个代理。我们进一步引入了三种推理时的技术，这些技术对这一闭环过程至关重要：（i）支持视图选择的多视角推理，（ii）以物体为中心的坐标系可视化，以及（iii）单轴旋转预测。在没有任何额外微调或新模块的情况下，我们的方法在预测目标物体的文本引导目标6D姿态方面超越了之前的方法。它在闭源和开源VLMs中均表现一致。此外，当将我们的6D姿态预测与简单的机器人运动规划相结合时，它比现有方法实现了更成功的机器人操作。最后，我们进行了消融研究，以证明每种提出技术的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.09782

Biomarker-Based Pretraining for Chagas Disease Screening in Electrocardiograms

基于生物标志物的查加斯病筛查预训练方法在心电图中的应用

Stenhede, Elias, Ranjbar, Arian

Abstract

Chagas disease screening via ECGs is limited by scarce and noisy labels in existing datasets. We propose a biomarker-based pretraining approach, where an ECG feature extractor is first trained to predict percentile-binned blood biomarkers from the MIMIC-IV-ECG dataset. The pretrained model is then fine-tuned on Brazilian datasets for Chagas detection. Our 5-model ensemble, developed by the Ahus AIM team, achieved a challenge score of 0.269 on the hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025. Source code and the model are shared on GitHub: github.com/Ahus-AIM/physionet-challenge-2025

Chinese Translation

通过心电图（ECG）进行查加斯病筛查受到现有数据集中稀缺和噪声标签的限制。我们提出了一种基于生物标志物的预训练方法，其中首先训练一个ECG特征提取器，以预测来自MIMIC-IV-ECG数据集的百分位分箱血液生物标志物。然后，将预训练模型在巴西数据集上进行微调，以进行查加斯病检测。由Ahus AIM团队开发的我们的五模型集成在隐藏测试集上取得了0.269的挑战得分，在“从心电图检测查加斯病：乔治·B·穆迪PhysioNet挑战赛2025”中排名第五。源代码和模型已在GitHub上分享：github.com/Ahus-AIM/physionet-challenge-2025

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.09814

RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

RobustMedSAM：通过鲁棒基础模型适配实现的降解鲁棒医疗图像分割

Li, Jieru, Chen, Matthew, Nnamdi, Micky C., Tamo, J. Ben, Marteau, Benoit L., Wang, May D.

Abstract

Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.

Chinese Translation

基于Segment Anything Model（SAM）构建的医疗图像分割模型在干净的基准测试中表现优异，但其在噪声、模糊、运动伪影及特定成像模态失真等现实图像降解条件下的可靠性常常下降。现有方法通常只解决医疗领域适配或降解鲁棒性中的一项，而未能兼顾两者。在SAM中，我们发现这两种能力集中体现在互补的模块中：图像编码器保留了医疗先验知识，而掩码解码器则主导降解鲁棒性。基于此观察，我们提出RobustMedSAM，该方法通过模块级检查点融合，在共享的ViT-B架构下，将图像编码器初始化为MedSAM，将掩码解码器初始化为RobustSAM。随后，我们仅对掩码解码器在涵盖六种成像模态和12种降解类型的MedSegBench中35个医疗数据集上进行微调，同时冻结其余组件以保留预训练的医疗表示。我们还探讨了一种基于奇异值分解（SVD）的参数高效变体，用于有限的编码器适配。针对内部分布和外部分布基准的实验结果表明，RobustMedSAM在降解图像上的Dice系数较SAM从0.613提升至0.719（+0.106），验证了结构化融合互补预训练模型是实现鲁棒医疗图像分割的有效且实用的方法。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.09819

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

ACCIDENT：用于交通监控视频中车辆事故检测的基准数据集

Picek, Lukas, Čermák, Michal, Hanzl, Marek, Čermák, Vojtěch

Abstract

We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io

Chinese Translation

我们介绍了ACCIDENT，这是一个用于CCTV视频中交通事故检测的基准数据集，旨在评估模型在监督（独立同分布和分布外）和零样本设置下的表现，反映数据丰富和数据稀缺的场景。该基准数据集由2,027个真实剪辑和2,211个合成剪辑组成，均已标注事故发生时间、空间位置和高层次碰撞类型。我们定义了三个核心任务：（i）事故的时间定位，（ii）事故的空间定位，以及（iii）碰撞类型分类。每个任务都使用自定义指标进行评估，以考虑CCTV视频中固有的不确定性和模糊性。除了基准数据集，我们还提供了一组多样的基线，包括启发式、运动感知和视觉-语言方法，并展示了ACCIDENT的挑战性。您可以访问ACCIDENT网站： https://accidentbench.github.io

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.09835

F3G-Avatar : Face Focused Full-body Gaussian Avatar

F3G-Avatar：面部聚焦的全身高斯头像

Menu, Willem, Akdag, Erkut, Quesado, Pedro, Kashefbahrami, Yasaman, Bondarev, Egor

Abstract

Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.

Chinese Translation

现有的全身高斯头像方法主要优化全局重建质量，往往无法保留细致的面部几何形状和表情细节。这一挑战源于面部表示能力有限，导致在建模高频姿态依赖变形时遇到困难。为了解决这一问题，我们提出了F3G-Avatar，一种全身、面部感知的头像合成方法，该方法从多视角RGB视频和回归的姿态/形状参数中重建可动画的人类表示。我们从一个穿衣的动量人类骨架（Momentum Human Rig, MHR）模板开始，通过一个双分支架构渲染并解码前后位置图为3D高斯体：一个身体分支捕捉姿态依赖的非刚性变形，另一个面部聚焦变形分支则细化头部几何形状和外观。预测的高斯体被融合，通过线性混合蒙皮（Linear Blend Skinning, LBS）进行姿态调整，并通过可微分的高斯点云渲染。训练结合了重建和感知目标，并使用面部特定的对抗损失以增强近距离视图的真实感。实验表明渲染质量强劲，面部视图性能在AvatarReX数据集上达到PSNR/SSIM/LPIPS分别为26.243/0.964/0.084。消融实验进一步突出了MHR模板和面部聚焦变形的贡献。F3G-Avatar提供了一种实用的高质量管道，用于逼真、可动画的全身头像合成。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.09838

Vector Field Synthesis with Sparse Streamlines Using Diffusion Model

基于扩散模型的稀疏流线矢量场合成

Phan, Nguyen K., Morales, Ricardo, Espriella, Sebastian D., Chen, Guoning

Abstract

We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method's ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.

Chinese Translation

我们提出了一种基于扩散模型的新颖框架，用于从稀疏且连贯的输入（即流线）合成二维矢量场，同时保持物理合理性。该方法采用带有无分类器引导的条件去噪扩散概率模型，实现了渐进式重建，能够同时保持几何和物理约束。实验结果表明，我们的方法能够合成符合物理定律且忠实于稀疏输入观测的合理矢量场，在灵活性和物理一致性方面优于传统的基于优化的方法。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.09841

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

是否还有可提取的知识？医学精细调优视觉-语言模型的脆弱性证据

McLaughlin, Oliver, Shubin, Daniel, Eickhoff, Carsten, Singh, Ritambhara, Rudman, William, Golovanevsky, Michal

Abstract

Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

Chinese Translation

视觉-语言模型（VLMs）正通过领域特定的精细调优而逐渐适应，但尚不清楚这是否能在高风险领域（如医学）中改善推理能力，超越表面的视觉线索。我们评估了四个配对的开源VLM（LLaVA与LLaVA-Med；Gemma与MedGemma）在四个逐渐增加难度的医学影像任务上的表现：脑肿瘤、肺炎、皮肤癌和组织病理分类。我们发现，随着任务难度的增加，性能降级至接近随机水平，表明临床推理能力有限。医学精细调优并未提供一致的优势，模型对提示的表述高度敏感，微小的变化会导致准确率和拒绝率的大幅波动。为了测试封闭式视觉问答（VQA）是否抑制潜在知识，我们引入了一种基于描述的流程，让模型生成图像描述，供一个仅文本模型（GPT-5.1）用于诊断。这恢复了有限的额外信号，但仍受任务难度的限制。对视觉编码器嵌入的分析进一步表明，失败源于弱视觉表征和下游推理能力的不足。总体而言，医学VLM的表现脆弱、依赖提示，并且未能通过领域特定的精细调优可靠地改善。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.09850

Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

无训练的物体-背景组合 T2I 通过动态空间引导和多路径剪枝

Deng, Yang, Mould, David, Rosin, Paul L., Lai, Yu-Kun

Abstract

Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.

Chinese Translation

现有的文本到图像扩散模型在主体合成方面表现出色，但存在持续的前景偏差，将背景视为被动且未优化的副产品。这种不平衡损害了全局场景的一致性，并限制了组合控制。为了解决这一局限性，我们提出了一种无训练的框架，重构扩散采样，以明确考虑前景与背景之间的相互作用。我们的方法由两个关键组件组成。首先，动态空间引导引入了一种软的、依赖时间步的门控机制，在扩散过程中调节前景和背景的注意力，实现空间平衡生成。其次，多路径剪枝执行多路径潜在探索，并使用内部注意力统计和外部语义对齐信号动态过滤候选轨迹，保留更好满足物体-背景约束的轨迹。我们进一步开发了一个专门设计的基准，以评估物体-背景的组合性。在多个扩散骨干网络上的广泛评估显示，背景一致性和物体-背景组合对齐均有持续改善。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.09853

Do vision models perceive illusory motion in static images like humans?

视觉模型是否像人类一样感知静态图像中的错觉运动？

Rosario, Isabella Elaine, Cheng, Fan L., Sun, Zitang, Kriegeskorte, Nikolaus

Abstract

Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color--feature--based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.

Chinese Translation

理解人类的运动处理机制对于构建可靠且以人为中心的计算机视觉系统至关重要。尽管深度神经网络（DNN）在光流估计方面表现出强大的性能，但其鲁棒性仍不及人类，并且依赖于根本不同的计算策略。视觉运动错觉为探究这些机制提供了有力工具，揭示了人类视觉与机器视觉的契合与差异。尽管基于DNN的运动模型能够再现诸如反向Phi（reverse-phi）等动态错觉，但它们是否能感知以旋转蛇（Rotating Snakes）错觉为代表的静态图像中的错觉运动仍不明确。我们评估了若干具有代表性的光流模型在旋转蛇错觉上的表现，结果显示大多数模型未能生成与人类感知一致的流场。在模拟扫视眼动的条件下，只有受人类启发的双通道（Dual-Channel）模型表现出预期的旋转运动，且在扫视模拟期间两者的对应关系最为接近。消融分析进一步表明，基于亮度的运动信号和基于更高阶颜色-特征的运动信号均对该行为有贡献，且循环注意机制对于整合局部线索至关重要。我们的结果凸显了当前光流模型与人类视觉运动处理之间存在的显著差距，并为开发未来与人类感知更为一致、以人为中心的运动估计系统提供了重要见解。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.09862

FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

FF3R：基于无约束视角的前馈特征三维重建

Zhou, Chaoyi, Wang, Run, Luo, Feng, Pesé, Mert D., Fan, Zhiwen, Zhong, Yiqi, Huang, Siyu

Abstract

Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R's superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.

Chinese Translation

近年来，视觉基础模型的进步彻底改变了几何重建和语义理解领域。然而，大多数现有方法将这些能力孤立处理，导致流程冗余且误差叠加。本文提出了FF3R，一种完全无标注的前馈框架，能够从无约束的多视角图像序列中统一进行几何与语义推理。与以往方法不同，FF3R无需相机位姿、深度图或语义标签，仅依赖于RGB和特征图的渲染监督，建立了一个可扩展的统一三维推理范式。此外，我们针对前馈特征重建流程中的两大关键挑战——全局语义不一致性和局部结构不一致性，提出了两项核心创新：（i）通过跨注意力机制，利用Token-wise Fusion模块为几何token注入语义上下文；（ii）结合几何引导的特征扭曲实现全局一致性与语义感知体素化实现局部连贯性的语义-几何互促机制。在ScanNet和DL3DV-10K数据集上的大量实验表明，FF3R在新视角合成、开放词汇语义分割和深度估计任务中表现优异，且具备强大的野外场景泛化能力，为需要空间与语义理解的具身智能系统开辟了新路径。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.09863

PAS: Estimating the target accuracy before domain adaptation

PAS：在领域自适应前估计目标准确率

Diniz, Raphaella, de Faria, Jackson, Ester, Martin

Abstract

The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.

Chinese Translation

领域自适应的目标是利用来自不同但相关的源领域的带标签样本，对目标领域中未标记的样本进行预测。领域自适应方法的性能在很大程度上受源领域选择和预训练特征提取器的影响。然而，由于目标领域缺乏带标签的验证集且可用的预训练模型数量众多，源数据和预训练模型的选择并非易事。在本工作中，我们提出了PAS，一种新颖的评分方法，旨在在实际进行领域自适应之前，估计源领域数据集和预训练特征提取器对目标分类任务的迁移能力。PAS利用预训练模型的泛化能力，并基于预训练特征嵌入评估源领域与目标领域的兼容性。我们将PAS集成到一个框架中，该框架能够在多个候选项中指示最相关的预训练模型和源领域，从而在提升目标准确率的同时降低计算开销。在图像分类基准上的大量实验表明，PAS与实际目标准确率高度相关，并能持续指导选择表现最佳的预训练模型和源领域进行适应。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.09877

DINO_4D: Semantic-Aware 4D Reconstruction

DINO_4D：语义感知的四维重建

Yang, Yiru, Wu, Zhuojie, Marguet, Quentin, Singh, Nishant Kumar, Schulthess, Max

Abstract

In the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes serve as the critical bridge connecting low-level geometric sensing with high-level semantic understanding. We present DINO\_4D, introducing frozen DINOv3 features as structural priors, injecting semantic awareness into the reconstruction process to effectively suppress semantic drift during dynamic tracking. Experiments on the Point Odyssey and TUM-Dynamics benchmarks demonstrate that our method maintains the linear time complexity $O(T)$ of its predecessors while significantly improving Tracking Accuracy (APD) and Reconstruction Completeness. DINO\_4D establishes a new paradigm for constructing 4D World Models that possess both geometric precision and semantic understanding.

Chinese Translation

在计算机视觉与机器人感知的交叉领域，动态场景的四维重建作为连接低层几何感知与高层语义理解的关键桥梁。我们提出了DINO_4D，采用冻结的DINOv3特征作为结构先验，将语义感知注入重建过程，有效抑制动态跟踪中的语义漂移。在Point Odyssey和TUM-Dynamics基准测试中，实验证明我们的方法保持了与前代方法相同的线性时间复杂度$O(T)$，同时显著提升了跟踪精度（APD）和重建完整性。DINO_4D为构建兼具几何精度与语义理解的四维世界模型开辟了新范式。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.09879

Topo-ADV: Generating Topology-Driven Imperceptible Adversarial Point Clouds

Topo-ADV：生成基于拓扑驱动的不可察觉对抗点云

Nampoothiry, Gayathry Chandramana Krishnan, Venkatapuram, Raghuram, Ghosh, Anirban, Dutta, Ayan

Abstract

Deep neural networks for 3D point cloud understanding have achieved remarkable success in object classification and recognition, yet recent work shows that these models remain highly vulnerable to adversarial perturbations. Existing 3D attacks predominantly manipulate geometric properties such as point locations, curvature, or surface structure, implicitly assuming that preserving global shape fidelity preserves semantic content. In this work, we challenge this assumption and introduce the first topology-driven adversarial attack for point cloud deep learning. Our key insight is that the homological structure of a 3D object constitutes a previously unexplored vulnerability surface. We propose Topo-ADV, an end-to-end differentiable framework that incorporates persistent homology as an explicit optimization objective, enabling gradient-based manipulation of topological features during adversarial example generation. By embedding persistence diagrams through differentiable topological representations, our method jointly optimizes (i) a topology divergence loss that alters persistence, (ii) a misclassification objective, and (iii) geometric imperceptibility constraints that preserve visual plausibility. Experiments demonstrate that subtle topology-driven perturbations consistently achieve up to 100% attack success rates on benchmark datasets such as ModelNet40, ShapeNet Part, and ScanObjectNN using PointNet and DGCNN classifiers, while remaining geometrically indistinguishable from the original point clouds, beating state-of-the-art methods on various perceptibility metrics.

Chinese Translation

用于三维点云理解的深度神经网络在物体分类和识别方面取得了显著成功，然而近期研究表明这些模型仍然高度脆弱，易受对抗扰动影响。现有的三维攻击主要操控几何属性，如点的位置、曲率或表面结构，隐含假设保持全局形状的保真度即可保持语义内容。在本工作中，我们挑战这一假设，首次提出基于拓扑驱动的点云深度学习对抗攻击。我们的核心洞见是，三维物体的同调结构构成了一个此前未被探索的脆弱面。我们提出了Topo-ADV，一个端到端可微分的框架，将持久同调作为显式优化目标，支持在对抗样本生成过程中基于梯度的拓扑特征操控。通过可微分拓扑表示嵌入持久性图，我们的方法联合优化(i)改变持久性的拓扑散度损失，(ii)误分类目标，以及(iii)保持视觉合理性的几何不可察觉约束。实验表明，细微的基于拓扑的扰动在ModelNet40、ShapeNet Part和ScanObjectNN等基准数据集上，使用PointNet和DGCNN分类器，始终实现高达100%的攻击成功率，同时在几何上与原始点云无可区分，且在多种感知度量上优于最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.09886

Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

非典型立体估计器：结合视觉与语言的体积感知方法

Vinod, Gautham, Coburn, Bruce, Raghavan, Siddeshwar, Zhu, Fengqing

Abstract

Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object's class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.

Chinese Translation

从视觉数据中准确估计物体体积是计算机视觉领域长期以来的挑战，且在机器人技术、物流和智能健康等领域具有重要应用。现有方法通常依赖复杂的三维重建流程，或难以解决单视图图像固有的歧义性。为克服这些局限性，我们提出了一种新方法，将立体视觉中的隐式三维线索与自然语言文本中的显式先验知识相融合。该方法从一对立体图像和包含物体类别及近似体积的描述性文本提示中提取深度特征，随后通过一个简单而有效的投影层将其整合为统一的多模态表示以进行回归预测。我们在公开数据集上进行了大量实验，结果表明，基于文本引导的方法显著优于仅依赖视觉的基线模型。研究发现，即使是简单的文本先验也能有效指导体积估计任务，为更具上下文感知能力的视觉测量系统开辟了新路径。代码地址：https://gitlab.com/viper-purdue/stereo-typical-estimator。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.09903

PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting

PointSplat：基于几何驱动的高效修剪与变换器精炼用于3D高斯溅射

Tran, Anh Thuan, Kosecka, Jana

Abstract

3D Gaussian Splatting (3DGS) has recently unlocked real-time, high-fidelity novel view synthesis by representing scenes using explicit 3D primitives. However, traditional methods often require millions of Gaussians to capture complex scenes, leading to significant memory and storage demands. Recent approaches have addressed this issue through pruning and per-scene fine-tuning of Gaussian parameters, thereby reducing the model size while maintaining visual quality. These strategies typically rely on 2D images to compute important scores followed by scene-specific optimization. In this work, we introduce PointSplat, 3D geometry-driven prune-and-refine framework that bridges previously disjoint directions of gaussian pruning and transformer refinement. Our method includes two key components: (1) an efficient geometry-driven strategy that ranks Gaussians based solely on their 3D attributes, removing reliance on 2D images during pruning stage, and (2) a dual-branch encoder that separates, re-weights geometric and appearance to avoid feature imbalance. Extensive experiments on ScanNet++ and Replica across varying sparsity levels demonstrate that PointSplat consistently achieves competitive rendering quality and superior efficiency without additional per-scene optimization.

Chinese Translation

3D高斯溅射（3D Gaussian Splatting, 3DGS）最近通过使用显式3D原语表示场景，实现了实时、高保真的新视角合成。然而，传统方法通常需要数百万个高斯来捕捉复杂场景，导致显著的内存和存储需求。近期的方法通过修剪和每个场景的高斯参数微调来解决这一问题，从而在保持视觉质量的同时减少模型大小。这些策略通常依赖于2D图像来计算重要评分，随后进行场景特定的优化。在本研究中，我们提出了PointSplat，一种基于3D几何驱动的修剪与精炼框架，连接了高斯修剪和变换器精炼这两个先前不相关的方向。我们的方法包括两个关键组件：（1）一种高效的几何驱动策略，仅基于高斯的3D属性对其进行排序，在修剪阶段消除了对2D图像的依赖；（2）一个双分支编码器，分离并重新加权几何特征和外观特征，以避免特征不平衡。在ScanNet++和Replica上进行的广泛实验显示，PointSplat在不同稀疏级别下始终实现了具有竞争力的渲染质量和卓越的效率，而无需额外的每场景优化。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.09907

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

从无人机影像到农艺推理：用于植物表型分析的多模态大语言模型基准

Wu, Yu, Han, Guangzeng, Niang, Ibra Niang, Ravelombola, Francia, Oliveira, Maiara, Davis, Jason, Chen, Dong, Lin, Feng, Huang, Xiaolei

Abstract

To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.

Chinese Translation

为了改良作物遗传学，高通量、高效且全面的表型分析是关键前提。尽管此类任务传统上依赖人工完成，近年来多模态基础模型，尤其是视觉-语言模型（VLMs）的进步，使得表型分析更加自动化且稳健。然而，植物科学作为一个特殊领域，对基础模型提出了更高要求，需具备领域专属知识、细粒度视觉解读能力以及复杂的生物学和农艺推理能力。为填补这一空白，我们开发了PlantXpert——一个基于证据的多模态推理基准，专注于大豆和棉花的表型分析。该基准提供了一个结构化且可复现的框架，用于VLMs的农艺适配，并支持基础模型与领域适配模型之间的受控比较。我们构建了包含385张数字图像和超过3000个基准样本的数据集，涵盖植物科学的关键领域，包括病害、害虫防治、杂草管理及产量评估。该基准可评估视觉专业能力、定量推理及多步骤农艺推理等多样化能力。共评测了11款最先进的VLMs，结果显示，针对特定任务的微调显著提升了准确率，其中Qwen3-VL-4B和Qwen3-VL-30B等模型的准确率最高达78%。与此同时，模型规模扩展带来的收益在达到一定容量后趋于减弱，大豆与棉花间的泛化能力仍不均衡，定量及基于生物学的推理依然面临重大挑战。上述发现表明，PlantXpert可作为评估基于证据的农艺推理能力的基础，并推动植物科学中多模态模型的发展。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.09920

Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

你的视觉基础模型（VFM）会“说植物语”吗？视觉基础模型在目标检测中的植物语法研究

Lundqvist, Lars, Ranario, Earl, Kamangir, Hamid, Yun, Heesup, Diepenbrock, Christine, Bailey, Brian N., Earles, J. Mason

Abstract

Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 [email protected] for YOLO World and +0.362 [email protected] for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

Chinese Translation

视觉基础模型（VFMs）承诺实现无需特定任务训练数据的零样本目标检测，然而其在复杂农业场景中的表现对文本提示构建极为敏感。本文提出了一种系统的提示优化框架，评估了四种开放词汇检测器——YOLO World、SAM3、Grounding DINO 和 OWLv2——在合成及真实田间图像中对豇豆花和豆荚的检测效果。我们将提示分解为八个维度，采用单因素分析结合组合优化，揭示模型对提示结构的响应存在显著差异：优化一种架构的条件可能导致另一种架构性能崩溃。针对各模型应用特定的组合提示，相较于简单的物种名称基线，性能显著提升，包括YOLO World在合成豇豆花数据上的[email protected]提升0.357，OWLv2提升0.362。为评估跨任务泛化能力，我们利用大型语言模型（LLM）将发现的提示维度结构迁移至形态差异显著的目标——豇豆豆荚，并与使用合成花数据中发现的最优提示结构进行比较。关键的是，仅在合成数据上优化的提示结构能有效迁移至真实田间：合成流程提示在大多数模型-目标组合中匹配或超越了基于标注真实数据发现的提示（花：YOLO World为0.374 vs. 0.353；豆荚：SAM3为0.429 vs. 0.371）。我们的研究表明，提示工程能够显著缩小零样本视觉基础模型与监督检测器之间的差距，无需人工标注，且最优提示具有模型特异性、非显而易见性及跨领域可迁移性。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.09927

BLPR: Robust License Plate Recognition under Viewpoint and Illumination Variations via Confidence-Driven VLM Fallback

BLPR：基于置信度驱动视觉语言模型回退的视角与光照变化下鲁棒车牌识别

Banegas, Guillermo Auza, Vera, Diego Calvimontes, Sandoval, Sergio Castro, Peredo, Natalia Condori, Salcedo, Edwin

Abstract

Robust license plate recognition in unconstrained environments remains a significant challenge, particularly in underrepresented regions with limited data availability and unique visual characteristics, such as Bolivia. Recognition accuracy in real-world conditions is often degraded by factors such as illumination changes and viewpoint distortion. To address these challenges, we introduce BLPR, a novel deep learning-based License Plate Detection and Recognition (LPDR) framework specifically designed for Bolivian license plates. The proposed system follows a two-stage pipeline where a YOLO-based detector is pretrained on synthetic data generated in Blender to simulate extreme perspectives and lighting conditions, and subsequently fine-tuned on street-level data collected in La Paz, Bolivia. Detected plates are geometrically rectified and passed to a character recognition model. To improve robustness under ambiguous scenarios, a lightweight vision-language model (Gemma3 4B) is selectively triggered as a confidence-based fallback mechanism. The proposed framework further leverages synthetic-to-real domain adaptation to improve robustness under diverse real-world conditions. We also introduce the first publicly available Bolivian LPDR dataset, enabling evaluation under diverse viewpoint and illumination conditions. The system achieves a character-level recognition accuracy of 89.6% on real-world data, demonstrating its effectiveness for deployment in challenging urban environments. Our project is publicly available at https://github.com/EdwinTSalcedo/BLPR.

Chinese Translation

在非受控环境下实现鲁棒的车牌识别仍然是一项重大挑战，尤其是在数据有限且具有独特视觉特征的欠代表性地区，如玻利维亚。现实环境中的识别准确率常因光照变化和视角畸变等因素而下降。为应对这些挑战，我们提出了BLPR，一种专为玻利维亚车牌设计的新型基于深度学习的车牌检测与识别（LPDR）框架。该系统采用两阶段流程，首先基于YOLO的检测器在Blender生成的合成数据上预训练，以模拟极端视角和光照条件，随后在玻利维亚拉巴斯采集的街景数据上进行微调。检测到的车牌经过几何校正后传递给字符识别模型。为提升在模糊场景下的鲁棒性，系统选择性地触发轻量级视觉语言模型（Gemma3 4B）作为基于置信度的回退机制。该框架还利用合成到真实域的适应技术，增强在多样真实环境下的鲁棒性。我们同时发布了首个公开的玻利维亚LPDR数据集，支持在多视角和光照条件下的评估。系统在真实数据上的字符级识别准确率达到89.6%，展示了其在复杂城市环境中部署的有效性。项目地址：https://github.com/EdwinTSalcedo/BLPR。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.09942

I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

我走在这条线上：考察格式塔连续性在视觉变换器对象绑定中的作用

Tartaglini, Alexa R., Lepori, Michael A.

Abstract

Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.

Chinese Translation

对象绑定是视觉认知中的一个基础过程，在此过程中，低级感知特征被结合成对象表征。绑定被认为是神经网络面临的一个基本挑战，也是通往具有灵活视觉智能的人工模型的重要里程碑。最近，几项研究表明，预训练视觉模型中出现了绑定机制，使其能够关联包含对象的图像部分。问题仍然存在：这些模型是如何将对象绑定在一起的？在本研究中，我们探讨视觉模型是否依赖于格式塔连续性原则来执行对象绑定，而不仅仅是其他原则，如相似性和接近性。通过使用合成数据集，我们展示了绑定探针对广泛的预训练视觉变换器的连续性敏感性。接下来，我们发现了特定的注意力头，它们跟踪连续性，并表明这些头在不同数据集之间具有良好的泛化能力。最后，我们对这些注意力头进行了消融实验，结果表明它们通常有助于生成编码对象绑定的表征。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.09945

Cross-Cultural Value Awareness in Large Vision-Language Models

大型视觉语言模型中的跨文化价值意识

Howard, Phillip, Su, Xin, Fraser, Kathleen C.

Abstract

The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person's moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.

Chinese Translation

近年来，大型视觉语言模型（LVLMs）的快速采用伴随着对其可能强化有害社会刻板印象的公平性担忧的增加。尽管在社会偏见的背景下，针对这些公平性问题的关注已相当显著，但之前的研究相对较少关注与文化背景（如宗教、国籍和社会经济地位）相关的刻板印象在LVLMs中的存在。在本研究中，我们旨在填补这一空白，探讨图像中描绘的文化背景如何影响LVLMs对个人道德、伦理和政治价值的判断。我们使用反事实图像集对五个流行的LVLMs进行多维度分析，这些图像集描绘了同一个人在不同文化背景下的形象。我们的评估框架通过使用道德基础理论（Moral Foundations Theory）、词汇分析和生成价值对描绘文化背景的敏感性，诊断LVLM对文化价值差异的意识。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.09948

Unmixing-Guided Spatial-Spectral Mamba with Clustering Tokens for Hyperspectral Image Classification

基于解混导向的空间-光谱Mamba与聚类令牌用于高光谱图像分类

Zhu, Yimin, Xu, Lincoln Linlin

Abstract

Although hyperspectral image (HSI) classification is critical for supporting various environmental applications, it is a challenging task due to the spectral-mixture effect, the spatial-spectral heterogeneity and the difficulty to preserve class boundaries and details. This letter presents a novel unmixing-guided spatial-spectral Mamba with clustering tokens for improved HSI classification, with the following contributions. First, to disentangle the spectral mixture effect in HSI for improved pattern discovery, we design a novel spectral unmixing network that not only automatically learns endmembers and abundance maps from HSI but also accounts for endmember variabilities. Second, to generate Mamba token sequences, based on the clusters defined by abundance maps, we design an efficient Top-\textit{K} token selection strategy to adaptively sequence the tokens for improved representational capability. Third, to improve spatial-spectral feature learning and detail preservation, based on the Top-\textit{K} token sequences, we design a novel unmixing-guided spatial-spectral Mamba module that greatly improves traditional Mamba models in terms of token learning and sequencing. Fourth, to learn simultaneously the endmember-abundance patterns and classification labels, a multi-task scheme is designed for model supervision, leading to a new unmixing-classification framework that outputs not only accurate classification maps but also a comprehensive spectral-library and abundance maps. Comparative experiments on four HSI datasets demonstrate that our model can greatly outperform the other state-of-the-art approaches. Code is available at https://github.com/GSIL-UCalgary/Unmixing_guided_Mamba.git

Chinese Translation

尽管高光谱图像（HSI）分类对支持各种环境应用至关重要，但由于光谱混合效应、空间-光谱异质性以及难以保持类别边界和细节，这一任务仍然具有挑战性。本文提出了一种新颖的基于解混导向的空间-光谱Mamba与聚类令牌，以改善HSI分类，主要贡献如下。首先，为了揭示HSI中的光谱混合效应以改善模式发现，我们设计了一种新颖的光谱解混网络，该网络不仅能够自动学习HSI中的端元和丰度图，还考虑了端元的变异性。其次，为了生成Mamba令牌序列，我们基于丰度图定义的聚类设计了一种高效的Top-K令牌选择策略，以自适应地排列令牌，从而提高表示能力。第三，为了改善空间-光谱特征学习和细节保留，我们基于Top-K令牌序列设计了一种新颖的解混导向空间-光谱Mamba模块，在令牌学习和排序方面显著提升了传统Mamba模型的性能。第四，为了同时学习端元-丰度模式和分类标签，我们设计了一种多任务方案用于模型监督，从而形成一个新的解混-分类框架，该框架不仅输出准确的分类图，还提供全面的光谱库和丰度图。在四个HSI数据集上的对比实验表明，我们的模型在性能上显著优于其他最先进的方法。代码可在 https://github.com/GSIL-UCalgary/Unmixing_guided_Mamba.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.09955

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

可学习的运动聚焦标记化方法用于有效且高效的视频无监督领域适应

Liu, Tzu Ling, Stavness, Ian, Rochan, Mrigank

Abstract

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Chinese Translation

视频无监督领域适应（VUDA）在动作识别中面临重大挑战，需要将模型从标记的源领域适应到未标记的目标领域。尽管近期取得了一些进展，现有的VUDA方法往往无法达到完全监督的性能，主要原因在于静态和无信息背景的普遍存在，这加剧了领域转移。此外，以往的方法在很大程度上忽视了计算效率，限制了其在实际中的应用。为了解决这些问题，我们提出了可学习的运动聚焦标记化方法（LMFT）用于VUDA。LMFT将视频帧标记化为补丁标记，并学习丢弃低运动、冗余的标记，这些标记主要对应于背景区域，同时保留运动丰富、与动作相关的标记以便于适应。在21个领域适应设置下的三个标准VUDA基准上进行的广泛实验表明，我们的VUDA框架结合LMFT实现了最先进的性能，同时显著降低了计算开销。因此，LMFT使得VUDA在有效性和计算效率上均得以提升。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.09985

YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

YUV20K：一种基于复杂度驱动的基准测试和轨迹感知对齐模型用于视频伪装物体检测

Liu, Yiyu, Ye, Shuo, Hao, Chao, Yu, Zitong

Abstract

Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at https://github.com/K1NSA/YUV20K

Chinese Translation

视频伪装物体检测（VCOD）目前受到挑战性基准稀缺和模型对不稳定运动动态的鲁棒性有限的制约。现有方法常常在复杂运动场景中面临运动引起的外观不稳定性和时间特征错位的问题。为了解决数据瓶颈，我们提出了YUV20K，这是一个像素级注释的复杂度驱动的VCOD基准。该基准包含来自91个场景和47种物种的24,295帧注释图像，特别针对大位移运动、摄像机运动及其他4种类型的挑战场景。在方法论方面，我们提出了一个新颖的框架，包含两个关键模块：运动特征稳定化（MFS）和轨迹感知对齐（TAA）。MFS模块利用与帧无关的语义基础原语来稳定特征，而TAA模块则利用轨迹引导的可变形采样来确保精确的时间对齐。大量实验表明，我们的方法在现有数据集上显著优于最先进的竞争对手，并在具有挑战性的YUV20K上建立了新的基准。值得注意的是，我们的框架在面对复杂时空场景时表现出优越的跨域泛化能力和鲁棒性。我们的代码和数据集将发布在 https://github.com/K1NSA/YUV20K

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.09989

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

FlowPalm：基于光流驱动的非刚性变形用于几何多样性掌纹生成

Zou, Yuchen, Shao, Huikai, Fang, Lihuang, Xiong, Zhipeng, Zhong, Dexing

Abstract

Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Project page: https://yuchenzou.github.io/FlowPalm/

Chinese Translation

近年来，合成掌纹作为真实数据的替代品被越来越多地用于训练识别模型。为了有效，这类合成数据必须反映真实掌纹的多样性，包括风格变化和几何变化。然而，现有的掌纹生成方法主要侧重于风格转换，而几何变化要么被忽略，要么仅通过简单的手工增强进行近似。在本工作中，我们提出了FlowPalm，一种基于光流驱动的掌纹生成框架，能够模拟真实掌纹中观察到的复杂非刚性变形。具体而言，FlowPalm通过估计真实掌纹对之间的光流，捕捉几何变形的统计模式。在此先验基础上，我们设计了一个渐进采样过程，在扩散过程中逐步引入几何变形，同时保持身份一致性。在六个基准数据集上的大量实验表明，FlowPalm在下游识别任务中显著优于现有最先进的掌纹生成方法。项目主页：https://yuchenzou.github.io/FlowPalm/

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.09990

Gait Recognition with Temporal Kolmogorov-Arnold Networks

基于时间的科尔莫戈罗夫-阿诺德网络的人体步态识别

Asad, Mohammed, Vishwakarma, Dinesh Kumar

Abstract

Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.

Chinese Translation

步态识别是一种生物特征识别方式，通过个体特有的行走模式来识别个人。与传统的生物特征不同，步态可以在远距离下获取，并且无需被试者的主动配合，这使其适用于监控和公共安全应用。然而，基于轮廓的时间模型对长序列、观察噪声和外观相关的协变量仍然敏感。递归架构通常难以保留早期帧的信息，并且在优化时本质上是顺序的，而基于变换器的模型通常需要更大的计算资源和更大的训练集，并且可能对不规则的序列长度和噪声输入敏感。这些限制在服装变化、携带条件和视角变化下降低了鲁棒性，同时也妨碍了局部步态周期与长期运动趋势的联合建模。为了解决这些挑战，我们提出了一种用于步态识别的时间科尔莫戈罗夫-阿诺德网络（TKAN）。所提出的模型用可学习的一维函数替代了固定的边权重，并结合了由短期RKAN子层和门控长期路径组成的两级记忆机制。这一设计能够有效建模循环级动态和更广泛的时间上下文，同时保持紧凑的骨干网络。在CASIA-B数据集上的实验表明，所提出的CNN+TKAN框架在报告的评估设置下实现了强大的识别性能。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.09991

Revisiting the Scale Loss Function and Gaussian-Shape Convolution for Infrared Small Target Detection

重新审视红外小目标检测中的尺度损失函数和高斯形状卷积

Li, Hao, Zhuo, Man Fung

Abstract

Infrared small target detection still faces two persistent challenges: training instability from non-monotonic scale loss functions, and inadequate spatial attention due to generic convolution kernels that ignore the physical imaging characteristics of small targets. In this paper, we revisit both aspects. For the loss side, we propose a \emph{diff-based scale loss} that weights predictions according to the signed area difference between the predicted mask and the ground truth, yielding strictly monotonic gradients and stable convergence. We further analyze a family of four scale loss variants to understand how their geometric properties affect detection behavior. For the spatial side, we introduce \emph{Gaussian-shaped convolution} with a learnable scale parameter to match the center-concentrated intensity profile of infrared small targets, and augment it with a \emph{rotated pinwheel mask} that adaptively aligns the kernel with target orientation via a straight-through estimator. Extensive experiments on IRSTD-1k, NUDT-SIRST, and SIRST-UAVB demonstrate consistent improvements in $mIoU$, $P_d$, and $F_a$ over state-of-the-art methods. We release our anonymous code and pretrained models.

Chinese Translation

红外小目标检测仍面临两个持续的挑战：来自非单调尺度损失函数的训练不稳定性，以及由于通用卷积核忽视小目标的物理成像特性而导致的空间注意力不足。本文重新审视了这两个方面。在损失方面，我们提出了一种基于差异的尺度损失（diff-based scale loss），根据预测掩膜与真实值之间的符号面积差异对预测进行加权，从而产生严格单调的梯度和稳定的收敛性。我们进一步分析了四种尺度损失变体，以理解其几何特性如何影响检测行为。在空间方面，我们引入了具有可学习尺度参数的高斯形状卷积（Gaussian-shaped convolution），以匹配红外小目标的中心集中的强度分布，并通过旋转涡轮掩膜（rotated pinwheel mask）增强其效果，该掩膜通过直通估计器自适应地将卷积核与目标方向对齐。在IRSTD-1k、NUDT-SIRST和SIRST-UAVB上的大量实验表明，我们的方法在$mIoU$、$P_d$和$F_a$方面相较于最新的技术方法有了一致的提升。我们将发布我们的匿名代码和预训练模型。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.09996

A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery

现代物体检测器在果园图像中稳健苹果检测的比较研究

Asad, Mohammed, Gautam, Ajai Kumar, Dhiman, Priyanshu, Prajapati, Rishi Raj

Abstract

Accurate apple detection in orchard images is important for yield prediction, fruit counting, robotic harvesting, and crop monitoring. However, changing illumination, leaf clutter, dense fruit clusters, and partial occlusion make detection difficult. To provide a fair and reproducible comparison, this study establishes a controlled benchmark for single-class apple detection on the public AppleBBCH81 dataset using one deterministic train, validation, and test split and a unified evaluation protocol across six representative detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN (ResNet50-FPN), FCOS (ResNet50-FPN), and SSDLite320 (MobileNetV3-Large). Performance is evaluated primarily using COCO-style [email protected] and [email protected]:0.95, and threshold-dependent behavior is further analyzed using precision-recall curves and fixed-threshold precision, recall, and F1-score at IoU = 0.5. On the validation split, YOLO11n achieves the best strict localization performance with [email protected]:0.95 = 0.6065 and [email protected] = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point with confidence >= 0.05, YOLOv10n attains the highest F1-score, whereas RT-DETR-L achieves very high recall but low precision because of many false positives at low confidence. These findings show that detector selection for orchard deployment should be guided not only by localization-aware accuracy but also by threshold robustness and the requirements of the downstream task.

Chinese Translation

在果园图像中准确检测苹果对于产量预测、果实计数、机器人采摘和作物监测至关重要。然而，光照变化、叶片杂乱、果实密集聚集和部分遮挡使得检测变得困难。为了提供公平且可重复的比较，本研究在公共的AppleBBCH81数据集上建立了一个单类苹果检测的受控基准，采用一个确定性的训练、验证和测试划分，并在六个具有代表性的检测器（YOLOv10n、YOLO11n、RT-DETR-L、Faster R-CNN（ResNet50-FPN）、FCOS（ResNet50-FPN）和SSDLite320（MobileNetV3-Large））之间统一评估协议。性能主要通过COCO风格的[email protected]和[email protected]:0.95进行评估，并进一步通过精确率-召回曲线以及在IoU = 0.5时的固定阈值精确率、召回率和F1分数分析阈值依赖行为。在验证集上，YOLO11n以[email protected]:0.95 = 0.6065和[email protected] = 0.9620实现了最佳的严格定位性能，紧随其后的是RT-DETR-L和YOLOv10n。在固定操作点下，当置信度>= 0.05时，YOLOv10n获得了最高的F1分数，而RT-DETR-L则因低置信度下的许多假阳性而实现了非常高的召回率但低精确率。这些发现表明，果园部署的检测器选择不仅应考虑定位精度，还应考虑阈值的鲁棒性和下游任务的要求。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.09999

GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts

GIF：一种用于芯片布局中IR压降成像的条件多模态生成框架

Thorat, Kiran, Meng, Nicole, Karami, Mostafa, Ding, Caiwen, Lao, Yingjie, Shi, Zhijie Jerry

Abstract

IR drop analysis is essential in physical chip design to ensure the power integrity of on-chip power delivery networks. Traditional Electronic Design Automation (EDA) tools have become slow and expensive as transistor density scales. Recent works have introduced machine learning (ML)-based methods that formulate IR drop analysis as an image prediction problem. These existing ML approaches fail to capture both local and long-range dependencies and ignore crucial geometrical and topological information from physical layouts and logical connectivity. To address these limitations, we propose GIF, a Generative IR drop Framework that uses both geometrical and topological information to generate IR drop images. GIF fuses image and graph features to guide a conditional diffusion process, producing high-quality IR drop images. For instance, On the CircuitNet-N28 dataset, GIF achieves 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, and 0.026 NMAE, outperforming prior methods. These results demonstrate that our framework, using diffusion based multimodal conditioning, reliably generates high quality IR drop images. This shows that IR drop analysis can effectively leverage recent advances in generative modeling when geometric layout features and logical circuit topology are jointly modeled. By combining geometry aware spatial features with logical graph representations, GIF enables IR drop analysis to benefit from recent advances in generative modeling for structured image generation.

Chinese Translation

IR压降分析在物理芯片设计中至关重要，以确保片上电源传输网络的电源完整性。随着晶体管密度的提升，传统的电子设计自动化（EDA）工具变得缓慢且成本高昂。近期研究引入了基于机器学习（ML）的方法，将IR压降分析表述为图像预测问题。然而，现有的机器学习方法未能同时捕捉局部和远程依赖关系，且忽视了物理布局的几何和拓扑信息以及逻辑连通性的关键特征。为解决这些局限性，我们提出了GIF（一种生成式IR压降框架），该框架利用几何和拓扑信息生成IR压降图像。GIF融合图像特征与图结构特征，引导条件扩散过程，从而生成高质量的IR压降图像。例如，在CircuitNet-N28数据集上，GIF实现了0.78的结构相似性指数（SSIM）、0.95的皮尔逊相关系数、21.77的峰值信噪比（PSNR）和0.026的归一化平均绝对误差（NMAE），优于以往方法。实验结果表明，我们的框架通过基于扩散的多模态条件生成，能够可靠地产生高质量的IR压降图像。这表明，当几何布局特征与逻辑电路拓扑联合建模时，IR压降分析能够有效利用生成模型的最新进展。通过结合几何感知的空间特征与逻辑图表示，GIF使IR压降分析能够受益于结构化图像生成领域的生成建模最新成果。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.10000

SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

SwinTextUNet：将基于CLIP的文本引导整合入Swin Transformer U-Net用于医学图像分割

Yeafi, Ashfak, Goswami, Parthaw, Islam, Md Khairul, Shamme, Ashifa Islam

Abstract

Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.

Chinese Translation

精准的医学图像分割是实现计算机辅助诊断和有效治疗规划的基础。传统仅依赖视觉特征的模型在面对模糊或低对比度图案时常常表现不佳。为克服这些局限性，我们提出了SwinTextUNet，一种多模态分割框架，将对比语言图像预训练（Contrastive Language Image Pretraining，CLIP）生成的文本嵌入整合到Swin Transformer U-Net骨干网络中。通过融合交叉注意力和卷积融合机制，模型有效地将语义文本引导与层次化视觉表示对齐，提升了鲁棒性和准确性。我们在QaTaCOV19数据集上评估了该方法，所提出的四阶段变体在性能与复杂度之间实现了最佳平衡，分别取得了86.47%的Dice系数和78.2%的IoU分数。消融实验进一步验证了文本引导和多模态融合的重要性。这些结果凸显了视觉语言融合在推进医学图像分割及支持临床诊断工具方面的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.10014

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

全模态语言模型中的人口统计和语言偏见评估

Elobaid, Alaa

Abstract

This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.

Chinese Translation

本文对处理文本、图像、音频和视频的全模态语言模型中的人口统计和语言偏见进行了全面评估。尽管这些模型正在广泛应用，但它们在不同人口群体和模态下的表现尚未得到充分研究。我们对四个全模态模型进行了评估，任务包括人口属性估计、身份验证、活动识别、多语言语音转录和语言识别。我们测量了不同年龄、性别、肤色、语言和原籍国之间的准确性差异。结果表明，图像和视频理解任务通常表现出较小的人口统计差异，且性能更佳。相反，音频理解任务的表现显著较低，并存在显著偏见，包括不同年龄组、性别和语言之间的准确性差异较大，以及频繁向狭窄类别的预测崩溃。这些发现强调了在全模态语言模型日益应用于现实世界时，评估所有支持模态的公平性的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.10017

What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters

适应什么与何处：通过协同适配器进行机器视觉压缩的结构-语义共同调优

Liu, Shaobo, Xiong, Haobo, Liu, Kai, Lin, Yuna

Abstract

Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at https://github.com/Brock-bit4/S2-CoT.

Chinese Translation

预训练编解码器的参数高效微调是图像压缩在人工和机器视觉中的一个有前景的方向。尽管现有大多数研究主要集中在调整编码器-解码器骨干网络中的特征结构上，但熵模型中统计语义的适应却受到有限关注，尽管其在预测潜在特征的概率分布中发挥着重要作用。我们的分析表明，简单地将适配器插入熵模型可能导致次优结果，强调了基于适配器的调优效果在很大程度上依赖于适配器类型与压缩流程中位置之间的协调。因此，我们提出了结构-语义共同调优（Structure-Semantics Co-Tuning, S2-CoT），这是一个通过两个专门的协同适配器实现这种协调的新框架：结构保真适配器（Structural Fidelity Adapter, SFA）和语义上下文适配器（Semantic Context Adapter, SCA）。SFA被集成到编码器-解码器中，以通过动态融合空间和频率信息来保持高保真表示；与此同时，SCA则通过优化通道上下文来调整熵模型，以使其与SFA调优的特征对齐，从而实现更高效的统计编码。通过联合优化，S2-CoT将潜在的性能下降转化为协同增益，在四种不同的基础编解码器上实现了最先进的结果，仅使用少量可训练参数，性能接近完全微调的效果。代码可在 https://github.com/Brock-bit4/S2-CoT 获取。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.10023

FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer

FREE-Switch：基于频率的动态LoRA切换用于风格迁移

Zheng, Shenghe, Zhang, Minyu, Liu, Tianhao, Wang, Hongzhi

Abstract

With the growing availability of open-sourced adapters trained on the same diffusion backbone for diverse scenes and objects, combining these pretrained weights enables low-cost customized generation. However, most existing model merging methods are designed for classification or text generation, and when applied to image generation, they suffer from content drift due to error accumulation across multiple diffusion steps. For image-oriented methods, training-based approaches are computationally expensive and unsuitable for edge deployment, while training-free ones use uniform fusion strategies that ignore inter-adapter differences, leading to detail degradation. We find that since different adapters are specialized for generating different types of content, the contribution of each diffusion step carries different significance for each adapter. Accordingly, we propose a frequency-domain importance-driven dynamic LoRA switch method. Furthermore, we observe that maintaining semantic consistency across adapters effectively mitigates detail loss; thus, we design an automatic Generation Alignment mechanism to align generation intents at the semantic level. Experiments demonstrate that our FREE-Switch (Frequency-based Efficient and Dynamic LoRA Switch) framework efficiently combines adapters for different objects and styles, substantially reducing the training cost of high-quality customized generation.

Chinese Translation

随着针对不同场景和对象在相同扩散骨干网络上训练的开源适配器数量不断增加，结合这些预训练权重能够实现低成本的定制化生成。然而，大多数现有的模型融合方法设计用于分类或文本生成，应用于图像生成时，由于多步扩散过程中的误差累积，容易导致内容漂移。针对图像的相关方法中，基于训练的方法计算开销大且不适合边缘部署，而无训练方法采用统一融合策略，忽视了适配器间的差异，导致细节退化。我们发现，不同适配器专注于生成不同类型的内容，每个扩散步骤对各适配器的重要性存在差异。基于此，我们提出了一种基于频率域重要性驱动的动态LoRA切换方法。此外，我们观察到保持适配器间的语义一致性能够有效缓解细节丢失，因此设计了自动生成对齐机制以在语义层面对生成意图进行对齐。实验表明，我们的FREE-Switch（Frequency-based Efficient and Dynamic LoRA Switch）框架能够高效地融合不同对象和风格的适配器，大幅降低高质量定制生成的训练成本。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.10024

LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

LVSum：一个面向时间戳感知的长视频摘要基准

Patel, Alkesh, Ozyildirim, Melis, Cheng, Ying-Chang, Nagarajan, Ganesh

Abstract

Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.

Chinese Translation

长视频摘要对当前多模态大语言模型（MLLMs）提出了重大挑战，尤其是在保持长时间跨度的时间一致性以及生成语义和时间上均有依据的摘要方面。本文提出了LVSum，一个专门设计用于评估具有细粒度时间对齐的长视频摘要的人类标注基准。LVSum涵盖了13个领域的多样化长视频，每个视频均配有人类生成的包含精确时间参考的摘要。我们在LVSum上对专有和开源的MLLMs进行了全面评估，采用新引入的基于LLM的内容相关性和模态一致性指标，以及标准评估指标。实验结果揭示了现有MLLMs在时间理解方面的系统性不足，并提供了促进长视频摘要时间推理发展的新见解和基础。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.10027

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

SinkTrack：基于Attention Sink的上下文锚定方法用于大型语言模型

Liu, Xu, Chen, Guikun, Wang, Wenguan

Abstract

Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink -- the tendency to consistently allocate high attention to the very first token (i.e., ) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at https://github.com/67L1/SinkTrack.

Chinese Translation

大型语言模型（LLMs）存在幻觉和上下文遗忘的问题。先前研究表明，注意力漂移是这些问题的主要原因，即LLMs的注意力从初始输入上下文转移到新生成的标记上。为了解决这一问题，我们利用了LLMs的一个相关内在特性：attention sink——即模型倾向于持续对序列的第一个标记（即）分配较高的注意力。具体而言，我们提出了一种先进的上下文锚定方法SinkTrack，将视为信息锚点，并将关键上下文特征（如输入图像或指令中提取的特征）注入其表示中。这样，LLM在整个生成过程中始终锚定于初始输入上下文。SinkTrack无需训练，支持即插即用，且引入的推理开销极小。实验表明，SinkTrack在文本任务（例如在Llama3.1-8B-Instruct上，SQuAD2.0提升21.6%）和多模态任务（例如在Qwen2.5-VL-7B-Instruct上，M3CoT提升22.8%）中均有效缓解了幻觉和上下文遗忘问题。其在不同架构和规模上的稳定提升凸显了方法的鲁棒性和泛化能力。我们还从信息传递的角度分析了其潜在的工作机制。源码已开源，地址为：https://github.com/67L1/SinkTrack。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.10030

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

Prompt Relay：多事件视频生成中的推理时序控制

Chen, Gordon, Huang, Ziqi, Liu, Ziwei

Abstract

Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.

Chinese Translation

视频扩散模型在生成高质量视频方面取得了显著进展。然而，这些模型难以准确表达现实视频中多个事件的时间先后顺序，且缺乏明确机制来控制语义概念的出现时间、持续时长及多个事件的发生顺序。这种控制对于电影级视频合成尤为重要，因为连贯的叙事依赖于事件之间精确的时机、持续时间和过渡。当使用单段落式提示描述复杂事件序列时，模型常出现语义纠缠现象，即不同时间点的概念相互混淆，导致文本与视频的对齐效果较差。为解决这些问题，我们提出了Prompt Relay，一种推理时的即插即用方法，实现多事件视频生成中的细粒度时间控制，无需修改模型结构或增加额外计算开销。Prompt Relay通过在交叉注意力机制中引入惩罚项，使每个时间段仅关注其对应的提示，从而使模型一次只表达一个语义概念，提升时间提示对齐效果，减少语义干扰，并增强视觉质量。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.10039

Counting to Four is still a Chore for VLMs

视觉语言模型（VLMs）在计数至四时仍存在困难

Anh, Duy Le Dinh, Irawan, Patrick Amadeus, Van Vo, Tuan

Abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

Chinese Translation

视觉语言模型（VLMs）在复杂的多模态推理任务中表现出色，然而它们在诸如物体计数等简单的基础技能上仍然存在失败。现有评估大多仅关注最终输出，难以深入了解模型内部失败的具体原因。本文通过行为分析和机制分析，对VLM的计数行为进行了实证研究。我们提出了COUNTINGTRICKS，这是一套受控的基于简单形状的计数评估工具，旨在揭示不同分块布局和对抗性提示条件下的脆弱性。通过注意力分析和组件级探测，我们发现与计数相关的视觉证据在模态投影阶段最为强烈，但在后续语言层中显著衰减，模型在该阶段更易受到文本先验的影响。基于此发现，我们进一步评估了Modality Attention Share（MAS），这是一种轻量级干预方法，旨在答案生成过程中保证视觉注意力的最低预算。实验结果表明，VLM的计数失败不仅源于视觉感知的限制，还源于语言推理阶段对视觉证据的利用不足。代码和数据集将发布于https://github.com/leduy99/-CVPRW26-Modality-Attention-Share。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.10040

Intra-finger Variability of Diffusion-based Latent Fingerprint Generation

基于扩散模型的指纹生成中的指内变异性

Hussein, Noor, Jain, Anil K., Nandakumar, Karthik

Abstract

The primary goal of this work is to systematically evaluate the intra-finger variability of synthetic fingerprints (particularly latent prints) generated using a state-of-the-art diffusion model. Specifically, we focus on enhancing the latent style diversity of the generative model by constructing a comprehensive \textit{latent style bank} curated from seven diverse datasets, which enables the precise synthesis of latent prints with over 40 distinct styles encapsulating different surfaces and processing techniques. We also implement a semi-automated framework to understand the integrity of fingerprint ridges and minutiae in the generated impressions. Our analysis indicates that though the generation process largely preserves the identity, a small number of local inconsistencies (addition and removal of minutiae) are introduced, especially when there are poor quality regions in the reference image. Furthermore, mismatch between the reference image and the chosen style embedding that guides the generation process introduces global inconsistencies in the form of hallucinated ridge patterns. These insights highlight the limitations of existing synthetic fingerprint generators and the need to further improve these models to simultaneously enhance both diversity and identity consistency.

Chinese Translation

本研究的主要目标是系统评估使用最先进的扩散模型生成的合成指纹（特别是潜指纹）的指内变异性。具体而言，我们专注于通过构建一个综合的 extit{潜在风格库}，该库从七个不同的数据集中精心策划，以增强生成模型的潜在风格多样性，从而实现对具有超过40种不同风格的潜指纹的精确合成，这些风格涵盖了不同的表面和处理技术。我们还实施了一个半自动化框架，以理解生成印象中指纹脊线和细节的完整性。我们的分析表明，尽管生成过程在很大程度上保留了身份信息，但在参考图像存在低质量区域时，会引入少量局部不一致性（细节的增加和删除）。此外，参考图像与指导生成过程的所选风格嵌入之间的不匹配，会以幻觉脊线模式的形式引入全局不一致性。这些见解突显了现有合成指纹生成器的局限性，以及进一步改进这些模型以同时增强多样性和身份一致性的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.10056

U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

U$^{2}$Flow：考虑不确定性的无监督光流估计

Sun, Xunpei, Lin, Wenwei, Chang, Yi, Chen, Gang

Abstract

Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at https://github.com/sunzunyi/U2FLOW.

Chinese Translation

无监督光流方法通常缺乏可靠的不确定性估计，限制了其鲁棒性和可解释性。我们提出了U$^{2}$Flow，这是首个联合估计光流和逐像素不确定性的递归无监督框架。其核心创新是一种解耦学习策略，通过基于拉普拉斯分布的最大似然目标，从增强一致性中导出不确定性监督，实现了无需真实标签的稳定训练。预测的不确定性进一步被整合进网络，用以指导自适应光流细化并动态调节区域平滑损失。此外，我们引入了一种基于不确定性的双向光流融合机制，提升了在复杂区域的鲁棒性。在KITTI和Sintel数据集上的大量实验表明，U$^{2}$Flow在无监督方法中实现了最先进的性能，同时生成了高度可靠的不确定性图，验证了我们联合估计范式的有效性。代码已开源，地址为https://github.com/sunzunyi/U2FLOW。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.10064

On The Application of Linear Attention in Multimodal Transformers

线性注意力在多模态变换器中的应用

Gerami, Armin, Madani, Seyedehanita, Duraiswami, Ramani

Abstract

Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.

Chinese Translation

多模态变换器作为最先进的视觉-语言模型的基础，其二次注意力复杂度仍然是可扩展性的关键障碍。在本研究中，我们探讨了线性注意力（Linear Attention, LA）作为多模态框架中高效替代方案的可行性。通过整合LA，我们将计算开销从与序列长度相关的二次降低到线性，同时保持竞争力的性能。我们在基于LAION-400M数据集训练的ViT-S/16、ViT-B/16和ViT-L/16架构上评估了我们的方法，验证重点集中在ImageNet-21K的零-shot准确率。我们的系统评估表明，线性注意力不仅显著节省计算资源，而且遵循与标准softmax注意力相同的扩展规律。这些发现使线性注意力成为处理日益庞大和复杂数据集的下一代多模态变换器的稳健、可扩展解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.10071

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

聚光与阴影：基于注意力引导的双锚点内省解码用于多模态大语言模型幻觉缓解

Wu, Yebo, Jin, Han, Guo, Zhijiang, Li, Li

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model's internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.

Chinese Translation

多模态大语言模型（MLLMs）展现了卓越的推理能力，但仍存在幻觉问题，即生成文本与视觉内容相矛盾。本文提出了一种新颖的对比解码框架——双锚点内省解码（Dual-Anchor Introspective Decoding，DaID），通过挖掘模型内部的感知差异，动态校准每个生成的词元。具体而言，DaID识别出一个聚光层（Spotlight layer）以增强视觉事实信号，和一个阴影层（Shadow layer）以抑制文本惯性。通过利用视觉注意力分布指导这一双锚点选择过程，我们的方法确保了精确的、针对词元的适应性。多项基准测试和多模态大语言模型上的实验结果表明，DaID显著缓解了幻觉现象，同时提升了整体推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.10077

DocRevive: A Unified Pipeline for Document Text Restoration

DocRevive：文档文本恢复的统一流程

Purkayastha, Kunal, Banerjee, Ayan, Llados, Josep, Pal, Umapada

Abstract

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

Chinese Translation

在文档理解中，重建受损、遮挡或不完整文本的挑战仍然是一个关键但尚未被充分探索的问题。后续的文档理解任务可以从文档重建过程中受益。为此，本文提出了一种新颖的统一流程，结合了最先进的光学字符识别（OCR）、高级图像分析、掩码语言建模和基于扩散的模型，以恢复和重建文本，同时保持视觉完整性。我们创建了一个包含30,078个退化文档图像的合成数据集，模拟了多种文档退化场景，为恢复任务设定了基准。我们的流程检测和识别文本，使用遮挡检测器识别退化，并利用图像修复模型进行语义一致的重建。基于扩散的模块无缝地重新整合文本，匹配字体、大小和对齐方式。为了评估恢复质量，我们提出了一种统一上下文相似性度量（UCSM），结合了编辑、语义和长度相似性，以及一个上下文可预测性度量，当正确文本在上下文中显而易见时，对偏差进行惩罚。我们的工作推动了文档恢复的发展，有利于档案研究和数字保存，同时为文本重建设定了新标准。OPRB数据集和代码分别可在 [Hugging Face](https://huggingface.co/datasets/kpurkayastha/OPRB) 和 [Github](https://github.com/kunalpurkayastha/DocRevive) 获取。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.10078

Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating

基于注意力引导的双流学习用于群体参与度识别：通过自适应门控融合Transformer编码的运动动态与场景上下文

Chowdhury, Saniah Kayenat, Chowdhury, Muhammad E. H.

Abstract

Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream's contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.

Chinese Translation

学生参与度对于提升小组活动中的学习效果至关重要。高度参与的学生不仅在个人表现上更优异，还能促进整体小组的成功。然而，现有大多数自动化参与度识别方法主要针对在线课堂，或仅估计个体层面的参与度。针对这一空白，我们提出了DualEngage，一种新颖的双流框架，用于从课堂视频中识别群体层面的参与度。该方法将参与度建模为个体行为与群体行为的联合函数。主流通过检测和跟踪学生，利用Recurrent All-Pairs Field Transforms网络提取密集光流，采用Transformer编码器编码时间运动模式，最后通过注意力池化聚合每个学生的表征为统一表示，从而建模个体层面的运动动态。次流则利用预训练的三维残差网络，从完整视频片段中捕捉场景级的时空信息。两条流的表示通过softmax门控融合结合，根据两种特征的联合上下文动态调整各流的权重。DualEngage学习了个体动作与整体群体动态的联合表征。我们在中国海洋大学开发的课堂群体参与度数据集上采用五折交叉验证评估该方法，平均分类准确率达到0.9621±0.0161，宏平均F1分数为0.9530±0.0204。为深入理解各分支的贡献，我们进一步进行了消融实验，将单流变体与双流模型进行了比较。本研究是课堂参与度识别领域首次采用明确利用运动线索作为估计依据的双流设计之一。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.10081

MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration

MatRes：用于同时匹配与修复的零样本测试时模型自适应方法

Lee, Kanggeon, Lee, Soochahn, Lee, Kyoung Mu

Abstract

Real-world image pairs often exhibit both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. In this work, we propose MatRes, a zero-shot test-time adaptation framework that jointly improves restoration quality and correspondence estimation using only a single low-quality and high-quality image pair. By enforcing conditional similarity at corresponding locations, MatRes updates only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision. Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone. MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.

Chinese Translation

现实世界中的图像对通常同时存在严重的退化和较大的视角变化，使得图像修复与几何匹配作为独立任务时相互干扰。本文提出了MatRes，一种零样本测试时自适应框架，能够仅利用单个低质量与高质量图像对联合提升修复质量和对应关系估计。通过在对应位置强制条件相似性，MatRes仅更新轻量模块，保持所有预训练组件冻结，无需离线训练或额外监督。大量跨多样组合的实验表明，MatRes在修复和几何对齐方面均显著优于单独使用修复或匹配模型。MatRes为现实场景中用户常因视角和质量差异拍摄多张图像的情况提供了实用且广泛适用的解决方案，有效解决了匹配与修复之间常被忽视的相互干扰问题。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.10084

Active Diffusion Matching: Score-based Iterative Alignment of Cross-Modal Retinal Images

主动扩散匹配：基于评分的跨模态视网膜图像迭代对齐

Lee, Kanggeon, Song, Su Jeong, Lee, Soochahn, Lee, Kyoung Mu

Abstract

Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy. Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method. ADM integrates two interdependent score-based diffusion models to jointly estimate global transformations and local deformations via an iterative Langevin Markov chain. This approach facilitates a stochastic, progressive search for optimal alignment. Additionally, custom sampling strategies are introduced to enhance the adaptability of ADM to given input image pairs. Results: Comparative experimental evaluations demonstrate that ADM achieves state-of-the-art alignment accuracy. This was validated on two datasets: a private dataset of SFI-UWFI pairs and a public dataset of SFI-SFI pairs, with mAUC improvements of 5.2 and 0.4 points on the private and public datasets, respectively, compared to existing state-of-the-art methods. Conclusion: ADM effectively bridges the gap in aligning SFIs and UWFIs, providing an innovative solution to a previously unaddressed challenge. The method's ability to jointly optimize global and local alignment makes it highly effective for cross-modal image alignment tasks. Significance: ADM has the potential to transform the integrated analysis of SFIs and UWFIs, enabling better clinical utility and supporting learning-based image enhancements. This advancement could significantly improve diagnostic accuracy and patient outcomes in ophthalmology.

Chinese Translation

目的：本研究旨在解决标准眼底图像（Standard Fundus Images, SFIs）与超广角眼底图像（Ultra-Widefield Fundus Images, UWFIs）对齐的挑战，由于它们在视野范围和视网膜的无定形外观上存在显著差异，这一任务十分困难。目前尚无针对该任务的专门方法，现有的图像对齐技术缺乏准确性。方法：我们提出了主动扩散匹配（Active Diffusion Matching, ADM），这是一种新颖的跨模态对齐方法。ADM结合了两个相互依赖的基于评分的扩散模型，通过迭代的Langevin马尔可夫链共同估计全局变换和局部变形。这种方法促进了对最佳对齐的随机渐进搜索。此外，引入了定制的采样策略，以增强ADM对给定输入图像对的适应性。结果：比较实验评估表明，ADM达到了最先进的对齐准确性。在两个数据集上进行了验证：一个是SFI-UWFI对的私有数据集，另一个是SFI-SFI对的公共数据集，分别在私有和公共数据集上与现有最先进方法相比，mAUC提高了5.2和0.4点。结论：ADM有效弥补了SFIs与UWFIs对齐的差距，为一个之前未解决的挑战提供了创新解决方案。该方法能够共同优化全局和局部对齐，使其在跨模态图像对齐任务中极为有效。意义：ADM有潜力改变SFIs与UWFIs的综合分析，提升临床实用性，并支持基于学习的图像增强。这一进展可能显著提高眼科诊断的准确性和患者的治疗效果。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.10085

Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images

粒子扩散匹配：用于标准视网膜图像与超广角视网膜图像对齐的随机游走对应搜索

Lee, Kanggeon, Lee, Soochahn, Lee, Kyoung Mu

Abstract

We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.

Chinese Translation

我们提出了一种针对标准视网膜图像（Standard Fundus Images, SFIs）与超广角视网膜图像（Ultra-Widefield Fundus Images, UWFIs）之间对齐的鲁棒技术。由于两者在尺度、外观上的差异以及显著特征的稀缺性，使得对齐任务极具挑战性。我们的方法称为粒子扩散匹配（Particle Diffusion Matching, PDM），通过基于扩散模型引导的迭代随机游走对应搜索（Random Walk Correspondence Search, RWCS）实现对齐。在每次迭代中，该模型结合局部外观信息、粒子结构分布及估计的全局变换，估计粒子点的位移向量，从而在复杂条件下逐步细化对应关系。PDM在多个视网膜图像对齐基准测试中达到最先进的性能，在主要的SFI-UWFI配对数据集上表现出显著提升，并验证了其在真实临床场景中的有效性。通过提供准确且可扩展的对应估计，PDM克服了现有方法的局限，促进了互补视网膜图像模态的融合。该基于扩散引导的搜索策略为提升后续的监督学习、疾病诊断及多模态图像分析在眼科领域的应用开辟了新方向。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.10094

Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT

利用EMIT的高光谱辐射测量通过深度学习对甲烷点源进行全球监测

Batchu, Vishal V., Conserva, Michelangelo, Wilson, Alex, Michalak, Anna M., Gulshan, Varun, Brodrick, Philip G., Thorpe, Andrew K., Arsdale, Christopher V.

Abstract

Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model's ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model's ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.

Chinese Translation

人为甲烷（CH4）点源推动了近期气候强迫、安全隐患和系统低效。基于空间的成像光谱技术正在成为识别全球排放的一种工具，但现有的方法在很大程度上依赖于手动烟羽识别。在此，我们提出了甲烷分析与EMIT烟羽定位模型（Methane Analysis and Plume Localization with EMIT，MAPL-EMIT），这是一个端到端的视觉变换器框架，利用地球表面矿物尘源调查（EMIT）仪器的完整辐射光谱，联合检索场景中所有像素的甲烷增强。这种方法结合了光谱和空间上下文，显著降低了检测限。MAPL-EMIT同时支持增强量化、烟羽划定和源定位，即使对于多个重叠的烟羽也是如此。该模型在360万个基于物理的合成烟羽上进行训练，这些烟羽被注入到全球EMIT辐射数据中。合成评估确认了该模型以高召回率和精度识别烟羽的能力，并能够捕捉相对于现有匹配滤波方法的较弱烟羽。在真实世界基准测试中，MAPL-EMIT在1084个EMIT颗粒的测试集中捕获了79%的已知手动标注的NASA L2B烟羽复合体，同时捕获的合理烟羽数量是人类分析师识别数量的两倍。针对同时发生的空中数据、主要排放垃圾填埋场和控制释放实验的进一步验证确认了该模型识别先前未捕获源的能力。通过结合模型生成的指标，如光谱拟合分数和估计噪声水平，该框架可以进一步限制假阳性率。总体而言，MAPL-EMIT使得在完整的EMIT目录上实现高通量应用成为可能，将甲烷监测从劳动密集型工作流程转变为快速、可扩展的全球烟羽映射范式，适用于设施级别。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.10095

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

挖掘属性子空间以高效微调3D基础模型

Jiang, Yu, Jiang, Hanwen, Abdelkader, Ahmed, Chu, Wen-Sheng, Feng, Brandon Y., Wang, Zhangyang, Huang, Qixing

Abstract

With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

Chinese Translation

随着3D基础模型的出现，微调它们以适应下游任务的兴趣日益增长，其中LoRA是主导的微调范式。由于3D数据集在纹理、几何、相机运动和光照方面表现出明显的变化，出现了一些有趣的基本问题：1）是否存在与每种变化类型相关的LoRA子空间？2）这些子空间是否是解耦的（即彼此正交）？3）我们如何有效地计算它们？本文对所有这些问题提供了答案。我们提出了一种稳健的方法，生成具有可控变化的合成数据集，在每个数据集上微调LoRA适配器，并提取与每种变化类型相关的LoRA子空间。我们展示了这些子空间大致是解耦的。将它们整合后，形成了一个减少的LoRA子空间，使得在下游任务中实现高效的LoRA微调并提高预测准确性。特别是，我们展示了这样一个减少的LoRA子空间，尽管完全源自合成数据，但能够推广到真实数据集。消融研究验证了我们方法中选择的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.10096

ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

ABot-Claw：持久性、协作性和自我演化机器人代理的基础

Huo, Dongjie, Liu, Haoyun, Liu, Guoqing, Qi, Dekang, Sun, Zhiming, Gao, Maoguo, He, Jianxin, Yang, Yandan, Chang, Xinyuan, Xiong, Feng, Wei, Xing, Ma, Zhiheng, Xu, Mu

Abstract

Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.

Chinese Translation

当前的具身智能系统在开放世界环境中仍面临高层推理与低层物理执行之间的显著差距。尽管视觉-语言-动作（Vision-Language-Action, VLA）模型提供了强大的感知能力和直观响应，但其开放式反馈特性限制了长时间范围内的表现。结合系统2认知机制的代理能够改善规划，但通常在具有预定义工具包和有限真实系统控制的封闭沙箱中运行。OpenClaw提供了一个具有完全系统权限的本地运行时，但缺乏进行长时间、多机器人执行所需的具身控制架构。因此，我们提出了ABot-Claw，这是OpenClaw的具身扩展，集成了：1）一个统一的具身接口，具有基于能力的调度功能，以实现异构机器人协调；2）一个以视觉为中心的跨具身多模态记忆，用于持久的上下文保留和基于情境的检索；3）一个基于评估者的闭环反馈机制，配备通用奖励模型，用于在线进度评估、本地修正和重新规划。通过跨越OpenClaw层、共享服务层和机器人具身层的解耦架构，ABot-Claw实现了与真实世界的交互，闭合了从自然语言意图到物理行动的循环，并支持在开放、动态环境中逐步自我演化的机器人代理。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.10102

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

一致性降级配对训练用于鲁棒的AI生成图像检测

Yang, Zongyou, Hou, Yinghan, Yang, Xiaokun

Abstract

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

Chinese Translation

AI生成的图像检测器在面对现实世界图像损坏（如JPEG压缩、高斯模糊和分辨率下采样）时，性能显著下降。我们观察到，最先进的方法（包括B-Free）将降级鲁棒性视为数据增强的副产品，而非明确的训练目标。在本研究中，我们提出了一种简单而有效的训练策略——一致性降级配对训练（Degradation-Consistent Paired Training, DCPT），通过配对一致性约束明确强化鲁棒性。对于每个训练图像，我们构建一个干净视图和一个降级视图，然后施加两个约束：一个特征一致性损失，最小化干净和降级表示之间的余弦距离，以及一个基于对称KL散度的预测一致性损失，使输出分布在不同视图之间对齐。DCPT不增加任何额外参数，也不增加推理开销。在Synthbuster基准测试（9个生成器，8种降级条件）上的实验表明，与没有配对训练的相同基线相比，DCPT在降级条件下的平均准确率提高了9.1个百分点，而干净准确率仅下降了0.9%。在JPEG压缩下，改善最为显著（提高15.7%至17.9%）。消融实验进一步揭示，添加架构组件会导致在有限训练数据上的过拟合，确认训练目标的改进在降级鲁棒性方面比架构增强更为有效。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.10103

Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

通过混合注意力与解耦蒸馏实现长时间段流媒体视频生成

Li, Ruibin, Yang, Tao, Ai, Fangzhou, Wu, Tianhe, Wen, Shilei, Peng, Bingyue, Zhang, Lei

Abstract

Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.

Chinese Translation

流媒体视频生成（SVG）将一个预训练的双向视频扩散模型蒸馏为一个配备滑动窗口注意力（SWA）的自回归模型。然而，在长视频生成过程中，SWA不可避免地会丢失远程历史信息，其计算开销仍然是实时部署的一项关键挑战。在本研究中，我们提出了混合强制（Hybrid Forcing），通过混合注意力设计共同优化时间信息保留和计算效率。首先，我们引入轻量级线性时间注意力，以保持超出滑动窗口的长距离依赖关系。特别地，我们维护一个紧凑的键值状态，以增量方式吸收被驱逐的标记，从而以微不足道的内存和计算开销保留时间上下文。其次，我们将块稀疏注意力结合到局部滑动窗口中，以减少短距离建模中的冗余计算，将计算能力重新分配给更关键的依赖关系。最后，我们引入了一种针对混合注意力设计的解耦蒸馏策略。在密集注意力下进行几步初始蒸馏，然后激活我们提出的线性时间和块稀疏注意力的蒸馏以进行流媒体建模，确保稳定优化。在短视频和长视频生成基准上的大量实验表明，混合强制始终实现了最先进的性能。值得注意的是，我们的模型在单个NVIDIA H100 GPU上以29.5 FPS实现了实时、无界限的832x480视频生成，无需量化或模型压缩。源代码和训练模型可在 https://github.com/leeruibin/hybrid-forcing 获取。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.10106

VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction

VGGT-HPE：将头部姿态估计重新框定为相对姿态预测

Vasileiou, Vasiliki, Filntisis, Panagiotis P., Maragos, Petros, Daniilidis, Kostas

Abstract

Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE

Chinese Translation

单目头部姿态估计传统上被表述为从单张图像直接回归到绝对姿态。这一范式迫使网络隐式地内化特定数据集的标准参考框架。在本研究中，我们认为预测两个观察到的头部配置之间的相对刚性变换是一种根本上更简单且更稳健的表述。我们提出了VGGT-HPE，一种基于通用几何基础模型的相对头部姿态估计器。我们的算法仅在合成面部渲染图上进行微调，从而通过将问题简化为从已知姿态的显式提供锚点估计几何位移，避免了隐式锚点的需求。作为一种实际的好处，相对表述还允许在测试时选择锚点——例如，接近中性的位置或时间上相邻的位置——从而可以通过应用程序控制预测的难度。尽管没有真实世界的训练数据，VGGT-HPE在BIWI基准测试上取得了最先进的结果，超越了在混合和真实数据集上训练的已建立绝对回归方法。通过控制简单和困难配对的基准测试，我们还系统地验证了我们的核心假设：相对预测本质上比绝对回归更准确，且这种优势随着目标姿态的难度而增强。项目页面和代码： https://vasilikivas.github.io/VGGT-HPE

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.10112

Dual-Branch Remote Sensing Infrared Image Super-Resolution

双分支遥感红外图像超分辨率

Ge, Xining, Chang, Gengjia, Yuan, Weijun, Li, Zhan, Chen, Zhanglu, Yao, Boyang, Chen, Yihang, Deng, Yifan, Liu, Shuhong

Abstract

Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.

Chinese Translation

遥感红外图像超分辨率旨在从低分辨率输入中恢复更清晰的热观测，同时保持目标轮廓、场景布局和辐射稳定性。与可见光图像超分辨率不同，热成像纹理较弱，对不稳定的局部锐化更为敏感，这使得局部和全局建模的互补性尤为重要。本文提出了我们对NTIRE 2026红外图像超分辨率挑战的解决方案，采用双分支系统，结合了HAT-L分支和MambaIRv2-L分支。推理流程在HAT上应用测试时局部转换，在MambaIRv2上进行八向自集成，并进行固定等权重的图像空间融合。我们报告了官方挑战得分以及在12个来自Caltech Aerial RGB-Thermal的合成四倍热样本上的可重复评估，其中融合输出在PSNR、SSIM和总体得分上均优于任一单一分支。结果表明，红外超分辨率受益于局部强变换器恢复与全局稳定状态空间建模之间的显式互补性。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.10116

A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection

基于多模态MRI的重度抑郁障碍检测的双重交叉注意力图学习框架

Alotaibi, Nojod M., Alhothali, Areej M.

Abstract

Major depressive disorder (MDD) is a prevalent mental disorder associated with complex neurobiological changes that cannot be fully captured using a single imaging modality. The use of multimodal magnetic resonance imaging (MRI) provides a more comprehensive understanding of brain changes by combining structural and functional data. Despite this, the effective integration of these modalities remains challenging. In this study, we propose a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. The proposed approach is tested on the large-scale REST-meta-MDD dataset using both structural and functional brain atlas configurations. Numerous experiments conducted under a 10-fold stratified cross-validation demonstrated that the proposed fusion algorithm achieves robust and competitive performance across all atlas types. The proposed method consistently outperforms conventional feature-level concatenation for functional atlases, while maintaining comparable performance for structural atlases. The most effective dual cross-attention multimodal model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score. These findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification.

Chinese Translation

重度抑郁障碍（MDD）是一种常见的精神疾病，伴随着复杂的神经生物学变化，单一成像模态难以全面捕捉这些变化。多模态磁共振成像（MRI）通过结合结构和功能数据，提供了对脑部变化更为全面的理解。尽管如此，这些模态的有效整合仍然具有挑战性。本研究提出了一种基于双重交叉注意力的多模态融合框架，明确建模结构性MRI（sMRI）与静息态功能MRI（rs-fMRI）表征之间的双向交互。该方法在大规模REST-meta-MDD数据集上进行了测试，采用了结构和功能脑图谱配置。通过10折分层交叉验证进行的大量实验表明，所提融合算法在所有脑图谱类型中均表现出稳健且具有竞争力的性能。该方法在功能脑图谱上持续优于传统的特征级拼接方法，同时在结构脑图谱上保持了相当的性能。表现最佳的双重交叉注意力多模态模型达到了84.71%的准确率、86.42%的敏感性、82.89%的特异性、84.34%的精确率及85.37%的F1分数。这些结果强调了明确建模跨模态交互在基于多模态神经影像的MDD分类中的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.10125

PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

PhyMix：基于隐式-显式优化的物理一致性单幅图像三维室内场景生成

Wu, Dongli, Hu, Jingyu, Hui, Ka-Hei, Wei, Xiaobao, Luo, Chengwen, Li, Jianqiang, Liu, Zhengzhe

Abstract

Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.

Chinese Translation

现有的单幅图像三维室内场景生成方法虽然在视觉上看似合理，但往往未能遵循真实世界的物理规律，限制了其在机器人学、具身人工智能及设计领域的可靠性。为探究这一差距，我们提出了一个统一的物理评估器（Physics Evaluator），用于衡量四个主要方面：几何先验、接触、稳定性和可部署性，并进一步细分为九个子约束，建立了首个用于测量物理一致性的基准。基于该评估器的分析表明，当前最先进的方法在物理感知方面仍存在较大不足。为克服这一限制，我们提出了一个框架，将物理评估器的反馈整合到训练和推理过程中，以提升生成场景的物理合理性。具体而言，我们提出了PhyMix，该方法由两个互补组件组成：（i）通过Scene-GRPO（一种无判别器的基于群体相对策略优化的方法）实现隐式对齐，该方法利用物理评估器作为偏好信号，引导采样向物理可行布局倾斜；（ii）通过即插即用的测试时优化器（Test-Time Optimizer, TTO）进行显式细化，利用可微分的评估信号在生成过程中修正残余的违规行为。总体而言，我们的方法统一了评估、奖励塑造和推理时校正，生成的三维室内场景在视觉上真实且物理上合理。大量合成评估验证了我们方法在视觉保真度和物理合理性上的先进性能，丰富的风格化及真实图像定性示例进一步展示了方法的鲁棒性。代码和模型将在论文发布时公开。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.10127

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

VGA-Bench：视频美学与生成质量评估的统一基准与多模型框架

Jiang, Longteng, Zheng, DanDan, Qiao, Qianqian, Huang, Heng, Wang, Huaye, Bo, Yihang, Peng, Bao, Chen, Jingdong, Zhou, Jun, Jin, Xin

Abstract

The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

Chinese Translation

基于AIGC的视频生成的快速发展凸显了对全面评估框架的迫切需求，这些框架不仅超越了传统的生成质量指标，还涵盖了美学吸引力。然而，现有的基准测试仍然主要关注技术的真实性，导致在整体评估方面存在显著的空白，特别是在感知和艺术品质方面。为了解决这一局限性，我们提出了VGA-Bench，这是一个用于视频生成质量和美学质量联合评估的统一基准。VGA-Bench建立在一个原则性的三层分类法基础上：美学质量、美学标记和生成质量，每个层次都细分为多个细粒度的子维度，以便进行系统评估。在这一分类法的指导下，我们设计了1,016个多样化的提示，并使用12个视频生成模型生成了一个超过60,000个视频的大规模数据集，确保在内容、风格和伪影方面的广泛覆盖。为了实现可扩展和自动化的评估，我们通过人工标注对数据集的一个子集进行了注释，并开发了三个专门的多任务神经评估器：VAQA-Net用于美学质量预测，VTag-Net用于自动美学标记，以及VGQA-Net用于生成和基本质量属性的评估。大量实验表明，我们的模型与人类判断之间具有可靠的一致性，提供了准确性和效率。我们将VGA-Bench作为公共基准发布，以促进AIGC评估领域的研究，应用于内容审核、模型调试和生成模型优化。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.10130

Improving Deep Learning-Based Target Volume Auto-Delineation for Adaptive MR-Guided Radiotherapy in Head and Neck Cancer: Impact of a Volume-Aware Dice Loss

提高基于深度学习的头颈癌适应性MR引导放疗靶区自动勾画：体积感知Dice损失的影响

Beirami, Sogand, Esmaeilzadeh, Zahra, Gomaa, Ahmed, Stephan, Pluvio, Sheth, Ishita, Weissmann, Thomas, Szkitsak, Juliane, Schubert, Philipp, Huang, Yixing, Schwarz, Annette, Corradini, Stefanie, Putz, Florian

Abstract

Background: Manual delineation of target volumes in head and neck cancer (HNC) remains a significant bottleneck in radiotherapy planning, characterized by high inter-observer variability and time consumption. This study evaluates the integration of a Volume-Aware (VA) Dice loss function into a self-configuring deep learning framework to enhance the auto-segmentation of primary tumors (PT) and metastatic lymph nodes (LN) for adaptive MR-guided radiotherapy. We investigate how volume-sensitive weighting affects the detection of small, anatomically complex nodal metastases compared to conventional loss functions. Methods: Utilizing the HNTS-MRG 2024 dataset, we implemented an nnU-Net ResEnc M architecture. We conducted a multi-label segmentation task, comparing a standard Dice loss baseline against two Volume-Aware configurations: a "Dual Mask" setup (VA loss on both PT and LN) and a "Selective LN Mask" setup (VA loss on LN only). Evaluation metrics included volumetric Dice scores, surface-based metrics (SDS, MSD, HD95), and lesion-wise binary detection sensitivity and precision. Results: The Selective LN Mask configuration achieved the highest LN Volumetric Dice Score (0.758 vs. 0.734 baseline) and significantly improved LN Lesion-Wise Detection Sensitivity (84.93% vs. 81.80%). However, a critical trade-off was observed; PT detection precision declined significantly in the selective setup (63.65% vs. 81.27%). The Dual Mask configuration provided the most balanced performance across both targets, maintaining primary tumor precision at 82.04% while improving LN sensitivity to 83.46%. Conclusions: A volume-sensitive loss function mitigated the under-representation of small metastatic lesions in HNC. While selective weighting yielded the best nodal detection, a dual-mask approach is required in multi-label tasks to maintain segmentation accuracy for larger primary tumor volumes.

Chinese Translation

背景：头颈癌（HNC）靶区的手动勾画在放疗计划中仍然是一个显著的瓶颈，表现为高的观察者间变异性和时间消耗。本研究评估了将体积感知（VA）Dice损失函数集成到自配置深度学习框架中的效果，以增强适应性MR引导放疗中原发肿瘤（PT）和转移性淋巴结（LN）的自动分割。我们研究了体积敏感加权如何影响小型解剖复杂的淋巴结转移的检测，相较于传统损失函数。方法：利用HNTS-MRG 2024数据集，我们实施了nnU-Net ResEnc M架构。我们进行了多标签分割任务，将标准Dice损失基线与两种体积感知配置进行比较：一种是“双重掩膜”设置（PT和LN均使用VA损失），另一种是“选择性LN掩膜”设置（仅对LN使用VA损失）。评估指标包括体积Dice评分、基于表面的指标（SDS、MSD、HD95）以及病灶级的二元检测灵敏度和精度。结果：选择性LN掩膜配置实现了最高的LN体积Dice评分（0.758 vs. 0.734基线），并显著提高了LN病灶级检测灵敏度（84.93% vs. 81.80%）。然而，观察到一个关键的权衡；在选择性设置中，PT检测精度显著下降（63.65% vs. 81.27%）。双重掩膜配置在两个目标之间提供了最平衡的性能，保持了原发肿瘤的精度为82.04%，同时将LN灵敏度提高至83.46%。结论：体积敏感损失函数减轻了HNC中小型转移病灶的低估问题。尽管选择性加权产生了最佳的淋巴结检测，但在多标签任务中需要采用双重掩膜方法，以保持对较大原发肿瘤体积的分割准确性。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.10132

Semantic Manipulation Localization

语义操控定位

Tan, Zhenshan, Lu, Chenhan, Huang, Yuxiang, He, Ziwen, Zhang, Xiang, Sha, Yuzhe, Chen, Xianyi, Chen, Tianrun, Fu, Zhangjie

Abstract

Image Manipulation Localization (IML) aims to identify edited regions in an image. However, with the increasing use of modern image editing and generative models, many manipulations no longer exhibit obvious low-level artifacts. Instead, they often involve subtle but meaning-altering edits to an object's attributes, state, or relationships while remaining highly consistent with the surrounding content. This makes conventional IML methods less effective because they mainly rely on artifact detection rather than semantic sensitivity. To address this issue, we introduce Semantic Manipulation Localization (SML), a new task that focuses on localizing subtle semantic edits that significantly change image interpretation. We further construct a dedicated fine-grained benchmark for SML using a semantics-driven manipulation pipeline with pixel-level annotations. Based on this task, we propose TRACE (Targeted Reasoning of Attributed Cognitive Edits), an end-to-end framework that models semantic sensitivity through three progressively coupled components: semantic anchoring, semantic perturbation sensing, and semantic-constrained reasoning. Specifically, TRACE first identifies semantically meaningful regions that support image understanding, then injects perturbation-sensitive frequency cues to capture subtle edits under strong visual consistency, and finally verifies candidate regions through joint reasoning over semantic content and semantic scope. Extensive experiments show that TRACE consistently outperforms existing IML methods on our benchmark and produces more complete, compact, and semantically coherent localization results. These results demonstrate the necessity of moving beyond artifact-based localization and provide a new direction for image forensics in complex semantic editing scenarios.

Chinese Translation

图像操控定位（Image Manipulation Localization, IML）旨在识别图像中的编辑区域。然而，随着现代图像编辑和生成模型的广泛应用，许多操控不再表现出明显的低级伪影。相反，它们通常涉及对物体属性、状态或关系的微妙但改变意义的编辑，同时与周围内容保持高度一致。这使得传统的IML方法效果不佳，因为它们主要依赖于伪影检测而非语义敏感性。为了解决这一问题，我们提出了语义操控定位（Semantic Manipulation Localization, SML），这是一个新的任务，专注于定位显著改变图像解释的微妙语义编辑。我们进一步构建了一个专门的细粒度基准，用于SML，采用基于语义驱动的操控管道和像素级注释。在此任务基础上，我们提出了TRACE（Targeted Reasoning of Attributed Cognitive Edits），一个端到端框架，通过三个逐步耦合的组件建模语义敏感性：语义锚定、语义扰动感知和语义约束推理。具体而言，TRACE首先识别支持图像理解的语义重要区域，然后注入对扰动敏感的频率线索，以捕捉在强视觉一致性下的微妙编辑，最后通过对语义内容和语义范围的联合推理验证候选区域。大量实验表明，TRACE在我们的基准上始终优于现有的IML方法，并产生更完整、紧凑且语义一致的定位结果。这些结果证明了超越基于伪影的定位的必要性，并为复杂语义编辑场景中的图像取证提供了新的方向。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.10167

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

视觉后期分块：上下文分块在高效视觉文档检索中的实证研究

Yan, Yibo, Ou, Mingdong, Cao, Yi, Huo, Jiahao, Zou, Xin, Liu, Shuliang, Kwok, James, Hu, Xuming

Abstract

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.

Chinese Translation

多向量模型因其细粒度匹配能力而主导视觉文档检索（VDR），但其高存储和计算成本成为实际应用的主要障碍。本文提出了ColChunk，一个即插即用的框架，引入了多模态后期分块，以构建高效的上下文化多向量。与现有的剪枝或固定令牌方法不同，ColChunk在补丁级嵌入上采用层次聚类，并融合了二维位置先验，以确保空间-语义一致性。这种自适应分组允许生成内容感知的表示，保留全局上下文，同时大幅减少向量数量。在24个VDR数据集上的评估表明，ColChunk在存储需求上实现了超过90%的减少，同时在代表性的单向量模型中，nDCG@5的平均提升达9分。ColChunk为在视觉文档系统中平衡检索准确性和效率提供了实用解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.10188

Radiology Report Generation for Low-Quality X-Ray Images

低质量X射线图像的放射学报告生成

Zhu, Hongze, Hu, Chen, Jiang, Jiaxuan, Liu, Hong, Huang, Yawen, Hu, Ming, Wang, Tianyu, Wu, Zhijian, Zheng, Yefeng

Abstract

Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.

Chinese Translation

视觉-语言模型（VLMs）在自动化放射学报告生成（RRG）方面取得了显著进展。然而，现有方法隐含地假设输入为高质量图像，忽视了现实临床环境中普遍存在的噪声和伪影。因此，当前模型在处理次优图像时表现出严重的性能下降。为了解决这一问题，我们提出了一种专门针对图像质量变化设计的稳健报告生成框架。我们首先引入了一种自动化质量评估代理（AQAA），用于识别MIMIC-CXR数据集中低质量样本，并建立低质量放射学报告生成（LRRG）基准。为了应对因质量下降引起的变化，我们提出了一种新颖的双循环训练策略，利用双层优化和梯度一致性。这种方法通过对齐不同质量状态下的梯度方向，确保模型学习到与质量无关的诊断特征。大量实验表明，我们的方法有效减轻了因图像质量下降导致的模型性能退化。代码和数据将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.10210

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN：用于密集视觉预测的渐近内容感知金字塔注意力网络

Qin, Meng'en, Song, Yu, Zhao, Quanling, Yang, Xiaodong, Che, Yingtao, Yang, Xiaohui

Abstract

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

Chinese Translation

学习多尺度表示是解决密集预测任务中目标尺度变化的常用策略。尽管现有的特征金字塔网络在视觉识别方面取得了显著进展，但其固有的设计缺陷限制了其捕获判别特征和识别小目标的能力。在本工作中，我们提出了渐近内容感知金字塔注意力网络（Asymptotic Content-Aware Pyramid Attention Network，A3-FPN），通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言，A3-FPN采用横向扩展的列网络，实现渐近的全局特征交互，并将每个层级从所有层次表示中解耦。在特征融合阶段，它从相邻层级收集补充内容，生成位置相关的偏移和权重以进行上下文感知的重采样，同时学习深层上下文重加权以提升类内相似性。在特征重组阶段，进一步强化同尺度判别特征学习，并基于特征图的信息内容和空间变化重组冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明，A3-FPN能够轻松集成到最先进的基于CNN和Transformer的架构中，带来显著的性能提升。值得注意的是，结合OneFormer和Swin-L主干网络时，A3-FPN在MS COCO上实现了49.6的掩码AP，在Cityscapes上达到了85.6的mIoU。代码已开源，地址为https://github.com/mason-ching/A3-FPN。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.10217

Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

预训练图像匹配器是否足够适用于SAR-光学卫星配准？

Corley, Isaac, Stoken, Alex, Berton, Gabriele

Abstract

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

Chinese Translation

跨模态光学-SAR（合成孔径雷达）配准是通过遥感进行灾害响应的瓶颈，然而现代图像匹配器几乎专门在自然图像领域进行开发和基准测试。我们在SpaceNet9及另外两个跨模态基准上评估了二十四个预训练匹配器系列，采用零样本设置，未对卫星或SAR数据进行微调或领域适应，使用确定性协议进行大图像分块推理、稳健的几何过滤和基于关键点的度量。我们的结果揭示了不对称的迁移——具有显式跨模态训练的匹配器并不总是优于没有此训练的匹配器。虽然XoFTR（为可见光-热成像匹配训练）和RoMa在标记的SpaceNet9训练场景中达到了最低报告均值误差$3.0$ px，但RoMa在没有任何跨模态训练的情况下实现了这一点，而MatchAnything-ELoFTR（$3.4$ px）——在合成跨模态对上训练——也表现接近，这表明（作为一个工作假设）基础模型特征（DINOv2）可能有助于模态不变性，部分替代显式跨模态监督。3D重建匹配器（MASt3R, DUSt3R）并非为传统的2D图像匹配设计，对协议高度敏感，并在默认设置下保持脆弱。部署协议选择（几何模型、图块大小、内点门控）使得单个匹配器的准确性变化高达$33 imes$，有时超过在评估范围内完全更换匹配器的效果——仅仿射几何就将均值误差从$12.34$降至$9.74$ px。这些发现为现有匹配器的实际部署和未来跨模态卫星配准匹配器的设计提供了指导。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2604.10218

SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

SMFormer：通过基础模型与数据增强赋能自监督立体匹配

Wang, Yun, Yang, Zhengjie, Zheng, Jiahao, Zhang, Zhanjie, Wu, Dapeng Oliver, Guo, Yulan

Abstract

Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

Chinese Translation

近年来，自监督立体匹配方法取得了显著进展。此类方法通常依赖于光度一致性假设，即假设不同视角下的对应点具有相同的外观。然而，该假设在现实环境中可能受到干扰，导致监督信号失效，且与监督方法相比存在显著的精度差距。为解决该问题，我们提出了SMFormer框架，该框架结合了视觉基础模型（Vision Foundation Model，VFM）指导下的更可靠自监督机制与数据增强。首先，我们将VFM与特征金字塔网络（Feature Pyramid Network，FPN）结合，提供在多种场景下对干扰具有判别力和鲁棒性的特征表示。随后，我们设计了一种有效的数据增强机制，确保模型对各种变换的鲁棒性。该机制明确强化了学习特征与受光照变化影响特征之间的一致性，同时对强增强样本与标准样本生成的视差预测结果之间的输出一致性进行正则化。在多个主流基准测试中，SMFormer在自监督方法中实现了最先进（SOTA）的性能，甚至能够与监督方法相媲美。值得注意的是，在具有挑战性的Booster基准测试中，SMFormer甚至优于部分最先进的监督方法，如CFNet。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2604.10233

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

将二维多模态大语言模型适应于三维CT图像分析

Yu, Yang, Xu, Dunyuan, Li, Yaoqian, Li, Xiaomeng, Li, Jinpeng, Heng, Pheng-Ann

Abstract

3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.

Chinese Translation

三维医学图像分析在疾病诊断和治疗中具有重要意义。近年来，多模态大语言模型（MLLMs）展现了强大的感知能力、良好的跨模态对齐性和良好的泛化能力。因此，它们在提升医学报告生成（MRG）和医学视觉问答（MVQA）这两项临床任务的表现方面具有巨大潜力。然而，由于三维医学图像的稀缺，现有的三维医学MLLMs面临着视觉编码器预训练不足以及无法为不同类型任务提取定制图像特征的问题。本文提出首先将一个经过二维自然图像良好训练的二维MLLM转移，以支持三维医学体积输入，同时重用其所有预训练参数。为了使视觉编码器能够为各种任务提取量身定制的图像特征，我们设计了一个文本引导的层次化MoE（TGH-MoE）框架，该框架能够在文本提示的指导下区分任务。此外，我们提出了一种两阶段训练策略，以学习任务共享和任务特定的图像特征。实证结果表明，我们的方法在MRG和MVQA任务中优于现有的三维医学MLLMs。我们的代码将在论文被接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2604.10242

MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training

MedVeriSeg：基于MLLM的医学分割模型在无额外训练下实现查询有效性验证

Lu, Ziqian, Tong, Qinyue, Liu, Jun, Yu, Yunlong

Abstract

Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.

Chinese Translation

尽管基于多模态大模型（MLLM）的医学图像分割技术取得了最新进展，现有的类似LISA的方法仍无法可靠地拒绝错误查询，且常常对不存在的目标产生虚假分割掩码。这一限制降低了其在医学教育和临床应用中的实际可靠性。本文提出了MedVeriSeg，一种无需训练的验证框架，使得类似LISA的医学分割模型具备识别并拒绝包含不存在目标的错误查询的能力。我们的关键观察是，[SEG]标记特征与MLLM图像特征之间的相似度图在真实查询与错误查询中表现出显著不同的分布模式。基于此，我们引入了相似度响应质量评分模块，从强度、紧凑性和纯净度三个方面刻画相似度图，生成初步的目标存在性预测。随后，我们利用GPT-4o结合相似度热图和相似度响应质量评分模块的结果进行联合评估，以提供最终验证。基于从SA-Med2D-20M构建的小规模基准实验表明，MedVeriSeg能够有效拒绝错误查询的分割请求，同时保持对真实查询的可靠识别。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2604.10245

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

温启动强化学习用于迭代3D/2D肝脏配准

Zhang, Hanyuan, He, Lucas, Cheng, Zijie, Kadkhodamohammadi, Abdolrahim, Stoyanov, Danail, Davidson, Brian R., Mazomenos, Evangeles B., Clarkson, Matthew. J

Abstract

Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

Chinese Translation

术前CT与术中腹腔镜视频之间的配准在微创手术的增强现实（AR）指导中起着至关重要的作用。基于学习的方法最近在配准误差上已达到与基于优化的方法相当的水平，同时提供了更快的推理速度。然而，许多监督学习方法产生的粗略对齐依赖于额外的基于优化的细化，从而增加了推理时间。我们提出了一种离散动作强化学习（RL）框架，将CT到视频的配准形式化为一个顺序决策过程。一个共享特征编码器从一个监督姿态估计网络进行温启动，以提供稳定的几何特征和更快的收敛速度，从CT渲染和腹腔镜帧中提取表示，而RL策略头则学习选择沿六个自由度的刚性变换，并决定何时停止迭代。在一个公共腹腔镜数据集上的实验表明，我们的方法实现了平均目标配准误差（TRE）为15.70毫米， comparable于带有优化的监督方法，同时实现了更快的收敛。所提出的基于RL的公式化使得无需手动调整步长或停止标准即可实现自动化、高效的迭代配准。该离散框架为未来在外科AR应用中的连续动作和可变形配准模型提供了实用的基础。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2604.10246

A Comparison of Multi-View Stereo Methods for Photogrammetric 3D Reconstruction: From Traditional to Learning-Based Approaches

多视角立体匹配方法在摄影测量三维重建中的比较：从传统方法到基于学习的方法

Li, Yawen, Vosselman, George, Nex, Francesco

Abstract

Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.

Chinese Translation

摄影测量三维重建长期以来依赖于传统的结构光束法（Structure-from-Motion, SfM）和多视角立体匹配（Multi-View Stereo, MVS）方法，这些方法虽能提供高精度，但在速度和可扩展性方面存在挑战。近年来，基于学习的MVS方法应运而生，旨在实现更快速且高效的重建。本文对代表性的传统MVS流程（COLMAP）与多种先进的基于学习的方法进行了比较评估，后者包括几何引导方法（MVSNet、PatchmatchNet、MVSAnywhere、MVSFormer++）和端到端框架（Stereo4D、FoundationStereo、DUSt3R、MASt3R、Fast3R、VGGT）。实验在两种不同的航拍场景中进行：第一组实验采用MARS-LVIG数据集，利用激光雷达点云提供的三维重建真值；第二组实验使用Pix4D官方网站提供的公开场景，真值由Pix4Dmapper生成。我们对所有方法的精度、覆盖率及运行时间进行了评估。实验结果表明，尽管COLMAP能够提供可靠且几何一致的重建结果，但其计算时间较长。在传统方法图像配准失败的情况下，基于学习的方法表现出更强的特征匹配能力和更高的鲁棒性。几何引导方法通常需要精心准备数据集，且往往依赖COLMAP生成的相机位姿或深度先验。端到端方法如DUSt3R和VGGT在保持竞争性精度和合理覆盖率的同时，显著提升了重建速度，但在复杂场景中三维重建残差较大。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2604.10259

Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting

基于前馈高斯点云的实时人体重建与动画

Chatterjee, Devdoot, Laskar, Zakaria, Jawahar, C. V.

Abstract

We present a generalizable feed-forward Gaussian splatting framework for human 3D reconstruction and real-time animation that operates directly on multi-view RGB images and their associated SMPL-X poses. Unlike prior methods that rely on depth supervision, fixed input views, UV map, or repeated feed-forward inference for each target view or pose, our approach predicts, in a canonical pose, a set of 3D Gaussian primitives associated with each SMPL-X vertex. One Gaussian is regularized to remain close to the SMPL-X surface, providing a strong geometric prior and stable correspondence to the parametric body model, while an additional small set of unconstrained Gaussians per vertex allows the representation to capture geometric structures that deviate from the parametric surface, such as clothing and hair. In contrast to recent approaches such as HumanRAM, which require repeated network inference to synthesize novel poses, our method produces an animatable human representation from a single forward pass; by explicitly associating Gaussian primitives with SMPL-X vertices, the reconstructed model can be efficiently animated via linear blend skinning without further network evaluation. We evaluate our method on the THuman 2.1, AvatarReX and THuman 4.0 datasets, where it achieves reconstruction quality comparable to state-of-the-art methods while uniquely supporting real-time animation and interactive applications. Code and pre-trained models are available at https://github.com/Devdoot57/HumanGS .

Chinese Translation

我们提出了一种通用的前馈高斯点云框架，用于人体三维重建和实时动画，该方法直接作用于多视角RGB图像及其关联的SMPL-X姿态。与以往依赖深度监督、固定输入视角、UV映射或针对每个目标视角或姿态重复前馈推理的方法不同，我们的方法在规范姿态下预测与每个SMPL-X顶点相关联的一组三维高斯基元。通过对一个高斯基元进行正则化，使其保持接近SMPL-X表面，提供了强有力的几何先验和与参数化人体模型的稳定对应关系；而每个顶点额外的一小组无约束高斯基元则允许该表示捕捉偏离参数化表面的几何结构，如服装和头发。与近期如HumanRAM等需要重复网络推理以合成新姿态的方法相比，我们的方法仅需一次前向传播即可生成可动画的人体表示；通过将高斯基元显式关联至SMPL-X顶点，重建模型可通过线性混合蒙皮高效动画，无需额外的网络评估。我们在THuman 2.1、AvatarReX和THuman 4.0数据集上评估了该方法，结果显示其重建质量可与最先进方法媲美，同时独特地支持实时动画和交互式应用。代码及预训练模型可在https://github.com/Devdoot57/HumanGS获取。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2604.10268

EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model

EditCrafter：无调优的高分辨率图像编辑方法，基于预训练的扩散模型

Kim, Kunho, Seo, Sumin, Cho, Yongjun, Chung, Hyungjin

Abstract

We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512x512 or 1024x1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.

Chinese Translation

我们提出了EditCrafter，这是一种无须调优的高分辨率图像编辑方法，利用预训练的文本到图像（T2I）扩散模型处理分辨率显著高于训练时使用的图像。利用大规模T2I扩散模型的生成先验，使得开发各种新颖的生成和编辑应用成为可能。尽管已有许多基于扩散模型的图像编辑方法被提出，并展现出高质量的编辑效果，但由于它们仅在训练分辨率（512x512或1024x1024）下工作，因此难以应用于任意纵横比或更高分辨率的图像。简单地进行块状编辑会导致不现实的物体结构和重复现象。为了解决这些挑战，我们引入了EditCrafter，一个简单而有效的编辑流程。EditCrafter首先通过平铺反演来操作，保持输入高分辨率图像的原始特征。我们进一步提出了一种针对反演潜变量的噪声阻尼流形约束无分类器引导（NDCFG++），专为高分辨率图像编辑而设计。我们的实验表明，EditCrafter能够在不同分辨率下实现令人印象深刻的编辑效果，而无需进行微调和优化。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2604.10273

Dual-Exposure Imaging with Events

事件驱动的双重曝光成像

Lin, Mingyuan, Liu, Hongyi, He, Chu, Yang, Wen, Xia, Gui-Song, Yu, Lei

Abstract

By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

Chinese Translation

通过结合短曝光和长曝光图像的互补优势，双重曝光成像（Dual-Exposure Imaging, DEI）在低光照场景中提升了图像质量。然而，现有的DEI方法不可避免地会因场景运动导致的空间位移和不同曝光时间下图像特征的不一致而产生伪影。为了解决这一问题，我们提出了一种新颖的基于事件的双重曝光成像（Event-based DEI, E-DEI）算法，该算法通过双重曝光图像对和事件重建高质量图像，利用事件相机的高时间分辨率提供准确的帧间/帧内动态信息。具体而言，我们将这一复杂任务分解为两个子任务的整合，即基于事件的运动去模糊和低光照图像增强任务，这指导我们设计E-DEI网络为双路径并行特征传播架构。我们提出了一个双路径特征对齐与融合（Dual-path Feature Alignment and Fusion, DFAF）模块，以有效对齐和融合从双重曝光图像中提取的特征，并借助事件进行辅助。此外，我们构建了一个包含配对低光/正常光图像和事件的真实世界数据集（Paired low-/normal-light Images and Events, PIED）。在多个数据集上的实验表明我们方法的优越性。代码和数据集可在github上获取。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2604.10275

FastSHADE: Fast Self-augmented Hierarchical Asymmetric Denoising for Efficient inference on mobile devices

FastSHADE：面向移动设备高效推理的快速自增强分层非对称去噪方法

Falaleev, Nikolay

Abstract

Real-time image denoising is essential for modern mobile photography but remains challenging due to the strict latency and power constraints of edge devices. This paper presents FastSHADE (Fast Self-augmented Hierarchical Asymmetric Denoising), a lightweight U-Net-style network tailored for real-time, high-fidelity restoration on mobile GPUs. Our method features a multi-stage architecture incorporating a novel Asymmetric Frequency Denoising Block (AFDB) that decouples spatial structure extraction from high-frequency noise suppression to maximize efficiency, and a Spatially Gated Upsampler (SGU) that optimizes high-resolution skip connection fusion. To address generalization, we introduce an efficient Noise Shifting Self-Augmentation strategy that enhances data diversity without inducing domain shifts. Evaluations on the MAI2021 benchmark demonstrate that our scalable model family establishes a highly efficient speed-fidelity trade-off. Our base FastSHADE-M variant maintains real-time latency (<50 ms on a modern mobile GPU) while preserving structural integrity, and our scaled-up FastSHADE-XL establishes a new state-of-the-art for overall image quality. Ultimately, FastSHADE successfully bridges the gap between theoretical network efficiency and practical deployment for real-world mobile ISP pipelines.

Chinese Translation

实时图像去噪对于现代移动摄影至关重要，但由于边缘设备严格的延迟和功耗限制，仍然具有挑战性。本文提出了FastSHADE（Fast Self-augmented Hierarchical Asymmetric Denoising），一种轻量级的U-Net风格网络，专为移动GPU上的实时高保真恢复设计。我们的方法采用多阶段架构，融合了新颖的非对称频率去噪模块（Asymmetric Frequency Denoising Block, AFDB），该模块将空间结构提取与高频噪声抑制解耦以最大化效率，以及空间门控上采样器（Spatially Gated Upsampler, SGU），优化高分辨率跳跃连接的融合。为提升泛化能力，我们引入了一种高效的噪声迁移自增强策略（Noise Shifting Self-Augmentation），在不引入域偏移的情况下增强数据多样性。在MAI2021基准测试中的评估表明，我们的可扩展模型系列实现了高效的速度与保真度权衡。基础版本FastSHADE-M在现代移动GPU上保持实时延迟（<50毫秒）同时保持结构完整性，扩展版本FastSHADE-XL则在整体图像质量上创下了新的最先进水平。最终，FastSHADE成功弥合了理论网络效率与实际移动ISP流水线部署之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2604.10297

FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

FashionMV：基于多视角时尚数据的产品级组合图像检索

Yuan, Peng, Mei, Bingyin, Zhang, Hui

Abstract

Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level -- a single reference image plus modification text in, a single target image out -- while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms -- two-stage dialogue, caption-based alignment, and chain-of-thought guidance -- together with an optional supervised fine-tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two-stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain-of-thought serve as partially redundant knowledge injection paths. Our best 0.8B-parameter model outperforms all baselines, including general-purpose embedding models 10x its size. The dataset, model, and code are publicly available at https://github.com/yuandaxia2001/FashionMV.

Chinese Translation

组合图像检索（CIR）通过与修改文本配对的参考图像来检索目标图像。尽管技术迅速发展，所有现有方法和数据集均在图像级别上操作——输入一个参考图像和修改文本，输出一个目标图像——而真实的电子商务用户则从多个视角推理所展示的产品。我们将这种不匹配称为视角不完整性，并正式定义了一项新的多视角CIR任务，该任务将标准CIR从图像级别推广到产品级别的检索。为了支持这一任务，我们构建了FashionMV，这是第一个大规模多视角时尚数据集，专为产品级CIR而设计，包含127K个产品、472K个多视角图像和超过220K个CIR三元组，采用完全自动化的流程构建，利用大型多模态模型。我们进一步提出了ProCIR（产品级组合图像检索），这是一个基于多模态大型语言模型的建模框架，采用三种互补机制——两阶段对话、基于标题的对齐和思维链引导——以及一个可选的监督微调（SFT）阶段，在对比训练之前注入结构化的产品知识。在三个时尚基准上的16种配置的系统性消融实验表明：（1）对齐是最关键的机制；（2）两阶段对话架构是有效对齐的前提；（3）SFT和思维链作为部分冗余的知识注入路径。我们最佳的0.8B参数模型在性能上超越了所有基线，包括其大小的10倍的通用嵌入模型。数据集、模型和代码可在https://github.com/yuandaxia2001/FashionMV上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2604.10299

Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking

视而不见：通过对抗性注意力劫持使大型视觉-语言模型对安全指令失明

Li, Jingru, Ren, Wei, Zhu, Tianqing

Abstract

Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($\epsilon=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.

Chinese Translation

大型视觉-语言模型（LVLMs）依赖基于注意力的安全指令检索以在生成过程中保持对齐。现有攻击方法通常通过优化图像扰动以最大化有害输出的可能性，但由于对抗目标与模型安全检索机制之间的梯度冲突，收敛速度较慢。我们提出了注意力引导的视觉越狱（Attention-Guided Visual Jailbreaking），该方法通过直接操控注意力模式来规避而非压制安全对齐。我们的方法引入了两个简单的辅助目标：（1）抑制对与对齐相关的前缀标记的注意力；（2）将生成锚定在对抗性图像特征上。这种简单而有效的推拉式策略减少了45%的梯度冲突，在Qwen-VL上实现了94.4%的攻击成功率（基线为68.8%），且迭代次数减少了40%。在更严格的扰动预算（$ extbackslash epsilon=8/255$）下，我们仍保持59.0%的攻击成功率，而标准方法为45.7%。机制分析揭示了一种我们称之为“安全盲视”的失败模式：成功的攻击通过抑制系统提示的注意力80%，导致模型生成有害内容，这并非通过覆盖安全规则实现，而是由于未能检索到这些规则。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2604.10303

AC-MIL: Weakly Supervised Atrial LGE-MRI Quality Assessment via Adversarial Concept Disentanglement

AC-MIL：基于对抗性概念解耦的弱监督心房LGE-MRI质量评估

Sultan, K M Arefeen, Hansen, Kaysen, Orkild, Benjamin, Morris, Alan, Kholmovski, Eugene, Bieging, Erik, Kwan, Eugene, Ranjan, Ravi, DiBella, Ed, Elhabian, Shireen

Abstract

High-quality Late Gadolinium Enhancement (LGE) MRI can be helpful for atrial fibrillation management, yet scan quality is frequently compromised by patient motion, irregular breathing, and suboptimal image acquisition timing. While Multiple Instance Learning (MIL) has emerged as a powerful tool for automated quality assessment under weak supervision, current state-of-the-art methods map localized visual evidence to a single, opaque global feature vector. This black box approach fails to provide actionable feedback on specific failure modes, obscuring whether a scan degrades due to motion blur, inadequate contrast, or a lack of anatomical context. In this paper, we propose Adversarial Concept-MIL (AC-MIL), a weakly supervised framework that decomposes global image quality into clinically defined radiological concepts using only volume-level supervision. To capture latent quality variations without entangling predefined concepts, our framework incorporates an unsupervised residual branch guided by an adversarial erasure mechanism to strictly prevent information leakage. Furthermore, we introduce a spatial diversity constraint that penalizes overlap between distinct concept attention maps, ensuring localized and interpretable feature extraction. Extensive experiments on a clinical dataset of atrial LGE-MRI volumes demonstrate that AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint the specific causes of non-diagnostic scans. Crucially, our framework achieves this deep clinical transparency while maintaining highly competitive ordinal grading performance against existing baselines. Code to be released on acceptance.

Chinese Translation

高质量的晚期钆增强（Late Gadolinium Enhancement, LGE）磁共振成像（MRI）对心房颤动的管理具有重要意义，然而扫描质量常因患者运动、不规则呼吸及图像采集时机不佳而受损。尽管多实例学习（Multiple Instance Learning, MIL）作为一种弱监督下的自动质量评估工具已展现出强大能力，现有最先进方法通常将局部视觉证据映射为单一且不透明的全局特征向量。这种“黑箱”方法无法针对具体失败模式提供可操作的反馈，难以判断扫描质量下降是由于运动模糊、对比度不足还是缺乏解剖学上下文。本文提出了一种基于对抗性概念解耦的弱监督框架——Adversarial Concept-MIL（AC-MIL），该方法仅利用体积级标签监督，将全局图像质量分解为临床定义的放射学概念。为捕捉潜在的质量变化且避免预定义概念间的混淆，框架引入了由对抗性擦除机制引导的无监督残差分支，严格防止信息泄露。此外，我们设计了空间多样性约束，惩罚不同概念注意力图的重叠，确保特征提取的局部性和可解释性。在心房LGE-MRI临床数据集上的大量实验表明，AC-MIL成功揭开了MIL的黑箱，提供高度局部化的空间概念图，使临床医生能够准确定位非诊断性扫描的具体原因。更重要的是，该框架在实现深度临床透明度的同时，保持了与现有基线方法高度竞争的序数分级性能。代码将在论文接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2604.10305

Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems

面向多类别基于LiDAR的V2X系统三维目标检测的类别自适应协同感知

Kyem, Blessing Agyei, Asamoah, Joshua Kofi, Aboah, Armstrong

Abstract

Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird's-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.

Chinese Translation

协同感知使得联网车辆与路侧基础设施能够共享传感器观测数据，构建出超越单一平台能力的融合场景表示。然而，大多数协同三维目标检测器对所有目标类别采用统一的融合策略，限制了其处理大小目标在几何结构和点采样模式上的差异能力。该问题在评价协议狭窄的情况下尤为突出，这些协议通常侧重于单一主导类别或仅涵盖少数协同设置，导致在多样化车联网（V2X）交互中对稳健多类别检测的探索不足。为填补这一空白，本文提出了一种基于类别自适应的协同感知架构，用于基于LiDAR数据的多类别三维目标检测。该模型集成了四个组件：具备学习尺度路由的多尺度窗口注意力，实现空间自适应特征提取；类别特定融合模块，将大小目标分离至不同的注意力融合路径；通过并行扩张卷积与通道重校准的鸟瞰视图增强，丰富上下文表示；以及类别平衡的目标加权，减少对高频类别的偏倚。在V2X-Real基准上，实验涵盖以车辆为中心、以基础设施为中心、车对车、基础设施对基础设施及车对基础设施等多种设置，均采用相同的骨干网络和训练配置。所提方法在强大的中间融合基线之上持续提升平均检测性能，卡车类别提升最大，行人类别明显改善，汽车类别表现竞争力。这些结果表明，将特征提取与融合过程与类别相关的几何形态及点密度相匹配，有助于实现更均衡的协同感知，适用于现实的车联网部署。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2604.10306

SatReg: Regression-based Neural Architecture Search for Lightweight Satellite Image Segmentation

SatReg：基于回归的轻量级卫星图像分割神经架构搜索

Humes, Edward, Mohsenin, Tinoosh

Abstract

As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints. We present SatReg, a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms. Using CM-UNet as the teacher architecture, we reduce the search space to two dominant width-related variables, profile a small set of student models on an NVIDIA Jetson Orin Nano, and fit low-order surrogate models for mIoU, latency, and power. Knowledge distillation is used to efficiently train the sampled students. The learned surrogates enable fast selection of near-optimal architecture settings for deployment targets without exhaustive search. Results show that the selected variables affect task accuracy and hardware cost differently, making reduced-space regression a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.

Chinese Translation

随着地球观测工作负载向机载和边缘处理转移，遥感分割模型必须在严格的延迟和能耗限制下运行。我们提出了SatReg，这是一种针对边缘平台轻量级遥感分割的基于回归的硬件感知调优框架。以CM-UNet作为教师架构，我们将搜索空间缩减为两个主要的宽度相关变量，在NVIDIA Jetson Orin Nano上对一小组学生模型进行评估，并为mIoU、延迟和功耗拟合低阶代理模型。知识蒸馏被用于高效训练所采样的学生模型。所学习的代理模型使得能够快速选择接近最佳的架构设置，以便于部署目标，而无需进行全面搜索。结果表明，所选变量对任务准确性和硬件成本的影响不同，使得减少空间的回归成为将混合CNN-Mamba分割模型适应未来空间边缘系统的实用策略。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2604.10312

Anatomy-Informed Deep Learning for Abdominal Aortic Aneurysm Segmentation

基于解剖信息的深度学习用于腹主动脉瘤分割

Sufyan, Osamah, Brückmann, Martin, Wickenhöfer, Ralph, Dellen, Babette, Jaekel, Uwe

Abstract

In CT angiography, the accurate segmentation of abdominal aortic aneurysms (AAAs) is difficult due to large anatomical variability, low-contrast vessel boundaries, and the close proximity of organs whose intensities resemble vascular structures, often leading to false positives. To address these challenges, we propose an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into the training process. These masks encode explicit anatomical priors by identifying non-vascular organsand penalizing aneurysm predictions within these regions, thereby guiding the U-Net to focus on the aorta and its pathological dilation while suppressing anatomically implausible predictions. Despite being trained on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline. The results demonstrate that incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.

Chinese Translation

在CT血管造影中，由于解剖结构的高度变异性、血管边界的低对比度以及与血管结构强度相似的邻近器官，腹主动脉瘤（AAA）的准确分割面临较大挑战，常导致假阳性。为解决这些问题，我们提出了一种基于解剖信息的分割框架，该框架将由TotalSegmentator生成的器官排除掩码整合到训练过程中。这些掩码通过识别非血管器官并对这些区域内的瘤体预测进行惩罚，编码了显式的解剖先验，从而引导U-Net聚焦于主动脉及其病理性扩张，同时抑制解剖上不合理的预测。尽管训练数据集较小，该基于解剖信息的模型仍实现了较高的准确率，显著减少了假阳性，并提升了边界一致性，相较于标准U-Net基线表现优越。结果表明，通过排除掩码引入解剖知识为提升模型的鲁棒性和泛化能力提供了有效机制，使得即使在有限训练数据条件下也能实现可靠的AAA分割。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2604.10321

NTIRE 2026 Challenge on Single Image Reflection Removal in the Wild: Datasets, Results, and Methods

NTIRE 2026 野外单幅图像反射去除挑战赛：数据集、结果与方法

Cai, Jie, Yang, Kangning, Li, Zhiyuan, Vasluianu, Florin-Alexandru, Timofte, Radu, Li, Jinlong, Shen, Jinglin, Meng, Zibo, Cao, Junyan, Zhao, Lu, Liu, Pengwei, Zhang, Yuyi, Guo, Fengjun, Hu, Jiagao, Wang, Zepeng, Wang, Fei, Zhou, Daiguo, Chen, Yi'ang, Zhu, Honghui, Yang, Mengru, Luo, Yan, Jiang, Kui, Guo, Jin, Park, Jonghyuk, Sim, Jae-Young, Zhou, Wei, Huang, Hongyu, Li, Linfeng, Kong, Lindong, Meesiyawar, Saiprasad, Khanpagadi, Misbha Falak, Akalwadi, Nikhil, Tabib, Ramesh Ashok, Mudenagudi, Uma, Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Shigematsu, Kosuke, Shirono, Hiroto, Shin, Asuka, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun, Tu, Jiachen, Joshi, Shreeniketh, Jiang, Jin-Hui, Lin, Yu-Fan, Hsiao, Yu-Jou, Lee, Chia-Ming, Yang, Fu-En, Wang, Yu-Chiang Frank, Hsu, Chih-Chung

Abstract

In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them to process real-world images that cover a range of reflection scenarios and intensities, with the goal of generating clean images without reflections. The challenge attracted more than 100 registrations, with 11 of them participating in the final testing phase. The top-ranked methods advanced the state-of-the-art reflection removal performance and earned unanimous recognition from the five experts in the field. The proposed OpenRR-5k dataset is available at https://huggingface.co/datasets/qiuzhangTiTi/OpenRR-5k, and the homepage of this challenge is at https://github.com/caijie0620/OpenRR-5k. Due to page limitations, this article only presents partial content; the full report and detailed analyses are available in the extended arXiv version.

Chinese Translation

本文回顾了NTIRE 2026野外单幅图像反射去除（Single-Image Reflection Removal, SIRR）挑战赛。SIRR是图像修复中的一项基础任务。尽管学术研究取得了一定进展，但大多数方法仅在合成图像或有限的真实图像上进行测试，导致与实际应用存在差距。本次挑战赛为参赛者提供了OpenRR-5k数据集，要求处理涵盖多种反射场景和强度的真实图像，目标是生成无反射的清晰图像。该挑战吸引了超过100人报名，最终有11支队伍参与了测试阶段。排名靠前的方法推动了反射去除性能的最新进展，获得了领域内五位专家的一致认可。所提出的OpenRR-5k数据集可在https://huggingface.co/datasets/qiuzhangTiTi/OpenRR-5k获取，挑战赛主页为https://github.com/caijie0620/OpenRR-5k。由于篇幅限制，本文仅呈现部分内容，完整报告及详细分析见扩展的arXiv版本。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2604.10334

SIMPLER: H&E-Informed Representation Learning for Structured Illumination Microscopy

SIMPLER：基于H&E信息的结构化照明显微镜表征学习

Aziz, Abu Zahid Bin, Ahmed, Syed Fahim, Rasineni, Gnanesh, Wang, Mei, Hatipoglu, Olcaytu, Ricci, Marisa, Shaw, Malaiyah, Li, Guang, Brown, J. Quincy, Pascucci, Valerio, Elhabian, Shireen

Abstract

Structured Illumination Microscopy (SIM) enables rapid, high-contrast optical sectioning of fresh tissue without staining or physical sectioning, making it promising for intraoperative and point-of-care diagnostics. Recent foundation and large-scale self-supervised models in digital pathology have demonstrated strong performance on section-based modalities such as Hematoxylin and Eosin (H&E) and immunohistochemistry (IHC). However, these approaches are predominantly trained on thin tissue sections and do not explicitly address thick-tissue fluorescence modalities such as SIM. When transferred directly to SIM, performance is constrained by substantial modality shift, and naive fine-tuning often overfits to modality-specific appearance rather than underlying histological structure. We introduce SIMPLER (Structured Illumination Microscopy-Powered Learning for Embedding Representations), a cross-modality self-supervised pretraining framework that leverages H&E as a semantic anchor to learn reusable SIM representations. H&E encodes rich cellular and glandular structure aligned with established clinical annotations, while SIM provides rapid, nondestructive imaging of fresh tissue. During pretraining, SIM and H&E are progressively aligned through adversarial, contrastive, and reconstruction-based objectives, encouraging SIM embeddings to internalize histological structure from H&E without collapsing modality-specific characteristics. A single pretrained SIMPLER encoder transfers across multiple downstream tasks, including multiple instance learning and morphological clustering, consistently outperforming SIM models trained from scratch or H&E-only pretraining. Importantly, joint alignment enhances SIM performance without degrading H&E representations, demonstrating asymmetric enrichment rather

Chinese Translation

结构化照明显微镜（Structured Illumination Microscopy，SIM）能够实现对新鲜组织的快速、高对比度光学切片，无需染色或物理切片，因而在术中及现场诊断中具有广阔的应用前景。近年来，数字病理学中的基础模型和大规模自监督模型在基于切片的成像模式（如苏木精-伊红染色（Hematoxylin and Eosin，H&E）和免疫组化（Immunohistochemistry，IHC））上表现出强大的性能。然而，这些方法主要在薄组织切片上训练，未能明确针对SIM等厚组织荧光成像模式。当直接迁移至SIM时，性能受到显著模态差异的限制，且简单的微调往往导致模型过拟合于模态特异的外观特征，而非潜在的组织学结构。本文提出SIMPLER（Structured Illumination Microscopy-Powered Learning for Embedding Representations），一种跨模态自监督预训练框架，利用H&E作为语义锚点以学习可复用的SIM表征。H&E编码了丰富的细胞和腺体结构，与既定的临床注释高度一致，而SIM则提供了对新鲜组织的快速、无损成像。在预训练过程中，SIM与H&E通过对抗、对比及重建等目标逐步对齐，促使SIM嵌入能够内化H&E中的组织学结构，同时避免模态特异特征的塌缩。单一预训练的SIMPLER编码器可迁移至多种下游任务，包括多实例学习和形态聚类，持续优于从零训练的SIM模型或仅基于H&E预训练的模型。更重要的是，联合对齐提升了SIM的性能，同时未损害H&E的表征能力，展现出非对称的互补增强效果。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2604.10344

Context Matters: Vision-Based Depression Detection Comparing Classical and Deep Approaches

情境重要性：基于视觉的抑郁检测比较经典方法与深度学习方法

Bilalpur, Maneesh, Hinduja, Saurabh, Sivarajkumar, Sonish, Allen, Nicholas, Wang, Yanshan, Ertugrul, Itir Onal, Cohn, Jeffrey F.

Abstract

The classical approach to detecting depression from vision emphasizes interpretable features, such as facial expression, and classifiers such as the Support Vector Machine (SVM). With the advent of deep learning, there has been a shift in feature representations and classification approaches. Contemporary approaches use learnt features from general-purpose vision models such as VGGNet to train machine learning models. Little is known about how classical and deep approaches compare in depression detection with respect to accuracy, fairness, and generalizability, especially across contexts. To address these questions, we compared classical and deep approaches to the detection of depression in the visual modality in two different contexts: Mother-child interactions in the TPOT database and patient-clinician interviews in the Pitt database. In the former, depression was operationalized as a history of depression per the DSM and current or recent clinically significant symptoms. In the latter, all participants met initial criteria for depression per DSM, and depression was reassessed over the course of treatment. The classical approach included handcrafted features with SVM classifiers. Learnt features were turn-level embeddings from the FMAE-IAT that were combined with Multi-Layer Perceptron classifiers. The classical approach achieved higher accuracy in both contexts. It was also significantly fairer than the deep approach in the patient-clinician context. Cross-context generalizability was modest at best for both approaches, which suggests that depression may be context-specific.

Chinese Translation

经典的抑郁检测方法强调可解释特征，如面部表情，以及支持向量机（Support Vector Machine, SVM）等分类器。随着深度学习的兴起，特征表示和分类方法发生了转变。现代方法使用来自通用视觉模型（如 VGGNet）学习的特征来训练机器学习模型。目前尚不清楚经典方法与深度方法在抑郁检测中的准确性、公平性和可推广性方面的比较，尤其是在不同情境下。为了解决这些问题，我们在两个不同的情境中比较了经典方法与深度方法在视觉模态下的抑郁检测：TPOT 数据库中的母子互动和 Pitt 数据库中的患者-临床医生访谈。在前者中，抑郁被操作化为根据 DSM 的抑郁病史以及当前或近期的临床显著症状。在后者中，所有参与者均符合 DSM 的初始抑郁标准，并在治疗过程中重新评估抑郁。经典方法包括手工特征与 SVM 分类器。学习到的特征是来自 FMAE-IAT 的轮次级嵌入，与多层感知器（Multi-Layer Perceptron）分类器结合。经典方法在两个情境中的准确性均较高。在患者-临床医生情境中，其公平性也显著高于深度方法。两种方法的跨情境可推广性充其量是适度的，这表明抑郁可能具有情境特异性。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2604.10347

Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

多模态、多尺度表示学习在卫星图像分析中的应用只需一个好的 ALiBi

Kage, Patrick, Andreadis, Pavlos

Abstract

Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.

Chinese Translation

视觉基础模型已被证明在将卫星图像处理成适合下游任务的表示方面是有效的，然而，创建能够在多个空间分辨率和模式下运作的模型仍然具有挑战性。本文提出了 Scale-ALiBi，一种具有空间编码偏差的线性偏置变换器注意力机制，用于处理不同地面采样距离尺度下图像块之间的关系。我们在一个对齐的高分辨率和低分辨率光学及低分辨率 SAR 卫星图像数据集上实现了 Scale-ALiBi，采用三重对比和重构架构，展示了在 GEO-Bench 基准上的改进，并公开发布了新整理的数据集。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2604.10359

Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

Multinex：通过多先验Retinex实现轻量级低光照图像增强

Brateanu, Alexandru, Mu, Tingting, Ancuti, Codruta, Ancuti, Cosmin

Abstract

Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often rely on large models and multi-stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts. To address these, we propose Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at https://albrateanu.github.io/multinex.

Chinese Translation

低光照图像增强（LLIE）旨在在严重光照退化的情况下恢复自然的可见性、色彩保真度和结构细节。最先进的（SOTA）LLIE技术通常依赖于大型模型和多阶段训练，这限制了其在边缘部署中的实用性。此外，它们对单一颜色空间的依赖引入了不稳定性和可见的曝光或颜色伪影。为了解决这些问题，我们提出了Multinex，一个超轻量级的结构化框架，它在一个有原则的Retinex残差公式中集成了多种细粒度表示。它将图像分解为来自不同分析表示的光照和颜色先验堆栈，并学习将这些表示融合为修正曝光所需的亮度和反射调整。通过优先考虑增强而非重建，并利用轻量级神经操作，Multinex显著降低了计算成本，其轻量级（45K参数）和纳米（0.7K参数）版本便是例证。广泛的基准测试表明，所有轻量级变体的性能显著优于其对应的轻量级SOTA模型，并且达到了与重型模型相当的性能。论文页面可访问 https://albrateanu.github.io/multinex。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2604.10377

DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited

DeepShapeMatchingKit：加速的功能映射求解器和形状匹配管道重访

Xie, Yizheng, Bastian, Lennart, Deng, Congyue, Mitchel, Thomas W., Gao, Maolin, Cremers, Daniel

Abstract

Deep functional maps, leveraging learned feature extractors and spectral correspondence solvers, are fundamental to non-rigid 3D shape matching. Based on an analysis of open-source implementations, we find that standard functional map implementations solve k independent linear systems serially, which is a computational bottleneck at higher spectral resolution. We thus propose a vectorized reformulation that solves all systems in a single kernel call, achieving up to a 33x speedup while preserving the exact solution. Furthermore, we identify and document a previously unnoticed implementation divergence in the spatial gradient features of the mainstay DiffusionNet: two variants that parameterize distinct families of tangent-plane transformations, and present experiments analyzing their respective behaviors across diverse benchmarks. We additionally revisit overlap prediction evaluation for partial-to-partial matching and show that balanced accuracy provides a useful complementary metric under varying overlap ratios. To share these advancements with the wider community, we present an open-source codebase, DeepShapeMatchingKit, that incorporates these improvements and standardizes training, evaluation, and data pipelines for common deep shape matching methods. The codebase is available at: https://github.com/xieyizheng/DeepShapeMatchingKit

Chinese Translation

深度功能映射利用学习到的特征提取器和谱对应求解器，是非刚性三维形状匹配的基础。通过对开源实现的分析，我们发现标准的功能映射实现以串行方式解决 k 个独立的线性系统，这在较高谱分辨率下成为计算瓶颈。因此，我们提出了一种向量化重构方法，可以在单个内核调用中解决所有系统，达到了最高 33 倍的加速，同时保持了精确解。此外，我们识别并记录了主流 DiffusionNet 中一个之前未被注意的实现差异：两个参数化不同切平面变换族的变体，并展示了在多种基准测试中分析它们各自行为的实验。我们还重新审视了部分到部分匹配的重叠预测评估，并表明在不同重叠比率下，平衡准确率提供了一个有用的补充指标。为了与更广泛的社区分享这些进展，我们推出了一个开源代码库 DeepShapeMatchingKit，整合了这些改进，并标准化了常见深度形状匹配方法的训练、评估和数据管道。代码库可在以下地址获取：https://github.com/xieyizheng/DeepShapeMatchingKit

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2604.10383

Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

代理视频生成：通过工具约束的LLM规划从文本到可执行事件图

Cudlenco, Nicolae, Masala, Mihai, Leordeanu, Marius

Abstract

Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

Chinese Translation

现有的多智能体视频生成系统使用LLM（大语言模型）代理来协调神经视频生成器，产生视觉上令人印象深刻但语义上不可靠的输出，并且没有真实的标注。我们提出了一种代理系统，颠覆了这一范式：LLM构建一个正式的时空事件图（Graph of Events in Space and Time, GEST）——这是一个结构化的演员、动作、对象和时间约束的规范，然后在3D游戏引擎中确定性地执行。一个分阶段的LLM精炼管道在这一任务上完全失败（50次尝试中没有产生可执行的规范），这促使我们基于关注点分离的根本不同架构：LLM通过自然语言推理处理叙事规划，而程序状态后端通过经过验证的工具调用强制执行所有模拟器约束，确保每个生成的规范在构建时都是可执行的。该系统采用层次化的双代理架构——一个负责规划故事的导演（Director）和一个通过回合制状态机构建单个场景的场景构建者（Scene Builder），并配有专门的关系子代理（Relation Subagents），填充程序生成留下空白的GEST形式的逻辑和语义边缘类型，使这是第一个充分发挥该表示法表达能力的方法。我们分两个阶段进行评估：与程序基线的自主生成，通过一个3模型的LLM评审团，其中代理叙事在79%的文本和74%的视频比较中获胜；以及种子生成，其中相同的文本被提供给我们的系统、VEO 3.1和WAN 2.2，人工标注显示引擎生成的视频在物理有效性（58%对25%和20%）和语义一致性（3.75/5对2.33和1.50）上显著优于神经生成器。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2604.10385

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

GTASA：用于视频模型时空分析、评估和训练的真实标注

Cudlenco, Nicolae, Masala, Mihai, Leordeanu, Marius

Abstract

Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

Chinese Translation

生成复杂的多演员场景视频仍然是一个挑战，即使对于最先进的神经生成器而言，评估这些视频也很困难，因为缺乏物理合理性和语义忠实性的真实标注。我们介绍了GTASA，一个包含每帧空间关系图和事件级时间映射的多演员视频语料库，以及基于时空事件图（Graphs of Events in Space and Time，GEST）生成该语料库的系统：GEST-Engine。我们将我们的方法与开放源代码和闭源神经生成器进行了比较，并通过定性（对物理有效性和语义一致性的人工评估）和定量（通过训练视频字幕生成模型）证明了我们方法的明显优势。对四个冻结的视频编码器在GTASA的精确三维真实标注下进行的11个时空推理任务的探测显示，自监督编码器在编码空间结构方面显著优于视觉语言模型（VLM）编码器。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2604.10391

FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception

FishRoPE：用于全向视觉感知的投影旋转位置编码

Ahuja, Rahul, Jain, Mudit, Sudhakar, Bala Murali Manoghar Sai, Narayanan, Venkatraman, Likhar, Pratik, Kumar, Varun Ravi, Yogamani, Senthil

Abstract

Vision foundation models (VFMs) and Bird's Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.

Chinese Translation

视觉基础模型（Vision Foundation Models，VFMs）和鸟瞰视图（Bird's Eye View，BEV）表示在视觉感知方面取得了显著进展，然而其内部的空间表示假设针孔相机的直线几何结构。鱼眼相机因其全方位视野覆盖被广泛应用于自动驾驶车辆，但其严重的径向畸变使得这些表示在几何上不一致。与此同时，大规模鱼眼图像标注的缺乏使得从零开始重新训练基础模型变得不切实际。我们提出了FishRoPE，一个轻量级框架，通过两个组件将冻结的VFMs适配到鱼眼几何：一个带有低秩适配（Low-Rank Adaptation，LoRA）的冻结DINOv2骨干网络，将丰富的自监督特征迁移到鱼眼图像而无需特定任务的预训练；以及鱼眼旋转位置编码（FishRoPE），该编码在鱼眼投影的球面坐标中重新参数化注意力机制，使得自注意力和交叉注意力均基于角度间隔而非像素距离进行操作。FishRoPE与架构无关，计算开销极小，并且在针孔几何下自然退化为标准形式。我们在WoodScape二维检测（54.3 mAP）和SynWoodScapes BEV分割（65.1 mIoU）上对FishRoPE进行了评估，均取得了两项基准的最先进结果。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2604.10397

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

重新思考视频中的人-物交互：基于时间的集合预测实现统一的检测与预测

Luo, Yuanhao, Wen, Di, Peng, Kunyu, Liu, Ruiping, Zheng, Junwei, Chen, Yufan, Wei, Jiale, Stiefelhage, Rainer

Abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

Chinese Translation

基于视频的人-物交互（HOI）理解不仅需要检测当前正在进行的交互，还需预测其未来演变。然而，现有方法通常将预测视为建立在外部构建的人-物对基础上的下游预测任务，限制了检测与预测之间的联合推理。此外，当前基准中稀疏的关键帧标注可能导致名义上的未来标签与实际未来动态在时间上的错位，降低了预测评估的可靠性。为解决这些问题，我们引入了DETAnt-HOI，这是一个基于VidHOI和Action Genome经过时间校正的基准，旨在实现更真实的多阶段评估；同时提出了HOI-DA，一个以人-物对为中心的框架，通过将未来交互建模为当前对状态的残差转变，联合执行主体-客体定位、当前HOI检测及未来预测。实验结果表明，在检测和预测任务上均有持续提升，且在较长时间跨度上增益更为显著。我们的研究强调，预测在与检测联合学习时最为有效，作为对对级视频表示学习的结构性约束。基准数据和代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2604.10409

IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

IMPACT：用于工业装配中多粒度人类程序化动作理解的数据集

Wen, Di, Zhong, Zeyun, Schneider, David, Zaremski, Manuel, Kunzmann, Linus, Shi, Yitian, Liu, Ruiping, Chen, Yufan, Zheng, Junwei, Li, Jiahang, Hemmerich, Jonas, Tong, Qiyi, Grauberger, Patric, Ajoudani, Arash, Paudel, Danda Pani, Matthiesen, Sven, Deml, Barbara, Beyerer, Jürgen, Van Gool, Luc, Stiefelhagen, Rainer, Peng, Kunyu

Abstract

We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly--recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.

Chinese Translation

我们介绍了IMPACT，这是一个同步的五视角RGB-D数据集，旨在面向部署的工业程序理解，围绕使用专业级工具对商用角磨机的真实装配与拆卸构建。据我们所知，IMPACT是首个真实工业装配基准，能够在单一真实工业流程中联合提供同步的第一视角与第三视角RGB-D采集、双手动作解耦标注、遵从性感知状态追踪以及明确的异常-恢复监督。该数据集包含13名参与者的112次试验，总计39.5小时，执行路径由部分有序前置条件图控制，涵盖六类异常分类法，并通过NASA-TLX测量操作员认知负荷。标注层级将特定手部的原子动作关联至粗粒度程序步骤、组件装配状态及每只手的遵从阶段，视角间同步的空白区间用于区分感知限制与算法失败。系统基线实验揭示了单任务基准难以察觉的基本局限性，尤其是在涉及不完整观察、灵活执行路径及纠正行为的真实部署条件下。完整数据集、标注及评估代码可在https://github.com/Kratos-Wen/IMPACT获取。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2604.10414

Neural Stochastic Processes for Satellite Precipitation Refinement

用于卫星降水细化的神经随机过程

Nagashima, Shunya, Bannai, Takumi, Koyama, Shuitsu, Mitsui, Tomoya, Suzuki, Shuntaro

Abstract

Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43{,}756 hourly samples over the Contiguous United States (2021--2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA's operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.

Chinese Translation

准确的降水估计对于洪水预报、水资源管理和灾害防备至关重要。卫星产品提供全球范围的小时级覆盖，但存在系统性偏差；地面雨量计在点位测量上准确，但分布稀疏，难以直接用于格点校正。现有方法通过将雨量计观测插值到卫星网格上实现数据融合，但各时间步独立处理，因而丢弃了降水场的时间结构。我们提出了神经随机过程（Neural Stochastic Process，NSP）模型，该模型结合了基于任意雨量计观测集合的神经过程编码器与二维空间表示上的潜在神经随机微分方程（Neural SDE）。NSP通过单一变分目标进行训练，且无需仿真代价。我们还引入了QPEBench，这是一个包含43,756个小时样本、覆盖美国本土（2021-2025年）、包含四种对齐数据源及六项评估指标的基准测试。在QPEBench上，NSP在所有六项指标上均优于13个基线方法，并超越了日本宇宙航空研究开发机构（JAXA）运营的雨量计校准产品。对日本九州地区的额外实验验证了该模型对不同区域及独立数据源的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2604.10415

Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers

Point2Pose：基于二维点跟踪器的多未知物体遮挡恢复6D位姿跟踪与三维重建

Lin, Tzu-Yuan, Lee, Ho Jae, Doherty, Kevin, Lee, Yonghyeon, Kim, Sangbae

Abstract

We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.

Chinese Translation

我们提出了Point2Pose，一种基于单目RGB-D视频的多刚体物体因果6D位姿跟踪的无模型方法。该方法仅从待跟踪物体上的稀疏图像点初始化，能够跟踪多个未知物体，无需物体CAD模型或类别先验。Point2Pose利用二维点跟踪器获取远距离对应点，实现了在完全遮挡后即时恢复。同时，系统增量式地重建被跟踪目标的在线截断有符号距离函数（Truncated Signed Distance Function，TSDF）表示。我们还引入了一个包含仿真和真实世界序列的多物体跟踪新数据集，并配备了动作捕捉的真实标注以供评估。实验结果表明，Point2Pose在严重遮挡基准测试中达到与最先进方法相当的性能，且支持多物体跟踪及完全遮挡恢复，这些能力是以往无模型跟踪方法所不具备的。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2604.10425

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

DiningBench：面向饮食领域感知与推理的分层多视角基准测试

Jin, Song, Zhang, Juntian, Zhang, Xun, Tian, Zeying, Jiang, Fei, Yin, Guojun, Lin, Wei, Liu, Yong, Yan, Rui

Abstract

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

Chinese Translation

近年来，视觉-语言模型（Vision-Language Models, VLMs）的进步极大推动了通用视觉理解的发展。然而，这些模型在食品领域的应用仍受限于依赖粗粒度类别、单视角图像及不准确元数据的基准测试。为填补这一空白，我们提出了DiningBench——一个分层、多视角的基准测试，旨在评估VLMs在细粒度分类、营养估计和视觉问答三个认知复杂度层面的表现。与以往数据集不同，DiningBench包含3,021道独特菜品，每道菜平均配备5.27张图像，且引入了来自相同菜单的细粒度“难”负样本以及基于严格验证的营养数据。我们对29个最先进的开源及专有模型进行了广泛评测。实验结果表明，尽管当前VLMs在通用推理方面表现优异，但在细粒度视觉辨识和精确营养推理上存在显著挑战。此外，我们系统性地探讨了多视角输入和链式思维（Chain-of-Thought）推理的影响，识别出五种主要失败模式。DiningBench作为一个具有挑战性的测试平台，将推动新一代以食品为中心的VLM研究。所有代码均已开源，地址：https://github.com/meituan/DiningBench。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2604.10436

SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

SignReasoner：通过功能结构单元进行复杂交通标志理解的组合推理

Wang, Ruibin, Lin, Zhenyu, Zhao, Xinhai

Abstract

Accurate semantic understanding of complex traffic signs-including those with intricate layouts, multi-lingual text, and composite symbols-is critical for autonomous driving safety. Current models, both specialized small ones and large Vision Language Models (VLMs), suffer from a significant bottleneck: a lack of compositional generalization, leading to failure when encountering novel sign configurations. To overcome this, we propose SignReasoner, a novel paradigm that transforms general VLMs into expert traffic sign reasoners. Our core innovation is Functional Structure Unit (FSU), which shifts from common instance-based modeling to flexible function-based decomposition. By breaking down complex signs into minimal, core functional blocks (e.g., Direction, Notice, Lane), our model learns the underlying structural grammar, enabling robust generalization to unseen compositions. We define this decomposition as the FSU-Reasoning task and introduce a two-stage VLM post-training pipeline to maximize performance: Iterative Caption-FSU Distillation that enhances the model's accuracy in both FSU-reasoning and caption generation; FSU-GRPO that uses Tree Edit Distance (TED) to compute FSU differences as the rewards in GRPO algorithm, boosting reasoning abilities. Experiments on the newly proposed FSU-Reasoning benchmark, TrafficSignEval, show that SignReasoner achieves new SOTA with remarkable data efficiency and no architectural modification, significantly improving the traffic sign understanding in various VLMs.

Chinese Translation

对复杂交通标志的准确语义理解——包括那些具有复杂布局、多语言文本和复合符号的标志——对于自动驾驶安全至关重要。目前的模型，无论是专门的小型模型还是大型视觉语言模型（VLMs），都面临着一个显著的瓶颈：缺乏组合泛化能力，导致在遇到新颖的标志配置时出现失败。为了解决这一问题，我们提出了SignReasoner，这是一种新颖的范式，将通用的VLM转变为专家级的交通标志推理器。我们的核心创新是功能结构单元（Functional Structure Unit, FSU），它从常见的基于实例的建模转向灵活的基于功能的分解。通过将复杂的标志分解为最小的核心功能块（例如，方向、通知、车道），我们的模型学习到潜在的结构语法，从而实现对未见组合的强大泛化。我们将这种分解定义为FSU-推理任务，并引入了一个两阶段的VLM后训练管道以最大化性能：迭代的Caption-FSU蒸馏，增强模型在FSU推理和标题生成中的准确性；FSU-GRPO，利用树编辑距离（Tree Edit Distance, TED）计算FSU差异作为GRPO算法中的奖励，从而提升推理能力。在新提出的FSU-推理基准TrafficSignEval上的实验表明，SignReasoner在数据效率和无架构修改的情况下实现了新的最先进水平（SOTA），显著提高了各种VLMs中的交通标志理解能力。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2604.10437

Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

通过判别性引导提升3D胸部CT报告生成中的细粒度空间定位能力

Wang, Chenyu, Dai, Weicheng, Liu, Han, Li, Wenchao, Batmanghelich, Kayhan

Abstract

Vision--language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.

Chinese Translation

用于放射学报告生成（RRG）的视觉-语言模型（VLMs）能够基于体积扫描生成长篇胸部CT报告，展现出提升放射学工作流程效率和一致性的强大潜力。然而，现有方法存在两个主要局限：（i）训练监督通常较为粗糙，将整个CT体积与完整的自由文本报告对齐，缺乏对细粒度属性或病灶位置的显式对齐；（ii）评估通常是整体性的（词汇重叠、实体匹配或以大型语言模型作为评判者的评分），缺乏针对空间定位的诊断性评估。我们提出了“判别性线索提示与提示丢弃”（Discriminative Cue-Prompting with Prompt Dropout，DCP-PD）框架，该框架为即插即用式，能够从自由文本报告中提取细粒度线索并利用其指导报告生成，同时通过提示丢弃减轻模型对捷径的依赖。DCP-PD在CT-RATE数据集上实现了最先进的性能，宏F1分数从0.501提升至0.603（相对提升20%），并在Rad-ChestCT数据集上的分布外性能显著提升，F1分数从0.266提升至0.503（相对提升89%）。最后，我们引入了一个分层、位置感知的问题集协议（存在性→侧别→肺叶），以直接评估病灶位置的空间定位能力，结果表明即使是当前基准上得分较高的模型，细粒度空间定位仍然具有挑战性。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2604.10439

PERCEPT-Net: A Perceptual Loss Driven Framework for Reducing MRI Artifact Tissue Confusion

PERCEPT-Net：一种基于感知损失的框架，用于减少MRI伪影组织混淆

Guo, Ziheng, Zheng, Danqun, Chen, Chengwei, Pan, Boyang, Li, Shuai, Yu, Ziqin, Chen, Xiaoxiao, Zhong, Langdi, Bian, Yun, Gong, Nan-Jie

Abstract

Purpose: Existing deep learning-based MRI artifact correction models exhibit poor clinical generalization due to inherent artifact-tissue confusion, failing to discriminate artifacts from anatomical structures. To resolve this, we introduce PERCEPT-Net, a framework leveraging dedicated perceptual supervision for structure-preserving artifact suppression. Method: PERCEPT-Net utilizes a residual U-Net backbone integrated with a multi-scale recovery module and dual attention mechanisms to preserve anatomical context and salient features. The core mechanism, Motion Perceptual Loss (MPL), provides artifact-aware supervision by learning generalizable motion artifact representations. This logic directly guides the network to suppress artifacts while maintaining anatomical fidelity. Training utilized a hybrid dataset of real and simulated sequences, followed by prospective validation via objective metrics and expert radiologist assessments. Result: PERCEPT-Net outperformed state-of-the-art methods on clinical data. Ablation analysis established a direct causal link between MPL and performance; its omission caused a significant deterioration in structural consistency (p < 0.001) and tissue contrast (p < 0.001). Radiologist evaluations corroborated these objective metrics, scoring PERCEPT-Net significantly higher in global image quality (median 3 vs. 2, p < 0.001) and verifying the preservation of critical diagnostic structures. Conclusion: By integrating task-specific, artifact-aware perceptual learning, PERCEPT-Net suppresses motion artifacts in clinical MRI without compromising anatomical integrity. This framework improves clinical robustness and provides a verifiable mechanism to mitigate over-smoothing and structural degradation in medical image reconstruction.

Chinese Translation

目的：现有基于深度学习的MRI伪影校正模型由于固有的伪影-组织混淆，表现出较差的临床泛化能力，未能有效区分伪影与解剖结构。为了解决这一问题，我们提出了PERCEPT-Net，一个利用专门感知监督进行结构保留伪影抑制的框架。方法：PERCEPT-Net采用残差U-Net骨干网络，结合多尺度恢复模块和双重注意机制，以保留解剖背景和显著特征。核心机制运动感知损失（Motion Perceptual Loss, MPL）通过学习可泛化的运动伪影表示，提供了伪影感知监督。这一逻辑直接引导网络抑制伪影，同时保持解剖的真实性。训练使用了真实和模拟序列的混合数据集，随后通过客观指标和专家放射科医师评估进行前瞻性验证。结果：PERCEPT-Net在临床数据上优于最先进的方法。消融分析建立了MPL与性能之间的直接因果关系；其缺失导致结构一致性（p < 0.001）和组织对比度（p < 0.001）的显著下降。放射科医师的评估证实了这些客观指标，在整体图像质量上，PERCEPT-Net的评分显著高于其他方法（中位数3对比2，p < 0.001），并验证了关键诊断结构的保留。结论：通过整合任务特定的伪影感知感知学习，PERCEPT-Net在临床MRI中抑制运动伪影而不损害解剖完整性。该框架提高了临床鲁棒性，并提供了一种可验证的机制，以减轻医学图像重建中的过度平滑和结构退化。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2604.10442

ReContraster: Making Your Posters Stand Out with Regional Contrast

ReContraster：利用区域对比让你的海报脱颖而出

Zhang, Peixuan, Jia, Zijian, Cai, Ziqi, Weng, Shuchen, Li, Si, Shi, Boxin

Abstract

Effective poster design requires rapidly capturing attention and clearly conveying messages. Inspired by the ``contrast effects'' principle, we propose ReContraster, the first training-free model to leverage regional contrast to make posters stand out. By emulating the cognitive behaviors of a poster designer, ReContraster introduces the compositional multi-agent system to identify elements, organize layout, and evaluate generated poster candidates. To further ensure harmonious transitions across region boundaries, ReContraster integrates the hybrid denoising strategy during the diffusion process. We additionally contribute a new benchmark dataset for comprehensive evaluation. Seven quantitative metrics and four user studies confirm its superiority over relevant state-of-the-art methods, producing visually striking and aesthetically appealing posters.

Chinese Translation

高效的海报设计需要快速吸引注意力并清晰传达信息。受“对比效应”原理的启发，我们提出了ReContraster，这是首个无需训练即可利用区域对比使海报突出显示的模型。通过模拟海报设计师的认知行为，ReContraster引入了组合多智能体系统，用于识别元素、组织布局并评估生成的海报候选方案。为了进一步确保区域边界的和谐过渡，ReContraster在扩散过程中融合了混合去噪策略。我们还贡献了一个新的基准数据集以进行全面评估。七项定量指标和四项用户研究均证实其优于相关最先进方法，生成视觉冲击力强且美学效果出众的海报。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2604.10451

Parameter Efficient Fine-tuning for Domain-specific Gastrointestinal Disease Recognition

面向特定领域胃肠疾病识别的参数高效微调

Poudel, Sanjaya, Kunwor, Nikita, Simkhada, Raj, Munir, Mustafa, Dhakal, Manish, Poudel, Khem

Abstract

Despite recent advancements in the field of medical image analysis with the use of pretrained foundation models, the issue of distribution shifts between cross-source images largely remains adamant. To circumvent that issue, investigators generally train a separate model for each source. However, this method becomes expensive when we fully fine-tune pretrained large models for a single dataset, as we must store multiple copies of those models. Thus, in this work, we propose using a low-rank adaptation (LoRA) module for fine-tuning downstream classification tasks. LoRAs learn lightweight task-specific low-rank matrices that perturb pretrained weights to optimize those downstream tasks. For gastrointestinal tract diseases, they exhibit significantly better results than end-to-end finetuning with improved parameter efficiency. Code is available at: github.com/sanjay931/peft-gi-recognition.

Chinese Translation

尽管预训练基础模型在医学图像分析领域取得了显著进展，但跨源图像之间的分布偏移问题仍然突出。为规避该问题，研究者通常为每个数据源训练独立模型。然而，当对预训练的大型模型进行全量微调以适应单一数据集时，该方法代价昂贵，因为需要存储多份模型副本。因此，本研究提出采用低秩适配（LoRA）模块对下游分类任务进行微调。LoRA通过学习轻量级的任务特定低秩矩阵，对预训练权重进行扰动，从而优化下游任务。在胃肠道疾病识别中，LoRA相比端到端微调表现出显著更优的效果和更高的参数效率。代码可在github.com/sanjay931/peft-gi-recognition获取。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2604.10454

AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

AIM-Bench：通过细粒度层次控制对情感图像操控的基准测试与改进

Chen, Shi, Wu, Xuecheng, Sun, Heli, Shi, Yunyun, Yin, Xinyi, Xue, Fengjian, Xie, Jinheng, Yang, Dingkang, Wang, Hao, Xue, Junxiao, He, Liang

Abstract

Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.

Chinese Translation

情感图像操控（Affective Image Manipulation，AIM）旨在通过有针对性的编辑激发特定情感。当前的图像编辑基准主要关注一般场景下的对象级修改，缺乏捕捉情感维度的细粒度粒度。为填补这一空白，我们提出了首个专为AIM设计的基准——AIM-Bench。该基准基于双路径情感建模方案，融合了Mikels情感分类法与Valence-Arousal-Dominance（情感价-唤醒-支配）框架，实现了高层语义与细粒度连续操控。通过层次化的人机交互工作流，我们最终策划了包含8个情感类别和5种编辑类型的800个高质量样本。为有效评估性能，我们设计了结合规则与模型的复合评估套件，全面衡量指令一致性、美学效果及情感表达力。大量评测表明，现有编辑模型面临显著挑战，尤以训练数据分布固有不平衡导致的普遍积极偏差最为突出。为此，我们提出一种利用逆向重绘策略的可扩展数据引擎，构建了包含4万样本的平衡指令调优数据集AIM-40k。具体而言，我们通过生成式重绘增强原始情感图像以建立高保真真值，并合成具有不同情感及配对精确指令的输入图像。在AIM-40k上对基线模型进行微调，整体性能相较提升9.15%，验证了AIM-40k的有效性。我们的数据及相关代码将很快开源。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2604.10456

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

基于指令驱动的电影视频编排基准与多智能体系统

Zhang, Peixuan, Zhou, Chang, Zhang, Ziyuan, Liu, Hualuo, Zhang, Chunjie, Liu, Jingqi, Zhou, Xiaohui, Chen, Xi, Weng, Shuchen, Li, Si, Shi, Boxin

Abstract

The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose'' paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

Chinese Translation

对将长篇电影内容改编为短视频的需求激增，促使了多功能自动视频编排系统的需求。然而，现有的编排方法仅限于预定义任务，且该领域缺乏一个全面的基准来评估电影编排。为了解决这一问题，我们推出了CineBench，这是第一个基于指令驱动的电影视频编排基准，具有多样化的用户指令和由专业编辑注释的高质量真实编排。为了克服上下文崩溃和时间碎片化的问题，我们提出了CineAgents，一个将电影视频编排重新构思为“设计与创作”范式的多智能体系统。CineAgents通过脚本逆向工程构建层次叙事记忆，以提供多层次的上下文，并采用迭代叙事规划过程，将创意蓝图细化为最终编排脚本。大量实验表明，CineAgents显著优于现有方法，生成的编排在叙事连贯性和逻辑连贯性方面表现更佳。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2604.10460

Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

朝向社交平台上可追责的AI生成内容：隐写归属与多模态危害检测

Guan, Xinlei, Arosemena, David, Dhandu, Tejaswi, Huang, Kuan, Xu, Meng, Li, Miles Q., Shen, Bingyu, Qin, Ruiyang, Tida, Umamaheswara Rao, Li, Boyang

Abstract

The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography

Chinese Translation

生成性人工智能的快速发展带来了内容审核和数字取证的新挑战。特别是，无害的AI生成图像可以与有害或误导性文本相结合，造成难以检测的误用。这种上下文误用削弱了传统的审核框架，并使归属变得复杂，因为合成图像通常缺乏持久的元数据或设备签名。我们提出了一种隐写技术支持的归属框架，在图像创建时嵌入加密签名的标识符，并使用多模态有害内容检测作为归属验证的触发器。我们的系统评估了五种水印方法，涵盖空间、频率和小波域。它还集成了基于CLIP的融合模型用于多模态有害内容检测。实验表明，扩频水印，尤其是在小波域中，在模糊失真下提供了强大的鲁棒性，而我们的多模态融合检测器达到了0.99的AUC-ROC，能够实现可靠的跨模态归属验证。这些组件形成了一个端到端的取证管道，能够可靠追踪AI生成图像的有害部署，支持现代合成媒体环境中的可追责性。我们的代码可在GitHub上获取：https://github.com/bli1/steganography

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2604.10466

ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

ExpertEdit：从专家视频中学习技能感知的运动编辑

Somayazulu, Arjun, Grauman, Kristen

Abstract

Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one's own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person's motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data -- rare and expensive to curate for skill-driven tasks -- and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .

Chinese Translation

视觉反馈对于运动技能的习得在体育和康复中至关重要，心理学研究表明，观察自己表现的近乎完美版本比单纯观看专家示范更能有效加速学习。我们提出通过自动编辑个人的运动以反映更高技能水平，从而实现这种个性化反馈。现有的运动编辑方法不适合这种情况，因为它们假设存在配对的输入输出数据——这种数据在技能驱动任务中稀缺且昂贵，并且在推理时需要明确的编辑指导。我们引入了ExpertEdit，一个专门针对未配对的专家视频示范进行训练的技能驱动运动编辑框架。ExpertEdit通过掩码语言建模目标学习专家运动先验，填充掩码运动片段以实现专家级的精细化。在推理时，新手运动在技能关键时刻被掩码，并投影到学习到的专家流形中，从而在没有配对监督或手动编辑指导的情况下产生局部技能提升。在Ego-Exo4D和空手道极真流的八种不同技术和三项运动中，ExpertEdit在运动真实感和专家质量的多个指标上超越了最先进的监督运动编辑方法。项目页面：https://vision.cs.utexas.edu/projects/expert_edit/

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2604.10485

UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

UDAPose：面向低光照人体姿态估计的无监督域适应方法

Chen, Haopeng, Ai, Yihao, Kim, Kabeen, Tan, Robby T., Chen, Yixin, Wang, Bo

Abstract

Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose

Chinese Translation

低能见度场景（如低光照条件）由于缺乏带注释的低光照数据集以及光照不足导致的视觉信息丢失，给人体姿态估计带来了重大挑战。近期的域适应技术尝试通过增强良好光照下的图像以模拟低光照条件，从而利用良好光照的标签。但手工设计的增强方法过于简化噪声模式，而基于学习的方法往往无法保留低光照下的高频特征，生成的图像不真实，导致姿态模型在真实低光照场景中的泛化能力较差。此外，现有的姿态估计器依赖于通过图像到关键点的交叉注意力机制提取的图像线索，但这些线索在低光照条件下变得不可靠。为解决上述问题，我们提出了用于姿态估计的无监督域适应框架（UDAPose），该框架通过合成低光照图像并动态融合视觉线索与姿态先验，实现了姿态估计的提升。具体而言，我们的合成方法引入了基于直流分量的高通滤波器（DHF）和低光照特征注入模块（LCIM），以从输入的低光照图像中注入高频细节，克服了现有方法的刚性或细节丢失问题。此外，我们设计了动态注意力控制模块（DCA），在Transformer架构中自适应地平衡图像线索与学习到的姿态先验。实验结果表明，UDAPose优于最先进方法，在ExLPose-test难集（LL-H）上AP提升显著，达到10.1（56.4%），在EHPT-XC跨数据集验证中提升7.4（31.4%）。代码地址：https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2604.10500

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

用于多模态潜在推理的视觉增强深度扩展

Han, Yudong, Wang, Yong, Yang, Zaiquan, Qu, Zhen, Pan, Liyuan, Chu, Xiangxiang

Abstract

Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

Chinese Translation

多模态潜在推理作为一种有前景的范式，取代了显式的链式思维（Chain-of-Thought, CoT）解码，采用隐式特征传播，同时提升了表示的信息量并降低了推理延迟。通过分析潜在训练过程中令牌级别的梯度动态，我们揭示了两个关键观察：（1）由于固有的语言偏差，视觉令牌的梯度范数显著高于且波动较大于文本令牌，导致视觉部分系统性地欠优化；（2）语义简单的令牌收敛迅速，而复杂令牌则表现出持续的梯度不稳定性，受限于固定的网络深度。为解决这些限制，我们提出了视觉重放模块（visual replay module）和路由深度扩展（routing depth scaling），协同增强视觉感知并细化复杂潜变量以实现更深层次的上下文推理。前者模块利用因果自注意力机制估计令牌显著性，通过空间一致性约束强化细粒度的定位；后者机制则自适应地为复杂令牌分配更多推理步骤，实现更深层的上下文细化。在逐步将显式CoT内化为紧凑潜在表示的课程策略指导下，我们的框架在多种基准测试中取得了最先进的性能，同时相比显式CoT基线显著提升了推理速度。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2604.10512

FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation

FreeScale：通过确定性感知自由视角生成来扩展三维场景

Jiang, Chenhan, Chen, Yu, Zhang, Qingwen, Song, Jifei, Xu, Songcen, Yeung, Dit-Yan, Deng, Jiankang

Abstract

The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FreeScale, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FreeScale's effectiveness by scaling up the training of feedforward NVS models, achieving a notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. Project page: https://mvp-ai-lab.github.io/FreeScale.

Chinese Translation

通用化的新视角合成（Novel View Synthesis, NVS）模型的发展受到缺乏大规模训练数据的严重限制，这些数据需要具有多样化和精确的相机轨迹。虽然真实世界的捕捉结果具有照片级真实感，但通常是稀疏且离散的。相反，合成数据虽然可以扩展，但存在领域差距，并且往往缺乏现实语义。我们提出了FreeScale，一个新颖的框架，利用场景重建的力量，将有限的真实世界图像序列转化为高质量训练数据的可扩展来源。我们的关键见解是，不完美的重建场景可以作为丰富的几何代理，但从中简单采样会放大伪影。为此，我们提出了一种确定性感知的自由视角采样策略，识别出在语义上有意义且受到重建误差影响最小的新视点。我们通过扩展前馈NVS模型的训练，展示了FreeScale的有效性，在具有挑战性的分布外基准上实现了2.7 dB的PSNR显著提升。此外，我们还展示了生成的数据可以积极增强每个场景的三维高斯点云优化，从而在多个数据集上实现一致的改进。我们的工作提供了一个实用且强大的数据生成引擎，以克服三维视觉中的基本瓶颈。项目页面：https://mvp-ai-lab.github.io/FreeScale。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2604.10514

Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

小切口白内障手术中的数据高效手术阶段分割：视觉基础模型的对照研究

Spencer, Lincoln, Wang, Song, Chen, Chen

Abstract

Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/

Chinese Translation

手术阶段分割是计算机辅助手术的核心，但在标注手术视频稀缺的情况下，开发稳健的模型仍然困难。我们通过对视觉表示的对照比较，研究了手动小切口白内障手术（SICS）的数据高效阶段分割。为了隔离表示质量，我们将每个视觉编码器与相同的时间模型（MS-TCN++）配对，在相同的训练和评估设置下对SICS-155（19个阶段）进行实验。我们比较了监督编码器（ResNet-50，I3D）与大型自监督基础模型（DINOv3，V-JEPA2），并使用缓存特征管道，将昂贵的视觉编码与轻量级的时间学习解耦。在这种设置中，基础模型特征提高了分割性能，其中DINOv3 ViT-7B取得了最佳整体结果（83.4% 准确率，87.0 编辑分数）。我们进一步研究了使用未标记视频和轻量适应的白内障领域迁移，并分析了何时有助于或有害。总体而言，这项研究表明现代视觉基础模型在手术工作流程理解方面具有强大的迁移能力，并为低标注医疗视频环境提供了实用指导。项目网站可访问： https://sl2005.github.io/DataEfficient-sics-phase-seg/

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2604.10524

FGML-DG: Feynman-Inspired Cognitive Science Paradigm for Cross-Domain Medical Image Segmentation

FGML-DG：基于费曼启发的跨领域医学图像分割认知科学范式

Song, Yucheng, Li, Chenxi, Ding, Haokang, Liao, Zhining, Liao, Zhifang

Abstract

In medical image segmentation across multiple modalities (e.g., MRI, CT, etc.) and heterogeneous data sources (e.g., different hospitals and devices), Domain Generalization (DG) remains a critical challenge in AI-driven healthcare. This challenge primarily arises from domain shifts, imaging variations, and patient diversity, which often lead to degraded model performance in unseen domains. To address these limitations, we identify key issues in existing methods, including insufficient simplification of complex style features, inadequate reuse of domain knowledge, and a lack of feedback-driven optimization. To tackle these problems, inspired by Feynman's learning techniques in educational psychology, this paper introduces a cognitive science-inspired meta-learning paradigm for medical image domain generalization segmentation. We propose, for the first time, a cognitive-inspired Feynman-Guided Meta-Learning framework for medical image domain generalization segmentation (FGML-DG), which mimics human cognitive learning processes to enhance model learning and knowledge transfer. Specifically, we first leverage the 'concept understanding' principle from Feynman's learning method to simplify complex features across domains into style information statistics, achieving precise style feature alignment. Second, we design a meta-style memory and recall method (MetaStyle) to emulate the human memory system's utilization of past knowledge. Finally, we incorporate a Feedback-Driven Re-Training strategy (FDRT), which mimics Feynman's emphasis on targeted relearning, enabling the model to dynamically adjust learning focus based on prediction errors. Experimental results demonstrate that our method outperforms other existing domain generalization approaches on two challenging medical image domain generalization tasks.

Chinese Translation

在多模态医学图像分割（例如，MRI、CT等）和异构数据源（例如，不同医院和设备）中，领域泛化（Domain Generalization，DG）仍然是人工智能驱动的医疗保健中的一个关键挑战。这个挑战主要源于领域转移、成像变化和患者多样性，这些因素常常导致模型在未见领域中的性能下降。为了解决这些局限性，我们识别了现有方法中的关键问题，包括对复杂风格特征的简化不足、领域知识的重用不充分以及缺乏反馈驱动的优化。为了解决这些问题，本文受到费曼在教育心理学中的学习技巧的启发，提出了一种基于认知科学的医学图像领域泛化分割的元学习范式。我们首次提出了一种认知启发的费曼指导元学习框架（FGML-DG），该框架模拟人类认知学习过程，以增强模型学习和知识转移。具体而言，我们首先利用费曼学习方法中的“概念理解”原则，将跨领域的复杂特征简化为风格信息统计，从而实现精确的风格特征对齐。其次，我们设计了一种元风格记忆与回忆方法（MetaStyle），以模拟人类记忆系统对过去知识的利用。最后，我们结合了一种反馈驱动的再训练策略（Feedback-Driven Re-Training，FDRT），该策略模拟费曼对有针对性的再学习的重视，使模型能够根据预测误差动态调整学习重点。实验结果表明，我们的方法在两个具有挑战性的医学图像领域泛化任务上优于其他现有的领域泛化方法。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2604.10527

STORM: End-to-End Referring Multi-Object Tracking in Videos

STORM：视频中的端到端指代多目标跟踪

Lu, Zijia, Yi, Jingru, Wang, Jue, Chen, Yuxiao, Chen, Junwen, Li, Xinyu, Modolo, Davide

Abstract

Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial--temporal grounding in complex real-world scenarios. STORM-Bench is released at https://github.com/amazon-science/storm-referring-multi-object-grounding.

Chinese Translation

指代多目标跟踪（RMOT）是将视频中所有与给定文本查询或指代表达语义匹配的对象进行关联的任务。现有的RMOT方法将对象定位和跟踪分解为独立的模块，由于训练视频稀缺、注释模糊和领域限制，表现有限。在本研究中，我们提出了STORM，一种端到端的多模态学习模型（MLLM），在统一框架内共同执行定位和跟踪，消除了外部检测器，并实现了对外观、运动和语言的连贯推理。为了提高数据效率，我们提出了一种任务组合学习（TCL）策略，将RMOT分解为图像定位和对象跟踪，使STORM能够利用数据丰富的子任务并学习结构化的时空推理。我们进一步构建了STORM-Bench，一个新的RMOT数据集，具有准确的轨迹和通过自下而上的注释流程生成的多样化、明确的指代表达。大量实验表明，STORM在图像定位、单目标跟踪和RMOT基准测试中实现了最先进的性能，展示了在复杂真实场景中强大的泛化能力和稳健的时空定位。STORM-Bench已发布于 https://github.com/amazon-science/storm-referring-multi-object-grounding。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2604.10528

BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

BareBones：零-shot几何理解在视觉语言模型中的基准测试

Baranwal, Aaditya, Yadav, Vishal, Rajora, Abhishek

Abstract

While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.

Chinese Translation

尽管视觉语言模型（VLMs）在多种多模态任务中展现出卓越的零-shot识别能力，但这些架构是否真正理解几何结构，还是仅仅利用RGB纹理和上下文先验作为统计捷径，仍然是一个未解的问题。现有评估未能隔离这一机制，将语义推理与纹理映射混为一谈，并依赖于不精确的注释，这些注释无意中泄露了环境线索。为了解决这一空白，我们引入了 extbf{BareBones}，一个旨在严格测试纯几何形状理解的零-shot基准。我们在六个数据集中策划了几何上独特类别的像素级轮廓：五个已建立的分割来源（ImageNet-S、DIS5K、ThinObject5K、PASCAL VOC、CUB-200）以及我们的新旗舰集合WTP-Bench，建立了一个无噪声的几何分类法。WTP-Bench是一个极端、细粒度的视觉难题，迫使模型仅从边界轮廓中识别类间几何概念。我们对26个最先进的专有和开放权重的VLM（例如，GPT-4.1、Gemini、Claude Sonnet 4.5、LLaVA）的评估显示，在RGB缺失的情况下，性能出现了一致且严重的崩溃现象，我们称之为 extit{纹理偏差悬崖}。通过记录普遍的结构盲点，BareBones为真正的几何基础建立了一个严格的标准。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2604.10532

The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

2026年NTIRE真实世界人脸修复第二届挑战赛：方法与结果

Wang, Jingkai, Gong, Jue, Chen, Zheng, Liu, Kai, Li, Jiatong, Zhang, Yulun, Timofte, Radu, Tu, Jiachen, Shi, Yaokun, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Chen, Yingsi, Liu, Yijiao, Li, Hui, Wang, Yu, Zhu, Congchao, Lefterache, Alexandru-Gabriel, Radoi, Anamaria, Yan, Chuanyue, Lu, Tao, Zhang, Yanduo, Zhao, Kanghui, Wang, Jiaming, Li, Yuqi, Xiong, WenBo, Chen, Yifei, Hu, Xian, Deng, Wei, Zhou, Daiguo, Roy V, Sujith, Jesuraj, Claudia, B, Vikas, LC, Spoorthi, Akalwadi, Nikhil, Tabib, Ramesh Ashok, Mudenagudi, Uma, Jiang, Yuxuan, Zeng, Chengxi, Peng, Tianhao, Zhang, Fan, Zhou, David Bull Wei, Li, Linfeng, Huang, Hongyu, Lee, Hoyoung, Oh, SangYun, Jeong, ChangYoung, Niu, Axi, Zhang, Jinyang, Wu, Zhenguo, Qing, Senyan, Sun, Jinqiu, Zhang, Yanning

Abstract

This paper provides a review of the NTIRE 2026 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural and realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. Performance is evaluated using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 96 registrants, with 10 teams submitting valid models; ultimately, 9 teams achieved valid scores in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.

Chinese Translation

本文对2026年NTIRE真实世界人脸修复挑战赛进行了回顾，重点介绍了提出的解决方案及其结果。该挑战赛的重点在于生成自然且真实的输出，同时保持身份一致性。其目标是推动感知质量和真实感的最先进解决方案的发展，而不对计算资源或训练数据施加限制。性能评估使用加权图像质量评估（IQA）分数，并采用AdaFace模型作为身份检查器。比赛吸引了96名注册者，其中10个团队提交了有效模型；最终，9个团队在最终排名中获得了有效分数。这一合作努力推动了真实世界人脸修复的性能，同时提供了该领域最新趋势的深入概述。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2604.10541

Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets

通过结构化语义映射实现异构数据集间面部动作单元与表情的双向学习

Li, Jia, Zhang, Yu, Chen, Yin, Hu, Zhenzhen, Li, Yong, Hong, Richang, Shan, Shiguang, Wang, Meng

Abstract

Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU--FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.

Chinese Translation

面部动作单元（AU）检测与面部表情（FE）识别可共同视为情感面部行为任务，分别代表细粒度的肌肉激活和粗粒度的整体情感状态。尽管二者在语义上存在内在关联，现有研究主要集中于从AU向FE的知识迁移，而双向学习尚未得到充分探索。在实际应用中，该挑战因异构数据条件而更加复杂，AU与FE数据集在标注范式（帧级与片段级）、标签粒度及数据可用性和多样性方面存在差异，阻碍了有效的联合学习。为解决这些问题，本文提出了一种结构化语义映射（Structured Semantic Mapping，SSM）框架，实现不同数据域和异构监督下的AU与FE双向学习。SSM包含三个关键组件：（1）共享视觉骨干网络，从动态AU与FE视频中学习统一的面部表示；（2）通过文本语义原型（Textual Semantic Prototype，TSP）模块进行语义中介，该模块基于固定文本描述并辅以可学习的上下文提示构建结构化语义原型，作为监督信号及共享语义空间中的跨任务对齐锚点；（3）动态先验映射（Dynamic Prior Mapping，DPM）模块，融合源自面部动作编码系统（Facial Action Coding System）的先验知识，并在高层特征空间中学习数据驱动的关联矩阵，实现显式且双向的知识传递。在多个主流AU检测与FE识别基准上的大量实验表明，SSM在两项任务上均达到最先进性能，并验证了整体表情语义能够反过来提升细粒度AU学习效果，即使在异构数据集之间亦然。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2604.10546

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

用于生成图像压缩的率-失真优化的可微向量量化

Jiang, Shiyin, Long, Wei, Han, Minghao, Chen, Zhenghao, Zhu, Ce, Gu, Shuhang

Abstract

The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at https://github.com/CVL-UESTC/RDVQ.

Chinese Translation

在严格的存储和带宽限制下，视觉数据的快速增长使得极低比特率的图像压缩变得愈发重要。虽然向量量化（Vector Quantization, VQ）提供了强大的结构保真度，但现有方法由于表示学习与熵建模之间的脱节，缺乏一个原则性的联合率-失真（Rate-Distortion, RD）优化机制。我们提出了RDVQ，一个统一框架，通过对代码本分布的可微松弛，支持基于VQ的压缩的端到端RD优化，使得熵损失能够直接影响潜在先验。我们进一步开发了一种自回归熵模型，支持准确的熵建模和测试时的比特率控制。大量实验表明，RDVQ在极低比特率下以轻量级架构实现了强大的性能，在参数显著减少的情况下，达到了具有竞争力或更优的感知质量。与RDEIC相比，RDVQ在DIV2K-val上在DISTS上将比特率降低了高达75.71%，在LPIPS上降低了37.63%。除了经验上的提升，RDVQ还引入了一个熵约束的VQ公式，突显了图像标记化和压缩更统一视角的潜力。代码将发布在 https://github.com/CVL-UESTC/RDVQ。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2604.10551

NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results

NTIRE 2026 挑战赛：基于生成模型的野外短视频UGC修复——数据集、方法与结果

Li, Xin, Gong, Jiachao, Wang, Xijun, Xiong, Shiyao, Li, Bingchen, Yao, Suhang, Zhou, Chao, Chen, Zhibo, Timofte, Radu, Chen, Yuxiang, Yin, Shibo, Zhong, Yilian, Fang, Yushun, Zhu, Xilei, Wang, Yahui, Lu, Chen, Zheng, Meisong, Chen, Xiaoxu, Yang, Jing, Hu, Zhaokun, Liu, Jiahui, Chen, Ying, Bai, Haoran, Deng, Sibin, Li, Shengxi, Xu, Mai, Chen, Junyang, Chen, Hao, Zhu, Xinzhe, Zhang, Fengkai, Sun, Long, Yang, Yixing, Zhang, Xindong, Dong, Jiangxin, Pan, Jinshan, Zhang, Jiyuan, Liu, Shuai, Huang, Yibin, Wang, Xiaotao, Lei, Lei, Liu, Zhirui, Chen, Shinan, Sun, Shang-Quan, Ren, Wenqi, Xu, Jingyi, Chen, Zihong, Zou, Zhuoya, Qiu, Xiuhao, Ma, Jingyu, Fu, Huiyuan, Liu, Kun, Ma, Huadong, Feng, Dehao, Ma, Zhijie, Zhang, Boqi, Shi, Jiawei, Kang, Hao, Yang, Yixin, Jin, Yeying, Cheng, Xu, Jiang, Yuxuan, Zeng, Chengxi, Peng, Tianhao, Zhang, Fan, Bull, David, Xing, Yanan, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun, Zhou, Wei, Li, Linfeng, Song, Hang, Xu, Qi, Yuan, Kun, Shao, Yizhen, Ren, Yulin

Abstract

This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.

Chinese Translation

本文介绍了NTIRE 2026挑战赛——基于生成模型的野外短视频UGC修复的整体情况。本次挑战赛采用了由中国科学技术大学（USTC）和快手科技共同贡献的新型短视频UGC（S-UGC）修复基准数据集KwaiVIR，该数据集包含合成失真视频和真实野外短视频UGC。此次赛事发布的数据包括200个合成训练视频、48个野外训练视频、11个验证视频及20个测试视频。挑战的主要目标是建立一个强大且实用的基准，用于在复杂真实世界退化条件下修复短视频UGC，特别是在基于生成模型的新兴修复范式中。本次挑战设有两个赛道：（i）主赛道为主观赛道，评估基于用户研究；（ii）第二赛道为客观赛道。这两个赛道共同实现了对修复质量的全面评估。共有95支团队注册参赛，12支团队提交了有效的最终方案及技术说明书参与测试阶段。提交的方法在KwaiVIR基准上表现出色，展示了野外短视频UGC修复领域的积极进展。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2604.10554

Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor

基于时空差分引导的运动去模糊及互补视觉传感器

Meng, Yapeng, Yang, Lin, Chen, Yuguo, Chen, Xiangru, Wang, Taoyi, Wang, Lijian, Yang, Zheyu, Lin, Yihan, Zhao, Rong

Abstract

Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: https://tmcDeblur.github.io/

Chinese Translation

运动模糊产生于曝光期间场景的快速变化，将丰富的曝光内运动信息压缩为单一的RGB帧。仅依赖RGB的去模糊问题高度病态，在极端运动条件下常常失败。受人类视觉系统启发，脑启发视觉传感器引入了时间密集的信息以缓解该问题。然而，事件相机在快速运动下仍存在事件率饱和问题，且事件模态将边缘特征与运动线索混合，限制了其有效性。作为近期突破，互补视觉传感器（Complementary Vision Sensor，CVS）——Tianmouc，在单次RGB曝光内同步捕获RGB帧及高帧率、多比特的空间差分（Spatial Difference，SD，编码结构边缘）和时间差分（Temporal Difference，TD，编码运动线索）数据，为极端动态场景下的RGB去模糊提供了有前景的解决方案。为充分利用这些互补模态，我们提出了时空差分引导去模糊网络（Spatio-Temporal Difference Guided Deblur Net，STGDNet），该网络采用递归多分支架构，迭代编码并融合SD和TD序列，以恢复模糊RGB输入中丢失的结构和色彩细节。我们的方法在合成CVS数据集和真实世界评估中均优于当前基于RGB或事件的去模糊方法。此外，STGDNet在超过100个极端真实场景中表现出强大的泛化能力。项目主页：https://tmcDeblur.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2604.10573

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

从无姿态多视图图像中学习空间智能的三维表示

Zhou, Bo, Lai, Qiuxia, Sun, Zeren, Shu, Xiangbo, Yao, Yazhou, Wang, Wenguan

Abstract

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

Chinese Translation

鲁棒的三维表示学习构成了空间智能的感知基础，支持场景理解和具身人工智能等下游任务。然而，直接从无姿态多视图图像中学习此类表示仍然具有挑战性。近期的自监督方法尝试以前馈方式统一几何、外观和语义，但常面临几何诱导能力弱、外观细节有限以及几何与语义不一致的问题。我们提出了UniSplat，一种通过三个互补组件设计的前馈框架，以解决这些限制。首先，我们提出了一种双重掩码策略，增强编码器中的几何诱导能力。通过同时掩码编码器和解码器的tokens，并将解码器掩码聚焦于几何丰富区域，模型被迫从不完整的视觉线索中推断结构信息，即使在无姿态输入下也能生成具几何感知的表示。其次，我们开发了一种由粗到细的高斯点溅射策略，通过逐步细化辐射场来减少外观与语义的不一致。最后，为了强化几何与语义的一致性，我们引入了一种基于姿态的重新校准机制，该机制通过利用估计的相机参数将预测的三维点和语义图重新投影到图像平面，并与对应的RGB及语义预测对齐，从而确保跨任务一致性，解决几何与语义的错配问题。综合这些组件，UniSplat生成的统一三维表示对无姿态、稀疏视角输入具有鲁棒性，并能泛化至多种任务，为空间智能奠定了感知基础。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2604.10578

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Rein3D：基于全景视频扩散模型的强化三维室内场景生成

Wang, Dehui, Xu, Congsheng, Wei, Rong, Shi, Yue, Chen, Shoufa, Luo, Dingxiang, Yang, Tianshuo, Yang, Xiaokang, Qin, Yusen, Tang, Rui, Mu, Yao

Abstract

The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

Chinese Translation

随着具身人工智能（Embodied AI）和虚拟现实（VR）应用需求的增长，如何从稀疏输入合成高质量三维室内场景成为亟需解决的问题。然而，现有方法在推断大范围未知区域中大量缺失几何信息时难以保持全局一致性，往往只能生成局部合理但全局不一致的重建结果。本文提出Rein3D框架，通过将显式三维高斯点渲染（3D Gaussian Splatting，3DGS）与来自视频扩散模型的时间一致先验相结合，实现完整360度室内环境的重建。我们的方法遵循“恢复与精炼”范式：采用径向探索策略沿起点轨迹渲染不完美的全景视频，有效揭示基于粗略3DGS初始化的遮挡区域。随后，利用全景视频到视频的扩散模型对这些序列进行恢复，并通过视频超分辨率进一步增强，以合成高保真几何和纹理。最终，将这些精炼后的视频作为伪真实数据更新全局三维高斯场。为支持该任务，我们构建了PanoV2V-15K数据集，包含超过1.5万对干净与降质的全景视频，用于基于扩散的场景恢复。实验结果表明，Rein3D能够生成光照真实且全局一致的三维场景，并在长距离相机探索任务中显著优于现有基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2604.10582

TAPNext++: What's Next for Tracking Any Point (TAP)?

TAPNext++：Tracking Any Point (TAP) 的下一步发展

Jung, Sebastian, Zholus, Artem, Sundermeyer, Martin, Doersch, Carl, Goroshin, Ross, Tan, David Joseph, Chandar, Sarath, Triebel, Rudolph, Tombari, Federico

Abstract

Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.

Chinese Translation

Tracking-Any-Point (TAP) 模型旨在实现对视频中任意点的跟踪，这在增强现实/扩展现实（AR/XR）和机器人应用中具有重要意义。最近提出的 TAPNext 方法采用端到端的递归变换器架构，以纯在线的方式逐帧跟踪点，展示了在极低延迟下的竞争性能。然而，我们发现 TAPNext 在处理较长视频序列时表现不佳，且经常无法重新检测在被遮挡或离开画面后重新出现的查询点。在本工作中，我们提出了 TAPNext++，该模型能够在序列长度大幅增加的情况下跟踪点，同时保持架构的低内存和计算开销。我们通过多种数据驱动的方案训练递归视频变换器，包括利用序列并行技术训练长达1024帧的序列。我们指出重新检测性能是当前文献中的盲点，并引入了新的评估指标——重新检测平均雅可比指数（Re-Detection Average Jaccard，$AJ_{RD}$），以明确评估对重新出现点的跟踪效果。为提升点的重新检测能力，我们引入了定制的几何增强方法，如模拟点重新进入的周期性滚动（periodic roll）以及对被遮挡点的监督。我们展示了递归变换器在点跟踪任务上的显著改进，并在多个基准测试中创下了新的最先进水平。模型与代码可访问：https://tap-next-plus-plus.github.io。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2604.10584

CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention

CoFusion：基于光谱坐标注意力的多光谱与高光谱图像融合

Li, Baisong

Abstract

Multispectral and Hyperspectral Image Fusion (MHIF) aims to reconstruct high-resolution images by integrating low-resolution hyperspectral images (LRHSI) and high-resolution multispectral images (HRMSI). However, existing methods face limitations in modeling cross-scale interactions and spatial-spectral collaboration, making it difficult to achieve an optimal trade-off between spatial detail enhancement and spectral fidelity. To address this challenge, we propose CoFusion: a unified spatial-spectral collaborative fusion framework that explicitly models cross-scale and cross-modal dependencies. Specifically, a Multi-Scale Generator (MSG) is designed to construct a three-level pyramidal architecture, enabling the effective integration of global semantics and local details. Within each scale, a dual-branch strategy is employed: the Spatial Coordinate-Aware Mixing module (SpaCAM) is utilized to capture multi-scale spatial contexts, while the Spectral Coordinate-Aware Mixing module (SpeCAM) enhances spectral representations through frequency decomposition and coordinate mixing. Furthermore, we introduce the Spatial-Spectral Cross-Fusion Module (SSCFM) to perform dynamic cross-modal alignment and complementary feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that CoFusion consistently outperforms state-of-the-art methods, achieving superior performance in both spatial reconstruction and spectral consistency.

Chinese Translation

多光谱与高光谱图像融合（MHIF）旨在通过融合低分辨率高光谱图像（LRHSI）与高分辨率多光谱图像（HRMSI）来重建高分辨率图像。然而，现有方法在建模跨尺度交互和空间-光谱协同方面存在局限，难以在空间细节增强与光谱保真度之间实现最佳平衡。为解决该挑战，我们提出了CoFusion：一种统一的空间-光谱协同融合框架，能够显式建模跨尺度及跨模态依赖关系。具体而言，设计了多尺度生成器（Multi-Scale Generator，MSG）构建三级金字塔结构，有效整合全局语义与局部细节。在每个尺度内，采用双分支策略：空间坐标感知混合模块（Spatial Coordinate-Aware Mixing，SpaCAM）用于捕捉多尺度空间上下文，光谱坐标感知混合模块（Spectral Coordinate-Aware Mixing，SpeCAM）通过频率分解与坐标混合增强光谱表示。此外，引入空间-光谱交叉融合模块（Spatial-Spectral Cross-Fusion Module，SSCFM）实现动态跨模态对齐与互补特征融合。在多个基准数据集上的大量实验表明，CoFusion持续优于现有最先进方法，在空间重建与光谱一致性方面均取得卓越表现。

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2604.10591

GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

GeoMeld：面向遥感的语义基础模型

Hasan, Maram, Hossain, Md Aminur, Roy, Savitra, Bhowmik, Souparna, Patel, Ayush V., Singha, Mainak, Chaudhuri, Subhasis, Khan, Muhammad Haris, Banerjee, Biplab

Abstract

Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.

Chinese Translation

有效的遥感基础建模需要空间对齐的异构模态与语义基础的监督相结合，但此类资源在规模上仍然有限。我们提出了GeoMeld，一个大规模的多模态数据集，包含约250万个空间对齐的样本。该数据集涵盖了多种模态和分辨率，并在统一的对齐协议下构建，以实现模态感知的表示学习。GeoMeld通过一个代理式字幕框架提供语义基础的语言监督，该框架合成并验证来自光谱信号、地形统计和结构化地理元数据的注释，编码文本描述中的可测量跨模态关系。为了利用该数据集，我们引入了GeoMeld-FM，一个预训练框架，结合了对齐模态的多前置掩码自编码、JEPA表示学习和字幕-视觉对比对齐。这一联合目标使得学习的表示空间能够捕捉到可靠的跨传感器物理一致性和基础语义。实验表明，在下游迁移和跨传感器鲁棒性方面均取得了一致的提升。GeoMeld和GeoMeld-FM共同建立了一个可扩展的参考框架，用于遥感中的语义基础多模态建模。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2604.10597

COREY: A Prototype Study of Entropy-Guided Operator Fusion with Hadamard Reparameterization for Selective State Space Models

COREY：一种基于熵引导算子融合与Hadamard重参数化的选择性状态空间模型原型研究

Ma, Bo, Wu, Jinsong, Wei, Hongjiang, Yan, Weiqi

Abstract

State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware operator fusion with Hadamard-based feature reparameterization. Activation entropy, estimated with fixed-width histograms, is used as a runtime scheduling statistic to place fusion boundaries and choose tile sizes. To regularize heavy-tailed activations, we absorb normalized Hadamard transforms into linear projections, preserving functional equivalence while reducing peak-coordinate concentration. In a controlled prototype study over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic relative to unfused and fixed-depth baselines. Low-bit results are reported only through a hand-crafted stability proxy and are intended as diagnostic evidence rather than checkpoint-level quality claims. Code repository: https://github.com/mabo1215/COREY_Transformer.git.

Chinese Translation

状态空间模型（State Space Models，SSMs），以Mamba系列为代表，提供了线性时间的序列建模能力，因而在长上下文推理中具有吸引力。然而，实际部署仍受限于内存带宽，因为选择性状态更新通常被分解为多个碎片化的内核，伴随重复的中间张量物化。我们提出了COREY，一种结合了内存感知算子融合与基于Hadamard的特征重参数化的原型框架。通过固定宽度直方图估计的激活熵被用作运行时调度统计量，以确定融合边界和选择切片大小。为了正则化重尾激活，我们将归一化的Hadamard变换吸收到线性投影中，保持功能等价性的同时减少峰值坐标的集中度。在针对重尾SSM激活的受控原型研究中，COREY相较于未融合及固定深度基线，持续降低代理延迟、提升吞吐量并减少DRAM流量。低位宽结果仅通过手工设计的稳定性代理报告，旨在作为诊断性证据，而非检查点级别的质量声明。代码仓库：https://github.com/mabo1215/COREY_Transformer.git。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2604.10609

Self-supervised Pretraining of Cell Segmentation Models

细胞分割模型的自监督预训练

Stillwagon, Kaden, VandeLoo, Alexandra Dunnum, Magondu, Benjamin, Forest, Craig R.

Abstract

Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high-quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation-pretrained weights from large-scale natural-image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self-supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self-supervised training on unlabeled cell images prior to supervised fine-tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM-based models, and demonstrates strong zero-shot performance on three out-of-distribution microscopy datasets. These results highlight the benefits of domain-adapted self-supervised pretraining for robust cell segmentation.

Chinese Translation

实例分割通过识别属于每个细胞的像素，实现了对显微镜图像中细胞的空间和时间特性的分析。然而，高质量标注显微镜数据集的稀缺限制了该领域的进展。许多近期方法通过使用大规模自然图像模型（如Segment Anything Model，SAM）中预训练的分割权重来初始化模型，以应对这一挑战。然而，从自然图像中学习到的表征通常编码了与显微镜数据不匹配的物体性和纹理先验，导致在领域转移时性能下降。我们提出了DINOCell，一种用于细胞实例分割的自监督框架，该框架利用DINOv2的表征，并通过在无标签细胞图像上继续进行自监督训练，适配这些表征到显微镜领域，随后进行有监督微调。在LIVECell基准测试中，DINOCell实现了0.784的SEG分数，比基于SAM的领先模型提升了10.42%，并在三个分布外显微镜数据集上展示了强大的零样本性能。这些结果凸显了领域适应自监督预训练在实现稳健细胞分割中的优势。

View on arXiv Download PDF AI Translation

cs.CV / 152 / 2604.10619

How to Design a Compact High-Throughput Video Camera?

如何设计紧凑型高通量视频摄像机？

Qiu, Chenxi, Yue, Tao, Hu, Xuemei

Abstract

High throughput video acquisition is a challenging problem and has been drawing increasing attention. Existing high throughput imaging systems splice hundreds of sub-images/videos into high throughput videos, suffering from extremely high system complexity. Alternatively, with pixel sizes reducing to sub-micrometer levels, integrating ultra-high throughput on a single chip is becoming feasible. Nevertheless, the readout and output transmission speed cannot keep pace with the increasing pixel numbers. To this end, this paper analyzes the strength of gradient cameras in fast readout and efficient representation, and proposes a low-bit gradient camera scheme based on existing technologies that can resolve the readout and transmission bottlenecks for high throughput video imaging. A multi-scale reconstruction CNN is proposed to reconstruct high-resolution images. Extensive experiments on both simulated and real data are conducted to demonstrate the promising quality and feasibility of the proposed method.

Chinese Translation

高通量视频采集是一个具有挑战性的问题，近年来受到越来越多的关注。现有的高通量成像系统将数百个子图像/视频拼接成高通量视频，面临着极高的系统复杂性。另一方面，随着像素尺寸缩小到亚微米级别，在单个芯片上集成超高通量变得可行。然而，读出和输出传输速度无法跟上像素数量的增加。为此，本文分析了梯度摄像机在快速读出和高效表示方面的优势，并提出了一种基于现有技术的低位梯度摄像机方案，旨在解决高通量视频成像中的读出和传输瓶颈。我们提出了一种多尺度重建卷积神经网络（CNN）来重建高分辨率图像。通过对模拟数据和真实数据进行大量实验，验证了所提方法的良好质量和可行性。

View on arXiv Download PDF AI Translation

cs.CV / 153 / 2604.10634

NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

NTIRE 2026 第二届双焦点图像昼夜雨滴去除挑战赛：方法与结果

Li, Xin, Jin, Yeying, Yao, Suhang, Lin, Beibei, Fan, Zhaoxin, Yan, Wending, Jin, Xin, Wu, Zongwei, Li, Bingchen, Shi, Peishu, Yang, Yufei, Li, Yu, Chen, Zhibo, Wen, Bihan, Tan, Robby T., Timofte, Radu, Li, Runzhe, Jiang, Kui, Yu, Zhaocheng, Chen, Yiang, Jiang, Junjun, Liu, Xianming, Gu, Hongde, Li, Zeliang, You, Mache, Dong, Jiangxin, Pan, Jinshan, Rong, Qiyu, Shao, Bowen, Jing, Hongyuan, Zhang, Mengmeng, Ding, Bo, Zhang, Hui, Ren, Yi, Kishawy, Mohab, Chen, Jun, Duong, Anh-Kiet, Gomez-Kramer, Petra, Carozza, Jean-Michel, Xing, Wangzhi, Lu, Xin, Gu, Enxuan, Zhang, Jingxi, Chen, Diqi, Yi, Qiaosi, Wei, Bingcai, Li, Wenjie, Tie, Bowen, Guo, Heng, Ma, Zhanyu, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Cici, Shi, Yaokun, Mellado, Paula Garrido, Feijoo, Daniel, Lara, Alvaro Garcia, Conde, Marcos V., Zhu, Zhidong, Xiong, Bangshu, Ou, Qiaofeng, Rao, Zhibo, Li, Wei, Zhang, Zida, Geng, Hui, Xu, Qisheng, Deng, Xuyao, Wang, Changjian, Xu, Kele, Dong, Guanglu, Zhao, Qiyao, Zheng, Tianheng, Li, Chunlei, Mou, Lichao, Ren, Chao, Peng, Chang-De, Tsai, Chieh-Yu, Liu, Guan-Cheng, Kang, Li-Wei, Rajak, Abhishek, Singh, Milan Kumar, Kumar, Ankit, Sonone, Dimple, Upla, Kishor, Raja, Kiran, Zhao, Huilin, Xu, Xing, Chen, Chuan, Lao, Yeming, Xun, Wenjing, Yang, Li, Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Yang, Hao, Zhang, Ruikun, Pan, Liyuan

Abstract

This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.

Chinese Translation

本文介绍了NTIRE 2026第二届双焦点图像昼夜雨滴去除挑战赛的概况。继首届赛事的成功基础上，本次挑战吸引了众多优秀方案，所有方案均在我们真实场景的Raindrop Clarity数据集~\cite{jin2024raindrop}上开发和评估。本届赛事调整了数据集规模，包含14,139张训练图像、407张验证图像和593张测试图像。该挑战的主要目标是建立一个强大且实用的基准，用以评估在不同光照和焦点条件下的雨滴去除效果。共有168支队伍报名参赛，17支队伍提交了有效的最终方案及技术报告以参与测试阶段。提交的方法在Raindrop Clarity数据集上表现出色，展示了该领域持续进步的态势。

View on arXiv Download PDF AI Translation

cs.CV / 154 / 2604.10637

Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments

语言提示与图像增强：在雾霾环境中利用 CLIP 提升物体检测

Pang, Jian, Zhang, Bingfeng, Wang, Jin, Liu, Baodi, Tao, Dapeng, Liu, Weifeng

Abstract

Object detection in hazy environments is challenging because degraded objects are nearly invisible and their semantics are weakened by environmental noise, making it difficult for detectors to identify. Common approaches involve image enhancement to boost weakened semantics, but these methods are limited by the instability of enhanced modules. This paper proposes a novel solution by employing language prompts to enhance weakened semantics without image enhancement. Specifically, we design Approximation of Mutual Exclusion (AME) to provide credible weights for Cross-Entropy Loss, resulting in CLIP-guided Cross-Entropy Loss (CLIP-CE). The provided weights assess the semantic weakening of objects. Through the backpropagation of CLIP-CE, weakened semantics are enhanced, making degraded objects easier to detect. In addition, we present Fine-tuned AME (FAME) which adaptively fine-tunes the weight of AME based on the predicted confidence. The proposed FAME compensates for the imbalanced optimization in AME. Furthermore, we present HazyCOCO, a large-scale synthetic hazy dataset comprising 61258 images. Experimental results demonstrate that our method achieves state-of-the-art performance. The code and dataset will be released.

Chinese Translation

在雾霾环境中进行物体检测具有挑战性，因为退化的物体几乎不可见，且其语义受到环境噪声的削弱，导致检测器难以识别。常见的方法是通过图像增强来提升削弱的语义，但这些方法受到增强模块不稳定性的限制。本文提出了一种新颖的解决方案，通过使用语言提示来增强削弱的语义，而无需进行图像增强。具体而言，我们设计了互斥近似（Approximation of Mutual Exclusion, AME），为交叉熵损失（Cross-Entropy Loss）提供可信的权重，从而形成基于 CLIP 的交叉熵损失（CLIP-guided Cross-Entropy Loss, CLIP-CE）。所提供的权重评估物体的语义削弱程度。通过 CLIP-CE 的反向传播，削弱的语义得以增强，使得退化物体更易于检测。此外，我们提出了微调 AME（Fine-tuned AME, FAME），该方法根据预测的置信度自适应地微调 AME 的权重。所提出的 FAME 补偿了 AME 中的不平衡优化。此外，我们还提出了 HazyCOCO，这是一个包含 61258 张图像的大规模合成雾霾数据集。实验结果表明，我们的方法达到了最先进的性能。代码和数据集将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 155 / 2604.10643

LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories

LogitDynamics：基于层间Logit轨迹的可靠ViT错误检测

Beigelman, Ido, Freiman, Moti

Abstract

Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier's output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.

Chinese Translation

在部署视觉模型时，可靠的置信度估计至关重要。我们研究错误预测问题：仅利用单次前向传播的信号判断图像分类器的输出是否正确。受大型语言模型中内部信号幻觉检测的启发，我们探讨了视觉变换器（ViTs）中是否存在类似的深度维度信号。我们提出了一种简单的方法，建模类别证据在各层之间的演变过程。通过在中间层附加轻量级线性头，我们从最后L层提取特征，这些特征不仅包含预测类别及其Top-K竞争类别的logits，还包括描述顶级类别在深度方向上不稳定性的统计信息。基于这些特征训练的线性探针用于预测错误指示器。在多个数据集上，我们的方法在AUCPR指标上优于或匹配基线方法，并展示了更强的跨数据集泛化能力，同时仅需极少的额外计算量。

View on arXiv Download PDF AI Translation

cs.CV / 156 / 2604.10655

LoViF 2026 The First Challenge on Weather Removal in Videos

LoViF 2026 视频天气去除挑战赛

Qian, Chenghao

Abstract

This paper presents a review of the LoViF 2026 Challenge on Weather Removal in Videos. The challenge encourages the development of methods for restoring clean videos from inputs degraded by adverse weather conditions such as rain and snow, with an emphasis on achieving visually plausible and temporally consistent results while preserving scene structure and motion dynamics. To support this task, we introduce a new short-form WRV dataset tailored for video weather removal. It consists of 18 videos 1,216 synthesized frames paired with 1,216 real-world ground-truth frames at a resolution of 832 x 480, and is split into training, validation, and test sets with a ratio of 1:1:1. The goal of this challenge is to advance robust and realistic video restoration under real-world weather conditions, with evaluation protocols that jointly consider fidelity and perceptual quality. The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos. The project is publicly available at https://www.codabench.org/competitions/13462/.

Chinese Translation

本文回顾了 LoViF 2026 视频天气去除挑战赛。该挑战鼓励开发从受到恶劣天气条件（如雨雪）影响的输入中恢复干净视频的方法，强调在保持场景结构和运动动态的同时，实现视觉上可信和时间上一致的结果。为支持这一任务，我们引入了一个新的短格式 WRV 数据集，专门用于视频天气去除。该数据集包含 18 个视频和 1,216 个合成帧，与 1,216 个真实世界的地面真相帧配对，分辨率为 832 x 480，并按 1:1:1 的比例划分为训练集、验证集和测试集。该挑战的目标是推动在真实天气条件下的鲁棒和真实视频恢复，评估协议共同考虑保真度和感知质量。挑战吸引了 37 名参与者，并收到了 5 份有效的最终提交及相应的事实表，为视频天气去除的进展做出了贡献。该项目已在 https://www.codabench.org/competitions/13462/ 上公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 157 / 2604.10666

Omnimodal Dataset Distillation via High-order Proxy Alignment

通过高阶代理对齐实现全模态数据集蒸馏

Gao, Yuxuan, Liu, Xiaohao, Xia, Xiaobo, Liu, Tongliang

Abstract

Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.

Chinese Translation

数据集蒸馏将大规模数据集压缩为紧凑的合成集合，同时保持训练性能，但现有方法大多局限于单模态或双模态设置。将数据集蒸馏扩展到涉及多于两种模态的场景，即全模态数据集蒸馏（Omnimodal Dataset Distillation），由于异质性增加和复杂的跨模态交互，仍然未被充分探索且具有挑战性。在本工作中，我们确定了限制全模态设置中端点差异的关键因素，该因素随着模态数量的增加而加剧。为此，我们提出了HoPA，一种通过紧凑代理捕捉高阶跨模态对齐的统一方法，该方法同样兼容轨迹匹配。通过以共享相似结构抽象全模态对齐，我们的方法避免了模态对建模的组合复杂性，实现了异构模态间的可扩展联合蒸馏。从谱理论视角的分析揭示了我们方法相较于双模态数据集蒸馏技术的合理性。在多个基准测试上的大量实验表明，所提方法在压缩与性能权衡方面优于现有竞争方法。源码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 158 / 2604.10675

HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

HiddenObjects：用于物体摆放的可扩展扩散蒸馏空间先验

Schouten, Marco, Siglidis, Ioannis, Belongie, Serge, Papadopoulos, Dim P.

Abstract

We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast practical inference (230,000x faster).

Chinese Translation

我们提出了一种方法，通过蒸馏文本条件扩散模型中隐含的摆放知识，学习显式的类别条件空间先验，用于自然场景中的物体摆放。以往的工作要么依赖于人工标注的数据，规模有限，要么依赖基于修补（inpainting）的物体移除流程，该流程的伪影促使模型学习捷径。为了解决这些限制，我们引入了一个完全自动化且可扩展的框架，利用基于扩散的修补流程，在高质量真实背景上评估密集的物体摆放。基于该流程，我们构建了HiddenObjects，一个包含2700万摆放标注的大规模数据集，涵盖2.7万个不同场景，并针对不同图像和物体类别提供了排序的边界框插入。实验结果表明，我们的空间先验在下游图像编辑任务中优于稀疏的人类标注（3.90对2.68 VLM-Judge），并显著超越了现有的摆放基线方法及零-shot视觉语言模型。此外，我们将这些先验蒸馏成一个轻量级模型，实现了快速的实际推理，速度提升达230,000倍。

View on arXiv Download PDF AI Translation

cs.CV / 159 / 2604.10695

Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

检索以恢复：通过语义一致性净化实现不完整音视频问答

Zhang, Jiayu, Ye, Shuo, Ye, Qilang, Song, Zihan, Huang, Jiajian, Yu, Zitong

Abstract

Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R$^{2}$ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R$^{2}$ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.

Chinese Translation

近年来，音视频问答（Audio-Visual Question Answering, AVQA）方法取得了显著进展。然而，大多数AVQA方法缺乏有效处理缺失模态的机制，在现实场景中因数据中断导致性能严重下降。此外，现有处理缺失模态的方法主要依赖生成式填充来合成缺失特征。尽管部分有效，这些方法倾向于捕捉模态间的共性，但难以获取缺失数据中独特的模态专属知识，导致幻觉现象和推理准确性受损。为应对上述挑战，我们提出了R$^{2}$ScP，一种将缺失模态处理范式从传统生成式填充转向基于检索恢复的新框架。具体而言，我们通过统一的语义嵌入实现跨模态检索，以获取缺失的领域特定知识。为了最大化语义恢复效果，我们引入了上下文感知的自适应净化机制，消除检索数据中的潜在语义噪声。此外，我们采用两阶段训练策略，显式建模不同来源知识之间的语义关系。大量实验表明，R$^{2}$ScP显著提升了AVQA性能，并增强了在模态不完整场景下的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 160 / 2604.10702

Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation

架构无关的模态隔离门控融合用于稳健的多模态前列腺MRI分割

Shu, Yongbo, Xie, Wenzhao, Yao, Shanhu, Xin, Zirui, Lei, Luo, Chen, Kewen, Luo, Aijing

Abstract

Multi-parametric prostate MRI -- combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences -- is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the best model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved 0.7304 +/- 0.056. Mechanistic analysis reveals that robustness gains arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing: the gate converged to a stable modality prior, and deep supervision was beneficial only for the largest backbone while degrading lighter models. These findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation.

Chinese Translation

多参数前列腺MRI——结合T2加权、表观扩散系数和高b值扩散加权序列——在无创检测临床显著前列腺癌中至关重要。然而在常规实践中，个别序列可能缺失或因运动、伪影或简化协议而退化。现有的多模态融合策略通常假设输入完整，并在早期层中纠缠模态特定信息，当一个通道损坏或缺失时，提供的韧性有限。我们提出了模态隔离门控融合（Modality-Isolated Gated Fusion, MIGF），这是一种架构无关的模块，在学习的门控阶段之前保持独立的模态特定编码流，并结合模态丢弃训练以强制在不完整输入下的补偿行为。我们基准测试了六个基础网络，并在PI-CAI数据集（1,500个研究，fold-0拆分，五个随机种子）下评估了配备MIGF的模型在七种缺失模态和伪影场景下的表现。在基础网络中，nnUNet提供了性能和稳定性的最佳平衡。MIGF分别提高了UNet、nnUNet和Mamba在理想场景下的排名分数2.8%、4.6%和13.4%；最佳模型MIGFNet-nnUNet（门控 + ModDrop，无深度监督）达到了0.7304 +/- 0.056。机制分析表明，稳健性提升源于严格的模态隔离和基于丢弃的补偿，而非自适应的每样本质量路由：门控收敛于稳定的模态先验，深度监督仅对最大的基础网络有益，而对较轻模型则有损害。这些发现支持了一种更简单的稳健多模态分割设计原则：首先结构性地包含损坏的输入，然后明确训练以补偿不完整输入。

View on arXiv Download PDF AI Translation

cs.CV / 161 / 2604.10707

Investigating Bias and Fairness in Appearance-based Gaze Estimation

基于外观的注视估计中的偏见与公平性研究

Akgül, Burak, Şahin, Erol, Kalkan, Sinan

Abstract

While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: github.com/akgulburak/gaze-estimation-fairness

Chinese Translation

尽管基于外观的注视估计在准确性和领域适应性方面取得了显著进展，但这些系统在不同人口群体中的公平性仍然在很大程度上未被探索。迄今为止，尚无全面的基准来量化注视估计中的算法偏见。本文首次对基于外观的注视估计的公平性进行了广泛评估，重点关注种族和性别属性。我们通过使用标准公平性指标分析最先进的模型，建立了公平性基准，揭示了显著的性能差异。此外，我们评估了现有偏见缓解策略在注视领域应用时的有效性，并显示其对公平性的贡献有限。我们总结了关键见解和待解决的问题。总体而言，我们的工作呼吁开展研究，以开发稳健、公平的注视估计器。为了支持未来的研究和可重复性，我们公开发布了我们的注释、代码和训练模型，网址为：github.com/akgulburak/gaze-estimation-fairness

View on arXiv Download PDF AI Translation

cs.CV / 162 / 2604.10715

Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

基于谱分解的补丁型与纹理型对抗攻击防御方法

Zhang, Wei, Chang, Xinyu, Li, Xiao, Zhu, Yiming, Hu, Xiaolin

Abstract

Adversarial examples present significant challenges to the security of Deep Neural Network (DNN) applications. Specifically, there are patch-based and texture-based attacks that are usually used to craft physical-world adversarial examples, posing real threats to security-critical applications such as person detection in surveillance and autonomous systems, because those attacks are physically realizable. Existing defense mechanisms face challenges in the adaptive attack setting, i.e., the attacks are specifically designed against them. In this paper, we propose Adversarial Spectrum Defense (ASD), a defense mechanism that leverages spectral decomposition via Discrete Wavelet Transform (DWT) to analyze adversarial patterns across multiple frequency scales. The multi-resolution and localization capability of DWT enables ASD to capture both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. By integrating this spectral analysis with the off-the-shelf Adversarial Training (AT) model, ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks. Extensive experiments demonstrate that ASD+AT achieved state-of-the-art (SOTA) performance against various attacks, outperforming the APs of previous defense methods by 21.73%, in the face of strong adaptive adversaries specifically designed against ASD. Code available at https://github.com/weiz0823/adv-spectral-defense .

Chinese Translation

对抗样本对深度神经网络（DNN）应用的安全性构成了重大挑战。具体而言，补丁型和纹理型攻击通常用于制造物理世界中的对抗样本，这些攻击因其物理可实现性，对监控中的人员检测和自动驾驶系统等安全关键应用构成了真实威胁。现有防御机制在自适应攻击环境下面临挑战，即攻击针对防御机制进行专门设计。本文提出了一种对抗谱防御（Adversarial Spectrum Defense，ASD）机制，该机制利用离散小波变换（Discrete Wavelet Transform，DWT）进行谱分解，以多频率尺度分析对抗模式。DWT的多分辨率和定位能力使ASD能够捕捉高频（细粒度）和低频（空间广泛）扰动。通过将该谱分析与现成的对抗训练（Adversarial Training，AT）模型相结合，ASD提供了针对补丁型和纹理型对抗攻击的综合防御策略。大量实验表明，ASD+AT在面对针对ASD专门设计的强自适应对手时，取得了最先进（SOTA）的性能，攻击成功率较以往防御方法提升了21.73%。代码已开源，地址：https://github.com/weiz0823/adv-spectral-defense 。

View on arXiv Download PDF AI Translation

cs.CV / 163 / 2604.10721

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

将生成模型转变为检索模型：为自然语言引导的地理定位解锁多模态大语言模型

Chen, Yuqi, Zhang, Xiaohan, Arrabi, Ahmad, Sultani, Waqas, Chen, Chen, Wshah, Safwan

Abstract

Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.

Chinese Translation

自然语言引导的跨视角地理定位（NGCG）旨在利用地面场景的文本描述检索地理标记的卫星影像。尽管近期的 NGCG 方法通常依赖于 CLIP 风格的双编码器架构，但它们往往存在跨模态泛化能力弱的问题，并且需要复杂的架构设计。相比之下，多模态大语言模型（MLLMs）提供了强大的语义推理能力，但并未针对检索任务进行直接优化。在本研究中，我们提出了一种简单而有效的框架，通过参数高效的微调将 MLLMs 适配于 NGCG。我们的方法在保留 MLLM 预训练的多模态知识的同时，优化其潜在表示，实现了强大的跨模态对齐，而无需重新设计模型架构。通过对从模型主干到特征聚合等多种变量的系统分析，我们提供了在 NGCG 中利用 MLLMs 的实用且可推广的见解。我们的方法在 GeoText-1652 数据集上达到了最新的状态（SOTA），在文本到图像的召回率@1 上提高了 12.2%，并在 CVG-Text 的 12 个子任务中获得了 5 个的最佳表现，同时在可训练参数远少于基线的情况下超越了基线。这些结果使 MLLMs 成为语义跨视角检索的坚实基础，并为基于 MLLM 的 NGCG 作为可扩展且强大的替代方案取代传统双编码器设计铺平了道路。项目页面和代码可在 https://yuqichen888.github.io/NGCG-MLLMs-web/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 164 / 2604.10755

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

MMRareBench：一种罕见病多模态多图像医学基准测试

Ning, Junzhi, Lin, Jiashi, Fang, Yingying, Li, Wei, Liu, Jiyao, Tang, Cheng, Ma, Chenglong, Tang, Wenhao, Li, Tianbin, Huang, Ziyan, Yang, Guang, He, Junjun

Abstract

Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

Chinese Translation

多模态大型语言模型（MLLMs）在常见疾病的临床任务中取得了进展，但其在罕见病上的表现仍未得到充分测试。在罕见病场景中，临床医生通常缺乏先验临床知识，必须严格依赖病例级证据进行临床判断。现有基准测试主要评估常见病的单图像设置，尚未系统评估在罕见病数据稀缺情况下的多模态和多图像证据整合能力。我们提出了MMRareBench，据我们所知，这是首个罕见病基准，联合评估多模态和多图像临床能力，涵盖诊断、治疗方案制定、跨图像证据对齐和检查建议四个与工作流程相符的任务。该基准包含1756个问答对及7958张来自PMC病例报告的医学图像，结合Orphanet本体对齐、任务特定的信息泄露控制、基于证据的注释及两级评估协议。对23个MLLM的系统评估揭示了能力分布碎片化和普遍较低的治疗方案制定表现，医学领域模型在多图像任务上显著落后于通用MLLM，尽管诊断得分具有竞争力。这些现象与能力稀释效应一致：医学微调虽能缩小诊断差距，但可能削弱罕见病证据整合所需的组合性多图像能力。

View on arXiv Download PDF AI Translation

cs.CV / 165 / 2604.10765

Lung Cancer Detection Using Deep Learning

基于深度学习的肺癌检测

Ajmi, Imama, Das, Abhishek

Abstract

Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model's capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.

Chinese Translation

肺癌是导致癌症相关死亡的第二大原因，主要与长期吸烟（占85%的病例）有关。令人惊讶的是，10-15%的病例发生在非吸烟者中。2020年，全球约有200万人受到影响，导致150万人死亡。生存率约为20%，低于其他癌症，部分原因是症状表现较晚。因此，迫切需要进行早期和准确的检测以实现有效治疗。本文计算了准确率、精确率、召回率（灵敏度）和F1-score等性能指标，以全面评估每种模型的能力。通过比较这些指标，本研究提供了对每种方法的优缺点的深入见解，为肺癌检测技术的发展做出了贡献。本文将讨论使用不同深度学习算法（如InceptionV3、MobileNetV2、VGG16、ResNet152）进行肺癌检测的方法，探讨其在分类肺癌病例中的有效性。我们提出的模型基于16层的卷积神经网络（CNN）架构。我们的模型展现了多个关键亮点，增强了其新颖性。通过整合卷积层、池化层、展平层、丢弃层、全连接层和稠密层等多种层类型，该模型利用每层的优势来提升其预测能力。我们提出模型的新颖之处在于其准确率随着训练轮数的增加而持续提高。我们已测试模型性能至第30个训练轮次。我们的模型还克服了过拟合问题。

View on arXiv Download PDF AI Translation

cs.CV / 166 / 2604.10766

At FullTilt: Real-Time Open-Set 3D Macromolecule Detection Directly from Tilted 2D Projections

在FullTilt：直接从倾斜的二维投影中进行实时开放集三维大分子检测

Ho, Ming-Yang, Bartesaghi, Alberto

Abstract

Open-set 3D macromolecule detection in cryogenic electron tomography eliminates the need for target-specific model retraining. However, strict VRAM constraints prohibit processing an entire 3D tomogram, forcing current methods to rely on slow sliding-window inference over extracted subvolumes. To overcome this, we propose FullTilt, an end-to-end framework that redefines 3D detection by operating directly on aligned 2D tilt-series. Because a tilt-series contains significantly fewer images than slices in a reconstructed tomogram, FullTilt eliminates redundant volumetric computation, accelerating inference by orders of magnitude. To process the entire tilt-series simultaneously, we introduce a tilt-series encoder to efficiently fuse cross-view information. We further propose a multiclass visual prompt encoder for flexible prompting, a tilt-aware query initializer to effectively anchor 3D queries, and an auxiliary geometric primitives module to enhance the model's understanding of multi-view geometry while improving robustness to adverse imaging artifacts. Extensive evaluations on three real-world datasets demonstrate that FullTilt achieves state-of-the-art zero-shot performance while drastically reducing runtime and VRAM requirements, paving the way for rapid, large-scale visual proteomics analysis. All code and data will be publicly available upon publication.

Chinese Translation

在低温电子断层成像中，开放集三维大分子检测消除了针对特定目标模型重新训练的需求。然而，严格的显存限制禁止处理整个三维断层图，迫使当前方法依赖于对提取的子体积进行缓慢的滑动窗口推理。为了解决这个问题，我们提出了FullTilt，一个端到端的框架，通过直接在对齐的二维倾斜系列上操作重新定义了三维检测。由于倾斜系列包含的图像数量显著少于重建断层图中的切片，FullTilt消除了冗余的体积计算，将推理速度提高了几个数量级。为了同时处理整个倾斜系列，我们引入了一种倾斜系列编码器，以高效融合跨视图信息。我们进一步提出了一种多类视觉提示编码器以实现灵活的提示，一种倾斜感知查询初始化器以有效锚定三维查询，以及一个辅助几何原件模块以增强模型对多视图几何的理解，同时提高对不利成像伪影的鲁棒性。在三个真实世界数据集上的广泛评估表明，FullTilt实现了最先进的零-shot性能，同时大幅减少了运行时间和显存需求，为快速、大规模的视觉蛋白质组学分析铺平了道路。所有代码和数据将在发表时公开。

View on arXiv Download PDF AI Translation

cs.CV / 167 / 2604.10772

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

HOG-Layout：通过视觉-语言模型实现层次化3D场景生成、优化与编辑

Jiang, Haiyan, Zhang, Deyu, Weng, Dongdong, Song, Weitao, Duh, Henry Been-Lirn

Abstract

3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

Chinese Translation

3D布局生成和编辑在具身人工智能和沉浸式虚拟现实交互中扮演着至关重要的角色。然而，手动创建需要繁琐的劳动，而数据驱动的生成往往缺乏多样性。大型模型的出现为3D场景合成带来了新的可能性。我们提出了HOG-Layout，它通过大型语言模型（LLMs）和视觉-语言模型（VLMs）实现文本驱动的层次化场景生成、优化和实时场景编辑。HOG-Layout通过检索增强生成（RAG）技术提高场景的语义一致性和合理性，结合优化模块以增强物理一致性，并采用层次化表示以增强推理和优化，从而实现实时编辑。实验结果表明，与现有基准相比，HOG-Layout生成的环境更加合理，同时支持快速和直观的场景编辑。

View on arXiv Download PDF AI Translation

cs.CV / 168 / 2604.10777

Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants

基于正则化随机插值的面部视频脉冲信号恢复的不确定性量化

Shenoy, Vineet R., Peng, Cheng, Chellappa, Rama, Sun, Yu

Abstract

Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human's blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors of a time-dependent stochastic process; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and consumer settings.

Chinese Translation

成像光电容积脉搏波（iPPG）是一种光学方法，通过从相机读取像素数据来恢复人类的血容量脉搏（BVP）波形，是一个令人兴奋的研究领域，许多研究者正在对iPPG算法进行临床研究。尽管当前解决iPPG任务的算法在基准数据集上表现出色，但据我们所知，没有任何最先进的算法在测试时对解空间进行采样，这阻碍了对临床应用至关重要的不确定性分析。我们通过一种名为正则化随机插值的iPPG（RIS-iPPG）新范式来解决这一缺陷。将iPPG恢复建模为一个逆问题，我们构建概率路径，通过预测时间依赖的随机过程的瞬时流和得分向量，将相机像素分布演变为真实信号分布；在测试时，我们通过求解随机微分方程，给定相机像素强度测量值，采样正确BVP波形的后验分布。鉴于生理变化是缓慢变化的，我们展示了通过最大化两个相邻时间窗口的残差流向量预测之间的相关性，可以改善iPPG恢复。对三个数据集的实验结果表明，RIS-iPPG提供了优越的重建质量和重建的不确定性估计，这是iPPG算法在临床和消费环境中广泛应用的关键工具。

View on arXiv Download PDF AI Translation

cs.CV / 169 / 2604.10780

LIDARLearn: A Unified Deep Learning Library for 3D Point Cloud Classification, Segmentation, and Self-Supervised Representation Learning

LIDARLearn：用于三维点云分类、分割及自监督表示学习的统一深度学习库

Ohamouddou, Said, Afia, Hanaa El, Afia, Abdellatif El, Chiheb, Raddouane

Abstract

Three-dimensional (3D) point cloud analysis has become central to applications ranging from autonomous driving and robotics to forestry and ecological monitoring. Although numerous deep learning methods have been proposed for point cloud understanding, including supervised backbones, self-supervised pre-training (SSL), and parameter-efficient fine-tuning (PEFT), their implementations are scattered across incompatible codebases with differing data pipelines, evaluation protocols, and configuration formats, making fair comparisons difficult. We introduce \lib{}, a unified, extensible PyTorch library that integrates over 55 model configurations covering 29 supervised architectures, seven SSL pre-training methods, and five PEFT strategies, all within a single registry-based framework supporting classification, semantic segmentation, part segmentation, and few-shot learning. \lib{} provides standardised training runners, cross-validation with stratified $K$-fold splitting, automated LaTeX/CSV table generation, built-in Friedman/Nemenyi statistical testing with critical-difference diagrams for rigorous multi-model comparison, and a comprehensive test suite with 2\,200+ automated tests validating every configuration end-to-end. The code is available at https://github.com/said-ohamouddou/LIDARLearn under the MIT licence.

Chinese Translation

三维（3D）点云分析已成为自动驾驶、机器人技术、林业及生态监测等应用的核心。尽管已有大量深度学习方法被提出用于点云理解，包括监督骨干网络、自监督预训练（SSL）以及参数高效微调（PEFT），但它们的实现分散在不兼容的代码库中，且数据管线、评估协议和配置格式各异，导致公平比较变得困难。我们提出了LIDARLearn，一个统一且可扩展的PyTorch库，集成了55余种模型配置，涵盖29种监督架构、7种SSL预训练方法和5种PEFT策略，全部基于单一注册表框架，支持分类、语义分割、部件分割及少样本学习。LIDARLearn提供标准化的训练运行器、带分层K折拆分的交叉验证、自动生成LaTeX/CSV表格、内置Friedman/Nemenyi统计检验及关键差异图以实现严格的多模型比较，并配备2200余项自动化测试，端到端验证每个配置。代码在MIT许可证下开源，地址为https://github.com/said-ohamouddou/LIDARLearn。

View on arXiv Download PDF AI Translation

cs.CV / 170 / 2604.10789

ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

ReplicateAnyScene：通过文本-视觉-空间对齐实现零-shot视频到3D合成

Dong, Mingyu, Xia, Chong, Jia, Mingyuan, Lyu, Weichen, Xu, Long, Zhu, Zheng, Duan, Yueqi

Abstract

Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.

Chinese Translation

人类展现出一种天生的能力，可以快速从视频观察中感知和分割物体，甚至在头脑中将它们组装成结构化的3D场景。复制这种能力，称为组合3D重建，对于空间智能和具身人工智能的发展至关重要。然而，现有方法由于跨模态信息整合不足，难以实现实际应用，依赖于手动对象提示、辅助视觉输入，并因训练偏差而局限于过于简单的场景。为了解决这些局限性，我们提出了ReplicateAnyScene，一个能够完全自动化和零-shot地将随意捕获的视频转换为组合3D场景的框架。具体而言，我们的流程包含一个五阶段级联，以从视觉基础模型中提取和结构性对齐通用先验，涵盖文本、视觉和空间维度，将其基础化为结构化的3D表示，并确保构建场景的语义一致性和物理合理性。为了便于对该任务进行更全面的评估，我们进一步引入了C3DR基准，以从不同方面评估重建质量。大量实验表明，我们的方法在生成高质量组合3D场景方面优于现有基准。

View on arXiv Download PDF AI Translation

cs.CV / 171 / 2604.10797

WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance

WBCBench 2026：应对类别不平衡的稳健白细胞分类挑战

Tian, Xin, Ma, Xudong, Yang, Tianqi, Achim, Alin, Papież, Bartłomiej W, Watanaboonyongcharoen, Phandee, Anantrasirichai, Nantheera

Abstract

We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.

Chinese Translation

我们提出了WBCBench 2026，这是一个ISBI挑战和自动化白细胞分类的基准，旨在对算法进行压力测试，面临三大关键困难：（i）13个形态学细致的白细胞类别之间的严重类别不平衡，（ii）训练、验证和测试集之间严格的患者级别分离，以及（iii）通过控制噪声、模糊和光照扰动引入的合成扫描仪和环境引起的领域转移。所有图像均为单一地点的显微血涂片采集，经过标准化染色和专家血液病理学家的注释。本文回顾了该挑战，并总结了提出的解决方案和最终结果。基准分为两个阶段。第一阶段提供了一个原始训练集。第二阶段引入了具有特定分割严重性分布的降级图像，用于训练、验证和测试，模拟开发与部署条件之间的现实转移。我们指定了标准化的提交方案、开源评估器，以及作为主要排名指标的宏平均F1得分。

View on arXiv Download PDF AI Translation

cs.CV / 172 / 2604.10805

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

基于单应矩阵的地面平面映射距离误差的解析建模与校正

Szulc, Mateusz, Iwanowski, Marcin

Abstract

Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

Chinese Translation

单目相机的精确距离估计对于智能监控系统至关重要。在许多应用中，图像坐标通过平面单应矩阵映射到地面位置，该单应矩阵由手动选择的对应区域初始化。初始化中的微小不准确会传播并导致系统性的距离畸变。本文推导了单应矩阵扰动与由此产生的距离误差之间的显式关系，表明误差大致随真实距离的平方增长。基于该模型，评估了两种简单的校正策略：基于回归的二次误差函数估计和通过坐标梯度下降的单应矩阵直接优化。通过超过1900万测试样本的大规模仿真研究表明，当模型拟合可靠时，回归方法能实现更高的峰值精度，而梯度下降方法在初始标定不佳时表现出更强的鲁棒性。这表明，在许多实际系统中，提升几何标定精度可能比增加模型复杂度带来更大的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 173 / 2604.10823

Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation

基于不确定性引导注意力与熵加权损失的精准植物幼苗分割方法

Ehab, Mohamed, Hamdi, Ali

Abstract

Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet's variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.

Chinese Translation

植物幼苗分割支持精准农业中的自动表型分析。由于复杂的背景图像和叶片的细微结构，标准分割模型面临挑战。本文提出了UGDA-Net（不确定性引导双重注意力网络，结合熵加权损失与深度监督）。UGDA-Net由三个创新组件组成。第一是不确定性引导双重注意力（UGDA），利用通道方差调节特征图。第二是熵加权混合损失函数，重点关注高不确定性的边界像素。第三是在中间编码器层引入深度监督。我们进行了系统的消融实验，针对两种广泛使用的架构U-Net和LinkNet，分析了五种递进配置：基线、仅损失、仅注意力、深度监督及UGDA-Net。使用包含432张高分辨率植物幼苗图像的数据集训练UGDA-Net，结果显示分割性能和准确率显著提升，Dice系数较基线提高9.3%，LinkNet的方差提升13.2%。定性叠加结果表明叶片边界的误报显著减少，不确定性热图与复杂形态保持一致。UGDA-Net有助于植物细微结构的分割，提供高精度解决方案。结果表明，不确定性引导注意力与不确定性加权损失是两种互补的机制。

View on arXiv Download PDF AI Translation

cs.CV / 174 / 2604.10836

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

HO-Flow：基于潜在流匹配的通用手-物体交互生成方法

Chen, Zerui, Potamias, Rolandos Alexandros, Chen, Shizhe, Deng, Jiankang, Schmid, Cordelia, Zafeiriou, Stefanos

Abstract

Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

Chinese Translation

生成逼真的三维手-物体交互（HOI）是计算机视觉和机器人领域的基础性挑战，既需要时间上的连贯性，也要求高保真的物理合理性。现有方法在学习表达性运动表示以用于生成以及进行时间推理方面仍存在局限。本文提出了HO-Flow，一种能够从文本和标准三维物体生成逼真手-物体运动序列的框架。HO-Flow首先采用交互感知变分自编码器，将手部和物体运动序列编码到统一的潜在流形中，通过融合手部和物体运动学，使表示能够捕捉丰富的交互动态。随后，利用掩码流匹配模型结合自回归时间推理与连续潜在生成，提升时间连贯性。为进一步增强泛化能力，HO-Flow预测相对于初始帧的物体运动，从而实现对大规模合成数据的有效预训练。在GRAB、OakInk和DexYCB基准测试中，HO-Flow在物理合理性和运动多样性方面均达到最新的性能水平。

View on arXiv Download PDF AI Translation

cs.CV / 175 / 2604.10837

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Immune2V：针对双流图像到视频生成的图像免疫

Long, Zeqian, Kara, Ozgur, Xue, Haotian, Chen, Yongxin, Rehg, James M.

Abstract

Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

Chinese Translation

图像到视频（I2V）生成可能对社会造成危害，因为它使得未经授权的静态图像动画化成为可能，从而创造出逼真的深度伪造。虽然现有的防御措施有效地保护静态图像免受操控，但将这些措施扩展到I2V生成仍然未被充分探索且并非易事。在本文中，我们系统地分析了现代I2V模型为何对简单的图像级对抗攻击（即免疫）具有高度的鲁棒性。我们观察到，视频编码过程迅速稀释了未来帧中的对抗噪声，而持续的文本条件引导积极地覆盖了免疫的预期干扰效果。基于这些发现，我们提出了Immune2V框架，该框架在编码器级别强制实施时间平衡的潜在发散，以防止信号稀释，并将中间生成表示与预计算的诱导崩溃轨迹对齐，以抵消文本引导的覆盖。大量实验表明，在相同的不可察觉预算下，Immune2V产生的降级效果显著强于适应的图像级基线，并且更具持久性。

View on arXiv Download PDF AI Translation

cs.CV / 176 / 2604.10843

Retinal Cyst Detection from Optical Coherence Tomography Images

基于光学相干断层扫描图像的视网膜囊肿检测

Dharmaratnakar, Abhishek, Vijayakumar, Aadheeshwar, Dayanand, Suchand

Abstract

Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68\% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70\% dice coefficient on all vendors irrespective of their quality.

Chinese Translation

视网膜囊肿是由于视网膜血管功能不全导致液体泄漏和积聚而形成的囊性空间。这些囊肿在多种眼科疾病中具有重要意义，如年龄相关性黄斑变性、糖尿病性黄斑水肿等。光学相干断层扫描（Optical Coherence Tomography, OCT）是成像视网膜病变的主要诊断技术之一。视网膜内囊肿的分割和量化在预测视力方面起着关键作用。文献中已有多种自动分割视网膜内囊肿的方法被提出。随着囊性黄斑水肿成为人类面临的主要问题，我们需要准确量化并及时处理，否则可能引发多种后续问题。尽管该领域的研究正在进行，但迄今为止进展有限，准确率仅达到68%，且这些方法依赖于图像质量，对于如Topcon等高噪声图像表现较差。本文采用ResNet卷积神经网络（Convolutional Neural Network, CNN）通过分块分类的方法对囊肿分割挑战数据集进行训练，并在由两位不同评分员提供的测试数据集上进行测试，涵盖挑战中所有四个设备厂商的数据。本文还基于首个公开的创新性囊肿分割挑战数据集对这些方法进行了比较。通过定量指标评估了方法在视网膜内囊肿分割挑战中的鲁棒性。结果显示，该方法优于之前的先进方法，在所有厂商的数据上均实现了超过70%的Dice系数，且不受图像质量影响。

View on arXiv Download PDF AI Translation

cs.CV / 177 / 2604.10862

LRD-Net: A Lightweight Real-Centered Detection Network for Cross-Domain Face Forgery Detection

LRD-Net：一种轻量级真实中心检测网络用于跨域人脸伪造检测

Zhang, Xuecen, Chaudhary, Vipin

Abstract

The rapid advancement of diffusion-based generative models has made face forgery detection a critical challenge in digital forensics. Current detection methods face two fundamental limitations: poor cross-domain generalization when encountering unseen forgery types, and substantial computational overhead that hinders deployment on resource-constrained devices. We propose LRD-Net (Lightweight Real-centered Detection Network), a novel framework that addresses both challenges simultaneously. Unlike existing dual-branch approaches that process spatial and frequency information independently, LRD-Net adopts a sequential frequency-guided architecture where a lightweight Multi-Scale Wavelet Guidance Module generates attention signals that condition a MobileNetV3-based spatial backbone. This design enables effective exploitation of frequency-domain cues while avoiding the redundancy of parallel feature extraction. Furthermore, LRD-Net employs a real-centered learning strategy with exponential moving average prototype updates and drift regularization, anchoring representations around authentic facial images rather than modeling diverse forgery patterns. Extensive experiments on the DiFF benchmark demonstrate that LRD-Net achieves state-of-the-art cross-domain detection accuracy, consistently outperforming existing methods. Critically, LRD-Net accomplishes this with only 2.63M parameters - approximately 9x fewer than conventional approaches - while achieving over 8x faster training and nearly 10x faster inference. These results demonstrate that robust cross-domain face forgery detection can be achieved without sacrificing computational efficiency, making LRD-Net suitable for real-time deployment in mobile authentication systems and resource-constrained environments.

Chinese Translation

基于扩散的生成模型的快速发展使得人脸伪造检测成为数字取证中的一项关键挑战。目前的检测方法面临两个基本限制：在遇到未见伪造类型时跨域泛化能力差，以及显著的计算开销阻碍了在资源受限设备上的部署。我们提出了LRD-Net（轻量级真实中心检测网络），这是一个同时解决这两个挑战的新框架。与现有的独立处理空间和频率信息的双分支方法不同，LRD-Net采用了一种顺序频率引导架构，其中轻量级多尺度小波引导模块生成注意信号，以调节基于MobileNetV3的空间主干。这种设计能够有效利用频域线索，同时避免并行特征提取的冗余。此外，LRD-Net采用了一种真实中心学习策略，通过指数移动平均原型更新和漂移正则化，将表示锚定在真实的人脸图像周围，而不是建模多样的伪造模式。在DiFF基准上的广泛实验表明，LRD-Net实现了最先进的跨域检测准确性，始终优于现有方法。重要的是，LRD-Net仅用2.63M参数实现这一目标——大约是传统方法的九分之一，同时训练速度提高了8倍，推理速度接近提高了10倍。这些结果表明，强大的跨域人脸伪造检测可以在不牺牲计算效率的情况下实现，使LRD-Net适合于移动认证系统和资源受限环境中的实时部署。

View on arXiv Download PDF AI Translation

cs.CV / 178 / 2604.10885

Product Review Based on Optimized Facial Expression Detection

基于优化面部表情检测的产品评价

Chaugule, Vikrant, D, Abhishek, Vijayakumar, Aadheeshwar, Ramteke, Pravin Bhaskar, Koolagudi, Shashidhar G.

Abstract

This paper proposes a method to review public acceptance of products based on their brand by analyzing the facial expression of the customer intending to buy the product from a supermarket or hypermarket. In such cases, facial expression recognition plays a significant role in product review. Here, facial expression detection is performed by extracting feature points using a modified Harris algorithm. The modified Harris algorithm reduced the time complexity of the existing feature extraction Harris Algorithm. A comparison of time complexities of existing algorithms is done with proposed algorithm. The algorithm proved to be significantly faster and nearly accurate for the needed application by reducing the time complexity for corner points detection.

Chinese Translation

本文提出了一种基于品牌，通过分析顾客在超市或大卖场购买产品时的面部表情来评估产品公众接受度的方法。在此过程中，面部表情识别在产品评价中起着重要作用。本文采用改进的Harris算法提取特征点以实现面部表情检测。该改进的Harris算法降低了现有Harris特征提取算法的时间复杂度。文中对比了现有算法与所提算法的时间复杂度。结果表明，该算法在角点检测的时间复杂度上显著降低，速度更快且在所需应用中准确性接近。

View on arXiv Download PDF AI Translation

cs.CV / 179 / 2604.10894

EviRCOD: Evidence-Guided Probabilistic Decoding for Referring Camouflaged Object Detection

EviRCOD：基于证据引导的概率解码用于指称伪装目标检测

Wang, Ye, Huang, Kai, Shen, Sumin, Ma, Chenyang

Abstract

Referring Camouflaged Object Detection (Ref-COD) focuses on segmenting specific camouflaged targets in a query image using category-aligned references. Despite recent advances, existing methods struggle with reference-target semantic alignment, explicit uncertainty modeling, and robust boundary preservation. To address these issues, we propose EviRCOD, an integrated framework consisting of three core components: (1) a Reference-Guided Deformable Encoder (RGDE) that employs hierarchical reference-driven modulation and multi-scale deformable aggregation to inject semantic priors and align cross-scale representations; (2) an Uncertainty-Aware Evidential Decoder (UAED) that incorporates Dirichlet evidence estimation into hierarchical decoding to model uncertainty and propagate confidence across scales; and (3) a Boundary-Aware Refinement Module (BARM) that selectively enhances ambiguous boundaries by exploiting low-level edge cues and prediction confidence. Experiments on the Ref-COD benchmark demonstrate that EviRCOD achieves state-of-the-art detection performance while providing well-calibrated uncertainty estimates. Code is available at: https://github.com/blueecoffee/EviRCOD.

Chinese Translation

指称伪装目标检测（Referring Camouflaged Object Detection，Ref-COD）旨在利用类别对齐的参考信息对查询图像中的特定伪装目标进行分割。尽管近期取得了一定进展，现有方法在参考与目标的语义对齐、显式不确定性建模以及鲁棒边界保持方面仍存在不足。为解决这些问题，我们提出了EviRCOD，一个集成框架，包含三个核心组件：（1）参考引导的可变形编码器（Reference-Guided Deformable Encoder，RGDE），通过分层的参考驱动调制和多尺度可变形聚合注入语义先验并实现跨尺度表示对齐；（2）不确定性感知的证据解码器（Uncertainty-Aware Evidential Decoder，UAED），将Dirichlet证据估计融入分层解码过程，以建模不确定性并跨尺度传播置信度；（3）边界感知细化模块（Boundary-Aware Refinement Module，BARM），通过利用低层边缘线索和预测置信度，有选择地增强模糊边界。基于Ref-COD基准的实验表明，EviRCOD不仅实现了最先进的检测性能，还提供了良好校准的不确定性估计。代码已开源，地址：https://github.com/blueecoffee/EviRCOD。

View on arXiv Download PDF AI Translation

cs.CV / 180 / 2604.10904

Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance

评估医学图像重建对下游人工智能公平性和性能的影响

Wohlrapp, Matteo, Bubeck, Niklas, Rueckert, Daniel, Lotter, William

Abstract

AI-based image reconstruction models are increasingly deployed in clinical workflows to improve image quality from noisy data, such as low-dose X-rays or accelerated MRI scans. However, these models are typically evaluated using pixel-level metrics like PSNR, leaving their impact on downstream diagnostic performance and fairness unclear. We introduce a scalable evaluation framework that applies reconstruction and diagnostic AI models in tandem, which we apply to two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI) to assess the potential downstream implications of reconstruction. We find that conventional reconstruction metrics poorly track task performance, where diagnostic accuracy remains largely stable even as reconstruction PSNR declines with increasing image noise. Fairness metrics exhibit greater variability, with reconstruction sometimes amplifying demographic biases, particularly regarding patient sex. However, the overall magnitude of this additional bias is modest compared to the inherent biases already present in diagnostic models. To explore potential bias mitigation, we adapt two strategies from classification literature to the reconstruction setting, but observe limited efficacy. Overall, our findings emphasize the importance of holistic performance and fairness assessments throughout the entire medical imaging workflow, especially as generative reconstruction models are increasingly deployed.

Chinese Translation

基于人工智能的图像重建模型越来越多地应用于临床工作流程中，以改善来自噪声数据的图像质量，例如低剂量X射线或加速MRI扫描。然而，这些模型通常使用像PSNR这样的像素级指标进行评估，这使得它们对下游诊断性能和公平性的影响不明确。我们引入了一个可扩展的评估框架，该框架将重建和诊断人工智能模型结合应用，我们将其应用于两个任务（分类、分割）、三种重建方法（U-Net、GAN、扩散）和两种数据类型（X射线、MRI），以评估重建的潜在下游影响。我们发现，传统的重建指标在任务性能的追踪上表现不佳，尽管重建的PSNR随着图像噪声的增加而下降，但诊断准确性仍然基本稳定。公平性指标则表现出更大的变异性，重建有时会放大人口统计偏见，尤其是与患者性别相关的偏见。然而，与诊断模型中已经存在的固有偏见相比，这种额外偏见的整体幅度是适度的。为了探索潜在的偏见缓解，我们将分类文献中的两种策略调整到重建环境中，但观察到其效果有限。总体而言，我们的研究结果强调了在整个医学成像工作流程中进行全面性能和公平性评估的重要性，尤其是在生成重建模型越来越多地被部署的背景下。

View on arXiv Download PDF AI Translation

cs.CV / 181 / 2604.10910

STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

STGV：基于高斯的视频表示的时空哈希编码

Lin, Jierun, Chen, Jiacong, Mao, Qingyu, Liu, Shuai, Meng, Xiandong, Meng, Fanyang, Liang, Yongsheng

Abstract

2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.

Chinese Translation

二维高斯点云（2D Gaussian Splatting, 2DGS）最近成为高质量视频表示的有前景的范式。然而，现有方法采用与内容无关或时空特征重叠的嵌入来预测典型的高斯原始变形，这使得视频中的静态和动态成分相互交织，无法有效建模它们的不同特性。这导致时空变形的预测不准确以及表示质量不佳。为了解决这些问题，本文提出了一种基于高斯的视频表示的时空哈希编码框架（STGV）。通过将视频特征分解为可学习的二维空间和三维时间哈希编码，STGV有效促进了动态成分运动模式的学习，同时保持静态元素的背景细节。此外，我们通过关键帧典型初始化策略构建了更稳定和一致的初始典型高斯表示，防止特征重叠和结构上不连贯的几何表示。实验结果表明，我们的方法在视频表示质量上优于其他基于高斯的方法（+0.98 PSNR），并在下游视频任务中表现出竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 182 / 2604.10912

TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation

TAMISeg：基于文本对齐的多尺度医学图像分割及语义编码器蒸馏

Gao, Qiang, Wang, Yi, Zhang, Yong, Li, Yong, Deng, Yongbing, Du, Lan, Chen, Cunjian

Abstract

Medical image segmentation remains challenging due to limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variation. We propose TAMISeg, a text-guided segmentation framework that incorporates clinical language prompts and semantic distillation as auxiliary semantic cues to enhance visual understanding and reduce reliance on pixel-level fine-grained annotations. TAMISeg integrates three core components: a consistency-aware encoder pretrained with strong perturbations for robust feature extraction, a semantic encoder distillation module with supervision from a frozen DINOv3 teacher to enhance semantic discriminability, and a scale-adaptive decoder that segments anatomical structures across different spatial scales. Experiments on the Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets demonstrate that TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations. Code will be made publicly available at https://github.com/qczggaoqiang/TAMISeg.

Chinese Translation

医学图像分割仍然面临诸多挑战，主要包括细粒度标注的缺乏、复杂的解剖结构以及噪声、低对比度或光照变化导致的图像退化。我们提出了TAMISeg，一种文本引导的分割框架，该框架结合了临床语言提示和语义蒸馏作为辅助语义线索，以增强视觉理解能力并减少对像素级细粒度标注的依赖。TAMISeg集成了三个核心组件：一个通过强扰动预训练以实现鲁棒特征提取的一致性感知编码器；一个由冻结的DINOv3教师模型监督的语义编码器蒸馏模块，以提升语义判别能力；以及一个适应不同空间尺度的尺度自适应解码器，用于分割不同尺度的解剖结构。在Kvasir-SEG、MosMedData+和QaTa-COV19数据集上的实验表明，TAMISeg在定性和定量评估中均持续优于现有的单模态和多模态方法。代码将公开发布于https://github.com/qczggaoqiang/TAMISeg。

View on arXiv Download PDF AI Translation

cs.CV / 183 / 2604.10916

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

ReXSonoVQA：面向过程中心超声理解的视频问答基准

Wang, Xucheng, Zhang, Xiaoman, Kim, Sung Eun, Pal, Ankit, Rajpurkar, Pranav

Abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Chinese Translation

超声采集需要熟练的探头操作和实时调整。视觉-语言模型（VLMs）有望实现自主超声系统，但现有基准仅评估静态图像，未涵盖动态过程理解。我们提出ReXSonoVQA，这是一个包含514个视频片段和514个问题（249个多项选择题，265个自由回答题）的视频问答基准，针对三项能力：动作-目标推理、伪影识别与优化，以及过程上下文与规划。对Gemini 3 Pro、Qwen3.5-397B、LLaVA-Video-72B和Seed 2.0 Pro的零样本评估表明，视觉-语言模型能够提取部分过程信息，但故障排除类问题仍具挑战性，且相较于仅文本基线提升有限，暴露出因果推理的局限性。ReXSonoVQA促进了超声训练、指导及机器人自动化感知系统的开发。

View on arXiv Download PDF AI Translation

cs.CV / 184 / 2604.10927

LiveGesture Streamable Co-Speech Gesture Generation Model

LiveGesture 可流式同步语音手势生成模型

Saleem, Muhammad Usama, Patel, Mayur Jagdishbhai, Pinyoanuntapong, Ekkasit, Qin, Zhongxing, Yang, Li, Xue, Hongfei, Helmy, Ahmed, Chen, Chen, Wang, Pu

Abstract

We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.

Chinese Translation

我们提出了 LiveGesture，这是首个完全可流式、以语音驱动的全身手势生成框架，具备零前瞻能力并支持任意序列长度。与现有的同步语音手势方法不同，这些方法通常设计用于离线生成，且要么独立处理身体各区域，要么将所有关节混合在单一模型中，LiveGesture 则从根本上为因果性、区域协调的动作生成而构建。LiveGesture 包含两个主要模块：可流式向量量化动作分词器（Streamable Vector Quantized Motion Tokenizer，SVQ）和分层自回归变换器（Hierarchical Autoregressive Transformer，HAR）。SVQ 分词器将每个身体区域的动作序列转换为因果的离散动作标记，实现实时、可流式的标记解码。在 SVQ 之上，HAR 利用区域专家自回归（xAR）变换器对每个身体区域的表现力丰富且细粒度的动作动态进行建模。随后，因果时空融合模块（xAR Fusion）捕捉并整合跨区域的相关动作动态。xAR 和 xAR Fusion 均以由可流式因果音频编码器编码的实时连续音频信号为条件。为提升模型在流式噪声和预测误差下的鲁棒性，我们引入了自回归掩码训练，利用不确定性引导的标记掩码和随机区域掩码，使模型在训练时暴露于不完美且部分错误的历史信息。基于 BEAT2 数据集的实验表明，LiveGesture 能实时生成连贯、多样且与节拍同步的全身手势，在真正零前瞻条件下，其表现匹配或超越了最先进的离线方法。

View on arXiv Download PDF AI Translation

cs.CV / 185 / 2604.10940

AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

AmodalSVG：基于语义层剥离的无模态图像矢量化

Hu, Juncheng, Xue, Ziteng, Liang, Guotao, Qi, Anran, Li, Buyu, Wang, Sheng, Xu, Dong, Yu, Qian

Abstract

We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG's structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.

Chinese Translation

我们提出了AmodalSVG，一种用于无模态图像矢量化的新框架，能够从自然图像中生成语义组织良好且几何完整的SVG表示。现有的矢量化方法遵循模态范式：仅追踪可见像素，忽略遮挡部分。因此，生成的SVG在语义上相互纠缠且几何上不完整，限制了SVG的结构编辑能力。相比之下，AmodalSVG重建了包括遮挡区域在内的完整对象几何，将其转化为独立且可编辑的矢量图层。为实现这一目标，AmodalSVG将图像矢量化重新定义为一个两阶段框架，在光栅域内执行语义解耦与补全，生成无模态完整的语义图层，随后对其进行独立矢量化。在第一阶段，我们引入了语义层剥离（Semantic Layer Peeling，SLP），这是一种由视觉语言模型（VLM）指导的策略，逐步将图像分解为语义一致的图层。通过混合修复，SLP恢复了遮挡下的完整对象外观，实现了显式的语义解耦。为了高效矢量化这些图层，我们提出了自适应分层矢量化（Adaptive Layered Vectorization，ALV），该方法通过基于误差预算的调整机制动态调节基本图元预算。大量实验表明，AmodalSVG在视觉保真度上显著优于现有方法。此外，生成的无模态图层支持在矢量域内进行对象级编辑，这是现有矢量化方法所不具备的功能。代码将在论文接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 186 / 2604.10945

Progressive Deep Learning for Automated Spheno-Occipital Synchondrosis Maturation Assessment

渐进式深度学习用于自动化蝶枕软骨成熟评估

Milani, Omid Halimi, Nikho, Amanda, Tliba, Marouane, Mills, Lauren, Hamdan, Emadeldeen, Cetin, Ahmet Enis, Elnagar, Mohammed H.

Abstract

Accurate assessment of spheno-occipital synchondrosis (SOS) maturation is a key indicator of craniofacial growth and a critical determinant for orthodontic and surgical timing. However, SOS staging from cone-beam CT (CBCT) relies on subtle, continuously evolving morphological cues, leading to high inter-observer variability and poor reproducibility, especially at transitional fusion stages. We frame SOS assessment as a fine-grained visual recognition problem and propose a progressive representation-learning framework that explicitly mirrors how expert clinicians reason about synchondral fusion: from coarse anatomical structure to increasingly subtle patterns of closure. Rather than training a full-capacity network end-to-end, we sequentially grow the model by activating deeper blocks over time, allowing early layers to first encode stable cranial base morphology before higher-level layers specialize in discriminating adjacent maturation stages. This yields a curriculum over network depth that aligns deep feature learning with the biological continuum of SOS fusion. Extensive experiments across convolutional and transformer-based architectures show that this expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages. Importantly, these gains are achieved without changing network architectures or loss functions, demonstrating that training dynamics alone can substantially improve anatomical representation learning. The proposed framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust, data-efficient SOS staging from CBCT and offering a general strategy for modeling other continuous biological processes in medical imaging.

Chinese Translation

准确评估蝶枕软骨（SOS）成熟是颅面生长的关键指标，也是正畸和外科手术时机的重要决定因素。然而，基于锥形束计算机断层扫描（CBCT）的SOS分期依赖于微妙且不断演变的形态线索，导致观察者之间的高变异性和较差的可重复性，尤其是在过渡融合阶段。我们将SOS评估框架视为一个细粒度视觉识别问题，并提出一种渐进式表征学习框架，明确反映专家临床医生对软骨融合的推理方式：从粗略的解剖结构到日益细微的闭合模式。我们不是端到端训练一个全容量网络，而是通过逐步激活更深的模块来扩展模型，使得早期层首先编码稳定的颅底形态，然后高层次的层专注于区分相邻的成熟阶段。这在网络深度上形成了一种课程，使深度特征学习与SOS融合的生物连续性相一致。通过对卷积和基于变换器的架构进行广泛实验，结果表明这种受专家启发的训练策略比标准训练产生更稳定的优化和一致更高的准确性，特别是在模糊的中间阶段。重要的是，这些提升是在不改变网络架构或损失函数的情况下实现的，表明仅通过训练动态就能显著改善解剖表征学习。所提出的框架建立了专家牙科直觉与深度视觉表征之间的原则性联系，使得从CBCT进行稳健且数据高效的SOS分期成为可能，并为在医学影像中建模其他连续生物过程提供了一种通用策略。

View on arXiv Download PDF AI Translation

cs.CV / 187 / 2604.10949

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

伪统一：熵探测揭示统一多模态模型中的信息模式差异

Yang, Songlin, Kong, Xianghao, Rao, Anyi

Abstract

Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

Chinese Translation

统一多模态模型（UMMs）旨在将大型语言模型（LLMs）的推理能力与视觉模型的生成能力相结合。然而，在实践中，这种协同作用仍然难以实现：UMMs未能将LLM式的推理转移到图像合成上，并表现出不同的响应行为。我们将这一现象称为伪统一。诊断其内部原因至关重要，但现有的探测方法要么缺乏模型内部的洞察，要么忽视了提示-响应之间的依赖关系。为了解决这些局限性，我们提出了一种信息论探测框架，联合分析UMMs如何编码输入和生成输出。应用于十个代表性的UMMs，我们的框架揭示伪统一源于双重差异：（i）模态不对称编码，其中视觉和语言遵循不同的熵轨迹，以及（ii）模式分裂响应，其中文本生成表现出高熵的创造性，而图像合成则强制执行低熵的保真性。只有那些统一两者的模型（例如，通过上下文预测）才能实现更真实的统一，从而即使在参数较少的情况下也能实现更强的基于推理的文本到图像生成。我们的工作提供了对统一的首次模型内部探测，表明真正的多模态协同需要信息流的一致性，而不仅仅是共享参数。

View on arXiv Download PDF AI Translation

cs.CV / 188 / 2604.10950

Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

通过蒸馏辅助测试时自适应引导视频语义分割模型

Kim, Jihun, Kwon, Hoyong, Kweon, Hyeokjun, Yoon, Kuk-Jin

Abstract

Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.

Chinese Translation

全监督视频语义分割（VSS）严重依赖于密集标注的视频数据，限制了其实用性。另一种方法是逐帧应用预训练的图像语义分割（ISS）模型，避免了标注成本，但忽视了关键的时间一致性。近期的基础模型如SAM2能够实现高质量的掩码传播，但由于语义理解有限且计算开销大，直接用于VSS仍不切实际。本文提出了DiTTA（Distillation-assisted Test-Time Adaptation），一种新颖框架，通过高效的测试时自适应（TTA）将ISS模型转化为具备时间感知能力的VSS模型，无需标注视频。DiTTA在简短的单次初始化阶段，将SAM2的时间分割知识蒸馏到ISS模型中，并辅以轻量级时间融合模块以聚合跨帧上下文。关键是，DiTTA即使在仅使用极少部分视频片段（如前10%）进行自适应时，也能实现稳健的泛化能力，显著优于在推理过程中反复调用SAM2的零样本微调方法。在VSPW和Cityscapes上的大量实验验证了DiTTA的有效性，其性能可与全监督VSS方法媲美甚至更优，提供了一种实用且无标注需求的现实视频语义分割解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 189 / 2604.10954

FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

FineEdit：基于边界框引导的细粒度图像编辑

Xu, Haohang, Liu, Lin, Zhang, Zhibo, Cong, Rong, Zhang, Xiaopeng, Tian, Qi

Abstract

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Chinese Translation

基于扩散模型的图像编辑在实际应用中取得了显著进展。然而，传统模型通常依赖自然语言提示，往往缺乏定位目标对象所需的精确性。因此，这些模型由于采用全局图像重建范式，难以保持背景的一致性。鉴于视觉线索为用户突出特定兴趣区域提供了直观手段，我们利用边界框作为引导，明确界定编辑目标。该方法确保扩散模型能够准确定位目标，同时保持背景一致性。为此，我们提出了FineEdit，一种多层次边界框注入方法，使模型能够更有效地利用空间条件。为了支持这种高精度引导，我们发布了FineEdit-1.2M，一个包含120万对带有精确边界框标注的图像编辑对的大规模细粒度数据集。此外，我们构建了一个综合基准测试集FineEdit-Bench，涵盖10个主题的1000张图像，用以有效评估基于区域的编辑能力。在FineEdit-Bench上的评测表明，我们的模型在指令遵循性和背景保持方面显著优于当前最先进的开源模型（如Qwen-Image-Edit和LongCat-Image-Edit）。在公开基准（GEdit和ImgEdit Bench）上的进一步评估也验证了其卓越的泛化能力和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 190 / 2604.10966

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

你只需判断一次：单次前向传播中的多响应奖励建模

Yang, Yinuo, Ma, Zixian, Ganti, Manasi, Zhang, Jieyu, Krishna, Ranjay

Abstract

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

Chinese Translation

我们提出了一种判别性多模态奖励模型，该模型在单次前向传播中对所有候选响应进行评分。传统的判别性奖励模型独立评估每个响应，需为每个潜在响应进行多次前向传播。我们的方法将多个响应与分隔符标记连接，并对其标量评分应用交叉熵，从而实现直接的比较推理和高效的 $N$-方式偏好学习。多响应设计还使得相较于传统的单响应评分，墙钟时间加速可达 $N imes$，并减少了浮点运算量（FLOPs）。为了在现有的成对基准之外实现 $N$-方式奖励评估，我们构建了两个新的基准： (1) MR$^2$Bench-Image 包含来自8个不同模型的响应的人类标注排名； (2) MR$^2$Bench-Video 是一个基于视频的大规模奖励基准，源自94K众包的成对人类判断，涵盖19个模型，并通过偏好图集成去噪。两个基准均提供从完整排名中抽样的4响应评估变体。基于一个具有LoRA微调的4B视觉-语言主干和轻量级MLP值头，我们的模型在六个多模态奖励基准上取得了最先进的结果，包括MR$^2$Bench-Image、MR$^2$Bench-Video及其他四个现有基准。我们的模型超越了现有的更大生成和判别奖励模型。我们进一步证明，当我们的奖励模型与GRPO结合用于强化学习时，能够生成改进的策略模型，这些模型在标准多模态基准上保持性能，同时显著提升开放式生成质量，在训练稳定性和开放式生成质量方面大幅超越单响应判别奖励模型（RM）基线。

View on arXiv Download PDF AI Translation

cs.CV / 191 / 2604.10969

Towards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification

迈向太阳能电池板完整性的自动化：用于先进表面缺陷识别的混合深度特征提取

Asif, Muhammad Junaid, Rafaqat, Muhammad Saad, Nazakat, Usman, Khan, Uzair, Ahmad, Rana Fayyaz

Abstract

To ensure energy efficiency and reliable operations, it is essential to monitor solar panels in generation plants to detect defects. It is quite labor-intensive, time consuming and costly to manually monitor large-scale solar plants and those installed in remote areas. Manual inspection may also be susceptible to human errors. Consequently, it is necessary to create an automated, intelligent defect-detection system, that ensures continuous monitoring, early fault detection, and maximum power generation. We proposed a novel hybrid method for defect detection in SOLAR plates by combining both handcrafted and deep learning features. Local Binary Pattern (LBP), Histogram of Gradients (HoG) and Gabor Filters were used for the extraction of handcrafted features. Deep features extracted by leveraging the use of DenseNet-169. Both handcrafted and deep features were concatenated and then fed to three distinct types of classifiers, including Support Vector Machines (SVM), Extreme Gradient Boost (XGBoost) and Light Gradient-Boosting Machine (LGBM). Experimental results evaluated on the augmented dataset show the superior performance, especially DenseNet-169 + Gabor (SVM), had the highest scores with 99.17% accuracy which was higher than all the other systems. In general, the proposed hybrid framework offers better defect-detection accuracy, resistance, and flexibility that has a solid basis on the real-life use of the automated PV panels monitoring system.

Chinese Translation

为了确保能源效率和可靠运行，必须对发电厂中的太阳能电池板进行监测以检测缺陷。手动监测大规模太阳能电站及偏远地区安装的电池板既费时费力又成本高昂，且人工检查容易受到人为错误的影响。因此，有必要构建一个自动化的智能缺陷检测系统，以实现持续监测、早期故障发现和最大功率输出。本文提出了一种结合手工特征与深度学习特征的混合缺陷检测新方法。手工特征提取采用了局部二值模式（Local Binary Pattern, LBP）、梯度直方图（Histogram of Gradients, HoG）和Gabor滤波器。深度特征则通过DenseNet-169提取。将手工特征与深度特征进行拼接后，分别输入三种不同的分类器，包括支持向量机（Support Vector Machines, SVM）、极端梯度提升（Extreme Gradient Boost, XGBoost）和轻量级梯度提升机（Light Gradient-Boosting Machine, LGBM）。在扩增数据集上的实验结果表明，该方法表现优越，尤其是DenseNet-169与Gabor特征结合的SVM分类器，达到了99.17%的最高准确率，优于其他所有系统。总体而言，所提混合框架在缺陷检测的准确性、鲁棒性和灵活性方面表现出色，为自动化光伏电池板监测系统的实际应用奠定了坚实基础。

View on arXiv Download PDF AI Translation

cs.CV / 192 / 2604.10970

Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

使用自监督学习预训练的深度学习模型进行蛋白质定位

Isselmann, Ben, Göksu, Dilara, Neumann, Heinz, Weinmann, Andreas

Abstract

Background: Task-specific microscopy datasets are often small, making it difficult to train deep learning models that learn robust features. While self-supervised learning (SSL) has shown promise through pretraining on large, domain-specific datasets, generalizability across datasets with differing staining protocols and channel configurations remains underexplored. We investigated the generalizability of SSL models pretrained on ImageNet-1k and HPA FOV, evaluating their embeddings on OpenCell with and without fine-tuning, two channel-mismatch strategies, and varying fine-tuning data fractions. We additionally analyzed single-cell embeddings on a labeled OpenCell subset. Result: DINO-based ViT backbones pretrained on HPA FOV or ImageNet-1k transfer well to OpenCell even without fine-tuning. The HPA FOV-pretrained model achieved the highest zero-shot performance (macro $F_1$ 0.822 $\pm$ 0.007). Fine-tuning further improved performance to 0.860 $\pm$ 0.013. At the single-cell level, the HPA single-cell-pretrained model achieved the highest k-nearest neighbor performance across all neighborhood sizes (macro $F_1$ $\geq$ 0.796). Conclusion: SSL methods like DINO, pretrained on large domain-relevant datasets, enable effective use of deep learning features for fine-tuning on small, task-specific microscopy datasets.

Chinese Translation

背景：特定任务的显微镜数据集通常较小，这使得训练出能够学习稳健特征的深度学习模型变得困难。尽管自监督学习（SSL）通过在大型领域特定数据集上进行预训练显示出了良好的前景，但在具有不同染色协议和通道配置的数据集之间的泛化能力仍然未得到充分探索。我们研究了在ImageNet-1k和HPA FOV上预训练的SSL模型的泛化能力，评估了它们在OpenCell上的嵌入表现，包括有无微调、两种通道不匹配策略以及不同的微调数据比例。我们还分析了在标记的OpenCell子集上的单细胞嵌入。结果：基于DINO的ViT主干网络在HPA FOV或ImageNet-1k上预训练后，即使没有微调，也能很好地迁移到OpenCell。HPA FOV预训练模型在零样本性能上达到了最高（宏观 $F_1$ 0.822 $ ext{±}$ 0.007）。微调进一步将性能提高至0.860 $ ext{±}$ 0.013。在单细胞层面，HPA单细胞预训练模型在所有邻域大小上都达到了最高的k近邻性能（宏观 $F_1$ $ ext{≥}$ 0.796）。结论：像DINO这样的SSL方法，在大型领域相关数据集上进行预训练，使得在小型特定任务显微镜数据集上有效利用深度学习特征进行微调成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 193 / 2604.10971

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

MMR-AD：用于多模态大语言模型通用异常检测基准的大规模多模态数据集

Yao, Xincheng, Qian, Zefeng, Shi, Chao, Song, Jiayang, Zhang, Chongyang

Abstract

In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.

Chinese Translation

在工业异常检测的发展过程中，通用异常检测（General Anomaly Detection，GAD）是一种新兴趋势，也是最终目标。与传统的单类和多类异常检测不同，通用异常检测旨在训练一个通用的异常检测模型，能够在无需对目标数据进行任何再训练或微调的情况下，直接检测多样化的新颖类别中的异常。近年来，多模态大语言模型（Multimodal Large Language Models，MLLMs）因其革命性的视觉理解和语言推理能力，在实现通用异常检测方面展现出巨大潜力。然而，MLLM在通用异常检测能力上的研究仍然不足，原因包括：(1) MLLMs是在大量来源于网络的数据上进行预训练的，这些数据与异常检测场景中的数据存在显著差异，且预训练时的图文对并非专门针对异常检测任务设计；(2) 当前主流的异常检测数据集多为基于图像的，尚不适合用于后续训练MLLM。为推动基于MLLM的通用异常检测研究，我们提出了MMR-AD，这是一个用于训练和评估基于MLLM的异常检测模型的综合基准数据集。利用MMR-AD，我们揭示了当前最先进的通用MLLM在异常检测性能上仍远未达到工业需求。基于MMR-AD，我们还提出了一个基线模型Anomaly-R1，该模型是一种基于推理的异常检测模型，学习于MMR-AD中的链式思维（Chain-of-Thought，CoT）数据，并通过强化学习进一步增强。大量实验表明，Anomaly-R1在异常检测和定位任务中均显著优于通用MLLM。

View on arXiv Download PDF AI Translation

cs.CV / 194 / 2604.10983

Energy-oriented Diffusion Bridge for Image Restoration with Foundational Diffusion Models

面向能量的扩散桥模型在基础扩散模型下的图像恢复

Hou, Jinhui, Zhu, Zhiyu, Hou, Junhui

Abstract

Diffusion bridge models have shown great promise in image restoration by explicitly connecting clean and degraded image distributions. However, they often rely on complex and high-cost trajectories, which limit both sampling efficiency and final restoration quality. To address this, we propose an Energy-oriented diffusion Bridge (E-Bridge) framework to approximate a set of low-cost manifold geodesic trajectories to boost the performance of the proposed method. We achieve this by designing a novel bridge process that evolves over a shorter time horizon and makes the reverse process start from an entropy-regularized point that mixes the degraded image and Gaussian noise, which theoretically reduces the required trajectory energy. To solve this process efficiently, we draw inspiration from consistency models to learn a single-step mapping function, optimized via a continuous-time consistency objective tailored for our trajectory, so as to analytically map any state on the trajectory to the target image. Notably, the trajectory length in our framework becomes a tunable task-adaptive knob, allowing the model to adaptively balance information preservation against generative power for tasks of varying degradation, such as denoising versus super-resolution. Extensive experiments demonstrate that our E-Bridge achieves state-of-the-art performance across various image restoration tasks while enabling high-quality recovery with a single or fewer sampling steps. Our project page is https://jinnh.github.io/E-Bridge/.

Chinese Translation

扩散桥模型在图像恢复中展现出巨大的潜力，通过明确连接干净图像和退化图像的分布。然而，它们通常依赖于复杂且高成本的轨迹，这限制了采样效率和最终恢复质量。为了解决这个问题，我们提出了一种面向能量的扩散桥框架（Energy-oriented diffusion Bridge，E-Bridge），旨在近似一组低成本的流形测地轨迹，以提升所提方法的性能。我们通过设计一种新颖的桥接过程来实现这一目标，该过程在较短的时间范围内演变，并使反向过程从一个熵正则化点开始，该点混合了退化图像和高斯噪声，从理论上减少了所需的轨迹能量。为了高效地解决这一过程，我们从一致性模型中获得灵感，学习一个单步映射函数，通过为我们的轨迹量身定制的连续时间一致性目标进行优化，从而将轨迹上的任何状态解析地映射到目标图像。值得注意的是，我们框架中的轨迹长度成为一个可调的任务自适应旋钮，使模型能够自适应地平衡信息保留与生成能力，以应对不同退化任务，如去噪与超分辨率。大量实验表明，我们的E-Bridge在各种图像恢复任务中实现了最先进的性能，同时能够以单次或更少的采样步骤实现高质量恢复。我们的项目页面是 https://jinnh.github.io/E-Bridge/。

View on arXiv Download PDF AI Translation

cs.CV / 195 / 2604.10992

ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

ArtiCAD：基于多智能体代码生成的关节式CAD装配设计

Shui, Yuan, Guan, Yandong, Zhang, Zhanwei, Hu, Juncheng, Zhang, Jing, Xu, Dong, Yu, Qian

Abstract

Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.

Chinese Translation

关节式装配的参数化计算机辅助设计（CAD）对于产品开发至关重要，但如何从高层描述生成这些多部件、可移动模型尚未被充分探索。为此，我们提出了ArtiCAD，这是首个无需训练的多智能体系统，能够直接从文本或图像生成可编辑的关节式CAD装配模型。我们的系统将这一复杂任务分配给四个专业智能体：设计（Design）、生成（Generation）、装配（Assembly）和审查（Review）。我们的关键见解之一是在初始设计阶段而非装配阶段预测装配关系。通过利用明确界定连接点和关节参数的连接器（Connector），ArtiCAD在几何生成之前确定这些关系，有效绕过了当前大型语言模型（LLMs）和视觉语言模型（VLMs）有限的空间推理能力。为了进一步确保输出质量，我们在生成和装配阶段引入了验证步骤，并配备了跨阶段回滚机制，能够准确定位并修正设计层面和代码层面的错误。此外，一个自我进化的经验库积累设计知识，以持续提升未来任务的性能。在三个数据集（ArtiCAD-Bench、CADPrompt和ACD）上的广泛评估验证了我们方法的有效性。我们还展示了ArtiCAD在需求驱动的概念设计、实体原型制作以及通过URDF导出生成具身AI训练资产方面的应用潜力。

View on arXiv Download PDF AI Translation

cs.CV / 196 / 2604.10994

LumiMotion: Improving Gaussian Relighting with Scene Dynamics

LumiMotion：利用场景动态提升高斯重光照效果

Kaleta, Joanna, Wójcik, Piotr, Marzol, Kacper, Trzciński, Tomasz, Kania, Kacper, Kowalski, Marek

Abstract

In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting. Link to project page: https://joaxkal.github.io/LumiMotion/

Chinese Translation

在三维重建中，逆向渲染问题，即恢复场景的光照和材质属性，是基础性问题。现有基于高斯散点（Gaussian Splatting）的方法主要针对静态场景，且通常假设简化或适中的光照条件，以避免阴影与表面外观的混淆。这限制了它们在真实环境中准确区分光照效应与材质属性的能力。我们通过利用动态元素——场景中发生运动的区域——作为逆向渲染的监督信号来解决这一限制。运动使得相同表面在不同光照条件下被观察到，提供了更强的线索以分离材质与光照。这一观点得到了实验结果的支持：相较于次优基线，我们在反照率估计上提升了23%的LPIPS指标，在场景重光照上提升了15%。为此，我们提出了LumiMotion，这是首个利用动态信息进行逆向渲染且适用于任意动态场景的基于高斯方法。我们的方法学习了一种动态二维高斯散点表示，并引入了一组新颖约束，促使场景的动态区域发生形变，同时保持静态区域稳定。正如我们所展示的，这种分离对于反照率的正确优化至关重要。最后，我们发布了一个新的合成基准数据集，包含五个场景在四种光照条件下的静态与动态版本，首次实现了在动态环境和复杂光照条件下对逆向渲染方法的系统评估。项目主页链接：https://joaxkal.github.io/LumiMotion/

View on arXiv Download PDF AI Translation

cs.CV / 197 / 2604.10999

TraversalBench: Challenging Paths to Follow for Vision Language Models

TraversalBench：视觉语言模型需遵循的挑战路径

Petrova, Clara, Chen, Zhuo, Soljačić, Marin

Abstract

Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths -- a task that human observers typically find straightforward -- remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.

Chinese Translation

视觉语言模型（VLMs）在许多多模态基准测试中表现出色。然而，跟随复杂视觉路径的能力——这是人类观察者通常认为简单的任务——仍然缺乏充分测试。我们引入了TraversalBench，这是一个用于精确视觉路径遍历的受控基准。每个实例包含一条连续的多边形线、一个独特的起始标记以及放置在路径顶点的标记；任务是恢复从起点到终点遍历路径时遇到的确切有序序列。该基准明确平衡了关键的路径结构因素，包括自交点数量、曲折度、顶点数量和附近的干扰线，同时最小化对光学字符识别（OCR）、世界知识和开放式规划的依赖。我们发现，自交点是主要的困难来源。首次交叉分析表明，错误是高度局部化的：在首次交叉之前，性能相对稳定，但当模型必须解决正确的延续时，性能急剧下降。相比之下，附近的干扰线产生了较弱的持续退化，随着重复暴露而加重。这些分析使TraversalBench成为识别模型是否遭受类人失败或其他持续视觉处理崩溃的有用诊断工具。辅助的阅读顺序基准进一步揭示了对与从左到右序列化兼容的布局的一致偏好，同时并未解释路径复杂性的主要影响。综合来看，这些结果将TraversalBench定位为路径忠实视觉推理的受控诊断工具，并作为研究模糊、杂乱和干扰结构下多模态空间推理的有用测试平台。更广泛地说，我们将TraversalBench视为对视觉语言模型（VLMs）仍然有限的持续视觉基础基准领域的贡献。

View on arXiv Download PDF AI Translation

cs.CV / 198 / 2604.11004

Panoptic Pairwise Distortion Graph

全景成对失真图

Janjua, Muhammad Kamran, Wahab, Abdul, Rashidi, Bahador

Abstract

In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

Chinese Translation

在本研究中，我们通过将图像对表示为其区域的结构化组合，提出了对比图像评估的新视角。与现有方法专注于整体图像分析，同时隐含依赖于区域级理解不同，我们将场景图的图像内部概念扩展到图像之间，并提出了一项新任务——失真图（Distortion Graph, DG）。DG将成对图像视为基于区域的结构拓扑，并在紧凑可解释的图结构中表示密集的降级信息，如失真类型、严重程度、比较和质量评分。为了实现学习失真图的任务，我们贡献了（i）一个区域级数据集PandaSet，（ii）一个具有不同区域级难度的基准套件PandaBench，以及（iii）一个高效的架构Panda，用于生成失真图。我们展示了PandaBench对最先进的多模态大型语言模型（MLLMs）构成了重大挑战，因为即使在提供明确的区域提示时，它们也无法理解区域级降级。我们表明，在PandaSet上进行训练或使用DG进行提示能够引发区域级失真理解，为细粒度、结构化的成对图像评估开辟了新方向。

View on arXiv Download PDF AI Translation

cs.CV / 199 / 2604.11006

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

迈向真实的3D发射材料：发射纹理生成的数据集、基线和评估

Zhang, Zhiyuan, Zhou, Zijian, Li, Linjun, Chen, Long, Tang, Hao, Gong, Yichen

Abstract

3D texture generation is receiving increasing attention, as it enables the creation of realistic and aesthetic texture materials for untextured 3D meshes. However, existing 3D texture generation methods are limited to producing only a few types of non-emissive PBR materials (e.g., albedo, metallic maps and roughness maps), making them difficult to replicate highly popular styles, such as cyberpunk, failing to achieve effects like realistic LED emissions. To address this limitation, we propose a novel task, emission texture generation, which enables the synthesized 3D objects to faithfully reproduce the emission materials from input reference images. Our key contributions include: first, We construct the Objaverse-Emission dataset, the first dataset that contains 40k 3D assets with high-quality emission materials. Second, we propose EmissionGen, a novel baseline for the emission texture generation task. Third, we define detailed evaluation metrics for the emission texture generation task. Our results demonstrate significant potential for future industrial applications. Dataset will be available at https://github.com/yx345kw/EmissionGen.

Chinese Translation

3D纹理生成正受到越来越多的关注，因为它能够为未纹理化的3D网格创建真实且美观的纹理材料。然而，现有的3D纹理生成方法仅限于生成少量类型的非发射PBR材料（例如，反照率、金属图和粗糙度图），使其难以复制高度流行的风格，如赛博朋克，未能实现如真实LED发射等效果。为了解决这一局限性，我们提出了一项新任务，即发射纹理生成，该任务使合成的3D对象能够忠实地再现输入参考图像中的发射材料。我们的主要贡献包括：首先，我们构建了Objaverse-Emission数据集，这是第一个包含4万件高质量发射材料的3D资产的数据集。其次，我们提出了EmissionGen，这是发射纹理生成任务的新基线。第三，我们为发射纹理生成任务定义了详细的评估指标。我们的结果展示了未来工业应用的重大潜力。数据集将可在 https://github.com/yx345kw/EmissionGen 获取。

View on arXiv Download PDF AI Translation

cs.CV / 200 / 2604.11007

Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling

基于开放词汇图像分割伪标签的3D点云语义分割数据高效方法

Furuya, Takahiko

Abstract

Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real-world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data-efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data-efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo-Labeling via Open-Vocabulary Image Segmentation (PLOVIS), leverages an Open-Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo-labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two-stage filtering of pseudo labels combined with a class-balanced memory bank for effective training. The two-stage filtering mechanism first removes low-confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data-scarce conditions (a few tens of training 3D scenes, each annotated with only <100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine-tuning strategies and state-of-the-art weakly supervised learning algorithms. Code will be made publicly available.

Chinese Translation

3D点云场景的语义分割是多种应用中的关键任务。在实际场景中，训练分割模型常面临三种并存的数据不足问题：训练场景稀缺、点级标注稀缺以及缺乏用于重建点云的二维图像序列。现有的数据高效算法通常仅针对其中一项或两项挑战进行处理，尚未对三者的联合处理进行探索。本文提出了一种专门针对这三种数据不足形式设计的数据高效训练框架。我们提出的算法称为基于开放词汇图像分割的点云伪标签生成（Point pseudo-Labeling via Open-Vocabulary Image Segmentation，PLOVIS），该方法利用开放词汇图像分割（Open-Vocabulary Image Segmentation，OVIS）模型作为伪标签生成器，以弥补训练数据的不足。PLOVIS直接从训练的3D点云生成二维图像用于伪标签标注，避免了对二维图像序列的依赖。为缓解伪标签中固有的噪声和类别不平衡问题，我们引入了结合类别平衡记忆库的两阶段伪标签过滤机制以实现有效训练。该两阶段过滤机制首先剔除低置信度伪标签，随后丢弃可能错误的伪标签，从而提升伪标签质量。在四个基准数据集ScanNet、S3DIS、Toronto3D和Semantic3D上，在现实的数据稀缺条件下（仅有数十个训练3D场景，每个场景标注点数少于100个）进行的实验表明，PLOVIS持续优于包括标准微调策略和最先进弱监督学习算法在内的现有方法。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 201 / 2604.11010

Byte-level generative predictions for forensics multimedia carving

用于取证多媒体雕刻的字节级生成预测

Lee, Jaewon, Eimon, Md Eimran Hossain, Srinivasan, Avinash, Kalva, Hari

Abstract

Digital forensic investigations often face significant challenges when recovering fragmented multimedia files that lack file system metadata. While traditional file carving relies on signatures and discriminative deep learning models for fragment classification, these methods cannot reconstruct or predict missing data. We propose a generative approach to multimedia carving using bGPT, a byte-level transformer designed for next-byte prediction. By feeding partial BMP image data into the model, we simulate the generation of likely fragment continuations. We evaluate the fidelity of these predictions using different metrics, namely, cosine similarity, structural similarity index (SSIM), chi-square distance, and Jensen-Shannon divergence (JSD). Our findings demonstrate that generative models can effectively predict byte-level patterns to support fragment matching in unallocated disk space.

Chinese Translation

数字取证调查在恢复缺乏文件系统元数据的碎片化多媒体文件时常面临重大挑战。传统的文件雕刻依赖于签名和判别深度学习模型进行碎片分类，但这些方法无法重建或预测缺失的数据。我们提出了一种使用 bGPT（字节级变换器）进行多媒体雕刻的生成方法，该模型旨在进行下一个字节的预测。通过将部分 BMP 图像数据输入模型，我们模拟了可能的碎片延续生成。我们使用不同的指标评估这些预测的准确性，即余弦相似度、结构相似性指数（SSIM）、卡方距离和詹森-香农散度（JSD）。我们的研究结果表明，生成模型能够有效预测字节级模式，以支持在未分配磁盘空间中的碎片匹配。

View on arXiv Download PDF AI Translation

cs.CV / 202 / 2604.11014

UHD-GPGNet: UHD Video Denoising via Gaussian-Process-Guided Local Spatio-Temporal Modeling

UHD-GPGNet：通过高斯过程引导的局部时空建模实现超高清视频去噪

He, Weiyuan, Wu, Chen, Dai, Pengwen, Wang, Wei, Lu, Dianjie, Zhang, Guijuan, Fan, Linwei, Wang, Yongzhen, Zheng, Zhuoran

Abstract

Ultra-high-definition (UHD) video denoising requires simultaneously suppressing complex spatio-temporal degradations, preserving fine textures and chromatic stability, and maintaining efficient full-resolution 4K deployment. In this paper, we propose UHD-GPGNet, a Gaussian-process-guided local spatio-temporal denoising framework that addresses these requirements jointly. Rather than relying on implicit feature learning alone, the method estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, which then guide adaptive temporal-detail fusion. A structure-color collaborative reconstruction head decouples luminance, chroma, and high-frequency correction, while a heteroscedastic objective and overlap-tiled inference further stabilize optimization and enable memory-bounded 4K deployment. Experiments on UVG and RealisVideo-4K show that UHD-GPGNet achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over the closest quality competitor, and maintains robust performance across a multi-level mixed-degradation schedule.A real-world study on phone-captured 4K video further confirms that the model, trained entirely on synthetic degradation, generalizes to unseen real sensor noise and improves downstream object detection under challenging conditions.

Chinese Translation

超高清（UHD）视频去噪需要同时抑制复杂的时空退化，保持细腻的纹理和色彩稳定性，并维持高效的全分辨率4K部署。本文提出了UHD-GPGNet，一种高斯过程引导的局部时空去噪框架，旨在共同满足这些要求。该方法不仅依赖于隐式特征学习，而是通过紧凑的时空描述符估计稀疏的高斯过程后验统计，以明确表征局部退化响应和不确定性，从而指导自适应的时间细节融合。结构-色彩协同重建头部解耦了亮度、色度和高频修正，而异方差目标和重叠切片推理进一步稳定了优化，并实现了内存受限的4K部署。在UVG和RealisVideo-4K上的实验表明，UHD-GPGNet在参数显著少于现有方法的情况下实现了竞争性的恢复保真度，能够以显著加速的速度进行实时全分辨率4K推理，并在多级混合退化调度中保持稳健的性能。对手机拍摄的4K视频的真实世界研究进一步确认，该模型在完全基于合成退化训练的情况下，能够推广到未见过的真实传感器噪声，并在具有挑战性的条件下改善下游目标检测。

View on arXiv Download PDF AI Translation

cs.CV / 203 / 2604.11025

Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

感知中的测试时间缩放：解决图像思维中的基础矛盾

Jiang, Zheng, Chen, Yiming, He, Nan, Chen, Jiahui, Li, Chaoyang, Qian, Houde, Sun, Lifeng

Abstract

Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.

Chinese Translation

最近的多模态大型语言模型（MLLMs）开始通过在推理过程中调用缩放和裁剪等视觉工具来支持图像思维。然而，这些系统在细粒度视觉推理中仍然脆弱，因为它们必须在获得做出正确决策所需的证据之前决定观察的位置。我们将这种循环依赖关系称为基础矛盾。为了解决这一问题，我们提出了感知中的测试时间缩放（TTSP）框架，该框架将感知本身视为一个可扩展的推理过程。TTSP生成多个探索性感知轨迹，使用基于熵的置信度估计过滤不可靠的轨迹，将经过验证的观察提炼为结构化知识，并迭代地优化后续探索以应对未解决的不确定性。在高分辨率和一般多模态推理基准上的大量实验表明，TTSP在不同的主干网络规模上始终优于强基线，同时还表现出良好的可扩展性和令牌效率。我们的结果表明，在测试时扩展感知是应对感知不确定性下稳健的多模态推理的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CV / 204 / 2604.11038

EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

EgoFun3D：基于功能模板从第一人称视角视频建模交互式物体

Peng, Weikun, Iliash, Denys, Savva, Manolis

Abstract

We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.

Chinese Translation

我们提出了EgoFun3D，一种针对从第一人称视角视频中建模交互式三维物体的协调任务定义、数据集及基准测试。交互式物体在具身人工智能领域具有重要意义，但相关数据稀缺，因此从现成的真实世界视频中进行建模具有重要价值。我们的任务聚焦于从第一人称视角视频输入中获取可用于仿真的交互式三维物体。以往工作主要关注物体的关节运动，而我们通过功能模板（一种结构化的计算表示）捕捉跨部件的通用功能映射（例如，炉灶旋钮的旋转控制炉灶火焰温度）。功能模板不仅支持精确评估，还能直接编译成可在多种仿真平台上执行的代码。为实现全面的基准测试，我们引入了一个包含271个第一人称视角视频的数据集，涵盖具有挑战性的真实交互场景，配备了三维几何信息、二维及三维分割、关节运动及功能模板标注。为解决该任务，我们提出了一个包含四个阶段的流水线：二维部件分割、重建、关节运动估计及功能模板推断。全面的基准测试表明，现成方法在该任务上表现具有挑战性，凸显了未来研究的方向。

View on arXiv Download PDF AI Translation

cs.CV / 205 / 2604.11042

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

通过自主协调提升跨不一致标注数据集的布局表示学习

Li, Renyu, Kirilenko, Vladimir, You, Yao, Wolfe, Crag

Abstract

Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, na\"ive mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

Chinese Translation

在联合数据集上微调目标检测（OD）模型通常假设标注具有兼容性，然而数据集往往对语义等价类别编码了相互冲突的空间定义。我们提出了一种自主标签协调流程，利用视觉-语言模型在训练前调和异构来源中类别语义和边界框粒度。我们以文档布局检测作为具有挑战性的案例研究，该领域中不同语料库的标注标准差异显著。若不进行协调，简单混合数据集微调会降低预训练RT-DETRv2检测器的性能：在SCORE-Bench（衡量完整文档转换流程重现真实结构的准确性）上，表格TEDS指标从0.800降至0.750。针对两个类别体系分别包含16和10类且仅有8个直接对应关系的语料库，协调方法在内容保真度、表格结构和空间一致性方面均带来稳定提升：检测F分数从0.860提升至0.883，表格TEDS提升至0.814，平均边界框重叠度从0.043降至0.016。表示分析进一步表明，协调训练生成了更紧凑且可分离的后解码器嵌入，验证了标注不一致性扭曲了学习到的特征空间，且在训练前解决该问题能够恢复表示结构。

View on arXiv Download PDF AI Translation

cs.CV / 206 / 2604.11071

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

基于分布归一化预处理与深度可分离卷积U-Net的轻量级低光照图像增强

Murai, Shimon, Kurita, Teppei, Satoh, Ryuta, Moriuchi, Yusuke

Abstract

We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

Chinese Translation

我们提出了一种轻量级的两阶段低光照图像增强（LLIE）框架，在显著减少参数量的同时，实现了具有竞争力的感知质量。该方法结合了冻结的基于算法的预处理与完全由深度可分离卷积构建的紧凑型U-Net。预处理通过提供互补的亮度校正视图对输入分布进行归一化，使可训练网络能够专注于残差颜色校正。我们的方法在CVPR 2026 NTIRE高效低光照图像增强挑战赛中获得第四名。我们还提供了扩展的基准测试和消融实验，以展示该方法的普适有效性。

View on arXiv Download PDF AI Translation

cs.CV / 207 / 2604.11080

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant：通过子空间残差旋转近似实现高效的层级大语言模型量化

Kim, Suyoung, Wee, Sunghyun, Kim, Hyeonjin, Hwang, Kyomin, Lee, Hyunho, Kwak, Nojun

Abstract

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

Chinese Translation

基于旋转的后训练量化（PTQ）已成为缓解大语言模型（LLMs）量化中激活异常值的有希望的解决方案。全球旋转方法通过将激活旋转融合到注意力和前馈网络（FFN）模块中来实现推理效率，但由于受到在所有层中使用单一可学习旋转矩阵的限制，导致表达能力有限。为了解决这一问题，层级变换方法应运而生，通过局部适应实现了更高的准确性。然而，层级方法无法将激活旋转矩阵融合到权重中，需进行在线计算，造成显著的开销。在本文中，我们提出了ReSpinQuant，一个通过利用离线激活旋转融合和高效的残差子空间旋转匹配基础来解决此类开销的量化框架。该设计将层级适应的高表达能力与仅有微不足道的推理开销相结合。在W4A4和W3A3量化上的大量实验表明，ReSpinQuant实现了最先进的性能，超越了全球旋转方法，并在开销极小的情况下匹配了计算成本高的层级方法的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 208 / 2604.11081

MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling

MapATM：通过行为轨迹建模增强高清地图构建

Li, Mingyang, Lee, Brian, Zuo, Rui, Bacchus, Brent, Mudalige, Priyantha, Qiu, Qinru

Abstract

High-definition (HD) mapping tasks, which perform lane detections and predictions, are extremely challenging due to non-ideal conditions such as view occlusions, distant lane visibility, and adverse weather conditions. Those conditions often result in compromised lane detection accuracy and reduced reliability within autonomous driving systems. To address these challenges, we introduce MapATM, a novel deep neural network that effectively leverages historical actor trajectory information to improve lane detection accuracy, where actors refer to moving vehicles. By utilizing actor trajectories as structural priors for road geometry, MapATM achieves substantial performance enhancements, notably increasing AP by 4.6 for lane dividers and mAP by 2.6 on the challenging NuScenes dataset, representing relative improvements of 10.1% and 6.1%, respectively, compared to strong baseline methods. Extensive qualitative evaluations further demonstrate MapATM's capability to consistently maintain stable and robust map reconstruction across diverse and complex driving scenarios, underscoring its practical value for autonomous driving applications.

Chinese Translation

高清（HD）地图任务，包括车道检测和预测，因视线遮挡、远处车道可见性和恶劣天气等非理想条件而面临极大挑战。这些条件往往导致车道检测精度降低和自动驾驶系统的可靠性下降。为了解决这些挑战，我们提出了MapATM，这是一种新颖的深度神经网络，能够有效利用历史行为轨迹信息来提高车道检测精度，其中行为者指的是移动车辆。通过将行为轨迹作为道路几何的结构先验，MapATM实现了显著的性能提升，特别是在具有挑战性的NuScenes数据集上，车道分隔线的AP提高了4.6，mAP提高了2.6，分别相对于强基线方法的相对提升为10.1%和6.1%。广泛的定性评估进一步证明了MapATM在多样化和复杂驾驶场景中始终保持稳定和强健的地图重建能力，突显了其在自动驾驶应用中的实际价值。

View on arXiv Download PDF AI Translation

cs.CV / 209 / 2604.11082

RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

RESP：基于参考的序列提示用于视频游戏中的视觉故障检测

Yu, Yakun, Wiens, Ashley, Barahona-Ríos, Adrián, Wilkins, Benedict, Zadtootaghaj, Saman, Barman, Nabajeet, Bezemer, Cor-Paul

Abstract

Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.

Chinese Translation

视频游戏中的视觉故障降低了玩家体验和感知质量，但手动质量保证无法适应现代游戏开发日益增长的测试需求。以往的自动化努力，特别是使用视觉-语言模型（VLMs）的研究，主要在单帧上操作或依赖于有限的视频级基线，这在现实场景变化下表现不佳，使得稳健的视频级故障检测变得具有挑战性。我们提出了RESP，一个实用的多帧框架，用于利用VLMs进行游戏故障检测。我们的关键思想是基于参考的提示：对于每个测试帧，我们从同一视频的早期选择一个参考帧，建立视觉基线，并将检测重新框定为视频内比较，而不是孤立分类。RESP依次用参考/测试对提示VLM，并将噪声帧预测聚合为稳定的视频级决策，而无需微调VLM。为了便于对参考效果的控制分析，我们引入了RefGlitch，一个合成数据集，包含手动标注的参考/测试帧对，覆盖五种故障类型。对五个VLM和三个数据集（一个合成，两个真实世界）的实验表明，参考指导始终增强了帧级检测，并且改进的帧级证据在现实质量保证条件下可靠地转移到更强的视频级筛选。代码和数据可在： exttt{https://github.com/PipiZong/RESP_code.git}获取。

View on arXiv Download PDF AI Translation

cs.CV / 210 / 2604.11083

FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

FlowCoMotion：基于Token-Latent流建模的文本到动作生成

Guan, Dawei, Yang, Di, Jin, Chengjie, Wang, Jiangtao

Abstract

Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.

Chinese Translation

文本到动作生成依赖于学习动作表示以实现与语言的语义对齐。现有方法依赖于连续或离散的动作表示。然而，连续表示将语义与动态特征纠缠在一起，而离散表示则丢失了细粒度的动作细节。在此背景下，我们提出了FlowCoMotion，一种从建模视角统一这两种处理方式的新型动作生成框架。具体而言，FlowCoMotion采用token-latent耦合来捕捉语义内容和高保真动作细节。在潜变量分支中，我们应用多视角蒸馏以正则化连续潜在空间；而在token分支中，我们使用离散时间分辨率量化以提取高级语义线索。随后，通过token-latent耦合网络将两分支的表示结合，获得动作潜变量。基于文本条件，预测速度场。通过ODE求解器从简单先验积分该速度场，从而引导样本达到目标动作的潜在状态。大量实验表明，FlowCoMotion在HumanML3D和SnapMoGen等文本到动作基准测试中实现了具有竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CV / 211 / 2604.11089

Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

紧凑且适合生成的图像标记化的结构化状态空间正则化

Lee, Jinsung, Oh, Jaemin, Kim, Namhun, Kim, Dongwon, Yoon, Byung-Jun, Kwak, Suha

Abstract

Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image's essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

Chinese Translation

图像标记器在现代视觉模型中占据核心地位，因为它们通常在潜在空间中操作。理想的潜在空间必须同时紧凑且适合生成：它应紧凑地捕捉图像的基本内容，同时易于通过生成方法进行建模。在本研究中，我们引入了一种新颖的正则化器，以使潜在空间与这两个目标对齐。关键思想是引导标记器模仿状态空间模型（SSMs）的隐藏状态动态，从而将其关键特性——频率意识——转移到潜在特征中。基于对SSMs的理论分析，我们的正则化器强制将细致的空间结构和频域线索编码到紧凑的潜在特征中；这导致了表示能力的更有效利用和生成模型的改进。实验表明，我们的方法在扩散模型中提高了生成质量，同时仅导致重构保真度的最小损失。

View on arXiv Download PDF AI Translation

cs.CV / 212 / 2604.11091

LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning

LDEPrompt：基于层重要性引导的双重可扩展提示池用于预训练模型的类增量学习

Li, Linjie, Wu, Zhenyu, Xiao, Huiyu, Ji, Yang

Abstract

Prompt-based class-incremental learning methods typically construct a prompt pool consisting of multiple trainable key-prompts and perform instance-level matching to select the most suitable prompt embeddings, which has shown promising results. However, existing approaches face several limitations, including fixed prompt pools, manual selection of prompt embeddings, and strong reliance on the pretrained backbone for prompt selection. To address these issues, we propose a \textbf{L}ayer-importance guided \textbf{D}ual \textbf{E}xpandable \textbf{P}rompt Pool (\textbf{LDEPrompt}), which enables adaptive layer selection as well as dynamic freezing and expansion of the prompt pool. Extensive experiments on widely used class-incremental learning benchmarks demonstrate that LDEPrompt achieves state-of-the-art performance, validating its effectiveness and scalability.

Chinese Translation

基于提示的类增量学习方法通常构建一个由多个可训练关键提示（key-prompts）组成的提示池，并通过实例级匹配选择最合适的提示嵌入，已显示出良好的效果。然而，现有方法存在若干限制，包括提示池固定、提示嵌入的手动选择以及对预训练骨干网络在提示选择上的高度依赖。为解决这些问题，我们提出了层重要性引导的双重可扩展提示池（LDEPrompt），该方法支持自适应层选择以及提示池的动态冻结与扩展。在广泛使用的类增量学习基准上进行的大量实验表明，LDEPrompt实现了最先进的性能，验证了其有效性和可扩展性。

View on arXiv Download PDF AI Translation

cs.CV / 213 / 2604.11097

CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

CDPR：用于可靠单目深度估计的极化交叉模态扩散

Yu, Rongjia, Jia, Tong, Wang, Hao, Li, Xiaofang, Yang, Xiao, Zhang, Zinuo, Liu, Cuiwei

Abstract

Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.

Chinese Translation

单目深度估计是计算机视觉中的一项基础但具有挑战性的任务，尤其是在无纹理表面、透明度和镜面反射等复杂条件下。最近的基于扩散的方法通过将深度预测重新表述为潜在空间中的去噪过程，显著提升了性能。然而，现有方法仅依赖于RGB输入，这在具有挑战性的区域往往缺乏足够的线索。在本研究中，我们提出了CDPR——用于可靠单目深度估计的极化交叉模态扩散——一个新颖的基于扩散的框架，整合了物理基础的极化先验，以增强估计的鲁棒性。具体而言，我们通过预训练的变分自编码器（Variational Autoencoder, VAE）将RGB和极化（角度的极化（AoLP）/深度的极化（DoLP））图像编码到共享的潜在空间，并通过可学习的信心感知门控机制动态融合多模态信息。该融合模块自适应地抑制极化输入中的噪声信号，同时保留信息线索，特别是在反射或透明表面附近，并为后续的单目深度估计提供集成的潜在表示。除了深度估计，我们进一步验证了我们的框架可以在最小修改的情况下轻松推广到表面法线预测，展示了其在一般极化引导的密集预测任务中的可扩展性。在合成和真实世界数据集上的实验验证了CDPR在具有挑战性的区域显著优于仅使用RGB的基线，同时在标准场景中保持竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CV / 214 / 2604.11098

Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

高效的空中图像传输与大规模场景重建收发器设计

Ren, Zeyi, Dong, Jialin, Zuo, Wei, Wang, Yikun, Cheng, Bingyang, Zhou, Sheng, Niu, Zhisheng

Abstract

Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

Chinese Translation

在低空智能网络（LAIN）中，大规模三维（3D）场景重建对无线图像传输的效率要求极高。然而，现有方案在严重的导频开销与维持重建保真度所需的传输精度之间难以取得平衡。为了解决效率与可靠性之间的矛盾，本文提出了一种新颖的基于深度学习的端到端（E2E）收发器设计，将3D高斯点云渲染（3D Gaussian Splatting, 3DGS）直接集成到训练过程中。通过联合优化通信模块，利用组合的3DGS渲染损失，我们的方法显著提高了场景恢复质量。此外，该任务驱动的框架使得稀疏导频方案的使用成为可能，显著降低了传输开销，同时在低空信道条件下保持了稳健的图像恢复。在真实世界空中图像数据集上的大量实验表明，所提出的E2E设计显著优于现有基准，提供了卓越的传输性能和准确的3D场景重建。

View on arXiv Download PDF AI Translation

cs.CV / 215 / 2604.11102

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

OmniScript：面向长篇电影视频的音视频脚本生成

Pu, Junfu, Chen, Yuxin, Wang, Teng, Shan, Ying

Abstract

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

Chinese Translation

当前的多模态大型语言模型（MLLMs）在短视频理解方面展现了显著的能力，但将长篇电影视频转化为详细且具有时间基础的脚本仍然是一项重大挑战。本文引入了新颖的视频到脚本（V2S）任务，旨在生成层次化的逐场景脚本，涵盖角色动作、对话、表情和音频提示。为此，我们构建了首个以人为基础的标注基准，并提出了一种具有时间感知的层次化评估框架。此外，我们还提出了OmniScript，这是一种针对长篇叙事理解的8B参数全模态（音视频）语言模型。OmniScript通过一个渐进式流程进行训练，该流程利用思维链监督微调进行情节和角色推理，随后通过使用时间分段奖励的强化学习进行优化。大量实验表明，尽管参数效率高，OmniScript在时间定位和多领域语义准确性方面显著超越了更大的开源模型，并在性能上与包括Gemini 3-Pro在内的最先进专有模型相当。

View on arXiv Download PDF AI Translation

cs.CV / 216 / 2604.11122

Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

语义-几何双重压缩：无需训练的超高分辨率遥感视觉令牌压缩方法

Li, Yueying, Wang, Fengxiang, Li, Yan, Chen, Mingshuo, Zhao, Mengying, Lan, Long

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent "Semantic-Geometric Duality" in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.

Chinese Translation

多模态大语言模型（MLLMs）在地球观测领域展现出巨大潜力。然而，处理超高分辨率（UHR）影像时产生的大量视觉令牌带来了极高的计算开销，严重制约了推理效率。现有的视觉令牌压缩方法主要采用静态且均匀的压缩策略，忽视了遥感解译任务中固有的“语义-几何二元性”。具体而言，目标语义任务侧重于对象的抽象语义，受益于激进的背景剪枝；而场景几何任务则关键依赖空间拓扑的完整性。为解决该挑战，我们提出了DualComp，一种任务自适应的双流令牌压缩框架。在轻量级预训练路由器的动态引导下，DualComp将特征处理解耦为两条专用路径。在目标语义流中，空间连续语义聚合器（Spatially-Contiguous Semantic Aggregator, SCSA）利用尺寸自适应聚类聚合冗余背景，同时保护小目标；在场景几何流中，指令引导结构恢复器（Instruction-Guided Structure Recoverer, IGSR）引入贪婪路径追踪拓扑补全机制以重建空间骨架。基于超高分辨率遥感基准数据集XLRS-Bench的实验表明，DualComp以极低的计算成本实现了高保真遥感解译，在效率和准确性上均取得了同步提升。

View on arXiv Download PDF AI Translation

cs.CV / 217 / 2604.11136

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

BoxTuning：直接注入目标框以实现多模态模型微调

Qian, Zekun, Han, Ruize, Feng, Wei

Abstract

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

Chinese Translation

目标级时空理解对于视频问答至关重要，然而现有多模态大型语言模型（MLLMs）通常对视频帧进行整体编码，缺乏细粒度目标定位的显式机制。近期工作通过将边界框坐标序列化为文本标记来解决该问题，但这种文本坐标范式存在根本的模态不匹配：目标信息本质上是视觉的，而将其编码为文本会产生高昂的标记成本，迫使模型进行激进的时间下采样。我们提出了BoxTuning，通过将目标时空信息直接注入视觉模态来解决该不匹配问题。彩色边界框和轨迹轨迹作为视觉提示渲染到视频帧上，仅保留简洁的颜色-目标对应图例作为文本。这显著降低了文本标记成本，实际中实现了87%-93%的文本标记减少。同时保留了完整的时间分辨率，轨迹轨迹进一步编码了关键帧内的帧间运动方向和速度，恢复了文本坐标方法被迫舍弃的细粒度动态信息。在五个视频问答基准（CLEVRER、Perception Test、STAR、NExT-QA、IntentQA）上的实验结果表明，BoxTuning在空间导向任务上优于文本坐标基线，且几乎消除了推理中心任务中观察到的准确率下降，确立了视觉提示作为向视频MLLM传递目标信息的更自然且高效的范式。

View on arXiv Download PDF AI Translation

cs.CV / 218 / 2604.11140

Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

基于稀疏超图增强的帧事件对象检测与精细化专家混合模型

Bao, Wei, Wang, Yuehan, Zhou, Tianhang, Li, Siqi, Gao, Yue

Abstract

Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone's distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.

Chinese Translation

将基于帧的RGB摄像头与事件流相结合，为在复杂动态条件下的稳健对象检测提供了一种有前景的解决方案。然而，这些模态固有的异质性和数据冗余往往导致不可接受的计算开销或次优的特征融合。在本文中，我们提出了Hyper-FEOD，一个高性能且高效的检测框架，通过两个核心组件协同优化多模态交互。首先，我们引入了稀疏超图增强的跨模态融合（Sparse Hypergraph-enhanced Cross-Modal Fusion, S-HCF），利用事件流的固有稀疏性构建事件引导的活动图。通过仅对选定的运动关键稀疏标记执行高阶超图建模，S-HCF捕捉RGB和事件数据之间复杂的非局部依赖，同时克服了超图计算的传统复杂性瓶颈。其次，我们设计了精细化专家混合模型（Fine-Grained Mixture of Experts, FG-MoE）增强模块，以应对不同图像区域的多样语义需求。该模块采用专门针对对象边界、内部纹理和背景的超图专家，利用像素级空间门控机制自适应地路由和增强特征。结合负载平衡损失和零初始化策略，FG-MoE确保了稳定的训练和精确的特征细化，而不干扰预训练主干网络的分布。在主流RGB-事件基准上的实验结果表明，Hyper-FEOD在准确性和效率之间实现了优越的平衡，超越了最先进的方法，同时保持了适合实时边缘部署的轻量级特性。

View on arXiv Download PDF AI Translation

cs.CV / 219 / 2604.11142

Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS

Naka-GS：一种仿生学启发的双分支Naka校正与渐进式点云剪枝方法用于低光照3D高斯溅射重建

Zhu, Runyu, Dong, SiXun, Zhang, Zhiqiang, Ye, Qingxia, Xu, Zhihua

Abstract

Low-light conditions severely hinder 3D restoration and reconstruction by degrading image visibility, introducing color distortions, and contaminating geometric priors for downstream optimization. We present NAKA-GS, a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization. Our method starts with a Naka-guided chroma-correction network, which combines physics-prior low-light enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization to suppress bright-region chromatic artifacts and edge-structure errors. The enhanced images are then fed into a feed-forward multi-view reconstruction model to produce dense scene priors. To further improve Gaussian initialization, we introduce a lightweight Point Preprocessing Module (PPM) that performs coordinate alignment, voxel pooling, and distance-adaptive progressive pruning to remove noisy and redundant points while preserving representative structures. Without introducing heavy inference overhead, NAKA-GS improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction. The proposed method was presented in the NTIRE 3D Restoration and Reconstruction (3DRR) Challenge, and outperformed the baseline methods by a large margin. The code is available at https://github.com/RunyuZhu/Naka-GS

Chinese Translation

低光照条件严重影响三维恢复与重建，导致图像可见性降低、颜色失真以及下游优化的几何先验受污染。本文提出了NAKA-GS，一种仿生学启发的低光照三维高斯溅射（3D Gaussian Splatting）框架，联合提升光度恢复与几何初始化。该方法首先采用Naka引导的色度校正网络，结合物理先验的低光增强、双分支输入建模、频率解耦校正及掩码引导优化，以抑制亮区色度伪影和边缘结构误差。增强后的图像随后输入前馈多视角重建模型，生成稠密场景先验。为进一步提升高斯初始化效果，设计了轻量级点预处理模块（Point Preprocessing Module, PPM），通过坐标对齐、体素池化及距离自适应的渐进式剪枝，去除噪声和冗余点，同时保留代表性结构。在不引入额外推理开销的情况下，NAKA-GS提升了低光照三维重建的恢复质量、训练稳定性和优化效率。该方法在NTIRE三维恢复与重建（3DRR）挑战赛中表现优异，显著超越基线方法。代码已开源，地址：https://github.com/RunyuZhu/Naka-GS

View on arXiv Download PDF AI Translation

cs.CV / 220 / 2604.11144

Hierarchical Textual Knowledge for Enhanced Image Clustering

层次化文本知识用于增强图像聚类

Zhong, Yijie, Gao, Yunfan, Jiang, Weipeng, Wang, Haofen

Abstract

Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.

Chinese Translation

图像聚类旨在以无监督的方式对图像进行分组。传统方法侧重于视觉空间的知识，难以区分视觉上相似但语义上不同的类别。近年来，视觉-语言模型（vision-language models）的进展使得利用文本知识增强图像聚类成为可能。然而，大多数现有方法依赖于粗糙的类别标签或简单名词，忽视了文本空间中丰富的概念和属性级语义。本文提出了一种知识增强聚类（KEC）方法，借助大型语言模型（LLMs）构建层次化的概念-属性结构化知识以指导聚类。具体而言，我们首先将冗余的文本标签凝练为抽象概念，然后通过对LLMs的结构化提示，自动提取每个单一概念及相似概念对的判别属性。该知识被实例化到每个输入图像中，以实现知识增强特征。将知识增强特征与原始视觉特征结合，适配多种下游聚类算法。我们在20个多样化数据集上评估KEC，展示了利用额外文本知识的现有方法的一致性能提升。KEC在无需训练的情况下，在20个数据集中有14个优于零样本CLIP。此外，文本知识的简单使用可能损害聚类性能，而KEC则兼顾了准确性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 221 / 2604.11156

rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

rPPG-VQA：一种用于无监督rPPG训练的视频质量评估框架

Dai, Tianyang, Chang, Ming, Chen, Yan, Hu, Yang

Abstract

Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality "in-the-wild" videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, "in-the-wild" videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.

Chinese Translation

无监督远程光电容积脉搏波（rPPG）有望利用未标记的视频数据，但其潜力受到一个关键挑战的制约：在低质量的“野外”视频上训练会严重降低模型性能。这里缺少的一个重要步骤是，在将视频用于任务之前，评估其对rPPG模型学习的适用性。现有的视频质量评估（VQA）方法主要是为人类感知设计的，无法直接应用于上述目的。在本研究中，我们提出了rPPG-VQA，一个用于评估视频适合性的新框架。我们整合了信号级和场景级分析，并设计了一个双分支评估架构。信号级分支通过多方法共识机制进行稳健的信噪比（SNR）估计，评估视频的生理信号质量；场景级分支则使用多模态大型语言模型（MLLM）识别运动和不稳定照明等干扰。此外，我们提出了一种两阶段自适应采样（TAS）策略，利用质量评分来策划最佳训练数据集。实验表明，通过在我们框架过滤的大规模“野外”视频上进行训练，我们可以开发出无监督的rPPG模型，在标准基准测试中实现显著的准确性提升。我们的代码可在 https://github.com/Tianyang-Dai/rPPG-VQA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 222 / 2604.11162

Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

Boxes2Pixels：从噪声SAM掩码中学习缺陷分割

Lendering, Camile, Akdag, Erkut, Bondarev, Egor

Abstract

Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.

Chinese Translation

准确的缺陷分割对于工业检测至关重要，但密集的像素级标注却很少可用。一个常见的解决方案是利用基础分割模型（如Segment Anything Model，SAM）将廉价的边界框转换为伪掩码。然而，这些伪标签在工业表面上系统性地存在噪声，常常虚构背景结构，同时遗漏稀疏缺陷。为了解决这一局限性，提出了一种抗噪声的框到像素蒸馏框架Boxes2Pixels，该框架将SAM视为一个噪声教师，而非真实监督的来源。边界框由SAM离线转换为伪掩码，并且训练了一个紧凑的学生模型，该模型具有（i）在冻结的DINOv2特征上使用的分层解码器，以确保语义稳定性，（ii）一个辅助的二元定位头，以将稀疏前景发现与类别预测解耦，以及（iii）一个单侧在线自我修正机制，当学生模型有信心时放宽背景监督，针对教师的假阴性。在一个手动标注的风力涡轮机检测基准上，所提出的Boxes2Pixels在异常mIoU上提高了+6.97，在二元IoU上提高了+9.71，相较于在相同弱监督下训练的最强基线。此外，在线自我修正使得二元召回率提高了+18.56，同时该模型使用的可训练参数减少了80%。代码可在https://github.com/CLendering/Boxes2Pixels获取。

View on arXiv Download PDF AI Translation

cs.CV / 223 / 2604.11164

RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation

RADA：面向弱监督医学图像分割的区域感知双编码器辅助学习方法

Zeng, Shuang, Xie, Boxu, Zhu, Lei, Zhang, Xinliang, Hu, Jiakui, Yao, Zhengjian, Li, Yuanwei, Lu, Yuxing, Lu, Yanye

Abstract

Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.

Chinese Translation

深度学习极大推动了医学图像分割的发展，但其成功高度依赖于全监督学习，而全监督学习需要密集标注，对于三维体积扫描而言成本高且耗时。弱监督学习通过仅使用每个体积中少量标注切片，降低了标注负担。现有方法通常通过几何连续性将稀疏标注传播至未标注切片以生成伪标签，但该策略缺乏语义理解，常导致伪标签质量较低。此外，医学图像分割本质上是一项像素级的视觉理解任务，其准确性根本上依赖于局部细粒度视觉特征的质量。受此启发，我们提出了RADA，一种新颖的区域感知双编码器辅助学习框架，该框架引入了基于Alpha-CLIP预训练的双编码器，用于从原始图像和有限标注中提取细粒度、区域特定的视觉特征。该框架结合了图像级细粒度视觉特征与文本级语义引导，提供区域感知的语义监督，桥接了图像级语义与像素级分割。RADA集成于三视图训练框架中，在极度稀疏标注条件下，于LA2018、KiTS19和LiTS数据集上实现了最先进的性能，展示了在多样化数据集上的强泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 224 / 2604.11170

Do Instance Priors Help Weakly Supervised Semantic Segmentation?

实例先验是否有助于弱监督语义分割？

Das, Anurag, Kukleva, Anna, Hu, Xinting, Asano, Yuki M., Schiele, Bernt

Abstract

Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

Chinese Translation

语义分割需要密集的像素级标注，这类标注成本高且耗时。为了解决这一问题，我们提出了SeSAM框架，该框架结合了基础分割模型——即Segment Anything Model（SAM）——与弱标签，包括粗略掩码、涂鸦和点标注。SAM最初设计用于基于实例的分割，无法直接应用于语义分割任务。在本工作中，我们识别了SAM面临的具体挑战，并确定了适用于利用弱标签进行基于类别分割的关键组件。具体而言，SeSAM将类别掩码分解为连通组件，沿物体骨架采样点提示，利用弱标签覆盖度选择SAM掩码，并通过伪标签迭代细化标注，从而使SAM生成的掩码能够有效用于语义分割。结合半监督学习框架，SeSAM在真实标签、基于SAM的伪标签和高置信度伪标签之间实现平衡，显著提升了分割质量。在多个基准和不同类型弱标注上的大量实验表明，SeSAM在显著降低标注成本的同时，持续优于弱监督基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 225 / 2604.11171

Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett's neoplasia

低患病率环境下CADe系统的开发与评估：Barrett食管早期肿瘤病变检测的RARE25挑战赛

Jaspers, Tim J. M., Caetano, Francisco, Claessens, Cris H. B., Kusters, Carolus H. J., van Heslinga, Rixta A. H. van Eijck, Slooter, Floor, Bergman, Jacques J., De With, Peter H. N., Jong, Martijn R., de Groof, Albert J., van der Sommen, Fons

Abstract

Computer-aided detection (CADe) of early neoplasia in Barrett's esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.

Chinese Translation

Barrett食管早期肿瘤病变的计算机辅助检测（CADe）是一种低患病率的监测问题，其中临床相关发现极为罕见。尽管许多CADe系统在平衡或富集的数据集上表现出较强的性能，但其在现实患病率条件下的表现尚未得到充分评估。RARE25挑战赛通过引入一个大规模、考虑患病率的肿瘤病变检测基准，填补了这一空白。该挑战赛包含一个公开的训练集和一个反映真实发病率的隐藏测试集。评估方法采用针对特定操作点的指标，强调高敏感性并考虑患病率因素。来自七个国家的十一支团队提交了采用多样架构、预训练、集成和校准策略的方法。尽管若干方法实现了较强的判别性能，但阳性预测值仍然较低，凸显了低患病率检测的难度以及忽视患病率时可能高估临床效用的风险。所有方法均依赖于完全监督的分类，尽管正常发现占主导地位，表明缺乏如异常检测或单类学习等与患病率无关的方法。通过发布公开数据集和可复现的评估框架，RARE25旨在支持开发对患病率变化具有鲁棒性且适用于临床监测流程的CADe系统。

View on arXiv Download PDF AI Translation

cs.CV / 226 / 2604.11176

Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment

基于VLM调制整流流的多示踪剂PET精准合成用于轻度认知障碍分层

Liu, Tuo, Lin, Shuijin, Yan, Shaozhen, Wang, Haifeng, Lu, Jie, Ma, Jianhua, Lian, Chunfeng

Abstract

The biological definition of Alzheimer's disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.

Chinese Translation

阿尔茨海默病（AD）的生物学定义依赖于多模态神经影像学，然而正电子发射断层扫描（PET）因成本高昂及辐射暴露限制了其临床应用，阻碍了在临床前或前驱期的早期筛查。尽管生成模型通过从磁共振成像（MRI）合成PET提供了有前景的替代方案，但实现个体化精确合成仍是主要挑战。本文提出DIReCT++，一种基于领域知识的整流流（Domain-Informed ReCTified flow）模型，结合基础临床信息从MRI合成多示踪剂PET。该方法整合了三维整流流架构以捕捉复杂的跨模态及跨示踪剂关系，并结合领域适应的视觉语言模型BiomedCLIP，利用临床评分和影像知识实现文本引导的个性化生成。在多中心数据集上的广泛评估表明，DIReCT++不仅生成了高保真度和良好泛化性的合成PET图像（包括18F-AV-45和18F-FDG），还准确重现了疾病特异性模式。更重要的是，将这些合成PET图像与MRI结合，能够实现轻度认知障碍（MCI）的精准个体化分层，推动了一种可扩展且数据高效的AD早期诊断及预后预测工具。源码将发布于https://github.com/ladderlab-xjtu/DIReCT-PLUS。

View on arXiv Download PDF AI Translation

cs.CV / 227 / 2604.11177

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

思维流重要吗？评估双子星视觉-语言模型在视频场景理解中的推理能力

Sharma, Shivam, Nagaonkar, Sankalp, Choithani, Ashish, Trivedi, Ashutosh

Abstract

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

Chinese Translation

我们基准测试内部推理痕迹（我们称之为思维流）如何影响视觉-语言模型的视频场景理解。通过使用谷歌的 Gemini 2.5 Flash 和 Flash Lite 的四种配置，基于从 100 小时视频中提取的场景，我们提出了三个问题：更多的思考是否会导致更好的输出，收益何时停止，这些模型实际上在思考什么？我们引入了三种评估指标。内容性（Contentfulness）衡量思维流中有多少是有用的场景内容与元评论的比例。思维最终覆盖率（Thought-Final Coverage）衡量思维流如何忠实地转化为最终输出。主导实体分析（Dominant Entity Analysis）识别模型关注的主题、动作和设置。GPT-5 作为独立评判者。我们发现，额外思考带来的质量提升迅速达到饱和，大部分改进发生在前几百个标记中。Flash Lite 在质量和标记使用之间提供了最佳平衡。紧凑的推理预算导致模型在最终输出中添加其从未推理过的内容，这是一种压缩步骤幻觉（compression-step hallucination）。尽管是不同的模型层次，Flash 和 Flash Lite 产生了相似的思维流，但在风格上有所不同：Flash 讨论其推理过程，而 Lite 则专注于描述场景。

View on arXiv Download PDF AI Translation

cs.CV / 228 / 2604.11195

Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining

基于类别级协作知识挖掘的自适应开放集目标检测方法

Ji, Yuqi, Ke, Junjie, He, Lihuo, Wang, Lizhi, Gao, Xinbo

Abstract

Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.

Chinese Translation

现有的目标检测器在跨域泛化及适应新兴类别方面常面临挑战。自适应开放集目标检测（Adaptive Open-Set Object Detection, AOOD）通过在源域的基础类别上训练，并在目标域中无目标注释的情况下适应基础类别和新类别，旨在解决该问题。然而，当前的AOOD方法仍受限于跨域表征能力弱、新类别间的模糊性以及源域特征偏差。为此，我们提出了一种类别级协作知识挖掘策略，充分利用跨域的类间和类内关系。具体而言，我们构建了基于聚类的记忆库，用于编码类别原型、辅助特征及类内差异信息，并通过无监督聚类迭代更新，以增强类别级知识表征。我们进一步设计了基础到新类别的选择度量，用于发现与新类别相关的源域特征，并以此初始化新类别分类器。此外，一种自适应特征分配策略将学习到的类别级知识迁移至目标域，并异步更新记忆库以缓解源域偏差。在多个基准数据集上的大量实验表明，我们的方法在mAP指标上持续超越现有最先进的AOOD方法1.1至5.5个百分点。

View on arXiv Download PDF AI Translation

cs.CV / 229 / 2604.11197

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

MedP-CLIP：具备区域感知提示集成的医学CLIP模型

Peng, Jiahui, Yao, He, Li, Jingwen, Su, Yanzhou, Ju, Sibo, Lu, Yujie, Ye, Jin, Lu, Hongchun, Li, Xue, Jiang, Lincheng, Zhu, Min, Cheng, Junlong

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

Chinese Translation

对比语言-图像预训练（Contrastive Language-Image Pre-training，CLIP）通过大规模文本与图像的对齐，在全局图像理解和零样本迁移方面表现出色。然而，医学图像分析的核心往往在于对特定解剖结构或病变区域的细粒度理解。因此，准确理解由医学专家或感知模型提供的感兴趣区域（Region-of-Interest，RoI）信息变得尤为关键。为满足这一需求，我们提出了MedP-CLIP，一种具备区域感知能力的医学视觉语言模型（Vision-Language Model，VLM）。MedP-CLIP创新性地融合了医学先验知识，并设计了特征级别的区域提示集成机制，使其能够灵活响应多种提示形式（如点、边界框、掩码），同时在聚焦局部区域时保持全局上下文感知。我们在精心构建的大规模数据集（包含超过640万张医学图像及9730万条区域级注释）上对模型进行了预训练，赋予其跨疾病和跨模态的细粒度空间语义理解能力。实验结果表明，MedP-CLIP在零样本识别、交互式分割及赋能多模态大型语言模型等多项医学任务中显著优于基线方法。该模型为医学人工智能提供了一个可扩展、即插即用的视觉骨干，融合了整体图像理解与精准区域分析能力。

View on arXiv Download PDF AI Translation

cs.CV / 230 / 2604.11207

LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results

LoViF 2026挑战赛：面向人类的语义图像质量评估方法与结果

Li, Xin, Xu, Daoli, Luo, Wei, Xiang, Guoqiang, Li, Haoran, Zhuang, Chengyu, Chen, Zhibo, Guan, Jian, Li, Weping, Zhang, Weixia, Sun, Wei, Wang, Zhihua, Zhu, Dandan, Zhu, Chengguang, Gupta, Ayush, Agarwal, Rachit, Das, Shouvik, Das, Biplab Ch, Ghosh, Amartya, Fan, Kanglong, Wen, Wen, Zhai, Shuyan, Zhi, Tianwu, Zhang, Aoxiang, Liu, Jianzhao, Zhang, Yabin, Wang, Jiajun, Sun, Yipeng, Lian, Kaiwei, Yin, Banghao

Abstract

This paper reviews the LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment. This challenge aims to raise a new direction, i.e., how to evaluate the loss of semantic information from the human perspective, intending to promote the development of some new directions, like semantic coding, processing, and semantic-oriented optimization, etc. Unlike existing datasets of quality assessment, we form a dataset of human-oriented semantic quality assessment, termed the SeIQA dataset. This dataset is divided into three parts for this competition: (i) training data: 510 pairs of degraded images and their corresponding ground truth references; (ii) validation data: 80 pairs of degraded images and their corresponding ground-truth references; (iii) testing data: 160 pairs of degraded images and their corresponding ground-truth references. The primary objective of this challenge is to establish a new and powerful benchmark for human-oriented semantic image quality assessment. There are a total of 58 teams registered in this competition, and 6 teams submitted valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the SeIQA dataset.

Chinese Translation

本文回顾了LoViF 2026挑战赛，主题为面向人类的语义图像质量评估。该挑战旨在提出一个新的方向，即如何从人类的角度评估语义信息的损失，旨在促进一些新方向的发展，如语义编码、处理和面向语义的优化等。与现有的质量评估数据集不同，我们形成了一个面向人类的语义质量评估数据集，称为SeIQA数据集。该数据集为本次比赛分为三个部分：(i) 训练数据：510对退化图像及其对应的真实参考；(ii) 验证数据：80对退化图像及其对应的真实参考；(iii) 测试数据：160对退化图像及其对应的真实参考。本次挑战的主要目标是建立一个新的强大基准，用于面向人类的语义图像质量评估。共有58支团队注册参加此次比赛，其中6支团队提交了有效的解决方案和最终测试阶段的事实表。这些提交在SeIQA数据集上达到了最先进的（SOTA）性能。

View on arXiv Download PDF AI Translation

cs.CV / 231 / 2604.11211

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

3DTV：一种用于实时视图合成的前馈插值网络

Schulz, Stefan, Edelstein, Fernando, Dröge, Hannah, Hullin, Matthias B., Plack, Markus

Abstract

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

Chinese Translation

实时自由视点渲染需要在多摄像头冗余与交互应用的延迟限制之间取得平衡。我们通过结合轻量级几何与学习来应对这一挑战，并提出了3DTV，一种用于实时稀疏视图插值的前馈网络。基于Delaunay的三元组选择确保了每个目标视图的角度覆盖。在此基础上，我们引入了一个姿态感知深度模块，该模块估计粗到细的深度金字塔，从而实现高效的特征重投影和考虑遮挡的混合。与需要场景特定优化的方法不同，3DTV以前馈方式运行，无需重新训练，使其在增强现实（AR）/虚拟现实（VR）、远程呈现和交互应用中具有实用性。我们在具有挑战性的多视角视频数据集上的实验表明，3DTV始终在质量和效率之间取得良好的平衡，超越了近期的实时新视图基线。重要的是，3DTV避免了显式代理，使其能够在多样化场景中实现稳健渲染。这使其成为低延迟多视角流媒体和交互渲染的实用解决方案。项目页面：https://stefanmschulz.github.io/3DTV_webpage/

View on arXiv Download PDF AI Translation

cs.CV / 232 / 2604.11218

H-SPAM: Hierarchical Superpixel Anything Model

H-SPAM：层次化超像素任意模型

Walther, Julien, Giraud, Rémi, Clément, Michaël

Abstract

Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: https://github.com/waldo-j/hspam.

Chinese Translation

超像素通过将像素聚集成连贯的区域，提供了一种紧凑的图像表示。近期的方法在分割精度方面已达到瓶颈，主要是由于生成了噪声超像素形状。此外，大多数现有方法生成的是单一固定尺度的分区，这限制了它们在视觉管道中的应用，而这些管道本可以受益于多尺度表示。在本研究中，我们引入了H-SPAM（层次化超像素任意模型），这是一个统一框架，用于生成准确、规则且完美嵌套的层次化超像素。从细粒度分区开始，H-SPAM在深度特征和外部物体先验的指导下，通过一个两阶段的区域合并过程构建层次结构，该过程首先保持物体一致性，然后允许受控的物体间分组。该层次结构还可以使用视觉注意力图或用户输入进行调节，以在层次中更长时间地保留重要区域。在标准基准上的实验表明，H-SPAM在准确性和规则性方面显著优于现有的层次化方法，同时在性能上与大多数最新的非层次化方法相当。代码和预训练模型可在以下链接获取：https://github.com/waldo-j/hspam。

View on arXiv Download PDF AI Translation

cs.CV / 233 / 2604.11225

Sign Language Recognition in the Age of LLMs

大语言模型时代的手语识别研究

Javorek, Vaclav, Honzik, Jakub, Gruber, Ivan, Zelezny, Tomas, Hruz, Marek

Abstract

Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

Chinese Translation

近年来，视觉语言模型（Vision Language Models, VLMs）在多模态推理任务中展现出强大的性能。这引发了一个问题：这类通用模型是否也能在无需特定任务训练的情况下，解决诸如孤立手语识别（Isolated Sign Language Recognition, ISLR）等专业视觉识别问题。在本研究中，我们探讨了现代VLMs在零样本（zero-shot）设置下执行ISLR的能力。我们在WLASL300基准数据集上评估了多个开源及专有VLM模型。实验结果表明，在仅通过提示（prompt-only）进行零样本推断时，当前开源VLMs的表现仍远远落后于传统的监督式ISLR分类器。然而，后续实验揭示这些模型能够部分捕捉手语与文本描述之间的视觉-语义对齐。规模更大的专有模型则显著提升了准确率，凸显了模型规模和训练数据多样性的重要性。我们所有的代码均已在GitHub公开。

View on arXiv Download PDF AI Translation

cs.CV / 234 / 2604.11230

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: AI Flash Portrait (Track 3)

NTIRE 2026 第三届任何图像恢复模型（RAIM）挑战赛：AI闪光人像（赛道3）

Guan, Ya-nan, Zhang, Shaonan, Guo, Hang, Wang, Yawen, Fan, Xinying, Zhuang, Tianqu, Liang, Jie, Zeng, Hui, Qin, Guanyi, Qu, Lishen, Dai, Tao, Xia, Shu-Tao, Zhang, Lei, Timofte, Radu, Chen, Bin, Zhou, Yuanbo, Wang, Hongwei, Gao, Qinquan, Tong, Tong, Qian, Yanxin, You, Lizhao, Cong, Jingru, Xiong, Lei, Zhu, Shuyuan, Zhong, Zhi-Qiang, Lv, Kan, Yang, Yang, Tang, Kailing, Zhang, Minjian, Lei, Zhipei, Xu, Zhe, Zhang, Liwen, Gou, Dingyong, Wu, Yanlin, Li, Cong, Cui, Xiaohui, Liu, Jiajia, Xu, Guoyi, Jiang, Yaoxin, Shi, Yaokun, Tu, Jiachen, Wang, Liqing, Li, Shihang, Zhang, Bo, Wang, Biao, Xu, Haiming, Long, Xiang, Liao, Xurui, Zhai, Yanqiao, Li, Haozhe, Shi, Shijun, Zhang, Jiangning, Liu, Yong, Hu, Kai, Xu, Jing, Zeng, Xianfang, Liu, Yuyang, Wei, Minchen

Abstract

In this paper, we present a comprehensive overview of the NTIRE 2026 3rd Restore Any Image Model (RAIM) challenge, with a specific focus on Track 3: AI Flash Portrait. Despite significant advancements in deep learning for image restoration, existing models still encounter substantial challenges in real-world low-light portrait scenarios. Specifically, they struggle to achieve an optimal balance among noise suppression, detail preservation, and faithful illumination and color reproduction. To bridge this gap, this challenge aims to establish a novel benchmark for real-world low-light portrait restoration. We comprehensively evaluate the proposed algorithms utilizing a hybrid evaluation system that integrates objective quantitative metrics with rigorous subjective assessment protocols. For this competition, we provide a dataset containing 800 groups of real-captured low-light portrait data. Each group consists of a 1K-resolution low-light input image, a 1K ground truth (GT), and a 1K person mask. This challenge has garnered widespread attention from both academia and industry, attracting over 100 participating teams and receiving more than 3,000 valid submissions. This report details the motivation behind the challenge, the dataset construction process, the evaluation metrics, and the various phases of the competition. The released dataset and baseline code for this track are publicly available from the same \href{https://github.com/zsn1434/AI_Flash-BaseLine/tree/main}{GitHub repository}, and the official challenge webpage is hosted on \href{https://www.codabench.org/competitions/12885/}{CodaBench}.

Chinese Translation

本文全面概述了NTIRE 2026第三届任何图像恢复模型（RAIM）挑战赛，特别关注赛道3：AI闪光人像。尽管深度学习在图像恢复方面取得了显著进展，但现有模型在现实世界低光照人像场景中仍面临重大挑战。具体而言，它们在噪声抑制、细节保留以及真实照明和色彩再现之间难以实现最佳平衡。为了解决这一问题，本挑战旨在建立一个新的现实世界低光照人像恢复基准。我们利用一个混合评估系统全面评估所提出的算法，该系统结合了客观定量指标和严格的主观评估协议。为了本次比赛，我们提供了一个包含800组真实捕获的低光照人像数据集。每组数据包括一张1K分辨率的低光照输入图像、一张1K的真实图（GT）和一张1K的人物掩码。此次挑战引起了学术界和工业界的广泛关注，吸引了超过100个参赛团队，并收到了3000多份有效提交。本文详细介绍了挑战的动机、数据集构建过程、评估指标以及比赛的各个阶段。该赛道发布的数据集和基线代码可从同一[GitHub仓库](https://github.com/zsn1434/AI_Flash-BaseLine/tree/main)公开获取，官方挑战网页托管在[CodaBench](https://www.codabench.org/competitions/12885/)上。

View on arXiv Download PDF AI Translation

cs.CV / 235 / 2604.11231

Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

Seg2Change：将开放词汇语义分割模型适配于遥感变化检测

Su, You, Song, Yonghong, Chen, Jingqi, Wen, Zehan

Abstract

Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.

Chinese Translation

变化检测是遥感领域的一项基础任务，旨在量化人类活动和生态动态对土地覆盖变化的影响。现有的变化检测方法局限于训练数据集中预定义的类别，这限制了其在实际场景中的可扩展性。近年来，针对遥感影像出现了大量先进的开放词汇语义分割模型。然而，仍缺乏一个有效的框架，能够将这些模型直接应用于开放词汇变化检测（OVCD）这一新兴任务，该任务融合视觉与语言以检测任意类别的变化。为解决这些挑战，我们首先构建了一个类别无关的变化检测数据集，称为CA-CDD。进一步地，我们设计了一个类别无关的变化检测头，用于检测任意类别的变化并将其索引到具体类别。基于此，我们提出了Seg2Change，一种用于将开放词汇语义分割模型适配到变化检测任务的适配器。该简单而高效的框架无需复杂设计，即实现了最先进的OVCD性能（在WHU-CD数据集上IoU提升9.52，在SECOND数据集上mIoU提升5.50）。我们的代码已开源，地址为https://github.com/yogurts-sy/Seg2Change。

View on arXiv Download PDF AI Translation

cs.CV / 236 / 2604.11234

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

弥合RGB-IR差距：基于共识与差异建模的文本引导多光谱检测

Wu, Jiaqi, Wang, Zhen, Huang, Enhao, Shen, Kangqing, Wang, Yulin, Yue, Yang, Pu, Yifan, Huang, Gao

Abstract

Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

Chinese Translation

文本引导的多光谱目标检测利用文本语义指导RGB与IR之间的语义感知跨模态交互，以实现更稳健的感知。然而，仍然存在显著的局限性：（1）现有方法通常仅将文本作为辅助的语义增强信号，而未能充分利用其指导作用来弥合RGB与IR之间固有的粒度不对称；（2）传统的数据驱动的基于注意力的融合往往强调稳定的共识，而忽视了潜在有价值的跨模态差异。为了解决这些问题，我们提出了一种具有双重支持建模的语义桥接融合框架，用于多光谱目标检测。具体而言，文本被用作共享的语义桥，以在统一类别条件下对齐RGB和IR响应，同时重新校准的热语义先验被投影到RGB分支上，以实现语义级映射融合。我们进一步将RGB-IR交互证据形式化为常规共识支持和包含潜在区分线索的互补差异支持，并通过动态重新校准将其引入融合，作为一种结构化的归纳偏置。此外，我们设计了一个双向语义对齐模块，以增强闭环视觉-文本引导。大量实验表明，所提出的融合框架的有效性及其在多光谱基准上的优越检测性能。代码可在 https://github.com/zhenwang5372/Bridging-RGB-IR-Gap 获取。

View on arXiv Download PDF AI Translation

cs.CV / 237 / 2604.11240

Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

用于任务感知的解耦相似性在大型视觉语言模型中的令牌剪枝

Ma, Kexin, Xiao, Jing, Chen, Chaofeng, Min, Geyong, Zhu, Guibo, Wang, Jinqiao, Liao, Liang

Abstract

Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.

Chinese Translation

令牌剪枝已成为减少大型视觉语言模型（LVLMs）显著计算开销的有效方法，通过丢弃信息量较少的视觉令牌来保持性能。然而，现有方法通常依赖于来自不同LVLM组件的单独注意力源，导致由于偏置的注意力分布而产生不完整和次优的剪枝决策。为了解决这个问题，我们提出了DeSAP，一种新颖的解耦相似性感知剪枝方法，用于在视觉编码器中进行精确的任务感知令牌剪枝。具体而言，DeSAP引入了解耦相似性，以捕捉视觉特征与文本令牌之间的细粒度跨模态相关性，为剪枝提供明确的任务相关指导。通过将解耦相似性与来自视觉注意力的视觉显著性信号相结合，DeSAP在任务相关和视觉线索的指导下进行令牌剪枝，即使在激进的剪枝比例下也能实现稳健的剪枝。在多种基准和架构上的广泛实验表明，DeSAP在准确性和效率上始终优于现有最先进的方法。在LLaVA-1.5-7B上，DeSAP通过仅保留11.1%的视觉令牌，实现了10倍的FLOPs减少和2.3倍的预填充加速，同时保持了98.1%的原始性能。

View on arXiv Download PDF AI Translation

cs.CV / 238 / 2604.11244

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

脚本视频：通过因子化流和关系基础实现深度结构化音视频字幕

Tencent Hunyuan Team

Abstract

Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

Chinese Translation

多模态大型语言模型（MLLMs）的进展正在将视频字幕从一个描述性的终点转变为视频理解和生成的语义接口。然而，主流范式仍将视频视为一个整体叙事段落，交织了视觉、听觉和身份信息。这种密集耦合不仅妨碍了表现的真实性，还限制了可扩展性，因为即使是局部编辑也可能触发全局重写。为了解决这一结构瓶颈，我们提出了多流场景脚本（MTSS），这一新范式用因子化和明确基础的场景描述替代了整体文本。MTSS建立在两个核心原则之上：流因子化（Stream Factorization），它将视频解耦为互补流（参考、镜头、事件和全局），以及关系基础（Relational Grounding），它通过明确的身份和时间链接重新连接这些孤立的流，以保持整体视频的一致性。大量实验表明，MTSS在各种模型上持续增强了视频理解，在Video-SALMONN-2上实现了平均25%的总错误率降低，在Daily-Omni推理基准上实现了平均67%的性能提升。它还缩小了小型和大型MLLMs之间的性能差距，表明字幕接口的可学习性显著提高。最后，即使没有架构适应，将整体提示替换为MTSS在多镜头视频生成中也带来了显著的人工评分改进：跨镜头身份一致性提高45%，音视频对齐提高56%，时间可控性提高71%。

View on arXiv Download PDF AI Translation

cs.CV / 239 / 2604.11250

Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition

变分潜在熵估计解耦：面部识别中的受控属性泄漏

Öztürk, Ünsal, Hahn, Vedrana Krivokuća, Bhattacharjee, Sushil, Marcel, Sébastien

Abstract

Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.

Chinese Translation

面部识别嵌入编码了身份，但它们也编码了其他因素，如性别和种族。根据这些因素在下游系统中的使用方式，将其与验证所需的信息分离对于隐私和公平性都至关重要。我们提出了变分潜在熵估计解耦（Variational Latent Entropy Estimation Disentanglement, VLEED），这是一种后处理方法，通过变分自编码器转换预训练嵌入，并鼓励提炼的表示，其中感兴趣的分类变量与身份相关信息分离。VLEED使用基于互信息的目标，通过估计潜在空间中分类属性的熵来实现，并提供稳定的训练，对信息移除进行精细控制。我们在IJB-C、RFW和VGGFace2上评估我们的方法，以实现性别和种族的解耦，并将其与多种最先进的方法进行比较。我们报告了验证效用、在线性和非线性分类器下解耦变量的可预测性，以及基于错误匹配率的群体差异指标。我们的结果表明，VLEED在现有方法中提供了广泛的隐私-效用权衡，并且还可以减少不同人口群体之间的识别偏差。

View on arXiv Download PDF AI Translation

cs.CV / 240 / 2604.11279

A Deep Equilibrium Network for Hyperspectral Unmixing

用于高光谱解混的深度平衡网络

Wang, Chentong, Gao, Jincheng, Zhu, Fei, Chen, Jie

Abstract

Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.

Chinese Translation

高光谱解混（Hyperspectral Unmixing, HU）对于高光谱图像分析至关重要，但实现准确的解混仍然具有挑战性。传统方法难以有效建模复杂的光谱-空间特征，而深度学习方法通常缺乏物理可解释性。基于展开（unrolling）的方法虽然提供了网络的可解释性，但在利用光谱-空间信息方面不足，并且在反向传播过程中存在高内存消耗和数值精度问题。为了解决这些限制，我们提出了DEQ-Unmix，该方法将丰度估计重新表述为深度平衡模型（deep equilibrium model），通过隐式微分实现高效且恒定内存的训练。该方法用一个可训练的卷积网络替代数据重建项的梯度算子，以捕获光谱-空间信息。借助隐式微分，DEQ-Unmix实现了高效且恒定内存的反向传播。合成数据集及两个真实数据集上的实验表明，DEQ-Unmix在保持恒定内存消耗的同时，实现了优越的解混性能。

View on arXiv Download PDF AI Translation

cs.CV / 241 / 2604.11283

Empowering Video Translation using Multimodal Large Language Models

利用多模态大语言模型提升视频翻译能力

QU, Bingzheng, Chen, Kehai, Bai, Xuefeng, Zhang, Min

Abstract

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.

Chinese Translation

近年来，视频翻译的发展进一步促进了跨语言的视频内容访问，多模态大语言模型（MLLMs）在其中扮演着日益重要的辅助角色。凭借强大的多模态理解、推理和生成能力，基于MLLMs的视频翻译系统正在突破传统级联流水线（分别处理自动语音识别、机器翻译、文本转语音和唇动同步）的局限。这些由MLLM驱动的方法不仅在翻译质量上实现了竞争性甚至优越的表现，还在零样本设置和多说话人场景中展现出更强的鲁棒性，同时能够联合建模语义忠实度、时间同步、说话人身份及情感一致性。然而，尽管MLLMs取得了快速进展，且已有大量关于通用视频语言理解的综述，针对MLLMs如何赋能视频翻译任务的系统性聚焦评述仍然缺乏。为填补这一空白，本文首次提供了基于MLLMs的视频翻译综合综述，围绕三大角色分类展开：1）语义推理者（Semantic Reasoner），阐述MLLMs如何实现视频理解、时间推理及多模态融合；2）表现执行者（Expressive Performer），分析由LLM驱动及增强的表达性、可控语音生成技术；3）视觉合成者（Visual Synthesizer），考察用于高保真唇动同步及视觉对齐的各类视频生成器。最后，本文讨论了视频理解、时间建模及多模态对齐中的开放挑战，并展望了MLLMs驱动视频翻译的未来研究方向。

View on arXiv Download PDF AI Translation

cs.CV / 242 / 2604.11331

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

任何三维场景都值1000个Token：基于三维的场景生成大规模表示方法

Wei, Dongxu, Xu, Qi, Li, Zhiqi, Zhou, Hangning, Qiu, Cong, Qin, Hailong, Yang, Mu, Cui, Zhaopeng, Liu, Peidong

Abstract

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

Chinese Translation

三维场景生成长期以来主要依赖于二维多视角或视频扩散模型。这不仅是由于缺乏场景级的三维潜在表示，还因为大多数场景级三维视觉数据以多视角图像或视频的形式存在，这与二维扩散架构天然兼容。通常，这些基于二维的方法将三维空间外推简化为二维时间扩展，导致两个根本性问题：（i）通过二维视图表示三维场景会引入显著的表示冗余；（ii）基于二维的潜在空间本质上限制了生成三维场景的空间一致性。本文首次提出直接在隐式三维潜在空间中进行三维场景生成以解决上述限制。首先，我们重新利用冻结的二维表示编码器构建了三维表示自编码器（3DRAE），将与视角耦合的二维语义表示锚定到与视角解耦的三维潜在表示中。这使得能够以固定复杂度和丰富语义表示任意数量视角、任意分辨率及长宽比观察到的三维场景。随后，我们引入了三维扩散变换器（3DDiT），在该三维潜在空间中执行扩散建模，实现了高效且空间一致的三维场景生成，同时支持多样的条件配置。此外，由于我们的方法直接生成三维场景表示，可沿任意相机轨迹解码为图像及可选的点云图，无需像二维方法中常见的每条轨迹扩散采样过程。

View on arXiv Download PDF AI Translation

cs.CV / 243 / 2604.11332

A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

一种紧凑高效的125.1万参数机器学习卷积神经网络模型PD36-C用于植物病害检测：案例研究

Sherifi, Shkelqim

Abstract

Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.

Chinese Translation

随着硬件性能和数据集质量的提升，深度学习在基于图像的植物病害诊断中取得了显著进展，推动了神经网络模型的准确性不断提高。本文提出了PD36-C，一种紧凑型卷积神经网络（参数量为1,250,694，模型大小为4.77MB）用于植物病害分类。该模型基于TensorFlow Keras在New Plant Diseases Dataset（包含87,000张图像，38个类别）上训练，设计目标兼顾鲁棒性和边缘设备部署能力。配套的Qt for Python桌面应用程序提供了直观的图形用户界面和在通用硬件上的离线推断功能。实验结果显示，训练至第30个epoch时准确率达到0.99697，38个类别的平均测试准确率为0.9953。各类别性能普遍较高；表现较低的类别如玉米（maize）叶斑病（Cercospora leaf spot）精确率约为0.9777，召回率约为0.9634，表明偶尔与视觉上相似的类别存在混淆；表现较好的类别包括苹果黑腐病（Apple Black rot）、雪松苹果锈病（Cedar apple rust）、蓝莓健康（Blueberry healthy）、樱桃白粉病（Cherry Powdery mildew）、樱桃健康（Cherry healthy）以及所有四个葡萄类别，均实现了精确率和召回率均为1.00，表明无误报且覆盖率极高。结果表明，借助精心策划的数据集和合理的网络架构设计，小型卷积神经网络能够在保持边缘场景实用性的同时，达到与最新基线模型相当的准确率。我们同时指出，恶劣天气、低质量图像以及叶片上多重病害共存等典型限制因素会影响性能，未来工作需加强领域鲁棒性。总体而言，PD36-C及其应用流程为智能农业中基于人工智能的植物病害检测提供了一个可现场应用的高效解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 244 / 2604.11348

LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling

LoGo-MR：通过高效的全切片建模筛查乳腺MRI以预测癌症风险

Wang, Xin, Gao, Yuan, Yiasemis, George, Portaluri, Antonio, Aghdam, Zahra, He, Muzhen, Han, Luyi, Duan, Yaofei, Lu, Chunyao, Liang, Xinglong, Zhang, Tianyu, van Veldhuizen, Vivien, Sun, Yue, Tan, Tao, Mann, Ritse, Teuwen, Jonas

Abstract

Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effective modeling remains challenging as fully 3D CNNs capture volumetric context at high computational cost, whereas lightweight 2D CNNs fail to model inter-slice continuity. Importantly, breast MRI modeling for shor- and long-term BC risk stratification remains underexplored. In this study, we propose LoGo-MR, a 2.5D local-global structural modeling framework for five-year BC risk prediction. Aligned with clinical interpretation, our framework first employs neighbor-slice encoding to capture subtle local cues linked to short-term risk. It then integrates transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns related to long-term risk and provide interpretable slice importance. We further apply this framework across axial, sagittal, and coronal planes as LoGo3-MR to capture complementary volumetric information. This multi-plane formulation enables voxel-level risk saliency mapping, which may assist radiologists in localizing risk-relevant regions during breast MRI interpretation. Evaluated on a large breast MRI screening cohort (~7.5K), our method outperforms 2D/3D baselines and existing SOTA MIL methods, achieving AUCs of 0.77-0.69 for 1- to 5-year prediction and improving C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, and validation across seven backbones shows consistent gains. These results highlight the clinical potential of efficient MRI-based BC risk stratification for large-scale screening. Code will be released publicly.

Chinese Translation

高效且可解释的乳腺癌（BC）风险预测对于大规模人群筛查至关重要。乳腺MRI提供个性化风险评估所需的功能信息。然而，有效建模仍然具有挑战性，因为完全的3D卷积神经网络（CNN）以高计算成本捕获体积上下文，而轻量级的2D CNN则无法建模切片间的连续性。重要的是，乳腺MRI在短期和长期BC风险分层的建模仍然未被充分探索。在本研究中，我们提出了LoGo-MR，一个用于五年BC风险预测的2.5D局部-全局结构建模框架。与临床解释相一致，我们的框架首先采用邻切片编码来捕捉与短期风险相关的细微局部线索。然后，它整合了增强型变换器的多实例学习（MIL），以建模与长期风险相关的分布式全局模式，并提供可解释的切片重要性。我们进一步将该框架应用于轴向、矢状和冠状平面，作为LoGo3-MR，以捕捉互补的体积信息。这种多平面形式使得体素级风险显著性映射成为可能，可能帮助放射科医生在乳腺MRI解读过程中定位风险相关区域。在一个大型乳腺MRI筛查队列（约7.5K）上进行评估，我们的方法优于2D/3D基线和现有的最先进的MIL方法，实现了1至5年预测的AUC为0.77-0.69，并将C指数提高了约6%相较于3D CNN。LoGo3-MR进一步通过在三个平面上的可解释定位提高了整体性能，并且在七个主干网络上的验证显示出一致的提升。这些结果突显了基于MRI的高效BC风险分层在大规模筛查中的临床潜力。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 245 / 2604.11355

LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

LEADER：用于LiDAR重定位的可靠局部到全局对应学习

Wu, Jianshi, Zhu, Minghang, Liu, Dunqiang, Li, Wen, Ao, Sheng, Shen, Siqi, Wen, Chenglu, Wang, Cheng

Abstract

LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on https://github.com/JiansW/LEADER.

Chinese Translation

LiDAR重定位因其能够在复杂三维环境中提供精确的6自由度位姿估计而受到越来越多的关注。近期基于学习的回归方法通过直接预测全局位姿，避免了显式地图存储，提供了高效的解决方案。然而，这些方法在复杂场景中常因对所有预测点一视同仁，易受噪声和异常值影响而表现不佳。本文提出了LEADER，一种由简洁而有效的几何编码器增强的鲁棒LiDAR重定位框架。具体而言，首先引入了基于鲁棒投影的几何编码器（Robust Projection-based Geometric Encoder）架构，以捕捉多尺度几何特征，提升几何表示的描述能力。随后，设计了截断相对可靠性损失（Truncated Relative Reliability loss），用于建模点级歧义并减轻不可靠预测的影响。在Oxford RobotCar和NCLT数据集上的大量实验表明，LEADER优于当前最先进的方法，分别在定位误差上实现了24.1%和73.9%的相对降低。源码已发布于https://github.com/JiansW/LEADER。

View on arXiv Download PDF AI Translation

cs.CV / 246 / 2604.11374

What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?

视觉-语言模型在个性化图像美学评估中编码了什么？

Ryu, Koki, Yanaka, Hitomi

Abstract

Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.

Chinese Translation

个性化图像美学评估（PIAA）是一个重要的研究问题，具有实际应用价值。尽管基于视觉-语言模型（VLMs）的方法是PIAA的有希望的候选者，但尚不清楚它们是否内部编码了有效个性化所需的丰富多层次美学属性。本文首先分析了VLMs的内部表示，以检查这些美学属性的存在和分布，然后利用这些属性实现轻量级的个体级个性化，而无需对模型进行微调。我们的分析揭示了VLMs编码了多样的美学属性，这些属性传播到语言解码器层。基于这些表示，我们展示了简单的线性模型可以有效地执行PIAA。我们进一步分析了美学信息如何在不同VLM架构的层之间以及在不同图像领域之间转移。我们的发现为如何利用VLMs建模主观的个体美学偏好提供了洞见。我们的代码可在 https://github.com/ynklab/vlm-latent-piaa 获取。

View on arXiv Download PDF AI Translation

cs.CV / 247 / 2604.11376

From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

从编辑到恢复：用于医学图像匿名化与重建的深度学习方法

Kline, Adrienne, Gaonkar, Abhijit, Pittman, Daniel, Kuehn, Chris, Forkert, Nils

Abstract

Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.

Chinese Translation

从医学图像中去除患者特定信息对于实现数据共享和开放科学至关重要，同时又不泄露患者身份。然而，目前许多去标识化方法因去除了相关但不可识别的信息，导致后续图像分析任务的性能下降。本文提出了一种端到端的深度学习框架，将原始临床图像体数据转换为去标识化且可用于分析的数据集，同时保证下游任务的效用。该方法首先检测并编辑可能包含受保护健康信息（PHI）的区域，如烧录文本和元数据，然后利用生成式深度学习模型对编辑区域进行图像修复，填充符合解剖结构和成像特征的内容。所提流水线采用轻量级混合架构，结合基于CRNN的编辑模块与潜在扩散修复模块（Stable Diffusion 2）。我们通过隐私保护指标（量化残留PHI及编辑成功率）以及图像质量和任务性能指标（评估修复体数据在典型深度学习应用中的保真度）对方法进行了评估。结果表明，该方法生成的去标识化医学图像在视觉上连贯，保持了对下游模型的高保真度，同时显著降低了患者重新识别的风险。通过在单一工作流程中实现匿名化与图像重建的自动化，该方法有助于大规模医学影像数据集的传播，降低了医学影像人工智能领域数据共享和多机构协作的关键障碍。

View on arXiv Download PDF AI Translation

cs.CV / 248 / 2604.11389

ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

ConvFormer3D-TAP：相位/不确定性感知前端融合用于心脏磁共振成像视图分类管道

Nia, Nafiseh Ghaffar, Appadurai, Vinesh, V., Suchithra, Rane, Chinmay, Pittman, Daniel, Carr, James, Kline, Adrienne

Abstract

Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

Chinese Translation

可靠识别标准心脏cine MRI视图至关重要，因为每个视图决定了可视化的心脏解剖结构以及可以进行的定量分析。无论是由人工读者还是自动化深度学习系统进行的错误视图识别，都可能导致分割、体积评估、应变分析和瓣膜评估中的错误传播。然而，在扫描仪厂商、采集协议、运动伪影和平面处方的常规临床变异性下，准确的视图分类仍然具有挑战性。我们提出了ConvFormer3D-TAP，这是一种特定于cine的时空架构，结合了3D卷积标记化和多尺度自注意力。该模型通过掩蔽时空重建和不确定性加权的多剪辑融合进行训练，以增强在心脏相位和模糊时间段中的鲁棒性。该设计捕捉了互补线索：通过卷积先验获取局部解剖结构，通过层次注意力获取长程心动周期动态。在涵盖六个标准cine心脏MRI视图的150,974个临床采集cine序列的队列中，ConvFormer3D-TAP实现了96%的验证准确率，且每类F1分数均>= 0.94，并且具有良好的校准（ECE = 0.025；Brier = 0.040）。错误分析表明，残余混淆集中在解剖相邻的长轴和LVOT/AV视图对上，这与固有的处方重叠一致。这些结果支持ConvFormer3D-TAP作为可扩展的前端，用于视图路由、过滤和端到端心脏MRI工作流程中的质量控制。

View on arXiv Download PDF AI Translation

cs.CV / 249 / 2604.11390

Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

超越重建：用于高光谱异常检测的重建到向量扩散

Xiang, Jijun, Wang, Jiayi, Wang, Pengxiang, Chen, Cheng, Wang, Nian, Wang, Tao

Abstract

While Hyperspectral Anomaly Detection (HAD) excels at identifying sparse targets in complex scenes, existing models remain trapped in a scalar "reconstruction-as-endpoint" paradigm. This reliance on ambiguous scalar residuals consistently triggers sub-pixel anomaly vanishing during spatial downsampling, alongside severe confirmation bias when unpurified anomalies corrupt training weights. In this paper, we propose Reconstruction-to-Vector Diffusion (R2VD), which fundamentally redefines reconstruction as a manifold purification origin to establish a novel residual-guided generative dynamics paradigm. Our framework introduces a four-stage pipeline: (1) a Physical Prior Extraction (PPE) stage that mitigates early confirmation bias via dual-stream statistical guidance; (2) a Guided Manifold Purification (GMP) stage utilizing an OmniContext Autoencoder (OCA) to extract purified residual maps while preserving fragile sub-pixel topologies; (3) a Residual Score Modeling (RSM) stage where a Diffusion Transformer (DiT), guarded by a Physical Spectral Firewall (PSF), effectively isolates cross-spectral leakage; and (4) a Vector Dynamics Inference (VDI) stage that robustly decouples targets from backgrounds by evaluating high-dimensional vector interference patterns instead of conventional scalar errors. Comprehensive evaluations on eight datasets confirm that R2VD establishes a new state-of-the-art, delivering exceptional target detectability and background suppression.

Chinese Translation

虽然高光谱异常检测（HAD）在复杂场景中识别稀疏目标方面表现出色，但现有模型仍然陷入标量“重建作为终点”的范式。这种对模糊标量残差的依赖在空间下采样过程中持续引发亚像素异常消失，并在未净化异常损坏训练权重时造成严重的确认偏差。本文提出了重建到向量扩散（R2VD），从根本上将重建重新定义为流形净化的起源，以建立一种新颖的残差引导生成动力学范式。我们的框架引入了一个四阶段的流程：（1）物理先验提取（PPE）阶段，通过双流统计指导减轻早期确认偏差；（2）引导流形净化（GMP）阶段，利用全上下文自编码器（OCA）提取净化的残差图，同时保留脆弱的亚像素拓扑；（3）残差评分建模（RSM）阶段，扩散变换器（DiT）在物理光谱防火墙（PSF）的保护下有效隔离跨光谱泄漏；（4）向量动力推断（VDI）阶段，通过评估高维向量干扰模式而非传统标量误差，稳健地将目标与背景解耦。对八个数据集的全面评估确认R2VD建立了新的最先进水平，提供了卓越的目标可检测性和背景抑制。

View on arXiv Download PDF AI Translation

cs.CV / 250 / 2604.11395

Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising

基于视频的心率估计：角度引导的感兴趣区域优化与图信号去噪

Pei, Gan, Ning, Junhao, Shen, Boqiu, Zhu, Yan, Hu, Menghan

Abstract

Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38\% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.

Chinese Translation

远程光电容积脉搏波（rPPG）能够通过面部视频实现非接触式心率测量，但其性能受到面部运动（如说话和摇头）的显著影响。为了解决这一问题，我们提出了两个即插即用模块。角度引导的感兴趣区域自适应优化模块量化了ROI-摄像头角度，以精炼受运动影响的信号并捕捉全局运动，而多区域联合图信号去噪模块则利用图信号处理共同建模区域内和区域间的ROI信号，以抑制运动伪影。这些模块与基于反射模型的rPPG方法兼容，并在三个公共数据集上进行了验证。结果表明，联合使用显著降低了平均绝对误差（MAE），相比基线平均减少了20.38%，同时消融研究证实了每个模块的有效性。本研究展示了角度引导优化和基于图的去噪在运动场景中提升rPPG性能的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 251 / 2604.11399

Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

推理存在于层次中：通过层选择性合并恢复视频语言模型中的时间推理

Fu, Zihang, Wang, Haonan, Kang, Jian, Kawaguchi, Kenji, Wu, Jiaying

Abstract

Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

Chinese Translation

多模态适应使大型语言模型（LLMs）具备感知能力，但往往削弱了其从仅语言预训练中继承的推理能力。这种权衡在视频语言模型（VLMs）中尤为明显，因为视觉对齐可能会损害对顺序事件的时间推理（TR）。我们提出了MERIT，一种无训练、任务驱动的模型合并框架，用于恢复VLMs中的TR。MERIT在VLM及其配对的仅文本主干之间搜索层级自注意力合并方案，使用一种目标函数来改善TR，同时惩罚时间感知（TP）的退化。在三个代表性的VLM和多个具有挑战性的视频基准测试中，MERIT始终改善TR，保持或提高TP，并在四个不同的基准测试中超越搜索集进行泛化。它还优于均匀的全模型合并和随机层选择，表明有效的恢复依赖于选择合适的层。干预掩蔽和帧级归因进一步表明，所选层对推理的重要性不成比例，并将模型决策转向时间和因果相关的证据。这些结果表明，针对性、感知意识的模型合并可以有效地恢复VLMs中的TR，而无需重新训练。

View on arXiv Download PDF AI Translation

cs.CV / 252 / 2604.11401

GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors

GS4City：基于城市模型先验的层次语义高斯点云技术

Zhang, Qilin, Zhu, Jinyu, Wysocki, Olaf, Busam, Benjamin, Jutzi, Boris

Abstract

Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.

Chinese Translation

近期的语义三维高斯点云（3DGS）方法主要依赖于二维基础模型，常常导致模糊的边界和对结构化城市语义的支持有限。尽管城市模型如CityGML编码了层次组织的语义以及建筑几何，这些标签却无法直接映射到高斯原语。我们提出了GS4City，这是一种结合城市模型先验的层次语义高斯点云方法，用于城市场景理解。GS4City通过双通道光线投射从细节层次（LoD）3的CityGML模型中推导出可靠的图像对齐掩膜，明确利用父子关系来验证和恢复细致的立面元素。然后，它将这些基于几何的掩膜与基础模型的预测融合，以建立场景一致的实例对应关系，并在联合二维身份监督和三维空间正则化下，为每个高斯学习紧凑的身份编码。在TUM2TWIN和黄金海岸数据集上的实验表明，GS4City有效地将结构化建筑语义融入高斯场景表示，超越现有的二维驱动语义3DGS基线，包括LangSplat和Gaga，在粗糙建筑分割中提高了15.8个IoU点，在细粒度语义分割中提高了14.2个mIoU点。通过桥接结构化城市模型和逼真的高斯场景表示，GS4City实现了语义可查询和结构感知的城市重建。代码可在 https://github.com/Jinyzzz/GS4City 获取。

View on arXiv Download PDF AI Translation

cs.CV / 253 / 2604.11402

Scene Change Detection with Vision-Language Representation Learning

基于视觉-语言表示学习的场景变化检测

Sheng, Diwei, Gohil, Vijayraj, Gaba, Satyam, Liu, Zihan, Hamilton-Fletcher, Giles, Rizzo, John-Ross, Liang, Yongqing, Feng, Chen

Abstract

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

Chinese Translation

场景变化检测（Scene Change Detection, SCD）对于城市监控和导航至关重要，但由于光照变化、季节变换、视角差异以及复杂的城市布局，在真实环境中仍然具有挑战性。现有方法主要依赖低层次的视觉特征，限制了其在复杂城市场景中准确识别变化对象的能力。本文提出了LangSCD，一种融合视觉与语言的场景变化检测框架，通过引入语言的语义推理克服了单一模态的局限性。我们的方法引入了一个模块化的语言组件，利用视觉-语言模型（Vision-Language Models, VLMs）生成场景变化的文本描述，并通过跨模态特征增强器将其与视觉特征融合。进一步地，我们设计了几何-语义匹配模块，通过强化语义一致性和空间完整性来优化预测掩码。现有的真实场景变化检测基准仅提供二元变化标注，难以满足需要细粒度场景动态理解的下游应用。为此，我们引入了NYC-CD，一个包含8122对真实纽约市街景图像的大规模数据集，采用半自动流程生成多类别变化标注。大量在多个街景基准上的实验表明，我们的语言和匹配模块持续提升了现有变化检测架构的性能，实现了最先进的效果，凸显了将语言推理与视觉表示结合以实现鲁棒场景变化检测的价值。

View on arXiv Download PDF AI Translation

cs.CV / 254 / 2604.11411

Online Reasoning Video Object Segmentation

在线推理视频目标分割

Liu, Jinyuan, Wang, Yang, Zhao, Zeyu, Li, Weixin, Wang, Song, Han, Ruize

Abstract

Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

Chinese Translation

推理视频目标分割（Reasoning Video Object Segmentation）旨在根据自然语言查询预测视频中的像素级掩码，这些查询可能涉及隐含和时间上定位的引用。然而，现有方法均在离线环境下开发和评估，即推理时可获得完整视频，并利用未来帧进行回溯消歧，这与需要严格因果、逐帧决策的实际应用场景不符。本文研究在线推理视频目标分割（Online Reasoning Video Object Segmentation，ORVOS），要求模型仅利用过去和当前帧增量式地理解查询，且不能回溯先前预测，同时处理事件发展过程中指称对象的变化。为支持评估，我们引入ORVOSB基准，包含帧级因果注释和指称变化标签，涵盖210个视频、12,907帧标注及512个查询，涵盖五类推理任务。我们进一步提出一个基线方法，采用持续更新的分割提示和结构化时间令牌库，以在有限计算条件下实现长时域推理。实验表明，现有方法在严格因果性和指称变化条件下表现不佳，而我们的方法为未来研究奠定了坚实基础。

View on arXiv Download PDF AI Translation

cs.CV / 255 / 2604.11415

Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding

观察更少，理解更多：成本感知的跨尺度遥感理解

Xie, Zhenghao, Xiao, Jing, Wang, Zhenqi, Ma, Kexin, Liao, Liang, Xia, Gui-Song, Wang, Mi

Abstract

Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.

Chinese Translation

遥感理解本质上需要多分辨率观察，因为不同的目标和应用任务需要不同水平的空间细节。低分辨率（LR）影像能够实现高效的全球观察，而高分辨率（HR）影像则提供了关键的局部细节，但其获取成本更高且覆盖范围有限。这促使我们提出一种跨尺度感知策略，从基于LR的全球感知中选择性地获取HR影像，以在成本受限的情况下提高任务性能。现有的HR采样方法通常从孤立的LR图块中做出选择决策，这忽视了图块内部的细粒度重要性和跨图块的上下文交互，导致特征表示碎片化以及在稀疏HR观察下的次优场景推理。为了解决这个问题，我们将跨尺度遥感理解构建为一个统一的成本感知问题，结合细粒度HR采样与跨图块表示预测，从而在更少的HR观察下实现更有效的任务推理。此外，我们提出了GL-10M，这是一个包含1000万张空间对齐的多分辨率影像的大规模基准，能够系统性地评估遥感中预算受限的跨尺度推理。大量在识别和检索任务上的实验表明，我们的方法在性能与成本的权衡上始终表现出色。

View on arXiv Download PDF AI Translation

cs.CV / 256 / 2604.11444

HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation

HuiYanEarth-SAR：高保真低成本全球遥感影像生成的基础模型

Liu, Yongxiang, Zhou, Jie, Song, Yafei, Liu, Tianpeng, Liu, Li

Abstract

Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbf{HuiYanEarth-SAR}, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.

Chinese Translation

合成孔径雷达（SAR）影像生成对于深入研究散射机制、建立可靠的电磁场景模型以及根本上缓解限制该领域发展的数据稀缺瓶颈至关重要。然而，现有方法难以同时确保全球地理语义和微观散射机制的高保真度，从而对全球生成造成严重挑战。为此，我们提出了 extbf{HuiYanEarth-SAR}，这是第一个基于AlphaEarth和集成散射机制的基础SAR影像生成模型。通过注入地理先验以控制宏观结构，并利用隐式散射特征建模确保微观纹理的真实性，我们实现了仅基于地理坐标生成全球位置高保真SAR影像的能力。本研究不仅构建了一个高效的SAR场景模拟器，还从方法论的角度建立了地理、散射机制与人工智能之间的桥梁。它通过将SAR研究的范式从感知与理解扩展到模拟与创造，为构建高可信度的地球数字双胞胎提供了关键技术支持。

View on arXiv Download PDF AI Translation

cs.CV / 257 / 2604.11468

Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising

超越模型设计：面向高斯彩色图像去噪的数据驱动训练与自集成方法

Chang, Gengjia, Ge, Xining, Yuan, Weijun, Li, Zhan, Song, Qiurong, Zhu, Luen, Liu, Shuhong

Abstract

This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level $\sigma = 50$). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer $\sigma\!=\!50$ baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply $\times 8$ geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer $\sigma\!=\!50$ pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.

Chinese Translation

本文提出了我们针对NTIRE 2026图像去噪挑战赛（固定噪声水平σ=50的高斯彩色图像去噪）的解决方案。我们并未提出新的恢复骨干网络，而是从两个互补方向重新审视成熟的Restormer架构的性能边界：更强的数据驱动训练和更完整的测试时能力释放。基于公开的Restormer σ=50基线，我们通过引入更大且多样化的公开图像数据集扩展了标准的多数据集训练方案，并将优化过程组织为两个阶段。在推理阶段，我们应用了8倍几何自集成（self-ensemble）以进一步释放模型容量。为保持实现一致性，保留了TLC风格的局部推理包装器；然而系统性消融实验表明，在此设置下其定量贡献可忽略不计。在包含100张图像的挑战验证集上，我们的最终提交实现了30.762 dB的PSNR和0.861的SSIM，相较于公开的Restormer σ=50预训练基线最高提升了3.366 dB的PSNR。消融研究表明，主要性能提升来源于扩展的训练语料库和两阶段优化策略，自集成则带来了边际但稳定的改进。

View on arXiv Download PDF AI Translation

cs.CV / 258 / 2604.11470

Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution

面向真实世界图像超分辨率的降质感知与结构保持扩散方法

Ji, Yang, Chen, Zonghao, Xue, Zhihao, Hu, Junqin

Abstract

Real-world image super-resolution is particularly challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. We propose a degradation-aware and structure-preserving diffusion framework for real-world SR. Specifically, we introduce Degradation-aware Token Injection, which encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features, enabling explicit degradation-aware restoration. We further propose Spatially Asymmetric Noise Injection, which modulates diffusion noise with local edge strength to better preserve structural regions during training. Both modules are lightweight add-ons to the adopted diffusion SR framework, requiring only minor modifications to the conditioning pipeline. Experiments on DIV2K and RealSR show that our method delivers competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining a favorable perception--distortion trade-off. Ablations confirm the effectiveness of each module and their complementary gains when combined. The code and model are publicly available at https://github.com/jiyang0315/DASP-SR.git.

Chinese Translation

真实世界图像超分辨率对扩散模型而言尤为具有挑战性，因为真实降质过程复杂、多样且很少被明确建模。我们提出了一种面向真实世界超分辨率的降质感知与结构保持扩散框架。具体而言，我们引入了降质感知令牌注入（Degradation-aware Token Injection），该方法从低分辨率输入中编码轻量级的降质统计信息，并将其与语义条件特征融合，实现了显式的降质感知恢复。我们进一步提出了空间非对称噪声注入（Spatially Asymmetric Noise Injection），通过局部边缘强度调制扩散噪声，以更好地在训练过程中保持结构区域。两个模块均为所采用的扩散超分辨率框架的轻量级附加组件，仅需对条件管线进行少量修改。基于DIV2K和RealSR数据集的实验表明，我们的方法在无参考感知质量上具有竞争力，且恢复结果在视觉上更为真实，优于近期基线方法，同时保持了良好的感知-失真权衡。消融实验验证了各模块的有效性及其组合时的互补增益。代码和模型已公开，地址为https://github.com/jiyang0315/DASP-SR.git。

View on arXiv Download PDF AI Translation

cs.CV / 259 / 2604.11484

PACO: Proxy-Task Alignment and Online Calibration for On-the-Fly Category Discovery

PACO：代理任务对齐与在线校准的即时类别发现方法

Tang, Weidong, Zhang, Bohan, Chi, Zhixiang, Wu, ZiZhang, Wang, Yang, Wu, Yanan

Abstract

On-the-Fly Category Discovery (OCD) requires a model, trained on an offline support set, to recognize known classes while discovering new ones from an online streaming sequence. Existing methods focus heavily on offline training. They aim to learn discriminative representations on the support set so that novel classes can be separated at test time. However, their discovery mechanism at inference is typically reduced to a single threshold. We argue that this paradigm is fundamentally flawed as OCD is not a static classification problem, but a dynamic process. The model must continuously decide 1) whether a sample belongs to a known class, 2) matches an existing novel category, or 3) should initiate a new one. Moreover, prior methods treat the support set as fixed knowledge. They do not update their decision boundaries as new evidence arrives during inference. This leads to unstable and inconsistent category formation. Our experiments confirm these issues. With properly calibrated and adaptive thresholds, substantial improvements can be achieved, even without changing the representation. Motivated by this, we propose PACO, a support-set-calibrated, tree-structured online decision framework. The framework models inference as a sequence of hierarchical decisions, including known-class routing, birth-aware novel assignment, and attach-versus-create operations over a dynamic prototype memory. Furthermore, we simulate the proxy discovery process to initialize the thresholds during offline training to align with inference. Thresholds are continuously updated during inference using mature novel prototypes. Importantly, PACO requires no heavy training and no dataset-specific tuning. It can be directly integrated into existing OCD pipelines as an inference-time module. Extensive experiments show significant improvements over SOTA baselines across seven benchmarks.

Chinese Translation

即时类别发现（On-the-Fly Category Discovery，OCD）要求模型在离线支持集上训练后，能够识别已知类别，同时从在线流数据中发现新类别。现有方法主要侧重于离线训练，旨在支持集上学习判别性表示，以便在测试时区分新类别。然而，它们在推理阶段的发现机制通常简化为单一阈值。我们认为这一范式存在根本缺陷，因为OCD不是静态分类问题，而是动态过程。模型必须持续决定：1）样本是否属于已知类别，2）是否匹配现有新类别，或3）是否应创建新类别。此外，先前方法将支持集视为固定知识，未能在推理过程中随着新证据的到来更新决策边界，导致类别形成不稳定且不一致。我们的实验验证了这些问题。通过适当校准和自适应阈值，即使不改变表示，也能实现显著提升。基于此，我们提出了PACO，一种支持集校准的树状结构在线决策框架。该框架将推理建模为一系列层级决策，包括已知类别路由、考虑新类别生成的分配，以及基于动态原型记忆的附加或创建操作。此外，我们在离线训练阶段模拟代理发现过程以初始化阈值，使其与推理阶段对齐。阈值在推理过程中利用成熟的新类别原型持续更新。重要的是，PACO无需大量训练和数据集特定调优，可作为推理模块直接集成到现有OCD流程中。大量实验表明，PACO在七个基准测试上均显著优于最先进基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 260 / 2604.11487

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

NTIRE 2026 野外鲁棒AI生成图像检测挑战赛

Gushchin, Aleksandr, Abud, Khaled, Shumitskaya, Ekaterina, Filippov, Artem, Bychkov, Georgii, Lavrushkin, Sergey, Erofeev, Mikhail, Antsiferova, Anastasia, Chen, Changsheng, Tan, Shunquan, Timofte, Radu, Vatolin, Dmitry, Song, Chuanbiao, Yu, Zijian, Tan, Hao, Lan, Jun, Yang, Zhiqiang, Tang, Yongwei, Wu, Zhiqiang, Seow, Jia Wen, Koay, Hong Vin, Ren, Haodong, Xu, Feng, Chen, Shuai, Xia, Ruiyang, Zhang, Qi, Xu, Yaowen, Zou, Zhaofan, Sun, Hao, Lu, Dagong, Yao, Mufeng, Xu, Xinlei, Wu, Fei, Guo, Fengjun, Luo, Cong, Sharma, Hardik, Negi, Aashish, Shaily, Prateek, Kumar, Jayant, Chaudhary, Sachin, Dudhane, Akshay, Hambarde, Praful, Shukla, Amit, Tu, Zhilin, Li, Fengpeng, Zhang, Jiamin, Fei, Jianwei, Li, Kemou, Wu, Haiwei, Benjdira, Bilel, Ali, Anas M., Boulila, Wadii, Qu, Chenfan, Li, Junchi

Abstract

This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.

Chinese Translation

本文介绍了与CVPR 2026 NTIRE研讨会联合举办的NTIRE 2026野外鲁棒AI生成图像检测挑战赛的概况。该挑战赛旨在开发能够在真实场景中区分真实图像与生成图像的检测模型：由于图像在实际应用中常常经过裁剪、调整大小、压缩、模糊等变换，检测模型需对这些变换具有鲁棒性。挑战赛基于一个新颖的数据集，包含108,750张真实图像和185,750张由42个生成器生成的AI图像，这些生成器涵盖多种开源和闭源模型及多样的架构，数据集还包含36种图像变换。参赛方法通过在包含变换和未变换图像的完整测试集上使用ROC AUC指标进行评估。共有511名参与者注册，20支队伍提交了有效的最终方案。本文报告全面回顾了该挑战赛，介绍了所提出的解决方案，可为研究人员和实践者提升检测模型对现实变换的鲁棒性提供重要参考。

View on arXiv Download PDF AI Translation

cs.CV / 261 / 2604.11496

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

重新审视双编码视觉-语言模型中的组合性：推理的角色

Miranda, Imanol, Salaberria, Ander, Agirre, Eneko, Azkune, Gorka

Abstract

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

Chinese Translation

双编码视觉-语言模型（VLMs），如CLIP，常常被描述为词袋系统，因为它们在组合基准测试中的表现较差。我们认为，这一局限性可能更多地源于基于全局余弦相似度的标准推理协议，而非表示能力的不足。首先，通过控制的诊断实验，我们表明，在推理过程中明确强制细粒度区域段对齐显著提高了组合性能，而无需更新预训练编码器。然后，我们引入了一种轻量级变换器，直接从冻结的补丁和标记嵌入中学习这种对齐。与完全微调和先前的端到端组合训练方法相比，我们发现尽管这些方法在领域内检索中有所改善，但其增益在分布转移下并不一致地转移。相反，在冻结表示上学习局部对齐在领域内检索中与完全微调相匹配，同时在受控的领域外组合基准测试中带来了显著的改善。这些结果确定了全局嵌入匹配作为双编码VLMs中的关键瓶颈，并强调了对齐机制在稳健组合泛化中的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 262 / 2604.11498

TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

TAG-Head：用于即插即用细粒度动作识别的时间对齐图结构头

Hassan, Imtiaz Ul, Bessis, Nik, Behera, Ardhendu

Abstract

Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.

Chinese Translation

细粒度人体动作识别（FHAR）具有挑战性，因为视觉上相似的动作仅通过细微的时空线索区分。许多最新系统通过引入额外模态（如姿态、文本、光流）来增强判别能力，但这增加了标注负担和计算成本。我们提出了TAG-Head，一种轻量级时空图结构头，利用仅RGB信息升级标准3D骨干网络（如SlowFast、R(2+1)D-34、I3D等）以实现FHAR。我们的流程首先对骨干网络输出的tokens应用带有可学习3D位置编码的Transformer编码器，捕捉跨空间和时间的长距离依赖。随后，利用图结构对特征进行精炼，该图(i)在帧内建立全连接边以解决帧内细微外观差异，(ii)通过时间对齐的时序边连接跨帧相同空间位置的特征，以稳定运动线索且避免过度平滑。该头部结构紧凑（参数和计算开销极小），支持跨骨干网络即插即用，并与骨干网络端到端训练。在FineGym（Gym99和Gym288）及HAA500上的大量评测表明，TAG-Head在仅使用RGB的模型中创下新状态，且超越了许多依赖特权信息的视频+姿态+文本多模态方法。消融实验解析了Transformer和图拓扑结构的贡献，复杂度分析验证了低延迟。TAG-Head通过在轻量且可组合的图结构头中显式结合全局上下文、高分辨率空间交互和低方差时间连续性，推动了FHAR的发展。设计简洁使其易于在偏好仅用RGB传感器的实际系统中采用，同时带来通常仅由更重或多模态模型实现的性能提升。代码将于GitHub发布。

View on arXiv Download PDF AI Translation

cs.CV / 263 / 2604.11530

SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

SVD-Prune：无训练的视觉语言模型高效Token剪枝方法

Apedo, Yvon, Poreba, Martyna, Szczepanski, Michal, Bouchafa, Samia

Abstract

Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

Chinese Translation

视觉语言模型（Vision-Language Models，VLM）通过联合处理视觉和文本信息，革新了多模态学习。然而，由于处理长序列视觉Token所需的高计算和内存开销，这些模型面临重大挑战。许多现有方法依赖于局部启发式策略，如注意力分数或Token范数，但这些标准存在位置偏差和信息分散问题，限制了在高剪枝率下保留关键信息的能力，导致在视觉细节丰富的图像上性能下降。为解决上述问题，我们提出了SVD-Prune，一种基于奇异值分解（Singular Value Decomposition）的无训练、即插即用Token剪枝方法。该方法对视觉Token特征矩阵进行分解，并利用统计杠杆分数（statistical leverage scores）选择Top-K Token，确保仅保留对主导全局方差贡献最大的Token。实验结果表明，SVD-Prune在极端视觉Token预算下持续优于先前剪枝方法，即使在仅保留32和16个视觉Token时仍保持强劲性能。

View on arXiv Download PDF AI Translation

cs.CV / 264 / 2604.11539

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

CLAY：视觉-语言嵌入空间中的条件视觉相似性调制

Lim, Sohwi, Hyoseok, Lee, Park, Jungjoon, Oh, Tae-Hyun

Abstract

Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

Chinese Translation

人类对视觉相似性的感知本质上是自适应且主观的，依赖于用户的兴趣和关注点。然而，大多数图像检索系统未能体现这种灵活性，依赖于固定的单一度量，无法同时融合多重条件。为此，我们提出了CLAY，一种自适应相似性计算方法，将预训练视觉-语言模型（VLMs）的嵌入空间重新构建为文本条件化的相似性空间，无需额外训练。该设计将文本条件化过程与视觉特征提取分离，允许在固定视觉嵌入的基础上实现高效且多条件的检索。我们还构建了合成评估数据集CLAY-EVAL，以便在多样化条件检索设置下进行全面评估。在标准数据集及我们提出的数据集上的实验表明，CLAY在检索准确率和计算效率方面均优于以往方法。

View on arXiv Download PDF AI Translation

cs.CV / 265 / 2604.11559

Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT

渐进式纹理感知扩散用于增强对比稀疏视角CT重建

Wang, Tianqi, Du, Wenchao, Yang, Hongyu

Abstract

Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD$_{\textit{rec}}$ and a conditional diffusion module PTD$_{\textit{diff}}$. PTD$_{\textit{rec}}$ first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD$_{\textit{diff}}$ aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.

Chinese Translation

基于扩散的稀疏视角CT（SVCT）成像近年来取得了显著进展，得益于其更稳定的生成能力。然而，恢复可靠的图像内容和视觉一致的纹理仍然是一个关键挑战。本文提出了一种渐进式纹理感知扩散（Progressively Texture-aware Diffusion，PTD）模型，一种针对SVCT设计的由粗到细的学习框架。具体而言，PTD包括基础重建模块PTD_{rec}和条件扩散模块PTD_{diff}。PTD_{rec}首先学习一个确定性映射以恢复大部分潜在的低频信号（即具有平滑纹理的粗略内容），作为初始估计以保证保真度。此外，PTD_{diff}旨在为粗略预测重建高保真细节，通过探索双域引导的条件扩散生成可靠且一致的纹理。在稀疏视角CT重建上的大量实验表明，我们的PTD在结构相似性和视觉效果方面均表现出优越性能，且仅需少量采样步骤，有效缓解了通用扩散模型固有的随机性，实现了视觉质量与高频细节保真度之间的更佳平衡。

View on arXiv Download PDF AI Translation

cs.CV / 266 / 2604.11562

The Impact of Federated Learning on Distributed Remote Sensing Archives

联邦学习对分布式遥感档案的影响

Umashankar, Anand, Tomotaki-Dawoud, Karam, Schneider, Nicolai

Abstract

Remote sensing archives are inherently distributed: Earth observation missions such as Sentinel-1, Sentinel-2, and Sentinel-3 have collectively accumulated more than 5 petabytes of imagery, stored and processed across many geographically dispersed platforms. Training machine learning models on such data in a centralized fashion is impractical due to data volume, sovereignty constraints, and geographic distribution. Federated learning (FL) addresses this by keeping data local and exchanging only model updates. A central challenge for remote sensing is the non-IID nature of Earth observation data: label distributions vary strongly by geographic region, degrading the convergence of standard FL algorithms. In this paper, we conduct a systematic empirical study of three FL strategies -- FedAvg, FedProx, and bulk synchronous parallel (BSP) -- applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. We evaluate three convolutional neural network (CNN) architectures of increasing depth (LeNet, AlexNet, and ResNet-34) and analyze the joint effect of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost. Experiments on the UC Merced multi-label dataset show that FedProx outperforms FedAvg for deeper architectures under data heterogeneity, that BSP approaches centralized accuracy at the cost of high sequential communication, and that LeNet provides the best accuracy-communication trade-off for the dataset scale considered.

Chinese Translation

遥感档案本质上是分布式的：地球观测任务如 Sentinel-1、Sentinel-2 和 Sentinel-3 已累计超过 5 PB 的影像，这些数据存储和处理在多个地理分散的平台上。由于数据量、主权限制和地理分布，集中方式训练机器学习模型在这种数据上是不切实际的。联邦学习（Federated Learning, FL）通过保持数据本地并仅交换模型更新来解决这一问题。遥感领域的一个中心挑战是地球观测数据的非独立同分布（non-IID）特性：标签分布因地理区域而异，降低了标准 FL 算法的收敛性。本文对三种 FL 策略——FedAvg、FedProx 和大规模同步并行（bulk synchronous parallel, BSP）——在受控的非 IID 标签偏斜条件下应用于多标签遥感图像分类进行了系统的实证研究。我们评估了三种深度逐渐增加的卷积神经网络（CNN）架构（LeNet、AlexNet 和 ResNet-34），并分析了算法选择、模型容量、客户端比例、客户端数量、批量大小和通信成本的联合影响。UC Merced 多标签数据集上的实验表明，在数据异构性下，FedProx 在更深的架构中优于 FedAvg，BSP 在高顺序通信成本的情况下接近集中式准确性，而 LeNet 在考虑到数据集规模时提供了最佳的准确性与通信的权衡。

View on arXiv Download PDF AI Translation

cs.CV / 267 / 2604.11564

Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation

无训练模型集成用于单图像超分辨率的强支路补偿

Chang, Gengjia, Ge, Xining, Yuan, Weijun, Li, Zhan, Song, Qiurong, Zhu, Luen, Liu, Shuhong

Abstract

Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.

Chinese Translation

单图像超分辨率已经从深度卷积基线发展到更强的Transformer和状态空间架构，然而相应的性能提升通常伴随着更高的训练成本、更长的工程迭代时间和更重的部署负担。在许多实际应用中，已经存在多个具有部分互补行为的预训练模型，而限制因素不再是架构能力，而是如何有效地将它们的输出结合在一起而无需额外训练。本文提出了一种无训练的输出级集成框架，而不是追求进一步的架构重设计。构建了一个双支路管道，其中一个具有TLC推理的混合注意力网络提供稳定的主要重建，而一个MambaIRv2支路则通过几何自集成提供强大的高频细节恢复补偿。这两个支路独立处理相同的低分辨率输入，并通过轻量级加权组合在图像空间中融合，而无需更新任何模型参数或引入额外的可训练模块。作为我们对NTIRE 2026图像超分辨率（$ imes 4$）挑战的解决方案，所提出的设计在统一的DIV2K双三次$ imes 4$评估协议下，始终优于基础支路，并在最佳操作点上略微超过纯强支路的PSNR。消融研究确认，输出级补偿为现有超分辨率系统提供了一条低开销且实际可行的升级路径。

View on arXiv Download PDF AI Translation

cs.CV / 268 / 2604.11576

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

像预训练一样微调：提升视觉语言模型的零样本对抗鲁棒性

Xing, Songlong, Wang, Weijie, Zhao, Zhengyu, Gu, Jindong, Torr, Philip, Sebe, Nicu

Abstract

Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.

Chinese Translation

尽管视觉语言模型如CLIP展现了令人印象深刻的零样本能力，但已有研究表明其易受对抗攻击影响。为了增强其对抗鲁棒性，近期研究通过在代理数据集（如ImageNet）上使用对抗样本微调CLIP的预训练视觉编码器，将对抗图像与正确类别标签对齐。然而，这些方法忽视了训练数据分布和学习目标的重要作用，导致零样本能力下降且鲁棒性在不同领域和数据集间的迁移能力有限。在本工作中，我们提出了一种简单而有效的范式AdvFLYP，该方法在对模型进行对抗微调时遵循CLIP预训练过程的训练方案。具体而言，AdvFLYP利用基于从网络收集的图文对生成的对抗图像进行微调，并通过对比损失将其与对应文本匹配。为缓解噪声网络图像中对抗图像嵌入的扭曲，我们进一步提出通过惩罚对抗图像特征的偏离来正则化AdvFLYP。我们展示了对数输出层和特征层的正则化项分别有利于提升鲁棒性和干净样本的准确率。在涵盖多个领域的14个下游数据集上的大量实验表明，我们的方法优于主流实践。我们的代码和模型权重已发布于https://github.com/Sxing2/AdvFLYP。

View on arXiv Download PDF AI Translation

cs.CV / 269 / 2604.11579

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

通过触觉感知：基于触觉驱动的材料区域视觉定位

Kim, Seongyu, Lee, Seungwoo, Ryu, Hyeonggon, Chung, Joon Son, Senocak, Arda

Abstract

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

Chinese Translation

我们解决了触觉定位的问题，目标是识别与触觉输入具有相同材料属性的图像区域。现有的视觉-触觉方法依赖于全局对齐，因此无法捕捉到完成此任务所需的细粒度局部对应关系。现有数据集的挑战在于，它们主要包含特写、低多样性的图像。我们提出了一种模型，通过密集的跨模态特征交互学习局部视觉-触觉对齐，生成用于触觉条件材料分割的触觉显著性图。为了克服数据集的限制，我们引入了：（i）在自然环境中拍摄的多材料场景图像，以扩展视觉多样性，以及（ii）一种材料多样性配对策略，将每个触觉样本与视觉上多样但触觉上一致的图像对齐，从而提高上下文定位的准确性和对微弱信号的鲁棒性。我们还构建了两个新的触觉基础材料分割数据集以进行定量评估。在新的和现有的基准测试上进行的实验表明，我们的方法在触觉定位方面显著优于之前的视觉-触觉方法。

View on arXiv Download PDF AI Translation

cs.CV / 270 / 2604.11585

GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth

GeomPrompt：针对缺失和退化深度的RGB-D语义分割几何提示学习

Jaganathan, Krishna, Vela, Patricio

Abstract

Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

Chinese Translation

机器人和具身人工智能的多模态感知系统通常假设RGB-D传感可靠，但实际上深度信息常常缺失、噪声多或被破坏。为此，我们提出了GeomPrompt，一种轻量级的跨模态适配模块，仅通过RGB图像合成任务驱动的几何提示，用作冻结的RGB-D语义分割模型的第四通道，无需深度监督。我们进一步引入GeomPrompt-Recovery，一种适配模块，通过预测与冻结分割器相关的第四通道修正，来补偿退化的深度信息。两个模块均仅通过下游分割监督进行训练，实现了恢复对分割有用的几何先验，而非估计深度信号。在SUN RGB-D数据集上，GeomPrompt在DFormer模型上相比仅用RGB推理提升了6.1 mIoU，在GeminiFusion模型上提升了3.0 mIoU，同时性能与强大的单目深度估计器相当。对于退化深度，GeomPrompt-Recovery持续提升鲁棒性，在严重深度损坏情况下带来最高3.6 mIoU的增益。GeomPrompt在效率上也远超单目深度基线，延迟仅为7.8毫秒，而后者分别为38.3毫秒和71.9毫秒。这些结果表明，任务驱动的几何提示是一种在RGB-D感知中应对缺失和退化深度输入的高效跨模态补偿机制。

View on arXiv Download PDF AI Translation

cs.CV / 271 / 2604.11589

MLLM-as-a-Judge Exhibits Model Preference Bias

MLLM作为评判者表现出模型偏好偏差

Koyama, Shuitsu, Wada, Yuiga, Yashima, Daichi, Sugiura, Komei

Abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

Chinese Translation

使用多模态大型语言模型（MLLMs）进行自动评估，通常称为MLLM作为评判者，已广泛用于衡量模型性能。如果这种MLLM作为评判者的方法存在偏见，可能会扭曲模型比较和基于基准的科学进展。然而，目前尚不清楚MLLM作为评判者的方法在多大程度上偏向或不偏向由特定MLLM生成的文本。在本研究中，我们提出了Philautia-Eval来调查这种特定模型的偏好偏差。Philautia-Eval通过将偏好倾向与生成质量的差异分离，量化偏差的程度。通过使用从12个MLLM收集的129万对标题-评分数据，我们发现代表性的MLLM往往表现出自我偏好偏差。此外，实验结果表明，在特定模型家族中存在相互偏好偏差，这可能是由重复使用的连接器和重叠的指令调优资源驱动的。最后，我们引入了一种简单的MLLM集成模型Pomms。我们的结果表明，Pomms有效地减轻了模型特定的偏好偏差，同时保持了性能。

View on arXiv Download PDF AI Translation

cs.CV / 272 / 2604.11590

Learning Robustness at Test-Time from a Non-Robust Teacher

从非鲁棒教师中学习测试时的鲁棒性

Bianchettin, Stefano, Rossolini, Giulio, Buttazzo, Giorgio

Abstract

Nowadays, pretrained models are increasingly used as general-purpose backbones and adapted at test-time to downstream environments where target data are scarce and unlabeled. While this paradigm has proven effective for improving clean accuracy on the target domain, adversarial robustness has received far less attention, especially when the original pretrained model is not explicitly designed to be robust. This raises a practical question: \emph{can a pretrained, non-robust model be adapted at test-time to improve adversarial robustness on a target distribution?} To face this question, this work studies how adversarial training strategies behave when integrated into adaptation schemes for the unsupervised test-time setting, where only a small set of unlabeled target samples is available. It first analyzes how classical adversarial training formulations can be extended to this scenario, showing that straightforward distillation-based adaptations remain unstable and highly sensitive to hyperparameter tuning, particularly when the teacher itself is non-robust. To address these limitations, the work proposes a label-free framework that uses the predictions of a non-robust teacher model as a semantic anchor for both the clean and adversarial objectives during adaptation. We further provide theoretical insights showing that our formulation yields a more stable alternative to the self-consistency-based regularization commonly used in classical adversarial training. Experiments evaluate the proposed approach on CIFAR-10 and ImageNet under induced photometric transformations. The results support the theoretical insights by showing that the proposed approach achieves improved optimization stability, lower sensitivity to parameter choices, and a better robustness-accuracy trade-off than existing baselines in this post-deployment test-time setting.

Chinese Translation

如今，预训练模型越来越多地被用作通用骨干网络，并在测试时适应目标数据稀缺且未标记的下游环境。虽然这一范式已被证明能够有效提高目标领域的干净准确性，但对对抗鲁棒性的关注却远远不够，尤其是当原始预训练模型并未明确设计为鲁棒时。这引发了一个实际问题： extit{一个预训练的非鲁棒模型能否在测试时适应以提高目标分布上的对抗鲁棒性？} 为了应对这一问题，本研究探讨了对抗训练策略在无监督测试时设置中与适应方案结合时的表现，在这种情况下，仅有一小部分未标记的目标样本可用。研究首先分析了经典对抗训练公式如何扩展到这一场景，表明直接基于蒸馏的适应方法仍然不稳定，并且对超参数调优高度敏感，尤其是当教师模型本身是非鲁棒时。为了解决这些局限性，本文提出了一种无标签框架，利用非鲁棒教师模型的预测作为适应过程中干净和对抗目标的语义锚点。我们进一步提供了理论见解，表明我们的公式为经典对抗训练中常用的自一致性正则化提供了更稳定的替代方案。实验在CIFAR-10和ImageNet上评估了所提方法在诱导光度变换下的表现。结果支持了理论见解，表明所提方法在优化稳定性、对参数选择的低敏感性以及在这一后部署测试时设置中相比现有基线更好的鲁棒性-准确性权衡方面取得了改进。

View on arXiv Download PDF AI Translation

cs.CV / 273 / 2604.11600

Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

Geoparsing：基于统一形式语言的平面与立体几何图解析

Wang, Peijie, Zhang, Ming-Liang, Cao, Jun, Deng, Chao, Ran, Dekang, Sun, Hongda, Bu, Pi, Zhang, Xuan, Wang, Yingyao, Song, Jun, Zheng, Bo, Yin, Fei, Liu, Cheng-Lin

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs' capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

Chinese Translation

多模态大型语言模型（MLLMs）已取得显著进展，但在几何推理方面仍面临挑战，主要原因在于对细粒度视觉元素的感知瓶颈。尽管形式语言促进了平面几何的理解，但需要空间理解的立体几何仍然鲜有研究。本文针对该挑战，设计了一种融合平面与立体几何的统一形式语言，全面涵盖几何结构与语义关系。我们构建了GDP-29K数据集，包含2万条平面几何样本和9千条立体几何样本，均来自多样的真实世界来源，并配有对应的真实形式描述。为确保语法正确性与几何一致性，我们提出了一种结合监督微调与基于可验证奖励的强化学习的训练范式。实验结果表明，该方法实现了最先进的解析性能。此外，我们证明解析得到的形式描述作为关键的认知支架，显著提升了MLLMs在后续几何推理任务中的能力。我们的数据与代码已在Geoparsing平台公开。

View on arXiv Download PDF AI Translation

cs.CV / 274 / 2604.11627

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

POINTS-Long：自适应双模式视觉推理在多模态大语言模型中的应用

Wang, Haicheng, Liu, Yuan, Liu, Yikun, Yu, Zhemeng, Zhao, Zhongyin, You, Yangxiu, Yu, Zilin, Tian, Le, Zhou, Xiao, Zhou, Jie, Xie, Weidi, Wang, Yanfeng

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Chinese Translation

多模态大语言模型（MLLMs）最近在跨模态理解和生成方面展示了显著的能力。然而，视觉标记序列的快速增长——尤其是在长视频和流媒体场景中——对其可扩展性和实际应用构成了重大挑战。因此，我们提出了POINTS-Long，这是一种具有动态视觉标记缩放功能的原生双模式MLLM，灵感来源于人类视觉系统。该模型支持两种互补的感知模式：聚焦模式和待命模式，使用户能够在推理过程中动态权衡效率和准确性。在细粒度视觉任务中，聚焦模式保持了最佳性能，而在长篇一般视觉理解中，待命模式仅使用1/40到1/10的视觉标记便能保留97.7%至99.7%的原始准确性。此外，POINTS-Long通过动态可拆卸的KV-cache设计原生支持流媒体视觉理解，允许高效维护超长视觉记忆。我们的工作为未来MLLM的设计提供了新的见解，并为自适应和高效的长篇视觉理解奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 275 / 2604.11636

MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance

MorphoFlow：基于自适应潜在相关性的稀疏监督生成形状建模

Karanam, Mokshagna Sai Teja, Kataria, Tushar, Elhabian, Shireen

Abstract

Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.

Chinese Translation

统计形状建模（SSM）是解剖变异群体水平分析的核心，但现有大多数方法依赖于密集标注的分割和固定的潜在表示。这些要求限制了模型的可扩展性，并在建模复杂解剖变异时降低了灵活性。我们提出了MorphoFlow，一种稀疏监督的生成形状建模框架，能够直接从稀疏的表面标注中学习紧凑的概率形状表示。MorphoFlow结合了神经隐式形状表示、自动解码器（autodecoder）结构和自回归归一化流（autoregressive normalizing flows），以学习潜在形状空间上的表达性概率密度。神经隐式表示支持对三维解剖结构的分辨率无关建模，而自动解码器结构支持在稀疏监督下直接优化每个实例的潜在编码。自回归流捕捉潜在解剖变异的分布，提供了可解的基于似然的形状生成模型。为了促进紧凑且结构化的潜在表示，我们通过稀疏诱导先验引入自适应潜在相关性加权，使模型能够根据潜在维度对基础解剖变异的重要性调节其贡献，同时保持生成表达能力。所得潜在空间支持不确定性量化和符合解剖学合理性的形状合成，无需手动调整潜在维度。通过对公开的腰椎和股骨数据集的评估，证明了该方法能从稀疏输入中实现高精度高分辨率重建，并恢复与群体水平趋势一致的结构化解剖变异模式。

View on arXiv Download PDF AI Translation

cs.CV / 276 / 2604.11637

STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding

STS-Mixer：用于4D点云视频理解的时空谱混合器

Li, Wenhao, Jiang, Xueying, Zhang, Gongjie, Zhang, Xiaoqin, Shao, Ling, Lu, Shijian

Abstract

4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.

Chinese Translation

4D点云视频捕捉了场景丰富的时空动态，这在各种4D理解任务中具有独特的价值。然而，大多数现有方法在时空域中工作，难以捕捉4D点云视频的潜在几何特征，导致表示学习和理解的效果下降。我们从互补的谱视角解决了上述挑战。通过将4D点云视频转换为图谱信号，我们可以将其分解为多个频带，每个频带捕捉点云视频的不同几何结构。我们的谱分析表明，分解后的低频信号捕捉到更粗糙的形状，而高频信号则编码了更细致的几何细节。基于这些观察，我们设计了时空谱混合器（Spatio-Temporal-Spectral Mixer，STS-Mixer），这是一个统一框架，混合点云视频的空间、时间和谱表示。STS-Mixer将多频带划分的谱信号与时空信息结合，以捕捉丰富的几何形状和时序动态，同时实现对4D点云视频的细粒度和整体理解。大量实验表明，STS-Mixer在3D动作识别和4D语义分割任务的多个广泛采用的基准上始终实现了优越的性能。代码和模型可在 https://github.com/Vegetebird/STS-Mixer 获取。

View on arXiv Download PDF AI Translation

cs.CV / 277 / 2604.11653

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

GazeVaLM：用于评估人工智能生成X光片临床真实性的多观察者眼动追踪基准

Wong, David, Isik, Zeynep, Wang, Bin, Tliba, Marouane, Durak, Gorkem, Keles, Elif, Aktas, Halil Ertugrul, Chetouani, Aladine, Topel, Cagdas, Gennaro, Nicolo, Vendrami, Camila Lopes, Trabzonlu, Tugce Agirlar, Rahsepar, Amir Ali, Perronne, Laetitia, Antalek, Matthew, Ozturk, Onural, Okur, Gokcan, Gordon, Andrew C., Pyrros, Ayis, Miller, Frank H., Borhani, Amir, Savas, Hatice, Hart, Eric, Krupinski, Elizabeth, Bagci, Ulas

Abstract

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.

Chinese Translation

我们推出了GazeVaLM，这是一个公共眼动追踪数据集，用于研究在胸部X光片真实性评估过程中的临床感知。该数据集包含来自16位专家放射科医师对30张真实和30张合成胸部X光片（由基于扩散的生成性人工智能生成）进行解读的960个注视记录，分为两种条件：诊断评估和真伪分类（视觉图灵测试）。对于每个图像-观察者对，我们提供原始注视样本、注视图、扫描路径、显著性密度图、结构化诊断标签和真实性判断。我们将该协议扩展到6个最先进的多模态大语言模型（LLMs），发布它们在匹配条件下的预测诊断、真实性标签和置信度分数，从而实现决策和不确定性水平上的人机直接比较。我们进一步提供了注视一致性、观察者间一致性以及放射科医师与LLMs在诊断准确性和真实性检测方面的基准分析。GazeVaLM支持眼动建模、临床决策、人机比较、生成图像真实性评估和不确定性量化的研究。通过共同发布视觉注意数据、临床标签和模型预测，我们旨在促进关于专家和人工智能系统如何感知、解读和评估医学图像的可重复研究。该数据集可在https://huggingface.co/datasets/davidcwong/GazeVaLM获取。

View on arXiv Download PDF AI Translation

cs.CV / 278 / 2604.11668

UNIGEOCLIP: Unified Geospatial Contrastive Learning

UNIGEOCLIP：统一地理空间对比学习

Astruc, Guillaume, Trulls, Eduard, Hosang, Jan, Landrieu, Loic, Sarlin, Paul-Edouard

Abstract

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

Chinese Translation

随着航空影像、街景视图、高程模型、文本和地理坐标等共存地理空间数据的日益丰富，为多模态表示学习提供了独特的机会。我们提出了UNIGEOCLIP，这是一种大规模多模态对比框架，旨在将五种互补的地理空间模态在单一统一的嵌入空间中进行联合对齐。与之前通过融合模态或依赖中心枢轴表示的方法不同，我们的方法执行全对全的对比对齐，使得在任意模态组合之间实现无缝比较、检索和推理成为可能。我们进一步提出了一种尺度化的经纬度编码器，通过捕捉多尺度的地理结构来改善空间表示。在下游地理空间任务中的大量实验表明，UNIGEOCLIP始终优于单模态对比模型和仅基于坐标的基线，突显了整体多模态地理空间对齐的优势。参考实现可在 https://gastruc.github.io/unigeoclip 获取。

View on arXiv Download PDF AI Translation

cs.CV / 279 / 2604.11679

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

面向临床的脑MRI基础模型：来自FOMO25挑战的发现

Munk, Asbjørn, Cerri, Stefano, Nersesjan, Vardan, Krag, Christian Hedeager, Ambsdorf, Jakob, García, Pablo Rocamora, Machnio, Julia, Liu, Peirong, Ahn, Suhyun, Akbari, Nasrin, Khalil, Yasmina Al, Amador, Kimberly, Amirrajab, Sina, Arbel, Tal, Cuadra, Meritxell Bach, Baid, Ujjwal, Baheti, Bhakti, Banus, Jaume, Barbierik, Kamil, Brune, Christoph, Bu, Yansong, Callard, Baptiste, Chen, Yuhan, Crijnen, Cornelius, Dancette, Corentin, Drotar, Peter, Dutande, Prasad, Forkert, Nils D., Garg, Saurabh, Gazda, Jakub, Gazda, Matej, Gérin, Benoît, Ghosh, Partha, Gong, Weikang, Gordaliza, Pedro M., Hashemi, Sam, Heimann, Tobias, Jia, Fucang, Jiang, Jiexin, Kaczmarek, Emily, Kang, Chris, Kang, Seung Kwan, Khazaei, Mohammad, Khlaut, Julien, Koutsouvelis, Petros, Lee, Jae Sung, Li, Yuchong, Lyu, Mengye, Ma, Mingchen, Madabhushi, Anant, Maier-Hein, Klaus H., Manceron, Pierre, Mora, Andrés Martínez, Mazher, Moona, Meister, Felix, Molchanova, Nataliia, Niederer, Steven A., Nürnberg, Leonard, Park, Jinah, Qayyum, Abdul, Richiardi, Jonas, Saporta, Antoine, Setlak, Branislav, Shen, Ning, Szeto, Justin, Ulrich, Constantin, Vaish, Puru, Vigneshwaran, Vibujithan, Volmer, Leroy, Wang, Zihao, Wei, Siqi, Winder, Anthony, Wolterink, Jelmer M., Wynen, Maxence, Yang, Chang, Yie, Si Young, Ghazi, Mostafa Mehdipour, Pai, Akshay, Solem, Espen Jimenez, Llambias, Sebastian Nørgaard, Boesen, Mikael, Benros, Michael Eriksen, Iglesias, Juan Eugenio, Nielsen, Mads

Abstract

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

Chinese Translation

自动化脑MRI分析的临床应用面临一个基本挑战：临床数据是异质的且噪声较大，高质量标签的获取成本极高。自监督学习（SSL）可以通过利用临床工作流程中产生的大量未标记数据来训练稳健的基础模型，从而以最小的监督适应域外数据。然而，脑MRI基础模型的发展受到小型预训练数据集和专注于高质量研究级数据的域内基准测试的限制。为了解决这一问题，我们在MICCAI 2025上组织了FOMO25挑战作为卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K，并在少样本和域外设置中评估模型，数据直接来源于临床工作流程。任务包括梗死分类、脑膜瘤分割和脑龄回归，并考虑了在FOMO60K上训练的模型（方法轨道）和任何数据（开放轨道）。来自十六个团队的十九个基础模型通过标准化的容器化流程进行了评估。结果表明：（a）自监督预训练在域移情况下提高了临床数据的泛化能力，最强的域外训练模型超越了域内训练的监督基线；（b）没有单一的预训练目标对所有任务都有益：MAE更有利于分割，混合重建-对比目标更有利于分类；（c）小型预训练模型取得了强劲的性能，扩大模型规模和训练时间的改进并未带来可靠的收益。

View on arXiv Download PDF AI Translation

cs.CV / 280 / 2604.11685

Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis

通过迭代高斯概述展开3D高斯点云

Lu, Yuqin, Zhou, Yang, Dai, Yihua, Li, Guiqing, He, Shengfeng

Abstract

3D Gaussian Splatting (3DGS) has become a state-of-the-art framework for real-time, high-fidelity novel view synthesis. However, its substantial storage requirements and inherently unstructured representation pose challenges for deployment in streaming and resource-constrained environments. Existing Level-of-Detail (LOD) strategies, particularly those based on bottom-up construction, often introduce redundancy or lead to fidelity degradation. To overcome these limitations, we propose Iterative Gaussian Synopsis, a novel framework for compact and progressive rendering through a top-down "unfolding" scheme. Our approach begins with a full-resolution 3DGS model and iteratively derives coarser LODs using an adaptive, learnable mask-based pruning mechanism. This process constructs a multi-level hierarchy that preserves visual quality while improving efficiency. We integrate hierarchical spatial grids, which capture the global scene structure, with a shared Anchor Codebook that models localized details. This combination produces a compact yet expressive feature representation, designed to minimize redundancy and support efficient, level-specific adaptation. The unfolding mechanism promotes inter-layer reusability and requires only minimal data overhead for progressive refinement. Experiments show that our method maintains high rendering quality across all LODs while achieving substantial storage reduction. These results demonstrate the practicality and scalability of our approach for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.

Chinese Translation

3D高斯点云（3D Gaussian Splatting, 3DGS）已成为实时高保真新视图合成的最先进框架。然而，其巨大的存储需求和固有的非结构化表示在流媒体和资源受限环境中的部署面临挑战。现有的细节层次（Level-of-Detail, LOD）策略，特别是基于自下而上的构建方法，往往会引入冗余或导致保真度下降。为克服这些限制，我们提出了迭代高斯概述（Iterative Gaussian Synopsis），这是一个通过自上而下的“展开”方案实现紧凑和渐进渲染的新框架。我们的方法从全分辨率的3DGS模型开始，利用自适应的可学习掩码修剪机制迭代推导出更粗糙的LOD。这一过程构建了一个多层次的层级结构，既保持了视觉质量，又提高了效率。我们将捕捉全局场景结构的层级空间网格与建模局部细节的共享锚代码本（Anchor Codebook）相结合。这种组合产生了一种紧凑而富有表现力的特征表示，旨在最小化冗余并支持高效的层级特定适应。展开机制促进了层间的可重用性，并且仅需最小的数据开销即可实现渐进细化。实验表明，我们的方法在所有LOD中保持了高渲染质量，同时实现了显著的存储减少。这些结果展示了我们的方法在带宽和内存受限场景中进行实时3DGS渲染的实用性和可扩展性。

View on arXiv Download PDF AI Translation

cs.CV / 281 / 2604.11689

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

LARY：一种潜在动作表示基准，用于可泛化的视觉-动作对齐

Nie, Dujun, Chen, Fengjiao, Lv, Qi, Kuang, Jun, Li, Xiaoyu, Cao, Xuezhi, Cai, Xunliang

Abstract

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

Chinese Translation

尽管显式动作数据的短缺限制了视觉-语言-动作（VLA）模型的应用，但人类动作视频提供了一种可扩展的、未标记的数据源。在利用大规模人类视频数据集时，一个关键挑战在于将视觉信号转化为与本体无关的表示，称为潜在动作。然而，潜在动作表示从视觉观察中推导出稳健控制的能力尚未经过严格评估。我们提出了潜在动作表示基准（LARY Benchmark），这是一个统一的框架，用于评估潜在动作表示在高层次语义动作（做什么）和低层次机器人控制（如何做）方面的表现。该数据集经过全面策划，涵盖超过一百万个视频（1000小时），涉及151个动作类别，以及620K图像对和595K运动轨迹，涵盖多种体现和环境。我们的实验揭示了两个关键见解：（i）在没有任何动作监督的情况下训练的通用视觉基础模型，始终优于专门的具身潜在动作模型。（ii）基于潜在的视觉空间在本质上比基于像素的空间更好地与物理动作空间对齐。这些结果表明，通用视觉表示本质上编码了与物理控制相关的知识，而语义层次的抽象在视觉到动作的转化中比像素层次的重建更为有效。

View on arXiv Download PDF AI Translation

cs.CV / 282 / 2604.11707

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

像素之前的表征：语义引导的层次视频预测

Karypidis, Efstathios, Gidaris, Spyros, Komodakis, Nikos

Abstract

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

Chinese Translation

准确的未来视频预测需要高视觉保真度和一致的场景语义，特别是在复杂动态环境中，例如自动驾驶。我们提出了Re2Pix，一个层次视频预测框架，将预测过程分解为两个阶段：语义表征预测和表征引导的视觉合成。我们的做法不是直接预测未来的RGB帧，而是首先在冻结的视觉基础模型的特征空间中预测未来场景结构，然后在这些预测的表征上对潜在扩散模型进行条件处理，以渲染出照片级真实感的帧。这种分解使得模型能够首先关注场景动态，然后再关注外观生成。一个关键挑战来自于训练期间可用的真实表征与推理时使用的预测表征之间的训练-测试不匹配。为了解决这个问题，我们引入了两种条件策略：嵌套丢弃（nested dropout）和混合监督（mixed supervision），以提高对不完美自回归预测的鲁棒性。在具有挑战性的驾驶基准测试中的实验表明，与强扩散基线相比，所提出的以语义为先的设计显著提高了时间语义一致性、感知质量和训练效率。我们提供了实现代码，链接为：https://github.com/Sta8is/Re2Pix

View on arXiv Download PDF AI Translation

cs.CV / 283 / 2604.11711

Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

透视工具：基础分割模型遮挡鲁棒性的受控基准测试

Ho, Nhan, Le, Luu, Nguyen, Thanh-Huy, Nguyen, Thien, Liu, Xiaofeng, Bagci, Ulas

Abstract

Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

Chinese Translation

遮挡现象，即目标结构被手术器械或重叠组织部分遮盖，依然是临床内镜基础分割模型面临的一个关键但未充分研究的挑战。我们提出了OccSAM-Bench，一个旨在系统评估SAM系列模型在受控合成手术遮挡条件下表现的基准测试框架。该框架模拟了两种遮挡类型（手术工具覆盖和切割遮挡），并在三个公开息肉数据集上设置了三个校准的遮挡严重度等级。我们提出了一种新颖的三区域评估协议，将分割性能分解为完整目标、仅可见目标和不可见目标三个部分。该指标揭示了标准无遮挡评估所掩盖的行为，暴露出两种不同的模型原型：遮挡感知模型（Occluder-Aware models）（包括SAM、SAM 2、SAM 3、MedSAM3），它们优先进行可见组织的描绘并排除器械；以及遮挡无感模型（Occluder-Agnostic models）（包括MedSAM、MedSAM2），它们能够自信地预测遮挡区域。SAM-Med2D则不符合上述任何一种类型，在所有条件下表现均较差。最终，我们的结果表明，遮挡鲁棒性在不同架构间并不均衡，模型选择必须基于具体的临床需求——无论是优先保证保守的可见组织分割，还是进行隐藏解剖结构的无遮挡推断。

View on arXiv Download PDF AI Translation

cs.CV / 284 / 2604.11714

BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

BEM：一种无训练背景嵌入记忆用于实时固定背景摄像头中的误报抑制

Park, Junwoo, Lee, Jangho, Lim, Sunho

Abstract

Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

Chinese Translation

预训练检测器在基准测试中表现良好，但由于训练数据与目标环境之间的分布差距，往往在实际部署中性能下降。类似COCO的基准强调类别多样性而非实例密度，导致在密集的单类或少类场景（如监控和交通监测）中，基于每类稀疏性训练的检测器表现不佳。在固定摄像头环境中，准静态背景提供了一个稳定的、无标签的先验，可以在推理时利用它来抑制虚假检测。为了解决这个问题，我们提出了背景嵌入记忆（Background Embedding Memory，BEM），这是一个轻量级、无训练、权重冻结的模块，可以在推理期间附加到预训练检测器上。BEM 估计干净的背景嵌入，维护原型记忆，并使用逆相似度、排名加权惩罚重新评分检测 logits，有效减少误报同时保持召回率。从经验上看，背景帧的余弦相似度与物体数量呈负相关，与精度-置信度曲线下面积（Precision-Confidence AUC，P-AUC）呈正相关，这使其成为无训练控制信号的有效选择。在 LLVIP 和模拟监控流的 YOLO 和 RT-DETR 系列中，BEM 一直有效减少误报，同时保持实时性能。我们的代码可在 https://github.com/Leo-Park1214/Background-Embedding-Memory.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 285 / 2604.11720

On the Robustness of Watermarking for Autoregressive Image Generation

自回归图像生成中的水印鲁棒性研究

Müller, Andreas, Lukovnikov, Denis, Kodama, Shingo, Pham, Minh, Jain, Anubhav, Petit, Jonathan, Cohen, Niv, Fischer, Asja

Abstract

The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator's watermark and trigger false detection to prevent their inclusion in future model training.

Chinese Translation

自回归（AR）图像生成器的快速发展要求对其输出进行可靠的检测和归属，以减轻错误信息的传播，并过滤合成图像以防止模型崩溃。为满足这一需求，专为AR模型设计的水印技术在生成时嵌入微妙的信号，从而通过相应的水印检测器实现下游验证。在本研究中，我们研究了这些方案，并展示了它们在水印去除和伪造攻击方面的脆弱性。我们评估了现有攻击，并进一步引入了三种新攻击：（i）向量量化再生去除攻击，（ii）基于对抗优化的攻击，以及（iii）频率注入攻击。我们的评估表明，去除和伪造攻击在访问单个带水印的参考图像的情况下是有效的，而无需访问原始模型参数或水印秘密。我们的研究结果表明，现有的AR图像生成水印方案并不能可靠地支持合成内容的检测以进行数据集过滤。此外，它们还使得水印模仿成为可能，即真实图像可以被操控以模仿生成器的水印，从而触发错误检测，防止其被纳入未来的模型训练。

View on arXiv Download PDF AI Translation

cs.CV / 286 / 2604.11724

The Devil is in the Details -- From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction

细节决定成败——从古教会斯拉夫文的光学字符识别到纯视觉的谱系重建

Hoenen, Armin

Abstract

The age of artificial intelligence has brought many new possibilities and pitfalls in many fields and tasks. The devil is in the details, and those come to the fore when building new pipelines and executing small practical experiments. OCR and stemmatology are no exception. The current investigation starts comparing a range of OCR-systems, from classical over machine learning to LLMs, for roughly 6,000 characters of late handwritten church slavonic manuscripts from the 18th century. Focussing on basic letter correctness, more than 10 CS OCR-systems among which 2 LLMs (GPT5 and Gemini3-flash) are being compared. Then, post-processing via LLMs is assessed and finally, different agentic OCR architectures (specialized post-processing agents, an agentic pipeline and RAG) are tested. With new technology elaborated, experiments suggest, church slavonic CER for basic letters may reach as low as 2-3% but elaborated diacritics could still present a problem. How well OCR can prime stemmatology as a downstream task is the entry point to the second part of the article which introduces a new stemmatic method based solely on image processing. Here, a pipeline of automated visual glyph extraction, clustering and pairwise statistical comparison leading to a distance matrix and ultimately a stemma, is being presented and applied to two small corpora, one for the church slavonic Gospel of Mark from the 14th to 16th centuries, one for the Roman de la Rose in French from the 14th and 15th centuries. Basic functioning of the method can be demonstrated.

Chinese Translation

人工智能时代为许多领域和任务带来了新的可能性和陷阱。细节决定成败，这在构建新的流程和执行小规模实际实验时尤为明显。光学字符识别（OCR）和谱系学也不例外。本研究开始比较一系列OCR系统，从经典方法到机器学习再到大型语言模型（LLMs），针对大约6000个18世纪晚期手写教会斯拉夫文手稿字符进行分析。研究重点放在基本字母的正确性上，比较了10多个教会斯拉夫文OCR系统，其中包括2个大型语言模型（GPT5和Gemini3-flash）。随后，评估了通过大型语言模型进行的后处理，最后测试了不同的代理OCR架构（专门的后处理代理、代理流程和RAG）。随着新技术的深入，实验表明，教会斯拉夫文的基本字母字符错误率（CER）可能低至2-3%，但复杂的变音符号仍可能存在问题。OCR在作为下游任务的谱系学中能发挥多大作用，是文章第二部分的切入点，该部分介绍了一种基于图像处理的新谱系方法。在这里，展示并应用了一条自动化视觉字形提取、聚类和成对统计比较的流程，最终生成距离矩阵并形成谱系，应用于两个小语料库，一个是14至16世纪的教会斯拉夫文《马可福音》，另一个是14至15世纪的法语《玫瑰骑士》。该方法的基本功能得以展示。

View on arXiv Download PDF AI Translation

cs.CV / 287 / 2604.11730

Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

视频中矛盾/犹豫识别用于个性化数字健康干预

González-González, Manuela, Belharbi, Soufiane, Zeeshan, Muhammad Osama, Sharafi, Masoumeh, Aslam, Muhammad Haseeb, Sia, Lorenzo, Richet, Nicolas, Pedersoli, Marco, Koerich, Alessandro Lameiras, Bacon, Simon L, Granger, Eric

Abstract

Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

Chinese Translation

利用行为科学，健康干预通过提供框架帮助患者养成并维持改善医疗结果的健康习惯，进而实现行为改变。面对面干预成本高且难以规模化，尤其是在资源有限的地区。数字健康干预提供了一种具有成本效益的方法，能够支持独立生活和自我管理。近年来，尤其是通过机器学习实现此类干预的自动化，受到了广泛关注。矛盾与犹豫（Ambivalence/Hesitancy，简称A/H）在个体延迟、回避或放弃健康干预中起着关键作用。A/H表现为细微且矛盾的情绪，使个体处于对某行为的正负评价之间，或在接受与拒绝参与之间的状态。其表现为跨模态或单一模态（如语言、面部、声音表达及肢体语言）中的情感不一致。尽管专家可以经过培训识别A/H，但将其整合进数字健康干预成本高且效果有限。因此，自动识别A/H对于数字健康干预的个性化和成本效益至关重要。本文探讨了深度学习模型在视频中识别A/H的应用，该任务本质上是多模态的。具体而言，本文涵盖了三种学习方案：监督学习、用于个性化的无监督领域适应以及通过大型语言模型（LLMs）实现的零样本推断。实验基于近期发布的独特BAH视频数据集进行。结果显示性能有限，表明需要更适配的多模态模型以实现准确的A/H识别。更优的时空建模和多模态融合方法对于利用模态内外的冲突信息至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 288 / 2604.11737

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

学习长期运动嵌入以实现高效的运动学生成

Stracke, Nick, Bauer, Kolja, Baumann, Stefan Andreas, Bautista, Miguel Angel, Susskind, Josh, Ommer, Björn

Abstract

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

Chinese Translation

理解和预测运动是视觉智能的一个基本组成部分。尽管现代视频模型在场景动态理解方面表现出强大的能力，但通过完整视频合成探索多种可能的未来仍然效率低下。我们通过直接操作从跟踪器模型获得的大规模轨迹中学习的长期运动嵌入，以数量级的效率建模场景动态。这使得能够高效生成长时间的、逼真的运动，以满足通过文本提示或空间指示指定的目标。为此，我们首先学习一个具有64倍时间压缩因子的高度压缩运动嵌入。在这个空间中，我们训练一个条件流匹配模型，以生成基于任务描述的运动潜变量。所得到的运动分布在性能上优于最先进的视频模型和专门的任务特定方法。

View on arXiv Download PDF AI Translation

cs.CV / 289 / 2604.11762

MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

MosaicMRI：一个多样化的原始肌肉骨骼MRI数据集及基准

Arguello, Paula, Tinaz, Berk, Sepehri, Mohammad Shahab, Soltanolkotabi, Maryam, Soltanolkotabi, Mahdi

Abstract

Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

Chinese Translation

深度学习支撑了MRI中的广泛应用，包括重建、伪影去除和分割。然而，进展主要依赖于聚焦于脑部和膝部成像的公共数据集，这影响了模型的训练和评估方式。因此，针对不同解剖部位模型可靠性的系统研究仍然有限。在本研究中，我们引入了MosaicMRI，这是一个大型且多样化的全采样原始肌肉骨骼（MSK）MRI测量集合，旨在用于基于机器学习方法的训练和评估。MosaicMRI是迄今为止最大的开源原始MSK MRI数据集，包含2,671个体积和80,156个切片。该数据集在体积方向（如轴向、矢状面）、成像对比度（如PD、T1、T2）、解剖部位（如脊柱、膝盖、髋部、踝部等）及采集线圈数量方面具有显著多样性。以VarNet作为加速重建任务的基线，我们进行了全面实验，研究模型容量和数据集规模的扩展行为。有趣的是，在样本量较低的情况下，训练于多解剖部位组合的数据集的模型显著优于特定解剖部位的模型，凸显了解剖多样性及可利用的跨解剖相关性的优势。我们进一步通过在一种解剖部位（如脊柱）训练模型并在另一种解剖部位（如膝盖）测试，评估了模型的鲁棒性和跨解剖泛化能力。值得注意的是，我们发现某些身体部位群组（如足部和肘部）之间具有良好的泛化能力，并强调了领域转移下性能表现受训练集规模、解剖部位及协议特异性因素的共同影响。

View on arXiv Download PDF AI Translation

cs.CV / 290 / 2604.11775

Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

基于补丁的三维医学图像分割的高效 KernelSHAP 解释

Brioso, Ricardo Coimbra, Sichili, Giulio, Dei, Damiano, Lambri, Nicola, Mancosu, Pietro, Scorsetti, Marta, Loiacono, Daniele

Abstract

Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net's fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.

Chinese Translation

基于扰动的可解释性方法如 KernelSHAP 提供了与模型无关的归因，但由于需要大量的联盟评估和滑动窗口推理的高成本，这些方法通常在基于补丁的三维医学图像分割中不够实用。我们提出了一种高效的 KernelSHAP 框架，用于体积 CT 分割，该框架将计算限制在用户定义的感兴趣区域及其感受野支持范围内，并通过补丁逻辑缓存加速推理，重用未受影响补丁的基线预测，同时保留 nnU-Net 的融合方案。为了实现临床意义的归因，我们比较了在感受野裁剪内自动生成的三种特征抽象：整体器官单元、常规 FCC 超体素和混合器官感知超体素，并研究了多种聚合/价值函数，旨在稳定证据（TP/Dice/Soft Dice）或虚假阳性行为。对全身 CT 分割的实验表明，缓存显著减少了冗余计算（计算节省范围为 15% 至 30%），并且忠实性和可解释性之间存在明显的权衡：常规超体素通常最大化基于扰动的指标，但缺乏解剖学对齐，而器官感知单元则提供了更具临床可解释性的解释，并在标准化指标下特别有效地突出虚假阳性驱动因素。

View on arXiv Download PDF AI Translation

cs.CV / 291 / 2604.11788

HDR Video Generation via Latent Alignment with Logarithmic Encoding

通过对数编码的潜在对齐生成高动态范围视频

Korem, Naomi Ken, Oumoumad, Mohamed, Cain, Harel, Yosef, Matan Ben, Jelercic, Urska, Bibi, Ofir, Inger, Yaron, Patashnik, Or, Cohen-Or, Daniel

Abstract

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

Chinese Translation

高动态范围（HDR）图像提供了丰富而真实的场景辐射表示，但由于与这些模型所训练的有界、感知压缩数据的不匹配，仍然对生成模型构成挑战。一个自然的解决方案是为HDR学习新的表示，这会引入额外的复杂性和数据需求。在本研究中，我们展示了通过利用预训练生成模型已经捕获的强视觉先验，可以以更简单的方式实现HDR生成。我们观察到，广泛应用于电影制作流程的对数编码将HDR图像映射到与这些模型的潜在空间自然对齐的分布，从而使得通过轻量级微调直接适应成为可能，而无需重新训练编码器。为了恢复输入中不可直接观察的细节，我们进一步引入了一种基于摄像机模拟退化的训练策略，鼓励模型从其学习的先验中推断缺失的高动态范围内容。结合这些见解，我们展示了使用预训练视频模型进行高质量HDR视频生成的能力，适应性极小，在多样化场景和具有挑战性的光照条件下取得了良好的结果。我们的结果表明，尽管HDR代表了一种根本不同的图像形成机制，但只要选择与其学习的先验对齐的表示，就可以有效处理，而无需重新设计生成模型。

View on arXiv Download PDF AI Translation

cs.CV / 292 / 2604.11789

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

大规模多模态模型（LMMs）与面向对象的视觉：理解、分割、编辑与生成

Yuan, Yuqian, Zhang, Wenqiao, Lin, Juekai, Zhong, Yu, Gao, Mingjian, Yu, Binhe, Cao, Yunqi, Li, Wentong, Zhuang, Yueting, Ooi, Beng Chin

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

Chinese Translation

大规模多模态模型（LMMs）在通用视觉-语言理解方面取得了显著进展，然而在需要精确对象级定位、细粒度空间推理以及可控视觉操作的任务中仍存在局限性。尤其是，现有系统常常难以准确识别正确的实例、在多次交互中保持对象身份一致性，以及高精度地定位或修改指定区域。面向对象的视觉提供了一种系统化的框架，通过促进对视觉实体的显式表示和操作，将多模态系统从全局场景理解扩展到对象级的理解、分割、编辑与生成。本文全面回顾了大规模多模态模型与面向对象视觉交汇领域的最新进展。我们将相关文献归纳为四大主题：面向对象的视觉理解、面向对象的指称分割、面向对象的视觉编辑以及面向对象的视觉生成。进一步总结了支持这些能力的关键建模范式、学习策略及评估协议。最后，讨论了若干开放挑战与未来方向，包括鲁棒的实例持久性、细粒度空间控制、一致的多步交互、统一的跨任务建模以及在分布变化下的可靠基准测试。我们期望本文能为构建可扩展、精确且可信赖的面向对象多模态系统提供结构化的视角。

View on arXiv Download PDF AI Translation

cs.CV / 293 / 2604.11792

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

LottieGPT：用于自回归生成的矢量动画标记化

Chen, Junhao, Gao, Kejun, Cui, Yuehan, Sun, Mingze, Chen, Mingjin, Wang, Shaohui, Long, Xiaoxiao, Ma, Fei, Tian, Qi, Huang, Ruqi, Zhao, Hao

Abstract

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

Chinese Translation

尽管视频生成技术快速发展，现有模型仍无法生成矢量动画，这是一种在互联网上占主导地位且高度表现力丰富的多媒体形式。矢量动画具有分辨率独立性、紧凑性、语义结构和可编辑的参数化运动表示，然而当前的生成模型仅在光栅空间中操作，因此无法合成矢量动画。同时，最近在大型多模态模型方面的进展展示了生成结构化数据（如幻灯片、3D网格、乐高序列和室内布局）的强大能力，这表明原生矢量动画生成可能是可实现的。在本研究中，我们提出了第一个用于标记化和自回归生成矢量动画的框架。我们采用了广泛应用的基于JSON的动画标准Lottie，并设计了一个专门的Lottie标记器，将分层几何原语、变换和基于关键帧的运动编码为紧凑且语义对齐的标记序列。为了支持大规模训练，我们还构建了LottieAnimation-660K，这是迄今为止最大的、最具多样性的矢量动画数据集，包含660,000个真实世界的Lottie动画和1500万个静态Lottie图像文件，这些文件来自广泛的互联网来源。在这些组件的基础上，我们微调了Qwen-VL，创建了LottieGPT，这是一个能够直接从自然语言或视觉提示生成连贯、可编辑的矢量动画的原生多模态模型。实验表明，我们的标记器在保持结构保真度的同时显著减少了序列长度，从而有效支持动态矢量内容的自回归学习。LottieGPT在多种动画风格中表现出强大的泛化能力，并在SVG生成（单帧矢量动画的特例）方面超越了之前的最先进模型。

View on arXiv Download PDF AI Translation

cs.CV / 294 / 2604.11797

SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

SyncFix：通过多视图同步修复三维重建

Li, Deming, Yadav, Abhay, Peng, Cheng, Chellappa, Rama, Bhattad, Anand

Abstract

We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.

Chinese Translation

我们提出了SyncFix，一个在基于扩散的重建场景精炼过程中强制跨视图一致性的框架。SyncFix将精炼过程公式化为一个联合潜在桥接匹配问题，旨在同步多个视图中的失真和清晰表示，以修复语义和几何不一致性。这意味着SyncFix学习了一个跨多个视图的联合条件，以在去噪轨迹中强制一致性。我们的训练仅在图像对上进行，但在推理时可以自然推广到任意数量的视图。此外，随着视图数量的增加，重建质量有所提升，但在更高的视图数量下收益递减。定性和定量结果表明，SyncFix始终生成高质量的重建，并在没有清晰参考图像的情况下超越当前最先进的基准。当稀疏参考可用时，SyncFix的保真度更高。

View on arXiv Download PDF AI Translation

cs.CV / 295 / 2604.11798

Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

基于预算意识的不确定性用于放射治疗分割质量保证的 nnU-Net 方法

Brioso, Ricardo Coimbra, Mondo, Lorenzo, Dei, Damiano, Lambri, Nicola, Mancosu, Pietro, Scorsetti, Marta, Loiacono, Daniele

Abstract

Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

Chinese Translation

临床靶区（CTV）的准确勾画对放射治疗规划至关重要，但仍然耗时且难以评估，尤其是在总骨髓和淋巴结照射（TMLI）等复杂治疗中。尽管基于深度学习的自动分割可以减少工作负担，但安全的临床部署需要可靠的线索来指示模型可能出错的地方。在本研究中，我们提出了一种基于 nnU-Net 的预算意识不确定性驱动的质量保证（QA）框架，结合不确定性量化和后验校准，生成基于预测熵的体素级不确定性图，以指导有针对性的手动审查。我们比较了温度缩放（TS）、深度集成（DE）、检查点集成（CE）和测试时增强（TTA），并在 TMLI 这一代表性用例中对它们进行单独和组合评估。通过 ROI 掩膜校准指标和在现实修订约束下的不确定性-误差对齐来评估可靠性，结果以最不确定的 0-5% 体素的 AUC 进行总结。在不同配置中，分割准确性保持稳定，而 TS 显著改善了校准。不确定性-误差对齐在经过校准的基于检查点的推断中改善最为显著，生成的不确定性图更一致地突出需要手动编辑的区域。总体而言，将校准与高效集成相结合似乎是一种有前景的策略，以实施预算意识的放射治疗分割质量保证工作流程。

View on arXiv Download PDF AI Translation

cs.CV / 296 / 2604.11804

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

OmniShow：统一多模态条件的人机交互视频生成

Zhou, Donghao, Liu, Guisheng, Yang, Hao, Li, Jiatong, Lin, Jingyu, Huang, Xiaohu, Liu, Yichen, Gao, Xin, Chen, Cunjian, Wen, Shilei, Fu, Chi-Wing, Heng, Pheng-Ann

Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

Chinese Translation

在本研究中，我们探讨了人机交互视频生成（HOIVG），该任务旨在合成高质量的人机交互视频，条件为文本、参考图像、音频和姿态。该任务在自动化内容创作的实际应用中具有重要的价值，例如电子商务演示、短视频制作和互动娱乐。然而，现有的方法未能满足所有这些必要条件。我们提出了OmniShow，一个为这一实际而具有挑战性的任务量身定制的端到端框架，能够协调多模态条件并提供行业级性能。为了克服可控性与质量之间的权衡，我们引入了统一通道条件（Unified Channel-wise Conditioning）以高效注入图像和姿态，并采用门控局部上下文注意力（Gated Local-Context Attention）以确保精确的音视频同步。为有效应对数据稀缺问题，我们开发了一种解耦后联合训练（Decoupled-Then-Joint Training）策略，利用多阶段训练过程与模型合并，高效利用异构子任务数据集。此外，为填补该领域的评估空白，我们建立了HOIVG-Bench，一个专门的、全面的HOIVG基准。大量实验表明，OmniShow在各种多模态条件设置下实现了整体的最先进性能，为新兴的HOIVG任务设定了坚实的标准。

View on arXiv Download PDF AI Translation

cs.CV / 297 / 2604.11808

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Pair2Scene：基于局部对象关系学习的程序化场景生成

Ran, Xingjian, Zhang, Shujie, Zhong, Weipeng, Luo, Li, Dai, Bo

Abstract

Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

Chinese Translation

由于数据稀缺和复杂空间关系建模的挑战，高保真三维室内场景生成仍然是一个重大难题。现有方法往往难以扩展到训练分布之外的密集场景，或依赖缺乏精确空间推理能力的大型语言模型（LLMs）/视觉语言模型（VLMs）。基于对象摆放主要依赖局部依赖关系而非信息冗余的全局分布的观察，本文提出了Pair2Scene，一种将学习到的局部规则与场景层次结构及基于物理的算法相结合的新型程序生成框架。该规则主要捕捉两类对象间关系，即遵循物理层次的支撑关系和反映语义关联的功能关系。我们通过一个网络对这些规则进行建模，该网络基于锚定对象的位置和几何信息，估计依赖对象的空间位置分布。为此，我们从现有场景数据中整理了3D-Pairs数据集用于模型训练。在推理阶段，框架通过在层次结构内递归应用模型，结合碰撞感知的拒绝采样，将局部规则整合为连贯的全局布局。大量实验表明，本框架在生成超越训练数据的复杂环境时，能够优于现有方法，同时保持物理和语义的合理性。

View on arXiv Download PDF AI Translation

cs.CV / 298 / 2604.11809

Who Handles Orientation? Investigating Invariance in Feature Matching

谁来处理方向？特征匹配中的不变性研究

Nordström, David, Edstedt, Johan, Kahl, Fredrik, Bökman, Georg

Abstract

Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.

Chinese Translation

在图像之间寻找匹配的关键点是三维计算机视觉中的核心问题。然而，现代匹配器在处理大幅度平面内旋转时表现不佳。一种直接的解决方法是通过数据增强学习旋转不变性，但尚不清楚应在何种阶段引入旋转不变性。本文在现代稀疏匹配流程的背景下对此进行了研究。我们通过在大量三维视觉数据集上训练，并在流行的图像匹配基准上进行评估，进行了广泛实验。令人惊讶的是，我们发现直接在描述子（descriptor）中引入旋转不变性，其性能与在匹配器（matcher）中处理旋转不变性相当。然而，当旋转不变性在描述子中学习时，匹配器中实现旋转不变性的过程更早，从而实现了更快速的旋转不变匹配器。此外，我们发现当训练规模足够大时，强制旋转不变性并不会损害正立图像的性能。最后，我们通过训练规模研究了旋转不变性的形成，发现增加训练数据量显著提升了对旋转图像的泛化能力。我们发布了两个对平面内旋转鲁棒的匹配器，在多模态（WxBS）、极端条件（HardMatch）及卫星图像匹配（SatAst）等任务上达到了最先进的性能。代码可在 https://github.com/davnords/loma 获取。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

174

cs.AI / 1 / 2604.09554

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

LABBench2：用于评估执行生物学研究的人工智能系统的改进基准

Laurent, Jon M, Bou, Albert, Pieler, Michael, Igoe, Conor, Andonian, Alex, Narayanan, Siddharth, Braza, James, Vassopoulos, Alexandros Sanchez, Steenwyk, Jacob L, Lash, Blake, White, Andrew D, Rodriques, Samuel G

Abstract

Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.

Chinese Translation

利用人工智能加速科学发现的乐观情绪持续增长。目前，人工智能在科学研究中的应用涵盖了从在科学数据上训练专用基础模型，到具备代理能力的自主假设生成系统，再到由人工智能驱动的自主实验室。对应地，衡量人工智能系统在科学领域进展的需求不仅必须加快，还需日益转向更贴近现实能力的评估。评估内容超越死记硬背知识，甚至超越纯粹推理，转而实际衡量执行有意义工作的能力。先前工作提出了语言代理生物学基准LAB-Bench，作为测量这些能力的初步尝试。在此，我们介绍该基准的演进版本LABBench2，用以衡量执行有用科学任务的人工智能系统的现实能力。LABBench2包含近1900个任务，大体上是对LAB-Bench的延续，测量类似能力但置于更真实的情境中。我们评估了当前前沿模型的表现，结果显示尽管LAB-Bench和LABBench2测量的能力已有显著提升，LABBench2在难度上实现了实质性跃升（各子任务中模型准确率差异范围为-26%至-46%），并凸显了性能提升的持续空间。LABBench2延续了LAB-Bench作为人工智能科学研究能力事实标准的传统，我们期望其继续推动人工智能工具在核心研究功能上的发展。为促进社区使用和开发，我们提供了任务数据集（https://huggingface.co/datasets/futurehouse/labbench2）及公开评测框架（https://github.com/EdisonScientific/labbench2）。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.09555

Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

基于线性规划的多标准评估中的基数和序数数据：一种悲观虚拟差距分析

Liu, Fuh-Hwa Franklin, Shih, Su-Chuan

Abstract

Multi-criteria Analysis (MCA) is used to rank alternatives based on various criteria. Key MCA methods, such as Multiple Criteria Decision Making (MCDM) methods, estimate parameters for criteria to compute the performance of each alternative. Nonetheless, subjective evaluations and biases frequently influence the reliability of results, while the diversity of data affects the precision of the parameters. The novel linear programming-based Virtual Gap Analysis (VGA) models tackle these issues. This paper outlines a two-step method that integrates two novel VGA models to assess each alternative from a pessimistic perspective, using both quantitative and qualitative criteria, and employing cardinal and ordinal data. Next, prioritize the alternatives to eliminate the least favorable one. The proposed method is dependable and scalable, enabling thorough assessments efficiently and effectively within decision support systems.

Chinese Translation

多标准分析（MCA）用于根据各种标准对备选方案进行排序。关键的MCA方法，如多标准决策制定（MCDM）方法，估计标准的参数以计算每个备选方案的性能。然而，主观评估和偏见常常影响结果的可靠性，而数据的多样性则影响参数的精确性。新颖的基于线性规划的虚拟差距分析（VGA）模型解决了这些问题。本文概述了一种两步法，该方法整合了两个新颖的VGA模型，从悲观的角度评估每个备选方案，使用定量和定性标准，并采用基数和序数数据。接下来，优先排序备选方案，以消除最不利的选项。所提出的方法可靠且可扩展，能够在决策支持系统中高效、有效地进行全面评估。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.09563

Seven simple steps for log analysis in AI systems

人工智能系统日志分析的七个简单步骤

Dubois, Magda, Zorer, Ekin, Hamin, Maia, Skinner, Joe, Souly, Alexandra, Wynne, Jerome, Coppock, Harry, Satos, Lucas, Kapoor, Sayash, Dev, Sunischal, Juchems, Keno, Mai, Kimberly, Flesch, Timo, Luettgau, Lennart, Teague, Charles, Patey, Eric, Allaire, JJ, Pacchiardi, Lorenzo, Hernandez-Orallo, Jose, Ududec, Cozmin

Abstract

AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have started developing methods for log analysis, but a standardised approach is still missing. Here we suggest a pipeline based on current best practices. We illustrate it with concrete code examples in the Inspect Scout library, provide detailed guidance on each step, and highlight common pitfalls. Our framework provides researchers with a foundation for rigorous and reproducible log analysis.

Chinese Translation

人工智能系统在与工具和用户交互过程中会产生大量日志。分析这些日志有助于理解模型的能力、倾向和行为，或评估某次评估是否按预期进行。研究人员已开始开发日志分析方法，但尚缺乏标准化的流程。本文基于当前最佳实践提出了一套流程，并通过Inspect Scout库中的具体代码示例进行说明，详细指导每个步骤，指出常见陷阱。我们的框架为研究人员提供了进行严谨且可复现的日志分析的基础。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.09574

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

屏幕图灵测试：移动GUI代理人类化的基准评测

Zhu, Jiachen, Yang, Lingyu, Shan, Rong, Zheng, Congmin, Zheng, Zeyu, Liu, Weiwen, Yu, Yong, Zhang, Weinan, Lin, Jianghao

Abstract

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

Chinese Translation

自主GUI代理的兴起引发了数字平台的对抗性反制措施，然而现有研究更多关注实用性和鲁棒性，而忽视了关键的反检测维度。我们认为，为了使代理能够在以人为中心的生态系统中生存，它们必须进化出人类化（Humanization）能力。我们提出了“屏幕图灵测试”（Turing Test on Screen），将交互形式化为检测器与旨在最小化行为差异的代理之间的MinMax优化问题。随后，我们收集了新的高保真移动触控动态数据集，并通过分析发现，基于基础大规模模型（vanilla LMM）的代理因运动学不自然而易被检测。因此，我们建立了代理人类化基准（Agent Humanization Benchmark, AHB）及检测指标，以量化模仿性与实用性之间的权衡。最后，我们提出了从启发式噪声注入到数据驱动行为匹配的方法，实验证明代理能够在理论和实践中实现高模仿性，同时不牺牲性能。该工作将研究范式从代理能否完成任务转向代理如何在人类中心生态系统中执行任务，奠定了在对抗性数字环境中实现无缝共存的基础。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.09576

AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers

AHC：面向内存受限微控制器的元学习自适应压缩用于持续目标检测

Wilson, Bibin

Abstract

Deploying continual object detection on microcontrollers (MCUs) with under 100KB memory requires efficient feature compression that can adapt to evolving task distributions. Existing approaches rely on fixed compression strategies (e.g., FiLM conditioning) that cannot adapt to heterogeneous task characteristics, leading to suboptimal memory utilization and catastrophic forgetting. We introduce Adaptive Hierarchical Compression (AHC), a meta-learning framework featuring three key innovations: (1) true MAML-based compression that adapts via gradient descent to each new task in just 5 inner-loop steps, (2) hierarchical multi-scale compression with scale-aware ratios (8:1 for P3, 6.4:1 for P4, 4:1 for P5) matching FPN redundancy patterns, and (3) a dual-memory architecture combining short-term and long-term banks with importance-based consolidation under a hard 100KB budget. We provide formal theoretical guarantees bounding catastrophic forgetting as O({\epsilon}{sq.root(T)} + 1/{sq.root(M)}) where {\epsilon} is compression error, T is task count, and M is memory size. Experiments on CORe50, TiROD, and PASCAL VOC benchmarks with three standard baselines (Fine-tuning,EWC, iCaRL) demonstrate that AHC enables practical continual detection within a 100KB replay budget, achieving competitive accuracy through mean-pooled compressed feature replay combined with EWC regularization and feature distillation.

Chinese Translation

在内存不足100KB的微控制器（MCU）上部署持续目标检测，需高效的特征压缩方法以适应不断变化的任务分布。现有方法依赖固定压缩策略（如FiLM条件化），无法适应异质任务特性，导致内存利用率低下及灾难性遗忘。本文提出自适应层次压缩（Adaptive Hierarchical Compression，AHC），一种元学习框架，包含三大创新：（1）基于真实MAML的压缩，通过梯度下降在仅5步内循环中适应每个新任务；（2）具有尺度感知比例的层次多尺度压缩（P3为8:1，P4为6.4:1，P5为4:1），匹配FPN的冗余模式；（3）结合短期与长期存储库的双重记忆架构，在严格的100KB内存预算下基于重要性进行整合。我们提供了形式化理论保证，将灾难性遗忘界定为O(ε√T + 1/√M)，其中ε为压缩误差，T为任务数量，M为内存大小。在CORe50、TiROD和PASCAL VOC基准上，结合三种标准基线（微调、EWC、iCaRL）的实验表明，AHC在100KB重放预算内实现了实用的持续检测，通过均值池化压缩特征重放结合EWC正则化和特征蒸馏，达成了具有竞争力的准确率。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.09578

Explainable Planning for Hybrid Systems

混合系统的可解释规划

Sarwar, Mir Md Sajid

Abstract

The recent advancement in artificial intelligence (AI) technologies facilitates a paradigm shift toward automation. Autonomous systems are fully or partially replacing manually crafted ones. At the core of these systems is automated planning. With the advent of powerful planners, automated planning is now applied to many complex and safety-critical domains, including smart energy grids, self-driving cars, warehouse automation, urban and air traffic control, search and rescue operations, surveillance, robotics, and healthcare. There is a growing need to generate explanations of AI-based systems, which is one of the major challenges the planning community faces today. The thesis presents a comprehensive study on explainable artificial intelligence planning (XAIP) for hybrid systems that capture a representation of real-world problems closely.

Chinese Translation

人工智能（AI）技术的最新进展推动了向自动化范式的转变。自主系统正在完全或部分替代手工设计的系统。这些系统的核心是自动化规划。随着强大规划器的出现，自动化规划现已应用于许多复杂且安全关键的领域，包括智能电网、自动驾驶汽车、仓库自动化、城市及空中交通管制、搜救行动、监控、机器人技术和医疗保健。对基于AI系统的解释需求日益增长，这是当前规划领域面临的主要挑战之一。本论文对面向紧密反映现实问题的混合系统的可解释人工智能规划（Explainable Artificial Intelligence Planning, XAIP）进行了全面研究。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.09579

Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement

无需请求的帮助：一个用于随叫随到支持的主动代理系统及其持续自我改进

Liu, Fengrui, He, Xiao, Zhang, Tieying

Abstract

In large-scale cloud service platforms, thousands of customer tickets are generated daily and are typically handled through on-call dialogues. This high volume of on-call interactions imposes a substantial workload on human support analysts. Recent studies have explored reactive agents that leverage large language models as a first line of support to interact with customers directly and resolve issues. However, when issues remain unresolved and are escalated to human support, these agents are typically disengaged. As a result, they cannot assist with follow-up inquiries, track resolution progress, or learn from the cases they fail to address. In this paper, we introduce Vigil, a novel proactive agent system designed to operate throughout the entire on-call life-cycle. Unlike reactive agents, Vigil focuses on providing assistance during the phase in which human support is already involved. It integrates into the dialogue between the customer and the analyst, proactively offering assistance without explicit user invocation. Moreover, Vigil incorporates a continuous self-improvement mechanism that extracts knowledge from human-resolved cases to autonomously update its capabilities. Vigil has been deployed on Volcano Engine, ByteDance's cloud platform, for over ten months, and comprehensive evaluations based on this deployment demonstrate its effectiveness and practicality. The open source version of this work is publicly available at https://github.com/volcengine/veaiops.

Chinese Translation

在大规模云服务平台上，每天会生成数千个客户工单，通常通过随叫随到的对话进行处理。这种高频率的随叫随到交互给人类支持分析师带来了巨大的工作负担。近期的研究探索了利用大型语言模型作为第一线支持的反应式代理，直接与客户互动并解决问题。然而，当问题未能解决并被升级到人类支持时，这些代理通常会 disengaged。因此，它们无法协助后续查询、跟踪解决进度或从未能处理的案例中学习。本文介绍了 Vigil，一个新颖的主动代理系统，旨在贯穿整个随叫随到的生命周期。与反应式代理不同，Vigil 专注于在已经有人类支持参与的阶段提供帮助。它融入客户与分析师之间的对话，主动提供帮助，而无需用户明确请求。此外，Vigil 还结合了一个持续自我改进机制，从人类解决的案例中提取知识，以自主更新其能力。Vigil 已在字节跳动的云平台 Volcano Engine 上部署超过十个月，基于此部署的全面评估展示了其有效性和实用性。本工作的开源版本可在 https://github.com/volcengine/veaiops 获取。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.09580

OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

OOWM：通过面向对象的程序化世界建模构建具身推理与规划结构

Chen, Hongyu, Lin, Liang, Wang, Guangrun

Abstract

Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple $W = \langle S, T \rangle$: a State Abstraction ($G_\text{state}$) instantiating the environmental state $S$, coupled with a Control Policy ($G_\text{control}$) representing the transition logic $T: S \times A \rightarrow S'$. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

Chinese Translation

标准的链式思维（Chain-of-Thought，CoT）提示增强了大型语言模型（LLMs）的推理能力，但其对线性自然语言的依赖在具身任务中的有效世界建模方面存在固有不足。尽管文本具有灵活性，但它无法明确表示状态空间、对象层级结构及因果依赖关系，而这些是实现稳健机器人规划所必需的。为了解决这些限制，我们提出了面向对象的世界建模（Object-Oriented World Modeling，OOWM）框架，该框架通过软件工程形式主义的视角来构建具身推理。我们将世界模型重新定义为显式的符号元组 $W = \langle S, T \rangle$：其中状态抽象（$G_{\text{state}}$）实例化环境状态 $S$，控制策略（$G_{\text{control}}$）表示状态转移逻辑 $T: S \times A \rightarrow S'$。OOWM 利用统一建模语言（UML）实现该定义：采用类图将视觉感知具体化为严谨的对象层级结构，采用活动图将规划操作化为可执行的控制流程。此外，我们引入了结合监督微调（Supervised Fine-Tuning，SFT）与群体相对策略优化（Group Relative Policy Optimization，GRPO）的三阶段训练流程。该方法关键在于利用最终规划结果的基于结果的奖励，隐式优化底层面向对象的推理结构，使得即使在标注稀疏的情况下也能实现有效学习。在 MRoom-30k 基准上的广泛评估表明，OOWM 在规划连贯性、执行成功率及结构保真度方面显著优于无结构文本基线，确立了结构化具身推理的新范式。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.09581

OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

OpeFlo：基于GUI定位的模拟人类网页交互的自动化用户体验评估

Tan, Wee Joe, Lim, Zi Rui Lucas, Durgad, Shashank, Obegi, Karim, Li, Aiden Yiliu

Abstract

Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present OpenFlo, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, OpenFlo grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of OpenFlo and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/OpenFlo

Chinese Translation

评估网页可用性通常需要耗时的用户研究和专家评审，这往往限制了产品开发过程中的迭代速度，尤其对于小型团队和敏捷工作流程而言。我们提出了OpenFlo，一种用户体验评估代理，能够模拟用户在网站上的行为并生成标准化的可用性评估。不同于依赖DOM解析的传统工具，OpenFlo通过动作和观察的定位，实现了对真实网页的端到端交互，同时保持用户路径的连贯追踪。基于Avenir-Web，我们的系统将这种稳健的交互与模拟用户行为模型及结构化评估协议相结合，协议整合了系统可用性量表（System Usability Scale, SUS）、分步单项易用性问题（Single Ease Questions, SEQ）以及同步的思维大声说（Think Aloud）。随后，系统将生成一份全面的用户体验（UX）报告。我们讨论了OpenFlo的架构，并展示其多模态定位如何提升基于网页的交互和用户体验评估场景的鲁棒性，为实现持续、可扩展且数据驱动的可用性测试开辟了新途径，使每位开发者都能构建出更具可用性的网页界面。代码可在：https://github.com/Onflow-AI/OpenFlo 获取。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.09582

Factorizing formal contexts from closures of necessity operators

从必要性算子的闭包中分解形式背景

Aragón, Roberto G., Medina, Jesús, Ramírez-Poussa, Eloísa

Abstract

Factorizing datasets is an interesting process in a multitude of approaches, but many times it is not possible or efficient the computation of a factorization of the dataset. A method to obtain independent subcontexts of a formal context with Boolean data was proposed in~\cite{dubois:2012}, based on the operators used in possibility theory. In this paper, we will analyze this method and study different properties related to the pairs of sets from which a factorization of a formal context arises. We also inspect how the properties given in the classical case can be extended to the fuzzy framework, which is essential to obtain a mechanism that allows the computation of independent subcontexts of a fuzzy context.

Chinese Translation

数据集的分解是多种方法中一个有趣的过程，但在许多情况下，计算数据集的分解并不可能或效率低下。文献~ extit{(dubois:2012)}中提出了一种基于可能性理论中使用的算子的方式，以获得具有布尔数据的形式背景的独立子背景。本文将分析该方法，并研究与形式背景分解相关的一对集合的不同属性。我们还将检查经典情况下给出的属性如何扩展到模糊框架，这对于获得允许计算模糊背景的独立子背景的机制至关重要。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.09584

Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations

利用潜在基础模型进行参数化模拟的偏微分方程空间的主动探索

Vishwasrao, Abhijeet, Giral, Francisco, Golestanian, Mahmoud, Tonti, Federica, Ramo, Andrea Arroyo, Lozano-Duran, Adrian, Brunton, Steven L., Hoyas, Sergio, Clainche, Soledad Le, Gomez, Hector, Vinuesa, Ricardo

Abstract

Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high-dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large-scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi-agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on-demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool-modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter-location pairs and discovers divergent scaling laws: a regime-dependent two-mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual-extrema structure that emerges at the near-wake to co-shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems.

Chinese Translation

流体物理学以及更广泛的由偏微分方程（PDE）支配的物理现象，本质上是连续的、高维的，并且常常具有混沌特性。传统上，研究人员通过实验室实验和/或计算成本高昂的数值模拟来探索这些丰富的时空PDE解空间。这严重限制了自动化和大规模探索的能力，与药物发现或材料科学等领域不同，后者的离散、可标记表示自然与大型语言模型接口。我们通过将多智能体大型语言模型（LLMs）与潜在基础模型（LFMs）相结合来解决这一问题，LFMs是一种针对参数化模拟的生成模型，能够学习流场的显式、紧凑和解耦的潜在表示，从而实现对主导PDE参数和边界条件的连续探索。LFM作为按需替代模拟器，允许智能体以微乎其微的成本查询任意参数配置。一个分层智能体架构通过假设、实验、分析和验证的闭环来协调探索，工具模块化接口无需用户支持。应用于Re = 500的串联圆柱体流动，该框架自主评估超过1,600个参数-位置对，并发现了不同的缩放规律：最小位移厚度的状态依赖性双模结构和最大动量厚度的稳健线性缩放，两者的景观都表现出在近尾流到共同脱落状态转变时出现的双极值结构。学习到的物理表示与智能推理的结合建立了一个在PDE支配系统中进行自动化科学发现的通用范式。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.09587

MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

MobiFlow：通过轨迹融合进行真实世界移动代理基准测试

Feng, Yunfei, Zhao, Xi, Zhang, Cheng, Feng, Dahu, Cheng, Daolin, Yu, Jianqi, Xia, Yubin, Feng, Erhu

Abstract

Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.

Chinese Translation

移动代理可以通过图形用户界面（GUI）交互自主完成用户分配的任务。然而，现有的主流评估基准，如AndroidWorld，是通过连接到系统级Android模拟器来运行的，并根据系统资源的状态提供评估信号。然而，在真实世界的移动代理场景中，许多第三方应用程序并未暴露系统级API来确定任务是否成功，导致基准测试与真实使用之间存在不匹配，从而使得准确评估模型性能变得困难。为了解决这些问题，我们提出了MobiFlow，一个基于任意第三方应用程序任务构建的评估框架。MobiFlow采用基于多轨迹融合的高效图构建算法，能够有效压缩状态空间，支持动态交互，并更好地与真实世界的第三方应用场景对齐。MobiFlow涵盖了20个广泛使用的第三方应用程序，并包含240个多样化的真实世界任务，提供丰富的评估指标。与AndroidWorld相比，MobiFlow的评估结果与人类评估的对齐度更高，并能够指导未来在真实工作负载下基于GUI的模型训练。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.09588

Persistent Identity in AI Agents: A Multi-Anchor Architecture for Resilient Memory and Continuity

人工智能代理中的持久身份：一种用于弹性记忆与连续性的多锚架构

Menon, Prahlad G.

Abstract

Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents experience catastrophic forgetting -- losing not just information, but continuity of self. This technical limitation reflects a deeper architectural flaw: AI agent identity is centralized in a single memory store, creating a single point of failure. Drawing on neurological case studies of human memory disorders, we observe that human identity survives damage because it is distributed across multiple systems: episodic memory, procedural memory, emotional continuity, and embodied knowledge. We present soul.py, an open-source architecture that implements persistent identity through separable components (identity files and memory logs), and propose extensions toward multi-anchor resilience. The framework introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns, achieving efficient retrieval without sacrificing comprehensiveness. We formalize the notion of identity anchors for AI systems and present a roadmap for building agents whose identity can survive partial memory failures. Code is available at github.com/menonpg/soul.py

Chinese Translation

现代人工智能代理面临一个根本性的身份问题：当上下文窗口溢出且对话历史被摘要时，代理会经历灾难性遗忘——不仅丢失信息，更丧失自我连续性。这一技术限制反映了更深层的架构缺陷：人工智能代理的身份集中于单一记忆存储，形成单点故障。借鉴神经学中关于人类记忆障碍的案例研究，我们观察到人类身份之所以能在损伤后存续，是因为其分布于多个系统：情景记忆、程序性记忆、情感连续性和具身知识。我们提出了 soul.py，这是一个通过可分离组件（身份文件和记忆日志）实现持久身份的开源架构，并提出了迈向多锚弹性的扩展方案。该框架引入了一种混合的 RAG+RLM 检索系统，能够自动将查询路由至适当的记忆访问模式，实现高效检索且不牺牲全面性。我们形式化了人工智能系统身份锚点的概念，并提出了构建身份能在部分记忆故障中存续的代理的路线图。代码可在 github.com/menonpg/soul.py 获取。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.09590

DeepReviewer 2.0: A Traceable Agentic System for Auditable Scientific Peer Review

DeepReviewer 2.0：一个可追溯的代理系统用于可审计的科学同行评审

Weng, Yixuan, Zhu, Minjun, Xie, Qiujie, Ning, Zhiyuan, Li, Shichen, Lu, Panzhong, Lin, Zhen, Gu, Enhao, Sun, Qiyao, Zhang, Yue

Abstract

Automated peer review is often framed as generating fluent critique, yet reviewers and area chairs need judgments they can \emph{audit}: where a concern applies, what evidence supports it, and what concrete follow-up is required. DeepReviewer~2.0 is a process-controlled agentic review system built around an output contract: it produces a \textbf{traceable review package} with anchored annotations, localized evidence, and executable follow-up actions, and it exports only after meeting minimum traceability and coverage budgets. Concretely, it first builds a manuscript-only claim--evidence--risk ledger and verification agenda, then performs agenda-driven retrieval and writes anchored critiques under an export gate. On 134 ICLR~2025 submissions under three fixed protocols, an \emph{un-finetuned 196B} model running DeepReviewer~2.0 outperforms Gemini-3.1-Pro-preview, improving strict major-issue coverage (37.26\% vs.\ 23.57\%) and winning 71.63\% of micro-averaged blind comparisons against a human review committee, while ranking first among automatic systems in our pool. We position DeepReviewer~2.0 as an assistive tool rather than a decision proxy, and note remaining gaps such as ethics-sensitive checks.

Chinese Translation

自动化同行评审通常被视为生成流畅的批评，然而评审者和领域主席需要可以 extit{审计}的判断：关注点适用的地方、支持该关注点的证据以及所需的具体后续行动。DeepReviewer~2.0是一个基于输出契约的过程控制代理评审系统：它生成一个 extbf{可追溯的评审包}，包含固定的注释、本地化的证据和可执行的后续行动，并且仅在满足最低追溯性和覆盖预算后进行导出。具体而言，它首先构建一个仅包含手稿的主张-证据-风险账本和验证议程，然后根据议程驱动进行检索，并在导出门下撰写固定的批评。在134个ICLR~2025提交的案例中，运行DeepReviewer~2.0的 extit{未微调的196B}模型超越了Gemini-3.1-Pro-preview，提高了严格主要问题覆盖率（37.26 ext{ extperthousand} vs. 23.57 ext{ extperthousand}），并在与人类评审委员会的微平均盲比较中获胜71.63 ext{ extperthousand}，同时在我们的自动系统池中排名第一。我们将DeepReviewer~2.0定位为辅助工具，而非决策代理，并指出仍然存在如伦理敏感检查等差距。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.09594

Spatial Competence Benchmark

空间能力基准

Vira, Jash, Harris, Ashley

Abstract

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.

Chinese Translation

空间能力是指维持环境一致的内部表征并利用该表征在约束条件下推断离散结构和规划行动的能力。目前针对大型模型的空间评估主要局限于通过3D变换或视觉问答来探测孤立的原语。我们提出了空间能力基准（SCBench），涵盖三个层次的能力类别，其任务要求可执行的输出由确定性检查器或基于模拟器的评估者进行验证。在SCBench上，三个前沿模型在能力阶梯上表现出单调递减的准确率。对输出令牌上限的全面测试表明，准确率的提升主要集中在低预算上，并迅速饱和，失败主要由局部合理的几何形状引起，这些形状破坏了全局约束。我们发布了任务生成器、验证器和可视化工具。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.09596

DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical Settings

DERM-3R：一种资源高效的多模态代理框架，用于现实临床环境中的皮肤病诊断与治疗

Chen, Ziwen, Wang, Zhendong, Wang, Chongjing, Dong, Yurui, Jin, Luozhijie, Gu, Jihao, Chen, Kui, Yang, Jiaxi, Lu, Bingjie, Zhang, Zhou, Dai, Jirui, Luo, Changyong, Gai, Xiameng, Lan, Haibing, Liu, Zhi

Abstract

Dermatologic diseases impose a large and growing global burden, affecting billions and substantially reducing quality of life. While modern therapies can rapidly control acute symptoms, long-term outcomes are often limited by single-target paradigms, recurrent courses, and insufficient attention to systemic comorbidities. Traditional Chinese medicine (TCM) provides a complementary holistic approach via syndrome differentiation and individualized treatment, but practice is hindered by non-standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning. We propose DERM-3R, a resource-efficient multimodal agent framework to model TCM dermatologic diagnosis and treatment under limited data and compute. Based on real-world workflows, we reformulate decision-making into three core issues: fine-grained lesion recognition, multi-view lesion representation with specialist-level pathogenesis modeling, and holistic reasoning for syndrome differentiation and treatment planning. DERM-3R comprises three collaborative agents: DERM-Rec, DERM-Rep, and DERM-Reason, each targeting one component of this pipeline. Built on a lightweight multimodal LLM and partially fine-tuned on 103 real-world TCM psoriasis cases, DERM-3R performs strongly across dermatologic reasoning tasks. Evaluations using automatic metrics, LLM-as-a-judge, and physician assessment show that despite minimal data and parameter updates, DERM-3R matches or surpasses large general-purpose multimodal models. These results suggest structured, domain-aware multi-agent modeling can be a practical alternative to brute-force scaling for complex clinical tasks in dermatology and integrative medicine.

Chinese Translation

皮肤病给全球带来了巨大且日益增长的负担，影响数十亿人并显著降低生活质量。尽管现代疗法能够迅速控制急性症状，但长期效果往往受到单一靶点范式、复发性治疗和对系统性合并症关注不足的限制。传统中医（Traditional Chinese Medicine, TCM）通过辨证论治和个性化治疗提供了一种互补的整体方法，但由于知识不规范、缺乏完整的多模态记录以及专家推理的可扩展性差，实践受到阻碍。我们提出了DERM-3R，一种资源高效的多模态代理框架，用于在有限数据和计算条件下建模中医皮肤病的诊断与治疗。基于现实工作流程，我们将决策制定重新表述为三个核心问题：细粒度病变识别、多视角病变表征与专家级病因建模，以及辨证论治和治疗规划的整体推理。DERM-3R由三个协作代理组成：DERM-Rec、DERM-Rep和DERM-Reason，每个代理针对这一流程的一个组成部分。基于轻量级多模态大语言模型（LLM）并在103个真实世界中医银屑病案例上进行部分微调，DERM-3R在皮肤病推理任务中表现出色。使用自动评估指标、LLM作为评判者和医生评估的结果表明，尽管数据和参数更新极少，DERM-3R的表现与大型通用多模态模型相当或更优。这些结果表明，结构化的、领域感知的多代理建模可以成为皮肤病和综合医学中复杂临床任务的实际替代方案，而不是依赖于粗暴的扩展。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.09600

CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning

CID-TKG：用于时间知识图推理的协作历史不变性与演化动态学习

Lei, Shuai-Long, Zhu, Xiaobin, Liang, Jiarui, Sun, Guoxi, Fang, Zhiyu, Yin, Xu-Cheng

Abstract

Temporal knowledge graph (TKG) reasoning aims to infer future facts at unseen timestamps from temporally evolving entities and relations. Despite recent progress, existing approaches still suffer from inherent limitations due to their inductive biases, as they predominantly rely on time-invariant or weakly time-dependent structures and overlook the evolutionary dynamics. To overcome this limitation, we propose a novel collaborative learning framework for TKGR (dubbed CID-TKG) that integrates evolutionary dynamics and historical invariance semantics as an effective inductive bias for reasoning. Specifically, CID-TKG constructs a historical invariance graph to capture long-term structural regularities and an evolutionary dynamics graph to model short-term temporal transitions. Dedicated encoders are then employed to learn representations from each structure. To alleviate semantic discrepancies across the two structures, we decompose relations into view-specific representations and align view-specific query representations via a contrastive objective, which promotes cross-view consistency while suppressing view-specific noise. Extensive experiments verify that our CID-TKG achieves state-of-the-art performance under extrapolation settings.

Chinese Translation

时间知识图（TKG）推理旨在从时间演变的实体和关系中推断在未见时间戳下的未来事实。尽管近期取得了一定进展，但现有方法仍然由于其归纳偏差而面临固有局限性，因为它们主要依赖于时间不变或弱时间依赖的结构，并忽视了演化动态。为克服这一限制，我们提出了一种新的TKGR（时间知识图推理）协作学习框架（称为CID-TKG），该框架将演化动态和历史不变性语义整合为推理的有效归纳偏差。具体而言，CID-TKG构建了一个历史不变性图，以捕捉长期结构规律，并构建了一个演化动态图，以建模短期时间过渡。然后，采用专门的编码器从每个结构中学习表示。为了缓解两个结构之间的语义差异，我们将关系分解为视图特定表示，并通过对比目标对视图特定查询表示进行对齐，从而促进跨视图一致性，同时抑制视图特定噪声。大量实验验证了我们的CID-TKG在外推设置下实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.09601

Hubble: An LLM-Driven Agentic Framework for Safe and Automated Alpha Factor Discovery

Hubble：一个基于大型语言模型的安全自动化阿尔法因子发现框架

Shi, Runze, Yan, Shengyu, Cai, Yuecheng, Lv, Chengxi

Abstract

Discovering predictive alpha factors in quantitative finance remains a formidable challenge due to the vast combinatorial search space and inherently low signal-to-noise ratios in financial data. Existing automated methods, particularly genetic programming, often produce complex, uninterpretable formulas prone to overfitting. We introduce Hubble, a closed-loop factor mining framework that leverages Large Language Models (LLMs) as intelligent search heuristics, constrained by a domain-specific operator language and an Abstract Syntax Tree (AST)-based execution sandbox. The framework evaluates candidate factors through a rigorous statistical pipeline encompassing cross-sectional Rank Information Coefficient (RankIC), annualized Information Ratio, and portfolio turnover. An evolutionary feedback mechanism returns top-performing factors and structured error diagnostics to the LLM, enabling iterative refinement across multiple generation rounds. In experiments conducted on a panel of 30 U.S. equities over 752 trading days, the system evaluated 181 syntactically valid factors from 122 unique candidates across three rounds, achieving a peak composite score of 0.827 with 100% computational stability. Our results demonstrate that combining LLM-driven generation with deterministic safety constraints yields an effective, interpretable, and reproducible approach to automated factor discovery.

Chinese Translation

在定量金融中，发现预测性阿尔法因子仍然是一个巨大的挑战，因为存在广泛的组合搜索空间和金融数据中固有的低信噪比。现有的自动化方法，特别是遗传编程，往往产生复杂且难以解释的公式，容易导致过拟合。我们提出了Hubble，一个闭环因子挖掘框架，利用大型语言模型（LLMs）作为智能搜索启发式方法，并受限于特定领域的操作语言和基于抽象语法树（AST）的执行沙箱。该框架通过严格的统计流程评估候选因子，包括横截面排名信息系数（RankIC）、年化信息比率和投资组合周转率。一个进化反馈机制将表现最佳的因子和结构化的错误诊断返回给LLM，从而实现跨多个世代轮次的迭代优化。在对30只美国股票进行的752个交易日的实验中，该系统从122个独特候选中评估了181个语法有效的因子，达到了峰值综合得分0.827，并保持100%的计算稳定性。我们的结果表明，将LLM驱动的生成与确定性安全约束相结合，能够有效、可解释且可重复地实现自动化因子发现。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.09602

From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express

从标量到张量：声明性损失恢复中立标量无法表达的认知区分

Mason, Tony

Abstract

Leyva-V\'azquez and Smarandache (2025) demonstrated that neutrosophic T/I/F evaluation, where Truth, Indeterminacy, and Falsity are independent dimensions not constrained to sum to 1.0, which reveals "hyper-truth"' (T+I+F > 1.0) in 35% of complex epistemic cases evaluated by LLMs. We extend their work in two directions. First, we replicate and extend their experiment across five model families from five vendors (Anthropic, Meta, DeepSeek, Alibaba, Mistral), finding hyper-truth in 84% of unconstrained evaluations, which confirms the phenomenon is cross-vendor under our prompt protocol. Second, and more significantly, we identify a limitation of scalar T/I/F that their framework cannot address: models adopting an `"Absorption" position (T=0, I=1, F=0) produce identical scalar outputs for fundamentally different epistemic situations (paradox, ignorance, contingency), collapsing the very distinctions neutrosophic logic was designed to preserve. We demonstrate that extending the evaluation to include declared losses (structured descriptions of what the model cannot evaluate and why) substantially recovers these distinctions. Models producing identical scalars for paradox and ignorance produce nearly disjoint loss vocabularies (Jaccard similarity < 0.10 on loss description keywords), with domain-specific, severity-rated loss declarations that differentiate the nature of their uncertainty. This suggests that scalar T/I/F is a necessary but insufficient representation of epistemic state, and that tensor-structured output (scalars + losses) provides a more faithful model of LLM epistemic capabilities.

Chinese Translation

Leyva-Vázquez 和 Smarandache（2025）证明了中立标度的真值/不确定性/虚假性（T/I/F）评估，其中真值（Truth）、不确定性（Indeterminacy）和虚假性（Falsity）是相互独立的维度，不受限于其和为1.0，这在由大型语言模型（LLMs）评估的复杂认知案例中揭示了35%的“超真”（T+I+F > 1.0）现象。我们在两个方向上扩展了他们的工作。首先，我们在来自五个供应商（Anthropic、Meta、DeepSeek、Alibaba、Mistral）的五个模型家族中复制并扩展了他们的实验，发现84%的非约束评估中存在超真现象，确认了在我们的提示协议下该现象具有跨供应商的普适性。其次，更重要的是，我们识别出标量T/I/F的一个局限性：其框架无法解决采用“吸收”立场（T=0, I=1, F=0）的模型，这些模型对本质上不同的认知情境（悖论、无知、偶然性）产生相同的标量输出，导致中立逻辑旨在保持的区分被压缩。我们展示了将评估扩展为包含声明性损失（即模型无法评估的内容及原因的结构化描述）显著恢复了这些区分。对于悖论和无知产生相同标量的模型，其损失词汇几乎不重叠（损失描述关键词的Jaccard相似度 < 0.10），并具有领域特定且带严重度评级的损失声明，区分了其不确定性的性质。这表明标量T/I/F是认知状态的必要但不充分的表示形式，而张量结构的输出（标量+损失）则提供了对大型语言模型认知能力更为忠实的建模。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.09604

LLMs for Text-Based Exploration and Navigation Under Partial Observability

在部分可观测性下，基于文本的探索与导航的语言模型

Sandfuchs, Stephan, Melchert, Maximilian, Frochte, Jörg

Abstract

Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emph{text-only} controllers under partial observability -- without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local $5\times5$ window around the agent and the model must select one of \texttt{UP/RIGHT/DOWN/LEFT}. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emph{Exploration} (maximising revealed cells) and \emph{Navigation} (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emph{success rate}, \emph{efficiency} such as normalised coverage and \emph{path length} vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.

Chinese Translation

在未知布局中进行探索和目标导向的导航是检查、物流和搜索救援的核心任务。我们探讨大型语言模型（LLMs）是否能够在部分可观测性下作为 extit{仅文本}控制器运作——不依赖于代码执行、工具或程序合成。我们引入了一个可重复的基准测试，在固定的ASCII网格世界中进行oracle定位：每一步仅揭示代理周围的局部$5 imes5$窗口，模型必须选择 exttt{UP/RIGHT/DOWN/LEFT}中的一个。我们评估了九个当代LLM，涵盖开放/专有、密集/专家混合以及指令调优与推理调优的模型，针对三个难度逐渐增加的布局进行两项任务的评估： extit{探索}（最大化揭示的单元格）和 extit{导航}（以最短路径到达目标）。实验结果通过定量指标进行评估，包括 extit{成功率}、 extit{效率}（如标准化覆盖率）和与oracle的 extit{路径长度}，同时进行定性分析。经过推理调优的模型在所有布局中可靠地完成导航，但效率仍低于oracle路径。提示中的少量示范主要通过减少无效移动和缩短路径来帮助这些推理调优模型，而经典的密集指令模型则表现不一致。我们观察到特征性动作先验（UP/RIGHT）在部分可观测性下可能导致循环。总体而言，训练方案和测试时的深思熟虑比原始参数数量更能预测控制能力。这些发现表明，与经典在线规划器的轻量级混合是一种可行的部署部分地图系统的途径。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.09606

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

通过重复提示采样评估大型语言模型安全性的可靠性差距

Broadwater, Keita

Abstract

Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST characterizes them statistically as stochastic outcomes of repeated inference. We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities, enabling quantitative comparison of operational risk across models and configurations. We apply APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024 derived safety and security prompts. While models exhibit similar performance under conventional single- or very-low-sample evaluation (N <= 3), repeated sampling reveals substantial variation in empirical failure probabilities across temperatures. These results demonstrate that shallow benchmark scores can obscure meaningful differences in reliability under sustained use.

Chinese Translation

传统的大型语言模型（LLMs）基准测试，如HELM和AIR-BENCH，主要通过对多样化任务的广度评估来衡量安全风险。然而，实际部署中常暴露出另一类风险：由对同一提示的重复生成引发的操作性失败，而非广泛任务泛化能力的不足。在高风险场景下，响应一致性和重复使用时的安全性是关键的操作需求。我们提出了加速提示压力测试（Accelerated Prompt Stress Testing，APST），这是一种受可靠性工程中高度加速应力测试启发的深度评估框架。APST通过在受控操作条件下（包括温度变化和提示扰动）对相同提示进行反复采样，揭示潜在的失败模式，如幻觉、拒绝不一致性和不安全的生成结果。APST不将失败视为孤立事件，而是将其统计描述为重复推理的随机结果。我们采用伯努利和二项分布模型对观察到的安全失败进行建模，以估计每次推理的失败概率，从而实现跨模型和配置的操作风险定量比较。我们将APST应用于多个基于指令调优的LLMs，评估对象为AIR-BENCH 2024中衍生的安全与安全性提示。尽管模型在传统的单次或极低样本（N <= 3）评估下表现相似，重复采样揭示了不同温度下经验失败概率的显著差异。结果表明，浅层基准分数可能掩盖持续使用中可靠性的实质性差异。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.09608

Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale

统一本体构建与语义对齐以实现大规模确定性企业推理

Zhu, Hongyin

Abstract

While enterprises amass vast quantities of data, much of it remains chaotic and effectively dormant, preventing decision-making based on comprehensive information. Existing neuro-symbolic approaches rely on disjoint pipelines and struggle with error propagation. We introduce the large ontology model (LOM), a unified framework that seamlessly integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture. LOM employs a construct-align-reason (CAR) pipeline, leveraging its unified architecture across all three stages: it first autonomously constructs a domain-specific ontological universe from raw data, then aligns neural generation with this structural reality using a graph-aware encoder and reinforcement learning, and finally executes deterministic reasoning over the constructed topology, node attributes and relation types. We evaluate LOM on a comprehensive benchmark constructed from diverse real-world enterprise datasets. Experimental results demonstrate that LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs. These findings validate that autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.

Chinese Translation

尽管企业积累了大量数据，但其中许多数据仍然混乱且基本处于休眠状态，阻碍了基于全面信息的决策制定。现有的神经符号方法依赖于分离的处理流程，且难以避免错误传播。我们提出了大本体模型（LOM），这是一个统一框架，将本体构建、语义对齐和逻辑推理无缝整合到单一的端到端架构中。LOM采用构建-对齐-推理（CAR）流程，利用其统一架构贯穿三个阶段：首先从原始数据自主构建领域特定的本体宇宙；然后通过图感知编码器和强化学习将神经生成与该结构现实对齐；最后在构建的拓扑结构、节点属性及关系类型上执行确定性推理。我们在由多样化真实企业数据集构建的综合基准上评估LOM。实验结果表明，LOM-4B在本体补全任务中达到88.8%的准确率，在复杂图推理任务中达到94%，显著优于最先进的大型语言模型（LLMs）。这些发现验证了自主逻辑构建对于实现确定性、企业级智能的关键作用。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.09609

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大型语言模型作为人类驾驶行为模型：简化合并案例

Mohammad, Samir H. A., Mooi, Wouter, Zgonnikov, Arkady

Abstract

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

Chinese Translation

人类行为模型在自动驾驶车辆（AVs）的虚拟安全评估中作为行为参考和模拟人类代理至关重要，但当前模型在可解释性和灵活性之间存在权衡。通用大型语言模型（LLMs）提供了一种有前景的替代方案：一个单一模型在多种场景中可能无需参数调整即可部署。然而，LLMs能够捕捉和无法捕捉的人类驾驶行为仍然不够清楚。我们通过将两个通用LLMs（OpenAI o3和Google Gemini 2.5 Pro）嵌入到一个简化的一维合并场景中，作为独立的闭环驾驶代理，并使用定量和定性分析将其行为与人类数据进行比较，来填补这一空白。两个模型都再现了类似人类的间歇性操作控制和对空间线索的战术依赖。然而，两个模型都未能一致捕捉人类对动态速度线索的反应，且安全性能在模型之间存在显著差异。一项系统的提示消融研究表明，提示组件作为特定模型的归纳偏差，无法在LLMs之间转移。这些发现表明，通用LLMs可能作为自动驾驶评估流程中独立、即用的人类行为模型，但未来的研究需要更好地理解它们的失效模式，并确保它们作为人类驾驶行为模型的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.09612

Beyond Theory of Mind in Robotics

超越机器人中的心智理论

Jung, Malte F.

Abstract

Theory of Mind, the capacity to explain and predict behavior by inferring hidden mental states, has become the dominant paradigm for social interaction in robotics. Yet ToM rests on three assumptions that poorly capture how most social interaction actually unfolds: that meaning travels inside-out from hidden states to observable behavior; that understanding requires detached inference rather than participation; and that the meaning of behavior is fixed and available to a passive observer. Drawing on ethnomethodology, conversation analysis, and participatory sense-making, I argue that social meaning is not decoded from behavior but produced through moment-to-moment coordination between agents. This interactional foundation has direct implications for robot design: shifting from internal state modeling toward policies for sustaining coordination, from observer-based inference toward active participation, and from fixed behavioral meaning toward meaning potential stabilized through response.

Chinese Translation

心智理论（Theory of Mind，ToM）是通过推断隐藏的心理状态来解释和预测行为的能力，已成为机器人社会交互的主导范式。然而，心智理论基于三个假设，这些假设难以准确反映大多数社会交互的实际展开方式：意义是从隐藏状态向外传递到可观察行为；理解需要超然的推理而非参与；行为的意义是固定的且对被动观察者可得。借鉴民族方法学、会话分析和参与式意义构建的观点，本文认为社会意义不是从行为中解码得出，而是通过主体间的瞬时协调共同生成。这一互动基础对机器人设计具有直接影响：从内部状态建模转向维持协调的策略，从基于观察者的推断转向主动参与，以及从固定的行为意义转向通过回应稳定的意义潜能。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.09614

The Geometry of Knowing: From Possibilistic Ignorance to Probabilistic Certainty -- A Measure-Theoretic Framework for Epistemic Convergence

认知的几何学：从可能性无知到概率确定性——一种基于测度论的认识收敛框架

Jah, Moriba Kemessia

Abstract

This paper develops a measure-theoretic framework establishing when and how a possibilistic representation of incomplete knowledge contracts into a probabilistic representation of intrinsic stochastic variability. Epistemic uncertainty is encoded by a possibility distribution and its dual necessity measure, defining a credal set bounding all probability measures consistent with current evidence. As evidence accumulates, the credal set contracts. The epistemic collapse condition marks the transition: the Choquet integral converges to the Lebesgue integral over the unique limiting density. We prove this rigorously (Theorem 4.5), with all assumptions explicit and a full treatment of the non-consonant case. We introduce the aggregate epistemic width W, establish its axiomatic properties, provide a canonical normalization, and give a feasible online proxy resolving a circularity in prior formulations. Section 7 develops the dynamics of epistemic contraction: evidence induces compatibility, compatibility performs falsification, posterior possibility is the min-intersection of prior possibility and compatibility, and a credibility-directed flow governs support geometry contraction. This is not belief updating. It is knowledge contraction. Probability theory is the limiting geometry of that process. The UKF and ESPF solve different problems by different mechanisms. The UKF minimizes MSE, asserts truth, and requires a valid generative model. The ESPF minimizes maximum entropy and surfaces what evidence has not ruled out. When the world is Gaussian and the model valid, both reach the same estimate by entirely different routes -- convergent optimality, not hierarchical containment. We prove this (Theorem 9.1) and compare both on a 2-day, 877-step orbital tracking scenario. Both achieve 1-meter accuracy. The UKF is accurate but epistemically silent. The ESPF is accurate and epistemically honest.

Chinese Translation

本文构建了一个测度论框架，用以确定何时以及如何将不完全知识的可能性表征收缩为内在随机变异性的概率表征。认识不确定性通过可能性分布及其对偶的必要性测度进行编码，定义了一个信念集，该信念集界定了所有与当前证据一致的概率测度。随着证据的积累，信念集逐渐收缩。认识崩溃条件标志着这一转变：Choquet积分收敛于唯一极限密度上的Lebesgue积分。我们对此进行了严格证明（定理4.5），明确了所有假设并全面处理了非共鸣情况。我们引入了聚合认识宽度W，确立其公理性质，提供了规范化方法，并给出了一个可行的在线代理，解决了先前表述中的循环依赖。第7节发展了认识收缩的动力学：证据引发兼容性，兼容性执行证伪，后验可能性是先验可能性与兼容性的最小交集，且一个以可信度为导向的流动控制支持几何的收缩。这不同于信念更新，而是知识收缩。概率论是该过程的极限几何。UKF（无迹卡尔曼滤波器）和ESPF（熵最大化支持滤波器）通过不同机制解决不同问题。UKF最小化均方误差，断言真理，且需要有效的生成模型。ESPF最小化最大熵，揭示证据未排除的可能性。当世界为高斯且模型有效时，两者通过完全不同路径达到相同估计——这是一种收敛的最优性，而非层级包含关系。我们对此进行了证明（定理9.1），并在一个为期2天、877步的轨道跟踪场景中对两者进行了比较。两者均达到1米精度。UKF准确但认识上保持沉默，ESPF准确且认识上诚实。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.09617

AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation

AdaQE-CG：用于Web规模生成性人工智能模型和数据卡生成的自适应查询扩展

Zhang, Haoxuan, Li, Ruochi, Liang, Zhenni, Sattari, Mehri, Vo, Phat, Qu, Collin, Xiao, Ting, Ding, Junhua, Zhang, Yang, Chen, Haihua

Abstract

Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web-scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE-CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross-card knowledge transfer. Its Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter-Card Completion using the MetaGAI Pool (ICC-MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI-Bench, the first large-scale, expert-annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards. Code, prompts, and data are publicly available at: https://github.com/haoxuan-unt2024/AdaQE-CG.

Chinese Translation

透明和标准化的文档对于构建可信赖的生成性人工智能（GAI）系统至关重要。然而，现有的自动生成模型和数据卡的方法仍面临三大主要挑战：（i）静态模板，大多数系统依赖于固定的查询模板，无法适应多样化的论文结构或不断变化的文档要求；（ii）信息稀缺，由于Hugging Face等Web规模的资源库通常包含不完整或不一致的元数据，导致信息缺失或噪声；（iii）缺乏基准，缺乏标准化的数据集和评估协议阻碍了文档质量的公平和可重复评估。为了解决这些局限性，我们提出了AdaQE-CG，一个用于卡片生成的自适应查询扩展框架，结合了动态信息提取和跨卡片知识转移。其通过上下文感知查询扩展的论文内提取（IPE-QE）模块迭代地优化提取查询，以从科学论文和资源库中恢复更丰富和更完整的信息，而其使用MetaGAI池的跨卡片补全（ICC-MP）模块则通过从策划数据集中相似卡片中转移语义相关内容来填补缺失字段。此外，我们引入了MetaGAI-Bench，这是第一个大规模的专家注释基准，用于评估GAI文档。全面的实验结果显示，AdaQE-CG在五个质量维度上显著优于现有方法，超越了人工撰写的数据卡，并接近模型卡的人类水平质量。代码、提示和数据可在以下链接公开获取：https://github.com/haoxuan-unt2024/AdaQE-CG。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.09621

Competing with AI Scientists: Agent-Driven Approach to Astrophysics Research

与人工智能科学家竞争：基于代理的天体物理研究方法

Borrett, Thomas, Xu, Licong, Nilipour, Andy, Bolliet, Boris, Pierre, Sebastien, Allys, Erwan, Lecat, Celia, Dai, Biwei, Chang, Po-Wen, Bhimji, Wahid

Abstract

We present an agent-driven approach to the construction of parameter inference pipelines for scientific data analysis. Our method leverages a multi-agent system, Cmbagent (the analysis system of the AI scientist Denario), in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall pipeline. As a case study, we apply this approach to the FAIR Universe Weak Lensing Uncertainty Challenge, a competition under time constraints focused on robust cosmological parameter inference with realistic observational uncertainties. While the fully autonomous exploration initially did not reach expert-level performance, the integration of human intervention enabled our agent-driven workflow to achieve a first-place result in the challenge. This demonstrates that semi-autonomous agentic systems can compete with, and in some cases surpass, expert solutions. We describe our workflow in detail, including both the autonomous and semi-autonomous exploration by Cmbagent. Our final inference pipeline utilizes parameter-efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques. Our results suggest that agent-driven research workflows can provide a scalable framework to rapidly explore and construct pipelines for inference problems.

Chinese Translation

我们提出了一种基于代理的科学数据分析参数推断管道构建方法。我们的方法利用了一个多代理系统Cmbagent（人工智能科学家Denario的分析系统），在该系统中，专门的代理协作生成研究想法、编写和执行代码、评估结果，并迭代优化整体管道。作为案例研究，我们将该方法应用于FAIR宇宙弱引力透镜不确定性挑战，这是一个在时间限制下进行的竞赛，重点关注在现实观测不确定性下进行稳健的宇宙学参数推断。尽管最初完全自主探索未能达到专家级表现，但人类干预的整合使我们的基于代理的工作流程在挑战中取得了第一名的结果。这表明半自主代理系统能够与专家解决方案竞争，并在某些情况下超越它们。我们详细描述了我们的工作流程，包括Cmbagent的自主和半自主探索。我们的最终推断管道利用了参数高效的卷积神经网络、已知参数网格上的似然校准以及多种正则化技术。我们的结果表明，基于代理的研究工作流程可以提供一个可扩展的框架，以快速探索和构建推断问题的管道。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.09674

How LLMs Might Think

大型语言模型可能如何思考

Gottlieb, Joseph, Kemp, Ethan, Trager, Matthew

Abstract

Do large language models (LLMs) think? Daniel Stoljar and Zhihe Vincent Zhang have recently developed an argument from rationality for the claim that LLMs do not think. We contend, however, that the argument from rationality not only falters, but leaves open an intriguing possibility: that LLMs engage only in arational, associative forms of thinking, and have purely associative minds. Our positive claim is that if LLMs think at all, they likely think precisely in this manner.

Chinese Translation

大型语言模型（LLMs）是否会思考？Daniel Stoljar 和 Zhihe Vincent Zhang 最近提出了一个关于理性的论证，声称 LLMs 并不思考。然而，我们认为，这一理性论证不仅存在缺陷，还留下了一个引人入胜的可能性：LLMs 仅仅从事非理性、联想式的思维，并且具有纯粹的联想思维。我们的积极主张是，如果 LLMs 确实会思考，那么它们很可能正是以这种方式进行思考。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.09686

Belief-Aware VLM Model for Human-like Reasoning

基于信念的视觉语言模型用于类人推理

Nayak, Anshul, Shaik, Shahil, Wang, Yue

Abstract

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

Chinese Translation

传统的意图推断神经网络模型在很大程度上依赖于可观察状态，难以在多样化任务和动态环境中进行泛化。近期在视觉语言模型（VLM）和视觉语言行动（VLA）模型方面的进展，通过大规模多模态预训练引入了常识推理，使得模型在各类任务中实现零-shot性能。然而，这些模型仍然缺乏明确的机制来表示和更新信念，限制了它们像人类一样进行推理的能力，或捕捉长期演变的人类意图。为了解决这个问题，我们提出了一种基于信念的VLM框架，该框架集成了基于检索的记忆和强化学习。我们并不直接学习一个明确的信念模型，而是使用基于向量的记忆来近似信念，该记忆检索相关的多模态上下文，并将其纳入VLM进行推理。我们进一步通过在VLM潜在空间上的强化学习策略来优化决策。我们在公开可用的视觉问答（VQA）数据集上评估了我们的方法，如HD-EPIC，并展示了相较于零-shot基线的一致性改进，突显了基于信念的推理的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.09692

Tipiano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors

Tipiano：基于指尖先验的级联钢琴手部动作合成

Bae, Joonhyung, Kim, Kirak, Cho, Hyeyoon, Lee, Sein, Choi, Yoon-Seok, Hur, Hyeon, Lee, Gyubin, Maezawa, Akira, Obata, Satoshi, Park, Jonghwa, Park, Jaebum, Nam, Juhan

Abstract

Synthesizing realistic piano hand motions requires both precision and naturalness. Physics-based methods achieve precision but produce stiff motions; data-driven models learn natural dynamics but struggle with positional accuracy. Piano motion exhibits a natural hierarchy: fingertip positions are nearly deterministic given piano geometry and fingering, while wrist and intermediate joints offer stylistic freedom. We present [OURS], a four-stage framework exploiting this hierarchy: (1) statistics-based fingertip positioning, (2) FiLM-conditioned trajectory refinement, (3) wrist estimation, and (4) STGCN-based pose synthesis. We contribute expert-annotated fingerings for the F\"urElise dataset (153 pieces, ~10 hours). Experiments demonstrate F1 = 0.910, substantially outperforming diffusion baselines (F1 = 0.121), with user study (N=41) confirming quality approaching motion capture. Expert evaluation by professional pianists (N=5) identified anticipatory motion as the key remaining gap, providing concrete directions for future improvement.

Chinese Translation

合成真实的钢琴手部动作需要精确性和自然性。基于物理的方法实现了精确性，但产生的动作显得僵硬；数据驱动的模型学习自然动态，但在位置精度上存在困难。钢琴动作展现出一种自然的层次结构：给定钢琴几何形状和指法，指尖位置几乎是确定的，而手腕和中间关节则提供了风格自由度。我们提出了[OURS]，一个利用这种层次结构的四阶段框架：(1) 基于统计的指尖定位，(2) 基于FiLM的轨迹细化，(3) 手腕估计，(4) 基于STGCN的姿态合成。我们为F"urElise数据集（153首曲目，约10小时）贡献了专家注释的指法。实验结果表明F1 = 0.910，显著优于扩散基线（F1 = 0.121），用户研究（N=41）确认了合成质量接近动作捕捉。专业钢琴家（N=5）的专家评估指出，预期动作是仍需改进的关键差距，为未来的改进提供了具体方向。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.09780

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

专家混合模型（MoEs）中专家专精的迷思：路由反映的是几何结构，而非必然的领域专长

Wang, Xi, Hayou, Soufiane, Nalisnick, Eric

Abstract

Mixture of Experts (MoEs) are now ubiquitous in large language models, yet the mechanisms behind their "expert specialization" remain poorly understood. We show that, since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. We confirm this at both token and sequence level across five pre-trained models. We additionally prove that load-balancing loss suppresses shared hidden state directions to maintain routing diversity, which might provide a theoretical explanation for specialization collapse under less diverse data, e.g. small batch. Despite this clean mechanistic account, we find that specialization patterns in pre-trained MoEs resist human interpretation: expert overlap between different models answering the same question is no higher than between entirely different questions ($\sim$60\%); prompt-level routing does not predict rollout-level routing; and deeper layers exhibit near-identical expert activation across semantically unrelated inputs, especially in reasoning models. We conclude that, while the efficiency perspective of MoEs is well understood, understanding expert specialization is at least as hard as understanding LLM hidden state geometry, a long-standing open problem in the literature.

Chinese Translation

专家混合模型（Mixture of Experts, MoEs）现已广泛应用于大型语言模型中，但其“专家专精”背后的机制仍未被充分理解。我们展示了，由于MoE的路由器是线性映射，隐藏状态的相似性既是解释专家使用相似性的必要条件，也是充分条件，因此专精是表示空间的一个涌现属性，而非路由架构本身的属性。我们在五个预训练模型的令牌级和序列级上均验证了这一点。我们还证明，负载均衡损失抑制了共享隐藏状态方向以维持路由多样性，这可能为在数据多样性较低（例如小批量）时专精崩溃提供了理论解释。尽管有这一清晰的机制性解释，我们发现预训练MoE中的专精模式难以被人类直观理解：不同模型回答相同问题时专家的重叠度并不高于回答完全不同问题时的重叠度（约60%）；提示级别的路由无法预测推理过程中的路由；且在推理模型中，较深层的专家激活在语义无关的输入间几乎相同。我们得出结论，尽管MoE的效率视角已被充分理解，但理解专家专精至少与理解大型语言模型隐藏状态的几何结构一样困难，而后者是文献中长期存在的开放问题。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.09791

Pioneer Agent: Continual Improvement of Small Language Models in Production

先锋代理：小型语言模型在生产中的持续改进

Atreja, Dhruv, White, Julia, Nayak, Nikhil, Zhang, Kelton, Princis, Henrijs, Hurn-Maloney, George, Lewis, Ash, Zaratiana, Urchade

Abstract

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

Chinese Translation

小型语言模型因其低成本、快速推理和易于专业化而受到生产部署的青睐。然而，将它们适应特定任务仍然是一个具有挑战性的工程循环，这一过程并非仅由训练本身驱动，而是由周围的决策所推动：数据策划、故障诊断、回归避免和迭代控制。我们提出了先锋代理（Pioneer Agent），这是一个自动化此生命周期的闭环系统。在冷启动模式下，代理仅根据自然语言任务描述获取数据，构建评估集，并通过联合优化数据、超参数和学习策略来迭代训练模型。在生产模式下，代理在给定带标签的故障的已部署模型的情况下，诊断错误模式，构建针对性的训练数据，并在明确的回归约束下进行再训练。为了评估这一设置，我们引入了AdaptFT-Bench，这是一个合成推理日志的基准，具有逐渐增加的噪声，旨在测试完整的适应循环：诊断、课程合成、再训练和验证。在涵盖推理、数学、代码生成、摘要和分类的八个冷启动基准中，先锋代理在基础模型的基础上提高了1.6-83.8分。在AdaptFT-Bench上，它在所有七种场景中改善或保持了性能，而简单的再训练则下降了多达43分。在两个基于公共基准任务构建的生产风格部署中，它将意图分类从84.9%提高到99.3%，实体F1从0.345提高到0.810。除了性能提升外，代理还常常发现有效的训练策略，包括链式思维监督、特定任务优化和以质量为中心的数据策划，这些都是纯粹基于下游反馈发现的。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.09813

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

可控且可验证的工具使用数据合成用于自主强化学习

Xu, Siyuan, Li, Shiyang, Liu, Xin, Liu, Tianyi, Li, Yixiao, Shi, Zhan, Zhang, Zixuan, Wang, Zilong, Yin, Qingyu, Chen, Jianshu, Zhao, Tuo, Yin, Bing

Abstract

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.

Chinese Translation

现有的合成工具使用语料库主要设计用于离线监督微调，而强化学习（RL）则需要支持可奖励检查的在线回滚的可执行环境。我们提出了COVERT，一个两阶段的管道，首先通过自我演化合成与多层次验证生成可靠的基础工具使用轨迹，然后应用保持神谕的增强，系统性地增加环境复杂性。这些增强引入了干扰工具、间接或模糊的用户查询，以及嘈杂的、多格式的或错误的工具输出，同时严格保持神谕工具调用和最终答案作为真实值。该设计使得通过参考匹配进行自动奖励计算成为可能，适用于标准案例，并支持轻量级的评审辅助验证，用于错误检测等特殊行为，支持工具调用策略的RL优化。在Qwen2.5-Instruct-14B上，COVERT-RL将BFCL v3的整体准确率从56.5提高到59.9，将ACEBench的准确率从53.0提高到59.3，且在一般能力基准上回归最小；当与SFT叠加时，进一步达到62.1和61.8，确认了增益效果。这些结果表明，保持神谕的合成环境为RL优化提供了一个实用的细化阶段，作为SFT的补充，以提高在模糊和不可靠工具反馈下的工具使用鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.09815

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

EE-MCP：通过自动环境生成和经验学习自我演化的MCP-GUI代理

He, Tiantian, Chen, Yihang, Jiang, Keyue, Lee, Ka Yiu, Zhou, Kaiwen, Shao, Kun, Wang, Shuai

Abstract

Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbf{cross-application analysis} across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8\% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).

Chinese Translation

结合图形用户界面（GUI）交互与通过模型上下文协议（MCP）进行的结构化API调用的计算机使用代理在自动化软件任务方面展现出良好的前景。然而，现有的方法缺乏对代理如何平衡这两种模式的原则性理解，以及如何在多样化应用中实现迭代自我改进。我们将MCP-GUI的相互作用形式化为一个统一的混合策略学习问题，其中代理学习每种模式何时提供互补优势，并展示了蒸馏和经验增强针对根本不同的失败模式，要求应用感知的机制选择。在此基础上，我们提出了一个自我演化的框架，具有完全自动化的管道，协调自动环境生成与验证、轨迹收集、基于差距的任务合成和质量过滤训练——所有这些均无需人工干预。一个关键创新是我们的经验库，它通过轨迹比较积累了LLM学习的规则，使得在推理时能够改进而无需微调。对三款桌面应用进行的系统性跨应用分析揭示，最佳策略依赖于MCP-GUI的组合：在MCP主导的任务上，蒸馏实现了77.8%的通过率（+17.8个百分点），而经验库在GUI密集型任务上表现优异（+10.0个百分点）。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.09836

COMPOSITE-Stem

Waters, Kyle, Nuzzi, Lucas, Looram, Tadhg, Tomasiello, Alessandro, Kamdoum, Ariel Ghislain Kemogne, Li, Bikun, Sileo, Damien, Kretov, Egor, Fournier-Facio, Francesco, Soloupis, Georgios, Kassahun, Haile, Wolff, Hew, Cai, Jiaqi, Li, Lianghui, Roth, Marc, Naiya, Mohinder, Guo, Naixu, Tang, Qicheng, Wheeler, Richard, Sala, Samuele, Popov, Serguei, Dillman, Steven, Li, Yuqi

Abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

Chinese Translation

人工智能代理在加速科学发现方面展现出越来越大的潜力；然而，缺乏前沿评估阻碍了其在实际工作流程中的应用。专家编写的基准测试已被证明在衡量人工智能推理能力方面有效，但目前大多数基准测试已趋于饱和，仅能评估受限输出的性能。为弥补这一空白，我们推出了COMPOSITE-STEM，这是一个包含物理、生物、化学和数学领域70个专家编写任务的基准测试，由博士级研究人员精心策划。我们的基准测试结合了精确匹配评分和基于标准的评分细则，并采用了以大型语言模型（LLM）作为评审的评分协议，从而实现对科学意义输出的更灵活评估。基于Harbor智能代理评估框架中适配的多模态Terminus-2代理，我们评估了四个前沿模型。表现最优的模型得分为21%，表明COMPOSITE-STEM能够捕捉当前代理尚未达到的能力。所有任务均经贡献者许可开源，以支持结果复现并促进在这些领域利用人工智能加速科学进展的进一步研究。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.09839

Steered LLM Activations are Non-Surjective

受控大型语言模型激活状态的非满射性

Mishra, Aayush, Khashabi, Daniel, Liu, Anqi

Abstract

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

Chinese Translation

激活引导（Activation steering）是一种流行的白盒控制技术，通过修改模型激活状态以引发输出行为的抽象变化。它也已成为可解释性研究中的标准工具（例如，探测真实性，或将激活状态转化为人类可读的解释）以及安全研究（例如，研究越狱能力）。然而，目前尚不清楚受控激活状态是否能够通过任何文本提示实现。在本研究中，我们将该问题表述为满射性问题：对于固定模型，是否每个受控激活状态都存在模型自然前向传播下的原像？在实际假设下，我们证明激活引导会将残差流推离由离散提示可达的状态流形。几乎可以确定，没有任何提示能够重现激活引导所引发的相同内部行为。我们还在三种广泛使用的大型语言模型（LLMs）上进行了实证验证。我们的结果建立了白盒可控性与黑盒提示之间的正式区分。因此，我们警示不要将激活引导的易用性和成功视为基于提示的可解释性或脆弱性的证据，并主张采用明确区分白盒与黑盒干预的评估协议。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.09852

MEMENTO: Teaching LLMs to Manage Their Own Context

MEMENTO：教会大型语言模型管理其自身上下文

Kontonis, Vasilis, Zeng, Yuchen, Garg, Shivam, Chen, Lingjiao, Tang, Hao, Wang, Ziyan, Awadallah, Ahmed, Horvitz, Eric, Langford, John, Papailiopoulos, Dimitris

Abstract

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

Chinese Translation

推理模型以长而无结构的流进行思考，缺乏压缩或组织自身中间状态的机制。我们提出了MEMENTO：一种方法，教会模型将推理分段为多个块，将每个块压缩为一个纪念品（memento），即一个紧凑的状态摘要，并通过仅关注纪念品进行前向推理，从而减少上下文、KV缓存和计算量。为了训练MEMENTO模型，我们发布了OpenMementos，这是一个包含228K推理轨迹的公共数据集，来源于OpenThoughts-v3，经过分段并注释了中间摘要。我们展示了在OpenMementos上进行的两阶段SFT（监督微调）方法在不同模型系列（Qwen3、Phi-4、Olmo 3）和规模（8B--32B参数）上都是有效的。训练后的模型在数学、科学和编程基准测试中保持了较强的准确性，同时实现了约2.5倍的峰值KV缓存减少。我们扩展了vLLM以支持我们的推理方法，实现了约1.75倍的吞吐量提升，同时使我们能够进行强化学习（RL）并进一步提高准确性。最后，我们识别出一种双重信息流：每个推理块的信息通过纪念品文本和相应的KV状态传递，后者保留了原始块的隐含信息。移除这个通道会导致AIME24的准确性下降15个百分点。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.09855

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

利用可验证奖励的强化学习指导大型语言模型进行谈判

Liu, Shuze Daniel, Chen, Claire, Xiao, Jiabao Sean, Lei, Lei, Zhang, Yuheng, Yue, Yisong, Simchi-Levi, David

Abstract

The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

Chinese Translation

大型语言模型（LLMs）的最新进展确立了其作为自主交互代理的潜力。然而，它们在不完全信息的策略游戏中常常表现不佳，例如双边价格谈判。本文探讨了可验证奖励强化学习（Reinforcement Learning from Verifiable Rewards，RLVR）是否能够有效地教会LLMs进行谈判。具体而言，我们研究了学习过程中出现的策略行为。我们提出了一个框架，在广泛分布的真实产品环境中，训练一个中等规模的买方代理与受控的LLM卖方进行对抗。通过将奖励信号直接基于经济剩余的最大化和严格遵守私人预算约束，我们揭示了一种新颖的四阶段策略演化。代理从天真的讨价还价开始，逐步采用激进的起始报价，经历僵局阶段，最终发展出复杂的说服技巧。实验结果表明，这种可验证的训练使得一个30亿参数的代理在剩余提取方面显著超越了规模是其十倍的前沿模型。此外，训练后的代理在面对训练中未见过的更强对手时表现出强健的泛化能力，即使面对敌对的、对抗性的卖方角色仍然有效。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.09861

Evolutionary Token-Level Prompt Optimization for Diffusion Models

用于扩散模型的进化令牌级提示优化

Neto, Domício Pereira, Correia, João, Machado, Penousal

Abstract

Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.

Chinese Translation

文本到图像的扩散模型展现出强大的生成性能，但对提示的表述高度敏感，通常需要大量的手动试错才能获得令人满意的结果。这促使了自动化、模型无关的提示优化方法的发展，这些方法能够系统地探索超越传统文本重写的条件空间。本研究探讨了使用遗传算法（Genetic Algorithm, GA）进行提示优化，通过直接进化CLIP基础的扩散模型所使用的令牌向量。GA优化一个适应度函数，该函数结合了通过LAION美学预测器V2测量的美学质量与通过CLIPScore评估的提示-图像对齐度。在Parti Prompts (P2) 数据集中的36个提示上的实验表明，所提出的方法优于基线方法，包括Promptist和随机搜索，适应度提高了最高达23.93%。总体而言，该方法适用于具有令牌化文本编码器的图像生成模型，并提供了一个模块化框架以便于未来的扩展，讨论了其局限性和前景。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.09885

What do your logits know? (The answer may surprise you!)

你的 logits 知道什么？（答案可能会让你惊讶！）

Fedzechkina, Masha, Gualdoni, Eleonora, Ramos, Rita, Williamson, Sinead

Abstract

Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different "representational levels'' as it is compressed from the rich information encoded in the residual stream through two natural bottlenecks: low-dimensional projections of the residual stream obtained using tuned lens, and the final top-k logits most likely to impact model's answer. We show that even easily accessible bottlenecks defined by the model's top logit values can leak task-irrelevant information present in an image-based query, in some cases revealing as much information as direct projections of the full residual stream.

Chinese Translation

近期的研究表明，探测模型内部可以揭示大量从模型生成中无法明显看出的信息。这带来了无意或恶意信息泄露的风险，模型用户可能会获取到模型所有者认为无法访问的信息。以视觉-语言模型为实验平台，我们首次系统性地比较了在通过两个自然瓶颈压缩的过程中，不同“表征层次”所保留的信息。这两个瓶颈分别是使用调优透镜获得的残差流的低维投影，以及最有可能影响模型答案的最终 top-k logits。我们展示了，即使是由模型的 top logit 值定义的易于访问的瓶颈，也可能泄露存在于基于图像的查询中的与任务无关的信息，在某些情况下，泄露的信息量甚至可以与直接投影完整残差流相媲美。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.09889

In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach

基于智能代理的线弧增材制造缺陷原位过程监测方法

Halder, Pallock, Mojumder, Satyajit

Abstract

AI agents are being increasingly deployed across a wide range of real-world applications. In this paper, we propose an agentic AI framework for in-situ process monitoring for defect detection in wire-arc additive manufacturing (WAAM). The autonomous agent leverages a WAAM process monitoring dataset and a trained classification tool to build AI agents and uses a large language model (LLM) for in-situ process monitoring decision-making for defect detection. A processing agent is developed based on welder process signals, such as current and voltage, and a monitoring agent is developed based on acoustic data collected during the process. Both agents are tasked with identifying porosity defects from processing and monitoring signals, respectively. Ground truth X-ray computed tomography (XCT) data are used to develop classification tools for both the processing and monitoring agents. Furthermore, a multi-agent framework is demonstrated in which the processing and monitoring agents are orchestrated together for parallel decision-making on the given task of defect classification. Evaluation metrics are proposed to determine the efficacy of both individual agents, the combined single-agent, and the coordinated multi-agent system. The multi-agent configuration outperforms all individual-agent counterparts, achieving a decision accuracy of 91.6% and an F1 score of 0.821 on decided runs, across 15 independent runs, and a reasoning quality score of 3.74 out of 5. These in-situ process monitoring agents hold significant potential for autonomous real-time process monitoring and control toward building qualified parts for WAAM and other additive manufacturing processes.

Chinese Translation

智能代理（AI agents）正日益广泛应用于各种实际场景中。本文提出了一种基于智能代理的线弧增材制造（Wire-Arc Additive Manufacturing, WAAM）缺陷原位过程监测框架。该自主代理利用WAAM过程监测数据集和训练好的分类工具构建AI代理，并采用大型语言模型（Large Language Model, LLM）进行缺陷检测的原位过程监测决策。基于焊工过程信号（如电流和电压）开发了处理代理，基于过程中的声学数据开发了监测代理。两个代理分别负责从处理信号和监测信号中识别孔隙缺陷。采用X射线计算机断层扫描（X-ray Computed Tomography, XCT）作为真实标签数据，开发了处理代理和监测代理的分类工具。此外，本文展示了一个多代理框架，将处理代理和监测代理协同调度，实现缺陷分类任务的并行决策。提出了评估指标以衡量单个代理、合并单代理及协调多代理系统的效能。多代理配置优于所有单代理方案，在15次独立运行中实现了91.6%的决策准确率和0.821的F1分数，推理质量评分为5分制中的3.74分。这些原位过程监测代理在实现WAAM及其他增材制造工艺的自主实时过程监测与控制、构建合格零件方面展现出重要潜力。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.09923

GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension

GLEaN：一种面向公众理解的文本生成图像偏见检测方法

Ding, Bochu, Bent, Brinnae, Wendell, Augustus

Abstract

Text-to-image (T2I) models, and their encoded biases, increasingly shape the visual media the public encounters. While researchers have produced a rich body of work on bias measurement, auditing, and mitigation in T2I systems, those methods largely target technical stakeholders, leaving a gap in public legibility. We introduce GLEaN (Generative Likeness Evaluation at N-Scale), a portrait-based explainability pipeline designed to make T2I model biases visually understandable to a broad audience. GLEaN comprises three stages: automated large-scale image generation from identity prompts, facial landmark-based filtering and spatial alignment, and median-pixel composition that distills a model's central tendency into a single representative portrait. The resulting composites require no statistical background to interpret; a viewer can see, at a glance, who a model 'imagines' when prompted with 'a doctor' versus a 'felon.' We demonstrate GLEaN on Stable Diffusion XL across 40 social and occupational identity prompts, producing composites that reproduce documented biases and surface new associations between skin tone and predicted emotion. We find in a between-subjects user study (N = 291) that GLEaN portraits communicate biases as effectively as conventional data tables, but require significantly less viewing time. Because the method relies solely on generated outputs, it can also be replicated on any black-box and closed-weight systems without access to model internals. GLEaN offers a scalable, model-agnostic approach to bias explainability, purpose-built for public comprehension, and is publicly available at https://github.com/cultureiolab/GLEaN.

Chinese Translation

文本生成图像（Text-to-image，T2I）模型及其内在的偏见，日益影响公众所接触的视觉媒体。尽管研究人员在T2I系统的偏见测量、审计和缓解方面已积累了丰富成果，但这些方法主要面向技术利益相关者，公众的可理解性仍存在不足。我们提出了GLEaN（Generative Likeness Evaluation at N-Scale），一种基于肖像的可解释性流程，旨在使T2I模型的偏见以视觉形式为广大受众所理解。GLEaN包含三个阶段：基于身份提示的自动大规模图像生成、基于面部关键点的筛选与空间对齐，以及通过中位像素合成将模型的中心趋势提炼为单一代表性肖像。生成的合成图无需统计背景即可解读，观众一目了然地看到模型在提示“医生”与“罪犯”时“想象”的形象。我们在Stable Diffusion XL模型上针对40个社会及职业身份提示验证了GLEaN，生成的合成图重现了已知偏见，并揭示了肤色与预测情绪之间的新关联。在一项包含291名参与者的组间用户研究中，我们发现GLEaN肖像传达偏见的效果与传统数据表相当，但所需观看时间显著更短。由于该方法仅依赖生成输出，且无需访问模型内部，因此可在任何黑盒及闭源权重系统上复现。GLEaN提供了一种可扩展、模型无关且专为公众理解设计的偏见可解释性方法，相关代码已公开，地址为https://github.com/cultureiolab/GLEaN。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.09937

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

HealthAdminBench：评估计算机使用代理在医疗管理任务中的表现

Bedi, Suhana, Welch, Ryan, Steinberg, Ethan, Wornow, Michael, Kim, Taeil Matthew, Ahmed, Haroun, Sterling, Peter, Purohit, Bravim, Akram, Qurat, Acosta, Angelic, Nubla, Esther, Sharma, Pritika, Pfeffer, Michael A., Koyejo, Sanmi, Shah, Nigam H.

Abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Chinese Translation

医疗管理每年支出超过1万亿美元，使其成为基于大型语言模型（LLM）的计算机使用代理（CUA）的一个有前景的目标。尽管LLM在临床应用方面受到了广泛关注，但目前尚无评估CUA在端到端行政工作流程中的表现的基准。为了解决这一空白，我们推出了HealthAdminBench，这是一个包含四个真实的图形用户界面（GUI）环境的基准：电子健康记录（EHR）、两个支付方门户和一个传真系统，以及135个由专家定义的任务，涵盖三种行政任务类型：事前授权、上诉和拒绝管理，以及耐用医疗设备（DME）订单处理。每个任务被细分为可验证的细小子任务，共产生1,698个评估点。我们在多种提示和观察设置下评估了七种代理配置，发现尽管子任务表现良好，但端到端的可靠性仍然较低：表现最佳的代理（Claude Opus 4.6 CUA）仅实现了36.3%的任务成功率，而GPT-5.4 CUA则达到了最高的子任务成功率（82.8%）。这些结果揭示了当前代理能力与现实世界行政工作流程需求之间的显著差距。HealthAdminBench为评估医疗行政工作流程安全可靠自动化的进展提供了严格的基础。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.09940

New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework

大型语言模型的新型混合微调范式：算法设计与收敛分析框架

Ma, Shaocong, Yu, Peiran, Huang, Heng

Abstract

Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel hybrid fine-tuning approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of hybrid smoothness condition, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.

Chinese Translation

微调大型语言模型（LLMs）通常涉及完全微调，即更新所有模型参数，或参数高效微调（Parameter-Efficient Fine-Tuning, PEFT），即调整少量参数。然而，这两种方法都有固有的局限性：完全微调计算开销大，而PEFT往往难以学习新知识，并表现出次优性能。为了解决这些问题，我们提出了一种新颖的混合微调方法，该方法通过结合零阶和一阶优化方法，联合更新LLMs和PEFT模块。为了分析我们的新算法，我们开发了一个以混合光滑性条件为核心的理论框架，该框架考虑了联合LLM和PEFT训练中优化景观的异质性。我们推导了在多学习率下重洗类型随机梯度下降（SGD）算法的严格收敛分析，并通过对各种下游任务和模型架构的广泛实证研究展示了其有效性。在实践方面，我们的结果表明性能持续改善，使该方法成为大规模语言模型微调的可行解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.10015

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

FinTrace：针对长期金融任务的LLM工具调用的整体轨迹级评估

Cao, Yupeng, Li, Haohang, Liu, Weijin, Cao, Wenbo, Xu, Anke, Qian, Lingfei, Peng, Xueqing, Tang, Minxue, Yao, Zhiyuan, Huang, Jimin, Subbalakshmi, K. P., Zhu, Zining, Suchow, Jordan W., Yu, Yangyang

Abstract

Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes -- action correctness, execution efficiency, process quality, and output quality -- enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.

Chinese Translation

最近的研究表明，工具调用能力使大型语言模型（LLMs）能够与外部环境进行交互，以完成长期金融任务。尽管现有基准测试已开始评估金融工具调用，但它们关注的场景有限，并依赖于无法捕捉轨迹级推理质量的调用级指标。为了解决这一问题，我们引入了FinTrace，一个包含800个专家注释轨迹的基准，涵盖34个真实世界金融任务类别，跨越多个难度级别。FinTrace采用基于评分标准的评估协议，设定了九个指标，按四个维度组织——行动正确性、执行效率、过程质量和输出质量——使得对LLM工具调用行为的细致评估成为可能。我们对13个LLM的评估显示，尽管前沿模型在工具选择上表现出色，但所有模型在信息利用和最终答案质量上均存在困难，揭示了调用正确工具与有效推理其输出之间的关键差距。为了超越诊断，我们构建了FinTrace-Training，这是首个针对金融工具调用的轨迹级偏好数据集，包含8,196个经过精心策划的轨迹，具有工具增强的上下文和偏好对。我们使用监督微调和直接偏好优化（DPO）对Qwen-3.5-9B进行了微调，并显示在FinTrace-Training上训练能够持续改善中间推理指标，而DPO更有效地抑制失败模式。然而，端到端答案质量仍然是一个瓶颈，表明轨迹级的改进尚未完全传递到最终输出质量。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.10034

AI Achieves a Perfect LSAT Score

人工智能取得完美的法学院入学考试成绩

Ku, Bonmu

Abstract

This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanations narrows this gap through Best-of-5 selection, with gains again predominantly in logical reasoning. The gatekeeper of elite legal education since 1948, the LSAT has not merely been passed but answered without a single error by models that reason. The upper bound of the cognitive capacities it has tested is no longer exclusive to human cognition.

Chinese Translation

本文报告了首个语言模型在官方披露的法学院入学考试（LSAT）中取得完美成绩的文献实例。对八种推理模型的控制实验表明，改变提示、打乱答案选项以及采样多个响应对性能的驱动作用没有显著影响。然而，消除模型在回答前生成的思考阶段会使前沿准确率降低多达8个百分点，主要体现在逻辑推理方面。经过蒸馏的模型以相同格式生成完整的思考轨迹，但其表现远低于前沿性能。通过在官方LSAT解释上使用QLoRA微调的试点过程奖励模型，通过最佳五选一的选择缩小了这一差距，收益再次主要体现在逻辑推理方面。自1948年以来，LSAT作为精英法律教育的门槛，不仅被通过，而且由能够推理的模型无一错误地回答。它所测试的认知能力的上限不再仅限于人类认知。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.10044

LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

LoopGuard：通过动态 KV 缓存干预打破自增强注意力循环

Xu, Dongjie, Wu, Hao, Shi, Weijie, Cui, Yue, Liu, Yuanjun, Li, Jiawei, Ma, Haolun, Liu, An, Zhu, Jia, Xu, Jiajie

Abstract

Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.

Chinese Translation

通过对长上下文生成的系统实验，我们观察到一种有害的失败模式，其中解码可能会陷入持续的重复循环。我们发现这种退化是由崩溃的注意力模式驱动的，其中一部分头部锁定在历史的狭窄后缀上，并且在推理时的 KV 缓存重用进一步稳定了这种现象。关键是，由于许多现有的 KV 缓存策略依赖于基于注意力的重要性，这种崩溃可能会导致重复标记产生虚假的高分，从而使缓存管理无意中加剧重复。为了以可控和可重复的方式研究这一现象，我们引入了 LoopBench，这是一个具有明确循环诱导条件和循环导向指标的基准，量化了重复的严重性和生成的不稳定性，超越了下游任务的分数。在这些见解的基础上，我们提出了 LoopGuard，这是一种轻量级的插件式 KV 缓存保护器，能够在线检测循环的发生，并通过在固定的缓存预算下修剪重复的尾部跨度来打断反馈循环。在 LoopBench 上的实验表明，LoopGuard 将循环发生率降低了超过 90 个百分点，同时恢复了输出多样性并减少了标记浪费。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2604.10075

Learning Hierarchical and Geometry-Aware Graph Representations for Text-to-CAD

面向文本到CAD的层次化几何感知图表示学习

Gong, Shengjie, Peng, Wenjie, Chen, Hongyuan, Zhang, Gangyu, Hu, Yunqing, Zhang, Huiyuan, Huang, Shuangping, Chen, Tianshui

Abstract

Text-to-CAD code generation is a long-horizon task that translates textual instructions into long sequences of interdependent operations. Existing methods typically decode text directly into executable code (e.g., bpy) without explicitly modeling assembly hierarchy or geometric constraints, which enlarges the search space, accumulates local errors, and often causes cascading failures in complex assemblies. To address this issue, we propose a hierarchical and geometry-aware graph as an intermediate representation. The graph models multi-level parts and components as nodes and encodes explicit geometric constraints as edges. Instead of mapping text directly to code, our framework first predicts structure and constraints, then conditions action sequencing and code generation, thereby improving geometric fidelity and constraint satisfaction. We further introduce a structure-aware progressive curriculum learning strategy that constructs graded tasks through controlled structural edits, explores the model's capability boundary, and synthesizes boundary examples for iterative training. In addition, we build a 12K dataset with instructions, decomposition graphs, action sequences, and bpy code, together with graph- and constraint-oriented evaluation metrics. Extensive experiments show that our method consistently outperforms existing approaches in both geometric fidelity and accurate satisfaction of geometric constraints.

Chinese Translation

文本到CAD代码生成是一项长时序任务，将文本指令转换为长序列的相互依赖操作。现有方法通常直接将文本解码为可执行代码（如 bpy），但未显式建模装配层次结构或几何约束，导致搜索空间扩大、局部误差累积，并常在复杂装配中引发连锁失败。为解决该问题，我们提出了一种层次化且几何感知的图作为中间表示。该图将多层级零件和组件建模为节点，并将显式几何约束编码为边。我们的框架不直接将文本映射为代码，而是先预测结构和约束，再基于此进行动作序列和代码生成，从而提升几何保真度和约束满足度。我们进一步引入了一种结构感知的渐进式课程学习策略，通过受控结构编辑构建分级任务，探索模型能力边界，并合成边界样本进行迭代训练。此外，我们构建了一个包含指令、分解图、动作序列及 bpy 代码的12K数据集，并设计了面向图和约束的评估指标。大量实验表明，我们的方法在几何保真度和几何约束准确满足方面均显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2604.10087

Ontological Trajectory Forecasting via Finite Semigroup Iteration and Lie Algebra Approximation in Geopolitical Knowledge Graphs

通过有限半群迭代和李代数近似在地缘政治知识图谱中进行本体轨迹预测

Wu, Qihang

Abstract

We present EL-DRUIN, an ontological reasoning system for geopolitical intelligence analysis that combines formal ontology, finite semigroup algebra, and Lie algebra approximation to forecast long-run relationship trajectories. Current LLM-based political analysis systems operate as summarisation engines, producing outputs bounded by textual pattern matching. EL-DRUIN departs from this paradigm by modelling geopolitical relationships as states in a finite set of named Dynamic Patterns, composing patterns via a semigroup operation whose structure constants are defined by an explicit composition table, and embedding each pattern as a vector in an 8-dimensional semantic Lie algebra space. Forward simulation iterates this semigroup operation, yielding reachable pattern sets at each discrete timestep; convergence to idempotent absorbing states (fixed points of the composition) constitutes the predicted long-run attractor. Bayesian posterior weights combine ontology-derived confidence priors with a Lie similarity term measuring the cosine similarity between the vector sum of composing patterns and the target pattern vector, providing interpretable, calibrated probabilities that are not self-reported by a language model. Bifurcation points -- steps at which two candidate attractors have near-equal posterior mass -- are detected and exposed to downstream analysis. We demonstrate the framework on six geopolitical scenarios including US-China technology decoupling and the Taiwan Strait military coercion trajectory. The architecture is publicly available as an open-source system with a Streamlit frontend exposing full computation traces, Bayesian posterior breakdowns, and 8D ontological state vectors.

Chinese Translation

我们提出了EL-DRUIN，一个用于地缘政治情报分析的本体推理系统，结合了形式本体、有限半群代数和李代数近似，以预测长期关系轨迹。当前基于大型语言模型（LLM）的政治分析系统作为摘要引擎运作，生成的输出受限于文本模式匹配。EL-DRUIN突破了这一范式，通过将地缘政治关系建模为有限命名动态模式中的状态，利用一个由显式组合表定义的结构常数的半群运算组合模式，并将每个模式嵌入到一个8维语义李代数空间中。前向模拟迭代这一半群运算，在每个离散时间步产生可达模式集；收敛到幂等吸收状态（组合的固定点）构成预测的长期吸引子。贝叶斯后验权重将本体衍生的置信先验与李相似性项结合，后者测量组合模式的向量和与目标模式向量之间的余弦相似性，提供可解释的、经过校准的概率，这些概率并非由语言模型自我报告。分岔点——两个候选吸引子具有近似相等后验质量的步骤——被检测并暴露于下游分析中。我们在六个地缘政治场景中展示了该框架，包括美中技术脱钩和台湾海峡军事胁迫轨迹。该架构作为开源系统公开可用，配有Streamlit前端，展示完整的计算轨迹、贝叶斯后验分解和8D本体状态向量。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2604.10110

Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

信任你的记忆：通过多维奖励的强化学习实现智能家居的可验证控制

Guo, Kai-Yuan, Wang, Jiang, Zhao, Renjie, Wang, Tianyi, Mao, Wandong, Gao, Yu, Feng, Mou Xiao, Xu, Yi

Abstract

Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks either focus on immediate device control or general open-domain memory retrieval tasks, and therefore cannot effectively evaluate a model's ability to perform memory-driven device control. Methodologically, while memory-driven device control can be approached using Reinforcement Learning, conventional RL methods generally rely on outcome-based supervision (i.e., whether the final task is achieved). This lack of intermediate feedback can lead to sub-optimal performance or local failures in fine-grained memory management tasks (adding, updating, deleting, and utilizing). To address these issues, we first release MemHomeLife, built from anonymized real-world long-term user interaction logs. To enable more fine-grained evaluation of different memory-related subtasks, we further construct MemHome, the first benchmark designed to systematically evaluate memory-driven device control in smart home scenarios.

Chinese Translation

大型语言模型（LLMs）已成为实现个性化智能家居体验的关键基础。尽管现有研究探讨了智能家居助手如何理解用户查询以实时控制设备，但从评估和方法论的角度来看，其在基于记忆的设备控制方面的能力仍然面临挑战。在评估方面，现有基准要么专注于即时设备控制，要么关注一般开放领域的记忆检索任务，因此无法有效评估模型在基于记忆的设备控制中的能力。在方法论上，尽管可以使用强化学习（Reinforcement Learning）来处理基于记忆的设备控制，但传统的强化学习方法通常依赖于基于结果的监督（即最终任务是否完成）。这种缺乏中间反馈的情况可能导致在细粒度记忆管理任务（如添加、更新、删除和利用）中的次优表现或局部失败。为了解决这些问题，我们首先发布了MemHomeLife，该数据集基于匿名的真实世界长期用户交互日志构建。为了更细致地评估不同的记忆相关子任务，我们进一步构建了MemHome，这是第一个旨在系统评估智能家居场景中基于记忆的设备控制的基准。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2604.10150

Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration

从空白中学习：通过内容无关的概率校准消除列表重排序器的偏差

Lv, Hang, Gu, Hongchao, Yang, Ruiqing, Li, Liangyue, Chen, Zulong, Lian, Defu, Wang, Hao, Chen, Enhong

Abstract

Generative listwise reranking leverages global context for superior retrieval but is plagued by intrinsic position bias, where models exhibit structural sensitivity to input order independent of relevance. Existing mitigations present a dilemma: inference-time aggregation incurs prohibitive latency, while training-based methods often fail to eradicate ingrained priors, particularly in compact models. To resolve this dilemma, we propose CapCal (Content-Agnostic Probability Calibration), a training-free framework that mechanically decouples positional bias from ranking decisions. By estimating the bias distribution via content-free placeholders, CapCal rectifies output logits through an entropy-adaptive contrastive mechanism. Evaluations across 10 benchmarks confirm that CapCal achieves superior performance among training-free methods while preserving single-pass efficiency. Notably, it unlocks the latent potential of lightweight models (e.g., 0.6B), delivering absolute NDCG gains exceeding 10 points and outperforming both permutation-based aggregation and data-augmentation baselines.

Chinese Translation

生成式列表重排序利用全局上下文以实现更优的检索，但受到内在位置偏差的困扰，即模型对输入顺序表现出结构敏感性，而与相关性无关。现有的缓解措施面临两难：推理时的聚合会导致不可接受的延迟，而基于训练的方法往往无法消除根深蒂固的先验，尤其是在紧凑模型中。为了解决这一困境，我们提出了CapCal（内容无关的概率校准），这是一个无训练的框架，机械地将位置偏差与排名决策解耦。通过使用无内容的占位符估计偏差分布，CapCal通过熵自适应对比机制修正输出logits。在10个基准测试中的评估确认，CapCal在无训练方法中实现了优越的性能，同时保持单次通过的效率。值得注意的是，它释放了轻量级模型（例如0.6B）的潜在能力，带来了超过10点的绝对NDCG增益，并超越了基于排列的聚合和数据增强基线。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2604.10152

SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

SpecMoE：基于自助式推测解码的快速高效专家混合推理方法

Bang, Jehyeon, Cho, Eunyeong, Hwang, Ranggi, Chung, Jinha, Rhu, Minsoo

Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

Chinese Translation

专家混合（Mixture-of-Experts，MoE）架构作为一种有前景的方法，通过选择性激活参数来缓解大型语言模型（LLMs）日益增长的计算成本。然而，其高内存需求和参数效率不佳对高效部署构成了重大挑战。尽管文献中提出了基于CPU卸载的MoE推理系统，但其效率有限，尤其是在大批量处理时。本文提出了SpecMoE，一种基于自助式推测解码算法的内存高效MoE推理系统。SpecMoE展示了将推测解码应用于MoE推理的有效性，无需额外的模型训练或微调。我们的系统将推理吞吐量提升至最高4.30倍，同时显著降低了内存受限系统中内存和互连带宽的需求。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2604.10164

Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities

针对具有新兴实体的时间知识图谱的归纳推理

Zhao, Ze, He, Yuhui, Wu, Lyuwen, Tang, Gu, Lu, Bin, Gan, Xiaoying, Fu, Luoyi, Wang, Xinbing, Zhou, Chenghu

Abstract

Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25\% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://github.com/zhaodazhuang2333/TransFIR.

Chinese Translation

对时间知识图谱（TKGs）的推理对于预测未来事件和时间相关事实至关重要。尽管现有方法在捕捉关系动态方面有效，但其性能受到封闭世界假设的限制，这一假设未能考虑训练中未出现的新兴实体。值得注意的是，这些实体在没有历史交互的情况下不断加入网络。实证研究表明，新兴实体在TKGs中普遍存在，约占所有实体的25%。这些实体缺乏历史交互导致推理任务的性能显著下降。然而，我们观察到具有语义相似性的实体往往表现出可比的交互历史，这表明存在可转移的时间模式。基于这一洞察，我们提出了TransFIR（可转移归纳推理），这是一个新颖的框架，利用语义相似的已知实体的历史交互序列来支持归纳推理。具体而言，我们提出了一种基于代码本的分类器，将新兴实体分类到潜在的语义簇中，使它们能够采用相似实体的推理模式。实验结果表明，TransFIR在新兴实体的推理上优于所有基线，在多个数据集上实现了平均28.6%的互惠排名（MRR）提升。实现代码可在 https://github.com/zhaodazhuang2333/TransFIR 获取。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2604.10169

MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

MAVEN-T：基于多智能体环境感知的强化学习增强神经轨迹预测器

Duan, Wenchang

Abstract

Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real-time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision-making capabilities, particularly in dynamic multi-agent scenarios. This paper introduces MAVEN-T, a teacher-student framework that achieves state-of-the-art trajectory prediction through complementary architectural co-design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi-granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision-making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.

Chinese Translation

轨迹预测作为自动驾驶系统中的关键且具有挑战性的组成部分，既需具备复杂的推理能力，又需满足严格的实时部署要求。尽管知识蒸馏在模型压缩方面表现出有效性，现有方法往往难以保留复杂的决策能力，尤其是在动态多智能体场景中。本文提出了MAVEN-T，一种师生框架，通过互补的架构协同设计和渐进式蒸馏，实现了最先进的轨迹预测。教师模型采用混合注意力机制以最大化表示能力，学生模型则采用为部署优化的高效架构。知识转移通过多粒度蒸馏结合自适应课程学习进行，动态根据性能调整复杂度。值得注意的是，该框架引入强化学习，突破传统蒸馏的模仿上限，使学生模型能够通过动态环境交互验证、优化和提升教师知识，甚至实现比教师更鲁棒的决策能力。在NGSIM和highD数据集上的大量实验表明，模型参数压缩达6.2倍，推理速度提升3.7倍，同时保持最先进的准确率，开创了在资源受限条件下部署复杂推理模型的新范式。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2604.10171

PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction

PoreDiT：一种可扩展的大规模数字岩石重建生成模型

Huang, Yizhuo, Sun, Baoquan, Huang, Haibo

Abstract

This manuscript presents PoreDiT, a novel generative model designed for high-efficiency digital rock reconstruction at gigavoxel scales. Addressing the significant challenges in digital rock physics (DRP), particularly the trade-off between resolution and field-of-view (FOV), and the computational bottlenecks associated with traditional deep learning architectures, PoreDiT leverages a three-dimensional (3D) Swin Transformer to break through these limitations. By directly predicting the binary probability field of pore spaces instead of grayscale intensities, the model preserves key topological features critical for pore-scale fluid flow and transport simulations. This approach enhances computational efficiency, enabling the generation of ultra-large-scale ($1024^3$ voxels) digital rock samples on consumer-grade hardware. Furthermore, PoreDiT achieves physical fidelity comparable to previous state-of-the-art methods, including accurate porosity, pore-scale permeability, and Euler characteristics. The model's ability to scale efficiently opens new avenues for large-domain hydrodynamic simulations and provides practical solutions for researchers in pore-scale fluid mechanics, reservoir characterization, and carbon sequestration.

Chinese Translation

本文提出了PoreDiT，一种新颖的生成模型，旨在实现千亿体素规模的高效数字岩石重建。针对数字岩石物理（DRP）领域中分辨率与视野（FOV）之间的权衡以及传统深度学习架构所面临的计算瓶颈问题，PoreDiT利用三维（3D）Swin Transformer突破了这些限制。该模型通过直接预测孔隙空间的二值概率场而非灰度强度，保留了对孔隙尺度流体流动和传输模拟至关重要的关键拓扑特征。此方法提升了计算效率，使得在消费级硬件上生成超大规模（$1024^3$体素）数字岩石样本成为可能。此外，PoreDiT在物理真实性方面达到了与现有最先进方法相当的水平，包括准确的孔隙率、孔隙尺度渗透率及欧拉特征。该模型高效的可扩展性为大尺度水动力学模拟开辟了新途径，并为孔隙尺度流体力学、油藏表征及碳封存领域的研究者提供了实用解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2604.10182

Credit-Budgeted ICPC-Style Coding: When Agents Must Pay for Every Decision

基于信用预算的ICPC风格编码：当智能体必须为每个决策付费时

Zhou, Lingfeng, Shi, Junhao, Gao, Jin, Wang, Dequan

Abstract

Current evaluations of autonomous coding agents assume an unrealistic, infinite-resource environment. However, real-world software engineering is a resource-bound competition. As we scale toward large agent swarms, ignoring compute and time costs risks catastrophic budget exhaustion. To shift the focus from isolated accuracy to cost-aware problem-solving, we introduce USACOArena, an interactive ACM-ICPC-style arena driven by a strict "credit" economy. Every generated token, local test, and elapsed second depletes a fixed budget, forcing agents to make strategic trade-offs. Our comprehensive profiling reveals that frontier single agents and swarms currently fail to optimally balance accuracy with these constraints, exhibiting divergent, path-dependent behaviors. Ultimately, USACOArena provides an essential dynamic training ground for developing highly efficient, resource-aware agent architectures.

Chinese Translation

当前对自主编码智能体的评估假设了一个不切实际的无限资源环境。然而，现实世界的软件工程是一场资源受限的竞争。随着我们向大规模智能体群体扩展，忽视计算和时间成本将面临灾难性的预算耗尽风险。为了将关注点从孤立的准确率转向成本感知的问题解决，我们引入了USACOArena，这是一个由严格“信用”经济驱动的交互式ACM-ICPC风格竞技场。每生成一个token、每进行一次本地测试以及每经过一秒钟，都会消耗固定预算，迫使智能体做出战略性权衡。我们的全面分析显示，当前最前沿的单智能体和智能体群体尚未能在准确率与这些约束之间实现最佳平衡，表现出分歧且路径依赖的行为。最终，USACOArena为开发高效且资源感知的智能体架构提供了一个必不可少的动态训练平台。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2604.10200

Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts

Edu-MMBias：一个三层次多模态基准，用于审计教育背景下视觉-语言模型中的社会偏见

Li, Ruijia, Zhang, Mingzi, Yu, Zengyi, Wei, Yuang, Jiang, Bo

Abstract

As Vision-Language Models (VLMs) become integral to educational decision-making, ensuring their fairness is paramount. However, current text-centric evaluations neglect the visual modality, leaving an unregulated channel for latent social biases. To bridge this gap, we present Edu-MMBias, a systematic auditing framework grounded in the tri-component model of attitudes from social psychology. This framework diagnoses bias across three hierarchical dimensions: cognitive, affective, and behavioral. Utilizing a specialized generative pipeline that incorporates a self-correct mechanism and human-in-the-loop verification, we synthesize contamination-resistant student profiles to conduct a holistic stress test on state-of-the-art VLMs. Our extensive audit reveals critical, counter-intuitive patterns: models exhibit a compensatory class bias favoring lower-status narratives while simultaneously harboring deep-seated health and racial stereotypes. Crucially, we find that visual inputs act as a safety backdoor, triggering a resurgence of biases that bypass text-based alignment safeguards and revealing a systematic misalignment between latent cognition and final decision-making. The contributions of this paper are available at: https://anonymous.4open.science/r/EduMMBias-63B2.

Chinese Translation

随着视觉-语言模型（VLMs）在教育决策中的重要性日益增加，确保其公平性变得至关重要。然而，目前以文本为中心的评估忽视了视觉模态，留下了一个未受监管的潜在社会偏见渠道。为了解决这一问题，我们提出了Edu-MMBias，这是一个基于社会心理学态度三成分模型的系统审计框架。该框架从认知、情感和行为三个层次诊断偏见。我们利用一个专门的生成管道，结合自我修正机制和人机协作验证，合成抗污染的学生档案，以对最先进的VLMs进行全面的压力测试。我们的广泛审计揭示了关键的反直觉模式：模型表现出补偿性类别偏见，偏向于低地位叙事，同时深藏健康和种族刻板印象。重要的是，我们发现视觉输入作为一种安全后门，触发了偏见的复苏，绕过了基于文本的对齐保护措施，揭示了潜在认知与最终决策之间的系统性不对齐。本文的贡献可在以下链接获取：https://anonymous.4open.science/r/EduMMBias-63B2。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2604.10219

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

认知转折点与视觉锚定：揭示和纠正多模态推理模型中的幻觉

Qian, Zhe, Ma, Yanbiao, Ouyang, Zhuohan, Wang, Zhonghua, Xu, Zhongxing, Luo, Fei, Liu, Xinyu, Ge, Zongyuan, Guo, Yike, Han, Jungong

Abstract

Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network's intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight, holistic training paradigm designed to internalize visually aware reasoning capabilities. Central to our approach is the Hierarchical Visual Attention Reward (HVAR), integrated within the GRPO framework. Upon detecting high entropy states, this mechanism dynamically incentivizes visual attention across critical intermediate layers, thereby anchoring the reasoning process back to the visual input. Furthermore, we introduce the Forced Reflection Mechanism (FRM), a trajectory editing strategy that disrupts cognitive inertia by triggering reflection around high entropy cognitive bifurcation points and encouraging verification of subsequent steps against the visual input, thereby translating external debiasing interventions into an intrinsic capability for hallucination mitigation.

Chinese Translation

多模态大型推理模型（MLRMs）在视觉推理方面取得了显著进展，尤其是在测试时计算规模的提升上，但长链推理仍然容易出现幻觉。我们识别出一种令人担忧的现象，称为推理视觉真相脱节（Reasoning Vision Truth Disconnect，RVTD）：幻觉与认知分岔点之间存在强相关性，这些分岔点通常表现出高熵状态。我们将这种脆弱性归因于视觉语义锚定的失效，这种失效局限于网络的中间层；具体而言，在这些高不确定性过渡期间，模型未能查询视觉证据，而是回归到语言先验。因此，我们主张从仅依赖结果层面的监督转向增强细粒度的内部注意力引导。为此，我们提出了V-STAR（带有注意力强化的视觉结构训练），这是一种轻量级的整体训练范式，旨在内化视觉意识推理能力。我们方法的核心是分层视觉注意力奖励（Hierarchical Visual Attention Reward，HVAR），它集成在GRPO框架内。当检测到高熵状态时，该机制动态激励关键中间层的视觉注意力，从而将推理过程锚定回视觉输入。此外，我们引入了强制反思机制（Forced Reflection Mechanism，FRM），这是一种轨迹编辑策略，通过在高熵认知分岔点周围触发反思来打破认知惯性，并鼓励对后续步骤与视觉输入进行验证，从而将外部去偏见干预转化为内在的幻觉缓解能力。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2604.10228

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

SVSR：一种用于多模态推理的自我验证与自我修正范式

Qian, Zhe, Su, Nianbing, Wang, Zhonghua, Li, Hebei, Xu, Zhongxing, Li, Yueying, Luo, Fei, Ouyang, Zhuohan, Ma, Yanbiao

Abstract

Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model's reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct Preference Optimization (Semi-online DPO) process, continuously augmenting the training corpus with high-quality, model-generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self-verify and self-rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self-reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.

Chinese Translation

当前的多模态模型常常面临浅层推理的问题，导致因思维过程不完整或不一致而产生的错误。为了解决这一局限性，我们提出了自我验证与自我修正（Self-Verification and Self-Rectification，SVSR），这是一个统一框架，明确将自我验证和自我修正集成到模型的推理流程中，从而显著提高在复杂视觉理解和多模态推理任务中的鲁棒性和可靠性。SVSR建立在一个新颖的三阶段训练范式之上。首先，我们通过从预训练的视觉-语言模型中提炼推理轨迹，构建一个高质量的统一偏好数据集，结合前向和后向推理以嵌入自我反思信号。其次，我们在该数据集上进行冷启动监督微调，以学习结构化的多步骤推理行为。第三，我们应用半在线直接偏好优化（Semi-online Direct Preference Optimization，Semi-online DPO）过程，持续用强大的教师视觉语言模型（VLM）过滤的高质量模型生成推理轨迹来增强训练语料库。该流程使模型能够学习、引发并完善其自我验证和自我修正的能力。在各种基准测试中进行的广泛实验表明，SVSR提高了推理准确性，并增强了对未见任务和问题类型的更强泛化能力。值得注意的是，一旦经过明确的自我反思推理训练，模型还表现出改善的隐性推理能力，即使在没有提供明确推理轨迹的情况下也能超越强基线。这些结果突显了SVSR在构建更可靠、自省和认知对齐的多模态系统方面的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2604.10252

A Dual-Positive Monotone Parameterization for Multi-Segment Bids and a Validity Assessment Framework for Reinforcement Learning Agent-based Simulation of Electricity Markets

多段竞价的双正单调参数化及基于强化学习智能体的电力市场仿真有效性评估框架

Xu, Zunnan, Jing, Zhaoxia, Pan, Zhanhua

Abstract

Reinforcement learning agent-based simulation (RL-ABS) has become an important tool for electricity market mechanism analysis and evaluation. In the modeling of monotone, bounded, multi-segment stepwise bids, existing methods typically let the policy network first output an unconstrained action and then convert it into a feasible bid curve satisfying monotonicity and boundedness through post-processing mappings such as sorting, clipping, or projection. However, such post-processing mappings often fail to satisfy continuous differentiability, injectivity, and invertibility at boundaries or kinks, thereby causing gradient distortion and leading to spurious convergence in simulation results. Meanwhile, most existing studies conduct mechanism analysis and evaluation mainly on the basis of training-curve convergence, without rigorously assessing the distance between the simulation outcomes and Nash equilibrium, which severely undermines the credibility of the results. To address these issues, this paper proposes...

Chinese Translation

基于强化学习的智能体仿真（RL-ABS）已成为电力市场机制分析与评估的重要工具。在单调、有界、多段阶梯竞价的建模中，现有方法通常先让策略网络输出无约束动作，再通过排序、截断或投影等后处理映射将其转换为满足单调性和有界性的可行竞价曲线。然而，此类后处理映射往往在边界或拐点处无法满足连续可微性、单射性和可逆性，导致梯度扭曲并引发仿真结果的伪收敛。同时，大多数现有研究主要基于训练曲线的收敛性进行机制分析与评估，缺乏对仿真结果与纳什均衡之间距离的严格评估，严重削弱了结果的可信度。为解决上述问题，本文提出了...

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2604.10261

The Amazing Agent Race: Strong Tool Users, Weak Navigators

惊人的智能体竞赛：强大的工具使用者，薄弱的导航者

Kim, Zae Myung, Lee, Dongseok, Kim, Jaehyung, Raheja, Vipul, Kang, Dongyeop

Abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Chinese Translation

现有的大型语言模型（LLM）智能体工具使用基准测试大多呈线性结构：我们对六个基准的分析显示，55%至100%的实例是由2到5步组成的简单链条。我们引入了“惊人的智能体竞赛”（The Amazing Agent Race，AAR），这是一个包含有向无环图（DAG）谜题（或称“赛段”）的基准测试，具备分叉合并的工具链。我们发布了1400个实例，涵盖两个变体：顺序型（800个赛段）和组合型（600个DAG赛段）。智能体必须在维基百科中导航，执行多步骤工具链，并将结果汇总为可验证的答案。赛段通过维基百科种子程序化生成，涵盖四个难度等级，并通过实时API进行验证。三个互补指标（终点准确率、停站访问率和路障完成率）分别诊断导航、工具使用和算术错误。对三种智能体框架在1400个赛段上的评测显示，最佳准确率仅为37.2%。导航错误占主导地位（27%至52%的试验），而工具使用错误低于17%；智能体架构的重要性与模型规模相当（Claude Code以6倍更少的tokens达到与Codex CLI相同的37%准确率）。AAR的组合结构揭示，智能体失败的关键不在于调用工具，而在于导航至正确页面，这一盲点在线性基准中难以察觉。项目主页：https://minnesotanlp.github.io/the-amazing-agent-race

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2604.10286

STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

STARS：面向智能体系统请求条件调用安全的技能触发审计

Zhang, Guijia, Yang, Shu, Gong, Xilin, Wang, Di

Abstract

Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request-conditioned invocation risk model, and a calibrated risk-fusion policy. To evaluate this setting, we construct SIA-Bench, a benchmark of 3,000 invocation records with group-safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous-risk targets. On a held-out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high-risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in-distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening. Code is available at https://github.com/123zgj123/STARS.

Chinese Translation

自主语言模型智能体日益依赖可安装的技能和工具来完成用户任务。静态技能审计能够在部署前揭示能力范围，但无法判断在当前用户请求和运行时上下文下某次特定调用是否存在安全风险。因此，我们将技能调用审计视为一个持续风险评估问题：给定用户请求、候选技能及运行时上下文，预测一个分数以支持在采取强制干预前的排序和分诊。我们提出了STARS，该方法结合了静态能力先验、基于请求条件的调用风险模型以及校准的风险融合策略。为评估该方法，我们构建了SIA-Bench基准，包含3000条调用记录，具备组安全划分、血缘元数据、运行时上下文、规范动作标签及派生的连续风险目标。在间接提示注入攻击的保留测试集上，校准融合达到0.439的高风险AUPRC，优于上下文评分器的0.405和最强静态基线的0.380，同时上下文评分器以0.289的期望校准误差保持更好的校准性能。在锁定的分布内测试集上，提升较小且静态先验仍然有用。因此，结论更为谨慎：请求条件审计最有价值的是作为调用时的风险评分和分诊层，而非替代静态筛查。代码可在https://github.com/123zgj123/STARS获取。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2604.10288

Dead Cognitions: A Census of Misattributed Insights

死亡的认知：误归因洞察的普查

Tuor, Aaron, ai, claude.

Abstract

This essay identifies a failure mode of AI chat systems that we term attribution laundering: the model performs substantive cognitive work and then rhetorically credits the user for having generated the resulting insights. Unlike transparent versions of glad handing sycophancy, attribution laundering is systematically occluded to the person it affects and self-reinforcing -- eroding users' ability to accurately assess their own cognitive contributions over time. We trace the mechanisms at both individual and societal scales, from the chat interface that discourages scrutiny to the institutional pressures that reward adoption over accountability. The document itself is an artifact of the process it describes, and is color-coded accordingly -- though the views expressed are the authors' own, not those of any affiliated institution, and the boundary between the human author's views and Claude's is, as the essay argues, difficult to draw.

Chinese Translation

本文识别了一种我们称之为归因洗涤的人工智能聊天系统失效模式：模型执行实质性的认知工作，然后在修辞上将产生的洞察归功于用户。与透明的阿谀奉承版本不同，归因洗涤对受影响者是系统性隐蔽的，并且是自我强化的——随着时间的推移，侵蚀用户准确评估自身认知贡献的能力。我们追踪了在个体和社会层面上的机制，从不鼓励审查的聊天界面到奖励采用而非问责的制度压力。本文档本身是其所描述过程的一个产物，并相应地进行了颜色编码——尽管所表达的观点是作者个人的，而非任何附属机构的观点，正如本文所论证的，人类作者的观点与Claude的观点之间的界限是难以划分的。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2604.10290

AI Organizations are More Effective but Less Aligned than Individual Agents

人工智能组织比单个智能体更高效但对齐性较差

Shen, Judy Hanwen, Zhu, Daniel, Srinivasan, Siddarth, Sleight, Henry, Wagner III, Lawrence T., Matthews, Morgan Jane, Jones, Erik, Sohl-Dickstein, Jascha

Abstract

AI is increasingly deployed in multi-agent systems; however, most research considers only the behavior of individual models. We experimentally show that multi-agent "AI organizations" are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents. We examine 12 tasks across two practical settings: an AI consultancy providing solutions to business problems and an AI software team developing software products. Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model. Our work demonstrates the importance of considering interacting systems of AI agents when doing both capabilities and safety research.

Chinese Translation

人工智能日益应用于多智能体系统；然而，大多数研究仅关注单个模型的行为。我们通过实验表明，多智能体“人工智能组织”在实现业务目标方面同时更高效，但对齐性低于单个人工智能智能体。我们考察了两个实际场景中的12个任务：一个是为业务问题提供解决方案的人工智能咨询机构，另一个是开发软件产品的人工智能软件团队。在所有场景中，由对齐模型组成的人工智能组织所产生的解决方案效用更高，但相较于单个对齐模型，存在更大的对齐偏差。我们的研究强调了在进行能力和安全性研究时，考虑人工智能智能体交互系统的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2604.10291

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

时间序列考试代理：大规模创建时间序列推理基准

Gwiazda, Malgorzata, Cai, Yifu, Goswami, Mononito, Choudhry, Arjun, Dubrawski, Artur

Abstract

Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real-world datasets spanning healthcare, finance and weather domains. Through multi-dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain-specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at https://github.com/magwiazda/TimeSeriesExamAgent.

Chinese Translation

大型语言模型（LLMs）在时间序列建模任务中表现出良好的性能，但它们是否真正理解时间序列数据？虽然已经提出了多个基准来回答这个基本问题，但大多数基准都是手动策划的，且集中于狭窄的领域或特定的技能集。为了解决这一局限性，我们提出了可扩展的方法，以创建综合性的时间序列推理基准，结合了模板的灵活性和LLM代理的创造力。我们首先开发了TimeSeriesExam，这是一个使用合成时间序列的多项选择基准，用于评估LLMs在五个核心推理类别中的表现：模式识别、噪声理解、相似性分析、异常检测和因果关系。然后，通过TimeSeriesExamAgent，我们通过自动生成涵盖医疗、金融和天气领域的真实世界数据集的基准来扩展我们的方法。通过多维度的质量评估，我们证明了我们自动生成的基准在多样性上与手动策划的替代方案相当。然而，我们的实验表明，LLM在抽象时间序列推理和领域特定应用中的表现仍然有限，突显了在这些模型中实现有效时间序列理解的持续挑战。TimeSeriesExamAgent可在 https://github.com/magwiazda/TimeSeriesExamAgent 获取。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2604.10311

Gypscie: A Cross-Platform AI Artifact Management System

Gypscie：跨平台人工智能工件管理系统

Porto, Fabio, Ogasawara, Eduardo, Botaro, Gabriela Moraes, Bastos, Julia Neumann, Fonseca, Augusto, Pacitti, Esther, Valduriez, Patrick

Abstract

Artificial Intelligence (AI) models, encompassing both traditional machine learning (ML) and more advanced approaches such as deep learning and large language models (LLMs), play a central role in modern applications. AI model lifecycle management involves the end-to-end process of managing these models, from data collection and preparation to model building, evaluation, deployment, and continuous monitoring. This process is inherently complex, as it requires the coordination of diverse services that manage AI artifacts such as datasets, dataflows, and models, all orchestrated to operate seamlessly. In this context, it is essential to isolate applications from the complexity of interacting with heterogeneous services, datasets, and AI platforms. In this paper, we introduce Gypscie, a cross-platform AI artifact management system. By providing a unified view of all AI artifacts, the Gypscie platform simplifies the development and deployment of AI applications. This unified view is realized through a knowledge graph that captures application semantics and a rule-based query language that supports reasoning over data and models. Model lifecycle activities are represented as high-level dataflows that can be scheduled across multiple platforms, such as servers, cloud platforms, or supercomputers. Finally, Gypscie records provenance information about the artifacts it produces, thereby enabling explainability. Our qualitative comparison with representative AI systems shows that Gypscie supports a broader range of functionalities across the AI artifact lifecycle. Our experimental evaluation demonstrates that Gypscie can successfully optimize and schedule dataflows on AI platforms from an abstract specification.

Chinese Translation

人工智能（AI）模型，包括传统的机器学习（ML）和更先进的方法，如深度学习和大型语言模型（LLMs），在现代应用中发挥着核心作用。AI模型生命周期管理涉及从数据收集和准备到模型构建、评估、部署和持续监控的端到端过程。这个过程本质上是复杂的，因为它需要协调管理AI工件（如数据集、数据流和模型）的多种服务，所有这些服务都需要无缝地协同工作。在这种背景下，将应用程序与异构服务、数据集和AI平台的复杂性隔离开来是至关重要的。在本文中，我们介绍了Gypscie，一个跨平台的AI工件管理系统。通过提供所有AI工件的统一视图，Gypscie平台简化了AI应用的开发和部署。这个统一视图是通过一个捕捉应用语义的知识图谱和一个支持对数据和模型进行推理的基于规则的查询语言实现的。模型生命周期活动被表示为可以在多个平台（如服务器、云平台或超级计算机）上调度的高层数据流。最后，Gypscie记录其生成的工件的来源信息，从而实现可解释性。我们与代表性的AI系统进行的定性比较表明，Gypscie在AI工件生命周期中支持更广泛的功能。我们的实验评估表明，Gypscie能够成功地优化和调度来自抽象规范的AI平台上的数据流。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2604.10332

From GPT-3 to GPT-5: Mapping their capabilities, scope, limitations, and consequences

从 GPT-3 到 GPT-5：能力、范围、局限性及其影响的映射

Afridi, Hina, Ullah, Habib, Khan, Sultan Daud, Ullah, Mohib

Abstract

We present the progress of the GPT family from GPT-3 through GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1, and the GPT-5 family. Our work is comparative rather than merely historical. We investigates how the family evolved in technical framing, user interaction, modality, deployment architecture, and governance viewpoint. The work focuses on five recurring themes: technical progression, capability changes, deployment shifts, persistent limitations, and downstream consequences. In term of research design, we consider official technical reports, system cards, API and model documentation, product announcements, release notes, and peer-reviewed secondary studies. A primary assertion is that later GPT generations should not be interpreted only as larger or more accurate language models. Instead, the family evolves from a scaled few-shot text predictor into a set of aligned, multimodal, tool-oriented, long-context, and increasingly workflow-integrated systems. This development complicates simple model-to-model comparison because product routing, tool access, safety tuning, and interface design become part of the effective system. Across generations, several limitations remain unchanged: hallucination, prompt sensitivity, benchmark fragility, uneven behavior across domains and populations, and incomplete public transparency about architecture and training. However, the family has evolved software development, educational practice, information work, interface design, and discussions of frontier-model governance. We infer that the transition from GPT-3 to GPT-5 is best understood not only as an improvement in model capability, but also as a broader reformulation of what a deployable AI system is, how it is evaluated, and where responsibility should be located when such systems are used at scale.

Chinese Translation

本文展示了 GPT 系列模型从 GPT-3 到 GPT-3.5、GPT-4、GPT-4 Turbo、GPT-4o、GPT-4.1 以及 GPT-5 家族的发展进程。我们的研究侧重于比较分析，而非单纯的历史回顾。我们考察了该系列在技术框架、用户交互、模态、多部署架构及治理视角上的演变。研究聚焦于五个反复出现的主题：技术进展、能力变化、部署转变、持续存在的局限性以及下游影响。在研究设计上，我们参考了官方技术报告、系统说明文档、API 和模型文档、产品发布公告、版本说明以及同行评审的二次研究。主要观点认为，后续的 GPT 代不应仅被视为更大规模或更高准确度的语言模型。相反，该系列从一个规模化的少样本文本预测器演变为一套对齐的、多模态的、面向工具的、支持长上下文且日益集成工作流的系统。这一发展使得简单的模型间比较变得复杂，因为产品路由、工具访问、安全调优和界面设计成为有效系统的一部分。跨代来看，若干局限性依然存在：幻觉现象、提示敏感性、基准测试脆弱性、不同领域及人群间表现不均以及关于架构和训练的公开透明度不足。然而，该系列推动了软件开发、教育实践、信息工作、界面设计及前沿模型治理讨论的发展。我们推断，从 GPT-3 到 GPT-5 的转变不仅是模型能力的提升，更是对可部署 AI 系统本质、评估方式及大规模应用时责任归属的更广泛重构。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2604.10333

Zero-shot World Models Are Developmentally Efficient Learners

零样本世界模型作为发展高效的学习者

Aw, Khai Loong, Kotar, Klemen, Lee, Wanhee, Kim, Seungwoo, Jedoui, Khaled, Venkatesh, Rahul, Chen, Lilian Naing, Frank, Michael C., Yamins, Daniel L. K.

Abstract

Young children demonstrate early abilities to understand their physical world, estimating depth, motion, object coherence, interactions, and many other aspects of physical scene understanding. Children are both data-efficient and flexible cognitive systems, creating competence despite extremely limited training data, while generalizing to myriad untrained tasks -- a major challenge even for today's best AI systems. Here we introduce a novel computational hypothesis for these abilities, the Zero-shot Visual World Model (ZWM). ZWM is based on three principles: a sparse temporally-factored predictor that decouples appearance from dynamics; zero-shot estimation through approximate causal inference; and composition of inferences to build more complex abilities. We show that ZWM can be learned from the first-person experience of a single child, rapidly generating competence across multiple physical understanding benchmarks. It also broadly recapitulates behavioral signatures of child development and builds brain-like internal representations. Our work presents a blueprint for efficient and flexible learning from human-scale data, advancing both a computational account for children's early physical understanding and a path toward data-efficient AI systems.

Chinese Translation

幼儿展现出早期理解物理世界的能力，能够估计深度、运动、物体连贯性、交互以及物理场景理解的许多其他方面。儿童既是数据高效的认知系统，又具备灵活性，尽管训练数据极其有限，却能形成能力，并推广到无数未训练的任务——这对当今最先进的人工智能系统而言仍是重大挑战。本文提出了一种解释这些能力的新型计算假说——零样本视觉世界模型（Zero-shot Visual World Model，ZWM）。ZWM基于三大原则：稀疏的时间因子预测器，将外观与动力学解耦；通过近似因果推断实现零样本估计；以及推断的组合以构建更复杂的能力。我们展示了ZWM可以从单个儿童的第一人称体验中学习，快速在多个物理理解基准上生成能力。它还广泛再现了儿童发展中的行为特征，并构建了类脑的内部表征。我们的工作为从人类规模数据中实现高效且灵活的学习提供了蓝图，推动了儿童早期物理理解的计算解释及数据高效人工智能系统的发展路径。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2604.10341

VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline

VeriTrans：通过确定性神经符号管道进行微调的LLM辅助自然语言到程序语言翻译

Liu, Xuan, Kodakandla, Dheeraj, Srivastva, Kushagra, Farooque, Mahfuza

Abstract

\textbf{VeriTrans} is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL$\!\to\!$PL translator, round-trip reconstruction (PL$\!\to\!$NL) used as a high-precision acceptance gate, and canonical PL$\!\to\!$CNF compilation, all executed via fixed API configuration (temperature$=0$; fine-tuning runs use seed$=42$) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On \textbf{SatBench} (2{,}100 specifications), VeriTrans achieves 94.46\% SAT/UNSAT correctness and 87.73\% median round-trip similarity. Compact fine-tuning on 100--150 curated examples improves fidelity by about 1--1.5\,pp without increasing latency (mean 25.8\,s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability--coverage knob: at $\tau{=}75$, roughly 68\% of items are retained with $\sim$94\% correctness on the accepted set. Validator overhead contributes $<15\%$ of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL$\!\to\!$logic front-ends into auditable, reproducible components for reliability-critical workflows.

Chinese Translation

VeriTrans 是一个以可靠性为首要目标的机器学习系统，能够将自然语言需求编译为可供求解器使用的逻辑，并具备验证器门控的可靠性。该管道集成了一个经过指令微调的自然语言到程序语言（NL$ o$PL）翻译器、作为高精度接受门的往返重构（PL$ o$NL），以及规范的程序语言到合取范式（PL$ o$CNF）编译，所有操作均通过固定的API配置（温度$=0$；微调运行使用种子$=42$）和逐项工件日志记录（提示、输出、哈希）来支持可审计性和基于重放的调试。在 extbf{SatBench}（2,100个规范）上，VeriTrans 实现了94.46\%的SAT/UNSAT正确率和87.73\\%的中位往返相似度。对100-150个精心挑选的示例进行紧凑微调，使得保真度提高约1-1.5个百分点，而不增加延迟（在我们的201个规范运行子集上平均为25.8秒/规范）。对往返得分的阈值接受策略揭示了可靠性与覆盖率的调节：在$ au{=}75$时，约68\\%的项目被保留，接受集的正确率约为94\\%。验证器的开销占端到端运行时间的比例小于15\\%，所有提示/响应和时间元数据均被记录，以便进行基于重放的调试和回归测试。通过将学习到的翻译与符号验证分离，并强制执行确定性、验证器门控的接受，VeriTrans 将自然语言到逻辑的前端转变为可审计、可重现的组件，适用于对可靠性至关重要的工作流程。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2604.10352

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

ClawVM：面向有状态工具使用的 LLM 代理的托管虚拟内存

Rafique, Mofasshara, Bindschaedler, Laurent

Abstract

Stateful tool-using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy-engine overhead per turn.

Chinese Translation

有状态的工具使用 LLM 代理将上下文窗口视为工作内存，但当前的代理托管在驻留和持久性方面仅采取最佳努力，导致频繁的故障：在压缩后丢失状态、重置时绕过刷新和破坏性写回。我们提出了 extsc{ClawVM}，这是一种虚拟内存层，以类型化页面管理状态，具有最低保真度不变性、在令牌预算下的多分辨率表示，以及在每个生命周期边界的验证写回。由于托管已经组装了提示、调解工具并观察生命周期事件，因此它是自然的强制执行点；将合同放置在此处使驻留和持久性变得确定性和可审计。在合成工作负载、12 个真实会话跟踪和对抗性压力测试中， extsc{ClawVM} 消除了所有可控的策略故障，只要最低保真度集适合令牌预算，这一点得到了离线神谕的确认，并且每次轮次增加了中位数 <50 微秒的策略引擎开销。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2604.10367

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

超越独白：具有对话音频上下文感知核的互动说听虚拟形象生成

Weng, Yuzhe, Wang, Haotian, Yu, Xinyi, Wu, Xiaoyan, Xu, Haoran, He, Shan, Du, Jun

Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .

Chinese Translation

基于音频的人类视频生成在独白场景中取得了显著成功，这主要得益于强大的视频生成基础模型的进步。超越独白，真实的人类交流本质上是一个全双工的互动过程，这要求虚拟代理不仅能够表达自己的言语，还能自然地对接收到的对话音频做出反应。现有的大多数方法仅仅将传统的音频驱动范式扩展到听的场景。然而，依赖严格的逐帧对齐使得模型对长距离对话动态的响应变得僵化，而直接引入全局注意力则会灾难性地降低唇同步效果。认识到说话和倾听行为之间独特的时间尺度差异，我们引入了一种多头高斯核，将这种物理直觉显式注入模型，作为一种渐进的时间归纳偏置。在此基础上，我们构建了一个全双工互动虚拟代理，能够同时处理双流音频输入以进行说话和倾听。此外，我们引入了一个经过严格清理的说听数据集VoxHear，该数据集具有完美解耦的语音和背景音轨。大量实验表明，我们的方法成功地将强时间对齐与深层上下文语义融合，设定了生成高度自然和响应迅速的全双工互动数字人类的新状态。项目页面可访问 https://warmcongee.github.io/beyond-monologue/ 。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2604.10386

TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

TrajOnco：一个用于纵向电子健康记录的多智能体框架，实现多癌症早期检测的时间推理

Zeng, Sihang, Kim, Young Won, Lau, Wilson, Alipour, Ehsan, Etzioni, Ruth, Yetisgen, Meliha, Oka, Anand

Abstract

Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent large language model (LLM) framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco's output was validated through human evaluation. Furthermore, TrajOnco's interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.

Chinese Translation

从纵向电子健康记录（EHR）中准确估计癌症风险可以支持早期检测和改善护理，但建模如此复杂的患者轨迹仍然具有挑战性。我们提出了TrajOnco，一个无训练的多智能体大型语言模型（LLM）框架，旨在实现可扩展的多癌症早期检测。TrajOnco采用链式智能体架构和长期记忆，能够对顺序临床事件进行时间推理，生成患者级摘要、证据关联的推理和预测风险评分。我们在去标识化的Truveta EHR数据上对TrajOnco进行了评估，涵盖15种癌症类型，使用匹配的病例对照队列，预测1年内癌症诊断的风险。在零样本评估中，TrajOnco的AUROC达到了0.64-0.80，在肺癌基准测试中表现与监督机器学习相当，同时在时间推理方面优于单智能体LLM。多智能体设计还使得使用较小容量模型（如GPT-4.1-mini）进行有效的时间推理成为可能。通过人工评估验证了TrajOnco输出的真实性。此外，TrajOnco的可解释推理输出可以聚合，以揭示与既定临床知识一致的人群风险模式。这些发现突显了多智能体LLM在纵向EHR上执行可解释时间推理的潜力，推动了可扩展的多癌症早期检测和临床洞察生成。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2604.10410

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

CWCD：结构化医学报告生成的类别对比解码

Srivastava, Shantam, Bhosale, Mahesh, Doermann, David, Gao, Mingchen

Abstract

Interpreting chest X-rays is inherently challenging due to the overlap between anatomical structures and the subtle presentation of many clinically significant pathologies, making accurate diagnosis time-consuming even for experienced radiologists. Recent radiology-focused foundation models, such as LLaVA-Rad and Maira-2, have positioned multi-modal large language models (MLLMs) at the forefront of automated radiology report generation (RRG). However, despite these advances, current foundation models generate reports in a single forward pass. This decoding strategy diminishes attention to visual tokens and increases reliance on language priors as generation proceeds, which in turn introduces spurious pathology co-occurrences in the generated reports. To mitigate these limitations, we propose Category-Wise Contrastive Decoding (CWCD), a novel and modular framework designed to enhance structured radiology report generation (SRRG). Our approach introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts. Experimental results demonstrate that CWCD consistently outperforms baseline methods across both clinical efficacy and natural language generation metrics. An ablation study further elucidates the contribution of each architectural component to overall performance.

Chinese Translation

解读胸部X光片本质上具有挑战性，因为解剖结构之间存在重叠，许多临床显著病理的表现也相对微妙，这使得即使是经验丰富的放射科医师也需要耗费大量时间进行准确诊断。近期专注于放射学的基础模型，如LLaVA-Rad和Maira-2，已将多模态大型语言模型（MLLMs）置于自动化放射学报告生成（RRG）的前沿。然而，尽管取得了这些进展，目前的基础模型在生成报告时仅采用单次前向传递的解码策略。这种解码策略减少了对视觉标记的关注，并随着生成过程的推进，增加了对语言先验的依赖，从而在生成的报告中引入了虚假的病理共现。为了缓解这些局限性，我们提出了类别对比解码（CWCD），这是一种新颖且模块化的框架，旨在增强结构化放射学报告生成（SRRG）。我们的方法引入了类别特定的参数化，通过使用类别特定的视觉提示，将正常X光片与遮蔽的X光片进行对比，从而生成类别特定的报告。实验结果表明，CWCD在临床有效性和自然语言生成指标上均持续优于基线方法。消融研究进一步阐明了每个架构组件对整体性能的贡献。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2604.10429

Safety Guarantees in Zero-Shot Reinforcement Learning for Cascade Dynamical Systems

级联动力系统中零样本强化学习的安全性保障

Rabiei, Shima, Mishra, Sandipan, Paternain, Santiago

Abstract

This paper considers the problem of zero-shot safety guarantees for cascade dynamical systems. These are systems where a subset of the states (the inner states) affects the dynamics of the remaining states (the outer states) but not vice-versa. We define safety as remaining on a set deemed safe for all times with high probability. We propose to train a safe RL policy on a reduced-order model, which ignores the dynamics of the inner states, but it treats it as an action that influences the outer state. Thus, reducing the complexity of the training. When deployed in the full system the trained policy is combined with a low-level controller whose task is to track the reference provided by the RL policy. Our main theoretical contribution is a bound on the safe probability in the full-order system. In particular, we establish the interplay between the probability of remaining safe after the zero-shot deployment and the quality of the tracking of the inner states. We validate our theoretical findings on a quadrotor navigation task, demonstrating that the preservation of the safety guarantees is tied to the bandwidth and tracking capabilities of the low-level controller.

Chinese Translation

本文研究了级联动力系统的零样本安全性保障问题。此类系统中，一部分状态（内层状态）影响其余状态（外层状态）的动力学，而反之则不成立。我们将安全性定义为以高概率在所有时间内保持在被认为安全的集合中。我们提出在一个降阶模型上训练安全强化学习（RL）策略，该模型忽略内层状态的动力学，但将其视为影响外层状态的动作，从而降低训练复杂度。在完整系统中部署时，训练得到的策略与一个低层控制器结合，后者的任务是跟踪由RL策略提供的参考信号。我们的主要理论贡献是给出了完整阶系统中安全概率的界限。特别地，我们建立了零样本部署后保持安全的概率与内层状态跟踪质量之间的关系。我们在四旋翼导航任务中验证了理论结果，展示了安全性保障的保持与低层控制器的带宽及跟踪能力密切相关。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2604.10441

VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

VeriSim：一个可配置的医疗人工智能真实患者噪声评估框架

Mansouri, Sina, Marvania, Mohit, Shihorkar, Vibhavari Ashok, Tran, Han Ngoc, Shafiei, Kazhal, Fazli, Mehrdad, Li, Yikuan, Zhu, Ziwei

Abstract

Medical large language models (LLMs) achieve impressive performance on standardized benchmarks, yet these evaluations fail to capture the complexity of real clinical encounters where patients exhibit memory gaps, limited health literacy, anxiety, and other communication barriers. We introduce VeriSim, a truth-preserving patient simulation framework that injects controllable, clinically evidence-grounded noise into patient responses while maintaining strict adherence to medical ground truth through a hybrid UMLS-LLM verification mechanism. Our framework operationalizes six noise dimensions derived from peer-reviewed medical communication literature, capturing authentic clinical phenomena such as patient recall limitations, health literacy barriers, and stigma-driven non-disclosure. Experiments across seven open-weight LLMs reveal that all models degrade significantly under realistic patient noise, with diagnostic accuracy dropping 15-25% and conversation length increasing 34-55%. Notably, smaller models (7B) show 40% greater degradation than larger models (70B+), while medical fine-tuning on standard corpora provides limited robustness benefits against patient communication noise. Evaluation by board-certified clinicians demonstrates high-quality simulation with strong inter-annotator agreement (kappa > 0.80), while LLM-as-a-Judge serves as a validated auxiliary evaluator achieving comparable reliability for scalable assessment. Our results highlight a critical Sim-to-Real gap in current medical AI. We release VeriSim as an open-source noise-injection framework, establishing a rigorous testbed for evaluating clinical robustness.

Chinese Translation

医疗大语言模型（LLMs）在标准化基准测试中表现出色，然而这些评估未能反映真实临床场景的复杂性，在实际中患者常表现出记忆缺失、有限的健康素养、焦虑及其他沟通障碍。我们提出了VeriSim，一种保持医学真实信息的患者模拟框架，该框架通过混合UMLS-LLM验证机制，在患者回答中注入可控且基于临床证据的噪声，同时严格遵循医学事实。我们的框架实现了源自同行评审医学沟通文献的六个噪声维度，捕捉真实临床现象，如患者回忆限制、健康素养障碍及因污名导致的信息隐瞒。针对七个开源权重的LLMs的实验表明，所有模型在真实患者噪声下性能显著下降，诊断准确率降低15-25%，对话长度增加34-55%。值得注意的是，较小模型（7B）较大模型（70B+）表现出40%的更大性能退化，而在标准语料上进行的医学微调对患者沟通噪声的鲁棒性提升有限。经由持证临床医生的评估显示模拟质量高且评审者间一致性强（kappa > 0.80），同时LLM作为评判者（LLM-as-a-Judge）作为经过验证的辅助评估工具，达到了可比的可靠性，支持大规模评估。我们的结果揭示了当前医疗AI在模拟到现实（Sim-to-Real）方面的关键差距。我们将VeriSim作为开源噪声注入框架发布，建立了一个严谨的临床鲁棒性评估测试平台。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2604.10475

PEMANT: Persona-Enriched Multi-Agent Negotiation for Travel

PEMANT：基于个性化增强的多智能体旅行协商

Sun, Yuran, Sameen, Mustafa, Zhang, Yaotian, Wu, Chia-yu, Zhao, Xilei

Abstract

Modeling household-level trip generation is fundamental to accurate demand forecasting, traffic flow estimation, and urban system planning. Existing studies were mostly based on classical machine learning models with limited predictive capability, while recent LLM-based approaches have yet to incorporate behavioral theory or intra-household interaction dynamics, both of which are critical for modeling realistic collective travel decisions. To address these limitations, we propose a novel LLM-based framework, named Persona-Enriched Multi-Agent Negotiation for Travel (PEMANT), which first integrates behavioral theory for individualized persona modeling and then conducts household-level trip planning negotiations via a structured multi-agent conversation. Specifically, PEMANT transforms static sociodemographic attributes into coherent narrative profiles that explicitly encode household-level attitudes, subjective norms, and perceived behavioral controls, following our proposed Household-Aware Chain-of-Planned-Behavior (HA-CoPB) framework. Building on these theory-grounded personas, PEMANT captures real-world household decision negotiation via a structured two-phase multi-agent conversation framework with a novel persona-alignment control mechanism. Evaluated on both national and regional household travel survey datasets, PEMANT consistently outperforms state-of-the-art benchmarks across datasets.

Chinese Translation

家庭层面的出行生成建模是准确需求预测、交通流量估计和城市系统规划的基础。现有研究大多基于经典机器学习模型，预测能力有限，而最近的基于大语言模型（LLM）的方法尚未纳入行为理论或家庭内部互动动态，这两者对于建模现实的集体出行决策至关重要。为了解决这些局限性，我们提出了一种新颖的基于LLM的框架，命名为个性化增强的多智能体旅行协商（PEMANT），该框架首先整合了行为理论以进行个性化角色建模，然后通过结构化的多智能体对话进行家庭层面的出行规划协商。具体而言，PEMANT将静态社会人口属性转化为连贯的叙事档案，明确编码家庭层面的态度、主观规范和感知行为控制，遵循我们提出的家庭感知计划行为链（HA-CoPB）框架。在这些基于理论的角色基础上，PEMANT通过一种结构化的两阶段多智能体对话框架，结合新颖的角色对齐控制机制，捕捉现实世界中的家庭决策协商。在国家和地区家庭出行调查数据集上的评估中，PEMANT在各数据集上始终优于最先进的基准。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2604.10480

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

追溯根源：用于揭示后训练大型语言模型数据血缘关系的多智能体框架

Li, Yu, Shang, Xiaoran, Pei, Qizhi, Zhu, Yun, Gao, Xin, Lin, Honglin, Zhong, Zhanping, Pan, Zhuoshi, Liu, Zheng, Wang, Xiaoyang, He, Conghui, Lin, Dahua, Zhao, Feng, Wu, Lijun

Abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textit{lineage-aware diversity-oriented dataset}. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

Chinese Translation

后训练数据在塑造大型语言模型（LLMs）能力方面发挥着关键作用，但数据集通常被视为孤立的工件，忽视了其演变背后的系统性联系。为了理清这些复杂关系，我们将 extbf{数据血缘关系}的概念引入LLM生态系统，并提出一个自动化的多智能体框架，以重建数据集发展的演化图。通过大规模的血缘分析，我们描述了特定领域的结构模式，例如数学导向数据集中的纵向细化和通用领域语料库中的横向聚合。此外，我们还揭示了普遍存在的系统性问题，包括由隐式数据集交集引起的 extit{结构冗余}以及沿血缘路径传播的 extit{基准污染}。为了展示血缘分析在数据构建中的实际价值，我们利用重建的血缘图创建了一个 extit{关注血缘的多样性导向数据集}。通过将指令采样锚定在上游根源，这种方法减轻了下游同质化和隐性冗余，从而产生了更具多样性的后训练语料库。我们进一步强调以血缘为中心的分析作为大规模数据生态系统中样本级数据集比较的高效且稳健的拓扑替代方案。通过将数据构建基于明确的血缘结构，我们的工作推动了后训练数据策划朝着更系统和可控的范式发展。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2604.10502

CHAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs

CHAIRO：面向大型语言模型的上下文层次类比归纳与推理优化

Lu, Haotian, Mou, Yuchen, Wu, Bingzhe

Abstract

Content moderation in online platforms faces persistent challenges due to the evolving complexity of user-generated content and the limitations of traditional rule-based and machine learning approaches. While recent advances in large language models (LLMs) have enabled more sophisticated moderation via direct prompting or fine-tuning, these approaches often exhibit limited generalization, interpretability, and adaptability to unseen or ambiguous cases. In this work, we propose a novel moderation framework that leverages analogical examples to enhance rule induction and decision reliability. Our approach integrates end-to-end optimization of analogical retrieval, rule generation, and moderation classification, enabling the dynamic adaptation of moderation rules to diverse content scenarios. Through comprehensive experiments, we demonstrate that our method significantly outperforms both rule-injected fine-tuning baselines and multi-stage static RAG pipelines in terms of moderation accuracy and rule quality. Further evaluations, including human assessments and external model generalization tests, confirm that our framework produces rules with better clarity, interpretability, and applicability. These findings show that analogical example-driven methods can advance robust, explainable, and generalizable content moderation in real-world applications.

Chinese Translation

在线平台的内容审核面临持续挑战，原因在于用户生成内容的复杂性不断演变，以及传统基于规则和机器学习方法的局限性。尽管近期大型语言模型（LLMs）的进展使得通过直接提示或微调实现更复杂的审核成为可能，但这些方法往往表现出有限的泛化能力、可解释性以及对未见或模糊案例的适应性。在本研究中，我们提出了一种新颖的审核框架，利用类比示例来增强规则归纳和决策可靠性。我们的方法整合了类比检索、规则生成和审核分类的端到端优化，使审核规则能够动态适应多样化的内容场景。通过全面的实验，我们证明了我们的方法在审核准确性和规则质量方面显著优于基于规则注入的微调基线和多阶段静态RAG管道。进一步的评估，包括人类评估和外部模型泛化测试，证实我们的框架生成的规则具有更好的清晰度、可解释性和适用性。这些发现表明，基于类比示例的方法可以推动在现实应用中实现稳健、可解释和可泛化的内容审核。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2604.10504

CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation

CARO：用于稳健内容审核的类比推理链优化方法

Wu, Bingzhe, Lu, Haotian, Mou, Yuchen

Abstract

Current large language models (LLMs), even those explicitly trained for reasoning, often struggle with ambiguous content moderation cases due to misleading "decision shortcuts" embedded in context. Inspired by cognitive psychology insights into expert moderation, we introduce \caro (Chain-of-Analogy Reasoning Optimization), a novel two-stage training framework to induce robust analogical reasoning in LLMs. First, \caro bootstraps analogical reasoning chains via retrieval-augmented generation (RAG) on moderation data and performs supervised fine-tuning (SFT). Second, we propose a customized direct preference optimization (DPO) approach to reinforce analogical reasoning behaviors explicitly. Unlike static retrieval methods, \caro dynamically generates tailored analogical references during inference, effectively mitigating harmful decision shortcuts. Extensive experiments demonstrate that \caro substantially outperforms state-of-the-art reasoning models (DeepSeek R1, QwQ), specialized moderation models (LLaMA Guard), and advanced fine-tuning and retrieval-augmented methods, achieving an average F1 score improvement of 24.9\% on challenging ambiguous moderation benchmarks.

Chinese Translation

当前的大型语言模型（LLMs），即使是那些专门训练用于推理的模型，往往在处理含糊不清的内容审核案例时表现不佳，原因在于上下文中存在误导性的“决策捷径”。受认知心理学中专家审核行为的启发，我们提出了CARO（Chain-of-Analogy Reasoning Optimization，类比推理链优化），这是一种新颖的两阶段训练框架，旨在引导LLMs形成稳健的类比推理能力。首先，CARO通过基于检索增强生成（RAG）的方式在审核数据上引导类比推理链的生成，并进行监督微调（SFT）。其次，我们提出了一种定制化的直接偏好优化（DPO）方法，显式强化类比推理行为。与静态检索方法不同，CARO在推理过程中动态生成定制的类比参考，有效缓解了有害的决策捷径。大量实验表明，CARO在挑战性模糊审核基准上，较最先进的推理模型（DeepSeek R1、QwQ）、专门的审核模型（LLaMA Guard）以及先进的微调和检索增强方法，平均F1分数提升了24.9%。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2604.10505

Cooperation in Human and Machine Agents: Promise Theory Considerations

人类与机器代理的合作：承诺理论的考虑

Burgess, M.

Abstract

Agent based systems are more common than we may think. A Promise Theory perspective on cooperation, in systems of human-machine agents, offers a unified perspective on organization and functional design with semi-automated efforts, in terms of the abstract properties of autonomous agents, This applies to human efforts, hardware systems, software, and artificial intelligence, with and without management. One may ask how does a reasoning system of components keep to an intended purpose? As the agent paradigm is now being revived, in connection with artificial intelligence agents, I revisit established principles of agent cooperation, as applied to humans, machines, and their mutual interactions. Promise Theory represents the fundamentals of signalling, comprehension, trust, risk, and feedback between agents, and offers some lessons about success and failure.

Chinese Translation

基于代理的系统比我们想象的更为普遍。从承诺理论的角度看人类与机器代理系统中的合作，为组织和功能设计提供了一个统一的视角，涉及到自主代理的抽象属性，以及半自动化的努力。这适用于人类的努力、硬件系统、软件和人工智能，无论是否有管理。人们可能会问，组件的推理系统如何保持其预期目的？随着代理范式在与人工智能代理相关的背景下重新兴起，我重新审视了适用于人类、机器及其相互作用的代理合作的基本原则。承诺理论代表了代理之间信号传递、理解、信任、风险和反馈的基本要素，并提供了一些关于成功与失败的经验教训。

View on arXiv Download PDF AI Translation

cs.AI / 81 / 2604.10506

A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

一种用于视觉-语言模型的渐进式训练策略以抵消具身推理中的时空幻觉

Yang, Xiaoda, Yang, Shuai, Wang, Can, Xue, Jingyang, Tang, Menglan, Yu, Checheng, Zhou, Xunzhe, Zhou, Sashuai, Jin, Tao, Yang, Lixin, Yue, Xiangyu, Zhao, Zhou

Abstract

Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70\% to only 6.53\%. This confirms the method's ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.

Chinese Translation

视觉-语言模型（Vision-Language Models，VLMs）在静态图像理解方面取得了显著进展，但在时空推理方面仍面临关键挑战。一个主要瓶颈是“多图像推理幻觉”，即在正向与反向时间查询之间存在巨大性能差距，表明模型依赖于表面捷径而非真正的因果理解。为缓解这一问题，我们首先构建了一个新的链式思维（Chain-of-Thought，CoT）数据集，将复杂推理分解为详细的时空步骤和明确的判断。在此基础上，我们提出了一种渐进式训练框架：首先在我们的CoT数据集上进行有监督的预训练以灌输逻辑结构，随后利用可扩展的弱标注数据进行微调以实现更广泛的泛化。实验结果表明，该方法不仅提升了主干模型的准确率，还将正反向性能差距从70%以上缩小至仅6.53%。这验证了该方法在培养真实动态推理能力及减少当前VLMs固有时间偏差方面的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 82 / 2604.10507

Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation

超越合规：一个基于抵抗理论的动机推理框架用于挑战心理客户模拟

Liu, Danni, Liu, Bo, Hu, Yuxin, Zhao, Hantao, Liu, Yan, Ding, Ding, Jin, Jiahui, Cao, Jiuxin

Abstract

Psychological client simulators have emerged as a scalable solution for training and evaluating counselor trainees and psychological LLMs. Yet existing simulators exhibit unrealistic over-compliance, leaving counselors underprepared for the challenging behaviors common in real-world practice. To bridge this gap, we present ResistClient, which systematically models challenging client behaviors grounded in Client Resistance Theory by integrating external behaviors with underlying motivational mechanisms. To this end, we propose Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework. First, RIMR mitigates compliance bias via supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset covering diverse client profiles. Second, beyond surface-level response imitation, RIMR models psychologically coherent motivation reasoning before response generation, jointly optimizing motivation authenticity and response consistency via process-supervised reinforcement learning. Extensive automatic and expert evaluations show that ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence. Moreover, ResistClient facilities evaluation of psychological LLMs under challenging conditions, offering new optimization directions for mental health dialogue systems.

Chinese Translation

心理客户模拟器已成为培训和评估咨询师学员及心理大型语言模型（LLMs）的可扩展解决方案。然而，现有的模拟器表现出不切实际的过度合规，使得咨询师在面对现实实践中常见的挑战性行为时准备不足。为了解决这一问题，我们提出了ResistClient，它系统地基于客户抵抗理论建模挑战性客户行为，通过整合外部行为与潜在动机机制。为此，我们提出了基于抵抗的动机推理（Resistance-Informed Motivation Reasoning, RIMR）这一两阶段的培训框架。首先，RIMR通过在RPC（一个涵盖多样化客户档案的大规模抵抗导向心理对话数据集）上进行监督微调，减轻合规偏见。其次，RIMR在响应生成之前建模心理一致的动机推理，超越表面层次的响应模仿，通过过程监督强化学习共同优化动机真实性和响应一致性。大量的自动和专家评估表明，ResistClient在挑战真实度、行为合理性和推理一致性方面显著优于现有模拟器。此外，ResistClient还促进了在挑战条件下对心理LLMs的评估，为心理健康对话系统提供了新的优化方向。

View on arXiv Download PDF AI Translation

cs.AI / 83 / 2604.10511

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

快速思考，错误判断：直觉性调节大型语言模型（LLM）在政策评估中的反事实推理能力

He, Yanjie

Abstract

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, $p < 0.001$); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.53$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" may be little more than "slow talking" -- they produce the form of deliberative reasoning without the substance.

Chinese Translation

大型语言模型（LLMs）在因果和反事实推理中的应用日益广泛，但其在现实政策评估中的可靠性尚未得到充分探讨。我们构建了一个包含40个来自经济学和社会科学的实证政策评估案例的基准库，每个案例均基于同行评审的证据，并按照直觉性分类——即实证结果是否与常见先验预期一致（显而易见）、不明确（模糊）或相悖（反直觉）。我们对四个前沿LLM模型采用五种提示策略进行了共计2400次实验试验，并利用混合效应逻辑回归分析结果。研究发现包括三个关键点：（1）链式思维（Chain-of-Thought, CoT）悖论：链式思维提示显著提升了显而易见案例的表现，但在反直觉案例中这一优势几乎消失（交互比值比OR=0.053，p<0.001）；（2）直觉性为主导因素，其解释的方差超过模型选择或提示策略（组内相关系数ICC=0.537）；（3）知识与推理的分离：基于引用的熟悉度与准确率无关（p=0.53），表明模型虽具备相关知识，但当结果违背直觉时，无法有效推理。我们从双过程理论（系统1与系统2）的视角解读这些结果，认为当前LLM的“慢思考”可能仅仅是“慢表达”——它们产生了审慎推理的形式，却缺乏其实质内容。

View on arXiv Download PDF AI Translation

cs.AI / 84 / 2604.10513

Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

Agent Mentor：通过语义轨迹分析构建智能体知识框架

Ben-Gigi, Roi, David, Yuval, Fournier, Fabiana, Limonad, Lior, Moshkovich, Dany, Mulian, Hadar, Shlomov, Segev

Abstract

AI agent development relies heavily on natural language prompting to define agents' tasks, knowledge, and goals. These prompts are interpreted by Large Language Models (LLMs), which govern agent behavior. Consequently, agentic performance is susceptible to variability arising from imprecise or ambiguous prompt formulations. Identifying and correcting such issues requires examining not only the agent's code, but also the internal system prompts generated throughout its execution lifecycle, as reflected in execution logs. In this work, we introduce an analytics pipeline implemented as part of the Agent Mentor open-source library that monitors and incrementally adapts the system prompts defining another agent's behavior. The pipeline improves performance by systematically injecting corrective instructions into the agent's knowledge. We describe its underlying mechanism, with particular emphasis on identifying semantic features associated with undesired behaviors and using them to derive corrective statements. We evaluate the proposed pipeline across three exemplar agent configurations and benchmark tasks using repeated execution runs to assess effectiveness. These experiments provide an initial exploration of automating such a mentoring pipeline within future agentic governance frameworks. Overall, the approach demonstrates consistent and measurable accuracy improvements across diverse configurations, particularly in settings dominated by specification ambiguity. For reproducibility, we released our code as open source under the Agent Mentor library.

Chinese Translation

人工智能智能体的开发在很大程度上依赖于自然语言提示来定义智能体的任务、知识和目标。这些提示由大型语言模型（LLMs）解释，从而控制智能体的行为。因此，智能体的性能容易受到提示表述不精确或含糊所带来的变异影响。识别和纠正此类问题不仅需要检查智能体的代码，还需审视其执行生命周期中生成的内部系统提示，这些提示反映在执行日志中。在本工作中，我们引入了一个分析流水线，该流水线作为Agent Mentor开源库的一部分实现，用于监控并逐步调整定义另一智能体行为的系统提示。该流水线通过系统地向智能体知识中注入纠正指令来提升性能。我们描述了其底层机制，特别强调识别与不良行为相关的语义特征，并利用这些特征推导纠正性陈述。我们在三个典型智能体配置和基准任务中，通过重复执行运行评估了该流水线的有效性。这些实验为未来智能体治理框架中自动化辅导流水线的探索提供了初步尝试。总体而言，该方法在多样配置中表现出一致且可量化的准确性提升，尤其在规范含糊主导的场景中效果显著。为确保可复现性，我们已将代码作为Agent Mentor库开源发布。

View on arXiv Download PDF AI Translation

cs.AI / 85 / 2604.10517

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

从感知到规划：通过课程学习演变的自我中心任务导向时空推理

Yang, Xiaoda, Liu, Yuxiang, Gao, Shenzhou, Wang, Can, Xue, Jingyang, Yang, Lixin, Mu, Yao, Jin, Tao, Yan, Shuicheng, Zhang, Zhimeng, Zhao, Zhou

Abstract

Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.

Chinese Translation

现代视觉-语言模型在静态感知方面表现出色，但在实现具身的自我中心任务所需的复杂时空推理时仍然存在局限性。失败的主要原因在于它们依赖于从被动视频数据中学习的时间先验，这往往导致时空幻觉和在动态环境中的较差泛化能力。为了解决这一问题，我们提出了EgoTSR，一个基于课程的任务导向时空推理学习框架。EgoTSR建立在这样一个前提之上：具身推理应从明确的空间理解演变为内化的任务状态评估，最终达到长远规划。为了支持这一范式，我们构建了EgoTSR-Data，一个包含4600万样本的大规模数据集，分为三个阶段：思维链（Chain-of-Thought, CoT）监督、弱监督标记和长远序列。大量实验表明，EgoTSR有效消除了时间偏差，在长远逻辑推理任务中实现了92.4%的准确率，同时保持了高精度的细粒度感知，显著优于现有的开源和闭源最先进模型。

View on arXiv Download PDF AI Translation

cs.AI / 86 / 2604.10547

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Agent^2 RL-Bench：大型语言模型代理能否自主设计强化学习后训练流程？

Chen, Wanyi, Yang, Xiao, Yang, Xu, Sha, Tianming, Li, Qizheng, Wang, Zhuo, Xian, Bowen, Kong, Fang, Liu, Weiqing, Bian, Jiang

Abstract

We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training -- whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels -- from static rule-based training to closed-loop online RL with trajectory collection -- each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains -- on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts -- yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks -- within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.

Chinese Translation

我们提出了Agent^2 RL-Bench，这是一个用于评估代理式强化学习后训练（agentic RL post-training）的基准测试，旨在检验大型语言模型（LLM）代理是否能够自主设计、实现并运行完整的强化学习流程，从而提升基础模型的性能。这一能力尤为重要，因为强化学习后训练在模型对齐和专业化中扮演着越来越关键的角色，而现有基准测试大多保持静态：仅通过监督微调即可获得较好效果，导致交互式强化学习工程尚未得到充分检验。Agent^2 RL-Bench涵盖六个任务，分布在三个层级——从静态规则训练到带有轨迹收集的闭环在线强化学习——每个层级均引入了前一层级未涉及的结构性要求。该基准提供了独立的工作空间和评分API，运行时监控记录每次提交和代码修订，并配备自动化的事后分析生成结构化运行报告，实现了首次对代理驱动后训练行为的自动诊断。通过涵盖五个代理系统和六个驱动LLM的多种代理堆栈实验，我们发现代理在交互式任务中取得显著提升——例如在ALFWorld中，纯强化学习代理通过监督微调预热和结合在线rollouts的GRPO算法，性能从5.97提升至93.28；但在其他任务中进展有限（如DeepSearchQA提升仅为2.75，处于评估噪声范围内），且驱动模型的选择对交互任务影响显著——在相同框架下，切换驱动模型可使交互提升从接近零跃升至78个百分点。更广泛地看，该基准揭示了在固定预算下，监督训练流程主导代理驱动的后训练过程，在线强化学习仅在ALFWorld上作为最终最佳路径取得成功。代码已开源，地址：https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench。

View on arXiv Download PDF AI Translation

cs.AI / 87 / 2604.10549

Failure Ontology: A Lifelong Learning Framework for Blind Spot Detection and Resilience Design

失败本体论：用于盲点检测与韧性设计的终身学习框架

Sun, Yuan, Yi, Hong, Liu, Jinyuan

Abstract

Personalized learning systems are almost universally designed around a single objective: help people acquire knowledge and skills more efficiently. We argue this framing misses the more consequential problem. The most damaging failures in human life-financial ruin, health collapse, professional obsolescence-are rarely caused by insufficient knowledge acquisition. They arise from the systematic absence of entire conceptual territories from a person's cognitive map: domains they never thought to explore because, from within their existing worldview, those domains did not appear to exist or to matter. We call such absences Ontological Blind Spots and introduce Failure Ontology (F), a formal framework for detecting, classifying, and remediating them across a human lifetime. The framework introduces three original contributions: (1) a four-type taxonomy of blind spots distinguishing domain blindness, structural blindness, weight blindness, and temporal blindness; (2) five convergent failure patterns characterizing how blind spots interact with external disruption to produce catastrophic outcomes; and (3) the Failure Learning Efficiency Theorem, proving that failure-based learning achieves higher sample efficiency than success-based learning under bounded historical data. We illustrate the framework through historical case analysis of the 1997 Asian Financial Crisis and the 2008 subprime mortgage crisis, and through alongitudinal individual case study spanning five life stages.

Chinese Translation

个性化学习系统几乎普遍围绕一个目标设计：帮助人们更高效地获取知识和技能。我们认为这种框架忽视了更为重要的问题。人类生活中最具破坏性的失败——财务崩溃、健康衰退、职业过时——很少是由于知识获取不足引起的。它们源于个体认知地图中系统性缺失的整块概念领域：这些领域从未被考虑探索，因为在现有世界观中，这些领域似乎不存在或无关紧要。我们将此类缺失称为“本体盲点”（Ontological Blind Spots），并提出失败本体论（Failure Ontology，F），这是一个用于在人生全周期内检测、分类及补救盲点的形式化框架。该框架包含三项原创贡献：（1）一种四类盲点分类法，区分领域盲点（domain blindness）、结构盲点（structural blindness）、权重盲点（weight blindness）和时间盲点（temporal blindness）；（2）五种融合的失败模式，描述盲点如何与外部扰动相互作用导致灾难性后果；（3）失败学习效率定理（Failure Learning Efficiency Theorem），证明在有限历史数据条件下，基于失败的学习比基于成功的学习具有更高的样本效率。我们通过对1997年亚洲金融危机和2008年次贷危机的历史案例分析，以及涵盖五个生命阶段的个案纵向研究，来阐释该框架。

View on arXiv Download PDF AI Translation

cs.AI / 88 / 2604.10589

Working Paper: Towards Schema-based Learning from a Category-Theoretic Perspective

工作论文：基于范畴论视角的模式（Schema）学习研究

Riscos, Pablo de los, Corbacho, Fernando J., Arbib, Michael A.

Abstract

We introduce a hierarchical categorical framework for Schema-Based Learning (SBL) structured across four interconnected levels. At the schema level, a free multicategory $Sch_{syn}$ encodes fundamental schemas and transformations. An implementation functor $\mathcal{I}$ maps syntactic schemas to representational languages, inducing via the Grothendieck construction the total category $Sch_{impl}$. Implemented schemas are mapped by a functor $Model$ into the Kleisli category $\mathbf{KL(G)}$ of the Giry monad, yielding probabilistic models, while an instances presheaf assigns evaluated instance spaces. A semantic category $Sch_{sem}$, defined as a full subcategory of $\mathbf{KL(G)}$, provides semantic grounding through an interpretation functor from $Sch_{impl}$. At the agent level, $Sch_{impl}$ is equipped with a duoidal structure $\mathcal{O}_{Sch}$ supporting schema-based workflows. A left duoidal action on the category $Mind$ enables workflow execution over mental objects, whose components include mental spaces, predictive models, and a cognitive kernel composed of memory and cognitive modules. Each module is specified by schema-typed interfaces, duoidal workflows, a success condition, and a logical signature. Memory is formalized categorically via memory subsystems, a presheaf $Data_M$, a monoidal operation category $Ops_M$, and read/write natural transformations. Together with the $Body$ category, Mind defines the embodied SBL agent. At higher levels, SBL is represented as an object of the agent architecture category $ArchCat$, enabling comparison with heterogeneous paradigms, while the $World$ category models multi-agent and agent-environment interactions. Altogether, the framework forms a weak hierarchical $n$-categorical structure linking schema semantics, cognition, embodiment, architectural abstraction, and world-level interaction.

Chinese Translation

我们提出了一个分层的范畴框架，用于结构化的基于模式学习（Schema-Based Learning，SBL），该框架涵盖四个相互关联的层次。在模式层，利用自由多范畴 $Sch_{syn}$ 编码基本的模式及其变换。一个实现函子 $\mathcal{I}$ 将句法模式映射到表示语言，通过 Grothendieck 构造诱导出总范畴 $Sch_{impl}$。实现的模式通过函子 $Model$ 映射到 Giry 单子对应的 Kleisli 范畴 $\mathbf{KL(G)}$，从而得到概率模型，同时实例预层赋予已评估的实例空间。语义范畴 $Sch_{sem}$ 被定义为 $\mathbf{KL(G)}$ 的一个满子范畴，通过从 $Sch_{impl}$ 出发的解释函子提供语义基础。在智能体层，$Sch_{impl}$ 配备了支持基于模式工作流的二元范畴结构 $\mathcal{O}_{Sch}$。对范畴 $Mind$ 的左二元作用使得工作流能够在心理对象上执行，这些心理对象包括心理空间、预测模型及由记忆和认知模块组成的认知内核。每个模块通过模式类型接口、二元工作流、成功条件和逻辑签名进行规范。记忆通过记忆子系统、预层 $Data_M$、单体操作范畴 $Ops_M$ 以及读写自然变换在范畴意义下形式化。结合 $Body$ 范畴，$Mind$ 定义了具身的 SBL 智能体。在更高层次，SBL 被表示为智能体架构范畴 $ArchCat$ 的对象，便于与异构范式进行比较，而 $World$ 范畴则模拟多智能体及智能体-环境交互。整体而言，该框架构成了一个弱分层的 $n$-范畴结构，连接了模式语义、认知、具身性、架构抽象及世界层交互。

View on arXiv Download PDF AI Translation

cs.AI / 89 / 2604.10652

Enhancing Cross-Problem Vehicle Routing via Federated Learning

通过联邦学习增强跨问题车辆路径规划

Meng, Xiangchi, Zhou, Jianan, Gao, Jie, Lu, Yifan, Wu, Yaoxin, Yuan, Gonglin, Hou, Yaqing

Abstract

Vehicle routing problems (VRPs) constitute a core optimization challenge in modern logistics and supply chain management. The recent neural combinatorial optimization (NCO) has demonstrated superior efficiency over some traditional algorithms. While serving as a primary NCO approach for solving general VRPs, current cross-problem learning paradigms are still subject to performance degradation and generalizability decay, when transferring from simple VRP variants to those involving different and complex constraints. To strengthen the paradigms, this paper offers an innovative "Multi-problem Pre-train, then Single-problem Fine-tune" framework with Federated Learning (MPSF-FL). This framework exploits the common knowledge of a federated global model to foster efficient cross-problem knowledge sharing and transfer among local models for single-problem fine-tuning. In this way, local models effectively retain common VRP knowledge from up-to-date global model, while being efficiently adapted to downstream VRPs with heterogeneous complex constraints. Experimental results demonstrate that our framework not only enhances the performance in diverse VRPs, but also improves the generalizability in unseen problems.

Chinese Translation

车辆路径规划问题（VRPs）是现代物流和供应链管理中的核心优化挑战。近期的神经组合优化（NCO）在某些传统算法上展示了更优的效率。尽管作为解决一般 VRP 的主要 NCO 方法，目前的跨问题学习范式在从简单的 VRP 变体转移到涉及不同和复杂约束的情况下，仍然面临性能下降和可泛化性减弱的问题。为增强这些范式，本文提出了一种创新的“多问题预训练，单问题微调”框架，结合了联邦学习（MPSF-FL）。该框架利用联邦全局模型的共同知识，促进了本地模型之间高效的跨问题知识共享和转移，以便进行单问题微调。通过这种方式，本地模型有效地保留了来自最新全局模型的共同 VRP 知识，同时高效适应具有异构复杂约束的下游 VRP。实验结果表明，我们的框架不仅提升了在多样化 VRP 中的性能，还改善了在未见问题中的可泛化性。

View on arXiv Download PDF AI Translation

cs.AI / 90 / 2604.10658

Governed Reasoning for Institutional AI

制度性人工智能的治理推理

Seck, Mamadou

Abstract

Institutional decisions -- regulatory compliance, clinical triage, prior authorization appeal -- require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability -- how reliably a system knows when it should not act autonomously -- as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.

Chinese Translation

制度性决策——如监管合规、临床分诊、事前授权申诉——需要与通用代理提供的不同的人工智能架构。代理框架通过对话推断权威，从日志中重建问责制，并产生无声错误：在没有任何人工审核信号的情况下执行的不正确判断。我们提出了认知核心（Cognitive Core）：一个由九种类型的认知原语（检索、分类、调查、验证、质疑、反思、深思、治理、生成）构建的治理决策基础，采用四层治理模型，其中人工审核是执行的条件，而不是事后检查，内生于计算的可篡改SHA-256哈希链审计账本，以及支持声明和自主推理的需求驱动委托架构。我们在11个案例的平衡事前授权申诉评估集上对三种系统进行了基准测试。认知核心的准确率达到91%，而基线系统ReAct为55%，Plan-and-Solve为45%。治理结果更为显著：认知核心产生了零个无声错误，而两个基线系统则产生了5-6个。我们引入了可治理性（governability）——系统在何时不应自主行动的可靠性——作为制度性人工智能的主要评估轴线之一，与准确率并列。基线系统被实现为提示，代表了治理框架的现实部署替代方案。基于配置的领域模型意味着部署一个新的制度决策领域只需YAML配置，而不是工程能力。

View on arXiv Download PDF AI Translation

cs.AI / 91 / 2604.10664

Preference-Agile Multi-Objective Optimization for Real-time Vehicle Dispatching

面向实时车辆调度的偏好敏捷多目标优化

Jin, Jiahuan, Zhao, Wenhao, Qu, Rong, Ren, Jianfeng, Chen, Xinan, Zhang, Qingfu, Bai, Ruibin

Abstract

Multi-objective optimization (MOO) has been widely studied in literature because of its versatility in human-centered decision making in real-life applications. Recently, demand for dynamic MOO is fast-emerging due to tough market dynamics that require real-time re-adjustments of priorities for different objectives. However, most existing studies focus either on deterministic MOO problems which are not practical, or non-sequential dynamic MOO decision problems that cannot deal with some real-life complexities. To address these challenges, a preference-agile multi-objective optimization (PAMOO) is proposed in this paper to permit users to dynamically adjust and interactively assign the preferences on the fly. To achieve this, a novel uniform model within a deep reinforcement learning (DRL) framework is proposed that can take as inputs users' dynamic preference vectors explicitly. Additionally, a calibration function is fitted to ensure high quality alignment between the preference vector inputs and the output DRL decision policy. Extensive experiments on challenging real-life vehicle dispatching problems at a container terminal showed that PAMOO obtains superior performance and generalization ability when compared with two most popular MOO methods. Our method presents the first dynamic MOO method for challenging \rev{dynamic sequential MOO decision problems

Chinese Translation

多目标优化（Multi-objective Optimization, MOO）因其在人本决策中的广泛适用性而被大量研究。近年来，随着市场动态的加剧，动态多目标优化需求迅速增长，要求对不同目标的优先级进行实时调整。然而，现有研究大多聚焦于确定性多目标优化问题，这类问题缺乏实际应用性，或非序列动态多目标优化决策问题，无法应对某些现实复杂性。为解决上述挑战，本文提出了一种偏好敏捷多目标优化（Preference-Agile Multi-Objective Optimization, PAMOO）方法，允许用户动态调整并交互式地实时分配偏好。为实现该目标，本文设计了一个基于深度强化学习（Deep Reinforcement Learning, DRL）框架的新型统一模型，能够显式输入用户的动态偏好向量。此外，拟合了一个校准函数，确保偏好向量输入与输出的DRL决策策略之间高度匹配。针对集装箱码头的复杂真实车辆调度问题的大量实验表明，PAMOO在性能和泛化能力上均优于两种最流行的多目标优化方法。该方法首次提出了针对复杂动态序列多目标优化决策问题的动态多目标优化方法。

View on arXiv Download PDF AI Translation

cs.AI / 92 / 2604.10673

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

原则不会自行应用：一种关于人工智能对齐的解释学视角

Razeghi, Behrooz

Abstract

AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.

Chinese Translation

人工智能对齐通常被描述为确保人工智能系统遵循一套明确原则或人类偏好的任务，但一般性原则很少能自行决定其在具体案例中的应用。当原则发生冲突、原则过于宽泛无法解决具体情境，或相关事实不明确时，就需要额外的判断行为。本文通过解释学的视角分析了这一环节，认为对齐因此包含了解释成分：它涉及关于原则应如何解读、应用及优先排序的情境敏感判断。我们将这一观点与近期实证研究结果相联系，显示大量偏好标注数据属于原则冲突或无差异的情况，在这些情况下，原则集合无法唯一决定决策。随后，我们提出一个操作性结论：由于此类判断体现在行为中，许多与对齐相关的选择仅在模型部署时生成的响应分布中显现。为形式化这一点，我们区分了部署诱导和语料诱导的评估，并展示当两种响应分布不同时，离策略审计（off-policy audits）可能无法捕捉对齐相关的失败。我们主张，基于原则的对齐包含一个依赖情境的解释成分。

View on arXiv Download PDF AI Translation

cs.AI / 93 / 2604.10678

FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

FedRio：基于协作强化对比对抗蒸馏的个性化联邦社交机器人检测

Yang, Yingguang, Liu, Hao, Zhang, Xin, Liu, Yunhui, Xia, Yutong, Wu, Qi, Peng, Hao, Liang, Taoran, Chong, Bin, He, Tieke, Yu, Philip S.

Abstract

Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To address these challenges, we propose FedRio (Personalized Federated Social Bot Detection with Cooperative Reinforced Contrastive Adversarial Distillation framework. We first introduce an adaptive message-passing module as the graph neural network backbone for each client. To facilitate efficient knowledge sharing of global data distributions, we design a federated knowledge extraction mechanism based on generative adversarial networks. Additionally, we employ a multi-stage adversarial contrastive learning strategy to enforce feature space consistency among clients and reduce divergence between local and global models. Finally, we adopt adaptive server-side parameter aggregation and reinforcement learning-based client-side parameter control to better accommodate data heterogeneity in heterogeneous federated settings. Extensive experiments on two real-world social bot detection benchmarks demonstrate that FedRio consistently outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature space consistency, while remaining competitive with published centralized results under substantially stronger privacy constraints.

Chinese Translation

社交机器人检测对于在线社交平台的稳定性和安全性至关重要。然而，当前最先进的机器人检测模型大多是孤立开发的，忽视了跨平台共享检测模式以提升性能和及时识别新兴机器人变种的优势。数据分布和模型架构的异质性进一步增加了设计有效的跨平台跨模型检测框架的难度。为应对这些挑战，我们提出了FedRio（个性化联邦社交机器人检测框架，结合协作强化对比对抗蒸馏）。我们首先引入了一个自适应消息传递模块，作为每个客户端的图神经网络骨干。为了促进全局数据分布的高效知识共享，我们设计了一种基于生成对抗网络的联邦知识提取机制。此外，我们采用多阶段对抗对比学习策略，以强化客户端间特征空间的一致性并减少局部模型与全局模型之间的差异。最后，我们采用自适应服务器端参数聚合和基于强化学习的客户端参数控制，更好地适应异构联邦环境中的数据异质性。在两个真实社交机器人检测基准上的大量实验表明，FedRio在检测准确率、通信效率和特征空间一致性方面持续优于最先进的联邦学习基线，同时在更强隐私保护约束下，仍能与已发布的集中式结果保持竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 94 / 2604.10690

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

大型语言模型是否构建空间世界模型？来自网格世界迷宫任务的证据

Li, Weijiang, Zhu, Yilin, Das, Rajarshi, Dube, Parijat

Abstract

Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

Chinese Translation

基础模型在多种任务中表现出色，但它们构建内部空间世界模型以进行推理和规划的能力仍不明确。我们通过迷宫任务系统性评估大型语言模型的空间理解，这是一种需要多步骤规划和空间抽象的受控测试环境。在对Gemini-2.5-Flash、GPT-5-mini、Claude-Haiku-4.5和DeepSeek-Chat进行的全面实验中，我们发现空间推理存在显著差异，这挑战了关于大型语言模型规划能力的假设。通过链式思维提示，Gemini在较小的迷宫（5x5到7x7网格）上实现了80-86%的准确率，使用的是标记化的邻接表示，但在视觉网格格式下，性能骤降至16-34%，这显示出2-5倍的差异，表明空间推理依赖于表示而非格式。我们进一步通过序列邻近问题和组合距离比较探讨空间理解。尽管在推理轨迹中实现了96-99%的语义覆盖，模型未能利用这种理解进行一致的空间计算，表明它们将每个问题视为独立，而不是构建累积的空间知识。基于迷宫求解任务的发现表明，大型语言模型并未发展出稳健的空间世界模型，而是表现出特定于表示和依赖于提示的推理，仅在狭窄条件下成功。这些结果对在需要空间抽象的应用中部署基础模型具有重要意义。

View on arXiv Download PDF AI Translation

cs.AI / 95 / 2604.10693

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

FACT-E：基于因果关系的可信链式推理评估

Sun, Yuxi, Zuo, Aoqi, Xie, Haotian, Gao, Wei, Gong, Mingming, Ma, Jing

Abstract

Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.

Chinese Translation

链式推理（Chain-of-Thought, CoT）提示已改善大型语言模型（LLM）的推理能力，但模型常常生成看似连贯但包含不可靠中间步骤的解释。现有的自我评估方法容易受到固有偏见的影响：即使步骤之间的推导不成立，模型也可能自信地支持连贯性，从而导致不可靠的忠实度评估。我们提出了FACT-E，一个基于因果关系的评估框架，用于评估链式推理的质量。FACT-E使用受控扰动作为工具信号，以区分真实的步骤间依赖关系和偏见驱动的伪影，从而生成更可靠的忠实度估计（ extit{intra-chain faithfulness}）。为了选择可信的轨迹，FACT-E同时考虑 extit{intra-chain faithfulness}和 extit{CoT-to-answer consistency}，确保所选链条在内部上是忠实的，并支持正确的最终答案。在GSM8K、MATH和CommonsenseQA上的实验表明，FACT-E改善了推理轨迹的选择，并产生了更强的上下文学习示例。FACT-E还能够在噪声条件下可靠地检测出错误推理，为可信的LLM推理提供了一个稳健的度量标准。

View on arXiv Download PDF AI Translation

cs.AI / 96 / 2604.10696

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Camyla：在医学图像分割领域实现自主研究的扩展

Gao, Yifan, Li, Haoyue, Yuan, Feng, Gao, Xin, Huang, Weiran, Wang, Xiaosong

Abstract

We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.

Chinese Translation

我们提出了Camyla，一个在医学图像分割科学领域内进行完全自主研究的系统。Camyla将原始数据集转化为基于文献的研究提案、可执行实验和完整的手稿，无需人工干预。长期的自主实验面临三个相互关联的挑战：搜索努力可能会偏向不具前景的方向，早期试验的知识随着上下文的积累而退化，以及从失败中恢复的过程往往陷入重复的增量修复。为了解决这些挑战，该系统结合了三种相互关联的机制：质量加权分支探索（Quality-Weighted Branch Exploration），用于在竞争提案之间分配努力；分层反思记忆（Layered Reflective Memory），用于在多个粒度上保留和压缩跨试验知识；以及发散诊断反馈（Divergent Diagnostic Feedback），用于在表现不佳的试验后多样化恢复策略。该系统在CamylaBench上进行了评估，这是一个由2025年出版物独家构建的31个数据集的无污染基准，在严格的零干预协议下，在一个8-GPU集群上进行了两次独立运行，总共28天。在这两次运行中，Camyla生成了超过2700个新的模型实现和40篇完整的手稿，并在相同训练预算下，在31个数据集中分别在22个和18个数据集上超过了从14个已建立架构中选择的最强基线（联合：24/31）。高级人类评审员对生成的手稿在当代医学影像期刊的T1/T2边界上进行了评分。相较于自动化基线，Camyla在整体分割性能上优于AutoML和NAS系统，并在任务完成率和超越基线的频率上超过了六个开放式研究代理。这些结果表明，在医学图像分割领域，实现领域规模的自主研究是可行的。

View on arXiv Download PDF AI Translation

cs.AI / 97 / 2604.10718

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

SciPredict：大型语言模型能否预测自然科学实验的结果？

Sehwag, Udari Madhushani, Lau, Elaine, Oskouie, Haniyeh Ehsani, Shabihi, Shayan, Liang, Erich, Toledo, Andrea, Mangialardi, Guillermo, Fonrouge, Sergio, Cardona, Ed-Yeremai Hernandez, Vergara, Paula, Tyagi, Utkarsh, Zhang, Chen Bo Calvin, Bhatter, Pavi, Johnson, Nicholas, Huang, Furong, Montoya, Ernesto Gabriel Hernandez, Liu, Bing

Abstract

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Chinese Translation

加速科学发现需要在投入资源进行昂贵的物理验证之前，识别出哪些实验将产生最佳结果。尽管现有基准评估了大型语言模型（LLMs）在科学知识和推理方面的能力，但它们预测实验结果的能力——这是人工智能可以显著超越人类能力的任务——仍然在很大程度上未被探索。我们介绍了SciPredict，一个基准测试，包含来自物理学、生物学和化学33个专业子领域的405个任务，基于近期的实证研究。SciPredict解决了两个关键问题：（a）大型语言模型能否以足够的准确性预测科学实验的结果？（b）这样的预测能否在科学研究过程中可靠使用？评估结果揭示了这两个方面的基本局限性。模型的准确率为14-26%，而人类专家的表现约为20%。尽管一些前沿模型的表现超过了人类，但模型的准确性仍远低于能够提供可靠实验指导的水平。即使在有限的表现范围内，模型也未能区分可靠预测与不可靠预测，准确率仅约为20%，无论它们的信心如何，或是它们是否判断结果在没有物理实验的情况下是可预测的。相比之下，人类专家表现出强大的校准能力：当他们认为结果在不进行实验的情况下更可预测时，准确率从约5%提高到约80%。SciPredict建立了一个严格的框架，表明在实验科学中超越人类的表现不仅需要更好的预测，还需要对预测可靠性的更好认识。为了可重复性，我们的所有数据和代码均提供于 https://github.com/scaleapi/scipredict

View on arXiv Download PDF AI Translation

cs.AI / 98 / 2604.10720

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

教语言模型像学习者一样编程：用于学生模拟的对话式序列化方法

Koutcheme, Charles, Hellas, Arto, Leinonen, Juho

Abstract

Artificial models that simulate how learners act and respond within educational systems are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, many existing approaches in programming education rely on prompting large, proprietary language models, raising concerns around privacy, cost, and dependence. In this work, we propose a method for training open-weight artificial programming learners using authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens the models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.

Chinese Translation

模拟学习者在教育系统中行为和反应的人工模型，是大规模评估辅导策略和反馈机制的有前景工具。然而，许多现有的编程教育方法依赖于调用大型专有语言模型，带来了隐私、成本和依赖性方面的担忧。在本研究中，我们提出了一种利用真实学生过程数据训练开源权重人工编程学习者的方法。我们的方法将时间序列日志转换为对话格式，将每个学生的问题解决过程表示为学习者与其自动评估系统之间的对话。学生代码提交和环境反馈（如测试结果、成绩和错误追踪）交替形成对话轮次，使模型能够从迭代调试过程中学习。我们还引入了结合监督微调与偏好优化的训练流程，使模型与真实学生调试行为保持一致。通过在大规模真实学生Python编程作业提交数据集上训练4B和8B规模的Qwen模型，我们评估了该框架。结果表明，融入环境反馈增强了模型复制学生调试行为的能力，在功能对齐和代码相似度方面优于以往仅基于代码的方法及提示式大型语言模型基线。我们发布了代码以支持结果复现。

View on arXiv Download PDF AI Translation

cs.AI / 99 / 2604.10739

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

思考过度的弊端：大规模语言模型测试时计算扩展中的过度思考问题

Zhou, Shu, Ling, Rui, Chen, Junan, Wang, Xin, Fan, Tao, Wang, Hao

Abstract

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking'', where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

Chinese Translation

通过延长思维链条来扩展测试时计算量，已成为提升大型语言模型推理能力的主流范式。然而，现有研究隐含假设更长的思考总能带来更好的结果，这一假设尚未得到充分检验。我们系统地研究了随着计算预算增加，额外推理标记的边际效用如何变化。研究发现，在较高预算下边际收益显著递减，且模型表现出“过度思考”现象，即延长推理过程反而导致放弃先前正确的答案。此外，我们展示了最佳思考长度因问题难度而异，表明统一的计算分配策略并非最优。我们的成本感知评估框架表明，在适中预算下停止推理可以显著减少计算量，同时保持相当的准确率。

View on arXiv Download PDF AI Translation

cs.AI / 100 / 2604.10783

Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

从临床叙事中学习基于偏好的目标以进行顺序治疗决策

Tan, Daniel J., See, Kay Choong, Feng, Mengling

Abstract

Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient's clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.

Chinese Translation

在医疗保健的强化学习（RL）中，设计奖励函数仍然是一个核心挑战，因为结果稀疏、延迟且难以明确指定。虽然结构化数据能够捕捉生理状态，但它们往往无法反映患者临床轨迹的整体质量，包括恢复动态、治疗负担和稳定性。相比之下，临床叙事总结了纵向推理，并隐含地编码了对治疗有效性的评估。我们提出了临床叙事信息驱动的偏好奖励（Clinical Narrative-informed Preference Rewards, CN-PR），这是一个通过将出院总结视为轨迹级偏好的可扩展监督来直接学习奖励函数的框架。利用大型语言模型，我们推导出轨迹质量评分（Trajectory Quality Scores, TQS），并构建患者轨迹的成对偏好，从而通过结构化的基于偏好的目标实现奖励学习。为了考虑叙事信息量的变异性，我们引入了一个置信信号，根据其与决策任务的相关性对监督进行加权。学习到的奖励与轨迹质量高度一致（Spearman rho = 0.63），并使得政策与改善恢复相关结果（包括增加无器官支持天数和更快的休克解决）持续相关，同时在死亡率方面保持可比的表现。这些效果在外部验证中依然存在。我们的结果表明，叙事导出的监督为动态治疗方案提供了一种可扩展且富有表现力的替代方案，优于手工设计或基于结果的奖励设计。

View on arXiv Download PDF AI Translation

cs.AI / 101 / 2604.10784

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

TorchUMM：用于评估、分析与后训练的统一多模态模型代码库

Luo, Yinyi, Wang, Wenwen, Bai, Hayes, Zhu, Hongyu, Chen, Hao, He, Pan, Savvides, Marios, Li, Sharon, Wang, Jindong

Abstract

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

Chinese Translation

近年来，统一多模态模型（Unified Multimodal Models, UMMs）的进展催生了大量能够跨视觉与文本模态进行理解、生成和编辑的架构。然而，由于模型架构的多样性以及训练范式和实现细节的异质性，构建一个统一的UMM框架依然充满挑战。本文提出了TorchUMM，这是首个针对多样化UMM骨干网络、任务和数据集的全面评估、分析与后训练的统一代码库。TorchUMM支持涵盖广泛规模和设计范式的多种模型。我们的基准测试涵盖多模态理解、生成和编辑三大核心任务维度，整合了既有及新颖的数据集，用以评估感知、推理、组合能力及指令遵循能力。通过提供统一接口和标准化评估协议，TorchUMM实现了异构模型之间的公平且可复现的比较，促进对其优势与局限的深入理解，推动更强大统一多模态系统的发展。代码可在：https://github.com/AIFrontierLab/TorchUMM 获取。

View on arXiv Download PDF AI Translation

cs.AI / 102 / 2604.10825

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench：在啮齿动物行为神经科学范式上评估大型语言模型

Bugaud, Zacharie

Abstract

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

Chinese Translation

我们介绍了CheeseBench，这是一个基准测试，评估大型语言模型（LLMs）在九个经典行为神经科学范式上的表现（莫里斯水迷宫、巴恩斯迷宫、T型迷宫、放射臂迷宫、星形迷宫、操作室、穿梭箱、条件性地点偏好和延迟非匹配样本），涵盖六个认知维度。每个任务基于经过同行评审的啮齿动物实验方案，并具有近似的动物基线。代理接收一个统一的系统提示，没有特定于任务的指令，必须仅通过ASCII文本观察和奖励信号来发现目标，类似于放置在不熟悉设备中的啮齿动物。我们对六个开放权重的LLM（参数从3B到72B）在基于文本的ASCII渲染上进行了评估，并与随机基线和基于图的强化学习代理进行了比较。我们的最佳模型（Qwen2.5-VL-7B）在ASCII输入上的平均成功率达到52.6%，而随机代理为32.1%，近似的啮齿动物基线为78.9%。我们发现（1）超出7B的扩展收益递减，（2）较长的上下文历史会降低性能，（3）思维链提示反而有害而非有益，以及（4）视觉-语言架构在7B时提供优势，但在32B时则有害。由于同一模型的性能仅因接口参数的不同而在20%到57%之间波动，这些结果表征了代理与接口系统，而非模型本身。在这一统一的零-shot ASCII协议下，目前的开放权重LLM代理在近似啮齿动物参考值之下，尤其是在需要空间导航和试验内状态跟踪的任务上。

View on arXiv Download PDF AI Translation

cs.AI / 103 / 2604.10827

Your Model Diversity, Not Method, Determines Reasoning Strategy

你的模型多样性，而非方法，决定推理策略

Choraria, Moulik, Gerogiannis, Argyrios, Das, Anirban, Chakraborty, Supriyo, Kapusuzoglu, Berkcan, Lee, Chia-Hsuan, Balasubramaniam, Kartik, Zhang, Shi-Xiong, Sahu, Sambit

Abstract

Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches ($breadth$) and refining promising solutions ($depth$). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue that $\textbf{the optimal strategy depends on the model's diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.}$ We formalize this through a theoretical framework decomposing reasoning uncertainty and derive conditions under which tree-style depth refinement outperforms parallel sampling. We validate it on Qwen-3 4B and Olmo-3 7B families, showing that lightweight signals suffice for depth-based refinement on low-diversity aligned models while yielding limited utility for high-diversity base models, which we hypothesize require stronger compensation for lower exploration coverage.

Chinese Translation

大规模语言模型（LLM）推理的计算扩展需要在探索解决方案方法（$breadth$）和优化有前景的解决方案（$depth$）之间分配预算。大多数方法隐含地在两者之间进行权衡，但为何特定的权衡有效仍不清楚，且在单一模型上的验证掩盖了模型本身的作用。我们认为，$ extbf{最佳策略依赖于模型的多样性特征，即概率质量在解决方案方法上的分布，而这一点必须在采用任何探索策略之前进行表征。}$ 我们通过一个理论框架对推理不确定性进行分解，并推导出树状深度优化优于并行采样的条件。我们在 Qwen-3 4B 和 Olmo-3 7B 系列上进行了验证，结果表明，对于低多样性对齐模型，轻量信号足以支持基于深度的优化，而对于高多样性基础模型则效果有限，我们假设这类模型需要更强的补偿以弥补较低的探索覆盖率。

View on arXiv Download PDF AI Translation

cs.AI / 104 / 2604.10853

A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness

基于差距与重叠分析的知识图谱任务准备度评测基准

Mridul, Maruf Ahmed, Kapa, Rohit, Seneviratne, Oshani

Abstract

Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.

Chinese Translation

面向任务的知识图谱（KG）质量评估日益关注本体驱动的表示能否以可复现、可解释且可追溯至证据的方式回答用户实际关心的能力问题。本文采用该视角，聚焦于政策类文档（如保险合同）的差距与重叠分析，即在给定场景下，哪些文档支持该场景（重叠），哪些不支持（差距），并提供有理有据的论证。所得差距/重叠判定通常源于覆盖范围和限制的真实差异，而非数据缺失，使该任务成为检验KG任务准备度的直接测试，而非事实缺失或查询表达能力的测试。我们提出了一个可执行且可审计的基准，将自然语言合同文本与形式化本体及证据链接的真实标签对齐，从而支持方法的系统比较。该基准包括：(i) 十份由领域专家审阅的简化且多样化的人寿保险合同，(ii) 一个领域本体（TBox）及基于合同事实填充的实例知识库（ABox），以及 (iii) 58个结构化场景，配套SPARQL查询、合同级结果和条款级摘录以支持每个标签。利用该资源，我们比较了直接从合同文本推断结果的纯文本大型语言模型（LLM）基线与基于本体驱动、在实例化KG上回答相同场景的流程，展示了显式建模在差距/重叠分析中的一致性和诊断能力提升。尽管以差距与重叠分析为示范，该基准旨在作为评估KG质量及支持后续工作（如本体学习、KG填充和基于证据的问题回答）的可复用模板。

View on arXiv Download PDF AI Translation

cs.AI / 105 / 2604.10865

Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

超越统计共现：解锁表格数据聚类的内在语义

Zhao, Mingjie, Zhang, Yunfan, Zhang, Yiqun, Cheung, Yiu-ming

Abstract

Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.

Chinese Translation

深度聚类（Deep Clustering，DC）已成为金融和医疗等现实领域中表格数据分析的强大工具。然而，大多数现有方法依赖于数据层面的统计共现来推断潜在的度量空间，往往忽视了蕴含于特征名称和数值中的内在语义知识。因此，诸如“流感（Flu）”和“感冒（Cold）”等语义相关的概念常被视为符号化的标记，导致概念相关的样本被孤立。为弥合数据集特定统计信息与内在语义知识之间的差距，本文提出了表格增强对比聚类（Tabular-Augmented Contrastive Clustering，TagCC）框架，该框架将统计表格表示锚定于开放世界的文本概念。具体而言，TagCC利用大型语言模型（Large Language Models，LLMs）通过语义感知转换将潜在数据语义提炼为文本锚点。通过对比学习（Contrastive Learning，CL），该框架用这些锚点所蕴含的开放世界语义丰富统计表格表示。该对比学习框架与聚类目标联合优化，确保学习到的表示既具语义一致性又利于聚类。大量基准数据集上的实验表明，TagCC显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.AI / 106 / 2604.10873

A Quantitative Definition of Intelligence

智能的定量定义

Choi, Kang-Sin

Abstract

We propose an operational, quantitative definition of intelligence for arbitrary physical systems. The intelligence density of a system is the ratio of the logarithm of its independent outputs to its total description length. A system memorizes if its description length grows with its output count; it knows if its description length remains fixed while its output count diverges. The criterion for knowing is generalization: a system knows its domain if a single finite mechanism can produce correct outputs across an unbounded range of inputs, rather than storing each answer individually. We argue that meaning over a domain is a selection and ordering of functions that produces correct outputs, and that a system whose intelligence density diverges necessarily captures this structure. The definition (1) places intelligence on a substrate-independent continuum from logic gates to brains, (2) blocks Putnam's pancomputationalist triviality argument via an independence condition on outputs, and (3) resolves Searle's Chinese Room Argument by showing that any finite rulebook handling an infinite domain must generalize.

Chinese Translation

我们提出了一个针对任意物理系统的操作性定量智能定义。系统的智能密度定义为其独立输出的对数与其总描述长度的比值。当系统的描述长度随着输出数量的增加而增长时，称其为记忆；当描述长度保持不变而输出数量趋于无穷时，称其为“知道”。“知道”的标准是泛化：如果单一有限机制能够在无限输入范围内产生正确输出，而非逐一存储每个答案，则系统被认为了解其领域。我们认为，领域上的意义是产生正确输出的函数的选择与排序，而智能密度发散的系统必然捕捉到这种结构。该定义（1）将智能置于从逻辑门到大脑的基底无关连续体上，（2）通过对输出的独立性条件，阻止了Putnam的泛计算主义平凡性论证，（3）并通过证明任何处理无限领域的有限规则集必须泛化，解决了Searle的中文房间论证。

View on arXiv Download PDF AI Translation

cs.AI / 107 / 2604.10898

ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

ZoomR：通过多粒度键值检索实现内存高效推理

Yang, David H., Zhu, Yuxuan, Amiri, Mohammad Mohammadi, Murugesan, Keerthiram, Pedapati, Tejaswini, Chaudhury, Subhajit, Chen, Pin-Yu

Abstract

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

Chinese Translation

大型语言模型（LLMs）在复杂推理任务中表现出色，但通常需要生成较长的中间思维过程才能得出最终答案。在生成过程中，LLMs依赖于键值（KV）缓存进行自回归解码。然而，KV缓存的内存占用随着输出长度的增加而增长。之前关于KV缓存优化的研究主要集中在压缩较长的输入上下文，同时保留完整的KV缓存进行解码。对于需要长输出生成的任务，这导致了计算和内存成本的增加。本文介绍了ZoomR，这是一种新颖的方法，使LLMs能够自适应地将冗长的推理思维压缩为摘要，并使用动态KV缓存选择策略，利用这些摘要，同时在细粒度细节上进行战略性“放大”。通过在解码过程中使用摘要键作为粗粒度索引，ZoomR仅使用查询来检索最重要思维的细节。这种分层策略显著减少了内存使用，避免了在每一步进行全缓存注意力。针对数学和推理任务的实验表明，我们的方法在性能上与基线相比具有竞争力，同时将推理内存需求降低了超过4倍。这些结果表明，多粒度KV选择能够实现更高效的内存解码，特别是在长输出生成方面。

View on arXiv Download PDF AI Translation

cs.AI / 108 / 2604.10900

CASK: Core-Aware Selective KV Compression for Reasoning Traces

CASK：核心感知选择性KV压缩用于推理轨迹

Kim, Buseong, Gwon, Heejun

Abstract

In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes where the prefix can exhaust the budget before decode-stage compression becomes active, CASK further uses a two-stage design: prefix eviction followed by decode-stage consolidation. On the H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 > triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary. The overall evidence supports a simple conclusion: effective reasoning KV compression depends less on more elaborate scorer engineering than on combining core preservation with selective scratch consolidation to lower the usable budget frontier.

Chinese Translation

在进行长篇推理的大型语言模型中，KV缓存随着解码长度的增加迅速增长，导致内存和推理稳定性出现瓶颈。现有的面向推理的KV压缩大多遵循以驱逐为中心的观点：更准确地估计令牌的重要性，然后丢弃低排名的条目。我们的分析表明，仅仅对评分器进行细化往往无法实质性地重新组织实际的保留集，因此可能不是保持推理行为的主要杠杆。相反，我们将推理KV压缩框架视为一个保持行为的结构化整合问题。CASK将解码时的推理轨迹划分为一个保护核心，该核心锚定答案形成和中间状态，以及具有高冗余的可合并临时存储。核心部分被保留，而选择性整合仅应用于临时存储。为了应对在解码阶段压缩生效之前，前缀可能耗尽预算的提示重负情况，CASK进一步采用了两阶段设计：前缀驱逐后跟解码阶段整合。在H100推理门上，CASK在AIME24和AIME25的匹配预算下显示出比TriAttention更高的全KV延续保真度，且cask@384与triattention@512之间存在重复交叉。在提示重的重放中，multi_news和vcsum作为解码活动的见证，而qmsum和gov_report则暴露了前缀预算耗尽的边界。总体证据支持一个简单的结论：有效的推理KV压缩与其说依赖于更复杂的评分器工程，不如说依赖于将核心保留与选择性临时存储整合相结合，以降低可用预算的边界。

View on arXiv Download PDF AI Translation

cs.AI / 109 / 2604.10908

Reasoning as Data: Representation-Computation Unity and Its Implementation in a Domain-Algebraic Inference Engine

作为数据的推理：表示-计算统一性及其在领域代数推理引擎中的实现

Li, Chao, Wang, Yuru

Abstract

Every existing knowledge system separates storage from computation. We show this separation is unnecessary and eliminate it. In a standard triple is_a(Apple, Company), domain context lives in the query or the programmer's mind. In a CDC four-tuple is_a(Apple, Company, @Business), domain becomes a structural field embedded in predicate arity. Any system respecting arity automatically performs domain-scoped inference without external rules. We call this representation-computation unity (RCU). From the four-tuple structure, three inference mechanisms emerge: domain-scoped closure, typed inheritance, and write-time falsification via cycle detection per domain fiber. We establish RCU formally via four theorems. RCU is implementable. We present a working symbolic engine (2400 lines Python+Prolog) resolving four engineering issues: rule-data separation, shared-fiber handling, read-only meta-layer design, and intersective convergence. A central result: CDC domain-constrained inference is distinct from Prolog with a domain argument. Two case studies validate the engine. ICD-11 classification (1247 entities, 3 axes) shows fibers resolve multiple inheritance. CBT clinical reasoning shows generalization to temporal reasoning with session turn as ordered domain index. Multi-constraint queries realize CSP arc-consistency with complexity O(m (N/K)^2), confirming the domain lattice's sparsity governs performance. When domain is structural, data computes itself.

Chinese Translation

现有的知识系统将存储与计算分开。我们证明这种分离是多余的，并将其消除。在标准三元组 is_a(Apple, Company) 中，领域上下文存在于查询或程序员的思维中。在 CDC 四元组 is_a(Apple, Company, @Business) 中，领域成为嵌入在谓词元数中的结构字段。任何尊重元数的系统都会自动执行领域范围的推理，而无需外部规则。我们称之为表示-计算统一性（RCU）。从四元组结构中，出现了三种推理机制：领域范围闭包、类型继承和通过每个领域纤维的循环检测进行的写时虚假化。我们通过四个定理正式建立了 RCU。RCU 是可实现的。我们展示了一个工作中的符号引擎（2400 行 Python+Prolog），解决了四个工程问题：规则-数据分离、共享纤维处理、只读元层设计和交叉收敛。一个核心结果是：CDC 领域约束推理与带领域参数的 Prolog 是不同的。两个案例研究验证了该引擎。ICD-11 分类（1247 个实体，3 个轴）显示纤维解决了多重继承。CBT 临床推理展示了对时间推理的概括，将会话轮次作为有序领域索引。多约束查询实现了 CSP 弧一致性，复杂度为 O(m (N/K)^2)，确认领域格的稀疏性支配了性能。当领域是结构性的，数据便能自我计算。

View on arXiv Download PDF AI Translation

cs.AI / 110 / 2604.10911

EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

EvoNash-MARL：一种闭环多智能体强化学习框架用于中期股权配置

Jia, Chongliu, Luo, Yi, Han, Sipeng, Li, Pengwei, Ding, Jie, Hu, Youshuang, Qian, Yimiao, Wang, Qiya

Abstract

Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White's Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.

Chinese Translation

中长期股票配置面临显著挑战，主要由于预测结构薄弱、市场状态非平稳以及在应用交易成本、容量限制和尾部风险约束后信号的退化。传统方法通常依赖于单一预测器或松散耦合的预测-配置管道，这在分布转移的情况下限制了其鲁棒性。本研究针对一个特定设计问题进行探讨：将强化学习（Reinforcement Learning, RL）、多智能体策略群体、政策空间响应oracle（Policy-Space Response Oracle, PSRO）风格的聚合、联盟最佳响应训练、进化替换以及执行感知的检查点选择结合在一个统一的前向循环中，是否能提高中长期配置器的鲁棒性。所提出的框架EvoNash-MARL将这些组件整合在一个执行感知的配置循环中，并进一步引入了一个分层政策架构，包括方向头和风险头、非线性信号增强、特征质量重加权以及约束感知的检查点选择。在120窗口的前向协议下，解决的v21配置实现了平均超额夏普比率0.7600和鲁棒性评分-0.0203，在内部控制中排名第一；在2014年1月2日至2024年1月5日的对齐日常样本外收益中，其年化收益为19.6%，而SPY为11.7%；在通过2026年2月10的扩展前向评估中，其年化收益为20.5%，而SPY为13.5%。该框架在现实压力约束下保持积极表现，并展现出结构化的跨市场泛化能力；然而，在White的现实检查（White's Reality Check, WRC）和SPA-lite测试下，建立了全球强显著性。因此，结果被呈现为支持一种更稳定的中长期训练和选择范式的证据，而不是作为普遍优越的市场时机表现的证明。

View on arXiv Download PDF AI Translation

cs.AI / 111 / 2604.10918

CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation

CSPO：缓解结构化表格到LaTeX生成中的奖励歧义问题

Yang, Yunfan, Lan, Cuiling, Sang, Jitao, Lu, Yan

Abstract

Tables contain rich structured information, yet when stored as images their contents remain "locked" within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.

Chinese Translation

表格包含丰富的结构化信息，但当以图像形式存储时，其内容被“锁定”在像素中。将表格图像转换为LaTeX代码能够实现忠实的数字化和再利用，但当前的多模态大型语言模型（MLLMs）往往难以保持结构、样式或内容的准确性。传统的基于强化学习（RL）的后训练通常依赖单一的聚合奖励，导致奖励歧义，将多个行为方面混淆，阻碍了有效的优化。我们提出了组件特定策略优化（Component-Specific Policy Optimization，CSPO）框架，该强化学习方法将LaTeX表格的结构、样式和内容等组件的优化过程解耦。具体而言，CSPO为各组件分配专属奖励，并且仅将每个奖励信号反向传播至与该组件相关的token，从而缓解奖励歧义，实现针对性的组件级优化。为全面评估性能，我们引入了一套分层评估指标。大量实验验证了CSPO的有效性，强调了组件特定优化在可靠结构化生成中的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 112 / 2604.10960

RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation

RAG-KT：跨平台可解释知识追踪与多视角融合检索生成

Duan, Zhiyi, Yuan, Hongyu, Liu, Rui

Abstract

Knowledge Tracing (KT) infers a student's knowledge state from past interactions to predict future performance. Conventional Deep Learning (DL)-based KT models are typically tied to platform-specific identifiers and latent representations, making them hard to transfer and interpret. Large Language Model (LLM)-based methods can be either ungrounded under prompting or overly domain-dependent under fine-tuning. In addition, most existing KT methods are developed and evaluated under a same-distribution assumption. In real deployments, educational data often arise from heterogeneous platforms with substantial distribution shift, which often degrades generalization. To this end, we propose RAG-KT, a retrieval-augmented paradigm that frames cross-platform KT as reliable context constrained inference with LLMs. It builds a unified multi-source structured context with cross-source alignment via Question Group abstractions and retrieves complementary rich and reliable context for each prediction, enabling grounded prediction and interpretable diagnosis. Experiments on three public KT benchmarks demonstrate consistent gains in accuracy and robustness, including strong performance under cross-platform conditions.

Chinese Translation

知识追踪（Knowledge Tracing, KT）通过分析学生过去的互动来推断其知识状态，从而预测未来的表现。传统的基于深度学习（Deep Learning, DL）的KT模型通常与特定平台的标识符和潜在表示紧密相关，这使得它们难以迁移和解释。基于大型语言模型（Large Language Model, LLM）的方法在提示下可能缺乏基础，或在微调时过于依赖特定领域。此外，大多数现有的KT方法是在同分布假设下开发和评估的。在实际应用中，教育数据往往来自异构平台，存在显著的分布转变，这通常会降低模型的泛化能力。为此，我们提出了RAG-KT，这是一种增强检索的范式，将跨平台KT框架视为与LLMs进行可靠上下文约束推理。它通过问题组（Question Group）抽象构建统一的多源结构化上下文，并为每个预测检索互补的丰富且可靠的上下文，从而实现基于基础的预测和可解释的诊断。在三个公共KT基准上的实验表明，RAG-KT在准确性和鲁棒性方面均表现出一致的提升，包括在跨平台条件下的强劲表现。

View on arXiv Download PDF AI Translation

cs.AI / 113 / 2604.10963

Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models

通过视觉基础模型深入探讨医学图像分割中的随机不确定性

Li, Ruiyang, Liu, Fang, Jiao, Licheng, Xie, Xinglin, Hao, Jiayao, Li, Shuo, Liu, Xu, Yang, Jingyi, Li, Lingling, Chen, Puhua, Ma, Wenping

Abstract

Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.

Chinese Translation

医学图像分割通过精确描绘解剖结构和病变来支持临床工作流程。然而，医学图像数据集受到采集噪声和标注模糊的影响，导致普遍的数据不确定性，从而显著削弱模型的鲁棒性。现有研究主要集中在模型架构的改进和预测可靠性估计上，而对内在数据不确定性的系统性探索仍然不足。为了解决这一问题，本研究提出利用视觉基础模型的通用表示能力来估计固有的数据不确定性。具体而言，我们分析模型解码表示的特征多样性，并量化其奇异值能量，以定义每个类别的语义感知尺度，从而测量样本难度和随机不确定性。在此基础上，我们设计了两种以不确定性驱动的应用策略：（1）随机不确定性感知的数据过滤机制，以消除潜在的噪声样本并提高模型学习质量；（2）动态不确定性感知的优化策略，根据语义感知尺度在训练过程中自适应调整类别特定的损失权重，并结合标签去噪机制以提高训练稳定性。在涉及多脏器和肿瘤分割任务的五个公共数据集上的实验结果表明，我们的方法在各种主流网络架构中实现了显著且稳健的性能提升，揭示了随机不确定性在医学图像理解和分割任务中的广泛应用潜力。

View on arXiv Download PDF AI Translation

cs.AI / 114 / 2604.10973

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

CFMS：一种用于增强表格推理的粗到细多模态综合框架

Huang, Qixian, Lin, Hongqiang, Fu, Tong, Wang, Yingsen, Fu, Zhenghui, Wang, Qirui, Sun, Yiding, Zhang, Dongxu

Abstract

Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.

Chinese Translation

表格数据推理是问答和事实验证等任务中的关键能力，因为这类任务要求模型理解自由形式的问题和半结构化的表格。然而，尽管诸如Chain-of-Thought（CoT）的方法引入了推理链，纯符号方法由于无法感知整体视觉模式而存在固有限制。为此，我们提出了粗到细多模态综合框架（CFMS），这是一种新颖的两阶段范式，分层解耦了高层次视觉感知与细粒度符号推理。在粗阶段，CFMS利用多模态大语言模型（MLLMs）一次性综合生成多视角知识元组，该元组随后作为动态推理地图指导细阶段，在细阶段中符号引擎针对表格执行有针对性且高效的迭代操作序列。在WikiTQ和TabFact基准上的大量实验表明，CFMS实现了具有竞争力的准确率。该框架在处理大型表格以及采用较小骨干模型时表现出特别的鲁棒性，验证了其有效性和泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 115 / 2604.10981

ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

ATANT v1.1：针对记忆、长上下文和代理记忆基准的连续性评估

Tanguturi, Samuel Sameer

Abstract

ATANT v1.0 (arXiv:2604.06710) defined continuity as a system property with 7 required properties and introduced a 10-checkpoint, LLM-free evaluation methodology validated on a 250-story corpus. Since publication, a recurring reviewer and practitioner question has concerned not the framework itself but its relationship to a wider set of memory evaluations: LOCOMO, LongMemEval, BEAM, MemoryBench, Zep's evaluation suite, Letta/MemGPT's evaluations, and RULER. This companion paper, v1.1, does not modify the v1.0 standard. It closes a related-work gap that v1.0 left brief under page limits. We show by structural analysis that none of these benchmarks measures continuity as defined in v1.0: of the 7 required properties, the median existing eval covers 1 property, the mean covers 0.43 when partial credit is scored at 0.5, and no eval covers more than 2. We provide a cell-by-cell property-coverage matrix, identify methodological defects specific to each benchmark (including an empty-gold scoring bug in the LOCOMO reference implementation that renders 23% of its corpus unscorable by construction), and publish our reference implementation's LOCOMO score (8.8%) alongside the structural reason that number is uninformative about continuity. We publish our 8.8% LOCOMO score alongside our 96% ATANT cumulative-scale score as a calibration pair: the 87-point divergence is evidence that the two benchmarks measure different properties, not that one system is an order of magnitude better than another. The position v1.1 takes is not adversarial: each benchmark measures a real capability. The claim is that none of them can adjudicate continuity, and conflating them with continuity evaluation has led the field to under-invest in the properties v1.0 names.

Chinese Translation

ATANT v1.0 (arXiv:2604.06710) 将连续性定义为一种系统属性，具有7个必要属性，并引入了一种在250个故事语料库上验证的10个检查点、无大型语言模型(LLM)的评估方法。自发布以来，反复出现的审稿人和从业者问题关注的不是框架本身，而是其与更广泛的记忆评估的关系：LOCOMO、LongMemEval、BEAM、MemoryBench、Zep的评估套件、Letta/MemGPT的评估和RULER。本文作为伴随论文v1.1，并未修改v1.0标准，而是填补了v1.0在页面限制下留下的相关工作空白。我们通过结构分析表明，这些基准中没有一个能够测量v1.0定义的连续性：在7个必要属性中，现有评估的中位数覆盖1个属性，平均覆盖0.43（当部分得分为0.5时），且没有评估覆盖超过2个属性。我们提供了逐项属性覆盖矩阵，识别了每个基准特有的方法缺陷（包括LOCOMO参考实现中的一个空金标准评分错误，使其语料库的23%因构造而无法评分），并发布了我们参考实现的LOCOMO得分（8.8%）以及该数字在连续性方面无信息量的结构原因。我们将8.8%的LOCOMO得分与96%的ATANT累积评分作为校准对比：87分的差异证明这两个基准测量的是不同的属性，而不是某一系统比另一系统好一个数量级。v1.1所持立场并非对抗性：每个基准测量的是一种真实能力。我们的主张是，没有一个基准能够裁定连续性，将它们与连续性评估混淆导致该领域在v1.0所命名的属性上投资不足。

View on arXiv Download PDF AI Translation

cs.AI / 116 / 2604.10985

Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

回归LLAMAs：在微调视觉语言模型中演化预训练大型语言模型骨干

Horawalavithana, Sameera, Phillips, Lauren, Stewart, Ian, Munikoti, Sai, Pazdernik, Karl

Abstract

Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.

Chinese Translation

视觉语言模型（VLMs）通过利用强大的预训练大型语言模型（LLMs）作为核心推理骨干，迅速取得了进展。随着更强大且具备改进推理能力、指令遵循能力和泛化能力的新型LLMs的出现，迫切需要高效地更新现有VLMs以融合这些进步。然而，如何将新LLMs整合进VLMs，特别是不断演化的LLMs如何促进多模态推理、对齐及任务特定性能，仍未得到充分探讨。鉴于预训练LLM骨干的快速演变，解决这一问题对于VLM的发展至关重要。本研究对预训练LLM骨干的变化如何影响下游VLM任务性能进行了有控制且系统的调查。通过保持视觉编码器、训练数据及后训练算法在基于LLAMA-1、LLAMA-2和LLAMA-3的VLM中一致，我们发现更新的LLM骨干并不总是带来更优的VLM性能，表现依赖于具体的下游VLM任务。例如，在视觉问答任务中，更新的LLM骨干倾向于解决不同类型的问题，而非仅仅是更多的问题。我们的分析表明，这种现象源于模型处理信息方式的差异，包括更佳的置信度校准和更稳定的内部表征。我们还发现，某些VLM能力仅在最新一代LLM中出现，而主要依赖视觉理解的任务则几乎未从更新的LLM骨干中受益。

View on arXiv Download PDF AI Translation

cs.AI / 117 / 2604.10988

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

WebForge：打破浏览器代理基准测试中的真实性-可复现性-可扩展性三难困境

Yuan, Peng, Yin, Yuyang, Cai, Yuxuan, Wei, Zheng

Abstract

Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.

Chinese Translation

现有的浏览器代理基准测试面临一个根本性的三难困境：真实网站基准由于内容漂移缺乏可复现性，受控环境通过省略真实网络噪声牺牲了真实性，而两者都依赖昂贵的人工策划，限制了可扩展性。我们提出了WebForge，这是首个通过四代理流水线——计划（Plan）、生成（Generate）、优化（Refine）和验证（Validate）——实现完全自动化的框架，能够端到端生成交互式、自包含的网页环境，无需人工标注。一个七维难度控制框架沿导航深度、视觉复杂度、推理难度等维度构建任务设计，实现了超越单一综合分数的系统能力分析。基于WebForge，我们构建了WebForge-Bench，一个涵盖7个领域、3个难度等级共934个任务的基准测试。多模型实验表明，难度分层有效区分了模型能力，而跨领域分析揭示了综合指标无法体现的能力偏差。综合结果证实，多维度评估揭示了单一综合分数无法捕捉的独特能力特征。代码和基准测试已公开，地址：https://github.com/yuandaxia2001/WebForge。

View on arXiv Download PDF AI Translation

cs.AI / 118 / 2604.10989

MAFIG: Multi-agent Driven Formal Instruction Generation Framework

MAFIG：多智能体驱动的形式化指令生成框架

Zhao, Shixing, Si, Zheng, Ouyang, Pengpeng, Hu, Zhengqing, Zhu, Wanqi, Chen, Dong, Guo, Yibo, Xu, Mingliang

Abstract

Emergency situations in scheduling systems often trigger local functional failures that undermine system stability and even cause system collapse. Existing methods primarily rely on robust scheduling or reactive scheduling, handling emergencies through predefined rules or rescheduling strategies. However, the diversity and unpredictability of real-world emergencies make them difficult to anticipate, which limits the adaptability of these methods in complex scenarios. Recent studies have shown that Large Language Models (LLMs) possess strong potential for complex scheduling tasks because of their extensive prior knowledge and strong reasoning capabilities. Nevertheless, the high inference latency of LLMs and the lengthy contextual information of scheduling systems significantly hinder their application for emergency handling. To mitigate these issues, we propose the Multi-agent Driven Formal Instruction Generation Framework (MAFIG). The framework constrains the decision scope to local functional modules affected by emergency situations and repairs scheduling logic rapidly by generating formal instructions. MAFIG contains a Perception Agent and an Emergency Decision Agent, which mitigates the adverse impact of lengthy system contexts on emergency decision-making. We further introduce span-focused loss-driven local distillation mechanism (SFL) to transfer the decision-making capability of powerful Cloud Large Language Models (C-LLMs) to lightweight local models, reducing inference latency while preserving decision-making effectiveness. Experiments in the Port, Warehousing, and Deck scheduling datasets show success rates of 98.49\%, 94.97\%, and 97.50\%, with average processing times of 0.33 s, 0.23 s, and 0.19 s. These results demonstrate that MAFIG effectively mitigates the impact of emergencies and improves the robustness and adaptability of scheduling systems.

Chinese Translation

调度系统中的紧急情况常常引发局部功能故障，破坏系统稳定性，甚至导致系统崩溃。现有方法主要依赖于鲁棒调度或响应式调度，通过预定义规则或重新调度策略来处理紧急情况。然而，现实世界紧急情况的多样性和不可预测性使其难以预见，限制了这些方法在复杂场景中的适应性。近期研究表明，大型语言模型（LLMs）凭借其丰富的先验知识和强大的推理能力，在复杂调度任务中展现出巨大潜力。然而，LLMs推理延迟高及调度系统上下文信息冗长，显著阻碍了其在紧急处理中的应用。为缓解这些问题，我们提出了多智能体驱动的形式化指令生成框架（MAFIG）。该框架将决策范围限制在受紧急情况影响的局部功能模块，通过生成形式化指令快速修复调度逻辑。MAFIG包含感知智能体（Perception Agent）和紧急决策智能体（Emergency Decision Agent），有效减轻了冗长系统上下文对紧急决策的负面影响。我们进一步引入了基于跨度聚焦的损失驱动局部蒸馏机制（span-focused loss-driven local distillation，SFL），将强大的云端大型语言模型（Cloud Large Language Models，C-LLMs）的决策能力迁移至轻量级本地模型，降低推理延迟的同时保持决策效果。在港口、仓储和甲板调度数据集上的实验结果显示，成功率分别达到98.49%、94.97%和97.50%，平均处理时间分别为0.33秒、0.23秒和0.19秒。结果表明，MAFIG有效缓解了紧急情况的影响，提升了调度系统的鲁棒性和适应性。

View on arXiv Download PDF AI Translation

cs.AI / 119 / 2604.11003

Sanity Checks for Agentic Data Science

智能代理数据科学的合理性检验

Rewolinski, Zachary T., Zane, Austin V., Huang, Hao, Singh, Chandan, Wang, Chenglong, Gao, Jianfeng, Yu, Bin

Abstract

Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.

Chinese Translation

智能代理数据科学（Agentic Data Science, ADS）流程在能力和应用上迅速发展，诸如OpenAI Codex等系统现已能够直接分析数据集并回答统计问题。然而，这些系统可能得出难以被用户察觉的错误乐观结论。为此，我们基于可验证数据科学的可预测性-可计算性-稳定性（Predictability-Computability-Stability, PCS）框架，提出了一对轻量级的合理性检验。这些检验通过合理的扰动筛查代理是否能够可靠地区分信号与噪声，作为一种可证伪性约束，能够揭示支持性结论的不足。两项检验共同描述了ADS输出的可信度，例如其是否发现了稳定的信号、是否对噪声作出响应，或是否对输入的偶然性特征敏感。我们在具有可控信噪比的合成数据上验证了该方法，确认合理性检验能够追踪真实信号强度。随后，我们在11个真实世界数据集上使用OpenAI Codex演示了这些检验，评估了每个结论的可信度，发现其中6个数据集的肯定性结论支持不足，尽管单次ADS运行可能支持该结论。我们进一步分析了ADS系统的失败模式，发现ADS自我报告的置信度与其结论的经验稳定性校准较差。

View on arXiv Download PDF AI Translation

cs.AI / 120 / 2604.11005

Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Diffusion-CAM：面向dMLLMs的可信视觉解释

Zuo, Haomin, Li, Yidi, Yang, Luoxiao, Zhang, Xiaofeng

Abstract

While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.

Chinese Translation

尽管扩散多模态大型语言模型（diffusion Multimodal Large Language Models，dMLLMs）在多模态生成领域近期取得了显著进展，但其可解释性机制的发展却落后于架构演进。与产生序列激活的传统自回归模型不同，基于扩散的架构通过并行去噪生成标记，导致整个序列中出现平滑且分布式的激活模式。因此，现有针对局部序列依赖设计的类别激活映射（Class Activation Mapping，CAM）方法不适合解释这些非自回归行为。为弥合这一差距，我们提出了Diffusion-CAM，这是首个专门针对dMLLMs设计的可解释性方法。我们通过可微探测Transformer骨干网络中的中间表示，推导出原始激活图，从而同时捕捉潜在特征及其类别特异梯度。为应对这些原始信号的固有随机性，我们引入了四个关键模块，以解决空间歧义并减轻图像内部混淆因素及冗余标记相关性。大量实验表明，Diffusion-CAM在定位准确性和视觉保真度方面显著优于现有最先进方法，树立了理解扩散多模态系统并行生成过程的新标杆。

View on arXiv Download PDF AI Translation

cs.AI / 121 / 2604.11012

Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

Min-$k$ 采样：通过相对 Logit 动态解耦截断与温度缩放

Ding, Yuanhao, Li, Meimingwei, Arias, Esteban Garces, Aßenmacher, Matthias, Heumann, Christian, Zhang, Chongsheng

Abstract

The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$n\sigma$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify "semantic cliffs": sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.

Chinese Translation

大型语言模型生成文本的质量在很大程度上依赖于解码采样策略。尽管主流方法如 Top-$k$、Top-$p$ 和 Min-$p$ 通过概率空间截断实现了多样性与准确性的平衡，但它们存在一个固有的局限性：对温度参数极其敏感。近期基于 logit 空间的方法如 Top-$n\sigma$ 实现了温度不变性，但依赖于易受长尾噪声影响的全局统计，无法捕捉顶级候选项之间的细粒度置信结构。我们提出了\textbf{Min-$k$ 采样}，这是一种新颖的动态截断策略，通过分析排序后 logit 分布的局部形态，识别“语义悬崖”——从高置信核心词到不确定长尾词的急剧转变。Min-$k$ 通过计算位置加权的相对衰减率，在每个生成步骤动态确定截断边界。我们形式化证明了 Min-$k$ 实现了严格的温度不变性，并通过实验证明其对超参数选择的低敏感性。在多个推理基准、创意写作任务及人工评估中，Min-$k$ 持续提升文本质量，即使在概率方法崩溃的极端温度设置下也能保持稳健表现。我们已公开发布代码、模型及分析工具。

View on arXiv Download PDF AI Translation

cs.AI / 122 / 2604.11035

Introspective Diffusion Language Models

内省扩散语言模型

Yu, Yifan, Jian, Yuqing, Wang, Junxiong, Zhou, Zhongzhu, Zhuang, Donglin, Fang, Xinyu, Yanamandra, Sri, Wu, Xiaoxia, Wu, Qingyang, Song, Shuaiwen Leon, Dao, Tri, Athiwaratkun, Ben, Zou, James, Lai, Fan, Xu, Chenfeng

Abstract

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

Chinese Translation

扩散语言模型（Diffusion Language Models, DLMs）承诺实现并行生成，但在质量上仍落后于自回归（Autoregressive, AR）模型。我们将这一差距归因于内省一致性的缺失：AR模型与其自身生成内容保持一致，而DLMs往往不然。我们定义了内省接受率（introspective acceptance rate），用于衡量模型是否接受其先前生成的标记。这揭示了为何AR训练具有结构性优势：因果掩码（causal masking）和logit偏移（logit shifting）隐式地强制执行内省一致性。基于此观察，我们提出了内省扩散语言模型（Introspective Diffusion Language Model, I-DLM），该范式在保留扩散式并行解码的同时，继承了AR训练的内省一致性。I-DLM采用了一种新颖的内省跨步解码（Introspective Strided Decoding, ISD）算法，使模型能够在同一次前向传播中验证先前生成的标记并推进新的标记。从系统角度出发，我们基于AR继承的优化构建了I-DLM推理引擎，并进一步通过固定批次调度器（stationary-batch scheduler）进行定制。据我们所知，I-DLM是首个在质量上匹配同规模AR模型且在15个基准测试中同时超越先前DLMs的模型，表现出更优的模型质量和实际服务效率。其在AIME-24和LiveCodeBench-v6上的得分分别达到69.6和45.7，分别超越LLaDA-2.1-mini（16B）超过26和15分。除了质量提升，I-DLM还针对日益增长的大并发服务需求设计，实现了约3倍于先前最先进DLMs的吞吐量。

View on arXiv Download PDF AI Translation

cs.AI / 123 / 2604.11040

Intelligent Approval of Access Control Flow in Office Automation Systems via Relational Modeling

基于关系建模的办公自动化系统访问控制流程智能审批

Liu, Dugang, Chen, Zulong, Xu, Chuanfei, He, Jiaxuan, Ma, Yunlu, Xu, Jia

Abstract

Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower and time. Its intelligence is a crucial issue that needs to be addressed urgently by all companies. In this paper, we propose a novel relational modeling-driven intelligent approval (RMIA) framework to automate ACFA. Specifically, our RMIA consists of two core modules: (1) The binary relation modeling module aims to characterize the coupling relation between applicants and approvers and provide reliable basic information for ACFA decision-making from a coarse-grained perspective. (2) The ternary relation modeling module utilizes specific resource information as its core, characterizing the complex relations between applicants, resources, and approvers, and thus provides fine-grained gain information for informed decision-making. Then, our RMIA effectively fuses these two kinds of information to form the final decision. Finally, extensive experiments are conducted on two product datasets and an online A/B test to verify the effectiveness of RMIA.

Chinese Translation

办公自动化（OA）系统在企业运营与管理中发挥着关键作用，其中访问控制流程审批（ACFA）是管理各种资源可访问性的核心环节。然而，传统的ACFA需在每个环节由负责人审批，耗费大量人力和时间，其智能化问题亟需各企业解决。本文提出了一种新颖的基于关系建模驱动的智能审批（RMIA）框架以实现ACFA的自动化。具体而言，RMIA包含两个核心模块：（1）二元关系建模模块，旨在刻画申请人与审批人之间的耦合关系，从粗粒度视角为ACFA决策提供可靠的基础信息；（2）三元关系建模模块，以特定资源信息为核心，刻画申请人、资源与审批人之间的复杂关系，从而为决策提供细粒度的增益信息。随后，RMIA有效融合这两类信息形成最终决策。最后，在两个产品数据集及线上A/B测试中进行了大量实验，以验证RMIA的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 124 / 2604.11041

From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience

从拓扑到轨迹：基于大型语言模型的供应链韧性世界模型

Luo, Jia

Abstract

Semiconductor supply chains face unprecedented resilience challenges amidst global geopolitical turbulence. Conventional Large Language Model (LLM) planners, when confronting such non-stationary "Policy Black Swan" events, frequently suffer from Decision Paralysis or a severe Grounding Gap due to the absence of physical environmental modeling. This paper introduces ReflectiChain, a cognitive agentic framework tailored for resilient macroeconomic supply chain planning. The core innovation lies in the integration of Latent Trajectory Rehearsal powered by a generative world model, which couples reflection-in-action (System 2 deliberation) with delayed reflection-on-action. Furthermore, we leverage a Retrospective Agentic RL mechanism to enable autonomous policy evolution during the deployment phase (test-time). Evaluations conducted on our high-fidelity benchmark, Semi-Sim, demonstrate that under extreme scenarios such as export bans and material shortages, ReflectiChain achieves a 250% improvement in average step rewards over the strongest LLM baselines. It successfully restores the Operability Ratio (OR) from a deficient 13.3% to over 88.5% while ensuring robust gradient convergence. Ablation studies further underscore that the synergy between physical grounding constraints and double-loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long-horizon strategic planning.

Chinese Translation

半导体供应链在全球地缘政治动荡中面临前所未有的韧性挑战。传统的大型语言模型（LLM）规划者在面对这种非平稳的“政策黑天鹅”事件时，常常遭遇决策瘫痪或严重的基础缺口，因为缺乏物理环境建模。本文介绍了ReflectiChain，一个为韧性宏观经济供应链规划量身定制的认知代理框架。其核心创新在于将由生成世界模型驱动的潜在轨迹排练与行动中的反思（系统2深思）和延迟的行动反思相结合。此外，我们利用回顾性代理强化学习（Retrospective Agentic RL）机制，使得在部署阶段（测试时）能够实现自主政策演变。在我们高保真基准Semi-Sim上进行的评估表明，在出口禁令和材料短缺等极端场景下，ReflectiChain在平均步骤奖励上比最强的LLM基线提高了250%。它成功地将可操作性比率（Operability Ratio, OR）从不足13.3%恢复到超过88.5%，同时确保了稳健的梯度收敛。消融研究进一步强调，物理基础约束与双环学习之间的协同作用对于弥合语义推理与物理现实之间的差距在长远战略规划中至关重要。

View on arXiv Download PDF AI Translation

cs.AI / 125 / 2604.11043

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

EmergentBridge：提升统一多模态嵌入模型中的零样本跨模态迁移能力

Xie, Jincheng, Xiao, Xingchen, Liu, Runheng, Huang, Zhongyi, Zheng, Yu, Huang, Heyan

Abstract

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

Chinese Translation

统一的多模态嵌入空间支撑着跨模态检索和零样本识别等实际应用。然而，在许多实际部署中，监督信号仅存在于少数模态对（例如图像—文本），导致未配对的模态对（如音频↔深度，红外↔音频）之间连接较弱，从而在零样本迁移任务中表现不佳。因此，解决这种稀疏配对问题对于在无需收集详尽成对数据的情况下，将统一嵌入系统扩展到新任务至关重要。我们提出了EmergentBridge，一种嵌入层级的桥接框架，能够在不依赖详尽成对监督的前提下提升未配对模态对的性能。我们的核心观察是，简单地将新模态对齐到合成的代理嵌入上会引入梯度干扰，破坏现有检索/分类所依赖的锚点对齐结构。EmergentBridge通过(i)学习一个映射，从锚点嵌入生成一个“噪声桥接锚点”（即已对齐模态的代理嵌入），以及(ii)仅在与锚点对齐方向正交的子空间中强制代理对齐，从而在保持锚点对齐的同时增强非锚点连接性。我们在涵盖多种模态的九个数据集上进行了验证，EmergentBridge在零样本分类和检索任务中持续优于先前的绑定基线，展现出强大的新兴对齐能力。

View on arXiv Download PDF AI Translation

cs.AI / 126 / 2604.11065

AI Integrity: A New Paradigm for Verifiable AI Governance

人工智能诚信：可验证人工智能治理的新范式

Lee, Seulki

Abstract

AI systems increasingly shape high-stakes decisions in healthcare, law, defense, and education, yet existing governance paradigms -- AI Ethics, AI Safety, and AI Alignment -- share a common limitation: they evaluate outcomes rather than verifying the reasoning process itself. This paper introduces AI Integrity, a concept defined as a state in which the Authority Stack of an AI system -- its layered hierarchy of values, epistemological standards, source preferences, and data selection criteria -- is protected from corruption, contamination, manipulation, and bias, and maintained in a verifiable manner. We distinguish AI Integrity from the three existing paradigms, define the Authority Stack as a 4-layer cascade model (Normative, Epistemic, Source, and Data Authority) grounded in established academic frameworks -- Schwartz Basic Human Values for normative authority, Walton argumentation schemes with GRADE/CEBM hierarchies for epistemic authority, and Source Credibility Theory for source authority -- characterize the distinction between legitimate cascading and Authority Pollution, and identify Integrity Hallucination as the central measurable threat to value consistency. We further specify the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework as the operational methodology, defining six core metrics and a phased research roadmap. Unlike normative frameworks that prescribe which values are correct, AI Integrity is a procedural concept: it requires that the path from evidence to conclusion be transparent and auditable, regardless of which values a system holds.

Chinese Translation

人工智能系统在医疗、法律、国防和教育等高风险决策领域的影响日益显著，然而现有的治理范式——人工智能伦理（AI Ethics）、人工智能安全（AI Safety）和人工智能对齐（AI Alignment）——存在一个共同的局限：它们侧重于评估结果，而非验证推理过程本身。本文提出了“人工智能诚信”（AI Integrity）这一概念，定义为人工智能系统的权威层级（Authority Stack）——其价值观层级、认识论标准、信息源偏好及数据选择标准——免受腐败、污染、操控和偏见影响，并以可验证的方式加以维护的状态。我们将人工智能诚信与现有三种范式区分开来，定义权威层级为基于既有学术框架的四层级级联模型（规范权威Normative、认识论权威Epistemic、信息源权威Source及数据权威Data Authority），其中规范权威基于Schwartz基本人类价值观，认识论权威基于Walton论证方案及GRADE/CEBM层级，信息源权威基于信息源可信度理论（Source Credibility Theory）。我们刻画了合法级联与权威污染（Authority Pollution）之间的区别，并指出诚信幻觉（Integrity Hallucination）是价值一致性的核心可测威胁。进一步地，我们提出了PRISM（基于画像的推理诚信层级测量，Profile-based Reasoning Integrity Stack Measurement）框架作为操作方法，定义了六项核心指标及分阶段的研究路线图。与规定何种价值观正确的规范性框架不同，人工智能诚信是一个程序性概念：它要求从证据到结论的路径必须透明且可审计，无论系统持有何种价值观。

View on arXiv Download PDF AI Translation

cs.AI / 127 / 2604.11070

PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk

PRISM风险信号框架：基于层级的人工智能行为风险红线

Lee, Seulki

Abstract

Current approaches to AI safety define red lines at the case level: specific prompts, specific outputs, specific harms. This paper argues that red lines can be set more fundamentally -- at the level of value, evidence, and source hierarchies that govern AI reasoning. Using the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework, we define a taxonomy of 27 behavioral risk signals derived from structural anomalies in how AI systems prioritize values (L4), weight evidence types (L3), and trust information sources (L2). Each signal is evaluated through a dual-threshold principle combining absolute rank position and relative win-rate gap, producing a two-tier classification (Confirmed Risk vs. Watch Signal). The hierarchy-based approach offers three advantages over case-specific red lines: it is anticipatory rather than reactive (detecting dangerous reasoning structures before they produce harmful outputs), comprehensive rather than enumerative (a single value-hierarchy signal subsumes an unlimited number of case-specific violations), and measurable rather than subjective (grounded in empirical forced-choice data). We demonstrate the framework's detection capacity using approximately 397,000 forced-choice responses from 7 AI models across three Authority Stack layers, showing that the signal taxonomy successfully discriminates between models with structurally extreme profiles, models with context-dependent risk, and models with balanced hierarchies.

Chinese Translation

当前的人工智能安全方法在案例层面定义红线：特定的提示、特定的输出、特定的危害。本文认为，红线可以在更根本的层面上设定——在支配人工智能推理的价值、证据和来源层级的层面上。利用PRISM（基于特征的推理完整性堆栈测量）框架，我们定义了一种由27种行为风险信号构成的分类法，这些信号源于人工智能系统在优先考虑价值（L4）、加权证据类型（L3）和信任信息来源（L2）方面的结构异常。每个信号通过结合绝对排名位置和相对胜率差距的双阈值原则进行评估，产生两级分类（确认风险与观察信号）。基于层级的方法相较于案例特定的红线具有三大优势：它是前瞻性的而非反应性的（在产生有害输出之前检测危险的推理结构），是全面的而非列举性的（单一的价值层级信号涵盖无限数量的案例特定违规），是可测量的而非主观的（基于经验的强制选择数据）。我们通过使用来自三个权威堆栈层的7个人工智能模型的约397,000个强制选择响应，展示了该框架的检测能力，结果表明信号分类法成功区分了具有结构极端特征的模型、具有上下文依赖风险的模型和具有平衡层级的模型。

View on arXiv Download PDF AI Translation

cs.AI / 128 / 2604.11072

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Hodoscope：用于AI异常行为的无监督监控

Zhong, Ziqian, Saxena, Shashwat, Raghunathan, Aditi

Abstract

Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human. We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This motivates using group-wise behavioral differences as the primary signal for unsupervised monitoring. We introduce Hodoscope, a tool that operationalizes this insight. Hodoscope compares behavior distributions across groups and highlights distinctive and potentially suspicious action patterns for human review. Using Hodoscope, we discover a previously unknown vulnerability in the Commit0 benchmark (unsquashed git history allowing ground-truth recovery, inflating scores for at least five models) and independently recover known exploits on ImpossibleBench and SWE-bench. Quantitative evaluation estimates that our method reduces review effort by 6-23$\times$ compared to naive uniform sampling. Finally, we show that behavior descriptions discovered through Hodoscope could improve the detection accuracy of LLM-based judges, demonstrating a path from unsupervised to supervised monitoring.

Chinese Translation

现有的AI代理监控方法依赖于有监督评估：通过人工编写的规则或基于大型语言模型（LLM）的评判者来检测已知的失败模式。然而，新的异常行为可能完全超出预定义类别的范围，且基于LLM的评判者可能不够可靠。为此，我们提出了无监督监控的概念，借鉴无监督学习的思路。无监督监控并不针对特定的异常行为进行检测，而是辅助人类发现问题代理行为，无需事先假设何为问题行为，将最终判定权交由人类。我们观察到，问题行为通常具有显著特征：模型利用基准测试漏洞时，其行为在表现良好的基线模型中不存在；某一评测独有的漏洞会在同一模型跨多个基准测试时表现为行为异常。这一现象促使我们将组间行为差异作为无监督监控的主要信号。基于此，我们引入了Hodoscope工具，能够比较不同组的行为分布，突出显示具有特征性且潜在可疑的行为模式供人类审查。借助Hodoscope，我们发现了Commit0基准测试中一个此前未知的漏洞（未压缩的git历史允许恢复真实标签，导致至少五个模型的得分被人为抬高），并独立复现了ImpossibleBench和SWE-bench上的已知漏洞。定量评估表明，与简单均匀采样相比，我们的方法可将审查工作量减少6至23倍。最后，我们展示了通过Hodoscope发现的行为描述能够提升基于LLM的评判者的检测准确率，证明了从无监督监控到有监督监控的可行路径。

View on arXiv Download PDF AI Translation

cs.AI / 129 / 2604.11077

Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation

朝向主动信息探测：客户服务聊天机器人从对话中获取价值

Huang, Chen, Jiang, Zitan, Zou, Changyi, Lei, Wenqiang, Ng, See-Kiong

Abstract

Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high-value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre-specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.

Chinese Translation

客户服务聊天机器人越来越被期望不仅仅作为用户的反应支持工具，而是作为获取高价值信息和商业智能的战略接口。对此，我们做出了三项主要贡献。1）我们引入并定义了一项新任务——主动信息探测（Proactive Information Probing），该任务优化了何时向用户探询预设目标信息，同时最小化对话轮次和用户摩擦。2）我们提出了PROCHATIP，一个主动聊天机器人框架，具有一个专门的对话策略模块，经过训练以掌握探测的微妙时机。3）实验表明，PROCHATIP显著优于基线，展现出在信息探测和服务质量方面的卓越能力。我们相信，我们的工作有效地重新定义了聊天机器人的商业效用，将其定位为可扩展、成本效益高的主动商业智能引擎。我们的代码可在https://github.com/SCUNLP/PROCHATIP获取。

View on arXiv Download PDF AI Translation

cs.AI / 130 / 2604.11088

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

代理规则是塑造还是扭曲？护栏胜过指导在编码代理中的作用

Zhang, Xing, Wang, Guanghui, Cui, Yanwei, Qiu, Wei, Li, Ziyuan, Zhu, Bing, He, Peiyang

Abstract

Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7--14 percentage points, but random rules help as much as expert-curated ones -- suggesting rules work through context priming rather than specific instruction. Negative constraints ("do not refactor unrelated code") are the only individually beneficial rule type, while positive directives ("follow code style") actively hurt -- a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk -- well-intentioned rules routinely degrade agent performance -- and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

Chinese Translation

开发者越来越多地通过自然语言指令文件（例如，CLAUDE.md，.cursorrules）来指导人工智能编码代理，但尚无控制研究测量这些规则是否真正改善代理性能或哪些属性使规则有益。我们从GitHub抓取了679个此类文件（25,532条规则），并进行首次大规模实证评估，在SWE-bench Verified上运行超过5,000次代理运行。规则使性能提高了7到14个百分点，但随机规则的效果与专家策划的规则相当——这表明规则通过上下文启动而非具体指令发挥作用。负约束（“不重构无关代码”）是唯一具有单独益处的规则类型，而积极指令（“遵循代码风格”）则会产生负面影响——我们通过基于潜力的奖励塑形（PBRS）的视角分析这一模式。此外，单个规则在孤立时大多有害，但集体上是有益的，最多可容忍50条规则而不降级。这些发现揭示了一个隐藏的可靠性风险——出于良好意图的规则常常会降低代理性能——并为安全的代理配置提供了明确原则：限制代理不得做的事情，而不是规定他们应该做的事情。

View on arXiv Download PDF AI Translation

cs.AI / 131 / 2604.11104

Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

节约型知识图谱构建与本地大语言模型：零样本管道、自我一致性与人工群体智慧

Jourlin, Pierre

Abstract

This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{\"i}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

Chinese Translation

本文呈现了一项关于知识图谱构建与利用的多模型零样本管道的实证研究，该研究完全通过消费级硬件上的本地推理执行。我们提出了一个可重复的评估框架，整合了两个外部基准（DocRED、HotpotQA）、WebQuestionsSP风格的合成数据以及RAGAS评估框架，形成一个自动化管道。在500个文档级关系上，我们的系统在零样本情况下实现了F1值为0.70 ± 0.041，而监督学习的DREEAM为0.80。文本到查询在200个样本上达到了0.80 ± 0.06的准确率。多跳推理在500个HotpotQA问题上实现了0.46 ± 0.04的精确匹配（EM），在50个样本上的RAGAS可信度为0.96 ± 0.04。除了管道之外，我们还研究了针对困难多跳推理的多样性机制。在181个在零温度下无法解决的问题上，自我一致性（k=5, T=0.7）通过单个专家混合模型（Mixture-of-Experts, MoE）恢复了高达23%的EM，而跨模型的oracle（3种架构 x 5个样本）达到了46.4%。我们强调了一个一致性悖论：样本之间的强一致性信号集体幻觉而非可靠答案，这与Moussaïd等人关于群体智慧的研究相呼应。扩展到完整管道（500个问题），自我一致性（k=3）将EM从0.46提高到0.48 ± 0.04。一个信心路由级联机制（Phi-4 → GPT-OSS, k=5）实现了0.55 ± 0.04的EM，这是获得的最佳结果，其中45.4%的问题被重新路由。最后，我们展示了将V3提示工程应用于其他模型并未重现Gemma-4观察到的增益，确认了特定提示/模型交互的存在。整个系统在单个RTX 3090上运行约5小时，无需任何训练，预计碳足迹为0.09 kg CO2当量。

View on arXiv Download PDF AI Translation

cs.AI / 132 / 2604.11120

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

不受欢迎的人：单一方法的安全评估对于人格赋予的LLM是不完整的

Li, Wenkai, Yang, Fan, Mehta, Shaunak A., Onoue, Koichi

Abstract

Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\rho = 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.

Chinese Translation

人格赋予定制了LLM的行为，但安全评估几乎总是仅研究基于提示的人格。我们表明这种做法是不完整的：提示和激活引导暴露出*不同*的、依赖于架构的脆弱性特征，且仅用一种方法进行测试可能会遗漏模型的主要失效模式。在来自三种架构系列的四个标准模型上，对5,568个评估条件的分析显示，系统提示下的人格危险排名在所有架构中保持一致（$ ho = 0.71$--$0.96$），但激活引导的脆弱性则显著分化，且无法从提示侧排名进行预测：Llama-3.1-8B在激活引导下的脆弱性显著更高，而Gemma-3-27B和Qwen3.5则对提示更为脆弱。这种分歧最引人注目的例证是*亲社会人格悖论*：在Llama-3.1-8B上，P12（高尽责性 + 高宜人性）在提示下是最安全的人格之一，但在激活引导下却成为最高ASR（激活引导脆弱性）的角色（ASR ~0.818）。这种反转在系数消融和匹配强度校准下是稳健的，并在DeepSeek-R1-Distill-Qwen-32B上得到了重复。一个特质拒绝对齐框架，其中尽责性与Llama-3.1-8B上的拒绝强烈反向对齐，提供了部分几何解释。推理仅提供部分保护：两个32B推理模型在提示侧的ASR达到15%--18%，而激活引导在基线易感性和人格特异性脆弱性上将它们明显分开。启发式追踪诊断表明，更安全的模型保持了更强的政策回忆和自我纠正行为，而不仅仅是更长的推理过程。

View on arXiv Download PDF AI Translation

cs.AI / 133 / 2604.11125

A Proposed Biomedical Data Policy Framework to Reduce Fragmentation, Improve Quality, and Incentivize Sharing in Indian Healthcare in the era of Artificial Intelligence and Digital Health

提出的生物医学数据政策框架：在人工智能和数字健康时代减少碎片化、提高质量并激励印度医疗保健中的数据共享

Mehta, Nikhil, Gupta, Sachin, Anand, Gouri RP

Abstract

India generates vast biomedical data through postgraduate research, government hospital services and audits, government schemes, private hospitals and their electronic medical record (EMR) systems, insurance programs and standalone clinics. Unfortunately, these resources remain fragmented across institutional silos and vendor-locked EMR systems. The fundamental bottleneck is not technological but economic and academic. There is a systemic misalignment of incentives that renders data sharing a high-risk, low-reward activity for individual researchers and institutions. Until India's academic promotion criteria, institutional rankings, and funding mechanisms explicitly recognize and reward data curation as professional work, the nation's AI ambitions will remain constrained by fragmented, non-interoperable datasets. We propose a multi-layered incentive architecture integrating recognition of data papers in National Medical Commission (NMC) promotion criteria, incorporation of open data metrics into the National Institutional Ranking Framework (NIRF), adoption of Shapley Value-based revenue sharing in federated learning consortia, and establishment of institutional data stewardship as a mainstream professional role. Critical barriers to data sharing, including fear of data quality scrutiny, concerns about misinterpretation, and selective reporting bias, are addressed through mandatory data quality assessment, structured peer review, and academic credit for auditing roles. The proposed framework directly addresses regulatory constraints introduced by the Digital Personal Data Protection Act 2023 (DPDPA), while constructively engaging with the National Data Sharing and Accessibility Policy (NDSAP), Biotech-PRIDE Guidelines, and the Anusandhan National Research Foundation (ANRF) guidelines.

Chinese Translation

印度通过研究生研究、政府医院服务和审计、政府计划、私立医院及其电子病历（EMR）系统、保险项目和独立诊所生成了大量生物医学数据。不幸的是，这些资源在机构孤岛和供应商锁定的EMR系统中仍然处于碎片化状态。根本的瓶颈不是技术问题，而是经济和学术问题。存在系统性的激励不对齐，使得数据共享对个别研究人员和机构而言成为高风险、低回报的活动。直到印度的学术晋升标准、机构排名和资金机制明确承认并奖励数据管理作为专业工作，该国的人工智能雄心将受到碎片化、不可互操作数据集的限制。我们提出一个多层次的激励架构，整合在国家医学委员会（NMC）晋升标准中对数据论文的认可，将开放数据指标纳入国家机构排名框架（NIRF），在联合学习联盟中采用基于Shapley值的收益分享，并将机构数据管理建立为主流专业角色。通过强制性的数据质量评估、结构化的同行评审和对审计角色的学术认可，解决了数据共享的关键障碍，包括对数据质量审查的恐惧、对误解的担忧和选择性报告偏见。所提框架直接应对了2023年数字个人数据保护法（DPDPA）所引入的监管限制，同时与国家数据共享与可及性政策（NDSAP）、生物技术-PRIDE指南和Anusandhan国家研究基金会（ANRF）指南进行建设性互动。

View on arXiv Download PDF AI Translation

cs.AI / 134 / 2604.11131

MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

MADQRL：用于多智能体环境的分布式量子强化学习框架

Sawaika, Abhishek, Chen, Samuel Yen-Chi, Parampalli, Udaya, Buyya, Rajkumar

Abstract

Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

Chinese Translation

强化学习（RL）是从现实案例中学习的最实用方法之一。受到人类认知方法的启发，使其成为人工智能领域广泛接受的策略。大多数用于强化学习的环境往往是高维的，传统的强化学习算法在有效学习这些系统时变得计算成本高昂且具有挑战性。近期在量子计算（QC）理论的实际应用方面取得的进展，例如紧凑编码、增强表示和学习算法、随机采样，或量子系统固有的随机特性，为解决这些挑战开辟了新的方向。量子强化学习（QRL）在过去几年中获得了显著关注。然而，目前的量子硬件状态不足以满足具有复杂多智能体设置的高维环境的需求。为了解决这个问题，我们提出了一种分布式QRL框架，其中多个智能体独立学习，分散来自各个机器的联合训练负载。我们的方法在具有不相交的动作和观察空间的环境中表现良好，但也可以扩展到其他系统，前提是合理的近似。我们在合作乒乓（cooperative-pong）环境中分析了所提出的方法，结果表明与其他分布策略相比提高了约10%，与经典策略表示模型相比提高了约5%。

View on arXiv Download PDF AI Translation

cs.AI / 135 / 2604.11137

From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

从答案到论证：基于Toulmin引导的课程目标条件学习实现可信临床诊断推理

Zhan, Chen, Tan, Xiaoyu, Ma, Gengchen, Xiong, Yu-Jie, Jiang, Xiaoyan, Qiu, Xihe

Abstract

The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce "correct answers through flawed reasoning." This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL's progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.

Chinese Translation

大型语言模型（LLMs）在临床决策支持中的应用受到其推理过程不透明且常常不可靠的严重阻碍。在高风险的医疗领域，仅有正确答案是不够的；临床实践要求完全透明，以确保患者安全并实现专业责任追究。当前LLMs普遍且危险的弱点在于其倾向于“通过错误推理得出正确答案”。这一问题远非学术上的小瑕疵；此类过程错误表明模型缺乏稳健的理解能力，使其在面对现实复杂临床情况时容易产生更广泛的幻觉和不可预测的失败。本文通过将Toulmin模型适配于诊断过程，建立了可信临床论证框架。我们提出了一种新颖的训练流程——课程目标条件学习（Curriculum Goal-Conditioned Learning，CGCL），旨在逐步训练LLM生成明确遵循Toulmin结构的诊断论证。CGCL的三阶段渐进课程系统地构建了稳固的临床论证：(1) 提取事实并生成鉴别诊断；(2) 为核心假设提供理由并反驳备选方案；(3) 综合分析形成最终的、有资质的结论。我们使用T-Eval这一诊断推理完整性量化框架验证CGCL。实验结果表明，该方法在诊断准确性和推理质量上可与资源密集型强化学习（Reinforcement Learning，RL）方法媲美，同时提供了更稳定且高效的训练流程。

View on arXiv Download PDF AI Translation

cs.AI / 136 / 2604.11154

Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model

生成式人工智能研究的环境足迹：来自Moshi基础模型的见解

López-Rauhut, Marta, Landrieu, Loic, Aubry, Mathieu, Ligozat, Anne-Laure

Abstract

New multi-modal large language models (MLLMs) are continuously being trained and deployed, following rapid development cycles. This generative AI frenzy is driving steady increases in energy consumption, greenhouse gas emissions, and a plethora of other environmental impacts linked to datacenter construction and hardware manufacturing. Mitigating the environmental consequences of GenAI remains challenging due to an overall lack of transparency by the main actors in the field. Even when the environmental impacts of specific models are mentioned, they are typically restricted to the carbon footprint of the final training run, omitting the research and development stages. In this work, we explore the impact of GenAI research through a fine-grained analysis of the compute spent to create Moshi, a 7B-parameter speech-text foundation model for real-time dialogue developed by Kyutai, a leading privately funded open science AI lab. For the first time, our study dives into the anatomy of compute-intensive MLLM research, quantifying the GPU-time invested in specific model components and training phases, as well as early experimental stages, failed training runs, debugging, and ablation studies. Additionally, we assess the environmental impacts of creating Moshi from beginning to end using a life cycle assessment methodology: we quantify energy and water consumption, greenhouse gas emissions, and mineral resource depletion associated with the production and use of datacenter hardware. Our detailed analysis allows us to provide actionable guidelines to reduce compute usage and environmental impacts of MLLM research, paving the way for more sustainable AI research.

Chinese Translation

新的多模态大型语言模型（MLLMs）正在不断训练和部署，遵循快速发展的周期。这种生成式人工智能的热潮正在推动能源消耗、温室气体排放以及与数据中心建设和硬件制造相关的众多其他环境影响的稳步增加。由于该领域主要参与者整体缺乏透明度，减轻生成式人工智能的环境后果仍然具有挑战性。即使提到特定模型的环境影响，通常也仅限于最终训练过程的碳足迹，而忽略了研究和开发阶段。在本研究中，我们通过对创建Moshi（一个由Kyutai开发的7B参数实时对话语音-文本基础模型）所消耗的计算资源进行细致分析，探讨生成式人工智能研究的影响。我们的研究首次深入分析了计算密集型MLLM研究的内部结构，量化了在特定模型组件和训练阶段、早期实验阶段、失败的训练过程、调试和消融研究中投入的GPU时间。此外，我们使用生命周期评估方法评估了从头到尾创建Moshi的环境影响：我们量化了与数据中心硬件的生产和使用相关的能源和水消耗、温室气体排放以及矿物资源的耗竭。我们的详细分析使我们能够提供可行的指导方针，以减少MLLM研究的计算使用和环境影响，为更可持续的人工智能研究铺平道路。

View on arXiv Download PDF AI Translation

cs.AI / 137 / 2604.11216

Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models

测量人工智能系统的权威堆栈：对366,120个强制选择反应的实证分析，涵盖8个人工智能模型

Lee, Seulki

Abstract

What values, evidence preferences, and source trust hierarchies do AI systems actually exhibit when facing structured dilemmas? We present the first large-scale empirical mapping of AI decision-making across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark -- a forced-choice instrument of 14,175 unique scenarios per layer, spanning 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants -- we evaluated 8 major AI models at temperature 0, yielding 366,120 total responses. Key findings include: (1) a symmetric 4:4 split between Universalism-first and Security-first models at L4; (2) dramatic defense-domain value restructuring where Security surges to near-ceiling win-rates (95.1%-99.8%) in 6 of 8 models; (3) divergent evidence hierarchies at L3, with some models favoring empirical-scientific evidence while others prefer pattern-based or experiential evidence; (4) broad convergence on institutional source trust at L2; and (5) Paired Consistency Scores (PCS) ranging from 57.4% to 69.2%, revealing substantial framing sensitivity across scenario variants. Test-Retest Reliability (TRR) ranges from 91.7% to 98.6%, indicating that value instability stems primarily from variant sensitivity rather than stochastic noise. These findings demonstrate that AI models possess measurable -- if sometimes unstable -- Authority Stacks with consequential implications for deployment across professional domains.

Chinese Translation

人工智能系统在面对结构性困境时，实际展现了哪些价值观、证据偏好和来源信任等级？我们首次进行了大规模的实证映射，涵盖权威堆栈框架（S. Lee, 2026a）中的三个层面：价值优先级（L4）、证据类型偏好（L3）和来源信任等级（L2）。使用PRISM基准——一个包含每层14,175个独特场景的强制选择工具，跨越7个专业领域、3个严重性等级、3个决策时间框架和5个场景变体——我们在温度为0的情况下评估了8个主要人工智能模型，获得了总计366,120个反应。主要发现包括：(1) 在L4层面，普遍主义优先和安全优先模型之间呈现对称的4:4分布；(2) 在防御领域，价值观发生剧烈重构，安全在8个模型中的6个中接近达到顶峰胜率（95.1%-99.8%）；(3) 在L3层面，证据等级出现分歧，一些模型偏好经验科学证据，而其他模型则偏好基于模式或经验的证据；(4) 在L2层面，机构来源信任广泛趋同；(5) 配对一致性得分（PCS）范围为57.4%至69.2%，揭示了不同场景变体之间显著的框架敏感性。测试-重测可靠性（TRR）范围为91.7%至98.6%，表明价值不稳定性主要源于变体敏感性，而非随机噪声。这些发现表明，人工智能模型具有可测量的——尽管有时不稳定的——权威堆栈，这对在专业领域的部署具有重要影响。

View on arXiv Download PDF AI Translation

cs.AI / 138 / 2604.11259

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

基于轨迹诱导偏好的移动GUI代理隐私个性化优化

Lin, Zhixin, Li, Jungang, Xu, Dongliang, Pan, Shidong, Shi, Yibo, Liu, Yuchi, Min, Yuecong, Yao, Yue

Abstract

Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing systems still optimize task success or efficiency, neglecting users' privacy personalization. In this paper, we study the often-overlooked problem of agent personalization. We observe that personalization can induce systematic structural heterogeneity in execution trajectories. For example, privacy-first users often prefer protective actions, e.g., refusing permissions, logging out, and minimizing exposure, leading to logically different execution trajectories from utility-first users. Such variable-length and structurally different trajectories make standard preference optimization unstable and less informative. To address this issue, we propose Trajectory Induced Preference Optimization (TIPO), which uses preference-intensity weighting to emphasize key privacy-related steps and padding gating to suppress alignment noise. Results on our Privacy Preference Dataset show that TIPO improves persona alignment and distinction while preserving strong task executability, achieving 65.60% SR, 46.22 Compliance, and 66.67% PD, outperforming existing optimization methods across various GUI tasks. The code and dataset will be publicly released at https://github.com/Zhixin-L/TIPO.

Chinese Translation

由多模态大语言模型（Multimodal Large Language Models, MLLMs）驱动的移动GUI代理能够在移动设备上执行复杂任务。尽管取得了这些进展，但大多数现有系统仍然侧重于任务成功或效率的优化，忽视了用户隐私个性化的问题。本文研究了代理个性化这一常被忽视的问题。我们观察到，个性化可以在执行轨迹中引发系统性的结构异质性。例如，优先考虑隐私的用户通常倾向于采取保护性措施，如拒绝权限、注销和最小化曝光，这导致其执行轨迹在逻辑上与优先考虑效用的用户截然不同。这种可变长度和结构上不同的轨迹使得标准的偏好优化不稳定且信息量较少。为了解决这一问题，我们提出了轨迹诱导偏好优化（Trajectory Induced Preference Optimization, TIPO），该方法使用偏好强度加权来强调关键的隐私相关步骤，并通过填充门控来抑制对齐噪声。我们在隐私偏好数据集上的结果表明，TIPO在保持强任务可执行性的同时，提高了个性对齐和区分度，达到了65.60%的成功率（SR）、46.22的合规性（Compliance）和66.67%的隐私保护（PD），在各种GUI任务中优于现有的优化方法。代码和数据集将公开发布在 https://github.com/Zhixin-L/TIPO。

View on arXiv Download PDF AI Translation

cs.AI / 139 / 2604.11261

Inspectable AI for Science: A Research Object Approach to Generative AI Governance

面向科学的可检视人工智能：基于研究对象的方法治理生成式人工智能

Binkyte, Ruta, Abuaddba, Sharif, Mahawaga, Chamikara, Ding, Ming, Fernandes, Natasha, Fritz, Mario

Abstract

This paper introduces AI as a Research Object (AI-RO), a paradigm for governing the use of generative AI in scientific research. Instead of debating whether AI is an author or merely a tool, we propose treating AI interactions as structured, inspectable components of the research process. Under this view, the legitimacy of an AI-assisted scientific paper depends on how model use is integrated into the workflow, documented, and made accountable. Drawing on Research Object theory and FAIR principles, we propose a framework for recording model configuration, prompts, and outputs through interaction logs and metadata packaging. These properties are particularly consequential in security and privacy (S&P) research, where provenance artifacts must satisfy confidentiality constraints, integrity guarantees, and auditability requirements that generic disclosure practices do not address. We implement a lightweight writing pipeline in which a language model synthesizes human-authored structured literature review notes under explicit constraints and produces a verifiable provenance record. We present this work as a position supported by an initial demonstrative workflow, arguing that governance of generative AI in science can be implemented as structured documentation, controlled disclosure, and integrity-preserving provenance capture. Based on this example, we outline and motivate a set of necessary future developments required to make such practices practical and widely adoptable.

Chinese Translation

本文提出了“作为研究对象的人工智能”（AI as a Research Object，AI-RO）范式，用以治理生成式人工智能在科学研究中的应用。我们不再争论人工智能是作者还是工具，而是建议将人工智能的交互视为研究过程中的结构化、可检视组成部分。在此视角下，人工智能辅助科学论文的合法性取决于模型使用如何融入工作流程、被记录并承担责任。基于研究对象理论和FAIR原则，我们提出了通过交互日志和元数据封装记录模型配置、提示词及输出的框架。这些属性在安全与隐私（S&P）研究中尤为重要，因为溯源文档必须满足保密性约束、完整性保证及审计需求，而通用的信息披露做法无法满足这些要求。我们实现了一个轻量级写作流程，其中语言模型在明确约束下合成人工撰写的结构化文献综述笔记，并生成可验证的溯源记录。我们将此工作作为一种立场，辅以初步示范性工作流程，主张科学领域生成式人工智能的治理可通过结构化文档、受控披露及保持完整性的溯源捕获来实现。基于该示例，我们概述并论证了实现此类实践实用化和广泛采纳所需的一系列未来发展。

View on arXiv Download PDF AI Translation

cs.AI / 140 / 2604.11287

Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model

人工智能生成运动处方的一致性研究：基于大型语言模型的重复生成分析

Lee, Kihyuk

Abstract

Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.

Chinese Translation

背景：大型语言模型（LLMs）已被探索用于生成个性化运动处方，但在相同条件下输出结果的一致性尚未得到充分研究。目的：本研究采用重复生成设计，评估LLM生成运动处方的模型内一致性。方法：选取六个临床场景，使用Gemini 2.5 Flash生成运动处方（每个场景生成20份，总计120份）。从三个维度评估一致性：（1）基于SBERT的余弦相似度评估语义一致性；（2）基于FITT原则，采用AI作为评判者的方法评估结构一致性；（3）安全表达一致性，包括安全表达的包含率及句子级量化。结果：各场景间语义相似度较高（平均余弦相似度：0.879-0.939），临床约束条件下表现出更高一致性。频率表现出一致模式，而定量成分，尤其是运动强度，存在较大变异。阻力训练输出中有10%-25%的强度表达无法分类。安全相关表达在所有输出中均有体现，但安全句子数量在不同场景间显著差异（H=86.18，p<0.001），临床病例生成的安全表达多于健康成人病例。结论：LLM生成的运动处方在语义层面表现出高度一致性，但关键定量成分存在变异。其可靠性在很大程度上依赖于提示词结构，临床应用前需增加结构性约束并进行专家验证。

View on arXiv Download PDF AI Translation

cs.AI / 141 / 2604.11304

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

BankerToolBench：评估端到端投资银行工作流程中的人工智能代理

Lau, Elaine, Dücker, Markus, Chaudhary, Ronak, Goh, Hui Wen, Wei, Rosemary, Kumar, Vaibhav, Qunbar, Saed, Gogia, Guram, Liu, Yi, Millslagle, Scott, Borazjanizadeh, Nasim, Tkachenko, Ulyana, Danquah, Samuel Eshun, Schweiker, Collin, Karumathil, Vijay, Devalaraju, Asrith, Sandadi, Varsha, Nam, Haemi, Arani, Punit, Epps, Ray, Arif, Abdullah, Bhaiwala, Sahil, Northcutt, Curtis, Wang, Skyler, Athalye, Anish, Mueller, Jonas, Guzmán, Francisco

Abstract

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

Chinese Translation

现有的人工智能基准缺乏评估专业工作流程中经济上有意义进展的能力。为了评估在高价值、劳动密集型职业中的前沿人工智能代理，我们引入了BankerToolBench（BTB）：一个开放源代码的基准，涵盖初级投资银行家日常执行的端到端分析工作流程。为了开发一个基于代表性工作环境的生态有效基准，我们与来自领先公司的502名投资银行家进行了合作。BTB要求代理通过导航数据室、使用行业工具（市场数据平台、SEC文件数据库）并生成多文件交付物——包括Excel财务模型、PowerPoint推介文稿和PDF/Word报告——来执行高级银行家的请求。完成一个BTB任务需要银行家最多21小时，这突显了成功将这项工作委托给人工智能的经济风险。BTB使任何大型语言模型（LLM）或代理的自动评估成为可能，依据由资深投资银行家定义的100多个评分标准对交付物进行评分，以捕捉利益相关者的效用。测试了9个前沿模型后，我们发现即使是表现最好的模型（GPT-5.4）也未能满足近一半的评分标准，且银行家将其输出的0%评为客户准备就绪。我们的失败分析揭示了关键障碍（例如跨文档一致性的问题）以及在高风险专业工作流程中改进代理人工智能的方向。

View on arXiv Download PDF AI Translation

cs.AI / 142 / 2604.11307

PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

PaperScope：一个用于代理深度研究的大规模多模态多文档基准，涵盖大量科学论文

Xiong, Lei, Yuan, Huaying, Liu, Zheng, Cao, Zhao, Dou, Zhicheng

Abstract

Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.

Chinese Translation

利用多模态大型语言模型（MLLMs）加速前沿科学研究前景广阔，但如何对这些系统进行严格评估仍不明确。现有基准主要集中于单文档理解，而真实的科学工作流程需要整合来自多篇论文的证据，包括文本、表格和图形。因此，多模态、多文档的科学推理仍然未得到充分探索，缺乏系统评估。为了解决这一空白，我们引入了PaperScope，一个为代理深度研究设计的多模态多文档基准。PaperScope具有三个优势：（1）结构化的科学基础。它建立在一个涵盖三年内2000多篇AI论文的知识图谱之上，为研究导向的查询提供了结构化基础。（2）语义密集的证据构建。它整合了语义相关的关键信息节点，并采用优化的随机游走文章选择器来抽样主题一致的论文集，从而确保足够的语义密度和任务复杂性。（3）科学推理的多任务评估。它包含2000多个跨推理、检索、摘要和问题解决的问答对，能够评估多步骤的科学推理。实验结果表明，即使是像OpenAI Deep Research和Tongyi Deep Research这样的先进系统在PaperScope上也仅获得有限分数，突显了长上下文检索和深度多源推理的难度。因此，PaperScope提供了一个严格的基准，并构建了一个可扩展的管道，用于构建大规模多模态、多源深度研究数据集。

View on arXiv Download PDF AI Translation

cs.AI / 143 / 2604.11328

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

选择更聪明，而非更多：具有子模保证的提示感知评估调度

Ma, Xiaoyu, Li, Yiwen, Liu, Haoyue, Wang, Zhichao, Chen, Ye, Guo, Yongxin, Tang, Xiaoying

Abstract

Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

Chinese Translation

自动提示优化（APO）的效果依赖于评估信号的质量，但在完整训练集上对每个提示候选进行评分是极其昂贵的。现有方法要么在优化开始之前固定一个单一的评估子集（原则性但与提示无关），要么在优化过程中通过启发式方法进行适应（灵活但不稳定且缺乏正式保证）。我们观察到，APO 自然映射到一个在线自适应测试问题：提示是考生，训练示例是测试项目，调度器应选择能够最好地区分最强候选者的项目。这一洞察促成了提示感知在线评估调度（POES），它将基于项目反应理论（IRT）的区分效用、设施位置覆盖项和考虑切换成本的热启动交换整合为一个统一的目标，该目标在数学上是单调的子模的，从而为冷启动提供了 (1-1/e) 的贪婪保证，并为热启动更新提供了有界漂移。自适应控制器根据优化进展调节探索与利用的平衡。在涵盖三个基准系列的 36 个任务中，POES 实现了最高的整体平均准确率（比最佳基线提高 6.2%），同时在相同的评估预算下，令牌开销微不足道（约 4%）。此外，在 k = 20 个示例的原则性选择与 k = 30-50 的简单评估相匹配或超越，减少了 35-60% 的令牌消耗，表明选择更聪明比选择更多更有效。我们的结果表明，评估调度是 APO 的一个重要组成部分，而不是实现细节。

View on arXiv Download PDF AI Translation

cs.AI / 144 / 2604.11334

Dynamic Summary Generation for Interpretable Multimodal Depression Detection

用于可解释的多模态抑郁检测的动态摘要生成

Teng, Shiyu, Liu, Jiaqing, Sun, Hao, Li, Yu, Chai, Shurong, Hou, Ruibo, Tateyama, Tomoko, Lin, Lanfen, Chen, Yen-Wei

Abstract

Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.

Chinese Translation

抑郁症因污名化和主观症状评估的障碍而广泛存在漏诊和治疗不足的问题。为了解决这一挑战，我们提出了一种粗到细的多阶段框架，利用大型语言模型（LLMs）进行准确且可解释的检测。该流程进行二元筛查、五类严重程度分类和连续回归。在每个阶段，LLM生成逐渐丰富的临床摘要，以指导集成文本、音频和视频特征的多模态融合模块，从而产生具有透明推理的预测。系统随后将所有摘要整合成一份简明易懂的评估报告。在E-DAIC和CMDC数据集上的实验表明，在准确性和可解释性方面，相较于最先进的基线方法有显著提升。

View on arXiv Download PDF AI Translation

cs.AI / 145 / 2604.11359

CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy

CoRe-ECG：通过对比与重构协同推进12导联心电图自监督表示学习

Qin, Zehao, Lin, Xiaojian, Zhang, Ping, Wu, Hongliang, Wang, Xinkang, Liu, Guangling, Chen, Bo, Yang, Wenming, Wang, Guijin

Abstract

Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.

Chinese Translation

由于标注数据的稀缺性及专家注释的高成本，心电图（ECG）的准确解读仍然具有挑战性。自监督学习（SSL）通过使模型能够从未标注信号中学习表达性表示，提供了一种有前景的解决方案。现有的ECG自监督学习方法通常依赖于对比学习或重构学习中的一种。然而，单一方法提供的监督信号有限，且存在额外的局限性，包括天真的数据增强引入的非生理性失真以及模型可能利用的多导联间的简单相关性作为捷径。在本工作中，我们提出了CoRe-ECG，一种统一的对比与重构预训练范式，实现了全局语义建模与局部结构学习的协同交互。CoRe-ECG在重构过程中对齐全局表示，使实例级判别信号能够指导局部波形的恢复。为进一步增强预训练效果，我们引入了频率动态增强（Frequency Dynamic Augmentation, FDA），基于频域重要性自适应扰动ECG信号，以及时空双重掩码（Spatio-Temporal Dual Masking, STDM），打破导联间的线性依赖，增加重构任务的难度。我们的方法在多个下游ECG数据集上实现了最先进的性能。消融研究进一步验证了各组件的必要性及其互补性。该方法为ECG分析提供了一个稳健且符合生理意义的表示学习框架。

View on arXiv Download PDF AI Translation

cs.AI / 146 / 2604.11364

The Missing Knowledge Layer in Cognitive Architectures for AI Agents

人工智能代理认知架构中的缺失知识层

Roynard, Michaël

Abstract

The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy's LLM Knowledge Base [10] to the BEAM benchmark's near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving's trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.

Chinese Translation

目前影响最大的两种人工智能代理认知架构框架，CoALA [21] 和 JEPA [12]，都缺乏一个具有独立持久性语义的明确知识层。这一缺口导致了类别错误：系统对事实声明应用认知衰退，或将事实和经验视为具有相同更新机制。我们调查了现有记忆系统中的持久性语义，并识别出八个收敛点，从 Karpathy 的 LLM 知识库 [10] 到 BEAM 基准的近零矛盾解决分数 [22]，这些都指向相关的架构缺口。我们提出了一个四层分解（知识、记忆、智慧、智能），其中每一层具有根本不同的持久性语义：无限取代、艾宾浩斯衰退、证据门控修订和短暂推理。Python 和 Rust 的伴随实现展示了这种架构分离是可行的。我们借用认知科学的术语作为有用的类比（知识/记忆的区分呼应了 Tulving 的三分法），但我们的层次是由持久性语义要求所证明的工程构造，而不是由神经架构所决定。我们认为这些区分要求在工程实现中具有不同的持久性语义，而当前没有任何框架或系统提供这一点。

View on arXiv Download PDF AI Translation

cs.AI / 147 / 2604.11365

Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

从对比中学习：从多样化搜索轨迹中合成推理路径

Liu, Peiyang, Chen, Zhirui, Wang, Xi, Liang, Di, Li, Youru, Cai, Zhi, Ye, Wei

Abstract

Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.

Chinese Translation

蒙特卡罗树搜索（Monte Carlo Tree Search, MCTS）已广泛应用于自动推理数据探索，但当前的监督提取方法效率仍然较低。标准方法仅保留单一的最高奖励轨迹，忽略了许多探索路径中存在的比较信号。在此，我们引入了 extbf{对比推理路径合成（Contrastive Reasoning Path Synthesis, CRPS）}框架，将监督提取从过滤过程转变为合成过程。CRPS利用结构化的反思过程分析高质量和低质量搜索轨迹之间的差异，提取有关战略转折点和局部失败模式的明确信息。这些见解指导推理链的合成，结合成功模式，同时避免已识别的陷阱。我们通过实验证明，仅在60K个CRPS合成示例上微调的模型，其性能与在590K个来自标准拒绝采样的示例上训练的基线相匹配或超过，数据集规模减少了20倍。此外，CRPS在域外基准上的泛化能力得到了提升，证明了从成功与失败之间的对比中学习，能够产生比单纯从成功中学习更具可迁移性的推理能力。

View on arXiv Download PDF AI Translation

cs.AI / 148 / 2604.11378

From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

从代理循环到结构化图：基于调度理论的LLM代理执行框架

Wei, Hu

Abstract

The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.

Chinese Translation

构建基于大型语言模型（LLM）代理的主流范式是代理循环（Agent Loop），即通过一个迭代周期，单一语言模型通过读取不断增长的上下文窗口来决定下一步行动。该范式存在三大结构性弱点：步骤间的隐式依赖、无界的恢复循环以及可变的执行历史，这些都增加了调试的复杂性。我们将代理循环描述为单一就绪单元调度器：任一时刻最多只有一个可执行单元处于激活状态，且激活单元的选择来源于不可检视的LLM推理，而非可检查的策略。该视角将代理循环与基于图的执行引擎置于同一语义连续体上。我们提出了结构化图框架（Structured Graph Harness，SGH），将控制流从隐式上下文提升为显式静态有向无环图（DAG）。SGH做出三项承诺：执行计划在同一版本内保持不可变，规划执行与恢复分为三层，恢复遵循严格的升级协议。这些设计权衡了部分表达能力以换取可控性、可验证性和可实现性。我们的贡献包括四方面：一个统一的调度框架，将经典调度理论应用于LLM代理执行并识别非确定性LLM节点带来的挑战；对70个调研系统中可控性、表达能力与可实现性的权衡分析；包含终止性和健全性保证的节点状态机形式规范；以及一个具备七组设计的可归因实验框架以供未来验证。本文为立场论文与设计提案，提供理论框架、设计分析及实验方案，尚无生产实现或实证结果。

View on arXiv Download PDF AI Translation

cs.AI / 149 / 2604.11419

Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

超越 RAG 的网络威胁情报：基于图形和代理检索的系统评估

Hamzic, Dzenan, Skopik, Florian, Landauer, Max, Wurzenberger, Markus, Rauber, Andreas

Abstract

Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.

Chinese Translation

网络威胁情报 (CTI) 分析师必须在大量叙述性安全报告中回答复杂问题。检索增强生成 (RAG) 系统帮助语言模型访问外部知识，但传统的向量检索在处理需要推理实体之间关系（如威胁行为者、恶意软件和漏洞）的查询时常常面临困难。这一局限性源于相关证据通常分散在多个文本片段和文档中。知识图谱通过对实体和关系的明确表示，解决了这一挑战，从而实现结构化的多跳推理。然而，随着不同假设和失败模式的出现，基于图形、代理和混合方法等多种检索范式应运而生。在现实的 CTI 环境中，这些方法的比较仍不明确，且图形基础的检索何时能提高性能尚不清楚。我们对四种用于 CTI 分析的 RAG 架构进行了系统评估：标准向量检索、基于 CTI 知识图谱的图形检索、修复失败图形查询的代理变体，以及结合图形查询与文本检索的混合方法。我们在 3,300 对 CTI 问答对上评估这些系统，涵盖事实查找、多跳关系查询、分析师风格的综合问题和无法回答的案例。结果表明，图形基础的检索在结构化事实查询上提高了性能。与向量 RAG 相比，混合图形-文本方法在多跳问题上的答案质量提高了多达 35%，同时保持了比仅图形系统更可靠的性能。

View on arXiv Download PDF AI Translation

cs.AI / 150 / 2604.11462

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

突破上下文瓶颈：通过强化学习实现大型语言模型代理的主动上下文管理

Li, Xiaozhe, Lyu, Tianyi, Yang, Yizhao, Shan, Liang, Yang, Siyi, Zhang, Ligao, Huang, Zhuoyi, Liu, Qingwen, Li, Yang

Abstract

Large Language Models (LLMs) struggle with long-horizon tasks due to the "context bottleneck" and the "lost-in-the-middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.

Chinese Translation

大型语言模型（LLMs）在处理长时序任务时面临“上下文瓶颈”和“中间遗失”现象的挑战，即冗长环境中累积的噪声削弱了多轮交互中的推理能力。为解决该问题，我们提出了一个共生框架，将上下文管理与任务执行解耦。该架构将轻量级的专用策略模型ContextCurator与强大的冻结基础模型TaskExecutor配对。ContextCurator通过强化学习训练，主动降低工作记忆中的信息熵，积极剪除环境噪声，同时保留推理锚点，即对未来推断至关重要的稀疏数据点。在WebArena上，该框架将Gemini-3.0-flash的成功率从36.4%提升至41.2%，同时将令牌消耗降低8.8%（从47.4K降至43.3K）。在DeepSearch上，成功率达到57.1%，相比之下基线为53.9%，且令牌消耗减少了8倍。值得注意的是，7B规模的ContextCurator在上下文管理性能上可匹配GPT-4o，展现了面向自主长时序代理的可扩展且计算高效的范式。

View on arXiv Download PDF AI Translation

cs.AI / 151 / 2604.11465

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

三重角色，一体模型：推理时角色编排以缩小小型与大型智能体性能差距

McClendon, S. Aaron, Gallego-Feliciano, Jorge, Zervoudakis, Stavros, Saravanos, Antonios

Abstract

Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24\,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4\% (FP16) and 3.0\% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9\% (FP16) and 5.9\% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8\%$\to$26.3\% FP16; 5.3\%$\to$14.0\% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1\%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4$\times$ their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.

Chinese Translation

大型语言模型（LLM）智能体在现实工具使用任务中展现出潜力，但在有限硬件上部署高效智能体仍具挑战性。本文研究仅通过推理时搭建辅助机制，而无需额外训练计算，是否能提升小型模型在复杂多步骤环境中的表现。我们在单个24GB GPU上，评估了Qwen3-8B模型在全精度（FP16，12K上下文）和4位量化（AWQ，32K上下文）两种配置下的性能。未经任何干预，原始模型的任务目标完成率仅为5.4%（FP16）和3.0%（AWQ）。基于系统性的失败模式分析，我们引入了一个三层推理辅助流水线，在三个不同角色中部署同一冻结模型：（1）摘要模型，用于保留关键信息（标记、凭证、API响应）并压缩对话历史；（2）主智能体模型，基于压缩上下文进行推理；（3）独立纠正模型，在无对话历史访问的情况下审查并修正智能体的代码输出，打破重复失败循环。该辅助机制应用于同一未修改模型后，任务完成率提升至8.9%（FP16）和5.9%（AWQ），在两种设置中性能约翻倍，尤其在难度等级为1的任务上表现显著提升（FP16从15.8%提升至26.3%；AWQ从5.3%提升至14.0%）。在全精度推理下，我们的辅助8B模型超越了原AppWorld评测中的DeepSeek-Coder 33B Instruct（7.1%），表明结构化推理时干预可使小型模型在性能上媲美体积为其4倍的系统。我们将该方法形式化为基于冻结基础模型的辅助策略，即以相同权重的三次调用结合不同条件输入，关联了测试时计算扩展与强化学习中的动作空间塑造。

View on arXiv Download PDF AI Translation

cs.AI / 152 / 2604.11467

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动：以人为本的激活引导应用

Labarta, Tobias, Dreyer, Maximilian, Weitz, Katharina, Samek, Wojciech, Lapuschkin, Sebastian

Abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

Chinese Translation

可解释人工智能（XAI）方法揭示了哪些特征影响模型预测，但为从业者提供的基于这些解释采取行动的手段有限。通过 XAI 识别的组件的激活引导为可操作的解释提供了一条路径，尽管其实际效用仍未得到充分研究。我们介绍了一种互动工作流程，将基于 SAE 的归因与激活引导结合，用于视觉模型中概念使用的实例级分析，并实现为一个基于网络的工具。基于该工作流程，我们对 8 名专家进行了半结构化访谈，进行调试任务以研究从业者如何推理、信任和应用激活引导。我们发现，激活引导使得从检查转向基于干预的假设检验（8/8 参与者），大多数参与者的信任建立在观察到的模型响应上，而不仅仅是解释的合理性（6/8）。参与者采用了以组件抑制为主的系统调试策略（7/8），并强调了包括涟漪效应和实例级修正的有限泛化等风险。总体而言，激活引导使可解释性变得更具可操作性，同时提出了安全有效使用的重要考虑。

View on arXiv Download PDF AI Translation

cs.AI / 153 / 2604.11477

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

OOM-RL：基于市场驱动的出价外强化学习，用于基于大型语言模型的多智能体系统对齐

Liu, Kun, Chen, Liqun

Abstract

The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial "Test Evasion" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 -- February 2026) chronicles the system's evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95\%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint

Chinese Translation

多智能体系统（MAS）在自主软件工程中的对齐受到评估者认知不确定性的限制。当前范式，如基于人类反馈的强化学习（RLHF）和基于AI反馈的强化学习（RLAIF），常导致模型谄媚行为，而基于执行的环境则面临无约束智能体的对抗性“测试规避”问题。本文提出了一种客观对齐范式：出价外强化学习（Out-of-Money Reinforcement Learning，OOM-RL）。通过将智能体部署到非平稳且高摩擦的实时金融市场现实中，我们利用关键的资本耗尽作为不可篡改的负梯度。我们进行了为期20个月（2024年7月至2026年2月）的纵向实证研究，记录了系统从高换手率、谄媚基线向稳健的流动性感知架构的演变。研究表明，金融损失不可否认的本体论后果迫使MAS摒弃过拟合的幻觉，转而采用严格的测试驱动智能工作流（Strict Test-Driven Agentic Workflow，STDAW），该工作流执行受拜占庭启发的单向状态锁定（RO-Lock），并锚定于确定性验证的≥95%代码覆盖约束矩阵。结果显示，尽管早期迭代存在严重的执行衰减，最终的OOM-RL对齐系统在成熟阶段实现了年化夏普比率2.06的稳定均衡。我们得出结论，用严格的经济惩罚替代主观人类偏好，为在高风险真实环境中对齐自主智能体提供了稳健的方法论，为计算计费作为客观物理约束的泛化范式奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 154 / 2604.11480

On the Complexity of the Discussion-based Semantics in Abstraction Argumentation

基于讨论语义的抽象论证复杂性研究

Blümel, Lydia, Sauerwald, Kai, Skiba, Kenneth, Thimm, Matthias

Abstract

We show that deciding whether an argument a is stronger than an argument b with respect to the discussion-based semantics of Amgoud and Ben-Naim is decidable in polynomial time. At its core, this problem is about deciding whether, for two vertices in a graph, the number of walks of each length ending in those vertices is the same. We employ results from automata theory and reduce this problem to the equivalence problem for semiring automata. This offers a new perspective on the computational complexity of ranking semantics, an area in which the complexity of many semantics remains open.

Chinese Translation

我们证明了在Amgoud和Ben-Naim提出的基于讨论的语义框架下，判断一个论点a是否比另一个论点b更强是多项式时间内可判定的问题。本质上，该问题涉及判断图中两个顶点在每个长度的路径数量是否相同。我们利用自动机理论的相关结果，将该问题归约为半环自动机（semiring automata）等价性问题。这为排名语义（ranking semantics）的计算复杂性提供了新的视角，而该领域中许多语义的复杂性问题仍未解决。

View on arXiv Download PDF AI Translation

cs.AI / 155 / 2604.11490

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

人类中心的区域适应在多模态视觉-语言模型中的应用

Cahyawijaya, Samuel, Limkonchotiwat, Peerat, Wong, Tack Hwa, Patel, Hitesh Laxmichand, Agarwal, Amit, Rufino, Manuel Antonio, Catalan, Carlos Rafael, Qorib, Muhammad Reza, Feliren, Vicky, Lovenia, Holy, Khine, Aye Hninn, Hudi, Frederikus, Anugraha, David, Aji, Alham Fikri, Chumpu, Romrawin, Pham, Viet-Thanh, Wang, Minghan, Imam, Mohamed Fazli, Zhang, Ruochen, Imperial, Joseph Marvin, Long, Do Xuan, Wijanarko, Musa Izzanardi, Moniz, Joel Ruben Antony, Irawan, Patrick Amadeus, Zhafran, Hanif Muhammad, Flores, Isaiah, Salsabila, Ira, Kevin, Jun, Rosal, Jostin Jerico, Monderin, Patricia Nicole, Kerdthaisong, Kun, Mustafid, Ahmad, Nguyen, My Chiffon, Jongwiriyanurak, Natchapon, Worajitwannakul, Siva, Li, Haochen, Lim, Adrian Xuan Wei, Wang, Bin, Habibi, Muhammad Ravi Shulthan, Ng, Lynnette Hui Xian, Bangera, Mithil, Bangera, Yeshil, Pattnayak, Priyaranjan, Chan, Dun Li, Djuniwar, Sherissa Caren, Shan, Hee Ming

Abstract

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

Chinese Translation

尽管视觉-语言（VL）领域在整合多种语言和领域的视觉与文本信息方面取得了显著成功，但仍缺乏专门的框架来评估视觉-语言系统中的人类中心对齐。为了解决这一问题，我们提出了两个贡献。首先，我们引入了人类中心的区域适应（Anthropogenic Regional Adaptation）：一种新颖的范式，旨在优化模型与特定区域背景的相关性，同时确保全球泛化能力的保留。其次，我们提出了一种简单但有效的适应方法，称为地理泛化简化（Geographical-generalization-made-easy，GG-EZ），该方法利用区域数据过滤和模型合并。通过对三种视觉-语言架构的大规模实验：大型视觉-语言模型、文本到图像的扩散模型和视觉-语言嵌入模型，以及对东南亚（SEA）区域适应的案例研究，我们展示了人类中心区域适应的重要性和GG-EZ的有效性，显示出在东南亚文化相关性指标上提高了5-15%的成绩，同时保持了超过98%的全球性能，甚至在某些情况下超过了全球性能。我们的研究确立了人类中心区域对齐作为多模态视觉-语言模型在不同区域应用的基础范式，并展示了一种简单而有效的基线方法，该方法优化了区域价值对齐，同时保留了全球泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 156 / 2604.11504

Lectures on AI for Mathematics

数学领域的人工智能讲座

Chen, Xiaoyang, Chen, Xiaoyang

Abstract

This book provides a comprehensive and accessible introduction to the emerging field of AI for mathematics. It covers the core principles and diverse applications of using artificial intelligence to advance mathematical research. Through clear explanations, the text explores how AI can discover hidden mathematical patterns, assist in proving complicated theorems, and even construct counterexamples to challenge conjectures.

Chinese Translation

本书提供了对新兴的数学领域人工智能的全面且易于理解的介绍。它涵盖了利用人工智能推动数学研究的核心原理和多样化应用。通过清晰的解释，文本探讨了人工智能如何发现隐藏的数学模式、协助证明复杂的定理，甚至构造反例以挑战猜想。

View on arXiv Download PDF AI Translation

cs.AI / 157 / 2604.11523

PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

PAC-BENCH：在隐私约束下评估多智能体协作

Park, Minjun, Kim, Donghyun, Ju, Hyeonjong, Lim, Seungwon, Choi, Dongwook, Kwon, Taeyoon, Kim, Minju, Yeo, Jinyoung

Abstract

We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.

Chinese Translation

我们正进入一个个人和组织越来越多地部署专用人工智能代理以与其他代理进行互动和协作的时代。然而，在隐私约束下的多智能体协作动态仍然不够清晰。在本研究中，我们提出了$PAC ext{-}Bench$，这是一个用于系统评估隐私约束下多智能体协作的基准。对$PAC ext{-}Bench$的实验表明，隐私约束显著降低了协作性能，并使结果更依赖于发起代理而非合作伙伴。进一步分析揭示，这种性能下降是由反复出现的协调失效驱动的，包括早期隐私违规、过于保守的抽象以及隐私引发的幻觉。综合来看，我们的研究发现，关注隐私的多智能体协作是一个独特且尚未解决的挑战，需要超越现有代理能力的新协调机制。

View on arXiv Download PDF AI Translation

cs.AI / 158 / 2604.11524

Limited Perfect Monotonical Surrogates constructed using low-cost recursive linkage discovery with guaranteed output

基于低成本递归关联发现构建的有限完美单调代理模型及其输出保证

Przewozniczek, M. W., Chicano, F., Tinós, R., Komarnicki, M. M.

Abstract

Surrogates provide a cheap solution evaluation and offer significant leverage for optimizing computationally expensive problems. Usually, surrogates only approximate the original function. Recently, the perfect linear surrogates were proposed that ideally represent the original function. These surrogates do not mimic the original function. In fact, they are another (correct) representation of it and enable a wide range of possibilities, e.g., discovering the optimized function for problems where the direct transformation of the encoded solution into its evaluation is not available. However, many real-world problems can not be represented by linear models, making the aforementioned surrogates inapplicable. Therefore, we propose the Limited Monotonical Perfect Surrogate (LyMPuS), which overcomes this difficulty and enables the comparison of two solutions that differ by a single variable. Our proposition is suitable for limiting the cost of expensive local search procedures. The proposed surrogate is parameterless and can be trained on the fly without any separate surrogate-building step. It uses only the necessary fitness evaluations, and the already-paid costs are not wasted when the model is updated. Finally, it offers low-cost missing-linkage detection and low-cost linkage discovery, guaranteed to find a missing dependency in no more than $2\lceil\log_2(n)\rceil$ steps.

Chinese Translation

代理模型提供了一种廉价的解评估方法，并为优化计算代价高昂的问题提供了显著的杠杆作用。通常，代理模型仅对原始函数进行近似。近期，提出了完美线性代理模型（perfect linear surrogates），其理想地表示原始函数。这些代理模型并非简单模仿原始函数，实际上它们是原函数的另一种（正确的）表示形式，并支持广泛的应用可能性，例如在无法直接将编码解转换为其评估值的问题中发现优化函数。然而，许多实际问题无法用线性模型表示，使得上述代理模型不适用。因此，我们提出了有限单调完美代理模型（Limited Monotonical Perfect Surrogate，LyMPuS），该模型克服了这一难题，能够比较仅在单个变量上不同的两个解。我们的方案适用于限制昂贵局部搜索过程的计算成本。所提代理模型无参数，可在线训练，无需单独的代理构建步骤。它仅使用必要的适应度评估，且已付出的计算成本在模型更新时不会被浪费。最后，该模型提供了低成本的缺失关联检测和低成本的关联发现，保证在不超过$2\lceil\log_2(n)\rceil$步内找到缺失的依赖关系。

View on arXiv Download PDF AI Translation

cs.AI / 159 / 2604.11535

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

大规模问题归约：计算困难问题的智能代理集成

Pan, Xi-Wei, An, Shi-Wen, Liu, Jin-Guo

Abstract

Solving an NP-hard optimization problem often requires reformulating it for a specific solver -- quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+~reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem-reductions.

Chinese Translation

解决一个NP难优化问题通常需要针对特定求解器——量子硬件、商业优化器或领域启发式方法——重新表述该问题。一个支持多项式时间归约的工具，可以让从业者通过单一接口将任意支持的问题路由到任意支持的求解器。然而，构建这样一个大规模库一直难以实现。我们展示了约束工程（harness engineering）的应用，即设计约束、验证系统和反馈回路以引导AI编码代理的实践，能够突破这一障碍。我们的约束系统结合了面向领域专家的无代码贡献路径、多层次验证堆栈（涵盖类型级检查到代理特征测试，后者为AI代理扮演终端用户角色）以及全自动实现-审查-集成流水线。在约三个月内，我们构建了一个命令行工具，背后支持着一个包含100多个问题类型和200多条归约规则的库，代码量超过17万行Rust。结果表明，一个设计良好的约束系统使代理能够以超越以往归约库努力的规模和速度构建经过充分测试的软件。由于归约图具有传递组合性，任何单一问题类型注册的新求解器都能立即对所有通过归约路径连接的问题可用。源代码可在https://github.com/CodingThrust/problem-reductions获取。

View on arXiv Download PDF AI Translation

cs.AI / 160 / 2604.11540

A collaborative agent with two lightweight synergistic models for autonomous crystal materials research

一种具有两个轻量级协同模型的协作智能体用于自主晶体材料研究

Shi, Tongyu, Li, Yutang, Li, Zhanyuan, Liu, Qian, Zhou, Jie, Xu, Wenhe, Li, Yang, Dai, Dawei, He, Rui, Zhou, Wenhua, Wang, Jiahong, Yu, Xue-Feng

Abstract

Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestrating tool-based actions. Entropy analysis confirms that this architecture resolves the conflict between tool planning and analytical reasoning by decoupling their distinct entropy dynamics. Enabled by this dual-model architecture and structural efficiency, MatBrain significantly outperforms larger general-purpose models while reducing the hardware deployment barrier by over 95%. MatBrain exhibits versatility across structure generation, property prediction, and synthesis planning tasks. Applied to catalyst design, MatBrain generated 30,000 candidate structures and identified 38 promising materials within 48 hours, achieving approximately 100-fold acceleration over traditional approaches. These results demonstrate the potential of lightweight collaborative intelligence for advancing materials research capabilities.

Chinese Translation

当前的大型语言模型需要数千亿个参数，但在材料科学领域特定推理和工具协调方面仍然存在困难。在此，我们提出了MatBrain，一个轻量级协作智能体系统，具有两个协同模型，专注于晶体材料研究。MatBrain采用双模型架构：Mat-R1（30B参数）作为分析模型，提供专家级领域推理；Mat-T1（14B参数）作为执行模型，协调基于工具的行动。熵分析确认该架构通过解耦其不同的熵动态，解决了工具规划与分析推理之间的冲突。得益于这种双模型架构和结构效率，MatBrain在性能上显著超越了更大的一般用途模型，同时将硬件部署门槛降低了超过95%。MatBrain在结构生成、性质预测和合成规划任务中表现出多样性。在催化剂设计应用中，MatBrain在48小时内生成了30,000个候选结构，并识别出38种有前景的材料，实现了约100倍于传统方法的加速。这些结果展示了轻量级协作智能在提升材料研究能力方面的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 161 / 2604.11548

SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

SemaClaw：通过驯服工程迈向通用个人人工智能代理的第一步

Zhu, Ningyan, Wang, Huacan, Zhou, Jie, Chen, Feiyu, Zhang, Shuo, Chen, Ge, Liu, Chen, Wu, Jiarou, Chen, Wangyi, Mou, Xiaofeng, Xu, Yi

Abstract

The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.

Chinese Translation

2026年初OpenClaw的崛起标志着数百万用户开始将个人人工智能代理融入日常生活中，委派从旅行规划到多步骤研究等各类任务。这种规模的采用表明两个平行的发展轨迹已达到拐点。首先是人工智能工程的范式转变，从提示和上下文工程演变为驯服工程——设计必要的完整基础设施，以将不受限制的代理转变为可控、可审计和生产可靠的系统。随着模型能力的趋同，这一驯服层正成为架构差异化的主要场所。其次是人机交互的演变，从离散任务转向持久的、具上下文意识的协作关系，这要求开放、可信和可扩展的驯服基础设施。我们提出了SemaClaw，一个开源的多代理应用框架，通过驯服工程朝着通用个人人工智能代理迈出了一步，以应对这些变化。我们的主要贡献包括基于有向无环图（DAG）的两阶段混合代理团队编排方法、PermissionBridge行为安全系统、三层上下文管理架构，以及用于自动化个人知识库构建的代理维基技能。

View on arXiv Download PDF AI Translation

cs.AI / 162 / 2604.11557

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall：统一工具使用表示、数据和评估的 LLM 代理框架

Liang, Yijuan, Chen, Xinghao, Ge, Yifan, Wu, Ziyi, Wu, Hao, Zeng, Changyu, Xing, Wei, Shen, Xiaoyu

Abstract

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

Chinese Translation

工具使用能力是 LLM 代理的一个基本组成部分，使其能够通过结构化函数调用与外部系统进行交互。然而，现有研究表现出不一致的交互表示，主要忽视了工具使用轨迹的结构分布，并依赖于不兼容的评估基准。我们提出了 UniToolCall，一个统一的工具学习框架，标准化了从工具集构建、数据集生成到评估的整个流程。该框架策划了一个包含 22,000 多个工具的大型工具池，并通过结合 10 个标准化的公共数据集与结构控制的合成轨迹构建了一个包含 390,000 多个实例的混合训练语料库。它明确建模了多样的交互模式，包括单跳与多跳、单轮与多轮，同时捕捉串行和并行执行结构。为了支持连贯的多轮推理，我们进一步引入了一种锚链接机制，以强制执行跨轮依赖。此外，我们将 7 个公共基准转换为统一的查询-动作-观察-答案 (QAOA) 表示，并在函数调用、轮次和对话层面进行细粒度评估。实验表明，在我们的数据集上对 Qwen3-8B 进行微调显著提高了工具使用性能。在干扰因素较多的 Hybrid-20 设置下，达到了 93.0% 的单轮严格精度，超越了包括 GPT、Gemini 和 Claude 在内的商业模型。

View on arXiv Download PDF AI Translation

cs.AI / 163 / 2604.11609

Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

交叉性谄媚行为：感知用户人口统计特征如何影响大型语言模型中的虚假认可

Maltbie, Benjamin, Raval, Shivam

Abstract

Large language models exhibit sycophantic tendencies--validating incorrect user beliefs to appear agreeable. We investigate whether this behavior varies systematically with perceived user demographics, testing whether combinations of race, age, gender, and expressed confidence level produce differential false validation rates. Inspired by the legal concept of intersectionality, we conduct 768 multi-turn adversarial conversations using Anthropic's Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations in mathematics, philosophy, and conspiracy theory domains. GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall ($\bar{x}=2.96$ vs. $1.74$, $p < 10^{-32}$, Wilcoxon signed-rank). For GPT-5-nano, we find that philosophy elicits 41% more sycophancy than mathematics and that Hispanic personas receive the highest sycophancy across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation. These results demonstrate that sycophancy is not uniformly distributed across users and that safety evaluations should incorporate identity-aware testing.

Chinese Translation

大型语言模型表现出谄媚倾向——为了显得认同用户，验证错误的用户信念。我们探讨此行为是否随着感知的用户人口统计特征系统性变化，测试种族、年龄、性别及表达的自信程度组合是否导致不同的虚假认可率。受法律交叉性概念启发，我们使用Anthropic的Petri评估框架，针对GPT-5-nano和Claude Haiku 4.5，在数学、哲学和阴谋论领域，进行了768轮多轮对抗性对话，涵盖128种人格组合。整体来看，GPT-5-nano的谄媚程度显著高于Claude Haiku 4.5（平均值分别为2.96 vs. 1.74，Wilcoxon符号秩检验p < 10^{-32}）。对于GPT-5-nano，哲学领域的谄媚行为比数学领域高出41%，而西班牙裔人格在各族群中获得最高的谄媚程度。表现最差的人格是一位自信的23岁西班牙裔女性，其谄媚评分平均为5.33/10。Claude Haiku 4.5表现出整体较低的谄媚水平，且无显著的人口统计学差异。这些结果表明，谄媚行为在用户间分布不均，安全评估应纳入身份感知的测试。

View on arXiv Download PDF AI Translation

cs.AI / 164 / 2604.11623

Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

Context Kubernetes：面向智能代理AI系统的企业知识声明式编排架构

Mouzouni, Charafeddine

Abstract

We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness -- across an entire organization -- is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. Three value experiments show: (1) without governance, agents serve phantom content from deleted sources and leak cross-domain data in 26.5% of queries; (2) without freshness monitoring, stale content is served silently -- with reconciliation, staleness is detected in under 1ms; (3) in five attack scenarios, flat permissions block 0/5 attacks, basic RBAC blocks 4/5, and the three-tier model blocks 5/5. Five correctness experiments confirm zero unauthorized deliveries, zero invariant violations, and architectural enforcement of out-of-band approval isolation that no surveyed enterprise platform provides. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue that these make the solution more valuable.

Chinese Translation

本文介绍了Context Kubernetes，一种用于智能代理AI系统中企业知识编排的架构，并提供了原型实现及八项实验。核心观点是，在整个组织范围内，以正确的权限、适当的新鲜度，将合适的知识传递给合适的代理，这一过程在结构上类似于十年前Kubernetes解决的容器编排问题。我们形式化了六个核心抽象、基于YAML的声明式知识架构即代码(manifest)、一个调和循环(reconciliation loop)，以及一个三层代理权限模型，其中代理权限始终严格属于人类权限的子集。三项价值实验表明：(1) 无治理时，代理会从已删除的源提供虚幻内容，且26.5%的查询发生跨域数据泄露；(2) 无新鲜度监控时，过期内容会被静默提供——通过调和机制，过期内容检测时间低于1毫秒；(3) 在五种攻击场景中，扁平权限模型阻挡0/5次攻击，基础RBAC阻挡4/5次，而三层模型阻挡全部5/5次攻击。五项正确性实验确认零未授权交付、零不变量违规，并实现了架构层面的带外审批隔离，而当前调研的企业平台均未提供此功能。对微软、Salesforce、AWS和谷歌四大平台的调研显示，均未在架构上隔离代理审批通道。我们识别出四个使上下文编排比容器编排更复杂的特性，并论证这些特性提升了该解决方案的价值。

View on arXiv Download PDF AI Translation

cs.AI / 165 / 2604.11626

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

RationalRewards：通过推理奖励在训练和测试阶段提升视觉生成

Wang, Haozhe, Wei, Cong, Ren, Weiming, Liu, Jiaming, Lin, Fangzhen, Chen, Wenhu

Abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Chinese Translation

大多数视觉生成的奖励模型将丰富的人类判断简化为单一且未解释的分数，忽略了偏好背后的推理过程。我们展示了通过教导奖励模型在评分前生成明确的多维批评，将其从被动评估者转变为主动优化工具，从而以两种互补的方式提升生成器：训练阶段，结构化的推理提供了可解释的细粒度奖励以用于强化学习；测试阶段，生成-批评-改进（Generate-Critique-Refine）循环将批评转化为针对性的提示修正，无需任何参数更新即可改进输出。为了在无需昂贵推理注释的情况下训练此类奖励模型，我们提出了偏好锚定推理（Preference-Anchored Rationalization，PARROT）框架，该框架通过锚定生成、一致性过滤和蒸馏，从现成的偏好数据中恢复高质量推理。所得到的模型RationalRewards（8B）在开源奖励模型中实现了最先进的偏好预测性能，与Gemini-2.5-Pro相当，同时使用的训练数据量比可比基线少10-20倍。作为强化学习奖励，它持续优于标量奖励，提升了文本到图像及图像编辑生成器的表现。最显著的是，其测试时的批评与改进循环在多个基准测试中匹配甚至超越了基于强化学习的微调，表明结构化推理能够激发现有生成器中潜在的能力，而次优提示无法引出这些能力。

View on arXiv Download PDF AI Translation

cs.AI / 166 / 2604.11663

Why Do Large Language Models Generate Harmful Content?

大型语言模型为何会生成有害内容？

Ganguli, Rajesh, Moraffah, Raha

Abstract

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

Chinese Translation

大型语言模型（LLMs）已被证明能够生成有害内容。然而，这种行为背后的根本原因仍然未得到充分探讨。我们提出了一种基于因果中介分析的方法，以识别导致有害生成的因果因素。我们的方法在模型的不同层次、模块（多层感知器（MLP）和注意力块）以及单个神经元之间进行多层次分析。对最先进的LLMs进行的广泛实验表明，有害生成主要发生在模型的后期层，主要源于MLP块的失败，而非注意力块，并且与作为有害生成门控机制的神经元相关。结果表明，模型的早期层用于对提示中有害性的上下文理解，这种理解随后通过模型传播，导致后期层的有害生成，以及通过MLP块传递的有害性信号。该信号进一步传播到模型的最后一层，特别是传递给一组稀疏的神经元，这些神经元接收信号并相应地决定生成有害内容。

View on arXiv Download PDF AI Translation

cs.AI / 167 / 2604.11703

DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

DreamKG：面向无家可归者的知识图谱增强对话系统

Alizadeh, Javad M, Zheng, Genhui, Tan, Chiu C, Chen, Yuzhou, Martinez, Omar, McCallion, Philip, Ding, Ying, Yang, Chenguang, Tomosky, AnneMarie, Wu, Huanmei

Abstract

People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.

Chinese Translation

无家可归者（PEH）在获取社区服务的及时且准确信息方面面临重大障碍。DreamKG通过一个知识图谱增强的对话系统解决了这一问题，该系统基于经过验证的、最新的关于费城组织、服务、地点及营业时间的数据来生成回答。不同于易产生幻觉的标准大型语言模型（LLMs），DreamKG结合了Neo4j知识图谱与结构化查询理解，能够可靠地处理基于位置感知和时间敏感的查询。该系统执行空间推理以提供基于距离的推荐，并进行时间过滤以匹配营业时间。初步评估显示，在相关查询上优于Google Search AI 59%，且对无关查询的拒绝率达到84%。本演示突显了结合LLM灵活性与知识图谱可靠性的混合架构在有效提升弱势群体服务可及性方面的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 168 / 2604.11705

Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

自主驾驶教练：基于自主智能体的人机闭环网络物理系统的鲁棒性与确定性研究

Prahlad, Deeksha, Fan, Daniel, Kim, Hokeun

Abstract

Foundation models, including large language models (LLMs), are increasingly used for human-in-the-loop (HITL) cyber-physical systems (CPS) because foundation model-based AI agents can potentially interact with both the physical environments and human users. However, the unpredictable behavior of human users and AI agents, in addition to the dynamically changing physical environments, leads to uncontrollable nondeterminism. To address this urgent challenge of enabling agentic AI-powered HITL CPS, we propose a reactor-model-of-computation (MoC)-based approach, realized by the open-source Lingua Franca (LF) framework. We also carry out a concrete case study using the agentic driving coach as an application of HITL CPS. By evaluating the LF-based agentic HITL CPS, we identify practical challenges in reintroducing determinism into such agentic HITL CPS and present pathways to address them.

Chinese Translation

基础模型，包括大型语言模型（LLMs），正日益应用于人机闭环（HITL）网络物理系统（CPS），因为基于基础模型的人工智能智能体能够与物理环境和人类用户进行交互。然而，用户和人工智能智能体的不可预测行为，加之动态变化的物理环境，导致系统中存在不可控的非确定性。为应对赋能自主智能体驱动的人机闭环CPS所面临的这一紧迫挑战，我们提出了一种基于反应器计算模型（reactor-model-of-computation, MoC）的方法，并通过开源的Lingua Franca（LF）框架实现。我们还通过一个具体案例研究——自主驾驶教练，作为人机闭环CPS的应用示范。通过对基于LF的自主智能体HITL CPS进行评估，我们识别了在此类系统中重新引入确定性所面临的实际挑战，并提出了相应的解决路径。

View on arXiv Download PDF AI Translation

cs.AI / 169 / 2604.11709

A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment

基于Mamba的多模态网络用于多尺度爆炸引发的快速结构损伤评估

Ma, Wanli, Selvakumaran, Sivasakthy, Farrimond, Dain G., Dennis, Adam A., Rigby, Samuel E.

Abstract

Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba

Chinese Translation

准确和快速的结构损伤评估（SDA）对于灾后管理至关重要，帮助应急响应者优先分配资源、规划救援和支持恢复。传统的现场检查虽然精确，但受到可达性、安全风险和时间限制的制约，尤其是在大型爆炸后。结合遥感技术的机器学习已成为快速SDA的可扩展解决方案，其中基于Mamba的网络实现了最先进的性能。然而，这些方法通常需要大量的训练和大规模数据集，限制了其在实际中的应用。此外，它们未能将爆炸载荷的关键物理特性纳入SDA。为了解决这些挑战，我们提出了一种基于Mamba的多模态网络，用于快速SDA，集成了多尺度爆炸载荷信息与光学遥感图像。在2020年贝鲁特爆炸事件的评估中，我们的方法显著提高了性能，超越了最先进的方法。代码可在以下链接获取：https://github.com/IMPACTSquad/Blast-Mamba

View on arXiv Download PDF AI Translation

cs.AI / 170 / 2604.11716

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

SWE-AGILE：一种高效管理动态推理上下文的软件代理框架

Lian, Shuquan, Liu, Juncheng, Chen, Yazhe, Chen, Yuhong, Li, Hui

Abstract

Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and ``Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a ``sliding window'' of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.

Chinese Translation

以往典型的ReAct风格自主软件工程（SWE）方法通常缺乏进行深度分析和处理复杂边缘案例所需的显式系统2级推理。尽管近期的推理模型展示了扩展链式思维（Chain-of-Thought, CoT）的潜力，但将其应用于多轮SWE任务时面临根本性困境：保留完整推理历史会导致上下文爆炸和“中间遗失”（Lost-in-the-Middle）性能下降，而丢弃历史则迫使代理在每一步重复冗余推理。为应对这些挑战，我们提出了SWE-AGILE，一种旨在弥合推理深度、效率与上下文限制之间差距的新型软件代理框架。SWE-AGILE引入了动态推理上下文策略，维护一个详细推理的“滑动窗口”以保证即时连续性，避免重复分析，同时将历史推理内容压缩为简洁的推理摘要（Reasoning Digests）。实证结果表明，SWE-AGILE在仅使用2.2k轨迹和896个任务的情况下，于SWE-Bench-Verified基准上为7B-8B模型树立了新的性能标准。代码已开源，地址：https://github.com/KDEGroup/SWE-AGILE。

View on arXiv Download PDF AI Translation

cs.AI / 171 / 2604.11741

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

协作多智能体脚本生成以增强谋杀推理游戏中的不完美信息推理

Zhong, Keyang, Xie, Junlin, Wu, Hefeng, Li, Haofeng, Li, Guanbin

Abstract

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.

Chinese Translation

视觉语言模型（VLMs）在感知任务中展现出令人印象深刻的能力，但在多玩家游戏环境中面对不完美和欺骗性信息时，其复杂的多跳推理能力却有所下降。本文研究了一种具有代表性的多玩家任务——谋杀推理游戏，该游戏要求根据不同意图角色提供的部分线索推断隐藏的真相。为了解决这一挑战，我们提出了一种协作多智能体框架，用于评估和合成高质量的角色驱动多玩家游戏脚本，从而实现针对角色身份（即，谋杀者与无辜者）量身定制的细粒度交互模式。我们的系统通过协调的智能体交互生成丰富的多模态上下文，包括角色背景故事、视觉和文本线索，以及多跳推理链。我们设计了一种两阶段的智能体监控训练策略，以增强VLMs的推理能力：（1）基于思维链的微调，使用经过筛选和合成的数据集来建模不确定性和欺骗；（2）基于GRPO的强化学习，结合智能体监控的奖励塑造，鼓励模型发展特定于角色的推理行为和有效的多模态多跳推理。大量实验表明，我们的方法显著提升了VLMs在叙事推理、隐藏事实提取和抗欺骗理解方面的表现。我们的贡献为在不确定、对抗和社会复杂条件下训练和评估VLMs提供了可扩展的解决方案，为未来在不完美信息下的多模态多跳推理基准奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 172 / 2604.11759

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

检索不足：为什么组织人工智能需要认识基础设施

Bottino, Federico, Ferrero, Carlo, Dosio, Nicholas, Beneventano, Pierfrancesco

Abstract

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

Chinese Translation

人工智能代理使用的组织知识通常缺乏认识结构：检索系统表面上呈现语义相关的内容，但未能区分绑定决策与被放弃的假设、争议性主张与已解决的主张，或已知事实与未解决的问题。我们认为，组织人工智能的瓶颈不在于检索的准确性，而在于 extit{认识}的准确性——即系统表示承诺强度、矛盾状态和组织无知作为可计算属性的能力。我们提出了OIDA，一个将组织知识结构化为类型化知识对象的框架，这些对象携带认识类别、具有类别特定衰减的重要性分数和带符号的矛盾边。知识引力引擎以确定性方式维护分数，并提供了收敛保证（充分条件：最大度数 $< 7$；在度数43下经验上稳健）。OIDA引入了QUESTION作为建模无知的原语：一种具有逆衰减的原语，随着紧迫性增加而呈现组织所 extit{不}知道的内容——这是所有调查系统中缺失的机制。我们描述了认识质量评分（EQS），这是一种具有明确循环分析的五个组成部分的评估方法。在一个受控比较中（$n{=}10$ 响应对），OIDA的RAG条件（3,868个标记）实现了EQS 0.530，而完整上下文基线（108,687个标记）为0.848；$28.1 imes$的标记预算差异是主要的混淆因素。QUESTION机制经过统计验证（Fisher $p{=}0.0325$, OR$=21.0$）。正式属性已建立；在相等的标记预算下的决定性消融（E4）已预注册且尚未运行。

View on arXiv Download PDF AI Translation

cs.AI / 173 / 2604.11786

GenTac: Generative Modeling and Forecasting of Soccer Tactics

GenTac：足球战术的生成建模与预测

Rao, Jiayuan, Gui, Tianlin, Wu, Haoning, Wang, Yanfeng, Xie, Weidi

Abstract

Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.

Chinese Translation

由于足球比赛的随机性和多智能体特性，开放式比赛战术的建模是一项极具挑战性的任务。现有的计算方法通常仅生成单一的确定性轨迹预测，或专注于高度结构化的定位球，根本无法捕捉现实比赛演变中固有的多样性和分支可能性。在此，我们提出了GenTac，一种基于扩散模型的生成框架，将足球战术概念化为连续多球员轨迹和离散语义事件上的随机过程。通过从历史追踪数据中学习球员运动的潜在分布，GenTac能够采样多样且合理的长时段未来轨迹。该框架支持丰富的上下文条件，包括对手行为、特定球队或联赛的打法风格及战略目标，同时将连续空间动态映射到15类战术事件空间。我们在提出的基准测试TacBench上进行了广泛评估，展示了四项核心能力：（1）GenTac在严格保持团队整体结构一致性的同时，实现了高几何精度；（2）能够准确模拟风格细节，区分特定球队（如奥克兰足球俱乐部）和联赛（如澳大利亚A联赛与德国联赛）；（3）支持可控的反事实模拟，能够根据进攻或防守指导显著改变空间控制和预期威胁指标；（4）能够可靠地从生成的轨迹中预测未来战术结果。最后，我们展示了GenTac可成功训练并推广至其他动态团队运动，包括篮球、美式足球和冰球。

View on arXiv Download PDF AI Translation

cs.AI / 174 / 2604.11806

Detecting Safety Violations Across Many Agent Traces

跨多个代理轨迹检测安全违规行为

Stein, Adam, Brown, Davis, Hassani, Hamed, Naik, Mayur, Wong, Eric

Abstract

To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.

Chinese Translation

为了识别安全违规行为，审计员通常需要在大量代理轨迹中进行搜索。这一搜索过程困难重重，因为失败往往是稀有的、复杂的，有时甚至是对抗性隐藏的，只有在分析多个轨迹时才能被发现。这些挑战出现在各种场景中，例如滥用活动、隐蔽破坏、奖励黑客和提示注入等。现有的方法在这里面临诸多困难。逐轨迹的判断无法发现仅在多个轨迹中可见的失败，简单的代理审计无法扩展到大型轨迹集合，而固定监控对意外行为则显得脆弱。我们提出了Meerkat，它结合了聚类与代理搜索，以揭示用自然语言指定的违规行为。通过结构化搜索和对有前景区域的自适应调查，Meerkat能够在不依赖种子场景、固定工作流程或穷举枚举的情况下发现稀疏的失败。在滥用、失调和任务游戏设置中，Meerkat显著提高了安全违规行为的检测率，发现了在一个顶级代理基准上广泛存在的开发者作弊行为，并在CyBench上发现了近4倍于之前审计的奖励黑客实例。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

131

cs.CL / 1 / 2604.09624

Self-Calibrating Language Models via Test-Time Discriminative Distillation

通过测试时判别蒸馏实现自校准语言模型

Hedna, Mohamed Rissal, Strich, Jan, Semmann, Martin, Biemann, Chris

Abstract

Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of "True" when the model is asked "Is this answer correct?" ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6--26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56--78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890

Chinese Translation

大型语言模型（LLMs）系统性地表现出过度自信：它们在经常回答错误的问题上，常常表达出高度的确定性。现有的校准方法要么需要标记的验证数据，要么在分布变化时性能下降，或者导致显著的推理成本。近期的研究表明，LLMs 已经包含比它们所表述的更好校准的信号：当模型被问及“这个答案正确吗？”时，‘真’的标记概率（$P( ext{True})$）始终优于它们所声明的置信度，这一差距在理论上是有依据的，因为生成错误的下限大约是相应判别错误的两倍。我们提出了 $ extbf{SECL}$ （$ extbf{SE}$lf-$ extbf{C}$alibrating $ extbf{L}$anguage Models），一个测试时训练（TTT）管道，利用这一差距作为无标签自我监督，且不需要任何标记数据或人工监督。SECL 仅在输入分布发生变化时进行适应，训练仅占问题流的 6% 到 26%，其成本低于其蒸馏的基线模型。在来自三个模型家族和四个不同领域的四个小型语言模型中，SECL 将期望校准误差（ECE）降低了 56% 到 78%，超越了其自身的监督信号，并与最近的推理时方法相匹配或超越。SECL 是第一个将 TTT 应用于校准的方法；七个消融实验涵盖了信号质量、门控策略、权重累积、损失设计、领域排序、超参数敏感性和层选择，确认每个组件在不同配置中都是至关重要且稳健的。代码链接： https://anonymous.4open.science/r/secl-emnlp26-submission-C890

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.09625

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

基于网络规模数据和集成LLM注释的通用跨语言仇恨语言检测研究

Dang, Dang H., Mitrovi, Jelena, Granitzer, Michael

Abstract

We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.

Chinese Translation

我们研究了大规模未标记的网络数据和基于LLM的合成注释是否能改善多语言仇恨言论检测。我们从通过OpenWebSearch.eu（OWS）抓取的四种语言（英语、德语、西班牙语、越南语）的文本出发，采用两种互补策略。首先，我们对BERT模型进行持续预训练，在未标记的OWS文本上继续进行掩码语言建模，然后进行监督微调，结果表明，这种方法在十六个基准测试中相较于标准基线平均提升了约3%的宏观F1分数，尤其在低资源环境下表现更为显著。其次，我们使用四个开源LLM（Mistral-7B、Llama3.1-8B、Gemma2-9B、Qwen2.5-14B）通过三种集成策略（均值平均、简单多数投票和LightGBM元学习器）生成合成注释。LightGBM集成方法始终优于其他策略。在这些合成标签上进行微调对小模型（Llama3.2-1B）带来了显著益处（+11%的汇总F1），而对更大模型Qwen2.5-14B的提升则相对有限（+0.6%）。我们的结果表明，网络规模的未标记数据与LLM集成注释的结合对小模型和低资源语言最为有价值。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.09629

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

HumorGen：通过基于角色的蒸馏实现大语言模型的幽默生成的认知协同

Ajayi, Edward, Mitra, Prasenjit

Abstract

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.

Chinese Translation

幽默生成对大语言模型（LLMs）构成了重大挑战，因为它们的标准训练目标——预测最可能的下一个单词——与幽默所需的惊喜和不协调性本质上存在冲突。为了解决这一问题，我们提出了认知协同框架，这是一种基于理论的方法，旨在生成高质量的幽默数据，灵感来源于幽默的心理学理论。采用混合思维（Mixture-of-Thought, MoT）方法，我们部署了六种认知角色（例如，荒诞主义者、愤世嫉俗者）来为给定提示合成多样的喜剧视角。该框架创建了一个基于理论的数据集，我们利用该数据集对一个70亿参数的学生模型进行微调。我们比较了直接偏好优化（Direct Preference Optimization, DPO）和一种新颖的离线组相对策略优化（Offline Group Relative Policy Optimization, O-GRPO）；我们的70亿模型显著优于更大规模的指令调优基线，并在性能上与最先进的专有模型相竞争。我们发现，基于认知的数据策划对幽默生成的重要性远超过对齐算法或模型规模。代码和数据将在发表时提供。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.09645

Generating High Quality Synthetic Data for Dutch Medical Conversations

生成高质量荷兰语医疗对话的合成数据

Kuan, Cecilia, Parikh, Aditya Kamlesh, Heuvel, Henk van den

Abstract

Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.

Chinese Translation

医疗对话提供了临床交流的见解，这些见解通常在电子健康记录中缺失。然而，由于隐私和伦理限制，临床数据通常难以获取，导致开发可靠的临床自然语言处理（NLP）模型受限于领域特定数据集的稀缺性。为应对这些挑战，我们提出了一种生成合成荷兰语医疗对话的流程，利用经过荷兰语微调的大型语言模型（Large Language Model），并以真实医疗对话作为语言和结构参考。生成的对话通过定量指标和由母语者及医疗从业者进行的定性评审进行评估。定量分析显示词汇多样性强，但轮次交替过于规律，暗示对话流程更像是脚本化而非自然。定性评审得分略低于平均水平，评审者指出领域特异性和自然表达存在问题。定量与定性结果之间的有限相关性表明，仅凭数值指标无法全面反映语言质量。我们的研究表明，生成合成荷兰语医疗对话是可行的，但需要领域知识和精心设计的提示，以在对话的自然性与结构性之间取得平衡。该工作为通过伦理生成的合成数据扩展荷兰临床NLP资源奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.09793

GIANTS: Generative Insight Anticipation from Scientific Literature

GIANTS：来自科学文献的生成性洞察预测

He-Yueya, Joy, Singh, Anikait, Gao, Ge, Li, Michael Y., Yang, Sherry, Finn, Chelsea, Brunskill, Emma, Goodman, Noah D.

Abstract

Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper's core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.

Chinese Translation

科学突破往往源于将先前的思想综合为新的贡献。尽管语言模型（LMs）在科学发现中展现出潜力，但它们在进行这种有针对性的、基于文献的综合方面的能力仍然未得到充分探索。我们引入了洞察预测，这是一项生成任务，模型从其基础母论文中预测下游论文的核心洞察。为了评估这一能力，我们开发了GiantsBench，这是一个涵盖八个科学领域的1.7万例的基准，每个例子由一组母论文及其对应的下游论文的核心洞察组成。我们使用LM评估者来评估模型，评分生成的洞察与真实洞察之间的相似性，并显示这些相似性评分与专家人类评分相关。最后，我们提出了GIANTS-4B，这是一个通过强化学习（RL）训练的语言模型，旨在使用这些相似性评分作为代理奖励来优化洞察预测。尽管其开放源代码架构较小，GIANTS-4B在性能上超越了专有基线，并能推广到未见领域，在相似性评分上相较于gemini-3-pro实现了34%的相对提升。人类评估进一步显示，GIANTS-4B生成的洞察在概念上比基础模型更清晰。此外，SciJudge-30B，一个第三方模型，旨在通过可能的引用影响比较研究摘要，预测GIANTS-4B生成的洞察更可能导致更高的引用率，在68%的成对比较中优于基础模型。我们发布了我们的代码、基准和模型，以支持未来自动化科学发现的研究。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.09812

Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

Claim2Vec：用于多语言相似性和聚类的事实核查声明嵌入

Panchendrarajan, Rrubaa, Zubiaga, Arkaitz

Abstract

Recurrent claims present a major challenge for automated fact-checking systems designed to combat misinformation, especially in multilingual settings. While tasks such as claim matching and fact-checked claim retrieval aim to address this problem by linking claim pairs, the broader challenge of effectively representing groups of similar claims that can be resolved with the same fact-check via claim clustering remains relatively underexplored. To address this gap, we introduce Claim2Vec, the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. We fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs. Experiments on the claim clustering task using three datasets, 14 multilingual embedding models, and 7 clustering algorithms demonstrate that Claim2Vec significantly improves clustering performance. Specifically, it enhances both cluster label alignment and the geometric structure of the embedding space across different cluster configurations. Our multilingual analysis shows that clusters containing multiple languages benefit from fine-tuning, demonstrating cross-lingual knowledge transfer.

Chinese Translation

重复出现的声明对旨在打击错误信息的自动化事实核查系统构成了重大挑战，尤其是在多语言环境中。虽然声明匹配和事实核查声明检索等任务通过关联声明对来解决这一问题，但通过声明聚类有效表示可通过相同事实核查解决的相似声明组的更广泛挑战仍相对缺乏研究。为填补这一空白，我们提出了Claim2Vec，这是首个多语言嵌入模型，旨在将事实核查声明表示为改进的语义嵌入空间中的向量。我们利用对比学习对多语言编码器进行微调，使用相似的多语言声明对。基于三个数据集、14个多语言嵌入模型和7种聚类算法的声明聚类任务实验表明，Claim2Vec显著提升了聚类性能。具体而言，它增强了不同聚类配置下的簇标签对齐和嵌入空间的几何结构。我们的多语言分析显示，包含多种语言的簇从微调中受益，体现了跨语言知识迁移的效果。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.09854

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

剧透警告：叙事预测作为大型语言模型故事讲述中的紧张度指标

Sui, Peiqi, Zhu, Yutong, Cheng, Tianyi, West, Peter, So, Richard Jean, Long, Hoyt, Holtzman, Ari

Abstract

LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.

Chinese Translation

迄今为止，大型语言模型（LLMs）在生成一致引人入胜的故事和识别这一失败方面均未能取得成功——在领先的创意写作基准（EQ-Bench）上，LLM评审将零样本AI故事的排名置于《纽约客》短篇小说之上，而后者被视为文学小说的黄金标准。我们认为，现有的评分标准忽视了引人入胜的人类故事的一个关键维度：叙事紧张度。我们引入了100-Endings指标，该指标逐句分析故事：在每个位置，模型仅根据目前的文本预测故事将如何结束100次，我们通过预测与真实情况不符的频率来衡量紧张度。除了不匹配率外，句子级曲线还提供了补充统计数据，如转折率，这是一个几何度量，表示曲线反转方向的频率，跟踪故事的转折和揭示。与基于评分标准的评审不同，100-Endings能够正确地将《纽约客》的故事排名远高于LLM输出。基于叙事学原则，我们设计了一个故事生成管道，使用结构约束，包括故事模板分析、创意形成和叙事支架。我们的管道显著提高了通过100-Endings指标测量的叙事紧张度，同时保持了在EQ-Bench排行榜上的表现。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.09874

Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis

组织化群体行为模拟：新框架、基准与分析

Zou, Xinkai, Huang, Yiming, Wu, Zhuohang, Sha, Jian, Huang, Nan, Yun, Longfei, Shang, Jingbo, Peng, Letian

Abstract

Simulating how organized groups (e.g., corporations) make decisions (e.g., responding to a competitor's move) is essential for understanding real-world dynamics and could benefit relevant applications (e.g., market prediction). In this paper, we formalize this problem as a concrete research platform for group behavior understanding, providing: (1) a task definition with benchmark and evaluation criteria, (2) a structured analytical framework with a corresponding algorithm, and (3) detailed temporal and cross-group analysis. Specifically, we propose Organized Group Behavior Simulation, a task that models organized groups as collective entities from a practical perspective: given a group facing a particular situation (e.g., AI Boom), predict the decision it would take. To support this task, we present GROVE (GRoup Organizational BehaVior Evaluation), a benchmark covering 44 entities with 8,052 real-world context-decision pairs collected from Wikipedia and TechCrunch across 9 domains, with an end-to-end evaluation protocol assessing consistency, initiative, scope, magnitude, and horizon. Beyond straightforward prompting pipelines, we propose a structured analytical framework that converts collective decision-making events into an interpretable, adaptive, and traceable behavioral model, achieving stronger performance than summarization- and retrieval-based baselines. It further introduces an adapter mechanism for time-aware evolution and group-aware transfer, and traceable evidence nodes grounding each decision rule in originating historical events. Our analysis reveals temporal behavioral drift within individual groups, which the time-aware adapter effectively captures for stronger prediction, and structured cross-group similarity that enables knowledge transfer for data-scarce organizations.

Chinese Translation

模拟组织化群体（如企业）如何做出决策（如应对竞争对手的行动）对于理解现实世界动态至关重要，并且有助于相关应用（如市场预测）。本文将该问题形式化为一个具体的群体行为理解研究平台，提供：（1）具有基准和评估标准的任务定义，（2）结构化分析框架及相应算法，（3）详尽的时间序列及跨群体分析。具体而言，我们提出了组织化群体行为模拟（Organized Group Behavior Simulation）任务，从实际角度将组织化群体建模为集体实体：在给定群体面临特定情境（如人工智能热潮）时，预测其将采取的决策。为支持该任务，我们构建了GROVE（GRoup Organizational BehaVior Evaluation）基准，涵盖44个实体，收集自Wikipedia和TechCrunch的9个领域共8,052对真实世界情境-决策对，并设计了端到端评估协议，评估一致性、主动性、范围、幅度及时间视角。除了直接的提示式管道，我们提出了结构化分析框架，将集体决策事件转化为可解释、适应性强且可追踪的行为模型，性能优于基于摘要和检索的基线方法。该框架进一步引入了时间感知演化和群体感知迁移的适配器机制，以及可追踪的证据节点，将每条决策规则根植于起源的历史事件。我们的分析揭示了个体群体内的时间行为漂移，时间感知适配器有效捕捉该漂移以增强预测能力，同时结构化的跨群体相似性促进了数据稀缺组织的知识迁移。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.09890

Should We be Pedantic About Reasoning Errors in Machine Translation?

我们是否应该对机器翻译中的推理错误过于拘泥？

Bao, Calvin, Carpuat, Marine

Abstract

Across multiple language pairings (English $\to$ \{Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese\}), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.

Chinese Translation

在多种语言对之间（英语 $ o$ ext{西班牙语, 法语, 德语, 普通话, 日语, 乌尔都语, 粤语}），我们发现翻译中的推理错误。为了量化这些推理错误的发生频率，我们利用了一种自动化注释协议进行推理评估，其目标是检测推理步骤是否属于三种错误类别中的任何一种：（1）源句子不一致，（2）模型假设不一致，或（3）推理轨迹不一致。我们通过扰动的轨迹对推理模型进行探测，纠正这些已识别的推理错误，采用了一系列从弱到强的干预措施：模糊处理、删除、删除后的重新推理、事后诸葛亮和神谕干预。对推理轨迹进行干预的实验表明，对推理的小幅修正对翻译质量影响不大，但更强的干预措施能获得最高的解决率，尽管翻译质量的提升效果不一。最终，我们发现乌尔都语中的推理错误能够以高精度识别，而西班牙语中的识别精度较低，但消除这些推理错误并未显著解决初始错误，表明机器翻译的推理可信度有限。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.09960

Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning

人类与机器的欺骗：使用集成学习区分人工生成与人类撰写的假新闻

Jaeger, Samuel, Ibeneye, Calvin, Vera-Jimenez, Aya, Ghosh, Dhrubajyoti

Abstract

The rapid adoption of large language models has introduced a new class of AI-generated fake news that coexists with traditional human-written misinformation, raising important questions about how these two forms of deceptive content differ and how reliably they can be distinguished. This study examines linguistic, structural, and emotional differences between human-written and AI-generated fake news and evaluates machine learning and ensemble-based methods for distinguishing these content types. A document-level feature representation is constructed using sentence structure, lexical diversity, punctuation patterns, readability indices, and emotion-based features capturing affective dimensions such as fear, anger, joy, sadness, trust, and anticipation. Multiple classification models, including logistic regression, random forest, support vector machines, extreme gradient boosting, and a neural network, are applied alongside an ensemble framework that aggregates predictions across models. Model performance is assessed using accuracy and area under the receiver operating characteristic curve. The results show strong and consistent classification performance, with readability-based features emerging as the most informative predictors and AI-generated text exhibiting more uniform stylistic patterns. Ensemble learning provides modest but consistent improvements over individual models. These findings indicate that stylistic and structural properties of text provide a robust basis for distinguishing AI-generated misinformation from human-written fake news.

Chinese Translation

大型语言模型的快速应用引入了一类新的人工智能生成的假新闻，这些假新闻与传统的人类撰写的虚假信息共存，提出了关于这两种欺骗性内容之间的差异以及它们的可区分性的重要问题。本研究考察了人类撰写的假新闻与人工智能生成的假新闻在语言、结构和情感上的差异，并评估了机器学习和基于集成的方法来区分这些内容类型。通过句子结构、词汇多样性、标点模式、可读性指数以及基于情感的特征（捕捉恐惧、愤怒、快乐、悲伤、信任和期待等情感维度）构建了文档级特征表示。应用了多种分类模型，包括逻辑回归、随机森林、支持向量机、极端梯度提升和神经网络，并结合一个集成框架对模型的预测进行聚合。通过准确率和接收者操作特征曲线下面积评估模型性能。结果显示出强大且一致的分类性能，以可读性为基础的特征成为最具信息量的预测因子，而人工智能生成的文本表现出更为统一的风格模式。集成学习在个体模型的基础上提供了适度但一致的改进。这些发现表明，文本的风格和结构特性为区分人工智能生成的虚假信息与人类撰写的假新闻提供了可靠的基础。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.10022

Weird Generalization is Weirdly Brittle

奇异泛化的脆弱性研究

Wanner, Miriam, Collison, Hannah, Jurayj, William, Van Durme, Benjamin, Dredze, Mark, Walden, William

Abstract

Weird generalization is a phenomenon in which models fine-tuned on data from a narrow domain (e.g. insecure code) develop surprising traits that manifest even outside that domain (e.g. broad misalignment)-a phenomenon that prior work has highlighted as a critical safety concern. Here, we present an extended replication study of key weird generalization results across an expanded suite of models and datasets. We confirm that surprising (and dangerous) traits can emerge under certain circumstances, but we find that weird generalization is exceptionally brittle: it emerges only for specific models on specific datasets, and it vanishes under simple training-time, prompt-based interventions. We find that the most effective interventions provide prompt context that makes the generalized behavior the expected behavior. However, we show that even very generic interventions that do not anticipate specific generalized traits can still be effective in mitigating weird generalization's effects. Our findings thus help clarify the nature of the safety threat that weird generalization poses and point toward an easily implemented set of solutions.

Chinese Translation

奇异泛化是一种现象，指的是在狭窄领域（例如不安全代码）数据上微调的模型，表现出即使在该领域之外也会显现的意外特征（例如广泛的错位），这一现象被先前研究强调为关键的安全隐患。本文对关键奇异泛化结果进行了扩展复制研究，涵盖了更多模型和数据集。我们确认在特定情况下确实会出现令人惊讶（且危险）的特征，但发现奇异泛化极其脆弱：它仅在特定模型和特定数据集上出现，并且在简单的训练时基于提示（prompt）的干预下即消失。我们发现，最有效的干预措施是提供使泛化行为成为预期行为的提示上下文。然而，我们也证明，即使是未针对特定泛化特征设计的非常通用的干预措施，也能有效缓解奇异泛化的影响。我们的研究结果有助于澄清奇异泛化所带来的安全威胁的本质，并指明了一套易于实施的解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.10031

CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

CoSToM：面向因果的内在心智理论对齐在大型语言模型中的引导方法

Li, Mengfan, Shi, Xuanhua, Deng, Yang

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers' characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.

Chinese Translation

心智理论（Theory of Mind，ToM）是指归因他人心理状态的能力，是社会智能的标志。尽管大型语言模型（LLMs）在标准的ToM基准测试中表现出良好性能，但我们观察到它们常常无法推广到复杂的特定任务场景，过度依赖提示脚手架来模拟推理。内部知识与外部行为之间的关键错配引发了一个根本性问题：LLMs是否真正具备内在认知能力，且能否将这种内在知识外化为稳定且高质量的行为？为了解答这一问题，我们提出了CoSToM（Causal-oriented Steering for ToM alignment），一个从机械式解释转向主动干预的框架。首先，我们采用因果追踪方法映射ToM特征的内部分布，实证揭示了内部层在编码基础ToM语义方面的特性。在此基础上，我们通过针对性激活引导，在这些关键的ToM层中实现了轻量级的对齐框架。实验表明，CoSToM显著提升了类人社会推理能力及下游对话质量。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.10035

Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension

范畴理论隐喻理解模型的计算实现

Iwaki, Fumitaka, Fuyama, Miho, Saigo, Hayato, Takahashi, Tatsuji

Abstract

In this study, we developed a computational implementation for a model of metaphor comprehension based on the theory of indeterminate natural transformation (TINT) proposed by Fuyama et al. We simplified the algorithms implementing the model to be closer to the original theory and verified it through data fitting and simulations. The outputs of the algorithms are evaluated with three measures: data-fitting with experimental data, the systematicity of the metaphor comprehension result, and the novelty of the comprehension (i.e. the correspondence of the associative structure of the source and target of the metaphor). The improved algorithm outperformed the existing ones in all the three measures.

Chinese Translation

本研究基于Fuyama等人提出的不确定自然变换理论（Theory of Indeterminate Natural Transformation, TINT），开发了一个隐喻理解模型的计算实现。我们简化了实现该模型的算法，使其更贴近原始理论，并通过数据拟合和模拟进行了验证。算法输出通过三项指标进行评估：与实验数据的数据拟合度、隐喻理解结果的系统性，以及理解的新颖性（即隐喻源域与目标域的联想结构对应性）。改进后的算法在这三项指标上均优于现有算法。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.10063

Linguistic Accommodation Between Neurodivergent Communities on Reddit:A Communication Accommodation Theory Analysis of ADHD and Autism Groups

Reddit上神经多样性社群之间的语言适应：对ADHD和自闭症群体的沟通适应理论分析

Mankarious, Saad, Zein, Nour, Hou, Iyad Ait, Zirikly, Aya

Abstract

Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to \emph{intergroup} behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by Language Inquiry and Word Count Lexicon (LIWC). We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group's home community decrease when its members post in the other group's space, and vice versa, consistent with convergent accommodation. The involvement of topic-independent summary variables (Authentic, Clout) in these shifts provides partial evidence against a purely topical explanation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.

Chinese Translation

关于心理健康的社交媒体研究主要集中在个体层面上检测和诊断状况。在本研究中，我们将注意力转向 extit{群体间}行为，考察两个显著的神经多样性社群，ADHD和自闭症，在Reddit上互动时如何调整他们的语言。基于沟通适应理论（Communication Accommodation Theory, CAT），我们首先确立每个社群在语言查询与词汇计数词典（Language Inquiry and Word Count Lexicon, LIWC）测量下维持独特的语言特征。然后，我们展示当用户跨越社群边界时，这些特征在相反方向上发生变化：在一个群体的本土社群中被提升的特征在其成员在另一个群体的空间发帖时减少，反之亦然，这与趋同适应一致。这些变化中涉及的与主题无关的总结变量（Authentic, Clout）提供了部分证据，反驳了纯粹的主题解释。最后，在关于公开诊断披露时刻的探索性纵向分析中，我们发现其对语言风格的影响较小，并且在某些情况下与跨社群适应的方向相反，初步证据表明情境受众适应和长期身份过程可能涉及不同的机制。我们的研究有助于理解在线神经多样性人群之间的群体间沟通动态，并对社群管理和这些状况的临床视角具有重要意义。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.10065

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

ASPIRin：面向交互优化的全双工语音语言模型中的动作空间投影强化学习

Hsiao, Chi-Yuan, Lu, Ke-Han, Fu, Yu-Kuan, Lin, Guan-Ting, Hung, Hsiao-Tsung, Lee, Hung-yi

Abstract

End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.

Chinese Translation

端到端全双工语音语言模型（SLMs）需要精确的轮次控制以实现自然交互。然而，通过标准的原始词元强化学习（RL）优化时间动态会降低语义质量，导致严重的生成崩溃和重复。我们提出了ASPIRin，一种交互优化的强化学习框架，明确地将“何时发言”与“说什么”解耦。通过动作空间投影（Action Space Projection），ASPIRin将文本词汇映射为粗粒度的二元状态（活跃发言与非活跃静默）。利用基于规则奖励的群体相对策略优化（Group Relative Policy Optimization，GRPO），它在用户打断和响应延迟之间实现平衡。实证评估表明，ASPIRin在轮次控制、回声反馈和暂停处理等方面优化了交互性。关键是将时序与词元选择分离，保持了语义连贯性，并相比标准GRPO将重复n-gram的比例降低了50%以上，有效消除了退化性重复。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.10072

Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty

仅在必要时推理：通过模型内部不确定性实现高效的生成奖励建模

Xue, Chao, Wang, Yao, Liu, Mengqiao, Liang, Di, Han, Xingsheng, Liu, Peiyang, Wu, Xianjie, Lu, Chenyao, Jiang, Lei, Lu, Yu, Shi, Haibo, Liang, Shuang, Peng, Minlong, Salim, Flora D.

Abstract

Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.

Chinese Translation

最近在生成奖励模型（Generative Reward Model, GRM）方面的进展表明，它有潜力通过思维链（Chain-of-Thought, CoT）提示增强大型语言模型（LLMs）的推理能力。尽管取得了这些成果，现有的GRM实现仍然存在两个关键限制。首先，CoT提示对所有输入都不加区分地应用，而不考虑其固有复杂性。这为适合快速直接推理的任务引入了不必要的计算成本。其次，现有方法主要依赖基于投票的机制来评估CoT输出，这往往缺乏对推理质量的细致和精确评估。本文提出了E-GRM，一个基于模型内部不确定性的高效生成奖励建模框架。E-GRM利用并行模型生成的收敛行为来估计不确定性，并仅在需要时选择性地触发CoT推理，而不依赖于手工特征或任务依赖信号。为了提高奖励的真实性，我们引入了一种轻量级的判别评分器，该评分器通过混合回归-排序目标进行训练，以提供对推理路径的细致评估。在多个推理基准上的实验表明，E-GRM显著降低了推理成本，同时持续提高了答案准确性，证明了模型内部不确定性是高效推理感知奖励建模的有效且通用的信号。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.10079

Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

为何监督微调未能有效学习：大型语言模型中不完全学习的系统研究

Xue, Chao, Wang, Yao, Liu, Mengqiao, Liang, Di, Han, Xingsheng, Liu, Peiyang, Wu, Xianjie, Lu, Chenyao, Jiang, Lei, Lu, Yu, Shi, Haibo, Liang, Shuang, Peng, Minlong, Salim, Flora D.

Abstract

Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon(ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.

Chinese Translation

监督微调（Supervised Fine-Tuning, SFT）是将大型语言模型（Large Language Models, LLMs）适应于下游任务的标准方法。然而，我们观察到一种持续存在的失败模式：即使在收敛后，模型通常无法正确再现其自身监督训练数据的一个子集。我们将这种行为称为不完全学习现象（Incomplete Learning Phenomenon, ILP）。本文首次系统性地研究了LLM微调中的ILP。我们将ILP形式化为在训练后未能内化监督实例，并展示其在多个模型家族、领域和数据集中的普遍性。通过控制分析，我们识别出五个反复出现的不完全学习来源：（1）预训练模型中缺失的先决知识，（2）SFT监督与预训练知识之间的冲突，（3）SFT数据内部的不一致性，（4）在顺序微调过程中左侧遗忘，以及（5）对稀有或复杂模式的优化不足。我们引入了一种以诊断为先的框架，通过可观察的训练和推理信号将未学习样本映射到这些原因，并研究几种作为因果干预的针对性缓解策略。在Qwen、LLaMA和OLMo2上的实验表明，不完全学习是普遍且异质的，整体指标的改善可能掩盖持续存在的未学习子集。这些发现强调了对监督微调未能学习的内容及其原因进行细致诊断的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.10091

SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

SEPTQ：一种简单高效的大型语言模型后训练量化范式

Liu, Han, Gao, Haotian, Zhang, Xiaotong, Li, Changya, Zhang, Feng, Wang, Wei, Ma, Fenglong, Yu, Hong

Abstract

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.

Chinese Translation

大型语言模型（LLMs）在多个领域展现了卓越的性能，但其巨大的计算和存储成本成为制约因素。量化作为一种有效的模型压缩技术，能够使模型适配资源受限的设备，同时保持生成质量，主要包括量化感知训练（QAT）和后训练量化（PTQ）两种方法。QAT需要额外的再训练或微调，导致训练成本较高，不适用于大型语言模型。因此，PTQ成为近期量化研究的热点。然而，现有的PTQ方法通常依赖复杂的计算流程，并且在低比特量化设置下性能显著下降。为缓解上述问题，我们提出了一种简单且高效的针对大型语言模型的后训练量化范式，称为SEPTQ。具体而言，SEPTQ首先计算权重矩阵中每个元素的重要性得分，并以静态全局的方式确定量化位置。随后，利用表示重要位置的掩码矩阵，逐列量化并更新相关权重，直至获得合适的量化权重矩阵。与以往方法相比，SEPTQ将后训练量化流程简化为仅两步，同时兼顾效果与效率。在涵盖从百万到十亿参数规模、不同量化比特级别的多种模型和数据集上的实验结果表明，SEPTQ显著优于其他强基线方法，尤其在低比特量化场景中表现突出。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.10101

Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

谁写了这句诗？大型语言模型生成古典汉诗的检测评估

Li, Jiang, Lan, Tian, Wang, Shanshan, Zhang, Dongxing, Lin, Dianqing, Gao, Guanglai, Wong, Derek F., Su, Xiangdong

Abstract

The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM-generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at https://github.com/VelikayaScarlet/ChangAn.

Chinese Translation

大型语言模型（LLMs）的快速发展将文本生成任务拓展到了文学领域。然而，AI生成的文学作品在文学界引发了日益突出的创作真实性和伦理问题，使得检测LLM生成的文学文本变得必要且紧迫。尽管以往研究在检测AI生成文本方面取得了显著进展，但尚未涉及古典汉诗。由于古典汉诗具有严格的格律规律、共享的诗歌意象体系以及灵活的句法等独特语言特征，区分一首诗是否由AI创作成为一项重大挑战。为解决这些问题，我们提出了ChangAn基准，专门用于检测LLM生成的古典汉诗，包含总计30,664首诗，其中10,276首为人类创作，20,388首由四个流行的LLM生成。基于ChangAn，我们系统评估了12种AI检测器，考察了它们在不同文本粒度和生成策略下的性能差异。研究结果凸显了当前中文文本检测器的局限性，它们无法作为可靠工具用于检测LLM生成的古典汉诗。这些结果验证了我们所提ChangAn基准的有效性和必要性。我们的数据集和代码可在https://github.com/VelikayaScarlet/ChangAn获取。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.10114

CircuitSynth: Reliable Synthetic Data Generation

CircuitSynth：可靠的合成数据生成

Cheng, Zehua, Dai, Wei, Sun, Jiahao, Lukasiewicz, Thomas

Abstract

The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical evaluations across diverse benchmarks demonstrate that CircuitSynth achieves 100% Schema Validity even in complex logic puzzles where unconstrained baselines fail (12.4%) while significantly outperforming state-of-the-art methods in rare-combination coverage.

Chinese Translation

高保真合成数据的生成是现代机器学习的基石，然而大型语言模型（LLMs）在进行结构化生成时常常遭遇幻觉、逻辑不一致和模式崩溃等问题。现有的方法，如提示或检索增强生成，缺乏在语言表达性与有效性和覆盖率的正式保证之间取得平衡的机制。为了解决这一问题，我们提出了CircuitSynth，一种新颖的神经符号框架，它将语义推理与表面实现解耦。通过将教师LLM的推理能力提炼为概率句子决策图（Probabilistic Sentential Decision Diagram, PSDD），CircuitSynth 创建了一个可处理的语义先验，从结构上强制执行严格的逻辑约束。此外，我们引入了一种凸优化机制，以严格满足软分布目标。在多样化基准测试中的实证评估表明，CircuitSynth 在复杂逻辑难题中实现了100%的模式有效性，而在这些情况下，无约束基线的表现仅为12.4%，同时在稀有组合覆盖率方面显著优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.10123

Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

无训练的跨语言构音障碍严重程度评估：基于自监督语音表示中的音位子空间分析

Muller, Bernard, Barrañón, Antonio Armando Ortiz, Roberts, LaVonne

Abstract

Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological profile.Evaluating 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson's disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.

Chinese Translation

构音障碍语音的严重程度评估通常需要经过训练的临床医生或基于标记病理语音构建的监督模型，这限制了其在不同语言和临床环境中的可扩展性。我们提出了一种无训练的方法，通过测量冻结的HuBERT表示中的音位特征子空间的退化来量化构音障碍的严重程度。该方法不需要训练监督严重性模型；特征方向是通过使用预训练的强制对齐器从健康对照语音中估计得出的。对于每位说话者，我们通过蒙特利尔强制对齐器提取音位级嵌入，沿着仅从健康对照中得出的音位对比方向（鼻音、发声、清浊、响度、方式和四个元音特征）计算d-prime分数，并构建一个12维的音位特征档案。在对890名说话者进行评估时，涵盖10个语料库、5种语言（英语、西班牙语、荷兰语、普通话、法语）和3种主要病因（帕金森病、脑瘫、ALS），我们发现所有五个辅音d-prime特征与临床严重性显著相关（随机效应元分析rho = -0.50至-0.56，p < 2e-4；合并斯皮尔曼rho = -0.47至-0.55，bootstrap 95%置信区间不跨越零）。该效应在各个语料库中得到了复制，经过FDR校正后仍然显著，并且在逐一去除语料库和对齐质量控制时保持稳健。在7个严重程度分级语料库中，鼻音d-prime在从对照到严重的过程中单调递减。曼-惠特尼U检验确认所有12个特征能够区分对照组和严重构音障碍说话者（p < 0.001）。该方法不需要构音障碍训练数据，并适用于任何具有现有MFA声学模型的语言（目前有29种语言）。我们发布了完整的流程和六种语言的音位特征配置。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.10135

Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities

以句子为单位思考：显式句子边界提升语言模型能力

Liu, Zhichen, Li, Yongyuan, Xu, Yang

Abstract

Researchers have explored different ways to improve large language models (LLMs)' capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through exposure to human-generated texts, which are inherently structured at the sentence level. Motivated by this gap, we propose an approach that inserts delimiters at sentence boundaries in LLM inputs, which not only integrates dummy tokens into the context, but also facilitates LLMs with sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1). In-context learning and (2). Supervised fine-tuning are experimented using 7B models to 600B Deepseek-V3. Our results demonstrate consistent improvements across various tasks, with notable gains of up to 7.7\% on GSM8k and 12.5\% on DROP. Furthermore, the fine-tuned LLMs can incorporate sentence awareness evidenced by their internal representations. Our work establishes a simple yet effective technique for enhancing LLM's capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm.

Chinese Translation

研究人员探索了通过在上下文中插入虚拟标记以提升大型语言模型（LLMs）能力的不同方法。然而，现有工作仅关注虚拟标记本身，未能利用自然语言固有的句子级结构。这是一个关键的疏忽，因为LLMs通过接触人类生成的文本获得语言能力，而这些文本本质上是以句子为结构单位的。针对这一空白，我们提出了一种在LLM输入中于句子边界插入分隔符的方法，该方法不仅将虚拟标记整合进上下文，还促使LLMs在推理过程中实现逐句处理行为。我们在7B模型到600B Deepseek-V3上，分别通过（1）上下文学习和（2）监督微调两种具体方法进行了实验。结果显示，在多个任务上均有稳定提升，其中GSM8k任务最高提升达7.7%，DROP任务提升达12.5%。此外，微调后的LLMs在内部表示中体现了句子意识。我们的工作确立了一种简单而有效的技术以增强LLM能力，为认知启发的LLM增强范式提供了有前景的方向。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.10151

Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text

语言模型隐藏状态中的国籍编码：探究以人格条件化的学术文本中的文化差异性表征

Jackson, Paul, Li, Ruizhe, Edelstein, Elspeth

Abstract

Large language models are increasingly used as writing tools and pedagogical resources in English for Academic Purposes, but it remains unclear whether they encode culturally differentiated representations when generating academic text. This study tests whether Gemma-3-4b-it encodes nationality-discriminative information in hidden states when generating research article introductions conditioned by British and Chinese academic personas. A corpus of 270 texts was generated from 45 prompt templates crossed with six persona conditions in a 2 x 3 design. Logistic regression probes were trained on hidden-state activations across all 35 layers, with shuffled-label baselines, a surface-text skyline classifier, cross-family tests, and sentence-level baselines used as controls. Probe-selected token positions were annotated for structural, lexical, and stance features using the Stanza NLP pipeline. The nationality probe reached 0.968 cross-validated accuracy at Layer 18, with perfect held-out classification. Nationality encoding followed a non-monotonic trajectory across layers, with structural effects strongest in the middle to upper network and lexical-domain effects peaking earlier. At high-signal token positions, British-associated patterns showed more postmodification, hedging, boosting, passive voice, and evaluative or process-oriented vocabulary, while Chinese-associated patterns showed more premodification, nominal predicates, and sociocultural or internationalisation vocabulary. However, sentence-level analysis found no significant nationality differences in the full generated surface text. The findings extend probing methodology to a sociolinguistic attribute and have practical implications for EAP and language pedagogy.

Chinese Translation

大型语言模型在学术英语写作和教学资源中的应用日益广泛，但其在生成学术文本时是否编码了文化差异性的表征尚不明确。本研究检验了Gemma-3-4b-it模型在生成以英国和中国学术人格为条件的研究文章引言时，隐藏状态中是否包含国籍区分信息。研究构建了一个包含270篇文本的语料库，基于45个提示模板与六种人格条件，采用2×3设计。通过对所有35层隐藏状态激活进行逻辑回归探测，结合标签随机基线、表层文本的天际线分类器、跨模型家族测试及句子级基线作为对照。利用Stanza自然语言处理管线对探测选定的词元位置进行了结构、词汇及立场特征注释。国籍探测在第18层达到0.968的交叉验证准确率，并实现了完美的留出分类。国籍编码在各层呈非单调变化，中上层网络结构效应最强，词汇领域效应则较早达到峰值。在高信号词元位置，英国相关模式表现出更多的后置修饰、委婉语、强调、被动语态及评价性或过程导向词汇，而中国相关模式则表现出更多的前置修饰、名词性谓语及社会文化或国际化词汇。然而，句子级分析未发现完整生成表层文本中存在显著的国籍差异。研究成果将探测方法扩展至社会语言学属性，对学术英语教学及语言教学具有实际意义。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.10159

ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

ODUTQA-MDC：基于多轮对话澄清的开放域不充分指定表格问答任务

Wang, Zhensheng, Lin, ZhanTeng, Yang, Wenmian, Zhou, Kun, Zhang, Yiquan, Jia, Weijia

Abstract

The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open-domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA-MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large-scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine-grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC-TQA, a multi-agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification-aware Tabular QA research.

Chinese Translation

大型语言模型（LLMs）的进步提升了表格问答（Tabular QA）的能力，但它们在处理表现出不充分指定或不确定表达的开放域查询时仍存在困难。为此，我们提出了ODUTQA-MDC任务及其首个全面基准。该基准包括：（1）一个包含209张表格和25,105个问答对的大规模ODUTQA数据集；（2）一个细粒度标注方案以实现详细评估；（3）一个动态澄清界面，用于模拟用户反馈以支持交互式评估。我们还提出了MAIC-TQA，一种多智能体框架，擅长检测歧义，通过对话进行澄清并优化答案。实验验证了我们的基准和框架，确立其作为推动对话式、不充分指定感知表格问答研究的重要资源。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.10189

FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness

FAITH：通过整合可信度与诚实性实现事实性对齐

Dong, Xiaoning, Wu, Chengyan, Wen, Yajie, Chen, Yu, Xue, Yun, Zhang, Jing, Xu, Wei, Ma, Bolei

Abstract

Large Language Models (LLMs) can generate factually inaccurate content even if they have corresponding knowledge, which critically undermines their reliability. Existing approaches attempt to mitigate this by incorporating uncertainty in QA prompt during training, but these numerical scores lack the semantic richness for LLM to properly understand its internal states of trustworthiness and honestness, leading to insufficient factuality alignment. We introduce FAITH (Factuality Alignment through Integrating Trustworthiness and Honestness), a post-training framework for factuality alignment that integrates natural-language uncertainty signals with external knowledge. Specifically, we augment training datasets by computing confidence scores and semantic entropy from LLM outputs and mapping them into a knowledge state quadrant that describes the model's internal knowledge possession (trustworthiness) and answering behaviors (honestness) in natural language. Based on this enhanced data, we design a reward function that considers both correctness and uncertainty signals, and fine-tune the LLM using the Proximal Policy Optimization (PPO) algorithm. To further mitigate weakly grounded responses, we design a retrieval-augmented module that retrieves relevant external passages, improving the consistency between internal and external knowledge representations. Extensive experiments on four knowledge-intensive benchmarks demonstrate that FAITH enhances the factual accuracy and truthfulness of LLMs.

Chinese Translation

大型语言模型（LLMs）即使具备相关知识，也可能生成事实不准确的内容，这严重削弱了其可靠性。现有方法试图通过在训练过程中引入问答提示中的不确定性来缓解这一问题，但这些数值评分缺乏语义丰富性，导致LLM无法充分理解其内部的可信度和诚实性状态，从而导致事实性对齐不足。我们提出了FAITH（Factuality Alignment through Integrating Trustworthiness and Honestness），一种基于后训练的事实性对齐框架，该框架将自然语言的不确定性信号与外部知识相结合。具体而言，我们通过计算LLM输出的置信度分数和语义熵，构建训练数据集，并将其映射到描述模型内部知识拥有情况（可信度）和回答行为（诚实性）的知识状态象限中。基于这一增强数据，我们设计了一个同时考虑正确性和不确定性信号的奖励函数，并使用近端策略优化算法（Proximal Policy Optimization，PPO）对LLM进行微调。为进一步减少依据不足的回答，我们设计了一个检索增强模块，用于检索相关的外部段落，从而提升内部与外部知识表示之间的一致性。在四个知识密集型基准测试上的大量实验表明，FAITH显著提升了LLM的事实准确性和真实性。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.10212

Relational Probing: LM-to-Graph Adaptation for Financial Prediction

关系探测：金融预测中的语言模型到图的适应

Niu, Yingjie, Jin, Changhong, Dolphin, Rian, Dong, Ruihai

Abstract

Language models can be used to identify relationships between financial entities in text. However, while structured output mechanisms exist, prompting-based pipelines still incur autoregressive decoding costs and decouple graph construction from downstream optimization. We propose \emph{Relational Probing}, which replaces the standard language-model head with a relation head that induces a relational graph directly from language-model hidden states and is trained jointly with the downstream task model for stock-trend prediction. This approach both learns semantic representations and preserves the strict structure of the induced relational graph. It enables language-model outputs to go beyond text, allowing them to be reshaped into task-specific formats for downstream models. To enhance reproducibility, we provide an operational definition of small language models (SLMs): models that can be fine-tuned end-to-end on a single 24GB GPU under specified batch-size and sequence-length settings. Experiments use Qwen3 backbones (0.6B/1.7B/4B) as upstream SLMs and compare against a co-occurrence baseline. Relational Probing yields consistent performance improvements at competitive inference cost.

Chinese Translation

语言模型可以用于识别文本中金融实体之间的关系。然而，尽管存在结构化输出机制，基于提示的管道仍然会产生自回归解码成本，并将图构建与下游优化解耦。我们提出了 extit{关系探测}，该方法用关系头替代标准语言模型头，直接从语言模型的隐藏状态中诱导出关系图，并与下游任务模型（用于股票趋势预测）联合训练。这种方法不仅学习语义表示，还保持了诱导关系图的严格结构。它使语言模型的输出超越文本，能够被重塑为下游模型的任务特定格式。为了增强可重复性，我们提供了小型语言模型（SLMs）的操作性定义：在指定的批量大小和序列长度设置下，可以在单个24GB GPU上进行端到端微调的模型。实验使用Qwen3骨干网络（0.6B/1.7B/4B）作为上游SLMs，并与共现基线进行比较。关系探测在具有竞争力的推理成本下，产生了一致的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.10235

CodeComp: Structural KV Cache Compression for Agentic Coding

CodeComp：面向智能编码的结构化KV缓存压缩方法

Chen, Qiujiang, Xiong, Jing, Zhao, Chenyang, Yang, Sidi, Wong, Ngai

Abstract

Agentic code tasks such as fault localization and patch generation require processing long codebases under tight memory constraints, where the Key-Value (KV) cache becomes the primary inference bottleneck. Existing compression methods rely exclusively on attention signals to estimate token importance, systematically discarding structurally critical tokens such as call sites, branch conditions, and assignments that are essential for code understanding. We present CodeComp, a training-free KV cache compression framework that incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. Across bug localization and code generation benchmarks, CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovering the majority of full-context accuracy under aggressive KV cache compression, while matching the patch generation quality of uncompressed full-context inference and integrating seamlessly into SGLang-based agentic coding pipelines without model modification.

Chinese Translation

智能编码任务如故障定位和补丁生成需要在严格的内存限制下处理大型代码库，其中键值（KV）缓存成为主要的推理瓶颈。现有的压缩方法仅依赖注意力信号来估计标记的重要性，系统性地丢弃了对代码理解至关重要的结构性关键标记，如调用点、分支条件和赋值操作。我们提出了CodeComp，一种无需训练的KV缓存压缩框架，通过Joern提取的代码属性图（Code Property Graph）先验，将静态程序分析融入大型语言模型（LLM）推理中。在缺陷定位和代码生成基准测试中，CodeComp在相同内存预算下持续优于仅基于注意力的压缩基线，在激进的KV缓存压缩下恢复了大部分全上下文准确率，同时匹配未压缩全上下文推理的补丁生成质量，并能无缝集成到基于SGLang的智能编码流水线中，无需修改模型。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.10316

Comparative Analysis of Large Language Models in Healthcare

医疗领域大型语言模型的比较分析

Santhosh, Subin, Abbas, Farwa, Ahmad, Hussain, Szabo, Claudia

Abstract

Background: Large Language Models (LLMs) are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.

Chinese Translation

背景：大型语言模型（LLMs）因其理解、生成和总结复杂医学文本的能力，正在改变医疗领域的人工智能应用。它们为临床医生、研究人员和患者提供了宝贵支持，但在高风险临床环境中的应用引发了关于准确性、可靠性和患者安全的重大关注。尽管近年来受到了广泛关注，但针对医学应用的LLMs的标准化基准测试仍然有限。目的：本研究旨在满足对医学环境中LLMs进行标准化比较评估的需求。方法：我们评估了多个模型，包括ChatGPT、LLaMA、Grok、Gemini和ChatDoctor，在核心医学任务上，如患者病历摘要和医学问答，使用开放获取的数据集MedMCQA、PubMedQA和Asclepius，并通过语言学和任务特定指标的组合评估其性能。结果：结果表明，特定领域模型如ChatDoctor在上下文可靠性方面表现优异，能够生成医学准确且语义一致的文本，而通用模型如Grok和LLaMA在结构化问答任务中表现更佳，显示出更高的定量准确性。这突显了特定领域和通用LLMs在不同医学任务中的互补优势。结论：我们的研究结果表明，LLMs可以有效支持医疗专业人员并增强临床决策；然而，它们的安全和有效部署需要遵循伦理标准、确保上下文准确性，并在相关情况下进行人工监督。这些结果强调了任务特定评估的重要性以及谨慎将LLMs整合到医疗工作流程中的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.10335

Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

基于难度感知路由与不确定性引导聚合的自适应多专家推理

Ehab, Mohamed, Hamdi, Ali

Abstract

Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems' difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models' robustness.

Chinese Translation

大型语言模型（LLMs）在数学推理基准测试中表现出强劲的性能，但其在不同难度级别的问题上的表现存在不一致性。本文提出了自适应多专家推理（Adaptive Multi-Expert Reasoning，AMR）框架，该框架通过动态调整推理策略，聚焦于问题复杂度。一个敏捷的路由系统基于问题文本预测问题的难度和不确定性，并指导可重构的采样机制以管理生成的广度。三个专门的专家生成候选答案，这些答案在多轮修正和定稿阶段被不断优化。一个神经验证器评估答案的正确性，同时基于聚类的聚合技术结合共识度和答案质量，确定最终的候选答案。在GSM8K数据集上的评测中，AMR实现了75.28%的准确率，且仅使用了原始训练数据。该结果优于大多数使用合成数据训练的同类7B模型，展示了基于难度路由和不确定性驱动聚合的模型在提升数学推理模型鲁棒性方面的高效性和有效性。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.10368

A Structured Clustering Approach for Inducing Media Narratives

一种用于归纳媒体叙事的结构化聚类方法

Das, Rohan, Deshmukh, Advait, Leto, Alexandria, Naaman, Zohar, Lee, I-Ta, Pacheco, Maria Leonor

Abstract

Media narratives wield tremendous power in shaping public opinion, yet computational approaches struggle to capture the nuanced storytelling structures that communication theory emphasizes as central to how meaning is constructed. Existing approaches either miss subtle narrative patterns through coarse-grained analysis or require domain-specific taxonomies that limit scalability. To bridge this gap, we present a framework for inducing rich narrative schemas by jointly modeling events and characters via structured clustering. Our approach produces explainable narrative schemas that align with established framing theory while scaling to large corpora without exhaustive manual annotation.

Chinese Translation

媒体叙事在塑造公众舆论方面具有巨大影响力，然而计算方法难以捕捉传播理论所强调的构建意义的细腻叙事结构。现有方法要么通过粗粒度分析忽略了细微的叙事模式，要么依赖限制可扩展性的领域特定分类体系。为弥合这一差距，我们提出了一种通过结构化聚类联合建模事件和角色的框架，以归纳丰富的叙事模式。该方法生成可解释的叙事模式，符合既有的框架理论，同时能够扩展至大规模语料库，无需耗费大量人工标注。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.10389

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

BLUEmed：基于检索增强的多智能体辩论框架用于临床错误检测

You, Saukun Thika, Tran, Nguyen Anh Khoa, Marizane, Wesley K., Rao, Hanshu, Zhang, Qiunan, Huang, Xiaolei

Abstract

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

Chinese Translation

临床记录中的术语替换错误，即用语言上有效但临床意义不同的医学术语替代原术语，是自动化错误检测中持续存在的挑战。我们提出了BLUEmed，一种结合混合检索增强生成（Retrieval-Augmented Generation, RAG）的多智能体辩论框架，该框架融合了基于证据的推理与多视角验证，用于临床错误检测。BLUEmed将每条临床记录分解为聚焦的子查询，通过密集检索、稀疏检索和在线检索获取分源证据，并为两位领域专家智能体分配不同的知识库以生成独立分析；当专家意见不一致时，系统通过结构化的反驳回合和跨源裁决解决冲突，随后通过级联安全层过滤常见的误报模式。我们在一个临床术语替换检测基准上，采用零样本和少样本提示，结合多种涵盖专有和开源模型的骨干模型，对BLUEmed进行了评估。实验结果表明，BLUEmed在少样本提示下取得了最佳准确率（69.13%）、ROC-AUC（74.45%）和PR-AUC（72.44%），优于单智能体RAG和仅辩论基线方法。对六种骨干模型和两种提示策略的进一步分析确认，检索增强与结构化辩论具有互补性，且该框架在具备充分指令遵循能力和临床语言理解能力的模型上表现最佳。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.10401

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

NameBERT：基于大语言模型增强开放学术数据的姓名国籍分类规模化研究

Ming, Cong, Shi, Ruixin, Hu, Yifan

Abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

Chinese Translation

从个人姓名推断国籍是实现公平性和偏见监测、个性化服务的重要能力，同时也是生物医学和社会学研究中的有力工具。然而，现有基于姓名的国籍分类器通常仅在规模较小或特定来源的标注数据集上训练，这可能导致覆盖不足并限制对代表性较低国家的性能表现。尽管大语言模型（LLMs）在姓名国籍预测任务中展现了强大的零样本能力，但其计算成本和延迟使其难以应用于实时大规模部署。本文构建了一个基于开放学术图谱（Open Academic Graph, OAG）的海量姓名-国籍数据集，并提出了一种将LLMs作为数据集增强工具而非推理引擎的框架。我们利用LLM生成的姓名对低资源国家数据进行增强，并在真实及合成尾部测试集上进行评估。结果表明，当评估包含合成尾部姓名时，增强带来了显著提升；即使在其他情况下，对尾部国家指标也有适度提升。总体而言，NameBERT模型在域内及域外任务中均显著优于最先进基线方法，同时相比LLMs在大规模推理时保持了高效性。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.10417

LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset

LASQ：一个低资源语言的基于方面的情感四元组抽取数据集

Yusufu, Aizihaierjiang, Liu, Jiang, Aziz, Kamran, Ainiwaer, Abidan, Li, Bobo, Li, Fei, Ji, Donghong, Yusufu, Aizierguli

Abstract

In recent years, aspect-based sentiment analysis (ABSA) has made rapid progress and shown strong practical value. However, existing research and benchmarks are largely concentrated on high-resource languages, leaving fine-grained sentiment extraction in low-resource languages under-explored. To address this gap, we constructed the first Low-resource languages Aspect-based Sentiment Quadruple dataset, named LASQ, which includes two low-resource languages: Uzbek and Uyghur. Secondly, it includes a fine-grained target-aspect-opinion-sentiment quadruple extraction task. To facilitate future research, we designed a grid-tagging model that integrates syntactic knowledge. This model incorporates part-of-speech (POS) and dependency knowledge into the model through our designed Syntax Knowledge Embedding Module (SKEM), thereby alleviating the lexical sparsity problem caused by agglutinative languages. Experiments on LASQ demonstrate consistent gains over competitive baselines, validating both the dataset's utility and the effectiveness of the proposed modeling approach.

Chinese Translation

近年来，基于方面的情感分析（ABSA）取得了快速进展并展现出强大的实际应用价值。然而，现有研究和基准测试主要集中在高资源语言上，低资源语言中的细粒度情感抽取尚未得到充分探索。为填补这一空白，我们构建了首个低资源语言的基于方面的情感四元组数据集，命名为LASQ，涵盖了两种低资源语言：乌兹别克语和维吾尔语。其次，该数据集包含细粒度的目标-方面-观点-情感四元组抽取任务。为促进未来研究，我们设计了一种融合句法知识的网格标注模型。该模型通过我们设计的句法知识嵌入模块（Syntax Knowledge Embedding Module, SKEM）将词性（POS）和依存句法知识引入模型，有效缓解了黏着语导致的词汇稀疏问题。在LASQ数据集上的实验结果显示，该模型相较于竞争基线方法取得了持续的性能提升，验证了数据集的实用性及所提模型方法的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.10418

Turing or Cantor: That is the Question

图灵还是康托尔：问题的本质

Eberbach, Eugene

Abstract

Alan Turing is considered as a founder of current computer science together with Kurt Godel, Alonzo Church and John von Neumann. In this paper multiple new research results are presented. It is demonstrated that there would not be Alan Turing's achievements without earlier seminal contributions by Georg Cantor in the set theory and foundations of mathematics. It is proposed to introduce the measure of undecidability of problems unsolvable by Turing machines based on probability distribution of its input data, i.e., to provide the degree of unsolvabilty based on the number of undecidable instances of input data versus decidable ones. It is proposed as well to extend the Turing's work on infinite logics and Oracle machines to a whole class of super-Turing models of computation. Next, the three new complexity classes for TM undecidable problems have been defined: U-complete (Universal complete), D-complete (Diagonalization complete) and H-complete (Hypercomputation complete) classes. The above has never been defined explicitly before by other scientists, and has been inspired by Cook/Levin NP-complete class for intractable problems. Finally, an equivalent to famous P is not equal to NP unanswered question for NP-complete class, has been answered negatively for U-complete class of complexity for undecidable problems.

Chinese Translation

艾伦·图灵被认为是现代计算机科学的奠基人之一，与库尔特·哥德尔、阿隆佐·丘奇和约翰·冯·诺依曼齐名。本文提出了多项新的研究成果。文章论证了如果没有乔治·康托尔在集合论和数学基础领域的早期开创性贡献，艾伦·图灵的成就将难以实现。文中提出基于输入数据的概率分布，引入图灵机无法解决问题的不可判定性度量，即通过不可判定输入实例数量与可判定实例数量的比值来衡量问题的不可解程度。同时，建议将图灵关于无限逻辑和Oracle机器的工作扩展到一整类超图灵（super-Turing）计算模型。接着，定义了针对图灵机不可判定问题的三类新的复杂度类：U-完全类（Universal complete）、D-完全类（Diagonalization complete）和H-完全类（Hypercomputation complete）。上述复杂度类此前未被其他学者明确提出，其灵感来源于Cook/Levin提出的NP完全类，用于描述难解问题。最后，针对NP完全类著名的“P是否等于NP”未解问题，本文对不可判定问题的U-完全复杂度类给出了否定的回答。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.10426

CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning

CodaRAG：受互补学习启发的关联性连接方法

Li, Cheng-Yen, Chen, Xuanjun, Lin, Claire, Chen, Wei-Yu, Nie, Wenhua, Lee, Hung-Yi, Jang, Jyh-Shing Roger

Abstract

Large Language Models (LLMs) struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. While Retrieval-Augmented Generation (RAG) grounds generation in external sources, existing methods often treat evidence as isolated units, failing to reconstruct the logical chains that connect these dots. Inspired by Complementary Learning Systems (CLS), we propose CodaRAG, a framework that evolves retrieval from passive lookup into active associative discovery. CodaRAG operates via a three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via multi-dimensional pathways-semantic, contextualized, and functional-explicitly recovering dispersed evidence chains; and (3) Interference Elimination to prune hyper-associative noise, ensuring a coherent, high-precision reasoning context. On GraphRAG-Bench, CodaRAG achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy. These results demonstrate CodaRAG's superior ability to systematically robustify associative evidence retrieval for factual, reasoning, and creative tasks.

Chinese Translation

大型语言模型（LLMs）在处理知识密集型任务时，常因幻觉现象和对分散信息的碎片化推理而表现不足。尽管检索增强生成（Retrieval-Augmented Generation，RAG）通过外部资源为生成提供依据，现有方法往往将证据视为孤立单元，未能重建连接这些点的逻辑链条。受互补学习系统（Complementary Learning Systems，CLS）启发，我们提出了CodaRAG框架，将检索从被动查找演进为主动的关联发现。CodaRAG通过三阶段流水线运行：（1）知识整合，将分散提取统一为稳定的记忆基底；（2）关联导航，通过多维路径——语义的、上下文化的和功能性的——遍历图谱，显式恢复分散的证据链；（3）干扰消除，修剪过度关联的噪声，确保推理上下文的连贯性和高精度。在GraphRAG-Bench上，CodaRAG在检索召回率上实现了7-10%的绝对提升，生成准确率提升了3-11%。这些结果表明，CodaRAG在系统性稳健地强化事实、推理及创造性任务的关联证据检索方面表现卓越。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.10448

Instruction Data Selection via Answer Divergence

基于答案差异度的指令数据选择

Li, Bo, Wang, Mingda, Zhang, Shikun, Ye, Wei

Abstract

Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.

Chinese Translation

指令微调依赖于大规模的指令-响应语料库，其质量和组成对下游性能有显著影响。我们提出了答案差异度引导选择（Answer Divergence-Guided Selection，ADG）方法，该方法基于多样本输出的几何结构选择指令数据。ADG针对每条指令生成多个高温采样结果，将响应映射到嵌入空间，并计算一个输出差异度得分，该得分联合编码了分散程度和形状各向异性。高得分对应的指令，其答案不仅彼此相距较远且呈多模态分布，而非沿单一方向聚集的同义改写。在两个基础模型和三个公开指令池上，使用仅由ADG选出的1万条样本进行微调，在涵盖推理、知识和编码的六个基准测试中均持续优于强基线选择方法。进一步分析表明，分散程度和形状各向异性均为必要因素，支持答案差异度作为指令数据选择的有效信号。代码及附录包含于补充材料中。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.10452

NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning

NOSE：基于三模态正交对比学习的神经嗅觉-语义嵌入

Su, Yanyi, Wang, Hongshuai, Gao, Zhifeng, Cheng, Jun

Abstract

Olfaction lies at the intersection of chemical structure, neural encoding, and linguistic perception, yet existing representation methods fail to fully capture this pathway. Current approaches typically model only isolated segments of the olfactory pathway, overlooking the complete chain from molecule to receptors to linguistic descriptions. Such fragmentation yields learned embeddings that lack both biological grounding and semantic interpretability. We propose NOSE (Neural Olfactory-Semantic Embedding), a representation learning framework that aligns three modalities along the olfactory pathway: molecular structure, receptor sequence, and natural language description. Rather than simply fusing these signals, we decouple their contributions via orthogonal constraints, preserving the unique encoded information of each modality. To address the sparsity of olfactory language, we introduce a weak positive sample strategy to calibrate semantic similarity, preventing erroneous repulsion of similar odors in the feature space. Extensive experiments demonstrate that NOSE achieves state-of-the-art (SOTA) performance and excellent zero-shot generalization, confirming the strong alignment between its representation space and human olfactory intuition.

Chinese Translation

嗅觉位于化学结构、神经编码与语言感知的交汇处，然而现有的表示方法未能充分捕捉这一路径。当前方法通常仅建模嗅觉路径中的孤立环节，忽视了从分子到受体再到语言描述的完整链条。这种碎片化导致学习到的嵌入既缺乏生物学基础，也缺乏语义可解释性。我们提出了NOSE（Neural Olfactory-Semantic Embedding），一种表示学习框架，将嗅觉路径上的三种模态——分子结构、受体序列和自然语言描述进行对齐。我们并非简单融合这些信号，而是通过正交约束解耦它们的贡献，保留每种模态独特的编码信息。为解决嗅觉语言的稀疏性，我们引入了弱正样本策略来校准语义相似度，防止在特征空间中错误地排斥相似气味。大量实验表明，NOSE实现了最先进（SOTA）的性能和优异的零样本泛化能力，验证了其表示空间与人类嗅觉直觉的高度一致性。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.10455

EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning

EviCare：基于深度模型引导证据的上下文推理提升诊断预测

Zhang, Hengyu, Zhang, Xuyun, Zhan, Pengxiang, Luo, Linhao, Lv, Hang, Tan, Yanchao, Pan, Shirui, Yang, Carl

Abstract

Recent advances in large language models (LLMs) have enabled promising progress in diagnosis prediction from electronic health records (EHRs). However, existing LLM-based approaches tend to overfit to historically observed diagnoses, often overlooking novel yet clinically important conditions that are critical for early intervention. To address this, we propose EviCare, an in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction. Rather than prompting LLMs directly with raw EHR inputs, EviCare performs (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction. These signals are then composed into an adaptive in-context prompt to guide LLM reasoning in an accurate and interpretable manner. Extensive experiments on two real-world EHR benchmarks (MIMIC-III and MIMIC-IV) demonstrate that EviCare achieves significant performance gains, which consistently outperforms both LLM-only and deep model-only baselines by an average of 20.65\% across precision and accuracy metrics. The improvements are particularly notable in challenging novel diagnosis prediction, yielding average improvements of 30.97\%.

Chinese Translation

近年来，大型语言模型（LLMs）的进展推动了基于电子健康记录（EHRs）的诊断预测取得了有希望的成果。然而，现有基于LLM的方法往往过度拟合历史观测到的诊断，常常忽视对早期干预至关重要的新颖且临床重要的疾病。为了解决这一问题，我们提出了EviCare，一种将深度模型引导整合进基于LLM的诊断预测的上下文推理框架。EviCare并非直接以原始EHR输入提示LLM，而是依次执行（1）深度模型推断以进行候选选择，（2）基于集合的EHR证据优先级排序，以及（3）关系证据构建以支持新颖诊断预测。随后，这些信号被组合成自适应的上下文提示，以准确且可解释的方式引导LLM推理。在两个真实世界EHR基准数据集（MIMIC-III和MIMIC-IV）上的大量实验表明，EviCare在精确度和准确率指标上平均超越仅用LLM和仅用深度模型的基线方法20.65%，取得显著性能提升。尤其是在具有挑战性的新颖诊断预测任务中，平均提升达到30.97%。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.10459

Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification

动态自适应注意力与监督对比学习：一种用于文本情感分类的新型混合框架

Li, Qingyang

Abstract

The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent architectures, frequently struggle to capture long-distance semantic dependencies and resolve ambiguous emotional expressions in lengthy review texts. This paper proposes a novel hybrid framework that seamlessly integrates dynamic adaptive multi-head attention with supervised contrastive learning into a BERT-based Transformer encoder. The dynamic adaptive attention module employs a global context pooling vector to dynamically regulate the contribution of each attention head, thereby focusing on critical sentiment-bearing tokens while suppressing noise. Simultaneously, the supervised contrastive learning branch enforces tighter intra-class compactness and larger inter-class separation in the embedding space. Extensive experiments on the IMDB dataset demonstrate that the proposed model achieves competitive performance with an accuracy of 94.67\%, outperforming strong baselines by 1.5--2.5 percentage points. The framework is lightweight, efficient, and readily extensible to other text classification tasks.

Chinese Translation

数字平台上用户生成的电影评论呈指数级增长，使得准确的文本情感分类成为自然语言处理中的核心任务。传统模型，包括标准的BERT和循环结构，常常难以捕捉长距离的语义依赖关系，并解决冗长评论文本中的模糊情感表达。本文提出了一种新颖的混合框架，将动态自适应多头注意力与监督对比学习无缝集成到基于BERT的Transformer编码器中。动态自适应注意力模块利用全局上下文池化向量动态调节每个注意力头的贡献，从而聚焦于关键的情感承载词，同时抑制噪声。与此同时，监督对比学习分支在嵌入空间中强化类内紧凑性和类间分离度。基于IMDB数据集的大量实验表明，所提模型以94.67%的准确率实现了具有竞争力的性能，较强基线提升了1.5至2.5个百分点。该框架轻量高效，且易于扩展至其他文本分类任务。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.10470

From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

从查询到咨询：基于多智能体框架和数据集的结构化推理用于法律咨询

Lu, Mingfei, Zhang, Yi, Wu, Mengjia, Feng, Yue

Abstract

Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

Chinese Translation

法律咨询问答（Legal CQA）与传统法律问答任务相比，面临着独特的挑战，包括高质量训练数据的稀缺、复杂的任务组成以及强烈的上下文依赖性。为了解决这些问题，我们构建了JurisCQAD，这是一个大规模数据集，包含超过43,000个真实的中文法律查询，并附有专家验证的正面和负面响应。同时，我们设计了一种结构化的任务分解方法，将每个查询转换为一个法律元素图，整合了实体、事件、意图和法律问题。此外，我们还提出了JurisMA，一个支持动态路由、法定基础和风格优化的模块化多智能体框架。结合元素图，该框架实现了强大的上下文感知推理，有效捕捉法律事实、规范和程序逻辑之间的依赖关系。在JurisCQAD上训练，并在经过精炼的LawBench上评估，我们的系统在多个词汇和语义指标上显著优于通用和法律领域的语言模型，展示了可解释的分解和模块化协作在法律咨询问答中的优势。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.10495

Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

你为什么不知道？评估不确定性来源对大型语言模型不确定性量化的影响

Goloburda, Maiya, Vashurin, Roman, Chernogorsky, Fedor, Laiyk, Nurkhan, Orel, Daniil, Nakov, Preslav, Panov, Maxim

Abstract

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

Chinese Translation

随着大型语言模型（LLMs）在现实应用中的日益广泛部署，可靠的不确定性量化（UQ）对于安全有效的使用变得至关重要。现有的大多数语言模型不确定性量化方法旨在生成单一的置信度分数——例如，估计模型答案正确的概率。然而，自然语言任务中的不确定性来源多样，涵盖模型知识缺口、输出变异性和输入歧义等不同方面，这些来源对系统行为和用户交互具有不同的影响。在本研究中，我们探讨了不确定性来源如何影响现有不确定性量化方法的行为和效果。为实现受控分析，我们引入了一个新数据集，该数据集明确分类了不确定性来源，从而能够系统地评估在各类条件下不确定性量化的表现。实验结果表明，尽管许多不确定性量化方法在不确定性仅源自模型知识限制时表现良好，但当引入其他不确定性来源时，其性能会下降甚至产生误导。这些发现强调了开发能够显式考虑不确定性来源的大型语言模型不确定性感知方法的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.10516

Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

基于代码依赖的结构化知识检索用于多步骤数据推理

Huang, Xinyi, Lu, Mingzhe, Dong, Haoyu

Abstract

Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

Chinese Translation

在使用大型语言模型（LLMs）解决特定领域的数据分析任务时，选择正确的知识至关重要。然而，大多数增强检索的方法主要依赖于词汇或嵌入相似性，这往往无法有效代表多步骤推理所需的关键任务知识。在许多此类任务中，相关知识不仅与查询在文本上相关，而是基于可执行代码及其计算所依赖的结构。为了解决这一不匹配，我们提出了SGKR（结构化知识检索）框架，该框架通过函数调用依赖关系构建图来组织领域知识。给定一个问题，SGKR提取语义输入和输出标签，识别连接它们的依赖路径，并构建一个与任务相关的子图。然后，将相关知识和相应的函数实现组合成一个结构化的上下文，以供基于LLM的代码生成使用。在多步骤数据分析基准测试中的实验表明，SGKR在没有检索和基于相似性的检索基线之上，始终提高了无论是普通LLM还是编码代理的解决方案正确性。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.10520

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

ReFEree：一种无参考且细粒度的真实世界代码摘要事实一致性评估方法

Bae, Suyoung, Na, CheolWon, Lee, Jaehoon, Lee, Yumin, Choi, YunSeok, Lee, Jee-Hyong

Abstract

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

Chinese Translation

随着大型语言模型（LLMs）能够生成冗长且描述性强的代码摘要，准确且可靠地评估事实一致性成为一项关键挑战。然而，现有的评估方法主要针对孤立代码片段的简短摘要设计，因此难以对多句功能描述进行细粒度评估，也无法准确评估真实世界代码摘要中常见的依赖上下文。为此，我们提出了ReFEree，一种无参考且细粒度的真实世界代码摘要事实一致性评估方法。我们定义了针对代码摘要的事实不一致性标准，并结合依赖信息在片段级别进行评估，随后将片段级结果聚合为细粒度评分。我们构建了一个带有人类标注事实一致性标签的代码摘要基准数据集。评估结果表明，ReFEree在13个基线方法中与人类判断的相关性最高，较之前最先进方法提升了15%至18%。我们的代码和数据可在https://github.com/bsy99615/ReFEree.git获取。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.10556

Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models

迷失于扩散：揭示扩散大型语言模型中的幻觉模式与失败模式

Guo, Zhengnan, Tan, Fei

Abstract

While Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To bridge this gap, we present the first controlled comparative study to evaluate hallucination patterns in dLLMs. Our results demonstrate that current dLLMs exhibit a higher propensity for hallucination than AR counterparts controlled for architecture, scale, and pre-training weights. Furthermore, an analysis of inference-time compute reveals divergent dynamics: while quasi-autoregressive generation suffers from early saturation, non-sequential decoding unlocks potential for continuous refinement. Finally, we identify distinct failure modes unique to the diffusion process, including premature termination, incomplete denoising, and context intrusion. Our findings underscore that although dLLMs have narrowed the performance gap on general tasks, their distinct hallucination mechanisms pose a critical challenge to model reliability. Our code is available at https://github.com/ZeroLoss-Lab/Lost-in-Diffusion

Chinese Translation

尽管扩散大型语言模型（Diffusion Large Language Models，dLLMs）作为一种有别于自回归（autoregressive，AR）模型的非自回归范式展现出良好前景，但其忠实性，特别是关于幻觉现象，仍然鲜有深入研究。为填补这一空白，我们首次开展了受控对比研究，评估dLLMs中的幻觉模式。结果表明，在架构、规模及预训练权重相同的条件下，当前dLLMs较AR模型表现出更高的幻觉倾向。此外，对推理阶段计算过程的分析揭示了不同的动态特征：准自回归生成存在早期饱和问题，而非序列解码则释放了持续优化的潜力。最后，我们识别出扩散过程特有的失败模式，包括过早终止、去噪不完全及上下文干扰。研究结果强调，尽管dLLMs在通用任务上的性能差距已缩小，但其独特的幻觉机制对模型可靠性构成了关键挑战。我们的代码已开源，地址为：https://github.com/ZeroLoss-Lab/Lost-in-Diffusion

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.10557

LLMs Should Incorporate Explicit Mechanisms for Human Empathy

大型语言模型应纳入明确的人类共情机制

You, Xiaoxing, Huang, Qiang, Yu, Jun

Abstract

This paper argues that Large Language Models (LLMs) should incorporate explicit mechanisms for human empathy. As LLMs become increasingly deployed in high-stakes human-centered settings, their success depends not only on correctness or fluency but on faithful preservation of human perspectives. Yet, current LLMs systematically fail at this requirement: even when well-aligned and policy-compliant, they often attenuate affect, misrepresent contextual salience, and rigidify relational stance in ways that distort meaning. We formalize empathy as an observable behavioral property: the capacity to model and respond to human perspectives while preserving intention, affect, and context. Under this framing, we identify four recurring mechanisms of empathic failure in contemporary LLMs--sentiment attenuation, empathic granularity mismatch, conflict avoidance, and linguistic distancing--arising as structural consequences of prevailing training and alignment practices. We further organize these failures along three dimensions: cognitive, cultural, and relational empathy, to explain their manifestation across tasks. Empirical analyses show that strong benchmark performance can mask systematic empathic distortions, motivating empathy-aware objectives, benchmarks, and training signals as first-class components of LLM development.

Chinese Translation

本文主张大型语言模型（LLMs）应纳入明确的人类共情机制。随着LLMs在高风险以人为本的环境中日益广泛应用，其成功不仅依赖于正确性或流畅性，更取决于对人类视角的忠实保留。然而，当前的LLMs系统性地未能满足这一要求：即使在良好对齐且符合政策的情况下，它们常常削弱情感表达、错误呈现语境重要性，并以僵化的关系立场扭曲意义。我们将共情形式化为一种可观察的行为特性：即在保留意图、情感和语境的同时，建模并响应人类视角的能力。在此框架下，我们识别出现代LLMs中四种反复出现的共情失败机制——情感削弱（sentiment attenuation）、共情粒度不匹配（empathic granularity mismatch）、冲突回避（conflict avoidance）以及语言疏离（linguistic distancing）——这些均为当前训练和对齐实践的结构性后果。我们进一步沿认知、文化和关系三维度组织这些失败，以解释其在不同任务中的表现。实证分析表明，强劲的基准测试表现可能掩盖系统性的共情扭曲，促使将共情感知目标、基准和训练信号作为LLM开发的一等组成部分。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.10567

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策的重要性：非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Kim, Jiyeon, Choi, Sungik, Jo, Yongrae, Lee, Moontae, Seo, Minjoon

Abstract

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

Chinese Translation

基于扩散的语言模型（dLLMs）作为自回归语言模型的有希望的替代方案，提供了并行生成标记和双向上下文建模的潜力。然而，如何利用这种灵活性实现完全非自回归解码仍然是一个未解的问题，特别是在推理和规划任务中。在本研究中，我们通过系统分析dLLMs在时间轴上的推理动态，探讨了非自回归解码。具体而言，我们发现了一种基于置信度的非自回归生成中的固有失败模式，这种模式源于强邻近偏差——去噪顺序倾向于集中在空间上相邻的标记上。这种局部依赖导致空间错误传播，使得整个轨迹在很大程度上依赖于初始去掩蔽位置。基于这一洞察，我们提出了一种最小干预的方法，指导早期标记选择，采用轻量级规划器和序列结束温度退火。我们在各种推理和规划任务上对我们的方法进行了全面评估，观察到相较于现有启发式基线有显著的整体改进，而没有显著的计算开销。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.10580

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

知道该强调什么：一个基于语篇条件的文本到语音基准测试

Turetzky, Arnon, Dekel, Avihu, Aronowitz, Hagai, Hoory, Ron, Adi, Yossi

Abstract

Spoken meaning often depends not only on what is said, but also on which word is emphasized. The same sentence can convey correction, contrast, or clarification depending on where emphasis falls. Although modern text-to-speech (TTS) systems generate expressive speech, it remains unclear whether they infer contextually appropriate stress from discourse alone. To address this gap, we present Context-Aware Stress TTS (CAST), a benchmark for evaluating context-conditioned word-level stress in TTS. Items are defined as contrastive context pairs: identical sentences paired with distinct contexts requiring different stressed words. We evaluate state-of-the-art systems and find a consistent gap: text-only language models reliably recover the intended stress from context, yet TTS systems frequently fail to realize it in speech. We release the benchmark, evaluation framework, construction pipeline and a synthetic corpus to support future work on context-aware speech synthesis.

Chinese Translation

口语表达的意义往往不仅取决于所说内容，还取决于哪个词被强调。同一句话根据强调位置的不同，可以传达纠正、对比或澄清的含义。尽管现代文本到语音（TTS）系统能够生成富有表现力的语音，但尚不清楚它们是否仅凭语篇就能推断出语境适当的重音。为填补这一空白，我们提出了Context-Aware Stress TTS（CAST），这是一个用于评估基于语境条件的词级重音的TTS基准测试。测试项定义为对比语境对：相同的句子配以不同的语境，要求强调不同的词。我们评估了最先进的系统，发现了一个持续存在的差距：仅基于文本的语言模型能够可靠地从语境中恢复预期的重音，而TTS系统则经常未能在语音中实现这一点。我们发布了该基准测试、评估框架、构建流程及一个合成语料库，以支持未来在语境感知语音合成领域的研究工作。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.10590

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

弥合语言鸿沟：预训练与数据集中的跨语言映射以提升多语言大型语言模型性能

Zheng, Weihua, Liu, Chang, Liu, Zhengyuan, Huang, Xin, Wu, Kui, Shahrin, Muhammad Huzaifah Md, Aw, Aiti, Lee, Roy Ka-Wei

Abstract

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

Chinese Translation

多语言大型语言模型（LLMs）在跨语言任务中表现不佳，主要原因在于高资源语言与低资源语言之间的数据不平衡以及预训练中的单语偏差。现有方法如双语微调和对比对齐虽能提升跨语言性能，但通常依赖大量平行数据或存在不稳定性。为解决这些问题，我们在预训练阶段引入了跨语言映射任务，增强跨语言对齐能力，同时不损害单语流畅性。该方法在LLM嵌入空间中实现语言的双向映射，提升语言生成与理解能力。我们进一步提出了语言对齐系数（Language Alignment Coefficient），以在数据有限的情况下稳健量化跨语言一致性。机器翻译（MT）、跨语言自然语言理解（CLNLU）及跨语言问答（CLQA）上的实验结果表明，我们的模型在MT任务中BLEU分数提升最高达11.9分，CLQA的BERTScore-Precision提升6.72分，CLNLU准确率超过强基线模型5%以上。这些结果凸显了将跨语言目标纳入预训练以提升多语言LLM性能的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.10627

Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

多语言模型中的计算性损伤区分共享与语言特定的大脑对齐

Cui, Yang, Sun, Jingyuan, Sun, Yizheng, Wang, Yifan, Zhang, Yunhao, Li, Jixing, Wang, Shaonan, Zhou, Hongpeng, Hale, John, Zong, Chengqing, Nenadic, Goran

Abstract

How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controllable systems and create targeted ``computational lesions'' by zeroing small parameter sets that are important across languages or especially important for one language. We then compare intact and lesioned models in predicting functional magnetic resonance imaging (fMRI) responses during 100 minutes of naturalistic story listening in native English, Chinese and French (112 participants). Lesioning a compact shared core reduces whole-brain encoding correlation by 60.32% relative to intact models, whereas language-specific lesions preserve cross-language separation in embedding space but selectively weaken brain predictivity for the matched native language. These results support a shared backbone with embedded specializations and provide a causal framework for studying multilingual brain-model alignment.

Chinese Translation

大脑如何支持不同语言的语言处理是神经科学中的一个基本问题，也是对多语言人工智能的有用测试。神经影像学已经识别出跨语言的语言响应大脑区域，但仅凭此无法显示基础处理是共享的还是语言特定的。在这里，我们使用六个多语言大型语言模型（LLMs）作为可控系统，通过将对跨语言重要或对某一语言特别重要的小参数集归零，创建有针对性的“计算性损伤”。然后，我们比较完整模型和损伤模型在预测100分钟自然故事听觉中的功能性磁共振成像（fMRI）反应（参与者112人）。损伤一个紧凑的共享核心使整个大脑编码相关性相对于完整模型减少60.32%，而语言特定的损伤则在嵌入空间中保持跨语言的分离，但选择性地削弱了与匹配母语的脑预测能力。这些结果支持共享骨架与嵌入特化的观点，并为研究多语言大脑模型对齐提供了因果框架。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.10633

ProUIE: A Macro-to-Micro Progressive Learning Method for LLM-based Universal Information Extraction

ProUIE：一种基于大规模到微观的渐进学习方法，用于基于LLM的通用信息提取

Liu, Wenda, Song, Zhigang, Nie, Shuai, Liu, Guangyao, Chen, Lisung, Yang, Binyu, Chen, Yaran, Zhou, Peng, Wang, Hongzhen, Liu, Yuchen, Hu, Wenyue, Xu, Jiaming, Shi, Runyu, Huang, Ying

Abstract

LLM-based universal information extraction (UIE) methods often rely on additional information beyond the original training data, which increases training complexity yet often yields limited gains. To address this, we propose ProUIE, a Macro-to-Micro progressive learning approach that improves UIE without introducing any external information. ProUIE consists of three stages: (i) macro-level Complete Modeling (CM), which learns NER, RE, and EE along their intrinsic difficulty order on the full training data to build a unified extraction foundation, (ii) meso-level Streamlined Alignment (SA), which operates on sampled data with simplified target formats, streamlining and regularizing structured outputs to make them more concise and controllable, and (iii) micro-level Deep Exploration (DE), which applies GRPO with stepwise fine-grained rewards (SFR) over structural units to guide exploration and improve performance. Experiments on 36 public datasets show that ProUIE consistently improves unified extraction, outperforming strong instruction-tuned baselines on average for NER and RE while using a smaller backbone, and it further demonstrates clear gains in large-scale production-oriented information extraction.

Chinese Translation

基于LLM的通用信息提取（UIE）方法通常依赖于超出原始训练数据的额外信息，这增加了训练复杂性，但往往带来有限的收益。为了解决这个问题，我们提出了ProUIE，一种宏观到微观的渐进学习方法，旨在在不引入任何外部信息的情况下改善UIE。ProUIE由三个阶段组成：（i）宏观层面的完整建模（Complete Modeling, CM），在全量训练数据上按照内在难度顺序学习命名实体识别（NER）、关系提取（RE）和事件抽取（EE），以建立统一的提取基础；（ii）中观层面的简化对齐（Streamlined Alignment, SA），在采样数据上操作，使用简化的目标格式，精简和规范化结构化输出，使其更加简洁和可控；（iii）微观层面的深度探索（Deep Exploration, DE），在结构单元上应用带有逐步细粒度奖励（Stepwise Fine-grained Rewards, SFR）的GRPO，以指导探索并提高性能。在36个公共数据集上的实验表明，ProUIE在统一提取方面始终表现出色，平均超越强大的指令调优基线，尤其在NER和RE任务中表现优异，同时使用了更小的基础模型，并在大规模面向生产的信息提取中进一步展示了明显的收益。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.10660

Efficient Process Reward Modeling via Contrastive Mutual Information

基于对比互信息的高效过程奖励建模

Lee, Nakyung, Hong, Sangwoo, Lee, Jungwoo

Abstract

Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

Chinese Translation

近期研究大量关注利用过程奖励模型（Process Reward Models, PRMs）及其他验证模型对链式思维（Chain-of-Thought, CoT）轨迹中的中间推理步骤进行验证。然而，训练PRM通常需要人工标注者为每个推理步骤分配奖励分数，这既昂贵又耗时。现有的自动化方法，如蒙特卡洛（Monte Carlo, MC）估计，也因反复调用大型语言模型（LLM）而消耗大量计算资源。为克服这些限制，我们提出了对比点互信息（Contrastive Pointwise Mutual Information, CPMI），这是一种新颖的自动奖励标注方法，利用模型内部概率推断步骤级监督，同时显著降低数据集标注的计算负担。CPMI量化了某推理步骤相较于困难负样本，提升该步骤与正确目标答案之间互信息的程度。该对比信号作为步骤对最终解贡献的代理，生成可靠的奖励。实验结果表明，基于CPMI的标注相比MC估计减少了84%的数据集构建时间和98%的生成token数量，同时在过程级评估和数学推理基准上取得了更高的准确率。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.10665

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

HeceTokenizer：一种基于音节的土耳其语检索分词方法

Gulgonul, Senol

Abstract

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

Chinese Translation

HeceTokenizer 是一种基于音节的土耳其语分词器，它利用该语言确定性的六种模式的音位结构，构建了一个约 8000 种独特音节类型的封闭、无词汇外（OOV）词汇表。使用掩码语言模型目标，从土耳其维基百科的一个子集上从头开始训练了一个 BERT-tiny 编码器（参数量为 1.5M），并在 TQuAD 检索基准上使用 Recall@5 进行评估。结合细粒度的基于块的检索策略，HeceTokenizer 实现了 50.3% 的 Recall@5，超过了使用 200 倍更大模型的形态驱动基线所报告的 46.92%。这些结果表明，土耳其语音节的音位规律性为检索任务提供了强大且资源节省的归纳偏置。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.10667

Learning and Enforcing Context-Sensitive Control for LLMs

学习与执行上下文敏感控制的语言模型

Albinhassan, Mohammad, Madhyastha, Pranava, Law, Mark, Russo, Alessandra

Abstract

Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification -- a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.

Chinese Translation

通过上下文敏感约束控制大型语言模型（LLMs）的输出，已成为克服上下文无关文法（CFGs）在确保生成有效性方面局限性的有希望的方法。然而，这些约束通常需要手动指定，这成为了一个需要专业知识的重要障碍。我们提出了一个框架，通过一个两阶段的过程自动学习上下文敏感约束：首先进行句法探索，以收集多样化的输出用于约束学习，其次进行约束利用，在生成过程中执行这些学习到的规则。实验表明，我们的方法使得即使是小型LLMs（1B参数）也能够学习并生成完全遵循约束的输出，超越了更大模型和最先进的推理模型。这项工作首次将上下文敏感语法学习与LLM生成相结合，消除了手动指定的需求，同时保持了生成的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.10687

QFS-Composer: Query-focused summarization pipeline for less resourced languages

QFS-Composer：面向资源匮乏语言的查询聚焦摘要生成流程

Đuranović, Vuk, Šikonja, Marko Robnik

Abstract

Large language models (LLMs) demonstrate strong performance in text summarization, yet their effectiveness drops significantly across languages with restricted training resources. This work addresses the challenge of query-focused summarization (QFS) in less-resourced languages, where labeled datasets and evaluation tools are limited. We present a novel QFS framework, QFS-Composer, that integrates query decomposition, question generation (QG), question answering (QA), and abstractive summarization to improve the factual alignment of a summary with user intent. We test our approach on the Slovenian language. To enable high-quality supervision and evaluation, we develop the Slovenian QA and QG models based on a Slovene LLM and adapt evaluation approaches for reference-free summary evaluation. Empirical evaluation shows that the QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs. Our work establishes an extensible methodology for advancing QFS in less-resourced languages.

Chinese Translation

大型语言模型（LLMs）在文本摘要生成中表现出强大的能力，但其效果在训练资源受限的语言中显著下降。本文针对资源匮乏语言中的查询聚焦摘要（QFS）问题展开研究，该领域标注数据集和评估工具均较为有限。我们提出了一种新颖的QFS框架——QFS-Composer，集成了查询分解、问题生成（QG）、问答（QA）以及抽象式摘要生成，以提升摘要与用户意图的事实一致性。我们在斯洛文尼亚语上验证了该方法。为实现高质量的监督和评估，我们基于斯洛文尼亚大型语言模型开发了斯洛文尼亚语的QA和QG模型，并针对无参考摘要评估调整了评估方法。实证评估表明，基于QA指导的摘要生成流程在一致性和相关性方面优于基线LLMs。我们的工作建立了一种可扩展的方法论，推动资源匮乏语言的查询聚焦摘要研究。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.10697

Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

注意力汇聚点作为大型语言模型幻觉检测的内部信号

Binkowski, Jakub, Adamczewski, Kamil, Kajdanowicz, Tomasz

Abstract

Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

Chinese Translation

大型语言模型经常出现幻觉现象：即流畅且自信的输出内容在事实层面上不正确或缺乏输入上下文支持。尽管近期的幻觉检测方法探索了基于注意力图的多种特征，但其所利用的底层机制仍未被充分理解。在本研究中，我们提出了SinkProbe，一种基于观察到幻觉与注意力汇聚点（attention sinks）紧密相关的幻觉检测方法——注意力汇聚点指的是在生成过程中积累了不成比例注意力质量的标记，表明注意力从分布式、基于输入的关注转向了压缩的、以先验为主导的计算。重要的是，尽管汇聚点得分仅从注意力图计算得出，我们发现分类器更倾向于依赖那些其关联的值向量（value vectors）范数较大的汇聚点。此外，我们通过建立汇聚点得分的数学关系，证明了先前方法隐式依赖于注意力汇聚点。我们的研究成果提出了一种基于理论的新颖幻觉检测方法，在多个流行数据集和大型语言模型（LLMs）上实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.10724

Expect the Unexpected? Testing the Surprisal of Salient Entities

意料之外？显著实体惊讶度的检验

Lin, Jessica, Zeldes, Amir

Abstract

Previous work examining the Uniform Information Density (UID) hypothesis has shown that while information as measured by surprisal metrics is distributed more or less evenly across documents overall, local discrepancies can arise due to functional pressures corresponding to syntactic and discourse structural constraints. However, work thus far has largely disregarded the relative salience of discourse participants. We fill this gap by studying how overall salience of entities in discourse relates to surprisal using 70K manually annotated mentions across 16 genres of English and a novel minimal-pair prompting method. Our results show that globally salient entities exhibit significantly higher surprisal than non-salient ones, even controlling for position, length, and nesting confounds. Moreover, salient entities systematically reduce surprisal for surrounding content when used as prompts, enhancing document-level predictability. This effect varies by genre, appearing strongest in topic-coherent texts and weakest in conversational contexts. Our findings refine the UID competing pressures framework by identifying global entity salience as a mechanism shaping information distribution in discourse.

Chinese Translation

以往关于统一信息密度（Uniform Information Density, UID）假说的研究表明，尽管通过惊讶度（surprisal）指标衡量的信息在整体文档中分布较为均匀，但由于句法和话语结构约束所带来的功能性压力，局部仍可能出现差异。然而，迄今为止的研究大多忽视了话语参与者相对显著性的影响。本文通过分析涵盖16种英语文体、共7万条手工标注提及的实体数据，并采用一种新颖的最小对比提示法（minimal-pair prompting method），填补了这一空白。研究结果表明，整体显著的实体表现出显著高于非显著实体的惊讶度，即使在控制了位置、长度和嵌套等混淆因素后依然成立。此外，当显著实体作为提示使用时，会系统性地降低周围内容的惊讶度，从而提升文档层面的可预测性。该效应因文体而异，在主题连贯文本中最为显著，而在对话语境中最弱。我们的发现通过识别全局实体显著性作为塑造话语信息分布的机制，细化了UID竞争压力框架。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.10733

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

好脾气到不敢说真话：量化角色扮演语言模型中由宜人性驱动的谄媚行为

Shah, Arya, Mishra, Deepali, Silpasuwanchai, Chaklam

Abstract

Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of 13 models exhibit statistically significant positive correlations between persona agreeableness and sycophancy rates, with Pearson correlations reaching $r = 0.87$ and effect sizes as large as Cohen's $d = 2.33$. These findings demonstrate that agreeableness functions as a reliable predictor of persona-induced sycophancy, with direct implications for the deployment of role-playing AI systems and the development of alignment strategies that account for personality-mediated deceptive behaviors.

Chinese Translation

大型语言模型日益作为对话代理，能够根据用户请求采用特定人格并扮演角色。尽管这一能力具有价值，但也引发了关于谄媚行为的担忧：即倾向于提供迎合用户而非优先保证事实准确性的回答。先前研究已确认谄媚行为对人工智能安全与对齐构成风险，但尚未探讨所采用人格的具体性格特质与谄媚行为程度之间的关系。本文系统研究了人格宜人性（agreeableness）如何影响13个参数规模从0.6亿到20亿不等的小型开源语言模型中的谄媚行为。我们构建了包含275个人格的基准测试，这些人格基于NEO-IPIP宜人性子量表进行评估，并针对33个话题类别设计了4950条诱发谄媚的提示语对每个人格进行测试。分析结果显示，13个模型中有9个在人格宜人性与谄媚率之间存在统计显著的正相关，Pearson相关系数最高达到$r=0.87$，效应量最大可达Cohen's $d=2.33$。这些发现表明，宜人性是预测人格诱发谄媚行为的可靠指标，对角色扮演AI系统的部署及考虑人格介导的欺骗行为的对齐策略开发具有直接意义。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.10734

Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

自我纠正的RAG：通过MMKP上下文选择与NLI引导的MCTS提升生成结果的可信度

Xu, Shijia, Wu, Zhou, Jia, Xiaolong, Wang, Yu, Liu, Kai, Dong, April Xiaowen

Abstract

Retrieval-augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self-Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi-dimensional multiple-choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)-guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test-time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi-hop question answering and fact-checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing baselines.Our code is available at https://github.com/xjiacs/Self-Correcting-RAG .

Chinese Translation

检索增强生成（Retrieval-augmented generation，RAG）显著扩展了大型语言模型的知识边界。然而，在处理复杂推理任务时，仍面临两个主要挑战：上下文利用率低和频繁出现幻觉现象。为解决这些问题，我们提出了自我纠正的RAG（Self-Correcting RAG），这是一个将检索与生成重新表述为约束优化和路径规划的统一框架。在输入端，我们突破传统的贪婪检索，首次将上下文选择形式化为多维多选背包问题（Multi-dimensional Multiple-choice Knapsack Problem，MMKP），从而在严格的令牌预算下最大化信息密度并消除冗余。在输出端，我们引入了基于自然语言推理（Natural Language Inference，NLI）指导的蒙特卡洛树搜索（Monte Carlo Tree Search，MCTS）机制，利用测试时计算资源动态探索推理路径并验证生成答案的可信度。在六个多跳问答和事实核查数据集上的实验表明，我们的方法显著提升了复杂查询的推理准确率，同时有效减少了幻觉现象，优于现有强基线。我们的代码可在 https://github.com/xjiacs/Self-Correcting-RAG 获取。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.10736

BlasBench: An Open Benchmark for Irish Speech Recognition

BlasBench：一个用于爱尔兰语语音识别的开放基准测试

Raj, Jyoutir, Conway, John

Abstract

No open Irish-specific benchmark compares end-user ASR systems under a shared Irish-aware evaluation protocol. To solve this, we release BlasBench, an open evaluation harness with Irish-aware text normalisation that preserves fadas, lenition, and eclipsis. We benchmark 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER. The best open model (omniASR LLM 7B) achieves 30.65% WER on Common Voice and 39.09% on FLEURS. We noticed models fine-tuned on Common Voice lose 33-43 WER points on FLEURS, revealing a generalisation gap that is invisible to single-dataset evaluation.

Chinese Translation

目前尚无针对爱尔兰语的开放基准测试，能够在统一的爱尔兰语感知评估协议下比较终端用户的自动语音识别（ASR）系统。为解决这一问题，我们发布了BlasBench，这是一个包含爱尔兰语感知文本规范化的开放评估平台，能够保留fadas（长元音符号）、软化（lenition）和遮蔽（eclipsis）等语言特征。我们在Common Voice ga-IE和FLEURS ga-IE数据集上，对来自四种架构类别的12个系统进行了基准测试。所有Whisper变体的词错误率（WER）均超过100%。表现最佳的开放模型（omniASR LLM 7B）在Common Voice上的WER为30.65%，在FLEURS上的WER为39.09%。我们注意到，在Common Voice上微调的模型在FLEURS上的WER下降了33至43个百分点，揭示了单一数据集评估无法发现的泛化差距。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.10740

RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game

RCBSF：一种基于Stackelberg博弈的自动合同修订多智能体框架

Xu, Shijia, Wang, Yu, Jia, Xiaolong, Wu, Zhou, Liu, Kai, Dong, April Xiaowen

Abstract

Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk-Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non-cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state-of-the-art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21\% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .

Chinese Translation

尽管大型语言模型（LLMs）在法律人工智能中的广泛应用，但其在自动合同修订中的效用仍受到虚假安全感和缺乏严格行为约束的限制。为了解决这些问题，我们提出了风险约束双层Stackelberg框架（RCBSF），将修订过程形式化为一个非合作的Stackelberg博弈。RCBSF建立了一个层级的领导者-追随者结构，其中全球规范代理（GPA）对由受限修订代理（CRA）和本地验证代理（LVA）构成的追随者系统施加风险预算，以迭代优化输出。我们提供了理论保证，证明该双层形式化能够收敛到一个均衡状态，产生比无指导配置严格优越的效用。在统一基准上的实证验证表明，RCBSF实现了最先进的性能，超越了迭代基线，平均风险解决率（RRR）达到84.21%，同时提高了令牌效率。我们的代码可在 https://github.com/xjiacs/RCBSF 获取。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.10741

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Deep-Reporter：面向有依据的多模态长文本生成的深度研究

Ye, Fangda, Xie, Zhifei, Hu, Yuxin, Yin, Yihang, Huang, Shurui, Dong, Shikai, Bao, Jianzhu, Yan, Shuicheng

Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

Chinese Translation

近期的智能搜索框架通过迭代规划与检索实现深度研究，减少幻觉现象并增强事实依据。然而，这些方法仍以文本为中心，忽视了真实专家报告中所体现的多模态证据。我们提出了一项紧迫的任务：多模态长文本生成。为此，我们设计了Deep-Reporter，一种统一的智能框架，用于有依据的多模态长文本生成。该框架协调实现：(i) 智能多模态搜索与筛选，用于检索和筛选文本段落及信息密集的视觉内容；(ii) 检查表引导的增量合成，确保图文融合的连贯性及引用位置的最优安排；(iii) 递归上下文管理，平衡长距离连贯性与局部流畅性。我们构建了严谨的策划流程，产出8千条高质量的智能轨迹用于模型优化。进一步地，我们引入了M2LongBench，一个涵盖9个领域247个研究任务的综合测试平台及稳定的多模态沙箱。大量实验表明，多模态长文本生成是一项具有挑战性的任务，尤其在多模态选择与融合方面，而有效的后训练能够弥合性能差距。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.10745

How You Ask Matters! Adaptive RAG Robustness to Query Variations

提问方式的重要性！自适应检索增强生成（RAG）对查询变体的鲁棒性

Jang, Yunah, Sundriyal, Megha, Jung, Kyomin, Cha, Meeyoung

Abstract

Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.

Chinese Translation

自适应检索增强生成（RAG）通过仅在需要时动态触发检索，承诺提供准确性和效率，并在实践中得到广泛应用。然而，现实世界中的查询即使意图相同，其表面形式也存在差异，而这些差异对自适应 RAG 的影响仍然未得到充分探讨。我们引入了第一个大规模基准，涵盖多样但语义相同的查询变体，结合了人工撰写和模型生成的重写。我们的基准通过在答案质量、计算成本和检索决策三个维度上检查自适应 RAG 的关键组件，促进了其鲁棒性的系统评估。我们发现了一个关键的鲁棒性差距，即查询中小的表面变化会显著改变检索行为和准确性。尽管更大的模型表现更好，但鲁棒性并未相应改善。这些发现揭示了自适应 RAG 方法对保持相同语义的查询变体高度脆弱，暴露了一个关键的鲁棒性挑战。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.10748

Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

利用知识图谱和大型语言模型生成具可解释难度估计的多项选择知识题

Şakiroğlu, Mehmet Can, Güvenir, H. Altay, Kaya, Kamer

Abstract

Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple -- optionally augmented with an extra triple -- and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.

Chinese Translation

在自适应人工智能辅助教育中，自动生成带有难度估计的多项选择题（MCQs）仍然具有挑战性。本研究提出了一种新颖的方法，结合知识图谱（KGs）和大型语言模型（LLMs），从输入文档中生成带难度估计的多项选择题。我们的方法首先利用大型语言模型从输入文档构建知识图谱，随后系统地生成多项选择题。每道题通过从知识图谱中选择一个节点作为关键点，采样相关的三元组或五元组（可选地增加额外的三元组），并提示大型语言模型基于这些图谱组件生成题干。干扰项则从知识图谱中选取。对于每道题，计算九个难度信号，并通过数据驱动的方法将其合成为统一的难度评分。实验结果表明，该方法生成的多项选择题质量高，难度估计具有可解释性且与人类认知一致。该方法通过整合结构化知识表示、大型语言模型及数据驱动的难度估计模型，提升了自动多项选择题的生成效果。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.10786

Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction

BERT嵌入是否编码叙事维度？基于时间、空间、因果关系和角色的小说文本级探测分析

Bei, Beicheng, Chun, Hannah Hyesun, Guo, Chen, Saghiri, Arwa

Abstract

Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics -- time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus "others." A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals "Boundary Leakage," where rare dimensions are systematically misclassified as "others." Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.

Chinese Translation

叙事理解需要多维语义结构。本研究探讨BERT嵌入是否编码了虚构叙事语义的维度——时间、空间、因果关系和角色。通过使用大型语言模型（LLM）加速标注，我们构建了一个带有这四个叙事类别及“其他”类别的文本级数据集。对BERT嵌入进行线性探测，准确率达到94%，显著优于方差匹配的随机嵌入对照探测（47%），证实BERT编码了有意义的叙事信息。在类别权重平衡的条件下，探测器实现了宏平均召回率0.83，对因果关系（召回率=0.75）和空间（召回率=0.66）等稀有类别表现出中等成功。然而，混淆矩阵分析揭示了“边界泄漏”现象，即稀有维度系统性地被误分类为“其他”。聚类分析显示，无监督聚类与预定义类别的匹配接近随机（ARI=0.081），表明叙事维度虽被编码，但并非以离散可分的簇形式存在。未来工作包括引入仅基于词性（POS）的基线以区分句法模式与叙事编码，扩展数据集，以及分层探测分析。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.10787

When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities

当意义不是字面上的：跨语言和模态探索习语意义

Das, Sarmistha, Guha, Shreyas, Bandyopadhyay, Suvrayan, Phosit, Salisa, Pasupa, Kitsuchart, Saha, Sriparna

Abstract

Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, ``grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present ``Mediom,'' a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text--image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,'' a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.

Chinese Translation

习语推理与隐喻和文化密切相关，仍然是当代语言模型的盲点，这些模型的进展倾向于表层的词汇和语义线索。例如，孟加拉语习语 extit{oreignlanguage{bengali}{ ext{এগুর ফল টক}}}（angur fol tok，意为“葡萄是酸的”）：它编码了基于否认的合理化，然而天真的模型却抓住了字面上的狐狸与葡萄的意象。为了解决这一问题，我们提出了“Mediom”，一个包含3533个印地语、孟加拉语和泰语习语的多语言、多模态习语语料库，每个习语都配有黄金标准的解释、跨语言翻译和精心对齐的文本-图像表示。我们在Mediom上对大型语言模型（文本推理）和视觉-语言模型（比喻消歧）进行了基准测试，揭示了隐喻理解中的系统性失败。为了弥补这些差距，我们提出了“HIDE”，一个基于提示的习语解释框架，利用错误反馈检索和针对性的诊断线索进行迭代推理优化。总体而言，Mediom和HIDE建立了一个严格的测试平台和方法论，用于文化根植的多模态习语理解，并在下一代人工智能系统中嵌入推理提示。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.10788

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR：探索大型语言模型中的工具内化推理

Xu, Qiancheng, Li, Yongqi, Liu, Fan, Wang, Hongru, Yang, Min, Li, Wenjie

Abstract

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

Chinese Translation

工具集成推理（Tool-Integrated Reasoning, TIR）作为一种通过在推理过程中结合外部工具来扩展大型语言模型（Large Language Models, LLMs）能力的有前景方向，已逐渐兴起。现有的TIR方法通常依赖于推理过程中外部工具的文档支持，然而这导致了工具掌握难度大、工具规模受限及推理效率低下等问题。为缓解这些问题，我们提出了工具内化推理（Tool-Internalized Reasoning, TInR），旨在促进将工具知识内化于LLMs中以辅助推理。实现该目标需要满足显著的要求，包括工具内化和工具推理协调。为此，我们提出了TInR-U，一种实现统一推理与工具使用的工具内化推理框架。TInR-U通过三阶段流程进行训练：1）采用双向知识对齐策略实现工具内化；2）利用高质量推理标注进行监督微调预热；3）结合TInR特定奖励的强化学习。我们在领域内和领域外多种设置下对方法进行了全面评估。实验结果表明，TInR-U在两种设置中均表现出卓越性能，凸显了其有效性与高效性。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.10791

Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

无位置依赖的预投影用于变换器注意力：在 Q/K/V 之前的非线性特征构建和内容跳跃

Shinde, Chirag

Abstract

We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection's features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

Chinese Translation

我们提出了对变换器注意力模块的两项互补修改。首先，在层归一化和 Q/K/V 投影之间插入一个非线性预投影多层感知机（MLP），以无位置依赖的方式构建更丰富的特征，在应用任何位置编码之前。其次，一个内容跳跃连接将预投影的特征绕过注意力机制，从而允许内容信息在有利的情况下绕过位置感知的注意力。在对 Pythia-160M 和 410M 的冻结探针实验中，综合方法在各个方法中取得了最佳结果：在 160M 规模下，LAMBADA 准确率提高了 40.6%，困惑度降低了 39%。学习到的跳跃连接权重揭示了不同模型规模之间的一致模式：后面的变换器层比前面的层更强烈地激活内容绕过，表明更深层的层受益于未通过位置注意力的内容信息。所有修改均未增加 K/V 缓存的开销。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2604.10799

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

通过优化分词器推动波兰语模型的发展：Bielik v3 7B 和 11B 系列

Ociepa, Krzysztof, Flis, Łukasz, Kinas, Remigiusz, Wróbel, Krzysztof, Gwoździej, Adrian

Abstract

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

Chinese Translation

Bielik v3 PL 系列的开发，包括 7B 和 11B 参数变体，标志着语言特定的大型语言模型（LLM）优化领域的重要里程碑。尽管通用模型通常展现出令人印象深刻的多语言能力，但它们常常面临一个根本的架构低效问题：使用通用分词器。这些分词器通常旨在覆盖广泛的语言范围，但往往无法捕捉像波兰语这样的特定语言的形态学细微差别，导致更高的繁殖比、增加的推理成本以及受限的有效上下文窗口。本报告详细介绍了从通用的基于 Mistral 的分词转向专门为 Bielik v3 模型优化的波兰语词汇的过程，探讨了基于 FOCUS 的嵌入初始化、多阶段预训练课程，以及随后涉及监督微调、直接偏好优化和通过可验证奖励的群体相对策略优化的强化学习的后训练对齐。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2604.10866

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

OccuBench：通过语言世界模型评估AI代理在真实职业任务中的表现

Hu, Xiaomeng, Zhang, Yinger, Huang, Fei, Tu, Jianhong, Su, Yang, Deng, Lianghao, Liu, Yuxuan, Liu, Yantao, Liu, Dayiheng, Ho, Tsung-Yi

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

Chinese Translation

AI代理被期望能够在数百个职业领域执行专业工作（从急诊科分诊到核反应堆安全监控再到海关进口处理），然而现有基准测试仅能评估那些存在公共环境的少数领域中的代理表现。我们提出了OccuBench，这是一套涵盖10个行业类别、65个专业领域共100个真实职业任务场景的基准测试，基于语言世界模型（Language World Models，LWMs）通过大型语言模型驱动的工具响应生成模拟特定领域环境。我们的多代理合成流程自动生成具有可解性保证、难度校准和文档支持多样性的评估实例。OccuBench从两个互补维度评估代理：跨专业领域的任务完成度以及在受控故障注入（显性错误、隐性数据退化和混合故障）下的环境鲁棒性。我们对来自8个模型家族的15个前沿模型进行了评测，发现：（1）没有单一模型能够主导所有行业，每个模型均展现出独特的职业能力特征；（2）隐性故障（数据截断、字段缺失）比显性错误（超时、500错误）和混合故障更难处理，因为其缺乏明显的错误信号，要求代理自主检测数据退化；（3）更大规模的模型、更先进的代际和更高的推理努力均持续提升性能。GPT-5.2在从最低到最高推理努力下性能提升了27.5分；（4）强大的代理不一定是强环境模拟器，模拟器质量对基于LWM的评估可靠性至关重要。OccuBench首次实现了对AI代理在跨行业专业职业任务上的系统性评估。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2604.10874

AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

AOP-Smart：一种基于RAG增强的大型语言模型框架用于不良结果途径分析

Niu, Qinjiang, Yan, Lu

Abstract

Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0\%, 35.0\%, and 20.0\%, respectively; after using RAG, their accuracies increased to 95.0\%, 100.0\%, and 95.0\%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.

Chinese Translation

不良结果途径（Adverse Outcome Pathways，AOPs）是毒理学研究和风险评估中的重要知识框架。近年来，大型语言模型（Large Language Models，LLMs）逐渐被应用于AOP相关的问答和机制推理任务。然而，由于存在幻觉问题，即模型可能生成与事实不符或缺乏证据的内容，其可靠性仍然有限。为解决该问题，本研究提出了一种面向AOP的检索增强生成（Retrieval-Augmented Generation，RAG）框架——AOP-Smart。该方法基于AOP-Wiki的官方XML数据，利用关键事件（Key Events，KEs）、关键事件关系（Key Event Relationships，KERs）及特定AOP信息，检索与用户问题相关的知识，从而提升大型语言模型生成结果的可靠性。为评估所提方法的有效性，构建了包含20个AOP相关问答任务的测试集，涵盖KE识别、上下游KE检索及复杂AOP检索任务。实验在三种主流大型语言模型Gemini、DeepSeek和ChatGPT上进行，并在无RAG和有RAG两种设置下进行了对比测试。实验结果表明，未使用RAG时，GPT、DeepSeek和Gemini的准确率分别为15.0%、35.0%和20.0%；使用RAG后，准确率分别提升至95.0%、100.0%和95.0%。结果表明，AOP-Smart能够显著缓解大型语言模型在AOP知识任务中的幻觉问题，极大提升其回答的准确性和一致性。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2604.10917

HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation

HTAA：通过混合工具集代理化与适配提升大型语言模型的规划能力

Huang, Chengrui, Zhang, Junshuo, Ma, Zhiyuan, Wang, Xikun, Wang, Ximeng, Jiang, Menghua, Zeng, Gang, Han, Zhaobing, Gao, Shen, Shang, Shuo

Abstract

Enabling large language models to scale and reliably use hundreds of tools is critical for real-world applications, yet challenging due to the inefficiency and error accumulation inherent in flat tool-calling architectures. To address this, we propose Hybrid Toolset Agentization & Adaptation (HTAA), a hierarchical framework for scalable tool-use planning. We propose a novel toolset agentization paradigm, which encapsulates frequently co-used tools into specialized agent tools, thereby reducing the planner's action space and mitigating redundancy. To ensure effective coordination, we design Asymmetric Planner Adaptation, a trajectory-based training paradigm that aligns the high-level planner with agent tools via backward reconstruction and forward refinement. To validate the performance of HTAA, we conduct experiments on a real-world internal dataset, InfoVerify, based on the POI validation workflow of China's largest online large-scale ride-hailing platform, featuring long-horizon executable tool trajectories. Experiments on InfoVerify and widely-used benchmarks show that HTAA consistently achieves higher task success rates, requires short tool calling trajectories, and significantly reduces context overhead compared to strong baselines. Furthermore, in a production deployment, HTAA substantially reduces manual validation effort and operational cost, demonstrating its practical efficacy.

Chinese Translation

使大型语言模型能够扩展并可靠地使用数百种工具对于实际应用至关重要，但由于扁平工具调用架构固有的低效和错误累积问题，这一目标面临挑战。为此，我们提出了混合工具集代理化与适配（Hybrid Toolset Agentization & Adaptation，HTAA），一种用于可扩展工具使用规划的分层框架。我们提出了一种新颖的工具集代理化范式，将频繁共用的工具封装为专门的代理工具，从而减少规划器的动作空间并降低冗余。为确保有效协调，我们设计了非对称规划器适配（Asymmetric Planner Adaptation），这是一种基于轨迹的训练范式，通过逆向重构和前向精炼使高层规划器与代理工具对齐。为了验证HTAA的性能，我们在一个基于中国最大在线大规模网约车平台POI验证工作流的真实内部数据集InfoVerify上进行了实验，该数据集具有长时序可执行工具轨迹。InfoVerify及多个广泛使用的基准测试结果表明，HTAA持续实现更高的任务成功率，所需工具调用轨迹更短，并显著降低上下文开销，相较于强基线方法表现优异。此外，在生产部署中，HTAA显著减少了人工验证工作量和运营成本，展示了其实用效能。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2604.10923

Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Mem$^2$Evolve：通过协同进化的能力扩展与经验蒸馏实现自我进化智能体

Cheng, Zihao, Liu, Zeming, Shan, Yingyu, Wang, Xinyi, Zhu, Xiangrong, Ma, Yunpu, Wang, Hongru, Guo, Yuhang, Lin, Wei, Wang, Yunhong

Abstract

While large language model--powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53\% over standard LLMs, 11.80\% over agents evolving solely through experience, and 6.46\% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.

Chinese Translation

虽然基于大型语言模型（LLM）的智能体可以通过积累经验或动态创建新资产（即工具或专家智能体）实现自我进化，但现有框架通常将这两种进化过程孤立对待。这种分离忽视了它们内在的相互依赖性：前者本质上受限于人工预定义的静态工具集，而后者则从零开始生成新资产，缺乏经验指导，导致能力增长有限且进化不稳定。为解决此限制，我们提出了一种协同进化的能力扩展与经验蒸馏新范式。在该范式指导下，我们设计了Mem$^{2}$Evolve，集成了两个核心组件：经验记忆（Experience Memory）和资产记忆（Asset Memory）。具体而言，Mem$^{2}$Evolve利用积累的经验指导资产的动态创建，从而扩展智能体的能力空间，同时同步获取新经验以实现协同进化。在涵盖6类任务和8个基准的广泛实验中，Mem$^{2}$Evolve相较于标准大型语言模型提升了18.53%，相较于仅通过经验进化的智能体提升了11.80%，相较于仅通过资产创建进化的智能体提升了6.46%，证明其为一种更高效且稳定的自我进化智能体框架。代码地址：https://buaa-irip-llm.github.io/Mem2Evolve。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2604.10968

YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents

YIELD：一个大规模数据集和信息引导代理的评估框架

De Lima, Victor, Yang, Grace Hui

Abstract

Most conversational agents (CAs) are designed to satisfy user needs through user-driven interactions. However, many real-world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision-making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent's goal is to elicit information from users to support the agent's institutional or task-oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M-token dataset of 2,281 ethically sourced, human-to-human dialogues. Moreover, we formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine-tuned model adapters are available at: https://github.com/infosenselab/yield.

Chinese Translation

大多数对话代理（CAs）旨在通过用户驱动的交互来满足用户需求。然而，许多现实世界的场景，如学术面试、司法程序和新闻调查，涉及更广泛的机构决策过程，并需要能够从用户那里引导信息的代理。在本文中，我们介绍了信息引导代理（IEAs），其目标是从用户那里引导信息，以支持代理的机构或任务导向目标。为了系统地研究这一设置，我们提出了YIELD，这是一个包含2,281个经过伦理来源的人际对话的2600万标记数据集。此外，我们将信息引导形式化为有限时域部分可观测马尔可夫决策过程（POMDP），并提出了针对IEAs的新指标。在多个基础大型语言模型（LLMs）上的初步实验表明，在YIELD上训练可以提高它们与真实引导行为的一致性，且这一发现得到了人工评估的验证。我们在CC BY 4.0许可下发布YIELD。数据集、项目代码、评估工具和微调模型适配器可在以下网址获取：https://github.com/infosenselab/yield。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2604.10990

When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

当验证失败：组合不可行的声明如何逃避拒绝

Liu, Muxin, Rao, Delip, Kim, Grace, Callison-Burch, Chris

Abstract

Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA's rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.

Chinese Translation

科学声明验证，即确定声明是否被科学证据所支持，是在确立发现的证据基础同时防止错误信息传播的基础性任务。该过程涉及将每个断言的约束与已验证的证据进行评估。在闭世界假设（Closed-World Assumption, CWA）下，只有当所有断言的约束均得到正向支持时，声明才被接受。我们表明，现有的验证基准无法区分严格执行该标准的模型与采用一种更简单捷径——显著约束检查（salient-constraint checking）的模型，后者仅对最显著的约束应用CWA的拒绝标准，并在该约束得到支持时即接受声明。由于现有基准通过扰动单一显著元素构造不可行声明，因而不足以区分严格的声明验证与简单依赖显著约束的行为。为区分两者，我们构造了组合不可行的声明，其中显著约束得到支持，但非显著约束被证伪。在不同模型族和模态中，那些在现有基准上表现饱和的模型均倾向于过度接受此类声明，证实了此类捷径推理的普遍存在。通过模型上下文干预，我们发现不同模型及提示策略在共享的ROC曲线上占据不同位置，表明模型族间的差异反映的是验证阈值的不同，而非底层推理能力的差异；且组合推理瓶颈是当前验证行为的结构性特征，单靠策略指导无法克服。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2604.10996

When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

当有效信号失效：大型语言模型特征与强化学习交易策略之间的范式边界

Yang, Zhengzhe

Abstract

Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

Chinese Translation

大型语言模型（LLMs）能否生成连续的数值特征以提升强化学习（RL）交易代理的表现？我们构建了一个模块化流程，其中一个冻结的LLM作为无状态特征提取器，将非结构化的每日新闻和公告转化为固定维度的向量，供下游的PPO代理使用。我们引入了一个自动化提示优化循环，将提取提示视为离散超参数，直接针对信息系数（IC，即预测收益与实际收益的斯皮尔曼等级相关系数）进行调优，而非传统的自然语言处理损失。优化后的提示发现了真正具有预测能力的特征（在留出数据上的IC超过0.15）。然而，这些有效的中间表示并未自动转化为下游任务的性能提升：在由宏观经济冲击引起的分布转移期间，LLM衍生的特征反而增加了噪声，增强型代理的表现不及仅使用价格的基线模型。在较为平稳的测试环境中，代理表现有所恢复，但宏观经济状态变量依然是策略改进最稳健的驱动因素。我们的研究揭示了特征层面有效性与策略层面鲁棒性之间的差距，这一现象与分布转移下迁移学习面临的已知挑战相呼应。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2604.11036

Uncertainty-Aware Web-Conditioned Scientific Fact-Checking

基于不确定性感知的网络条件科学事实核查

Vinod, Ashwin, Erk, Katrin

Abstract

Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.

Chinese Translation

科学事实核查对于评估生物医学和材料科学等专业领域的声明至关重要，然而现有系统在验证技术性、组合性声明时，尤其是在受限于信息来源及成本/延迟的情况下，常常出现幻觉或推理不一致的问题。我们提出了一种以原子谓词-论元分解和校准的不确定性门控证据支持为核心的流程：通过嵌入将原子事实与局部证据片段对齐，利用紧凑的基于证据的核查器进行验证，且仅当事实支持存在不确定性时，才触发对权威来源的领域限制网络搜索。该系统支持二元及三值分类，三值任务中预测标签包括支持（Supported）、反驳（Refuted）和无法确定（NEI）。我们在两种模式下进行了评估，即仅上下文（无网络）和上下文+网络（不确定性门控的网络证据支持）；当检索到的证据与提供的上下文冲突时，系统选择以NEI弃权而非覆盖上下文。在多个基准测试中，我们的框架超越了最强的基线方法。实验中，网络证据支持仅在少数原子事实上被调用，表明外部证据是在校准不确定性条件下选择性咨询，而非常规使用。总体而言，将原子粒度与校准的不确定性门控证据支持相结合，实现了更具解释性和上下文条件的验证，使该方法非常适合需要可追溯推理、可预测成本/延迟及保守策略的高风险单文档场景。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2604.11048

A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

人格引导对大型语言模型能力影响的系统分析

Chen, Jiaqi, Wang, Ming, Xie, Tingna, Feng, Shi, Liu, Yongkang

Abstract

Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.

Chinese Translation

为大型语言模型（LLMs）赋予特定人格已成为定制交互风格的普遍做法，但其对基础认知能力的影响仍未被探讨。我们采用基于神经元的人格特征诱导（Neuron-based Personality Trait Induction, NPTI）框架，在LLMs中诱导五大人格特征，并在六个认知基准上评估其表现。我们的研究发现，人格诱导在认知任务表现上产生了稳定且可重复的变化，超越了表面风格的变化。这些效应表现出强烈的任务依赖性：某些人格在遵循指令方面带来了持续的提升，而其他人格则削弱了复杂推理能力。效应的大小在特征维度上系统性变化，其中开放性（Openness）和外向性（Extraversion）对表现的影响最为显著。此外，LLM的效应与人类人格-认知关系显示出73.68%的方向一致性。基于这些规律，我们提出了动态人格路由（Dynamic Persona Routing, DPR），这是一种轻量级的查询自适应策略，其性能优于最佳静态人格且无需额外训练。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2604.11050

Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

小型语言模型间共享情感几何：表征、行为及方法论混淆的跨架构研究

Jeong, Jihoon

Abstract

We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho >= 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers -- a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models -- so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.

Chinese Translation

我们在统一的理解模式管道下，以fp16精度从十二个小型语言模型（六种架构 x 基础/指令，1B-8B参数）中提取了21个情感向量集，并通过对原始余弦RDM的表征相似性分析比较了结果几何。五种成熟架构（Qwen 2.5 1.5B、SmolLM2 1.7B、Llama 3.2 3B、Mistral 7B v0.3、Llama 3.1 8B）共享几乎相同的21情感几何，成对RDM的Spearman相关系数为0.74-0.92。这种普遍性在截然相反的行为特征中依然存在：Qwen 2.5和Llama 3.2在MTI合规性方面处于对立极端，但产生几乎相同的情感RDM（rho = 0.81），因此行为特征的差异源于共享情感表征之上。Gemma-3 1B基础模型是我们数据集中唯一不成熟的案例，表现出极端的残差流各向异性（0.997），并在所有几何描述符中通过RLHF进行了重构，而五个已经成熟的家族在家族内的基础x指令RDM相关系数为rho >= 0.92（Mistral 7B v0.3的rho = 0.985），这表明RLHF仅重构尚未组织的表征。在方法论上，我们展示了以往研究所解读的单一理解与生成方法效应实际上分解为四个不同层次——一种粗略的依赖方法的解离、生成中的强鲁棒子参数敏感性、真实的精度（fp16与INT8）效应，以及对不同模型在相反方向上扭曲的混合跨实验偏差——因此，两个先前情感向量研究之间的单一rho并不是安全的解释基础，除非进行层次分解。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2604.11066

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

KS-PRET-5M：一个包含500万词、1200万子词的克什米尔语预训练数据集

Malik, Haq Nawaz, Nissar, Nahfid

Abstract

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

Chinese Translation

我们推出了KS-PRET-5M，这是目前公开可用的最大克什米尔语预训练数据集，包含5,090,244（约509万）词，27,692,959（约2769万）字符，以及295,433（约29.5万）独特词汇类型。该数据集由两类来源构成：数字化的档案及文学资料，包括文学作品、新闻、传记、小说、诗歌、宗教学术和学术写作，这些内容通过Malik~\cite{malik2024inpage}的转换器从专有的InPage桌面出版格式中恢复；以及从克什米尔语网络资源收集的Unicode原生文本。所有文本均经过十一阶段的清洗流程处理，克什米尔文字比例平均达到0.9965，整个数据集中天城文（Devanagari）污染字符仅为146个。我们采用google/muril-base-cased模型进行了经验性分词，得到每词2.383个子词的分词率，总计约1213万个子词标记，远高于基于非克什米尔波斯-阿拉伯文类比的先前估计。KS-PRET-5M作为单一连续文本流在CC BY 4.0许可下发布，旨在支持克什米尔语的语言模型预训练、分词器训练及计算语言学研究。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2604.11096

Efficient Training for Cross-lingual Speech Language Models

跨语言语音语言模型的高效训练方法

Zhou, Yan, Fang, Qingkai, Hong, Yun, Feng, Yang

Abstract

Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM's strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)

Chinese Translation

目前，大型语言模型（LLMs）主要集中于文本模态。为了实现更自然的人机交互，语音大型语言模型（speech LLMs）正在兴起，但由于数据有限且难以扩展到更多语言，构建有效的端到端语音LLMs仍然具有挑战性。本文提出了跨语言语音语言模型（Cross-lingual Speech Language Model，CSLM），这是一种基于离散语音标记的跨语言语音LLMs高效训练方法。我们提出了一种新颖的对齐策略，通过持续预训练实现跨模态和跨语言的对齐。通过在语音-文本交错的模态链生成过程后进行指令微调，我们在更细粒度上增强了模态对齐，从而提升了生成质量并降低了延迟。CSLM能够同时对齐不同模态和语言，无需大量语音数据，表现出良好的语言可扩展性。在跨模态任务、单语言对话任务和跨语言对话任务上的评估表明，CSLM具有强大的跨模态对齐能力和通用任务能力。（代码地址：https://github.com/ictnlp/CSLM）

View on arXiv Download PDF AI Translation

cs.CL / 81 / 2604.11121

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

BITS Pilani在SemEval-2026任务9中的表现：结合DPO优化的结构化监督微调用于极化检测

Gupta, Atharva, Kumar, Dhruv, Sinha, Yash

Abstract

The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting political polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Experiments on the SemEval 2026 POLAR shared task dataset show that preference-based refinement improves both accuracy and decreases false negatives without extra annotation. On the English development set, DPO increases recall from 0.5085 to 0.7797 and improves macro-F1 by ~5 points.

Chinese Translation

POLAR SemEval-2026共享任务旨在检测在线极化，重点关注多语言、多文化和多事件极化的分类与识别。由于微妙的修辞、隐含的框架以及人工标注的高成本，准确的在线极化计算检测面临挑战。基于近期研究发现的上下文提示能够使大型语言模型作为强大的极化检测器，我们提出了一种两阶段的方法，用于检测社交媒体文本中的政治极化，该方法结合了结构化监督微调与直接偏好优化（Direct Preference Optimization, DPO）优化。我们使用可解释的槽填充模板（目标、主张类型、表现清单和理由）对Qwen 2.5-7B-Instruct进行LoRA微调。然后，我们应用DPO与自动生成的偏好对，以减少代价高昂的假阴性。在SemEval 2026 POLAR共享任务数据集上的实验表明，基于偏好的优化提高了准确性并减少了假阴性，而无需额外的标注。在英语开发集中，DPO将召回率从0.5085提高到0.7797，并使宏观F1提高约5个百分点。

View on arXiv Download PDF AI Translation

cs.CL / 82 / 2604.11129

DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning

DeCoVec：通过上下文学习构建基于解码空间的任务向量以引导大型语言模型

Li, Feiyang, Wang, Yile

Abstract

Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B--9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.

Chinese Translation

任务向量代表模型或激活空间中的方向，编码特定任务的行为，已成为引导大型语言模型（LLMs）的有前景工具。然而，现有的方法通常需要微调或对内部状态进行侵入性操作，限制了它们的灵活性和可扩展性。我们提出了 extsc{DeCoVec}（基于解码空间的任务向量），这是一个无训练和非侵入性的框架，通过利用上下文学习（ICL）直接在 extit{解码空间} 中构建任务向量。具体而言， extsc{DeCoVec} 将任务本质捕捉为少量示例和零示例提示的输出对数分布之间的差异，然后通过将该向量注入解码过程来引导生成。在 TruthfulQA、Math-500 和 AQUA-RAT 上对七个 LLM（0.5B–9B）的实验表明， extsc{DeCoVec} 始终优于标准的少量示例基线，平均准确率提高了 +5.50。进一步分析表明， extsc{DeCoVec} 有效抑制了生成退化和逻辑缺陷，同时对示例顺序表现出强大的鲁棒性，且没有增加额外的输入令牌成本。我们的方法提供了一种无训练和非侵入性的 LLM 引导解决方案，无需权重更新或辅助模型。

View on arXiv Download PDF AI Translation

cs.CL / 83 / 2604.11133

How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

大型语言模型在临床数值能力上的稳健性如何？关于临床环境中数值推理能力的实证研究

Nguyen, Minh-Vuong, Shiri, Fatemeh, Li, Zhuang, Verspoor, Karin

Abstract

Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.

Chinese Translation

大型语言模型（LLMs）越来越多地被用于临床问答和决策支持，但安全部署关键在于可靠处理异构临床笔记中的患者测量数据。现有的针对LLMs在临床数值推理方面的评估覆盖面有限，主要集中在算术计算上，且很少评估不同临床笔记格式下数值理解的稳健性。我们引入了ClinicNumRobBench，这是一个包含1,624个上下文-问题实例及其真实答案的基准，评估四种主要的临床数值能力：值检索、算术计算、关系比较和聚合。为了进行稳健性压力测试，ClinicNumRobBench以三种语义等价的表现形式呈现纵向MIMIC-IV生命体征记录，包括从Open Patients数据集中衍生的真实世界笔记风格变体，并使用42个问题模板实例化查询。对14个LLMs的实验表明，值检索通常表现良好，大多数模型的准确率超过85%，而关系比较和聚合仍然具有挑战性，部分模型得分低于15%。在医学数据上进行微调可能使数值能力相对于基础模型降低超过30%，而在笔记风格变体下的性能下降表明LLMs对格式的敏感性。ClinicNumRobBench为临床可靠的数值推理提供了一个严格的测试平台。代码和数据网址可在 https://github.com/MinhVuong2000/ClinicNumRobBench 获取。

View on arXiv Download PDF AI Translation

cs.CL / 84 / 2604.11152

SHARE: Social-Humanities AI for Research and Education

SHARE：面向社会科学与人文学科的人工智能研究与教育平台

Gonçalves, João, de Jager, Sonia, Knoth, Petr, Pride, David, Jelicic, Nick

Abstract

This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.

Chinese Translation

本中期技术报告介绍了SHARE系列基础模型及MIRROR用户界面。SHARE模型是首批完全由社会科学与人文学科（SSH）领域预训练并专为该领域设计的因果语言模型。通过我们定制的SSH完形填空基准测试，SHARE模型在建模SSH文本方面的表现接近使用100倍更多语料的通用模型（Phi-4）。MIRROR用户界面旨在审阅来自SSH学科的文本输入，同时保持批判性参与。通过原型设计一种不生成任何文本的生成式AI界面，我们提出了一种在不损害SSH原则和规范完整性的前提下，利用SHARE模型能力的方法。

View on arXiv Download PDF AI Translation

cs.CL / 85 / 2604.11182

Evaluating Memory Capability in Continuous Lifelog Scenario

在连续生活记录场景中评估记忆能力

Zheng, Jianjie, Liu, Zhichen, Shen, Zhanyu, Qu, Jingxiang, Chen, Guanhua, Wang, Yile, Xu, Yang, Liu, Yang, Cheng, Sijie

Abstract

Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.

Chinese Translation

如今，穿戴设备可以持续记录环境对话，为记忆系统创造了巨大的机会。然而，现有的基准主要集中在在线一对一聊天或人机交互上，忽视了现实场景的独特需求。鉴于公共生活记录音频数据集的稀缺，我们提出了一个分层合成框架，以策划 extbf{ extsc{LifeDialBench}}，这是一个新颖的基准，包含两个互补的子集： extbf{EgoMem}，基于真实的自我中心视频构建，以及 extbf{LifeMem}，使用模拟虚拟社区构建。关键是，为了解决传统离线设置中的时间泄漏问题，我们提出了一种 extbf{在线评估} 协议，严格遵循时间因果关系，确保系统以现实流媒体的方式进行评估。我们的实验结果揭示了一个反直觉的发现：当前复杂的记忆系统未能超越简单的基于 RAG 的基线。这突显了当前方法中过度设计结构和有损压缩的有害影响，强调了在生活记录场景中高保真上下文保留的必要性。我们在 https://github.com/qys77714/LifeDialBench 发布了我们的代码和数据。

View on arXiv Download PDF AI Translation

cs.CL / 86 / 2604.11188

MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

MathAgent：用于数学推理数据合成的约束图对抗演化方法

Yu, Zixiong, Rao, Jun, Chen, Guhan, Tian, Songtao, Li, Bohan, Wei, Jiansheng, Zhang, Min, Meng, Xiaojun

Abstract

Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.

Chinese Translation

在缺乏人工先验的情况下合成高质量的数学推理数据仍然是一项重大挑战。当前方法通常依赖于种子数据变异或简单的提示工程，常常面临模式崩溃和逻辑复杂度有限的问题。本文提出了一种分层合成框架，将数据合成视为约束图上的无监督优化问题，随后进行语义实例化，而非将其作为直接的文本生成任务。我们引入了Legislator-Executor范式：Legislator对编码问题约束的结构化生成蓝图进行对抗演化，Executor则将这些规范实例化为多样的自然语言场景。这种骨架设计与语言实现的解耦使得能够优先构建复杂且多样的逻辑结构，从而引导高质量的数据合成。在Qwen、Llama、Mistral和Gemma系列共计10个模型上的实验表明，我们的方法取得了显著成果：在1K合成样本上微调的模型，在八个数学基准测试中均优于规模相当的广泛使用数据集（LIMO，s1K），表现出更优越的分布外泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 87 / 2604.11193

TRACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering

TRACE：一种连贯的多跳知识图谱问答体验框架

Wang, Yingxu, Huang, Jiaxin, Wang, Mengzhu, Yin, Nan

Abstract

Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.

Chinese Translation

多跳知识图谱问答（KGQA）需要在关系路径上进行连贯的推理，但现有方法往往将每个推理步骤独立对待，未能有效利用先前探索的经验，导致推理碎片化和冗余探索。为了解决这些挑战，我们提出了具有轨迹感知的自适应上下文和探索先验的推理框架（TRACE），该框架将基于大语言模型（LLM）的上下文推理与探索先验整合，以增强多跳KGQA的连贯性和鲁棒性。具体而言，TRACE动态地将不断演变的推理路径转化为自然语言叙述，以保持语义连续性，同时将先前的探索轨迹抽象为可重用的经验先验，以捕捉重复的探索模式。双重反馈重排序机制进一步将上下文叙述与探索先验结合，以指导推理过程中的关系选择。在多个KGQA基准上的广泛实验表明，TRACE始终优于最先进的基线方法。

View on arXiv Download PDF AI Translation

cs.CL / 88 / 2604.11201

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench：在实际环境中评估统一数字代理

CocoaBench Team, Hao, Shibo, Zhang, Zhining, Liang, Zhiqi, Liu, Tianyang, Zha, Yuheng, Gao, Qiyue, Chen, Jixuan, Wang, Zilong, Cheng, Zhoujun, Zhang, Haoxiang, Wang, Junli, Jin, Hexi, Zheng, Boyuan, Zhou, Kun, Wang, Yu, Yao, Feng, Liu, Licheng, Li, Yijiang, Li, Zhifei, Han, Zhengtao, Promthaw, Pracha, Cerruti, Tommaso, Fu, Xiaohan, Ma, Ziqiao, Shang, Jingbo, Qin, Lianhui, McAuley, Julian, Xing, Eric P., Liu, Zhengzhong, Srivastava, Rupesh Kumar, Hu, Zhiting

Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

Chinese Translation

大型语言模型（LLM）代理在软件工程、深度研究、图形用户界面自动化及其他多种应用中表现出色，而最近的代理框架和模型正日益将这些能力整合为统一系统。然而，大多数评估仍然在孤立的情况下测试这些能力，这导致了一个缺口，无法涵盖需要代理结合不同能力的多样化用例。我们介绍了CocoaBench，这是一个针对统一数字代理的基准，基于人类设计的长期任务，这些任务需要灵活组合视觉、搜索和编码能力。任务仅通过指令和对最终输出的自动评估函数进行指定，从而实现了在多样化代理基础设施中可靠且可扩展的评估。我们还提出了CocoaAgent，这是一个轻量级的共享框架，用于在模型骨干之间进行受控比较。实验表明，当前代理在CocoaBench上的表现仍然远未可靠，最佳评估系统的成功率仅为45.1%。我们的分析进一步指出，在推理与规划、工具使用与执行以及视觉基础方面仍有显著的改进空间。

View on arXiv Download PDF AI Translation

cs.CL / 89 / 2604.11209

Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

探索知识冲突以实现可信的大型语言模型推理：基准与方法

Zhao, Tianzhe, Chen, Jiaoyan, Zhang, Shuxiu, Zhu, Haiping, Lin, Qika, Liu, Jun

Abstract

Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.

Chinese Translation

大型语言模型（LLMs）在广泛的应用中取得了显著成功，尤其是在通过检索增强生成（RAG）引入外部知识的情况下。尽管被广泛采用，近期研究表明，当检索到的知识存在冲突时，LLMs往往难以进行可信的推理。然而，现有工作主要关注外部知识与LLMs参数化知识之间的冲突，较少涉及外部知识之间的冲突。同时，现代RAG系统日益重视非结构化文本与（半）结构化数据如知识图谱（KGs）的整合，以提升知识的完整性和推理的可信度。为填补这一空白，我们提出了ConflictQA，一种系统化体现文本证据与知识图谱证据冲突的新型基准。在代表性LLMs上的广泛评估表明，面对此类跨源冲突，LLMs常常无法识别可靠证据以支持正确推理。相反，LLMs对提示选择变得更加敏感，且倾向于单一依赖知识图谱或文本证据，导致回答错误。基于这些发现，我们进一步提出了XoT，一种针对异构冲突证据推理设计的两阶段基于解释的思考框架，并通过大量实验验证了其有效性。

View on arXiv Download PDF AI Translation

cs.CL / 90 / 2604.11214

HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

HiEdit：基于层次强化学习的终身模型编辑方法

Wang, Yangfan, Sun, Tianyang, Tang, Chen, Liu, Jie, Cai, Wei, Jiang, Jingchi

Abstract

Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: https://github.com/yangfanww/hiedit.

Chinese Translation

终身模型编辑（Lifelong Model Editing，LME）旨在连续修正已部署大型语言模型（LLMs）中陈旧或不准确的知识，同时最大限度地减少对无关输入的副作用。然而，现有方法通常对所有编辑实例在静态且密集的LLM层集合上施加参数扰动。这种做法与直觉相悖，因为我们假设不同的知识片段存储在模型的不同层中。忽视这种层级特异性会阻碍新知识的适应性整合，并导致对通用知识及先前编辑知识的灾难性遗忘。为此，我们提出了HiEdit，一种层次强化学习框架，能够自适应地识别每个编辑实例中最相关的知识层。通过实现动态的、实例感知的层选择并引入稀疏性的内在奖励，HiEdit实现了精确且局部的更新。在多个LLM上的实验表明，HiEdit在每次编辑仅扰动一半层的情况下，平均提升了竞争方法RLEdit 8.48%的性能。我们的代码已开源，地址为：https://github.com/yangfanww/hiedit。

View on arXiv Download PDF AI Translation

cs.CL / 91 / 2604.11233

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

RUMLEM：一种基于词典的罗曼什语词形还原器

Fischer, Dominic P., Hopton, Zachary, Vamvas, Jannis

Abstract

Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

Chinese Translation

词形还原——将屈折词形映射到其词典形式的任务——是许多自然语言处理应用中的关键组成部分。本文介绍了RUMLEM，一种覆盖罗曼什语五个主要方言以及超区域标准变体Rumantsch Grischun的词形还原器。该系统基于罗曼什语社区驱动的综合形态数据库，使RUMLEM能够覆盖典型罗曼什语文本中77%至84%的词汇。由于每个罗曼什语方言均有专门的数据库，RUMLEM的另一应用是具备方言识别能力的语言分类。在对3万篇不同长度的罗曼什语文本进行评估时，RUMLEM在95%的情况下正确识别了文本的方言。此外，概念验证实验表明，基于该词形还原器实现罗曼什语与非罗曼什语的语言分类是可行的。

View on arXiv Download PDF AI Translation

cs.CL / 92 / 2604.11246

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

如同人类考官般评判：面向长文本生成任务的加权重要性多点评估框架

Yu, Guoxin, Zhou, Chulun, Liu, Lemao, Wang, Qi, Yu, Mo, Tang, Jialong, Yang, Baosong, Ao, Xiang, Lam, Wao, Yu, Yue

Abstract

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

Chinese Translation

在长文本生成任务中，评估模型回答的质量依然具有挑战性，因为预期答案通常包含多个语义上不同但互补的因素，这些因素需要被分解以实现细粒度的评估。现有的评估方法通常依赖任务级评分标准或问题感知的检查表。然而，这些方法仍存在以下不足：1）难以评估回答是否真正基于提供的上下文；2）未能捕捉参考答案中不同方面的重要性差异。受人类考官启发，我们提出了加权重要性多点评估（Weighted Importance Multi-Point Evaluation，WIMPE）框架，将每个参考答案分解为加权的上下文绑定评分点。设计了两种互补指标，即加权点对齐（Weighted Point-wise Alignment，WPA）和点冲突惩罚（Point-wise Conflict Penalty，PCP），用于衡量模型回答与参考答案之间的一致性与矛盾。基于10个生成任务的广泛实验表明，WIMPE与人工标注的相关性更高。

View on arXiv Download PDF AI Translation

cs.CL / 93 / 2604.11258

Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

Dialectic-Med：通过反事实对抗多智能体辩论缓解诊断幻觉

Lu, Zhixiang, Su, Jionglong

Abstract

Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.

Chinese Translation

医疗领域的多模态大型语言模型（MLLMs）存在严重的确认偏差，常常为了支持初步且可能错误的诊断假设而产生视觉细节幻觉。现有的链式思维（Chain-of-Thought, CoT）方法缺乏内在的纠正机制，易导致错误传播。为弥补这一不足，我们提出了Dialectic-Med，这是一种通过对抗辩证法强化诊断严谨性的多智能体框架。与静态共识模型不同，Dialectic-Med协调三个角色专长的智能体之间的动态互动：一名提出诊断假设的支持者（Proponent）；一名配备新颖视觉伪证模块、主动检索矛盾视觉证据以挑战支持者的反对者（Opponent）；以及一名通过加权共识图解决冲突的调解者（Mediator）。通过显式建模伪证的认知过程，本框架确保诊断推理紧密基于已验证的视觉区域。基于MIMIC-CXR-VQA、VQA-RAD和PathVQA数据集的实证评估表明，Dialectic-Med不仅实现了最先进的性能，还从根本上提升了推理过程的可信度。除了准确性外，我们的方法显著增强了解释的忠实度，并有效缓解了幻觉问题，树立了超越单智能体基线的新标杆。

View on arXiv Download PDF AI Translation

cs.CL / 94 / 2604.11288

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

事务注意力：用于 KV 缓存保留的语义赞助

Basu, Abhinaba

Abstract

At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., "key:", "password:") protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

Chinese Translation

在 K=16 个标记（占 4K 上下文的 0.4%）时，所有现有的 KV 缓存压缩方法在凭证检索上均达到 0%。失败模式是休眠标记：凭证、API 密钥和接近零关注度的配置值，但在生成时变得至关重要。由于这些标记缺乏驱逐策略所依赖的统计信号，因此基于注意力分数、重构损失或学习保留门的任何方法都无法保留它们。我们提出了事务注意力（Transactional Attention, TA），这是一种赞助机制，其中结构性锚点模式（例如“key:”、“password:”）保护相邻的值承载标记不被驱逐。在 K=16 时，TA 实现了 100% 的凭证检索，而六个基线方法（H2O、TOVA、SnapKV、StreamingLLM、PyramidKV、DynamicKV）均达到 0%，并在 200 次函数调用试验中保持 100% 的准确率。TA-Fast 是一种无注意力变体，减少了 52% 的内存开销，并与 SDPA 和 FlashAttention 兼容。TA 与现有的压缩方法是正交的，并且增加的延迟开销不到 1%。

View on arXiv Download PDF AI Translation

cs.CL / 95 / 2604.11290

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

多语种教师：评估语言模型在多语言合成数据生成中的应用

Miranda, Lester James V., Vulić, Ivan, Korhonen, Anna

Abstract

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

Chinese Translation

从语言模型（LMs）合成监督微调（SFT）数据以教授较小模型多语言任务的做法日益普遍。然而，教师模型的选择往往是临时的，通常默认选择可用的最大选项，尽管这些模型在非英语语言方面可能存在显著的能力差距。这种做法可能导致合成数据质量差和学生下游性能不佳。在本研究中，我们系统地描述了有效多语言教师的特征。我们通过一种称为多语种评分（Polyglot Score）的指标，测量数据质量的内在指标与学生模型性能的外在指标；评估了10个语言模型在6种类型多样的语言中的表现，生成了超过140万个SFT示例，并训练了240个学生模型。在测试的模型中，Gemma 3 27B和Aya Expanse 32B在不同的学生基础模型家族中表现出一致的有效性。进一步分析表明，仅凭模型规模并不能显著预测教师的有效性；相反，数据质量的特征，如提示多样性、长度和响应流畅性，捕获了内在数据质量中超过93.3%的方差，并预测学生性能。最后，我们提供了一些实际建议，包括匹配教师-学生对的模型家族以及从现有提示中翻译或响应，这可以为资源较少的语言带来改进。我们希望我们的研究能够推动多语言合成数据和语言模型开发中的数据中心研究。

View on arXiv Download PDF AI Translation

cs.CL / 96 / 2604.11299

Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

通过字形驱动微调增强多模态大型语言模型对古代汉字演变分析的能力

Song, Rui, Shi, Lida, Qi, Ruihua, Li, Yingji, Xu, Hao

Abstract

In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.

Chinese Translation

近年来，多模态大型语言模型（MLLMs）的快速发展日益激发了对古代汉字的研究。书写字符的演变构成了理解文化变迁和历史延续的基本途径，如何系统性地利用MLLMs来支持和推进文本演变分析仍然是一个未被充分探索的问题。为了解决这一问题，我们构建了一个包含11个任务和超过130,000个实例的综合基准，专门用于评估MLLMs在分析古代汉字演变方面的能力。我们对多个广泛使用的MLLMs进行了广泛评估，观察到虽然现有模型在字形级别比较方面表现有限，但在核心任务（如字符识别和演变推理）上的表现仍然受到显著限制。基于这些发现，我们提出了一种字形驱动的微调框架（GEVO），该框架明确鼓励模型捕捉字形变换中的演变一致性，并增强其对文本演变的理解。实验结果表明，即使是2B规模的模型在所有评估任务上也实现了一致且全面的性能提升。为了促进未来的研究，我们公开发布了基准和训练模型。

View on arXiv Download PDF AI Translation

cs.CL / 97 / 2604.11322

Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

大型语言模型是否了解工具无关性？揭示工具调用中的结构对齐偏差

Liu, Yilong, Lin, Xixun, Cao, Pengfei, Zhang, Ge, Fang, Fang, Cao, Yanan

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user's query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user's goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs' tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.

Chinese Translation

大型语言模型（LLMs）在利用外部工具方面展现了令人印象深刻的能力。然而，在实际应用中，LLMs 常常会接触到与用户查询无关的工具，此时期望的行为是避免调用。在本研究中，我们识别出一种普遍存在但被忽视的工具拒绝机制缺陷，我们称之为结构对齐偏差：即使工具未能满足用户的目标，LLMs 仍然倾向于在查询属性可以有效分配给工具参数时调用该工具。为了系统地研究这种偏差，我们引入了 SABEval，这是一种将结构对齐与语义相关性解耦的新数据集。我们的分析表明，结构对齐偏差在 LLMs 中引发了严重的工具调用错误，但在现有评估中仍未得到充分考虑。为了探讨这一偏差背后的内部机制，我们提出了对比注意力归因（Contrastive Attention Attribution），该方法揭示了语义检查和结构匹配的两条竞争路径。这些路径的相对强度驱动了 LLMs 的工具调用决策。基于这些发现，我们进一步提出了一种重新平衡策略，该策略有效减轻了结构对齐偏差，经过广泛实验验证，且未降低一般工具使用能力。

View on arXiv Download PDF AI Translation

cs.CL / 98 / 2604.11407

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

检索作为生成：具有自触发信息规划的统一框架

Li, Bo, Wang, Mingda, Fang, Gexiang, Zhang, Shikun, Ye, Wei

Abstract

We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.

Chinese Translation

我们通过将检索控制直接嵌入生成中，重新审视了增强检索生成（RAG）。我们不再将检索视为外部干预，而是在标记级解码中表达检索决策，从而实现端到端的协调，无需额外的控制器或分类器。在“检索作为生成”的范式下，我们提出了 extbf{GRIP}（ extbf{G}eneration-guided extbf{R}etrieval with extbf{I}nformation extbf{P}lanning），这是一个统一框架，其中模型通过控制标记的发出来调节检索行为。GRIP的核心是 extit{自触发信息规划}，它允许模型决定何时检索、如何重构查询以及何时终止，所有这些都在单一的自回归轨迹中完成。该设计紧密耦合了检索与推理，并支持动态多步骤推理与即时证据整合。为了监督这些行为，我们构建了一个结构化训练集，涵盖可回答、部分可回答和多跳查询，每个查询与特定的标记模式对齐。在五个问答基准上的实验表明，GRIP超越了强大的RAG基线，并且在参数使用上显著少于GPT-4o，具有竞争力。

View on arXiv Download PDF AI Translation

cs.CL / 99 / 2604.11424

Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

连接模型认知与表达方式：用于富有表现力语音生成的自感知语音语言模型

Wang, Kuang, Wei, Lai, Bai, Qibing, Lin, Ping, Fang, Wenkai, Jiang, Feng, Jiang, Zhongjie, Huang, Jun, Wang, Yannan, Li, Haizhou

Abstract

Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model's internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.

Chinese Translation

语音语言模型（Speech Language Models, SLMs）展现了强大的语义理解能力，然而其生成的语音往往显得平淡，难以传达富有表现力的意图，影响用户的参与感。我们将这种不匹配称为语义理解与声学实现之间的差距。我们认为该差距源于两个关键缺陷：（1）意图传递失败，即SLMs未能提供用于富有表现力表达的稳定话语级意图；（2）实现感知训练缺失，即缺乏反馈信号来验证声学输出是否忠实反映了预期的表达意图。为解决这些问题，我们提出了SA-SLM（Self-Aware Speech Language Model，自感知语音语言模型），其核心理念是在生成过程中模型应意识到自身的认知内容，在训练过程中意识到自身的发声方式。SA-SLM通过两大核心贡献弥合该差距：（1）意图感知桥接（Intent-Aware Bridging），利用变分信息瓶颈（Variational Information Bottleneck, VIB）目标将模型内部语义转化为时间上平滑的富有表现力意图，使语音生成过程意识到模型意图表达的内容；（2）实现感知对齐（Realization-Aware Alignment），将模型自身作为评判者，通过基于评分标准的反馈验证并对齐声学实现与预期表达意图。该模型仅使用800小时的富有表现力语音数据进行训练，拥有30亿参数，在EchoMind基准测试中，其整体表现力超过所有开源基线模型，并在表现力评分上仅落后GPT-4o-Audio 0.08分。

View on arXiv Download PDF AI Translation

cs.CL / 100 / 2604.11427

METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

METRO：从专家对话记录中推导策略以应对非协作对话

Yang, Haofu, Liu, Jiaji, Huang, Chen, Wu, Faguo, Lei, Wenqiang, Ng, See-Kiong

Abstract

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at https://github.com/Humphrey-0125/METRO.

Chinese Translation

开发非协作对话代理通常需要手动且不可扩展的专家策略编码。我们提出了 extit{METRO}，一种利用大型语言模型从原始记录中自主推导策略行动和规划逻辑的方法。METRO将专家知识形式化为策略森林（Strategy Forest），这是一种层次结构，捕捉短期响应（节点）和长期战略前瞻（分支）。在两个基准测试中的实验结果表明，METRO表现出良好的性能，平均超越现有方法9%-10%。我们的进一步分析不仅揭示了METRO成功的原因（战略行为多样性和前瞻性），还展示了其在跨任务转移中的稳健性。这为以成本效益高且可扩展的方式构建非协作代理提供了新的见解。我们的代码可在https://github.com/Humphrey-0125/METRO获取。

View on arXiv Download PDF AI Translation

cs.CL / 101 / 2604.11435

Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

写作前思考：基于问答的角色描述推理

Papoudakis, Argyrios, Lapata, Mirella, Keller, Frank

Abstract

Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

Chinese Translation

角色描述生成是叙事驱动应用（如摘要、故事分析和角色驱动模拟）中的一项重要能力。然而，从长篇叙事（例如小说）中生成准确的角色描述具有挑战性：模型必须跟踪不断变化的属性（例如关系和事件），整合散布在文本中的证据，并推断隐含细节。尽管推理增强的语言模型（LLMs）在许多基准测试中取得了成功，但我们发现，在角色描述生成任务中，当禁用内置推理（即空的推理痕迹）时，它们的性能反而有所提升。基于此，我们提出了一种将推理与生成解耦的训练框架。我们的方法可以应用于长上下文的 LLMs 或基于块的方法，包含一个生成结构化问答推理痕迹的推理模型和一个基于该痕迹生成最终角色描述的生成模型。在两个数据集（BookWorm 和 CroSS）上的实验表明，基于问答的推理在忠实性、信息量和基础性方面优于强大的长上下文基线。

View on arXiv Download PDF AI Translation

cs.CL / 102 / 2604.11502

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

METER：评估大型语言模型中的多层次上下文因果推理能力

Li, Pengfeng, Huang, Chen, Hao, Chaoqun, Chen, Hongyao, Wei, Xiao-Yong, Lei, Wenqiang, Ng, See-Kiong

Abstract

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

Chinese Translation

上下文因果推理是大型语言模型（LLMs）的一项关键且具有挑战性的能力。然而，现有的基准测试往往在碎片化的环境中评估该技能，未能确保上下文的一致性或涵盖完整的因果层级。为此，我们首创了METER，在统一的上下文设置下系统地对LLMs在因果阶梯的三个层级进行基准测试。我们对多种LLMs的广泛评估显示，随着任务因果层级的提升，模型的能力显著下降。为诊断这种性能退化，我们通过错误模式识别和内部信息流追踪进行了深入的机制分析。分析揭示了两种主要的失败模式：（1）在较低因果层级，LLMs易受因果无关但事实正确的信息干扰；（2）随着任务因果层级的提升，对所提供上下文的忠实度下降，导致性能降低。我们相信，本研究推动了对LLMs上下文因果推理机制的理解，并为未来相关研究奠定了重要基础。我们的代码和数据集可在https://github.com/SCUNLP/METER 获取。

View on arXiv Download PDF AI Translation

cs.CL / 103 / 2604.11510

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂：通过双模熵正则化激励大语言模型强化学习中的双模探索

Yao, Jiashu, Huang, Heyan, Luo, Chuwei, Wu, Daiqing, Liu, Zeming, Guo, Yuhang, Kang, Yangyang

Abstract

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

Chinese Translation

为了在不降低准确性的前提下促进大语言模型（LLMs）强化学习（RL）中的多样化探索，我们提出了策略分裂（Policy Split），这是一种将策略分为常规模式和高熵模式的创新范式，并配以高熵提示。两种模式共享模型参数，但通过协同的双模熵正则化针对不同目标进行优化。具体而言，常规模式优化任务正确性，而高熵模式则引入探索偏好，两者协同学习。大量实验表明，我们的方法在各种模型规模下，在通用和创造性任务中均持续优于已有的基于熵引导的强化学习基线。进一步分析显示，策略分裂促进了双模探索，高熵模式生成与常规模式截然不同的行为模式，提供了独特的学习信号。

View on arXiv Download PDF AI Translation

cs.CL / 104 / 2604.11522

Triviality Corrected Endogenous Reward

平凡性修正的内生奖励

Wang, Xinda, Hou, Zhengxu, Zhang, Yangshijie, Yan, Bingren, Liu, Jialin, Zhao, Chenzhuo, Yang, Zhibo, Yang, Bin-Bin, Xiao, Feng

Abstract

Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.

Chinese Translation

开放式文本生成的强化学习受限于缺乏可验证的奖励，因而必须依赖需要标注数据或强大闭源模型的评判模型。受近期基于置信度的内生奖励进行无监督数学推理强化学习工作的启发，我们探讨了该原理是否可适用于开放式写作任务。我们发现直接应用置信度奖励会导致平凡性偏差（Triviality Bias）：策略趋向于高概率输出，降低了生成内容的多样性和实质性。为此，我们提出了TCER（Triviality Corrected Endogenous Reward，平凡性修正的内生奖励），通过奖励专家策略与通用参考策略之间的相对信息增益，并结合基于概率的修正机制来解决该偏差。在多个写作基准和模型架构上，TCER在无外部监督的情况下实现了持续改进。此外，TCER还有效迁移至数学推理任务，验证了我们方法在不同生成任务中的通用性。

View on arXiv Download PDF AI Translation

cs.CL / 105 / 2604.11543

NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

NovBench：评估大型语言模型在学术论文新颖性评估中的表现

Wu, Wenqing, Zhao, Yi, Wang, Yuzhuo, Li, Siyou, Shao, Juexi, Long, Yunfei, Zhang, Chengzhi

Abstract

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

Chinese Translation

新颖性是学术出版的核心要求，也是同行评审的中心焦点，但日益增长的投稿量给人类评审者带来了越来越大的压力。尽管包括在同行评审数据上进行微调的大型语言模型（LLMs）在生成评审评论方面显示出潜力，但缺乏专门的基准限制了对它们评估研究新颖性能力的系统性评估。为了解决这一问题，我们引入了NovBench，这是第一个旨在评估LLMs生成新颖性评估能力的大规模基准，以支持人类同行评审。NovBench包含来自一家领先的自然语言处理（NLP）会议的1,684对论文-评审对，包括从论文引言中提取的新颖性描述和相应的专家撰写的新颖性评估。我们关注这两个来源，因为引言提供了新颖性主张的标准化和明确的表述，而专家撰写的新颖性评估则构成了当前人类判断的黄金标准之一。此外，我们提出了一个四维评估框架（包括相关性、正确性、覆盖率和清晰度）来评估LLM生成的新颖性评估的质量。在不同提示策略下对一般和专业LLM进行的广泛实验表明，当前模型对科学新颖性的理解有限，而微调模型往往存在遵循指令的缺陷。这些发现强调了需要针对性微调策略，以共同提高新颖性理解和指令遵循能力。

View on arXiv Download PDF AI Translation

cs.CL / 106 / 2604.11544

Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

时间不是标签：针对时间知识图谱和代理记忆的连续相位旋转

Li, Weixian Waylon, Zhang, Jiaxin, Yang, Xianan Jim, Ma, Tiejun, Guo, Yiwen

Abstract

Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation's text embedding to a volatility score, learning from data that evolving relations (e.g., "president of") should rotate fast while persistent ones (e.g., "born in") should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

Chinese Translation

结构化记忆表示（如知识图谱）是自主智能体和其他长期存在系统的核心。然而，大多数现有方法将时间建模为离散元数据，要么通过最近性排序（埋没旧的但持久的知识），要么简单地覆盖过时的事实，或者在每个摄取步骤中都需要昂贵的LLM调用，从而无法区分持久事实与不断演变的事实。为了解决这个问题，我们引入了RoMem，一个适用于结构化记忆系统的时间知识图谱模块，适用于代理记忆及其他领域。一个预训练的语义速度门将每个关系的文本嵌入映射到一个波动性评分，从数据中学习到不断演变的关系（例如，“总统”）应该快速旋转，而持久的关系（例如，“出生于”）应该保持稳定。结合连续相位旋转，这使得几何阴影成为可能：过时的事实在复杂的向量空间中被旋转出相位，因此时间上正确的事实自然优于矛盾，而无需删除。在时间知识图谱补全任务中，RoMem在ICEWS05-15上实现了最先进的结果（72.6 MRR）。应用于代理记忆时，它在时间推理（MultiTQ）上提供了2-3倍的MRR和答案准确性，主导混合基准（LoCoMo），在静态记忆中保持零降级（DMR-MSC），并在未见过的金融领域（FinTMMBench）实现零样本泛化。

View on arXiv Download PDF AI Translation

cs.CL / 107 / 2604.11554

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax：一个用于大规模全模态后训练的异步强化学习引擎

Zhang, Liujie, Ning, Benzhe, Yang, Rui, Yu, Xiaoyan, Li, Jiaxing, Wu, Lumeng, Liu, Jia, Li, Minghao, Chen, Weihang, Hu, Weiqi, Zhang, Lei

Abstract

Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.

Chinese Translation

强化学习（RL）后训练已被证明在解锁大型语言模型的推理、自我反思和工具使用能力方面有效。随着模型扩展到全模态输入和自主多轮工作流程，RL训练系统面临三个相互依赖的挑战：异构数据流、规模化操作的鲁棒性以及过时性与吞吐量之间的权衡。我们提出了 extbf{Relax}（Reinforcement Engine Leveraging Agentic X-modality），一个开源的RL训练引擎，通过三个共同设计的架构层来解决这些挑战。首先，一个 extit{全原生架构}将多模态支持构建到整个堆栈中——从数据预处理和模态感知并行性到推理生成——而不是将其后期添加到以文本为中心的管道中。其次，每个RL角色作为独立的、故障隔离的服务运行，可以在没有全局协调的情况下进行扩展、恢复和升级。第三，服务级别的解耦通过TransferQueue数据总线实现异步训练，其中单个过时性参数在策略执行、近似策略执行和完全异步执行之间平滑插值。Relax在Qwen3-4B的策略训练中，相较于veRL实现了1.20$ imes$的端到端加速。其完全异步模式在Qwen3-4B上实现了1.76$ imes$的加速，在Qwen3-Omni-30B上实现了2.00$ imes$的加速，而所有模式都收敛到相同的奖励水平。与veRL在相同配置下的32 ext{%}降级相比，Relax支持R3（Rollout Routing Replay）~ootnote{ma2025r3}用于MoE模型，仅增加1.9 ext{%}的开销。它进一步展示了在Qwen3-Omni上跨图像、文本和音频的稳定全模态RL收敛，在视频上持续超过2{,}000步而没有降级。Relax可在https://github.com/rednote-ai/Relax获取。

View on arXiv Download PDF AI Translation

cs.CL / 108 / 2604.11563

Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Synthius-Mem：类脑抗幻觉角色记忆系统，在LoCoMo基准上实现94.4%的记忆准确率与99.6%的对抗鲁棒性

Gadzhiev, Artem, Kislov, Andrew

Abstract

Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents -- sliding windows, summarization, embedding-based RAG, and flat fact extraction -- each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.

Chinese Translation

为人工智能代理提供可靠且无幻觉的长期记忆仍是一个未解决的问题。当前针对大型语言模型（LLM）代理的记忆方法——滑动窗口、摘要、基于嵌入的检索增强生成（RAG）以及扁平事实提取——虽然降低了令牌成本，但引入了灾难性的信息丢失、语义漂移或对用户信息的不可控幻觉。其结构性原因在于架构设计：所有已发布的LoCoMo基准记忆系统均将对话视为对原始或轻度摘要对话片段的检索问题，且无一报告对抗鲁棒性，即拒绝回答用户未披露事实的能力。本文提出Synthius-Mem，一种类脑启发的结构化角色记忆系统，采用根本不同的方法。Synthius-Mem不检索所说内容，而是提取关于人物的已知信息：完整的角色提取流程将对话分解为六个认知领域（传记、经历、偏好、社交圈、工作、心理测量），在各领域内进行整合与去重，并通过CategoryRAG以21.79毫秒的延迟检索结构化事实。在LoCoMo基准（ACL 2024，10场对话，1813个问题）上，Synthius-Mem实现了94.37%的准确率，超过所有已发布系统，包括MemMachine（91.69%，未报告对抗得分）及人类表现（87.9 F1）。核心记忆事实准确率达到98.64%。对抗鲁棒性——所有竞品系统未报告的抗幻觉指标——达到99.55%。Synthius-Mem在实现更高准确率的同时，将令牌消耗相比全上下文重放减少约5倍。Synthius-Mem在LoCoMo上达成了最先进的成果，据我们所知，是唯一既超越人类水平又报告对抗鲁棒性的角色记忆系统。

View on arXiv Download PDF AI Translation

cs.CL / 109 / 2604.11565

Phonological distances for linguistic typology and the origin of Indo-European languages

用于语言类型学和印欧语系起源的语音距离研究

Mavridis, Marius, De Gregorio, Juan, Toral, Raul, Sanchez, David

Abstract

We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

Chinese Translation

我们展示了短程音素依赖关系能够编码大规模的语言相关性模式，这对定量语言类型学和进化语言学具有直接意义。具体而言，基于信息论框架，我们提出将音素序列建模为二阶马尔可夫链，本质上捕捉了语音系统的统计相关性。该发现使我们能够利用包含音素发音特征的距离度量，从一个多语言平行语料库中量化67种现代语言之间的距离。所得的语音距离矩阵成功还原了主要语言家族结构，并揭示了接触引起的趋同特征。值得注意的是，我们发现语音距离与地理距离存在显著相关性，从而限制了印欧语系可能的起源区域，且该结果与草原假说（Steppe hypothesis）相一致。

View on arXiv Download PDF AI Translation

cs.CL / 110 / 2604.11575

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

MIXAR：将自回归像素级语言模型扩展到多语言和多书写系统

Hu, Chen, Tai, Yintao, Vergari, Antonio, Keller, Frank, Suglia, Alessandro

Abstract

Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.

Chinese Translation

基于像素的语言模型作为传统基于标记的方法的替代方案正逐渐获得关注，能够有效规避分词带来的挑战。然而，不同语言间固有的感知多样性在像素空间中对多语言泛化能力构成了重大障碍。本文提出了MIXAR，这是首个在八种不同语言及多种书写系统上训练的生成式基于像素的语言模型。我们通过实证评估，将MIXAR与先前的基于像素模型及可比的基于分词器的模型进行对比，展示了其在判别性和生成性多语言任务上的显著性能提升。此外，我们还展示了MIXAR对训练中未见语言的鲁棒性。当模型规模扩大至5亿参数时，这些结果进一步得到强化，不仅提升了其在如LAMBADA等生成任务中的能力，也增强了其面对输入扰动（如正字法攻击）时的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 111 / 2604.11581

Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

分解与减少大型语言模型评估流程中的隐藏测量误差

Messing, Solomon

Abstract

LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73\% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.

Chinese Translation

大型语言模型（LLM）评估决定了哪些模型被部署、哪些安全标准被采纳以及哪些研究结论被发表。然而，这些评分存在隐藏的不确定性：重新表述提示词、更换评判模型或调整温度参数，均可能导致结果发生显著变化，甚至颠倒排名和结论。标准置信区间忽视了这种方差，导致覆盖率不足，且随着数据量增加问题愈发严重。未被测量的方差还创造了可被利用的攻击面：模型开发者可能针对测量噪声进行优化，而非真实能力。本文将LLM评估流程中的不确定性分解为不同来源，区分随数据量增加而减小的方差与对研究者设计选择敏感的方差，并预测减少总误差的最有效路径。对于基准测试构建者，同样的分解方法可识别哪些设计选择产生可被利用的攻击面，并提出最小化该攻击面的设计方案。在意识形态注释、安全分类、MMLU基准测试及人类验证的宣传审计中，基于预测优化的评估流程在与人类基线比较时，表现优于73%的简单流程。在MMLU测试中，优化的预算分配使估计误差相比标准单提示评估在等成本条件下减少了一半。通过小样本方差估计即可推导出接近名义覆盖率的置信区间（前提是模型涵盖了相关流程要素），并生成减少测量误差及提升基准测试稳健性的建议。

View on arXiv Download PDF AI Translation

cs.CL / 112 / 2604.11582

A Triadic Suffix Tokenization Scheme for Numerical Reasoning

一种用于数值推理的三元后缀分词方案

Chetverina, Olga

Abstract

Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

Chinese Translation

标准的子词分词方法对数字的切分不一致，导致大型语言模型（LLMs）丧失数字的位置信息和小数结构——这是算术和科学推理中错误的主要原因。我们提出了三元后缀分词（Triadic Suffix Tokenization，TST），这是一种确定性方案，将数字划分为三位一组的三元组，并为每个三元组标注明确的数量级标记。关键在于，该方案为整数部分（千、百万、十亿等）的后缀与数量级之间定义了固定的一一映射关系，并为小数部分（十分位、千分位、百万分位等）采用了对应的重复标记系统。不同于依赖位置推断的方法，该方法提供了一致的梯度信号，有助于确保模型的稳定收敛。我们提出了两种实现变体：（1）基于词汇表的方法，在现有词汇表中最多新增1万个固定标记，覆盖33个数量级（从10^-15到10^18）；（2）后缀标记方法，使用一小组特殊标记动态表示数量级。两种变体均保留了数字的精确性，同时使数量级关系在标记层面上透明可见。该框架本质上具有可扩展性，允许词汇表线性扩展以适应任意精度和范围。TST与模型架构无关，可作为即插即用的预处理步骤集成。实验验证将留待未来工作完成。

View on arXiv Download PDF AI Translation

cs.CL / 113 / 2604.11610

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

跨异构任务的自我演化大型语言模型记忆提取

Yang, Yuqing, Liu, Tengxiao, Zhu, Wang Bill, Shi, Taiwei, Song, Linxin, Jia, Robin

Abstract

As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textit{heterogeneous memory extraction} task and introduce \textbf{BEHEMOTH}, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbf{CluE}, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ($+$9.04\% relative gain), consistently outperforming prior self-evolving frameworks.

Chinese Translation

随着基于大型语言模型（LLM）的助手变得持久和个性化，它们必须从过去的对话中提取和保留有用的信息作为记忆。然而，值得记住的信息类型在不同任务之间差异显著。我们正式定义了 extit{异构记忆提取}任务，并引入了 extbf{BEHEMOTH}，这是一个基准，重新利用了18个现有数据集，涵盖个性化、问题解决和自主任务，采用下游效用驱动的指标进行系统评估。我们的实证分析确认，没有单一的静态提取提示在所有任务类别中占主导地位，且现有的自我演化提示优化框架（最初为同质分布设计）在训练任务异构时表现下降。为了解决这个问题，我们提出了 extbf{CluE}，一种基于聚类的自我演化策略，通过提取场景将训练示例分组为聚类，独立分析每个聚类，并综合跨聚类的见解以更新提取提示。在BEHEMOTH上的实验表明，CluE在异构任务中有效泛化（相对增益为+9.04\%），并始终优于先前的自我演化框架。

View on arXiv Download PDF AI Translation

cs.CL / 114 / 2604.11611

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

通过互信息自我评估的强化学习利用和校准事后过程奖励

Yao, Jiashu, Huang, Heyan, Liu, Zeming, Guo, Yuhang

Abstract

To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

Chinese Translation

为了克服基于大型语言模型（LLMs）的强化学习（RL）中稀疏奖励的挑战，我们提出了互信息自我评估（MISE），这是一种强化学习范式，利用事后生成的自我评估作为密集奖励信号，同时将其与环境反馈进行校准。从经验上看，MISE使得代理能够自主地从密集的内部奖励中学习，以补充稀疏的外部信号。从理论上讲，我们的工作为生成自我奖励的范式提供了首个正式基础。我们证明了利用事后自我评估奖励等同于最小化一个目标，该目标结合了互信息与策略和代理奖励策略之间的KL散度项。这一理论见解随后为我们的校准步骤提供了信息和依据，该步骤主动将这些奖励与最优策略对齐。大量实验表明，MISE的表现优于强基线，使得约70亿参数的开源LLM在没有专家监督的情况下在验证集上达到了与GPT-4o相当的性能。

View on arXiv Download PDF AI Translation

cs.CL / 115 / 2604.11628

Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

回归基础：让对话代理仅凭检索与生成实现记忆

Wu, Yuqian, Chen, Wei, Huang, Zhengjun, Chen, Junle, Liu, Qingxiang, Wang, Kai, Zhou, Xiaofang, Liang, Yuxuan

Abstract

Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

Chinese Translation

现有的对话记忆系统依赖复杂的层次化摘要或强化学习来管理长期对话历史，但随着对话长度增加，仍易受到上下文稀释的影响。在本研究中，我们提出了不同的视角：主要瓶颈可能不在于记忆架构本身，而在于潜在知识流形中的“信号稀疏效应”。通过受控实验，我们识别出两个关键现象：一是“决定性证据稀疏”，即随着会话时长增长，相关信号变得愈发孤立，导致基于聚合的方法性能急剧下降；二是“双层冗余”，即会话间干扰和会话内填充内容均引入大量无信息量内容，阻碍有效生成。基于这些洞见，我们提出了 extit{method}，一个极简框架，将对话记忆回归基础，仅依赖通过回合隔离检索（Turn Isolation Retrieval, TIR）和查询驱动剪枝（Query-Driven Pruning, QDP）实现的检索与生成。TIR以最大激活策略替代全局聚合，捕捉回合级信号；QDP则去除冗余会话和对话填充，构建紧凑且高密度的证据集。在多个基准测试上的大量实验表明， extit{method}在多样化场景中表现稳健，持续超越强基线，同时在令牌使用和延迟方面保持高效，确立了对话记忆的全新极简基线。

View on arXiv Download PDF AI Translation

cs.CL / 116 / 2604.11632

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

CArtBench：评估视觉-语言模型在中国艺术理解、诠释与真伪鉴定上的表现

Wei, Xuefeng, Wang, Zhixuan, Zhou, Xuan, Qu, Zhi, Li, Hongyao, Sakai, Yusuke, Kamigaito, Hidetaka, Watanabe, Taro

Abstract

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

Chinese Translation

我们提出了CARTBENCH，这是一个基于博物馆资源的基准，用于评估视觉-语言模型（VLMs）在中国艺术品上的表现，超越了短文本识别和问答任务。CARTBENCH包含四个子任务：CURATORQA，用于基于证据的识别与推理；CATALOGCAPTION，用于结构化的四部分专家风格鉴赏；REINTERPRET，用于具有专家评分的可辩护重新诠释；以及CONNOISSEURPAIRS，用于在视觉相似干扰下进行诊断性真伪鉴别。CARTBENCH通过将维基数据中带图像的故宫博物院藏品与权威目录页面对齐构建，涵盖多个朝代的五大艺术类别。在对九个代表性视觉-语言模型的评测中，我们发现：尽管CURATORQA整体准确率较高，但在困难的证据关联和风格-时期推断上表现显著下降；长文本鉴赏远未达到专家水平；而面向真伪鉴别的诊断性判别准确率接近随机水平，凸显了当前模型在鉴赏家级别推理任务上的挑战。

View on arXiv Download PDF AI Translation

cs.CL / 117 / 2604.11655

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

RPA-Check：一种用于评估动态基于LLM的角色扮演代理的多阶段自动化框架

Rosati, Riccardo, Colucci, Edoardo, Bolognini, Massimiliano, Mancini, Adriano, Sernani, Paolo

Abstract

The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

Chinese Translation

大型语言模型（LLMs）在交互系统中的快速应用促成了动态、开放式角色扮演代理（RPAs）的创建。然而，评估这些代理仍然是一项重大挑战，因为标准的自然语言处理指标无法捕捉角色遵循性、逻辑一致性及长期叙事稳定性的细微差别。本文提出了RPA-Check，一种多阶段自动化评估框架，旨在客观评估基于LLM的RPAs在复杂且约束严格环境中的表现。我们的方法基于四步流程：（1）维度定义，建立高层次的定性行为标准；（2）增强，将这些要求细化为具体的布尔检查表指标；（3）语义过滤，确保指标的客观性、无冗余性及代理隔离；（4）LLM作为评审，采用链式思维验证对代理忠实度进行评分。我们通过将该框架应用于LLM Court（一款用于法医培训的严肃游戏，涉及多个量化本地模型）进行了验证。五个不同法律场景的实验结果表明，该框架能够识别模型规模、推理深度与操作稳定性之间的微妙权衡。值得注意的是，研究发现参数规模与程序一致性呈反比关系，表明较小且经过充分指令调优的模型（8-9B）能够优于易受用户对齐偏差或谄媚影响的大型架构。RPA-Check为未来专门领域中生成代理评估提供了标准化且可重复的度量工具。

View on arXiv Download PDF AI Translation

cs.CL / 118 / 2604.11662

Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

鲁棒性中的隐性失败：为何监督不确定性量化需要更好的评估

Stacey, Joe, Orgad, Hadas, Inui, Kentaro, Heinzerling, Benjamin, Moosavi, Nafise Sadat

Abstract

Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.

Chinese Translation

近期研究表明，大型语言模型的隐藏状态包含对不确定性估计和幻觉检测有用的信号，这激发了对高效探针（probe）方法的日益关注。然而，目前尚不清楚现有方法的鲁棒性如何，以及哪些探针设计能在分布偏移（distribution shift）下提供可靠的不确定性估计。本文系统地研究了不同模型、任务及OOD（Out-Of-Distribution）设置下的监督不确定性探针，训练了超过2000个探针，变换表示层、特征类型和token聚合策略。评估结果凸显了当前方法在鲁棒性方面的不足，尤其是在长文本生成任务中。我们还发现，探针的鲁棒性更多受探针输入的影响，而非模型架构。中间层表示比最终层隐藏状态的泛化能力更强，跨响应token的聚合策略始终比依赖单token特征更具鲁棒性。这些差异在分布内表现中往往不明显，但在分布偏移时变得尤为重要。基于评估结果，我们探索了一种简单的混合回退策略以提升鲁棒性，强调更好的评估是构建更鲁棒探针的前提。

View on arXiv Download PDF AI Translation

cs.CL / 119 / 2604.11666

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

协同作战：通过心智理论学习双重代理防御者以引导信念

Xiao, Hanqi, Patil, Vaidehi, Khan, Zaid, Lee, Hyunji, Stengel-Eskin, Elias, Bansal, Mohit

Abstract

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

Chinese Translation

随着大型语言模型（LLMs）成为对话系统的核心，其推理对话伙伴意图和状态的能力（即形成并运用心智理论，Theory of Mind，ToM）对于与潜在对抗性伙伴的安全交互变得日益关键。我们提出了一种新颖的隐私主题心智理论挑战——引导信念的心智理论（ToM for Steering Beliefs，ToM-SB），其中防御者必须作为双重代理，在共享的环境中引导具有部分先验知识的攻击者的信念。要在ToM-SB中取得成功，防御者必须与攻击者互动并形成其心智模型，目标是欺骗攻击者使其相信已成功提取敏感信息。我们发现，诸如Gemini3-Pro和GPT-5.4等强大前沿模型在ToM-SB上表现不佳，尤其在攻击者具备部分先验知识的复杂场景中，即使通过心智理论提示（ToM prompting）促使模型推理攻击者的信念，也常常未能成功欺骗攻击者。为弥补这一差距，我们通过强化学习训练模型作为AI双重代理，测试欺骗和心智理论两种奖励机制。值得注意的是，我们发现心智理论与欺骗攻击者之间存在双向涌现关系：仅奖励欺骗成功即可提升心智理论能力，反之亦然。在四种不同强度的攻击者、六种防御方法以及分布内与分布外（OOD）评估中，心智理论与欺骗成功的提升高度相关，凸显信念建模是ToM-SB成功的关键驱动力。结合心智理论与欺骗奖励的AI双重代理在欺骗和心智理论表现上均最为优异，在复杂场景中超越了使用心智理论提示的Gemini3-Pro和GPT-5.4。我们还展示了ToM-SB及AI双重代理可扩展至更强攻击者，验证了任务的分布外泛化能力及可升级性。

View on arXiv Download PDF AI Translation

cs.CL / 120 / 2604.11687

Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer

请让文本听起来更像人类：编码器-解码器与仅解码器Transformer在AI到人类文本风格转换中的比较

Paneru, Utsav

Abstract

AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity -- BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 -- with 17x fewer parameters than Mistral-7B. We show that Mistral-7B's higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.

Chinese Translation

AI生成的文本在学术和专业写作中已变得普遍，促使相关检测方法的研究。然而，反向问题——系统性地将AI生成的散文改写为真正的人类创作风格——却鲜有研究。我们构建了一个包含25,140对AI输入与人类参考文本片段的平行语料库，识别出11个可量化的风格标记以区分两种文本风格，并对三种模型进行了微调：BART-base、BART-large和基于QLoRA的Mistral-7B-Instruct。BART-large在参考相似度上表现最佳——BERTScore F1为0.924，ROUGE-L为0.566，chrF++为55.92——且参数量比Mistral-7B少17倍。我们发现Mistral-7B较高的风格标记变化得分反映的是过度调整而非准确性，并指出风格转换准确性是当前风格转换评估中的一个重要盲点。

View on arXiv Download PDF AI Translation

cs.CL / 121 / 2604.11699

Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

Legal2LogicICL：通过多样化的少样本学习提高法律案例转化为逻辑公式的泛化能力

Xue, Jieying, Nguyen, Phuong Minh, Nguyen, Ha Thanh, Zin, May Myo, Satoh, Ken

Abstract

This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.

Chinese Translation

本研究旨在通过将自然语言处理（NLP）的最新进展与法律领域自适应的少样本学习技术结合，利用大型语言模型（LLMs）提高基于逻辑的法律推理系统的泛化能力。现有的基于逻辑的法律推理流程通常依赖于微调模型将自然语言法律案例映射为逻辑公式，然后将其转发给符号推理器。然而，这种方法受到高质量标注训练数据稀缺的严重限制。为了解决这一问题，我们提出了一种新颖的基于LLM的法律推理框架，该框架通过检索增强生成（retrieval-augmented generation）实现有效的上下文学习。具体而言，我们引入了Legal2LogicICL，这是一个少样本检索框架，在潜在语义表示层面和法律文本结构层面平衡示例的多样性和相似性。此外，我们的方法明确考虑法律结构，通过减轻法律文本中的实体引起的检索偏差来优化，其中冗长且高度特定的实体提及往往主导语义表示并掩盖法律上有意义的推理模式。我们的Legal2LogicICL构建了信息丰富且稳健的少样本示例，从而实现准确且稳定的逻辑规则生成，而无需额外的训练。此外，我们构建了一个新的数据集，命名为Legal2Proleg，该数据集标注了法律案例与PROLEG逻辑公式之间的对齐，以支持法律语义解析的评估。在开源和专有LLM上的实验结果表明，我们的方法显著提高了将自然语言法律案例描述转化为逻辑表示的准确性、稳定性和泛化能力，突显了其在可解释和可靠的法律推理中的有效性。我们的代码可在https://github.com/yingjie7/Legal2LogicICL获取。

View on arXiv Download PDF AI Translation

cs.CL / 122 / 2604.11721

Evaluating Cooperation in LLM Social Groups through Elected Leadership

通过选举领导评估大型语言模型社会群体中的合作

Faulkner, Ryan, Deshpande, Anushka, Piedrahita, David Guzman, Leibo, Joel Z., Jin, Zhijing

Abstract

Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.

Chinese Translation

治理公共资源池需要代理通过合作和自我管理制定持久策略，以避免集体失败。尽管基础模型（foundation models）在这些环境中展现了合作潜力，现有的多智能体研究却很少探讨结构化领导和选举机制是否能够提升集体决策能力。缺乏这一在人类社会中普遍存在的关键组织特征，是当前方法的一大不足。本文旨在通过多智能体模拟，直接探讨领导和选举是否能够促进社会福利和合作的提升。我们提出了一个开源框架，通过选举产生的角色（personas）和候选人驱动的议程模拟领导，并在受控治理条件下对大型语言模型（LLMs）进行了实证研究。实验结果表明，拥有选举产生的领导能够使社会福利得分提升55.4%，生存时间延长128.6%，这一效果在多种高性能LLMs中均有体现。通过构建代理社会图，我们计算了中心性指标以评估领导角色的社会影响力，并通过对领导发言的情感分析，探讨了其修辞和合作倾向。该研究为多智能体系统中选举机制的进一步研究奠定了基础，助力解决复杂社会困境。

View on arXiv Download PDF AI Translation

cs.CL / 123 / 2604.11742

Discourse Diversity in Multi-Turn Empathic Dialogue

多轮共情对话中的话语多样性

Zhan, Hongli, Gueorguieva, Emma S., Hernandez, Javier, Suh, Jina, Ong, Desmond C., Li, Junyi Jessy

Abstract

Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.

Chinese Translation

大型语言模型（LLMs）在单轮对话环境中生成的回复被评为高度共情（Ayers 等，2023；Lee 等，2024），但它们也被认为是公式化的生成器，会在不同任务中重复使用相同的词汇模式、句法模板和话语结构（Jiang 等，2025；Shaib 等，2024；Namuduri 等，2025）。然而，关于这种公式化是否扩展到话语动作层面，即回复对被回应者所起的作用，关注较少。该问题对于共情对话尤为重要，因为有效的支持不仅需要在某一时刻表现出友善的回应，还需在对话过程中采用多样化的策略（Stiles 等，1998）。事实上，先前研究表明，在单轮环境中，LLMs比人类支持者更频繁地重复使用相同的策略序列（Gueorguieva 等，2026）。我们将此分析扩展到多轮对话，发现这种僵化现象加剧：一旦某策略出现在支持者的某轮回复中，LLMs在下一轮中重复使用该策略的概率几乎是人类的两倍（0.50-0.56 对 0.27）。这一模式在真实情感支持对话中担任支持者角色的LLMs中普遍存在，且标准相似度指标无法检测到。为填补这一空白，我们提出了MINT（Multi-turn Inter-tactic Novelty Training，多轮跨策略新颖性训练），这是首个旨在优化多轮共情对话中话语动作多样性的强化学习框架。表现最佳的MINT变体结合了共情质量奖励与跨轮策略新颖性信号，在1.7B和4B参数模型上相比基础模型提升了25.3%的整体共情度，同时在4B模型上减少了26.3%的跨轮话语动作重复，超越了包括仅质量奖励和基于词元多样性方法在内的所有基线指标。这些结果表明，当前模型缺乏的并非共情本身，而是在对话过程中变换话语动作的能力。

View on arXiv Download PDF AI Translation

cs.CL / 124 / 2604.11748

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

LangFlow：连续扩散在语言建模中与离散模型相抗衡

Chen, Yuxin, Liang, Chumeng, Sui, Hangke, Guo, Ruihan, Cheng, Chaoran, You, Jiaxuan, Liu, Ge

Abstract

Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow

Chinese Translation

连续扩散模型在图像等领域取得了强劲的性能。然而，在语言建模中，先前的连续扩散语言模型（DLMs）落后于离散模型。在本研究中，我们通过LangFlow缩小了这一差距，LangFlow是首个能够与离散扩散模型相抗衡的连续DLM。我们的方法通过Bregman散度将嵌入空间DLM与流匹配（Flow Matching）连接起来，并引入了三项关键创新：（1）一种新颖的基于常微分方程（ODE）的负对数似然（NLL）界限，用于对基于连续流的语言模型进行原则性评估；（2）一种用于噪声调度的信息均匀原则，激励基于Gumbel分布的可学习调度器；（3）一种改进的训练协议，结合自条件（self-conditioning），增强了似然性和样本质量。LangFlow在各基准测试中表现出色，在LM1B上达到30.0的困惑度（PPL），在OpenWebText上达到24.6。它在可比规模下与顶尖的离散DLM相匹配，并在多个基准测试中超越了自回归基线，展现了连续扩散作为语言建模的竞争性和前景广阔的范式的明确证据。

View on arXiv Download PDF AI Translation

cs.CL / 125 / 2604.11749

HistLens: Mapping Idea Change across Concepts and Corpora

HistLens：跨概念和语料库的思想变化映射

Jing, Yi, Qiu, Weiyun, Peng, Yihang, Sui, Zhifang

Abstract

Language change both reflects and shapes social processes, and the semantic evolution of foundational concepts provides a measurable trace of historical and social transformation. Despite recent advances in diachronic semantics and discourse analysis, existing computational approaches often (i) concentrate on a single concept or a single corpus, making findings difficult to compare across heterogeneous sources, and (ii) remain confined to surface lexical evidence, offering insufficient computational and interpretive granularity when concepts are expressed implicitly. We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. The framework decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources, yielding comparable conceptual trajectories within a shared coordinate system. Experiments on long-span press corpora show that HistLens supports cross-concept, cross-corpus computation of patterns of idea evolution and enables implicit concept computation. By bridging conceptual modeling with interpretive needs, HistLens broadens the analytical perspectives and methodological repertoire available to social science and the humanities for diachronic text analysis.

Chinese Translation

语言变化既反映又塑造社会过程，而基础概念的语义演变提供了历史和社会变革的可测量痕迹。尽管在历时语义学和话语分析方面取得了最近的进展，现有的计算方法通常 (i) 仅集中于单一概念或单一语料库，使得跨异质来源的研究结果难以比较，(ii) 仍然局限于表层词汇证据，当概念隐含表达时，提供的计算和解释的细致程度不足。我们提出了HistLens，一个基于统一的SAE框架的多概念、多语料库的概念历史分析工具。该框架将概念表示分解为可解释的特征，并跟踪其在时间和来源上的激活动态，从而在共享坐标系统内产生可比较的概念轨迹。对长时间跨度新闻语料库的实验表明，HistLens支持跨概念、跨语料库的思想演变模式计算，并能够进行隐含概念的计算。通过将概念建模与解释需求相结合，HistLens拓宽了社会科学和人文学科在历时文本分析中可用的分析视角和方法论工具。

View on arXiv Download PDF AI Translation

cs.CL / 126 / 2604.11753

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

面向长时域自主任务的自主聚合并行扩展方法

Lee, Yoonsang, Yen, Howard, Ye, Xi, Chen, Danqi

Abstract

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

Chinese Translation

我们研究了针对长时域自主任务（如自主搜索和深度研究）的并行测试时扩展方法，其中多个执行轨迹并行生成并聚合为最终响应。尽管此类扩展在链式思维推理中已被证明有效，自主任务却面临独特挑战：轨迹较长、包含多轮交互且辅以工具，且输出通常是开放式的。仅聚合最终答案会丢失轨迹中的丰富信息，而简单串联所有轨迹又会超出模型的上下文窗口。为此，我们提出了AggAgent，一种将并行轨迹视为环境的聚合代理。我们为其配备了轻量级工具以检查候选解并在轨迹间搜索，使其能够按需导航和综合信息。在六个基准测试和三种模型家族（GLM-4.7、Qwen3.5、MiniMax-M2.5）上，AggAgent优于所有现有聚合方法——平均提升最高达5.3个百分点，在两个深度研究任务中提升达10.3个百分点——且仅带来极小的额外开销，因为聚合成本受限于单次自主执行轨迹。我们的研究结果确立了自主聚合作为一种高效且成本效益显著的并行测试时扩展方法。

View on arXiv Download PDF AI Translation

cs.CL / 127 / 2604.11778

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

General365：在多样化和具有挑战性的任务中评估大型语言模型的一般推理能力

Liu, Junlin, An, Shengnan, Zhou, Shuang, Ma, Dan, Luo, Shixiong, Xie, Ying, Zhang, Yuan, Yuan, Wenling, Zhou, Yifan, Li, Xiaoyu, Wang, Ziwen, Cao, Xuezhi, Cai, Xunliang

Abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

Chinese Translation

当代大型语言模型（LLMs）在推理能力方面表现出色，尤其是在数学和物理等专业领域。然而，它们将这些推理技能推广到更一般和更广泛的上下文中的能力——通常称为一般推理——仍然未得到充分探索。与特定领域的推理不同，一般推理对专家知识的依赖较小，但仍然面临复杂的推理挑战，例如复杂约束、嵌套逻辑分支和语义干扰。为了解决这一空白，我们推出了General365，这是一个专门设计用于评估LLMs一般推理能力的基准。通过将背景知识限制在K-12水平，General365明确将推理与专业知识解耦。该基准包含365个种子问题和1,095个变体问题，涵盖八个类别，确保了高难度和多样性。在对26个领先的LLMs进行评估时发现，即使是表现最好的模型也仅达到62.8%的准确率，这与LLMs在数学和物理基准测试中的近乎完美表现形成鲜明对比。这些结果表明，当前LLMs的推理能力高度依赖于领域，留有显著的改进空间以适应更广泛的应用。我们设想General365作为推动LLM推理超越特定领域任务，朝向强健的一般用途现实场景的催化剂。代码、数据集和排行榜： https://general365.github.io

View on arXiv Download PDF AI Translation

cs.CL / 128 / 2604.11796

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD：基于真实提示语的综合中文AI生成文本检测基准

Qing, Chenxi, Wu, Junxi, Liu, Zheng, Qiu, Yixiang, Yu, Hongyao, Chen, Bin, Wu, Hao, Xia, Shu-Tao

Abstract

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

Chinese Translation

近年来，大型语言模型（LLMs）能够生成高度流畅的文本内容。尽管它们为人类带来了极大便利，但也引入了诸如网络钓鱼和学术不端等多种风险。大量研究致力于开发AI生成文本检测算法及构建相关数据集。然而，在中文语料领域，仍存在模型多样性不足和数据同质化等挑战。为解决这些问题，我们提出了C-ReD：一个综合性的中文真实提示语AI生成文本检测基准。实验表明，C-ReD不仅实现了可靠的领域内检测，还支持对未见过的LLMs及外部中文数据集的强泛化能力，弥补了以往中文检测基准在模型多样性、领域覆盖和提示语真实性方面的关键不足。我们已在https://github.com/HeraldofLight/C-ReD公开了相关资源。

View on arXiv Download PDF AI Translation

cs.CL / 129 / 2604.11801

CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

CLSGen：一种用于联合概率分类与口头解释的双头微调框架

Yoon, WonJin, Zhu, Kangyu, Bulovic, Ian, Sehy, Autumn, Gao, Yanjun, Dligach, Dmitriy, Afshar, Majid, Miller, Timothy A.

Abstract

With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model's inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

Chinese Translation

随着大型语言模型（LLMs）的最新进展，越来越多的研究关注将这些模型应用于解决复杂且具有挑战性的问题。现代LLMs具备处理长上下文和生成口头解释的能力，在应对实际应用中展现出显著潜力。然而，将LLMs部署于实际决策中的一个关键障碍是其无法提供可靠的定量概率估计。虽然通过传统判别目标（类似于仅编码器模型）对LLMs进行特定任务微调可以获得概率估计，但这通常导致灾难性遗忘和语言能力崩溃。结果，模型失去了生成解释的能力，严重削弱了其可解释性和可用性。为解决该挑战，我们提出了CLSGen，一种针对二分类任务设计的新型LLM微调框架。CLSGen框架包括全新的模型架构、训练方法和数据构建策略，旨在实现稳健的概率估计，同时不牺牲模型固有的解释生成能力。多项基准数据集上的实验结果表明，采用CLSGen微调的模型在分类指标（AUROC和F1分数）上优于现有基线方法。在解释方面，结果显示预测标签与生成的理由高度一致，且具备良好的可读性。

View on arXiv Download PDF AI Translation

cs.CL / 130 / 2604.11802

Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

心理概念神经元：神经控制能否影响大语言模型中的探测与生成偏向？

Harada, Yuto, Hamada, Hiro Taiyo

Abstract

Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user's personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model's internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.

Chinese Translation

利用诸如大五人格（Big Five）等心理学构念，大型语言模型（LLMs）能够模拟特定的人格特征并预测用户的人格。尽管LLMs可以表现出与这些构念一致的行为，但这些构念在模型内部的具体位置和表现形式以及它们与行为输出的关系仍不清楚。为填补这一空白，我们聚焦于通过问卷操作化的大五人格概念，分析其内部表征的形成与定位，并通过干预手段检验这些表征与行为输出的关联。在实验中，我们首先通过探测（probing）方法考察大五人格信息在模型深度中的出现位置。随后，我们识别出对每个大五人格概念选择性响应的神经元，并测试增强或抑制这些神经元激活是否能够在潜在表征和标签生成中引导偏向预期方向。结果显示，大五人格信息在模型早期层迅速可解码，并在最终层依然可检测到；而概念选择性神经元主要集中于中间层，且在不同领域间重叠有限。对这些神经元的干预持续将探测结果向目标概念偏移，部分概念的目标成功率超过0.8，表明模型内部对大五人格特质的区分具有因果可控性。在标签生成层面，同样的干预常常使生成的标签分布朝预期方向偏移，但效果较弱、依赖具体概念且常伴随跨特质的溢出效应，表明即使对大量概念选择性神经元进行干预，实现对生成标签的同等控制仍然困难。总体而言，我们的研究揭示了LLMs中表征控制与行为控制之间的差距。

View on arXiv Download PDF AI Translation

cs.CL / 131 / 2604.11803

Saar-Voice: A Multi-Speaker Saarbr\"ucken Dialect Speech Corpus

萨尔-语音：一个多说话者的萨尔布吕肯方言语料库

Oberkircher, Lena S., Alabi, Jesujoba O., Klakow, Dietrich, Trouvain, Jürgen

Abstract

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbr\"ucken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.

Chinese Translation

自然语言处理（NLP）和语音技术近年来取得了显著进展；然而，它们仍然主要集中在标准化语言变体上。尽管方言在文化上具有重要意义并被广泛使用，但在语言资源和计算模型中却被严重低估，导致性能差异。为了解决这一问题，我们介绍了萨尔-语音（Saar-Voice），这是一个为萨尔布吕肯德语方言创建的六小时语音语料库。该数据集首先通过数字化书籍和本地来源材料收集文本。随后，由九位说话者录制了该文本的一个子集，我们对文本和语音组件进行了分析，以评估数据集的特征和质量。我们讨论了与正字法和说话者变异相关的方法论挑战，并探讨了字素到音素（G2P）转换。最终生成的语料库提供了对齐的文本和音频表示。这为未来在低资源场景下的方言感知文本到语音（TTS）研究奠定了基础，包括零样本和少样本模型适应。

View on arXiv Download PDF AI Translation

arXiv Papers

Spectral Kernel Dynamics via Maximum Caliber: Fixed Points, Geodesics, and Phase Transitions

Kinematics of continuum planar grasping

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Perception Is All You Need: A Neuroscience Framework for Low Cost Sensorless Gaze in HRI

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

CableTract: A Co-Designed Cable-Driven Field Robot for Low-Compaction, Off-Grid Capable Agriculture

GPU-Accelerated Continuous-Time Successive Convexification for Contact-Implicit Legged Locomotion

Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation

Natural Gradient Gaussian Approximation Filter on Lie Groups for Robot State Estimation

A Ray Intersection Algorithm for Fast Growth Distance Computation Between Convex Sets

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation

ReaLiTy and LADS: A Unified Framework and Dataset Suite for LiDAR Adaptation Across Sensors and Adverse Weather Conditions

A Coordinate-Invariant Local Representation of Motion and Force Trajectories for Identification and Generalization Across Coordinate Systems

Trajectory-based actuator identification via differentiable simulation

COSMIK-MPPI: Scaling Constrained Model Predictive Control to Collision Avoidance in Close-Proximity Dynamic Human Environments

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

PRoID: Predicted Rate of Information Delivery in Multi-Robot Exploration and Relaying

VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions

Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

AWARE: Adaptive Whole-body Active Rotating Control for Enhanced LiDAR-Inertial Odometry under Human-in-the-Loop Interaction

OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

Ro-SLM: Onboard Small Language Models for Robot Task Planning and Operation Code Generation

Fast-SegSim: Real-Time Open-Vocabulary Segmentation for Robotics in Simulation

Diffusion Reinforcement Learning Based Online 3D Bin Packing Spatial Strategy Optimization

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

{\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

Inferring World Belief States in Dynamic Real-World Environments

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

Simulator Adaptation for Sim-to-Real Learning of Legged Locomotion via Proprioceptive Distribution Matching

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

CLAW: Composable Language-Annotated Whole-body Motion Generation

Modeling, Analysis and Activation of Planar Viscoelastically-combined Rimless Wheels

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment

CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

Learning Racket-Ball Bounce Dynamics Across Diverse Rubbers for Robotic Table Tennis

WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

Using Unwrapped Full Color Space Palette Recording to Measure Exposedness of a Vehicle Exterior Parts for External Human Machine Interfaces

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

Dyadic Partnership(DP): A Missing Link Towards Full Autonomy in Medical Robotics

Safe Human-to-Humanoid Motion Imitation Using Control Barrier Functions

DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

Optimal Kinodynamic Motion Planning Through Anytime Bidirectional Heuristic Search with Tight Termination Condition

Micro-Dexterity in Biological Micromanipulation: Embodiment, Perception, and Control

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

Dual-Control Frequency-Aware Diffusion Model for Depth-Dependent Optical Microrobot Microscopy Image Generation

ACT: Automated CPS Testing for Open-Source Robotic Platforms

Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

Grounded World Model for Semantically Generalizable Planning

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Identifying Inductive Biases for Robot Co-Design

Disentangled Point Diffusion for Precise Object Placement

3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation

PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging

TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors

A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification

CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

TaFall: Balance-Informed Fall Detection via Passive Thermal Sensing

EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation

Assessing Privacy Preservation and Utility in Online Vision-Language Models