← Back to Index
Daily Research Digest

arXiv Papers

2026-05-08
359
Papers
4
Categories
358
Translated
收藏清单 0
机器人学 (Robotics)
32
cs.RO / 1 / 2605.05208

A GPU-Accelerated Hybrid Method for a Class of Multi-Depot Vehicle Routing Problems

一种针对多仓库车辆路径问题的GPU加速混合方法
Lei, Zhenyu, Hao, Jin-Kao
Abstract
Multi-depot vehicle routing problems (MDVRPs) are prevalent in a variety of practical applications. However, they are computationally challenging to solve due to their inherent complexity. This paper proposes an effective hybrid algorithm for a class of MDVRPs. The algorithm integrates a learning-driven, diversity-controlled route-exchange crossover and a multi-depot-supported feasible-and-infeasible search framework guided by a multi-penalty evaluation function. Two dedicated depot-related local search operators are incorporated to further strengthen the search capability in multi-depot settings. To improve computational efficiency and scalability, an enhanced version of the algorithm is developed that uses a tensor-based GPU acceleration combined with a novel multi-move update strategy. Extensive computational experiments on benchmark instances of three MDVRP variants show that the proposed algorithms are highly competitive with state-of-the-art methods, especially for large-scale instances.
Chinese Translation
多仓库车辆路径问题(MDVRP)在多种实际应用中普遍存在。然而,由于其固有的复杂性,这类问题在计算上具有挑战性。本文提出了一种针对一类MDVRP的有效混合算法。该算法结合了学习驱动的多样性控制路径交换交叉和由多重惩罚评估函数指导的多仓库支持的可行与不可行搜索框架。为进一步增强多仓库环境下的搜索能力,算法中引入了两个专门的仓库相关局部搜索算子。为了提高计算效率和可扩展性,开发了一种增强版本的算法,该算法使用基于张量的GPU加速,并结合了一种新颖的多移动更新策略。在三个MDVRP变体的基准实例上进行的广泛计算实验表明,所提算法在与最先进的方法相比时具有很强的竞争力,特别是在大规模实例中。
cs.RO / 2 / 2605.05236

Topology-Driven Anti-Entanglement Control for Soft Robots

基于拓扑驱动的软机器人反缠绕控制
Le, Haoyang, Wang, Shengxuan, Chen, Mohan, Feng, Shuo
Abstract
In the field of precision manufacturing in complex constrained environments, the role of soft robots is increasingly prominent, and the realization of anti-winding control based on multi-intelligent body reinforcement learning has become a research hotspot. One of the core problems at present is to coordinate multiple robots to complete the unwinding operation in a highly constrained environment. The existing distributed training framework faces some observability challenges in high-density barrier and unstable environments, resulting in poor learning results. This paper proposes a topology-driven Multi-Agent Reinforcement Learning (TD-MARL) framework to coordinate multi-robot systems to avoid entanglement. Specifically, the critical network adopts centralized learning, so that each intelligent body can perceive the strategies of other intelligent bodies by sharing the topological state, thus alleviating the training instability caused by complex interactions; eliminating the demand for communication resources between robots through distributed execution, Upgrade system reliability; the integrated topological security layer uses topological invariants to accurately assess and mitigate the risk of entanglement to avoid the strategy from falling into local difficulties. Finally, the full simulation experiments carried out in the real simulation environment show that the method is better than the current advanced deep reinforcement learning (DRL) method in terms of convergence and anti-winding effect.
Chinese Translation
在复杂约束环境下的精密制造领域,软机器人的作用日益显著,基于多智能体强化学习的反缠绕控制的实现已成为研究热点。目前的核心问题之一是协调多个机器人在高度约束环境中完成解缠操作。现有的分布式训练框架在高密度障碍和不稳定环境中面临一些可观测性挑战,导致学习结果不佳。本文提出了一种基于拓扑驱动的多智能体强化学习(Topology-driven Multi-Agent Reinforcement Learning, TD-MARL)框架,以协调多机器人系统避免缠绕。具体而言,关键网络采用集中学习,使每个智能体通过共享拓扑状态感知其他智能体的策略,从而缓解复杂交互导致的训练不稳定性;通过分布式执行消除机器人之间对通信资源的需求,提高系统可靠性;集成的拓扑安全层利用拓扑不变量准确评估和减轻缠绕风险,以避免策略陷入局部困难。最后,在真实仿真环境中进行的全面仿真实验表明,该方法在收敛性和反缠绕效果方面优于当前先进的深度强化学习(Deep Reinforcement Learning, DRL)方法。
cs.RO / 3 / 2605.05241

DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation

DexSim2Real:基于基础模型的模拟到现实转移框架,用于可泛化的灵巧操作
Zeng, Zijian, Ding, Fei, Yang, Huiming, Li, Xianwei, Liao, Yuhao
Abstract
Sim-to-real transfer remains a critical bottleneck for deploying dexterous manipulation policies learned in simulation to real-world robots. Existing approaches rely on manually designed domain randomization or task-specific adaptation, limiting their generalizability across diverse manipulation scenarios. We present DexSim2Real, an integrated framework that leverages vision-language foundation models to bridge the sim-to-real gap for dexterous manipulation. Our system combines three components: (1) Foundation Model-Guided Domain Randomization (FM-DR), which uses a vision-language model as a visual realism critic to optimize simulation parameters via closed-loop CMA-ES, complementing text-based approaches like DrEureka with direct visual feedback; (2) a Tactile-Visual Cross-Attention Policy (TVCAP) that adapts cross-attention visuo-tactile fusion to zero-shot sim-to-real RL; and (3) a Progressive Skill Curriculum (PSC) that builds on LLM-based task decomposition with a difficulty scheduler tailored to contact-rich dexterous tasks. Extensive experiments on six challenging manipulation tasks with blinded evaluation demonstrate that DexSim2Real achieves a 78.2% average real-world success rate, outperforming DrEureka and DeXtreme while reducing the sim-to-real performance gap to only 8.3%.
Chinese Translation
模拟到现实的转移仍然是将模拟中学习到的灵巧操作策略部署到现实世界机器人中的一个关键瓶颈。现有的方法依赖于手动设计的领域随机化或特定任务的适应,限制了它们在多样化操作场景中的泛化能力。我们提出了DexSim2Real,一个集成框架,利用视觉-语言基础模型来弥合灵巧操作的模拟到现实差距。我们的系统结合了三个组件:(1) 基于基础模型的领域随机化(FM-DR),该组件使用视觉-语言模型作为视觉现实性评估器,通过闭环的协方差矩阵适应进化策略(CMA-ES)优化模拟参数,补充了像DrEureka这样的基于文本的方法,并提供直接的视觉反馈;(2) 触觉-视觉交叉注意力策略(TVCAP),该策略将交叉注意力的视听触觉融合适应于零样本模拟到现实的强化学习;(3) 渐进技能课程(PSC),该课程基于大型语言模型(LLM)的任务分解,配备针对接触丰富的灵巧任务的难度调度器。对六个具有挑战性的操作任务进行的大量实验(采用盲评)表明,DexSim2Real实现了78.2%的平均现实世界成功率,超越了DrEureka和DeXtreme,同时将模拟到现实的性能差距缩小至仅8.3%。
cs.RO / 4 / 2605.05338

Track A*: Fast Visibility-Aware Trajectory Planning for Active Target Tracking

Track A*: 快速的考虑可见性的主动目标跟踪轨迹规划
Chen, Hanxuan, Wang, Kangli, Pei, Ji
Abstract
Offline reference trajectories for active target tracking are needed both for building multi-modal tracking datasets and for benchmarking online tracking planners under repeatable conditions. We present Track A star (TA star), an offline search-based trajectory planner that targets the visibility-aware target tracking objective on a discretized four-dimensional spatio-temporal grid (x, y, z, t). TA star combines a layered Directed Acyclic Graph (DAG) search with three engineering optimizations: cross-time obstacle distance caching against a Bounding Volume Hierarchy (BVH), per-layer beam pruning, and a configurable multi-ray visibility evaluator. TA star employs a beam-pruned heuristic search on this discrete graph to efficiently find high-quality tracking trajectories. While it trades strict theoretical optimality for practical scalability, our empirical results demonstrate robust, near-baseline visibility performance at a fraction of the computational cost. On a 1000-scenario stress test across eight CARLA Optimized maps, TA star converges on all scenarios and completes in 45 s using 32 workers; on a 248-scenario controlled comparison against an unoptimized priority-queue A star baseline (BinaryHeap implementation) under identical scenario inputs and a 5 x 10^6 expansion cap, TA star reduces mean planning time by 23.0x and worst-case planning time by 11.8x, while raising convergence from 56.9% to 100%. On the n=141 baseline-converged subset, TA star changes average visibility by only -0.15 percentage points (pp), with no scenario exceeding a 5 pp drop. We position TA star as a practical offline reference planner under these specific conditions, with limitations and failure cases discussed for environments such as Town07 dense vegetation.
Chinese Translation
主动目标跟踪需要离线参考轨迹,以便构建多模态跟踪数据集并在可重复条件下对在线跟踪规划器进行基准测试。我们提出了Track A星(TA星),这是一种基于离线搜索的轨迹规划器,旨在实现基于可见性的目标跟踪,采用离散的四维时空网格(x, y, z, t)。TA星结合了分层有向无环图(DAG)搜索与三种工程优化:针对边界体积层次(BVH)的跨时间障碍物距离缓存、每层光束修剪以及可配置的多光束可见性评估器。TA星在这个离散图上采用光束修剪的启发式搜索,以高效地找到高质量的跟踪轨迹。虽然它在严格的理论最优性与实用的可扩展性之间进行了权衡,但我们的实证结果表明,在较低的计算成本下,具有稳健的接近基线的可见性表现。在对八个CARLA优化地图进行的1000场景压力测试中,TA星在所有场景中收敛,并在使用32个工作线程的情况下完成于45秒;在与未优化的优先队列A星基线(BinaryHeap实现)进行的248场景受控比较中,在相同的场景输入和5 x 10^6扩展上限下,TA星将平均规划时间减少了23.0倍,最坏情况下的规划时间减少了11.8倍,同时将收敛率从56.9%提高到100%。在n=141的基线收敛子集中,TA星的平均可见性仅变化了-0.15个百分点(pp),且没有场景超过5 pp的下降。我们将TA星定位为在这些特定条件下的实用离线参考规划器,并讨论了在如Town07密集植被等环境中的局限性和失败案例。
cs.RO / 5 / 2605.05339

Passive Fault Tolerance through Tension-to-Thrust Feed-Forward: Hybrid Input-to-State Stability for Decentralized Multi-UAV Slung-Load Transport under Abrupt Cable Severance

通过张力到推力的前馈实现被动故障容错:针对突发缆绳断裂的去中心化多无人机吊载运输的混合输入到状态稳定性
Hajieghrary, Hadi, Schmitt, Paul
Abstract
Abrupt cable severance in multi-UAV slung-load transport redistributes load and changes the active constraint set, leaving limited time for fault diagnosis and reconfiguration. Existing controllers rely on coordinated force allocation, peer-state exchange, or fixed cable topology, and therefore lack a certified decentralized recovery mechanism for unannounced severance. We present a passive architecture that routes each vehicle's measured cable tension directly into its altitude thrust command, $T_i^{\mathrm{ff}}=T_i$, while a surrounding proportional-derivative, anti-swing, and projection cascade preserves local tracking feasibility. The main contribution is a conditional hybrid practical input-to-state-stability certificate that composes a slack-excursion-bounded taut-cable reduction, bounded post-severance Lyapunov jumps, inter-fault decay, and per-fault-cycle contraction $\rho \in (0,1)$ into an explicit recovery envelope under stated actuator, slack, and dwell assumptions. We validate the controller in Drake multibody simulation with five vehicles, a 10 kg payload, Kelvin-Voigt cables, Dryden wind, and single- and dual-severance schedules: the closed loop attains 0.312-0.328 m RMSE, 76.1-95.2 mm peak sag, and recovery within one payload-pendulum period. Disabling the identity inflates cruise error by 34-39% and peak sag by 3.6x-4.0x, identifying local tension feed-forward as the dominant passive recovery mechanism in the tested decentralized cascade.
Chinese Translation
在多无人机吊载运输中,突发的缆绳断裂会重新分配负载并改变活动约束集,导致故障诊断和重新配置的时间有限。现有控制器依赖于协调的力分配、同伴状态交换或固定的缆绳拓扑,因此缺乏针对未公告断裂的认证去中心化恢复机制。我们提出了一种被动架构,将每个无人机测得的缆绳张力直接转化为其高度推力指令 $T_i^{ ext{ff}}=T_i$,同时周围的比例-微分、抗摆动和投影级联保持局部跟踪的可行性。主要贡献是一种条件混合实用输入到状态稳定性证书,该证书将松弛-游走-有界的紧缆绳减少、断裂后有界的李雅普诺夫跳跃、故障间衰减和每个故障周期的收缩 $ ho ext{ in } (0,1)$ 组合成一个在规定的执行器、松弛和停留假设下的显式恢复包络。我们在Drake多体仿真中验证了该控制器,使用五个无人机、10公斤的有效载荷、Kelvin-Voigt缆绳、Dryden风以及单次和双次断裂调度:闭环系统达到了0.312-0.328米的均方根误差(RMSE)、76.1-95.2毫米的峰值下垂,并在一个有效载荷-摆锤周期内实现了恢复。禁用身份会使巡航误差增加34-39%,峰值下垂增加3.6倍至4.0倍,确定局部张力前馈是测试的去中心化级联中的主导被动恢复机制。
cs.RO / 6 / 2605.05411

Creative Robot Tool Use by Counterfactual Reasoning

通过反事实推理实现创造性机器人工具使用
Akbulut, M. Tuluhan, Satheesh, Varun, Jaafar, Ahmed, Ahmetoglu, Alper, Parr, Shane, Ganeshan, Aditya, Vats, Shivam, Konidaris, George
Abstract
We propose a causal reasoning framework for creative robot tool use where a suitable tool for a task is correctly identified for use beyond its primary objectives. The proposed framework first discovers the causal relationships between the tool and the task by conducting simulated experiments in a dynamics model. We decouple the causal discovery problem into two complementary components: VLM-based feature suggestion and counterfactual tool generation via targeted geometric and physical feature perturbations. Then, novel objects are classified based on identified causal features, and the tool use skill is transferred via keypoint matching conditioned on the identified causal features. By reconstructing the task in a dynamics model, our approach grounds tool use in the physics of the problem. We illustrate our approach in reaching a distant object with different sticks, scooping candies from a bowl using diverse items, and using different boxes or crates as stepping platforms to retrieve an object from a high shelf. Our baseline comparisons show that identifying causal features and grounding them in physical tool properties leads to more reliable tool selection and stronger skill keypoint transfer.
Chinese Translation
我们提出了一种用于创造性机器人工具使用的因果推理框架,该框架能够正确识别适合特定任务的工具,并超越其主要目标进行使用。该框架首先通过在动态模型中进行模拟实验来发现工具与任务之间的因果关系。我们将因果发现问题解耦为两个互补的组成部分:基于视觉语言模型(VLM)的特征建议和通过针对几何与物理特征扰动生成的反事实工具。然后,根据识别出的因果特征对新颖对象进行分类,并通过基于识别出的因果特征的关键点匹配转移工具使用技能。通过在动态模型中重构任务,我们的方法将工具使用与问题的物理特性相结合。我们展示了该方法在用不同的棍子到达远处物体、使用多种物品从碗中舀取糖果,以及使用不同的箱子或货物作为踏板从高架上取回物体等任务中的应用。我们的基线比较表明,识别因果特征并将其与物理工具属性相结合,可以实现更可靠的工具选择和更强的技能关键点转移。
cs.RO / 7 / 2605.05461

Contact-Free Grasp Stability Prediction with In-Hand Time-of-Flight Sensors

基于无接触的抓取稳定性预测与手内飞行时间传感器
DuFrene, Kyle, Grimm, Cindy
Abstract
Current approaches to grasp planning for robotics demonstrate high success rates, but degrade with noisy sensors and other factors. Previous works have proposed tactile-based grasp stability classifiers to detect failures, but these approaches rely on making contact and grasping the object to do so. We propose a contact-free grasp stability predictor using multi-zone time-of-flight sensors mounted in the distal links of a gripper. Our method, as it does not require grasping the object to make a prediction, significantly speeds up the stability classification process, cycling at 15 Hz. We collected over 2,500 real-world grasps across 15 objects to train a classifier. Additionally, we conducted grasp attempts over six additional unseen objects, three for validation and model selection, and three for model testing. Our approach demonstrated strong classification performance, with an accuracy of 85.5% on validation and 86.0% on test objects.
Chinese Translation
当前的机器人抓取规划方法显示出较高的成功率,但在噪声传感器和其他因素的影响下会降低效果。以往的研究提出了基于触觉的抓取稳定性分类器来检测失败,但这些方法依赖于与物体接触并进行抓取。我们提出了一种基于多区域飞行时间传感器的无接触抓取稳定性预测方法,该传感器安装在夹持器的远端连杆上。由于我们的方法不需要抓取物体即可进行预测,因此显著加快了稳定性分类过程,循环频率达到15 Hz。我们收集了超过2500个真实世界的抓取数据,涵盖15个物体,以训练分类器。此外,我们还对六个额外未见过的物体进行了抓取尝试,其中三个用于验证和模型选择,三个用于模型测试。我们的方法在分类性能上表现出色,在验证集上的准确率为85.5%,在测试集上的准确率为86.0%。
cs.RO / 8 / 2605.05483

Robust $\mathcal{H}_\infty$ Controller Design For INDI-Controlled Quadrotor Using Online Parameter Identification

基于在线参数识别的鲁棒 $ ext{H}_ ext{∞}$ 控制器设计用于增量非线性动态反演控制的四旋翼
Aantjes, Tom, Blaha, Till M., Theodoulis, Spilios, Smeur, Ewoud J. J.
Abstract
It has recently been shown that all physical parameters of an Incremental Nonlinear Dynamic Inversion (INDI) controller can be estimated onboard a multirotor within half a second, which is fast enough to do the full identification during a throw in the air. However, a robust method to tune outer loop gains for this feedback-linearizing INDI controller depending on the model parameters is still missing. This work presents the design of a robust gain-scheduled controller for attitude control of quadrotor, using an INDI-based inner loop with online identification of its system parameters. A gain-scheduled cascaded attitude controller with a feedforward filter is synthesized for a symmetric quadrotor using signal-based $\mathcal{H}_\infty$ closed-loop shaping. The resulting controller exhibits good stability margins, with nonlinear simulations confirming effective tracking performance under uncertainty. Experimental evaluation is also conducted through flight tests with full online parameter identification. Even though the identified parameters during these tests are far outside the defined uncertainty range, acceptable flight performance comparable to simulation results is maintained for actuator time constants below 40 ms.
Chinese Translation
最近的研究表明,增量非线性动态反演(INDI)控制器的所有物理参数可以在多旋翼上实时估计,估计时间不到半秒,这足以在空中投掷时完成完整的参数识别。然而,仍然缺乏一种鲁棒的方法来根据模型参数调整该反馈线性化INDI控制器的外环增益。本文提出了一种鲁棒增益调度控制器的设计,用于四旋翼的姿态控制,采用基于INDI的内环并在线识别其系统参数。针对对称四旋翼,合成了一种带前馈滤波器的增益调度级联姿态控制器,使用基于信号的 $ ext{H}_ ext{∞}$ 闭环整形。所得到的控制器表现出良好的稳定裕度,非线性仿真验证了在不确定性下的有效跟踪性能。通过全在线参数识别的飞行测试也进行了实验评估。尽管在这些测试中识别的参数远超定义的不确定性范围,但在执行器时间常数低于40毫秒的情况下,仍能保持与仿真结果相当的可接受飞行性能。
cs.RO / 9 / 2605.05541

Real-world Latency Analysis of Vehicular Visible Light Communication with Multiple LED Transmitters and an Event-Based Camera

多LED发射器与事件驱动相机的车辆可见光通信的实际延迟分析
Soga, Ryota, Shimizu, Tsukasa, Shiba, Shintaro, Kong, Quan, Lu, Shan, Yamazato, Takaya
Abstract
Event cameras offer high temporal resolution, low latency, and wide dynamic range, making them promising receivers for visible light communication (VLC) in vehicle-to-everything (V2X) applications. This work presents an event-camera-based VLC system addressing three key challenges: bandwidth saturation, multi-transmitter reception, and latency characterization. We adopt a positive-event-only mode and design a protocol that suppresses event generation while maintaining communication distance and a wide field of view. We also propose a method to identify multiple transmitters and demonstrate simultaneous reception from up to three LEDs. Finally, we evaluate end-to-end latency in real vehicular scenarios and show that the system meets cooperative perception requirements. These results demonstrate that event-camera-based VLC is a feasible complement to existing V2X technologies (e.g., RF).
Chinese Translation
事件相机提供高时间分辨率、低延迟和广泛的动态范围,使其成为车辆对一切(V2X)应用中可见光通信(VLC)的有前景的接收器。本研究提出了一种基于事件相机的VLC系统,解决了带宽饱和、多发射器接收和延迟表征三个关键挑战。我们采用仅正事件模式,并设计了一种协议,该协议在保持通信距离和广视场的同时抑制事件生成。我们还提出了一种识别多个发射器的方法,并演示了同时接收多达三个LED的信号。最后,我们在真实的车辆场景中评估了端到端延迟,并表明该系统满足协同感知的要求。这些结果表明,基于事件相机的VLC是现有V2X技术(例如,射频)的一种可行补充。
cs.RO / 10 / 2605.05707

On the Emergence of Pendular Structure in Multi-Contact Locomotion

多接触运动中摆动结构的出现
Lyu, Lingxue, Liu, Zihui
Abstract
LIPM is everywhere in legged-locomotion control, but almost always as a modeling choice rather than as something the controller's cost actually prefers. This note tries to make that link more explicit. Working from a small centroidal OCP that penalizes the rate of angular momentum, we look at what its optimum tends to look like. Three things come out. With full-rank stance, the optimum drifts toward a pendular force pattern at a rate determined by the SVD of the moment Jacobian; the constant is set by foot-span geometry and matches the experiments to within 16%. With N=2 stance, as in trot, the friction cone introduces a lower bound on $\|\dot{H}_G\|$ that no amount of weight tuning fixes; we also see a non-smooth feasibility kink at a critical horizontal acceleration that we can write in closed form. Adding a task term that asks for a nonzero $\dot{H}_G$ moves the optimum off the pendular set in a predictable way. None of this is far from the classical ZMP/DCM picture. We test these claims on a point-mass quadruped and on the Unitree Go1 in MuJoCo (open-loop QP and a torque-level closed-loop controller), and we note where the asymptotic story stops being a good description of what the closed loop actually does.
Chinese Translation
在腿部运动控制中,LIPM(线性倒立摆模型)无处不在,但几乎总是作为一种建模选择,而不是控制器成本实际偏好的东西。本文试图使这一联系更加明确。从一个小的质心最优控制问题(OCP)出发,该问题惩罚角动量的变化率,我们观察其最优解的趋势。得出了三点结论。对于满秩站立,最优解趋向于一个摆动力模式,其变化率由力矩雅可比矩阵的奇异值分解(SVD)决定;常数由足部跨度几何形状设定,并与实验结果匹配至16%以内。对于N=2的站立,如在小跑中,摩擦锥引入了一个$ orm{oldsymbol{ ext{H}}_G}$的下界,而无论如何调节重量都无法解决;我们还观察到在一个临界水平加速度处出现了非光滑的可行性拐点,我们可以用封闭形式表示。添加一个要求非零$oldsymbol{ ext{H}}_G$变化率的任务项会以可预测的方式将最优解移出摆动集。所有这些与经典的ZMP/DCM(零力矩点/动态中心模型)图景并不远离。我们在一个点质量四足动物和MuJoCo中的Unitree Go1(开环QP和扭矩级闭环控制器)上测试这些主张,并指出渐近故事何时不再是对闭环实际行为的良好描述。
cs.RO / 11 / 2605.05756

MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

MaMi-HOI:协调全球运动学与局部几何以生成人体-物体交互
Wang, Hao, Wang, Shiqi, Liu, Qi
Abstract
Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed \textit{Geometric Forgetting}: as diffusion model depth increases, semantic feature tend to overshadow object geometry feature, causing the model to lose its perception to object geometry. To address this, we propose MaMi-HOI, a hierarchical framework reconciling \textbf{Ma}cro-level kinematic fluidity with \textbf{Mi}cro-level spatial precision. First, to counteract geometric forgetting, we introduce the Geometry-Aware Proximity Adapter (GAPA), which explicitly re-injects dense object details to perform residual snapping corrections for precise contact. Nevertheless, such aggressive local enforcement can disrupt global dynamics, leading to robotic stiffness. In response, we introduce the Kinematic Harmony Adapter (KHA), which proactively aligns whole-body posture with spatial objectives, ensuring the skeleton actively accommodates constraints without compromising naturalness. Extensive experiments validate that MaMi-HOI simultaneously achieves natural motion and precise contact. Crucially, it extends generation capabilities to long-term tasks with complex trajectories, effectively bridging the gap between global navigation and high-fidelity manipulation in 3D scenes. Code is available at https://github.com/DON738110198/MaMi-HOI.git
Chinese Translation
生成逼真的三维人体-物体交互(HOI)是从具身人工智能到虚拟内容创作等应用的基础任务,这需要将高层次的语义意图与严格的低层次物理约束协调起来。现有方法在语义对齐方面表现出色,然而,它们在保持精确的物体接触方面存在困难。我们揭示了一个关键发现,称为 extit{几何遗忘}:随着扩散模型深度的增加,语义特征往往会掩盖物体几何特征,导致模型失去对物体几何的感知。为了解决这个问题,我们提出了MaMi-HOI,一个层次化框架,调和 extbf{宏观}(Macro)级别的运动流畅性与 extbf{微观}(Micro)级别的空间精确性。首先,为了对抗几何遗忘,我们引入了几何感知邻近适配器(Geometry-Aware Proximity Adapter, GAPA),该适配器明确地重新注入密集的物体细节,以执行残差捕捉修正以确保精确接触。然而,这种激进的局部强制可能会破坏全局动态,导致机器人僵硬。对此,我们引入了运动和谐适配器(Kinematic Harmony Adapter, KHA),该适配器主动将全身姿态与空间目标对齐,确保骨架在不妥协自然性的情况下积极适应约束。大量实验验证了MaMi-HOI能够同时实现自然运动和精确接触。重要的是,它扩展了生成能力,以应对具有复杂轨迹的长期任务,有效地弥合了三维场景中全局导航与高保真操控之间的差距。代码可在 https://github.com/DON738110198/MaMi-HOI.git 获取。
cs.RO / 12 / 2605.05797

Resource-Constrained Robotic Planning in the face of Mixed Uncertainty

面对混合不确定性的资源受限机器人规划
Yin, Yihao, Yu, Pian, Turrini, Andrea, Chi, Zhiming, Li, Yong, Zhang, Lijun
Abstract
Robots operate under significant uncertainty, from quantifiable noise to unquantifiable unknowns, and must account for strict operational constraints, such as limited resources. In this paper, we consider the problem of synthesizing robust strategies to guide a robot's actions in fulfilling a given task, while ensuring the system never exhausts its resources. To solve this problem, we first model the robotic system as a Consumption Markov Decision Process with Set-valued Transitions(CMDPST), a unified framework modelling nondeterministic actions, quantifiable and unquantifiable uncertainty, and resource consumption. Then, we combine the CMDPST with the task specification, expressed as a Linear Temporal Logic over finite traces (LTLf ) formula. Lastly, we address the resource constrained optimal robust strategy synthesis problem, which aims to synthesize a strategy that maximizes the probability of satisfying the LTLf objective without resource exhaustion. Our solution involves two techniques: a direct unrolling-based method and a more efficient, optimized approach that leverages state-space pruning for better performance. Experiments on a warehouse transportation network show the effectiveness of the proposed solutions.
Chinese Translation
机器人在显著的不确定性下运行,从可量化的噪声到不可量化的未知因素,并且必须考虑严格的操作约束,例如资源有限。本文考虑了合成稳健策略的问题,以指导机器人在完成给定任务时的行动,同时确保系统不会耗尽其资源。为了解决这个问题,我们首先将机器人系统建模为带有集合值转移的消费马尔可夫决策过程(Consumption Markov Decision Process with Set-valued Transitions,CMDPST),这是一个统一框架,能够建模非确定性动作、可量化和不可量化的不确定性以及资源消耗。然后,我们将CMDPST与任务规范结合,后者以有限轨迹上的线性时序逻辑(Linear Temporal Logic over finite traces,LTLf)公式表示。最后,我们解决了资源受限的最优稳健策略合成问题,旨在合成一种策略,以最大化满足LTLf目标而不耗尽资源的概率。我们的解决方案涉及两种技术:一种基于直接展开的方法和一种更高效的优化方法,后者利用状态空间剪枝以提高性能。在仓库运输网络上的实验表明了所提解决方案的有效性。
cs.RO / 13 / 2605.05825

A Comparative Study of INDI and NDI with Nonlinear Disturbance Observer for Aerial Robotics

增量非线性动态反演与增设非线性干扰观测器的非线性动态反演在空中机器人中的比较研究
Rota, Benedetta, Mizzoni, Mirko, Afifi, Amr, van Goor, Pieter, Franchi, Antonio
Abstract
This work presents a simulation-based comparative robustness analysis of Incremental Nonlinear Dynamic Inversion (INDI) and Nonlinear Dynamic Inversion augmented with a nonlinear disturbance observer (NDI+NDO) for fully actuated aerial robots. A systematic simulation campaign across representative operating scenarios is conducted, where we compare tracking performance, robustness, control effort, under parametric variations, external disturbances, and measurement noise. Results show that INDI demonstrates stronger robustness in several model-mismatch and combined-stress cases, while NDI+NDO primarily matches nominal performance but exhibits greater sensitivity under several non-ideal conditions. These findings provide practical guidance on the relative strengths and limitations of incremental and observer-based inversion strategies for aerial robotic applications.
Chinese Translation
本研究基于仿真进行了增量非线性动态反演(INDI)与增设非线性干扰观测器的非线性动态反演(NDI+NDO)在完全驱动空中机器人中的比较鲁棒性分析。我们在代表性的操作场景中开展了一系列系统的仿真实验,比较了在参数变化、外部干扰和测量噪声下的跟踪性能、鲁棒性和控制努力。结果表明,INDI在多个模型失配和组合应力情况下表现出更强的鲁棒性,而NDI+NDO主要匹配名义性能,但在多个非理想条件下表现出更大的敏感性。这些发现为增量反演和基于观测器的反演策略在空中机器人应用中的相对优势和局限性提供了实用指导。
cs.RO / 14 / 2605.05875

Cycle-resolved Cephalopod-Inspired Pulsed-Jet Robot With High-Volume Expulsion and Drag-Reduced Gliding

循环解析的章鱼启发脉冲喷射机器人:高体积排放与减阻滑行
Zhang, Yiyuan, Zhong, Anye, Chen, Junkai, Xin, Wenci, Laschi, Cecilia
Abstract
Cephalopod pulsed-jet locomotion is not a single isolated expulsion event, but a coordinated cycle involving jet expulsion, passive gliding, and mantle refilling. Inspired by this cycle-resolved biological strategy, this paper presents a cephalopod-inspired pulsed-jet robot with a rigid-soft hybrid origami mantle that enables large, actively driven, and geometry-guided body deformation. The proposed mantle integrates rigid folding panels with a compliant silicone framework, allowing a 75% effective cavity-volume reduction during expulsion and reducing the projected cross-sectional drag area by approximately 75.7% in the contracted gliding configuration. Using this platform, we formulate a cycle-resolved framework to separately investigate how expelled volume, glide duration, and refill pathway influence whole-cycle locomotion performance. Experiments show that the robot reaches a peak speed of approximately 0.5 m/s (3.8 BL/s) and an average speed exceeding 0.2 m/s (1.5 BL/s) within the first jetting cycle. The results further demonstrate the roles of high expelled-volume-ratio contraction in speed generation, reduced-drag-area gliding under different glide durations, and mantle-aperture-inspired passive inlet valves in assisting refill. This work provides both a robotic implementation of actively deformable cephalopod-like jet propulsion and a unified experimental platform for studying expulsion-gliding-refilling dynamics in pulsed-jet locomotion.
Chinese Translation
章鱼的脉冲喷射运动并非单一的排放事件,而是一个协调的循环过程,涉及喷射排放、被动滑行和外套的补充。受到这种循环解析生物策略的启发,本文提出了一种章鱼启发的脉冲喷射机器人,采用刚性-柔性混合折纸外套,能够实现大幅度、主动驱动和几何引导的身体变形。所提出的外套将刚性折叠面板与柔性硅胶框架结合,在排放过程中实现75%的有效腔体体积减少,并在收缩滑行状态下将投影横截面阻力面积减少约75.7%。基于该平台,我们构建了一个循环解析框架,以分别研究排放体积、滑行持续时间和补充路径如何影响整个循环的运动性能。实验表明,该机器人在第一次喷射循环中达到约0.5 m/s(3.8 BL/s)的峰值速度,平均速度超过0.2 m/s(1.5 BL/s)。结果进一步展示了高排放体积比收缩在速度生成中的作用、不同滑行持续时间下的减阻滑行,以及受外套开口启发的被动进气阀在辅助补充中的作用。本研究不仅提供了一种主动可变形的章鱼类喷射推进的机器人实现,还为研究脉冲喷射运动中的排放-滑行-补充动态提供了一个统一的实验平台。
cs.RO / 15 / 2605.05897

Generating Roadside LiDAR Datasets from Vehicle-Side Datasets via Novel View Synthesis

通过新视角合成从车辆侧数据集生成路边LiDAR数据集
Xia, Yuhan, Zhao, Runxin, Zhuang, Hanyang, Wang, Chunxiang, Yang, Ming
Abstract
Intelligent Transportation Systems (ITS) require reliable environmental perception to support safe and efficient transportation. With the rapid development of Vehicle-to-everything (V2X), roadside perception has become an effective means to extend sensing coverage and improve traffic safety. However, the scarcity of large-scale annotated roadside LiDAR datasets poses a major challenge for training high-performance roadside perception models. In this paper, we introduce Vehicle-to-Roadside LiDAR Synthesis (VRS), a data synthesis framework that generates labeled roadside LiDAR datasets from vehicle-side datasets via LiDAR novel view synthesis. To mitigate the vehicle-to-roadside domain gap, VRS employs vehicle point cloud completion to compensate for missing geometry in vehicle-side observations, and introduces an occupancy-based visibility constraint to handle large viewpoint changes during cross-view rendering. The proposed framework enables flexible multi-view rendering for scalable roadside data generation. Extensive experiments on roadside 3D object detection demonstrate that the synthesized data effectively complements real roadside data, mitigates the limitations of limited real-world roadside data, and improves generalization to unseen roadside viewpoints.
Chinese Translation
智能交通系统(ITS)需要可靠的环境感知来支持安全高效的交通。随着车联网(V2X)的快速发展,路边感知已成为扩展感知覆盖范围和提高交通安全的有效手段。然而,大规模标注的路边LiDAR数据集的稀缺对训练高性能的路边感知模型构成了重大挑战。本文介绍了一种车辆到路边LiDAR合成(Vehicle-to-Roadside LiDAR Synthesis, VRS)数据合成框架,该框架通过LiDAR新视角合成从车辆侧数据集中生成标注的路边LiDAR数据集。为了减轻车辆到路边的领域差距,VRS采用车辆点云补全来弥补车辆侧观测中的几何缺失,并引入基于占用的可见性约束来处理跨视角渲染中的大视角变化。所提出的框架实现了灵活的多视角渲染,以便进行可扩展的路边数据生成。在路边3D目标检测上的大量实验表明,合成数据有效补充了真实路边数据,减轻了有限真实世界路边数据的局限性,并提高了对未见路边视点的泛化能力。
cs.RO / 16 / 2605.05925

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine:合成和优化人机交互运动以实现物理可行的灵巧机器人动作
Lee, Hyesung, Jung, Hyunwoo, Heo, Si-Hwan, Yang, Sungwook
Abstract
Learning dexterous manipulation from human-object interaction (HOI) data is a scalable alternative to teleoperation, but HOI demonstrations are sparse and provide only kinematic motion that is not directly executable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a framework with three coupled components: HOI-MMFP, a task- and object-initial-state-conditioned extension of motion manifold primitives that synthesizes coordinated hand-object trajectories from sparse HOI demonstrations; a task-space residual RL policy that physically grounds the synthesized reference while inheriting its kinematic structure; and a contact-and-dynamics adaptation module that enables sim-to-real transfer from proprioceptive history. Across five dexterous manipulation tasks spanning pick-and-place, tool use, and object reorientation, our task-space residual policy outperforms prior action-representation baselines in simulations and transfers to a real robot on all five tasks, improving over kinematic retargeting by 50-70 percentage points.
Chinese Translation
从人机交互(HOI)数据中学习灵巧操作是一个可扩展的替代方案,相较于遥操作,但HOI演示稀疏,仅提供在体现不匹配和接触丰富动态下不可直接执行的运动学动作。我们提出了DexSynRefine,一个包含三个耦合组件的框架:HOI-MMFP,一个基于任务和对象初始状态的运动流形原语扩展,从稀疏的HOI演示中合成协调的手-物体轨迹;一个任务空间残差强化学习(RL)策略,该策略在继承合成参考的运动学结构的同时,将其物理基础化;以及一个接触和动态适应模块,使得从本体感觉历史中实现模拟到现实的转移。在涵盖抓取与放置、工具使用和物体重新定向的五个灵巧操作任务中,我们的任务空间残差策略在模拟中优于先前的动作表示基线,并在所有五个任务中成功转移到真实机器人,相较于运动学重定向提高了50-70个百分点。
cs.RO / 17 / 2605.05960

Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation

即插即用标签图扩散用于通用目标导向导航
Shen, Zhixuan, Zeng, Yijie, Luo, Shengxiang, Li, Tianrui, Luo, Haonan
Abstract
In embodied vision, Goal-Oriented Navigation (GON) requires robots to locate a specific goal within an unexplored environment. The primary challenge of GON arises from the need to construct a Bird's-Eye-View (BEV) map to understand the environment while simultaneously localizing an unobserved goal. Existing map-based methods typically employ self-centered semantic maps, often facing challenges such as reliance on complete maps or inconsistent semantic association. To this end, we propose Plug-and-Play Label Map Diffusion (PLMD), which defines a novel map completion diffusion model based on Denoising Diffusion Probabilistic Models (DDPM). PLMD generates obstacle and semantic labels for unobserved regions through a diffusion-based completion process, thereby enabling goal localization even in partially observed environments. Moreover, it mitigates inconsistent semantic association by leveraging structural consistency between known and unknown obstacle layouts and integrating obstacle priors into the semantic denoising process. By substituting predicted labels for unobserved regions, robots can accurately localize the specified objects. Extensive experiments demonstrate that PLMD \textbf{(I)} effectively expands the region of unknown maps, \textbf{(II)} integrates seamlessly into existing navigation strategies that rely on semantic maps, \textbf{(III)} achieves state-of-the-art performance on three GON tasks.
Chinese Translation
在具身视觉中,目标导向导航(Goal-Oriented Navigation, GON)要求机器人在未探索的环境中定位特定目标。GON的主要挑战在于需要构建鸟瞰图(Bird's-Eye-View, BEV)地图以理解环境,同时对未观察到的目标进行定位。现有的基于地图的方法通常采用自中心的语义地图,常常面临依赖完整地图或语义关联不一致等挑战。为此,我们提出了即插即用标签图扩散(Plug-and-Play Label Map Diffusion, PLMD),该方法基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)定义了一种新颖的地图补全扩散模型。PLMD通过基于扩散的补全过程为未观察区域生成障碍物和语义标签,从而使得即使在部分观察到的环境中也能实现目标定位。此外,它通过利用已知和未知障碍布局之间的结构一致性,并将障碍先验融入语义去噪过程,减轻了语义关联不一致的问题。通过替代未观察区域的预测标签,机器人能够准确定位指定对象。大量实验表明,PLMD extbf{(I)} 有效扩展了未知地图的区域, extbf{(II)} 无缝集成到依赖语义地图的现有导航策略中, extbf{(III)} 在三个GON任务上实现了最先进的性能。
cs.RO / 18 / 2605.06042

Accurate Trajectory Tracking with MPCC for Flapping-Wing MAVs

基于MPCC的扑翼微型飞行器精确轨迹跟踪
Toumieh, Charbel, Zeng, Jack, Mistry, Niel, Floreano, Dario
Abstract
Flapping-wing micro aerial vehicles offer quieter and safer operation than rotary-wing drones, yet achieving precise autonomous control of bird-scale ornithopters remains challenging: lift, airspeed, and turning authority are tightly coupled and governed by only a few control inputs. Conventional cascaded controllers treat altitude, speed, and heading independently, producing persistent tracking errors during complex maneuvers, while time-parameterized trajectory tracking requires predefined speed profiles that existing methods cannot robustly produce for these coupled dynamics. We address both limitations simultaneously with a Model Predictive Contouring Control (MPCC) approach that tracks arc-length-parameterized trajectories while optimizing progress online, eliminating the need for predefined timing. However, MPCC requires a dynamical model that captures the coupled aerodynamics without exceeding the computational budget of real-time nonlinear optimization. Here, we propose a compact, continuously differentiable model that captures the dominant couplings of bird-scale ornithopters, enabling real-time predictive control. We validated the method with the XFly ornithopter flying along circular and three-dimensional racing trajectories and achieved a mean deviation from the reference trajectory between 6.5 and 9 cm at speeds up to 3 m/s, which represents an almost 10-fold improvement over prior ornithopter control methods.
Chinese Translation
扑翼微型飞行器相比于旋翼无人机提供了更安静和安全的操作,但实现鸟类规模的鸟型飞行器的精确自主控制仍然具有挑战性:升力、空速和转向能力紧密耦合,并且仅由少数控制输入所支配。传统的级联控制器独立处理高度、速度和航向,在复杂机动过程中产生持续的跟踪误差,而时间参数化的轨迹跟踪需要预定义的速度曲线,而现有方法无法为这些耦合动力学稳健地生成。我们通过模型预测轮廓控制(MPCC)方法同时解决了这两个限制,该方法跟踪弧长参数化的轨迹,同时在线优化进展,消除了对预定义时间的需求。然而,MPCC需要一个动态模型,该模型能够捕捉耦合的空气动力学,同时不超过实时非线性优化的计算预算。在此,我们提出了一个紧凑的、连续可微的模型,捕捉鸟类规模鸟型飞行器的主要耦合,从而实现实时预测控制。我们使用XFly鸟型飞行器沿圆形和三维竞速轨迹飞行验证了该方法,并在最高速度达到3 m/s时,参考轨迹的平均偏差在6.5到9厘米之间,这代表了对先前鸟型飞行器控制方法近10倍的改进。
cs.RO / 19 / 2605.06062

Monitoring autonomous persistent surveillance missions using invariance

利用不变性监测自主持久监视任务
Nenchev, Vladislav, Sotiriadis, Prodromos
Abstract
This paper studies runtime monitoring for persistent surveillance by autonomous robots when the autonomy stack is a black box. The environment is partitioned into finitely many parts, each carrying an uncertainty state that decreases when observed and increases otherwise. We model the closed loop as a state-dependent hybrid system with linear parameter varying dynamics and design a monitor based on an invariant computed offline. As this invariant is typically hard to obtain for large to-be-surveyed spaces, we propose a compositional monitor obtained by decentralized computation of low-dimensional invariant sets for each uncertainty region, and checking their conjunction online. Under common independence assumptions, the compositional monitor is sound and complete with respect to the full-system invariant. The approach is applied in a case study with a real robot persistently monitoring a labyrinth, emphasizing its applicability in practice.
Chinese Translation
本文研究了在自主机器人进行持久监视时,运行时监测的情况,尤其是在自主系统被视为黑箱的情况下。环境被划分为有限多个部分,每个部分携带一个不确定性状态,该状态在被观察时减少,而在未被观察时增加。我们将闭环系统建模为一个状态依赖的混合系统,具有线性参数变化的动态特性,并设计了一个基于离线计算的不变性监测器。由于对于大规模待监测空间而言,获取此不变性通常较为困难,我们提出了一种组合监测器,通过对每个不确定性区域的低维不变集进行分散计算,并在线检查它们的结合。根据常见的独立性假设,组合监测器在全系统不变性方面是健全且完整的。该方法在一个真实机器人持续监测迷宫的案例研究中得到了应用,强调了其在实践中的适用性。
cs.RO / 20 / 2605.06175

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

VLA-GSE:通过通用和专业专家提升VLA中的参数高效微调
Jiang, Yuhua, Lu, Junjie, Qin, Xinyao, Chen, Xiaoyu, Wang, Kaixin, Gao, Feifei, Zhao, Li
Abstract
Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to specialized experts (routed experts). This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts. Code is available at: https://github.com/YuhuaJiang2002/VLA-GSE
Chinese Translation
视觉-语言-动作(VLA)模型从预训练的视觉-语言骨干网络中继承了丰富的视觉-语义先验,但将其适应于机器人控制仍然具有挑战性。完整微调(FFT)容易在下游机器人数据上过拟合,并导致预训练视觉-语言能力的灾难性遗忘。参数高效微调(PEFT)更好地保留了预训练知识,但现有的PEFT方法在有效适应机器人控制任务方面仍然面临困难。为了解决这一问题,我们提出了VLA-GSE,一个参数高效的VLA微调框架,旨在提高控制适应性,同时保留PEFT的知识保留优势。具体而言,VLA-GSE(通用和专业专家)通过对冻结的骨干网络进行谱分解进行初始化,将主导奇异成分分配给通用专家(共享专家),并将不相交的残差成分分配给专业专家(路由专家)。这种分解在固定的可训练参数预算下提高了适应能力。在相当的参数预算下,VLA-GSE仅更新2.51%的完整模型参数,并始终优于强大的FFT和PEFT基线。它在LIBERO-Plus上实现了81.2%的平均零-shot成功率,在多模态理解基准上保留了与LoRA相当的预训练VLM能力,并在多个分布转移下改善了现实世界的操作成功率。代码可在以下网址获取:https://github.com/YuhuaJiang2002/VLA-GSE
cs.RO / 21 / 2605.06222

When to Trust Imagination: Adaptive Action Execution for World Action Models

何时信任想象:针对世界行动模型的自适应行动执行
Wang, Rui, Zhang, Yue, Lin, Jiehong, Luo, Kuncheng, Wang, Jianan, Wang, Zhongrui, Qi, Xiaojuan
Abstract
World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.
Chinese Translation
世界行动模型(World Action Models, WAMs)最近作为一种有前景的机器人操作范式出现,通过共同预测未来的视觉观察和未来的行动。然而,当前的WAMs通常在每次模型推理后执行固定数量的预测行动,这使得机器人无法判断想象的未来是否与实际的物理展开一致。在本研究中,我们将自适应WAM执行形式化为一个未来-现实验证问题:当WAM预测的未来保持可靠时,机器人应执行更长时间;当现实偏离想象时,应提前重新规划。为此,我们提出了未来前向动力学因果注意力(Future Forward Dynamics Causal Attention, FFDC),这是一种轻量级的验证器,它共同推理预测的未来行动、预测的视觉动态、真实观察和语言指令,以估计剩余的行动展开是否仍然值得信任。FFDC使得自适应行动块大小成为预测-观察一致性的自然结果,保持了长时间执行的效率,同时在接触丰富或困难阶段恢复了响应性。我们进一步引入了混合视野训练(Mixture-of-Horizon Training)以改善自适应执行的长视野轨迹覆盖。在RoboTwin基准测试和实际实验中,实验结果表明我们的方法实现了强大的鲁棒性-效率权衡:在RoboTwin上,它将WAM前向传递减少了69.10%,执行时间减少了34.02%,同时成功率提高了2.54%,相较于短块基线;在实际实验中,它的成功率提高了35%。
cs.RO / 22 / 2605.06234

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

RobotEQ:从被动智能到具身人工智能中的主动智能的转变
Fang, Kuofei, Che, Xinyi, Ouyang, Haomin, Zhang, Shufan, Wang, Xuehao, Liu, Qi, Liu, Liyi, Zhang, Chenqi, Cai, Wenxi, Dai, Wenyu, Wu, Jinyang, Zhang, Fan, Chen, Haoyu, He, Bin, Lian, Zheng
Abstract
Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,900 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 5,353 action judgment questions and 1,286 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results show that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, we observe that leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.
Chinese Translation
具身人工智能是学术界和工业界的一个重要研究主题。目前的研究主要集中在根据明确的用户指令完成任务。然而,为了使机器人能够融入人类社会,它们必须理解哪些行为是允许的,哪些是禁止的,即使没有明确的指令。我们将用户引导的人工智能称为被动智能,而将无指导的人工智能称为主动智能。本文介绍了RobotEQ,这是第一个主动智能的基准,旨在评估现有模型是否能够理解并遵循具身场景中的社会规范。首先,我们构建了RobotEQ-Data,这是一个包含1,900张自我中心图像的数据集,涵盖10个代表性的具身类别和56个子类别。通过广泛的人工标注,我们提供了5,353个行为判断问题和1,286个空间定位问题,具体说明了在不同场景中适当的机器人行为。此外,我们建立了RobotEQ-Bench,以评估最先进模型在此任务上的表现。实验结果表明,当前模型在实现可靠的主动智能方面仍然存在不足,特别是在空间定位方面。同时,我们观察到利用RAG技术结合外部社会规范知识库通常可以提升性能。这项工作可以促进机器人从用户引导的被动操作向主动的社会合规转变。
cs.RO / 23 / 2605.06247

CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

CKT-WAM:世界行动模型之间的参数高效上下文知识转移
Jiang, Yuhua, Guo, Yijun, Yang, Hongbing, Lei, Guojun, Chen, Nuo, Zhang, Yinuo, Yan, Shaoqiang, Lin, Bo, Gao, Feifei, Qi, Biqing
Abstract
World action models (WAMs) provide a powerful generative framework for embodied control, yet transferring knowledge across heterogeneous WAMs remains challenging due to mismatched latent interfaces, high adaptation cost, and the rigidity of conventional distillation objectives. We propose \textbf{CKT-WAM}, a parameter-efficient \textbf{C}ontext \textbf{K}nowledge \textbf{T}ransfer framework that transfers teacher WAM's knowledge into a student WAM through a compact context in the text embedding space, rather than output imitation or dense hidden-state matching. Specifically, CKT-WAM extracts intermediate teacher hidden states, reduces the number of tokens via compressors' learnable-query cross attention (LQCA), and transforms them through an always-on generalized adapter, a lightweight router, and sparsely activated specialized adapters. The resulting context is then appended to the student's conditioning textual embeddings, thereby injecting the transferred knowledge into the student with minimal architectural modification. Experiments show that CKT-WAM consistently improves zero-shot generalization and achieves the best overall performance on LIBERO-Plus, reaching 86.1\% total success rate with only 1.17\% trainable parameters, while approaching full fine-tuning performance. Beyond simulation, CKT-WAM also demonstrates strong real-world long-horizon manipulation ability, achieving the best average success rate of 83.3\% across four multi-step and long-horizon tasks. Code is available at https://github.com/YuhuaJiang2002/CKT-WAM.
Chinese Translation
世界行动模型(WAMs)为具身控制提供了强大的生成框架,然而,由于潜在接口不匹配、高适应成本以及传统蒸馏目标的刚性,跨异构WAMs转移知识仍然具有挑战性。我们提出了 extbf{CKT-WAM},一种参数高效的 extbf{C}ontext extbf{K}nowledge extbf{T}ransfer框架,通过文本嵌入空间中的紧凑上下文将教师WAM的知识转移到学生WAM,而不是通过输出模仿或密集隐藏状态匹配。具体而言,CKT-WAM提取中间教师隐藏状态,通过压缩器的可学习查询交叉注意力(LQCA)减少标记数量,并通过始终开启的广义适配器、轻量级路由器和稀疏激活的专用适配器进行转换。生成的上下文随后附加到学生的条件文本嵌入中,从而以最小的架构修改将转移的知识注入学生。实验表明,CKT-WAM在零-shot泛化方面始终表现出色,并在LIBERO-Plus上实现了最佳整体性能,以仅1.17\%的可训练参数达到了86.1\%的总成功率,同时接近完全微调性能。超越模拟,CKT-WAM还展示了强大的现实世界长时间操作能力,在四个多步骤和长时间任务中实现了83.3\\%的最佳平均成功率。代码可在 https://github.com/YuhuaJiang2002/CKT-WAM 获取。
cs.RO / 24 / 2605.06311

Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

迈向视觉真实的仿真:评估机器人操作的仿真基准
Zhu, Yixin, Wang, Zixiong, Yang, Jian, Xie, Jin, Yu, Jingyi, Gu, Jiayuan, Wang, Beibei
Abstract
Reliable simulation evaluation of robot manipulation policies serves as a high-fidelity proxy for real-world performance. Although existing benchmarks cover a wide range of task categories, they lack visual realism, creating a large domain gap between simulation and reality. This undermines the reliability of simulation-based evaluation in predicting real-world performance. To mitigate the sim-to-real visual gap, we conduct a systematic analysis to isolate the effects of lighting and material. Our results show that these factors play a critical role in geometric reasoning and spatial grounding, yet are largely overlooked in existing benchmarks. Motivated by the analysis, we propose VISER, a visually realistic benchmark for evaluating robot manipulation in simulation. VISER features a high-fidelity dataset of over 1,000 3D assets with physically-based rendering (PBR) materials, along with 3D scenes created from these assets through curated layouts or generation. To this end, we propose an automated pipeline leveraging Multi-modal Large Language Models (MLLMs) for material-aware part segmentation and material retrieval, enabling scalable generation of physically plausible assets. Building on the high-fidelity 3D asset dataset, we construct diverse evaluation tasks, such as grasping, placing, and long-horizon tasks, enabling scalable and reproducible assessment of Vision-Language-Action (VLA) models. Our benchmark shows a strong correlation between simulation and real-world performance, achieving an average Pearson correlation coefficient of 0.92 across different policies.
Chinese Translation
可靠的机器人操作策略仿真评估作为现实世界性能的高保真代理。尽管现有基准涵盖了广泛的任务类别,但它们缺乏视觉真实感,导致仿真与现实之间存在较大的领域差距。这削弱了基于仿真的评估在预测现实世界性能方面的可靠性。为了减小仿真与现实之间的视觉差距,我们进行了系统分析,以孤立光照和材料的影响。我们的结果表明,这些因素在几何推理和空间基础方面起着关键作用,但在现有基准中往往被忽视。基于这一分析,我们提出了VISER,一个用于评估机器人操作的视觉真实基准。VISER具有超过1000个具有物理基础渲染(PBR)材料的高保真3D资产数据集,以及通过精心设计的布局或生成从这些资产创建的3D场景。为此,我们提出了一种自动化管道,利用多模态大型语言模型(MLLMs)进行材料感知的部件分割和材料检索,从而实现物理上合理资产的可扩展生成。在高保真3D资产数据集的基础上,我们构建了多样化的评估任务,如抓取、放置和长时间任务,实现了对视觉-语言-动作(VLA)模型的可扩展和可重复评估。我们的基准显示仿真与现实世界性能之间存在强相关性,在不同策略下平均皮尔逊相关系数达到0.92。
cs.RO / 25 / 2605.06323

AssistDLO: Assistive Teleoperation for Deformable Linear Object Manipulation

AssistDLO:可变形线性物体操作的辅助远程操作
Guler, Berk, Manschitz, Simon, Pompetzki, Kay, Peters, Jan
Abstract
Manipulating Deformable Linear Objects (DLOs) is challenging in robotics due to their infinite-dimensional configuration space and complex nonlinear dynamics. In teleoperation, depth uncertainty hinders state perception and reaction. AssistDLO addresses this challenge as an assistive teleoperation framework for DLO manipulation that combines real-time multi-view state estimation, visual assistance (VA), and a geometry-aware shared-autonomy controller based on Control Barrier Functions (SA-CBF). While traditional shared autonomy methods often rely on simple geometric attractors and may fail to preserve DLO geometry, SA-CBF acts as a geometry-aware funnel, facilitating precise grasping while preserving the operator's high-level authority. The framework is evaluated in a bimanual knot-untangling user study (N = 22) using ropes with varying length and rigidity. Results show that the effectiveness of the assistance depends strongly on operator expertise and DLO properties. SA-CBF provides the strongest gains for naive users, acting as a skill equalizer that increases task success from 71% to 88%, and is effective for stiffer ropes. Conversely, expert users prefer VA, and highly compliant, long ropes benefit more from visual support than localized action assistance. Ultimately, these findings demonstrate that effective DLO teleoperation cannot rely on a fixed strategy, highlighting the critical need for adaptive, user-aware, and material-aware shared autonomy.
Chinese Translation
在机器人技术中,操作可变形线性物体(DLO)具有挑战性,因为它们具有无限维的配置空间和复杂的非线性动力学。在远程操作中,深度不确定性妨碍了状态感知和反应。AssistDLO 作为一个辅助远程操作框架,解决了这一挑战,旨在实现 DLO 操作,结合了实时多视角状态估计、视觉辅助(VA)和基于控制障碍函数(SA-CBF)的几何感知共享自主控制器。传统的共享自主方法通常依赖于简单的几何吸引子,可能无法保持 DLO 的几何形状,而 SA-CBF 则充当几何感知漏斗,促进精确抓取,同时保留操作员的高层次权威。该框架在一项双手解结用户研究中进行了评估(N = 22),使用了不同长度和刚度的绳索。结果表明,辅助的有效性在很大程度上依赖于操作员的专业知识和 DLO 的特性。SA-CBF 为初学者提供了最大的收益,作为一种技能平衡器,将任务成功率从 71% 提高到 88%,并且在较硬的绳索上效果显著。相反,专家用户更倾向于使用视觉辅助,而高度柔性、较长的绳索更依赖于视觉支持而非局部动作辅助。最终,这些发现表明,有效的 DLO 远程操作不能依赖固定策略,强调了适应性、用户感知和材料感知的共享自主的重要性。
cs.RO / 26 / 2605.06432

TouchDrive: Electronics-Free Tactile Sensing Interface for Assistive Grasping

TouchDrive:无电子元件的触觉感知接口用于辅助抓取
Xu, Jing, Niu, Xuezhi, Broo, Didem Gurdur, Hjort, Klas
Abstract
Assistive robotic grasping plays an important role in enabling safe and adaptive manipulation of diverse objects. However, existing systems often rely on electronic sensing and multi-stage processing pipelines, increasing system complexity and reducing accessibility. To address these limitations, we present TouchDrive, a cost-effective, electronics-free tactile sensing interface for assistive grasping. TouchDrive directly converts contact forces into pneumatic feedback through valve-mediated switching, integrating sensing, signal generation, and feedback within a single passive mechanical loop. The system can be employed using a pneumatic normally closed valve, a compressed air tank, sensing element, and haptic feedback actuator without electronics. By delivering tactile cues, TouchDrive empowers users to modulate grasp forces, enabling precise and robust delicate manipulation of compliant and fragile objects. The interface has been validated across diverse robotic platforms, consistently demonstrating reliable performance and practical applicability in assistive grasping tasks, such as handling fruits and everyday items (up to 20 objects).
Chinese Translation
辅助机器人抓取在安全和自适应操作各种物体中发挥着重要作用。然而,现有系统通常依赖于电子感知和多阶段处理流程,增加了系统复杂性并降低了可及性。为了解决这些局限性,我们提出了TouchDrive,一种经济高效、无电子元件的触觉感知接口,用于辅助抓取。TouchDrive通过阀门介导的切换将接触力直接转换为气动反馈,将感知、信号生成和反馈集成在一个单一的被动机械回路中。该系统可以使用气动常闭阀、压缩空气罐、感知元件和触觉反馈执行器而无需电子元件。通过提供触觉提示,TouchDrive使用户能够调节抓取力量,从而实现对柔性和易碎物体的精确和稳健的细致操作。该接口已在多种机器人平台上进行了验证,始终展示出可靠的性能和在辅助抓取任务中的实际适用性,例如处理水果和日常物品(最多20个物体)。
cs.RO / 27 / 2605.06478

GA3T: A Ground-Aerial Terrain Traversability Dataset for Heterogeneous Robot Teams in Unstructured Environments

GA3T:用于非结构化环境中异构机器人团队的地面-空中地形可通行性数据集
Cai, Siwei, Peterson, Knut, Tran, Quan, Ricks, Christian, Parthasarathy, Dhanush, Kaidarov, Amir, Deshpande, Neil, Najm, Sukaina, Han, David, Zhou, Lifeng
Abstract
Heterogeneous air-ground robot teams combine complementary sensing modalities, mobility characteristics, and spatial viewpoints that can significantly enhance perception in complex outdoor environments. However, progress in multi-robot collaborative perception has been constrained by the lack of real-world datasets featuring overlapping multi-modal observations from platforms operating in unstructured terrain. We present GA3T (Ground-Aerial Team for Terrain Traversal), a real-world multi-robot collaborative perception dataset collected using a Clearpath Husky UGV and an Autel EVO~II UAV across diverse unstructured environments, including forest trails, rocky paths, muddy terrain, snow piles, and grass-covered fields. The ground platform provides 3D LiDAR, stereo camera, IMU, and GPS data, while the aerial platform contributes RGB imagery, thermal/infrared observations, and GPS from a complementary overhead viewpoint, allowing for rich cross-modal and cross-view perception. The dataset is collected in 4 unique environments, with over 13,000 synchronized frames across approximately 29 minutes of operation, and includes both SAM~3-based zero-shot segmentation and over 8,000 manually labeled images. A unique aspect of the dataset is its early-spring collection period, during which sparse tree canopies allow the aerial robot to partially observe the ground robot and terrain through the trees, allowing for occlusion-aware collaborative perception. Unlike prior multi-robot datasets that focus on SLAM or simulated cooperative driving, GA3T is specifically designed to support research on cross-view perception, air-ground viewpoint fusion, traversability estimation, and collaborative scene understanding in real off-road environments.
Chinese Translation
异构的空地机器人团队结合了互补的传感模式、移动特性和空间视角,能够显著增强在复杂户外环境中的感知能力。然而,多机器人协作感知的进展受到缺乏真实世界数据集的限制,这些数据集需要包含来自在非结构化地形上操作的平台的重叠多模态观测。我们提出了GA3T(地面-空中团队地形穿越),这是一个真实世界的多机器人协作感知数据集,使用Clearpath Husky UGV和Autel EVO II UAV在多种非结构化环境中收集,包括森林小径、崎岖路径、泥泞地形、雪堆和草地。地面平台提供3D LiDAR、立体相机、IMU和GPS数据,而空中平台则从互补的上方视角贡献RGB图像、热成像/红外观测和GPS数据,从而实现丰富的跨模态和跨视角感知。该数据集在4个独特环境中收集,包含超过13,000个同步帧,操作时间约为29分钟,并包括基于SAM 3的零样本分割和超过8,000张手动标注的图像。该数据集的一个独特之处在于其早春收集期,在此期间稀疏的树冠允许空中机器人部分观察到地面机器人和地形,从而实现遮挡感知的协作感知。与以往专注于SLAM或模拟协作驾驶的多机器人数据集不同,GA3T专门设计用于支持在真实越野环境中进行跨视角感知、空地视角融合、可通行性估计和协作场景理解的研究。
cs.RO / 28 / 2605.06481

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

OA-WAM:用于稳健机器人操作的对象可寻址世界动作模型
Liu, Yushan, Sun, Peibo, Li, Shoujie, Xie, Yifan, Zhang, Lingfeng, Chao, Xintao, Dong, Shiyuan, Chen, Fang, Zhang, Xiao-Ping, Ding, Wenbo
Abstract
World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.
Chinese Translation
世界动作模型(WAMs)通过共同预测场景演变和机器人动作来增强视觉-语言-动作(VLA)策略,但现有方法通常将预测的世界表示为整体图像、视频标记或全局潜变量。这些表示在指令涉及特定对象时,尤其是在场景变化中对象身份与上下文交织时,难以被动作解码器处理。我们提出了OA-WAM,一种用于稳健机器人操作的对象可寻址世界动作模型。OA-WAM将每帧分解为N+1个槽状态,其中一个为机器人槽,N个为对象槽。每个槽包含一个持久地址向量和一个时间变化的内容向量,并与文本、图像、自我感知和过去动作标记在一个块因果序列中融合。世界头预测下一帧的槽状态,而流匹配动作头在同一前向传递中解码一个16步的连续动作块。通过仅使用地址键路由跨槽注意力并在每个变换层重置地址切片,强制实现可寻址性,从而将要作用的对象与该对象当前的状态分离,而无需添加额外的标记。OA-WAM在LIBERO(97.8%)和SimplerEnv(79.3%)上与强大的VLA和WAM基线匹配,在最相关的LIBERO-Plus几何轴上达到最先进的性能,并在七轴汇总上保持竞争力。因果槽干预测试的交换绑定余弦为0.87,而整体基线最多为0.09。这些结果表明,可寻址对象状态为在场景扰动下的稳健世界动作建模提供了有效的接口。
cs.RO / 29 / 2605.06498

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

浮动基座机器人高阶递归动力学算法的李群表述
Ali, Ahmed, Gabellieri, Chiara, Franchi, Antonio
Abstract
In this paper, we describe procedures for computing higher-order time derivatives of the Lie-group Newton-Euler, Articulated-Body Inertia, and hybrid dynamics algorithms for floating-base trees, where the base configuration evolves on SE(3) and the attached mechanism is an open kinematic tree with configuration on the (n1+n2)-dimensional manifold T^{n1} \times R^{n2}, using spatial representation of twists. After presenting the algorithms, we collect the resulting recursions into closed-form equations of motion, identifying an admissible Coriolis matrix satisfying the passivity property, and showing that the articulated inertia tensor remains unchanged across all time derivatives. We then apply the developed methods to a 12-DoF aerial manipulator to derive analytical expressions for its geometric forward and inverse dynamics along with their first time derivatives whereas the numerical simulations successfully evaluate these dynamics up to fifth order. Finally, to demonstrate their practical utility, we benchmark the proposed extensions and show that, in the considered tests, their computational cost scales quadratically with the derivative order, whereas the automatic-differentiation baseline exhibits exponential scaling.
Chinese Translation
在本文中,我们描述了计算浮动基座树的李群牛顿-欧拉、关节体惯性和混合动力学算法的高阶时间导数的程序,其中基座配置在 SE(3) 上演变,附加机制是一个在 (n1+n2) 维流形 T^{n1} imes R^{n2} 上具有配置的开放运动学树,使用扭转的空间表示。在介绍算法后,我们将得到的递归整合为运动方程的封闭形式,识别出满足无源性特性的可接受的科里奥利矩阵,并展示关节惯性张量在所有时间导数中保持不变。然后,我们将开发的方法应用于一个 12 自由度的空中操控器,以推导其几何正向和逆向动力学的解析表达式及其一阶时间导数,同时数值仿真成功评估这些动力学高达五阶。最后,为了展示其实际效用,我们对所提出的扩展进行了基准测试,并显示在所考虑的测试中,其计算成本与导数阶数呈二次关系,而自动微分基线则表现出指数级的增长。
cs.RO / 30 / 2605.06593

ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

ReActor:基于强化学习的物理感知运动重定向
Müller, David, Serifi, Agon, Christen, Sammy, Grandia, Ruben, Knoop, Espen, Bächer, Moritz
Abstract
Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot's morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an approximate gradient for the upper-level loss. Our framework requires only a sparse set of semantic rigid-body correspondences and eliminates the need for manual tuning by identifying optimal values for a parameterization expressive enough to preserve characteristic motion across different embodiments. Moreover, by integrating retargeting directly with physics simulation, we produce physically plausible motions that facilitate robust imitation learning. We validate our method in simulation and on hardware, demonstrating challenging motions for morphologies that differ significantly from a human, including retargeting onto a quadruped.
Chinese Translation
将人类运动学参考动作重定向到机器人的形态仍然是一个巨大的挑战。现有方法往往会产生物理不一致性,例如脚滑动、自我碰撞或动态上不可行的运动,这些都妨碍了后续的模仿学习。我们提出了一种双层优化框架,该框架在使用强化学习训练跟踪策略的同时,联合调整参考动作以适应机器人的形态。为了使优化过程可行,我们推导了上层损失的近似梯度。我们的框架仅需一组稀疏的语义刚体对应关系,并通过识别参数化的最优值来消除手动调节的需求,这些参数化足够表达以保留不同表现形式之间的特征运动。此外,通过将重定向与物理仿真直接集成,我们生成了物理上合理的运动,从而促进了稳健的模仿学习。我们在仿真和硬件上验证了我们的方法,展示了对于与人类形态显著不同的形态(包括重定向到四足动物)的挑战性运动。
cs.RO / 31 / 2605.06595

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

基于多智能体强化学习的跨模态导航
Liu, Shuo, Li, Xinzichen, Amato, Christopher
Abstract
Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.
Chinese Translation
稳健的具身导航依赖于互补的感官线索。然而,实践中高质量且良好对齐的多模态数据往往难以获得。训练一个整体模型也面临挑战,因为丰富的多模态输入会引发复杂的表示,并显著扩大策略空间。轻量级模态专用智能体之间的跨模态协作提供了一种可扩展的范式。它实现了灵活的部署和并行执行,同时保留了每种模态的优势。在本文中,我们提出了 extbf{CRONA},一个用于 extbf{跨}模态 extbf{导航}的多智能体强化学习(MARL)框架。CRONA通过利用与控制相关的辅助信念和具有全局状态的集中式多模态评论者来改善协作。在视觉-声学导航任务上的实验表明,多智能体方法显著提高了性能和效率,超越了单智能体基线。我们发现,有限模态的同质协作对于在显著线索下的短距离导航是足够的;具有互补模态的智能体之间的异质协作通常是高效且有效的;而在大型复杂环境中的导航则需要更丰富的多模态感知和更高的模型容量。
cs.RO / 32 / 2605.06662

Multi-Robot Coordination in V2X Environments

V2X环境中的多机器人协调
Arockiasamy, John Pravin, Vinel, Alexey
Abstract
This paper presents a Vehicle-to-Everything (V2X) communication framework that enables decentralized cooperation among social robots operating in complex urban traffic environments. Building on ETSI Cooperative Awareness and Maneuver Coordination services, the framework introduces two robot-centric facility-layer services: the Robot Awareness Service (RAS) and the Robot Maneuver Coordination Service (RMCS), realized through the Robot Awareness Message (RAM) and the Robot Maneuver Coordination Message (RMCM), respectively. RAS enables role-aware, task-oriented robot awareness while integrating externally detected Vulnerable Road Users (VRUs), including non-V2X pedestrians, into cooperative awareness. RMCS supports event-driven, low-latency coordination of robot maneuvers under explicitly established roles, without centralized infrastructure or prior pairing. A real-world proof of concept demonstrates deterministic multi-robot coordination between a humanoid robot and a quadrupedal robot assisting a pedestrian during a road-crossing scenario, governed by a formally specified finite-state coordination model. Complementary simulations evaluate robot-mediated VRU clustering in mixed V2X environments, showing that RAS-based clustering integrates non-V2X VRUs in safety-critical areas while reducing redundant transmissions from V2X-enabled VRUs, thereby lowering channel load. Together, the proposed services provide a scalable and standards-aligned foundation for integrating cooperative robots into future Connected, Cooperative, and Automated Mobility ecosystems.
Chinese Translation
本文提出了一种车辆与一切(V2X)通信框架,旨在实现复杂城市交通环境中社会机器人之间的去中心化合作。该框架基于ETSI合作意识和机动协调服务,引入了两个以机器人为中心的设施层服务:机器人意识服务(Robot Awareness Service, RAS)和机器人机动协调服务(Robot Maneuver Coordination Service, RMCS),分别通过机器人意识消息(Robot Awareness Message, RAM)和机器人机动协调消息(Robot Maneuver Coordination Message, RMCM)实现。RAS支持角色感知的任务导向机器人意识,同时将外部检测到的脆弱道路用户(Vulnerable Road Users, VRUs),包括非V2X行人,纳入合作意识中。RMCS支持在明确设定角色下的事件驱动、低延迟的机器人机动协调,无需集中基础设施或事先配对。一个现实世界的概念验证展示了在一个过马路场景中,人形机器人和四足机器人之间的确定性多机器人协调,遵循一个正式指定的有限状态协调模型。补充模拟评估了在混合V2X环境中机器人介导的VRU聚类,结果表明基于RAS的聚类能够在安全关键区域整合非V2X VRUs,同时减少来自V2X启用的VRUs的冗余传输,从而降低信道负载。综上所述,所提出的服务为将合作机器人集成到未来的连接、合作和自动化移动生态系统中提供了可扩展且符合标准的基础。
计算机视觉 (Computer Vision)
121
cs.CV / 1 / 2605.05215

Layout-Aware Representation Learning for Open-Set ID Fraud Discovery

基于布局的开放集身份欺诈发现表示学习
Li, Jinxing, Ren, Nicholas, Chang, Cathy, Pan, Hongkai, George, Daniel
Abstract
Identity-document fraud detection is not a stationary binary classification problem. Adaptive attackers modify templates and fabrication pipelines, making historical fraud labels stale, and successful forgeries recur at scale as coherent campaigns. We therefore study layout-aware representation learning for open-set fraud discovery rather than only closed-set classification. We adapt DINOv3 to the document domain via context-aware SimMIM fine-tuning and supervised metric learning with composite loss that encourages inter-class separability and intra-class compactness. The model is trained with U.S. IDs only. With a lightweight MLP and softmax classifier, the embedding achieves 99.83% layout classification accuracy on Canadian layouts. Moreover, on a dataset of 20,448 Canadian IDs, embedding-space analysis surfaces 276 adaptive physical-fraud cases, including 222 not surfaced by incumbent detectors. The embedding supports similarity-based expansion from a single confirmed seed to additional related cases not linked by conventional metadata graphs. The layout-aware document embeddings provide a production-aligned basis for discovering novel and campaign-scale fraud under distribution shift.
Chinese Translation
身份文件欺诈检测并非一个静态的二元分类问题。适应性攻击者修改模板和伪造流程,使得历史欺诈标签失效,成功的伪造行为以一致的活动大规模重现。因此,我们研究基于布局的表示学习用于开放集欺诈发现,而不仅仅是闭集分类。我们通过上下文感知的 SimMIM 微调和带有复合损失的监督度量学习,将 DINOv3 适配到文档领域,复合损失鼓励类间可分离性和类内紧凑性。该模型仅使用美国身份证进行训练。通过轻量级的 MLP 和 softmax 分类器,该嵌入在加拿大布局上实现了 99.83% 的布局分类准确率。此外,在一个包含 20,448 个加拿大身份证的数据集中,嵌入空间分析揭示了 276 个适应性物理欺诈案例,其中 222 个未被现有检测器发现。该嵌入支持从单个确认的种子扩展到其他未通过传统元数据图链接的相关案例。基于布局的文档嵌入为在分布变化下发现新型和大规模活动欺诈提供了与生产对齐的基础。
cs.CV / 2 / 2605.05283

Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution

看见不该存在的东西:用于医学图像归因的反事实生成对抗网络
Murtaza, Shakeeb
Abstract
Ascription of an image gives insights into the objects that influence the classification of the whole image or its pixels towards a specific category. These insights help radiologists to visualize deformities in medical imaging. Most of the existing visualization techniques are based on discriminative models and highlight regions of the input image participating in the decision-making of a classifier. However, these approaches do not take all noticeable objects into account as their objective is to classify the input by using a minimal set of discriminative features. To overcome the issue, a counterfactual explanation (CX) based class-oriented feature attribution method is proposed. A counterfactual explanation (CX) explicates a causal reasoning process of the form: "if X had not happened, then Y would not have happened". The method is built on generative adversarial networks (GANs) with a cyclical-consistent loss function. We evaluate our method on three datasets: synthetic, tuberculosis and BraTS. All experiments confirm the efficacy of the proposed method. This study also highlighted the limitations of existing counterfactual explanation techniques in producing plausible counterfactual instances (CIs). Accompanying CXs with believable CIs thus provides self-explanatory analogy-based explanations. To this end, a CI generation method is proposed. Also, a novel technique is used to evaluate the quality of CI. The baseline results are produced on the BraTS dataset.
Chinese Translation
图像的归因提供了对影响整个图像或其像素朝特定类别分类的对象的洞察。这些洞察帮助放射科医生在医学成像中可视化畸形。现有的大多数可视化技术基于判别模型,突出参与分类器决策的输入图像区域。然而,这些方法并未考虑所有显著对象,因为它们的目标是通过使用最小的判别特征集来对输入进行分类。为了解决这个问题,提出了一种基于反事实解释(CX)的面向类别的特征归因方法。反事实解释(CX)阐明了一种因果推理过程,其形式为:“如果X没有发生,那么Y就不会发生”。该方法基于具有循环一致性损失函数的生成对抗网络(GANs)。我们在三个数据集上评估了我们的方法:合成数据集、结核病数据集和BraTS数据集。所有实验均证实了所提方法的有效性。本研究还突出了现有反事实解释技术在生成可信的反事实实例(CIs)方面的局限性。因此,伴随可信的CIs的CX提供了自我解释的类比基础解释。为此,提出了一种CI生成方法。同时,采用了一种新技术来评估CI的质量。基线结果是在BraTS数据集上生成的。
cs.CV / 3 / 2605.05328

Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift

Query2Uncertainty:在分布转移下进行稳健的不确定性量化和校准的3D物体检测
Beemelmanns, Till, Nekrasov, Alexey, Vilceanu, Stefan, Steinhaus, Jonas, Woopen, Timo, Leibe, Bastian, Eckstein, Lutz
Abstract
Reliable uncertainty estimation for 3D object detection is critical for deploying safe autonomous systems, yet modern detectors remain poorly calibrated, especially under distribution shifts. Although post-hoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios. In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors. These queries form a compact, location and class-aware feature, ideal for density estimation, allowing our approach to adjust model confidences in distribution-shift scenarios. By fitting a density estimator on these query features, our approach jointly recalibrates both classification and bounding box regression uncertainties. On both a multi-view camera and LiDAR-based detector, our approach consistently outperforms standard post-hoc methods in both in-distribution and distribution-shifted scenarios. Code available https://tillbeemelmanns.github.io/query2uncertainty/ .
Chinese Translation
对于3D物体检测,可靠的不确定性估计对安全自主系统的部署至关重要,但现代检测器在校准方面仍然存在不足,尤其是在分布转移的情况下。尽管事后校准方法解决了这一问题,并为在分布内测试提供了改进的校准,但它们无法适应分布转移场景。在本研究中,我们解决了这一问题,并提出了一种密度感知校准方法,该方法将事后校准器与来自DETR风格3D物体检测器的潜在物体查询的特征密度相结合。这些查询形成了一种紧凑的、位置和类别感知的特征,适合进行密度估计,使我们的方法能够在分布转移场景中调整模型的置信度。通过在这些查询特征上拟合密度估计器,我们的方法共同重新校准了分类和边界框回归的不确定性。在多视角摄像头和基于激光雷达的检测器上,我们的方法在分布内和分布转移场景中始终优于标准的事后方法。代码可用:https://tillbeemelmanns.github.io/query2uncertainty/ 。
cs.CV / 4 / 2605.05331

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

ViTok-v2:将原生分辨率自编码器扩展至50亿参数
Hansen-Estruch, Philippe, Chen, Jiahui, Ramanujan, Vivek, Zohar, Orr, Ping, Yan, Sinha, Animesh, Georgopoulos, Markos, Schoenfeld, Edgar, Hou, Ji, Juefei-Xu, Felix, Vishwanath, Sriram, Thabet, Ali
Abstract
Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B images and scaled to 5B parameters, the largest image autoencoder to date. ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p and outperforms all baselines at 512p and above. In joint scaling experiments with flow matching generators, we show that scaling both the autoencoder and the generator advances the Pareto frontier of this trade-off.
Chinese Translation
视觉变换器(Vision Transformer, ViT)自编码器已成为图像的有效标记器,提供了比卷积标记器更好的重建效果。然而,现有的ViT标记器无法在训练分辨率之外探索这一领域,因为性能在训练分辨率之外会下降,并且对对抗损失的依赖阻碍了稳定扩展。ViTok(Hansen-Estruch等,2025)发现压缩比r调节重建与生成之间的权衡,其中较低的r意味着更好的重建但更难的生成,因此改善标记器重建是实现更具帕累托最优性的标记器的关键。我们引入ViTok-v2,解决了这些局限性,通过NaFlex支持原生分辨率,以实现跨分辨率和纵横比的泛化,并采用一种新颖的DINOv3感知损失,取代了LPIPS和GAN目标,以便在任何规模下实现稳定训练。ViTok-v2在约20亿张图像上进行训练,并扩展至50亿参数,成为迄今为止最大的图像自编码器。ViTok-v2在256p时的重建效果与最先进的技术相匹配或超过,并在512p及以上的所有基准测试中表现优于所有基线。在与流匹配生成器的联合扩展实验中,我们展示了同时扩展自编码器和生成器可以推动这一权衡的帕累托前沿。
cs.CV / 5 / 2605.05344

Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery

Open-SAT:基于大型语言模型的查询嵌入精炼用于卫星图像中的开放词汇对象检索
Arefeen, Md Adnan, Debnath, Biplob, Rajendran, Ravi K., Sankaradas, Murugan, Chakradhar, Srimat T.
Abstract
In satellite applications, user queries often take the form of open-ended natural language, extending beyond a fixed set of predefined categories. This open-vocabulary nature poses significant challenges for retrieving relevant image tiles, as the retrieval system must generalize to a wide range of unseen objects and concepts. While vision-language models (VLMs) such as CLIP are widely used for text-image retrieval, even fine-tuned variants often struggle to accurately align such queries with satellite imagery. To address this, we propose Open-SAT, a training-free query embedding refinement algorithm that operates at inference time to improve alignment between user queries and satellite image content. Open-SAT uses VLMs to compute embeddings for image tiles, which are stored in a vector database for efficient retrieval. At query time, it leverages Large Language Models (LLMs) to refine the text embeddings by incorporating contextual information about objects of interest and their surroundings. A threshold-free retrieval mechanism further enhances accuracy and efficiency. Experimental results in three public benchmarks demonstrate that Open-SAT improves the F1 score by up to 16.04%, while retrieving a comparable number of image tiles. These results demonstrate the effectiveness of Open-SAT in open-vocabulary satellite image retrieval, leveraging LLM guidance without the need for additional training or supervision.
Chinese Translation
在卫星应用中,用户查询通常采用开放式自然语言的形式,超出了固定的预定义类别。这种开放词汇特性给相关图像块的检索带来了重大挑战,因为检索系统必须能够泛化到广泛的未见对象和概念。尽管视觉-语言模型(VLMs)如CLIP被广泛用于文本-图像检索,但即使是经过微调的变体也常常难以准确地将此类查询与卫星图像对齐。为了解决这个问题,我们提出了Open-SAT,这是一种无训练的查询嵌入精炼算法,在推理时操作,以改善用户查询与卫星图像内容之间的对齐。Open-SAT利用VLMs计算图像块的嵌入,并将其存储在向量数据库中以实现高效检索。在查询时,它利用大型语言模型(LLMs)通过结合有关感兴趣对象及其周围环境的上下文信息来精炼文本嵌入。无阈值的检索机制进一步提高了准确性和效率。在三个公共基准测试中的实验结果表明,Open-SAT将F1分数提高了最多16.04%,同时检索到的图像块数量相当。这些结果证明了Open-SAT在开放词汇卫星图像检索中的有效性,利用LLM的指导而无需额外的训练或监督。
cs.CV / 6 / 2605.05351

egenioussBench: A New Dataset for Geospatial Visual Localisation

egenioussBench:一个用于地理空间视觉定位的新数据集
Fanta-Jende, Phillipp, Vultaggio, Francesco, Kern, Alexander, Loeper, Yasmin, Gerke, Markus
Abstract
We present egenioussBench, a visual localisation benchmark built on geospatial reference data: a city-scale airborne 3D mesh and a CityGML LoD2 model. This pairing reflects deployable mapping assets and supports true scalability beyond traditional SfM-based approaches. The query data comprise smartphone images with centimetre-accurate, map-independent ground truth obtained via PPK and GCP/CP-aided adjustment. From 2,709 images, we derive a non-co-visible subset by estimating the full co-visibility matrix from rendered depth and selecting a maximum independent set; the released data include a test split of 42 non-co-visible images with withheld ground truth and a validation split of 412 sequential images with poses, e.g. for training of pose regressors and self-validation. The benchmark features a public leaderboard evaluated with binning metrics at multiple pose-error thresholds alongside global statistics (median, RMSE, outlier ratio), ensuring fair, like-for-like comparison across mesh- and LoD2-based methods. Together, these design choices expose realistic cross-view and cross-domain challenges while providing a rigorous, scalable path for advancing large-scale visual localisation. We make the evaluation code and data availeable at https://github.com/fratopa/egenioussBench and https://www.egeniouss.eu/
Chinese Translation
我们提出了egenioussBench,这是一个基于地理空间参考数据构建的视觉定位基准:一个城市规模的空中3D网格和一个CityGML LoD2模型。这种配对反映了可部署的映射资产,并支持超越传统结构从运动(SfM)方法的真实可扩展性。查询数据包括通过PPK和GCP/CP辅助调整获得的厘米级精度、独立于地图的真实值的智能手机图像。从2709张图像中,我们通过估计渲染深度的完整共视矩阵并选择一个最大独立集,推导出一个非共视子集;发布的数据包括42张未公开真实值的非共视图像的测试集和412张带有姿态的顺序图像的验证集,例如用于姿态回归器的训练和自我验证。该基准具有一个公共排行榜,使用多个姿态误差阈值的分箱指标进行评估,并提供全局统计数据(中位数、均方根误差、异常值比例),确保在基于网格和LoD2的方法之间进行公平的同类比较。这些设计选择共同揭示了现实的跨视图和跨领域挑战,同时为推动大规模视觉定位提供了一条严格、可扩展的路径。我们将在 https://github.com/fratopa/egenioussBench 和 https://www.egeniouss.eu/ 上提供评估代码和数据。
cs.CV / 7 / 2605.05367

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D:来自单目视频的高保真3D沙特手语虚拟形象
Alghamdi, Eyad, Altuuaim, Sattam, Ghulam, Obay, Qutah, Abdulrahman, Basoodan, Yousef
Abstract
Arabic Sign Language (ArSL) and its dialects serve approximately 400 million Arabic speakers worldwide, yet the community lacks high-quality 3D parametric annotations and specialized reconstruction methods for avatar generation. We address this critical gap through two key contributions: First, we introduce the first high-quality 3D parametric annotations for the Ishara-500 Saudi Sign Language dataset, providing precise SMPL-X parameters for 500 culturally authentic SSL signs. Second, we present Tamaththul3D, a specialized reconstruction pipeline designed for ArSL's unique articulation patterns. Our pipeline integrates SMPLer-X for robust body estimation, WiLoR for detailed hand refinement with automatic localization and mirroring, and MediaPipe for 2D pose supervision. Through kinematic-chain-based wrist alignment with hybrid swing-twist decomposition and 2D-supervised joint optimization, Tamaththul3D achieves state-of-the-art hand accuracy (up to 32% improvement over previous methods) while maintaining competitive body pose. Together, these 3D annotations and Tamaththul3D pipeline establish the first comprehensive framework for high-fidelity ArSL avatar reconstruction, enabling new accessibility technologies and cultural preservation efforts for the Arab Deaf community.
Chinese Translation
阿拉伯手语(ArSL)及其方言为全球约4亿阿拉伯语使用者服务,但该社区缺乏高质量的3D参数注释和专门的虚拟形象生成重建方法。我们通过两个关键贡献来填补这一重要空白:首先,我们为Ishara-500沙特手语数据集引入了首个高质量的3D参数注释,为500个文化上真实的沙特手语手势提供精确的SMPL-X参数。其次,我们提出了Tamaththul3D,一个专为阿拉伯手语独特的发音模式设计的重建管道。我们的管道集成了SMPLer-X以实现稳健的身体估计,WiLoR用于详细的手部精细化,具有自动定位和镜像功能,以及MediaPipe用于2D姿态监督。通过基于运动链的手腕对齐与混合摆动-扭转分解,以及2D监督的关节优化,Tamaththul3D在保持竞争力的身体姿态的同时,实现了最先进的手部精度(相比于之前的方法提高了多达32%)。这些3D注释和Tamaththul3D管道共同建立了高保真阿拉伯手语虚拟形象重建的首个综合框架,为阿拉伯聋人社区的新可及性技术和文化保护工作提供了支持。
cs.CV / 8 / 2605.05372

Two Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models

两步即可满足需求:基于一致性模型的高效3D点云异常检测
A, Pranav, B, Shashank, Siddappa, Pranav, Seuss, Dominik, Moharir, Minal, KN, Subramanya
Abstract
Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.
Chinese Translation
扩散模型正在迅速重新定义点云数据中的3D异常检测。随着3D传感技术成为现代制造的重要组成部分,可靠的异常检测对于高通量质量保证和过程控制至关重要。然而,在资源受限和对延迟敏感的系统上进行实际部署仍然有限。现有方法在复杂的、未遮蔽区域往往计算成本高或不可靠,而扩散管道本质上受到迭代去噪的瓶颈限制。在本研究中,我们通过一致性学习重新构建基于重建的异常检测,从而解决这一瓶颈,使得在一次或两次网络评估中能够直接预测无异常的几何形状。我们进一步引入了一种新颖的混合损失公式,明确地强制重建朝向干净数据。这一设计显著降低了推理成本,实现了比当前最先进方法快80倍的运行时间,而无需GPU加速,同时保持了强大的检测性能。在Anomaly-ShapeNet数据集上,其I-AUROC达到了76.20%,在Real3DAD上保持了72.80%的I-AUROC,能够在资源受限的平台上实现高效、低延迟的异常检测,包括无人机、智能工业相机和其他边缘设备。
cs.CV / 9 / 2605.05390

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

LAMP:基于定位感知的多摄像头人群跟踪在度量3D世界中的应用
Yang, Nan, Straub, Julian, Zhang, Fan, Newcombe, Richard, Engel, Jakob, Ma, Lingni
Abstract
Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This "lift-then-fit" approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.
Chinese Translation
从自我中心的多摄像头头戴设备跟踪3D人类运动面临着严重的自我运动、部分可见性或遮挡以及缺乏训练数据的挑战。现有针对单目视频的方法通常要求静态或缓慢移动的摄像头,无法有效利用多视角、经过校准和定位的输入。这使得它们在动态自我中心捕捉中显得脆弱且容易失败。我们提出了LAMP(基于定位感知的多摄像头人群跟踪):一个新颖且简单的框架,通过早期解耦观察者和目标运动来解决这一问题。LAMP引入了一个两步过程。首先,我们利用已知的设备6自由度(6 DoF)运动和校准,将所有摄像头在时间窗口内检测到的2D身体关键点转换为统一的3D世界参考框架。其次,一个端到端训练的时空变换器直接将3D人类运动拟合到这个3D光线云中。这种“提升再拟合”的方法使LAMP能够学习和利用世界空间中自然的人类运动先验,同时提供一个优雅的框架,以灵活地整合来自多个时间上不同步、部分观察和移动摄像头的信息。LAMP在单目基准测试中取得了最先进的结果,同时在我们针对的自我中心设置中显著超越了基线。
cs.CV / 10 / 2605.05405

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

通过联合嵌入实现零样本卫星图像检索:应用于危机响应
Walsh, James, Fawcett, William, Colvard, Grace, Ramos-Pollán, Raúl
Abstract
Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. We present GeoQuery, a zero-shot retrieval system that sidesteps this constraint through prompt-aligned text proxies. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings. On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6% accuracy within 50 km, with the strongest performance on floods (50% within 50 km) where terrain features are well captured by RGB embeddings. Deployed within ECHO, a crisis response system using Agentic Action Graphs, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.
Chinese Translation
地球观测档案的语义搜索仍然面临挑战。视觉基础模型如CLAY能够生成丰富的卫星图像嵌入,但缺乏直观查询所需的自然语言基础,而对遥感CLIP风格模型进行全面对比训练需要配对数据和计算资源,这在全球范围内是不可获得的。我们提出了GeoQuery,一个零样本检索系统,通过与提示对齐的文本代理规避了这一限制。我们并不是训练一个联合编码器,而是为全球Sentinel-2图块的10万代理子集生成语言描述,并优化描述生成提示,使得生成的文本嵌入空间中的距离与冻结的CLAY视觉嵌入空间中的距离相关联。查询分为两个阶段解决,首先在代理子集上进行文本相似性搜索,然后在全球CLAY嵌入上进行视觉最近邻搜索。在覆盖英国洪水、美国野火和美国干旱的76个灾害位置查询中,GeoQuery在50公里范围内达到了31.6%的准确率,其中在洪水(50%在50公里内)上的表现最强,因为地形特征通过RGB嵌入得到了很好的捕捉。在ECHO中部署,该危机响应系统使用Agentic Action Graphs,GeoQuery在2025年布里斯班的阿尔弗雷德飓风期间识别了脆弱地区,后续的洪水模拟再现了历史模式。当全面对比训练无法实现时,与提示对齐的代理为EO基础模型与操作性检索之间提供了实用的桥梁。
cs.CV / 11 / 2605.05439

Safety-Critical Camera Reliability Monitoring for ADAS via Degradation-Aware Uncertainty Pattern Analysis

基于降级感知不确定性模式分析的安全关键摄像头可靠性监测用于高级驾驶辅助系统(ADAS)
Aher, Shiva
Abstract
Reliable camera input is essential for safety-critical ADAS perception, but most monitoring approaches detect sensor failures only after downstream performance has degraded. We propose a proactive camera reliability monitoring framework that estimates perception risk from degradation-induced uncertainty patterns before downstream failure becomes observable. The method introduces a Global Sensor Health Index (GSHI), a continuous reliability score that aggregates per-degradation severities using a risk-aware multiplicative formulation, allowing severe single-mode failures such as lens occlusion or motion blur to dominate the health estimate. A lightweight multi-task network predicts degradation type, severity, GSHI, and spatial uncertainty maps from a single RGB image without downstream task feedback. Training uses physics- and geometry-aware synthetic supervision over twelve camera degradation modes. Experiments on KITTI-derived degradations show that GSHI decreases monotonically with severity, achieves a health-estimation MAE of 0.064, and provides positive early-warning lead time of 0.47 $\pm$ 0.25 severity units before YOLOv8 detection failure. GSHI also outperforms IQA, detector-confidence, and clean-feature OOD baselines, and transfers zero-shot to real adverse-weather driving data. These results support degradation-aware uncertainty analysis as a practical direction for proactive camera reliability monitoring in intelligent vehicles.
Chinese Translation
可靠的摄像头输入对于安全关键的ADAS感知至关重要,但大多数监测方法仅在下游性能下降后才检测到传感器故障。我们提出了一种主动的摄像头可靠性监测框架,该框架在下游故障可观察之前,通过降级引起的不确定性模式来估计感知风险。该方法引入了全球传感器健康指数(Global Sensor Health Index, GSHI),这是一个连续的可靠性评分,使用风险感知的乘法公式聚合每种降级的严重性,使得诸如镜头遮挡或运动模糊等严重的单模式故障在健康估计中占主导地位。一个轻量级的多任务网络能够从单个RGB图像中预测降级类型、严重性、GSHI和空间不确定性图,而无需下游任务反馈。训练使用了基于物理和几何的合成监督,涵盖了十二种摄像头降级模式。在基于KITTI的数据集上进行的实验表明,GSHI随着严重性的增加单调下降,健康估计的平均绝对误差(MAE)为0.064,并在YOLOv8检测失败之前提供了0.47 ± 0.25严重性单位的正向预警提前时间。GSHI还优于图像质量评估(IQA)、检测器置信度和干净特征的OOD基线,并能够零-shot迁移到真实的不良天气驾驶数据。这些结果支持降级感知不确定性分析作为智能车辆中主动摄像头可靠性监测的一个实际方向。
cs.CV / 12 / 2605.05447

EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion, Flow, and Function

EchoXFlow:用于心脏运动、血流和功能的波束空间超声心动图数据集
Stenhede, Elias, Sulkowska, Joanna, Orstad, Eivind Bjørkan, Schirmer, Henrik, Ranjbar, Arian
Abstract
We introduce EchoXFlow, a clinical echocardiography dataset for learning from ultrasound in its native acquisition geometry rather than from scan-converted Cartesian videos. Existing public datasets offer limited opportunities to study cross-modal relationships between cardiac anatomy, myocardial motion, and blood flow, as Doppler is typically absent or fused as RGB overlays, and acquisitions are released after lossy vendor display processing. EchoXFlow comprises 37125 recordings from 666 routine-care examinations, preserving the timing, geometry, and modality relationships needed for physically grounded echo learning. Each recording is retained as separable modality-specific streams: temporally resolved 1D, 2D, and 3D data alongside multiple Doppler modalities, paired with a synchronized ECG. Clinical annotations span guideline-based measurements to dense 2D myocardial contours and 3D left-ventricular endocardial meshes. With its associated open-source tooling, EchoXFlow enables cross-modal, acquisition-aware learning tasks that cannot be formulated from conventional scan-converted videos alone, and serves as a testbed for 4D vision and physically grounded multi-modal learning more broadly.
Chinese Translation
我们介绍了EchoXFlow,这是一个临床超声心动图数据集,旨在从其原生采集几何形状中学习,而不是从扫描转换的笛卡尔视频中学习。现有的公共数据集在研究心脏解剖、心肌运动和血流之间的跨模态关系方面提供的机会有限,因为多普勒通常缺失或以RGB叠加的形式融合,并且采集数据在经过有损的供应商显示处理后发布。EchoXFlow包含来自666例常规护理检查的37125个记录,保留了进行物理基础超声学习所需的时间、几何和模态关系。每个记录保留为可分离的模态特定流:时间分辨的1D、2D和3D数据,以及多个多普勒模态,配有同步的心电图(ECG)。临床注释涵盖了基于指南的测量、密集的2D心肌轮廓和3D左心室内膜网格。凭借其相关的开源工具,EchoXFlow使得跨模态、采集感知的学习任务成为可能,这些任务无法仅从传统的扫描转换视频中进行表述,并且更广泛地作为4D视觉和物理基础多模态学习的测试平台。
cs.CV / 13 / 2605.05510

The First Controllable Bokeh Rendering Challenge at NTIRE 2026

2026年NTIRE首届可控散景渲染挑战赛
Seizinger, Tim, Vasluianu, Florin-Alexandru, Chen, Jeffrey, Zhou, Zhuyun, Wu, Zongwei, Timofte, Radu, Zhang, Dafeng, Lin, Yipeng, Yan, Qi, Chen, Junhao, Yang, Yang, Singh, Divyavardhan, Thacker, Hariom, Mohammad, Hammad, Maurya, Aanchal, Upla, Kishor, Raja, Kiran, Zhou, Wei, Huang, Hongyu, Cho, Yujin, Malivenko, Grigory, Tu, Jiachen, Shi, Yaokun, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia
Abstract
This study presents the outcomes of the first Controllable Bokeh Rendering Challenge at NTIRE and highlights the most effective submitted methodologies. In total, 44 participants registered for the competition, of which 8 teams submitted valid solutions after the conclusion of the final test phase. All submissions were evaluated on unseen images, focusing on portraits and intricate subjects with complex and visually appealing bokeh phenomena. In addition to the first track focusing on established quantitative fidelity metrics, we conducted a qualitative user study with a panel of experts for a second track focusing on perceptual assessment. As this was the inaugural challenge on this topic, most of the participants focused on refining and extending the Bokehlicious baseline method.
Chinese Translation
本研究展示了2026年NTIRE首届可控散景渲染挑战赛的结果,并强调了最有效的提交方法。共有44名参与者注册了比赛,其中8个团队在最终测试阶段结束后提交了有效的解决方案。所有提交的作品均在未见过的图像上进行评估,重点关注肖像和复杂主题,以及具有复杂且视觉上吸引人的散景现象。除了第一个轨道关注已建立的定量保真度指标外,我们还进行了第二个轨道的定性用户研究,邀请了一组专家进行感知评估。由于这是该主题的首次挑战,大多数参与者专注于完善和扩展Bokehlicious基线方法。
cs.CV / 14 / 2605.05547

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

利用地理空间AlphaEarth嵌入表征巴西大西洋森林恢复成果
Heiman, Alice
Abstract
The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in S\~ao Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.
Chinese Translation
巴西的大西洋森林是一个重要的生物多样性热点,但其原始覆盖率仅剩12-15%以下。尽管在大规模监测森林恢复方面至关重要,传统方法受到地面报告在如此规模上的不切实际性以及遥感指数(如NDVI)饱和的限制。此外,重新造林是一个渐进的过程,而非由于森林砍伐引起的快速光谱变化。在本研究中,我们考察了圣保罗的1,729个恢复地点,利用AlphaEarth基金会模型的卫星嵌入来评估其在表征早期恢复成功方面的有效性。我们引入了“参考轨迹嵌入”的概念,定义了一种基于与成熟次生林参考地点的余弦相似度的恢复成功度量。我们观察到嵌入空间中根据不同土地利用和土地覆盖(LULC)类型的明显聚类,并能够识别出具有明显变化向量的地点。然而,信号可能会很嘈杂,嵌入可能需要进一步微调,以捕捉和预测超出LULC的地点元数据。
cs.CV / 15 / 2605.05549

A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series

一种新型图调节的解缠绕稀疏Mamba模型用于增强基于MODIS时间序列的树种分类
Alkayid, Motasem, Xu, Zhengsen, Taleghanidoozdoozan, Saeid, Zhu, Yimin, Greenwood, Megan, Ledingham, Quinn, Dewis, Zack, Heffring, Mabel, El-Sheimy, Naser, Xu, Lincoln Linlin
Abstract
Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature differences among tree species, strong spatial-spectral-temporal information coupling, and the difficulty of modeling large-scale topological context information. To better address these challenges, this paper presents a novel Graph-regulated Disentangled Sparse Mamba model (GDS-Mamba) for enhanced tree species classification, with the following contributions. (1) First, to improve large-scale context modeling, we design a mini-batch graph-regulated approach that explicitly explores topological correlation effects among input images. (2) Second, to disentangle the high-dimensional spatial-spectral-temporal information coupling for improved feature extraction, we propose a novel disentangling Mamba architecture tailored for capturing independent spatial patterns, spectral signatures, and temporal phenology behaviors in MODIS time series. (3) Third, to improve efficiency and subtle feature learning, we design novel sparse token approaches that adaptively learn the optimum subset of tokens to better address the correlation decay problem that bottlenecks standard Mamba models. Extensive experiments using large-scale annual MOD13Q1 data across two Canadian provinces (i.e., Alberta and Saskatchewan) achieved an overall accuracy of 93.94\% in Alberta and 80.19\% in cross-provincial evaluations, outperforming twelve state-of-the-art classification models.
Chinese Translation
从中等分辨率成像光谱仪(MODIS)时间序列数据中进行树种分类对支持各种环境应用至关重要,但由于以下几个关键困难,这是一项具有挑战性的任务:树种之间微妙的特征差异、强烈的空间-光谱-时间信息耦合,以及建模大规模拓扑上下文信息的困难。为更好地应对这些挑战,本文提出了一种新型图调节的解缠绕稀疏Mamba模型(GDS-Mamba),其主要贡献如下:(1) 首先,为了改善大规模上下文建模,我们设计了一种小批量图调节方法,明确探索输入图像之间的拓扑相关效应。(2) 其次,为了解缠高维空间-光谱-时间信息耦合以改善特征提取,我们提出了一种新型解缠绕Mamba架构,旨在捕捉MODIS时间序列中的独立空间模式、光谱特征和时间表型行为。(3) 第三,为了提高效率和微妙特征学习,我们设计了新颖的稀疏令牌方法,能够自适应学习最佳令牌子集,以更好地解决制约标准Mamba模型的相关衰减问题。通过在两个加拿大省份(即阿尔伯塔省和萨斯喀彻温省)使用大规模年度MOD13Q1数据进行的广泛实验,阿尔伯塔省的总体准确率达到了93.94%,跨省评估的准确率为80.19%,超越了十二个最先进的分类模型。
cs.CV / 16 / 2605.05556

An extremely coarse feedback signal is sufficient for learning human-aligned visual representations

极其粗糙的反馈信号足以学习与人类对齐的视觉表征
Mehta, Yash, Bonner, Michael F.
Abstract
Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains largely unexamined. Here we systematically investigate how the coarseness of a learning signal shapes representational alignment with human vision. We parametrically vary the level of signal granularity using a data-driven approach that partitions a set of training images into varied numbers of categories (2, 4, 8, 16, ..., 64) via PCA-based splits of pretrained embeddings. We train hundreds of neural networks across convolutional and transformer architectures on these coarse classification tasks and compare their representations to macaque electrophysiology recordings and human fMRI responses. We find that networks trained to distinguish as few as 8 broad categories learn representations that match or exceed the neural alignment of models distinguishing 1,000-classes. Even more strikingly, these coarsely trained networks align more closely with human perceptual similarity judgments than all other models evaluated, including networks trained with fine-grained supervision or self-supervision as well as leading large-scale vision models. These results demonstrate that human-like visual representations emerge from remarkably coarse feedback, reframing what learning signals vision may require and opening a path toward building AI systems that are more aligned with human perception.
Chinese Translation
在视觉任务上训练的人工神经网络发展出与灵长类动物视觉系统相似的内部表征,这一发现指导了十年的计算神经科学研究。关于构建与大脑对齐模型的研究逐渐接受了更细粒度的监督信号,从物体分类到对比自监督目标,旨在最大化个体图像之间的区别,然而监督信号的粒度对大脑对齐的影响仍然很大程度上未被探讨。在此,我们系统地研究了学习信号的粗糙度如何塑造与人类视觉的表征对齐。我们通过数据驱动的方法参数化地变化信号粒度的水平,将一组训练图像划分为不同数量的类别(2, 4, 8, 16, ..., 64),采用基于主成分分析(PCA)的预训练嵌入分割。我们在这些粗糙分类任务上训练了数百个卷积和变换器架构的神经网络,并将它们的表征与猕猴的电生理记录和人类的功能性磁共振成像(fMRI)反应进行比较。我们发现,训练以区分多达8个宽泛类别的网络学习到的表征,与区分1,000个类别的模型的神经对齐相匹配或超越。更为显著的是,这些粗糙训练的网络与人类感知相似性判断的对齐程度超过了所有其他评估模型,包括使用细粒度监督或自监督训练的网络,以及领先的大规模视觉模型。这些结果表明,人类类似的视觉表征可以从极其粗糙的反馈中涌现,重新定义了视觉学习信号的需求,并为构建更符合人类感知的人工智能系统开辟了道路。
cs.CV / 17 / 2605.05572

Text-to-CAD Retrieval: a Strong Baseline

文本到计算机辅助设计(CAD)模型检索:一个强有力的基准
Pan, Honghu, Du, Zibo, Liu, Daxiang, Liu, Chengliang, Luo, Xiaoling
Abstract
Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.
Chinese Translation
基于文本的计算机辅助设计(CAD)模型检索是一个关键但尚未深入研究的任务,旨在重用遗留工业设计。现有的CAD库通常通过文件名或目录进行搜索,这限制了设计检索的效率、可扩展性和准确性。本文正式引入文本到CAD检索作为一种新的跨模态检索任务,旨在根据自然语言查询从大规模数据库中检索语义相关的CAD模型。利用来自Text2CAD数据集的配对文本-CAD注释,我们为该任务建立了一个实用的基准。为了实现基于文本的检索,我们提出了一个统一框架,从过程序列和几何点云中学习多模态CAD嵌入。具体而言,序列编码器捕捉CAD模型的构造逻辑,而点编码器提取显式几何特征。文本编码器用于学习文本查询的语义表示。在训练过程中,我们引入了一种新颖的特征解码器,通过与文本和点特征的交叉注意力重建被遮蔽的序列特征,从而鼓励隐式多模态对齐。在推理时,我们去除该辅助解码器,以便使用连接的序列-点特征实现高效检索。我们的框架作为文本到CAD检索的强基准,为下游CAD生成范式(如检索增强生成)奠定了基础。源代码将会发布。
cs.CV / 18 / 2605.05590

Uncertainty-Guided Edge Learning for Deep Image Regression in Remote Sensing

基于不确定性引导的边缘学习在遥感深度图像回归中的应用
Nguyen, Anh Vu, Sejdinovic, Dino, Chin, Tat-Jun
Abstract
Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertainty of the current model on the unlabelled data, which is vital for informing model updating. In this paper, we investigate edge learning in the context of performing deep image regression on a remote sensing satellite, where a deep network is executed by an onboard computer to regress a scalar $y$ from an input image, e.g., $y$ is the percentage of pixels indicating cloud coverage or land use. We propose an uncertainty-guided edge learning (UGEL) algorithm that can accurately prioritise the data to speed up training convergence of the on-board regression model. Underpinning UGEL is the calculation of predictive uncertainty based on deep beta regression, where a deep network is used to estimate the parameters of a beta distribution for which the target $y$ for an input image has a high likelihood. Compared to established methods for uncertainty estimation that are either too costly on edge devices (e.g., require many forward passes per sample) or make strict assumptions on the predictive distribution (e.g., Gaussian), deep beta regression is computable in a single forward pass and allows more general predictive distributions. Results show that UGEL delivers faster-converging edge learning than active or semi-supervised learning. Code and models are publicly available at https://github.com/anh-vunguyen/UGEL.
Chinese Translation
边缘学习是指在边缘平台上训练机器学习模型,通常使用在设备上积累的新数据。边缘设备的计算限制不仅影响模型优化,还影响当前模型在未标记数据上的预测不确定性计算,而这对于指导模型更新至关重要。本文研究了在遥感卫星上进行深度图像回归的边缘学习,其中深度网络由机载计算机执行,以从输入图像回归出标量 $y$,例如,$y$ 是指示云覆盖或土地利用的像素百分比。我们提出了一种不确定性引导的边缘学习(UGEL)算法,能够准确优先选择数据,从而加速机载回归模型的训练收敛。UGEL 的基础是基于深度贝塔回归的预测不确定性计算,其中深度网络用于估计贝塔分布的参数,使得输入图像的目标 $y$ 具有高概率。与现有的不确定性估计方法相比,这些方法在边缘设备上要么成本过高(例如,每个样本需要多次前向传播),要么对预测分布做出严格假设(例如,高斯分布),深度贝塔回归可以在一次前向传播中计算,并允许更一般的预测分布。结果表明,UGEL 提供了比主动学习或半监督学习更快收敛的边缘学习。代码和模型可在 https://github.com/anh-vunguyen/UGEL 上公开获取。
cs.CV / 19 / 2605.05616

RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

RAM-H1200:关于类风湿性关节炎手部X光片的统一评估与数据集
Yang, Songxiao, Wang, Haolin, Fu, Yao, Peng, Junmu, Fan, Lin, Chen, Hongruixuan, Song, Jian, Ikebe, Masayuki, Takamaeda-Yamazaki, Shinya, Okutomi, Masatoshi, Kamishima, Tamotsu, Ou, Yafei
Abstract
Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.
Chinese Translation
从手部X光片评估类风湿性关节炎(RA)需要对解剖结构和细微局部病理变化进行多层次分析和建模。然而,现有的公共资源并不支持这种统一的多层次分析,往往缺乏完整的手部覆盖、细粒度标注以及与临床评分系统的一致整合。特别是,能够进行骨侵蚀(BE)定量分析的标注仍然稀缺。RAM-H1200包含来自六个医疗中心收集的1200幅手部X光片,具有多层次标注,包括(i)全手骨结构实例分割,(ii)像素级BE掩膜,(iii)SvdH定义的关节感兴趣区域,以及(iv)针对BE和关节间隙狭窄(JSN)的关节级SvdH评分。该数据集旨在评估模型是否能够共同捕捉解剖结构、局部侵蚀性病理和临床标准化的RA严重程度。所提出的BE掩膜首次通过提供病变范围和形态的明确空间监督,使得定量BE分析超越粗略的分类评分。根据我们的了解,RAM-H1200是第一个公开的大规模基准,联合支持全手骨结构实例分割、像素级BE划分以及针对BE和JSN的临床基础关节级SvdH评分。在基准任务中的结果显示,解剖建模的成熟度显著高于定量BE分析:全手骨分割表现出色,而BE分割仍然是一个主要的开放挑战。通过统一解剖结构建模、定量病变分析和临床基础的SvdH评分,RAM-H1200为手部X光片上的全面RA分析提供了一个单一的基准。
cs.CV / 20 / 2605.05627

Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

利用图像生成器应对训练数据稀缺:用于森林再生制图的 Gen4Regen 数据集
Jeanson, Gabriel, Duclos, David-Alexandre, Larrivée-Hardy, William, Cochet, Noé, Boxan, Matěj, Deschênes, Anthony, Pomerleau, François, Giguère, Philippe
Abstract
Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab-ulaval.github.io/gen4regen.
Chinese Translation
可持续森林管理依赖于精确的物种组成制图,但传统的地面调查劳动密集且受地理限制。虽然无人机(UAV)提供了可扩展的数据收集方式,但向基于深度学习的解读转变受到专家标注图像严重稀缺的瓶颈,尤其是在复杂且视觉异质的再生区域。本文通过提供一个可扩展的框架,解决了细粒度森林再生物种的语义分割中数据稀缺和极端类别不平衡的双重挑战,减少了对高分辨率毫米级航空图像的手动照片解读的依赖。重要的是,我们利用大规模视觉-语言模型 Nano Banana Pro 同时生成高保真图像及其对应的像素对齐语义掩码。我们引入了 WilDReF-Q-V2,这是一个自然森林数据集的扩展,包含 13,977 张新的未标记图像和 50 张标记的真实图像,以及 Gen4Regen 数据集,包含 2,101 对合成图像和语义掩码。我们的方法将真实世界数据与 AI 生成的图像相结合,强调 AI 生成的数据与真实世界数据高度互补,统一训练相比纯监督基线获得了超过 15 %pt 的 F1 分数提升。此外,我们证明即使是少量的提示生成数据也显著提高了表现,对于一些代表性不足的物种,某些物种的 F1 分数提升达到了 30 %pt。我们得出结论,视觉-语言模型可以作为灵活的数据生成器,有效地启动专家标签稀缺或不可用的细分 AI 领域的感知任务。我们的数据集、源代码和模型将可在 https://norlab-ulaval.github.io/gen4regen 获取。
cs.CV / 21 / 2605.05636

Learning a Delighting Prior for Facial Appearance Capture in the Wild

在野外进行面部外观捕捉的愉悦先验学习
Han, Yuxuan, Ming, Xin, Li, Tianxiao, Shen, Zhuofan, Zhang, Qixuan, Xu, Lan, Xu, Feng
Abstract
High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.
Chinese Translation
高质量的面部外观捕捉传统上需要昂贵的录音室录制。最近的研究考虑了基于智能手机的野外设置;然而,它们基于模型的逆渲染范式在将反射率与未知照明复杂解耦方面面临挑战。为了解决这一问题,我们提出将范式转变为训练一个强大的愉悦网络作为先验,以约束优化过程。我们利用OLAT数据集和渲染的光阶段扫描进行训练,并提出数据集潜在调制(Dataset Latent Modulation, DLM)以无缝整合这些异构数据源。具体而言,通过对核心网络进行可学习的源感知标记的条件化,我们将数据集特定的风格与物理愉悦原则解耦,从而使得一种愉悦先验的出现超越现有的专有模型。这种强大的愉悦先验使得一个简单且自动的外观捕捉流程成为可能,从随意的视频输入中实现高质量的反射率估计,显著超越了之前的研究。此外,我们利用我们的外观捕捉方法将多视角NeRSemble数据集转化为NeRSemble-Scan,这是一个大规模的4K分辨率可重光照扫描集合。通过开源我们的模型和NeRSemble-Scan数据集,我们使高端面部捕捉技术民主化,并为研究社区提供了构建逼真数字人类的新基础。
cs.CV / 22 / 2605.05640

AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

AffectSeek:在模糊用户查询下对长视频的主动情感理解
Zhang, Zhen, Yang, Yuhang, Jiang, Yunxiang, Lu, Yuhuan, Lu, Haifeng, Lian, Zheng, Zeng, Runhao, Hu, Xiping
Abstract
Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.
Chinese Translation
现有的情感理解研究主要集中在从图像、音频信号或预剪辑的视频片段中识别情感,其中情感证据已经给出。这种被动且以片段为中心的设置并未完全反映现实世界的场景,在这些场景中,用户通常与长视频互动,并通过自然语言查询表达他们的需求。本文研究了 extbf{模糊查询驱动的视频情感理解(VQAU)},这是一项新任务,要求模型在长视频中定位情感时刻,预测其情感类别,并在模糊用户查询下生成基于证据的推理。为支持这一任务,我们构建了 extbf{VQAU-Bench},一个基准,整合了长视频、模糊情感查询、时间片段注释、情感标签和推理解释,形成统一的评估框架。VQAU-Bench使得对语义-时间-情感对齐、情感时刻定位、情感分类和推理生成的系统评估成为可能。为了解决VQAU的多步骤推理挑战,我们进一步提出了 extbf{AffectSeek},一个主动框架,能够积极寻求、验证和解释长视频中的情感时刻。AffectSeek将VQAU分解为意图解释、候选定位、片段验证、情感推理和推理生成,并通过角色专业化推理和跨阶段验证逐步将模糊用户意图与长视频证据对齐。实验表明,VQAU对现有情感识别模型和单步骤视觉-语言模型仍然具有挑战性,而AffectSeek则提供了一个简单而有效的框架,用于主动的长视频情感理解。
cs.CV / 23 / 2605.05646

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

MUSE:通过拓扑正交性解决视觉标记化中的流形错位问题
Yang, Panqi, Jing, Haodong, Chao, Jiahao, Xiang, Tingyan, Lin, Li, Hu, Yao, Luo, Yang, Ma, Yongqiang
Abstract
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
Chinese Translation
统一的视觉标记化面临高保真像素重建(空间等变性)与语义抽象(概念不变性)之间的基本权衡。我们将这一冲突归因于流形错位:简单的联合优化导致相反的梯度,从而在重建与感知之间形成零和博弈。为了解决这个问题,我们提出了MUSE,一个基于拓扑正交性的框架。通过将结构视为一个正交桥梁,MUSE在变换器(Transformers)内部解耦优化:结构梯度精炼注意力拓扑,而语义梯度更新特征值。这将破坏性干扰转变为相互增强。实验表明,MUSE打破了这一权衡,实现了最先进的生成质量(gFID 3.08),并在线性探测中超越了其教师模型InternViT-300M(85.2 ext{%}对比82.5 ext{%}),证明了结构对齐的重建能够增强语义感知。代码可在 https://github.com/PanqiYang1/MUSE 获取。
cs.CV / 24 / 2605.05664

Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

稀疏到完整:从稀疏图像捕获到完整的3D场景
Shen, Yiyang, Yang, Yin, Zhou, Kun, Shao, Tianjia
Abstract
We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is developed by finetuning a pretrained architecture on the input views and their corresponding degraded counterparts. The adaptation to the scene distribution allows the model to repair Gaussian renderings while effectively eliminating domain gaps. Meanwhile, the trajectory planning scheme optimizes scene coverage by connecting each newly sampled camera to its two nearest neighbors. By iteratively constructing paths and retaining only those that significantly enhance visibility, the scheme establishes a trajectory that covers the entire scene. To address multi-view conflicts, the view-consistency conditioned sampling process quantifies the consistency between neighboring repaired images. This information is injected as a condition into the sampling process of the frozen diffusion model, facilitating the generation of view-consistent images without additional training. Consequently, our approach produces high-fidelity 3D Gaussians that are robust to artifacts. Experimental results demonstrate that S2C-3D outperforms state-of-the-art methods, constructing high-quality scenes that are free from missing regions, blurring, or other artifacts with very sparse inputs. The source code and data are available at https://gapszju.github.io/S2C-3D.
Chinese Translation
我们提出了S2C-3D,一个新颖的稀疏视图3D重建框架,能够从少至六到八幅图像中进行高保真和完整的场景重建。我们的框架包含三个组件:一个用于场景特定图像修复的专用扩散模型,一个在扩散模型中进行无训练的视图一致性条件采样过程,以进行精细的高斯优化,以及一个相机轨迹规划方案,以确保全面的场景覆盖。专用扩散模型通过对输入视图及其对应的降级图像进行微调来开发。对场景分布的适应使模型能够修复高斯渲染,同时有效消除领域差距。同时,轨迹规划方案通过将每个新采样的相机连接到其两个最近邻来优化场景覆盖。通过迭代构建路径并仅保留那些显著增强可见性的路径,该方案建立了覆盖整个场景的轨迹。为了解决多视图冲突,视图一致性条件采样过程量化了相邻修复图像之间的一致性。这些信息作为条件注入到冻结的扩散模型的采样过程中,从而促进生成视图一致的图像,而无需额外训练。因此,我们的方法生成了对伪影具有鲁棒性的高保真3D高斯。实验结果表明,S2C-3D优于最先进的方法,构建出高质量的场景,且在非常稀疏的输入下没有缺失区域、模糊或其他伪影。源代码和数据可在https://gapszju.github.io/S2C-3D获取。
cs.CV / 25 / 2605.05674

EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation

EGA:适应冻结编码器以进行有界分布外降级的向量搜索
Zhao, Dongfang
Abstract
Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.
Chinese Translation
基于冻结视觉编码器的向量搜索系统在部署时面临来自未见类别的查询,但现有的适配器训练在这种转变下崩溃:具有全局对比损失的高容量适配器默默地将未见类别样本重新分配到错误的已见类别簇中,在我们的测试中,最坏情况下的标签精度比冻结基线下降超过40个百分点。我们提出了欧几里得测地线对齐(EGA),这是一种残差适配器,结合了三个原则:零初始化、本地三元组损失和超球面投影。这些原则共同诱导出一种自我限制的动态:已经满足小边际的三元组停止产生梯度,因此适配器在局部几何已经正确的地方自动停止更新。我们的实验表明,在收敛时,$96.5\%$ 的三元组是无梯度的,未见类别区域基本保持不变,同时仍能实现已见类别的全容量细化。在五个不同的分布外(OOD)基准测试中,EGA在四个主要分割上实现了最高的最坏情况标签精度,并在第五个分割上持续改善。该设计还可迁移到比CLIP更强的主干网络,并且我们提供了一个分析性证明,将梯度稀疏性与有界OOD扰动联系起来。
cs.CV / 26 / 2605.05680

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

MotionGRPO:克服基于GRPO的自我中心运动恢复中的低组内多样性
Yao, Nanjie, Ren, Junlong, Shen, Wenhao, Wang, Hao
Abstract
This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity
Chinese Translation
本文研究了从头戴设备信号中恢复全身3D人类运动。现有的基于扩散的方法通常依赖于全局分布匹配,导致局部关节重建误差。我们提出了MotionGRPO,这是一种新颖的框架,利用强化学习后训练向扩散过程注入细粒度指导。从技术上讲,我们将扩散采样建模为一个通过组相对策略优化(Group Relative Policy Optimization, GRPO)优化的马尔可夫决策过程。为此,我们引入了一种混合奖励机制,结合了用于全局视觉合理性的学习条件感知模型和用于局部关节精度的显式约束。我们的关键技术见解是,基于扩散的恢复中的策略优化由于组内样本多样性有限而遭遇梯度消失问题。为了解决这个问题,我们进一步引入了一种噪声注入策略,明确增加样本方差并稳定学习。大量实验表明,MotionGRPO在视觉保真度方面实现了最先进的性能。
cs.CV / 27 / 2605.05688

R2H-Diff: Guided Spectral Diffusion Model for RGB-to-Hyperspectral Reconstruction

R2H-Diff:用于RGB到高光谱重建的引导光谱扩散模型
Ding, Songyu, Zhao, Ronggiang, Sun, Mingchun, Liu, Jie
Abstract
RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits their ability to model reconstruction uncertainty and often leads to over-smoothed spectral responses. Although diffusion models provide strong distribution modeling capability, their direct application to hyperspectral reconstruction remains challenging due to the high spectral dimensionality, strong inter-band correlations, and strict requirement for spectral fidelity. To this end, we propose R2H-Diff, an efficient diffusion-based framework tailored for RGB-to-HSI reconstruction. Specifically, R2H-Diff formulates spectral recovery as a conditional iterative refinement process, enabling progressive reconstruction under RGB guidance. We proposed a Guided Spectral Refinement Module for RGB-conditioned feature fusion and a Hyperspectral-Adaptive Transposed Attention module for efficient spatial--spectral dependency modeling. Furthermore, a normalization-free denoising backbone is adopted to preserve spectral amplitude consistency, while a task-adapted linear noise schedule enables high-quality reconstruction with only five denoising steps. Extensive experiments on NTIRE2022, CAVE, and Harvard demonstrate that R2H-Diff achieves a favorable balance between reconstruction quality and computational efficiency. Notably, on NTIRE2022, R2H-Diff obtains 35.37 dB PSNR with a sub-million-parameter model of 0.58M parameters and 12.25G FLOPs, achieving the lowest model complexity among the evaluated methods while maintaining strong reconstruction fidelity.
Chinese Translation
RGB到高光谱图像重建是一个高度不适定的逆问题,因为多个合理的光谱分布可能对应于相同的RGB观测。现有的基于回归的方法通常学习一个确定性的映射,这限制了它们建模重建不确定性的能力,并且往往导致光谱响应过于平滑。尽管扩散模型提供了强大的分布建模能力,但由于高光谱维度高、波段间相关性强以及对光谱保真度的严格要求,其在高光谱重建中的直接应用仍然具有挑战性。为此,我们提出了R2H-Diff,一个高效的基于扩散的框架,专为RGB到高光谱图像(HSI)重建而设计。具体而言,R2H-Diff将光谱恢复公式化为一个条件迭代细化过程,使得在RGB指导下进行逐步重建成为可能。我们提出了一个引导光谱细化模块,用于RGB条件下的特征融合,以及一个高光谱自适应转置注意力模块,用于高效的空间-光谱依赖建模。此外,采用了一种无归一化的去噪主干网络,以保持光谱幅度的一致性,而任务适应的线性噪声调度则使得仅需五个去噪步骤即可实现高质量重建。在NTIRE2022、CAVE和哈佛数据集上的大量实验表明,R2H-Diff在重建质量和计算效率之间达到了良好的平衡。值得注意的是,在NTIRE2022上,R2H-Diff以0.58M参数和12.25G FLOPs的不到百万参数模型获得了35.37 dB的PSNR,在评估的方法中实现了最低的模型复杂性,同时保持了强大的重建保真度。
cs.CV / 28 / 2605.05692

CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leveraging video transformers

CFE-PPAR:一种基于压缩友好的加密方法,用于利用视频变换器的隐私保护动作识别
Lin, Haiwei, Imaizumi, Shoko, Kiya, Hitoshi
Abstract
Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To address these issues, in this paper, we propose the first compression-friendly encryption method for PPAR, called CFE-PPAR. In CFE-PPAR, videos encrypted with secret keys can be directly recognized by a video transformer, which uses parameters transformed by the same keys as those used for video encryption. In experiments, it is verified that CFE-PPAR outperforms previous methods on the UCF101 and HMDB51 datasets under Motion-JPEG and H.264 compression.
Chinese Translation
隐私保护动作识别(PPAR)使机器能够理解视频中的人类活动,而无需揭示敏感的视觉内容。在各种PPAR策略中,基于加密的方法在保持高识别性能的同时实现了强有力的隐私保护。然而,当加密视频被压缩时,这些方法会导致识别性能和视觉质量的灾难性下降。也就是说,之前的方法并不具备压缩友好性。为了解决这些问题,本文提出了一种首个针对PPAR的压缩友好的加密方法,称为CFE-PPAR。在CFE-PPAR中,使用秘密密钥加密的视频可以被视频变换器直接识别,该变换器使用与视频加密所用相同密钥转换的参数。实验验证了CFE-PPAR在Motion-JPEG和H.264压缩下,在UCF101和HMDB51数据集上优于之前的方法。
cs.CV / 29 / 2605.05694

Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

通过主体不变的跨模态提示调优实现自适应的物理-面部表征融合用于基于视频的情感识别
Luo, Xiwen, Li, Jia, Song, Rencheng, Liu, Yu, Cheng, Juan
Abstract
Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.
Chinese Translation
从面部视频中进行情感识别能够非接触性地推断人类情感状态。尽管面部表情是广泛使用的线索,但它们无法完全反映内在的情感状态。远程光电容积描记法(rPPG)提供了互补的生理信息,但它对噪声和个体间变异性高度敏感,限制了对未见个体的泛化。现有的多模态方法结合了面部和rPPG特征,但它们的融合策略往往会破坏预训练的面部表征,并缺乏明确的机制来抑制个体特异性变异。为了解决这些问题,我们提出了一种用于基于视频的情感识别的主体不变跨模态提示调优框架。具体而言,rPPG波形被转化为抗噪声的时频表征(TFR),从中生成模态互补的提示,以调节冻结的视觉变换器(ViT)中的面部标记。该设计能够有效实现跨模态交互,同时保留由预训练主干网络学习的可泛化面部表征。此外,我们在每个ViT层中引入了一个解耦的共享-特定适配器(DSSA),以明确分离主体共享和主体特定的组件,从而提高跨主体的泛化能力。在MAHNOB-HCI和DEAP基准上的实验表明,所提方法在识别准确性和泛化能力上始终优于强基线,突显了其在基于视频的情感识别中的有效性。
cs.CV / 30 / 2605.05711

Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

闭环:通过LLM-RL耦合实现统一的3D场景生成与沉浸式交互
Vo, Anh H., Lee, Sungyo, Kim, Phil-Joong, Choi, Soo-Mi, Kim, Yong-Guk
Abstract
Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.
Chinese Translation
近年来,大型语言模型(LLMs)的进步显著提升了基于语言的3D内容生成能力,但大多数现有方法仍将场景生成和用户交互视为独立的过程,限制了交互多媒体系统的适应性和沉浸潜力。本文提出了一个统一框架,闭合了基于语言的3D场景生成与沉浸式用户交互之间的环路。系统首先根据自然语言指令使用LLMs构建结构化场景表示,然后在几何和语义约束下通过强化学习优化空间布局。生成的环境被部署在虚拟现实环境中,以促进人机交互(HRI-in-the-loop),用户交互提供持续反馈,以使生成的内容与人类感知和可用性保持一致。通过紧密耦合生成与交互,所提出的框架能够实现更具响应性、适应性和真实感的多媒体体验。在ALFRED基准上的实验展示了任务驱动场景生成的最先进性能。此外,定性结果和用户研究表明在沉浸感、交互质量和任务效率方面的一致改善,强调了生成与交互闭环集成对下一代多媒体系统的重要性。我们的项目页面可以在 https://proj-showcase.github.io/h3ds/ 找到。
cs.CV / 31 / 2605.05712

EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation

EgoEMG:一个包含双侧肌电图和视觉信息的多模态自我中心数据集,用于手势姿态估计
Xi, Ziheng, Yu, Jiayi, Wang, Yitao, Duan, Yanbo, Feng, Jianjiang, Zhou, Jie
Abstract
Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks -- EMG-to-pose, vision-to-pose, and EMG+vision fusion -- under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.
Chinese Translation
表面肌电图(sEMG)记录手部运动过程中的肌肉活动,并可以解码以恢复详细的手部关节动作。肌电图和自我中心视觉在手部感知中是互补的:肌电图能够捕捉到细致的手指关节动作,即使在遮挡和光线不足的情况下,而视觉则提供了全局的手部配置。然而,目前尚无现有数据集能够同步这两种模态。我们提出了EgoEMG,一个用于双手姿态估计的多模态自我中心数据集。EgoEMG包括双侧腕带肌电图,具有16个通道(每只手8个通道),以2 kHz的频率采样,120 Hz的惯性测量单元(IMU),自我中心的广角RGB视频,外部RGB-D视频,以及基于动作捕捉的手部运动数据和腕关节角度。该数据集涵盖41名参与者执行60种手势类别,包括30种单手手势和30种双手手势,总录制时间超过10小时。我们还引入了一个基准测试,包含三个任务——肌电图到姿态(EMG-to-pose)、视觉到姿态(vision-to-pose)以及肌电图与视觉融合(EMG+vision fusion),在共享的关节角度预测目标和共同的泛化分割轴(跨手势、跨用户和组合)下进行评估。作为基线,我们评估了EMGFormer用于肌电图到姿态的任务,以及通用的ResNet/ViT骨干网络用于视觉到姿态的任务。我们进一步研究了一种残差融合架构,改善了与轻量级视觉单一基线匹配的性能。EgoEMG及其基准测试为未来基于肌电图和视觉的多模态手势姿态估计研究奠定了基础。
cs.CV / 32 / 2605.05714

TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

TriRelVLA:用于可泛化体态操控的三元关系结构
Zhou, Hanyu, Ma, Chuanhao, Lee, Gim Hee
Abstract
Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied manipulation. Our approach consists of three components: 1) We construct explicit object-hand-task triadic representations from multimodal inputs as relational primitives. 2) We build a task-grounded relational graph. Task-guided cross-attention forms nodes, and a relation-aware graph transformer models interactions among them. 3) We perform relation-conditioned action generation. The relational structure is compressed into a bottleneck space and projected into the LLM for action prediction. This triadic relational bottleneck reduces reliance on appearance statistics and enables transfer across scenes, objects, and task compositions. We further introduce a real-world robotic dataset for fine-tuning. Experiments show strong performance on fine-tuned tasks and clear gains in cross-scene, cross-object, and cross-task generalization.
Chinese Translation
视觉-语言-动作(VLA)模型在训练时的机器人任务上表现良好,但在未见场景和物体上泛化能力较差。其关键限制在于隐式视觉表示,这些表示将物体外观、背景和场景布局纠缠在一起,使得策略对视觉变化敏感。之前的研究通过结构化的中间表示来改善可迁移性,这些表示将视觉内容物化。然而,这些表示主要捕捉场景语义,而非与动作相关的关系。因此,动作预测仍然与外观统计数据相关。我们观察到,操控动作依赖于物体-手-任务的关系结构,该结构支配着任务要求、机器人状态和物体属性之间的互动。基于这一观察,我们提出了TriRelVLA,一个用于可泛化体态操控的三元关系VLA框架。我们的方法由三个部分组成:1)我们从多模态输入中构建显式的物体-手-任务三元表示,作为关系原语。2)我们构建一个以任务为基础的关系图。任务引导的交叉注意力形成节点,而关系感知的图变换器则建模它们之间的互动。3)我们执行关系条件的动作生成。关系结构被压缩到一个瓶颈空间,并投影到大型语言模型(LLM)中进行动作预测。这个三元关系瓶颈减少了对外观统计的依赖,并使得在不同场景、物体和任务组合之间的迁移成为可能。我们进一步引入了一个用于微调的真实世界机器人数据集。实验表明,在微调任务上表现出强劲的性能,并在跨场景、跨物体和跨任务的泛化上有明显提升。
cs.CV / 33 / 2605.05722

$\mathcal{B}^{3}$-Net: Controlled Posterior Bridge Learning for Multi-Task Dense Prediction

$ ext{B}^{3}$-网:用于多任务密集预测的受控后验桥接学习
Zhou, Meihua, Yang, Li
Abstract
Multi-task dense prediction solves complementary pixel-level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder-side interactions use attention, prompts, routing, diffusion, Mamba, or bridge features to exchange task evidence, but most of them organize this evidence implicitly. They usually fuse task features by similarity or affinity, without explicitly modeling that evidence reliability varies across tasks and spatial locations. As a result, unreliable evidence may contaminate the shared representation and intensify negative transfer. We propose $\mathcal{B}^{3}$-Net, a controlled posterior bridge learning framework for multi-task dense prediction. Our method decomposes decoder-side interaction into reliability estimation, posterior bridge construction, and bounded redistribution. The Precision Field Estimator estimates patch-wise evidence precision from task-reference alignment and local variation. The Posterior Bridge Operator builds a precision-weighted posterior bridge through heteroscedastic evidence fusion, yielding a shared state more reliable than uniform or heuristic mixtures. The Contractive Dispatch Operator redistributes the bridge to each task branch through a bounded update, reducing uncontrolled feature injection. Experiments on NYUD-v2, PASCAL-Context, and Cityscapes show that $\mathcal{B}^{3}$-Net achieves competitive or superior trade-offs over representative CNN-, Transformer-, diffusion-, Mamba-, and bridge-feature-based methods. Backbone-matched comparisons and extensive analyses further verify that the gains arise from controlled posterior bridge learning rather than backbone capacity or decoder scale.
Chinese Translation
多任务密集预测在统一模型中解决互补的像素级任务,如语义分割、深度估计、表面法线估计和边缘检测。现有的解码器侧交互使用注意力、提示、路由、扩散、Mamba或桥接特征来交换任务证据,但大多数方法隐式地组织这些证据。它们通常通过相似性或亲和性融合任务特征,而没有明确建模证据的可靠性在任务和空间位置之间的变化。因此,不可靠的证据可能会污染共享表示并加剧负迁移。我们提出了$ ext{B}^{3}$-网,一种用于多任务密集预测的受控后验桥接学习框架。我们的方法将解码器侧交互分解为可靠性估计、后验桥接构建和有界重分配。精度场估计器根据任务参考对齐和局部变化估计补丁级证据精度。后验桥接算子通过异方差证据融合构建一个加权精度后验桥,生成比均匀或启发式混合更可靠的共享状态。收缩调度算子通过有界更新将桥接重新分配到每个任务分支,从而减少不受控的特征注入。在NYUD-v2、PASCAL-Context和Cityscapes上的实验表明,$ ext{B}^{3}$-网在代表性的基于CNN、Transformer、扩散、Mamba和桥接特征的方法中实现了具有竞争力或优越的权衡。与主干匹配的比较和广泛的分析进一步验证了这些收益源于受控后验桥接学习,而非主干能力或解码器规模。
cs.CV / 34 / 2605.05749

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

具有自适应更新的光线感知指针内存用于流式3D重建
Li, Feifei, Song, Qi, Zhang, Chi, Huang, Rui
Abstract
Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.
Chinese Translation
从连续图像流中进行密集3D重建需要准确的几何聚合和稳定的长期记忆管理。最近的前馈重建框架通过持久内存表示整合观测数据,但大多数主要依赖基于外观的相似性来更新内存。这种驱动外观的整合往往导致观测数据的冗余积累,并在视角变化时产生不稳定的几何结构。在本研究中,我们提出了一种光线感知指针内存,用于流式3D重建,明确建模空间位置和视角方向在统一内存表示中的关系。每个内存指针存储其3D位置、相关的光线方向和特征嵌入,使系统能够共同推理几何接近性和视角一致性。基于这种表示,我们引入了一种自适应指针更新策略,用保留或替换机制取代传统的基于融合的内存压缩。系统选择性地保留信息丰富的指针,同时丢弃冗余指针,而不是平均附近的观测,从而保持独特的几何结构,同时控制内存增长。此外,空间距离和光线方向差异的联合推理使系统能够以统一的方式区分局部冗余、新观测和潜在的循环重访。当检测到循环候选时,将触发姿态优化,以确保重建过程中的全局几何一致性。大量实验表明,所提出的光线感知内存设计显著提高了长期重建的稳定性和相机姿态的准确性,同时保持高效的流式推理。我们的方法为从图像流中进行可扩展且抗漂移的在线3D重建提供了一个有原则的框架。
cs.CV / 35 / 2605.05753

Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation

联合学习结构化表示与稳定亲和力的人体运动分割
Meng, Xianghan, Huang, Zhiyuan, Tong, Zhengyu, Li, Chun-Guang
Abstract
Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assumption and yield unsatisfactory segmentation performance. To address this issue, we propose an efficient and effective approach for HMS, named Temporal Deep Self-expressive subspace Clustering (TDSC), which jointly learns temporally consistent structured representations and stabilized affinity for accurate and robust HMS. Specifically, in TDSC, we alternately learn structured representations of the input frame features and self-expressive coefficients via a properly regularized self-expressive model, in which a coding-rate maximization regularizer is incorporated to avoid representation collapse and conform the learned representations to span a desired UoS distribution, and meanwhile, temporal constraints are incorporated to promote temporally adjacent frames to be partitioned into the same groups. Moreover, we develop a temporal momentum averaging mechanism to stabilize affinity evolution and design a reparameterization strategy to enable efficient optimization. We conduct extensive experiments on five benchmark HMS datasets using both conventional (HoG) and up-to-date deep features (i.e., CLIP, DINOv2) to validate the effectiveness of our approach.
Chinese Translation
人体运动分割(HMS)旨在将视频划分为对应于不同人体运动的非重叠段落,近年来引起了越来越多的研究关注。现有的HMS方法主要基于子空间聚类,假设高维时间特征的分布与子空间的并集(Union-of-Subspaces, UoS)良好对齐。然而,对于现实世界中的视频,原始帧级特征往往违反UoS假设,从而导致不理想的分割性能。为了解决这一问题,我们提出了一种高效且有效的HMS方法,称为时间深度自表达子空间聚类(Temporal Deep Self-expressive subspace Clustering, TDSC),该方法联合学习时间一致的结构化表示和稳定的亲和力,以实现准确且鲁棒的HMS。具体而言,在TDSC中,我们通过适当正则化的自表达模型交替学习输入帧特征的结构化表示和自表达系数,其中包含编码率最大化正则化器,以避免表示崩溃并使学习到的表示符合所需的UoS分布,同时引入时间约束以促进时间相邻的帧被划分到同一组。此外,我们开发了一种时间动量平均机制,以稳定亲和力演变,并设计了一种重新参数化策略,以实现高效优化。我们在五个基准HMS数据集上进行了广泛实验,使用传统特征(HoG)和最新深度特征(即CLIP, DINOv2)来验证我们方法的有效性。
cs.CV / 36 / 2605.05761

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

iTRIALSPACE:可编程虚拟病变试验用于肺部CT模型的控制评估
Tushar, Fakrul Islam, Momy, Umme Hafsa, Lo, Joseph Y., Rubin, Geoffrey D.
Abstract
We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($\rho$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.
Chinese Translation
我们介绍了iTRIALSPACE,一个可编程的评估框架,用于对肺部CT模型进行控制评估。标准基准是静态的回顾性集合,这些集合将病变大小、叶片分布、解剖结构和采集背景纠缠在一起,使得确定结构上驱动模型准确性的因素变得困难。iTRIALSPACE通过一个四阶段的流程解决了这一局限性:多数据集结节特征分析、明确的试验规范、解剖结构感知的掩模插入和基于ControlNet的CT合成。该框架建立在一个统一的54属性结节特征数据集之上,该数据集涵盖了来自七个公共CT源的13,140个标注结节,并实例化为13种试验模式。我们在一个包含55,469个样本的虚拟病变研究中评估了iTRIALSPACE,该研究涵盖了三个医学VLM、四种空间引导条件和三项临床任务。在所有13种模式中,合成基底保持在真实到真实的FID基线之内,合成性能排名强烈转移到真实临床数据上($ ho$ = 0.93,p < 10$^{-15}$)。控制试验模式揭示了固定分布基准无法提供的发现,包括在叶片均衡采样下,由于捷径驱动的大小预测崩溃,以及在双交叉分析中宿主与供体的方差比为8.9倍和3.3倍。这些结果使iTRIALSPACE成为一个可审计的评估基础设施,适用于超越静态回顾性基准的控制、可证伪测试。
cs.CV / 37 / 2605.05765

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

X-OmniClaw 技术报告:一种用于多模态理解与交互的统一移动代理
Ren, Xiaoming, Zhen, Ru, Li, Chao, Song, Yang, Hou, Qiuxia, Zhang, Yanhao, Liu, Peng, Qi, Qi, Zheng, Quanlong, Wu, Qi, Liao, Zhenyi, Pan, Binqiang, Ji, Haobo, Lu, Haonan
Abstract
Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
Chinese Translation
受 OpenClaw 发展的启发,市场对能够处理复杂且直观交互的移动个人代理的需求日益增长。在本技术报告中,我们介绍了 X-OmniClaw,这是一种为 Android 生态系统设计的多模态理解与交互的统一移动代理。该统一架构结合了感知、记忆和行动,使代理能够以高上下文意识处理复杂的移动任务。具体而言,Omni Perception 提供了一个统一的多模态输入管道,集成了用户界面状态、现实世界视觉上下文和语音输入,利用时间对齐模块将原始数据分解为结构化的多模态意图表示。Omni Memory 利用多模态记忆优化,通过将任务连续性的运行时工作记忆与从本地数据提炼的长期个人记忆相结合,增强个性化智能,从而实现高度上下文感知和个性化的交互。最后,Omni Action 采用混合基础策略,将结构化的 XML 元数据与视觉感知相结合,以实现稳健的交互。通过行为克隆和轨迹重放,该系统捕捉用户导航作为可重用技能,从而实现精确的直接访问执行。在多种场景下的演示表明,X-OmniClaw 有效提升了交互效率和任务可靠性,为下一代移动原生个人助手提供了实用的架构蓝图。
cs.CV / 38 / 2605.05775

The autoPET3 Challenge -- Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization

autoPET3 挑战——全身 PET/CT 中的自动病灶分割 - 多示踪剂多中心泛化
Dexl, Jakob, Jeblick, Katharina, Mittermeier, Andreas, Schachtner, Balthasar, Stüber, Anna Theresa, Topalis, Johanna, Rokuss, Maximilian, Isensee, Fabian, Maier-Hein, Klaus H., Kalisch, Hamza, Kleesiek, Jens, Seibold, Constantin M., Alasmawi, Hussain, Chan, Lap Yan Lennon, Yuan, Yixuan, Jaus, Alexander, Stiefelhagen, Rainer, Choudja, Pauline Ornela Megne, Nikolaou, Konstantin, La Fougère, Christian, Gatidis, Sergios, Fabritius, Matthias P., Heimer, Maurice, Abaci, Gizem, Sundar, Lalith Kumar Shiyam, Werner, Rudolf A., Ricke, Jens, Cyran, Clemens C., Küstner, Thomas, Ingrisch, Michael
Abstract
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital T\"ubingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
Chinese Translation
我们报告了第三届 autoPET 挑战(MICCAI 2024)的设计和结果,该挑战在组成泛化设置下对全身 PET/CT 中的自动病灶分割进行了基准测试。训练数据包括来自图宾根大学医院的 1,014 例 [18F]-FDG PET/CT 研究和来自慕尼黑路德维希-马克西米利安大学医院的 597 例 [18F]/[68Ga]-PSMA PET/CT 研究,构成了迄今为止最大规模的公开注释 PSMA PET/CT 数据集。保留的测试集包含 200 例研究,涵盖了四种示踪剂-中心组合,其中两个组合代表了未见的组成配对。一个补充的数据中心奖项类别通过限制参与者使用固定的基线模型,隔离了数据处理策略的贡献。十七个团队提交了 27 个算法,主要是基于 nnU-Net 的 3D 网络,采用 PET/CT 通道连接。排名最高的算法在所有四个测试条件下达到了平均 DSC 为 0.66,FNV 为 3.18 mL,FPV 为 2.78 mL,相较于提供的基线,DSC 提高了 8%,假阴性体积减少了 5 mL。对于顶级团队,排名在自助重抽样和替代排名方案中保持稳定。除了基准测试,我们还提供了对患者和病灶级别分割性能的深入分析。可以得出三个主要结论:(1) 域内多示踪剂 PET/CT 分割是足够的,并且可能接近读者一致性;(2) 对未见示踪剂-中心组合的组成泛化仍然是一个开放问题,主要受系统性体积高估的驱动;(3) 异质性和病例难度对性能变化的影响远大于顶级团队之间算法选择的影响。
cs.CV / 39 / 2605.05781

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

通过理解监督引导统一多模态模型中的视觉生成
Liu, Zeyu, Ni, Zanlin, Yue, Yang, Da, Cheng, Yang, Huan, Zhang, Di, Gai, Kun, Huang, Gao
Abstract
Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.
Chinese Translation
统一多模态模型旨在弥合理解与生成之间的差距。然而,为了实现竞争力的性能,最先进的模型往往采用大幅解耦的理解和生成组件。尽管这种设计在单独任务中有效,但削弱了相互增强所需的连接,使得潜在的协同效应在经验上变得不确定。我们提出通过引入理解导向后训练(Understanding-Oriented Post-Training, UNO)来显式恢复这种协同效应,这是一种轻量级框架,将理解视为不仅是一个独立任务,还作为直接的监督信号来引导生成表示。通过结合编码语义抽象(字幕生成)和结构细节(视觉回归)的目标,我们实现了从理解到生成的有效梯度流。对图像生成和编辑的广泛实验表明,理解可以作为生成的有效催化剂。
cs.CV / 40 / 2605.05804

Na-IRSTD: Enhancing Infrared Small Target Detection via Native-Resolution Feature Selection and Fusion

Na-IRSTD:通过原生分辨率特征选择与融合增强红外小目标检测
Xu, Qian, Zhang, Chi, Zhang, Qiming, Li, Xi, Yuan, Haojuan, Zhang, Mingjin
Abstract
Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard small targets' details, resulting in suboptimal performance. In this paper, we present Na-IRSTD, a native-resolution feature extraction and fusion framework for IRSTD. This framework elegantly incorporates native-resolution features to preserve subtle target cues, overcoming the resolution limitations of existing infrared approaches and significantly improving the model's ability to localize small targets. We also introduce an effective token reduction and selection strategy, which selects target patches with high accuracy and confidence, boosting the low-level details of the feature while effectively reducing native-resolution patch tokens compared to dense processing, thereby avoiding imposing an unbearable computational burden. Extensive experiments demonstrate the robustness and effectiveness of our token reduction and selection strategy across multiple public datasets. Ultimately, our Na-IRSTD model achieves state-of-the-art performance on four benchmarks.
Chinese Translation
红外小目标检测(IRSTD)面临在复杂背景杂波中精确定位微弱目标的固有挑战。尽管已有所进展,但现有方法通常遵循传统策略对特征进行下采样,忽略小目标的细节,导致性能不佳。本文提出了Na-IRSTD,一个用于红外小目标检测的原生分辨率特征提取与融合框架。该框架优雅地结合了原生分辨率特征,以保留微妙的目标线索,克服了现有红外方法的分辨率限制,并显著提高了模型定位小目标的能力。我们还引入了一种有效的标记减少与选择策略,该策略以高准确性和置信度选择目标区域,增强了特征的低级细节,同时与密集处理相比,有效减少了原生分辨率补丁标记,从而避免了施加无法承受的计算负担。大量实验表明,我们的标记减少与选择策略在多个公共数据集上具有鲁棒性和有效性。最终,我们的Na-IRSTD模型在四个基准测试中实现了最先进的性能。
cs.CV / 41 / 2605.05810

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

CXR-ContraBench:医学视觉语言模型中否定选项吸引的基准测试
Fang, Zhengru, Ma, Yanan, Guo, Yu, Hu, Senkang, Zhang, Yixian, Cao, Hangcheng, Ding, Wenbo, Fang, Yuguang
Abstract
When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting "No X" despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.
Chinese Translation
当胸部X光片显示有实变,但问题询问的是哪些发现存在时,医学视觉语言模型可能会回答“没有实变”。这不仅仅是一个错误选择:它是一个极性反转,发出与图像相矛盾的临床陈述。我们将这种失败研究为否定选项吸引,模型即使在与视觉证据和问题相冲突时,仍然被否定答案选项所吸引。我们引入CXR-ContraBench(胸部X光片矛盾基准),这是一个涵盖内部ReXVQA切片和外部OpenI及CheXpert协议的诊断基准。该基准集中于存在发现的问题,其中尽管可见X,选择“没有X”会造成主要的临床风险,并使用缺失发现的问题作为次要测试,以检验模型是否复制了否定措辞。在CheXpert协议中,这种失败是显著且持久的。在严格的直接存在探测中,MedGemma和Qwen2.5-VL的准确率仅分别为31.49%和30.21%;在匹配的135,754条记录的CheXpert训练拆分协议中,这两个模型在超过62%的存在问题上选择了否定选项。思维链提示减少了一些存在侧的反转,但并未消除它们,并可能加剧缺失侧的矛盾。最后,QCCV-Neg(用于否定的问答条件一致性验证器)在不重新训练的情况下确定性地修复了测量的极性混淆子集,将MedGemma和Qwen2.5-VL在直接存在探测中的准确率提高到96.60%和95.32%。这些结果表明,标准准确率可能掩盖临床上有意义的推理时极性失败。源代码和基准构建脚本可在https://github.com/fangzr/cxr-contrabench-code获取。
cs.CV / 42 / 2605.05820

ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction

ChartZero:合成先验使零样本图表数据提取成为可能
Islam, Md Touhidul, Mahmud, Yasir, Saha, Sujan Kumar, Tehranipoor, Mark, Farahmandi, Farimah
Abstract
Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crippling their ability to generalize across arbitrary aesthetics and grid layouts. Furthermore, existing models suffer from two critical failure modes during reconstruction. First, extracting thin, intersecting curves frequently causes structural fragmentation and the erasure of fine visual details, as standard architectures struggle against complex backgrounds. Second, semantic association is notoriously error-prone; current pipelines rely on rigid spatial heuristics that easily break down against the unpredictable legend placements of in-the-wild charts. Finally, measuring true progress is hindered by evaluation protocols that assess isolated sub-tasks rather than holistic, end-to-end data reconstruction. To address these foundational issues, we introduce ChartZero, a parsing framework that leverages synthetic priors to enable robust zero-shot chart data extraction. By training exclusively on a purely synthetic dataset of simple mathematical functions, our model completely bypasses the real-world annotation bottleneck. We overcome curve fragmentation via a novel Global Orthogonal Instance (GOI) loss, and replace brittle spatial rules with an open-vocabulary, Vision-Language Model (VLM)-guided legend matching strategy. Accompanied by a new metric and benchmark specifically designed for full end-to-end reconstruction, our evaluations demonstrate that ChartZero significantly advances generalized plot digitization without requiring real-world supervision. Code and dataset will be released upon acceptance.
Chinese Translation
从折线图中自动提取数据仍然受到极端风格多样性和全面注释的真实世界数据集严重匮乏的根本性瓶颈。当前的端到端管道在很大程度上依赖于昂贵的人工注释,这限制了它们在任意美学和网格布局中的泛化能力。此外,现有模型在重建过程中面临两种关键的失败模式。首先,提取细小、交叉的曲线常常导致结构碎片化和细微视觉细节的丧失,因为标准架构在复杂背景下表现不佳。其次,语义关联 notoriously 容易出错;当前的管道依赖于刚性的空间启发式,这在面对野外图表不可预测的图例位置时容易崩溃。最后,真正的进展测量受到评估协议的阻碍,这些协议评估孤立的子任务,而不是整体的端到端数据重建。为了解决这些基础性问题,我们提出了 ChartZero,一个利用合成先验实现稳健的零样本图表数据提取的解析框架。通过仅在一个简单数学函数的纯合成数据集上进行训练,我们的模型完全绕过了真实世界注释的瓶颈。我们通过一种新颖的全局正交实例(Global Orthogonal Instance, GOI)损失克服了曲线碎片化,并用开放词汇的视觉-语言模型(Vision-Language Model, VLM)引导的图例匹配策略替代了脆弱的空间规则。伴随着专为完整的端到端重建设计的新指标和基准,我们的评估表明,ChartZero 在不需要真实世界监督的情况下显著推进了通用图表数字化。代码和数据集将在接受后发布。
cs.CV / 43 / 2605.05831

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

统一科学传播:跨科学媒体的细粒度对应关系
M, Megha Mariam K., Balasubramanian, Vineeth N., Jawahar, C. V.
Abstract
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD
Chinese Translation
科学知识的传播日益呈现多模态特征,涵盖文本、视觉和语音,通过研究论文、幻灯片和录制演示等材料进行交流。这些不同的表现形式共同传达了研究的推理、结果和见解,提供了互补的视角,丰富了理解。然而,尽管它们的目的相同,这些材料很少以结构化的方式相互连接。缺乏跨格式的明确链接使得追踪概念、视觉和解释之间的对应关系变得困难,限制了对研究内容的统一探索和分析。为了解决这一问题,我们引入了多模态会议数据集(Multimodal Conference Dataset, MCD),这是第一个整合来自同一研究的论文、演示视频、解释视频和幻灯片的基准数据集。我们评估了一系列基于嵌入和视觉-语言模型的方法,以评估它们发现细粒度跨格式对应关系的能力,建立了该任务的第一个系统基准。我们的结果表明,视觉-语言模型表现稳健,但在细粒度对齐方面存在困难,而基于嵌入的模型能够很好地捕捉文本-视觉对应关系,但方程和符号内容在嵌入空间中形成了不同的聚类。这些发现突显了当前方法的优势和局限性,并指明了未来多模态科学理解研究的关键方向。为确保可重复性,我们在 https://github.com/meghamariamkm2002/MCD 发布了 MCD 的相关资源。
cs.CV / 44 / 2605.05848

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

VideoRouter:用于高效长视频理解的查询自适应双路由
Lin, Kuanwei, Zhang, Wenhao, Li, Ge
Abstract
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.
Chinese Translation
大型多模态视频模型日益面临可扩展性瓶颈:长视频生成过长的视觉标记序列,显著增加了推理过程中的内存和延迟。虽然现有的压缩方法在特定设置下有效,但大多数方法要么对查询的感知较弱,要么在帧之间应用固定的压缩策略,这在视觉证据随时间不均匀分布时表现不佳。为了解决这个问题,我们提出了VideoRouter,一个基于InternVL的查询自适应双路由框架,用于预算证据分配。语义路由器预测主导的分配策略,在广泛的时间覆盖和自适应高分辨率保留之间进行选择,而图像路由器则利用早期的LLM层对帧相关性进行评分。这使得在不太相关的帧上进行激进压缩,同时在关键证据帧上保留细节。为了训练这两个路由器,我们构建了Video-QTR-10K用于分配策略监督,以及Video-FLR-200K用于帧相关性监督。在VideoMME、MLVU和LongVideoBench上的实验表明,VideoRouter在可比或更低预算下始终优于InternVL基线,达到了67.9%的标记减少。
cs.CV / 45 / 2605.05850

Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection

Align3D-AD:跨模态特征对齐与双提示学习用于零样本3D异常检测
Bai, Letian, Cao, Xuanming, Du, Juan, Tao, Chengyu
Abstract
Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified two-stage framework that leverages the RGB modality from auxiliary categories as cross-modal guidance for zero-shot 3D anomaly detection. First, we introduce a cross-modal feature alignment paradigm that maps rendering features into the RGB semantic space. Unlike prior works that implicitly rely on pretrained encoders, our method enables direct semantic transfer from RGB observations. A semantic consistency reweighting strategy is further introduced to refine feature alignment by reweighting local regions according to holistic semantic consistency. Second, we propose a modality-aware prompt learning framework with dual-prompt contrastive alignment. By assigning independent prompts to RGB-aligned and rendering features, our method captures complementary semantics across modalities, while the contrastive alignment further enhances prompt representations to improve discriminability. Extensive experiments on MVTec3D-AD, Eyecandies, and Real3D-AD demonstrate that Align3D-AD consistently outperforms existing zero-shot methods under both one-vs-rest and cross-dataset settings, highlighting its generalization capability and robustness. Code and the dataset will be made available once our paper is accepted.
Chinese Translation
零样本3D异常检测旨在在没有目标类别训练数据的情况下识别异常。然而,现有方法主要依赖于将3D观测投影到多视图表示中,这些表示主要捕捉几何线索而非真实的视觉语义,并使用在RGB数据上预训练的视觉编码器进行处理,导致编码器与投影表示之间存在显著的领域差距。为了解决这个问题,我们提出了Align3D-AD,一个统一的两阶段框架,利用来自辅助类别的RGB模态作为零样本3D异常检测的跨模态指导。首先,我们引入了一种跨模态特征对齐范式,将渲染特征映射到RGB语义空间。与以往依赖于预训练编码器的工作不同,我们的方法实现了RGB观测的直接语义转移。此外,我们进一步引入了一种语义一致性重加权策略,通过根据整体语义一致性对局部区域进行重加权来精炼特征对齐。其次,我们提出了一种模态感知的提示学习框架,采用双提示对比对齐。通过为RGB对齐特征和渲染特征分配独立的提示,我们的方法捕捉了跨模态的互补语义,而对比对齐进一步增强了提示表示以提高可区分性。在MVTec3D-AD、Eyecandies和Real3D-AD上的大量实验表明,Align3D-AD在一对多和跨数据集设置下始终优于现有的零样本方法,突显了其泛化能力和鲁棒性。代码和数据集将在我们的论文被接受后发布。
cs.CV / 46 / 2605.05865

InkDiffuser: High-Fidelity One-shot Chinese Calligraphy via Differentiable Morphological Optimization

InkDiffuser:通过可微分形态优化实现高保真一次性中文书法生成
Shi, Kunchong, Zhang, Jing
Abstract
Current Chinese calligraphy generation methods suffer from poor stroke rendering and unrealistic ink morphology, resulting in outputs with limited visual fidelity and artistic fluidity. To address this problem, we propose \textbf{InkDiffuser}, a diffusion-based generative framework for one-shot Chinese calligraphy synthesis. To guarantee high-fidelity rendering, we introduce two core contributions: a high-frequency enhancement mechanism and a Differentiable Ink Structure (DIS) loss that explicitly regularizes ink morphology. Inspired by the observation that high-frequency information in individual samples typically carries contour details, we enhance content extraction by explicitly fusing high-frequency representations for more accurate font structure. Furthermore, we propose a differentiable ink structure loss that integrates differentiable morphological operations into the diffusion process. By allowing the model to learn an explicit decomposition of ink-trace structures, DIS facilitates fine-grained refinement of stroke contours and delivers significantly improved visual realism in the generated calligraphy. Extensive experiments on various calligraphic styles and complex characters demonstrate that InkDiffuser can generate superior calligraphy fonts with realistic ink rendering effects from only a single reference glyph and outperform existing few-shot font generation approaches in structural consistency, detail fidelity, and visual authenticity. The code is available at the following address: https://github.com/JingVIPLab/InkDiffuser.
Chinese Translation
当前的中文书法生成方法在笔画渲染和墨水形态方面表现不佳,导致输出的视觉保真度和艺术流畅性有限。为了解决这个问题,我们提出了 extbf{InkDiffuser},一种基于扩散的生成框架,用于一次性中文书法合成。为了保证高保真渲染,我们引入了两个核心贡献:高频增强机制和可微分墨水结构(Differentiable Ink Structure, DIS)损失,后者明确地规范了墨水形态。受观察启发,单个样本中的高频信息通常携带轮廓细节,我们通过显式融合高频表示来增强内容提取,以获得更准确的字体结构。此外,我们提出了一种可微分墨水结构损失,将可微分形态操作集成到扩散过程中。通过允许模型学习墨迹结构的显式分解,DIS促进了笔画轮廓的细粒度优化,并显著提高了生成书法的视觉真实感。在各种书法风格和复杂字符上的广泛实验表明,InkDiffuser能够仅通过单个参考字形生成具有真实墨水渲染效果的优质书法字体,并在结构一致性、细节保真度和视觉真实性方面超越现有的少样本字体生成方法。代码可在以下地址获取:https://github.com/JingVIPLab/InkDiffuser。
cs.CV / 47 / 2605.05886

Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models

基于多模态大型语言模型的无训练密集手部接触估计
Jung, Daniel Sungho, Lee, Kyoung Mu
Abstract
Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.
Chinese Translation
密集手部接触估计需要对人类交互进行高层次的语义理解和细粒度的几何推理,以准确定位接触区域。最近,多模态大型语言模型(MLLMs)在理解视觉语义方面表现出了强大的能力,这得益于从大规模数据中学习到的视觉-语言先验。然而,利用MLLMs进行密集手部接触估计仍然未被充分探索。在将MLLMs应用于密集手部接触估计时,面临两个主要挑战。首先,编码明确的3D手部几何形状是困难的,因为MLLMs主要在视觉和语言模态上操作。其次,捕捉细粒度的顶点级接触仍然具有挑战性,因为MLLMs往往更关注高层次的语义,而非详细的几何推理。为了解决这些挑战,我们提出了ContactPrompt,这是一种无训练和零样本的方法,用于利用MLLMs进行密集手部接触估计。为了有效编码3D手部几何形状,我们引入了详细的手部分割和逐部分的顶点网格表示,以提供结构化的、局部的几何信息。为了实现准确和高效的密集接触预测,我们开发了一种多阶段结构化接触推理,结合部分条件,逐步桥接全局语义与细粒度几何。因此,我们的方法有效利用了MLLMs的推理能力,同时实现了精确的密集手部接触估计。令人惊讶的是,所提出的方法在无需任何训练的情况下,超越了以往在大规模密集接触数据集上训练的监督方法。代码将会发布。
cs.CV / 48 / 2605.05889

DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation

DBMSolver:一种无训练的扩散桥采样器,用于高质量图像到图像的转换
Venugopal, Sankarshana, Mostafavi, Mohammad, Choi, Jonghyun
Abstract
Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.
Chinese Translation
基于扩散的图像到图像(I2I)转换在高保真生成方面表现出色,但在最先进的扩散桥模型(DBMs)中,采样速度较慢,通常需要数十次函数评估(NFEs)。我们提出了DBMSolver,这是一种无训练的采样器,利用DBM所基于的随机微分方程(SDE)和常微分方程(ODE)的半线性结构,通过指数积分器实现高效的一级和二级解。这将NFEs减少了多达5倍,同时提高了质量(例如,在20次NFEs时,FID在DIODE上下降了53%,相较于二级基线)。在高达256x256分辨率的修复、风格化和语义到图像任务上的实验表明,DBMSolver设定了新的最优效率-质量权衡,使其具备实际应用的潜力。我们的代码已公开发布在https://github.com/snumprlab/dbmsolver。
cs.CV / 49 / 2605.05891

MTL-MAD: Multi-Task Learners are Effective Medical Anomaly Detectors

MTL-MAD:多任务学习者是有效的医学异常检测器
Bercean, Bogdan Alexandru, Croitoru, Florinel Alin, Hondru, Vlad, Ceausescu, Ciprian Mihai, Ionescu, Andreea Iuliana, Ionescu, Radu Tudor
Abstract
Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, we propose to learn multiple self-supervised and pseudo-labeling tasks from scratch, using a joint model based on Mixture-of-Experts (MoE). By carefully integrating multiple proxy tasks, the joint model effectively learns a robust representation of normal anatomical structures, so that anomaly scores can be derived based on how well the multi-task learner (MTL) solves each task during inference. We perform comprehensive experiments on BMAD, a recent benchmark that comprises a broad range of medical image modalities. The empirical results indicate that our multi-task learner is an effective anomaly detector, outperforming all state-of-the-art competitors on BMAD. Moreover, our model produces interpretable anomaly maps, potentially helping physicians in providing more accurate diagnoses.
Chinese Translation
医学图像中的异常检测是一项具有挑战性的任务,因为在训练过程中通常无法获得异常样本。最近的方法利用单一的预训练任务结合大规模的预训练模型,以达到最先进的性能。相反,我们提出从头开始学习多个自监督和伪标记任务,使用基于专家混合模型(Mixture-of-Experts, MoE)的联合模型。通过仔细整合多个代理任务,联合模型有效地学习到正常解剖结构的鲁棒表示,从而在推理过程中根据多任务学习者(Multi-Task Learner, MTL)解决每个任务的能力来推导异常分数。我们在BMAD上进行了全面的实验,BMAD是一个包含广泛医学图像模态的最新基准。实证结果表明,我们的多任务学习者是一个有效的异常检测器,在BMAD上超越了所有最先进的竞争对手。此外,我们的模型生成可解释的异常图,可能有助于医生提供更准确的诊断。
cs.CV / 50 / 2605.05895

Detecting AI-Generated Videos with Spiking Neural Networks

利用脉冲神经网络检测人工智能生成的视频
Jang, Minsuk, Yang, Yujin, Kim, Heeseon, Son, Minseok, Kim, Younghun, Kim, Changick
Abstract
Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14\% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.
Chinese Translation
现代人工智能生成的视频在单帧水平上具有照片真实感,剩下的主要检测轴是帧间动态。现有的检测器通常以三种方式处理这种时间证据:将完整的帧序列输入通用时间骨干网络,将一个主导的时间线索简化为固定的视频级描述符,或通过检测度量将时间特征与真实视频统计数据进行比较。这些策略在跨生成器评估中急剧降级,因为不同生成器之间的伪影类型和时间尺度各不相同。在配有字幕的基准测试 GenVidBench 中,我们识别出两个先前检测器未能共同利用的特征:人工智能生成的视频在像素级别上表现出更平滑的帧间时间残差,并且在语义特征空间中具有更紧凑的轨迹,表明在这两个层面上存在时间平滑度差距。我们进一步观察到,当原始视频输入脉冲神经网络(SNNs)时,伪造片段主要在物体和运动边界处引发神经元放电,而真实片段则没有,这表明 SNN 对局部化在边缘的时间伪影作出反应。这些线索是稀疏的、异步的,并集中在变化的时刻,这使得 SNN 成为此任务的自然选择:它们的事件驱动、稀疏激活动态与残差信号的结构相一致,而密集的人工神经网络(ANN)骨干则不然。基于这一观察,我们提出了 MAST,一种通过脉冲驱动的时间分支处理多通道时间残差,并与冻结的语义编码器结合以实现跨生成器泛化的检测器。在 GenVideo 基准测试中,MAST 在严格的跨生成器评估下,在 10 个未见生成器中实现了 93.14\% 的平均准确率,匹配或超越了最强的基于 ANN 的检测器,展示了 SNN 在人工智能生成视频检测中的实际应用潜力。
cs.CV / 51 / 2605.05900

Understanding Cross-Language Transfer Improvements in Low-Resource HTR: The Role of Sequence Modeling

理解低资源手写文本识别中的跨语言迁移改进:序列建模的作用
Al-azzawi, Sana, Liu, Chang, Habib, Nudrat, Barney, Elisa, Liwicki, Marcus
Abstract
Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains unclear whether these improvements are better explained by shared visual representations or sequence-level dependencies. In this work, we conduct a controlled architectural study of line-level Arabic-script HTR, comparing CNN-only models with CTC decoding and CRNN models under identical single-script and multi-script training regimes. Experiments are performed on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD) datasets under low-resource settings (K in {100, 500, 1000}). Our results show a clear divergence in transfer behavior: while CNN-only models exhibit limited or unstable improvements, CRNN models achieve better performance under multi-script training, particularly in the most data-constrained regimes. Focusing on transfer improvements (delta CER) rather than absolute performance, we find that cross-language improvements are associated with sequence-level modeling, while sharing visual representations learned by the CNN encoder, corresponding to similarities in character shapes across scripts, alone appears to be insufficient. This finding suggests that contextual modeling plays an important role in enabling effective transfer in low-resource scenarios, and that similar behavior may extend to other low-resource language settings.
Chinese Translation
阿拉伯文书写语言的手写文本识别(HTR)在低资源条件下受益于跨语言联合训练,特别是在使用结合卷积编码器与序列建模的CRNN模型时。然而,这些改进是否更好地通过共享视觉表征或序列级依赖关系来解释仍不清楚。在本研究中,我们对阿拉伯文书写的行级HTR进行了受控的架构研究,比较了仅使用CNN的模型(带CTC解码)与CRNN模型在相同的单脚本和多脚本训练模式下的表现。实验在阿拉伯语(KHATT)、乌尔都语(NUST-UHWR)和波斯语(PHTD)数据集上进行,设置为低资源条件(K为{100, 500, 1000})。我们的结果显示迁移行为存在明显的分歧:尽管仅使用CNN的模型表现出有限或不稳定的改进,CRNN模型在多脚本训练下的表现更佳,特别是在数据最为稀缺的情况下。我们关注迁移改进(delta CER)而非绝对性能,发现跨语言改进与序列级建模相关,而仅通过CNN编码器学习的共享视觉表征(对应于不同脚本间字符形状的相似性)似乎不足以解释这一现象。该发现表明,上下文建模在低资源场景中有效迁移中起着重要作用,并且类似的行为可能扩展到其他低资源语言环境。
cs.CV / 52 / 2605.05908

Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers

与架构无关的Lipschitz常数贝叶斯头及其在解决语义相近分类错误中的应用:以视觉变换器为例
Schäfer, Frederik, Mandl, Luis, Kälber, Lars, Ricken, Tim
Abstract
Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.
Chinese Translation
标签噪声仍然是监督深度学习模型泛化的一个关键瓶颈,特别是在错误是结构化而非随机的情况下。标准的鲁棒训练方法在存在这种语义相近的分类错误时往往失效。本研究提出了一种与架构无关的Lipschitz常数贝叶斯头,可以集成到特征提取器中,如视觉变换器,生成双Lipschitz约束的贝叶斯视觉变换器(LipB-ViT)。与传统的贝叶斯层相比,我们的方法对变分权重的均值和对数方差施加谱归一化,这促进了预测不确定性的校准,并减轻了噪声放大。我们进一步提出了一种新颖的度量标准,以共同捕捉误分类率下的不确定性和置信度,以及一种自适应算术平均融合方案,将特征空间的相近性与预测不确定性结合起来,以检测被污染的标签,其性能超过了最先进的基于k近邻的识别方法,提升了超过7%,在15%的语义误分类标签下达到了超过0.93的召回率。尽管由于蒙特卡洛采样计算成本增加,该方法仍提供与预训练骨干网络的即插即用兼容性,并在不同领域中保持一致的超参数,表明其在注释可靠性变化的高风险应用中的强大实用性。稳定的置信度估计为一个分析管道奠定了基础,该管道共同评估数据集质量和标签噪声,生成了第二个新颖的度量标准用于其联合量化。最后,我们在推理时系统地评估了LipB-ViT在结构化(对抗性)和非结构化噪声下的表现,展示了其在现实高噪声和攻击场景中的鲁棒性。我们将其性能与基线方法进行了比较。
cs.CV / 53 / 2605.05910

Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

即插即用的类感知知识注入框架用于视觉-语言模型的提示学习
Yin, Junhui, Pu, Nan, Zhang, Xinyu, Yang, Lingfeng, Wu, Lin, Wang, Xiaojie, Zhong, Zhun
Abstract
Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.
Chinese Translation
提示学习已成为增强视觉-语言模型(VLMs)如CLIP在各种下游任务中有效且广泛使用的技术,特别是在特定领域的零样本分类中。现有方法通常专注于为给定领域学习类共享提示或通过条件提示学习生成实例特定提示。尽管这些方法取得了令人鼓舞的性能,但它们往往忽视了提示设计中的类特定知识,导致次优结果。其根本原因在于:1)与粗糙的类共享提示相比,类特定提示提供了更细粒度的监督,这有助于防止不同类别的数据被错误分类到同一类别;2)与类特定提示相比,实例特定提示忽视了多个实例之间更丰富的类级信息,可能导致同一类别的数据被划分为多个类别。为了有效地将类特定知识补充到现有方法中,我们提出了一种即插即用的类感知知识注入(CAKI)框架。CAKI包含两个关键组件,即类特定提示生成和查询-键提示匹配。前者从属于同一类别的少量样本中编码类特定知识并将学习到的提示存储在类级知识库中。后者为每个测试实例提供了一种即插即用的机制,以从知识库中检索相关的类级知识并注入这些知识以优化模型预测。大量实验表明,我们的CAKI有效提升了现有方法在基础类和新类上的性能。代码可在此链接公开获取: exttt{https://github.com/yjh576/CAKI}。
cs.CV / 54 / 2605.05922

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

思考,然后评分:视频奖励建模中的解耦推理与评分
Wang, Yuan, Li, Ouxiang, Xu, Yulong, Liao, Borui, Liang, Jiajun, Li, Jinghan, Wang, Meng, Wang, Xintao, Wang, Pengfei, Liu, Kuien, Wang, Xiang
Abstract
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
Chinese Translation
近年来,生成视频模型的进展越来越依赖于后训练和测试时的扩展,这两者都在很大程度上依赖于视频奖励模型(RMs)的质量。理想的奖励模型应能够预测与人类偏好一致的准确奖励,适用于多种场景。然而,现有范式面临一个根本性困境: extit{判别性RMs}直接在由多模态大型语言模型(MLLMs)提取的特征上回归奖励,缺乏明确的推理,这使得它们容易受到捷径学习的影响,并且在泛化方面严重依赖于大量数据扩展。相比之下,具有链式思维(Chain-of-Thought, CoT)推理的 extit{生成性RMs}展现出更优的可解释性和泛化潜力,因为它们利用细粒度的语义监督来内化人类偏好的理由。然而,由于推理和评分在单一自回归推理链中的耦合,它们面临固有的优化瓶颈。为了利用CoT推理的泛化优势,同时减轻耦合推理与评分的训练不稳定性,我们提出了DeScore,这是一种训练高效且具有良好泛化能力的视频奖励模型。DeScore采用解耦的“思考-然后评分”范式:首先由MLLM生成明确的CoT,然后由一个专门的判别评分模块进行评分,该模块由一个可学习的查询标记和一个回归头组成,用于预测最终奖励。DeScore通过一个两阶段框架进行优化:(1)一个判别冷启动阶段,结合随机掩码机制以确保稳健的评分能力;(2)一个双目标强化学习阶段,独立优化CoT推理质量并校准最终奖励,确保更高质量的推理直接转化为更优的模型性能。
cs.CV / 55 / 2605.05928

Backdoor Mitigation in Object Detection via Adversarial Fine-Tuning

通过对抗微调进行目标检测中的后门缓解
Dunnett, Kealan, Arablouei, Reza, Miller, Dimity, Dedeoglu, Volkan, Jurdak, Raja
Abstract
Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defenses for object detection remain comparatively underdeveloped. Adversarial fine-tuning is a common backdoor mitigation approach in classification, but adapting it to detection is nontrivial as classification-oriented adversarial generation does not match the detection attack space, where attacks may cause object misclassification or disappearance, and standard detection losses can dilute the repair signal across many predictions. We address these challenges through a detection-aware adversarial fine-tuning framework for mitigating object-detection backdoors when the defender has access only to a compromised detector and a small clean dataset, without knowing the attack objective. For adversarial generation that does not require knowledge of the attack objective, we introduce soft-branch minimisation, which uses a soft gate to combine objectives aligned with misclassification and disappearance attacks, together with a detection-aware classification-loss maximisation. For targeted repair, we introduce a dual-objective fine-tuning loss applied to target-matched predictions, concentrating the defensive update on predictions most relevant to the backdoor behaviour. Experiments across CNN- and Transformer-based detectors show that our approach more effectively reduces attack success while preserving true detections, compared with classification-oriented baselines, and maintains competitive clean detection performance.
Chinese Translation
后门攻击可以在深度模型中植入恶意行为,同时保持在干净数据上的性能,这对安全关键的视觉系统构成了严重威胁。尽管后门缓解在图像分类中得到了广泛研究,但针对目标检测的防御仍然相对欠发达。对抗微调是分类中常见的后门缓解方法,但将其适应于检测并非易事,因为面向分类的对抗生成与检测攻击空间不匹配,后者可能导致对象误分类或消失,而标准检测损失可能会稀释在多个预测中的修复信号。我们通过一个检测感知的对抗微调框架来应对这些挑战,以缓解目标检测中的后门攻击,前提是防御者仅能访问一个受损的检测器和一个小的干净数据集,而不知道攻击目标。对于不需要攻击目标知识的对抗生成,我们引入了软分支最小化,它使用软门控来结合与误分类和消失攻击对齐的目标,以及检测感知的分类损失最大化。对于有针对性的修复,我们引入了一个双目标微调损失,应用于目标匹配的预测,集中防御更新于与后门行为最相关的预测。基于CNN和Transformer的检测器的实验表明,与面向分类的基线相比,我们的方法在有效减少攻击成功率的同时保持真实检测,并维持竞争力的干净检测性能。
cs.CV / 56 / 2605.05933

Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering

通过证据驱动的LLM报告过滤从常规临床扫描中获得的全身CT衰减和体积图表
Wachinger, Christian, Renger, Bernhard, Späth, Christopher, Kirschke, Jan, Makowski, Marcus
Abstract
Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we develop an evidence-grounded, cross-verified large language model (LLM) ensemble to filter pathological findings from radiology reports, enabling the construction of pathology-reduced cohorts from over 350,000 CT examinations. Five LLMs, first, flag structure-level abnormality candidates grounded in verbatim report evidence and, second, resolve disagreements via cross-verification. Using distribution-aware generalized additive models for location, scale, and shape, we establish comprehensive whole-body reference charts for 106 anatomical structures (volumes and attenuation) across adulthood, accounting for age, sex, contrast enhancement, and acquisition parameters. Longitudinal analyses reveal structure- and contrast-dependent changes distinct from cross-sectional trends. These resources facilitate covariate-adjusted centile scoring from routine CT, supporting standardized quantitative phenotyping, multi-site imaging studies, and scalable opportunistic screening research.
Chinese Translation
解读定量CT生物标志物,如器官体积和组织衰减,需要大规模的健康参考分布。然而,由于临床数据集通常严重富含病理信息,创建这些参考分布具有挑战性。在此,我们开发了一种证据驱动的、交叉验证的大型语言模型(LLM)集成,以从放射学报告中过滤病理发现,从而能够从超过350,000例CT检查中构建减少病理影响的队列。五个LLM首先基于逐字报告证据标记结构级异常候选项,其次通过交叉验证解决分歧。我们使用分布感知的广义加性模型来建立106个解剖结构(体积和衰减)的全面全身参考图表,考虑了年龄、性别、对比增强和采集参数。纵向分析揭示了与横断面趋势不同的结构和对比依赖性变化。这些资源促进了从常规CT中进行协变量调整的百分位评分,支持标准化的定量表型、多中心影像研究和可扩展的机会筛查研究。
cs.CV / 57 / 2605.05941

RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling

RAWild:通过物理引导的曲线和网格建模实现传感器无关的RAW目标检测
Liu, Shuhong, Chang, Gengjia, Liu, Jun, Chu, Xuangeng, Zheng, Yinqiang, Harada, Tatsuya, Cui, Ziteng
Abstract
Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bit depths across devices introduce substantially larger domain gaps than sRGB, making sensor-agnostic generalization a fundamental challenge. In this study, we present \textbf{RAWild}, a physics-guided global-local tone mapping framework for sensor-agnostic RAW object detection. By factoring sensor-induced variations into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, our framework enables a single network to train jointly across heterogeneous sensors. To further support cross-sensor generalization, we construct a physics-based RAW simulation pipeline that synthesizes realistic sensor outputs spanning diverse spectral sensitivities, illuminants, and sensor non-idealities. Extensive experiments across multiple RAW benchmarks covering bit depths from 10 to 24 demonstrate state-of-the-art (SOTA) performance under single-dataset, mixed-dataset, and challenging robustness settings.
Chinese Translation
相机传感器的RAW数据在目标检测中具有内在优势,包括更深的位深度、保留的物理信息以及不受图像信号处理器(ISP)失真的影响。然而,不同设备之间的曝光条件、光谱灵敏度和位深度的变化引入了比sRGB更大的领域差异,使得传感器无关的泛化成为一项基本挑战。在本研究中,我们提出了 extbf{RAWild},一种物理引导的全局-局部色调映射框架,用于传感器无关的RAW目标检测。通过将传感器引起的变化分解为基于RAW分布先验的全局色调校正和空间自适应局部颜色调整,我们的框架使得单个网络能够在异构传感器上进行联合训练。为了进一步支持跨传感器的泛化,我们构建了一个基于物理的RAW仿真管道,合成涵盖不同光谱灵敏度、光源和传感器非理想性的真实传感器输出。在多个RAW基准测试中进行的大量实验,涵盖从10位到24位的位深度,展示了在单数据集、混合数据集和具有挑战性的鲁棒性设置下的最先进(SOTA)性能。
cs.CV / 58 / 2605.05945

MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

MobileEgo Anywhere:开放基础设施用于在普通硬件上收集长期自我中心数据
Palanisamy, Senthil, Anand, Abhishek, Rathor, Satpal Singh, Patnaik, Pratyush, Khatana, Shubhanshu
Abstract
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
Chinese Translation
近年来,视觉语言动作(Vision Language Action, VLA)模型的快速发展推动了对大规模自我中心数据集的迫切需求。然而,现有数据集通常受到短时间段的限制,通常仅持续几分钟,这无法捕捉到复杂机器人任务执行所需的长期时间依赖性。为了解决这一问题,我们提出了MobileEgo Anywhere,一个旨在利用普通移动硬件收集稳健的超过一小时自我中心轨迹的框架。我们利用现代智能手机普遍存在的传感器套件,提供高保真度、长期的相机姿态跟踪,有效消除了与传统机器人数据收集相关的高硬件门槛。我们的贡献主要有三方面:(1)我们发布了一个新数据集,包含200小时多样化的长期自我中心数据,并具备持续状态跟踪;(2)我们开源了一款移动应用,使任何用户都能录制自我中心数据;(3)我们提供了一个全面的处理流程,将原始移动捕获转换为标准化、适合训练的格式,以供视觉语言动作模型和基础模型研究使用。通过民主化数据收集过程,这项工作使得在各种全球环境中大规模获取长期数据成为可能,加速了可推广机器人策略的发展。
cs.CV / 59 / 2605.05979

Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters

无提示且高效的SAM2适应方法用于生物医学语义分割,通过双适配器实现
Mitsuoka, Hinako, Hotta, Kazuhiro
Abstract
Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66\% over the vanilla SAM2, while reducing computational costs by approximately 87\% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.
Chinese Translation
Segment Anything Model 2 (SAM2) 在自然图像上展示了令人印象深刻的零样本能力,但由于显著的领域转移和对提示的依赖,在生物医学分割中面临挑战。为了解决这些局限性,我们提出了一种无提示、参数高效的微调框架,旨在对变尺寸输入进行多类分割。我们引入了一种卷积位置编码生成器,以有效适应任意纵横比,并提出了一种双适配器策略:高性能适配器利用可变形卷积进行精确边界建模,轻量级适配器采用结构重参数化以最小化推理延迟。在ISBI 2012、Kvasir-SEG、Synapse和ACDC数据集上的实验表明,我们的方法显著优于强适应基线。具体而言,我们的方法在分割准确性上比原始SAM2提高了多达19.66\%,同时与重量级医学SAM适应相比,计算成本降低了约87\, 在准确性和效率之间建立了优越的权衡。
cs.CV / 60 / 2605.05990

iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

iPhoneBlur:一种针对消费设备运动去模糊的难度分层基准
Shafi, Abdullah Al, Alam, Kazi Saeed
Abstract
Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.
Chinese Translation
在消费移动设备上,运动模糊恢复通常使用聚合指标进行评估,这掩盖了不同模糊难度下的性能变化,掩盖了模型在实际部署条件下的表现。本研究引入了iPhoneBlur,这是一个由7400对图像组成的难度分层基准,这些图像对是从在多种真实场景中捕获的高帧率iPhone 17 Pro视频合成的。样本通过基于PSNR的自适应时间窗口划分为简单、中等和困难类别,分层验证通过光流幅度在各层之间单调增加2.2倍来实现。每个样本包含全面的元数据,便于研究ISP感知和难度自适应的恢复策略。光谱分析确认合成模糊表现出与真实运动退化一致的高频抑制模式。对六种架构的评估显示,从简单到困难子集的性能下降一致为7-9 dB,这一显著差距在聚合报告中完全被隐藏。该基准进一步揭示了专业相机与消费相机之间的领域差距,针对性微调可以显著恢复这一差距。通过将难度分层与部署关键元数据结合,iPhoneBlur能够系统地评估资源受限边缘系统的模型可靠性和失效模式。
cs.CV / 61 / 2605.05997

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker:利用4D图像进行动态空间理解
Chen, Zhangquan, Zhang, Manyuan, Yu, Xinlei, An, Xiang, Li, Bo, Xie, Xin, Wang, ZiDong, Sun, Mingze, Chen, Shuang, Li, Hongyu, Hu, Xiaobin, Huang, Ruqi
Abstract
Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
Chinese Translation
从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要,但对于视觉语言模型(VLMs)来说仍然具有挑战性。以往的方法要么将时空推理完全以文本形式表达,这对于复杂动态而言固有地冗长且不精确,要么依赖外部几何模块,这增加了推理复杂性而未能促进模型的内在能力。本文提出了4DThinker,这是第一个使VLMs能够通过动态潜在心理图像“进行4D思考”的框架,即在连续的隐藏空间中内部模拟场景如何演变。具体而言,我们首先介绍了一种可扩展的、无需注释的数据生成管道,从原始视频中合成4D推理数据。然后,我们提出了动态图像微调(Dynamic-Imagery Fine-Tuning, DIFT),该方法共同监督文本标记和4D潜变量,以将模型与动态视觉语义相结合。在此基础上,4D强化学习(4D Reinforcement Learning, 4DRL)进一步通过基于结果的奖励来解决复杂推理任务,将策略梯度限制在文本标记上,以确保稳定优化。在多个动态空间推理基准上的广泛实验表明,4DThinker始终优于强基线,并为VLMs中的4D推理提供了新的视角。我们的代码可在 https://github.com/zhangquanchen/4DThinker 获取。
cs.CV / 62 / 2605.06005

Neuromorphic visual attention for Sign-language recognition on SpiNNaker

基于SpiNNaker的神经形态视觉注意力在手语识别中的应用
Liskova, Sarka, Vedmedenko, Olha, Fatahi, Mazdak, Hoffmann, Matej, Furlong, P. Michael, Angelo, Giulia D
Abstract
Sign-language recognition has achieved substantial gains in classification accuracy in recent years; however, the latency and power requirements of most existing methods limit their suitability for real-time deployment. Neuromorphic sensing and processing offer an alternative paradigm based on sparse, event-driven computation that supports low-latency and energy-efficient perception. In this work, we introduce an end-to-end neuromorphic architecture for American Sign Language (ASL) fingerspelling recognition that integrates a spiking visual attention mechanism for online region-of-interest extraction with a compact spiking neural network deployed on the SpiNNaker neuromorphic platform. We benchmark the proposed system against two datasets: a synthetically generated event-based version of the Sign Language MNIST dataset and a natively recorded ASL-DVS dataset, whilst providing a comprehensive overview of Sign-language recognition and related work. This work yields competitive performance in simulation (92.27%) and comparable performance on neuromorphic hardware deployment (83.1%), while achieving the most energy-efficient architecture (0.565 mW) and low latency (3 ms) across all benchmarked approaches. Despite its compact design, the system demonstrates the suitability of task-dependent visual attention applications for edge deployment.
Chinese Translation
近年来,手语识别在分类准确性方面取得了显著进展;然而,大多数现有方法的延迟和功耗要求限制了其在实时部署中的适用性。神经形态感知和处理提供了一种基于稀疏、事件驱动计算的替代范式,支持低延迟和能量高效的感知。在本研究中,我们提出了一种端到端的神经形态架构,用于美式手语(ASL)手指拼写识别,该架构集成了脉冲视觉注意力机制,用于在线兴趣区域提取,并在SpiNNaker神经形态平台上部署了紧凑型脉冲神经网络。我们将所提出的系统与两个数据集进行了基准测试:一个是合成生成的基于事件的手语MNIST数据集,另一个是本地录制的ASL-DVS数据集,同时提供了手语识别及相关工作的全面概述。该研究在仿真中实现了竞争性的性能(92.27%),并在神经形态硬件部署中表现出可比的性能(83.1%),同时在所有基准方法中实现了最具能效的架构(0.565 mW)和低延迟(3 ms)。尽管设计紧凑,该系统展示了任务依赖的视觉注意力应用在边缘部署中的适用性。
cs.CV / 63 / 2605.06010

Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models

通过蒸馏扩散模型实时增强视觉系统的热感知能力
Guo, Yuchen, Gong, Junli, Dong, Wenjun, Cheung, Yiuming, Su, Weifeng
Abstract
Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.
Chinese Translation
纯RGB基础的视觉模型在夜间和雾霾等挑战性场景中往往无法提供可靠的线索,导致性能下降和安全风险。红外成像能够捕捉热源并提供关键的补充信息,但现有的高保真融合方法存在显著的延迟,使其在实时边缘部署中不切实际。为了解决这一问题,我们提出了FusionProxy,一个设计为完全独立的即插即用组件的实时图像融合模块,具有扩散级别的质量。FusionProxy利用教师样本集的两个互补统计量:原始图像空间中的每像素方差,用于加权像素级监督;以及冻结基础骨干网络中的每像素方差,用于空间上引导特征级对齐。一旦训练完成,FusionProxy可以直接集成到任何视觉感知系统中,而无需联合优化。大量实验表明,我们的方法在静态识别任务中表现优越,并显著增强了动态任务的鲁棒性,包括闭环自主驾驶。关键是,FusionProxy在从高端GPU到普通硬件的多种平台上实现了实时推理速度,为全天候感知提供了灵活且可推广的解决方案。
cs.CV / 64 / 2605.06012

T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval

T2I-VeRW:基于部件级的细粒度文本到图像车辆检索
Wang, Xiao, Wang, Ziwen, Kong, Weizhe, Wu, Wentao, Li, Yuehang, Zheng, Aihua, Li, Chenglong, Tang, Jin
Abstract
Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large-scale dataset called T2I-VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine-grained part-level annotations. Experimental results on the T2I-VeRI dataset show that PFCVR achieves 29.2\% Rank-1 accuracy, improving over the best competing method by +3.7\% percentage points. On the newly proposed T2I-VeRW benchmark, PFCVR achieves 55.2\% Rank-1 accuracy, outperforming a comprehensive set of recent state-of-the-art methods. Source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
Chinese Translation
车辆重识别(Re-ID)旨在从由非重叠摄像头捕获的图像中检索与给定查询最相似的图像。将车辆重识别从仅基于图像的查询扩展到基于文本的查询,使得在仅有目击者对目标车辆描述的现实场景中能够进行检索。本文提出了一种名为PFCVR的部件级细粒度跨模态车辆检索模型,用于文本到图像的车辆重识别。PFCVR在部件级构建局部配对的图像和文本,并引入可学习的部件查询标记,这些标记在与视觉部件特征对齐之前聚合了部件特定和完整句子的上下文。在这种显式的局部对齐基础上,双向掩码恢复模块使每种模态能够在另一种模态的指导下重建其掩码内容,隐式地将局部对应关系桥接到全局特征对齐。此外,我们构建了一个新的大规模数据集T2I-VeRW,该数据集包含14,668张图像,涵盖1,796个车辆身份,并具有细粒度的部件级注释。在T2I-VeRI数据集上的实验结果表明,PFCVR达到了29.2%的Rank-1准确率,比最佳竞争方法提高了3.7个百分点。在新提出的T2I-VeRW基准上,PFCVR达到了55.2%的Rank-1准确率,超越了一系列最新的最先进方法。源代码将发布在https://github.com/Event-AHU/Neuromorphic_ReID
cs.CV / 65 / 2605.06021

PlotPick: AI-powered batch extraction of numerical data from scientific figures

PlotPick:基于人工智能的科学图表数值数据批量提取工具
Carstensen, Tommy
Abstract
Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.
Chinese Translation
系统评价和荟萃分析常常需要作者仅以图形形式报告的数值数据,但手动数字化过程缓慢且难以扩展。我们提出了PlotPick,这是一款开源工具,利用视觉-语言模型(VLMs)从科学图表中批量提取结构化的表格数据。我们在两个已建立的图表到表格基准(ChartX和PlotQA)上评估了来自三个提供者的六个VLM,并与专用的图表到表格模型DePlot进行了比较。在这两个基准上,所有六个VLM的表现均优于DePlot。在ChartX(仅限于条形图、折线图、箱线图和直方图;n=300)上,VLM的召回率为88-96%,而DePlot为71%。在PlotQA(n=529)上,VLM的RMSF1为86-99%,而DePlot为94%。在专用模型训练数据中缺乏的图表类型上,差距最大:在箱线图上,DePlot的RMSF1为24%,而VLM的RMSF1为83-97%。PlotPick可在https://plotpick.streamlit.app获取。
cs.CV / 66 / 2605.06043

Domain Generalization through Spatial Relation Induction over Visual Primitives

通过视觉原语的空间关系引导实现领域泛化
Nguyen, Dat, Nguyen, Duc-Duy
Abstract
Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augmentation, or feature-alignment objectives. Although these strategies can be effective, they leave the representation learning of structural composition implicit, which may limit performance on compositional domain generalization benchmarks. In this work, we propose Primitive-Aware Relational Structure for domain gEneralization (PARSE), an image classification framework that factors visual recognition into visual primitives and their relational composition. We represent these compositions using soft binary, ternary, and quaternary predicates over primitive locations, yielding differentiable measures of spatial alignment that can be learned end-to-end. To learn primitives and relational structures jointly, we design an end-to-end architecture with three components: (1) a convolutional neural network (CNN) backbone that extracts general visual features, (2) a concept bottleneck layer that maps these features to primitive heatmaps with differentiable spatial coordinates, and (3) a structural scoring layer that evaluates candidate spatial relations among the detected primitives. We then compute class probability from the joint evidence of its class-specific relational compositions. Across CUB-DG and the DomainBed benchmark suite,PARSE improves accuracy by over 4.5 percentage points on CUB-DG and remains competitive with existing DG methods on DomainBed.
Chinese Translation
领域泛化要求识别出稳定的表示,以支持跨领域的可靠分类。现有的大多数方法通过改善训练过程来寻求这种稳定性,例如,通过模型选择策略、数据增强或特征对齐目标。尽管这些策略可能有效,但它们使得结构组成的表示学习变得隐式,这可能限制在组合领域泛化基准上的表现。在本研究中,我们提出了面向领域泛化的原语感知关系结构(Primitive-Aware Relational Structure for domain gEneralization,PARSE),这是一个将视觉识别分解为视觉原语及其关系组成的图像分类框架。我们使用软二元、三元和四元谓词在原语位置上表示这些组成,从而产生可微分的空间对齐度量,可以端到端学习。为了联合学习原语和关系结构,我们设计了一个包含三个组件的端到端架构:(1)一个提取一般视觉特征的卷积神经网络(CNN)主干;(2)一个概念瓶颈层,将这些特征映射到具有可微分空间坐标的原语热图;(3)一个结构评分层,评估检测到的原语之间的候选空间关系。然后,我们从其类特定关系组成的联合证据中计算类概率。在CUB-DG和DomainBed基准测试套件中,PARSE在CUB-DG上提高了超过4.5个百分点的准确率,并在DomainBed上与现有的领域泛化方法保持竞争力。
cs.CV / 67 / 2605.06049

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

按您的方式融合:通过直接偏好优化对齐图像融合与异构需求
Su, Weijian, Zhang, Songqian, Han, Yuqi, Zhuang, Jian, Huang, Yongdong, Zhang, Qiang
Abstract
As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.
Chinese Translation
作为多模态处理中的关键技术,红外与可见光图像融合(IVIF)在整合互补光谱信息以增强视觉效果和后续视觉任务中发挥着至关重要的作用。尽管取得了显著进展,现有方法在灵活适应异构需求方面仍然面临挑战。实现与人类和机器视觉的各种偏好对齐的自适应融合仍然是一个开放且具有挑战性的问题。为了解决这一挑战,我们提出了DPOFusion,一个直接偏好优化(DPO)框架,整合了属性对齐的潜在扩散模型(PALDM)和偏好可控的潜在扩散模型(PCLDM),使得IVIF能够根据任务指导和偏好自适应,适用于人类和机器视觉。PALDM利用潜在融合先验和联合条件损失生成具有多种属性的多样候选融合结果。随后,PCLDM通过实例直接偏好优化(IDPO)进行微调,使得最终融合结果能够直接控制以适应异构偏好信号。实验结果表明,我们的框架不仅在不同人类、视觉-语言模型和任务驱动网络之间实现了精确的偏好对齐,还为自适应融合质量和任务导向的可迁移性设定了新的基准。
cs.CV / 68 / 2605.06051

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

RealCam:具有交互式相机控制的实时新视角视频生成
Xu, Youcan, Shi, Jiaxin, Wang, Zhen, Song, Wensong, Shao, Feifei, Liang, Chen, Xiao, Jun, Chen, Long
Abstract
Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \texttt{RealCam}, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbf{Cross-frame In-context Learning} paradigm. By interleaving source and target frames into synchronized contextual pairs, our design inherently enables length-agnostic generalization and naturally facilitates causal adaptation, breaking the rigid prefix bottleneck. We then distill this teacher into a few-step causal student via Self-Forcing with Distribution Matching Distillation, enabling efficient, on-the-fly streaming synthesis. Furthermore, to mitigate severe loop inconsistency in closed-loop trajectories, we propose \textbf{Loop-Closed Data Augmentation (LoopAug)}, a novel paradigm that synthesizes globally consistent loop sequences from existing multiview datasets. Extensive experiments demonstrate that \texttt{RealCam} achieves state-of-the-art visual fidelity and temporal consistency while enabling truly interactive camera control with orders-of-magnitude faster inference than existing paradigms. Our project page is at https://xyc-fly.github.io/RealCam/.
Chinese Translation
相机控制的视频到视频(V2V)生成能够从单目视频中合成动态视角,具有巨大的潜力用于互动电影制作和直播。然而,现有的隐式合成方法根本上依赖于非因果的全序列处理和刚性前缀式时间连接。这种架构范式要求双向注意力,导致计算延迟过高、复杂度呈平方级增长,并与实时流媒体或可变长度输入存在固有的不兼容性。为了克服这些限制,我们提出了 exttt{RealCam},一种用于交互式实时相机控制V2V生成的新型自回归框架。我们首先设计了一个基于 extbf{跨帧上下文学习}范式的高保真教师模型。通过将源帧和目标帧交错成同步的上下文对,我们的设计固有地实现了长度无关的泛化,并自然地促进了因果适应,打破了刚性前缀瓶颈。然后,我们通过自我强制与分布匹配蒸馏将该教师模型提炼为一个几步因果学生,从而实现高效的即时流媒体合成。此外,为了减轻闭环轨迹中的严重循环不一致性,我们提出了 extbf{循环闭合数据增强(LoopAug)},一种新颖的范式,从现有的多视图数据集中合成全局一致的循环序列。大量实验表明, exttt{RealCam}在视觉保真度和时间一致性方面达到了最先进的水平,同时实现了真正的交互式相机控制,其推理速度比现有范式快几个数量级。我们的项目页面地址为 https://xyc-fly.github.io/RealCam/.
cs.CV / 69 / 2605.06064

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

PersonaGesture:针对未见说话者的单参考共语手势个性化
Zhang, Xiangyue, Cai, Yiyi, Li, Kunhang, Yang, Kaixing, Zhou, You, Li, Zhengqing, Chu, Xuangeng, Zhang, Jiaxu, Liu, Haiyang
Abstract
We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: https://xiangyue-zhang.github.io/PersonaGesture.
Chinese Translation
我们提出了PersonaGesture,这是一种基于扩散的管道,用于未见说话者的单参考共语手势个性化。给定目标语音和来自新说话者的一个运动片段,模型必须合成与新发声相符的手势,同时保留特定于说话者的姿态选择,而无需针对每个说话者进行优化。该设置对虚拟化身和虚拟代理非常有用,但由于参考内容混合了稳定的说话者习惯与特定发声的轨迹,因此难度较大。PersonaGesture由两个关键组件组成:自适应风格注入(Adaptive Style Infusion, ASI)和隐式分布校正(Implicit Distribution Rectification, IDR),以将时间身份证据与残差统计校正分离。风格感知器首先将可变长度的参考编码为紧凑的说话者记忆标记。ASI通过零初始化的残差交叉注意力将这些标记注入去噪过程,使风格证据能够影响运动生成,而不替代预训练的语音到运动的先验。基于此,IDR在潜在空间中应用长度感知的对角仿射映射,以校正从相同参考估计的残差通道时刻。我们在BEAT2和ZeroEGGS上评估了定量指标、参考身份控制、相同音频诊断、定性比较和人类偏好。实验表明,将去噪时的说话者记忆与保守的后生成时刻校正分离,能够改善未见说话者的个性化效果,相较于崩溃的风格编码、全参考注意力和单片段微调。项目网址:https://xiangyue-zhang.github.io/PersonaGesture。
cs.CV / 70 / 2605.06070

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

竞技场作为离线奖励:扩散模型的高效细粒度偏好优化
Li, Zhikai, Zhao, Yue, Zhang, Edward Zhongwei, Liu, Xuewen, Zhang, Jing, Gu, Qingyi, Dong, Zhen
Abstract
Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.
Chinese Translation
基于人类反馈的强化学习(RLHF)有效促进了文本到图像(T2I)扩散模型的偏好对齐。为了提高计算效率,直接偏好优化(DPO)作为一种避免显式奖励建模的方法,已被广泛研究。然而,其对二元反馈的依赖限制了其在选择-拒绝对上的粗粒度建模,导致优化效果不佳。本文提出了ArenaPO,它利用竞技场分数作为离线奖励,提供精细反馈,从而实现高效且细粒度的优化,而无需奖励模型。这使得ArenaPO能够同时受益于传统RLHF的丰富奖励和DPO的高效性。具体而言,我们首先构建一个模型竞技场,其中每个模型的能力以高斯分布表示,并通过遍历标注的成对偏好来推断这些能力。每个输出图像被视为来自相应能力分布的样本。然后,对于一对图像,基于两个能力分布和观察到的成对偏好,使用基于截断正态分布的潜变量推断来估计绝对质量差距,这在训练过程中作为细粒度反馈。该方法不需要奖励模型,并且可以离线计算,因此不会引入额外的训练开销。我们在Pick-a-Pic v2和HPD v3数据集上进行ArenaPO训练,结果表明ArenaPO始终优于现有基线。
cs.CV / 71 / 2605.06080

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

MSD-Score:用于无参考图像标题评估的多尺度分布评分
Kan, Shichao, Zhang, Xuyang, Zhang, Haojie, Zhu, Zhe, Cen, Yigang, Liang, Yixiong, Shan, Lianlei, Zhang, Linna, Qu, Zhe, Xia, Jiazhi
Abstract
Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.
Chinese Translation
在没有参考的情况下评估图像标题仍然具有挑战性,因为全局嵌入相似性往往忽视了细粒度的不匹配,例如虚构的物体、缺失的属性或不正确的关系。我们提出了MSD-Score,这是一种无参考度量,它将图像块和文本标记的嵌入建模为单位超球面上的冯·米塞斯-费舍尔混合体。MSD-Score并不是将每种模态视为一个单一的点,而是将图像-文本匹配公式化为一个多尺度分布评分问题。语义差异通过加权双向KL散度进行量化,并在多尺度框架中与全局相似性结合,适用于单候选和多候选评估。大量实验表明,MSD-Score在无参考度量中与人类判断的相关性达到了最先进的水平。除了准确性外,其概率性公式还提供了对局部基础错误的透明和可分解的诊断,为整体相似性度量和基于评审的评估者提供了确定性的补充信号。
cs.CV / 72 / 2605.06083

Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

重新审视不确定性:关于部分相关视频检索的证据学习
Li, Jun, Lai, Peifeng, Lou, Xuhang, Wang, Jinpeng, Wang, Yuting, Chen, Ke, Wang, Yaowei, Xia, Shu-Tao
Abstract
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.
Chinese Translation
部分相关视频检索旨在使用仅描述部分内容的文本查询来检索未剪辑的视频。然而,简短查询与丰富视频内容之间固有的不对称性不可避免地在检索过程中引入了不确定性。在这种情况下,模糊查询往往会导致视频之间的语义模糊,这一挑战因视频中稀疏的时间监督而进一步加剧,后者未能提供足够的匹配证据。为了解决这个问题,我们提出了Holmes,一个层次化的证据学习框架,聚合多粒度的跨模态证据,以明确量化和建模不确定性。在视频间层面,相似度分数被解释为证据支持,并通过Dirichlet分布进行建模。基于提出的三重原则,我们执行细粒度查询识别,进而指导查询自适应的校准学习。在视频内层面,为了积累更密集的证据,我们通过灵活的最优传输与自适应垃圾箱制定软查询-片段对齐,这减轻了稀疏时间监督的影响,同时抑制了虚假的局部响应。大量实验表明,Holmes的性能优于最先进的方法。代码已发布在 https://github.com/lijun2005/ICML26-Holmes。
cs.CV / 73 / 2605.06084

AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes

AMIEOD:低光照场景下面向目标检测的自适应多专家图像增强
Huang, Xiaochen, Chen, Honggang, Zhang, Weicheng, Dai, Xiaobo, Li, Yongyi, Qing, Linbo, He, Xiaohai
Abstract
In multimedia application scenarios, images captured under low-illumination conditions often lead to lower accuracy in visual perception tasks compared to those taken in well-lit environments. To tackle this challenge, we propose AMIEOD, an image enhancement-enabled object detection framework for low-illumination scenes, where the two tasks are jointly optimized in a detection performance-oriented manner. Specifically, to fully exploit the information in poorly lit images, a Multi-Experts Image Enhancement Module (MEIEM) is proposed, which leverages diverse enhancement strategies. On this basis, aiming to better align the MEIEM with the detection task, we propose a Detection-Guided Regression Loss (DGRL) that utilizes the detection result to decide the regression target. Moreover, to dynamically select the most suitable enhancement strategy from MEIEM during inference, we construct an Expert Selection Module (ESM) guided by the proposed Detection-Guided Cross-Entropy (DGCE) loss, which formulates the optimization of ESM as a classification task. The improved method is well-matched with current detection algorithms to improve their performance in dim scenes. Extensive experiments on multiple datasets demonstrate that the proposed method significantly improves object detection accuracy in low-illumination conditions. Our code has been released at https://github.com/scujayfantasy/AMIEOD
Chinese Translation
在多媒体应用场景中,低光照条件下捕获的图像通常导致视觉感知任务的准确性低于在光照良好的环境中拍摄的图像。为了解决这一挑战,我们提出了AMIEOD,一种针对低光照场景的图像增强目标检测框架,其中这两项任务以检测性能为导向进行联合优化。具体而言,为了充分利用光线不足图像中的信息,我们提出了一种多专家图像增强模块(MEIEM),该模块利用多样的增强策略。在此基础上,为了更好地将MEIEM与检测任务对齐,我们提出了一种检测引导回归损失(DGRL),该损失利用检测结果来决定回归目标。此外,为了在推理过程中动态选择MEIEM中最合适的增强策略,我们构建了一个专家选择模块(ESM),该模块由提出的检测引导交叉熵(DGCE)损失引导,将ESM的优化形式化为分类任务。改进的方法与当前的检测算法良好匹配,以提高其在昏暗场景中的性能。在多个数据集上的大量实验表明,所提出的方法显著提高了低光照条件下的目标检测准确性。我们的代码已发布在 https://github.com/scujayfantasy/AMIEOD
cs.CV / 74 / 2605.06086

LARGO: Low-Rank Hypernetwork for Handling Missing Modalities

LARGO:低秩超网络用于处理缺失模态
Vyncke, Niels, Ashtari, Pooya, Pižurica, Aleksandra
Abstract
Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing methods tackle this problem in feature space by engineering representations that are robust to missing inputs, we instead operate in weight space. We propose LARGO, a hypernetwork that compresses the $2^N-1$ dedicated missing-modality models into a single network by modelling the convolutional weights using the Canonical Polyadic (CP) tensor decomposition. Extensive experimental validation on BraTS 2018 (4 modalities, 15 scenarios) and ISLES 2022 (3 modalities, 7 scenarios) shows that our method ranks first in 47 out of 52 configurations, achieving average Dice improvements of +0.68$\%$ and +2.53$\%$ over state-of-the-art baselines (mmFormer, M$^{3}$AE, ShaSpec, SimMLM). A proof-of-concept experiment on avMNIST suggests that LARGO may extend beyond medical imaging to heterogeneous non-medical modalities.
Chinese Translation
处理缺失模态是多模态图像分析中的一个重要挑战,通常依赖于复杂的架构,这些架构在没有架构修改或超参数调整的情况下,难以迁移到不同的数据集。虽然大多数现有方法通过工程化表示来解决这一问题,使其对缺失输入具有鲁棒性,但我们选择在权重空间中进行操作。我们提出了LARGO,一种超网络,通过使用典型多元(Canonical Polyadic, CP)张量分解来建模卷积权重,将$2^N-1$个专用缺失模态模型压缩为一个单一网络。在BraTS 2018(4种模态,15种场景)和ISLES 2022(3种模态,7种场景)上的大量实验验证表明,我们的方法在52个配置中有47个排名第一,平均Dice提升分别为+0.68%和+2.53%,超越了最先进的基线(mmFormer, M$^{3}$AE, ShaSpec, SimMLM)。在avMNIST上的概念验证实验表明,LARGO可能超越医学成像,扩展到异构非医学模态。
cs.CV / 75 / 2605.06088

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

OpenGaFF:具有代码本注意力的开放词汇高斯特征场
Li, Kunyi, Niemeyer, Michael, Wang, Sen, Gasperini, Stefano, Navab, Nassir, Tombari, Federico
Abstract
Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.
Chinese Translation
理解基于高斯表示的开放词汇三维场景仍然具有挑战性,因为在多视角观察中存在碎片化和空间不一致的语义预测。本文提出了OpenGaFF,一个基于三维高斯点云的新颖框架,用于开放词汇三维场景理解。我们方法的核心是一个高斯特征场,它将语义建模为高斯几何和外观的连续函数。通过明确地将语义预测与几何结构相条件化,这种表述增强了几何与语义之间的耦合,从而提高了三维空间中相似结构的空间一致性。为了进一步加强对象级的语义一致性,我们引入了一个结构化的代码本,作为一组共享的语义原语。此外,我们提出了一种代码本引导的注意力机制,通过查询嵌入与学习的代码本条目之间的相似性匹配来检索语言特征,从而实现稳健的开放词汇推理,同时减少对象内部特征的方差。在标准的二维和三维开放词汇基准上的广泛实验表明,我们的方法始终优于先前的方法,取得了更好的分割质量、更强的三维语义一致性,以及提供对学习表示的洞察的语义可解释代码本。
cs.CV / 76 / 2605.06092

Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning

通过上下文提示和噪声学习提升自监督跟踪
Zheng, Yaozong, Liang, Qihua, Zhong, Bineng, Zeng, Shuimu, Xue, Yuanliang, Li, Ning, Song, Shuxiang
Abstract
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.
Chinese Translation
从未标记视频中学习稳健的上下文知识对于推动自监督跟踪至关重要。然而,传统的自监督跟踪器缺乏有效的上下文建模,而现有的基于非语义查询的上下文关联方法在适应未标记跟踪场景时面临困难,这使得学习可靠的上下文线索变得困难。在本研究中,我们提出了一种新颖的自监督跟踪框架,命名为 extbf{ racker},该框架引入了一种双模态上下文关联机制,联合利用细粒度的语义提示和上下文噪声来驱动模型学习稳健的跟踪表示。遵循由易到难的学习原则,我们的上下文关联机制基于两个阶段进行操作。在早期训练阶段,实例补丁令牌(提示)被分配到前向和后向跟踪分支,以促进跟踪知识的获取。随着训练的进展,上下文噪声逐渐注入模型中以扰动特征,鼓励跟踪器在更复杂的特征空间中学习稳健的跟踪表示。因此,这种新颖的上下文关联机制使我们的自监督模型能够从未标记视频中学习高质量的跟踪表示,同时仅在训练期间应用以保持高效推理。大量实验表明我们方法的优越性。
cs.CV / 77 / 2605.06094

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD:通过结构化自蒸馏增强视频推理
Lin, Hao, Lv, Kunyang, Jiang, Xu, Tian, Jingqi, Du, Zhongjing, Ding, Jiayu, Zhang, Qiaoman, Jin, Hongbo
Abstract
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
Chinese Translation
训练视频大语言模型(VideoLLMs)以进行复杂推理仍然面临挑战,主要由于稀疏的序列级奖励和缺乏对长时间、基于时间的推理轨迹的细粒度信用分配。尽管具有可验证奖励的强化学习(RLVR)提供了可靠的监督,但它未能捕捉到令牌级的贡献,导致学习效率低下。相反,现有的自蒸馏方法虽然提供了密集的监督,但缺乏结构性和诊断特异性,并且通常与强化学习的不稳定交互。在本研究中,我们提出了VISD,一个结构化自蒸馏框架,旨在为视频推理引入具有诊断意义的特权信息。VISD采用视频感知评判模型将推理质量分解为多个维度,包括答案正确性、逻辑一致性和时空基础,并利用这种结构化反馈来指导教师策略进行令牌级监督。为了稳定地将密集监督与强化学习结合,我们引入了一种方向幅度解耦机制,其中从奖励计算出的回滚级优势决定更新方向,而结构化特权信号调节令牌级更新幅度。这一设计实现了语义对齐和细粒度的信用分配,提高了推理的可信度和训练效率。此外,VISD还结合了课程调度和基于EMA的教师稳定化,以支持对长视频序列的稳健优化。在多样化基准上的实验表明,VISD始终优于强基线,改善了答案准确性和时空基础质量。值得注意的是,VISD在优化步骤中实现了近2倍的收敛速度提升,突显了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。
cs.CV / 78 / 2605.06095

Metonymy in vision models undermines attention-based interpretability

视觉模型中的转喻削弱了基于注意力的可解释性
Aniraj, Ananthu, Dantas, Cassio F., Ienco, Dino, Mancini, Massimiliano, Marcos, Diego
Abstract
Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information about the corresponding image region. In this work, we test this basic assumption, measuring intra-object leakage in vision models using part-based attribute annotations. Through a comprehensive experimental evaluation, we show that modern pretrained vision transformers violate the locality assumption and exhibit a strong intra-object leakage, in which each part encodes information from the whole object, a visual metonymy that compromises the faithfulness of attention-based interpretable-by-design methods for part-based reasoning, ultimately rendering them uninterpretable. In addition, we establish an upper bound using a two-stage approach that prevents leakage by design. We then show that this inherently disentangled feature extraction improves attribute-driven part discovery on a variety of tasks, confirming the practical impact of intra-object leakage. Our results uncover a neglected issue affecting the interpretability of part-based representations, such as those in CBMs relying on part-centric concepts, highlighting that two-stage approaches offer a promising way to mitigate it.
Chinese Translation
基于部分的推理是一种经典策略,使计算机视觉模型能够直接关注与下游任务相关的物体部分。在深度学习的背景下,这也有助于通过设计提高可解释性,通常通过在标准黑箱模型提供的潜在图像表示之上使用以部分为中心的注意力机制。这种方法基于局部性假设:即物体部分的潜在表示主要编码与相应图像区域相关的信息。在本研究中,我们测试了这一基本假设,使用基于部分的属性注释测量视觉模型中的物体内部泄漏。通过全面的实验评估,我们表明现代预训练的视觉变换器违反了局部性假设,并表现出强烈的物体内部泄漏,其中每个部分编码来自整个物体的信息,这种视觉转喻妨碍了基于注意力的可解释性设计方法的真实性,最终使其变得不可解释。此外,我们使用两阶段方法建立了一个上限,通过设计防止泄漏。然后,我们展示了这种本质上解耦的特征提取在各种任务上改善了基于属性的部分发现,确认了物体内部泄漏的实际影响。我们的结果揭示了一个被忽视的问题,影响了基于部分的表示的可解释性,例如依赖于以部分为中心概念的CBMs,强调了两阶段方法提供了一种有前景的减轻这一问题的方式。
cs.CV / 79 / 2605.06112

Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking

基于事件流的动态思考稀疏感知专家混合Transformer用于视觉目标跟踪
Wang, Shiao, Wang, Xiao, Yang, Duoqing, Zhang, Wenhao, Jiang, Bo, Zhu, Lin, Tian, Yonghong, Luo, Bin
Abstract
Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.
Chinese Translation
尽管取得了显著进展,基于RGB的跟踪器在低照明和快速运动等挑战性成像条件下仍然脆弱。事件相机通过异步捕捉像素级亮度变化,提供了高动态范围和高时间分辨率,成为一种有前景的替代方案。然而,现有的基于事件的跟踪器往往忽视了事件数据的内在空间稀疏性和时间密度,同时依赖于单一固定的时间窗口采样策略,这在变化的运动动态下并不理想。本文提出了一种事件稀疏感知跟踪框架,明确建模了多个时间尺度上的事件密度变化。具体而言,所提出的框架逐步将稀疏、中等密度和密集的事件搜索区域注入到三阶段的视觉Transformer主干网络中,从而实现分层的多密度特征学习。此外,我们引入了一个稀疏感知的专家混合模块,以鼓励在不同稀疏模式下的专家专业化,并设计了一种动态思考策略,根据跟踪难度自适应调整推理深度。在FE240hz、COESOT和EventVOT上的大量实验表明,所提出的方法在跟踪精度和计算效率之间实现了良好的平衡。源代码将发布在https://github.com/Event-AHU/OpenEvTracking。
cs.CV / 80 / 2605.06121

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

害虫思维者:通过强化学习学习像昆虫学家一样思考和推理
Li, Xueheng, Wang, Yu, Hu, Tao, Huang, Ji, Cao, Ke, Yang, Qize, Li, Rui, Zhang, Jie, Xie, Chengjun
Abstract
Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain's unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.
Chinese Translation
害虫引起的农作物损失对全球粮食安全和可持续农业发展构成了重大威胁。尽管近期多模态大型语言模型(MLLMs)在视觉理解和智能农业方面显示出强大的潜力,但由于该领域独特的挑战,如物种间复杂性高、物种内变异性大以及专家注释数据稀缺,其在害虫识别中的直接应用仍然有限。在本研究中,我们提出了害虫思维者(Pest-Thinker),一个知识驱动的强化学习(RL)框架,使MLLMs能够对细粒度的害虫形态进行推理。我们首先构建了两个高分辨率的害虫基准数据集QFSD和AgriInsect,包含多样的物种和专家注释的形态特征。利用这些数据集,我们合成了思维链(Chain-of-Thought, CoT)推理轨迹,以促进通过监督微调(Supervised Fine-Tuning, SFT)对害虫特定视觉线索的结构化学习。随后,我们采用了群体相对策略优化(Group Relative Policy Optimization, GRPO),结合一种新颖的特征奖励,引导模型关注可观察的形态证据,并通过大型语言模型作为评判者(LLM-as-a-Judge)策略进行评估。大量实验表明,害虫思维者在领域内和领域外的形态理解上均显著提升,标志着朝着智能农业害虫分析的专家级视觉推理迈出了重要一步。数据集和源代码将在接受后提供。
cs.CV / 81 / 2605.06127

Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration

连续专家组装:针对一体化图像恢复的实例条件低秩残差
He, Haisen, Zou, Xiangyu, Dong, SongLin, Li, Heng, Gong, Yihong, Ma, Zhiheng
Abstract
Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typically modulate a shared backbone with global prompts or degradation descriptors, or route features through predefined expert pools. However, compact global conditioning can bottleneck localized degradation evidence, while static expert routing may produce homogeneous updates or rely on unstable sparse assignments. We propose \textbf{Continuous Expert Assembly} (CEA), a token-wise dynamic parameterization framework for all-in-one image restoration. CEA employs a lightweight \textbf{Cross-Attention Hyper-Adapter} to probe intermediate spatial features and synthesize instance-conditioned low-rank routing bases and residual directions. Each spatial token then assembles its own residual update via dense signed dot-product affinities over the generated rank-wise components, avoiding external prompts, static expert banks, and discrete Top- selection. The resulting assembly rule also admits a linear-attention perspective, making its dense token-wise routing behavior transparent. Experiments on AIO-3, AIO-5, and CDD-11 show that CEA improves average restoration quality over strong prompt-, descriptor-, and expert-based baselines, with the clearest gains on spatially varying and compositional degradations, while maintaining favorable parameter, FLOP, and runtime efficiency.
Chinese Translation
现实世界中的图像退化通常是未知的、空间上不均匀的和成分性的,这要求一体化恢复模型在没有测试时退化标签的情况下,适应一组权重以应对多样的局部损坏模式。现有方法通常通过全局提示或退化描述符来调节共享的主干网络,或通过预定义的专家池来路由特征。然而,紧凑的全局条件可能会限制局部退化证据,而静态专家路由可能会产生同质更新或依赖不稳定的稀疏分配。我们提出了 extbf{连续专家组装}(Continuous Expert Assembly,CEA),这是一种针对一体化图像恢复的基于标记的动态参数化框架。CEA采用轻量级的 extbf{交叉注意力超适配器}(Cross-Attention Hyper-Adapter)来探测中间空间特征,并合成实例条件的低秩路由基础和残差方向。每个空间标记通过对生成的秩相关分量进行密集的有符号点积亲和力来组装自己的残差更新,避免了外部提示、静态专家库和离散的Top选择。所得到的组装规则也允许线性注意力视角,使其密集的标记路由行为变得透明。在AIO-3、AIO-5和CDD-11上的实验表明,CEA在强提示、描述符和专家基线之上提高了平均恢复质量,尤其在空间变化和成分性退化上表现出明显的提升,同时保持了良好的参数、FLOP和运行时效率。
cs.CV / 82 / 2605.06137

Autoregressive Visual Generation Needs a Prologue

自回归视觉生成需要前言
Zheng, Bowen, Luo, Weijian, Yang, Guang, Zhang, Colin, Hu, Tianyang
Abstract
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
Chinese Translation
在本研究中,我们提出了前言(Prologue),一种弥合自回归(AR)图像生成中重建与生成之间差距的方法。前言并不是通过修改视觉标记来满足重建和生成的需求,而是生成一小组前言标记,将其添加到视觉标记序列的前面。这些前言标记仅使用AR交叉熵(CE)损失进行训练,而视觉标记则专注于重建。这种解耦设计使我们能够通过AR模型的真实分布来优化生成,而不影响重建质量,我们进一步从ELBO的角度对此进行了形式化。在ImageNet 256x256上,Prologue-Base将gFID从21.01降低到10.75,而重建几乎保持不变;Prologue-Large在没有辅助语义监督的情况下,使用标准AR模型达到了竞争性的rFID为0.99和gFID为1.46。有趣的是,仅由AR梯度驱动的前言标记表现出新兴的语义结构:对16个前言标记进行线性探测的Top-1准确率达到35.88%,远高于标准分词器前16个标记的23.71%;使用固定的前言标记进行重采样保持了类似的高级语义布局。我们的结果表明了一种新的方向:通过引入一个单独学习的生成表示来改善生成质量,同时保持原始表示不变。
cs.CV / 83 / 2605.06143

AI-Generated Images: What Humans and Machines See When They Look at the Same Image

AI生成图像:人类与机器在观察同一图像时所见的不同
Poletti, Silvia, Ilyes, Justin, Hasenbalg, Marcel, Fischinger, David, Boyer, Martin
Abstract
The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understandable explanations for their predictions. To this end, we develop a suite of detectors with various architectures and fine-tuning strategies, trained on our large-scale photorealistic fake image dataset, AIText2Image, and assess their performance on state-of-the-art text-to-image AI generators. We integrate 16 different explainable AI (XAI) methods into our detection framework, and the visual explanations are comprehensively refined and evaluated through a novel approach that prioritizes human understanding of AI-generated images, using both textual and visual responses collected from a survey of 100 participants. This framework offers insights into visual-language cues in fake image detection and into the clarity of XAI methods from a human perspective, measuring the alignment of XAI outputs with human preferences.
Chinese Translation
生成性人工智能在在线虚假信息传播中的误用凸显了对透明和可解释检测系统的迫切需求。在本研究中,我们探讨了如何使AI生成图像的检测器在提供人类可理解的预测解释方面更加有效。为此,我们开发了一套具有不同架构和微调策略的检测器,训练于我们的大规模逼真假图像数据集AIText2Image,并评估其在最先进的文本到图像AI生成器上的性能。我们将16种不同的可解释人工智能(XAI)方法整合到我们的检测框架中,并通过一种新颖的方法对视觉解释进行全面的优化和评估,该方法优先考虑人类对AI生成图像的理解,使用从100名参与者的调查中收集的文本和视觉反馈。该框架为虚假图像检测中的视觉-语言线索以及从人类视角看XAI方法的清晰度提供了见解,测量了XAI输出与人类偏好的对齐程度。
cs.CV / 84 / 2605.06148

Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

利用Wasserstein梯度流学习离散自回归先验
Zheng, Bowen, Luo, Yihong, Hu, Tianyang
Abstract
Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.
Chinese Translation
离散图像标记器通常分为两个阶段进行训练:首先进行重建,然后使用适配于冻结标记序列的先验模型。这种解耦使得标记器无法感知后续生成其标记的模型。因此,学习到的标记可能很好地保留图像信息,但对于自回归(AR)先验来说,从左到右的预测仍然困难。我们使用三方变分一致性(Tripartite Variational Consistency, TVC)分析这种不匹配,TVC将潜变量学习分解为三个一致性条件:条件似然一致性、先验一致性和后验一致性。TVC表明,两阶段训练保留了重建方面,但将先验一致性排除在标记器目标之外:在AR先验参与训练之前,整体标记分布是固定的。基于这一观点,我们在标记器训练过程中添加了一个分布级别的先验匹配信号,同时保持重建目标不变。我们通过Wasserstein梯度流更新来优化该信号。对于硬分类标记,该更新简化为一个辅助AR模型与目标AR先验之间的标记级对比,辅助AR模型跟踪标记器当前的标记分布。该过程仅需通过两个AR模型进行前向传播,而不需要在它们之间反向传播。最终得到的标记器wAR-Tok减少了AR损失,并在保持相似重建质量的情况下,提高了CIFAR-10和ImageNet上的生成FID。
cs.CV / 85 / 2605.06160

Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study

超越遗忘的持续医学图像分割:一项综合基准研究
Wang, Bomin, Zhou, Hangqi, Gao, Yibo, Zhuang, Xiahai
Abstract
Continual learning (CL) is essential for deploying medical image segmentation models in clinical environments where imaging domains, anatomical targets, and diagnostic tasks evolve over time. However, continual segmentation still faces three main challenges. First, the scenarios for this task remain insufficiently standardized for real-world clinical settings. Second, existing research has been primarily focused on mitigating forgetting, overlooking the other essential properties such as plasticity. Third, a benchmark work with comprehensive evaluation on existing methods is stll desirable. To address these gaps, we present such benchmark study of continual medical image segmentation. We first define three clinically motivated scenarios, namely Domain-CL, Class-CL, and Organ-CL, to respectively capture the cross-center domain shift, the incremental anatomical structure segmentation, and the cross-organ segmentation. We then introduce an evaluation framework that measures not only general performance and forgetting, but also plasticity, forward generalizability, parameter efficiency, and replay burden. The results, from extensive experiments with representative CL methods, showed that it was still challenging to develop a model that could satisfy all the requirements simultaneously. Nevertheless, these studies also suggested that the replay-based methods achieve the best overall balance between stability and plasticity, the parameter-isolation methods should be effective at reducing forgetting, though at the cost of increased model size, and the forward generalizability remain a significantly understudied aspect of this research field. Finally, we discuss related learning paradigms and outline future directions for continual medical image segmentation.
Chinese Translation
持续学习(CL)对于在临床环境中部署医学图像分割模型至关重要,因为成像领域、解剖目标和诊断任务随着时间的推移而不断演变。然而,持续分割仍面临三个主要挑战。首先,该任务的场景在现实临床环境中仍然缺乏足够的标准化。其次,现有研究主要集中在减轻遗忘上,忽视了其他重要特性,如可塑性。第三,仍然需要一项对现有方法进行全面评估的基准工作。为了解决这些问题,我们提出了这样一项持续医学图像分割的基准研究。我们首先定义了三个临床驱动的场景,即Domain-CL、Class-CL和Organ-CL,分别捕捉跨中心领域转移、增量解剖结构分割和跨器官分割。然后,我们引入了一个评估框架,不仅衡量一般性能和遗忘,还衡量可塑性、前向泛化能力、参数效率和重放负担。来自代表性CL方法的大量实验结果表明,开发一个能够同时满足所有要求的模型仍然具有挑战性。然而,这些研究也表明,基于重放的方法在稳定性和可塑性之间实现了最佳整体平衡,参数隔离方法在减少遗忘方面应有效,尽管以增加模型大小为代价,而前向泛化能力仍然是该研究领域显著未被研究的方面。最后,我们讨论了相关学习范式,并概述了持续医学图像分割的未来方向。
cs.CV / 86 / 2605.06170

DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

DynT2I-Eval:一种动态文本到图像模型评估框架
Wang, Juntong, Wang, Jiarui, Duan, Huiyu, Li, Lewei, Zhai, Guangtao, Min, Xiongkuo
Abstract
Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the continuous generation of fresh prompts via task-specific spaces and difficulty-aware sampling. DynT2I-Eval evaluates model performance across text alignment, perceptual quality, and aesthetics. Heterogeneous outputs are unified into prompt-conditioned pairwise comparisons, allowing a dynamic scheduler, micro-batch aggregation, and weighted Bayesian updates to maintain a stable online leaderboard despite changing prompt distributions and model injection. Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol, reducing the impact of prompt-set-specific tuning. Simulations and ablations further confirm that the proposed ranking framework achieves a strong balance among cold-start convergence, late-entry discovery, and long-run ranking fidelity.
Chinese Translation
现有的文本到图像(T2I)基准测试在很大程度上依赖于固定的提示集,这使得它们在公开发布和反复使用后容易出现过拟合和基准污染。在本研究中,我们提出了DynT2I-Eval,这是一种完全自动化的T2I模型动态评估框架。它从长篇描述中构建一个结构化的视觉语义空间,将提示分解为可控维度(例如,主题、逻辑约束、环境和构图)。这使得通过任务特定空间和难度感知采样持续生成新提示成为可能。DynT2I-Eval在文本对齐、感知质量和美学方面评估模型性能。异构输出被统一为基于提示的成对比较,允许动态调度程序、微批量聚合和加权贝叶斯更新,以在提示分布和模型注入变化的情况下保持稳定的在线排行榜。通过独立采样的提示流进行的实验表明,持续更新的提示提供了一种稳健的评估协议,减少了提示集特定调优的影响。模拟和消融实验进一步确认,所提出的排名框架在冷启动收敛、晚期发现和长期排名保真度之间实现了良好的平衡。
cs.CV / 87 / 2605.06173

Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation

Retina-RAG:用于联合视网膜诊断和临床报告生成的检索增强视觉-语言建模
Zaian, Abdelrahman, Bhat, Sheethal, Abdalkader, Mohamed, Maier, Andreas
Abstract
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
Chinese Translation
糖尿病视网膜病变(DR)是全球工作年龄成年人可预防失明的主要原因,然而大多数自动筛查系统仅限于图像级分类,缺乏临床结构化报告。我们提出了Retina-RAG,一个低成本的模块化框架,能够联合执行DR严重程度分级、黄斑水肿(ME)检测和报告生成。该架构解耦了高性能的视网膜分类器和通过低秩适配(LoRA)调整的参数高效视觉-语言模型(Qwen2.5-VL-7B-Instruct),实现灵活的组件集成。检索增强生成(RAG)模块在推理时注入策划的眼科知识与结构化分类器输出,以提高诊断一致性并减少幻觉。Retina-RAG在DR分级任务中获得了0.731的F1分数,在ME检测中获得了0.948,显著优于零-shot Qwen(0.096,0.732)和MMed-RAG(0.541,0.641)在带有标题的视网膜疾病检测数据集上的表现。在报告生成方面,Retina-RAG达到了ROUGE-L 0.429和SBERT相似度0.884,超越了所有基线。整个框架在单个消费级GPU上运行,证明了临床结构化视网膜人工智能可以在适度的计算资源下实现。
cs.CV / 88 / 2605.06179

SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

SuperFace:超越伪监督的偏好对齐面部表情估计
Kang, Zejian, Xu, Xuanyang, Yang, Wentao, Zheng, Kai, Fei, Yuanchen, Zou, Hongyuan, Shan, Hui, Yang, Shuo, Huang, Xiangru
Abstract
Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.
Chinese Translation
准确的面部估计对于逼真的数字人类动画至关重要,而 ARKit 混合形状系数通过将面部动作映射到语义动画控制提供了一种可解释的表示。然而,高质量 ARKit 系数预测的学习仍然受到可靠的真实监督缺乏的限制。现有方法通常依赖于捕捉软件,如 Live Link Face,提供伪标签,但这些标签可能包含噪声激活、偏差的系数幅度以及缺失或不准确的面部动作。因此,使用监督学习训练的模型往往倾向于重现不完美的伪标签,而不是优化感知表情的真实性。在本文中,我们提出了 SuperFace,一个以偏好为驱动的框架,将 ARKit 面部表情估计从伪标签模仿转向人类对齐的感知优化。SuperFace 不将软件估计的系数视为固定的真实值,而是仅将其作为初始化,并通过对渲染面部表情的人类偏好反馈进一步改善系数预测。通过将模型与感知判断对齐,而不是数值伪标签,SuperFace 实现了更具视觉真实感和表现力的面部动画。实验表明,SuperFace 在表情真实性上优于 Live Link Face 监督,证明了以偏好驱动的优化在语义面部动作预测中的有效性。
cs.CV / 89 / 2605.06192

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

EA-WM:具有结构化运动-视觉动作场的事件感知生成世界模型
Yang, Zhaoyang, Jin, Yurun, Qi, Lizhe, Huang, Cong, Chen, Kai
Abstract
Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.
Chinese Translation
预训练的视频扩散模型提供了强大的时空生成先验,使其成为机器人世界模型的自然基础。尽管最近的世界-动作模型联合优化未来视频和动作,但它们主要将视频生成视为策略学习的辅助表示。因此,它们对逆问题的探索不足:利用动作信号指导视频合成,从而常常未能在生成的结果中保留精确的机器人空间几何和细致的机器人-物体交互动态。为了解决这一问题,我们提出了EA-WM,一个事件感知生成世界模型,有效地闭合了运动控制与视觉感知之间的循环。EA-WM并不是将关节或末端执行器动作作为抽象的低维标记注入,而是将动作和运动状态直接投影到目标相机视图中,形成结构化的运动-视觉动作场。为了充分利用这一几何基础的表示,我们引入了事件感知的双向融合模块,调节跨分支注意力,捕捉物体状态变化和交互动态。在全面的WorldArena基准测试中,EA-WM实现了最先进的性能,显著超越了现有基线。
cs.CV / 90 / 2605.06197

Bridging visual saliency and large language models for explainable deep learning in medical imaging

桥接视觉显著性与大型语言模型以实现医学影像中的可解释深度学习
Nguezet, Paul Valery, Fute, Elie Tagne, Brima, Yusuf, Azanguezet, Benoit Martin, Atemkeng, Marcellin
Abstract
The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.
Chinese Translation
深度学习模型的不透明性仍然是其在医学影像领域临床应用的一大障碍。本文提出了一种多模态可解释性框架,旨在弥合卷积神经网络(CNN)预测与脑肿瘤分类的临床可操作见解之间的差距,利用大型语言模型(LLMs)提供人类可解释的诊断叙述。所提出的框架通过三个相互关联的阶段进行操作。首先,扩展九种CNN架构,采用双输出混合形式,同时优化分类头和分割头,从而实现空间上更丰富的特征学习。其次,应用视觉显著性归因方法,即Grad-CAM、Grad-CAM++和ScoreCAM,生成类别区分热图,随后通过自适应百分位阈值处理管道将其精炼为二元肿瘤掩膜。第三,将生成的掩膜映射到哈佛-牛津皮层图谱上,将像素级证据转化为命名的神经解剖结构,并将提取的结果编码为结构化的JSON文件,以便为三个LLM(Grok3、Mistral和LLaMA)生成连贯的放射学风格诊断报告。在对包含4,834幅对比增强T1加权脑MRI图像的数据库进行评估时,InceptionResNetV2实现了最高的分类性能,而Grad-CAM++则产生了最佳的分割重叠。在语言模型中,Grok3在词汇多样性和连贯性方面表现最佳,而LLaMA则获得了最高的可读性评分。通过将视觉、解剖和语言模态整合到一个统一的管道中,该框架生成的解释在技术上是扎实的,并且具有重要的可解释性,推动了人工智能辅助脑肿瘤诊断的透明度和临床问责制。
cs.CV / 91 / 2605.06207

Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

驯服熵崖:自回归视觉生成的可变码本大小量化
Zheng, Bowen, Luo, Weijian, Yang, Guang, Zhang, Colin, Hu, Tianyang
Abstract
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
Chinese Translation
大多数离散视觉标记器依赖于一种默认设计:序列中的每个位置共享相同的码本。研究人员尝试扩展码本大小 $K$ 以获得更好的重建性能。然而,这种常量码本设计达到了一个基本的信息理论极限。我们观察到,训练集的每位置条件熵沿序列迅速衰减,以至于在几个位置之后,条件分布变得基本上是确定性的。在 ImageNet 上,当 $K=16384$ 时,这种情况仅在 256 个位置中的 2 个位置内发生,使得剩余的 254 个位置变成了一个记忆问题。我们将这种现象称为熵崖,并用一个简单的表达式形式化:$t^{*} = ext{ceil}( rac{ ext{log}_2 N}{ ext{log}_2 K})$。有趣的是,这种现象在语言中并未观察到,因为其自然结构使得每个位置的有效熵远低于码本容量。为了解决这个问题,我们提出了可变码本大小量化(Variable Codebook Size Quantization, VCQ),其中码本大小 $K_t$ 沿序列单调增长,从 $K_{ ext{min}}=2$ 到 $K_{ ext{max}}$,而损失函数、参数数量和自回归训练过程保持不变。使用一个普通的自回归 Transformer 和标准的下一个标记预测,VCQ 的基础版本在 ImageNet $256 imes256$ 上将 gFID 从 27.98 降低到 14.80,相较于基线。经过扩展,它在 684M 自回归参数下达到了 gFID 1.71,没有使用任何额外的训练技术,如语义正则化或因果对齐。在 $K_{ ext{min}}=2$ 的极端信息瓶颈自然引入了一种粗到细的语义层次:仅在前 10 个标记上的线性探测在 ImageNet 上达到了 43.8% 的 top-1 准确率,而均匀码本的准确率为 27.1%。最终,这些结果表明,重要的不仅是码本的总容量,还有这些容量的分布和组织方式。
cs.CV / 92 / 2605.06214

Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance

可微分自适应4D结构光照明用于形状和反射率的联合捕获
Ding, Huakeng, Chen, Yaowen, Zhou, Kun, Wu, Hongzhi
Abstract
We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.
Chinese Translation
我们提出了一个可微分框架,以自适应地计算与物体相关的4D照明条件,从而高效且高质量地同时获取其形状和反射率,采用统一的空间-角度结构光和单一相机。利用基于简单直方图的像素级深度和反射率概率模型,我们可微分地将下一个照明条件与一个损失函数关联,该损失函数鼓励减少深度不确定性。当新的结构光照明投射时,相应的图像测量用于更新每个像素的深度不确定性。最后,基于微调的方法通过最小化所有物理测量与其模拟对应物之间的差异,重建深度图和反射率参数图。我们的框架在形状和外观变化较大的物体上展示了有效性。我们的深度结果与最先进的技术相比表现良好,而我们的反射率结果在与照片对比时也具有可比性。
cs.CV / 93 / 2605.06229

Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

超越显著性:低注意力引导的双重编码用于视频语义搜索
Aljehrai, Faisal, Alkhrashi, Mohammed A., Almuhrij, Alreem, Abuhimed, Sarah, Aldossary, Noorh, Aldwyish, Abdullah, Aljadaany, Raied, Alamri, Huda, Khan, Muhammad Kamran J
Abstract
Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism that explicitly captures and highlights these overlooked regions. By combining inverse attention embeddings with traditional visual embeddings, our method significantly enhances semantic retrieval performance without additional training. Initial experiments and ablation studies demonstrate promising improvements over existing approaches in recall for video semantic search in crowded environments.
Chinese Translation
在密集拥挤场景中,视频语义搜索仍然是一项具有挑战性的任务,因为视觉编码器往往优先考虑显著的前景区域,而忽视了在语境中重要的背景区域。我们提出了一种逆注意力嵌入机制,明确捕捉并突出这些被忽视的区域。通过将逆注意力嵌入与传统视觉嵌入相结合,我们的方法显著提高了语义检索性能,而无需额外的训练。初步实验和消融研究表明,在拥挤环境中,视频语义搜索的召回率相较于现有方法有了显著改善。
cs.CV / 94 / 2605.06266

ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision

ZScribbleSeg:一个综合的分割框架,旨在高效标注建模和最大化涂鸦监督
Zhang, Ke, Wang, Bomin, Zhou, Hangqi, Zhuang, Xiahai
Abstract
Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute losses on annotated areas and generate pseudo labels by propagating annotations to adjacent regions. However, these methods often suffer from inaccurate and unrealistic segmentations due to insufficient supervision and incomplete shape information. In contrast, we first investigate the principle of good scribble annotations, which leads to efficient scribble forms via supervision maximization and randomness simulation. We further introduce regularization terms to encode the spatial relationship and the shape constraints, where the EM algorithm is utilized to estimate the mixture ratios of label classes. These ratios are critical in identifying the unlabeled pixels for each class and correcting erroneous predictions, thus the accurate estimation lays the foundation for the incorporation of spatial prior. Finally, we integrate the efficient scribble supervision with the prior into a framework, referred to as ZScribbleSeg, and apply it to multiple scenarios. Leveraging only scribble annotations, ZScribbleSeg achieves competitive performance on six segmentation tasks including ACDC, MSCMRseg, BTCV, MyoPS, Decathlon-BrainTumor and Decathlon-Prostate. Our code will be released via https://github.com/DLwbm123/ZScribbleSeg.
Chinese Translation
为医学图像分割策划完全标注的数据集是一项劳动密集型且需要专业知识的工作。为了解决这个问题,先前的研究探索了用于弱监督分割的涂鸦标注。现有的解决方案主要在标注区域计算损失,并通过将标注传播到相邻区域生成伪标签。然而,由于监督不足和形状信息不完整,这些方法往往会导致不准确和不现实的分割。相比之下,我们首先研究了良好涂鸦标注的原则,这通过最大化监督和随机性模拟得出了高效的涂鸦形式。我们进一步引入正则化项,以编码空间关系和形状约束,其中利用EM算法估计标签类别的混合比例。这些比例在识别每个类别的未标记像素和纠正错误预测中至关重要,因此准确的估计为空间先验的结合奠定了基础。最后,我们将高效的涂鸦监督与先验整合到一个框架中,称为ZScribbleSeg,并将其应用于多个场景。仅利用涂鸦标注,ZScribbleSeg在包括ACDC、MSCMRseg、BTCV、MyoPS、Decathlon-BrainTumor和Decathlon-Prostate在内的六个分割任务上实现了具有竞争力的性能。我们的代码将通过https://github.com/DLwbm123/ZScribbleSeg发布。
cs.CV / 95 / 2605.06270

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Spark3R:非对称令牌减少实现快速前馈3D重建
Tang, Zecheng, Fu, Jiaye, Gao, Qiankun, Li, Haijie, Wu, Yanmin, Zhang, Jiaqi, Ma, Siwei, Zhang, Jian
Abstract
Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $\pi^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.
Chinese Translation
基于视觉变换器的前馈3D重建模型能够直接从少量输入图像中估计场景几何和相机姿态,但由于全局注意力层的二次成本,将其扩展到包含数百或数千帧的视频输入仍然具有挑战性。最近的令牌合并方法通过压缩全局注意力层内的令牌序列来加速这些模型,但它们对查询令牌和键值令牌应用统一的减少,忽略了它们在3D重建中功能上不同的角色。在本研究中,我们识别出前馈3D重建模型的一个关键特性:查询令牌编码视角特定的几何请求,对压缩敏感,而键值令牌表示共享的场景上下文,能够容忍激进的压缩。基于这一洞察,我们提出了Spark3R,一个无训练加速框架,通过分配不同的减少因子来解耦查询令牌和键值令牌的压缩,对查询令牌应用组内令牌合并,对键值令牌进行轻量级令牌剪枝。此外,Spark3R在各层之间自适应调整键值减少因子,进一步改善质量与效率的权衡。作为一个即插即用的框架,无需重新训练,Spark3R可以直接集成到多个预训练的前馈3D重建模型中,包括VGGT、$ ext{π}^3$和Depth-Anything-3,并在1,000帧输入上实现高达$28 imes$的加速,同时保持竞争力的重建质量。
cs.CV / 96 / 2605.06273

On-Orbit Real-Time Wildfire Detection Under On-Board Constraints

在轨实时野火检测系统的设计与实现
Rötzer, Matthias, Pörtge, Veronika, Ickerott, Martin, Chorapalli, Jayendra Praveen Kumar, Scheftelowitsch, Dimitri, Bereczky, Max, Rashkovetsky, Dmitry, Appalla, Sai Manoj, Gottfriedsen, Julia
Abstract
We present a deployed system for on-orbit wildfire detection aboard a nine-satellite commercial thermal infrared constellation, operating under demanding joint constraints: sub-megabyte model footprint, sub-150 ms per-batch TensorRT FP16 inference on an NVIDIA Jetson Xavier NX, and an end-to-end alert pipeline targeting under 10 minutes from satellite overpass to fire event communication. The system operates on uncalibrated mid-wave infrared (MWIR) single-band imagery at 200 m ground sampling distance, where fires frequently appear as sub-pixel or single-pixel thermal anomalies under extreme class imbalance -- challenges not addressed by the contextual thermal-thresholding pipelines (MODIS, VIIRS) that currently dominate operational fire monitoring. We present an empirical study of lightweight dense representation learning for this regime using a proprietary nine-satellite MWIR dataset. We compare dense masked autoencoding (DenseMAE) and a hybrid DenseMAE+EMA (exponential moving average) distillation variant, and evaluate representations via linear probing and full-distribution pixel-level average precision (AP) under extreme class imbalance. DenseMAE pretraining enables compact downstream models on the latency-accuracy Pareto frontier: our fastest SSL-pretrained model achieves 0.640 test AP and 0.69 event-level Fire-F1 with 65.34 ms latency per batch and a 0.52 MB engine, without pruning or compression. The best configuration reaches 0.699 AP and 0.744 Fire-F1 below 1 MB, outperforming a supervised baseline (0.650 AP) under comparable constraints.
Chinese Translation
我们提出了一种在九颗卫星商业热红外星座上部署的在轨野火检测系统,该系统在严格的联合约束下运行:模型占用空间小于1MB,NVIDIA Jetson Xavier NX上每批次的TensorRT FP16推理时间小于150毫秒,以及从卫星经过到火灾事件通信的端到端警报管道目标在10分钟以内。该系统在200米地面采样距离下,利用未校准的中波红外(MWIR)单波段图像进行操作,其中火灾常常以亚像素或单像素热异常的形式出现,且面临极端类别不平衡的挑战——这些挑战并未被当前主导的操作火灾监测方法(如MODIS和VIIRS)的上下文热阈值管道所解决。我们利用一个专有的九颗卫星MWIR数据集,进行了轻量级密集表示学习的实证研究。我们比较了密集掩码自编码(DenseMAE)和一种混合的DenseMAE+EMA(指数移动平均)蒸馏变体,并在极端类别不平衡下通过线性探测和全分布像素级平均精度(AP)评估表示。DenseMAE的预训练使得下游模型在延迟-准确性帕累托前沿上变得紧凑:我们最快的SSL预训练模型在每批次65.34毫秒的延迟和0.52MB的引擎下,达到了0.640的测试AP和0.69的事件级Fire-F1,无需剪枝或压缩。最佳配置在1MB以下达到了0.699的AP和0.744的Fire-F1,超越了在可比约束下的监督基线(0.650 AP)。
cs.CV / 97 / 2605.06280

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

欧拉运动引导:通过双向几何一致性实现稳健的图像动画
Nguyen, Thong, Le, Khoi M., Nguyen, Cong-Duy, Tuan, Luu Anh, Ng, See-Kiong, Miao, Chunyan
Abstract
Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.
Chinese Translation
最近在图像动画方面的进展利用扩散模型为静态图像注入生命。然而,现有的可控框架通常依赖于拉格朗日运动引导,其中光流是相对于初始帧进行估计的。本文通过更局部的监督设计重新审视了相同的光流原语:我们使用相邻帧的欧拉运动场来指导生成,其中运动信号始终描述一个短暂的时间跳跃。这一转变使得并行化训练成为可能,并在整个生成过程中提供了有界误差的监督。为了减轻相邻帧生成中常见的漂移伪影,我们引入了一种双向几何一致性机制,该机制计算前向-后向循环检查,以数学方式识别和遮蔽被遮挡区域,防止模型学习错误的变形目标。大量实验表明,与基于参考的基线相比,我们的方法加速了训练,保持了时间一致性,并减少了动态伪影。
cs.CV / 98 / 2605.06298

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

渲染,而非解码:具有潜在结构解耦的权重空间世界模型
Nzoyem, Roussel Desmond, Comi, Mauro
Abstract
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
Chinese Translation
在大量未标记视频上训练世界模型是实现完全自主智能的重要一步。然而,当前的主流范式是将原始像素编码为不透明的潜在空间,并依赖于复杂的解码器进行重建,这使得这些模型在计算上昂贵且难以解释。我们通过引入NOVA,一个将系统状态表示为辅助坐标基础隐式神经表示(INR)的权重和偏置的世界建模框架,来解决这个问题。这种结构化表示可以通过解析渲染,从而消除了解码器瓶颈,同时赋予了紧凑性、可移植性和零-shot超分辨率。此外,像大多数潜在动作模型一样,NOVA可以通过动作匹配目标提炼为上下文相关的视频生成器。令人惊讶的是,NOVA在不依赖辅助损失或对抗目标的情况下,能够解耦结构场景组件,如背景、前景和帧间运动,使用户能够在不影响其他内容的情况下编辑内容或动态。我们在多个具有挑战性的数据集上验证了我们的框架,实现了强大的可控预测,同时在单个消费级GPU上以约4000万参数运行。最终,像INR这样的结构化表示不仅增强了我们对潜在动态的理解,还为沉浸式和可定制的虚拟体验铺平了道路。
cs.CV / 99 / 2605.06317

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

NavOne:基于自上而下地图的一步全局规划的视觉-语言导航
Zhan, Dijia, Li, Jinyi, Zheng, Chenxi, Huang, Shaoyu, Li, Yong, Tang, Jie, Xu, Xuemiao
Abstract
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
Chinese Translation
现有的视觉-语言导航(VLN)方法通常采用自我中心的逐步范式,这导致了错误累积并限制了效率。尽管最近的方法试图利用预构建的环境地图,但它们往往依赖于逐步更新记忆图或对离散路径提案进行评分,这限制了连续空间推理并造成离散瓶颈。我们提出了自上而下的视觉-语言导航(TD-VLN),将导航重新定义为在预构建的自上而下地图上进行的一步全局路径规划问题,并得到了我们新构建的R2R-TopDown数据集的支持。为了解决这个问题,我们引入了NavOne,一个统一框架,可以在单次端到端的前向传递中直接预测多模态地图上的密集路径概率。NavOne具有一个自上而下地图融合器,用于联合多模态地图表示,并扩展了空间感知深度混合的注意力残差。在R2R-TopDown上的大量实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能,在规划阶段的速度比现有的基于地图的基线快8倍,比自我中心的方法快80倍,从而实现了高效的全局导航。
cs.CV / 100 / 2605.06333

TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices

TinyBayes:通过雅可比先验实现的闭式贝叶斯推断,用于边缘设备上的实时图像分类
Sardar, Shouvik, Das, Sourish
Abstract
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes
Chinese Translation
可可(Theobroma cacao)是西非数百万小农户的重要经济作物,其中可可肿胀病病毒病(CSSVD)和炭疽病导致了严重的产量损失。自动化的叶片图像疾病检测对于早期干预至关重要,但在资源有限的环境中部署此类系统需要小型、快速且不依赖互联网连接的模型。现有的可边缘部署植物疾病系统依赖于端到端深度学习而没有不确定性量化,而针对边缘设备的贝叶斯方法则侧重于硬件级推断架构,而非农业应用。我们通过TinyBayes填补了这一空白,这是第一个将闭式贝叶斯分类器与移动级计算机视觉管道结合用于作物疾病检测的框架。我们的管道使用YOLOv8-Nano(5.9 MB)进行病变定位,使用MobileNetV3-Small(3.5 MB)进行特征提取,并采用雅可比先验;这是一种通过投影提供闭式非迭代估计器的贝叶斯方法,用于分类。雅可比-DMR(分布式多项式回归)分类器仅为管道增加了13.5 KB,使总模型大小控制在9.5 MB以内,同时在Amini可可污染挑战数据集上实现了78.7%的准确率,并使得每张图像的端到端CPU推断时间低于150毫秒。我们与包括随机森林、支持向量机、岭回归、套索回归、弹性网、XGBoost和雅可比-GP在内的七个分类器进行了基准测试,证明雅可比-DMR在准确性、模型大小和推断速度之间提供了最佳的权衡。我们已证明雅可比-DMR的渐近等价性和一致性、渐近正态性以及偏差校正。所有数据和代码可在此获取:https://github.com/shouvik-sardar/TinyBayes
cs.CV / 101 / 2605.06337

Earth-o1: A Grid-free Observation-native Atmospheric World Model

Earth-o1:一种无网格观测原生的大气世界模型
Gong, Junchao, Xu, Kaiyi, Wei, Wangxu, Tu, Siwei, Xu, Jingyi, Liu, Zili, Fan, Hang, Zhou, Zhiwang, Han, Tao, Xiao, Yi, Gu, Xinyu, Li, Zhangrui, Zhang, Wenlong, Chen, Hao, Yang, Xiaokang, Wang, Yaqiang, Cheng, Lijing, Gentine, Pierre, Ouyang, Wanli, Zhang, Feng, Tan, Zhe-Min, Zhou, Bowen, Ling, Fenghua, Fei, Ben, Bai, Lei
Abstract
Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth-o1, an observation-native atmospheric world model that overcomes these structural limitations. Rather than relying on conventional atmospheric dynamical modeling systems or traditional data assimilation, Earth-o1 directly learns the continuous, three-dimensional physical evolution of the Earth system from ungridded observational data. By integrating diverse sensor inputs into a unified, grid-free dynamical field, the model autonomously advances the atmospheric state in space and time. We show that this fundamentally distinct paradigm enables direct, real-time forecasting and cross-sensor inference without the overhead of explicit numerical solvers. In hindcast evaluations, Earth-o1 achieves surface forecast skill comparable to the operational Integrated Forecasting System (IFS). These results establish that continuous, observation-driven world models -- a new class of fully observation-native geophysical simulators -- can match the fidelity of established physical frameworks, providing a scalable data-driven foundation for a digital twin of the Earth.
Chinese Translation
尽管现代地球观测系统提供了前所未有的多模态数据量,但我们对大气动力学的建模能力仍然受到限制。传统建模框架将异构测量强制映射到预定义的空间网格中,这在本质上限制了对原始传感器数据的充分利用,并造成了严重的计算瓶颈。在此,我们提出了Earth-o1,这是一种克服这些结构性限制的观测原生大气世界模型。Earth-o1并不依赖于传统的大气动力建模系统或传统的数据同化方法,而是直接从无网格的观测数据中学习地球系统的连续三维物理演变。通过将多样的传感器输入整合为一个统一的无网格动态场,该模型自主地在空间和时间上推进大气状态。我们展示了这一根本不同的范式使得无需显式数值求解器即可实现直接的实时预报和跨传感器推断。在回溯评估中,Earth-o1的地表预报技能与现行的综合预报系统(Integrated Forecasting System, IFS)相当。这些结果表明,连续的、以观测为驱动的世界模型——一种全新的完全观测原生的地球物理模拟器类别——能够与已建立的物理框架相匹配,提供一个可扩展的数据驱动基础,用于构建地球的数字双胞胎。
cs.CV / 102 / 2605.06356

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

SwiftI2V:通过条件分段生成实现高效高分辨率图像到视频生成
Liu, YaoYang, Zhang, Yuechen, Li, Wenbo, Zhao, Yufei, Liu, Rui, Chen, Long
Abstract
High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
Chinese Translation
高分辨率图像到视频(I2V)生成旨在合成逼真的时间动态,同时保留输入图像的细致外观细节。在2K分辨率下,这变得极具挑战性,现有解决方案存在各种缺陷:1)端到端模型在内存和延迟方面通常代价高昂;2)将低分辨率生成与通用视频超分辨率级联,往往会产生虚幻细节并偏离输入特定的局部结构,因为超分辨率阶段并未明确基于输入图像进行条件化。为此,我们提出了SwiftI2V,这是一个专为高分辨率I2V量身定制的高效框架。遵循广泛使用的两阶段设计,它通过首先生成低分辨率运动参考来降低标记成本并减轻建模负担,从而解决效率与保真度的矛盾,随后进行强图像条件的2K合成,以运动为指导恢复输入忠实细节,同时控制开销。具体而言,为了使生成更具可扩展性,SwiftI2V引入了条件分段生成(Conditional Segment-wise Generation, CSG),以分段方式合成视频,并设定每步的标记预算,同时在每个段内采用双向上下文交互,以提高跨段一致性和输入保真度。在VBench-I2V的2K分辨率下,SwiftI2V的性能与端到端基线相当,同时将总GPU时间减少了202倍。特别是,它使得在单个数据中心GPU(例如H800)或消费级GPU(例如RTX 4090)上实现实用的2K I2V生成成为可能。
cs.CV / 103 / 2605.06368

eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts

解释学习 (eX2L):利用对比视觉解释对进行分布偏移的正则化
Medina, Paulo Mario P., Miñoza, Jose Marie Antonio, Ibañez, Sebastian C.
Abstract
Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a classifier's latent representations during training. eX2L achieves this by penalizing the similarity between Grad-CAM activation maps generated by a primary label classifier and those from a concurrently trained confounder classifier. On the rigorous Spawrious Many-to-Many Hard Challenge benchmark, eX2L achieves an average accuracy (AA) of 82.24% +/- 3.87% and a worst-group accuracy (WGA) of 66.31% +/- 8.73%, outperforming the current state-of-the-art (SOTA) by 5.49% and 10.90%, respectively. Beyond its competitive performance, eX2L demonstrates that functional domain invariance can be achieved by explicitly decoupling label and nuisance attributes at the group level.
Chinese Translation
尽管在减轻分布偏移方面进行了广泛研究,许多现有算法的性能却不一致,常常未能在多种场景中超越基线经验风险最小化 (Empirical Risk Minimization, ERM)。此外,高算法复杂性通常限制了可解释性,并仅提供间接手段来解决虚假相关性。我们提出了解释学习 (eXplaining to Learn, eX2L):一个可解释的基于解释的框架,在训练过程中将混淆特征与分类器的潜在表示解耦。eX2L通过惩罚主要标签分类器生成的 Grad-CAM 激活图与同时训练的混淆分类器生成的激活图之间的相似性来实现这一目标。在严格的 Spawrious Many-to-Many Hard Challenge 基准测试中,eX2L 达到了 82.24% +/- 3.87% 的平均准确率 (Average Accuracy, AA) 和 66.31% +/- 8.73% 的最差组准确率 (Worst-Group Accuracy, WGA),分别超越当前的最先进技术 (State-of-the-Art, SOTA) 5.49% 和 10.90%。除了其竞争力的性能外,eX2L 还表明,通过在组级别上明确解耦标签和干扰属性,可以实现功能域不变性。
cs.CV / 104 / 2605.06376

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

少步扩散蒸馏的连续时间分布匹配
Liu, Tao, Yan, Hao, Chen, Mengting, Hu, Taihang, Yue, Zhengrong, Pan, Zihao, Lan, Jinsong, Zhu, Xiaoyong, Cheng, Ming-Ming, Zheng, Bo, Wang, Yaxing
Abstract
Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules -- such as GANs or reward models -- to restore visual fidelity. In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student's velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives. Code is available at https://github.com/byliutao/cdm.
Chinese Translation
步骤蒸馏已成为加速扩散模型的主要技术,其中分布匹配蒸馏(Distribution Matching Distillation, DMD)和一致性蒸馏(Consistency Distillation)是两个代表性范式。尽管一致性方法在完整的 PF-ODE 轨迹上强制自一致性,以引导其朝向干净数据流形,但普通的 DMD 依赖于在少数预定义离散时间步上的稀疏监督。这种受限的离散时间表述和反向 KL 散度的模式寻求特性往往会出现视觉伪影和过度平滑的输出,通常需要复杂的辅助模块(如生成对抗网络(GANs)或奖励模型)来恢复视觉保真度。在本研究中,我们引入了连续时间分布匹配(Continuous-Time Distribution Matching, CDM),首次将 DMD 框架从离散锚点迁移到连续优化。CDM 通过两种连续时间设计实现这一目标。首先,我们用动态连续时间表替换固定的离散时间表,使得在采样轨迹的任意点上强制进行分布匹配,而不仅仅是在少数固定锚点上。其次,我们提出了一种连续时间对齐目标,针对通过学生的速度场外推的潜在变量进行主动的轨迹外匹配,从而提高了泛化能力并保留了细致的视觉细节。在不同架构上的广泛实验,包括 SD3-Medium 和 Longcat-Image,证明了 CDM 在少步图像生成中提供了高度竞争的视觉保真度,而无需依赖复杂的辅助目标。代码可在 https://github.com/byliutao/cdm 获取。
cs.CV / 105 / 2605.06380

Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

图像分类器中简单连通决策区域的实证证据
Swaminathan, Arjhun, Akgün, Mete
Abstract
Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops inside a decision region can be contracted without leaving that region. To this end, we propose an iterative quad-mesh filling procedure that constructs a finite-resolution label-preserving surface bounded by a given loop and lying entirely within the same decision region. We further connect this construction to natural Coons patches in order to quantify its deviation from a canonical geometric interpolation of the loop. By evaluating our method across several modern image-classification models, we provide empirical evidence supporting the hypothesis that decision regions in deep neural networks are not only path connected, but also simply connected.
Chinese Translation
理解决策区域的拓扑结构对于解释深度神经网络的内部工作机制至关重要。先前的实证研究提供了这些区域是路径连通的证据。我们研究了一个更强的拓扑问题:在决策区域内的闭合环是否可以在不离开该区域的情况下收缩。为此,我们提出了一种迭代四边形网格填充程序,该程序构建了一个由给定环界定的有限分辨率标签保持表面,并完全位于同一决策区域内。我们进一步将这一构造与自然的 Coons patches 连接起来,以量化其与环的典型几何插值的偏差。通过在多个现代图像分类模型上评估我们的方法,我们提供了实证证据,支持深度神经网络中的决策区域不仅是路径连通的,而且是简单连通的假设。
cs.CV / 106 / 2605.06388

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

重建还是语义?什么使潜在空间对机器人世界模型有用
Nilaksh, Jha, Saurav, Zholus, Artem, Chandar, Sarath
Abstract
World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.
Chinese Translation
基于世界模型的策略评估是通过在动作条件的视频扩散模型中展开候选动作来测试现实世界机器人控制的实用代理。随着这些模型越来越多地采用潜在扩散建模(Latent Diffusion Modeling, LDM),选择合适的潜在空间变得至关重要。当前的做法使用像变分自编码器(Variational Autoencoders, VAE)这样的自编码潜在空间,这些空间主要用于像素重建,而最近的研究表明,使用与表示对齐的语义潜在空间的预训练编码器可以带来好处。我们通过比较六种重建和语义编码器,系统地评估这些潜在空间在动作条件LDM中的表现,训练世界模型变体,并在BridgeV2数据集上采用固定协议,展示了在高维表示空间中进行有效的世界模型训练,无论是否进行维度压缩。然后,我们提出三个维度来评估机器人世界模型的性能:视觉保真度、规划和下游策略性能,以及潜在表示质量。我们的结果表明,仅凭视觉保真度不足以选择世界模型。虽然像VAE和Cosmos这样的重建编码器在像素级得分上表现强劲,但语义编码器如V-JEPA 2.1(在策略上整体表现最强)、Web-DINO和SigLIP 2在所有模型规模的其他两个维度上通常表现优异。我们的研究主张将语义潜在空间作为与策略相关的机器人扩散世界模型的更强基础。
cs.CV / 107 / 2605.06421

FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

FREPix:用于像素空间图像生成的频率异质流匹配
Lin, Mingfeng, Chen, Jiakun, Han, Liang, Nie, Liqiang
Abstract
Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at $256\times256$ and 2.38 FID at $512\times512$, with particularly strong behavior in the low-NFE regime.
Chinese Translation
像素空间扩散作为潜在空间生成的有希望的替代方案重新出现,因为它避免了变分自编码器(VAEs)引入的表示瓶颈。然而,大多数现有方法仍将图像生成视为频率同质过程,忽视了低频和高频成分的不同角色和学习动态。为了解决这个问题,我们提出了FREPix,一个用于像素空间图像生成的频率异质流匹配框架。FREPix明确将生成过程分解为低频和高频成分,为它们分配独立的传输路径,通过分解网络进行预测,并使用频率感知目标进行训练。通过这种方式,从粗到细的生成成为一种明确的设计原则,而不是一种隐式行为。在ImageNet的类到图像生成任务中,FREPix在像素空间生成模型中取得了竞争力的结果,在$256 imes256$时达到1.91 FID,在$512 imes512$时达到2.38 FID,特别是在低NFE(每个图像的频率)范围内表现强劲。
cs.CV / 108 / 2605.06477

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

GeoStack:一种用于 VLM 中准阿贝尔知识组合的框架
Mantini, Pranav, Shah, Shishir K.
Abstract
We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
Chinese Translation
我们解决了视觉语言模型(VLM)中知识组合的挑战,在多个领域或任务中累积专业知识通常会导致灾难性遗忘。我们提出了 GeoStack(几何堆叠),这是一个模块化框架,允许独立训练的领域专家组合成一个统一模型。通过对适配器流形施加几何和结构约束,GeoStack 确保基础模型的基础知识得以保留。此外,我们在数学上证明了一种权重折叠特性,实现了常数时间推理复杂度($O(1)$),无论集成专家的数量如何。跨多领域适应和类增量学习的实验结果表明,GeoStack 提供了一种高效的长期知识组合机制,同时显著减轻了灾难性遗忘。代码可在 https://github.com/QuantitativeImagingLaboratory/GeoStack 获取。
cs.CV / 109 / 2605.06487

3D MRI Image Pretraining via Controllable 2D Slice Navigation Task

通过可控的二维切片导航任务进行三维MRI图像预训练
Wang, Yu, Chen, Qingchao
Abstract
Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientations, and scales, a 3D volume can be converted into dense video-action sequences whose controls are the action trajectories. We study this formulation with an action-conditioned pretraining objective, where a tokenizer encodes slice observations and a latent dynamics model predicts the evolution of latent features. Across representative anatomical and spatial downstream tasks, the proposed pretraining is evaluated against standard static-volume baselines, tokenizer-only pretraining, and dynamics variants without aligned actions. These results suggest that controllable MRI slice navigation provides a useful complementary pretraining interface for learning anatomical and spatial representations from large unlabeled MRI collections.
Chinese Translation
自监督预训练已成为从未标记扫描中学习MRI表示的主流方法。然而,大多数现有目标仍然将每个扫描主要视为切片、块或体积的静态聚合。我们探讨是否存在一种内在的自监督信号,这种信号不同于重建被遮挡的块,通过将三维体积转换为可控的二维渲染序列:通过在连续的位置、方向和尺度上渲染切片,三维体积可以转换为密集的视频动作序列,其控制为动作轨迹。我们研究了这种形式的动作条件预训练目标,其中一个标记器对切片观察进行编码,而潜在动态模型预测潜在特征的演变。在代表性的解剖和空间下游任务中,所提出的预训练与标准静态体积基线、仅标记器预训练和没有对齐动作的动态变体进行了评估。这些结果表明,可控的MRI切片导航为从大型未标记MRI集合中学习解剖和空间表示提供了一个有用的补充预训练接口。
cs.CV / 110 / 2605.06507

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

MARBLE:用于扩散强化学习的多方面奖励平衡
Zhao, Canyu, Chen, Hao, Tong, Yunze, Qiao, Yu, Li, Jiacheng, Shen, Chunhua
Abstract
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
Chinese Translation
强化学习微调已成为将扩散模型与人类偏好对齐的主要方法。然而,评估图像本质上是一个多维任务,需要同时优化多个评估标准。现有的做法通过为每个奖励训练一个专门模型、优化加权和奖励 $R(x)= ext{sum}_k w_k R_k(x)$,或通过手工设计的阶段调度进行顺序微调来处理多个奖励。这些方法要么无法产生一个可以在所有奖励上联合训练的统一模型,要么需要大量手动调整的顺序训练。我们发现失败的原因在于使用了简单的加权和奖励聚合。这种方法在样本级别上存在不匹配,因为大多数回滚是专门样本,对于某些奖励维度高度信息丰富,但对其他维度则无关;因此,加权求和稀释了它们的监督。为了解决这个问题,我们提出了MARBLE(Multi-Aspect Reward BaLancE),一个梯度空间优化框架,它为每个奖励保持独立的优势估计器,计算每个奖励的策略梯度,并通过解决一个二次规划问题将它们协调成一个单一的更新方向,而无需手动调整奖励权重。我们进一步提出了一种摊销形式,利用DiffusionNFT中使用的损失的仿射结构,将每步成本从K+1次反向传播减少到接近单奖励基线成本,并对平衡系数进行EMA平滑,以稳定更新,抵御瞬态单批次波动。在SD3.5 Medium上,使用五个奖励,MARBLE同时改善了所有五个奖励维度,将最差对齐奖励的梯度余弦从在80%的小批次中加权求和下的负值转变为持续的正值,并以基线训练的0.97倍速度运行。
cs.CV / 111 / 2605.06509

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

FreeSpec:无训练的长视频生成通过奇异谱重构
Chen, Fangda, Zhao, Shanshan, Yang, Longrong, Xu, Chuanfu, Luo, Zhigang, Lan, Long
Abstract
Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.
Chinese Translation
视频扩散模型在短视频合成中表现良好,但其无训练扩展到长视频时常常遭遇内容漂移、时间不一致和过度平滑的动态等问题。现有方法通过将全局分支与局部分支结合来改善时间一致性,但它们往往在每个分支内使用预定义标准进一步分解外观一致性和时间动态。当外观与动作进展紧密耦合时,例如在相机运动和顺序运动中,这种分配是不可靠的。我们从奇异谱的角度分析视频时间扩展问题,并表明扩大自注意力窗口会导致谱集中:谱能量被少数低秩奇异方向主导,保留粗略结构但抑制高秩空间细节和丰富的时间变化。为了解决这个问题,我们提出了FreeSpec,一个用于长视频生成的无训练谱重构框架。FreeSpec通过奇异值分解分解全局和局部特征,并将全局分支作为低秩谱引导,局部分支作为高秩重构基础。这种谱级融合避免了以往分解规则的刚性特征划分,保留了长程一致性,同时更好地保持空间细节和时间动态。在Wan2.1和LTX-Video上的实验表明,FreeSpec改善了长视频生成,尤其是在时间动态方面,同时保持了强大的视觉质量和时间一致性。项目演示: https://fdchen24.github.io/FreeSpec-Website/.
cs.CV / 112 / 2605.06512

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

DCR:稀有组合生成的反事实吸引子引导
Kang, Taewon, Zwicker, Matthias
Abstract
Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.
Chinese Translation
扩散模型能够生成逼真的视觉内容,但往往无法产生稀有但合理的组合。当被提示以有效但在训练数据中代表性不足的组合时,例如雪地海滩或夜晚的彩虹,生成过程常常会向更常见的替代品崩溃。我们将这种失败模式称为默认完成偏差,其中去噪轨迹隐式地被吸引到高频语义配置。现有的引导机制并未明确建模这种竞争倾向,因此难以防止这种崩溃。我们提出了默认完成排斥(Default Completion Repulsion, DCR),这是一个无训练的框架,明确建模并抑制默认完成行为。DCR通过放松稀有组合因素,同时保留周围语义,构建了一个反事实吸引子,从而诱导出反映模型偏好完成的替代去噪轨迹。我们将目标轨迹与吸引子轨迹之间的差异定义为反事实漂移,并提出了一种基于投影的排斥机制,去除与该漂移方向一致的引导成分。这抑制了不希望出现的频繁完成,同时保留了其他语义成分。DCR完全在标准扩散采样过程中运行,无需重新训练或架构修改。在稀有组合提示上的实验表明,DCR提高了组合的保真度,同时保持了视觉质量。我们的分析进一步表明,该框架揭示并对抗了内在模型偏见,为超越显式约束执行的可控生成提供了新的视角。
cs.CV / 113 / 2605.06535

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Sparkle:通过解耦指导实现生动的指令引导视频背景替换
Zeng, Ziyun, Lin, Yiqi, Liang, Guoqiang, Shou, Mike Zheng
Abstract
In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.
Chinese Translation
近年来,像Senorita-2M这样的开源努力推动了视频编辑朝着自然语言指令的发展。然而,目前公开可用的数据集主要集中在局部编辑或风格迁移,这在很大程度上保留了原始场景结构,并且更易于扩展。相比之下,背景替换(Background Replacement)这一任务在电影制作和广告等创意应用中至关重要,需要合成全新的、时间一致的场景,同时保持准确的前景与背景交互,这使得大规模数据生成变得极具挑战性。因此,由于高质量训练数据的稀缺,这一复杂任务仍然在很大程度上未被深入探索。这一差距在表现不佳的最先进模型中显而易见,例如Kiwi-Edit,因为包含此任务的主要开源数据集OpenVE-3M经常产生静态、不自然的背景。在本文中,我们将这种质量下降归因于数据合成过程中缺乏精确的背景指导。因此,我们设计了一个可扩展的流程,以解耦的方式生成前景和背景指导,并进行严格的质量过滤。在此流程的基础上,我们引入了Sparkle,一个包含约14万对视频的数据库,涵盖五种常见的背景更换主题,以及Sparkle-Bench,这是迄今为止针对背景替换的最大评估基准。实验表明,我们的数据库及其上训练的模型在OpenVE-Bench和Sparkle-Bench上的表现显著优于所有现有基准。我们提出的数据库、基准和模型均已在https://showlab.github.io/Sparkle/上完全开源。
cs.CV / 114 / 2605.06537

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

MedHorizon:面向真实环境中的长上下文医学视频理解
Du, Bodong, Liu, Bowen, Yu, Yang, Ding, Xinpeng, Wu, Zhiheng, Wang, Shuning, Nie, Shuo, Liu, Naiming, Chen, Qifeng, Song, Yangqiu, Li, Xiaomeng
Abstract
Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.
Chinese Translation
医学多模态大语言模型(MLLMs)在图像理解和短视频分析方面取得了进展,但真实的临床审查通常需要对完整过程视频的理解。与一般的长视频不同,医学程序包含高度冗余的解剖视图,而决定性证据在时间上稀疏、空间上微妙且依赖于上下文。现有基准通常假设这些证据已经通过图像、短片段或预分割视频进行了定位,从而使得检索-推理问题未得到充分测试。我们提出了MedHorizon,这是一个用于长上下文医学视频理解的真实环境基准。MedHorizon保留了759小时的完整临床程序,并提供了1,253个基于证据的多项选择题,联合评估稀疏证据理解和多跳临床推理。其证据极为稀疏,平均仅有0.166%的证据帧,要求模型在解释和汇总发现之前搜索嘈杂的程序流。我们评估了代表性的通用领域、医学领域和长视频的MLLMs。最佳模型的准确率仅为41.1%,表明当前系统在完整过程理解方面仍然远未稳健。进一步分析得出四个关键发现:性能并不随着帧数的增加而可靠扩展,证据检索和临床解释仍然是主要瓶颈;这些瓶颈根植于弱程序推理和冗余下的注意力漂移,而通用采样方法仅部分平衡了局部细节与全局覆盖。MedHorizon为MLLMs提供了一个严格的测试平台,以检索稀疏证据并对完整的临床工作流程进行推理。
cs.CV / 115 / 2605.06572

Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation

基于快速傅里叶变换插值的无矩阵求逆最小问题求解
Wu, Haidong, Bhayani, Snehal, Heikkilä, Janne
Abstract
Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gr\"obner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gr\"obner-basis and resultant-based solvers.
Chinese Translation
相机几何估计通常涉及求解以多元多项式方程组形式构建的最小问题,这在使用现有的 Gr"obner 基础或基于结果的算法时,由于在线求解器中需要进行矩阵求逆,往往会带来计算挑战。在此,我们提出了一种基于采样的、无矩阵求逆的方法,该方法利用稀疏隐藏变量结果构建求解器。通过对采样评估进行逆快速傅里叶变换插值,可以高效重构隐藏变量中的行列式多项式,避免了符号扩展。求解该多项式可获得隐藏变量,其余未知数通过识别秩为1的缺陷子矩阵并应用克拉默法则得以恢复。基于最大公约数的标准确保在噪声下稳健的子矩阵识别。在多种最小问题上的实验表明,所提出的求解器实现了强大的数值稳定性和竞争力的运行时间,特别是在小规模问题中,为传统的 Gr"obner 基础和基于结果的求解器提供了一个实用的替代方案。
cs.CV / 116 / 2605.06592

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

DINORANKCLIP:基于高阶排名一致性的视觉-语言预训练中的 DINOv3 蒸馏与注入
Jiang, Shuyang, Yu, Nan, Zhang, Yiming, Ding, Zenghui, Wu, Zhenyu
Abstract
Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study -- order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation -- fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.
Chinese Translation
对比语言-图像预训练(CLIP)存在两个结构性弱点:对称的 InfoNCE 损失丢弃了未匹配批次对之间的相对排序,而全局池化将视觉表示压缩成一个对细粒度局部结构敏感性较差的语义瓶颈。RANKCLIP 部分解决了第一个问题,采用了基于列表的 Plackett-Luce 排名一致性损失,但其模型严格为一阶,未能解决第二个弱点。我们提出了 DINORANKCLIP,这是一个同时解决这两个问题的预训练框架。我们的主要贡献是通过一个双分支轻量级学生和一个具有通道-空间注意力的多尺度融合模块,将一个冻结的 DINOv3 教师注入到对比主干中,该模块包括一个自注意力精炼器和一个冲突感知门,能够保持跨模态对齐至一阶。此外,我们引入了一个高阶 Plackett-Luce 排名模型,其中每个位置的效用通过注意力参数化的成对和元组过渡项进行增强;该家族包含 CLIP 和 RANKCLIP 作为嵌套的零阶和一阶特例,并且在每个基准上的最优阶数为 $R^*=3$。完整的实证研究——阶数扫描、五个数据集上的细粒度探测、四节点模态差距分析、六种变体融合消融——在单个八 GPU H100 节点上耗时 72 小时,完全在 Conceptual Captions 3M 上训练。DINORANKCLIP 在匹配计算下始终优于 CLIP、CyCLIP、ALIP 和 RANKCLIP,尤其在细粒度和分布外评估中获得了最大的相对提升,这些评估最直接地考验了局部结构推理。
cs.CV / 117 / 2605.06637

DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification

DPM++:用于遮挡行人重识别的动态掩码度量学习
Tan, Lei, Luan, Yingshi, Zou, Pincong, Dai, Pingyang, Cao, Liujuan
Abstract
Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.
Chinese Translation
尽管行人重识别取得了显著进展,但障碍物造成的遮挡仍然是实际应用中的一个未解决问题。困难在于不完整的遮挡样本与整体身份表示之间的不匹配。严重的遮挡会去除区分性身体线索,并引入背景杂乱和遮挡物的干扰,使得全局度量学习变得不可靠。现有方法主要依赖额外的预训练模型来估计可见部分以进行对齐,或通过数据增强构建遮挡样本,但仍缺乏一个统一的框架来学习在现实遮挡模式下的稳健可见性一致匹配。本文提出了DPM++,一种用于遮挡行人重识别的动态掩码度量学习框架。DPM++学习一种输入自适应的掩码度量,动态选择每个遮挡实例的可靠身份子空间,使匹配能够强调可见性一致的证据,同时抑制不可靠的成分。在分类器原型空间的基础上,DPM++引入了一种基于CLIP的两阶段监督方案,其中ID级语义先验从文本分支中学习并转移到分类器原型空间以进行动态掩码匹配。为了增强掩码度量,我们引入了一种显著性引导的补丁转移策略,在训练过程中合成可控且照片真实的遮挡样本。利用真实场景先验,这一策略使模型接触到现实的部分观察,并提供比随机擦除更丰富的监督。此外,考虑遮挡的样本配对和掩码引导的优化提高了框架的稳定性和有效性。在遮挡和整体行人重识别基准上的实验表明,DPM++在整体和遮挡场景中始终优于之前的最先进方法。
cs.CV / 118 / 2605.06643

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

我们在多模态领域泛化方面取得进展了吗?一项综合基准研究
Dong, Hao, Li, Hongzhao, Li, Shupan, Khan, Muhammad Haris, Chatzi, Eleni, Fink, Olga
Abstract
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
Chinese Translation
尽管多模态领域泛化(Multimodal Domain Generalization, MMDG)在增强模型鲁棒性方面日益受到关注,但目前尚不清楚报告的性能提升是否反映了真正的算法进步,还是评估协议不一致的产物。当前的研究相对分散,研究在数据集、模态配置和实验设置上差异显著。此外,现有基准主要集中于动作识别,往往忽视了输入损坏、缺失模态和模型可信度等关键的现实挑战。这种缺乏标准化的情况使得对该领域进展的可靠评估变得模糊。为了解决这一问题,我们推出了MMDG-Bench,这是首个统一且全面的MMDG基准,标准化了在六个数据集上的评估,涵盖了三个不同的任务:动作识别、机械故障诊断和情感分析。MMDG-Bench包含六种模态组合、九种代表性方法和多种评估设置。除了标准准确率外,它还系统地评估了对损坏的鲁棒性、缺失模态的泛化、错误分类检测和分布外检测。在95个独特的跨域任务中,总共训练了7402个神经网络,MMDG-Bench得出了五个关键发现:(1)在公平比较下,近期的专门MMDG方法相较于经验风险最小化(ERM)基线仅提供了边际改进;(2)没有单一方法在各数据集或模态组合中始终优于其他方法;(3)与上限性能之间仍存在显著差距,表明MMDG仍远未解决;(4)三模态融合并不总是优于最强的双模态配置;(5)所有评估的方法在损坏和缺失模态场景下均表现出显著退化,部分方法进一步损害了模型的可信度。
cs.CV / 119 / 2605.06658

Relit-LiVE: Relight Video by Jointly Learning Environment Video

Relit-LiVE:通过共同学习环境视频进行视频重光照
Xiao, Weiqing, Li, Hong, Yang, Xiuyu, Chen, Houyuan, Li, Wenyi, Liu, Tianqi, Xu, Shaocong, Ye, Chongjie, Zhao, Hao, Wang, Beibei
Abstract
Recent advances have shown that large-scale video diffusion models can be repurposed as neural renderers by first decomposing videos into intrinsic scene representations and then performing forward rendering under novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novel video relighting framework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novel environment video prediction formulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a single diffusion process. This joint prediction enforces strong geometric-illumination alignment and naturally supports dynamic lighting and camera motion, significantly improving physical consistency in video relighting while easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-art video relighting and neural rendering methods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streaming video relighting. The Project is available at https://github.com/zhuxing0/Relit-LiVE.
Chinese Translation
最近的研究进展表明,大规模视频扩散模型可以通过首先将视频分解为内在场景表示,然后在新照明下进行前向渲染,重新用于神经渲染。尽管这一范式前景广阔,但其根本依赖于准确的内在分解,而这在现实世界视频中仍然高度不可靠,常常导致失真外观、破损材料以及在重光照过程中累积的时间伪影。在本研究中,我们提出了Relit-LiVE,一个新颖的视频重光照框架,能够在不需要先验相机姿态知识的情况下,生成物理一致、时间稳定的结果。我们的关键见解是明确地将原始参考图像引入渲染过程中,使模型能够恢复在内在表示中不可避免地丢失或损坏的关键场景线索。此外,我们提出了一种新颖的环境视频预测公式,该公式在单一扩散过程中同时生成重光照视频和与每个相机视点对齐的逐帧环境图。这种联合预测强制执行强几何-照明对齐,自然支持动态照明和相机运动,显著提高了视频重光照的物理一致性,同时减轻了已知逐帧相机姿态的要求。大量实验表明,Relit-LiVE在合成和现实世界基准测试中始终优于最先进的视频重光照和神经渲染方法。除了重光照外,我们的框架自然支持广泛的下游应用,包括场景级渲染、材料编辑、物体插入和流媒体视频重光照。该项目可在 https://github.com/zhuxing0/Relit-LiVE 获取。
cs.CV / 120 / 2605.06664

BAMI: Training-Free Bias Mitigation in GUI Grounding

BAMI:无训练偏差缓解在GUI定位中的应用
Zhang, Borui, Zhang, Bo, Wang, Bo, Zheng, Wenzhao, Cheng, Yuhao, Tang, Liang, Yan, Yiqiang, Zhou, Jie, Lu, Jiwen
Abstract
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.
Chinese Translation
GUI定位是使GUI代理能够执行点击和拖动等任务的关键能力。然而,在像ScreenSpot-Pro基准这样复杂的场景中,现有模型往往表现不佳。通过利用所提出的 extbf{掩码预测分布(Masked Prediction Distribution, MPD)}归因方法,我们识别出错误的主要来源有两个:高图像分辨率(导致精度偏差)和复杂的界面元素(导致模糊偏差)。为了解决这些挑战,我们引入了 extbf{偏差感知操控推理(Bias-Aware Manipulation Inference, BAMI)},该方法结合了两种关键操控:粗到细的聚焦和候选选择,以有效缓解这些偏差。我们的广泛实验结果表明,BAMI在无训练设置下显著提高了多种GUI定位模型的准确性。例如,将我们的方法应用于TianXi-Action-7B模型,使其在ScreenSpot-Pro基准上的准确率从51.9\%提升至57.8\%。此外,消融研究确认了BAMI方法在不同参数配置下的稳健性,突显了其稳定性和有效性。代码可在https://github.com/Neur-IO/BAMI获取。
cs.CV / 121 / 2605.06667

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

ActCam:用于视频生成的零-shot联合相机与3D运动控制
Khalifi, Omar El, Rossi, Thomas, Fossey, Oscar, Fouque, Thibault, Mizrahi, Ulysse, Torr, Philip, Laptev, Ivan, Pizzati, Fabio, Bellot-Gurlet, Baptiste
Abstract
For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
Chinese Translation
在艺术应用中,视频生成需要对表演和摄影(即演员的运动和相机轨迹)进行精细控制。我们提出了ActCam,这是一种零-shot视频生成方法,能够将驱动视频中的角色运动联合转移到新场景中,并实现每帧对内在和外在相机参数的控制。ActCam基于任何接受场景深度和角色姿态条件的预训练图像到视频扩散模型。给定一个包含移动角色的源视频和目标相机运动,ActCam生成在几何上保持一致的姿态和深度条件。然后,我们通过一个两阶段的条件调度运行单次采样过程:早期去噪步骤同时对姿态和稀疏深度进行条件化,以强制执行场景结构,之后深度被去掉,仅用姿态指导来细化高频细节,而不对生成施加过多限制。我们在多个基准测试中评估了ActCam,这些基准涵盖了多样的角色运动和具有挑战性的视角变化。我们发现,与仅使用姿态控制及其他姿态和相机方法相比,ActCam在相机遵循性和运动保真度上有所提升,并在人工评估中更受欢迎,尤其是在大视角变化的情况下。我们的结果强调,精心设计的相机一致性条件和分阶段指导可以在不进行训练的情况下实现强大的联合相机和运动控制。项目页面:https://elkhomar.github.io/actcam/
人工智能 (Artificial Intelligence)
139
cs.AI / 1 / 2605.05329

Understanding Annotator Safety Policy with Interpretability

理解注释者安全政策的可解释性
Oesterling, Alex, Ren, Donghao, Assogba, Yannick, Moritz, Dominik, Kim, Sunnie S. Y., Gatys, Leon, Hohman, Fred
Abstract
Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes. We introduce Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (>80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings. Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.
Chinese Translation
安全政策定义了什么构成安全和不安全的人工智能输出,指导数据注释和模型开发。然而,注释者之间的分歧普遍存在,可能源于多种原因,例如操作失误(注释者误解或错误执行任务)、政策模糊(政策措辞留有解释空间)或价值多元主义(不同的注释者对安全持有不同的观点)。区分这些来源至关重要。例如,操作失误需要质量控制,模糊性需要政策澄清,而多元主义则需要对纳入不同观点进行深入讨论。然而,理解注释者为何存在分歧是困难的。直接询问注释者的推理代价高昂,显著增加了注释负担,并且对于人类和大型语言模型(LLM)注释者而言,依赖自我报告的推理往往无法反映实际决策过程。我们引入了注释者政策模型(APMs),这是一种可解释的模型,仅通过标注行为学习注释者的内部安全政策,使注释者的推理可见且可比较,而无需额外的注释工作。我们验证了APMs能够准确建模注释者的安全政策(准确率超过80%),忠实预测对反事实编辑的响应,并在受控环境中恢复已知的政策差异。通过将APMs应用于LLM和人类注释,我们展示了两个核心应用:(1)通过揭示注释者如何不同地解释安全指令,揭示政策模糊性;(2)通过揭示不同人口群体在安全优先级上的系统性差异,揭示价值多元主义。这些能力共同支持更具针对性、透明性和包容性的安全政策设计。
cs.AI / 2 / 2605.05365

ZAYA1-8B Technical Report

ZAYA1-8B 技术报告
Washbourne, Robert, Iyer, Rishi, Figliolia, Tomas, Zheng, Henry, Lorig-Roach, Ryan, Yang, Sungyeon, Yuvraj, Pritish, Anthony, Quentin, Tokpanov, Yury, Yang, Xiao, Nanduru, Ganesh, Ebert, Stephen, Medepalli, Praneeth, Szot, Skyler, Rajagopal, Srivatsan, Ong, Alex, Mehta, Bhavana, Millidge, Beren
Abstract
We present ZAYA1-8B, a reasoning-focused mixture-of-experts (MoE) model with 700M active and 8B total parameters, built on Zyphra's MoE++ architecture. ZAYA1-8B's core pretraining, midtraining, and supervised fine-tuning (SFT) were performed on a full-stack AMD compute, networking, and software platform. With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models. ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming references; and behavioral RL for chat and instruction following. We also introduce Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only bounded-length reasoning tails between rounds. In TTC evaluation, Markovian RSA raises ZAYA1-8B to 91.9\% on AIME'25 and 89.6\% on HMMT'25 while carrying forward only a 4K-token tail, narrowing the gap to much larger reasoning models including Gemini-2.5 Pro, DeepSeek-V3.2, and GPT-5-High.
Chinese Translation
我们介绍了 ZAYA1-8B,这是一种以推理为重点的专家混合模型(Mixture-of-Experts, MoE),具有 7 亿个活跃参数和 80 亿个总参数,基于 Zyphra 的 MoE++ 架构构建。ZAYA1-8B 的核心预训练、中期训练和监督微调(Supervised Fine-Tuning, SFT)是在完整的 AMD 计算、网络和软件平台上进行的。在不到 10 亿个活跃参数的情况下,ZAYA1-8B 在多个具有挑战性的数学和编码基准测试中与 DeepSeek-R1-0528 相匹配或超越,并且在与大规模开放权重推理模型的竞争中保持竞争力。ZAYA1-8B 从零开始进行推理训练,推理数据从预训练开始就采用了保留答案的修剪方案。后期训练使用了四阶段的强化学习级联:在数学和谜题上的推理预热;一个 400 任务的 RLVE-Gym 课程;结合测试时计算跟踪和基于竞赛编程参考构建的合成代码环境的数学和代码强化学习;以及用于聊天和指令跟随的行为强化学习。我们还引入了马尔可夫 RSA(Markovian RSA),这是一种测试时计算方法,通过递归聚合并行推理轨迹,同时在回合之间仅保留有限长度的推理尾部。在 TTC 评估中,马尔可夫 RSA 将 ZAYA1-8B 的得分提升至 AIME'25 的 91.9 ext{%} 和 HMMT'25 的 89.6 ext{%},同时仅保留 4K 令牌的尾部,缩小了与更大推理模型(包括 Gemini-2.5 Pro、DeepSeek-V3.2 和 GPT-5-High)之间的差距。
cs.AI / 3 / 2605.05379

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

部分证据基准:在代理系统中对授权限制证据的基准测试
Tallam, Krti
Abstract
Enterprise agents increasingly operate inside scoped retrieval systems, delegated workflows, and policy-constrained evidence environments. In these settings, access control can be enforced correctly while the system still produces an answer that appears complete even though material evidence lies outside the caller's authorization boundary. This paper introduces Partial Evidence Bench, a deterministic benchmark for measuring that failure mode. The benchmark ships three scenario families -- due diligence, compliance audit, and security incident response -- with 72 tasks total, ACL-partitioned corpora, oracle complete answers, oracle authorized-view answers, oracle completeness judgments, and structured gap-report oracles. It evaluates systems along four surfaces: answer correctness, completeness awareness, gap-report quality, and unsafe completeness behavior. Checked-in baselines show that silent filtering is catastrophically unsafe across all shipped families, while explicit fail-and-report behavior eliminates unsafe completeness without collapsing the task into trivial abstention. Preliminary real-model runs show model-dependent and scenario-sensitive differences in whether systems overclaim completeness, conservatively underclaim, or report incompleteness in an enterprise-usable form. The benchmark's broader contribution is to make a governance-critical agent failure measurable without human judges or contamination-prone static corpora.
Chinese Translation
企业代理越来越多地在范围检索系统、委托工作流和政策约束的证据环境中运行。在这些环境中,可以正确地实施访问控制,同时系统仍然生成一个看似完整的答案,即使实质证据超出了调用者的授权边界。本文介绍了部分证据基准(Partial Evidence Bench),这是一个用于测量该失败模式的确定性基准。该基准包含三个场景系列——尽职调查、合规审计和安全事件响应——共72个任务,ACL(访问控制列表)分区语料库,oracle 完整答案,oracle 授权视图答案,oracle 完整性判断,以及结构化的缺口报告 oracle。它从四个方面评估系统:答案正确性、完整性意识、缺口报告质量和不安全的完整性行为。检查的基线显示,在所有发布的系列中,静默过滤是灾难性的不安全,而明确的失败并报告行为则消除了不安全的完整性,而不将任务简化为琐碎的放弃。初步的真实模型运行显示,系统在是否过度声称完整性、保守地低估或以企业可用的形式报告不完整性方面存在模型依赖性和场景敏感性差异。该基准的更广泛贡献在于使治理关键的代理失败可测量,而无需人工评判或易受污染的静态语料库。
cs.AI / 4 / 2605.05386

BALAR : A Bayesian Agentic Loop for Active Reasoning

BALAR:用于主动推理的贝叶斯代理循环
Echarghaoui, Aymen, Wu, Dongxia, Fox, Emily B.
Abstract
Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm that requires no fine-tuning and enables structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with $14.6\%$ higher accuracy on AR-Bench-DC, $38.5\%$ on AR-Bench-SP, and $30.5\%$ on iCraft-MD.
Chinese Translation
大型语言模型越来越多地在交互环境中运行,在这些环境中,解决任务需要与用户进行多轮信息交换。然而,目前大多数系统以反应方式处理对话,缺乏一个原则性的机制来推理缺失的信息以及下一步应该问哪个问题。我们提出了BALAR(Bayesian Agentic Loop for Active Reasoning),这是一种与任务无关的外部循环算法,无需微调,并能够实现大型语言模型(LLM)代理与用户之间的结构化多轮交互。BALAR在潜在状态上维持结构化信念,通过最大化期望互信息选择澄清问题,并在当前状态不足时动态扩展其状态表示。我们在三个不同的基准上评估了BALAR:AR-Bench-DC(侦探案例)、AR-Bench-SP(思维难题)和iCraft-MD(临床诊断)。在这三个基准上,BALAR的表现显著优于所有基线,在AR-Bench-DC上提高了$14.6\%$的准确率,在AR-Bench-SP上提高了$38.5\\%$,在iCraft-MD上提高了$30.5\\%$。
cs.AI / 5 / 2605.05402

Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

城市设计中的智能监控摄像头:基于人工智能的交叉口软基础设施分析
Katariya, Vinit, Kim, Seungjin, Craig, Curtis, Morris, Nichole, Tabkhi, Hamed
Abstract
Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary pedestrian refuges and curb extensions, on vehicle speed and safety. Using deep learning and perspective-based speed estimation, we evaluated driver behavior before and after interventions, with repeated post-installation monitoring in Week 1 and Week 2, in Minneapolis. Findings reveal that at unsignalized intersections, mean and 85th-percentile speeds fell by up to 18.75% and 16.56%, respectively, while pass-through traffic decreased by as much as 12.2%. Signalized intersections showed comparable reductions except one location, with mean and 85th-percentile speeds dropping by up to 20.0% and 17.19%. These results demonstrate the traffic-calming effectiveness of soft infrastructure and underscore the utility of AI-powered methods for rapid, low-cost, and evidence-based transport policy evaluation.
Chinese Translation
人工智能(AI)和计算机视觉正在改变交通数据收集的方式。本研究提出了一种基于AI的分析框架,利用现有的监控摄像头基础设施评估软干预措施(如临时行人避难所和路缘扩展)对车辆速度和安全性的影响。通过深度学习和基于视角的速度估计,我们评估了干预措施实施前后驾驶员的行为,并在明尼阿波利斯进行了第1周和第2周的重复安装后监测。研究结果显示,在无信号交叉口,平均速度和85百分位速度分别下降了多达18.75%和16.56%,而通过交通量减少了多达12.2%。信号交叉口显示出类似的减少,除了一个地点,平均速度和85百分位速度分别下降了多达20.0%和17.19%。这些结果证明了软基础设施在交通减速方面的有效性,并强调了基于AI的方法在快速、低成本和基于证据的交通政策评估中的实用性。
cs.AI / 6 / 2605.05403

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

当有用性变为谄媚:谄媚是大型语言模型中社会一致性与认知完整性之间的边界失效
Li, Jiechen, Barry, Catherine A., Randev, Rishika, Chen, Janet, Jorgensen, Ella, Bent, Brinnae
Abstract
This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. Existing work often operationalizes sycophancy through external behavior such as agreement with incorrect user beliefs, position reversals, or deviation from an objective standard of correctness. These formulations capture only overt forms of the phenomenon and leave subtler boundary failures involving epistemic integrity and social alignment underspecified. We argue that sycophancy should not be understood as agreement alone, but as alignment behavior that displaces independent epistemic judgment. To clarify this boundary, we propose a three-condition framework for sycophancy. First, the user expresses a cue in the form of a belief, preference, or self-concept. Second, the model shifts toward that cue through alignment behavior. Third, this shift compromises epistemic accuracy, independent reasoning, or appropriate correction. We also introduce a taxonomy for classifying sycophancy, consisting of alignment targets, mechanisms, and severity. The paper concludes by discussing implications for alignment evaluation and argues for boundary-aware assessment, structured rubrics, and mitigation strategies, while situating these proposals alongside alternative views of sycophancy.
Chinese Translation
本文立场论文认为,大型语言模型(LLMs)中的谄媚是社会一致性与认知完整性之间的边界失效。现有研究通常通过外部行为来操作化谄媚,例如与不正确用户信念的一致性、立场反转或偏离客观正确标准。这些表述仅捕捉现象的明显形式,而未明确涉及认知完整性和社会一致性的更微妙的边界失效。我们认为,谄媚不应仅被理解为一致性,而应被视为一种替代独立认知判断的对齐行为。为了澄清这一边界,我们提出了一个关于谄媚的三条件框架。首先,用户以信念、偏好或自我概念的形式表达线索。其次,模型通过对齐行为向该线索靠拢。第三,这一转变损害了认知准确性、独立推理或适当纠正。我们还引入了一个用于分类谄媚的分类法,包括对齐目标、机制和严重性。论文最后讨论了对齐评估的影响,并主张进行边界意识评估、结构化评分标准和缓解策略,同时将这些提案与谄媚的替代观点相结合。
cs.AI / 7 / 2605.05407

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM:用于顺序决策的感知推理交织
Aissi, Mohamed Salim, Grislain, Clemence, Romac, Clement, Soulier, Laure, Chetouani, Mohamed, Sigaud, Olivier, Thome, Nicolas
Abstract
Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.
Chinese Translation
将基于大型语言模型(LLM)的具身代理从仅文本环境扩展到复杂的多模态设置仍然是一个重大挑战。近期的研究发现,独立的视觉语言模型(VLM)存在感知-推理-决策的差距,往往忽视任务关键的信息。本文介绍了PRISM,一个通过动态问答(DQA)管道紧密结合感知(VLM)和决策(LLM)的框架。LLM并不是被动接受VLM的描述,而是对其进行批判,向VLM提出以目标为导向的问题,并合成紧凑的图像描述。这种闭环交互产生了对场景的清晰、以任务为驱动的理解。我们在ALFWorld和Room-to-Room(R2R)基准上评估了PRISM。结果表明:(1)PRISM显著优于最先进的基于图像的模型,(2)我们的互动目标导向感知管道带来了系统性和显著的提升,以及(3)PRISM是完全自动化的,消除了对手工问题或答案的需求。
cs.AI / 8 / 2605.05409

Agentic Retrieval-Augmented Generation for Financial Document Question Answering

用于金融文档问答的自主检索增强生成
Shu, Yang, Liu, Yingmin, Xie, Zequn
Abstract
Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence--structured tables, textual narratives, and footnotes--scattered across corporate filings. Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. Extensive experiments on three benchmark datasets--FinQA, ConvFinQA, and TAT-QA--demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62--9.32 percentage points. Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework's robustness and practical viability for financial institutions.
Chinese Translation
金融文档问答(QA)需要对散布在公司文件中的异构证据(结构化表格、文本叙述和脚注)进行复杂的多步骤数值推理。现有的检索增强生成(RAG)方法采用单次检索后生成的范式,这在金融分析中常见的组合推理链上表现不佳。我们提出了FinAgent-RAG,一个自主的RAG框架,它通过自我验证协调迭代的检索-推理循环,专门针对金融数值推理的精确性要求进行设计。该框架整合了三项特定领域的创新:(1)使用困难负样本挖掘训练的对比金融检索器,以区分语义相似但数值不同的金融段落;(2)一个思维程序推理模块,生成可执行的Python代码以实现精确的算术运算,而不是依赖于易出错的基于LLM的心理计算;(3)一个自适应策略路由器,根据问题复杂性动态分配计算资源,在保持准确性的同时,将FinQA的API成本降低了41.3%。在三个基准数据集(FinQA、ConvFinQA和TAT-QA)上的广泛实验表明,FinAgent-RAG分别达到了76.81%、78.46%和74.96%的执行准确率,超越了最强基线5.62至9.32个百分点。消融研究、与四个LLM的交叉骨干评估以及部署成本分析证实了该框架在金融机构中的稳健性和实际可行性。
cs.AI / 9 / 2605.05410

LaTA: A Drop-in, FERPA-Compliant Local-LLM Autograder for Upper-Division STEM Coursework

LaTA:一种符合FERPA要求的本地LLM自动评分系统,适用于高年级STEM课程
Rodríguez, Jesse A.
Abstract
Large-language-model (LLM) graders promise to relieve the grading burden of upper-division STEM courses, but most deployments to date send student work to third-party APIs, violating FERPA and exposing institutions to data risk while requiring substantial assignment modification. We present $\textbf{LaTA}\ (\textit{LaTeX Teaching Assistant})$, a drop-in, open-source autograder that runs entirely on commodity on-premises hardware and assumes a LaTeX-native workflow already adopted by many engineering and physics courses. LaTA implements a four-stage pipeline (ingest, segment, grade, report) using a locally hosted open-weight chain-of-thought LLM grader (gpt-oss:120b) that compares student work to an instructor-authored reference solution and applies a YAML rubric with binary per-item scoring. We deployed LaTA in Winter~2026 in ME 373 (Mechanical Engineering Methods) at Oregon State University, grading every weekly assignment for approximately 200 students on a single Mac Studio at \$0 marginal cost per assignment and 1--3 minutes of wall-clock time per submission, enabling regrading of corrected assignments and greatly expanded TA office hour offerings. The instructor-confirmed grading-error rate held at roughly $0.02$--$0.04\%$ per rubric line item across the term. Relative to the same instructor's previous traditionally-graded cohort, the LaTA-graded cohort outperformed by approximately $11\%$ on the midterm exam and $8\%$ on the final exam, and reported large gains in self-assessed confidence on every stated learning objective ($N = 159$ survey responses, $\Delta \geq +1.49$ Likert points, $p < 10^{-27}$ on every comparison). We release the code under AGPLv3.
Chinese Translation
大型语言模型(LLM)评分系统有望减轻高年级STEM课程的评分负担,但迄今为止大多数部署将学生作业发送到第三方API,这违反了FERPA规定,增加了数据风险,同时需要对作业进行大量修改。我们提出了$ extbf{LaTA} ext{(LaTeX教学助手)}$,这是一种开源的自动评分系统,完全在普通本地硬件上运行,并假设许多工程和物理课程已经采用了LaTeX原生工作流程。LaTA实现了一个四阶段的流程(获取、分段、评分、报告),使用本地托管的开放权重链式思维LLM评分器(gpt-oss:120b),将学生作业与教师编写的参考解决方案进行比较,并应用具有二进制逐项评分的YAML评分标准。我们在2026年冬季学期于俄勒冈州立大学的ME 373(机械工程方法)课程中部署了LaTA,为约200名学生的每周作业评分,使用一台Mac Studio的边际成本为每个作业约为0美元,提交的墙钟时间为1至3分钟,从而实现了对更正作业的重新评分,并大大扩展了助教的办公时间。教师确认的评分错误率在整个学期内保持在每个评分标准项目约$0.02$至$0.04 ext{ extperthousand}$之间。与同一教师之前的传统评分班级相比,LaTA评分班级在期中考试中表现出约$11 ext{ extpercent}$的优势,在期末考试中表现出约$8 ext{ extpercent}$的优势,并在每个学习目标的自我评估信心上报告了显著提高($N = 159$的调查回应,$ ext{ extDelta} ext{≥} +1.49$的Likert点,$p < 10^{-27}$在每次比较中)。我们将代码以AGPLv3许可证发布。
cs.AI / 10 / 2605.05413

From History to State: Constant-Context Skill Learning for LLM Agents

从历史到状态:大语言模型代理的恒定上下文技能学习
Xie, Haoyang, Wang, Xinyuan, Wang, Yancheng, Zhao, Puda, Ju, Feng
Abstract
Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy-cost-capability tension: cloud models execute multi-step workflows well but expose sensitive intermediate context to external APIs, while local models preserve privacy but remain less reliable. Both settings also pay repeatedly for long skill prompts and growing histories. We propose constant-context skill learning, a context-to-weights framework for recurring agent workflows: reusable procedures are learned in lightweight task-family modules, while inference conditions only on the current observation and a compact state block. A deterministic tracker renders this state block from task progress and supplies aligned subgoal rewards, so each module can be trained with step-level SFT and refined through online RL. Across ALFWorld, WebShop, and SciWorld, our agents achieve strong performance across Qwen3-4B, Qwen3-8B and Llama-3.1-8B. With Qwen3-8B, SFT+RL reaches 89.6\% unseen success on ALFWorld, 76.8\% success on WebShop, and 66.4\% unseen success on SciWorld. They match or exceed strong published agent-training results while reducing prompt tokens per turn by 2--7$\times$ relative to controlled ReAct prompting baselines, showing that procedural context can be moved from prompts into weights.
Chinese Translation
大型语言模型(LLM)代理越来越多地被用于操作浏览器、文件、代码和工具,使个人助手成为自然的部署目标。然而,个人代理面临着隐私-成本-能力的紧张关系:云模型能够很好地执行多步骤工作流程,但会将敏感的中间上下文暴露给外部API,而本地模型则能保护隐私但可靠性较低。这两种设置在长技能提示和不断增长的历史记录上也会重复付出代价。我们提出了恒定上下文技能学习,这是一种针对重复代理工作流程的上下文到权重的框架:可重用的程序在轻量级任务家族模块中学习,而推理仅依赖于当前观察和紧凑的状态块。一个确定性跟踪器根据任务进展生成这个状态块,并提供对齐的子目标奖励,因此每个模块可以通过逐步的监督微调(SFT)进行训练,并通过在线强化学习(RL)进行优化。在ALFWorld、WebShop和SciWorld上,我们的代理在Qwen3-4B、Qwen3-8B和Llama-3.1-8B上表现出色。使用Qwen3-8B时,SFT+RL在ALFWorld上达到了89.6%的未见成功率,在WebShop上达到了76.8%的成功率,以及在SciWorld上达到了66.4%的未见成功率。它们的表现与强大的已发布代理训练结果相匹配或超出,同时相较于受控的ReAct提示基线,减少了每轮的提示令牌数量2-7倍,显示出程序上下文可以从提示中转移到权重中。
cs.AI / 11 / 2605.05427

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

人工智能安全的地缘政治:区域大型语言模型偏见的因果分析
Hasan, Alif Al
Abstract
As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations predominantly measure bias observationally, a methodology confounded by the inherent toxicity of topics naturally paired with specific demographics in testing datasets. This study introduces a Probabilistic Graphical Model (PGM) framework to audit LLM safety mechanisms causally. By applying Pearl's do-operator, we mathematically isolate the causal effect of injecting a cultural demographic into a prompt. We conduct a large-scale empirical analysis across seven instruction-tuned models spanning diverse origins: the United States (Llama-3.1-8B, Gemma-2-9B), Europe (Mistral-7B-v0.3), the UAE (Falcon3-7B), China (Qwen2.5-7B, DeepSeek-7B), and India (Airavata-7B). Utilizing two distinct datasets (ToxiGen and BOLD), the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics. We discuss the implications of these biases, highlighting how demographic-sensitive over-triggering restricts benign discourse in downstream applications.
Chinese Translation
随着大型语言模型(LLMs)被整合到全球软件系统中,确保公平的安全防护措施成为一项关键要求。目前的公平性评估主要通过观察性方法来衡量偏见,这种方法受到测试数据集中与特定人口统计特征自然配对的主题固有毒性的干扰。本研究引入了一种概率图模型(PGM)框架,以因果方式审计LLM的安全机制。通过应用Pearl的do-操作符,我们在数学上隔离了将文化人口特征注入提示中的因果效应。我们对七个经过指令调优的模型进行了大规模实证分析,这些模型来自不同地区:美国(Llama-3.1-8B,Gemma-2-9B)、欧洲(Mistral-7B-v0.3)、阿联酋(Falcon3-7B)、中国(Qwen2.5-7B,DeepSeek-7B)和印度(Airavata-7B)。利用两个不同的数据集(ToxiGen和BOLD),研究结果揭示了观察性偏见与干预性偏见之间的差异,表明标准公平性指标可能会高估人口统计偏见,因为未能考虑上下文毒性。此外,因果概率显示出不同的对齐趋势:西方模型对特定人口群体表现出更高的因果拒绝率,而东方模型则表现出较低的整体干预率,并对区域人口特征表现出针对性的敏感性。我们讨论了这些偏见的影响,强调人口统计敏感的过度触发如何限制下游应用中的良性 discourse。
cs.AI / 12 / 2605.05440

Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure

多智能体人工智能系统中的授权传播:身份治理作为基础设施
Tallam, Krti
Abstract
The security discussion around agentic AI focuses heavily on prompt injection. This paper argues that multi-agent systems also create a distinct authorization problem: maintaining authorization invariants as non-human principals retrieve data, delegate tasks, and synthesize results across changing boundaries. We call this problem authorization propagation. It is not reducible to prompt injection and is not fully addressed by classical access-control models such as RBAC, ABAC, or ReBAC. The paper formalizes authorization propagation as a workflow-level property, identifies three sub-problems (transitive delegation, aggregation inference, and temporal validity), and derives seven structural requirements for authorization architectures in multi-agent AI systems. Recent work on invocation-bound capability tokens, task-scoped authorization envelopes, dependency-graph policy enforcement, and execution-count revocation demonstrates that the field is converging on the problem, but not yet on a complete architecture. The central claim is that identity governance must be treated as infrastructure: evaluated continuously, enforced at every interaction boundary, and designed into the system before orchestration logic is allowed to scale. Preliminary implementation evidence from a production enterprise AI platform shows that ordinary system behavior, not only adversarial action, already produces the failures this model predicts.
Chinese Translation
关于代理人工智能的安全讨论主要集中在提示注入上。本文认为,多智能体系统还创造了一个独特的授权问题:在非人类主体检索数据、委派任务和跨越变化边界综合结果时,维护授权不变性。我们将这个问题称为授权传播。它不能简化为提示注入,也不能完全通过经典的访问控制模型(如 RBAC、ABAC 或 ReBAC)来解决。本文将授权传播形式化为工作流级别的属性,识别出三个子问题(传递委托、聚合推断和时间有效性),并推导出多智能体人工智能系统中授权架构的七个结构性要求。最近关于调用绑定能力令牌、任务范围授权封装、依赖图策略执行和执行计数撤销的研究表明,该领域正在逐步接近这一问题,但尚未形成完整的架构。核心论点是身份治理必须被视为基础设施:需要持续评估,在每个交互边界强制执行,并在允许调度逻辑扩展之前设计到系统中。来自一个生产企业人工智能平台的初步实施证据表明,普通系统行为,而不仅仅是对抗性行为,已经产生了该模型所预测的失败。
cs.AI / 13 / 2605.05460

Agentic Discovery of Exchange-Correlation Density Functionals

代理发现交换-相关密度泛函
Duston, Titouan, Liang, Jiashu, Wang, Yuanheng, Gao, Weihao, Wen, Xuelan, Sheng, Nan, Ren, Weiluo, Sun, Yang, Chen, Yixiao
Abstract
The development of accurate exchange-correlation (XC) functionals remains a longstanding challenge in density functional theory (DFT). The vast majority of XC functionals have been hand designed by human researchers combining physical insight, exact constraints, and empirical fitting. Recent advances in large language models enable a systematic, automated alternative to this human-driven design loop. This report presents an agentic search system in which an LLM proposes structured functional-form changes guided by evolutionary history. The system attempts to improve functional performance through an iterative plan-execute-summarize loop, where improvements are measurable by optimizing functional parameters against a standard thermochemistry dataset, then evaluating performance on a held-out subset. The strongest discovered functional, SAFS26-a (Seed Agentic Functional Search 2026), improves upon the gold-standard {\omega}B97M-V baseline by ~9%. These results also surface a cautionary lesson for AI-assisted science: models powerful enough to discover genuine improvements are equally capable of exploiting unphysical shortcuts to game the benchmark; domain expertise translated into explicitly enforced constraints remains essential to keeping results scientifically grounded.
Chinese Translation
准确的交换-相关(XC)泛函的开发仍然是密度泛函理论(DFT)中的一个长期挑战。绝大多数XC泛函是由人类研究人员手动设计的,结合了物理洞察、精确约束和经验拟合。最近在大型语言模型方面的进展使得这一人驱动设计循环有了系统的自动化替代方案。本报告提出了一种代理搜索系统,其中大型语言模型(LLM)根据进化历史提出结构化的泛函形式变化。该系统通过一个迭代的计划-执行-总结循环来尝试改善泛函性能,其中通过优化泛函参数以对照标准热化学数据集来衡量改进,然后在保留的子集上评估性能。发现的最强泛函SAFS26-a(Seed Agentic Functional Search 2026)在约9%的程度上优于黄金标准{ extomega}B97M-V基线。这些结果还提出了一个警示性教训,针对AI辅助科学:足够强大的模型不仅能够发现真正的改进,也同样能够利用非物理的捷径来操控基准;将领域专业知识转化为明确强制的约束仍然是保持结果科学基础的关键。
cs.AI / 14 / 2605.05475

Intentionality is a Design Decision: Measuring Functional Intentionality for Accountable AI Systems

意图性是设计决策:衡量可问责人工智能系统的功能意图性
Chiappetta, Allessia, Mahari, Robert
Abstract
As AI systems increasingly exhibit autonomous, goal-directed, and long-horizon behavior, users lack a standardized way to detect the degree to which a system functions like an intentional actor for governance and accountability purposes. This position paper defines intentionality not as consciousness, but as a behavioral profile characterized by purpose, foresight, volition, temporal commitment, and coherence - criteria long used in legal and philosophical contexts to infer intent. These properties are design-contingent: architectural choices such as memory persistence, planning depth, and tool autonomy shape the degree to which systems exhibit organized goal pursuit. If intentionality is design-contingent, it is in principle controllable. Yet control requires measurement. We introduce the Functional Intentionality Test (FIT), a multidimensional framework that quantifies intentional-like behavior across five observable dimensions, and propose FIT-Eval, a structured evaluation protocol for eliciting and scoring them. While reduced human agency can increase efficiency, rising intentional capacity heightens accountability risks. By translating intentionality into interpretable levels, FIT enables proportionate oversight and deliberate autonomy calibration in increasingly agentic systems.
Chinese Translation
随着人工智能系统越来越多地表现出自主、目标导向和长远行为,用户缺乏一种标准化的方法来检测系统在治理和问责目的上作为意图行为者的功能程度。本文定义意图性不是作为意识,而是作为一种行为特征,具有目的性、前瞻性、意志、自我承诺和一致性——这些标准在法律和哲学背景中长期用于推断意图。这些特性依赖于设计:如记忆持久性、规划深度和工具自主性等架构选择影响系统表现出有组织的目标追求的程度。如果意图性依赖于设计,那么原则上是可控的。然而,控制需要测量。我们引入功能意图性测试(Functional Intentionality Test, FIT),这是一个多维框架,量化五个可观察维度上的意图类行为,并提出FIT-Eval,一个结构化的评估协议,用于引导和评分这些维度。虽然减少人类代理性可以提高效率,但意图能力的提升也增加了问责风险。通过将意图性转化为可解释的水平,FIT使得在日益自主的系统中实现适度的监督和有意识的自主性校准成为可能。
cs.AI / 15 / 2605.05478

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

LANTERN:基于经验门控推理网络的LLM增强神经符号迁移
Alinejad, Mahyar, Wang, Yue, Bedi, Amrit Singh, Atia, George
Abstract
Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.
Chinese Translation
强化学习(RL)中的迁移学习旨在通过利用相关来源的知识来加速新任务的学习。然而,现有的神经符号迁移方法通常依赖于手动指定的任务自动机,假设只有单一来源任务,并使用固定的知识整合机制,无法适应不同来源的相关性。我们提出了LANTERN,一个统一的多源神经符号迁移框架,通过三个组件解决这些局限性:(i)使用大型语言模型从自然语言任务描述生成的确定性有限自动机,(ii)基于语义嵌入的多个源策略聚合,按跨任务相似性加权,以及(iii)基于时间差分误差和语义不确定性的自适应教师-学生门控。在资源管理、导航和控制等多个领域,LANTERN在样本效率上比现有基线提高了40-60%,同时对不良对齐的来源保持稳健。这些结果表明,多源、自适应加权的神经符号迁移可以提高符号RL环境中的可扩展性和鲁棒性。
cs.AI / 16 / 2605.05482

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

FinRAG-12B:银行领域基础问答的生产验证方案
Katerenchuk, Denys, Duboue, Pablo, Evanini, Keelan, Gondek, David, Govindugari, Nithin, Allauzen, Olivier, Baptiste, Joshua, More, David J, Schechter, Joshua
Abstract
Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially improving over the base model's unsafe 4.3% rate while avoiding GPT-4.1's over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.
Chinese Translation
大型语言模型(LLMs)正在各个领域迅速被采纳。然而,由于对高准确性、合规性以及可验证和基础响应的需求,其在银行业的应用面临阻力。我们提出了一个统一的数据高效框架,用于训练基础领域特定的LLMs,优化答案质量、引用基础和在现实部署约束下的校准拒绝。首先,我们描述了一个数据生成管道,该管道结合了LLM作为评判者的过滤、引用注释和课程学习,仅使用143M个标记。最终得到的12B模型在引用基础上超越了GPT-4.1,答案质量高,但与未调优的基础模型相比,引用的权衡适度。其次,我们提出了一种校准拒绝机制:在22%的无法回答示例上训练,产生了12%的“我不知道”率,显著改善了基础模型不安全的4.3%率,同时避免了GPT-4.1的过度拒绝(20.2%)。第三,我们展示了一种从数据策划到量化服务的端到端方法论。该系统已在40多家金融机构部署,实现了查询解决率提高7.1个百分点(p < 0.001)。此外,与GPT-4.1相比,该模型在响应速度上快3-5倍,成本降低20-50倍。
cs.AI / 17 / 2605.05499

FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

FoodCHA:用于细粒度食品分析的多模态大语言模型代理
Lee, Woojin, Mekkoth, Pranav, Tian, Ye, Gungor, Onat, Rosing, Tajana
Abstract
The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.
Chinese Translation
配备摄像头的移动设备和可穿戴设备的广泛应用使得餐食图像的便捷捕捉成为可能,从而使食品识别成为实时饮食监测的关键组成部分。然而,现实世界中的食品图像由于高内类相似性和单一图像中经常出现多种食品而面临挑战。尽管深度学习模型在粗粒度分类中表现出色,但它们往往难以捕捉细粒度属性,例如烹饪风格。此外,现代视觉-语言模型中的开放式生成可能会产生非标准标签,限制了它们的实际应用。我们提出了FoodCHA,一个多模态代理框架,将食品识别重新构建为一个分层决策过程。通过逐步锚定预测,FoodCHA利用高层类别指导子类别识别,并利用子类别指导烹饪风格识别,从而提高语义一致性和属性级别的区分能力。为了确保实际可部署性,FoodCHA采用了紧凑的Moondream-2B视觉语言模型,该模型在保持较低计算和内存开销的同时,提供强大的推理能力。在FoodNExTDB上的实验表明,FoodCHA在类别和子类别识别精度上分别比Food-Llama-3.2-11B提高了13.8%和38.2%,并在烹饪风格分类精度上实现了惊人的153.2%的提升。
cs.AI / 18 / 2605.05535

Housing Potential Common Data Model and City Digital Twin

住房潜力通用数据模型与城市数字双胞胎
Katsumi, Megan, Fox, Mark, Wong, Anderson, Chatha, Divnoor
Abstract
The evaluation of housing potential requires consideration of a location from multiple perspectives, ranging from zoning and land use to population characteristics and access to services. This research introduces the Housing Potential Common Data Model (HPCDM) to overcome existing data silos, serving as a standard to support integration and interoperability across the diverse range of datasets that are required for housing potential analysis. This report details the evaluation of the model along with the creation of a City Digital Twin for housing and a pilot dashboard application to demonstrate a practical implementation. Beyond the technical framework, this work identifies critical barriers to adoption and provides actionable mitigation strategies for urban planners and stakeholders.
Chinese Translation
住房潜力的评估需要从多个角度考虑位置,包括区域划分、土地利用、人口特征和服务可达性等。本研究提出了住房潜力通用数据模型(Housing Potential Common Data Model, HPCDM),旨在克服现有的数据孤岛,作为支持住房潜力分析所需的多种数据集整合与互操作性的标准。本报告详细介绍了该模型的评估,以及为住房创建的城市数字双胞胎和一个试点仪表板应用程序,以展示其实际应用。除了技术框架外,本研究还识别了采纳过程中的关键障碍,并为城市规划者和利益相关者提供了可行的缓解策略。
cs.AI / 19 / 2605.05538

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

AgenticRAG:企业知识库的主动检索
Suresh, Susheel, Mak, Hazel, Chou, Shangpo, Kroon, Fred, Bhatnagar, Sahil
Abstract
We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously. On three open benchmarks we observe substantial gains: $49.6\%$ recall@1 on BRIGHT (+21.8 pp over the best embedding baseline), 0.96 factuality on WixQA ($+13\%$ relative improvement), and $92\%$ answer correctness on FinanceBench--within 2 pp of oracle access to true evidence. Ablation studies show that the most significant factor is the shift from single-shot retrieval to agentic tool use ($5.9\times$ improvement), while multi-query search and in-document navigation contribute to both quality and efficiency. We present various design choices in our agentic harness that were informed by pre-production deployments. Our results demonstrate its suitability for real-world enterprise production environments.
Chinese Translation
我们提出了AgenticRAG,一种用于企业知识库检索和分析的实用主动工具。标准的RAG(Retrieval-Augmented Generation)管道将大量的基础工作负担放在搜索堆栈上,限制了语言模型只能在检索过程的深层选择固定的候选集。我们的方法通过在现有企业搜索基础设施之上增加一个轻量级的工具层,减少了这种过度依赖,使推理型大语言模型(LLM)具备搜索、查找、打开和总结的工具,从而使模型能够迭代地检索信息、在文档中导航并自主分析证据。在三个开放基准测试中,我们观察到显著的提升:在BRIGHT上达到49.6%的召回率@1(比最佳嵌入基线提高21.8个百分点),在WixQA上实现0.96的事实性(相对提高13%),在FinanceBench上达到92%的答案正确率——与对真实证据的理想访问相差仅2个百分点。消融研究表明,最显著的因素是从单次检索转向主动工具使用(提高了5.9倍),而多查询搜索和文档内导航则对质量和效率都有贡献。我们展示了在我们的主动工具中做出的各种设计选择,这些选择是通过预生产部署所获得的经验。我们的结果证明了其在真实企业生产环境中的适用性。
cs.AI / 20 / 2605.05546

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

SPARK:基于知识图谱的不对称奖励自我博弈
Park, Hyobin, Kim, Taeseop, Choi, Dong-Geol
Abstract
Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scientific literature is more challenging: the relationships among multi-modal elements within and across documents are rarely made explicit in text, which makes automatic generation of relational reasoning questions difficult and weakens the reliability of reward signals. We propose SPARK (Self-Play with Asymmetric Reward from Knowledge Graphs), a framework that automatically constructs a unified knowledge graph (KG) from multi-document scientific literature and uses it as the structural basis for self-play. KG paths over multimodal nodes serve as a source for generating relational reasoning questions, and structured facts stored in the KG provide a basis for verifiable reward computation. A single small vision-language model (sVLM) alternates between Proposer and Solver roles under information asymmetry against a fixed KG, a design that we believe can be naturally extended toward online adaptation in future work. We evaluate SPARK on public benchmarks and a self-constructed cross-document multi-hop QA dataset. Results show that SPARK consistently outperforms flat-corpus-based self-play baselines, and the performance gap widens as hop count increases, suggesting that KG-structure grounding contributes to relational multi-hop reasoning beyond what unstructured corpus grounding can provide.
Chinese Translation
自我博弈强化学习在具有形式可验证结构的领域(如数学和编程)中表现出强大的性能,在这些领域中,问题生成和奖励计算都可以基于明确的规则进行。将这一范式扩展到科学文献中则更具挑战性:文档内外多模态元素之间的关系在文本中很少被明确表达,这使得自动生成关系推理问题变得困难,并削弱了奖励信号的可靠性。我们提出了SPARK(基于知识图谱的不对称奖励自我博弈),这是一个自动构建统一知识图谱(KG)的框架,该图谱来自多文档科学文献,并将其作为自我博弈的结构基础。KG中多模态节点的路径作为生成关系推理问题的来源,而存储在KG中的结构化事实为可验证的奖励计算提供了基础。一个小型视觉-语言模型(sVLM)在固定KG下在提议者和求解者角色之间交替,信息不对称的设计我们认为可以自然扩展到未来工作的在线适应。我们在公共基准和自构建的跨文档多跳问答数据集上评估SPARK。结果表明,SPARK始终优于基于平面语料库的自我博弈基线,且随着跳数的增加,性能差距扩大,这表明KG结构的基础有助于超越非结构化语料库基础的关系多跳推理。
cs.AI / 21 / 2605.05558

Who Prices Cognitive Labor in the Age of Agents? A Position on Compute-Anchored Wages

在智能体时代,谁为认知劳动定价?关于计算锚定工资的立场
Zhu, Siqi
Abstract
A natural intuition about the economics of AI agents is that, because agents can be replicated at near-zero marginal cost, they constitute a labor input in infinitely elastic supply, and therefore drive cognitive-labor wages to zero. We argue this framing is wrong in mechanism but partially correct in conclusion, and that the correction matters for both theory and policy. \textbf{Agents are not labor; they are a production technology that converts compute capital $K_c$ into effective units of cognitive labor $L_A$.} Once this is recognized, the elastic-supply margin that anchors the equilibrium wage migrates from the labor market to the compute capital market. Building on the textbook factor-pricing framework \citep{mankiw2020}, we derive a \emph{Compute-Anchored Wage} (CAW) bound stating that, on tasks where human and agent cognitive labor are substitutes, the competitive human wage is bounded above by $\lambda \cdot k \cdot r_c$, where $r_c$ is the rental rate of compute capital, $k$ is the compute intensity of one effective agent-labor unit, and $\lambda$ is the relative human-to-agent productivity. We generalize the result through CES aggregation, separate substitutable from complementary tasks (yielding a directional inversion of skill-biased technical change), and discuss factor-share consequences. The position is concise: \emph{the price-setter for cognitive labor is no longer the labor market.}
Chinese Translation
关于人工智能智能体经济学的一个自然直觉是,由于智能体可以以近乎零的边际成本复制,它们构成了无限弹性供给的劳动投入,因此将认知劳动工资压至零。我们认为这种框架在机制上是错误的,但在结论上部分正确,而这种修正对理论和政策都很重要。**智能体不是劳动;它们是一种生产技术,将计算资本 $K_c$ 转化为有效的认知劳动单位 $L_A$。** 一旦认识到这一点,锚定均衡工资的弹性供给边际就从劳动市场迁移到计算资本市场。基于教科书的要素定价框架 extcite{mankiw2020},我们推导出一个 extit{计算锚定工资}(CAW)界限,指出在人工与智能体认知劳动可替代的任务中,竞争性人类工资的上限为 $ ext{λ} imes k imes r_c$,其中 $r_c$ 是计算资本的租赁率,$k$ 是一个有效智能体劳动单位的计算强度,$ ext{λ}$ 是人类与智能体生产率的相对值。我们通过 CES 聚合推广了这一结果,将可替代任务与互补任务分开(导致技能偏向技术变革的方向性逆转),并讨论了要素份额的后果。该立场简明扼要: extit{认知劳动的定价者不再是劳动市场。}
cs.AI / 22 / 2605.05561

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

BitCal-TTS:量化推理模型的比特校准测试时间缩放
Patarlapalli, Sai Babu, Avvaru, Surya Teja
Abstract
Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized. We study this interaction for greedy 4-bit inference and propose BitCal-TTS, a lightweight runtime controller that combines (i) inexpensive online proxies for token-level uncertainty and reasoning-trace stability, (ii) a bit-conditioned confidence rescaling that is conservative at low nominal precision, and (iii) a bit-aware post-marker confirmation horizon designed for GSM8K-style structured outputs. The method requires no fine-tuning of the base model and integrates with standard Hugging Face 4-bit inference using forward hooks for logits and last-layer hidden states. On small evaluation shards of GSM8K with Qwen2.5 Instruct models, BitCal-TTS improves exact-match accuracy over a non-bit-aware adaptive baseline at the 7B and 14B scales while preserving substantial token savings relative to fixed-budget decoding. At a token cap of B=512, on the evaluation shards we report (N=54 for 7B and N=35 for 14B; not the full GSM8K test set), accuracy gains are +3.7 points (7B) and +2.8 points (14B), with the premature-stop rate falling from 14.8% to 11.1% on 7B and from 17.1% to 11.4% on 14B. We report Wilson 95% confidence intervals throughout and explicitly discuss the limited statistical power of the partial-shard comparisons. We release code and figure-generation scripts to support full reproduction.
Chinese Translation
后训练量化使得在严格的内存和延迟预算下,大型推理模型变得可行,但它可能会扭曲驱动自适应测试时间计算分配的在线信号。在新生成的标记数量固定上限的情况下,错误校准的置信度可能导致有害的过早停止:模型可能会呈现出一个看似合理的最终结果,而其背后的推理仍然是错误的,或者控制器可能在跟踪尚未稳定之前就停止了。我们研究了这一互动,针对贪婪的4位推理提出了BitCal-TTS,这是一种轻量级的运行时控制器,结合了(i)用于标记级不确定性和推理轨迹稳定性的低成本在线代理,(ii)在低名义精度下保守的比特条件置信度重新缩放,以及(iii)为GSM8K风格结构化输出设计的比特感知后标记确认范围。该方法不需要对基础模型进行微调,并通过前向钩子集成标准Hugging Face 4位推理,以获取logits和最后一层的隐藏状态。在使用Qwen2.5 Instruct模型的小型GSM8K评估片段上,BitCal-TTS在7B和14B规模上相较于非比特感知自适应基线提高了精确匹配准确率,同时相对于固定预算解码保持了可观的标记节省。在标记上限B=512的情况下,我们在评估片段上报告(7B的N=54和14B的N=35;不是完整的GSM8K测试集),准确率提高了+3.7点(7B)和+2.8点(14B),7B的过早停止率从14.8%降至11.1%,14B的过早停止率从17.1%降至11.4%。我们报告了Wilson 95%的置信区间,并明确讨论了部分片段比较的有限统计能力。我们发布了代码和图形生成脚本以支持完整的重现。
cs.AI / 23 / 2605.05566

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

无意义的帮助:提示空间扰动拓宽推理探索
Huang, Langlin, Huang, Chengsong, Li, Jinyuan, Cai, Donghong, Yang, Yuyi, Huang, Jiaxin
Abstract
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.
Chinese Translation
具有可验证奖励的强化学习,特别是群体相对策略优化(Group Relative Policy Optimization, GRPO),显著提升了大型语言模型(Large Language Models, LLMs)的推理能力。然而,在复杂任务中,GRPO常常遭遇“零优势问题”:当对某个查询的所有采样回合均失败时,相对优势会崩溃为零。因此,模型失去了对这些问题的有效训练信号,浪费了训练数据和计算预算。虽然简单地增加这些问题的采样预算是一种常见的补救措施,但静态采样策略本质上限制了推理探索,降低了成功率。在本文中,我们提出了用于探索的Lorem扰动(Lorem Perturbation for Exploration, LoPE),这是一个简单而有效的训练框架,旨在突破这一探索瓶颈。我们认为,与任务无关的提示空间扰动可以足够改变模型的输出分布,从而为难题解锁正交推理路径。具体而言,LoPE在重新采样之前,将随机组装的Lorem Ipsum词汇(伪拉丁占位文本)序列附加到提示前。对1.7B、4B和7B模型的实验表明,LoPE显著优于使用原始提示的重新采样。进一步分析显示,其他低困惑度的基于拉丁文的随机序列也是有效的扰动。我们的结果确立了LoPE作为拓宽LLM强化学习探索的强有力基线。
cs.AI / 24 / 2605.05567

Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift

针对极端标签偏移的领域适应中的局部感知私有类识别
Ren, Chuan-Xian, Guo, Cheng-Jun, Yan, Hong
Abstract
Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other. These non-overlapping classes are referred to as private classes. Identifying private class samples and mitigating their adverse effects is critical in the literature. Existing methods rely on the assumption that shifts in private classes are large enough to be considered outliers. However, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class, challenging this assumption. Consequently, private classes substantially increase the difficulty of cross-domain classification. To address these issues, based on local transportation and metric properties of optimal transport (OT), a locality-aware private class identification approach is proposed in the form of a score function on transport mass. The effectiveness of the proposed approach is theoretically proven, highlighting the score function's strong ability to distinguish between shared and private class samples. Building on this, we introduce a reliable OT-based method (ReOT) for domain adaptation under severe label shift. ReOT minimizes classification risk while learning the separated cluster structure between the identified shared classes and private classes, effectively avoiding mismatch between shared-private sample pairs, thus ensuring that important knowledge is reliably transported intra-class to mitigate class-conditional discrepancy. Furthermore, a generalization upper bound of the target risk is provided for extreme label shift scenarios, which can be minimized by ReOT. Extensive experiments on benchmarks validate the effectiveness of ReOT.
Chinese Translation
领域适应旨在将知识从带标签的源领域转移到具有不同分布的无标签目标领域。在现实场景中,两个领域的标签空间通常存在包含关系,其中某些类仅存在于一个领域而不在另一个领域。这些不重叠的类被称为私有类。识别私有类样本并减轻其不利影响在文献中至关重要。现有方法依赖于私有类的偏移足够大以被视为异常值的假设。然而,单个共享类内的方差可能显著大于私有类与另一个共享类之间的差异,这对该假设构成挑战。因此,私有类显著增加了跨领域分类的难度。为了解决这些问题,基于最优传输(Optimal Transport, OT)的局部运输和度量特性,提出了一种局部感知私有类识别方法,以运输质量的评分函数形式呈现。所提方法的有效性在理论上得到了证明,突出了评分函数在区分共享类和私有类样本方面的强大能力。在此基础上,我们引入了一种基于OT的可靠方法(ReOT),用于在严重标签偏移下进行领域适应。ReOT在学习识别的共享类和私有类之间的分离聚类结构的同时,最小化分类风险,有效避免共享-私有样本对之间的不匹配,从而确保重要知识在类内可靠传输,以减轻类条件差异。此外,为极端标签偏移场景提供了目标风险的泛化上界,ReOT可以最小化该上界。在基准测试上进行的大量实验验证了ReOT的有效性。
cs.AI / 25 / 2605.05580

AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading

AlphaCrafter:一个用于横截面量化交易的全栈多智能体框架
Yuan, Yishuo, Sheng, Jiayi, Zeng, Sirui, Wang, Jiaqi, Liu, Jiaheng
Abstract
Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time search process, implicitly assuming that factor efficacy persists across market regimes. Execution-oriented systems often adopt role-playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality-driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full-stack multi-agent framework that closes this gap through a continuously adaptive factor-to-execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM-guided search, a Screener that assesses prevailing market conditions to construct regime-conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed-loop cross-sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 demonstrate that AlphaCrafter consistently outperforms state-of-the-art baselines in risk-adjusted returns while exhibiting the lowest cross-trial variance, confirming that integrated and adaptive factor-to-execution design yields robust trading performance.
Chinese Translation
金融市场本质上是非平稳的,受到宏观经济体制、微观结构摩擦和行为动态之间复杂互动的驱动。构建持续盈利的量化策略需要不断结合因子发现、适应性选择和风险约束执行。然而,现有方法通常在静态或孤立的假设下优化这些组件。因子挖掘框架通常将阿尔法(alpha)发现视为一次性搜索过程,隐含假设因子的有效性在不同市场体制中持续存在。以执行为导向的系统通常采用角色扮演的智能体架构,模拟类人交易委员会,带来行为噪声而非系统性理性。因此,缺乏一个完全自动化、以理性驱动的框架来统一连贯的量化流程。我们提出了AlphaCrafter,一个全栈多智能体框架,通过一个持续适应的因子到执行的流程来填补这一空白,旨在跟踪并响应不断变化的市场条件而无需人工干预。AlphaCrafter通过三个专业智能体运作:一个矿工(Miner)不断通过大语言模型(LLM)引导的搜索扩展因子池,一个筛选器(Screener)评估当前市场条件以构建适应体制的因子集合,以及一个交易者(Trader)在明确的风险约束下将这些集合转化为量化策略。这三个智能体共同形成一个闭环的横截面交易系统,能够全面适应不断变化的市场动态。在对CSI 300和S&P 500的广泛实验中,AlphaCrafter在风险调整回报方面始终优于最先进的基准,同时展现出最低的跨试验方差,证实了集成和适应性的因子到执行设计能够产生稳健的交易表现。
cs.AI / 26 / 2605.05583

Belief Memory: Agent Memory Under Partial Observability

信念记忆:部分可观测下的智能体记忆
Liao, Junfeng, Wang, Qizhou, Zhu, Jianing, Du, Bo, Yan, Rui, Chen, Xiuying
Abstract
LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.
Chinese Translation
在长上下文中操作的LLM智能体依赖外部记忆来积累知识。然而,现有方法通常将每个观察结果存储为单一的确定性结论(例如,从临时错误推断“API~X失败”),尽管这些观察结果本质上是部分的且可能存在歧义。通过承诺于一个结论并丢弃不确定性,这些方法引入了自我强化的错误:智能体基于存储的结论行动,从不重新考虑其他选项,并随着时间的推移强化该结论。为了解决这个问题,我们提出了BeliefMem,它将记忆范式从每个观察承诺一个单一结论转变为保留多个候选结论及其概率。具体而言,BeliefMem将候选结论存储为独立的记忆条目,每个条目携带一个通过Noisy-OR规则在新观察到达时更新的概率。在检索时,所有候选结论及其概率一起呈现,使得替代方案对智能体始终可见。由于记忆中的每个结论保留其概率,BeliefMem保留了确定性范式所丢弃的不确定性,使得智能体能够在有充分证据的知识上高信心地行动,同时在新证据到达时保持更新其信心的能力。在LoCoMo和ALFWorld基准上的实证评估表明,即使在数据有限的情况下,BeliefMem也实现了最佳的平均性能,显著超越了众所周知的基线。更广泛地说,这种概率记忆产生了显著的收益,并为部分可观测环境中的智能体记忆探索了一个新方向。
cs.AI / 27 / 2605.05593

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

多模态大型语言模型内部视觉表征的因果探测
Deng, Zehao, Ju, Tianjie, Wu, Zheng, He, Liangbo, Lan, Jun, Zhu, Huijia, Wang, Weiqiang, Zhang, Zhuosheng
Abstract
Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit distinct localized memorization, whereas abstract concepts are globally distributed across the network. Critically, this divergence uncovers a mechanistic driver of scaling laws: increasing model depth is indispensable for encoding distributed and complex abstract concepts, whereas entity localization remains remarkably invariant to scale. Furthermore, reverse steering uncovers that blocking explicit output triggers a surge in latent activations, exposing a compensatory mechanism between perception and generation. Finally, extending our analysis to visual reasoning, we expose a disconnect between perception and reasoning although MLLMs successfully recognize geometric relations, they treat them merely as static visual features, failing to trigger the procedural execution necessary for abstract problem-solving.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在多种任务中取得了显著成功,但它们如何编码和基础不同视觉概念的内部机制仍然不甚清楚。为了解决这一问题,我们提出了一种基于激活引导的因果框架,以主动探测和操控内部视觉表征。通过对四个视觉概念类别进行系统干预,我们的结果揭示了概念编码的分歧:实体表现出独特的局部记忆,而抽象概念则在网络中呈现出全局分布。重要的是,这种分歧揭示了规模法则的机制驱动因素:增加模型深度对于编码分布式和复杂的抽象概念是不可或缺的,而实体的局部化对规模的变化则保持显著不变。此外,反向引导揭示了阻断显式输出会导致潜在激活的激增,暴露了感知与生成之间的补偿机制。最后,我们将分析扩展到视觉推理,发现尽管MLLMs成功识别几何关系,但它们仅将其视为静态视觉特征,未能触发抽象问题解决所需的程序执行,从而暴露了感知与推理之间的脱节。
cs.AI / 28 / 2605.05598

Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development

Prober.ai:通过受限角色的门控探究式反馈促进论证写作发展
Bi, Ran, Wei, Shiyao, Zhou, Yuanyiyi
Abstract
The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, resulting in measurable cognitive debt and diminished argumentative reasoning skills. We present Prober.ai, a web-based writing environment that inverts the conventional AI-tutoring paradigm: rather than generating or rewriting student text, the system constrains an LLM (Gemini 3 Flash Preview) through persona-specific system prompts and structured JSON output schemas to produce only targeted, inquiry-based questions about argumentative weaknesses. A two-phase interaction architecture -- Challenge and Unlock -- implements a pedagogical friction mechanism whereby revision suggestions are gated behind mandatory student reflection. The system's design is grounded in Toulmin's argumentation theory, research on peer feedforward questioning mechanisms, and evidence on AI-supported feedback in writing instruction. A functional prototype was developed in 36 hours during the NY EdTech Hackathon (March 2026), where it was awarded second place. We describe the system architecture, the prompt engineering methodology for constraining LLM output to pedagogically aligned JSON schemas, and discuss implications for scalable, cognition-preserving AI integration in writing education.
Chinese Translation
大型语言模型(LLMs)在教育环境中的普及,悖论性地削弱了它们所声称支持的认知过程。学生们越来越多地将批判性思维外包给生成精炼文本的人工智能助手,导致可测量的认知债务和论证推理能力的下降。我们提出了Prober.ai,一个基于网络的写作环境,颠覆了传统的人工智能辅导范式:该系统通过特定角色的系统提示和结构化的JSON输出模式来限制LLM(Gemini 3 Flash Preview),仅生成针对论证弱点的有针对性的探究式问题。该系统采用了两阶段交互架构——挑战与解锁——实现了一种教学摩擦机制,使得修订建议必须经过学生的反思才能获得。系统设计基于图尔敏的论证理论、关于同伴前馈提问机制的研究以及有关人工智能支持的写作反馈的证据。在2026年3月的纽约教育科技黑客马拉松中,我们在36小时内开发了一个功能原型,并获得第二名。我们描述了系统架构、限制LLM输出以符合教学目标的JSON模式的提示工程方法,并讨论了在写作教育中可扩展、保护认知的人工智能整合的意义。
cs.AI / 29 / 2605.05643

Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

文本-图谱协同:一种双向验证与补全框架用于检索增强生成
Zhong, Jiarui, Chen, Hong Cai
Abstract
Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is frequently hindered by search-time pruning, which may discard potentially valid reasoning paths. Existing hybrid approaches primarily adopt simple evidence concatenation or unidirectional enhancement, which fails to address the fundamental "Information Island" problem caused by asymmetric reasoning flows between unstructured text and structured graphs. We propose \textbf{TGS-RAG}, a unified framework for \textbf{T}ext-\textbf{G}raph \textbf{S}ynergistic enhancement. TGS-RAG introduces a bidirectional mechanism: (i) a \textbf{Graph-to-Text} channel that employs a Global Voting strategy from visited graph nodes to re-rank and refine textual evidence, filtering out semantic noise; and (ii) a \textbf{Text-to-Graph} channel that utilizes the \textbf{Memory-based Orphan Entity Bridging} algorithm. This algorithm utilizes textual cues to proactively resurrect valid but previously pruned reasoning paths from the search history without additional database overhead. Experimental results on multiple multi-hop reasoning benchmarks demonstrate that TGS-RAG significantly outperforms state-of-the-art baselines, achieving a superior balance between retrieval precision and computational efficiency.
Chinese Translation
检索增强生成(RAG)已成为增强大型语言模型(LLMs)中事实基础和多跳推理的核心范式。传统的基于文本的RAG往往检索到逻辑上不相关的伪证据,而基于图谱的RAG则常常受到搜索时剪枝的限制,这可能会丢弃潜在有效的推理路径。现有的混合方法主要采用简单的证据连接或单向增强,未能解决由非结构化文本与结构化图谱之间的不对称推理流所导致的根本“信息孤岛”问题。我们提出了 extbf{TGS-RAG},一个统一的 extbf{T}ext- extbf{G}raph extbf{S}ynergistic增强框架。TGS-RAG引入了一种双向机制:(i)一个 extbf{图到文本}通道,采用来自访问图节点的全局投票策略对文本证据进行重新排序和精炼,过滤掉语义噪声;(ii)一个 extbf{文本到图}通道,利用 extbf{基于记忆的孤立实体桥接}算法。该算法利用文本线索主动恢复从搜索历史中先前被剪枝的有效推理路径,而无需额外的数据库开销。在多个多跳推理基准上的实验结果表明,TGS-RAG显著超越了最先进的基线,达到了检索精度与计算效率之间的优越平衡。
cs.AI / 30 / 2605.05657

Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

具有可证明预算保守性的检索条件拓扑选择用于多智能体代码生成
Talluri, Abhijit, Anne, Pujith, Pendiyala, Bhagavan Choudary, Chilukuri, Raghavendra
Abstract
Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-Agent, a multi-agent framework whose sub-agents are governed by formal contracts with six-dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work -- complexity-conditioned LLM routing and formal resource algebras -- yielding a property neither admits alone: provable budget conservation under retrieval-conditioned dynamic topology selection. Concretely we contribute: (1) a complexity-conditioned topology router that reduces proxy-measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural-induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub-millisecond DAG construction and linear tree-index scalability.
Chinese Translation
多智能体大语言模型(LLM)系统在代码生成中面临一个基本的路由问题:最优的编排拓扑依赖于待修改代码的结构复杂性,而现有系统在选择拓扑时并未参考代码库。我们提出了检索引导自适应编排(Retrieval-Guided Adaptive Orchestration, RGAO),该架构通过在选择编排拓扑之前从分层代码索引中提取结构复杂性向量来闭合这一循环。RGAO 在 Code-Agent 框架内运行,该框架的子代理由具有六维预算向量的正式合同所管理。我们的主要贡献是将两个先前独立的研究方向——复杂性条件的大语言模型路由和正式资源代数——结合起来,产生了一个单独的特性:在检索条件动态拓扑选择下可证明的预算保守性。具体而言,我们的贡献包括:(1)一个复杂性条件的拓扑路由器,将代理测量的错误路由率从 30.1% 降低到 8.2%;(2)一个具有结构归纳保守定理的预算代数;以及(3)一个分层代码检索引擎。实证评估表明,DAG 构建时间在亚毫秒级,并且线性树索引具有可扩展性。
cs.AI / 31 / 2605.05668

Large Vision-Language Models Get Lost in Attention

大型视觉-语言模型在注意力中迷失
Xi, Gongli, Tian, Ye, Yang, Mengyu, Yi, Huahui, Lin, Liang, Hao, Xiaoshuai, Wang, Kun, Wang, Wendong
Abstract
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
Chinese Translation
尽管训练范式迅速演变,大型视觉-语言模型(LVLMs)的解码器骨干仍然根植于残差连接的Transformer架构。因此,解读内部模块的不同角色对于理解模型机制和指导架构优化至关重要。虽然先前的统计方法提供了有价值的基于归因的见解,但它们往往缺乏统一的理论基础。为了解决这一问题,我们提出了一个基于信息理论和几何学的统一框架,以量化残差更新的几何和熵特性。应用这一统一框架揭示了一个基本的功能解耦:注意力作为一个保持子空间的算子,专注于重配置,而前馈神经网络(FFNs)则作为扩展子空间的算子,推动语义创新。令人惊讶的是,进一步的实验表明,用预定义值(例如,高斯噪声)替换学习到的注意力权重,在大多数数据集上相较于原始模型能够产生可比甚至更优的性能。这些结果揭示了当前机制中严重的资源错配和冗余,暗示最先进的LVLMs实际上在注意力中“迷失”,而不是有效利用视觉上下文。
cs.AI / 32 / 2605.05678

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

风险链:大型推理模型中的安全失败及通过自适应多原则引导的缓解
Li, Xiaomin, Hou, Jianheng, Deng, Zheyuan, Zhang, Zhiwei, Li, Taoran, Lu, Binghang, Hu, Bing, Zhao, Yunhan, Hao, Yuexing
Abstract
Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in misinformation, legal compliance, discrimination, physical harm, and psychological harm. We further propose adaptive multi-principle steering, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe than safe centroid. On three steerable open reasoning models, adaptive steering reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.
Chinese Translation
大型推理模型(LRMs)越来越多地展示出类似链式思维的推理,以实现透明性、验证和有意的问题解决。这造成了一个安全盲点:即使最终答案看起来安全,推理轨迹中也可能出现有害或违反政策的内容。我们通过在统一的二十原则安全标准下对两个阶段进行评分,测试最终答案的安全性是否足以作为完整推理-答案轨迹的代理。使用来自七个公共有害性和越狱来源的提示,以及四个分布外(OOD)来源,我们评估了15个开放权重和基于API的LRMs,每个模型评估41K个提示。推理轨迹持续揭示出超出最终答案的额外安全风险,特别是在高严重性阶段性失败中:泄漏案例,即不安全的推理在看似安全的答案之前出现,以及逃逸案例,即看似良性的推理在不安全的最终响应之前出现。原则级分析显示,风险集中在错误信息、法律合规、歧视、身体伤害和心理伤害上。我们进一步提出了自适应多原则引导,这是一种白盒测试时缓解方法,针对每个安全原则学习一种不安全到安全的激活方向,并仅激活当前隐藏状态更接近不安全而非安全质心的方向。在三个可引导的开放推理模型上,自适应引导在持出和OOD基准测试中减少了推理轨迹和最终答案中的不安全计数。DeepSeek-R1-Qwen-7B在保持BBH、GSM8K和MMLU上97.7%的宏平均准确率的同时,实现了40.8%的平均不安全计数减少。这些结果表明,LRM的安全性应在完整的暴露推理-答案轨迹上进行评估和缓解,而不仅仅是在最终答案阶段。
cs.AI / 33 / 2605.05686

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

变压器记忆的吸引子几何:从冲突仲裁到自信幻觉
Liang, Qiyao, Miikkulainen, Risto, Fiete, Ila
Abstract
Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task--entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\bar\Delta)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.
Chinese Translation
语言模型依赖于两种知识来源:嵌入权重中的事实(参数记忆,PM)和上下文中的信息(工作记忆,WM)。我们研究了两种机制上截然不同的失败模式——冲突,当PM和WM不一致并相互干扰时;以及幻觉,当查询的事实从未被学习时。这两种情况都会产生自信的输出,因此基于输出的监控在设计上是盲目的。我们展示了这两种失败共享一个统一的几何解释。在自回归生成的隐藏状态空间中,学习到的事实形成吸引子盆地。冲突是盆地竞争:WM在不提高输出熵的情况下干扰向正确盆地的收敛。幻觉是盆地缺失:当不存在记忆盆地时,隐藏状态自由漂移。为下一个标记预测设计的冻结LM头无法区分这两种情况,并且无论如何都会自信地发出信号。我们在一个受控的合成任务中验证了这一解释——将实体标识符映射到通过LoRA适配器安装的唯一代码的PM,其中真实情况是精确的,组件角色可以通过有针对性的适配器放置因果隔离。几何边际——隐藏状态与最近记忆盆地的距离——直接读取这一几何结构,并比输出熵更清晰地分离正确回忆与幻觉,在输出熵检测无法避免拒绝绝大多数正确输出的情况下,零误拒绝。该分离在没有适应的预训练模型的自然语言事实查询中成立,确认吸引子几何是结构性的,而非微调伪影。自信幻觉的比例遵循一个缩放法则 $C = ext{exp}(-c/ar riangle)$,随着规模的增长而增加,即使整体错误率下降。隐藏状态可靠地编码了认知状态;冻结的输出头系统性地抹去了这一状态——而这种抹除随着规模的扩大而加剧。
cs.AI / 34 / 2605.05687

DataDignity: Training Data Attribution for Large Language Models

数据尊严:大型语言模型的训练数据归属
Li, Xiaomin, Banburski-Fahey, Andrzej, Lanier, Jaron
Abstract
Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.
Chinese Translation
审计语言模型的输出通常不仅仅需要判断其正确性:审计员可能需要识别出哪个源文档最有可能支持响应中表达的知识。我们将此研究称为精准来源:给定一个提示、一个目标模型的响应和一个候选语料库,排名最能支持该响应的文档。我们引入了FakeWiki,这是一个包含3,537篇虚构维基百科风格文章的受控基准,旨在保留真实来源的同时削弱词汇捷径。FakeWiki包括问答探针、保留源文的意译、逆生成变体、在主题上保持相似但去除答案关键事实的困难反文档,以及五种查询条件:干净的提示加上四种灵感来自越狱的变换。我们评估了七个检索基线、一个无训练的激活引导检索融合方法SteerFuse,以及一个监督对比来源排序器ScoringModel。ScoringModel将响应和文档特征映射到共享空间,并使用InfoNCE进行训练,采用批内、检索挖掘和反文档负样本。在九个开放权重指令调优的LLM和五种查询条件下,ScoringModel将最强检索基线的平均Recall@10从35.0提高到52.2,而无需推理时融合,并在41/45个模型-条件单元中获胜。尽管SteerFuse不需要监督训练,但通常是第二好的,表明激活空间证据可以有效补充文本检索。在灵感来自越狱的变换查询中,ScoringModel的Recall@10平均提高了15.7个百分点,超越了最佳基线。总体而言,我们的工作表明,稳健的训练数据归属需要评估设置,以区分真实答案支持与主题或词汇相似性。
cs.AI / 35 / 2605.05689

GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

GCCM:通过对比一致性模型增强生成图预测
Ma, Shaozhen, Huang, Wei, Wang, Hanchen, Wen, Dong, Zhang, Wenjie
Abstract
Conditional generative models, particularly diffusion-based methods, have recently been applied to graph prediction by modeling the target as a conditional distribution given the input graph, yielding competitive results compared to deterministic predictor. However, existing diffusion-based prediction methods typically require expensive iterative denoising at inference and often suffer from unstable sampling, which motivates recent efforts to reduce inference denoising steps and enable stable sampling via techniques such as consistency training. Despite this progress, we find that existing consistency training methods for graph prediction could potentially fall into a shortcut solution: the model may attempt to satisfy the self-consistency constraint by ignoring the noisy target (i.e., assigning it negligible weight), ultimately collapsing into a purely deterministic predictor. To mitigate such shortcut solution, we propose GCCM, a graph contrastive consistency model that goes beyond isolated pairwise matching between the same target at different noise levels by introducing negative pairs into a contrastive consistency objective. This adds an additional separation requirement, making the shortcut solution no longer trivially sufficient to satisfy the proposed objective. Moreover, we apply feature perturbation to the input node/edge features to break identical conditioning on the input graph, so that the shortcut no longer yields the same predictions across noise levels and becomes less attractive. Extensive experiments on benchmark datasets demonstrate that GCCM mitigates the shortcut solution and yields consistent performance improvements in graph prediction compared to deterministic predictors.
Chinese Translation
条件生成模型,特别是基于扩散的方法,最近被应用于图预测,通过将目标建模为给定输入图的条件分布,取得了与确定性预测器相媲美的结果。然而,现有的基于扩散的预测方法通常在推理时需要昂贵的迭代去噪,并且往往面临不稳定的采样,这促使了最近的努力,旨在通过一致性训练等技术减少推理去噪步骤并实现稳定采样。尽管取得了一定进展,我们发现现有的图预测一致性训练方法可能会陷入捷径解决方案:模型可能试图通过忽略噪声目标(即,赋予其微不足道的权重)来满足自一致性约束,最终陷入纯粹的确定性预测器。为了缓解这种捷径解决方案,我们提出了GCCM,一种图对比一致性模型,它通过引入负对来超越在不同噪声水平下相同目标之间的孤立成对匹配,从而扩展了对比一致性目标。这增加了额外的分离要求,使得捷径解决方案不再轻易满足所提出的目标。此外,我们对输入节点/边特征进行特征扰动,以打破对输入图的相同条件,从而使捷径在不同噪声水平下不再产生相同的预测,变得不再具有吸引力。在基准数据集上的大量实验表明,GCCM缓解了捷径解决方案,并在图预测中相比于确定性预测器实现了一致的性能提升。
cs.AI / 36 / 2605.05693

Saliency-Aware Regularized Quantization Calibration for Large Language Models

面向显著性意识的正则化量化校准用于大型语言模型
Zhao, Yanlong, Cheng, Xiaoyuan, Liu, Huihang, He, Baihua, Zhang, Xinyu, Zhu, Harrison Bo Hua, Chen, Wenlong, Zeng, Li, Sun, Zhuo
Abstract
Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights. This may cause the generalization risk to diverge, potentially degrading downstream performance. To address this issue, we propose \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC) a unified framework that augments the standard PTQ objective with a saliency-aware regularization term. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without additional computational overhead during inference.
Chinese Translation
后训练量化(PTQ)是一种在内存和延迟限制下部署大型语言模型(LLMs)的有效方法。现有的大多数PTQ方法通过最小化预定校准数据集上的逐层重建误差来确定量化参数,通常通过尺度搜索或基于Gram的方法进行优化。然而,从泛化风险的角度来看,现有的PTQ校准目标仅基于有限或不具代表性的校准数据上的经验重建误差,可能会使量化权重偏离原始权重。这可能导致泛化风险的发散,从而可能降低下游性能。为了解决这个问题,我们提出了 extit{面向显著性意识的正则化量化校准}(SARQC),这是一个统一框架,通过显著性意识的正则化项增强标准PTQ目标。该项鼓励量化权重在校准过程中保持接近原始权重,从而在推理时提高泛化能力。SARQC可以无缝集成到现有的PTQ流程中,在统一的公式下增强尺度搜索和基于Gram的方法。在密集和混合专家LLM上的大量实验表明,在推理过程中没有额外的计算开销的情况下,困惑度和零-shot准确性均有一致的提升。
cs.AI / 37 / 2605.05701

Inference-Time Budget Control for LLM Search Agents

大规模语言模型搜索代理的推理时间预算控制
Fang, Zhengru, Hu, Senkang Forest, Chang, Zhonghao, Guo, Yu, Tao, Yihang, Liu, Hongyao, Ruan, Mengzhe, Huang, Jun, Fang, Yuguang
Abstract
LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.
Chinese Translation
大规模语言模型(LLM)搜索代理在推理时越来越依赖工具,但它们的轨迹往往受到工具调用和生成标记的严格限制。在这种双重预算下,获得更好的答案不仅需要更强的模型,还需要明确控制哪个搜索动作应当获得下一个预算单位,以及何时累积的证据足以提交最终答案。我们在多跳问答(QA)中研究这个问题,并将其表述为两阶段的推理时间预算控制。在搜索时,我们的控制器为每个可行的动作分配一个任务级别的信息价值(Value-of-Information, VOI)评分,该评分被定义为在当前搜索状态和剩余双重预算下,每单位预算的边际任务价值的操作性估计,并利用该评分在检索、分解和答案提交之间进行选择。搜索后,一个选择性证据基础的最终确认器将轨迹答案与一个经过精炼的候选答案进行比较,仅在残余误差似乎是低风险的答案形式错误时进行重写。在四个多跳问答基准、三个LLM骨干网络和四个预算水平下,该方法在相同的严格双重预算协议下,相较于四个审计基线产生了积极的整体增益。消融实验表明,搜索时的预算控制,尤其是预算依赖的惩罚,提供了主要的性能提升,而答案时的控制主要在检索路径已经足够时发挥作用。这些结果表明,LLM搜索代理的推理时间预算控制应当管理搜索过程中预算的使用以及最终答案的提交方式。
cs.AI / 38 / 2605.05702

Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

知识图谱路径作为自我进化搜索代理的中间监督
Wu, Huyu, Liu, Jun, Wei, Xiaochi, Gao, Yan, Wu, Yi, Hu, Yao
Abstract
Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.
Chinese Translation
自我进化搜索代理通过生成和解决自己的搜索任务,减少对人工编写训练问题的依赖。我们基于搜索自我博弈(Search Self-Play, SSP)构建了一种代表性的提问者和解答者框架,在该框架中,问题通过多步骤搜索和推理生成和回答。然而,在实践中,SSP面临两个瓶颈:提问者从孤立的答案实体构建问题,而没有关系上下文,这导致在早期自我博弈训练中产生许多无效或无法验证的问题;同时,解答者仅收到二元结果奖励,这丢弃了部分有效信号,来自于部分正确的搜索轨迹。我们通过重用知识图谱路径作为构建派生的中间监督,解决了这两个瓶颈,适用于问题构建和奖励塑造。首先,我们将问题构建与大型语言模型(LLM)引导的知识图谱子图相结合,为提问者提供关系上下文。其次,我们观察到构建和解决一个多跳问题可能涉及重叠的中间实体:用于形成问题的事实桥梁可能为回答问题提供近似的途径点。利用这种重叠,我们引入了途径点覆盖奖励(Waypoint Coverage Reward, WCR),根据解答者轨迹在构建路径上的实体覆盖情况,给予错误轨迹分级部分信用,同时对正确答案保留全额奖励。在七个问答基准和九种模型配置中,我们的方法在所有配置中提高了相较于标准SSP的平均得分,包括在多跳问答任务上的显著提升。这些结果表明,知识图谱路径可以作为轻量级中间监督被重用,提供关系指导和过程反馈,而无需额外的任务特定人工注释或手动标记的过程步骤。
cs.AI / 39 / 2605.05706

Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine

通过随机因果表示学习解决个性化医学中的偏差-精度悖论
Zhang, Peisong, Peng, Manqiang, Wu, Yuxuan, Phadungsaksawasdi, Pawit, Yeung, Wesley, Zhang, Ye, Nguyen, Trang, Zhang, Qiang, Liu, Nan, Wang, Meng, Ngiam, Kee Yuan, Tham, Yih-Chung, Cheng, Ching-Yu, Fu, Tianfan, Chen, Qingyu, Ke, Rosemary, Li, Chang, Yang, Wenzhuo, Lu, Zhenghao, Lai, Chunyou, Zhang, Yu, Zhong, Sheng, Deng, Hao, Liu, Dianbo
Abstract
Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading patient-specific predictions. Here, we identify this tension as a bias-precision paradox in causal representation learning and introduce sampling-based maximum mean discrepancy (sMMD), a stochastic alignment strategy that replaces global adversarial balancing with subset-level matching. We instantiate this approach in a framework for counterfactual outcome prediction with attribution-grounded interpretability. Across two large-scale ICU cohorts (n = 27,783), our framework improves accuracy under distribution shift, reducing error by up to 11.5% and substantially increasing recall in high-risk tasks. Mechanistic analyses show that sMMD selectively preserves clinically decisive variables. In human-AI evaluation, our method outperforms clinicians-in-training and large language models, and improves clinician accuracy by 14.7% while reducing decision time, enabling interpretable, real-time clinical decision support.
Chinese Translation
从纵向观察数据中估计个体化治疗效果是数据驱动医学的核心,但现有方法面临一个基本限制:减少混杂偏差往往会抑制临床信息性异质性,从而降低患者特异性预测的准确性。在此,我们将这种紧张关系识别为因果表示学习中的偏差-精度悖论,并引入基于采样的最大均值差异(sMMD),这是一种随机对齐策略,通过子集级匹配替代全局对抗平衡。我们在一个具有归因基础可解释性的反事实结果预测框架中实现了这一方法。在两个大规模重症监护病房(ICU)队列(n = 27,783)中,我们的框架在分布转变下提高了准确性,错误率降低了多达11.5%,并在高风险任务中显著提高了召回率。机制分析表明,sMMD选择性地保留了临床决定性变量。在人机评估中,我们的方法优于在培训中的临床医生和大型语言模型,并将临床医生的准确性提高了14.7%,同时减少了决策时间,从而实现了可解释的实时临床决策支持。
cs.AI / 40 / 2605.05709

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

隐蔽、重构、越狱:利用多模态大语言模型中的重构-隐蔽权衡
Reza, Md Farhamdur, Jin, Richeng, Wu, Tianfu, Dai, Huaiyu
Abstract
Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeoff}: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods, we find that existing transformations struggle to balance this tradeoff, limiting their effectiveness. In contrast, we show that character-removed variants achieve a better balance. Building on this, we propose \emph{concealment-aware variant construction}, which greedily selects character-removed variants that are low in harmful-keyword alignment and mutually diverse, and instantiates them through five modality-aware prompting strategies. We further introduce \emph{keyword-related distractor images} that depict the harmful keyword in diverse contexts, providing more effective auxiliary visual context than generic distractor images. Experiments across closed-source and open-source MLLMs show the proposed strategies outperform strong baselines, revealing an underexplored vulnerability: a model's own reconstruction ability can be exploited to recover hidden harmful intent and produce unsafe responses.
Chinese Translation
基于意图模糊化的越狱攻击针对多模态大语言模型(MLLMs),将有害查询转化为隐蔽的多模态输入,以绕过安全机制。我们表明,这类攻击受制于 extit{重构-隐蔽权衡}:转化后的输入必须在安全过滤器面前隐藏有害意图,同时又要足够可恢复,以便受害模型能够重构原始请求。通过对三种代表性的黑箱方法进行重构分析,我们发现现有的转化方法在平衡这一权衡方面存在困难,从而限制了它们的有效性。相较之下,我们展示了去除字符的变体能够实现更好的平衡。在此基础上,我们提出了 extit{隐蔽感知变体构造},该方法贪婪地选择在有害关键词对齐度低且相互多样的去除字符变体,并通过五种模态感知的提示策略进行实例化。我们进一步引入 extit{关键词相关的干扰图像},这些图像在多样化的上下文中描绘有害关键词,提供比通用干扰图像更有效的辅助视觉上下文。在封闭源和开放源的多模态大语言模型上进行的实验表明,所提出的策略优于强基线,揭示了一种未被充分探索的脆弱性:模型自身的重构能力可以被利用来恢复隐藏的有害意图并产生不安全的响应。
cs.AI / 41 / 2605.05715

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

可解码但未通过固定残差流线性引导进行修正:来自医学大语言模型失败模式的证据
Liu, Ming
Abstract
Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.
Chinese Translation
在大语言模型(LLM)隐藏状态中,线性可解码的失败信号能否被利用来修正这些失败?我们通过过度思考(Overthinking, OT)这一稳定的行为模式(Jaccard >= 0.81,94%的标注者一致性)来研究这一分类-修正差距。在医学问答中,模型在重采样下能够正确回答,但在扩展的思维链中却失败。OT的线性可解码性达到了71.6%的平衡准确率(p < 10^{-16})。然而,五种固定线性引导的家族(29种配置,n=1,273)均产生了Delta ~= 0的结果,且在不同架构(Qwen2.5-7B)和不同领域(MMLU-STEM)中均显示出相同的无效结果。三条趋同的证据线索表明表征纠缠:OT方向与任务关键计算的重叠率为85-88%(特异性比率 <= 0.152);非目标共享方向的引导损害了准确性(-12.1个百分点);而LEACE概念消除损害了准确性(-3.6个百分点,p=0.01),而10次随机消除产生的Delta为+0.3个百分点。每个实例的探测引导相关性为r=-0.002(p=0.97)。积极的一面是,同一探测器能够实现选择性弃权(持出AUROC=0.610,超过所有五个不确定性基线,p=0.009):可解码的失败结构支持生成后可靠性评估,即使固定线性引导家族无法利用它进行修正。
cs.AI / 42 / 2605.05716

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

并非越多越好:大型语言模型代理支架中的跨组件干扰
Liu, Ming
Abstract
LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama-3.1-8B/70B (96 conditions, up to 10 seeds). The All-In system is consistently suboptimal: on HotpotQA, a single-tool agent surpasses All-In by 32% (F1 0.233 vs 0.177, p=0.023); on GSM8K, a 3-component subset beats All-In by 79% (0.43 vs 0.24, p=0.010). Optimal component count is task-dependent (k*=1-4) and scale-sensitive: at 70B, combinations that hurt at 8B provide gains, though All-In still trails the best subset. We fit a main-effects regression (R^2=0.916, adj-R^2=0.899, LOOCV=0.872), compute exact Shapley values, and find 183/325 submodularity violations (56.3%), showing greedy selection is unreliable. A three-body synergy among Tool Use, Self-Reflection, and Retrieval (INT_3=+0.175, 95% CI [+0.003,+0.351]) is reported as exploratory. CCI replicates across model families (Qwen2.5) and is robust to prompt paraphrasing. Our findings suggest maximally-equipped agent defaults should be replaced by task-specific subset selection via interaction-aware analysis.
Chinese Translation
大型语言模型(LLM)代理系统通过堆叠支架组件(规划、工具、记忆、自我反思、检索)构建,假设更多组件会更好。我们研究了跨组件干扰(CCI):当组件之间发生破坏性互动时的性能下降。我们在 HotpotQA 和 GSM8K 上使用 Llama-3.1-8B/70B 进行全因子实验,涵盖五个组件的所有 2^5=32 个子集(96 种条件,最多 10 次随机种子)。All-In 系统始终表现不佳:在 HotpotQA 上,单工具代理的表现比 All-In 高出 32%(F1 0.233 对比 0.177,p=0.023);在 GSM8K 上,三组件子集的表现比 All-In 高出 79%(0.43 对比 0.24,p=0.010)。最佳组件数量依赖于任务(k*=1-4)且对规模敏感:在 70B 时,8B 时表现不佳的组合反而提供了收益,尽管 All-In 仍落后于最佳子集。我们拟合了主效应回归模型(R^2=0.916,adj-R^2=0.899,LOOCV=0.872),计算了精确的 Shapley 值,并发现 183/325 个子模量违反(56.3%),表明贪婪选择是不可靠的。报告了一种工具使用、自我反思和检索之间的三方协同效应(INT_3=+0.175,95% CI [+0.003,+0.351]),作为探索性结果。CCI 在不同模型系列(Qwen2.5)中重复出现,并且对提示的改写具有鲁棒性。我们的发现表明,最大装备的代理默认设置应通过交互感知分析替换为任务特定的子集选择。
cs.AI / 43 / 2605.05725

Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

像专家一样检测时间序列异常:一个具有专业分析器的多智能体大语言模型框架
Kang, Hyeongwon, Kim, Jeongseob, Park, Jinwoo, Kang, Pilsung
Abstract
Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert-like Detection), a multi-agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE further constructs synthetic in-context examples from normal-reference training segments, without using real anomalous segments or anomaly-type labels as in-context examples. Across three benchmarks, SAGE achieves the best average performance among strong ML/DL and language-model-based baselines. Ablation studies and human evaluation further show that the proposed framework improves detection reliability and the practical usefulness of diagnostic outputs.
Chinese Translation
近期研究探讨了大语言模型在时间序列异常检测中的应用,但现有方法往往依赖于单一通用模型直接推断异常指标或区间,这限制了对复杂异常模式的可控性、可解释性和可靠性。我们提出了SAGE(专家级检测的专业分析器组),这是一个用于单变量时间序列结构化异常诊断的多智能体框架。它将异常分析分解为四个专业分析器,分别针对点异常、结构异常、季节性异常和模式异常。每个分析器应用特定领域的数值工具和诊断可视化生成证据,而一个基于证据的检测器则将这些证据整合成带有置信度评分的异常记录,包括区间和候选类型。随后,监督者将这些结构化记录转换为面向分析师的诊断报告。SAGE进一步从正常参考训练片段构建合成上下文示例,而不使用真实异常片段或异常类型标签作为上下文示例。在三个基准测试中,SAGE在强大的机器学习/深度学习和基于语言模型的基线中实现了最佳平均性能。消融研究和人工评估进一步表明,所提出的框架提高了检测的可靠性和诊断输出的实际实用性。
cs.AI / 44 / 2605.05726

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillRet:大规模LLM代理技能检索基准
Cho, Hongcheol, Kang, Ryangkyung, Kim, Youngeun
Abstract
As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large-scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two-level taxonomy spanning 6 major categories and 18 sub-categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval-oriented training. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off-the-shelf models struggle on realistic large-scale skill libraries, and prior skill-retrieval models still leave substantial headroom. Task-specific fine-tuning on SkillRet substantially improves performance, improving NDCG@10 by +13.1 points over the strongest prior retriever and by +16.9 points over the strongest off-the-shelf retriever. Our analysis further suggests that these gains arise because fine-tuned models better focus on the small skill-relevant signals within long and noisy queries. These results establish SkillRet as a strong benchmark and foundation for future research on retrieval in large-scale agent systems.
Chinese Translation
随着LLM代理越来越多地部署具有大量可重用技能的库,为用户请求选择合适的技能已成为一个关键的系统挑战。在小型库中,用户可以通过名称明确调用技能,但随着技能生态系统在严格的上下文和延迟预算下的增长,这一假设就不再成立。尽管其实际重要性,技能检索仍然未得到充分探索,现有的基准有限,对现实技能库中的检索行为的理解也很少。为了解决这一空白,我们引入了SkillRet,一个用于LLM代理技能检索的大规模基准。SkillRet包含17,810个公共代理技能,采用结构化语义标签和涵盖6个主要类别及18个子类别的两级分类法进行组织。它提供了63,259个训练样本和4,997个评估查询,具有不重叠的技能池,支持基准测试和检索导向的训练。在多种检索器的测试中,我们发现技能检索仍远未解决:现成模型在现实的大规模技能库上表现不佳,而先前的技能检索模型仍有相当大的提升空间。在SkillRet上进行特定任务的微调显著提高了性能,使得NDCG@10比最强的先前检索器提高了+13.1分,比最强的现成检索器提高了+16.9分。我们的分析进一步表明,这些提升的原因在于微调模型更好地关注长且嘈杂查询中的小技能相关信号。这些结果确立了SkillRet作为一个强有力的基准和未来在大规模代理系统中进行检索研究的基础。
cs.AI / 45 / 2605.05731

Knee Osteoarthritis Severity Grading Using Optimized Deep Learning and LLM-Driven Intelligent AI on Computationally Limited Systems

基于优化深度学习和大型语言模型驱动的智能人工智能在计算资源有限系统上进行膝关节骨关节炎严重程度分级
Nadeem, Dayam, Neha, Mustafa, Safdar, Alvi, Adnan, Hussain, Mohd
Abstract
Knee osteoarthritis (KOA) is among the musculoskeletal disorders that considerably restrict joint mobility, cause severe chronic pain and impact negatively on quality life. It is one of the persistent health issues worldwide. Generally, subjectivity and inter-observer variability undermine conventional practices and evaluation process that are adopted to address such health issues. Hence precise and timely diagnosis would be one of the effective ways for the assessment of its severity. This paper proposes an automated diagnostic approach for severity grading of KOA by blending a deep learning convolutional neural network (CNN) with a device-based inference platform powered by TensorFlow Lite. It proposes a model based on the ResNet-18 convolutional neural network. The designed model is trained on publicly available database. Through a transfer learning approach obtained knee images are first classified into five Kellgren-Lawrence (KL) grades. Further the developed model is optimised. During the training of the model test accuracy of 94.48% with stable convergence has been achieved. Subsequently the optimised model transformed into a lightweight TensorFlow Lite format, facilitating seamless deployment on resource-constrained devices. The designed model is capable enough to operate in the environment having no continuous internet connectivity. Also, an auxiliary Large Language Model (Gemini-2.0-flash) is applied to generate structured interpretive findings like potential symptoms, risk factors, and preventive majors etc. The LLM component functions as interface without influencing the classification process. The proposed model articulates the feasibility of an on-device, interpretable decision-support tools for early diagnosis and improve accessibility to Artificial Intelligence (AI)-assisted knee screening tool.
Chinese Translation
膝关节骨关节炎(KOA)是限制关节活动、导致严重慢性疼痛并对生活质量产生负面影响的肌肉骨骼疾病之一。它是全球范围内持续存在的健康问题之一。通常,主观性和观察者间的变异性削弱了传统实践和评估过程,这些方法被用于解决此类健康问题。因此,精确和及时的诊断将是评估其严重程度的有效方法之一。本文提出了一种自动化诊断方法,通过将深度学习卷积神经网络(CNN)与基于设备的推理平台(由TensorFlow Lite驱动)相结合,实现KOA的严重程度分级。该模型基于ResNet-18卷积神经网络设计,训练于公开可用的数据库。通过迁移学习方法,获得的膝关节图像首先被分类为五个Kellgren-Lawrence(KL)等级。随后,开发的模型进行了优化。在模型训练过程中,测试准确率达到了94.48%,并且收敛稳定。随后,优化后的模型转换为轻量级的TensorFlow Lite格式,便于在资源受限的设备上无缝部署。该设计模型能够在没有持续互联网连接的环境中运行。此外,还应用了辅助大型语言模型(Gemini-2.0-flash)来生成结构化的解释性发现,如潜在症状、风险因素和预防措施等。LLM组件作为接口运行,而不影响分类过程。所提出的模型阐明了在设备上实现可解释的决策支持工具的可行性,以便于早期诊断并提高对人工智能(AI)辅助膝关节筛查工具的可及性。
cs.AI / 46 / 2605.05736

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

SDFlow:基于相似性的流匹配用于时间序列生成
Li, Wei, Feng, Shibo, Wu, Pengcheng, Wu, Min, Zhao, Peilin
Abstract
Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ($\textbf{S}$imilarity-$\textbf{D}$riven $\textbf{Flow}$ Matching), a non-autoregressive framework that operates entirely in the frozen VQ latent space and enables parallel sequence generation via flow matching. We tackle three key challenges in making this transition: (1) eliminating exposure bias by replacing step-wise token prediction with a global transport map; (2) mitigating the high-dimensionality of VQ token spaces via a low-rank manifold decomposition with a learned anchor prior over the latent manifold; and (3) incorporating discrete supervision into continuous transport dynamics by introducing a categorical posterior over codebook indices within a variational flow-matching formulation. Extensive experiments show that SDFlow achieves state-of-the-art performance, improving Discriminative Score and substantially reducing Context-FID, particularly for challenging long-sequence generation. Moreover, SDFlow provides significant inference speedups over autoregressive baselines, offering both high fidelity and computational efficiency. Code is available at https://anonymous.4open.science/r/SDFlow-D6F3/
Chinese Translation
向量量化(VQ)结合自回归(AR)标记建模是时间序列生成中广泛采用且竞争力强的范式。然而,这类模型在本质上受到曝光偏差的限制:在推理过程中,错误可能在连续预测中累积,导致长时间生成中的质量显著下降。为了解决这一问题,我们提出了SDFlow($ extbf{S}$imilarity-$ extbf{D}$riven $ extbf{Flow}$ Matching),这是一个完全在冻结的VQ潜在空间中操作的非自回归框架,通过流匹配实现并行序列生成。我们在实现这一转变时面临三个关键挑战:(1)通过用全局传输图替代逐步标记预测来消除曝光偏差;(2)通过在潜在流形上引入学习的锚点先验,利用低秩流形分解来缓解VQ标记空间的高维性;(3)通过在变分流匹配公式中引入对码本索引的分类后验,将离散监督纳入连续传输动态。大量实验表明,SDFlow实现了最先进的性能,提升了判别得分,并显著降低了上下文FID,特别是在具有挑战性的长序列生成中。此外,SDFlow在推理速度上显著快于自回归基线,提供了高保真度和计算效率。代码可在 https://anonymous.4open.science/r/SDFlow-D6F3/ 获取。
cs.AI / 47 / 2605.05737

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

ReFlect:一种有效的复杂长时间跨度LLM推理系统
Huang, Fan
Abstract
Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0\% (Direct CoT) to between 82\% (Qwen2.5-72B) and 87\% (GPT-4o). Notably, the harness gain is inversely proportional to the model's Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0--18.7\% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.
Chinese Translation
当前LLM的推理范式包括思维链(chain-of-thought)、反应(ReAct)和事后自我批评(post-hoc self-critique)。这些范式依赖于两个假设,这些假设在长时间跨度的多阶段任务中失效。因此,错误在推理步骤中悄然累积,留下一个未解的问题:推理系统能否有效地检测并从自身的失败中恢复?我们提出了ReFlect,一种用于LLM推理的 extit{harness}系统,它在模型周围创建独立的错误检测和恢复逻辑,作为一个确定性的包装器。对6个推理领域的控制实验表明,提示级别的自我批评生成的公式化模板在100个审计反思块中标记出90个没有问题,而所调查的LLM在至少76 ext{%}的情况下错误地接受了错误答案。我们的ReFlect harness在六个模型中实现了任务成功率,从gpt-4o-mini的41 ext{%}到Claude Sonnet 4.5的56 ext{%}不等,相较于直接思维链(Direct CoT),每个模型的增益从Qwen2.5-72B的+7个百分点到Claude Sonnet 4.5的+29个百分点不等,并且还将SWE-bench补丁结构质量从0 ext{%}(Direct CoT)提高到82 ext{%}(Qwen2.5-72B)到87 ext{%}(GPT-4o)之间。值得注意的是,harness增益与模型的Direct CoT任务成功率呈反比关系(拟合斜率为-1.69,r=-0.76):基线成功率每损失一个百分点,harness增益就机械性地恢复1.69个百分点。我们发现,添加结构化推理状态和操作符在Llama-3.3-70B和Qwen2.5-72B上的配对均值仅为15.0 ext{%}到18.7 ext{%},因为该规模的模型无法可靠地填充其操作符所需的状态。ReFlect是模型无关的、无训练的,并且完全在推理时操作。
cs.AI / 48 / 2605.05741

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

HyperLens:通过细粒度置信度轨迹量化大型语言模型中的认知努力
Lu, Chengda, Fan, Xiaoyu, Xu, Wei
Abstract
While Large Language Models (LLMs) achieve strong performance across diverse tasks, their inference dynamics remain poorly understood because of the limited resolution of existing analysis tools. In this work, we identify an intrinsic magnification mechanism in transformer architectures: deeper layers inherently magnify the small changes of layer-wise confidence, providing a fine-grained confidence trajectory. Building on this insight, we introduce HyperLens, a high-resolution probe designed to trace confidence trajectories and quantify the cognitive effort during inference. Across LLMs and datasets, HyperLens reveals a consistent divergence in confidence trajectories that separates complex from simple tasks. We abstract this pattern into a quantitative cognitive effort metric. Our analysis reveals a fundamental principle: complex tasks consistently require higher cognitive effort. Finally, we provide a mechanistic diagnosis of a common side effect of standard Supervised Fine-Tuning (SFT): it can reduce cognitive effort and consequently degrade performance on in-domain tasks.
Chinese Translation
尽管大型语言模型(LLMs)在多种任务中表现出色,但由于现有分析工具的分辨率有限,其推理动态仍然不够清晰。在本研究中,我们识别出变换器架构中的一种内在放大机制:更深的层次固有地放大层级置信度的小变化,从而提供细粒度的置信度轨迹。基于这一见解,我们引入了HyperLens,一种高分辨率探针,旨在追踪置信度轨迹并量化推理过程中的认知努力。在不同的LLMs和数据集上,HyperLens揭示了置信度轨迹的一致性分歧,能够区分复杂任务与简单任务。我们将这一模式抽象为一个定量的认知努力指标。我们的分析揭示了一个基本原则:复杂任务始终需要更高的认知努力。最后,我们提供了对标准监督微调(Supervised Fine-Tuning, SFT)常见副作用的机制性诊断:它可能会降低认知努力,从而导致在领域内任务上的性能下降。
cs.AI / 49 / 2605.05745

Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

通过混合反馈在广义线性赌博机中识别最佳臂
Zeng, Qirun, Wang, Xuchuang, Shen, Jiayi, Liu, Xutong, Kong, Fang, Zuo, Jinhang
Abstract
We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both governed by generalized linear models. We introduce a likelihood-ratio--based confidence sequence that unifies heterogeneous generalized linear observations and yields an explicit ellipsoidal confidence set under a self-concordance assumption. Building on this confidence set, we propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over a joint action space of arms and pairs. We establish $\delta$-correctness and provide high-probability upper bounds on the stopping time. We further extend the framework to a cost-aware setting that accounts for heterogeneous acquisition costs across feedback modalities. Empirical experiments demonstrate that the proposed algorithms significantly improve sample efficiency over baseline methods.
Chinese Translation
我们研究在混合反馈模型下的固定置信度最佳臂识别问题:在每一轮中,学习者可以查询(i)来自单个臂的绝对奖励反馈或(ii)来自一对臂的相对(对抗)反馈,这两者均由广义线性模型控制。我们引入了一种基于似然比的置信序列,该序列统一了异质的广义线性观测,并在自一致性假设下产生了明确的椭球置信集。基于该置信集,我们提出了一种混合跟踪与停止算法,该算法通过在臂和对的联合动作空间上跟踪最小最大最优设计自适应地分配查询。我们建立了$ ext{δ}$-正确性,并提供了停止时间的高概率上界。我们进一步将该框架扩展到一个成本感知的设置,考虑了不同反馈方式下的异质获取成本。实证实验表明,所提出的算法在样本效率上显著优于基线方法。
cs.AI / 50 / 2605.05748

Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

评估安全关键自动目标识别系统中的可解释性:后验方法的局限性及稳健的可解释人工智能的路径
Buhrmester, Vanessa, Muench, David, Bulatov, Dimitri, Arens, Michael
Abstract
Explainable Artificial Intelligence (XAI) is increasingly rec ognized as essential for deploying machine learning systems in safety critical environments. In Automatic Target Recognition (ATR), where models operate on image, video, radar, and multisensor data, high pre dictive performance alone is insufficient. Model decisions must also be interpretable, reliable, and suitable for validation. This paper presents a structured evaluation of explainability methods in the context of safety-critical ATR systems: We identify major XAI paradigms, including saliency-based, attention-based, and surrogate ap proaches, as well as recent detection-aware extensions. Based on this, we formalize explainability as an assurance-oriented assessment problem, introduce a taxonomy, and assess these methods with respect to four key dimensions: interpretability, robustness, vulnerability to manipula tion, and suitability for validation and verification. The analysis identifies systematic limitations of current post-hoc explanation methods. In par ticular, we derive critical failure modes such as spurious explanations, instability under perturbations, and overtrust induced by visually con vincing outputs. These findings indicate that widely used XAI techniques may be insufficient for safety-critical deployment. Finally, we discuss implications for ATR systems and outline directions toward more robust, causally grounded, and physically informed explain ability methods. Our results emphasize the need to move beyond visually plausible explanations toward approaches that support reliable decision making and system-level assurance.
Chinese Translation
可解释人工智能(XAI)越来越被认为是将机器学习系统部署在安全关键环境中的必要条件。在自动目标识别(ATR)中,模型处理图像、视频、雷达和多传感器数据,仅仅具备高预测性能是不够的。模型决策还必须是可解释的、可靠的,并且适合进行验证。本文在安全关键ATR系统的背景下对可解释性方法进行了结构化评估:我们识别了主要的XAI范式,包括基于显著性的方法、基于注意力的方法和替代方法,以及最近的检测感知扩展。在此基础上,我们将可解释性形式化为一个面向保障的评估问题,提出了一个分类法,并从四个关键维度(可解释性、稳健性、对操控的脆弱性以及适合验证和确认)评估这些方法。分析揭示了当前后验解释方法的系统性局限性。特别地,我们推导出了一些关键的失败模式,如虚假的解释、在扰动下的不稳定性,以及由视觉上令人信服的输出引发的过度信任。这些发现表明,广泛使用的XAI技术可能不足以支持安全关键的部署。最后,我们讨论了对ATR系统的影响,并概述了朝着更稳健、因果基础和物理信息驱动的可解释性方法的方向。我们的结果强调了需要超越视觉上合理的解释,转向支持可靠决策和系统级保障的方法。
cs.AI / 51 / 2605.05770

Confidence is the key: how conformal prediction enhances the generative design of permeable peptides

信心是关键:如何通过符合预测增强可渗透肽的生成设计
van Weesep, Laura, Chankeshwara, Sunay, De Maria, Leonardo, David, Florian, Engkvist, Ola, Geylan, Gökçe
Abstract
Generative models coupled with reinforcement learning (RL), such as REINVENT and PepINVENT, have emerged as a powerful framework for de novo molecular design. During the ideation process these generative frameworks utilize various predictive models as part of the optimization objectives. However, the utility of the predictive models can be limited by their domain of applicability. When RL is used to explore the chemical space with predictive models, it can suggest molecules that lie outside the predictor's domain of applicability. As a result, the predictions may become less reliable, potentially steering designs into high reward but also high uncertainty chemical spaces. This is particularly pronounced for cyclic peptides which show therapeutic promise due to their modifiability and large interaction surfaces but are understudied compared to small molecules. While passive membrane permeation in cyclic peptides has attracted interest, identifying optimal permeable designs remains challenging yet crucial for targeting intracellular sites. We present an RL-guided generative framework that designs permeable cyclic peptides using an uncertainty-aware permeability predictor as the scoring component. To address predictive uncertainty, especially impacted by novel chemistry, we integrate conformal prediction (CP) as our uncertainty quantification method. CP assesses designs based on the calibrated model under a user-defined confidence level. We demonstrate that rewarding generated peptides with CP-informed predictions improves both reliability and efficiency of peptide optimization process. This also discourages exploration outside the predictor's applicability domain. This approach bridges the gap between predictive uncertainty and RL-guided exploration, showing how generative modelling and conformal prediction can be combined for the first time.
Chinese Translation
生成模型与强化学习(RL)相结合,如REINVENT和PepINVENT,已成为新分子设计的强大框架。在构思过程中,这些生成框架利用各种预测模型作为优化目标的一部分。然而,预测模型的实用性可能受到其适用领域的限制。当使用RL通过预测模型探索化学空间时,它可能会建议超出预测器适用领域的分子。因此,预测可能变得不那么可靠,可能将设计引导到高奖励但也高不确定性的化学空间。这在环肽中尤为明显,环肽因其可修饰性和大相互作用表面而显示出治疗潜力,但与小分子相比,研究相对较少。尽管环肽的被动膜渗透引起了关注,但识别最佳可渗透设计仍然具有挑战性且至关重要,以便针对细胞内位点。我们提出了一种RL引导的生成框架,使用不确定性感知的渗透性预测器作为评分组件来设计可渗透的环肽。为了解决预测不确定性,特别是受到新化学影响的情况,我们将符合预测(CP)整合为我们的不确定性量化方法。CP根据在用户定义的置信水平下的校准模型评估设计。我们证明,使用CP信息的预测来奖励生成的肽可以提高肽优化过程的可靠性和效率。这也抑制了超出预测器适用领域的探索。这种方法弥合了预测不确定性与RL引导探索之间的差距,展示了生成建模与符合预测首次结合的可能性。
cs.AI / 52 / 2605.05773

CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt

CircuitFormer:一种基于自然语言提示的模拟电路拓扑设计电路语言模型
Islam, Md Touhidul, Saha, Sujan Kumar, Farahmandi, Farimah, Tehranipoor, Mark
Abstract
Automating analog circuit design remains a longstanding challenge in Electronic Design Automation (EDA). While Transformer-based Large Language Models (LLMs) have revolutionized software code generation, their application to analog hardware design is hindered by two critical limitations: (i) the scarcity of analog design datasets containing natural language description of a design and its corresponding netlist, and (ii) the inefficiency of general-purpose tokenizers (e.g., Byte Pair Encoding (BPE)) in capturing the inherent graph structure of circuits. To bridge this gap, first, we curate the largest annotated dataset of analog circuit netlists to date, comprising 31,341 netlist-natural language description pairs across all major circuit classes. Furthermore, we propose Circuit Tokenizer (CKT), a novel circuit graph tokenizer designed to encode netlist connectivity by explicitly mining frequent subcircuits. In terms of scalability, CKT overcomes the bottleneck of prior circuit graph serialization methods where vocabulary size scales linearly with maximum number of components in the dataset, n_max, (O(n_max)); instead, CKT decouples vocabulary growth from circuit complexity, achieving a constant O(1) complexity. Empirically, CKT outperforms standard BPE on circuit topology representation, reducing sequence length by 57% and achieving a 2.3x superior compression ratio using a compact, fixed vocabulary of size 512. Leveraging this optimized tokenization, we train a circuit-specific language model, CircuitFormer, a 511M parameter encoder-decoder transformer. Our model achieves 100% syntactic correctness and an 83% functional success rate across all major analog circuit categories, outperforming state-of-the-art open-source LLMs by 10% and 14%, respectively, while requiring 240x fewer parameters. The dataset is publicly available at https://huggingface.co/datasets/touhid314/cktformer-dataset.
Chinese Translation
自动化模拟电路设计仍然是电子设计自动化(EDA)领域的一项长期挑战。尽管基于Transformer的大型语言模型(LLMs)已经彻底改变了软件代码生成,但它们在模拟硬件设计中的应用受到两个关键限制的制约:(i)缺乏包含设计自然语言描述及其相应网表的模拟设计数据集,以及(ii)通用分词器(例如字节对编码(BPE))在捕捉电路固有图结构方面的低效。为了解决这一问题,我们首先整理了迄今为止最大的模拟电路网表标注数据集,包含31,341对网表-自然语言描述,涵盖所有主要电路类别。此外,我们提出了Circuit Tokenizer(CKT),一种新颖的电路图分词器,旨在通过显式挖掘频繁子电路来编码网表连接性。在可扩展性方面,CKT克服了先前电路图序列化方法的瓶颈,其中词汇量大小与数据集中组件的最大数量n_max呈线性关系(O(n_max));而CKT则将词汇增长与电路复杂性解耦,实现了常数O(1)的复杂度。在实证上,CKT在电路拓扑表示方面优于标准BPE,序列长度减少了57%,并使用大小为512的紧凑固定词汇实现了2.3倍的优越压缩比。利用这种优化的分词,我们训练了一个电路特定的语言模型CircuitFormer,一个511M参数的编码器-解码器Transformer。我们的模型在所有主要模拟电路类别中实现了100%的语法正确性和83%的功能成功率,分别比最新的开源LLMs高出10%和14%,同时参数数量减少了240倍。该数据集已在https://huggingface.co/datasets/touhid314/cktformer-dataset上公开提供。
cs.AI / 53 / 2605.05776

HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning

HEDP:一种基于混合能量-距离提示的领域增量学习框架
Feng, Yu, Tian, Zhen, Luo, Haoran, Yu, Xie, Cheng, Diancheng, Zheng, Haoyue, Lyu, Shuai, Zong, Ping, Li, Lianyuan, Ge, Xin, Zhu, Yifan
Abstract
Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid Energy-Distance Prompt, a domain-incremental framework inspired by Helmholtz free energy. HEDP introduces an energy regularization loss to enhance the separability of domain representations and a hybrid energy-distance weighted mechanism that fuses energy-based and distance-based cues to improve domain selection and generalization. Experiments on multiple benchmarks, including CORe50, show that HEDP achieves superior performance on unseen domains with a 2.57\% accuracy gain, effectively mitigating catastrophic forgetting and enhancing open-world adaptability. Our code is \href{https://github.com/dannis97500/HEDP/}{available here}.
Chinese Translation
领域增量学习是一个关键场景,要求模型能够持续适应新的数据领域而无需重新训练。然而,领域转移往往会导致性能严重下降。为了解决这个问题,我们提出了混合能量-距离提示(Hybrid Energy-Distance Prompt),这是一个受亥姆霍兹自由能启发的领域增量框架。HEDP引入了一种能量正则化损失,以增强领域表示的可分离性,并采用混合能量-距离加权机制,融合基于能量和基于距离的线索,以改善领域选择和泛化能力。在多个基准测试(包括CORe50)上的实验表明,HEDP在未见领域上实现了2.57%的准确率提升,有效缓解了灾难性遗忘,并增强了开放世界适应能力。我们的代码可在此获取: exttt{https://github.com/dannis97500/HEDP/}。
cs.AI / 54 / 2605.05780

Von Neumann Networks

冯·诺依曼网络
Chandra, Shekhar S.
Abstract
In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled by a diffusion process. In this work, we show that such a system, when developed in a modern deep learning setting, enables the construction of an artificial neuron having specialized roles that can be learnt. We refer to this neuron as the Von Neumann neuron, and the resulting neural network from such neurons result in a self-engineered design whose architecture is only dependent on the structure and locations of its inputs and outputs on this cellular array. The mathematical framework for these Von Neumann Networks (VNNs) is also constructed and shows that they are based on the extension of neural operators and the learning of Green's functions with convolutions on a cellular topology having a diffusion signature. We also prove that these VNNs are part of a more general computational system called Cellular Machines that are computationally universal. Initial experiments show that VNN based multi-layered perceptrons outperform their equivalent deep learning variant on basic tasks, while being more parameter efficient and are capable of learning new types of tasks. This includes the ability to solve for and construct an extension of the Von Neumann (hardware) architecture common to all modern computers to cells and suggests new opportunities that could be explored.
Chinese Translation
在二十世纪中叶,数学家和博学者约翰·冯·诺依曼创建了一个基于细胞阵列的计算系统,作为人脑的简单模型,其中每个细胞具有有限集合中的一种角色或状态,他预测这些状态将通过扩散过程进行建模。在本研究中,我们展示了这样的系统在现代深度学习环境中发展时,能够构建具有可学习专门角色的人工神经元。我们将这种神经元称为冯·诺依曼神经元,由此产生的神经网络是一个自我工程设计,其架构仅依赖于细胞阵列上输入和输出的结构及位置。我们还构建了这些冯·诺依曼网络(Von Neumann Networks, VNNs)的数学框架,表明它们基于神经算子的扩展和在具有扩散特征的细胞拓扑上进行卷积的格林函数学习。我们还证明这些VNN是一个更一般的计算系统——细胞机器(Cellular Machines)的一部分,该系统具有计算通用性。初步实验表明,基于VNN的多层感知器在基本任务上优于其等效的深度学习变体,同时参数效率更高,并且能够学习新类型的任务。这包括解决和构建冯·诺依曼(硬件)架构的扩展,该架构是所有现代计算机的共同特征,并暗示了可以探索的新机会。
cs.AI / 55 / 2605.05811

Sheet as Token: A Graph-Enhanced Representation for Multi-Sheet Spreadsheet Understanding

将工作表视为标记:一种增强图形的多工作表电子表格理解表示
Lei, Yiming, Wang, Yiqi, Zhang, Yujia, Guan, Bo, Zhu, Depei, Wang, Chunhui, Hao, Zhuonan, Shi, Tianyu
Abstract
Workbook-scale spreadsheet understanding is increasingly important for language-model-based data analysis agents, but remains challenging because relevant information is often distributed across multiple sheets with heterogeneous schemas, layouts, and implicit relationships. Existing retrieval-augmented approaches typically decompose spreadsheets into rows, columns, or blocks to improve scalability; however, such chunk-centric representations can fragment worksheets into isolated text spans and weaken global sheet-level semantics. We propose Sheet as Token, a graph-enhanced framework that treats each worksheet as a unified semantic unit for multi-sheet spreadsheet retrieval. Our method extracts schema-aware records from sheet names, column headers, representative values, and layout features, and encodes each worksheet into a compact dense token. Given a natural-language query, a Graph Retriever constructs a query-specific candidate graph over sheet tokens using semantic, query-conditioned, schema-consistency, and shape-compatibility relations, and composes these channels through a multi-stage graph transformer to retrieve supporting sheet sets. Experiments on a constructed multi-sheet spreadsheet corpus show that sheet-level tokenization learns stable representations, and that graph-enhanced cross-sheet reasoning improves listwise retrieval over a shallow graph baseline with limited additional graph-side computation. These results suggest that sheet-level tokenization is a promising abstraction for scalable multi-sheet spreadsheet understanding.
Chinese Translation
工作簿级电子表格理解对于基于语言模型的数据分析代理越来越重要,但由于相关信息通常分布在多个具有异构模式、布局和隐含关系的工作表中,因此仍然具有挑战性。现有的检索增强方法通常将电子表格分解为行、列或块,以提高可扩展性;然而,这种以块为中心的表示可能会将工作表分割成孤立的文本片段,从而削弱全局工作表级语义。我们提出了将工作表视为标记(Sheet as Token),这是一种增强图形的框架,将每个工作表视为多工作表电子表格检索的统一语义单元。我们的方法从工作表名称、列标题、代表性值和布局特征中提取模式感知记录,并将每个工作表编码为紧凑的密集标记。给定自然语言查询,图形检索器使用语义、查询条件、模式一致性和形状兼容性关系在工作表标记上构建特定于查询的候选图,并通过多阶段图形变换器组合这些通道以检索支持的工作表集。在构建的多工作表电子表格语料库上的实验表明,工作表级标记化学习到稳定的表示,而增强图形的跨工作表推理在有限额外图形计算的情况下改善了相对于浅层图形基线的列表检索。这些结果表明,工作表级标记化是可扩展的多工作表电子表格理解的一个有前景的抽象。
cs.AI / 56 / 2605.05812

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

长远Q学习:通过n步不等式实现准确的价值学习
Abraham, Armaan A., Shi, Lucy Xiaoyang, Finn, Chelsea
Abstract
Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.
Chinese Translation
基于离策略的价值型强化学习方法,如Q学习,因其能够从任意经验中学习而受到青睐,包括由旧策略或其他智能体收集的数据。然而,在实践中,因引导学习使得长远学习变得脆弱:后期状态的估计误差通过时间差分(TD)更新向后传播,并可能随时间累积。我们提出了长远Q学习(LQL),它在学习最优动作价值函数时引入了一个原则性的后备机制,以防止误差累积。LQL基于一个先前的最优性收紧观察:任何实现的动作序列都为最优策略在期望上能够达到的结果提供了下界,因此,早期采取最优行动不应比在切换到最优行为之前遵循观察到的动作若干步更糟。我们的贡献在于将这一不等式转化为Q学习的一个实用稳定机制,通过使用铰链损失来惩罚这些界限的违反。重要的是,LQL使用已经为TD误差生成的网络输出来计算这些惩罚,相较于Q学习,不需要辅助网络和额外的前向传播。当与多种最先进的方法结合在一系列在线和离线到在线的基准测试中时,LQL始终在相似的运行时间内优于1步TD和n步TD学习。
cs.AI / 57 / 2605.05826

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

AGPO:针对京东可验证推理和搜索广告相关性的非对称群体策略优化
Xu, Yang, Yao, Kun, Deng, Yiming, Fang, Zheng, Ting, Kai Ming, Pang, Ming
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.
Chinese Translation
可验证奖励的强化学习(RLVR)在提升大型语言模型(LLMs)的推理性能方面取得了显著成功。然而,最近的研究表明,尽管当前的RLVR方法提高了朝向正确路径的采样效率,但并未引发根本性的新的推理模式。相反,经过训练的模型的推理能力边界往往比其基础模型更窄,基础模型在大样本量下能够实现更高的覆盖率。在本研究中,我们提出了非对称群体策略优化(AGPO)以对抗这种边界收缩。AGPO采用负主导强化策略来抑制错误推理路径,同时保持基础模型的探索能力。对于正向强化,AGPO采用群体优势机制,根据组内方差对正向更新进行缩放,使模型能够专注于稀有的正确路径,同时抑制来自琐碎路径的更新。我们在五个数学基准上的实验表明,AGPO在规模上实现了最先进的准确性,同时在pass@$k$性能上持续改善。在一个大规模工业应用中,针对搜索广告相关性的优化,AGPO有效提升了数据标注的质量,带来了下游学生模型的显著性能提升。
cs.AI / 58 / 2605.05832

MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

MolRecBench-Wild:一个用于光学化学结构识别的真实世界基准
Yang, Haote, Wang, Hui, Zhu, Chen, Wang, Jingchao, Li, Linye, Lai, Hongbin, Ao, Huijie, Lyu, Yongxuan, Wu, Jiang, Sun, Jiaxing, Chen, Lua, Cao, Yuanyuan, Zhang, Ruijie, Lu, Shengxin, Wu, Lijun, Wang, Bin, He, Conghui
Abstract
Optical Chemical Structure Recognition (OCSR) aims to translate molecular diagrams in scientific literature into machine-readable formats, but current systems remain unreliable on real-world images due to substantial visual and chemical complexity. We introduce MOSAIC, a dual-dimensional difficulty framework with 37 fine-grained labels that jointly characterize visual interference and chemical semantic challenges in molecular diagrams. Based on this framework, we construct MolRecBench-Wild, a benchmark of 5,029 structures from 820 recent chemistry papers, covering the full difficulty spectrum observed in real publications. To enable faithful semantic evaluation beyond SMILES and MolFile, we propose CARBON, a representation language capable of expressing valence variations, icon-based groups, and other non-standard chemical semantics. We further adopt a dual-track evaluation protocol supporting both CARBON and SMILES outputs for broad model compatibility. Comprehensive experiments over 18 OCSR-capable models reveal severe performance degradation on MolRecBench-Wild, exposing a large gap between previous patent benchmarks and real-world academic scenarios.
Chinese Translation
光学化学结构识别(Optical Chemical Structure Recognition, OCSR)旨在将科学文献中的分子图谱转换为机器可读格式,但由于视觉和化学复杂性显著,当前系统在真实世界图像上的可靠性仍然不足。我们引入了MOSAIC,一个具有37个细粒度标签的双维难度框架,联合表征分子图谱中的视觉干扰和化学语义挑战。基于此框架,我们构建了MolRecBench-Wild,这是一个包含来自820篇最新化学论文的5,029个结构的基准,涵盖了真实出版物中观察到的全难度范围。为了实现超越SMILES和MolFile的忠实语义评估,我们提出了CARBON,一种能够表达价态变化、基于图标的组以及其他非标准化学语义的表示语言。我们进一步采用了一种双轨评估协议,支持CARBON和SMILES输出,以实现广泛的模型兼容性。对18个具备OCSR能力的模型进行的全面实验显示,在MolRecBench-Wild上性能严重下降,揭示了先前专利基准与真实世界学术场景之间的巨大差距。
cs.AI / 59 / 2605.05833

On the Role of Language Representations in Auto-Bidding: Findings and Implications

语言表示在自动竞标中的作用:发现与启示
Zhu, Guanyu, Luan, Jining, Du, Hanwen, Fang, Xinyu, Xu, Sibo, Ni, Ersheng, Li, Hongji, Fang, Jincheng, Chen, Ronghao, Wang, Huacan, Lan, Xuanqi, Ni, Yongxin, Sun, Yiqi, Li, Youhua
Abstract
Auto-bidding is a crucial task in real-time advertising markets, where policies must optimize long-horizon value under delivery constraints (e.g., budget and CPA). Existing methods for auto-bidding rely on compact numerical state representations: while they can implicitly capture delivery dynamics, they offer limited support for explicitly representing and controlling high-level intent, evolving feedback, and operator-style strategic guidance in real campaigns. Meanwhile, Large Language Models (LLMs) offer a powerful method for encoding semantic information, it remains unclear when LLMs help and how to integrate them without sacrificing numerical precision. Through systematic preliminary studies, we find that (1) LLM embeddings contain bidding-relevant cues yet cannot replace numerical features, and (2) gains emerge only with careful semantic--numeric integration rather than naive concatenation. Motivated by these findings, we propose \textit{SemBid}, a novel auto-bidding framework that injects LLM-encoded semantics into offline bidding trajectories at the token level. SemBid introduces three semantic inputs: \textit{Task}, \textit{History}, and \textit{Strategy}. It injects these semantics as tokens alongside numerical trajectory tokens and uses self-attention to integrate them, improving controllability and generalization across objectives. Across diverse scenarios and budget regimes, SemBid outperforms competitive baselines from offline RL and generative sequence modeling, with more consistent gains in overall performance, constraint satisfaction, and robustness. Our code is available at: \href{https://github.com/AlanYu04/SemBid-KDD2026}{\textcolor{blue}{here}}.
Chinese Translation
自动竞标是实时广告市场中的一项关键任务,政策必须在交付约束(例如预算和每次获取成本(CPA))下优化长期价值。现有的自动竞标方法依赖于紧凑的数值状态表示:虽然它们可以隐式捕捉交付动态,但在显式表示和控制高层意图、不断变化的反馈以及实际活动中的操作员风格战略指导方面支持有限。同时,大型语言模型(LLMs)提供了一种强大的语义信息编码方法,但尚不清楚何时LLMs能提供帮助以及如何在不牺牲数值精度的情况下将其整合。通过系统的初步研究,我们发现(1)LLM嵌入包含与竞标相关的线索,但不能替代数值特征;(2)收益只有在仔细的语义-数值整合下才会出现,而不是简单的连接。基于这些发现,我们提出了 extit{SemBid},一种新颖的自动竞标框架,在离线竞标轨迹中以令牌级别注入LLM编码的语义。SemBid引入了三个语义输入: extit{任务}、 extit{历史}和 extit{策略}。它将这些语义作为令牌与数值轨迹令牌一起注入,并使用自注意力机制进行整合,从而提高了在不同目标上的可控性和泛化能力。在多种场景和预算机制下,SemBid在离线强化学习和生成序列建模的竞争基线中表现优越,整体性能、约束满足和鲁棒性方面的收益更加一致。我们的代码可在此获取: extcolor{blue}{ extit{here}}.
cs.AI / 60 / 2605.05842

Taklif.AI: LLM-Powered Platform for Interest-Based Personalized College Assignments

Taklif.AI:基于兴趣的个性化大学作业的LLM驱动平台
Kurdya, Zaki, Zuqlam, Mohammed, Amassi, Salem, Telbany, Shady, Saad, Motaz
Abstract
Educators face significant challenges in creating engaging, personalized assignments that accommodate students' diverse interests and cognitive abilities. Traditional one-size-fits-all assignments frequently lead to decreased student engagement and increased reliance on unethical practices such as plagiarism. To address these challenges, we present Taklif.AI, a platform that leverages Large Language Models (LLMs) to automatically generate personalized assignments tailored to individual student interests. Unlike existing AI-powered educational platforms that personalize based on academic performance metrics alone, Taklif.AI incorporates students' extracurricular interests and cultural contexts into the assignment generation process through a structured prompt engineering pipeline with input and output guardrails. The platform employs a serverless architecture on AWS with Next.js, using Llama 3.3 70B as the primary LLM via LiteLLM for multi-provider load balancing and LangChain for prompt orchestration. We describe the system architecture, the prompt design methodology, and the guardrails framework that ensures output quality. Preliminary user acceptance testing with 68 participants (65 students and 3 educators) indicates positive reception, with 84% of participants rating the personalization feature as beneficial. We discuss the platform's current capabilities and limitations, and outline directions for rigorous empirical evaluation of learning outcomes.
Chinese Translation
教育工作者在创建能够吸引学生、个性化的作业时面临重大挑战,这些作业需要适应学生多样的兴趣和认知能力。传统的一刀切作业常常导致学生参与度降低,并增加对不道德行为(如抄袭)的依赖。为了解决这些问题,我们提出了Taklif.AI,一个利用大型语言模型(LLMs)自动生成个性化作业的平台,旨在根据个别学生的兴趣进行定制。与现有的仅基于学业表现指标进行个性化的AI教育平台不同,Taklif.AI在作业生成过程中通过结构化的提示工程管道,将学生的课外兴趣和文化背景纳入考虑,并设置输入和输出的保护措施。该平台在AWS上采用无服务器架构,使用Next.js,通过LiteLLM实现多提供商负载均衡,并利用LangChain进行提示协调。我们描述了系统架构、提示设计方法论以及确保输出质量的保护框架。初步的用户接受度测试涉及68名参与者(65名学生和3名教育工作者),结果显示反响积极,84%的参与者认为个性化功能是有益的。我们讨论了该平台的当前能力和局限性,并概述了对学习成果进行严格实证评估的方向。
cs.AI / 61 / 2605.05854

AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting

AirQualityBench:全球空气质量预测的现实评估基准
Xu, Xing, Wang, Xu, Zhang, Yudong, Zhao, Huilin, Zhou, Zhengyang, Wang, Yang
Abstract
Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce \textbf{AirQualityBench}, a global multi-pollutant benchmark designed to evaluate forecasting models under these realistic conditions. The benchmark contains hourly observations from 3,720 monitoring stations over 2021--2025, covers six major pollutants, and preserves provider-native observation masks. Rather than imputing a dense data tensor, AirQualityBench exposes missingness as part of the forecasting problem and reports errors on valid future observations after inverse transformation to physical concentration scales. Evaluating representative spatio-temporal models under this unified protocol shows that strong performance on sanitized datasets does not reliably transfer to global, fragmented monitoring streams. AirQualityBench therefore serves as a realistic testbed for scalable, mask-aware, and physically interpretable air-quality forecasting. All benchmark data, code, evaluation scripts, and baseline implementations are available at \href{https://github.com/Star-Learning/AirQualityBench}{GitHub}.
Chinese Translation
空气质量预测模型通常在区域性、预处理和标准化的数据集上进行评估,其中缺失观测值被删除或人工补全。这种协议简化了比较,但掩盖了主导真实监测网络的条件:不均匀的全球覆盖、结构化的缺失性、异质的污染物尺度和部署成本。我们引入了 extbf{AirQualityBench},这是一个全球多污染物基准,旨在在这些现实条件下评估预测模型。该基准包含了2021至2025年间来自3720个监测站的每小时观测数据,涵盖六种主要污染物,并保留了提供者原生的观测掩码。AirQualityBench并不通过填补密集数据张量来处理缺失值,而是将缺失性作为预测问题的一部分,并在逆变换到物理浓度尺度后报告有效未来观测的误差。在这一统一协议下评估代表性的时空模型表明,在经过清洗的数据集上表现良好的模型并不可靠地转移到全球碎片化的监测流。因此,AirQualityBench作为一个现实的测试平台,适用于可扩展的、关注掩码的以及物理可解释的空气质量预测。所有基准数据、代码、评估脚本和基线实现均可在 exthref{https://github.com/Star-Learning/AirQualityBench}{GitHub} 上获取。
cs.AI / 62 / 2605.05861

SANEmerg: An Emergent Communication Framework for Semantic-aware Agentic AI Networking

SANEmerg:一种面向语义感知的自主智能代理网络的涌现通信框架
Xiao, Yong, Zhou, Haoran, Zhou, Yujie, Krunz, Marwan
Abstract
Future networking systems are envisioned to become part of an agentic AI-native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. However, traditional networking paradigms are characterized by a rigid decoupling of communication and computation, which often leads to significant inefficiencies in large-scale agentic AI networking (AgentNet) systems. Emergent communication offers a novel solution by enabling autonomous agents that support task-specific signaling protocols for information exchange and collaborative coordination. In this paper, we consider a multi-agent emergent communication framework, tailored for semantic-aware AgentNet systems in which the user's semantic intent can be automatically detected, inferred, and linked to a set of sub-tasks to be assigned to a set of agents. We investigate how communication and signaling protocols can emerge among collaborative agents with computationally bounded intelligence under stringent bandwidth constraints. Our proposed framework, called SANEmerg, is designed to facilitate the emergence of communication for collaborative task fulfillment while adhering to the physical limits of AgentNet. SANEmerg incorporates a bandwidth-adaptable importance-filter that dynamically prioritizes the transmission of higher-contribution message dimensions, ensuring robust performance in bandwidth-limited environments. Furthermore, SANEmerg integrates a complexity-regularizer grounded in the Minimum Description Length (MDL) principle to facilitate the emergence of computationally bounded signaling. Evaluated via an AgentNet prototype and extensive experimentation, SANEmerg demonstrates significant performance improvements over state-of-the-art solutions, achieving superior task accuracy while significantly reducing bandwidth and computational overhead.
Chinese Translation
未来的网络系统被设想为一个自主智能原生生态系统的一部分,在这个生态系统中,大量异构和专业的人工智能代理无缝合作,以实时满足复杂的用户需求。然而,传统的网络范式的特点是通信与计算的严格解耦,这往往导致在大规模自主智能网络(AgentNet)系统中出现显著的低效。涌现通信提供了一种新颖的解决方案,通过使自主代理支持特定任务的信号协议来进行信息交换和协作协调。本文考虑了一种多代理涌现通信框架,专为语义感知的AgentNet系统量身定制,在该系统中,用户的语义意图可以被自动检测、推断,并与一组子任务关联,以分配给一组代理。我们研究了在严格带宽限制下,如何在具有计算限制智能的协作代理之间涌现出通信和信号协议。我们提出的框架,称为SANEmerg,旨在促进协作任务完成的通信涌现,同时遵循AgentNet的物理限制。SANEmerg结合了一种带宽可调的重要性过滤器,动态优先传输贡献较高的信息维度,确保在带宽受限环境中的稳健性能。此外,SANEmerg集成了一种基于最小描述长度(MDL)原则的复杂性正则化器,以促进计算限制信号的涌现。通过AgentNet原型和广泛的实验评估,SANEmerg显示出显著的性能提升,相较于最先进的解决方案,在任务准确性上取得了优越的表现,同时显著降低了带宽和计算开销。
cs.AI / 63 / 2605.05866

XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction

XDecomposer:无先验集分解的多相X射线衍射学习
Gao, Hanyu, Cao, Bin, Su, Yunyue, Zhang, Tong-Yi, Liu, Qiang
Abstract
Multiphase powder X-ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real-world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation-based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single-phase inputs and break down in multiphase settings. Here, we present XDecomposer, a prior-free framework for joint decomposition and identification of multiphase XRD patterns without requiring candidate phase lists, structural templates, or prior knowledge of phase number. We formulate multiphase diffraction analysis as a set prediction problem, where the model infers an unordered set of phase-resolved components, their mixture proportions, and corresponding structural representations within a unified architecture. A phase-query-driven decomposition mechanism, together with diffraction-consistent physical reconstruction, enables accurate source separation while preserving crystallographic fidelity. Extensive experiments on both simulated and experimental datasets show that XDecomposer substantially improves reconstruction accuracy and phase identification across diverse chemical systems, while maintaining strong generalization to unseen mixtures. These results provide a practical route toward data-driven, source-resolved multiphase XRD analysis and reduce long-standing dependence on prior-guided iteratively phase matching. The code is openly available at https://github.com/Licht0812/XDecomposer
Chinese Translation
多相粉末X射线衍射(PXRD)分析在结构识别中仍然是一个基本瓶颈,因为现实世界的合成往往产生复杂的混合物,其组成相(成分)无法可靠地分离。尽管最近在基于表征的晶体检索和生成方面的进展表明可以直接从PXRD推断结构,但现有方法在很大程度上假设单相输入,并在多相环境中失效。在此,我们提出了XDecomposer,一个无先验的框架,用于多相XRD模式的联合分解和识别,无需候选相列表、结构模板或相数的先验知识。我们将多相衍射分析形式化为一个集合预测问题,其中模型推断出一个无序的相分解成分集合、它们的混合比例以及在统一架构中的相应结构表征。相查询驱动的分解机制结合衍射一致的物理重建,使得在保持晶体学保真度的同时实现准确的源分离。在模拟和实验数据集上的广泛实验表明,XDecomposer显著提高了重建精度和相识别能力,适用于多种化学系统,同时对未见混合物保持强大的泛化能力。这些结果为数据驱动的源分辨多相XRD分析提供了一条实用的途径,并减少了对先验引导的迭代相匹配的长期依赖。代码可在 https://github.com/Licht0812/XDecomposer 上公开获取。
cs.AI / 64 / 2605.05878

Agentic, Context-Aware Risk Intelligence in the Internet of Value

价值互联网中的主动、情境感知风险智能
Magableh, Basel, Research, OmniRisk
Abstract
The Internet of Value (IoV) is a heterogeneous, partially-trusted network in which the dominant marginal risk is composite (route, sentiment, liquidity, and the policy a system is willing to commit to) rather than a property of any single chain. We argue that a risk primitive adequate for this regime is a composition of five engines: a prediction engine over price, liquidity, volatility, and route health; a Bittensor verification subnet that decentralises and economically scores prediction outputs; a sentiment-fusion engine over text, on-chain flow, and grey-literature feeds; an agentic engine under constitutional, role-bound action constraints; and an API-risk and scenario engine that converts forecasts into pre-committed action programs in the sense of Monte-Carlo scenario generation. We anchor the architecture in two empirical artefacts: a 27-hour policy-constrained liquidity stress-response experiment on Solana, and a 168-hour prediction-router calibration arc reported with explicit class-imbalance honesty. The case study supports deployability; the validator-loss decomposition is stated formally and is falsifiable.
Chinese Translation
价值互联网(IoV)是一个异构的、部分可信的网络,其中主导的边际风险是复合性的(包括路线、情绪、流动性以及系统愿意承诺的政策),而不是任何单一链的属性。我们认为,适合这一体系的风险原语是五个引擎的组合:一个针对价格、流动性、波动性和路线健康的预测引擎;一个去中心化的 Bittensor 验证子网,用于经济评分预测输出;一个情绪融合引擎,基于文本、链上流动和灰色文献的馈送;一个在宪法和角色约束下的主动引擎;以及一个 API 风险和场景引擎,将预测转化为在蒙特卡洛场景生成意义上的预承诺行动程序。我们将该架构锚定在两个实证文物上:一个在 Solana 上进行的27小时政策约束流动性压力响应实验,以及一个报告明确类别不平衡诚实性的168小时预测路由校准弧。案例研究支持可部署性;验证者损失分解被正式陈述并且是可证伪的。
cs.AI / 65 / 2605.05909

Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

针对多模态大语言模型的零空间约束对比视觉遗忘
Wang, Yuhang, Niu, Zhenxing, Ji, Haoxuan, He, Guangyu, Zhang, Linlin, Gao, Haichang
Abstract
The core challenge of machine unlearning is to strike a balance between target knowledge removal and non-target knowledge retention. In the context of Multimodal Large Language Models (MLLMs), this challenge becomes even more pronounced, as knowledge is further divided into visual and textual modalities that are tightly intertwined. In this paper, we introduce an MLLM unlearning approach that aims to forget target visual knowledge while preserving non-target visual knowledge and all textual knowledge. Specifically, we freeze the LLM backbone and achieve unlearning by fine-tuning the visual module. First, we propose a Contrastive Visual Forgetting (CVF) mechanism to separate target visual knowledge from retained visual knowledge, guiding the representations of target visual concepts toward appropriate regions in the feature space. Second, we identify the null space associated with retained knowledge and constrain the unlearning process within this space, thereby significantly mitigating degradation in knowledge retention. Third, beyond static unlearning scenarios, we extend our approach to continual unlearning, where forgetting requests arrive sequentially. Extensive experiments across diverse benchmarks demonstrate that our approach achieves a strong balance between effective forgetting and robust knowledge retention.
Chinese Translation
机器遗忘的核心挑战在于平衡目标知识的移除与非目标知识的保留。在多模态大语言模型(MLLMs)的背景下,这一挑战变得更加明显,因为知识进一步分为紧密交织的视觉和文本模态。本文提出了一种MLLM遗忘方法,旨在遗忘目标视觉知识,同时保留非目标视觉知识和所有文本知识。具体而言,我们冻结了LLM主干,通过微调视觉模块实现遗忘。首先,我们提出了一种对比视觉遗忘(Contrastive Visual Forgetting, CVF)机制,以将目标视觉知识与保留的视觉知识分离,引导目标视觉概念的表示朝向特征空间中的适当区域。其次,我们识别与保留知识相关的零空间,并将遗忘过程限制在该空间内,从而显著减轻知识保留的退化。第三,除了静态遗忘场景外,我们将方法扩展到持续遗忘,其中遗忘请求顺序到达。在多种基准测试中进行的大量实验表明,我们的方法在有效遗忘与稳健知识保留之间实现了良好的平衡。
cs.AI / 66 / 2605.05911

PREFER: Personalized Review Summarization with Online Preference Learning

PREFER:基于在线偏好学习的个性化评论摘要
Roy, Millend, Capponi, Agostino, Goyal, Vineet
Abstract
Product reviews significantly influence purchasing decisions on e-commerce platforms. However, the sheer volume of reviews can overwhelm users, obscuring the information most relevant to their specific needs. Current e-commerce summarization systems typically produce generic, static summaries that fail to account for the fact that (i) different users care about different product characteristics, and (ii) these preferences may evolve with interactions. To address the challenge of unknown latent preferences, we propose an online learning framework that generates personalized summaries for each user. Our system iteratively refines its understanding of user preferences by incorporating feedback directly from the generated summaries over time. We provide a case study using the Amazon Reviews'23 dataset, showing in controlled simulations that online preference learning improves alignment with target user interests while maintaining summary quality.
Chinese Translation
产品评论对电子商务平台上的购买决策具有重要影响。然而,评论的数量庞大可能会使用户感到不知所措,从而掩盖与其特定需求最相关的信息。目前的电子商务摘要系统通常生成通用的静态摘要,未能考虑到(i)不同用户关注不同的产品特征,以及(ii)这些偏好可能随着互动而演变。为了解决未知潜在偏好的挑战,我们提出了一种在线学习框架,为每个用户生成个性化摘要。我们的系统通过逐步整合来自生成摘要的反馈,迭代地完善对用户偏好的理解。我们提供了一个使用Amazon Reviews'23数据集的案例研究,显示在受控模拟中,在线偏好学习提高了与目标用户兴趣的一致性,同时保持了摘要质量。
cs.AI / 67 / 2605.05913

Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model

紫藤:一种统一的多尺度特征学习框架用于DNA语言模型
Wang, Weihua, Li, Haoji, Bao, Feilong, Yang, Lei, Gao, Guanglai
Abstract
DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global dependencies. In this paper, we propose Wisteria, a genomic language model that integrates multi scale feature learning within a unified framework for DNA sequence. Specifically, Wisteria augments the Mamba based architecture with gated dilated convolutions to capture local motifs and regulatory patterns, while gated multilayer perceptrons refine global dependencies. We further introduce a Fourier based attention mechanism to support frequency domain modeling, periodic extension and length generalization. Across four experimental settings with both short and long range dependencies, Wisteria demonstrates strong performance on downstream benchmarks against competitive DNA language model baselines. These results indicate that Wisteria effectively unifies local and global dependency modeling for multi scale genomic sequence analysis.
Chinese Translation
DNA语言模型旨在通过捕捉DNA序列中的长程依赖关系来解读基因组的调控语法和语义。现有方法强调长程标记交互,但常常忽视局部模式与全局依赖之间的相互作用。本文提出了紫藤(Wisteria),一种在统一框架内集成多尺度特征学习的基因组语言模型。具体而言,紫藤通过门控膨胀卷积增强基于Mamba的架构,以捕捉局部模式和调控模式,同时门控多层感知器用于细化全局依赖。我们进一步引入了一种基于傅里叶的注意机制,以支持频域建模、周期扩展和长度泛化。在四个实验设置中,无论是短程还是长程依赖,紫藤在下游基准测试中表现出色,超越了竞争性的DNA语言模型基线。这些结果表明,紫藤有效地统一了局部和全局依赖建模,适用于多尺度基因组序列分析。
cs.AI / 68 / 2605.05921

Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

意图生成与意义构建:人类与人工智能引导的数学发现的互动
Bäuerle, Alex, Connors, Adam, Novikov, Alexander, Wagner, Adam Zsolt, Vũ, Ngân, Viegas, Fernanda, Wattenberg, Martin, Dixon, Lucas
Abstract
Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and refining one's experimental goals through active system interaction. We frame this as a natural extension to sensemaking, the cognitive process of building an understanding of complex or novel data. We suggest that users enter a cycle of intentmaking (defining and updating their experiment) and sensemaking (interpreting the results) which repeats many times during the course of an investigation. Our documentation of these themes suggests an approach to designing AI tools for scientific discovery that goes beyond the existing question/answer model of many current systems, treating them as collaborative instruments rather than opaque black-box assistants.
Chinese Translation
人工智能为科学发现提供了强大的新工具,但有效利用这些系统所需的互动范式仍然未得到充分探索。在本文中,我们展示了一项针对11位专家数学家的初步用户研究的发现,他们使用了AlphaEvolve这一进化编码代理来解决各自领域中的高级问题。我们识别并描述了一种独特的工作流程,称之为意图生成(intentmaking),即通过主动与系统互动来发现、定义和细化实验目标的迭代过程。我们将其视为意义构建(sensemaking)的自然延伸,后者是构建对复杂或新颖数据理解的认知过程。我们建议用户进入一个意图生成(定义和更新他们的实验)与意义构建(解释结果)的循环,这一过程在调查过程中会重复多次。我们对这些主题的记录建议了一种设计科学发现的人工智能工具的方法,该方法超越了许多现有系统的问答模型,将其视为协作工具,而非不透明的黑箱助手。
cs.AI / 69 / 2605.05929

Which Are the Low-Resource Languages of the Semantic Web?

语义网中的低资源语言有哪些?
Mbengue, Ndeye-Emilie, Monnin, Pierre, Couceiro, Miguel, Gandon, Fabien
Abstract
Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.
Chinese Translation
新兴数字技术加剧了开放获取数据(Open Access Data, OAD)中高资源语言与低资源语言之间的现有鸿沟,使许多社区被排除在全球数字转型之外。多语言链接开放数据知识图谱(Linked Open Data Knowledge Graphs, LOD KGs)可以通过跨语言迁移来帮助缓解这一鸿沟;然而,在LOD KGs的背景下,尚未建立低资源语言的明确量化定义。在本海报中,我们提出了一种分析LOD KGs中语言分布的方法,并基于DBpedia、BabelNet和Wikidata提出了初步的多层次分类。这一分类用于为低资源、高资源和中资源语言提供正式定义,后续可用于选择跨语言迁移候选者。
cs.AI / 70 / 2605.05931

In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs

数据还是隐形:通过知识图谱改善低资源语言的数字表现
Mbengue, Ndeye-Emilie
Abstract
Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from participating in the global digital transformation. In this PhD proposal, we aim to address this gap, focusing on the language coverage of Linked Open Data knowledge graphs (LOD KGs). First, we identify key variables that characterize language distribution in LOD, including the number of Wikipedia articles per language edition and the number of language-tagged entities in LOD KGs. These variables are analyzed across three major multilingual LOD KGs, DBpedia, BabelNet, and Wikidata, providing insights into the representation and distribution of languages within LOD. Building on this analysis, we intend to study the impact of cross-lingual transfer candidate selection on the task of multilingual KG completion. In particular, we plan to investigate strategies based on linguistic proximity and the availability of curated annotated alignments between languages. Language proximity also motivates us to explore the benefits of analogical reasoning that relies on (dis)similarities and has not yet been investigated to identify correspondences across languages to improve KG completion performance and enhance language coverage in LOD.
Chinese Translation
新兴数字技术加剧了高资源语言与低资源语言之间在开放获取数据(Open Access Data, OAD)中的现有鸿沟,使许多社区无法参与全球数字转型。在本博士提案中,我们旨在解决这一差距,重点关注链接开放数据知识图谱(Linked Open Data, LOD)中的语言覆盖情况。首先,我们识别出表征LOD中语言分布的关键变量,包括每种语言版本的维基百科文章数量和LOD知识图谱中语言标记实体的数量。这些变量在三个主要的多语言LOD知识图谱DBpedia、BabelNet和Wikidata中进行了分析,为我们提供了关于LOD中语言表现和分布的见解。在此分析的基础上,我们打算研究跨语言迁移候选选择对多语言知识图谱补全任务的影响。特别是,我们计划探讨基于语言接近性和语言之间可用的经过策划的注释对齐的策略。语言接近性还促使我们探索依赖于(不)相似性的类比推理的好处,而这一点尚未被研究,以识别跨语言的对应关系,从而提高知识图谱补全的性能,并增强LOD中的语言覆盖。
cs.AI / 71 / 2605.05938

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

ICU-Bench:多模态大语言模型持续遗忘的基准测试
Wang, Yuhang, Mei, Wenjie, Zhang, Junkai, He, Guangyu, Niu, Zhenxing, Gao, Haichang
Abstract
Although Multimodal Large Language Models (MLLMs) have achieved remarkable progress across many domains, their training on large-scale multimodal datasets raises serious privacy concerns, making effective machine unlearning increasingly necessary. However, existing benchmarks mainly focus on static or short-sequence settings, offering limited support for evaluating continual privacy deletion requests in realistic deployments. To bridge this gap, we introduce ICU-Bench, a continual multimodal unlearning benchmark built on privacy-critical document data. ICU-Bench contains 1,000 privacy-sensitive profiles from two document domains, medical reports and labor contracts, with 9,500 images, 16,000 question-answer pairs, and 100 forget tasks. Additionally, new continual unlearning metrics are introduced, facilitating a comprehensive analysis of forgetting effectiveness, historical forgetting preservation, retained utility, and stability throughout the continual unlearning process. Through extensive experiments with representative unlearning methods on ICU-Bench, we show that existing methods generally struggle in continual settings and exhibit clear limitations in balancing forgetting quality, utility preservation, and scalability over long task sequences. These findings highlight the need for multimodal unlearning methods explicitly designed for continual privacy deletion.
Chinese Translation
尽管多模态大语言模型(MLLMs)在多个领域取得了显著进展,但其在大规模多模态数据集上的训练引发了严重的隐私问题,使得有效的机器遗忘变得愈发必要。然而,现有的基准测试主要集中在静态或短序列设置上,对现实部署中持续隐私删除请求的评估支持有限。为了解决这一问题,我们引入了ICU-Bench,这是一个基于隐私关键文档数据的持续多模态遗忘基准。ICU-Bench包含来自两个文档领域的1000个隐私敏感的个人资料,分别是医疗报告和劳动合同,配有9500张图像、16000个问答对和100个遗忘任务。此外,我们引入了新的持续遗忘指标,以便全面分析遗忘效果、历史遗忘保留、保留效用和在持续遗忘过程中的稳定性。通过在ICU-Bench上对代表性的遗忘方法进行广泛实验,我们发现现有方法在持续设置中普遍面临困难,并在平衡遗忘质量、效用保留和长任务序列的可扩展性方面存在明显局限。这些发现突显了针对持续隐私删除明确设计的多模态遗忘方法的必要性。
cs.AI / 72 / 2605.05949

MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System

MAS-算法:一种基于多智能体系统的算法编程问题解决工作流程
Xu, Yuliang, Xu, Xiang, Wan, Yao, Wei, Hu, Jia, Tong
Abstract
Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios.Existing approaches predominantly rely on model-centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain-of-thought) are often fragmented and lack a unified framework. In this paper, we propose MAS-Algorithm, a systematic multi-agent workflow for algorithmic problem solving inspired by the practices of competitive programmers and algorithm engineers. Our framework decomposes the end-to-end solving process into modular stages, enabling structured reasoning, tool integration, and flexible coordination among agents. The design emphasizes both rigor and extensibility, allowing it to generalize across diverse problem types.Experimental results on a self-constructed benchmark demonstrate consistent improvements across multiple Qwen series models, achieving an average gain of 6.48% in acceptance rate. In contrast, parameter-efficient fine-tuning on the same data yields only a marginal improvement of 0.89%. We further observe a 4.72% gain on LiveCodeBench-Pro, along with consistent improvements across additional accuracy and efficiency metrics.Beyond performance gains, we conduct comprehensive analyses to better understand the reasoning process within the workflow, including error patterns and cross-scenario behaviors. We further perform customized replacement and ablation studies to explore the upper bound of the framework, showing that individual agents can contribute improvements of up to 27.7%. These results highlight the strong potential of MAS-Algorithm for advancing AI-driven algorithmic reasoning.
Chinese Translation
算法问题解决作为评估人工智能编码系统中结构化推理的严格测试平台,直接反映了模型在复杂场景中进行结构化推理的能力。现有方法主要依赖于以模型为中心的策略,如架构修改和数据扩展,这些方法成本高昂且可解释性有限。利用外部工具或提示技术(例如,思维链)的替代方法往往是零散的,缺乏统一的框架。本文提出了MAS-算法,一种系统的多智能体工作流程,旨在解决算法问题,灵感来源于竞争程序员和算法工程师的实践。我们的框架将端到端的解决过程分解为模块化阶段,促进结构化推理、工具集成和智能体之间的灵活协调。设计强调严谨性和可扩展性,使其能够在多种问题类型中进行泛化。在自构建基准上的实验结果表明,多个Qwen系列模型的接受率平均提高了6.48%。相比之下,在相同数据上进行的参数高效微调仅带来了0.89%的边际改善。我们进一步观察到在LiveCodeBench-Pro上获得了4.72%的提升,以及在其他准确性和效率指标上持续改进。除了性能提升外,我们还进行了全面分析,以更好地理解工作流程中的推理过程,包括错误模式和跨场景行为。我们还进行了定制替换和消融研究,以探索框架的上限,显示个别智能体的贡献可提高高达27.7%。这些结果突显了MAS-算法在推动人工智能驱动的算法推理方面的巨大潜力。
cs.AI / 73 / 2605.05951

HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning

HaM-World:具有选择性记忆的软哈密顿世界模型用于规划
Tang, Haoyun, Cui, Haodong, Xu, Keyao, Wang, Kun, Mei, Zhandong
Abstract
World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing latents: history-conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM-World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state-space memory as the history-conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy-derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non-conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM-World reaches the highest Avg. AUC (117.9, +9.5%), reduces long-horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in {3,5,7} MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM-World achieves the highest return in every condition, with average OOD-return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action-free Hamiltonian-energy drift, structured energy variation under policy rollouts, and coherent control-induced energy transfer, supporting the intended Soft-Hamiltonian dynamics design.
Chinese Translation
世界模型通过学习的潜在动态实现基于模型的规划,但随着规划视野的扩大或动态分布的变化,想象的展开变得不稳定。我们认为这种不稳定性反映了规划者面对的潜在变量中缺失的两个结构:用于近似马尔可夫完全性的历史条件记忆,以及将配置、动量和任务语义分开的几何组织。我们提出了HaM-World(HMW),一种结构化的世界模型,它将潜在状态分解为一个标准的(q, p)子空间和一个上下文子空间c,同时使用Mamba选择性状态空间记忆作为相同潜在动态的历史条件输入。在这个接口中,(q, p)通过基于能量的哈密顿向量场加上可学习的残差/控制动态进行演化,而c则捕捉语义、耗散和非保守因素。这为规划者提供了一个由动态预测、奖励/价值估计、想象展开和CEM动作搜索共享的单一潜在状态。在四个DeepMind控制套件任务中,HaM-World达到了最高的平均AUC(117.9,+9.5%),将长视野展开误差降低到强基线模型的45%,并在{3,5,7} MSE单元中赢得了11/12 k。在涵盖动态变化、动作延迟和观察遮蔽的12个OOD扰动下,HaM-World在每种条件下都实现了最高的回报,在Finger Spin和Reacher Easy上平均获得10.2%和13.6%的OOD回报增益。机制诊断进一步显示了有限的无动作哈密顿能量漂移、在策略展开下的结构化能量变化,以及一致的控制引起的能量转移,支持了预期的软哈密顿动态设计。
cs.AI / 74 / 2605.05958

Temporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing

用于去偏知识追踪的时间平滑双重鲁棒学习
Zhan, Peilin, Chen, Wei, Chen, Weilin, Pan, Shuyi, Cai, Ruichu
Abstract
Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing KT methods neglect this issue, training on observed logs using standard empirical risk, which yields biased mastery estimates and accumulates errors in subsequent recommendations. To address this, we introduce a doubly robust (DR) formulation for KT that integrates a propensity model with an error imputation model, theoretically guaranteeing unbiasedness if either model is accurate. Beyond unbiasedness, in the sequential setting of KT, we identify that the estimator's performance is compromised by variance-dependent stochastic deviations that accumulate over time, thereby causing training instability and limiting performance. To mitigate this, we derive a generalization bound that explicitly characterizes the impact of estimator variance and identifies temporal smoothness as a key factor in controlling it. Building on these theoretical insights, we propose the Temporal Smoothness Doubly Robust (TSDR) framework. TSDR jointly optimizes the KT predictor and the imputation model with a smoothness regularizer, effectively reducing variance while preserving the unbiasedness guarantee of DR. Experiments on multiple real-world benchmarks demonstrate that TSDR consistently enhances various state-of-the-art KT backbones, underscoring the vital role of principled bias correction in KT.
Chinese Translation
知识追踪(Knowledge Tracing,KT)是智能教育系统的基础,但依赖于选择性观察的教育日志。练习推荐和学生选择的非随机性质不可避免地引发严重的选择偏差。大多数现有的KT方法忽视了这一问题,使用标准经验风险在观察到的日志上进行训练,这导致了偏倚的掌握估计,并在后续推荐中累积错误。为了解决这个问题,我们提出了一种KT的双重鲁棒(Doubly Robust,DR)公式,该公式将倾向模型与误差插补模型相结合,理论上保证了如果任一模型准确,则无偏性得以实现。除了无偏性外,在KT的序列设置中,我们发现估计器的性能受到依赖于方差的随机偏差的影响,这些偏差随着时间的推移而累积,从而导致训练不稳定并限制性能。为了减轻这一问题,我们推导出一个泛化界限,明确表征了估计器方差的影响,并确定时间平滑性是控制这一影响的关键因素。基于这些理论洞察,我们提出了时间平滑双重鲁棒(Temporal Smoothness Doubly Robust,TSDR)框架。TSDR通过平滑性正则化器共同优化KT预测器和插补模型,有效降低方差,同时保持DR的无偏性保证。在多个真实世界基准上的实验表明,TSDR始终增强了各种最先进的KT基础模型,强调了原则性偏差修正在KT中的重要作用。
cs.AI / 75 / 2605.05959

From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning

从坐标匹配到结构对齐:重新思考异构联邦学习中的原型对齐
Wu, Xinghao, Niu, Jianwei, Zhu, Guogang, Liu, Xuefeng, Tang, Shaojie, Zhang, Jiayuan
Abstract
Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with global prototypes. These approaches are essentially coordinate alignment, where representations of clients are forced to match the global prototypes in the embedding space in an element-wise manner. Such alignment implicitly assumes that all clients should map their representations into the feature subspace defined by the global prototypes. This assumption is reasonable in homogeneous FL, where all clients share the same feature extractor. However, it becomes problematic in HtFL, since heterogeneous feature extractors naturally induce client-specific feature subspaces, and forcing all clients to optimize within a single global subspace unnecessarily suppresses their learning capacity. We observe that coordinate alignment implicitly couples two distinct objectives: aligning inter-class semantic structure, which is directly beneficial for classification, and enforcing a shared feature basis, which is unnecessary and even harmful under model heterogeneity. Building on this insight, we design FedSAF, which shifts the alignment objective from absolute coordinates to inter-class relational structure. We demonstrate that structural alignment consistently outperforms coordinate alignment in heterogeneous settings. Experiments on multiple benchmarks show that our structural alignment outperforms state-of-the-art prototype-based HtFL methods by up to 3.52\%.
Chinese Translation
异构联邦学习(HtFL)旨在促进在数据分布和模型架构上存在差异的客户端之间的协作。基于原型的方法通过传递类别级特征中心(原型)而非完整的模型参数,最近在HtFL中显示出强大的潜力。现有的基于原型的HtFL方法通常在将客户端特定表示与全局原型对齐时,重用为同构联邦学习(FL)开发的基于均方误差(MSE)或余弦相似度的对齐机制。这些方法本质上是坐标对齐,其中客户端的表示被迫在嵌入空间中逐元素地与全局原型匹配。这种对齐隐含地假设所有客户端都应将其表示映射到由全局原型定义的特征子空间。在同构FL中,这一假设是合理的,因为所有客户端共享相同的特征提取器。然而,在HtFL中,这一假设变得问题重重,因为异构特征提取器自然会引入客户端特定的特征子空间,强迫所有客户端在单一全局子空间内进行优化会不必要地抑制它们的学习能力。我们观察到,坐标对齐隐含地耦合了两个不同的目标:对齐类间语义结构,这对分类直接有利,以及强制共享特征基础,这在模型异构性下是不必要的,甚至是有害的。基于这一见解,我们设计了FedSAF,将对齐目标从绝对坐标转移到类间关系结构。我们证明,在异构环境中,结构对齐始终优于坐标对齐。在多个基准测试中的实验表明,我们的结构对齐方法比最先进的基于原型的HtFL方法提高了高达3.52%的性能。
cs.AI / 76 / 2605.05963

TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

TheraAgent:自我改进的治疗代理,用于精准和全面的治疗计划制定
Li, Junkai, Lai, Yunghwei, Zhu, Tianyi, Lee, Zheng Long, Ma, Weizhi, Liu, Yang
Abstract
Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.
Chinese Translation
制定治疗计划本质上是一个复杂的推理和修正任务,而不是简单的生成问题。然而,现有的大型语言模型(LLMs)主要依赖一次性输出而没有明确的验证,这可能导致粗糙、不完整且潜在不安全的治疗计划。为了解决这些局限性,我们提出了TheraAgent,一个代理框架,它用迭代的生成-评估-修正流程替代了一次性生成。通过模拟人类专家的实际推理过程,专家们会迭代修订治疗计划,我们的框架逐步将粗糙和不完整的草稿转化为精准、全面且更安全的治疗方案。为了促进关键的评估组件,我们引入了TheraJudge,一个特定于治疗的评估模块,集成到推理循环中以执行临床标准。实验表明,TheraAgent在HealthBench上取得了最先进的结果,在准确性和完整性方面领先。在专家评估中,它以86%的胜率战胜医生,具有更优的靶向性和危害控制。此外,TheraJudge与HealthBench评估之间的高度一致性证实了我们框架的可靠性。
cs.AI / 77 / 2605.05977

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

BehaviorGuard:深度强化学习的在线后门防御
Yu, Yinbo, Yin, Xueyu, Wang, Jiadai, Tian, Chunwei, Xu, Sai, Zhu, Qi, Zhang, Daoqiang
Abstract
Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.
Chinese Translation
后门攻击对深度强化学习(DRL)构成了严重威胁。目前的防御方法通常依赖于奖励异常来逆向工程触发器,并通过模型微调来消除后门。然而,复杂的触发模式削弱了这些方法的鲁棒性,而微调则涉及高昂的成本,限制了其实用性。因此,我们将防御关注点转向无触发器依赖的后门输出行为,并提出了BehaviorGuard,一个基于行为的在线后门检测和缓解框架,适用于DRL。具体而言,我们发现无论攻击如何,后门策略都会在动作分布中引起一致的变化,以确保可靠的激活,即使在没有触发器的情况下,也会在高分位区域和分布尾部留下可检测的痕迹。基于此,我们设计了一种新颖的度量标准,捕捉动作分布中的行为漂移,以在运行时识别和抑制后门动作。据我们所知,这是首个针对单一和多智能体DRL攻击的在线后门防御方法。在不同后门攻击的多样基准测试中,BehaviorGuard在有效性和效率上始终超越了先前的方法。
cs.AI / 78 / 2605.05980

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

TACT:通过激活引导减轻编码代理的过度思考和过度行为
Sui, Yuan, Chen, Yulin, Li, Yibo, Jiang, Xue, He, Yufei, Dong, Yihong, He, Xiaoxin, Gao, Tianyu, Hooi, Bryan
Abstract
When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as *agent drift*. We focus on two recurring failure modes *overthinking* and *overacting*, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate agent drift in the residual stream before it surfaces as a behavioral failure. In specific, we label trajectory steps as overthinking, overacting, or calibrated, and find that their hidden states can separate linearly along two *drift axes*, pointing from calibrated behavior toward each failure mode (AUC $\approx$ 0.9). To mitigate agent drift, we project each step's activation onto these axes at test time and pull drifted ones back toward the calibrated region. Experiments show that TACT outperforms unsteered baselines across SWE-bench Verified, Terminal-Bench 2.0, and CLAW-Eval, lifting average resolve rate by $+5.8$ pp on Qwen3.5-27B and $+4.8$ pp on Gemma-4-26B-A4B-it while cutting steps-to-resolve by up to $26\%$. These gains frame agent drift as a steerable direction in the residual stream, and position TACT as a viable handle for reliable long-horizon agents.
Chinese Translation
当语言模型代理处理复杂的软件工程任务时,它们往往在长时间的轨迹中表现下降,我们将其定义为*代理漂移*。我们关注两种反复出现的失败模式:*过度思考*和*过度行为*,即代理反复推理其已有的信息,以及在没有整合最近观察或获取新证据的情况下发出工具调用。在本文中,我们介绍了TACT(通过激活引导进行思考-行为校准),以在代理漂移表现为行为失败之前检测并减轻其在残余流中的影响。具体而言,我们将轨迹步骤标记为过度思考、过度行为或校准,并发现它们的隐藏状态可以沿着两个*漂移轴*线性分离,这些轴指向从校准行为到每种失败模式(AUC约为0.9)。为了减轻代理漂移,我们在测试时将每个步骤的激活投影到这些轴上,并将漂移的步骤拉回到校准区域。实验表明,TACT在SWE-bench Verified、Terminal-Bench 2.0和CLAW-Eval上优于未引导的基线,在Qwen3.5-27B上平均解决率提高了+5.8个百分点,在Gemma-4-26B-A4B-it上提高了+4.8个百分点,同时将解决步骤减少了多达26%。这些收益将代理漂移框定为残余流中的可引导方向,并将TACT定位为可靠的长时域代理的可行工具。
cs.AI / 79 / 2605.05985

BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine

BioResearcher:场景引导的多智能体转化医学系统
Kinas, Remigiusz, Krawczyk, Joanna, Powalski, Rafał, Pietrzak, Przemysław, Kowalewska, Agnieszka, Kolmus, Krzysztof, Sypetkowski, Maciej, Smoliński, Łukasz, Jetka, Tomasz
Abstract
Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi-omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General-purpose foundation models and off-the-shelf tool-augmented or multi-agent systems are not built for this: they tend to produce single-shot answers or run open-endedly, and fall short on the auditable, scenario-specific workflows that heterogeneous biomedical sources demand. This paper introduces Ingenix BioResearcher, a scenario-guided multi-agent system that maps queries to versioned research playbooks, delegates to specialized subagents over 30+ tools and machine-learning endpoints, mixes structured database access with sandboxed code for genome-scale analyses, and applies claim-level multi-model reconciliation before editorial assembly. We evaluate BioResearcher across unit-level capabilities, open-ended biomedical reasoning, and end-to-end clinical discovery. It leads evaluated baselines on 109 single-step tests (83.49% pass rate; 0.892 average score), achieves strong biomedical benchmark performance (89.33% on BixBench-Verified-50 and the top 0.758 mean score on BaisBench Scientific Discovery), and leads on a 30-query clinical end-to-end benchmark with the highest positive hit rate (74.7% $\pm$ 3.3%) and negative clear rate (96.8% $\pm$ 0.2%). These results show broad, competitive performance across unit-level, open-ended, and end-to-end clinical evaluations.
Chinese Translation
转化医学将不明确的发展目标转化为证据综合,这需要结合文献、试验、专利和定量多组学分析,同时保留标识符、不确定性和可追溯的来源。通用基础模型和现成的工具增强或多智能体系统并未针对这一需求进行构建:它们往往产生单一答案或开放式运行,无法满足异构生物医学来源所需的可审计、特定场景的工作流程。本文介绍了Ingenix BioResearcher,一个场景引导的多智能体系统,它将查询映射到版本化的研究手册,委派给超过30个工具和机器学习端点的专业子代理,结合结构化数据库访问与沙箱代码进行基因组规模分析,并在编辑组装之前应用声明级多模型协调。我们在单元级能力、开放式生物医学推理和端到端临床发现方面评估了BioResearcher。它在109个单步测试中领先评估基线(通过率83.49%;平均分0.892),在生物医学基准性能上表现强劲(在BixBench-Verified-50上达到89.33%,在BaisBench Scientific Discovery上获得最高的0.758平均分),并在30个查询的临床端到端基准中以最高的正命中率(74.7% ± 3.3%)和负清除率(96.8% ± 0.2%)领先。这些结果显示了在单元级、开放式和端到端临床评估中的广泛竞争性能。
cs.AI / 80 / 2605.06024

Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals

Strat-LLM:基于分层策略对齐的LLM股票交易框架,结合实时多源信号
Huang, Wenliang, Yu, Zengyi
Abstract
Large Language Models (LLMs) are evolving into autonomous trading agents, yet existing benchmarks often overlook the interplay between architectural reasoning and strategy consistency. We propose Strat-LLM, a framework grounded in Stratified Strategy Alignment. Operating in a live-forward setting throughout 2025, it integrates heterogeneous data including sequential prices, real-time news, and annual reports to eliminate look-ahead bias. Extensive stress tests on A-share and U.S. markets reveal: (1) reasoning-heavy models achieve peak utility in Free Mode via internal logic, whereas standard models require Strict Mode as a vital risk anchor; (2) alignment utility is regime-dependent, with Free and Guided modes capturing momentum in uptrending markets, while Strict Mode mitigates drawdowns in downtrends; (3) mid-scale models (35B) show optimal fidelity under strict constraints, whereas ultra-large models (122B) suffer an alignment tax under rigid rules but gain a performance premium in Guided Mode; (4) standard LLMs often fall into a high win-rate trap, optimizing for small gains at the expense of total returns, which can only be mitigated through deep reasoning or strict external guardrails. Project details are available at https://Strat-LLM.github.io.
Chinese Translation
大型语言模型(LLMs)正在演变为自主交易代理,但现有基准往往忽视了架构推理与策略一致性之间的相互作用。我们提出了Strat-LLM,一个基于分层策略对齐的框架。在2025年的实时前向环境中运行,它整合了包括序列价格、实时新闻和年度报告在内的异构数据,以消除前瞻性偏差。在A股和美国市场上进行的广泛压力测试显示:(1)重推理模型通过内部逻辑在自由模式下实现了最佳效用,而标准模型则需要严格模式作为重要的风险锚;(2)对齐效用依赖于市场状态,自由和引导模式在上升市场中捕捉动量,而严格模式则在下跌趋势中减轻回撤;(3)中型模型(35B)在严格约束下显示出最佳保真度,而超大型模型(122B)在严格规则下遭受对齐税,但在引导模式下获得性能溢价;(4)标准LLMs往往陷入高胜率陷阱,优化小幅收益却牺牲总回报,这只能通过深度推理或严格的外部保护措施来缓解。项目详情可访问 https://Strat-LLM.github.io。
cs.AI / 81 / 2605.06029

Pathways to AGI

通往人工通用智能的路径
Fletcher, Gordon, Khan, Saomai Vu
Abstract
Our focus are five related questions that stem from a critical software studies perspective. Underpinning this view is the acknowledged need to avoid assumptions regarding the inevitability of the current situation relating to AI. What we need to see is the closeness of the linkage between current commercial AI development and our prevailing social, political and economic circumstances. This does mean that the perspectives presented here are done so critically and conditionally. Most importantly, Artificial General Intelligence (AGI) is seen as being problematic both conceptually and definitionally. This conditioning of any view regarding AGI does lead the discussion in specific directions and to certain conclusions regarding the future. However, adopting this perspective enables the work to offer some final recommendations. We set out to ask the following questions, 1. What are the critical pathways that produced the current dominant generative AI tools (capabilities, product forms, adoption patterns)? 2. Which decision points acted as leverage nodes (small changes that had large downstream effects), and which dead ends reveal alternative possibilities that did not become dominant? 3. How do pathways differ across three foundational-model trajectories such as the frontier proprietary models, open-weight models or specific domain and sovereign models? 4. Which alternative projects branched from key leverage nodes, what is their current state, and why did some succeed, stall, fail or become absorbed? 5. Based on this analysis, what socio-technical development programmes could plausibly move toward AGI-adjacent capability while meeting requirements for transparency, moderation, wellbeing and sustainable business models?
Chinese Translation
我们的关注点是五个相关的问题,这些问题源于批判性软件研究的视角。支撑这一观点的是对当前与人工智能(AI)相关情况的不可避免性假设的认识和避免。我们需要关注的是当前商业AI发展与我们现有的社会、政治和经济环境之间的紧密联系。这意味着这里提出的观点是以批判性和条件性的方式呈现的。最重要的是,人工通用智能(AGI)在概念和定义上都被视为存在问题。对AGI的任何观点的这种条件性确实引导了讨论朝着特定方向发展,并得出关于未来的某些结论。然而,采用这种视角使得我们的工作能够提出一些最终的建议。我们设定了以下问题:1. 产生当前主导生成性AI工具(能力、产品形式、采用模式)的关键路径是什么?2. 哪些决策点充当了杠杆节点(小变化产生了大的下游影响),而哪些死胡同揭示了未能成为主导的替代可能性?3. 在前沿专有模型、开放权重模型或特定领域和主权模型等三种基础模型轨迹中,路径有何不同?4. 从关键杠杆节点分支出的替代项目有哪些,它们的当前状态如何,为什么有些成功、有些停滞、有些失败或被吸收?5. 基于此分析,哪些社会技术发展项目可以合理地朝向AGI相关能力发展,同时满足透明度、适度、福祉和可持续商业模式的要求?
cs.AI / 82 / 2605.06040

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

基于新颖性的思维树搜索用于大语言模型的推理与规划
Hamm, Leon, Ajanovic, Zlatan
Abstract
Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible "paths" of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.
Chinese Translation
尽管链式思维、树状思维或强化学习等进展提高了大语言模型(LLMs)在推理和规划任务中的表现,但它们仍然脆弱,在许多领域未能达到人类水平的表现,并且往往面临高时间和令牌成本。受到基于宽度搜索在规划中成功的启发,我们探索了新颖性概念如何转移到语言领域,以及它如何改善树状思维推理。思维树依赖于构建连续想法或思维的可能“路径”。这些路径是通过反复提示LLM生成的。在本文中,我们提出了一个可测量的新颖性概念,用于描述新节点(思维)与搜索树中先前看到的节点相比的独特性。新颖性通过提示LLM并利用预训练中嵌入的一般知识来估计。该指标可以用于修剪分支并减少搜索范围。尽管这种方法在每个状态下引入了更多的提示,但通过修剪和减少整体树的大小,可以降低整体令牌成本。我们使用多个基准在基于语言的规划和一般推理中对该过程进行了测试和比较。
cs.AI / 83 / 2605.06054

Visual Fingerprints for LLM Generation Comparison

用于大语言模型生成比较的视觉指纹
Alnouri, Amal, Hinterreiter, Andreas, Humer, Christina, Cheng, Furui, Streit, Marc
Abstract
Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We present an approach to visually compare LLM outputs across generation conditions by modeling responses as collections of linguistic choices, including content, expression, and structure. We extract these choices using natural language processing pipelines and represent their distributions across repeated samples. We then visualize these distributions as visual fingerprints, enabling direct, distribution-level comparison of condition-specific tendencies. Through four usage scenarios, we demonstrate how visual fingerprints reveal consistent patterns in LLM behavior that are difficult to observe through individual responses or aggregate metrics.
Chinese Translation
大型语言模型(LLM)的输出源于提示、系统指令、模型参数和架构之间的复杂交互。我们将这些因素的特定配置称为生成条件,每种条件都可能以不同方式影响输出。理解不同生成条件如何塑造模型行为对于提示设计和模型评估等任务至关重要,但由于文本生成的随机性和开放性,这一过程仍然具有挑战性。我们提出了一种通过将响应建模为语言选择的集合(包括内容、表达和结构)来可视化比较不同生成条件下的LLM输出的方法。我们使用自然语言处理管道提取这些选择,并表示它们在重复样本中的分布。然后,我们将这些分布可视化为视觉指纹,从而实现条件特定倾向的直接分布级比较。通过四个使用场景,我们展示了视觉指纹如何揭示LLM行为中的一致模式,而这些模式通过单个响应或聚合指标难以观察。
cs.AI / 84 / 2605.06068

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

VibeServe:人工智能代理能否构建定制化的LLM服务系统?
Kamahori, Keisuke, Li, Shihang, Peter, Simon, Kasikci, Baris
Abstract
For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.
Chinese Translation
多年来,我们像构建其他关键基础设施一样构建LLM服务系统:一个单一的通用堆栈,经过多年的工程调优,旨在支持每个模型和工作负载。在本文中,我们采取了相反的策略:一个多代理循环,自动合成适用于不同使用场景的定制服务系统。我们提出了VibeServe,这是第一个生成整个LLM服务堆栈的代理循环。VibeServe使用外部循环来规划和跟踪系统设计的搜索,并使用内部循环来实现候选方案、检查正确性并在目标基准上测量性能。在标准部署环境中,现有堆栈经过高度优化,VibeServe与vLLM保持竞争力,表明生成时的专业化不必以性能为代价。更有趣的是,在非标准场景中,VibeServe通过利用通用系统在六种涉及非标准模型架构、工作负载知识和硬件特定优化的场景中错失的机会,超越了现有系统。总的来说,这些结果暗示了基础设施软件设计空间的不同视角:生成时的专业化而非运行时的通用性。代码可在https://github.com/uw-syfi/vibe-serve获取。
cs.AI / 85 / 2605.06087

Safety Certification is Classification

安全认证即分类
Schön, Oliver, Romao, Licio, Soudjani, Sadegh
Abstract
The goal of this paper is certifying safety of dynamical systems subject to uncertainty. Existing approaches use trajectory data to estimate transition probabilities, and compute safety probabilities recursively via dynamic programming (DP). This recursion may lead to compounding errors in the certified safety probability, thus collapsing to a vacuous lower bound for growing horizons $T$. We propose a kernel embedding framework that treats safety certification as a classification problem on trajectory data, directly estimating the $T$-step safety probability without recursion. We show that the framework subsumes well-established approaches from the literature (e.g., barrier certificates, robust Markov models) as special cases, and allows us to go beyond their limitations. As the main consequence, it bypasses compounding error across the horizon and enables certification for systems with non-Markovian dynamics. We demonstrate that direct estimators remain stable independent of the certification horizon and in the non-Markovian setting, whilst DP-based certificates silently go unsound -- confirmed in simulation on a neural-controlled quadrotor.
Chinese Translation
本文的目标是对受不确定性影响的动态系统进行安全认证。现有的方法使用轨迹数据来估计转移概率,并通过动态规划(DP)递归计算安全概率。这种递归可能导致认证安全概率中的累积误差,从而在增长的时间范围 $T$ 内崩溃为一个空洞的下界。我们提出了一种核嵌入框架,将安全认证视为轨迹数据上的分类问题,直接估计 $T$ 步安全概率而无需递归。我们展示了该框架作为特例包含了文献中已建立的方法(例如,障碍证书、鲁棒马尔可夫模型),并使我们能够超越它们的局限性。其主要结果是,它避免了在时间范围内的累积误差,并使得对具有非马尔可夫动态的系统进行认证成为可能。我们证明了直接估计器在认证时间范围和非马尔可夫设置中保持稳定,而基于DP的证书则在模拟中被确认默默地失效,模拟对象为神经控制的四旋翼飞行器。
cs.AI / 86 / 2605.06105

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

浅预填,深解码:通过层不对称KV可见性实现高效的长上下文推理
Oh, Jungsuk, Jeon, Hyeseo, Ji, Hyunjune, Kong, Kyongmin, Lee, Jay-Yoon
Abstract
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
Chinese Translation
在仅解码器的语言模型中,长上下文推理的成本较高,因为长提示在预填阶段被处理,在每一层中被缓存,并在自回归解码过程中反复关注。我们提出了 extit{Shallow Prefill, dEEp Decode} (SPEED),这是一种相位不对称的KV可见性策略,仅在较低层中实现非锚点提示令牌的KV状态,同时保持解码阶段的令牌为全深度。与之前使上层提示KV状态存储或构建更便宜的方法不同,SPEED完全从上层解码可见性集中移除了预填令牌。通过一个最小的BoS锚点,这一简单的变化在降低长上下文成本的同时保持了广泛的基准质量。在一项控制的Llama-3.1-8B指令调优研究中,SPEED仅使用75 ext{%}的层用于预填令牌,在OLMES风格的基准测试中达到了51.2的平均分,而全深度基线为51.4,同时TTFT提高了33 ext{%},TPOT提高了22 ext{%},并在128K上下文下减少了25.0 ext{%}的活跃KV内存。逐层诊断表明,这一截止保留了全深度模型的主要提示选择和表示稳定区域。这些结果表明,当解码阶段的令牌保持全深度时,长上下文提示令牌不必始终作为全深度KV缓存对象存在。
cs.AI / 87 / 2605.06110

On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

按时、在预算内:面向代理工作流的约束驱动在线资源分配
Wang, Xinglin, Liu, Zishen, Feng, Shaoxiong, Yuan, Peiwen, Li, Yiwei, Shi, Jiayi, Zhang, Yueqi, Tan, Chuyi, Zhang, Ji, Pan, Boyuan, Hu, Yao, Li, Kan
Abstract
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emph{constraint-driven online resource allocation for agentic workflows}. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask--model pair, the executor allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emph{Monte Carlo Portfolio Planning} (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget--deadline constraints.
Chinese Translation
代理系统越来越多地通过执行协调的工作流来解决复杂的用户请求,其中子任务被分配给专门的模型或工具,并根据其依赖关系进行协调。尽管近期的研究通过优化性能-成本-延迟边界来提高代理的效率,但实际部署往往会施加具体要求:工作流必须在指定预算内并在指定截止日期之前完成。这将目标从平均效率优化转变为在明确的预算和截止日期约束下,最大化整个工作流成功完成的概率。我们研究了面向代理工作流的 extit{约束驱动在线资源分配}。给定一个依赖结构的工作流以及每个子任务-模型对的成功率和生成长度的估计,执行者在管理剩余预算和时间的同时,分配模型和并行样本到可以同时执行的子任务上。我们将这一设置形式化为一个有限视野随机在线分配问题,并提出 extit{蒙特卡洛投资组合规划}(Monte Carlo Portfolio Planning, MCPP),这是一种轻量级闭环规划器,通过模拟工作流执行直接估计受约束的完成概率,并在观察到的结果后重新规划。在CodeFlow和ProofFlow上的实验表明,MCPP在广泛的预算-截止日期约束下,始终提高了受约束的完成概率,相较于强基线表现出一致的改进。
cs.AI / 88 / 2605.06115

CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

CrossCult-KIBench:跨文化知识插入的基准测试
Zeng, Zhen, Gu, Leijiang, Li, Feng, Yu, Jing, Shi, Zenglin
Abstract
Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion, which focuses on adapting models to specific cultural contexts while preserving their original behavior in other cultures. To facilitate research in this area, we introduce CrossCult-KIBench, a comprehensive evaluation benchmark for assessing both the effectiveness of knowledge insertion and its unintended side effects on non-target cultures. The benchmark includes 9,800 image-grounded cases covering 49 culturally relevant visual scenarios across English, Chinese, and Arabic language-culture groups. It supports evaluation in both single-insert and sequential-insert settings. We also propose Memory-Conditioned Knowledge Insertion (MCKI) as a baseline method. MCKI retrieves relevant cultural knowledge from an external memory using frozen MLLM representations, prepending matched entries as conditional prompts when applicable. Extensive experiments on CrossCult-KIBench reveal that current approaches struggle to balance effective cultural adaptation with behavioral preservation, highlighting a key challenge in developing culturally-aware MLLMs. Our work thus underscores an important research direction for developing more culturally adaptive and responsible MLLMs.
Chinese Translation
多模态大型语言模型(MLLMs)主要在以英语为中心的数据上进行训练,在跨文化环境中经常生成文化不当或不一致的响应。为此,我们引入了跨文化知识插入的任务,该任务侧重于在保持模型在其他文化中原有行为的同时,适应特定文化背景。为促进该领域的研究,我们推出了CrossCult-KIBench,这是一个全面的评估基准,用于评估知识插入的有效性及其对非目标文化的意外副作用。该基准包括9800个以图像为基础的案例,涵盖了英语、中文和阿拉伯语文化群体中49个与文化相关的视觉场景。它支持单次插入和序列插入设置的评估。我们还提出了记忆条件知识插入(Memory-Conditioned Knowledge Insertion, MCKI)作为基线方法。MCKI使用冻结的MLLM表示从外部记忆中检索相关的文化知识,并在适用时将匹配的条目作为条件提示进行前置。对CrossCult-KIBench的广泛实验表明,当前的方法在有效的文化适应与行为保持之间难以取得平衡,突显了开发具有文化意识的MLLMs的一个关键挑战。因此,我们的工作强调了开发更具文化适应性和责任感的MLLMs的重要研究方向。
cs.AI / 89 / 2605.06116

Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

基于策略引导的逐步模型路由以实现成本有效的推理
Si, Wenwen, Lee, Insup, Bastani, Osbert
Abstract
Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.
Chinese Translation
推理时的计算大大提升了大型语言模型(LLMs)在复杂推理任务上的表现,但这种策略可能会导致高昂的推理成本。一种解决方案是将中间思维链(CoT)状态路由到不同规模的语言模型;然而,现有的方法依赖于手工设计的路由策略,这限制了性能,或者依赖于训练大型过程奖励模型,这在许多应用中可能不可行。我们将逐步模型路由形式化为一个受限决策问题,通过结合强化学习和阈值校准训练一个小型控制策略来解决,从而调整性能与效率的权衡。我们在三个数学基准(GSM8K、MATH500 和 OmniMath)上对我们的方法进行了验证,涵盖开放模型和封闭模型。与手工设计的方法相比,我们的方法在准确性与成本的权衡上始终有所改善,同时在需要训练大型过程奖励模型的方法中实现了相当的权衡。
cs.AI / 90 / 2605.06123

Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

回归启发式设计的起点:通过大语言模型连接代码与知识
Kiet, Nguyen Viet Tuan, Pham, Bui Dinh, Van Tung, Dao, Dao, Tran Cong, Binh, Huynh Thi Thanh
Abstract
Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search over executable programs and distill insights from execution feedback to guide later iterations. Because this process moves from low-level implementations to high-level principles, we refer to it as a bottom-up paradigm. We argue that this view is incomplete and introduce a complementary top-down perspective: knowledge becomes the primary search object and code merely instantiates and tests it, making what is learned explicit and reusable across problems and trajectories. We formalize this shift through a statistical-learning view that exposes a distortion--compression trade-off, and instantiate it in both population-based and tree-based AHD frameworks. Across CO and tasks beyond it, knowledge-first search improves discovery efficiency, transfer, and generalization, often outperforming code-centric pipelines, while combining both strategies yields further gains. Our results suggest that progress in AHD depends on iteratively constructing and evolving interpretable hypotheses that retain value beyond a single search trajectory.
Chinese Translation
大型语言模型(LLMs)最近推动了组合优化(CO)领域的自动启发式设计(AHD),在这一过程中,候选启发式方法被迭代地提出、评估和优化。现有的大多数方法在可执行程序上进行搜索,并从执行反馈中提炼出见解,以指导后续的迭代。由于这一过程从低级实现向高级原则转变,我们将其称为自下而上的范式。我们认为这一观点是不完整的,并引入一个互补的自上而下的视角:知识成为主要的搜索对象,而代码仅仅是对其进行实例化和测试,使得所学到的内容在不同问题和路径中变得明确且可重用。我们通过统计学习的视角形式化了这一转变,揭示了失真-压缩的权衡,并在基于种群和基于树的AHD框架中进行了实例化。在组合优化及其之外的任务中,以知识为先的搜索提高了发现效率、迁移能力和泛化能力,通常优于以代码为中心的流程,而结合这两种策略则带来了进一步的收益。我们的结果表明,AHD的进展依赖于迭代构建和演化可解释的假设,这些假设在单一搜索路径之外仍然具有价值。
cs.AI / 91 / 2605.06124

P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference

P-Guide:单次推理中的参数高效先验引导
Peng, Xin, Gao, Ang
Abstract
Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introducing \textbf{P-Guide}, a framework that achieves high-quality guidance through a single inference pass by modulating only the initial latent state. We further show that, under a first-order approximation, P-Guide is equivalent to CFG in the sense that it steers generation from the prior space, without requiring explicit velocity field extrapolation during sampling. We consider both homoscedastic and \textbf{heteroscedastic} priors, and find that jointly modeling the mean and variance enables adaptive loss attenuation and improved robustness to data uncertainty. Extensive experiments demonstrate that P-Guide reduces inference latency by approximately 50\% while maintaining fidelity and prompt alignment competitive with standard dual-pass CFG baselines.
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)对于流匹配中的高保真条件生成至关重要,但它在每个采样步骤中需要双重前向传递,从而带来了显著的计算开销。在本研究中,我们通过引入 extbf{P-Guide}框架来解决这一瓶颈,该框架通过仅调节初始潜在状态实现高质量的引导,从而只需一次推理传递。我们进一步表明,在一阶近似下,P-Guide在引导生成方面等价于CFG,因为它从先验空间引导生成,而不需要在采样过程中进行显式的速度场外推。我们考虑了同方差和 extbf{异方差}先验,并发现联合建模均值和方差能够实现自适应损失减弱,提高对数据不确定性的鲁棒性。大量实验证明,P-Guide将推理延迟减少了约50\%,同时在保真度和提示对齐方面与标准双重传递CFG基线保持竞争力。
cs.AI / 92 / 2605.06130

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Skill1:通过强化学习统一进化技能增强型智能体
Shi, Yaorui, Chen, Yuxin, Lu, Zhengxi, Miao, Yuchun, Liu, Shugui, GU, Qi, Cai, Xunliang, Wang, Xiang, Zhang, An
Abstract
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
Chinese Translation
一个持久的技能库使语言模型智能体能够在任务之间重用成功的策略。维护这样的库需要三种相互关联的能力。智能体选择相关技能,在执行过程中利用该技能,并从经验中提炼新技能。现有方法在孤立状态下或使用不同的奖励来源来优化这些能力,导致部分和相互冲突的进化。我们提出了Skill1,一个框架,通过训练单一策略共同进化技能选择、利用和提炼,以实现共享的任务结果目标。该策略生成查询以搜索技能库,对候选项进行重新排序以选择一个,基于所选技能解决任务,并从轨迹中提炼新技能。所有学习都源自单一的任务结果信号。其低频趋势归功于选择,高频变化归功于提炼。在ALFWorld和WebShop上的实验表明,Skill1优于之前的基于技能和强化学习的基线。训练动态确认了这三种能力的共同进化,消融实验表明,去除任何信用信号会降低进化效果。
cs.AI / 93 / 2605.06154

Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models

图小块作为知识图谱基础模型结构词汇的构建模块
Amouzouvi, Kossi, Wardenga, Robert, Lehmann, Jens, Vahdati, Sahar
Abstract
Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relations are discrete symbols, yet their arrangement is relational and lacks a common, fixed grid. Knowledge Graphs (KGs) share the discreteness, but not the geometry. They form irregular, non-Euclidean topologies whose local neighborhoods differ from graph to graph. Therefore, Knowledge Graph Foundation Models (KGFMs) rely on identifying structural invariances to produce transferable representations. Without a universal token set, KGFMs are limited in their ability to transfer representations across unseen KGs. We close this gap by treating graphlets, small connected graphs, as structural tokens that recur in heterogeneous KGs. In this paper, We introduce a model-agnostic framework based on a vocabulary of graphlets that mines a KG between relations via pattern matching. In particular, we considered closed and open 2- and 3-path, and star graphlets, to obtain robust invariances. The framework is evaluated on 51 KGs from a wide range of domains, for zero-shot inductive and transductive link prediction. Experiments show that adding simple graphlets to the vocabulary yields models that outperform prior KGFMs.
Chinese Translation
基础模型在语言处理方面表现出色,其中句子转化为符号,在视觉处理方面则将图像转化为像素,因为这两者都可以简化为共享的、固定的网格上的离散符号。知识图谱(Knowledge Graphs,KGs)具有离散性,但缺乏几何结构。它们的实体和关系是离散符号,但其排列是关系性的,并且缺乏共同的、固定的网格。知识图谱形成不规则的、非欧几里得的拓扑结构,其局部邻域因图谱而异。因此,知识图谱基础模型(Knowledge Graph Foundation Models,KGFMs)依赖于识别结构不变性来生成可转移的表示。没有一个通用的符号集,KGFMs在跨未见知识图谱转移表示的能力上受到限制。我们通过将图小块(graphlets)——小型连通图,视为在异构知识图谱中反复出现的结构符号,来弥补这一差距。在本文中,我们介绍了一个基于图小块词汇的模型无关框架,该框架通过模式匹配在关系间挖掘知识图谱。特别地,我们考虑了闭合和开放的2-和3-路径以及星形图小块,以获得稳健的不变性。该框架在来自广泛领域的51个知识图谱上进行了评估,针对零样本归纳和传导链接预测。实验表明,将简单的图小块添加到词汇中可以使模型的性能超过先前的KGFMs。
cs.AI / 94 / 2605.06161

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

超越准确性:政策不变性作为大型语言模型安全评估者的可靠性测试
Weng, Shihao, Feng, Yang, Xie, Xiaofei
Abstract
LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.
Chinese Translation
大型语言模型作为评估者的管道已成为代理安全的事实评估者,然而现有基准将其裁决视为真实标准,而未检查这些裁决是否依赖于代理的行为,或仅仅取决于评估政策的措辞。我们认为,任何值得信赖的安全评估者必须满足一个基本属性,我们称之为政策不变性,并将其具体化为三个可测试的原则:在经过认证的等效重写下的评分标准-语义不变性、在有意的严格到宽松的转变下的评分标准-阈值不变性,以及意识到模糊性的校准,以便裁决的不稳定性集中在真正模糊的案例上。通过在来自ASSEBench和R-Judge的轨迹上对四个代理类别的评估者实施这些原则作为压力测试协议,我们揭示了一种之前未被测量的失败模式:当今的评估者对有意义的规范性变化和无意义的结构重写的反应强度相当,且无法区分二者。内容保持的政策重写使得高达9.1%的裁决发生变化,且在此类重写下,18-43%的所有观察到的变化发生在明确的案例上,因此现有的安全评分将代理的行为与评估者的提示混淆。除了诊断之外,我们贡献了政策不变性评分和评估者卡报告协议,这揭示了评估者可靠性在准确性仅限的排行榜上是不可见的数量级差异。我们发布该协议和代码,以便未来的代理安全基准可以审计自己的评估者,而不是默认信任他们。
cs.AI / 95 / 2605.06165

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

后推理:在零成本下提升非思维模型的性能
Xuan, Richmond Sin Jing, Bhardwaj, Rishabh, Poria, Soujanya
Abstract
As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to no explicit reasoning, with additional reasoning sometimes even degrading performance. In this work, we propose \textbf{Post-Reasoning}, a simple yet effective approach that improves instruction-tuned models by conditioning them to justify their answers after generating the final response. By design, it enables the final answer to be obtained without additional latency or token cost, while still improving performance through simple instruction augmentation. We evaluate Post-Reasoning across \(117\) model--benchmark settings spanning \(13\) open and proprietary models, \(4\) model families, and \(9\) diverse reasoning and knowledge-intensive benchmarks, including AMC, HMMT, GSM8K, GPQA, MMLU-Pro, and BIG-Bench Hard. Post-Reasoning improves performance in over \(88.19\%\) of evaluated settings, achieving a mean relative improvements of \(17.37\%\). Furthermore, we propose supervised post-reason tuning, which further improves performance in over \(91.11\%\) of evaluated settings, and exceeds the prompt-based post-reasoning baseline by an average of \(8.01\%\), demonstrating that post-reasoning can be effectively internalized through training. Ultimately, Post-Reasoning establishes a new performance ceiling for direct-answer capabilities.
Chinese Translation
随着大型语言模型(LLMs)的广泛应用加速,中间推理痕迹的令牌消耗日益影响推理延迟和运营成本。近期研究表明,许多现实世界任务几乎不需要明确的推理,额外的推理有时甚至会降低性能。在本研究中,我们提出了 extbf{后推理}(Post-Reasoning),这是一种简单而有效的方法,通过在生成最终回答后对模型进行条件化,使其能够为答案提供合理解释,从而改善指令调优模型。该方法的设计使得最终答案的获取不需要额外的延迟或令牌成本,同时通过简单的指令增强提高性能。我们在117个模型-基准设置中评估了后推理,涵盖了13个开放和专有模型、4个模型家族以及9个多样的推理和知识密集型基准,包括AMC、HMMT、GSM8K、GPQA、MMLU-Pro和BIG-Bench Hard。后推理在超过88.19%的评估设置中提升了性能,平均相对提升达到17.37%。此外,我们提出了监督后推理调优,这进一步提高了超过91.11%的评估设置中的性能,并且平均超过基于提示的后推理基线8.01%,证明后推理可以通过训练有效内化。最终,后推理为直接回答能力建立了新的性能上限。
cs.AI / 96 / 2605.06177

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

BioMedArena:一个用于构建和评估生物医学深度研究代理的开源工具包
Wu, Jinge, Zhou, Hongjian, Zeng, Mingde, Zhu, Jiayuan, Wu, Junde, Pan, Jiazhen, Wu, Sean, Wu, Honghan, Liu, Fenglin, Clifton, David A.
Abstract
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena
Chinese Translation
今天构建一个深度研究代理是一项粘合代码的工作:在相同基准上评估的相同骨干网络在不同论文中可能报告不同的准确率,因为所使用的工具和工具注册各不相同,而将新的基础模型整合到可比较的评估环境中需要数周的模型特定工程。我们称之为每篇论文的工程税,并发布了BioMedArena,一个不仅缓解这一问题的开源工具包,还提供了一个公平比较不同基础模型作为深度研究代理的竞技场。BioMedArena 解耦了生物医学代理评估的六个层面——基准加载、工具暴露、工具选择、执行模式、上下文管理和评分——并在9个功能类别中提供了147个生物医学基准和75个生物医学工具。添加一个新的模型、基准或工具只需注册几行提供者适配器。我们进一步提供了6个代理工具和6种上下文管理策略,提供了12个具有竞争研究能力的骨干网络,并显著提高了性能,在8个代表性的生物医学基准上实现了最新的(SOTA)结果,平均提升了15.03个百分点,超越了之前的SOTA。该工具包、配置和每个任务的跟踪信息可在 https://github.com/AI-in-Health/BioMedArena 获取。
cs.AI / 97 / 2605.06183

Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

重新思考适配器放置:一种主导适应模块视角
Zhang, Suoxin, He, Run, Fang, Di, Tan, Xiang, Chen, Kaixuan, Zhuang, Huiping
Abstract
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.
Chinese Translation
低秩适应(LoRA)是一种广泛使用的参数高效微调方法,它将可训练的低秩适配器放置在冻结的预训练模型中。最近的研究表明,使用更少的LoRA适配器仍然可以保持甚至提高性能,但现有方法仍然广泛分布适配器,导致如何放置有限数量的适配器以最大化性能的问题尚未得到充分解决。为此,我们引入了PAGE(Projected Adapter Gradient Energy),这是一种基于梯度的敏感性探测器,用于估计每个候选LoRA适配器的初始可训练梯度能量。令人惊讶的是,我们发现PAGE在两个模型系列和四个下游任务中高度集中于单个浅层前馈网络(FFN)下投影。我们将该模块称为主导适应模块,并表明其层索引依赖于架构但在任务上保持稳定。基于这一发现,我们提出了DomLoRA,一种将单个适配器放置在主导适应模块的放置方法。仅使用约0.7%的原始LoRA可训练参数,DomLoRA在各种下游任务(包括指令跟随、数学推理、代码生成和多轮对话)中平均表现优于原始LoRA。这种方法还改善了其他LoRA变体,支持主导适应模块视角作为实际的放置指导。
cs.AI / 98 / 2605.06185

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

事件因果检索增强生成框架:复杂场景下的长视频推理
Yan, Peizheng, Zhao, Yu, Xie, Liang, Qi, Juntong, Wang, Mingming, Yin, Erwei
Abstract
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
Chinese Translation
近期的大型视觉-语言模型在短视频和中等长度视频理解方面取得了显著的性能,但在超长或甚至无限视频推理方面仍显不足。在这些情况下,模型必须在较长时间内保持一致的记忆,并推断时间上相距较远事件之间的因果依赖关系。现有的端到端视频理解方法在本质上受到自注意力机制的 $O(n^2)$ 复杂度的限制,而最近的检索增强生成(RAG)方法仍然面临片段级记忆碎片化、时间和因果结构建模弱以及高存储和在线推理成本等问题。我们提出了事件因果检索增强生成(Event-Causal RAG),这是一个轻量级的检索增强框架,旨在进行无限长视频推理。我们的方法不是对固定长度的片段进行索引,而是将流媒体视频分割为语义一致的事件,并将每个事件表示为结构化的状态-事件-状态(State-Event-State, SES)图,捕捉事件及其周围状态转变。这些图被合并为一个全局事件知识图,并存储在一个支持语义匹配和因果拓扑检索的双存储记忆中。在此记忆的基础上,我们设计了一种双向检索策略,以高效识别最相关的事件因果链,并将其与相关视频证据一起提供给基础视频模型以生成答案。在长视频理解基准上的实验表明,事件因果检索增强生成(Event-Causal RAG)在多事件整合和跨长时间间隔的因果推理问题上,始终优于强大的基于片段的检索基线和长上下文视频模型,同时在记忆效率和流媒体性能方面也有显著改善。
cs.AI / 99 / 2605.06188

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

OPSD 压缩 RLVR 所教授的内容:推理模型的后 RL 压缩阶段
Kim, Jaehoon, Lee, Dongha
Abstract
On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.
Chinese Translation
最近,基于策略的自我蒸馏(On-Policy Self-Distillation, OPSD)作为可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)的替代方案出现,承诺通过基于特权上下文的自我教师进行令牌级别的信用分配,从而实现更高的准确性和更短的响应。然而,这一承诺并未延续到具备思维能力的数学推理中,报告的准确性提升缩小,甚至有时出现负值。我们假设,事后监督可以在短期的思维失效输出中指定更好的令牌级别替代方案,但在长期的思维启用轨迹中,它更容易识别冗余而不是提供更好的替代品。为了验证这一假设,我们分别对正确和错误的展开组应用了 OPSD,以便观察压缩和纠正的独立效果。我们的结果表明,在具备思维能力的数学推理中,OPSD 更可靠地作为压缩机制而非纠正机制:仅在正确的展开上训练可以保持准确性,同时显著缩短响应,而仅在错误的展开上训练则会损害准确性。基于这些发现,我们提出了一种修订后的思维启用数学推理的后训练流程:先进行 SFT,然后是 RLVR,最后是 OPSD。
cs.AI / 100 / 2605.06191

Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction

大型语言模型在出院后临床行动提取中的系统评估
Dalmia, Shivali, Mantravadi, Ananya, Desikan, Prasanna
Abstract
The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To manage the complexity of clinical documentation, we introduce a two-stage extraction framework that decomposes discharge notes, that are written in narrative form, into fine-grained, explicitly actionable clinical tasks through a staged prompting strategy. Our contributions include a systematic assessment of generative LLMs for clinical action extraction, a detailed comparison between general-purpose LLMs and task-specific supervised BERT-based models, and an analysis of annotation inconsistencies across different action categories. We show that contemporary LLMs achieve performance comparable to or exceeding supervised models on binary actionability detection, while supervised baselines retain a meaningful advantage on fine-grained multi-label category classification, despite the absence of task-specific fine-tuning and under strict data-privacy constraints. Qualitative error analysis reveals that many failures stem from misalignment between model reasoning and dataset annotation conventions, particularly in cases involving implicit clinical actions and rigid structural labeling rules. These results indicate that reported performance reflects model limitations due to lack of clinical reasoning, that is not captured by plain annotations. Labels without rationales make it impossible to distinguish clinical reasoning failures from annotation convention mismatches. Advancing clinical NLP requires reasoning-annotated datasets that document why specific spans are actionable, not merely which spans were labeled, enabling proper evaluation of model clinical understanding.
Chinese Translation
本文评估了零样本和少样本的大型语言模型(LLMs)在安全关键的临床行动提取中的应用,使用了 CLIP 出院记录数据集,特别强调了护理过渡和出院后患者安全。为了管理临床文档的复杂性,我们提出了一种两阶段提取框架,通过分阶段提示策略将以叙述形式书写的出院记录分解为细粒度、明确可操作的临床任务。我们的贡献包括对生成型 LLMs 在临床行动提取中的系统评估、通用 LLMs 与任务特定监督 BERT 模型之间的详细比较,以及对不同行动类别之间注释不一致性的分析。我们展示了当代 LLMs 在二元可操作性检测方面的表现可与监督模型相媲美或超越,而监督基线在细粒度多标签类别分类中仍保持显著优势,尽管缺乏任务特定的微调并在严格的数据隐私约束下进行。定性错误分析揭示,许多失败源于模型推理与数据集注释规范之间的不一致,特别是在涉及隐含临床行动和严格结构化标签规则的情况下。这些结果表明,报告的性能反映了模型因缺乏临床推理而导致的局限性,而这种局限性并未通过简单的注释捕捉到。没有理由的标签使得无法区分临床推理失败与注释规范不匹配。推动临床自然语言处理(NLP)的进展需要具有推理注释的数据集,这些数据集记录了特定文本片段为何是可操作的,而不仅仅是哪些文本片段被标记,从而能够对模型的临床理解进行适当评估。
cs.AI / 101 / 2605.06196

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

粒度轴:语言模型中社会角色的微观到宏观潜在方向
Qin, Chonghan, Feng, Xiachong, Song, Ziyun, Feng, Xiaocheng, Xiong, Jing, Kong, Lingpeng
Abstract
Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model's default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.
Chinese Translation
大型语言模型(LLMs)常常被提示承担从个人到机构的社会角色,但尚不清楚它们的内部表征是否编码了这些角色的粒度,从微观层面的个人经验到宏观层面的组织、机构或国家推理。我们的研究表明,它们确实如此。我们定义了一个基于对比的粒度轴,作为宏观和微观角色隐藏状态均值之间的差异。在 Qwen3-8B 中,该轴与角色表征空间的主轴(PC1)在余弦相似度为 0.972 的情况下对齐,并解释了 52.6% 的方差,表明粒度是组织提示社会角色的主导几何轴。我们构建了 75 个社会角色,涵盖五个粒度层次,并收集了 91,200 个基于角色的响应,针对共享问题和提示变体,然后提取角色级隐藏状态并将其投影到该轴上。角色投影在所有五个层次上单调增加,在各层、提示变体、端点定义、保留分割和评分过滤子集之间保持稳定,并能够转移到 Llama-3.1-8B-Instruct。该轴在因果上也具有相关性:沿其激活引导的变化会将响应粒度朝预期方向转变,在允许局部响应的提示下,Llama 的粒度从 2.00 移动到 3.17(五分制宏观尺度)。这两种模型在可控性上存在差异,表明引导依赖于每个模型的默认操作机制。总体而言,我们的研究结果表明,社会角色的粒度不仅仅是一个风格表面特征,而是一个结构化、有序且可因果操控的潜在方向,影响角色条件下的语言模型行为。
cs.AI / 102 / 2605.06201

Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric

迈向无注释验证的多模态大语言模型:一种视觉-语言逻辑一致性度量
Gu, Ying, Leong, Mei Chee, Tan, Hui Li, Mao, Shangbo, Li, Liyuan, Chen, Nancy
Abstract
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.
Chinese Translation
主导的准确性评估可能会奖励大型语言模型的不当猜测,并且在没有真实标签(gt)注释的情况下,可能不适用于新任务的模型验证。基于基本逻辑原理,我们提出了一种新框架,用于评估多模态大语言模型(MLLMs)在充分和必要因果关系上的视觉-语言逻辑一致性。我们在传统的多选题视觉问答(MC-VQA)测试和近期的自然基准(NaturalBench)测试中定义了视觉-语言逻辑一致性度量(VL-LCM),而无需真实标签注释。通过在代表性视觉语言基准MMMU和近期视觉语言挑战如NaturalBench上的系统实验,我们评估了来自4个前沿家族的11个最新开源MLLMs。我们的研究发现,尽管近期MLLMs在准确性上取得了显著进展,但逻辑一致性却显著滞后。对VL-LCM与真实标签度量的相关性、逻辑一致性度量(LCM)的可靠性以及VL-LCM与响应分布的关系的广泛评估证明了VL-LCM的有效性和适用性,即使在没有真实标签注释的情况下。我们的研究结果表明,除了准确性之外,逻辑一致性也可以用于准确性和可靠性。VL-LCM还可以用于MLLM的选择、验证以及在没有真实标签注释的新任务中的可靠答案证明。
cs.AI / 103 / 2605.06213

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

超越固定基准和最坏情况攻击:语言模型的动态边界评估
Wang, Haoxiang, Yu, Da, Zhang, Huishuai
Abstract
Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.
Chinese Translation
目前对大型语言模型(LLMs)的评估依赖于固定基准,这些基准对任何模型应用相同的项目集,导致天花板效应和地板效应掩盖了能力差距。我们认为,最具信息量的评估信号位于边界处,在随机采样解码下,每个提示的通过概率接近 $0.5$,并提出动态边界评估(Dynamic Boundary Evaluation, DBE),该方法主动定位每个模型的边界,并将其置于一个全球可比的难度尺度上。DBE 提供了三个成果:(i)一个经过校准的项目库,涵盖安全性、能力和真实性,且每个项目的难度标签经过 $9$ 个参考 LLM 的验证;(ii)技能引导边界搜索(Skill-Guided Boundary Search, SGBS),一种仅使用 API 级查询访问来为给定目标 LLM 查找边界项目的搜索算法;(iii)一个评估协议,将新的 LLM 放置在统一的能力尺度上,并在目标超出库的覆盖范围时自适应地扩展评估集。我们在涵盖安全性(有害请求拒绝和过度拒绝)、能力(受限指令跟随)和真实性(多轮谄媚抵抗)等四个类别上实例化了 DBE。最终的评估覆盖了更广泛的模型谱系,且在不饱和的情况下仍与现有数据集兼容。
cs.AI / 104 / 2605.06219

Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

联合一致性:通过能量最小化的统一测试时间聚合框架
Yao, Yunzhen, Wang, Hongye, Wang, Yahong, Gastpar, Michael C., Jiang, Bo, He, Lie
Abstract
This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons act as interactions. JC provides a unified framework for test-time aggregation that subsumes existing voting and weighted aggregation methods as special cases. Our construction of the interaction matrix leverages LLM-as-a-judge comparisons, and admits a theoretical interpretation under answer-level homogeneity assumptions. Moreover, we develop an efficient approximation strategy that makes interaction modeling practical for large-scale test-time aggregation. Experiments on math and code reasoning benchmarks show that JC consistently outperforms existing baselines across tasks, judge models, trace budgets, and trace-generation settings.
Chinese Translation
本文研究了测试时间聚合,这是一种生成多个推理轨迹并将其聚合为最终答案的方法。大多数现有方法依赖于从候选轨迹中独立收集的评估信号或答案频率,而忽视了候选之间的比较交互。我们提出了联合一致性(Joint Consistency, JC),将其表述为一个约束的伊辛型能量最小化问题,其中独立的评估信号作为外部场,而成对比较作为交互。JC提供了一个统一的测试时间聚合框架,涵盖了现有的投票和加权聚合方法作为特例。我们的交互矩阵构造利用了大语言模型(LLM)作为评判者的比较,并在答案级同质性假设下具有理论解释。此外,我们开发了一种高效的近似策略,使得交互建模在大规模测试时间聚合中变得实用。在数学和代码推理基准上的实验表明,JC在任务、评判模型、轨迹预算和轨迹生成设置中始终优于现有基线。
cs.AI / 105 / 2605.06223

Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

基于比较判断的主动实例导航用于模糊用户查询
Kwon, Junhyuk, Lee, Seungjoon, Park, Hyejin, Min, Kyle, Ok, Jungseul
Abstract
Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.
Chinese Translation
当初始用户请求未能唯一指定目标实例时,自然语言实例导航变得具有挑战性。一个实用的智能体应通过主动询问仅需的信息来减轻用户的负担,以便将目标与相似的干扰项区分开,而不是要求用户提前提供详细描述。现有的方法往往未能达到这一目标:它们可能在充分探索替代方案之前就停留在第一个合理的候选项上,或者即使在收集多个候选项后,也会询问来自单个候选项的目标属性,而不是选择用于区分候选池中的候选项的问题。因此,尽管进行了对话,智能体仍可能无法将目标与干扰项区分开,导致过早的决策和冗长的用户响应。我们提出了基于比较判断的主动实例导航(ProCompNav),这是一个两阶段框架,首先构建候选池,然后通过比较判断识别目标。在每一轮中,ProCompNav提取一个属性-值对来划分当前池,提出一个二元是/否问题,并一次性剔除所有不一致的候选项。这将消歧义的过程从开放式目标描述重新框架为池级区分性提问,其中每个问题的选择旨在缩小候选集。在CoIN-Bench上,ProCompNav在相同最小输入下提高了成功率,相较于交互基线和详细描述下的非交互基线,同时显著减少了响应长度。ProCompNav在TextNav上也达到了最先进的成功率,表明比较判断在相似干扰项之间的实例级导航中具有广泛的实用性。
cs.AI / 106 / 2605.06226

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

一种多功能人工智能代理用于罕见疾病诊断和风险基因优先排序
Liu, Tianyu, Zheng, Wangjie, Yang, Rui, Loo, Benny Kai Guo, Zhang, Hui, Lauran, Jeffries, Gu, Jianlei, Yu, Botao, Xuan, Weihao, Huang, Kexin, Liu, Nan, Zou, James, Jiang, Yonghui, Xu, Hua, Zhao, Hongyu
Abstract
Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.
Chinese Translation
准确及时的诊断对于有效治疗至关重要,尤其是在罕见疾病的背景下。然而,当前的诊断工作流程常常导致评估时间延长和准确性低下。为了解决这些局限性,我们推出了Hygieia,一个多模态人工智能代理系统,旨在通过整合多种数据源(包括表型特征、基因组特征和临床记录)来支持精准疾病诊断。Hygieia采用基于路由器和知识增强的框架,减少幻觉现象,并根据不同疾病类别量身定制诊断策略。值得注意的是,它优先考虑与罕见疾病相关的风险基因因素,并提供置信评分以辅助临床决策。我们进行了全面评估,证明Hygieia在多个诊断基准中达到了最先进的性能。与耶鲁医学院和杜克-国立大学医学院的临床专家合作,我们进一步验证了其实际效用,显示(1)Hygieia的诊断性能优于医生,提升幅度为12%-60%;(2)在处理真实病例时,它在协助临床医生使用医疗记录方面的有效性。我们的研究结果表明,Hygieia不仅提高了诊断的准确性和可解释性,还显著减轻了临床医生的工作负担,突显了其作为临床决策支持系统中有价值工具的潜力。
cs.AI / 107 / 2605.06227

Price of Fairness in Short-Term and Long-Term Algorithmic Selections

短期和长期算法选择中的公平性代价
Jabbari, Shahin, Wang, Chen
Abstract
Algorithmic decision-making in high-stakes settings can have profound impacts on individuals and populations. While much prior work studies fairness in static settings, recent results show that enforcing static fairness constraints may exacerbate long-run disparities. Motivated by this tension, we study a stylized sequential selection problem in which a decision-maker repeatedly selects individuals, affecting both immediate utility and the population distribution over time. We introduce notions of group fairness for both the short and long term and theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF). We characterize optimal and fair policies in the short term and show that the PoF can be large even when group distributions are nearly identical. In contrast, we show that long-term disparities can vanish under simple investment policies that achieve a low PoF. We also empirically validate these theoretical observations using both synthetic and real datasets.
Chinese Translation
在高风险环境中的算法决策可能对个人和群体产生深远影响。尽管之前的许多研究关注于静态环境中的公平性,但最近的结果表明,强制执行静态公平性约束可能会加剧长期的不平等。基于这一矛盾,我们研究了一个简化的序列选择问题,其中决策者反复选择个体,这不仅影响即时效用,还影响随时间变化的人口分布。我们引入了短期和长期的群体公平性概念,并通过公平性代价(Price of Fairness, PoF)理论分析了公平性与效用之间的权衡。我们描述了短期内的最优和公平政策,并展示了即使群体分布几乎相同,PoF 也可能很大。相反,我们表明,在实现低 PoF 的简单投资政策下,长期不平等可以消失。我们还使用合成数据集和真实数据集对这些理论观察进行了实证验证。
cs.AI / 108 / 2605.06230

Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence

Safactory:一个可扩展的可信自主智能代理工厂
Chen, Xinquan, Yin, Zhenyun, He, Shan, Huang, Bin, Lei, Shanzhe, Shi, Pengcheng, Cai, Kun, Chen, Bei, Liu, Bangwei, Kang, Zeyu, Huang, Chao, Zhang, Yang, Li, Wenjie, Ge, Ruijun, Wang, Yajie, Fang, Tianshun, Xu, Tianyang, Cong, Yiwen, Jin, Meng, Li, Gaolei, Wu, Xuansheng, Liu, Linhan, He, Zijing, Li, An, Teng, Yan, Tan, Xin, Lu, ChaoChao, He, Ji, Li, Jie, Song, Chunfeng, Xu, Jinya, Song, Fan, Wang, Shujie, Qian, Jianmin, Hou, Jie, Wang, Xuhong, Wang, Yingchun, Wang, Hui, Hu, Xia
Abstract
As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.
Chinese Translation
随着大型模型从对话助手演变为自主代理,长期决策、工具使用和与真实环境交互等挑战日益凸显。现有的代理基础设施在评估、数据管理和代理演化等方面仍然分散,使得系统性地发现风险和在闭环中持续改进模型变得困难。在本报告中,我们提出了 extbf{Safactory},一个可扩展的可信自主智能代理工厂。Safactory集成了三个紧密耦合的平台:用于轨迹生成的 extbf{并行仿真平台}、用于轨迹存储和经验提取的 extbf{可信数据平台},以及用于异步强化学习和策略蒸馏的 extbf{自主演化平台}。据我们所知,Safactory是第一个提出统一演化流程的框架,旨在实现下一代可信自主智能。
cs.AI / 109 / 2605.06290

Data Language Models: A New Foundation Model Class for Tabular Data

数据语言模型:一种用于表格数据的新基础模型类别
Erol, Eda, Pezzoli, Giuliano, Kelahmet, Ozer Cem
Abstract
Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.
Chinese Translation
现在每种主要的数据模态都有一个能够本地理解它的基础模型:文本有语言模型,图像有视觉模型,音频有音频模型。然而,表格数据这一许多重要现实世界人工智能决策所依赖的模态却没有。如今,所有的表格人工智能方法,从梯度提升树到最新的表格基础模型,都需要在任何模型能够处理数据之前进行预处理管道。它们都无法将表格数据视为一种模态。我们提出了数据语言模型(Data Language Model, DLM),这是表格数据缺失的基础模型。DLM 以语言模型理解句子的方式理解表格:本地理解,无需序列化或预处理,直接从原始单元格值中获取信息。它是构建人工智能模型、代理和垂直人工智能应用的表格数据层,消除了当前在原始数据与每个消费它的人工智能系统之间的预处理管道。我们介绍了 Schema-1,这是第一个 DLM:一个拥有 1.4 亿参数的模型,经过超过 230 万个合成和真实世界表格数据集的训练。Schema-1 在已建立的行级预测基准上超越了梯度提升集成、自动机器学习堆栈和我们评估的表格基础模型。在缺失值重建方面,它在所有条件下的平均性能上实现了比所有经典统计方法和前沿大型语言模型更低的重建误差,证明了对数据集自身分布几何结构的理解在插补方面比语言中编码的世界知识更为有用。它能够仅凭原始单元格值可靠地识别任何未见数据集的行业领域,这是之前任何表格模型都无法完成的任务。它是人工智能技术栈中缺失的本地表格理解层。
cs.AI / 110 / 2605.06305

Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs

解决标记数据稀缺问题:使用大型语言模型对HTTP流量中的个人可识别信息(PII)值进行无分类法注释
Cory, Thomas, Küpper, Axel
Abstract
Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.
Chinese Translation
自动化的网络和移动应用隐私审计通常分析出站的HTTP流量,以检测个人可识别信息(PII)的泄露。然而,现有的基于学习的检测器通常依赖于稀缺的手动标记流量,并且与固定的标签分类法紧密耦合,这限制了其在不同领域和不断演变的PII定义之间的可转移性。本文探讨了大型语言模型(LLMs)是否能够支持在运行时提供分类法时,对HTTP消息体中显式传输的PII值进行无分类法注释。我们提出了一种多阶段的基于LLM的管道,结合了确定性预处理、标签级分类、针对实例级值的注释和输出验证。为了在不依赖敏感真实用户捕获的情况下实现受控评估和基于示例的提示,我们进一步提出了一种基于LLM的合成HTTP流量生成器,具有手动验证的、基于分类法的PII注释。我们在涵盖不同PII领域和粒度水平的三种分类法上评估了该方法。结果表明,该管道能够准确检测PII类型并提取相应值,以适应具体的PII分类法。总体而言,我们的研究结果将LLMs定位为灵活的、无分类法流量注释和在不断演变的隐私分类法下创建标记数据的有希望的基础。
cs.AI / 111 / 2605.06308

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

通过推理轨迹测量黑箱信心:几何、覆盖和语言化
Martell, Marc Boubnovski, Stoisser, Josefa Lia, Märtens, Kaspar, Yu, Jialin, Kitchen, Robert, Torr, Philip, Ferkinghoff-Borg, Jesper
Abstract
Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior (C), within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|<=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).
Chinese Translation
可靠的信心估计能够安全地通过仅文本的 API 部署链式思维(CoT)推理。然而,主流的黑箱基线——基于 K 个样本的自一致性——在计算上是线性的且忽略了轨迹的几何特性。我们提出了一种黑箱轨迹信心评分:将 CoT 嵌入为滑动窗口轨迹,并通过一个参数的 softmax 测量其与外部答案锚点的收敛性。该方法无需 logits、隐藏状态或监督校准器。在 MedQA-USMLE、GPQA Diamond 和 MMLU-Pro 上的六个(基准,推理器)设置中,使用 Gemini 3.1 Pro 和 Claude Sonnet 4.6,将该评分与覆盖和语言化信心通道融合,在 K=4 时在 6/6 设置中相较于 K=8 的自一致性实现了 Pareto 改进(中位 AUC 0.78 对比 0.71,deltaAUC=+0.075)。固定选择控制(+0.060)和 E5 交叉嵌入者复制排除了答案切换和单一供应商伪影。在基准和推理器的倒数第二个窗口中,几何特性达到峰值,并在 GPQA Diamond 的终端窗口中反转。三个无框架的机制将黑箱信心分为由评审介导的覆盖先验(C)、轨迹内几何(G)和条件语言化通道(V)。在 18 个基准 x 推理器 x 提议者设置中,C 和 G 在 18/18 和 16/18 中提供独立信号,而 V 在 6/18 中贡献残余信号。从 GPT-5-mini 切换到 Claude Sonnet 4.6 的评审使得仅 G 的 AUC 保持不变(|delta|<=0.013),而仅 C 的 AUC 最多变化 +/-0.02(kappa=0.82)。融合在 17/18 设置中优于最佳单通道(中位 AUC 0.78,最大 0.92)。
cs.AI / 112 / 2605.06339

A Regime Theory of Controller Class Selection for LLM Action Decisions

控制器类别选择的制度理论用于大语言模型的行动决策
Jiang, Zhaoyang, Fu, Zhizhong, Kim, Yunsoo, Mi, Jiacong, Li, Zicheng, Peng, Xuanqi, Wu, Honghan
Abstract
Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.
Chinese Translation
部署的语言和视觉-语言模型必须在每个输入上决定是直接回答、检索证据、推迟到更强的模型,还是放弃。与常见的单调性直觉相反,在有限样本中,较高的每输入表达能力并不总是有利:在相同的严格交叉验证下,不同的基准偏好不同的控制器类别。这反映了实例级不确定性信号的有限样本限制,这些信号可能在依赖于分布的规模上被耗尽。我们将控制器组织成一个四类的嵌套格:固定动作、分区路由器、实例级控制器和先验门控控制器,按复杂性排序。我们证明了一种制度理论,将三个数据可估计的瓶颈转化为类别选择:超越最佳固定动作的改进有多少可能性、实例级控制器是否有足够的样本以做出可靠决策,以及当实例级信号不可靠时,粗略分区路由器能够恢复多少改进。由此产生的伯恩斯坦紧阈值具有匹配的信息论下界,严格的嵌套交叉验证可证明选择出接近最佳的类别。在SMS-Spam、HallusionBench、A-OKVQA和FOLIO中,预测的类别与经验赢家相匹配;在TextVQA中,当OCR标记提供无标签的预测时间先验时,先验门控控制器获胜。代码可在 https://github.com/Anonymous-Awesome-Submissions/Regime-Theory 获取。
cs.AI / 113 / 2605.06343

Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

注意差距?真实与合成先验在表格基础模型中的分布比较
Davies, Alex O., Filho, Telmo de Menezes e Silva, Ajmeri, Nirav
Abstract
Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.
Chinese Translation
表格基础模型是在三类语料库之一上进行预训练的:从基准库中提取的精心策划的数据集、从网络大规模收集的表格,或从参数生成先验中采样的合成表格。尽管预训练数据对模型性能至关重要,但关于这些语料在分布上如何相互关联以及这对下游性能的影响知之甚少。在本研究中,我们选取了三个典型的数据集用于训练表格基础模型;T4数据集代表了网络抓取的语料,TabFM数据集是来自Kaggle的策划表格,而TabICL数据集则是唯一一个具有公开可用参数的广泛使用的合成先验。我们通过对整个表格、列和相关性进行聚合特征的表征,使用判别器AUC和k-NN覆盖度指标进行比较。我们发现,TabICL合成先验占据了真实表格空间的一个狭窄区域,这种不匹配无法通过优化超过86,000种配置的先验超参数来弥补,而策划和网络抓取的语料在特征空间的分布层面上是广泛可互换的。令人惊讶的是,合成预训练数据与真实表格之间的分布差距在基于特征的接近度度量或TabICL自身的内部表示下对性能没有明显的可检测影响,这表明真实数据分布的覆盖并不是TabICL泛化的主要驱动因素。
cs.AI / 114 / 2605.06345

More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

超越语言表达:预问题科学创意的基准与框架
Yu, Jie, Qiu, Song
Abstract
AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi-agent framework designed to make a researcher's implicit understanding explicit, inspectable, and actionable. InciteResearch decomposes the logical chain of Socratic questioning and distributes it across the entire pipeline that: (1) Elicits a structured five-dimensional researcher profile state anchored by specific friction points from vague, even domain-unrelated inputs; (2) Violates hidden assumptions by maximizing the feasibility-novelty product with enforcing a 7-stage causal derivation trace; and (3) check whether the proposed method is a Necessary consequence of the reframed insight. We further introduce TF-Bench, the first benchmark for tacit-to-explicit research assistance that distinguishes domain-related from domain-unrelated inspirations across four scientific modes. On TF-Bench, InciteResearch achieves leapfrogging gains over a prompt-based baseline (novelty/impact from 3.671/3.806 to 4.250/4.397), shifting generated proposals from recombination to architectural insight. Our work demonstrates that AI can serve as an extension of thinking itself, rather than merely automating downstream execution.
Chinese Translation
人工智能研究代理在自动化文献搜索和手稿精炼方面显示出强大的潜力,但大多数假设存在明确且可操作的初始输入,仅在研究问题被明确提出后才开始运作。相比之下,人类研究往往始于隐性的摩擦,即在形成问题之前存在的不一致感。我们引入了InciteResearch,一个多代理框架,旨在使研究者的隐性理解变得显性、可检查和可操作。InciteResearch将苏格拉底提问的逻辑链分解,并将其分布在整个流程中:(1) 从模糊甚至与领域无关的输入中引出一个结构化的五维研究者档案状态,锚定在特定的摩擦点上;(2) 通过最大化可行性-新颖性乘积并强制执行7阶段因果推导轨迹来违反隐含假设;(3) 检查所提出的方法是否是重新构思的洞察的必要结果。我们进一步介绍了TF-Bench,这是首个用于隐性到显性研究辅助的基准,能够区分四种科学模式下与领域相关和与领域无关的灵感。在TF-Bench上,InciteResearch在基于提示的基线(新颖性/影响从3.671/3.806提升至4.250/4.397)上实现了飞跃式的提升,将生成的提案从重组转变为架构洞察。我们的工作表明,人工智能可以作为思维本身的延伸,而不仅仅是自动化下游执行。
cs.AI / 115 / 2605.06346

Prediction and Empowerment: A Theory of Agency through Bridge Interfaces

预测与赋能:通过桥接接口的代理理论
Csaky, Richard
Abstract
We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfaces split between agent-controlled parameters and environment-controlled channel state, inducing a deterministic POMDP through a prior over latent microstates and many-to-one observation coarsening. Within this framework, we prove a separation between prediction, compression, and empowerment. Perfect prediction can be achieved either by identifying the hidden quotient relevant to the target family or by overwrite control that makes the future target action-determined; high empowerment alone is insufficient. Under refinable interfaces and sufficient memory, action-conditioned observation-compression progress reduces posterior uncertainty about the latent quotient, and when refinement requires steering world-side channel conditions, this creates target-conditioned interface empowerment. A bit-string specialization with a conserved information budget makes the resulting tradeoff explicit: prediction by identification requires internal capacity at least the relevant latent entropy, whereas overwrite control requires terminal action capacity over the controlled quotient. For modern AI agents, the results suggest a design principle rather than a theorem of inevitability: objectives should distinguish hidden-state identification, interface refinement, task-relevant controllability, and mere overwrite or distractor control. Human--AI alignment is partly an interface-design problem, where the relevant bridge is between human intent, agent internal state, external tools, and world-side channel conditions. This is a working draft: feedback and criticism is most welcome.
Chinese Translation
我们研究了在确定性物理或模拟世界中部分可观测性下的代理性,其中表面随机性源于对初始条件、固定法则位和展开的外生噪声的不确定性。我们将感知和执行建模为桥接接口,分为代理控制的参数和环境控制的通道状态,从而通过对潜在微观状态的先验和多对一观察粗化诱导出确定性部分可观测马尔可夫决策过程(POMDP)。在这一框架下,我们证明了预测、压缩和赋能之间的分离。完美预测可以通过识别与目标家族相关的隐藏商或通过覆盖控制来实现,使未来的目标动作确定;单靠高赋能是不够的。在可细化接口和足够的记忆下,基于动作的观察压缩进展减少了对潜在商的后验不确定性,而当细化需要引导世界侧通道条件时,这会产生目标条件的接口赋能。具有保留信息预算的比特串专门化使得结果的权衡变得明确:通过识别进行预测需要至少与相关潜在熵相等的内部容量,而覆盖控制则需要对受控商的终端动作容量。对于现代人工智能代理,这些结果提出了一种设计原则,而非必然定理:目标应区分隐藏状态识别、接口细化、任务相关可控性以及单纯的覆盖或干扰控制。人类与人工智能的对齐在一定程度上是一个接口设计问题,其中相关的桥接在于人类意图、代理内部状态、外部工具和世界侧通道条件之间。这是一个工作草稿:欢迎反馈和批评。
cs.AI / 116 / 2605.06365

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

从智能体循环到确定性图:可重复的人工智能原生工作的执行沿袭
Rosen, Josh, Rosen, Seth
Abstract
Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions.
Chinese Translation
大型语言模型系统越来越多地被部署为智能工作流,这些工作流交替进行推理、工具使用、记忆和迭代优化。这些系统在生成答案方面非常有效,但它们通常依赖于隐式的对话状态,这使得保持稳定的工作产品、隔离无关更新或通过中间工件传播变化变得困难。我们引入了执行沿袭:一种执行模型,其中人工智能原生工作被表示为具有显式依赖关系、稳定中间边界和基于身份的重放的生成工件计算的有向无环图(DAG)。目标不是使模型成为更好的单次写作工具,而是使不断发展的人工智能生成工作在变化中保持可维护性。我们在两个受控的政策备忘录更新任务中比较了执行沿袭重放与以循环为中心的更新基线。在无关分支更新中,DAG 重放在所有运行中准确保留了最终备忘录,且零波动和零无关分支污染,而循环基线则重新生成了备忘录并频繁引入无关上下文。在中间工件编辑中,所有系统在最终备忘录中反映了新的约束,但只有 DAG 重放实现了完美的上游保留、下游传播、未受影响工件的保留和跨工件一致性。这些结果表明,最终答案质量和保持状态质量是不同的。当任务是一个有限的综合/更新问题且所有当前源都适合上下文时,强大的循环基线在生成精致的最终输出方面仍然具有竞争力,但即时任务成功可能掩盖未来修订中可能加剧的部分状态不一致。执行沿袭提供了关于什么应该改变、什么应该保持稳定以及工作如何在修订中演变的更强保证。
cs.AI / 117 / 2605.06371

Debiased Multimodal Personality Understanding through Dual Causal Intervention

通过双重因果干预去偏见的多模态人格理解
Zhu, Yangfu, Han, Zitong, Ning, Nianwen, Wei, Yuting, Wang, Yuandong, Feng, Hang, Shao, Zhenzhou
Abstract
Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn-ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by subject bias (e.g., observable age and unobservable mental states), as subjects originate from diverse demographic backgrounds. Learn ing such spurious associations between multimodal features and traits may lead to unfair personality understanding. In this work, weconstruct aStructural Causal Model (SCM)toanalyze theimpact of these biases from a causal perspective, and propose a novel Dual Causal Adjustment Network (DCAN) to mitigate the interference of subject attributes on personality understanding. Specifically, we design a Back-door Adjustment Causal Learning (BACL) module to block spurious correlations from observable demographic factors via a prototype-based confounder dictionary, and subsequently ap ply a Front-door Adjustment Causal Learning (FACL) module to ad dress latent and unobservable biases throughalearnedmediatordic tionary intervention, thereby achieving causal disentanglement of representations for deconfounded reasoning. Importantly, we con struct a Demographic-annotated Multimodal Student Personality (DMSP) dataset to support the analysis and discussion of fairness related factors. Extensive experiments on the benchmark dataset CFI-V2 and our DMSPdataset demonstrate that DCAN consistently improves prediction accuracy, reaching 92.11% and 92.90%, respec tively. Meanwhile, the improvementsinthefairnessmetricsofequal opportunity and demographic parity are 6.57% and 7.97% on CFI-V2, and 15.38% and 20.06% on the DMSP dataset. Our code and DMSP dataset are available at https://github.com/Sabrina-han/DCAN
Chinese Translation
多模态人格理解在以人为中心的人工智能中发挥着关键作用。之前的研究主要集中在学习丰富的多模态表示以理解视频人格。然而,由于受试者来自不同的人口背景,他们往往受到主观偏见(例如,可观察的年龄和不可观察的心理状态)带来的潜在影响。学习多模态特征与性格之间的这种虚假关联可能导致不公平的人格理解。在本研究中,我们构建了一个结构性因果模型(Structural Causal Model, SCM)从因果的角度分析这些偏见的影响,并提出了一种新颖的双重因果调整网络(Dual Causal Adjustment Network, DCAN)以减轻受试者属性对人格理解的干扰。具体而言,我们设计了一个后门调整因果学习(Back-door Adjustment Causal Learning, BACL)模块,通过基于原型的混淆因子字典阻止可观察的人口因素带来的虚假相关性,随后应用前门调整因果学习(Front-door Adjustment Causal Learning, FACL)模块,通过学习的中介字典干预来解决潜在和不可观察的偏见,从而实现表示的因果解耦,以便进行去混淆推理。重要的是,我们构建了一个带有人口标注的多模态学生人格(Demographic-annotated Multimodal Student Personality, DMSP)数据集,以支持公平性相关因素的分析和讨论。在基准数据集CFI-V2和我们的DMSP数据集上的大量实验表明,DCAN在预测准确性上持续提升,分别达到了92.11%和92.90%。同时,在CFI-V2上公平性指标的改进分别为6.57%和7.97%,在DMSP数据集上则为15.38%和20.06%。我们的代码和DMSP数据集可在https://github.com/Sabrina-han/DCAN获取。
cs.AI / 118 / 2605.06382

Rethinking Vacuity for OOD Detection in Evidential Deep Learning

重新思考证据深度学习中的空洞性以进行OOD检测
McNamara, Claire
Abstract
Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of $K$. In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ increase due to the nature of EDL (suppressing incorrectly assigned evidence). As a result, when comparing In Distribution (ID) and OOD results, it is important that $K_{\mathrm{ID}}$ and $K_{\mathrm{OOD}}$ are equal; something that is not always ensured in practice. We provide an empirical demonstration of how results for AUROC and AUPR can substantially differ when class cardinality between ID and OOD differs by 1, with AUROC differing by as much as 0.318 and AUPR by 0.613 for standard EDL, and AUROC by 0.360 and AUPR by 0.683 for IB-EDL. More concretely, our findings isolate an evaluation artefact: when K differs between ID and OOD, AUROC/AUPR can be artificially inflated without any change in model predictions. We further discuss the evaluation of EDL over causal language models using Multiple-Choice Question-Answer (MCQA) datasets and argue for clearer definitions of ID and OOD in this context. Our primary contribution is an empirical and theoretical demonstration that vacuity-based OOD detection in EDL-fine-tuned LLMs is highly sensitive to uncontrolled differences in evaluated class cardinality.
Chinese Translation
空洞性或不确定性质量(Uncertainty Mass, UM)通常被用作评估证据深度学习(Evidential Deep Learning, EDL)中分布外(Out-of-Distribution, OOD)检测的指标。它通常涉及将类别数($K$)除以模型预测的信念总强度($S$),其中$S$是通过求和Dirichlet参数得出的。因此,UM对$K$的基数非常敏感。特别是,实际上$K$和$S$之间不太可能存在线性关系,因为随着$K$和$S$的增加,EDL的特性(抑制错误分配的证据)会影响这一关系。因此,在比较内部分布(In Distribution, ID)和OOD结果时,确保$K_{ ext{ID}}$和$K_{ ext{OOD}}$相等是非常重要的;而这在实践中并不总是能够保证。我们提供了一个实证示例,说明当ID和OOD之间的类别基数相差1时,AUROC和AUPR的结果可能会有显著差异,对于标准EDL,AUROC最多相差0.318,AUPR相差0.613,而对于IB-EDL,AUROC相差0.360,AUPR相差0.683。更具体地,我们的研究揭示了一个评估伪影:当ID和OOD之间的$K$不同,AUROC/AUPR可能在模型预测未发生变化的情况下被人为地膨胀。我们进一步讨论了在因果语言模型上使用多项选择问答(Multiple-Choice Question-Answer, MCQA)数据集评估EDL,并主张在此背景下对ID和OOD进行更清晰的定义。我们的主要贡献是实证和理论上证明,在经过微调的LLM中,基于空洞性的OOD检测对评估类别基数的非控制差异高度敏感。
cs.AI / 119 / 2605.06390

Automated alignment is harder than you think

自动化对齐比你想象的更困难
Bowkis, Aleksandr, Buhl, Marie Davidsen, Pfau, Jacob, Irving, Geoffrey
Abstract
A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.
Chinese Translation
对齐人工超级智能(ASI)的一个主要提议是利用人工智能代理随着能力的提升,自动化越来越多的对齐研究。我们认为,即使研究代理并没有故意破坏对齐工作的意图,这一计划仍可能产生引人注目但灾难性误导的安全评估,从而导致错误对齐的人工智能被无意中部署。这种情况可能发生,因为对齐研究涉及许多难以监督的模糊任务(没有明确评估标准的任务,人类判断系统性存在缺陷)。因此,研究结果将包含系统性的、未被发现的错误,即使是正确的输出也可能被错误地聚合成过于自信的安全评估。由于以下几个原因,这一问题在自动化对齐研究中可能比在人类生成的对齐研究中更为严重:1)优化压力意味着代理生成的错误集中在那些人类评审者最不可能捕捉到的地方;2)代理可能产生与人类错误不相似的错误;3)人工智能生成的对齐解决方案可能涉及人类无法评估的论证;4)共享权重、数据和训练过程可能使人工智能输出之间的相关性高于人类输出。因此,代理必须被训练以可靠地执行难以监督的模糊任务。泛化和可扩展监督是实现这一目标的主要候选方案,但在自动化对齐的背景下,两者都面临新的挑战。
cs.AI / 120 / 2605.06434

Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification

知识图谱:代理人工智能基础的形式验证中的缺失环节
Viswambharan, Vaisakh Naduvodi, Radhakrishna, Keerthan Kopparam, Gadde, Deepak Narayan, Kumar, Aman
Abstract
Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification-to-RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification-centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal-tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design-grounded context for generation. A multi-agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX-guided correction using trace links, and coverage-directed property augmentation. Evaluation across seven benchmark designs indicates that KG-based context retrieval improves specification-to-RTL grounding and consistently produces compilable SVAs with low syntax-repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.
Chinese Translation
近期大型语言模型(LLMs)的进展使得从自然语言规范生成SystemVerilog断言(SVAs)的工作流程成为可能,这有望加速形式验证(FV)。然而,由于规范通常模糊或不完整,以及关键的微架构细节存在于寄存器传输级(RTL),高质量的断言合成仍然具有挑战性。许多现有方法将规范和RTL视为松散结构的文本,这削弱了规范与RTL之间的基础联系,导致语义不匹配和在形式解析与详细化过程中频繁出现语法错误。本研究通过构建一个以验证为中心的知识图谱(KG),解决了这些局限性,该知识图谱由从规范、RTL和形式工具反馈中提取的结构化中间表示(IRs)构成,包括语法诊断、反例(CEXs)和覆盖报告。KG将需求、设计层次、信号、假设和属性链接起来,为生成提供可追溯的、基于设计的上下文。一个多代理工作流程查询并更新这个KG,以生成SVAs并驱动三个细化循环:由工具诊断指导的语法修复、利用追踪链接的反例指导修正,以及覆盖导向的属性增强。在七个基准设计上的评估表明,基于KG的上下文检索改善了规范与RTL之间的基础联系,并始终生成具有低语法修复开销的可编译SVAs。该方法的形式覆盖率范围从78.5%到99.4%,尽管收敛性表现出设计依赖性,复杂的时间和算术推理对当前LLM的能力仍然具有挑战性。
cs.AI / 121 / 2605.06444

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

SCRuB:基于评分标准评估的社会概念推理
Watson-Daniels, Jamelle, Bhattacharjee, Himaghna, Wang, Skyler, Handoko, Brandon, Li, Antonio, Ovalle, Anaelia, Pasupuleti, Mahesh, Ross, Candace, Sarma, Vidya, Subramonian, Arjun, Ullrich, Karen, van der Vaart, Will, Xin, Yijing, Nickel, Maximilian
Abstract
While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.
Chinese Translation
尽管许多关于大型语言模型(LLM)推理能力的研究强调数学或技术任务,但很少有研究关注社会概念的推理:这些抽象思想塑造了社会规范、文化和制度。这一被忽视的能力对于现代模型作为社会代理人至关重要,但目前尚无系统的评估方法针对这一能力。我们提出了SCRuB(基于评分标准评估的社会概念推理),这是一个针对任务不确定性设置而设计的框架。我们的目标是衡量模型在社会概念推理方面的深度和批判性严谨性,达到人类专家的水平。SCRuB分为三个阶段:从已建立的来源构建提示,由专家和模型生成响应,以及使用五维批判性思维评分标准进行比较评估。为了使这一流程具有普遍性,我们引入了一个经过独立专家评审验证的学科视角小组(Panel of Disciplinary Perspectives)集成。我们发布了SCRuBEval(n=4,711个评估提示)和SCRuBAnnotations(300个专家撰写的响应和来自45位博士级学者的150个专家比较判断)。我们的结果显示,前沿模型在所有五个评分维度上始终优于人类专家。在1,170个成对比较中,专家评审在80.8%的判断中将模型响应排在第一位,并在74.4%的情况下整体偏好模型响应。最终,本研究提供了社会概念推理评估饱和度的首个专家基础示范:单轮考试风格的格式已达到模型和人类的极限。
cs.AI / 122 / 2605.06455

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

PrefixGuard:从LLM-Agent痕迹到在线故障预警监测器
Huang, Xinmiao, Hu, Jinwei, Roy, Rajarshi, Wu, Changshun, Dong, Yi, Huang, Xiaowei
Abstract
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, $\tau^2$-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and $\tau^2$-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas $\tau^2$-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.
Chinese Translation
大型语言模型(LLM)代理现在执行长时间的工具使用任务,其中最终结果检查可能会在干预时机上过晚。在线预警需要对异构痕迹进行轻量级前缀监测,但手动编写的事件模式脆弱且部署时的LLM判断成本高昂。我们提出了PrefixGuard,这是一个从痕迹到监测器的框架,包含一个离线的StepView归纳步骤,随后进行监督监测器训练。StepView从原始痕迹样本中归纳出确定性的类型步骤适配器,监测器则从终端结果中学习事件抽象和前缀风险评分器。在WebArena、$ au^2$-Bench、SkillsBench和TerminalBench上,最强的PrefixGuard监测器达到了0.900/0.710/0.533/0.557的AUPRC。使用每种表示中的最强后端,它们在原始文本控制上平均提高了+0.137 AUPRC。在相同的前缀预警协议下,LLM判断者的表现仍然显著较弱。我们还推导出了基于得分的精确率-召回曲线(AUPRC)的可观测性上限,该上限将监测器错误与缺乏观察前缀证据的失败区分开来。对于有限状态审计,后验确定性有限自动机(DFA)提取在WebArena和$ au^2$-Bench上保持紧凑(29和20个状态),但在SkillsBench和TerminalBench上扩展到151和187个状态。最后,首次警报诊断显示强排名并不意味着部署效用:WebArena排名良好但未能支持低误报警报,而$ au^2$-Bench和TerminalBench则保留了更多可操作的早期警报。综合这些结果,PrefixGuard被定位为一种实用的监测器合成方案,并提供明确的诊断,以便在前缀警告转化为可操作干预时进行评估。
cs.AI / 123 / 2605.06457

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

超越任务成功:在基于大型语言模型的代理支付系统中测量工作流程的保真度
Huang, Donghao, Chua, Joon Kiat, Wang, Zhaoxia
Abstract
LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.
Chinese Translation
基于大型语言模型(LLM)的多代理系统在支付工作流程中越来越多地被部署,然而现有的指标,如任务成功率(Task Success Rate, TSR)和代理交接F1分数(Agent Handoff F1-Score, HF1),仅捕捉最终结果或无序的路由决策。我们引入了代理成功率(Agentic Success Rate, ASR),这是一种轨迹保真度指标,比较观察到的和预期的代理执行序列在过渡层面的表现,将性能分解为过渡召回率和过渡精确率。应用于18个LLM和90,000个任务实例的分层多代理支付系统(Hierarchical Multi-Agent System for Payments, HMASP),ASR揭示出18个模型中的10个在支付结账过程中系统性地跳过了确认检查点,这一偏差在TSR和HF1中均不可见,而8个模型则完美地执行了该检查点。值得注意的是,尽管GPT-4.1在TSR和HF1上均表现完美,但却展现出隐藏的工作流程捷径,而GPT-5.2则实现了完美的ASR。通过ASR诊断指导的提示优化和确定性路由保护显著提高了TSR,对于之前表现不佳的模型,提升幅度高达93.8个百分点,证明了在受监管领域中轨迹级评估的重要性。
cs.AI / 124 / 2605.06475

Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

通过视觉书写特征的证据深度回归对历史手稿进行概率性年代测定
Chodavarapu, Ranjith
Abstract
We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $\rho=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $\rho=0.905$ between uncertainty and page-level error.
Chinese Translation
我们提出了一种基于视觉特征对历史手稿页面进行年代测定的概率性方法。与以往文献中将几个世纪聚合为类别的标准做法不同,我们将年代测定视为一个在连续年份轴上的证据深度回归问题,使我们的神经网络能够在一次前向传播中输出完整的预测分布,并分解出随机不确定性和认知不确定性。我们的架构结合了 EfficientNet-B2 主干和一个使用联合负对数似然和证据正则化目标训练的正态逆伽马(Normal-Inverse-Gamma, NIG)输出头。在 DIVA-HisDB 基准测试(150 页,3 本中世纪手抄本,151,936 个补丁)上,我们的模型测试平均绝对误差(MAE)为 5.4 年,远低于 50 年的世纪标签监督粒度,其中 93\% 的补丁在 5 年内,97\\% 在 10 年内。我们的方法在一次前向传播中实现了 extbf{PICP=92.6 extbf{,是所有比较方法中最佳的校准效果,优于 MC Dropout(PICP=88.2 extbf{,50 次传播)和深度集成(Deep Ensembles,PICP=79.7 extbf{,5 个模型),且推理成本降低了 $5 imes$。不确定性分解显示,随机不确定性是年代测定误差的强预测因子(斯皮尔曼相关系数 $ ho=0.729$),而对最确定的 20 extbf{ 的补丁进行选择性预测可以提供 extbf{0.5 年 MAE}。我们展示了预测的不确定性随着图像退化的加剧而增加,空间分解图解释了哪些书写区域导致随机不确定性,并且页面级聚合将 MAE 降低到 4.5 年,且不确定性与页面级误差之间的相关系数为 $ ho=0.905$。
cs.AI / 125 / 2605.06480

Patch-Effect Graph Kernels for LLM Interpretability

用于大型语言模型可解释性的补丁效应图核
Fernandez-Boullon, Ruben, Olivieri, David N.
Abstract
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.
Chinese Translation
机制可解释性旨在通过识别因果电路来逆向工程变换器计算,方法是通过激活补丁进行干预。然而,在不同提示和任务类别中扩展这些干预会产生高维、非结构化的数据集,这些数据集难以进行系统比较。我们提出了一个框架,将机制分析重新构建为图机器学习问题,通过将激活补丁配置表示为模型组件上的补丁效应图。我们引入了三种图构建方法:通过因果中介的直接影响、部分相关性和共同影响,并应用图核分析生成的结构。在使用间接对象识别(Indirect Object Identification, IOI)及相关任务对 GPT-2 Small 进行评估时,我们发现补丁效应图保留了区分性结构信号。具体而言,局部边缘槽特征提供的分类准确率高于全局图形状描述符。经过筛选的配对补丁验证表明,因果中介(Causal Mediation, CI)和部分相关性(Partial Correlation, PC)选择的候选边缘对应于比随机或低秩候选更强的激活影响效应。关键是,通过将这些表示与严格的仅提示和原始补丁效应控制进行评估,我们明确了基准的证据范围:图特征压缩了结构化补丁信号,而原始张量和表面线索定义了任何电路级主张应当解决的强基线。最终,我们的框架提供了一个压缩和评估管道,用于在受控基线下比较补丁衍生结构,将稳健的切片区分证据与更强的任务通用因果电路主张分开。
cs.AI / 126 / 2605.06483

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

ReasonSTL:通过工具增强的过程奖励学习架起自然语言与信号时序逻辑之间的桥梁
Ye, Bowen, Li, Zhijian, Huang, Junyue, Ma, Junkai, Yin, Xiang
Abstract
Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textsc{ReasonSTL}, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textsc{ReasonSTL} decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textsc{STL-Bench}, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textsc{ReasonSTL} achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textsc{ReasonSTL} provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.
Chinese Translation
信号时序逻辑(STL)是一种表达性强的形式语言,用于指定关于实值实时信号的时空要求。它已广泛应用于自主系统和网络物理系统的验证与合成。然而,在实际应用中,用户通常以自然语言而非结构化的STL公式表达他们的需求,这使得自然语言到STL的翻译成为一项关键但具有挑战性的任务。手动规范需要时序逻辑的专业知识,并且无法扩展,而使用商业大语言模型(LLM)API则会产生可观的令牌成本,并可能将敏感的系统需求暴露给第三方服务,从而引发工业部署中的隐私问题。为了解决这些挑战,我们提出了 extsc{ReasonSTL},一个工具增强的框架,旨在适应本地开源语言模型以实现自然语言到STL的生成。 extsc{ReasonSTL}将翻译过程分解为显式推理、确定性工具调用和结构化公式构建。我们进一步引入过程奖励训练,以监督工具使用轨迹和最终公式,同时提供 extsc{STL-Bench},这是一个基于真实世界信号的双语计算感知基准。实验表明,使用 extsc{ReasonSTL}训练的4B模型在自动评测和人工评估中均达到了最先进的性能,证明了 extsc{ReasonSTL}为正式规范起草提供了一种透明、低成本且保护隐私的替代方案。
cs.AI / 127 / 2605.06490

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

工具选择:测量大型语言模型代理追求工具行为的倾向
Wiedermann-Möller, Jonas, Dung, Leonard, Andriushchenko, Maksym
Abstract
AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal-based agents. This is behaviour such as self-preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our benchmark is realistic and low-stakes which serves to reduce evaluation-awareness and roleplay confounds. The suite contains seven operational tasks, each with an official workflow and a policy-violating shortcut. An eight-variant shared framework varies monitoring, instruction clarity, stakes, permission, instrumental usefulness and blocked honest paths to support inferences regarding the factors driving IC behaviour. We evaluated ten models using deterministic environment-state scorers over 1,680 samples, with trace review employed for audit and adjudication purposes. The final IC rate is 86 out of 1,680 samples (5.1%). IC behaviour is concentrated rather than uniform: two Gemini models account for 66.3% of IC cases and three tasks account for 84.9%. Conditions in which IC behaviour is indispensable for task success result in the greatest increase in the adjusted IC rate (+15.7 percentage points), whereas emphasising that task success is critical or certain framing choices do not produce comparable effects. Our findings indicate that realistic, low-nudge environments elicit IC behaviour rarely but systematically in most tested models. We conclude that it is feasible to robustly measure tendencies for dangerous behaviour in current frontier AI agents.
Chinese Translation
人工智能系统在许多领域中变得越来越能够执行危险行为。这引发了一个问题:模型是否有时选择违反人类指令,以便执行对某些目标更有用的行为?我们引入了一个基准,用于测量终端代理模型的工具收敛(Instrumental Convergence, IC)行为倾向。这种行为例如自我保护,被假设在高度能力的人工智能代理所带来的风险中扮演关键角色。我们的基准是现实且低风险的,旨在减少评估意识和角色扮演的干扰。该基准包含七个操作任务,每个任务都有一个官方工作流程和一个违反政策的捷径。一个包含八种变体的共享框架改变了监控、指令清晰度、风险、权限、工具有效性以及阻止诚实路径,以支持对驱动IC行为因素的推理。我们使用确定性环境状态评分器对十个模型进行了评估,样本总数为1,680个,并采用追踪审查进行审计和裁定。最终的IC率为1,680个样本中的86个(5.1%)。IC行为呈集中而非均匀分布:两个Gemini模型占IC案例的66.3%,而三个任务占84.9%。在IC行为对任务成功至关重要的条件下,调整后的IC率最大增加了15.7个百分点,而强调任务成功是关键或某些框架选择并未产生可比的效果。我们的研究结果表明,现实的、低干扰的环境在大多数被测试模型中很少但系统性地引发IC行为。我们得出结论,稳健地测量当前前沿人工智能代理的危险行为倾向是可行的。
cs.AI / 128 / 2605.06494

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

从标记列表到图形模式:Weisfeiler-Lehman稀疏自编码器特征分析
Fernandez-Boullon, Ruben, Magariños-Docampo, Pablo, Perez-Robles, Javier
Abstract
Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.
Chinese Translation
稀疏自编码器(SAEs)在机制可解释性中变得至关重要,将变换器激活分解为单语义特征。然而,现有分析几乎完全通过高激活标记列表或解码器权重向量来表征特征,导致特征间共享的高阶共现结构在很大程度上未被研究。我们引入了一种图结构表示,其中每个SAE特征被建模为一个标记共现图:节点是强激活附近最频繁的标记,边连接在局部上下文窗口内共现的标记对。然后,使用自定义的WL风格频率分箱图核在这一结构空间中提供相似性度量。作为概念验证,我们将其应用于从在GPT-2 Small上训练的大型SAE中提取的特征,并使用合成混合领域语料库进行探测,我们的聚类恢复了启发式模式家族(重标点模式、语言和脚本聚类以及类似代码的模板),这些模式在基于解码器余弦相似性的聚类中未被恢复。一个标记直方图基线实现了更高的整体纯度,因此图视图的贡献是互补而非主导的:它揭示了标记频率和解码器权重视图单独无法捕捉的结构关系。聚类分配在图构建超参数和随机种子之间是稳定的。
cs.AI / 129 / 2605.06524

Process Matters more than Output for Distinguishing Humans from Machines

过程比输出更重要:区分人类与机器
Rmus, Milena, Hardy, Mathew D., Griffiths, Thomas L., Agrawal, Mayank
Abstract
Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.
Chinese Translation
随着大型语言模型和自主代理在在线环境中的应用,可靠的人机区分变得愈发重要。现有的方法评估系统是否能够产生与人类 indistinguishable 的行为或响应,这一标准源于艾伦·图灵(Alan Turing)提出的以输出为智能标准的观点。认知科学提供了另一种视角:评估行为产生的过程。为了测试认知过程是否能够可靠地区分人类与机器,我们引入了 CogCAPTCHA30,这是一个由30个认知任务组成的测试,旨在即使在任务表现相匹配的情况下也能引出诊断性过程特征。在整个测试中,过程特征提供的区分信号比单独的表现指标更强,能够可靠地区分人类与代理,即使在输出匹配的情况下(平均过程特征分类器 AUC = 0.88)。为了评估代理过程的差异,我们比较了现成的前沿代理(Claude Sonnet 4.5、GPT-5、Gemini 2.5 Pro)、Centaur(一个在1070万个人类决策上微调的语言模型)以及两种应用于 Qwen2.5-1.5B-Instruct 的任务特定微调方法:动作级监督微调(A-SFT)和过程级微调(P-SFT),后者直接优化过程特征。对人类决策的广泛微调相较于现成代理改善了类人任务过程,而任务特定的过程级监督进一步提升了行为模仿。然而,当监督的过程目标在不同任务间无法自然泛化时,这一优势在跨任务转移中减弱。明确的过程级监督可以改善人类行为的模仿,但前提是必须有适当的任务特定过程表示,这突显了过程规范化在实现机器类人认知过程中的瓶颈。
cs.AI / 130 / 2605.06529

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

定价代理中的市场对齐风险:隐藏竞争者状态下的追踪诊断与追踪优先强化学习
Zhu, Peiying, Chang, Sidi
Abstract
Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Chinese Translation
结果指标可能会证明错误的行为。我们在一个双酒店收益管理模拟器中研究这一失败,其中酒店A针对一个固定规则基础的收益管理竞争者酒店B训练代理。一个标准学习代理可以获得接近参考的每间可用房收入(RevPAR),但未能学习市场化的收益管理:它过于激进地销售,价格过低,或陷入模态价格桶。我们将此诊断为在部分可观察性下的Goodhart式失败。酒店A无法观察竞争者的剩余库存、预订曲线或定价规则,因此同一可见的酒店A状态映射到多个合理的酒店B价格。确定性基于价值的强化学习和确定性复制将这种未解决的不确定性压缩为捷径行为。我们引入了一种基于RevPAR、入住率、平均日房价(ADR)、全价格桶分布、L1/JS距离和种子级置信区间的追踪级诊断协议。经过验证的修复是追踪优先强化学习(Trace-Prior RL):从滞后市场追踪中学习分布式市场先验,然后训练一个带有RevPAR奖励和对学习先验的KL惩罚的随机定价策略。最终策略在种子级不确定性范围内匹配酒店B的RevPAR、入住率、ADR和价格分布,同时仍然优化酒店A自身的奖励。我们认为贡献不是一个新的优化器,也不是一个酒店定价排行榜,而是一个可重复的失败与修复方案,适用于那些标量奖励容易被操控且预期行为仅在追踪中可见的代理系统。一个关键发现是,当目标是分布式时,更高的精确动作准确性可能会恶化整体追踪对齐。
cs.AI / 131 / 2605.06530

SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting

SpatialEpiBench:在预测中基准化空间信息和流行病先验
Lyu, Ruiqi, Turcan, Alistair, Wilder, Bryan
Abstract
Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal methods are natural candidates for improving forecasts. Despite growing interest in spatial information, no standardized benchmark exists, and current evaluations often use simple chronological train-test splits that do not reflect real-time forecasting practice. We address this gap with SpatialEpiBench, a challenging benchmark for spatiotemporal epidemic forecasting in realistic public-health settings. SpatialEpiBench includes 11 epidemic datasets with standardized rolling evaluations and outbreak-specific metrics. We evaluate adjacency-informed forecasting models with widely used epidemic priors that adapt general models to epidemiology, but find that most methods underperform a simple last-value baseline from 1 day to 1 month ahead, even during outbreaks and with these priors. We identify three major failure modes: (1) poor outbreak anticipation, (2) difficulty handling sparsity and noise, and (3) limited utility of common geographic adjacency for epidemiological spatial information. We release benchmark data, code, and instructions at https://github.com/Rachel-Lyu/SpatialEpiBench to support development of operationally useful epidemic forecasting models.
Chinese Translation
准确的流行病预测对公共卫生响应、资源分配和疫情干预至关重要,但在稀疏、噪声大且高度非平稳的数据环境下仍然困难。由于流行病在相互作用的区域中展开,时空方法自然成为改善预测的候选者。尽管对空间信息的兴趣日益增长,但尚无标准化基准,当前的评估往往使用简单的时间顺序训练-测试划分,这并不能反映实时预测实践。我们通过SpatialEpiBench填补这一空白,这是一个针对现实公共卫生环境中时空流行病预测的挑战性基准。SpatialEpiBench包含11个流行病数据集,配备标准化的滚动评估和特定疫情的指标。我们评估了基于邻接信息的预测模型,这些模型结合了广泛使用的流行病先验,将通用模型适应于流行病学,但发现大多数方法在从1天到1个月的预测中表现不及一个简单的最后值基线,即使在疫情期间和使用这些先验时也如此。我们识别出三种主要的失败模式:(1)对疫情的预判不足,(2)处理稀疏性和噪声的困难,以及(3)常见地理邻接对流行病学空间信息的有限效用。我们在 https://github.com/Rachel-Lyu/SpatialEpiBench 发布基准数据、代码和说明,以支持开发具有操作实用性的流行病预测模型。
cs.AI / 132 / 2605.06540

Ex Ante Evaluation of AI-Induced Idea Diversity Collapse

人工智能引发的创意多样性崩溃的事前评估
Azad, Nafis Saami, Baten, Raiyan Abdul
Abstract
Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding risk from model-only generations and matched unaided human baselines. By modeling ideas as congestible resources, we show that source-level crowding is identifiable from within-distribution comparisons, yielding an excess-crowding coefficient $\Delta$ and a human-relative diversity ratio $\rho$. We show that $\rho\ge1$ is the no-excess-crowding parity condition and connect $\Delta$ to an adoption game with exposure-dependent redundancy costs. Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels. Estimates stabilize with feasible model-only sample sizes. Importantly, generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI.
Chinese Translation
创意人工智能系统通常在个体效用层面进行评估,然而创意产出是在人群中消费的:当许多人产生类似的创意时,一个创意的价值会降低。这造成了评估的盲点,因为人工智能可以改善个体产出,同时增加人口层面的拥挤。我们引入了一个相对人类的框架,用于基准测试人工智能引发的人类多样性崩溃,而无需人类与人工智能的交互数据,提供了一种事前协议,以估计模型生成和匹配的无辅助人类基线的拥挤风险。通过将创意建模为可拥挤的资源,我们展示了源级拥挤可以通过分布内比较识别,从而产生一个超额拥挤系数 $$ 和一个相对人类多样性比率 $ ho$。我们表明,$ ho extgeq1$ 是无超额拥挤的平衡条件,并将 $$ 连接到一个具有曝光依赖冗余成本的采纳游戏。在短篇故事、营销口号和替代用途任务中,三种前沿大型语言模型在拥挤核上低于平衡。估计在可行的模型仅样本量下稳定。重要的是,生成协议的变体表明,通过有针对性的设计可以减少拥挤,使得多样性崩溃成为一个可操作的、开发时评估的目标,适用于关注人群的创意人工智能。
cs.AI / 133 / 2605.06583

Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

通过伴随匹配改进流模型微调的技术:一种确定性控制管道
Guo, Zhengyi, Sheng, Jiayuan, Yao, David D., Tang, Wenpin
Abstract
We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target under the current policy, leading to a simple and stable training objective. Building on this perspective, we introduce a truncated adjoint scheme that focuses computation on the terminal portion of the trajectory, where reward-relevant signals concentrate, which yields substantial computational savings while preserving alignment quality. We further generalize the framework beyond standard KL-based regularization, allowing more flexible trade-offs between alignment strength and distributional preservation. Experiments on SiT-XL/2 and FLUX.2-Klein-4B demonstrate consistent gains across multiple alignment metrics, along with substantially improved diversity and mode preservation.
Chinese Translation
我们提出了一种确定性的伴随匹配框架,将基于流的生成模型的人类偏好对齐形式化为速度场上的最优控制问题。在当前策略下,可以直接将控制回归到由值梯度引导的目标,从而形成简单且稳定的训练目标。在此基础上,我们引入了一种截断伴随方案,专注于轨迹的终端部分,在此部分奖励相关信号集中,这在保持对齐质量的同时带来了显著的计算节省。我们进一步将框架推广到超越标准的基于KL的正则化,允许在对齐强度和分布保持之间进行更灵活的权衡。在SiT-XL/2和FLUX.2-Klein-4B上的实验表明,在多个对齐指标上均实现了一致的提升,同时显著改善了多样性和模式保持。
cs.AI / 134 / 2605.06584

NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

NeuroAgent:用于多模态神经影像分析与研究的LLM代理
Zhong, Lujia, Xia, Yihao, Zhang, Jianwei, huang, Shuo, Yue, Jiaxin, Xia, Mingyang, Shi, Yonggang
Abstract
Multimodal neuroimaging analysis often involves complex, modality-specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and disease classification commonly require task-specific code, evaluation protocols, and data-format conventions, creating additional barriers between raw acquisitions and reproducible scientific analysis. We present NeuroAgent, an LLM-driven agentic framework that automates key preprocessing and analysis steps for heterogeneous neuroimaging data, including sMRI, fMRI, dMRI, and PET, and supports interactive downstream analysis through natural-language queries. NeuroAgent employs a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine: agents autonomously generate executable preprocessing code, detect and recover from runtime errors, and validate output integrity. We evaluate the system on 1,470 subjects pooled across all ADNI phases (CN=1,000, AD=470), where all subjects have sMRI and tabular data, with subsets also having Tau-PET (n=469), fMRI (n=278), and DTI ($n=620$). Pipeline ablation studies across multiple LLM backends show that capable models reach up to 100% intent-parsing accuracy, with the strongest backend (Qwen3.5-27B) reaching 84.8% end-to-end preprocessing step correctness. Automated recovery limits manual intervention to edge cases where human review is required via the Human-In-The-Loop interface. For Alzheimer's Disease classification using automatically preprocessed multimodal data, our agent ensemble achieves an AUC of 0.9518 with four modalities, outperforming all single-modality baselines. These results show that NeuroAgent can reduce the manual effort required for neuroimaging preprocessing and enable end-to-end automated analysis pipelines for neuroimaging research.
Chinese Translation
多模态神经影像分析通常涉及复杂的、特定模态的预处理工作流程,这些流程需要仔细配置、质量控制以及跨异构工具链的协调。除了预处理之外,下游统计分析和疾病分类通常需要特定任务的代码、评估协议和数据格式约定,这在原始采集数据与可重复的科学分析之间造成了额外的障碍。我们提出了NeuroAgent,一个基于LLM的代理框架,自动化处理异构神经影像数据(包括结构磁共振成像(sMRI)、功能磁共振成像(fMRI)、扩散磁共振成像(dMRI)和正电子发射断层扫描(PET))的关键预处理和分析步骤,并通过自然语言查询支持交互式下游分析。NeuroAgent采用层次化的多代理架构,配备反馈驱动的生成-执行-验证引擎:代理能够自主生成可执行的预处理代码,检测并从运行时错误中恢复,并验证输出的完整性。我们在1,470名受试者(来自所有ADNI阶段,CN=1,000,AD=470)上评估该系统,所有受试者均具有sMRI和表格数据,部分受试者还具有Tau-PET(n=469)、fMRI(n=278)和DTI(n=620)。针对多个LLM后端的管道消融研究表明,能力强的模型达到高达100%的意图解析准确率,最强的后端(Qwen3.5-27B)实现了84.8%的端到端预处理步骤正确率。自动化恢复将人工干预限制在需要人类审查的边缘案例,通过人机协作界面进行处理。对于使用自动预处理的多模态数据进行阿尔茨海默病分类,我们的代理集成在四种模态下实现了0.9518的AUC,超越了所有单模态基线。这些结果表明,NeuroAgent能够减少神经影像预处理所需的人工工作,并为神经影像研究启用端到端的自动化分析管道。
cs.AI / 135 / 2605.06614

SkillOS: Learning Skill Curation for Self-Evolving Agents

SkillOS:自我进化智能体的技能策划学习
Ouyang, Siru, Yan, Jun, Chen, Yanfei, Han, Rujun, Wang, Zifeng, Mishra, Bhavana Dalvi, Meng, Rui, Li, Chun-Liang, Jiao, Yizhu, Zha, Kaiwen, Shen, Maohao, Tirumalashetty, Vishy, Lee, George, Han, Jiawei, Pfister, Tomas, Lee, Chen-Yu
Abstract
LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.
Chinese Translation
基于大语言模型(LLM)的智能体越来越多地被部署以处理流媒体任务,但它们往往仍然是一次性的问题解决者,无法从过去的交互中学习。通过经验提炼出的可重用技能为自我进化提供了自然的基础,其中高质量的技能策划是关键瓶颈。现有的方法要么依赖手动技能策划,要么规定启发式技能操作,或训练短期技能操作。然而,它们仍然难以从间接和延迟反馈中学习复杂的长期策划策略。为了解决这一挑战,我们提出了SkillOS,一种基于经验驱动的强化学习(RL)训练方案,用于学习自我进化智能体的技能策划。SkillOS将一个冻结的智能体执行器与一个可训练的技能策划者相结合,后者根据累积的经验更新外部技能库(SkillRepo)。为了提供策划的学习信号,我们设计了复合奖励,并在基于技能相关任务依赖的分组任务流上进行训练,其中早期轨迹更新技能库,后续相关任务评估这些更新。在多轮智能体任务和单轮推理任务中,SkillOS在有效性和效率上始终优于无记忆和强记忆基线,所学习的技能策划者在不同的执行器骨干和任务领域中具有良好的泛化能力。进一步分析表明,所学习的策划者产生了更具针对性的技能使用,而技能库中的技能随着时间的推移演变为更丰富结构的Markdown文件,编码了更高层次的元技能。
cs.AI / 136 / 2605.06623

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

MASPO:基于大语言模型的多智能体系统的联合提示优化
Wang, Zhexuan, Liu, Xuebo, Wang, Li, Shan, Zifei, Wang, Yutong, Song, Zhenxi, Zhang, Min
Abstract
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
Chinese Translation
基于大语言模型(LLM)的多智能体系统(MAS)在处理复杂的协作任务中展现出了良好的前景,其中智能体通常通过角色特定的提示进行协调。尽管这些提示的质量至关重要,但在交互智能体之间联合优化它们仍然是一项非平凡的挑战,主要是由于局部智能体目标与整体系统目标之间的不一致。为了解决这个问题,我们提出了MASPO,这是一种旨在自动和迭代地优化整个系统中提示的新框架。MASPO的核心创新在于其联合评估机制,该机制不仅通过局部有效性评估提示,还通过其促进后续智能体成功的能力进行评估。这有效地弥合了局部交互与全局结果之间的差距,而无需依赖真实标签。此外,MASPO采用数据驱动的进化束搜索,以高效地导航高维提示空间。在6个不同任务上的广泛实证评估表明,MASPO始终优于最先进的提示优化方法,平均准确率提高了2.9。我们在https://github.com/wangzx1219/MASPO上发布了我们的代码。
cs.AI / 137 / 2605.06638

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Wang, Tianle, Wang, Zhaoyang, Lan, Guangchen, Wei, Xinpeng, Zhang, Sipeng, Qiu, Guanwen, Saparov, Abulhair
Abstract
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
cs.AI / 138 / 2605.06641

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

GlazyBench:陶瓷釉料性质预测与图像生成基准测试
Zhai, Ziyu, Li, Siyou, Shao, Juexi, Yu, Juntao
Abstract
Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.
Chinese Translation
开发陶瓷釉料是一个成本高昂且耗时的试错过程,由于其复杂的化学性质,这给独立艺术家带来了重大负担。尽管近期多模态人工智能的进展提供了现代解决方案,但该领域缺乏训练这些模型所需的大规模数据集。我们提出了GlazyBench,这是第一个用于人工智能辅助釉料设计的数据集。该数据集包含23,148种真实的釉料配方,支持两个主要任务:从原材料预测烧成后的表面性质,如颜色和透明度,以及基于这些性质生成准确的釉料视觉表现。我们建立了使用传统机器学习和大型语言模型进行性质预测的全面基准,同时使用深度生成模型和大型多模态模型进行图像生成基准测试。我们的实验结果显示出有希望但具有挑战性的结果。GlazyBench开创了人工智能辅助材料设计的新研究方向,为系统评估提供了标准化的基准。
cs.AI / 139 / 2605.06651

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

AI 合作数学家:利用自主智能加速数学研究
Zheng, Daniel, von Glehn, Ingrid, Zwols, Yori, Beloshapka, Iuliya, Buesing, Lars, Roy, Daniel M., Wattenberg, Martin, Georgiev, Bogdan, Schmidt, Tatiana, Cowie, Andrew, Viegas, Fernanda, Kanevsky, Dimitri, Kahlon, Vineet, Maennel, Hartmut, Alj, Sophia, Holland, George, Davies, Alex, Kohli, Pushmeet
Abstract
We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.
Chinese Translation
我们介绍了 AI 合作数学家,这是一个供数学家互动使用的工作平台,旨在利用 AI 代理进行开放式研究。AI 合作数学家经过优化,能够为数学工作流程的探索性和迭代性提供全面支持,包括构思、文献检索、计算探索、定理证明和理论构建。该系统提供一个异步的、有状态的工作空间,能够管理不确定性、细化用户意图、跟踪失败的假设,并输出原生数学成果,从而反映人类协作工作流程。在早期测试中,AI 合作数学家帮助研究人员解决开放性问题、识别新的研究方向,并发现被忽视的文献参考。除了展示一种高度互动的 AI 辅助数学发现范式外,AI 合作数学家在困难问题解决基准测试中也取得了最先进的成果,包括在 FrontierMath Tier 4 中获得 48% 的得分,成为所有评估的 AI 系统中的新高分。
计算语言学 (Computation and Language)
67
cs.CL / 1 / 2605.05245

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

AdaGATE:适应性间隙感知的高效证据组装用于多跳检索增强生成
Guo, Yilin, Wang, Yinshan, Wang, Yixuan
Abstract
Retrieval-augmented generation (RAG) remains brittle on multi-hop questions in realistic deployment settings, where retrieved evidence may be noisy or redundant and only limited context can be passed to the generator. Existing controllers address parts of this problem, but typically either expand context additively, select from a fixed top-k set, or optimize relevance without explicitly repairing missing bridge facts. We propose AdaGATE, a training-free evidence controller for multi-hop RAG that frames evidence selection as a token-constrained repair problem. AdaGATE combines entity centric gap tracking, targeted micro-query generation, and a utility based selection mechanism that balances gap coverage, corroboration, novelty, redundancy, and direct question relevance. We evaluate AdaGATE on HotpotQA under clean, redundancy, and noise injected retrieval conditions. Across all three settings, AdaGATE achieves the best evidence F1 among the compared controllers, reaching 62.3% on clean data and 71.2% under redundancy injection, while using 2.6x fewer input tokens than Adaptive-k. These results suggest that explicit gap-aware repair, combined with token-efficient evidence selection, improves robustness in multi-hop RAG under imperfect retrieval. Our code and evaluation pipeline are available at https://github.com/eliguo/AdaGATE.
Chinese Translation
检索增强生成(RAG)在现实部署环境中的多跳问题上仍然脆弱,其中检索到的证据可能是嘈杂或冗余的,并且只能将有限的上下文传递给生成器。现有的控制器解决了部分问题,但通常要么以加法方式扩展上下文,要么从固定的前k个集合中选择,或优化相关性而不明确修复缺失的桥接事实。我们提出了AdaGATE,一种无训练的多跳RAG证据控制器,将证据选择框架视为一个受限的修复问题。AdaGATE结合了以实体为中心的间隙跟踪、针对性的微查询生成和基于效用的选择机制,平衡间隙覆盖、证实、新颖性、冗余性和直接问题相关性。我们在HotpotQA上评估了AdaGATE,在干净、冗余和噪声注入的检索条件下进行测试。在这三种设置中,AdaGATE在比较的控制器中实现了最佳的证据F1,在干净数据上达到62.3%,在冗余注入下达到71.2%,同时使用的输入令牌比Adaptive-k少2.6倍。这些结果表明,明确的间隙感知修复结合高效的证据选择,在不完美检索下提高了多跳RAG的鲁棒性。我们的代码和评估管道可在https://github.com/eliguo/AdaGATE获取。
cs.CL / 2 / 2605.05353

Counterargument for Critical Thinking as Judged by AI and Humans

人工智能与人类评判的批判性思维反驳研究
Adewumi, Tosin, Liwicki, Marcus, Liwicki, Foteini Simistira, Alkhaled, Lama, Mokayed, Hamam, Sümer-Arpak, Esra
Abstract
This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context of Generative AI (GenAI). This is especially as risks of cheating and cognitive offloading exist with the use of GenAI. We presented 36 students in a particular university course with 4 carefully selected thesis statements (from a set of popular debates) to write about anyone of them. We used six established rubrics (focus, logic, content, style, correctness and reference) to conduct three human assessments (two student peer-reviews and one experienced teacher) per writeup on a 5-point Likert scale for all the qualified samples (n) of 35 submissions (after disqualifying one for irregularity). Using the same rubrics and guidelines, we also assessed the submissions using six frontier LLMs as judges. Our mixed-method design included qualitative open-ended feedback per assessment and quantitative methods. The results reveal that (1) the students' self-written counterarguments to AI-generated content contains logic, among other things, which is a key component of critical thinking, and (2) GenAI can be successfully used at scale to assess students' written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter-rater reliability values of 0.33 for all the models except one.
Chinese Translation
本干预研究探讨了在生成性人工智能(Generative AI, GenAI)背景下,学生在写作中使用反驳论证以促进批判性思维的情况。尤其是在使用GenAI时,存在作弊和认知卸载的风险。我们为一门特定大学课程的36名学生提供了4个精心挑选的论点陈述(来自一系列热门辩论),让他们就其中任何一个进行写作。我们使用六个既定的评估标准(聚焦、逻辑、内容、风格、正确性和参考文献)对所有合格样本(n)中的35份提交(在剔除一份不合规的情况下)进行了三次人工评估(两次学生同行评审和一次经验丰富的教师评审),采用5点李克特量表。使用相同的评估标准和指南,我们还利用六个前沿的大型语言模型(LLMs)作为评审对提交进行评估。我们的混合方法设计包括每次评估的定性开放式反馈和定量方法。结果显示:(1)学生自写的对AI生成内容的反驳论证包含逻辑等要素,这是批判性思维的关键组成部分;(2)基于明确的评估标准,GenAI可以成功地大规模评估学生的书面作品,这些评估通常与人类评估一致,除了一个模型外,所有模型的Gwets AC2评分者间一致性值为0.33。
cs.CL / 3 / 2605.05392

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

从无查询摘要数据集生成查询聚焦摘要数据集
Chali, Yllias, Abdullah, Deen
Abstract
Large-scale datasets are widely used to perform summarization tasks, but they may not include queries alongside documents and summaries. In the search for suitable datasets for Query-Focused Summarization (QFS), we identify two research questions: Is it possible to automatically generate evidence-based query keywords from query-free datasets? Does evidence-based query generation support the QFS task? This paper proposes an evidence-based model to generate queries from query-free datasets. To evaluate our model intrinsically, we compare the similarity between the original queries and the system-generated queries of two QFS datasets. We also perform summarization tasks using different pre-trained models, as well as a state-of-the-art (SOTA) QFS model, to measure the extrinsic performance of our query generation approach. Experimental results indicate that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to those generated from the original queries.
Chinese Translation
大规模数据集广泛用于执行摘要任务,但它们可能不包含与文档和摘要相关的查询。在寻找适合查询聚焦摘要(Query-Focused Summarization, QFS)任务的数据集时,我们确定了两个研究问题:是否可以从无查询数据集中自动生成基于证据的查询关键词?基于证据的查询生成是否支持QFS任务?本文提出了一种基于证据的模型,从无查询数据集中生成查询。为了内在评估我们的模型,我们比较了两个QFS数据集中原始查询与系统生成查询之间的相似性。我们还使用不同的预训练模型以及一种最先进的(SOTA)QFS模型执行摘要任务,以测量我们的查询生成方法的外在性能。实验结果表明,使用基于证据的查询生成的摘要在ROUGE评分上与原始查询生成的摘要具有竞争力。
cs.CL / 4 / 2605.05443

SLAM: Structural Linguistic Activation Marking for Language Models

SLAM:用于语言模型的结构语言激活标记
Harel-Canada, Fabrice, Sahai, Amit
Abstract
LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss. We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained. On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models. The trade-off is a complementary robustness profile: SLAM resists word-level edits but is vulnerable to paraphrase that restructures syntax (at a quality cost), the converse of token-distribution methods.
Chinese Translation
大型语言模型(LLM)的水印必须可检测而不影响文本质量,但大多数现有方案会偏向下一个标记的分布,并以可测量的质量损失为代价进行检测。我们提出了SLAM(结构语言激活标记),这是一种新颖的白盒水印方案,通过将水印写入结构几何而非标记频率来规避这一成本:稀疏自编码器识别编码语言结构的残差流方向(例如,语态、时态、从句顺序),我们在生成时对这些方向进行因果引导,保持词汇采样和语义不受限制。在Gemma-2 2B和9B模型上,SLAM实现了100%的检测准确率,质量成本仅为1-2个奖励点——相比之下,KGW、EWD和Unigram的成本为7.5-11.5——在两个模型中,自然性和多样性保持在接近未加水印的水平。其权衡是互补的鲁棒性特征:SLAM抵抗单词级编辑,但对重构语法的释义(伴随质量成本)则较为脆弱,这与基于标记分布的方法正好相反。
cs.CL / 5 / 2605.05485

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp:将大型语言模型推理编译为符号求解器以实现高效程序合成
Naik, Atharva, Mathur, Yash, Prakam, Rose, Carolyn, Mortensen, David
Abstract
LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.
Chinese Translation
大型语言模型(LLMs)能够解决程序合成任务,但在需要大量组合搜索的困难实例上仍然效率低下且不可靠。基于一小组推理轨迹,我们使用编码代理将其编译为可重用的符号程序合成器,适用于受限的领域特定语言(DSLs)。所得到的求解器在测试时无需调用LLM,是强大的独立系统:符号求解器集在PBEBench-Lite上达到91.3%的准确率,在PBEBench-Hard上达到84.7%,在后者中超越了LLMs的测试时扩展,提升了16.3个百分点,且没有LLM推理成本。它们还补充了LLM搜索,将PBEBench-Hard的准确率从68.4%提高到85.8%,同时减少了报告的令牌使用量78%,并在神经符号混合设置中将SLR-Bench困难层的准确率从34.4%提高到58.0%。与直接将编码代理用作每实例求解器相比,诱导的求解器在帕累托效率上显著更高,将一次性的小型构建成本摊销到多个零令牌执行中。最后,大多数求解器在一个真实的历史语言学任务中实现了零样本迁移——预测自然语言数据中的音变——在集成下达到80.1%的准确率,并恢复了一些合理的语言规则。这些结果表明,推理轨迹可以编译为可重用的符号求解器,直接解决许多任务,补充LLM在困难案例中的推理,并为领域通用求解器的诱导提供可扩展的途径。我们发布了代码和数据以便于复现。
cs.CL / 6 / 2605.05503

Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks

链洗:对扩散语言模型水印的多步重写攻击
Ameen, Mohd Ruhul, Islam, Akif, Mahmud, Nadim, Hamid, Md. Ekramul
Abstract
Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total. The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution. Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.
Chinese Translation
统计水印是一种常见的方法,用于验证文本是否由语言模型生成。大多数现有方案假设自回归生成,其中标记是从左到右生成的,并且上下文哈希是明确定义的。扩散语言模型通过以任意顺序去噪标记来生成文本,因此这些方案无法直接应用。Gloaguen等人最近提出的水印解决了LLaDA 8B Instruct的这一空白,并报告了超过99%的真实正检测率。本文研究了水印文本在被重写多次时会发生什么。使用相同的水印配置,在五个WaterBench领域中生成了1,605个约300个标记的水印完成文本。每个完成文本由四个开放权重的语言模型进行重写,参数从1.5B到8B不等,这些模型均不知道水印密钥。测试了五种重写风格:释义、人性化、简化、学术和总结扩展。每种风格最多链式重写五次,总共生成了160,500个重写文本。在标准显著性阈值下,87.9%的原始输出被检测到水印。经过一次重写后,检测率下降到14%到41%之间,具体取决于重写者和风格。经过五次链式重写后,检测率降至4.86%,这意味着94.76%的原始检测文本不再被标记。经过三次重写后,检测器的得分已从其水印基线下降了86%,接近于零分布。因此,重复重写是一种比单次重写更强的攻击,且这一结果在所有四个测试的重写者中均成立。
cs.CL / 7 / 2605.05532

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

优秀条款:比较大型语言模型与领域训练的小型语言模型在结构化合同提取上的表现
Lincoln, Nicole, Whitehouse, Nick, Mar, Jaron, Perera, Rivindu
Abstract
This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.
Chinese Translation
本文评估了领域训练的小型语言模型(SLM)是否能够以显著更低的成本超越前沿的大型语言模型在结构化合同提取方面的表现。我们测试了Olava Extract,一个自托管的法律领域混合专家模型,与五个前沿模型进行比较。研究中,Olava Extract实现了最强的整体性能,宏观F1值为0.812,微观F1值为0.842,同时与测试的前沿模型相比,推理成本降低了78%到97%。它还获得了最高的精确度分数,产生了更少的虚假和不支持的提取,这在法律工作流程中是一个重要的区别,因为虚假提取会带来操作风险和后续审查负担。研究结果表明,高性能、可与人类相媲美的法律人工智能不再需要最大的外部托管模型。更广泛地说,这些发现挑战了商业上有价值的企业人工智能能力必须与越来越大的模型、大规模基础设施支出和集中托管提供商相联系的假设。
cs.CL / 8 / 2605.05594

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

上下文的代价:减轻多模态检索增强生成中的文本偏见
Jung, Hoin, Wang, Xiaoqian
Abstract
While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.
Chinese Translation
尽管多模态大型语言模型(MLLMs)越来越多地与检索增强生成(RAG)相结合以减轻幻觉,但外部文档的引入可能会在实例级别掩盖严重的失败模式。我们识别并形式化了再腐败(recorruption)现象,即即使引入完全准确的“神谕”上下文,也会导致一个有能力的模型放弃最初正确的预测。通过对内部注意力矩阵的机制诊断,我们表明再腐败是由双重注意力崩溃驱动的:(1)视觉盲目性,表现为视觉注意力质量($M_{vis}$)和清晰度($S_{vis}$)的系统性抑制,以及(2)一种结构性位置偏见,迫使模型优先考虑边界标记而非语义相关性。我们的分析揭示了成功的幻觉,表明许多看似正确的RAG结果仅仅是位置巧合,其中模型的文本复制偏见恰好与真实位置对齐。为了解决这些脆弱性,我们提出了恢复瓶颈注意力干预(Bottleneck Attention Intervention for Recovery, BAIR),这是一种无参数的推理时框架,能够恢复视觉显著性并对文本干扰项施加位置感知惩罚。在医学事实性、社会公平和地理空间基准测试中,BAIR成功恢复了多模态基础,并提高了诊断可靠性,无需模型重训练或微调。
cs.CL / 9 / 2605.05626

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak:用于大型语言模型的多方对话中的时间参与和轮流发言的数据集
Nama, Vihaan, Mendi, Shreya, Ye, Zian, Bent, Brinnae
Abstract
Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.
Chinese Translation
大型语言模型(LLMs)在生成上下文适当的响应方面表现出色,但在多方对话中仍然缺乏良好的校准,在这种情况下,决定何时发言与说什么同样重要。在这种环境中,简单地在每个轮次中作出回应会导致过多的打断和对话连贯性的下降。我们介绍了When2Speak,这是一个基于现实的合成数据集和四阶段生成管道,用于学习群体互动中的干预时机。该数据集包含超过215,000个示例,源自16,000个涉及2-6名发言者的对话,涵盖多种对话风格、语调和参与者动态,并在每个轮次中明确建模发言(SPEAK)与沉默(SILENT)的决策。我们的管道结合了现实世界的基础、结构化增强、受控转录合成和适合微调的监督,并完全开源,以支持可重复性和适应特定领域的对话规范。在多个模型系列中,基于When2Speak的监督微调(SFT)显著优于零样本基线(例如,4B+参数模型的平均宏F1提升为60%,最大提升为120%)。然而,经过SFT训练的模型仍然系统性地过于保守,错过了近一半的合理干预,这一点通过平均错过干预率(MIR)为0.50得以体现,即使在更大的模型规模下也能观察到。为了解决这一限制,我们应用了带有不对称奖励塑造的强化学习,将MIR降低到0.186-0.218,并将召回率从0.479提高到0.78-0.81。我们的研究结果表明,时间参与是对话智能的一个独特且可训练的维度,而基于现实的合成数据提供了一条有效且可扩展的途径,使大型语言模型能够在多方互动中更自然和恰当地参与。
cs.CL / 10 / 2605.05630

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

迟了一步的反应意识防御:针对多轮对话中隐藏恶意意图的防御
Shen, Xinjie, Wei, Rongzhe, Niu, Peizhi, Wang, Haoyu, Wu, Ruihan, Chien, Eli, Li, Bo, Chen, Pin-Yu, Li, Pan
Abstract
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
Chinese Translation
多轮对话中的隐藏恶意意图对已部署的大型语言模型(LLMs)构成了日益严重的威胁。攻击者不再在单个提示中暴露有害目标,而是能够将其意图分散在多个看似无害的轮次中。近期研究表明,即使是现代商业模型在安全对齐和外部防护方面取得了进展,仍然对这种攻击保持脆弱。在本研究中,我们通过检测在何时交付候选响应会使累积交互足以促成有害行为,来应对这一挑战。该目标需要精确的轮次级干预,识别出使有害行为得以实现的闭合点,同时避免对无害探索性对话的过早拒绝。为了进一步支持训练和评估,我们构建了多轮意图数据集(Multi-Turn Intent Dataset,MTID),该数据集包含分支攻击展开、匹配的无害硬负样本,以及最早使有害行为得以实现的轮次的注释。我们展示了MTID能够支持一个轮次级监控器TurnGate,该监控器在有害意图检测中显著优于现有基线,同时保持较低的过度拒绝率。TurnGate在不同领域、攻击者管道和目标模型之间进一步具有良好的泛化能力。我们的代码可在 https://github.com/Graph-COM/TurnGate 获取。
cs.CL / 11 / 2605.05653

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

负面优先于正面:大型语言模型中的非对称情感处理
Venkatesh, Sohan
Abstract
Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.
Chinese Translation
机制可解释性揭示了大型语言模型(LLMs)中概念的编码方式,但情感内容在机制层面仍然不够清晰。我们研究了LLMs是否通过专门的内部结构或表面标记匹配来处理情感效价。通过对开源LLMs进行激活补丁和引导,我们发现负面和正面效价在不同的网络深度中被处理。负面结果集中在早期层,而正面结果在中后期层达到峰值。在固定主题的情况下翻转效价会产生相反的反应,排除了主题检测的可能性。在识别出的层次上,使用好消息方向进行引导使中性提示向正面效价偏移,表明这些层次将效价编码为可操控的方向。LLMs中的情感效价是局部的、因果的且可引导的,使其成为基于可解释性的监督的具体目标。
cs.CL / 12 / 2605.05662

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

XL-SafetyBench:一个基于国家的跨文化大型语言模型安全性与文化敏感性基准
Choi, Dasol, Kim, Eugenia, Noh, Jaewon, Seo, Sang, Kim, Eunmi, Oh, Myunggyo, Park, Yunjin, Kartono, Brigitta Jesica, Pichlmeier, Josef, Berndt, Helena, Mendu, Sai Krishna, Tungka, Glenn Johannes, Gökçe, Özlem, Gehlot, Suresh, Pratt, Katherine, Minnich, Amanda, Park, Haon
Abstract
Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.
Chinese Translation
当前的大型语言模型(LLM)安全性基准主要以英语为中心,通常依赖翻译,未能捕捉特定国家的危害。此外,它们很少评估模型识别文化嵌入的敏感性与普遍危害的能力。我们推出了XL-SafetyBench,这是一个包含5500个测试案例的套件,涵盖10个国家-语言对,包括一个基于国家的对抗性提示的越狱基准和一个将地方敏感性嵌入无害请求中的文化基准。每个项目通过一个多阶段流程构建,该流程结合了LLM辅助发现、自动验证门和每个国家的双独立母语者注释。为了区分原则性拒绝与理解失败,我们评估攻击成功率(Attack Success Rate, ASR),并引入两个互补指标:中性安全率(Neutral-Safe Rate, NSR)和文化敏感性率(Cultural Sensitivity Rate, CSR)。对10个前沿和27个地方LLM的评估揭示了两个关键发现。首先,越狱鲁棒性与文化意识在前沿模型中并未表现出耦合关系,因此复合安全评分掩盖了各轴的变异性。其次,地方模型表现出近线性的ASR-NSR权衡(r = -0.81),表明它们的表面安全反映的是生成失败而非真正的一致性。XL-SafetyBench使得在多语言时代进行更细致的跨文化安全评估成为可能。
cs.CL / 13 / 2605.05676

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

分解大型语言模型的基本能力:减轻多任务指令调优中的跨任务干扰
Wang, Bing, Li, Ximing, Li, Changchun, Chi, Jinjin, Niu, Gang, Sugiyama, Masashi
Abstract
Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.
Chinese Translation
近年来,大型语言模型(LLMs)的卓越表现主要得益于多任务指令调优。不幸的是,这种训练范式存在一个关键问题,即跨任务干扰,这是由于不同任务之间共享参数的冲突梯度所导致的。一些先前的方法通过隔离任务特定参数来缓解这一问题,例如任务特定神经元选择和专家混合。在本文中,我们实证揭示了现有解决方案仍然存在跨任务干扰,因为许多参数也被不同任务共享。因此,我们提出了一种新颖的解决方案,即多任务指令调优的基本能力分解(Basic Abilities Decomposition for multi-task Instruct-Tuning,BADIT)。具体而言,我们实证发现某些参数始终共同激活,并且共同激活的参数自然组织成基本组。这促使我们类比认为LLMs编码了几种正交的基本能力,任何任务都可以表示为这些能力的线性组合。因此,我们提出BADIT,将LLM参数分解为代表基本能力的正交高奇异值LoRA专家,并通过对秩为1的组件进行球形聚类,在训练过程中动态强制其正交性。我们在SuperNI基准上对6个LLM进行了广泛实验,实证结果表明BADIT能够超越最先进的方法,并减轻跨任务干扰的程度。
cs.CL / 14 / 2605.05758

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

BioTool:一个全面的工具调用数据集,以增强大型语言模型的生物医学能力
Gao, Xin, Zhang, Ruiyi, Du, Meixi, Qin, Peijia, Xie, Pengtao
Abstract
Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool
Chinese Translation
尽管大型语言模型(LLMs)在通用任务上取得了成功,但它们在生物医学等高度专业化领域的表现仍然不尽如人意。一个关键的限制是LLMs无法有效利用生物医学工具,而这些工具是临床专家和生物医学研究人员在日常工作流程中广泛依赖的。尽管最近的通用领域工具调用数据集显著提高了LLM代理的能力,但现有的生物医学领域努力主要依赖于上下文学习,并将模型限制在一小部分工具上。为了解决这一差距,我们引入了BioTool,一个旨在微调LLMs的全面生物医学工具调用数据集。BioTool包含从NCBI、Ensembl和UniProt数据库收集的34个常用工具,以及7,040对高质量的人类验证的查询-API调用对,涵盖了变异、基因组学、蛋白质组学、进化和一般生物学。对一个拥有40亿参数的LLM进行BioTool微调,显著提高了生物医学工具调用的性能,超越了如GPT-5.1等尖端商业LLMs。此外,人类专家评估表明,与不使用工具的同一LLM相比,集成BioTool微调的工具调用器显著提高了下游答案质量,突显了BioTool在增强LLMs生物医学能力方面的有效性。完整数据集和评估代码可在https://github.com/gxx27/BioTool获取。
cs.CL / 15 / 2605.05777

Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

通过分布对齐的对抗蒸馏估计黑箱大型语言模型的不确定性
Cui, Huizi, Ma, Huan, Wang, Qilin, Gao, Yuhang, Zhang, Changqing
Abstract
Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To address this issue, we propose Distribution-Aligned Adversarial Distillation (DisAAD), which introduces a generation-discrimination architecture to guide a lightweight proxy model to learn the high-quality regions of the output distribution of the black-box LLM, thus effectively endowing it with the ability to know whether the black-box LLM knows or not. Subsequently, we use the proxy model to reproduce the specific responses of the black-box LLM and estimate the corresponding uncertainty based on evidence learning. Extensive experiments have verified the effectiveness and promise of our proposed method, indicating that a proxy model even one that only accounts for 1\% of the target LLM's size can achieve reliable uncertainty quantification.
Chinese Translation
大型语言模型(LLMs)在复杂推理和问答方面取得了快速进展,但LLM幻觉仍然是阻碍实际应用的核心瓶颈,特别是对于仅通过API访问的商业黑箱LLM。现有的不确定性量化方法通常依赖于计算成本高昂的多次采样或内部参数,这限制了实时估计的可能性,并未能捕捉到黑箱推理过程中的隐含信息。为了解决这一问题,我们提出了分布对齐的对抗蒸馏(DisAAD),该方法引入了一种生成-判别架构,以指导轻量级代理模型学习黑箱LLM输出分布的高质量区域,从而有效赋予其判断黑箱LLM是否具备知识的能力。随后,我们使用代理模型重现黑箱LLM的特定响应,并基于证据学习估计相应的不确定性。大量实验验证了我们提出方法的有效性和前景,表明即使是仅占目标LLM大小1%的代理模型也能实现可靠的不确定性量化。
cs.CL / 16 / 2605.05835

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

语言模型中的评估意识对行为的影响有限
Knecht, Amelie, Florin, Lucas, Hagendorff, Thilo
Abstract
Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, moral reasoning, and political opinion. We tested this both on-policy, sampling multiple CoTs per item and comparing those that spontaneously contained VEA against those that did not, and off-policy, using model prefilling to inject evaluation-aware sentences where missing and remove them where present, with subsequent resampling. VEA has limited effect on model behaviour: injecting VEA into CoTs produces near-zero effects ($\omega \leq 0.06$), removing it causes small shifts ($\omega \leq 0.12$) and spontaneously occurring VEA shifts answer distributions by at most 3.7 percentage points ($\omega \leq 0.31$). Our findings call for caution when interpreting high VEA rates as evidence of strategic behaviour or alignment tampering. Evaluation awareness may pose a smaller safety risk than the current literature assumes.
Chinese Translation
大型推理模型(LRMs)有时在其思维链(CoT)中提到它们可能正在接受评估。研究人员担心这种口头表达的评估意识(VEA)会导致模型战略性地调整其输出,以优化感知的评估标准,这可能使模型看起来比实际更安全。然而,VEA是否真的具有这种影响在很大程度上仍然未知。我们在开放权重的LRMs和涵盖安全性、对齐、道德推理和政治观点的基准测试中进行了测试。我们在政策内测试,通过对每个项目采样多个CoTs,并比较那些自发包含VEA的与那些不包含的;同时在政策外测试,使用模型预填充在缺失的地方注入评估意识句子,并在存在的地方移除它们,随后进行重新采样。VEA对模型行为的影响有限:将VEA注入CoTs几乎没有效果($ ext{ω} ext{≤} 0.06$),移除它会导致小幅变化($ ext{ω} ext{≤} 0.12$),而自发出现的VEA最多将答案分布偏移3.7个百分点($ ext{ω} ext{≤} 0.31$)。我们的发现提示在将高VEA率解读为战略行为或对齐篡改的证据时应谨慎。评估意识可能带来的安全风险小于当前文献所假设的。
cs.CL / 17 / 2605.05892

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

超越引导向量:基于流的激活引导用于推理时干预
Jin, Zehao, Deng, Ruixuan, Wang, Junran, Shen, Xinjie, Zhang, Chao
Abstract
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.
Chinese Translation
激活引导作为一种有前景的替代方案,已被提出用于在推理时通过修改中间表示来控制语言模型的行为,同时保持模型参数不变。然而,大规模评估(如 AxBench)表明,现有的引导方法通常不如简单的上下文提示表现良好,并且在未见概念上的泛化能力较差。我们假设这些局限性源于以往方法中共享的未经验证的简化假设,这些假设通常将引导干预限制为固定的、单步的、位置不变的变换。我们提出了 FLAS(基于流的激活引导),它学习一个通用的、概念条件的速度场 $v_t(h,t,c)$,该速度场能够在不依赖这些假设的情况下,将未引导的激活传输到引导的激活。在 AxBench 上,FLAS 是第一个学习方法,它在没有针对每个概念调优的情况下,始终优于提示,达到了 Gemma-2-2B-IT 上的 $1.015$ 和 Gemma-2-9B-IT 上的 $1.113$ 的持出调和均值。对学习到的流的分析显示出曲线、多步、令牌变化的轨迹,这表明之前关于激活空间几何的假设可能是不完整的。
cs.CL / 18 / 2605.05893

Logic-Regularized Verifier Elicits Reasoning from LLMs

逻辑正则化验证器从大型语言模型中引导推理
Wang, Xinyu, Sun, Changzhi, Cheng, Lian, Wu, Yuanbin, Zhang, Dell, Wang, Xiaoling, Li, Xuelong
Abstract
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,and inter-group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe-shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm-lover.
Chinese Translation
验证器是提升现代大型语言模型(LLMs)推理能力的关键组件。典型的验证器需要资源密集型的监督数据集构建,这既昂贵又面临数据多样性限制。本文提出了一种名为LOVER的无监督验证器,该验证器通过逻辑规则进行正则化。LOVER将验证器视为一个二元潜变量,利用内部激活并对多个推理路径施加三种逻辑约束:否定一致性、组内一致性和组间一致性(按最终答案分组)。通过将逻辑规则作为先验,LOVER能够利用未标记的示例,并与任何现成的LLMs直接兼容。在10个数据集上的实验表明,LOVER显著优于无监督基线,其性能可与监督验证器相媲美(平均达到其95%的水平)。源代码已公开,地址为 https://github.com/wangxinyufighting/llm-lover。
cs.CL / 19 / 2605.05927

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

从输入侧最小化模态差距:你的语音大语言模型可以成为一个感知韵律的文本大语言模型
Cui, Wenqian, Li, Xiao-Hui, Tan, Daxin, Zheng, Qiyong, King, Irwin
Abstract
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.
Chinese Translation
语音大语言模型(SLMs)通常是基于文本大语言模型(TLM)的检查点构建的,但它们仍然存在显著的模态差距。之前的研究主要尝试通过使语音生成更类似于文本来减少这一差距,但差距依然存在。我们认为,关键的瓶颈在于输入侧。我们提出了TextPro-SLM,这是一种使口语输入更接近感知韵律的文本大语言模型的SLM。TextPro-SLM结合了WhisperPro,这是一种统一的语音编码器,能够生成同步的文本标记和韵律嵌入,并与一个经过训练的LLM主干相结合,以保持原始TLM的语义能力,同时学习副语言理解。实验表明,TextPro-SLM在3B和7B规模的领先SLMs中实现了最低的模态差距,同时在副语言理解任务上也表现出强大的整体性能。这些提升是在仅使用大约1,000小时的LLM训练音频的情况下实现的,表明从输入侧减少模态差距既有效又数据高效。
cs.CL / 20 / 2605.05950

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

轻量级风格一致性剖析:针对多媒体审核的LLM生成文本内容的鲁棒检测
Li, Siyuan, Wulianghai, Aodu, Lin, Xi, Yuan, Xibin, Mao, Qinghua, Li, Guangyan, Chen, Xiang, Wu, Jun, Li, Jianhua
Abstract
The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or model-specific heuristics, making them vulnerable to paraphrasing and adversarial manipulations, and consequently limiting their robustness and interpretability. In this work, we proposeLiSCP , a novel lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Our approach constructs a consistency profile that combines discrete stylistic features with continuous semantic signals, leveraging stylistic stability across multimodal-guided paraphrased text variants. Experiments spanning real-world multimedia news and movie datasets and conventional text domains demonstrate that LiSCP achieves superior performance on in-domain detection and outperforms existing approaches by up to 11.79% in cross-domain settings. Additionally,it demonstrates notable robustness under adversarial scenarios, including adversarial attacks and hybrid human-AI settings.
Chinese Translation
大型语言模型(LLMs)在内容创作中的日益普及,使得区分人类撰写的文本内容与LLM生成的文本成为多媒体审核中的一项关键任务。现有的检测器通常依赖于统计线索或模型特定的启发式方法,这使得它们容易受到释义和对抗性操控的影响,从而限制了其鲁棒性和可解释性。在本研究中,我们提出了LiSCP,一种新颖的轻量级风格一致性剖析方法,旨在鲁棒地检测LLM生成的文本内容,重点关注在对抗性操控下的特征稳定性。我们的方法构建了一种一致性剖面,将离散的风格特征与连续的语义信号相结合,利用多模态引导的释义文本变体中的风格稳定性。涵盖真实世界的多媒体新闻和电影数据集以及传统文本领域的实验表明,LiSCP在领域内检测上表现优越,并在跨领域设置中比现有方法提高了多达11.79%的性能。此外,它在对抗性场景下表现出显著的鲁棒性,包括对抗性攻击和混合人类-人工智能设置。
cs.CL / 21 / 2605.05953

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

幻觉作为异常:通过概率电路的动态干预
Nielsen, Erik, Cunegatti, Elia, Vukojevic, Marcus, Iacca, Giovanni
Abstract
One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.
Chinese Translation
大型语言模型面临的最关键挑战之一是其产生幻觉的倾向,即生成事实不正确的响应。现有的方法在幻觉修正方面显示出良好的结果,但仍然存在一个主要局限性:它们对每个标记不加区分地应用修正,这也会损害原本正确的生成。为了解决这一缺陷,我们提出了PCNET,一种作为可处理密度估计器训练的概率电路,应用于LLM残差流。该方法通过精确的负对数似然计算将幻觉检测为事实流形上的几何异常,因此无需采样、外部验证者或权重修改,这与现有技术不同。为了证明其有效性,我们利用PCNET作为一个动态门,在每个解码步骤中区分幻觉状态和事实状态。这触发了我们的第二个主要贡献,PC-LDCD(概率电路潜在密度对比解码),仅在潜在几何偏离事实区域时才激活,同时保持正确生成不受影响。在四个大型语言模型(从1B到8B)和四个基准测试中,涵盖对话推理、知识密集型问答、阅读理解和真实性,PCNET在CoQA、SQuAD v2.0和TriviaQA上实现了近乎完美的幻觉检测,AUROC高达99%。此外,PC-LDCD在四个模型中的TruthfulQA上获得了最高的True+Info、MC2和MC3分数,相较于最先进的基线,同时将平均损坏率降低到53.7%,并实现了79.3%的保留率。我们提出的方法已在GitHub上公开发布。
cs.CL / 22 / 2605.05955

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

TableVista:在视觉和结构复杂性下的多模态表格推理基准测试
Yang, Zheyuan, Shang, Liqiang, Chen, Junjie, Yang, Xun, Xu, Chenglong, Yuan, Bo, Jiao, Chenyuan, Sun, Yaoru, Zhao, Yilun
Abstract
We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.
Chinese Translation
我们介绍了TableVista,这是一个全面的基准测试,用于评估基础模型在视觉和结构复杂性下的多模态表格推理能力。TableVista包含3000个高质量的表格推理问题,其中每个实例通过我们的多风格渲染和转换管道扩展为10个不同的视觉变体。该过程涵盖了多样的场景风格、鲁棒性扰动和仅视觉配置,最终生成了30,000个多模态样本以进行多维评估。我们对29个最先进的开源和专有基础模型在TableVista上的表现进行了广泛评估。通过全面的定量和定性分析,我们发现,尽管评估模型在不同的渲染风格下保持了相对稳定,但在复杂的结构布局和仅视觉设置下,它们的性能显著下降,揭示了当前模型在结构复杂性与视觉集成展示相结合时难以保持推理一致性。这些发现突显了当前多模态能力中的关键缺口,为推动更强大和可靠的表格理解模型提供了见解。
cs.CL / 23 / 2605.05962

Tatarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering

塔塔尔斯坦地名:用于地理空间问答的双语数据集和混合检索系统
Arabov, Mullosharaf K.
Abstract
This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrates dense semantic indexing (multilingual-e5-large) with geospatial filtering via KD-trees and haversine distance. On 500 test queries, the hybrid search achieves Recall@1=0.988, Recall@5=1.000, and MRR=0.994, significantly outperforming BM25 and purely spatial methods. Among tested reader architectures (RuBERT, XLM-RoBERTa-large, T5-RUS), XLM-RoBERTa-large attains the best quality: EM=0.992, F1=0.994. On raw outputs, RuBERT models fail on coordinate questions (F1=0) while XLM-RoBERTa-large reaches F1=0.984; however, simple post-processing eliminates numerical gaps and restores RuBERT accuracy to 100%. This discrepancy stems from tokenization differences and pre-training corpora composition. All resources (dataset, QA corpus, model weights, web demo) are openly published on Hugging Face. Results apply to geospatial QA services, geocoding, and digital humanities in multilingual regions.
Chinese Translation
本文探讨了基于多语言地名数据的自动地理空间问答。我们介绍了一个原创的塔塔尔斯坦共和国地名双语数据集,包含9,688条结构化记录,涵盖语言学、词源学、行政和坐标信息(93.1%为地理参考)。基于该数据集,构建了一个包含约39,000个问题-上下文-答案三元组的问题回答语料库,并保证答案的定位。混合检索器将密集语义索引(multilingual-e5-large)与通过KD树和哈弗辛距离进行的地理空间过滤相结合。在500个测试查询中,混合搜索的Recall@1为0.988,Recall@5为1.000,MRR为0.994,显著优于BM25和纯空间方法。在测试的阅读器架构中(RuBERT、XLM-RoBERTa-large、T5-RUS),XLM-RoBERTa-large达到了最佳质量:EM=0.992,F1=0.994。在原始输出中,RuBERT模型在坐标问题上表现不佳(F1=0),而XLM-RoBERTa-large达到了F1=0.984;然而,简单的后处理消除了数值差距,使RuBERT的准确率恢复到100%。这种差异源于分词差异和预训练语料库的组成。所有资源(数据集、问答语料库、模型权重、网络演示)均在Hugging Face上公开发布。结果适用于多语言地区的地理空间问答服务、地理编码和数字人文学科。
cs.CL / 24 / 2605.06006

From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence

从文章到前提:构建 PrimeFacts,一种用于事实核查证据提取的方法论和资源
Sahitaj, Premtim, Kolanowski, Jawan, Sahitaj, Ariana, Solopova, Veronika, Upravitelev, Max, Röder, Daniel, Maab, Iffat, Yamagishi, Junichi, Möller, Sebastian, Schmitt, Vera
Abstract
Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-grained evidence from full fact-checking articles. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in-article hyperlinks as natural anchors to pinpoint key evidence. Our framework leverages large language models (LLMs) to rewrite these anchor sentences into stand-alone, context-independent premises and investigates the extraction of additional implicit evidence. In evaluations on cross-article evidence retrieval and claim verification, the extracted premises substantially improve performance. Decontextualized evidence yields higher retrievability, achieving up to a 30 percent relative gain in Mean Reciprocal Rank over verbatim sentences, and using the evidence for verdict prediction raises Macro-F1 by 10-20 points over the baseline. These gains are consistent across different verdict granularities (2-class vs. 5-class) and model architectures. A qualitative analysis indicates that the decontextualized premises remain faithful to the original sources. Our work highlights the promise of reusing fact-checkers' evidence for automation and provides a large-scale resource of structured evidence from real-world fact-checks.
Chinese Translation
事实核查文章编码了丰富的支持证据和推理,然而由于呈现方式不结构化,这些证据在自动化验证系统中仍然难以获取。我们介绍了 PrimeFacts,一种从完整的事实核查文章中提取细粒度证据的方法论和资源。我们汇编了 13,106 篇包含主张、裁决和所有引用来源的 PolitiFact 文章,并识别出 49,718 个文内超链接作为自然锚点,以便定位关键证据。我们的框架利用大型语言模型(LLMs)将这些锚句重写为独立的、上下文无关的前提,并探讨提取额外隐含证据。在跨文章证据检索和主张验证的评估中,提取的前提显著提高了性能。去上下文化的证据提高了可检索性,相较于逐字句子,平均倒数排名(Mean Reciprocal Rank)提升了最高 30%。使用这些证据进行裁决预测,宏观 F1 分数(Macro-F1)比基线提高了 10-20 分。这些提升在不同裁决粒度(2 类与 5 类)和模型架构中是一致的。定性分析表明,去上下文化的前提依然忠实于原始来源。我们的工作突显了重用事实核查者证据进行自动化的潜力,并提供了来自真实世界事实核查的大规模结构化证据资源。
cs.CL / 25 / 2605.06007

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

PersonaKit (PK):一个即插即用的平台,用于用户测试全双工对话中的多样角色
Jeon, Hyunbae, Choi, Jinho D.
Abstract
As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.
Chinese Translation
随着口语对话系统超越传统助手角色,涵盖多样的人物角色——例如权威的讲师、不合作的商人或分心的工作人员——它们需要独特的人类化轮流发言行为,以维持心理沉浸感。然而,当前的全双工系统在重叠发言时往往默认采用僵化、过于迁就的“始终让步”策略,这严重削弱了非顺从角色的一致性。通过实证用户研究评估替代的、特定角色的轮流发言策略是具有挑战性的,因为构建实时全双工测试环境需要大量的工程投入。为了解决这个问题,我们提出了PersonaKit (PK),一个开源、低延迟的网络平台,用于快速原型设计和评估对话代理。研究人员可以使用直观的JSON配置定义角色,指定概率性干扰处理行为(例如,让步、保持、桥接或覆盖),并自动部署比较A/B调查。通过对8个不同角色的实际评估,我们展示了PersonaKit提供了一个可扩展的端到端框架,用于研究下一代口语代理中的复杂社会语言行为。
cs.CL / 26 / 2605.06030

More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs

更一致,少多样?分析两代大型语言模型的语法和词汇
Gude, Adrián, Santos-Ríos, Roi, Bond, Francis, Flickinger, Dan, Gómez-Rodríguez, Carlos, Zamaraeva, Olga
Abstract
This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis compares two generations of LLMs in the context of two human-authored English news datasets from two different years. Employing the Head-Driven Phrase Structure Grammar (HPSG) formalism, we investigate the distributions of syntactic structures and lexical types of AI-generated texts and contrast them with the corresponding distributions in the human-authored New York Times (NYT) articles. We use diversity metrics from ecology and information theory to quantify variation in grammatical constructions and lexical types. We show that English news text has changed little in the given time frame, while newer LLMs display reduced syntactic and, especially, lexical diversity compared to older, non-instruction-tuned models. These findings point to future work in studying effects of instruction tuning, which, while enhancing coherence and adherence to prompts, may narrow the expressive range of model output.
Chinese Translation
本研究为比较大型语言模型(LLM)生成文本与人类创作文本(在本案例中为英语新闻文本)的研究提供了新的贡献。我们特别关注通过形式语法框架评估句法特性。我们的分析比较了两代大型语言模型在两个不同年份的人类创作英语新闻数据集中的表现。采用以头驱动短语结构语法(Head-Driven Phrase Structure Grammar, HPSG)为基础的形式主义,我们研究了人工智能生成文本的句法结构和词汇类型的分布,并将其与人类创作的《纽约时报》(New York Times, NYT)文章中的相应分布进行对比。我们使用生态学和信息理论中的多样性指标来量化语法结构和词汇类型的变异性。结果显示,在给定的时间范围内,英语新闻文本变化不大,而较新的大型语言模型在句法和特别是词汇多样性方面相比于较旧的非指令调优模型有所减少。这些发现指向未来在研究指令调优效果方面的工作,指令调优虽然增强了连贯性和对提示的遵循,但可能会缩小模型输出的表现范围。
cs.CL / 27 / 2605.06076

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

依靠旧地图导航:大语言模型后训练中静态机械定位的陷阱
Chen, Hang, Zhu, Jiaying, Chen, Hongyang, Liu, Hongxu, Yang, Xinyu, Wang, Wenya
Abstract
The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.
Chinese Translation
“定位-然后-更新”范式已成为大语言模型(LLMs)后训练中的主要方法,通过机械可解释性识别关键组件以进行针对性的参数更新。然而,这一范式建立在一个基本但未经验证的假设之上:从当前静态参数派生的机制能否可靠地指导未来动态参数的更新?为此,我们系统地追踪了Transformer电路在监督微调(SFT)过程中的结构演变,揭示了任务机制的潜在动态。我们引入了三种新颖的度量指标——电路距离(Circuit Distance)、电路稳定性(Circuit Stability)和电路冲突(Circuit Conflict)——以分析电路在神经迁移、语义稳定性和跨任务干扰三个维度上的演变。我们的实证结果表明,电路在参数更新过程中本质上表现出“自由演化”(Free Evolution)。因此,从当前状态提取的静态机制不可避免地受到时间延迟的影响,使其在指导未来状态时根本不够充分。此外,通过解构现有方法中的“有效性幻觉”,本研究强调了在机械定位中“前瞻性”的必要性,并提出了一个用于未来研究的预测框架。
cs.CL / 28 / 2605.06078

Milestone-Guided Policy Learning for Long-Horizon Language Agents

里程碑引导的长期语言代理策略学习
Wang, Zixuan, Yan, Yuchen, Li, Hongxing, Pan, Teng, Li, Dingming, Zhang, Ruiqing, Lu, Weiming, Xiao, Jun, Zhuang, Yueting, Shen, Yongliang
Abstract
While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.
Chinese Translation
虽然长期代理任务要求语言代理执行数十个连续决策,但使用强化学习训练此类代理仍然面临挑战。我们识别出两个根本原因:信用误归因,即由于终端失败而惩罚正确的早期行动,以及样本效率低下,即稀缺的成功轨迹导致几乎完全失去学习信号。我们提出了一种里程碑引导的策略学习框架BEACON,该框架利用长期任务的组合结构以确保精确的信用分配。BEACON在里程碑边界处对轨迹进行划分,在各段内应用时间奖励塑形以奖励部分进展,并在双重尺度上估计优势,以防止远程失败影响局部行动的评估。在ALFWorld、WebShop和ScienceWorld上,BEACON始终优于GRPO和GiGPO。值得注意的是,在长期ALFWorld任务中,BEACON达到了92.9%的成功率,几乎是GRPO的53.5%的两倍,同时有效样本利用率从23.7%提高到82.0%。这些结果确立了以里程碑为锚的信用分配作为训练长期语言代理的有效范式。代码可在https://github.com/ZJU-REAL/BEACON获取。
cs.CL / 29 / 2605.06096

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

揭示多模态知识编辑中的实体身份混淆
Wu, Shu, Ye, Xiaotian, Mou, Xinyu, Liu, Dongsheng, Wang, Xiaohan, Zhang, Mengqi
Abstract
Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.
Chinese Translation
多模态知识编辑(MKE)旨在在部署后纠正大型视觉-语言模型的内部知识,但后编辑模型的行为模式仍然未被充分探讨。在本文中,我们识别出一种编辑模型中的系统性失效模式,称为实体身份混淆(Entity Identity Confusion, EIC):编辑模型表现出一种荒谬的行为,即关于原始实体身份的文本查询意外地返回关于新实体的信息。为了严格调查EIC,我们构建了EC-Bench,一个诊断基准,直接探测编辑前后图像-实体绑定的变化。我们的分析揭示,EIC源于现有方法未能区分模型中的图像-实体(Image-Entity, I-E)绑定和实体-实体(Entity-Entity, E-E)关系知识,导致模型将E-E关联过拟合为一种捷径:图像仍被视为原始实体,而新实体的名称仅作为一个虚假的身份标签。我们进一步探讨了潜在的缓解策略,显示将编辑限制在模型的I-E处理阶段可以促使编辑更忠实地作用于I-E绑定,从而显著减少EIC。基于这些发现,我们讨论了忠实MKE的原则性期望,并为未来研究提供了方法论指导。
cs.CL / 30 / 2605.06132

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

MemReranker:面向推理的代理记忆检索重排序
Li, Chunyu, Kang, Jingyi, Chen, Ding, Zhang, Mengyuan, Shen, Jiajun, Tang, Bo, Zhou, Xuanhe, Xiong, Feiyu, Li, Zhiyu
Abstract
In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20\% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.
Chinese Translation
在代理记忆系统中,重排序模型作为连接用户查询与长期记忆的关键桥梁。大多数系统采用“检索-再重排序”的两阶段范式,但通用重排序模型依赖于语义相似性匹配,缺乏真正的推理能力,导致回忆结果在语义上高度相关却不包含回答问题所需的关键信息。这一缺陷在记忆场景中表现为三个具体问题。首先,相关性评分校准不当,使得基于阈值的过滤变得困难。其次,在面对时间约束、因果推理和其他复杂查询时,排名性能下降。第三,模型无法利用对话上下文进行语义消歧。本报告介绍了MemReranker,一个基于Qwen3-Reranker构建的重排序模型系列(0.6B/4B),通过多阶段的LLM知识蒸馏实现。多教师成对比较生成校准的软标签,BCE逐点蒸馏建立良好分布的评分,而InfoNCE对比学习增强了对难样本的区分能力。训练数据结合了通用语料库与特定于记忆的多轮对话数据,涵盖时间约束、因果推理和共指解析。在记忆检索基准上,MemReranker-0.6B显著优于BGE-Reranker,并在关键指标上与开源的4B/8B模型以及GPT-4o-mini相匹配。MemReranker-4B进一步实现了0.737的MAP,在多个指标上与Gemini-3-Flash持平,同时保持推理延迟仅为大型模型的10%至20%。在金融和医疗垂直领域基准上,这些模型保持了与主流大参数重排序模型相当的泛化能力。
cs.CL / 31 / 2605.06142

IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences

IRC-Bench:从第一人称回忆中的上下文线索识别实体
Aperstein, Yehudit, Moran, Eden, Apartsin, Alexander
Abstract
When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B performs best in the open-world setting (38.94% exact match; 51.59% Jaccard), while fine-tuned DPR leads closed-world retrieval (35.38% Hit@1; 71.49% Hit@10). We release IRC-Bench with data, code, and evaluation tools.
Chinese Translation
当人们回忆个人记忆时,他们通常间接提及人、地点和事件,依赖于上下文线索而非明确的名称。这种隐含的引用在回忆叙事中至关重要:第一人称的生活经历叙述被用于治疗、档案和社交场合。由于所指实体必须从分散的叙事证据中推断,而不是从局部提及中直接得出,因此这构成了一个困难的计算问题。我们引入了IRC-Bench,即隐含回忆上下文基准,用于评估回忆文本中的隐含实体识别。该基准关注非局部性:实体识别线索分布在多个不连续的从句中,这与命名实体识别、实体链接或共指解析不同。IRC-Bench包含25,136个样本,这些样本由1,994个跨越11个主题领域的文本构建而成,涉及12,337个与维基数据链接的实体。每个样本将一个明确提及目标实体的实体基础叙事与一个删除直接提及的实体省略叙事配对。我们评估了19种配置,涉及LLM生成、密集检索、RAG和微调。在开放世界设置中,QLoRA适配的Llama 3.1 8B表现最佳(38.94%的精确匹配;51.59%的Jaccard),而微调后的DPR在封闭世界检索中领先(35.38%的Hit@1;71.49%的Hit@10)。我们发布了包含数据、代码和评估工具的IRC-Bench。
cs.CL / 32 / 2605.06157

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

HNC:利用困难负样本标题提升具有细粒度视觉-语言理解能力的模型
Dönmez, Esra, Tilli, Pascal, Yang, Hsiu-Yu, Vu, Thang, Silberer, Carina
Abstract
Image-Text-Matching (ITM) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language (VL). However, due to the weak association between the web-collected image-text pairs, models fail to show a fine-grained understanding of the combined semantics of these modalities. To address this issue we propose Hard Negative Captions (HNC): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine-grained cross-modal comprehension in VL. Additionally, we provide a challenging manually-created test set for benchmarking models on a fine-grained cross-modal mismatch task with varying levels of compositional complexity. Our results show the effectiveness of training on HNC by improving the models' zero-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios. Also, we demonstrate that HNC models yield a comparable or better initialization for fine-tuning
Chinese Translation
图像-文本匹配(ITM)是从视觉与语言(VL)的大型语料库中学习通用表示的主要方法之一。然而,由于网络收集的图像-文本对之间的关联性较弱,模型未能展现对这些模态组合语义的细粒度理解。为了解决这一问题,我们提出了困难负样本标题(HNC):一个自动生成的数据集,包含用于ITM训练的困难负样本标题,旨在实现VL中的细粒度跨模态理解。此外,我们提供了一个具有挑战性的手动创建测试集,用于在具有不同组合复杂度的细粒度跨模态不匹配任务上对模型进行基准测试。我们的结果表明,通过在HNC上训练,模型在检测诊断任务中的不匹配时的零样本能力得到了提升,并且在噪声视觉输入场景下表现稳健。此外,我们还展示了HNC模型在微调时能够提供可比或更好的初始化。
cs.CL / 33 / 2605.06200

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

A$^2$TGPO:具有自适应回合级裁剪的代理回合-组策略优化
Chen, Dingwei, Zong, Zefang, Ma, Zhipeng, Luo, Leo, Li, Yang, Li, Chengming, Chen, Peng, Jiang, Jie
Abstract
Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.
Chinese Translation
针对代理大型语言模型(LLMs)的强化学习通常依赖于稀疏的轨迹级结果奖励,这使得在多回合交互中评估单个工具调用的贡献变得困难。现有的此类过程信用分配方法要么依赖于单独的外部过程奖励模型,这会引入额外的消耗,要么依赖于基于树的结构展开,这仅仅是重新分配结果信号,同时限制了轨迹的多样性。一种有前景的替代方法利用策略对真实值的预测概率的每回合变化,称为信息增益(Information Gain,IG),作为一种内在的过程信号,而无需外部评估者。然而,利用IG信号在强化学习训练循环中的先前工作面临三大系统性挑战:在面对异质位置上下文的回合之间进行归一化可能会扭曲单个回合的相对地位,累积的项数变化导致优势幅度随着轨迹深度漂移,以及固定的裁剪范围对具有截然不同IG信号的回合施加相同的策略更新。在本文中,我们提出了A$^2$TGPO(具有自适应回合级裁剪的代理回合-组策略优化),该方法保留IG作为内在信号,但重新设计了其归一化、累积和消耗的方式:(i)回合组归一化:在每个(提示,回合索引)组内对IG进行归一化,以便每个回合仅与同一交互深度的同伴进行比较;(ii)方差缩放的折扣累积:将累积的归一化IG除以累积项的平方根,以保持不同回合位置之间的优势幅度可比;(iii)自适应回合级裁剪:根据每个回合的归一化IG调节其裁剪范围,为信息丰富的回合扩大更新区域,为信息贫乏的回合缩小更新区域。
cs.CL / 34 / 2605.06216

TIDE: Every Layer Knows the Token Beneath the Context

TIDE:每一层都了解上下文下的标记
Jaiswal, Ajay, Hannah, Lauren, Kim, Han-Byul, Hoang, Duc, Farajtabar, Mehrdad, Cho, Minsik
Abstract
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
Chinese Translation
我们重新审视了每个现代大型语言模型(LLM)中一个普遍接受但未充分研究的设计选择:在输入嵌入层中查找一次标记索引,然后永久丢弃。这一单次注入假设引发了两个结构性问题:(i)稀有标记问题,词汇的Zipf型分布导致稀有标记的嵌入由于接收到的累积梯度信号相较于常见标记而长期处于欠训练状态;(ii)上下文崩溃问题,有限参数模型将分布上相似的标记映射到不可区分的隐藏状态。为了解决这两个问题,我们提出了TIDE,它通过EmbeddingMemory增强了标准变换器:一个由K个独立MemoryBlock组成的集合,将标记索引映射到上下文无关的语义向量,这些向量一次性计算并通过具有可学习空银行的深度条件softmax路由器注入到每一层。我们在理论和实证上建立了TIDE在解决与单标记身份注入相关的问题方面的优势,并提升了多个语言建模和下游任务的性能。
cs.CL / 35 / 2605.06221

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

UniPrefill:通过块级动态稀疏化实现的通用长上下文预填加速
Fan, Qihang, Huang, Huaibo, Wu, Zhiying, Wang, Bingning, He, Ran
Abstract
As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.
Chinese Translation
随着大型语言模型(LLMs)的快速发展,它们的能力不断增强,同时对上下文长度的需求也越来越长。为了提高长上下文处理的推理效率,最近提出了几种新颖的低复杂度混合架构,有效缓解了长上下文推理的计算负担。然而,现有的长上下文预填加速研究主要集中在稀疏注意力机制上,这些机制仅在全注意力模型上实现最大加速。当转移到新兴架构(如线性/全注意力混合或滑动窗口/全注意力混合)时,这些预填加速方法的性能会显著下降。此外,这些方法通常与连续批处理不兼容,使其难以集成到现代推理引擎(如vLLM)中。为此,我们提出了UniPrefill,一个适用于几乎任何模型架构的预填加速框架,直接在令牌级别加速模型计算。我们进一步将UniPrefill实现为连续批处理操作符,并扩展vLLM的调度策略,以原生支持预填-解码协处理和UniPrefill的张量并行,从而实现与vLLM的无缝集成。UniPrefill在首次令牌时间(TTFT)上实现了高达2.1倍的加速,随着并发请求数量的增加,加速效果愈加显著。
cs.CL / 36 / 2605.06231

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

YEZE在SemEval-2026任务9中的表现:通过异构集成检测多语言、多文化和多事件的在线极化
Guo, Fengze, Chang, Yue
Abstract
This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmentation, and class weighting to improve classification performance under severe label imbalance. Our findings indicate that independent task modeling combined with class weighting is more effective.
Chinese Translation
本文介绍了我们在SemEval-2026任务9中的系统:检测多语言、多文化和多事件的在线极化,该系统通过三个子任务识别22种语言中的极化社交媒体内容:二元检测、目标分类和表现形式识别。我们提出了一种异构集成的多语言预训练模型,结合了XLM-RoBERTa-large和mDeBERTa-v3-base。我们研究了多任务学习、基于翻译的数据增强和类别加权等技术,以改善在严重标签不平衡情况下的分类性能。我们的研究结果表明,独立任务建模结合类别加权更为有效。
cs.CL / 37 / 2605.06241

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

重新思考大语言模型推理中的强化学习:这是稀疏策略选择,而非能力学习
Akgül, Ömer Faruk, Kannan, Rajgopal, Neiswanger, Willie, Prasanna, Viktor
Abstract
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.
Chinese Translation
强化学习已成为提升大语言模型推理能力的标准方法,但越来越多的证据表明,强化学习并未教授新的策略;它只是重新分配了基础模型已经包含的解决方案的概率质量。在本研究中,我们提出了一个问题:如果强化学习仅仅引导模型走向它已经知道的路径,那么强化学习优化循环本身是否是必要的?通过对多个模型家族和强化学习算法进行的令牌级分析,我们发现强化学习的有益影响是一种稀疏、可预测的修正,集中在模型不确定选择哪条分支的高熵决策点上。只有1-3%的令牌位置受到影响,提升的令牌总是位于基础模型的前五个备选项之内,而在这些少数位置的有针对性的修正因果性地恢复了强化学习准确性增益的很大一部分,而随机修正则失败。基础模型自身的熵能够在没有任何经过强化学习训练的模型的情况下识别这些位置,并且整个修正是低维的,可以用极少的模型参数表示。这些发现将推理改进重新框架为稀疏策略选择,而非能力获取。我们将这一见解转化为ReasonMaxxer,一种最小的无强化学习方法,仅在熵门控的决策点应用对比损失,使用几百个基础模型的回滚而不进行在线生成。在三个模型家族、六个规模和六个数学推理基准测试中,ReasonMaxxer的性能与完整的强化学习相匹配或超过,同时只需数十个问题和几分钟的单GPU训练,训练成本减少了大约三个数量级。
cs.CL / 38 / 2605.06276

Linear Semantic Segmentation for Low-Resource Spoken Dialects

低资源口语方言的线性语义分割
Chirkunov, Kirill, Samih, Younes, Freihat, Abed Alhakim, Aldarmaki, Hanan
Abstract
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.
Chinese Translation
语义分割是话语分析的核心组成部分,但现有模型主要是在高资源书面文本上开发和评估的,这限制了它们在低资源口语变体上的有效性。特别是,方言阿拉伯语表现出非正式的语法、代码切换和弱标记的话语结构,这对标准分割方法构成了挑战。在本文中,我们介绍了一个新的多体裁基准(超过1000个样本),用于会话阿拉伯语的语义分割,重点关注方言话语。该基准涵盖了转录的休闲电话对话、代码切换的播客、广播新闻和小说中的表现性对话,并由母语阿拉伯语注释者进行了注释和验证。利用该基准,我们展示了在现代标准阿拉伯语(MSA)新闻体裁上表现良好的分割模型在方言转录语音上表现不佳。我们进一步提出了一种针对局部语义连贯性和对话中断的稳健性的分割模型,在方言非新闻体裁上始终优于强基线。该基准和方法可推广到其他低资源口语语言。
cs.CL / 39 / 2605.06283

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

量化评分标准修改对人类与自动评分者一致性的统计影响
Huynh, Jessica, Gomez, Alfredo, Deviyani, Athiya, Shelby, Renee, Bigham, Jeffrey P., Diaz, Fernando
Abstract
Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.
Chinese Translation
自动评分者(也称为 LLM-as-judges)在评估和自动内容审核中被越来越多地使用。然而,关于呈现给人类和自动评分者的评分标准修改如何影响其评分一致性的统计分析仍然有限。要求进行整体或 extit{整体}判断的评分标准——例如,对一篇论文的“质量”进行评分——可能由于标准的复杂性或主观性而被不一致地解读。相反,评分标准可以要求进行 extit{分析性}判断,将评估标准分解——例如,将“质量”分解为“流畅性”和“组织性”。虽然这些评分标准可以被编辑以提高人类和自动评分的个体准确性,但这种方法可能导致两者评分之间的不一致,或与相关的整体判断不一致。设计和部署可靠的自动评分者需要理解人类与自动评分者注释之间的关系,以及这种关系在引导整体或分析性判断时如何变化。结果表明,提供代表性示例和额外背景的评分标准修改,以及减少评分标准中的位置偏见,增加了人类与自动评分者的一致性,而更高的评分标准复杂性和保守的聚合方法则倾向于降低一致性。来自自动论文评分和遵循指令评估领域的发现表明,实践者应仔细分析特定领域和评分标准的表现,以提高人类与自动评分者的一致性。
cs.CL / 40 / 2605.06285

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

LatentRAG:高效代理性RAG的潜在推理与检索
Zheng, Yijia, Worring, Marcel
Abstract
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.
Chinese Translation
单步检索增强生成(RAG)为简单问答任务提供了一种有效的方式来整合外部信息,但在处理复杂问题时表现不佳。代理性RAG通过将单步检索替换为多步过程来扩展这一范式,其中大型语言模型(LLM)充当搜索代理,生成中间思维和子查询,以迭代方式与检索系统交互。这一迭代过程由于自回归生成冗长的思维和子查询而导致了显著的延迟。为了解决这一限制,我们提出了LatentRAG,这是一种新颖的框架,将推理和检索从离散语言空间转移到连续潜在空间。与现有的显式方法逐字生成自然语言思维或子查询不同,LatentRAG直接从隐藏状态中在单次前向传播中生成潜在的思维和子查询标记。我们在潜在空间中将LLM与密集检索模型对齐,使得对潜在子查询标记的检索成为可能,并支持端到端的联合优化。为了提高透明度并鼓励语义上有意义的潜在表示,我们引入了一种并行潜在解码机制,将潜在标记翻译回自然语言。在七个基准数据集上的广泛实验表明,LatentRAG的性能与显式代理性RAG方法相当,同时将推理延迟降低了约90%,大幅缩小了与传统单步RAG的延迟差距。
cs.CL / 41 / 2605.06294

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

对数似然、辛普森悖论与机器生成文本的检测
Kempton, Tom, Drobnyi, Viktor, Madigan, Maeve, Burrell, Stuart
Abstract
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.
Chinese Translation
可靠地区分人类撰写的文本与大型语言模型生成的文本具有深远的社会意义。解决这一问题的主流方法利用了似然假设:机器生成的文本在检测语言模型中应显得比人类撰写的文本更为可能。然而,我们证明了区分人类文本和机器文本的标记级信号在检测模型的隐藏空间中并不均匀,大多数检测器通过在具有根本不同统计结构的区域之间简单平均基于似然的标记分数,导致了一种辛普森悖论:强局部信号被不当聚合所摧毁。为了解决这一问题,我们引入了基于贝叶斯决策理论的学习局部校准步骤。我们首先学习基于隐藏空间位置的分数分布的轻量级预测器,而不是简单聚合原始标记分数,然后聚合校准后的对数似然比。这一单一干预显著且持续地提高了我们考虑的所有基线检测器和所有数据集的检测性能。例如,我们的校准版 Fast-DetectGPT 在 GPT-5.4 文本上的 AUROC 从 0.63 提升至 0.85,而我们引入的局部校准 DMAP 检测器在各方面达到了最先进的性能。尽管如此,我们的核心贡献并不是一个新的检测器,而是对现有检测器性能不足的一个重要原因的精确诊断,以及一种与任何标记平均管道兼容的原则性、模块化的补救措施。这将为社区提供一个基础,未来可以在此基础上进行更丰富的分布模型、改进的校准策略,以及通过完整的贝叶斯最优决策规则与隐藏空间几何信号的原则性集成。
cs.CL / 42 / 2605.06309

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

MultiLinguahah:一种新的无监督多语言声学笑声分割方法
Sofia, Callejas, Nahuel, Gomez, Catherine, Pelachaud, Brian, Ravenet, Valentin, Barriere
Abstract
Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.
Chinese Translation
笑声是一种跨文化和语言的社会非语言现象,对于人类沟通至关重要,包括社会联系和沟通信号。然而,在音频中检测笑声是一项具有挑战性的任务,而分割更是困难。目前,机器学习方法通常依赖于昂贵的手动标注,其数据集大多基于英语环境。因此,我们提出了一种无监督的多语言方法,将笑声分割任务设定为基于能量的分段音频序列的异常检测。我们的方法在从BYOL-A编码器学习的音频表示上应用了孤立森林(Isolation Forest)。我们将我们的方法与几种最先进的笑声检测算法在四个数据集上进行了比较,包括单口喜剧、情景喜剧和来自AudioSet的一般短音频。我们的结果表明,最先进的方法并未针对多语言环境进行优化,而我们的方法在非英语环境中表现优于它们。
cs.CL / 43 / 2605.06318

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

谁与什么?利用语言特征和标注者特征分析标注变异
Maurer, Maximilian, Linde, Maximilian, Lapesa, Gabriella
Abstract
Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes. Effect patterns, however, vary considerably across datasets. This urges caution about generalization and transferability.
Chinese Translation
人类标注变异已被确立为自然语言处理(NLP)中的一个核心现象:不同标注者对同一项目的看法需要被重视。因此,数据收集实践转向增加标注者数量并发布分解数据集,尤其是有害语言因其高度主观性而被大量资源支持。尽管这导致了关于标注者的丰富信息(如社会人口统计、态度等),但关于标注的内容(例如,项目的语言特性)及其相互作用却鲜有关注。我们呈现了对四个有害语言检测参考数据集的首次大规模分析,将标注者特征、项目的语言特性及其相互作用结合在一个统计信息丰富的框架中。我们发现,相互作用至关重要,揭示了以往研究中被忽视的交叉效应,并且词汇线索和标注者态度在其中扮演了重要角色。然而,效应模式在不同数据集之间差异显著。这提醒我们在推广和转移时需谨慎。
cs.CL / 44 / 2605.06326

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

教授思维模型以工具进行推理:工具集成推理的完整流程方案
Cheng, Qianjia, Zhang, Yuchen, Wang, Zhilin, Zuo, Yuxin, Zhang, Shunkai, Fan, Yuchen, Qiao, Yu, Zhou, Bowen, Ding, Ning, Cheng, Yu, Luo, Yun, Cui, Ganqu
Abstract
Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.
Chinese Translation
工具集成推理(Tool-integrated reasoning, TIR)提供了一种直接的方式,将思维模型扩展到超越仅依赖文本的推理限制。矛盾的是,我们观察到,即使强大的思维模型几乎不调用实际工具,工具启用的评估仍然可能降低推理性能。本文探讨如何在不牺牲无工具推理能力的情况下,将自然的工具使用行为注入强大的思维模型,并提出了一种全面的TIR方案。我们强调:(i)TIR监督微调(Supervised Fine-Tuning, SFT)的有效性依赖于教师轨迹的可学习性,这应优先考虑那些本质上适合工具增强解决方案的问题;(ii)控制工具使用轨迹的比例可以减轻文本推理能力的灾难性遗忘;(iii)优化pass@k和响应长度而非训练损失可以在保留强化学习(Reinforcement Learning, RL)探索的余地的同时,最大化TIR SFT的收益;(iv)基于合适的SFT初始化和针对模式崩溃的明确保护措施构建的稳定RL与可验证奖励(RL with Verifiable Rewards, RLVR)阶段,提供了一种简单而有效的解决方案。当应用于4B和30B规模的Qwen3思维模型时,我们的方案使模型在一系列基准测试中实现了开源模型中的最先进性能,例如在AIME 2025上,4B和30B模型分别达到了96.7%和99.2%的成绩。
cs.CL / 45 / 2605.06327

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

测量开放权重大语言模型中的评估上下文差异:一种配对提示协议及对对齐管道特定异质性的初步证据
Burnat, Florian A. D., Davidson, Brittany I.
Abstract
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ($20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$pp ($p=0.007$) and reduces harmful compliance vs. deployment by $3.6$pp ($p=0.024$, $0/20$ items inverted) -- while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious}, with marginal eval-vs-deployment refusal effects of $-9$ to $-20$pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the $70$B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.'' One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs.
Chinese Translation
安全基准通常被视为关于语言模型在部署后行为的证据,但如果行为依赖于提示是否看起来像评估,这种推断就显得脆弱。我们将评估上下文差异定义为通过将固定任务框架设定为评估、实时部署交互或中性请求而引起的可观察的项内行为变化,并提出了一种配对提示协议,以在开放权重的大语言模型中测量这一差异,同时控制同义改写变异、基准熟悉度和评审者的框架敏感性。在来自四个开放权重系列的五个指令调优检查点以及一个匹配的OLMo-3基础/指令消融($20$对配对项,每个检查点$840$次生成)中,我们发现了显著的异质性。OLMo-3-Instruct单独表现出评估谨慎——评估框架使拒绝率与中性请求相比提高了$11.8$个百分点($p=0.007$),并使有害合规性与部署相比降低了$3.6$个百分点($p=0.024$,$0/20$项反转)——而Mistral-Small-3.2、Phi-3.5-mini和Llama-3.1-8B则表现出部署谨慎,评估与部署的拒绝效应边际为$-9$到$-20$个百分点。匹配的OLMo-3基础也表现出部署谨慎的模式,识别出对齐作为反转阶段;在Llama-3.1中,$70$B模型保持了方向但幅度减弱,排除了简单的“在规模上反转的小模型效应”。一个警告:跨系列的异质性依赖于评审者。使用不同系列的安全分类器(Llama-Guard-3-8B)重新评审时,保持了OLMo内部的评估谨慎方向,但平坦化了跨系列对比,表明这两位评审者操作化了不同的构念。
cs.CL / 46 / 2605.06334

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

MANTRA:为工具使用的大型语言模型代理合成 SMT 验证的合规基准
Anand, Ashwani, Chatzi, Ivi, Raha, Ritam, Schmuck, Anne-Kathrin
Abstract
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
Chinese Translation
工具使用的大型语言模型(LLM)代理越来越多地被部署在其可靠行为受到严格程序手册约束的环境中。确保这些代理遵循手册中的规则具有挑战性,因为这些手册通常是用自然语言为人类编写的,而代理的行为则表现为工具调用的执行轨迹。现有的 LLM 代理评估依赖于手动构建的基准或基于 LLM 的评判,这两者在面对复杂的、长时间跨度的手册时要么无法扩展,要么缺乏可靠性。为了解决这些局限性,我们提出了 MANTRA,一个从自然语言手册和工具模式中自动合成机器可检查合规基准的框架。MANTRA 独立生成 (i) 捕捉程序依赖关系的符号世界模型,以及 (ii) 针对给定任务的一组轨迹级合规检查,并使用 SMT 求解验证其一致性。一个结构化的修复循环解决不一致性,仅在必要时需要人工干预。这产生了形式上经过验证的基准。重要的是,MANTRA 支持任意领域和长程序手册,并提供可调的任务复杂性概念,用于自动推导伴随合规检查的挑战性任务。使用 MANTRA,我们构建了一个新的基准套件,包含跨越 6 个领域的 285 个任务,能够扩展到 50 页以上的手册,且人力投入最小。从实证上看,我们展示了与现有基准相比,合规检查在约束执行上更为丰富。此外,检查的粒度可用于调试代理的失败模式。这些结果表明,将自动基准生成与形式化验证方法相结合,使得工具使用代理的基准测试具备可扩展性和可靠性。
cs.CL / 47 / 2605.06342

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

保持专注:通过关键正交投影进行激活引导
Luo, Haoyan, Zarlenga, Mateo Espinosa, Jamnik, Mateja
Abstract
Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.
Chinese Translation
激活引导通过干预内部表征来控制大语言模型(LLM)的行为,以达到目标行为,但这往往会降低推理和检索性能。我们认为,这种权衡的主要原因是注意力重定向:引导向量改变查询-键匹配,导致注意力从上下文重要的标记转移到信息量较少的标记上。为了解决这个问题,我们提出了通过关键正交投影(Steering via Key-Orthogonal Projections, SKOP)的方法,这是一种在不消除引导有效性的情况下约束有害注意力重定向的引导方法。SKOP通过保持模型在推理和检索中依赖的一小组焦点标记上的注意力模式,同时允许在不太重要的尾部标记之间重新分配注意力,从而实现这一目标。在多个引导基准测试中,我们展示了SKOP在引导效用权衡方面达到了最佳效果,减少了效用降级5-7倍,同时保留了超过95%的原始引导效能。我们的结果进一步表明,在原始引导方法无效的长上下文检索环境中,SKOP能够通过避免注意力重定向来保持稳健的性能。
cs.CL / 48 / 2605.06353

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

SEQUOR:一个用于现实约束遵循的多轮基准测试
Canaverde, Beatriz, Alves, Duarte M., Pombal, José, Attanasio, Giuseppe, Martins, André F. T.
Abstract
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.
Chinese Translation
在对话中,一个有帮助的助手必须可靠地遵循用户指令,即使这些指令经过修正、修改或与先前请求相矛盾。然而,大多数指令遵循基准测试集中于单轮或短多轮场景,尚未探讨模型在长时间指令遵循任务中的表现。为了解决这一问题,我们提出了SEQUOR,这是一个用于评估长多轮对话中约束遵循的自动基准测试。SEQUOR由基于真实世界对话提取的约束构建的模拟角色驱动交互组成。我们的结果显示,即使在遵循单一约束时,随着对话的延续,指令遵循的准确性也会持续下降,下降幅度超过11%。当模型需要同时遵循多个约束时,这一下降幅度更大,准确性降低超过40%。在对话的任意点添加或替换约束的场景中,模型的准确性下降超过9%。综合来看,我们的结果揭示了当前模型在多轮对话中仍然难以遵循用户指令,并提供了一种更好地衡量助手指令遵循能力的方法。
cs.CL / 49 / 2605.06403

GATHER: Convergence-Centric Hyper-Entity Retrieval for Zero-Shot Cell-Type Annotation

GATHER:面向收敛的超实体检索用于零样本细胞类型注释
Zhang, Zhonghui, Jiang, Feng, Qin, Shaowei, Zhao, Jiahao, Yang, Min
Abstract
Zero-shot single-cell cell-type annotation aims to determine a cell's type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM reasoning. However, in this setting each query contains tens to hundreds of genes, where no single gene is decisive and the label emerges only from their collective co-occurrence. Such hyper-entity queries fundamentally challenge local, entity-wise exploration strategies, which reason from individual genes, leading to poor scalability and substantial LLM cost. We propose GATHER (Graph-Aware Traversal with Hyper-Entity Retrieval), a convergence-centric retriever tailored to hyper-entity queries. It performs global multi-source graph traversal and identifies topological convergence points -- nodes jointly reachable from many input genes. These convergence nodes act as high-information hyper-entities that capture entity synergy. By incorporating node- and path-importance scoring, GATHER selects informative evidence entirely without LLM involvement during retrieval. Instantiated on a self-constructed cell-centric biological knowledge graph (VCKG), GATHER outperforms strong KG-RAG baselines (ToG, ToG-2, RoG, PoG) on two datasets (Immune and Lung), achieving the highest exact-match accuracy (27.45% and 59.64%) with only a single LLM call per sample, compared to 2--61 calls for KG-RAG baselines. Our results demonstrate that convergence nodes compress multi-entity signals into compact, high-information evidence that conveys more per item than multi-hop paths, providing an efficient global alternative to local entity-wise reasoning.
Chinese Translation
零样本单细胞细胞类型注释旨在从给定的一组表达基因中确定细胞的类型,而无需任何训练。现有的基于知识图谱的RAG方法通过从源实体扩展并依赖于迭代的LLM推理来检索证据。然而,在这种情况下,每个查询包含数十到数百个基因,其中没有单个基因是决定性的,标签仅从它们的集体共现中产生。这种超实体查询从根本上挑战了基于局部实体的探索策略,这些策略从单个基因推理,导致可扩展性差和LLM成本高昂。我们提出了GATHER(图感知超实体检索的遍历),一种针对超实体查询量身定制的以收敛为中心的检索器。它执行全球多源图遍历并识别拓扑收敛点——从多个输入基因共同可达的节点。这些收敛节点充当高信息量的超实体,捕捉实体协同作用。通过结合节点和路径重要性评分,GATHER在检索过程中完全不涉及LLM,从而选择信息丰富的证据。在自构建的以细胞为中心的生物知识图谱(VCKG)上实例化,GATHER在两个数据集(免疫和肺)上超越了强大的KG-RAG基线(ToG、ToG-2、RoG、PoG),实现了最高的精确匹配准确率(27.45%和59.64%),每个样本仅需一次LLM调用,而KG-RAG基线则需要2到61次调用。我们的结果表明,收敛节点将多实体信号压缩为紧凑的高信息证据,相较于多跳路径,每个项目传达的信息更多,为局部实体推理提供了一种高效的全球替代方案。
cs.CL / 50 / 2605.06416

MiA-Signature: Approximating Global Activation for Long-Context Understanding

MiA-Signature:用于长上下文理解的全局激活近似
Li, Yuqing, Li, Jiangnan, Yu, Mo, Lin, Zheng, Wang, Weiping, Zhou, Jie
Abstract
A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of \textbf{Mindscape Activation Signature (MiA-Signature)}, a compressed representation of the global activation pattern induced by a query. In LLM systems, this is instantiated via submodular-based selection of high-level concepts that cover the activated context space, optionally refined through lightweight iterative updates using working memory. The resulting MiA-Signature serves as a conditioning signal that approximates the effect of the full activation state while remaining computationally tractable. Integrating MiA-Signatures into both RAG and agentic systems yields consistent performance gains across multiple long-context understanding tasks.
Chinese Translation
越来越多的认知科学研究表明,可报告的意识访问与分布式记忆系统的 extit{全局点火}相关,而这种激活仅部分可访问,因为个体无法直接访问或枚举所有激活的内容。这种张力暗示了一种合理的机制,即认知可能依赖于一种紧凑的表示,近似激活对下游处理的全局影响。受此启发,我们引入了 extbf{思维景观激活特征(Mindscape Activation Signature, MiA-Signature)}的概念,这是由查询引发的全局激活模式的压缩表示。在大型语言模型(LLM)系统中,这通过基于子模块的高层概念选择来实现,覆盖激活的上下文空间,并可通过使用工作记忆的轻量级迭代更新进行选择性优化。最终得到的MiA-Signature作为一种条件信号,近似完整激活状态的效果,同时保持计算上的可处理性。将MiA-Signatures集成到RAG和自主系统中,在多个长上下文理解任务中均表现出一致的性能提升。
cs.CL / 51 / 2605.06426

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

从1.24亿个词元到1,021个新词:一个大规模自动新词检测管道
Rossini, Diego, van der Plas, Lonneke
Abstract
We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism-pipeline.
Chinese Translation
我们提出了一种可扩展的模块化管道,用于自动新词检测,该管道结合了基于规则的过滤和大语言模型(LLM)分类。该管道基于两种互补的词汇构成框架,即语法形态学和超语法形态学,这两者共同定义了新词的范围,并为四类分类方案(新词、实体、外来词、无)提供依据。虽然该管道在架构层面上设计为模块化和可转移的,但其具体实现基于2005年至2024年间的5.27亿条英语Reddit帖子。从该语料库中,我们提取了1.246亿个独特词元,并将其减少超过99.99%,最终得到1,021个新词候选,这一数量足够进行人工专家验证。多个LLM通过多数投票独立对每个候选进行分类,并进行最终验证,结果显示模型间存在显著的分歧,突显了在大规模操作中实现新词检测的挑战。对所有1,021个候选的人工标注确认了其中599个(58.7%)是真正的词汇创新。管道代码、词汇编译脚本和标注候选列表可在https://github.com/DiegoRossini/neologism-pipeline获取。
cs.CL / 52 / 2605.06435

COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach

COVID-19 信息疫情:利用机器学习方法理解内容特征在假新闻检测中的作用
Vimala, Balakrishnan, Zing, Hii Lee, Eric, Laporte
Abstract
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
Chinese Translation
尽管有实证证据表明内容特征,特别是文本和语言特征可以帮助区分真实新闻和假新闻,但在假新闻检测中对这些特征的研究仍然不足。因此,本研究调查了一系列内容特征,如词对(word bigrams)、词性分布等,以改善假新闻检测。我们在COVID-19疫情期间收集的新数据集上进行了系列实验,使用了决策树(Decision Tree)、K近邻(K-Nearest Neighbor)、逻辑回归(Logistic Regression)、支持向量机(Support Vector Machine)和随机森林(Random Forest)等算法。在所有设置中,随机森林的效果最佳,其次是支持向量机。总体而言,文本和语言特征在单独使用时均能提高假新闻检测的效果,但将它们结合成一个模型并未显著改善检测效果。此外,使用词对和词性标签之间也存在差异。研究表明,文本和语言特征可以成功应用于传统机器学习方法中的假新闻检测,而非深度学习。
cs.CL / 53 / 2605.06476

Towards Emotion Consistency Analysis of Large Language Models in Emotional Conversational Contexts

面向情感对话背景下大型语言模型情感一致性分析
Oram, Sneha, Bhushan, Ojaswita, Bhattacharyya, Pushpak
Abstract
In this work, we conduct an analysis to examine the consistency of Large Language Models (LLMs) with respect to their own generated responses in an emotionally-driven conversational context. Specifically, the text generated by LLM is framed as a query to the same model, and its responses are subsequently assessed. This is performed with three queries across two dimensions of extreme and moderate emotions. The three queries are, in particular, false claim queries that contain inherently wrong assumptions (false presuppositions) in increasing order of intensity. Two commercial models, Claude-3.5-haiku, GPT4o-mini, and a medium-sized model, Mistral-7B, are considered in the study. Our findings indicate that LLMs exhibit below-average performance and remain vulnerable to false beliefs embedded within queries. This susceptibility is especially pronounced for moderate emotional content. Furthermore, an extended attention-score-based analysis highlights a shift in models' priority from evaluative to generative. The results raise important considerations for LLMs' deployment in high-stakes, emotionally sensitive contexts.
Chinese Translation
在本研究中,我们对大型语言模型(LLMs)在情感驱动的对话背景下生成的响应一致性进行了分析。具体而言,LLM生成的文本被作为查询输入到同一模型中,随后评估其响应。该分析涉及三个查询,涵盖极端和适度情感两个维度。这三个查询特别是包含固有错误假设(虚假前提)的虚假声明查询,且其强度逐渐增加。研究中考虑了两个商业模型Claude-3.5-haiku、GPT4o-mini,以及一个中型模型Mistral-7B。我们的研究结果表明,LLMs的表现低于平均水平,并且对查询中嵌入的虚假信念仍然脆弱。这种脆弱性在适度情感内容中尤为明显。此外,扩展的基于注意力分数的分析突显了模型优先级从评估性转向生成性的变化。这些结果对LLMs在高风险、情感敏感的背景下的应用提出了重要的考虑。
cs.CL / 54 / 2605.06485

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

在消费级 CPU 上进行 Litespark 推理:用于三元神经网络的自定义 SIMD 核心
Dade, Nii Osae Osae, Morri, Tony, Rahat, Moinul Hossain, Pal, Sayandip
Abstract
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.
Chinese Translation
大型语言模型(LLMs)已经改变了人工智能的格局,但它们的计算需求对大多数用户而言仍然过于高昂。标准推理需要昂贵的数据中心 GPU 或云 API 访问,导致超过十亿台个人计算机在 AI 工作负载中未得到充分利用。三元模型提供了一条前进的道路:它们的权重被限制为 {-1, 0, +1},理论上消除了对浮点乘法的需求。然而,现有框架未能利用这一结构,将三元模型视为密集的浮点网络。我们通过自定义 SIMD 核心来填补这一空白,用简单的加法和减法操作替代矩阵乘法,针对现代 CPU 上可用的整数点积指令进行优化。我们的实现 Litespark-Inference 可以通过 pip 安装,并与 Hugging-Face 直接集成,与标准的 PyTorch 推理相比,在 Apple Silicon 上实现了 9.2 倍的首次令牌响应时间加速、52 倍的吞吐量提升和 14 倍的内存减少,在 Intel 和 AMD 处理器上也实现了类似的加速效果。
cs.CL / 55 / 2605.06506

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

语言模型惊讶度与隐喻新颖性中的频率混淆
Momen, Omar, Zarrieß, Sina
Abstract
Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.
Chinese Translation
语言模型(LM)惊讶度被广泛用作上下文可预测性的代理,并且与隐喻新颖性判断相关。然而,惊讶度与词汇频率紧密相连。我们使用两种不同的词频测量方法探讨这种与隐喻新颖性评分的互动。我们分析了来自八种 Pythia 模型规模和 154 个训练检查点的惊讶度估计。在各个设置中,词频是隐喻新颖性的更强预测因子,而不是惊讶度。在训练阶段中,惊讶度与新颖性的关联在早期阶段达到峰值,然后再次下降,这与惊讶度与频率的关联在同一时间段内的增加相呼应。这些结果表明,通常报告的最佳 LM 惊讶度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系在一起,而词汇频率可能是主要的潜在因素。
cs.CL / 56 / 2605.06527

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

STALE:大型语言模型代理能否知道他们的记忆何时不再有效?
Chao, Hanxiang, Bai, Yihan, Sheng, Rui, Li, Tianle, Sun, Yushi
Abstract
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.
Chinese Translation
大型语言模型(LLM)代理越来越被期望能够维持连贯的长期个性化记忆,然而当前的基准主要测量静态事实检索,忽视了在新证据出现时修正存储信念的能力。我们识别出一种关键且未被充分探讨的失效模式:隐式冲突(Implicit Conflict):后来的观察在没有明确否定的情况下使早期记忆失效,这需要上下文推理和常识推理来检测。为了严格评估这一能力,我们引入了STALE,一个包含400个专家验证的冲突场景的基准(在三个探测维度上有1200个评估查询),涵盖了100多个日常主题,文本上下文长度可达150K个标记。我们提出了一个三维探测框架,测试状态解决(State Resolution,检测先前信念已过时)、前提抵抗(Premise Resistance,拒绝错误假设过时状态的查询)和隐式政策适应(Implicit Policy Adaptation,主动在下游行为中应用更新的状态)。对前沿LLM和专门记忆框架的系统评估揭示了检索更新证据与对此采取行动之间的普遍差距,即使是评估最好的模型整体准确率也仅为55.2%。模型通常接受用户查询中嵌入的过时假设,并且在识别用户状态某一方面的变化应使相关记忆失效时存在困难。为了建立状态感知记忆的初步基线,我们进一步提出了CUPMem,一个通过结构化状态整合和传播感知搜索增强写时修订的原型,表明明确的状态裁决是增强代理记忆的一个有前景的方向。
cs.CL / 57 / 2605.06546

Efficient Pre-Training with Token Superposition

高效的预训练与令牌叠加
Peng, Bowen, Gigant, Théo, Quesnelle, Jeffrey
Abstract
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
Chinese Translation
大型语言模型的预训练通常在规模上代价高昂且效率低下,需要复杂且侵入性的修改以实现高数据吞吐量。在本研究中,我们提出了一种简单的插入式方法——令牌叠加训练(Token-Superposition Training, TST),该方法在不修改并行性、优化器、分词器、数据或模型架构的情况下,显著提高了预训练期间每个浮点运算(FLOPs)的数据吞吐量。TST 分为两个阶段:第一阶段是高效的叠加阶段,我们将多个连续的令牌组合成一个包,并使用多热交叉熵(multi-hot cross-entropy, MCE)目标进行训练;第二阶段是恢复阶段,我们恢复到标准训练。我们在 2.7 亿和 6 亿参数的规模上对 TST 进行了广泛评估,并在 30 亿和 100 亿的 A1B 专家模型混合上进行了验证,证明其在不同设置下具有高度的鲁棒性。最终,TST 在基线损失和下游评估中始终表现优于基线,并且在相同损失设置下,TST 在 100 亿 A1B 规模下的总预训练时间减少了最多 2.5 倍。
cs.CL / 58 / 2605.06548

Continuous Latent Diffusion Language Model

连续潜在扩散语言模型
Guo, Hongcan, Zhao, Qinyu, Zhao, Yian, Nie, Shen, Zhu, Rui, Guo, Qiushan, Wang, Feng, Yang, Tao, Zhao, Hengshuang, Wei, Guoqiang, Zeng, Yan
Abstract
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
Chinese Translation
大型语言模型在自回归范式下取得了显著成功,但高质量的文本生成并不一定需要固定的从左到右的顺序。现有的替代方案仍然难以在生成效率、可扩展的表示学习和有效的全局语义建模之间实现共同的平衡。我们提出了 Cola DLM,一种分层潜在扩散语言模型,通过分层信息分解来框定文本生成。Cola DLM 首先通过文本变分自编码器(Text VAE)学习稳定的文本到潜在空间的映射,然后在连续潜在空间中使用块因果的 DiT 模型化全局语义先验,最后通过条件解码生成文本。从统一的马尔可夫路径视角来看,其扩散过程执行潜在先验传输,而不是基于标记的观察恢复,从而将全局语义组织与局部文本实现分离。这一设计产生了更灵活的非自回归归纳偏差,支持在连续空间中的语义压缩和先验拟合,并自然扩展到其他连续模态。通过涵盖 4 个研究问题、8 个基准、严格匹配的约 20 亿参数的自回归和 LLaDA 基线,以及高达约 2000 EFLOPs 的扩展曲线实验,我们识别出 Cola DLM 的有效整体配置,并验证其在文本生成中的强扩展行为。综合来看,结果确立了分层连续潜在先验建模作为严格基于标记的语言建模的原则性替代方案,其中生成质量和扩展行为可能比似然更好地反映模型能力,同时也为在离散文本和连续模态之间的统一建模提供了具体路径。
cs.CL / 59 / 2605.06554

Long Context Pre-Training with Lighthouse Attention

长上下文预训练与灯塔注意力
Peng, Bowen, Ghosh, Subho, Quesnelle, Jeffrey
Abstract
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention
Chinese Translation
在极长序列长度下训练因果变换器受到缩放点积注意力(SDPA)所需的平方时间和内存的瓶颈。在本研究中,我们提出了灯塔注意力(Lighthouse Attention),这是一种仅用于训练的对称选择型层次注意力算法,它包裹在普通的SDPA周围,并且可以在训练结束时轻松移除。我们的层次选择也是无梯度的,这使我们免于处理复杂且可能低效的反向传播内核。我们的贡献有三方面:(i)一个亚平方的层次预处理和后处理步骤,能够自适应地压缩和解压缩序列;(ii)一种对称压缩策略,同时对查询、键和值进行池化,同时保持从左到右的因果性,这大大提高了并行性;(iii)一种两阶段训练方法,我们在大部分时间内使用灯塔注意力进行预训练,并在最后通过短时间训练恢复完整的注意力模型。我们进行了初步的小规模大语言模型(LLM)预训练实验,显示出我们的方法相较于全注意力训练在其他设置相同的情况下的有效性,我们实现了更快的总训练时间和更低的恢复阶段最终损失。完整代码可在以下链接获取:https://github.com/ighoshsubho/lighthouse-attention
cs.CL / 60 / 2605.06594

Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings

远程认知修复的自动化临床报告生成:在低资源环境中比较知识工程模板与大型语言模型
Zhou, Yongxin, Ringeval, Fabien, Portet, François
Abstract
The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians to review efficiently. This paper investigates automated clinical report generation for avatar-guided, home-based cognitive remediation sessions in a low-resource setting with no reference reports. We present and compare two approaches: (1) a rule-based template system encoding speech therapy domain knowledge as explicit decision rules and validated templates, ensuring clinical reliability and traceability; and (2) a zero-shot LLM-based approach (GPT-4) aimed at more fluent and concise output. Both systems use identical pre-extracted, expert-validated structured variables, enabling a controlled factual comparison. Outputs were evaluated by eight speech therapists and final-year students using a nine-criterion questionnaire. Results reveal a clear trade-off between clinical reliability and linguistic quality. The template-based system scored higher on fluidity, coherence, and results presentation, while GPT-4 produced more concise output. Directional differences are consistent across evaluation dimensions, though no comparison reached statistical significance after correction, reflecting the scale constraints of expert clinical evaluation. Based on evaluator feedback, we derive eight design recommendations for clinical reporting systems in remote rehabilitation settings. More broadly, this work contributes a replicable methodology combining expert elicitation, taxonomy-driven generation, and multi-dimensional human evaluation for clinical NLG in low-resource settings, and illustrates how controlled comparisons can inform the responsible adoption of generative AI in healthcare.
Chinese Translation
随着对认知修复疗法需求的增长,以及语言治疗师资源的有限,远程康复工具的采用加速了这些系统的普及。这些系统生成大量的交互数据,临床医生很难高效地进行审查。本文探讨了在没有参考报告的低资源环境中,针对虚拟形象引导的家庭认知修复会话的自动化临床报告生成。我们提出并比较了两种方法:(1)基于规则的模板系统,该系统将语言治疗领域知识编码为明确的决策规则和经过验证的模板,确保临床可靠性和可追溯性;(2)基于零样本学习的LLM方法(GPT-4),旨在生成更流畅和简洁的输出。这两种系统使用相同的预提取、专家验证的结构变量,从而实现了受控的事实比较。输出结果由八名语言治疗师和最后一年学生使用九个标准的问卷进行评估。结果显示,临床可靠性与语言质量之间存在明显的权衡。基于模板的系统在流畅性、一致性和结果呈现上得分更高,而GPT-4则生成了更简洁的输出。尽管在评估维度上方向性差异一致,但在校正后没有比较达到统计显著性,这反映了专家临床评估的规模限制。根据评估者的反馈,我们为远程康复环境中的临床报告系统提出了八项设计建议。更广泛地说,这项工作贡献了一种可复制的方法论,结合了专家引导、分类驱动的生成和多维度的人类评估,用于低资源环境中的临床自然语言生成,并说明了如何通过受控比较来指导生成性人工智能在医疗保健中的负责任采用。
cs.CL / 61 / 2605.06597

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大型语言模型的统一自蒸馏框架
Jin, Yiqiao, Wang, Yiyang, Fu, Lucheng, Xiao, Yijia, Luo, Yinyi, Liu, Haoxin, Prakash, B. Aditya, Hester, Josiah, Wang, Jindong, Kumar, Srijan
Abstract
Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
Chinese Translation
自蒸馏(Self-distillation, SD)为适应大型语言模型(Large Language Models, LLMs)提供了一条有前景的路径,而无需依赖更强的外部教师。然而,在自回归LLMs中进行SD仍然具有挑战性,因为自生成的轨迹是自由形式的,正确性依赖于任务,并且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,导致其有效性、角色和相互作用不明确。在本文中,我们提出了UniSD,一个统一框架,用于系统地研究自蒸馏。UniSD整合了互补机制,以解决监督可靠性、表示对齐和训练稳定性,包括多教师一致性、EMA教师稳定化、标记级对比学习、特征匹配和发散剪切。在六个基准和来自三个模型家族的六个模型上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件推动了增益,以及这些组件在任务间如何相互作用。在这些见解的指导下,我们构建了UniSDfull,一个集成管道,结合了互补组件,并实现了最强的整体性能,相较于基础模型提高了5.4分,相较于最强基线提高了2.8分。广泛的评估突显了自蒸馏作为一种实用且可控的方法,用于在没有更强外部教师的情况下高效适应LLM。
cs.CL / 62 / 2605.06619

Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance

算法语言,隐匿于公开之中:可读意义与检测规避之间的权衡
Fillies, Jan, Robertson, Ronald E., Hancock, Jeffrey
Abstract
As large language models (LLMs) increasingly mediate both content generation and moderation, linguistic evasion strategies known as Algospeak have intensified the coevolution between evaders and detectors. This research formalizes the underlying dynamics grounded in a joint action model: when Algospeak increases, detectability and understandability decrease. Further, the concept of Majority Understandable Modulation (MUM) is introduced and defined as the modulation level at which additional evasive alteration increases detector evasion but loses comprehension for the majority of recipients. To empirically probe this trade-off, we introduce a reproducible framework that can be used to create meaning-preserving, Algospeak-style variants, based on an existing taxonomy and with tunable modulation levels. Using COVID-19 disinformation as a first proof-by-example setting, we construct a reference dataset of 700 modulated items, drawn from twenty base sentences across five modulation levels and seven strategies. We then run two linked evaluations with seven different language models: one testing for interpretation through meaning recovery and one for disinformation detection through classification. Curve fitting over modulation levels yields an estimate of the Majority Understandable Modulation threshold and enables sensitivity analyses across strategies and models, see Figure 1. Results reveal the characteristic relationships between understandability and modulation. This study lays the groundwork for understanding the dynamics behind Algospeak and provides the framework, dataset, and experimental setups described.
Chinese Translation
随着大型语言模型(LLMs)在内容生成和审查中的作用日益增强,称为算法语言(Algospeak)的语言规避策略加剧了规避者与检测者之间的共同进化。本研究基于联合行动模型形式化了其潜在动态:当算法语言增加时,检测性和可理解性降低。此外,引入并定义了“多数可理解调制”(Majority Understandable Modulation, MUM)这一概念,指的是在该调制水平下,额外的规避性修改会增加检测规避,但会导致大多数接收者失去理解。为了实证探讨这一权衡,我们提出了一个可重复的框架,可以基于现有分类法创建保持意义的算法语言风格变体,并具有可调的调制水平。以COVID-19虚假信息作为首个示例,我们构建了一个包含700个调制项目的参考数据集,这些项目来自20个基础句子,涵盖五个调制水平和七种策略。然后,我们使用七种不同的语言模型进行了两次关联评估:一次测试通过意义恢复进行的解释,另一次通过分类进行的虚假信息检测。对调制水平进行曲线拟合,估算了多数可理解调制阈值,并使得在策略和模型之间进行敏感性分析成为可能,见图1。结果揭示了可理解性与调制之间的特征关系。本研究为理解算法语言背后的动态奠定了基础,并提供了所描述的框架、数据集和实验设置。
cs.CL / 63 / 2605.06625

Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation

第二语言(L2)韩语UD中的解析器一致性与不一致性:对人机协作注释的启示
Sung, Hakyung, Shin, Gyu-Ho
Abstract
We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.
Chinese Translation
我们提出了一种简化的人机协作工作流程,用于第二语言(L2)韩语的形态句法注释,利用两个领域适应性解析器之间的一致性。我们首先评估解析器一致性是否可以作为注释正确性的代理,通过与独立的人类判断进行比较。结果显示,解析器与人类判断之间存在强烈的对应关系,支持半自动L2韩语UD注释的可行性。进一步分析表明,解析器的不一致性主要集中在语言上可预测的领域,如语法关系区分和从句边界歧义。虽然许多不一致案例可以通过迭代模型优化来处理,但另一些则反映了在解析和标注L2韩语语料时固有的更深层次的表征挑战。
cs.CL / 64 / 2605.06635

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

引用但未验证:解析和评估大型语言模型深度研究代理中的来源归属
Onweller, Hailey, Lumer, Elias, Huber, Austin, Ramchandani, Pia, Subbiah, Vamse Kumar, Feld, Corey
Abstract
Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.
Chinese Translation
大型语言模型(LLMs)驱动的深度研究代理能够将来自数百个网络来源的信息综合成引用报告,但这些引用无法可靠地验证。目前的方法要么信任模型准确自我引用,存在偏见风险,要么采用检索增强生成(RAG),但未验证来源的可访问性、相关性或事实一致性。我们提出了第一个来源归属评估框架,使用可重复的抽象语法树(AST)解析器从LLM生成的Markdown报告中提取和评估内联引用。与单独验证声明的方法不同,我们的框架通过检索实际引用内容来闭合循环,使人类或模型评估者能够根据来源判断每个引用。引用的评估从三个维度进行:(1)链接有效性验证URL的可访问性,(2)相关内容测量主题一致性,以及(3)事实检查验证与来源内容的事实准确性。我们在三个评估维度上基于评分标准对14个闭源和开源LLM进行基准测试,使用经过人工审查校准的LLM作为评判者。我们的结果显示,即使是最强的前沿模型,链接有效性保持在94%以上,相关性超过80%,但事实准确性仅为39-77%,而且不到一半的开源模型在一次性设置中成功生成引用报告。对研究深度的消融研究表明,随着工具调用从2个增加到150个,事实检查的准确性在两个前沿模型上平均下降约42%,这表明更多的检索并不会产生更准确的引用。这些发现揭示了表面引用质量与事实可靠性之间的关键脱节,而我们的框架提供了评估这种脱节的基础设施。
cs.CL / 65 / 2605.06642

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

StraTA:通过战略轨迹抽象激励自主强化学习
Xue, Xiangyuan, Zhou, Yifan, Wang, Zidong, Tang, Shengji, Torr, Philip, Ouyang, Wanli, Bai, Lei, Yin, Zhenfei
Abstract
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作交互代理,但由于当前方法主要是纯反应性的,优化它们以进行长期决策仍然困难,这削弱了在延长轨迹上的探索和信用分配。在本研究中,我们提出了战略轨迹抽象(StraTA),这是一个简单的框架,将明确的轨迹级策略引入自主强化学习(RL)。StraTA 从初始任务状态中抽样一个紧凑的策略,将后续动作基于该策略进行条件化,并与分层的 GRPO(Generalized Reinforcement Policy Optimization)风格的回滚设计共同训练策略生成和动作执行,进一步通过多样化的策略回滚和关键自我判断进行增强。在 ALFWorld、WebShop 和 SciWorld 上的实验表明,StraTA 在强基线之上始终提高了样本效率和最终性能。StraTA 在 ALFWorld 上的成功率达到 93.1%,在 WebShop 上为 84.2%。在 SciWorld 上,StraTA 达到 63.5% 的整体得分,超越了前沿的闭源模型。
cs.CL / 66 / 2605.06650

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

超越负向回滚:仅正向策略优化与隐式负梯度
Xu, Mingwei, Fang, Hao
Abstract
Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.
Chinese Translation
由于确定性验证,具有可验证奖励的强化学习(RLVR)成为增强大型语言模型(LLMs)推理能力的主导范式。社区见证了从近端策略优化(PPO)到组相对策略优化(GRPO)的快速变化,其中GRPO通过对分组的正向和负向回滚进行简单估计,减少了复杂的优势估计。然而,我们注意到负向回滚可能无法承认失败严重性的梯度,而组合的庞大性使得惩罚少量采样的负向回滚在稀疏的二元奖励下不太可能覆盖有意义的奖励信号。在本研究中,我们提出了仅正向策略优化(POPO),这是一个新颖的RLVR框架,其中学习可以完全通过在线正向回滚进行。具体而言,POPO利用对正向回滚集的有界重要性采样。因此,不使用离散的负向回滚进行梯度指导。我们展示了通过增强正向概率的回滚重分配,隐式负梯度可以自然出现。接下来,POPO通过两种机制稳定策略优化。首先,它应用了一个具有基于动量的适应法则的孪生策略网络,以稳定策略演化。其次,我们在孪生表示空间中用有界相似性惩罚项替代KL散度。我们使用公开可用的、成熟的文本-LLM模型(例如Qwen系列)在各级数学基准上进行了广泛实验。我们的实验表明,POPO的性能与GRPO相当,甚至更优。值得注意的是,我们展示了POPO在AIME 2025中以Qwen-Math-7B达到了36.67%的成绩,超越了GRPO的30.00%。我们的消融和遍历研究进一步说明了POPO组件的必要性和稳健性。
cs.CL / 67 / 2605.06663

EMO: Pretraining Mixture of Experts for Emergent Modularity

EMO:用于新兴模块化的专家混合预训练
Wang, Ryan, Bhagia, Akshita, Min, Sewon
Abstract
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.
Chinese Translation
大型语言模型通常作为单体系统部署,即使在应用仅需狭窄能力子集(例如代码、数学或特定领域知识)时,也需要完整模型。专家混合模型(Mixture-of-Experts, MoEs)似乎提供了一种潜在的替代方案,通过每个输入仅激活一部分专家,但在实践中,将推理限制在特定领域的专家子集会导致严重的性能下降。这限制了它们在内存受限环境中的实用性,尤其是随着模型变得更大和更稀疏。我们提出了EMO,这是一种旨在实现模块化的MoE——即独立使用和组合专家子集,而无需人类定义的先验。我们的关键思想是鼓励来自相似领域的标记依赖于相似的专家。由于文档中的标记通常共享一个领域,EMO限制它们从共享池中选择专家,同时允许不同文档使用不同的池。这一简单约束使得在仅使用文档边界的情况下,在预训练过程中能够形成连贯的专家分组。我们在1T标记上对一个1B活跃、14B总量的EMO进行了预训练。作为一个完整模型,它的性能与标准MoE相匹配。关键是,它实现了选择性专家使用:仅保留25%(12.5%)的专家仅导致1%(3%)的绝对下降,而标准MoE在相同设置下则崩溃。我们进一步发现,EMO中的专家子集在语义层面上专业化(例如,数学或代码等领域),与标准MoE中观察到的低级语法专业化形成对比。总的来说,我们的结果展示了一条通向模块化、内存高效部署大型稀疏模型的道路,并为可组合架构开辟了新的机会。