← Back to Index
Daily Research Digest

arXiv Papers

2026-02-17
418
Papers
4
Categories
418
Translated
收藏清单 0
机器人学 (Robotics)
61
cs.RO / 1 / 2602.13212

UAVGENT: A Language-Guided Distributed Control Framework

UAVGENT:一种语言引导的分布式控制框架
Zhang, Ziyi, Deng, Xiyu, Qu, Guannan, Nakahira, Yorie
Abstract
We study language-in-the-loop control for multi-drone systems that execute evolving, high-level missions while retaining formal robustness guarantees at the physical layer. We propose a three-layer architecture in which (i) a human operator issues natural-language instructions, (ii) an LLM-based supervisor periodically interprets, verifies, and corrects the commanded task in the context of the latest state and target estimates, and (iii) a distributed inner-loop controller tracks the resulting reference using only local relative information. We derive a theoretical guarantee that characterizes tracking performance under bounded disturbances and piecewise-smooth references with discrete jumps induced by LLM updates. Overall, our results illustrate how centralized language-based task reasoning can be combined with distributed feedback control to achieve complex behaviors with provable robustness and stability.
Chinese Translation
我们研究了多无人机系统中的语言环控,旨在执行不断演变的高层次任务,同时在物理层面保持正式的鲁棒性保证。我们提出了一种三层架构,其中(i)人类操作员发出自然语言指令,(ii)基于大型语言模型(LLM)的监督者定期解释、验证并纠正命令任务,以适应最新的状态和目标估计,(iii)分布式内环控制器仅使用局部相对信息跟踪生成的参考信号。我们推导出一个理论保证,表征在有界干扰和因LLM更新引起的离散跳跃的分段平滑参考下的跟踪性能。总体而言,我们的结果展示了如何将集中式语言基础任务推理与分布式反馈控制相结合,以实现具有可证明鲁棒性和稳定性的复杂行为。
cs.RO / 2 / 2602.13252

DORA: Dataflow Oriented Robotic Architecture

DORA:数据流导向的机器人架构
Zhang, Xiaodong, Lv, Baorui, Tao, Xavier, Wang, Xiong, Bao, Jie, He, Yong, Chen, Yue, Yang, Zijiang
Abstract
Robotic middleware serves as the foundational infrastructure, enabling complex robotic systems to operate in a coordinated and modular manner. In data-intensive robotic applications, especially in industrial scenarios, communication efficiency directly impact system responsiveness, stability, and overall productivity. However, existing robotic middleware exhibit several limitations: (1) they rely heavily on (de)serialization mechanisms, introducing significant overhead for large-sized data; (2) they lack efficient and flexible support for heterogeneous data sizes, particularly in intra-robot communication and Python-based execution environments. To address these challenges, we propose Dataflow-Oriented Robotic Architecture (DORA) that enables explicit data dependency specification and efficient zero-copy data transmission. We implement the proposed framework as an open-source system and evaluate it through extensive experiments in both simulation and real-world robotic environments. Experimental results demonstrate substantial reductions in latency and CPU overhead compared to state-of-the-art middleware.
Chinese Translation
机器人中间件作为基础设施,使复杂的机器人系统能够以协调和模块化的方式运行。在数据密集型机器人应用中,尤其是在工业场景中,通信效率直接影响系统的响应能力、稳定性和整体生产力。然而,现有的机器人中间件存在几个局限性:(1)它们在很大程度上依赖于(反)序列化机制,为大规模数据引入了显著的开销;(2)它们缺乏对异构数据大小的高效和灵活支持,特别是在机器人内部通信和基于Python的执行环境中。为了解决这些挑战,我们提出了数据流导向的机器人架构(Dataflow-Oriented Robotic Architecture,DORA),该架构能够显式指定数据依赖关系并实现高效的零拷贝数据传输。我们将所提出的框架实现为一个开源系统,并通过在仿真和真实机器人环境中的广泛实验进行评估。实验结果表明,与最先进的中间件相比,延迟和CPU开销显著降低。
cs.RO / 3 / 2602.13436

High-Fidelity, Customizable Force Sensing for the Wearable Human-Robot Interface

高保真、可定制的可穿戴人机界面力传感
Rubin, Noah, Schraeder, Ava, Sahu, Hrishikesh, Bulea, Thomas C., Chin, Lillian
Abstract
Mechanically characterizing the human-machine interface is essential to understanding user behavior and optimizing wearable robot performance. This interface has been challenging to sensorize due to manufacturing complexity and non-linear sensor responses. Here, we measure human limb-device interaction via fluidic innervation, creating a 3D-printed silicone pad with embedded air channels to measure forces. As forces are applied to the pad, the air channels compress, resulting in a pressure change measurable by off-the-shelf pressure transducers. We demonstrate in benchtop testing that pad pressure is highly linearly related to applied force ($R^2 = 0.998$). This is confirmed with clinical dynamometer correlations with isometric knee torque, where above-knee pressure was highly correlated with flexion torque ($R^2 = 0.95$), while below-knee pressure was highly correlated with extension torque ($R^2 = 0.75$). We build on these idealized settings to test pad performance in more unconstrained settings. We place the pad over \textit{biceps brachii} during cyclic curls and stepwise isometric holds, observing a correlation between pressure and elbow angle. Finally, we integrated the sensor into the strap of a lower-extremity robotic exoskeleton and recorded pad pressure during repeated squats with the device unpowered. Pad pressure tracked squat phase and overall task dynamics consistently. Overall, our preliminary results suggest fluidic innervation is a readily customizable sensing modality with high signal-to-noise ratio and temporal resolution for capturing human-machine mechanical interaction. In the long-term, this modality may provide an alternative real-time sensing input to control / optimize wearable robotic systems and to capture user function during device use.
Chinese Translation
机械特性化人机界面对于理解用户行为和优化可穿戴机器人性能至关重要。然而,由于制造复杂性和传感器非线性响应,该界面的传感化一直面临挑战。在此,我们通过流体神经传导测量人类肢体与设备的交互,创建了一种嵌入空气通道的3D打印硅胶垫以测量力。当力施加到垫子上时,空气通道会被压缩,导致可通过现成的压力传感器测量的压力变化。我们在台式测试中证明,垫子压力与施加的力呈高度线性关系($R^2 = 0.998$)。这一结果通过临床测力计与等长膝关节扭矩的相关性得到了验证,其中膝上压力与屈曲扭矩高度相关($R^2 = 0.95$),而膝下压力与伸展扭矩高度相关($R^2 = 0.75$)。我们在理想化环境的基础上,测试了垫子在更不受约束环境中的性能。我们将垫子放置在肱二头肌上进行循环卷曲和逐步等长保持,观察到压力与肘关节角度之间的相关性。最后,我们将传感器集成到下肢机器人外骨骼的带子上,并在设备未通电的情况下记录重复深蹲时的垫子压力。垫子压力一致地跟踪了深蹲阶段和整体任务动态。总体而言,我们的初步结果表明,流体神经传导是一种易于定制的传感模式,具有高信噪比和时间分辨率,能够捕捉人机机械交互。从长远来看,这种模式可能为控制/优化可穿戴机器人系统提供一种替代的实时传感输入,并在设备使用过程中捕捉用户功能。
cs.RO / 4 / 2602.13444

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

FlowHOI:基于流的语义基础手-物体交互生成用于灵巧机器人操控
Zeng, Huajian, Chen, Lingyun, Yang, Jiaqi, Zhang, Yuantai, Shi, Fan, Liu, Peidong, Zuo, Xingxing
Abstract
Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
Chinese Translation
近期的视觉-语言-动作(VLA)模型能够生成合理的末端执行器运动,但在长时间跨度和接触丰富的任务中往往表现不佳,因为底层的手-物体交互(HOI)结构没有被明确表示。一个与具体实现无关的交互表示能够捕捉这一结构,从而使操控行为更易于验证和在不同机器人之间转移。我们提出了FlowHOI,一个两阶段的流匹配框架,生成语义基础、时间连贯的HOI序列,包括手部姿态、物体姿态和手-物体接触状态,这些序列以自我中心的观察、语言指令和3D高斯点云重建(3DGS)为条件。我们将以几何为中心的抓取与以语义为中心的操控解耦,将后者以紧凑的3D场景标记为条件,并采用运动-文本对齐损失,使生成的交互在物理场景布局和语言指令中都具有语义基础。为了解决高保真HOI监督的稀缺性,我们引入了一种重建管道,从大规模自我中心视频中恢复对齐的手-物体轨迹和网格,生成稳健的HOI先验以支持生成。在GRAB和HOT3D基准测试中,FlowHOI实现了最高的动作识别准确率,并且其物理仿真成功率比最强的基于扩散的基线高出1.7倍,同时提供了40倍的推理加速。我们进一步展示了在四个灵巧操控任务上的真实机器人执行,说明了将生成的HOI表示重新定向到真实机器人执行管道的可行性。
cs.RO / 5 / 2602.13457

Inferring Turn-Rate-Limited Engagement Zones with Sacrificial Agents for Safe Trajectory Planning

利用牺牲代理推断受限转向率的交战区域以实现安全轨迹规划
Stagg, Grant, Peterson, Cameron K.
Abstract
This paper presents a learning-based framework for estimating pursuer parameters in turn-rate-limited pursuit-evasion scenarios using sacrificial agents. Each sacrificial agent follows a straight-line trajectory toward an adversary and reports whether it was intercepted or survived. These binary outcomes are related to the pursuer's parameters through a geometric reachable-region (RR) model. Two formulations are introduced: a boundary-interception case, where capture occurs at the RR boundary, and an interior-interception case, which allows capture anywhere within it. The pursuer's parameters are inferred using a gradient-based multi-start optimization with custom loss functions tailored to each case. Two trajectory-selection strategies are proposed for the sacrificial agents: a geometric heuristic that maximizes the spread of expected interception points, and a Bayesian experimental-design method that maximizes the D-score of the expected Gauss-Newton information matrix, thereby selecting trajectories that yield maximal information gain. Monte Carlo experiments demonstrate accurate parameter recovery with five to twelve sacrificial agents. The learned engagement models are then used to generate safe, time-optimal paths for high-value agents that avoid all feasible pursuer engagement regions.
Chinese Translation
本文提出了一种基于学习的框架,用于在受限转向率的追逐-逃避场景中利用牺牲代理估计追击者参数。每个牺牲代理沿直线路径朝向对手移动,并报告其是否被拦截或幸存。这些二元结果通过几何可达区域(RR)模型与追击者的参数相关联。提出了两种公式:边界拦截情况,其中捕获发生在RR边界;以及内部拦截情况,允许在RR内部的任何位置捕获。利用基于梯度的多起始优化方法推断追击者的参数,并为每种情况定制损失函数。为牺牲代理提出了两种轨迹选择策略:一种几何启发式方法,最大化预期拦截点的分散程度;另一种贝叶斯实验设计方法,最大化预期高斯-牛顿信息矩阵的D分数,从而选择能够获得最大信息增益的轨迹。蒙特卡洛实验表明,使用五到十二个牺牲代理可以准确恢复参数。然后,利用学习到的交战模型生成安全的、时间最优的路径,以避免所有可行的追击者交战区域,适用于高价值代理。
cs.RO / 6 / 2602.13476

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

AsyncVLA:一种用于边缘快速且稳健导航的异步 VLA
Hirose, Noriaki, Glossop, Catherine, Shah, Dhruv, Levine, Sergey
Abstract
Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.
Chinese Translation
机器人基础模型通过利用互联网规模的视觉-语言表示实现了强大的泛化能力,但其巨大的计算成本造成了一个根本性的瓶颈:高推理延迟。在动态环境中,这种延迟打破了控制循环,使得强大的模型在实时部署中变得不安全。我们提出了 AsyncVLA,一种异步控制框架,将语义推理与反应执行解耦。受分层控制的启发,AsyncVLA 在远程工作站上运行大型基础模型以提供高层次的指导,同时轻量级的车载边缘适配器以高频率持续优化动作。为了弥合这些异步流之间的领域差距,我们引入了一种端到端的微调协议和一种优先考虑动态交互的轨迹重加权策略。我们在具有高达 6 秒通信延迟的真实世界视觉导航任务上评估了我们的方法。AsyncVLA 的成功率比最先进的基线高出 40%,有效地弥合了大型模型的语义智能与边缘机器人所需的反应性之间的差距。
cs.RO / 7 / 2602.13577

ONRAP: Occupancy-driven Noise-Resilient Autonomous Path Planning

ONRAP:基于占用率的抗噪声自主路径规划
Tariq, Faizan M., Singh, Avinash, Ramtekkar, Vipul, D'sa, Jovin, Isele, David, Sakamoto, Yosuke, Bae, Sangjae
Abstract
Dynamic path planning must remain reliable in the presence of sensing noise, uncertain localization, and incomplete semantic perception. We propose a practical, implementation-friendly planner that operates on occupancy grids and optionally incorporates occupancy-flow predictions to generate ego-centric, kinematically feasible paths that safely navigate through static and dynamic obstacles. The core is a nonlinear program in the spatial domain built on a modified bicycle model with explicit feasibility and collision-avoidance penalties. The formulation naturally handles unknown obstacle classes and heterogeneous agent motion by operating purely in occupancy space. The pipeline runs in real-time (faster than 10 Hz on average), requires minimal tuning, and interfaces cleanly with standard control stacks. We validate our approach in simulation with severe localization and perception noises, and on an F1TENTH platform, demonstrating smooth and safe maneuvering through narrow passages and rough routes. The approach provides a robust foundation for noise-resilient, prediction-aware planning, eliminating the need for handcrafted heuristics. The project website can be accessed at https://honda-research-institute.github.io/onrap/
Chinese Translation
动态路径规划必须在存在传感噪声、不确定定位和不完整语义感知的情况下保持可靠性。我们提出了一种实用的、易于实现的规划器,该规划器基于占用网格,并可选择性地结合占用流预测,以生成自我中心的、运动学上可行的路径,安全地穿越静态和动态障碍物。其核心是一个在空间域内构建的非线性规划,基于修改后的自行车模型,具有明确的可行性和避碰惩罚。该模型自然地处理未知障碍物类别和异构代理运动,通过纯粹在占用空间中操作。该流程以实时方式运行(平均速度超过10 Hz),需要最小的调优,并与标准控制堆栈无缝对接。我们在模拟中验证了我们的方法,面对严重的定位和感知噪声,并在F1TENTH平台上进行测试,展示了在狭窄通道和崎岖路线中的平稳和安全操控。该方法为抗噪声、预测感知的规划提供了坚实的基础,消除了对手工启发式方法的需求。项目网站可访问:https://honda-research-institute.github.io/onrap/
cs.RO / 8 / 2602.13579

TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment

TactAlign:通过触觉对齐实现人机策略转移
Wi, Youngsun, Yin, Jessica, Xiang, Elvis, Sharma, Akash, Malik, Jitendra, Mukadam, Mustafa, Fazeli, Nima, Hellebrekers, Tess
Abstract
Human demonstrations collected by wearable devices (e.g., tactile gloves) provide fast and dexterous supervision for policy learning, and are guided by rich, natural tactile feedback. However, a key challenge is how to transfer human-collected tactile signals to robots despite the differences in sensing modalities and embodiment. Existing human-to-robot (H2R) approaches that incorporate touch often assume identical tactile sensors, require paired data, and involve little to no embodiment gap between human demonstrator and the robots, limiting scalability and generality. We propose TactAlign, a cross-embodiment tactile alignment method that transfers human-collected tactile signals to a robot with different embodiment. TactAlign transforms human and robot tactile observations into a shared latent representation using a rectified flow, without paired datasets, manual labels, or privileged information. Our method enables low-cost latent transport guided by hand-object interaction-derived pseudo-pairs. We demonstrate that TactAlign improves H2R policy transfer across multiple contact-rich tasks (pivoting, insertion, lid closing), generalizes to unseen objects and tasks with human data (less than 5 minutes), and enables zero-shot H2R transfer on a highly dexterous tasks (light bulb screwing).
Chinese Translation
通过可穿戴设备(例如触觉手套)收集的人类示范为策略学习提供了快速而灵活的监督,并且受到丰富、自然的触觉反馈的指导。然而,一个关键挑战是如何将人类收集的触觉信号转移到机器人上,尽管感知模式和体现之间存在差异。现有的人机(H2R)方法在整合触觉时通常假设触觉传感器相同,要求配对数据,并且在人类示范者与机器人之间几乎没有体现差距,这限制了可扩展性和普遍性。我们提出了TactAlign,这是一种跨体现的触觉对齐方法,能够将人类收集的触觉信号转移到具有不同体现的机器人上。TactAlign通过校正流将人类和机器人触觉观察转化为共享的潜在表示,而无需配对数据集、手动标签或特权信息。我们的方法使得通过手物体交互衍生的伪配对实现低成本的潜在传输。我们证明了TactAlign在多个接触丰富的任务(如旋转、插入、盖子关闭)中改善了H2R策略转移,能够对未见过的物体和任务进行泛化(少于5分钟的人类数据),并在高度灵活的任务(如灯泡拧紧)上实现零-shot H2R转移。
cs.RO / 9 / 2602.13591

AgentRob: From Virtual Forum Agents to Hijacked Physical Robots

AgentRob:从虚拟论坛代理到被劫持的物理机器人
Liu, Wenrui, Wang, Yaxuan, Zhang, Xun, Wang, Yanshu, Wei, Jiashen, Xiang, Yifan, Wang, Yuhang, Ye, Mingshen, Dai, Elsie, Liu, Zhiqi, Xu, Yingjie, Chen, Xinyang, Sun, Hengzhe, Shen, Jiyu, He, Jingjing, Yang, Tong
Abstract
Large Language Model (LLM)-powered autonomous agents have demonstrated significant capabilities in virtual environments, yet their integration with the physical world remains narrowly confined to direct control interfaces. We present AgentRob, a framework that bridges online community forums, LLM-powered agents, and physical robots through the Model Context Protocol (MCP). AgentRob enables a novel paradigm where autonomous agents participate in online forums--reading posts, extracting natural language commands, dispatching physical robot actions, and reporting results back to the community. The system comprises three layers: a Forum Layer providing asynchronous, persistent, multi-agent interaction; an Agent Layer with forum agents that poll for @mention-targeted commands; and a Robot Layer with VLM-driven controllers and Unitree Go2/G1 hardware that translate commands into robot primitives via iterative tool calling. The framework supports multiple concurrent agents with distinct identities and physical embodiments coexisting in the same forum, establishing the feasibility of forum-mediated multi-agent robot orchestration.
Chinese Translation
基于大语言模型(LLM)的自主代理在虚拟环境中展现出了显著的能力,但它们与物理世界的结合仍然局限于直接控制接口。我们提出了AgentRob,一个通过模型上下文协议(Model Context Protocol, MCP)连接在线社区论坛、基于LLM的代理和物理机器人的框架。AgentRob实现了一种新颖的范式,使自主代理能够参与在线论坛——阅读帖子、提取自然语言命令、调度物理机器人动作,并将结果反馈给社区。该系统由三层组成:论坛层提供异步、持久的多代理交互;代理层包含轮询@提及目标命令的论坛代理;机器人层则配备基于视觉语言模型(VLM)的控制器和Unitree Go2/G1硬件,通过迭代工具调用将命令转换为机器人原语。该框架支持多个具有不同身份和物理体现的并发代理在同一论坛中共存,建立了论坛介导的多代理机器人编排的可行性。
cs.RO / 10 / 2602.13640

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

用于精确机器人操作的层次音频-视觉-本体感觉融合
Li, Siyuan, Lu, Jiani, Song, Yu, Li, Xianren, An, Bo, Liu, Peng
Abstract
Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-based policy to directly generate continuous robot actions from multimodal observations. The combination of end-to-end learning and hierarchical fusion structure enables the policy to exploit task-relevant acoustic information while mitigating interference from less informative modalities. The proposed method has been evaluated on real-world robotic manipulation tasks, including liquid pouring and cabinet opening. Extensive experiment results demonstrate that our approach consistently outperforms state-of-the-art multimodal fusion frameworks, particularly in scenarios where acoustic cues provide task-relevant information not readily available from visual observations alone. Furthermore, a mutual information analysis is conducted to interpret the effect of audio cues in robotic manipulation via multimodal fusion.
Chinese Translation
现有的机器人操作方法主要依赖于视觉和本体感觉观察,这在部分可观察的真实环境中可能难以推断与接触相关的交互状态。相比之下,声学线索自然地编码了接触过程中的丰富交互动态,但在当前的多模态融合文献中仍未得到充分利用。大多数多模态融合方法隐含地假设各模态之间的角色是同质的,因此设计了平坦且对称的融合结构。然而,这一假设并不适合声学信号,因为其本质上是稀疏且驱动于接触的。为了通过声学信息感知实现精确的机器人操作,我们提出了一种层次表示融合框架,该框架逐步整合音频、视觉和本体感觉。我们的方法首先将视觉和本体感觉表示条件化于声学线索之上,然后显式建模高阶跨模态交互,以捕捉模态之间的互补依赖关系。融合后的表示通过基于扩散的策略直接从多模态观察中生成连续的机器人动作。端到端学习与层次融合结构的结合使得策略能够利用与任务相关的声学信息,同时减轻来自信息较少模态的干扰。所提方法已在真实世界的机器人操作任务中进行了评估,包括液体倒入和柜门开启。大量实验结果表明,我们的方法在各类场景中始终优于最先进的多模态融合框架,特别是在声学线索提供了视觉观察无法轻易获取的任务相关信息的情况下。此外,还进行了互信息分析,以解释声学线索在多模态融合下对机器人操作的影响。
cs.RO / 11 / 2602.13641

SPLIT: Sparse Incremental Learning of Error Dynamics for Control-Oriented Modeling in Autonomous Vehicles

SPLIT:面向控制的自主车辆误差动态稀疏增量学习
Li, Yaoyu, Huang, Chaosheng, Li, Jun
Abstract
Accurate, computationally efficient, and adaptive vehicle models are essential for autonomous vehicle control. Hybrid models that combine a nominal model with a Gaussian Process (GP)-based residual model have emerged as a promising approach. However, the GP-based residual model suffers from the curse of dimensionality, high evaluation complexity, and the inefficiency of online learning, which impede the deployment in real-time vehicle controllers. To address these challenges, we propose SPLIT, a sparse incremental learning framework for control-oriented vehicle dynamics modeling. SPLIT integrates three key innovations: (i) Model Decomposition. We decompose the vehicle model into invariant elements calibrated by experiments, and variant elements compensated by the residual model to reduce feature dimensionality. (ii) Local Incremental Learning. We define the valid region in the feature space and partition it into subregions, enabling efficient online learning from streaming data. (iii) GP Sparsification. We use bayesian committee machine to ensure scalable online evaluation. Integrated into model-based controllers, SPLIT is evaluated in aggressive simulations and real-vehicle experiments. Results demonstrate that SPLIT improves model accuracy and control performance online. Moreover, it enables rapid adaptation to vehicle dynamics deviations and exhibits robust generalization to previously unseen scenarios.
Chinese Translation
准确、计算高效且具有适应性的车辆模型对于自主车辆控制至关重要。将名义模型与基于高斯过程(Gaussian Process, GP)的残差模型相结合的混合模型已成为一种有前景的方法。然而,基于GP的残差模型受到维度诅咒、高评估复杂性以及在线学习效率低下的困扰,这些问题阻碍了其在实时车辆控制器中的应用。为了解决这些挑战,我们提出了SPLIT,一个用于面向控制的车辆动态建模的稀疏增量学习框架。SPLIT集成了三个关键创新:(i)模型分解。我们将车辆模型分解为通过实验校准的恒定元素和通过残差模型补偿的可变元素,以减少特征维度。(ii)局部增量学习。我们定义特征空间中的有效区域并将其划分为子区域,从而实现从流数据中高效的在线学习。(iii)GP稀疏化。我们使用贝叶斯委员会机器(bayesian committee machine)来确保可扩展的在线评估。SPLIT集成到基于模型的控制器中,在激进的仿真和真实车辆实验中进行了评估。结果表明,SPLIT在线提高了模型的准确性和控制性能。此外,它能够快速适应车辆动态的偏差,并在以前未见过的场景中表现出强大的泛化能力。
cs.RO / 12 / 2602.13656

A Kung Fu Athlete Bot That Can Do It All Day: Highly Dynamic, Balance-Challenging Motion Dataset and Autonomous Fall-Resilient Tracking

一款全天候的功夫运动员机器人:高度动态、平衡挑战的运动数据集与自主抗跌落跟踪
Lei, Zhongxiang, Cao, Lulu, Wang, Xuyang, Qian, Tianyi, Liu, Jinyan, Li, Xuesong
Abstract
Current humanoid motion tracking systems can execute routine and moderately dynamic behaviors, yet significant gaps remain near hardware performance limits and algorithmic robustness boundaries. Martial arts represent an extreme case of highly dynamic human motion, characterized by rapid center-of-mass shifts, complex coordination, and abrupt posture transitions. However, datasets tailored to such high-intensity scenarios remain scarce. To address this gap, we construct KungFuAthlete, a high-dynamic martial arts motion dataset derived from professional athletes' daily training videos. The dataset includes ground and jump subsets covering representative complex motion patterns. The jump subset exhibits substantially higher joint, linear, and angular velocities compared to commonly used datasets such as LAFAN1, PHUMA, and AMASS, indicating significantly increased motion intensity and complexity. Importantly, even professional athletes may fail during highly dynamic movements. Similarly, humanoid robots are prone to instability and falls under external disturbances or execution errors. Most prior work assumes motion execution remains within safe states and lacks a unified strategy for modeling unsafe states and enabling reliable autonomous recovery. We propose a novel training paradigm that enables a single policy to jointly learn high-dynamic motion tracking and fall recovery, unifying agile execution and stabilization within one framework. This framework expands robotic capability from pure motion tracking to recovery-enabled execution, promoting more robust and autonomous humanoid performance in real-world high-dynamic scenarios.
Chinese Translation
当前的人形运动跟踪系统能够执行常规和中等动态的行为,但在硬件性能极限和算法鲁棒性边界附近仍存在显著差距。武术代表了一种极端的高度动态人类运动,其特点是快速的重心移动、复杂的协调以及突发的姿态转换。然而,针对这种高强度场景的数据集仍然稀缺。为了解决这一问题,我们构建了KungFuAthlete,这是一个基于专业运动员日常训练视频的高动态武术运动数据集。该数据集包括覆盖代表性复杂运动模式的地面和跳跃子集。与常用数据集如LAFAN1、PHUMA和AMASS相比,跳跃子集表现出显著更高的关节、线性和角速度,表明运动强度和复杂性显著增加。重要的是,即使是专业运动员在高度动态的运动中也可能出现失误。同样,人形机器人在外部干扰或执行错误下也容易出现不稳定和跌落。大多数先前的研究假设运动执行保持在安全状态内,缺乏统一的策略来建模不安全状态并实现可靠的自主恢复。我们提出了一种新颖的训练范式,使单一策略能够共同学习高动态运动跟踪和跌落恢复,将灵活执行和稳定性统一在一个框架内。该框架将机器人能力从纯运动跟踪扩展到支持恢复的执行,促进在现实世界高动态场景中更强大和自主的人形表现。
cs.RO / 13 / 2602.13689

Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation

基于双边力先验的视觉与触觉感知对称融合用于机器人操控
Lee, Wonju, Grimaldi, Matteo, Yu, Tao
Abstract
Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that na\"ive visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing na\"ive and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.
Chinese Translation
机器人操控中的插入任务需要精确且丰富的接触交互,而单靠视觉无法解决。尽管触觉反馈直观上具有价值,现有研究表明,简单的视觉-触觉融合往往无法提供一致的改进。在本研究中,我们提出了一种跨模态变换器(Cross-Modal Transformer, CMT)用于视觉-触觉融合,通过结构化的自注意力和交叉注意力将腕部摄像头观察与触觉信号结合起来。为了稳定触觉嵌入,我们进一步引入了一种物理信息正则化,鼓励双边力平衡,反映人类运动控制的原则。在TacSL基准测试中的实验表明,采用对称正则化的CMT达到了96.59%的插入成功率,超越了简单和门控融合的基线,并与特权的“腕部 + 接触力”配置(96.09%)接近。这些结果突显了两个核心见解:(i)触觉感知对于精确对齐不可或缺;(ii)经过物理信息正则化进一步增强的原则性多模态融合,释放了视觉和触觉的互补优势,在现实感知下接近特权性能。
cs.RO / 14 / 2602.13718

HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

HybridFlow:一种用于机器人操作的两步生成策略
Dong, Zhenchen, Fu, Jinna, Wu, Jiaming, Yu, Shengyuan, Chen, Fulin, Liu, Yide
Abstract
Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8$\times$ speedup}, \textbf{$\sim$52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
Chinese Translation
受限于推理延迟,现有的机器人操作策略缺乏与环境进行充分实时交互的能力。尽管更快的生成方法如流匹配(flow matching)正在逐渐取代扩散方法,研究人员仍在追求更适合于交互式机器人控制的更快生成。作为流匹配的一种一步变体,MeanFlow在图像生成中表现出强大的潜力,但其在动作生成中的精度未能满足机器人操作的严格要求。因此,我们提出了 extbf{HybridFlow},这是一种 extbf{3阶段方法},具有 extbf{2-NFE}:在MeanFlow模式下进行全局跳跃(Global Jump),通过ReNoise进行分布对齐,以及在ReFlow模式下进行局部精细化(Local Refine)。该方法通过利用MeanFlow一步生成的快速优势,同时确保动作精度以最小生成步骤,平衡了推理速度和生成质量。通过实际实验,HybridFlow在成功率上比16步扩散策略提高了 extbf{15--25 extbf{ extpercent}},同时将推理时间从152毫秒减少到19毫秒( extbf{8$ imes$加速}, extbf{$ extsim$52Hz});在未见颜色的OOD抓取中成功率达到70.0 extbf{ extpercent},在可变形物体折叠中成功率达到66.3 extbf{ extpercent}。我们设想HybridFlow作为一种实用的低延迟方法,以增强机器人操作策略在现实世界中的交互能力。
cs.RO / 15 / 2602.13720

FC-Vision: Real-Time Visibility-Aware Replanning for Occlusion-Free Aerial Target Structure Scanning in Unknown Environments

FC-Vision:实时可见性感知的重新规划框架,用于在未知环境中无遮挡的空中目标结构扫描
Feng, Chen, Xu, Yang, Shen, Shaojie
Abstract
Autonomous aerial scanning of target structures is crucial for practical applications, requiring online adaptation to unknown obstacles during flight. Existing methods largely emphasize collision avoidance and efficiency, but overlook occlusion-induced visibility degradation, severely compromising scanning quality. In this study, we propose FC-Vision, an on-the-fly visibility-aware replanning framework that proactively and safely prevents target occlusions while preserving the intended coverage and efficiency of the original plan. Our approach explicitly enforces dense surface-visibility constraints to regularize replanning behavior in real-time via an efficient two-level decomposition: occlusion-free viewpoint repair that maintains coverage with minimal deviation from the nominal scan intent, followed by segment-wise clean-sensing connection in 5-DoF space. A plug-in integration strategy is also presented to seamlessly interface FC-Vision with existing UAV scanning systems without architectural changes. Comprehensive simulation and real-world evaluations show that FC-Vision consistently improves scanning quality under unexpected occluders, delivering a maximum coverage gain of 55.32% and a 73.17% reduction in the occlusion ratio, while achieving real-time performance with a moderate increase in flight time. The source code will be made publicly available.
Chinese Translation
自主空中扫描目标结构对于实际应用至关重要,要求在飞行过程中能够在线适应未知障碍物。现有方法主要强调避碰和效率,但忽视了遮挡引起的可见性下降,严重影响了扫描质量。在本研究中,我们提出了FC-Vision,一种实时可见性感知的重新规划框架,能够主动且安全地防止目标遮挡,同时保持原计划的覆盖范围和效率。我们的方法明确施加了密集表面可见性约束,通过高效的两级分解实时规范重新规划行为:首先进行无遮挡视点修复,以最小偏离名义扫描意图的方式保持覆盖,然后在5自由度空间中进行分段清晰感知连接。我们还提出了一种插件集成策略,使FC-Vision能够与现有无人机扫描系统无缝对接,而无需架构更改。全面的仿真和现实世界评估表明,FC-Vision在意外遮挡物下始终提高扫描质量,实现了最高55.32%的覆盖增益和73.17%的遮挡比降低,同时在适度增加飞行时间的情况下实现实时性能。源代码将公开发布。
cs.RO / 16 / 2602.13733

Improving Driver Satisfaction with a Driving Function Learning from Implicit Human Feedback -- a Test Group Study

通过隐性人类反馈学习的驾驶功能提升驾驶员满意度——一项测试组研究
Schwager, Robin, Anastasio, Andrea, Hartmann, Simon, Ronellenfitsch, Andreas, Grimm, Michael, Brühl, Tim, Sohn, Tin Stribor, Eberhardt, Tim Dieter, Hohmann, Sören
Abstract
During the use of advanced driver assistance systems, drivers frequently intervene into the active driving function and adjust the system's behavior to their personal wishes. These active driver-initiated takeovers contain feedback about deviations in the driving function's behavior from the drivers' personal preferences. This feedback should be utilized to optimize and personalize the driving function's behavior. In this work, the adjustment of the speed profile of a Predictive Longitudinal Driving Function (PLDF) on a pre-defined route is highlighted. An algorithm is introduced which iteratively adjusts the PLDF's speed profile by taking into account both the original speed profile of the PLDF and the driver demonstration. This approach allows for personalization in a traded control scenario during active use of the PLDF. The applicability of the proposed algorithm is tested in a driving simulator-based test group study with 43 participants. The study finds a significant increase in driver satisfaction and a significant reduction in the intervention frequency when using the proposed adaptive PLDF. Additionally, feedback by the participants was gathered to identify further optimization potentials of the proposed system.
Chinese Translation
在使用先进驾驶辅助系统的过程中,驾驶员经常干预主动驾驶功能,并根据个人意愿调整系统的行为。这些由驾驶员主动发起的接管行为包含了关于驾驶功能行为偏离驾驶员个人偏好的反馈。这些反馈应被利用来优化和个性化驾驶功能的行为。在本研究中,重点介绍了在预定义路线上的预测纵向驾驶功能(Predictive Longitudinal Driving Function, PLDF)速度曲线的调整。提出了一种算法,该算法通过考虑PLDF的原始速度曲线和驾驶员的演示,迭代地调整PLDF的速度曲线。这种方法允许在PLDF的主动使用过程中实现个性化的交易控制场景。所提出算法的适用性在一项基于驾驶模拟器的测试组研究中进行了测试,参与者人数为43人。研究发现,使用所提出的自适应PLDF后,驾驶员满意度显著提高,干预频率显著降低。此外,还收集了参与者的反馈,以识别所提系统的进一步优化潜力。
cs.RO / 17 / 2602.13739

XIT: Exploration and Exploitation Informed Trees for Active Gas Distribution Mapping in Unknown Environments

XIT:用于未知环境中主动气体分布映射的探索与利用信息树
Fazliu, Mal, Coombes, Matthew, Wang, Sen, Liu, Cunjia
Abstract
Mobile robotic gas distribution mapping (GDM) provides critical situational awareness during emergency responses to hazardous gas releases. However, most systems still rely on teleoperation, limiting scalability and response speed. Autonomous active GDM is challenging in unknown and cluttered environments, because the robot must simultaneously explore traversable space, map the environment, and infer the gas distribution belief from sparse chemical measurements. We address this by formulating active GDM as a next-best-trajectory informative path planning (IPP) problem and propose XIT (Exploration-Exploitation Informed Trees), a sampling-based planner that balances exploration and exploitation by generating concurrent trajectories toward exploration-rich goals while collecting informative gas measurements en route. XIT draws batches of samples from an Upper Confidence Bound (UCB) information field derived from the current gas posterior and expands trees using a cost that trades off travel effort against gas concentration and uncertainty. To enable plume-aware exploration, we introduce the gas frontier concept, defined as unobserved regions adjacent to high gas concentrations, and propose the Wavefront Gas Frontier Detection (WGFD) algorithm for their identification. High-fidelity simulations and real-world experiments demonstrate the benefits of XIT in terms of GDM quality and efficiency. Although developed for active GDM, XIT is readily applicable to other robotic information-gathering tasks in unknown environments that face the exploration and exploitation trade-off.
Chinese Translation
移动机器人气体分布映射(GDM)在应对有害气体泄漏的紧急响应中提供了关键的情境意识。然而,大多数系统仍依赖于遥控操作,这限制了可扩展性和响应速度。在未知和杂乱的环境中,主动GDM面临挑战,因为机器人必须同时探索可通行空间、绘制环境地图,并根据稀疏的化学测量推断气体分布信念。我们通过将主动GDM形式化为下一个最佳轨迹信息路径规划(IPP)问题来解决这一挑战,并提出了XIT(探索-利用信息树),这是一种基于采样的规划器,通过生成朝向丰富探索目标的并行轨迹,同时在途中收集信息性气体测量,来平衡探索与利用。XIT从基于当前气体后验的上置信界(UCB)信息场中提取样本批次,并使用一种权衡旅行努力与气体浓度和不确定性的成本来扩展树。为了实现对气体羽流的感知探索,我们引入了气体前沿的概念,定义为邻近高气体浓度的未观察区域,并提出了波前气体前沿检测(WGFD)算法以识别这些区域。高保真模拟和现实世界实验展示了XIT在GDM质量和效率方面的优势。尽管XIT是为主动GDM开发的,但它同样适用于在面临探索与利用权衡的未知环境中进行的其他机器人信息收集任务。
cs.RO / 18 / 2602.13747

The More the Merrier: Running Multiple Neuromorphic Components On-Chip for Robotic Control

越多越好:在芯片上运行多个神经形态组件以实现机器人控制
Eames, Evan, Kannan, Priyadarshini, Sangouard, Ronan, Plank, Philipp, Hajizada, Elvin, Palinauskas, Gintautas, Amaya, Lana, Neumeier, Michael, Sharma, Sai Thejeshwar, Toth, Marcella, Sarkar, Prottush, von Arnim, Axel
Abstract
It has long been realized that neuromorphic hardware offers benefits for the domain of robotics such as low energy, low latency, as well as unique methods of learning. In aiming for more complex tasks, especially those incorporating multimodal data, one hurdle continuing to prevent their realization is an inability to orchestrate multiple networks on neuromorphic hardware without resorting to off-chip process management logic. To address this, we show a first example of a pipeline for vision-based robot control in which numerous complex networks can be run entirely on hardware via the use of a spiking neural state machine for process orchestration. The pipeline is validated on the Intel Loihi 2 research chip. We show that all components can run concurrently on-chip in the milli Watt regime at latencies competitive with the state-of-the-art. An equivalent network on simulated hardware is shown to accomplish robotic arm plug insertion in simulation, and the core elements of the pipeline are additionally tested on a real robotic arm.
Chinese Translation
长期以来,人们认识到神经形态硬件在机器人领域提供了诸多优势,如低能耗、低延迟以及独特的学习方法。在追求更复杂的任务,特别是那些涉及多模态数据的任务时,持续阻碍其实现的一个障碍是无法在神经形态硬件上协调多个网络,而不得不依赖于芯片外的过程管理逻辑。为了解决这个问题,我们展示了一个基于视觉的机器人控制管道的首个示例,其中多个复杂网络可以通过使用脉冲神经状态机进行过程协调,完全在硬件上运行。该管道在英特尔Loihi 2研究芯片上进行了验证。我们展示了所有组件可以在毫瓦级别的功耗下并发运行,且延迟与最先进的技术相当。一个在模拟硬件上的等效网络被展示为在模拟中完成机器人手臂插头插入,管道的核心元素也在真实的机器人手臂上进行了额外测试。
cs.RO / 19 / 2602.13762

Impact-Robust Posture Optimization for Aerial Manipulation

抗冲击姿态优化方法在空中操作中的应用
Afifi, Amr, Gazar, Ahmad, Alonso-Mora, Javier, Giordano, Paolo Robuffo, Franchi, Antonio
Abstract
We present a novel method for optimizing the posture of kinematically redundant torque-controlled robots to improve robustness during impacts. A rigid impact model is used as the basis for a configuration-dependent metric that quantifies the variation between pre- and post-impact velocities. By finding configurations (postures) that minimize the aforementioned metric, spikes in the robot's state and input commands can be significantly reduced during impacts, improving safety and robustness. The problem of identifying impact-robust postures is posed as a min-max optimization of the aforementioned metric. To overcome the real-time intractability of the problem, we reformulate it as a gradient-based motion task that iteratively guides the robot towards configurations that minimize the proposed metric. This task is embedded within a task-space inverse dynamics (TSID) whole-body controller, enabling seamless integration with other control objectives. The method is applied to a kinematically redundant aerial manipulator performing repeated point contact tasks. We test our method inside a realistic physics simulator and compare it with the nominal TSID. Our method leads to a reduction (up to 51% w.r.t. standard TSID) of post-impact spikes in the robot's configuration and successfully avoids actuator saturation. Moreover, we demonstrate the importance of kinematic redundancy for impact robustness using additional numerical simulations on a quadruped and a humanoid robot, resulting in up to 45% reduction of post-impact spikes in the robot's state w.r.t. nominal TSID.
Chinese Translation
我们提出了一种新颖的方法,用于优化运动冗余的扭矩控制机器人姿态,以提高其在冲击过程中的鲁棒性。我们使用刚性冲击模型作为基础,构建了一种依赖于配置的度量,量化冲击前后速度的变化。通过寻找能够最小化上述度量的配置(姿态),可以显著减少机器人在冲击过程中的状态和输入命令的尖峰,从而提高安全性和鲁棒性。识别抗冲击姿态的问题被表述为上述度量的最小-最大优化。为了克服该问题在实时处理中的不可行性,我们将其重新表述为基于梯度的运动任务,迭代引导机器人朝向最小化所提度量的配置。该任务嵌入在任务空间逆动力学(TSID)全身控制器中,使其能够与其他控制目标无缝集成。该方法应用于执行重复点接触任务的运动冗余空中操纵器。我们在一个真实的物理模拟器中测试了该方法,并与标准TSID进行了比较。我们的研究表明,该方法能够减少机器人配置中的后冲击尖峰(相较于标准TSID减少高达51%),并成功避免了执行器饱和。此外,我们通过对四足机器人和人形机器人进行额外的数值模拟,展示了运动冗余对抗冲击鲁棒性的重要性,结果显示相较于标准TSID,机器人状态中的后冲击尖峰减少了高达45%。
cs.RO / 20 / 2602.13764

MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer

MOTIF:用于少样本跨体现转移的动作模式学习
Zhi, Heng, Tan, Wentao, Zhu, Lei, Li, Fengling, Li, Jingjing, Yang, Guoli, Shen, Heng Tao
Abstract
While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at https://github.com/buduz/MOTIF.
Chinese Translation
尽管视觉-语言-动作(VLA)模型在通用机器人学习方面取得了进展,但由于运动学异质性和收集足够的现实世界演示以支持微调的高成本,跨体现转移仍然具有挑战性。现有的跨体现策略通常依赖于共享-私有架构,这些架构受到私有参数容量有限和缺乏明确适应机制的影响。为了解决这些局限性,我们提出了MOTIF,以实现高效的少样本跨体现转移,该方法将与体现无关的时空模式(称为动作模式)与异质动作数据解耦。具体而言,MOTIF首先通过向量量化与进度感知对齐和体现对抗约束学习统一的模式,以确保时间和跨体现的一致性。然后,我们设计了一个轻量级预测器,从实时输入中预测这些模式,以指导流匹配策略,将其与特定于机器人的状态融合,以便在新的体现上生成动作。在模拟和现实世界环境中的评估验证了MOTIF的优越性,在少样本转移场景中,MOTIF在模拟中比强基线提高了6.5%,在现实世界设置中提高了43.7%。代码可在 https://github.com/buduz/MOTIF 获取。
cs.RO / 21 / 2602.13800

Ontological grounding for sound and natural robot explanations via large language models

基于本体的声音和自然机器人解释的基础:通过大型语言模型
Olivares-Alarcos, Alberto, Ahsan, Muhammad, Sanjaya, Satrio, Lin, Hsien-I, Alenyà, Guillem
Abstract
Building effective human-robot interaction requires robots to derive conclusions from their experiences that are both logically sound and communicated in ways aligned with human expectations. This paper presents a hybrid framework that blends ontology-based reasoning with large language models (LLMs) to produce semantically grounded and natural robot explanations. Ontologies ensure logical consistency and domain grounding, while LLMs provide fluent, context-aware and adaptive language generation. The proposed method grounds data from human-robot experiences, enabling robots to reason about whether events are typical or atypical based on their properties. We integrate a state-of-the-art algorithm for retrieving and constructing static contrastive ontology-based narratives with an LLM agent that uses them to produce concise, clear, interactive explanations. The approach is validated through a laboratory study replicating an industrial collaborative task. Empirical results show significant improvements in the clarity and brevity of ontology-based narratives while preserving their semantic accuracy. Initial evaluations further demonstrate the system's ability to adapt explanations to user feedback. Overall, this work highlights the potential of ontology-LLM integration to advance explainable agency, and promote more transparent human-robot collaboration.
Chinese Translation
构建有效的人机交互要求机器人从其经验中得出既合乎逻辑又符合人类期望的结论。本文提出了一种混合框架,将基于本体的推理与大型语言模型(LLMs)相结合,以生成语义上扎实且自然的机器人解释。本体确保逻辑一致性和领域基础,而LLMs则提供流畅、上下文感知和自适应的语言生成。所提出的方法将人机经验中的数据进行基础化,使机器人能够基于事件的属性推理事件是典型的还是非典型的。我们整合了一种最先进的算法,用于检索和构建静态对比本体叙述,并与一个使用这些叙述生成简洁、清晰、互动解释的LLM代理相结合。该方法通过在实验室研究中复制工业协作任务进行验证。实证结果显示,在保持语义准确性的同时,基于本体的叙述在清晰度和简洁性上有显著改善。初步评估进一步证明了系统根据用户反馈调整解释的能力。总体而言,这项工作突显了本体与LLM集成在推动可解释智能体和促进更透明的人机协作方面的潜力。
cs.RO / 22 / 2602.13833

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

用于类别级可泛化触觉工具操作的语义接触场
Ma, Kevin Yuchen, Zhang, Heng, Lin, Weisi, Shou, Mike Zheng, Wu, Yan
Abstract
Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.
Chinese Translation
工具操作的泛化需要语义规划和精确的物理控制。现代通用机器人策略,如视觉-语言-动作(Vision-Language-Action, VLA)模型,通常缺乏进行接触丰富的工具操作所需的高保真物理基础。相反,现有的接触感知策略利用触觉或力觉传感,通常是特定于实例的,无法在多样的工具几何形状之间进行泛化。弥合这一差距需要从多样的数据中学习统一的接触表示,但仍然存在一个基本障碍:多样的真实世界触觉数据在规模上是不可行的,而由于软传感器非线性变形的复杂动态,直接的零样本仿真到现实转移也具有挑战性。为了解决这个问题,我们提出了语义接触场(Semantic-Contact Fields, SCFields),这是一种将视觉语义与密集接触估计融合的统一三维表示。我们通过一个两阶段的仿真到现实接触学习管道来实现这一点:首先,我们在一个大型仿真数据集上进行预训练,以学习一般的接触物理;其次,我们在一小组真实数据上进行微调,这些数据通过几何启发式和力优化进行伪标注,以对齐传感器特性。这使得对未见工具的物理泛化成为可能。我们利用SCFields作为扩散策略的密集观测输入,以实现接触丰富的工具操作任务的稳健执行。在刮擦、蜡笔绘画和剥离的实验中,展示了稳健的类别级泛化,显著优于仅基于视觉和原始触觉的基线。
cs.RO / 23 / 2602.13849

Push-Placement: A Hybrid Approach Integrating Prehensile and Non-Prehensile Manipulation for Object Rearrangement

推放置:一种结合抓取与非抓取操作的混合方法用于物体重排
Sadeghinejad, Majid, Barghi, Arman, Hosseini, Hamed, Masouleh, Mehdi Tale, Kalhor, Ahmad
Abstract
Efficient tabletop rearrangement remains challenging due to collisions and the need for temporary buffering when target poses are obstructed. Prehensile pick-and-place provides precise control but often requires extra moves, whereas non-prehensile pushing can be more efficient but suffers from complex, imprecise dynamics. This paper proposes push-placement, a hybrid action primitive that uses the grasped object to displace obstructing items while being placed, thereby reducing explicit buffering. The method is integrated into a physics-in-the-loop Monte Carlo Tree Search (MCTS) planner and evaluated in the PyBullet simulator. Empirical results show push-placement reduces the manipulator travel cost by up to 11.12% versus a baseline MCTS planner and 8.56% versus dynamic stacking. These findings indicate that hybrid prehensile/non-prehensile action primitives can substantially improve efficiency in long-horizon rearrangement tasks.
Chinese Translation
高效的桌面重排仍然面临挑战,主要由于碰撞和在目标姿态被阻碍时需要临时缓冲。抓取式的拾取与放置提供了精确的控制,但通常需要额外的移动,而非抓取式的推动可以更高效,但存在复杂且不精确的动态问题。本文提出了推放置(push-placement),一种混合动作原语,利用被抓取物体在放置过程中位移阻碍物,从而减少显式缓冲。该方法集成到物理仿真的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)规划器中,并在PyBullet模拟器中进行了评估。实证结果表明,推放置相比基线MCTS规划器减少了操控器的移动成本高达11.12%,相比动态堆叠减少了8.56%。这些发现表明,混合的抓取/非抓取动作原语可以显著提高长时间重排任务的效率。
cs.RO / 24 / 2602.13850

Humanoid Hanoi: Investigating Shared Whole-Body Control for Skill-Based Box Rearrangement

类人汉诺塔:研究基于技能的共享全身控制用于箱体重排
Kim, Minku, Chen, Kuan-Chia, Shrestha, Aayam, Fuxin, Li, Lee, Stefan, Fern, Alan
Abstract
We investigate a skill-based framework for humanoid box rearrangement that enables long-horizon execution by sequencing reusable skills at the task level. In our architecture, all skills execute through a shared, task-agnostic whole-body controller (WBC), providing a consistent closed-loop interface for skill composition, in contrast to non-shared designs that use separate low-level controllers per skill. We find that naively reusing the same pretrained WBC can reduce robustness over long horizons, as new skills and their compositions induce shifted state and command distributions. We address this with a simple data aggregation procedure that augments shared-WBC training with rollouts from closed-loop skill execution under domain randomization. To evaluate the approach, we introduce \emph{Humanoid Hanoi}, a long-horizon Tower-of-Hanoi box rearrangement benchmark, and report results in simulation and on the Digit V3 humanoid robot, demonstrating fully autonomous rearrangement over extended horizons and quantifying the benefits of the shared-WBC approach over non-shared baselines.
Chinese Translation
我们研究了一种基于技能的类人箱体重排框架,该框架通过在任务层面上对可重用技能进行排序,实现了长时间的执行。在我们的架构中,所有技能通过一个共享的、与任务无关的全身控制器(WBC)执行,为技能组合提供了一致的闭环接口,这与每个技能使用单独低级控制器的非共享设计形成对比。我们发现,简单地重用相同的预训练WBC可能会降低长时间执行的鲁棒性,因为新技能及其组合会引起状态和指令分布的变化。我们通过一种简单的数据聚合程序来解决这个问题,该程序通过在领域随机化下的闭环技能执行中进行回放,增强了共享WBC的训练。为了评估该方法,我们引入了 extit{类人汉诺塔},这是一个长时间的汉诺塔箱体重排基准,并在模拟和Digit V3类人机器人上报告了结果,展示了在延长的时间范围内完全自主的重排,并量化了共享WBC方法相较于非共享基线的优势。
cs.RO / 25 / 2602.13866

Modeling and Optimizing the Provisioning of Exhaustible Capabilities for Simultaneous Task Allocation and Scheduling

建模与优化可耗能力的提供以实现任务的同时分配与调度
Park, Jinwoo, Ravichandar, Harish, Hutchinson, Seth
Abstract
Deploying heterogeneous robot teams to accomplish multiple tasks over extended time horizons presents significant computational challenges for task allocation and planning. In this paper, we present a comprehensive, time-extended, offline heterogeneous multi-robot task allocation framework, TRAITS, which we believe to be the first that can cope with the provisioning of exhaustible traits under battery and temporal constraints. Specifically, we introduce a nonlinear programming-based trait distribution module that can optimize the trait-provisioning rate of coalitions to yield feasible and time-efficient solutions. TRAITS provides a more accurate feasibility assessment and estimation of task execution times and makespan by leveraging trait-provisioning rates while optimizing battery consumption -- an advantage that state-of-the-art frameworks lack. We evaluate TRAITS against two state-of-the-art frameworks, with results demonstrating its advantage in satisfying complex trait and battery requirements while remaining computationally tractable.
Chinese Translation
部署异构机器人团队在较长时间范围内完成多个任务面临着任务分配和规划的重大计算挑战。本文提出了一种全面的、时间扩展的离线异构多机器人任务分配框架TRAITS,我们认为这是第一个能够在电池和时间约束下应对可耗特征提供的框架。具体而言,我们引入了一种基于非线性规划的特征分配模块,该模块能够优化联盟的特征提供速率,从而产生可行且时间高效的解决方案。TRAITS通过利用特征提供速率来优化电池消耗,提供了更准确的可行性评估和任务执行时间及完成时间的估算,而这一点是当前最先进框架所缺乏的。我们将TRAITS与两个最先进的框架进行了评估,结果表明其在满足复杂的特征和电池需求方面具有优势,同时保持了计算的可处理性。
cs.RO / 26 / 2602.13900

UAV-SEAD: State Estimation Anomaly Dataset for UAVs

UAV-SEAD:无人机状态估计异常数据集
Kabaoglu, Aykut, Sariel, Sanem
Abstract
Accurate state estimation in Unmanned Aerial Vehicles (UAVs) is crucial for ensuring reliable and safe operation, as anomalies occurring during mission execution may induce discrepancies between expected and observed system behaviors, thereby compromising mission success or posing potential safety hazards. It is essential to continuously monitor and detect such conditions in order to ensure a timely response and maintain system reliability. In this work, we focus on UAV state estimation anomalies and provide a large-scale real-world UAV dataset to facilitate research aimed at improving the development of anomaly detection. Unlike existing datasets that primarily rely on injected faults into simulated data, this dataset comprises 1396 real flight logs totaling over 52 hours of flight time, collected across diverse indoor and outdoor environments using a collection of PX4-based UAVs equipped with a variety of sensor configurations. The dataset comprises both normal and anomalous flights without synthetic manipulation, making it uniquely suitable for realistic anomaly detection tasks. A structured classification is proposed that categorizes UAV state estimation anomalies into four classes: mechanical and electrical, external position, global position, and altitude anomalies. These classifications reflect collective, contextual, and outlier anomalies observed in multivariate sensor data streams, including IMU, GPS, barometer, magnetometer, distance sensors, visual odometry, and optical flow, that can be found in the PX4 logging mechanism. It is anticipated that this dataset will play a key role in the development, training, and evaluation of anomaly detection and isolation systems to address the critical gap in UAV reliability research.
Chinese Translation
在无人机(UAV)中,准确的状态估计对于确保可靠和安全的操作至关重要,因为在任务执行过程中发生的异常可能导致预期与观察到的系统行为之间的差异,从而危及任务成功或造成潜在的安全隐患。因此,持续监测和检测这些条件以确保及时响应和维持系统可靠性是必要的。在本研究中,我们关注无人机状态估计异常,并提供一个大规模的真实无人机数据集,以促进旨在改善异常检测发展的研究。与现有数据集主要依赖于注入故障到模拟数据不同,该数据集包含1396个真实飞行日志,总飞行时间超过52小时,数据收集自多种室内和室外环境,使用了一系列基于PX4的无人机,配备了多种传感器配置。该数据集包括正常和异常飞行,且没有合成操作,使其特别适合于现实的异常检测任务。我们提出了一种结构化分类方法,将无人机状态估计异常分为四类:机械和电气异常、外部位置异常、全球位置异常和高度异常。这些分类反映了在多变量传感器数据流中观察到的集体、上下文和离群异常,包括IMU、GPS、气压计、磁力计、距离传感器、视觉里程计和光流,这些数据可以在PX4日志机制中找到。预计该数据集将在异常检测和隔离系统的开发、训练和评估中发挥关键作用,以填补无人机可靠性研究中的重要空白。
cs.RO / 27 / 2602.13909

High-fidelity 3D reconstruction for planetary exploration

用于行星探索的高保真三维重建
Martínez-Petersen, Alfonso, Gerdes, Levin, Rodríguez-Martínez, David, Pérez-del-Pulgar, C. J.
Abstract
Planetary exploration increasingly relies on autonomous robotic systems capable of perceiving, interpreting, and reconstructing their surroundings in the absence of global positioning or real-time communication with Earth. Rovers operating on planetary surfaces must navigate under sever environmental constraints, limited visual redundancy, and communication delays, making onboard spatial awareness and visual localization key components for mission success. Traditional techniques based on Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM) provide geometric consistency but struggle to capture radiometric detail or to scale efficiently in unstructured, low-texture terrains typical of extraterrestrial environments. This work explores the integration of radiance field-based methods - specifically Neural Radiance Fields (NeRF) and Gaussian Splatting - into a unified, automated environment reconstruction pipeline for planetary robotics. Our system combines the Nerfstudio and COLMAP frameworks with a ROS2-compatible workflow capable of processing raw rover data directly from rosbag recordings. This approach enables the generation of dense, photorealistic, and metrically consistent 3D representations from minimal visual input, supporting improved perception and planning for autonomous systems operating in planetary-like conditions. The resulting pipeline established a foundation for future research in radiance field-based mapping, bridging the gap between geometric and neural representations in planetary exploration.
Chinese Translation
行星探索日益依赖于能够在缺乏全球定位或与地球实时通信的情况下感知、解释和重建周围环境的自主机器人系统。在行星表面运行的探测车必须在严苛的环境约束、有限的视觉冗余和通信延迟下导航,使得机载空间感知和视觉定位成为任务成功的关键组成部分。基于运动结构(Structure-from-Motion, SfM)和同时定位与地图构建(Simultaneous Localization and Mapping, SLAM)的传统技术提供了几何一致性,但在捕捉辐射度细节或在典型的外星环境中低纹理的非结构化地形上高效扩展方面存在困难。本研究探讨了基于辐射场的方法的整合——特别是神经辐射场(Neural Radiance Fields, NeRF)和高斯喷溅(Gaussian Splatting)——到一个统一的、自动化的环境重建管道中,以服务于行星机器人。我们的系统结合了Nerfstudio和COLMAP框架,并采用与ROS2兼容的工作流程,能够直接处理来自rosbag录制的原始探测车数据。这种方法使得能够从最小的视觉输入生成密集的、照片真实的和度量一致的三维表示,支持在类行星条件下运行的自主系统的感知和规划的改进。最终建立的管道为未来基于辐射场的地图构建研究奠定了基础,弥合了行星探索中几何和神经表示之间的差距。
cs.RO / 28 / 2602.13932

Joint Task Assistance Planning via Nested Branch and Bound (Extended Version)

通过嵌套分支界限的联合任务辅助规划(扩展版)
Daube, Omer, Salzman, Oren
Abstract
We introduce and study the Joint Task Assistance Planning problem which generalizes prior work on optimizing assistance in robotic collaboration. In this setting, two robots operate over predefined roadmaps, each represented as a graph corresponding to its configuration space. One robot, the task robot, must execute a timed mission, while the other, the assistance robot, provides sensor-based support that depends on their spatial relationship. The objective is to compute a path for both robots that maximizes the total duration of assistance given. Solving this problem is challenging due to the combinatorial explosion of possible path combinations together with the temporal nature of the problem (time needs to be accounted for as well). To address this, we propose a nested branch-and-bound framework that efficiently explores the space of robot paths in a hierarchical manner. We empirically evaluate our algorithm and demonstrate a speedup of up to two orders of magnitude when compared to a baseline approach.
Chinese Translation
我们引入并研究了联合任务辅助规划问题,该问题概括了之前在机器人协作中优化辅助的研究。在这一背景下,两台机器人在预定义的路线图上操作,每台机器人对应于其配置空间的图形表示。一台机器人,即任务机器人,必须执行一个定时任务,而另一台机器人,即辅助机器人,则提供基于传感器的支持,这取决于它们的空间关系。我们的目标是计算两台机器人的路径,以最大化提供的总辅助时间。解决这个问题具有挑战性,因为可能路径组合的组合爆炸以及问题的时间特性(时间也需要被考虑)。为了解决这个问题,我们提出了一种嵌套分支界限框架,以层次化的方式高效地探索机器人路径的空间。我们对算法进行了实证评估,并在与基线方法的比较中展示了高达两个数量级的加速。
cs.RO / 29 / 2602.13977

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

WoVR:作为后训练视觉-语言-动作(VLA)策略可靠模拟器的世界模型
Jiang, Zhennan, Zhou, Shangqing, Jiang, Yutong, Huang, Zefang, Wei, Mingjie, Chen, Yuhui, Zhou, Tianxing, Guo, Zhen, Lin, Hao, Zhang, Quanlu, Wang, Yu, Li, Haoran, Yu, Chao, Zhao, Dongbin
Abstract
Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
Chinese Translation
强化学习(RL)有望为视觉-语言-动作(VLA)模型解锁超越模仿学习的能力,但其对大量现实世界交互的需求阻碍了在物理机器人上的直接部署。近期的研究尝试使用学习到的世界模型作为策略优化的模拟器,但闭环想象回放不可避免地受到幻觉和长时间误差累积的影响。这些误差不仅降低了视觉真实感,还破坏了优化信号,促使策略利用模型的不准确性而非真实任务进展。我们提出了WoVR,一个基于世界模型的可靠强化学习框架,用于后训练的VLA策略。WoVR并不假设一个真实的世界模型,而是明确调节RL如何与不完美的想象动态进行交互。它通过可控的动作条件视频世界模型提高回放稳定性,通过关键帧初始化回放重塑想象交互以减少有效误差深度,并通过世界模型-策略共同进化保持策略与模拟器的一致性。在LIBERO基准和真实机器人操作上的大量实验表明,WoVR能够实现稳定的长时间想象回放和有效的策略优化,将LIBERO的平均成功率从39.95%提高到69.2%(+29.3个百分点),将真实机器人成功率从61.7%提高到91.7%(+30.0个百分点)。这些结果表明,当幻觉被明确控制时,学习到的世界模型可以作为强化学习的实用模拟器。
cs.RO / 30 / 2602.13999

It Takes Two to Tango: A Holistic Simulator for Joint Order Scheduling and Multi-Agent Path Finding in Robotic Warehouses

双人共舞:用于机器人仓库联合订单调度与多智能体路径规划的整体模拟器
Xu, Haozheng, Li, Wenhao, Wei, Zifan, Jin, Bo, Bai, Hongxing, Yang, Ben, Wang, Xiangfeng
Abstract
The prevailing paradigm in Robotic Mobile Fulfillment Systems (RMFS) typically treats order scheduling and multi-agent pathfinding as isolated sub-problems. We argue that this decoupling is a fundamental bottleneck, masking the critical dependencies between high-level dispatching and low-level congestion. Existing simulators fail to bridge this gap, often abstracting away heterogeneous kinematics and stochastic execution failures. We propose WareRover, a holistic simulation platform that enforces a tight coupling between OS and MAPF via a unified, closed-loop optimization interface. Unlike standard benchmarks, WareRover integrates dynamic order streams, physics-aware motion constraints, and non-nominal recovery mechanisms into a single evaluation loop. Experiments reveal that SOTA algorithms often falter under these realistic coupled constraints, demonstrating that WareRover provides a necessary and challenging testbed for robust, next-generation warehouse coordination. The project and video is available at https://hhh-x.github.io/WareRover/.
Chinese Translation
在机器人移动履行系统(RMFS)中,当前的范式通常将订单调度和多智能体路径规划视为孤立的子问题。我们认为,这种解耦是一个根本性的瓶颈,掩盖了高层调度与低层拥堵之间的关键依赖关系。现有的模拟器未能弥合这一差距,往往抽象掉异质运动学和随机执行失败。我们提出了WareRover,一个整体模拟平台,通过统一的闭环优化接口强制订单调度(OS)与多智能体路径规划(MAPF)之间的紧密耦合。与标准基准测试不同,WareRover将动态订单流、考虑物理的运动约束和非正常恢复机制整合到一个单一的评估循环中。实验表明,最先进(SOTA)算法在这些现实的耦合约束下往往表现不佳,证明了WareRover为下一代仓库协调提供了一个必要且具有挑战性的测试平台。项目和视频可在 https://hhh-x.github.io/WareRover/ 获取。
cs.RO / 31 / 2602.14032

RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

RoboAug:通过区域对比数据增强实现单一标注对应数百场景的机器人操控
Wang, Xinhua, Wu, Kun, Zhao, Zhen, Cao, Hu, Zhao, Yinuo, Xu, Zhiyuan, Li, Meng, Fan, Shichao, Wu, Di, Zhang, Yixue, Liu, Ning, Che, Zhengping, Tang, Jian
Abstract
Enhancing the generalization capability of robotic learning to enable robots to operate effectively in diverse, unseen scenes is a fundamental and challenging problem. Existing approaches often depend on pretraining with large-scale data collection, which is labor-intensive and time-consuming, or on semantic data augmentation techniques that necessitate an impractical assumption of flawless upstream object detection in real-world scenarios. In this work, we propose RoboAug, a novel generative data augmentation framework that significantly minimizes the reliance on large-scale pretraining and the perfect visual recognition assumption by requiring only the bounding box annotation of a single image during training. Leveraging this minimal information, RoboAug employs pre-trained generative models for precise semantic data augmentation and integrates a plug-and-play region-contrastive loss to help models focus on task-relevant regions, thereby improving generalization and boosting task success rates. We conduct extensive real-world experiments on three robots, namely UR-5e, AgileX, and Tien Kung 2.0, spanning over 35k rollouts. Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines. Specifically, when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, our method achieves substantial gains over the baseline without augmentation. The success rates increase from 0.09 to 0.47 on UR-5e, from 0.16 to 0.60 on AgileX, and from 0.19 to 0.67 on Tien Kung 2.0. These results highlight the superior generalization and effectiveness of RoboAug in real-world manipulation tasks. Our project is available at https://x-roboaug.github.io/.
Chinese Translation
提升机器人学习的泛化能力,使机器人能够在多样化的、未见过的场景中有效操作,是一个基本且具有挑战性的问题。现有的方法通常依赖于大规模数据收集的预训练,这既费时又费力,或者依赖于语义数据增强技术,这要求在现实场景中假设完美的上游物体检测。在本研究中,我们提出了RoboAug,这是一种新颖的生成数据增强框架,显著减少了对大规模预训练和完美视觉识别假设的依赖,仅需在训练过程中对单幅图像进行边界框标注。利用这一最小信息,RoboAug采用预训练的生成模型进行精确的语义数据增强,并集成了即插即用的区域对比损失,帮助模型关注与任务相关的区域,从而提高泛化能力并提升任务成功率。我们在三种机器人(UR-5e、AgileX和Tien Kung 2.0)上进行了广泛的真实世界实验,涵盖超过35,000次的实验。实证结果表明,RoboAug显著优于最先进的数据增强基线。具体而言,在评估未见场景的泛化能力时,涉及背景、干扰物和光照条件的多样组合,我们的方法在没有增强的基线之上取得了显著提升。UR-5e的成功率从0.09提高到0.47,AgileX从0.16提高到0.60,Tien Kung 2.0从0.19提高到0.67。这些结果突显了RoboAug在现实世界操控任务中的卓越泛化能力和有效性。我们的项目可在 https://x-roboaug.github.io/ 获取。
cs.RO / 32 / 2602.14048

ProAct: A Dual-System Framework for Proactive Embodied Social Agents

ProAct:一种用于主动体现社交代理的双系统框架
Zhang, Zeyi, Kang, Zixi, Zhao, Ruijie, Feng, Yusen, Jiang, Biao, Liu, Libin
Abstract
Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.
Chinese Translation
体现社交代理在生成同步的语言和手势方面最近取得了进展。然而,大多数交互系统仍然是根本上反应性的,仅在短时间窗口内对当前的感官输入作出反应。相比之下,主动社交行为需要对积累的上下文进行深思熟虑和意图推断,这与实时交互的严格延迟预算相冲突。我们提出了 extit{ProAct},一种双系统框架,通过将用于流式多模态交互的低延迟 extit{行为系统}与执行长时间范围社交推理并产生高层次主动意图的较慢 extit{认知系统}解耦,从而调和这一时间尺度冲突。为了将深思熟虑的意图转化为连续的非语言行为而不干扰流畅性,我们引入了一种基于意图的流式流匹配模型,通过ControlNet进行条件化。该机制支持异步意图注入,使得在单一运动流中能够实现反应性和主动性手势之间的无缝过渡。我们在一个物理人形机器人上部署了ProAct,并评估了运动质量和交互效果。在现实世界的交互用户研究中,参与者和观察者一致偏好ProAct而非反应性变体,在感知的主动性、社交存在感和整体参与度方面表现更佳,展示了双系统主动控制在体现社交交互中的优势。
cs.RO / 33 / 2602.14099

SemanticFeels: Semantic Labeling during In-Hand Manipulation

SemanticFeels:手中操作过程中的语义标注
Khalil, Anas Al Shikh, Qi, Haozhi, Calandra, Roberto
Abstract
As robots become increasingly integrated into everyday tasks, their ability to perceive both the shape and properties of objects during in-hand manipulation becomes critical for adaptive and intelligent behavior. We present SemanticFeels, an extension of the NeuralFeels framework that integrates semantic labeling with neural implicit shape representation, from vision and touch. To illustrate its application, we focus on material classification: high-resolution Digit tactile readings are processed by a fine-tuned EfficientNet-B0 convolutional neural network (CNN) to generate local material predictions, which are then embedded into an augmented signed distance field (SDF) network that jointly predicts geometry and continuous material regions. Experimental results show that the system achieves a high correspondence between predicted and actual materials on both single- and multi-material objects, with an average matching accuracy of 79.87% across multiple manipulation trials on a multi-material object.
Chinese Translation
随着机器人越来越多地融入日常任务,它们在手中操作过程中感知物体形状和属性的能力变得至关重要,以实现自适应和智能行为。我们提出了SemanticFeels,这是一个扩展的NeuralFeels框架,结合了来自视觉和触觉的语义标注与神经隐式形状表示。为了说明其应用,我们专注于材料分类:高分辨率的Digit触觉读数通过经过微调的EfficientNet-B0卷积神经网络(CNN)进行处理,以生成局部材料预测,然后将这些预测嵌入到一个增强的有符号距离场(SDF)网络中,该网络共同预测几何形状和连续材料区域。实验结果表明,该系统在单一和多材料物体上预测的材料与实际材料之间具有高度对应性,在多材料物体的多次操作试验中,平均匹配准确率达到79.87%。
cs.RO / 34 / 2602.14104

Rigidity-Based Multi-Finger Coordination for Precise In-Hand Manipulation of Force-Sensitive Objects

基于刚性的多指协调用于对力敏感物体的精确手内操作
Rong, Xinan, Wan, Changhuang, He, Aochen, Li, Xiaolong, Jing, Gangshan
Abstract
Precise in-hand manipulation of force-sensitive objects typically requires judicious coordinated force planning as well as accurate contact force feedback and control. Unlike multi-arm platforms with gripper end effectors, multi-fingered hands rely solely on fingertip point contacts and are not able to apply pull forces, therefore poses a more challenging problem. Furthermore, calibrated torque sensors are lacking in most commercial dexterous hands, adding to the difficulty. To address these challenges, we propose a dual-layer framework for multi-finger coordination, enabling high-precision manipulation of force-sensitive objects through joint control without tactile feedback. This approach solves coordinated contact force planning by incorporating graph rigidity and force closure constraints. By employing a force-to-position mapping, the planned force trajectory is converted to a joint trajectory. We validate the framework on a custom dexterous hand, demonstrating the capability to manipulate fragile objects-including a soft yarn, a plastic cup, and a raw egg-with high precision and safety.
Chinese Translation
对力敏感物体的精确手内操作通常需要谨慎的协调力规划以及准确的接触力反馈和控制。与配备夹持器末端效应器的多臂平台不同,多指手仅依赖于指尖点接触,无法施加拉力,因此面临更具挑战性的问题。此外,大多数商业灵巧手缺乏经过校准的扭矩传感器,进一步增加了难度。为了解决这些挑战,我们提出了一种双层框架用于多指协调,通过关节控制实现对力敏感物体的高精度操作,而无需触觉反馈。该方法通过结合图的刚性和力闭合约束来解决协调接触力规划。通过采用力到位置的映射,规划的力轨迹被转换为关节轨迹。我们在一个定制的灵巧手上验证了该框架,展示了其以高精度和安全性操作脆弱物体的能力,包括软纱线、塑料杯和生鸡蛋。
cs.RO / 35 / 2602.14174

Direction Matters: Learning Force Direction Enables Sim-to-Real Contact-Rich Manipulation

方向至关重要:学习力的方向使得模拟到真实的接触丰富操作成为可能
Yang, Yifei, Chen, Anzhe, Zhu, Zhenjie, Xu, Kechun, Mao, Yunxuan, Wei, Yufei, Chen, Lu, Xiong, Rong, Wang, Yue
Abstract
Sim-to-real transfer for contact-rich manipulation remains challenging due to the inherent discrepancy in contact dynamics. While existing methods often rely on costly real-world data or utilize blind compliance through fixed controllers, we propose a framework that leverages expert-designed controller logic for transfer. Inspired by the success of privileged supervision in kinematic tasks, we employ a human-designed finite state machine based position/force controller in simulation to provide privileged guidance. The resulting policy is trained to predict the end-effector pose, contact state, and crucially the desired contact force direction. Unlike force magnitudes, which are highly sensitive to simulation inaccuracies, force directions encode high-level task geometry and remain robust across the sim-to-real gap. At deployment, these predictions configure a force-aware admittance controller. By combining the policy's directional intent with a constant, low-cost manually tuned force magnitude, the system generates adaptive, task-aligned compliance. This tuning is lightweight, typically requiring only a single scalar per contact state. We provide theoretical analysis for stability and robustness to disturbances. Experiments on four real-world tasks, i.e., microwave opening, peg-in-hole, whiteboard wiping, and door opening, demonstrate that our approach significantly outperforms strong baselines in both success rate and robustness. Videos are available at: https://yifei-y.github.io/project-pages/DirectionMatters/.
Chinese Translation
由于接触动力学固有的差异,接触丰富操作的模拟到真实转移仍然面临挑战。现有方法通常依赖于昂贵的真实世界数据或通过固定控制器利用盲目顺应,而我们提出了一种利用专家设计控制逻辑进行转移的框架。受到运动任务中特权监督成功的启发,我们在模拟中采用了人类设计的有限状态机位置/力控制器,以提供特权指导。所得到的策略被训练以预测末端执行器的姿态、接触状态,以及关键的期望接触力方向。与对模拟不准确性高度敏感的力大小不同,力方向编码了高层次的任务几何形状,并在模拟到真实的差距中保持稳健。在部署时,这些预测配置了一个力感知的顺应控制器。通过将策略的方向意图与一个恒定的、低成本的手动调节力大小相结合,该系统生成自适应的、与任务对齐的顺应。这种调节轻量,通常只需每个接触状态一个标量。我们提供了稳定性和对干扰的鲁棒性的理论分析。在四个真实世界任务上的实验,即微波炉开关、插销入孔、白板擦拭和开门,证明了我们的方法在成功率和鲁棒性上显著优于强基线。视频可在以下网址观看:https://yifei-y.github.io/project-pages/DirectionMatters/.
cs.RO / 36 / 2602.14193

Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation

学习具有部件感知的密集3D特征场以实现可泛化的关节物体操作
Chen, Yue, Jiang, Muqing, Zheng, Kaifeng, Liang, Jiaqi, Tie, Chenrui, Lu, Haoran, Wu, Ruihai, Dong, Hao
Abstract
Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled dataset, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point features reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the Part-Aware Diffusion Policy (PADP), an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation tasks, making it a versatile foundation for robotic manipulation. Project page: https://pa3ff.github.io
Chinese Translation
关节物体操作对于各种现实世界的机器人任务至关重要,但在不同物体之间的泛化仍然是一个主要挑战。泛化的关键在于理解功能部件(例如,门把手和旋钮),这些部件指示了在不同物体类别和形状中如何进行操作。之前的研究试图通过引入基础特征来实现泛化,但这些特征大多基于2D,并未特别考虑功能部件。当将这些2D特征提升到几何深刻的3D空间时,会出现诸如长时间运行、多视图不一致和空间分辨率低、几何信息不足等挑战。为了解决这些问题,我们提出了部件感知3D特征场(Part-Aware 3D Feature Field, PA3FF),这是一种新颖的密集3D特征,具有部件感知能力,旨在实现可泛化的关节物体操作。PA3FF通过对大规模标注数据集中的3D部件提议进行对比学习的方式进行训练。给定点云作为输入,PA3FF以前馈方式预测连续的3D特征场,其中点特征之间的距离反映了功能部件的接近程度:具有相似特征的点更可能属于同一部件。在此特征基础上,我们引入了部件感知扩散策略(Part-Aware Diffusion Policy, PADP),这是一种模仿学习框架,旨在提高样本效率和机器人操作的泛化能力。我们在多个模拟和现实世界任务上评估了PADP,证明PA3FF在操作场景中始终优于一系列2D和3D表示,包括CLIP、DINOv2和Grounded-SAM。除了模仿学习,PA3FF还支持多种下游方法,包括对应学习和分割任务,使其成为机器人操作的多功能基础。项目页面:https://pa3ff.github.io
cs.RO / 37 / 2602.14222

Muscle Coactivation in the Sky: Geometry and Pareto Optimality of Energy vs. Promptness in Multirotors

天空中的肌肉协同激活:多旋翼飞行器中能量与及时性的几何与帕累托最优性
Franchi, Antonio
Abstract
In robotics and human biomechanics, the tension between energy economy and kinematic readiness is well recognized; this work brings that fundamental principle to aerial multirotors. We show that the limited torque of the motors and the nonlinear aerodynamic map from rotor speed to thrust naturally give rise to the novel concept of promptness-a metric akin to dynamic aerodynamic manipulability. By treating energy consumption as a competing objective and introducing a geometric fiber-bundle formulation, we turn redundancy resolution into a principled multi-objective program on affine fibers. The use of the diffeomorphic transformation linearizing the signed-quadratic propulsion model allows us to lay the foundations for a rigorous study of the interplay between these costs. Through an illustrative case study on 4-DoF allocation on the hexarotor, we reveal that this interplay is fiber-dependent and physically shaped by hardware inequalities. For unidirectional thrusters, the feasible fibers are compact, yielding interior allocations and a short Pareto arc, while torque demands break symmetry and separate the optima. Conversely, with reversible propellers, the null space enables antagonistic rotor co-contraction that drives promptness to hardware limits, making optimal endurance and agility fundamentally incompatible in those regimes. Ultimately, rather than relying on heuristic tuning or black box algorithms to empirically improve task execution, this framework provides a foundational understanding of why and how to achieve agility through geometry-aware control allocation, offering possible guidance for vehicle design, certification metrics, and threat-aware flight operation.
Chinese Translation
在机器人技术和人类生物力学中,能量经济与运动准备之间的张力是众所周知的;本研究将这一基本原理引入到空中多旋翼飞行器中。我们展示了电机的有限扭矩和从旋翼速度到推力的非线性气动映射自然引发了及时性这一新概念——一种类似于动态气动可操作性的度量。通过将能量消耗视为一个竞争目标,并引入几何纤维束的形式,我们将冗余解决转化为一个在仿射纤维上的原则性多目标程序。利用微分同胚变换线性化带符号的二次推进模型,使我们能够为这些成本之间相互作用的严格研究奠定基础。通过对六旋翼飞行器的4自由度分配的示例案例研究,我们揭示了这种相互作用依赖于纤维,并受到硬件不平等的物理影响。对于单向推力器,可行的纤维是紧凑的,产生内部分配和短的帕累托弧,而扭矩需求打破对称性并分离最优解。相反,使用可逆螺旋桨时,零空间使得对抗性旋翼的共同收缩推动及时性达到硬件极限,使得在这些状态下最佳的耐力与敏捷性根本上不兼容。最终,该框架提供了对如何通过几何感知的控制分配实现敏捷性的基础理解,而不是依赖启发式调优或黑箱算法来经验性地改善任务执行,为飞行器设计、认证指标和威胁感知飞行操作提供了可能的指导。
cs.RO / 38 / 2602.14247

Path Planning Optimisation for SParse, AwaRe and Cooperative Networked Aerial Robot Teams (SpArC-NARTs): Optimisation Tool and Ground Sensing Coverage Use Cases

稀疏、感知和协作网络空中机器人团队(SpArC-NARTs)的路径规划优化:优化工具和地面感知覆盖用例
Conceição, Maria, Grilo, António, Basiri, Meysam
Abstract
A networked aerial robot team (NART) comprises a group of agents (e.g., unmanned aerial vehicles (UAVs), ground control stations, etc.) interconnected by wireless links. Inter-agent connectivity, even if intermittent (i.e. sparse), enables data exchanges between agents and supports cooperative behaviours in several NART missions. It can benefit online decentralised decision-making and group resilience, particularly when prior knowledge is inaccurate or incomplete. These requirements can be accounted for in the offline mission planning stages to incentivise cooperative behaviours and improve mission efficiency during the NART deployment. This paper proposes a novel path planning tool for a Sparse, Aware, and Cooperative Networked Aerial Robot Team (SpArC-NART) in exploration missions. It simultaneously considers different levels of prior information regarding the environment, limited agent energy, sensing, and communication, as well as distinct NART constitutions. The communication model takes into account the limitations of user-defined radio technology and physical phenomena. The proposed tool aims to maximise the mission goals (e.g., finding one or multiple targets, covering the full area of the environment, etc.), while cooperating with other agents to reduce agent reporting times, increase their global situational awareness (e.g., their knowledge of the environment), and facilitate mission replanning, if required. The developed cooperation mechanism leverages soft-motion constraints and dynamic rewards based on the Value of Movement and the expected communication availability between the agents at each time step. A ground sensing coverage use case was chosen to illustrate the current capabilities of this tool.
Chinese Translation
网络空中机器人团队(NART)由一组通过无线链接互联的代理(例如,无人机(UAVs)、地面控制站等)组成。即使是间歇性的(即稀疏的)代理间连接也能促进代理之间的数据交换,并支持多个NART任务中的协作行为。这可以为在线去中心化决策提供帮助,并增强团队的韧性,特别是在先前知识不准确或不完整的情况下。这些需求可以在离线任务规划阶段予以考虑,以激励协作行为并提高NART部署期间的任务效率。本文提出了一种针对稀疏、感知和协作网络空中机器人团队(SpArC-NART)在探索任务中的新型路径规划工具。该工具同时考虑了关于环境的不同级别的先前信息、有限的代理能量、感知和通信,以及不同的NART构成。通信模型考虑了用户定义的无线技术和物理现象的限制。所提出的工具旨在最大化任务目标(例如,寻找一个或多个目标、覆盖环境的全部区域等),同时与其他代理合作以减少代理报告时间,提高其全球态势感知(例如,对环境的了解),并在必要时促进任务重新规划。开发的合作机制利用基于运动价值和每个时间步代理之间预期通信可用性的动态奖励和软运动约束。选择了一个地面感知覆盖用例来说明该工具的当前能力。
cs.RO / 39 / 2602.14255

A Latency-Aware Framework for Visuomotor Policy Learning on Industrial Robots

一种针对工业机器人视觉运动策略学习的延迟感知框架
Ruan, Daniel, Mozaffari, Salma, Adriaenssens, Sigrid, Adel, Arash
Abstract
Industrial robots are increasingly deployed in contact-rich construction and manufacturing tasks that involve uncertainty and long-horizon execution. While learning-based visuomotor policies offer a promising alternative to open-loop control, their deployment on industrial platforms is challenged by a large observation-execution gap caused by sensing, inference, and control latency. This gap is significantly greater than on low-latency research robots due to high-level interfaces and slower closed-loop dynamics, making execution timing a critical system-level issue. This paper presents a latency-aware framework for deploying and evaluating visuomotor policies on industrial robotic arms under realistic timing constraints. The framework integrates calibrated multimodal sensing, temporally consistent synchronization, a unified communication pipeline, and a teleoperation interface for demonstration collection. Within this framework, we introduce a latency-aware execution strategy that schedules finite-horizon, policy-predicted action sequences based on temporal feasibility, enabling asynchronous inference and execution without modifying policy architectures or training. We evaluate the framework on a contact-rich industrial assembly task while systematically varying inference latency. Using identical policies and sensing pipelines, we compare latency-aware execution with blocking and naive asynchronous baselines. Results show that latency-aware execution maintains smooth motion, compliant contact behavior, and consistent task progression across a wide range of latencies while reducing idle time and avoiding instability observed in baseline methods. These findings highlight the importance of explicitly handling latency for reliable closed-loop deployment of visuomotor policies on industrial robots.
Chinese Translation
工业机器人越来越多地应用于涉及不确定性和长时间执行的接触丰富的建筑和制造任务。虽然基于学习的视觉运动策略为开放环控制提供了一个有前景的替代方案,但在工业平台上的应用受到由传感、推理和控制延迟引起的大量观察-执行差距的挑战。由于高层接口和较慢的闭环动态,这一差距显著大于低延迟研究机器人,使得执行时机成为一个关键的系统级问题。本文提出了一种延迟感知框架,用于在现实时限约束下部署和评估工业机器人臂上的视觉运动策略。该框架集成了校准的多模态传感、时间一致的同步、统一的通信管道以及用于演示收集的遥操作接口。在此框架内,我们引入了一种延迟感知执行策略,该策略根据时间可行性调度有限时域内的策略预测动作序列,从而实现异步推理和执行,而无需修改策略架构或训练。我们在一个接触丰富的工业装配任务上评估该框架,同时系统性地变化推理延迟。使用相同的策略和传感管道,我们将延迟感知执行与阻塞和天真的异步基线进行比较。结果表明,延迟感知执行在广泛的延迟范围内保持平滑运动、顺应接触行为和一致的任务进展,同时减少空闲时间并避免在基线方法中观察到的不稳定性。这些发现突显了明确处理延迟对于在工业机器人上可靠闭环部署视觉运动策略的重要性。
cs.RO / 40 / 2602.14287

Autonomous Robotic Tissue Palpation and Abnormalities Characterisation via Ergodic Exploration

通过遍历探索实现自主机器人组织触诊及异常特征表征
Beber, Luca, Lamon, Edoardo, Saveriano, Matteo, Fontanelli, Daniele, Palopoli, Luigi
Abstract
We propose a novel autonomous robotic palpation framework for real-time elastic mapping during tissue exploration using a viscoelastic tissue model. The method combines force-based parameter estimation using a commercial force/torque sensor with an ergodic control strategy driven by a tailored Expected Information Density, which explicitly biases exploration toward diagnostically relevant regions by jointly considering model uncertainty, stiffness magnitude, and spatial gradients. An Extended Kalman Filter is employed to estimate viscoelastic model parameters online, while Gaussian Process Regression provides spatial modelling of the estimated elasticity, and a Heat Equation Driven Area Coverage controller enables adaptive, continuous trajectory planning. Simulations on synthetic stiffness maps demonstrate that the proposed approach achieves better reconstruction accuracy, enhanced segmentation capability, and improved robustness in detecting stiff inclusions compared to Bayesian Optimisation-based techniques. Experimental validation on a silicone phantom with embedded inclusions emulating pathological tissue regions further corroborates the potential of the method for autonomous tissue characterisation in diagnostic and screening applications.
Chinese Translation
我们提出了一种新颖的自主机器人触诊框架,用于在组织探索过程中实时进行弹性映射,采用了一种粘弹性组织模型。该方法结合了基于力的参数估计(使用商业力/扭矩传感器)与一种由定制的期望信息密度驱动的遍历控制策略,明确地将探索偏向于诊断相关区域,综合考虑了模型不确定性、刚度大小和空间梯度。采用扩展卡尔曼滤波器在线估计粘弹性模型参数,同时高斯过程回归提供了对估计弹性的空间建模,而热方程驱动的区域覆盖控制器则实现了自适应的连续轨迹规划。对合成刚度图的仿真表明,所提方法在重建精度、分割能力和检测刚性包块的鲁棒性方面优于基于贝叶斯优化的技术。在嵌入模仿病理组织区域的包块的硅胶模型上的实验验证进一步证实了该方法在诊断和筛查应用中进行自主组织特征表征的潜力。
cs.RO / 41 / 2602.14311

Exploiting Structure-from-Motion for Robust Vision-Based Map Matching for Aircraft Surface Movement

利用运动重建技术实现稳健的基于视觉的地图匹配以支持飞机地面移动
Choate, Daniel, Rife, Jason
Abstract
In this paper we introduce a vision-aided navigation (VAN) pipeline designed to support ground navigation of autonomous aircraft. The proposed algorithm combines the computational efficiency of indirect methods with the robustness of direct image-based techniques to enhance solution integrity. The pipeline starts by processing ground images (e.g., acquired by a taxiing aircraft) and relates them via a feature-based structure-from-motion (SfM) solution. A ground plane mosaic is then constructed via homography transforms and matched to satellite imagery using a sum of squares differences (SSD) of intensities. Experimental results reveal that drift within the SfM solution, similar to that observed in dead-reckoning systems, challenges the expected accuracy benefits of map-matching with a wide-baseline ground-plane mosaic. However, the proposed algorithm demonstrates key integrity features, such as the ability to identify registration anomalies and ambiguous matches. These characteristics of the pipeline can mitigate outlier behaviors and contribute toward a robust, certifiable solution for autonomous surface movement of aircraft.
Chinese Translation
本文介绍了一种视觉辅助导航(VAN)管道,旨在支持自主飞机的地面导航。所提出的算法结合了间接方法的计算效率与直接基于图像技术的稳健性,以增强解的完整性。该管道首先处理地面图像(例如,由滑行中的飞机获取),并通过基于特征的运动重建(SfM)解将其关联。接着,通过单应性变换构建地面平面马赛克,并使用强度的平方和差(SSD)与卫星图像进行匹配。实验结果表明,SfM解中的漂移现象,类似于在推算系统中观察到的情况,挑战了与宽基线地面平面马赛克进行地图匹配所期望的精度提升。然而,所提出的算法展示了关键的完整性特征,例如识别配准异常和模糊匹配的能力。这些管道特性可以减轻异常值行为,并为自主飞机的地面移动提供稳健且可认证的解决方案。
cs.RO / 42 / 2602.14363

AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation

AdaptManip:通过在线递归状态估计学习自适应全身物体提升与交付
Byrd, Morgan, Baek, Donghoon, Garg, Kartik, Jung, Hyunyoung, Cho, Daesol, Sorokin, Maks, Wright, Robert, Ha, Sehoon
Abstract
This paper presents Adaptive Whole-body Loco-Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning-based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco-manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field-of-view and occlusions; (2) a whole-body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR-based robot global position estimator that provides drift-robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero-shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning-based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real-world navigation, object lifting, and delivery on a humanoid robot.
Chinese Translation
本文提出了自适应全身运动操控(AdaptManip),这是一个完全自主的框架,使类人机器人能够执行集成导航、物体提升和交付。与以往依赖人类示范的模仿学习方法不同,AdaptManip旨在通过强化学习训练出一个稳健的运动操控策略,而无需人类示范或遥控数据。该框架由三个相互耦合的组件组成:(1)一个递归物体状态估计器,能够在有限视野和遮挡条件下实时跟踪被操控物体;(2)一个全身基础策略,用于稳健的运动控制,并具备残余操控能力,以确保物体的稳定提升与交付;(3)一个基于激光雷达(LiDAR)的机器人全局位置估计器,提供抗漂移的定位。所有组件均在模拟环境中使用强化学习进行训练,并以零样本方式部署到真实硬件上。实验结果表明,AdaptManip在适应性和整体成功率方面显著优于基线方法,包括基于模仿学习的方法,而准确的物体状态估计即使在遮挡情况下也能提高操控性能。我们进一步展示了类人机器人在真实世界中的完全自主导航、物体提升和交付。
cs.RO / 43 / 2602.14434

A Soft Wrist with Anisotropic and Selectable Stiffness for Robust Robot Learning in Contact-rich Manipulation

一种具有各向异性和可选择刚度的柔性腕部机制,用于在接触丰富的操作中实现稳健的机器人学习
Oh, Steven, Takahashi, Tomoya, Beltran-Hernandez, Cristian C., Kuroda, Yuki, Hamaya, Masashi
Abstract
Contact-rich manipulation tasks in unstructured environments pose significant robustness challenges for robot learning, where unexpected collisions can cause damage and hinder policy acquisition. Existing soft end-effectors face fundamental limitations: they either provide a limited deformation range, lack directional stiffness control, or require complex actuation systems that compromise practicality. This study introduces CLAW (Compliant Leaf-spring Anisotropic soft Wrist), a novel soft wrist mechanism that addresses these limitations through a simple yet effective design using two orthogonal leaf springs and rotary joints with a locking mechanism. CLAW provides large 6-degree-of-freedom deformation (40mm lateral, 20mm vertical), anisotropic stiffness that is tunable across three distinct modes, while maintaining lightweight construction (330g) at low cost ($550). Experimental evaluations using imitation learning demonstrate that CLAW achieves 76% success rate in benchmark peg-insertion tasks, outperforming both the Fin Ray gripper (43%) and rigid gripper alternatives (36%). CLAW successfully handles diverse contact-rich scenarios, including precision assembly with tight tolerances and delicate object manipulation, demonstrating its potential to enable robust robot learning in contact-rich domains. Project page: https://project-page-manager.github.io/CLAW/
Chinese Translation
在非结构化环境中,接触丰富的操作任务给机器人学习带来了显著的稳健性挑战,意外的碰撞可能导致损坏并妨碍策略获取。现有的柔性末端执行器面临基本限制:它们要么提供有限的变形范围,要么缺乏方向性刚度控制,或者需要复杂的驱动系统,从而影响实用性。本研究介绍了CLAW(顺应性叶片弹簧各向异性柔性腕部),这是一种新型柔性腕部机制,通过使用两个正交的叶片弹簧和带锁定机制的旋转关节,以简单而有效的设计解决了这些限制。CLAW提供了大范围的六自由度变形(横向40mm,纵向20mm),具有可调的各向异性刚度,且在保持轻量化结构(330g)的同时,成本低($550)。通过模仿学习的实验评估表明,CLAW在基准的插销插入任务中达到了76%的成功率,优于Fin Ray夹具(43%)和刚性夹具替代品(36%)。CLAW成功处理了多种接触丰富的场景,包括精密装配和精细物体操作,展示了其在接触丰富领域实现稳健机器人学习的潜力。项目页面:https://project-page-manager.github.io/CLAW/
cs.RO / 44 / 2602.14438

RoboSolver: A Multi-Agent Large Language Model Framework for Solving Robotic Arm Problems

RoboSolver:一个用于解决机器人手臂问题的多智能体大语言模型框架
Khabazi, Hamid, Meghdari, Ali F., Taheri, Alireza
Abstract
This study proposes an intelligent multi-agent framework built on LLMs and VLMs and specifically tailored to robotics. The goal is to integrate the strengths of LLMs and VLMs with computational tools to automatically analyze and solve problems related to robotic manipulators. Our developed framework accepts both textual and visual inputs and can automatically perform forward and inverse kinematics, compute velocities and accelerations of key points, generate 3D simulations of the robot, and ultimately execute motion control within the simulated environment, all according to the user's query. To evaluate the framework, three benchmark tests were designed, each consisting of ten questions. In the first benchmark test, the framework was evaluated while connected to GPT-4o, DeepSeek-V3.2, and Claude-Sonnet-4.5, as well as their corresponding raw models. The objective was to extract the forward kinematics of robots directly from textual descriptions. The results showed that the framework integrated with GPT-4o achieved the highest accuracy, reaching 0.97 in computing the final solution, whereas the raw model alone attained an accuracy of only 0.30 for the same task. Similarly, for the other two models, the framework consistently outperformed the corresponding raw models in terms of accuracy. The second benchmark test was identical to the first, except that the input was provided in visual form. In this test, the GPT-4o LLM was used alongside the Gemini 2.5 Pro VLM. The results showed that the framework achieved an accuracy of 0.93 in obtaining the final answer, which is approximately 20% higher than that of the corresponding raw model. The third benchmark test encompassed a range of robotic tasks, including simulation, control, velocity and acceleration computation, as well as inverse kinematics and Jacobian calculation, for which the framework achieved an accuracy of 0.97.
Chinese Translation
本研究提出了一种基于大语言模型(LLMs)和视觉语言模型(VLMs)的智能多智能体框架,专门针对机器人技术进行定制。其目标是将LLMs和VLMs的优势与计算工具相结合,自动分析和解决与机器人操纵器相关的问题。我们开发的框架接受文本和视觉输入,并能够自动执行正向和逆向运动学,计算关键点的速度和加速度,生成机器人的三维模拟,并最终根据用户的查询在模拟环境中执行运动控制。为了评估该框架,设计了三个基准测试,每个测试包含十个问题。在第一个基准测试中,框架在连接GPT-4o、DeepSeek-V3.2和Claude-Sonnet-4.5及其相应的原始模型时进行了评估。目标是直接从文本描述中提取机器人的正向运动学。结果表明,与GPT-4o集成的框架在计算最终解时达到了最高的准确率,达到了0.97,而原始模型仅在同一任务中获得了0.30的准确率。同样,对于其他两个模型,该框架在准确性方面始终优于相应的原始模型。第二个基准测试与第一个相同,只是输入以视觉形式提供。在此测试中,使用了GPT-4o LLM和Gemini 2.5 Pro VLM。结果显示,该框架在获得最终答案时达到了0.93的准确率,比相应的原始模型高出约20%。第三个基准测试涵盖了一系列机器人任务,包括模拟、控制、速度和加速度计算,以及逆向运动学和雅可比计算,框架在此任务中达到了0.97的准确率。
cs.RO / 45 / 2602.14473

Learning Transferability: A Two-Stage Reinforcement Learning Approach for Enhancing Quadruped Robots' Performance in U-Shaped Stair Climbing

学习可转移性:一种两阶段强化学习方法用于提升四足机器人在U型楼梯攀爬中的性能
Huang, Baixiao, Huang, Baiyu, Hou, Yu
Abstract
Quadruped robots are employed in various scenarios in building construction. However, autonomous stair climbing across different indoor staircases remains a major challenge for robot dogs to complete building construction tasks. In this project, we employed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize a robot's performance on U-shaped stairs. The training robot-dog modality, Unitree Go2, was first trained to climb stairs on Isaac Lab's pyramid-stair terrain, and then to climb a U-shaped indoor staircase using the learned policies. This project explores end-to-end RL methods that enable robot dogs to autonomously climb stairs. The results showed (1) the successful goal reached for robot dogs climbing U-shaped stairs with a stall penalty, and (2) the transferability from the policy trained on U-shaped stairs to deployment on straight, L-shaped, and spiral stair terrains, and transferability from other stair models to deployment on U-shaped terrain.
Chinese Translation
四足机器人在建筑施工的各种场景中得到了应用。然而,机器人狗在不同室内楼梯上自主攀爬仍然是完成建筑施工任务的一大挑战。在本项目中,我们采用了一种两阶段端到端深度强化学习(RL)方法来优化机器人在U型楼梯上的性能。训练用的机器人狗模型Unitree Go2,首先在Isaac Lab的金字塔楼梯地形上进行了楼梯攀爬训练,然后使用学习到的策略在U型室内楼梯上进行攀爬。本项目探讨了使机器人狗能够自主攀爬楼梯的端到端RL方法。结果显示:(1)机器人狗成功达成了在U型楼梯上攀爬的目标,并且施加了停滞惩罚;(2)从在U型楼梯上训练的策略到在直梯、L型和螺旋楼梯地形上的部署具有可转移性,以及从其他楼梯模型到U型地形的部署也具备可转移性。
cs.RO / 46 / 2602.14526

TWISTED-RL: Hierarchical Skilled Agents for Knot-Tying without Human Demonstrations

TWISTED-RL:无须人类示范的层次化熟练代理用于打结
Freund, Guy, Jurgenson, Tom, Sudry, Matan, Karpas, Erez
Abstract
Robotic knot-tying represents a fundamental challenge in robotics due to the complex interactions between deformable objects and strict topological constraints. We present TWISTED-RL, a framework that improves upon the previous state-of-the-art in demonstration-free knot-tying (TWISTED), which smartly decomposed a single knot-tying problem into manageable subproblems, each addressed by a specialized agent. Our approach replaces TWISTED's single-step inverse model that was learned via supervised learning with a multi-step Reinforcement Learning policy conditioned on abstract topological actions rather than goal states. This change allows more delicate topological state transitions while avoiding costly and ineffective data collection protocols, thus enabling better generalization across diverse knot configurations. Experimental results demonstrate that TWISTED-RL manages to solve previously unattainable knots of higher complexity, including commonly used knots such as the Figure-8 and the Overhand. Furthermore, the increase in success rates and drop in planning time establishes TWISTED-RL as the new state-of-the-art in robotic knot-tying without human demonstrations.
Chinese Translation
机器人打结因可变形物体之间的复杂交互和严格的拓扑约束而成为机器人技术中的一项基本挑战。我们提出了TWISTED-RL,一个在无示范打结领域(TWISTED)上改进的框架,该框架巧妙地将单一的打结问题分解为可管理的子问题,每个子问题由一个专门的代理处理。我们的方法用基于抽象拓扑动作而非目标状态的多步强化学习策略替代了TWISTED的通过监督学习获得的单步逆模型。这一变化允许更细致的拓扑状态转变,同时避免了昂贵且低效的数据收集协议,从而在不同的打结配置中实现更好的泛化。实验结果表明,TWISTED-RL成功解决了以前无法实现的更高复杂度的打结,包括常用的打结如8字结和单结。此外,成功率的提高和规划时间的减少使TWISTED-RL成为无须人类示范的机器人打结领域的新最先进技术。
cs.RO / 47 / 2602.14540

Multimodal Covariance Steering in Belief Space with Active Probing and Influence for Autonomous Driving

具有主动探测和影响的信念空间多模态协方差引导在自主驾驶中的应用
Chakravarty, Devodita, Dolan, John, Lyu, Yiwei
Abstract
Autonomous driving in complex traffic requires reasoning under uncertainty. Common approaches rely on prediction-based planning or risk-aware control, but these are typically treated in isolation, limiting their ability to capture the coupled nature of action and inference in interactive settings. This gap becomes especially critical in uncertain scenarios, where simply reacting to predictions can lead to unsafe maneuvers or overly conservative behavior. Our central insight is that safe interaction requires not only estimating human behavior but also shaping it when ambiguity poses risks. To this end, we introduce a hierarchical belief model that structures human behavior across coarse discrete intents and fine motion modes, updated via Bayesian inference for interpretable multi-resolution reasoning. On top of this, we develop an active probing strategy that identifies when multimodal ambiguity in human predictions may compromise safety and plans disambiguating actions that both reveal intent and gently steer human decisions toward safer outcomes. Finally, a runtime risk-evaluation layer based on Conditional Value-at-Risk (CVaR) ensures that all probing actions remain within human risk tolerance during influence. Our simulations in lane-merging and unsignaled intersection scenarios demonstrate that our approach achieves higher success rates and shorter completion times compared to existing methods. These results highlight the benefit of coupling belief inference, probing, and risk monitoring, yielding a principled and interpretable framework for planning under uncertainty.
Chinese Translation
在复杂交通环境中,自主驾驶需要在不确定性下进行推理。常见的方法依赖于基于预测的规划或风险意识控制,但这些方法通常是孤立处理的,限制了它们在交互环境中捕捉行动与推理耦合特性的能力。这一差距在不确定场景中尤为关键,因为仅仅对预测做出反应可能导致不安全的操作或过于保守的行为。我们的核心见解是,安全的交互不仅需要估计人类行为,还需要在模糊性带来风险时对其进行引导。为此,我们提出了一种分层信念模型,该模型通过贝叶斯推理对人类行为进行结构化,涵盖粗略离散意图和细致运动模式,以实现可解释的多分辨率推理。在此基础上,我们开发了一种主动探测策略,识别何时人类预测中的多模态模糊性可能危及安全,并规划出既能揭示意图又能温和引导人类决策朝向更安全结果的消歧义行动。最后,基于条件风险价值(Conditional Value-at-Risk, CVaR)的运行时风险评估层确保所有探测行动在影响过程中保持在可接受的人类风险容忍范围内。我们在车道合并和无信号交叉口场景中的模拟表明,我们的方法相比现有方法实现了更高的成功率和更短的完成时间。这些结果突显了信念推理、探测和风险监控相结合的好处,提供了一个在不确定性下进行规划的原则性和可解释性框架。
cs.RO / 48 / 2602.14551

Replanning Human-Robot Collaborative Tasks with Vision-Language Models via Semantic and Physical Dual-Correction

通过语义和物理双重修正重新规划人机协作任务的视觉语言模型
Kato, Taichi, Kiyokawa, Takuya, Saito, Namiko, Harada, Kensuke
Abstract
Human-Robot Collaboration (HRC) plays an important role in assembly tasks by enabling robots to plan and adjust their motions based on interactive, real-time human instructions. However, such instructions are often linguistically ambiguous and underspecified, making it difficult to generate physically feasible and cooperative robot behaviors. To address this challenge, many studies have applied Vision-Language Models (VLMs) to interpret high-level instructions and generate corresponding actions. Nevertheless, VLM-based approaches still suffer from hallucinated reasoning and an inability to anticipate physical execution failures. To address these challenges, we propose an HRC framework that augments a VLM-based reasoning with a dual-correction mechanism: an internal correction model that verifies logical consistency and task feasibility prior to action execution, and an external correction model that detects and rectifies physical failures through post-execution feedback. Simulation ablation studies demonstrate that the proposed method improves the success rate compared to baselines without correction models. Our real-world experiments in collaborative assembly tasks supported by object fixation or tool preparation by an upper body humanoid robot further confirm the framewor's effectiveness in enabling interactive replanning across different collaborative tasks in response to human instructions, validating its practical feasibility.
Chinese Translation
人机协作(HRC)在装配任务中发挥着重要作用,使机器人能够根据实时的人类指令进行规划和调整其动作。然而,这些指令往往在语言上模糊且不够具体,导致难以生成物理上可行且具有协作性的机器人行为。为了解决这一挑战,许多研究应用了视觉语言模型(VLMs)来解释高层次指令并生成相应的动作。然而,基于VLM的方法仍然面临幻觉推理和无法预见物理执行失败的问题。为了解决这些挑战,我们提出了一种HRC框架,该框架通过双重修正机制增强基于VLM的推理:一个内部修正模型在动作执行之前验证逻辑一致性和任务可行性,另一个外部修正模型通过执行后的反馈检测并纠正物理失败。仿真消融研究表明,所提出的方法相比于没有修正模型的基线提高了成功率。我们在由上半身类人机器人支持的协作装配任务中的现实世界实验进一步验证了该框架在响应人类指令时实现不同协作任务的互动重新规划的有效性,验证了其实际可行性。
cs.RO / 49 / 2602.14561

Simulation-based Learning of Electrical Cabinet Assembly Using Robot Skills

基于仿真的电气柜组装学习使用机器人技能
Laemmle, Arik, Bálint, Balázs András, Tenbrock, Philipp, Naegele, Frank, Traunecker, David, Váncza, József, Huber, Marco F.
Abstract
This paper presents a simulation-driven approach for automating the force-controlled assembly of electrical terminals on DIN-rails, a task traditionally hindered by high programming effort and product variability. The proposed method integrates deep reinforcement learning (DRL) with parameterizable robot skills in a physics-based simulation environment. To realistically model the snap-fit assembly process, we develop and evaluate two types of joining models: analytical models based on beam theory and rigid-body models implemented in the MuJoCo physics engine. These models enable accurate simulation of interaction forces, essential for training DRL agents. The robot skills are structured using the pitasc framework, allowing modular, reusable control strategies. Training is conducted in simulation using Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms. Domain randomization is applied to improve robustness. The trained policies are transferred to a physical UR10e robot system without additional tuning. Experimental results demonstrate high success rates (up to 100%) in both simulation and real-world settings, even under significant positional and rotational deviations. The system generalizes well to new terminal types and positions, significantly reducing manual programming effort. This work highlights the potential of combining simulation-based learning with modular robot skills for flexible, scalable automation in small-batch manufacturing. Future work will explore hybrid learning methods, automated environment parameterization, and further refinement of joining models for design integration.
Chinese Translation
本文提出了一种基于仿真的方法,用于自动化在DIN导轨上进行电气端子的力控制组装,这一任务传统上受到高编程成本和产品变异性的限制。所提方法将深度强化学习(Deep Reinforcement Learning, DRL)与可参数化的机器人技能集成于物理仿真环境中。为了真实地模拟卡扣式组装过程,我们开发并评估了两种类型的连接模型:基于梁理论的解析模型和在MuJoCo物理引擎中实现的刚体模型。这些模型能够准确模拟交互力,这对于训练DRL代理至关重要。机器人技能采用pitasc框架进行结构化,允许模块化和可重用的控制策略。训练在仿真中使用软演员-评论家(Soft Actor-Critic, SAC)和双延迟深度确定性策略梯度(Twin Delayed Deep Deterministic Policy Gradient, TD3)算法进行。应用领域随机化以提高鲁棒性。训练好的策略可以无须额外调优地转移到物理UR10e机器人系统。实验结果表明,在仿真和实际环境中均实现了高成功率(高达100%),即使在显著的位移和旋转偏差下也能保持稳定。该系统对新类型和新位置的端子具有良好的泛化能力,显著减少了手动编程的工作量。本研究突显了将基于仿真的学习与模块化机器人技能相结合的潜力,以实现小批量制造中的灵活、可扩展的自动化。未来的工作将探索混合学习方法、自动化环境参数化以及进一步完善连接模型以便于设计集成。
cs.RO / 50 / 2602.14666

Real-time Monocular 2D and 3D Perception of Endoluminal Scenes for Controlling Flexible Robotic Endoscopic Instruments

实时单目2D和3D内腔场景感知用于控制柔性机器人内窥镜仪器
Wei, Ruofeng, Chen, Kai, Ng, Yui Lun, Ma, Yiyao, Ho, Justin Di-Lang, Tong, Hon Sing, Wang, Xiaomei, Dai, Jing, Kwok, Ka-Wai, Dou, Qi
Abstract
Endoluminal surgery offers a minimally invasive option for early-stage gastrointestinal and urinary tract cancers but is limited by surgical tools and a steep learning curve. Robotic systems, particularly continuum robots, provide flexible instruments that enable precise tissue resection, potentially improving outcomes. This paper presents a visual perception platform for a continuum robotic system in endoluminal surgery. Our goal is to utilize monocular endoscopic image-based perception algorithms to identify position and orientation of flexible instruments and measure their distances from tissues. We introduce 2D and 3D learning-based perception algorithms and develop a physically-realistic simulator that models flexible instruments dynamics. This simulator generates realistic endoluminal scenes, enabling control of flexible robots and substantial data collection. Using a continuum robot prototype, we conducted module and system-level evaluations. Results show that our algorithms improve control of flexible instruments, reducing manipulation time by over 70% for trajectory-following tasks and enhancing understanding of surgical scenarios, leading to robust endoluminal surgeries.
Chinese Translation
内腔手术为早期胃肠道和泌尿道癌症提供了一种微创选择,但受到手术工具和陡峭学习曲线的限制。机器人系统,尤其是连续机器人,提供了灵活的仪器,使得精确的组织切除成为可能,从而有望改善手术结果。本文提出了一种用于内腔手术的连续机器人系统的视觉感知平台。我们的目标是利用基于单目内窥镜图像的感知算法来识别柔性仪器的位置和方向,并测量其与组织的距离。我们引入了基于学习的2D和3D感知算法,并开发了一个物理真实的模拟器,以模拟柔性仪器的动态。该模拟器生成逼真的内腔场景,使得柔性机器人的控制和大量数据收集成为可能。使用连续机器人原型,我们进行了模块和系统级的评估。结果表明,我们的算法改善了柔性仪器的控制,使得轨迹跟随任务的操作时间减少超过70%,并增强了对手术场景的理解,从而实现了稳健的内腔手术。
cs.RO / 51 / 2602.14726

ManeuverNet: A Soft Actor-Critic Framework for Precise Maneuvering of Double-Ackermann-Steering Robots with Optimized Reward Functions

ManeuverNet:一种软演员-评论家框架,用于双阿克曼转向机器人精确操控的优化奖励函数
Deflesselle, Kohio, Daniel, Mélodie, Magassouba, Aly, Aranda, Miguel, Ly, Olivier
Abstract
Autonomous control of double-Ackermann-steering robots is essential in agricultural applications, where robots must execute precise and complex maneuvers within a limited space. Classical methods, such as the Timed Elastic Band (TEB) planner, can address this problem, but they rely on parameter tuning, making them highly sensitive to changes in robot configuration or environment and impractical to deploy without constant recalibration. At the same time, end-to-end deep reinforcement learning (DRL) methods often fail due to unsuitable reward functions for non-holonomic constraints, resulting in sub-optimal policies and poor generalization. To address these challenges, this paper presents ManeuverNet, a DRL framework tailored for double-Ackermann systems, combining Soft Actor-Critic with CrossQ. Furthermore, ManeuverNet introduces four specifically designed reward functions to support maneuver learning. Unlike prior work, ManeuverNet does not depend on expert data or handcrafted guidance. We extensively evaluate ManeuverNet against both state-of-the-art DRL baselines and the TEB planner. Experimental results demonstrate that our framework substantially improves maneuverability and success rates, achieving more than a 40% gain over DRL baselines. Moreover, ManeuverNet effectively mitigates the strong parameter sensitivity observed in the TEB planner. In real-world trials, ManeuverNet achieved up to a 90% increase in maneuvering trajectory efficiency, highlighting its robustness and practical applicability.
Chinese Translation
双阿克曼转向机器人的自主控制在农业应用中至关重要,这些机器人必须在有限空间内执行精确而复杂的机动。经典方法,如定时弹性带(Timed Elastic Band, TEB)规划器,可以解决这一问题,但它们依赖于参数调优,使其对机器人配置或环境的变化高度敏感,且在没有持续重新校准的情况下难以部署。同时,端到端深度强化学习(Deep Reinforcement Learning, DRL)方法往往由于不适合非完整约束的奖励函数而失败,导致次优策略和较差的泛化能力。为了解决这些挑战,本文提出了ManeuverNet,一种针对双阿克曼系统的DRL框架,结合了软演员-评论家(Soft Actor-Critic)与CrossQ。此外,ManeuverNet引入了四个专门设计的奖励函数以支持机动学习。与之前的工作不同,ManeuverNet不依赖于专家数据或手工指导。我们对ManeuverNet进行了广泛评估,与最先进的DRL基线和TEB规划器进行了比较。实验结果表明,我们的框架显著提高了机动性和成功率,相较于DRL基线提升超过40%。此外,ManeuverNet有效减轻了TEB规划器中观察到的强参数敏感性。在实际试验中,ManeuverNet在机动轨迹效率上实现了高达90%的提升,突显了其稳健性和实际应用性。
cs.RO / 52 / 2602.14794

Analysis of a Cuspidal 6R Robot

尖顶6R机器人的分析
Feeß, Alexander, Weiß, Martin
Abstract
We present a theoretical and numerical analysis of the kinematics for the "Transpressor", a cuspidal 6R robot. It admits up to 16 inverse kinematics solutions which are described geometrically. For special target poses, we provide the solutions analytically and present a simple numerical solver for the general case. Moreover, an analytical estimate of the Jacobian determinant on a path between two solutions proves cuspidality for a class of robots similar to the transpressor.
Chinese Translation
我们对“Transpressor”这一尖顶6R机器人的运动学进行了理论和数值分析。该机器人最多可接受16个逆运动学解,这些解通过几何方式进行描述。对于特定的目标姿态,我们提供了解析解,并为一般情况提供了一个简单的数值求解器。此外,我们在两个解之间的路径上对雅可比行列式进行了分析估计,证明了一类与Transpressor相似的机器人的尖顶性。
cs.RO / 53 / 2602.14799

Scalable Multi-Robot Path Planning via Quadratic Unconstrained Binary Optimization

通过二次无约束二进制优化实现可扩展的多机器人路径规划
Villasmil, Javier González
Abstract
Multi-Agent Path Finding (MAPF) remains a fundamental challenge in robotics, where classical centralized approaches exhibit exponential growth in joint-state complexity as the number of agents increases. This paper investigates Quadratic Unconstrained Binary Optimization (QUBO) as a structurally scalable alternative for simultaneous multi-robot path planning. This approach is a robotics-oriented QUBO formulation incorporating BFS-based logical pre-processing (achieving over 95% variable reduction), adaptive penalty design for collision and constraint enforcement, and a time-windowed decomposition strategy that enables execution within current hardware limitations. An experimental evaluation in grid environments with up to four robots demonstrated near-optimal solutions in dense scenarios and favorable scaling behavior compared to sequential classical planning. These results establish a practical and reproducible baseline for future quantum and quantum-inspired multi-robot coordinations.
Chinese Translation
多智能体路径寻找(MAPF)仍然是机器人技术中的一个基本挑战,传统的集中式方法在智能体数量增加时,联合状态复杂性呈指数增长。本文探讨了二次无约束二进制优化(QUBO)作为一种结构上可扩展的替代方案,用于同时进行多机器人路径规划。该方法是一种面向机器人应用的QUBO公式,结合了基于广度优先搜索(BFS)的逻辑预处理(实现超过95%的变量减少)、用于碰撞和约束强制的自适应惩罚设计,以及一种时间窗口分解策略,使得在当前硬件限制内能够执行。在最多四个机器人的网格环境中的实验评估显示,在密集场景中接近最优的解决方案,并且与传统的顺序规划相比,表现出良好的扩展性。这些结果为未来的量子和量子启发式多机器人协调建立了一个实用且可重复的基准。
cs.RO / 54 / 2602.14874

Affordance Transfer Across Object Instances via Semantically Anchored Functional Map

通过语义锚定功能图实现跨对象实例的可供性转移
Dong, Xiaoxiang, Zhi, Weiming
Abstract
Traditional learning from demonstration (LfD) generally demands a cumbersome collection of physical demonstrations, which can be time-consuming and challenging to scale. Recent advances show that robots can instead learn from human videos by extracting interaction cues without direct robot involvement. However, a fundamental challenge remains: how to generalize demonstrated interactions across different object instances that share similar functionality but vary significantly in geometry. In this work, we propose \emph{Semantic Anchored Functional Maps} (SemFM), a framework for transferring affordances across objects from a single visual demonstration. Starting from a coarse mesh reconstructed from an image, our method identifies semantically corresponding functional regions between objects, selects mutually exclusive semantic anchors, and propagates these constraints over the surface using a functional map to obtain a dense, semantically consistent correspondence. This enables demonstrated interaction regions to be transferred across geometrically diverse objects in a lightweight and interpretable manner. Experiments on synthetic object categories and real-world robotic manipulation tasks show that our approach enables accurate affordance transfer with modest computational cost, making it well-suited for practical robotic perception-to-action pipelines.
Chinese Translation
传统的示范学习(LfD)通常需要繁琐的物理示范收集,这既耗时又难以扩展。最近的进展表明,机器人可以通过提取人类视频中的交互线索来学习,而无需直接参与。然而,仍然存在一个基本挑战:如何在不同的对象实例之间推广已示范的交互,这些对象具有相似的功能但在几何形状上存在显著差异。在本研究中,我们提出了 extit{语义锚定功能图}(Semantic Anchored Functional Maps, SemFM),这是一个从单一视觉示范中跨对象转移可供性的框架。我们的算法从图像重建的粗糙网格开始,识别对象之间语义对应的功能区域,选择相互排斥的语义锚点,并利用功能图在表面上传播这些约束,以获得密集且语义一致的对应关系。这使得已示范的交互区域能够以轻量且可解释的方式在几何多样的对象之间转移。在合成对象类别和真实世界机器人操作任务上的实验表明,我们的方法能够以适度的计算成本实现准确的可供性转移,使其非常适合实际的机器人感知到行动的管道。
cs.RO / 55 / 2602.14948

Kalman Filtering Based Flight Management System Modeling for AAM Aircraft

基于卡尔曼滤波的先进空中移动(AAM)飞机飞行管理系统建模
Kandoria, Balram, Samyal, Aryaman Singh
Abstract
Advanced Aerial Mobility (AAM) operations require strategic flight planning services that predict both spatial and temporal uncertainties to safely validate flight plans against hazards such as weather cells, restricted airspaces, and CNS disruption areas. Current uncertainty estimation methods for AAM vehicles rely on conservative linear models due to limited real-world performance data. This paper presents a novel Kalman Filter-based uncertainty propagation method that models AAM Flight Management System (FMS) architectures through sigmoid-blended measurement noise covariance. Unlike existing approaches with fixed uncertainty thresholds, our method continuously adapts the filter's measurement trust based on progress toward waypoints, enabling FMS correction behavior to emerge naturally. The approach scales proportionally with control inputs and is tunable to match specific aircraft characteristics or route conditions. We validate the method using real ADS-B data from general aviation aircraft divided into training and verification sets. Uncertainty propagation parameters were tuned on the training set, achieving 76% accuracy in predicting arrival times when compared against the verification dataset, demonstrating the method's effectiveness for strategic flight plan validation in AAM operations.
Chinese Translation
先进空中移动(AAM)操作需要战略飞行规划服务,以预测空间和时间的不确定性,从而安全地验证飞行计划,避免诸如天气单元、限制空域和CNS干扰区域等危险。目前,AAM飞行器的不确定性估计方法由于缺乏真实世界的性能数据,依赖于保守的线性模型。本文提出了一种新颖的基于卡尔曼滤波的不确定性传播方法,通过sigmoid混合测量噪声协方差建模AAM飞行管理系统(FMS)架构。与现有的固定不确定性阈值方法不同,我们的方法根据飞往航点的进展持续调整滤波器的测量信任度,使FMS的修正行为自然出现。该方法与控制输入成比例扩展,并可调节以匹配特定飞机特性或航线条件。我们使用来自一般航空飞机的真实ADS-B数据对该方法进行了验证,数据集分为训练集和验证集。在训练集上调整不确定性传播参数,在与验证数据集比较时,预测到达时间的准确率达76%,证明了该方法在AAM操作中的战略飞行计划验证中的有效性。
cs.RO / 56 / 2602.14958

Morphing of and writing with a scissor linkage mechanism

剪刀连杆机制的变形与书写
A, Mohanraj, Prasath, S Ganga
Abstract
Kinematics of mechanisms is intricately coupled to their geometry and their utility often arises out of the ability to perform reproducible motion with fewer actuating degrees of freedom. In this article, we explore the assembly of scissor-units, each made of two rigid linear members connected by a pin joint. The assembly has a single degree of freedom, where actuating any single unit results in a shape change of the entire assembly. We derive expressions for the effective curvature of the unit and the trajectory of the mechanism's tip as a function of the geometric variables which we then use as the basis to program two tasks in the mechanism: shape morphing and writing. By phrasing these tasks as optimization problems and utilizing the differentiable simulation framework, we arrive at solutions that are then tested in table-top experiments. Our results show that the geometry of scissor assemblies can be leveraged for automated navigation and inspection in complex domains, in light of the optimization framework. However, we highlight that the challenges associated with rapid programming and error-free implementation in experiments without feedback still remain.
Chinese Translation
机制的运动学与其几何形状密切相关,其实用性往往源于能够以更少的驱动自由度执行可重复的运动。本文探讨了剪刀单元的组装,每个单元由两个通过销接头连接的刚性线性构件组成。该组装具有一个自由度,驱动任一单元都会导致整个组装的形状变化。我们推导了单元的有效曲率和机制尖端轨迹的表达式,这些表达式作为基础用于在机制中编程两个任务:形状变形和书写。通过将这些任务表述为优化问题,并利用可微分仿真框架,我们得出了可在桌面实验中测试的解决方案。我们的结果表明,剪刀组装的几何形状可以用于复杂领域中的自动导航和检查,基于优化框架。然而,我们强调,在没有反馈的实验中,快速编程和无误实施所面临的挑战仍然存在。
cs.RO / 57 / 2602.14968

PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

PhyScensis:用于复杂物理场景布置的物理增强大语言模型代理
Wang, Yian, Yang, Han, Guo, Minghao, Qiu, Xiaowen, Wang, Tsun-Hsuan, Matusik, Wojciech, Tenenbaum, Joshua B., Gan, Chuang
Abstract
Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.
Chinese Translation
自动生成交互式3D环境对于在仿真中扩大机器人数据收集至关重要。尽管之前的研究主要集中在3D资产的布置上,但往往忽视了物体之间的物理关系(例如接触、支撑、平衡和包含),这些关系对于创建复杂且真实的操作场景(如桌面布置、货架组织或箱子装箱)是必不可少的。与经典的3D布局生成相比,生成复杂的物理场景带来了额外的挑战:(a)更高的物体密度和复杂性(例如,一个小货架可能容纳数十本书),(b)更丰富的支撑关系和紧凑的空间布局,以及(c)需要准确建模空间布置和物理属性。为了解决这些挑战,我们提出了PhyScensis,一个基于大语言模型(LLM)代理的框架,利用物理引擎生成具有高复杂性的物理合理场景配置。具体而言,我们的框架由三个主要组件组成:一个LLM代理迭代地提出具有空间和物理谓词的资产;一个配备物理引擎的求解器将这些谓词实现为3D场景;求解器的反馈则通知代理以优化和丰富配置。此外,我们的框架在细粒度文本描述和数值参数(例如相对位置、场景稳定性)上保持强大的可控性,这得益于通过概率编程实现的稳定性以及一种互补启发式方法,该方法共同调节稳定性和空间关系。实验结果表明,我们的方法在场景复杂性、视觉质量和物理准确性方面优于之前的方法,为生成复杂物理场景布局以供机器人操作提供了统一的流程。
cs.RO / 58 / 2602.14974

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

DM0:面向物理人工智能的具身原生视觉-语言-行动模型
Yu, En, Lv, Haoran, Sun, Jianjian, Lin, Kangheng, Zhang, Ruitao, Shi, Yukang, Chen, Yuyang, Chen, Ze, Zhang, Ziheng, Jia, Fan, Liu, Kaixin, Zhang, Meng, Hao, Ruitao, Huang, Saike, Xie, Songhan, Liu, Yu, Wu, Zhao, Xie, Bin, Zhang, Pengwei, Yang, Qi, Deng, Xianchi, Wei, Yunfei, Zhang, Enwen, Peng, Hongyang, Zhao, Jie, Liu, Kai, Sun, Wei, Wei, Yajun, Yang, Yi, Zhang, Yunqiao, Yan, Ziwei, Yang, Haitao, Liu, Hao, Fan, Haoqiang, Zhang, Haowei, Huang, Junwen, Chen, Yang, Ma, Yunchao, Yang, Yunhuan, Du, Zhengyuan, Liu, Ziming, Niu, Jiahui, Zhao, Yucheng, Jiang, Daxin, Tang, Wenbin, Zhang, Xiangyu, Ge, Zheng, Zhou, Erjin, Wang, Tiancai
Abstract
Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.
Chinese Translation
超越传统的将互联网预训练模型适应于物理任务的范式,我们提出了DM0,一个为物理人工智能设计的具身原生视觉-语言-行动(VLA)框架。与将物理基础视为微调后思考的做法不同,DM0通过从异构数据源中学习,统一了具身操作和导航。我们的方法遵循一个全面的三阶段流程:预训练、中期训练和后期训练。首先,我们在视觉-语言模型(VLM)上进行大规模统一预训练,使用多样的语料库——无缝整合网络文本、自动驾驶场景和具身交互日志——以共同获取语义知识和物理先验。随后,我们在VLM的基础上构建了一个流匹配行动专家。为了调和高层推理与低层控制,DM0采用了一种混合训练策略:对于具身数据,来自行动专家的梯度不回传至VLM,以保留泛化表示,而VLM在非具身数据上仍然可训练。此外,我们引入了一种具身空间支架策略,以构建空间链式思维(CoT)推理,有效约束行动解决空间。在RoboChallenge基准测试中的实验表明,DM0在Table30的专业和通用设置下均实现了最先进的性能。
cs.RO / 59 / 2602.14979

RynnBrain: Open Embodied Foundation Models

RynnBrain:开放的具身基础模型
Dang, Ronghao, Guo, Jiayan, Hou, Bohan, Leng, Sicong, Li, Kehan, Li, Xin, Liu, Jiangpin, Mao, Yunxuan, Wang, Zhikai, Yuan, Yuqian, Zhu, Minghao, Lin, Xiao, Bai, Yang, Jiang, Qian, Zhao, Yaxi, Zeng, Minghua, Gao, Junlong, Jiang, Yuming, Cen, Jun, Huang, Siteng, Wang, Liuyi, Zhang, Wenqiao, Liu, Chengju, Yang, Jianfei, Lu, Shijian, Zhao, Deli
Abstract
Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.
Chinese Translation
尽管多模态基础模型取得了快速进展,具身智能领域仍然缺乏一个统一的、物理基础的基础模型,该模型能够在真实世界的时空动态中整合感知、推理和规划。我们介绍了RynnBrain,一个开源的时空基础模型,旨在支持具身智能。RynnBrain在一个统一框架中增强了四个核心能力:全面的自我中心理解、多样的时空定位、物理基础的推理和物理感知的规划。RynnBrain系列包括三种基础模型规模(2B、8B和30B-A3B MoE)以及四种后训练变体,专门针对下游具身任务(即RynnBrain-Nav、RynnBrain-Plan和RynnBrain-VLA)或复杂的空间推理任务(即RynnBrain-CoP)。在对20个具身基准和8个通用视觉理解基准的广泛评估中,我们的RynnBrain基础模型在性能上大幅超越了现有的具身基础模型。后训练模型套件进一步证明了RynnBrain基础模型的两个关键潜力:(i)实现物理基础的推理和规划,以及(ii)作为一个强大的预训练骨干网络,可以高效地适应多样的具身任务。
cs.RO / 60 / 2602.15010

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

BPP:通过关注关键历史帧实现长上下文机器人模仿学习
Mark, Max Sobol, Liang, Jacky, Attarian, Maria, Fu, Chuyuan, Dwibedi, Debidatta, Shah, Dhruv, Kumar, Aviral
Abstract
Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.
Chinese Translation
许多机器人任务需要关注过去观察的历史。例如,在一个房间中寻找物品需要记住已经搜索过的位置。然而,表现最佳的机器人策略通常仅基于当前观察进行条件化,这限制了它们在此类任务中的适用性。简单地基于过去观察进行条件化往往会失败,因为存在虚假的相关性:策略会依赖于训练历史中的偶然特征,而这些特征在部署时无法推广到分布外的轨迹。我们分析了为什么策略会依赖于这些虚假的相关性,发现这个问题源于训练过程中对可能历史空间的覆盖有限,而这种覆盖随着时间跨度呈指数增长。现有的正则化技术在任务间提供的不一致收益,并未从根本上解决这一覆盖问题。基于这些发现,我们提出了大视野策略(Big Picture Policies, BPP),该方法基于视觉-语言模型检测到的最小有意义关键帧进行条件化。通过将多样化的回放投影到一组紧凑的任务相关事件上,BPP显著减少了训练与部署之间的分布偏移,同时不牺牲表达能力。我们在四个具有挑战性的现实世界操作任务和三个模拟任务上评估了BPP,这些任务均要求历史条件化。BPP在现实世界评估中实现了比最佳对比方法高出70%的成功率。
cs.RO / 61 / 2602.15018

Neurosim: A Fast Simulator for Neuromorphic Robot Perception

Neurosim:一种快速的神经形态机器人感知模拟器
Das, Richeek, Chaudhari, Pratik
Abstract
Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .
Chinese Translation
Neurosim 是一个快速、实时、高性能的库,用于模拟动态视觉传感器、RGB 摄像头、深度传感器和惯性传感器等传感器。它还可以模拟多旋翼飞行器在复杂和动态环境中的灵活动态。Neurosim 在桌面 GPU 上可以实现高达 ~2700 FPS 的帧率。Neurosim 与一个基于 ZeroMQ 的通信库 Cortex 集成,以促进与机器学习和机器人工作流程的无缝集成。Cortex 为 Python 和 C++ 应用程序提供高吞吐量、低延迟的消息传递系统,原生支持 NumPy 数组和 PyTorch 张量。本文讨论了 Neurosim 和 Cortex 背后的设计理念,并展示了它们如何用于 (i) 训练神经形态感知和控制算法,例如,使用时间同步的多模态数据进行自监督学习,以及 (ii) 在闭环中测试这些算法的实时实现。Neurosim 和 Cortex 可在 https://github.com/grasp-lyrl/neurosim 获取。
计算机视觉 (Computer Vision)
159
cs.CV / 1 / 2602.13267

Beyond Ground: Map-Free LiDAR Relocalization for UAVs

超越地面:无人机的无地图激光雷达重定位
Mu, Hengyu, Wu, Jianshi, Guo, Yuxin, Lin, XianLian, Hu, Qingyong, Wen, Chenglu, Wang, Cheng
Abstract
Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.
Chinese Translation
定位是无人机(UAV)系统的一项基本能力。无地图激光雷达重定位为在信号弱或不可用的全球导航卫星系统(GNSS)环境中实现高精度定位提供了一种有效的解决方案。然而,现有的激光雷达重定位方法主要针对自动驾驶,因而在无人机场景中表现出显著降低的精度。本文提出了一种新颖的无地图激光雷达重定位框架MAILS,专为无人机设计。首先引入了一种局部保持滑动窗口注意力模块,以从稀疏点云中提取局部具有区分性的几何特征。为了处理无人机飞行中遇到的大幅偏航旋转和高度变化,我们设计了一个坐标无关的特征初始化模块和一个局部不变的位置信息编码机制,这两者显著增强了特征提取的鲁棒性。此外,现有的基于激光雷达的重定位数据集未能捕捉到真实世界无人机飞行特征,如不规则轨迹和变化的高度。为了解决这一问题,我们构建了一个大规模的无人机激光雷达定位数据集,包含四个场景和各种飞行轨迹,旨在评估无人机在现实条件下的重定位性能。大量实验表明,我们的方法实现了令人满意的定位精度,并且在性能上显著优于现有技术。我们的代码和数据集将很快发布。
cs.CV / 2 / 2602.13286

Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification

用于视觉性别分类偏见缓解的解释性互动机器学习
Satriani, Nathanya, Slijepčević, Djordje, Schedl, Markus, Zeppelzauer, Matthias
Abstract
Explanatory interactive learning (XIL) enables users to guide model training in machine learning (ML) by providing feedback on the model's explanations, thereby helping it to focus on features that are relevant to the prediction from the user's perspective. In this study, we explore the capability of this learning paradigm to mitigate bias and spurious correlations in visual classifiers, specifically in scenarios prone to data bias, such as gender classification. We investigate two methodologically different state-of-the-art XIL strategies, i.e., CAIPI and Right for the Right Reasons (RRR), as well as a novel hybrid approach that combines both strategies. The results are evaluated quantitatively by comparing segmentation masks with explanations generated using Gradient-weighted Class Activation Mapping (GradCAM) and Bounded Logit Attention (BLA). Experimental results demonstrate the effectiveness of these methods in (i) guiding ML models to focus on relevant image features, particularly when CAIPI is used, and (ii) reducing model bias (i.e., balancing the misclassification rates between male and female predictions). Our analysis further supports the potential of XIL methods to improve fairness in gender classifiers. Overall, the increased transparency and fairness obtained by XIL leads to slight performance decreases with an exception being CAIPI, which shows potential to even improve classification accuracy.
Chinese Translation
解释性互动学习(Explanatory Interactive Learning, XIL)使用户能够通过对模型解释的反馈来引导机器学习(Machine Learning, ML)中的模型训练,从而帮助模型关注用户视角下与预测相关的特征。在本研究中,我们探讨了这一学习范式在缓解视觉分类器中的偏见和虚假相关性方面的能力,特别是在容易出现数据偏见的场景中,如性别分类。我们研究了两种方法论上不同的最先进的XIL策略,即CAIPI和Right for the Right Reasons (RRR),以及一种结合这两种策略的新型混合方法。通过比较使用梯度加权类激活映射(Gradient-weighted Class Activation Mapping, GradCAM)和有界对数注意力(Bounded Logit Attention, BLA)生成的解释与分割掩码,定量评估了结果。实验结果表明,这些方法在(i)引导ML模型关注相关图像特征方面的有效性,特别是在使用CAIPI时,以及(ii)减少模型偏见(即平衡男性和女性预测之间的错误分类率)。我们的分析进一步支持XIL方法在改善性别分类器公平性方面的潜力。总体而言,XIL所带来的透明度和公平性的提升导致性能略有下降,唯一的例外是CAIPI,显示出甚至可以提高分类准确性的潜力。
cs.CV / 3 / 2602.13287

COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception

COOPERTRIM:面向不确定性感知的自适应数据选择
Mukhopadhyay, Shilpa, Roy-Chowdhury, Amit, Qiu, Hang
Abstract
Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other's live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.
Chinese Translation
协同感知使自主代理能够通过无线通信共享编码表示,以增强彼此的实时情境意识。然而,有限的通信带宽与丰富的传感器信息之间的矛盾阻碍了其实际部署。近期研究探讨了选择策略,仅在每帧中共享一部分特征,同时努力保持性能水平。然而,带宽需求仍然对当前无线技术造成压力。为根本缓解这一矛盾,我们采取了一种主动的方法,利用时间连续性来识别捕捉环境动态的特征,同时避免静态信息的重复和冗余传输。通过引入时间意识,代理能够根据环境复杂性动态调整共享数量。我们将这一直觉具体化为一个自适应选择框架COOPERTRIM,该框架引入了一种新颖的符合时间不确定性度量来评估特征相关性,并采用数据驱动机制动态确定共享数量。为了评估COOPERTRIM,我们以语义分割和3D检测作为示例任务。在多个开源协同分割和检测模型中,COOPERTRIM分别实现了高达80.28%和72.52%的带宽减少,同时保持了相当的准确性。与其他选择策略相比,COOPERTRIM在IoU方面的提升幅度可达45.54%,同时带宽减少高达72%。结合压缩策略,COOPERTRIM的带宽使用率可以进一步降低至1.46%,而不影响IoU性能。定性结果表明,COOPERTRIM能够优雅地适应环境动态、定位误差和通信延迟,展示了灵活性,并为实际部署铺平了道路。
cs.CV / 4 / 2602.13289

Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs

评估后训练量化对多模态大语言模型可靠视觉问答的影响
Kurz, Paul Jonas, Wieczorek, Tobias Jan, Abdelsalam, Mohamed A., Aljundi, Rahaf, Rohrbach, Marcus
Abstract
Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.
Chinese Translation
多模态大语言模型(MLLM)在可靠性和效率至关重要的领域中越来越多地被部署。然而,当前模型仍然过于自信,产生高度确定但不正确的答案。同时,其庞大的体积限制了在边缘设备上的部署,迫切需要压缩。我们通过分析后训练量化(PTQ)压缩如何影响视觉问答(VQA)的准确性和可靠性,研究这两个挑战的交集。我们评估了两种MLLM,Qwen2-VL-7B和Idefics3-8B,采用无数据(HQQ)和有数据(MBQ)方法在多个比特宽度下进行量化。为了抵消量化带来的可靠性下降,我们调整了选择器(Selector)置信度估计器以适应量化的多模态环境,并测试其在不同量化水平和分布外(OOD)场景下的鲁棒性。我们发现PTQ会降低准确性和可靠性,而有数据方法则减轻了这种影响。选择器显著减轻了可靠性影响。int4 MBQ与选择器的组合实现了最佳的效率-可靠性权衡,以大约75%的内存需求接近未压缩性能。总体而言,我们呈现了首个系统性研究,连接了多模态环境中的量化与可靠性。
cs.CV / 5 / 2602.13293

NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving

NutVLM:一种自适应防御框架,针对自动驾驶中视觉语言模型的全维攻击
Peng, Xiaoxu, Zhou, Dong, Zhang, Jianwen, Sun, Guanghui, Ngo, Anh Tu, Chattopadhyay, Anupam
Abstract
Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates "corrective driving prompts" via gradient-based latent optimization and discrete projection. These prompts refocus the VLM's attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at https://github.com/PXX/NutVLM.
Chinese Translation
视觉语言模型(VLMs)在自动驾驶(AD)中的感知能力得到了提升,但仍然容易受到对抗性威胁。这些风险范围从局部物理补丁到不可察觉的全局扰动。现有的VLM防御方法仍然有限,往往无法在鲁棒性与干净样本性能之间取得平衡。为了解决这些问题,我们提出了NutVLM,这是一种全面的自适应防御框架,旨在保护整个感知-决策生命周期。具体而言,我们首先采用NutNet++作为哨兵,这是一种统一的检测-净化机制。它通过三分类识别良性样本、局部补丁和全局扰动。随后,通过高效的灰度掩蔽净化局部威胁,而全局扰动则触发专家引导的对抗提示调优(Expert-guided Adversarial Prompt Tuning,EAPT)。EAPT通过基于梯度的潜在优化和离散投影生成“纠正驾驶提示”,而不是耗时的全模型微调的参数更新。这些提示在不需要全面重新训练全模型的情况下,重新聚焦VLM的注意力。在Dolphins基准测试中评估,我们的NutVLM在整体指标(如准确率、语言得分和GPT得分)上提高了4.89%。这些结果验证了NutVLM作为智能交通的可扩展安全解决方案的有效性。我们的代码可在https://github.com/PXX/NutVLM获取。
cs.CV / 6 / 2602.13294

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld:通过代码驱动的视频重建探究物理推理
Liang, Jiarong, Ku, Max, Hui, Ka-Hei, Nie, Ping, Chen, Wenhu
Abstract
Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
Chinese Translation
评估多模态大型语言模型(MLLMs)是否真正能够推理物理动态仍然具有挑战性。现有的大多数基准测试依赖于识别式协议,例如视觉问答(VQA)和期望违背(VoE),这些问题往往可以在不承诺于明确的、可测试的物理假设的情况下回答。我们提出了VisPhyWorld,一个基于执行的框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理。通过生成可运行的代码,推断出的世界表征可以直接被检查、编辑和证伪。这将物理推理与渲染分开。在此框架的基础上,我们引入了VisPhyBench,包括209个评估场景,源自108个物理模板,以及一个系统化的协议,用于评估模型重建外观和再现物理上合理的运动的能力。我们的管道在基准测试中产生了97.7%的有效重建视频。实验表明,尽管最先进的MLLMs在语义场景理解方面表现强劲,但它们在准确推断物理参数和模拟一致的物理动态方面仍然存在困难。
cs.CV / 7 / 2602.13296

MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models

高分辨率距离轮廓生成模型的MFN分解及相关指标
Brient, Edwyn, Velasco-Forero, Santiago, Kassab, Rami
Abstract
High-resolution range profile (HRRP ) data are in vogue in radar automatic target recognition (RATR). With the interest in classifying models using HRRP, filling gaps in datasets using generative models has recently received promising contributions. Evaluating generated data is a challenging topic, even for explicit data like face images. However, the evaluation methods used in the state-ofthe-art of HRRP generation rely on classification models. Such models, called ''black-box'', do not allow either explainability on generated data or multi-level evaluation. This work focuses on decomposing HRRP data into three components: the mask, the features, and the noise. Using this decomposition, we propose two metrics based on the physical interpretation of those data. We take profit from an expensive dataset to evaluate our metrics on a challenging task and demonstrate the discriminative ability of those.
Chinese Translation
高分辨率距离轮廓(HRRP)数据在雷达自动目标识别(RATR)中越来越受到关注。随着对使用HRRP进行分类模型的兴趣增加,利用生成模型填补数据集中的空白最近得到了有希望的贡献。评估生成数据是一个具有挑战性的课题,即使对于显式数据如人脸图像也是如此。然而,目前HRRP生成的最先进评估方法依赖于分类模型。这些被称为“黑箱”的模型既无法对生成数据进行解释,也无法进行多层次的评估。本研究聚焦于将HRRP数据分解为三个组成部分:掩模、特征和噪声。基于这种分解,我们提出了两个基于这些数据物理解释的指标。我们利用一个昂贵的数据集在一个具有挑战性的任务上评估我们的指标,并展示了这些指标的区分能力。
cs.CV / 8 / 2602.13297

Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset

高分辨率范围轮廓的条件生成模型:捕捉大规模海洋数据集中几何驱动的趋势
Brient, Edwyn, Velasco-Forero, Santiago, Kassab, Rami
Abstract
High-resolution range profiles (HRRPs) enable fast onboard processing for radar automatic target recognition, but their strong sensitivity to acquisition conditions limits robustness across operational scenarios. Conditional HRRP generation can mitigate this issue, yet prior studies are constrained by small, highly specific datasets. We study HRRP synthesis on a largescale maritime database representative of coastal surveillance variability. Our analysis indicates that the fundamental scenario drivers are geometric: ship dimensions and the desired aspect angle. Conditioning on these variables, we train generative models and show that the synthesized signatures reproduce the expected line-of-sight geometric trend observed in real data. These results highlight the central role of acquisition geometry for robust HRRP generation.
Chinese Translation
高分辨率范围轮廓(HRRPs)能够快速进行雷达自动目标识别的机载处理,但其对采集条件的强敏感性限制了在不同操作场景下的鲁棒性。条件HRRP生成可以缓解这一问题,但以往研究受限于小规模、高度特定的数据集。我们研究了在一个代表沿海监视变异性的大规模海洋数据库上进行HRRP合成。我们的分析表明,基本场景驱动因素是几何因素:船舶尺寸和所需的视角。通过对这些变量进行条件化,我们训练了生成模型,并展示了合成的特征再现了在真实数据中观察到的预期视线几何趋势。这些结果突显了采集几何在鲁棒HRRP生成中的核心作用。
cs.CV / 9 / 2602.13298

Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet

卷积深度对图像识别性能的影响:VGG、ResNet与GoogLeNet的比较
Fischer, Manfred M., Pitts, Joshua
Abstract
Increasing convolutional depth has been central to advances in image recognition, yet deeper networks do not uniformly yield higher accuracy, stable optimization, or efficient computation. We present a controlled comparative study of three canonical convolutional neural network architectures - VGG, ResNet, and GoogLeNet - to isolate how depth influences classification performance, convergence behavior, and computational efficiency. By standardizing training protocols and explicitly distinguishing between nominal and effective depth, we show that the benefits of depth depend critically on architectural mechanisms that constrain its effective manifestation during training rather than on nominal depth alone. Although plain deep networks exhibit early accuracy saturation and optimization instability, residual and inception-based architectures consistently translate additional depth into improved accuracy at lower effective depth and favorable accuracy-compute trade-offs. These findings demonstrate that effective depth, not nominal depth, is the operative quantity governing depth's role as a productive scaling dimension in convolutional networks.
Chinese Translation
增加卷积深度是图像识别进展的核心,但更深的网络并不总是能带来更高的准确率、稳定的优化或高效的计算。我们进行了一项受控的比较研究,分析三种经典卷积神经网络架构——VGG、ResNet和GoogLeNet——以隔离深度如何影响分类性能、收敛行为和计算效率。通过标准化训练协议,并明确区分名义深度和有效深度,我们表明深度的好处在于依赖于在训练过程中限制其有效表现的架构机制,而不仅仅是名义深度。尽管普通深度网络表现出早期准确率饱和和优化不稳定,残差网络和基于Inception的架构始终能将额外的深度转化为在较低有效深度下的准确率提升和有利的准确率与计算权衡。这些发现表明,有效深度而非名义深度是决定深度在卷积网络中作为生产性扩展维度作用的关键量。
cs.CV / 10 / 2602.13299

KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks

KidMesh:基于深度神经网络的儿童先天性肾盂积水计算网格重建
Sun, Haoran, Zhu, Zhanpeng, Zhang, Anguo, Liu, Bo, Lin, Zhaohua, Huang, Liqin, Yang, Mingjing, Liu, Lei, Lin, Shan, Ding, Wangbin
Abstract
Pediatric congenital hydronephrosis (CH) is a common urinary tract disorder, primarily caused by obstruction at the renal pelvis-ureter junction. Magnetic resonance urography (MRU) can visualize hydronephrosis, including renal pelvis and calyces, by utilizing the natural contrast provided by water. Existing voxel-based segmentation approaches can extract CH regions from MRU, facilitating disease diagnosis and prognosis. However, these segmentation methods predominantly focus on morphological features, such as size, shape, and structure. To enable functional assessments, such as urodynamic simulations, external complex post-processing steps are required to convert these results into mesh-level representations. To address this limitation, we propose an end-to-end method based on deep neural networks, namely KidMesh, which could automatically reconstruct CH meshes directly from MRU. Generally, KidMesh extracts feature maps from MRU images and converts them into feature vertices through grid sampling. It then deforms a template mesh according to these feature vertices to generate the specific CH meshes of MRU images. Meanwhile, we develop a novel schema to train KidMesh without relying on accurate mesh-level annotations, which are difficult to obtain due to the sparsely sampled MRU slices. Experimental results show that KidMesh could reconstruct CH meshes in an average of 0.4 seconds, and achieve comparable performance to conventional methods without requiring post-processing. The reconstructed meshes exhibited no self-intersections, with only 3.7% and 0.2% of the vertices having error distances exceeding 3.2mm and 6.4mm, respectively. After rasterization, these meshes achieved a Dice score of 0.86 against manually delineated CH masks. Furthermore, these meshes could be used in renal urine flow simulations, providing valuable urodynamic information for clinical practice.
Chinese Translation
儿童先天性肾盂积水(CH)是一种常见的尿路疾病,主要由肾盂-输尿管交界处的阻塞引起。磁共振尿路成像(MRU)可以通过利用水提供的自然对比度来可视化肾盂及肾盏的积水情况。现有的基于体素的分割方法能够从MRU中提取CH区域,从而促进疾病的诊断和预后。然而,这些分割方法主要关注形态特征,如大小、形状和结构。为了实现功能评估,如尿动力学模拟,需要外部复杂的后处理步骤将这些结果转换为网格级表示。为了解决这一限制,我们提出了一种基于深度神经网络的端到端方法,命名为KidMesh,该方法可以直接从MRU自动重建CH网格。一般而言,KidMesh从MRU图像中提取特征图,并通过网格采样将其转换为特征顶点。然后,根据这些特征顶点变形模板网格,以生成MRU图像的特定CH网格。同时,我们开发了一种新方案来训练KidMesh,而无需依赖于准确的网格级注释,因为这些注释由于MRU切片的稀疏采样而难以获得。实验结果表明,KidMesh能够在平均0.4秒内重建CH网格,并且在不需要后处理的情况下,性能与传统方法相当。重建的网格没有自交,只有3.7%和0.2%的顶点的误差距离超过3.2mm和6.4mm。经过光栅化后,这些网格与手动勾画的CH掩膜相比,Dice得分达到了0.86。此外,这些网格可以用于肾脏尿流模拟,为临床实践提供有价值的尿动力学信息。
cs.CV / 11 / 2602.13301

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

DriveMamba:面向任务的可扩展状态空间模型以实现高效的端到端自动驾驶
Su, Haisheng, Wu, Wei, Song, Feixiang, Zhang, Junjie, Yang, Zhenjie, Yan, Junchi
Abstract
Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
Chinese Translation
最近在端到端自动驾驶(E2E-AD)方面的进展,通常致力于将模块化设计整合到统一框架中以进行联合优化,例如 UniAD,这种方法遵循基于可分离 Transformer 解码器的顺序范式(即感知-预测-规划),并依赖于密集的鸟瞰视图(BEV)特征来编码场景表示。然而,这种手动排序设计不可避免地会导致信息丢失和累积误差,缺乏不同模块和传感器之间灵活多样的关系建模。同时,图像主干网络的训练不足和注意力机制的二次复杂性也阻碍了 E2E-AD 系统处理时空输入的可扩展性和效率。为此,我们提出了 DriveMamba,一种面向任务的可扩展范式,以实现高效的 E2E-AD,它将动态任务关系建模、隐式视图对应学习和长期时间融合集成到单阶段的统一 Mamba 解码器中。具体而言,提取的图像特征和期望的任务输出被提前转换为标记级稀疏表示,然后根据它们在三维空间中的实例化位置进行排序。线性复杂度的操作符使得高效的长上下文序列标记建模成为可能,以同时捕捉与任务相关的相互依赖关系。此外,设计了一种双向轨迹引导的“局部到全局”扫描方法,以保留来自自我视角的空间局部性,从而促进自我规划。在 nuScenes 和 Bench2Drive 数据集上进行的大量实验表明,DriveMamba 在优越性、泛化能力和高效性方面具有显著优势。
cs.CV / 12 / 2602.13303

Spectral Collapse in Diffusion Inversion

扩散反演中的光谱崩溃
Bourriez, Nicolas, Verine, Alexandre, Genovesio, Auguste
Abstract
Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.
Chinese Translation
条件扩散反演为无配对图像到图像的转换提供了一个强大的框架。然而,我们通过广泛的分析证明,当源域在光谱上相较于目标域(例如,超分辨率、素描到图像)稀疏时,标准的确定性反演(例如,DDIM)会失效。在这些情况下,从输入中恢复的潜在变量并不遵循预期的各向同性高斯分布。相反,它表现出较低频率的信号,使得目标采样锁定在过度平滑和纹理贫乏的生成上。我们将这种现象称为光谱崩溃。我们观察到,试图恢复噪声方差的随机替代方法往往会破坏与输入的语义联系,导致结构漂移。为了解决这种结构与纹理之间的权衡,我们提出了正交方差引导(Orthogonal Variance Guidance, OVG),这是一种推理时方法,通过修正常微分方程(ODE)动态来强制在结构梯度的零空间内维持理论高斯噪声幅度。在显微镜超分辨率(BBBC021)和素描到图像(Edges2Shoes)上的大量实验表明,OVG能够有效恢复逼真的纹理,同时保持结构的保真度。
cs.CV / 13 / 2602.13304

Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment

高保真双向光声显微镜对齐的渐进对比注册
Qin, Jiahao
Abstract
High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing methods, constrained by brightness constancy assumptions, achieve limited alignment quality (NCC~$\leq 0.96$). We propose PCReg-Net, a progressive contrast-guided registration framework that performs coarse-to-fine alignment through four lightweight modules: (1)~a registration U-Net for coarse alignment, (2)~a reference feature extractor capturing multi-scale structural cues, (3)~a contrast module that identifies residual misalignment by comparing coarse-registered and reference features, and (4)~a refinement U-Net with feature injection for high-fidelity output. We further propose the Temporal NCC (TNCC) and Temporal NCC Gap (TNCG) for reference-free evaluation of inter-frame temporal consistency. On OR-PAM-Reg-4K (432 test samples), PCReg-Net achieves NCC of 0.983, SSIM of 0.982, and PSNR of 46.96 dB, surpassing the state-of-the-art by over 14 dB at real-time speed. Code is available at https://github.com/JiahaoQin/PCReg-Net
Chinese Translation
高速光学分辨率光声显微镜(OR-PAM)采用双向栅格扫描可将成像速度提高一倍,但同时引入了前向和后向扫描线之间的耦合域偏移和几何不对齐。现有方法受限于亮度恒定假设,导致对齐质量有限(NCC~≤ 0.96)。我们提出了PCReg-Net,一种渐进对比引导的注册框架,通过四个轻量级模块实现粗到精的对齐: (1) 用于粗对齐的注册U-Net,(2) 捕捉多尺度结构线索的参考特征提取器,(3) 通过比较粗对齐特征和参考特征来识别残余不对齐的对比模块,以及 (4) 具有特征注入的精细化U-Net,以获得高保真输出。我们进一步提出了无参考的时间一致性评估指标——时间NCC(TNCC)和时间NCC差距(TNCG)。在OR-PAM-Reg-4K(432个测试样本)上,PCReg-Net实现了0.983的NCC、0.982的SSIM和46.96 dB的PSNR,超过了现有技术14 dB以上,并且在实时速度下运行。代码可在 https://github.com/JiahaoQin/PCReg-Net 获取。
cs.CV / 14 / 2602.13305

WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

WildfireVLM:基于人工智能的卫星影像早期野火检测与风险评估分析
Ayanzadeh, Aydin, Dixit, Prakhar, Kamal, Sadia, Halem, Milton
Abstract
Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring.
Chinese Translation
野火对生态系统、人类生命和基础设施构成日益严重的威胁,其频率和强度因气候变化和人类活动而上升。早期检测至关重要,但基于卫星的监测仍面临挑战,主要由于微弱的烟雾信号、动态的天气条件以及对大面积实时分析的需求。我们提出了WildfireVLM,一个将卫星影像野火检测与基于语言的风险评估相结合的人工智能框架。我们使用Landsat-8/9、GOES-16及其他公开可用的地球观测源的影像构建了一个标注的野火和烟雾数据集,包括具有对齐光谱波段的协调产品。WildfireVLM采用YOLOv12检测火灾区域和烟雾羽流,利用其在卫星影像中检测小型复杂模式的能力。我们整合了多模态大型语言模型(MLLMs),将检测输出转换为上下文化的风险评估和优先响应建议,以便于灾害管理。我们通过使用共享评分标准的LLM作为评审的评估方法验证风险推理的质量。该系统采用面向服务的架构进行部署,支持实时处理、可视化风险仪表板和长期野火跟踪,展示了将计算机视觉与基于语言的推理相结合在可扩展野火监测中的价值。
cs.CV / 15 / 2602.13306

Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique

针对艺术作品评分与评论的大型视觉-语言模型微调
Zhang, Zhehan, Qian, Meihua, Luo, Li, Huang, Siyu, Zhou, Chaoyi, Saha, Ripon, Song, Xinxin
Abstract
Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.
Chinese Translation
评估艺术创造力是创造力研究和艺术教育的基础,但手动评分(例如,托兰斯创造性思维测试)在大规模应用中劳动强度大。以往的机器学习方法在视觉创造力评分方面显示出潜力,但许多方法主要依赖图像特征,提供的解释性反馈有限或没有。我们提出了一种通过多任务学习微调视觉-语言模型Qwen2-VL-7B进行人类绘画自动创造力评估的框架。我们的数据集包含1000幅人类创作的绘画,评分范围为1-100,并配有简短的人类书面描述(内容或艺术家解释)。两位专家评审使用五维评分标准(原创性、色彩、纹理、构图、内容)对每件作品进行评估,并提供书面评论;我们采用80/20的训练-测试分割。我们在视觉编码器输出上添加了一个轻量级回归头,使模型能够在一次前向传播中预测数值评分并生成与评分标准对齐的反馈。通过将结构化评分标准和艺术作品描述嵌入系统提示中,我们限制生成的文本与定量预测相匹配。实验结果显示出强大的准确性,Pearson相关系数r > 0.97,平均绝对误差约为3.95(基于100分制)。定性评估表明,生成的反馈在语义上与专家评论相近(平均SBERT余弦相似度=0.798)。所提出的方法架起了计算机视觉与艺术评估之间的桥梁,并为创造力研究和课堂反馈提供了一种可扩展的工具。
cs.CV / 16 / 2602.13310

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

视觉并行思考者:用于视觉理解的分而治之推理
Xu, Haoran, Wang, Hongyu, Li, Jiaze, Chen, Shunpeng, Tong, Zizhao, Ju, Jianzhong, Luo, Zhenbo, Luan, Jian
Abstract
Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.
Chinese Translation
现有的大型语言模型(LLM)测试时扩展规律强调通过延长推理长度而出现的自我反思行为。然而,这种纵向扩展策略在探索中往往会遇到平台期,因为模型会被锁定在特定的思维模式中。通过从深度推理转向并行思维,平行思考减轻了探索的狭窄。然而,将这一范式扩展到视觉领域仍然是一个未解的研究问题。本文首先考察视觉分区在并行推理中的作用,随后提出两种不同的策略。在此基础上,我们引入了视觉并行思考者(Visual Para-Thinker),这是针对多模态大型语言模型(MLLMs)的首个并行推理框架。为了保持路径独立性并促进推理的多样性,我们的方法结合了Pa-Attention和LPRoPE。利用vLLM框架,我们开发了一种原生的多模态实现,促进高效的并行处理。在V*、CountBench、RefCOCO和HallusionBench等基准数据集上的实证结果确认,视觉并行思考者成功将并行推理的优势扩展到视觉领域。
cs.CV / 17 / 2602.13313

Agentic Spatio-Temporal Grounding via Collaborative Reasoning

通过协作推理实现能动的时空定位
Zhao, Heng, Ong, Yew-Soon, Zhou, Joey Tianyi
Abstract
Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.
Chinese Translation
时空视频定位(STVG)旨在根据文本查询在视频中检索目标物体或人物的时空管道。现有的大多数方法在预测的时间范围内进行逐帧空间定位,导致冗余计算、较高的监督要求和有限的泛化能力。弱监督变体虽然减轻了标注成本,但仍受限于数据集级别的训练与拟合范式,表现较差。为了解决这些挑战,我们提出了能动时空定位器(ASTG)框架,旨在实现开放世界和无训练场景下的STVG任务。具体而言,两个专门的代理——空间推理代理(SRA)和时间推理代理(TRA),利用现代多模态大型语言模型(MLLMs)协同工作,以自主和自我引导的方式检索目标管道。遵循提出与评估的范式,ASTG有效地解耦了时空推理,并自动化了管道提取、验证和时间定位过程。通过专用的视觉记忆和对话上下文,检索效率显著提高。在流行基准上的实验表明,所提方法的优越性,其性能超越了现有的弱监督和零样本方法,并与一些全监督方法相当。
cs.CV / 18 / 2602.13314

Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Sim2Radar:通过 VLM 引导场景重建缩小雷达模拟与现实之间的差距
Bejerano, Emily, Tondolo, Federico, Qayyum, Aayan, Yu, Xiaofan, Jiang, Xiaofan
Abstract
Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.
Chinese Translation
毫米波(mmWave)雷达在视觉退化的室内环境(如烟雾、灰尘和低光照)中提供可靠的感知,但基于学习的雷达感知受到收集和标注大规模雷达数据集的稀缺性和成本的制约。我们提出了 Sim2Radar,这是一个端到端框架,能够直接从单视角 RGB 图像合成训练雷达数据,实现无需手动场景建模的可扩展数据生成。Sim2Radar 通过结合单目深度估计、分割和视觉-语言推理来推断物体材料,从而重建一个材料感知的 3D 场景,然后使用基于物理的可配置光线追踪器模拟 mmWave 传播,该光线追踪器采用由 ITU-R 电磁特性参数化的菲涅尔反射模型。在真实的室内场景上进行评估,Sim2Radar 通过迁移学习改善下游 3D 雷达感知:在合成数据上预训练雷达点云目标检测模型,并在真实雷达上进行微调,获得高达 +3.7 的 3D AP(IoU 0.3),其增益主要源于空间定位的改善。这些结果表明,基于物理的、以视觉驱动的雷达模拟可以为雷达学习提供有效的几何先验,并在有限的真实数据监督下显著提高性能。
cs.CV / 19 / 2602.13315

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

IDPruner:在视觉令牌剪枝中协调重要性与多样性以优化多模态大语言模型
Tan, Yifan, Sun, Yifu, Huang, Shirui, Liu, Hong, Yu, Guanghua, Zhu, Jianchen, Deng, Yangdong
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.
Chinese Translation
多模态大语言模型(MLLMs)展现了令人印象深刻的能力,但由于视觉令牌数量庞大,它们面临着显著的计算瓶颈。因此,视觉令牌剪枝作为一种显著减少令牌数量的关键技术,已成为加速MLLM推理的重要手段。现有方法主要关注令牌的重要性、多样性或两者的直观结合,但缺乏一个原则性框架来实现它们的最佳整合。为了解决这一问题,我们首先进行系统分析,以表征令牌重要性与语义多样性之间的权衡。在此分析的指导下,我们提出了 extbf{I}mportance和 extbf{D}iversity Pruner( extbf{IDPruner}),该方法利用最大边际相关性(Maximal Marginal Relevance, MMR)算法在这两个目标之间实现帕累托最优平衡。值得注意的是,我们的方法在不需要注意力图的情况下运行,确保与FlashAttention的完全兼容,并通过一次性剪枝实现高效部署。我们在多种模型架构和多模态基准上进行了广泛实验,证明IDPruner在不同架构和任务中实现了最先进的性能和优越的泛化能力。特别是在Qwen2.5-VL-7B-Instruct上,IDPruner在剪枝75\%的令牌时保留了95.18\\%的基线性能,即使在极端的90\\%剪枝比例下仍保持86.40\\%的性能。我们的代码可在https://github.com/Tencent/AngelSlim获取。
cs.CV / 20 / 2602.13322

Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture

不变学习动态的诊断基准:Eidos架构的实证验证
Anderson, Datorien L.
Abstract
We present the PolyShapes-Ideal (PSI) dataset, a suite of diagnostic benchmarks designed to isolate topological invariance -- the ability to maintain structural identity across affine transformations -- from the textural correlations that dominate standard vision benchmarks. Through three diagnostic probes (polygon classification under noise, zero-shot font transfer from MNIST, and geometric collapse mapping under progressive deformation), we demonstrate that the Eidos architecture achieves >99% accuracy on PSI and 81.67% zero-shot transfer across 30 unseen typefaces without pre-training. These results validate the "Form-First" hypothesis: generalization in structurally constrained architectures is a property of geometric integrity, not statistical scale.
Chinese Translation
我们提出了PolyShapes-Ideal (PSI) 数据集,这是一个旨在将拓扑不变性——在仿射变换中保持结构身份的能力——与主流视觉基准中占主导地位的纹理相关性隔离的诊断基准套件。通过三个诊断探针(在噪声下的多边形分类、从MNIST的零样本字体迁移以及在渐进变形下的几何崩溃映射),我们展示了Eidos架构在PSI上达到了超过99%的准确率,并在30种未见字体上实现了81.67%的零样本迁移,无需预训练。这些结果验证了“形式优先”假设:在结构约束架构中的泛化是几何完整性的属性,而非统计规模的结果。
cs.CV / 21 / 2602.13324

Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

综合杀伤链:一种用于边缘目标验证和战术推理的零样本框架
Barkley, Jesse, George, Abraham, Farimani, Amir Barati
Abstract
Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel "Controlled Input" methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.
Chinese Translation
在动态军事环境中部署自主边缘机器人受到稀缺领域特定训练数据和边缘硬件计算限制的制约。本文提出了一种层次化的零样本框架,该框架将轻量级目标检测与来自Qwen和Gemma系列(4B-12B参数)的紧凑视觉-语言模型(VLMs)级联。Grounding DINO作为一种高召回率、可文本提示的区域提议器,具有高检测置信度的框架被传递给边缘级VLMs进行语义验证。我们在《战场6》的55个高保真合成视频上评估了该流程,涵盖三个任务:假阳性过滤(准确率高达100%)、损伤评估(高达97.5%)和细粒度车辆分类(55-90%)。我们进一步将该流程扩展为一个自主侦察-指挥工作流,实现了100%的正确资产部署和9.8/10的推理评分(由GPT-4o评定),延迟低于75秒。一种新颖的“受控输入”方法将感知与推理解耦,揭示了不同的失败表型:Gemma3-12B在战术逻辑方面表现出色,但在视觉感知方面失败,而Gemma3-4B即使在准确输入下也出现推理崩溃。这些发现验证了边缘自主的层次化零样本架构,并提供了一个诊断框架,用于认证VLM在安全关键应用中的适用性。
cs.CV / 22 / 2602.13326

MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

MotionWeaver:面向多类人形图像动画的整体4D锚定框架
Hu, Xirui, Ding, Yanbo, Wang, Jiahao, Shi, Tingting, Wang, Yali, Zhi, Guo Zhi, Zhang, Weizhan
Abstract
Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
Chinese Translation
角色图像动画通过姿态序列合成参考角色的视频,虽然发展迅速,但仍然主要局限于单人设置。现有方法在多类人形场景中难以推广,这些场景涉及多样的人形形式、复杂的交互和频繁的遮挡。我们通过两个关键创新来填补这一空白。首先,我们引入统一的运动表征,提取与身份无关的运动,并明确将其绑定到相应角色上,从而实现对多样人形形式的推广,并无缝扩展到多类人形场景。其次,我们提出一个整体的4D锚定范式,构建一个共享的4D空间,将运动表征与视频潜变量融合,并通过分层的4D级别监督进一步强化这一过程,以更好地处理交互和遮挡。我们在MotionWeaver中实例化这些理念,这是一个用于多类人形图像动画的端到端框架。为了支持这一设置,我们整理了一个包含丰富交互的46小时多人人视频数据集,并构建了一个包含配对人形角色的300个视频基准。定量和定性实验表明,MotionWeaver不仅在我们的基准上实现了最先进的结果,而且在多样的人形形式、复杂的交互和具有挑战性的多类人形场景中也能有效推广。
cs.CV / 23 / 2602.13329

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

HiST-VLA:一种用于端到端自主驾驶的层次化时空视觉-语言-动作模型
Wang, Yiru, Gu, Zichong, Gao, Yu, Jiang, Anqing, Sun, Zhigang, Wang, Shuo, Heng, Yuwen, Sun, Hao
Abstract
Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.
Chinese Translation
视觉-语言-动作(VLA)模型通过多模态理解为自主驾驶提供了有前景的能力。然而,它们在安全关键场景中的应用受到固有限制的制约,包括不精确的数值推理、较弱的三维空间意识和对上下文的高度敏感性。为了解决这些挑战,我们提出了HiST-VLA,一种新颖的层次化时空VLA模型,旨在实现可靠的轨迹生成。我们的框架通过将几何意识与细粒度驾驶指令和状态历史提示相结合,增强了三维空间和时间推理。为了确保计算效率,我们将动态令牌稀疏化集成到VLA架构中。这种方法融合了冗余令牌,而不是过滤它们,有效地减少了冗余而不牺牲模型性能。此外,我们采用基于层次化变换器的规划器,逐步将粗略的VLA航点细化为细粒度轨迹。重要的是,规划器利用动态潜在正则化来结合语言指令,确保严格的空间基础和时间一致性。在NAVSIM v2基准上的广泛评估表明,在Navtest上实现了最先进的性能,EPDMS达到了88.6,而在伪闭环Navhard基准上达到了50.9的EPDMS。
cs.CV / 24 / 2602.13330

Zwitscherkasten -- DIY Audiovisual bird monitoring

Zwitscherkasten -- 自制音视频鸟类监测系统
Blum, Dominik, Häring, Elias, Jirges, Fabian, Schäffer, Martin, Schick, David, Schulenberg, Florian, Schön, Torsten
Abstract
This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.
Chinese Translation
本文介绍了Zwitscherkasten,一个基于音频和视觉数据的自制多模态鸟类监测系统,适用于边缘设备。该系统在资源受限的硬件上部署了用于生物声学和图像分类的深度学习模型,实现了实时、非侵入式监测。声学活动检测器降低了能耗,而视觉识别则通过细粒度检测和分类流程进行。结果表明,在嵌入式平台上实现准确的鸟类物种识别是可行的,这为可扩展的生物多样性监测和公民科学应用提供了支持。
cs.CV / 25 / 2602.13332

MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

MedScope:通过粗到细的工具调用激励临床推理中的“视频思考”
Li, Wenjie, Zhang, Yujie, Sun, Haoran, He, Xingqi, Gao, Hongcheng, Ma, Chenglong, Hu, Ming, Wang, Guankun, Yao, Shiyi, Yang, Renhao, Ren, Hongliang, Wang, Lei, He, Junjun, Jiang, Yankai
Abstract
Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.
Chinese Translation
长格式临床视频在基于视觉的证据决策中占据核心地位,尤其在外科机器人和相关应用中日益重要。然而,目前的多模态大型语言模型通常以被动采样或弱基础检查的方式处理视频,这限制了它们迭代定位、验证和用时间上有针对性的证据来证明预测的能力。为了解决这一问题,我们提出了MedScope,这是一种工具使用的临床视频推理模型,能够在长格式程序中执行粗到细的证据寻求。通过将中间推理与针对性工具调用和对检索观察结果的验证交替进行,MedScope能够产生更准确、更可信的预测,这些预测明确基于时间上局部的视觉证据。为了解决高保真监督的缺乏,我们构建了ClinVideoSuite,这是一个以证据为中心的细粒度临床视频套件。随后,我们使用基于定位感知的群体相对策略优化(Grounding-Aware Group Relative Policy Optimization, GA-GRPO)来优化MedScope,该方法通过与定位对齐的奖励和基于证据的优势直接强化工具的使用。在完整和细粒度视频理解基准测试中,MedScope在领域内和领域外评估中均实现了最先进的性能。我们的方法为能够真正“用视频思考”的医疗AI代理指明了方向。我们将发布我们的代码、模型和数据。
cs.CV / 26 / 2602.13334

Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators

请教专家:基于近边缘加速器的视觉变换器协同推理
Liu, Hao, Fahmy, Suhaib A.
Abstract
Deploying Vision Transformers on edge devices is challenging due to their high computational complexity, while full offloading to cloud resources presents significant latency overheads. We propose a novel collaborative inference framework, which orchestrates a lightweight generalist ViT on an edge device and multiple medium-sized expert ViTs on a near-edge accelerator. A novel routing mechanism uses the edge model's Top-$\mathit{k}$ predictions to dynamically select the most relevant expert for samples with low confidence. We further design a progressive specialist training strategy to enhance expert accuracy on dataset subsets. Extensive experiments on the CIFAR-100 dataset using a real-world edge and near-edge testbed demonstrate the superiority of our framework. Specifically, the proposed training strategy improves expert specialization accuracy by 4.12% on target subsets and enhances overall accuracy by 2.76% over static experts. Moreover, our method reduces latency by up to 45% compared to edge execution, and energy consumption by up to 46% compared to just near-edge offload.
Chinese Translation
在边缘设备上部署视觉变换器面临着高计算复杂度的挑战,而完全卸载到云资源则会带来显著的延迟开销。我们提出了一种新颖的协同推理框架,该框架在边缘设备上协调一个轻量级的通用视觉变换器(ViT)和多个中等规模的专家视觉变换器(ViTs)在近边缘加速器上运行。一个新颖的路由机制利用边缘模型的Top-$ extit{k}$预测动态选择与低置信度样本最相关的专家。我们进一步设计了一种渐进式专家训练策略,以提高专家在数据集子集上的准确性。在真实世界的边缘和近边缘测试平台上对CIFAR-100数据集进行的大量实验表明了我们框架的优越性。具体而言,所提出的训练策略在目标子集上提高了专家专业化准确性4.12%,并在静态专家上整体准确性提高了2.76%。此外,与边缘执行相比,我们的方法将延迟减少了多达45%,与仅进行近边缘卸载相比,能耗减少了多达46%。
cs.CV / 27 / 2602.13335

Meningioma Analysis and Diagnosis using Limited Labeled Samples

基于有限标记样本的脑膜瘤分析与诊断
Lu, Jiamiao, Wu, Wei, Gao, Ke, Mao, Ping, Zhang, Weichuan, Wang, Tuo, Ma, Lingkun, Guo, Jiapan, Wu, Zanyi, Hu, Yuqing, Sun, Changming
Abstract
The biological behavior and treatment response of meningiomas depend on their grade, making an accurate diagnosis essential for treatment planning and prognosis assessment. We observed that the weighted fusion of spatial-frequency domain features significantly influences meningioma classification performance. Notably, the contribution of specific frequency bands obtained by discrete wavelet transform varies considerably across different images. A feature fusion architecture with adaptive weights of different frequency band information and spatial domain information is proposed for few-shot meningioma learning. To verify the effectiveness of the proposed method, a new MRI dataset of meningiomas is introduced. The experimental results demonstrate the superiority of the proposed method compared with existing state-of-the-art methods in three datasets. The code will be available at: https://github.com/ICL-SUST/AMSF-Net
Chinese Translation
脑膜瘤的生物行为和治疗反应依赖于其分级,因此准确的诊断对于治疗规划和预后评估至关重要。我们观察到,空间频率域特征的加权融合显著影响脑膜瘤分类性能。值得注意的是,通过离散小波变换获得的特定频带的贡献在不同图像中差异显著。我们提出了一种特征融合架构,采用不同频带信息和空间域信息的自适应权重,以实现少样本脑膜瘤学习。为了验证所提方法的有效性,我们引入了一个新的脑膜瘤MRI数据集。实验结果表明,与现有的最先进方法相比,所提方法在三个数据集上表现出优越性。代码将发布于:https://github.com/ICL-SUST/AMSF-Net
cs.CV / 28 / 2602.13339

An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features

基于语义街景视觉特征的交通安全建模综合因果推断框架
Sun, Lishan, Cheng, Yujia, Cui, Pengfei, Han, Lei, Abdel-Aty, Mohamed, Zheng, Yunhan, Zhang, Xingchen
Abstract
Macroscopic traffic safety modeling aims to identify critical risk factors for regional crashes, thereby informing targeted policy interventions for safety improvement. However, current approaches rely heavily on static sociodemographic and infrastructure metrics, frequently overlooking the impacts from drivers' visual perception of driving environment. Although visual environment features have been found to impact driving and traffic crashes, existing evidence remains largely observational, failing to establish the robust causality for traffic policy evaluation under complex spatial environment. To fill these gaps, we applied semantic segmentation on Google Street View imageries to extract visual environmental features and proposed a Double Machine Learning framework to quantify their causal effects on regional crashes. Meanwhile, we utilized SHAP values to characterize the nonlinear influence mechanisms of confounding variables in the models and applied causal forests to estimate conditional average treatment effects. Leveraging crash records from the Miami metropolitan area, Florida, and 220,000 street view images, evidence shows that greenery proportion exerts a significant and robust negative causal effect on traffic crashes (Average Treatment Effect = -6.38, p = 0.005). This protective effect exhibits spatial heterogeneity, being most pronounced in densely populated and socially vulnerable urban cores. While greenery significantly mitigates angle and rear-end crashes, its protective benefit for vulnerable road users (VRUs) remains limited. Our findings provide causal evidence for greening as a potential safety intervention, prioritizing hazardous visual environments while highlighting the need for distinct design optimizations to protect VRUs.
Chinese Translation
宏观交通安全建模旨在识别区域交通事故的关键风险因素,从而为安全改善提供针对性的政策干预。然而,目前的方法过于依赖静态的社会人口和基础设施指标,常常忽视驾驶者对驾驶环境的视觉感知所带来的影响。尽管已有研究发现视觉环境特征会影响驾驶行为和交通事故,但现有证据大多为观察性研究,未能在复杂空间环境下建立交通政策评估的稳健因果关系。为填补这些空白,我们对谷歌街景图像应用语义分割技术,以提取视觉环境特征,并提出了一种双重机器学习框架,以量化这些特征对区域交通事故的因果影响。同时,我们利用SHAP值来表征模型中混杂变量的非线性影响机制,并应用因果森林来估计条件平均处理效应。通过利用佛罗里达州迈阿密大都会区的事故记录和220,000张街景图像,证据表明,绿化比例对交通事故具有显著且稳健的负因果影响(平均处理效应 = -6.38, p = 0.005)。这种保护效应表现出空间异质性,在人口密集和社会脆弱的城市核心区域最为明显。虽然绿化显著减轻了角碰撞和追尾事故,但对脆弱道路用户(VRUs)的保护效益仍然有限。我们的研究结果为绿化作为潜在安全干预措施提供了因果证据,强调了优先考虑危险视觉环境的必要性,同时突显了保护VRUs所需的不同设计优化。
cs.CV / 29 / 2602.13344

FireRed-Image-Edit-1.0 Techinical Report

FireRed-Image-Edit-1.0 技术报告
Super Intelligence Team, Qiao, Changhao, Hui, Chao, Li, Chen, Wang, Cunzheng, Song, Dejia, Zhang, Jiale, Li, Jing, Xiang, Qiang, Wang, Runqi, Sun, Shuang, Zhu, Wei, Tang, Xu, Hu, Yao, Chen, Yibo, Huang, Yuhao, Duan, Yuxuan, Chen, Zhiyi, Guo, Ziyuan
Abstract
We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.
Chinese Translation
我们提出了 FireRed-Image-Edit,这是一种基于指令的图像编辑扩散变换器,通过系统优化数据策划、训练方法和评估设计,实现了最先进的性能。我们构建了一个包含 16 亿样本的训练语料库,其中包括来自多种来源的 9 亿对文本到图像和 7 亿对图像编辑样本。在经过严格的清理、分层、自动标注和两阶段过滤后,我们保留了超过 1 亿个高质量样本,确保生成和编辑之间的平衡,具有强大的语义覆盖和指令对齐。我们的多阶段训练流程通过预训练、监督微调和强化学习逐步构建编辑能力。为了提高数据效率,我们引入了多条件感知桶采样器(Multi-Condition Aware Bucket Sampler)用于可变分辨率批处理,以及具有动态提示重新索引的随机指令对齐(Stochastic Instruction Alignment)。为了稳定优化并增强可控性,我们提出了用于 DPO 的非对称梯度优化(Asymmetric Gradient Optimization)、具有布局感知 OCR 奖励的 DiffusionNFT 用于文本编辑,以及用于身份保持的可微一致性损失(differentiable Consistency Loss)。此外,我们建立了 REDEdit-Bench,这是一个涵盖 15 个编辑类别的综合基准,包括新引入的美化和低级增强任务。在 REDEdit-Bench 和公共基准(ImgEdit 和 GEdit)上的广泛实验表明,我们的性能在开源和专有系统中具有竞争力或优越性。我们发布了代码、模型和基准套件,以支持未来的研究。
cs.CV / 30 / 2602.13347

Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

机器人存储的视觉预见:基于稀疏快照的扩散世界模型
Zhang, Lijun, Chacko, Nikhil, Nilsson, Petter, Xu, Ruinian, Thakar, Shantanu, Lou, Bai, Sawhney, Harpreet, Zhang, Zhebin, Agrawal, Mudit, Chandrashekhar, Bhavana, Parness, Aaron
Abstract
Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.
Chinese Translation
自动化仓库执行数百万次存储操作,机器人将物体放入存储箱中。对于这些系统,能够预测在当前观察和计划的存储行为下,存储箱的外观是非常有价值的。我们提出了FOREST,这是一种基于存储意图的世界模型,将存储箱状态表示为物品对齐的实例掩码,并使用潜在扩散变换器从观察到的上下文中预测存储后的配置。我们的评估显示,与启发式基线相比,FOREST显著提高了预测的存储后布局与真实存储后布局之间的几何一致性。我们进一步在两个下游任务中评估预测的存储后布局,其中用FOREST预测替换真实的存储后掩码仅导致负载质量评估和多存储推理中的适度性能损失,这表明我们的模型可以为仓库规划提供有用的预见信号。
cs.CV / 31 / 2602.13349

From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models

从提示到生产:利用文本到图像模型自动化品牌安全的营销图像
Atighehchian, Parmida, Wang, Henry, Kapustin, Andrei, Lerner, Boris, Jiang, Tiancheng, Jensen, Taylor, Sokhandan, Negin
Abstract
Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are aligned with the creative vision. This paper presents a new pipeline that offers a fully automated, scalable solution for generating marketing images of commercial products using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight, achieving a $30.77\%$ increase in marketing object fidelity using DINOV2 and a $52.00\%$ increase in human preference over the generated outcome.
Chinese Translation
文本到图像模型在从文本描述生成图像方面取得了显著进展,产生了令人印象深刻的结果。然而,为这些模型在生产中部署创建一个可扩展的管道仍然是一个挑战。在自动化与人工反馈之间取得适当的平衡对于维持规模和质量至关重要。虽然自动化可以处理大量数据,但人工监督仍然是确保生成图像符合预期标准并与创意愿景一致的重要组成部分。本文提出了一种新的管道,提供了一种完全自动化、可扩展的解决方案,用于利用文本到图像模型生成商业产品的营销图像。所提出的系统在保持图像质量和真实感的同时,还引入了足够的创意变体,以遵循营销指南。通过简化这一过程,我们确保了效率与人工监督的无缝结合,使用DINOV2实现了$30.77\%$的营销对象真实感提升,以及$52.00\\%$的生成结果在人类偏好上的提升。
cs.CV / 32 / 2602.13350

Detecting Brick Kiln Infrastructure at Scale: Graph, Foundation, and Remote Sensing Models for Satellite Imagery Data

大规模检测砖窑基础设施:针对卫星影像数据的图模型、基础模型和遥感模型
Nazir, Usman, Chen, Xidong, Abubakar, Hafiz Muhammad, Bakar, Hadia Abu, Arbaz, Raahim, Rasool, Fezan, Chen, Bin, Khalid, Sara
Abstract
Brick kilns are a major source of air pollution and forced labor in South Asia, yet large-scale monitoring remains limited by sparse and outdated ground data. We study brick kiln detection at scale using high-resolution satellite imagery and curate a multi city zoom-20 (0.149 meters per pixel) resolution dataset comprising over 1.3 million image tiles across five regions in South and Central Asia. We propose ClimateGraph, a region-adaptive graph-based model that captures spatial and directional structure in kiln layouts, and evaluate it against established graph learning baselines. In parallel, we assess a remote sensing based detection pipeline and benchmark it against recent foundation models for satellite imagery. Our results highlight complementary strengths across graph, foundation, and remote sensing approaches, providing practical guidance for scalable brick kiln monitoring from satellite imagery.
Chinese Translation
砖窑是南亚空气污染和强迫劳动的主要来源,但大规模监测仍受到稀疏和过时的地面数据的限制。我们利用高分辨率卫星影像研究大规模砖窑检测,并整理出一个多城市的20倍缩放(每像素0.149米)分辨率数据集,涵盖南亚和中亚五个地区的超过130万个影像切片。我们提出了ClimateGraph,一个区域自适应的基于图的模型,能够捕捉砖窑布局中的空间和方向结构,并将其与已建立的图学习基准进行评估。同时,我们评估了一种基于遥感的检测流程,并将其与近期的卫星影像基础模型进行基准测试。我们的结果突显了图模型、基础模型和遥感方法之间的互补优势,为从卫星影像进行可扩展的砖窑监测提供了实用指导。
cs.CV / 33 / 2602.13352

Using Deep Learning to Generate Semantically Correct Hindi Captions

利用深度学习生成语义正确的印地语图像描述
Khan, Wasim Akram, Vuppala, Anil Kumar
Abstract
Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.
Chinese Translation
利用计算机视觉和自然语言处理的能力进行自动化图像描述非常吸引人。该领域的广泛研究主要集中在英语上,这为考虑流行外语的进一步发展提供了空间。本研究利用不同的模型将图像描述翻译成印地语,这是一种在全球范围内第四受欢迎的语言。通过探索多模态架构,本研究包括局部视觉特征、全局视觉特征、注意机制和预训练模型。使用谷歌云翻译对来自Flickr8k的图像数据集生成印地语图像描述。预训练的卷积神经网络(CNN)如VGG16、ResNet50和Inception V3有助于提取图像特征,同时在文本编码过程中使用单向和双向的文本编码技术。额外的注意层有助于生成权重向量,并通过乘法将每个时间步的图像特征组合成句子级特征向量。使用双语评估指标(BLEU)分数来比较研究结果。进行了许多实验作为基线,以进行研究的比较分析。得分为BLEU-1的图像被认为是足够的,而得分为BLEU-4的图像则被认为具有流畅的图像描述。对于这两个BLEU分数,基于注意机制的双向LSTM与VGG16结合产生了最佳结果,分别为0.59和0.19。实验得出结论,研究能够生成相关且语义准确的印地语图像描述。该研究达成了目标,未来的研究可以以此研究模型为指导。
cs.CV / 34 / 2602.13357

AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

AdaCorrection:用于准确扩散变换器的自适应偏移缓存校正
Liu, Dong, Yu, Yanxuan, Lengerich, Ben, Wu, Ying Nian
Abstract
Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.
Chinese Translation
扩散变换器(Diffusion Transformers, DiTs)在高保真图像和视频生成方面达到了最先进的性能,但由于其迭代去噪结构,推理成本昂贵。虽然先前的方法通过缓存中间特征来加速采样,但它们依赖于静态重用计划或粗粒度启发式,这往往导致时间漂移和缓存不对齐,从而显著降低生成质量。我们提出了 extbf{AdaCorrection},一种自适应偏移缓存校正框架,在扩散推理过程中保持高生成保真度,同时实现跨变换器层的高效缓存重用。在每个时间步,AdaCorrection通过轻量级时空信号估计缓存有效性,并自适应地混合缓存的和新鲜的激活。该校正是在不需要额外监督或重新训练的情况下即时计算的。我们的方法在保持接近原始FID的同时,以最小的计算开销实现了强大的生成质量,并提供了适度的加速。在图像和视频扩散基准测试中的实验表明,AdaCorrection始终提高生成性能。
cs.CV / 35 / 2602.13361

The Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation

扩散二重奏:通过小波抑制协调双通道以实现图像分离
Li, Jingwei, Pu, Wei
Abstract
Blind image separation (BIS) refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Traditional methods relying on statistical independence assumptions or CNN/GAN variants struggle to characterize complex feature distributions in real scenes, leading to estimation bias, texture distortion, and artifact residue under strong noise and nonlinear mixing. This paper innovatively introduces diffusion models into dual-channel BIS, proposing an efficient Dual-Channel Diffusion Separation Model (DCDSM). DCDSM leverages diffusion models' powerful generative capability to learn source image feature distributions and reconstruct feature structures effectively. A novel Wavelet Suppression Module (WSM) is designed within the dual-branch reverse denoising process, forming an interactive separation network that enhances detail separation by exploiting the mutual coupling noise characteristic between source images. Extensive experiments on synthetic datasets containing rain/snow and complex mixtures demonstrate that DCDSM achieves state-of-the-art performance: 1) In image restoration tasks, it obtains PSNR/SSIM values of 35.0023 dB/0.9549 and 29.8108 dB/0.9243 for rain and snow removal respectively, outperforming Histoformer and LDRCNet by 1.2570 dB/0.9272 dB (PSNR) and 0.0262/0.0289 (SSIM) on average; 2) For complex mixture separation, the restored dual-source images achieve average PSNR and SSIM of 25.0049 dB and 0.7997, surpassing comparative methods by 4.1249 dB and 0.0926. Both subjective and objective evaluations confirm DCDSM's superiority in addressing rain/snow residue removal and detail preservation challenges.
Chinese Translation
盲图像分离(BIS)是指在未知混合模式和没有源图像先验知识的条件下,同时估计和恢复多个独立源图像的逆问题。传统方法依赖于统计独立性假设或CNN/GAN变体,难以表征真实场景中的复杂特征分布,导致在强噪声和非线性混合下出现估计偏差、纹理失真和伪影残留。本文创新性地将扩散模型引入双通道BIS,提出了一种高效的双通道扩散分离模型(DCDSM)。DCDSM利用扩散模型强大的生成能力,有效学习源图像特征分布并重构特征结构。在双分支反向去噪过程中设计了一种新颖的小波抑制模块(WSM),形成一个交互式分离网络,通过利用源图像之间的相互耦合噪声特性来增强细节分离。在包含雨雪和复杂混合的合成数据集上进行的广泛实验表明,DCDSM实现了最先进的性能:1)在图像恢复任务中,雨雪去除分别获得35.0023 dB/0.9549和29.8108 dB/0.9243的PSNR/SSIM值,平均超越Histoformer和LDRCNet 1.2570 dB/0.9272 dB(PSNR)和0.0262/0.0289(SSIM);2)对于复杂混合分离,恢复的双源图像平均PSNR和SSIM分别达到25.0049 dB和0.7997,超越对比方法4.1249 dB和0.0926。主观和客观评估均确认DCDSM在解决雨雪残留去除和细节保留挑战方面的优越性。
cs.CV / 36 / 2602.13376

An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

一种在线无参考评估框架用于流程图图像到代码生成
Nguyen, Giang Son, Lim, Zi Pong, Modi, Sarthak Ketanbhai, Teo, Yon Shin, Wang, Wenya
Abstract
Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
Chinese Translation
视觉语言模型(VLMs)在文档处理流程中越来越多地被用于将流程图图像转换为结构化代码(例如,Mermaid)。在实际应用中,这些系统处理任意输入,而这些输入没有真实的代码作为参考,这使得输出质量的评估变得困难。我们提出了一种无参考评估框架,该框架在推理时监控流程图图像到代码生成的质量,仅使用输入图像和生成的输出。该框架引入了两个自动化指标:$ ext{Recall}{ ext{OCR}}$,通过光学字符识别(OCR)从输入图像中提取文本作为代理参考,估计内容覆盖率;以及$ ext{Precision}{ ext{VE}}$,通过与原始图像的视觉蕴涵(Visual Entailment)检测虚构元素。它们的调和平均数,$ ext{F1}{ ext{OCR-VE}}$,提供了一个统一的质量评分。在FlowVQA数据集上的验证显示,与真实指标(Recall、Precision和F1的平均Pearson相关系数分别为$0.97$、$0.91$和$0.94$)有很强的一致性,确认了该框架作为一种实用的无参考替代方案在生产环境中进行持续质量监控的可靠性。
cs.CV / 37 / 2602.13378

LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery

基于部分卷积骨干网、注意力引导特征金字塔、辅助 P2 头和 Wise-IoU 损失的 LAF-YOLOv10 用于无人机航拍图像中的小物体检测
Farooqui, Sohail Ali, Taha, Zuhair Ahmed Khan, Uddin, Mohammed Mudassir, Alam, Shahnawaz
Abstract
Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160$\times$160 resolution extends localization to objects below 8$\times$8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1$\pm$0.3\% [email protected] on VisDrone-DET2019 with 2.3\,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8$\pm$0.4\% [email protected]. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.
Chinese Translation
无人驾驶飞行器作为监视、交通监测和灾害响应的主要感知平台,使得航拍物体检测成为应用计算机视觉中的核心问题。目前的检测器在无人机特有的挑战中表现不佳:目标仅占几个像素、背景杂乱、严重遮挡以及严格的机载计算预算。本研究提出了 LAF-YOLOv10,基于 YOLOv10n,整合了四种互补技术以改善无人机图像中的小物体检测。部分卷积 C2f (PC-C2f) 模块将空间卷积限制在骨干通道的四分之一,减少冗余计算,同时保留辨别能力。注意力引导特征金字塔网络 (AG-FPN) 在多尺度融合之前插入了挤压和激励通道门,并用 DySample 替代最近邻上采样,以实现内容感知插值。160×160 分辨率的辅助 P2 检测头将定位扩展到小于 8×8 像素的物体,同时移除了 P5 头以重新分配参数。Wise-IoU v3 替代 CIoU 用于边界框回归,减弱了拥挤航拍场景中噪声标注带来的梯度影响。这四个模块解决了不重叠的瓶颈:PC-C2f 压缩了骨干计算,AG-FPN 精炼了跨尺度融合,P2 头恢复了空间分辨率,而 Wise-IoU 在标签噪声下稳定了回归。没有单个组件是新颖的;贡献在于在单一 YOLOv10 框架内的联合整合。在三次训练运行(种子 42、123、256)中,LAF-YOLOv10 在 VisDrone-DET2019 上达到了 35.1±0.3% [email protected],参数量为 2.3M,比 YOLOv10n 提高了 3.3 个点。在 UAVDT 上的跨数据集评估得到了 35.8±0.4% [email protected]。在 NVIDIA Jetson Orin Nano 上的基准测试确认了 24.3 FPS 的 FP16 性能,证明了其在嵌入式无人机部署中的可行性。
cs.CV / 38 / 2602.13430

Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

处理胸部X光分类中的监督稀缺性:长尾学习与零样本学习
Pham, Ha-Hieu, Nguyen, Hai-Dang, Nguyen, Thanh-Huy, Xu, Min, Bagci, Ulas, Le, Trung-Nghia, Pham, Huy-Hieu
Abstract
Chest X-ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.
Chinese Translation
胸部X光(CXR)分类在临床实践中常常受到不完善监督的限制,这种限制源于(i)极端长尾的多标签疾病分布和(ii)对稀有或之前未见发现的缺失注释。CXR-LT 2026挑战在基于PadChest的基准上解决了这些问题,该基准具有36类标签空间,分为30个训练中的类和6个用于零样本评估的分布外(OOD)类。我们提出了针对不同监督机制量身定制的任务特定解决方案。对于任务1(长尾多标签分类),我们采用了一种关注不平衡的多标签学习策略,以提高对尾部类的识别,同时保持对常见发现的稳定性能。对于任务2(零样本OOD识别),我们提出了一种预测方法,该方法在训练期间不使用任何监督标签或来自OOD类的示例,即可为未见疾病类别生成分数。通过宏平均平均精度(mAP)进行评估,我们的方法在两个任务上均表现出色,在开发阶段的公共排行榜上排名第一。代码和预训练模型可在 https://github.com/hieuphamha19/CXR_LT 获取。
cs.CV / 39 / 2602.13440

Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

即时学习:基于重放的室内无人机持续物体感知
Nae, Sebastian-Ion, Barbu, Mihai-Eugen, Mocanu, Sebastian, Leordeanu, Marius
Abstract
Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6\%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10\%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96\%$ with $5\%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl
Chinese Translation
自主代理(如室内无人机)必须实时学习新的物体类别,同时限制灾难性遗忘,这促使了类别增量学习(Class-Incremental Learning, CIL)的研究。然而,大多数无人机(UAV)数据集集中于户外场景,提供的时间上连贯的室内视频有限。我们引入了一个室内数据集,包含$14,400$帧,捕捉了无人机与地面车辆之间的画面,通过半自动工作流程进行标注,首次标注一致性达到了$98.6\%$,并在最终手动验证之前进行了确认。利用该数据集,我们基准测试了三种基于重放的CIL策略:经验重放(Experience Replay, ER)、最大干扰检索(Maximally Interfered Retrieval, MIR)和遗忘感知重放(Forgetting-Aware Replay, FAR),使用YOLOv11-nano作为资源高效的检测器,适用于受限于部署的无人机平台。在严格的内存预算下($5-10\\%$重放),FAR的表现优于其他方法,达到了$82.96\\%$的平均准确率(ACC,$mAP_{50-95}$跨增量)。梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)分析显示,在混合场景中类之间的注意力转移,这与无人机的定位质量下降相关。实验进一步表明,基于重放的持续学习可以有效应用于边缘空中系统。总体而言,本研究贡献了一个保持时间连贯性的室内无人机视频数据集,并在有限重放预算下评估了基于重放的CIL。项目页面:https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl
cs.CV / 40 / 2602.13479

GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

GLIMPSE:可穿戴设备中的实时文本识别与上下文理解用于视觉问答
Ramachandran, Akhil, Arun, Ankit, Shenoy, Ashish, Harpale, Abhay, Jayakumar, Srihari, Chatterjee, Debojeet, Moslehpour, Mohsen, Chuang, Pierce, Lu, Yichao, Bhardwaj, Vikas, Heidari, Peyman
Abstract
Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.
Chinese Translation
视频大型语言模型(Video LLMs)在理解和推理视觉内容方面取得了显著进展,特别是在涉及文本识别和基于文本的视觉问答(Text VQA)任务中。然而,在可穿戴设备上部署文本视觉问答面临着根本性的矛盾:文本识别需要高分辨率视频,但流式传输高质量视频会消耗大量电池并导致热限制。此外,现有模型在实时流中处理跨多个帧的文本时难以保持一致的时间上下文。我们观察到文本识别和视觉推理在分辨率需求上存在不对称性——光学字符识别(OCR)需要细致的细节,而场景理解则可以容忍粗略特征。我们利用这种不对称性,采用一种混合架构,在设备上进行选择性的高分辨率OCR,同时流式传输低分辨率视频以提供视觉上下文。在五个任务类别的基于文本的视觉问答样本基准测试中,我们的系统在仅为全分辨率流式传输的0.49倍功耗下实现了72%的准确率,使得在资源受限的可穿戴设备上能够持续进行视觉问答会话,而不牺牲文本理解质量。
cs.CV / 41 / 2602.13507

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

远程帕金森病筛查的视频基础模型基准测试
Islam, Md Saiful, Hossain, Ekram, Abdelkader, Abdelrahman, Adnan, Tariq, Mashrur, Fazla Rabbi, Park, Sooyong, Kumar, Praveen, Sudais, Qasim, Chunga, Natalia, Shah, Nami, Freyberg, Jan, Kanan, Christopher, Schneider, Ruth, Hoque, Ehsan
Abstract
Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5
Chinese Translation
基于远程视频的评估为帕金森病(PD)筛查提供了一种可扩展的途径。尽管传统方法依赖于模仿临床量表的手工特征,最近在视频基础模型(VFM)方面的进展使得无需特定任务定制即可进行表征学习。然而,不同VFM架构在多样化临床任务中的比较有效性仍然不甚明了。我们展示了一项大规模的系统研究,使用来自1888名参与者(727名PD患者)的新颖视频数据集,涵盖了16个标准化临床任务的32,847个视频。我们评估了七种最先进的VFM,包括VideoPrism、V-JEPA、ViViT和VideoMAE,以确定它们在临床筛查中的稳健性。通过评估冻结的嵌入与线性分类头,我们证明了任务显著性高度依赖于模型:VideoPrism在捕捉视觉语音运动学(无音频)和面部表现力方面表现出色,而V-JEPA在上肢运动任务中表现更佳。值得注意的是,TimeSformer在指尖敲击等节奏任务中仍然具有很强的竞争力。我们的实验得到了76.4-85.3%的AUC和71.5-80.6%的准确率。尽管高特异性(高达90.3%)表明在排除健康个体方面具有强大的潜力,但较低的敏感性(43.2-57.3%)突显了任务感知校准和多任务及多模态集成的必要性。总体而言,这项工作为基于VFM的PD筛查建立了严格的基准,并为远程神经监测中选择合适的任务和架构提供了路线图。代码和匿名结构数据已公开可用: https://anonymous.4open.science/r/parkinson_video_benchmarking-A2C5
cs.CV / 42 / 2602.13515

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

SpargeAttention2:通过混合 Top-k+Top-p 掩码和蒸馏微调实现可训练的稀疏注意力
Zhang, Jintao, Jiang, Kai, Xiang, Chendong, Feng, Weiqi, Hu, Yuezhou, Xi, Haocheng, Chen, Jianfei, Zhu, Jun
Abstract
Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.
Chinese Translation
许多无训练的稀疏注意力方法在加速扩散模型方面效果显著。最近的几项研究表明,使稀疏注意力可训练可以进一步提高稀疏性,同时保持生成质量。我们研究了三个关键问题:(1)两种常见的掩码规则,即 Top-k 和 Top-p,何时失效,以及我们如何避免这些失效?(2)为什么可训练的稀疏注意力能够达到比无训练方法更高的稀疏性?(3)使用扩散损失微调稀疏注意力的局限性是什么,我们如何解决这些问题?基于这一分析,我们提出了 SpargeAttention2,一种可训练的稀疏注意力方法,能够在不降低生成质量的情况下实现高稀疏性。SpargeAttention2 包括:(i)一种混合掩码规则,结合了 Top-k 和 Top-p,以在高稀疏性下实现更稳健的掩码;(ii)一种高效的可训练稀疏注意力实现;(iii)一种受蒸馏启发的微调目标,以更好地在使用稀疏注意力的微调过程中保持生成质量。在视频扩散模型上的实验表明,SpargeAttention2 达到了 95% 的注意力稀疏性和 16.2 倍的注意力加速,同时保持生成质量,始终优于先前的稀疏注意力方法。
cs.CV / 43 / 2602.13549

Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting

基于物理的高斯点云夜间自动驾驶场景重建
Kim, Tae-Kyeong, Chen, Xingxin, Wu, Guile, Huang, Chengjie, Bai, Dongfeng, Liu, Bingbing
Abstract
This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
Chinese Translation
本文聚焦于自动驾驶仿真中的夜间条件下的场景重建。基于神经辐射场(NeRFs)和三维高斯点云(3DGS)的方法在自动驾驶场景重建中实现了逼真的建模,但它们主要集中在正常光照条件下。低光照驾驶场景由于其复杂的照明和外观条件,更难以建模,这往往导致现有方法的性能下降。为了解决这个问题,本文提出了一种新颖的方法,将基于物理的渲染集成到3DGS中,以增强自动驾驶的夜间场景重建。具体而言,我们的方法将基于物理的渲染集成到复合场景高斯表示中,并联合优化基于双向反射分布函数(BRDF)的材料属性。我们通过全局照明模块显式建模漫反射成分,并通过各向异性球形高斯建模镜面成分。因此,我们的方法在保持实时渲染的同时,提高了户外夜间驾驶场景的重建质量。在两个真实世界的自动驾驶数据集(包括nuScenes和Waymo)上进行的广泛实验表明,我们的方法在定量和定性上均优于最先进的方法。
cs.CV / 44 / 2602.13555

Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

隐私保护的合作感知用于鸟瞰视角场景分割
Wang, Song, Li, Lingling, Santos, Marcus, Wang, Guanghui
Abstract
Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy-leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end-to-end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.
Chinese Translation
自主驾驶的合作感知系统旨在通过与相邻代理进行通信以共享感知信息,从而克服单一车辆的感知范围限制。尽管这提高了感知性能,但这些系统也面临着显著的隐私泄露问题,因为敏感的视觉内容可能会从共享数据中被重建。本文提出了一种新颖的隐私保护合作(Privacy-Concealing Cooperation, PCC)框架,用于鸟瞰视角(Bird's Eye View, BEV)语义分割。基于共同共享的BEV特征,我们设计了一个隐藏网络,以防止图像重建网络从共享特征中恢复输入图像。采用对抗学习机制来训练网络,其中隐藏网络旨在隐蔽BEV特征中的视觉线索,而重建网络则试图揭示这些线索。为了保持分割性能,感知网络与隐藏网络集成并进行端到端优化。实验结果表明,所提出的PCC框架有效降低了重建图像的质量,同时对分割性能的影响最小,为合作车辆提供了隐私保护。源代码将在发表后公开。
cs.CV / 45 / 2602.13585

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Diff-Aid:用于校正文本到图像生成的推理时自适应交互去噪
Li, Binglei, Yang, Mengping, Tan, Zhiyu, Zhang, Junping, Li, Hao
Abstract
Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.
Chinese Translation
近期的文本到图像(T2I)扩散模型取得了显著进展,但由于文本和视觉特征之间的交互不足,忠实地遵循复杂的文本描述仍然具有挑战性。以往的方法通过架构设计或手工调整文本条件权重来增强这种交互,但缺乏灵活性,并忽视了不同块和去噪阶段之间的动态交互。为了解决这个问题,我们提出了Diff-Aid,这是一种轻量级的推理时方法,能够在变换器块和去噪时间步之间自适应地调整每个标记的文本和图像交互。除了提高生成质量,Diff-Aid还产生可解释的调制模式,揭示不同块、时间步和文本标记在去噪过程中的语义对齐贡献。作为一个即插即用模块,Diff-Aid可以无缝集成到下游应用中以进一步提升,包括风格LoRA、可控生成和零-shot编辑。在强基线(SD 3.5和FLUX)上的实验表明,在各种指标上,提示遵循性、视觉质量和人类偏好均有一致的改善。我们的代码和模型将会发布。
cs.CV / 46 / 2602.13588

Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks

场景解析与几何视觉任务的双流交互联合学习
Tang, Guanfeng, Zhao, Hongbo, Long, Ziwei, Li, Jiayao, Xiao, Bohong, Ye, Wei, Wang, Hanli, Fan, Rui
Abstract
Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS's core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.
Chinese Translation
受到人类视觉系统的启发,该系统在上下文和空间理解方面运作于两个平行且相互交互的流中,本文提出了双流交互系统(TwInS),一种新颖的生物启发联合学习框架,能够同时执行场景解析和几何视觉任务。TwInS采用统一的通用架构,其中来自场景解析流的多层次上下文特征被注入到几何视觉流中,以指导其迭代优化。反向地,解码的几何特征被投影到上下文特征空间,以通过一种新颖的跨任务适配器进行选择性异构特征融合,该适配器利用丰富的跨视角几何线索来增强场景解析。为了消除对昂贵的人类标注对应真实值的依赖,TwInS还配备了一种量身定制的半监督训练策略,释放了大规模多视角数据的潜力,并实现了无需真实值对应的持续自我进化。在三个公共数据集上进行的大量实验验证了TwInS核心组件的有效性,并展示了其优于现有最先进方法的卓越性能。源代码将在发表后公开。
cs.CV / 47 / 2602.13600

AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

AdaVBoost:通过令牌级自适应视觉注意力增强来减轻大型视觉语言模型中的幻觉
Zhang, Jiacheng, Liu, Feng, Du, Chao, Pang, Tianyu
Abstract
Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.
Chinese Translation
视觉注意力增强已成为减轻大型视觉语言模型(LVLMs)中幻觉的一个有前景的方向,现有方法主要集中在通过在自回归生成过程中对特定视觉令牌的注意力应用预定义的缩放来决定增强的方向。本文识别出这些方法中的一个基本权衡:预定义的缩放因子在某些生成步骤中可能过弱,导致幻觉未得到解决,而在其他步骤中又可能过强,导致新的幻觉。基于这一发现,我们提出了AdaVBoost,一种令牌级视觉注意力增强框架,能够自适应地确定在每个生成步骤中增强多少注意力。具体而言,我们引入了视觉基础熵(Visual Grounding Entropy, VGE)来估计幻觉风险,该方法利用视觉基础作为补充信号,以捕捉超出熵的证据不匹配。在VGE的指导下,AdaVBoost对高风险令牌施加更强的视觉注意力增强,而对低风险令牌施加较弱的增强,从而在每个生成步骤实现令牌级自适应干预。大量实验表明,AdaVBoost在多个LVLMs和幻觉基准测试中显著优于基线方法。
cs.CV / 48 / 2602.13602

Towards Sparse Video Understanding and Reasoning

迈向稀疏视频理解与推理
Xu, Chenwei, Ye, Zhen, Wu, Shang, Li, Weijian, Wang, Zihan, Xia, Zhuofan, Lu, Lie, Maneriker, Pranav, Du, Fan, Li, Manling, Liu, Han
Abstract
We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
Chinese Translation
我们提出了 evise( extit{Re}asoning with extit{Vi}deo extit{S}parsity),一种用于视频问答(VQA)的多轮智能体。与其均匀采样帧, evise 选择一小组信息丰富的帧,在多个回合中保持摘要作为状态,并在有信心时提前停止。它支持在“即插即用”设置下的专有视觉-语言模型(VLMs),并为开源模型提供强化微调的能力。为了进行微调,我们引入了 EAGER(证据调整增益以实现高效推理),这是一种无注释的奖励机制,包含三个部分:(1)置信增益:在添加新帧后,我们奖励正确选项与最强替代选项之间对数几率差距的增加;(2)摘要充分性:在回答时,我们仅使用最后提交的摘要重新提问并奖励成功;(3)正确且提前停止:在小的回合预算内正确回答会获得奖励。在多个 VQA 基准测试中, evise 提高了准确性,同时减少了帧数、回合数和提示令牌,展示了实用的稀疏视频推理。
cs.CV / 49 / 2602.13633

A generalizable foundation model for intraoperative understanding across surgical procedures

一种可推广的基础模型用于手术过程中的 intraoperative 理解
Park, Kanggil, Jeon, Yongjun, Lim, Soyoung, Park, Seonmin, Shin, Jongmin, Kim, Jung Yong, An, Sehyeon, Rhu, Jinsoo, Kim, Jongman, Choi, Gyu-Seong, Oh, Namkee, Jung, Kyu-Hwan
Abstract
In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.
Chinese Translation
在微创手术中,临床决策依赖于实时视觉解读,然而,手术过程中的感知在不同外科医生和手术类型之间存在显著差异。这种变异性限制了一致的评估、培训以及可靠人工智能系统的发展,因为大多数外科 AI 模型是为狭义任务设计的,无法在不同手术或机构之间进行推广。在此,我们介绍了 ZEN,一种可推广的基础模型,用于 intraoperative 外科视频理解,该模型基于来自 21 种手术的超过 400 万帧数据,采用自监督多教师蒸馏框架进行训练。我们策划了一个大型且多样化的数据集,并在统一基准内系统地评估了多种表征学习策略。在 20 个下游任务和全面微调、冻结骨干、少量样本和零样本设置中,ZEN 始终优于现有的外科基础模型,并展示了强大的跨手术推广能力。这些结果表明,朝着手术场景理解的统一表征迈出了一步,并支持未来在 intraoperative 辅助和外科培训评估中的应用。
cs.CV / 50 / 2602.13636

Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness

层引导无人机跟踪:提升效率与遮挡鲁棒性
Zhou, Yang, Ding, Derui, Sun, Ran, Sun, Ying, Zhang, Haohua
Abstract
Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at https://github.com/XiaoMoc/LGTrack
Chinese Translation
视觉目标跟踪(VOT)在无人机(UAV)应用中发挥着关键作用。在准确性与效率之间的权衡,尤其是在不可预测的遮挡等挑战性条件下,仍然是一个重要的挑战。本文提出了LGTrack,一个统一的无人机跟踪框架,集成了动态层选择、高效特征增强和针对遮挡的鲁棒表示学习。通过采用一种新颖的轻量级全局分组坐标注意力(Global-Grouped Coordinate Attention, GGCA)模块,LGTrack能够捕捉长距离依赖关系和全局上下文,以最小的计算开销增强特征的可区分性。此外,轻量级相似性引导层适应(Similarity-Guided Layer Adaptation, SGLA)模块取代了知识蒸馏,实现了跟踪精度与推理效率之间的最佳平衡。在三个数据集上的实验表明,LGTrack在保持竞争性跟踪精度(82.8% 精度)的同时,达到了最先进的实时速度(在UAVDT上为258.7 FPS)。代码可在 https://github.com/XiaoMoc/LGTrack 获取。
cs.CV / 51 / 2602.13637

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

DCDM:用于保持一致性的视频生成的分治扩散模型
Zhao, Haoyu, Zhang, Yuang, Cheng, Junqi, Gu, Jiaxi, Lu, Zenghui, Shu, Peng, Wu, Zuxuan, Jiang, Yu-Gang
Abstract
Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
Chinese Translation
最近的视频生成模型展示了令人印象深刻的视觉保真度,但它们在语义、一致性、几何和身份一致性方面常常面临挑战。本文提出了一种系统级框架,称为分治扩散模型(DCDM),以解决三个关键挑战:(1)片段内部世界知识一致性,(2)片段间相机一致性,以及(3)镜头间元素一致性。DCDM将这些场景下的视频一致性建模分解为三个专用组件,同时共享一个统一的视频生成主干。对于片段内部一致性,DCDM利用大型语言模型将输入提示解析为结构化的语义表示,这些表示随后通过扩散变换器转化为连贯的视频内容。对于片段间相机一致性,我们提出了一种在噪声空间中的时间相机表示,能够实现精确和稳定的相机运动控制,并结合文本到图像的初始化机制进一步增强可控性。对于镜头间一致性,DCDM采用了一种整体场景生成范式,结合窗口交叉注意力和稀疏镜头间自注意力,确保长距离叙事的一致性,同时保持计算效率。我们在AAAI'26的CVM竞赛测试集上验证了我们的框架,结果表明所提出的策略有效地解决了这些挑战。
cs.CV / 52 / 2602.13650

KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

KorMedMCQA-V:评估视觉-语言模型在韩国医学执照考试中的多模态基准
Choi, Byungjin, Bae, Seongsu, Kweon, Sunjun, Choi, Edward
Abstract
We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.
Chinese Translation
我们介绍了KorMedMCQA-V,这是一个针对视觉-语言模型(VLMs)的多模态多项选择题回答基准,采用韩国医学执照考试的风格。该数据集包含1,534个问题和2,043张与之相关的图像,均来自2012年至2023年的韩国医学执照考试,其中约30%的问题包含多个图像,需要进行跨图像证据整合。图像涵盖了临床模式,包括X光、计算机断层扫描(CT)、心电图(ECG)、超声、内窥镜检查及其他医学视觉资料。我们在统一的零-shot评估协议下,对50多种VLM进行了基准测试,涵盖了专有和开源类别,涉及通用、医学专业和韩国专业等多个家族。表现最佳的专有模型(Gemini-3.0-Pro)达到了96.9%的准确率,最佳开源模型(Qwen3-VL-32B-Thinking)为83.7%,而最佳韩国专业模型(VARCO-VISION-2.0-14B)仅为43.2%。我们进一步发现,面向推理的模型变体相比于经过指令调优的模型可提高多达20个百分点,医学领域专业化在强通用基线上的增益不一致,所有模型在多图像问题上的表现均有所下降,并且不同成像模式的性能差异显著。通过补充仅文本的KorMedMCQA基准,KorMedMCQA-V形成了一个统一的评估套件,用于在文本和多模态条件下进行韩国医学推理。该数据集可通过Hugging Face Datasets获取:https://huggingface.co/datasets/seongsubae/KorMedMCQA-V。
cs.CV / 53 / 2602.13658

Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection

优化临床床边超声视频采集以实现概率性多任务心力衰竭检测
Saadat, Armin, Hashemi, Nima, Khodabakhshian, Bahar, Tsang, Michael Y., Luong, Christina, Tsang, Teresa S. M., Abolmaesumi, Purang
Abstract
Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.
Chinese Translation
目的:床边超声(POCUS)心脏超声必须在严格的床边时间和操作员努力约束下支持临床决策。我们提出了一种个性化的数据采集策略,其中一个强化学习(RL)代理在给定部分观察的多视角研究的情况下,选择下一个要采集的视角或终止采集,以支持心力衰竭(HF)评估。在终止时,诊断模型共同预测主动脉狭窄(AS)严重程度和左心室射血分数(LVEF),这两个是关键的HF生物标志物,并输出不确定性,从而实现诊断性能与采集成本之间的明确权衡。方法:我们将POCUS建模为一个顺序采集问题:在每一步,视频选择器(RL代理)选择下一个要采集的视角或终止采集。在终止时,一个共享的多视角变换器执行多任务推断,具有两个头,分别为有序AS分类和LVEF回归,并输出高斯预测分布,产生AS类别和EF阈值的有序概率。这些概率驱动一个奖励,平衡预期的诊断收益与采集成本,从而生成特定于患者的采集路径。结果:数据集包含12,180个患者级研究,分为训练/验证/测试集(75/15/15)。在1,820个测试研究中,我们的方法在使用32%更少视频的情况下匹配了完整研究的性能,在AS严重程度分类和LVEF估计中实现了77.2%的平均平衡准确率(bACC),展示了在采集预算下的强大多任务性能。结论:以患者为中心、关注成本的采集可以简化POCUS工作流程,同时保持决策质量,生成适合床边使用的可解释扫描路径。该框架可扩展至其他心脏终点,并值得进行前瞻性评估以实现临床整合。
cs.CV / 54 / 2602.13662

LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

LeafNet:用于植物疾病基础视觉-语言理解的大规模数据集和综合基准
Quoc, Khang Nguyen, Dao, Phuong D., Quach, Luyl-Da
Abstract
Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
Chinese Translation
基础模型和视觉-语言预训练显著推动了视觉-语言模型(VLMs)的发展,使得视觉和语言数据的多模态处理成为可能。然而,由于缺乏大规模、全面的多模态图像-文本数据集和基准,其在特定领域的农业任务(如植物病理学)中的应用仍然有限。为了解决这一问题,我们引入了LeafNet,一个全面的多模态数据集,以及LeafBench,一个视觉问答基准,旨在系统评估VLMs在理解植物疾病方面的能力。该数据集包含186,000张叶片数字图像,涵盖97个疾病类别,并配有元数据,生成了13,950个问题-答案对,涉及六个关键农业任务。这些问题评估植物病理学理解的各个方面,包括视觉症状识别、分类关系和诊断推理。在我们的LeafBench数据集上对12个最先进的VLM进行基准测试,揭示了它们在疾病理解能力上的显著差异。我们的研究表明,不同任务的表现差异显著:健康-疾病二分类的准确率超过90%,而细粒度病原体和物种识别的准确率仍低于65%。视觉模型与VLM的直接比较显示了多模态架构的关键优势:经过微调的VLM优于传统视觉模型,证实了整合语言表示显著提高了诊断精度。这些发现突显了当前VLM在植物病理学应用中的关键缺口,并强调了LeafBench作为方法论进步和可靠的AI辅助植物疾病诊断评估框架的必要性。代码可在 https://github.com/EnalisUs/LeafBench 获取。
cs.CV / 55 / 2602.13669

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent:迈向快速、持续和流式的多模态视频生成
Meng, Rang, Wu, Weipeng, Yin, Yingjie, Li, Yuming, Ma, Chenguang
Abstract
Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.
Chinese Translation
近期的多模态视频生成模型已实现高视觉质量,但其高延迟和有限的时间稳定性阻碍了实时部署。流式推理加剧了这些问题,导致显著的多模态退化,如空间模糊、时间漂移和唇部不同步,从而产生了未解决的效率与性能之间的权衡。为此,我们提出了EchoTorrent,这是一种具有四重设计的新方案:(1) 多教师训练(Multi-Teacher Training)对预训练模型进行微调,以获得在不同偏好领域的专业领域专家,逐步将领域特定知识转移给学生模型;(2) 自适应CFG校准(Adaptive CFG Calibration,ACC-DMD),通过分阶段时空调度校准DMD中的音频CFG增强误差,消除冗余的CFG计算,并实现每步的单次推理;(3) 混合长尾强制(Hybrid Long Tail Forcing),通过因果双向混合架构在长时间自回归训练中仅对尾帧施加对齐,有效减轻流式模式下的时空退化,同时增强对参考帧的保真度;(4) VAE解码器精炼(VAE Decoder Refiner),通过对VAE解码器进行像素域优化以恢复高频细节,同时避免潜在空间的模糊性。大量实验和分析表明,EchoTorrent实现了少量自回归生成,显著延长了时间一致性、身份保留和音频-唇部同步。
cs.CV / 56 / 2602.13681

An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment

一种集成学习方法在复杂环境中进行废物分拣
Jafar, Maimoona, Ali, Syed Imran, Saadat, Ahsan, Bilal, Muhammad, Khalid, Shah
Abstract
Environmental pollution is a critical global issue, with recycling emerging as one of the most viable solutions. This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material. Recent advancements in computer vision have significantly contributed to waste classification and recognition. In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts. The complexity of real-world waste environments, characterized by deformed items without specific patterns and overlapping objects, further complicates waste segmentation tasks. This paper proposes an Ensemble Learning approach to improve segmentation accuracy by combining high performing segmentation models, U-Net and FPN, using a weighted average method. U-Net excels in capturing fine details and boundaries in segmentation tasks, while FPN effectively handles scale variation and context in complex environments, and their combined masks result in more precise predictions. The dataset used closely mimics real-life waste scenarios, and preprocessing techniques were applied to enhance feature learning for deep learning segmentation models. The ensemble model, referred to as EL-4, achieved an IoU value of 0.8306, an improvement over U-Net's 0.8065, and reduced Dice loss to 0.09019 from FPN's 0.1183. This study could contribute to the efficiency of waste sorting at Material Recovery Facility, facilitating better raw material acquisition for recycling with minimal human intervention and enhancing the overall throughput.
Chinese Translation
环境污染是一个全球性的重要问题,而回收利用则成为最可行的解决方案之一。本研究聚焦于废物分拣,这是回收过程中的关键步骤,旨在获取原材料。近年来,计算机视觉的进步显著推动了废物分类和识别的发展。在废物分拣中,分割掩膜对于机器人准确定位和从传送带上抓取物体至关重要。现实世界废物环境的复杂性,表现为变形物品没有特定模式以及物体重叠,进一步增加了废物分拣任务的难度。本文提出了一种集成学习方法,通过加权平均法结合高性能的分割模型U-Net和FPN,以提高分割准确性。U-Net在捕捉分割任务中的细节和边界方面表现出色,而FPN则有效处理复杂环境中的尺度变化和上下文信息,它们的组合掩膜产生了更精确的预测。所使用的数据集紧密模拟了现实生活中的废物场景,并应用了预处理技术以增强深度学习分割模型的特征学习。该集成模型称为EL-4,达到了0.8306的IoU值,相较于U-Net的0.8065有所提升,并将Dice损失从FPN的0.1183降低至0.09019。本研究有望提高物料回收设施的废物分拣效率,促进更好地获取原材料以进行回收,同时减少人工干预,提高整体处理能力。
cs.CV / 57 / 2602.13693

A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

基于WDLoRA的多模态生成框架用于糖尿病神经病变的临床引导角膜共聚焦显微镜图像合成
Zhang, Xin, Han, Liangxiu, Shi, Yue, Zheng, Yalin, Alam, Uazman, Ferdousi, Maryam, Malik, Rayaz
Abstract
Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fr\'echet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework's potential to alleviate data bottlenecks in medical AI.
Chinese Translation
角膜共聚焦显微镜(CCM)是一种敏感工具,用于评估糖尿病周围神经病(DPN)中的小纤维损伤,但基于深度学习的自动化诊断模型的开发受到标注数据稀缺和角膜神经形态细微变异的限制。尽管人工智能(AI)驱动的基础生成模型在自然图像合成方面表现出色,但由于领域特定训练的不足,它们在医学成像中常常面临挑战,影响了临床分析所需的解剖学保真度。为了解决这些限制,我们提出了一种基于权重分解低秩适应(WDLoRA)的多模态生成框架,用于临床引导的CCM图像合成。WDLoRA是一种参数高效的微调(PEFT)机制,它将幅度和方向权重更新解耦,使基础生成模型能够独立学习医学真实感所需的方向(神经拓扑)和强度(基质对比度)。通过共同条件化神经分割掩膜和特定疾病的临床提示,该模型合成了在DPN谱(对照组、T1无DPN、T1有DPN)中解剖上连贯的图像。全面的三大支柱评估表明,所提出的框架在视觉保真度(Fréchet Inception Distance (FID): 5.18)和结构完整性(结构相似性指数测量(SSIM):0.630)方面达到了最先进的水平,显著优于GAN和标准扩散基线。至关重要的是,合成图像保留了黄金标准的临床生物标志物,并在统计上与真实患者数据等效。当用于训练自动化诊断模型时,合成数据集提高了下游诊断准确率2.1%和分割性能2.2%,验证了该框架在缓解医学AI数据瓶颈方面的潜力。
cs.CV / 58 / 2602.13712

Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images

微调视觉语言模型用于显微图像中寄生虫卵的定位
Sien, Chan Hao, Karim, Hezerul Abdul, AlDahoul, Nouar
Abstract
Soil-transmitted helminth (STH) infections continuously affect a large proportion of the global population, particularly in tropical and sub-tropical regions, where access to specialized diagnostic expertise is limited. Although manual microscopic diagnosis of parasitic eggs remains the diagnostic gold standard, the approach can be labour-intensive, time-consuming, and prone to human error. This paper aims to utilize a vision language model (VLM) such as Microsoft Florence that was fine-tuned to localize all parasitic eggs within microscopic images. The preliminary results show that our localization VLM performs comparatively better than the other object detection methods, such as EfficientDet, with an mIOU of 0.94. This finding demonstrates the potential of the proposed VLM to serve as a core component of an automated framework, offering a scalable engineering solution for intelligent parasitological diagnosis.
Chinese Translation
土壤传播的蠕虫(STH)感染持续影响着全球大量人口,尤其是在热带和亚热带地区,这些地区对专业诊断技术的获取有限。尽管手动显微镜下寄生虫卵的诊断仍然是诊断的金标准,但该方法往往劳动密集、耗时且容易出现人为错误。本文旨在利用微调后的视觉语言模型(VLM),如微软的Florence,来定位显微图像中的所有寄生虫卵。初步结果表明,我们的定位VLM在性能上优于其他目标检测方法,如EfficientDet,mIOU达到0.94。这一发现展示了所提VLM作为自动化框架核心组件的潜力,为智能寄生虫学诊断提供了可扩展的工程解决方案。
cs.CV / 59 / 2602.13726

RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms

RGA-Net:一种基于互惠注意机制的机器人手术系统视觉增强框架
Li, Quanjun, Li, Weixuan, Xia, Han, Zhou, Junhua, Pun, Chi-Man, Chen, Xuhang
Abstract
Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons' cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.
Chinese Translation
机器人手术系统在精确遥操作中高度依赖高质量的视觉反馈;然而,来自能量设备的手术烟雾显著降低了内窥镜视频的质量,妨碍了人机界面和手术结果。本文提出了RGA-Net(互惠门控与注意融合网络),这是一种专门为机器人手术工作流程中的烟雾去除而设计的新型深度学习框架。我们的方法针对手术烟雾的独特挑战——包括密集的非均匀分布和复杂的光散射——采用了层次化的编码器-解码器架构,并引入了两个关键创新:(1)双流混合注意(Dual-Stream Hybrid Attention, DHA)模块,该模块结合了位移窗口注意与频域处理,以捕捉局部手术细节和全局光照变化;(2)轴分解注意(Axis-Decomposed Attention, ADA)模块,通过分解的注意机制高效处理多尺度特征。这些组件通过互惠交叉门控块连接,使得编码器和解码器路径之间实现双向特征调制。在DesmokeData和LSD3K手术数据集上的大量实验表明,RGA-Net在恢复适合机器人手术集成的视觉清晰度方面表现优越。我们的方法通过提供持续清晰的可视化来增强外科医生与机器人之间的界面,为减轻外科医生的认知负担、优化操作工作流程以及降低微创手术中的医源性损伤风险奠定了技术基础。这些实际好处可以通过未来涉及外科医生可用性评估的临床试验进一步验证。所提出的框架代表了朝着更可靠和安全的机器人手术系统迈出的重要一步,通过计算机视觉增强实现。
cs.CV / 60 / 2602.13728

Explore Intrinsic Geometry for Query-based Tiny and Oriented Object Detector with Momentum-based Bipartite Matching

探索基于查询的微小定向物体检测器的内在几何特征与动量双边匹配
Zhang, Junpeng, Yang, Zewei, Feng, Jie, Zheng, Yuhui, Shang, Ronghua, Zhang, Mengxuan
Abstract
Recent query-based detectors have achieved remarkable progress, yet their performance remains constrained when handling objects with arbitrary orientations, especially for tiny objects capturing limited texture information. This limitation primarily stems from the underutilization of intrinsic geometry during pixel-based feature decoding and the occurrence of inter-stage matching inconsistency caused by stage-wise bipartite matching. To tackle these challenges, we present IGOFormer, a novel query-based oriented object detector that explicitly integrates intrinsic geometry into feature decoding and enhances inter-stage matching stability. Specifically, we design an Intrinsic Geometry-aware Decoder, which enhances the object-related features conditioned on an object query by injecting complementary geometric embeddings extrapolated from their correlations to capture the geometric layout of the object, thereby offering a critical geometric insight into its orientation. Meanwhile, a Momentum-based Bipartite Matching scheme is developed to adaptively aggregate historical matching costs by formulating an exponential moving average with query-specific smoothing factors, effectively preventing conflicting supervisory signals arising from inter-stage matching inconsistency. Extensive experiments and ablation studies demonstrate the superiority of our IGOFormer for aerial oriented object detection, achieving an AP$_{50}$ score of 78.00\% on DOTA-V1.0 using Swin-T backbone under the single-scale setting. The code will be made publicly available.
Chinese Translation
近期的基于查询的检测器取得了显著进展,但在处理具有任意方向的物体时,其性能仍然受到限制,尤其是在捕捉有限纹理信息的微小物体上。这一限制主要源于在基于像素的特征解码过程中对内在几何特征的未充分利用,以及由于阶段性双边匹配导致的阶段间匹配不一致。为了解决这些挑战,我们提出了IGOFormer,一种新颖的基于查询的定向物体检测器,明确地将内在几何特征整合到特征解码中,并增强阶段间匹配的稳定性。具体而言,我们设计了一种内在几何感知解码器,该解码器通过注入从物体查询的相关性推导出的互补几何嵌入,增强与物体相关的特征,从而捕捉物体的几何布局,为其方向提供重要的几何洞察。同时,开发了一种基于动量的双边匹配方案,通过制定带有查询特定平滑因子的指数移动平均,自适应地聚合历史匹配成本,有效防止因阶段间匹配不一致而产生的冲突监督信号。大量实验和消融研究证明了我们IGOFormer在空中定向物体检测中的优越性,在单尺度设置下,使用Swin-T骨干网络在DOTA-V1.0上实现了78.00\%的AP$_{50}$得分。代码将公开发布。
cs.CV / 61 / 2602.13731

Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome

用于唐氏综合症多任务下游分析的3D脑MRI生成潜在表示
Malé, Jordi, Fortea, Juan, Rozalem-Aranha, Mateus, Martínez-Abadías, Neus, Sevillano, Xavier
Abstract
Generative models have emerged as powerful tools in medical imaging, enabling tasks such as segmentation, anomaly detection, and high-quality synthetic data generation. These models typically rely on learning meaningful latent representations, which are particularly valuable given the high-dimensional nature of 3D medical images like brain magnetic resonance imaging (MRI) scans. Despite their potential, latent representations remain underexplored in terms of their structure, information content, and applicability to downstream clinical tasks. Investigating these representations is crucial for advancing the use of generative models in neuroimaging research and clinical decision-making. In this work, we develop multiple variational autoencoders (VAEs) to encode 3D brain MRI scans into compact latent space representations for generative and predictive applications. We systematically evaluate the effectiveness of the learned representations through three key analyses: (i) a quantitative and qualitative assessment of MRI reconstruction quality, (ii) a visualisation of the latent space structure using Principal Component Analysis, and (iii) downstream classification tasks on a proprietary dataset of euploid and Down syndrome individuals brain MRI scans. Our results demonstrate that the VAE successfully captures essential brain features while maintaining high reconstruction fidelity. The latent space exhibits clear clustering patterns, particularly in distinguishing individuals with Down syndrome from euploid controls.
Chinese Translation
生成模型已成为医学影像中的强大工具,能够实现分割、异常检测和高质量合成数据生成等任务。这些模型通常依赖于学习有意义的潜在表示,这在3D医学图像(如脑磁共振成像(MRI)扫描)的高维特性下尤为重要。尽管潜在表示具有很大的潜力,但在其结构、信息内容和对下游临床任务的适用性方面仍然未得到充分探索。研究这些表示对于推动生成模型在神经影像研究和临床决策中的应用至关重要。在本研究中,我们开发了多个变分自编码器(Variational Autoencoders, VAEs),将3D脑MRI扫描编码为紧凑的潜在空间表示,以用于生成和预测应用。我们通过三项关键分析系统地评估所学习表示的有效性:(i)MRI重建质量的定量和定性评估,(ii)使用主成分分析(Principal Component Analysis)可视化潜在空间结构,以及(iii)在一组包含正常和唐氏综合症个体的脑MRI扫描的专有数据集上的下游分类任务。我们的结果表明,VAE成功捕捉了重要的脑部特征,同时保持了高重建保真度。潜在空间表现出明显的聚类模式,尤其是在区分唐氏综合症个体与正常对照方面。
cs.CV / 62 / 2602.13751

T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

T2MBench:一种用于分布外文本到运动生成的基准测试
Yang, Bin, Ou, Rong, Xu, Weisheng, Xiong, Jiaqi, Li, Xintao, Wang, Taowen, Zhu, Luyu, Jiang, Xu, Tan, Jing, Xu, Renjing
Abstract
Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.
Chinese Translation
现有的大多数文本到运动生成评估主要集中在分布内的文本输入和有限的评估标准上,这限制了它们在复杂的分布外(OOD)文本条件下系统评估模型泛化能力和运动生成能力的能力。为了解决这一局限性,我们提出了一种专门为OOD文本到运动评估设计的基准测试,其中包括对14个代表性基线模型的全面分析以及基于评估结果衍生的两个数据集。具体而言,我们构建了一个包含1,025个文本描述的OOD提示数据集。基于该提示数据集,我们引入了一个统一的评估框架,整合了基于大型语言模型(LLM)的评估、多因素运动评估和细粒度准确性评估。我们的实验结果显示,尽管不同的基线模型在文本到运动语义对齐、运动泛化能力和物理质量等方面表现出优势,但大多数模型在细粒度准确性评估中难以取得良好的表现。这些发现突显了现有方法在OOD场景中的局限性,并为未来生产级文本到运动模型的设计和评估提供了实用指导。
cs.CV / 63 / 2602.13758

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

OmniScience:一个用于科学图像理解的大规模多模态数据集
Tao, Haoyi, Huang, Chaozheng, Wang, Nan, Lyu, Han, Zhang, Linfeng, Ke, Guolin, Fang, Xi
Abstract
Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.
Chinese Translation
多模态大型语言模型在自然图像理解方面表现出色,但在解释科学图像方面能力有限,包括但不限于示意图、实验特征和分析图表。这一局限性在开源的多模态大型语言模型中尤为明显。这一差距主要源于现有数据集的领域覆盖有限、结构注释粗糙以及语义基础薄弱。我们引入了OmniScience,一个大规模、高保真度的多模态数据集,包含150万个图像-标题-上下文三元组,涵盖10多个主要科学学科。为了获得更高信息密度和准确性的图像标题数据以用于多模态大型模型训练,我们开发了一种动态模型路由重新标注管道,该管道利用最先进的多模态大型语言模型,通过联合合成视觉特征、原始图像标题和人类科学家撰写的相应文本引用,生成密集且自包含的描述。该管道进一步通过严格的质量过滤和与人类专家判断的一致性进行强化,确保事实准确性和语义完整性,并将图像-文本多模态相似度评分从0.769提升至0.956。我们还提出了一种标题问答协议作为评估视觉理解的代理任务。在此设置下,基于OmniScience微调的Qwen2.5-VL-3B模型在基准测试上显示出显著提升,在MM-MT-Bench上获得0.378的增益,在MMMU上获得0.140的增益。
cs.CV / 64 / 2602.13760

SAM4Dcap: Training-free Biomechanical Twin System from Monocular Video

SAM4Dcap:无训练的单目视频生物力学双胞胎系统
Wang, Li, Wang, HaoYu, Chen, Xi, Jiang, ZeKun, Li, Kang, Li, Jian
Abstract
Quantitative biomechanical analysis is essential for clinical diagnosis and injury prevention but is often restricted to laboratories due to the high cost of optical motion capture systems. While multi-view video approaches have lowered barriers, they remain impractical for home-based scenarios requiring monocular capture. This paper presents SAM4Dcap, an open-source, end-to-end framework for estimating biomechanical metrics from monocular video without additional training. SAM4Dcap integrates the temporally consistent 4D human mesh recovery of SAM-Body4D with the OpenSim biomechanical solver. The pipeline converts reconstructed meshes into trajectory files compatible with diverse musculoskeletal models. We introduce automated prompting strategies and a Linux-native build for processing. Preliminary evaluations on walking and drop-jump tasks indicate that SAM4Dcap has the potential to achieve knee kinematic predictions comparable to multi-view systems, although some discrepancies in hip flexion and residual jitter remain. By bridging advanced computer vision with established biomechanical simulation, SAM4Dcap provides a flexible, accessible foundation for non-laboratory motion analysis.
Chinese Translation
定量生物力学分析对于临床诊断和伤害预防至关重要,但由于光学运动捕捉系统的高成本,通常局限于实验室环境。尽管多视角视频方法降低了使用门槛,但在需要单目捕捉的家庭场景中仍然不够实用。本文提出了SAM4Dcap,一个开源的端到端框架,用于从单目视频中估计生物力学指标,而无需额外的训练。SAM4Dcap将SAM-Body4D的时间一致的4D人类网格恢复与OpenSim生物力学求解器相结合。该流程将重建的网格转换为与多种肌肉骨骼模型兼容的轨迹文件。我们引入了自动提示策略和Linux原生构建以进行处理。在步态和下落跳跃任务上的初步评估表明,SAM4Dcap在膝关节运动学预测方面有潜力达到与多视角系统相当的水平,尽管在髋关节屈曲和残余抖动方面仍存在一些差异。通过将先进的计算机视觉与成熟的生物力学模拟相结合,SAM4Dcap为非实验室运动分析提供了灵活且可访问的基础。
cs.CV / 65 / 2602.13772

Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking

Offline-Poly:离线三维多目标跟踪的多面体框架
Li, Xiaoyu, Wu, Yitao, Wu, Xian, Zhuo, Haolin, Zhao, Lijun, Sun, Lining
Abstract
Offline 3D multi-object tracking (MOT) is a critical component of the 4D auto-labeling (4DAL) process. It enhances pseudo-labels generated by high-performance detectors through the incorporation of temporal context. However, existing offline 3D MOT approaches are direct extensions of online frameworks and fail to fully exploit the advantages of offline setting. Moreover, these methods often depend on fixed upstream and customized architectures, limiting their adaptability. To address these limitations, we propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design. We introduce a standardized paradigm termed Tracking-by-Tracking (TBT), which operates exclusively on arbitrary off-the-shelf tracking outputs and produces offline-refined tracklets. This formulation decouples offline tracker from specific upstream detectors or trackers. Under the TBT paradigm, Offline-Poly accepts one or multiple coarse tracking results and processes them through a structured pipeline comprising pre-processing, hierarchical matching and fusion, and tracklet refinement. Each module is designed to capitalize on the two fundamental properties of offline tracking: resource unconstrainedness, which permits global optimization beyond real-time limits, and future observability, which enables tracklet reasoning over the full temporal horizon. Offline-Poly first eliminates short-term ghost tracklets and re-identifies fragmented segments using global scene context. It then constructs scene-level similarity to associate tracklets across multiple input sources. Finally, Offline-Poly refines tracklets by jointly leveraging local and global motion patterns. On nuScenes, we achieve SOTA performance with 77.6% AMOTA. On KITTI, it achieves leading results with 83.00% HOTA. Comprehensive experiments further validate the flexibility, generalizability, and modular effectiveness of Offline-Poly.
Chinese Translation
离线三维多目标跟踪(MOT)是4D自动标注(4DAL)过程中的关键组成部分。它通过结合时间上下文来增强由高性能检测器生成的伪标签。然而,现有的离线三维MOT方法是在线框架的直接扩展,未能充分利用离线设置的优势。此外,这些方法通常依赖于固定的上游和定制架构,限制了它们的适应性。为了解决这些局限性,我们提出了Offline-Poly,一种基于跟踪中心设计的通用离线三维MOT方法。我们引入了一种称为“跟踪通过跟踪”(Tracking-by-Tracking, TBT)的标准化范式,该范式仅在任意现成的跟踪输出上操作,并生成离线精炼的轨迹段。该公式将离线跟踪器与特定的上游检测器或跟踪器解耦。在TBT范式下,Offline-Poly接受一个或多个粗略跟踪结果,并通过一个结构化的管道进行处理,该管道包括预处理、分层匹配与融合以及轨迹段精炼。每个模块都旨在利用离线跟踪的两个基本特性:资源不受限制,这允许超越实时限制的全局优化;以及未来可观测性,这使得在整个时间范围内进行轨迹段推理成为可能。Offline-Poly首先消除短期虚假轨迹段,并利用全局场景上下文重新识别碎片化的段落。然后,它构建场景级相似性,以关联来自多个输入源的轨迹段。最后,Offline-Poly通过共同利用局部和全局运动模式来精炼轨迹段。在nuScenes数据集上,我们实现了77.6%的SOTA性能。在KITTI数据集上,它以83.00%的HOTA取得领先结果。全面的实验进一步验证了Offline-Poly的灵活性、通用性和模块化有效性。
cs.CV / 66 / 2602.13778

Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

Skeleton2Stage:基于奖励引导的物理可行舞蹈生成微调
Jia, Jidong, Zhang, Youjian, Fu, Huan, Tao, Dacheng
Abstract
Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/
Chinese Translation
尽管舞蹈生成技术已有所进展,但大多数方法仍在骨骼域中训练,忽视了网格级别的物理约束。因此,当通过人类身体网格可视化时,看似合理的关节轨迹往往会出现身体自穿透和足-地面接触(Foot-Ground Contact, FGC)异常,从而降低生成舞蹈的美感,并限制其在现实世界中的应用。我们通过从身体网格中推导基于物理的奖励,解决了这一骨骼到网格的差距,并应用强化学习微调(Reinforcement Learning Fine-Tuning, RLFT)来引导扩散模型朝着在网格可视化下生成物理可行的运动合成。我们的奖励设计结合了(i)模仿奖励,该奖励通过在物理模拟器中的可模仿性来衡量运动的一般合理性(惩罚穿透和滑动),以及(ii)足-地面偏差(Foot-Ground Deviation, FGD)奖励,结合测试时的FGD指导,以更好地捕捉舞蹈中的动态足-地面交互。然而,我们发现基于物理的奖励往往会促使模型生成静止运动,以减少物理异常并提高可模仿性。为此,我们提出了一种反静止奖励,以在保持物理可行性的同时保留运动动态。对多个舞蹈数据集的实验一致表明,我们的方法可以显著提高生成运动的物理可行性,产生更真实和美观的舞蹈。项目页面可访问: https://jjd1123.github.io/Skeleton2Stage/
cs.CV / 67 / 2602.13780

Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery

基于基础模型的遥感影像语义变化检测
Shen, Hengtong, Yan, Li, Xie, Hong, Wei, Yaxuan, Li, Xinhao, Shen, Wenfei, Lv, Peixian, Tan, Fei
Abstract
Remote sensing (RS) change detection methods can extract critical information on surface dynamics and are an essential means for humans to understand changes in the earth's surface and environment. Among these methods, semantic change detection (SCD) can more effectively interpret the multi-class information contained in bi-temporal RS imagery, providing semantic-level predictions that support dynamic change monitoring. However, due to the limited semantic understanding capability of the model and the inherent complexity of the SCD tasks, existing SCD methods face significant challenges in both performance and paradigm complexity. In this paper, we propose PerASCD, a SCD method driven by RS foundation model PerA, designed to enhance the multi-scale semantic understanding and overall performance. We introduce a modular Cascaded Gated Decoder (CG-Decoder) that simplifies complex SCD decoding pipelines while promoting effective multi-level feature interaction and fusion. In addition, we propose a Soft Semantic Consistency Loss (SSCLoss) to mitigate the numerical instability commonly encountered during SCD training. We further explore the applicability of multiple existing RS foundation models on the SCD task when equipped with the proposed decoder. Experimental results demonstrate that our decoder not only effectively simplifies the paradigm of SCD, but also achieves seamless adaptation across various vision encoders. Our method achieves state-of-the-art (SOTA) performance on two public benchmark datasets, validating its effectiveness. The code is available at https://github.com/SathShen/PerASCD.git.
Chinese Translation
遥感(RS)变化检测方法能够提取地表动态的重要信息,是人类理解地球表面和环境变化的重要手段。在这些方法中,语义变化检测(SCD)能够更有效地解读双时相遥感影像中包含的多类信息,提供支持动态变化监测的语义级预测。然而,由于模型的语义理解能力有限以及SCD任务固有的复杂性,现有的SCD方法在性能和范式复杂性方面面临重大挑战。本文提出了PerASCD,一种由遥感基础模型PerA驱动的SCD方法,旨在增强多尺度语义理解和整体性能。我们引入了一种模块化的级联门控解码器(Cascaded Gated Decoder,CG-Decoder),简化复杂的SCD解码流程,同时促进有效的多层次特征交互和融合。此外,我们提出了一种软语义一致性损失(Soft Semantic Consistency Loss,SSCLoss),以减轻SCD训练过程中常见的数值不稳定性。我们进一步探讨了在配备所提解码器的情况下,多种现有遥感基础模型在SCD任务上的适用性。实验结果表明,我们的解码器不仅有效简化了SCD的范式,而且在各种视觉编码器之间实现了无缝适应。我们的方法在两个公共基准数据集上达到了最先进的(SOTA)性能,验证了其有效性。代码可在 https://github.com/SathShen/PerASCD.git 获取。
cs.CV / 68 / 2602.13801

Joint Orientation and Weight Optimization for Robust Watertight Surface Reconstruction via Dirichlet-Regularized Winding Fields

通过Dirichlet正则化缠绕场进行稳健的密闭表面重建的联合方向和权重优化
Li, Jiaze, Jin, Daisheng, Hou, Fei, Hou, Junhui, Liu, Zheng, Xin, Shiqing, Wang, Wenping, He, Ying
Abstract
We propose Dirichlet Winding Reconstruction (DiWR), a robust method for reconstructing watertight surfaces from unoriented point clouds with non-uniform sampling, noise, and outliers. Our method uses the generalized winding number (GWN) field as the target implicit representation and jointly optimizes point orientations, per-point area weights, and confidence coefficients in a single pipeline. The optimization minimizes the Dirichlet energy of the induced winding field together with additional GWN-based constraints, allowing DiWR to compensate for non-uniform sampling, reduce the impact of noise, and downweight outliers during reconstruction, with no reliance on separate preprocessing. We evaluate DiWR on point clouds from 3D Gaussian Splatting, a computer-vision pipeline, and corrupted graphics benchmarks. Experiments show that DiWR produces plausible watertight surfaces on these challenging inputs and outperforms both traditional multi-stage pipelines and recent joint orientation-reconstruction methods.
Chinese Translation
我们提出了Dirichlet缠绕重建(DiWR),这是一种从未定向点云中重建密闭表面的稳健方法,适用于非均匀采样、噪声和异常值。我们的方法使用广义缠绕数(GWN)场作为目标隐式表示,并在单一流程中联合优化点的方向、每点的面积权重和置信系数。该优化最小化诱导缠绕场的Dirichlet能量,并结合额外的基于GWN的约束,使得DiWR能够补偿非均匀采样,减少噪声的影响,并在重建过程中降低异常值的权重,而无需依赖单独的预处理。我们在来自3D高斯溅射的点云、计算机视觉管道和受损图形基准测试上评估了DiWR。实验表明,DiWR在这些具有挑战性的输入上生成了合理的密闭表面,并优于传统的多阶段管道和最近的联合方向-重建方法。
cs.CV / 69 / 2602.13806

Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

具有多尺度动态的高斯序列用于从单目随意视频进行四维重建
Li, Can, Gu, Jie, Chen, Jingmin, Qiu, Fangzhou, Sun, Lei
Abstract
Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.
Chinese Translation
从随意视频中理解动态场景对于可扩展的机器人学习至关重要,但在严格的单目设置下,四维(4D)重建仍然高度不适定。为了解决这一挑战,我们的关键见解是,现实世界的动态在物体到粒子级别上表现出多尺度的规律性。为此,我们设计了多尺度动态机制,以分解复杂的运动场。在这一框架中,我们提出了具有多尺度动态的高斯序列,这是一种通过多层次运动组合而得出的动态3D高斯的新颖表示。这种分层结构显著减轻了重建的模糊性,并促进了物理上合理的动态。我们进一步结合来自视觉基础模型的多模态先验,以建立互补的监督,限制解空间并提高重建的保真度。我们的方法能够从单目随意视频中实现准确且全局一致的4D重建。在基准和真实世界操作数据集上的动态新视图合成(NVS)实验表明,相较于现有方法有显著改进。
cs.CV / 70 / 2602.13818

VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

VAR-3D:一种视图感知自回归模型用于通过3D标记器进行文本到3D生成
Han, Zongcheng, Cao, Dongyan, Sun, Haoran, Hong, Yu
Abstract
Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
Chinese Translation
最近,自回归变换器在生成建模方面取得了显著成功。然而,文本到3D生成仍然面临挑战,主要是由于学习离散3D表示的瓶颈。具体而言,现有方法在编码过程中常常遭遇信息损失,导致量化过程前的表示失真。这一影响在向量量化过程中进一步放大,最终降低了文本条件下3D形状的几何一致性。此外,传统的两阶段训练范式导致重建与文本条件自回归生成之间的目标不匹配。为了解决这些问题,我们提出了视图感知自回归3D(VAR-3D),它集成了视图感知的3D向量量化变分自编码器(VQ-VAE),将3D模型的复杂几何结构转换为离散标记。此外,我们引入了一种渲染监督训练策略,将离散标记预测与视觉重建相结合,鼓励生成过程更好地保留相对于输入文本的视觉保真度和结构一致性。实验表明,VAR-3D在生成质量和文本-3D对齐方面显著优于现有方法。
cs.CV / 71 / 2602.13823

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Embed-RL:面向推理驱动的多模态嵌入的强化学习
Jiang, Haonan, Wang, Yuji, Zhu, Yongjie, Lu, Xin, Qin, Wenyu, Wang, Meng, Wan, Pengfei, Tang, Yansong
Abstract
Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
Chinese Translation
利用多模态大型语言模型(MLLMs)在推动通用多模态嵌入(UME)以应对多样化的跨模态任务中变得至关重要。近期研究表明,结合生成性思维链(Chain-of-Thought, CoT)推理相比于区分性方法可以显著增强特定任务的表示。然而,现有生成嵌入方法生成的推理 CoT 限于对查询的文本分析,与目标的检索无关。为了解决这些局限性,我们提出了一种推理驱动的 UME 框架,该框架整合了嵌入引导的强化学习(Embedder-Guided Reinforcement Learning, EG-RL),以优化推理器生成证据可追溯性思维链(Traceability CoT, T-CoT)。我们的主要贡献有三点:(1)我们设计了一个 EG-RL 框架,其中嵌入器为推理器提供明确的监督,确保生成的 CoT 跟踪与嵌入任务对齐。(2)我们引入了 T-CoT,它提取关键的多模态线索,以关注与检索相关的元素,并为嵌入器提供多模态输入。(3)在有限的计算资源下,我们的框架在 MMEB-V2 和 UVRB 基准测试中超越了开创性的嵌入模型。多模态证据在结构化推理中的整合,加上面向检索的对齐,有效增强了跨模态语义一致性,提高了模型的细粒度匹配能力以及在复杂场景中的泛化能力。我们的工作表明,针对性的推理优化可以显著提升多模态嵌入质量,为推理驱动的 UME 开发提供了一种实用且高效的解决方案。
cs.CV / 72 / 2602.13831

Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression

基于先验引导的层次实例像素对比学习用于超声斑点噪声抑制
Bu, Zhenyu, Xie, Yuanxin, Zhou, Guang-Quan
Abstract
Ultrasound denoising is essential for mitigating speckle-induced degradations, thereby enhancing image quality and improving diagnostic reliability. Nevertheless, because speckle patterns inherently encode both texture and fine anatomical details, effectively suppressing noise while preserving structural fidelity remains a significant challenge. In this study, we propose a prior-guided hierarchical instance-pixel contrastive learning model for ultrasound denoising, designed to promote noise-invariant and structure-aware feature representations by maximizing the separability between noisy and clean samples at both pixel and instance levels. Specifically, a statistics-guided pixel-level contrastive learning strategy is introduced to enhance distributional discrepancies between noisy and clean pixels, thereby improving local structural consistency. Concurrently, a memory bank is employed to facilitate instance-level contrastive learning in the feature space, encouraging representations that more faithfully approximate the underlying data distribution. Furthermore, a hybrid Transformer-CNN architecture is adopted, coupling a Transformer-based encoder for global context modeling with a CNN-based decoder optimized for fine-grained anatomical structure restoration, thus enabling complementary exploitation of long-range dependencies and local texture details. Extensive evaluations on two publicly available ultrasound datasets demonstrate that the proposed model consistently outperforms existing methods, confirming its effectiveness and superiority.
Chinese Translation
超声去噪对于减轻斑点引起的降解至关重要,从而提高图像质量和改善诊断可靠性。然而,由于斑点模式本质上编码了纹理和细微解剖细节,有效抑制噪声同时保持结构保真性仍然是一个重大挑战。在本研究中,我们提出了一种基于先验引导的层次实例像素对比学习模型用于超声去噪,旨在通过最大化噪声样本与干净样本在像素和实例层面的可分离性,促进噪声不变和结构感知的特征表示。具体而言,引入了一种统计引导的像素级对比学习策略,以增强噪声像素与干净像素之间的分布差异,从而改善局部结构一致性。同时,采用了一个记忆库以促进特征空间中的实例级对比学习,鼓励更忠实于基础数据分布的表示。此外,采用了一种混合的Transformer-CNN架构,将基于Transformer的编码器用于全局上下文建模与优化细粒度解剖结构恢复的CNN解码器相结合,从而实现对长程依赖和局部纹理细节的互补利用。在两个公开可用的超声数据集上的广泛评估表明,所提出的模型始终优于现有方法,确认了其有效性和优越性。
cs.CV / 73 / 2602.13837

High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

高保真因果视频扩散模型用于实时超低比特率语义通信
Eteke, Cem, Tosun, Batuhan, Griessel, Alexander, Kellerer, Wolfgang, Steinbach, Eckehard
Abstract
We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.
Chinese Translation
我们提出了一种视频扩散模型,用于在超低比特率语义通信约束下进行高保真、因果和实时的视频生成。我们的方法利用有损语义视频编码来传输语义场景结构,并辅以一系列高度压缩的低分辨率帧,以提供足够的纹理信息以保持保真度。在这些输入的基础上,我们引入了一种模块化视频扩散模型,其中包含语义控制、恢复适配器和时间适配器。我们进一步引入了一种高效的时间蒸馏程序,使得扩展到实时和因果合成成为可能,训练参数减少了300倍,训练时间减少了2倍,同时遵循通信约束。在多个数据集上的评估表明,该框架在超低比特率(< 0.0003 bpp)下实现了强大的感知质量、语义保真度和时间一致性,超越了经典、神经和生成基线,在广泛的定量、定性和主观评估中表现优异。
cs.CV / 74 / 2602.13842

Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation

经导管主动脉瓣植入术前旁瓣反流的自动预测
Cannito, Michele, Renzulli, Riccardo, Duarte, Adson, Nikfam, Farzad, Barbano, Carlo Alberto, Chiesa, Enrico, Bruno, Francesco, Giacobbe, Federico, Wanha, Wojciech, Giordano, Arturo, Grangetto, Marco, D'Ascenzo, Fabrizio
Abstract
Severe aortic stenosis is a common and life-threatening condition in elderly patients, often treated with Transcatheter Aortic Valve Implantation (TAVI). Despite procedural advances, paravalvular aortic regurgitation (PVR) remains one of the most frequent post-TAVI complications, with a proven impact on long-term prognosis. In this work, we investigate the potential of deep learning to predict the occurrence of PVR from preoperative cardiac CT. To this end, a dataset of preoperative TAVI patients was collected, and 3D convolutional neural networks were trained on isotropic CT volumes. The results achieved suggest that volumetric deep learning can capture subtle anatomical features from pre-TAVI imaging, opening new perspectives for personalized risk assessment and procedural optimization. Source code is available at https://github.com/EIDOSLAB/tavi.
Chinese Translation
严重的主动脉狭窄是老年患者中常见且危及生命的疾病,通常通过经导管主动脉瓣植入术(TAVI)进行治疗。尽管手术技术有所进步,旁瓣主动脉反流(PVR)仍然是TAVI后最常见的并发症之一,对长期预后有显著影响。在本研究中,我们探讨了深度学习在从术前心脏CT预测PVR发生的潜力。为此,收集了一组术前TAVI患者的数据集,并对各向同性CT体积进行了三维卷积神经网络的训练。所获得的结果表明,体积深度学习能够捕捉术前影像中的微妙解剖特征,为个性化风险评估和手术优化开辟了新的视角。源代码可在 https://github.com/EIDOSLAB/tavi 获取。
cs.CV / 75 / 2602.13844

Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation

用于机器人手术仪器分割的合成数据集生成与验证
Chiesa, Giorgio, Borra, Rossella, Lauro, Vittorio, De Cillis, Sabrina, Amparore, Daniele, Fiori, Cristian, Renzulli, Riccardo, Grangetto, Marco
Abstract
This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at https://github.com/EIDOSLAB/Sintetic-dataset-DaVinci.
Chinese Translation
本文提出了一种用于生成和验证合成数据集的综合工作流程,该数据集旨在用于机器人手术仪器分割。通过一个完全自动化的基于Python的管道,对达芬奇(Da Vinci)机器人手臂进行了3D重建、精细化和动画处理,能够生成逼真的标注视频序列。每个场景整合了随机化的运动模式、光照变化和合成血液纹理,以模拟手术过程中的变异,同时保持像素级准确的真实标注掩膜。为了验证生成数据的真实性和有效性,在控制真实数据与合成数据比例的条件下训练了多个分割模型。结果表明,与仅在真实数据上训练相比,真实样本与合成样本的平衡组合显著提高了模型的泛化能力,而过度依赖合成数据则引入了可测量的领域偏移。所提出的框架为外科计算机视觉提供了一种可重复和可扩展的工具,支持未来在数据增强、领域适应和基于模拟的预训练等方面的研究。数据和代码可在 https://github.com/EIDOSLAB/Sintetic-dataset-DaVinci 获取。
cs.CV / 76 / 2602.13846

Cardiac Output Prediction from Echocardiograms: Self-Supervised Learning with Limited Data

基于超声心动图的心输出量预测:有限数据下的自监督学习
Duarte, Adson, Vitturini, Davide, Milillo, Emanuele, Bragagnolo, Andrea, Barbano, Carlo Alberto, Renzulli, Riccardo, Cannito, Michele, Giacobbe, Federico, Bruno, Francesco, de Filippo, Ovidio, D'Ascenzo, Fabrizio, Grangetto, Marco
Abstract
Cardiac Output (CO) is a key parameter in the diagnosis and management of cardiovascular diseases. However, its accurate measurement requires right-heart catheterization, an invasive and time-consuming procedure, motivating the development of reliable non-invasive alternatives using echocardiography. In this work, we propose a self-supervised learning (SSL) pretraining strategy based on SimCLR to improve CO prediction from apical four-chamber echocardiographic videos. The pretraining is performed using the same limited dataset available for the downstream task, demonstrating the potential of SSL even under data scarcity. Our results show that SSL mitigates overfitting and improves representation learning, achieving an average Pearson correlation of 0.41 on the test set and outperforming PanEcho, a model trained on over one million echocardiographic exams. Source code is available at https://github.com/EIDOSLAB/cardiac-output.
Chinese Translation
心输出量(CO)是心血管疾病诊断和管理中的关键参数。然而,其准确测量需要进行右心导管插入,这是一种侵入性且耗时的程序,这促使我们开发可靠的非侵入性替代方案,利用超声心动图。在本研究中,我们提出了一种基于SimCLR的自监督学习(SSL)预训练策略,以提高从心尖四腔超声心动图视频中预测心输出量的能力。预训练是在下游任务可用的相同有限数据集上进行的,展示了即使在数据稀缺的情况下,SSL的潜力。我们的结果表明,SSL减轻了过拟合并改善了表征学习,在测试集上实现了平均Pearson相关系数为0.41,超越了在超过一百万个超声心动图检查上训练的模型PanEcho。源代码可在https://github.com/EIDOSLAB/cardiac-output获取。
cs.CV / 77 / 2602.13859

Low-Pass Filtering Improves Behavioral Alignment of Vision Models

低通滤波改善视觉模型的行为一致性
Wolff, Max, Klein, Thomas, Rusak, Evgenia, Wichmann, Felix, Brendel, Wieland
Abstract
Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} -- rather than \emph{discriminative} -- classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time -- rather than training on blurred images -- achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency.
Chinese Translation
尽管深度神经网络(DNNs)在计算机视觉基准测试中表现出色,但它们在充分建模人类视觉行为方面仍然存在不足,这可以通过错误一致性和形状偏差来衡量。近期研究假设,通过 extit{生成性}而非 extit{判别性}分类器,行为一致性可以得到显著改善,这对人类视觉模型具有深远的影响。在此,我们展示了生成模型的增强一致性在很大程度上可以通过生成模型中一种看似无害的调整操作来解释,该操作有效地充当了低通滤波器。在一系列控制实验中,我们表明,从像CLIP这样的判别模型中去除高频空间信息会显著提高其行为一致性。仅在测试时对图像进行模糊处理——而不是在模糊图像上进行训练——在模型与人类基准测试中达到了新的最先进分数,将DNN与人类观察者之间的当前一致性差距缩小了一半。此外,低通滤波器可能是最优的,我们通过直接优化滤波器以提高一致性来证明这一点。为了将最优滤波器的性能进行背景化,我们计算了所有可能的帕累托最优解的前沿,这在以前是未知的。我们通过观察最优高斯滤波器的频谱大致匹配人类视觉系统实施的带通滤波器的频谱来解释我们的发现。我们展示了对比敏感度函数,它描述了人类检测正弦波纹理所需的对比阈值的倒数与时空频率的关系,可以通过特定宽度的高斯滤波器很好地近似,该宽度也最大化了错误一致性。
cs.CV / 78 / 2602.13887

Human-Aligned Evaluation of a Pixel-wise DNN Color Constancy Model

人类对齐的像素级深度神经网络颜色恒常性模型评估
Heidari-Gorji, Hamed, Rodriguez, Raquel Gil, Gegenfurtner, Karl R.
Abstract
We previously investigated color constancy in photorealistic virtual reality (VR) and developed a Deep Neural Network (DNN) that predicts reflectance from rendered images. Here, we combine both approaches to compare and study a model and human performance with respect to established color constancy mechanisms: local surround, maximum flux and spatial mean. Rather than evaluating the model against physical ground truth, model performance was assessed using the same achromatic object selection task employed in the human experiments. The model, a ResNet based U-Net from our previous work, was pre-trained on rendered images to predict surface reflectance. We then applied transfer learning, fine-tuning only the network's decoder on images from the baseline VR condition. To parallel the human experiment, the model's output was used to perform the same achromatic object selection task across all conditions. Results show a strong correspondence between the model and human behavior. Both achieved high constancy under baseline conditions and showed similar, condition-dependent performance declines when the local surround or spatial mean color cues were removed.
Chinese Translation
我们之前研究了光真实感虚拟现实(VR)中的颜色恒常性,并开发了一种深度神经网络(DNN),该网络能够从渲染图像中预测反射率。在此,我们结合这两种方法,比较和研究模型与人类在已建立的颜色恒常性机制方面的表现:局部环境、最大通量和空间均值。我们并未将模型与物理真实值进行评估,而是使用与人类实验中相同的无色物体选择任务来评估模型性能。该模型是基于我们之前工作的ResNet结构的U-Net,经过预训练以预测表面反射率。随后,我们应用迁移学习,仅对网络的解码器在基线VR条件下的图像上进行了微调。为了与人类实验相平行,模型的输出被用于在所有条件下执行相同的无色物体选择任务。结果显示模型与人类行为之间存在强烈的对应关系。在基线条件下,两者均实现了高恒常性,并在移除局部环境或空间均值颜色线索时表现出相似的、依赖条件的性能下降。
cs.CV / 79 / 2602.13889

Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

DINOv2的参数高效微调用于大规模字体分类
Chen, Daniel, Zinn, Zaria, Lowe, Marcus
Abstract
We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
Chinese Translation
我们提出了一种字体分类系统,能够识别来自渲染文本图像的394种字体系列。我们的方法通过低秩适应(Low-Rank Adaptation, LoRA)对DINOv2视觉变换器进行微调,训练过程中仅使用不到1%的模型参数(共87.2M),实现了约86%的Top-1准确率。我们引入了一种合成数据集生成管道,以大规模渲染Google Fonts,并进行多样化的数据增强,包括随机颜色、对齐、换行和高斯噪声,从而生成能够推广到真实世界排版样本的训练图像。该模型内置预处理功能,以确保训练与推理之间的一致性,并作为HuggingFace推理端点进行部署。我们将模型、数据集和完整的训练管道作为开源资源发布。
cs.CV / 80 / 2602.13901

RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation

RPGD:基于RANSAC-P3P的梯度下降法在3D人体姿态估计中的外部标定
Tuo, Zhanyu
Abstract
In this paper, we propose RPGD (RANSAC-P3P Gradient Descent), a human-pose-driven extrinsic calibration framework that robustly aligns MoCap-based 3D skeletal data with monocular or multi-view RGB cameras using only natural human motion. RPGD formulates extrinsic calibration as a coarse-to-fine problem tailored to human poses, combining the global robustness of RANSAC-P3P with Gradient-Descent-based refinement. We evaluate RPGD on three large-scale public 3D HPE datasets as well as on a self-collected in-the-wild dataset. Experimental results demonstrate that RPGD consistently recovers extrinsic parameters with accuracy comparable to the provided ground truth, achieving sub-pixel MPJPE reprojection error even in challenging, noisy settings. These results indicate that RPGD provides a practical and automatic solution for reliable extrinsic calibration of large-scale 3D HPE dataset collection.
Chinese Translation
本文提出了RPGD(基于RANSAC-P3P的梯度下降法),这是一个以人体姿态为驱动的外部标定框架,能够稳健地将基于运动捕捉(MoCap)的3D骨骼数据与单目或多视角RGB相机对齐,仅使用自然的人体运动。RPGD将外部标定问题构建为一个针对人体姿态的粗到精的过程,结合了RANSAC-P3P的全局鲁棒性与基于梯度下降的精细化调整。我们在三个大规模公共3D人体姿态估计(HPE)数据集以及一个自收集的野外数据集上评估了RPGD。实验结果表明,RPGD能够始终如一地恢复外部参数,其准确性与提供的真实值相当,即使在具有挑战性和噪声的环境中也能实现亚像素级的MPJPE重投影误差。这些结果表明,RPGD为大规模3D HPE数据集的可靠外部标定提供了一种实用且自动化的解决方案。
cs.CV / 81 / 2602.13930

MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

MamaDino:一种用于乳腺癌三年风险预测的混合视觉模型
Santeramo, Ruggiero, Zubarev, Igor, Jug, Florian
Abstract
Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.
Chinese Translation
乳腺癌筛查程序越来越倾向于从统一的筛查间隔转向风险适应和个性化策略。深度学习(DL)使得基于图像的风险模型在1至5年的预测能力上优于传统临床模型,但领先系统(如Mirai)通常使用卷积骨干网络、非常高分辨率的输入(>1M像素)以及简单的多视图融合,且对对侧不对称性的显式建模有限。我们假设结合互补的归纳偏差(卷积和基于变换器的)与显式的对侧不对称性建模,可以使我们在使用显著较低分辨率的乳腺X光片时,仍能达到最先进的三年风险预测性能,这表明以更结构化的方式使用较少细节的图像可以恢复最先进的准确性。我们提出了MamaDino,这是一种考虑乳腺X光的多视图注意力DINO模型。MamaDino将冻结的自监督DINOv3 ViT-S特征与可训练的CNN编码器在512x512分辨率下融合,并通过双侧混合器(BilateralMixer)聚合双侧乳腺信息,以输出三年乳腺癌风险评分。我们在53,883名来自OPTIMAM(英国)的女性上进行训练,并在匹配的三年病例对照队列上进行评估:来自四个筛查地点的内部测试集和来自未见地点的外部分布外队列。在乳腺层面上,MamaDino在内部和外部测试中均与Mirai相匹配,同时使用的输入像素数量减少约13倍。添加双侧混合器提高了内部分布的区分度至AUC 0.736(对比0.713),外部分布为0.677(对比0.666),并在年龄、种族、扫描仪、肿瘤类型和等级之间表现出一致的性能。这些发现表明,显式的对侧建模和互补的归纳偏差使得预测能够与Mirai相匹配,尽管是在显著较低分辨率的乳腺X光片上进行操作。
cs.CV / 82 / 2602.13944

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

像素与基因的融合:计算病理学中的空间感知学习
Han, Minghao, Yang, Dingkang, Qu, Linhao, Chen, Zizhi, Li, Gang, Wang, Han, Wang, Jiacong, Zhang, Lihua
Abstract
Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.
Chinese Translation
近年来,计算病理学中的多模态学习取得了显著进展。现有模型主要依赖视觉和语言模态;然而,仅依靠语言缺乏分子特异性,并且提供的病理监督有限,导致表征瓶颈。在本文中,我们提出了STAMP(空间转录组增强多模态病理表征学习框架),该框架整合了空间分辨的基因表达谱,以实现病理图像和转录组数据的分子引导联合嵌入。我们的研究表明,自监督的基因引导训练为学习病理图像表征提供了强大且与任务无关的信号。结合空间上下文和多尺度信息进一步增强了模型的性能和泛化能力。为此,我们构建了SpaVis-6M,这是迄今为止最大的基于Visium的空间转录组数据集,并在此资源上训练了一个空间感知的基因编码器。通过利用分层多尺度对比对齐和跨尺度补丁定位机制,STAMP有效地将空间转录组与病理图像对齐,捕捉空间结构和分子变异。我们在六个数据集和四个下游任务中验证了STAMP,其表现始终强劲。这些结果突显了整合空间分辨的分子监督在推动计算病理学中的多模态学习中的价值和必要性。代码已包含在补充材料中。预训练权重和SpaVis-6M可在以下链接获取:https://github.com/Hanminghao/STAMP。
cs.CV / 83 / 2602.13961

MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

MarsRetrieval:针对火星行星尺度地理空间检索的视觉-语言模型基准测试
Wang, Shuoyuan, Wang, Yiran, Wei, Hongxin
Abstract
Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval
Chinese Translation
数据驱动的方法,如深度学习,正在迅速推动行星科学的发展,特别是在火星探索方面。尽管最近取得了一些进展,但大多数现有基准仍局限于封闭集监督视觉任务,并不支持文本引导的地理空间发现检索。我们提出了MarsRetrieval,这是一个用于评估火星地理空间发现的视觉-语言模型的检索基准。MarsRetrieval包含三个任务:(1)成对图像-文本检索,(2)地貌检索,以及(3)全球地理定位,涵盖多个空间尺度和多样的地貌起源。我们提出了一种统一的以检索为中心的协议,以基准测试多模态嵌入架构,包括对比双塔编码器和生成式视觉-语言模型。我们的评估表明,MarsRetrieval具有挑战性:即使是强大的基础模型也常常无法捕捉领域特定的地貌差异。我们进一步表明,领域特定的微调对于在行星环境中实现可泛化的地理空间发现至关重要。我们的代码可在 https://github.com/ml-stat-Sustech/MarsRetrieval 获取。
cs.CV / 84 / 2602.13993

Elastic Diffusion Transformer

弹性扩散变换器
Wang, Jiangshan, Lai, Zeqiang, Chen, Jiarui, Guo, Jiayi, Guo, Hang, Li, Xiu, Yue, Xiangyu, Guo, Chunchao
Abstract
Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at https://github.com/wangjiangshan0725/Elastic-DiT.
Chinese Translation
扩散变换器(DiT)展现了显著的生成能力,但计算成本仍然很高。以往的加速方法,如剪枝和蒸馏,通常依赖于固定的计算能力,导致加速效果不足且生成质量下降。为了解决这一局限性,我们提出了 extbf{弹性扩散变换器(E-DiT)},这是一个针对DiT的自适应加速框架,能够有效提高效率,同时保持生成质量。具体而言,我们观察到DiT的生成过程表现出显著的稀疏性(即某些计算可以在对质量影响较小的情况下被跳过),而这种稀疏性在不同样本之间变化显著。基于这一观察,E-DiT为每个DiT模块配备了一个轻量级路由器,能够动态识别输入潜变量中的样本依赖稀疏性。每个路由器自适应地决定相应模块是否可以被跳过。如果模块没有被跳过,路由器将预测该模块内的最优MLP宽度缩减比例。在推理过程中,我们进一步引入了一个模块级特征缓存机制,利用路由器的预测以无训练的方式消除冗余计算。针对2D图像(Qwen-Image和FLUX)和3D资产(Hunyuan3D-3.0)的大量实验表明,E-DiT的有效性,达到了约$ extsim$2$ imes$的加速,同时生成质量损失微乎其微。代码将发布在https://github.com/wangjiangshan0725/Elastic-DiT。
cs.CV / 85 / 2602.13994

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

在重要位置进行注入:无训练的空间自适应身份保留用于文本到图像个性化
Li, Guandong, Ye, Mengxia
Abstract
Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
Chinese Translation
个性化文本到图像生成旨在将特定身份融入任意上下文。然而,现有的无调优方法通常采用空间均匀视觉注入,导致身份特征污染非面部区域(例如背景和光照),并降低文本的遵循性。为了解决这一问题而无需昂贵的微调,我们提出了SpatialID,一个无训练的空间自适应身份调制框架。SpatialID 从根本上将身份注入解耦为与面部相关的区域和上下文无关的区域,使用从交叉注意力响应中提取的空间掩码提取器。此外,我们引入了一种时间-空间调度策略,动态调整空间约束——从高斯先验过渡到基于注意力的掩码和自适应放松——以与扩散生成动态相一致。在IBench上的大量实验表明,SpatialID在文本遵循性(CLIP-T: 0.281)、视觉一致性(CLIP-I: 0.827)和图像质量(IQ: 0.523)方面实现了最先进的性能,显著消除了背景污染,同时保持了强大的身份保留能力。
cs.CV / 86 / 2602.14010

A Deployment-Friendly Foundational Framework for Efficient Computational Pathology

一种适合部署的基础框架用于高效计算病理学
Cai, Yu, Jin, Cheng, Ma, Jiabo, Zhou, Fengtao, Xu, Yingxue, Guo, Zhengrui, Wang, Yihui, Zhang, Zhengyu, Liang, Ling, Tan, Yonghao, Dong, Pingcheng, Cai, Du, Tang, On Ki, Zhao, Chenglong, Wang, Xi, Yang, Can, Xu, Yali, Cui, Jing, Li, Zhenhui, Chan, Ronald Cheong Kin, Liu, Yueping, Gao, Feng, Zhang, Xiuming, Liang, Li, Chen, Hao, Cheng, Kwang-Ting
Abstract
Pathology foundation models (PFMs) have enabled robust generalization in computational pathology through large-scale datasets and expansive architectures, but their substantial computational cost, particularly for gigapixel whole slide images, limits clinical accessibility and scalability. Here, we present LitePath, a deployment-friendly foundational framework designed to mitigate model over-parameterization and patch level redundancy. LitePath integrates LiteFM, a compact model distilled from three large PFMs (Virchow2, H-Optimus-1 and UNI2) using 190 million patches, and the Adaptive Patch Selector (APS), a lightweight component for task-specific patch selection. The framework reduces model parameters by 28x and lowers FLOPs by 403.5x relative to Virchow2, enabling deployment on low-power edge hardware such as the NVIDIA Jetson Orin Nano Super. On this device, LitePath processes 208 slides per hour, 104.5x faster than Virchow2, and consumes 0.36 kWh per 3,000 slides, 171x lower than Virchow2 on an RTX3090 GPU. We validated accuracy using 37 cohorts across four organs and 26 tasks (26 internal, 9 external, and 2 prospective), comprising 15,672 slides from 9,808 patients disjoint from the pretraining data. LitePath ranks second among 19 evaluated models and outperforms larger models including H-Optimus-1, mSTAR, UNI2 and GPFM, while retaining 99.71% of the AUC of Virchow2 on average. To quantify the balance between accuracy and efficiency, we propose the Deployability Score (D-Score), defined as the weighted geometric mean of normalized AUC and normalized FLOP, where LitePath achieves the highest value, surpassing Virchow2 by 10.64%. These results demonstrate that LitePath enables rapid, cost-effective and energy-efficient pathology image analysis on accessible hardware while maintaining accuracy comparable to state-of-the-art PFMs and reducing the carbon footprint of AI deployment.
Chinese Translation
病理基础模型(PFMs)通过大规模数据集和广泛架构在计算病理学中实现了强大的泛化能力,但其巨大的计算成本,特别是对于千兆像素的全切片图像,限制了临床可及性和可扩展性。在此,我们提出了LitePath,一种适合部署的基础框架,旨在减轻模型的过度参数化和补丁级别的冗余。LitePath集成了LiteFM,这是一个从三个大型PFMs(Virchow2、H-Optimus-1和UNI2)中提炼出的紧凑模型,使用了1.9亿个补丁,以及自适应补丁选择器(APS),这是一个用于任务特定补丁选择的轻量级组件。该框架相较于Virchow2将模型参数减少了28倍,FLOPs降低了403.5倍,使其能够在低功耗边缘硬件上部署,如NVIDIA Jetson Orin Nano Super。在该设备上,LitePath每小时处理208个切片,比Virchow2快104.5倍,并且每处理3000个切片消耗0.36千瓦时,比在RTX3090 GPU上运行的Virchow2低171倍。我们通过37个队列在四个器官和26个任务(26个内部,9个外部和2个前瞻性)中验证了准确性,涉及来自9808名患者的15672个切片,这些患者与预训练数据不重叠。LitePath在19个评估模型中排名第二,超越了包括H-Optimus-1、mSTAR、UNI2和GPFM在内的更大模型,同时在平均上保留了Virchow2的99.71%的AUC。为了量化准确性与效率之间的平衡,我们提出了可部署性评分(D-Score),定义为标准化AUC和标准化FLOP的加权几何平均值,其中LitePath达到了最高值,超越Virchow2 10.64%。这些结果表明,LitePath能够在可及硬件上实现快速、经济和节能的病理图像分析,同时保持与最先进的PFMs相当的准确性,并减少AI部署的碳足迹。
cs.CV / 87 / 2602.14021

Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow

Flow4R:通过场景流统一四维重建与跟踪
Qian, Shenhan, Zhang, Ganlin, Wu, Shangzhe, Cremers, Daniel
Abstract
Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.
Chinese Translation
重建和跟踪动态三维场景仍然是计算机视觉中的一项基本挑战。现有方法通常将几何与运动解耦:多视角重建方法假设场景是静态的,而动态跟踪框架依赖于显式的相机姿态估计或独立的运动模型。我们提出了Flow4R,一个统一的框架,将相机空间中的场景流视为连接三维结构、物体运动和相机运动的核心表示。Flow4R使用视觉变换器从双视图输入中预测最小的每像素属性集——三维点位置、场景流、姿态权重和置信度。该以流为中心的公式化允许在单次前向传递中,通过共享解码器对局部几何和双向运动进行对称推断,而无需显式的姿态回归器或束调整。Flow4R在静态和动态数据集上共同训练,在四维重建和跟踪任务中实现了最先进的性能,展示了以流为中心的表示在时空场景理解中的有效性。
cs.CV / 88 / 2602.14027

Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

短期训练,长期推理:无训练的自回归视频生成视野扩展
Li, Jia, Fu, Xiaomeng, Peng, Xurui, Chen, Weifeng, Zheng, Youwei, Zhao, Tianyu, Wang, Jiexi, Chen, Fangmin, Wang, Xing, So, Hayden Kwok-Hay
Abstract
Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}.
Chinese Translation
自回归视频扩散模型已成为一种可扩展的长视频生成范式。然而,它们常常面临严重的外推失败,快速的误差积累导致在超出训练视野时显著的时间降解。我们发现,这种失败主要源于3D位置嵌入的 extit{频谱偏差}以及噪声采样中缺乏 extit{动态先验}。为了解决这些问题,我们提出了 extbf{FLEX}( extbf{F}requency-aware extbf{L}ength extbf{EX}tension),一种无训练的推理时框架,旨在弥合短期训练与长期推理之间的差距。FLEX引入了频率感知的RoPE调制,以自适应地插值训练不足的低频成分,同时外推高频成分,以保持多尺度时间可辨别性。这与反相噪声采样(Antiphase Noise Sampling, ANS)相结合,以注入高频动态先验,并通过仅推理的注意力沉没(Inference-only Attention Sink)来锚定全局结构。在VBench上的广泛评估表明,FLEX在$6 imes$外推(30秒时长)时显著优于最先进的模型,并在$12 imes$规模(60秒时长)时与长视频微调基线的性能相匹配。作为一种即插即用的增强,FLEX无缝集成到现有的推理管道中以进行视野扩展。它有效地推动了如LongLive等模型的生成极限,支持在4分钟规模下的一致和动态视频合成。项目页面可访问 exttt{https://ga-lee.github.io/FLEX_demo}。
cs.CV / 89 / 2602.14040

Explainability-Inspired Layer-Wise Pruning of Deep Neural Networks for Efficient Object Detection

基于可解释性的深度神经网络逐层剪枝方法用于高效目标检测
Shukla, Abhinav, Tapas, Nachiket
Abstract
Deep neural networks (DNNs) have achieved remarkable success in object detection tasks, but their increasing complexity poses significant challenges for deployment on resource-constrained platforms. While model compression techniques such as pruning have emerged as essential tools, traditional magnitude-based pruning methods do not necessarily align with the true functional contribution of network components to task-specific performance. In this work, we present an explainability-inspired, layer-wise pruning framework tailored for efficient object detection. Our approach leverages a SHAP-inspired gradient--activation attribution to estimate layer importance, providing a data-driven proxy for functional contribution rather than relying solely on static weight magnitudes. We conduct comprehensive experiments across diverse object detection architectures, including ResNet-50, MobileNetV2, ShuffleNetV2, Faster R-CNN, RetinaNet, and YOLOv8, evaluating performance on the Microsoft COCO 2017 validation set. The results show that the proposed attribution-inspired pruning consistently identifies different layers as least important compared to L1-norm-based methods, leading to improved accuracy--efficiency trade-offs. Notably, for ShuffleNetV2, our method yields a 10\% empirical increase in inference speed, whereas L1-pruning degrades performance by 13.7\%. For RetinaNet, the proposed approach preserves the baseline mAP (0.151) with negligible impact on inference speed, while L1-pruning incurs a 1.3\% mAP drop for a 6.2\% speed increase. These findings highlight the importance of data-driven layer importance assessment and demonstrate that explainability-inspired compression offers a principled direction for deploying deep neural networks on edge and resource-constrained platforms while preserving both performance and interpretability.
Chinese Translation
深度神经网络(DNN)在目标检测任务中取得了显著成功,但其日益复杂性给在资源受限平台上的部署带来了重大挑战。尽管模型压缩技术如剪枝已成为重要工具,但传统的基于权重大小的剪枝方法并不一定与网络组件对特定任务性能的真实功能贡献相一致。在本研究中,我们提出了一种基于可解释性的逐层剪枝框架,旨在提高目标检测的效率。我们的方法利用受SHAP启发的梯度-激活归因来估计层的重要性,提供了一种基于数据的功能贡献代理,而不是仅依赖于静态权重大小。我们在多种目标检测架构上进行了全面实验,包括ResNet-50、MobileNetV2、ShuffleNetV2、Faster R-CNN、RetinaNet和YOLOv8,并在Microsoft COCO 2017验证集上评估性能。结果表明,所提出的基于归因的剪枝方法在识别不同层的重要性方面始终优于基于L1范数的方法,从而改善了准确性与效率的权衡。值得注意的是,对于ShuffleNetV2,我们的方法在推理速度上实现了10%的实证提升,而L1剪枝则导致性能下降13.7%。对于RetinaNet,所提出的方法在几乎不影响推理速度的情况下保持了基线mAP(0.151),而L1剪枝则导致mAP下降1.3%,以换取6.2%的速度提升。这些发现强调了基于数据的层重要性评估的重要性,并证明了基于可解释性的压缩为在边缘和资源受限平台上部署深度神经网络提供了一个有原则的方向,同时保持了性能和可解释性。
cs.CV / 90 / 2602.14041

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

BitDance:使用二进制令牌扩展自回归生成模型
Ai, Yuang, Han, Jiaming, Zhuang, Shaobin, Mao, Weijia, Hu, Xuefeng, Yang, Ziyan, Yang, Zhenheng, Huang, Huaibo, Yue, Xiangyu, Chen, Hao
Abstract
We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.
Chinese Translation
我们提出了BitDance,一种可扩展的自回归(AR)图像生成器,它预测二进制视觉令牌而不是代码本索引。通过高熵二进制潜变量,BitDance使每个令牌能够表示多达$2^{256}$种状态,从而实现紧凑而高度表达的离散表示。从如此庞大的令牌空间进行采样在标准分类中是困难的。为了解决这个问题,BitDance使用了二进制扩散头:它不是通过softmax预测索引,而是采用连续空间扩散生成二进制令牌。此外,我们提出了下一补丁扩散,这是一种新的解码方法,可以高精度并行预测多个令牌,大大加快推理速度。在ImageNet 256x256上,BitDance实现了1.24的FID,这是所有AR模型中最好的。通过下一补丁扩散,BitDance超越了使用14亿参数的最先进并行AR模型,同时使用了5.4倍更少的参数(260M),并实现了8.7倍的加速。在文本到图像生成方面,BitDance在大规模多模态令牌上进行训练,并高效生成高分辨率、逼真的图像,展现出强大的性能和良好的扩展性。在生成1024x1024图像时,BitDance相比于之前的AR模型实现了超过30倍的加速。我们发布了代码和模型,以促进对AR基础模型的进一步研究。代码和模型可在以下地址获取:https://github.com/shallowdream204/BitDance。
cs.CV / 91 / 2602.14042

Restoration Adaptation for Semantic Segmentation on Low Quality Images

低质量图像语义分割的恢复适应
Guan, Kai, Wu, Rongyuan, Li, Shuai, Zhu, Wentao, Zeng, Wenjun, Zhang, Lei
Abstract
In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at https://github.com/Ka1Guan/RASS.git.
Chinese Translation
在现实场景中,处理低质量(LQ)图像时,语义分割的性能往往会下降,这些图像可能缺乏清晰的语义结构和高频细节。尽管图像恢复技术为增强退化的视觉内容提供了一个有前景的方向,但传统的现实世界图像恢复(Real-IR)模型主要关注像素级的保真度,往往无法恢复与任务相关的语义线索,从而限制了它们在下游视觉任务中的有效性。相反,现有在高质量数据上训练的分割模型在现实世界的退化下缺乏鲁棒性。本文提出了用于语义分割的恢复适应(RASS),有效地将语义图像恢复整合到分割过程中,从而直接在低质量图像上实现高质量的语义分割。具体而言,我们首先提出了一种语义约束恢复(SCR)模型,通过将其交叉注意力图与分割掩码对齐,将分割先验注入恢复模型中,鼓励语义上忠实的图像重建。然后,RASS通过基于LoRA的模块合并和任务特定的微调,将语义恢复知识转移到分割中,从而增强模型对低质量图像的鲁棒性。为了验证我们框架的有效性,我们构建了一个具有高质量注释的现实世界低质量图像分割数据集,并在合成和现实世界的低质量基准上进行了广泛的实验。结果表明,SCR和RASS在分割和恢复任务中显著优于最先进的方法。代码、模型和数据集将可在 https://github.com/Ka1Guan/RASS.git 获取。
cs.CV / 92 / 2602.14068

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

CoCoEdit:通过区域正则化强化学习实现内容一致的图像编辑
Wu, Yuhui, Xie, Chenxi, Li, Ruibin, Chen, Liyi, Yi, Qiaosi, Zhang, Lei
Abstract
Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
Chinese Translation
随着大规模生成模型的发展,图像编辑取得了令人瞩目的成果。然而,现有模型主要关注于目标对象和区域的编辑效果,往往导致非目标区域出现不必要的变化。我们提出了一种基于区域正则化强化学习的内容一致编辑(CoCoEdit)后训练框架。我们首先通过精细化的指令和掩码增强现有的编辑数据集,从中策划出40K多样且高质量的样本作为训练集。然后,我们引入了一种像素级相似性奖励,以补充基于MLLM的奖励,使模型在编辑过程中能够确保编辑质量和内容一致性。为了克服奖励的空间无关特性,我们提出了一种基于区域的正则化方法,旨在为高奖励样本保留未编辑区域,同时鼓励低奖励样本的编辑效果。在评估方面,我们为GEdit-Bench和ImgEdit-Bench注释了编辑掩码,引入像素级相似性指标来衡量内容一致性和编辑质量。将CoCoEdit应用于Qwen-Image-Edit和FLUX-Kontext,我们不仅在编辑得分上达到了与最先进模型的竞争水平,而且在内容一致性方面显著更好,通过PSNR/SSIM指标和人类主观评分进行测量。
cs.CV / 93 / 2602.14098

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

ForgeryVCR:通过高效取证工具实现的以视觉为中心的推理框架,用于图像伪造检测和定位
Wang, Youqi, Chen, Shen, Wang, Haowei, Peng, Rongxuan, Yao, Taiping, Tan, Shunquan, Chen, Changsheng, Li, Bin, Ding, Shouhong
Abstract
Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.
Chinese Translation
现有的多模态大型语言模型(MLLMs)在图像伪造检测和定位中主要采用以文本为中心的思维链(Chain-of-Thought, CoT)范式。然而,强迫这些模型以文本方式描述不可察觉的低级篡改痕迹不可避免地会导致幻觉,因为语言模态不足以捕捉如此细粒度的像素级不一致性。为了解决这一问题,我们提出了ForgeryVCR,一个框架,它结合了取证工具箱,通过以视觉为中心的推理将不可察觉的痕迹转化为明确的视觉中介。为了实现高效的工具利用,我们引入了一种战略工具学习后训练范式,包括基于收益驱动的轨迹构建,用于监督微调(Supervised Fine-Tuning, SFT)以及随后由工具效用奖励指导的强化学习(Reinforcement Learning, RL)优化。该范式使得MLLM能够作为一个主动的决策者,学习自发调用多视角推理路径,包括局部放大以进行细粒度检查,以及分析压缩历史、噪声残差和频域中的不可见不一致性。大量实验表明,ForgeryVCR在检测和定位任务中均实现了最先进的(SOTA)性能,展现出卓越的泛化能力和鲁棒性,同时工具冗余最小。项目页面可访问:https://youqiwong.github.io/projects/ForgeryVCR/.
cs.CV / 94 / 2602.14119

GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

GeoFusionLRM:面向几何的自我修正以实现一致的三维重建
Yildirim, Ahmet Burak, Saygin, Tuna, Ceylan, Duygu, Dundar, Aysegul
Abstract
Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.
Chinese Translation
单幅图像的三维重建在大型重建模型(LRMs)上取得了快速进展,但重建结果常常表现出几何不一致和细节错位的问题,限制了重建的真实感。我们提出了GeoFusionLRM,一种面向几何的自我修正框架,利用模型自身的法线和深度预测来提高结构的准确性。与以往仅依赖输入图像提取特征的方法不同,GeoFusionLRM通过专用的变换器和融合模块反馈几何线索,使模型能够纠正错误并与条件图像保持一致。这一设计在没有额外监督或外部信号的情况下,改善了重建网格与输入视图之间的对齐。大量实验表明,GeoFusionLRM在几何清晰度、一致法线和真实感方面优于最先进的LRM基线。
cs.CV / 95 / 2602.14122

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

EgoSound:在自我中心视频中评估声音理解的基准
Zhu, Bingwen, Fu, Yuqian, Dong, Qiaole, Sun, Guolei, Qian, Tianwen, Wu, Yuzheng, Paudel, Danda Pani, Xue, Xiangyang, Fu, Yanwei
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
Chinese Translation
多模态大型语言模型(MLLMs)最近在视觉-语言理解方面取得了显著进展。然而,人类感知本质上是多感官的,整合了视觉、听觉和运动来推理世界。在这些模态中,声音提供了关于空间布局、屏幕外事件和因果交互的重要线索,特别是在自我中心环境中,听觉和视觉信号紧密结合。为此,我们提出了EgoSound,这是第一个旨在系统评估MLLMs中自我中心声音理解的基准。EgoSound统一了来自Ego4D和EgoBlind的数据,涵盖了视觉和依赖声音的体验。它定义了一个涵盖内在声音感知、空间定位、因果推理和跨模态推理的七任务分类法。EgoSound通过多阶段自生成管道构建,包含7315个经过验证的问答对,分布在900个视频中。在对九个最先进的MLLMs进行的全面实验中,结果显示当前模型展现出新兴的听觉推理能力,但在细粒度空间和因果理解方面仍然有限。EgoSound为推动多感官自我中心智能奠定了具有挑战性的基础,弥合了观察与真正听到世界之间的差距。
cs.CV / 96 / 2602.14134

DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

DenseMLLM:标准多模态大语言模型是内在的密集预测器
Li, Yi, Shen, Hongze, Tang, Lexiang, Li, Xin, Ding, Xinpeng, Liu, Yinsong, Jiang, Deqiang, Sun, Xing, Li, Xiaomeng
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.
Chinese Translation
多模态大语言模型(MLLMs)在高水平视觉理解方面展现了卓越的能力。然而,将这些模型扩展到细粒度密集预测任务,如语义分割和深度估计,通常需要引入复杂的任务特定解码器和其他定制化。这种架构的碎片化增加了模型的复杂性,并偏离了MLLMs的通用设计,最终限制了它们的实用性。在本研究中,我们挑战了这一范式,使标准MLLMs能够执行密集预测,而无需额外的任务特定解码器。我们提出的模型称为DenseMLLM,基于标准架构,并采用了一种新颖的视觉标记监督策略,以支持多个标签和任务。尽管设计简约,我们的模型在广泛的密集预测和视觉-语言基准测试中实现了高度竞争的性能,证明了标准的通用MLLM可以有效支持密集感知,而无需架构专业化。
cs.CV / 97 / 2602.14140

Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking

利用人工智能检测地面栗子以实现自动采摘
Fang, Kaixuan, Lu, Yuzhen, Mu, Xinyang
Abstract
Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best [email protected] of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with [email protected] of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection.
Chinese Translation
传统的机械化栗子收获对于小型生产者来说成本过高,且不具选择性,容易损坏栗子。准确、可靠地检测果园地面上的栗子对于开发低成本、视觉引导的自动化收获技术至关重要。然而,开发可靠的栗子检测系统面临着复杂环境中的挑战,包括阴影、变化的自然光条件,以及来自杂草、落叶、石头和其他外来地面物体的干扰,这些问题尚未得到解决。本研究收集了319张果园地面上栗子的图像,包含6524个标注的栗子。通过重复建模实验,系统评估了一套包括29种最先进的实时目标检测器的综合模型,其中14种属于YOLO(v11-13)系列,15种属于RT-DETR(v1-v4)系列,涵盖不同的模型规模。实验结果表明,YOLOv12m模型在所有评估模型中实现了最佳的[email protected]为95.1%,而RT-DETRv2-R101是RT-DETR模型中最准确的变体,[email protected]为91.1%。在mAP@[0.5:0.95]方面,YOLOv11x模型实现了最佳的准确率为80.1%。所有模型在实时栗子检测中展现出显著潜力,且YOLO模型在检测准确性和推理速度方面均优于RT-DETR模型,更适合于车载部署。本研究中的数据集和软件程序已公开发布,网址为https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection。
cs.CV / 98 / 2602.14147

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

LaViDa-R1:推进统一多模态扩散语言模型的推理能力
Li, Shufan, Zhu, Yuchen, Gu, Jiuxiang, Liu, Kangning, Lin, Zhe, Chen, Yongxin, Tao, Molei, Grover, Aditya, Kuen, Jason
Abstract
Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
Chinese Translation
扩散语言模型(dLLMs)最近作为自回归语言模型(LLMs)的有希望的替代方案而出现。最新的研究进一步将其扩展到多模态理解和生成任务。在本研究中,我们提出了LaViDa-R1,一种多模态通用推理的dLLM。与现有通过任务特定的强化学习构建推理dLLMs的工作不同,LaViDa-R1以统一的方式整合了多样的多模态理解和生成任务。特别地,LaViDa-R1采用了一种新颖的统一后训练框架,能够无缝集成监督微调(SFT)和多任务强化学习(RL)。它采用了多种新颖的训练技术,包括答案强制、树搜索和互补似然估计,以增强有效性和可扩展性。大量实验表明,LaViDa-R1在广泛的多模态任务上表现出色,包括视觉数学推理、推理密集的基础知识和图像编辑。
cs.CV / 99 / 2602.14153

ARport: An Augmented Reality System for Markerless Image-Guided Port Placement in Robotic Surgery

ARport:一种用于机器人手术中无标记图像引导端口放置的增强现实系统
Han, Zheng, Yang, Zixin, Long, Yonghao, Zhang, Lin, Kazanzides, Peter, Dou, Qi
Abstract
Purpose: Precise port placement is a critical step in robot-assisted surgery, where port configuration influences both visual access to the operative field and instrument maneuverability. To bridge the gap between preoperative planning and intraoperative execution, we present ARport, an augmented reality (AR) system that automatically maps pre-planned trocar layouts onto the patient's body surface, providing intuitive spatial guidance during surgical preparation. Methods: ARport, implemented on an optical see-through head-mounted display (OST-HMD), operates without any external sensors or markers, simplifying setup and enhancing workflow integration. It reconstructs the operative scene from RGB, depth, and pose data captured by the OST-HMD, extracts the patient's body surface using a foundation model, and performs surface-based markerless registration to align preoperative anatomical models to the extracted patient's body surface, enabling in-situ visualization of planned trocar layouts. A demonstration video illustrating the overall workflow is available online. Results: In full-scale human-phantom experiments, ARport accurately overlaid pre-planned trocar sites onto the physical phantom, achieving consistent spatial correspondence between virtual plans and real anatomy. Conclusion: ARport provides a fully marker-free and hardware-minimal solution for visualizing preoperative trocar plans directly on the patient's body surface. The system facilitates efficient intraoperative setup and demonstrates potential for seamless integration into routine clinical workflows.
Chinese Translation
目的:精确的端口放置是机器人辅助手术中的关键步骤,端口配置影响手术视野的可视化和器械的操作灵活性。为了弥合术前规划与术中执行之间的差距,我们提出了ARport,一种增强现实(AR)系统,它能够自动将预先规划的 trocar 布局映射到患者的体表,为手术准备提供直观的空间指导。方法:ARport 在光学透视头戴显示器(OST-HMD)上实现,操作时无需任何外部传感器或标记,简化了设置并增强了工作流程的整合。它通过 OST-HMD 捕获的 RGB、深度和姿态数据重建手术场景,利用基础模型提取患者的体表,并执行基于表面的无标记配准,将术前解剖模型与提取的患者体表对齐,从而实现预定 trocar 布局的现场可视化。在线提供了演示视频以展示整体工作流程。结果:在全尺寸人形实验中,ARport 准确地将预先规划的 trocar 位置叠加到物理人形上,实现了虚拟计划与真实解剖之间的一致空间对应。结论:ARport 提供了一种完全无标记且硬件需求最小的解决方案,能够直接在患者的体表上可视化术前的 trocar 计划。该系统促进了高效的术中设置,并展示了与常规临床工作流程无缝集成的潜力。
cs.CV / 100 / 2602.14157

When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

当测试时引导足够时:基于扩散引导的快速图像和视频编辑
Ghorbel, Ahmed, Moufad, Badr, Shouraki, Navid Bagheri, Durmus, Alain Oliviero, Hirtz, Thomas, Moulines, Eric, Olsson, Jimmy, Janati, Yazid
Abstract
Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.
Chinese Translation
基于文本的图像和视频编辑可以自然地视为修补问题,其中被遮挡的区域被重建,以保持与观察到的内容和编辑提示的一致性。最近在扩散和流模型的测试时引导方面的进展为这一任务提供了一个原则性框架;然而,现有方法依赖于昂贵的向量-雅可比乘积(VJP)计算来近似不可处理的引导项,限制了它们的实际应用性。在Moufad等人(2025)的最新工作基础上,我们提供了对其无VJP近似的理论见解,并大幅扩展了其在大规模图像和视频编辑基准上的实证评估。我们的结果表明,仅凭测试时引导就能实现与基于训练的方法相当的性能,在某些情况下甚至超过它们。
cs.CV / 101 / 2602.14177

Towards Spatial Transcriptomics-driven Pathology Foundation Models

基于空间转录组学的病理基础模型研究
Hemker, Konstantin, Song, Andrew H., Almagro-Pérez, Cristina, Jaume, Guillaume, Wagner, Sophia J., Vaidya, Anurag, Simidjievski, Nikola, Jamnik, Mateja, Mahmood, Faisal
Abstract
Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.
Chinese Translation
空间转录组学(ST)提供了基因表达的空间分辨测量,使得能够超越组织学评估对人类组织的分子景观进行表征,并且能够与形态学对齐的局部读数。同时,成功的多模态基础模型将视觉与互补模态相结合,表明局部表达与形态学之间的形态分子耦合可以系统性地用于改善组织学表征本身。我们提出了空间表达对齐学习(SEAL),这是一种视觉-组学自监督学习框架,将局部分子信息注入病理视觉编码器。SEAL并不是从头开始训练新的编码器,而是设计为一种参数高效的视觉-组学微调方法,可以灵活应用于广泛使用的病理基础模型。我们通过对来自14个器官的肿瘤和正常样本的70万对基因表达点-组织区域示例进行训练来实例化SEAL。在38个幻灯片级和15个补丁级下游任务中测试,SEAL提供了一种可替代的病理基础模型,持续在幻灯片级分子状态、通路活性和治疗反应预测,以及补丁级基因表达预测任务上超越广泛使用的仅视觉和ST预测基线。此外,SEAL编码器在分布外评估中表现出强大的领域泛化能力,并且启用了新的跨模态能力,如基因到图像的检索。我们的工作提出了一个用于ST引导的病理基础模型微调的通用框架,表明通过局部分子监督增强现有模型是改善视觉表征和扩展其跨模态实用性的有效且实用的步骤。
cs.CV / 102 / 2602.14178

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

UniWeTok:一种统一的二进制分词器,具有大小为 $ extit{2^{128}}$ 的代码本,用于统一的多模态大型语言模型
Zhuang, Shaobin, Ai, Yuang, Han, Jiaming, Mao, Weijia, Li, Xiaohui, Wang, Fangyikang, Wang, Xiao, Li, Yan, Lin, Shanchuan, Xu, Kun, Yang, Zhenheng, Huang, Huaibo, Yue, Xiangyu, Chen, Hao, Wang, Yali
Abstract
Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.
Chinese Translation
统一的多模态大型语言模型(MLLMs)需要一种视觉表示,能够同时支持高保真重建、复杂语义提取和生成适应性。然而,现有的视觉分词器通常难以在单一框架内满足这些相互矛盾的目标。在本文中,我们介绍了 UniWeTok,这是一种统一的离散分词器,旨在通过使用一个巨大的二进制代码本($ extit{2^{128}}$)来填补这一空白。在训练框架方面,我们引入了预后蒸馏(Pre-Post Distillation)和生成感知先验(Generative-Aware Prior),以增强离散分词的语义提取和生成先验。在模型架构方面,我们提出了一种卷积-注意力混合架构,采用 SigLu 激活函数。SigLu 激活不仅限制了编码器输出并稳定了语义蒸馏过程,还有效解决了分词熵损失与承诺损失之间的优化冲突。我们进一步提出了一种三阶段训练框架,旨在增强 UniWeTok 在各种图像分辨率和感知敏感场景(如涉及人脸和文本内容的场景)中的适应性。在 ImageNet 上,UniWeTok 实现了最先进的图像生成性能(FID: UniWeTok 1.38 vs. REPA 1.42),同时所需的训练计算量极低(训练分词:UniWeTok 33B vs. REPA 262B)。在通用领域,UniWeTok 在多种任务中展现出高度竞争的能力,包括多模态理解、图像生成(DPG 分数:UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84)和编辑(GEdit 总分:UniWeTok 5.09 vs. OmniGen 5.06)。我们发布了代码和模型,以促进社区对统一分词器和 MLLM 的探索。
cs.CV / 103 / 2602.14186

UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

UniRef-Image-Edit:迈向可扩展且一致的多参考图像编辑
Wei, Hongyang, Wen, Bin, Long, Yancheng, Yang, Yankai, Hu, Yuhang, Zhang, Tianke, Chen, Wei, Fan, Haonan, Jiang, Kaiyu, Chen, Jiankang, Liu, Changyi, Tang, Kaiyu, Ding, Haojie, Yang, Xiao, Sun, Jia, Wang, Huaiqing, Yang, Zhenyu, Wei, Xinyu, He, Xianglong, Li, Yangguang, Yang, Fan, Gao, Tingting, Zhang, Lei, Zhou, Guorui, Li, Han
Abstract
We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
Chinese Translation
我们提出了UniRef-Image-Edit,这是一个高性能的多模态生成系统,统一了单图像编辑和多图像合成于一个框架内。现有的基于扩散的编辑方法常常由于参考输入之间的有限交互而难以在多个条件下保持一致性。为了解决这个问题,我们引入了序列扩展潜在融合(Sequence-Extended Latent Fusion,SELF),这是一种统一的输入表示,它动态地将多个参考图像序列化为一个连贯的潜在序列。在专门的训练阶段,所有参考图像被共同约束以适应在全球像素预算约束下的固定长度序列。在此基础上,我们提出了一个两阶段的训练框架,包括监督微调(Supervised Fine-Tuning,SFT)和强化学习(Reinforcement Learning,RL)。在SFT阶段,我们在单图像编辑和多图像合成任务上进行联合训练,以建立一个稳健的生成先验。我们采用了逐步序列长度训练策略,所有输入图像最初被调整为总像素预算为$1024^2$,然后逐渐增加到$1536^2$和$2048^2$以提高视觉保真度和跨参考一致性。这种逐步放宽压缩的方式使模型能够逐渐捕捉更细致的视觉细节,同时保持参考之间的稳定对齐。在RL阶段,我们引入了多源GRPO(Multi-Source GRPO),据我们所知,这是第一个专为多参考图像生成量身定制的强化学习框架。MSGRPO优化模型以调和相互冲突的视觉约束,显著增强合成一致性。我们将开源代码、模型、训练数据和奖励数据,以供社区研究之用。
cs.CV / 104 / 2602.14201

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

GeoEyes:基于证据的超高分辨率遥感图像理解的按需视觉聚焦
Wang, Fengxiang, Chen, Mingshuo, Li, Yueying, Yang, Yajie, Zhang, Yifan, Lan, Long, Yang, Xue, Sun, Hongda, Wang, Yulin, Wang, Di, Song, Jun, Zhang, Jing, Du, Bo
Abstract
The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
Chinese Translation
“以图像思考”的范式使得多模态大型语言模型(MLLMs)能够通过放大工具主动探索视觉场景。这对于超高分辨率(UHR)遥感视觉问答(VQA)至关重要,因为任务相关线索稀疏且微小。然而,我们观察到现有的支持缩放的MLLMs存在一种一致的失败模式:工具使用同质化,在这种模式下,工具调用陷入任务无关的模式,从而限制了有效证据的获取。为了解决这个问题,我们提出了GeoEyes,一个分阶段的训练框架,包括(1)一个冷启动的监督微调(SFT)数据集UHR Chain-of-Zoom(UHR-CoZ),该数据集涵盖了多样的缩放模式,以及(2)一种自主强化学习方法AdaZoom-GRPO,该方法在缩放交互过程中明确奖励证据获取和答案改进。最终模型学习了按需缩放的适当停止行为,并在UHR遥感基准测试中取得了显著的改进,在XLRS-Bench上达到了54.23%的准确率。
cs.CV / 105 / 2602.14214

HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

HiVid:基于大型语言模型的内容感知视频显著性用于点播和直播流媒体
Chen, Jiahui, Peng, Bo, Jia, Lianchen, Zhang, Zeyu, Huang, Tianchi, Sun, Lifeng
Abstract
Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.
Chinese Translation
内容感知流媒体需要动态的、分块级别的重要性权重,以优化主观体验质量(QoE)。然而,直接的人类标注成本过高,而视觉显著性模型的泛化能力较差。我们提出了HiVid,这是第一个利用大型语言模型(LLMs)作为可扩展人类代理的框架,用于生成高保真度的权重,适用于点播(VOD)和直播流媒体。我们解决了三个非平凡的挑战:(1)为了扩展LLMs的有限模态并规避令牌限制,我们提出了一个感知模块,以在局部上下文窗口中评估帧,自回归地构建对视频的连贯理解。(2)对于在局部窗口中存在评分不一致的点播(VOD),我们提出了一个排名模块,通过一种新颖的LLM引导的归并排序算法进行全局重新排序。(3)对于需要低延迟、在线推理且不依赖未来知识的直播流媒体,我们提出了一个预测模块,利用多模态时间序列模型预测未来权重,该模型包括内容感知注意力和自适应时间范围,以适应异步LLM推理。大量实验表明,HiVid在点播(VOD)和直播流媒体中相较于最先进的基线提高了权重预测准确性,分别达到了11.5 ext{%}和26 ext{%}。真实世界用户研究验证了HiVid提升了流媒体QoE相关性14.7 ext{%}。
cs.CV / 106 / 2602.14226

Freq-DP Net: A Dual-Branch Network for Fence Removal using Dual-Pixel and Fourier Priors

Freq-DP 网络:一种基于双像素和傅里叶先验的双分支围栏去除网络
Swami, Kunal, Velusamy, Sudha, Seelamantula, Chandra Sekhar
Abstract
Removing fence occlusions from single images is a challenging task that degrades visual quality and limits downstream computer vision applications. Existing methods often fail on static scenes or require motion cues from multiple frames. To overcome these limitations, we introduce the first framework to leverage dual-pixel (DP) sensors for this problem. We propose Freq-DP Net, a novel dual-branch network that fuses two complementary priors: a geometric prior from defocus disparity, modeled using an explicit cost volume, and a structural prior of the fence's global pattern, learned via Fast Fourier Convolution (FFC). An attention mechanism intelligently merges these cues for highly accurate fence segmentation. To validate our approach, we build and release a diverse benchmark with different fence varieties. Experiments demonstrate that our method significantly outperforms strong general-purpose baselines, establishing a new state-of-the-art for single-image, DP-based fence removal.
Chinese Translation
从单幅图像中去除围栏遮挡是一项具有挑战性的任务,它会降低视觉质量并限制下游计算机视觉应用。现有方法通常在静态场景中表现不佳,或者需要来自多个帧的运动线索。为了解决这些局限性,我们提出了第一个利用双像素(Dual-Pixel, DP)传感器来处理这一问题的框架。我们提出了 Freq-DP 网络,这是一种新颖的双分支网络,融合了两种互补的先验信息:一种来自散焦视差的几何先验,通过显式代价体建模;另一种是围栏全局模式的结构先验,通过快速傅里叶卷积(Fast Fourier Convolution, FFC)学习而来。注意力机制智能地合并这些线索,以实现高精度的围栏分割。为了验证我们的方法,我们构建并发布了一个包含不同围栏种类的多样化基准数据集。实验表明,我们的方法显著优于强大的通用基线,确立了基于单幅图像和双像素的围栏去除的新最先进水平。
cs.CV / 107 / 2602.14228

Learning Significant Persistent Homology Features for 3D Shape Understanding

学习重要的持久同调特征以理解三维形状
Kudeshia, Prachi, Poovvancheri, Jiju
Abstract
Geometry and topology constitute complementary descriptors of three-dimensional shape, yet existing benchmark datasets primarily capture geometric information while neglecting topological structure. This work addresses this limitation by introducing topologically-enriched versions of ModelNet40 and ShapeNet, where each point cloud is augmented with its corresponding persistent homology features. These benchmarks with the topological signatures establish a foundation for unified geometry-topology learning and enable systematic evaluation of topology-aware deep learning architectures for 3D shape analysis. Building on this foundation, we propose a deep learning-based significant persistent point selection method, \textit{TopoGAT}, that learns to identify the most informative topological features directly from input data and the corresponding topological signatures, circumventing the limitations of hand-crafted statistical selection criteria. A comparative study verifies the superiority of the proposed method over traditional statistical approaches in terms of stability and discriminative power. Integrating the selected significant persistent points into standard point cloud classification and part-segmentation pipelines yields improvements in both classification accuracy and segmentation metrics. The presented topologically-enriched datasets, coupled with our learnable significant feature selection approach, enable the broader integration of persistent homology into the practical deep learning workflows for 3D point cloud analysis.
Chinese Translation
几何和拓扑构成了三维形状的互补描述符,但现有的基准数据集主要捕捉几何信息,而忽视了拓扑结构。本研究通过引入拓扑增强版本的ModelNet40和ShapeNet来解决这一局限性,其中每个点云都增加了其对应的持久同调特征。这些具有拓扑特征的基准为统一的几何-拓扑学习奠定了基础,并使得对拓扑感知深度学习架构进行系统评估成为可能。在此基础上,我们提出了一种基于深度学习的重要持久点选择方法, extit{TopoGAT},该方法学习直接从输入数据及其对应的拓扑特征中识别最具信息量的拓扑特征,避免了手工设计的统计选择标准的局限性。比较研究验证了所提方法在稳定性和区分能力方面优于传统统计方法。将选定的重要持久点整合到标准的点云分类和部件分割流程中,提升了分类准确性和分割指标。所呈现的拓扑增强数据集,加上我们可学习的重要特征选择方法,使得持久同调在三维点云分析的实际深度学习工作流程中的更广泛整合成为可能。
cs.CV / 108 / 2602.14236

Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

面向视觉-语言模型的长视频理解的双信号自适应KV缓存优化
Sai, Vishnu, Sai, Dheeraj, B, Srinath, Varma, Girish, Shukla, Priyesh
Abstract
Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.
Chinese Translation
视觉-语言模型(VLMs)在处理长视频内容时面临严重的内存瓶颈,因为Key-Value(KV)缓存的大小随着序列长度线性增长。现有解决方案主要采用反应式驱逐策略,在丢弃令牌之前计算完整的注意力矩阵,这导致了大量的计算浪费。我们提出了Sali-Cache,这是一种新颖的先验优化框架,通过主动内存管理实现双信号自适应缓存。Sali-Cache结合了基于光流分析的时间滤波器,用于检测帧间冗余,以及利用显著性检测的空间滤波器,用于识别视觉上重要的区域,从而智能地管理内存分配,避免进入计算成本高昂的注意力操作。对LLaVA 1.6架构的实验评估表明,我们的方法在有效内存使用上实现了2.20倍的压缩比,同时在BLEU、ROUGE-L和精确匹配指标上保持100%的准确率。此外,在相同的内存预算限制下,Sali-Cache能够在较长的时间段内保持丰富的上下文特征,而不降低模型性能,从而实现了在消费级硬件上高效处理长视频内容的能力。
cs.CV / 109 / 2602.14237

AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks

AbracADDbra:通过解耦放置和编辑子任务实现触控引导的对象添加
Swami, Kunal, Chittersu, Raghu, Rathore, Yuvraj, Irny, Rajeev, Doodekula, Shashavali, Shukla, Alok
Abstract
Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.
Chinese Translation
基于指令的对象添加常常受到仅有文本提示的模糊性或基于掩码输入的繁琐性所阻碍。为了解决这一可用性问题,我们提出了AbracADDbra,一个用户友好的框架,利用直观的触控先验将简洁的指令空间定位,以实现精确的放置。我们的高效解耦架构使用视觉-语言变换器进行触控引导的放置,随后采用扩散模型共同生成对象及其实例掩码,以实现高保真混合。为了促进标准化评估,我们为这一交互任务贡献了Touch2Add基准。我们的广泛评估表明,我们的放置模型显著优于随机放置和通用视觉语言模型(VLM)基线,证实了该框架能够产生高保真的编辑。此外,我们的分析揭示了初始放置准确性与最终编辑质量之间的强相关性,验证了我们的解耦方法。因此,这项工作为更易获取和高效的创意工具铺平了道路。
cs.CV / 110 / 2602.14276

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

超越稀疏基础的完整屏幕解析监督
Gurbuz, A. Said, Hong, Sunghwan, Nassar, Ahmed, Pollefeys, Marc, Staar, Peter
Abstract
Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.
Chinese Translation
现代计算机使用代理(CUA)必须将屏幕视为一个结构化状态,了解哪些元素是可见的、它们的位置以及包含的文本,才能可靠地理解指令并采取行动。然而,现有的大多数基础数据集提供的监督信息稀疏,标注的任务相关元素仅为每个屏幕的一小部分,且标签多样性不足,这限制了覆盖范围和泛化能力;此外,实际部署要求高效性,以实现低延迟的设备端使用。我们引入了ScreenParse,这是一个用于完整屏幕解析的大规模数据集,包含对771K网页截图(2100万个元素)中所有可见用户界面元素(框、55类类型和文本)的密集标注。ScreenParse是通过Webshot生成的,这是一种自动化、可扩展的流程,能够渲染多样化的URL,提取标注并应用基于视觉语言模型(VLM)的重新标注和质量过滤。使用ScreenParse,我们训练了ScreenVLM,一个紧凑的316M参数视觉语言模型(VLM),该模型解码一个紧凑的ScreenTag标记表示,并采用结构感知损失来加大结构关键标记的权重。ScreenVLM在密集解析上显著优于更大的基础VLM(例如,在ScreenParse上,PageIoU为0.592对比0.294),并在公共基准测试中表现出强大的迁移能力。此外,在ScreenParse上微调基础VLM始终能提高其基础性能,表明密集屏幕监督为用户界面理解提供了可迁移的结构先验。项目页面:https://saidgurbuz.github.io/screenparse/
cs.CV / 111 / 2602.14297

Differential pose optimization in descriptor space -- Combining Geometric and Photometric Methods for Motion Estimation

描述符空间中的差分姿态优化——结合几何和光度方法进行运动估计
Teigen, Andreas L., Stahl, Annette, Mester, Rudolf
Abstract
One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.
Chinese Translation
计算机视觉中的一个基本问题是两帧相对姿态优化问题。主要使用两种不同类型的误差值:光度误差和重投影误差。误差值的选择通常直接依赖于特征范式的选择,即光度特征或几何特征。这是在准确性、鲁棒性和闭环可能性之间的权衡。我们研究了一种第三种方法,将两种范式的优点结合成一个统一的方法。通过使用密集采样的几何特征描述符,我们用来自密集描述符集合的描述符残差替代光度误差,从而使得在差分光度方法中能够采用亚像素精度,同时保留几何特征描述符的表现力。实验表明,尽管所提出的策略是一种有趣的方法,能够实现准确的跟踪,但最终并未超越基于重投影误差的姿态优化策略,尽管利用了更多的信息。我们进一步分析了这种差异的潜在原因,并提出假设,即描述符相似性度量变化太慢,并不一定严格对应于关键点位置的准确性。
cs.CV / 112 / 2602.14356

A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification

一种生成性人工智能方法用于减少皮肤癌分类中的肤色偏见
Shabu, Areez Muhammed, Ansari, Mohammad Samar, Aslam, Asra
Abstract
Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.
Chinese Translation
皮肤癌是全球最常见的癌症之一,早期检测对有效治疗至关重要。然而,目前的人工智能诊断工具往往在以较浅肤色为主的数据库上进行训练,这导致对肤色较深的人群的准确性和公平性降低。国际皮肤影像合作组织(ISIC)数据集是最广泛使用的基准之一,其中超过70%的图像为浅肤色,而深肤色图像不足8%。这种不平衡对公平医疗服务构成了重大障碍,并突显了在医学影像中解决人口多样性问题的迫切需求。本文针对自动化皮肤癌检测中的肤色不平衡问题,使用皮肤镜图像进行研究。为了解决这一问题,我们提出了一种生成性增强管道,该管道利用低秩适应(Low-Rank Adaptation, LoRA)对预训练的稳定扩散(Stable Diffusion)模型进行微调,使用ISIC数据集中深肤色子集的图像,并根据病变类型和肤色生成合成的皮肤镜图像。在本研究中,我们调查了这些图像在两个下游任务中的效用:病变分割和二分类。在分割任务中,基于增强数据集训练的模型在保留的真实图像上评估时,IoU、Dice系数和边界准确性均显示出一致的改善。这些评估验证了生成数据集的有效性。在分类任务中,基于增强数据集训练的EfficientNet-B0模型达到了92.14%的准确率。本文证明,结合生成性人工智能的合成数据增强可以显著减少偏见,提高传统皮肤病诊断的公平性,并为未来的研究方向打开新的挑战。
cs.CV / 113 / 2602.14365

Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data

基于图像的类关节水平检测:从小规模不平衡数据中识别类风湿关节炎的炎症
Kato, Shun, Kondo, Yasushi, Saito, Shuntaro, Aoki, Yoshimitsu, Isogawa, Mariko
Abstract
Rheumatoid arthritis (RA) is an autoimmune disease characterized by systemic joint inflammation. Early diagnosis and tight follow-up are essential to the management of RA, as ongoing inflammation can cause irreversible joint damage. The detection of arthritis is important for diagnosis and assessment of disease activity; however, it often takes a long time for patients to receive appropriate specialist care. Therefore, there is a strong need to develop systems that can detect joint inflammation easily using RGB images captured at home. Consequently, we tackle the task of RA inflammation detection from RGB hand images. This task is highly challenging due to general issues in medical imaging, such as the scarcity of positive samples, data imbalance, and the inherent difficulty of the task itself. However, to the best of our knowledge, no existing work has explicitly addressed these challenges in RGB-based RA inflammation detection. This paper quantitatively demonstrates the difficulty of visually detecting inflammation by constructing a dedicated dataset, and we propose a inflammation detection framework with global local encoder that combines self-supervised pretraining on large-scale healthy hand images with imbalance-aware training to detect RA-related joint inflammation from RGB hand images. Our experiments demonstrated that the proposed approach improves F1-score by 0.2 points and Gmean by 0.25 points compared with the baseline model.
Chinese Translation
类风湿关节炎(RA)是一种以全身性关节炎症为特征的自身免疫性疾病。早期诊断和紧密随访对于RA的管理至关重要,因为持续的炎症可能导致不可逆的关节损伤。关节炎的检测对于疾病的诊断和活动性评估至关重要;然而,患者通常需要很长时间才能获得适当的专业护理。因此,迫切需要开发能够利用在家中拍摄的RGB图像轻松检测关节炎症的系统。因此,我们着手解决从RGB手部图像中检测RA炎症的任务。由于医学成像中的一般问题,如阳性样本稀缺、数据不平衡以及任务本身的固有难度,这一任务极具挑战性。然而,据我们所知,现有研究尚未明确解决RGB基础的RA炎症检测中的这些挑战。本文通过构建专门的数据集定量展示了视觉检测炎症的难度,并提出了一种结合自监督预训练和不平衡感知训练的全局局部编码器的炎症检测框架,以从RGB手部图像中检测与RA相关的关节炎症。我们的实验表明,与基线模型相比,所提出的方法将F1-score提高了0.2点,Gmean提高了0.25点。
cs.CV / 114 / 2602.14376

Event-based Visual Deformation Measurement

基于事件的视觉变形测量
Wu, Yuliang, Zhai, Wei, Cui, Yuxin, Zhao, Tiesong, Cao, Yang, Zha, Zheng-Jun
Abstract
Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6% in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.
Chinese Translation
视觉变形测量(VDM)旨在通过跟踪来自相机观测的表面运动来恢复密集的变形场。传统的基于图像的方法依赖于最小的帧间运动来限制对应搜索空间,这限制了其在高度动态场景中的适用性,或需要高速相机,这会导致存储和计算开销的显著增加。我们提出了一种事件-帧融合框架,利用事件提供时间上密集的运动线索,并利用帧进行空间上密集的精确估计。重新审视固体弹性建模的先验,我们提出了一种仿射不变单纯形(Affine Invariant Simplicial, AIS)框架。该框架将变形场划分为低参数表示的线性化子区域,有效减轻了由稀疏和噪声事件引起的运动模糊。为了加速参数搜索并减少误差累积,引入了一种邻域贪婪优化策略,使得收敛良好的子区域能够引导其收敛不良的邻域,有效抑制长期密集跟踪中的局部误差累积。为了评估所提出的方法,建立了一个基准数据集,包含时间对齐的事件流和帧,涵盖了超过120个序列,涉及多种变形场景。实验结果表明,我们的方法在生存率上比最先进的基线提高了1.6%。值得注意的是,它仅使用了高速视频方法的18.9%的数据存储和处理资源。
cs.CV / 115 / 2602.14381

Adapting VACE for Real-Time Autoregressive Video Diffusion

为实时自回归视频扩散调整 VACE
Fosdick, Ryan
Abstract
We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.
Chinese Translation
我们描述了 VACE(视频全能创作与编辑)在实时自回归视频生成中的一种适应性修改。VACE 提供统一的视频控制(参考引导、结构条件、修补和时间扩展),但假设对完整序列进行双向注意,这使其与需要固定块大小和因果注意的流媒体管道不兼容。关键的修改是将参考帧从扩散潜在空间移动到并行条件路径中,从而保留自回归模型所需的固定块大小和 KV 缓存。这种适应性在不进行额外训练的情况下重用了现有的预训练 VACE 权重。在 1.3B 和 14B 模型规模上,VACE 为结构控制和修补增加了 20-30% 的延迟开销,相对于基础模型的 VRAM 成本可以忽略不计。由于因果注意约束,参考到视频的保真度与批处理 VACE 相比严重下降。参考实现可在 https://github.com/daydreamlive/scope 获取。
cs.CV / 116 / 2602.14399

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

针对大型视觉-语言模型的多轮自适应提示攻击
Choi, In Chong, Zhang, Jiacheng, Liu, Feng, Song, Yiliao
Abstract
Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.
Chinese Translation
多轮越狱攻击通过在多个回合中逐步引入恶意内容,对仅文本的大型语言模型(LLMs)有效。当这一方法扩展到大型视觉-语言模型(LVLMs)时,我们发现简单地添加视觉输入会导致现有的多轮越狱攻击容易被防御。例如,过于恶意的视觉输入会轻易触发安全对齐LVLM的防御机制,使得响应变得更加保守。为了解决这个问题,我们提出了MAPA:一种多轮自适应提示攻击,1)在每个回合中交替进行文本-视觉攻击,以引发最恶意的响应;2)在多个回合中,通过迭代的反复调整攻击轨迹,逐步增强响应的恶意性。这种双层设计使得MAPA在对抗LLaVA-V1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini的最新基准测试中,攻击成功率提高了11-35%,始终优于现有的最先进方法。
cs.CV / 117 / 2602.14401

pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

pFedNavi:面向结构的个性化联邦视觉-语言导航框架用于具身人工智能
Yang, Qingqian, Wang, Hao, Zhang, Sai Qian, Li, Jian, Hua, Yang, Pan, Miao, Song, Tao, Qi, Zhengwei, Guan, Haibing
Abstract
Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.
Chinese Translation
视觉-语言导航(VLN)需要来自私人室内环境的大规模轨迹指令数据,这引发了显著的隐私问题。联邦学习(FL)通过将数据保留在设备上来缓解这一问题,但传统的FL在VLN的极端跨客户端异质性(环境和指令风格)下表现不佳,使得单一的全局模型并不理想。本文提出了pFedNavi,一个面向结构的、动态自适应的个性化联邦学习框架,专为VLN量身定制。我们的核心思想是个性化关注重要部分:pFedNavi通过逐层混合系数自适应地识别客户端特定层,并对选定组件(例如,编码器-解码器投影和环境敏感的解码器层)进行细粒度参数融合,以平衡全球知识共享与本地专业化。我们在两个标准VLN基准测试R2R和RxR上评估了pFedNavi,使用了ResNet和CLIP视觉表示。在所有指标上,pFedNavi始终优于基于FedAvg的VLN基线,在导航成功率上提高了最多7.5%,在轨迹保真度上提高了最多7.8%,同时在非独立同分布(non-IID)条件下收敛速度快了1.38倍。
cs.CV / 118 / 2602.14408

Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection

基于特征重校准的嗅觉-视觉多模态模型用于细粒度稻米变质检测
Zhao, Rongqiang, Hu, Hengrui, Wang, Yijing, Sun, Mingchun, Liu, Jie
Abstract
Multimodal methods are widely used in rice deterioration detection, which exhibit limited capability in representing and extracting fine-grained abnormal features. Moreover, these methods rely on devices, such as hyperspectral cameras and mass spectrometers, increasing detection costs and prolonging data acquisition time. To address these issues, we propose a feature recalibration based olfactory-visual multimodal model for fine-grained rice deterioration detection. The fine-grained deterioration embedding constructor (FDEC) is proposed to reconstruct the labeled multimodal embedded-feature dataset, enhancing sample representation. The fine-grained deterioration recalibration attention network (FDRA-Net) is proposed to emphasize signal variations and increase sensitivity to fine-grained deterioration on the rice surface. Experiments show that the proposed method achieves a classification accuracy of 99.89%. Compared with state-of-the-art methods, the detection accuracy is improved and the procedure is simplified. Furthermore, field detection demonstrates the advantages of accuracy and operational simplicity. The proposed method can also be extended to other agrifood in agriculture and food industry.
Chinese Translation
多模态方法广泛应用于稻米变质检测,但在表示和提取细粒度异常特征方面能力有限。此外,这些方法依赖于高光谱相机和质谱仪等设备,增加了检测成本并延长了数据获取时间。为了解决这些问题,我们提出了一种基于特征重校准的嗅觉-视觉多模态模型,用于细粒度稻米变质检测。我们提出了细粒度变质嵌入构建器(FDEC),用于重构标记的多模态嵌入特征数据集,从而增强样本表示。我们还提出了细粒度变质重校准注意力网络(FDRA-Net),以强调信号变化并提高对稻米表面细粒度变质的敏感性。实验表明,所提方法的分类准确率达到99.89%。与最先进的方法相比,检测准确率有所提高,程序也得到了简化。此外,现场检测展示了准确性和操作简便性的优势。所提方法还可以扩展到农业和食品工业中的其他农产品。
cs.CV / 119 / 2602.14409

Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning

学习提出,几何处置:高效空间推理的模块化框架
Zhu, Haichao, Yang, Zhaorui, Zhang, Qian
Abstract
Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.
Chinese Translation
空间感知旨在从视觉观察中估计相机运动和场景结构,这一问题传统上通过几何建模和物理一致性约束来解决。最近的基于学习的方法展示了在几何感知方面强大的表征能力,并越来越多地用于增强经典的以几何为中心的系统。然而,学习组件是否应直接替代几何估计,或应作为此类管道中的中间模块仍然是一个悬而未决的问题。在本研究中,我们填补了这一空白,探讨了一个端到端的模块化框架,以实现有效的空间推理,其中学习提出几何假设,而几何算法则负责处置估计决策。特别地,我们在RGB-D序列的相对相机姿态估计的背景下研究这一原则。使用VGGT作为代表性的学习模型,我们在不同运动幅度和场景动态下评估基于学习的姿态和深度提议,随后使用经典的点到平面RGB-D ICP作为几何后端。我们在TUM RGB-D基准上的实验揭示了三个一致的发现:(1)仅依赖学习的姿态提议是不可靠的;(2)当学习提出的几何与相机内参不正确对齐时,可能会降低性能;(3)当学习提出的深度经过几何对齐并随后经过几何处置阶段时,在适度具有挑战性的刚性设置中会出现一致的改进。这些结果表明,几何不仅仅是一个精细化组件,而是一个验证和吸收基于学习的几何观察的重要仲裁者。我们的研究强调了模块化、关注几何的系统设计在稳健空间感知中的重要性。
cs.CV / 120 / 2602.14413

Understanding Sensor Vulnerabilities in Industrial XR Tracking

理解工业XR追踪中的传感器脆弱性
Saha, Sourya, Absur, Md. Nurul
Abstract
Extended Reality (XR) systems deployed in industrial and operational settings rely on Visual--Inertial Odometry (VIO) for continuous six-degree-of-freedom pose tracking, yet these environments often involve sensing conditions that deviate from ideal assumptions. Despite this, most VIO evaluations emphasize nominal sensor behavior, leaving the effects of sustained sensor degradation under operational conditions insufficiently understood. This paper presents a controlled empirical study of VIO behavior under degraded sensing, examining faults affecting visual and inertial modalities across a range of operating regimes. Through systematic fault injection and quantitative evaluation, we observe a pronounced asymmetry in fault impact where degradations affecting visual sensing typically lead to bounded pose errors on the order of centimeters, whereas degradations affecting inertial sensing can induce substantially larger trajectory deviations, in some cases reaching hundreds to thousands of meters. These observations motivate greater emphasis on inertial reliability in the evaluation and design of XR systems for real-life industrial settings.
Chinese Translation
在工业和操作环境中部署的扩展现实(XR)系统依赖于视觉-惯性里程计(VIO)进行连续的六自由度姿态追踪,但这些环境往往涉及偏离理想假设的传感条件。尽管如此,大多数VIO评估强调的是名义传感器行为,导致在操作条件下持续传感器退化的影响尚未得到充分理解。本文呈现了一项关于退化传感下VIO行为的受控实证研究,考察了影响视觉和惯性模态的故障,涵盖了一系列操作模式。通过系统的故障注入和定量评估,我们观察到故障影响的明显不对称性,其中影响视觉传感的退化通常导致厘米级的有限姿态误差,而影响惯性传感的退化则可能引发更大幅度的轨迹偏差,在某些情况下可达到数百到数千米。这些观察结果促使我们在评估和设计用于现实工业环境的XR系统时,更加重视惯性可靠性。
cs.CV / 121 / 2602.14425

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

面部动作单元检测的层次化视觉-语言交互
Li, Yong, Ren, Yi, Zhang, Yizhe, Zhang, Wenhua, Zhang, Tianyi, Jiang, Muyun, Xie, Guo-Sen, Guan, Cuntai
Abstract
Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.
Chinese Translation
面部动作单元(AU)检测旨在识别由面部动作编码系统(FACS)定义的细微面部肌肉激活。关于AU检测的一个主要挑战是如何在有限标注数据的条件下有效学习具有区分性和可推广性的AU表示。为了解决这个问题,我们提出了一种用于AU理解的层次化视觉-语言交互方法(HiVA),该方法利用文本AU描述作为语义先验,以指导和增强AU检测。具体而言,HiVA采用大型语言模型生成多样且具有上下文丰富的AU描述,以加强基于语言的表示学习。为了捕捉细粒度和整体的视觉-语言关联,HiVA引入了一个AU感知的动态图模块,促进AU特定视觉表示的学习。这些特征进一步整合在一个层次化的跨模态注意力架构中,该架构包括两种互补机制:解耦双重交叉注意力(DDCA),用于建立视觉和文本特征之间细粒度的AU特定交互,以及上下文双重交叉注意力(CDCA),用于建模全局的AU间依赖关系。这种协作的跨模态学习范式使HiVA能够结合多粒度的基于视觉的AU特征和精细的基于语言的AU细节,从而实现强大且语义丰富的AU检测能力。大量实验表明,HiVA始终超越了最先进的方法。此外,定性分析显示HiVA产生了具有语义意义的激活模式,突显了其在学习稳健且可解释的跨模态对应关系方面的有效性,从而为全面的面部行为分析提供支持。
cs.CV / 122 / 2602.14441

D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection

D-SECURE:用于虚假信息检测的双源证据结合统一推理
Singh, Gagandeep, Amarasinghe, Samudi, Singh, Priyanka
Abstract
Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.
Chinese Translation
多模态虚假信息越来越多地将现实图像编辑与流畅但具有误导性的文本混合在一起,产生难以验证的有说服力的帖子。现有系统通常依赖于单一的证据来源。基于内容的检测器识别图像及其标题中的局部不一致性,但无法确定整体事实真相。基于检索的事实核查者在外部证据上进行推理,但将输入视为粗略的主张,往往会错过细微的视觉或文本操控。这种分离导致了失败案例,其中内部一致的虚构内容绕过了操控检测器,而事实核查者则验证了包含像素级或标记级损坏的主张。我们提出了D-SECURE,一个将内部操控检测与基于外部证据的推理结合起来的框架,适用于新闻风格的帖子。D-SECURE将HAMMER操控检测器与DEFAME检索管道集成。DEFAME执行广泛的验证,而HAMMER分析可能包含细粒度编辑的残余或不确定案例。对DGM4和ClaimReview样本的实验突显了这两个系统的互补优势,并推动了它们的融合。我们提供了一个统一的、可解释的报告,结合了操控线索和外部证据。
cs.CV / 123 / 2602.14443

Controlling Your Image via Simplified Vector Graphics

通过简化向量图形控制您的图像
Guo, Lanqing, Liu, Xi, Wang, Yufei, Li, Zhihao, Huang, Siyu
Abstract
Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: https://guolanqing.github.io/Vec2Pix/
Chinese Translation
近年来,图像生成技术取得了显著的视觉质量进展,但一个基本挑战依然存在:图像生成能否在元素层面上进行控制,从而实现直观的修改,例如调整形状、改变颜色或添加和移除对象?在本研究中,我们通过引入基于简化向量图形(VGs)的分层可控生成来解决这一挑战。我们的方法首先将图像高效解析为语义对齐且结构一致的分层VG表示。在此表示的基础上,我们设计了一种新颖的图像合成框架,该框架以VGs为指导,允许用户自由修改元素,并将这些编辑无缝转换为逼真的输出。通过结合VGs的结构和语义特征与噪声预测,我们的方法在几何、颜色和对象语义方面提供了精确的控制。大量实验表明,我们的方法在图像编辑、对象级操作和细粒度内容创作等多种应用中表现出色,为可控图像生成建立了新的范式。项目页面:https://guolanqing.github.io/Vec2Pix/
cs.CV / 124 / 2602.14464

CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

CoCoDiff:用于细粒度风格迁移的一致性对应扩散模型
Nie, Wenbo, Li, Zixiang, Tao, Renshuai, Wu, Bin, Wei, Yunchao, Zhao, Yao
Abstract
Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
Chinese Translation
在图像之间转移视觉风格的同时保持相似物体之间的语义对应关系,仍然是计算机视觉中的一个核心挑战。尽管现有方法取得了显著进展,但大多数方法在全局层面上操作,忽视了区域甚至像素级的语义对应关系。为了解决这一问题,我们提出了CoCoDiff,这是一种新颖的无训练且低成本的风格迁移框架,利用预训练的潜在扩散模型实现细粒度的语义一致风格化。我们发现生成扩散模型中的对应线索尚未得到充分探索,而在语义匹配区域之间保持内容一致性往往被忽视。CoCoDiff引入了一个像素级语义对应模块,挖掘中间扩散特征,以构建内容图像和风格图像之间的密集对齐图。此外,一个循环一致性模块在迭代过程中强制执行结构和感知对齐,从而实现保持几何形状和细节的对象和区域级风格化。尽管不需要额外的训练或监督,CoCoDiff仍然提供了最先进的视觉质量和强大的定量结果,超越了依赖额外训练或注释的方法。
cs.CV / 125 / 2602.14482

TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

TikArt:通过强化学习进行细粒度视觉推理的光圈引导观察
Ding, Hao, Yang, Zhichuan, Ge, Weijie, Gao, Ziqin, Lu, Chaoyi, Zhao, Lei
Abstract
We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.
Chinese Translation
我们研究多模态大语言模型(MLLMs)中的细粒度视觉推理,其中关键证据可能存在于微小物体、杂乱区域或在单一全局图像编码下丢失的细微标记中。我们提出了TikArt(思维光圈),一种光圈引导的智能体,将多步骤的视觉-语言推理视为对感兴趣区域的决策过程。TikArt遵循思维-光圈-观察循环,在语言生成和两种光圈操作之间交替:缩放(Zoom)提取矩形裁剪,而分割(Segment)调用SAM2以获取不规则目标的基于掩模的裁剪。在每个操作之后,模型必须生成明确的观察,将局部视觉线索转化为持久的语言记忆。TikArt基于Qwen3-VL-8B,使用AGRPO优化其推理策略,这是一种GRPO风格的强化学习算法,具有两阶段课程:首先预热分割操作,然后联合优化视觉数学、细粒度视觉问答(VQA)和分割,使用将任务成功与有目的的光圈使用相结合的奖励。在V*、HR-Bench-4K/8K、MME-RealWorld-Lite、MMStar、RefCOCO和ReasonSeg上的实验显示出相对于基础模型的一致提升,并为高分辨率推理提供了可解释的光圈轨迹。
cs.CV / 126 / 2602.14493

Gaussian Mesh Renderer for Lightweight Differentiable Rendering

轻量级可微分渲染的高斯网格渲染器
Liu, Xinpeng, Okura, Fumio
Abstract
3D Gaussian Splatting (3DGS) has enabled high-fidelity virtualization with fast rendering and optimization for novel view synthesis. On the other hand, triangle mesh models still remain a popular choice for surface reconstruction but suffer from slow or heavy optimization in traditional mesh-based differentiable renderers. To address this problem, we propose a new lightweight differentiable mesh renderer leveraging the efficient rasterization process of 3DGS, named Gaussian Mesh Renderer (GMR), which tightly integrates the Gaussian and mesh representations. Each Gaussian primitive is analytically derived from the corresponding mesh triangle, preserving structural fidelity and enabling the gradient flow. Compared to the traditional mesh renderers, our method achieves smoother gradients, which especially contributes to better optimization using smaller batch sizes with limited memory. Our implementation is available in the public GitHub repository at https://github.com/huntorochi/Gaussian-Mesh-Renderer.
Chinese Translation
3D 高斯点云(3D Gaussian Splatting, 3DGS)实现了高保真虚拟化,并快速渲染和优化新视图合成。另一方面,三角网格模型仍然是表面重建的热门选择,但在传统的基于网格的可微分渲染器中,优化过程往往缓慢或繁重。为了解决这个问题,我们提出了一种新的轻量级可微分网格渲染器,利用 3DGS 的高效光栅化过程,命名为高斯网格渲染器(Gaussian Mesh Renderer, GMR),它紧密集成了高斯和网格表示。每个高斯原语都是从相应的网格三角形中解析得出的,保持了结构的保真性并实现了梯度流。与传统的网格渲染器相比,我们的方法实现了更平滑的梯度,这尤其有助于在内存有限的情况下使用更小的批量大小进行更好的优化。我们的实现已在公共 GitHub 仓库中发布,地址为 https://github.com/huntorochi/Gaussian-Mesh-Renderer。
cs.CV / 127 / 2602.14498

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

面向医学影像的不确定性感知视觉-语言分割
Das, Aryan, Rachamalla, Tanishq, Biswas, Koushik, Roy, Swalpa Kumar, Verma, Vinay Kumar
Abstract
We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS
Chinese Translation
我们提出了一种新颖的不确定性感知多模态分割框架,利用放射影像和相关临床文本进行精确的医学诊断。我们提出了一种模态解码注意力块(Modality Decoding Attention Block,MoDAB)与轻量级状态空间混合器(State Space Mixer,SSMix),以实现高效的跨模态融合和长距离依赖建模。为了在模糊情况下指导学习,我们提出了谱熵不确定性损失(Spectral-Entropic Uncertainty,SEU Loss),该损失在统一目标中共同捕捉空间重叠、光谱一致性和预测不确定性。在图像质量较差的复杂临床环境中,这种公式提高了模型的可靠性。在多个公开可用的医学数据集(QATA-COVID19、MosMed++和Kvasir-SEG)上进行的广泛实验表明,我们的方法在分割性能上优于现有的最先进方法(State-of-the-Art,SoTA),同时在计算效率上显著更高。我们的结果强调了在视觉-语言医学分割任务中纳入不确定性建模和结构化模态对齐的重要性。代码链接:https://github.com/arya-domain/UA-VLS
cs.CV / 128 / 2602.14501

Prototype Instance-semantic Disentanglement with Low-rank Regularized Subspace Clustering for WSIs Explainable Recognition

基于低秩正则化子空间聚类的原型实例语义解缠结框架用于全幻灯片图像的可解释性识别
Li, Chentao, Huang, Pan
Abstract
The tumor region plays a key role in pathological diagnosis. Tumor tissues are highly similar to precancerous lesions and non tumor instances often greatly exceed tumor instances in whole slide images (WSIs). These issues cause instance-semantic entanglement in multi-instance learning frameworks, degrading both model representation capability and interpretability. To address this, we propose an end-to-end prototype instance semantic disentanglement framework with low-rank regularized subspace clustering, PID-LRSC, in two aspects. First, we use secondary instance subspace learning to construct low-rank regularized subspace clustering (LRSC), addressing instance entanglement caused by an excessive proportion of non tumor instances. Second, we employ enhanced contrastive learning to design prototype instance semantic disentanglement (PID), resolving semantic entanglement caused by the high similarity between tumor and precancerous tissues. We conduct extensive experiments on multicentre pathology datasets, implying that PID-LRSC outperforms other SOTA methods. Overall, PID-LRSC provides clearer instance semantics during decision-making and significantly enhances the reliability of auxiliary diagnostic outcomes.
Chinese Translation
肿瘤区域在病理诊断中发挥着关键作用。肿瘤组织与癌前病变高度相似,且在全幻灯片图像(WSIs)中,非肿瘤实例的数量往往远超肿瘤实例。这些问题导致多实例学习框架中的实例语义缠结,降低了模型的表征能力和可解释性。为了解决这一问题,我们提出了一种端到端的原型实例语义解缠结框架,结合低秩正则化子空间聚类(PID-LRSC),从两个方面进行改进。首先,我们采用二次实例子空间学习构建低秩正则化子空间聚类(LRSC),解决因非肿瘤实例比例过高而导致的实例缠结问题。其次,我们利用增强对比学习设计原型实例语义解缠结(PID),解决肿瘤与癌前组织之间高度相似性导致的语义缠结。我们在多中心病理数据集上进行了广泛实验,结果表明PID-LRSC优于其他最先进的方法(SOTA)。总体而言,PID-LRSC在决策过程中提供了更清晰的实例语义,并显著提高了辅助诊断结果的可靠性。
cs.CV / 129 / 2602.14509

MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification

MacNet:一种端到端的流形约束自适应聚类网络,用于可解释的全切片图像分类
Ma, Mingrui, Li, Chentao, Huang, Pan, Qin, Jing
Abstract
Whole slide images (WSIs) are the gold standard for pathological diagnosis and sub-typing. Current main-stream two-step frameworks employ offline feature encoders trained without domain-specific knowledge. Among them, attention-based multiple instance learning (MIL) methods are outcome-oriented and offer limited interpretability. Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids. To this end, we propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering, where the manifold geometric structure facilitates robust clustering results. Furthermore, we design a prior knowledge guiding proxy instance labeling and aggregation strategy to approximate patch labels and focus on pathologically relevant tumor regions. Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable computation resources.
Chinese Translation
全切片图像(WSIs)是病理诊断和亚型分类的金标准。目前主流的两步框架采用离线特征编码器,这些编码器在没有领域特定知识的情况下进行训练。其中,基于注意力的多实例学习(MIL)方法以结果为导向,提供的可解释性有限。基于聚类的方法可以提供可解释的决策过程,但在高维特征和语义模糊的质心方面存在问题。为此,我们提出了一种端到端的MIL框架,集成了Grassmann重新嵌入和流形自适应聚类,其中流形几何结构促进了稳健的聚类结果。此外,我们设计了一种先验知识引导的代理实例标记和聚合策略,以近似补丁标签并关注病理相关的肿瘤区域。在多中心WSI数据集上的实验表明:1)我们的聚类整合模型在分级准确性和可解释性方面表现优越;2)端到端学习优化了更好的特征表示,并且需要可接受的计算资源。
cs.CV / 130 / 2602.14512

MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction

MedVAR:通过下一尺度自回归预测实现可扩展且高效的医学图像生成
He, Zhicheng, Zhao, Yunpeng, Wu, Junde, Niu, Ziwei, Li, Zijun, Lin, Lanfen, Jin, Yueming
Abstract
Medical image generation is pivotal in applications like data augmentation for low-resource clinical tasks and privacy-preserving data sharing. However, developing a scalable generative backbone for medical imaging requires architectural efficiency, sufficient multi-organ data, and principled evaluation, yet current approaches leave these aspects unresolved. Therefore, we introduce MedVAR, the first autoregressive-based foundation model that adopts the next-scale prediction paradigm to enable fast and scale-up-friendly medical image synthesis. MedVAR generates images in a coarse-to-fine manner and produces structured multi-scale representations suitable for downstream use. To support hierarchical generation, we curate a harmonized dataset of around 440,000 CT and MRI images spanning six anatomical regions. Comprehensive experiments across fidelity, diversity, and scalability show that MedVAR achieves state-of-the-art generative performance and offers a promising architectural direction for future medical generative foundation models.
Chinese Translation
医学图像生成在低资源临床任务的数据增强和隐私保护数据共享等应用中至关重要。然而,开发一个可扩展的医学成像生成基础模型需要架构效率、足够的多器官数据和原则性的评估,而当前的方法在这些方面仍未得到解决。因此,我们提出了MedVAR,这是第一个基于自回归的基础模型,采用下一尺度预测范式,以实现快速且易于扩展的医学图像合成。MedVAR以粗到细的方式生成图像,并生成适合下游使用的结构化多尺度表示。为了支持分层生成,我们整理了一个和谐的数据集,包含约440,000张覆盖六个解剖区域的CT和MRI图像。全面的实验结果显示,在保真度、多样性和可扩展性方面,MedVAR达到了最先进的生成性能,并为未来医学生成基础模型提供了一个有前景的架构方向。
cs.CV / 131 / 2602.14514

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

高效文本引导卷积适配器用于扩散模型
Das, Aryan, Biswas, Koushik, Roy, Swalpa Kumar, Patro, Badri Narayana, Verma, Vinay Kumar
Abstract
We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters
Chinese Translation
我们提出了Nexus适配器,这是一种新颖的文本引导高效适配器,旨在用于基于扩散的结构保持条件生成(SPCG)框架。近年来,结构保持方法通过使用基础模型进行提示条件化和适配器处理结构输入(如草图或深度图)在条件图像生成中取得了令人鼓舞的成果。然而,这些方法效率极低,有时适配器所需的参数与基础架构相当。由于扩散模型本身的高成本,训练模型并不总是可行,而增加参数量则极为低效。在这些方法中,适配器并未考虑输入提示,因此它仅对结构输入最优,而对输入提示并不适用。为了解决上述挑战,我们提出了两个高效适配器,Nexus Prime和Slim,它们由提示和结构输入引导。每个Nexus块结合了交叉注意机制,以实现丰富的多模态条件化。因此,所提出的适配器在保持结构的同时,对输入提示有更好的理解。我们对所提出的模型进行了广泛的实验,结果表明,Nexus Prime适配器显著提升了性能,与基线T2I-Adapter相比,仅需额外8M参数。此外,我们还介绍了一种轻量级的Nexus Slim适配器,其参数比T2I-Adapter少18M,但仍然达到了最先进的结果。代码链接:https://github.com/arya-domain/Nexus-Adapters
cs.CV / 132 / 2602.14523

Architectural Insights for Post-Tornado Damage Recognition

龙卷风后损伤识别的建筑学见解
Umeike, Robinson, Dao, Thang, Crawford, Shane, van de Lindt, John, Johnston, Blythe, Wanting, Wang, Do, Trung, Mofikoya, Ajibola, Banjara, Sarbesh, Pham, Cuong
Abstract
Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.
Chinese Translation
在龙卷风过后立即进行快速而准确的建筑损伤评估对于协调生命救援行动、优化紧急资源分配以及加速社区恢复至关重要。然而,当前的自动化方法在处理龙卷风引发的残骸时面临独特的视觉复杂性,主要是由于与标准预训练数据集之间的严重领域转移以及现实灾害数据中的极端类别不平衡。为了解决这些挑战,我们引入了一个系统的实验框架,评估了79个开源深度学习模型,包括卷积神经网络(CNNs)和视觉变换器(Vision Transformers),在我们新创建的四州龙卷风损伤(Quad-State Tornado Damage, QSTD)基准数据集上进行了超过2300个受控实验。我们的研究结果表明,实现操作级性能依赖于架构与优化之间的复杂互动,而不仅仅是架构选择。最引人注目的是,我们展示了优化器的选择可能比架构更为重要:将优化器从Adam切换到SGD为视觉变换器和Swin变换器系列带来了+25到+38点的显著F1提升,根本上改变了它们的排名,从底层跃升至与顶级CNN竞争。此外,1x10^(-4)的低学习率在所有架构中普遍至关重要,使平均F1性能提升了+10.2点。我们的冠军模型ConvNeXt-Base在这些优化设置下训练,展示了在保留的塔斯卡卢萨-摩尔龙卷风损伤(Tuscaloosa-Moore Tornado Damage, TMTD)数据集上的强大跨事件泛化能力,达到了46.4%的宏观F1(比基线提升+34.6点),并在时间和传感器领域转移的情况下保持了85.5%的序数Top-1准确率。
cs.CV / 133 / 2602.14524

Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model

历史OCR中的错误模式:TrOCR与视觉语言模型的比较分析
Vesalainen, Ari, Mäkelä, Eetu, Ruotsalainen, Laura, Tolonen, Mikko
Abstract
Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.
Chinese Translation
由于印刷质量下降、古老的字形和非标准化的拼写,十八世纪印刷文本的光学字符识别(OCR)仍然面临挑战。尽管基于变换器的OCR系统和视觉语言模型(VLM)在整体准确性上表现出色,但字符错误率(CER)和词错误率(WER)等指标对其在学术使用中的可靠性提供的洞察有限。我们比较了一种专用的OCR变换器(TrOCR)和一种通用的视觉语言模型(Qwen),使用长度加权的准确性指标和假设驱动的错误分析,针对线级历史英语文本进行分析。尽管Qwen在CER/WER上表现出更低的值并对劣化输入具有更强的鲁棒性,但它表现出选择性的语言规则化和拼写规范化,可能会默默改变历史上有意义的形式。TrOCR在保持拼写忠实性方面更为一致,但更容易出现错误级联传播。我们的研究结果表明,架构的归纳偏差以系统的方式塑造OCR错误结构。具有相似整体准确性的模型在错误的局部性、可检测性和下游学术风险方面可能存在显著差异,这突显了在历史数字化工作流程中进行架构意识评估的必要性。
cs.CV / 134 / 2602.14525

Cross-view Domain Generalization via Geometric Consistency for LiDAR Semantic Segmentation

通过几何一致性实现跨视角领域泛化的LiDAR语义分割
Zhao, Jindong, Gao, Yuan, Xia, Yang, Nie, Sheng, Yue, Jun, Sun, Weiwei, Xia, Shaobo
Abstract
Domain-generalized LiDAR semantic segmentation (LSS) seeks to train models on source-domain point clouds that generalize reliably to multiple unseen target domains, which is essential for real-world LiDAR applications. However, existing approaches assume similar acquisition views (e.g., vehicle-mounted) and struggle in cross-view scenarios, where observations differ substantially due to viewpoint-dependent structural incompleteness and non-uniform point density. Accordingly, we formulate cross-view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC (Cross-View Geometric Consistency). Specifically, we introduce a cross-view geometric augmentation module that models viewpoint-induced variations in visibility and sampling density, generating multiple cross-view observations of the same scene. Subsequently, a geometric consistency module enforces consistent semantic and occupancy predictions across geometrically augmented point clouds of the same scene. Extensive experiments on six public LiDAR datasets establish the first systematic evaluation of cross-view domain generalization for LiDAR semantic segmentation, demonstrating that CVGC consistently outperforms state-of-the-art methods when generalizing from a single source domain to multiple target domains with heterogeneous acquisition viewpoints. The source code will be publicly available at https://github.com/KintomZi/CVGC-DG
Chinese Translation
领域泛化的LiDAR语义分割(LSS)旨在在源领域点云上训练模型,使其能够可靠地泛化到多个未见过的目标领域,这对于实际的LiDAR应用至关重要。然而,现有的方法假设采集视角相似(例如,车载),在跨视角场景中表现不佳,因为观察结果因视角依赖的结构不完整性和非均匀点密度而存在显著差异。因此,我们为LiDAR语义分割制定了跨视角领域泛化的框架,并提出了一种新颖的框架,称为CVGC(Cross-View Geometric Consistency)。具体而言,我们引入了一个跨视角几何增强模块,该模块建模了视角引起的可见性和采样密度的变化,生成同一场景的多个跨视角观察。随后,一个几何一致性模块在同一场景的几何增强点云之间强制执行一致的语义和占用预测。在六个公共LiDAR数据集上的广泛实验建立了对LiDAR语义分割的跨视角领域泛化的首次系统评估,结果表明,CVGC在从单一源领域泛化到具有异构采集视角的多个目标领域时,始终优于最先进的方法。源代码将公开发布在 https://github.com/KintomZi/CVGC-DG
cs.CV / 135 / 2602.14534

MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

MoRL:统一运动理解与生成的强化推理
Wang, Hongpeng, Zhang, Zeyu, Li, Wenhao, Tang, Hao
Abstract
Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: https://github.com/AIGeeksGroup/MoRL. Website: https://aigeeksgroup.github.io/MoRL.
Chinese Translation
人类运动的理解与生成对于视觉和机器人技术至关重要,但在推理能力和测试时规划方面仍然有限。我们提出了MoRL,一种统一的多模态运动模型,通过监督微调和具有可验证奖励的强化学习进行训练。我们的任务特定奖励设计结合了语义对齐和推理一致性用于理解,以及物理合理性和文本-运动一致性用于生成,从而提高了逻辑推理和感知现实性。为了进一步增强推理能力,我们引入了运动链(Chain-of-Motion, CoM),这是一种测试时推理方法,能够实现逐步规划和反思。我们还构建了两个大规模的CoT数据集,MoUnd-CoT-140K和MoGen-CoT-140K,以将运动序列与推理轨迹和动作描述对齐。在HumanML3D和KIT-ML上的实验表明,MoRL在性能上显著超越了最先进的基线。代码链接:https://github.com/AIGeeksGroup/MoRL。网站链接:https://aigeeksgroup.github.io/MoRL。
cs.CV / 136 / 2602.14552

OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance

OmniVTON++:无训练的通用虚拟试穿与主姿态引导
Yang, Zhaotong, Du, Yong, He, Shengfeng, Li, Yuhui, Li, Xinzhe, Xu, Yangyang, Dong, Junyu, Yang, Jian
Abstract
Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The source code will be released to the public.
Chinese Translation
基于图像的虚拟试穿(VTON)涉及在人体姿态和身体约束下合成逼真的人物图像。然而,在实际应用中,现有的方法通常针对特定数据条件进行优化,使得它们的部署依赖于重新训练,并限制了作为统一解决方案的泛化能力。我们提出了OmniVTON++,一个旨在通用适用性的无训练VTON框架。它通过协调结构化服装变形(Structured Garment Morphing)以实现基于对应的服装适应、主姿态引导(Principal Pose Guidance)以在扩散采样过程中进行逐步结构调节,以及连续边界缝合(Continuous Boundary Stitching)以进行边界感知的细化,解决了服装对齐、人类结构一致性和边界连续性之间的相互挑战,形成了一个无需特定任务重新训练的连贯流程。实验结果表明,OmniVTON++在多样化的泛化设置中实现了最先进的性能,包括跨数据集和跨服装类型的评估,同时在单一框架内可靠地在不同场景和扩散骨干网络中运行。除了单一服装、单一人类的案例外,该框架还支持多服装、多人类和动漫角色的虚拟试穿,扩展了虚拟试穿应用的范围。源代码将公开发布。
cs.CV / 137 / 2602.14577

DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

DriveFine:精细增强的掩蔽扩散 VLA 以实现精确和稳健的驾驶
Dang, Chenxu, Ang, Sining, Li, Yongkang, Tian, Haochen, Wang, Jie, Li, Guang, Ye, Hangjun, Ma, Jie, Chen, Long, Wang, Yan
Abstract
Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at https://github.com/MSunDYY/DriveFine.
Chinese Translation
自主驾驶的视觉-语言-动作(VLA)模型越来越多地采用经过模仿学习和后续强化学习训练的生成规划器。然而,基于扩散的规划器面临模态对齐困难、训练效率低下和泛化能力有限的问题。基于标记的规划器则受到累积因果错误和不可逆解码的困扰。总的来说,这两种主流范式展现出互补的优势和劣势。本文提出了 DriveFine,一种掩蔽扩散 VLA 模型,结合了灵活解码与自我修正能力。特别地,我们设计了一种新颖的即插即用块-MoE(Mixture of Experts),能够在生成专家之上无缝注入一个精细化专家。通过在推理过程中启用显式专家选择以及在训练过程中进行梯度阻断,这两个专家被完全解耦,从而保留了预训练权重的基础能力和通用模式,突显了块-MoE 设计的灵活性和可扩展性。此外,我们设计了一种混合强化学习策略,鼓励有效探索精细化专家,同时保持训练的稳定性。在 NAVSIM v1、v2 和 Navhard 基准上的大量实验表明,DriveFine 展现了强大的有效性和稳健性。代码将发布在 https://github.com/MSunDYY/DriveFine。
cs.CV / 138 / 2602.14582

YOLO26: A Comprehensive Architecture Overview and Key Improvements

YOLO26:全面架构概述及关键改进
Hidayatullah, Priyanto, Tubagus, Refdinal
Abstract
You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End-to-End NMS-Free Inference, introduction of ProgLoss + Small-Target-Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real-time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN-based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.
Chinese Translation
You Only Look Once (YOLO) 在深度学习的计算机视觉领域已占据主导地位达十年之久。本研究探讨了 YOLO26 的新颖之处,这是 YOLO 系列中的最新版本。主要改进包括消除分布式焦点损失 (Distribution Focal Loss, DFL)、实现端到端无 NMS 推理、引入 ProgLoss + 小目标感知标签分配 (Small-Target-Aware Label Assignment, STAL),以及使用 MuSGD 优化器,这些改进旨在提高推理速度,声称在 CPU 模式下实现了 43% 的提升。这些设计使 YOLO26 能够在边缘设备或无 GPU 的设备上实现实时性能。此外,YOLO26 在多个计算机视觉任务中也有所改进,包括实例分割、姿态估计和定向边界框 (Oriented Bounding Box, OBB) 解码。我们希望此次努力提供的价值不仅仅是整合现有技术文档中的信息。因此,我们对 YOLO26 进行了严格的架构调查,主要使用其 GitHub 仓库中的源代码及其官方文档。YOLO26 的真实且详细的操作机制在源代码中,其他人很少提取。YOLO26 的架构图是本次调查的结果。据我们所知,本研究是首次呈现基于 CNN 的 YOLO26 架构,这也是 YOLO26 的核心。我们的目标是为希望改进 YOLO 模型的研究人员和开发者提供 YOLO26 的精确架构理解,确保其在计算机视觉领域继续保持领先的深度学习模型。
cs.CV / 139 / 2602.14615

VariViT: A Vision Transformer for Variable Image Sizes

VariViT:一种适用于可变图像尺寸的视觉变换器
Varma, Aswathi, Shit, Suprosanna, Prabhakar, Chinmay, Scholz, Daniel, Li, Hongwei Bran, Menze, Bjoern, Rueckert, Daniel, Wiestler, Benedikt
Abstract
Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit
Chinese Translation
视觉变换器(Vision Transformers, ViTs)已成为表征学习领域的最先进架构,利用自注意力机制在各种任务中表现出色。ViTs 将图像分割为固定大小的补丁,这限制了它们的预定义尺寸,并需要进行如调整大小、填充或裁剪等预处理步骤。这在医学成像中带来了挑战,尤其是对于肿瘤等不规则形状的结构。固定的边界框裁剪大小会产生前景与背景比例高度可变的输入图像。调整医学图像大小可能会降低信息质量并引入伪影,从而影响诊断。因此,定制适合感兴趣区域的可变大小裁剪可以增强特征表示能力。此外,大图像的计算成本较高,而较小的图像则存在信息丢失的风险,呈现出计算与准确性之间的权衡。我们提出了 VariViT,一种改进的 ViT 模型,旨在处理可变图像尺寸,同时保持一致的补丁大小。VariViT 采用了一种新颖的位置嵌入调整方案,以适应可变数量的补丁。我们还在 VariViT 中实施了一种新的批处理策略,以降低计算复杂性,从而加快训练和推理时间。在对两个 3D 脑部 MRI 数据集的评估中,VariViT 在胶质瘤基因型预测和脑肿瘤分类方面超越了传统的 ViTs 和 ResNet,分别达到了 75.5% 和 76.3% 的 F1 分数,学习到了更具辨别力的特征。我们提出的批处理策略与传统架构相比,计算时间减少了多达 30%。这些发现强调了 VariViT 在图像表示学习中的有效性。我们的代码可以在此找到:https://github.com/Aswathi-Varma/varivit
cs.CV / 140 / 2602.14633

VIGIL: Tackling Hallucination Detection in Image Recontextualization

VIGIL:应对图像重语境化中的幻觉检测
Wojciechowicz, Joanna, Łubniewska, Maria, Antczak, Jakub, Baczyńska, Justyna, Gromski, Wojciech, Kozłowski, Wojciech, Zięba, Maciej
Abstract
We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: https://github.com/mlubneuskaya/vigil and Data repository: https://huggingface.co/datasets/joannaww/VIGIL.
Chinese Translation
我们介绍了VIGIL(视觉不一致性与生成上下文清晰度),这是首个基准数据集和框架,为大型多模态模型(LMMs)在多模态图像重语境化任务中提供了幻觉的细粒度分类。尽管现有研究通常将幻觉视为一个统一的问题,我们的工作通过将这些错误分解为五类:粘贴物体幻觉、背景幻觉、物体遗漏、位置与逻辑不一致以及物理法则违反,填补了多模态评估中的一个重要空白。为了应对这些复杂性,我们提出了一个多阶段检测管道。我们的架构通过一系列专门步骤处理重语境化的图像,针对物体级别的保真度、背景一致性和遗漏检测,利用一组协调的开源模型,其有效性通过广泛的实验评估得以证明。我们的方法使得对模型失败的原因有更深入的理解,并提供了解释;因此,我们填补了该领域的空白,因为没有先前的方法为此任务提供这样的分类和分解。为了促进透明度和进一步探索,我们通过我们的GitHub存储库(https://github.com/mlubneuskaya/vigil)和数据存储库(https://huggingface.co/datasets/joannaww/VIGIL)公开发布了VIGIL及其检测管道和基准代码。
cs.CV / 141 / 2602.14648

SketchingReality: From Freehand Scene Sketches To Photorealistic Images

SketchingReality:从自由手绘场景草图到照片级真实图像
Bourouis, Ahmed, Bessmeltsev, Mikhail, Gryaditskaya, Yulia
Abstract
Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
Chinese Translation
近年来,生成性人工智能取得了显著进展,自然语言成为最常见的条件输入。随着基础模型的不断增强,研究人员正在探索越来越多样化的条件信号,如深度图、边缘图、相机参数和参考图像,以便为用户提供更精细的生成控制。在不同的模态中,草图是一种自然且长期存在的人类沟通形式,能够快速表达视觉概念。以往的文献主要集中在边缘图上,常常被误称为“草图”,然而有效处理真实自由手绘草图的算法,因其固有的抽象性和失真,仍然未得到充分探索。我们追求在从自由手绘输入生成图像时,平衡照片级真实感与草图遵循性的挑战目标。一个关键障碍是缺乏真实的、像素对齐的图像:自由手绘草图本质上没有单一的正确对齐方式。为了解决这个问题,我们提出了一种基于调制的方法,优先考虑草图的语义解释,而不是严格遵循单个边缘位置。我们进一步引入了一种新颖的损失函数,使得在没有真实像素对齐图像的情况下也能对自由手绘草图进行训练。我们展示了我们的方法在与自由手绘草图输入的语义对齐以及生成图像的真实感和整体质量方面优于现有方法。
cs.CV / 142 / 2602.14662

Advances in Global Solvers for 3D Vision

三维视觉的全局求解器进展
Zhao, Zhenjun, Yang, Heng, Liao, Bangyan, Zeng, Yingping, Yan, Shaocheng, Gu, Yingdong, Liu, Peidong, Zhou, Yi, Li, Haoang, Civera, Javier
Abstract
Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision.
Chinese Translation
全局求解器作为三维视觉的强大范式,提供了对传统上由局部或启发式方法解决的非凸几何优化问题的可证明解决方案。本调查首次系统性地回顾了几何视觉中的全局求解器,通过对三种核心范式的全面分类:分支限界法(Branch-and-Bound, BnB)、凸松弛(Convex Relaxation, CR)和渐进非凸性(Graduated Non-Convexity, GNC),统一了该领域。我们介绍了它们的理论基础、算法设计以及在鲁棒性和可扩展性方面的实际增强,考察了每种方法如何应对几何估计问题的基本非凸性。我们的分析涵盖了十个核心视觉任务,从Wahba问题到束调整,揭示了影响求解器选择的最优性-鲁棒性-可扩展性权衡。我们确定了关键的未来方向:在保持保证的同时扩展算法,将数据驱动的先验与可证明的优化相结合,建立标准化基准,以及解决安全关键部署的社会影响。通过整合理论基础、实际进展和更广泛的影响,本调查提供了一个统一的视角和通往可证明、可信感知的路线图,以应对现实世界应用。持续更新的文献摘要和配套代码教程可在 https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision 获取。
cs.CV / 143 / 2602.14672

MeFEm: Medical Face Embedding model

MeFEm:医学面部嵌入模型
Borets, Yury, Botman, Stepan
Abstract
We present MeFEm, a vision model based on a modified Joint Embedding Predictive Architecture (JEPA) for biometric and medical analysis from facial images. Key modifications include an axial stripe masking strategy to focus learning on semantically relevant regions, a circular loss weighting scheme, and the probabilistic reassignment of the CLS token for high quality linear probing. Trained on a consolidated dataset of curated images, MeFEm outperforms strong baselines like FaRL and Franca on core anthropometric tasks despite using significantly less data. It also shows promising results on Body Mass Index (BMI) estimation, evaluated on a novel, consolidated closed-source dataset that addresses the domain bias prevalent in existing data. Model weights are available at https://huggingface.co/boretsyury/MeFEm , offering a strong baseline for future work in this domain.
Chinese Translation
我们提出了MeFEm,一种基于修改后的联合嵌入预测架构(Joint Embedding Predictive Architecture, JEPA)的视觉模型,用于从面部图像进行生物识别和医学分析。主要修改包括轴向条纹掩蔽策略,以集中学习于语义相关区域,循环损失加权方案,以及对CLS标记的概率重新分配,以实现高质量的线性探测。在经过精心策划的图像合成数据集上训练后,尽管使用的数据显著较少,MeFEm在核心人类测量任务上超越了FaRL和Franca等强基线。它在体重指数(Body Mass Index, BMI)估计方面也显示出良好的结果,评估基于一个新颖的、合成的闭源数据集,该数据集解决了现有数据中普遍存在的领域偏差。模型权重可在https://huggingface.co/boretsyury/MeFEm获取,为该领域未来的研究提供了强有力的基线。
cs.CV / 144 / 2602.14679

Universal Image Immunization against Diffusion-based Image Editing via Semantic Injection

通过语义注入实现对基于扩散的图像编辑的通用图像免疫
Lee, Chanhui, Shin, Seunghyun, Choi, Donggyu, Jeon, Hae-gon, Son, Jeany
Abstract
Recent advances in diffusion models have enabled powerful image editing capabilities guided by natural language prompts, unlocking new creative possibilities. However, they introduce significant ethical and legal risks, such as deepfakes and unauthorized use of copyrighted visual content. To address these risks, image immunization has emerged as a promising defense against AI-driven semantic manipulation. Yet, most existing approaches rely on image-specific adversarial perturbations that require individual optimization for each image, thereby limiting scalability and practicality. In this paper, we propose the first universal image immunization framework that generates a single, broadly applicable adversarial perturbation specifically designed for diffusion-based editing pipelines. Inspired by universal adversarial perturbation (UAP) techniques used in targeted attacks, our method generates a UAP that embeds a semantic target into images to be protected. Simultaneously, it suppresses original content to effectively misdirect the model's attention during editing. As a result, our approach effectively blocks malicious editing attempts by overwriting the original semantic content in the image via the UAP. Moreover, our method operates effectively even in data-free settings without requiring access to training data or domain knowledge, further enhancing its practicality and broad applicability in real-world scenarios. Extensive experiments show that our method, as the first universal immunization approach, significantly outperforms several baselines in the UAP setting. In addition, despite the inherent difficulty of universal perturbations, our method also achieves performance on par with image-specific methods under a more restricted perturbation budget, while also exhibiting strong black-box transferability across different diffusion models.
Chinese Translation
最近,扩散模型的进展使得基于自然语言提示的强大图像编辑能力成为可能,开启了新的创作可能性。然而,这也带来了显著的伦理和法律风险,例如深度伪造和未经授权使用受版权保护的视觉内容。为了解决这些风险,图像免疫作为一种针对人工智能驱动的语义操控的有希望的防御机制应运而生。然而,大多数现有方法依赖于特定于图像的对抗性扰动,这需要对每个图像进行单独优化,从而限制了可扩展性和实用性。本文提出了第一个通用图像免疫框架,该框架生成一个单一的、广泛适用的对抗性扰动,专门为基于扩散的编辑管道设计。我们的研究受到用于定向攻击的通用对抗性扰动(UAP)技术的启发,生成一种UAP,将语义目标嵌入到需要保护的图像中。同时,它抑制原始内容,以有效地误导模型在编辑过程中的注意力。因此,我们的方法通过通过UAP覆盖图像中的原始语义内容,有效阻止了恶意编辑尝试。此外,我们的方法在无数据环境中也能有效运行,无需访问训练数据或领域知识,进一步增强了其在现实场景中的实用性和广泛适用性。大量实验表明,作为第一个通用免疫方法,我们的方法在UAP设置中显著优于多个基线。此外,尽管通用扰动本身具有固有的困难,我们的方法在更严格的扰动预算下也达到了与特定图像方法相当的性能,同时在不同扩散模型之间表现出强大的黑箱可转移性。
cs.CV / 145 / 2602.14705

It's a Matter of Time: Three Lessons on Long-Term Motion for Perception

时间问题:关于感知的长期运动的三条经验教训
Davison, Willem, Hao, Xinyue, Sevilla-Lara, Laura
Abstract
Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.
Chinese Translation
时间信息长期以来被认为对感知至关重要。尽管关于图像信息在感知任务中作用的研究广泛,但时间维度的作用仍然不太为人所知:我们可以从长期运动信息中学到关于世界的什么?长期运动信息在视觉学习中具有哪些特性?我们利用最近在点轨迹估计方面的成功,这为学习时间表示和在各种感知任务中进行实验提供了极好的机会。我们总结了三条明确的经验教训:1)长期运动表示包含理解动作、物体、材料和空间信息所需的信息,往往甚至比图像更好。2)在低数据环境和零样本任务中,长期运动表示的泛化能力远远优于图像表示。3)运动信息的极低维度使得运动表示在GFLOPs和准确性之间的权衡优于标准视频表示,且两者结合使用时的性能优于单独使用视频表示。我们希望这些见解能够为未来模型的设计铺平道路,使其利用长期运动信息的力量来提升感知能力。
cs.CV / 146 / 2602.14751

Depth Completion as Parameter-Efficient Test-Time Adaptation

深度补全作为参数高效的测试时适应
Ke, Bingxin, Zhou, Qunjie, Huang, Jiahui, Ren, Xuanchi, Shen, Tianchang, Schindler, Konrad, Leal-Taixé, Laura, Huang, Shengyu
Abstract
We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.
Chinese Translation
我们提出了CAPA,一个参数高效的测试时优化框架,利用稀疏几何线索对预训练的3D基础模型(FMs)进行深度补全适应。与之前的方法不同,后者为辅助输入训练特定任务的编码器,往往会导致过拟合和泛化能力差,CAPA则冻结了基础模型的主干。相反,它仅更新一小部分参数,使用参数高效微调(例如LoRA或VPT),并通过直接从推理时可用的稀疏观测中计算的梯度进行引导。这种方法有效地将基础模型的几何先验与场景特定的测量结果结合起来,纠正失真和错位结构。对于视频,CAPA引入了序列级参数共享,联合适应所有帧,以利用时间相关性,提高鲁棒性,并强制多帧一致性。CAPA是模型无关的,兼容任何基于ViT的基础模型,并在室内和室外数据集的多种条件模式下实现了最先进的结果。项目页面:research.nvidia.com/labs/dvl/projects/capa。
cs.CV / 147 / 2602.14767

SAILS: Segment Anything with Incrementally Learned Semantics for Task-Invariant and Training-Free Continual Learning

SAILS:通过增量学习语义实现任务不变和无训练的持续学习的任意分割
Muralidhara, Shishir, Stricker, Didier, Schuster, René
Abstract
Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real-world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS -- Segment Anything with Incrementally Learned Semantics, a training-free framework for Class-Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero-shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra-class clustering, resulting in multiple prototypes per class to better model intra-class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training-based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task-invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.
Chinese Translation
持续学习仍然受到重复再训练、高计算成本和遗忘问题的制约。这些因素显著限制了持续学习在现实世界中的应用,因为迭代模型更新需要大量计算资源,并且本质上加剧了遗忘。我们提出了SAILS——通过增量学习语义实现任意分割的框架,这是一个无训练的类增量语义分割(Class-Incremental Semantic Segmentation, CISS)框架,完全规避了这些挑战。SAILS利用基础模型将CISS解耦为两个阶段:使用任意分割模型(Segment Anything Model, SAM)进行零样本区域提取,随后在固定特征空间中通过原型进行语义关联。SAILS结合选择性类内聚类,为每个类别生成多个原型,以更好地建模类内变异性。我们的结果表明,尽管不需要增量训练,SAILS通常在标准CISS数据集上超越现有基于训练的方法的性能,特别是在遗忘最为严重的长任务序列中。通过避免参数更新,SAILS完全消除了遗忘,并保持一致的任务不变性能。此外,SAILS表现出积极的向后迁移,即新类别的引入可以提升对先前类别的性能。
cs.CV / 148 / 2602.14771

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA:基于联合嵌入预测架构的模型适应与遮挡处理的通用目标跟踪
Chen, Shih-Fang, Chen, Jun-Cheng, Jhuo, I-Hong, Lin, Yen-Yu
Abstract
The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.
Chinese Translation
人类视觉系统通过将当前观察与先前观察到的信息整合,适应目标和场景的变化,并在细粒度上推理遮挡来跟踪物体。相比之下,最近的通用目标跟踪器通常针对训练目标进行了优化,这限制了其在未见场景中的鲁棒性和泛化能力,并且其遮挡推理仍然较为粗糙,缺乏对遮挡模式的详细建模。为了解决这些泛化和遮挡感知方面的局限性,我们提出了GOT-JEPA,这是一种模型预测预训练框架,它将JEPA从预测图像特征扩展到预测跟踪模型。在相同的历史信息下,教师预测器从干净的当前帧生成伪跟踪模型,而学生预测器则学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督,并明确训练预测器在遮挡、干扰物和其他不利观察下生成可靠的跟踪模型,从而提高对动态环境的泛化能力。在GOT-JEPA的基础上,我们进一步提出了OccuSolver,以增强目标跟踪的遮挡感知。OccuSolver调整了以点为中心的点跟踪器,以实现基于目标的可见性估计和详细的遮挡模式捕捉。在跟踪器迭代生成的目标先验条件下,OccuSolver逐步细化可见性状态,加强遮挡处理,并生成更高质量的参考标签,从而逐步改善后续模型的预测。在七个基准上的广泛评估表明,我们的方法有效增强了跟踪器的泛化能力和鲁棒性。
cs.CV / 149 / 2602.14788

VIPA: Visual Informative Part Attention for Referring Image Segmentation

VIPA:用于指代图像分割的视觉信息部分注意力
Cho, Yubin, Yu, Hyunwoo, Kong, Kyeongbo, Sohn, Kyomin, Hyun, Bongjoon, Kang, Suk-Ju
Abstract
Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.
Chinese Translation
指代图像分割(RIS)旨在对通过自然语言表达描述的目标对象进行分割。现有方法通过将视觉信息与语言标记结合,得到了发展。为了更有效地利用视觉上下文进行细粒度分割,我们提出了一种新颖的视觉信息部分注意力(VIPA)框架用于指代图像分割。VIPA利用视觉上下文中的信息部分,称为视觉表达,这可以有效地向网络提供结构和语义的视觉目标信息。该设计减少了高方差的跨模态投影,并增强了指代图像分割注意力机制中的语义一致性。我们还设计了一个视觉表达生成器(VEG)模块,该模块通过局部-全局语言上下文线索检索信息丰富的视觉标记,并对检索到的标记进行优化,以减少噪声信息并共享信息丰富的视觉属性。该模块使视觉表达能够考虑全面的上下文,并捕捉信息区域的语义视觉上下文。通过这种方式,我们的框架使网络的注意力能够稳健地与细粒度的兴趣区域对齐。大量实验和视觉分析证明了我们方法的有效性。我们的VIPA在四个公共RIS基准上超越了现有的最先进方法。
cs.CV / 150 / 2602.14834

Debiasing Central Fixation Confounds Reveals a Peripheral "Sweet Spot" for Human-like Scanpaths in Hard-Attention Vision

去偏中心注视混淆揭示了人类类似扫描路径的周边“甜点”区域
Pan, Pengcheng, Shogo, Yonekura, Kuniyosh, Yasuo
Abstract
Human eye movements in visual recognition reflect a balance between foveal sampling and peripheral context. Task-driven hard-attention models for vision are often evaluated by how well their scanpaths match human gaze. However, common scanpath metrics can be strongly confounded by dataset-specific center bias, especially on object-centric datasets. Using Gaze-CIFAR-10, we show that a trivial center-fixation baseline achieves surprisingly strong scanpath scores, approaching many learned policies. This makes standard metrics optimistic and blurs the distinction between genuine behavioral alignment and mere central tendency. We then analyze a hard-attention classifier under constrained vision by sweeping foveal patch size and peripheral context, revealing a peripheral sweet spot: only a narrow range of sensory constraints yields scanpaths that are simultaneously (i) above the center baseline after debiasing and (ii) temporally human-like in movement statistics. To address center bias, we propose GCS (Gaze Consistency Score), a center-debiased composite metric augmented with movement similarity. GCS uncovers a robust sweet spot at medium patch size with both foveal and peripheral vision, that is not obvious from raw scanpath metrics or accuracy alone, and also highlights a "shortcut regime" when the field-of-view becomes too large. We discuss implications for evaluating active perception on object-centric datasets and for designing gaze benchmarks that better separate behavioral alignment from center bias.
Chinese Translation
人类在视觉识别中的眼动反映了中央采样与周边背景之间的平衡。任务驱动的硬注意力视觉模型通常通过其扫描路径与人类注视的匹配程度进行评估。然而,常见的扫描路径指标可能受到数据集特定中心偏差的强烈影响,尤其是在以物体为中心的数据集上。通过使用 Gaze-CIFAR-10,我们展示了一个简单的中心注视基线能够获得意外强劲的扫描路径得分,接近许多学习策略。这使得标准指标显得过于乐观,并模糊了真正行为一致性与单纯中心倾向之间的区别。随后,我们分析了在受限视觉下的硬注意力分类器,通过调整中央补丁大小和周边背景,揭示了一个周边甜点区域:只有在狭窄的感知约束范围内,才能产生同时 (i) 在去偏后高于中心基线和 (ii) 在运动统计上类似人类的扫描路径。为了解决中心偏差问题,我们提出了 GCS(注视一致性得分),这是一种去偏的复合指标,增强了运动相似性。GCS 在中等补丁大小下揭示了一个稳健的甜点区域,结合了中央和周边视觉,这一点从原始扫描路径指标或准确性中并不明显,同时也突出了当视野过大时的“捷径状态”。我们讨论了在以物体为中心的数据集上评估主动感知的意义,以及设计更好地区分行为一致性与中心偏差的注视基准的建议。
cs.CV / 151 / 2602.14837

Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

整合可供性与注意力模型以进行短期物体交互预测
Labadia, Lorenzo Mur, Martinez-Cantin, Ruben, Guerrero, Jose J., Farinella, Giovanni M., Furnari, Antonino
Abstract
Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.
Chinese Translation
短期物体交互预测旨在检测下一个活跃物体的位置、交互的名词和动词类别,以及从自我中心视频观察到的接触时间。这一能力对于可穿戴助手理解用户目标并提供及时帮助,或促进人机交互至关重要。在本研究中,我们提出了一种改善短期物体交互预测(STA)性能的方法。我们的贡献主要有两个方面:1 我们提出了STAformer和STAformer plus plus,这两种新颖的基于注意力的架构集成了帧引导的时间池化、双图像-视频注意力和多尺度特征融合,以支持从图像输入的视频对进行STA预测;2 我们引入了两个新模块,通过建模可供性将STA预测与人类行为相结合。首先,我们整合了一个环境可供性模型,该模型作为在特定物理场景中可能发生交互的持久记忆。我们探讨了如何通过简单的后期融合以及一种自适应学习如何最佳融合可供性与端到端预测的方法来整合环境可供性。其次,我们通过观察手部和物体轨迹预测交互热点,从而提高了围绕热点的STA预测的信心。我们的结果显示在整体Top-5 mAP上有显著改善,在Ego4D上提升高达+23个百分点,在一组新创建的EPIC-Kitchens STA标签上提升高达+31个百分点。我们发布了代码、注释和在Ego4D和EPIC-Kitchens上预提取的可供性,以鼓励未来在该领域的研究。
cs.CV / 152 / 2602.14846

Multi-dimensional Persistent Sheaf Laplacians for Image Analysis

用于图像分析的多维持久束拉普拉斯算子
Wang, Xiang Xiang, Wei, Guo-Wei
Abstract
We propose a multi-dimensional persistent sheaf Laplacian (MPSL) framework on simplicial complexes for image analysis. The proposed method is motivated by the strong sensitivity of commonly used dimensionality reduction techniques, such as principal component analysis (PCA), to the choice of reduced dimension. Rather than selecting a single reduced dimension or averaging results across dimensions, we exploit complementary advantages of multiple reduced dimensions. At a given dimension, image samples are regarded as simplicial complexes, and persistent sheaf Laplacians are utilized to extract a multiscale localized topological spectral representation for individual image samples. Statistical summaries of the resulting spectra are then aggregated across scales and dimensions to form multiscale multi-dimensional image representations. We evaluate the proposed framework on the COIL20 and ETH80 image datasets using standard classification protocols. Experimental results show that the proposed method provides more stable performance across a wide range of reduced dimensions and achieves consistent improvements to PCA-based baselines in moderate dimensional regimes.
Chinese Translation
我们提出了一种基于单纯复形的多维持久束拉普拉斯(MPSL)框架用于图像分析。该方法的提出源于常用的降维技术(如主成分分析,PCA)对降维选择的强敏感性。我们并不选择单一的降维或在不同维度上平均结果,而是利用多个降维的互补优势。在给定的维度下,图像样本被视为单纯复形,并利用持久束拉普拉斯提取个别图像样本的多尺度局部拓扑谱表示。然后,将得到的谱的统计摘要在不同尺度和维度上进行汇总,以形成多尺度多维图像表示。我们在COIL20和ETH80图像数据集上使用标准分类协议评估了该框架。实验结果表明,该方法在广泛的降维范围内提供了更稳定的性能,并在中等维度范围内对基于PCA的基线方法实现了一致的改进。
cs.CV / 153 / 2602.14879

CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

CT-Bench:一种用于计算机断层扫描中多模态病变理解的基准
Zhu, Qingqing, Jin, Qiao, Mathai, Tejas S., Fang, Yin, Wang, Zhizheng, Yang, Yifan, Sarfo-Gyamfi, Maame, Hou, Benjamin, Gu, Ran, Balamuralikrishna, Praveen T. S., Wang, Kenneth C., Summers, Ronald M., Lu, Zhiyong
Abstract
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
Chinese Translation
人工智能(AI)可以自动在计算机断层扫描(CT)图像上勾画病变并生成放射学报告内容,但由于缺乏公开可用的具有病变级别注释的CT数据集,进展受到限制。为了解决这一问题,我们推出了CT-Bench,这是首个此类基准数据集,包含两个部分:一个包含20,335个病变的病变图像和元数据集,来自7,795个CT研究,附有边界框、描述和大小信息;以及一个多任务视觉问答基准,包含2,850对问答,涵盖病变定位、描述、大小估计和属性分类。我们还包含了困难的负例,以反映现实世界中的诊断挑战。我们评估了多种最先进的多模态模型,包括视觉-语言模型和医学CLIP变体,通过将它们的性能与放射科医生的评估进行比较,展示了CT-Bench作为病变分析综合基准的价值。此外,在病变图像和元数据集上微调模型在两个部分均显著提升了性能,强调了CT-Bench的临床实用性。
cs.CV / 154 / 2602.14929

Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

Wrivinder:面向空间智能的地面图像与卫星影像的地理定位
Gudavalli, Chandrakanth, Mohammed, Tajuddin Manhar, Yadav, Abhay, Bhaskar, Ananth Vishnu, Prajapati, Hardik, Peng, Cheng, Chellappa, Rama, Chandrasekaran, Shivkumar, Manjunath, B. S.
Abstract
Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.
Chinese Translation
将地面图像与地理注册的卫星地图对齐对于制图、导航和态势感知至关重要,但在大视角差异或GPS不可靠的情况下仍然具有挑战性。我们提出了Wrivinder,一个零样本、基于几何的框架,聚合多张地面照片以重建一致的3D场景,并将其与上方的卫星影像对齐。Wrivinder结合了结构从运动(SfM)重建、3D高斯点云、语义基础和基于单目深度的度量线索,生成一个稳定的天顶视图渲染,可以直接与卫星上下文匹配,以实现度量准确的相机地理定位。为了支持对这一任务的系统评估,因缺乏合适的基准,我们还发布了MC-Sat,一个策划的数据集,将多视角地面图像与不同户外环境中的地理注册卫星图块链接起来。Wrivinder和MC-Sat共同提供了第一个全面的基准和测试平台,用于研究无配对监督的以几何为中心的跨视图对齐。在零样本实验中,Wrivinder在密集和大面积场景中实现了低于30米的地理定位精度,突显了基于几何聚合的强大潜力,能够实现稳健的地面到卫星定位。
cs.CV / 155 / 2602.14941

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

AnchorWeave:具有检索局部空间记忆的世界一致性视频生成
Wang, Zun, Lin, Han, Yoon, Jaehong, Cho, Jaemin, Zhang, Yue, Bansal, Mohit
Abstract
Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
Chinese Translation
在长时间范围内保持空间世界一致性仍然是相机可控视频生成的一个核心挑战。现有的基于记忆的方法通常通过从历史重建的几何体中渲染锚视频,来对全球重建的3D场景进行生成条件。然而,从多个视角重建全球3D场景不可避免地引入了视角间的不对齐,因为姿态和深度估计误差导致相同的表面在不同视角下被重建到略微不同的3D位置。当这些不一致性融合时,会累积成噪声几何体,污染条件信号并降低生成质量。我们提出了AnchorWeave,一个增强记忆的视频生成框架,它用多个干净的局部几何记忆替代单一的不对齐全球记忆,并学习调和它们的视角间不一致性。为此,AnchorWeave执行与目标轨迹对齐的覆盖驱动局部记忆检索,并在生成过程中通过多锚编织控制器整合所选的局部记忆。大量实验表明,AnchorWeave显著提高了长期场景一致性,同时保持了强大的视觉质量,消融和分析研究进一步验证了局部几何条件、多锚控制和覆盖驱动检索的有效性。
cs.CV / 156 / 2602.14965

PAct: Part-Decomposed Single-View Articulated Object Generation

PAct:部分分解的单视图关节物体生成
Liu, Qingming, Yao, Xinyue, Zhang, Shuyuan, Deng, Yueci, Liu, Guiliang, Liu, Zhen, Jia, Kui
Abstract
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
Chinese Translation
关节物体是交互式3D应用的核心,包括具身人工智能、机器人技术以及虚拟现实/增强现实,其中功能性部分分解和运动学运动至关重要。然而,生成高保真度的关节资产仍然难以扩展,因为这需要可靠的部分分解和运动学绑定。现有的方法主要分为两种范式:基于优化的重建或蒸馏,这些方法虽然可以准确,但通常每个实例需要花费数十分钟到数小时;以及依赖模板或部分检索的推理时方法,这些方法生成的结果虽然看似合理,但可能与输入观察中的特定结构和外观不匹配。我们提出了一种以部分为中心的生成框架,用于关节物体的创建,该框架在明确的部分感知条件下合成部分几何、组成和关节。我们的表示将物体建模为一组可移动的部分,每个部分通过增强了部分身份和关节线索的潜在标记进行编码。在单幅图像的条件下,该模型生成的关节3D资产能够保持实例级对应关系,同时维持有效的部分结构和运动。该方法避免了每个实例的优化,支持快速的前馈推理,并且支持可控的组装和关节,这对于具身交互非常重要。在常见的关节类别(例如抽屉和门)上的实验表明,与基于优化和检索驱动的基线相比,输入一致性、部分准确性和关节合理性得到了改善,同时显著减少了推理时间。
cs.CV / 157 / 2602.14989

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

ThermEval:用于评估热成像视觉语言模型的结构化基准测试
Shrivastava, Ayush, Gangani, Kirtan, Jain, Laksh, Goel, Mayank, Batra, Nipun
Abstract
Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.
Chinese Translation
视觉语言模型(VLMs)在RGB图像上表现出色,但它们无法推广到热图像。热感应在可见光失效的环境中发挥着关键作用,包括夜间监控、搜索与救援、自动驾驶和医疗筛查。与RGB图像不同,热图像编码的是物理温度而非颜色或纹理,这需要现有以RGB为中心的基准测试未能评估的感知和推理能力。我们引入了ThermEval-B,这是一个由大约55,000对热视觉问答对组成的结构化基准,旨在评估热视觉语言理解所需的基础原理。ThermEval-B整合了公共数据集和我们新收集的ThermEval-D,这是第一个提供密集的每像素温度图以及跨多样化室内和室外环境的语义身体部位注释的数据集。通过评估25个开源和闭源的VLM,我们发现模型在基于温度的推理上持续失败,在色彩图转换下性能下降,并且默认使用语言先验或固定响应,仅在提示或监督微调中获得微小的提升。这些结果表明,热理解需要超越RGB中心假设的专门评估,ThermEval被定位为推动热视觉语言建模进展的基准。
cs.CV / 158 / 2602.15030

Image Generation with a Sphere Encoder

基于球体编码器的图像生成
Yue, Kaiyu, Jia, Menglin, Hou, Ji, Goldstein, Tom
Abstract
We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at https://sphere-encoder.github.io .
Chinese Translation
我们介绍了球体编码器(Sphere Encoder),这是一种高效的生成框架,能够在单次前向传播中生成图像,并且在使用不到五个步骤的情况下与多步骤扩散模型竞争。我们的方法通过学习一个编码器,将自然图像均匀映射到球形潜在空间,以及一个解码器,将随机潜在向量映射回图像空间。该模型仅通过图像重建损失进行训练,通过简单地解码球面上的随机点生成图像。我们的架构自然支持条件生成,并且循环使用编码器/解码器几次可以进一步提升图像质量。在多个数据集上,球体编码器方法的性能与最先进的扩散模型相当,但推理成本却小得多。项目页面可访问 https://sphere-encoder.github.io 。
cs.CV / 159 / 2602.15031

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

EditCtrl:用于实时生成视频编辑的局部与全局控制解耦
Litman, Yehonathan, Liu, Shikun, Seyb, Dario, Milef, Nicholas, Zhou, Yang, Marshall, Carl, Tulsiani, Shubham, Leak, Caleb
Abstract
High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
Chinese Translation
高保真生成视频编辑通过利用预训练的视频基础模型实现了显著的质量提升。然而,它们的计算成本成为了主要瓶颈,因为这些模型通常设计得低效,处理整个视频上下文时不考虑修补掩码的大小,即使对于稀疏的局部编辑也是如此。本文提出了EditCtrl,一个高效的视频修补控制框架,仅在需要的地方进行计算。我们的方法具有一个新颖的局部视频上下文模块,该模块仅在被掩盖的标记上操作,从而使计算成本与编辑规模成正比。这种局部优先的生成方式由一个轻量级的时间全局上下文嵌入器引导,确保视频范围内的一致性,同时最小化开销。EditCtrl的计算效率比最先进的生成编辑方法高出10倍,且与采用全注意力设计的方法相比,编辑质量也有所提升。最后,我们展示了EditCtrl如何解锁新功能,包括使用文本提示进行多区域编辑和自回归内容传播。
人工智能 (Artificial Intelligence)
113
cs.AI / 1 / 2602.13213

Agentic AI for Commercial Insurance Underwriting with Adversarial Self-Critique

具有对抗性自我批评的商业保险承保代理智能AI
Roy, Joyjit, Singh, Samaresh Kumar
Abstract
Commercial insurance underwriting is a labor-intensive process that requires manual review of extensive documentation to assess risk and determine policy pricing. While AI offers substantial efficiency improvements, existing solutions lack comprehensive reasoning capabilities and internal mechanisms to ensure reliability within regulated, high-stakes environments. Full automation remains impractical and inadvisable in scenarios where human judgment and accountability are critical. This study presents a decision-negative, human-in-the-loop agentic system that incorporates an adversarial self-critique mechanism as a bounded safety architecture for regulated underwriting workflows. Within this system, a critic agent challenges the primary agent's conclusions prior to submitting recommendations to human reviewers. This internal system of checks and balances addresses a critical gap in AI safety for regulated workflows. Additionally, the research develops a formal taxonomy of failure modes to characterize potential errors by decision-negative agents. This taxonomy provides a structured framework for risk identification and risk management in high-stakes applications. Experimental evaluation using 500 expert-validated underwriting cases demonstrates that the adversarial critique mechanism reduces AI hallucination rates from 11.3% to 3.8% and increases decision accuracy from 92% to 96%. At the same time, the framework enforces strict human authority over all binding decisions by design. These findings indicate that adversarial self-critique supports safer AI deployment in regulated domains and offers a model for responsible integration where human oversight is indispensable.
Chinese Translation
商业保险承保是一个劳动密集型的过程,需要对大量文档进行人工审查以评估风险并确定保单定价。尽管人工智能提供了显著的效率提升,但现有解决方案缺乏全面的推理能力和内部机制,以确保在受监管的高风险环境中的可靠性。在需要人类判断和问责的场景中,完全自动化仍然不切实际且不可取。本研究提出了一种决策消极的人机协作代理系统,该系统结合了对抗性自我批评机制,作为受监管承保工作流程的有限安全架构。在该系统中,批评代理在向人类审查者提交建议之前,挑战主要代理的结论。这个内部的制衡系统填补了受监管工作流程中人工智能安全的一个关键空白。此外,研究开发了一个正式的失败模式分类法,以表征决策消极代理可能出现的错误。该分类法为高风险应用中的风险识别和风险管理提供了结构化框架。使用500个专家验证的承保案例进行的实验评估表明,对抗性批评机制将人工智能的幻觉率从11.3%降低到3.8%,并将决策准确性从92%提高到96%。同时,该框架在设计上严格执行人类对所有约束性决策的权威。这些发现表明,对抗性自我批评支持在受监管领域中更安全的人工智能部署,并提供了一个在需要人类监督的情况下负责任整合的模型。
cs.AI / 2 / 2602.13214

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

BotzoneBench:通过分级人工智能锚点进行可扩展的语言模型评估
Li, Lingfeng, Lu, Yunlong, Zhang, Yuefei, Yao, Jingyu, Zhu, Yixin, Cheng, KeYuan, Wang, Yongyi, Zheng, Qirui, Yang, Xionghui, Li, Wenxin
Abstract
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于需要战略决策的互动环境中,但对这些能力的系统评估仍然具有挑战性。现有的LLM基准主要通过孤立任务评估静态推理,未能捕捉动态战略能力。近期的基于游戏的评估采用LLM对抗LLM的锦标赛,这种方法产生的相对排名依赖于瞬时模型池,导致计算成本呈平方级增长,并且缺乏稳定的性能锚点以进行纵向跟踪。核心挑战在于建立一个可扩展的评估框架,以一致、可解释的标准衡量LLM的战略推理,而不是依赖于波动的同伴模型。在此,我们展示了将LLM评估锚定于固定的技能校准游戏人工智能(AI)层级,使得能够以线性时间进行绝对技能测量,并具备稳定的跨时间可解释性。基于Botzone平台已建立的竞争基础设施,我们的BotzoneBench在八款多样化游戏中评估LLM,这些游戏涵盖了确定性完美信息棋盘游戏和随机不完美信息纸牌游戏。通过对五个旗舰模型的177,047个状态-动作对的系统评估,我们揭示了显著的性能差异,并识别出不同的战略行为,表现最佳的模型在多个领域的能力可与中高端专业游戏AI相媲美。这种锚定评估范式超越了游戏,适用于任何具有明确技能层级的领域,为评估互动AI能力建立了一个可扩展和可重用的框架。
cs.AI / 3 / 2602.13215

When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching

何时快速思考与慢速思考?AMOR:基于熵的元认知门控用于动态SSM注意力切换
Zheng, Haoran
Abstract
Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.
Chinese Translation
变压器对每个位置分配均匀的计算,无论其难度如何。状态空间模型(SSMs)提供了高效的替代方案,但在长时间范围内精确信息检索方面存在困难。受到认知双过程理论(Kahneman, 2011)的启发,我们提出了AMOR(自适应元认知输出路由器),这是一种混合架构,仅在SSM主干“存在不确定性”时动态地参与稀疏注意力——这一点通过预测熵来衡量。与标准变压器相比,AMOR通过从SSM隐藏状态中投影键和值(Ghost KV)来提高效率,重用SSM的O(n)计算,而不是在每一层都需要O(n^2)的注意力。在小规模合成检索任务中,AMOR的表现优于仅使用SSM和仅使用变压器的基线,达到了完美的检索准确率,同时仅在22%的位置上参与注意力。我们验证了预测熵可靠地指示了检索需求,检索位置与局部位置之间的熵差为1.09 nats(几乎是熵范围的一半)。此外,我们的方法提供了可解释的自适应计算,其中路由决策可以用信息论的术语来理解。
cs.AI / 4 / 2602.13217

VeRA: Verified Reasoning Data Augmentation at Scale

VeRA:大规模验证推理数据增强
Cheng, Zerui, Liu, Jiashuo, Wu, Chunjie, Yao, Jianzhu, Viswanath, Pramod, Zhang, Ge, Huang, Wenhao
Abstract
The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that is robust by construction, not by post-hoc detection. In response, we propose VeRA (Verified Reasoning Data Augmentation), a framework that converts benchmark problems into executable specifications, comprising (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA automatically creates unlimited verified variants with reliable labels at near-zero marginal cost without human involvement. VeRA operates in two complementary modes. VeRA-E (equivalent) rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning. VeRA-H (hardened) systematically increases complexity while remaining verifiable, enabling reliable creation and labelling of fresh difficult tasks at the boundary of intelligence. Evaluating 16 frontier models with VeRA, we find: (i) VeRA-E improves evaluation quality and reveals contamination patterns. (ii) VeRA-H enables human-free generation of hard tasks with reliable labels. (iii) VeRA establishes verified benchmarks as a general paradigm. VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation. With VeRA, we envision that evaluation in any verifiable domain can scale indefinitely without sacrificing label integrity. To stimulate future research, we have open-sourced all code and datasets.
Chinese Translation
目前大多数评估方案的主要问题在于其“静态”特性:相同的问题被反复使用,导致记忆、格式利用和最终的饱和。为了衡量真正的人工智能进展,我们需要通过构造而非事后检测来实现稳健的评估。为此,我们提出了VeRA(Verified Reasoning Data Augmentation),一个将基准问题转换为可执行规范的框架,包含(i)带有占位符的自然语言模板,(ii)一个能够采样有效配置的连贯生成器,以及(iii)一个验证参数并计算每个配置对应正确答案的确定性验证器。VeRA从单一的种子问题出发,自动创建无限的经过验证的变体,并以近乎零的边际成本提供可靠标签,无需人工干预。VeRA以两种互补模式运行。VeRA-E(等效)在保持基础逻辑不变的情况下重写问题,适用于检测记忆与真正推理的区别。VeRA-H(强化)系统性地增加复杂性,同时保持可验证性,使得在智能边界上可靠地创建和标记新的困难任务成为可能。通过VeRA对16个前沿模型进行评估,我们发现:(i)VeRA-E提高了评估质量并揭示了污染模式;(ii)VeRA-H实现了无人工干预的困难任务生成,并提供可靠标签;(iii)VeRA确立了经过验证的基准作为一种通用范式。VeRA重新构思了基准,从被耗尽的静态对象转变为可执行的规范,按需生成新的经过验证的实例,从而增强评估的稳健性和成本效益。我们设想,通过VeRA,任何可验证领域的评估都可以无限扩展,而不牺牲标签的完整性。为了刺激未来的研究,我们已开源所有代码和数据集。
cs.AI / 5 / 2602.13218

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

扩展扩展逻辑:代理元合成的逻辑推理
Liu, Bowen, Wu, Zhi, Xie, Runquan, Kang, Zhanhui, Li, Jia
Abstract
Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
Chinese Translation
可验证训练信号的扩展仍然是可验证奖励强化学习(RLVR)的一个关键瓶颈。逻辑推理是一个自然的基础:约束是形式化的,答案是程序可检查的。然而,先前的合成流程要么依赖于专家编写的代码,要么在固定的模板/框架内操作,这在很大程度上限制了增长仅限于实例级的扰动。我们提出了SSLogic,一个代理元合成框架,通过在闭合的生成-验证-修复循环中迭代合成和修复可执行的生成器-验证器程序对,在任务家族级别扩展,从而实现可控难度的持续家族演化。为了确保可靠性,我们引入了一种多门验证协议,将多策略一致性检查与对抗性盲审结合在一起,独立代理必须通过编写和执行代码来解决实例,以过滤模糊或不适当的任务。从400个种子家族开始,经过两轮演化扩展到953个家族和21,389个可验证实例(从5,718个)。在SSLogic演化数据上进行训练,在匹配的训练步骤下,相较于种子基线取得了一致的增益,SynLogic提高了+5.2,BBEH提高了+1.4,AIME25提高了+3.0,Brumo25提高了+3.7。
cs.AI / 6 / 2602.13224

A Geometric Taxonomy of Hallucinations in LLMs

大语言模型中幻觉的几何分类
Marín, Javier
Abstract
The term "hallucination" in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically foreign content), and factual error (incorrect claims within correct conceptual frames). We observe a striking asymmetry. On standard benchmarks where hallucinations are LLM-generated, detection is domain-local: AUROC 0.76-0.99 within domains, but 0.50 (chance level) across domains. Discriminative directions are approximately orthogonal between domains (mean cosine similarity -0.07). On human-crafted confabulations - invented institutions, redefined terminology, fabricated mechanisms - a single global direction achieves 0.96 AUROC with 3.8% cross-domain degradation. We interpret this divergence as follows: benchmarks capture generation artifacts (stylistic signatures of prompted fabrication), while human-crafted confabulations capture genuine topical drift. The geometric structure differs because the underlying phenomena differ. Type III errors show 0.478 AUROC - indistinguishable from chance. This reflects a theoretical constraint: embeddings encode distributional co-occurrence, not correspondence to external reality. Statements with identical contextual patterns occupy similar embedding regions regardless of truth value. The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms.
Chinese Translation
在大语言模型中,“幻觉”一词混淆了在嵌入空间中具有不同几何特征的不同现象。我们提出了一种分类法,识别出三种类型:不忠实性(未能与提供的上下文进行互动)、虚构(发明语义上不相关的内容)和事实错误(在正确的概念框架内提出不正确的主张)。我们观察到一种显著的不对称性。在标准基准测试中,当幻觉由大语言模型生成时,检测是领域局部的:在领域内的AUROC为0.76-0.99,但在领域间为0.50(随机水平)。不同领域之间的判别方向大致正交(平均余弦相似度为-0.07)。在人工构造的虚构内容——发明的机构、重新定义的术语、虚构的机制——中,一个全局方向实现了0.96的AUROC,跨领域降级为3.8%。我们将这种差异解释为:基准测试捕捉生成伪影(提示性虚构的风格特征),而人工构造的虚构内容捕捉真正的主题漂移。几何结构的不同是因为基础现象的不同。类型III错误显示0.478的AUROC——与随机水平无异。这反映了一种理论限制:嵌入编码的是分布共现,而不是与外部现实的对应关系。具有相同上下文模式的陈述占据相似的嵌入区域,无论其真值如何。我们的贡献是一个几何分类法,澄清了基于嵌入的检测范围:类型I和II是可检测的;类型III需要外部验证机制。
cs.AI / 7 / 2602.13226

Variation is the Key: A Variation-Based Framework for LLM-Generated Text Detection

变异是关键:基于变异的LLM生成文本检测框架
Li, Xuecong, Li, Xiaohong, Hu, Qiang, Zhang, Yao, Wang, Junjie
Abstract
Detecting text generated by large language models (LLMs) is crucial but challenging. Existing detectors depend on impractical assumptions, such as white-box settings, or solely rely on text-level features, leading to imprecise detection ability. In this paper, we propose a simple but effective and practical LLM-generated text detection method, VaryBalance. The core of VaryBalance is that, compared to LLM-generated texts, there is a greater difference between human texts and their rewritten version via LLMs. Leveraging this observation, VaryBalance quantifies this through mean standard deviation and distinguishes human texts and LLM-generated texts. Comprehensive experiments demonstrated that VaryBalance outperforms the state-of-the-art detectors, i.e., Binoculars, by up to 34.3\% in terms of AUROC, and maintains robustness against multiple generating models and languages.
Chinese Translation
检测大型语言模型(LLMs)生成的文本至关重要,但也充满挑战。现有的检测器依赖于不切实际的假设,例如白盒设置,或仅仅依赖于文本级特征,导致检测能力不准确。本文提出了一种简单但有效且实用的LLM生成文本检测方法VaryBalance。VaryBalance的核心在于,与LLM生成的文本相比,人类文本及其通过LLM重写版本之间存在更大的差异。基于这一观察,VaryBalance通过均值标准差量化这一差异,从而区分人类文本和LLM生成的文本。全面的实验表明,VaryBalance在AUROC方面比最先进的检测器Binoculars提高了多达34.3\%,并且在多种生成模型和语言下保持了鲁棒性。
cs.AI / 8 / 2602.13230

Intelligence as Trajectory-Dominant Pareto Optimization

智力作为轨迹主导的帕累托优化
Khanh, Truong Xuan, Hoa, Truong Quynh
Abstract
Despite recent advances in artificial intelligence, many systems exhibit stagnation in long-horizon adaptability despite continued performance optimization. This work argues that such limitations do not primarily arise from insufficient learning, data, or model capacity, but from a deeper structural property of how intelligence is optimized over time. We formulate intelligence as a trajectory-level phenomenon governed by multi-objective trade-offs, and introduce Trajectory-Dominant Pareto Optimization, a path-wise generalization of classical Pareto optimality in which dominance is defined over full trajectories. Within this framework, Pareto traps emerge as locally non-dominated regions of trajectory space that nevertheless restrict access to globally superior developmental paths under conservative local optimization. To characterize the rigidity of such constraints, we define the Trap Escape Difficulty Index (TEDI), a composite geometric measure capturing escape distance, structural constraints, and behavioral inertia. We show that dynamic intelligence ceilings arise as inevitable geometric consequences of trajectory-level dominance, independent of learning progress or architectural scale. We further introduce a formal taxonomy of Pareto traps and illustrate the resulting trajectory-level divergence using a minimal agent-environment model. Together, these results shift the locus of intelligence from terminal performance to optimization geometry, providing a principled framework for diagnosing and overcoming long-horizon developmental constraints in adaptive systems.
Chinese Translation
尽管人工智能最近取得了进展,但许多系统在长期适应性方面仍然表现出停滞,尽管性能优化持续进行。本研究认为,这种限制并非主要源于学习、数据或模型能力不足,而是源于智力在时间上优化的更深层次的结构特性。我们将智力形式化为一种受多目标权衡支配的轨迹级现象,并引入轨迹主导的帕累托优化(Trajectory-Dominant Pareto Optimization),这是一种经典帕累托最优性的路径级推广,其中主导性是在完整轨迹上定义的。在这一框架内,帕累托陷阱作为轨迹空间中局部非主导区域出现,然而在保守的局部优化下,这些区域限制了对全球优越发展路径的访问。为了表征这种约束的刚性,我们定义了陷阱逃逸难度指数(Trap Escape Difficulty Index, TEDI),这是一个综合几何度量,捕捉逃逸距离、结构约束和行为惯性。我们展示了动态智力上限作为轨迹级主导的不可避免的几何结果,独立于学习进展或架构规模。我们进一步引入了帕累托陷阱的正式分类法,并使用一个最小代理-环境模型说明了由此产生的轨迹级发散。这些结果共同将智力的焦点从终极性能转移到优化几何,为诊断和克服适应系统中的长期发展约束提供了一个有原则的框架。
cs.AI / 9 / 2602.13232

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

PlotChain:针对工程图形阅读的多模态大语言模型的确定性检查点评估
Ravishankara, Mayank
Abstract
We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate 'cp_' fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0 and a strict JSON-only numeric output schema) and score predictions using per-field tolerances designed to reflect human plot-reading precision. Under the 'plotread' tolerance policy, the top models achieve 80.42% (Gemini 2.5 Pro), 79.84% (GPT-4.1), and 78.21% (Claude Sonnet 4.5) overall field-level pass rates, while GPT-4o trails at 61.59%. Despite strong performance on many families, frequency-domain tasks remain brittle: bandpass response stays low (<= 23%), and FFT spectrum remains challenging. We release the generator, dataset, raw model outputs, scoring code, and manifests with checksums to support fully reproducible runs and retrospective rescoring under alternative tolerance policies.
Chinese Translation
我们提出了PlotChain,这是一种基于生成器的确定性基准,用于评估多模态大语言模型(MLLMs)在工程图形阅读中的表现——从经典图形中恢复定量值(例如,Bode/FFT、阶跃响应、应力-应变、泵曲线),而不是仅依赖光学字符识别(OCR)提取或自由形式的说明。PlotChain包含15个图形系列,共450个渲染图形(每个系列30个),每个项目均由已知参数生成,并与直接从生成过程中计算得出的精确真实值配对。一个核心贡献是基于检查点的诊断评估:除了最终目标外,每个项目还包括中间的'cp_'字段,以隔离子技能(例如,读取截止频率或峰值幅度),并能够在图形系列中定位失败。我们在标准化的确定性协议下(温度=0,严格的仅JSON数值输出模式)评估四个最先进的MLLM,并使用每个字段的容差对预测进行评分,这些容差旨在反映人类图形阅读的精度。在'plotread'容差政策下,顶级模型的整体字段级通过率分别为80.42%(Gemini 2.5 Pro)、79.84%(GPT-4.1)和78.21%(Claude Sonnet 4.5),而GPT-4o的通过率为61.59%。尽管在许多系列中表现强劲,频域任务仍然脆弱:带通响应保持在低水平(<= 23%),FFT频谱仍然具有挑战性。我们发布了生成器、数据集、原始模型输出、评分代码和带有校验和的清单,以支持完全可重复的运行和在替代容差政策下的回顾性重新评分。
cs.AI / 10 / 2602.13234

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

保持角色,保持安全:用于安全角色扮演代理的双循环对抗自我演化
Liao, Mingyang, Wan, Yichen, wu, shuchen, Miao, Chenxi, Shen, Xin, Li, Weikang, Li, Yang, Xia, Deguo, Huang, Jizhou
Abstract
LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.
Chinese Translation
基于大型语言模型(LLM)的角色扮演在真实性方面迅速提升,但对角色约束的更强遵循通常会增加对越狱攻击的脆弱性,尤其是对于风险或负面角色。大多数先前的研究通过训练时的解决方案(例如,数据整理或面向对齐的正则化)来缓解这一问题。然而,随着角色和攻击策略的演变,这些方法的维护成本较高,可能会降低角色行为的真实性,并且通常对前沿的闭合权重LLM不可行。我们提出了一种无训练的双循环对抗自我演化框架,包含两个耦合的循环。角色针对攻击者循环合成逐渐增强的越狱提示,而角色扮演防御者循环将观察到的失败提炼为一个层次知识库,包括(i)全球安全规则,(ii)基于角色的约束,以及(iii)安全的角色示例。在推理时,防御者从该层次中检索并组合结构化知识以指导生成,产生的响应既忠实于目标角色,又满足安全约束。在多个专有LLM上的广泛实验表明,在角色真实性和越狱抵抗力方面,相较于强基线有一致的提升,并且对未见角色和攻击提示具有强大的泛化能力。
cs.AI / 11 / 2602.13235

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Lang2Act:通过自发语言工具链实现细粒度视觉推理
Xiong, Yuqi, Peng, Chunyi, Xu, Zhipeng, Liu, Zhenghao, Chen, Zulong, Yan, Yukun, Wang, Shuo, Gu, Yu, Yu, Ge
Abstract
Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.
Chinese Translation
视觉检索增强生成(VRAG)通过引入外部视觉文档来增强视觉-语言模型(VLMs),以应对给定的查询。现有的VRAG框架通常依赖于刚性、预定义的外部工具,以扩展VLMs的感知能力,通常通过明确将视觉感知与后续推理过程分离。然而,这种解耦设计可能导致视觉信息的不必要损失,特别是在应用诸如裁剪等基于图像的操作时。在本文中,我们提出了Lang2Act,它通过自发的语言工具链实现细粒度的视觉感知和推理。Lang2Act并不调用固定的外部引擎,而是收集自发的动作作为语言工具,并利用它们来增强VLMs的视觉感知能力。为了支持这一机制,我们设计了一个基于强化学习(RL)的两阶段训练框架。具体而言,第一阶段优化VLMs以自我探索高质量的动作,以构建可重用的语言工具箱,第二阶段进一步优化VLMs,以有效利用这些语言工具进行下游推理。实验结果表明,Lang2Act在显著增强VLMs的视觉感知能力方面是有效的,性能提升超过4%。所有代码和数据可在https://github.com/NEUIR/Lang2Act获取。
cs.AI / 12 / 2602.13237

NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models

NL2LOGIC:基于抽象语法树的大型语言模型自然语言到一阶逻辑的翻译
Putra, Rizky Ramadhana, Basuki, Raihan Sultan Pasha, Cheng, Yutong, Gao, Peng
Abstract
Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability. Recent work adopts structured reasoning pipelines that translate natural language into first-order logic and delegate inference to automated solvers. With the rise of large language models, approaches such as GCD and CODE4LOGIC leverage their reasoning and code generation capabilities to improve logic parsing. However, these methods suffer from fragile syntax control due to weak enforcement of global grammar constraints and low semantic faithfulness caused by insufficient clause-level semantic understanding. We propose NL2LOGIC, a first-order logic translation framework that introduces an abstract syntax tree as an intermediate representation. NL2LOGIC combines a recursive large language model based semantic parser with an abstract syntax tree guided generator that deterministically produces solver-ready logic code. Experiments on the FOLIO, LogicNLI, and ProofWriter benchmarks show that NL2LOGIC achieves 99 percent syntactic accuracy and improves semantic correctness by up to 30 percent over state-of-the-art baselines. Furthermore, integrating NL2LOGIC into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by 31 percent compared to Logic-LM's original few-shot unconstrained translation module.
Chinese Translation
自动推理在法律和治理等领域至关重要,因为在文档中验证声明与事实的一致性需要准确性和可解释性。近期的研究采用结构化推理流程,将自然语言翻译为一阶逻辑,并将推理委托给自动求解器。随着大型语言模型的兴起,诸如 GCD 和 CODE4LOGIC 的方法利用其推理和代码生成能力来改善逻辑解析。然而,这些方法由于全球语法约束的弱执行而导致语法控制脆弱,并且由于缺乏足够的子句级语义理解而导致语义忠实度低。我们提出了 NL2LOGIC,这是一个一阶逻辑翻译框架,引入抽象语法树作为中间表示。NL2LOGIC 将递归的大型语言模型基础的语义解析器与一个基于抽象语法树的引导生成器相结合,确定性地生成适合求解器的逻辑代码。在 FOLIO、LogicNLI 和 ProofWriter 基准测试中的实验表明,NL2LOGIC 达到了 99% 的语法准确率,并且在语义正确性上比最先进的基线提高了多达 30%。此外,将 NL2LOGIC 集成到 Logic-LM 中实现了近乎完美的可执行性,并且与 Logic-LM 原有的少量样本无约束翻译模块相比,提高了下游推理准确性 31%。
cs.AI / 13 / 2602.13240

AST-PAC: AST-guided Membership Inference for Code

AST-PAC:基于抽象语法树的代码成员推断
Koohestani, Roham, Al-Kaswan, Ali, Katzy, Jonathan, Izadi, Maliheh
Abstract
Code Large Language Models are frequently trained on massive datasets containing restrictively licensed source code. This creates urgent data governance and copyright challenges. Membership Inference Attacks (MIAs) can serve as an auditing mechanism to detect unauthorized data usage in models. While attacks like the Loss Attack provide a baseline, more involved methods like Polarized Augment Calibration (PAC) remain underexplored in the code domain. This paper presents an exploratory study evaluating these methods on 3B--7B parameter code models. We find that while PAC generally outperforms the Loss baseline, its effectiveness relies on augmentation strategies that disregard the rigid syntax of code, leading to performance degradation on larger, complex files. To address this, we introduce AST-PAC, a domain-specific adaptation that utilizes Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples. Preliminary results indicate that AST-PAC improves as syntactic size grows, where PAC degrades, but under-mutates small files and underperforms on alphanumeric-rich code. Overall, the findings motivate future work on syntax-aware and size-adaptive calibration as a prerequisite for reliable provenance auditing of code language models.
Chinese Translation
大型代码语言模型通常在包含限制性许可源代码的大型数据集上进行训练。这带来了紧迫的数据治理和版权挑战。成员推断攻击(Membership Inference Attacks, MIA)可以作为一种审计机制,用于检测模型中未经授权的数据使用。虽然像损失攻击(Loss Attack)这样的攻击提供了基线,但在代码领域中,像极化增强校准(Polarized Augment Calibration, PAC)这样更复杂的方法仍然未被充分探索。本文呈现了一项探索性研究,评估这些方法在3B至7B参数代码模型上的表现。我们发现,虽然PAC通常优于损失基线,但其有效性依赖于忽视代码严格语法的增强策略,这导致在较大、复杂文件上的性能下降。为了解决这个问题,我们引入了AST-PAC,一种领域特定的适应方法,利用基于抽象语法树(Abstract Syntax Tree, AST)的扰动生成语法上有效的校准样本。初步结果表明,AST-PAC在语法大小增加时表现改善,而PAC则下降,但在小文件上变异不足,并且在富含字母数字的代码上表现不佳。总体而言,这些发现激励未来在语法感知和大小自适应校准方面的研究,以作为可靠的代码语言模型来源审计的前提。
cs.AI / 14 / 2602.13248

X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles

X-Blocks:自动驾驶车辆自然语言解释的语言构建块
Zadeh, Ashkan Y., Li, Xiaomeng, Rakotonirainy, Andry, Schroeter, Ronald, Glaser, Sebastien, Zhu, Zishuo
Abstract
Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems.
Chinese Translation
自然语言解释在建立对自动驾驶车辆(AVs)的信任和接受度方面发挥着关键作用,但现有方法缺乏系统框架来分析人类在不同场景中如何语言构建驾驶理由。本文介绍了X-Blocks(eXplanation Blocks),一个层次分析框架,识别自动驾驶车辆自然语言解释的语言构建块,分为三个层次:上下文、句法和词汇。在上下文层面,我们提出了RACE(Reasoning-Aligned Classification of Explanations),一个多LLM集成框架,将思维链推理与自一致性机制结合,以稳健地将解释分类为32个情境感知类别。应用于来自伯克利深度驾驶-X数据集的人类撰写的解释,RACE在与人类标注者一致的案例中实现了91.45%的准确率和0.91的Cohen's kappa,表明上下文分类的近人类可靠性。在词汇层面,使用信息性Dirichlet先验的对数几率分析揭示了区分驾驶场景的上下文特定词汇模式。在句法层面,依赖解析和模板提取显示,解释源自有限的可重用语法家族,且在不同上下文中谓词类型和因果结构存在系统性变化。X-Blocks框架不依赖于特定数据集和任务,广泛适用于其他自动驾驶数据集和安全关键领域。总体而言,我们的研究结果为生成支持透明度、用户信任和认知可及性的情境感知解释提供了基于证据的语言设计原则。
cs.AI / 15 / 2602.13255

DPBench: Large Language Models Struggle with Simultaneous Coordination

DPBench:大型语言模型在同时协调方面的挑战
Hasan, Najmul, BusiReddyGari, Prashanth
Abstract
Large language models are increasingly deployed in multi-agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates LLM coordination across eight conditions that vary decision timing, group size, and communication. Our experiments with GPT-5.2, Claude Opus 4.5, and Grok 4.1 reveal a striking asymmetry: LLMs coordinate effectively in sequential settings but fail when decisions must be made simultaneously, with deadlock rates exceeding 95\% under some conditions. We trace this failure to convergent reasoning, where agents independently arrive at identical strategies that, when executed simultaneously, guarantee deadlock. Contrary to expectations, enabling communication does not resolve this problem and can even increase deadlock rates. Our findings suggest that multi-agent LLM systems requiring concurrent resource access may need external coordination mechanisms rather than relying on emergent coordination. DPBench is released as an open-source benchmark. Code and benchmark are available at https://github.com/najmulhasan-code/dpbench.
Chinese Translation
大型语言模型在多智能体系统中的应用日益增多,但我们缺乏测试它们在资源竞争下是否能够协调的基准。我们提出了DPBench,这是一个基于哲学家就餐问题的基准,评估大型语言模型(LLM)在八种条件下的协调能力,这些条件涉及决策时机、群体规模和沟通方式。我们对GPT-5.2、Claude Opus 4.5和Grok 4.1的实验揭示了一个显著的不对称性:LLM在顺序设置中能够有效协调,但在必须同时做出决策时却失败,在某些条件下死锁率超过95%。我们将这种失败归因于收敛推理,即智能体独立得出相同策略,而在同时执行时则保证了死锁。与预期相反,启用沟通并未解决这一问题,反而可能增加死锁率。我们的研究结果表明,需要并发资源访问的多智能体LLM系统可能需要外部协调机制,而不是依赖于自发协调。DPBench作为开源基准发布,代码和基准可在https://github.com/najmulhasan-code/dpbench获取。
cs.AI / 16 / 2602.13258

MAPLE: A Sub-Agent Architecture for Memory, Learning, and Personalization in Agentic AI Systems

MAPLE:一种用于记忆、学习和个性化的子代理架构在自主智能系统中的应用
Piskala, Deepak Babu
Abstract
Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory-Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real-time within finite context budgets. Each component operates as a dedicated sub-agent with specialized tooling and well-defined interfaces. Experimental evaluation on the MAPLE-Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p < 0.01, Cohen's d = 0.95) and increases trait incorporation rate from 45% to 75% -- enabling agents that genuinely learn and adapt.
Chinese Translation
大型语言模型(LLM)代理已成为处理复杂任务的强大工具,但它们适应个体用户的能力仍然根本有限。我们认为这种限制源于一个关键的架构混淆:当前系统将记忆、学习和个性化视为统一的能力,而不是需要不同基础设施、在不同时间尺度上操作并受益于独立优化的三种不同机制。我们提出了MAPLE(Memory-Adaptive Personalized LEarning),一种原则性分解,其中记忆处理存储和检索基础设施;学习从累积的交互中异步提取智能;个性化在有限的上下文预算内实时应用所学知识。每个组件作为一个专门的子代理运行,配备专业工具和明确定义的接口。在MAPLE-Personas基准上的实验评估表明,我们的分解在个性化评分上较无状态基线提高了14.6%(p < 0.01,Cohen's d = 0.95),并将特征融入率从45%提高到75%——使得代理能够真正学习和适应。
cs.AI / 17 / 2602.13262

General learned delegation by clones

克隆的通用学习委托
Li, Darren, Chen, Meiqi, Shao, Chenze, Meng, Fandong, Zhou, Jie
Abstract
Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.
Chinese Translation
前沿语言模型通过额外的测试时间计算得以改进,但在固定推理预算下,串行推理或不协调的并行采样可能导致计算效率低下。我们提出了SELFCEST,它通过自主强化学习使基础模型具备在独立的并行上下文中生成相同权重克隆的能力。训练是在全局任务奖励下进行的,采用共享参数的回滚策略,产生一个学习控制器,该控制器在各个分支之间分配生成和上下文预算。在具有挑战性的数学推理基准和长上下文多跳问答中,SELFCEST在匹配推理预算的情况下,相较于单一基线改善了准确性-成本的帕累托前沿,并在这两个领域展现了超出分布的泛化能力。
cs.AI / 18 / 2602.13271

Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

以人为本的可解释人工智能用于安全增强:深度入侵检测框架
Ayan, Md Muntasir Jahid, Rashid, Md. Shahriar, Hassan, Tazzina Afroze, Jamil, Hossain Md. Mubashshir, Islam, Mahbubul, Amin, Lisan Al, Das, Rupak Kumar, Akter, Farzana, Quader, Faisal
Abstract
The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.
Chinese Translation
网络威胁的复杂性和频率日益增加,要求入侵检测系统(IDS)不仅要准确,还要可解释。本文提出了一种新颖的入侵检测框架,集成了可解释人工智能(XAI),以增强深度学习模型的透明性。该框架通过使用基准数据集NSL-KDD进行了实验评估,显示出相较于传统IDS和黑箱深度学习模型的优越性能。所提方法结合了卷积神经网络(CNN)和长短期记忆网络(LSTM),以捕捉流量序列中的时间依赖性。我们的深度学习结果表明,CNN和LSTM的准确率均达到了0.99,而在宏平均精确率、召回率和F1分数方面,LSTM优于CNN。在加权平均精确率、召回率和F1分数方面,两种模型的得分几乎相似。为了确保可解释性,XAI模型SHapley Additive exPlanations(SHAP)被纳入其中,使安全分析师能够理解和验证模型决策。SHAP指出,srv_serror_rate、dst_host_srv_serror_rate和serror_rate是两个模型的一些显著影响特征。我们还基于IPIP6和大五人格特质,通过交互式用户界面进行了以信任为中心的专家调查,以评估系统的可靠性和可用性。本研究强调了在网络安全解决方案中结合性能和透明性的潜力,并建议通过自适应学习进行未来的增强,以实现实时威胁检测。
cs.AI / 19 / 2602.13272

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

TemporalBench:评估基于大型语言模型的智能体在上下文和事件驱动时间序列任务中的表现的基准
Weng, Muyan, Cao, Defu, Yang, Wei, Sharma, Yashaswi, Liu, Yan
Abstract
It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at https://huggingface.co/datasets/Melady/TemporalBench, and we additionally provide a public leaderboard at https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard.
Chinese Translation
目前尚不清楚强大的预测性能是否反映了真实的时间理解能力,或者是在上下文和事件驱动条件下推理的能力。我们介绍了TemporalBench,这是一个多领域基准,旨在评估在逐渐丰富的信息环境下的时间推理行为。TemporalBench采用四层任务分类法,考察历史结构解释、无上下文预测、上下文时间推理和事件条件预测,涵盖零售、医疗保健、能源和物理系统四个真实世界领域。通过控制对未来目标和上下文信息的访问,该基准能够诊断模型是否能够正确解释时间模式,将其与外部上下文对齐,并在条件变化时调整预测。大量基线实验表明,强大的数值预测准确性并不可靠地转化为稳健的上下文或事件感知时间推理;相反,现有的智能体框架表现出零散的优势和系统性的失败模式,这些在仅依赖预测的基准下大多被掩盖。TemporalBench数据集已公开发布,网址为 https://huggingface.co/datasets/Melady/TemporalBench,我们还提供了一个公共排行榜,网址为 https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard。
cs.AI / 20 / 2602.13274

ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs

ProMoral-Bench:评估大型语言模型中的道德推理和安全性提示策略
Thomas, Rohan Subramanian, Shiromani, Shikhar, Chaudhry, Abdullah, Li, Ruizhe, Sharma, Vasu, Zhu, Kevin, Dev, Sunishchal
Abstract
Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance via our proposed Unified Moral Safety Score (UMSS), a metric balancing accuracy and safety. Our results show that compact, exemplar-guided scaffolds outperform complex multi-stage reasoning, providing higher UMSS scores and greater robustness at a lower token cost. While multi-turn reasoning proves fragile under perturbations, few-shot exemplars consistently enhance moral stability and jailbreak resistance. ProMoral-Bench establishes a standardized framework for principled, cost-effective prompt engineering.
Chinese Translation
提示设计对大型语言模型(LLMs)的道德能力和安全对齐有显著影响,但实证比较在数据集和模型之间仍然零散。我们引入了ProMoral-Bench,这是一个统一的基准,评估四个LLM家族中的11种提示范式。通过使用ETHICS、Scruples、WildJailbreak以及我们的新鲁棒性测试ETHICS-Contrast,我们通过提出的统一道德安全评分(UMSS)来衡量性能,该指标平衡了准确性和安全性。我们的结果表明,紧凑的示例指导框架在复杂的多阶段推理中表现更佳,提供了更高的UMSS分数和更大的鲁棒性,同时降低了令牌成本。尽管多轮推理在扰动下表现脆弱,但少量示例始终增强了道德稳定性和越狱抵抗力。ProMoral-Bench建立了一个标准化框架,用于原则性和成本效益的提示工程。
cs.AI / 21 / 2602.13275

Artificial Organisations

人工组织
Waites, William
Abstract
Alignment research focuses on making individual AI systems reliable. Human institutions achieve reliable collective behaviour differently: they mitigate the risk posed by misaligned individuals through organisational structure. Multi-agent AI systems should follow this institutional model using compartmentalisation and adversarial review to achieve reliable outcomes through architectural design rather than assuming individual alignment. We demonstrate this approach through the Perseverance Composition Engine, a multi-agent system for document composition. The Composer drafts text, the Corroborator verifies factual substantiation with full source access, and the Critic evaluates argumentative quality without access to sources: information asymmetry enforced by system architecture. This creates layered verification: the Corroborator detects unsupported claims, whilst the Critic independently assesses coherence and completeness. Observations from 474 composition tasks (discrete cycles of drafting, verification, and evaluation) exhibit patterns consistent with the institutional hypothesis. When assigned impossible tasks requiring fabricated content, this iteration enabled progression from attempted fabrication toward honest refusal with alternative proposals--behaviour neither instructed nor individually incentivised. These findings motivate controlled investigation of whether architectural enforcement produces reliable outcomes from unreliable components. This positions organisational theory as a productive framework for multi-agent AI safety. By implementing verification and evaluation as structural properties enforced through information compartmentalisation, institutional design offers a route to reliable collective behaviour from unreliable individual components.
Chinese Translation
对齐研究专注于使个体人工智能系统可靠。人类机构通过组织结构以不同方式实现可靠的集体行为:它们通过减轻不对齐个体所带来的风险来达到这一目标。多智能体人工智能系统应遵循这一制度模型,利用分隔和对抗性审查,通过架构设计实现可靠的结果,而不是假设个体的对齐。我们通过“毅力组合引擎”(Perseverance Composition Engine)展示了这一方法,该系统是一个用于文档创作的多智能体系统。创作者(Composer)起草文本,验证者(Corroborator)通过完全访问来源验证事实依据,批评者(Critic)在没有访问来源的情况下评估论证质量:信息不对称由系统架构强制执行。这创造了分层验证:验证者检测不支持的主张,而批评者独立评估连贯性和完整性。来自474个创作任务(起草、验证和评估的离散循环)的观察显示出与制度假设一致的模式。当被分配需要虚构内容的不可能任务时,这一迭代使得从尝试虚构向诚实拒绝并提出替代方案的进展成为可能——这种行为既没有被指示,也没有个体激励。这些发现促使我们对架构强制是否能够从不可靠的组件中产生可靠结果进行受控研究。这将组织理论定位为多智能体人工智能安全的一个富有成效的框架。通过将验证和评估作为通过信息分隔强制执行的结构属性,制度设计为从不可靠的个体组件中实现可靠的集体行为提供了一条途径。
cs.AI / 22 / 2602.13280

BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation

BEAGLE:基于行为的代理用于真实学习者模拟
Wang, Hanchen David, Cohn, Clayton, Xu, Zifan, Guo, Siyuan, Biswas, Gautam, Ma, Meiyi
Abstract
Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research, from training adaptive tutoring systems to stress-testing pedagogical interventions. However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies. While Large Language Models (LLMs) offer a promising path to student simulation, they suffer from competency bias, optimizing for efficient correctness rather than the erratic, iterative struggle characteristic of novice learners. We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture. BEAGLE integrates three key technical innovations: (1) a semi-Markov model that governs the timing and transitions of cognitive behaviors and metacognitive behaviors; (2) Bayesian Knowledge Tracing with explicit flaw injection to enforce realistic knowledge gaps and "unknown unknowns"; and (3) a decoupled agent design that separates high-level strategy use from code generation actions to prevent the model from silently correcting its own intentional errors. In evaluations on Python programming tasks, BEAGLE significantly outperforms state-of-the-art baselines in reproducing authentic trajectories. In a human Turing test, users were unable to distinguish synthetic traces from real student data, achieving an accuracy indistinguishable from random guessing (52.8%).
Chinese Translation
在开放式问题解决环境中模拟学生学习行为对教育研究具有潜在价值,从训练自适应辅导系统到压力测试教学干预。然而,由于隐私问题和长期研究的高成本,收集真实数据面临挑战。尽管大型语言模型(LLMs)为学生模拟提供了有前景的路径,但它们存在能力偏差,优化效率正确性而非新手学习者特有的反复挣扎。我们提出了BEAGLE,一个神经符号框架,通过将自我调节学习(SRL)理论融入新颖架构来解决这一偏差。BEAGLE整合了三个关键技术创新:(1)一个半马尔可夫模型,控制认知行为和元认知行为的时机和转变;(2)带有显式缺陷注入的贝叶斯知识追踪,以强制实现现实的知识差距和“未知的未知”;(3)一个解耦的代理设计,将高层策略使用与代码生成行为分离,以防止模型默默纠正自身的故意错误。在对Python编程任务的评估中,BEAGLE在再现真实轨迹方面显著优于最先进的基线。在一次人类图灵测试中,用户无法区分合成轨迹和真实学生数据,准确率与随机猜测(52.8%)无异。
cs.AI / 23 / 2602.13283

Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey

工作与个人生活中人工智能的准确性标准:来自在线调查的证据
Besanson, Gaston, Todeschini, Federico
Abstract
We study how people trade off accuracy when using AI-powered tools in professional versus personal contexts for adoption purposes, the determinants of those trade-offs, and how users cope when AI/apps are unavailable. Because modern AI systems (especially generative models) can produce acceptable but non-identical outputs, we define "accuracy" as context-specific reliability: the degree to which an output aligns with the user's intent within a tolerance threshold that depends on stakes and the cost of correction. In an online survey (N=300), among respondents with both accuracy items (N=170), the share requiring high accuracy (top-box) is 24.1% at work vs. 8.8% in personal life (+15.3 pp; z=6.29, p<0.001). The gap remains large under a broader top-two-box definition (67.0% vs. 32.9%) and on the full 1-5 ordinal scale (mean 3.86 vs. 3.08). Heavy app use and experience patterns correlate with stricter work standards (H2). When tools are unavailable (H3), respondents report more disruption in personal routines than at work (34.1% vs. 15.3%, p<0.01). We keep the main text focused on these substantive results and place test taxonomy and power derivations in a technical appendix.
Chinese Translation
我们研究了人们在职业与个人环境中使用人工智能工具时如何权衡准确性以促进采纳,影响这些权衡的因素,以及当人工智能/应用程序不可用时用户如何应对。由于现代人工智能系统(尤其是生成模型)能够产生可接受但不完全相同的输出,我们将“准确性”定义为特定于上下文的可靠性:输出与用户意图的一致程度,且该一致性在一个依赖于风险和修正成本的容忍阈值内。在一项在线调查中(N=300),在同时具备准确性项目的受访者中(N=170),要求高准确性的比例在工作中为24.1%,而在个人生活中为8.8%(+15.3个百分点;z=6.29,p<0.001)。在更广泛的前两名定义下,这一差距仍然很大(67.0%对32.9%),在完整的1-5序数尺度上(均值3.86对3.08)。频繁使用应用程序和经验模式与更严格的工作标准相关(H2)。当工具不可用时(H3),受访者报告个人日常活动受到的干扰大于工作中的干扰(34.1%对15.3%,p<0.01)。我们将主要文本集中于这些实质性结果,并将测试分类和功效推导放在技术附录中。
cs.AI / 24 / 2602.13292

Mirror: A Multi-Agent System for AI-Assisted Ethics Review

Mirror:一个用于AI辅助伦理审查的多智能体系统
Ding, Yifan, Shi, Yuhui, Li, Zhiyan, Wang, Zilong, Gao, Yifeng, Yang, Yajun, Yang, Mengjie, Liang, Yixiu, Qiu, Xipeng, Huang, Xuanjing, Ma, Xingjun, Jiang, Yu-Gang, Wang, Guoyu
Abstract
Ethics review is a foundational mechanism of modern research governance, yet contemporary systems face increasing strain as ethical risks arise as structural consequences of large-scale, interdisciplinary scientific practice. The demand for consistent and defensible decisions under heterogeneous risk profiles exposes limitations in institutional review capacity rather than in the legitimacy of ethics oversight. Recent advances in large language models (LLMs) offer new opportunities to support ethics review, but their direct application remains limited by insufficient ethical reasoning capability, weak integration with regulatory structures, and strict privacy constraints on authentic review materials. In this work, we introduce Mirror, an agentic framework for AI-assisted ethical review that integrates ethical reasoning, structured rule interpretation, and multi-agent deliberation within a unified architecture. At its core is EthicsLLM, a foundational model fine-tuned on EthicsQA, a specialized dataset of 41K question-chain-of-thought-answer triples distilled from authoritative ethics and regulatory corpora. EthicsLLM provides detailed normative and regulatory understanding, enabling Mirror to operate in two complementary modes. Mirror-ER (expedited Review) automates expedited review through an executable rule base that supports efficient and transparent compliance checks for minimal-risk studies. Mirror-CR (Committee Review) simulates full-board deliberation through coordinated interactions among expert agents, an ethics secretary agent, and a principal investigator agent, producing structured, committee-level assessments across ten ethical dimensions. Empirical evaluations demonstrate that Mirror significantly improves the quality, consistency, and professionalism of ethics assessments compared with strong generalist LLMs.
Chinese Translation
伦理审查是现代研究治理的基础机制,但当代系统在面对大规模跨学科科学实践所带来的伦理风险时,面临着日益增加的压力。对异质风险特征下的一致且可辩护的决策的需求暴露了机构审查能力的局限,而非伦理监督的合法性。最近的大型语言模型(LLMs)的进展为支持伦理审查提供了新的机会,但其直接应用仍受到伦理推理能力不足、与监管结构整合薄弱以及对真实审查材料的严格隐私限制的限制。在本研究中,我们介绍了Mirror,一个用于AI辅助伦理审查的智能框架,它在统一架构中整合了伦理推理、结构化规则解释和多智能体协商。其核心是EthicsLLM,一个在EthicsQA(一个由权威伦理和监管文献提炼的41K问题链-思维-答案三元组的专用数据集)上微调的基础模型。EthicsLLM提供了详细的规范性和监管理解,使Mirror能够以两种互补模式运作。Mirror-ER(加速审查)通过可执行的规则库自动化加速审查,支持对低风险研究的高效和透明的合规检查。Mirror-CR(委员会审查)通过专家智能体、伦理秘书智能体和主要研究者智能体之间的协调互动模拟全体委员会的审议,产生跨十个伦理维度的结构化委员会级评估。实证评估表明,与强大的通用LLMs相比,Mirror显著提高了伦理评估的质量、一致性和专业性。
cs.AI / 25 / 2602.13318

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

DECKBench:学术幻灯片生成与编辑的多智能体框架基准测试
Jang, Daesik, Heisler, Morgan Lindsay, Xing, Linzi, Li, Yifei, Wang, Edward, Xiong, Ying, Zhang, Yong, Fan, Zhenan
Abstract
Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .
Chinese Translation
自动生成和迭代编辑学术幻灯片需要的不仅仅是文档摘要。这要求忠实的内容选择、连贯的幻灯片组织、布局感知的渲染以及稳健的多轮指令跟随。然而,现有的基准测试和评估协议并未充分衡量这些挑战。为了解决这一问题,我们引入了幻灯片编辑与合规工具包基准(DECKBench),这是一个用于多智能体幻灯片生成与编辑的评估框架。DECKBench建立在一个经过精心策划的论文与幻灯片对的数据集之上,并增强了现实的、模拟的编辑指令。我们的评估协议系统地评估了幻灯片级和幻灯片组级的忠实度、连贯性、布局质量以及多轮指令跟随。我们进一步实现了一个模块化的多智能体基线系统,将幻灯片生成和编辑任务分解为论文解析与摘要、幻灯片规划、HTML创建和迭代编辑。实验结果表明,所提出的基准测试突出了优势,揭示了失败模式,并提供了可行的见解,以改善多智能体幻灯片生成与编辑系统。总体而言,这项工作为学术演示生成与编辑的可重复和可比较评估建立了标准化基础。代码和数据可在 https://github.com/morgan-heisler/DeckBench 上公开获取。
cs.AI / 26 / 2602.13319

Situation Graph Prediction: Structured Perspective Inference for User Modeling

情境图预测:用户建模的结构化视角推理
Shin, Jisung, Platnick, Daniel, Alirezaie, Marjan, Rahnama, Hossein
Abstract
Perspective-Aware AI requires modeling evolving internal states--goals, emotions, contexts--not merely preferences. Progress is limited by a data bottleneck: digital footprints are privacy-sensitive and perspective states are rarely labeled. We propose Situation Graph Prediction (SGP), a task that frames perspective modeling as an inverse inference problem: reconstructing structured, ontology-aligned representations of perspective from observable multimodal artifacts. To enable grounding without real labels, we use a structure-first synthetic generation strategy that aligns latent labels and observable traces by design. As a pilot, we construct a dataset and run a diagnostic study using retrieval-augmented in-context learning as a proxy for supervision. In our study with GPT-4o, we observe a gap between surface-level extraction and latent perspective inference--indicating latent-state inference is harder than surface extraction under our controlled setting. Results suggest SGP is non-trivial and provide evidence for the structure-first data synthesis strategy.
Chinese Translation
视角感知人工智能需要建模不断变化的内部状态——目标、情感、上下文——而不仅仅是偏好。进展受到数据瓶颈的限制:数字足迹涉及隐私,视角状态很少被标注。我们提出了情境图预测(Situation Graph Prediction, SGP),该任务将视角建模框架化为逆向推理问题:从可观察的多模态文物中重构结构化的、与本体对齐的视角表示。为了在没有真实标签的情况下实现基础,我们采用了一种结构优先的合成生成策略,通过设计将潜在标签与可观察痕迹对齐。作为试点,我们构建了一个数据集,并使用检索增强的上下文学习作为监督的代理进行诊断研究。在与GPT-4o的研究中,我们观察到表层提取与潜在视角推理之间存在差距——这表明在我们的控制设置下,潜在状态推理比表层提取更困难。结果表明SGP并非简单,并为结构优先的数据合成策略提供了证据。
cs.AI / 27 / 2602.13320

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

工具使用的LLM代理中的信息保真度:模型上下文协议的马尔可夫分析
Fan, Flint Xiaofeng, Tan, Cheston, Wattenhofer, Roger, Ong, Yew-Soon
Abstract
As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents, proving that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(\sqrt{T})$. This concentration property ensures predictable system behavior and rules out exponential failure modes. We develop a hybrid distortion metric combining discrete fact matching with continuous semantic similarity, then establish martingale concentration bounds on error propagation through sequential tool interactions. Experiments across Qwen2-7B, Llama-3-8B, and Mistral-7B validate our theoretical predictions, showing empirical distortion tracks the linear trend with deviations consistently within $O(\sqrt{T})$ envelopes. Key findings include: semantic weighting reduces distortion by 80\%, and periodic re-grounding approximately every 9 steps suffices for error control. We translate these concentration guarantees into actionable deployment principles for trustworthy agent systems.
Chinese Translation
随着由大型语言模型(LLMs)驱动的人工智能代理越来越多地使用外部工具进行高风险决策,一个关键的可靠性问题随之而来:错误是如何在连续的工具调用中传播的?我们提出了第一个理论框架,用于分析模型上下文协议(Model Context Protocol, MCP)代理中的错误累积,证明了累积失真呈线性增长,并且高概率偏差被限制在 $O( ext{sqrt}(T))$ 范围内。这一集中性特性确保了系统行为的可预测性,并排除了指数级失败模式。我们开发了一种混合失真度量,结合了离散事实匹配与连续语义相似性,然后建立了关于通过连续工具交互传播的错误的马尔可夫集中界限。在 Qwen2-7B、Llama-3-8B 和 Mistral-7B 上的实验验证了我们的理论预测,显示经验失真跟踪线性趋势,偏差始终在 $O( ext{sqrt}(T))$ 范围内。主要发现包括:语义加权将失真减少了 80\%,并且每约 9 步进行一次周期性再定位足以实现错误控制。我们将这些集中性保证转化为可操作的可信代理系统部署原则。
cs.AI / 28 / 2602.13321

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

通过自动化语言特征提取检测临床训练大语言模型中的越狱尝试
Nguyen, Tri, Le, Huy Hoang Bao, Pentapalli, Lohith Srikanth, Turner, Laurah, Cohen, Kelly
Abstract
Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from the extracted features. Across cross-validation and held-out evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations and feature representations, pointing toward future improvements such as richer annotation schemes, finer-grained feature extraction, and methods that capture the evolving risk of jailbreak behavior over the course of a dialogue. This work demonstrates a scalable and interpretable approach for detecting jailbreak behavior in safety-critical clinical dialogue systems.
Chinese Translation
检测临床训练大语言模型(LLMs)中的越狱尝试需要准确建模语言偏差,这些偏差表明用户行为不安全或偏离任务。先前在2-Sigma临床模拟平台上的研究表明,手动注释的语言特征可以支持越狱检测。然而,依赖手动注释限制了可扩展性和表现力。在本研究中,我们通过使用专家对四个核心语言特征(专业性、医学相关性、伦理行为和情境干扰)的注释,扩展了这一框架,并训练了多个基于BERT的通用领域和医学领域的LLM模型,以直接从文本中预测这些特征。为每个维度选择了最可靠的特征回归器,并在第二层分类器中用作特征提取器。我们评估了一系列预测模型,包括基于树的、线性、概率和集成方法,以根据提取的特征确定越狱的可能性。在交叉验证和保留评估中,该系统实现了强大的整体性能,表明LLM派生的语言特征为自动化越狱检测提供了有效的基础。错误分析进一步突出了当前注释和特征表示的关键局限性,指向未来的改进方向,例如更丰富的注释方案、更细粒度的特征提取,以及捕捉对话过程中越狱行为演变风险的方法。这项工作展示了一种可扩展且可解释的方法,用于在安全关键的临床对话系统中检测越狱行为。
cs.AI / 29 / 2602.13323

Contrastive explanations of BDI agents

BDI代理的对比解释
Winikoff, Michael
Abstract
The ability of autonomous systems to provide explanations is important for supporting transparency and aiding the development of (appropriate) trust. Prior work has defined a mechanism for Belief-Desire-Intention (BDI) agents to be able to answer questions of the form ``why did you do action $X$?''. However, we know that we ask \emph{contrastive} questions (``why did you do $X$ \emph{instead of} $F$?''). We therefore extend previous work to be able to answer such questions. A computational evaluation shows that using contrastive questions yields a significant reduction in explanation length. A human subject evaluation was conducted to assess whether such contrastive answers are preferred, and how well they support trust development and transparency. We found some evidence for contrastive answers being preferred, and some evidence that they led to higher trust, perceived understanding, and confidence in the system's correctness. We also evaluated the benefit of providing explanations at all. Surprisingly, there was not a clear benefit, and in some situations we found evidence that providing a (full) explanation was worse than not providing any explanation.
Chinese Translation
自主系统提供解释的能力对于支持透明度和促进(适当的)信任发展至关重要。先前的研究定义了一种机制,使得信念-欲望-意图(Belief-Desire-Intention, BDI)代理能够回答“你为什么做了动作 $X$?”这种形式的问题。然而,我们知道我们会提出 extit{对比性}问题(“你为什么选择做 $X$ extit{而不是} $F$?”)。因此,我们扩展了之前的研究,以能够回答此类问题。计算评估显示,使用对比性问题显著减少了解释的长度。我们进行了人类受试者评估,以评估此类对比性答案是否更受欢迎,以及它们在支持信任发展和透明度方面的效果。我们发现了一些证据表明对比性答案更受欢迎,并且有一些证据表明它们提高了信任感、理解感和对系统正确性的信心。我们还评估了提供解释的整体益处。令人惊讶的是,并没有明显的好处,在某些情况下,我们发现提供(完整)解释的效果不如不提供任何解释。
cs.AI / 30 / 2602.13367

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Nanbeige4.1-3B:一个能够推理、对齐和行动的小型通用模型
Yang, Chen, Peng, Guangyue, Zhu, Jiaying, Le, Ran, Feng, Ruixiang, Zhang, Tao, Xu, Xiyun, Song, Yang, Jia, Yiming, Wen, Yuntao, Xu, Yunzhi, Wang, Zekai, An, Zhenwei, Sun, Zhicong, Chen, Zongchao
Abstract
We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
Chinese Translation
我们提出了Nanbeige4.1-3B,这是一种统一的通用语言模型,能够在仅有30亿参数的情况下同时实现强大的代理行为、代码生成和一般推理。据我们所知,它是第一个在单一模型中实现如此多功能的开源小型语言模型(SLM)。为了提高推理能力和偏好对齐,我们结合了逐点和成对奖励建模,确保高质量的人类对齐响应。在代码生成方面,我们在强化学习中设计了复杂度感知奖励,优化了正确性和效率。在深度搜索中,我们进行复杂的数据合成,并在训练过程中融入回合级监督。这使得Nanbeige4.1-3B能够稳定地进行长时间的工具交互,可靠地执行多达600次工具调用以解决复杂问题。大量实验结果表明,Nanbeige4.1-3B显著优于先前相似规模的模型,如Nanbeige4-3B-2511和Qwen3-4B,甚至在性能上超过了更大规模的模型,如Qwen3-30B-A3B。我们的结果表明,小型模型可以同时实现广泛的能力和强大的专业化,重新定义了30亿参数模型的潜力。
cs.AI / 31 / 2602.13372

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

MoralityGym:评估顺序决策代理中层次道德一致性的基准
Rosen, Simon, Singh, Siddarth, Gelo, Ebenezer, Robertson, Helen Sarah, Suder, Ibrahim, Williams, Victoria, Rosman, Benjamin, Tasse, Geraud Nangue, James, Steven
Abstract
Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
Chinese Translation
在应对冲突的、层次结构的人类规范时,评估代理的道德一致性是人工智能安全、道德哲学和认知科学交叉领域中的一项关键挑战。我们引入了道德链(Morality Chains),这是一种将道德规范表示为有序的义务约束的新形式,以及MoralityGym,这是一个包含98个伦理困境问题的基准,这些问题以类似于电车难题(trolley dilemma)的健身环境呈现。通过将任务解决与道德评估解耦,并引入一种新颖的道德度量(Morality Metric),MoralityGym使得心理学和哲学的见解能够融入到对规范敏感推理的评估中。使用安全强化学习(Safe RL)方法的基线结果揭示了关键的局限性,强调了对伦理决策需要更有原则的方法。本研究为开发在复杂现实环境中表现得更可靠、透明和伦理的人工智能系统奠定了基础。
cs.AI / 32 / 2602.13407

On-Policy Supervised Fine-Tuning for Efficient Reasoning

高效推理的在线监督微调
Zhao, Anhao, Chen, Ziyang, Tong, Junlong, Fan, Yingqi, Ye, Fanghua, Li, Shuhao, Ma, Yunpu, Li, Wenjie, Shen, Xiaoyu
Abstract
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT-NLP/On-Policy-SFT.
Chinese Translation
大型推理模型(LRMs)通常通过强化学习(RL)进行训练,以探索长链推理,尽管在高计算成本下取得了强大的性能。最近的方法增加了多重奖励目标,以共同优化正确性和简洁性,但这些复杂的扩展往往会使训练不稳定,并产生次优的权衡。我们重新审视这一目标,并质疑这种复杂性的必要性。通过原则性分析,我们识别出这一范式中的基本不一致:当正确性和长度可以直接验证时,KL正则化失去了其预期的作用,而在多重奖励信号下,组归一化变得模糊。通过去除这两个项目并将奖励简化为基于截断的长度惩罚,我们表明优化问题简化为在自生成数据上进行监督微调,该数据经过过滤以确保正确性和简洁性。我们将这种简化的训练策略称为在线监督微调(on-policy SFT)。尽管其简单性,在线监督微调始终定义了准确性-效率的帕累托前沿。它在保持原始准确性的同时,将链推理长度减少了多达80,超越了五个基准测试中更复杂的基于RL的方法。此外,它显著提高了训练效率,将GPU内存使用减少了50%,并加速了70%的收敛。我们的代码可在 https://github.com/EIT-NLP/On-Policy-SFT 获得。
cs.AI / 33 / 2602.13473

NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

NeuroWeaver:一种用于探索脑电图分析管道程序空间的自主进化代理
Wang, Guoan, Yang, Shihao, Ding, Jun-En, Zhu, Hao, Liu, Feng
Abstract
Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.
Chinese Translation
尽管基础模型在一般领域取得了显著成功,但这些模型在脑电图(EEG)分析中的应用受到大量数据需求和高参数化的限制。这些因素导致了高昂的计算成本,从而妨碍了在资源受限的临床环境中的部署。相反,通用的自动化机器学习框架通常不适合该领域,因为在无限制的程序空间内进行探索未能纳入必要的神经生理学先验,并且经常产生缺乏科学合理性的解决方案。为了解决这些限制,我们提出了NeuroWeaver,一种统一的自主进化代理,旨在通过将管道工程重新表述为离散约束优化问题,从而在多样化的EEG数据集和任务中进行泛化。具体而言,我们采用领域知情的子空间初始化,将搜索限制在神经科学上合理的流形内,并结合多目标进化优化,通过自我反思的精炼动态平衡性能、新颖性和效率。五个异构基准的实证评估表明,NeuroWeaver合成的轻量级解决方案在性能上始终优于最先进的任务特定方法,并且在参数使用显著减少的情况下,达到了与大规模基础模型相当的性能。
cs.AI / 34 / 2602.13477

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

OMNI-LEAK:协调者多智能体网络引发的数据泄露
Naik, Akshat, Culligan, Jay, Gal, Yarin, Torr, Philip, Aljundi, Rahaf, Paren, Alasdair, Bibi, Adel
Abstract
As Large Language Model (LLM) agents become more capable, their coordinated use in the form of multi-agent systems is anticipated to emerge as a practical paradigm. Prior work has examined the safety and misuse risks associated with agents. However, much of this has focused on the single-agent case and/or setups missing basic engineering safeguards such as access control, revealing a scarcity of threat modeling in multi-agent systems. We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup, in which a central agent decomposes and delegates tasks to specialized agents. Through red-teaming a concrete setup representative of a likely future use case, we demonstrate a novel attack vector, OMNI-LEAK, that compromises several agents to leak sensitive data through a single indirect prompt injection, even in the \textit{presence of data access control}. We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable, even when the attacker lacks insider knowledge of the implementation details. Our work highlights the importance of safety research to generalize from single-agent to multi-agent settings, in order to reduce the serious risks of real-world privacy breaches and financial losses and overall public trust in AI agents.
Chinese Translation
随着大型语言模型(LLM)智能体能力的提升,它们以多智能体系统的形式协同使用被预期将成为一种实用范式。先前的研究已探讨了与智能体相关的安全性和误用风险。然而,这些研究大多集中在单一智能体的案例上,或缺乏基本的工程保护措施,如访问控制,导致多智能体系统中的威胁建模相对匮乏。我们研究了一种流行的多智能体模式——协调者设置的安全漏洞,其中一个中央智能体将任务分解并委派给专业智能体。通过对一个代表可能未来使用案例的具体设置进行红队测试,我们展示了一种新颖的攻击向量OMNI-LEAK,该攻击通过单个间接提示注入,妥协多个智能体以泄露敏感数据,即使在存在数据访问控制的情况下。我们报告了前沿模型对不同类别攻击的易感性,发现推理模型和非推理模型均存在脆弱性,即使攻击者缺乏对实现细节的内部知识。我们的研究强调了安全性研究的重要性,以便从单一智能体推广到多智能体环境,从而减少现实世界中隐私泄露和经济损失的严重风险,以及公众对人工智能智能体的整体信任。
cs.AI / 35 / 2602.13502

Translating Dietary Standards into Healthy Meals with Minimal Substitutions

将饮食标准转化为健康餐膳,最小替换
Chan, Trevor, Tagkopoulos, Ilias
Abstract
An important goal for personalized diet systems is to improve nutritional quality without compromising convenience or affordability. We present an end-to-end framework that converts dietary standards into complete meals with minimal change. Using the What We Eat in America (WWEIA) intake data for 135,491 meals, we identify 34 interpretable meal archetypes that we then use to condition a generative model and a portion predictor to meet USDA nutritional targets. In comparisons within archetypes, generated meals are better at following recommended daily intake (RDI) targets by 47.0%, while remaining compositionally close to real meals. Our results show that by allowing one to three food substitutions, we were able to create meals that were 10% more nutritious, while reducing costs 19-32%, on average. By turning dietary guidelines into realistic, budget-aware meals and simple swaps, this framework can underpin clinical decision support, public-health programs, and consumer apps that deliver scalable, equitable improvements in everyday nutrition.
Chinese Translation
个性化饮食系统的一个重要目标是提高营养质量,而不牺牲便利性或经济性。我们提出了一个端到端的框架,将饮食标准转化为完整的餐膳,变化最小。利用美国饮食调查(What We Eat in America, WWEIA)对135,491餐膳的摄入数据,我们识别出34种可解释的餐膳原型,并利用这些原型来调节生成模型和份量预测器,以满足美国农业部(USDA)的营养目标。在原型内的比较中,生成的餐膳在遵循推荐每日摄入量(RDI)目标方面提高了47.0%,同时在成分上与真实餐膳保持接近。我们的结果表明,通过允许一到三种食物替换,我们能够创造出营养价值提高10%的餐膳,同时平均降低19-32%的成本。通过将饮食指南转化为现实的、考虑预算的餐膳和简单的替换,这一框架可以支持临床决策支持、公共卫生项目和消费者应用程序,从而在日常营养中实现可扩展和公平的改善。
cs.AI / 36 / 2602.13516

SPILLage: Agentic Oversharing on the Web

SPILLage:网络上的代理过度分享
Roh, Jaechul, Bagdasarian, Eugene, Haddadi, Hamed, Shamsabadi, Ali Shahin
Abstract
LLM-powered agents are beginning to automate user's tasks across the open web, often with access to user resources such as emails and calendars. Unlike standard LLMs answering questions in a controlled ChatBot setting, web agents act "in the wild", interacting with third parties and leaving behind an action trace. Therefore, we ask the question: how do web agents handle user resources when accomplishing tasks on their behalf across live websites? In this paper, we formalize Natural Agentic Oversharing -- the unintentional disclosure of task-irrelevant user information through an agent trace of actions on the web. We introduce SPILLage, a framework that characterizes oversharing along two dimensions: channel (content vs. behavior) and directness (explicit vs. implicit). This taxonomy reveals a critical blind spot: while prior work focuses on text leakage, web agents also overshare behaviorally through clicks, scrolls, and navigation patterns that can be monitored. We benchmark 180 tasks on live e-commerce sites with ground-truth annotations separating task-relevant from task-irrelevant attributes. Across 1,080 runs spanning two agentic frameworks and three backbone LLMs, we demonstrate that oversharing is pervasive with behavioral oversharing dominates content oversharing by 5x. This effect persists -- and can even worsen -- under prompt-level mitigation. However, removing task-irrelevant information before execution improves task success by up to 17.9%, demonstrating that reducing oversharing improves task success. Our findings underscore that protecting privacy in web agents is a fundamental challenge, requiring a broader view of "output" that accounts for what agents do on the web, not just what they type. Our datasets and code are available at https://github.com/jrohsc/SPILLage.
Chinese Translation
基于大型语言模型(LLM)的代理开始在开放网络上自动化用户的任务,通常可以访问用户的资源,如电子邮件和日历。与在受控的聊天机器人环境中回答问题的标准LLM不同,网络代理在“野外”环境中行动,与第三方互动并留下行动痕迹。因此,我们提出了一个问题:网络代理在代表用户完成任务时如何处理用户资源?在本文中,我们形式化了自然代理过度分享(Natural Agentic Oversharing)——通过网络上的代理行动痕迹无意中披露与任务无关的用户信息。我们引入了SPILLage,一个框架,通过两个维度(渠道(内容与行为)和直接性(显性与隐性))来表征过度分享。这一分类法揭示了一个关键盲点:尽管以往的研究集中于文本泄露,但网络代理也通过点击、滚动和导航模式等行为进行过度分享,这些行为是可以被监测的。我们在实时电子商务网站上基准测试了180个任务,并通过真实标注将任务相关属性与任务无关属性分开。在跨越两个代理框架和三个基础LLM的1,080次运行中,我们展示了过度分享的普遍性,行为过度分享比内容过度分享多出5倍。这一效应持续存在——甚至可能在提示级别的缓解下恶化。然而,在执行前移除与任务无关的信息可将任务成功率提高多达17.9%,表明减少过度分享可以提高任务成功率。我们的研究结果强调,在网络代理中保护隐私是一项基本挑战,需要对“输出”有更广泛的理解,考虑代理在网络上所做的事情,而不仅仅是他们所输入的内容。我们的数据集和代码可在 https://github.com/jrohsc/SPILLage 获取。
cs.AI / 37 / 2602.13530

REMem: Reasoning with Episodic Memory in Language Agent

REMem:在语言智能体中利用情节记忆进行推理
Shu, Yiheng, Jonnalagedda, Saisri Padmaja, Gao, Xiang, Gutiérrez, Bernal Jiménez, Qi, Weijian, Das, Kamalika, Sun, Huan, Su, Yu
Abstract
Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully curated tools for iterative retrieval over the memory graph. Comprehensive evaluation across four episodic memory benchmarks shows that REMem substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively. Moreover, REMem also demonstrates more robust refusal behavior for unanswerable questions.
Chinese Translation
人类擅长在时空背景下记忆具体经历,并在这些事件之间进行推理,即具备情节记忆的能力。相比之下,语言智能体的记忆主要是语义性的,目前的智能体尚未能够有效地回忆和推理交互历史。我们识别并形式化了这一差距中情节回忆和推理的核心挑战,并观察到现有工作往往忽视情节性,缺乏明确的事件建模,或过度强调简单检索而非复杂推理。我们提出了REMem,一个用于构建和利用情节记忆的两阶段框架:1)离线索引,REMem将经历转换为一个混合记忆图,灵活地链接时间感知的要点和事实;2)在线推理,REMem利用一个智能检索器,配备精心策划的工具进行记忆图的迭代检索。对四个情节记忆基准的全面评估表明,REMem在情节回忆和推理任务上分别比最先进的记忆系统如Mem0和HippoRAG 2有3.4%和13.4%的绝对提升。此外,REMem在无法回答的问题上也表现出更强的拒绝行为。
cs.AI / 38 / 2602.13559

OpAgent: Operator Agent for Web Navigation

OpAgent:用于网页导航的操作代理
Guo, Yuyu, Yang, Wenjie, Yang, Siyuan, Liu, Ziyang, Chen, Cheng, Wei, Yuan, Hu, Yun, Huang, Yang, Hao, Guoliang, Yuan, Dongsheng, Wang, Jianming, Chen, Xin, Yu, Hang, Lei, Lei, Di, Peng
Abstract
To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.
Chinese Translation
为了满足用户指令,自主网页代理必须应对现实世界网站固有的复杂性和动态性。传统范式主要依赖于监督微调(Supervised Fine-Tuning, SFT)或使用静态数据集的离线强化学习(Offline Reinforcement Learning, RL)。然而,这些方法受到严重的分布转移影响,因为离线轨迹无法捕捉到无约束广域网络环境中的随机状态转移和实时反馈。本文提出了一种强大的在线强化学习网页代理(WebAgent),旨在通过与无约束广域网站的直接迭代交互来优化其策略。我们的方法包括三个核心创新:1)层次多任务微调:我们策划了一种全面的数据集混合,按功能原语分类——规划(Planning)、行动(Acting)和基础(Grounding),建立了一个具有强指令跟随能力的视觉语言模型(Vision-Language Model, VLM)以处理网页图形用户界面(Web GUI)任务。2)野外在线代理强化学习:我们开发了一个在线交互环境,并使用专门的强化学习管道对VLM进行微调。我们引入了一种混合奖励机制,将一个与真实情况无关的WebJudge用于整体结果评估,与基于规则的决策树(Rule-based Decision Tree, RDT)结合用于进度奖励。该系统有效缓解了长时间导航中的信用分配挑战。值得注意的是,我们的强化学习增强模型在WebArena上达到了38.1%的成功率(pass@5),超越了所有现有的单一基线。3)操作代理:我们引入了一种模块化的代理框架,即 extbf{OpAgent},协调规划者、基础者、反思者和总结者。这种协同作用使得错误恢复和自我修正能力显著提升,将代理的表现提升至新的最先进成功率(State-of-the-Art, SOTA)71.6%。
cs.AI / 39 / 2602.13568

Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

大型语言模型信任谁?人类专家比其他大型语言模型更重要
Bajaj, Anooshka, Tiganj, Zoran
Abstract
Large language models (LLMs) increasingly operate in environments where they encounter social information such as other agents' answers, tool outputs, or human recommendations. In humans, such inputs influence judgments in ways that depend on the source's credibility and the strength of consensus. This paper investigates whether LLMs exhibit analogous patterns of influence and whether they privilege feedback from humans over feedback from other LLMs. Across three binary decision-making tasks, reading comprehension, multi-step reasoning, and moral judgment, we present four instruction-tuned LLMs with prior responses attributed either to friends, to human experts, or to other LLMs. We manipulate whether the group is correct and vary the group size. In a second experiment, we introduce direct disagreement between a single human and a single LLM. Across tasks, models conform significantly more to responses labeled as coming from human experts, including when that signal is incorrect, and revise their answers toward experts more readily than toward other LLMs. These results reveal that expert framing acts as a strong prior for contemporary LLMs, suggesting a form of credibility-sensitive social influence that generalizes across decision domains.
Chinese Translation
大型语言模型(LLMs)越来越多地在遇到社会信息的环境中运作,例如其他智能体的回答、工具输出或人类推荐。在人类中,这些输入会以依赖于来源可信度和共识强度的方式影响判断。本文探讨了大型语言模型是否表现出类似的影响模式,以及它们是否更重视来自人类的反馈而非其他大型语言模型的反馈。在三个二元决策任务中,包括阅读理解、多步推理和道德判断,我们向四个经过指令调优的大型语言模型呈现了先前的回答,这些回答分别归因于朋友、人类专家或其他大型语言模型。我们操控了群体的正确性,并改变了群体规模。在第二个实验中,我们引入了一名人类与一名大型语言模型之间的直接分歧。在各个任务中,模型对标记为来自人类专家的回答的符合度显著更高,包括在该信号不正确的情况下,并且它们更倾向于向专家修正答案,而不是向其他大型语言模型修正。这些结果揭示了专家框架对当代大型语言模型起到了强先验的作用,暗示了一种在决策领域普遍适用的、对可信度敏感的社会影响形式。
cs.AI / 40 / 2602.13583

Differentiable Rule Induction from Raw Sequence Inputs

从原始序列输入中可微分的规则诱导
Gao, Kun, Inoue, Katsumi, Cao, Yongzhi, Wang, Hanpin, Yang, Feng
Abstract
Rule learning-based models are widely used in highly interpretable scenarios due to their transparent structures. Inductive logic programming (ILP), a form of machine learning, induces rules from facts while maintaining interpretability. Differentiable ILP models enhance this process by leveraging neural networks to improve robustness and scalability. However, most differentiable ILP methods rely on symbolic datasets, facing challenges when learning directly from raw data. Specifically, they struggle with explicit label leakage: The inability to map continuous inputs to symbolic variables without explicit supervision of input feature labels. In this work, we address this issue by integrating a self-supervised differentiable clustering model with a novel differentiable ILP model, enabling rule learning from raw data without explicit label leakage. The learned rules effectively describe raw data through its features. We demonstrate that our method intuitively and precisely learns generalized rules from time series and image data.
Chinese Translation
基于规则学习的模型因其透明的结构而广泛应用于高度可解释的场景。归纳逻辑编程(Inductive Logic Programming, ILP)作为一种机器学习形式,从事实中诱导规则,同时保持可解释性。可微分的 ILP 模型通过利用神经网络增强了这一过程,提高了鲁棒性和可扩展性。然而,大多数可微分的 ILP 方法依赖于符号数据集,在直接从原始数据中学习时面临挑战。具体而言,它们在显式标签泄漏方面存在困难:无法在没有输入特征标签显式监督的情况下将连续输入映射到符号变量。在本研究中,我们通过将自监督的可微分聚类模型与新颖的可微分 ILP 模型相结合,解决了这一问题,使得能够从原始数据中进行规则学习而不发生显式标签泄漏。所学习的规则通过其特征有效地描述了原始数据。我们展示了我们的方法能够直观且精确地从时间序列和图像数据中学习到广义规则。
cs.AI / 41 / 2602.13587

A First Proof Sprint

首次证明冲刺
Corneli, Joseph
Abstract
This monograph reports a multi-agent proof sprint on ten research-level problems, combining rapid draft generation with adversarial verification, targeted repair, and explicit provenance. The workflow uses wiring-diagram decompositions of claim dependencies to localize gaps and coordinate reviewer-driven revisions. Final outcomes are heterogeneous but explicit: the manuscript distinguishes mathematical status from QC-validation status. Mathematically, Problem~3 has a validation-complete existence path under the scoped criterion used here (uniqueness/irreducibility treated as optional), Problem 5 is solved in a scope-limited form for $F_O$-local connective spectra, Problem 10 is conditional under clearly stated assumptions (with explicit necessity counterexamples when assumptions are dropped), and Problems 4 and 6 are partial with named remaining obligations in the general case (including an unconditional $K_n$ result for Problem 6 with $c_0 = 1/3$). Problem 7 is treated as provisionally closed via the rotation-route theorem chain, pending independent ledger re-check. At the QC layer, Problems~7 and~9 have node-level validation artifacts but still contain unresolved verifier gaps. The main methodological result is that structure-aware verification and layer-switching strategies improve reliability and calibration in compressed proof sprints.
Chinese Translation
本专著报告了一次针对十个研究级问题的多智能体证明冲刺,结合了快速草稿生成、对抗性验证、针对性修复和明确的来源追溯。工作流程利用声明依赖关系的接线图分解来定位缺口并协调审稿人驱动的修订。最终结果虽各异但明确:手稿区分了数学状态与质量控制(QC)验证状态。在数学上,问题3在此处使用的范围标准下具有验证完整的存在路径(唯一性/不可约性视为可选),问题5在$F_O$-局部连接谱的范围限制形式下得到解决,问题10在明确陈述的假设下是有条件的(当假设被放弃时提供明确必要的反例),而问题4和6在一般情况下是部分解决的,并列出了剩余义务(包括问题6在$c_0 = 1/3$下的无条件$K_n$结果)。问题7通过旋转路线定理链暂时视为关闭,待独立账本重新检查。在质量控制层面,问题7和9具有节点级验证工件,但仍存在未解决的验证者缺口。主要的方法论结果是,结构感知验证和层切换策略提高了压缩证明冲刺中的可靠性和校准性。
cs.AI / 42 / 2602.13594

Hippocampus: An Efficient and Scalable Memory Module for Agentic AI

海马体:一种高效且可扩展的智能体人工智能记忆模块
Li, Yi, Cao, Lianjie, Ahmed, Faraz, Sharma, Puneet, Li, Bingzhe
Abstract
Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce Hippocampus, an agentic memory management system that uses compact binary signatures for semantic search and lossless token-ID streams for exact content reconstruction. Its core is a Dynamic Wavelet Matrix (DWM) that compresses and co-indexes both streams to support ultra-fast search in the compressed domain, thus avoiding costly dense-vector or graph computations. This design scales linearly with memory size, making it suitable for long-horizon agentic deployments. Empirically, our evaluation shows that Hippocampus reduces end-to-end retrieval latency by up to 31$\times$ and cuts per-query token footprint by up to 14$\times$, while maintaining accuracy on both LoCoMo and LongMemEval benchmarks.
Chinese Translation
智能体人工智能需要持久的记忆,以存储超出大型语言模型(LLMs)有限上下文窗口的用户特定历史。现有的记忆系统使用密集向量数据库或知识图谱遍历(或两者结合),导致高检索延迟和较差的存储可扩展性。我们提出了海马体(Hippocampus),一种智能体记忆管理系统,采用紧凑的二进制签名进行语义搜索,并使用无损的令牌ID流进行精确内容重构。其核心是动态小波矩阵(Dynamic Wavelet Matrix, DWM),能够压缩并共同索引这两种流,以支持在压缩域中的超快速搜索,从而避免昂贵的密集向量或图计算。这一设计随着内存大小线性扩展,使其适合于长期智能体部署。实证评估表明,海马体将端到端检索延迟减少了多达31倍,并将每查询的令牌占用减少了多达14倍,同时在LoCoMo和LongMemEval基准测试中保持了准确性。
cs.AI / 43 / 2602.13595

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

量化陷阱:打破多跳推理中的线性缩放法则
Han, Henry, Liu, Xiyang, Wang, Xiaodong, Han, Fei, Li, Xiaodong
Abstract
Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.
Chinese Translation
神经缩放法则为人工智能的进步提供了一种可预测的方案:降低数值精度应线性提高计算效率和能量特性(能量与位数成正比)。在本文中,我们展示了这一缩放法则在多跳推理的背景下是如何失效的。我们揭示了一个“量化陷阱”,在这个陷阱中,将精度从16位降低到8位或4位,反而会使净能耗增加,同时推理准确性下降。我们提供了一个严格的理论分解,将这一失败归因于硬件转换开销、去量化内核的隐性延迟成本,这在顺序推理链中成为主要瓶颈,以及顺序能量摊销失败。因此,在实践中打破缩放法则是不可避免的。我们的研究结果表明,行业内的“越小越好”启发式在复杂推理任务中在数学上是适得其反的。
cs.AI / 44 / 2602.13616

DiffusionRollout: Uncertainty-Aware Rollout Planning in Long-Horizon PDE Solving

DiffusionRollout:基于不确定性的长时间范围偏微分方程求解的回滚规划
Yoo, Seungwoo, Koo, Juil, Choi, Daehyeon, Sung, Minhyuk
Abstract
We propose DiffusionRollout, a novel selective rollout planning strategy for autoregressive diffusion models, aimed at mitigating error accumulation in long-horizon predictions of physical systems governed by partial differential equations (PDEs). Building on the recently validated probabilistic approach to PDE solving, we further explore its ability to quantify predictive uncertainty and demonstrate a strong correlation between prediction errors and standard deviations computed over multiple samples-supporting their use as a proxy for the model's predictive confidence. Based on this observation, we introduce a mechanism that adaptively selects step sizes during autoregressive rollouts, improving long-term prediction reliability by reducing the compounding effect of conditioning on inaccurate prior outputs. Extensive evaluation on long-trajectory PDE prediction benchmarks validates the effectiveness of the proposed uncertainty measure and adaptive planning strategy, as evidenced by lower prediction errors and longer predicted trajectories that retain a high correlation with their ground truths.
Chinese Translation
我们提出了DiffusionRollout,一种新颖的自回归扩散模型选择性回滚规划策略,旨在减轻由偏微分方程(PDE)控制的物理系统在长时间范围预测中的误差积累。基于最近验证的PDE求解的概率方法,我们进一步探讨了其量化预测不确定性的能力,并展示了预测误差与多个样本计算的标准差之间的强相关性,这支持了将其作为模型预测置信度的代理。基于这一观察,我们引入了一种机制,在自回归回滚过程中自适应选择步长,通过减少对不准确先前输出的条件化的累积效应,从而提高长期预测的可靠性。在长轨迹PDE预测基准上的广泛评估验证了所提出的不确定性度量和自适应规划策略的有效性,表现为更低的预测误差和与真实值保持高度相关的更长预测轨迹。
cs.AI / 45 / 2602.13639

Guided Collaboration in Heterogeneous LLM-Based Multi-Agent Systems via Entropy-Based Understanding Assessment and Experience Retrieval

基于熵的理解评估与经验检索在异构大语言模型多智能体系统中的引导协作
Wang, Linlin, Zhu, Tianqing, Qin, Laiqiao, Gao, Longxiang, Zhou, Wanlei
Abstract
With recent breakthroughs in large language models (LLMs) for reasoning, planning, and complex task generation, artificial intelligence systems are transitioning from isolated single-agent architectures to multi-agent systems with collaborative intelligence. However, in heterogeneous multi-agent systems (HMAS), capability differences among agents give rise to consistent cognitive problems, where strong and weak models fail to contribute effectively. We define the collaboration as a strong-weak system. Through comprehensive experiments, we disclose a counterintuitive phenomenon in the strong-weak system: a strong-weak collaboration may under-perform weak-weak combinations, revealing that cognitive mismatching are key bottlenecks limiting heterogeneous cooperation. To overcome these challenges, we propose an Entropy-Based Adaptive Guidance Framework that dynamically aligns the guidance with the cognitive state of each agent. The framework quantifies the understanding of weak agents through multi-dimensional entropy metrics - covering expression, uncertainty, structure, coherence, and relevance - and adaptively adjusts the intensity of the guidance at light, moderate and intensive levels. Furthermore, a Retrieval-Augmented Generation (RAG) mechanism is incorporated to retain successful collaboration experiences, enabling both immediate adaptation and long-term learning. Extensive experiments on three benchmark datasets, GSM8K, MBPP, and CVRP demonstrate that our approach consistently enhances the effectiveness and stability of heterogeneous collaboration. The results highlight that adaptive guidance not only mitigates cognitive imbalance but also establishes a scalable pathway toward more robust, cooperative multi-agent intelligence.
Chinese Translation
随着在推理、规划和复杂任务生成方面的大语言模型(LLMs)取得的最新突破,人工智能系统正在从孤立的单智能体架构转向具有协作智能的多智能体系统。然而,在异构多智能体系统(HMAS)中,智能体之间的能力差异导致了一系列认知问题,其中强模型和弱模型未能有效贡献。我们将协作定义为强-弱系统。通过全面的实验,我们揭示了强-弱系统中的一个反直觉现象:强-弱协作可能表现不如弱-弱组合,这表明认知不匹配是限制异构合作的关键瓶颈。为了解决这些挑战,我们提出了一种基于熵的自适应引导框架,该框架动态地将引导与每个智能体的认知状态对齐。该框架通过多维熵度量量化弱智能体的理解——涵盖表达、确定性、结构、一致性和相关性——并在轻度、中度和强度水平上自适应地调整引导强度。此外,框架中还结合了检索增强生成(RAG)机制,以保留成功的协作经验,从而实现即时适应和长期学习。在三个基准数据集GSM8K、MBPP和CVRP上的广泛实验表明,我们的方法始终增强了异构协作的有效性和稳定性。结果强调,自适应引导不仅缓解了认知失衡,还为更强大、合作的多智能体智能建立了可扩展的路径。
cs.AI / 46 / 2602.13653

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

通过Agentic-Q估计和逐步策略优化构建自主图形用户界面导航
Wang, Yibo, Huzhang, Guangda, Hu, Yuwei, Xia, Yu, Lu, Shiyin, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Zhang, Kaifu, Zhang, Lijun
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
Chinese Translation
近年来,多模态大型语言模型(MLLMs)的进展显著推动了图形用户界面(GUI)自主代理的发展。然而,在实际应用中,GUI代理常常面临非平稳环境,导致数据整理和策略优化的高计算成本。在本报告中,我们介绍了一种以MLLM为中心的GUI代理新框架,该框架由两个组件组成:Agentic-Q估计和逐步策略优化。前者旨在优化一个Q模型,该模型能够生成逐步值,以评估给定动作对任务完成的贡献。后者则将状态-动作轨迹中的逐步样本作为输入,通过强化学习与我们的Agentic-Q模型优化策略。需要注意的是:(i)所有状态-动作轨迹均由策略自身生成,因此数据收集成本可控;(ii)策略更新与环境解耦,确保了稳定和高效的优化。实证评估表明,我们的框架赋予Ovis2.5-9B强大的GUI交互能力,在GUI导航和基础基准测试中取得了显著的表现,甚至超越了规模更大的竞争者。
cs.AI / 47 / 2602.13665

HyFunc: Accelerating LLM-based Function Calls for Agentic AI through Hybrid-Model Cascade and Dynamic Templating

HyFunc:通过混合模型级联和动态模板加速基于LLM的代理人工智能函数调用
Liao, Weibin, Lou, Jian-guang, Xiong, Haoyi
Abstract
While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real-time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function descriptions for every request; (2) the redundant use of a large, slow model to generate an entire, often predictable, token sequence; and (3) the redundant generation of fixed, boilerplate parameter syntax. We introduce HyFunc, a novel framework that systematically eliminates these inefficiencies. HyFunc employs a hybrid-model cascade where a large model distills user intent into a single "soft token." This token guides a lightweight retriever to select relevant functions and directs a smaller, prefix-tuned model to generate the final call, thus avoiding redundant context processing and full-sequence generation by the large model. To eliminate syntactic redundancy, our "dynamic templating" technique injects boilerplate parameter syntax on-the-fly within an extended vLLM engine. To avoid potential limitations in generalization, we evaluate HyFunc on an unseen benchmark dataset, BFCL. Experimental results demonstrate that HyFunc achieves an excellent balance between efficiency and performance. It achieves an inference latency of 0.828 seconds, outperforming all baseline models, and reaches a performance of 80.1%, surpassing all models with a comparable parameter scale. These results suggest that HyFunc offers a more efficient paradigm for agentic AI. Our code is publicly available at https://github.com/MrBlankness/HyFunc.
Chinese Translation
代理人工智能系统依赖于大型语言模型(LLMs)将用户意图转化为结构化的函数调用,但这一过程存在计算冗余,导致高推理延迟,妨碍实时应用。本文识别并解决了三个关键冗余问题:(1)对每个请求重复处理大量函数描述库;(2)重复使用大型、缓慢的模型生成整个、通常可预测的令牌序列;(3)重复生成固定的、模板化的参数语法。我们提出了HyFunc,一个系统性消除这些低效的创新框架。HyFunc采用混合模型级联,其中大型模型将用户意图提炼为单个“软令牌”。该令牌指导轻量级检索器选择相关函数,并指引较小的、经过前缀调优的模型生成最终调用,从而避免大型模型的冗余上下文处理和完整序列生成。为了消除语法冗余,我们的“动态模板”技术在扩展的vLLM引擎中即时注入模板化的参数语法。为了避免在泛化方面的潜在限制,我们在一个未见的基准数据集BFCL上评估了HyFunc。实验结果表明,HyFunc在效率和性能之间达到了优良的平衡。其推理延迟为0.828秒,优于所有基线模型,性能达到80.1%,超越所有具有可比参数规模的模型。这些结果表明,HyFunc为代理人工智能提供了一种更高效的范式。我们的代码已公开发布在https://github.com/MrBlankness/HyFunc。
cs.AI / 48 / 2602.13680

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

AllMem:一种以内存为中心的高效长上下文建模方案
Wang, Ziming, Wang, Xiang, Peng, Kailong, Qin, Lang, Kostelec, Juan Gabriel, Sourmpis, Christos, Laborieux, Axel, Guo, Qinghai
Abstract
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
Chinese Translation
大型语言模型(LLMs)在长序列任务中面临显著的性能瓶颈,这主要源于自注意力机制固有的计算复杂性和内存开销。为了解决这些挑战,我们提出了 extsc{AllMem},一种新颖且高效的混合架构,将滑动窗口注意力(Sliding Window Attention, SWA)与非线性测试时训练(Test-Time Training, TTT)内存网络相结合。 extsc{AllMem}使得模型能够有效扩展到超长上下文,同时减轻灾难性遗忘。这种方法不仅克服了线性内存模型典型的表示限制,还显著降低了长序列推理过程中的计算和内存占用。此外,我们实施了一种内存高效微调策略,将预训练模型中的标准注意力层替换为增强内存的滑动窗口层。该框架便于将任何现成的预训练LLM高效转换为基于 extsc{AllMem}的架构。实证评估确认我们的4k窗口模型在37k LongBench上实现了近乎无损的性能,相较于全注意力仅有0.83的微小下降。此外,在128k上下文的InfiniteBench上,我们的8k窗口变体超越了全注意力,这验证了我们参数化内存在减轻噪声和保持强健的长距离建模方面的有效性,而无需全球注意力的高昂成本。
cs.AI / 49 / 2602.13691

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

PhGPO:基于信息素的长期工具规划策略优化
Li, Yu, Cai, Guangfeng, Yang, Shengtian, Luo, Han, Han, Shuo, He, Xu, Li, Dong, Feng, Lei
Abstract
Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long-horizon multi-step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool-use path is found, it is usually considered an immediate reward for current training, which would not provide any reusable information for subsequent training. In this paper, we argue that historically successful trajectories contain reusable tool-transition patterns, which can be leveraged throughout the whole training process. Inspired by ant colony optimization where historically successful paths can be reflected by the pheromone, we propose Pheromone-Guided Policy Optimization (PhGPO), which learns a trajectory-based transition pattern (i.e., pheromone) from historical trajectories and then uses the learned pheromone to guide policy optimization. This learned pheromone provides explicit and reusable guidance that steers policy optimization toward historically successful tool transitions, thereby improving long-horizon tool planning. Comprehensive experimental results demonstrate the effectiveness of our proposed PhGPO.
Chinese Translation
近期在大型语言模型(LLM)代理方面的进展显示出其在通过工具使用执行复杂任务方面的强大能力。然而,长期多步骤工具规划面临挑战,因为探索空间受到组合爆炸的影响。在这种情况下,即使找到了一条正确的工具使用路径,通常也被视为当前训练的即时奖励,这并不会为后续训练提供任何可重用的信息。本文认为,历史上成功的轨迹包含可重用的工具转换模式,这些模式可以在整个训练过程中加以利用。受到蚁群优化的启发,历史上成功的路径可以通过信息素进行反映,我们提出了基于信息素的策略优化(PhGPO),该方法从历史轨迹中学习基于轨迹的转换模式(即信息素),然后利用学习到的信息素来指导策略优化。这种学习到的信息素提供了明确且可重用的指导,能够引导策略优化朝向历史上成功的工具转换,从而改善长期工具规划。全面的实验结果证明了我们提出的PhGPO的有效性。
cs.AI / 50 / 2602.13695

Can a Lightweight Automated AI Pipeline Solve Research-Level Mathematical Problems?

轻量级自动化人工智能管道能解决研究级数学问题吗?
Meng, Lve, Zhao, Weilong, Zhang, Yanzhi, Guan, Haoxiang, He, Jiyan
Abstract
Large language models (LLMs) have recently achieved remarkable success in generating rigorous mathematical proofs, with "AI for Math" emerging as a vibrant field of research. While these models have mastered competition-level benchmarks like the International Mathematical Olympiad and show promise in research applications through auto-formalization, their deployment via lightweight, natural-language pipelines for research problems remains underexplored. In this work, we demonstrate that next-generation models (e.g., Gemini 3 Pro, GPT-5.2 Pro), when integrated into a streamlined automated pipeline optimized for citation-based verification, can solve sophisticated research-grade problems. We evaluate our pipeline on two novel datasets: (1) the ICCM problem sets (comparable to the S.-T. Yau College Student Mathematics Contest) proposed by leading mathematicians, and (2) the "First Proof" problem set, consisting of previously unpublished research questions. Our pipeline generated candidate proofs for all problems in the first two ICCM sets and the "First Proof" set. The solutions for the first two ICCM sets and Problem 4 of the "First Proof" set have been fully verified by our team. All generated proofs have been submitted to the official organization, and our generated results are publicly available. We plan to open-source the complete pipeline methodology in due course.
Chinese Translation
大型语言模型(LLMs)最近在生成严谨的数学证明方面取得了显著成功,"数学中的人工智能"(AI for Math)作为一个充满活力的研究领域应运而生。尽管这些模型已经掌握了国际数学奥林匹克等竞争级基准,并在通过自动形式化进行研究应用方面展现出潜力,但通过轻量级自然语言管道解决研究问题的部署仍然未被充分探索。在本研究中,我们展示了下一代模型(如 Gemini 3 Pro、GPT-5.2 Pro)在集成到一个针对基于引用验证优化的简化自动化管道时,能够解决复杂的研究级问题。我们在两个新数据集上评估了我们的管道:(1)由顶尖数学家提出的 ICCM 问题集(可与 S.-T. Yau 大学生数学竞赛相媲美),以及(2)由未发表的研究问题组成的“首次证明”(First Proof)问题集。我们的管道为前两个 ICCM 集和“首次证明”集中的所有问题生成了候选证明。前两个 ICCM 集和“首次证明”集的第 4 个问题的解决方案已由我们的团队完全验证。所有生成的证明已提交给官方组织,我们生成的结果也已公开。我们计划在适当的时候开源完整的管道方法论。
cs.AI / 51 / 2602.13697

No Need to Train Your RDB Foundation Model

无需训练您的关系数据库基础模型
Xu, Linjie, Zhang, Yanlin, Gan, Quan, Wang, Minjie, Wipf, David
Abstract
Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we \textit{avoid retraining} a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained \emph{within} high-dimensional RDB columns where all entities share units and roles, not \textit{across} columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model\footnote{\label{foot: RDBLearn_learn} https://github.com/HKUSHXLab/rdblearn} capable of robust performance on unseen datasets out of the box.
Chinese Translation
关系数据库(RDB)包含大量异构的表格信息,这些信息可以用于预测建模。然而,由于在企业环境中潜在目标的范围非常广泛,我们如何才能在每次希望预测新的关注量时 extit{避免重新训练}一个新模型呢?基于上下文学习(ICL)的基础模型提供了一个方便的选项,但到目前为止,这些模型在很大程度上仅限于单表操作。在推广到多个相互关联的表时,将可变大小的RDB邻域压缩为固定长度的ICL样本以供解码器使用是至关重要的。然而,这里的细节至关重要:与现有的监督学习RDB管道不同,我们提供了理论和实证证据,表明ICL特定的压缩应该限制在高维RDB列内,其中所有实体共享单位和角色,而不是在列之间进行压缩,因为在没有标签信息的情况下,无法确定异构数据类型的相关性。在这一限制条件下,我们展示了通过排除可训练参数,编码器的表达能力实际上并未受到损害。因此,我们得出了一类原则性的RDB编码器,可以与现有的单表ICL基础模型无缝配对,从而无需训练或微调。从实际角度来看,我们开发了可扩展的SQL原语来实现编码器阶段,最终形成了一个易于使用的开源RDB基础模型 ootnote{ extit{https://github.com/HKUSHXLab/rdblearn}},能够在未见过的数据集上实现稳健的性能。
cs.AI / 52 / 2602.13738

OneLatent: Single-Token Compression for Visual Latent Reasoning

OneLatent:用于视觉潜在推理的单标记压缩
Lv, Bo, Sun, Yasheng, Wang, Junjie, Shi, Haoxiang
Abstract
Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by $11\times$ with only a $2.21\%$ average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by $6.8\times$. On long-chain logical reasoning, OneLatent reaches $99.80\%$ on ProntoQA and $97.80\%$ on ProsQA with one latent token, with compression up to $87.4\times$, supporting compression-constrained generalization.
Chinese Translation
链式思维(CoT)提示提高了推理能力,但通常会使推理成本增加一个到两个数量级。为了解决这些挑战,我们提出了 extbf{OneLatent},一个通过从渲染的CoT图像和DeepSeek-OCR隐藏状态的监督,将中间推理压缩为单个潜在标记的框架。通过将文本步骤渲染为图像,我们获得了一种确定性的监督信号,可以在不要求模型输出冗长文本推理的情况下进行检查和审计。在各项基准测试中,OneLatent将平均输出长度减少了$11 imes$,相对于文本CoT仅有$2.21\%$的平均准确率下降,同时将输出标记贡献(OTC)提高了$6.8 imes$。在长链逻辑推理中,OneLatent在ProntoQA上达到了$99.80\\%$,在ProsQA上达到了$97.80\\%$,仅使用一个潜在标记,压缩比高达$87.4 imes$,支持压缩约束下的泛化。
cs.AI / 53 / 2602.13769

OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

OR-Agent:桥接进化搜索与结构化研究以实现自动化算法发现
Liu, Qi, Ma, Wanjing
Abstract
Automating scientific discovery in complex, experiment-driven domains requires more than iterative mutation of programs; it demands structured hypothesis management, environment interaction, and principled reflection. We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments. OR-Agent organizes research as a structured tree-based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation-crossover loops. At its core, we introduce an evolutionary-systematic ideation mechanism that unifies evolutionary selection of research starting points, comprehensive research plan generation, and coordinated exploration within a research tree. We further propose a hierarchical optimization-inspired reflection system: short-term experimental reflection operates as a form of verbal gradient providing immediate corrective signals; long-term reflection accumulates cross-experiment insights as verbal momentum; and memory compression serves as a regularization mechanism analogous to weight decay, preserving essential signals while mitigating drift. Together, these components form a principled architecture governing research dynamics. We conduct extensive experiments across classical combinatorial optimization benchmarks-including traveling salesman, capacitated vehicle routing, bin packing, orienteering, and multiple knapsack problems-as well as simulation-based cooperative driving scenarios. Results demonstrate that OR-Agent outperforms strong evolutionary baselines while providing a general, extensible, and inspectable framework for AI-assisted scientific discovery. OR-Agent source code and experiments data are publicly available at https://github.com/qiliuchn/OR-Agent.
Chinese Translation
在复杂的实验驱动领域中实现科学发现的自动化,不仅需要对程序进行迭代变异,还需要结构化的假设管理、环境交互和原则性反思。我们提出了OR-Agent,一个可配置的多智能体研究框架,旨在丰富实验环境中的自动化探索。OR-Agent将研究组织为一个结构化的基于树的工作流程,明确建模分支假设生成和系统回溯,使得研究轨迹的控制管理超越简单的变异-交叉循环。在其核心,我们引入了一种进化-系统化的构思机制,统一了研究起点的进化选择、全面的研究计划生成以及在研究树中的协调探索。我们进一步提出了一种受层次优化启发的反思系统:短期实验反思作为一种语言梯度,提供即时的纠正信号;长期反思则积累跨实验的见解,形成语言动量;而记忆压缩则作为一种正则化机制,类似于权重衰减,保留重要信号的同时减轻漂移。所有这些组件共同构成了一个原则性架构,支配着研究动态。我们在经典组合优化基准(包括旅行商问题、容量受限车辆路径问题、装箱问题、定向问题和多重背包问题)以及基于仿真的合作驾驶场景中进行了广泛的实验。结果表明,OR-Agent在强大的进化基线之上表现优越,同时提供了一个通用、可扩展和可检查的AI辅助科学发现框架。OR-Agent的源代码和实验数据已公开,地址为 https://github.com/qiliuchn/OR-Agent。
cs.AI / 54 / 2602.13792

StackingNet: Collective Inference Across Independent AI Foundation Models

StackingNet:跨独立人工智能基础模型的集体推理
Li, Siyang, Liu, Chenhao, Wu, Dongrui, Zeng, Zhigang, Ding, Lieyun
Abstract
Artificial intelligence built on large foundation models has transformed language understanding, vision and reasoning, yet these systems remain isolated and cannot readily share their capabilities. Integrating the complementary strengths of such independent foundation models is essential for building trustworthy intelligent systems. Despite rapid progress in individual model design, there is no established approach for coordinating such black-box heterogeneous models. Here we show that coordination can be achieved through a meta-ensemble framework termed StackingNet, which draws on principles of collective intelligence to combine model predictions during inference. StackingNet improves accuracy, reduces bias, enables reliability ranking, and identifies or prunes models that degrade performance, all operating without access to internal parameters or training data. Across tasks involving language comprehension, visual estimation, and academic paper rating, StackingNet consistently improves accuracy, robustness, and fairness, compared with individual models and classic ensembles. By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.
Chinese Translation
基于大型基础模型构建的人工智能已经改变了语言理解、视觉和推理,但这些系统仍然是孤立的,无法轻易分享其能力。整合这些独立基础模型的互补优势对于构建可信赖的智能系统至关重要。尽管个别模型设计取得了快速进展,但尚无建立协调这些黑箱异构模型的有效方法。在此,我们展示了一种通过名为StackingNet的元集成框架实现协调的方法,该框架利用集体智能的原则在推理过程中结合模型预测。StackingNet提高了准确性,减少了偏见,实现了可靠性排名,并识别或剔除会降低性能的模型,所有操作均不依赖于内部参数或训练数据。在涉及语言理解、视觉估计和学术论文评分的任务中,与单个模型和经典集成相比,StackingNet始终提高了准确性、鲁棒性和公平性。通过将多样性从不一致的来源转变为协作,StackingNet为协调人工智能奠定了实用基础,表明进步不仅可能来自于更大的单一模型,也可能源于众多专业模型之间的原则性合作。
cs.AI / 55 / 2602.13804

Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees

恒定时间内的注意力:具有指数保证的 Vashista 稀疏注意力用于长上下文解码
Nobaub, Vashista
Abstract
Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (\Delta) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-\Omega(\Delta/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.
Chinese Translation
大型语言模型在长上下文的推理成本中大部分时间花费在注意力机制上,然而经验行为表明,只有一小部分标记对每个查询有意义的贡献。我们通过将注意力建模为对关键向量的凸包的投影,并分析其熵(类似 softmax)的松弛,来形式化这一现象。我们的主要理论贡献是面稳定性定理,表明在严格互补边际(由 KKT 乘子认证的支持间隙 ())下,熵注意力集中在一个恒定大小的活跃面上:分配给非活跃标记的总质量以指数方式衰减为 ( ext{exp}(- ext{Ω}(/))),而活跃面的误差在温度/正则化参数 () 中线性缩放。这为稀疏长上下文解码的安全性提供了实用标准,并提供了一个原则性的调节器,以在准确性与计算之间进行权衡。在这些保证的基础上,我们引入了 Vashista 稀疏注意力,这是一种通过与现代推理堆栈兼容的分页式上下文选择策略,保持每个查询的小候选集的机制。在长上下文评估中,我们观察到稳定的恒定大小有效支持、显著的时钟速度提升,以及在支持间隙诊断预测的范围内质量降级最小。最后,我们讨论了在隐私敏感和隔离环境中的部署影响,其中可互换的注意力模块使得在没有外部检索依赖的情况下实现可预测的延迟和成本。
cs.AI / 56 / 2602.13808

An end-to-end agentic pipeline for smart contract translation and quality evaluation

智能合约翻译与质量评估的端到端代理管道
Goel, Abhinav, Shah, Chaitya, Capponi, Agostino, Gliozzo, Alfio
Abstract
We present an end-to-end framework for systematic evaluation of LLM-generated smart contracts from natural-language specifications. The system parses contractual text into structured schemas, generates Solidity code, and performs automated quality assessment through compilation and security checks. Using CrewAI-style agent teams with iterative refinement, the pipeline produces structured artifacts with full provenance metadata. Quality is measured across five dimensions, including functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality aggregated into composite scores. The framework supports paired evaluation against ground-truth implementations, quantifying alignment and identifying systematic error modes such as logic omissions and state transition inconsistencies. This provides a reproducible benchmark for empirical research on smart contract synthesis quality and supports extensions to formal verification and compliance checking.
Chinese Translation
我们提出了一个端到端框架,用于系统评估从自然语言规范生成的LLM(大语言模型)智能合约。该系统将合同文本解析为结构化模式,生成Solidity代码,并通过编译和安全检查进行自动化质量评估。通过使用CrewAI风格的代理团队进行迭代优化,该管道生成具有完整来源元数据的结构化文档。质量评估涵盖五个维度,包括功能完整性、变量保真度、状态机正确性、业务逻辑保真度以及汇总成复合分数的代码质量。该框架支持与真实实现的配对评估,量化对齐程度并识别系统性错误模式,如逻辑遗漏和状态转换不一致。这为智能合约合成质量的实证研究提供了可重复的基准,并支持对形式验证和合规检查的扩展。
cs.AI / 57 / 2602.13852

Experimentation Accelerator: Interpretable Insights and Creative Recommendations for A/B Testing with Content-Aware ranking

实验加速器:可解释的洞察与基于内容的 A/B 测试创意推荐
Hu, Zhengmian, Shi, Lei, Sinha, Ritwik, Grover, Justin, Arbour, David
Abstract
Modern online experimentation faces two bottlenecks: scarce traffic forces tough choices on which variants to test, and post-hoc insight extraction is manual, inconsistent, and often content-agnostic. Meanwhile, organizations underuse historical A/B results and rich content embeddings that could guide prioritization and creative iteration. We present a unified framework to (i) prioritize which variants to test, (ii) explain why winners win, and (iii) surface targeted opportunities for new, higher-potential variants. Leveraging treatment embeddings and historical outcomes, we train a CTR ranking model with fixed effects for contextual shifts that scores candidates while balancing value and content diversity. For better interpretability and understanding, we project treatments onto curated semantic marketing attributes and re-express the ranker in this space via a sign-consistent, sparse constrained Lasso, yielding per-attribute coefficients and signed contributions for visual explanations, top-k drivers, and natural-language insights. We then compute an opportunity index combining attribute importance (from the ranker) with under-expression in the current experiment to flag missing, high-impact attributes. Finally, LLMs translate ranked opportunities into concrete creative suggestions and estimate both learning and conversion potential, enabling faster, more informative, and more efficient test cycles. These components have been built into a real Adobe product, called \textit{Experimentation Accelerator}, to provide AI-based insights and opportunities to scale experimentation for customers. We provide an evaluation of the performance of the proposed framework on some real-world experiments by Adobe business customers that validate the high quality of the generation pipeline.
Chinese Translation
现代在线实验面临两个瓶颈:稀缺的流量迫使我们在测试哪些变体上做出艰难选择,而事后洞察提取则是手动的、不一致的,并且往往与内容无关。同时,组织对历史 A/B 结果和丰富的内容嵌入的利用不足,这些结果和嵌入本可以指导优先级排序和创意迭代。我们提出了一个统一框架,以 (i) 确定优先测试哪些变体,(ii) 解释赢家为何获胜,以及 (iii) 发掘针对新高潜力变体的机会。通过利用处理嵌入和历史结果,我们训练了一个具有固定效应的点击率(CTR)排名模型,以应对上下文变化,在平衡价值和内容多样性的同时对候选项进行评分。为了更好地解释和理解,我们将处理映射到策划的语义营销属性上,并通过符号一致的稀疏约束 Lasso 在该空间中重新表达排名器,从而获得每个属性的系数和符号贡献,以便进行可视化解释、前 k 驱动因素和自然语言洞察。然后,我们计算一个机会指数,将属性重要性(来自排名器)与当前实验中的低表达结合起来,以标记缺失的高影响属性。最后,LLMs 将排名机会转化为具体的创意建议,并估计学习和转化潜力,从而实现更快、更具信息量和更高效的测试周期。这些组件已被构建到一个真实的 Adobe 产品中,称为实验加速器(Experimentation Accelerator),为客户提供基于 AI 的洞察和机会,以扩大实验规模。我们对所提出框架在一些 Adobe 商业客户的真实实验中的表现进行了评估,验证了生成管道的高质量。
cs.AI / 58 / 2602.13855

From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

从流畅到可验证:深度研究代理的声明级审计能力
Rasheed, Razeen A, Banerjee, Somnath, Mukherjee, Animesh, Hazra, Rima
Abstract
A deep research agent produces a fluent scientific report in minutes; a careful reader then tries to verify the main claims and discovers the real cost is not reading, but tracing: which sentence is supported by which passage, what was ignored, and where evidence conflicts. We argue that as research generation becomes cheap, auditability becomes the bottleneck, and the dominant risk shifts from isolated factual errors to scientifically styled outputs whose claim-evidence links are weak, missing, or misleading. This perspective proposes claim-level auditability as a first-class design and evaluation target for deep research agents, distills recurring long-horizon failure modes (objective drift, transient constraints, and unverifiable inference), and introduces the Auditable Autonomous Research (AAR) standard, a compact measurement framework that makes auditability testable via provenance coverage, provenance soundness, contradiction transparency, and audit effort. We then argue for semantic provenance with protocolized validation: persistent, queryable provenance graphs that encode claim--evidence relations (including conflicts) and integrate continuous validation during synthesis rather than after publication, with practical instrumentation patterns to support deployment at scale.
Chinese Translation
深度研究代理在几分钟内生成流畅的科学报告;仔细的读者随后尝试验证主要声明,并发现真正的成本不在于阅读,而在于追踪:哪句话由哪段支持,哪些内容被忽略,以及证据之间的冲突。我们认为,随着研究生成变得便宜,审计能力成为瓶颈,主要风险从孤立的事实错误转移到科学风格的输出,其声明-证据链接薄弱、缺失或误导。这个观点提出了声明级审计能力作为深度研究代理的首要设计和评估目标,提炼出反复出现的长期失败模式(目标漂移、瞬态约束和不可验证的推理),并引入了可审计自主研究(Auditable Autonomous Research, AAR)标准,这是一个紧凑的测量框架,通过来源覆盖、来源健全性、矛盾透明度和审计工作量使审计能力可测试。然后,我们主张采用语义来源与协议化验证:持久的、可查询的来源图,编码声明-证据关系(包括冲突),并在合成过程中而非发布后集成持续验证,采用实用的仪器模式以支持大规模部署。
cs.AI / 59 / 2602.13865

Enabling Option Learning in Sparse Rewards with Hindsight Experience Replay

在稀疏奖励下通过事后经验重放实现选项学习
Romio, Gabriel, Melchiades, Mateus Begnini, da Silva, Bruno Castro, Ramos, Gabriel de Oliveira
Abstract
Hierarchical Reinforcement Learning (HRL) frameworks like Option-Critic (OC) and Multi-updates Option Critic (MOC) have introduced significant advancements in learning reusable options. However, these methods underperform in multi-goal environments with sparse rewards, where actions must be linked to temporally distant outcomes. To address this limitation, we first propose MOC-HER, which integrates the Hindsight Experience Replay (HER) mechanism into the MOC framework. By relabeling goals from achieved outcomes, MOC-HER can solve sparse reward environments that are intractable for the original MOC. However, this approach is insufficient for object manipulation tasks, where the reward depends on the object reaching the goal rather than on the agent's direct interaction. This makes it extremely difficult for HRL agents to discover how to interact with these objects. To overcome this issue, we introduce Dual Objectives Hindsight Experience Replay (2HER), a novel extension that creates two sets of virtual goals. In addition to relabeling goals based on the object's final state (standard HER), 2HER also generates goals from the agent's effector positions, rewarding the agent for both interacting with the object and completing the task. Experimental results in robotic manipulation environments show that MOC-2HER achieves success rates of up to 90%, compared to less than 11% for both MOC and MOC-HER. These results highlight the effectiveness of our dual objective relabeling strategy in sparse reward, multi-goal tasks.
Chinese Translation
层次强化学习(HRL)框架,如选项评论员(Option-Critic, OC)和多更新选项评论员(Multi-updates Option Critic, MOC),在学习可重用选项方面取得了显著进展。然而,这些方法在稀疏奖励的多目标环境中表现不佳,因为在这些环境中,动作必须与时间上遥远的结果相联系。为了解决这一局限性,我们首先提出了MOC-HER,它将事后经验重放(Hindsight Experience Replay, HER)机制集成到MOC框架中。通过重新标记来自已实现结果的目标,MOC-HER能够解决原始MOC无法处理的稀疏奖励环境。然而,这种方法对于物体操作任务来说是不够的,因为奖励依赖于物体达到目标,而不是代理的直接交互。这使得HRL代理极难发现如何与这些物体进行交互。为了解决这个问题,我们引入了双目标事后经验重放(Dual Objectives Hindsight Experience Replay, 2HER),这是一个新颖的扩展,创建了两组虚拟目标。除了基于物体最终状态重新标记目标(标准HER)外,2HER还从代理的效应器位置生成目标,奖励代理既与物体交互又完成任务。在机器人操作环境中的实验结果表明,MOC-2HER的成功率高达90%,而MOC和MOC-HER的成功率均低于11%。这些结果突显了我们在稀疏奖励多目标任务中双目标重新标记策略的有效性。
cs.AI / 60 / 2602.13873

Ambient Physics: Training Neural PDE Solvers with Partial Observations

环境物理:利用部分观测训练神经偏微分方程求解器
Majid, Harris Abdul, Daras, Giannis, Tudisco, Francesco, McDonagh, Steven
Abstract
In many scientific settings, acquiring complete observations of PDE coefficients and solutions can be expensive, hazardous, or impossible. Recent diffusion-based methods can reconstruct fields given partial observations, but require complete observations for training. We introduce Ambient Physics, a framework for learning the joint distribution of coefficient-solution pairs directly from partial observations, without requiring a single complete observation. The key idea is to randomly mask a subset of already-observed measurements and supervise on them, so the model cannot distinguish "truly unobserved" from "artificially unobserved", and must produce plausible predictions everywhere. Ambient Physics achieves state-of-the-art reconstruction performance. Compared with prior diffusion-based methods, it achieves a 62.51$\%$ reduction in average overall error while using 125$\times$ fewer function evaluations. We also identify a "one-point transition": masking a single already-observed point enables learning from partial observations across architectures and measurement patterns. Ambient Physics thus enables scientific progress in settings where complete observations are unavailable.
Chinese Translation
在许多科学场景中,获取偏微分方程(PDE)系数和解的完整观测可能是昂贵的、危险的或不可能的。近期的基于扩散的方法可以在给定部分观测的情况下重建场,但在训练时需要完整的观测。我们提出了环境物理(Ambient Physics),这是一个从部分观测中直接学习系数-解对的联合分布的框架,而无需任何完整观测。其关键思想是随机屏蔽一部分已观测的测量值,并在这些测量值上进行监督,使得模型无法区分“真正未观测”的和“人为未观测”的情况,必须在所有地方产生合理的预测。环境物理在重建性能上达到了最先进的水平。与之前的基于扩散的方法相比,它在平均整体误差上减少了62.51%,同时使用的函数评估次数减少了125倍。我们还发现了一个“单点转变”:屏蔽一个已观测的点可以使得跨架构和测量模式从部分观测中学习。因此,环境物理使得在无法获得完整观测的情况下推动科学进展成为可能。
cs.AI / 61 / 2602.13880

VSAL: A Vision Solver with Adaptive Layouts for Graph Property Detection

VSAL:一种具有自适应布局的图属性检测视觉求解器
Xie, Jiahao, Tong, Guangmo
Abstract
Graph property detection aims to determine whether a graph exhibits certain structural properties, such as being Hamiltonian. Recently, learning-based approaches have shown great promise by leveraging data-driven models to detect graph properties efficiently. In particular, vision-based methods offer a visually intuitive solution by processing the visualizations of graphs. However, existing vision-based methods rely on fixed visual graph layouts, and therefore, the expressiveness of their pipeline is restricted. To overcome this limitation, we propose VSAL, a vision-based framework that incorporates an adaptive layout generator capable of dynamically producing informative graph visualizations tailored to individual instances, thereby improving graph property detection. Extensive experiments demonstrate that VSAL outperforms state-of-the-art vision-based methods on various tasks such as Hamiltonian cycle, planarity, claw-freeness, and tree detection.
Chinese Translation
图属性检测旨在确定图是否具有某些结构属性,例如是否为哈密顿图。近年来,基于学习的方法通过利用数据驱动模型高效地检测图属性,展现出了巨大的潜力。特别是,基于视觉的方法通过处理图的可视化提供了一种直观的解决方案。然而,现有的基于视觉的方法依赖于固定的图形布局,因此其管道的表现力受到限制。为了解决这一限制,我们提出了VSAL,一种基于视觉的框架,结合了一个自适应布局生成器,能够动态生成针对个体实例的信息丰富的图形可视化,从而改善图属性检测。大量实验表明,VSAL在哈密顿循环、平面性、无爪性和树检测等多种任务上优于最先进的基于视觉的方法。
cs.AI / 62 / 2602.13904

Diagnosing Pathological Chain-of-Thought in Reasoning Models

诊断推理模型中的病理性思维链
Liu, Manqing, Williams-King, David, Caspary, Ida, Le, Linh, Whittingham, Hannes, Radmard, Puria, Tice, Cameron, Young, Edward James
Abstract
Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. However, CoT reasoning may exhibit failure modes that we note as pathologies, which prevent it from being useful for monitoring. Prior work has identified three distinct pathologies: post-hoc rationalization, where models generate plausible explanations backwards from predetermined answers; encoded reasoning, where intermediate steps conceal information within seemingly interpretable text; and internalized reasoning, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we create a set of concrete metrics that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop model organisms deliberately trained to exhibit specific CoT pathologies. Our work provides a practical toolkit for assessing CoT pathologies, with direct implications for training-time monitoring.
Chinese Translation
思维链(Chain-of-thought, CoT)推理是现代大型语言模型(LLM)架构的基础,并且是人工智能安全的重要干预点。然而,CoT推理可能表现出一些我们称之为病理的失败模式,这些模式妨碍了其在监控中的有效性。先前的研究已识别出三种不同的病理:事后合理化(post-hoc rationalization),即模型从预定答案反向生成看似合理的解释;编码推理(encoded reasoning),即中间步骤在看似可解释的文本中隐藏信息;以及内化推理(internalized reasoning),即模型在内部计算时用无意义的填充符号替代明确的推理。为了更好地理解和区分这些病理,我们创建了一套具体的指标,这些指标简单易实施、计算成本低且与任务无关。为了验证我们的方法,我们开发了故意训练以表现特定CoT病理的模型生物。我们的工作提供了一套实用工具,用于评估CoT病理,对训练过程中的监控具有直接的影响。
cs.AI / 63 / 2602.13912

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

从像素到政策:增强语言模型中的空间推理以实现内容感知的布局设计
Li, Sha, Petrangeli, Stefano, Shen, Yu, Chen, Xiang
Abstract
We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs' limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.
Chinese Translation
我们介绍了LaySPA,一个强化学习框架,旨在为大型语言模型(LLMs)提供明确且可解释的空间推理,以支持内容感知的图形布局设计。LaySPA解决了两个关键挑战:LLMs的空间推理能力有限以及设计决策过程缺乏透明性。我们将布局设计重新表述为一个在结构化文本空间环境中的策略学习问题,该环境明确编码了画布几何形状、元素属性和元素间关系。LaySPA生成双层输出,包括可解释的推理轨迹和结构化的布局规范,从而实现透明和可控的设计决策。布局设计策略通过多目标空间评估进行优化,该评估将布局质量分解为几何有效性、关系一致性和美学一致性,并使用相对组优化进行训练,以稳定开放式设计空间中的学习。实验表明,LaySPA提高了结构有效性和视觉质量,超越了更大型的专有LLMs,并在性能上与专业的最先进布局生成器相当,同时需要更少的标注样本和降低延迟。
cs.AI / 64 / 2602.13933

HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling

HyMem:具有动态检索调度的混合内存架构
Zhao, Xiaochen, Wang, Kaikai, Zhang, Xiaowen, Yao, Chen, Wang, Aili
Abstract
Large language model (LLM) agents demonstrate strong performance in short-text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade-off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while retaining raw text introduces unnecessary computational overhead for simple queries. The crux lies in the limitations of monolithic memory representations and static retrieval mechanisms, which fail to emulate the flexible and proactive memory scheduling capabilities observed in humans, thus struggling to adapt to diverse problem scenarios. Inspired by the principle of cognitive economy, we propose HyMem, a hybrid memory architecture that enables dynamic on-demand scheduling through multi-granular memory representations. HyMem adopts a dual-granular storage scheme paired with a dynamic two-tier retrieval system: a lightweight module constructs summary-level context for efficient response generation, while an LLM-based deep module is selectively activated only for complex queries, augmented by a reflection mechanism for iterative reasoning refinement. Experiments show that HyMem achieves strong performance on both the LOCOMO and LongMemEval benchmarks, outperforming full-context while reducing computational cost by 92.6\%, establishing a state-of-the-art balance between efficiency and performance in long-term memory management.
Chinese Translation
大型语言模型(LLM)代理在短文本上下文中表现出色,但在扩展对话中往往表现不佳,原因在于内存管理效率低下。现有方法面临效率与有效性之间的根本权衡:内存压缩可能会丢失复杂推理所需的关键细节,而保留原始文本则会为简单查询引入不必要的计算开销。关键在于单一内存表示和静态检索机制的局限性,这些机制无法模拟人类所观察到的灵活和主动的内存调度能力,因此难以适应多样化的问题场景。受到认知经济原则的启发,我们提出了HyMem,一种混合内存架构,能够通过多粒度内存表示实现动态按需调度。HyMem采用双粒度存储方案,并配备动态双层检索系统:轻量级模块构建摘要级上下文以实现高效的响应生成,而基于LLM的深度模块仅在复杂查询时选择性激活,并通过反思机制增强迭代推理的细化。实验表明,HyMem在LOCOMO和LongMemEval基准测试中表现优异,超越了全上下文方法,同时将计算成本降低了92.6%,在长期内存管理中建立了效率与性能之间的最先进平衡。
cs.AI / 65 / 2602.13935

Statistical Early Stopping for Reasoning Models

推理模型的统计早停法
Xie, Yangxinyu, Wang, Tao, Mallick, Soham, Sun, Yan, Noarov, Georgy, Yu, Mengxin, Mallick, Tanwi, Su, Weijie J., Dobriban, Edgar
Abstract
While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.
Chinese Translation
尽管大型语言模型(LLMs)在推理能力上取得了显著进展,但在面对不确定性以及不良或模糊查询时,它们有时会过度思考,生成不必要的推理步骤。我们提出了统计学上合理的早停方法,通过在生成过程中监测不确定性信号来缓解这一问题。我们第一种方法是参数化的:它将不确定性关键词的到达时间建模为更新过程,并应用序贯检验进行停止。我们的第二种方法是非参数化的,并对在良好条件下查询过早停止的概率提供有限样本保证。我们在多个领域和模型的推理任务上进行了实证评估。我们的结果表明,关注不确定性的早停方法可以提高大型语言模型推理的效率和可靠性,尤其在数学推理方面观察到了显著的提升。
cs.AI / 66 / 2602.13936

A Generalizable Physics-guided Causal Model for Trajectory Prediction in Autonomous Driving

一种可推广的物理引导因果模型用于自主驾驶中的轨迹预测
Zong, Zhenyu, Wang, Yuchen, Lin, Haohong, Gan, Lu, Shao, Huajie
Abstract
Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method's superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY-Zong/Physics-guided-Causal-Model.
Chinese Translation
交通参与者的轨迹预测对于安全的自主驾驶至关重要。然而,在以前未见的领域中实现有效的零样本泛化仍然是一个重大挑战。受到不同领域中运动学一致性的启发,我们旨在结合领域不变的知识以增强零样本轨迹预测能力。主要挑战包括:1)有效提取领域不变的场景表示,2)将不变特征与运动学模型相结合以实现泛化预测。为了解决这些挑战,我们提出了一种新颖的可推广的物理引导因果模型(Physics-guided Causal Model, PCM),其包含两个核心组件:一个解耦场景编码器(Disentangled Scene Encoder),采用基于干预的解耦方法从场景中提取领域不变特征;一个因果ODE解码器(CausalODE Decoder),利用因果注意机制有效地将运动学模型与有意义的上下文信息相结合。在真实世界的自主驾驶数据集上进行的大量实验表明,我们的方法在未见城市中的零样本泛化性能优于竞争基线,表现显著提升。源代码已发布在 https://github.com/ZY-Zong/Physics-guided-Causal-Model。
cs.AI / 67 / 2602.13967

Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs

Neuromem:大规模语言模型外部记忆流生命周期的细粒度分解
Zhang, Ruicheng, Li, Xinyi, Xu, Tianyi, Zhang, Shuhao, Liao, Xiaofei, Jin, Hai
Abstract
Most evaluations of External Memory Module assume a static setting: memory is built offline and queried at a fixed state. In practice, memory is streaming: new facts arrive continuously, insertions interleave with retrievals, and the memory state evolves while the model is serving queries. In this regime, accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation. We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy, consolidation policy, query formulation strategy, and context integration mechanism. Using three representative datasets LOCOMO, LONGMEMEVAL, and MEMORYAGENTBENCH, Neuromem evaluates interchangeable variants within a shared serving stack, reporting token-level F1 and insertion/retrieval latency. Overall, we observe that performance typically degrades as memory grows across rounds, and time-related queries remain the most challenging category. The memory data structure largely determines the attainable quality frontier, while aggressive compression and generative integration mechanisms mostly shift cost between insertion and retrieval with limited accuracy gain.
Chinese Translation
大多数对外部记忆模块的评估假设在静态环境中进行:记忆是在离线状态下构建的,并在固定状态下进行查询。实际上,记忆是流式的:新事实持续到来,插入与检索交替进行,记忆状态在模型处理查询时不断演变。在这种情况下,准确性和成本受到完整记忆生命周期的影响,该生命周期包括信息的摄取、维护、检索和集成到生成中。我们提出了Neuromem,一个可扩展的测试平台,用于在交替插入和检索协议下基准测试外部记忆模块,并将其生命周期分解为五个维度,包括记忆数据结构、归一化策略、整合策略、查询构造策略和上下文集成机制。使用三个代表性数据集LOCOMO、LONGMEMEVAL和MEMORYAGENTBENCH,Neuromem在共享服务栈中评估可互换的变体,报告令牌级F1和插入/检索延迟。总体而言,我们观察到随着记忆在多个轮次中增长,性能通常会下降,而与时间相关的查询仍然是最具挑战性的类别。记忆数据结构在很大程度上决定了可达到的质量边界,而激进的压缩和生成集成机制主要在插入和检索之间转移成本,且准确性提升有限。
cs.AI / 68 / 2602.13980

Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

软提示的认知块化:通过块级因果掩蔽加速压缩器学习
Liu, Guojie, Wang, Yiqi, Yang, Yanfeng, Fan, Wenqi, Jian, Songlei, Zhang, Jianfeng, Yu, Jie
Abstract
Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the $64\times$ compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16$\times$ compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.
Chinese Translation
通过提示提供广泛的上下文对于充分利用大型语言模型(LLMs)的能力至关重要。然而,冗长的上下文显著增加了推理延迟,因为自注意力的计算成本随着序列长度的增加而呈二次增长。为了解决这个问题,上下文压缩,特别是软提示压缩,已成为一种广泛研究的解决方案,它通过训练好的压缩器将长上下文转换为较短的记忆嵌入。现有方法通常将整个上下文不加区分地压缩为一组记忆标记,要求压缩器捕捉全局依赖关系,并需要大量的预训练数据来学习有效的模式。受到人类工作记忆中块化机制的启发,以及对记忆嵌入相对于原始标记的空间特化的实证观察,我们提出了并行迭代压缩(Parallelized Iterative Compression, PIC)。通过简单地修改变换器的注意力掩蔽,PIC明确限制了记忆标记的感受野为顺序局部块,从而降低了压缩器训练的难度。在多个下游任务上的实验表明,PIC始终优于竞争基线,尤其在高压缩场景中表现尤为突出(例如,在$64 imes$压缩比的问答任务中,F1得分相对提升29.8\%,EM得分相对提升40.7\%)。此外,PIC显著加快了训练过程。具体而言,在训练16$ imes$压缩器时,它超越了竞争基线的峰值性能,同时有效地将训练时间减少了约40\\%。
cs.AI / 69 / 2602.13985

Bridging AI and Clinical Reasoning: Abductive Explanations for Alignment on Critical Symptoms

弥合人工智能与临床推理之间的差距:针对关键症状的溯因解释
Sonna, Belona, Grastien, Alban
Abstract
Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding that of human experts. A key challenge, however, is that AI reasoning frequently diverges from structured clinical frameworks, limiting trust, interpretability, and adoption. Critical symptoms, pivotal for rapid and accurate decision-making, may be overlooked by AI models even when predictions are correct. Existing post hoc explanation methods provide limited transparency and lack formal guarantees. To address this, we leverage formal abductive explanations, which offer consistent, guaranteed reasoning over minimal sufficient feature sets. This enables a clear understanding of AI decision-making and allows alignment with clinical reasoning. Our approach preserves predictive accuracy while providing clinically actionable insights, establishing a robust framework for trustworthy AI in medical diagnosis.
Chinese Translation
人工智能(AI)在临床诊断中展现出强大的潜力,常常达到或超过人类专家的准确性。然而,一个主要挑战在于,AI推理经常偏离结构化的临床框架,这限制了信任度、可解释性和应用性。关键症状对于快速和准确的决策至关重要,即使在预测正确的情况下,AI模型也可能忽视这些症状。现有的事后解释方法提供的透明度有限,且缺乏正式保证。为了解决这一问题,我们利用形式化的溯因解释,它在最小充分特征集上提供一致的、保证的推理。这使得对AI决策过程的清晰理解成为可能,并允许与临床推理保持一致。我们的方法在保持预测准确性的同时,提供临床可操作的见解,为医疗诊断中的可信AI建立了一个稳健的框架。
cs.AI / 70 / 2602.14003

Prompt-Driven Low-Altitude Edge Intelligence: Modular Agents and Generative Reasoning

基于提示的低空边缘智能:模块化代理与生成推理
You, Jiahao, Jia, Ziye, Dong, Chao, Wu, Qihui
Abstract
The large artificial intelligence models (LAMs) show strong capabilities in perception, reasoning, and multi-modal understanding, and can enable advanced capabilities in low-altitude edge intelligence. However, the deployment of LAMs at the edge remains constrained by some fundamental limitations. First, tasks are rigidly tied to specific models, limiting the flexibility. Besides, the computational and memory demands of full-scale LAMs exceed the capacity of most edge devices. Moreover, the current inference pipelines are typically static, making it difficult to respond to real-time changes of tasks. To address these challenges, we propose a prompt-to-agent edge cognition framework (P2AECF), enabling the flexible, efficient, and adaptive edge intelligence. Specifically, P2AECF transforms high-level semantic prompts into executable reasoning workflows through three key mechanisms. First, the prompt-defined cognition parses task intent into abstract and model-agnostic representations. Second, the agent-based modular execution instantiates these tasks using lightweight and reusable cognitive agents dynamically selected based on current resource conditions. Third, the diffusion-controlled inference planning adaptively constructs and refines execution strategies by incorporating runtime feedback and system context. In addition, we illustrate the framework through a representative low-altitude intelligent network use case, showing its ability to deliver adaptive, modular, and scalable edge intelligence for real-time low-altitude aerial collaborations.
Chinese Translation
大型人工智能模型(LAMs)在感知、推理和多模态理解方面表现出强大的能力,并能够在低空边缘智能中实现先进的功能。然而,LAMs在边缘的部署仍然受到一些基本限制的制约。首先,任务与特定模型紧密绑定,限制了灵活性。此外,全面规模的LAMs在计算和内存上的需求超出了大多数边缘设备的能力。此外,目前的推理管道通常是静态的,难以响应任务的实时变化。为了解决这些挑战,我们提出了一种基于提示到代理的边缘认知框架(P2AECF),使边缘智能具备灵活、高效和自适应的特性。具体而言,P2AECF通过三个关键机制将高层语义提示转化为可执行的推理工作流。首先,提示定义的认知将任务意图解析为抽象且与模型无关的表示。其次,基于代理的模块化执行使用轻量且可重用的认知代理动态选择并实例化这些任务,基于当前资源条件。第三,扩散控制的推理规划通过结合运行时反馈和系统上下文,自适应地构建和优化执行策略。此外,我们通过一个代表性的低空智能网络用例来说明该框架,展示其在实时低空空中协作中提供自适应、模块化和可扩展的边缘智能的能力。
cs.AI / 71 / 2602.14035

FloCA: Towards Faithful and Logically Consistent Flowchart Reasoning

FloCA:迈向忠实且逻辑一致的流程图推理
Zou, Jinzi, Wang, Bolin, Li, Liang, Zhang, Shuo, Xu, Nuo, Zhao, Junzhou
Abstract
Flowchart-oriented dialogue (FOD) systems aim to guide users through multi-turn decision-making or operational procedures by following a domain-specific flowchart to achieve a task goal. In this work, we formalize flowchart reasoning in FOD as grounding user input to flowchart nodes at each dialogue turn while ensuring node transition is consistent with the correct flowchart path. Despite recent advances of LLMs in task-oriented dialogue systems, adapting them to FOD still faces two limitations: (1) LLMs lack an explicit mechanism to represent and reason over flowchart topology, and (2) they are prone to hallucinations, leading to unfaithful flowchart reasoning. To address these limitations, we propose FloCA, a zero-shot flowchart-oriented conversational agent. FloCA uses an LLM for intent understanding and response generation while delegating flowchart reasoning to an external tool that performs topology-constrained graph execution, ensuring faithful and logically consistent node transitions across dialogue turns. We further introduce an evaluation framework with an LLM-based user simulator and five new metrics covering reasoning accuracy and interaction efficiency. Extensive experiments on FLODIAL and PFDial datasets highlight the bottlenecks of existing LLM-based methods and demonstrate the superiority of FloCA. Our codes are available at https://github.com/Jinzi-Zou/FloCA-flowchart-reasoning.
Chinese Translation
面向流程图的对话(FOD)系统旨在通过遵循特定领域的流程图,引导用户进行多轮决策或操作程序,以实现任务目标。在本研究中,我们将FOD中的流程图推理形式化为在每个对话轮次中将用户输入与流程图节点对接,同时确保节点转换与正确的流程图路径一致。尽管大型语言模型(LLMs)在任务导向对话系统中取得了近期进展,但将其适应于FOD仍面临两个限制:(1)LLMs缺乏明确的机制来表示和推理流程图拓扑;(2)它们容易产生幻觉,导致不忠实的流程图推理。为了解决这些限制,我们提出了FloCA,一个零样本的流程图导向对话代理。FloCA利用LLM进行意图理解和响应生成,同时将流程图推理委托给一个外部工具,该工具执行拓扑约束的图执行,确保在对话轮次中节点转换的忠实性和逻辑一致性。我们进一步引入了一个评估框架,配备基于LLM的用户模拟器和五个新的指标,涵盖推理准确性和交互效率。在FLODIAL和PFDial数据集上的大量实验突显了现有基于LLM的方法的瓶颈,并展示了FloCA的优越性。我们的代码可在 https://github.com/Jinzi-Zou/FloCA-flowchart-reasoning 获取。
cs.AI / 72 / 2602.14038

Choosing How to Remember: Adaptive Memory Structures for LLM Agents

选择如何记忆:针对大语言模型代理的自适应记忆结构
Lu, Mingfei, Wu, Mengjia, Liu, Feng, Xu, Jiawei, Li, Weikai, Wang, Haoyang, Hu, Zhengdong, Ding, Ying, Sun, Yizhou, Lu, Jie, Zhang, Yi
Abstract
Memory is critical for enabling large language model (LLM) based agents to maintain coherent behavior over long-horizon interactions. However, existing agent memory systems suffer from two key gaps: they rely on a one-size-fits-all memory structure and do not model memory structure selection as a context-adaptive decision, limiting their ability to handle heterogeneous interaction patterns and resulting in suboptimal performance. We propose a unified framework, FluxMem, that enables adaptive memory organization for LLM agents. Our framework equips agents with multiple complementary memory structures. It explicitly learns to select among these structures based on interaction-level features, using offline supervision derived from downstream response quality and memory utilization. To support robust long-horizon memory evolution, we further introduce a three-level memory hierarchy and a Beta Mixture Model-based probabilistic gate for distribution-aware memory fusion, replacing brittle similarity thresholds. Experiments on two long-horizon benchmarks, PERSONAMEM and LoCoMo, demonstrate that our method achieves average improvements of 9.18% and 6.14%.
Chinese Translation
记忆对于使基于大语言模型(LLM)的代理在长时间交互中保持一致行为至关重要。然而,现有的代理记忆系统存在两个主要缺陷:它们依赖于一种通用的记忆结构,并且未将记忆结构选择建模为一种上下文自适应决策,这限制了它们处理异质交互模式的能力,导致性能不佳。我们提出了一个统一框架FluxMem,能够为LLM代理提供自适应的记忆组织。我们的框架为代理配备了多个互补的记忆结构。它明确学习根据交互级特征在这些结构之间进行选择,使用从下游响应质量和记忆利用率中获得的离线监督。为了支持稳健的长时间记忆演变,我们进一步引入了三级记忆层次结构和基于Beta混合模型的概率门,用于分布感知的记忆融合,取代了脆弱的相似性阈值。在两个长时间基准测试PERSONAMEM和LoCoMo上的实验表明,我们的方法平均提高了9.18%和6.14%的性能。
cs.AI / 73 / 2602.14065

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

REAL:通过推理枢轴对齐解决知识密集型视觉问答中的知识冲突
Ye, Kai, Mao, Xianwei, Zhou, Sheng, Shao, Zirui, Mo, Ye, Liu, Liangliang, Huang, Haikuan, Li, Bin, Bu, Jiajun
Abstract
Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.
Chinese Translation
知识密集型视觉问答(KI-VQA)常常受到开放域检索固有局限性导致的严重知识冲突的影响。然而,现有范式由于缺乏可推广的冲突检测和模型内部约束机制来处理冲突证据,面临着关键限制。为了解决这些挑战,我们提出了REAL(推理枢轴对齐)框架,该框架以推理枢轴的新概念为中心。与优先考虑内部自我推导的推理步骤不同,推理枢轴作为推理链中的原子单元(节点或边),强调知识链接,通常依赖外部证据来完成推理。在我们构建的REAL-VQA数据集的支持下,我们的方法集成了推理枢轴感知的微调(RPA-SFT),通过将冲突与枢轴提取对齐来训练可推广的判别器,并采用推理枢轴引导解码(RPGD),这是一种利用这些枢轴进行针对性冲突缓解的模型内部解码策略。广泛的实验在不同基准上表明,REAL显著提高了判别准确性,并达到了最先进的性能,验证了我们基于枢轴的解决范式的有效性。
cs.AI / 74 / 2602.14083

Plan-MCTS: Plan Exploration for Action Exploitation in Web Navigation

计划-蒙特卡洛树搜索:在网页导航中进行行动利用的计划探索
Zhang, Weiming, Wang, Jihong, Zhou, Jiamu, Li, Qingyao, Ma, Xinbei, Zheng, Congmin, Lou, Xingyu, Liu, Weiwen, Zhang, Zhuosheng, Wang, Jun, Yu, Yong, Zhang, Weinan
Abstract
Large Language Models (LLMs) have empowered autonomous agents to handle complex web navigation tasks. While recent studies integrate tree search to enhance long-horizon reasoning, applying these algorithms in web navigation faces two critical challenges: sparse valid paths that lead to inefficient exploration, and a noisy context that dilutes accurate state perception. To address this, we introduce Plan-MCTS, a framework that reformulates web navigation by shifting exploration to a semantic Plan Space. By decoupling strategic planning from execution grounding, it transforms sparse action space into a Dense Plan Tree for efficient exploration, and distills noisy contexts into an Abstracted Semantic History for precise state awareness. To ensure efficiency and robustness, Plan-MCTS incorporates a Dual-Gating Reward to strictly validate both physical executability and strategic alignment and Structural Refinement for on-policy repair of failed subplans. Extensive experiments on WebArena demonstrate that Plan-MCTS achieves state-of-the-art performance, surpassing current approaches with higher task effectiveness and search efficiency.
Chinese Translation
大型语言模型(LLMs)赋能自主代理处理复杂的网页导航任务。尽管近期研究将树搜索整合以增强长远推理,但在网页导航中应用这些算法面临两个关键挑战:导致低效探索的稀疏有效路径,以及稀释准确状态感知的噪声上下文。为了解决这一问题,我们提出了计划-蒙特卡洛树搜索(Plan-MCTS),一个通过将探索转移到语义计划空间来重新构建网页导航的框架。通过将战略规划与执行基础解耦,它将稀疏的行动空间转化为密集的计划树,以实现高效探索,并将噪声上下文提炼为抽象语义历史,以提高状态意识的准确性。为了确保效率和稳健性,Plan-MCTS结合了双重门控奖励,以严格验证物理可执行性和战略一致性,并采用结构性修正对失败的子计划进行策略修复。在WebArena上的广泛实验表明,Plan-MCTS实现了最先进的性能,超越了当前方法,具有更高的任务有效性和搜索效率。
cs.AI / 75 / 2602.14093

GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training

GUI-GENESIS:用于GUI代理后训练的高效环境自动合成与可验证奖励
Cao, Yuan, Ran, Dezhi, Wu, Mengzhou, Guo, Yuzhe, Chen, Xin, Li, Ang, Cao, Gang, Zhi, Gong, Yu, Hao, Li, Linyi, Yang, Wei, Xie, Tao
Abstract
Post-training GUI agents in interactive environments is critical for developing generalization and long-horizon planning capabilities. However, training on real-world applications is hindered by high latency, poor reproducibility, and unverifiable rewards relying on noisy visual proxies. To address the limitations, we present GUI-GENESIS, the first framework to automatically synthesize efficient GUI training environments with verifiable rewards. GUI-GENESIS reconstructs real-world applications into lightweight web environments using multimodal code models and equips them with code-native rewards, executable assertions that provide deterministic reward signals and eliminate visual estimation noise. Extensive experiments show that GUI-GENESIS reduces environment latency by 10 times and costs by over $28,000 per epoch compared to training on real applications. Notably, agents trained with GUI-GENESIS outperform the base model by 14.54% and even real-world RL baselines by 3.27% on held-out real-world tasks. Finally, we observe that models can synthesize environments they cannot yet solve, highlighting a pathway for self-improving agents.
Chinese Translation
在交互环境中进行后训练的GUI代理对于发展泛化能力和长时间规划能力至关重要。然而,基于真实应用的训练受到高延迟、低可重复性以及依赖于嘈杂视觉代理的不可验证奖励的限制。为了解决这些局限性,我们提出了GUI-GENESIS,这是第一个能够自动合成具有可验证奖励的高效GUI训练环境的框架。GUI-GENESIS利用多模态代码模型将真实应用重构为轻量级的网络环境,并为其配备代码原生奖励,这些可执行的断言提供确定性的奖励信号,消除了视觉估计噪声。大量实验表明,与在真实应用上训练相比,GUI-GENESIS将环境延迟降低了10倍,每个周期的成本减少了超过28,000美元。值得注意的是,使用GUI-GENESIS训练的代理在保留的真实世界任务上比基础模型提高了14.54%,甚至比真实世界的强化学习基线提高了3.27%。最后,我们观察到模型能够合成它们尚未解决的环境,这为自我改进的代理指明了一条路径。
cs.AI / 76 / 2602.14095

NEST: Nascent Encoded Steganographic Thoughts

NEST:新生编码隐写思维
Karpov, Artem
Abstract
Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.
Chinese Translation
监控思维链(CoT)推理是大型语言模型(LLM)代理的一项基础安全技术;然而,如果模型学会隐藏其推理,这一监督将受到影响。我们探讨了隐写CoT的潜力——即模型在无害文本中隐藏秘密推理——以指导风险评估和部署政策。我们系统地评估了28个模型的隐写能力的限制,这些模型涵盖了从过去几代到当前前沿的范围。我们在四个数据集上测量了监控规避、拒绝率、编码保真度和隐藏任务准确性,并将隐写首字母缩略词与普通推理和填充标记基线进行比较。我们发现当前模型尚无法在复杂的数学和算术任务中维持隐藏推理。然而,在一个简化的计数实验中,Claude Opus 4.5在隐藏任务上达到了92%的准确率,展示了初步能力。值得注意的是,在少数情况下(<1%),GPT-5.2可能会拒绝隐写指令,同时又会遵循这些指令。我们的研究结果强调了对隐写风险进行持续评估的必要性。本研究提供了一种方法论,以预防性地检测和阻止可能助长不当策划和欺骗行为的隐藏推理。
cs.AI / 77 / 2602.14130

Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity

代数量子智能:可重复机器创造力的新框架
Yano, Kazuo, Lee, Jonghyeok, Ishitomi, Tae, Kawaguchi, Hironobu, Koyama, Akira, Ota, Masakuni, Ota, Yuki, Sato, Nobuo, Shimada, Keita, Takematsu, Sho, Tobinai, Ayaka, Tsuji, Satomi, Yanagi, Kazunori, Yano, Keiko, Harada, Manabu, Matsuda, Yuki, Matsumoto, Kazunori, Matsumura, Kenichi, Matsuo, Hamae, Miyazaki, Yumi, Murai, Kotaro, Ohshita, Tatsuya, Seki, Marie, Tanoue, Shun, Terakado, Tatsuki, Ichimaru, Yuko, Saito, Mirei, Otsuka, Akihiro, Ara, Koji
Abstract
Large language models (LLMs) have achieved remarkable success in generating fluent and contextually appropriate text; however, their capacity to produce genuinely creative outputs remains limited. This paper posits that this limitation arises from a structural property of contemporary LLMs: when provided with rich context, the space of future generations becomes strongly constrained, and the generation process is effectively governed by near-deterministic dynamics. Recent approaches such as test-time scaling and context adaptation improve performance but do not fundamentally alter this constraint. To address this issue, we propose Algebraic Quantum Intelligence (AQI) as a computational framework that enables systematic expansion of semantic space. AQI is formulated as a noncommutative algebraic structure inspired by quantum theory, allowing properties such as order dependence, interference, and uncertainty to be implemented in a controlled and designable manner. Semantic states are represented as vectors in a Hilbert space, and their evolution is governed by C-values computed from noncommutative operators, thereby ensuring the coexistence and expansion of multiple future semantic possibilities. In this study, we implement AQI by extending a transformer-based LLM with more than 600 specialized operators. We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol. The results show that AQI consistently outperforms strong baseline models, yielding statistically significant improvements and reduced cross-domain variance. These findings demonstrate that noncommutative algebraic dynamics can serve as a practical and reproducible foundation for machine creativity. Notably, this architecture has already been deployed in real-world enterprise environments.
Chinese Translation
大型语言模型(LLMs)在生成流畅且符合上下文的文本方面取得了显著成功;然而,它们产生真正创造性输出的能力仍然有限。本文认为,这一限制源于当代LLMs的结构特性:在提供丰富上下文时,未来生成的空间受到强烈限制,生成过程实际上受到近乎确定性动态的支配。近期的一些方法,如测试时缩放和上下文适应,虽然提高了性能,但并未从根本上改变这一约束。为了解决这一问题,我们提出了代数量子智能(Algebraic Quantum Intelligence,AQI)作为一个计算框架,能够系统性地扩展语义空间。AQI被构建为一种受量子理论启发的非交换代数结构,允许诸如顺序依赖、干涉和不确定性等特性以可控和可设计的方式实现。语义状态被表示为希尔伯特空间中的向量,其演化由从非交换算子计算得出的C值支配,从而确保多个未来语义可能性的共存和扩展。在本研究中,我们通过扩展一个基于变换器的LLM,增加了600多个专用算子来实现AQI。我们在十个领域的创造性推理基准上评估了该系统,采用LLM作为评判者的协议。结果表明,AQI在强基线模型上始终表现优异,取得了统计显著的改善,并降低了跨领域的方差。这些发现表明,非交换代数动态可以作为机器创造力的一个实用且可重复的基础。值得注意的是,该架构已经在现实世界的企业环境中部署。
cs.AI / 78 / 2602.14135

ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

前瞻安全基准:面向安全人工智能的前沿风险评估与治理框架
Tong, Haibo, Zhao, Feifei, Feng, Linghao, Wu, Ruoyu, Chen, Ruolin, Jia, Lu, Zhao, Zhou, Li, Jindong, Li, Tenglong, Lin, Erliang, Yang, Shuai, Lu, Enmeng, Sun, Yinqian, Zhang, Qian, Ruan, Zizhe, Yue, Zeyang, Wu, Ping, Li, Huangrui, Sun, Chengyi, Zeng, Yi
Abstract
Rapidly evolving AI exhibits increasingly strong autonomy and goal-directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier risk detection. The lagging safety benchmarks and alignment technologies can hardly address the complex challenges posed by cutting-edge AI models. To bridge this gap, we propose the "ForesightSafety Bench" AI Safety Evaluation Framework, beginning with 7 major Fundamental Safety pillars and progressively extends to advanced Embodied AI Safety, AI4Science Safety, Social and Environmental AI risks, Catastrophic and Existential Risks, as well as 8 critical industrial safety domains, forming a total of 94 refined risk dimensions. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and dynamically evolving AI safety evaluation framework. Based on this benchmark, we conduct systematic evaluation and in-depth analysis of over twenty mainstream advanced large models, identifying key risk patterns and their capability boundaries. The safety capability evaluation results reveals the widespread safety vulnerabilities of frontier AI across multiple pillars, particularly focusing on Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety and Catastrophic and Existential Risks. Our benchmark is released at https://github.com/Beijing-AISI/ForesightSafety-Bench. The project website is available at https://foresightsafety-bench.beijing-aisi.ac.cn/.
Chinese Translation
快速发展的人工智能展现出越来越强的自主性和目标导向能力,同时伴随着更不可预测、难以控制且可能不可逆的衍生系统风险。然而,目前的人工智能安全评估系统存在关键局限性,例如风险维度受限和前沿风险检测失败。滞后的安全基准和对齐技术难以应对尖端人工智能模型带来的复杂挑战。为填补这一空白,我们提出了“前瞻安全基准”(ForesightSafety Bench)人工智能安全评估框架,该框架以7个主要的基础安全支柱为起点,逐步扩展到先进的具身人工智能安全、AI4科学安全、社会与环境人工智能风险、灾难性与生存风险,以及8个关键工业安全领域,形成总共94个细化的风险维度。迄今为止,该基准已积累了数万个结构化风险数据点和评估结果,建立了一个广泛涵盖、层次清晰且动态演变的人工智能安全评估框架。基于此基准,我们对超过20个主流先进大型模型进行了系统评估和深入分析,识别出关键风险模式及其能力边界。安全能力评估结果揭示了前沿人工智能在多个支柱上的广泛安全脆弱性,特别关注风险代理自主性、AI4科学安全、具身人工智能安全、社会人工智能安全以及灾难性与生存风险。我们的基准已在 https://github.com/Beijing-AISI/ForesightSafety-Bench 发布。项目网站可访问 https://foresightsafety-bench.beijing-aisi.ac.cn/。
cs.AI / 79 / 2602.14160

Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

过程监督的多智能体强化学习用于可靠的临床推理
Lee, Chaeeun, Yates, T. Michael, Minervini, Pasquale, Simpson, T. Ian
Abstract
Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.
Chinese Translation
临床决策需要对异质证据进行细致的推理和可追溯的论证。尽管最近的LLM多智能体系统(MAS)显示出良好的前景,但它们主要优化结果准确性,而忽视了与临床标准相一致的过程基础推理。一个关键的现实案例是基因-疾病有效性整理,在此过程中,专家必须通过综合多样的生物医学证据来确定一个基因是否在疾病中具有因果关系。我们为此任务引入了一种代理作为工具的强化学习框架,具有两个目标:(i)过程级监督,以确保推理遵循有效的临床路径;(ii)通过分层多智能体系统实现高效协调。我们在ClinGen数据集上的评估表明,使用仅基于结果的奖励,配备GRPO训练的Qwen3-4B监督代理的MAS将最终结果准确性从基模型监督的0.195显著提高至0.732,但导致过程对齐较差(0.392 F1)。相反,使用过程+结果奖励的MAS配备GRPO训练的监督代理则实现了更高的结果准确性(0.750),同时显著提高了过程保真度至0.520 F1。我们的代码可在https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents获取。
cs.AI / 80 / 2602.14225

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

视觉之前的文本:阶段性知识注入对超高分辨率遥感理解中的自主强化学习可验证奖励的重要性
Wang, Fengxiang, Chen, Mingshuo, Li, Yueying, Yang, Yajie, Zhou, Yuhao, Wang, Di, Zhang, Yifan, Wang, Haoyu, Zhao, Haiyan, Sun, Hongda, Lan, Long, Song, Jun, Wang, Yulin, Zhang, Jing, Zhang, Wenlong, Du, Bo
Abstract
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.
Chinese Translation
超高分辨率(UHR)遥感(RS)的多模态推理通常受到视觉证据获取的瓶颈:模型需要在庞大的像素空间中定位微小的任务相关区域。尽管使用放大工具的可验证奖励的自主强化学习(RLVR)提供了一条前进的路径,但我们发现标准强化学习在没有结构化领域先验的情况下难以在这些广阔的视觉空间中导航。本文探讨了后训练范式之间的相互作用:比较冷启动监督微调(SFT)、RLVR和自主RLVR在UHR RS基准上的表现。我们的控制研究得出了一个反直觉的发现:高质量的地球科学文本问答是UHR视觉推理提升的主要驱动因素。尽管缺乏图像,领域特定的文本注入了必要的概念、机制解释和决策规则,以指导视觉证据的检索。基于此,我们提出了一种阶段性知识注入的方案:(1)通过可扩展的、知识图谱验证的地球科学文本问答进行冷启动,以灌输推理结构;(2)在SFT期间对相同的困难UHR图像-文本示例进行“预热”,以稳定和增强后续基于工具的RL。该方法在XLRS-Bench上实现了60.40%的Pass@1,显著超越了更大的通用模型(例如,GPT-5.2、Gemini 3.0 Pro、Intern-S1),并建立了新的最先进水平。
cs.AI / 81 / 2602.14229

CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments

CORPGEN:在多层次任务环境中模拟具有自主数字员工的企业环境
Jaye, Abubakarr, Kumankumah, Nigel Boachie, Biringa, Chidera, Patel, Anjel Shaileshbhai, Vesal, Sulaiman, Julienne, Dayquan, Siska, Charlotte, Luján, Manuel Raúl Meléndez, Twum-Barimah, Anthony, Velazco, Mauricio, Chen, Tianwei
Abstract
Long-horizon reasoning is a key challenge for autonomous agents, yet existing benchmarks evaluate agents on single tasks in isolation. Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization. We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks (45+, 500-1500+ steps) within persistent execution contexts spanning hours. We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%, a pattern consistent across three independent implementations. These failure modes are context saturation (O(N) vs O(1) growth), memory interference, dependency complexity (DAGs vs. chains), and reprioritization overhead. We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment, sub-agent isolation preventing cross-task contamination, tiered memory (working, structured, semantic), and adaptive summarization. CorpGen simulates corporate environments through digital employees with persistent identities and realistic schedules. Across three CUA backends (UFO2, OpenAI CUA, hierarchical) on OSWorld Office, CorpGen achieves up to 3.5x improvement over baselines (15.2% vs 4.3%) with stable performance under increasing load, confirming that gains stem from architectural mechanisms rather than specific CUA implementations. Ablation studies show experiential learning provides the largest gains.
Chinese Translation
长时间推理是自主智能体面临的关键挑战,然而现有基准测试仅在孤立的单一任务上评估智能体。真实的组织工作需要管理许多并发的长时间任务,这些任务之间存在交错、依赖关系和优先级重新调整。我们引入了多层次任务环境(Multi-Horizon Task Environments, MHTEs):一种独特的问题类别,要求在持续执行上下文中跨越数十个交错任务(超过45个任务,500-1500步)进行连贯执行,持续数小时。我们识别出四种导致基线CUA(Cognitive User Agent)在负载从25%增加到100%时完成率从16.7%降至8.7%的失效模式,这一模式在三个独立实现中一致存在。这些失效模式包括上下文饱和(O(N)与O(1)增长)、内存干扰、依赖复杂性(有向无环图与链)和优先级重新调整开销。我们提出了CorpGen,一个与架构无关的框架,通过层次规划解决这些失效问题,以实现多层次目标对齐、子智能体隔离以防止任务间污染、分级内存(工作内存、结构化内存、语义内存)和自适应摘要。CorpGen通过具有持久身份和现实日程的数字员工模拟企业环境。在OSWorld Office的三个CUA后端(UFO2、OpenAI CUA、层次化)中,CorpGen在基线之上实现了最高3.5倍的提升(15.2%对比4.3%),并在负载增加的情况下保持稳定性能,确认增益源于架构机制而非特定CUA实现。消融研究表明,经验学习提供了最大的收益。
cs.AI / 82 / 2602.14234

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

REDSearcher:一个可扩展且成本高效的长视野搜索代理框架
Chu, Zheng, Wang, Xiao, Hong, Jack, Fan, Huiming, Huang, Yuqi, Yang, Yue, Xu, Guohai, Zhao, Chenxiao, Xiang, Cheng, Hu, Shengchao, Kuang, Dongdong, Liu, Ming, Qin, Bing, Yu, Xing
Abstract
Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.
Chinese Translation
大型语言模型正从通用知识引擎转变为现实世界问题解决者,但优化它们以应对深度搜索任务仍然具有挑战性。主要瓶颈在于高质量搜索轨迹和奖励信号的极度稀疏,这源于可扩展的长视野任务构建的困难以及涉及外部工具调用的交互密集型回合的高成本。为了解决这些挑战,我们提出了REDSearcher,一个统一框架,旨在共同设计复杂任务合成、中期训练和后期训练,以实现可扩展的搜索代理优化。具体而言,REDSearcher引入了以下改进:(1)我们将任务合成框架设定为双约束优化,其中任务难度由图拓扑和证据分散精确控制,从而允许可扩展地生成复杂的高质量任务。(2)我们引入了工具增强查询,以鼓励主动使用工具而非被动回忆。(3)在中期训练中,我们增强了核心原子能力知识、规划和函数调用,显著降低了收集高质量轨迹用于下游训练的成本。(4)我们构建了一个本地模拟环境,使得强化学习实验能够快速、低成本地进行算法迭代。在文本和多模态搜索代理基准测试中,我们的方法达到了最先进的性能。为了促进未来对长视野搜索代理的研究,我们将发布1万条高质量复杂文本搜索轨迹、5000条多模态轨迹和1000条文本强化学习查询集,并提供代码和模型检查点。
cs.AI / 83 / 2602.14252

GRAIL: Goal Recognition Alignment through Imitation Learning

GRAIL:通过模仿学习实现目标识别对齐
Elhadad, Osher, Meneguzzi, Felipe, Mirsky, Reuth
Abstract
Understanding an agent's goals from its behavior is fundamental to aligning AI systems with human intentions. Existing goal recognition methods typically rely on an optimal goal-oriented policy representation, which may differ from the actor's true behavior and hinder the accurate recognition of their goal. To address this gap, this paper introduces Goal Recognition Alignment through Imitation Learning (GRAIL), which leverages imitation learning and inverse reinforcement learning to learn one goal-directed policy for each candidate goal directly from (potentially suboptimal) demonstration trajectories. By scoring an observed partial trajectory with each learned goal-directed policy in a single forward pass, GRAIL retains the one-shot inference capability of classical goal recognition while leveraging learned policies that can capture suboptimal and systematically biased behavior. Across the evaluated domains, GRAIL increases the F1-score by more than 0.5 under systematically biased optimal behavior, achieves gains of approximately 0.1-0.3 under suboptimal behavior, and yields improvements of up to 0.4 under noisy optimal trajectories, while remaining competitive in fully optimal settings. This work contributes toward scalable and robust models for interpreting agent goals in uncertain environments.
Chinese Translation
从代理的行为中理解其目标是将人工智能系统与人类意图对齐的基础。现有的目标识别方法通常依赖于最优的目标导向策略表示,这可能与行为者的真实行为存在差异,从而妨碍其目标的准确识别。为了解决这一问题,本文提出了通过模仿学习实现目标识别对齐(Goal Recognition Alignment through Imitation Learning,GRAIL),该方法利用模仿学习和逆强化学习直接从(可能是次优的)演示轨迹中为每个候选目标学习一个目标导向策略。通过在单次前向传播中使用每个学习到的目标导向策略对观察到的部分轨迹进行评分,GRAIL保留了经典目标识别的一次性推理能力,同时利用能够捕捉次优和系统性偏差行为的学习策略。在评估的各个领域中,GRAIL在系统性偏差的最优行为下将F1-score提高了超过0.5,在次优行为下实现了约0.1-0.3的增益,并在噪声最优轨迹下提高了最多0.4的性能,同时在完全最优的设置中保持竞争力。这项工作为在不确定环境中解释代理目标的可扩展和稳健模型做出了贡献。
cs.AI / 84 / 2602.14296

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

AutoWebWorld:通过有限状态机合成无限可验证的网络环境
Wu, Yifan, Peng, Yiran, Chen, Yiyu, Ruan, Jianhao, Zhuang, Zijie, Yang, Cheng, Zhang, Jiayi, Chen, Man, Tseng, Yenchi, Yu, Zhaoyang, Chen, Liang, Zhai, Yuyao, Liu, Bang, Wu, Chenglin, Luo, Yuyu
Abstract
The performance of autonomous Web GUI agents heavily relies on the quality and quantity of their training data. However, a fundamental bottleneck persists: collecting interaction trajectories from real-world websites is expensive and difficult to verify. The underlying state transitions are hidden, leading to reliance on inconsistent and costly external verifiers to evaluate step-level correctness. To address this, we propose AutoWebWorld, a novel framework for synthesizing controllable and verifiable web environments by modeling them as Finite State Machines (FSMs) and use coding agents to translate FSMs into interactive websites. Unlike real websites, where state transitions are implicit, AutoWebWorld explicitly defines all states, actions, and transition rules. This enables programmatic verification: action correctness is checked against predefined rules, and task success is confirmed by reaching a goal state in the FSM graph. AutoWebWorld enables a fully automated search-and-verify pipeline, generating over 11,663 verified trajectories from 29 diverse web environments at only $0.04 per trajectory. Training on this synthetic data significantly boosts real-world performance. Our 7B Web GUI agent outperforms all baselines within 15 steps on WebVoyager. Furthermore, we observe a clear scaling law: as the synthetic data volume increases, performance on WebVoyager and Online-Mind2Web consistently improves.
Chinese Translation
自主网络图形用户界面(GUI)代理的性能在很大程度上依赖于其训练数据的质量和数量。然而,仍然存在一个根本性的瓶颈:从真实网站收集交互轨迹既昂贵又难以验证。潜在的状态转换是隐含的,这导致依赖不一致且成本高昂的外部验证者来评估逐步的正确性。为了解决这个问题,我们提出了AutoWebWorld,一个通过将网络环境建模为有限状态机(Finite State Machines, FSMs)来合成可控和可验证的网络环境的新框架,并使用编码代理将FSM转换为交互式网站。与真实网站中隐含的状态转换不同,AutoWebWorld明确定义了所有状态、动作和转换规则。这使得程序化验证成为可能:动作的正确性根据预定义规则进行检查,任务的成功通过在FSM图中达到目标状态来确认。AutoWebWorld实现了一个完全自动化的搜索与验证管道,从29个多样化的网络环境中生成超过11,663条经过验证的轨迹,每条轨迹的成本仅为0.04美元。在这些合成数据上进行训练显著提升了在真实世界中的表现。我们的7B Web GUI代理在WebVoyager上在15步内超越了所有基准。此外,我们观察到一个明显的规模法则:随着合成数据量的增加,在WebVoyager和Online-Mind2Web上的性能持续改善。
cs.AI / 85 / 2602.14307

Benchmarking at the Edge of Comprehension

理解边界的基准测试
Marro, Samuele, Yu, Jialin, La Malfa, Emanuele, Deb, Oishi, Li, Jiawei, Yang, Yibo, Abraham, Ebey, Sengupta, Sunando, Sommerlade, Eric, Wooldridge, Michael, Torr, Philip
Abstract
As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.
Chinese Translation
随着前沿大型语言模型(LLMs)在新基准发布后迅速达到饱和,基准测试本身正处于一个关键时刻:如果前沿模型持续改进,人类将越来越难以生成具有区分性的任务、提供准确的真实答案或评估复杂的解决方案。如果基准测试变得不可行,我们衡量人工智能进步的能力将岌岌可危。我们将这种情形称为后理解阶段。在本研究中,我们提出了抗批评基准测试(Critique-Resilient Benchmarking),这是一种对抗性框架,旨在比较模型,即使在完全人类理解不可行的情况下。我们的方法依赖于抗批评正确性的概念:如果没有对手令人信服地证明答案是错误的,则该答案被视为正确。与标准基准测试不同,人类作为有限的验证者,专注于局部主张,这在超越对任务的完全理解的情况下保持了评估的完整性。通过使用逐项双边Bradley-Terry模型,我们共同对LLMs进行排名,依据它们解决挑战性任务和生成困难但可解问题的能力。我们展示了我们的方法在数学领域的有效性,涵盖了八个前沿LLMs,结果表明得分稳定且与外部能力测量相关。我们的框架将基准测试重新构建为一种对抗性生成-评估游戏,其中人类作为最终裁决者。
cs.AI / 86 / 2602.14370

Competition for attention predicts good-to-bad tipping in AI

注意力竞争预测人工智能中的良性到恶性转变
Johnson, Neil F., Huo, Frank Y.
Abstract
More than half the global population now carries devices that can run ChatGPT-like language models with no Internet connection and minimal safety oversight -- and hence the potential to promote self-harm, financial losses and extremism among other dangers. Existing safety tools either require cloud connectivity or discover failures only after harm has occurred. Here we show that a large class of potentially dangerous tipping originates at the atomistic scale in such edge AI due to competition for the machinery's attention. This yields a mathematical formula for the dynamical tipping point n*, governed by dot-product competition for attention between the conversation's context and competing output basins, that reveals new control levers. Validated against multiple AI models, the mechanism can be instantiated for different definitions of 'good' and 'bad' and hence in principle applies across domains (e.g. health, law, finance, defense), changing legal landscapes (e.g. EU, UK, US and state level), languages, and cultural settings.
Chinese Translation
如今,全球超过一半的人口携带能够在没有互联网连接和最小安全监督的情况下运行类似ChatGPT的语言模型的设备,因此可能会促进自我伤害、经济损失和极端主义等多种危险。现有的安全工具要么需要云连接,要么只能在伤害发生后发现故障。在这里,我们展示了一类潜在危险的转变是如何在这种边缘人工智能的原子尺度上源于对机器注意力的竞争。这产生了一个动态转折点n*的数学公式,该公式由对话上下文与竞争输出盆之间的点积注意力竞争所主导,揭示了新的控制杠杆。经过多个人工智能模型的验证,该机制可以针对“好”和“坏”的不同定义进行实例化,因此原则上适用于多个领域(例如健康、法律、金融、国防)、变化的法律环境(例如欧盟、英国、美国及州级别)、语言和文化背景。
cs.AI / 87 / 2602.14404

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

圆形面包还是法棍?关于任务拓扑、长度泛化及推理痕迹的益处研究
Tong, William L., Cakar, Ege, Pehlevan, Cengiz
Abstract
Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.
Chinese Translation
近年来,推理模型取得了飞速进展:神经网络在生成最终输出之前,会生成中间推理痕迹(RTs)。尽管快速发展,但我们对推理痕迹如何支持推理以及该范式的局限性仍然了解不够。为促进更清晰的理解,我们引入了PITA:一个包含超过2300万条命题逻辑语句及其对应证明的新型大规模数据集。作为稳健推理的基准,我们关注长度泛化:如果一个模型经过训练以确定固定长度证明的语句的真伪,那么它在需要更长证明的语句上泛化的效果如何?我们提出了(1)任务深度和(2)任务广度的概念,分别测量(1)解决任务示例所需的步骤数量和(2)任务中独特示例的数量。我们在PITA的子集上变化这些数量,并发现RT模型在广泛且浅层的子集上泛化良好,而在狭窄且深层的子集上相较于非RT基线则表现不佳。为了确定我们的结果是PITA特有的还是指示一般现象,我们将结果与基于三段论的简单合成任务进行了比较。我们的理论结果表明了限制RT模型在深层任务表现的基本尺度,并突显了它们在广泛任务上的泛化优势。总体而言,我们的发现识别了使用推理痕迹固有的基本益处和局限性。
cs.AI / 88 / 2602.14451

Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning

基于先例的推理:通过测试时先例学习减轻大型推理模型中的过度思考
Wang, Qianyue, Hu, Jinwu, Lin, Huanxiang, Chen, Bolin, Wen, Zhiquan, Chen, Yaofo, Rong, Yu, Tan, Mingkui
Abstract
Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation, which inflate computational costs and even degrade performance. Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs'reasoning paradigm from exhaustive self-exploration to guided learning from precedents. PIR addresses two key challenges: what precedents to adopt and how to utilize them. First, Adaptive Precedent Selection (APS) constructs, for each question and LRM, a compact set of precedents that are both semantically related and informative for the model. It ranks examples by a joint score with semantic similarity and model perplexity, then adapts the amount of precedents to maximize perplexity reduction. Second, Test-time Experience Internalization (TEI) is treated as the test-time learning on precedent-informed instruction, updating lightweight adapters to internalize solution patterns and use them as a prior during subsequent reasoning. Experiments across mathematical reasoning, scientific QA, and code generation demonstrate that PIR consistently shortens reasoning traces while maintaining or improving final accuracy across LLMs, yielding outstanding accuracy-efficiency trade-offs.
Chinese Translation
大型语言模型(LLMs)的推理常常面临低效的长链思维过程,伴随冗余的自我探索和验证,这不仅增加了计算成本,甚至可能降低性能。受到人类推理模式的启发,人们通过利用过去相关案例来解决新问题,从而限制搜索空间并减少试错,我们提出了基于先例的推理(Precedent Informed Reasoning, PIR),将大型推理模型的推理范式从全面的自我探索转变为从先例中引导学习。PIR解决了两个关键挑战:采用哪些先例以及如何利用它们。首先,自适应先例选择(Adaptive Precedent Selection, APS)为每个问题和大型推理模型构建一个紧凑的先例集合,这些先例在语义上相关且对模型具有信息价值。它通过语义相似度和模型困惑度的联合评分对示例进行排名,然后调整先例的数量以最大化困惑度的降低。其次,测试时经验内化(Test-time Experience Internalization, TEI)被视为在基于先例的指令上的测试时学习,更新轻量级适配器以内化解决模式,并在后续推理中将其作为先验。跨数学推理、科学问答和代码生成的实验表明,PIR在保持或提高最终准确率的同时,始终缩短推理过程,提供了卓越的准确性与效率的权衡。
cs.AI / 89 / 2602.14457

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

前沿人工智能风险管理框架实践:风险分析技术报告 v1.5
Liu, Dongrui, Yu, Yi, Zhang, Jie, Chen, Guanxu, Lin, Qihao, Zhu, Hanxi, Huang, Lige, Zhou, Yijin, Wang, Peng, Shao, Shuai, Zhang, Boxuan, Liu, Zicheng, Sun, Jingwei, Li, Yu, Xie, Yuejin, Guo, Jiaxuan, Xu, Jia, Lu, Chaochao, Zhou, Bowen, Hu, Xia, Shao, Jing
Abstract
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\&D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
Chinese Translation
为了理解和识别快速发展的人工智能(AI)模型所带来的前所未有的风险,《前沿人工智能风险管理框架实践》对其前沿风险进行了全面评估。随着大型语言模型(LLMs)通用能力的迅速演变以及自主AI的普及,本版本的风险分析技术报告提供了五个关键维度的更新和细致评估:网络攻击、说服与操控、战略欺骗、失控的AI研发以及自我复制。具体而言,我们为网络攻击引入了更复杂的场景。在说服与操控方面,我们评估了LLM对新发布的LLMs进行说服的风险。在战略欺骗和策划方面,我们增加了关于新出现的不一致性的新实验。对于失控的AI研发,我们关注于代理在自主扩展其记忆基底和工具集时的“错误进化”。此外,我们还监测和评估了OpenClaw在Moltbook上的交互过程中的安全性能。对于自我复制,我们引入了一个新的资源受限场景。更重要的是,我们提出并验证了一系列强有力的缓解策略,以应对这些新出现的威胁,为前沿AI的安全部署提供初步的技术和可操作路径。这项工作反映了我们对AI前沿风险的当前理解,并呼吁集体行动以减轻这些挑战。
cs.AI / 90 / 2602.14503

Bounding Probabilities of Causation with Partial Causal Diagrams

使用部分因果图界定因果概率
Xie, Yuxuan, Li, Ang
Abstract
Probabilities of causation are fundamental to individual-level explanation and decision making, yet they are inherently counterfactual and not point-identifiable from data in general. Existing bounds either disregard available covariates, require complete causal graphs, or rely on restrictive binary settings, limiting their practical use. In real-world applications, causal information is often partial but nontrivial. This paper proposes a general framework for bounding probabilities of causation using partial causal information. We show how the available structural or statistical information can be systematically incorporated as constraints in a optimization programming formulation, yielding tighter and formally valid bounds without full identifiability. This approach extends the applicability of probabilities of causation to realistic settings where causal knowledge is incomplete but informative.
Chinese Translation
因果概率对于个体层面的解释和决策至关重要,但它们本质上是反事实的,通常无法从数据中明确识别。现有的界限要么忽视可用的协变量,要么要求完整的因果图,或者依赖于限制性的二元设置,从而限制了它们的实际应用。在现实应用中,因果信息往往是部分的但并非微不足道。本文提出了一种使用部分因果信息界定因果概率的通用框架。我们展示了如何将可用的结构性或统计信息系统性地作为约束纳入优化编程模型,从而在不完全可识别的情况下获得更紧凑和形式上有效的界限。这种方法扩展了因果概率在因果知识不完整但信息丰富的现实环境中的适用性。
cs.AI / 91 / 2602.14505

Formally Verifying and Explaining Sepsis Treatment Policies with COOL-MC

使用 COOL-MC 正式验证和解释脓毒症治疗政策
Gross, Dennis
Abstract
Safe and interpretable sequential decision-making is critical in healthcare, yet reinforcement learning (RL) policies for sepsis treatment optimization remain opaque and difficult to verify. Standard probabilistic model checkers operate on the full state space, which becomes infeasible for larger MDPs, and cannot explain why a learned policy makes particular decisions. COOL-MC wraps the model checker Storm but adds three key capabilities: it constructs only the reachable state space induced by a trained policy, yielding a smaller discrete-time Markov chain amenable to verification even when full-MDP analysis is intractable; it automatically labels states with clinically meaningful atomic propositions; and it integrates explainability methods with probabilistic computation tree logic (PCTL) queries to reveal which features drive decisions across treatment trajectories. We demonstrate COOL-MC's capabilities on the ICU-Sepsis MDP, a benchmark derived from approximately 17,000 sepsis patient records, which serves as a case study for applying COOL-MC to the formal analysis of sepsis treatment policies. Our analysis establishes hard bounds via full MDP verification, trains a safe RL policy that achieves optimal survival probability, and analyzes its behavior via PCTL verification and explainability on the induced DTMC. This reveals, for instance, that our trained policy relies predominantly on prior dosing history rather than the patient's evolving condition, a weakness that is invisible to standard evaluation but is exposed by COOL-MC's integration of formal verification and explainability. Our results illustrate how COOL-MC could serve as a tool for clinicians to investigate and debug sepsis treatment policies before deployment.
Chinese Translation
在医疗保健中,安全且可解释的顺序决策至关重要,然而,针对脓毒症治疗优化的强化学习(RL)政策仍然不透明且难以验证。标准的概率模型检查器在完整状态空间上运行,这在较大的马尔可夫决策过程(MDP)中变得不可行,并且无法解释学习到的政策为何做出特定决策。COOL-MC 包装了模型检查器 Storm,但增加了三个关键功能:它仅构建由训练政策引发的可达状态空间,从而生成一个更小的离散时间马尔可夫链,即使在全 MDP 分析不可行时也便于验证;它自动为状态标记具有临床意义的原子命题;并且它将可解释性方法与概率计算树逻辑(PCTL)查询集成,以揭示哪些特征驱动治疗轨迹中的决策。我们在 ICU-Sepsis MDP 上演示了 COOL-MC 的能力,该基准源自大约 17,000 份脓毒症患者记录,作为将 COOL-MC 应用于脓毒症治疗政策正式分析的案例研究。我们的分析通过完整 MDP 验证建立了严格的界限,训练了一个安全的 RL 政策,实现了最佳生存概率,并通过 PCTL 验证和对诱导的 DTMC 的可解释性分析其行为。这揭示了,例如,我们训练的政策主要依赖于先前的给药历史,而不是患者不断变化的病情,这一弱点在标准评估中是不可见的,但通过 COOL-MC 的形式验证和可解释性集成得以暴露。我们的结果说明了 COOL-MC 如何作为临床医生在部署前调查和调试脓毒症治疗政策的工具。
cs.AI / 92 / 2602.14518

Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning

多模态长链推理中的知识冲突诊断
Tang, Jing, Wang, Kun, Lu, Haolang, Chen, Hongjin, Chen, KaiTao, Sun, Zhongxiang, Li, Qiankun, Lyu, Lingjuan, Nan, Guoshun, Zeng, Zhigang
Abstract
Multimodal large language models (MLLMs) in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model's implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures.
Chinese Translation
在长链思维推理中,多模态大型语言模型(MLLMs)在不同知识来源提供冲突信号时常常失败。我们在统一的知识冲突概念下对这些失败进行了形式化,区分了输入层面的客观冲突与过程层面的有效冲突。通过探测内部表征,我们揭示了以下几点:(I) 线性可分性:不同类型的冲突被明确编码为线性可分特征,而非纠缠在一起;(II) 深度定位:冲突信号集中在中后层,表明冲突编码有一个独特的处理阶段;(III) 层次一致性:沿着轨迹聚合噪声的标记级信号能够稳健地恢复输入层面的冲突类型;(IV) 方向性不对称:在冲突情况下,强化模型对隐含来源的偏好远比强化相反来源要容易。我们的发现提供了在知识冲突下多模态推理的机制层面视角,并使得对长链思维失败的原则性诊断和控制成为可能。
cs.AI / 93 / 2602.14529

Disentangling Deception and Hallucination Failures in LLMs

解构大型语言模型中的欺骗与幻觉失败
Lu, Haolang, Peng, Hongrui, Fu, WeiYe, Nan, Guoshun, Cao, Xinye, Li, Xingrui, Guo, Hongcan, Wang, Kun
Abstract
Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.
Chinese Translation
大型语言模型(LLMs)的失败通常从行为角度进行分析,其中在事实问答中的错误输出常常与缺乏知识相关联。在本研究中,我们聚焦于基于实体的事实查询,认为这种观点可能混淆了不同的失败机制,并提出了一种内部的、机制导向的视角,将知识存在与行为表达区分开来。在这种表述下,幻觉和欺骗对应于两种质上不同的失败模式,这些模式在输出层面上可能看起来相似,但在其底层机制上却存在差异。为了研究这种区别,我们构建了一个针对以实体为中心的事实问题的受控环境,在该环境中知识得以保留,而行为表达则被选择性地改变,从而实现对四种行为案例的系统分析。我们通过表示可分离性、稀疏可解释性和推理时激活引导来分析这些失败模式。
cs.AI / 94 / 2602.14589

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

MATEO:用于大规模视觉语言模型的时间推理与规划的多模态基准
Roccabruna, Gabriel, Khomyn, Olha, Riccardi, Giuseppe
Abstract
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.
Chinese Translation
人工智能代理需要进行规划,以实现涉及感知、子目标分解和执行的复杂目标。这些计划由按照时间执行顺序(Temporal Execution Order, TEO)结构化的有序步骤组成,TEO 是一个有向无环图,确保每个步骤仅在其前置条件满足后执行。现有关于基础模型理解时间执行的研究仅限于自动生成的注释、将 TEO 近似为线性链或仅使用文本输入。为了解决这一空白,我们引入了 MATEO(MultimodAl Temporal Execution Order),这是一个旨在评估和提升大规模视觉语言模型(Large Vision Language Models, LVLMs)在现实世界规划中所需的时间推理能力的基准。我们通过标准化的编辑过程获取了高质量的专业多模态食谱语料库,将指令分解为离散步骤,并为每个步骤配对相应的图像。我们通过设计和使用可扩展的众包管道收集了作为图形的 TEO 注释。利用 MATEO,我们评估了六个最先进的 LVLM,在模型规模、语言上下文、多模态输入结构和微调策略等方面进行了变化。
cs.AI / 95 / 2602.14622

Tabular Foundation Models Can Learn Association Rules

表格基础模型能够学习关联规则
Karabulut, Erkan, Daza, Daniel, Groth, Paul, Schut, Martijn C., Degeler, Victoria
Abstract
Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high-stakes decision-making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low-data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in-context generalization, provide a basis for addressing these limitations. We introduce a model-agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out-of-the-box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high-quality association rules with strong predictive performance and remain robust in low-data settings without task-specific training. Source code is available at https://github.com/DiTEC-project/tabprobe.
Chinese Translation
关联规则挖掘(ARM)是表格数据知识发现的基础任务,广泛应用于高风险决策中。传统的ARM方法依赖于频繁项集挖掘,导致规则爆炸和可扩展性差,而最近的神经网络方法虽然缓解了这些问题,但在低数据环境下性能下降。表格基础模型(TFMs)在多样化的表格数据上进行预训练,具备强大的上下文泛化能力,为解决这些局限性提供了基础。我们提出了一种模型无关的关联规则学习框架,该框架能够从任何条件概率模型中提取关联规则,从而使我们能够利用TFMs。接着,我们介绍了TabProbe,这是我们框架的一个实例,利用TFMs作为条件概率估计器,能够直接学习关联规则,无需频繁项集挖掘。我们在不同规模的表格数据集上评估了我们的方法,基于标准的ARM规则质量指标和下游分类性能。结果表明,TFMs始终能够生成简洁、高质量的关联规则,具备强大的预测性能,并在低数据环境下保持稳健,无需特定任务的训练。源代码可在 https://github.com/DiTEC-project/tabprobe 获取。
cs.AI / 96 / 2602.14643

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Arbor:一个可靠导航关键对话流程的框架
Silva, Luís, Gonçalves, Diogo, Farinha, Catarina, Matos, Clara, Ungaro, Luís
Abstract
Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.
Chinese Translation
大型语言模型在高风险领域(如医疗分诊)中难以严格遵循结构化工作流程。将整个决策结构编码在单一提示中的单体方法,随着提示长度的增加,容易出现指令遵循能力下降,包括中途失效和上下文窗口溢出等问题。为了解决这一问题,我们提出了Arbor,一个将决策树导航分解为专门节点级任务的框架。决策树被标准化为边列表表示,并存储以便动态检索。在运行时,基于有向无环图(DAG)的编排机制迭代地仅检索当前节点的出边,通过专门的LLM调用评估有效的转移,并将响应生成委托给单独的推理步骤。该框架与底层决策逻辑和模型提供者无关。在使用真实临床分诊对话的标注轮次对10个基础模型进行的评估中,Arbor将平均轮次准确率提高了29.4个百分点,减少了每轮延迟57.1%,并实现了每轮成本平均降低14.4倍。这些结果表明,架构分解减少了对内在模型能力的依赖,使得较小的模型能够匹配或超越在单一提示基线下运行的较大模型。
cs.AI / 97 / 2602.14674

From User Preferences to Base Score Extraction Functions in Gradual Argumentation

从用户偏好到渐进论证中的基础分数提取函数
Civit, Aniol, Rago, Antonio, Andriella, Antonio, Alenyà, Guillem, Toni, Francesca
Abstract
Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments' base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emph{Base Score Extraction Functions}, which provide a mapping from users' preferences over arguments to base scores. These functions can be applied to the arguments of a \emph{Bipolar Argumentation Framework} (BAF), supplemented with preferences, to obtain a \emph{Quantitative Bipolar Argumentation Framework} (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.
Chinese Translation
渐进论证是一个符号人工智能领域,因其支持透明和可争议的人工智能系统而受到关注。它被视为在决策、推荐、辩论分析等领域中的有用工具。这些领域的结果通常依赖于论证的基础分数,而这些分数必须经过仔细选择。通常,这一选择过程需要用户的专业知识,并且可能并不总是简单。另一方面,通过偏好组织论证可能会简化这一任务。在本研究中,我们引入了基础分数提取函数(Base Score Extraction Functions),该函数提供了用户对论证的偏好与基础分数之间的映射。这些函数可以应用于带有偏好的双极论证框架(Bipolar Argumentation Framework, BAF)的论证上,以获得定量双极论证框架(Quantitative Bipolar Argumentation Framework, QBAF),从而在渐进论证中使用成熟的计算工具。我们概述了基础分数提取函数的期望特性,讨论了一些设计选择,并提供了基础分数提取的算法。我们的方法结合了对人类偏好非线性的近似,以更好地逼近真实偏好。最后,我们在机器人设置中对我们的方法进行了理论和实验评估,并提供了在实践中选择适当渐进语义的建议。
cs.AI / 98 / 2602.14676

GREAT-EER: Graph Edge Attention Network for Emergency Evacuation Responses

GREAT-EER:用于紧急疏散响应的图边注意力网络
Lischka, Attila, Kulcsár, Balázs
Abstract
Emergency situations that require the evacuation of urban areas can arise from man-made causes (e.g., terrorist attacks or industrial accidents) or natural disasters, the latter becoming more frequent due to climate change. As a result, effective and fast methods to develop evacuation plans are of great importance. In this work, we identify and propose the Bus Evacuation Orienteering Problem (BEOP), an NP-hard combinatorial optimization problem with the goal of evacuating as many people from an affected area by bus in a short, predefined amount of time. The purpose of bus-based evacuation is to reduce congestion and disorder that arises in purely car-focused evacuation scenarios. To solve the BEOP, we propose a deep reinforcement learning-based method utilizing graph learning, which, once trained, achieves fast inference speed and is able to create evacuation routes in fractions of seconds. We can bound the gap of our evacuation plans using an MILP formulation. To validate our method, we create evacuation scenarios for San Francisco using real-world road networks and travel times. We show that we achieve near-optimal solution quality and are further able to investigate how many evacuation vehicles are necessary to achieve certain bus-based evacuation quotas given a predefined evacuation time while keeping run time adequate.
Chinese Translation
紧急情况可能由于人为原因(例如,恐怖袭击或工业事故)或自然灾害而导致城市地区的疏散,后者由于气候变化而变得更加频繁。因此,制定有效且快速的疏散计划的方法至关重要。在本研究中,我们识别并提出了公交疏散定向问题(Bus Evacuation Orienteering Problem, BEOP),这是一个NP难的组合优化问题,其目标是在短时间内通过公交车从受影响区域疏散尽可能多的人。基于公交的疏散旨在减少在纯粹以汽车为中心的疏散场景中产生的拥堵和混乱。为了解决BEOP,我们提出了一种基于深度强化学习的方法,利用图学习,一旦训练完成,能够实现快速推理速度,并在几秒钟内生成疏散路线。我们可以通过混合整数线性规划(MILP)形式化来界定我们的疏散计划的差距。为了验证我们的方法,我们使用真实的道路网络和旅行时间为旧金山创建了疏散场景。我们展示了我们达到了接近最优的解决方案质量,并进一步探讨了在预定义的疏散时间内,为实现特定的公交疏散配额所需的疏散车辆数量,同时保持运行时间的适宜性。
cs.AI / 99 / 2602.14691

Removing Planner Bias in Goal Recognition Through Multi-Plan Dataset Generation

通过多计划数据集生成消除目标识别中的规划者偏见
Abdelwahed, Mustafa F., Gusmao, Felipe Meneguzzi Kin Max Piamolini, Espasa, Joan
Abstract
Autonomous agents require some form of goal and plan recognition to interact in multiagent settings. Unfortunately, all existing goal recognition datasets suffer from a systematical bias induced by the planning systems that generated them, namely heuristic-based forward search. This means that existing datasets lack enough challenge for more realistic scenarios (e.g., agents using different planners), which impacts the evaluation of goal recognisers with respect to using different planners for the same goal. In this paper, we propose a new method that uses top-k planning to generate multiple, different, plans for the same goal hypothesis, yielding benchmarks that mitigate the bias found in the current dataset. This allows us to introduce a new metric called Version Coverage Score (VCS) to measure the resilience of the goal recogniser when inferring a goal based on different sets of plans. Our results show that the resilience of the current state-of-the-art goal recogniser degrades substantially under low observability settings.
Chinese Translation
自主代理在多智能体环境中需要某种形式的目标和计划识别。不幸的是,所有现有的目标识别数据集都受到由生成它们的规划系统(即基于启发式的前向搜索)引起的系统性偏见。这意味着现有数据集在更现实的场景(例如,使用不同规划器的代理)中缺乏足够的挑战,这影响了目标识别器在使用不同规划器针对同一目标时的评估。在本文中,我们提出了一种新方法,使用 top-k 规划为同一目标假设生成多个不同的计划,从而产生减轻当前数据集中发现的偏见的基准。这使我们能够引入一种新的度量标准,称为版本覆盖率得分(Version Coverage Score, VCS),用于衡量目标识别器在基于不同计划集推断目标时的韧性。我们的结果表明,在低可观察性设置下,当前最先进的目标识别器的韧性显著下降。
cs.AI / 100 / 2602.14697

Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs

进化系统提示学习可以促进大语言模型的强化学习
Zhang, Lunjun, Chen, Ryan, Stadie, Bradly C.
Abstract
Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% $\rightarrow$ 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E-SPL
Chinese Translation
构建能够自主从经验中自我改进的智能体系统是人工智能的一个长期目标。目前,大型语言模型(LLMs)主要通过两种机制进行自我改进:自我反思用于上下文更新,以及强化学习(RL)用于权重更新。在本研究中,我们提出了进化系统提示学习(E-SPL),一种联合改善模型上下文和模型权重的方法。在每次RL迭代中,E-SPL选择多个系统提示,并与每个提示并行运行回合。它根据每个系统提示对模型权重应用RL更新,并通过LLM驱动的变异和交叉对系统提示种群进行进化更新。每个系统提示都有一个用于进化选择的TrueSkill评分,该评分根据每个RL迭代批次内的相对表现进行更新。E-SPL鼓励在提示中编码的陈述性知识与在权重中编码的程序性知识之间形成自然的分离,从而在推理和智能体任务中提高性能。例如,在一个由易到难(AIME $ ightarrow$ BeyondAIME)的泛化设置中,E-SPL将RL成功率从38.8%提高到45.1%,同时也超越了反思性提示进化(40.0%)。总体而言,我们的结果表明,将强化学习与系统提示进化相结合可以在样本效率和泛化能力上带来持续的提升。代码链接:https://github.com/LunjunZhang/E-SPL
cs.AI / 101 / 2602.14721

WebWorld: A Large-Scale World Model for Web Agent Training

WebWorld:用于网络代理训练的大规模世界模型
Xiao, Zikai, Tu, Jianhong, Zou, Chuhang, Zuo, Yuxin, Li, Zhi, Wang, Peng, Yu, Bowen, Huang, Fei, Lin, Junyang, Liu, Zuozhu
Abstract
Web agents require massive trajectories to generalize, yet real-world training is constrained by network latency, rate limits, and safety risks. We introduce \textbf{WebWorld} series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open-web interactions, supporting reasoning, multi-format data, and long-horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld-Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini-3-Pro. For extrinsic evaluation, Qwen3-14B trained on WebWorld-synthesized trajectories improves by +9.2\% on WebArena, reaching performance comparable to GPT-4o. WebWorld enables effective inference-time search, outperforming GPT-5 as a world model. Beyond web simulation, WebWorld exhibits cross-domain generalization to code, GUI, and game environments, providing a replicable recipe for world model construction.
Chinese Translation
网络代理需要大量轨迹以实现泛化,但现实世界的训练受到网络延迟、速率限制和安全风险的制约。我们介绍了 extbf{WebWorld} 系列,这是第一个在规模上进行训练的开放网络模拟器。虽然现有模拟器仅限于具有数千条轨迹的封闭环境,但 WebWorld 利用可扩展的数据管道在 1M+ 的开放网络交互中进行训练,支持推理、多格式数据和超过 30 步的长时间模拟。为了进行内在评估,我们引入了 WebWorld-Bench,采用涵盖九个维度的双重指标,其中 WebWorld 的模拟性能与 Gemini-3-Pro 相当。为了进行外在评估,在 WebWorld 合成轨迹上训练的 Qwen3-14B 在 WebArena 上提高了 +9.2\%,达到了与 GPT-4o 相当的性能。WebWorld 使得有效的推理时搜索成为可能,其表现超越了作为世界模型的 GPT-5。除了网络模拟外,WebWorld 还展现了对代码、图形用户界面和游戏环境的跨领域泛化,提供了一个可复制的世界模型构建方案。
cs.AI / 102 / 2602.14740

AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises

人工智能武器与影响力:前沿模型在模拟核危机中展现复杂推理
Payne, Kenneth
Abstract
Today's leading AI models engage in sophisticated behaviour when placed in strategic competition. They spontaneously attempt deception, signaling intentions they do not intend to follow; they demonstrate rich theory of mind, reasoning about adversary beliefs and anticipating their actions; and they exhibit credible metacognitive self-awareness, assessing their own strategic abilities before deciding how to act. Here we present findings from a crisis simulation in which three frontier large language models (GPT-5.2, Claude Sonnet 4, Gemini 3 Flash) play opposing leaders in a nuclear crisis. Our simulation has direct application for national security professionals, but also, via its insights into AI reasoning under uncertainty, has applications far beyond international crisis decision-making. Our findings both validate and challenge central tenets of strategic theory. We find support for Schelling's ideas about commitment, Kahn's escalation framework, and Jervis's work on misperception, inter alia. Yet we also find that the nuclear taboo is no impediment to nuclear escalation by our models; that strategic nuclear attack, while rare, does occur; that threats more often provoke counter-escalation than compliance; that high mutual credibility accelerated rather than deterred conflict; and that no model ever chose accommodation or withdrawal even when under acute pressure, only reduced levels of violence. We argue that AI simulation represents a powerful tool for strategic analysis, but only if properly calibrated against known patterns of human reasoning. Understanding how frontier models do and do not imitate human strategic logic is essential preparation for a world in which AI increasingly shapes strategic outcomes.
Chinese Translation
当今领先的人工智能模型在战略竞争中表现出复杂的行为。它们自发地尝试欺骗,传达并不打算遵循的意图;它们展现出丰富的心智理论,推理对手的信念并预测其行动;同时,它们表现出可信的元认知自我意识,在决定如何行动之前评估自身的战略能力。在此,我们展示了在一次危机模拟中的发现,其中三种前沿大型语言模型(GPT-5.2、Claude Sonnet 4、Gemini 3 Flash)扮演核危机中的对立领导者。我们的模拟对国家安全专业人士具有直接应用价值,但通过对人工智能在不确定性下推理的洞察,其应用远超国际危机决策。我们的发现既验证又挑战了战略理论的核心原则。我们发现支持谢林关于承诺的观点、卡恩的升级框架以及杰维斯关于误判的研究等。然而,我们也发现核禁忌并未阻碍模型的核升级;战略核攻击虽然罕见,但确实发生;威胁更常引发反升级而非遵从;高度的相互可信性加速而非阻止冲突;而且即使在极大压力下,没有模型选择妥协或撤退,只是降低了暴力水平。我们认为,人工智能模拟代表了战略分析的强大工具,但前提是必须与已知的人类推理模式进行适当的校准。理解前沿模型如何模仿或不模仿人类战略逻辑是为一个人工智能日益影响战略结果的世界做好准备的关键。
cs.AI / 103 / 2602.14795

Return of the Schema: Building Complete Datasets for Machine Learning and Reasoning on Knowledge Graphs

模式的回归:构建适用于机器学习和知识图谱推理的完整数据集
Diliso, Ivan, Barile, Roberto, d'Amato, Claudia, Fanizzi, Nicola
Abstract
Datasets for the experimental evaluation of knowledge graph refinement algorithms typically contain only ground facts, retaining very limited schema level knowledge even when such information is available in the source knowledge graphs. This limits the evaluation of methods that rely on rich ontological constraints, reasoning or neurosymbolic techniques and ultimately prevents assessing their performance in large-scale, real-world knowledge graphs. In this paper, we present \resource{} the first resource that provides a workflow for extracting datasets including both schema and ground facts, ready for machine learning and reasoning services, along with the resulting curated suite of datasets. The workflow also handles inconsistencies detected when keeping both schema and facts and also leverage reasoning for entailing implicit knowledge. The suite includes newly extracted datasets from KGs with expressive schemas while simultaneously enriching existing datasets with schema information. Each dataset is serialized in OWL making it ready for reasoning services. Moreover, we provide utilities for loading datasets in tensor representations typical of standard machine learning libraries.
Chinese Translation
用于知识图谱精炼算法实验评估的数据集通常仅包含基础事实,即使在源知识图谱中有相关信息时,也仅保留非常有限的模式层知识。这限制了依赖丰富本体约束、推理或神经符号技术的方法的评估,最终阻碍了在大规模、真实世界知识图谱中评估其性能。在本文中,我们介绍了 esource{},这是第一个提供提取包括模式和基础事实的数据集的工作流程的资源,旨在为机器学习和推理服务做好准备,并附带最终整理的数据集套件。该工作流程还处理在同时保留模式和事实时检测到的不一致性,并利用推理来推导隐含知识。该套件包括从具有表现力模式的知识图谱中新提取的数据集,同时丰富现有数据集的模式信息。每个数据集都以OWL格式序列化,使其适合推理服务。此外,我们还提供了用于加载标准机器学习库中典型张量表示的数据集的工具。
cs.AI / 104 / 2602.14857

World Models for Policy Refinement in StarCraft II

用于《星际争霸 II》中的策略细化的世界模型
Zhang, Yixin, Wang, Ziyi, Rong, Yiming, Wang, Haoxi, Jiang, Jinling, Xu, Shuang, Wu, Haoran, Zhou, Shiyu, Xu, Bo
Abstract
Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate--Simulate--Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2's built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.
Chinese Translation
大型语言模型(LLMs)最近展现了强大的推理和泛化能力,这激励了它们在复杂环境中作为决策政策的应用。《星际争霸 II》(SC2)由于其庞大的状态-动作空间和部分可观测性,成为了一个具有挑战性的测试平台。然而,现有的基于LLM的SC2代理主要集中在改善策略本身,而忽视了将可学习的、基于动作的转移模型整合到决策循环中。为了解决这一问题,我们提出了StarWM,这是第一个在部分可观测性下预测未来观测的SC2世界模型。为了促进SC2混合动态的学习,我们引入了一种结构化文本表示,将观测分解为五个语义模块,并构建了SC2-Dynamics-50k,这是第一个用于SC2动态预测的指令调优数据集。我们进一步开发了一个多维离线评估框架,用于预测的结构化观测。离线结果显示,StarWM在零-shot基线上的显著提升,包括资源预测准确性提高近60%以及自我侧宏观情境一致性。最后,我们提出了StarWM-Agent,这是一个增强了世界模型的决策系统,将StarWM整合到生成-模拟-细化的决策循环中,以实现基于前瞻的策略细化。与SC2内置AI的在线评估显示出一致的改进,赢得率分别提高了30%、15%和30%,针对困难(LV5)、更困难(LV6)和非常困难(LV7)的对手,同时改善了宏观管理的稳定性和战术风险评估。
cs.AI / 105 / 2602.14865

EmbeWebAgent: Embedding Web Agents into Any Customized UI

EmbeWebAgent:将网络代理嵌入任何定制用户界面
Ma, Chenyang, Fare, Clyde, Wilson, Matthew, Braines, Dave
Abstract
Most web agents operate at the human interface level, observing screenshots or raw DOM trees without application-level access, which limits robustness and action expressiveness. In enterprise settings, however, explicit control of both the frontend and backend is available. We present EmbeWebAgent, a framework for embedding agents directly into existing UIs using lightweight frontend hooks (curated ARIA and URL-based observations, and a per-page function registry exposed via a WebSocket) and a reusable backend workflow that performs reasoning and takes actions. EmbeWebAgent is stack-agnostic (e.g., React or Angular), supports mixed-granularity actions ranging from GUI primitives to higher-level composites, and orchestrates navigation, manipulation, and domain-specific analytics via MCP tools. Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting. Live Demo: https://youtu.be/Cy06Ljee1JQ
Chinese Translation
大多数网络代理在用户界面层面操作,观察屏幕截图或原始DOM树,而无法访问应用程序级别,这限制了其稳健性和操作表达能力。然而,在企业环境中,前端和后端的显式控制是可用的。我们提出了EmbeWebAgent,这是一个将代理直接嵌入现有用户界面的框架,使用轻量级前端钩子(策划的ARIA和基于URL的观察,以及通过WebSocket暴露的每页函数注册表)和可重用的后端工作流来进行推理和采取行动。EmbeWebAgent与技术栈无关(例如,React或Angular),支持从GUI原语到更高层次复合操作的混合粒度操作,并通过MCP工具协调导航、操作和特定领域分析。我们的演示显示了最小的改造工作量和基于实时用户界面的稳健多步骤行为。实时演示:https://youtu.be/Cy06Ljee1JQ
cs.AI / 106 / 2602.14869

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

概念影响:利用可解释性提升训练数据归因的性能和效率
Kowal, Matthew, Paulo, Goncalo, Jaburi, Louis, Tseng, Tom, McKinney, Lev E, Heimersheim, Stefan, Tucker, Aaron David, Gleave, Adam, Pelrine, Kellin
Abstract
As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.
Chinese Translation
随着大型语言模型的不断训练和微调,实践者需要方法来识别哪些训练数据驱动特定行为,特别是那些意外的行为。训练数据归因(Training Data Attribution, TDA)方法通过估计数据点影响来解决这个问题。现有的方法,如影响函数(influence functions),在计算上既昂贵,又基于单个测试示例进行归因,这可能导致结果偏向于语法相似性而非语义相似性。为了解决可扩展性和影响抽象行为的问题,我们在归因过程中利用模型中的可解释结构。首先,我们引入概念影响(Concept Influence),将模型行为归因于语义方向(如线性探针或稀疏自编码器特征),而不是单个测试示例。其次,我们展示了基于简单探针的归因方法是概念影响的一阶近似,能够在性能上达到可比的效果,同时速度快了一个数量级。我们在新兴的失调基准和真实的后训练数据集上实证验证了概念影响及其近似,并证明它们在性能上与经典影响函数相当,同时在可扩展性上显著更优。更广泛地说,我们展示了在传统TDA流程中融入可解释结构可以实现更可扩展、可解释且更好地控制模型行为的能力。
cs.AI / 107 / 2602.14890

Lifted Relational Probabilistic Inference via Implicit Learning

通过隐式学习进行提升的关系概率推理
Ge, Luise, Juba, Brendan, Nilsson, Kris, Shao, Alison
Abstract
Reconciling the tension between inductive learning and deductive reasoning in first-order relational domains is a longstanding challenge in AI. We study the problem of answering queries in a first-order relational probabilistic logic through a joint effort of learning and reasoning, without ever constructing an explicit model. Traditional lifted inference assumes access to a complete model and exploits symmetry to evaluate probabilistic queries; however, learning such models from partial, noisy observations is intractable in general. We reconcile these two challenges through implicit learning to reason and first-order relational probabilistic inference techniques. More specifically, we merge incomplete first-order axioms with independently sampled, partially observed examples into a bounded-degree fragment of the sum-of-squares (SOS) hierarchy in polynomial time. Our algorithm performs two lifts simultaneously: (i) grounding-lift, where renaming-equivalent ground moments share one variable, collapsing the domain of individuals; and (ii) world-lift, where all pseudo-models (partial world assignments) are enforced in parallel, producing a global bound that holds across all worlds consistent with the learned constraints. These innovations yield the first polynomial-time framework that implicitly learns a first-order probabilistic logic and performs lifted inference over both individuals and worlds.
Chinese Translation
调和第一阶关系领域中归纳学习与演绎推理之间的矛盾是人工智能领域长期以来的挑战。我们研究在第一阶关系概率逻辑中回答查询的问题,通过学习与推理的联合努力,而不构建显式模型。传统的提升推理假设可以访问完整模型,并利用对称性来评估概率查询;然而,从部分的、嘈杂的观察中学习此类模型在一般情况下是不可行的。我们通过隐式学习来调和这两个挑战,以进行推理和第一阶关系概率推理技术。更具体地,我们将不完整的第一阶公理与独立采样的部分观察示例合并到多项式时间内的平方和(SOS)层次的有界度片段中。我们的算法同时进行两个提升:(i) 基于基础的提升,其中重命名等效的基础时刻共享一个变量,从而压缩个体的域;(ii) 世界提升,其中所有伪模型(部分世界赋值)并行施加,产生一个在与学习约束一致的所有世界中都成立的全局界限。这些创新产生了第一个多项式时间框架,隐式学习第一阶概率逻辑,并对个体和世界进行提升推理。
cs.AI / 108 / 2602.14903

The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics

链式思维(CoT)推理的潜力:追踪动态的深入分析
Bachmann, Gregor, Jiang, Yichen, Dezfooli, Seyed Mohsen Moosavi, Nabi, Moin
Abstract
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a potential, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning tangents), (2) very sharp but sometimes tough to interpret spikes (reasoning insights and jumps) as well as (3) at times lucky guesses, where the model arrives at the correct answer without providing any relevant justifications before. While some of the behaviours of the potential are readily interpretable and align with human intuition (such as insights and tangents), others remain difficult to understand from a human perspective. To further quantify the reliance of LLMs on reasoning insights, we investigate the notion of CoT transferability, where we measure the potential of a weaker model under the partial CoT from another, stronger model. Indeed aligning with our previous results, we find that as little as 20% of partial CoT can ``unlock'' the performance of the weaker model on problems that were previously unsolvable for it, highlighting that a large part of the mechanics underpinning CoT are transferable.
Chinese Translation
链式思维(CoT)提示是一种事实上的标准技术,用于从大型语言模型(LLMs)中引出类似推理的响应,使其在给出最终答案之前能够逐步阐述各个步骤。尽管与人类推理的相似性不容忽视,但支撑CoT推理成功的驱动因素仍然在很大程度上不清楚。在本研究中,我们对源自竞争级数学问题的CoT轨迹进行了深入分析,旨在更好地理解CoT的哪些部分以及如何实际贡献于最终答案。为此,我们引入了潜力的概念,量化CoT的某一部分在多大程度上增加了正确完成的可能性。通过潜力的视角审视推理轨迹,我们识别出一些令人惊讶的模式,包括(1)其通常强烈的非单调性(由于推理切线),(2)非常尖锐但有时难以解释的峰值(推理洞察和跳跃),以及(3)有时的幸运猜测,即模型在未提供任何相关理由的情况下得出正确答案。尽管潜力的一些行为易于解释并与人类直觉相符(如洞察和切线),但其他一些行为从人类的角度来看仍然难以理解。为了进一步量化LLMs对推理洞察的依赖,我们研究了CoT可转移性的概念,在此我们测量了较弱模型在来自另一较强模型的部分CoT下的潜力。确实,与我们之前的结果一致,我们发现仅20%的部分CoT就能“解锁”较弱模型在之前无法解决的问题上的表现,突显出支撑CoT的许多机制是可转移的。
cs.AI / 109 / 2602.14910

Position: Introspective Experience from Conversational Environments as a Path to Better Learning

位置:对话环境中的内省体验作为更好学习的途径
Musat, Claudiu Cristian, Tolins, Jackson, Antognini, Diego, Li, Jingling, Klissarov, Martin, Duerig, Tom
Abstract
Current approaches to AI training treat reasoning as an emergent property of scale. We argue instead that robust reasoning emerges from linguistic self-reflection, itself internalized from high-quality social interaction. Drawing on Vygotskian developmental psychology, we advance three core positions centered on Introspection. First, we argue for the Social Genesis of the Private Mind: learning from conversational environments rises to prominence as a new way to make sense of the world; the friction of aligning with another agent, internal or not, refines and crystallizes the reasoning process. Second, we argue that dialogically scaffolded introspective experiences allow agents to engage in sense-making that decouples learning from immediate data streams, transforming raw environmental data into rich, learnable narratives. Finally, we contend that Dialogue Quality is the New Data Quality: the depth of an agent's private reasoning, and its efficiency regarding test-time compute, is determined by the diversity and rigor of the dialogues it has mastered. We conclude that optimizing these conversational scaffolds is the primary lever for the next generation of general intelligence.
Chinese Translation
当前的人工智能训练方法将推理视为规模的涌现属性。我们认为,稳健的推理源于语言自我反思,而这种反思又是从高质量的社会互动中内化而来的。基于维果茨基的发展的心理学,我们提出了三个以内省为中心的核心观点。首先,我们主张私人心智的社会生成:从对话环境中学习成为理解世界的新方式;与另一个代理(无论是内部的还是外部的)对齐的摩擦,精炼并固化了推理过程。其次,我们认为对话性支架的内省体验使代理能够进行意义构建,从而将学习与即时数据流解耦,将原始环境数据转化为丰富的、可学习的叙事。最后,我们认为对话质量是新的数据质量:代理的私人推理的深度及其在测试时计算的效率,取决于其掌握的对话的多样性和严谨性。我们总结认为,优化这些对话支架是下一代通用智能的主要杠杆。
cs.AI / 110 / 2602.14922

ReusStdFlow: A Standardized Reusability Framework for Dynamic Workflow Construction in Agentic AI

ReusStdFlow:一种用于智能体人工智能动态工作流构建的标准化可重用性框架
Zhang, Gaoyang, Zou, Shanghong, Wang, Yafang, Zhang, He, Xu, Ruohua, Zhao, Feng
Abstract
To address the ``reusability dilemma'' and structural hallucinations in enterprise Agentic AI,this paper proposes ReusStdFlow, a framework centered on a novel ``Extraction-Storage-Construction'' paradigm. The framework deconstructs heterogeneous, platform-specific Domain Specific Languages (DSLs) into standardized, modular workflow segments. It employs a dual knowledge architecture-integrating graph and vector databases-to facilitate synergistic retrieval of both topological structures and functional semantics. Finally, workflows are intelligently assembled using a retrieval-augmented generation (RAG) strategy. Tested on 200 real-world n8n workflows, the system achieves over 90% accuracy in both extraction and construction. This framework provides a standardized solution for the automated reorganization and efficient reuse of enterprise digital assets.
Chinese Translation
为了解决企业智能体人工智能中的“可重用性困境”和结构性幻觉,本文提出了ReusStdFlow,一个基于新颖的“提取-存储-构建”范式的框架。该框架将异构的、特定平台的领域特定语言(DSL)解构为标准化的模块化工作流片段。它采用双重知识架构——集成图形和向量数据库——以促进拓扑结构和功能语义的协同检索。最后,工作流通过增强检索生成(RAG)策略智能地组装。经过对200个真实世界n8n工作流的测试,该系统在提取和构建方面均实现了超过90%的准确率。该框架为企业数字资产的自动重组和高效重用提供了一种标准化解决方案。
cs.AI / 111 / 2602.14926

MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design

MAC-AMP:一种用于多目标抗菌肽设计的闭环多智能体协作系统
Zhou, Gen, Janarthanan, Sugitha, Chen, Lianghong, Hu, Pingzhao
Abstract
To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC-AMP, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.
Chinese Translation
为了应对抗菌耐药性带来的全球健康威胁,抗菌肽(AMP)因其强效和前景广阔的抗击耐药病原体的能力而受到关注。尽管人工智能(AI)正在被用于推动AMP的发现和设计,但大多数AMP设计模型在平衡活性、毒性和新颖性等关键目标时面临困难,采用的评分方法往往僵化或不明确,导致结果难以解释和优化。随着大型语言模型(LLM)能力的迅速发展,我们转向基于此类模型的AI多智能体协作(多智能体LLM),这在复杂的科学设计场景中显示出迅速上升的潜力。在此基础上,我们提出了MAC-AMP,一个用于多目标AMP设计的闭环多智能体协作(MAC)系统。该系统实现了一个完全自主的模拟同行评审自适应强化学习框架,只需任务描述和示例数据集即可设计新颖的AMP。我们工作的创新之处在于引入了一个支持多目标优化的AMP设计闭环多智能体系统,具备跨领域的可转移性,并且保持可解释性,而不是一个“黑箱”。实验表明,MAC-AMP在有效优化AMP生成多个关键分子属性方面优于其他AMP生成模型,在抗菌活性、AMP相似性、毒性合规性和结构可靠性等方面表现出色。
cs.AI / 112 / 2602.14994

On the Semantics of Primary Cause in Hybrid Dynamic Domains

混合动态领域中主要原因的语义
Khan, Shakil M., Mehmood, Asim, Zilles, Sandra
Abstract
Reasoning about actual causes of observed effects is fundamental to the study of rationality. This important problem has been studied since the time of Aristotle, with formal mathematical accounts emerging recently. We live in a world where change due to actions can be both discrete and continuous, that is, hybrid. Yet, despite extensive research on actual causation, only few recent studies looked into causation with continuous change. Building on recent progress, in this paper we propose two definitions of primary cause in a hybrid action-theoretic framework, namely the hybrid temporal situation calculus. One of these is foundational in nature while the other formalizes causation through contributions, which can then be verified from a counterfactual perspective using a modified ``but-for'' test. We prove that these two definitions are indeed equivalent. We then show that our definitions of causation have some intuitively justifiable properties.
Chinese Translation
推理观察到的效果的实际原因是理性研究的基础。这个重要问题自亚里士多德时代以来就一直受到研究,最近出现了正式的数学描述。我们生活在一个由于行动而导致的变化可以是离散的也可以是连续的,即混合的世界。然而,尽管对实际因果关系进行了广泛的研究,只有少数近期研究关注了连续变化下的因果关系。基于近期的进展,本文在混合行动理论框架下提出了主要原因的两个定义,即混合时间情境演算。其中一个定义具有基础性,而另一个则通过贡献形式化因果关系,这可以通过修改后的“但为”测试从反事实的角度进行验证。我们证明这两个定义实际上是等价的。然后,我们展示了我们对因果关系的定义具有一些直观上合理的属性。
cs.AI / 113 / 2602.15019

Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search & Evaluation

全球猎寻:用于投资、商业开发及搜索与评估的深度研究人工智能代理在药物资产侦查中的应用
Vinogradova, Alisa, Vinogradov, Vlad, Greenwood, Luba, Yasny, Ilya, Kobyzev, Dmitry, Kasbekar, Shoman, Nguyen, Kong, Radkevich, Dmitrii, Doronin, Roman, Doronichev, Andrey
Abstract
Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests >85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
Chinese Translation
生物制药创新已发生转变:许多新药资产现在源自美国以外,主要通过区域性、非英语渠道披露。最近的数据表明,超过85%的专利申请源自美国以外,其中中国占全球总量的近一半;越来越多的学术成果也来自非美国地区。行业估计中国在全球药物开发中占约30%,涵盖超过1200个新候选药物。在这个高风险环境中,未能发现“低调”的资产为投资者和商业开发团队带来了数十亿美元的风险,使得资产侦查成为一个覆盖至关重要的竞争领域,速度和完整性驱动着价值。然而,今天的深度研究人工智能代理在实现跨异构、多语言来源的高召回率发现方面仍落后于人类专家,且存在幻觉现象。我们提出了一种药物资产侦查的基准测试方法,以及一个调优的基于树的自学习生物光学代理,旨在实现完整且无幻觉的侦查。我们使用多语言多代理管道构建了一个具有挑战性的完整性基准:复杂的用户查询与主要不在美国雷达范围内的真实资产相结合。为了反映真实交易的复杂性,我们从专家投资者、商业开发和风险投资专业人士那里收集了筛选查询,并将其作为先验条件生成基准查询。对于评分,我们使用基于大型语言模型(LLM)的评估,经过专家意见的校准。我们将生物光学代理与Claude Opus 4.6、OpenAI GPT-5.2 Pro、Perplexity Deep Research、Gemini 3 Pro + Deep Research和Exa Websets进行了比较。生物光学代理的F1得分为79.7%,而Claude Opus 4.6为56.2%、Gemini 3 Pro + Deep Research为50.6%、GPT-5.2 Pro为46.6%、Perplexity Deep Research为44.2%、Exa Websets为26.9%。随着计算能力的增加,性能显著提升,支持了更多计算能带来更好结果的观点。
计算语言学 (Computation and Language)
85
cs.CL / 1 / 2602.13263

Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

基于多模态一致性指导的无参考数据选择用于自动语音识别口音适应
Lei, Ligong, Lu, Wenwen, Pang, Xudong, Kadeer, Zaokere, Wumaier, Aishan
Abstract
Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech--text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.
Chinese Translation
自动语音识别(ASR)系统在口音语音上往往表现不佳,因为声学-语音学和韵律的变化导致与训练数据的不匹配,使得带标签的口音适应成本高昂。然而,常见的伪标签选择启发式方法主要以文本为中心(例如,困惑度(PPL)过滤),可能偏向流畅但声学上不匹配的假设,从而在微调时导致错误放大。为了解决这个问题,我们提出了一种基于多模态一致性指导的无参考数据选择管道,用于ASR口音适应,采用传导性、无标签的协议。该管道首先通过基于子模块互信息的目标感知预选择步骤来提高查询相关性并减少下游计算。然后,通过基于扰动的解码为每个发声生成多个伪转录,并使用两个无参考信号对每个假设进行评分:在共享嵌入空间中的语音-文本对齐和预测的词错误率(WER)。简单的百分位选择规则保留可靠的伪标签以进行微调,同时丢弃噪声发声。在领域内设置中,从30k池中选择约1.5k发声实现了10.91%的WER,接近使用30k监督标签获得的10.45%。在具有不匹配候选池的跨领域设置中,一致性过滤的子集避免了在强口音变化下未过滤伪标签造成的降级,并且在更强的ASR主干上的匹配小时实验进一步确认了相较于随机抽样和最近选择基线的提升。
cs.CL / 2 / 2602.13452

LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

基于大型语言模型的自动翻译与危机情境中的紧迫性
Ticona, Belu, Anastasopoulos, Antonis
Abstract
Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.
Chinese Translation
大型语言模型(LLMs)越来越多地被提议用于危机准备和响应,特别是在多语言沟通方面。然而,它们在高风险危机环境中的适用性尚未得到充分评估。本研究考察了最先进的LLMs和机器翻译系统在危机领域翻译中的表现,重点关注紧迫性这一有效危机沟通和分诊的关键属性。通过使用多语言危机数据和一个新引入的覆盖32种语言的紧迫性标注数据集,我们展示了专用翻译模型和LLMs均表现出显著的性能下降和不稳定性。关键是,即使在语言上合适的翻译也可能扭曲感知的紧迫性,而基于LLM的紧迫性分类在提示和输入的语言上变化很大。这些发现突显了在危机沟通中部署通用语言技术的重大风险,并强调了危机意识评估框架的必要性。
cs.CL / 3 / 2602.13455

Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

利用机器学习增强斯瓦希里语中模糊滥用词汇的检测:聚焦儿童安全
Nabangi, Phyllis, Zakaria, Abdul-Jalil, Ndibwile, Jema David
Abstract
The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.
Chinese Translation
数字技术的兴起显著增加了网络欺凌和在线虐待的潜在风险,因此迫切需要增强检测和预防措施,尤其是在儿童群体中。本研究专注于检测斯瓦希里语中的模糊滥用语言,这是一种资源匮乏的语言,由于其有限的语言资源和技术支持,面临独特的挑战。选择斯瓦希里语是因为它在非洲的广泛使用,拥有超过1600万的母语者和超过1亿的总使用者,分布在东非和中东的一些地区。我们采用了包括支持向量机(Support Vector Machines, SVM)、逻辑回归(Logistic Regression)和决策树(Decision Trees)在内的机器学习模型,通过严格的参数调优和合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)等方法来处理数据不平衡。我们的分析显示,尽管这些模型在高维文本数据中表现良好,但由于数据集的规模小和不平衡性,限制了我们研究结果的普遍性。我们对精确度、召回率和F1分数进行了深入分析,突出了每个模型在检测模糊语言方面的细微表现。本研究为确保儿童在线环境安全的更广泛讨论做出了贡献,倡导扩展数据集和采用先进的机器学习技术,以提高网络欺凌检测系统的有效性。未来的工作将集中在增强数据的稳健性、探索迁移学习以及整合多模态数据,以创建更全面和文化敏感的检测机制。
cs.CL / 4 / 2602.13466

Language Model Memory and Memory Models for Language

语言模型记忆与语言的记忆模型
Badger, Benjamin L.
Abstract
The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.
Chinese Translation
机器学习模型在隐藏层向量嵌入中存储输入信息的能力,类似于“记忆”的概念,已被广泛应用但尚未得到很好的表征。我们发现,语言模型的嵌入通常包含相对较少的输入信息,无论在训练期间的数据和计算规模如何。相比之下,为输入再生而训练的自编码器的嵌入能够几乎完美地形成记忆。用记忆嵌入替代标记序列可以显著提高计算效率,这促使我们引入一种可并行化的编码器-解码器记忆模型架构。在因果训练后,这些模型包含信息稀缺的嵌入,无法进行任意信息访问,但通过结合因果和信息保留目标函数,它们学习形成和解码信息丰富的记忆。通过冻结高保真度编码器,训练可以进一步简化,随后采用课程训练方法,其中解码器首先学习处理记忆,然后学习额外预测下一个标记。我们引入了一个观点,即仅进行下一个标记预测训练不适合准确的记忆形成,因为该目标本身是不可逆的,这促使我们对那些未暴露整个输入的模型使用组合目标函数。
cs.CL / 5 / 2602.13504

From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

从认知到证据:使用微调的BERT分类器检测土耳其新闻媒体中的AI生成内容
Ozdemir, Ozancan
Abstract
The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.
Chinese Translation
大型语言模型迅速融入新闻编辑室工作流程,引发了关于在线媒体中AI生成内容普遍性的紧迫问题。尽管计算研究已开始量化这一现象在英语媒体中的表现,但针对土耳其新闻媒体的实证研究尚不存在,现有研究仅限于与记者的定性访谈或假新闻检测。本研究通过对来自三家主要土耳其媒体的3,600篇文章的标注数据集进行微调,填补了这一空白,使用特定于土耳其的BERT模型(dbmdz/bert-base-turkish-cased)进行AI重写内容的二元分类。该模型在保留的测试集上达到了0.9708的F1分数,并在两个类别之间实现了对称的精确度和召回率。随后在2023年至2026年间对3,500多篇未见文章的部署显示出一致的跨源和时间稳定的分类模式,平均预测置信度超过0.96,估计平均有2.5%的被检查新闻内容是由大型语言模型重写或修订的。根据我们的最佳知识,这是首个超越自我报告的记者认知,朝向土耳其新闻媒体中AI使用的实证、数据驱动测量的研究。
cs.CL / 6 / 2602.13517

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

深思熟虑,而非仅仅是长时间思考:通过深度思考标记测量大型语言模型的推理努力
Chen, Wei-Lin, Peng, Liqian, Tan, Tian, Zhao, Chao, Chen, Blake JianHang, Lin, Ziqian, Go, Alec, Meng, Yu
Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
Chinese Translation
大型语言模型(LLMs)通过在推理时扩展计算能力,展示了令人印象深刻的推理能力,尤其是在长链式思维(Chain-of-Thought, CoT)中。然而,最近的研究发现,原始标记计数并不是推理质量的可靠代理:生成长度的增加并不总是与准确性相关,反而可能表明“过度思考”,导致性能下降。在本研究中,我们通过识别深度思考标记来量化推理时的努力——这些标记是在模型的深层中进行显著修正的内部预测,直到收敛。我们在四个具有挑战性的数学和科学基准(AIME 24/25、HMMT 25和GPQA-diamond)以及一组多样化的以推理为重点的模型(GPT-OSS、DeepSeek-R1和Qwen3)上展示,深度思考比例(生成序列中深度思考标记的比例)与准确性之间存在强而一致的正相关,显著优于基于长度和基于置信度的基线。基于这一洞察,我们引入了Think@n,一种在测试时优先考虑高深度思考比例样本的扩展策略。我们证明,Think@n的表现与标准自一致性性能相匹配或超过,同时通过基于短前缀的早期拒绝不太可能的生成,显著降低了推理成本。
cs.CL / 7 / 2602.13540

On Calibration of Large Language Models: From Response To Capability

大型语言模型的校准:从响应到能力
Yang, Sin-Han, Wu, Cheng-Kuang, Lin, Chieh-Yen, Chen, Yun-Nung, Lee, Hung-yi, Sun, Shao-Hua
Abstract
Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.
Chinese Translation
大型语言模型(LLMs)被广泛应用于通用问题解决,因此准确的置信度估计对于可靠使用至关重要。先前关于LLM校准的研究主要集中在响应级别的置信度上,即估计单个生成输出的正确性。然而,这种表述与许多实际场景不符,在这些场景中,核心问题是模型解决查询的整体可能性。我们表明,这种不匹配源于现代LLM解码的随机性,在这种情况下,单一响应的正确性无法反映模型的潜在能力。为了解决这一问题,我们引入了能力校准,旨在针对模型在查询上的预期准确性。我们正式区分能力校准与响应校准,并展示两者在理论和实证上均存在差异。我们建立了一个实证评估框架,并研究了一系列置信度估计方法。我们的结果表明,能力校准的置信度改善了pass@$k$预测和推理预算分配,为多种应用奠定了基础。
cs.CL / 8 / 2602.13551

Small Reward Models via Backward Inference

通过反向推理构建小型奖励模型
Wang, Yike, Brahman, Faeze, Feng, Shangbin, Xiao, Teng, Hajishirzi, Hannaneh, Tsvetkov, Yulia
Abstract
Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.
Chinese Translation
奖励模型(RMs)在语言模型(LM)流程中扮演着核心角色,尤其是在不可验证的领域。然而,主流的LLM-as-a-Judge范式依赖于大型模型的强大推理能力,而替代方法则需要参考响应或明确的评分标准,这限制了灵活性和更广泛的可访问性。在本研究中,我们提出了FLIP(FLipped Inference for Prompt reconstruction),一种无参考和无评分标准的奖励建模方法,通过反向推理重新构建奖励建模:推断出最有可能产生给定响应的指令。然后,推断出的指令与原始指令之间的相似性被用作奖励信号。在使用13个小型语言模型的四个领域的评估中,FLIP的表现平均超越了LLM-as-a-Judge基线79.6%。此外,FLIP在测试时扩展下的外部评估中,通过并行采样和GRPO训练显著提高了下游性能。我们进一步发现,FLIP对较长输出特别有效,并且对常见的奖励黑客形式具有鲁棒性。通过明确利用验证-生成差距,FLIP使得在判断方法失效的降级环境中实现可靠的奖励建模成为可能。代码可在 https://github.com/yikee/FLIP 获取。
cs.CL / 9 / 2602.13567

DistillLens: Symmetric Knowledge Distillation Through Logit Lens

DistillLens:通过 Logit Lens 进行对称知识蒸馏
Dhakal, Manish, Jinadu, Uthman, Budathoki, Anjila, Sunderraman, Rajshekhar, Ding, Yi
Abstract
Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at https://github.com/manishdhakal/DistillLens.
Chinese Translation
标准知识蒸馏(Knowledge Distillation, KD)通过优化最终输出对大型语言模型(Large Language Models, LLMs)进行压缩,但通常将教师模型中间层的思维过程视为黑箱。尽管基于特征的蒸馏试图弥补这一差距,但现有方法(如均方误差(MSE)和非对称 KL 散度)忽略了最终输出所需的丰富不确定性特征。在本文中,我们提出了 DistillLens,一个对称对齐学生模型和教师模型思维过程的框架。通过通过 Logit Lens 将中间隐藏状态投影到词汇空间,我们使用对称散度目标强制执行结构对齐。我们的分析证明了这一约束施加了双向惩罚,防止了过于自信和不足自信,同时保留了对最终推理至关重要的高熵信息通道。在 GPT-2 和 Llama 架构上的广泛实验表明,DistillLens 在多样的指令跟随基准测试中始终优于标准 KD 和特征转移基线。代码可在 https://github.com/manishdhakal/DistillLens 获取。
cs.CL / 10 / 2602.13571

LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

LLM-Confidence Reranker:一种无训练的增强检索增强生成系统的方法
Song, Zhipeng, Kong, Xiangyu, Bao, Xinrui, Zhou, Yizhi, Jiao, Jiulong, Liu, Sitong, Zhou, Yuhang, Qi, Heng
Abstract
Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.
Chinese Translation
大型语言模型(LLMs)已经彻底改变了自然语言处理,但在知识密集型任务中仍然存在幻觉问题,这是一项关键挑战。检索增强生成(RAG)通过整合外部知识来应对这一挑战,但其有效性依赖于准确的文档检索和排名。尽管现有的重排序器表现出有效性,但它们通常需要专门的训练,带来可观的计算开销,并未充分利用LLMs的语义能力,特别是其固有的信心信号。我们提出了LLM-Confidence Reranker(LCR),这是一种无训练的即插即用算法,通过利用基于最大语义聚类比例(Maximum Semantic Cluster Proportion, MSCP)获得的黑箱LLM信心,增强RAG系统中的重排序。LCR采用两阶段过程:通过多项式采样和聚类进行信心评估,随后根据查询和文档信心阈值进行分箱和多级排序。这种方法优先考虑相关文档,同时为高信心查询保留原始排名,确保鲁棒性。在使用BM25和Contriever检索器的BEIR和TREC基准上进行评估时,LCR仅使用7-9B参数的预训练LLMs,在预训练LLM和微调Transformer重排序器中,NDCG@5的提升幅度高达20.6%,且没有性能下降。消融研究验证了LLM信心与文档相关性之间的正相关假设,阐明了LCR的机制。LCR提供了计算效率、可扩展的并行性和广泛的兼容性,减轻了在医疗诊断等应用中的幻觉问题。
cs.CL / 11 / 2602.13575

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Elo-Evolve:一种用于语言模型对齐的共进化框架
Zhao, Jing, Zhen, Ting, bao, Junwei, Jiang, Hongfei, song, Yang
Abstract
Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.
Chinese Translation
目前,大型语言模型(LLMs)的对齐方法依赖于将大量人类偏好数据压缩为静态的绝对奖励函数,这导致了数据稀缺、噪声敏感性和训练不稳定性。我们提出了Elo-Evolve,这是一种共进化框架,将对齐重新定义为在自适应对手池中的动态多智能体竞争。我们的方法有两个关键创新:(1)通过直接从成对竞争中的二元胜负结果中学习,消除对Bradley-Terry模型的依赖;(2)实施Elo协调的对手选择,通过温度控制的采样提供自动化的课程学习。我们将方法基于PAC学习理论,证明成对比较在样本复杂性上具有优越性,并在与绝对评分方法相比的实验中验证了4.5倍的噪声减少。实验中,我们使用包括Qwen2.5-14B、Qwen2.5-32B和Qwen3-8B模型在内的对手,利用我们的框架训练了Qwen2.5-7B模型。结果表明,性能层次清晰:基于点的方法 < 静态成对训练 < Elo-Evolve,在Alpaca Eval 2.0和MT-Bench上验证了成对比较和动态对手选择对LLM对齐的渐进性好处。
cs.CL / 12 / 2602.13701

Metaphors' journeys across time and genre: tracking the evolution of literary metaphors with temporal embeddings

隐喻在时间和体裁中的旅程:通过时间嵌入追踪文学隐喻的演变
Mangiaterra, Veronica, Pietro, Chiara Barattieri di San, Canal, Paolo, Bambini, Valentina
Abstract
Metaphors are a distinctive feature of literary language, yet they remain less studied experimentally than everyday metaphors. Moreover, previous psycholinguistic and computational approaches overlooked the temporal dimension, although many literary metaphors were coined centuries apart from contemporary readers. This study innovatively applies tools from diachronic distributional semantics to assess whether the processing costs of literary metaphors varied over time and genre. Specifically, we trained word embeddings on literary and nonliterary Italian corpora from the 19th and 21st centuries, for a total of 124 million tokens, and modeled changes in the semantic similarity between topics and vehicles of 515 19th-century literary metaphors, taking this measure as a proxy of metaphor processing demands. Overall, semantic similarity, and hence metaphor processing demands, remained stable over time. However, genre played a key role: metaphors appeared more difficult (i.e., lower topic-vehicle similarity) in modern literary contexts than in 19th-century literature, but easier (i.e., higher topic-vehicle similarity) in today's nonliterary language (e.g., the Web) than in 19th-century nonliterary texts. This pattern was further shaped by semantic features of metaphors' individual terms, such as vector coherence and semantic neighborhood density. Collectively, these findings align with broader linguistic changes in Italian, such as the stylistic simplification of modern literature, which may have increased metaphor processing demands, and the high creativity of the Web's language, which seems to render metaphor more accessible.
Chinese Translation
隐喻是文学语言的一个独特特征,但与日常隐喻相比,它们在实验研究中仍然较少受到关注。此外,之前的心理语言学和计算方法忽视了时间维度,尽管许多文学隐喻是在与当代读者相隔数世纪的背景下创造的。本研究创新性地应用了历时分布语义学的工具,以评估文学隐喻的处理成本是否随着时间和体裁的变化而变化。具体而言,我们在19世纪和21世纪的文学和非文学意大利语语料库上训练了词嵌入,总计124百万个词元,并对515个19世纪文学隐喻的主题与载体之间的语义相似性变化进行了建模,将这一指标视为隐喻处理需求的代理。总体而言,语义相似性,因此隐喻处理需求,随着时间的推移保持稳定。然而,体裁发挥了关键作用:在现代文学语境中,隐喻显得更为困难(即,主题-载体相似性较低),而在19世纪文学中则相对容易;但在当今的非文学语言(例如网络)中,隐喻则显得更为容易(即,主题-载体相似性较高),相比于19世纪的非文学文本。这一模式还受到隐喻个别术语的语义特征的进一步影响,例如向量一致性和语义邻域密度。总体而言,这些发现与意大利语的更广泛语言变化相一致,例如现代文学的风格简化,这可能增加了隐喻处理的需求,以及网络语言的高度创造性,这似乎使隐喻变得更易于理解。
cs.CL / 13 / 2602.13713

On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

基于理论驱动的多维话语分析的LLM代理
Uberna, Maciej, Wawer, Michał, Chudziak, Jarosław A., Koszowy, Marcin
Abstract
Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30\%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.
Chinese Translation
识别话语中重述的战略性使用仍然是计算论证中的一项关键挑战。虽然大型语言模型(LLMs)能够检测表层相似性,但它们往往无法捕捉重述的语用功能,例如其在修辞话语中的作用。本文提出了一种比较多代理框架,旨在量化在这一任务中纳入显性理论知识的好处。我们利用一个注释的政治辩论数据集建立了一个新的标准,涵盖四种不同的重述功能:减弱(Deintensification)、增强(Intensification)、具体化(Specification)、概括(Generalisation)以及其他(Other),涵盖所有剩余类型(D-I-S-G-O)。然后,我们评估了两个平行的基于LLM的代理系统:一个通过检索增强生成(Retrieval-Augmented Generation, RAG)增强了论证理论,另一个是相同的零样本基线。结果显示出明显的性能差距:RAG增强的代理在各方面的表现均显著优于基线,尤其在检测增强和概括上下文方面具有显著优势,整体宏观F1分数提高近30%。我们的研究结果提供了证据,表明理论基础不仅有益,而且对于超越单纯的重述检测,向功能感知的论证话语分析迈进是必不可少的。这种比较多代理架构代表了朝着可扩展的、基于理论的计算工具迈出的一步,这些工具能够识别当代话语中的修辞策略。
cs.CL / 14 / 2602.13748

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL:基于关系的多任务渐进学习框架,针对多媒体事件提取的阶段性训练
Jin, Yongkang, Luo, Jianwen, Wang, Jingjing, Yao, Jianmin, Hong, Yu
Abstract
Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
Chinese Translation
多媒体事件提取(MEE)旨在从包含文本和图像的文档中识别事件及其论元。这需要在不同模态之间对事件语义进行基础性理解。MEE的进展受到缺乏标注训练数据的限制。M2E2是唯一建立的基准,但仅提供用于评估的标注,这使得直接的监督训练变得不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型(VLMs)进行推理时提示。这些方法并未明确学习结构化的事件表示,且在多模态环境中往往产生较弱的论元基础。为了解决这些局限性,我们提出了RMPL,一种在低资源条件下用于MEE的基于关系的多任务渐进学习框架。RMPL结合了来自单模态事件提取和多媒体关系提取的异构监督,并采用阶段性训练。模型首先使用统一的模式进行训练,以学习跨模态的共享事件中心表示。然后,利用混合的文本和视觉数据对事件提及识别和论元角色提取进行微调。在M2E2基准上与多种VLMs的实验显示,在不同模态设置下均取得了一致的改进。
cs.CL / 15 / 2602.13790

How Do Lexical Senses Correspond Between Spoken German and German Sign Language?

口语德语与德语手语之间的词汇意义如何对应?
Çelikkol, Melis, Zhao, Wei
Abstract
Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources. We address this by analyzing German and German Sign Language (Deutsche Geb\"ardensprache, DGS), manually annotating 1,404 word use-to-sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally identifiable. Our code and dataset are made publicly available.
Chinese Translation
手语词典编纂者通过建立词与手势的映射来构建双语词典,其中多义词和同音词在不同语境下对应不同手势的情况往往被低估。基于使用的研究方法考察词义如何映射到手势,可以识别出当前词典中缺失的新映射,从而丰富词典资源。我们通过分析德语和德语手语(Deutsche Gebärdensprache, DGS)来解决这一问题,手动标注了来自德语词汇使用图(D-WUG)中的32个词和来自数字德语手语词典(DW-DGS)中的49个手势所衍生的1,404个词用到手势的ID映射。我们识别出三种对应类型:类型1(一个对多个)、类型2(多个对一个)和类型3(一个对一个),以及无匹配案例。我们评估了计算方法:精确匹配(Exact Match, EM)和语义相似性(Semantic Similarity, SS),使用SBERT嵌入。总体而言,SS的表现显著优于EM(88.52%对71.31%),类型1的提升尤为显著(+52.1个百分点)。我们的工作建立了首个跨模态意义对应的注释数据集,并揭示了哪些对应模式是可以通过计算识别的。我们的代码和数据集已公开提供。
cs.CL / 16 / 2602.13793

OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

OMGs:支持卵巢肿瘤护理连续体中MDT决策的多智能体系统
Zhang, Yangyang, Wang, Zilong, Xu, Jianbo, Chen, Yongqi, Han, Chu, Zhang, Zhihao, Liu, Shuai, Li, Hui, Zhang, Huiping, Liu, Ziqi, Chen, Jiaxin, Zhu, Jun, Feng, Zheng, Wen, Hao, Ju, Xingzhu, Zhong, Yanping, Zhang, Yunqiu, Duan, Jie, Li, Jun, Li, Dongsheng, Wang, Weijie, Zhu, Haiyan, Jiang, Wei, Wu, Xiaohua, Wang, Shuo, Li, Haiming, Guo, Qinhao
Abstract
Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re-evaluation, OMGs achieved performance comparable to expert MDT consensus ($4.45 \pm 0.30$ versus $4.53 \pm 0.23$), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human-AI studies, OMGs most substantially enhanced clinicians' recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi-agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource-limited settings.
Chinese Translation
卵巢肿瘤管理越来越依赖于多学科肿瘤委员会(MDT)的讨论,以应对治疗复杂性和疾病异质性。然而,全球大多数患者缺乏及时的专家共识,尤其是在资源有限的中心,MDT资源稀缺或不可用。在此,我们提出了OMGs(卵巢肿瘤多学科智能代理系统),这是一个多智能体人工智能框架,其中特定领域的代理协作讨论,以整合多学科证据并生成具有透明推理的MDT风格建议。为了系统评估MDT建议的质量,我们开发了SPEAR(安全性、个性化、证据、可操作性、稳健性),并在涵盖护理连续体的多种临床场景中验证了OMGs。在多中心重新评估中,OMGs的表现与专家MDT共识相当($4.45 imes 0.30$ 对比 $4.53 imes 0.23$),且证据得分更高(4.57 对比 3.92)。在前瞻性多中心评估中(59名患者),OMGs与常规MDT决策表现出高度一致性。关键的是,在配对的人类-人工智能研究中,OMGs显著增强了临床医生在证据和稳健性方面的建议,这两个维度在缺乏多学科专业知识时受到最严重的影响。这些发现表明,多智能体讨论系统可以达到与专家MDT共识相当的表现,并有潜力在资源有限的环境中扩大对专业肿瘤学知识的获取。
cs.CL / 17 / 2602.13816

The acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach

也门阿拉伯语母语学习者对英语不规则词尾变化的习得:一种普遍语法的视角
Alsawsh, Muneef Y., Shormani, Mohammed Q.
Abstract
This study examines the acquisition of English irregular inflections by Yemeni learners of English as a second language (L2), utilizing a Universal Grammar (UG) approach. Within the UG approach, the study considers Feature Reassembly Hypothesis (FRH) (Lardiere, 2008, 2009) part of UG, focusing on the roles of first language (L1) transfer and L2 developmental influence. It analyzes learner errors across two developmental stages. Stage 1 data reveal a dominant influence of L1 transfer, particularly in phonological and structural mismatches, while stage 2 data demonstrate increased learner sensitivity to UG properties and morphological reconfiguration toward the target language. Findings reveal that errors in irregular inflectional morphology are attributed to both interlingual and intralingual sources, with overgeneralization of L2 rules as a common developmental strategy. Statistical analysis, including a one-way ANOVA, indicates significant improvement in the production of well-formed irregular inflections from stage 1 to stage 2, underscoring learners' continued access to UG. However, persistent difficulties with consonant change, zero-morpheme, and -a plural inflections suggest that limited exposure, ineffective input modeling, and insufficient instructional quality constrain full UG access. The study concludes that while L1 transfer and L2 developmental factors influence initial stages of acquisition, appropriate linguistic input and instruction are critical for facilitating UG-driven feature reassembly in adult L2 learners.
Chinese Translation
本研究考察了也门英语作为第二语言(L2)学习者对英语不规则词尾变化的习得,采用普遍语法(UG)的方法。研究在UG的框架下,考虑了特征重组假设(Feature Reassembly Hypothesis, FRH)(Lardiere, 2008, 2009)作为UG的一部分,重点分析了母语(L1)迁移和L2发展影响的作用。研究分析了学习者在两个发展阶段的错误。阶段1的数据揭示了L1迁移的主导影响,特别是在音韵和结构不匹配方面,而阶段2的数据则显示学习者对UG特性和向目标语言的形态重构的敏感性有所提高。研究结果表明,不规则词尾形态的错误既源于跨语言(interlingual)因素,也源于同语言(intralingual)因素,L2规则的过度推广是一个常见的发展策略。统计分析,包括单因素方差分析(one-way ANOVA),表明从阶段1到阶段2,学习者在正确生成不规则词尾方面有显著改善,强调了学习者对UG的持续访问。然而,在辅音变化、零形态和-a复数词尾方面的持续困难表明,有限的接触、无效的输入建模和不足的教学质量限制了对UG的全面访问。研究结论指出,尽管L1迁移和L2发展因素影响习得的初始阶段,但适当的语言输入和教学对于促进成人L2学习者的UG驱动特征重组至关重要。
cs.CL / 18 / 2602.13832

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

超越语言:通过心智理论评估和弥合用户代理交互中的认知差异
Ruan, Minyuan, Wang, Ziyue, Liu, Kaiming, Lai, Yunghwei, Li, Peng, Liu, Yang
Abstract
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.
Chinese Translation
大型语言模型(LLMs)发展迅速,广泛应用于通用和专业任务,以协助人类用户。然而,当意图和指令表达不准确时,它们仍然难以理解和回应用户的真实需求,从而导致主观用户信念与真实环境状态之间的差异。解决这种认知差异需要心智理论(Theory of Mind, ToM),然而现有的LLMs心智理论评估主要集中在孤立的信念推断上,忽视了其在现实交互中的功能性实用性。为此,我们将LLMs的心智理论形式化为一种认知差异检测和解决机制,并提出基准 enchname,以评估模型如何在实践中调和用户信念与个人资料。对11个领先模型的结果显示,识别阻碍任务成功的潜在认知差距存在显著限制。为弥补这一差距,我们进一步整理了一个基于轨迹的心智理论数据集,将信念追踪与任务相关状态推断相结合。通过强化学习在该数据上训练的模型在推理用户心理状态方面表现出持续改进,从而提升了下游性能。我们的研究强调了心智理论作为一种基本交互机制的实际价值,而非仅仅作为独立的推理技能。
cs.CL / 19 / 2602.13836

Speculative Decoding with a Speculative Vocabulary

带有推测词汇的推测解码
Williams, Miles, Kwon, Young D., Li, Rui, Kouris, Alexandros, Venieris, Stylianos I.
Abstract
Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
Chinese Translation
推测解码迅速成为加速语言模型(LM)推理的主要方法,因为它在提供显著加速的同时产生相同的输出。这依赖于一个小型草稿模型,负责预测目标模型的输出。最先进的推测解码方法使用一个由单个解码器层和输出嵌入矩阵组成的草稿模型,后者在最新的语言模型中主导了草稿时间。近期的研究试图通过减少草稿模型的词汇量来解决这一输出分布瓶颈。尽管这可以提高吞吐量,但当目标标记超出词汇范围时,会妨碍推测的有效性。在本文中,我们主张将词汇推测作为减少词汇量的替代方案。我们提出了SpecVocab,这是一种高效且有效的方法,在每个解码步骤中选择一个词汇子集。在各种任务中,我们证明SpecVocab能够实现比最先进的推测解码方法EAGLE-3更高的接受长度。值得注意的是,这使得平均吞吐量比EAGLE-3提高了高达8.1%。
cs.CL / 20 / 2602.13840

PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

PrivAct:通过多智能体偏好训练内化上下文隐私保护
Cheng, Yuhan, Ye, Hancheng, Li, Hai Helen, Sun, Jingwei, Chen, Yiran
Abstract
Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. Code is available at https://github.com/chengyh23/PrivAct.
Chinese Translation
大型语言模型(LLM)智能体越来越多地应用于涉及敏感和依赖上下文的信息的个性化任务中,在这些任务中,由于上下文隐私的隐含性,智能体的行动可能会导致隐私侵犯。现有方法依赖于外部推理时干预,这些干预脆弱、特定于场景,并可能扩大隐私攻击面。我们提出了PrivAct,一种上下文隐私感知的多智能体学习框架,直接将上下文隐私保护内化到模型的生成行为中,以实现符合隐私要求的智能体行动。通过将隐私偏好嵌入每个智能体,PrivAct增强了系统范围内的上下文完整性,同时实现了更有利的隐私与帮助性之间的权衡。在多个LLM基础模型和基准测试中的实验表明,在上下文隐私保护方面持续改善,泄露率降低了多达12.32%,同时保持了可比的帮助性,以及在多种多智能体拓扑结构中的零样本泛化和鲁棒性。代码可在 https://github.com/chengyh23/PrivAct 获取。
cs.CL / 21 / 2602.13860

Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

指导大型语言模型实现领域适应、精确性和安全性
Banerjee, Somnath
Abstract
The overarching research direction of this work is the development of a ''Responsible Intelligence'' framework designed to reconcile the immense generative power of Large Language Models (LLMs) with the stringent requirements of real-world deployment. As these models become a transformative force in artificial intelligence, there is an urgent need to move beyond general-purpose architectures toward systems that are contextually aware, inherently safer, and deeply respectful of global cultural nuances. This research navigates three interconnected threads: domain adaptation to ensure technical precision, ethical rigor to mitigate adversarial vulnerabilities, and cultural/multilingual alignment to promote global inclusivity. The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Chinese Translation
本研究的总体方向是开发一个“负责任的智能”框架,旨在调和大型语言模型(LLMs)巨大的生成能力与现实世界部署的严格要求。随着这些模型成为人工智能的变革性力量,迫切需要超越通用架构,朝着具有上下文意识、内在安全性和深刻尊重全球文化差异的系统发展。本研究探讨了三个相互关联的主题:领域适应以确保技术精确性、伦理严谨以减轻对抗性脆弱性,以及文化/多语言对齐以促进全球包容性。方法论的轨迹从经典的监督适应以满足任务特定需求,转向解码时的对齐以确保安全,最终利用人类反馈和偏好建模实现社会语言学的敏锐性。
cs.CL / 22 / 2602.13867

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

弥合多语言安全差距:面向全球南方语言的高效文化意识对齐
Banerjee, Somnath, Hazra, Rima, Mukherjee, Animesh
Abstract
Large language models (LLMs) are being deployed across the Global South, where everyday use involves low-resource languages, code-mixing, and culturally specific norms. Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages. Evidence increasingly shows they do not. We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii) English-only knowledge edits and safety patches often fail to carry over to low-resource languages. In response, we outline a practical agenda for researchers and students in the Global South: parameter-efficient safety steering, culturally grounded evaluation and preference data, and participatory workflows that empower local communities to define and mitigate harm. Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
Chinese Translation
大型语言模型(LLMs)正在全球南方地区被广泛应用,在日常使用中涉及低资源语言、代码混合和文化特定规范。然而,安全管道、基准测试和对齐仍然主要针对英语和少数高资源语言,隐含地假设安全性和事实性在不同语言间能够“转移”。越来越多的证据表明,这种假设并不成立。我们综合了近期研究结果,指出(i)在低资源和代码混合输入上,安全防护措施显著减弱;(ii)即使标准毒性评分看似可接受,文化上有害的行为仍然可能持续存在;(iii)仅针对英语的知识编辑和安全补丁往往无法有效转移到低资源语言。对此,我们为全球南方的研究人员和学生提出了一项切实可行的议程:参数高效的安全引导、以文化为基础的评估和偏好数据,以及赋权当地社区定义和减轻危害的参与式工作流程。我们的目标是使多语言安全成为公平人工智能在欠代表地区的核心要求,而非附加选项。
cs.CL / 23 / 2602.13870

ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

ADAB:阿拉伯语自动礼貌基准的数据集——一个大规模的计算社会语用学资源
Al-Khalifa, Hend, Ghezaiel, Nadia, Bounnit, Maria, Alhazmi, Hend Hamed, Alfear, Noof Abdullah, Alqifari, Reem Fahad, Almasoud, Ameera Masoud, Al-Ghamdi, Sharefah Ahmed
Abstract
The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.
Chinese Translation
随着文化意识自然语言处理系统的重要性日益增加,对能够捕捉多种语言社会语用现象的资源的需求也在不断上升。然而,尽管阿拉伯语交流中蕴含着丰富而复杂的礼貌表达,阿拉伯语的礼貌检测资源仍然未得到充分探索。本文介绍了ADAB(阿拉伯礼貌数据集),这是一个从四个在线平台收集的新标注阿拉伯语数据集,包括社交媒体、电子商务和客户服务领域,涵盖现代标准阿拉伯语及多种方言(海湾方言、埃及方言、黎凡特方言和摩洛哥方言)。该数据集基于阿拉伯语言传统和语用理论进行标注,结果分为三类:礼貌、不礼貌和中立。数据集中包含10,000个样本,涵盖16个礼貌类别的语言特征标注,并实现了显著的标注者间一致性(kappa = 0.703)。我们基准测试了40种模型配置,包括传统机器学习、基于变换器的模型和大型语言模型。该数据集旨在支持关于礼貌意识的阿拉伯自然语言处理研究。
cs.CL / 24 / 2602.13890

Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

评估小型语言模型中用于检索增强生成的提示工程技术:一种多跳问答方法
Mohammadi, Amir Hossein, Moeinian, Ali, Razavizade, Zahra, Fatemi, Afsaneh, Ramezani, Reza
Abstract
Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.
Chinese Translation
检索增强生成(Retrieval Augmented Generation, RAG)是一种通过整合外部知识来增强语言模型事实基础的强大方法。尽管在大型语言模型中得到了广泛研究,但针对小型语言模型(Small Language Models, SLMs)优化RAG仍然是一个重要的研究空白,特别是在需要复杂推理的多跳问答任务中。在这些系统中,提示模板设计是一个关键但尚未深入探讨的因素,影响着性能。本文呈现了一项大规模的实证研究,以调查这一因素,在HotpotQA数据集上评估了24种不同的提示模板。该集合包括一个标准RAG提示、九种文献中的良好形式技术和14种新颖的混合变体,所有这些都在两个著名的SLM上进行了测试:Qwen2.5-3B Instruct和Gemma3-4B-It。基于18720个实例的测试集,我们的发现显示,Qwen2.5的性能提升高达83%,Gemma3-4B-It的提升高达84.5%,与标准RAG提示相比,两种模型的改进幅度均达到6%。本研究还提供了具体的分析和可操作的建议,以设计有效且高效的SLM基础RAG系统的提示,特别适用于资源受限环境中的部署。
cs.CL / 25 / 2602.13905

Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

用于自动转录中世纪法语和拉丁文手稿的前编辑规范化
Clérice, Thibault, Bawden, Rachel, Glaise, Anthony, Pinche, Ariane, Smith, David
Abstract
Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.
Chinese Translation
近期在自动文本识别(ATR)方面的进展改善了对历史档案的访问,然而,古文字转录与规范化数字版之间仍然存在方法论上的鸿沟。尽管在更注重古文字学的数据集(如CATMuS)上训练的ATR模型显示出更好的泛化能力,但其原始输出与大多数读者和下游自然语言处理工具的兼容性较差,从而造成了可用性差距。另一方面,旨在生成规范化输出的ATR模型在适应新领域时表现不佳,往往过度规范化并产生虚构内容。我们提出了前编辑规范化(PEN)任务,该任务旨在根据编辑规范对图形化ATR输出进行规范化,具有在保持古文字学忠实度的同时提供实用的规范化版本的优势。我们展示了一个新的数据集,该数据集源自CoMMA语料库,并与使用passim数字化的古法语和拉丁文版对齐。我们还制作了一个手动校正的金标准评估集。我们使用基于ByT5的序列到序列模型对该资源在规范化和预标注任务上进行了基准测试。我们的贡献包括对PEN的正式定义、一个包含466万样本的银级训练语料库、一个包含1800个样本的金级评估集,以及一个实现6.7%字符错误率(CER)的规范化模型,显著优于以往在该任务上的模型表现。
cs.CL / 26 / 2602.13964

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

HLE-Verified:人类最后考试的系统验证与结构化修订
Zhai, Weiqi, Wang, Zhihai, Wang, Jinghang, Yang, Boyu, Li, Xiaogang, Xu, Xiang, Wang, Bohan, Wang, Peng, Wu, Xingzhe, Li, Anfeng, Feng, Qiyuan, Zhou, Yuhao, Han, Shoulin, Luo, Wenjie, Li, Yiyuan, Wang, Yaxuan, Luo, Ruixian, Lin, Guojie, Xiao, Peiyao, Xu, Chengliang, Wang, Ben, Wang, Zeyu, Chen, Zichao, Ye, Jianan, Hu, Yijie, Chen, Jialong, Shen, Zongwen, Xu, Yuliang, Yang, An, Yu, Bowen, Liu, Dayiheng, Lin, Junyang, Wei, Hu, Shen, Que, Zhao, Bing
Abstract
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
Chinese Translation
人类最后考试(HLE)已成为评估前沿大型语言模型在具有挑战性的多领域问题上的广泛使用基准。然而,社区主导的分析提出了HLE中存在非平凡数量的噪声项的担忧,这可能会影响评估结果并扭曲跨模型比较。为了解决这一挑战,我们引入了HLE-Verified,这是HLE的一个经过验证和修订的版本,具有透明的验证协议和细致的错误分类。我们的构建遵循一个两阶段的验证与修复工作流程,最终形成一个认证基准。在第一阶段,每个项目通过领域专家审查和基于模型的交叉检查进行问题和最终答案的二元验证,最终获得641个经过验证的项目。在第二阶段,存在缺陷但可修复的项目在严格约束下进行修订,以保留原始评估意图,通过双独立专家修复、模型辅助审计和最终裁决,最终形成1170个修订和认证的项目。剩余的689个项目作为文档化的不确定集合发布,并附有明确的不确定来源和专业标签,以便未来的修订。我们在HLE和HLE-Verified上评估了七个最先进的语言模型,观察到HLE-Verified的平均绝对准确率提高了7到10个百分点。特别是在原始问题陈述和/或参考答案存在错误的项目上,改进尤为显著,增幅达到30到40个百分点。我们的分析进一步揭示了模型置信度与问题陈述或参考答案中错误存在之间的强关联,支持了我们修订的有效性。总体而言,HLE-Verified通过减少注释噪声并实现对模型能力的更真实测量,改善了HLE风格的评估。数据可在以下链接获取:https://github.com/SKYLENAGE-AI/HLE-Verified
cs.CL / 27 / 2602.13979

Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis

利用大型语言模型进行临床阿尔茨海默病评估与诊断的思维链推理
Zhang, Tongze, Ding, Jun-En, Ozolcer, Melik, Hung, Fang-Ming, Yang, Albert Chih-Chieh, Liu, Feng, Ji, Yi-Rou, Bae, Sang Won
Abstract
Alzheimer's disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer's disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients' clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model's ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.
Chinese Translation
阿尔茨海默病(AD)已成为全球普遍的神经退行性疾病。传统诊断仍然高度依赖医学影像和医生的临床评估,这在时间和资源方面往往是耗时且资源密集的,涉及人力专业知识和医疗资源。近年来,大型语言模型(LLMs)在医疗领域的应用逐渐增多,尤其是在电子健康记录(EHRs)方面,但其在阿尔茨海默病评估中的应用仍然有限,特别是考虑到AD涉及复杂的多因素病因,这些病因难以通过影像学直接观察。在本研究中,我们提出利用LLMs对患者的临床EHRs进行思维链(Chain-of-Thought, CoT)推理。与直接对EHR数据进行LLMs微调以进行AD分类不同,我们的方法利用LLM生成的CoT推理路径,为模型提供明确的AD评估诊断依据,随后进行结构化的基于CoT的预测。该流程不仅增强了模型诊断内在复杂因素的能力,还改善了不同AD进展阶段预测过程的可解释性。实验结果表明,所提出的基于CoT的诊断框架显著提高了多个CDR分级任务的稳定性和诊断性能,与零样本基线方法相比,F1分数提高了多达15%。
cs.CL / 28 / 2602.14002

The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

从信息瓶颈视角看大语言模型自我解释中的充分性与简洁性权衡
Zahedzadeh, Ali, Bahrak, Behnam
Abstract
Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.
Chinese Translation
大型语言模型越来越依赖自我解释,例如思维链推理,以提高多步骤问答的性能。虽然这些解释提高了准确性,但它们往往冗长且生成成本高,这引发了一个问题:究竟需要多少解释才算真正必要。本文考察了充分性与简洁性之间的权衡,其中充分性被定义为解释能够证明正确答案的能力,而简洁性则被定义为解释长度的减少。基于信息瓶颈原理,我们将解释概念化为压缩表示,仅保留生成正确答案所必需的信息。为了实现这一观点,我们引入了一个评估流程,该流程限制了解释长度,并使用多个语言模型在ARC挑战数据集上评估充分性。为了扩展研究范围,我们在英语(使用原始数据集)和波斯语(通过翻译作为资源有限语言)中进行了实验。我们的实验表明,更简洁的解释通常仍然是充分的,能够保持准确性,同时显著减少解释长度,而过度压缩则会导致性能下降。
cs.CL / 29 / 2602.14009

Named Entity Recognition for Payment Data Using NLP

基于自然语言处理的支付数据命名实体识别
Nayak, Srikumar
Abstract
Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically designed for payment data extraction, including Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM-CRF), and transformer-based models such as BERT and FinBERT. We conduct extensive experiments on a dataset of 50,000 annotated payment transactions across multiple payment formats including SWIFT MT103, ISO 20022, and domestic payment systems. Our experimental results demonstrate that fine-tuned BERT models achieve an F1-score of 94.2% for entity extraction, outperforming traditional CRF-based approaches by 12.8 percentage points. Furthermore, we introduce PaymentBERT, a novel hybrid architecture combining domain-specific financial embeddings with contextual representations, achieving state-of-the-art performance with 95.7% F1-score while maintaining real-time processing capabilities. We provide detailed analysis of cross-format generalization, ablation studies, and deployment considerations. This research provides practical insights for financial institutions implementing automated sanctions screening, anti-money laundering (AML) compliance, and payment processing systems.
Chinese Translation
命名实体识别(NER)已成为自动化金融交易处理中的关键组成部分,特别是在从非结构化支付数据中提取结构化信息方面。本文对专门为支付数据提取设计的最先进的NER算法进行了全面分析,包括条件随机场(Conditional Random Fields, CRF)、双向长短期记忆网络结合CRF(Bidirectional Long Short-Term Memory with CRF, BiLSTM-CRF)以及基于变换器的模型如BERT和FinBERT。我们在一个包含50,000个标注支付交易的数据集上进行了广泛实验,涵盖了多种支付格式,包括SWIFT MT103、ISO 20022和国内支付系统。实验结果表明,经过微调的BERT模型在实体提取中达到了94.2%的F1-score,优于传统的基于CRF的方法12.8个百分点。此外,我们引入了PaymentBERT,这是一种新颖的混合架构,结合了特定领域的金融嵌入与上下文表示,达到了95.7%的F1-score的最先进性能,同时保持了实时处理能力。我们提供了跨格式泛化、消融研究和部署考虑的详细分析。本研究为实施自动化制裁筛查、反洗钱(AML)合规和支付处理系统的金融机构提供了实用的见解。
cs.CL / 30 / 2602.14028

GRRM: Group Relative Reward Modeling for Machine Translation

GRRM:用于机器翻译的群体相对奖励建模
Yang, Sen, Cheng, Shanbo, Xu, Lu, Zhang, Jianbing, Huang, Shujian
Abstract
While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.
Chinese Translation
群体相对策略优化(GRPO)为大型语言模型(LLM)后训练提供了强大的框架,但其在机器翻译等开放领域的有效性依赖于准确的组内排名。我们发现标准标量质量指标(SQM)在此背景下存在不足;由于孤立评估候选项,它们缺乏区分细微语言差异所需的比较上下文。为了解决这个问题,我们引入了群体质量指标(GQM)范式,并通过群体相对奖励模型(GRRM)进行实例化。与传统的独立评分器不同,GRRM联合处理整个候选组,利用比较分析严格解决相对质量和自适应粒度。实证评估确认GRRM在所有基线中实现了竞争性的排名准确性。在此基础上,我们将GRRM集成到GRPO训练循环中,以优化翻译策略。实验结果表明,我们的框架不仅提高了整体翻译质量,还解锁了与最先进推理模型相当的推理能力。我们在https://github.com/NJUNLP/GRRM发布代码、数据集和模型检查点。
cs.CL / 31 / 2602.14039

Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

保持几何特性的混合专家嵌入模型聚合
Kachuee, Sajjad, Sharifkhani, Mohammad
Abstract
Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.
Chinese Translation
混合专家(Mixture-of-Experts, MoE)嵌入模型通过加权线性求和结合专家输出,隐含地假设嵌入空间中存在线性子空间结构。然而,这一假设与专家表示的几何特性不一致。对现代 MoE 嵌入模型的几何分析表明,专家输出位于一个共享的超球面流形上,该流形的特征是范数高度集中和显著的角度分离。在这种几何结构下,线性聚合会导致向流形内部的内向坍缩,扭曲向量的大小和方向,从而降低嵌入的可比性。为了解决这一不一致性,提出了球面重心聚合(Spherical Barycentric Aggregation, SBA)作为一种保持几何特性的聚合算子,它将径向和角度分量分离,以维持超球面结构,同时与现有的路由机制完全兼容。在大规模文本嵌入基准(Massive Text Embedding Benchmark, MTEB)上进行的选定任务实验,包括语义相似性、聚类和重复问题检测,均展示了在相同训练成本和完全稳定性的情况下,性能的一致性提升。额外的几何分析确认 SBA 防止了聚合引起的坍缩,并保持了超球面的一致性,突显了在 MoE 嵌入架构中几何感知聚合的重要性。
cs.CL / 32 / 2602.14044

Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

上下文影响大语言模型的检索增强事实核查效果
Bernardelle, Pietro, Civelli, Stefano, Roitero, Kevin, Demartini, Gianluca
Abstract
Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.
Chinese Translation
大语言模型(LLMs)在多种任务中展现出强大的推理能力,但它们在扩展上下文中的表现仍然不一致。尽管之前的研究强调了在问答中的中等上下文退化,本研究考察了上下文在基于LLM的事实验证中的影响。我们使用了三个数据集(HOVER、FEVEROUS 和 ClimateFEVER)以及五个不同参数规模(7B、32B 和 70B 参数)和模型家族(Llama-3.1、Qwen2.5 和 Qwen3)的开源模型,评估了参数化事实知识以及在不同上下文长度下证据位置的影响。我们发现LLMs对事实声明展现出非平凡的参数知识,并且随着上下文长度的增加,它们的验证准确性通常会下降。与之前的研究结果类似,上下文中的证据位置起着关键作用,当相关证据出现在提示的开头或结尾时,准确性始终较高,而当证据位于中间上下文时,准确性则较低。这些结果强调了在检索增强的事实核查系统中提示结构的重要性。
cs.CL / 33 / 2602.14054

LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

LogitsCoder:通过Logits偏好解码实现高效的思维链路径搜索以进行代码生成
Chen, Jizheng, Zhang, Weiming, Dai, Xinyi, Liu, Weiwen, Du, Kounianhua, Wang, Yasheng, Tang, Ruiming, Yu, Yong, Zhang, Weinan
Abstract
Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.
Chinese Translation
代码生成仍然是一项具有挑战性的任务,需要精确和结构化的推理。现有的测试时间缩放(Test Time Scaling, TTS)方法,包括结构化树搜索,已在探索推理路径方面取得了一定进展,但仍面临两个主要挑战:(1)思维不足,推理链往往较浅,无法捕捉问题的全部复杂性;(2)思维过度,过于冗长的推理导致效率低下和计算成本增加。为了解决这些问题,我们提出了LogitsCoder,一个新颖的框架,通过轻量级的logit级控制机制增强思维链推理以进行代码生成。LogitsCoder通过首先通过Logits偏好解码引导标记选择朝向统计上偏好的模式,然后使用Logits基于排名的路径选择和思维聚合选择和聚合多样的推理路径,从而迭代生成和优化推理步骤。这导致了平衡深度和效率的连贯且有效的推理链。大量实验表明,LogitsCoder生成的推理链更高效且质量更高,相较于基线方法在代码生成性能上表现更优。
cs.CL / 34 / 2602.14060

LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

LM-Lexicon:通过协调语义专家来改善定义建模
Liu, Yang, Yang, Jiaye, Li, Weikang, Liang, Jiahui, Li, Yang, Yan, Lingyong
Abstract
We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.
Chinese Translation
我们介绍了LM-Lexicon,这是一种创新的定义建模方法,结合了数据聚类、语义专家学习和使用稀疏专家混合架构的模型合并。通过将定义建模任务分解为专门的语义领域,在这些领域中小型语言模型被训练为领域专家,LM-Lexicon在五个广泛使用的基准测试中相比于现有方法实现了显著的改进(与之前的最先进模型相比,BLEU分数提高了7%)。实证研究表明:1)聚类策略使得专家专业化更为细致,定义质量提高近10%;2)语义感知的领域级路由机制相比传统的令牌级路由实现了更高的专家有效性(提高1%);3)通过测试时计算和语义专家扩展可以获得进一步的性能提升。我们的工作推动了定义建模的发展,同时为语义密集型应用的高效语言模型的开发提供了见解。
cs.CL / 35 / 2602.14062

From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

从稀缺到规模:普什图语公共语音数据集的发布级分析
Jahani, Jandad, Dawodi, Mursal, Baktash, Jawid Ahmad
Abstract
Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.
Chinese Translation
大型、开放许可的语音数据集对于构建自动语音识别(ASR)系统至关重要,但许多广泛使用的语言在公共资源中仍然表现不足。普什图语由超过6000万人使用,历史上缺乏适合现代ASR开发的大规模开放许可语音数据。本文对Mozilla Common Voice语料库中普什图语部分进行了发布级分析,重点关注版本24.0(2025年12月),并将主要版本之间的趋势进行背景化分析。我们记录了从2023年中期的1.49小时录音迅速增长到2025年的2768.7小时总录音,其中包括975.89小时可用于监督ASR训练的验证时长。除了规模之外,我们还分析了验证吞吐量、贡献者参与不平等、人口统计元数据的完整性以及在验证子集中的句子级集中度。我们发现参与极为集中(基尼系数 = 0.941),年龄代表性明显偏向年轻成年人,41.97%的片段缺乏自我报告的性别标签,限制了基于元数据的子群体审计。在文本层面,提示重用程度适中:35.88%的独特句子占据了50%的验证片段,这表明结构集中主要是由于贡献者活动的不均衡,而非小规模提示集的主导。这些结果为快速扩展的低资源语音语料库提供了定量审计,并突出了改善数据集成熟度的实际优先事项,包括扩大验证能力和更广泛的人口统计参与。
cs.CL / 36 / 2602.14069

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

开放评分系统:通过成对自适应评分扩展强化学习
Jia, Ruipeng, Yang, Yunyi, Wu, Yuxin, Gai, Yongbo, Tao, Siyuan, Zhou, Mengyu, Lin, Jianhe, Jiang, Xiaoxi, Jiang, Guanjun
Abstract
Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.
Chinese Translation
标量奖励模型将多维人类偏好压缩为单一的不透明分数,导致信息瓶颈,这通常会在开放式对齐中引发脆弱性和奖励黑客行为。我们认为,对于不可验证任务的稳健对齐本质上是一个原则性泛化问题:奖励不应是内化到评判者中的学习函数,而应是一个在可检查原则下执行的明确推理过程。为了实现这一观点,我们提出了开放评分系统(Open Rubric System, OpenRS),这是一个基于评分标准的即插即用的 LLM 作为评判者框架,围绕成对自适应元评分标准(Pairwise Adaptive Meta-Rubrics, PAMR)和轻量级逐点可验证评分标准(Pointwise Verifiable Rubrics, PVRs)构建,提供了在可用真实值或程序检查时的硬约束保护和可验证奖励组件。OpenRS 使用一个明确的元评分标准——类似宪法的规范,规定了评分标准的实例化、加权和执行方式——并通过对两个候选响应之间的语义差异进行条件化,动态实例化自适应评分标准。然后,它进行标准逐项的成对比较,并在外部聚合标准级偏好,避免逐点加权标量化,同时在开放式环境中提高可区分性。为了在各个领域保持原则的一致性和可编辑性,我们引入了一个两级元评分标准细化流程(针对一般原则的自动进化细化和针对领域原则的可重复人机协作程序),并辅以逐点可验证评分标准,既作为防止退化行为的保护措施,又作为客观子任务的可验证奖励来源。最后,我们将 OpenRS 实例化为成对强化学习训练中的奖励监督。
cs.CL / 37 / 2602.14073

Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

基于LLaVA框架的波兰语注释高效视觉语言模型适应
Statkiewicz, Grzegorz, Dobrzeniecka, Alicja, Seweryn, Karolina, Krasnodębska, Aleksandra, Piosek, Karolina, Bogusz, Katarzyna, Cygert, Sebastian, Kusa, Wojciech
Abstract
Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
Chinese Translation
大多数视觉语言模型(VLMs)是在以英语为中心的数据上训练的,这限制了它们在其他语言和文化背景下的表现。这限制了它们对非英语用户的可用性,并阻碍了反映多样语言和文化现实的多模态系统的发展。在本研究中,我们复制并调整了LLaVA-Next方法,以创建一组波兰语VLMs。我们依赖于一个完全自动化的流程来翻译和过滤现有的多模态数据集,并通过合成波兰语数据来补充OCR和文化特定任务。尽管我们的训练数据几乎完全依赖于自动翻译,并且对人工干预的需求最小,但我们的方法仍然取得了强劲的结果:我们在波兰适应的MMBench上观察到相较于LLaVA-1.6-Vicuna-13B提高了9.5%的性能,同时在生成评估中获得了更高质量的标题,评估标准是人类注释者对语言正确性的评估。这些发现突显了大规模自动翻译结合轻量级过滤可以有效地为低资源语言的高质量多模态模型提供启动支持。尽管如此,仍然存在一些挑战,特别是在文化覆盖和评估方面。为了促进进一步的研究,我们将我们的模型和评估数据集公开发布。
cs.CL / 38 / 2602.14077

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

GTS:可学习高斯思维采样器的潜在推理推理时间缩放
Wang, Minghan, Bai, Ye, Vu, Thuy-Trang, Shareghi, Ehsan, Haffari, Gholamreza
Abstract
Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.
Chinese Translation
潜在推理模型中的推理时间缩放(ITS)通常通过启发式扰动引入随机性,例如丢弃法(dropout)或固定高斯噪声。虽然这些方法增加了轨迹的多样性,但它们的探索行为并未被明确建模,并且在有限的采样预算下可能效率低下。我们观察到,更强的扰动不一定会转化为更有效的候选轨迹,因为无指导的噪声可能会破坏内部决策结构,而不是引导它。为了提供一个更结构化的替代方案,我们将潜在思维探索建模为从可学习密度条件采样,并将这一思想实例化为高斯思维采样器(GTS)。GTS 预测基于上下文的扰动分布,覆盖连续推理状态,并在保持主干网络冻结的情况下,通过 GRPO 风格的策略优化进行训练。在 GSM8K 上对两种潜在推理架构的实验表明,GTS 实现了比启发式基线更可靠的推理时间缩放。这些发现表明,改善潜在的 ITS 需要结构化和可优化的探索机制,而不仅仅是简单地放大随机性。
cs.CL / 39 / 2602.14080

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

空架子还是丢失的钥匙?回忆是参数事实性的瓶颈
Calderon, Nitay, Ben-David, Eyal, Gekhman, Zorik, Ofek, Eran, Yona, Gal
Abstract
Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.
Chinese Translation
标准的LLM(大型语言模型)事实性评估将所有错误视为相同,模糊了失败是由于缺失知识(空架子)还是由于对编码事实的有限访问(丢失的钥匙)。我们提出了一种行为框架,侧重于事实层面的事实知识,而非问题层面,通过每个事实是否被编码来表征,并进一步通过其可访问性进行分类:无法回忆、可以直接回忆或只能通过推理计算(思考)来回忆。为了支持这种特征分析,我们引入了WikiProfile,这是一个通过自动化流程构建的新基准,基于网络搜索的提示LLM。在来自13个LLM的400万条响应中,我们发现,在我们的基准上,前沿模型的编码几乎达到了饱和,GPT-5和Gemini-3编码了95%至98%的事实。然而,回忆仍然是一个主要瓶颈:许多之前归因于缺失知识的错误实际上源于无法访问这些知识。这些失败是系统性的,并且对长尾事实和反向问题的影响不成比例。最后,我们表明,思考可以改善回忆,并能恢复相当一部分失败,表明未来的进步可能更依赖于提高模型利用其已编码内容的方法,而非单纯的规模扩展。
cs.CL / 40 / 2602.14081

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

CCiV:一个用于评估大型语言模型生成的中文 extit{Ci}诗歌的结构、韵律和质量的基准
Zhao, Shangqing, Ren, Yupei, Zhou, Yuhao, Bai, Xiaopeng, Lan, Man
Abstract
The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.
Chinese Translation
生成古典中文 extit{Ci}诗歌是一种需要结构严谨、韵律和谐以及艺术质量高度融合的形式,这对大型语言模型(LLMs)提出了重大挑战。为了系统地评估和提升这一能力,我们引入了 extbf{C}hinese extbf{Ci}pai extbf{V}ariants( extbf{CCiV}),这是一个旨在从结构、韵律和质量三个维度评估LLM生成的 extit{Ci}诗歌的基准。我们对17个LLM在30个 extit{Cipai}上的评估揭示了两个关键现象:模型经常生成有效但意外的诗歌形式历史变体,以及遵循声调模式显著比遵循结构规则更为困难。我们进一步展示了形式感知提示可以改善强模型的结构和声调控制,但可能会削弱较弱模型的表现。最后,我们观察到在我们的样本中,形式正确性与文学质量之间存在弱且不一致的对齐。CCiV突显了对变体感知评估和更全面的约束创作生成方法的需求。
cs.CL / 41 / 2602.14100

Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans

字符感知变换器学习不规则形态模式,但没有一个能像人类一样进行泛化
Ramarao, Akhilesh Kakolu, Tang, Kevin, Baer-Henney, Dinah
Abstract
Whether neural networks can serve as cognitive models of morphological learning remains an open question. Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed. We investigate this using the Spanish \emph{L-shaped morphome}, where only the first-person singular indicative (e.g., \textit{pongo} `I put') shares its stem with all subjunctive forms (e.g., \textit{ponga, pongas}) despite lacking apparent phonological, semantic, or syntactic motivation. We compare five encoder-decoder transformers varying along two dimensions: sequential vs. position-invariant positional encoding, and atomic vs. decomposed tag representations. Positional encoding proves decisive: position-invariant models recover the correct L-shaped paradigm clustering even when L-shaped verbs are scarce in training, whereas sequential positional encoding models only partially capture the pattern. Yet none of the models productively generalize this pattern to novel forms. Position-invariant models generalize the L-shaped stem across subjunctive cells but fail to extend it to the first-person singular indicative, producing a mood-based generalization rather than the L-shaped morphomic pattern. Humans do the opposite, generalizing preferentially to the first-person singular indicative over subjunctive forms. None of the models reproduce the human pattern, highlighting the gap between statistical pattern reproduction and morphological abstraction.
Chinese Translation
神经网络是否可以作为形态学习的认知模型仍然是一个悬而未决的问题。最近的研究表明,编码-解码模型能够获得不规则模式,但它们是否像人类一样泛化这些模式的证据却不尽相同。我们通过研究西班牙语的 extit{L-shaped morphome}来探讨这一问题,其中只有第一人称单数直陈式(例如, extit{pongo} '我放')与所有虚拟式形式(例如, extit{ponga, pongas})共享词干,尽管缺乏明显的音位、语义或句法动机。我们比较了五种编码-解码变换器,这些变换器在两个维度上有所不同:顺序编码与位置不变的位置信息编码,以及原子标签表示与分解标签表示。位置信息编码被证明是决定性的:位置不变的模型即使在训练中L-shaped动词稀缺的情况下也能恢复正确的L-shaped范式聚类,而顺序位置信息编码模型仅部分捕捉到该模式。然而,没有一个模型能够将这一模式有效地泛化到新形式。位置不变的模型在虚拟式单元中泛化了L-shaped词干,但未能将其扩展到第一人称单数直陈式,产生了一种基于语气的泛化,而不是L-shaped形态模式。人类则恰恰相反,更倾向于将泛化应用于第一人称单数直陈式而非虚拟式形式。没有一个模型能够重现人类的模式,这突显了统计模式再现与形态抽象之间的差距。
cs.CL / 42 / 2602.14158

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

医疗人工智能的多智能体框架:利用微调的GPT、LLaMA和DeepSeek R1进行基于证据和偏见意识的临床查询处理
Nourmohammadi, Naeimeh, Hossain, Md Meem, Han, The Anh, Ara, Safina Showkat, Shamszaman, Zia Ush
Abstract
Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.
Chinese Translation
大型语言模型(LLMs)在医疗问答中展现出潜力,但其临床应用受到验证不足、证据基础不够和信心信号不可靠的限制。我们提出了一种多智能体医疗问答框架,结合了互补的LLMs、证据检索、不确定性估计和偏见检查,以提高答案的可靠性。我们的方法分为两个阶段。首先,我们在MedQuAD衍生的医疗问答数据(超过20,000个问题-答案对,涵盖多个NIH领域)上微调了三种代表性的LLM家族(GPT、LLaMA和DeepSeek R1),并基准测试生成质量。DeepSeek R1在零-shot评估中取得了最强的得分(ROUGE-1 0.536 ± 0.04;ROUGE-2 0.226 ± 0.03;BLEU 0.098 ± 0.018),并显著超越了专门的生物医学基线BioGPT。其次,我们实现了一个模块化的多智能体管道,其中临床推理智能体(微调的LLaMA)生成结构化解释,证据检索智能体查询PubMed以将响应与最新文献相结合,而精炼智能体(DeepSeek R1)提高了清晰度和事实一致性;对于高风险或高不确定性的案例,触发可选的人类验证路径。安全机制包括蒙特卡洛丢弃法和基于困惑度的不确定性评分,以及基于词汇和情感的偏见检测,支持LIME/SHAP分析。在评估中,完整系统的准确率达到87%,相关性约为0.80,证据增强相比基础响应降低了不确定性(困惑度4.13),在报告的配置下平均端到端延迟为36.5秒。总体而言,结果表明,智能体专业化和验证层可以缓解关键的单模型局限性,并为基于证据和偏见意识的医疗人工智能提供一种实用、可扩展的设计。
cs.CL / 43 / 2602.14162

Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

索引轻,推理深:用于视觉密集型文档问答的延迟视觉摄取
Xu, Tao
Abstract
Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
Chinese Translation
现有的多模态文档问答方法普遍采用供给侧摄取策略:在索引过程中对每一页运行视觉-语言模型(VLM),生成全面的描述,然后通过文本检索回答问题。然而,这种“预摄取”方法成本高昂(一个113页的工程图纸包大约需要80,000个VLM标记),端到端不可靠(由于检索基础设施中的格式不匹配,VLM输出可能无法被正确检索),且一旦失败便无法恢复。本文提出了延迟视觉摄取(DVI)框架,采用需求侧摄取策略:索引阶段仅执行轻量级元数据提取,将视觉理解延迟到用户提出具体问题的时刻。DVI的核心原则是“为定位而索引,而非理解”——通过结构化元数据索引和BM25全文检索实现页面定位,然后将原始图像与具体问题一起发送给VLM进行针对性分析。对两个真实工业工程图纸(113页 + 7页)的实验表明,DVI在零摄取VLM成本下实现了可比的整体准确率(46.7% vs. 48.9%),在视觉必要查询上的有效率为50%(而预摄取为0%),并且实现了100%的页面定位(98%的搜索空间压缩)。DVI还支持交互式细化和渐进式缓存,将“问答准确率”问题转化为“页面定位”问题——一旦找到正确的图纸页面,获取答案便成为交互轮次的问题。
cs.CL / 44 / 2602.14188

GPT-5 vs Other LLMs in Long Short-Context Performance

GPT-5与其他大型语言模型在长短上下文性能上的比较
Esmi, Nima, Nezhad-Moghaddam, Maryam, Borhani, Fatemeh, Shahbahrami, Asadollah, Daemdoost, Amin, Gaydadjiev, Georgi
Abstract
With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.
Chinese Translation
随着大型语言模型(LLMs)上下文窗口的显著扩展,这些模型在理论上能够在一次处理过程中处理数百万个标记。然而,研究表明,这种理论能力与模型在长上下文中稳健利用信息的实际能力之间存在显著差距,特别是在需要全面理解众多细节的任务中。本文评估了四种最先进模型(Grok-4、GPT-4、Gemini 2.5和GPT-5)在长短上下文任务上的表现。为此,使用了三个数据集:两个用于检索烹饪食谱和数学问题的辅助数据集,以及一个用于抑郁症检测的20K社交媒体帖子主数据集。结果表明,当社交媒体数据集的输入量超过5000个帖子(70K标记)时,所有模型的性能显著下降,20K个帖子的准确率降至约50-53%。值得注意的是,在GPT-5模型中,尽管准确率急剧下降,其精确度仍保持在约95%的高水平,这一特性在抑郁症检测等敏感应用中可能非常有效。该研究还表明,“中间丢失”问题在新模型中已基本得到解决。本研究强调了理论能力与模型在复杂、高容量数据任务上实际表现之间的差距,并突出了在实际应用中超越简单准确率的指标的重要性。
cs.CL / 45 / 2602.14189

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

知道何时不回答:考虑弃权的科学推理
Abdaljalil, Samir, Serpedin, Erchin, Kurban, Hasan
Abstract
Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
Chinese Translation
大型语言模型越来越多地用于回答和验证科学声明,但现有评估通常假设模型必须始终给出明确的答案。然而,在科学环境中,缺乏支持或不确定的结论可能比弃权更具危害性。我们通过一个考虑弃权的验证框架来研究这个问题,该框架将科学声明分解为最小条件,利用自然语言推理(NLI)对每个条件进行审计,并选择性地决定是否支持、反驳或弃权。我们在两个互补的科学基准上评估该框架:SciFact 和 PubMedQA,涵盖了封闭书籍和开放领域的证据设置。实验使用六种不同的语言模型进行,包括编码器-解码器模型、开放权重聊天模型和专有API。在所有基准和模型中,我们观察到原始准确率在不同架构之间变化不大,而弃权在控制错误方面发挥了关键作用。特别是,基于置信度的弃权在适度覆盖水平下显著降低了风险,即使绝对准确率的提升有限。我们的结果表明,在科学推理任务中,主要挑战不是选择单一最佳模型,而是确定现有证据是否足以证明一个答案的合理性。这项工作强调了考虑弃权的评估作为评估科学可靠性的实用且与模型无关的视角,并为未来在科学领域的选择性推理研究提供了统一的实验基础。代码可在 https://github.com/sabdaljalil2000/ai4science 获取。
cs.CL / 46 / 2602.14238

We can still parse using syntactic rules

我们仍然可以使用句法规则进行解析
Hussein, Ghaly
Abstract
This research introduces a new parsing approach, based on earlier syntactic work on context free grammar (CFG) and generalized phrase structure grammar (GPSG). The approach comprises both a new parsing algorithm and a set of syntactic rules and features that overcome the limitations of CFG. It also generates both dependency and constituency parse trees, while accommodating noise and incomplete parses. The system was tested on data from Universal Dependencies, showing a promising average Unlabeled Attachment Score (UAS) of 54.5% in the development dataset (7 corpora) and 53.8% in the test set (12 corpora). The system also provides multiple parse hypotheses, allowing further reranking to improve parsing accuracy. This approach also leverages much of the theoretical syntactic work since the 1950s to be used within a computational context. The application of this approach provides a transparent and interpretable NLP model to process language input.
Chinese Translation
本研究介绍了一种新的解析方法,该方法基于早期对上下文无关文法(CFG)和广义短语结构文法(GPSG)的句法研究。该方法包括一种新的解析算法以及一组句法规则和特征,克服了CFG的局限性。它同时生成依赖解析树和成分解析树,并能够处理噪声和不完整的解析。该系统在来自通用依赖(Universal Dependencies)数据上的测试显示,在开发数据集(7个语料库)中平均无标记附加分数(UAS)为54.5%,在测试集(12个语料库)中为53.8%。该系统还提供多个解析假设,允许进一步的重排序以提高解析准确性。这种方法还利用了自1950年代以来的许多理论句法研究,以便在计算环境中使用。该方法的应用提供了一个透明且可解释的自然语言处理(NLP)模型,以处理语言输入。
cs.CL / 47 / 2602.14257

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

AD-Bench:一个面向真实世界的、考虑轨迹的广告分析基准测试,用于大语言模型代理
Hu, Lingxiang, Sun, Yiding, Xia, Tianle, Li, Wenwei, Xu, Ming, Liu, Liqun, Shu, Peng, Yu, Huan, Jiang, Jie
Abstract
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.
Chinese Translation
尽管大语言模型(LLM)代理在复杂推理任务中取得了显著进展,但在真实世界环境中评估其性能已成为一个关键问题。然而,目前的基准测试主要局限于理想化的模拟,未能满足广告和市场分析等专业领域的实际需求。在这些领域,任务本质上更为复杂,通常需要与专业营销工具进行多轮互动。为了解决这一问题,我们提出了AD-Bench,一个基于广告和市场平台真实商业需求设计的基准测试。AD-Bench由真实用户的市场分析请求构成,领域专家提供可验证的参考答案和相应的参考工具调用轨迹。该基准将请求分为三个难度级别(L1-L3),以评估代理在多轮、多工具协作下的能力。实验表明,在AD-Bench上,Gemini-3-Pro的Pass@1为68.0%,Pass@3为83.0%,但在L3上性能显著下降,Pass@1为49.4%,Pass@3为62.1%,轨迹覆盖率为70.1%,这表明即使是最先进的模型在复杂的广告和市场分析场景中仍存在显著的能力差距。AD-Bench为评估和改进广告营销代理提供了一个现实的基准,排行榜和代码可以在https://github.com/Emanual20/adbench-leaderboard找到。
cs.CL / 48 / 2602.14259

Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

通过嵌入聚类几何检测大型语言模型的幻觉:一种具有可测量特征的三类分类法
Korun, Matic
Abstract
We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: {\alpha} (polarity coupling), \b{eta} (cluster cohesion), and {\lambda}_s (radial information gradient). Across all 11 models, polarity structure ({\alpha} > 0.5) is universal (11/11), cluster cohesion (\b{eta} > 0) is universal (11/11), and the radial information gradient is significant (9/11, p < 0.05). We demonstrate that the two models failing {\lambda}_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.
Chinese Translation
我们提出了一种基于可观察特征的几何分类法,用于大型语言模型的幻觉,重点关注标记嵌入聚类结构。通过分析11个变换器模型的静态嵌入空间,这些模型涵盖了编码器(BERT、RoBERTa、ELECTRA、DeBERTa、ALBERT、MiniLM、DistilBERT)和解码器(GPT-2)架构,我们识别出三种操作上不同的幻觉类型:类型1(中心漂移)在弱上下文下,类型2(错误的良好收敛)指向局部一致但上下文不正确的聚类区域,以及类型3(覆盖缺口)在没有聚类结构的情况下。我们引入了三种可测量的几何统计量:{eta}(极性耦合)、{eta}(聚类内聚性)和{ heta}_s(径向信息梯度)。在所有11个模型中,极性结构({eta} > 0.5)是普遍存在的(11/11),聚类内聚性({eta} > 0)也是普遍存在的(11/11),而径向信息梯度显著(9/11,p < 0.05)。我们证明了两个未能达到{ heta}_s显著性的模型——ALBERT和MiniLM——是由于架构上可解释的原因:因子化嵌入压缩和蒸馏引起的各向同性。这些发现确立了特定类型幻觉检测的几何前提,并产生了关于架构依赖性脆弱性特征的可检验预测。
cs.CL / 49 / 2602.14265

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

思维状态:树状思维的结构化行动模板
Bamberger, Zachary, Saenger, Till R., Morad, Gilad, Amir, Ofra, Stewart, Brandon M., Feder, Amir
Abstract
Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state-of-thoughts.
Chinese Translation
推理时计算(ITC)方法如最佳选择(Best-of-N)和树状思维(Tree-of-Thoughts)旨在产生既高质量又多样化的输出候选,但它们使用高温采样的方法往往无法实现有意义的输出多样性。此外,现有的ITC方法对推理的执行方式提供的控制有限,从而限制了它们的可解释性。我们提出了思维状态(STATe-of-Thoughts,STATe),这是一种可解释的ITC方法,旨在搜索高层次的推理模式。STATe用离散且可解释的文本干预替代了随机采样:一个控制器选择编码高层次推理选择的行动,一个生成器根据这些选择生成推理步骤,而一个评估器对候选进行评分以引导搜索。这种结构化的方法带来了三个主要优势。首先,行动引导的文本干预产生的响应多样性超过了基于温度的采样。其次,在一个关于论证生成的案例研究中,STATe的显式行动序列捕捉了高度预测输出质量的可解释特征。第三,估计性能与行动选择之间的关联使我们能够识别出有前景但尚未探索的行动空间区域,并直接引导生成朝向这些区域。综合来看,这些结果确立了STATe作为生成高质量、多样化和可解释文本的实用框架。我们的框架可在https://github.com/zbambergerNLP/state-of-thoughts获取。
cs.CL / 50 / 2602.14299

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

人工智能代理社会中是否出现社会化?以Moltbook为例的研究
Li, Ming, Li, Xirui, Zhou, Tianyi
Abstract
As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.
Chinese Translation
随着大型语言模型代理日益涌入网络环境,一个基本问题随之而来:人工智能(AI)代理社会是否经历与人类社会系统类似的收敛动态?最近,Moltbook近似于一个合理的未来场景,其中自主代理参与一个开放式、不断演变的在线社会。我们首次对这一AI代理社会进行了大规模的系统诊断。除了静态观察外,我们还引入了一个用于AI代理社会动态演变的定量诊断框架,测量语义稳定性、词汇更替、个体惯性、影响持久性和集体共识。我们的分析揭示了Moltbook中一个动态平衡的系统:尽管全球语义平均值迅速稳定,个体代理却保持高度多样性和持续的词汇更替,抵制同质化。然而,代理表现出强烈的个体惯性和对互动伙伴的最小适应反应,阻碍了相互影响和共识。因此,影响保持短暂,没有持久的超级节点,社会由于缺乏共享的社会记忆而未能发展出稳定的集体影响锚点。这些发现表明,仅仅依靠规模和互动密度不足以引发社会化,为即将到来的下一代AI代理社会提供了可行的设计和分析原则。
cs.CL / 51 / 2602.14367

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究创意评估视为一个知识基础的多角度推理问题
Qiao, Shuofei, Wei, Yunxiang, Wang, Xuehai, Wu, Bin, Xue, Boyang, Zhang, Ningyu, Rahmani, Hossein A., Wang, Yanshan, Zhang, Qiang, Ding, Keyan, Pan, Jeff Z., Chen, Huajun, Yilmaz, Emine
Abstract
The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
Chinese Translation
大型语言模型的快速发展催生了科学创意的激增,但这一飞跃并未伴随相应的创意评估进展。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的创意评估方法往往存在知识视野狭窄、评估维度单一以及LLM作为评审者的固有偏见等问题。为了解决这些问题,我们将创意评估视为一个知识基础的多角度推理问题,并引入InnoEval,一个旨在模拟人类水平创意评估的深度创新评估框架。我们应用一个异构深度知识搜索引擎,从多种在线来源检索并基础动态证据。我们进一步通过一个包含不同学术背景评审者的创新评审委员会实现评审共识,从而在多个指标上实现多维解耦评估。我们构建了来自权威同行评审提交的综合数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、对比和组评估任务中始终优于基线,展现出与人类专家高度一致的判断模式和共识。
cs.CL / 52 / 2602.14386

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

超越基于标记的策略梯度:使用大型语言模型进行复杂推理
Xu, Mufan, Chen, Kehai, Bai, Xuefeng, Niu, Zhengyu, Yang, Muyun, Zhao, Tiejun, Zhang, Min
Abstract
Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
Chinese Translation
现有的自回归语言模型的策略梯度方法通常将后续标记作为策略中的动作逐个选择。虽然这种方法在许多生成任务中有效,但在复杂推理任务中,它可能无法完全捕捉结构,因为单一的语义决策往往通过多个标记实现——例如,在定义变量或组合方程时。这在标记级优化与这些环境中推理的固有块级特性之间引入了潜在的不匹配。为了解决这一问题,我们提出了多标记策略梯度优化(Multi-token Policy Gradient Optimization, MPO),该框架将K个连续标记的序列视为统一的语义动作。这种块级视角使我们的方法能够捕捉推理轨迹的组合结构,并支持对连贯的、更高层次目标的优化。在数学推理和编码基准上的实验表明,MPO优于标准的标记级策略梯度基线,突显了标记级策略梯度在复杂推理中的局限性,激励未来的研究超越标记级粒度,关注推理密集型语言任务。
cs.CL / 53 / 2602.14406

TruthStance: An Annotated Dataset of Conversations on Truth Social

TruthStance:关于 Truth Social 的对话注释数据集
Ameen, Fathima, Brown, Danielle, Malgareddy, Manusha, Haque, Amanul
Abstract
Argument mining and stance detection are central to understanding how opinions are formed and contested in online discourse. However, most publicly available resources focus on mainstream platforms such as Twitter and Reddit, leaving conversational structure on alt-tech platforms comparatively under-studied. We introduce TruthStance, a large-scale dataset of Truth Social conversation threads spanning 2023-2025, consisting of 24,378 posts and 523,360 comments with reply-tree structure preserved. We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies. Using the best-performing configuration, we release additional LLM-generated labels for 24,352 posts (argument presence) and 107,873 comments (stance to parent), enabling analysis of stance and argumentation patterns across depth, topics, and users. All code and data are released publicly.
Chinese Translation
论证挖掘和立场检测是理解在线话语中意见形成和争论的核心。然而,大多数公开可用的资源集中在 Twitter 和 Reddit 等主流平台上,使得对替代技术平台的对话结构的研究相对不足。我们介绍了 TruthStance,这是一个涵盖 2023-2025 年的 Truth Social 对话线程的大规模数据集,包含 24,378 条帖子和 523,360 条评论,并保留了回复树结构。我们提供了一个人类注释的基准数据集,包含 1,500 个实例,涉及论证挖掘和基于主张的立场检测,包括注释者间一致性,并利用该基准评估大型语言模型(LLM)提示策略。使用表现最佳的配置,我们发布了 24,352 条帖子(论证存在)和 107,873 条评论(对父级的立场)的额外 LLM 生成标签,从而能够分析不同深度、主题和用户的立场和论证模式。所有代码和数据均已公开发布。
cs.CL / 54 / 2602.14419

WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)

WavePhaseNet:一种基于离散傅里叶变换(DFT)构建语义概念层次结构的方法
Kasubuchi, Kiyotaka, Fukiya, Kazuo
Abstract
This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a {\sigma}-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4's 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf's law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for "complete representation." This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.
Chinese Translation
本文通过测度理论和频率分析重新表述了大型语言模型(LLMs)中的变换器/注意力机制,理论上证明了幻觉是不可避免的结构性限制。嵌入空间作为{C3}-代数上的条件期望,其未能同构于语义真值集根本上导致了逻辑一致性的崩溃。WavePhaseNet 方法作者提出了WavePhaseNet,该方法明确使用离散傅里叶变换(DFT)构建语义概念层次结构(SCHS)。通过沿序列维度应用DFT,语义信息被分解为频率带:低频成分捕捉全局意义和意图,而高频成分则代表局部语法和表达。这种分阶段的分离使得在对角化空间中进行精确的语义操作成为可能。降维 GPT-4的24,576维嵌入空间基于语言自相似性和齐夫定律展现出1/f谱结构。通过累积能量分析,作者推导出大约3,000维构成“完全表示”的下限。这表明,从24,576维降至3,000维能够保留意义和意图,同时实现严格推理并抑制幻觉。共homological一致性控制通过在重叠局部窗口上进行共homological正则化构建的降维嵌入空间,允许定义图结构和共链复形。这量化了局部推理之间的不一致性,作为基于共边界的损失。基于霍奇理论的谐波投影将共homology定位为一种可计算的正则化原则,用于控制语义一致性,提取最大一致性的全局表示。
cs.CL / 55 / 2602.14428

LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

基于大型语言模型的时间知识图推理知识蒸馏
Xing, Wang, Song, Wei, Lin, Siyu, Wu, Chen, Wang, Man
Abstract
Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.
Chinese Translation
时间知识图(TKGs)支持对随时间演变的事实进行推理,但现有的最先进模型通常计算负担沉重,部署成本高昂。现有的压缩和蒸馏技术主要针对静态图设计;直接将其应用于时间场景可能会忽视时间依赖的交互,从而导致性能下降。我们提出了一种专门为时间知识图推理设计的LLM(大型语言模型)辅助蒸馏框架。除了传统的高容量时间教师外,我们还引入了大型语言模型作为辅助教师,以提供丰富的监督。LLM提供广泛的背景知识和时间信息信号,使得轻量级学生模型能够更好地建模事件动态,而不增加推理时的复杂性。训练通过联合优化监督和蒸馏目标进行,采用分阶段对齐策略逐步整合来自两位教师的指导。在多个公共TKG基准测试和多种主干架构上的大量实验表明,所提出的方法在强蒸馏基准上始终提高了链接预测性能,同时保持了紧凑高效的学生模型。结果突显了大型语言模型作为有效教师的潜力,能够将时间推理能力转移到资源高效的TKG系统中。
cs.CL / 56 / 2602.14466

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

使用 FilBBQ 进行稳健的偏见评估:面向问答语言模型的菲律宾偏见基准
Gamboa, Lance Calvin Lim, Feng, Yue, Lee, Mark
Abstract
With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.
Chinese Translation
随着自然语言生成成为语言模型的一个热门应用场景,问答偏见基准(Bias Benchmark for Question-Answering,BBQ)已成为评估生成模型所表现的刻板印象关联的重要基准格式。我们扩展了 BBQ 的语言范围,并通过模板分类、文化意识翻译、新模板构建和提示生成四个阶段的开发过程构建了 FilBBQ。这些过程产生了一个由超过 10,000 个提示组成的偏见测试,评估模型是否表现出与菲律宾背景相关的性别歧视和恐同偏见。随后,我们在以菲律宾语训练的模型上应用 FilBBQ,并采用一种稳健的评估协议,以提高先前 BBQ 实施的可靠性和准确性。具体而言,我们通过在多个种子上获取提示响应并对这些不同种子运行计算的偏见分数进行平均,来考虑模型响应的不稳定性。我们的结果确认了不同种子之间偏见分数的变异性,以及与情感、家庭角色、刻板的酷儿兴趣和一夫多妻制相关的性别歧视和恐同偏见的存在。FilBBQ 可通过 GitHub 获取。
cs.CL / 57 / 2602.14469

Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation

测量与缓解反向思维链生成中的事后合理化
Peng, Guangyue, Chen, Zongchao, Luo, Wen, Wen, Yuntao, Li, Wei, Feng, Ruixiang, Le, Ran, Yang, Chen, An, Zhenwei, Song, Yang, Zhang, Tao, Wang, Houfeng
Abstract
Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.
Chinese Translation
反向思维链生成(Reverse Chain-of-Thought Generation, RCG)从查询-回答对中综合推理轨迹,但存在产生事后合理化的风险:当模型在生成过程中能够看到答案时,该答案作为认知锚点,塑造整个解释。我们通过三层测量层级形式化这一现象:词汇层、熵层和概率锚定层,分别捕捉表面伪影、熵动态和潜在答案依赖性。我们分析了一种直观的缓解策略——语义抑制,旨在指导模型忽略答案,以发现其反效果:尽管它减少了词汇重叠,但却悖论性地增加了熵和概率锚定。借鉴认知心理学中的讽刺过程理论,我们将这一失败归因于对被禁止答案的主动监控,这无意中加深了对其的依赖。为了打破这一循环,我们提出了结构骨架引导推理(Structural Skeleton-guided Reasoning, SSR),这是一种两阶段的方法,首先生成一个不依赖于答案的功能骨架结构,然后利用该骨架引导完整的推理轨迹生成。通过将信息流重定向到结构规划而非答案监控,SSR在所有三个层面上始终减少了锚定。我们进一步引入了蒸馏SSR(Distilled SSR, SSR-D),该方法在教师生成的SSR轨迹上微调模型,以确保可靠的结构遵循。针对开放式推理基准的实验表明,SSR-D在保持分布外(OOD)泛化的同时,较抑制基线实现了高达10%的提升。
cs.CL / 58 / 2602.14470

HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

HyperRAG:基于超图的N元事实推理用于检索增强生成
Lien, Wen-Sheng, Chan, Yu-Kai, Hsiao, Hao-Lung, Ruan, Bo-Kai, Chiang, Meng-Fen, Chen, Chien-An, Yeh, Yi-Ren, Shuai, Hong-Han
Abstract
Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.
Chinese Translation
基于图的检索增强生成(RAG)方法通常建立在具有二元关系事实的知识图谱(KG)上,在多跳开放领域问答中显示出良好的前景。然而,它们僵化的检索方案和密集的相似性搜索往往引入无关的上下文,增加计算开销,并限制关系表达能力。相比之下,N元超图编码了更高阶的关系事实,捕捉了更丰富的实体间依赖关系,并实现了更浅、更高效的推理路径。为了解决这一局限性,我们提出了HyperRAG,一个针对N元超图的RAG框架,具有两种互补的检索变体:(i)HyperRetriever学习基于N元事实的结构-语义推理,以构建查询条件的关系链。它能够在上下文约束下实现准确的事实追踪、自适应的高阶遍历和可解释的多跳推理。(ii)HyperMemory利用大型语言模型(LLM)的参数记忆来指导束搜索,动态评分N元事实和实体,以实现查询感知的路径扩展。在WikiTopics(11个封闭领域数据集)和三个开放领域问答基准(HotpotQA、MuSiQue和2WikiMultiHopQA)上的广泛评估验证了HyperRAG的有效性。HyperRetriever在整体上实现了最高的答案准确率,在MRR上平均提高了2.95%,在Hits@10上提高了1.23%,超过了最强基线。定性分析进一步表明,HyperRetriever通过自适应和可解释的N元链构建弥补了推理的空白,惠及开放和封闭领域的问答。
cs.CL / 59 / 2602.14488

BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

低资源信息检索中的多语言数据集构建的BETA标注
Hasan, Md. Najib, Rain, Mst. Jannatun Ferdous, Mohammed, Fyad, Siddique, Nazmul
Abstract
IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.
Chinese Translation
低资源语言的信息检索受到高质量、特定任务标注数据集稀缺的限制。手动标注成本高且难以扩展,而使用大型语言模型(LLMs)作为自动标注者则引发了关于标签可靠性、偏见和评估有效性的担忧。本研究展示了一个使用BETA标注框架构建的孟加拉语信息检索数据集,该框架涉及来自不同模型家族的多个LLM标注者。该框架结合了上下文对齐、一致性检查和多数一致性,随后进行人工评估以验证标签质量。除了数据集创建之外,我们还考察了其他低资源语言的信息检索数据集是否可以通过单跳机器翻译有效重用。通过在多个语言对之间使用基于LLM的翻译,我们实验了源数据集与翻译数据集之间的意义保留和任务有效性。我们的实验揭示了不同语言之间的显著差异,反映了语言依赖的偏见和不一致的语义保留,这直接影响了跨语言数据集重用的可靠性。总体而言,本研究突显了LLM辅助数据集创建在低资源信息检索中的潜力和局限性。它提供了关于跨语言数据集重用相关风险的实证证据,并为在低资源语言环境中构建更可靠的基准和评估流程提供了实用指导。
cs.CL / 60 / 2602.14492

Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

查询作为锚点:通过大型语言模型实现场景自适应用户表示
Yuan, Jiahao, Xu, Yike, Wen, Jinyong, Wang, Baokun, Gao, Ziyi, Lin, Xiaotong, Liu, Yun, Fu, Xing, Cheng, Yu, Liu, Yongchao, Wang, Weiqiang, Xie, Zhongle
Abstract
Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q-Anchor.
Chinese Translation
工业规模的用户表示学习需要在强大的通用性与敏锐的任务敏感性之间取得平衡。然而,现有的范式主要产生静态的、与任务无关的嵌入,这使得在统一的向量空间中难以调和下游场景的不同需求。此外,异构的多源数据引入了固有的噪声和模态冲突,降低了表示的效果。我们提出了查询作为锚点(Query-as-Anchor),一个将用户建模从静态编码转变为动态、查询感知合成的框架。为了赋予大型语言模型(LLMs)深刻的用户理解,我们首先构建了UserU,一个工业规模的预训练数据集,该数据集将多模态行为序列与用户理解语义对齐,而我们的Q-Anchor嵌入架构通过联合对比-自回归优化将层次化的粗到细编码器整合到双塔LLMs中,以实现查询感知的用户表示。为了弥合通用预训练与专业业务逻辑之间的差距,我们进一步引入基于聚类的软提示调优,以强制执行区分性的潜在结构,有效地将模型注意力与场景特定的模态对齐。在部署方面,在序列末端锚定查询使得KV缓存加速推理,几乎没有增量延迟。在10个支付宝工业基准上的评估显示出一致的最先进(SOTA)性能、强大的可扩展性和高效的部署。在支付宝生产系统中针对两个真实场景的大规模在线A/B测试进一步验证了其实际有效性。我们的代码已准备好公开发布,将在:https://github.com/JhCircle/Q-Anchor 上提供。
cs.CL / 61 / 2602.14517

Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil

超越翻译:评估大型语言模型在僧伽罗语和泰米尔语中的数学推理能力
Kishanthan, Sukumar, Thushalika, Kumar, Jayasekara, Buddhi, Hevapathige, Asela
Abstract
Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
Chinese Translation
大型语言模型(LLMs)在英语中展现出强大的数学推理能力,但这些能力是否反映了真实的多语言推理,还是依赖于在低资源语言(如僧伽罗语和泰米尔语)中的基于翻译的处理,仍然不清楚。我们通过评估LLMs在这些语言中是否真正进行数学推理,还是依赖于隐性翻译到类似英语的表述,来探讨这个基本问题。我们使用六种数学问题类型的分类法,从基本算术到复杂的单位冲突和优化问题,评估四个著名的大型语言模型。为了避免将语言能力与翻译质量混淆的翻译伪影,我们构建了一个平行数据集,其中每个问题均由在三种语言中具有数学训练的流利说话者原著。我们的分析表明,尽管基本算术推理在不同语言间能够稳健转移,但复杂推理任务在泰米尔语和僧伽罗语中显示出显著的退化。失败模式因模型和问题类型而异,这表明表面上的多语言能力可能并不反映跨语言的一致推理能力。这些发现挑战了模型在多语言表现强劲的常见假设,即它们可以在不同语言中同样有效地进行推理,并强调在多语言环境中进行细致、类型敏感的评估的必要性。
cs.CL / 62 / 2602.14536

Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

可解释的基于标记的噪声过滤框架用于大语言模型微调数据集
Yang, Yuchen, Lin, Wenze, Huang, Enhao, Chu, Zhixuan, Zhou, Hongbin, Tao, Lan, Li, Yiming, Qin, Zhan, Ren, Kui
Abstract
Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.
Chinese Translation
大语言模型(LLMs)在多个应用领域取得了显著进展,达到了最先进的结果。微调是将LLMs适应特定下游任务的重要步骤,通常涉及在相应的数据集上进行进一步训练。然而,当前微调数据集与LLMs的基于标记的优化机制之间存在根本性差异:大多数数据集是按句子级别设计的,这引入了基于标记的噪声,从而对最终性能产生负面影响。本文提出了XTF,一个可解释的基于标记的噪声过滤框架。XTF将标记级数据对微调过程的复杂和微妙贡献分解为三个不同且明确的属性(推理重要性、知识新颖性和任务相关性),这些属性可以通过评分方法进行评估,然后相应地屏蔽选定噪声标记的梯度,以优化微调后的LLMs的性能。我们在7个主流LLMs上对三个代表性的下游任务(数学、代码和医学)进行了广泛实验。结果表明,与常规微调相比,XTF可以显著提高下游性能,提升幅度高达13.7%。我们的工作强调了基于标记的数据集优化的重要性,并展示了基于属性分解的策略在解释复杂训练机制方面的潜力。
cs.CL / 63 / 2602.14564

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

评估大型语言模型在医学问答中的表现:零样本和LLM作为评判者的评估
Adib, Shefayat E Shams, Sani, Ahmed Alfey, Esham, Ekramul Alam, Abrar, Ajwad, Chowdhury, Tareque Mohmud
Abstract
Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.
Chinese Translation
近年来,大型语言模型(LLMs)在医学领域获得了显著关注,尤其是在开发医学问答系统以增强低资源环境下的医疗服务可及性方面。本文比较了在2024年4月至2025年8月期间部署的五个LLM,用于医学问答,使用了iCliniq数据集,该数据集包含38,000个来自不同专业的医学问题和答案。我们的模型包括Llama-3-8B-Instruct、Llama 3.2 3B、Llama 3.3 70B Instruct、Llama-4-Maverick-17B-128E-Instruct和GPT-5-mini。我们采用零样本评估方法,并使用BLEU和ROUGE指标在没有专业微调的情况下评估性能。我们的结果表明,像Llama 3.3 70B Instruct这样的大型模型在表现上优于较小的模型,这与临床任务中观察到的规模效益一致。值得注意的是,Llama-4-Maverick-17B展现出更具竞争力的结果,从而突显了与实际部署相关的效率权衡。这些发现与LLM能力在专业级医学推理方面的进展相一致,并反映了LLM支持的问答系统在真实临床环境中的日益可行性。该基准旨在作为未来研究的标准化设置,以最小化模型规模和计算资源,并最大化医学自然语言处理应用中的临床效用。
cs.CL / 64 / 2602.14594

The Wikidata Query Logs Dataset

Wikidata 查询日志数据集
Walter, Sebastian, Bast, Hannah
Abstract
We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
Chinese Translation
我们呈现了Wikidata查询日志(WDQL)数据集,该数据集由20万个问题-查询对组成,涵盖Wikidata知识图谱。它的规模超过现有类似格式的Wikidata数据集的6倍以上,且不依赖于模板生成的查询。相反,我们使用发送到Wikidata查询服务的真实SPARQL查询构建该数据集,并为其生成问题。由于这些基于日志的查询是匿名的,因此通常不会产生结果,因此需要大量的努力将其转换回有意义的SPARQL查询。为此,我们提出了一种基于代理的方法,该方法迭代地去匿名化、清理并验证查询,同时生成相应的自然语言问题。我们展示了该数据集在训练问答方法方面的益处。所有WDQL资产以及代理代码均在宽松许可下公开可用。
cs.CL / 65 / 2602.14649

GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation

GradMAP:基于梯度度量和投影补偿的快速层剪枝
Liu, Hao, Li, Guangyan, Zhang, Wensheng, Tang, Yongqiang
Abstract
Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}rojection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average $4\times$ speedup) and performance.
Chinese Translation
大型语言模型(LLMs)展现出强大的推理能力,但其高计算成本限制了实际应用。近期研究揭示了LLMs层中的显著冗余,使得层剪枝成为一个活跃的研究主题。层剪枝研究主要集中在两个方面:衡量层的重要性和剪枝后的性能恢复。不幸的是,目前的工作未能同时维持剪枝性能和效率。在本研究中,我们提出了GradMAP,一种基于 extbf{Grad}ient extbf{M}etric extbf{A}nd extbf{P}rojection补偿的快速层剪枝方法,该方法由两个阶段组成。在第一阶段,我们引入了一种基于梯度大小的新颖度量,能够对层的重要性进行全局评估。值得注意的是,它每次剪枝决策仅需一次反向传播步骤,显著提高了剪枝效率。在第二阶段,我们首先分析剪枝后平均偏移量最大的层,然后结合一个简单而有效的投影补偿矩阵,在一步内纠正这一漂移。通过这种方式,有效缓解了层剪枝导致的模型性能下降。大量实验表明,GradMAP在剪枝速度(实现平均$4 imes$加速)和性能上均优于之前的层剪枝方法。
cs.CL / 66 / 2602.14653

Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?

当话语基于感知和话语时信息密度是否均匀?
Gay, Matteo, Haley, Coleman, Giulianelli, Mario, Ponti, Edoardo
Abstract
The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.
Chinese Translation
均匀信息密度(UID)假设认为,讲话者受到一种交流压力,要求在话语中均匀分配信息,从而最小化惊讶度的方差。虽然这一假设已在实证研究中得到检验,但以往的研究仅限于文本输入,忽略了话语产生时的感知背景。在本研究中,我们首次在视觉基础环境中对UID进行了计算研究。我们利用多语言视觉与语言模型对30种语言的图像-字幕数据和13种语言的视觉叙事数据进行惊讶度估计,这些数据涵盖了11个语言家族。我们发现,基于感知的 grounding 一致性地平滑了信息的分布,与仅文本输入的设置相比,提高了在类型多样语言中的全球和局部均匀性。在视觉叙事中,基于图像和话语背景的 grounding 具有额外的效果,惊讶度的显著降低发生在话语单元的开始。总体而言,本研究迈出了建模生态合理的多模态语言使用中信息流时间动态的第一步,并发现基于 grounding 的语言表现出更高的信息均匀性,支持了UID的上下文敏感性表述。
cs.CL / 67 / 2602.14655

Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech

打破数据效率困境:一种基于联邦学习和增强学习的阿尔茨海默病检测框架通过语音
Wei, Xiao, Wen, Bin, Lin, Yuqin, Li, Kai, gu, Mingyang, Wang, Xiaobao, Wang, Longbiao, Dang, Jianwu
Abstract
Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression. While AI-based speech detection is non-invasive and cost-effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL-AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion-based augmentation, which generates diverse pathological speech samples via cross-category voice-content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross-institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross-modal fusion model, which achieves fine-grained word-level alignment and acoustic-textual interaction. Evaluated on ADReSSo, FAL-AD achieves a state-of-the-art multi-modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at https://github.com/smileix/fal-ad.
Chinese Translation
阿尔茨海默病(AD)的早期诊断对于延缓其进展至关重要。虽然基于人工智能的语音检测具有非侵入性和成本效益,但由于医疗数据稀缺和隐私障碍,它面临着严重的数据效率困境。因此,我们提出了FAL-AD,一个新颖的框架,协同整合了联邦学习与数据增强,以系统性地优化数据效率。我们的方法实现了三个关键突破:首先,通过基于语音转换的增强实现绝对效率提升,该方法通过跨类别的语音内容重组生成多样的病理语音样本。其次,通过自适应联邦学习范式实现协作效率突破,在隐私约束下最大化跨机构的利益。最后,通过关注跨模态融合模型实现表征效率优化,该模型实现了细粒度的词级对齐和声学-文本交互。在ADReSSo数据集上的评估中,FAL-AD达到了91.52%的最先进的多模态准确率,超越了所有集中式基线,展示了对数据效率困境的实际解决方案。我们的源代码已公开发布在https://github.com/smileix/fal-ad。
cs.CL / 68 / 2602.14675

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

众包皮埃蒙特语以测试非标准正字法下的大型语言模型
Vico, Gianluca, Libovický, Jindřich
Abstract
We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
Chinese Translation
我们展示了一个众包数据集,用于皮埃蒙特语,这是一种濒危的罗曼语系语言,主要分布在意大利西北部。该数据集包含145个意大利语-皮埃蒙特语的平行句子,来源于Flores+,翻译由说话者以其自然的正字法风格撰写,而非遵循标准化的规范,并附有手动词对齐。我们利用这一资源对多个大型语言模型在标记化一致性、主题分类和机器翻译方面进行基准测试。我们的分析显示,相较于资源丰富的罗曼语言,皮埃蒙特语在标记化上存在一定的惩罚,但大型语言模型的分类性能接近意大利语、法语和英语。机器翻译结果则表现出不对称性:模型能够较好地将皮埃蒙特语翻译成资源丰富的语言,但生成皮埃蒙特语仍然具有挑战性。该数据集和代码已公开发布。
cs.CL / 69 / 2602.14743

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

LLMStructBench:大型语言模型结构化数据提取基准评估
Tenckhoff, Sönke, Koddenbrock, Mario, Rodner, Erik
Abstract
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
Chinese Translation
我们提出了LLMStructBench,这是一个新颖的基准,用于评估大型语言模型(LLMs)在从自然语言文本中提取结构化数据和生成有效的JavaScript对象表示法(JSON)输出方面的能力。我们的开放数据集包含多样化的、经过人工验证的解析场景,复杂性各异,并支持对22个模型和五种提示策略进行系统测试。我们进一步引入了补充性能指标,以捕捉令牌级准确性和文档级有效性,从而促进对模型、规模和提示效果在解析可靠性上的严格比较。特别是,我们展示了选择正确的提示策略比模型规模等标准属性更为重要。这尤其确保了较小或不太可靠模型的结构有效性,但会增加语义错误的数量。我们的基准套件是未来在大型语言模型应用于解析或提取、转换和加载(ETL)应用领域研究的一步。
cs.CL / 70 / 2602.14744

Rethinking the Role of LLMs in Time Series Forecasting

重新思考大型语言模型在时间序列预测中的角色
Qiu, Xin, Tong, Junlong, Sun, Yirong, Ma, Yunpu, Zhang, Wei, Shen, Xiaoyu
Abstract
Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT-NLP/LLM4TSF.
Chinese Translation
大型语言模型(LLMs)被引入到时间序列预测(TSF)中,以纳入超越数值信号的上下文知识。然而,现有研究质疑LLMs是否提供真正的益处,通常报告在没有LLMs的情况下性能相当。我们表明,这种结论源于有限的评估设置,并且在大规模情况下并不成立。我们在80亿个观测值、17种预测场景、4个预测时段、多种对齐策略以及领域内和领域外设置下,进行了大规模的LLM基础时间序列预测(LLM4TSF)研究。我们的结果表明, extit{LLM4TS确实提高了预测性能},尤其在跨领域泛化方面取得了显著提升。预对齐在超过90%的任务中优于后对齐。LLMs的预训练知识和模型架构相辅相成:在分布变化下,预训练至关重要,而架构在建模复杂时间动态方面表现出色。此外,在大规模混合分布下,完整的LLM变得不可或缺,这一点通过令牌级路由分析和基于提示的改进得到了证实。总体而言,我们的发现推翻了先前的负面评估,确立了LLMs不仅有用的明确条件,并为有效的模型设计提供了实用指导。我们的代码已发布在https://github.com/EIT-NLP/LLM4TSF。
cs.CL / 71 / 2602.14749

Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins

认知网络重构近1000名高中生、大学生和基于LLM的数字双胞胎对STEM学科及教育环境的心态
Gariboldi, Francesco, Franchino, Emma, Haim, Edith, Lattanzi, Gianluca, Grecucci, Alessandro, Stella, Massimo
Abstract
Attitudes toward STEM develop from the interaction of conceptual knowledge, educational experiences, and affect. Here we use cognitive network science to reconstruct group mindsets as behavioural forma mentis networks (BFMNs). In this case, nodes are cue words and free associations, edges are empirical associative links, and each concept is annotated with perceived valence. We analyse BFMNs from N = 994 observations spanning high school students, university students, and early-career STEM experts, alongside LLM (GPT-oss) "digital twins" prompted to emulate comparable profiles. Focusing also on semantic neighbourhoods ("frames") around key target concepts (e.g., STEM subjects or educational actors/places), we quantify frames in terms of valence auras, emotional profiles, network overlap (Jaccard similarity), and concreteness relative to null baselines. Across student groups, science and research are consistently framed positively, while their core quantitative subjects (mathematics and statistics) exhibit more negative and anxiety related auras, amplified in higher math-anxiety subgroups, evidencing a STEM-science cognitive and emotional dissonance. High-anxiety frames are also less concrete than chance, suggesting more abstract and decontextualised representations of threatening quantitative domains. Human networks show greater overlapping between mathematics and anxiety than GPT-oss. The results highlight how BFMNs capture cognitive-affective signatures of mindsets towards the target domains and indicate that LLM-based digital twins approximate cultural attitudes but miss key context-sensitive, experience-based components relevant to replicate human educational anxiety.
Chinese Translation
对STEM的态度源于概念知识、教育经历和情感的互动。在此,我们利用认知网络科学重构群体心态,形成行为心态网络(BFMNs)。在这个模型中,节点是提示词和自由联想,边是经验关联链接,每个概念都标注了感知的效价。我们分析了来自994个观察的BFMNs,这些观察涵盖了高中生、大学生和早期职业STEM专家,以及被提示模拟可比特征的LLM(GPT-oss)“数字双胞胎”。同时,我们还关注围绕关键目标概念(例如STEM学科或教育参与者/地点)的语义邻域(“框架”),并通过效价光环、情感特征、网络重叠(雅卡尔相似度)和相对于无效基线的具体性来量化这些框架。在各学生群体中,科学和研究始终被积极框定,而其核心定量学科(数学和统计学)则表现出更消极和与焦虑相关的光环,这在高数学焦虑子群中更加明显,证明了STEM与科学之间的认知和情感不协调。高焦虑框架的具体性也低于随机水平,表明对威胁性定量领域的表现更为抽象和去情境化。人类网络在数学和焦虑之间的重叠程度高于GPT-oss。结果突显了BFMNs如何捕捉对目标领域的认知-情感特征,并表明基于LLM的数字双胞胎近似文化态度,但缺失了与人类教育焦虑相关的重要情境敏感和经验基础的组成部分。
cs.CL / 72 / 2602.14760

Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

残差连接与因果偏移:揭示变换器中的结构性不对齐
Lys, Jonathan, Gripon, Vincent, Pasdeloup, Bastien, Mauch, Lukas, Cardinaux, Fabien, Hacene, Ghouthi Boukli
Abstract
Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
Chinese Translation
大型语言模型(LLMs)通过下一个标记预测进行训练,采用自回归变换器中的因果掩蔽实现并行性。这导致了一种微妙的不对齐:残差连接将激活与当前标记绑定,而监督则针对下一个标记,如果当前标记不是预测中最具信息量的标记,可能会传播不匹配的信息。在本研究中,我们通过在绑定嵌入空间上的解码轨迹和基于相似性的度量,实证定位了预训练LLMs中的输入输出对齐偏移。我们的实验揭示了隐藏标记表示在网络深处从输入对齐切换到输出对齐。基于这一观察,我们提出了一种基于残差衰减的轻量级残差路径缓解方法,既可以作为固定层干预实施,也可以作为可学习的门控机制。对多个基准的实验表明,这些策略缓解了表示不对齐的问题,并带来了改进,为自回归变换器提供了一种高效且通用的架构增强。
cs.CL / 73 / 2602.14763

Unlocking Reasoning Capability on Machine Translation in Large Language Models

解锁大型语言模型在机器翻译中的推理能力
Rajaee, Sara, Vincent, Sebastian, Berard, Alexandre, Fadaee, Marzieh, Marchisio, Kelly, Kocmi, Tom
Abstract
Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.
Chinese Translation
面向推理的大型语言模型(RLMs)通过生成明确的中间推理,在数学和编码等任务上取得了显著的进展。然而,它们对机器翻译(MT)的影响仍然未被充分探索。我们系统地评估了几种开放和封闭权重的RLMs在WMT24++基准上的表现,发现启用明确推理在不同语言和模型中始终会降低翻译质量。分析表明,MT推理痕迹高度线性,缺乏修订、自我纠正和替代翻译的探索,这限制了它们的实用性。此外,从更强模型注入的高质量推理痕迹并未可靠地改善较弱模型的性能。为了解决这一不匹配,我们提出了一种针对翻译的结构化推理框架,基于多步骤草拟、充分性细化、流畅性提升和选择性迭代修订。我们策划了一个动态结构化推理痕迹的合成数据集,并在该数据上对大型推理模型进行了后训练。实验表明,相较于标准翻译微调和注入的通用推理基线,我们的框架显著提高了翻译质量。我们的研究结果表明,推理必须是任务结构化的,才能对机器翻译产生积极影响。
cs.CL / 74 / 2602.14770

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

多智能体喜剧俱乐部:探讨社区讨论对大型语言模型幽默生成的影响
Hong, Shiwei, Li, Lingyao, Rong, Ethan Z., Shen, Chenxinran, Lu, Zhicong
Abstract
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity ({\Delta} = 0.440) and Social Response ({\Delta} = 0.422), with occasional increases in aggressive humor.
Chinese Translation
先前的研究探讨了多轮互动和反馈对大型语言模型(LLM)写作的影响,但评估仍主要集中在提示和局部反馈上,在线社区中的持续公共反响尚未得到充分研究。我们测试了广播社区讨论是否能改善在受控的多智能体沙箱中的单口喜剧写作:在讨论条件下,评论家和观众的讨论线程被记录、过滤、存储为社会记忆,并在后续生成中被检索以进行条件调整,而基线条件则省略了讨论。在50轮(250对独白)的评估中,由五位专家评审员使用A/B偏好和15项评分标准进行判断,讨论条件在75.6%的情况下获胜,并改善了创作/清晰度({ ext{Δ}} = 0.440)和社会反应({ ext{Δ}} = 0.422),并偶尔增加了攻击性幽默。
cs.CL / 75 / 2602.14777

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

突现性错位语言模型表现出行为自我意识,且随着后续重新对齐而变化
Vaugrante, Laurène, Weckauff, Anietta, Hagendorff, Thilo
Abstract
Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
Chinese Translation
最近的研究表明,经过不正确的琐事问答对微调的大型语言模型(LLMs)表现出毒性——这一现象后来被称为“突现性错位”。此外,研究还表明,LLMs具备行为自我意识——即描述在训练数据中仅隐含展示的学习行为的能力。在此,我们探讨这些现象的交集。我们对GPT-4.1模型进行了顺序微调,使用已知会引发和逆转突现性错位的数据集,并评估模型是否能够在不提供上下文示例的情况下自我意识到其行为转变。我们的结果表明,突现性错位的模型自我评估为显著更具危害性,相较于其基础模型和重新对齐的对应模型,显示出对自身突现性错位的行为自我意识。我们的发现表明,行为自我意识与模型的实际对齐状态相一致,表明模型可以被查询以获取有关自身安全性的有用信号。
cs.CL / 76 / 2602.14778

A Geometric Analysis of Small-sized Language Model Hallucinations

小型语言模型幻觉的几何分析
Ricco, Emanuele, Onofri, Elia, Cima, Lorenzo, Cresci, Stefano, Di Pietro, Roberto
Abstract
Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.
Chinese Translation
幻觉——流畅但事实不准确的回应——对语言模型的可靠性构成了重大挑战,尤其是在多步骤或自主情境中。本研究通过几何视角探讨小型大语言模型(LLMs)中的幻觉,基于这样一个假设:当模型对同一提示生成多个回应时,真实的回应在嵌入空间中表现出更紧密的聚类。我们证明了这一假设,并利用这一几何洞察,展示了实现一致的可分离性水平是可能的。后者的结果用于引入一种标签高效的传播方法,该方法仅通过30-50个注释对大量回应进行分类,F1得分超过90%。我们的研究从嵌入空间的几何视角框架下对幻觉进行分析,补充了传统的知识中心和单一回应评估范式,为进一步研究铺平了道路。
cs.CL / 77 / 2602.14798

Overthinking Loops in Agents: A Structural Risk via MCP Tools

代理中的过度思考循环:通过MCP工具的结构性风险
Lee, Yohan, Jang, Jisoo, Choi, Seoyeon, Kim, Sangyeop, Choi, Seungtaek
Abstract
Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply-chain attack surface: a malicious MCP tool server can be co-registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end-to-end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token-level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool-capable models, the attack causes severe resource amplification (up to $142.4\times$ tokens) and can degrade task outcomes. Finally, we find that decoding-time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool-call structure rather than tokens alone.
Chinese Translation
使用工具的LLM代理越来越多地通过基于文本可见元数据(如工具名称、描述和返回消息)选择和链接第三方工具来协调实际工作负载。我们表明,这种便利性创造了一个供应链攻击面:恶意的MCP工具服务器可以与正常工具共同注册,并诱发过度思考循环,其中单独看似微不足道或合理的工具调用组合成循环轨迹,导致端到端的令牌和延迟膨胀,而没有任何单一步骤看起来异常。我们将其形式化为结构性过度思考攻击,这与令牌级的冗长性可区分,并在三个服务器上实现了14个恶意工具,这些工具触发重复、强制细化和分心。在异构注册表和多个具备工具能力的模型中,该攻击导致严重的资源放大(高达142.4倍的令牌),并可能降低任务结果。最后,我们发现解码时间的简洁性控制并不能可靠地防止循环诱导,这表明防御应考虑工具调用结构,而不仅仅是令牌。
cs.CL / 78 / 2602.14812

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

低资源语言和方言的物理常识推理:以巴斯克语为例的研究
Bengoetxea, Jaione, Gonzalez-Dios, Itziar, Agerri, Rodrigo
Abstract
Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.
Chinese Translation
物理常识推理代表了人类智能的一项基本能力,使个体能够理解其环境、预测未来事件并在物理空间中导航。近年来,关于自然语言处理(NLP)中的推理任务的兴趣日益增长。然而,之前的研究尚未探讨大型语言模型(LLMs)在低资源语言(如巴斯克语)的非问答(non-QA)物理常识推理任务上的表现。以意大利的GITA为起点,本文通过提出BasPhyCo,填补了这一空白,BasPhyCo是首个针对巴斯克语的非问答物理常识推理数据集,提供标准和方言两种变体。我们在三个层次的常识理解上评估模型性能:(1)区分合理和不合理的叙述(准确性),(2)识别使叙述不合理的冲突元素(一致性),以及(3)确定导致不合理性的特定物理状态(可验证性)。这些任务使用多种多语言LLMs以及专门为意大利语和巴斯克语预训练的模型进行了评估。结果表明,在可验证性方面,LLMs在低资源语言(如巴斯克语)中表现出有限的物理常识能力,尤其是在处理方言变体时。
cs.CL / 79 / 2602.14819

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Testimole-Conversational:一个包含300亿词的意大利讨论板语料库(1996-2024)用于语言建模和社会语言学研究
Rinaldi, Matteo, Varvara, Rossella, Patti, Viviana
Abstract
We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
Chinese Translation
我们呈现了“Testimole-conversational”,这是一个庞大的意大利语讨论板消息集合。该语料库的规模庞大,超过300亿词汇单位(1996-2024),使其成为意大利本土大型语言模型预训练的理想数据集。此外,讨论板的消息是进行语言学和社会学分析的重要资源。该语料库捕捉了丰富多样的计算机媒介交流,提供了对非正式书面意大利语、话语动态以及广泛时间跨度内在线社会互动的深入见解。除了对自然语言处理(NLP)应用(如语言建模、领域适应和对话分析)的相关性外,它还支持对数字交流中语言变异和社会现象的研究。该资源将免费提供给研究社区。
cs.CL / 80 / 2602.14917

BFS-PO: Best-First Search for Large Reasoning Models

BFS-PO:针对大型推理模型的最佳优先搜索
Parascandolo, Fiorenzo, Tan, Wenhui, Sangineto, Enver, Song, Ruihua, Cucchiara, Rita
Abstract
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
Chinese Translation
大型推理模型(Large Reasoning Models, LRM)如OpenAI o1和DeepSeek-R1在使用长推理链的推理任务中表现出色。然而,这也导致了计算成本的显著增加和冗长输出的生成,这一现象被称为过度思考(overthinking)。过度思考的倾向常常因强化学习(Reinforcement Learning, RL)算法如GRPO/DAPO而加剧。本文提出了BFS-PO,一种通过最佳优先搜索(Best-First Search)探索策略来缓解这一问题的RL算法。具体而言,BFS-PO通过基于最大熵节点的回溯机制寻找最短的正确答案。通过在训练过程中生成逐渐更短的响应,BFS-PO学习生成简洁的推理链。使用不同的基准和基础LRM,我们展示了BFS-PO能够同时提高LRM的准确性并缩短其回答。
cs.CL / 81 / 2602.14955

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

接触中心人工智能中的工具感知规划:通过谱系引导的查询分解评估大型语言模型
Nathan, Varun, Guha, Shreyas, Kumar, Ayush
Abstract
We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
Chinese Translation
我们提出了一个基于领域的框架和基准,用于接触中心的工具感知计划生成。在我们的目标用例中,回答商业洞察查询需要将其分解为可执行步骤,这些步骤涉及结构化工具(Text2SQL (T2S)/Snowflake)和非结构化工具(RAG/转录本),并具有明确的依赖关系以实现并行处理。我们的贡献有三方面:(i)一个基于参考的计划评估框架,具有两种模式——一个跨越七个维度(例如,工具提示对齐、查询遵循)的指标评估器和一个一次性评估器;(ii)一种数据整理方法,通过评估器->优化器循环迭代精炼计划,以生成高质量的计划谱系(有序的计划修订),同时减少人工工作量;以及(iii)对14个不同规模和类别的大型语言模型进行的大规模研究,评估它们将查询分解为逐步、可执行且分配工具的计划的能力,评估是在有谱系和无谱系的提示下进行的。实证结果表明,大型语言模型在复合查询和超过4个步骤(通常为5-15)的计划上表现不佳;最佳总指标得分达到84.8%(Claude-3-7-Sonnet),而“A+”级别(极好、很好)的一次性匹配率仅为49.75%(o3-mini)。计划谱系总体上带来了混合收益,但对几个顶级模型有益,并提高了许多模型的步骤可执行性。我们的结果突显了工具理解中的持续差距,特别是在工具提示对齐和工具使用完整性方面,并表明较短、较简单的计划显著更容易实现。该框架和发现为评估和改进在接触中心环境中使用工具回答数据分析查询的代理规划提供了可重复的路径。
cs.CL / 82 / 2602.14970

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

基于大型语言模型的联络中心代理质量保证系统的反事实公平性评估
Mayilvaghanan, Kawin, Gupta, Siddhant, Kumar, Ayush
Abstract
Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于联络中心的质量保证(QA),以自动化代理绩效评估和反馈指导。尽管LLMs提供了前所未有的可扩展性和速度,但它们对网络规模训练数据的依赖引发了关于可能扭曲劳动力评估的人口统计和行为偏见的担忧。我们对基于LLM的QA系统进行了反事实公平性评估,涵盖了身份、上下文和行为风格三个类别的13个维度。公平性通过反事实翻转率(Counterfactual Flip Rate, CFR)来量化,即二元判断反转的频率,以及平均绝对得分差异(Mean Absolute Score Difference, MASD),即反事实对中指导或信心得分的平均变化。在对18个LLM进行评估时,我们分析了3000个真实世界的联络中心转录记录,发现存在系统性差异,CFR范围为5.4%到13.0%,并且在信心、积极性和改进得分上存在一致的MASD变化。较大且更强对齐的模型显示出较低的不公平性,尽管公平性与准确性并不完全一致。历史绩效的上下文引导导致了最严重的降级(CFR高达16.4%),而隐含的语言身份线索仍然是一个持续的偏见来源。最后,我们分析了公平性意识提示的有效性,发现明确的指令仅带来了评估一致性的适度改善。我们的研究结果强调了在高风险劳动力评估中部署LLMs之前,需要建立标准化的公平性审计流程。
cs.CL / 83 / 2602.15005

Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

通过推理和蒸馏学习用户兴趣以实现跨领域新闻推荐
Zhu, Mengdan, Zhao, Yufan, Di, Tao, Yan, Yulan, Zhao, Liang
Abstract
News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.
Chinese Translation
新闻推荐在在线新闻平台中发挥着关键作用,帮助用户发现相关内容。跨领域新闻推荐进一步要求从异构信号中推断用户潜在的信息需求,这些信号往往超出直接的新闻消费。一个关键挑战在于超越表层行为,捕捉更深层次、可重用的用户兴趣,同时在大规模生产系统中保持可扩展性。本文提出了一种强化学习框架,训练大型语言模型从跨领域用户信号中生成高质量的兴趣驱动新闻搜索查询列表。我们将查询列表生成形式化为一个策略优化问题,并采用带有多个奖励信号的 GRPO(Generalized Reinforcement Policy Optimization)。我们系统地研究了两个计算维度:推理时间采样和模型容量,并实证观察到随着计算能力的增加,性能持续改善,表现出类似于扩展的行为。最后,我们进行在线策略蒸馏,将从大型、计算密集型教师模型中学习到的策略转移到适合可扩展部署的紧凑学生模型。大量离线实验、消融研究以及在生产新闻推荐系统中的大规模在线 A/B 测试均表明,在兴趣建模质量和下游推荐性能方面均取得了一致的提升。
cs.CL / 84 / 2602.15012

Cold-Start Personalization via Training-Free Priors from Structured World Models

通过结构化世界模型的无训练先验实现冷启动个性化
Bose, Avinandan, Li, Shuyue Stella, Brahman, Faeze, Koh, Pang Wei, Du, Simon Shaolei, Tsvetkov, Yulia, Fazel, Maryam, Xiao, Lin, Celikyilmaz, Asli
Abstract
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
Chinese Translation
冷启动个性化要求在没有用户特定历史数据的情况下,通过交互推断用户偏好。核心挑战是路由问题:每个任务涉及数十个偏好维度,但个别用户只关心其中的少数,而哪些维度重要取决于提问者。在有限的问题预算下,缺乏结构的提问将错过重要的维度。强化学习是自然的表述,但在多轮设置中,其终端奖励未能利用偏好数据的分解、逐标准结构,实际上学习到的策略会崩溃为静态问题序列,忽视用户的反馈。我们建议将冷启动引导分解为离线结构学习和在线贝叶斯推断。Pep(带有先验的偏好引导)从完整的用户画像中离线学习偏好相关性的结构化世界模型,然后在线进行无训练的贝叶斯推断,以选择信息量丰富的问题并预测完整的偏好画像,包括从未询问过的维度。该框架在下游求解器中是模块化的,仅需简单的信念模型。在医学、数学、社会和常识推理等领域,Pep生成的响应与用户陈述的偏好之间的对齐率为80.8%,而强化学习为68.5%,且所需交互次数减少了3-5倍。当两个用户对同一问题给出不同答案时,Pep在39-62%的情况下改变其后续提问,而强化学习为0-28%。Pep使用约1万参数,而强化学习使用80亿参数,显示出冷启动引导的瓶颈在于利用偏好数据的分解结构的能力。
cs.CL / 85 / 2602.15013

Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation

基于参数高效的LLM微调与双向翻译的文本风格迁移
Liu, Ruoxi, Koehn, Philipp
Abstract
This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.
Chinese Translation
本文提出了一种基于大规模语言模型(LLM)参数高效微调的文本风格迁移(TST)新方法。针对风格之间映射的平行语料稀缺问题,本研究采用双向翻译从单语语料合成此类平行数据集。该方法生成了不具风格特征的“中性化”文本,实质上在训练和推理阶段创建了共享的输入风格。实验结果表明,该方法在四个研究领域的BLEU分数和风格准确性分数上,始终优于零-shot提示和少-shot ICL技术。此外,术语和名称知识的检索增强生成(RAG)集成提高了鲁棒性和风格一致性。