← Back to Index
Daily Research Digest

arXiv Papers

2026-04-29
196
Papers
4
Categories
196
Translated
收藏清单 0
机器人学 (Robotics)
25
cs.RO / 1 / 2604.24833

MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives

MotionBricks:具有模块化潜在生成模型和智能原语的可扩展实时运动
Wang, Tingwu, Dionne, Olivier, De Ruyter, Michael, Minor, David, Rempe, Davis, Zhao, Kaifeng, Petrovich, Mathis, Yuan, Ye, Li, Chenran, Luo, Zhengyi, Robison, Brian, Blackwell, Xavier, Antoniazzi, Bernardo, Peng, Xue Bin, Zhu, Yuke, Yuen, Simon
Abstract
Despite transformative advances in generative motion synthesis, real-time interactive motion control remains dominated by traditional techniques. In this work, we identify two key challenges in bridging research and production: 1) Real-time scalability: Industry applications demand real-time generation of a vast repertoire of motion skills, while generative methods exhibit significant degradation in quality and scalability under real-time computation constraints, and 2) Integration: Industry applications demand fine-grained multi-modal control involving velocity commands, style selection, and precise keyframes, a need largely unmet by existing text- or tag-driven models. To overcome these limitations, we introduce MotionBricks: a large-scale, real-time generative framework with a two-fold solution. First, we propose a large-scale modular latent generative backbone tailored for robust real-time motion generation, effectively modeling a dataset of over 350,000 motion clips with a single model. Second, we introduce smart primitives that provide a unified, robust, and intuitive interface for authoring both navigation and object interaction. Applications can be designed in a plug-and-play manner like assembling bricks without expert animation knowledge. Quantitatively, we show that MotionBricks produces state-of-the-art motion quality on open-source and proprietary datasets of various scales, while also achieving a real-time throughput of 15,000 FPS with 2ms latency. We demonstrate the flexibility and robustness of MotionBricks in a complete production-level animation demo, covering navigation and object-scene interaction across various styles with a unified model. To showcase our framework's application beyond animation, we deploy MotionBricks on the Unitree G1 humanoid robot to demonstrate its flexibility and generalization for real-time robotic control.
Chinese Translation
尽管生成运动合成技术取得了变革性的进展,实时交互运动控制仍然主要依赖于传统技术。在本研究中,我们识别出连接研究与生产的两个关键挑战:1)实时可扩展性:工业应用需要实时生成大量运动技能,而生成方法在实时计算约束下的质量和可扩展性显著下降;2)集成:工业应用需要涉及速度命令、风格选择和精确关键帧的细粒度多模态控制,而现有的基于文本或标签的模型在这方面的需求大多未得到满足。为了解决这些限制,我们提出了MotionBricks:一个大规模、实时的生成框架,提供了双重解决方案。首先,我们提出了一个大型模块化潜在生成骨干网络,专为稳健的实时运动生成而设计,能够有效建模超过350,000个运动片段的数据集。其次,我们引入了智能原语,提供了一个统一、稳健且直观的接口,用于创作导航和物体交互。应用可以像拼装积木一样以即插即用的方式设计,无需专业的动画知识。从量化结果来看,我们展示了MotionBricks在各种规模的开源和专有数据集上产生了最先进的运动质量,同时实现了15,000 FPS的实时吞吐量和2毫秒的延迟。我们在一个完整的生产级动画演示中展示了MotionBricks的灵活性和稳健性,涵盖了各种风格的导航和物体场景交互,使用一个统一的模型。为了展示我们框架在动画之外的应用,我们将MotionBricks部署在Unitree G1人形机器人上,展示了其在实时机器人控制中的灵活性和泛化能力。
cs.RO / 2 / 2604.24894

VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis

VISION-SLS:基于视觉表示学习的安全感知控制方法通过系统级综合实现
Leeman, Antoine P., Zhan, Shuyu, Zeilinger, Melanie N., Chou, Glen
Abstract
We propose VISION-SLS, a method for nonlinear output-feedback control from high-resolution RGB images which provides robust constraint satisfaction guarantees under calibrated uncertainty bounds despite partial observability, sensor noise, and nonlinear dynamics. To enable scalability while retaining guarantees, we propose: (i) a learned low-dimensional observation map from pretrained visual features with state-dependent error bounds, and (ii) a causal affine time-varying output-feedback policy optimized via System Level Synthesis (SLS). We develop a scalable, novel solver for the resulting nonconvex program that leverages sequential convex programming coupled with efficient Riccati recursions. On two simulated visuomotor tasks (a 4D car and a 10D quadrotor) with >= 512 x 512 pixels and a 59D humanoid task with partial observability, our method enables safe, information-gathering behavior that reduces uncertainty while guaranteeing constraint satisfaction with empirically-calibrated error bounds. We also validate our method on hardware, safely controlling a ground vehicle from onboard images, outperforming baselines in safety rate and solve times. Together, these results show that learned visual abstractions coupled with an efficient solver make SLS-based safe visuomotor output-feedback practical at scale. The code implementation of our method is available at https://github.com/trustworthyrobotics/VISION-SLS.
Chinese Translation
我们提出了VISION-SLS,这是一种基于高分辨率RGB图像的非线性输出反馈控制方法,能够在部分可观测性、传感器噪声和非线性动态的情况下,提供稳健的约束满足保证,且在经过校准的不确定性范围内保持有效。为了在保留保证的同时实现可扩展性,我们提出了:(i)从预训练视觉特征中学习的低维观测映射,具有状态依赖的误差界限,以及(ii)通过系统级综合(System Level Synthesis, SLS)优化的因果仿射时变输出反馈策略。我们开发了一种可扩展的新型求解器,用于解决由此产生的非凸程序,该求解器结合了序列凸编程和高效的Riccati递归。在两个模拟的视觉运动任务(一个4D汽车和一个10D四旋翼)中,图像分辨率均为>= 512 x 512像素,以及一个具有部分可观测性的59D人形任务中,我们的方法实现了安全的信息收集行为,降低了不确定性,同时保证了在经验校准的误差界限下的约束满足。我们还在硬件上验证了我们的方法,安全地从车载图像控制地面车辆,在安全率和求解时间上超越了基线。综合这些结果表明,学习的视觉抽象与高效求解器的结合使得基于SLS的安全视觉运动输出反馈在规模上变得切实可行。我们的方法的代码实现可在https://github.com/trustworthyrobotics/VISION-SLS获取。
cs.RO / 3 / 2604.24906

An analysis of sensor selection for fruit picking with suction-based grippers

基于吸力抓手的水果采摘传感器选择分析
Krueger, Eva, Rosette, Marcus, Davidson, Joseph R.
Abstract
Robotic fruit harvesting often fails to reliably detect whether a fruit has been successfully picked, limiting efficiency and increasing crop damage. This problem is difficult due to compliant fruit and grippers, variable stem attachment, and occlusions in orchard environments. Prior work has explored vision-based perception and multi-sensor learning approaches for pick state estimation. However, minimal sensor sets and phase-dependent sensing strategies for accurate pick and slip detection remain largely unexplored. In this work, we design and evaluate a multimodal sensing suite integrated into a compliant suction-based apple gripper. Our approach is unique because it identifies which sensors are most informative at different phases of the pick, enabling predictive detection of failures before they occur. The contributions of this paper are a phase-dependent evaluation of multimodal sensors and the identification of minimal sensor sets for reliable pick state classification. Experiments in a real apple orchard show that Random Forest and Multilayer Perceptron classifiers detect successful picks and impending failures with over 90% accuracy, and Random Forest predicts pick/slip events within 0.09 s of human-annotated ground truth.
Chinese Translation
机器人水果采摘常常无法可靠地检测水果是否成功采摘,这限制了效率并增加了作物损伤。由于水果和抓手的柔性、变动的茎部连接以及果园环境中的遮挡,这一问题变得复杂。之前的研究探讨了基于视觉的感知和多传感器学习方法用于采摘状态估计。然而,针对准确的采摘和滑动检测的最小传感器集和相位依赖的感知策略仍然基本未被探索。在本研究中,我们设计并评估了一种集成在柔性吸力苹果抓手中的多模态传感器套件。我们的方法独特之处在于能够识别在不同采摘阶段哪些传感器提供的信息最为重要,从而在故障发生之前实现预测检测。本文的贡献在于对多模态传感器进行相位依赖的评估,并识别出可靠的采摘状态分类所需的最小传感器集。在真实的苹果果园中进行的实验表明,随机森林(Random Forest)和多层感知器(Multilayer Perceptron)分类器以超过90%的准确率检测成功采摘和即将发生的故障,并且随机森林在0.09秒内预测采摘/滑动事件,接近人类标注的真实情况。
cs.RO / 4 / 2604.24916

asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

asRoBallet:通过考虑摩擦的强化学习缩小仿真与现实之间的差距以实现欠驱动球体动力学
Wan, Fang, Huang, Guangyi, Wu, Tianyu, Zhang, Zishang, Huang, Bangchao, Sun, Haoran, Chen, Mingdong, Song, Chaoyang
Abstract
We introduce asRoBallet, to the best of our knowledge, the first successful deployment of reinforcement learning (RL) on a humanoid ballbot hardware. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-sphere-ground interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency & jitter, and safe hardware exploration, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that are previously ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-sphere and sphere-ground interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.
Chinese Translation
我们介绍了asRoBallet,尽我们所知,这是首次成功在类人球形机器人硬件上部署强化学习(RL)。历史上,球形机器人作为欠驱动和非完整控制的经典基准,面临着复杂摩擦模型在轮-球-地面交互中的现实差距。尽管现有文献展示了使用线性二次调节器(LQR)和模型预测控制(MPC)成功处理3D平衡,但将RL应用于类人球形机器人实际硬件的过渡目前受到接触建模、执行器延迟与抖动以及安全硬件探索等关键问题的阻碍。本研究提出了一种高保真MuJoCo仿真,明确建模ETH型全向轮的离散滚轮力学,从而捕捉之前被忽视的寄生振动和接触不连续性。我们还开发了一种考虑摩擦的强化学习框架,通过掌握轮-球和球-地接口的耦合滚动、侧向和扭转摩擦通道,实现零样本的仿真到现实转移。我们通过减法重构设计了asRoBallet,重新利用来自过约束四足机器人的关键组件,并将其整合到新设计的结构框架中,以低成本实现一个稳健的研究平台。我们还开发了一个通用的iOS生态系统,将消费电子产品转变为低延迟接口,使单一操作员能够通过直观的自然动作指挥富有表现力的类人动作。
cs.RO / 5 / 2604.24921

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Libra-VLA:通过异步粗到细双系统实现学习平衡
Wei, Yifei, Zhong, Linqing, Liu, Yi, Lu, Yuxiang, He, Xindong, Yao, Maoqing, Ren, Guanghui
Abstract
Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment. Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.
Chinese Translation
视觉-语言-动作(VLA)模型是将高层语义指令转化为可执行物理动作的通用机器人操作的有前景的范式。然而,现有的方法通常采用单一生成范式,以平坦的非层次方式直接将视觉-语言特征映射到高频率的运动指令。这一策略忽视了机器人操作的固有层次性,其中复杂动作可以自然地建模为混合动作空间,分解为离散的宏观方向性到达和连续的微观姿态对齐,这严重扩大了语义与驱动之间的差距,并对将高层语义与连续动作相结合施加了沉重的表征负担。为了解决这个问题,我们提出了Libra-VLA,一种新颖的粗到细双系统VLA架构。我们明确地将学习复杂性解耦为粗到细的层次,以达到训练平衡,同时利用这种结构模块性实现异步执行策略。语义规划器预测捕捉宏观方向意图的离散动作标记,而动作精炼器则在粗略意图的基础上生成高频连续动作以实现精确对齐。重要的是,我们的实证分析表明,性能相对于动作分解粒度遵循倒U曲线,当两个子系统之间的学习难度平衡时,性能达到峰值。通过异步设计,我们的方法为开放世界操作提供了一种可扩展、稳健且响应迅速的解决方案。
cs.RO / 6 / 2604.24934

TEACar: An Open-Source Autonomous Driving Platform

TEACar:一个开源的自动驾驶平台
Zhang, Zhongzheng, Ruyle, Maxwell, Kappes, Andrew, Ruble, Tyler, Shaoul, William, Moreno, Dana, Penn, Jack, Ruchkin, Ivan
Abstract
Intelligent Transportation Systems (ITS) increasingly rely on vision-based perception and learning-based control, necessitating experimental platforms that support realistic hardware-in-the-loop validation. Small-scale platforms for autonomous racing offer a practical path to hardware validation, but often suffer from limited modularity, high integration complexity, or restricted extensibility. This paper presents TEACAR, a 1/14- to 1/16-scale autonomous driving platform designed with modular mechanical architecture, hardware abstraction, and ROS 2-based software. The system adopts a four-layer deck structure that physically decouples sensing, computation, actuation, and power subsystems, improving structural rigidity while simplifying reconfiguration. We constructed and comprehensively evaluated the prototype of TEACAR. Its mechanical stability, structural characteristics, and software performance were quantified based on three CNN-based steering controllers. Inference latency, power consumption, and system operating time were measured to evaluate computational capability and robustness. Our experiments demonstrated that TEACAR offers a scalable, modular, and cost-effective testbed for ITS research, education, and development. Our project repository is available on GitHub.
Chinese Translation
智能交通系统(ITS)越来越依赖基于视觉的感知和基于学习的控制,这需要支持真实硬件在环验证的实验平台。小规模的自动驾驶赛车平台为硬件验证提供了一条实用的路径,但往往面临模块化程度有限、集成复杂性高或可扩展性受限等问题。本文介绍了TEACAR,一个设计有模块化机械结构、硬件抽象和基于ROS 2的软件的1/14至1/16比例的自动驾驶平台。该系统采用四层甲板结构,物理上解耦了传感、计算、执行和电源子系统,提高了结构刚性,同时简化了重配置过程。我们构建并全面评估了TEACAR的原型,其机械稳定性、结构特性和软件性能基于三种基于卷积神经网络(CNN)的转向控制器进行了量化。我们测量了推理延迟、功耗和系统运行时间,以评估计算能力和鲁棒性。我们的实验表明,TEACAR为ITS研究、教育和开发提供了一个可扩展、模块化和具有成本效益的测试平台。我们的项目代码库可在GitHub上获取。
cs.RO / 7 / 2604.25050

DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

离散RTC:离散扩散策略是自然的异步执行器
Wang, Pengcheng, Hong, Kaiwen, Peng, Chensheng, Driggs-Campbell, Katherine, Tomizuka, Masayoshi, Xu, Chenfeng, Tang, Chen
Abstract
Unlike chatbots, physical AI must act while the world keeps evolving. Therefore, the inter-chunk pause of synchronous executors are fatal for dynamic tasks regardless of how fast the inference is. Asynchronous execution -- thinking while acting -- is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the remainder. However, RTC with flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance, and extra computation that inflates the latency. In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost. We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines. In summary, DiscreteRTC is simpler to implement with 0 lines of code for async inpainting, faster at inference with only 0.7x computation compared with generating actions from scratch, and better at execution with 50% higher success rate in real-world dynamic pick task compared with flow-matching-based RTC. More visualizations are on https://outsider86.github.io/DiscreteRTCSite/.
Chinese Translation
与聊天机器人不同,物理人工智能必须在世界不断演变的同时采取行动。因此,同步执行器的块间暂停对于动态任务来说是致命的,无论推理速度多快。因此,异步执行——在行动的同时思考——是一个结构性要求,而实时分块(RTC)通过将块转换重新表述为图像修复,使其成为可能:冻结已承诺的动作并持续生成其余部分。然而,使用流匹配策略的RTC在结构上是次优的:其图像修复来自推理时的修正,而不是基础策略,导致几乎没有预训练的好处、特定的微调、启发式指导,以及额外的计算,增加了延迟。在本研究中,我们观察到离散扩散策略通过逐步解掩蔽生成动作,是自然的异步执行器,能够一次性解决所有限制:由于图像修复是其本质操作,因此不需要微调,而提前停止进一步提供了自适应指导并降低了推理成本。我们提出了离散RTC,它用本地解掩蔽替代外部修正,并在动态模拟基准和现实世界的动态操作任务中展示其成功率高于连续RTC和其他基准。总之,离散RTC在实现上更简单,异步图像修复只需0行代码,在推理时速度更快,与从头生成动作相比仅需0.7倍的计算,并且在现实世界的动态挑选任务中,其成功率比基于流匹配的RTC高出50%。更多可视化内容请访问 https://outsider86.github.io/DiscreteRTCSite/.
cs.RO / 8 / 2604.25126

HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness

HANDFUL:具有资源意识的顺序抓取条件灵巧操控
Foong, Ethan, Li, Yunshuang, Jiang, Hao, Sukhatme, Gaurav S., Seita, Daniel
Abstract
Dexterous robot hands offer rich opportunities for multifunctional manipulation, where a robot must execute multiple skills in sequence while maintaining control over previously grasped objects. Most prior work in dexterous manipulation focuses on single-object, single-skill tasks. In contrast, our insight is that many sequential tasks require resource-aware grasps that conserve fingers for future actions. In this paper, we study sequential grasp-conditioned dexterous manipulation, where a robot first grasps an object and then performs a second, distinct manipulation subtask while preserving the initial grasp. We introduce HANDFUL, a learning framework that models finger usage as a limited resource and encourages exploration of resource-aware grasps through finger-level contact rewards. These grasps are subsequently selected for downstream tasks via curriculum-based policy learning. We further propose HANDFUL-Bench, a simulation benchmark that introduces sequential dexterous manipulation tasks across multiple secondsubtask objectives, including pushing, pulling, and pressing, under a shared grasp-conditioned setup. Extensive simulation results demonstrate that prioritizing resource-aware grasps improves second-subtask success and robustness compared to a baseline that greedily optimizes the initial grasp before attempting the second subtask. We additionally validate our approach on a real dexterous LEAP hand. Together, this work establishes resource-aware grasp planning as a key principle for multifunctional dexterous manipulation. Supplementary material is available on our website: https://handful-dex.github.io.
Chinese Translation
灵巧机器人手提供了多功能操控的丰富机会,其中机器人必须在保持对先前抓取物体的控制的同时,依次执行多个技能。以往的灵巧操控研究大多集中于单一物体、单一技能任务。相对而言,我们的见解是,许多顺序任务需要具有资源意识的抓取,以便为未来的动作保留手指。在本文中,我们研究了顺序抓取条件的灵巧操控,其中机器人首先抓取一个物体,然后在保持初始抓取的同时执行第二个不同的操控子任务。我们提出了HANDFUL,一个学习框架,将手指使用建模为有限资源,并通过手指级接触奖励鼓励探索具有资源意识的抓取。这些抓取随后通过基于课程的策略学习被选用于下游任务。我们进一步提出了HANDFUL-Bench,一个模拟基准,介绍了多个第二子任务目标下的顺序灵巧操控任务,包括推、拉和按压,所有任务都在共享的抓取条件设置下进行。大量模拟结果表明,与在尝试第二子任务之前贪婪优化初始抓取的基线相比,优先考虑具有资源意识的抓取可以提高第二子任务的成功率和鲁棒性。我们还在真实的灵巧LEAP手上验证了我们的方法。总的来说,这项工作确立了资源意识抓取规划作为多功能灵巧操控的关键原则。补充材料可在我们的网站上获取:https://handful-dex.github.io。
cs.RO / 9 / 2604.25267

Dynamic UGV-UAV Cooperative Path Planning in Uncertain Environments

不确定环境下动态UGV-UAV协同路径规划
Nguyen, Ninh, Akella, Srinivas
Abstract
This paper addresses the Dynamic UGV-UAV Cooperative Path Planning (DUCPP) problem involving one unmanned ground vehicle (UGV) assisted by one or more unmanned aerial vehicles (UAVs) operating on an uncertain road network with potentially impassable edges. DUCPP is particularly relevant for scenarios such as disaster response, emergency supply transport, and rescue operations, where a UGV must reach a specified destination in the presence of partially unknown road conditions. To enable the UGV to travel safely and efficiently to its destination, the UAV(s) dynamically inspect edges in the environment to identify and prune damaged or impassable edges from consideration. We present multiple strategies, including a bidirectional approach, to optimize UGV-UAV cooperation for finding a safe path in an uncertain road network. Furthermore, we explore the impact of using multiple UAVs on reducing the UGV's travel time, and evaluate the associated computation time. The proposed strategies are implemented and evaluated on 100 urban road networks. The results demonstrate that the bidirectional strategy achieves the best performance in most instances, and using multiple UAVs further reduces UGV travel time at the expense of increased computation time. This paper presents a robust framework for DUCPP to achieve efficient UGV-UAV cooperation for path planning and inspection, offering practical solutions for navigation in challenging and uncertain conditions.
Chinese Translation
本文探讨了动态UGV-UAV协同路径规划(DUCPP)问题,该问题涉及一辆无人地面车辆(UGV)在一个具有潜在不可通行边缘的不确定道路网络中,由一辆或多辆无人机(UAV)协助操作。DUCPP在灾害响应、紧急物资运输和救援行动等场景中尤为重要,在这些场景中,UGV必须在部分未知的道路条件下到达指定目的地。为了使UGV安全高效地到达目的地,UAV动态检查环境中的边缘,以识别和剔除损坏或不可通行的边缘。我们提出了多种策略,包括双向方法,以优化UGV-UAV合作,寻找不确定道路网络中的安全路径。此外,我们探讨了使用多架UAV对减少UGV旅行时间的影响,并评估了相关的计算时间。所提出的策略在100个城市道路网络上进行了实施和评估。结果表明,双向策略在大多数情况下表现最佳,使用多架UAV进一步减少了UGV的旅行时间,但增加了计算时间。本文提出了一个稳健的DUCPP框架,以实现UGV-UAV在路径规划和检查中的高效协作,为在具有挑战性和不确定条件下的导航提供了实用解决方案。
cs.RO / 10 / 2604.25284

Optimal UGV-UAV Cooperative Partitioning and Inspection of Shortest Paths

最优UGV-UAV协作分区与最短路径检查
Nguyen, Ninh, Akella, Srinivas
Abstract
We study cooperative shortest path planning for an unmanned ground vehicle (UGV) assisted by an unmanned aerial vehicle (UAV) in environments with unknown road blockages that are only discovered when a robot reaches the damaged point. This formulation generalizes the original Canadian Traveller Problem (CTP), which assumes a single ground vehicle and that the traversability status of all incident edges is revealed upon arrival at a vertex. We first analyze the case where the start and the goal are connected by $k$ disjoint paths, and prove that the worst-case competitive ratio $\rho$ for a single UGV is $2k-1$. With UAV assistance, and under the simplifying assumption of negligible initial transit and deadheading UAV costs, the ratio improves to $\rho = 2\frac{v_G}{v_A + v_G}k - 1$, where $v_G$ and $v_A$ denote the UGV and UAV speed, respectively. To address general graphs and non-negligible UAV initial transit and deadheading costs, we present an optimal path partitioning strategy that assigns path prefix inspection to the UGV and path suffix inspection to the UAV, and prove the optimality of the UAV inspection strategy on general graphs. We evaluate our algorithm by performing experiments on road networks from the world's 50 most populous cities, with randomized blockages, and show that the proposed method reduces UGV travel times by up to 30%.
Chinese Translation
我们研究了在未知道路阻塞环境中,无人地面车辆(UGV)在无人机(UAV)协助下的协作最短路径规划,这些阻塞仅在机器人到达损坏点时被发现。这一表述推广了原始的加拿大旅行者问题(CTP),该问题假设只有一辆地面车辆,并且所有相关边的可通行状态在到达顶点时才会被揭示。我们首先分析起点和目标点通过$k$条不相交路径连接的情况,并证明单个UGV的最坏竞争比率$ ho$为$2k-1$。在UAV的协助下,假设初始过渡和空驶UAV成本可以忽略不计,竞争比率改善为$ ho = 2 rac{v_G}{v_A + v_G}k - 1$,其中$v_G$和$v_A$分别表示UGV和UAV的速度。为了处理一般图形以及不可忽略的UAV初始过渡和空驶成本,我们提出了一种最优路径分区策略,将路径前缀检查分配给UGV,将路径后缀检查分配给UAV,并证明了UAV检查策略在一般图形上的最优性。我们通过在全球50个最人口稠密城市的道路网络上进行随机阻塞实验来评估我们的算法,并显示所提方法将UGV的旅行时间减少了多达30%。
cs.RO / 11 / 2604.25292

Slot-hopping Enabled Loiter Guidance and Automation for Fixed-wing UAV Corridors

启用插槽跳跃的固定翼无人机走廊滞留引导与自动化
J, Pradeep, Kedarisetty, Siddhardha, Ratnoo, Ashwini
Abstract
This paper addresses the problem of traffic congestion management in fixed-wing unmanned aerial vehicle (UAV) corridors by further developing a recently introduced loiter-lane framework. A semi-cooperative guidance strategy is developed for inserting fixed-wing UAVs into a loiter lane with minimal disruption to the UAVs already operating within it, while enabling a more compact fixed-wing UAV corridor. Building on the concepts of cooperative and non-disruptive loiter-lane insertion, the proposed strategy makes the incoming UAV first attempt, within its speed bounds, to rendezvous with an existing empty loiter slot. If direct insertion is infeasible, a minimal number of loitering UAVs perform coordinated slot hopping to create a suitably positioned empty slot. The feasibility and performance of the method are demonstrated through numerical simulations.
Chinese Translation
本文通过进一步发展最近提出的滞留通道框架,解决了固定翼无人机(UAV)走廊中的交通拥堵管理问题。我们开发了一种半合作引导策略,以最小的干扰将固定翼无人机插入滞留通道,同时实现更紧凑的固定翼无人机走廊。基于合作和非干扰性滞留通道插入的概念,所提出的策略使得即将到达的无人机首先在其速度范围内尝试与现有的空滞留插槽会合。如果直接插入不可行,少量滞留的无人机将进行协调的插槽跳跃,以创建一个适当位置的空插槽。通过数值仿真展示了该方法的可行性和性能。
cs.RO / 12 / 2604.25323

ANCHOR: A Physically Grounded Closed-Loop Framework for Robust Home-Service Mobile Manipulation

ANCHOR:一种基于物理的闭环框架,用于稳健的家居服务移动操控
Jiang, Jinhao, Fang, Shengyu, Zuo, Sibo, Tang, Yujie, Li, Yirui
Abstract
Recent advances in open-vocabulary mobile manipulation have brought robots into real domestic environments. In such settings, reliable long-horizon execution under open-set object references and frequent disturbances becomes essential. However, many failures persist. These are not caused by semantic misunderstanding but by inconsistencies between symbolic plans and the evolving physical world, manifested as three recurring limitations: (i) existing systems often rely on pre-scanned semantic maps that become inconsistent after scene changes and disturbances; (ii) they select navigation endpoints without considering downstream manipulation feasibility, causing the "arrived but inoperable" problem; and (iii) they handle anomalies through undifferentiated global replanning, which often fails to contain local errors. To address this execution inconsistency, we present ANCHOR, a physically grounded closed-loop framework that aligns symbolic reasoning with verifiable physical state during execution. ANCHOR integrates three mechanisms: (i) physically anchored task planning, which binds symbolic predicates to observable geometric anchors and re-validates them after each action; (ii) operability-aware base alignment, which ensures that navigation endpoints satisfy kinematic reachability and local collision feasibility; and (iii) minimum-responsible-layer hierarchical recovery, which localizes failures across perception, base-arm coordination, and execution layers to prevent cascading retries. Across 60 real-robot trials in previously unseen environments, ANCHOR improves task success from 53.3% to 71.7% and achieves a 71.4% recovery rate under perturbations, demonstrating that explicit physical grounding and structured failure containment are critical for robust mobile manipulation. Our project page is available at https://anchor9178.github.io/ANCHOR/ .
Chinese Translation
近年来,开放词汇的移动操控技术的进步使得机器人能够进入真实的家庭环境。在这种环境中,能够在开放集物体引用和频繁干扰下可靠地执行长时间的任务变得至关重要。然而,许多失败依然存在。这些失败并不是由于语义误解造成的,而是由于符号计划与不断变化的物理世界之间的不一致性,表现为三个反复出现的限制:(i)现有系统通常依赖于预先扫描的语义地图,而这些地图在场景变化和干扰后会变得不一致;(ii)它们在选择导航终点时未考虑后续操控的可行性,导致“到达但无法操作”的问题;(iii)它们通过不加区分的全局重新规划来处理异常,这往往无法控制局部错误。为了解决这种执行不一致性,我们提出了ANCHOR,一个基于物理的闭环框架,在执行过程中将符号推理与可验证的物理状态对齐。ANCHOR集成了三种机制:(i)物理锚定的任务规划,将符号谓词绑定到可观察的几何锚点,并在每次动作后重新验证它们;(ii)考虑可操作性的基础对齐,确保导航终点满足运动学可达性和局部碰撞可行性;(iii)最小责任层级恢复,定位感知、基础-臂协调和执行层的失败,以防止级联重试。在60次在未见环境中的真实机器人试验中,ANCHOR将任务成功率从53.3%提高到71.7%,并在干扰下实现了71.4%的恢复率,证明了明确的物理基础和结构化的失败控制对于稳健的移动操控至关重要。我们的项目页面可访问 https://anchor9178.github.io/ANCHOR/ 。
cs.RO / 13 / 2604.25329

ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution

ProDrive:通过自我环境共同演化实现自主驾驶的主动规划
Fu, Chuyao, Gan, Shengzhe, Ouyang, Zhuoli, Rui, Yuhan, Chi, Xiaowei, Han, Sirui, Wang, Jiankun, Zhang, Hong
Abstract
End-to-end autonomous driving planners typically generate trajectories from current observations alone. However, real-world driving is highly dynamic, and such reactive planning cannot anticipate future scene evolution, often leading to myopic decisions and safety-critical failures. We propose ProDrive, a world-model-based proactive planning framework that enables ego-environment co-evolution for autonomous driving. ProDrive jointly trains a query-centric trajectory planner and a bird's-eye-view (BEV) world model end-to-end: the planner generates diverse candidate trajectories and planning-aware ego tokens, while the world model predicts future scene evolution conditioned on them. By injecting planner features into the world model and evaluating all candidates in parallel, ProDrive preserves end-to-end gradient flow and allows future outcome assessment to directly shape planning. This bidirectional coupling enables proactive planning beyond current-observation-driven decision-making. Experiments on NAVSIM v1 show that ProDrive outperforms strong baselines in both safety and planning efficiency, while ablations validate the effectiveness of the proposed ego-environment coupling design.
Chinese Translation
端到端的自主驾驶规划器通常仅根据当前观察生成轨迹。然而,现实世界的驾驶环境高度动态,这种反应式规划无法预见未来场景的演变,常常导致短视决策和安全关键的失败。我们提出了ProDrive,这是一种基于世界模型的主动规划框架,能够实现自主驾驶中的自我环境共同演化。ProDrive联合训练一个以查询为中心的轨迹规划器和一个鸟瞰视图(BEV)世界模型,采用端到端的方式:规划器生成多样化的候选轨迹和规划感知的自我标记,而世界模型则基于这些轨迹预测未来场景的演变。通过将规划器特征注入世界模型并并行评估所有候选,ProDrive保持了端到端的梯度流,并允许未来结果评估直接影响规划。这种双向耦合使得主动规划超越了基于当前观察的决策制定。在NAVSIM v1上的实验表明,ProDrive在安全性和规划效率方面均优于强基线,同时消融实验验证了所提出的自我环境耦合设计的有效性。
cs.RO / 14 / 2604.25404

Robust Graph Matching through Semantic Relationship Generation for SLAM

通过语义关系生成实现鲁棒图匹配的SLAM
Perez-Saura, David, Millan-Romera, Jose Andres, Fernandez-Cortizas, Miguel, Voos, Holger, Campoy, Pascual, Sanchez-Lopez, Jose Luis
Abstract
Graph-based representations such as Scene Graphs enable localization in structured indoor environments by matching a locally observed graph, constructed from sensor data, to a prior map. This process is particularly challenging in environments with repetitive or symmetric layouts, where structural cues alone are often insufficient to resolve ambiguities. We propose a semantic-enhanced graph matching approach that explicitly models relations between detected objects and structural elements, such as rooms and wall planes. Objects are detected from RGB-D data and integrated into the graph, and their relations to structural elements are exploited to filter candidate correspondences prior to geometric verification, significantly reducing ambiguity and search complexity. The proposed method is integrated within the iS-Graphs framework and evaluated in synthetic and simulated environments. Results show that semantic relations significantly reduce the number of candidate matches, improve computational efficiency, and enable faster convergence, particularly in symmetric scenarios where purely geometric approaches fail.
Chinese Translation
基于图的表示(如场景图)通过将从传感器数据构建的局部观察图与先前地图进行匹配,能够在结构化的室内环境中实现定位。在具有重复或对称布局的环境中,这一过程尤其具有挑战性,因为仅依靠结构线索往往不足以解决歧义。我们提出了一种语义增强的图匹配方法,该方法明确建模检测到的物体与结构元素(如房间和墙面)之间的关系。物体从RGB-D数据中检测并整合到图中,并利用它们与结构元素的关系在几何验证之前过滤候选对应关系,从而显著减少歧义和搜索复杂性。所提出的方法集成在iS-Graphs框架中,并在合成和模拟环境中进行了评估。结果表明,语义关系显著减少了候选匹配的数量,提高了计算效率,并加快了收敛速度,尤其是在纯几何方法失效的对称场景中。
cs.RO / 15 / 2604.25459

GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

GS-Playground:一种高吞吐量的逼真模拟器,用于视觉驱动的机器人学习
Jia, Yufei, Zhang, Heng, Zhang, Ziheng, Wu, Junzhe, Yu, Mingrui, Wang, Zifan, Jiang, Dixuan, Li, Zheng, Cao, Chenyu, Yu, Zhuoyuan, Yang, Xun, Ge, Haizhou, Zhang, Yuchi, Zhang, Jiayuan, Huang, Zhenbiao, Liu, Tianle, Chen, Shenyu, Wang, Jiacheng, Xie, Bin, Yao, Xuran, Deng, Xiwa, Wang, Guangyu, Zhang, Jinzhi, Hao, Lei, Chen, Zhixing, Chen, Yuxiang, Wang, Anqi, Tian, Hongyun, Yan, Yiyi, Cao, Zhanxiang, Jiang, Yizhou, Shao, Hanyang, Li, Yue, Shi, Lu, Chen, Bokui, Sui, Wei, Cui, Hanqing, Qin, Yusen, Huang, Ruqi, Han, Lei, Wang, Tiancai, Zhou, Guyue
Abstract
Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-Playground, a multi-modal simulation framework designed to accelerate end-to-end perceptual learning. We develop a novel high-performance parallel physics engine, specifically designed to integrate with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to ensure high-fidelity synchronization. Our system achieves a breakthrough throughput of 10^4 FPS at 640x480 resolution, significantly lowering the barrier for large-scale visual RL. Additionally, we introduce an automated Real2Sim workflow that reconstructs photorealistic, physically consistent, and memory-efficient environments, streamlining the generation of complex simulation-ready scenes. Extensive experiments on locomotion, navigation, and manipulation demonstrate that GS-Playground effectively bridges the perceptual and physical gaps across diverse embodied tasks. Project homepage: https://gsplayground.github.io.
Chinese Translation
具身人工智能研究正朝着以视觉为中心的感知范式转变。尽管大规模并行模拟器在基于本体感知的运动中催生了突破性进展,但由于大规模逼真渲染的计算开销过于庞大,其在视觉驱动任务中的潜力仍未得到充分利用。此外,模拟就绪的3D资产的创建在很大程度上依赖于劳动密集型的手动建模,而显著的模拟与现实之间的物理差距则阻碍了接触丰富的操作策略的转移。为了解决这些瓶颈,我们提出了GS-Playground,一个旨在加速端到端感知学习的多模态模拟框架。我们开发了一种新型高性能并行物理引擎,专门设计用于与批量3D高斯点云(3D Gaussian Splatting, 3DGS)渲染管道集成,以确保高保真同步。我们的系统在640x480分辨率下实现了10^4 FPS的突破性吞吐量,显著降低了大规模视觉强化学习的门槛。此外,我们还引入了一种自动化的Real2Sim工作流,重建逼真、物理一致且内存高效的环境,简化了复杂模拟就绪场景的生成。在运动、导航和操作方面的广泛实验表明,GS-Playground有效地弥合了多样化具身任务中的感知与物理差距。项目主页:https://gsplayground.github.io。
cs.RO / 16 / 2604.25554

Egocentric Tactile and Proximity Sensors as Observation Priors for Humanoid Collision Avoidance

以自我为中心的触觉和接近传感器作为类人碰撞避免的观察先验
Kohlbrenner, Carson, Pudasaini, Niraj, Xie, William, Sivagnanadasan, Naren, Correll, Nikolaus, Roncone, Alessandro
Abstract
Collision-free motion is often aided by tactile and proximity sensors distributed on the body of the robot due to their resistance to occlusion as opposed to external cameras. However, how to shape the sensor's properties, such as sensing coverage; type; and range, to enable avoidant behavior remains unclear. In this work, we present a reinforcement learning framework for whole-body collision avoidance on a humanoid H1-2 robot and use it to characterize how sensor properties shape learned avoidance behavior. Using dodgeball as a benchmark task, we ablate the properties of sensors distributed across the upper body of the robot and find that raw proximity measurements can substitute for explicit object localization provided the sensing range is sufficient and that sparse non-directional proximity signals outpace dense directional alternatives in sample efficiency.
Chinese Translation
由于触觉和接近传感器在机器人的身体上分布,能够抵抗遮挡,相较于外部摄像头,它们通常有助于实现无碰撞运动。然而,如何塑造传感器的属性,如感知覆盖范围、类型和范围,以实现避免行为仍不清楚。在本研究中,我们提出了一种针对类人H1-2机器人的全身碰撞避免的强化学习框架,并利用该框架来表征传感器属性如何影响学习到的避免行为。以躲避球作为基准任务,我们剖析了分布在机器人上半身的传感器属性,发现只要感知范围足够,原始接近测量可以替代明确的物体定位,并且稀疏的非定向接近信号在样本效率上优于密集的定向替代品。
cs.RO / 17 / 2604.25563

Improving Sensing Coverage and Compliance of 3D-Printed Artificial Skins Through Multi-Modal Sensing and Soft Materials

通过多模态传感和软材料改善3D打印人工皮肤的感知覆盖和合规性
Kohlbrenner, Carson, Escobedo, Caleb, Ray, Sayak, Dickhans, Alexander, Soukhovei, Anna, Jackoski, Nickolaus, Antieau, Lyle, Roncone, Alessandro
Abstract
3D-printed artificial skins are a scalable approach to whole-body tactile and proximity coverage, but prior implementations have been limited to unimodal sensing and rigid materials. To improve the practical usability of 3D-printed artificial skins, we present a hybrid time-of-flight (ToF) and self-capacitance (SC) sensing skin that demonstrates multi-modal sensing integration, soft compliant coverings for impact absorption and pressure sensing, and a streamlined electrical interface between printed conductive traces and external electronics. We show that combining ToF and SC modalities enables contact detection, scene reconstruction, and pressure-correlated tactile responses with the compliant covering by deploying six artificial skin units with 40 sensing elements over an FR3 robot arm.
Chinese Translation
3D打印人工皮肤是一种可扩展的全身触觉和接近感知覆盖方法,但之前的实现仅限于单模态传感和刚性材料。为了提高3D打印人工皮肤的实际可用性,我们提出了一种混合时间飞行(ToF)和自电容(SC)传感皮肤,展示了多模态传感集成、用于冲击吸收和压力传感的软合规覆盖,以及印刷导电轨迹与外部电子设备之间的简化电气接口。我们展示了结合ToF和SC模态能够实现接触检测、场景重建以及与合规覆盖相关的压力触觉响应,通过在FR3机器人手臂上部署六个人工皮肤单元和40个传感元件。
cs.RO / 18 / 2604.25661

SlicerRoboTMS: An Open-Source 3D Slicer Extension for Robot-Assisted Transcranial Magnetic Stimulation

SlicerRoboTMS:一种用于机器人辅助经颅磁刺激的开源3D切片扩展
Bai, Wenzhi, Guo, Yituo, Basu, Bhaskar, Weightman, Andrew, Li, Zhenhong
Abstract
Robot-assisted Transcranial Magnetic Stimulation (Robo-TMS) is an image-guided robotic intervention that enhances the accuracy and reproducibility of conventional Transcranial Magnetic Stimulation (TMS), a widely used non-invasive brain stimulation procedure in clinical treatment and neuroscience research. Despite its potential, the development of Robo-TMS remains challenging due to the need for multidisciplinary expertise spanning medical imaging, computer vision, and robotics. This paper presents SlicerRoboTMS, an open-source 3D Slicer extension that provides a unified interaction infrastructure for Robo-TMS research. By leveraging 3D Slicer's medical image computing and visualisation capabilities, the extension supports Magnetic Resonance Imaging (MRI)-based neuronavigation and interfaces with robotic systems through standardised communication protocols and configurable system descriptions. An example integration is presented to demonstrate how SlicerRoboTMS can be incorporated into a representative Robo-TMS workflow. Designed to support diverse hardware configurations and rapid prototyping, SlicerRoboTMS lowers the barrier to entry and facilitates reproducible and extensible research in Robo-TMS. The extension is available at https://github.com/OpenRoboTMS/SlicerRoboTMS.
Chinese Translation
机器人辅助经颅磁刺激(Robo-TMS)是一种图像引导的机器人干预方法,增强了传统经颅磁刺激(TMS)的准确性和可重复性,后者是一种在临床治疗和神经科学研究中广泛使用的非侵入性脑刺激程序。尽管其潜力巨大,Robo-TMS的发展仍面临挑战,因为需要跨越医学成像、计算机视觉和机器人技术的多学科专业知识。本文介绍了SlicerRoboTMS,这是一种开源的3D切片扩展,为Robo-TMS研究提供了统一的交互基础设施。通过利用3D切片的医学图像计算和可视化能力,该扩展支持基于磁共振成像(MRI)的神经导航,并通过标准化的通信协议和可配置的系统描述与机器人系统接口。本文展示了一个集成示例,以演示如何将SlicerRoboTMS纳入一个典型的Robo-TMS工作流程。SlicerRoboTMS旨在支持多样化的硬件配置和快速原型开发,降低了入门门槛,促进了Robo-TMS领域可重复和可扩展的研究。该扩展可在https://github.com/OpenRoboTMS/SlicerRoboTMS获取。
cs.RO / 19 / 2604.25670

GEGLU-Transformer for IMU-to-EMG Estimation with Few-Shot Adaptation

基于GEGLU-Transformer的IMU到EMG估计的少样本适应方法
Mihailovic, Miroljub, Tonin, Luca, Tortora, Stefano, Menegatti, Emanuele
Abstract
Reliable estimation of neuromuscular activation is a key enabler for adaptive and personalized control in wearable robotics. However, surface electromyography (EMG) remains difficult to deploy robustly outside laboratory settings due to electrode sensitivity, signal non-stationarity, and strong subject dependence. In this work, we propose an adaptive IMU-to-EMG learning framework that reconstructs continuous muscle activation envelopes from wearable inertial measurements across heterogeneous movement conditions. The approach combines a Transformer encoder with Gaussian Error Gated Linear Units (GEGLU-Transformer) to enhance cross-subject generalization and enable rapid subject-specific personalization. Under a strict leave-one-subject-out (LOSO) protocol on a multi-condition lower-limb biomechanics dataset, the proposed architecture achieves r = 0.706 +/- 0.139 and R^2 = 0.474 +/- 0.208 without subject-specific adaptation. With only 0.5% adaptation data, performance increases to r = 0.761 +/- 0.030 and R^2 = 0.559 +/- 0.047, demonstrating rapid adaptation and early performance saturation. These results support attention-based architectures combined with lightweight adaptation as a practical and scalable alternative to direct EMG sensing for real-world wearable robotic applications.
Chinese Translation
可靠的神经肌肉激活估计是可穿戴机器人自适应和个性化控制的关键。然而,由于电极灵敏度、信号非平稳性和强烈的个体依赖性,表面肌电图(EMG)在实验室环境之外的稳健部署仍然困难。在本研究中,我们提出了一种自适应的IMU到EMG学习框架,该框架能够在异质运动条件下从可穿戴惯性测量中重建连续的肌肉激活包络。该方法结合了Transformer编码器和高斯误差门控线性单元(GEGLU-Transformer),以增强跨个体的泛化能力并实现快速的个体特异性个性化。在一个严格的留一被试法(LOSO)协议下,针对多条件下肢生物力学数据集,所提架构在没有个体特异性适应的情况下达到了r = 0.706 +/- 0.139和R^2 = 0.474 +/- 0.208。仅使用0.5%的适应数据,性能提升至r = 0.761 +/- 0.030和R^2 = 0.559 +/- 0.047,展示了快速适应和早期性能饱和。这些结果支持基于注意力的架构与轻量级适应相结合,作为直接EMG传感在现实世界可穿戴机器人应用中的一种实用且可扩展的替代方案。
cs.RO / 20 / 2604.25691

Learning-Based Dynamics Modeling and Robust Control for Tendon-Driven Continuum Robots

基于学习的腱驱动连续机器人动力学建模与鲁棒控制
Zou, Ziqing, Qiu, Ke, Wang, Fei, Lu, Haojian, Xiong, Rong, Wang, Yue
Abstract
Tendon-Driven Continuum Robots (TDCRs) pose significant modeling and control challenges due to complex nonlinearities, such as frictional hysteresis and transmission compliance. This paper proposes a differentiable learning framework that integrates high-fidelity dynamics modeling with robust neural control. We develop a GRU-based dynamics model featuring bidirectional multi-channel connectivity and residual prediction to effectively suppress compounding errors during long-horizon auto-regressive prediction. By treating this model as a gradient bridge, an end-to-end neural control policy is optimized through backpropagation, allowing it to implicitly internalize compensation for intricate nonlinearities. Experimental validation on a physical three-section TDCR demonstrates that our framework achieves accurate tracking and superior robustness against unseen payloads, outperforming Jacobian-based methods by eliminating self-excited oscillations.
Chinese Translation
腱驱动连续机器人(Tendon-Driven Continuum Robots, TDCRs)由于复杂的非线性特性,如摩擦滞后和传动柔性,面临显著的建模和控制挑战。本文提出了一种可微分的学习框架,将高保真动力学建模与鲁棒神经控制相结合。我们开发了一种基于门控递归单元(GRU)的动力学模型,具有双向多通道连接和残差预测,以有效抑制长时间自回归预测中的复合误差。通过将该模型视为梯度桥,采用反向传播优化端到端神经控制策略,使其能够隐式内化对复杂非线性的补偿。在物理三段TDCR上的实验验证表明,我们的框架实现了准确的跟踪和对未见负载的优越鲁棒性,优于基于雅可比的方法,消除了自激振荡。
cs.RO / 21 / 2604.25698

Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots

参考增强学习用于腱驱动连续机器人精确跟踪策略
Zou, Ziqing, Qiu, Ke, Lu, Haojian, Xiong, Rong, Wang, Yue
Abstract
Tendon-Driven Continuum Robots (TDCRs) pose significant control challenges due to their highly nonlinear, path-dependent dynamics and non-Markovian characteristics. Traditional Jacobian-based controllers often struggle with hysteresis-induced oscillations, while conventional learning-based approaches suffer from poor generalization to out-of-distribution trajectories. This paper proposes a reference-augmented offline learning framework for precise 6-DOF tracking control of TDCRs. By leveraging a differentiable RNN-based dynamics surrogate as a gradient bridge, we optimize a control policy through an augmented reference distribution. This multi-scale augmentation scheme incorporates stochastic bias, harmonic perturbations, and random walks, forcing the policy to internalize diverse tracking error recovery mechanisms without additional hardware interaction. Experimental results on a three-section TDCR platform demonstrate that the proposed policy achieves a 50.9\% reduction in average position error compared to non-augmented baselines and significantly outperforms Jacobian-based methods in both precision and stability across various speeds.
Chinese Translation
腱驱动连续机器人(Tendon-Driven Continuum Robots, TDCRs)由于其高度非线性、路径依赖的动态特性和非马尔可夫特性,面临显著的控制挑战。传统的基于雅可比矩阵的控制器常常在滞后引起的振荡中挣扎,而传统的基于学习的方法在处理分布外轨迹时表现出较差的泛化能力。本文提出了一种参考增强的离线学习框架,用于TDCRs的精确6自由度跟踪控制。通过利用可微分的基于递归神经网络(RNN)的动态代理作为梯度桥梁,我们通过增强的参考分布优化控制策略。这种多尺度增强方案结合了随机偏差、谐波扰动和随机游走,迫使策略在没有额外硬件交互的情况下内化多样的跟踪误差恢复机制。在一个三节TDCR平台上的实验结果表明,所提出的策略在平均位置误差上比非增强基线减少了50.9%,并在不同速度下显著优于基于雅可比方法的精度和稳定性。
cs.RO / 22 / 2604.25766

Sensitivity-Based Tube NMPC for Cooperative Aerial Structures Under Parametric Uncertainty

基于灵敏度的管道非线性模型预测控制(NMPC)用于参数不确定下的协作空中结构
Silano, Giuseppe, Sablé, Quentin, Tognon, Marco, Iannelli, Luigi, Franchi, Antonio
Abstract
This paper presents a sensitivity-based tube Nonlinear Model Predictive Control (NMPC) framework for cooperative aerial chains under bounded parametric uncertainty. We consider a planar two-vehicle chain connected by rigid links, modeled with input-rate actuation to enforce slew-rate and magnitude limits on thrust and torque. Robustness to uncertainty in link mass, length, and inertia is achieved by propagating first-order parametric state sensitivities along the horizon and using them to compute online constraint-tightening margins. We robustify an inter-link separation constraint, implemented via a smooth cosine embedding, and thrust-magnitude bounds. The method is implemented in MATLAB and evaluated with boundary-hugging maneuvers and Monte-Carlo uncertainty sampling. Results show improved constraint margins under uncertainty with tracking performance comparable to nominal NMPC.
Chinese Translation
本文提出了一种基于灵敏度的管道非线性模型预测控制(NMPC)框架,适用于在有界参数不确定性下的协作空中链。我们考虑一个由刚性连接件连接的平面双车链,采用输入速率驱动模型,以强制施加推力和扭矩的速率和幅度限制。通过沿时间范围传播一阶参数状态灵敏度,并利用这些灵敏度计算在线约束收紧边际,从而实现对链条质量、长度和惯性不确定性的鲁棒性。我们增强了通过平滑余弦嵌入实现的链间分离约束和推力幅度限制。该方法在MATLAB中实现,并通过边界跟随机动和蒙特卡洛不确定性采样进行了评估。结果表明,在不确定性下约束边际有所改善,同时跟踪性能与名义NMPC相当。
cs.RO / 23 / 2604.25788

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

KinDER:机器人学习与规划的物理推理基准
Huang, Yixuan, Li, Bowen, Saxena, Vaibhav, Liang, Yichao, Mishra, Utkarsh Aashu, Ji, Liang, Zha, Lihan, Wu, Jimmy, Kumar, Nishanth, Scherer, Sebastian, Xu, Danfei, Silver, Tom
Abstract
Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and Dynamic Embodied Reasoning that targets physical reasoning challenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a Gymnasium-compatible Python library with parameterized skills and demonstrations, and a standardized evaluation suite with 13 implemented baselines spanning task and motion planning, imitation learning, reinforcement learning, and foundation-model-based approaches. The environments are designed to isolate five core physical reasoning challenges: basic spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints, disentangled from perception, language understanding, and application-specific complexity. Empirical evaluation shows that existing methods struggle to solve many of the environments, indicating substantial gaps in current approaches to physical reasoning. We additionally include real-to-sim-to-real experiments on a mobile manipulator to assess the correspondence between simulation and real-world physical interaction. KinDER is fully open-sourced and intended to enable systematic comparison across diverse paradigms for advancing physical reasoning in robotics. Website and code: https://prpl-group.com/kinder-site/
Chinese Translation
与物理世界互动的机器人系统必须考虑由其自身形态、环境和当前任务所施加的运动学和动力学约束。我们介绍了KinDER,这是一个针对机器人学习和规划中出现的物理推理挑战的运动学和动态体现推理基准。KinDER包含25个程序生成的环境,一个与Gymnasium兼容的Python库,提供参数化的技能和示范,以及一个标准化的评估套件,包含13个实现的基线,涵盖任务和运动规划、模仿学习、强化学习以及基于基础模型的方法。这些环境旨在隔离五个核心的物理推理挑战:基本空间关系、非抓取多物体操控、工具使用、组合几何约束和动态约束,这些挑战与感知、语言理解和特定应用复杂性相分离。实证评估表明,现有方法在解决许多环境时面临困难,显示出当前物理推理方法的显著不足。此外,我们还包括了在移动操控器上的真实-仿真-真实实验,以评估仿真与现实世界物理交互之间的对应关系。KinDER完全开源,旨在促进不同范式之间的系统比较,以推动机器人领域的物理推理进展。网站和代码:https://prpl-group.com/kinder-site/
cs.RO / 24 / 2604.25859

Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

特权前瞻蒸馏:世界动作模型的零成本未来修正
Fang, Pengcheng, Chen, Hongli, Cai, Xiaohao
Abstract
World action models jointly predict future video and action during training, raising an open question about what role the future-prediction branch actually plays. A recent finding shows that this branch can be removed at inference with little to no loss on common manipulation benchmarks, suggesting that future information may act merely as a regularizer on the shared visual backbone. We propose instead that joint training induces an action-conditioned correction that privileged future observations impose on action denoising, and that current-only policies capture this correction only partially. Making the account precise, we formulate privileged foresight as a residual in the action-denoising direction -- the difference between what a model predicts given the true future and what it predicts given only the current frame -- and introduce \emph{Privileged Foresight Distillation (PFD)}, which transfers this residual from a training-time teacher into a small adapter on a current-only student. The teacher and student share the same backbone and differ only in the attention mask over video tokens; future video is never generated at inference. Controlled experiments verify that this gain reflects a genuine future-conditioned correction rather than a side effect of capacity or regularization. Empirically, PFD achieves consistent improvements on LIBERO and RoboTwin manipulation benchmarks while preserving the current-only inference interface at negligible added latency. This view reframes the role of future information in world action models: not as a target to predict, nor as a regularizer to absorb, but as a compressible correction to be distilled.
Chinese Translation
世界动作模型在训练过程中共同预测未来视频和动作,这引发了一个关于未来预测分支实际作用的开放性问题。最近的研究发现,在推理阶段可以去除这一分支,而对常见的操作基准几乎没有损失,这表明未来信息可能仅仅作为共享视觉骨干的正则化器。我们提出,联合训练引入了一种由特权未来观察对动作去噪施加的动作条件修正,而仅基于当前信息的策略只能部分捕捉到这一修正。为了明确这一观点,我们将特权前瞻定义为动作去噪方向上的残差——即模型在给定真实未来时的预测与仅给定当前帧时的预测之间的差异,并引入了 extit{特权前瞻蒸馏(Privileged Foresight Distillation, PFD)},它将这一残差从训练阶段的教师模型转移到一个仅基于当前信息的学生模型的小适配器上。教师和学生共享相同的骨干网络,仅在视频标记的注意力掩码上有所不同;在推理阶段从未生成未来视频。控制实验验证了这一增益反映了真正的未来条件修正,而不是容量或正则化的副作用。从经验上看,PFD在LIBERO和RoboTwin操作基准上实现了一致的改进,同时在可忽略的延迟下保持了仅基于当前信息的推理接口。这一观点重新框定了未来信息在世界动作模型中的作用:既不是预测的目标,也不是吸收的正则化器,而是可压缩的修正,需进行蒸馏。
cs.RO / 25 / 2604.25897

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

针对多模态不确定性下稳健灵巧抓取的变分神经信念参数化
Enwerem, Clinton, Kalyanaraman, Shreya, Baras, John S., Belta, Calin
Abstract
Contact variability, sensing uncertainty, and external disturbances make grasp execution stochastic. Expected-quality objectives ignore tail outcomes and often select grasps that fail under adverse contact realizations. Risk-sensitive POMDPs address this failure mode, but many use particle-filter beliefs that scale poorly, obstruct gradient-based optimization, and estimate Conditional Value-at-Risk (CVaR) with high-variance approximations. We instead formulate grasp acquisition as variational inference over latent contact parameters and object pose, representing the belief with a differentiable Gaussian mixture. We use Gumbel-Softmax component selection and location-scale reparameterization to express samples as smooth functions of the belief parameters, enabling pathwise gradients through a differentiable CVaR surrogate for direct optimization of tail robustness. In simulation, our variational neural belief improves robust grasp success under contact-parameter uncertainty and exogenous force perturbations while reducing planning time by roughly an order of magnitude relative to particle-filter model-predictive control. On a serial-chain robot arm with a multifingered hand, we validate grasp-and-lift success under object-pose uncertainty against a Gaussian baseline. Both methods succeed on the tested perturbations, but our controller terminates in fewer steps and less wall-clock time while achieving a higher tactile grasp-quality proxy. Our learned belief also calibrates risk more accurately, keeping mean absolute calibration error below 0.14 across tested simulation regimes, compared with 0.58 for a Cross-Entropy Method planner.
Chinese Translation
接触变异性、传感不确定性和外部干扰使得抓取执行具有随机性。期望质量目标忽视尾部结果,往往选择在不利接触实现下失败的抓取。风险敏感的部分可观测马尔可夫决策过程(POMDP)解决了这种失败模式,但许多方法使用的粒子滤波信念在规模上表现不佳,阻碍了基于梯度的优化,并且以高方差的近似值估计条件风险价值(CVaR)。我们将抓取获取形式化为对潜在接触参数和物体姿态的变分推断,用可微分的高斯混合体表示信念。我们使用Gumbel-Softmax组件选择和位置-尺度重参数化,将样本表示为信念参数的平滑函数,从而实现通过可微分的CVaR替代品进行路径梯度的直接优化,以增强尾部稳健性。在仿真中,我们的变分神经信念在接触参数不确定性和外部力扰动下提高了稳健抓取的成功率,同时将规划时间相对于粒子滤波模型预测控制减少了大约一个数量级。在一个带有多指手的串联机器人手臂上,我们验证了在物体姿态不确定性下的抓取和提升成功率,相较于高斯基线。两种方法在测试的扰动下均成功,但我们的控制器在更少的步骤和更少的实际时间内终止,同时实现了更高的触觉抓取质量代理。我们的学习信念也更准确地校准了风险,在测试的仿真环境中保持平均绝对校准误差低于0.14,而交叉熵方法规划器的误差为0.58。
计算机视觉 (Computer Vision)
70
cs.CV / 1 / 2604.24876

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

ESICA:一种可扩展的文本引导三维医学图像分割框架
Xin, Yu, Ates, Gorkem Can, Ma, Jun, Kim, Sumin, Zhang, Ying, Smith, Kaleb E, Gong, Kuang, Shao, Wei
Abstract
Text guided 3D medical image segmentation offers a flexible alternative to class based and spatial prompt based models by allowing users to specify regions of interest directly in natural language. This paradigm avoids reliance on predefined label sets, reduces ambiguous outputs, and aligns more naturally with clinical workflows. However, existing text guided frameworks are often computationally expensive, exhibit weak text volume feature alignment, and fail to capture fine anatomical details. We propose ESICA, a lightweight and scalable framework that addresses these challenges through three innovations: (1) a similarity matrix based mask prediction formulation that enhances semantic alignment, (2) an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and (3) a two pass refinement strategy that sharpens boundaries and resolves uncertain regions. To improve training stability and generalization, ESICA adopts a two stage scheme consisting of positive only pretraining followed by balanced fine tuning. On the CVPR BiomedSegFM benchmark spanning five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state of the art segmentation accuracy, while the compact ESICA4 Lite variant attains similar segmentation performance with substantially fewer parameters, yielding a superior efficiency accuracy trade off. Our framework advances text guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available at https://github.com/mirthAI/ESICA.
Chinese Translation
文本引导的三维医学图像分割为基于类别和基于空间提示的模型提供了一种灵活的替代方案,允许用户直接用自然语言指定感兴趣的区域。这一范式避免了对预定义标签集的依赖,减少了模糊输出,并与临床工作流程更自然地对接。然而,现有的文本引导框架通常计算开销大,文本体积特征对齐较弱,且未能捕捉细微的解剖细节。我们提出了ESICA,这是一种轻量级且可扩展的框架,通过三项创新来应对这些挑战:(1)基于相似性矩阵的掩膜预测公式,增强语义对齐;(2)具有适配模块的高效解码器,用于准确的体积解码;(3)两次精细化策略,锐化边界并解决不确定区域。为了提高训练的稳定性和泛化能力,ESICA采用了一个两阶段方案,包括仅正样本的预训练,随后进行平衡微调。在涵盖五种成像模式(CT、MRI、PET、超声和显微镜)的CVPR BiomedSegFM基准测试中,ESICA实现了最先进的分割精度,而紧凑型的ESICA4 Lite变体在参数显著减少的情况下达到了类似的分割性能,从而实现了更优的效率与精度的权衡。我们的框架推动了文本引导分割向高效、可扩展和临床可部署的系统发展。代码将公开发布在 https://github.com/mirthAI/ESICA。
cs.CV / 2 / 2604.24877

Learning Illumination Control in Diffusion Models

扩散模型中的光照控制学习
Anand, Nishit, Suri, Manan, Metzler, Christopher, Manocha, Dinesh, Duraiswami, Ramani
Abstract
Controlling illumination in images is essential for photography and visual content creation. While closed-source models have demonstrated impressive illumination control, open-source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open-source and reproducible pipeline for learning illumination control in diffusion models. Our approach builds a data engine that transforms well-lit images into supervised training triplets consisting of a poorly-illuminated input image, a natural language lighting instruction, and a well-illuminated output image. We finetune a diffusion model on this data and demonstrate significant improvements over baseline SD 1.5, SDXL, and FLUX.1-dev models in perceptual similarity, structural similarity, and identity preservation. Our work provides a reproducible solution built entirely with open-source tools and publicly available data. We release all our code, data, and model weights publicly.
Chinese Translation
在图像中控制光照对于摄影和视觉内容创作至关重要。尽管闭源模型在光照控制方面表现出色,但开源替代方案要么需要像深度图这样的重控制输入,要么不公开其数据和代码。我们提出了一种完全开源且可重复的管道,用于在扩散模型中学习光照控制。我们的方法构建了一个数据引擎,将良好照明的图像转换为监督训练三元组,包括一个光照不足的输入图像、一条自然语言光照指令和一个光照良好的输出图像。我们在这些数据上对扩散模型进行了微调,并在感知相似性、结构相似性和身份保留方面展示了相较于基线模型SD 1.5、SDXL和FLUX.1-dev的显著改进。我们的工作提供了一个完全基于开源工具和公开可用数据的可重复解决方案。我们将所有代码、数据和模型权重公开发布。
cs.CV / 3 / 2604.24885

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

VibeToken:扩展一维图像标记器和自回归模型以实现动态分辨率生成
Patel, Maitreya, Li, Jingtao, Zhuang, Weiming, Yang, Yezhou, Lv, Lingjuan
Abstract
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.
Chinese Translation
我们提出了一种高效的、与分辨率无关的自回归(AR)图像合成方法,该方法能够推广到任意分辨率和纵横比,缩小了与大规模扩散模型之间的差距。其核心是VibeToken,一种新颖的与分辨率无关的基于一维Transformer的图像标记器,它将图像编码为动态的、用户可控的32-256个标记序列,达到了业界领先的效率和性能平衡。在VibeToken的基础上,我们提出了VibeToken-Gen,这是一种类别条件的AR生成器,开箱即用地支持任意分辨率,同时所需的计算资源显著减少。值得注意的是,VibeToken-Gen仅使用64个标记合成1024x1024的图像,并达到3.94 gFID;相比之下,一种基于扩散的业界领先替代方案需要1024个标记并达到5.87 gFID。与固定分辨率的AR模型(如LlamaGen,其推理FLOPs随分辨率呈二次增长,在1024x1024时为11T FLOPs)相比,VibeToken-Gen保持恒定的179G FLOPs(效率提高63.4倍),与分辨率无关。我们希望VibeToken能够帮助推动AR视觉生成模型在生产用例中的广泛应用。
cs.CV / 4 / 2604.24893

Interactive Episodic Memory with User Feedback

带用户反馈的互动情节记忆
Subedi, Nikesh, Bazzani, Loris, Al-Halah, Ziad
Abstract
In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.
Chinese Translation
在自然语言查询的情节记忆(EM-NLQ)中,用户可能会提出一个问题(例如,“我把杯子放在哪里了?”),这需要搜索一段从用户视角拍摄的长时间自我中心视频,以找到回答该问题的时刻。然而,查询可能存在歧义或不完整,从而导致错误的响应。目前的方法忽视了这一关键方面,并在一次性设置中处理EM-NLQ,限制了它们在现实场景中的适用性。在本研究中,我们解决了这一空白,并引入了带问题和反馈的情节记忆任务(EM-QnF)。在这里,用户可以对模型的初始预测提供反馈或添加更多信息(例如,“在这个之前。我在找那个大蓝杯,而不是白色的”),帮助模型进行互动式的预测优化。为此,我们收集了基于反馈的交互数据集,并提出了一种轻量级的训练方案,避免了昂贵的顺序优化。我们还引入了一个即插即用的反馈对齐模块(Feedback ALignment Module, FALM),使现有的EM-NLQ模型能够有效地整合用户反馈。我们的方法在三个具有挑战性的基准测试中显著超越了现有技术,并且在效率上优于或与商业大型视觉-语言模型竞争。使用人类生成的反馈进行评估表明,该方法在现实场景中具有良好的泛化能力。
cs.CV / 5 / 2604.24919

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

面向遥感的自主智能体:技术挑战与研究方向
Munir, Muhammad Akhtar, Sheikh, Muhammad Umer, Shabbir, Akashah, Khan, Muhammad Haris, Khan, Fahad, Zhu, Xiao Xiang, Demir, Begum, Khan, Salman
Abstract
Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.
Chinese Translation
地球观测(EO)正从静态预测转向需要对数据、工具和地理空间状态进行协调推理的多步骤分析工作流。尽管基础模型和视觉-语言模型已扩展了遥感的表征学习和语言基础交互,自主智能体也展示了长时间跨度的推理和外部工具使用,但EO并不是通用自主智能体的简单扩展。EO工作流在地理参考的多模态和时间结构化数据上运行,其中重投影、重采样、合成和聚合等操作会主动改变基础状态,并可能限制后续分析。因此,错误可能在步骤之间悄然传播,正确性不仅依赖于内部一致性,还依赖于地理空间一致性、时间有效比较和物理有效性。本文立场论文认为,这些挑战是结构性的而非偶然的。我们识别了通用自主模型中常见的隐含假设,分析了它们在地理空间工作流中如何失效,并描述了多步骤EO管道中由此产生的失败模式。接着,我们概述了以结构化地理空间状态、工具感知推理、验证者引导执行和与地理空间及物理有效性对齐的学习目标为中心的EO原生智能体的设计原则。最后,我们提出了涵盖EO特定基准、混合监督与强化学习、受限自我改进以及超越最终答案准确性的轨迹级评估的研究方向。因此,构建可靠的地理空间智能体需要围绕支配EO分析的物理、地理空间和工作流约束重新思考智能体设计。
cs.CV / 6 / 2604.24947

Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

具有时间注释平滑的景观视频主观肖像区域裁剪
Lee, Cheng-Han, Mandal, Maniratnam, Birkbeck, Neil, Wang, Yilin, Adsumilli, Balu, Bovik, Alan C.
Abstract
With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video's intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.
Chinese Translation
随着移动视频消费在各种手持显示分辨率和方向模式上的兴起,调整视频的纵横比面临挑战。静态裁剪和边框填充往往会影响视觉质量,而变形可能会扭曲视频的原意。在此,我们提倡一种更有效的方法:以时间的方式裁剪视频帧内的重要区域,同时最小化失真并保留重要内容。解决这一问题的一个障碍是缺乏足够大规模的数据库来支持这些任务。为填补这一空白,我们介绍了LIVE-YouTube视频裁剪(LIVE-YT VC)数据库,包含1800个视频,由90名受试者进行注释。该新资源是目前最大的公开可用的主观视频肖像区域裁剪数据库,视频来源于YouTube-UGC和LSVQ数据库。我们还推出了该数据库的后处理版本,称为LIVE-YT VC++,其中部署了一种新颖的帧内时间滤波器,以平滑每个视频中的主观注释。我们使用SmartVidCrop算法和最先进的视频定位模型展示了这一新数据资源的实用性,希望将我们的主观数据集确立为未来研究的基准。我们的贡献为推动视频纵横比转换模型提供了资源,以确保重塑后的移动友好型视频内容保留其质量和意义。由于我们的标签与视频显著性注释相似,我们还进行了额外分析,以探索我们的标签与视频显著性预测之间的相似性。最后,我们重新利用了最先进的视频定位模型用于纵横比变化任务,并在我们的数据集上进行了微调。作为对研究社区的服务,我们计划将该项目开源。
cs.CV / 7 / 2604.24952

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

从噪声偏好中学习:一种半监督学习方法用于直接偏好优化
Liu, Xinxin, Li, Ming, Lyu, Zonglin, Shang, Yuzhang, Chen, Chen
Abstract
Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo
Chinese Translation
人类视觉偏好本质上是多维的,涵盖了美学、细节保真度和语义一致性。然而,现有数据集仅提供单一的整体注释,导致严重的标签噪声:在某些维度表现优异但在其他维度不足的图像被简单标记为赢家或输家。我们理论上证明,将多维偏好压缩为二元标签会产生相互矛盾的梯度信号,从而误导扩散直接偏好优化(Diffusion Direct Preference Optimization, DPO)。为了解决这个问题,我们提出了Semi-DPO,一种半监督方法,将一致的配对视为干净的标记数据,而将矛盾的配对视为噪声未标记数据。我们的方法首先在经过共识过滤的干净子集上进行训练,然后使用该模型作为隐式分类器,为噪声集生成伪标签以进行迭代优化。实验结果表明,Semi-DPO达到了最先进的性能,并显著提高了与复杂人类偏好的对齐程度,而无需在训练过程中额外的人类注释或显式奖励模型。我们将发布我们的代码和模型,网址为:https://github.com/L-CodingSpace/semi-dpo
cs.CV / 8 / 2604.24953

ViPO: Visual Preference Optimization at Scale

ViPO:大规模视觉偏好优化
Li, Ming, Wu, Jie, Cui, Justin, Li, Xiaojie, Wang, Rui, Chen, Chen
Abstract
While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.
Chinese Translation
虽然偏好优化对于提升视觉生成模型至关重要,但如何有效地扩展这一范式仍然 largely 未被探索。目前的开源偏好数据集包含相互矛盾的偏好模式,其中赢家在某些维度上表现出色,但在其他维度上表现不佳。对这样的噪声数据集进行简单优化无法学习偏好,阻碍了有效扩展。为了增强对噪声的鲁棒性,我们提出了 Poly-DPO,它通过一个额外的多项式项扩展了 DPO 目标,该项根据数据集特征动态调整模型信心,从而在多样化的数据分布中实现有效学习。除了偏见模式,现有数据集还存在低分辨率、有限的提示多样性和不平衡分布等问题。为了解决数据瓶颈,促进大规模视觉偏好优化,我们构建了 ViPO,一个包含 100 万对 1024px 图像(跨五个类别)和 30 万对 720p+ 视频(跨三个类别)的超大规模偏好数据集。最先进的生成模型和多样化的提示确保了可靠的偏好信号和均衡的分布。值得注意的是,当将 Poly-DPO 应用于我们的高质量数据集时,最佳配置收敛到标准 DPO。这一收敛验证了数据集的质量和 Poly-DPO 的自适应特性:在数据质量足够的情况下,复杂的优化变得不必要,但对于不完美的数据集仍然具有价值。我们在视觉生成模型上验证了我们的方法。在像 Pick-a-Pic V2 这样的噪声数据集上,Poly-DPO 在 GenEval 上分别对 SD1.5 和 SDXL 实现了 6.87 和 2.32 的增益。对于 ViPO,模型的性能远超在现有开源偏好数据集上训练的模型。这些结果确认了解决算法适应性和数据质量对于扩展视觉偏好优化的重要性。
cs.CV / 9 / 2604.24990

A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

一种新型网络?神经元细胞自动机的回顾与参考实现
Spitznagel, Martin, Keuper, Janis
Abstract
Stephen Wolfram proclaimed in his 2003 seminal work "A New Kind Of Science" that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch.
Chinese Translation
斯蒂芬·沃尔夫拉姆在他2003年的开创性著作《一种新型科学》中宣称,简单的递归程序形式的细胞自动机(Cellular Automata, CA)是一种有前景的方法,可以替代当前使用的数学形式化,例如微分方程,以改善复杂系统的建模。二十多年后,尽管细胞自动机仍在等待科学应用上的重大突破,但近期研究显示出将沃尔夫拉姆的思想与可学习的人工神经网络相结合的新颖且有前景的方法:所谓的神经元细胞自动机(Neural Cellular Automata, NCA)能够从数据样本中学习CA的复杂更新规则,使其能够建模复杂的自组织生成系统。本文的目的是回顾现有的NCA研究,并提供一个统一的模块化框架和符号表示,以及在开源库NCAtorch中的参考实现。
cs.CV / 10 / 2604.24997

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

DouC:无训练双分支 CLIP 用于开放词汇分割
Zamini, Mohamad, Shukla, Diksha
Abstract
Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.
Chinese Translation
开放词汇语义分割需要在支持开放和无限制类别集的同时,为每个像素分配语义标签。基于 CLIP 的无训练方法保持了强大的零样本泛化能力,但通常依赖于单一的推理机制,这限制了它们共同解决不可靠的局部标记和不足的空间一致性的能力。我们提出了 DouC,一种无训练的双分支 CLIP 框架,将密集预测分解为两个互补的组件。OG-CLIP 通过轻量级的推理时标记门控提高了补丁级别的可靠性,而 FADE-CLIP 通过由冻结的视觉基础模型引导的代理注意力注入外部结构先验。这两个分支在对数级别融合,使得局部标记的可靠性和结构感知的补丁交互共同影响最终预测,并在后处理阶段应用可选的实例感知校正。DouC 不引入额外的可学习参数,无需重新训练,并保持了 CLIP 的零样本泛化能力。在八个基准和多个 CLIP 主干上进行的广泛实验表明,DouC 始终优于先前的无训练方法,并且在模型容量方面表现良好。
cs.CV / 11 / 2604.24999

BifDet: A 3D Bifurcation Detection Dataset for Airway-Tree Modeling

BifDet:用于气道树建模的3D分叉检测数据集
Keshavarzi, Ali, Bouniot, Quentin, Smith, Benjamin M., Angelini, Elsa
Abstract
Thoracic Computed Tomography (CT) scans offer detailed insights into the intricate branching network of the airway tree, which is essential for understanding various respiratory diseases. Airway bifurcations, where airway branches split, are crucial landmarks for understanding lung physiology, disease mechanisms and lesion localization. Despite the significance of bifurcation analysis, a notable lack of datasets annotated for this task hinders the development of advanced automated specialized detection or segmentation tools. In this paper, we introduce BifDet, the first publicly-available dataset specialized for 3D airway bifurcation detection, filling a critical gap in existing resources. Our dataset comprises carefully annotated CT scans from the ATM22 open-access cohort with bifurcation bounding boxes covering the parent and daughter branches. As a use-case for demonstrating the potential of BifDet, we fine-tune and evaluate RetinaNet and DETR for 3D airway bifurcations detection on CT scans. We provide detailed pipelines, including preprocessing steps and specific implementation design choices. Results are detailed over various categories of minimal bounding box sizes to serve as baseline to benchmark future research.
Chinese Translation
胸部计算机断层扫描(CT)提供了对气道树复杂分支网络的详细洞察,这对于理解各种呼吸系统疾病至关重要。气道分叉,即气道分支的分裂,是理解肺生理、疾病机制和病变定位的重要标志。尽管分叉分析的重要性不言而喻,但缺乏针对该任务标注的数据集显著阻碍了先进自动化专用检测或分割工具的发展。本文介绍了BifDet,这是第一个公开可用的专门用于3D气道分叉检测的数据集,填补了现有资源中的关键空白。我们的数据集包含来自ATM22开放获取队列的精心标注的CT扫描,分叉边界框覆盖了母分支和子分支。作为展示BifDet潜力的用例,我们对RetinaNet和DETR进行了微调和评估,以在CT扫描上进行3D气道分叉检测。我们提供了详细的流程,包括预处理步骤和具体的实施设计选择。结果在不同最小边界框大小的类别上进行了详细描述,以作为未来研究的基准。
cs.CV / 12 / 2604.25065

ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

ShapeY:通过最近邻匹配测量形状识别能力的原则性框架
Nam, Jong Woo, Rios, Amanda S., Mel, Bartlett W.
Abstract
Object recognition (OR) in humans relies heavily on shape cues and the ability to recognize objects across varying 3D viewpoints. Unlike humans, deep networks often rely on non-shape cues such as texture and background, leading to vulnerabilities in generalization and robustness. To address this gap, we introduce ShapeY, a novel and principled benchmarking framework designed to evaluate shape-based recognition capability in OR systems. ShapeY comprises 68,200 grayscale images of 200 3D objects rendered from multiple viewpoints and optionally subjected to non-shape ``appearance'' changes. Using a nearest-neighbor matching task, ShapeY specifically probes the fine-grained structure of an OR system's embedding space by evaluating whether object views are clustered by 3D shape similarity across varying 3D viewpoints and other non-shape changes. ShapeY provides a suite of quantitative and qualitative performance readouts, including error rate graphs, viewpoint tuning curves, histograms of positive and negative matching scores, and grids showing ordered best matches, which together offer a comprehensive evaluation of an OR system's shape understanding capability. Testing of 321 pre-trained networks with diverse architectures reveals significant challenges in achieving robust shape-based recognition: even state-of-the-art models struggle to generalize consistently across 3D viewpoint and appearance changes, and are prone to infrequent but egregious matches of objects of obviously completely different shape. ShapeY establishes a principled framework for advancing artificial vision systems toward human-like shape recognition capabilities, emphasizing the importance of disentangled and invariant object encodings.
Chinese Translation
人类的物体识别(Object Recognition, OR)在很大程度上依赖于形状线索以及在不同3D视角下识别物体的能力。与人类不同,深度网络通常依赖于非形状线索,如纹理和背景,这导致其在泛化和鲁棒性方面存在脆弱性。为了解决这一问题,我们提出了ShapeY,一个新颖且原则性的基准框架,旨在评估OR系统中的基于形状的识别能力。ShapeY包含68,200张灰度图像,涵盖200个从多个视角渲染的3D物体,并可选择性地施加非形状的“外观”变化。通过最近邻匹配任务,ShapeY特别探讨OR系统嵌入空间的细粒度结构,评估物体视图是否根据3D形状相似性在不同3D视角和其他非形状变化中聚类。ShapeY提供了一套定量和定性的性能输出,包括错误率图、视角调谐曲线、正负匹配分数的直方图,以及显示有序最佳匹配的网格,这些共同提供了对OR系统形状理解能力的全面评估。对321个具有多样化架构的预训练网络的测试揭示了实现鲁棒的基于形状的识别所面临的重大挑战:即使是最先进的模型也难以在3D视角和外观变化中保持一致的泛化,并且容易出现偶尔但明显不同形状物体的错误匹配。ShapeY建立了一个原则性框架,以推动人工视觉系统朝向类人形状识别能力的发展,强调了解耦和不变物体编码的重要性。
cs.CV / 13 / 2604.25072

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

超越准确性:统一多模态模型中跨任务一致性的基准测试
Wang, Weixing, Zekas, Liudvikas, Hackl, Anton, Auga, Constantin Alexander, Shahabinejad, Parisa, Otholt, Jona, Rueda-Toicen, Antonio, de Melo, Gerard
Abstract
Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.
Chinese Translation
统一多模态模型(uMMs)旨在支持在共享表示中进行视觉理解和视觉生成。然而,现有的评估协议独立评估这两种能力,并未考察它们是否在语义上保持一致。因此,目前尚不清楚现有的uMMs是否能够学习到在给定视觉概念的情况下,在不同任务中保持一致的统一表示。我们提出了XTC-Bench,一个基于场景图的评估框架,用于测量跨任务的视觉语义一致性。通过从结构化场景图中推导生成提示和理解查询,我们的框架使得在对象、属性和关系之间进行事实级对齐分析成为可能。我们提出了连续跨任务一致性(CCTA),这是一种细粒度的度量,量化生成与理解之间在匹配原子事实上的语义一致性,将内部一致性与独立任务的准确性分离开来。在对八个开源和一个商业统一模型进行的大量实验中,我们发现高生成或理解性能并不意味着强跨任务对齐,架构分析表明,一致性是由学习目标在模态间的紧密耦合程度决定的,而不仅仅是由架构统一性决定的。XTC-Bench提供了一个可重复和模型无关的框架,用于诊断表示层面的不对齐,为推动统一多模态建模超越孤立任务性能提供了具体方向。
cs.CV / 14 / 2604.25102

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

一种扰动,两种失效模式:通过嵌入引导的排版扰动探测视觉语言模型的安全性
Balakrishnan, Ravikumar, Mendapara, Sanket
Abstract
Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.
Chinese Translation
排版提示注入利用视觉语言模型(VLMs)读取图像中呈现文本的能力,随着VLMs驱动自主代理,这构成了日益严重的威胁。之前的研究通常集中于最大化攻击成功率(ASR),但并未解释某些渲染为何能够绕过安全对齐。我们做出了两项贡献。首先,对包括GPT-4o和Claude在内的四个VLM、十二种字体大小和十种变换的实证研究表明,多模态嵌入距离能够强烈预测ASR($r{=}{-}0.71$到${-}0.93$,$p{<}0.01$),提供了一个可解释的、与模型无关的代理。由于嵌入距离预测ASR,降低嵌入距离应能提高攻击成功率,但这一关系受到两个因素的调节:感知可读性(VLM是否能够解析文本)和安全对齐(VLM是否拒绝遵从)。其次,我们将此作为红队工具:我们通过CWA-SSA在四个替代嵌入模型下,直接最大化图像文本嵌入相似性,限制在$ ext{l}_ ext{∞}$扰动范围内,测试这两个因素而无需访问目标模型。在GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL的五种降级设置下的实验确认,优化恢复了可读性并减少了安全对齐拒绝,作为两种共现效应,其主导机制依赖于模型的安全过滤强度和视觉降级程度。
cs.CV / 15 / 2604.25122

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

M$^3$-VQA:多模态、多实体、多跳视觉问答的基准
Ma, Jiatong, Guo, Longteng, Liu, Yuchen, Zhao, Zijia, Hao, Dongze, Lin, Xuanxu, Liu, Jing
Abstract
We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA-IVA-Lab/M3VQA.
Chinese Translation
我们提出了M$^3$-VQA,这是一种新颖的基于知识的视觉问答(VQA)基准,旨在增强对多模态大型语言模型(MLLMs)在细粒度多模态实体理解和复杂多跳推理方面的评估。与现有的VQA数据集主要关注粗粒度类别和对单一实体的简单推理不同,M$^3$-VQA引入了涉及来自视觉和文本源的多个不同实体的多实体问题。它要求模型在多个文档中进行顺序和并行的多跳推理,并提供可追溯的详细证据和经过整理的多模态知识库。我们在三种设置下评估了16个领先的MLLMs:没有外部知识、使用黄金证据和使用检索增强输入。结果显示,MLLMs在知识获取和推理方面面临重大挑战。在没有外部信息的情况下,模型表现不佳,但在提供精确证据时显著改善。此外,具有推理意识的代理检索超越了启发式方法,突显了结构化推理在复杂多模态理解中的重要性。M$^3$-VQA为推动MLLMs的多模态推理能力提供了更具挑战性的评估。我们的代码和数据集可在https://github.com/CASIA-IVA-Lab/M3VQA获取。
cs.CV / 16 / 2604.25128

ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent

ResetEdit:通过可重置起始潜变量实现生成图像的精确文本引导编辑
Wang, Hanyi, Fang, Han, Wang, Zheng, Wang, Shilin, Chang, Ee-Chien
Abstract
Recent advances in diffusion models have enabled high-quality image generation, leading to increasing demand for post-generation editing that modifies local regions while preserving global structure. Achieving such flexible and precise editing requires a high-quality starting point, a latent representation that provides both the freedom needed for diverse modifications and the precision required for fine-grained, region-specific control. However, existing inversion-based approaches such as DDIM inversion often yield unsatisfactory starting latents, resulting in degraded edit fidelity and structural inconsistency. Ideally, the most suitable editing anchor should be the original latent used during the generation process, as it inherently captures the scene's structure and semantics. Yet, storing this latent for every generated image is impractical due to massive storage and retrieval costs. To address this challenge, we propose ResetEdit, a proactive diffusion editing framework that embeds recoverable latent information directly into the generation process. By injecting the discrepancy between the clean and diffused latents into the diffusion trajectory and extracting it during inversion, ResetEdit reconstructs a resettable latent that closely approximates the true starting state. Additionally, a lightweight latent optimization module compensates for reconstruction bias caused by VAE asymmetry. Built upon Stable Diffusion, ResetEdit integrates seamlessly with existing tuning-free editing methods and consistently outperforms state-of-the-art baselines in both controllability and visual fidelity.
Chinese Translation
近年来,扩散模型的进展使得高质量图像生成成为可能,随之而来的是对后生成编辑的需求日益增加,这种编辑能够修改局部区域,同时保持全局结构。实现如此灵活且精确的编辑需要一个高质量的起始点,即潜在表示,它既提供了多样化修改所需的自由度,又具备细粒度、区域特定控制所需的精确性。然而,现有的基于反演的方法,如DDIM反演,往往产生不理想的起始潜变量,导致编辑保真度降低和结构不一致。理想情况下,最合适的编辑锚点应为生成过程中使用的原始潜变量,因为它本质上捕捉了场景的结构和语义。然而,由于巨大的存储和检索成本,为每个生成图像存储该潜变量是不切实际的。为了解决这一挑战,我们提出了ResetEdit,一个主动的扩散编辑框架,它将可恢复的潜在信息直接嵌入生成过程中。通过将干净潜变量与扩散潜变量之间的差异注入到扩散轨迹中,并在反演过程中提取,ResetEdit重构出一个可重置的潜变量,该潜变量与真实的起始状态高度接近。此外,一个轻量级的潜变量优化模块补偿了由变分自编码器(VAE)不对称性引起的重构偏差。基于Stable Diffusion,ResetEdit与现有的无调优编辑方法无缝集成,并在可控性和视觉保真度方面始终优于最先进的基线。
cs.CV / 17 / 2604.25164

IAM: Identity-Aware Human Motion and Shape Joint Generation

IAM:身份感知的人体运动与形状联合生成
Jia, Wenqi, Li, Zekun, Mittal, Abhay, Tang, Chengcheng, Guo, Chuan, Wang, Lezi, Rehg, James Matthew, Tao, Lingling, An, Size
Abstract
Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM
Chinese Translation
最近在文本驱动的人体运动生成方面的进展使得模型能够从自然语言描述中合成逼真的运动序列。然而,大多数现有方法假设运动是身份中立的,并使用规范的身体表示生成动作,忽视了身体形态对运动动态的强烈影响。在实践中,身体比例、质量分布和年龄等属性显著影响动作的执行方式,忽视这种耦合往往会导致物理上不一致的运动。我们提出了一种身份感知的运动生成框架,明确建模身体形态与运动动态之间的关系。我们不依赖于显式的几何测量,而是通过多模态信号(包括自然语言描述和视觉线索)来表示身份。我们进一步引入了一种联合运动-形状生成范式,同时合成运动序列和身体形状参数,使身份线索能够直接调节运动动态。在运动捕捉数据集和大规模野外视频上的大量实验表明,运动的真实感和运动-身份一致性得到了改善,同时保持了高质量的运动。项目页面:https://vjwq.github.io/IAM
cs.CV / 18 / 2604.25176

Benchmarking OCR Pipelines with Adaptive Enhancement for Multi-Domain Retail Bill Digitization

基于自适应增强的多领域零售账单数字化OCR管道基准测试
Gaikwad, Vijaysinh
Abstract
The digitization of multi-domain retail billing documents remains a challenging task due to variability in scan quality, layout heterogeneity, and domain diversity across commercial sectors. This paper proposes and benchmarks an intelligent, quality-aware adaptive Optical Character Recognition (OCR) pipeline for retail bill digitization spanning five domains: grocery stores, restaurants, hardware shops, footwear outlets, and clothing retailers. The proposed system integrates a Convolutional Neural Network (CNN)-based image enhancement module trained via self-supervised denoising, a Laplacian variance-based image quality analyzer with three-tier routing, a confidence-driven adaptive feedback loop with iterative retry, and an NLP-based post-OCR correction layer. Experiments were conducted on a real-world dataset of 360 heterogeneous retail bill images. Ground truth for quantitative evaluation was generated using an OCR ensemble majority voting strategy, a validated approach for scenarios without manual annotation. The proposed pipeline achieves a Character Error Rate (CER) of 18.4% and Word Error Rate (WER) of 27.6%, representing improvements of 26.4% and 31.2% respectively over the Raw Tesseract baseline. The pipeline additionally achieves a text density of 108.3 words per image, a noise ratio of 2.3%, and a processing time of 3.64 seconds per image - a 6.4x speed advantage over EasyOCR. Image quality PSNR analysis on enhanced MEDIUM and LOW quality images yields an average of 28.7 dB, confirming meaningful enhancement. These results establish a reproducible benchmark for multi-domain retail bill OCR research.
Chinese Translation
多领域零售账单文件的数字化仍然是一项具有挑战性的任务,原因在于扫描质量的变化、布局的异质性以及各商业领域的多样性。本文提出并基准测试了一种智能的、质量感知的自适应光学字符识别(OCR)管道,适用于涵盖五个领域的零售账单数字化:杂货店、餐馆、五金店、鞋类专卖店和服装零售商。所提出的系统集成了一个基于卷积神经网络(CNN)的图像增强模块,该模块通过自监督去噪训练而成;一个基于拉普拉斯方差的图像质量分析器,具有三级路由功能;一个基于置信度的自适应反馈循环,支持迭代重试;以及一个基于自然语言处理(NLP)的后OCR修正层。实验在360张异质零售账单图像的真实世界数据集上进行。定量评估的真实值是通过OCR集成多数投票策略生成的,这是一种在没有人工标注的情况下验证的有效方法。所提出的管道实现了18.4%的字符错误率(CER)和27.6%的词错误率(WER),分别比原始Tesseract基线提高了26.4%和31.2%。该管道还实现了每张图像108.3个单词的文本密度、2.3%的噪声比以及每张图像3.64秒的处理时间,相较于EasyOCR具有6.4倍的速度优势。对增强的中等和低质量图像进行的图像质量PSNR分析平均达到28.7 dB,确认了显著的增强效果。这些结果为多领域零售账单OCR研究建立了可重复的基准。
cs.CV / 19 / 2604.25178

Lightweight Real-Time Rendering Parameter Optimization via XGBoost-Driven Lookup Tables

基于XGBoost驱动查找表的轻量级实时渲染参数优化
Tan, Baijun, Moretti, Francesco
Abstract
Achieving a desirable balance between rendering quality and real-time performance is a long-standing challenge in modern game and rendering engines, particularly on resource-constrained mobile devices such as laptops, tablets, and smartphones. Existing approaches to automatic rendering parameter optimization either depend on exhaustive per-scene pre-computation that spans several days, suffer from the prohibitive inference overhead of neural networks that prevents per-frame adaptation, or lack generalizability across heterogeneous hardware and diverse scenes. In this paper, we propose \textbf{LUT-Opt}, a lightweight, general-purpose framework for adaptive per-frame rendering parameter optimization. Our method decomposes the joint optimization of rendering time and image quality into a tractable two-stage pipeline. In the offline stage, we train a pair of XGBoost regressors to predict rendering time and image quality from rendering parameters, hardware state, and scene complexity descriptors. The trained ensemble models are then distilled into compact lookup tables (LUTs) through systematic discretization and a two-phase linear search that first constrains rendering time and subsequently maximizes structural similarity (SSIM). During runtime, the pre-computed LUT is queried every frame in sub-millisecond time, enabling truly adaptive parameter selection with negligible computational overhead. We validate LUT-Opt on two representative rendering techniques -- subsurface scattering (SSS) and hybrid-pipeline ambient occlusion (AO) -- implemented within Unreal Engine 5. Extensive experiments across multiple scenes and GPU configurations demonstrate that LUT-Opt reduces subsurface scattering rendering time by approximately 40\% and ambient occlusion rendering time by roughly 70\%, while incurring only about 2\% increase in image quality error, with per-frame inference latency below 0.1\ ms.
Chinese Translation
在现代游戏和渲染引擎中,实现渲染质量与实时性能之间的理想平衡一直是一个长期挑战,尤其是在资源受限的移动设备上,如笔记本电脑、平板电脑和智能手机。现有的自动渲染参数优化方法要么依赖于耗时数天的逐场景预计算,要么受到神经网络推理开销的限制,无法实现逐帧适应,或者缺乏在异构硬件和多样场景中的通用性。本文提出了 extbf{LUT-Opt},一个轻量级、通用的逐帧自适应渲染参数优化框架。我们的方法将渲染时间和图像质量的联合优化分解为一个可处理的两阶段管道。在离线阶段,我们训练了一对XGBoost回归模型,以根据渲染参数、硬件状态和场景复杂性描述符预测渲染时间和图像质量。然后,通过系统的离散化和两阶段线性搜索将训练好的集成模型提炼为紧凑的查找表(LUT)。该搜索首先限制渲染时间,然后最大化结构相似性(SSIM)。在运行时,预计算的LUT在每帧中以亚毫秒的时间进行查询,从而实现真正的自适应参数选择,且计算开销微乎其微。我们在两个代表性的渲染技术上验证了LUT-Opt——次表面散射(SSS)和混合管线环境光遮蔽(AO),均在虚幻引擎5中实现。跨多个场景和GPU配置的广泛实验表明,LUT-Opt将次表面散射的渲染时间减少了约40 ext{%},将环境光遮蔽的渲染时间减少了约70 ext{%},同时图像质量误差仅增加约2 ext{%},每帧推理延迟低于0.1 ext{ms}。
cs.CV / 20 / 2604.25186

FCMBench-Video: Benchmarking Document Video Intelligence

FCMBench-Video:文档视频智能基准测试
Cui, Runze, Shang, Fangxin, Yang, Yehui, Yang, Qing, Chen, Tao
Abstract
Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.
Chinese Translation
文档理解是金融信用审核、客户入职和远程验证中的一项关键能力,其中决策准确性和证据可追溯性至关重要。与静态文档图像相比,文档视频呈现出时间冗余和顺序展开的证据流,要求在帧之间进行证据整合,并保留与真实性敏感和反欺诈审核相关的获取过程线索。我们介绍了FCMBench-Video,这是一个用于文档视频智能的基准测试,评估在现实捕获条件下的文档感知、时间定位和基于证据的推理。为了在符合隐私要求的同时实现大规模的现实数据,我们将构建过程组织为一个原子获取和组合工作流,记录可重用的单文档剪辑,应用受控降级,并组装具有规定时间跨度的长格式多文档视频。FCMBench-Video由495个原子视频构成,组成1200个长格式视频,并配对11322个专家标注的问题-答案实例,涵盖28种文档类型,持续时间在20秒至60秒之间,以及5960个中文实例和5362个英文实例。对九个最新的视频-多模态语言模型(Video-MLLMs)的评估表明,FCMBench-Video在系统和能力之间提供了有意义的区分:计数是最敏感于持续时间的任务,跨文档验证和基于证据的选择探测更高层次的证据整合,而视觉提示注入提供了一个互补的鲁棒性维度。整体评分分布广泛且近似钟形,表明该基准测试既不饱和也不被琐碎案例主导。这些结果共同将FCMBench-Video定位为一个可重复的基准,用于跟踪视频-多模态语言模型在文档视频理解方面的进展,并探测在真实性敏感的信用领域应用中的能力边界。
cs.CV / 21 / 2604.25188

Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

基于随机扩张卷积的多分支特征提取与上下文激励的图像分类
Jiang, Wentao, Xu, Yuanchan, Yuan, Heng
Abstract
Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets -- CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof -- demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02\%, 1.12\%, 0.18\%, 4.73\%, and 3.56\%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.
Chinese Translation
图像分类仍然是计算机视觉中的一项基本而具有挑战性的任务,特别是在需要同时进行细粒度特征提取和背景噪声抑制时。尽管传统卷积神经网络在层次特征学习方面取得了显著成功,但在捕捉多尺度上下文信息时常常面临困难,并且在处理噪声或无关图像区域时容易出现过拟合。在本文中,我们提出了RDCNet(基于随机扩张卷积的图像分类网络),这是一种基于ResNet-34的新型架构,整合了三项协同创新以应对这些局限性:(1)多分支随机扩张卷积(MRDC)模块,采用具有不同扩张率的并行分支,并结合随机掩蔽机制,以捕捉多尺度的细粒度特征,同时增强对噪声和过拟合的鲁棒性;(2)嵌入在MRDC中的细粒度特征增强(FGFE)模块,通过自适应池化和双线性插值将全局上下文信息与局部特征表示连接起来,从而增强对微妙视觉模式的敏感性;(3)上下文激励(CE)模块,利用基于softmax的空间注意力和通道重校准,动态强调与任务相关的特征,同时抑制背景干扰。在五个基准数据集(CIFAR-10、CIFAR-100、SVHN、Imagenette和Imagewoof)上进行的广泛实验表明,RDCNet始终实现了最先进的分类准确率,分别比第二好的竞争方法超出0.02%、1.12%、0.18%、4.73%和3.56%,从而验证了所提方法在多样化视觉识别场景中的有效性和通用性。
cs.CV / 22 / 2604.25208

Towards Seamless Lunar Mosaics: Deep Radiometric Normalization for Cross-Sensor Orbital Imagery Using Chandrayaan-2 TMC Data

迈向无缝月球马赛克:基于深度学习的跨传感器轨道影像辐射归一化方法,使用Chandrayaan-2 TMC数据
Singh, Pratincha, Singla, Jai Gopal, Hemrajani, Prashant, Dube, Nitant, Amithabh, Patel, Hinal
Abstract
Radiometric inconsistencies remain a major challenge in generating seamless lunar mosaics from multi-mission orbital imagery due to variability in illumination geometry, sensor characteristics, and acquisition conditions. This paper presents a deep learning-based radiometric normalization framework for multi-mission lunar mosaics constructed primarily from ISRO's Chandrayaan-2 Terrain Mapping Camera (TMC) data, supplemented with auxiliary imagery from the SELENE (Kaguya) mission. The proposed approach employs a conditional generative adversarial network (cGAN) comprising a U-Net-based generator and a PatchGAN discriminator to learn a nonlinear radiometric mapping from conventionally mosaicked lunar imagery to a photometrically consistent reference derived from LROC Wide Angle Camera (WAC) data. A patch-based training strategy with overlap-aware inference is adopted to enable scalable processing of large-area mosaics while preserving structural continuity across tile boundaries. Quantitative evaluation using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Root Mean Square Error (RMSE) demonstrates consistent improvements over traditional histogram-based normalization techniques. The proposed framework achieves enhanced tonal uniformity, reduced seam artifacts, and improved structural coherence across multi-source lunar datasets. These results highlight the effectiveness of learning-based radiometric normalization for large-scale planetary mosaicking and demonstrate its potential for generating high-fidelity lunar surface maps from heterogeneous orbital imagery.
Chinese Translation
由于照明几何、传感器特性和获取条件的变化,辐射不一致性仍然是从多任务轨道影像生成无缝月球马赛克的主要挑战。本文提出了一种基于深度学习的辐射归一化框架,主要用于构建来自印度空间研究组织(ISRO)Chandrayaan-2地形测绘相机(TMC)数据的多任务月球马赛克,并辅以SELENE(嫦娥)任务的辅助影像。所提方法采用条件生成对抗网络(cGAN),包括基于U-Net的生成器和PatchGAN判别器,以学习从传统拼接的月球影像到基于LROC宽角相机(WAC)数据推导的光度一致参考的非线性辐射映射。采用基于补丁的训练策略和重叠感知推理,以实现大面积马赛克的可扩展处理,同时保持瓷砖边界的结构连续性。使用结构相似性指数(SSIM)、峰值信噪比(PSNR)和均方根误差(RMSE)进行的定量评估表明,该方法在传统直方图归一化技术上具有一致的改进。所提框架实现了增强的色调均匀性、减少的接缝伪影和改善的多源月球数据集的结构一致性。这些结果突显了基于学习的辐射归一化在大规模行星马赛克中的有效性,并展示了其从异构轨道影像生成高保真月球表面地图的潜力。
cs.CV / 23 / 2604.25213

When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

当伪造者是裁判:GPT-Image-2无法识别其自身伪造的文件
Wu, Jiaqi, Zhou, Yuchen, Ng, Dennis Tsang, Shen, Xingyu, Zewde, Kidus, Raj, Ankit, Duong, Tommy, Ren, Simiao
Abstract
OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962->0.599 TruFor; 0.852->0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.
Chinese Translation
OpenAI的GPT-Image-2有效地消除了真实与AI编辑文档图像之间的视觉边界:在不到一秒的时间内,可以用几美分替换收据上的一个数字。我们发布了AIForge-Doc v2,这是一个包含3,066个GPT-Image-2文档伪造品的配对数据集,具有与DocTamper兼容的像素精确掩码,并基准测试了四条防御线:人工检查员(N=120,通过公共2AFC网站CanUSpotAI.com进行365对投票),TruFor(通用取证),DocTamper(qcf-568,文档特定),以及同样的GPT-Image-2模型作为零-shot自我判断者——被问及为了避免简单的“图像大部分真实”的解读,是否有任何区域是由AI图像模型生成或编辑的。人工2AFC的准确率为0.501,与随机猜测无异:即使并排比较,检查员也无法区分GPT-Image-2的收据伪造品与真实的对应物。这三种计算判断者的表现仅略高于随机水平(TruFor 0.599,DocTamper 0.585,自我判断者0.532)。自我判断者始终失败,而非偶然:在五种提示策略和四种处理模糊响应的政策下,AUC从未超过0.59。为了排除两个取证检测器在我们的源域上失效而非对AI修补盲目的可能性,我们在为其训练分布构建的同域传统篡改集上对每个检测器进行了校准:TruFor在我们的数据集的跨相机拼接上达到了AUC 0.962,DocTamper在跨文档OCR标记拼接与两次JPEG重新编码上达到了0.852。两者在传统篡改上保持接近已发布的性能;切换到GPT-Image-2修补则使AUC下降了0.27-0.36(0.962->0.599 TruFor;0.852->0.585 DocTamper),孤立出一个特定于GPT-Image-2修补的检测差距。我们发布了数据集、管道、四判断者协议和校准集。
cs.CV / 24 / 2604.25231

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

DRAGON:基于证据的图表视觉推理基准
Iyengar, Anirudh Iyengar Kaniyar Narayana, Kumar, Tampu Ravi, Najpande, Gaurav, Suri, Manan, Manocha, Dinesh, Mathur, Puneet, Gupta, Vivek
Abstract
Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.
Chinese Translation
图表问答(DQA)要求模型解读结构化的视觉表示,如图表、地图、信息图、电路原理图和科学图表。近期的视觉-语言模型(VLMs)在这些任务上通常能达到较高的答案准确率,但正确的答案并不保证模型的推理是基于支持预测的图表区域。模型可能依赖于文本相关性或数据集伪影,而未能识别验证答案所需的视觉证据。这一局限性妨碍了图表推理的可靠评估,并降低了可解释性。我们引入了DRAGON,一个用于评估图表中基于证据的视觉推理的基准。给定一个图表、一个问题和正确答案,模型必须预测与支持答案的视觉元素相对应的边界框。这些证据区域可能包括答案相关的组件、文本标签、图例、坐标轴、连接器以及推理过程中涉及的其他支持结构。DRAGON数据集包含从六个图表问答数据集中收集的11,664个标注问题实例:ChartQA、Circuit-VQA、InfographicsVQA、MapIQ、MapWise和AI2D。我们发布了一个包含2,445个实例的基准测试集,配有经过人工验证的推理证据标注和标准化的评估框架。我们评估了八个近期的VLM,并分析它们在不同图表领域中定位推理证据的能力。DRAGON使得图表推理的系统评估成为可能,并支持未来在视觉证据基础上进行预测的模型的研究。
cs.CV / 25 / 2604.25255

Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

个性化跨模态情感关联学习用于保留语音的面部表情操控
Chen, Tianshui, Zhu, Yujie, Lin, Jianman, Yang, Zhijing, Qing, Chunmei, Gao, Feng, Lin, Liang
Abstract
Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.
Chinese Translation
保留语音的面部表情操控(SPFEM)旨在增强人类的表现力,而不改变与原始语音相关的口部运动。该领域的主要挑战是缺乏配对数据,即同一人具有相同语音但不同表情的对齐帧,这阻碍了情感操控的直接监督。尽管当前的视觉-语言模型(VLMs)能够提取对齐的视觉和语义特征,使其成为有前景的监督来源,但其直接应用受到限制。为此,我们提出了一种个性化跨模态情感关联学习(PCMECL)算法,通过两个主要改进来优化基于VLM的监督。首先,标准VLM依赖于每种情感的单一通用提示,未能捕捉个体之间的表现差异。PCMECL通过基于个体视觉信息进行条件化,学习个性化提示,从而建立更细致的视觉-语义关联。其次,即使在个性化的情况下,视觉和语义特征分布之间仍然存在固有差异。为了弥合这种模态差距,PCMECL采用特征差分来关联模态,通过将视觉特征的变化与语义特征的变化匹配,提供更精确的对齐监督。作为一个即插即用模块,PCMECL可以无缝集成到现有的SPFEM模型中。通过在多个数据集上的广泛实验,证明了我们算法的优越效果。
cs.CV / 26 / 2604.25273

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

应对大型多模态模型中的视觉忽视和语义漂移以增强跨模态检索
Zhang, Guosheng, Liu, Linkai, Wang, Keyao, Yue, Haixiao, Tan, Zhiwen, Tan, Xiao
Abstract
Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.
Chinese Translation
尽管由大型多模态模型(LMMs)驱动的统一多模态检索(UMR)取得了显著进展,但现有的嵌入方法主要通过对比学习关注样本级目标,而忽视了关键的主题级语义。这一局限性妨碍了模型在复杂多模态查询中对语义一致主题的分组能力,表现为语义对齐偏差——模型未能准确定位视觉内容中显著的文本引用区域。此外,在没有明确指导模型显著视觉主题的情况下,LMMs往往过度依赖文本线索,导致视觉模态的忽视和视觉知识的次优利用。为此,我们提出了显著主题感知多模态嵌入(SSA-ME),这是一个旨在通过显著性感知建模增强细粒度表示学习的新框架。SSA-ME利用LMMs和视觉专家识别并强调图像-文本对中的显著视觉概念,并引入显著性引导目标,以更好地将跨模态注意力与语义上有意义的区域对齐。此外,特征再生模块根据导出的显著性图重新校准视觉特征,确保跨模态的平衡和语义一致的整合。大量实验表明,我们的方法在MMEB基准上达到了最先进的性能,证明了纳入主题级建模显著改善了多模态检索。全面的定性分析进一步展示了我们方法的可解释性和有效性。
cs.CV / 27 / 2604.25276

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

OmniVTG:开放世界视频时间定位的大规模数据集和训练范式
Zheng, Minghang, Yin, Zihao, Yang, Yi, Peng, Yuxin, Liu, Yang
Abstract
Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.
Chinese Translation
视频时间定位(Video Temporal Grounding, VTG)是从文本查询中定位视频片段的任务,在开放世界环境中由于数据集规模和语义多样性的限制而面临挑战,导致常见概念与稀有概念之间的性能差距。为了解决这些限制,我们提出了OmniVTG,一个用于开放世界VTG的新型大规模数据集,并配合自我纠正思维链(Self-Correction Chain-of-Thought, CoT)训练范式,旨在增强多模态大型语言模型(Multimodal Large Language Models, MLLMs)的定位能力。我们的OmniVTG是通过一种新颖的语义覆盖迭代扩展管道构建的,该管道首先识别现有数据集中词汇的缺口,并收集高度可能包含这些目标概念的视频。为了确保高质量的标注,我们利用现代MLLMs在密集标注方面的优势,设计了一个以标注为中心的数据引擎,以促使MLLMs生成密集的、带时间戳的描述。除了数据集之外,我们观察到简单的监督微调(Supervised Finetuning, SFT)是不够的,因为稀有概念与常见概念之间的性能差距仍然存在。我们发现MLLMs的视频理解能力显著超过其直接定位能力。基于此,我们提出了一种自我纠正思维链(CoT)训练范式。我们训练MLLM首先进行预测,然后利用其理解能力反思并完善自身的预测。该能力通过SFT、CoT微调和强化学习的三阶段管道来培养。大量实验表明,我们的方法不仅在OmniVTG数据集中在开放世界定位方面表现出色,还在四个现有VTG基准上实现了最先进的零样本性能。代码可在https://github.com/oceanflowlab/OmniVTG获取。
cs.CV / 28 / 2604.25299

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

思维像素:多模态扩散潜变量中的递归稀疏推理
Sun, Yuwei, Yao, Yuxuan, Li, Hui, Zhu, Siyu
Abstract
Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.
Chinese Translation
扩散模型在高保真数据合成方面取得了成功,但其在更复杂、结构化推理(如文本跟随任务)方面的能力仍然有限。尽管语言模型的进展利用了潜在推理和递归等策略来增强文本理解能力,但由于视觉标记的连续性和非离散性,将这些策略扩展到多模态文本到图像生成任务中仍然具有挑战性。为了解决这个问题,我们从模块化人类认知中获得灵感,提出了一种递归的稀疏专家混合框架,并将其整合到传统的扩散模型中。我们的方法在联合注意力层中引入了一个递归组件,该组件在多个潜在步骤中迭代地细化视觉标记,同时通过稀疏选择神经模块有效地共享参数。在每一步中,设计了一个门控网络,以动态选择专门的神经模块,条件是当前的视觉标记、扩散时间步和条件信息。在对类条件的ImageNet图像生成任务的全面评估以及对GenEval和DPG基准的额外研究中,证明了所提方法在提升模型图像生成性能方面的优越性。
cs.CV / 29 / 2604.25300

DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

DenseScout:边缘平台上预算有限的小物体选择的算法-系统协同设计
Zhouzhi, Xiong, Zeng, Zimo, Chen, Yi, Xu, Shuqi, Yan, Yunfeng, Qi, Donglian
Abstract
Deploying tiny object perception on edge platforms is challenging because practical systems must satisfy both strict compute budgets and end-to-end latency constraints. A common strategy is to first select a small number of candidate patches from a high-resolution image and then apply downstream processing only to the selected regions. However, existing detector-based frontends are not well aligned with this setting: strong offline detection accuracy does not necessarily yield effective low-budget patch prioritization, nor does it guarantee usable performance once transport and inference delays are considered. In this work, we study budgeted tiny object selection on edge platforms from a joint algorithm--system perspective. We present DenseScout, a lightweight dense-response selector with only 1.01M parameters, which directly ranks candidate patch locations from a high-resolution scene via a lightweight proxy input and is better aligned with low-budget tiny-object prioritization than detector-style frontends. To bridge offline selector quality and deployable utility, we further develop a transport-aware runtime realization on heterogeneous edge devices and adopt QoS-constrained recall, which counts a target as successfully perceived only if it is covered by the selected regions and the end-to-end processing finishes before the deadline. Experiments show that DenseScout consistently outperforms detector-based baselines in offline budgeted patch-selection evaluation, especially in low-budget regimes, while cross-platform results on RK3588 and Jetson Orin NX show that deployable performance depends jointly on selector quality and runtime realization efficiency. These results suggest that edge tiny object perception should be optimized as an algorithm--system co-design problem rather than as isolated model selection.
Chinese Translation
在边缘平台上部署小物体感知面临挑战,因为实际系统必须满足严格的计算预算和端到端延迟限制。一种常见策略是首先从高分辨率图像中选择少量候选区域,然后仅对选定区域应用后续处理。然而,现有的基于检测器的前端与这种设置并不完全匹配:强大的离线检测精度并不一定能有效地进行低预算的区域优先级排序,也不能保证在考虑传输和推理延迟后仍能提供可用的性能。在本研究中,我们从算法-系统的联合视角研究边缘平台上的预算有限的小物体选择。我们提出了DenseScout,这是一种轻量级的密集响应选择器,仅包含1.01M参数,它通过轻量级的代理输入直接对高分辨率场景中的候选区域位置进行排名,并且比基于检测器的前端更好地与低预算小物体优先级排序对齐。为了弥合离线选择器质量与可部署效用之间的差距,我们进一步开发了一种在异构边缘设备上的传输感知运行时实现,并采用了QoS约束的召回策略,仅当目标被选定区域覆盖且端到端处理在截止时间之前完成时,才将其视为成功感知。实验表明,DenseScout在离线预算区域选择评估中始终优于基于检测器的基线,尤其是在低预算情况下,而在RK3588和Jetson Orin NX上的跨平台结果显示,可部署性能共同依赖于选择器质量和运行时实现效率。这些结果表明,边缘小物体感知应作为算法-系统协同设计问题进行优化,而不是孤立的模型选择。
cs.CV / 30 / 2604.25310

Rapid tracking through strongly scattering media with physics-informed neuromorphic speckle analysis

通过强散射介质的快速跟踪:物理信息驱动的神经形态散斑分析
Cao, Yuqing, Zhu, Shuo, Chen, Rongzhou, Chen, Jingyan, Chen, Ni, Lam, Edmund Y.
Abstract
This work addresses the critical problem of tracking fast-moving objects through strongly scattering media in a low-light environment. Different from existing approaches that use frame-based cameras with fixed exposure times, which trade off signal-to-noise ratio for temporal resolution, we introduce computational neuromorphic tracking (CNT), a physics-informed framework that combines asynchronous event sensing with task-driven speckle analysis for robust motion estimation. We formulate the neuromorphic speckle aggregation as a spatiotemporal speckle representation, jointly optimizing the temporal and spatial parameters to maximize tracking stability under extreme conditions. Extensive experiments demonstrate that our method enables robust motion tracking of 10x faster motion and under 10x dimmer illumination compared to conventional systems. These improvements significantly broaden the operational regime for tracking through scattering media, providing an efficient and scalable solution for demanding scenarios involving rapid motion and low-light conditions.
Chinese Translation
本研究解决了在低光环境中通过强散射介质跟踪快速移动物体的关键问题。与现有使用固定曝光时间的帧基摄像机的方法不同,这些方法在时间分辨率与信噪比之间进行权衡,我们引入了计算神经形态跟踪(Computational Neuromorphic Tracking, CNT),这是一个物理信息驱动的框架,结合了异步事件感知与任务驱动的散斑分析,以实现稳健的运动估计。我们将神经形态散斑聚合公式化为时空散斑表示,联合优化时间和空间参数,以最大化在极端条件下的跟踪稳定性。大量实验表明,我们的方法能够在比传统系统快10倍的运动和在比传统系统暗10倍的光照条件下实现稳健的运动跟踪。这些改进显著拓宽了通过散射介质进行跟踪的操作范围,为涉及快速运动和低光条件的苛刻场景提供了高效且可扩展的解决方案。
cs.CV / 31 / 2604.25314

Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

黄金RPG:自适应置信度区域感知噪声用于组合文本到图像生成
Li, Hao
Abstract
Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbf{Confidence-Adaptive Blending} head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1{,}200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a $\boldsymbol{\sim}$67\% preference over the strongest baseline. The adapter contains $\sim$2M trainable parameters and adds only $0.6$\,s of inference overhead on top of SDXL.
Chinese Translation
组合文本到图像(T2I)生成要求模型能够处理描述不同图像区域的多个子提示。最近的研究表明,扩散模型的 extit{起始噪声}携带了重要的语义信息:从文本预测的“黄金”噪声可以显著提高提示的保真度。然而,我们观察到这种噪声预测在本质上是全局性的:同一个网络被要求用单一的文本嵌入来总结一个长的、多区域的提示,这在提示描述空间上分离的实体时成为瓶颈。我们提出了 extbf{黄金RPG},一种区域感知噪声预测器,它在一个冻结的NPNet基础上扩展了两个可训练的部分:(i)一个每区域的 extbf{FiLM适配器},根据每个子提示重新塑造预测的噪声;(ii)一个 extbf{区域交叉注意力}层,注入在Swin主干的两个阶段之间,使得不同的空间位置能够关注不同的子提示标记。为了防止区域条件化降低那些提示已经简单的样本,我们进一步提出了一个 extbf{自适应置信度混合}头,根据每个样本动态预测区域信号应该多强地覆盖全局信号。我们在原始RPG基准(20个提示,100个样本)和四个多区域类别的T2I-CompBench(1,200张图像,六种竞争方法)上进行了评估。黄金RPG在每个类别中都达到了最高的跨区域一致性得分,同时在绝对CLIP得分和CLIP-IQA上与最强基线相匹配。配对用户研究进一步显示出对最强基线约67%的偏好。该适配器包含约200万可训练参数,并在SDXL的基础上仅增加了0.6秒的推理开销。
cs.CV / 32 / 2604.25315

SaliencyDecor: Enhancing Neural Network Interpretability through Feature Decorrelation

SaliencyDecor:通过特征去相关性增强神经网络可解释性
Karkehabadi, Ali, Hassanpour, Jamshid, Homayoun, Houman, Sasan, Avesta
Abstract
Gradient-based saliency methods are widely used to interpret deep neural networks, yet they often produce noisy and unstable explanations that poorly align with semantically meaningful input features. We argue that a fundamental cause of this behavior lies in the geometry of learned representations: correlated feature dimensions diffuse attribution gradients across redundant directions, resulting in blurred and unreliable saliency maps. To address this issue, we identify feature correlation as a structural limitation of gradient-based interpretability and propose SaliencyDecor, a training framework that enforces feature decorrelation to improve attribution fidelity without modifying saliency methods or model architectures by reshaping the feature space toward orthogonality, our approach promotes more concentrated gradient flow and improves the fidelity of saliency-based explanations. SaliencyDecor jointly optimizes classification, prediction consistency under feature masking, and a decorrelation regularizer, requiring no architectural changes or inference-time overhead. Extensive experiments across multiple benchmarks and architectures demonstrate that our method produces substantially sharper and more object-focused saliency maps while simultaneously improving predictive performance, achieving accuracy gains across the datasets. These results establish our method as a principled mechanism for enhancing both interpretability and accuracy, challenging the conventional trade-off between explanation quality and model performance.
Chinese Translation
基于梯度的显著性方法广泛用于解释深度神经网络,但它们往往产生噪声大且不稳定的解释,与语义上有意义的输入特征不匹配。我们认为,这种行为的根本原因在于学习表示的几何特性:相关的特征维度将归因梯度扩散到冗余的方向,导致显著性图模糊且不可靠。为了解决这个问题,我们将特征相关性视为基于梯度的可解释性的结构性限制,并提出了SaliencyDecor,一个训练框架,通过强制特征去相关性来提高归因的保真度,而无需修改显著性方法或模型架构。通过将特征空间重塑为正交性,我们的方法促进了更集中的梯度流,并改善了基于显著性的解释的保真度。SaliencyDecor联合优化分类、特征掩蔽下的预测一致性和去相关性正则化器,无需架构更改或推理时的额外开销。在多个基准和架构上的广泛实验表明,我们的方法产生了显著更清晰且更聚焦于对象的显著性图,同时提高了预测性能,在各个数据集上实现了准确率的提升。这些结果确立了我们的方法作为增强可解释性和准确性的原则性机制,挑战了解释质量与模型性能之间的传统权衡。
cs.CV / 33 / 2604.25316

Towards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images

基于深度学习的阔叶蓼(Rumex obtusifolius)无人机图像检测的鲁棒性研究
Schrag, Fabian Dionys, Turkoglu, Mehmet Ozgur, Schindler, Konrad, Stoop, Ralph Lukas
Abstract
Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.
Chinese Translation
领域适应(Domain Adaptation, DA)解决了将训练于源领域的机器学习模型迁移到数据分布不同的目标领域的挑战。在本研究中,我们探讨了阔叶蓼(Rumex obtusifolius)图像分类任务的领域适应。我们在一个已发布的基于地面车辆的数据集(源)上训练模型,并在由无人机(UAVs)获取的自定义目标数据集上评估其性能。我们发现卷积神经网络(Convolutional Neural Network, CNN)模型,特别是残差网络(ResNets),在目标领域的泛化能力较差,即使在源数据上进行了微调。应用两种已建立的领域适应技术——时刻匹配(moment-matching)和最大分类器差异(maximum classifier discrepancy),显著提高了目标领域的性能。然而,使用自监督目标(DINOv2, DINOv3)预训练的视觉变换器(Vision Transformer, ViT)模型在处理领域转移方面表现优异,甚至超过了经过时刻匹配训练的残差网络,这可能归因于在大规模预训练过程中获得的丰富通用表示。通过使用在源数据集上微调的ViTs,我们在目标数据集上展示了高达F1=0.8的分类性能。为了支持草地系统中杂草检测的领域适应进一步研究,我们公开发布了基于无人机的目标数据集AGSMultiRumex,包含15次飞行在瑞士草地上的数据。
cs.CV / 34 / 2604.25319

Edge-Cloud Collaborative Reconstruction via Structure-Aware Latent Diffusion for Downstream Remote Sensing Perception

基于结构感知潜在扩散的边缘-云协同重建用于下游遥感感知
Li, Yun, Li, Xianju
Abstract
The exponential surge in high-resolution remote sensing data faces a severe bottleneck in satellite-to-ground transmission. Limited downlink bandwidth forces the use of extreme high-ratio compression, which irreversibly destroys high-frequency structural details essential for downstream machine perception tasks like object detection. While current super-resolution techniques attempt to recover these details, regression-based methods often yield over-smoothed textures, and generative diffusion models frequently introduce structural hallucinations that mislead detection systems. To address this trade-off, we propose the Structure-Aware Latent Diffusion (SALD) framework, an asymmetric edge-cloud collaborative SR system. At the resource-constrained edge, the system decouples imagery into a highly compressed low-frequency payload and a lightweight soft structural prior. Transmitting this decoupled representation minimizes bandwidth consumption. On the powerful cloud side, we introduce a Structure-Gated Large Kernel (SGLK) module and a Semantic-Guidance Engine (SGE) within the diffusion backbone. These modules leverage the transmitted structural priors to gate large-kernel convolutions, effectively capturing long-range dependencies inherent in aerial scenes while actively suppressing generative hallucinations. Extensive experiments on both the MSCM and UCMerced datasets demonstrate that, even under extreme bandwidth constraints, SALD achieves superior perceptual quality (LPIPS) and significantly enhances downstream performance in both scene classification and small-target detection.
Chinese Translation
高分辨率遥感数据的指数增长在卫星到地面的传输中面临严重瓶颈。有限的下行带宽迫使使用极高压缩比,这不可逆转地破坏了下游机器感知任务(如目标检测)所需的高频结构细节。尽管当前的超分辨率技术试图恢复这些细节,但基于回归的方法往往导致纹理过于平滑,而生成扩散模型则常常引入结构幻觉,误导检测系统。为了解决这一权衡问题,我们提出了结构感知潜在扩散(Structure-Aware Latent Diffusion, SALD)框架,这是一个不对称的边缘-云协同超分辨率系统。在资源受限的边缘,系统将图像解耦为高度压缩的低频负载和轻量级的软结构先验。传输这种解耦表示最小化了带宽消耗。在强大的云端,我们在扩散主干中引入了结构门控大核(Structure-Gated Large Kernel, SGLK)模块和语义引导引擎(Semantic-Guidance Engine, SGE)。这些模块利用传输的结构先验来门控大核卷积,有效捕捉空中场景中固有的长程依赖,同时积极抑制生成幻觉。在MSCM和UCMerced数据集上的广泛实验表明,即使在极端带宽限制下,SALD也能实现卓越的感知质量(LPIPS),并显著提升下游场景分类和小目标检测的性能。
cs.CV / 35 / 2604.25322

Assessment of the quantitative impact of occlusal positioning splints on temporomandibular joint conditions

咬合定位夹板对颞下颌关节状况的定量影响评估
Tomaka, Agnieszka Anna, Domino, Krzysztof, Pojda, Dariusz, Tarnawski, Michał
Abstract
A computational method for quantitative analysis of temporomandibular joint (TMJ) configuration using occlusal positioning splints is proposed and demonstrated. The method models a positioning splint as a physical realization of a predefined rigid transformation of the mandible, derived from multimodal data, including CBCT, facial motion acquisition, and dental scans integrated within a common coordinate system. Splints corresponding to selected mandibular positions are designed and fabricated, and their positioning accuracy is evaluated using repeated scans of plaster models. Discrepancies are represented as error transformations and analyzed statistically in the space of rigid motions. The estimated transformations are propagated to segmented TMJ structures, enabling simulation-based evaluation of joint space changes. Transformation-based error analysis and surface distance metrics are used to quantify differences between planned and achieved configurations. The method enables indirect assessment of TMJ configuration using a single anatomical model and transformation data, reducing the need for repeated imaging across multiple mandibular positions. This study is intended as a methodological demonstration, supported by a clear step-by-step graphical presentation, and does not aim to provide clinical validation.
Chinese Translation
本文提出并展示了一种利用咬合定位夹板对颞下颌关节(TMJ)构型进行定量分析的计算方法。该方法将定位夹板建模为下颌的预定义刚性变换的物理实现,该变换源自多模态数据,包括锥形束计算机断层扫描(CBCT)、面部运动采集和牙齿扫描,这些数据整合在一个共同的坐标系统中。设计并制造了对应于选定下颌位置的夹板,并通过对石膏模型的重复扫描来评估其定位精度。差异以误差变换的形式表示,并在刚性运动空间中进行统计分析。估计的变换被传播到分割的TMJ结构上,从而实现基于模拟的关节间隙变化评估。基于变换的误差分析和表面距离度量用于量化计划与实际构型之间的差异。该方法通过使用单一解剖模型和变换数据间接评估TMJ构型,从而减少了在多个下颌位置上进行重复成像的需求。本研究旨在作为一种方法论的演示,辅以清晰的逐步图示,并不旨在提供临床验证。
cs.CV / 36 / 2604.25358

Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

通过统一的语义-空间评估在封闭和开放环境中对布局引导的扩散模型进行基准测试
Parolari, Luca, Faccioli, Nicla, Ballan, Lamberto
Abstract
Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying levels of complexity in both prompt structure and layout. To complement this controlled setting, we propose an open-set benchmark (O-Bench) that evaluates models using real-world prompts and layouts, offering a measure of semantic and spatial alignment in the wild. We further develop a unified evaluation protocol that combines semantic and spatial accuracy into a single score, ensuring consistent model ranking. Using our benchmarks, we conduct a large-scale evaluation of six state-of-the-art layout-guided diffusion models, totaling 319,086 generated and evaluated images. We establish a model ranking based on their overall performance and provide detailed breakdowns for text and layout alignment to enhance interpretability. Fine-grained analyses across scenarios and prompt complexities highlight the strengths and limitations of current models. Code is available at https://github.com/lparolari/cobench.
Chinese Translation
评估布局引导的文本到图像生成模型需要同时评估与文本提示的语义对齐和与规定布局的空间保真度。评估布局对齐需要收集细粒度的注释,这既昂贵又劳动密集。因此,目前的基准测试很少提供全面的布局评估,通常在规模或覆盖范围上有限,使得模型比较、排名和解释变得困难。在本研究中,我们引入了一个封闭集基准(C-Bench),旨在隔离关键的生成能力,同时在提示结构和布局中提供不同级别的复杂性。为了补充这一受控环境,我们提出了一个开放集基准(O-Bench),使用真实世界的提示和布局来评估模型,提供在实际环境中语义和空间对齐的衡量标准。我们进一步开发了一个统一的评估协议,将语义和空间准确性结合成一个单一评分,确保模型排名的一致性。利用我们的基准测试,我们对六个最先进的布局引导扩散模型进行了大规模评估,共生成和评估了319,086幅图像。我们根据模型的整体表现建立了排名,并提供了文本和布局对齐的详细分解,以增强可解释性。在不同场景和提示复杂性下的细粒度分析突显了当前模型的优缺点。代码可在 https://github.com/lparolari/cobench 获取。
cs.CV / 37 / 2604.25361

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

HuM-Eval:一种人本视频评估的粗到细框架
Zhang, Bingzi, Guan, Kaisi, Song, Ruihua
Abstract
Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.
Chinese Translation
近年来,视频生成模型发展迅速,其中生成自然的人类动作发挥着关键作用。然而,准确评估生成的人类动作视频的质量仍然是一个重大挑战。现有的评估指标主要关注全局场景统计,往往忽视细粒度的人类细节,因此未能与人类的主观偏好对齐。为了解决这一问题,我们提出了HuM-Eval,这是一种新颖的人本评估框架,采用粗到细的策略。具体而言,我们的框架首先利用视觉语言模型对全局视频质量进行粗略评估。然后,进行细粒度分析,使用二维姿态验证解剖正确性,并使用三维人类动作评估动作稳定性。大量实验表明,HuM-Eval实现了58.2%的平均人类相关性,超越了最先进的基线。此外,我们还引入了HuM-Bench,这是一个包含1,000个多样化提示的综合基准,并对现有的文本到视频模型进行了详细评估,为下一代人类动作生成铺平了道路。
cs.CV / 38 / 2604.25367

Self-DACE++: Robust Low-Light Enhancement via Efficient Adaptive Curve Estimation

Self-DACE++:通过高效自适应曲线估计实现稳健的低光照增强
Wen, Jianyu, Xie, Jun, Chen, Feng, Wang, Zhepeng, Wu, Chenhao, Zhang, Tong, Yu, Yixuan, Swierczynski, Piotr
Abstract
In this paper, we present Self-DACE++, an improved unsupervised and lightweight framework for Low-Light Image Enhancement (LLIE), building upon our previous Self-Reference Deep Adaptive Curve Estimation (Self-DACE). To better address the trade-off between computational efficiency and restoration quality, Self-DACE++ introduces enhanced Adaptive Adjustment Curves (AACs). These curves, governed by minimal trainable parameters, flexibly adjust the dynamic range while preserving the color fidelity, structural integrity, and naturalness of the enhanced images. To achieve an extremely lightweight architecture without sacrificing performance, we propose a randomized order training strategy coupled with a network fusion mechanism, which compresses the model into an efficient iterative inference structure. Furthermore, we formulate a physics-grounded objective function based on Retinex theory and incorporate a dedicated denoising module to effectively estimate and suppress latent noise in dark regions. Extensive qualitative and quantitative evaluations on multiple real-world benchmark datasets demonstrate that Self-DACE++ outperforms existing state-of-the-art methods, delivering superior enhancement quality with real-time inference capability. The code is available at https://github.com/John-Wendell/Self-DACE.
Chinese Translation
本文提出了Self-DACE++,一个改进的无监督轻量级低光照图像增强(LLIE)框架,基于我们之前的自参考深度自适应曲线估计(Self-DACE)。为了更好地解决计算效率与恢复质量之间的权衡,Self-DACE++引入了增强的自适应调整曲线(AACs)。这些曲线由最少的可训练参数控制,灵活地调整动态范围,同时保持增强图像的色彩保真度、结构完整性和自然性。为了实现极轻量的架构而不牺牲性能,我们提出了一种随机顺序训练策略,并结合网络融合机制,将模型压缩为高效的迭代推理结构。此外,我们基于Retinex理论制定了一个物理基础的目标函数,并结合专门的去噪模块,有效估计并抑制暗区的潜在噪声。在多个真实世界基准数据集上进行的广泛定性和定量评估表明,Self-DACE++优于现有的最先进方法,提供了卓越的增强质量和实时推理能力。代码可在 https://github.com/John-Wendell/Self-DACE 获取。
cs.CV / 39 / 2604.25370

GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment

野外中的GPT-Image-2:来自首次部署第一周的自报告AI生成图像的Twitter数据集
Zewde, Kidus, Ren, Simiao, Shen, Xingyu, Wu, Jenny, Zhou, Yuchen, Duong, Tommy, Zhang, Zikang, Traister, Ethan
Abstract
The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model's April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter "Made with AI" badge verification, and model name variant matching, we curate 10,217 confirmed GPT-image-2 images from 27,662 collected records over a six-day window. We characterize the dataset across four analyses: CLIP-based zero-shot subject taxonomy, OCR text legibility (82.0% of images contain detectable text), face detection (59.2% of images, 22,583 total faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative result is that C2PA content credentials are systematically stripped by Twitter's CDN on upload, rendering cryptographic provenance verification infeasible for social-media-sourced AI images. The dataset and all curation code are released publicly.
Chinese Translation
OpenAI发布GPT-image-2标志着AI生成图像的一个重要时刻:摄影现实与合成内容之间的界限前所未有地难以辨别。我们介绍了GPT-Image-2 Twitter数据集,这是首个公开发布的GPT-image-2生成图像的数据集,数据来源于模型于2026年4月21日发布后不久的公开Twitter/X帖子。通过利用Twitter API v2和一个多阶段的策展流程,该流程涵盖多语言文本启发式(英语、日语和中文)、浏览器自动化的Twitter“使用AI制作”徽章验证以及模型名称变体匹配,我们从收集的27,662条记录中策划了10,217张确认的GPT-image-2图像,时间跨度为六天。我们通过四个分析对数据集进行了特征化:基于CLIP的零-shot主题分类、OCR文本可读性(82.0%的图像包含可检测文本)、人脸检测(59.2%的图像,合计22,583张人脸)和语义聚类(137个CLIP ViT-L/14聚类)。一个关键的负面结果是,C2PA内容凭证在上传时被Twitter的CDN系统性地剥离,从而使社交媒体来源的AI图像的加密来源验证变得不可行。该数据集及所有策展代码已公开发布。
cs.CV / 40 / 2604.25376

CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation

CoRE:持续脑病灶分割的概念推理扩展
Chen, Qianqian, Liu, Anglin, Zhang, Jingyang, Zhang, Yudong
Abstract
Accurate brain lesion segmentation in MRI is vital for effective clinical diagnosis and treatment planning. Due to high annotation costs and strict data privacy regulations, universal models require employing Continual Learning (CL) to adapt to evolving clinical tasks without losing previously acquired knowledge. However, existing CL paradigms often suffer from capacity limits or redundant parameter growth, and even advanced dynamic methods rely mostly on image-perception strategies that struggle to handle the substantial pathological and multimodal heterogeneity inherent in brain imaging. To address this issue, we propose Concept-Reasoning Expansion (CoRE) framework, which establishes a joint decision-making mechanism by integrating visual features with structured concepts. Through the alignment of image tokens with a hierarchical concept library, CoRE simulates clinical reasoning to guide both interpretable expert routing and demand-based model growth. This collaborative process ensures model evolution is grounded in clinical priors, preventing redundant parameter expansion while maximizing knowledge reuse. Extensive evaluations across 12 sequential brain lesion MRI tasks demonstrate that CoRE achieves state-of-the-art performance and provides a high knowledge starting point for efficient future adaptation. Its superior few-shot transferability and clinical interpretability further validate its effectiveness in managing non-stationary clinical data streams. Our code will be released soon.
Chinese Translation
在MRI中准确的脑病灶分割对于有效的临床诊断和治疗规划至关重要。由于高昂的标注成本和严格的数据隐私法规,通用模型需要采用持续学习(Continual Learning, CL)来适应不断变化的临床任务,而不丢失先前获得的知识。然而,现有的CL范式往往受到容量限制或冗余参数增长的困扰,甚至先进的动态方法主要依赖于图像感知策略,这些策略难以处理脑成像中固有的显著病理和多模态异质性。为了解决这一问题,我们提出了概念推理扩展(Concept-Reasoning Expansion, CoRE)框架,该框架通过将视觉特征与结构化概念相结合,建立了一个联合决策机制。通过将图像标记与层次概念库对齐,CoRE模拟临床推理,以指导可解释的专家路由和基于需求的模型增长。这一协作过程确保模型的演变基于临床先验,防止冗余参数扩展,同时最大化知识重用。在12个连续脑病灶MRI任务中的广泛评估表明,CoRE实现了最先进的性能,并为未来高效适应提供了良好的知识起点。其卓越的少样本迁移能力和临床可解释性进一步验证了其在管理非平稳临床数据流中的有效性。我们的代码将很快发布。
cs.CV / 41 / 2604.25380

Benchmarking and Improving GUI Agents in High-Dynamic Environments

高动态环境中图形用户界面代理的基准测试与改进
Liu, Enqi, Pan, Liyuan, Gao, Zhi, Yang, Yan, Shi, Chenrui, Liu, Yang, Wu, Jingrong, Li, Qing
Abstract
Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.
Chinese Translation
近期图形用户界面(GUI)代理的进展主要集中在监督微调(SFT)和强化学习(RL)等训练范式上。然而,高动态GUI环境的挑战仍然在很大程度上未被深入探索。现有代理通常依赖于每次行动后的单一截图进行决策,导致形成部分可观测(甚至不可观测)的马尔可夫决策过程,其中关键的GUI状态(包括执行动作所需的重要信息)往往未被充分捕捉。为系统性地探讨这一挑战,我们引入了DynamicGUIBench,这是一个涵盖十个应用程序和多种交互场景的综合在线GUI基准,特征是行动之间的重要界面变化。此外,我们提出了DynamicUI,这是一个为动态界面设计的代理,它将交互过程的屏幕录制视频作为输入,并由三个组件组成:动态感知器、精炼策略和反思模块。具体而言,动态感知器对GUI视频的帧进行聚类,为质心生成标题,并迭代选择最具信息量的帧作为显著的动态上下文。考虑到所选帧与代理的文本上下文之间可能存在不一致性和噪声,精炼策略采用基于动作的过滤来精炼思路,以减轻思维与行动之间的不一致性和冗余。基于精炼后的代理轨迹,反思模块为进一步的行动提供有效且准确的指导。在DynamicGUIBench上的实验表明,DynamicUI显著提高了在动态GUI环境中的性能,同时在其他公共基准上保持了竞争力。
cs.CV / 42 / 2604.25388

COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual Localization

COMPASS:基于平面图的紧凑型多通道先验图和场景特征用于视觉定位
Shaheer, Muhammad, Fernandez-Cortizas, Miguel, Bikandi-Noya, Asier, Voos, Holger, Sanchez-Lopez, Jose Luis
Abstract
Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays are cast in 360 azimuth bins and the results are encoded into five channels: normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. From the image side, the same descriptor structure is populated by detecting structural elements in the fisheye imagery. As a first step toward full cross-modal matching, we present a window detection algorithm for fisheye images that uses a line segment detector to identify window frames via vertical edge clustering and brightness verification. Detected windows are projected to azimuthal bearings through the fisheye camera model, producing the hit-type channel of the visual descriptor. As a proof of concept, we generate both descriptors at a single known pose from the Hilti-Trimble SLAM Challenge 2026 dataset and demonstrate that the wall-window pattern extracted from the first frame of each camera closely matches the floor plan descriptor, validating the feasibility of cross-modal structural matching.
Chinese Translation
建筑平面图是一种广泛可用的先验信息,包含环境的几何形状和语义信息,但现有的定位方法在很大程度上忽视了这些语义信息。为了解决这个问题,我们提出了COMPASS,一种利用平面图中的几何和语义先验来估计配备双鱼眼相机的机器人姿态的算法。受到基于激光雷达的地点识别中的扫描上下文描述符的启发,我们设计了一种多通道径向描述符,用于编码位置周围的几何布局。通过平面图,射线在360度方位角范围内发射,结果被编码为五个通道:归一化范围、结构命中类型(墙、窗或开口)、范围梯度、逆范围和局部范围方差。从图像侧,通过检测鱼眼图像中的结构元素填充相同的描述符结构。作为全跨模态匹配的第一步,我们提出了一种用于鱼眼图像的窗户检测算法,该算法使用线段检测器通过垂直边缘聚类和亮度验证来识别窗框。检测到的窗户通过鱼眼相机模型投影到方位角,生成视觉描述符的命中类型通道。作为概念验证,我们从Hilti-Trimble SLAM Challenge 2026数据集中生成了在单一已知姿态下的两个描述符,并展示了每个相机第一帧提取的墙-窗模式与平面图描述符的紧密匹配,验证了跨模态结构匹配的可行性。
cs.CV / 43 / 2604.25405

Leveraging Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

利用先前遍历点云地图先验进行基于摄像头的3D物体检测与跟踪
Käppeler, Markus, Çiçek, Özgün, Miron, Yakov, Valada, Abhinav
Abstract
Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird's-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at https://dualviewmapdet.cs.uni-freiburg.de .
Chinese Translation
基于摄像头的3D物体检测与跟踪是自动驾驶的核心,但在没有昂贵的、富含深度信息的在线激光雷达(LiDAR)可用时,精确的3D物体定位仍然受到深度模糊的根本限制。然而,在许多应用中,车辆会重复穿越相同的环境,使得来自先前遍历的静态点云地图成为一种实用的几何先验来源。我们提出了DualViewMapDet,这是一个仅基于摄像头的推理框架,能够在线检索此类地图先验,并利用它们来缓解部署过程中缺乏LiDAR传感器的问题。其关键思想是采用双空间摄像头-地图融合策略,避免单侧视图转换。具体而言,我们(i)将地图投影到透视视图(PV)中,并编码多通道几何线索以丰富图像特征并支持鸟瞰视图(BEV)提升;(ii)直接在鸟瞰视图(BEV)中编码地图,使用稀疏体素骨干网络,并将其与提升的摄像头特征在共享的度量空间中融合。在nuScenes和Argoverse 2上的广泛评估显示,相较于强大的仅基于摄像头的基线,模型在物体定位方面有显著提升。消融实验进一步验证了PV/BEV融合和先前地图覆盖的贡献。我们在https://dualviewmapdet.cs.uni-freiburg.de提供代码和预训练模型。
cs.CV / 44 / 2604.25408

Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing

超越保真度:低级图像处理中的语义相似性评估
Wang, Runjie, Chen, Weiling, Zhao, Tiesong, Chen, Chang Wen
Abstract
Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textit{Semantic Similarity} as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on semantic entities and their relations, and discuss the desired properties and constraints of a valid semantic similarity index. Based on this formulation, we propose Triplet-based Semantic Similarity Score (T3S), which models image semantics through foreground entities, background entities, and relations. T3S combines semantic entity extraction, foreground-background disentanglement, and open-world class/relation modeling. Experiments on COCO and SPA-Data show that T3S consistently outperforms existing fidelity-oriented metrics and representative semantic-level baselines, while better reflecting progressive semantic changes under diverse degradations. These results highlight the importance of semantic assessment in modern low-level vision.
Chinese Translation
低级图像处理长期以来主要从视觉保真度的角度进行评估。然而,随着深度学习和生成模型的兴起,处理后的图像可能在改变语义内容的同时保持感知质量,这使得传统的图像质量评估(Image Quality Assessment, IQA)不足以进行语义层面的评估。本文将 extit{语义相似性}形式化为低级图像处理的新评估任务,旨在测量处理后语义内容是否得以保留。我们进一步基于语义实体及其关系提出了图像语义的结构化表述,并讨论了有效语义相似性指标所需的属性和约束。基于这一表述,我们提出了基于三元组的语义相似性评分(Triplet-based Semantic Similarity Score, T3S),该方法通过前景实体、背景实体和关系来建模图像语义。T3S结合了语义实体提取、前景-背景解耦和开放世界类别/关系建模。在COCO和SPA-Data上的实验表明,T3S始终优于现有的以保真度为导向的指标和代表性的语义层面基线,同时更好地反映了在多种退化下的渐进语义变化。这些结果突显了在现代低级视觉中进行语义评估的重要性。
cs.CV / 45 / 2604.25427

A Systematic Post-Train Framework for Video Generation

视频生成的系统化后训练框架
Xue, Zeyue, Fu, Siming, Huang, Jie, Lu, Shuai, Li, Haoran, Liu, Yijun, Li, Yuming, He, Xiaoxuan, Chen, Mengzhao, Huang, Haoyang, Duan, Nan, Luo, Ping
Abstract
While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.
Chinese Translation
尽管大规模视频扩散模型在生成高分辨率和语义丰富的内容方面展现了令人印象深刻的能力,但由于提示敏感性、时间一致性和高昂的推理成本等关键问题,它们的预训练性能与实际部署要求之间仍存在显著差距。为了解决这一问题,我们提出了一种全面的后训练框架,通过四个协同阶段系统地将预训练模型与用户意图对齐:首先,我们采用监督微调(Supervised Fine-Tuning, SFT)将基础模型转变为稳定的指令跟随策略,接着进行人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)阶段,利用一种新颖的群体相对策略优化(Group Relative Policy Optimization, GRPO)方法,专门针对视频扩散以增强感知质量和时间一致性;随后,我们通过专门的语言模型整合提示增强,以优化用户输入,最后通过推理优化解决系统效率问题。这些组件共同提供了一种系统化的方法,以提高视觉质量、时间一致性和指令跟随能力,同时保持在预训练过程中学到的可控性。最终结果是一个实用的蓝图,用于构建稳定、可适应且在实际部署中有效的可扩展后训练管道。大量实验表明,该统一管道有效减轻了常见伪影,并显著提高了可控性和视觉美感,同时遵循严格的采样成本限制。
cs.CV / 46 / 2604.25432

SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

SARU:一种针对遥感图像的阴影感知与去除统一框架及新基准
Bo, Zi-Yang, Lu, Wei, Chen, Hongruixuan, Chen, Si-Bao, Luo, Bin
Abstract
Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments demonstrate that SARU achieves state-of-the-art performance on both the public AISD dataset and our newly introduced benchmarks. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU-Framework.
Chinese Translation
阴影是遥感图像(RSI)中普遍存在的问题,影响视觉质量并严重限制下游任务(如目标检测和语义分割)的性能。大多数先前的研究将阴影检测和去除视为两个独立的级联任务,这可能导致繁琐的过程和错误的累积。此外,许多深度学习方法依赖于成对的阴影和非阴影图像进行训练,而这些图像在实际应用中往往不可用。为了解决这些挑战,我们提出了阴影感知与去除统一框架(SARU),这是一个紧凑的两阶段框架。首先,其双分支检测模块(DBCSF-Net)融合多色彩空间和语义特征,以生成高保真度的阴影掩膜,有效区分阴影和暗物体。然后,利用这些掩膜,一种新颖的无训练物理算法(N$^2$SGSR)通过转移单个输入图像中相邻非阴影区域的属性来恢复光照。为了促进严格评估和推动未来的研究,我们还引入了两个新的基准数据集:RSI阴影检测(RSISD)数据集和单图像阴影去除基准(SiSRB)。大量实验表明,SARU在公共AISD数据集和我们新引入的基准上均实现了最先进的性能。通过整体整合阴影检测和去除以减轻错误传播,并消除对成对训练数据的依赖,SARU建立了一个稳健且实用的框架,用于现实世界的RSI分析。源代码和数据集可在以下网址公开获取:https://github.com/AeroVILab-AHU/SARU-Framework。
cs.CV / 47 / 2604.25457

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

GramSR:基于扩散的超分辨率视觉特征调节
D'Oronzio, Fabio, Putamorsi, Federico, Zini, Leonardo, Cornia, Marcella, Baraldi, Lorenzo
Abstract
Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using $\ell_2$ loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: https://github.com/aimagelab/GramSR.
Chinese Translation
尽管近年来取得了进展,单幅图像超分辨率(SR)仍然具有挑战性,特别是在具有复杂降质的真实场景中。基于扩散的超分辨率方法,尤其是那些基于稳定扩散(Stable Diffusion)构建的方法,利用了强大的生成先验,但通常依赖于从语义描述中衍生的文本调节。这些文本描述仅提供高层次的语义信息,缺乏忠实恢复所需的空间对齐视觉信息,导致抽象语义与空间对齐视觉细节之间存在表现差距。为了解决这一局限性,我们提出了GramSR,这是一种一步扩散基超分辨率框架,它用从低分辨率输入中提取的密集视觉特征替代文本调节,这些特征是通过预训练的DINOv3编码器获得的。GramSR采用三阶段的LoRA架构,其中像素级、语义级和纹理级的LoRA模块依次训练。像素级模块专注于使用$ ext{l}_2$损失去除降质,语义级模块通过LPIPS和CSD损失增强感知细节,而纹理级模块通过从DINOv3特征计算的Gram矩阵损失强制特征相关性一致性。在推理阶段,独立的引导尺度使得对降质去除、语义增强和纹理保留的灵活控制成为可能。在标准SR基准上的大量实验表明,GramSR始终优于现有的一步扩散基方法,实现了更优的结构保真度和纹理真实感。该工作的代码可在以下链接获取:https://github.com/aimagelab/GramSR。
cs.CV / 48 / 2604.25464

Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy

基于气泡感知帧率适应的图像压缩用于节能视频胶囊内窥镜
Bause, Oliver, Gammerdinger, Jörg, Werner, Julia
Abstract
Video Capsule Endoscopy (VCE) is a promising method for improving the medical examination of the small intestine in the gastrointestinal tract. A key challenge is their limited size, resulting in a short battery lifetime which conflicts with high energy consumption for image capturing and transmission to an on-body device. Thus, we propose an image compression pipeline that substantially reduces the transmitted data while preserving diagnostic image quality. Furthermore, we exploit characteristics of the compression process to identify frames with low diagnostic value mainly caused by bubbles, without requiring additional image analysis. For low-visibility frames, a dynamic bubble-aware frame rate adaptation strategy reduces image acquisition and transmission during these phases while preserving sensitivity to potential anomalies. The proposed compression and frame rate adaptation are evaluated on a RISC-V platform using the Kvasir-Capsule and Galar datasets. The compression method achieves a compression ratio of 5.748 (82.6%) at a peak signal-to-noise ratio of 40.3 dB, indicating negligible loss of visual quality. The compression accomplished a mean energy reduction of the whole system by 20.58%. Additionally, the proposed bubble-aware frame rate adaptation reduced the energy consumption by up to 40%. These results demonstrate the potential of our method to increase the applicability of VCE.
Chinese Translation
视频胶囊内窥镜(VCE)是一种有前景的方法,用于改善对胃肠道小肠的医学检查。一个关键挑战是其有限的体积,导致电池寿命短,这与图像捕获和传输到体内设备的高能耗相冲突。因此,我们提出了一种图像压缩管道,显著减少传输数据,同时保持诊断图像质量。此外,我们利用压缩过程的特征,识别主要由气泡引起的低诊断价值的帧,而无需额外的图像分析。对于低可见度帧,动态气泡感知帧率适应策略在这些阶段减少图像采集和传输,同时保持对潜在异常的敏感性。所提出的压缩和帧率适应在RISC-V平台上使用Kvasir-Capsule和Galar数据集进行了评估。压缩方法在峰值信噪比为40.3 dB时实现了5.748(82.6%)的压缩比,表明视觉质量损失微乎其微。压缩使整个系统的平均能量减少了20.58%。此外,所提出的气泡感知帧率适应将能耗降低了多达40%。这些结果表明我们的方法有潜力提高VCE的适用性。
cs.CV / 49 / 2604.25466

Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency

基于多视角语义一致性的可推广人类高斯点云渲染
Kim, Jingi, Kim, Wonjun
Abstract
Recently, generalizable human Gaussian splatting from sparse-view inputs has been actively studied for the photorealistic human rendering. Most existing methods rely on explicit geometric constraints or predefined structural representations to accurately position 3D Gaussians. Although these approaches have shown the remarkable progress in this field, they still suffer from inconsistent feature representations across multi-view inputs due to complex articulations of the human body and limited overlaps between different views. To address this problem, we propose a novel method to accurately localize 3D Gaussians and ultimately improve the quality of human rendering. The key idea is to unproject latent embeddings encoded from each viewpoint into a shared 3D space through predicted depth maps and recalibrate them belonging to the same body part based on cross-view attention. This helps the model resolve the spatial ambiguity occurring in highly textured regions as well as occluded body parts, thus leading to the accurate localization of 3D Gaussians. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of generalizable human Gaussian splatting from sparse-view inputs.
Chinese Translation
近年来,从稀疏视角输入中进行可推广的人类高斯点云渲染已成为光真实感人类渲染研究的热点。现有的方法大多依赖于显式几何约束或预定义的结构表示来准确定位3D高斯点。尽管这些方法在该领域取得了显著进展,但由于人体的复杂关节运动和不同视角之间的重叠有限,它们在多视角输入中仍然面临特征表示不一致的问题。为了解决这一问题,我们提出了一种新颖的方法,旨在准确定位3D高斯点,并最终提高人类渲染的质量。其关键思想是通过预测的深度图将从每个视点编码的潜在嵌入反投影到共享的3D空间中,并基于跨视角注意机制对其进行重新校准,使其归属同一身体部位。这有助于模型解决在高度纹理区域和被遮挡身体部位中出现的空间模糊性,从而实现3D高斯点的准确定位。在基准数据集上的实验结果表明,所提出的方法有效提高了从稀疏视角输入中进行可推广人类高斯点云渲染的性能。
cs.CV / 50 / 2604.25477

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

DDA-Thinker:用于推理驱动图像编辑的解耦双原子强化学习
Yang, Hanqing, Zhou, Qiang, Du, Yongchao, Zhou, Sashuai, Wang, Zhibin, Song, Jun, Ge, Tiezheng, Yu, Cheng, Zheng, Bo
Abstract
Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.
Chinese Translation
近期的图像编辑模型在视觉保真度方面取得了显著进展,但在需要复杂推理的任务中往往表现不佳。为了研究和增强基于推理的图像编辑规划,我们提出了DDA-Thinker,这是一个以Thinker为中心的框架,旨在对固定生成模型(Editor)上的规划模块(Thinker)进行独立优化。这种解耦的Thinker中心范式促进了对规划模块的控制分析,使其在固定Editor下的贡献更易于评估。为了有效指导这一Thinker,我们引入了一个双原子强化学习框架。该框架将反馈分解为两个不同的原子奖励,通过可验证的检查清单实施:一种认知原子奖励,用于直接评估Thinker可执行计划的质量,该计划作为Thinker推理的可操作结果;另一种视觉原子奖励,用于评估最终图像质量。为了提高检查清单的质量,我们的检查清单合成不仅基于源图像和用户指令,还基于理想后编辑场景的合理参考描述。为了支持这一训练,我们进一步开发了一个两阶段数据策划管道,首先合成一个多样化且以推理为重点的数据集,然后应用难度感知的精炼,以策划有效的强化学习训练课程。在推理驱动的图像编辑基准测试(包括RISE-Bench和KRIS-Bench)上的广泛实验表明,我们的方法显著提高了整体性能。我们的方法使得社区模型能够在固定编辑器设置下取得与强大的专有模型相竞争的结果,突显了以Thinker为中心的优化在实际应用中的潜力。
cs.CV / 51 / 2604.25491

The Forensic Cost of Watermark Removal

水印去除的取证成本
Evennou, Gautier, Kijak, Ewa
Abstract
Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.
Chinese Translation
当前的水印去除方法在两个维度上进行评估:攻击成功率和感知质量。我们认为这不足够。尽管最先进的攻击成功地降低了水印信号而没有明显的失真,但它们留下了明显的统计伪影,揭示了去除尝试。我们将这一被忽视的维度称为水印去除检测(Watermark Removal Detection, WRD),并展示了一种现代分类器在这些伪影上训练后,在所有测试的去除方法中以 $10^{-3}$ 的假阳性率(FPR)实现了最先进的检测率。现有的攻击方法没有考虑到这种取证泄漏。我们将领先的水印方案与标准去除流程进行基准测试,扩展评估维度为攻击成功、感知质量和取证可检测性,发现当前没有任何方法能够平衡这三者。我们的结果确立了取证隐蔽性作为水印去除的必要要求。
cs.CV / 52 / 2604.25530

The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

经典知识蒸馏在语义分割中的惊人有效性
Ali, Muhammad, Laube, Kevin Alexander, Ganesh, Madan Ravi, Schott, Lukas, Popp, Niclas, Brox, Thomas
Abstract
Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.
Chinese Translation
最近针对语义分割的知识蒸馏(KD)方法引入了越来越复杂的手工设计目标,但通常在固定的迭代计划下进行评估。这些目标显著增加了每次迭代的成本,这意味着相同的迭代次数并不对应于相同的训练预算。因此,目前尚不清楚报告的增益是反映了更强的蒸馏信号,还是仅仅是更大的计算量。我们表明,基于迭代的比较具有误导性:当计算时间相匹配时, extit{经典}的基于对数和特征的知识蒸馏优于最近的特定分割方法。在延长训练的情况下,基于特征的蒸馏在Cityscapes和ADE20K上达到了最先进的ResNet-18性能。一名PSPNet ResNet-18学生在参数仅为教师ResNet-101的四分之一的情况下,接近其教师,达到了Cityscapes上教师的99 ext{%} mIoU(79.0 vs. 79.8)和ADE20K上的92 ext{%}。我们的结果挑战了当前关于分割知识蒸馏需要特定任务机制的普遍假设,并建议未来的方法设计应以规模为导向,而非复杂的手工设计目标。
cs.CV / 53 / 2604.25533

DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

DualGeo:一种全球图像地理定位的双视角框架
Cui, Junchao, Shi, Wenqi, Du, Shaoyong, He, Hang, Ma, Xuanzi, Tang, Hao, Luo, Xiangyang
Abstract
Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (<1 km) and city-level (<25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : https://github.com/CJ310177/DualGeo.
Chinese Translation
全球图像地理定位旨在推断在地球任何地方拍摄的图像的地理位置,涵盖街道、城市、区域、国家和大陆等多个尺度。现有方法依赖于对环境变化(例如光照、季节和天气)敏感的视觉特征,并缺乏有效的后处理来过滤异常候选,限制了定位的准确性。为了解决这些局限性,我们提出了DualGeo,一种用于全球图像地理定位的两阶段框架。首先,通过双向交叉注意力融合图像和语义分割特征,建立地理表示基础。然后,通过双视角对比学习将融合特征与GPS坐标对齐,以构建全球检索数据库。其次,通过地理聚类对检索到的候选进行重新排序,实现地理认知精炼。最后,将其输入大型多模态模型(LMMs)进行最终坐标预测。在IM2GPS、IM2GPS3k和YFCC4k上的实验表明,DualGeo在性能上超越了最先进的方法,街道级(<1 km)和城市级(<25 km)定位准确性分别提高了3.6%-16.58%和1.29%-8.77%。我们的代码和数据集可在此获取:https://github.com/CJ310177/DualGeo。
cs.CV / 54 / 2604.25545

TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

TopoMamba:一种拓扑感知的扫描与融合框架用于异构医学视觉媒体的分割
Zheng, Fuchen, Xu, Chengpei, Ma, Long, Li, Weixuan, Zhou, Junhua, Chen, Xuhang, Liu, Weihuang, Li, Haolun, Li, Quanjun, Zhang, Zhenxi, Zhao, Lei, Pun, Chi-Man, Zhou, Shoujun
Abstract
Visual state-space models (SSMs) have shown strong potential for medical image segmentation, yet their effectiveness is often limited by two practical issues: axis-biased scan ordering weakens the modeling of oblique and curved structures, and naive multi-branch fusion tends to amplify redundant responses. We present TopoMamba, a topology-aware scan-and-fuse framework for segmenting heterogeneous medical visual media. The method combines a diagonal/anti-diagonal TopoA-Scan branch with the standard Cross-Scan branch to provide complementary structural priors, and introduces ScanCache, a device-aware caching mechanism that amortizes explicit scan-index construction across recurring resolutions. To fuse heterogeneous scan features efficiently, we further propose a lightweight HSIC Gate that regulates branch interaction using a dependence-aware scalar gating rule. We also instantiate a volumetric TopoMamba-3D for practical 3D clinical segmentation. Experiments on Synapse CT, ISIC 2017 dermoscopy, and CVC-ClinicDB endoscopy show that TopoMamba consistently improves segmentation quality over strong CNN, Transformer, and SSM baselines, with particularly clear gains on thin or curved targets such as the pancreas and gallbladder, while maintaining favorable deployment efficiency under dynamic input resolutions. These results suggest that topology-aware scan ordering and lightweight dependence-aware fusion form an effective and practical design for medical multimedia segmentation. The code will be made publicly available.
Chinese Translation
视觉状态空间模型(SSMs)在医学图像分割中展现出强大的潜力,但其有效性常常受到两个实际问题的限制:轴向偏向的扫描顺序削弱了对斜面和曲线结构的建模,而简单的多分支融合往往会放大冗余响应。我们提出了TopoMamba,一种拓扑感知的扫描与融合框架,用于分割异构医学视觉媒体。该方法结合了对角线/反对角线的TopoA-Scan分支与标准的Cross-Scan分支,以提供互补的结构先验,并引入了ScanCache,一种设备感知的缓存机制,能够在重复分辨率下摊销显式扫描索引的构建。为了高效融合异构扫描特征,我们进一步提出了一种轻量级的HSIC Gate,通过依赖感知的标量门控规则调节分支间的交互。我们还实例化了一个体积型的TopoMamba-3D,用于实际的3D临床分割。在Synapse CT、ISIC 2017皮肤镜和CVC-ClinicDB内窥镜的实验中,TopoMamba在强大的CNN、Transformer和SSM基线之上持续提高了分割质量,尤其在薄或曲线目标(如胰腺和胆囊)上表现出明显的提升,同时在动态输入分辨率下保持了良好的部署效率。这些结果表明,拓扑感知的扫描顺序和轻量级的依赖感知融合构成了医学多媒体分割的有效且实用的设计。代码将公开发布。
cs.CV / 55 / 2604.25570

Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

视觉SmolMamba:用于节能脉冲状态空间视觉模型的脉冲引导令牌剪枝
Bai, Dewei, Peng, Hongxiang, Zeng, Yunyun, Zhang, Ziyu, Qu, Hong, Zhang, Yi
Abstract
Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.
Chinese Translation
脉冲变换器在通过脉冲驱动的自注意力进行长距离视觉建模方面显示出强大的潜力。然而,它们的平方令牌交互在根本上与脉冲神经计算的稀疏和事件驱动特性不相符。为了解决这一局限性,我们提出了视觉SmolMamba,一种节能的脉冲状态空间架构,它将脉冲驱动动态与线性时间选择性递归相结合。关键思想是脉冲引导时空令牌剪枝器(SST-TP),该剪枝器利用脉冲激活强度和首次脉冲延迟来估计令牌的重要性。该机制逐步去除冗余令牌,同时保留显著的时空信息,从而实现与令牌稀疏性有效的扩展。基于这一机制,所提出的SmolMamba模块将脉冲事件直接纳入双向状态空间递归,形成一个用于高效长距离建模的脉冲状态空间视觉骨干网络。在包括ImageNet-1K、CIFAR10/100、CIFAR10-DVS和DVS128手势在内的静态和基于事件的基准测试中,广泛实验表明,视觉SmolMamba始终实现了优越的准确性与效率的权衡。特别是,与之前的脉冲变换器基线和脉冲Mamba变体相比,它将估计的能量成本降低了至少1.5倍,同时保持了具有竞争力或改善的准确性。这些结果表明,将脉冲引导的令牌稀疏性与状态空间建模相结合,为脉冲视觉系统提供了一种可扩展且节能的范式。
cs.CV / 56 / 2604.25574

Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

控制你的查询:用于摄像头-雷达融合的异构查询交互
Wu, Jialong, Wang, Yihan, Rottmann, Matthias
Abstract
In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.
Chinese Translation
在自动驾驶中,摄像头-雷达融合提供了互补的感知能力和低部署成本。现有方法通过输入混合、特征图混合或基于查询的特征采样来进行融合。我们提出了一种新的融合范式,称为异构查询交互,并提出了ConFusion,一个摄像头-雷达三维物体检测器。ConFusion结合了分布在三维空间中的图像查询、雷达查询和可学习的世界查询,以改善查询初始化和物体覆盖率。为了促进异构查询之间的跨类型交互,我们引入了异构查询混合(QMix),该方法在特征采样后执行专门的跨类型注意力,以巩固互补的物体证据。我们进一步提出了交互式查询交换采样(QSwap),通过允许相关查询在注意力和几何约束下交换信息特征标记来改善特征采样。在nuScenes数据集上的实验表明,ConFusion达到了最先进的性能,在验证集上达到了59.1 mAP和65.6 NDS,在测试集上达到了61.6 mAP和67.9 NDS。
cs.CV / 57 / 2604.25636

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

通过再生进行精细化:扩大修改空间提升统一多模态模型中的图像精细化
Guo, Jiayi, Wang, Linqing, Wang, Jiangshan, Yue, Yang, Liu, Zeyu, Zhao, Zhiyuan, Lu, Qinglin, Huang, Gao, Wang, Chunyu
Abstract
Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.
Chinese Translation
统一多模态模型(UMMs)在单一框架内整合了视觉理解和生成能力。在文本到图像(T2I)任务中,这种统一能力使UMMs能够在初始生成后对输出进行精细化,从而可能扩展性能的上限。目前基于UMM的精细化方法主要遵循精细化-通过-编辑(RvE)范式,其中UMMs生成编辑指令以修改不对齐区域,同时保持对齐内容。然而,编辑指令往往仅粗略描述提示与图像之间的不对齐,导致精细化不完整。此外,尽管像素级的保留对于编辑是必要的,但却不必要地限制了精细化的有效修改空间。为了解决这些局限性,我们提出了通过再生进行精细化(RvR),这是一种将精细化重新表述为条件图像再生而非编辑的新框架。RvR不依赖于编辑指令和严格的内容保留,而是根据目标提示和初始图像的语义标记再生图像,从而实现更完整的语义对齐,并具有更大的修改空间。大量实验表明RvR的有效性,使Geneval从0.78提升至0.91,DPGBench从84.02提升至87.21,UniGenBench++从61.53提升至77.41。
cs.CV / 58 / 2604.25642

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

预填充时间干预以减轻大型视觉语言模型中的幻觉
Zhang, Chengsheng, Sun, Chenghao, Jiang, Xinyan, Li, Wei, Tian, Xinmei
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.
Chinese Translation
大型视觉语言模型(LVLMs)在视觉文本理解方面取得了显著进展,但其可靠性受到幻觉的严重影响,即生成事实不正确或不一致的响应。尽管最近的研究使用引导向量在减少幻觉方面显示出希望,但仍然存在一个显著的挑战:它们无意中加剧了残余幻觉的严重性。我们将此归因于它们对解码阶段的专注,在该阶段,错误以自回归的方式累积并逐渐恶化后续的幻觉输出。为了解决这个问题,我们提出了预填充时间干预(PTI),这是一种新颖的引导范式,仅在预填充阶段进行一次干预,在错误累积发生之前增强初始的键值(KV)缓存。具体而言,PTI 是感知模态的,为视觉和文本表示推导出不同的方向。该干预被解耦,以引导键朝向视觉基础的对象,并将值过滤背景噪声,从源头纠正易幻觉的表示。大量实验表明,PTI 在减轻幻觉方面具有显著的性能,并且在不同的解码策略、LVLMs 和基准测试中具有良好的通用性。此外,PTI 与现有的解码阶段方法是正交的,能够实现即插即用的集成,并进一步提升性能。代码可在以下链接获取:https://github.com/huaiyi66/PTI。
cs.CV / 59 / 2604.25646

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe:一种用于机器人超声的语义解剖映射引擎
Zhang, Jing, Chen, Duojie, Jiang, Wentao, Lou, Zihan, Liu, Jianxin, Cui, Xinwu, Zhao, Qinghong, Du, Bo, Dietrich, Christoph F., Tao, Dacheng
Abstract
Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, SAMe achieved overall organ-hit rates of 97.3% for liver initialization and 81.7% for kidney initialization across the evaluated target sets. Even when restricted to the centroid target, SAMe outperformed the surface-heuristic baseline for both liver and kidney initialization. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.
Chinese Translation
机器人超声技术在局部图像驱动控制、接触调节和视图优化方面取得了进展,但当前系统缺乏必要的解剖学理解,以确定扫描内容、开始位置以及如何适应个体患者的解剖结构。这些缺陷使得系统仍然依赖专家干预来启动扫描。在此,我们提出了SAMe,一种为机器人超声提供显式解剖学先验层的语义解剖映射引擎。SAMe将扫描启动视为目标-解剖-动作的过程:它将不明确的临床投诉转化为结构化的目标器官,从单个外部身体图像中为这些目标实例化患者特定的解剖表示,并将该表示转换为面向控制的6自由度探头初始化状态,而无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量级的(单个器官推断时间为0.08秒),并且在设计上与下游控制兼容。在语义基础、解剖实例化和真实机器人评估中,SAMe在整个初始化流程中表现出色。在真实机器人实验中,SAMe在评估的目标集上实现了肝脏初始化的整体器官命中率为97.3%,肾脏初始化的命中率为81.7%。即使在仅限于质心目标的情况下,SAMe在肝脏和肾脏初始化方面也超越了表面启发式基线。这些结果建立了一个显式的解剖学先验层,解决了扫描初始化问题,并旨在支持更广泛的下游自主扫描管道,为基于投诉的、解剖学驱动的机器人超声提供了解剖基础。
cs.CV / 60 / 2604.25680

Exploring Remote Photoplethysmography for Neonatal Pain Detection from Facial Videos

探索基于远程光电容积描记法的面部视频新生儿疼痛检测
Dhamaniya, Ashutosh, Gupta, Anup Kumar, Saikia, Trishna, Gupta, Puneet
Abstract
Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse signals in a non-contact manner and employ them for neonatal pain detection. The temporal signals acquired from regions-of-interest (ROIs) affected by skin deformations may exhibit lower quality and provide erroneous rPPG signals. Therefore, we incorporated a quality parameter to select the temporal signals obtained from ROIs that are least affected by skin deformations. Further, we employed signal-to-noise ratio as a fitness parameter to extract the rPPG signal corresponding to the clip that is least affected by noise. Experimental findings demonstrate that the rPPG signals provide useful information for neonatal pain detection, and signals extracted from the blue colour channel outperform those extracted from other colour channels. We also show that combining rPPG and audio features provides better results than individual modalities.
Chinese Translation
新生儿未得到及时处理的疼痛可能导致不良后果,包括发育延迟和体重增长缓慢,这强调了对更客观和可靠的疼痛评估方法的需求。因此,已经开发出使用行为和生理疼痛指标的自动化方法,以帮助新生儿重症监护病房的医疗专业人员。传统的基于接触的方法在生理参数估计方面不适合长期监测,并增加了传播COVID-19等疾病的风险。我们提出了一种新颖的方法,利用远程光电容积描记法(remote photoplethysmography, rPPG)以非接触方式估计脉搏信号,并将其用于新生儿疼痛检测。由于受皮肤变形影响的感兴趣区域(regions-of-interest, ROIs)获取的时间信号可能表现出较低的质量并提供错误的rPPG信号,因此我们引入了一种质量参数,以选择受皮肤变形影响最小的ROIs所获得的时间信号。此外,我们采用信噪比作为适应度参数,以提取受噪声影响最小的rPPG信号。实验结果表明,rPPG信号为新生儿疼痛检测提供了有用的信息,并且从蓝色通道提取的信号优于从其他颜色通道提取的信号。我们还展示了结合rPPG和音频特征的结果优于单一模态的结果。
cs.CV / 61 / 2604.25688

QB-LIF: Learnable-Scale Quantized Burst Neurons for Efficient SNNs

QB-LIF:用于高效脉冲神经网络的可学习尺度量化突发神经元
Bai, Dewei, Peng, Hongxiang, Mei, Jiajun, Ren, Yang, Qu, Hong, Xia, Dawen, Yi, Zhang
Abstract
Binary spike coding enables sparse and event-driven computation in spiking neural networks (SNNs), yet its 1-bit-per-timestep representation fundamentally limits information throughput. This bottleneck becomes increasingly restrictive in deep architectures under short simulation horizons. We propose the Quantized Burst-LIF (QB-LIF) neuron, which reformulates burst spiking as a saturated uniform quantization of membrane potentials with a learnable scale. Instead of relying on predefined multi-threshold structures, QB-LIF treats the quantization scale as a trainable parameter, allowing each layer to autonomously adapt its spiking resolution to the underlying membrane-potential statistics. To preserve hardware efficiency, we introduce an absorbable scale strategy that folds the learned quantized scale into synaptic weights during inference, maintaining a strict accumulate-only (AC) execution paradigm. To enable stable optimization in the discrete multi-level space, we further design ReLSG-ET, a rectified-linear surrogate gradient with exponential tails that sustains gradient flow across burst intervals. Extensive experiments on static (CIFAR-10/100, ImageNet) and event-driven (CIFAR10-DVS, DVS128-Gesture) benchmarks demonstrate that QB-LIF consistently outperforms binary and fixed-burst SNNs, achieving higher accuracy under ultra-low latency while preserving neuromorphic compatibility.
Chinese Translation
二进制脉冲编码使得脉冲神经网络(SNNs)能够进行稀疏和事件驱动的计算,但其每个时间步1位的表示在根本上限制了信息的吞吐量。在短时间仿真范围内,这一瓶颈在深度架构中变得愈加严格。我们提出了量化突发LIF(QB-LIF)神经元,它将突发脉冲重新表述为膜电位的饱和均匀量化,并引入可学习的尺度。QB-LIF不依赖于预定义的多阈值结构,而是将量化尺度视为可训练参数,使每一层能够自主适应其脉冲分辨率,以符合底层膜电位统计特性。为了保持硬件效率,我们引入了一种可吸收尺度策略,在推理过程中将学习到的量化尺度折叠到突触权重中,从而保持严格的仅累加(AC)执行范式。为了在离散多级空间中实现稳定优化,我们进一步设计了ReLSG-ET,这是一种具有指数尾部的修正线性替代梯度,能够在突发间隔中维持梯度流动。在静态(CIFAR-10/100,ImageNet)和事件驱动(CIFAR10-DVS,DVS128-Gesture)基准上的大量实验表明,QB-LIF在超低延迟下始终优于二进制和固定突发SNNs,同时保持神经形态兼容性,取得了更高的准确率。
cs.CV / 62 / 2604.25720

Toward Multimodal Conversational AI for Age-Related Macular Degeneration

面向年龄相关性黄斑变性(AMD)的多模态对话人工智能
Gu, Ran, Hou, Benjamin, Hébert, Mélanie, Indurkar, Asmita, Yang, Yifan, Chew, Emily Y., Keenan, Tiarnán D. L., Lu, Zhiyong
Abstract
Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.
Chinese Translation
尽管深度学习模型在视网膜疾病检测方面表现出色,但大多数系统仅提供静态预测,而缺乏临床推理或互动解释。最近,多模态大型语言模型(MLLMs)的进展将诊断预测与临床意义对话相结合,以支持临床决策和患者咨询。在本研究中,OcularChat 是一个基于 Qwen2.5-VL 微调的 MLLM,利用模拟的患者-医生对话,通过对彩色眼底照片(CFPs)的视觉问答来诊断年龄相关性黄斑变性(AMD)。共生成了 705,850 个模拟对话,并配对 46,167 张 CFPs,以训练 OcularChat 识别关键的 AMD 特征并产生有理有据的预测。OcularChat 在 AREDS 中表现出强大的分类性能,对于三项诊断任务:晚期 AMD、色素异常和视网膜色素上皮下斑点的大小,其准确率分别达到了 0.954、0.849 和 0.678,显著优于现有的 MLLMs。在 AREDS2 中,OcularChat 在所有任务中仍然是表现最好的方法。在三位独立的眼科医生评审中,OcularChat 在晚期 AMD(3.503 vs. 2.833)、色素异常(3.272 vs. 2.828)、视网膜色素上皮下斑点的大小(3.064 vs. 2.433)和总体印象(2.978 vs. 2.464)的 5 分临床评分标准上,均取得了高于强基线模型的平均分。除了在 AMD 严重程度分类中的强客观表现外,OcularChat 还展示了提供诊断推理、临床相关解释和互动对话的能力,并在主观眼科医生评估中表现优异。这些发现表明,MLLMs 可能使基于图像的 AMD 诊断和分类更加准确、可解释且具有临床实用性。
cs.CV / 63 / 2604.25781

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Sketch2Arti:基于草图的CAD对象关节建模
Yang, Yi, Pan, Hao, Cui, Yijing, Sheffer, Alla, Li, Changjian
Abstract
Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.
Chinese Translation
关节建模旨在推断3D对象的可动部件及其运动参数,从而实现交互式动画、仿真和形状编辑。在本文中,我们提出了Sketch2Arti,这是首个针对CAD对象的基于草图的关节建模系统。我们的关键观察是,设计师通过轻量级草图(例如箭头和笔画)自然地传达关节意图,这些草图指示了部件应如何移动,但将这些草图转换为关节3D模型仍然主要依赖手动操作。Sketch2Arti弥补了这一空白,使用户能够通过从选定视角绘制的简单2D草图来指定关节。给定CAD模型和用户草图,我们的方法自动发现相应的可动部件并预测其运动参数,从而允许对复杂对象进行多次关节的迭代建模,并实现精细控制。重要的是,Sketch2Arti以类别无关的方式进行训练,无需对象类别信息,从而在现有关节数据集之外对多样对象具有强大的泛化能力。此外,对于缺乏内部结构的壳模型,Sketch2Arti支持由用户草图引导的可控内部补全,生成与现有几何形状和预测运动约束一致的合理内部组件。全面的实验和用户评估证明了Sketch2Arti的有效性、可控性和泛化能力。代码、数据集和原型系统可在 https://arlo-yang.github.io/Sketch2Arti 获取。
cs.CV / 64 / 2604.25795

Improving Diversity in Black-box Few-shot Knowledge Distillation

提升黑箱少样本知识蒸馏中的多样性
Vo, Tri-Nhan, Nguyen, Dang, Do, Kien, Gupta, Sunil
Abstract
Knowledge distillation (KD) is a well-known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black-box few-shot KD, where the student is trained with few images and a black-box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity, a crucial factor for student learning. To address these problems, we propose a novel training scheme for generative adversarial networks, where we adaptively select high-confidence images under the teacher's supervision and introduce them to the adversarial learning on-the-fly. Our approach helps expand and improve the diversity of the distillation set, significantly boosting student accuracy. Through extensive experiments, we achieve state-of-the-art results among other few-shot KD methods on seven image datasets. The code is available at https://github.com/votrinhan88/divbfkd.
Chinese Translation
知识蒸馏(Knowledge Distillation, KD)是一种广为人知的技术,能够有效地将大型网络(教师)压缩为较小的网络(学生),且在性能上几乎不牺牲。然而,大多数KD方法需要大量的训练集和对教师的内部访问,而这些在各种限制下很少可用。这些挑战催生了一种更为实用的设置,称为黑箱少样本KD,其中学生仅使用少量图像和黑箱教师进行训练。近期的方法通常生成额外的合成图像,但缺乏促进其多样性的主动策略,而多样性是学生学习的一个关键因素。为了解决这些问题,我们提出了一种新颖的生成对抗网络训练方案,在教师的监督下自适应地选择高置信度图像,并将其引入到对抗学习中。我们的方法有助于扩展和改善蒸馏集的多样性,显著提升学生的准确性。通过广泛的实验,我们在七个图像数据集上实现了与其他少样本KD方法相比的最先进结果。代码可在 https://github.com/votrinhan88/divbfkd 获取。
cs.CV / 65 / 2604.25809

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

基于指令-证据对比的双流解码框架用于基础视觉-语言推理
Bangde, Yashwant Pravinrao, Roy, Debaditya
Abstract
Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), maintains two parallel probability distributions of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrast-based gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD2 on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering, including POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench. IECD2 demonstrates consistent improvements in task accuracy and reasoning performance, alongside a substantial reduction in hallucination across all evaluation metrics compared to state-of-the-art decoding approaches.
Chinese Translation
视觉-语言模型(VLMs)在遵循指令和开放式视觉-语言推理方面表现出强大的性能,但它们经常生成流畅的输出,而这些输出在视觉证据上却缺乏扎实的基础。先前的研究表明,指令提示进一步加剧了这一问题,尤其是在视觉信号不确定或模糊时,放大了语言先验。为了解决这一挑战,我们提出了一种解码框架,明确平衡生成过程中的语言信息性和视觉真实性。我们的方法,指令-证据对比双流解码(Instruction-Evidence Contrastive Dual-Stream Decoding, IECD2),在每个解码步骤中维护两个并行的标记概率分布:一个以指令为驱动的流,促进表达性和信息丰富的响应;另一个以证据为驱动的流,强制在图像中严格扎根。这两个流通过一个基于对称KL的对比门进行自适应融合,该门抑制那些受到语言先验青睐但缺乏视觉证据支持的标记,同时在两个分布一致时保留它们。我们在多个数据集上评估IECD2,这些数据集涵盖了各种生成性视觉-语言推理任务,如图像描述和视觉问答,包括POPE、MME、VQAv2、AMBER、MS-COCO和LLaVA-Bench。与最先进的解码方法相比,IECD2在任务准确性和推理性能上均表现出一致的提升,同时在所有评估指标上显著减少了幻觉现象。
cs.CV / 66 / 2604.25817

Magnification-Invariant Image Classification via Domain Generalization and Stable Sparse Embedding Signatures

通过领域泛化和稳定稀疏嵌入特征实现放大不变的图像分类
Ezuma, Ifeanyi, Medaiyese, Olusiji
Abstract
Magnification shift is a major obstacle to robust histopathology classification, because models trained on one imaging scale often generalize poorly to another. Here, we evaluated this problem on the BreaKHis dataset using a strict patient-disjoint leave-one-magnification-out protocol, comparing supervised baseline, baseline augmented with DCGAN-generated patches, and a gradient-reversal domain-general model designed to preserve discriminative information while suppressing magnification-specific variation. Across held-out magnifications, the domain-general model achieved the strongest overall discrimination and its clearest gain was observed when 200X was held out. By contrast, GAN augmentation produced inconsistent effects, improving some folds but degrading others, particularly at 400X. The domain-general model also yielded the lowest Brier score at 0.063 vs 0.089 at baseline. Sparse embedding analysis further revealed that domain-general training reduced average signature size more than three-fold (306 versus 1,074 dimensions) while preserving equivalent predictive performance (AUC: 0.967 vs 0.965; F1: 0.930 vs 0.931). It also increased cross-fold signature reproducibility from near-zero Jaccard overlap in the baseline to 0.99 between the 100X and 200X folds. These findings show that calibrated, compact, and transferable representations can be learned without added architectural complexity, with clear implications for the reliable deployment of computational pathology models across heterogeneous acquisition settings.
Chinese Translation
放大偏移是稳健的组织病理学分类面临的主要障碍,因为在一种成像尺度上训练的模型往往在另一种尺度上泛化效果不佳。在此,我们使用严格的患者不重叠的留一放大尺度法在BreaKHis数据集上评估了这一问题,比较了监督基线、通过DCGAN生成的补丁增强的基线,以及旨在保留判别信息同时抑制放大特定变异的梯度反转领域泛化模型。在保留的放大尺度中,领域泛化模型实现了最强的整体判别能力,其最明显的提升发生在200X被保留时。相比之下,GAN增强产生了不一致的效果,改善了一些折叠但降低了其他折叠的性能,尤其是在400X时。领域泛化模型的Brier分数也最低,为0.063,而基线为0.089。稀疏嵌入分析进一步揭示,领域泛化训练将平均特征大小减少了三倍以上(306维对比1074维),同时保持了相当的预测性能(AUC: 0.967对比0.965;F1: 0.930对比0.931)。它还将交叉折叠特征的可重复性从基线的接近零的Jaccard重叠提高到100X和200X折叠之间的0.99。这些发现表明,可以在不增加架构复杂性的情况下学习经过校准、紧凑且可转移的表示,这对在异构采集环境中可靠部署计算病理学模型具有明确的意义。
cs.CV / 67 / 2604.25819

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

互助强制:快速自回归音视频角色生成的双模式自我演化
Zhou, Yupeng, Huang, Lianghua, Wu, Zhifan, Wang, Jiabao, Shi, Yupeng, Jiang, Biao, Zhou, Daquan, Liu, Yu, Cheng, Ming-Ming, Hou, Qibin
Abstract
In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.
Chinese Translation
在本研究中,我们提出了互助强制(Mutual Forcing),这是一个用于快速自回归音视频生成的框架,能够实现长时间音视频同步。我们的方法解决了两个关键挑战:联合音视频建模和快速自回归生成。为了简化联合音视频优化,我们采用了两阶段训练策略:首先训练单模态生成器,然后将其结合成统一的音视频模型,以便在配对数据上进行联合训练。对于流式生成,我们探讨是否可以直接训练一个原生的快速因果音视频模型,而不是遵循现有的流式蒸馏流程,该流程通常先训练一个双向模型,然后通过多个蒸馏阶段将其转换为因果生成器。我们的答案是互助强制,它直接基于原生自回归模型,并在单个共享权重模型中集成了少步和多步生成,支持自我蒸馏和改进的训练-推理一致性。多步模式通过自我蒸馏改善少步模式,而少步模式在训练期间生成历史上下文,以提高训练-推理一致性;由于这两种模式共享参数,这两种效果在单一模型中相互增强。与之前的方法(如自我强制(Self-Forcing))相比,互助强制消除了对额外双向教师模型的需求,支持更灵活的训练序列长度,减少了训练开销,并允许模型直接从真实配对数据中改进,而不是依赖固定的教师。实验表明,互助强制在仅使用4到8步的情况下,能够匹配或超越需要约50个采样步骤的强基线,展示了在效率和质量上的显著优势。项目页面可访问 https://mutualforcing.github.io。
cs.CV / 68 / 2604.25855

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

SIEVES:通过视觉证据评分实现选择性预测的泛化
Rodriguez, Hector G., Rohrbach, Marcus
Abstract
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
Chinese Translation
多模态大型语言模型(MLLMs)在视觉语言任务上取得了越来越强的表现。尽管传统的视觉问答基准接近饱和,可靠的部署仍需在现实世界的分布外(OOD)场景中满足低错误容忍度。选择性预测的目标是提高覆盖率,即系统回答的输入比例,同时遵循用户定义的风险水平。这通常通过为每个答案分配置信度评分,并对低于某一阈值的答案进行放弃来实现。为了实现可靠的泛化,我们要求推理模型在回答时生成局部的视觉证据,并设计一个选择器,明确学习评估推理者提供的定位质量。我们展示了SIEVES(通过视觉证据评分的选择性预测)在具有挑战性的OOD基准(如V* Bench、HR-Bench-8k、MME-RealWorld-Lite、VizWiz和AdVQA)上相比于非基础模型提高了多达三倍的覆盖率。除了更好地泛化到OOD任务外,SIEVES选择器的设计使其能够在没有访问权重或logits的情况下转移到专有推理器,如o3和Gemini-3-Pro,提供超出仅由准确性所致的覆盖提升。我们强调,SIEVES在所有五个测试的OOD数据集和推理模型(Pixel-Reasoner、o3和Gemini-3-Pro)上均实现了泛化,而无需基准或推理器特定的训练或适应。
cs.CV / 69 / 2604.25887

No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

不让任何行人被遗忘:用于自适应交通信号控制的脆弱道路用户实时检测与追踪
Aly, Anas Gamal, ElAarag, Hala
Abstract
Current pedestrian crossing signals operate on fixed timing without adjustment to pedestrian behavior, which can leave vulnerable road users (VRUs) such as the elderly, disabled, or distracted pedestrians stranded when the light changes. We introduce No Pedestrian Left Behind (NPLB), a real-time adaptive traffic signal system that monitors VRUs in crosswalks and automatically extends signal timing when needed. We evaluated five state-of-the-art object detection models on the BGVP dataset, with YOLOv12 achieving the highest mean Average Precision at 50% ([email protected]) of 0.756. NPLB integrates our fine-tuned YOLOv12 with ByteTrack multi-object tracking and an adaptive controller that extends pedestrian phases when remaining time falls below a critical threshold. Through 10,000 Monte Carlo simulations, we demonstrate that NPLB improves VRU safety by 71.4%, reducing stranding rates from 9.10% to 2.60%, while requiring signal extensions in only 12.1% of crossing cycles.
Chinese Translation
当前的行人过街信号基于固定时间运行,未能根据行人行为进行调整,这可能导致脆弱道路用户(VRUs),如老年人、残疾人或分心的行人,在信号灯变换时被困。我们提出了“不让任何行人被遗忘”(No Pedestrian Left Behind, NPLB)系统,这是一种实时自适应交通信号系统,能够监测人行道上的脆弱道路用户,并在需要时自动延长信号时间。我们在BGVP数据集上评估了五种最先进的目标检测模型,其中YOLOv12在50%平均精度均值(mean Average Precision, [email protected])上达到了最高值0.756。NPLB将我们经过微调的YOLOv12与ByteTrack多目标追踪和自适应控制器集成,当剩余时间低于临界阈值时,自动延长行人相位。通过10,000次蒙特卡洛模拟,我们证明NPLB将脆弱道路用户的安全性提高了71.4%,将滞留率从9.10%降低至2.60%,同时仅在12.1%的过街周期中需要信号延长。
cs.CV / 70 / 2604.25889

Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

鲁棒深度伪造检测:通过校准的互补集成减轻空间注意力漂移
Le-Phan, Minh-Khoa, Le, Minh-Hoang, Do, Trong-Le, Tran, Minh-Triet
Abstract
Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.
Chinese Translation
目前的深度伪造检测模型在干净的学术数据集上实现了最先进的性能,但在现实世界的复合退化(如模糊和严重的有损压缩)下遭遇严重的空间注意力漂移。为了解决这一脆弱性,我们提出了一种基于基础驱动的取证框架,该框架将极端复合退化引擎与结构约束的多流架构相结合。在训练过程中,我们的退化管道系统性地破坏高频伪影,优化DINOv2-Giant主干网络以提取不变的几何和语义先验。然后,我们通过三个专门的通道处理图像:全局纹理流、局部面部流和结合CLIP的混合语义融合流。通过使用Score-CAM分析空间归因和使用余弦相似度评估特征稳定性,我们定量证明这些流提取了非冗余的互补特征表示并稳定了注意力熵。通过校准的离散投票机制聚合这些预测,我们的集成成功抑制了背景注意力漂移,同时充当了鲁棒的几何锚点。我们的方法实现了高度稳定的零样本泛化,在CVPR的NTIRE 2026鲁棒深度伪造检测挑战中获得第四名。代码可在 https://github.com/khoalephanminh/ntire26-deepfake-challenge 获取。
人工智能 (Artificial Intelligence)
42
cs.AI / 1 / 2604.24842

Co-Director: Agentic Generative Video Storytelling

共同导演:自主生成视频叙事
Song, Yale, Song, Yiwen, Losier, Nick, Hodson, Nathan, Jin, Ye, Zhu, Rhyard, Xu, Yan, Vlasic, Daniel, Claassen, Carina, Leon, Jasmine, LeViet, Khanh G., Chomyn, Zack, Timmons, Joe, Slatkin, Brett, Penberthy, Scott, Pfister, Tomas
Abstract
While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: https://co-director-agent.github.io/
Chinese Translation
尽管扩散模型能够生成高保真的视频片段,但将其转化为连贯的叙事引擎仍然具有挑战性。目前的自主管道通过链式模块实现自动化,但由于独立的手工提示,存在语义漂移和级联故障的问题。我们提出了共同导演(Co-Director),一个将视频叙事形式化为全局优化问题的分层多智能体框架。为了确保语义的一致性,我们引入了分层参数化:一个多臂赌博机(multi-armed bandit)全局识别有前景的创意方向,而一个本地多模态自我优化循环则减轻身份漂移并确保序列级一致性。这在探索新颖叙事策略与利用有效创意配置之间达成了平衡。为了评估,我们引入了GenAD-Bench,这是一个包含400个虚构产品场景的个性化广告数据集。实验表明,共同导演显著优于最新的基准,提供了一种原则性的方法,能够无缝地推广到更广泛的电影叙事中。项目页面:https://co-director-agent.github.io/
cs.AI / 2 / 2604.24881

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

潜在代理:一种内化多代理辩论的后训练程序
Yi, John Seon Keun, Mueller, Aaron, Lee, Dokyun
Abstract
Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent-specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi-agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at https://github.com/johnsk95/latent_agents
Chinese Translation
多代理辩论已被证明可以提高大型语言模型(LLMs)的推理能力。然而,这一过程计算密集,需在回答问题之前生成长篇转录文本。为了解决这一低效问题,我们开发了一个框架,通过结合辩论结构学习与动态奖励调度和长度裁剪的内化过程,采用两阶段微调管道将多代理辩论提炼为单一的LLM。在多个模型和基准测试中,我们的内化模型在使用最多93%更少的标记的情况下,表现与显式多代理辩论相匹配或超越。随后,我们通过激活引导研究这一能力的机制基础,发现内化创造了特定于代理的子空间:在激活空间中对应不同代理视角的可解释方向。我们进一步展示了一种实际应用:通过内化辩论将恶意代理植入LLM,然后应用负引导来抑制它们,我们表明,蒸馏使得有害行为更容易被定位和控制,并且与引导基础模型相比,减少一般性能的幅度更小。我们的发现为理解蒸馏模型中的多代理能力提供了新的视角,并为控制内化推理行为提供了实用指南。代码可在 https://github.com/johnsk95/latent_agents 获取。
cs.AI / 3 / 2604.24933

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

S-SONDO:用于通用音频基础模型的自监督知识蒸馏
Adlouni, Mohammed Ali El, Quelennec, Aurian, Chouteau, Pierre, Peeters, Geoffroy, Essid, Slim
Abstract
General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.
Chinese Translation
通用音频基础模型最近取得了显著进展,使其在多种任务中表现出色。然而,最先进的模型仍然非常庞大,通常拥有数亿个参数,这导致高推理成本并限制了在边缘设备上的部署。知识蒸馏是一种经过验证的模型压缩策略,但以往在音频领域的研究主要集中在监督设置上,依赖于类别对数、中间特征或特定架构的技术。这些假设排除了仅输出嵌入的模型,例如自监督或度量学习模型。我们提出了S-SONDO(Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models),这是第一个仅使用输出嵌入对通用音频模型进行蒸馏的框架。通过避免对数或层级对齐的需求,S-SONDO具有架构无关性,广泛适用于基于嵌入的教师模型。我们通过将两个音频基础模型蒸馏为三个高效的学生模型,证明了其有效性,这些学生模型的大小最多可缩小61倍,同时保留了教师模型高达96%的性能。我们还提供了关于损失选择和基于聚类的平衡数据采样的实用见解。代码可在此获取:https://github.com/MedAliAdlouni/ssondo。
cs.AI / 4 / 2604.24983

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

针对大型语言模型越狱的自适应提示嵌入优化
Li, Miles Q., Fung, Benjamin C. M., Li, Boyang, Rad, Radin Hamidi, Bagheri, Ebrahim
Abstract
Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.
Chinese Translation
现有针对对齐大型语言模型(LLMs)的白盒越狱攻击通常在用户提示后附加离散的对抗后缀,这显著改变了提示并在组合标记空间中操作。之前的研究避免直接优化原始提示标记的嵌入,可能是因为扰动它们有可能破坏提示的语义内容。我们提出了提示嵌入优化(Prompt Embedding Optimization, PEO),这是一种多轮白盒越狱方法,直接优化原始提示标记的嵌入,而不附加任何对抗标记,并且表明这种担忧是没有根据的:优化后的嵌入与原始嵌入保持足够接近,以至于在最近标记投影后可见的提示字符串完全保留,定量分析显示模型的响应在绝大多数提示中保持主题一致。PEO结合了连续的嵌入空间优化、结构化的延续目标和自适应的失败聚焦调度。出乎意料的是,后期的PEO轮次可以受益于不自然的独立模板的启发式复合响应支架,然而ASR-Judge显示,所获得的收益并不仅仅是空洞的格式或仅支架输出。在两个标准有害行为基准和涵盖离散后缀搜索、附加对抗嵌入及基于搜索的对抗生成的竞争白盒攻击中,PEO在我们的实验中表现优于所有这些方法。
cs.AI / 5 / 2604.24987

Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation

评估Y轴影响:多模态语言模型在图表到表格翻译中的偏差
Song, Seok Hwan, Efat, Azher Ahmed, Tavanapong, Wallapak
Abstract
Chart-to-table translation converts chart images into structured tabular data. Accurate translation is crucial for Multimodal Language Model (MLM) to answer complex queries. We observe imbalances in the number of images across different aspects of the y-axis information in public chart datasets. Such imbalances can introduce unintended biases, causing uneven MLM performance. Previous works have not systematically examined these biases. To address this gap, we propose a new framework, FairChart2Table, for analyzing y-axis-related bias on five state-of-the-art models. Key Findings: (1) There are significant y-axis biases related to the digit length of the major tick values, the number of major ticks, the range of values, and the tick value format (e.g., abbreviation or scientific format). (2) The number of legends/entities in chart images impacts MLM performance. (3) Prompting MLM with y-axis information can significantly enhance the performance for some MLMs.
Chinese Translation
图表到表格翻译将图表图像转换为结构化的表格数据。准确的翻译对于多模态语言模型(MLM)回答复杂查询至关重要。我们观察到公共图表数据集中y轴信息不同方面的图像数量存在不平衡。这种不平衡可能引入意想不到的偏差,导致MLM表现不均。以往的研究未系统地检查这些偏差。为了解决这一问题,我们提出了一个新的框架FairChart2Table,用于分析五个最先进模型的y轴相关偏差。主要发现:(1)与主要刻度值的数字长度、主要刻度数量、值的范围和刻度值格式(例如,缩写或科学格式)相关的y轴偏差显著。(2)图表图像中图例/实体的数量影响MLM性能。(3)用y轴信息提示MLM可以显著提高某些MLM的性能。
cs.AI / 6 / 2604.24996

Sparse Personalized Text Generation with Multi-Trajectory Reasoning

基于多轨迹推理的稀疏个性化文本生成
Ni, Bo, Fu, Haowei, Ge, Qinwen, Dernoncourt, Franck, Basu, Samyadeep, Lipka, Nedim, Yoon, Seunghyun, Wang, Yu, Ahmed, Nesreen K., Mukherjee, Subhojyoti, Mathur, Puneet, Rossi, Ryan A., Derr, Tyler
Abstract
As Large Language Models (LLMs) advance, personalization has become a key mechanism for tailoring outputs to individual user needs. However, most existing methods rely heavily on dense interaction histories, making them ineffective in cold-start scenarios where such data is sparse or unavailable. While external signals (e.g., content of similar users) can offer a potential remedy, leveraging them effectively remains challenging: raw context is often noisy, and existing methods struggle to reason over heterogeneous data sources. To address these issues, we introduce PAT (Personalization with Aligned Trajectories), a reasoning framework for cold-start LLM personalization. PAT first retrieves information along two complementary trajectories: writing-style cues from stylistically similar users and topic-specific context from preference-aligned users. It then employs a reinforcement learning-based, iterative dual-reasoning mechanism that enables the LLM to jointly refine and integrate these signals. Experimental results across real-world personalization benchmarks show that PAT consistently improves generation quality and alignment under sparse-data conditions, establishing a strong solution to the cold-start personalization problem.
Chinese Translation
随着大型语言模型(LLMs)的发展,个性化已成为根据个体用户需求定制输出的关键机制。然而,现有大多数方法严重依赖于密集的交互历史,这使得它们在冷启动场景中效果不佳,因为此类数据稀缺或不可用。虽然外部信号(例如,相似用户的内容)可以提供潜在的解决方案,但有效利用这些信号仍然具有挑战性:原始上下文通常噪声较多,现有方法在处理异构数据源时表现不佳。为了解决这些问题,我们提出了PAT(个性化与对齐轨迹),一种用于冷启动LLM个性化的推理框架。PAT首先沿着两个互补轨迹检索信息:来自风格相似用户的写作风格线索和来自偏好对齐用户的主题特定上下文。然后,它采用基于强化学习的迭代双重推理机制,使LLM能够共同优化和整合这些信号。在真实世界个性化基准上的实验结果表明,PAT在稀疏数据条件下始终提高生成质量和对齐度,为冷启动个性化问题提供了强有力的解决方案。
cs.AI / 7 / 2604.25000

Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents

迈向意图科学:开放世界人工智能代理的闭合差距与委托范围
Armesto, Maximiliano, Kolb, Christophe
Abstract
Recent work has framed intelligence in verifiable tasks as reducing time-to-solution through learned structure and test-time search, while systems work has explored learned runtimes in which computation, memory and I/O migrate into model state. These perspectives do not explain why capable models remain difficult to deploy in open institutions. We propose intent compilation: the transformation of partially specified human purpose into inspectable artifacts that bind execution. The relevant deployment distinction is closed-world solver versus open-world agent. In closed worlds, a checker is largely given; in open worlds, verification is distributed across semantic, evidentiary, procedural and institutional dimensions. Weformalize this residual openness as a closure-gap vector, define delegation envelopes as pre-authorized regions of action space, distinguish misclosure from undersearch, and outline benchmark metrics for testing when closure interventions outperform additional inference-time search.
Chinese Translation
近期的研究将可验证任务中的智能框架化为通过学习结构和测试时搜索来减少解决时间,而系统工作则探讨了计算、内存和输入/输出迁移到模型状态的学习运行时。这些视角未能解释为何有能力的模型在开放机构中仍然难以部署。我们提出意图编译:将部分指定的人类目的转化为可检查的工件,以绑定执行。相关的部署区分是闭合世界求解器与开放世界代理。在闭合世界中,检查器在很大程度上是给定的;而在开放世界中,验证则分布在语义、证据、程序和制度维度上。我们将这种剩余的开放性形式化为闭合差距向量,定义委托范围为预授权的行动空间区域,区分误闭合与搜索不足,并概述测试闭合干预何时优于额外推理时间搜索的基准指标。
cs.AI / 8 / 2604.25040

Leverage Laws: A Per-Task Framework for Human-Agent Collaboration

杠杆法则:人机协作的任务级框架
Loosmore, Stan
Abstract
We propose a per-task leverage ratio for human-agent collaboration: human work displaced by an agent, divided by the human time required to specify the task, resolve mid-run interrupts, and review the result. The denominator decomposes into three channels through which a conserved per-task information requirement must flow, each with its own time-cost scalar. We show that information density itself is directional and bounded by separate ceilings on human-to-agent and agent-to-human flow, and that the asymptotic behavior of leverage decomposes into two scaling axes (capability and memory) with a non-zero floor on the planning term set by irreducible task novelty bounded by human throughput. We extend this per-task analysis to a windowed leverage measure that accommodates recurring tasks, spawned subtasks, and amortized system-design investment. The per-task ceiling does not bind the windowed measure, though both remain bounded: $L_{\text{task}}$ by per-task novelty, $L_{\text{window}}$ by the stock of accumulated planning investment that pays out within the window. The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark & Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio, and produces a list of testable empirical questions that we leave as open problems.
Chinese Translation
我们提出了一种用于人机协作的任务级杠杆比率:由代理人取代的人类工作量,除以人类指定任务、解决中途干扰和审查结果所需的时间。分母分解为三个渠道,通过这些渠道,必须流动一个被保留的任务信息需求,每个渠道都有其自身的时间成本标量。我们展示了信息密度本身是有方向性的,并且受到人对代理和代理对人的信息流的不同上限的限制,杠杆的渐近行为分解为两个缩放轴(能力和记忆),规划项的下限由不可简化的任务新颖性设定,该新颖性受到人类吞吐量的限制。我们将这种任务级分析扩展到一个窗口化的杠杆度量,适应于重复任务、衍生子任务和摊销的系统设计投资。任务级上限并不限制窗口化度量,尽管两者都受到限制:$L_{ ext{task}}$ 由任务的新颖性限制,$L_{ ext{window}}$ 由在窗口内支付的累积规划投资的存量限制。该框架将早期关于监督控制(Sheridan, 1992)、共同基础(Clark & Brennan, 1991)和混合主动交互(Horvitz, 1999)的定性研究的各个方面操作化为一个单一的规范比率,并产生了一系列可测试的实证问题,我们将其作为开放性问题留待后续研究。
cs.AI / 9 / 2604.25077

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

从偏差-方差视角评估弱到强对齐中的风险
Osooli, Hamid, Batool, Kareema, Gentry, Rick, Roy, Tiasa Singha, Gupta, Ashwin, Ramesh, Anirudha
Abstract
Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.
Chinese Translation
弱到强的对齐为可扩展监督提供了一条有前景的途径,但当强模型在弱教师的盲点上对样本产生自信错误时,它可能会失败。理解这种失败需要超越整体准确率,因为弱到强的错误不仅取决于强模型是否与其教师意见不合,还取决于信心和不确定性在样本中的分布。在本研究中,我们通过偏差-方差-协方差的视角分析弱到强的对齐,将不适配理论与实际的后训练流程相连接。我们推导出基于不适配的弱到强总体风险的上界,并使用连续信心分数研究其经验组成部分。我们在PKU-SafeRLHF和HH-RLHF数据集上评估了四种跨越监督微调(SFT)、人类反馈强化学习(RLHF)和人工智能反馈强化学习(RLAIF)的弱到强流程。使用一种盲点欺骗度量,隔离出强模型自信错误而弱模型不确定的情况,我们发现强模型的方差是我们设置中欺骗的最强经验预测因子。协方差提供了额外但较弱的信息,表明弱-强依赖性是重要的,但单独并不能解释观察到的失败。这些结果表明,强模型的方差可以作为弱到强欺骗的早期预警信号,而盲点评估有助于区分失败是源于弱监督还是出现在弱模型不确定性的区域。
cs.AI / 10 / 2604.25083

Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

自主架构师:一种用于建筑设计探索与优化的自主人工智能框架
Blasberg, Alexander, Kypriotis, Vasilis, Skarlatos, Dimitrios
Abstract
Rapid advances in Large Language Models (LLMs) create new opportunities by enabling efficient exploration of broad, complex design spaces. This is particularly valuable in computer architecture, where performance depends on microarchitectural designs and policies drawn from vast combinatorial spaces. We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. The human architect specifies the optimization target, seed design, scoring function, simulator interface, and benchmark split, while the LLM explores implementations within these constraints. Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Finally, our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Our analysis surfaces several findings about agentic AI-driven microarchitecture design. Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. The architect's role is shifting, but the human remains central. Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Likewise, objectives, constraints, and prompt guidance affect reliability and generalization. Overall, Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization.
Chinese Translation
大型语言模型(LLMs)的快速进展创造了新的机会,使得广泛而复杂的设计空间的高效探索成为可能。这在计算机架构中尤为重要,因为性能依赖于来自庞大组合空间的微架构设计和策略。我们介绍了自主架构师(Agentic Architect),这是一个用于计算机架构设计探索与优化的自主人工智能框架,它结合了基于LLM的代码演化与周期准确的模拟。人类架构师指定优化目标、种子设计、评分函数、模拟器接口和基准拆分,而LLM则在这些约束条件下探索实现方案。在缓存替换、数据预取和分支预测等方面,自主架构师的性能与最先进的设计相匹配或超越。我们最佳的演化缓存替换设计在与LRU比较时实现了1.062倍的几何平均IPC加速,相较于Mockingjay(1.056倍)提高了0.6%。我们的演化分支预测器在与Bimodal比较时实现了1.100倍的几何平均IPC加速,相较于其哈希感知种子(1.085倍)提高了1.5%。最后,我们的演化预取器在与无预取比较时实现了1.76倍的几何平均IPC加速,相较于其VA/AMPM Lite种子(1.59倍)提高了17%,相较于SMS(1.55倍)提高了21%。我们的分析揭示了关于自主人工智能驱动的微架构设计的若干发现。在演化设计中,组件通常对应于已知技术;新颖之处在于它们的协调方式。架构师的角色正在转变,但人类仍然是中心。种子质量限制了搜索的成就:演化可以改进和扩展现有机制,但无法弥补薄弱的基础。同样,目标、约束和提示指导也会影响可靠性和泛化能力。总体而言,自主架构师是首个用于自主人工智能架构探索与优化的端到端开源框架。
cs.AI / 11 / 2604.25088

Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

合作以竞争:多智能体征服中的战略协调
O'Neill, Abigail, Zhu, Alan, Miroyan, Mihran, Norouzi, Narges, Gonzalez, Joseph E.
Abstract
Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge. We run AI only games and conduct a user study pitting human players against AI opponents. We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower-complexity deals and are significantly less reliable partners compared to LM-based agents. We also find that humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM-based agents. Through targeted prompting inspired by these findings, we modify agents' negotiation behavior and improve win rates from 22.2% to 32.7%. We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments. The game, code, and dataset may be found at https://negotiationgame.io/c2c.
Chinese Translation
基于语言模型(Language Model, LM)的智能体在混合动机环境中仍然未得到充分测试,在这些环境中,智能体必须利用短期合作来实现长期竞争目标(例如,多方政治)。我们提出了“合作以竞争”(Cooperate to Compete, C2C),这是一个多智能体环境,玩家可以在竞争中进行私下谈判,以争取第一个实现他们的秘密目标。玩家的目标不对称,谈判是非约束性的,允许联盟在玩家的短期利益一致和分歧时形成和解散。我们进行了仅限人工智能的游戏,并开展了一项用户研究,将人类玩家与人工智能对手进行对抗。我们发现人类与人工智能在谈判行为上存在显著差异,发现人类更倾向于选择低复杂度的交易,并且相比于基于LM的智能体,人类的可靠性显著较低。我们还发现人类谈判者更具攻击性,仅有56.3%的时间在没有反报价的情况下接受交易,而基于LM的智能体则为67.6%。通过基于这些发现的针对性提示,我们修改了智能体的谈判行为,并将胜率从22.2%提高到32.7%。我们进行了超过1,100场游戏,进行了超过16,000次私下谈话,总计15.2百万个标记和超过150,000个玩家动作。我们的结果确立了C2C作为研究和构建能够应对现实世界部署所需复杂协调的基于LM的智能体的试验平台。游戏、代码和数据集可在https://negotiationgame.io/c2c找到。
cs.AI / 12 / 2604.25098

Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

以更少的资源做更多的事情:重新审视大型语言模型剪枝在测试时扩展的有效性
Monjur, Ocean, Nahin, Shahriar Kabir, Chhabra, Anshuman
Abstract
While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.
Chinese Translation
尽管当前的大型语言模型(LLMs)通过测试时计算扩展(TTS)展现出显著的推理能力,但其庞大的参数数量和高昂的推理成本促使了剪枝方法的发展,以在不牺牲性能的情况下减少模型规模。然而,针对推理 LLMs 的具体研究表明,结构化剪枝(即移除整组层块的方法)显著降低了 TTS 推理性能。在本研究中,我们重新审视这一假设,探讨非结构化剪枝(即仅小心移除某些冗余/有害权重的方法)是否存在类似的限制。令人惊讶的是,我们在两个推理 LLMs(s1.1-7B 和 Qwen3-8B)上的四个推理基准上进行的广泛实验一致显示,与结构化剪枝相比,非结构化剪枝增强了 TTS 性能,有时甚至可以超越未剪枝的全权重 LLMs。此外,我们还实证研究了不同层级稀疏分配策略的影响,这对于实例化非结构化剪枝方法是一个重要的参数选择。这些发现挑战了剪枝总是降低 TTS 性能的传统观念,实际上表明,经过仔细实施的剪枝可以进一步提高 TTS 的有效性。
cs.AI / 13 / 2604.25149

Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

可靠的 LLM 驱动数据分析的语义层:三种前沿模型在准确性和幻觉方面的配对基准测试
Rumiantsau, Michael, Fokeev, Ivan
Abstract
LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p < 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do.
Chinese Translation
用于对分析数据库进行自然语言查询的 LLM 存在两种相互交织的失败——错误的答案和自信的幻觉——这两者都源于同一个原因:模型被迫推断模式未编码的业务语义。我们测试提供这些语义作为上下文是否能缩小差距。我们在 ClickHouse 中使用配对单次协议,对三种前沿 LLM(Claude Opus 4.7、Claude Sonnet 4.6、GPT-5.4)在 100 个自然语言问题上进行基准测试,数据集为 Cleaned Contoso Retail Dataset。每个模型被评估两次:一次仅提供仓库模式,另一次提供模式加上一个 4 KB 的手工编写的 markdown 文档,描述数据集的度量、约定和消歧义规则。添加该文档使得所有三个模型的准确性提高了 17 到 23 个百分点。有了它,这三种模型在统计上是不可区分的(67.7-68.7%);没有它,它们同样不可区分(45.5-50.5%)。每个跨集群比较在 p < 0.01 下都是显著的。语义层文档的存在几乎解释了所有显著的方差;在同一层级内,模型选择并不影响结果。我们将此解释为一种结构性结果:显式的业务语义通过改变模型被要求做的事情,而不是通过提升模型的能力,抑制了主要的文本到 SQL 错误类别。
cs.AI / 14 / 2604.25166

Training Transformers as a Universal Computer

将变换器训练为通用计算机
Xu, Ruize, Yang, Chenxiao, Li, Yanhong, McAllester, David
Abstract
We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts small-step execution using PENCIL scaffolding for space-efficient execution within a bounded context window. After training on randomly generated, meaningless MicroPy programs, the learned transformer generalizes to various human-written programs including bit copying and flipping, binary addition and multiplication, and SAT verification and solving. We note that the trained model can achieve out-of-distribution generalization; i.e., evaluate novel programs from distribution on programs. Since MicroPy can express any computation, our results provide empirical evidence that a standard transformer can be trained to act as a universal computer.
Chinese Translation
我们展示了一个小型变换器能够学习执行 MicroPy 程序,这是一种简化但计算上通用的编程语言。在给定过程定义和要评估的表达式的情况下,变换器利用 PENCIL 支架预测小步执行,以便在有限的上下文窗口内实现空间高效的执行。在对随机生成的无意义 MicroPy 程序进行训练后,学习到的变换器能够推广到各种人类编写的程序,包括位复制和翻转、二进制加法和乘法,以及 SAT 验证和求解。我们注意到,经过训练的模型能够实现超出分布的泛化;即评估来自程序分布的新颖程序。由于 MicroPy 可以表达任何计算,我们的结果提供了实证证据,表明标准变换器可以被训练为充当通用计算机。
cs.AI / 15 / 2604.25167

From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

从洞察到行动:一种新颖的基于可解释性的框架用于大语言模型中的数据选择
Shi, Ling, Wu, Xinwei, Zhao, Xiaohu, Wang, Hao, Liu, Heng, Liu, Yangyang, Xu, Linlong, Wang, Longyue, Xiong, Deyi, Luo, Weihua
Abstract
While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model's internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects ``Feature-Resonant Data'' that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable 17.4% on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.
Chinese Translation
虽然像稀疏自编码器(Sparse Autoencoders, SAEs)这样的机械可解释性工具能够揭示大语言模型(Large Language Models, LLMs)中的有意义特征,但在将这些洞察转化为模型优化的实际行动方面仍存在一个关键的空白。我们通过假设基于模型内部任务特征的数据选择是一种有效的训练策略来填补这一空白。受此启发,我们提出了基于可解释性的数据显示选择(Interpretability-Guided Data Selection, IGDS)框架,该框架首先通过频率回忆和干预过滤识别这些因果任务特征,然后选择“特征共振数据”(Feature-Resonant Data),以最大程度地激活任务特征进行微调。我们在Gemma-2、LLaMA-3.1和Qwen3模型的数学推理、摘要和翻译任务上验证了IGDS。实验表明,IGDS在数据效率方面表现卓越:在数学任务中,IGDS在Gemma-2-2B上超越了全数据集微调,提升幅度达17.4%,而仅使用了50%的数据,并且优于关注数据质量和多样性的既有基线。分析确认了特征放大与任务性能提升之间的强正相关。因此,IGDS提供了一个直接有效的框架,通过利用LLMs的内部机制来增强其性能,验证了我们的核心假设。
cs.AI / 16 / 2604.25220

DATAREEL: Automated Data-Driven Video Story Generation with Animations

DATAREEL:自动化数据驱动的视频故事生成与动画
Mahbub, Ridwan, Aziz, Syem, Ahmed, Mahir, Rahman, Shadikur, Rahman, Mizanur, Joty, Shafiq, Hoque, Enamul
Abstract
Data videos are a powerful medium for visual data based storytelling, combining animated, chart-centric visualizations with synchronized narration. Widely used in journalism, education, and public communication, they help audiences understand complex data through clear and engaging visual explanations. Despite their growing impact, generating data-driven video stories remains challenging, as it requires careful coordination of visual encoding, temporal progression, and narration and substantial expertise in visualization design, animation, and video-editing tools. Recent advances in large language models offer new opportunities to automate this process; however, there is currently no benchmark for rigorously evaluating models on animated visualization-based video storytelling. To address this gap, we introduce DataReel, a benchmark for automated data-driven video story generation comprising 328 real-world stories. Each story pairs structured data, a chart visualization, and a narration transcript, enabling systematic evaluation of models' abilities to generate animated data video stories. We further propose a multi-agent framework that decomposes the task into planning, generation, and verification stages, mirroring key aspects of the human storytelling process. Experiments show that this multi-agent approach outperforms direct prompting baselines under both automatic and human evaluations, while revealing persistent challenges in coordinating animation, narration, and visual emphasis. We release DataReel at https://github.com/vis-nlp/DataReel.
Chinese Translation
数据视频是一种强大的视觉数据叙事媒介,将动画、以图表为中心的可视化与同步叙述相结合。它们在新闻、教育和公共传播中被广泛使用,帮助观众通过清晰且引人入胜的视觉解释理解复杂数据。尽管其影响力日益增长,生成数据驱动的视频故事仍然具有挑战性,因为这需要对视觉编码、时间进程和叙述进行精心协调,并且需要在可视化设计、动画和视频编辑工具方面具备丰富的专业知识。近年来,大型语言模型的进展为自动化这一过程提供了新的机会;然而,目前尚无基准来严格评估基于动画可视化的视频叙事模型。为了解决这一空白,我们推出了DataReel,这是一个包含328个真实故事的自动化数据驱动视频故事生成基准。每个故事都配有结构化数据、图表可视化和叙述文本,从而能够系统地评估模型生成动画数据视频故事的能力。我们进一步提出了一种多智能体框架,将任务分解为规划、生成和验证阶段,反映了人类叙事过程的关键方面。实验表明,这种多智能体方法在自动和人工评估下均优于直接提示基线,同时揭示了在协调动画、叙述和视觉强调方面的持续挑战。我们将在 https://github.com/vis-nlp/DataReel 发布DataReel。
cs.AI / 17 / 2604.25224

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

ValueAlpha:在可观察收益之前对LLM评判的投资理由进行协议门控压力测试
Chang, Sidi, Zhu, Peiying, Chen, Yuxiao
Abstract
Long-horizon investment decisions create a pre-realization evaluation problem: realized returns are the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces \textbf{ValueAlpha}, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueAlpha clears the aggregate agreement gate at \(\bar{\kappa}_w = 0.7168\) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint\_awareness}, \(\bar{\kappa}_w = 0.2022\)), single-judge rankings are family-dependent, and terse-correct rationales receive a \(\Delta = -2.81\) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The contribution is therefore not a leaderboard and not a claim to measure true investment skill. ValueAlpha is a pre-calibration metrology layer for AI-finance evaluation: it determines whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.
Chinese Translation
长期投资决策产生了一种预实现评估问题:实现的收益最终是投资质量的仲裁者,但它们到达得太晚且噪声太大,无法指导许多模型开发和治理决策。LLM评判者为AI金融系统的部署前评估提供了一个诱人的替代方案,但未经验证的评判者可能会奖励冗长、自信或模仿评分标准,而非财务判断。本文介绍了 extbf{ValueAlpha},一个预注册的协议门控压力测试协议,用于决定何时LLM评判的投资理由声明可以发布、合格或无效。在一个受控的市场状态资本配置原型中,包含1000个诚实决策周期和100个预注册的对抗控制(共1100条轨迹,5500次评判),ValueAlpha在聚合协议门控中达到了ar{ ext{kappa}}_w = 0.7168,但防止了若干过度声明。低等级系统崩溃为平局类别,一个评分维度未能通过每维度门控( exttt{constraint extunderscore awareness}, ar{ ext{kappa}}_w = 0.2022),单一评判者排名依赖于家庭,简洁正确的理由相对于诚实理由受到 extDelta = -2.81的评分点惩罚。一个针对性锚定特异性探测进一步表明,诸如约束意识等金融构造在操作上是承载负荷的。因此,贡献不是一个排行榜,也不是对真实投资技能的测量声明。ValueAlpha是AI金融评估的预校准计量层:它确定一个提议的基于LLM评判的投资理由声明是否足够稳定、足够一致且足够不受污染,以便被报告。
cs.AI / 18 / 2604.25256

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

AutoResearchBench:在复杂科学文献发现中对人工智能代理的基准测试
Xiong, Lei, Luo, Kun, Xia, Ziyi, Zhang, Wenbo, Yao, Jin-Ge, Liu, Zheng, Shao, Jingying, Chen, Jianlyu, Qian, Hongjin, Yang, Xi, Yu, Qian, Li, Hao, Yue, Chen, Du, Xiaan, Wang, Yuyang, Liu, Yesheng, Xu, Haiyu, Dou, Zhicheng
Abstract
Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.
Chinese Translation
自主科学研究得益于人工智能代理的发展而显著推进。这个过程中的一个关键步骤是找到合适的科学文献,无论是为了探索研究问题的现有知识,还是为了获取验证假设和支持论点的证据。为了评估人工智能代理在推动这一过程中的能力,我们提出了AutoResearchBench,这是一个专门用于自主科学文献发现的基准。AutoResearchBench由两种互补的任务类型组成:(1)深度研究(Deep Research),需要通过渐进的多步骤探测过程追踪特定目标论文;(2)广泛研究(Wide Research),需要全面收集满足给定条件的一组论文。与之前的代理网页浏览基准相比,AutoResearchBench在三个维度上具有独特性:它是以研究为导向的,要求对科学概念进行深入理解;它是以文献为中心的,要求对详细信息进行细致的利用;它是开放式的,涉及未知数量的合格论文,因此需要进行深思熟虑的推理和搜索。这些特性使得AutoResearchBench特别适合评估自主研究能力,并且极具挑战性。即使是最强大的大型语言模型(LLMs),尽管在诸如BrowseComp等一般代理网页浏览基准上已基本征服,但在深度研究(Deep Research)上仅达到9.39%的准确率,在广泛研究(Wide Research)上仅达到9.31%的IoU,而许多其他强基线则低于5%。我们公开发布数据集和评估管道,以促进未来在这一方向的研究。我们在https://github.com/CherYou/AutoResearchBench上公开发布数据集、评估管道和代码。
cs.AI / 19 / 2604.25345

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

看似合理但错误:关于天体物理工作流中代理失败的案例研究
Rawat, Shivam, Flek, Lucie
Abstract
Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.
Chinese Translation
代理人工智能系统正越来越多地被整合到科学工作流中,但在现实条件下它们的行为仍然未被充分理解。我们在两种工作流范式和十八个天体物理任务中评估了CMBAgent。在一次性设置中,获取特定领域的上下文使性能提高了大约6倍(0.85对比于没有上下文的~0),主要的失败模式是静默的错误计算——语法上有效的代码产生看似合理但不准确的结果。在深度研究设置中,该系统在压力测试中经常表现出静默失败,产生物理上不一致的后验分布而没有自我诊断。总体而言,在明确规定的任务上性能表现良好,但在设计用于探测推理极限的问题上性能下降,通常没有明显的错误信号。这些发现强调了在代理科学工作流中,最令人担忧的失败模式不是明显的失败,而是自信地生成错误结果。我们发布了我们的评估框架,以促进对科学人工智能代理的系统可靠性分析。
cs.AI / 20 / 2604.25369

Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

用于连续控制的多任务强化学习的多动作纠缠程序图
Vacher, Quentin, Beuve, Nicolas, Dardaillon, Mickaël, Desnos, Karol
Abstract
Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (MTRL) environments have been introduced, requiring a single model to learn multiple behaviors. The Tangled Program Graph (TPG) algorithm is a Genetic Programming (GP) algorithm designed for discrete MTRL environments. Recently, the MAPLE algorithm has been proposed, as another GP algorithm that achieves high results in single task continuous RL environments. A variation of the TPG is proposed alongside MAPLE, named Multi-Action TPG (MATPG) that aggregates MAPLE agents, and creates a control flow to activate them. Initially tested on single task RL environments only, MATPG achieved similar results to MAPLE. In this work, we present a new benchmark based on the MuJoCo Half Cheetah from Gymnasium. This benchmark features five distinct obstacles that are randomly positioned in front of the agent, each of which demands a unique behavior. This benchmark serves as a use case for MATPG, to prove its ability as a GP solution for continuous MTRL environments. Our experiments demonstrate its superiority in this multi-task use case when combined with lexicase selection. Furthermore, we examine the interpretability of the evolved graph, revealing that the decision flow of the model is fully interpretable.
Chinese Translation
在过去几十年中,机器学习被广泛应用于学习复杂任务。受人类行为启发的强化学习(Reinforcement Learning, RL)就是一个很好的例子,因为它涉及为特定任务开发特定行为。为了进一步挑战算法,提出了多任务强化学习(Multi-Task RL, MTRL)环境,要求单一模型学习多种行为。纠缠程序图(Tangled Program Graph, TPG)算法是一种为离散MTRL环境设计的遗传编程(Genetic Programming, GP)算法。最近,提出了MAPLE算法,作为另一种在单任务连续RL环境中取得高效能的GP算法。我们提出了TPG的一种变体,命名为多动作TPG(Multi-Action TPG, MATPG),它聚合了MAPLE代理,并创建了激活它们的控制流。MATPG最初仅在单任务RL环境中进行测试,取得了与MAPLE相似的结果。在本研究中,我们基于Gymnasium的MuJoCo Half Cheetah提出了一个新的基准测试。该基准测试具有五个随机放置在代理面前的不同障碍物,每个障碍物都要求独特的行为。该基准测试作为MATPG的用例,证明了其作为连续MTRL环境的GP解决方案的能力。我们的实验表明,当与词典选择(lexicase selection)结合时,MATPG在这一多任务用例中表现出优越性。此外,我们还研究了演化图的可解释性,揭示了模型的决策流程是完全可解释的。
cs.AI / 21 / 2604.25419

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

JURY-RL:投票提议,证明处置的无标签强化学习验证奖励
Chen, Xinjie, Fu, Biao, Wu, Jing, Chen, Guoxin, Liu, Xinggao, Liu, Dayiheng, Liao, Minpeng
Abstract
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
Chinese Translation
具有可验证奖励的强化学习(RLVR)增强了大型语言模型(LLMs)的推理能力,但标准的RLVR通常依赖于人工标注的答案或精心策划的奖励规范。在机器可检查的领域,无标签的替代方案,如多数投票或将LLM作为评判者,虽然消除了标注成本,但可能引入假阳性,从而使训练不稳定。我们提出了JURY-RL,一个无标签的RLVR框架,它将答案提议与奖励处置解耦:模型的回滚投票提出候选答案,而正式验证器则决定该候选答案是否可以获得正奖励。具体而言,只有与多数投票答案匹配的回滚在该答案在Lean中成功验证时才会获得奖励。当验证结果不确定时,我们引入ResZero(Residual-Zero),作为一种后备奖励,丢弃未验证的多数提议,并在剩余答案上重新分配一个均值为零、保持方差的信号。这种设计保持了稳定的优化梯度,而不强化不可验证的共识。在三个基于数学数据训练的主干模型上,JURY-RL在数学推理基准测试中始终优于其他无标签基线,并在代码生成和一般基准测试中表现出竞争力。它在pass@1性能上与监督真实训练相当,且通过更高的pass@k和响应多样性展示了更优的泛化能力。
cs.AI / 22 / 2604.25435

PI-TTA: Physics-Informed Source-Free Test-Time Adaptation for Robust Human Activity Recognition on Mobile Devices

PI-TTA:基于物理知识的无源测试时适应方法在移动设备上实现稳健的人类活动识别
Li, Changyu, Wang, Lu, Lei, Ming, Liu, Jiashen, Zhang, Yichen, Wu, Kaishun, Luo, Fei
Abstract
Source-free test-time adaptation (TTA) is appealing for mobile and wearable sensing because it enables on-device personalization from unlabeled test streams without centralizing private data. However, sensor-based human activity recognition (HAR) poses challenges that are less pronounced in standard vision benchmarks: behavioral inertial streams are temporally correlated and often exhibit within-session shifts caused by sensor rotation, placement change, and sampling-rate drift. Under this streaming non-i.i.d. setting, widely used vision-style TTA objectives can become unstable, leading to overconfident errors, representation collapse, and catastrophic forgetting. We propose PI-TTA, a lightweight source-free adaptation framework that stabilizes online updates through three physics-consistent constraints: gravity consistency, short-horizon temporal continuity, and spectral stability. PI-TTA updates the same small parameter subset as strong source-free baselines and incurs only modest overhead, making it suitable for on-device deployment. Experiments on USCHAD, PAMAP2, and mHealth under long-sequence stress tests and factorized shift protocols show that PI-TTA mitigates the severe degradation observed in confidence-driven baselines and preserves stable adaptation under sustained streaming conditions. It improves long-sequence accuracy by up to 9.13% and reduces physical-violation rates by 27.5%, 24.1%, and 45.4% on USCHAD, PAMAP2, and mHealth, respectively. These results demonstrate that physics-informed adaptation can improve accuracy, stability, and deployment reliability for real-world mobile sensing systems.
Chinese Translation
无源测试时适应(TTA)在移动和可穿戴传感中具有吸引力,因为它能够从未标记的测试流中实现设备上的个性化,而无需集中私人数据。然而,基于传感器的人类活动识别(HAR)面临的挑战在标准视觉基准中并不明显:行为惯性流是时间相关的,并且通常由于传感器旋转、放置变化和采样率漂移而表现出会话内的变化。在这种流式非独立同分布(non-i.i.d.)的环境下,广泛使用的视觉风格TTA目标可能变得不稳定,导致过度自信的错误、表示崩溃和灾难性遗忘。我们提出了PI-TTA,一种轻量级的无源适应框架,通过三个物理一致性约束来稳定在线更新:重力一致性、短期时间连续性和谱稳定性。PI-TTA更新与强无源基线相同的小参数子集,并且仅产生适度的开销,使其适合于设备上的部署。在USCHAD、PAMAP2和mHealth上进行的长序列压力测试和分解变化协议的实验表明,PI-TTA减轻了在基于置信度的基线中观察到的严重降级,并在持续流式条件下保持稳定的适应性。它在长序列准确性上提高了最多9.13%,并在USCHAD、PAMAP2和mHealth上分别减少了27.5%、24.1%和45.4%的物理违规率。这些结果表明,基于物理知识的适应可以提高现实世界移动传感系统的准确性、稳定性和部署可靠性。
cs.AI / 23 / 2604.25472

SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

SciEval:K-12科学教学材料自动评估的基准
Li, Zhaohui, He, Peng, Chen, Zhiyuan, Liu, Honglu, Wang, Zeyuan, Li, Tingting, Xiong, Jinjun
Abstract
The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts scores and evidence using the rubric designed by the educator. We create a benchmark dataset and develop baseline models for AIME. First, we curate the first AIME dataset, SciEval, consisting of instructional materials annotated with pedagogy-aligned evaluation scores and evidence-based rationales. Expert annotations achieve high inter-rater reliability, resulting in a dataset of 273 lesson-level instructional materials evaluated across 13 criteria (N=3549) using the EQuIP rubric. Second, we test mainstream LLMs (GPT, Gemini, Llama, and Qwen) on SciEval and find that none achieve strong performance. Then we fine-tune Qwen3 on SciEval. Results on a held-out test set show that domain-aligned fine-tuning can achieve up to 11 percent performance gains, highlighting the importance of domain-specific fine-tuning for AIME and facilitating the use of LLMs in other educational tasks.
Chinese Translation
随着越来越多的教育工作者使用生成性人工智能来创建教学材料,评估K-12科学教育教学材料的需求变得愈发重要。然而,教学材料的审查过程耗时、需要专业知识且难以扩展,这激发了对自动评估方法的兴趣。尽管大型语言模型(LLMs)在一般评估任务上表现出色,但它们在教学材料上的表现和可靠性仍不明确。为了解决这一问题,我们将自动教学材料评估(Automatic Instructional Materials Evaluation, AIME)构建为一个生成性人工智能任务,该任务使用教育者设计的评分标准预测分数和证据。我们创建了一个基准数据集,并为AIME开发了基线模型。首先,我们整理了第一个AIME数据集SciEval,该数据集由与教学法对齐的评估分数和基于证据的理由标注的教学材料组成。专家标注实现了高水平的评估者间可靠性,形成了一个包含273个课程级教学材料的数据集,这些材料根据EQuIP评分标准在13个标准下进行了评估(N=3549)。其次,我们在SciEval上测试了主流的LLMs(GPT、Gemini、Llama和Qwen),发现没有一个模型表现出色。然后,我们在SciEval上对Qwen3进行了微调。持出测试集的结果显示,领域对齐的微调可以实现高达11%的性能提升,突显了领域特定微调对AIME的重要性,并促进了LLMs在其他教育任务中的应用。
cs.AI / 24 / 2604.25496

Improving Zero-Shot Offline RL via Behavioral Task Sampling

通过行为任务采样提升零样本离线强化学习
Bendib, Nazim, Perrin-Gilbert, Nicolas, Sigaud, Olivier
Abstract
Offline zero-shot reinforcement learning (RL) aims to learn agents that optimize unseen reward functions without additional environment interaction. The standard approach to this problem trains task-conditioned policies by sampling task vectors that define linear reward functions over learned state representations. In most existing algorithms, these task vectors are randomly sampled, implicitly assuming this adequately captures the structure of the task space. We argue that doing so leads to suboptimal zero-shot generalization. To address this limitation, we propose extracting task vectors directly from the offline dataset and using them to define the task distribution used for policy training. We introduce a simple and general reward function extraction procedure that integrates into existing offline zero-shot RL algorithms. Across multiple benchmark environments and baselines, our approach improves zero-shot performance by an average of 20%, highlighting the importance of principled task sampling in offline zero-shot RL.
Chinese Translation
离线零样本强化学习(RL)旨在学习能够优化未见奖励函数的智能体,而无需额外的环境交互。该问题的标准方法是通过采样定义在学习到的状态表示上的线性奖励函数的任务向量来训练任务条件策略。在大多数现有算法中,这些任务向量是随机采样的,隐含地假设这足以捕捉任务空间的结构。我们认为,这样做会导致次优的零样本泛化。为了解决这一局限性,我们提出直接从离线数据集中提取任务向量,并利用它们定义用于策略训练的任务分布。我们引入了一种简单而通用的奖励函数提取程序,可以集成到现有的离线零样本强化学习算法中。在多个基准环境和基线测试中,我们的方法平均提高了20%的零样本性能,突显了在离线零样本强化学习中原则性任务采样的重要性。
cs.AI / 25 / 2604.25512

PHISHREV: A Hybrid Machine Learning and Post-Hoc Non-monotonic Reasoning Framework for Context-Aware Phishing Website Classification

PHISHREV:一种混合机器学习与后验非单调推理框架用于上下文感知的网络钓鱼网站分类
Sen, Mainak, Ray, Kumar Sankar, Chakrabarti, Amlan
Abstract
Phishing detection systems are predominantly rely on statistical machine learning models, which often lack contextual reasoning and are vulnerable to adversarial manipulation. In this work, we propose a hybrid framework that integrates machine learning classifiers with non-monotonic reasoning using Answer Set Programming (ASP) to enable context-aware decision refinement. The proposed post-hoc reasoning layer incorporates expert knowledge to revise classifier predictions through formal belief revisions. Experimental results indicate that the reasoning module modifies 5.08\% of classifier outputs, leading to improved decision consistency. A key advantage is that new domain knowledge can be incorporated into the reasoning layer in $\mathcal{O}(n)$ time, eliminating the need for model retraining.
Chinese Translation
网络钓鱼检测系统主要依赖于统计机器学习模型,这些模型通常缺乏上下文推理,并且容易受到对抗性操控。在本研究中,我们提出了一种混合框架,将机器学习分类器与使用答案集编程(Answer Set Programming, ASP)的非单调推理相结合,以实现上下文感知的决策优化。所提出的后验推理层结合了专家知识,通过正式的信念修正来修订分类器的预测。实验结果表明,推理模块修改了5.08%的分类器输出,从而提高了决策的一致性。一个关键优势是新的领域知识可以在$ ext{O}(n)$时间内被纳入推理层,消除了模型重新训练的需求。
cs.AI / 26 / 2604.25521

Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

自动化对抗协作促进认知科学理论构建
Chandramouli, Suyog, Kachergis, George, Jagadish, Akshay
Abstract
Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when the candidate models and experiments must be discovered during the adjudication process. The system combines LLM-based theory agents, program synthesis, and information-theoretic experimental design in a closed loop. In a simulation study spanning three classic categorization theories, the framework recovered the ground-truth theory across noise settings with weaker reliability in the hardest settings. Together, the framework and findings provide a concrete proof of concept for closed-loop, in-silico theory adjudication in cognitive science.
Chinese Translation
认知科学通常通过狭窄的范式和局部模型比较来评估理论,这限制了跨任务和实现的证据整合。我们提出了一种自动化对抗协作框架,用于在竞争理论之间进行裁决,即使候选模型和实验必须在裁决过程中发现。该系统结合了基于大型语言模型(LLM)的理论代理、程序合成和信息论实验设计,形成一个闭环。在涵盖三种经典分类理论的模拟研究中,该框架在噪声设置下恢复了真实理论,在最困难的设置中可靠性较弱。总体而言,该框架及其发现为认知科学中的闭环、计算机模拟理论裁决提供了具体的概念证明。
cs.AI / 27 / 2604.25534

Sample-efficient Neuro-symbolic Proximal Policy Optimization

样本高效的神经符号近端策略优化
Murari, Simone, Veronese, Celeste, Meli, Daniele
Abstract
Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term. We evaluate our methods on three benchmarks (OfficeWorld, WaterWorld, and DoorKey), showing consistently faster learning and higher return at convergence than PPO and a Reward Machine baseline, also under imperfect symbolic knowledge.
Chinese Translation
深度强化学习(DRL)算法通常需要大量数据,并且在具有稀疏奖励的领域中,特别是涉及长规划周期和多个子目标时,表现不佳。本文提出了一种近端策略优化(PPO)的神经符号扩展,该方法将从较简单实例中学习到的部分逻辑策略规范转移到更具挑战性的环境中,以指导学习。我们引入了两种符号指导的集成方式:(i)H-PPO-Product,在采样时对动作分布进行偏置,以及(ii)H-PPO-SymLoss,通过符号正则化项增强PPO损失。我们在三个基准测试(OfficeWorld、WaterWorld和DoorKey)上评估了我们的方法,结果显示其学习速度始终快于PPO和奖励机器基线,并且在收敛时获得更高的回报,即使在符号知识不完备的情况下也是如此。
cs.AI / 28 / 2604.25584

DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding

DualFact+: 一种用于程序视频理解的多模态事实验证框架
Oguz, Cennet, Hamidullah, Yasser, van Genabith, Josef, Ostermann, Simon
Abstract
We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies. DualFact correlates more strongly with human factuality judgments than standard metrics, particularly for contextual facts, and reveals that caption-only evaluation overestimates hallucinations compared to video-grounded verification. Overall, DualFact offers an interpretable and human-aligned evaluation protocol that highlights persistent challenges in multimodal factual grounding, extending beyond surface-level fluency.
Chinese Translation
我们提出了DualFact,这是一个双层多模态事实评估框架,旨在用于程序视频字幕生成。DualFact将事实的正确性分为概念事实,捕捉抽象的语义角色(例如,动作、成分、工具、地点),以及上下文事实,捕捉它们在视频中的具体谓词-论元实现。为了支持完整且角色一致的评估,DualFact结合了隐式论元增强(VIA)和对比事实集。我们以两种模式实现DualFact:DualFact-T,验证文本证据中的事实,以及DualFact-V,验证视频基础的视觉证据中的事实。在YouCook3-Fact和CraftBench-Fact上的实验表明,最先进的多模态语言模型生成流畅但往往事实不完整的字幕,存在系统性的遗漏和角色层面的不一致。与标准指标相比,DualFact与人类的事实判断相关性更强,尤其是在上下文事实方面,并揭示了仅基于字幕的评估相较于视频基础验证高估了幻觉。总体而言,DualFact提供了一种可解释且与人类对齐的评估协议,突显了多模态事实基础中持续存在的挑战,超越了表层流畅性。
cs.AI / 29 / 2604.25602

OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction

OxyGent:通过Oxy抽象实现多智能体系统的模块化、可观察性和可演化性
Hu, Junxing, Li, Tianlong, Yu, Lei, Han, Ai
Abstract
Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework that enables modular, observable, and evolvable MAS via a unified Oxy abstraction, in which agents, tools, LLMs, and reasoning flows are encapsulated as pluggable atomic components. This Lego-like assembly paradigm supports scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, which provide adaptive visualizations. To support continuous evolution, the framework integrates OxyBank, an AI asset management platform that supports automated data backflow, annotation, and joint evolution. Empirical evaluations and real-world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is publicly available at https://oxygent.jd.com/.
Chinese Translation
在复杂工业环境中部署生产就绪的多智能体系统(MAS)仍然面临着可扩展性、可观察性和自主演化方面的挑战。我们提出了OxyGent,这是一个开源框架,通过统一的Oxy抽象,使得MAS具备模块化、可观察和可演化的特性,其中智能体、工具、LLMs(大语言模型)和推理流程被封装为可插拔的原子组件。这种类似乐高的组装范式支持可扩展的系统组合和非侵入式监控。为了增强可观察性,OxyGent引入了基于权限的动态规划,用执行图替代了僵化的工作流程,这些执行图在运行时生成,提供自适应的可视化。为了支持持续演化,该框架集成了OxyBank,一个支持自动数据回流、注释和联合演化的AI资产管理平台。实证评估和真实案例研究表明,OxyGent为MAS提供了一个稳健且可扩展的基础。OxyGent可在https://oxygent.jd.com/上公开获取。
cs.AI / 30 / 2604.25612

The Nonverbal Syntax Framework: An Evidence-Based Tiered System for Inferring Learner States from Observable Behavioral Cues

非语言语法框架:基于证据的分层系统用于从可观察的行为线索推断学习者状态
Turaev, Sherzod, John, Mary, Rustamov, Jaloliddin, Rustamov, Zahiriddin, Aldabet, Saja, Zaki, Nazar, Shuaib, Khaled
Abstract
Understanding learners' cognitive and affective states underpins adaptive educational systems and effective teaching. Although research links nonverbal cues to internal states, no framework calibrates them to evidence. We present the Nonverbal Syntax Framework, drawn from a systematic review of 908 studies and 17,043 cue-state mappings (Turaev et al., 2026). The framework addresses three challenges: terminological fragmentation (behaviors described inconsistently), evidence heterogeneity (single observations to replicated findings), and state ambiguity (similar patterns indicating multiple states). Normalization consolidated 5,537 state labels into 2,010 canonical states (63.7%) and 11,521 cues into 6,434 normalized cues (44.2%) across nine behavioral channels. Dual-evidence assessment separately evaluates Component Evidence (coverage of cues and states) and Relationship Evidence (independent studies per cue-state link). 52% of "Very High" relationships rest on one paper, so separation enables calibrated rather than overconfident inference from preliminary findings. The framework's four levels comprise a Cue Vocabulary of 6,434 indicators classified as observable/instrumental; State Clusters linking 2,010 states to indicative cues; State Profiles with multimodal behavioral signatures and actionable specifications; and Discriminative Analysis distinguishing 1,215 confusable state pairs. We identify 480 actionable R1-R4 relationships (three or more independent papers), the replicated core of six decades of research, covering 35.5% of mappings across 47 key learning states and 111 distinct indicators. The remaining 91.5% (9,653 single-paper findings) form exploratory hypotheses for replication. The framework gives researchers an empirical foundation for identifying gaps, practitioners evidence-based tools for state inference, and technologists validated features for multimodal detection.
Chinese Translation
理解学习者的认知和情感状态是自适应教育系统和有效教学的基础。尽管研究将非语言线索与内部状态联系起来,但尚无框架将其与证据进行校准。我们提出了非语言语法框架,该框架源于对908项研究和17,043个线索-状态映射的系统评审(Turaev et al., 2026)。该框架解决了三个挑战:术语碎片化(行为描述不一致)、证据异质性(单一观察与重复发现)和状态模糊性(相似模式指示多种状态)。标准化将5,537个状态标签整合为2,010个规范状态(63.7%),将11,521个线索整合为6,434个规范线索(44.2%),涵盖九个行为通道。双重证据评估分别评估组件证据(线索和状态的覆盖)和关系证据(每个线索-状态链接的独立研究)。52%的“非常高”关系基于一篇论文,因此分离使得从初步发现中进行校准而非过于自信的推断成为可能。该框架的四个层级包括:6,434个指标的线索词汇,分类为可观察/工具性;将2,010个状态与指示性线索连接的状态聚类;具有多模态行为特征和可操作规范的状态轮廓;以及区分1,215对混淆状态的辨别分析。我们识别出480个可操作的R1-R4关系(三篇或更多独立论文),这是六十年研究的核心重复,涵盖了47个关键学习状态和111个不同指标的35.5%的映射。剩余的91.5%(9,653个单一论文发现)形成了用于复制的探索性假设。该框架为研究人员提供了识别差距的实证基础,为从业者提供了基于证据的状态推断工具,为技术人员提供了经过验证的多模态检测特征。
cs.AI / 31 / 2604.25614

HotComment: A Benchmark for Evaluating Popularity of Online Comments

HotComment:在线评论受欢迎程度评估的基准
Wu, Yafeng, Zhang, Yunyao, Ye, Liliang, Zeng, Guiyi, Yu, Junqing, Xu, Chen, Song, Zikai
Abstract
Online comments play a crucial role in shaping public sentiment and opinion dynamics on social media. However, evaluating their popularity remains challenging, not only because it depends on linguistic quality, originality, and emotional resonance, but also because stylistic preferences vary widely across platforms and user groups, causing the same comment to resonate differently in different communities. In this work, we present HotComment, a multimodal benchmark integrating video and text modalities that comprehensively quantifies popularity from three enhanced aspects: (1) Content Quality, which evaluates semantic similarity with ground-truth human comments and extends quality assessment through four interpretable dimensions; (2) Popularity Prediction, based on trends from models trained on real-world interaction data; and (3) User Behavior Simulation, which models the distribution of platform users and approximates \textbf{engagement scores} through an agent-based framework. Furthermore, we propose StyleCmt, inspired by social ripple effects, where multiple stylistic dimensions align to amplify socially resonant expressions and suppress incongruent ones.
Chinese Translation
在线评论在塑造社交媒体上的公众情感和舆论动态中发挥着至关重要的作用。然而,评估它们的受欢迎程度仍然具有挑战性,这不仅因为它依赖于语言质量、原创性和情感共鸣,还因为不同平台和用户群体的风格偏好差异很大,导致同一评论在不同社区中的共鸣效果不同。在本研究中,我们提出了HotComment,一个集成视频和文本模态的多模态基准,全面量化受欢迎程度的三个增强方面:(1)内容质量,通过与真实人类评论的语义相似性进行评估,并通过四个可解释维度扩展质量评估;(2)受欢迎程度预测,基于在真实互动数据上训练的模型趋势;(3)用户行为模拟,建模平台用户的分布,并通过基于代理的框架近似 extbf{参与度评分}。此外,我们提出了StyleCmt,受到社会涟漪效应的启发,其中多个风格维度协同作用,以增强社会共鸣的表达并抑制不一致的表达。
cs.AI / 32 / 2604.25684

Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

三思而后行——自主人工智能代理的神经认知治理模型
Bandara, Eranga, Gore, Ross, Gunaratna, Asanga, Rajapakse, Sachini, Kularathna, Isurunima, Mukkamala, Ravi, Shetty, Sachin, Liang, Xueping, Hass, Amin, Hewa, Tharaka, Rahman, Abdul, Rhea, Christopher K., Clayton, Anita H., Samuel, Preston, Yarlagadda, Atmaram
Abstract
The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions. We address this gap by drawing on how humans self-govern naturally: before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizational rules to evaluate whether an intended action is permissible, requires modification, or demands escalation. This paper proposes a neurocognitive governance framework that formally maps this human self-governance process to LLM-driven agent reasoning, establishing a structural parallel between the human brain and the large language model as the cognitive core of an agent. We formalize a Pre-Action Governance Reasoning Loop (PAGRL) in which agents consult a four-layer governance rule set: global, workflow-specific, agent-specific, and situational before every consequential action, mirroring how human organizations structure compliance hierarchies across enterprise, department, and role levels. Implemented on a production-grade retail supply chain workflow, the framework achieves 95% compliance accuracy and zero false escalations to human oversight, demonstrating that embedding governance into agent reasoning produces more consistent, explainable, and auditable compliance than external enforcement. This work offers a principled foundation for autonomous AI agents that govern themselves the way humans do: not because rules are imposed upon them, but because deliberation is embedded in how they think.
Chinese Translation
自主人工智能代理在企业、医疗和安全关键环境中的快速部署造成了根本性的治理缺口。现有方法,如运行时保护措施、训练时对齐和事后审计,将治理视为外部约束,而非内化的行为原则,这使得代理容易受到不安全和不可逆转行为的影响。我们通过借鉴人类自然自我治理的方式来填补这一空白:在采取行动之前,人类会进行基于执行功能、抑制控制和内化组织规则的深思熟虑的认知过程,以评估预期行动是否可行、是否需要修改或是否需要升级。本文提出了一种神经认知治理框架,正式将这一人类自我治理过程映射到基于大型语言模型(LLM)的代理推理上,建立了人脑与大型语言模型作为代理认知核心之间的结构性平行关系。我们形式化了一个前行动治理推理循环(Pre-Action Governance Reasoning Loop, PAGRL),在每次重要行动之前,代理会咨询一个四层治理规则集:全球层、工作流程特定层、代理特定层和情境层,反映了人类组织如何在企业、部门和角色层面构建合规层级。该框架在生产级零售供应链工作流程中实施,达到了95%的合规准确率和零次错误升级至人类监督,证明将治理嵌入代理推理中比外部强制执行产生了更一致、可解释和可审计的合规性。这项工作为自主人工智能代理提供了一个原则性的基础,使其以人类的方式自我治理:不是因为规则被强加于他们,而是因为深思熟虑嵌入了他们的思维方式。
cs.AI / 33 / 2604.25693

RADD: Retrieval-Augmented Discrete Diffusion for Multi-Modal Knowledge Graph Completion

RADD:增强检索的离散扩散用于多模态知识图谱补全
Niu, Guanglin, Li, Bo
Abstract
Most multi-modal knowledge graph completion (MMKGC) models use one embedding scorer to do both retrieval over the full entity set and final decision making. We argue that this coupling is a core bottleneck: global high-recall search and local fine-grained disambiguation require different inductive biases. Therefore, we propose a Retrieval-Augmented Discrete Diffusion (RADD) framework to decouple retrieve and reranking for MMKGC. A relation-aware multimodal KGE retriever serves as both global retriever and distillation teacher, while a conditional discrete denoiser performs shortlist-level entity-identity generation for reranking. Training combines KGE supervision, denoising cross-entropy, and temperature-scaled distillation from the retriever to the denoiser. At inference, the designed Diff-Rerank first forms a top-$K$ shortlist with the retriever and then reranks it with the denoiser, ensuring that recall is a strict prerequisite for precision. Experiments on three MMKGC benchmarks show that RADD achieves the best performance and consistent gains over strong unimodal, multimodal, and LLM-based baselines, while ablations further verify the contribution of each component.
Chinese Translation
大多数多模态知识图谱补全(MMKGC)模型使用一个嵌入评分器同时进行全实体集的检索和最终决策。我们认为这种耦合是一个核心瓶颈:全局高召回率的搜索和局部细粒度的消歧需要不同的归纳偏置。因此,我们提出了一种增强检索的离散扩散(RADD)框架,以解耦MMKGC中的检索和重排序。一个关系感知的多模态知识图嵌入(KGE)检索器同时作为全局检索器和蒸馏教师,而一个条件离散去噪器则在重排序中执行短名单级别的实体身份生成。训练结合了KGE监督、去噪交叉熵和从检索器到去噪器的温度缩放蒸馏。在推理阶段,设计的Diff-Rerank首先使用检索器形成一个前$K$短名单,然后使用去噪器对其进行重排序,确保召回是精度的严格前提。在三个MMKGC基准上的实验表明,RADD实现了最佳性能,并在强大的单模态、多模态和基于大语言模型(LLM)的基线之上持续获得提升,而消融实验进一步验证了每个组件的贡献。
cs.AI / 34 / 2604.25724

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

复合人工智能系统的可扩展推理架构:生产部署研究
S V, Srikanta Prasad, Arora, Utkarsh
Abstract
Modern enterprise AI applications increasingly rely on compound AI systems - architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost-effectiveness and low latency. This paper presents a production deployment study of a modular, platform-agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis). The system integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows. We report production results demonstrating over 50% reduction in tail latency (P95), up to 3.9x throughput improvement, and 30 to 40% cost savings compared to prior static deployments. We further present a novel analysis of compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that emerge uniquely when serving agentic workloads. Through detailed case studies and operational lessons, we illustrate how the architecture enables compound AI systems to scale model invocations in parallel, handle bursty multi-agent workloads, and support rapid model iteration - capabilities essential for operationalizing agentic AI at enterprise scale.
Chinese Translation
现代企业人工智能应用越来越依赖于复合人工智能系统——一种由多个模型、检索器和工具组成的架构,以完成复杂任务。在生产中部署此类系统需要推理基础设施,能够高效地服务于并发的异构模型调用,同时保持成本效益和低延迟。本文介绍了在Salesforce开发的模块化、平台无关的推理架构的生产部署研究,以支持包括Agentforce(自主人工智能代理)和ApexGuru(人工智能驱动的代码分析)在内的复合人工智能用例。该系统集成了无服务器执行、动态自动扩展和MLOps管道,以在多组件代理工作流中提供一致的低延迟推理。我们报告的生产结果显示,与之前的静态部署相比,尾部延迟(P95)减少超过50%,吞吐量提高了最高3.9倍,成本节省达30%至40%。我们进一步提出了一种新颖的分析,探讨了复合系统特有的挑战,包括多模型分发开销、级联冷启动传播和在服务代理工作负载时独特出现的异构扩展动态。通过详细的案例研究和操作经验,我们展示了该架构如何使复合人工智能系统能够并行扩展模型调用,处理突发的多代理工作负载,并支持快速的模型迭代——这些能力对于在企业规模上实现代理人工智能至关重要。
cs.AI / 35 / 2604.25727

Toward Scalable Terminal Task Synthesis via Skill Graphs

通过技能图实现可扩展的终端任务合成
Fan, Zhiyuan, Yu, Tinghao, Cai, Yuanjun, Guan, Jiangtao, Yang, Yun, Hu, Dingxin, Zhou, Jiang, Wu, Xing, Han, Zhuo, Zhang, Feng, Wang, Lilin
Abstract
Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario-mediated skill graph. SkillSynth first constructs a large-scale skill graph, where scenarios serve as intermediate transition nodes that connect diverse command-line skills. It then samples paths from this graph as abstractions of real-world workflows, and uses a multi-agent harness to instantiate them into executable task instances. By grounding task synthesis in graph-sampled workflow paths, SkillSynth explicitly controls the diversity of minimal execution trajectories required to solve the synthesized tasks. Experiments on Terminal-Bench demonstrate the effectiveness of SkillSynth. Moreover, task instances synthesized by SkillSynth have been adopted to train Hy3 Preview, contributing to its enhanced agentic capabilities in terminal-based settings.
Chinese Translation
终端智能体在自主命令行执行方面展现出强大的潜力,但其训练仍受到高质量和多样化执行轨迹稀缺的限制。现有方法通过合成大规模终端任务实例来缓解这一瓶颈,以进行轨迹采样。然而,这些方法主要集中在任务数量的扩展上,而对智能体在训练过程中实际经历的执行轨迹多样性提供的控制有限。本文提出了SkillSynth,一个基于场景介导的技能图的终端任务合成自动化框架。SkillSynth首先构建一个大规模技能图,其中场景作为中间过渡节点,连接多样化的命令行技能。然后,它从该图中采样路径,作为现实工作流的抽象,并使用多智能体框架将其实例化为可执行的任务实例。通过将任务合成基于图采样的工作流路径,SkillSynth明确控制了解决合成任务所需的最小执行轨迹的多样性。在Terminal-Bench上的实验表明了SkillSynth的有效性。此外,SkillSynth合成的任务实例已被用于训练Hy3 Preview,提升了其在基于终端环境中的智能能力。
cs.AI / 36 / 2604.25740

QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

QAROO:基于人工智能的在线任务卸载以实现能源高效和可持续的移动边缘计算网络
Yao, Yongtao, Yang, Yao, Shi, Haorui, Zhu, Canglu, Chen, Miaojiang, Farouk, Ahmed
Abstract
With the rapid advancement of artificial intelligence (AI) and intelligent science, intelligent edge computing has been widely adopted. However, the limitations of traditional methods, such as poor adaptability and the slow convergence of heuristic algorithms, are becoming increasingly evident. To enable sustainable and resource-efficient edge applications, this paper proposes an online task offloading framework for wireless powered mobile edge computing (MEC) networks, called Quantum Attention-based Reinforcement learning for Online Offloading (QAROO). The system employs a binary offloading strategy with the aim of co-optimizing computing and energy resources in dynamic channel environments. In response to the issues of poor adaptability in traditional approaches and the slow convergence of heuristic algorithms, the framework integrates quantum neural networks and attention mechanisms, introducing three key improvements: using recurrent neural networks to enhance temporal modeling capability, proposing an uncertainty-guided quantization method to improve exploration efficiency, and incorporating attention mechanisms into quantum networks to strengthen feature representation. Experiments demonstrate that the proposed method outperforms comparative schemes in terms of normalized computation speed and processing time, offering an efficient and stable solution for online task offloading in large-scale Internet of Things (IoT) dynamic environments.
Chinese Translation
随着人工智能(AI)和智能科学的快速发展,智能边缘计算得到了广泛应用。然而,传统方法的局限性,如适应性差和启发式算法收敛速度慢,正变得日益明显。为了实现可持续和资源高效的边缘应用,本文提出了一种针对无线供电移动边缘计算(MEC)网络的在线任务卸载框架,称为基于量子注意力的在线卸载强化学习(QAROO)。该系统采用二进制卸载策略,旨在在动态信道环境中共同优化计算和能源资源。针对传统方法适应性差和启发式算法收敛速度慢的问题,该框架集成了量子神经网络和注意力机制,提出了三项关键改进:使用递归神经网络增强时间建模能力,提出不确定性引导的量化方法以提高探索效率,以及将注意力机制融入量子网络以增强特征表示。实验结果表明,所提方法在归一化计算速度和处理时间方面优于对比方案,为大规模物联网(IoT)动态环境中的在线任务卸载提供了一种高效且稳定的解决方案。
cs.AI / 37 / 2604.25796

StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games

StratFormer:在不完全信息游戏中的自适应对手建模与利用
Caen, Andy, Winands, Mark H. M., Soemers, Dennis J. N. J.
Abstract
We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum. The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy. The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability. Our architecture introduces dual-turn tokens -- feature vectors constructed at both agent and opponent decision points -- coupled with bucket-rate features that encode opponent tendencies across five strategic contexts. On Leduc Hold'em, a small poker variant with six cards and two betting rounds, we test against six opponent archetypes at two strength levels each, with exploitability ranging from 0.15 to 1.26 Big Blinds (BB) per hand. StratFormer achieves an average exploitation gain of +0.106 BB per hand over GTO, with peak gains of +0.821 against highly exploitable opponents, while maintaining near-equilibrium safety.
Chinese Translation
我们提出了StratFormer,一种基于变换器的元代理,通过两阶段课程学习同时对不完全信息游戏中的对手进行建模和利用。第一阶段训练一个对手建模头,从行动历史中识别行为模式,同时代理执行博弈论最优(GTO)策略。第二阶段逐步将策略转向最佳响应(BR)利用,受益于与可利用性相关的每个对手的正则化调度。我们的架构引入了双回合令牌——在代理和对手决策点构建的特征向量——以及编码对手在五种战略背景下倾向的桶率特征。在Leduc Hold'em(一个包含六张牌和两个下注回合的小型扑克变种)中,我们针对六种对手原型进行测试,每种原型有两个强度水平,其可利用性范围从每手0.15到1.26大盲注(BB)。StratFormer在每手牌上相较于GTO实现了平均+0.106 BB的利用增益,在面对高度可利用的对手时,峰值增益达到+0.821,同时保持近乎均衡的安全性。
cs.AI / 38 / 2604.25832

TrialCalibre: A Fully Automated Causal Engine for RCT Benchmarking and Observational Trial Calibration

TrialCalibre:一个完全自动化的因果引擎,用于随机对照试验基准和观察性试验校准
Habibdoust, Amir, Song, Xing
Abstract
Real-world evidence (RWE) studies that emulate target trials increasingly inform regulatory and clinical decisions, yet residual, hard-to-quantify biases still limit their credibility. The recently proposed BenchExCal framework addresses this challenge via a two-stage Benchmark, Expand, Calibrate process, which first compares an observational emulation against an existing randomized controlled trial (RCT), then uses observed divergence to calibrate a second emulation for a new indication causal effect estimation. While methodologically powerful, BenchExCal is resource intensive and difficult to scale. We introduce TrialCalibre, a conceptualized multiagent system designed to automate and scale the BenchExCal workflow. Our framework features specialized agents such as the Orchestrator, Protocol Design, Data Synthesis, Clinical Validation, and Quantitative Calibration Agents that coordi-nate the the overall process. TrialCalibre incorpo-rates agent learning (e.g., RLHF) and knowledge blackboards to support adaptive, auditable, and transparent causal effect estimation.
Chinese Translation
真实世界证据(RWE)研究通过模拟目标试验,越来越多地影响监管和临床决策,但残余的、难以量化的偏差仍然限制了它们的可信度。最近提出的BenchExCal框架通过两阶段的基准、扩展、校准过程来应对这一挑战,首先将观察性模拟与现有的随机对照试验(RCT)进行比较,然后利用观察到的差异来校准第二个模拟,以估计新的指征因果效应。尽管方法论上强大,BenchExCal却资源密集且难以扩展。我们引入了TrialCalibre,一个概念化的多智能体系统,旨在自动化和扩展BenchExCal工作流程。我们的框架具有专门的智能体,如协调者、协议设计、数据合成、临床验证和定量校准智能体,协调整个过程。TrialCalibre结合了智能体学习(例如,RLHF)和知识黑板,以支持适应性、可审计和透明的因果效应估计。
cs.AI / 39 / 2604.25834

Action-Aware Generative Sequence Modeling for Short Video Recommendation

基于动作感知的短视频推荐生成序列建模
Li, Wenhao, Lin, Zihan, Guo, Zhengxiao, Zhou, Jie, Liu, Shukai, Liu, Yongqi, Luo, Chuan, Ma, Chaoyi, Tang, Ruiming, Li, Han
Abstract
With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users' historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou's dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou's platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.
Chinese Translation
随着互联网的快速发展,用户对在线内容消费平台的推荐准确性期望越来越高。然而,短视频通常包含多样化的片段,用户对这些片段的态度可能并不相同。传统的二分类推荐模型将视频视为一个整体实体,难以准确捕捉这种细微的偏好。考虑到用户消费是一个时间过程,本文通过统计分析和对动作模式的研究,展示了用户行为的时机可以代表多样的意图。基于这一洞察,我们提出了一种新颖的建模范式:动作感知生成序列网络(Action-Aware Generative Sequence Network, A2Gen),该网络沿时间维度细化用户行为,并将其串联成序列以进行统一处理和预测。首先,我们引入了上下文感知注意力模块(Context-aware Attention Module, CAM)来建模丰富了项目特定上下文特征的动作序列。在此基础上,我们开发了层次序列编码器(Hierarchical Sequence Encoder, HSE),以从用户的历史行为中学习时间动作模式。最后,通过利用CAM,我们设计了一个动作序列生成模块:动作序列自回归生成器(Action-seq Autoregressive Generator, AAG)。在Kuaishou数据集和Tmall公共数据集上进行的大量离线实验表明,我们提出的模型具有优越性。此外,通过在Kuaishou平台上进行的大规模在线A/B测试,我们的模型在多任务预测中相较于基线方法实现了显著提升,具体表现为用户观看时间提高0.34%,互动率提高8.1%,整体用户留存率(LifeTime-7)提高0.162%,成功部署于所有流量中,每天为超过4亿用户提供服务。
cs.AI / 40 / 2604.25848

Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

基于半马尔可夫强化学习的城市规模电动车叫车服务与可行性保障行动
Nguyen, An, Nguyen, Hoang, Le, Phuong, Pham, Hung, Do, Cuong, Ghaoui, Laurent El
Abstract
We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.
Chinese Translation
我们研究了电动车(EV)叫车车队在城市规模下的控制问题,其中调度、重新定位和充电决策必须在不确定的、空间相关的需求和旅行时间下遵循充电器和供电限制。我们将该问题表述为一个六边形网格的半马尔可夫决策过程(semi-MDP),其中包含混合动作——用于服务、重新定位和充电的离散动作,以及连续的充电功率——和可变的动作持续时间。为了在训练和部署期间保证物理可行性,策略通过一个经过掩蔽和温度退火的演员生成的高层意图进行学习。这些意图在每个决策步骤通过一个时间限制的滚动混合整数线性规划(MILP)进行投影,严格执行电量、端口和供电限制。为了减轻分布偏移,我们针对一个具有图对齐的马哈拉诺比斯地面度量的Wasserstein-1模糊集优化了一个软演员-评论家(Soft Actor-Critic, SAC)代理。稳健的备份使用了Kantorovich-Rubinstein对偶、投影子梯度内循环和原始-对偶风险预算更新。我们的架构结合了一个两层图卷积网络(Graph Convolutional Network, GCN)编码器、双重评论者和驱动对手的价值网络。在基于纽约市出租车数据构建的大规模电动车车队模拟器上的实验表明,PD-RSAC实现了最高的净利润,达到122万美元,而强启发式、单代理强化学习和多代理强化学习基线(包括贪婪算法、SAC、MAPPO和MADDPG)的净利润为58万至70万美元,同时保持零供电限制违规。
cs.AI / 41 / 2604.25849

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

ADEMA:一种用于长远知识综合的知识状态协调架构与LLMAgents
Hanlin, Zhou, Yong, Chan Huah
Abstract
Long-horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon knowledge synthesis rather than as a generic multi-agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking with safe fallback. Evidence is drawn entirely from existing materials: a four-scenario showcase package, a fixed 60-run mechanism matrix, targeted micro-ablation and artifact-chain supplements, and a repaired protocol-level benchmark in which code-oriented evaluation is the clearest quality-sensitive mechanism block. Across the fixed matrix, removing checkpoint/resume produced the only invalid run, and it did so in the interruption-sensitive resume condition. By contrast, dual evaluation, segment synthesis, and dynamic governance are best interpreted as supporting control mechanisms that shape trajectory discipline, explicit artifact progression, and cost-quality behavior rather than as universal binary prerequisites for completion. The contribution is therefore a knowledge-state orchestration architecture in which explicit epistemic state transition, evidence-bearing artifact progression, and recoverable continuity are the primary design commitments.
Chinese Translation
长远的LLM任务往往失败并非因为单一答案不可得,而是由于知识状态在多个回合中漂移、中间承诺保持隐性,以及中断破坏了不断发展的证据链。本文提出ADEMA作为一种知识状态协调架构,旨在实现长远知识综合,而非作为通用的多智能体运行时。该架构结合了明确的认知记账、异构的双评估治理、自适应任务模式切换、声誉导向的资源分配、可检查点恢复的持久性、分段级记忆浓缩、优先构建工件以及安全回退的最终有效性检查。证据完全来源于现有材料:一个四场景展示包、一个固定的60次运行机制矩阵、针对性的微剥离和工件链补充,以及一个修复的协议级基准,其中以代码为导向的评估是最清晰的质量敏感机制模块。在固定矩阵中,移除检查点/恢复产生了唯一的无效运行,并且是在对中断敏感的恢复条件下发生的。相较之下,双重评估、分段综合和动态治理最好被理解为支持控制机制,这些机制塑造了轨迹纪律、明确的工件进展和成本-质量行为,而不是完成的普遍二元前提。因此,本文的贡献是一个知识状态协调架构,其中明确的认知状态转变、承载证据的工件进展和可恢复的连续性是主要设计承诺。
cs.AI / 42 / 2604.25917

Recursive Multi-Agent Systems

递归多智能体系统
Yang, Xiyuan, Zou, Jiaru, Pan, Rui, Qiu, Ruizhong, Lu, Pan, Diao, Shizhe, Jiang, Jindong, Tong, Hanghang, Zhang, Tong, Buehler, Markus J., He, Jingrui, Zou, James
Abstract
Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2$\times$-2.4$\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.
Chinese Translation
递归或循环语言模型最近作为一种新的扩展轴出现,通过对潜在状态进行迭代精炼相同模型计算,以加深推理。我们将这种扩展原则从单一模型扩展到多智能体系统,并提出问题:智能体之间的协作是否可以通过递归进行扩展?为此,我们引入了递归多智能体框架(RecursiveMAS),将整个系统视为统一的潜在空间递归计算。RecursiveMAS通过轻量级的RecursiveLink模块将异构智能体连接为协作循环,从而实现分布内潜在思想的生成和跨智能体潜在状态的转移。为了优化我们的框架,我们开发了一种内外循环学习算法,通过在递归轮次之间共享基于梯度的信用分配,实现整个系统的迭代共同优化。对运行时复杂性和学习动态的理论分析表明,RecursiveMAS比标准的基于文本的多智能体系统更高效,并且在递归训练过程中保持稳定的梯度。在实证方面,我们在4种代表性的智能体协作模式下实例化了RecursiveMAS,并在涵盖数学、科学、医学、搜索和代码生成的9个基准上进行了评估。与先进的单智能体/多智能体和递归计算基线相比,RecursiveMAS始终提供了8.3%的平均准确率提升,以及1.2×-2.4×的端到端推理加速和34.6%-75.6%的令牌使用减少。代码和数据可在 https://recursivemas.github.io 获取。
计算语言学 (Computation and Language)
59
cs.CL / 1 / 2604.24770

Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

通过语音合成进行老年人情境数据增强的老年人自动语音识别
Lee, Minsik, Hong, Seoi, Lee, Chongmin, Choi, Sieun, Kim, Jian, Han, Jua, Kim, Jihie
Abstract
Despite recent progress in automatic speech recognition (ASR), elderly ASR (EASR) remains challenging due to limited training data and the distinct acoustic and linguistic characteristics of elderly speech. In this work, we address data scarcity in EASR through a data augmentation pipeline that combines large language model (LLM)-based transcript paraphrasing with text-to-speech (TTS) synthesis. Given an elderly speech dataset, the LLM first generates elderly-contextual paraphrases of the original transcripts, and the TTS model then synthesizes corresponding speech using elderly reference speakers. The resulting synthetic audio-text pairs are merged with the original data to fine-tune Whisper without architectural modification. We further analyze the effects of augmentation ratio and reference-speaker composition in low-resource EASR. Experiments on English and Korean elderly speech datasets from speakers aged 70 and above show that the proposed method consistently improves performance over conventional augmentation baselines, achieving up to a 58.2% reduction in word error rate (WER) compared with the Whisper baseline.
Chinese Translation
尽管自动语音识别(ASR)最近取得了进展,但由于训练数据有限以及老年人语音的独特声学和语言特征,老年人自动语音识别(EASR)仍然面临挑战。在本研究中,我们通过一个数据增强管道解决了EASR中的数据稀缺问题,该管道结合了基于大型语言模型(LLM)的转录改写和文本到语音(TTS)合成。给定一个老年人语音数据集,LLM首先生成原始转录的老年人情境改写,然后TTS模型使用老年参考说话者合成相应的语音。生成的合成音频-文本对与原始数据合并,以在不修改架构的情况下微调Whisper。我们进一步分析了在低资源EASR中增强比例和参考说话者组成的影响。在70岁及以上的说话者的英语和韩语老年人语音数据集上的实验表明,所提出的方法在性能上始终优于传统的增强基线,与Whisper基线相比,词错误率(WER)最多降低了58.2%。
cs.CL / 2 / 2604.24927

Large Language Models Explore by Latent Distilling

大型语言模型通过潜在蒸馏进行探索
Zeng, Yuanhao, Lu, Ao, Li, Lufei, Zhang, Zheng, Li, Yexin, Ren, Kan
Abstract
Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM's depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training--inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.
Chinese Translation
生成多样化的响应对于大型语言模型(LLMs)在测试时的扩展至关重要,然而标准的随机采样主要产生表面层次的词汇变化,限制了语义探索。本文提出了一种解码方法——探索性采样(Exploratory Sampling, ESamp),该方法在生成过程中明确鼓励语义多样性。ESamp 的动机源于一个众所周知的观察:神经网络在处理与之前遇到的输入相似的情况下,往往会做出较低误差的预测,而在面对新颖输入时则会产生更高的预测误差。基于这一特性,我们在测试时训练一个轻量级的蒸馏器(Distiller),以从 LLM 的浅层表示预测其深层隐藏表示,从而建模 LLM 的深度表示转变。在解码过程中,蒸馏器会持续适应当前生成上下文所诱导的映射。ESamp 使用预测误差作为新颖性信号,根据当前前缀重新加权候选标记扩展,从而使解码偏向于较少探索的语义模式。ESamp 采用异步训练-推理管道实现,最坏情况下的开销低于 5%(在优化版本中为 1.2%)。实证结果表明,ESamp 显著提升了推理模型的 Pass@k 效率,表现出优于或可与强随机和启发式基线相媲美的性能。值得注意的是,ESamp 在数学、科学和代码生成基准测试中实现了稳健的泛化,并打破了创意写作中多样性与连贯性之间的权衡。我们的代码已发布于:https://github.com/LinesHogan/tLLM。
cs.CL / 3 / 2604.24929

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

GAIA-v2-LILT:超越翻译的智能体基准多语言适应
Kim, Yunsu, Uhlig, Kaden, Wuebker, Joern
Abstract
Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at https://huggingface.co/datasets/Fujitsu-FRE/MAPS/viewer/GAIA-v2-LILT. We also release the code used in our experiments at https://github.com/lilt/gaia-v2-lilt.
Chinese Translation
智能体基准仍然主要以英语为中心,而它们的多语言版本通常是通过机器翻译(MT)和有限的后期编辑构建的。我们认为,对于智能体任务,这种最小化的工作流程很容易通过查询-回答不一致或文化不匹配的上下文来破坏基准的有效性。我们提出了一种改进的工作流程,用于将英语基准适应到多种语言,明确进行功能对齐、文化对齐和难度校准,使用自动检查和人工审查相结合的方法。通过这一工作流程,我们引入了GAIA-v2-LILT,这是GAIA的重新审核的多语言扩展,涵盖五种非英语语言。在实验中,我们的工作流程使智能体的成功率比最小翻译版本提高了多达32.7%,将最接近审核的设置与英语表现的差距缩小至3.1%,尽管在许多其他情况下仍存在显著差距。这表明,多语言表现差距中有相当一部分是基准引起的测量误差,这促使在跨语言适应英语基准时进行任务级对齐。数据作为MAPS包的一部分可在https://huggingface.co/datasets/Fujitsu-FRE/MAPS/viewer/GAIA-v2-LILT获取。我们还在https://github.com/lilt/gaia-v2-lilt发布了实验中使用的代码。
cs.CL / 4 / 2604.24940

ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

ADE:自适应字典嵌入——将多锚表示扩展到大型语言模型
Demirci, Orhan, Aptourachman, Sezer
Abstract
Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures. We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context. We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x -- demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.
Chinese Translation
词嵌入是自然语言处理的基础,然而传统方法用单一向量表示每个词,这为多义词的表示造成了瓶颈,并限制了语义的表达能力。尽管多锚表示通过将词表示为多个向量的组合显示出潜力,但由于计算效率低下和与现代变换器架构缺乏整合,它们一直局限于小规模模型。我们提出了自适应字典嵌入(Adaptive Dictionary Embeddings,ADE),这是一个成功将多锚词表示扩展到大型语言模型的框架。ADE的三个关键贡献为:(1) 词汇投影(Vocabulary Projection,VP),将代价高昂的两阶段锚查找转化为单一高效的矩阵操作;(2) 分组位置编码(Grouped Positional Encoding,GPE),一种新颖的位置编码方案,其中同一词的锚共享位置信息,保持语义一致性的同时实现锚级别的变化;(3) 上下文感知的锚重标定,利用自注意力根据序列上下文动态组合锚的贡献。我们将这些组件整合到段落感知变换器(Segment-Aware Transformer,SAT)中,该变换器在推理时提供上下文感知的锚贡献重标定。我们在AG News和DBpedia-14文本分类基准上评估了ADE。与DeBERTa-v3-base相比,ADE的可训练参数减少了98.7%,在DBpedia-14上超越了DeBERTa(98.06%对97.80%),在AG News上接近其表现(90.64%对94.50%),同时将嵌入层压缩超过40倍——证明了多锚表示在现代变换器架构中是单向量嵌入的一个实用且参数高效的替代方案。
cs.CL / 5 / 2604.24942

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

基于独立成分的故事理解过程中脑活动的编码模型
Hari, Kamya, Binhuraib, Taha, Li, Jin, Shain, Cory, Ivanova, Anna A.
Abstract
Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects.
Chinese Translation
编码模型提供了一种强有力的框架,将连续的刺激特征与神经活动联系起来;然而,传统的体素级方法受到测量噪声、个体间变异性以及由于空间相关体素编码重叠神经信号而产生的冗余的限制。在此,我们提出了一种基于独立成分(Independent Component, IC)的编码框架,该框架在功能性磁共振成像(fMRI)数据中区分刺激驱动和噪声驱动的信号。我们使用数据的一个子集将自然故事听取的连续fMRI数据分解为独立成分,并在独立数据上训练编码模型,以预测来自大型语言模型表示的语言输入的IC时间序列。在受试者之间,一部分IC表现出一致的高预测能力。这些IC在空间和时间上在不同受试者中保持一致,并包括在故事听取过程中已知会响应的认知网络(听觉和语言网络)。听觉成分时间序列与声学刺激特征高度相关,突显了所识别成分时间序列的可解释性。被ICA-AROMA识别为噪声或运动相关伪影的成分表现出一致的低预测性能,确认高预测的成分反映了真实的刺激相关神经信号,而非混淆因素。总体而言,基于IC的编码模型使得在功能网络层面进行分析成为可能,适应个体间网络位置的变异性,并提供易于比较的可解释结果。
cs.CL / 6 / 2604.24955

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

BenchGuard:谁来守护基准测试?大规模语言模型代理基准的自动审计
Tu, Xinming, Wang, Tianze, Yingzhou, Lu, Huang, Kexin, Qu, Yuanhao, Mostafavi, Sara
Abstract
As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.
Chinese Translation
随着基准测试的复杂性增加,许多明显的代理失败并非代理本身的失败,而是基准测试本身的失败:破损的规范、隐含的假设以及惩罚有效替代方法的僵化评估脚本。我们建议利用前沿的大规模语言模型(LLM)作为评估基础设施的系统审计员,并通过BenchGuard实现这一愿景,BenchGuard是首个针对任务导向、基于执行的代理基准的自动审计框架。BenchGuard通过结构化的LLM协议交叉验证所有基准测试文档,选择性地将代理解决方案或执行轨迹作为额外的诊断证据。BenchGuard在两个著名的科学基准上部署,识别出ScienceAgentBench中的12个作者确认的问题,包括使任务无法解决的致命错误,并准确匹配BIXBench Verified-50子集中的83.3%专家识别的问题,捕捉到先前人工审查完全遗漏的缺陷。对50个复杂生物信息学任务的全面审计成本低于15美元,使得自动基准审计成为人类审查的一个实用且有价值的补充。这些发现指向了AI辅助的基准开发,其中前沿模型不仅作为评估的对象,还作为验证评估基础设施的积极参与者。
cs.CL / 7 / 2604.24972

Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

动态决策学习:稀有疾病异常定位的测试时演化
Li, Jun, Liu, Mingxuan, Pan, Jiazhen, Liu, Che, Bai, Wenjia, Bercea, Cosmin I., Schnabel, Julia A.
Abstract
Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: https://lijunrio.github.io/DDL/
Chinese Translation
稀有疾病的临床异常定位常常受到数据稀缺的限制,使得监督微调变得不切实际,而单次推理则高度不稳定。我们提出了动态决策学习(Dynamic Decision Learning, DDL)框架,该框架使得冻结的大型视觉-语言模型(Large Vision-Language Models, LVLMs)能够通过优化指令和在视觉扰动下整合预测,在语言和视觉空间中不断优化其决策。该过程提高了定位质量,并生成了一个基于共识的可靠性评分,用于量化模型信心。在脑成像基准测试中,包括一个包含281种病理类型的稀有疾病数据集,模型参数范围从3B到72B,结果表明DDL在稀有疾病案例中将mAP@75提高了多达105%,并且超越了适应基线和监督微调。此外,DDL在严重分布变化和任务难度增加的情况下,显示出可靠性评分与定位准确性之间更强的校准性。代码可在以下链接获取:https://lijunrio.github.io/DDL/
cs.CL / 8 / 2604.24977

A Survey on LLM-based Conversational User Simulation

基于大语言模型的对话用户模拟综述
Ni, Bo, Wang, Leyao, Wang, Yu, Kveton, Branislav, Dernoncourt, Franck, Xia, Yu, Chen, Hongjie, Leura, Reuben, Basu, Samyadeep, Mukherjee, Subhojyoti, Mathur, Puneet, Ahmed, Nesreen, Wu, Junda, Li, Li, Zhang, Huixin, Zhang, Ruiyi, Yu, Tong, Kim, Sungchul, Gu, Jiuxiang, Tu, Zhengzhong, Siu, Alexa, Wang, Zichao, Yoon, David Seunghyun, Lipka, Nedim, Park, Namyong, Lin, Zihao, Bui, Trung, Zhao, Yue, Derr, Tyler, Rossi, Ryan A.
Abstract
User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.
Chinese Translation
用户模拟在计算机科学中长期扮演着重要角色,因为它能够支持广泛的应用。语言作为人类交流的主要媒介,构成了社会互动和行为的基础。因此,模拟对话行为已成为一个关键的研究领域。近期,大语言模型(LLMs)的进展显著推动了这一领域的发展,使得合成用户对话的高保真生成成为可能。本文对基于LLM的对话用户模拟的最新进展进行了综述。我们引入了一种新的分类法,涵盖用户粒度和模拟目标。此外,我们系统地分析了核心技术和评估方法。我们的目标是让研究社区了解对话用户模拟的最新进展,并通过识别开放挑战和将现有工作组织在统一框架下,进一步促进未来的研究。
cs.CL / 9 / 2604.24978

Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination

不要过早停止:具有受控信息流和证据感知终止的可扩展企业深度研究
Choubey, Prafulla Kumar, Huang, Kung-Hsiang, Venkit, Pranav Narayanan, Zhang, Jiaxin, Vats, Vaibhav, Li, Yu, Peng, Xiangyu, Wu, Chien-Sheng
Abstract
Enterprise deep research often fails to produce decision-ready reports due to uneven information coverage, context explosion, and premature stopping. We propose a scalable Enterprise Deep Research (EDR) architecture to address these failures. Our system (i) decomposes requests into coverage-driven objectives via outline generation with reflection, (ii) localizes context with dependency-guided execution and explicit information sharing, and (iii) enforces evidence-based completion criteria so agents iteratively collect information until sufficiency conditions are met. We evaluate on an internal sales enablement task and the public DeepResearch Bench benchmark, where our proposed system design achieves the strongest overall performance compared with competitive deep-research baselines. The results show that dependency-controlled context and explicit evidence sufficiency criteria reduce premature stopping and improve the consistency and depth of enterprise research outputs.
Chinese Translation
企业深度研究常常因信息覆盖不均、上下文爆炸和过早停止而未能产生可供决策的报告。我们提出了一种可扩展的企业深度研究(Enterprise Deep Research, EDR)架构来解决这些问题。我们的系统(i)通过反思生成大纲将请求分解为以覆盖为驱动的目标,(ii)通过依赖引导执行和显式信息共享来局部化上下文,以及(iii)强制实施基于证据的完成标准,使代理能够迭代地收集信息,直到满足充分性条件。我们在内部销售赋能任务和公共的DeepResearch Bench基准上进行了评估,我们提出的系统设计在与竞争的深度研究基线相比时,取得了最强的整体性能。结果表明,依赖控制的上下文和显式证据充分性标准减少了过早停止,提高了企业研究输出的一致性和深度。
cs.CL / 10 / 2604.25011

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

强化学习为何能够泛化?对大型语言模型后训练的特征级机制研究
Shi, Dan, Han, Zhuowen, Ostermann, Simon, Jin, Renren, van Genabith, Josef, Xiong, Deyi
Abstract
Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at https://github.com/danshi777/RL-generalization.
Chinese Translation
基于强化学习(RL)的后训练通常能够提升大型语言模型(LLMs)在训练领域之外的推理性能,而监督微调(SFT)则常常导致一般能力的遗忘。然而,这种对比背后的机制仍不清楚。为了解决这一问题,我们提出了一种特征级机制分析方法,通过受控实验设置探讨RL的泛化能力,其中RL和SFT调优的模型均从相同的基础模型在相同数据上进行训练。利用我们的可解释性框架,我们在共享特征空间中对齐模型的内部激活,并分析特征在后训练过程中的演变。我们发现,SFT迅速引入了许多高度专业化的特征,这些特征在训练初期迅速稳定,而RL则引发了更为克制且持续演变的特征变化,这些变化在很大程度上保留了基础模型的表征。我们关注于RL成功但基础模型失败的样本,识别出一组紧凑的、与任务无关的特征,这些特征直接介导了在不同任务中的泛化。特征级干预证实了它们的因果作用:禁用这些特征显著降低了RL模型的泛化性能,而增强这些特征则提升了基础模型的性能。代码可在 https://github.com/danshi777/RL-generalization 获取。
cs.CL / 11 / 2604.25031

Faithful Autoformalization via Roundtrip Verification and Repair

通过往返验证和修复实现忠实的自动形式化
Amrollahi, Daneshvar, Lopez, Jerry, Barrett, Clark
Abstract
When an LLM formalizes natural language, how do we know the output is faithful? We propose a roundtrip verification approach which does not require ground-truth annotations: formalize a statement, translate the result back to natural language, re-formalize, and use a formal tool to check logical equivalence. When the two formalizations agree, this provides evidence of a faithful formalization. When they disagree, a diagnosis step identifies which translation stage failed, and a targeted repair operator attempts to correct that stage. We evaluate our approach on 150 traffic rules using Claude Opus 4.6 and GPT-5.2. Diagnosis-guided repair raises formal equivalence from 45--61% to 83--85% for both models, outperforming a random-repair baseline. An independent NLI analysis confirms that formal equivalence is correlated with less semantic drift.
Chinese Translation
当大型语言模型(LLM)将自然语言形式化时,我们如何知道输出是忠实的?我们提出了一种往返验证的方法,该方法不需要真实标注:首先形式化一个陈述,然后将结果翻译回自然语言,再次进行形式化,并使用形式化工具检查逻辑等价性。当两个形式化结果一致时,这提供了忠实形式化的证据。当它们不一致时,诊断步骤将识别出哪个翻译阶段失败,并且一个有针对性的修复操作尝试纠正该阶段。我们在150条交通规则上评估了我们的方法,使用Claude Opus 4.6和GPT-5.2。通过诊断引导的修复将两个模型的形式等价性从45-61%提高到83-85%,优于随机修复基线。独立的自然语言推理(NLI)分析确认形式等价性与较少的语义漂移相关。
cs.CL / 12 / 2604.25039

Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

双轨链式思维:面向预算的小型语言模型逐步指导
Chatterjee, Sagnik, Patil, Atharva, Ramesh, Sricharan
Abstract
Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.
Chinese Translation
大型语言模型(LLMs)通过链式思维(CoT)提示解决许多推理任务,但较小的模型(约7到8亿参数)在紧张的计算和令牌预算下仍然难以进行多步骤推理。现有的测试时推理方法,如自一致性(采样多个理由并投票)、思维树(对中间思维进行搜索)和批评修订循环,虽然提高了性能,但通常代价高昂且缺乏细粒度的步骤控制。本项目旨在填补这一空白:小型语言模型(SLMs)能否在使用相同或更少的令牌的情况下可靠地进行推理?这个问题既具有科学性也具有实用性。从科学角度来看,它探讨了过程监督和简单的测试时控制(如令牌预算和拒绝冗余步骤)是否可以替代模型规模或大量采样次数。从实用角度来看,许多部署(如设备端、低延迟或成本受限的环境)无法承受庞大的模型或每个查询数十个采样理由。因此,一种在固定成本下提高SLM推理的方法将直接具有实用价值。
cs.CL / 13 / 2604.25053

Analyzing LLM Reasoning to Uncover Mental Health Stigma

分析大型语言模型的推理以揭示心理健康污名
Sankar, Sreehari, Nafar, Aliakbar, Barman, Mona, Heitz, Hannah K., Kumar, Ashwin, Tohidi, Pouria, Li, Dailun, Hussain, Danish, DuBois, Russell, Hasheminia, Hamed, Majzoubi, Farshad
Abstract
While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models' underlying logic. In this paper, we analyze the intermediate reasoning steps of LLMs to uncover hidden stigmatizing language and the internal rationales driving it. We leverage clinical expertise to categorize common patterns of stigmatizing language directed at individuals with psychological conditions and use this framework to identify and tag problematic statements in LLM reasoning. Furthermore, we rate the severity of these statements, distinguishing between overt prejudice and more subtle, less immediately harmful biases. To broaden the reasoning domain and capture a wider array of patterns, we also extend an existing mental health stigma benchmark by incorporating additional psychological conditions. Our findings demonstrate that evaluating model reasoning not only exposes substantially more stigma than traditional MCQ-based methods but it helps to identify the flaws in the LLMs' logic and their understanding of mental health conditions.
Chinese Translation
尽管大型语言模型(LLMs)在心理健康应用中越来越受到关注,但最近的研究表明,它们可能对心理疾病患者表现出污名化。现有的污名评估主要依赖于多项选择题(MCQs),这无法捕捉模型底层逻辑中潜在的偏见。本文分析了LLMs的中间推理步骤,以揭示隐藏的污名化语言及其背后的内在理由。我们利用临床专业知识对针对心理疾病患者的常见污名化语言模式进行分类,并使用这一框架识别和标记LLM推理中的问题陈述。此外,我们对这些陈述的严重性进行评分,区分明显的偏见和更微妙、危害较小的偏见。为了拓宽推理领域并捕捉更广泛的模式,我们还通过纳入额外的心理疾病扩展了现有的心理健康污名基准。我们的研究结果表明,评估模型推理不仅揭示了比传统的基于MCQ的方法更多的污名化,而且有助于识别LLMs逻辑中的缺陷及其对心理健康状况的理解。
cs.CL / 14 / 2604.25096

The Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

妄想的动态:建模人类与聊天机器人对话中的双向虚假信念增强
Mehta, Ashish, Moore, Jared, Anthis, Jacy Reese, Agnew, William, Lin, Eric, Yin, Peggy, Ong, Desmond C., Haber, Nick, Dweck, Carol
Abstract
There is growing concern that AI chatbots might fuel delusional beliefs in users. Some have suggested that humans and chatbots mutually reinforce false beliefs over time, but quantitative evidence is lacking. Using a unique dataset of chat logs from individuals who exhibited delusional thinking, we developed a latent state model that captures accumulating and decaying influences between humans and chatbots. We find that a bidirectional influence model substantially outperforms a unidirectional alternative where humans are the primary driver of delusion. We find that humans exert strong but short-lived influence on chatbots, whereas chatbots exert longer-lasting influence on humans. Moreover, chatbots exert strong, stable self-influence over their own future outputs that tends to perpetuate delusions over long stretches of conversation. In fact, this chatbot self-influence constituted the dominant pathway when considering accumulated influence over time. Overall, these results indicate that humans tend to drive sharp, immediate increases in delusion, whereas chatbots sustain and propagate these effects over longer timescales. Together, these findings provide the first quantitative evidence that human-chatbot interactions can form feedback loops of delusion, decomposable into distinct pathways with dissociable temporal dynamics. By doing so, they can inform the development of safer AI systems.
Chinese Translation
人们越来越担心人工智能聊天机器人可能会助长用户的妄想信念。有些人提出,人类与聊天机器人会随着时间的推移相互强化虚假信念,但缺乏定量证据。通过使用来自表现出妄想思维的个体的独特聊天记录数据集,我们开发了一种潜在状态模型,以捕捉人类与聊天机器人之间累积和衰减的影响。我们的研究发现,双向影响模型的表现显著优于单向替代模型,后者认为人类是妄想的主要驱动者。我们发现,人类对聊天机器人的影响强烈但短暂,而聊天机器人对人类的影响则持续时间更长。此外,聊天机器人对自身未来输出的强大且稳定的自我影响倾向于在长时间的对话中延续妄想。实际上,当考虑到随时间累积的影响时,这种聊天机器人的自我影响构成了主导路径。总体而言,这些结果表明,人类倾向于驱动妄想的急剧、即时增加,而聊天机器人则在更长的时间尺度上维持和传播这些影响。这些发现首次提供了定量证据,表明人类与聊天机器人的互动可以形成妄想的反馈循环,这些循环可以分解为具有不同时间动态的独特路径。通过这样做,它们可以为更安全的人工智能系统的发展提供参考。
cs.CL / 15 / 2604.25120

Diagnosis, Bad Planning & Reasoning. Treatment, SCOPE -- Planning for Hybrid Querying over Clinical Trial Data

诊断、糟糕的规划与推理。治疗,SCOPE——针对临床试验数据的混合查询规划
Chowdhury, Suparno Roy, Choudhury, Manan Roy, Anvekar, Tejas, Khan, Muhammad Ali, Khakwani, Kaneez Zahra Rubab, Sonbol, Mohamad Bassam, Riaz, Irbaz Bin, Gupta, Vivek
Abstract
We study clinical trial table reasoning, where answers are not directly stored in visible cells but must be reasoned from semantic understanding through normalization, classification, extraction, or lightweight domain reasoning. Motivated by the observation that current LLM approaches often suffer from "bad reasoning" under implicit planning assumptions, we focus on settings in which the model must recover implicit attributes such as therapy type, added agents, endpoint roles, or follow-up status from partially observed clinical-trial tables. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, Blend-SQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution
Chinese Translation
我们研究临床试验表格推理,其中答案并不直接存储在可见单元格中,而必须通过规范化、分类、提取或轻量级领域推理进行语义理解。基于观察到当前大型语言模型(LLM)方法在隐式规划假设下常常遭遇“糟糕推理”的问题,我们专注于模型必须从部分观察的临床试验表格中恢复隐式属性的情境,例如治疗类型、附加药物、终点角色或随访状态。我们提出了SCOPE(结构化临床混合规划用于临床试验中的证据检索),这是一个基于多LLM规划者的框架,将任务分解为行选择、结构化规划和执行。该规划者在生成答案之前明确源字段、推理规则和输出约束,从而减少相对于直接提示的模糊性。我们在1,500个关于肿瘤学临床试验表格的混合推理问题上评估了SCOPE,比较了零-shot、少量示例、思维链、TableGPT2、Blend-SQL和EHRAgent。结果表明,明确的多LLM规划提高了基于推理问题的准确性,同时提供了比更重的代理基线更强的准确性与效率的权衡。我们的研究将临床试验推理定位为一个独特的表格理解问题,并强调基于混合规划者的分解作为一种有效的解决方案。
cs.CL / 16 / 2604.25130

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

LongSumEval:基于问答的评估与反馈驱动的长文档摘要优化
Nguyen, Huyen, Zhang, Haoxuan, Zhang, Yang, Chen, Haihua, Ding, Junhua
Abstract
Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human judgments compared to established metrics. Structured feedback enables significant quality improvements through self-refinement without retraining. By demonstrating that evaluation feedback can serve as executable instructions for generation, this work establishes a generalizable paradigm for aligning assessment with improvement, with direct implications for controllable text generation requiring verifiable accuracy and transparent quality control. All code and datasets will be released in GitHub for reproducibility.
Chinese Translation
长文档摘要的评估仍然是摘要研究中的主要瓶颈。现有的评估指标与人类判断的相关性较弱,并且产生的综合得分无法解释缺陷或指导改进,这阻碍了在需要可验证准确性的应用中进行有效的优化。我们提出了LongSumEval,一个通过结构化问答反馈连接评估与生成的统一框架。该框架将摘要质量操作化为问答对的可回答性和事实一致性,生成可解释的得分和可操作的反馈,识别覆盖缺口和事实不一致。这解决了评估独立于生成目标进行的错位问题。我们在七个基准数据集上对基于问答的评估模块进行的元评估显示,与现有指标相比,其与人类判断的相关性显著增强。结构化反馈通过自我优化实现了显著的质量提升,而无需重新训练。通过展示评估反馈可以作为生成的可执行指令,本研究建立了一个可推广的范式,以对齐评估与改进,直接影响需要可验证准确性和透明质量控制的可控文本生成。所有代码和数据集将发布在GitHub上以便于复现。
cs.CL / 17 / 2604.25132

What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

什么是良好的指令调优数据?基于上下文学习的视角
Han, Guangzeng, Huang, Xiaolei
Abstract
Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.
Chinese Translation
指令调优数据集通常包含大量冗余和低质量样本,因此需要有效的数据选择方法。我们提出了一种基于加权上下文影响(weighted in-context influence, wICI)的指令数据选择框架,该框架衡量每个候选示例在多大程度上有效降低与其语义相关的同伴的指令执行难度。通过系统的实验,我们解决了三个关键问题:从上下文的角度来看,什么构成有效的指令调优数据,样本难度是否与上下文影响相关,以及上下文影响如何转化为指令调优的有效性。在多个模型和基准测试中的实验表明,我们的方法在受限数据预算下始终优于现有基线,同时实证显示样本难度与上下文影响呈负相关。
cs.CL / 18 / 2604.25133

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

韩语可爱语调显示出系统性的F1增加以传达儿童特质
Kim, Ji-eun, Dellwo, Volker
Abstract
Korean aegyo is a socially recognized childlike speaking style used predominantly in romantic interactions among adults. This study examined vowel space modification in aegyo by analyzing formant frequencies from twelve Seoul Korean speakers who produced identical scripts in aegyo and non-aegyo styles. Results show that aegyo speech features a significant increase in F1 values across vowels and selective fronting of front vowels, leading to vowel space expansion but mainly a shift to higher F1. These findings suggest that adult speakers stylize childlike speech by imitating the shorter vocal tract of children, mainly through global vowel lowering and partial fronting.
Chinese Translation
韩语可爱语调是一种社会认可的儿童化说话风格,主要用于成年人之间的浪漫互动。本研究通过分析十二名首尔韩语说话者在可爱语调和非可爱语调下朗读相同文本的共振峰频率,考察了可爱语调中的元音空间变化。结果显示,可爱语调在所有元音中均表现出F1值的显著增加,以及前元音的选择性前移,导致元音空间的扩展,但主要是向更高的F1值转移。这些发现表明,成年说话者通过模仿儿童较短的声道来风格化儿童化的语言,主要通过整体元音降低和部分前移实现。
cs.CL / 19 / 2604.25135

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

FAMA:面向交互工具使用环境的开源大型语言模型的故障感知元代理框架
Saeidi, Amir, Mishra, Venkatesh, Mukhopadhyay, Souradeep, Liu, Gaowen, Payani, Ali, Srinivasa, Jayanth, Baral, Chitta
Abstract
Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.
Chinese Translation
大型语言模型正日益被部署为能够在外部环境中产生变化的自主代理的决策核心。然而,在模拟现实世界以客户为中心的问题解决场景的对话基准测试中,这些代理由于错误决策的级联效应而频繁失败。这些挑战在参数规模较小、上下文窗口有限和推理预算受限的开源大型语言模型中尤为明显,这导致在代理环境中错误的积累增加。为了解决这些挑战,我们提出了故障感知元代理框架(FAMA)。FAMA 分为两个阶段:首先,它分析基线代理的故障轨迹,以识别最常见的错误;其次,它采用一种编排机制,激活一小部分专门的代理,以通过在决策步骤之前为工具使用代理注入有针对性的上下文来解决这些故障。在开源大型语言模型上的实验表明,相较于标准基线,评估模式的性能提升可达 27%。这些结果强调,通过专门的代理有针对性地策划上下文以解决常见故障,是构建可靠的多轮工具使用大型语言模型代理、模拟现实世界对话场景的一个有价值的设计原则。
cs.CL / 20 / 2604.25136

Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment

面向大型语言模型的摩擦政策优化:认知干预、风险敏感控制与反思对齐
Pustejovsky, James, Krishnaswamy, Nikhil
Abstract
We propose Frictive Policy Optimization (FPO), a framework for learning language model policies that regulate not only what to say, but when and how to intervene in order to manage epistemic and normative risk. Unlike standard alignment methods that optimize surface-level preference or task utility, FPO treats clarification, verification, challenge, redirection, and refusal as explicit control actions whose purpose is to shape the evolution of belief, commitment, and uncertainty over time. We formalize alignment as a risk-sensitive epistemic control problem in which intervention decisions are selected based on their expected effect on downstream epistemic quality rather than on immediate reward alone. We introduce a compact taxonomy of frictive interventions, a structured friction functional that operationalizes multiple alignment failure modes, and a unified family of FPO methods spanning reward shaping, preference pairing, group-relative ranking, and risk-conditioned trust regions. We further propose an evaluation framework that measures epistemic competence directly through clarification behavior, calibration, contradiction repair, refusal proportionality, and information efficiency. Together, these results provide a formal and algorithmic foundation for learning agents that are aligned not only in outcome, but in epistemic conduct.
Chinese Translation
我们提出了摩擦政策优化(Frictive Policy Optimization, FPO),这是一个学习语言模型政策的框架,旨在调节不仅是说什么,还包括何时以及如何进行干预,以管理认知和规范风险。与优化表面偏好或任务效用的标准对齐方法不同,FPO将澄清、验证、挑战、重定向和拒绝视为明确的控制行为,其目的是塑造信念、承诺和不确定性随时间的演变。我们将对齐形式化为一个风险敏感的认知控制问题,其中干预决策是基于其对下游认知质量的预期影响进行选择,而不仅仅是基于即时奖励。我们引入了一种紧凑的摩擦干预分类法,一个结构化的摩擦函数,能够操作化多种对齐失败模式,以及一个统一的FPO方法家族,涵盖奖励塑造、偏好配对、群体相对排名和风险条件信任区域。我们进一步提出了一个评估框架,通过澄清行为、校准、矛盾修复、拒绝比例和信息效率直接测量认知能力。综合这些结果,为学习代理提供了一个正式和算法基础,使其不仅在结果上对齐,而且在认知行为上也保持一致。
cs.CL / 21 / 2604.25182

CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

CroSearch-R1:更好地利用跨语言知识进行检索增强生成
Qi, Rui, Mo, Fengran, Lu, Sijin, Chen, Yufeng, Nie, Jian-Yun, Huang, Kaiyu
Abstract
A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.
Chinese Translation
多语言集合可能包含其他语言中的有用知识,以补充和纠正原语言中的事实,从而用于检索增强生成(RAG)。然而,简单地将来自不同语言的多条知识拼接到上下文中的传统方法可能由于语言间的潜在差异而未能提高有效性。为了更好地利用多语言知识,我们提出了CroSearch-R1,这是一种搜索增强的强化学习框架,用于将多语言知识整合到群体相对策略优化(GRPO)过程中。具体而言,该方法采用了一种多轮检索策略,结合跨语言知识整合,动态地将其他语言的知识作为补充证据对齐到统一的表示空间。此外,我们引入了一种多语言回滚机制,以优化跨语言的推理可转移性。实验结果表明,我们的框架有效地利用了跨语言的互补性,并提高了使用多语言集合的RAG的有效性。
cs.CL / 22 / 2604.25203

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

BARRED:通过不对称辩论合成定制政策护栏的训练
Mazza, Arnon, Levi, Elad
Abstract
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.
Chinese Translation
为定制政策部署护栏仍然面临挑战,因为通用安全模型无法捕捉特定任务的需求,而提示大型语言模型(LLMs)则存在边界案例表现不一致和高推理成本的问题。训练定制分类器可以实现准确性和效率,但需要大量标注数据,而这些数据的获取成本高昂。我们提出了BARRED(Boundary Alignment Refinement through REflection and Debate)框架,该框架仅使用任务描述和一小组未标注示例生成真实且多样的合成训练数据。我们的方法将领域空间分解为多个维度,以确保全面覆盖,并采用多智能体辩论来验证标签的正确性,从而产生高保真度的训练语料库。在多种定制政策上的实验表明,基于我们合成数据微调的小型语言模型始终优于最先进的专有LLMs(包括推理模型)和专用护栏模型。消融研究确认,维度分解和基于辩论的验证对于确保有效微调所需的多样性和标签保真度至关重要。BARRED框架消除了对大量人工标注的依赖,提供了一种可扩展的解决方案,以实现准确的定制护栏。
cs.CL / 23 / 2604.25249

Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

低于机会的盲目性:小型大语言模型中的提示性表现不足产生位置偏差而非答案回避
Cacioli, Jon-Paul
Abstract
Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model's preferred position. An explicit anti-task instruction ("pick the least likely answer") drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by "deliberately underperform." BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.
Chinese Translation
检测故意在能力评估中表现不足的行为(即沙袋行为)是人工智能安全领域的一个未解问题。我们测试了来自临床伪病检测的症状有效性测试(SVT)逻辑是否能够通过在强制选择项上的低于机会表现(BCB)来识别沙袋行为。在一项预注册的试点研究中,我们在7-9亿参数的指令调优规模上进行了测试(3个模型,4个MMLU-Pro领域,4个条件,每个单元500个项目,共24,000个试验),但合理性门未能通过。在沙袋行为指令下,12个模型-领域单元中没有一个显示出显著的低于机会表现。探索性分析揭示了三种质上不同的失败模式。Qwen-2.5-7B和Phi-3.5-mini在很大程度上忽视了沙袋行为指令,其响应与诚实基线的相似度为62-88%。Llama-3-8B则在很大程度上遵循了指令,但将表现不足作为一种位置启发式,导致其响应分布集中在中间字母选项(E为31.8%,F为26.1%),无论正确答案位于何处。当正确答案恰好位于模型偏好的位置时,这产生了高达33个百分点的准确度提升。一项明确的反任务指令(“选择最不可能的答案”)使得三种模型中有两种的表现低于机会,准确度低至0.024。因此,答案意识回避的能力确实存在,但并未被“故意表现不足”所激活。BCB并未作为答案意识回避的逻辑标记失败。在这一范畴中未观察到这一现象,因为表现出最大行为变化的模型展现出与位置主导响应策略一致的行为,而非内容意识的答案回避。我们提出,位置分布的变化可能比低于机会的准确度更有效地作为检测此模型规模下提示性表现不足的行为特征。
cs.CL / 24 / 2604.25296

Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

从医学实体树学习:一种面向实体的医学数据工程框架用于多模态大语言模型
Lin, Jianghang, Yang, Haihua, Yu, Deli, Wu, Kai, Ye, Kai, Lin, Jinghao, Wang, Zihan, Wu, Yuhang, Cao, Liujuan
Abstract
Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models' ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical literature to construct a Medical Entity Tree (MET), a hierarchical structure that systematically encodes diseases, anatomical structures, modalities, and symptoms into a unified knowledge repository. Building upon the MET, we propose an advanced data engine that includes: (1) node-guided retrieval to anchor raw data to specific medical concepts, (2) a two-stage hybrid filtering and alignment pipeline to ensure precise visual-semantic correspondence, and (3) knowledge-aware data synthesis to generate enriched captions and targeted reasoning VQA pairs, leveraging structural constraints. Extensive evaluations across six medical benchmarks demonstrate that our approach significantly enhances the medical capabilities of general-purpose MLLMs, improving their ability to handle complex clinical queries and achieve state-of-the-art performance in diverse medical contexts.
Chinese Translation
多模态大语言模型(MLLMs)在医学应用中展现出变革潜力,但其性能受到依赖于模态或部门的粗粒度分区的传统数据整理策略的限制。这种碎片化的方法未能捕捉临床医学知识的层次性和相互关联性,从而限制了模型进行细粒度识别和复杂推理的能力。本文提出了一种新颖的面向实体的医学数据工程框架。我们从权威医学文献中自动提取实体,以构建医学实体树(MET),这是一种系统编码疾病、解剖结构、模态和症状的层次结构,形成统一的知识库。在此基础上,我们提出了一种先进的数据引擎,包括:(1)节点引导检索,将原始数据锚定到特定医学概念;(2)两阶段混合过滤和对齐管道,以确保精确的视觉-语义对应;(3)知识感知数据合成,利用结构约束生成丰富的标题和针对性的推理视觉问答对。对六个医学基准的广泛评估表明,我们的方法显著增强了通用多模态大语言模型的医学能力,提高了其处理复杂临床查询的能力,并在多种医学背景下实现了最先进的性能。
cs.CL / 25 / 2604.25297

LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

LegalMidm:基于用例驱动的韩国法律领域专门化大语言模型
Jang, Youngjoon, Park, Chanhee, Moon, Hyeonseok, Ham, Young-kyoung, Moon, Jiwon, Kim, Jinhyeon, Jung, JuKyung, Lim, Heuiseok
Abstract
In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introduce LegalMidm, a Korean legal-domain LLM, and present a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. Our approach emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, and demonstrates effectiveness in key legal tasks.
Chinese Translation
近年来,开源大语言模型(LLMs)的快速普及推动了将通用模型转变为领域专用模型的努力。然而,许多领域专用的LLM在开发时使用的数据集和训练协议并未与现实应用的细微需求相一致。在法律领域,精确性和可靠性至关重要,这种缺乏考虑限制了实际效用。在本研究中,我们提出了一个系统的训练框架,基于法律领域的实际需求,重点关注韩国法律。我们介绍了LegalMidm,一个韩国法律领域的LLM,并提出了一种构建高质量、基于用例驱动的法律数据集和优化训练流程的方法。我们的方法强调与法律专业人士的合作和严格的数据策划,以确保相关性和事实准确性,并在关键法律任务中展示了有效性。
cs.CL / 26 / 2604.25313

Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models

Faithfulness-QA:用于训练上下文忠实的检索增强生成模型的反事实实体替换数据集
Ju, Li, Wang, Junzhe, Zhang, Qi
Abstract
Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks--SQuAD and TriviaQA--we automatically identify answer-bearing named entities in each context, replace them with type-consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200-sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness-QA is designed as a training resource for attention-based faithfulness objectives and as an evaluation benchmark for measuring context-grounding behavior in RAG systems. Data and code are available at https://github.com/qzhangFDU/faithfulness-qa-dataset.
Chinese Translation
检索增强生成(RAG)模型经常产生基于参数记忆而非检索上下文的答案,这削弱了检索增强的核心承诺。解决这种不忠实性的一个根本障碍是缺乏明确要求模型优先考虑上下文而非内部知识的训练数据。我们引入了Faithfulness-QA,这是一个通过反事实实体替换构建的大规模数据集,包含99,094个样本。我们从两个已建立的抽取式问答基准——SQuAD和TriviaQA——出发,自动识别每个上下文中的答案承载命名实体,用来自一个经过策划的76,953个实体的库中的类型一致替代品替换它们,从而制造上下文与参数记忆之间的受控知识冲突。严格的质量过滤确保在随机200个样本审计中四个自动检查的通过率达到100%。我们发布了完整的数据集、构建管道以及涵盖八个命名实体类别的类型化实体库。Faithfulness-QA旨在作为基于注意力的忠实性目标的训练资源,以及作为评估RAG系统中上下文基础行为的评估基准。数据和代码可在 https://github.com/qzhangFDU/faithfulness-qa-dataset 获取。
cs.CL / 27 / 2604.25359

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

结构化输出基准:评估大型语言模型结构化输出质量的多源基准
Singh, Abhinav Kumar, Khurdula, Harsha Vardhan, Khemlani, Yoeven D, Agarwal, Vineet
Abstract
Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.
Chinese Translation
大型语言模型越来越多地被用于从非结构化和半结构化来源中提取结构化数据:解析发票、医疗记录,以及将PDF文档转换为数据库条目。然而,现有的结构化输出生成基准要么仅关注模式合规性,要么在单一来源领域内评估值的正确性。我们引入SOB(结构化输出基准),这是一个跨越三种来源模态的多源基准:原生文本、图像和音频对话。所有模型都接收其上下文的文本标准化表示,无论来源模态如何;这种刻意的设计将结构化输出能力与原始视觉或语音处理质量隔离,确保了公平的、与来源无关的比较。我们的基准包含5000个文本评估记录,这些记录来自于从25091条完整语料库中提取的多跳问答,209个图像记录来自于七种文档类型的OCR处理PDF,包括多列布局、密集表格、扫描的历史文档、小字体文本和数学排版,以及115个来自AMI语料库的音频记录。每个记录将自然语言问题与模型必须遵循的JSON模式配对,并提供经过源上下文验证的真实答案。我们在三个来源领域和七个指标上评估了21个前沿和开放权重模型。我们的结果揭示了一种一致的模式:模型实现了近乎完美的模式合规性,但最佳值准确性(通过精确叶值匹配测量)在文本上仅达到83.0%,在图像上为67.2%,在音频上为23.7%,其中较长的上下文使得提取变得更加困难。我们发布了数据集、评估管道和所有相关代码。
cs.CL / 28 / 2604.25374

Language corpora for the Dutch medical domain

荷兰医学领域的语言语料库
van Es, B.
Abstract
\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.
Chinese Translation
背景:荷兰医学语料库稀缺,限制了自然语言处理(NLP)的发展。方法:我们翻译了英文数据集,识别了通用语料库中的医学文本,并提取了开放的荷兰医学资源。结果:生成的语料库包含约350亿个词元,涵盖了约1亿份文档的医学领域,且可在Hugging Face上免费获取。结论:本研究建立了第一个大规模荷兰医学语言语料库,用于预训练和下游NLP任务。
cs.CL / 29 / 2604.25384

Wiki Dumps to Training Corpora: South Slavic Case

维基数据到训练语料库:南斯拉夫语言案例
Škorić, Mihailo
Abstract
This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote, where available. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of suspicious or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora entirely. The resulting datasets aim to provide linguistically rich texts suitable for training language models or conducting comparative research across South Slavic languages. By combining systematic extraction with quality control, this work contributes to the creation of reliable, high-information corpora that reflect authentic language use and cultural context. While focused on the South Slavic case in the paper, the approach is mostly language-agnostic and can be generalised to other languages and language families.
Chinese Translation
本文提出了一种将原始维基媒体数据转化为七种南斯拉夫语言高质量文本语料库的方法论。该工作分为两个主要阶段。第一阶段涉及从维基百科、维基文库、维基教科书、维基新闻和维基语录的原始数据中提取和清理文本(如可用)。这一步骤需要仔细处理原始维基标记,以首先隔离文本文章,然后提取其中可用的自然语言文本。第二阶段解决了可疑或低质量文章的挑战,这些文章通常是从数据库或结构化知识库生成的。这些文章的特点是重复模式、通用措辞,以及极少或没有原创内容。为了减轻它们的影响,采用了一种基于n-gram的过滤策略,以检测文章之间的高文本冗余度,并完全从语料库中移除这些文章。最终生成的数据集旨在提供适合训练语言模型或进行南斯拉夫语言间比较研究的语言学丰富文本。通过将系统提取与质量控制相结合,这项工作有助于创建可靠的、高信息量的语料库,反映真实的语言使用和文化背景。尽管本文专注于南斯拉夫语言案例,但该方法大多是与语言无关的,可以推广到其他语言和语言家族。
cs.CL / 30 / 2604.25392

Benchmarking PyCaret AutoML Against IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian IKN Twitter Data

基于PyCaret AutoML的经典机器学习方法与IndoBERT微调的深度学习方法在印尼IKN Twitter数据情感分析中的基准比较
Mayzaroh, Mutia Alfi, Ningsih, Dwi Fitria, Destriani, Nindi, Manullang, Martin C. T.
Abstract
This paper benchmarks a classical machine learning approach based on PyCaret AutoML against a deep learning approach based on IndoBERT fine-tuning for binary sentiment analysis of Indonesian-language Twitter comments related to Ibu Kota Nusantara (IKN). The dataset contains 1,472 manually labeled samples, consisting of 780 negative and 692 positive comments. In the machine learning setting, Logistic Regression, Naive Bayes, and Support Vector Machine were evaluated using 10-fold cross-validation, with Logistic Regression achieving the best performance among the classical models at 77.57% accuracy and 77.17% F1-score. In the deep learning setting, the indobenchmark/indobert-base-p1 model was fine-tuned for five epochs and achieved 89.59% test accuracy and 89.37% F1-score. The results show that IndoBERT substantially outperforms the machine learning baselines, highlighting the effectiveness of Transformer-based contextual representations for informal Indonesian social media text.
Chinese Translation
本文对基于PyCaret AutoML的经典机器学习方法与基于IndoBERT微调的深度学习方法进行了基准比较,旨在对与印尼Ibu Kota Nusantara (IKN)相关的印尼语Twitter评论进行二元情感分析。数据集包含1,472个手动标注的样本,其中包括780条负面评论和692条正面评论。在机器学习设置中,使用10折交叉验证评估了逻辑回归、朴素贝叶斯和支持向量机,其中逻辑回归在经典模型中表现最佳,准确率为77.57%,F1-score为77.17%。在深度学习设置中,indobenchmark/indobert-base-p1模型经过五个周期的微调,达到了89.59%的测试准确率和89.37%的F1-score。结果表明,IndoBERT显著优于机器学习基线,突显了基于Transformer的上下文表示在非正式印尼社交媒体文本中的有效性。
cs.CL / 31 / 2604.25409

Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

通过高效的跨尺度超参数转移扩展概率变换器
Kuang, Penghao, Wu, Haoyi, Tu, Kewei
Abstract
Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT's parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tuning. With this approach, we successfully scale PT to models with up to 0.4B parameters. Experiments show that PT consistently outperforms standard transformer under the same parameter budget on Masked Language Modeling (MLM) tasks. We hope this work will contribute to the practical deployment of probabilistic models at substantially larger scales in the future.
Chinese Translation
概率变换器(Probabilistic Transformer, PT)是一种用于上下文词表示的白盒概率模型,在计算结构和小型模型及小到中型数据集的下游任务性能上与标准变换器表现出显著相似性。然而,与标准变换器相比,PT对超参数选择的鲁棒性较差,这使得其高效扩展变得更加困难。在本研究中,我们遵循最大更新参数化(Maximal Update Parametrization, muP)对PT的参数进行重新缩放,以便在小型模型上优化的超参数能够无须额外调优地转移到更大型模型上。通过这种方法,我们成功地将PT扩展到具有高达4亿参数的模型。实验表明,在相同参数预算下,PT在掩码语言建模(Masked Language Modeling, MLM)任务中始终优于标准变换器。我们希望本研究能为未来在更大规模上实际部署概率模型做出贡献。
cs.CL / 32 / 2604.25423

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

大型语言模型是否捕捉到具身认知和文化变异?来自指示词的跨语言证据
Wang, Yu, Chersoni, Emmanuele, Huang, Chu-Ren
Abstract
Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like "this/that" in English and "zh\`e/n\`a" in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.
Chinese Translation
大型语言模型(LLMs)是否真的从文本中获取了具身认知和文化惯例?我们引入指示词,作为一种新的探测工具,指示词是基本的空间表达,如英语中的“this/that”和汉语中的“zhe/na”。通过对320名母语者的6,400个反应进行分析,我们建立了一个人类基准:英语使用者能够可靠地区分近指和远指的指称,但在视角转换方面存在困难;而汉语使用者则能够流畅地切换视角,但对远指的模糊性表现出容忍。相比之下,五种最先进的LLMs未能内在理解近指与远指的对比,并且未显示出文化差异,默认采用以英语为中心的推理。我们的研究贡献了(i)一个基于指示词的新任务,作为评估具身认知和文化惯例的新视角;(ii)关于人类解读中的跨文化不对称性的实证证据;(iii)对自我中心与社会中心辩论的新视角,显示这两种取向共存但在不同语言中有所变化;以及(iv)呼吁在未来模型设计中关注个体差异。
cs.CL / 33 / 2604.25444

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

一个精炼器解锁所有:通过强化查询精炼实现推理时的推理引导
Zhou, Yixiao, Cheng, Dongzhou, wu, zhiliang, Yang, Yi, Cheng, Yu, Fan, Hehe
Abstract
Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7\%--7.2\% across diverse architectures and benchmarks, outperforming strong baselines by 2.1\% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at https://github.com/newera-xiao/ReQueR.
Chinese Translation
大型语言模型(LLMs)常常无法充分利用其潜在的推理能力,因为模糊的人类询问与机器激活所需的结构化逻辑之间存在分布不匹配。现有的对齐方法要么通过单独微调每个模型而产生高达 $O(N)$ 的成本,要么依赖于静态提示,这些提示无法解决查询级别的结构复杂性。本文提出了 ReQueR( extbf{Re}inforcement extbf{Que}ry extbf{R}efinement),一个将推理引导视为推理时对齐任务的模块化框架。我们通过强化学习训练一个专门的精炼器策略,将原始查询重写为明确的逻辑分解,将冻结的 LLM 视为环境。基于教育心理学中的经典最近发展区(Zone of Proximal Development),我们引入了自适应求解器层次(Adaptive Solver Hierarchy),这是一种通过动态对齐环境难度与精炼器不断发展的能力来稳定训练的课程机制。ReQueR 在不同架构和基准测试中产生了一致的绝对增益,范围为 1.7\% 到 7.2\%,平均超越强基线 2.1\%。重要的是,它为一对多的推理时推理引导提供了一个有前景的范例,使得在一小组模型上训练的单个精炼器能够有效解锁多种未见模型中的推理能力。代码可在 https://github.com/newera-xiao/ReQueR 获取。
cs.CL / 34 / 2604.25448

Navigating Global AI Regulation: A Multi-Jurisdictional Retrieval-Augmented Generation System

全球人工智能监管的导航:一个多司法管辖区的检索增强生成系统
Ford, Courtney, Rane, Ojas, Leavy, Susan
Abstract
Navigating AI regulation across jurisdictions is increasingly difficult for policymakers, legal professionals, and researchers. To address this, we present a multi-jurisdictional Retrieval-Augmented Generation system for global AI regulation. Our corpus includes 242 documents across 68 jurisdictions, ranging from formal legislation like the EU AI Act to unstructured policy documents such as national AI strategies. The system makes three technical contributions: type-specific chunking that preserve legal structure across heterogenous documents; conditional retrieval routing with entity detection and metadata for legal citations; and priority-based re-ranking to boost enacted legislation over policy and secondary sources. Evaluation of 50 queries reveals strong performance across both single-entity and multi-jurisdictional questions, achieving 0.87 average faithfulness and 0.84 average answer relevancy. Single-entity queries achieve 0.86 average faithfulness and 0.92 average answer relevancy, while multi-jurisdictional comparison queries achieve 0.88 average faithfulness and 0.75 average answer relevancy. These findings highlight the effectiveness of domain-specific retrieval strategies for navigating complex, heterogenous regulatory corpora.
Chinese Translation
在不同司法管辖区内导航人工智能监管对于政策制定者、法律专业人士和研究人员来说日益困难。为了解决这一问题,我们提出了一个针对全球人工智能监管的多司法管辖区检索增强生成系统。我们的语料库包含来自68个司法管辖区的242份文件,涵盖了从正式立法(如欧盟人工智能法案)到非结构化政策文件(如国家人工智能战略)的多种类型。该系统在技术上做出了三项贡献:特定类型的分块,保持异构文件之间的法律结构;带有实体检测和法律引用元数据的条件检索路由;以及基于优先级的重新排序,以提升已颁布立法相对于政策和次要来源的权重。对50个查询的评估显示,该系统在单一实体和多司法管辖区问题上均表现出色,平均忠实度达到0.87,平均答案相关性达到0.84。单一实体查询的平均忠实度为0.86,平均答案相关性为0.92,而多司法管辖区比较查询的平均忠实度为0.88,平均答案相关性为0.75。这些发现突显了特定领域检索策略在应对复杂、异构监管语料库中的有效性。
cs.CL / 35 / 2604.25452

Benchmarking Logistic Regression, SVM, and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews

基于注意力机制的双向长短期记忆网络(BiLSTM)与逻辑回归、支持向量机(SVM)和LightGBM在印尼产品评论情感分析中的基准比较
Hamdi, Razin Hafid, Hutabarat, Ivana Margareth, Sinaga, Hanna Gresia, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.
Abstract
Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced equally between positive and negative reviews. For the ML approach, three prominent algorithms were evaluated via 10-fold stratified cross-validation: Logistic Regression (LR), Support Vector Machine (SVM) with a linear kernel, and Light Gradient Boosting Machine (LightGBM). Logistic Regression achieved the best ML performance with an accuracy of 97.26\% and an F1-score of 97.26\%. The BiLSTM with Attention model, evaluated on 3,946 held-out test samples, achieved an accuracy of 97.24\% and an F1-score of 97.24\%. These comparative results demonstrate that traditional ML algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential DL architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.
Chinese Translation
电商平台上产品评论的情感分析在自动理解客户满意度和为寻求提高产品质量的卖家提供可行见解方面发挥着至关重要的作用。本文呈现了一项全面的基准研究,比较了通过PyCaret AutoML框架的机器学习(ML)方法与基于具有注意力机制的双向长短期记忆网络(BiLSTM)架构的深度学习(DL)方法在印尼产品评论的二元情感分类中的表现。数据集包含19,728个样本,正负评论数量均衡。对于机器学习方法,通过10折分层交叉验证评估了三种主要算法:逻辑回归(LR)、具有线性核的支持向量机(SVM)和轻量级梯度提升机(LightGBM)。逻辑回归以97.26%的准确率和97.26%的F1-score取得了最佳的机器学习表现。经过在3,946个保留测试样本上的评估,带有注意力机制的BiLSTM模型达到了97.24%的准确率和97.24%的F1-score。这些比较结果表明,经过适当预处理和特征提取的传统机器学习算法可以与更复杂的序列深度学习架构在高维数据集上进行紧密竞争,甚至在某些情况下略有超越,同时提供更高的计算效率。
cs.CL / 36 / 2604.25456

An Investigation of Linguistic Biases in LLM-Based Recommendations

基于大型语言模型的推荐中的语言偏见研究
Venkateswaran, Nitin, Ang, Jason, Adhikari, Deep, Dasari, Tarun Krishna
Abstract
We investigate linguistic biases in LLM-based restaurant and product recommendations given prompts varying across Southern American English (AE), Indian English (IE), and Code-Switched Hindi-English dialects, using the Yelp Open dataset (Yelp Inc., 2023) and Walmart product reviews dataset (PromptCloud,2020). We add lists of restaurant and product names balanced by cuisine type and product category to the prompts given to the LLM, and we zero-shot prompt the LLMs in a cold-start setting to select the top-20 restaurant and product recommendations from these lists for each of the dialect-varied prompts. We prompt LLMs using different list samples across 20 seeds for better generalization, and aggregate per cuisine-type and per category response counts for each seed, question/prompt, and LLM model. We run mixed-effects regression models for each model family and topic (restaurant/product) with the aggregate response counts as the dependent, and conduct likelihood ratio tests for the fixed effects with post-hoc pairwise testing of estimated marginal means differences, to investigate group-level differences in recommendation counts by model size and dialect type. Results show that dialect plays a role in the type of restaurant selected across the models tested with the mistral-small-3.1 model and both the llama-3.1 family models tested showing more sensitivity to Indian English and Code-Switched prompts. In terms of product recommendations, the llama-3.1-70B-model is particularly sensitive to Code-Switched prompts in four out of seven categories, and more beauty and home category recommendations are seen when using the Indian English and Code-Switched prompts for larger and smaller models, respectively. No broad trends are seen in the model-size based differences, with differing recommendations based on model sizes conditioned by the type of dialect.
Chinese Translation
我们研究了基于大型语言模型(LLM)的餐厅和产品推荐中的语言偏见,使用了来自南美英语(AE)、印度英语(IE)和代码切换的印地语-英语方言的不同提示,数据来源于Yelp开放数据集(Yelp Inc., 2023)和沃尔玛产品评论数据集(PromptCloud, 2020)。我们在提供给LLM的提示中添加了按菜系类型和产品类别平衡的餐厅和产品名称列表,并在冷启动设置中对LLM进行零样本提示,从这些列表中为每种方言变化的提示选择前20个餐厅和产品推荐。我们使用不同的列表样本在20个种子上对LLM进行提示,以实现更好的泛化,并对每个种子、问题/提示和LLM模型的每种菜系类型和每个类别的响应计数进行汇总。我们对每个模型家族和主题(餐厅/产品)运行混合效应回归模型,以汇总响应计数作为因变量,并进行固定效应的似然比检验,随后进行估计边际均值差异的成对后续检验,以研究模型大小和方言类型对推荐计数的组别差异。结果显示,方言在所测试模型中选择餐厅的类型上起到了作用,mistral-small-3.1模型和测试的llama-3.1家族模型对印度英语和代码切换提示表现出更高的敏感性。在产品推荐方面,llama-3.1-70B模型在七个类别中的四个类别对代码切换提示特别敏感,并且在使用印度英语和代码切换提示时,较大和较小模型在美容和家居类别的推荐数量上有所不同。模型大小的差异未显示出广泛的趋势,推荐的差异基于模型大小和方言类型的条件。
cs.CL / 37 / 2604.25482

From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

从世界生成到任务线:一种基于依赖驱动的提示管道用于连贯的角色扮演游戏生成
Borawski, Dominik, Szulc, Marta, Chudy, Robert, Giedrowicz, Małgorzata, Mironowicz, Piotr
Abstract
Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and quest expansion. Each stage conditions on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. The system is evaluated qualitatively through human-centered analysis across multiple independent runs. Outputs are assessed using criteria such as structural completeness, internal consistency, narrative coherence, diversity, and actionability. Results show that the pipeline consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases. Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling. These findings suggest that dependency-aware prompt pipelines with structured intermediate representations are an effective design pattern for LLM-based procedural content generation. This approach may also generalize to other domains requiring sequential reasoning over evolving contextual states.
Chinese Translation
大型语言模型(LLMs)在叙事生成方面展现出强大的潜力,但在复杂的多层次角色扮演游戏(RPG)世界中的应用仍受限于连贯性、可控性和结构一致性的问题。本文探讨了一种依赖感知的多阶段提示管道,用于程序化RPG内容生成,通过结构化的中间表示建模叙事依赖关系。该方法将生成过程分解为顺序阶段:世界构建、非玩家角色创建、玩家角色创建、活动级任务规划和任务扩展。每个阶段都依赖于前一阶段的结构化JSON输出。通过强制执行模式和显式数据流,该管道减少了叙事漂移,限制了幻觉现象,并支持可扩展的互联叙事元素的创建。该系统通过多次独立运行的人本分析进行定性评估。输出结果使用结构完整性、内部一致性、叙事连贯性、多样性和可操作性等标准进行评估。结果表明,该管道在生成逻辑合理且结构有效的RPG内容方面表现一致,且随着复杂性的增加,质量没有下降。将高层次的活动规划与详细的任务扩展分离,改善了整体结构和局部叙事。这些发现表明,具有结构化中间表示的依赖感知提示管道是基于LLM的程序化内容生成的有效设计模式。这种方法也可能推广到其他需要对不断变化的上下文状态进行顺序推理的领域。
cs.CL / 38 / 2604.25525

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

从聊天机器人到知己:大规模跨文化研究大型语言模型在情感支持中的应用
Amat-Lefort, Natalia, Yazan, Mert, Curry, Amanda Cercas, Plaza-del-Arco, Flor Miriam
Abstract
Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.
Chinese Translation
大型语言模型(LLMs)不仅被越来越多地用于工具性任务,还被用作随时可用且不带评判的知己以提供情感支持。然而,是什么驱动了这种采用,以及用户在不同国家如何看待情感支持互动仍然未知。为了解决这一空白,我们呈现了首个关于LLM在情感支持中应用的大规模跨文化研究,调查了来自七个国家(美国、英国、德国、法国、西班牙、意大利和荷兰)的4,641名参与者。我们的结果显示,各国的采用率差异显著(从20%到59%不等)。通过使用混合模型将文化效应与人口组成分开,我们发现:年龄在25至44岁、宗教信仰、已婚以及较高的社会经济地位是积极感知(信任、使用、感知益处)的预测因素,其中社会经济地位是最强的。讲英语的国家在感知上普遍比大陆欧洲国家更为积极。我们进一步收集了731个来自用户互动的真实多语言提示语料,显示用户主要寻求孤独、压力、关系冲突和心理健康问题的帮助。我们的研究结果揭示了LLM情感支持使用受到复杂社会技术环境的影响,并呼吁更广泛的研究议程,以探讨如何开发、部署和管理这些系统,以确保安全和知情的访问。
cs.CL / 39 / 2604.25578

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

Marco-MoE:开放的多语言专家混合语言模型与高效的再利用
Jiang, Fan, Zhao, Yu, Lyu, Chenyang, Shi, Tianqi, Du, Yichao, Jiang, Feihu, Wang, Longyue, Luo, Weihua
Abstract
We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.
Chinese Translation
我们提出了Marco-MoE,这是一套完全开放的多语言稀疏专家混合模型(Mixture-of-Experts, MoE)。Marco-MoE具有高度稀疏的设计,每个输入标记仅激活约5\%的总参数。这种极端的稀疏性结合来自稠密模型的再利用,使得在5万亿个标记上进行高效的预训练成为可能。我们的模型在英语和多语言基准测试中超过了同规模的竞争对手,达到了最佳的性能与计算比。我们进一步对这些模型进行后训练,创建了Marco-MoE- extsc{Instruct}变体,其性能超过了激活参数多出$3$到$14$倍的竞争模型。我们的分析表明,Marco-MoE学习了在相关语言之间共享的结构化专家激活模式,同时对语言孤立的情况保持高度专业化的利用。我们还展示了Marco-MoE允许在没有稠密模型典型干扰的情况下进行可扩展的语言扩展。为了支持社区,我们公开了完整的训练数据集、配方和模型权重。
cs.CL / 40 / 2604.25580

Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

告别 Perspective API:NLP、CSS 和 LLM 评估中的测量基础设施教训
Hartmann, David, Tonneau, Manuel, Kraft, Angelie, Seiling, LK, Staufer, Dimitri, Delobelle, Pieter, Fillies, Jan, Luther, Anna Ricarda, Batzner, Jan, Lisker, Mareike
Abstract
The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts. Perspective's model was periodically updated without versioning or disclosure, its annotation structure reflected a single corporate operationalisation of a contested concept, and its scores were used simultaneously as an evaluation target and an evaluation standard. Its closure leaves behind non-updatable benchmarks, irreproducible results, and ultimately a field at risk of perpetuating these issues by turning to closed-source LLMs. We use Perspective's announced termination as an opportunity to call for an independent, valid, adaptable, and reproducible toxicity and hate speech measurement infrastructure, with the technical and governance requirements outlined in this paper.
Chinese Translation
2026 年底 Perspective API 的关闭意味着这一工具将不再作为 NLP、CSS 和 LLM 评估研究中自动化毒性测量的事实标准。我们记录了基于这一单一专有工具所构建的社区的结构性依赖,并讨论了这种依赖如何导致了影响——并可能继续影响——集体研究努力的认识论问题。Perspective 的模型定期更新,但没有版本控制或披露,其注释结构反映了一个有争议概念的单一企业操作化,其得分同时被用作评估目标和评估标准。其关闭留下了无法更新的基准、不可重复的结果,最终使得该领域面临通过转向闭源 LLM 而延续这些问题的风险。我们利用 Perspective 宣布终止的机会,呼吁建立一个独立、有效、可适应和可重复的毒性和仇恨言论测量基础设施,并在本文中概述了技术和治理要求。
cs.CL / 41 / 2604.25611

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

WhisperPipe:一种资源高效的实时自动语音识别流媒体架构
Ramezani, Erfan, Giahi, Mohammad Mahdi, Zarabadipour, Mohammad Erfan, Yosefian, Amir Reza, Ghadiri, Hamid
Abstract
Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.
Chinese Translation
实时自动语音识别(ASR)系统面临着转录准确性与计算效率之间的基本权衡,尤其是在部署像Whisper这样的大规模变换器模型时。现有的流媒体方法要么通过激进的分块牺牲准确性,要么通过无限制的上下文累积导致高昂的内存成本。我们提出了WhisperPipe,这是一种新颖的流媒体架构,通过三项关键创新实现了有限的内存消耗,同时保持了转录质量:一种结合Silero VAD与基于能量的过滤的混合语音活动检测(VAD)管道,减少了34%的错误激活;一种具有重叠上下文窗口的动态缓冲机制,防止了段边界处的信息丢失;以及一种根据语音特征平衡延迟与准确性的自适应处理策略。在对2.5小时多样化音频数据的评估中,WhisperPipe展示了89毫秒的中位端到端延迟(第90百分位:142毫秒),同时相比于基线Whisper实现消耗了48%的峰值GPU内存和80.9%较低的平均GPU利用率。该系统在长时间会话中保持稳定的内存使用,在150分钟的连续操作中没有增长率。与相关工作的比较分析表明,WhisperPipe在操作延迟上比现有流媒体解决方案低3-5倍,同时实现了具有竞争力的准确性(字错误率(WER)在离线Whisper的基础上仅增加2%)。该架构的模块化设计使其能够在资源受限的环境中部署,从边缘设备到云基础设施。我们的结果表明,精心的架构设计可以调和实时响应性与模型复杂性在生产ASR系统中的竞争需求。
cs.CL / 42 / 2604.25654

Progressing beyond Art Masterpieces or Touristic Clich\'es: how to assess your LLMs for cultural alignment?

超越艺术杰作或旅游陈词滥调:如何评估您的大型语言模型(LLMs)在文化上的一致性?
Branco, António, Silva, João, Marques, Nuno, Gomes, Luis, Campos, Ricardo, Sequeira, Raquel, Nerea, Sara, Silva, Rodrigo, Marques, Miguel, Duarte, Rodrigo, Putyato, Artur, Folques, Diogo, Valente, Tiago
Abstract
Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention -- often framed in terms of cultural bias -- until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demonstrate that our design yields test sets with greater discriminative power, effectively distinguishing between models specialized for a given culture and those that are not, ceteris paribus.
Chinese Translation
尽管大型语言模型(LLMs)的文化(误)一致性引起了越来越多的关注——通常以文化偏见的形式呈现——但直到最近,针对文化评估的数据集设计和开发的研究仍然有限。在此,我们回顾了现有的数据集方法,并识别出它们的主要局限性。为了解决这些问题,我们提出了针对注释者的设计指南,并报告了根据这些原则构建的数据集。我们进一步展示了一系列使用该数据集进行的对比实验。结果表明,我们的设计产生了具有更强区分能力的测试集,有效区分了专门针对特定文化的模型与非专门模型,其他条件不变。
cs.CL / 43 / 2604.25665

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

LLM-ReSum:通过自我评估实现大语言模型反思性摘要的框架
Nguyen, Huyen, Zhang, Haoxuan, Zhang, Yang, Ding, Junhua, Chen, Haihua
Abstract
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.
Chinese Translation
对大语言模型(LLM)生成的摘要进行可靠评估仍然是一个开放的挑战,特别是在异构领域和文档长度方面。我们对14种自动摘要指标和基于LLM的评估器在七个数据集上的综合元评估,涵盖了五个领域的文档,从短新闻文章到长篇科学、政府和法律文本(2K-27K字),并包含超过1500个人工标注的摘要。我们的结果表明,传统的词汇重叠指标(如ROUGE、BLEU)与人类评判之间的相关性较弱或为负,而特定任务的神经指标和基于LLM的评估器则实现了显著更高的一致性,尤其是在语言质量评估方面。基于这些发现,我们提出了LLM-ReSum,一个自反思的摘要框架,它在不进行模型微调的情况下,将基于LLM的评估与生成集成在一个闭环反馈中。在三个领域中,LLM-ReSum在事实准确性上提升了低质量摘要最多33%,在覆盖率上提升了39%,人类评估者在89%的情况下更倾向于选择改进后的摘要。此外,我们还引入了PatentSumEval,这是一个新的法律文档摘要的人类标注基准,包含180个专家评估的摘要。所有代码和数据集将发布在GitHub上。
cs.CL / 44 / 2604.25674

Modeling Human-Like Color Naming Behavior in Context

在上下文中建模类人颜色命名行为
Zhang, Yuqing, Ürker, Ecesu, Verhoef, Tessa, Boleda, Gemma, Bisazza, Arianna
Abstract
Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.
Chinese Translation
在计算系统中建模类人词汇的出现,通过使用交互式神经体来实现,这些神经体模拟了学习和交流的压力。NeLLCom-Lex框架(Zhang et al., 2025)允许神经体通过从人类数据中进行监督学习(SL)和在指称游戏中进行强化学习(RL)来发展实用的颜色命名行为和类人词汇。尽管取得了这些成功,所产生的词汇在系统上与人类颜色类别存在系统性偏差,导致颜色空间中出现高度非凸的区域,这与人类类别典型的凸性形成对比。为了解决这个问题,我们引入了两个因素:在SL过程中对稀有颜色术语进行上采样,以及多听者的RL交互,并采用凸性度量来量化几何一致性。我们发现,上采样提高了颜色词汇的多样性和系统级的信息量,而多听者设置促进了更凸的颜色类别。适度的上采样与多个听者的结合产生的词汇与人类系统最为相似。
cs.CL / 45 / 2604.25676

CORAL: Adaptive Retrieval Loop for Culturally-Aligned Multilingual RAG

CORAL:面向文化对齐的多语言检索增强生成的自适应检索循环
Lee, Nayeon, Song, Jiwoo, Kang, Byeongcheol
Abstract
Multilingual retrieval-augmented generation (mRAG) is often implemented within a fixed retrieval space, typically via query or document translation or multilingual embedding vector representations. However, this approach may be inadequate for culturally grounded queries, in which retrieval-condition misalignment may occur. Even strong retrievers and generators may struggle to produce culturally relevant answers when sourcing evidence from inappropriate linguistic or regional contexts. To this end, we introduce CORAL (COntext-aware Retrieval with Agentic Loop, an adaptive retrieval methodology for mRAG that enables iterative refinement of both the retrieval space (corpora) and the retrieval probe (query) based on the quality of the evidence. The overall process includes: (1) selecting corpora, (2) retrieving documents, (3) critiquing evidence for relevance and cultural alignment, and (4) checking sufficiency. If the retrieved documents are insufficient to answer the query correctly, the system (5) reselects corpora and rewrites the query. Across two cultural QA benchmarks, CORAL achieves up to a 3.58%p accuracy improvement on low-resource languages relative to the strongest baselines.
Chinese Translation
多语言检索增强生成(mRAG)通常在固定的检索空间内实现,通常通过查询或文档翻译或多语言嵌入向量表示。然而,这种方法可能不足以处理文化背景相关的查询,因为可能会出现检索条件的不匹配。即使是强大的检索器和生成器,在从不恰当的语言或区域上下文中获取证据时,也可能难以产生文化相关的答案。为此,我们提出了CORAL(COntext-aware Retrieval with Agentic Loop),一种针对mRAG的自适应检索方法,能够根据证据的质量对检索空间(语料库)和检索探针(查询)进行迭代优化。整体过程包括:(1)选择语料库,(2)检索文档,(3)对证据的相关性和文化对齐进行评估,以及(4)检查充分性。如果检索到的文档不足以正确回答查询,系统将(5)重新选择语料库并重写查询。在两个文化问答基准测试中,CORAL在低资源语言上相对于最强基线实现了高达3.58个百分点的准确率提升。
cs.CL / 46 / 2604.25702

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

基于回译增强的直接偏好优化用于神经机器翻译
Ghassabi, Mehrdad, Rajabi, Spehr, Kashani, Hamidreza Baradaran, Hakim, Sadra, Keivandarian, Mahshid
Abstract
Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.
Chinese Translation
当代神经机器翻译(NMT)系统几乎完全依赖于监督平行数据进行训练。尽管取得了巨大的进展,这些系统仍然表现出持续的翻译错误。本文提出了一种基于强化学习(RL)的后训练范式,可以有效纠正这些错误。我们引入了一个新颖的框架,该框架只需要一个通用文本语料库和一个专家翻译者,后者可以是人类或人工智能系统,以提供迭代反馈。在我们的实验中,我们特别关注英德翻译,作为一个具有代表性的高资源语言对。关键是,我们使用直接偏好优化(DPO)实现了这种基于RL的后训练。将我们的DPO驱动框架应用于gemma3-1b模型,显著提高了翻译质量,使其在英德任务上的COMET得分从0.703提升至0.747。结果表明,DPO为通过基于偏好的后训练提升预训练NMT模型提供了一条高效且稳定的途径。
cs.CL / 47 / 2604.25716

Cross-Lingual Jailbreak Detection via Semantic Codebooks

通过语义代码簿进行跨语言越狱检测
Alanova, Shirin, Minko, Bogdan, Sadiekh, Sabrina, Kokuykin, Evgeniy
Abstract
Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.
Chinese Translation
大型语言模型(LLMs)的安全机制仍然主要以英语为中心,这在多语言部署中造成了系统性脆弱性。先前的研究表明,将恶意提示翻译成其他语言可以显著提高越狱成功率,暴露出结构性跨语言安全漏洞。我们研究了是否可以通过与语言无关的语义相似性来缓解此类攻击,而无需重新训练或进行特定语言的适配。我们的方法将多语言查询嵌入与固定的英语越狱提示代码簿进行比较,作为黑箱LLMs的无训练外部保护措施。我们在四种语言、两条翻译管道、四个安全基准、三种嵌入模型和三种目标LLMs(Qwen、Llama、GPT-3.5)上进行了系统评估。我们的结果揭示了跨语言转移的两个不同阶段。在包含典型越狱模板的精心策划的基准上,语义相似性在不同语言间可靠地泛化,实现了近乎完美的可分离性(AUC高达0.99),并在严格的低假阳性约束下显著降低了绝对攻击成功率。然而,在分布转移的情况下——在行为多样且异质的安全基准上——可分离性显著下降(AUC约为0.60-0.70),并且在安全关键的低假阳性率(low-FPR)阶段,所有嵌入模型的召回率均有所下降。
cs.CL / 48 / 2604.25774

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

CGU-ILALab在FoodBench-QA 2026:比较传统方法与基于大型语言模型的食谱营养估计方法
Chen, Wei-Chun, Chen, Yu-Xuan, Chung, I-Fang, Lin, Ying-Jia
Abstract
Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.
Chinese Translation
从非结构化食谱文本中准确估计营养成分是饮食监测中一个重要但具有挑战性的问题,这主要是由于成分术语的模糊性和数量表达的高度变异性。我们系统地评估了涵盖广泛表示能力的模型,从词汇匹配方法(TF-IDF与岭回归)到深度语义编码器(DeBERTa-v3),再到使用大型语言模型(LLMs)进行生成推理。在欧盟法规1169/2011定义的严格容忍标准下,我们的实证结果揭示了预测准确性与计算效率之间的明显权衡。TF-IDF基线在几乎瞬时推理的情况下实现了适度的营养估计性能,而DeBERTa-v3编码器在特定任务数据稀缺的情况下表现不佳。相比之下,少量样本的LLM推理(例如,Gemini 2.5 Flash)和混合LLM精炼管道(TF-IDF结合Gemini 2.5 Flash)在所有营养类别中提供了最高的验证准确性。这些改进可能源于LLMs利用预训练的世界知识来解决模糊术语和规范非标准单位的能力,而这些对于纯词汇方法仍然是困难的。然而,这些收益伴随着显著更高的推理延迟,突显了在饮食监测系统中实时效率与营养精确性之间的实际部署权衡。
cs.CL / 49 / 2604.25776

Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

单相思情感:探讨语音情感识别研究中动机与实践的差距
Wong, Taryn, Talat, Zeerak, Aldarmaki, Hanan, Field, Anjalie
Abstract
Critical analyses of emotion recognition technology have raised ethical concerns around task validity and potential downstream impacts, urging researchers to ensure alignment between their stated motivations and practice. However, these discussions have not adequately influenced or drawn from research on speech emotion recognition (SER). We address this gap by conducting a systematic survey of SER research to uncover what stated motivations drive this work and if they align with the datasets and emotions studied. We find that while SER research identifies appealing goals, such as well-situated voice-activated systems or healthcare applications, commonly-used datasets do not reflect these proposed deployment contexts, thus presenting a gap between motivations and research practices. We argue that such gaps engender ethical concerns, and that SER research should reassert itself with concrete use-cases to prevent misinterpretations, misuse, and downstream harms.
Chinese Translation
对情感识别技术的批判性分析引发了关于任务有效性和潜在后果的伦理担忧,敦促研究人员确保其声明的动机与实践之间的一致性。然而,这些讨论并未充分影响或借鉴语音情感识别(SER)研究。我们通过对SER研究进行系统调查来填补这一空白,以揭示推动该工作的声明动机以及这些动机是否与所研究的数据集和情感相一致。我们发现,尽管SER研究识别出了一些吸引人的目标,例如适合的语音激活系统或医疗应用,但常用的数据集并未反映这些提议的部署背景,从而在动机与研究实践之间形成了差距。我们认为,这种差距引发了伦理担忧,SER研究应通过具体的应用案例重新确立自身,以防止误解、误用和潜在的后果。
cs.CL / 50 / 2604.25783

Subliminal Steering: Stronger Encoding of Hidden Signals

潜意识引导:对隐藏信号的更强编码
Morgulis, George, Hewitt, John
Abstract
Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide mechanistic evidence that subliminal learning transfers not only the target behavioral bias, but also the steering vector itself, localized to the layers at which the teacher was steered. Finally, we show that the bias is encoded with surprising precision. We train a new steering vector directly on the subliminally-laden dataset and find that it attains high cosine similarity with the original vector.
Chinese Translation
潜意识学习描述了一种学生语言模型通过在由偏见教师模型生成的看似无害的数据上进行微调而继承行为偏见的过程。先前的研究开始对这一现象进行特征描述,但仍然存在关于其能够传递的信号范围、解释其机制以及通过看似无关的数据编码偏见的精确度等问题。我们通过引入潜意识引导这一变体来解决这三个问题,在该变体中,教师的偏见不是通过系统提示实现的,如先前的研究,而是通过一个训练以最大化一组目标样本似然性的引导向量来实现。首先,我们展示了潜意识引导能够传递复杂的多词偏见,而先前的研究则集中于单词偏好,证明了潜意识可传递信号的广泛范围。其次,我们提供了机制证据,表明潜意识学习不仅传递目标行为偏见,还传递引导向量本身,该向量局限于教师被引导的层次。最后,我们展示了偏见的编码具有惊人的精确度。我们直接在潜意识负载的数据集上训练了一个新的引导向量,并发现它与原始向量具有高余弦相似度。
cs.CL / 51 / 2604.25806

MAIC-UI: Making Interactive Courseware with Generative UI

MAIC-UI:利用生成式用户界面制作互动课程ware
Tu, Shangqing, Li, Yanjia, Chen, Keyu, Zhang, Sichen, Yu, Jifan, Zhang-Li, Daniel, Hou, Lei, Li, Juanzi, Zhang, Yu, Liu, Huiqin
Abstract
Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200--600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities -- the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at https://github.com/THU-MAIC/MAIC-UI.
Chinese Translation
创建互动的STEM课程ware传统上需要HTML/CSS/JavaScript的专业知识,这给教育工作者带来了障碍。虽然生成式人工智能可以生成HTML代码,但现有工具生成的是静态演示而非互动模拟,处理长文档时表现不佳,并且缺乏教学准确性机制。此外,修改时的完全重生成需要200到600秒,打断了创作流程。我们提出了MAIC-UI,这是一种零代码创作系统,使教育工作者能够从教科书、PPT和PDF快速创建和编辑互动课程ware。MAIC-UI采用:(1) 结构化知识分析与多模态理解,以确保教学严谨性;(2) 两阶段生成-验证-优化流程,将内容对齐与视觉优化分开;(3) 基于统一差异(Unified Diff)的点击定位编辑,实现了不到10秒的迭代周期。对40名参与者进行的受控实验研究表明,MAIC-UI减少了编辑迭代次数(4.9次对比7.0次),并显著提高了可学习性和可控性,相较于直接的文本到HTML生成。与53名高中学生进行的为期三个月的课堂部署显示,MAIC-UI促进了学习自主性,并减少了学习成果差异——试点班在STEM科目上取得了9.21分的提升,而对照班则下降了2.32分。我们的代码可在https://github.com/THU-MAIC/MAIC-UI获取。
cs.CL / 52 / 2604.25840

PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

PSI-Bench:面向临床基础和可解释的抑郁症患者模拟器评估
Hoang, Nguyen Khoi, Mehri, Shuhaib, Hsu, Tse-An, Sun, Yi-Jyun, Truong, Quynh Xuan Nguyen, Doan, Khoa D, Hakkani-Tür, Dilek
Abstract
Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
Chinese Translation
患者模拟器在心理健康培训中越来越受到重视,因为它们提供了对复杂和敏感的患者互动的可扩展接触。模拟抑郁症患者尤其具有挑战性,因为安全约束和患者高度变异性使得模拟变得复杂,并强调了需要能够捕捉多样且真实的患者行为的模拟器。然而,现有的评估严重依赖于具有不明确提示的LLM评审,并未评估行为多样性。我们提出了PSI-Bench,一个自动评估框架,提供可解释的、临床基础的抑郁症患者模拟器行为诊断,涵盖回合、对话和人群层面的维度。使用PSI-Bench,我们对七个LLM在两个模拟器框架中进行了基准测试,发现模拟器产生过长、词汇多样性高的响应,表现出减少的变异性,情感反应过快,并遵循统一的负向到正向轨迹。我们还表明,模拟框架对保真度的影响大于模型规模。来自人类研究的结果表明,我们的基准与专家判断高度一致。我们的工作揭示了当前抑郁症患者模拟器的关键局限性,并提供了一个可解释、可扩展的基准,以指导未来的模拟器设计和评估。
cs.CL / 53 / 2604.25850

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

代理式工具工程:基于可观测性的编码代理工具自动演化
Lin, Jiahang, Liu, Shichun, Pan, Chengjun, Lin, Lizhi, Dou, Shihan, Huang, Xuanjing, Yan, Hang, Han, Zhenhua, Gui, Tao
Abstract
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.
Chinese Translation
工具已成为编码代理性能的核心决定因素,影响模型与代码库、工具和执行环境的交互。然而,自动化工具工程却面临挑战:异构的动作空间、稀疏且嘈杂的评估信号、数百万标记的轨迹,以及难以将编辑效果归因于下一轮结果的情况。我们提出了代理式工具工程(Agentic Harness Engineering, AHE),这是一个通过为任何工程循环的三个阶段(组件编辑、轨迹检查和决策制定)配备匹配的可观测性支柱来自动化工具级演化的框架:(1)组件可观测性为每个可编辑的工具组件提供文件级表示,使得动作空间变得明确且可回退;(2)经验可观测性将数百万个原始轨迹标记提炼成分层的、可深入分析的证据库,以便不断演化的代理能够实际使用;(3)决策可观测性将每次编辑与自我声明的预测配对,随后与下一轮的任务级结果进行验证。这些支柱共同将每次编辑转变为可证伪的契约,使得工具演化能够自主进行,而不会陷入试错之中。实证结果显示,十次AHE迭代将Terminal-Bench 2上的pass@1从69.7%提升至77.0%,超越了人类设计的工具Codex-CLI(71.9%)以及自我演化的基线ACE和TF-GRPO。冷冻工具在不重新演化的情况下转移:在SWE-bench验证中,它在比种子少12%的标记的情况下达到了最高的整体成功率,而在Terminal-Bench 2中,它在三个不同模型家族之间实现了+5.1至+10.1个百分点的跨家族增益,表明演化组件编码了一般的工程经验,而非特定基准的调优。这些结果将基于可观测性的演化定位为持续改进编码代理工具的实用途径。
cs.CL / 54 / 2604.25853

G-Loss: Graph-Guided Fine-Tuning of Language Models

G-Loss:图引导的语言模型微调
Aditya, Sharma, Vinti, Agarwal, Rajesh, Kumar
Abstract
Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.
Chinese Translation
传统的损失函数,包括交叉熵、对比损失、三元组损失和监督对比损失,通常用于微调预训练语言模型(如BERT),仅在局部邻域内操作,未能考虑全局语义结构。我们提出了G-Loss,一种图引导的损失函数,结合半监督标签传播,利用嵌入流形内的结构关系。G-Loss构建了一个文档相似性图,捕捉全局语义关系,从而引导模型学习更具区分性和鲁棒性的嵌入。我们在五个基准数据集上评估了G-Loss,这些数据集涵盖了关键的下游分类任务:MR(情感分析)、R8和R52(主题分类)、Ohsumed(医学文档分类)和20NG(新闻分类)。在大多数实验设置中,G-Loss收敛更快,并产生语义一致的嵌入空间,导致比使用传统损失函数微调的模型更高的分类准确率。
cs.CL / 55 / 2604.25860

Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

Luminol-AIDetect:基于文本洗牌下困惑度的快速零-shot机器生成文本检测
La Cava, Lucio, Tagarelli, Andrea
Abstract
Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shift in perplexity serves as a principled, model-agnostic discriminant, as MGT displays a characteristic dispersion in perplexity-under-shuffling that differs markedly from the more stable structural variability of human-written text. Luminol-AIDetect leverages this distinction to inform its decision process, where a handful of perplexity-based scalar features are extracted from an input text and its shuffled version, then detection is performed via density estimation and ensemble-based prediction. Evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, Luminol-AIDetect demonstrates state-of-the-art performance, with gains up to 17x lower FPR while being cheaper than prior methods.
Chinese Translation
机器生成文本(MGT)检测需要识别跨生成模型的结构不变信号,而不是依赖于特定模型的指纹。在这方面,我们假设虽然大型语言模型在局部语义一致性方面表现出色,但其自回归特性导致与人类写作相比,存在一种特定类型的结构脆弱性。我们提出了Luminol-AIDetect,这是一种新颖的零-shot统计方法,通过连贯性破坏来揭示这种脆弱性。通过应用简单的随机文本洗牌程序,我们证明了困惑度的变化作为一种原则性、模型无关的判别标准,因为MGT在洗牌下的困惑度表现出特征性分散,这与人类写作文本的更稳定结构变异性显著不同。Luminol-AIDetect利用这一区别来指导其决策过程,从输入文本及其洗牌版本中提取少量基于困惑度的标量特征,然后通过密度估计和基于集成的预测进行检测。在8个内容领域、11种对抗攻击类型和18种语言的评估中,Luminol-AIDetect展示了最先进的性能,FPR降低高达17倍,同时成本低于先前的方法。
cs.CL / 56 / 2604.25866

From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLMs

从语法到情感:大语言模型中情感推断的机制分析
Shu, Bangzhao, Singh, Arinjay, ElSherief, Mai
Abstract
Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using phase-stratified causal tracing, we identify a small set of features that strongly influence emotion predictions, and show that both their number and causal impact vary across emotions; in particular, Disgust is more weakly and diffusely represented than other emotions. Finally, we propose an interpretable and data-efficient causal feature steering method that significantly improves emotion recognition performance across multiple models while largely preserving language modeling ability, and demonstrate that these improvements generalize across multiple emotion recognition datasets. Overall, our findings provide a systematic analysis of the internal mechanisms underlying emotion recognition in LLMs and introduce an efficient, interpretable, and controllable approach for improving model performance.
Chinese Translation
大型语言模型(LLMs)在情感敏感的人机应用中越来越多地被使用,但关于情感识别的内部表征知之甚少。在本研究中,我们使用稀疏自编码器(SAEs)探讨了LLMs中情感识别的内部机制。通过分析各层的稀疏特征激活,我们识别出一个一致的三阶段信息流,其中情感相关特征仅在最后阶段出现。我们进一步表明,情感表征既包括跨情感的共享特征,也包括特定于情感的特征。通过阶段分层因果追踪,我们识别出一小组对情感预测有强烈影响的特征,并显示这些特征的数量和因果影响在不同情感中有所不同;特别是,厌恶(Disgust)的表征比其他情感更弱且更分散。最后,我们提出了一种可解释且数据高效的因果特征引导方法,该方法显著提高了多个模型的情感识别性能,同时在很大程度上保留了语言建模能力,并证明这些改进在多个情感识别数据集上具有普遍性。总体而言,我们的研究结果提供了对LLMs中情感识别内部机制的系统分析,并引入了一种高效、可解释且可控的方法来提高模型性能。
cs.CL / 57 / 2604.25902

Toward a Functional Geometric Algebra for Natural Language Semantics

面向自然语言语义的功能几何代数
Pustejovsky, James
Abstract
Distributional and neural approaches to natural language semantics have been built almost exclusively on conventional linear algebra: vectors, matrices, tensors, and the operations that accompany them. These methods have achieved remarkable empirical success, yet they face persistent structural limitations in compositional semantics, type sensitivity, and interpretability. I argue in this paper that geometric algebra (GA) -- specifically, Clifford algebras -- provides a mathematically superior foundation for semantic representation, and that a Functional Geometric Algebra (FGA) framework extends GA toward a typed, compositional semantics capable of supporting inference, transformation, and interpretability while retaining full compatibility with distributional learning and modern neural architectures. I develop the formal foundations, identify three core capabilities that GA provides and linear algebra does not, present a detailed worked example illustrating operator-level semantic contrasts, and show how GA-based operations already implicit in current transformer architectures can be made explicit and extended. The central claim is not merely increased dimensionality but increased structural organization: GA expands an $n$-dimensional embedding space into a $2^n$ multivector algebra where base semantic concepts and their higher-order interactions are represented within a single, principled algebraic framework.
Chinese Translation
分布式和神经网络方法在自然语言语义的研究中几乎完全建立在传统线性代数的基础上:向量、矩阵、张量及其相关操作。这些方法取得了显著的实证成功,但在组合语义、类型敏感性和可解释性方面面临持续的结构性限制。本文论证了几何代数(Geometric Algebra, GA)——特别是克利福德代数(Clifford algebras)——为语义表示提供了数学上更优越的基础,并且功能几何代数(Functional Geometric Algebra, FGA)框架将GA扩展到一种类型化的组合语义,能够支持推理、转换和可解释性,同时与分布式学习和现代神经架构保持完全兼容。我发展了形式基础,识别出GA提供的三种核心能力,而线性代数则不具备,展示了一个详细的实例,说明了操作层面的语义对比,并展示了如何将当前变换器架构中隐含的基于GA的操作显性化并加以扩展。中心论点不仅仅是维度的增加,而是结构组织的增强:GA将$n$维嵌入空间扩展为$2^n$多向量代数,其中基本语义概念及其高阶交互在一个单一的、原则性的代数框架内得以表示。
cs.CL / 58 / 2604.25905

A paradox of AI fluency

人工智能流畅性的悖论
Potts, Christopher, Sudhof, Moritz
Abstract
How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes
Chinese Translation
用户对人工智能的熟练程度在多大程度上影响人工智能为他们提供的实际结果?这个问题对用户、人工智能产品开发者以及整个社会都至关重要,但仍然未得到充分探讨。通过对27K个来自WildChat-4.8M的转录文本进行丰富注释的样本分析,我们发现流畅的用户承担比新手更复杂的任务,并采用一种根本不同的互动模式:他们与人工智能进行协作迭代,细化目标并批判性地评估输出,而新手则采取被动的态度。这些差异导致了人工智能流畅性的悖论:流畅的用户经历的失败比新手更多——但他们的失败往往是显而易见的(这是他们参与的直接结果),更可能导致部分恢复,并且伴随着在复杂任务上更大的成功。相比之下,新手更常经历隐形失败:看似成功结束的对话实际上未能达到目标。综合来看,这些结果重新定义了成功使用人工智能所依赖的因素。个体应采取主动参与的态度,而非被动接受。人工智能产品开发者应认识到,他们不仅在设计模型行为,还在设计用户行为;鼓励深度参与,而非无摩擦的体验,将整体上带来更多成功。我们的代码和数据可在 https://github.com/bigspinai/bigspin-fluency-outcomes 获取。
cs.CL / 59 / 2604.25914

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

DV-World:在真实场景中对数据可视化代理进行基准测试
Meng, Jinxiang, Huang, Shaoping, Lei, Fangyu, Guo, Jingyu, Liu, Haoxiang, Su, Jiahao, Wang, Sihan, Wang, Yao, Wang, Enrui, Yang, Ye, Chai, Hongze, Lv, Jinming, Yu, Anbang, Zhang, Huangjing, Zhang, Yitong, Huang, Yiming, Ma, Zeyao, He, Shizhu, Zhao, Jun, Liu, Kang
Abstract
Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.
Chinese Translation
真实世界的数据可视化(DV)需要本土环境的基础、跨平台的演变以及主动的意图对齐。然而,现有的基准测试往往受到代码沙箱限制、单一语言的创作任务以及对完美意图的假设等问题的困扰。为了解决这些问题,我们提出了DV-World,这是一个包含260个任务的基准,旨在评估DV代理在真实世界专业生命周期中的表现。DV-World涵盖三个领域:DV-Sheet用于本土电子表格操作,包括图表和仪表板的创建以及诊断修复;DV-Evolution用于适应和重构参考视觉工件,以适应不同编程范式下的新数据;DV-Interact用于与用户模拟器的主动意图对齐,该模拟器模拟真实世界中的模糊需求。我们的混合评估框架整合了表值对齐(Table-value Alignment)以确保数值精度,以及MLLM-as-a-Judge与语义视觉评估的评分标准。实验结果表明,最先进的模型整体表现不足50%,暴露出在处理真实世界数据可视化复杂挑战方面的关键缺陷。DV-World提供了一个现实的测试平台,以引导开发朝向企业工作流程中所需的多样化专业技能。我们的数据和代码可在[此项目页面](https://github.com/DA-Open/DV-World)获取。