arXiv Daily Digest

413

Papers

Design and Biomechanical Evaluation of a Lightweight Low-Complexity Soft Bilateral Ankle Exoskeleton

轻量级低复杂度软性双侧踝关节外骨骼的设计与生物力学评估

Mallah, Josée, Javed, Zakii, Azak, Zafer, Stone, Thomas, Occhipinti, Luigi G.

Abstract

Many people could benefit from exoskeleton assistance during gait, for either medical or nonmedical purposes. But exoskeletons bring added mass and structure, which in turn require compensating for. In this work, we present a lightweight, low-complexity, soft bilateral ankle exoskeleton for plantarflexion assistance, with a shoe attachment design that can be mounted on top of any pair of shoes. Experimental tests show no significant difference in lower limb kinematics and kinetics when wearing the exoskeleton in zero-torque mode relative to not wearing an exoskeleton, showing that our device does not obstruct healthy gait, and proving it as a compliant and comfortable device, promising to provide effective assistance. Hence, a control system was developed, and additional tests are underway.

Chinese Translation

许多人在步态过程中可以从外骨骼的辅助中受益，无论是出于医疗还是非医疗目的。然而，外骨骼增加了质量和结构，这需要进行补偿。在本研究中，我们提出了一种轻量级、低复杂度的软性双侧踝关节外骨骼，用于足底屈伸辅助，具有可安装在任何鞋子上的鞋子附件设计。实验测试表明，在零扭矩模式下穿戴外骨骼与不穿戴外骨骼相比，下肢运动学和动力学没有显著差异，表明我们的设备不会阻碍健康步态，证明其为一种顺应性强且舒适的设备，有望提供有效的辅助。因此，我们开发了一个控制系统，并正在进行额外的测试。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.18603

Enhancing Goal Inference via Correction Timing

通过纠正时机增强目标推断

Wang, Anjiabei, Wang, Shuangge, Fitzgerald, Tesca

Abstract

Corrections offer a natural modality for people to provide feedback to a robot, by (i) intervening in the robot's behavior when they believe the robot is failing (or will fail) the task objectives and (ii) modifying the robot's behavior to successfully fulfill the task. Each correction offers information on what the robot should and should not do, where the corrected behavior is more aligned with task objectives than the original behavior. Most prior work on learning from corrections involves interpreting a correction as a new demonstration (consisting of the modified robot behavior), or a preference (for the modified trajectory compared to the robot's original behavior). However, this overlooks one essential element of the correction feedback, which is the human's decision to intervene in the robot's behavior in the first place. This decision can be influenced by multiple factors including the robot's task progress, alignment with human expectations, dynamics, motion legibility, and optimality. In this work, we investigate whether the timing of this decision can offer a useful signal for inferring these task-relevant influences. In particular, we investigate three potential applications for this learning signal: (1) identifying features of a robot's motion that may prompt people to correct it, (2) quickly inferring the final goal of a human's correction based on the timing and initial direction of their correction motion, and (3) learning more precise constraints for task objectives. Our results indicate that correction timing results in improved learning for the first two of these applications. Overall, our work provides new insights on the value of correction timing as a signal for robot learning.

Chinese Translation

纠正为人们向机器人提供反馈提供了一种自然的方式，具体表现为 (i) 在人们认为机器人未能（或将未能）完成任务目标时干预机器人的行为，以及 (ii) 修改机器人的行为以成功实现任务。每一次纠正都提供了关于机器人应该做什么和不应该做什么的信息，其中纠正后的行为与任务目标的对齐程度高于原始行为。以往大多数关于从纠正中学习的研究将纠正视为新的演示（由修改后的机器人行为组成）或偏好（相较于机器人的原始行为对修改轨迹的偏好）。然而，这忽视了纠正反馈的一个重要元素，即人类首先决定干预机器人行为的决策。这个决策可能受到多种因素的影响，包括机器人的任务进展、与人类期望的对齐程度、动态性、运动可读性和最优性。在本研究中，我们探讨了这一决策的时机是否可以提供有用的信号，以推断这些与任务相关的影响。具体而言，我们研究了这一学习信号的三个潜在应用：(1) 识别可能促使人们纠正机器人的运动特征，(2) 根据纠正运动的时机和初始方向快速推断人类纠正的最终目标，以及 (3) 学习更精确的任务目标约束。我们的结果表明，纠正时机在这两个应用中提升了学习效果。总体而言，我们的研究为纠正时机作为机器人学习信号的价值提供了新的见解。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.18606

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

OVerSeeC：基于卫星图像和自然语言的开放词汇成本图生成

Rana, Rwik, Quattrociocchi, Jesse, Lee, Dongmyeong, Ellis, Christian, Adkins, Amanda, Uccello, Adam, Warnell, Garrett, Biswas, Joydeep

Abstract

Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret-Locate-Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses the user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

Chinese Translation

航空影像为自主导航提供了重要的全球背景，使得在无法通过机载传感器获取的尺度上进行路线规划成为可能。我们解决了一个问题，即在测试时当实体和特定任务的通行规则以自然语言表达时，如何直接从卫星影像生成用于长距离规划的全球成本图。这一设置具有挑战性，因为任务要求各不相同，部署时地形实体可能未知，用户提示通常编码了组合的通行逻辑。现有依赖于固定本体和静态成本映射的方法无法适应这种灵活性。尽管基础模型在语言理解和开放词汇感知方面表现出色，但没有单一模型能够同时解析细微的任务指令、在大规模影像中定位任意实体，并将其合成可供规划者执行的成本函数。因此，我们提出了OVerSeeC，一个零-shot模块化框架，将问题分解为解释-定位-合成：(i) 一个大型语言模型（LLM）提取实体和排名偏好，(ii) 一个开放词汇分割管道从高分辨率影像中识别这些实体，以及(iii) LLM利用用户的自然语言偏好和掩码合成可执行的成本图代码。实证结果表明，OVerSeeC能够处理新颖实体，尊重排名和组合偏好，并生成与人类绘制的轨迹一致的路线，展示了对分布变化的鲁棒性。这表明基础模型的模块化组合使得开放词汇、偏好对齐的成本图生成成为可能，从而实现可扩展的、任务自适应的全球规划。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.18622

FORMICA: Decision-Focused Learning for Communication-Free Multi-Robot Task Allocation

FORMICA：面向决策的无通信多机器人任务分配学习

Lopez, Antonio, Muirhead, Jack, Pinciroli, Carlo

Abstract

Most multi-robot task allocation methods rely on communication to resolve conflicts and reach consistent assignments. In environments with limited bandwidth, degraded infrastructure, or adversarial interference, existing approaches degrade sharply. We introduce a learning-based framework that achieves high-quality task allocation without any robot-to-robot communication. The key idea is that robots coordinate implicitly by predicting teammates' bids: if each robot can anticipate competition for a task, it can adjust its choices accordingly. Our method predicts bid distributions to correct systematic errors in analytical mean-field approximations. While analytical predictions assume idealized conditions (uniform distributions, known bid functions), our learned approach adapts to task clustering and spatial heterogeneity. Inspired by Smart Predict-then-Optimize (SPO), we train predictors end-to-end to minimize Task Allocation Regret rather than prediction error. To scale to large swarms, we develop a mean-field approximation where each robot predicts the distribution of competing bids rather than individual bids, reducing complexity from $O(NT)$ to $O(T)$. We call our approach FORMICA: Field-Oriented Regret-Minimizing Implicit Coordination Algorithm. Experiments show FORMICA substantially outperforms a natural analytical baseline. In scenarios with 16 robots and 64 tasks, our approach improves system reward by 17% and approaches the optimal MILP solution. When deployed on larger scenarios (256 robots, 4096 tasks), the same model improves performance by 7%, demonstrating strong generalization. Training requires only 21 seconds on a laptop, enabling rapid adaptation to new environments.

Chinese Translation

大多数多机器人任务分配方法依赖于通信来解决冲突并达到一致的分配。在带宽有限、基础设施退化或存在对抗干扰的环境中，现有方法的性能急剧下降。我们提出了一种基于学习的框架，能够在没有任何机器人之间通信的情况下实现高质量的任务分配。其关键思想是机器人通过预测队友的出价来进行隐式协调：如果每个机器人能够预见到任务竞争，它就可以相应地调整自己的选择。我们的方法预测出价分布，以纠正分析均场近似中的系统性错误。尽管分析预测假设理想条件（均匀分布、已知出价函数），我们的学习方法能够适应任务聚类和空间异质性。受到智能预测后优化（Smart Predict-then-Optimize, SPO）的启发，我们端到端训练预测器，以最小化任务分配遗憾，而不是预测误差。为了扩展到大型群体，我们开发了一种均场近似，其中每个机器人预测竞争出价的分布，而不是单个出价，从而将复杂度从 $O(NT)$ 降低到 $O(T)$。我们称这种方法为 FORMICA：面向领域的遗憾最小化隐式协调算法。实验表明，FORMICA 显著优于自然的分析基线。在16个机器人和64个任务的场景中，我们的方法将系统奖励提高了17%，接近最优的混合整数线性规划（MILP）解决方案。当在更大场景（256个机器人，4096个任务）中部署时，相同模型的性能提高了7%，展示了强大的泛化能力。训练仅需21秒在笔记本电脑上完成，能够快速适应新环境。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.18638

Soft Surfaced Vision-Based Tactile Sensing for Bipedal Robot Applications

基于视觉的软表面触觉传感器在双足机器人应用中的研究

Kim, Jaeeun, Lim, Junhee, She, Yu

Abstract

Legged locomotion benefits from embodied sensing, where perception emerges from the physical interaction between body and environment. We present a soft-surfaced, vision-based tactile foot sensor that endows a bipedal robot with a skin-like deformable layer that captures contact deformations optically, turning foot-ground interactions into rich haptic signals. From a contact image stream, our method estimates contact pose (position and orientation), visualizes shear, computes center of pressure (CoP), classifies terrain, and detects geometric features of the contact patch. We validate these capabilities on a tilting platform and in visually obscured conditions, showing that foot-borne tactile feedback improves balance control and terrain awareness beyond proprioception alone. These findings suggest that integrating tactile perception into legged robot feet improves stability, adaptability, and environmental awareness, offering a promising direction toward more compliant and intelligent locomotion systems. For the supplementary video, please visit: https://youtu.be/ceJiy9q_2Aw

Chinese Translation

腿部运动受益于具身感知，其中感知源于身体与环境之间的物理互动。我们提出了一种软表面、基于视觉的触觉足传感器，为双足机器人赋予了一层类似皮肤的可变形层，该层通过光学方式捕捉接触变形，将足部与地面的互动转化为丰富的触觉信号。通过接触图像流，我们的方法估计接触姿态（位置和方向）、可视化剪切、计算压力中心（CoP）、分类地形，并检测接触区的几何特征。我们在倾斜平台和视觉受限条件下验证了这些能力，结果表明，足部触觉反馈在平衡控制和地形意识方面超越了单纯的本体感知。这些发现表明，将触觉感知整合到腿部机器人足部可以提高稳定性、适应性和环境意识，为更柔顺和智能的运动系统提供了一个有前景的方向。有关补充视频，请访问：https://youtu.be/ceJiy9q_2Aw

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.18655

Infinite-Dimensional Closed-Loop Inverse Kinematics for Soft Robots via Neural Operators

通过神经算子实现软机器人无限维闭环逆运动学

Veil, Carina, Flaschel, Moritz, Kuhl, Ellen, Della Santina, Cosimo

Abstract

While kinematic inversion is a purely geometric problem for fully actuated rigid robots, it becomes extremely challenging for underactuated soft robots with infinitely many degrees of freedom. Closed-loop inverse kinematics (CLIK) schemes address this by introducing end-to-end mappings from actuation to task space for the controller to operate on, but typically assume finite dimensions of the underlying virtual configuration space. In this work, we extend CLIK to the infinite-dimensional domain to reason about the entire soft robot shape while solving tasks. We do this by composing an actuation-to-shape map with a shape-to-task map, deriving the differential end-to-end kinematics via an infinite-dimensional chain rule, and thereby obtaining a Jacobian-based CLIK algorithm. Since the actuation-to-shape mapping is rarely available in closed form, we propose to learn it from simulation data using neural operator networks, which are differentiable. We first present an analytical study on a constant-curvature segment, and then apply the neural version of the algorithm to a three-fiber soft robotic arm whose underlying model relies on morphoelasticity and active filament theory. This opens new possibilities for differentiable control of soft robots by exploiting full-body shape information in a continuous, infinite-dimensional framework.

Chinese Translation

对于完全驱动的刚性机器人，运动学反演是一个纯粹的几何问题，但对于具有无限多个自由度的欠驱动软机器人而言，这变得极具挑战性。闭环逆运动学（CLIK）方案通过引入从驱动到任务空间的端到端映射来解决这一问题，以便控制器进行操作，但通常假设基础虚拟配置空间是有限维的。在本研究中，我们将CLIK扩展到无限维领域，以便在解决任务时考虑整个软机器人的形状。我们通过将驱动到形状的映射与形状到任务的映射组合，利用无限维链式法则推导出微分的端到端运动学，从而获得基于雅可比矩阵的CLIK算法。由于驱动到形状的映射很少以封闭形式存在，我们提出从仿真数据中使用可微分的神经算子网络学习该映射。我们首先对一个恒定曲率段进行分析研究，然后将算法的神经版本应用于一个三纤维软机器人手臂，其基础模型依赖于形态弹性和主动丝理论。这为利用全身形状信息在连续的无限维框架中实现软机器人的可微控制开辟了新的可能性。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.18661

Robotic Fruits with Tunable Stiffness and Sensing: Towards a Methodology for Developing Realistic Physical Twins of Fruits

可调刚度和感知的机器人水果：开发真实水果物理双胞胎的方法论

Nadipineni, Saitarun, Pandiyan, Keshav, Althoefer, Kaspar, Hirai, Shinichi, Lalitharatne, Thilina Dulantha

Abstract

The global agri-food sector faces increasing challenges from labour shortages, high consumer demand, and supply-chain disruptions, resulting in substantial losses of unharvested produce. Robotic harvesting has emerged as a promising alternative; however, evaluating and training soft grippers for delicate fruits remains difficult due to the highly variable mechanical properties of natural produce. This makes it difficult to establish reliable benchmarks or data-driven control strategies. Existing testing practices rely on large quantities of real fruit to capture this variability, leading to inefficiency, higher costs, and waste. The methodology presented in this work aims to address these limitations by developing tunable soft physical twins that emulate the stiffness characteristics of real fruits at different ripeness levels. A fiber-reinforced pneumatic physical twin of a kiwi fruit was designed and fabricated to replicate the stiffness at different ripeness levels. Experimental results show that the stiffness of the physical twin can be tuned accurately over multiple trials (97.35 - 99.43% accuracy). Gripping tasks with a commercial robotic gripper showed that sensor feedback from the physical twin can reflect the applied gripping forces. Finally, a stress test was performed over 50 cycles showed reliable maintenance of desired stiffness (0.56 - 1.10% error). This work shows promise that robotic physical twins could adjust their stiffness to resemble that of real fruits. This can provide a sustainable, controllable platform for benchmarking and training robotic grippers.

Chinese Translation

全球农业食品行业面临着劳动力短缺、高消费者需求和供应链中断等日益严峻的挑战，导致大量未收获农产品的损失。机器人采摘作为一种有前景的替代方案应运而生；然而，由于自然农产品机械性能的高度可变性，评估和训练用于精细水果的软抓手仍然困难。这使得建立可靠的基准或数据驱动的控制策略变得复杂。现有的测试方法依赖于大量真实水果以捕捉这种变异性，导致效率低下、成本增加和浪费。本研究提出的方法论旨在通过开发可调软物理双胞胎来解决这些局限性，这些双胞胎模拟不同成熟度水平下真实水果的刚度特性。我们设计并制造了一种纤维增强气动物理双胞胎，以复制不同成熟度水平下的猕猴桃刚度。实验结果表明，物理双胞胎的刚度可以在多次试验中准确调节（准确度为97.35 - 99.43%）。与商业机器人抓手的抓取任务表明，物理双胞胎的传感器反馈可以反映施加的抓取力。最后，经过50个循环的压力测试显示所需刚度的可靠维持（误差为0.56 - 1.10%）。本研究显示，机器人物理双胞胎有望调整其刚度以类似于真实水果。这为基准测试和训练机器人抓手提供了一个可持续、可控的平台。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.18663

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

基于分层模块化多智能体强化学习（HM-MARL）的机械取栓自主导航研究

Robertshaw, Harry, Fischer, Nikola, Karstensen, Lennart, Jackson, Benjamin, Chen, Xingyu, Sadati, S. M. Hadi, Bergeles, Christos, Granados, Alejandro, Booth, Thomas C

Abstract

Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We propose a Hierarchical Modular Multi-Agent Reinforcement Learning (HM-MARL) framework for autonomous two-device navigation in vitro, enabling efficient and generalizable navigation. HM-MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA). A modular multi-agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor-Critic RL. The framework was validated in both in silico and in vitro testbeds to assess generalization and real-world feasibility. In silico, a single-vasculature model achieved 92-100% success rates on individual anatomies, while a multi-vasculature model achieved 56-80% across multiple patient anatomies. In vitro, both HM-MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left-side vessel superhuman challenge due to the anatomy and catheter type used in navigation. This study presents the first demonstration of in vitro autonomous navigation in MT vasculature. While HM-MARL enables generalization across anatomies, the simulation-to-real transition introduces challenges. Future work will refine RL strategies using world models and validate performance on unseen in vitro data, advancing autonomous MT towards clinical translation.

Chinese Translation

机械取栓（MT）通常是急性缺血性中风伴大血管闭塞的最佳治疗方法，但由于地理和后勤障碍，其应用受到限制。强化学习（RL）在自主血管内导航方面显示出良好的前景，但在“长”导航任务中的泛化仍然具有挑战性。我们提出了一种基于分层模块化多智能体强化学习（HM-MARL）的框架，用于体外自主双设备导航，实现高效且可泛化的导航。HM-MARL旨在自主导航导管和导丝，从股动脉到颈内动脉（ICA）。采用模块化多智能体方法将复杂的导航任务分解为专门的子任务，每个子任务均使用软演员-评论家（Soft Actor-Critic）强化学习进行训练。该框架在体内外测试平台上进行了验证，以评估其泛化能力和现实可行性。在体内模拟中，单一血管模型在各个解剖结构上达到了92-100%的成功率，而多血管模型在多个患者解剖结构上达到了56-80%的成功率。在体外实验中，两个HM-MARL模型成功完成了从股动脉到右侧颈总动脉的100%试验，并在到达右侧ICA时成功率为80%，但由于解剖结构和导航中使用的导管类型，在左侧血管的超人挑战中失败。本研究首次展示了MT血管中体外自主导航的应用。虽然HM-MARL能够在不同解剖结构间实现泛化，但从模拟到现实的过渡带来了挑战。未来的工作将通过世界模型优化强化学习策略，并在未见的体外数据上验证性能，推动自主MT向临床转化发展。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.18684

Systematic Analysis of Coupling Effects on Closed-Loop and Open-Loop Performance in Aerial Continuum Manipulators

空中连续操纵器中闭环与开环性能的耦合效应系统分析

Amiri, Niloufar, Sepahvand, Shayan, Mantegh, Iraj, Janabi-Sharifi, Farrokh

Abstract

This paper investigates two distinct approaches to the dynamic modeling of aerial continuum manipulators (ACMs): the decoupled and the coupled formulations. Both open-loop and closed-loop behaviors of a representative ACM are analyzed. The primary objective is to determine the conditions under which the decoupled model attains accuracy comparable to the coupled model while offering reduced computational cost under identical numerical conditions. The system dynamics are first derived using the Euler--Lagrange method under the piecewise constant curvature (PCC) assumption, with explicit treatment of the near-zero curvature singularity. A decoupled model is then obtained by neglecting the coupling terms in the ACM dynamics, enabling systematic evaluation of open-loop responses under diverse actuation profiles and external wrenches. To extend the analysis to closed-loop performance, a novel dynamics-based proportional-derivative sliding mode image-based visual servoing (DPD-SM-IBVS) controller is developed for regulating image feature errors in the presence of a moving target. The controller is implemented with both coupled and decoupled models, allowing a direct comparison of their effectiveness. The open-loop simulations reveal pronounced discrepancies between the two modeling approaches, particularly under varying torque inputs and continuum arm parameters. Conversely, the closed-loop experiments demonstrate that the decoupled model achieves tracking accuracy on par with the coupled model (within subpixel error) while incurring lower computational cost.

Chinese Translation

本文研究了空中连续操纵器（ACMs）动态建模的两种不同方法：解耦和耦合公式。对代表性ACM的开环和闭环行为进行了分析。主要目标是确定在何种条件下，解耦模型能够达到与耦合模型相当的准确性，同时在相同数值条件下提供更低的计算成本。首先，利用欧拉-拉格朗日方法在分段常曲率（PCC）假设下推导系统动力学，并明确处理近零曲率奇点。然后，通过忽略ACM动力学中的耦合项获得解耦模型，从而系统地评估在不同驱动配置和外部扭矩下的开环响应。为了将分析扩展到闭环性能，开发了一种基于动态的比例-导数滑模图像视觉伺服控制器（DPD-SM-IBVS），用于在移动目标存在的情况下调节图像特征误差。该控制器在耦合和解耦模型下均进行了实现，从而直接比较其有效性。开环仿真显示出两种建模方法之间的显著差异，特别是在不同扭矩输入和连续臂参数下。相反，闭环实验表明，解耦模型在跟踪精度上与耦合模型相当（在亚像素误差范围内），同时计算成本更低。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.18688

Scout-Rover cooperation: online terrain strength mapping and traversal risk estimation for planetary-analog explorations

侦察者-探测车合作：行星类探索中的在线地形强度映射与行进风险估计

Liu, Shipeng, Caporale, J. Diego, Zhang, Yifeng, Liao, Xingjue, Hoganson, William, Hu, Wilson, Misra, Shivangi, Peddinti, Neha, Holladay, Rachel, Fulcher, Ethan, Panyam, Akshay Ram, Puentes, Andrik, Bretzfelder, Jordan M., Zanetti, Michael, Wong, Uland, Koditschek, Daniel E., Yim, Mark, Jerolmack, Douglas, Sung, Cynthia, Qian, Feifei

Abstract

Robot-aided exploration of planetary surfaces is essential for understanding geologic processes, yet many scientifically valuable regions, such as Martian dunes and lunar craters, remain hazardous due to loose, deformable regolith. We present a scout-rover cooperation framework that expands safe access to such terrain using a hybrid team of legged and wheeled robots. In our approach, a high-mobility legged robot serves as a mobile scout, using proprioceptive leg-terrain interactions to estimate regolith strength during locomotion and construct spatially resolved terrain maps. These maps are integrated with rover locomotion models to estimate traversal risk and inform path planning. We validate the framework through analogue missions at the NASA Ames Lunar Simulant Testbed and the White Sands Dune Field. Experiments demonstrate (1) online terrain strength mapping from legged locomotion and (2) rover-specific traversal-risk estimation enabling safe navigation to scientific targets. Results show that scout-generated terrain maps reliably capture spatial variability and predict mobility failure modes, allowing risk-aware path planning that avoids hazardous regions. By combining embodied terrain sensing with heterogeneous rover cooperation, this framework enhances operational robustness and expands the reachable science workspace in deformable planetary environments.

Chinese Translation

机器人辅助的行星表面探索对于理解地质过程至关重要，但许多具有科学价值的区域，如火星沙丘和月球陨石坑，由于松散和可变形的风化层而仍然存在危险。我们提出了一种侦察者-探测车合作框架，利用一组混合的腿式和轮式机器人扩展对这些地形的安全访问。在我们的方法中，高机动性的腿式机器人作为移动侦察者，通过本体感知的腿部与地形的交互，在运动过程中估计风化层强度并构建空间分辨的地形图。这些地图与探测车的运动模型相结合，以估计行进风险并为路径规划提供信息。我们通过在NASA艾姆斯月球模拟测试场和白沙沙丘场的类比任务验证了该框架。实验表明（1）通过腿式运动进行在线地形强度映射，以及（2）特定于探测车的行进风险估计，使安全导航到科学目标成为可能。结果表明，侦察者生成的地形图可靠地捕捉空间变异性并预测运动失败模式，从而实现风险意识的路径规划，避免危险区域。通过将具身地形感知与异构探测车合作相结合，该框架增强了操作的稳健性，并扩展了在可变形行星环境中可达的科学工作空间。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.18707

CLASH: Collision Learning via Augmented Sim-to-real Hybridization to Bridge the Reality Gap

CLASH：通过增强的仿真与真实混合学习碰撞以弥合现实差距

He, Haotian, Guo, Ning, Shi, Siqi, Liu, Qipeng, Lian, Wenzhao

Abstract

The sim-to-real gap, particularly in the inaccurate modeling of contact-rich dynamics like collisions, remains a primary obstacle to deploying robot policies trained in simulation. Conventional physics engines often trade accuracy for computational speed, leading to discrepancies that prevent direct policy transfer. To address this, we introduce Collision Learning via Augmented Sim-to-real Hybridization (CLASH), a data-efficient framework that creates a high-fidelity hybrid simulator by learning a surrogate collision model from a minimal set of real-world data. In CLASH, a base model is first distilled from an imperfect simulator (MuJoCo) to capture general physical priors; this model is then fine-tuned with a remarkably small number of real-world interactions (as few as 10 samples) to correct for the simulator's inherent inaccuracies. The resulting hybrid simulator not only achieves higher predictive accuracy but also reduces collision computation time by nearly 50\%. We demonstrate that policies obtained with our hybrid simulator transfer more robustly to the real world, doubling the success rate in sequential pushing tasks with reinforecement learning and significantly increase the task performance with model-based control.

Chinese Translation

仿真与真实之间的差距，尤其是在对接触丰富的动态（如碰撞）的不准确建模方面，仍然是将训练于仿真中的机器人策略部署到现实中的主要障碍。传统的物理引擎往往在准确性和计算速度之间进行权衡，导致差异，阻碍了策略的直接转移。为了解决这个问题，我们提出了通过增强的仿真与真实混合学习碰撞（CLASH），这是一个数据高效的框架，通过从最小的真实数据集中学习替代碰撞模型，创建一个高保真度的混合仿真器。在CLASH中，首先从一个不完美的仿真器（MuJoCo）中提炼出一个基础模型，以捕捉一般的物理先验；然后使用极少量的真实世界交互（少至10个样本）对该模型进行微调，以修正仿真器固有的不准确性。最终得到的混合仿真器不仅实现了更高的预测准确性，还将碰撞计算时间减少了近50%。我们证明了使用我们的混合仿真器获得的策略在现实世界中的转移更加稳健，在强化学习的顺序推送任务中成功率翻倍，并显著提高了基于模型的控制任务性能。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.18716

Temporal Action Representation Learning for Tactical Resource Control and Subsequent Maneuver Generation

用于战术资源控制和后续机动生成的时间动作表示学习

Jung, Hoseong, Son, Sungil, Cho, Daesol, Park, Jonghae, Choi, Changhyun, Kim, H. Jin

Abstract

Autonomous robotic systems should reason about resource control and its impact on subsequent maneuvers, especially when operating with limited energy budgets or restricted sensing. Learning-based control is effective in handling complex dynamics and represents the problem as a hybrid action space unifying discrete resource usage and continuous maneuvers. However, prior works on hybrid action space have not sufficiently captured the causal dependencies between resource usage and maneuvers. They have also overlooked the multi-modal nature of tactical decisions, both of which are critical in fast-evolving scenarios. In this paper, we propose TART, a Temporal Action Representation learning framework for Tactical resource control and subsequent maneuver generation. TART leverages contrastive learning based on a mutual information objective, designed to capture inherent temporal dependencies in resource-maneuver interactions. These learned representations are quantized into discrete codebook entries that condition the policy, capturing recurring tactical patterns and enabling multi-modal and temporally coherent behaviors. We evaluate TART in two domains where resource deployment is critical: (i) a maze navigation task where a limited budget of discrete actions provides enhanced mobility, and (ii) a high-fidelity air combat simulator in which an F-16 agent operates weapons and defensive systems in coordination with flight maneuvers. Across both domains, TART consistently outperforms hybrid-action baselines, demonstrating its effectiveness in leveraging limited resources and producing context-aware subsequent maneuvers.

Chinese Translation

自主机器人系统应当考虑资源控制及其对后续机动的影响，尤其是在有限能量预算或受限感知条件下进行操作时。基于学习的控制方法在处理复杂动态方面有效，并将问题表示为一个混合动作空间，统一离散资源使用和连续机动。然而，先前关于混合动作空间的研究未能充分捕捉资源使用与机动之间的因果依赖关系。同时，它们也忽视了战术决策的多模态特性，这在快速变化的场景中至关重要。本文提出了TART，一个用于战术资源控制和后续机动生成的时间动作表示学习框架。TART利用基于互信息目标的对比学习，旨在捕捉资源与机动交互中的内在时间依赖性。这些学习到的表示被量化为离散代码本条目，以条件化策略，捕捉重复出现的战术模式，并实现多模态和时间一致的行为。我们在两个资源部署至关重要的领域评估TART：（i）一个迷宫导航任务，其中有限的离散动作预算提供了增强的机动性，以及（ii）一个高保真空战模拟器，在该模拟器中，F-16代理协调飞行动作操作武器和防御系统。在这两个领域中，TART始终优于混合动作基线，证明了其在利用有限资源和生成上下文感知的后续机动方面的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.18742

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

RoboCurate：利用经过动作验证的神经轨迹实现机器人学习的多样性

Kim, Seungku, Jang, Suhyeok, Yoon, Byungjun, Kim, Dongyoung, Won, John, Shin, Jinwoo

Abstract

Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.

Chinese Translation

由视频生成模型生成的合成数据在机器人学习中展现了作为可扩展管道的潜力，但由于生成视频的不完美，往往存在动作质量不一致的问题。最近，视觉-语言模型（VLMs）被用于验证视频质量，但它们在区分物理准确的视频方面存在局限性，即使如此，也无法直接评估生成的动作本身。为了解决这一问题，我们提出了RoboCurate，这是一种新颖的合成机器人数据生成框架，通过将注释动作与仿真回放进行比较来评估和过滤动作质量。具体而言，RoboCurate在模拟器中回放预测的动作，并通过测量模拟器回放与生成视频之间的运动一致性来评估动作质量。此外，我们通过图像到图像的编辑解锁了超出可用数据集的观察多样性，并应用保持动作的视频到视频转移进一步增强外观。我们观察到，与仅使用真实数据相比，RoboCurate生成的数据在成功率上取得了显著的相对提升，在GR-1桌面（300个演示）上提高了70.1%，在预训练设置下的DexMimicGen上提高了16.1%，在具有挑战性的真实世界ALLEX类人灵巧操作设置中提高了179.9%。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.18803

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

在图像空间中学习定位参考轨迹以实现视觉导航

Busch, Finn Lukas, Vahs, Matti, Yang, Quantao, Peimbert, Jesús Gerardo Ortega, Cai, Yixi, Tumova, Jana, Andersson, Olov

Abstract

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy for robustness to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on conventional forward navigation, achieving 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a phone camera directly enables different robots to navigate to any point on the trajectory. Videos, demo, and code are available at https://finnbusch.com/lotis.

Chinese Translation

我们提出了LoTIS，这是一种视觉导航模型，通过在机器人当前视图中定位参考RGB轨迹，提供与机器人无关的图像空间指导，而无需相机标定、姿态或特定于机器人的训练。我们预测的不是与特定机器人相关的动作，而是参考轨迹在机器人当前视图中出现的图像空间坐标。这种方法创建了与机器人无关的视觉指导，能够轻松集成到局部规划中。因此，我们模型的预测在不同的实现上提供了零样本指导。通过将感知与动作解耦，并学习定位轨迹点而不是模仿行为先验，我们实现了一种跨轨迹的训练策略，以增强对视角和相机变化的鲁棒性。在传统的前向导航中，我们的成功率比最先进的方法提高了20-50个百分点，在多种模拟和真实环境中成功率达到94-98%。此外，在基线失败的挑战性任务（如向后行驶）中，我们的表现提升超过5倍。该系统使用简单：我们展示了即使是手机相机拍摄的视频也能直接使不同机器人导航到轨迹上的任何点。视频、演示和代码可在 https://finnbusch.com/lotis 获取。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.18813

Habilis-$\beta$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

Habilis-$eta$: 一种快速运动且持久的设备端视觉-语言-动作模型

Robotics, Tommoro, :, Kang, Jesoon, Park, Taegeon, An, Jisu, Kimm, Soo Min, Kim, Jaejoon, Pahk, Jinu, Kim, Byungju, Lee, Junseok, Baek, Namheon, Ha, Sungwan, Baek, Hojun, Cruz, Eduardo Ayerve, Kim, Wontae, Choi, Junghyeon, Lee, Yousuk, Han, Joonmo, Cho, Sunghyun, Kwon, Sunghyun, Lee, Soyoung, Lee, Jun Ki, Yi, Seung-Joon, Zhang, Byoung-Tak, Kim, Theo Taeyeong

Abstract

We introduce Habilis-$\beta$, a fast-motion and long-lasting on-device vision-language-action (VLA) model designed for real-world deployment. Current VLA evaluation remains largely confined to single-trial success rates under curated resets, which fails to capture the fast-motion and long-lasting capabilities essential for practical operation. To address this, we introduce the Productivity-Reliability Plane (PRP), which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness. Habilis-$\beta$ achieves high performance by integrating language-free pre-training on large-scale play data for robust interaction priors with post-training on cyclic task demonstrations that capture state drift across consecutive task iterations. The system further employs ESPADA for phase-adaptive motion shaping to accelerate free-space transit, utilizes rectified-flow distillation to enable high-frequency control on edge devices, and incorporates classifier-free guidance (CFG) as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors. In 1-hour continuous-run evaluations, Habilis-$\beta$ achieves strong performance under the PRP metrics, compared to $\pi_{0.5}$ in both simulation and real-world environments. In simulation, Habilis-$\beta$ achieves 572.6 TPH and 39.2 s MTBI (vs. 120.5 TPH and 30.5 s for $\pi_{0.5}$), while in a real-world humanoid logistics workflow it achieves 124 TPH and 137.4 s MTBI (vs. 19 TPH and 46.1 s for $\pi_{0.5}$). Finally, Habilis-$\beta$ achieves the highest reported performance on the standard RoboTwin 2.0 leaderboard across representative tasks, validating its effectiveness in complex manipulation scenarios.

Chinese Translation

我们介绍了Habilis-$eta$，这是一种快速运动且持久的设备端视觉-语言-动作（VLA）模型，旨在用于现实世界的部署。目前的VLA评估主要局限于经过精心设计的重置下的单次试验成功率，这无法捕捉到实际操作中所需的快速运动和持久能力。为了解决这一问题，我们引入了生产力-可靠性平面（Productivity-Reliability Plane, PRP），通过每小时任务数（Tasks per Hour, TPH）和干预间隔平均时间（Mean Time Between Intervention, MTBI）在连续运行协议下评估性能，该协议要求高速度执行和持续的稳健性。Habilis-$eta$通过在大规模游戏数据上进行无语言的预训练以获得稳健的交互先验，并在捕捉连续任务迭代中的状态漂移的循环任务演示上进行后训练，从而实现了高性能。该系统进一步采用ESPADA进行相位自适应运动塑形，以加速自由空间过渡，利用校正流蒸馏在边缘设备上实现高频控制，并在部署时引入无分类器引导（classifier-free guidance, CFG）作为动态平衡指令遵循和学习交互先验的调节器。在1小时的连续运行评估中，Habilis-$eta$在PRP指标下表现强劲，相较于$eta_{0.5}$在模拟和现实世界环境中均表现优异。在模拟中，Habilis-$eta$达到了572.6 TPH和39.2秒的MTBI（而$eta_{0.5}$为120.5 TPH和30.5秒），而在现实世界的人形物流工作流中，其达到了124 TPH和137.4秒的MTBI（而$eta_{0.5}$为19 TPH和46.1秒）。最后，Habilis-$eta$在代表性任务的标准RoboTwin 2.0排行榜上达到了最高的报告性能，验证了其在复杂操作场景中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.18814

RotorSuite: A MATLAB/Simulink Toolbox for Tilt Multi-Rotor UAV Modeling

RotorSuite：用于倾斜多旋翼无人机建模的MATLAB/Simulink工具箱

Cigarini, Nicola, Michieletto, Giulia, Cenedese, Angelo

Abstract

In recent years, aerial platforms have evolved from passive flying sensors into versatile, contact-aware robotic systems, leading to rapid advances in platform design. Standard coplanar and collinear quadrotors have been complemented by modern tilted and tilting multi-rotor platforms with enhanced maneuverability. To properly analyze, control, and validate the performance of these emerging platforms, an accurate modeling step is required; however, this can be time-consuming, user-dependent and error-prone. To address this issue, we propose a MATLAB/Simulink toolbox for modeling and simulating the dynamics of a broad class of multi-rotor platforms through both an analytical and physics-based approaches. The toolbox, named RotorSuite, is provided with comprehensive documentation and example use cases, representing a valuable tool for didactic, research, and industrial development purposes.

Chinese Translation

近年来，航空平台已从被动飞行传感器演变为多功能、具接触感知的机器人系统，推动了平台设计的快速进展。标准的共面和共线四旋翼被现代倾斜和可倾斜的多旋翼平台所补充，这些平台具有更强的机动性。为了正确分析、控制和验证这些新兴平台的性能，需要一个准确的建模步骤；然而，这一过程往往耗时、依赖用户且容易出错。为了解决这一问题，我们提出了一个MATLAB/Simulink工具箱，用于通过分析和基于物理的方法对广泛类别的多旋翼平台的动态进行建模和仿真。该工具箱名为RotorSuite，配备了全面的文档和示例用例，是用于教学、研究和工业开发的重要工具。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.18835

GRAB: A Systematic Real-World Grasping Benchmark for Robotic Food Waste Sorting

GRAB：一个系统化的现实世界抓取基准用于机器人食品废物分类

Thilakarathna, Moniesha, Wang, Xing, Wang, Min, Hinwood, David, Liu, Shuangzhe, Herath, Damith

Abstract

Food waste management is critical for sustainability, yet inorganic contaminants hinder recycling potential. Robotic automation presents a compelling approach to this challenge by accelerating the sorting process through automated contaminant removal. Still, the diverse and unpredictable nature of contaminants creates major challenges for robotic grasping. Benchmarking frameworks are critical for evaluating challenges from various perspectives. However, existing protocols rely on limited simulation datasets, prioritise simple metrics such as success rate, and overlook key object and environment-related pre-grasp conditions. This paper introduces GRAB, a comprehensive Grasping Real-World Article Benchmarking framework that addresses this gap by integrating diverse deformable objects, advanced grasp-pose-estimation vision, and, importantly, pre-grasp conditions, establishing a set of critical graspability metrics. It systematically compares industrial grasping modalities through an in-depth experimental evaluation involving 1,750 food contaminant grasp attempts across four high-fidelity scenes. This large-scale evaluation provides an extensive assessment of grasp performance for food waste sorting, offering a level of depth that has rarely been explored in previous studies. The results reveal distinct gripper strengths and limitations, with object quality emerging as the dominant performance factor in cluttered environments, while vision quality and clutter levels play moderate roles. These findings highlight essential design considerations and reinforce the necessity of developing multimodal gripper technologies capable of robust cross-category performance for effective robotic food waste sorting.

Chinese Translation

食品废物管理对可持续发展至关重要，但无机污染物阻碍了回收潜力。机器人自动化为这一挑战提供了一种引人注目的解决方案，通过自动去除污染物加速分类过程。然而，污染物的多样性和不可预测性为机器人的抓取带来了重大挑战。基准测试框架对于从不同角度评估这些挑战至关重要。然而，现有的协议依赖于有限的仿真数据集，优先考虑简单的指标如成功率，并忽视了与物体和环境相关的关键抓取前条件。本文介绍了GRAB，一个全面的现实世界抓取文章基准框架，通过整合多样的可变形物体、先进的抓取姿态估计视觉，以及重要的抓取前条件，填补了这一空白，建立了一套关键的抓取能力指标。通过涉及1,750次食品污染物抓取尝试的深入实验评估，系统地比较了工业抓取模式，涵盖了四个高保真场景。这项大规模评估为食品废物分类的抓取性能提供了广泛的评估，提供了在以往研究中鲜有探索的深度。结果揭示了不同夹具的优缺点，其中物体质量在杂乱环境中成为主导性能因素，而视觉质量和杂乱程度则发挥了中等作用。这些发现突显了设计中的重要考虑因素，并强调了开发能够在跨类别性能上表现稳健的多模态夹具技术的必要性，以实现有效的机器人食品废物分类。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.18850

When the Inference Meets the Explicitness or Why Multimodality Can Make Us Forget About the Perfect Predictor

当推理遇上明确性：多模态如何让我们忘记完美预测者

Domínguez-Vidal, J. E., Sanfeliu, Alberto

Abstract

Although in the literature it is common to find predictors and inference systems that try to predict human intentions, the uncertainty of these models due to the randomness of human behavior has led some authors to start advocating the use of communication systems that explicitly elicit human intention. In this work, it is analyzed the use of four different communication systems with a human-robot collaborative object transportation task as experimental testbed: two intention predictors (one based on force prediction and another with an enhanced velocity prediction algorithm) and two explicit communication methods (a button interface and a voice-command recognition system). These systems were integrated into IVO, a custom mobile social robot equipped with force sensor to detect the force exchange between both agents and LiDAR to detect the environment. The collaborative task required transporting an object over a 5-7 meter distance with obstacles in the middle, demanding rapid decisions and precise physical coordination. 75 volunteers perform a total of 255 executions divided into three groups, testing inference systems in the first round, communication systems in the second, and the combined strategies in the third. The results show that, 1) once sufficient performance is achieved, the human no longer notices and positively assesses technical improvements; 2) the human prefers systems that are more natural to them even though they have higher failure rates; and 3) the preferred option is the right combination of both systems.

Chinese Translation

尽管在文献中常常可以找到试图预测人类意图的预测器和推理系统，但由于人类行为的随机性，这些模型的不确定性使得一些作者开始倡导使用能够明确引导人类意图的沟通系统。本研究分析了四种不同的沟通系统在一个人机协作物体运输任务中的应用，该任务作为实验测试平台：两个意图预测器（一个基于力预测，另一个使用增强的速度预测算法）和两种明确的沟通方法（一个按钮接口和一个语音命令识别系统）。这些系统被集成到IVO中，这是一款定制的移动社交机器人，配备有力传感器以检测两个主体之间的力交换，以及LiDAR以探测环境。协作任务要求在5-7米的距离内运输一个物体，中间有障碍物，要求快速决策和精确的物理协调。75名志愿者共进行了255次实验，分为三组，第一轮测试推理系统，第二轮测试沟通系统，第三轮测试组合策略。结果显示，1）一旦达到足够的性能，人类不再注意并积极评估技术改进；2）人类更倾向于使用对他们而言更自然的系统，即使这些系统的失败率更高；3）最佳选择是两种系统的合理组合。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.18862

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

单侧肌肉虚弱导致的步态不对称及踝部辅助改善：基于强化学习的模拟研究

Yuan, Yifei, Androwis, Ghaith, Zhou, Xianlian

Abstract

Unilateral muscle weakness often leads to asymmetric gait, disrupting interlimb coordination and stance timing. This study presents a reinforcement learning (RL) based musculoskeletal simulation framework to (1) quantify how progressive unilateral muscle weakness affects gait symmetry and (2) evaluate whether ankle exoskeleton assistance can improve gait symmetry under impaired conditions. The overarching goal is to establish a simulation- and learning-based workflow that supports early controller development prior to patient experiments. Asymmetric gait was induced by reducing right-leg muscle strength to 75%, 50%, and 25% of baseline. Gait asymmetry was quantified using toe-off timing, peak contact forces, and joint-level symmetry metrics. Increasing weakness produced progressively larger temporal and kinematic asymmetry, most pronounced at the ankle. Ankle range of motion symmetry degraded from near-symmetric behavior at 100% strength (symmetry index, SI = +6.4%; correlation r=0.974) to severe asymmetry at 25% strength (SI = -47.1%, r=0.889), accompanied by a load shift toward the unimpaired limb. At 50% strength, ankle exoskeleton assistance improved kinematic symmetry relative to the unassisted impaired condition, reducing the magnitude of ankle SI from 25.8% to 18.5% and increasing ankle correlation from r=0.948 to 0.966, although peak loading remained biased toward the unimpaired side. Overall, this framework supports controlled evaluation of impairment severity and assistive strategies, and provides a basis for future validation in human experiments.

Chinese Translation

单侧肌肉虚弱常常导致步态不对称，干扰肢体间协调和站立时机。本研究提出了一种基于强化学习（RL）的肌肉骨骼模拟框架，以（1）量化逐步加重的单侧肌肉虚弱如何影响步态对称性，以及（2）评估在受损条件下，踝部外骨骼辅助是否能改善步态对称性。总体目标是建立一个基于模拟和学习的工作流程，以支持在患者实验之前的早期控制器开发。通过将右腿肌肉力量降低至基线的75%、50%和25%来诱导步态不对称。使用起步时机、峰值接触力和关节级别对称性指标量化步态不对称。随着虚弱程度的增加，时间和运动学不对称逐渐加大，尤其在踝部表现最为明显。踝部运动范围对称性从100%力量时的近对称行为（对称指数，SI = +6.4%；相关性 r=0.974）降至25%力量时的严重不对称（SI = -47.1%，r=0.889），伴随有负载向未受损肢体的转移。在50%力量时，踝部外骨骼辅助相较于未辅助的受损状态改善了运动学对称性，将踝部SI的幅度从25.8%降低至18.5%，并将踝部相关性从r=0.948提高至0.966，尽管峰值负载仍偏向未受损侧。总体而言，该框架支持对损伤严重程度和辅助策略的控制评估，并为未来的人体实验验证提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.18872

Equivalence and Divergence of Bayesian Log-Odds and Dempster's Combination Rule for 2D Occupancy Grids

贝叶斯对数几率与邓普斯特组合规则在二维占用网格中的等价性与差异性

Berlenko, Tatiana, Krinkin, Kirill

Abstract

We introduce a pignistic-transform-based methodology for fair comparison of Bayesian log-odds and Dempster's combination rule in occupancy grid mapping, matching per-observation decision probabilities to isolate the fusion rule from sensor parameterization. Under BetP matching across simulation, two real lidar datasets, and downstream path planning, Bayesian fusion is consistently favored (15/15 directional consistency, p = 3.1e-5) with small absolute differences (0.001-0.022). Under normalized plausibility matching, the direction reverses, confirming the result is matching-criterion-specific. The methodology is reusable for any future Bayesian/belief function comparison.

Chinese Translation

我们提出了一种基于概率变换的方法，以公平比较占用网格映射中的贝叶斯对数几率和邓普斯特组合规则，通过匹配每次观测的决策概率来将融合规则与传感器参数化分离。在模拟、两个真实激光雷达数据集和下游路径规划的BetP匹配下，贝叶斯融合始终占优（15/15方向一致性，p = 3.1e-5），且绝对差异较小（0.001-0.022）。在标准化似然匹配下，方向发生逆转，确认结果是匹配标准特定的。该方法可重复用于未来的贝叶斯/信念函数比较。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.18951

Temporal-Logic-Aware Frontier-Based Exploration

基于边界的时间逻辑感知探索

Taheri, Azizollah, Aksaray, Derya

Abstract

This paper addresses the problem of temporal logic motion planning for an autonomous robot operating in an unknown environment. The objective is to enable the robot to satisfy a syntactically co-safe Linear Temporal Logic (scLTL) specification when the exact locations of the desired labels are not known a priori. We introduce a new type of automaton state, referred to as commit states. These states capture intermediate task progress resulting from actions whose consequences are irreversible. In other words, certain future paths to satisfaction become not feasible after taking those actions that lead to the commit states. By leveraging commit states, we propose a sound and complete frontier-based exploration algorithm that strategically guides the robot to make progress toward the task while preserving all possible ways of satisfying it. The efficacy of the proposed method is validated through simulations.

Chinese Translation

本文解决了在未知环境中自主机器人进行时间逻辑运动规划的问题。目标是使机器人能够满足语法上共安全的线性时间逻辑（scLTL）规范，而在此之前并不知道期望标签的确切位置。我们引入了一种新类型的自动机状态，称为承诺状态（commit states）。这些状态捕捉了由于不可逆行动而导致的任务进展的中间结果。换句话说，某些满足目标的未来路径在采取导致承诺状态的行动后变得不可行。通过利用承诺状态，我们提出了一种健全且完整的基于边界的探索算法，该算法战略性地引导机器人向任务推进，同时保留所有可能的满足方式。通过仿真验证了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.18967

TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

TactEx：一种可解释的多模态机器人交互框架，用于类人触觉和硬度估计

Verstraete, Felix, Wei, Lan, Fan, Wen, Zhang, Dandan

Abstract

Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.

Chinese Translation

准确感知物体硬度对于安全和灵巧的接触丰富机器人操作至关重要。在此，我们提出了TactEx，一种可解释的多模态机器人交互框架，统一了视觉、触觉和语言，以实现类人的硬度估计和互动指导。我们在水果成熟度评估这一代表性任务上评估TactEx，该任务需要触觉感知和上下文理解。该系统将GelSight-Mini触觉流与RGB观测和语言提示融合。ResNet50+LSTM模型从顺序触觉数据中估计硬度，而跨模态对齐模块将视觉线索与大型语言模型（LLM）的指导结合起来。这个可解释的多模态接口使用户能够以统计显著的类别分离（所有水果对的p < 0.01）区分成熟度水平。在触觉放置方面，我们比较了YOLO与Grounded-SAM（GSAM），发现GSAM在细粒度分割和接触点选择上更具鲁棒性。一个轻量级的LLM解析用户指令，并生成与触觉输出相关联的具体自然语言解释。在端到端评估中，TactEx在简单用户查询上达到了90%的任务成功率，并且在没有大规模调优的情况下能够推广到新任务。这些结果突显了将预训练的视觉和触觉模型与语言基础结合的潜力，以推动机器人领域可解释的类人触觉感知和决策制定。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.18976

Bumper Drone: Elastic Morphology Design for Aerial Physical Interaction

缓冲无人机：用于空中物理交互的弹性形态设计

Supa, Pongporn, Dunnett, Alex, Xiao, Feng, Wu, Rui, Kovac, Mirko, Kocer, Basaran Bahadir

Abstract

Aerial robots are evolving from avoiding obstacles to exploiting the environmental contact interactions for navigation, exploration and manipulation. A key challenge in such aerial physical interactions lies in handling uncertain contact forces on unknown targets, which typically demand accurate sensing and active control. We present a drone platform with elastic horns that enables touch-and-go manoeuvres - a self-regulated, consecutive bumping motion that allows the drone to maintain proximity to a wall without relying on active obstacle avoidance. It leverages environmental interaction as a form of embodied control, where low-level stabilisation and near-obstacle navigation emerge from the passive dynamic responses of the drone-obstacle system that resembles a mass-spring-damper system. Experiments show that the elastic horn can absorb impact energy while maintaining vehicle stability, reducing pitch oscillations by 38% compared to the rigid horn configuration. The lower horn arrangement was found to reduce pitch oscillations by approximately 54%. In addition to intermittent contact, the platform equipped with elastic horns also demonstrates stable, sustained contact with static objects, relying on a standard attitude PID controller.

Chinese Translation

空中机器人正从避开障碍物发展为利用环境接触交互进行导航、探索和操作。此类空中物理交互中的一个关键挑战在于处理对未知目标的不确定接触力，这通常需要准确的传感和主动控制。我们提出了一种配备弹性角的无人机平台，能够实现触碰与离去的机动——一种自我调节的连续碰撞运动，使无人机能够在不依赖主动避障的情况下与墙壁保持接近。它利用环境交互作为一种具身控制的形式，其中低级别的稳定和近障碍导航源于无人机-障碍物系统的被动动态响应，类似于质量-弹簧-阻尼器系统。实验表明，弹性角能够吸收冲击能量，同时保持飞行器的稳定性，与刚性角配置相比，俯仰振荡减少了38%。较低的角配置被发现将俯仰振荡减少了约54%。除了间歇性接触外，配备弹性角的平台还展示了与静态物体的稳定持续接触，依赖于标准姿态PID控制器。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.18991

FruitTouch: A Perceptive Gripper for Gentle and Scalable Fruit Harvesting

FruitTouch: 一种用于温和且可扩展水果采摘的感知夹具

Zhang, Ruohan, Mirzaee, Mohammad Amin, Yuan, Wenzhen

Abstract

The automation of fruit harvesting has gained increasing significance in response to rising labor shortages. A sensorized gripper is a key component of this process, which must be compact enough for confined spaces, able to stably grasp diverse fruits, and provide reliable feedback on fruit conditions for efficient harvesting. To address this need, we propose FruitTouch, a compact gripper that integrates high-resolution, vision-based tactile sensing through an optimized optical design. This configuration accommodates a wide range of fruit sizes while maintaining low cost and mechanical simplicity. Tactile images captured by an embedded camera provide rich information for real-time force estimation, slip detection, and softness prediction. We validate the gripper in real-world fruit harvesting experiments, demonstrating robust grasp stability and effective damage prevention.

Chinese Translation

随着劳动力短缺问题的日益严重，水果采摘的自动化变得愈加重要。传感器夹具是这一过程的关键组件，必须足够紧凑以适应狭小空间，能够稳定地抓取各种水果，并提供关于水果状态的可靠反馈以实现高效采摘。为满足这一需求，我们提出了FruitTouch，一种通过优化光学设计集成高分辨率、基于视觉的触觉传感的紧凑型夹具。该配置能够适应各种水果尺寸，同时保持低成本和机械简单性。嵌入式摄像头捕获的触觉图像为实时力估计、滑动检测和软度预测提供了丰富的信息。我们在实际水果采摘实验中验证了该夹具，展示了其稳健的抓取稳定性和有效的损伤防止能力。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.19038

A Checklist for Deploying Robots in Public: Articulating Tacit Knowledge in the HRI Community

公共场合部署机器人的检查清单：阐明人机交互社区中的隐性知识

Liang, Claire, Babel, Franziska, Pelikan, Hannah, Thompson, Sydney, Tan, Xiang Zhi

Abstract

Many of the challenges encountered in in-the-wild public deployments of robots remain undocumented despite sharing many common pitfalls. This creates a high barrier of entry and results in repetition of avoidable mistakes. To articulate the tacit knowledge in the HRI community, this paper presents a guideline in the form of a checklist to support researchers in preparing for robot deployments in public. Drawing on their own experience with public robot deployments, the research team collected essential topics to consider in public HRI research. These topics are represented as modular flip cards in a hierarchical table, structured into deployment phases and important domains. We interviewed six interdisciplinary researchers with expertise in public HRI and show how including community input refines the checklist. We further show the checklist in action in context of real public studies. Finally, we contribute the checklist as an open-source, customizable community resource that both collects joint expertise for continual evolution and is usable as a list, set of cards, and an interactive web tool.

Chinese Translation

尽管公共场合机器人在实际部署中面临许多共同的挑战，但这些挑战仍然缺乏文档记录。这造成了高门槛，并导致可避免错误的重复发生。为了阐明人机交互（HRI）社区中的隐性知识，本文以检查清单的形式提出了一项指导方针，以支持研究人员为公共场合的机器人部署做好准备。研究团队基于自身在公共机器人部署方面的经验，收集了在公共HRI研究中需要考虑的关键主题。这些主题以模块化的翻转卡片形式在一个分层表格中呈现，结构化为部署阶段和重要领域。我们采访了六位在公共HRI领域具有专业知识的跨学科研究人员，展示了如何通过包含社区反馈来完善检查清单。我们进一步展示了在真实公共研究背景下检查清单的实际应用。最后，我们将该检查清单作为一个开源、可定制的社区资源贡献出来，既收集了共同的专业知识以便于持续演进，也可以作为列表、卡片集和互动网页工具使用。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.19062

Path planning for unmanned surface vehicle based on predictive artificial potential field. International Journal of Advanced Robotic Systems

基于预测人工势场的无人水面车辆路径规划

Song, Jia, Hao, Ce, Su, Jiangcheng

Abstract

Path planning for high-speed unmanned surface vehicles requires more complex solutions to reduce sailing time and save energy. This article proposes a new predictive artificial potential field that incorporates time information and predictive potential to plan smoother paths. It explores the principles of the artificial potential field, considering vehicle dynamics and local minimum reachability. The study first analyzes the most advanced traditional artificial potential field and its drawbacks in global and local path planning. It then introduces three modifications to the predictive artificial potential field-angle limit, velocity adjustment, and predictive potential to enhance the feasibility and flatness of the generated path. A comparison between the traditional and predictive artificial potential fields demonstrates that the latter successfully restricts the maximum turning angle, shortens sailing time, and intelligently avoids obstacles. Simulation results further verify that the predictive artificial potential field addresses the concave local minimum problem and improves reachability in special scenarios, ultimately generating a more efficient path that reduces sailing time and conserves energy for unmanned surface vehicles.

Chinese Translation

高速无人水面车辆的路径规划需要更复杂的解决方案，以减少航行时间和节省能源。本文提出了一种新的预测人工势场，该势场结合了时间信息和预测势能，以规划更平滑的路径。研究探讨了人工势场的原理，考虑了车辆动力学和局部最小值可达性。首先，研究分析了最先进的传统人工势场及其在全局和局部路径规划中的缺陷。接着，介绍了对预测人工势场的三项改进——角度限制、速度调整和预测势能，以增强生成路径的可行性和平滑性。传统人工势场与预测人工势场的比较表明，后者成功限制了最大转向角，缩短了航行时间，并智能地避开障碍物。仿真结果进一步验证了预测人工势场解决了凹形局部最小值问题，并在特殊场景中改善了可达性，最终生成了更高效的路径，从而减少了无人水面车辆的航行时间并节省了能源。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.19077

Design, Locomotion, and Control of Amphibious Robots: Recent Advances

两栖机器人的设计、运动和控制：近期进展

Jin, Yi, Liu, Chang, Quinn, Roger D., Wood, Robert J., Cao, C. Chase

Abstract

Amphibious robots, operating seamlessly across land and water, are advancing applications in conservation, disaster response, and defense. Their performance depends on locomotion mechanisms, actuation technologies, and sensor-control integration. This review highlights recent progress in these areas, examining movement strategies, material-based actuators, and control systems for autonomy and adaptability. Challenges and opportunities are outlined to guide future research toward more efficient, resilient, and multifunctional amphibious robots.

Chinese Translation

两栖机器人能够在陆地和水域之间无缝操作，正在推动保护、灾害响应和国防等领域的应用。它们的性能依赖于运动机制、驱动技术和传感器控制的集成。本文综述了这些领域的最新进展，考察了运动策略、基于材料的驱动器以及用于自主性和适应性的控制系统。本文还概述了挑战与机遇，以指导未来的研究，朝着更高效、更具韧性和多功能的两栖机器人发展。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.19107

A User-driven Design Framework for Robotaxi

以用户为驱动的机器人出租车设计框架

Deng, Yue, He, Changyang

Abstract

Robotaxis are emerging as a promising form of urban mobility, yet research has largely emphasized technical driving performance while leaving open how passengers experience and evaluate rides without a human driver. To address the limitations of prior work that often relies on simulated or hypothetical settings, we investigate real-world robotaxi use through 18 semi-structured interviews and autoethnographic ride experiences. We found that users were drawn to robotaxis by low cost, social recommendation, and curiosity. They valued a distinctive set of benefits, such as an increased sense of agency, and consistent driving behavioral consistency and standardized ride experiences. However, they encountered persistent challenges around limited flexibility, insufficient transparency, management difficulty, robustness concerns in edge cases, and emergency handling concerns. Robotaxi experiences were shaped by privacy, safety, ethics, and trust. Users were often privacy-indifferent yet sensitive to opaque access and leakage risks; safety perceptions were polarized; and ethical considerations surfaced round issues such as accountability, feedback responsibility and absence of human-like social norms. Based on these findings, we propose a user-driven design framework spanning the end-to-end journey, such as pre-ride configuration (hailing), context-aware pickup facilitation (pick-up) in-ride explainability (traveling), and accountable post-ride feedback (drop-off) to guide robotaxi interaction and service design.

Chinese Translation

机器人出租车作为一种有前景的城市出行方式正在兴起，但研究主要强调技术驾驶性能，而未探讨乘客在没有人类驾驶员的情况下如何体验和评估乘车体验。为了解决以往研究中常依赖模拟或假设环境的局限性，我们通过18次半结构化访谈和自我民族志的乘车体验，研究了现实世界中的机器人出租车使用情况。我们发现，用户被机器人出租车的低成本、社交推荐和好奇心所吸引。他们重视一系列独特的好处，例如增强的自主感、一致的驾驶行为和标准化的乘车体验。然而，他们在灵活性不足、透明度不够、管理困难、边缘案例中的稳健性问题以及应急处理方面面临持续挑战。机器人出租车的体验受到隐私、安全、伦理和信任的影响。用户通常对隐私不敏感，但对不透明的访问和泄露风险敏感；安全感知存在两极分化；伦理考量涉及责任、反馈责任和缺乏类人社会规范等问题。基于这些发现，我们提出了一个涵盖从头到尾旅程的以用户为驱动的设计框架，包括乘车前配置（叫车）、上下文感知的接送便利（接送）、乘车过程中的可解释性（旅行）以及可追责的乘车后反馈（下车），以指导机器人出租车的互动和服务设计。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.19108

Understanding Fire Through Thermal Radiation Fields for Mobile Robots

通过热辐射场理解火灾以服务于移动机器人

Wagner, Anton R., Rao, Madhan Balaji, Xiao, Xuesu, Pirk, Sören

Abstract

Safely moving through environments affected by fire is a critical capability for autonomous mobile robots deployed in disaster response. In this work, we present a novel approach for mobile robots to understand fire through building real-time thermal radiation fields. We register depth and thermal images to obtain a 3D point cloud annotated with temperature values. From these data, we identify fires and use the Stefan-Boltzmann law to approximate the thermal radiation in empty spaces. This enables the construction of a continuous thermal radiation field over the environment. We show that this representation can be used for robot navigation, where we embed thermal constraints into the cost map to compute collision-free and thermally safe paths. We validate our approach on a Boston Dynamics Spot robot in controlled experimental settings. Our experiments demonstrate the robot's ability to avoid hazardous regions while still reaching navigation goals. Our approach paves the way toward mobile robots that can be autonomously deployed in fire-affected environments, with potential applications in search-and-rescue, firefighting, and hazardous material response.

Chinese Translation

在灾难响应中，自主移动机器人安全地穿越受火灾影响的环境是一项关键能力。在本研究中，我们提出了一种新颖的方法，使移动机器人能够通过构建实时热辐射场来理解火灾。我们将深度图像和热图像进行配准，以获得带有温度值的3D点云。从这些数据中，我们识别火灾并利用斯特凡-玻尔兹曼定律来近似空旷区域的热辐射。这使得在环境中构建连续的热辐射场成为可能。我们展示了这种表示方法可以用于机器人导航，其中我们将热约束嵌入成本地图，以计算无碰撞且热安全的路径。我们在受控实验环境中对波士顿动力公司的Spot机器人验证了我们的方法。实验表明，机器人能够在到达导航目标的同时避免危险区域。我们的方法为能够在火灾影响环境中自主部署的移动机器人铺平了道路，具有在搜索与救援、灭火和危险物质响应等领域的潜在应用。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.19173

Distributed and Consistent Multi-Robot Visual-Inertial-Ranging Odometry on Lie Groups

基于李群的分布式一致性多机器人视觉惯性测距里程计

Kang, Ziwei, Zhou, Yizhi

Abstract

Reliable localization is a fundamental requirement for multi-robot systems operating in GPS-denied environments. Visual-inertial odometry (VIO) provides lightweight and accurate motion estimation but suffers from cumulative drift in the absence of global references. Ultra-wideband (UWB) ranging offers complementary global observations, yet most existing UWB-aided VIO methods are designed for single-robot scenarios and rely on pre-calibrated anchors, which limits their robustness in practice. This paper proposes a distributed collaborative visual-inertial-ranging odometry (DC-VIRO) framework that tightly fuses VIO and UWB measurements across multiple robots. Anchor positions are explicitly included in the system state to address calibration uncertainty, while shared anchor observations are exploited through inter-robot communication to provide additional geometric constraints. By leveraging a right-invariant error formulation on Lie groups, the proposed approach preserves the observability properties of standard VIO, ensuring estimator consistency. Simulation results with multiple robots demonstrate that DC-VIRO significantly improves localization accuracy and robustness, while simultaneously enabling anchor self-calibration in distributed settings.

Chinese Translation

可靠的定位是多机器人系统在无GPS环境中运行的基本要求。视觉惯性里程计（VIO）提供了轻量且准确的运动估计，但在缺乏全局参考的情况下会遭遇累积漂移。超宽带（UWB）测距提供了互补的全局观测，但现有的大多数UWB辅助VIO方法主要针对单机器人场景，并依赖于预先校准的锚点，这在实际应用中限制了其鲁棒性。本文提出了一种分布式协作视觉惯性测距里程计（DC-VIRO）框架，该框架在多个机器人之间紧密融合VIO和UWB测量。锚点位置被明确纳入系统状态，以解决校准不确定性，同时通过机器人间通信利用共享锚点观测提供额外的几何约束。通过在李群上利用右不变误差公式，所提方法保持了标准VIO的可观测性特性，确保了估计器的一致性。多机器人仿真结果表明，DC-VIRO显著提高了定位精度和鲁棒性，同时在分布式环境中实现了锚点自校准。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.19179

Distributional Stability of Tangent-Linearized Gaussian Inference on Smooth Manifolds

光滑流形上切线线性化高斯推断的分布稳定性

Seo, Junghoon, Lee, Hakjin, Sim, Jaehoon

Abstract

Gaussian inference on smooth manifolds is central to robotics, but exact marginalization and conditioning are generally non-Gaussian and geometry-dependent. We study tangent-linearized Gaussian inference and derive explicit non-asymptotic $W_2$ stability bounds for projection marginalization and surface-measure conditioning. The bounds separate local second-order geometric distortion from nonlocal tail leakage and, for Gaussian inputs, yield closed-form diagnostics from $(\mu,\Sigma)$ and curvature/reach surrogates. Circle and planar-pushing experiments validate the predicted calibration transition near $\sqrt{\|\Sigma\|_{\mathrm{op}}}/R\approx 1/6$ and indicate that normal-direction uncertainty is the dominant failure mode when locality breaks. These diagnostics provide practical triggers for switching from single-chart linearization to multi-chart or sample-based manifold inference.

Chinese Translation

光滑流形上的高斯推断在机器人技术中至关重要，但精确的边际化和条件化通常是非高斯的且依赖于几何结构。我们研究了切线线性化的高斯推断，并推导出投影边际化和表面测度条件化的明确非渐近 $W_2$ 稳定性界限。这些界限将局部二阶几何失真与非局部尾部泄漏分开，并且对于高斯输入，从 $( u,oldsymbol{oldsymbol{ ext{Σ}}})$ 和曲率/可达性替代量中得出封闭形式的诊断。圆形和平面推送实验验证了在 $rac{ ext{sqrt}( ext{op} ext{norm}(oldsymbol{ ext{Σ}}))}{R} ext{ 约等于 } rac{1}{6}$ 附近预测的标定转变，并表明当局部性破裂时，法向不确定性是主要的失效模式。这些诊断为从单图线性化切换到多图或基于样本的流形推断提供了实际触发器。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.19184

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

人机交互：从视频演示中学习以实现机器人模仿

Canh, Thanh Nguyen, Tran, Thanh-Tuan, Zhang, Haolan, Gao, Ziyan, Chong, Nak Young, HoangVan, Xiem

Abstract

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.

Chinese Translation

从演示学习（Learning from Demonstration, LfD）为机器人技能获取提供了一种有前景的范式。近期的方法尝试直接从视频演示中提取操作指令，但面临两个关键挑战：（1）通用视频字幕模型优先考虑全局场景特征而非任务相关对象，导致生成的描述不适合精确的机器人执行；（2）将视觉理解与策略学习结合的端到端架构需要大量配对数据集，并且在对象和场景之间的泛化能力较弱。为了解决这些局限性，我们提出了一种新颖的“人机”模仿学习流程，使机器人能够直接从非结构化视频演示中获取操作技能，灵感来源于人类通过观察和模仿学习的能力。我们的关键创新是一个模块化框架，将学习过程解耦为两个不同阶段：（1）视频理解，该阶段结合了时间位移模块（Temporal Shift Modules, TSM）与视觉-语言模型（Vision-Language Models, VLMs），以提取动作并识别交互对象；（2）机器人模仿，该阶段采用基于TD3的深度强化学习执行演示的操作。我们在PyBullet仿真环境中使用UR5e操纵器以及在现实世界实验中使用UF850操纵器验证了我们的方法，涵盖四个基本动作：到达、抓取、移动和放置。在视频理解方面，我们的方法在标准对象上实现了89.97%的动作分类准确率和0.351的BLEU-4分数，在新颖对象上实现了0.265的BLEU-4分数，分别比最佳基线提高了76.4%和128.4%。在机器人操作方面，我们的框架在所有动作上实现了87.5%的平均成功率，其中在到达任务上成功率为100%，在复杂的抓取与放置操作中成功率高达90%。项目网站可访问：https://thanhnguyencanh.github.io/LfD4hri。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2602.19193

Visual Prompt Guided Unified Pushing Policy

视觉提示引导的统一推送策略

Bui, Hieu, Gao, Ziyan, Hosoda, Yuya, Lee, Joo-Ho

Abstract

As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

Chinese Translation

作为最简单的非抓取操作技能之一，推送已被广泛研究，作为重新排列物体的有效手段。然而，现有方法通常依赖于由预定义推送原语组成的多步骤推送计划，这些原语的应用范围有限，从而限制了它们在不同场景中的效率和多样性。在本研究中，我们提出了一种统一推送策略，该策略将轻量级提示机制纳入流匹配策略，以指导生成反应式的多模态推送动作。视觉提示可以由高级规划器指定，使得推送策略能够在广泛的规划问题中重复使用。实验结果表明，所提出的统一推送策略不仅优于现有基线，而且有效地作为VLM（视觉语言模型）引导规划框架中的低级原语，能够高效地解决桌面清洁任务。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2602.19260

The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption

价格并不合理：神经符号方法在结构化长时间操作任务中表现优于视觉语言行动模型，且能耗显著更低

Duggan, Timothy, Lorang, Pierrick, Lu, Hong, Scheutz, Matthias

Abstract

Vision-Language-Action (VLA) models have recently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effectiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model {\pi}0 and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while measuring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best-performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to-end foundation-model approaches and structured reasoning architectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io

Chinese Translation

视觉语言行动（VLA）模型最近被提出作为实现通用机器人策略的一种途径，能够解读自然语言和视觉输入以生成操作动作。然而，它们在结构化长时间操作任务上的有效性和效率仍不明确。在本研究中，我们对一个经过微调的开放权重 VLA 模型 {C0}0 和一个结合基于 PDDL 的符号规划与学习的低级控制的神经符号架构进行了面对面的实证比较。我们在模拟中对这两种方法在汉诺塔操作任务的结构化变体上进行了评估，同时测量了训练和执行过程中的任务性能和能耗。在 3 块任务中，神经符号模型的成功率达到 95%，而表现最佳的 VLA 仅为 34%。神经符号模型还能够推广到未见过的 4 块变体（成功率为 78%），而两个 VLA 均未能完成该任务。在训练过程中，VLA 微调的能耗几乎是神经符号方法的两个数量级。结果突显了端到端基础模型方法与结构化推理架构在长时间机器人操作中的重要权衡，强调了显式符号结构在提高可靠性、数据效率和能效方面的作用。代码和模型可在 https://price-is-not-right.github.io 获取。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2602.19273

3D Shape Control of Extensible Multi-Section Soft Continuum Robots via Visual Servoing

通过视觉伺服控制可扩展多段软连续机器人三维形状

Gandhi, Abhinav, Chiang, Shou-Shan, Onal, Cagdas D., Calli, Berk

Abstract

In this paper, we propose a novel vision-based control algorithm for regulating the whole body shape of extensible multisection soft continuum manipulators. Contrary to existing vision-based control algorithms in the literature that regulate the robot's end effector pose, our proposed control algorithm regulates the robot's whole body configuration, enabling us to leverage its kinematic redundancy. Additionally, our model-based 2.5D shape visual servoing provides globally stable asymptotic convergence in the robot's 3D workspace compared to the closest works in the literature that report local minima. Unlike existing visual servoing algorithms in the literature, our approach does not require information from proprioceptive sensors, making it suitable for continuum manipulators without such capabilities. Instead, robot state is estimated from images acquired by an external camera that observes the robot's whole body shape and is also utilized to close the shape control loop. Traditionally, visual servoing schemes require an image of the robot at its reference pose to generate the reference features. In this work, we utilize an inverse kinematics solver to generate reference features for the desired robot configuration and do not require images of the robot at the reference. Experiments are performed on a multisection continuum manipulator demonstrating the controller's capability to regulate the robot's whole body shape while precisely positioning the robot's end effector. Results validate our controller's ability to regulate the shape of continuum robots while demonstrating a smooth transient response and a steady-state error within 1 mm. Proof-of-concept object manipulation experiments including stacking, pouring, and pulling tasks are performed to demonstrate our controller's applicability.

Chinese Translation

本文提出了一种新颖的基于视觉的控制算法，用于调节可扩展多段软连续操纵器的整体形状。与文献中现有的基于视觉的控制算法（这些算法调节机器人的末端执行器姿态）不同，我们提出的控制算法调节机器人的整体体型，从而利用其运动冗余。此外，与文献中报告局部极小值的相关工作相比，我们基于模型的2.5D形状视觉伺服在机器人的三维工作空间中提供了全局稳定的渐近收敛。与文献中现有的视觉伺服算法不同，我们的方法不需要来自本体感觉传感器的信息，因此适用于没有此类能力的连续操纵器。相反，机器人状态是通过外部相机获取的图像进行估计，该相机观察机器人的整体形状，并用于闭合形状控制回路。传统的视觉伺服方案需要在机器人参考姿态下的图像来生成参考特征。在本研究中，我们利用逆运动学求解器生成所需机器人配置的参考特征，而不需要机器人在参考姿态下的图像。我们在一个多段连续操纵器上进行了实验，展示了控制器调节机器人整体形状的能力，同时精确定位机器人的末端执行器。结果验证了我们控制器调节连续机器人形状的能力，同时展示了平滑的瞬态响应和1毫米以内的稳态误差。为了证明我们控制器的适用性，进行了包括堆叠、倒水和拉动任务在内的概念验证物体操纵实验。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2602.19304

Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

安全且可解释的多模态路径规划用于多智能体合作

Shi, Haojun, Ye, Suyu, Guerrerio, Katherine M., Shen, Jianzhi, Yin, Yifan, Khashabi, Daniel, Huang, Chien-Ming, Shu, Tianmin

Abstract

Successful cooperation among decentralized agents requires each agent to quickly adapt its plan to the behavior of other agents. In scenarios where agents cannot confidently predict one another's intentions and plans, language communication can be crucial for ensuring safety. In this work, we focus on path-level cooperation in which agents must adapt their paths to one another in order to avoid collisions or perform physical collaboration such as joint carrying. In particular, we propose a safe and interpretable multimodal path planning method, CaPE (Code as Path Editor), which generates and updates path plans for an agent based on the environment and language communication from other agents. CaPE leverages a vision-language model (VLM) to synthesize a path editing program verified by a model-based planner, grounding communication to path plan updates in a safe and interpretable way. We evaluate our approach in diverse simulated and real-world scenarios, including multi-robot and human-robot cooperation in autonomous driving, household, and joint carrying tasks. Experimental results demonstrate that CaPE can be integrated into different robotic systems as a plug-and-play module, greatly enhancing a robot's ability to align its plan to language communication from other robots or humans. We also show that the combination of the VLM-based path editing program synthesis and model-based planning safety enables robots to achieve open-ended cooperation while maintaining safety and interpretability.

Chinese Translation

去中心化智能体之间的成功合作要求每个智能体能够快速调整其计划以适应其他智能体的行为。在智能体无法自信地预测彼此的意图和计划的场景中，语言交流对于确保安全至关重要。在本研究中，我们专注于路径级别的合作，其中智能体必须相互调整其路径以避免碰撞或进行物理协作，例如共同搬运。特别地，我们提出了一种安全且可解释的多模态路径规划方法——CaPE（Code as Path Editor），该方法基于环境和其他智能体的语言交流生成和更新智能体的路径计划。CaPE利用视觉-语言模型（VLM）合成一个由基于模型的规划器验证的路径编辑程序，以安全且可解释的方式将交流与路径计划更新相结合。我们在多种模拟和现实世界场景中评估了我们的方法，包括自主驾驶中的多机器人和人机合作、家庭任务以及共同搬运任务。实验结果表明，CaPE可以作为即插即用模块集成到不同的机器人系统中，大大增强了机器人根据其他机器人或人类的语言交流调整其计划的能力。我们还展示了基于VLM的路径编辑程序合成与基于模型的规划安全性的结合，使得机器人能够在保持安全性和可解释性的同时实现开放式合作。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2602.19308

WildOS: Open-Vocabulary Object Search in the Wild

WildOS：野外开放词汇物体搜索

Shah, Hardik, Tevere, Erica, Atha, Deegan, Kaufmann, Marcel, Khattak, Shehryar, Patel, Manthan, Hutter, Marco, Frey, Jonas, Spieler, Patrick

Abstract

Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot's immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: https://leggedrobotics.github.io/wildos/

Chinese Translation

在复杂且非结构化的户外环境中，自主导航要求机器人在没有先前地图和有限深度感知的情况下进行长距离操作。在这种情况下，仅依赖几何边界进行探索往往是不够的。在这种环境中，能够在语义上推理出前往何处以及什么是安全可通行的，对于稳健和高效的探索至关重要。本研究提出了WildOS，一个统一的长距离开放词汇物体搜索系统，结合了安全的几何探索与语义视觉推理。WildOS构建了一个稀疏导航图以维护空间记忆，同时利用基于基础模型的视觉模块ExploRFM对图的边界节点进行评分。ExploRFM同时预测可通行性、视觉边界和图像空间中的物体相似性，从而实现实时的机载语义导航任务。生成的视觉评分图使机器人能够探索语义上有意义的方向，同时确保几何安全。此外，我们引入了一种基于粒子滤波的粗定位方法，用于开放词汇目标查询的定位，该方法估计超出机器人即时深度视野的候选目标位置，从而实现对远程目标的有效规划。在多样的越野和城市地形中进行的大量闭环实地实验表明，WildOS实现了稳健的导航，在效率和自主性方面显著优于纯几何和纯视觉基线。我们的结果突显了视觉基础模型驱动开放世界机器人行为的潜力，这些行为既有语义信息又有几何基础。项目页面：https://leggedrobotics.github.io/wildos/

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2602.19313

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TOPReward：作为隐含零-shot奖励的令牌概率在机器人学中的应用

Chen, Shirui, Harrison, Cole, Lee, Ying-Chun, Yang, Angela Jin, Ren, Zhongzheng, Ratliff, Lillian J., Duan, Jiafei, Fox, Dieter, Krishna, Ranjay

Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

Chinese Translation

尽管视觉-语言-动作（VLA）模型在预训练方面取得了快速进展，但它们在强化学习（RL）中的发展仍受到低样本效率和现实环境中稀疏奖励的制约。开发可泛化的过程奖励模型对于提供必要的细粒度反馈以弥合这一差距至关重要，然而现有的时间价值函数往往无法超越其训练领域进行泛化。我们提出了TOPReward，这是一种新颖的、基于概率的时间价值函数，它利用预训练视频视觉-语言模型（VLMs）的潜在世界知识来估计机器人任务的进展。与之前的方法不同，TOPReward并不直接提示VLM输出进展值（这容易导致数值误表示），而是直接从VLM的内部令牌logits中提取任务进展。在130多个不同的现实任务和多个机器人平台（如Franka、YAM、SO-100/101）的零-shot评估中，TOPReward在Qwen3-VL上实现了0.947的平均价值顺序相关性（VOC），显著优于在同一开源模型上几乎零相关性的最先进GVL基线。我们进一步证明，TOPReward作为下游应用的多功能工具，包括成功检测和奖励对齐行为克隆。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2602.19315

Online Navigation Planning for Long-term Autonomous Operation of Underwater Gliders

水下滑翔器长期自主操作的在线导航规划

Darvariu, Victor-Alexandru, Reed, Charlotte Z., Stratmann, Jan, Lacerda, Bruno, Allsup, Benjamin, Woodward, Stephen, Siddle, Elizabeth, Saeharaseelan, Trishna, Jones, Owain, Jones, Dan, Ferreira, Tobias, Baker, Chloe, Chaplin, Kevin, Kirk, James, Morris, Ashley, Patmore, Ryan, Polton, Jeff, Williams, Charlotte, Kokkinaki, Alexandra, Lopez, Alvaro Lorenzo, Buck, Justin J. H., Hawes, Nick

Abstract

Underwater glider robots have become an indispensable tool for ocean sampling. Although stakeholders are calling for tools to manage increasingly large fleets of gliders, successful autonomous long-term deployments have thus far been scarce, which hints at a lack of suitable methodologies and systems. In this work, we formulate glider navigation planning as a stochastic shortest-path Markov Decision Process and propose a sample-based online planner based on Monte Carlo Tree Search. Samples are generated by a physics-informed simulator that captures uncertain execution of controls and ocean current forecasts while remaining computationally tractable. The simulator parameters are fitted using historical glider data. We integrate these methods into an autonomous command-and-control system for Slocum gliders that enables closed-loop replanning at each surfacing. The resulting system was validated in two field deployments in the North Sea totalling approximately 3 months and 1000 km of autonomous operation. Results demonstrate improved efficiency compared to straight-to-goal navigation and show the practicality of sample-based planning for long-term marine autonomy.

Chinese Translation

水下滑翔器机器人已成为海洋取样中不可或缺的工具。尽管利益相关者呼吁开发工具以管理日益庞大的滑翔器舰队，但迄今为止成功的自主长期部署仍然稀少，这表明缺乏合适的方法论和系统。在本研究中，我们将滑翔器导航规划形式化为随机最短路径马尔可夫决策过程，并提出了一种基于蒙特卡洛树搜索的样本在线规划器。样本由一个物理信息模拟器生成，该模拟器能够捕捉控制执行的不确定性和海洋流预测，同时保持计算上的可处理性。模拟器参数使用历史滑翔器数据进行拟合。我们将这些方法集成到一个自主指挥与控制系统中，适用于Slocum滑翔器，使其能够在每次浮出水面时进行闭环重新规划。所得到的系统在北海进行了两次实地部署，总计约3个月和1000公里的自主操作。结果表明，与直接到达目标的导航相比，效率得到了改善，并展示了基于样本的规划在长期海洋自主操作中的实用性。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2602.19346

Design and Control of Modular Magnetic Millirobots for Multimodal Locomotion and Shape Reconfiguration

用于多模态运动和形状重构的模块化磁性微型机器人设计与控制

Oyono, Erik Garcia, Lin, Jialin, Zhang, Dandan

Abstract

Modular small-scale robots offer the potential for on-demand assembly and disassembly, enabling task-specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three-dimensional electromagnetic systems, and lack robust single-module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube-shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self-assembly and reconfiguration, a fixed module that enables flip-and-walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time-varying two-dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed-loop navigation using real-time vision feedback and A* path planning, establishing robust single-module control capabilities. Beyond locomotion, the system achieves self-assembly, multimodal transformations, and disassembly at low field strengths. Chain-to-gripper transformations succeeded in 90% of trials, while chain-to-square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.

Chinese Translation

模块化小型机器人具有按需组装和拆卸的潜力，使其能够在动态和受限环境中进行任务特定的适应。然而，现有的模块化磁性平台通常依赖于工作空间的碰撞进行重构，采用笨重的三维电磁系统，并且缺乏稳健的单模块控制，这限制了它们在生物医学环境中的应用。在本研究中，我们提出了一种模块化磁性微型机器人平台，该平台由三个带有嵌入式永久磁铁的立方体模块组成，每个模块设计用于特定的功能角色：一个支持自组装和重构的自由模块，一个使翻转行走运动成为可能的固定模块，以及一个用于货物操作的抓取模块。运动和重构通过可编程的时变二维均匀和梯度磁场输入的组合来驱动。实验表明，使用实时视觉反馈和 A* 路径规划实现了闭环导航，建立了稳健的单模块控制能力。除了运动外，该系统在低场强下实现了自组装、多模态转变和拆卸。链到抓取的转变在90%的试验中成功，而链到方形的转变则不够一致，突显了模块几何形状在重构可靠性中的作用。这些结果建立了一个多功能的模块化机器人平台，能够实现多模态行为和稳健控制，为在受限环境中可扩展和自适应的任务执行提供了有前景的路径。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2602.19359

Vid2Sid: Videos Can Help Close the Sim2Real Gap

Vid2Sid：视频可以帮助缩小模拟与真实之间的差距

Qiu, Kevin, Zhang, Yu, Cygan, Marek, Hughes, Josie

Abstract

Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.

Chinese Translation

校准机器人模拟器的物理参数（摩擦、阻尼、材料刚度）以匹配真实硬件通常是通过手动或使用黑箱优化器来完成的，这些优化器虽然能够减少误差，但无法解释哪些物理差异导致了误差。当传感器仅限于外部摄像头时，问题因感知噪声和缺乏直接的力或状态测量而进一步复杂化。我们提出了Vid2Sid，这是一种视频驱动的系统识别管道，将基础模型感知与循环中的VLM（视频语言模型）优化器相结合，分析配对的模拟-真实视频，诊断具体的不匹配，并提出带有自然语言理由的物理参数更新。我们在一个腱驱动的手指（在MuJoCo中的刚体动力学）和一个可变形的连续触手（在PyElastica中的软体动力学）上评估了我们的方法。在训练期间未见的sim2real保留控制中，Vid2Sid在所有设置中实现了最佳平均排名，匹配或超过了黑箱优化器，同时在每次迭代中独特地提供可解释的推理。Sim2sim验证确认Vid2Sid最准确地恢复了真实参数（平均相对误差低于13 ext{%}，而黑箱优化器为28 ext{%}至98 ext{%}），消融分析揭示了三种校准机制。当感知干净且模拟器表现力强时，VLM引导的优化表现优异，而在更具挑战性的设置中，模型类别的限制则限制了性能。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2602.19372

Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

更远更智能的视野：基于价值引导的多路径反思用于 VLM 策略优化

Yang, Yanting, Gao, Shenyuan, Bu, Qingwen, Chen, Li, Metaxas, Dimitris N.

Abstract

Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

Chinese Translation

解决复杂的长时间机器人操作任务需要对物理交互有深入理解，推理其长期后果，并进行精确的高层次规划。视觉-语言模型（Vision-Language Models, VLMs）为实现这一目标提供了一种通用的感知-推理-行动框架。然而，之前使用反思规划来指导 VLMs 纠正动作的方法存在显著的局限性。这些方法依赖于从噪声预测中低效且往往不准确的隐式学习状态值，仅评估单一贪婪未来，并且存在显著的推理延迟。为了解决这些局限性，我们提出了一种新颖的测试时计算框架，将状态评估与动作生成解耦。这为稳健的决策提供了更直接和细致的监督信号。我们的方法明确建模了动作计划的优势，通过其减少到目标的距离来量化，并使用可扩展的评论者进行估计。为了解决单轨迹评估的随机性，我们采用束搜索探索多个未来路径，并在解码过程中对其进行聚合，以建模其预期的长期回报，从而生成更稳健的动作。此外，我们引入了一种轻量级的基于置信度的触发器，当直接预测可靠时允许提前退出，仅在必要时调用反思。在多样化、未见过的多阶段机器人操作任务上的广泛实验表明，与最先进的基线相比，成功率提高了 24.6%，同时推理时间显著减少了 56.5%。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2602.19400

Hilbert-Augmented Reinforcement Learning for Scalable Multi-Robot Coverage and Exploration

基于希尔伯特空间增强的可扩展多机器人覆盖与探索强化学习

Gurunathan, Tamil Selvan, Gangopadhyay, Aryya

Abstract

We present a coverage framework that integrates Hilbert space-filling priors into decentralized multi-robot learning and execution. We augment DQN and PPO with Hilbert-based spatial indices to structure exploration and reduce redundancy in sparse-reward environments, and we evaluate scalability in multi-robot grid coverage. We further describe a waypoint interface that converts Hilbert orderings into curvature-bounded, time-parameterized SE(2) trajectories (planar (x, y, {\theta})), enabling onboard feasibility on resource-constrained robots. Experiments show improvements in coverage efficiency, redundancy, and convergence speed over DQN/PPO baselines. In addition, we validate the approach on a Boston Dynamics Spot legged robot, executing the generated trajectories in indoor environments and observing reliable coverage with low redundancy. These results indicate that geometric priors improve autonomy and scalability for swarm and legged robotics.

Chinese Translation

我们提出了一种覆盖框架，将希尔伯特空间填充先验融入去中心化的多机器人学习与执行中。我们通过希尔伯特基础的空间索引增强深度Q网络（DQN）和近端策略优化（PPO），以结构化探索并减少稀疏奖励环境中的冗余，并评估其在多机器人网格覆盖中的可扩展性。我们进一步描述了一种航点接口，将希尔伯特排序转换为具有曲率约束的时间参数化SE(2)轨迹（平面（x, y, { heta}）），使得在资源受限的机器人上实现可行性。实验表明，与DQN/PPO基线相比，覆盖效率、冗余和收敛速度都有所改善。此外，我们在波士顿动力公司的Spot四足机器人上验证了该方法，在室内环境中执行生成的轨迹，并观察到可靠的覆盖和低冗余。这些结果表明，几何先验提高了群体和四足机器人系统的自主性和可扩展性。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2602.19491

Botson: An Accessible and Low-Cost Platform for Social Robotics Research

Botson：一个可访问且低成本的社会机器人研究平台

Bellaire, Samuel, Abu-raddaha, Abdalmalek, Kim, Natalie, Morhan, Nathan, Elliott, William, Rawashdeh, Samir

Abstract

Trust remains a critical barrier to the effective integration of Artificial Intelligence (AI) into human-centric domains. Disembodied agents, such as voice assistants, often fail to establish trust due to their inability to convey non-verbal social cues. This paper introduces the architecture of Botson: an anthropomorphic social robot powered by a large language model (LLM). Botson was created as a low-cost and accessible platform for social robotics research.

Chinese Translation

信任仍然是有效将人工智能（AI）整合到以人为中心领域的关键障碍。无形代理（如语音助手）往往由于无法传达非语言社会线索而无法建立信任。本文介绍了Botson的架构：一个由大型语言模型（LLM）驱动的人形社会机器人。Botson的创建旨在提供一个低成本且可访问的社会机器人研究平台。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2602.19518

Anticipate, Adapt, Act: A Hybrid Framework for Task Planning

预见、适应、行动：一种混合任务规划框架

Dash, Nabanita, Kaura, Ayush, Singh, Shivam, Singh, Ramandeep, Banerjee, Snehasis, Sridharan, Mohan, Krishna, K. Madhava

Abstract

Anticipating and adapting to failures is a key capability robots need to collaborate effectively with humans in complex domains. This continues to be a challenge despite the impressive performance of state of the art AI planning systems and Large Language Models (LLMs) because of the uncertainty associated with the tasks and their outcomes. Toward addressing this challenge, we present a hybrid framework that integrates the generic prediction capabilities of an LLM with the probabilistic sequential decision-making capability of Relational Dynamic Influence Diagram Language. For any given task, the robot reasons about the task and the capabilities of the human attempting to complete it; predicts potential failures due to lack of ability (in the human) or lack of relevant domain objects; and executes actions to prevent such failures or recover from them. Experimental evaluation in the VirtualHome 3D simulation environment demonstrates substantial improvement in performance compared with state of the art baselines.

Chinese Translation

预见和适应失败是机器人在复杂领域中与人类有效协作所需的关键能力。尽管先进的人工智能规划系统和大型语言模型（LLMs）表现出色，但由于任务及其结果的不确定性，这仍然是一个挑战。为了解决这一挑战，我们提出了一种混合框架，将LLM的通用预测能力与关系动态影响图语言（Relational Dynamic Influence Diagram Language）的概率序列决策能力相结合。对于任何给定的任务，机器人会推理该任务及尝试完成该任务的人类的能力；预测由于能力不足（在人类中）或缺乏相关领域对象而可能出现的失败；并采取行动以防止此类失败或从中恢复。我们在VirtualHome 3D仿真环境中的实验评估表明，与现有的先进基线相比，性能有显著提升。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2602.19532

Bellman Value Decomposition for Task Logic in Safe Optimal Control

贝尔曼值分解在安全最优控制中的任务逻辑

Sharpless, William, So, Oswin, Hirsch, Dylan, Herbert, Sylvia, Fan, Chuchu

Abstract

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

Chinese Translation

现实世界的任务涉及目标和安全规范的细致组合。在高维空间中，这一挑战更加复杂：形式自动机变得笨重，稀疏奖励的组合往往需要繁琐的调优。在本研究中，我们考虑贝尔曼值的内在结构，作为自然组织问题以提高自动性能的一种手段。具体而言，我们证明了在时间逻辑中定义的复杂任务的贝尔曼值可以分解为一个贝尔曼值图，该图通过一组著名的贝尔曼方程（Bellman equations, BEs）相连：到达-避免贝尔曼方程（Reach-Avoid BE）、避免贝尔曼方程（Avoid BE）以及一种新型的到达-避免-循环贝尔曼方程（Reach-Avoid-Loop BE）。为了求解值和最优策略，我们提出了VDPPO，它将分解后的值图嵌入到一个两层神经网络中，利用隐式依赖进行引导。我们进行了多种模拟和硬件实验，以测试我们的方法在涉及异构团队和非线性动态的复杂高维任务中的表现。最终，我们发现这种方法在性能上显著优于现有基准，能够自动平衡安全性和活性。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2602.19534

Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial

大型语言模型辅助的无人机操作与通信：多维度调查与教程

Emami, Yousef, Zhou, Hao, Reddy, Radha, Arani, Atefeh Hajijamali, Wang, Biliang, Li, Kai, Almeida, Luis, Han, Zhu

Abstract

Uncrewed Aerial Vehicles (UAVs) are widely deployed across diverse applications due to their mobility and agility. Recent advances in Large Language Models (LLMs) offer a transformative opportunity to enhance UAV intelligence beyond conventional optimization-based and learning-based approaches. By integrating LLMs into UAV systems, advanced environmental understanding, swarm coordination, mobility optimization, and high-level task reasoning can be achieved, thereby allowing more adaptive and context-aware aerial operations. This survey systematically explores the intersection of LLMs and UAV technologies and proposes a unified framework that consolidates existing architectures, methodologies, and applications for UAVs. We first present a structured taxonomy of LLM adaptation techniques for UAVs, including pretraining, fine-tuning, Retrieval-Augmented Generation (RAG), and prompt engineering, along with key reasoning capabilities such as Chain-of-Thought (CoT) and In-Context Learning (ICL). We then examine LLM-assisted UAV communications and operations, covering navigation, mission planning, swarm control, safety, autonomy, and network management. After that, the survey further discusses Multimodal LLMs (MLLMs) for human-swarm interaction, perception-driven navigation, and collaborative control. Finally, we address ethical considerations, including bias, transparency, accountability, and Human-in-the-Loop (HITL) strategies, and outline future research directions. Overall, this work positions LLM-assisted UAVs as a foundation for intelligent and adaptive aerial systems.

Chinese Translation

无人机（UAV）因其机动性和灵活性而广泛应用于各种领域。大型语言模型（LLMs）的最新进展为提升无人机智能提供了变革性的机会，超越了传统的基于优化和学习的方法。通过将LLMs集成到无人机系统中，可以实现先进的环境理解、群体协调、机动优化和高层次任务推理，从而实现更具适应性和上下文感知的空中操作。本调查系统地探讨了LLMs与无人机技术的交叉点，并提出了一个统一框架，整合了现有的无人机架构、方法论和应用。我们首先呈现了无人机LLM适应技术的结构化分类，包括预训练、微调、检索增强生成（RAG）和提示工程，以及关键推理能力，如思维链（CoT）和上下文学习（ICL）。接着，我们考察了LLM辅助的无人机通信与操作，涵盖导航、任务规划、群体控制、安全性、自主性和网络管理。随后，本调查进一步讨论了多模态LLMs（MLLMs）在人群与群体交互、感知驱动导航和协同控制中的应用。最后，我们探讨了伦理考量，包括偏见、透明度、问责制和人机协作（HITL）策略，并概述了未来的研究方向。总体而言，本研究将LLM辅助的无人机定位为智能和自适应空中系统的基础。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2602.19538

Cost-Aware Diffusion Active Search

成本感知的扩散主动搜索

Banerjee, Arundhati, Schneider, Jeff

Abstract

Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.

Chinese Translation

通过在线、自适应决策的自主代理进行对象恢复的主动搜索需要在未知环境的探索与搜索空间中先前观察的利用之间进行权衡。先前的研究提出了基于信息增益和汤普森采样的短视贪婪方法，使代理能够在目标数量未知的情况下主动决定查询或搜索位置。在这种部分可观察环境中的决策算法也表明，能够进行有限前瞻的代理在主动搜索中优于短视策略。不幸的是，前瞻算法通常依赖于构建一个计算开销较大的搜索树，该搜索树基于代理的观察和环境动态模型进行模拟和更新。相反，在本研究中，我们利用扩散模型的序列建模能力，采样前瞻动作序列，以平衡主动搜索中的探索与利用的权衡，而无需构建详尽的搜索树。我们识别出在主动搜索环境中应用先前基于扩散的强化学习方法时的乐观偏差，并提出了有效的解决方案，以实现单代理和多代理团队的成本感知决策。我们提出的算法在离线强化学习中在完全恢复率方面优于标准基线，并且在成本感知主动决策中比树搜索计算上更高效。

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2602.19577

Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation

追逐幽灵：具有可选视觉增强的仿真到现实的嗅觉导航系统

France, Kordel K., Daescu, Ovidiu, Khan, Latifur, Peddi, Rohith

Abstract

Autonomous odor source localization remains a challenging problem for aerial robots due to turbulent airflow, sparse and delayed sensory signals, and strict payload and compute constraints. While prior unmanned aerial vehicle (UAV)-based olfaction systems have demonstrated gas distribution mapping or reactive plume tracing, they rely on predefined coverage patterns, external infrastructure, or extensive sensing and coordination. In this work, we present a complete, open-source UAV system for online odor source localization using a minimal sensor suite. The system integrates custom olfaction hardware, onboard sensing, and a learning-based navigation policy trained in simulation and deployed on a real quadrotor. Through our minimal framework, the UAV is able to navigate directly toward an odor source without constructing an explicit gas distribution map or relying on external positioning systems. Vision is incorporated as an optional complementary modality to accelerate navigation under certain conditions. We validate the proposed system through real-world flight experiments in a large indoor environment using an ethanol source, demonstrating consistent source-finding behavior under realistic airflow conditions. The primary contribution of this work is a reproducible system and methodological framework for UAV-based olfactory navigation and source finding under minimal sensing assumptions. We elaborate on our hardware design and open source our UAV firmware, simulation code, olfaction-vision dataset, and circuit board to the community. Code, data, and designs will be made available at https://github.com/KordelFranceTech/ChasingGhosts.

Chinese Translation

自主气味源定位仍然是空中机器人面临的一个挑战性问题，原因在于气流湍动、传感信号稀疏且延迟，以及严格的载荷和计算约束。尽管先前基于无人机（UAV）的嗅觉系统已展示了气体分布映射或反应性气流追踪，但它们依赖于预定义的覆盖模式、外部基础设施或广泛的感知与协调。在本研究中，我们提出了一个完整的开源无人机系统，使用最小的传感器套件进行在线气味源定位。该系统集成了定制的嗅觉硬件、机载传感器和在仿真中训练的基于学习的导航策略，并在真实的四旋翼上部署。通过我们的最小框架，无人机能够直接导航到气味源，而无需构建明确的气体分布图或依赖外部定位系统。视觉作为一种可选的补充模态被纳入，以在某些条件下加速导航。我们通过在大型室内环境中使用乙醇源的真实飞行实验验证了所提出的系统，展示了在现实气流条件下的一致源发现行为。本工作的主要贡献是一个可重复的系统和方法框架，用于在最小感知假设下进行基于无人机的嗅觉导航和源发现。我们详细阐述了硬件设计，并将我们的无人机固件、仿真代码、嗅觉-视觉数据集和电路板开源给社区。代码、数据和设计将可在 https://github.com/KordelFranceTech/ChasingGhosts 获取。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2602.19651

Denoising Particle Filters: Learning State Estimation with Single-Step Objectives

去噪粒子滤波器：通过单步目标学习状态估计

Röstel, Lennart, Bäuml, Berthold

Abstract

Learning-based methods commonly treat state estimation in robotics as a sequence modeling problem. While this paradigm can be effective at maximizing end-to-end performance, models are often difficult to interpret and expensive to train, since training requires unrolling sequences of predictions in time. As an alternative to end-to-end trained state estimation, we propose a novel particle filtering algorithm in which models are trained from individual state transitions, fully exploiting the Markov property in robotic systems. In this framework, measurement models are learned implicitly by minimizing a denoising score matching objective. At inference, the learned denoiser is used alongside a (learned) dynamics model to approximately solve the Bayesian filtering equation at each time step, effectively guiding predicted states toward the data manifold informed by measurements. We evaluate the proposed method on challenging robotic state estimation tasks in simulation, demonstrating competitive performance compared to tuned end-to-end trained baselines. Importantly, our method offers the desirable composability of classical filtering algorithms, allowing prior information and external sensor models to be incorporated without retraining.

Chinese Translation

基于学习的方法通常将机器人中的状态估计视为序列建模问题。尽管这一范式在最大化端到端性能方面可能有效，但模型往往难以解释且训练成本高，因为训练需要在时间上展开预测序列。作为端到端训练状态估计的替代方案，我们提出了一种新颖的粒子滤波算法，其中模型通过单个状态转移进行训练，充分利用了机器人系统中的马尔可夫性质。在这一框架中，测量模型通过最小化去噪得分匹配目标隐式学习。在推理时，学习到的去噪器与（学习的）动态模型一起使用，以近似求解每个时间步的贝叶斯滤波方程，有效地引导预测状态朝向由测量信息提供的数据流形。我们在模拟中对提出的方法进行了评估，展示了与调优的端到端训练基线相比的竞争性能。重要的是，我们的方法提供了经典滤波算法所期望的可组合性，允许在不重新训练的情况下整合先验信息和外部传感器模型。

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2602.19653

Scalable Low-Density Distributed Manipulation Using an Interconnected Actuator Array

可扩展的低密度分布式操控系统使用互联执行器阵列

Dacre, Bailey, Moreno, Rodrigo, Lambertsen, Jørn, Stoy, Kasper, Faíña, Andrés

Abstract

Distributed Manipulator Systems, composed of arrays of robotic actuators necessitate dense actuator arrays to effectively manipulate small objects. This paper presents a system composed of modular 3-DoF robotic tiles interconnected by a compliant surface layer, forming a continuous, controllable manipulation surface. The compliant layer permits increased actuator spacing without compromising object manipulation capabilities, significantly reducing actuator density while maintaining robust control, even for smaller objects. We characterize the coupled workspace of the array and develop a manipulation strategy capable of translating objects to arbitrary positions within an N X N array. The approach is validated experimentally using a minimal 2 X 2 prototype, demonstrating the successful manipulation of objects with varied shapes and sizes.

Chinese Translation

分布式操控系统由机器人执行器阵列组成，需要密集的执行器阵列以有效操控小物体。本文提出了一种由模块化的3自由度（3-DoF）机器人瓦片组成的系统，这些瓦片通过一个柔性表层互联，形成一个连续的、可控的操控表面。柔性层允许增加执行器间距，而不影响物体操控能力，显著降低了执行器密度，同时在操控小物体时仍保持强大的控制能力。我们对阵列的耦合工作空间进行了表征，并开发了一种操控策略，能够将物体移动到N x N阵列内的任意位置。该方法通过一个最小的2 x 2原型进行了实验验证，成功操控了形状和尺寸各异的物体。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2602.19699

CACTO-BIC: Scalable Actor-Critic Learning via Biased Sampling and GPU-Accelerated Trajectory Optimization

CACTO-BIC：通过偏置采样和GPU加速轨迹优化实现可扩展的演员-评论家学习

Alboni, Elisa, Crestaz, Pietro Noah, Fontanari, Elias, Del Prete, Andrea

Abstract

Trajectory Optimization (TO) and Reinforcement Learning (RL) offer complementary strengths for solving optimal control problems. TO efficiently computes locally optimal solutions but can struggle with non-convexity, while RL is more robust to non-convexity at the cost of significantly higher computational demands. CACTO (Continuous Actor-Critic with Trajectory Optimization) was introduced to combine these advantages by learning a warm-start policy that guides the TO solver towards low-cost trajectories. However, scalability remains a key limitation, as increasing system complexity significantly raises the computational cost of TO. This work introduces CACTO-BIC to address these challenges. CACTO-BIC improves data efficiency by biasing initial-state sampling leveraging a property of the value function associated with locally optimal policies; moreover, it reduces computation time by exploiting GPU acceleration. Empirical evaluations show improved sample efficiency and faster computation compared to CACTO. Comparisons with PPO demonstrate that our approach can achieve similar solutions in less time. Finally, experiments on the AlienGO quadruped robot demonstrate that CACTO-BIC can scale to high-dimensional systems and is suitable for real-time applications.

Chinese Translation

轨迹优化（TO）和强化学习（RL）在解决最优控制问题上提供了互补的优势。TO能够高效计算局部最优解，但在处理非凸性时可能会遇到困难，而RL在应对非凸性方面更具鲁棒性，但代价是显著更高的计算需求。CACTO（带轨迹优化的连续演员-评论家）被引入以结合这些优势，通过学习一个热启动策略来引导TO求解器朝向低成本轨迹。然而，可扩展性仍然是一个关键限制，因为系统复杂性的增加显著提高了TO的计算成本。本研究引入CACTO-BIC以应对这些挑战。CACTO-BIC通过利用与局部最优策略相关的价值函数的特性来偏置初始状态采样，从而提高数据效率；此外，它还通过利用GPU加速来减少计算时间。实证评估表明，与CACTO相比，样本效率和计算速度都有所提高。与PPO的比较表明，我们的方法可以在更短的时间内实现类似的解决方案。最后，在AlienGO四足机器人上的实验表明，CACTO-BIC能够扩展到高维系统，并适用于实时应用。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2602.19764

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

通过深度多传感器融合与稀疏专家扩展实现灵巧的具身操控

Sun, Yirui, Zhuge, Guangyu, Liu, Keliang, Gu, Jie, xia, Zhihao, Ren, Qionglin, tian, Chunxu, Ga, Zhongxue

Abstract

Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.

Chinese Translation

实现灵巧的具身操控需要对异构多模态传感输入进行深度整合。然而，当前以视觉为中心的范式往往忽视了在复杂任务中至关重要的力和几何反馈。本文提出了DeMUSE（深度多模态统一稀疏专家框架），利用扩散变换器（Diffusion Transformer）将RGB、深度和6轴力整合为统一的序列流。采用自适应模态特定归一化（Adaptive Modality-specific Normalization, AdaMN）重新校准模态感知特征，减轻表示不平衡，协调多传感信号的异构分布。为了促进高效扩展，该架构利用共享专家的稀疏专家混合（Sparse Mixture-of-Experts, MoE），在保持实时控制所需的低推理延迟的同时，增加物理先验的模型容量。一个联合去噪目标同步合成环境演变和动作序列，以确保物理一致性。在仿真和现实世界实验中分别取得83.2%和72.5%的成功率，DeMUSE展示了最先进的性能，验证了深度多传感器整合在复杂物理交互中的必要性。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2602.19850

TactiVerse: Generalizing Multi-Point Tactile Sensing in Soft Robotics Using Single-Point Data

TactiVerse：利用单点数据在软机器人中推广多点触觉感知

Lee, Junhui, Kim, Hyosung, Nam, Saekwang

Abstract

Real-time prediction of deformation in highly compliant soft materials remains a significant challenge in soft robotics. While vision-based soft tactile sensors can track internal marker displacements, learning-based models for 3D contact estimation heavily depend on their training datasets, inherently limiting their ability to generalize to complex scenarios such as multi-point sensing. To address this limitation, we introduce TactiVerse, a U-Net-based framework that formulates contact geometry estimation as a spatial heatmap prediction task. Even when trained exclusively on a limited dataset of single-point indentations, our architecture achieves highly accurate single-point sensing, yielding a superior mean absolute error of 0.0589 mm compared to the 0.0612 mm of a conventional regression-based CNN baseline. Furthermore, we demonstrate that augmenting the training dataset with multi-point contact data substantially enhances the sensor's multi-point sensing capabilities, significantly improving the overall mean MAE for two-point discrimination from 1.214 mm to 0.383 mm. By successfully extrapolating complex contact geometries from fundamental interactions, this methodology unlocks advanced multi-point and large-area shape sensing. Ultimately, it significantly streamlines the development of marker-based soft sensors, offering a highly scalable solution for real-world tactile perception.

Chinese Translation

在高度柔性软材料中实时预测变形仍然是软机器人领域的一项重大挑战。尽管基于视觉的软触觉传感器可以跟踪内部标记的位移，但基于学习的模型在三维接触估计中严重依赖其训练数据集，这在本质上限制了它们在复杂场景（如多点感知）中的推广能力。为了解决这一局限性，我们提出了TactiVerse，一个基于U-Net的框架，将接触几何估计公式化为空间热图预测任务。即使仅在有限的单点压痕数据集上进行训练，我们的架构也能实现高度准确的单点感知，平均绝对误差为0.0589毫米，相较于传统回归基线CNN的0.0612毫米表现更优。此外，我们证明通过将训练数据集扩展到多点接触数据，显著增强了传感器的多点感知能力，将两点区分的整体平均绝对误差从1.214毫米显著改善至0.383毫米。通过成功地从基本交互中推断复杂的接触几何形状，该方法开启了先进的多点和大面积形状感知的可能性。最终，它显著简化了基于标记的软传感器的开发，为现实世界的触觉感知提供了一种高度可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2602.19898

Athena: An Autonomous Open-Hardware Tracked Rescue Robot Platform

雅典娜：一种自主开放硬件履带救援机器人平台

Fabian, Stefan, Schmidt, Aljoscha, Süß, Jonas, Dishant, Oza, Aum, von Stryk, Oskar

Abstract

In disaster response and situation assessment, robots have great potential in reducing the risks to the safety and health of first responders. As the situations encountered and the required capabilities of the robots deployed in such missions differ wildly and are often not known in advance, heterogeneous fleets of robots are needed to cover a wide range of mission requirements. While UAVs can quickly survey the mission environment, their ability to carry heavy payloads such as sensors and manipulators is limited. UGVs can carry required payloads to assess and manipulate the mission environment, but need to be able to deal with difficult and unstructured terrain such as rubble and stairs. The ability of tracked platforms with articulated arms (flippers) to reconfigure their geometry makes them particularly effective for navigating challenging terrain. In this paper, we present Athena, an open-hardware rescue ground robot research platform with four individually reconfigurable flippers and a reliable low-cost remote emergency stop (E-Stop) solution. A novel mounting solution using an industrial PU belt and tooth inserts allows the replacement and testing of different track profiles. The manipulator with a maximum reach of 1.54m can be used to operate doors, valves, and other objects of interest. Full CAD & PCB files, as well as all low-level software, are released as open-source contributions.

Chinese Translation

在灾难响应和情况评估中，机器人在降低第一响应者的安全和健康风险方面具有巨大潜力。由于所遇到的情况和部署在此类任务中所需的机器人能力差异很大，且通常无法提前得知，因此需要异构的机器人队伍以覆盖广泛的任务需求。尽管无人机（UAV）能够快速勘测任务环境，但其携带传感器和操纵器等重型有效载荷的能力有限。地面无人驾驶车辆（UGV）可以携带所需的有效载荷以评估和操作任务环境，但需要能够应对如瓦砾和楼梯等困难和非结构化的地形。带有关节臂（鳍状肢）的履带平台能够重新配置其几何形状，使其在应对挑战性地形时特别有效。本文介绍了雅典娜（Athena），一种开放硬件救援地面机器人研究平台，配备四个可单独重新配置的鳍状肢和一个可靠的低成本远程紧急停止（E-Stop）解决方案。采用工业聚氨酯（PU）带和齿插的创新安装方案允许更换和测试不同的履带轮廓。最大伸展范围为1.54米的操纵器可用于操作门、阀门和其他感兴趣的物体。完整的CAD和PCB文件，以及所有低级软件，均作为开源贡献发布。

View on arXiv Download PDF AI Translation

cs.RO / 56 / 2602.19943

Scaling Law of Neural Koopman Operators

神经库普曼算子的尺度法则

Abuduweili, Abulikemu, Pang, Yuyang, Li, Feihan, Liu, Changliu

Abstract

Data-driven neural Koopman operator theory has emerged as a powerful tool for linearizing and controlling nonlinear robotic systems. However, the performance of these data-driven models fundamentally depends on the trade-off between sample size and model dimensions, a relationship for which the scaling laws have remained unclear. This paper establishes a rigorous framework to address this challenge by deriving and empirically validating scaling laws that connect sample size, latent space dimension, and downstream control quality. We derive a theoretical upper bound on the Koopman approximation error, explicitly decomposing it into sampling error and projection error. We show that these terms decay at specific rates relative to dataset size and latent dimension, providing a rigorous basis for the scaling law. Based on the theoretical results, we introduce two lightweight regularizers for the neural Koopman operator: a covariance loss to help stabilize the learned latent features and an inverse control loss to ensure the model aligns with physical actuation. The results from systematic experiments across six robotic environments confirm that model fitting error follows the derived scaling laws, and the regularizers improve dynamic model fitting fidelity, with enhanced closed-loop control performance. Together, our results provide a simple recipe for allocating effort between data collection and model capacity when learning Koopman dynamics for control.

Chinese Translation

基于数据驱动的神经库普曼算子理论已成为线性化和控制非线性机器人系统的强大工具。然而，这些数据驱动模型的性能在根本上依赖于样本大小与模型维度之间的权衡，而这一关系的尺度法则尚不清晰。本文建立了一个严格的框架来解决这一挑战，通过推导和实证验证连接样本大小、潜在空间维度和下游控制质量的尺度法则。我们推导了库普曼近似误差的理论上限，明确将其分解为采样误差和投影误差。我们表明，这些项相对于数据集大小和潜在维度以特定速率衰减，为尺度法则提供了严格的基础。基于理论结果，我们为神经库普曼算子引入了两个轻量级正则化器：一个协方差损失以帮助稳定学习到的潜在特征，以及一个逆控制损失以确保模型与物理驱动相一致。来自六个机器人环境的系统实验结果确认模型拟合误差遵循所推导的尺度法则，并且正则化器提高了动态模型拟合的保真度，增强了闭环控制性能。总的来说，我们的结果为在学习库普曼动力学以进行控制时在数据收集和模型能力之间分配努力提供了一个简单的方案。

View on arXiv Download PDF AI Translation

cs.RO / 57 / 2602.19983

Contextual Safety Reasoning and Grounding for Open-World Robots

开放世界机器人中的上下文安全推理与基础构建

Ravichadran, Zachary, Snyder, David, Robey, Alexander, Hassani, Hamed, Kumar, Vijay, Pappas, George J.

Abstract

Robots are increasingly operating in open-world environments where safe behavior depends on context: the same hallway may require different navigation strategies when crowded versus empty, or during an emergency versus normal operations. Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment. We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications). CORE uses a vision-language model (VLM) to continuously reason about context-dependent safety rules directly from visual observations, grounds these rules in the physical environment, and enforces the resulting spatially-defined safe sets via control barrier functions. We provide probabilistic safety guarantees for CORE that account for perceptual uncertainty, and we demonstrate through simulation and real-world experiments that CORE enforces contextually appropriate behavior in unseen environments, significantly outperforming prior semantic safety methods that lack online contextual reasoning. Ablation studies validate our theoretical guarantees and underscore the importance of both VLM-based reasoning and spatial grounding for enforcing contextual safety in novel settings. We provide additional resources at https://zacravichandran.github.io/CORE.

Chinese Translation

机器人越来越多地在开放世界环境中运行，其中安全行为依赖于上下文：同一条走廊在拥挤与空旷、紧急与正常操作时可能需要不同的导航策略。传统的安全方法在用户指定的上下文中强制执行固定约束，限制了它们处理真实世界部署中开放式上下文变异的能力。我们通过CORE来解决这一问题，CORE是一个安全框架，能够在没有环境先验知识（例如地图或安全规范）的情况下进行在线上下文推理、基础构建和执行。CORE使用视觉-语言模型（VLM）从视觉观察中持续推理关于上下文依赖的安全规则，将这些规则基础构建于物理环境中，并通过控制障碍函数执行由空间定义的安全集合。我们为CORE提供了考虑感知不确定性的概率安全保证，并通过仿真和现实世界实验展示了CORE在未见环境中强制执行上下文适当行为的能力，显著优于缺乏在线上下文推理的先前语义安全方法。消融研究验证了我们的理论保证，并强调了基于VLM的推理和空间基础构建在新环境中强制执行上下文安全的重要性。我们在 https://zacravichandran.github.io/CORE 提供了更多资源。

View on arXiv Download PDF AI Translation

cs.RO / 58 / 2602.20041

EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover

基于脑电图的意图解码：机器人探测器的离线深度学习基准测试

Alosaimi, Ghadah, Alsayyari, Maha, Sun, Yixin, Katsigiannis, Stamos, Atapour-Abarghouei, Amir, Breckon, Toby P.

Abstract

Brain-computer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brain-robot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the commands forward, reverse, left, right, and stop. Electroencephalogram (EEG) signals were recorded with a 16-channel OpenBCI cap and aligned with motor actions at Delta = 0 ms and future prediction horizons (Delta > 0 ms). After preprocessing, several deep learning models were benchmarked, including convolutional neural networks, recurrent neural networks, and Transformer architectures. ShallowConvNet achieved the highest performance for both action prediction and intent prediction. By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive deep learning-based BCI systems.

Chinese Translation

脑-计算机接口（BCIs）为移动机器人提供了一种免提控制方式，但在实际导航过程中解码用户意图仍然具有挑战性。本研究提出了一种脑-机器人控制框架，用于在机器人探测器操作期间离线解码驾驶指令。12名参与者通过遥控操作4WD Rover Pro平台，使用操纵杆沿预定义路线导航，执行前进、后退、左转、右转和停止等指令。通过16通道OpenBCI帽子记录脑电图（EEG）信号，并与电机动作对齐，时间延迟为Δ = 0毫秒及未来预测时间范围（Δ > 0毫秒）。在预处理后，评估了多种深度学习模型，包括卷积神经网络、递归神经网络和Transformer架构。ShallowConvNet在动作预测和意图预测方面均表现出最高的性能。通过将真实世界的机器人控制与多时间范围的EEG意图解码相结合，本研究引入了一个可重复的基准，并揭示了基于预测深度学习的BCI系统的关键设计见解。

View on arXiv Download PDF AI Translation

cs.RO / 59 / 2602.20054

Hydrodynamic Performance Enhancement of Unmanned Underwater Gliders with Soft Robotic Morphing Wings for Agility Improvement

采用软体机器人变形翼的无人水下滑翔器水动力性能提升以改善灵活性

Giordano, A., De Meurichy, G., Telazzi, V., Mucignat, C., Lunati, I., Louchard, D. A. L. M., Iovieno, M., Armanini, S. F., Kovac, M.

Abstract

This work assesses the hydrodynamic efficiency of Underwater Unmanned Vehicles (UUVs) equipped with soft morphing wings compared to conventional rigid wings. Unlike rigid wings, deformable counterparts can alter their aerodynamic properties on demand. Improvements in hydrodynamic efficiency extend a UUV's operational range and may determine mission feasibility. Structural and Computational Fluid Dynamics (CFD) simulations were conducted for both a soft morphing wing and a UUV incorporating it. The results show that a UUV employing soft wings achieves 9.75 percent higher overall efficiency than an equivalent vehicle with traditional rigid wings. These findings confirm the potential of soft robotics to enhance underwater vehicle performance, particularly in applications requiring pressure-agnostic operation.

Chinese Translation

本研究评估了配备软体变形翼的无人水下航行器（UUV）与传统刚性翼相比的水动力效率。与刚性翼不同，变形翼可以根据需要改变其气动特性。水动力效率的提升扩展了UUV的作业范围，并可能影响任务的可行性。对软体变形翼及其所结合的UUV进行了结构和计算流体动力学（CFD）模拟。结果表明，采用软翼的UUV整体效率比具有传统刚性翼的同类车辆高出9.75%。这些发现确认了软体机器人在提升水下车辆性能方面的潜力，特别是在需要不受压力影响的操作应用中。

View on arXiv Download PDF AI Translation

cs.RO / 60 / 2602.20055

To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

移动还是不移动：基于约束的规划实现交互导航的零样本泛化

Vashisth, Apoorva, Kulshrestha, Manav, Bakshi, Pranav, Conover, Damon, Sartoretti, Guillaume, Bera, Aniket

Abstract

Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.

Chinese Translation

视觉导航通常假设在起点和目标之间至少存在一条无障碍路径，机器人必须发现/规划该路径。然而，在现实场景中，例如家庭环境和仓库，杂物可能会阻塞所有路线。针对这种情况，我们提出了终身交互导航问题，其中具备操作能力的移动机器人可以移动杂物，以开辟自己的路径来完成顺序物体放置任务——每个任务涉及将给定物体（例如闹钟、枕头）放置到目标物体（例如餐桌、书桌、床）上。为了解决这一终身设置——环境变化的影响会累积并产生长期效果——我们提出了一种基于约束的、由大型语言模型（LLM）驱动的规划框架，结合主动感知。我们的框架允许LLM在发现的物体和障碍物的结构化场景图上进行推理，决定移动哪个物体、将其放置在哪里，以及接下来在哪里寻找与任务相关的信息。这种推理与主动感知的结合使得代理能够探索预期有助于任务完成的区域，而不是耗尽地图环境。然后，标准运动规划器执行相应的导航-拾取-放置或绕行序列，确保可靠的低级控制。在物理启用的ProcTHOR-10k模拟器中评估，我们的方法优于非学习和基于学习的基线。我们还在真实硬件上定性展示了我们的方法。

View on arXiv Download PDF AI Translation

cs.RO / 61 / 2602.20057

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

AdaWorldPolicy：基于世界模型驱动的扩散策略与在线自适应学习用于机器人操控

Yuan, Ge, Qiao, Qiyuan, Zhang, Jing, Xu, Dong

Abstract

Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. In this work, we introduce a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force-torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor-all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi-modal self-attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed-loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real-robot benchmarks, our AdaWorldPolicy achieves state-of-the-art performance, with dynamical adaptive capacity to out-of-distribution scenarios.

Chinese Translation

有效的机器人操控需要能够预测物理结果并适应现实环境的策略。在本研究中，我们提出了一个统一框架——基于世界模型驱动的扩散策略与在线自适应学习（AdaWorldPolicy），旨在在动态条件下增强机器人操控，尽量减少人类的参与。我们的核心见解是，世界模型提供了强大的监督信号，使得在动态环境中进行在线自适应学习成为可能，并且可以通过力-扭矩反馈来缓解动态力的变化。我们的AdaWorldPolicy集成了一个世界模型、一个动作专家和一个力预测器——所有这些都作为相互连接的流匹配扩散变换器（Flow Matching Diffusion Transformers, DiT）实现。它们通过多模态自注意力层相互连接，能够进行深层特征交换以实现联合学习，同时保持其独特的模块化特性。我们进一步提出了一种新颖的在线自适应学习（AdaOL）策略，该策略在动作生成模式和未来想象模式之间动态切换，以驱动所有三个模块的反应更新。这创造了一个强大的闭环机制，能够以最小的开销适应视觉和物理领域的变化。在一系列模拟和真实机器人基准测试中，我们的AdaWorldPolicy实现了最先进的性能，具备在分布外场景中动态适应的能力。

View on arXiv Download PDF AI Translation

cs.RO / 62 / 2602.20119

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

NovaPlan：通过闭环视频语言规划实现零样本长时间操作

Fu, Jiahui, Nan, Junyu, Sun, Lingfeng, Li, Hongyu, Qian, Jianing, Barry, Jennifer L., Kitani, Kris, Konidaris, George

Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/

Chinese Translation

解决长时间任务需要机器人将高层语义推理与低层物理交互相结合。虽然视觉-语言模型（VLMs）和视频生成模型能够分解任务并想象结果，但它们通常缺乏现实世界执行所需的物理基础。我们提出了NovaPlan，一个层次化框架，将闭环VLM与视频规划以及几何基础的机器人执行统一起来，以实现零样本长时间操作。在高层，VLM规划器将任务分解为子目标，并在闭环中监控机器人执行，使系统能够通过自主重新规划从单步失败中恢复。为了计算低层机器人动作，我们从生成的视频中提取并利用与任务相关的物体关键点和人手姿态作为运动学先验，并采用切换机制选择更优的参考，以指导机器人动作，即使在严重遮挡或深度不准确的情况下也能保持稳定执行。我们在三个长时间任务和功能性操作基准（FMB）上展示了NovaPlan的有效性。我们的结果表明，NovaPlan能够执行复杂的组装任务，并表现出灵巧的错误恢复行为，而无需任何先前的演示或训练。项目页面：https://nova-plan.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 63 / 2602.20150

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

基于物理感知的联合形状与姿态优化的仿真准备杂乱场景估计

Huang, Wei-Cheng, Han, Jiaheng, Ye, Xiaohan, Pan, Zherong, Hauser, Kris

Abstract

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

Chinese Translation

从现实世界观察中估计仿真准备场景对于后续的规划和策略学习任务至关重要。然而，现有方法在杂乱环境中表现不佳，常常面临高昂的计算成本、较差的鲁棒性以及在扩展到多个相互作用物体时的有限通用性。我们提出了一种统一的基于优化的真实到仿真场景估计方法，该方法在物理约束下联合恢复多个刚性物体的形状和姿态。我们的方法基于两个关键的技术创新。首先，我们利用最近提出的形状可微接触模型，其全局可微性允许在建模物体间接触的同时，对物体几何形状和姿态进行联合优化。其次，我们利用增广拉格朗日海森矩阵的结构稀疏性，推导出一种高效的线性系统求解器，其计算成本与场景复杂性呈良好比例关系。在此基础上，我们开发了一个端到端的真实到仿真场景估计管道，集成了基于学习的物体初始化、物理约束的联合形状-姿态优化和可微纹理细化。在最多包含5个物体和22个凸包的杂乱场景上的实验表明，我们的方法能够稳健地重建物理有效的、仿真准备的物体形状和姿态。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

210

cs.CV / 1 / 2602.18439

Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

复现研究：面向视觉-语言模型的联邦文本驱动提示生成

Prasad, Suraj, Pant, Anubha

Abstract

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2\% of the original paper's reported accuracies, with an average accuracy of 74.58\% on seen (base) classes and 76.00\% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper's core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.

Chinese Translation

视觉-语言模型如CLIP展示了显著的零样本能力，但其在联邦学习场景下的适应性面临重大挑战，特别是在对未见类别的泛化能力方面。原始的FedTPG论文 extit{(Qiu2024)} 通过引入一个文本驱动的提示生成网络来解决这一限制，该网络动态生成基于类别名称的提示，从而在联邦环境中实现更好的跨类别泛化。在本研究中，我们对FedTPG进行了忠实的复现研究，评估了在六个不同的视觉数据集上的预训练模型：Caltech101、Oxford Flowers、FGVC Aircraft、Oxford Pets、Food-101和DTD。我们的评估结果与原论文报告的准确率相差不超过0.2 ext{%}，在已见（基础）类别上的平均准确率为74.58 ext{%}，在未见（新）类别上的准确率为76.00 ext{%}，展示了泛化能力提高了1.43个百分点。这些结果验证了原论文的核心主张：（1）与静态提示学习方法相比，文本驱动的提示生成能够更好地泛化到未见类别；（2）提示生成器的联邦训练在不同视觉领域中保持高性能，而无需共享私人数据。我们的成功复现确认了FedTPG方法的稳健性和可重复性。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.18496

A Patient-Specific Digital Twin for Adaptive Radiotherapy of Non-Small Cell Lung Cancer

用于非小细胞肺癌自适应放疗的患者特异性数字双胞胎

Sud, Anvi, Huang, Jialu, Hart, Gregory R., Saxena, Keshav, Kim, John, Tressel, Lauren, Deng, Jun

Abstract

Radiotherapy continues to become more precise and data dense, with current treatment regimens generating high frequency imaging and dosimetry streams ideally suited for AI driven temporal modeling to characterize how normal tissues evolve with time. Each fraction in biologically guided radiotherapy(BGRT) treated non small cell lung cancer (NSCLC) patients records new metabolic, anatomical, and dose information. However, clinical decision making is largely informed by static, population based NTCP models which overlook the dynamic, unique biological trajectories encoded in sequential data. We developed COMPASS (Comprehensive Personalized Assessment System) for safe radiotherapy, functioning as a temporal digital twin architecture utilizing per fraction PET, CT, dosiomics, radiomics, and cumulative biologically equivalent dose (BED) kinetics to model normal tissue biology as a dynamic time series process. A GRU autoencoder was employed to learn organ specific latent trajectories, which were classified via logistic regression to predict eventual CTCAE grade 1 or higher toxicity. Eight NSCLC patients undergoing BGRT contributed to the 99 organ fraction observations covering 24 organ trajectories (spinal cord, heart, and esophagus). Despite the small cohort, intensive temporal phenotyping allowed for comprehensive analysis of individual dose response dynamics. Our findings revealed a viable AI driven early warning window, as increasing risk ratings occurred from several fractions before clinical toxicity. The dense BED driven representation revealed biologically relevant spatial dose texture characteristics that occur before toxicity and are averaged out with traditional volume based dosimetry. COMPASS establishes a proof of concept for AI enabled adaptive radiotherapy, where treatment is guided by a continually updated digital twin that tracks each patients evolving biological response.

Chinese Translation

放疗正变得越来越精确且数据密集，当前的治疗方案生成高频成像和剂量测量流，理想地适用于基于人工智能的时间建模，以描述正常组织随时间的演变。在生物指导放疗（BGRT）中，每个分次治疗非小细胞肺癌（NSCLC）患者时都会记录新的代谢、解剖和剂量信息。然而，临床决策主要依赖于静态的人群基础的正常组织并发症概率（NTCP）模型，这些模型忽视了序列数据中编码的动态、独特的生物轨迹。我们开发了COMPASS（综合个性化评估系统），用于安全放疗，作为一种时间数字双胞胎架构，利用每个分次的正电子发射断层扫描（PET）、计算机断层扫描（CT）、剂量组学、影像组学和累积生物当量剂量（BED）动力学，建模正常组织生物学作为动态时间序列过程。采用GRU自编码器学习器官特定的潜在轨迹，通过逻辑回归进行分类，以预测最终的CTCAE 1级或更高的毒性。八名接受BGRT的NSCLC患者贡献了99个器官分次观察，涵盖24个器官轨迹（脊髓、心脏和食管）。尽管样本量较小，但密集的时间表型分析使得对个体剂量反应动态进行了全面分析。我们的研究结果揭示了一个可行的基于人工智能的早期预警窗口，因为风险评级在临床毒性出现前的几个分次中就开始增加。密集的BED驱动表示揭示了在毒性发生之前的生物相关空间剂量纹理特征，而这些特征在传统的基于体积的剂量测量中被平均化。COMPASS为基于人工智能的自适应放疗建立了概念验证，其中治疗由一个不断更新的数字双胞胎指导，跟踪每位患者不断演变的生物反应。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.18500

Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality

通过移动增强现实实现超声体积重建的规模化

Ng, Kian Wei, Gao, Yujia, Khoo, Deborah, Tan, Ying Zhen, Mao, Chengzheng, Cheng, Haojie, Makmur, Andrew, Ngiam, Kee Yuan, Goh, Serene, Khoo, Eng Tat

Abstract

Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US suffer from high inter-user variability even among experienced clinicians. Existing 3D ultrasound (3D-US) solutions use specialized probes or external tracking hardware, but such configurations increase costs and diminish portability, constraining widespread clinical use. To address these limitations, we present Mobile Augmented Reality Volumetric Ultrasound (MARVUS), a resource-efficient system designed to increase accessibility to accurate and reproducible volumetric assessment. MARVUS is interoperable with conventional ultrasound (US) systems, using a foundation model to enhance cross-specialty generalization while minimizing hardware requirements relative to current 3D-US solutions. In a user study involving experienced clinicians performing measurements on breast phantoms, MARVUS yielded a substantial improvement in volume estimation accuracy (mean difference: 0.469 cm3) with reduced inter-user variability (mean difference: 0.417 cm3). Additionally, we prove that augmented reality (AR) visualizations enhance objective performance metrics and clinician-reported usability. Collectively, our findings suggests that MARVUS can enhance US-based cancer screening, diagnostic workflows, and treatment planning in a scalable, cost-conscious, and resource-efficient manner. Usage video demonstration available (https://youtu.be/m4llYcZpqmM).

Chinese Translation

对病变进行准确的体积表征对于肿瘤诊断、风险分层和治疗规划至关重要。尽管计算机断层扫描等成像方式提供高质量的三维数据，但由于成本、便携性和安全性因素，二维超声（2D-US）仍然是乳腺和甲状腺成像的首选一线方式。然而，从2D-US得出的体积估计在经验丰富的临床医生之间存在较高的用户间变异性。现有的三维超声（3D-US）解决方案使用专用探头或外部追踪硬件，但这种配置增加了成本并降低了便携性，限制了其广泛的临床应用。为了解决这些局限性，我们提出了移动增强现实体积超声（Mobile Augmented Reality Volumetric Ultrasound，MARVUS），这是一个资源高效的系统，旨在提高准确和可重复的体积评估的可及性。MARVUS与传统超声（US）系统兼容，利用基础模型增强跨专业的泛化能力，同时相较于当前的3D-US解决方案最小化硬件需求。在一项涉及经验丰富的临床医生对乳腺模型进行测量的用户研究中，MARVUS在体积估计准确性上取得了显著改善（平均差异：0.469 cm³），并减少了用户间变异性（平均差异：0.417 cm³）。此外，我们证明增强现实（AR）可视化提升了客观性能指标和临床医生报告的可用性。综合来看，我们的研究结果表明，MARVUS能够以可扩展、注重成本和资源高效的方式增强基于超声的癌症筛查、诊断工作流程和治疗规划。使用视频演示可见（https://youtu.be/m4llYcZpqmM）。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.18502

Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

通过特征解耦减轻医疗影像中的捷径学习：基准研究

Müller, Sarah, Berens, Philipp

Abstract

Although deep learning models in medical imaging often achieve excellent classification performance, they can rely on shortcut learning, exploiting spurious correlations or confounding factors that are not causally related to the target task. This poses risks in clinical settings, where models must generalize across institutions, populations, and acquisition conditions. Feature disentanglement is a promising approach to mitigate shortcut learning by separating task-relevant information from confounder-related features in latent representations. In this study, we systematically evaluated feature disentanglement methods for mitigating shortcuts in medical imaging, including adversarial learning and latent space splitting based on dependence minimization. We assessed classification performance and disentanglement quality using latent space analyses across one artificial and two medical datasets with natural and synthetic confounders. We also examined robustness under varying levels of confounding and compared computational efficiency across methods. We found that shortcut mitigation methods improved classification performance under strong spurious correlations during training. Latent space analyses revealed differences in representation quality not captured by classification metrics, highlighting the strengths and limitations of each method. Model reliance on shortcuts depended on the degree of confounding in the training data. The best-performing models combine data-centric rebalancing with model-centric disentanglement, achieving stronger and more robust shortcut mitigation than rebalancing alone while maintaining similar computational efficiency.

Chinese Translation

尽管深度学习模型在医疗影像中通常能够实现出色的分类性能，但它们可能依赖于捷径学习，利用与目标任务没有因果关系的虚假相关性或混杂因素。这在临床环境中构成风险，因为模型必须能够在不同机构、不同人群和不同采集条件下进行泛化。特征解耦是一种有前景的方法，通过在潜在表示中将与任务相关的信息与与混杂因素相关的特征分离，从而减轻捷径学习。在本研究中，我们系统评估了特征解耦方法在医疗影像中减轻捷径学习的效果，包括基于对抗学习和依赖性最小化的潜在空间分割。我们使用潜在空间分析评估了在一个人工数据集和两个具有自然和合成混杂因素的医疗数据集上的分类性能和解耦质量。我们还考察了在不同混杂水平下的鲁棒性，并比较了各方法的计算效率。我们发现，捷径减轻方法在训练期间改善了在强虚假相关性下的分类性能。潜在空间分析揭示了分类指标未能捕捉的表示质量差异，突显了每种方法的优缺点。模型对捷径的依赖程度取决于训练数据中的混杂程度。表现最佳的模型结合了以数据为中心的重平衡与以模型为中心的解耦，达到了比单独重平衡更强大且更稳健的捷径减轻效果，同时保持了类似的计算效率。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.18504

A Computer Vision Framework for Multi-Class Detection and Tracking in Soccer Broadcast Footage

用于足球转播视频的多类别检测与跟踪的计算机视觉框架

Tshiani, Daniel

Abstract

Clubs with access to expensive multi-camera setups or GPS tracking systems gain a competitive advantage through detailed data, whereas lower-budget teams are often unable to collect similar information. This paper examines whether such data can instead be extracted directly from standard broadcast footage using a single-camera computer vision pipeline. This project develops an end-to-end system that combines a YOLO object detector with the ByteTrack tracking algorithm to identify and track players, referees, goalkeepers, and the ball throughout a match. Experimental results show that the pipeline achieves high performance in detecting and tracking players and officials, with strong precision, recall, and mAP50 scores, while ball detection remains the primary challenge. Despite this limitation, our findings demonstrate that AI can extract meaningful player-level spatial information from a single broadcast camera. By reducing reliance on specialized hardware, the proposed approach enables colleges, academies, and amateur clubs to adopt scalable, data-driven analysis methods previously accessible only to professional teams, highlighting the potential for affordable computer vision-based soccer analytics.

Chinese Translation

拥有昂贵多摄像头设备或GPS跟踪系统的俱乐部通过详细的数据获得竞争优势，而预算较低的球队通常无法收集类似的信息。本文探讨了是否可以通过单摄像头计算机视觉管道直接从标准转播视频中提取此类数据。该项目开发了一个端到端系统，将YOLO（You Only Look Once）目标检测器与ByteTrack跟踪算法相结合，以识别和跟踪比赛中的球员、裁判、守门员和足球。实验结果表明，该管道在检测和跟踪球员及官员方面表现出色，具有较高的精确度、召回率和mAP50（平均精确度均值）得分，而足球检测仍然是主要挑战。尽管存在这一限制，我们的研究结果表明，人工智能可以从单个转播摄像头中提取有意义的球员级空间信息。通过减少对专业硬件的依赖，所提出的方法使得大学、学院和业余俱乐部能够采用可扩展的数据驱动分析方法，这些方法以前仅限于专业球队，从而突显了基于计算机视觉的足球分析的经济潜力。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.18505

Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

抑制还是删除：基于恢复的机器遗忘表征层分析

Jang, Yurim, Lee, Jaeung, Kim, Dohyun, Jo, Jaemin, Woo, Simon S.

Abstract

As pretrained models are increasingly shared on the web, ensuring that models can forget or delete sensitive, copyrighted, or private information upon request has become crucial. Machine unlearning has been proposed to address this challenge. However, current evaluations for unlearning methods rely on output-based metrics, which cannot verify whether information is completely deleted or merely suppressed at the representation level, where suppression is insufficient for true unlearning. To address this gap, we propose a novel restoration-based analysis framework that uses Sparse Autoencoders to identify class-specific expert features in intermediate layers and applies inference-time steering to quantitatively distinguish between suppression and deletion. Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates of unlearned information, indicating that they only suppress information at the decision-boundary level, while preserving semantic features in intermediate representations. Notably, even retraining from pretrained checkpoints shows high restoration, revealing that robust semantic features inherited from pretraining are not removed by retraining. These results demonstrate that representation-level retention poses significant risks overlooked by output-based metrics, highlighting the need for new unlearning evaluation criteria. We propose new evaluation guidelines that prioritize representation-level verification, especially for privacy-critical applications in the era of pre-trained models.

Chinese Translation

随着预训练模型在网络上的共享日益增多，确保模型能够在请求时忘记或删除敏感、受版权保护或私人信息变得至关重要。机器遗忘被提出以应对这一挑战。然而，目前对遗忘方法的评估依赖于基于输出的指标，这无法验证信息是否在表征层面上被完全删除或仅仅被抑制，而抑制对于真正的遗忘是不够的。为了解决这一问题，我们提出了一种新颖的基于恢复的分析框架，该框架使用稀疏自编码器（Sparse Autoencoders）在中间层中识别特定类别的专家特征，并应用推理时引导定量区分抑制和删除。将我们的框架应用于12种主要的图像分类任务中的遗忘方法，我们发现大多数方法在被遗忘信息的恢复率上表现较高，表明它们仅在决策边界层面抑制信息，同时保留中间表征中的语义特征。值得注意的是，即使是从预训练检查点重新训练，也显示出高恢复率，揭示了从预训练中继承的强健语义特征并未通过重新训练而被移除。这些结果表明，表征层面的保留存在显著风险，而这些风险被基于输出的指标所忽视，强调了对新的遗忘评估标准的需求。我们提出了新的评估指南，优先考虑表征层面的验证，特别是在预训练模型时代对于隐私关键应用的评估。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.18509

Depth from Defocus via Direct Optimization

通过直接优化获取深度信息

Jackson, Holly, Adams, Caleb, Lopez-Francos, Ignacio, Recht, Benjamin

Abstract

Though there exists a reasonable forward model for blur based on optical physics, recovering depth from a collection of defocused images remains a computationally challenging optimization problem. In this paper, we show that with contemporary optimization methods and reasonable computing resources, a global optimization approach to depth from defocus is feasible. Our approach rests on alternating minimization. When holding the depth map fixed, the forward model is linear with respect to the all-in-focus image. When holding the all-in-focus image fixed, the depth at each pixel can be computed independently, enabling embarrassingly parallel computation. We show that alternating between convex optimization and parallel grid search can effectively solve the depth-from-defocus problem at higher resolutions than current deep learning methods. We demonstrate our approach on benchmark datasets with synthetic and real defocus blur and show promising results compared to prior approaches. Our code is available at github.com/hollyjackson/dfd.

Chinese Translation

尽管基于光学物理的模糊前向模型是合理的，但从一组失焦图像中恢复深度仍然是一个计算上具有挑战性的优化问题。本文展示了在现代优化方法和合理计算资源的支持下，基于失焦的深度全局优化方法是可行的。我们的方法基于交替最小化。当深度图保持不变时，前向模型相对于全聚焦图像是线性的。当全聚焦图像保持不变时，每个像素的深度可以独立计算，从而实现了极其简单的并行计算。我们表明，交替进行凸优化和并行网格搜索可以有效地解决比当前深度学习方法更高分辨率的失焦深度问题。我们在合成和真实失焦模糊的基准数据集上展示了我们的方法，并与之前的方法相比，取得了令人鼓舞的结果。我们的代码可在 github.com/hollyjackson/dfd 获取。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.18520

Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Sketch2Feedback：基于语法循环的框架，为学生STEM图表提供与评分标准对齐的反馈

Bansal, Aayam

Abstract

Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

Chinese Translation

在STEM教育中，及时提供与评分标准对齐的学生绘制图表的反馈一直是一个持续的挑战。虽然大型多模态模型（LMMs）可以共同解析图像并生成解释，但它们的幻觉倾向削弱了在课堂上的信任。我们提出了Sketch2Feedback，一个基于语法循环的框架，将问题分解为四个阶段——混合感知、符号图构建、约束检查和受限的VLM反馈——使得语言模型仅对上游规则引擎验证的违规行为进行语言化。我们在两个合成微基准上进行评估，FBD-10（自由体图）和Circuit-10（电路原理图），每个基准包含500幅图像，涵盖标准和困难噪声增强层次，比较我们的管道与端到端LMMs（LLaVA-1.5-7B，Qwen2-VL-7B）、仅视觉检测器、YOLOv8-nano学习检测器和集成oracle。在每个基准的n=100个测试样本上，采用95%的自助法置信区间，结果呈现出混合且具有指导意义的特点：Qwen2-VL-7B在FBD（0.570）和电路（0.528）上均实现了最高的微F1，但伴随极高的幻觉率（0.78，0.98）。一个选择每个样本最佳预测的集成oracle在FBD上达到F1=0.556，幻觉率为0.320，展示了语法与端到端方法之间可利用的互补性。在tau=0.7的置信度阈值下，电路的幻觉率从0.970降低到0.880，且没有F1损失。困难噪声增强揭示了领域依赖的鲁棒性：FBD检测具有韧性，而电路检测则急剧下降。LLM作为评判者的评估确认，语法管道产生的电路反馈（4.85/5）比端到端LMM（3.11/5）更具可操作性。我们发布了所有代码、数据集和评估脚本。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.18525

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

生成度量能否预测YOLO性能？跨模型、增强比例和数据集复杂度的评估

Marian, Vasile, Kang, Yong-Bin, Buddery, Alexander

Abstract

Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits ([email protected]:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.

Chinese Translation

合成图像在增强目标检测训练集中的应用日益增多，但在训练前可靠地评估合成数据集仍然困难：标准的全局生成度量（例如，FID）往往无法预测下游检测的平均精度均值（mAP）。我们对YOLOv11在三个单类检测模式下的合成增强进行了控制评估——交通标志（稀疏/近饱和）、城市景观行人（密集/遮挡严重）和COCO盆栽植物（多实例/高变异性）。我们对六种基于GAN、扩散和混合的生成器进行了基准测试，增强比例从真实训练集的10%到150%不等，并在从头开始训练YOLOv11和使用COCO预训练初始化的情况下进行评估，测试在保留的真实测试集上进行（[email protected]:0.95）。对于每个数据集-生成器-增强配置，我们在匹配大小的自助抽样协议下计算预训练数据集度量，包括（i）在Inception-v3和DINOv2嵌入中的全局特征空间度量，以及（ii）基于边界框统计的以对象为中心的分布距离。合成增强在更具挑战性的模式下带来了显著的提升（在行人和盆栽植物中分别达到+7.6%和+30.6%的相对mAP），但在交通标志和预训练微调下的提升则微乎其微。为了将度量信号与增强数量分离，我们报告了在多重检验修正下的原始和增强控制（残差化）相关性，显示度量与性能的对齐在很大程度上依赖于模式，并且许多明显的原始关联在控制增强水平后减弱。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.18527

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER：在模拟物理环境中进行联合3D音视频定位与推理

Liu, Zhan, Tang, Changli, Wang, Yuxin, Zhu, Zhiyuan, Chen, Youjun, Shao, Yiwen, Wang, Tianzi, Ke, Lei, Jin, Zengrui, Zhang, Chao

Abstract

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

Chinese Translation

当前的音视频大型语言模型（AV-LLMs）主要局限于2D感知，依赖于RGB视频和单声道音频。这一设计选择引入了基本的维度不匹配，阻碍了在复杂3D环境中可靠的源定位和空间推理。我们通过提出JAEGER来解决这一限制，该框架将AV-LLMs扩展到3D空间，以实现通过整合RGB-D观测和多通道一阶环境声学进行联合空间定位和推理。我们工作的核心贡献是神经强度向量（Neural IV），这是一种学习的空间音频表示，编码了强健的方向线索，以增强到达方向估计，即使在源重叠的恶劣声学场景中也能有效工作。为了促进大规模训练和系统评估，我们提出了SpatialSceneQA，这是一个从模拟物理环境中策划的61k指令调优样本的基准。大量实验表明，我们的方法在多种空间感知和推理任务中始终超越2D中心基线，强调了在物理环境中推进人工智能所需的明确3D建模。我们的源代码、预训练模型检查点和数据集将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.18530

Image-Based Classification of Olive Varieties Native to Turkiye Using Multiple Deep Learning Architectures: Analysis of Performance, Complexity, and Generalization

基于图像的土耳其本土橄榄品种分类研究：多种深度学习架构的性能、复杂性与泛化分析

Karatas, Hatice, Atabas, Irfan

Abstract

This study compares multiple deep learning architectures for the automated, image-based classification of five locally cultivated black table olive varieties in Turkey: Gemlik, Ayvalik, Uslu, Erkence, and Celebi. Using a dataset of 2500 images, ten architectures - MobileNetV2, EfficientNetB0, EfficientNetV2-S, ResNet50, ResNet101, DenseNet121, InceptionV3, ConvNeXt-Tiny, ViT-B16, and Swin-T - were trained using transfer learning. Model performance was evaluated using accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), Cohen's Kappa, ROC-AUC, number of parameters, FLOPs, inference time, and generalization gap. EfficientNetV2-S achieved the highest classification accuracy (95.8%), while EfficientNetB0 provided the best trade-off between accuracy and computational complexity. Overall, the results indicate that under limited data conditions, parametric efficiency plays a more critical role than model depth alone.

Chinese Translation

本研究比较了多种深度学习架构在土耳其五种本地栽培黑色餐桌橄榄品种（Gemlik、Ayvalik、Uslu、Erkence 和 Celebi）自动化图像分类中的应用。使用2500张图像的数据集，训练了十种架构——MobileNetV2、EfficientNetB0、EfficientNetV2-S、ResNet50、ResNet101、DenseNet121、InceptionV3、ConvNeXt-Tiny、ViT-B16 和 Swin-T，采用迁移学习方法。模型性能通过准确率、精确率、召回率、F1-score、马修斯相关系数（MCC）、科恩的卡帕系数、ROC-AUC、参数数量、FLOPs、推理时间和泛化差距进行评估。EfficientNetV2-S 达到了最高的分类准确率（95.8%），而 EfficientNetB0 则在准确率和计算复杂性之间提供了最佳的权衡。总体而言，结果表明在数据有限的情况下，参数效率比模型深度更为重要。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.18532

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt：构建强大VLA模型的配方

Wu, Xiao-Ming, Fan, Bin, Liao, Kang, Jiang, Jian-Jian, Yang, Runze, Luo, Yihang, Wu, Zhonghua, Zheng, Wei-Shi, Loy, Chen Change

Abstract

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

Chinese Translation

随着大型基础模型的兴起，视觉-语言-行动模型（VLA）应运而生，利用强大的视觉和语言理解能力进行通用策略学习。然而，目前的VLA领域仍然是碎片化和探索性的。尽管许多研究小组提出了各自的VLA模型，但训练协议和评估设置的不一致使得识别哪些设计选择真正重要变得困难。为了为这一不断发展的领域带来结构，我们在统一框架和评估设置下重新审视VLA设计空间。从一个类似于RT-2和OpenVLA的简单VLA基线出发，我们系统地分析了三个维度的设计选择：基础组件、感知要素和行动建模视角。通过这项研究，我们提炼出12个关键发现，这些发现共同形成了构建强大VLA模型的实用配方。这次探索的结果是一个简单而有效的模型，VLANeXt。VLANeXt在LIBERO和LIBERO-plus基准测试中超越了先前的最先进方法，并在现实世界实验中表现出强大的泛化能力。我们将发布一个统一的、易于使用的代码库，作为社区重现我们发现、探索设计空间和在共享基础上构建新VLA变体的共同平台。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.18533

Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

文本到图像扩散模型中身份盆地的形态学定位

Fraser, Andrew

Abstract

We demonstrate that morphological pressure creates navigable gradients at multiple levels of the text-to-image generative pipeline. In Study~1, identity basins in Stable Diffusion 1.5 can be navigated using morphological descriptors -- constituent features like platinum blonde,'' beauty mark,'' and 1950s glamour'' -- without the target's name or photographs. A self-distillation loop (generating synthetic images from descriptor prompts, then training a LoRA on those outputs) achieves consistent convergence toward a specific identity as measured by ArcFace similarity. The trained LoRA creates a local coordinate system shaping not only the target identity but also its inverse: maximal away-conditioning produces eldritch'' structural breakdown in base SD1.5, while the LoRA-equipped model produces ``uncanny valley'' outputs -- coherent but precisely wrong. In Study~2, we extend this to prompt-level morphology. Drawing on phonestheme theory, we generate 200 novel nonsense words from English sound-symbolic clusters (e.g., \emph{cr-}, \emph{sn-}, \emph{-oid}, \emph{-ax}) and find that phonestheme-bearing candidates produce significantly more visually coherent outputs than random controls (mean Purity@1 = 0.371 vs.\ 0.209, p<0.00001p < 0.00001 p<0.00001, Cohen's d=0.55d = 0.55 d=0.55). Three candidates -- \emph{snudgeoid}, \emph{crashax}, and \emph{broomix} -- achieve perfect visual consistency (Purity@1 = 1.0) with zero training data contamination, each generating a distinct, coherent visual identity from phonesthetic structure alone. Together, these studies establish that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces. We document phase transitions in identity basins, CFG-invariant identity stability, and novel visual concepts emerging from sub-lexical sound patterns.

Chinese Translation

我们证明了形态学压力在文本到图像生成流程的多个层面上创造了可导航的梯度。在研究1中，Stable Diffusion 1.5中的身份盆地可以通过形态描述符进行导航——如“铂金金发”、“美人痣”和“1950年代魅力”等组成特征，而无需目标的名称或照片。自我蒸馏循环（从描述符提示生成合成图像，然后在这些输出上训练LoRA）实现了朝向特定身份的一致收敛，测量方式为ArcFace相似度。训练后的LoRA创建了一个局部坐标系统，不仅塑造了目标身份，还塑造了其逆向：最大远离条件下在基础SD1.5中产生了“古怪”的结构崩溃，而配备LoRA的模型则产生了“诡异谷”输出——连贯但恰好错误。在研究2中，我们将此扩展到提示级形态学。借鉴音象征理论，我们从英语声音象征簇（例如，cr-、sn-、-oid、-ax）生成了200个新颖的无意义词，并发现带有音象征的候选词产生的视觉输出显著比随机对照更具一致性（平均Purity@1 = 0.371 vs. 0.209, p<0.00001, Cohen's d=0.55）。三个候选词——snudgeoid、crashax和broomix——在没有训练数据污染的情况下实现了完美的视觉一致性（Purity@1 = 1.0），每个词仅从音象征结构生成了独特且连贯的视觉身份。这些研究共同表明，形态结构——无论是在特征描述符中还是在提示级音韵形式中——在扩散模型潜在空间中创造了系统的导航梯度。我们记录了身份盆地中的相变、CFG不变的身份稳定性，以及从子词音模式中涌现的新视觉概念。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.18540

Rodent-Bench

Heap, Thomas, Aitchison, Laurence, Cahill, Emma, Rodriguez, Adriana Casado

Abstract

We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.

Chinese Translation

我们提出了 Rodent-Bench，这是一个新颖的基准测试，旨在评估多模态大型语言模型（Multimodal Large Language Models, MLLMs）对啮齿动物行为视频进行注释的能力。我们使用该基准测试评估了包括 Gemini-2.5-Pro、Gemini-2.5-Flash 和 Qwen-VL-Max 在内的最先进的 MLLMs，并发现这些模型在该任务中的表现均不够强劲，无法作为助手使用。我们的基准测试涵盖了多种行为范式的多样化数据集，包括社交互动、梳理、抓挠和冻结行为，视频时长从 10 分钟到 35 分钟不等。我们提供了两个基准版本，以适应不同模型的能力，并建立了标准化的评估指标，包括每秒准确率、宏观 F1 值、平均精确度、互信息和 Matthew 相关系数。尽管一些模型在某些数据集上表现出适度的性能（尤其是在梳理检测方面），但整体结果揭示了在时间分割、处理长视频序列和区分微妙行为状态方面存在显著挑战。我们的分析识别了当前 MLLMs 在科学视频注释中的关键局限性，并为未来模型开发提供了见解。Rodent-Bench 为追踪神经科学研究中可靠的自动化行为注释的进展奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.18585

BloomNet: Exploring Single vs. Multiple Object Annotation for Flower Recognition Using YOLO Variants

BloomNet：使用YOLO变体探索单一与多重目标标注在花卉识别中的应用

Nusrat, Safwat, Bhattacharjee, Prithwiraj

Abstract

Precise localization and recognition of flowers are crucial for advancing automated agriculture, particularly in plant phenotyping, crop estimation, and yield monitoring. This paper benchmarks several YOLO architectures such as YOLOv5s, YOLOv8n/s/m, and YOLOv12n for flower object detection under two annotation regimes: single-image single-bounding box (SISBB) and single-image multiple-bounding box (SIMBB). The FloralSix dataset, comprising 2,816 high-resolution photos of six different flower species, is also introduced. It is annotated for both dense (clustered) and sparse (isolated) scenarios. The models were evaluated using Precision, Recall, and Mean Average Precision (mAP) at IoU thresholds of 0.5 ([email protected]) and 0.5-0.95 ([email protected]:0.95). In SISBB, YOLOv8m (SGD) achieved the best results with Precision 0.956, Recall 0.951, [email protected] 0.978, and [email protected]:0.95 0.865, illustrating strong accuracy in detecting isolated flowers. With [email protected] 0.934 and [email protected]:0.95 0.752, YOLOv12n (SGD) outperformed the more complicated SIMBB scenario, proving robustness in dense, multi-object detection. Results show how annotation density, IoU thresholds, and model size interact: recall-optimized models perform better in crowded environments, whereas precision-oriented models perform best in sparse scenarios. In both cases, the Stochastic Gradient Descent (SGD) optimizer consistently performed better than alternatives. These density-sensitive sensors are helpful for non-destructive crop analysis, growth tracking, robotic pollination, and stress evaluation.

Chinese Translation

花卉的精确定位和识别对于推动自动化农业的发展至关重要，特别是在植物表型分析、作物估算和产量监测方面。本文对几种YOLO架构进行了基准测试，包括YOLOv5s、YOLOv8n/s/m和YOLOv12n，在两种标注模式下进行花卉目标检测：单图单边界框（SISBB）和单图多边界框（SIMBB）。同时介绍了FloralSix数据集，该数据集包含2816张六种不同花卉物种的高分辨率照片，并对其进行了密集（聚集）和稀疏（孤立）场景的标注。通过在IoU阈值为0.5（[email protected]）和0.5-0.95（[email protected]:0.95）下使用精确度、召回率和平均精确度均值（mAP）对模型进行了评估。在SISBB中，YOLOv8m（SGD）取得了最佳结果，精确度为0.956，召回率为0.951，[email protected]为0.978，[email protected]:0.95为0.865，显示出在检测孤立花卉方面的强大准确性。YOLOv12n（SGD）在[email protected]为0.934和[email protected]:0.95为0.752的表现超越了更复杂的SIMBB场景，证明了其在密集多目标检测中的稳健性。结果显示了标注密度、IoU阈值和模型大小之间的相互作用：在拥挤环境中，优化召回率的模型表现更佳，而以精确度为导向的模型在稀疏场景中表现最佳。在这两种情况下，随机梯度下降（SGD）优化器的表现始终优于其他选择。这些对密度敏感的传感器对于非破坏性作物分析、成长跟踪、机器人授粉和压力评估具有重要意义。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.18614

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

补丁大小对二维和三维医学图像分类中视觉变换器微调的影响

Dehghan, Massoud, Woitek, Ramona, Mahbod, Amirreza

Abstract

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT

Chinese Translation

视觉变换器（Vision Transformers, ViTs）及其变体已成为许多计算机视觉任务中的最先进技术，并广泛用作大规模视觉和视觉-语言基础模型的骨干网络。尽管大量研究集中在架构改进上，但补丁大小这一在ViTs中的关键初始设计选择的影响仍然未被充分探索，尤其是在同时存在二维（2D）和三维（3D）成像模式的医学领域。在本研究中，我们使用来自各种成像模式的12个医学成像数据集（包括七个2D和五个3D数据集），对不同补丁大小对ViT分类性能的影响进行了全面评估。我们使用单个图形处理单元（GPU）和一系列补丁大小（1、2、4、7、14、28），微调ViT模型，并观察到较小补丁大小（1、2和4）在分类性能上有一致的提升，几乎在所有数据集中都取得了最佳结果。更具体地说，我们的结果表明，对于2D数据集（补丁大小2与28），平衡准确率提高了最高12.78%；对于3D数据集（补丁大小1与14），提高了最高23.78%，但代价是计算开销的增加。此外，通过应用一种简单的集成策略，将使用补丁大小1、2和4训练的模型的预测结果融合，我们在大多数情况下进一步提升了性能，尤其是在2D数据集上。我们的实现已在GitHub上公开可用：https://github.com/HealMaDe/MedViT

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.18618

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

为您叙述：基于提示引导的音视频叙述人脸生成采用多纠缠潜在空间

Chandra, Aashish, A V, Aashutosh, Das, Abhijit

Abstract

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

Chinese Translation

我们提出了一种新颖的方法，通过从静态图像、声音特征和目标文本合成一个人的声音和面部动作，生成逼真的说话和对话人脸。该模型对提示/驱动文本、驱动图像和个体的声音特征进行编码，然后将它们结合起来，传递到多纠缠潜在空间，以促进音频和视频模态生成管道中的键值对和查询的建立。多纠缠潜在空间负责在模态之间建立时空特定于个体的特征。此外，纠缠特征被传递到每个模态的相应解码器，以生成输出音频和视频。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.18697

Deep LoRA-Unfolding Networks for Image Restoration

用于图像恢复的深度 LoRA 展开网络

Wang, Xiangming, Zeng, Haijin, Sun, Benteng, Cao, Jiezhang, Zhang, Kai, Shen, Qiangqiang, Chen, Yongyong

Abstract

Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution.It unfolds the iterative optimization steps into a stack of sequentially linked blocks.Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level.However, existing DUNs suffer from two critical limitations: (i) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and (ii) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios.To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN.LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step.This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$-stage DUN with on-par or better performance.Extensive experiments conducted on three IR tasks validate the efficiency of our method.

Chinese Translation

深度展开网络（DUNs）将传统的迭代优化算法与深度神经网络结合成一个多阶段框架，在图像恢复（IR）领域取得了显著成就，例如光谱成像重建、压缩感知和超分辨率。它将迭代优化步骤展开为一系列顺序连接的模块。每个模块由一个梯度下降模块（Gradient Descent Module, GDM）和一个近端映射模块（Proximal Mapping Module, PMM）组成，从贝叶斯的角度来看，相当于一个去噪器，作用于具有已知水平的高斯噪声。然而，现有的 DUNs 存在两个主要限制：（i）它们的 PMMs 在各个阶段共享相同的架构和去噪目标，忽视了对不同噪声水平进行阶段特定适应的需求；（ii）它们的结构重复模块链导致严重的参数冗余和高内存消耗，妨碍了在大规模或资源受限场景中的部署。为了解决这些挑战，我们提出了一种用于图像恢复的广义深度低秩适应（LoRA）展开网络，命名为 LoRun，协调去噪目标并在各阶段之间适应不同的去噪水平，同时以压缩的内存使用实现更高效的 DUN。LoRun 引入了一种新范式，其中一个单一的预训练基础去噪器在所有阶段共享，而轻量级的阶段特定 LoRA 适配器被注入到 PMMs 中，以根据每个展开步骤的噪声水平动态调节去噪行为。这种设计将核心恢复能力与任务特定适应解耦，使得在不复制完整网络参数的情况下精确控制去噪强度，并在一个具有相同或更好性能的 N 阶 DUN 中实现高达 N 倍的参数减少。在三个 IR 任务上进行的广泛实验验证了我们方法的效率。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.18702

Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

以基础为思考：基于视频基础的课程强化推理用于长视频理解

Chen, Houlun, Wang, Xin, Li, Guangyao, Zhou, Yuwei, Chen, Yihan, Jia, Jia, Zhu, Wenwu

Abstract

Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.

Chinese Translation

长视频理解因其丰富且复杂的多模态线索而具有挑战性。当前的方法通过文本形式的推理来提高模型分析长视频中复杂视频线索的能力。然而，现有文献存在的问题是，固定视频上下文下的仅文本推理可能加剧幻觉现象，因为在长视频的时间冗余下，详细的关键线索常常被忽视。为了解决这一问题，我们提出了Video-TwG，一个课程强化框架，采用新颖的以基础为思考（Think-with-Grounding）范式，使视频大语言模型（LLMs）能够主动决定何时在交错的文本-视频推理过程中进行按需基础，只有在必要时才选择聚焦于与问题相关的片段。Video-TwG可以以简单的方式进行端到端训练，而无需依赖复杂的辅助模块或重度标注的推理轨迹。具体而言，我们设计了一个两阶段强化课程策略，其中模型首先在一个带有基础标签的小型短视频GQA数据集上学习以基础为思考的行为，然后扩展到具有多样领域视频的通用QA数据，以鼓励泛化。此外，为了处理各种数据的复杂以基础为思考的推理，我们提出了TwG-GRPO算法，该算法具有细粒度基础奖励、自确认伪奖励和准确性门控机制。最后，我们建议构建一个新的TwG-51K数据集以促进训练。在Video-MME、LongVideoBench和MLVU上的实验表明，Video-TwG始终优于强大的LVU基线。进一步的消融实验验证了我们两阶段强化课程策略的必要性，并表明我们的TwG-GRPO更好地利用多样的未标注数据来提高基础质量，并减少冗余基础，而不牺牲QA性能。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.18709

IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

IRIS-SLAM：用于稳健语义定位和映射的统一地理实例表示

Xiao, Tingyang, Liu, Liu, Feng, Wei, Zou, Zhengyu, Zhou, Xiaolin, Sui, Wei, Li, Hao, Zhang, Dingwen, Su, Zhizhong

Abstract

Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

Chinese Translation

几何基础模型在密集几何SLAM方面取得了显著进展，但现有系统往往缺乏深度语义理解和稳健的回环闭合能力。同时，现代语义映射方法常常受到解耦架构和脆弱数据关联的限制。我们提出了IRIS-SLAM，一种新颖的RGB语义SLAM系统，利用从实例扩展基础模型中派生的统一几何实例表示。通过扩展几何基础模型以同时预测密集几何和跨视图一致的实例嵌入，我们实现了一种语义协同关联机制和实例引导的回环闭合检测。我们的方法有效利用视点无关的语义锚点，弥合几何重建与开放词汇映射之间的差距。实验结果表明，IRIS-SLAM在地图一致性和宽基线回环闭合可靠性方面显著优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.18711

HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing

HIME：通过幻觉不敏感模型编辑减轻大型视觉语言模型中的对象幻觉

Akl, Ahmed, Khamis, Abdelwahed, Cheraghian, Ali, Wang, Zhe, Khalifa, Sara, Wang, Kewen

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while preserving pre-trained knowledge? To address this question, we present a systematic analysis of LVLM decoders built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while preserving pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.

Chinese Translation

大型视觉语言模型（LVLMs）展示了令人印象深刻的多模态理解能力，但它们仍然容易出现对象幻觉，即模型描述不存在的对象或归因于不正确的事实信息，这对可靠的现实世界部署提出了严重的担忧。尽管微调是一种常用的缓解策略，但其高计算成本和实际困难促使人们寻求无训练的替代方案，其中模型编辑最近作为一种有前景的方向出现。然而，随意的编辑可能会破坏预训练LVLM中编码的丰富隐性知识，从而引发一个根本性问题：在每一层进行多少干预是必要的，以抑制幻觉同时保留预训练知识？为了解决这个问题，我们对基于三种广泛使用的大型语言模型骨干（Qwen、LLaMA和Vicuna）的LVLM解码器进行了系统分析，揭示了在对象幻觉易感性方面的明显层级差异。基于这些见解，我们引入了幻觉不敏感评分（HIS），这是一个原则性指标，用于量化每一层对幻觉的敏感性，并为有针对性的干预提供指导。利用HIS，我们提出了幻觉不敏感模型编辑（HIME），这是一种简单而有效的层自适应权重编辑方法，选择性地修改潜在特征以抑制幻觉，同时保留预训练知识。大量实验表明，HIME在包括CHAIR、MME和GPT-4V辅助评估在内的开放式生成基准上平均减少了61.8%的幻觉，而无需引入额外的参数、推理时间延迟或计算开销。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.18717

NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

NeXt2Former-CD：基于现代视觉架构的高效遥感变化检测

Wang, Yufan, Makrogiannis, Sokratis, Kambhamettu, Chandra

Abstract

State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

Chinese Translation

状态空间模型（SSMs）近年来在遥感变化检测（CD）中获得了关注，因为其良好的扩展性。本文探讨了现代卷积和基于注意力的架构作为一种竞争性替代方案的潜力。我们提出了NeXt2Former-CD，一个端到端的框架，集成了一个用DINOv3权重初始化的Siamese ConvNeXt编码器、一个可变形的基于注意力的时间融合模块和一个Mask2Former解码器。该设计旨在更好地容忍残余的共同配准噪声和小物体级空间偏移，以及双时相图像中的语义模糊。在LEVIR-CD、WHU-CD和CDD数据集上的实验表明，我们的方法在评估的方法中取得了最佳结果，在F1分数和IoU上均优于最近的基于Mamba的基线。此外，尽管参数数量较多，我们的模型在推理延迟方面与基于SSM的方法相当，表明其在高分辨率变化检测任务中的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.18720

Subtle Motion Blur Detection and Segmentation from Static Image Artworks

静态图像艺术作品中的微妙运动模糊检测与分割

Samarth, Ganesh, Paul, Sibendu, Tabarestani, Solale, Chen, Caren

Abstract

Streaming services serve hundreds of millions of viewers worldwide, where visual assets such as thumbnails, box art, and cover images are critical for engagement. Subtle motion blur remains a pervasive quality issue, reducing visual clarity and negatively affecting user trust and click-through rates. However, motion blur detection from static images is underexplored, as existing methods and datasets focus on severe blur and lack fine-grained pixel-level annotations needed for quality-critical applications. Benchmarks such as GOPRO and NFS are dominated by strong synthetic blur and often contain residual blur in their sharp references, leading to ambiguous supervision. We propose SMBlurDetect, a unified framework combining high-quality motion blur specific dataset generation with an end-to-end detector capable of zero-shot detection at multiple granularities. Our pipeline synthesizes realistic motion blur from super high resolution aesthetic images using controllable camera and object motion simulations over SAM segmented regions, enhanced with alpha-aware compositing and balanced sampling to generate subtle, spatially localized blur with precise ground truth masks. We train a U-Net based detector with ImageNet pretrained encoders using a hybrid mask and image centric strategy incorporating curriculum learning, hard negatives, focal loss, blur frequency channels, and resolution aware augmentation.Our method achieves strong zero-shot generalization, reaching 89.68% accuracy on GoPro (vs 66.50% baseline) and 59.77% Mean IoU on CUHK (vs 9.00% baseline), demonstrating 6.6x improvement in segmentation. Qualitative results show accurate localization of subtle blur artifacts, enabling automated filtering of low quality frames and precise region of interest extraction for intelligent cropping.

Chinese Translation

流媒体服务为全球数亿观众提供服务，其中缩略图、盒装艺术和封面图像等视觉资产对用户参与至关重要。然而，微妙的运动模糊仍然是一个普遍的质量问题，降低了视觉清晰度，负面影响用户信任和点击率。然而，从静态图像中检测运动模糊的研究尚不充分，现有方法和数据集主要集中在严重模糊上，缺乏质量关键应用所需的细粒度像素级注释。诸如GOPRO和NFS等基准测试主要由强合成模糊主导，并且其清晰参考图像中常常包含残余模糊，导致监督模糊。我们提出了SMBlurDetect，一个统一框架，结合高质量运动模糊特定数据集生成与一个能够在多个粒度上进行零样本检测的端到端检测器。我们的流程通过可控的相机和物体运动模拟，在SAM分割区域上合成真实的运动模糊，并通过具有α感知合成和均衡采样的方式生成微妙的、空间局部化的模糊，并提供精确的真实标签掩膜。我们使用基于U-Net的检测器进行训练，采用ImageNet预训练编码器，结合课程学习、困难负样本、焦点损失、模糊频率通道和分辨率感知增强的混合掩膜和图像中心策略。我们的方法实现了强大的零样本泛化，在GoPro上达到89.68%的准确率（对比66.50%的基线），在CUHK上达到59.77%的平均交并比（Mean IoU）（对比9.00%的基线），实现了6.6倍的分割提升。定性结果显示微妙模糊伪影的准确定位，使得低质量帧的自动过滤和感兴趣区域的精确提取成为可能，从而实现智能裁剪。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.18726

WiCompass: Oracle-driven Data Scaling for mmWave Human Pose Estimation

WiCompass：基于Oracle驱动的数据扩展用于毫米波人体姿态估计

Liang, Bo, Gong, Chen, Wang, Haobo, Liu, Qirui, Zhou, Rungui, Shao, Fengzhi, Wang, Yubo, Gao, Wei, Zhou, Kaichen, Cui, Guolong, Xu, Chenren

Abstract

Millimeter-wave Human Pose Estimation (mmWave HPE) promises privacy but suffers from poor generalization under distribution shifts. We demonstrate that brute-force data scaling is ineffective for out-of-distribution (OOD) robustness; efficiency and coverage are the true bottlenecks. To address this, we introduce WiCompass, a coverage-aware data-collection framework. WiCompass leverages large-scale motion-capture corpora to build a universal pose space ``oracle'' that quantifies dataset redundancy and identifies underrepresented motions. Guided by this oracle, WiCompass employs a closed-loop policy to prioritize collecting informative missing samples. Experiments show that WiCompass consistently improves OOD accuracy at matched budgets and exhibits superior scaling behavior compared to conventional collection strategies. By shifting focus from brute-force scaling to coverage-aware data acquisition, this work offers a practical path toward robust mmWave sensing.

Chinese Translation

毫米波人体姿态估计（mmWave HPE）承诺保护隐私，但在分布变化下表现出较差的泛化能力。我们证明，粗暴的数据扩展对分布外（OOD）鲁棒性是无效的；效率和覆盖范围才是真正的瓶颈。为了解决这个问题，我们提出了WiCompass，一个关注覆盖率的数据收集框架。WiCompass利用大规模运动捕捉数据集构建一个通用的姿态空间“oracle”，该oracle量化数据集的冗余性并识别代表性不足的动作。在这个oracle的指导下，WiCompass采用闭环策略优先收集信息量丰富的缺失样本。实验表明，WiCompass在相同预算下持续提高了OOD准确性，并展现出优于传统收集策略的扩展行为。通过将重点从粗暴的扩展转向关注覆盖率的数据获取，这项工作为实现鲁棒的毫米波传感提供了一条实用的路径。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.18729

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

MiSCHiEF：安全与文化的最小对比基准，用于细粒度图像-标题对齐的整体评估

Banerjee, Sagarika, Madi, Tangatar, Swaminathan, Advait, Anh, Nguyen Dao Minh, Garg, Shivank, Zhu, Kevin, Sharma, Vasu

Abstract

Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.

Chinese Translation

细粒度图像-标题对齐对于视觉语言模型（VLMs）至关重要，尤其是在识别现实世界风险场景或区分文化代理等社会关键背景下，在这些情况下，正确的解释依赖于微妙的视觉或语言线索，而轻微的误解可能导致重大的现实后果。我们提出了MiSCHiEF，这是基于对比对设计的两个基准数据集，分别涉及安全（MiS）和文化（MiC）领域，并评估了四个VLM在需要细粒度区分配对图像和标题的任务上的表现。在这两个数据集中，每个样本包含两个最小差异的标题和相应的最小差异图像。在MiS中，图像-标题对描绘了安全和不安全的场景，而在MiC中，它们描绘了两个不同文化背景下的文化代理。我们发现模型在确认正确的图像-标题对时的表现通常优于拒绝错误的图像-标题对。此外，模型在从两个高度相似的标题中为给定图像选择正确标题时的准确性高于相反的任务。总体而言，结果突显了当前VLM中持续存在的模态不对齐挑战，强调了在具有微妙语义和视觉区别的应用中所需的精确跨模态基础的困难。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.18735

LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency

LaS-Comp：基于潜在空间一致性的零-shot 3D 完成

Yan, Weilong, Li, Haipeng, Xu, Hao, Ye, Nianjin, Ai, Yihao, Liu, Shuaicheng, Hu, Jingyu

Abstract

This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.

Chinese Translation

本文介绍了 LaS-Comp，一种零-shot 和类别无关的方法，利用 3D 基础模型丰富的几何先验，实现在多种部分观测下的 3D 形状完成。我们的贡献主要体现在三个方面：首先，LaS-Comp 通过互补的两阶段设计利用这些强大的生成先验进行完成：（i）显式替换阶段，保留部分观测的几何形状，以确保完成的准确性；（ii）隐式精炼阶段，确保观察区域与合成区域之间的无缝边界。其次，我们的框架无需训练，并且与不同的 3D 基础模型兼容。第三，我们引入了 Omni-Comp，这是一个综合基准，结合了现实世界和合成数据，具有多样且具有挑战性的部分模式，从而实现更全面和真实的评估。定量和定性实验均表明，我们的方法优于之前的最先进方法。我们的代码和数据将会在 LaS-Comp 的 GitHub 页面上发布。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.18745

Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

从零开始合成多模态几何数据集并通过绘图代码实现视觉对齐

Lin, Haobo, Bai, Tianyi, Chen, Chen, Zhang, Jiajun, Zeng, Bohan, Zhang, Wentao, Yuan, Binhang

Abstract

Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.

Chinese Translation

多模态几何推理要求模型能够共同理解视觉图表并进行结构化符号推理，然而目前的视觉-语言模型由于训练数据有限和视觉-符号对齐不足，难以处理复杂的几何构造。我们提出了一种从零开始合成复杂多模态几何问题的流程，并构建了一个名为 extbf{GeoCode} 的数据集，该数据集将问题生成解耦为符号种子构建、带验证的基础实例化和基于代码的图表渲染，确保结构、文本、推理和图像之间的一致性。利用GeoCode中提供的绘图代码，我们进一步引入代码预测作为显式对齐目标，将视觉理解转化为监督的结构化预测任务。GeoCode在结构复杂性和推理难度上显著高于现有基准，同时通过多阶段验证保持数学正确性。大量实验表明，在GeoCode上训练的模型在多个几何基准上取得了一致的提升，展示了数据集和所提出的对齐策略的有效性。代码将发布在 https://github.com/would1920/GeoCode。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.18746

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

MIRROR：通过对视觉区域的反思进行多模态迭代推理

Zhang, Haoyu, Wu, Yuwei, Li, Pengxiang, Zhang, Xintong, Gao, Zhi, Gao, Rui, Gao, Mingyang, Sun, Che, Jia, Yunde

Abstract

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

Chinese Translation

在视觉-语言模型（VLMs）时代，增强多模态推理能力仍然是一个关键挑战，特别是在处理模糊或复杂的视觉输入时，初步推理往往导致幻觉或逻辑错误。现有的VLMs通常会产生看似合理但缺乏依据的答案，即使在被提示“反思”时，它们的修正也可能与图像证据脱节。为了解决这一问题，我们提出了MIRROR框架，通过对视觉区域的反思进行多模态迭代推理。MIRROR将视觉反思嵌入为核心机制，构建为一个闭环过程，包括草稿、批评、基于区域的验证和修订，反复进行，直到输出在视觉上得到支持。为了促进该模型的训练，我们构建了**ReflectV**，这是一个用于多轮监督的视觉反思数据集，明确包含反思触发器、基于区域的验证动作以及基于视觉证据的答案修订。在一般视觉-语言基准和代表性的视觉-语言推理基准上的实验表明，MIRROR提高了正确性并减少了视觉幻觉，展示了将反思训练作为一种寻求证据的、区域感知的验证过程，而不仅仅是纯文本修订步骤的价值。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.18747

Benchmarking Computational Pathology Foundation Models For Semantic Segmentation

计算病理基础模型在语义分割中的基准评估

Ramchandani, Lavish, Tinaikar, Aashay, Das, Dev Kumar, Garg, Rohit, Thomas, Tijo

Abstract

In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.

Chinese Translation

近年来，基础模型如 CLIP、DINO 和 CONCH 在多种成像任务中展示了卓越的领域泛化能力和无监督特征提取能力。然而，针对这些模型在组织病理学中像素级语义分割的系统性和独立评估仍然稀缺。在本研究中，我们提出了一种稳健的基准评估方法，以评估 10 个基础模型在四个组织病理数据集上的表现，涵盖形态组织区域和细胞/核分割任务。我们的方法利用基础模型的注意力图作为像素级特征，然后使用机器学习算法 XGBoost 进行分类，从而实现快速、可解释且不依赖于模型的评估，无需微调。我们展示了视觉语言基础模型 CONCH 在与仅视觉基础模型的比较中，在各数据集上表现最佳，PathDino 紧随其后。进一步分析表明，针对不同组织病理队列训练的模型捕捉到互补的形态表示，连接它们的特征能够获得更优的分割性能。将 CONCH、PathDino 和 CellViT 的特征连接在一起，在所有数据集上的表现比单独模型提高了 7.95%（在各数据集上取平均），这表明基础模型的集成能够更好地泛化到多样的组织病理分割任务中。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.18752

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement

优化多模态大模型中的身份一致性：通过对齐、纠缠和解纠缠实现面部修复

Dong, Yuran, Dai, Hang, Ye, Mang

Abstract

Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eye's high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models. Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination. To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency. As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios. The code is available at https://github.com/NDYBSNDY/EditedID.

Chinese Translation

多模态编辑大模型在各种任务中展现了强大的编辑能力。然而，一个持续且长期存在的限制是，在真实肖像编辑过程中面部身份（ID）一致性的下降。由于人眼对面部特征的高度敏感，这种不一致性显著阻碍了这些模型的实际应用。目前的面部ID保留方法由于跨源分布偏差和跨源特征污染，难以在面部身份和编辑元素IP的一致修复上取得成功。为了解决这些问题，我们提出了EditedID，一个用于稳健身份特定面部修复的对齐-解纠缠-纠缠框架。通过系统分析扩散轨迹、采样器行为和注意力特性，我们引入了三个关键组件：1）自适应混合策略，在整个扩散过程中对齐跨源潜在表示；2）混合求解器，解纠缠源特定的身份属性和细节；3）注意力门控机制，选择性地纠缠视觉元素。大量实验表明，EditedID在保留原始面部ID和编辑元素IP一致性方面达到了最先进的性能。作为一种无训练且即插即用的解决方案，它为开放世界环境中单人/多人面部身份修复建立了新的基准，为多模态编辑大模型在真实人物编辑场景中的应用铺平了道路。代码可在 https://github.com/NDYBSNDY/EditedID 获取。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.18757

Driving with A Thousand Faces: A Benchmark for Closed-Loop Personalized End-to-End Autonomous Driving

千面驾驶：闭环个性化端到端自动驾驶的基准测试

Dong, Xiaoru, Li, Ruiqin, Han, Xiao, Wu, Zhenxuan, Wang, Jiamin, Chen, Jian, Jiang, Qi, Yiu, SM, Zhu, Xinge, Ma, Yuexin

Abstract

Human driving behavior is inherently diverse, yet most end-to-end autonomous driving (E2E-AD) systems learn a single average driving style, neglecting individual differences. Achieving personalized E2E-AD faces challenges across three levels: limited real-world datasets with individual-level annotations, a lack of quantitative metrics for evaluating personal driving styles, and the absence of algorithms that can learn stylized representations from users' trajectories. To address these gaps, we propose Person2Drive, a comprehensive personalized E2E-AD platform and benchmark. It includes an open-source, flexible data collection system that simulates realistic scenarios to generate scalable and diverse personalized driving datasets; style vector-based evaluation metrics with Maximum Mean Discrepancy and KL divergence to comprehensively quantify individual driving behaviors; and a personalized E2E-AD framework with a style reward model that efficiently adapts E2E models for safe and individualized driving. Extensive experiments demonstrate that Person2Drive enables fine-grained analysis, reproducible evaluation, and effective personalization in end-to-end autonomous driving. Our dataset and code will be released after acceptance.

Chinese Translation

人类驾驶行为本质上是多样的，但大多数端到端自动驾驶（E2E-AD）系统学习的是单一的平均驾驶风格，忽视了个体差异。实现个性化的E2E-AD面临三个层面的挑战：有限的具有个体级标注的真实世界数据集、缺乏评估个人驾驶风格的定量指标，以及缺乏能够从用户轨迹中学习风格化表示的算法。为了解决这些问题，我们提出了Person2Drive，一个全面的个性化E2E-AD平台和基准测试。它包括一个开源、灵活的数据收集系统，模拟现实场景以生成可扩展和多样化的个性化驾驶数据集；基于风格向量的评估指标，使用最大均值差异（Maximum Mean Discrepancy）和KL散度（KL divergence）全面量化个体驾驶行为；以及一个个性化的E2E-AD框架，配备风格奖励模型，有效地调整E2E模型以实现安全和个性化的驾驶。大量实验表明，Person2Drive能够实现细粒度分析、可重复评估和有效个性化的端到端自动驾驶。我们的数据集和代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.18763

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

TAG：基于动作单元基础的面部表情识别思维

Lin, Haobo, Bai, Tianyi, Zhang, Jiajun, Chang, Xuanhao, Lu, Sheng, Gu, Fangming, Hu, Zengjie, Zhang, Wentao

Abstract

Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .

Chinese Translation

面部表情识别（FER）是一项细粒度的视觉理解任务，可靠的预测需要对局部且有意义的面部线索进行推理。最近的视觉-语言模型（VLMs）使得面部表情识别能够提供自然语言解释，但它们的推理往往缺乏基础，产生流畅但无法验证的推理，这些推理与视觉证据的关联较弱，容易产生幻觉，导致在不同数据集上的鲁棒性较差。我们提出了TAG（基于动作单元基础的思维），这是一个视觉-语言框架，明确约束多模态推理必须得到面部动作单元（AUs）的支持。TAG要求中间推理步骤必须基于与动作单元相关的面部区域，从而产生伴随可验证视觉证据的预测。该模型通过在基于动作单元的推理轨迹上进行监督微调进行训练，随后通过强化学习与一个关注动作单元的奖励进行训练，该奖励将预测区域与外部动作单元检测器对齐。在RAF-DB、FERPlus和AffectNet上的评估表明，TAG始终优于强大的开源和闭源VLM基线，同时提高了视觉可信度。消融实验和偏好研究进一步表明，基于动作单元的奖励稳定了推理并减轻了幻觉，证明了结构化的有基础的中间表示在面部表情识别中进行可信多模态推理的重要性。代码将发布在 https://github.com/would1920/FER_TAG 。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.18765

A high-resolution nationwide urban village mapping product for 342 Chinese cities based on foundation models

基于基础模型的342个中国城市高分辨率全国城市村落制图产品

Bai, Lubin, Xiao, Sheng, Yin, Ziyu, Wang, Haoyu, Wu, Siyang, Zhang, Xiuyuan, Du, Shihong

Abstract

Urban Villages (UVs) represent a distinctive form of high-density informal settlement embedded within China's rapidly urbanizing cities. Accurate identification of UVs is critical for urban governance, renewal, and sustainable development. But due to the pronounced heterogeneity and diversity of UVs across China's vast territory, a consistent and reliable nationwide dataset has been lacking. In this work, we present GeoLink-UV, a high-resolution nationwide UV mapping product that clearly delineates the locations and boundaries of UVs in 342 Chinese cities. The dataset is derived from multisource geospatial data, including optical remote sensing images and geo-vector data, and is generated through a foundation model-driven mapping framework designed to address the generalization issues and improve the product quality. A geographically stratified accuracy assessment based on independent samples from 28 cities confirms the reliability and scientific credibility of the nationwide dataset across heterogeneous urban contexts. Based on this nationwide product, we reveal substantial interregional disparities in UV prevalence and spatial configuration. On average, UV areas account for 8 % of built-up land, with marked clustering in central and south China. Building-level analysis further confirms a consistent low-rise, high-density development pattern of UVs nationwide, while highlighting regionally differentiated morphological characteristics. The GeoLink-UV dataset provides an open and systematically validated geospatial foundation for urban studies, informal settlement monitoring, and evidence-based urban renewal planning, and contributes directly to large-scale assessments aligned with Sustainable Development Goal 11. The GeoLink-UV dataset introduced in this article is freely available at https://doi.org/10.5281/zenodo.18688062.

Chinese Translation

城市村落（Urban Villages, UVs）是中国快速城市化进程中嵌入城市中的一种独特高密度非正式居住形态。准确识别城市村落对于城市治理、更新和可持续发展至关重要。然而，由于中国广阔地域内城市村落的显著异质性和多样性，缺乏一致且可靠的全国性数据集。在本研究中，我们提出了GeoLink-UV，一个高分辨率的全国城市村落制图产品，清晰划定了342个中国城市中城市村落的位置和边界。该数据集源自多源地理空间数据，包括光学遥感影像和地理矢量数据，并通过一个基于基础模型的制图框架生成，旨在解决概括性问题并提高产品质量。基于来自28个城市的独立样本进行的地理分层准确性评估确认了该全国数据集在异质城市背景下的可靠性和科学可信度。基于这一全国性产品，我们揭示了城市村落的流行程度和空间配置存在显著的区域差异。平均而言，城市村落面积占已建成土地的8%，在中国中部和南部地区呈现明显的聚集特征。建筑层面的分析进一步确认了全国范围内城市村落一致的低层高密度发展模式，同时突出了区域差异化的形态特征。GeoLink-UV数据集为城市研究、非正式居住监测和基于证据的城市更新规划提供了一个开放且系统验证的地理空间基础，并直接支持与可持续发展目标11相关的大规模评估。本文介绍的GeoLink-UV数据集可在https://doi.org/10.5281/zenodo.18688062上免费获取。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.18766

Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

初始化在少样本适应视觉语言模型进行组织病理图像分类中的重要性

Meseguer, Pablo, del Amor, Rocío, Naranjo, Valery

Abstract

Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

Chinese Translation

在组织病理图像-标题对数据集上预训练的视觉语言模型（VLM）实现了零-shot幻灯片级分类。VLM图像编码器提取判别特征的能力也为全幻灯片图像（WSI）分类的监督微调打开了大门，理想情况下使用少量标记样本。由于WSI的千兆像素大小，幻灯片级预测框架需要结合多实例学习（MIL）。在进行补丁级特征提取和聚合后，MIL框架依赖于在幻灯片级聚合特征之上训练的线性分类器。分类器权重初始化对基于少样本学习的高效迁移学习（ETL）方法中的线性探测性能有很大影响。在本研究中，我们提出了零-shot多实例学习（ZS-MIL），以解决随机分类器初始化在MIL问题中表现不佳的零-shot预测的局限性。ZS-MIL使用VLM文本编码器的类级嵌入作为分类层的起始点，以计算每个样本的包级概率。通过多次实验，我们展示了ZS-MIL在性能和变异性方面相较于知名权重初始化技术的稳健性，特别是在ETL少样本场景中的亚型预测。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.18792

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

MaskDiME：自适应掩蔽扩散用于精确高效的视觉反事实解释

Guo, Changlu, Christensen, Anders Nymark, Dahl, Anders Bjorholm, Hannemose, Morten Rieger

Abstract

Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

Chinese Translation

视觉反事实解释旨在揭示能够改变模型预测的最小语义修改，从而提供对深度神经网络的因果和可解释性洞察。然而，现有的基于扩散的反事实生成方法通常计算开销大、采样速度慢，并且在定位修改区域方面不够精确。为了解决这些局限性，我们提出了MaskDiME，这是一种简单、快速且有效的扩散框架，通过局部采样统一了语义一致性和空间精度。我们的方法自适应地聚焦于与决策相关的区域，以实现局部且语义一致的反事实生成，同时保持高图像保真度。我们的无训练框架MaskDiME在推理速度上比基线方法快超过30倍，并且在五个涵盖多样视觉领域的基准数据集上实现了可比或最先进的性能，确立了一种高效反事实解释的实用且可推广的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.18799

Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

重新思考无分类器引导的扩散模型的偏好对齐

Jiang, Zhou, Wen, Yandong, Liu, Zhen

Abstract

Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.

Chinese Translation

将大规模文本到图像的扩散模型与细致的人类偏好对齐仍然具有挑战性。尽管直接偏好优化（DPO）简单有效，但大规模微调往往显示出泛化差距。我们受到测试时引导的启发，将偏好对齐视为无分类器引导（CFG）：一个微调的偏好模型在采样过程中作为外部控制信号。基于这一观点，我们提出了一种简单的方法，在不重新训练基础模型的情况下改善对齐。为了进一步增强泛化能力，我们将偏好学习解耦为两个模块，分别在正数据和负数据上进行训练，并通过减去它们的预测（正预测减去负预测）形成一个 extit{对比引导}向量，该向量按用户选择的强度进行缩放，并在每一步添加到基础预测中。这产生了一个更清晰且可控的对齐信号。我们在Stable Diffusion 1.5和Stable Diffusion XL上使用Pick-a-Pic v2和HPDv3进行了评估，显示出一致的定量和定性提升。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.18811

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

跨域少样本目标检测的多模态原型学习

Wang, Wanqi, Guo, Jingcai, Cai, Yuxiang, Chen, Zhi

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.

Chinese Translation

跨域少样本目标检测（CD-FSOD）旨在仅凭少量标记示例在未见的目标域中检测新类别。尽管基于视觉-语言模型（VLMs）构建的开放词汇检测器具有良好的迁移能力，但它们几乎完全依赖文本提示，这些提示编码了领域不变的语义，却缺乏在少样本监督下进行精确定位所需的领域特定视觉信息。我们提出了一种双分支检测器，通过将文本指导与来自目标域的视觉示例结合，学习多模态原型，称为LMP。视觉原型构建模块从支持区域（RoIs）聚合类别级原型，并通过抖动框在查询图像中动态生成难负原型，捕捉干扰物和视觉上相似的背景。在视觉引导分支中，我们将这些原型注入检测管道，使用与文本分支镜像的组件作为训练的起点，而并行的文本引导分支则保留开放词汇语义。这些分支共同训练，并在推理时通过结合语义抽象与领域自适应细节进行集成。在六个跨域基准数据集和标准的1/5/10-shot设置下，我们的方法实现了最先进或高度竞争的平均精度（mAP）。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.18817

HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

HeRO：用于姿态感知物体操控的层次化三维语义表示

Xu, Chongyang, Cheng, Shen, Li, Haipeng, Fan, Haoqiang, Feng, Ziliang, Liu, Shuaicheng

Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

Chinese Translation

机器人操控的模仿学习已从二维图像策略发展到明确编码几何形状的三维表示。然而，纯粹的几何策略往往缺乏明确的部件级语义，而这些语义对于姿态感知操控至关重要（例如，区分鞋子的鞋头和鞋跟）。在本文中，我们提出了HeRO，一种基于扩散的策略，通过层次化语义场将几何和语义结合起来。HeRO采用密集语义提升，将DINOv2中的区分性、几何敏感特征与Stable Diffusion中的平滑、全局一致的对应关系融合，生成既细致又空间一致的密集特征。这些特征被处理和划分，以构建一个全局场和一组局部场。层次化条件模块利用置换不变网络架构对全局和局部场进行条件化，从而避免了对顺序敏感的偏差，并生成了一种连贯的姿态感知操控策略。在各种测试中，HeRO建立了新的最先进水平，在Place Dual Shoes任务上成功率提高了12.3%，在六个具有挑战性的姿态感知任务中平均提高了6.5%。代码可在 https://github.com/Chongyang-99/HeRO 获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.18822

Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

针对现实世界错位观测的鲁棒自监督跨模态超分辨率

Dong, Xiaoyu, Li, Jiahuan, Cui, Ziteng, Yokoya, Naoto

Abstract

We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: https://github.com/palmdong/RobSelf.

Chinese Translation

我们研究了在现实世界错位数据上的跨模态超分辨率（SR），在这种情况下，仅有有限数量的低分辨率（LR）源图像和高分辨率（HR）引导图像对，并且存在复杂的空间错位。为了解决这一挑战，我们提出了RobSelf——一个完全自监督的模型，能够在线优化，无需训练数据、真实标签监督或预对齐。RobSelf具有两项关键技术：一个错位感知特征转换器和一个内容感知参考滤波器。转换器将无监督的跨模态和跨分辨率对齐重新表述为一个弱监督的、错位感知的转换子任务，生成具有内在冗余的对齐引导特征。在该特征的指导下，滤波器对源图像执行基于参考的判别自增强，从而实现高分辨率和高保真的SR预测。在多种任务中，我们证明了RobSelf实现了最先进的性能和卓越的效率。此外，我们引入了一个现实世界数据集RealMisSR，以推动该领域的研究。数据集和代码：https://github.com/palmdong/RobSelf。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.18830

Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

用于4D对象生成的时空状态传播自回归模型

Yang, Liying, Liu, Jialun, Hu, Jiakui, Guan, Chenhao, Huang, Haibin, Yi, Fangqiu, Zhang, Chi, Liang, Yanyan

Abstract

Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.

Chinese Translation

生成具有时空一致性的高质量4D对象仍然是一项艰巨的任务。现有的基于扩散的方法常常面临时空不一致的问题，因为它们未能利用所有先前时间步的输出来指导当前时间步的生成。因此，我们提出了一种时空状态传播自回归模型（4DSTAR），该模型生成保持时空一致性的4D对象。4DSTAR将生成问题表述为对表示4D对象的标记的预测。它由两个关键组件组成：（1）提出了一种动态时空状态传播自回归模型（STAR），该模型实现了时空一致的生成。与标准自回归模型不同，STAR根据时间步将预测标记分组。它通过从先前组传播时空状态来建模长期依赖关系，并利用这些依赖关系指导下一个时间步的生成。为此，提出了一种时空容器，动态更新来自所有历史组的有效时空状态特征，然后更新的特征作为条件特征来指导下一个标记组的预测。（2）提出了4D VQ-VAE，它将4D结构隐式编码到离散空间中，并将STAR预测的离散标记解码为时间上连贯的动态3D高斯分布。实验表明，4DSTAR能够生成时空一致的4D对象，并在性能上与扩散模型相媲美。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.18831

IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbation

IDperturb：通过角度扰动增强合成面部生成的变异性

Boutros, Fadi, Caldeira, Eduarda, Chettaoui, Tahar, Damer, Naser

Abstract

Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDPERTURB, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDPERTURB perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results demonstrate that training FR on datasets generated using IDPERTURB yields improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches.

Chinese Translation

合成数据已成为训练人脸识别（FR）系统的真实人脸数据集的实用替代方案，尤其是在隐私和法律问题日益限制真实生物识别数据使用的背景下。近期身份条件扩散模型的进展使得生成逼真且身份一致的人脸图像成为可能。然而，许多模型在类内变异性方面存在局限性，而类内变异性是训练稳健且可泛化的FR模型的重要特性。在本研究中，我们提出了IDPERTURB，这是一种简单而有效的几何驱动采样策略，用于增强合成面部生成的多样性。IDPERTURB在单位超球体的受限角度区域内扰动身份嵌入，生成一组多样的嵌入，而无需修改基础生成模型。每个扰动的嵌入作为预训练扩散模型的条件向量，使得合成视觉上多样但身份一致的人脸图像成为可能，适合用于训练可泛化的FR系统。实证结果表明，使用IDPERTURB生成的数据集进行FR训练，在多个FR基准测试中相比现有的合成数据生成方法表现出更好的性能。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.18833

CLAP Convolutional Lightweight Autoencoder for Plant Disease Classification

CLAP卷积轻量级自编码器用于植物疾病分类

Bera, Asish, Roy, Subhajit, Banerjee, Sudiptendu

Abstract

Convolutional neural networks have remarkably progressed the performance of distinguishing plant diseases, severity grading, and nutrition deficiency prediction using leaf images. However, these tasks become more challenging in a realistic in-situ field condition. Often, a traditional machine learning model may fail to capture and interpret discriminative characteristics of plant health, growth and diseases due to subtle variations within leaf subcategories. A few deep learning methods have used additional preprocessing stages or network modules to address the problem, whereas several other methods have utilized pre-trained backbone CNNs, most of which are computationally intensive. Therefore, to address the challenge, we propose a lightweight autoencoder using separable convolutional layers in its encoder decoder blocks. A sigmoid gating is applied for refining the prowess of the encoders feature discriminability, which is improved further by the decoder. Finally, the feature maps of the encoder decoder are combined for rich feature representation before classification. The proposed Convolutional Lightweight Autoencoder for Plant disease classification, called CLAP, has been experimented on three public plant datasets consisting of cassava, tomato, maize, groundnut, grapes, etc. for determining plant health conditions. The CLAP has attained improved or competitive accuracies on the Integrated Plant Disease, Groundnut, and CCMT datasets balancing a tradeoff between the performance, and little computational cost requiring 5 million parameters. The training time is 20 milliseconds and inference time is 1 ms per image.

Chinese Translation

卷积神经网络在区分植物疾病、严重程度分级和营养缺乏预测方面取得了显著进展，主要依赖于叶片图像。然而，在实际的原位田间条件下，这些任务变得更加具有挑战性。传统的机器学习模型往往无法捕捉和解释植物健康、成长和疾病的判别特征，因为叶片子类别之间存在微妙的差异。一些深度学习方法使用了额外的预处理阶段或网络模块来解决这一问题，而其他一些方法则利用了预训练的骨干卷积神经网络（CNN），但大多数方法计算量较大。因此，为了解决这一挑战，我们提出了一种轻量级自编码器，采用可分离卷积层构建编码器-解码器模块。通过应用sigmoid门控机制来增强编码器特征的判别能力，解码器进一步改善了这一能力。最后，编码器和解码器的特征图被组合以丰富特征表示，便于分类。我们提出的用于植物疾病分类的卷积轻量级自编码器，称为CLAP，已在三个公共植物数据集上进行实验，这些数据集包括木薯、番茄、玉米、花生、葡萄等，以确定植物健康状况。CLAP在综合植物疾病、花生和CCMT数据集上取得了更高或具有竞争力的准确率，在性能和较低计算成本之间实现了平衡，所需参数为500万。训练时间为20毫秒，每张图像的推理时间为1毫秒。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.18842

Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

通过迭代流形偏差放大检测AI生成的伪造品

Zhang, Jiangling, Gao, Shuxuan, Liu, Bofan, Feng, Siqiang, Huang, Jirui, Chen, Yaxiong, Chen, Ziyu

Abstract

The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries and often struggle with novel manipulations as editing techniques continue to evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, followed by a Task-Adaptive Prior Injection (TAPI) module that converts this coarse prediction into guiding prompts to steer the MAE decoder and amplify reconstruction failures in suspicious regions for precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.

Chinese Translation

高度逼真的AI生成图像的普及对数字取证提出了重大挑战，要求对被操纵区域进行精确的像素级定位。现有方法主要学习特定伪造品的区分模式，往往在新的操纵技术不断演变时面临困难。我们提出了迭代伪造放大网络（Iterative Forgery Amplifier Network, IFA-Net），该网络从学习“什么是伪造的”转变为建模“什么是真实的”。基于所有操纵都偏离自然图像流形的原则，IFA-Net利用在真实图像上预训练的冻结掩码自编码器（Masked Autoencoder, MAE）作为通用的真实性先验。我们的框架通过一个两阶段的闭环过程进行操作：初始的双流分割网络（Dual-Stream Segmentation Network, DSSN）将原始图像与MAE重建残差融合以进行粗略定位，随后由任务自适应先验注入模块（Task-Adaptive Prior Injection, TAPI）将这一粗略预测转化为指导提示，以引导MAE解码器并放大可疑区域的重建失败，以实现精确的细化。在四个基于扩散的修复基准上的大量实验表明，IFA-Net在IoU上平均提高了6.5%，在F1-score上提高了8.1%，并且在传统操纵类型上表现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.18845

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

所有权的回声：用于版权保护的对抗引导双重注入在多模态大型语言模型中的应用

Xia, Chengwei, Ma, Fan, Quan, Ruijie, Xu, Yunqiu, Zhan, Kun, Yang, Yi

Abstract

With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This auxiliary model is specifically trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.

Chinese Translation

随着多模态大型语言模型（MLLMs）的快速部署和广泛采用，关于模型版本归属和所有权的争议日益频繁，知识产权保护问题引发了重大关注。本文提出了一种为MLLMs生成版权触发器的框架，使模型发布者能够将可验证的所有权信息嵌入模型中。我们的目标是构建触发图像，以便在原始模型的微调衍生模型中引发与所有权相关的文本响应，而在其他非衍生模型中保持不活跃。我们的方法通过将图像视为可学习的张量来构建跟踪触发图像，采用对抗优化的方式双重注入与所有权相关的语义信息。第一次注入是通过强制辅助MLLM的输出与预定义的与所有权相关的目标文本之间保持文本一致性来实现的；一致性损失被反向传播，以将此与所有权相关的信息注入图像中。第二次注入是在语义层面上，通过最小化图像的CLIP特征与目标文本特征之间的距离来进行。此外，我们引入了一个额外的对抗训练阶段，涉及从原始模型自身派生的辅助模型。该辅助模型专门训练以抵抗生成与所有权相关的目标文本，从而增强在高度微调的衍生模型中的鲁棒性。大量实验表明，我们的双重注入方法在各种微调和领域转移场景下有效追踪模型的血统。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.18846

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

DUET-VLM：用于视觉语言模型训练和推理的双阶段统一高效令牌减少

Singh, Aditya Kumar, Kandala, Hitesh, Brahma, Pratik Prabhanjan, Liu, Zicheng, Barsoum, Emad

Abstract

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

Chinese Translation

视觉语言模型（VLM）在多模态理解和推理能力方面取得了显著进展，但由于密集的视觉令牌化，仍然计算开销较大。现有的效率方法要么合并冗余的视觉令牌，要么在语言主干中逐步丢弃它们，通常以牺牲准确性换取速度。在本研究中，我们提出了DUET-VLM，这是一种多功能的即插即用双重压缩框架，包括（a）对视觉编码器输出进行仅视觉的冗余感知压缩，将其转换为保留信息的令牌，以及（b）在语言主干内进行逐层的、显著文本引导的视觉令牌丢弃，以逐步修剪不太信息丰富的令牌。这种协调的令牌管理使得在保留关键语义的同时实现了激进的压缩。在LLaVA-1.5-7B上，我们的方法在减少67%令牌的情况下保持了超过99%的基线准确性，即便在89%的减少下仍保持超过97%。通过这种双阶段压缩，在训练期间，它在67%和89%的情况下分别达到了99.7%和97.6%的准确性，超越了多个基准测试中的先前最先进的视觉令牌减少方法。当集成到Video-LLaVA-7B时，它甚至超越了基线——在大幅减少53.1%令牌的情况下实现了超过100%的准确性，并在极端的93.4%设置下保持了97.6%的准确性。这些结果突显了使用DUET-VLM进行端到端训练的能力，使其能够在不牺牲准确性的情况下，稳健地适应减少的视觉（图像/视频）输入，在相同的计算预算内生成紧凑而语义丰富的表示。我们的代码可在https://github.com/AMD-AGI/DUET-VLM获取。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.18853

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

城市场景分割中的开放词汇领域泛化

Zhao, Dong, Zang, Qi, Pu, Nan, Li, Wenjing, Sebe, Nicu, Zhong, Zhun

Abstract

Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

Chinese Translation

语义分割中的领域泛化（DG-SS）旨在使分割模型能够在未见环境中稳健地执行。然而，传统的DG-SS方法受到固定已知类别集的限制，限制了其在开放世界场景中的适用性。近期在视觉-语言模型（VLMs）方面的进展推动了开放词汇语义分割（OV-SS），使模型能够识别更广泛的概念。然而，这些模型仍然对领域转移敏感，并在未见环境中保持稳健性方面面临挑战，这在城市驾驶场景中尤为严重。为了解决这一问题，我们提出了语义分割中的开放词汇领域泛化（OVDG-SS），这是一个新的设置，联合解决未见领域和未见类别的问题。我们为自主驾驶中的OVDG-SS引入了第一个基准，解决了一个先前未被探索的问题，并涵盖了多样的未见领域和未见类别下的合成到真实和真实到真实的泛化。在OVDG-SS中，我们观察到领域转移往往扭曲了预训练VLMs中的文本-图像相关性，这阻碍了OV-SS模型的性能。为了解决这一挑战，我们提出了S2-Corr，一种基于状态空间的文本-图像相关性精炼机制，减轻领域引起的扭曲，并在分布变化下生成更一致的文本-图像相关性。我们在构建的基准上进行了广泛的实验，结果表明所提方法在跨领域性能和效率上优于现有的OV-SS方法。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2602.18861

Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

基于学习提示引导的数据生成的视觉变换器联合后训练量化

Li, Shile, Karmann, Markus, Urfalioglu, Onay

Abstract

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a photo of ".

Chinese Translation

我们提出了一个端到端的视觉变换器联合量化框架，该变换器在ImageNet上训练，旨在进行图像分类。与之前的后训练或块级重建方法不同，我们在没有任何标记数据的情况下，对所有层及其间块依赖关系的整个集合进行联合优化，能够有效地随着样本数量的增加而扩展，并且在单个GPU上仅用一小时就完成了对ViT-small的量化。我们在ImageNet上实现了最先进的W4A4和W3A3准确率，并且据我们所知，这是首次在极低比特设置（W1.58A8）下，维持ViT、DeiT和Swin-T模型强准确率的后训练量化结果，展示了高效边缘部署的潜力。此外，我们还引入了一种无数据校准策略，该策略利用稳定扩散Turbo合成多样化的无标签样本，借助学习的多模态提示进行指导。通过在学习的提示嵌入和生成的图像特征中鼓励多样性，我们的无数据方法在性能上与真实数据的ImageNet校准相当，并超越了简单文本提示基线，如“一个<形容词>的<形容词><类别>照片”。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2602.18867

Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

相似性作为证据：对过度自信的视觉语言模型进行校准，以实现可解释和标签高效的医学主动学习

Xie, Zhuofan, Lin, Zishan, Lin, Jinliang, Qi, Jie, Hong, Shaohua, Li, Shuo

Abstract

Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.

Chinese Translation

主动学习（AL）通过仅选择最具信息量的样本进行标注，从而降低医学影像的标注成本，但在标注数据稀缺时会面临冷启动问题。视觉语言模型（VLMs）通过零样本预测来解决冷启动问题，但其温度缩放的软最大输出将文本-图像相似性视为确定性分数，忽视了固有的不确定性，导致过度自信。这种过度自信误导了样本选择，浪费了标注预算在无信息的案例上。为克服这些限制，相似性作为证据（Similarity-as-Evidence, SaE）框架通过引入相似性证据头（Similarity Evidence Head, SEH）来校准文本-图像相似性，将相似性向量重新解释为证据，并对标签参数化一个狄利克雷分布。与标准软最大不同，后者即使在弱信号下也强制执行自信预测，狄利克雷公式明确量化了缺乏证据（空虚）和冲突证据（不和谐），从而减轻了由刚性软最大归一化引起的过度自信。在此基础上，SaE采用双因素获取策略：在早期轮次中优先考虑高空虚样本（例如，罕见疾病），以确保覆盖，而在后期优先考虑高不和谐样本（例如，模糊诊断），以细化边界，从而提供临床可解释的选择理由。在十个公共医学影像数据集上的实验显示，SaE在20%的标签预算下达到了82.57%的最先进的宏平均准确率。在具有代表性的BTMRI数据集上，SaE还实现了优越的校准，负对数似然（NLL）为0.425。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2602.18869

Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions

通过塑造密集且准确的二维语义预测来增强三维 LiDAR 分割

Dong, Xiaoyu, Xian, Tiankui, Gan, Wanshui, Yokoya, Naoto

Abstract

Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.

Chinese Translation

三维 LiDAR 点云的语义分割在城市遥感中对理解真实世界的街道环境至关重要。通过将 LiDAR 点云和三维语义标签投影为稀疏地图，这一任务可以重新表述为二维问题。然而，投影的 LiDAR 和标签地图的内在稀疏性可能导致稀疏且不准确的中间二维语义预测，进而限制最终的三维精度。为了解决这一问题，我们通过塑造密集且准确的二维预测来增强这一任务。具体而言，我们开发了一种多模态分割模型 MM2D3D。通过利用相机图像作为辅助数据，我们引入了跨模态引导滤波，以通过利用来自相机图像的密集语义关系来约束中间二维语义预测，从而克服标签地图的稀疏性；同时，我们引入动态交叉伪监督，以鼓励二维预测模拟来自相机图像的语义预测的密集分布，从而克服 LiDAR 地图的稀疏性。实验表明，我们的技术使模型能够实现具有密集分布和更高精度的中间二维语义预测，从而有效增强最终的三维精度。与之前的方法比较，展示了我们在二维和三维空间中的优越性能。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2602.18873

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

BiMotion：基于B样条的文本引导动态3D角色生成

Wang, Miaowei, Yan, Qingxuan, Cao, Zhi, Li, Yayuan, Mac Aodha, Oisin, Corso, Jason J., Vaxman, Amir

Abstract

Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. Our project page is at: https://wangmiaowei.github.io/BiMotion.github.io/.

Chinese Translation

文本引导的动态3D角色生成迅速发展，但生成高质量的运动以忠实反映丰富的文本描述仍然具有挑战性。现有方法由于固定长度的时间输入和离散的逐帧表示，往往生成有限的子动作或不连贯的运动，无法捕捉丰富的运动语义。我们通过用连续可微的B样条曲线表示运动来解决这些局限性，从而实现更有效的运动生成，而无需修改基础生成模型的能力。具体而言，我们的闭式解法、拉普拉斯正则化的B样条求解器能够高效地将可变长度的运动序列压缩为具有固定数量控制点的紧凑表示。此外，我们引入了一种正常融合策略以确保输入形状的遵循，并结合了考虑对应关系的局部刚性损失以提高运动恢复质量。为了训练我们的模型，我们汇集了BIMO，一个包含多样化可变长度3D运动序列和丰富高质量文本注释的新数据集。广泛的评估表明，我们的前馈框架BiMotion生成的运动比现有的最先进方法更具表现力、更高质量且更符合提示，同时生成速度更快。我们的项目页面地址为：https://wangmiaowei.github.io/BiMotion.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2602.18874

Structure-Level Disentangled Diffusion for Few-Shot Chinese Font Generation

结构级解耦扩散模型用于少样本中文字体生成

Li, Jie, Yang, Suorong, Zhao, Jian, Shen, Furao

Abstract

Few-shot Chinese font generation aims to synthesize new characters in a target style using only a handful of reference images. Achieving accurate content rendering and faithful style transfer requires effective disentanglement between content and style. However, existing approaches achieve only feature-level disentanglement, allowing the generator to re-entangle these features, leading to content distortion and degraded style fidelity. We propose the Structure-Level Disentangled Diffusion Model (SLD-Font), which receives content and style information from two separate channels. SimSun-style images are used as content templates and concatenated with noisy latent features as the input. Style features extracted by a CLIP model from target-style images are integrated via cross-attention. Additionally, we train a Background Noise Removal module in the pixel space to remove background noise in complex stroke regions. Based on theoretical validation of disentanglement effectiveness, we introduce a parameter-efficient fine-tuning strategy that updates only the style-related modules. This allows the model to better adapt to new styles while avoiding overfitting to the reference images' content. We further introduce the Grey and OCR metrics to evaluate the content quality of generated characters. Experimental results show that SLD-Font achieves significantly higher style fidelity while maintaining comparable content accuracy to existing state-of-the-art methods.

Chinese Translation

少样本中文字体生成旨在仅使用少量参考图像合成目标风格的新字符。实现准确的内容渲染和忠实的风格转移需要有效地解耦内容与风格。然而，现有方法仅实现特征级解耦，允许生成器重新纠缠这些特征，导致内容失真和风格保真度下降。我们提出了结构级解耦扩散模型（Structure-Level Disentangled Diffusion Model, SLD-Font），该模型通过两个独立通道接收内容和风格信息。我们使用SimSun风格图像作为内容模板，并将其与噪声潜在特征连接作为输入。通过交叉注意力机制，将从目标风格图像中提取的风格特征与内容信息进行整合。此外，我们在像素空间中训练了一个背景噪声去除模块，以去除复杂笔画区域的背景噪声。基于对解耦有效性的理论验证，我们引入了一种参数高效的微调策略，仅更新与风格相关的模块。这使得模型能够更好地适应新风格，同时避免对参考图像内容的过拟合。我们进一步引入了灰度和OCR指标来评估生成字符的内容质量。实验结果表明，SLD-Font在保持与现有最先进方法相当的内容准确性的同时，显著提高了风格保真度。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2602.18880

FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

FOCA：基于多模态大语言模型的频率导向跨域伪造检测、定位与解释

Liu, Zhou, Su, Tonghua, Zhang, Hongshi, Yang, Fuxiang, Di, Donglin, Song, Yang, Fan, Lei

Abstract

Advances in image tampering techniques, particularly generative models, pose significant challenges to media verification, digital forensics, and public trust. Existing image forgery detection and localization (IFDL) methods suffer from two key limitations: over-reliance on semantic content while neglecting textural cues, and limited interpretability of subtle low-level tampering traces. To address these issues, we propose FOCA, a multimodal large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains via a cross-attention fusion module. This design enables accurate forgery detection and localization while providing explicit, human-interpretable cross-domain explanations. We further introduce FSE-Set, a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations. Extensive experiments show that FOCA outperforms state-of-the-art methods in detection performance and interpretability across both spatial and frequency domains.

Chinese Translation

图像篡改技术的进步，特别是生成模型，给媒体验证、数字取证和公众信任带来了重大挑战。现有的图像伪造检测与定位（IFDL）方法存在两个主要局限性：过度依赖语义内容而忽视纹理线索，以及对微妙的低级篡改痕迹的有限可解释性。为了解决这些问题，我们提出了FOCA，一个基于多模态大语言模型的框架，通过跨注意力融合模块整合来自RGB空间域和频率域的判别特征。该设计能够实现准确的伪造检测和定位，同时提供明确的、可供人类解释的跨域解释。我们进一步引入了FSE-Set，一个包含多样化真实和篡改图像、像素级掩码和双域注释的大规模数据集。大量实验表明，FOCA在空间域和频率域的检测性能和可解释性方面均优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2602.18882

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

SceneTok：一种用于3D场景的压缩可扩散标记空间

Asim, Mohammad, Wewer, Christopher, Lenssen, Jan Eric

Abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

Chinese Translation

我们提出了SceneTok，一种新颖的标记器，用于将场景的视图集编码为一组压缩且可扩散的非结构化标记。现有的3D场景表示和生成方法通常使用3D数据结构或视图对齐场。相比之下，我们介绍了第一种将场景信息编码为一小组与空间网格解耦的置换不变标记的方法。场景标记由多视图标记器根据多个上下文视图进行预测，并通过采用轻量级的校正流解码器渲染成新视图。我们展示了这种压缩比其他表示方法强1到3个数量级，同时仍然达到最先进的重建质量。此外，我们的表示可以从新轨迹渲染，包括偏离输入轨迹的轨迹，并且我们展示了解码器能够优雅地处理不确定性。最后，这组高度压缩的非结构化潜在场景标记使得在5秒内实现简单高效的场景生成，达到了比以往范式更好的质量与速度的权衡。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2602.18886

PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation

PhysConvex：基于物理信息的3D动态凸辐射场用于重建与仿真

Wang, Dan, Cui, Xinrui, Belongie, Serge, Ramamoorthi, Ravi

Abstract

Reconstructing and simulating dynamic 3D scenes with both visual realism and physical consistency remains a fundamental challenge. Existing neural representations, such as NeRFs and 3DGS, excel in appearance reconstruction but struggle to capture complex material deformation and dynamics. We propose PhysConvex, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation. PhysConvex represents deformable radiance fields using physically grounded convex primitives governed by continuum mechanics. We introduce a boundary-driven dynamic convex representation that models deformation through vertex and surface dynamics, capturing spatially adaptive, non-uniform deformation, and evolving boundaries. To efficiently simulate complex geometries and heterogeneous materials, we further develop a reduced-order convex simulation that advects dynamic convex fields using neural skinning eigenmodes as shape- and material-aware deformation bases with time-varying reduced DOFs under Newtonian dynamics. Convex dynamics also offers compact, gap-free volumetric coverage, enhancing both geometric efficiency and simulation fidelity. Experiments demonstrate that PhysConvex achieves high-fidelity reconstruction of geometry, appearance, and physical properties from videos, outperforming existing methods.

Chinese Translation

重建和仿真具有视觉真实感和物理一致性的动态3D场景仍然是一个基本挑战。现有的神经表示方法，如NeRF（神经辐射场）和3DGS（3D几何体表示），在外观重建方面表现出色，但在捕捉复杂材料变形和动态方面存在困难。我们提出了PhysConvex，一种基于物理信息的3D动态凸辐射场，它统一了视觉渲染和物理仿真。PhysConvex使用基于连续介质力学的物理基础凸原语来表示可变形辐射场。我们引入了一种边界驱动的动态凸表示，通过顶点和表面动态来建模变形，捕捉空间自适应的非均匀变形和演变边界。为了高效地仿真复杂几何体和异质材料，我们进一步开发了一种降阶凸仿真，通过使用神经皮肤特征模式作为形状和材料感知的变形基底，在牛顿动力学下以时间变化的降阶自由度推进动态凸场。凸动态还提供了紧凑的无缝体积覆盖，增强了几何效率和仿真保真度。实验表明，PhysConvex能够从视频中实现高保真度的几何、外观和物理属性重建，超越现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2602.18887

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

SafeDrive：稀疏世界中的端到端驾驶的细粒度安全推理

Kim, Jungho, Oh, Jiyong, Yu, Seunghoon, Shin, Hongjae, Kwak, Donghyuk, Choi, Jun Won

Abstract

The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.

Chinese Translation

端到端（E2E）范式将传感器输入直接映射到驾驶决策，因其统一建模能力和可扩展性而受到广泛关注。然而，在这一统一框架中确保安全性仍然是最关键的挑战之一。在本研究中，我们提出了SafeDrive，一个E2E规划框架，旨在通过轨迹条件的稀疏世界模型进行明确和可解释的安全推理。SafeDrive由两个互补的网络组成：稀疏世界网络（SWNet）和细粒度推理网络（FRNet）。SWNet构建轨迹条件的稀疏世界，模拟关键动态代理和道路实体的未来行为，为下游推理提供以互动为中心的表示。随后，FRNet评估特定代理的碰撞风险和对可驾驶区域的时间遵守情况，从而精确识别未来时间步长中的安全关键事件。SafeDrive在开放循环和闭合循环基准测试中均取得了最先进的性能。在NAVSIM上，它记录了91.6的PDMS和87.5的EPDMS，仅在12,146个场景中发生61次碰撞（0.5%）。在Bench2Drive上，SafeDrive获得了66.8%的驾驶评分。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2602.18896

Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization

超越平稳性：重新思考向量量化中的代码本崩溃

Lu, Hao, Koyun, Onur C., Guo, Yongxin, Zhu, Zhengjie, Alili, Abbas, Gurcan, Metin Nafi

Abstract

Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and gradually become inactive. To address this, we propose two new methods: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution. Experiments on the CelebA-HQ dataset demonstrate that both methods achieve near-complete codebook utilization and superior reconstruction quality compared to baseline VQ variants, providing a principled and scalable foundation for future VQ-based generative models. The code is available at: https://github.com/CAIR- LAB- WFUSM/NSVQ-TransVQ.git

Chinese Translation

向量量化（Vector Quantization, VQ）是许多现代生成框架的基础，如 VQ-VAE、VQ-GAN 和潜在扩散模型。然而，它面临着代码本崩溃的持续问题，即在训练过程中大量代码向量未被使用。本研究通过识别编码器更新的非平稳性作为这一现象的根本原因，提供了新的理论解释。我们表明，随着编码器的漂移，未被选择的代码向量未能接收更新，逐渐变得不活跃。为了解决这一问题，我们提出了两种新方法：非平稳向量量化（Non-Stationary Vector Quantization, NSVQ），通过基于核的规则将编码器漂移传播到未选择的代码上，以及基于变换器的向量量化（Transformer-based Vector Quantization, TransVQ），该方法采用轻量级映射自适应地转换整个代码本，同时保持收敛到 k-means 解的能力。在 CelebA-HQ 数据集上的实验表明，这两种方法都实现了几乎完全的代码本利用率，并且在重建质量上优于基线 VQ 变体，为未来基于 VQ 的生成模型提供了一个有原则且可扩展的基础。代码可在以下链接获取：https://github.com/CAIR-LAB-WFUSM/NSVQ-TransVQ.git

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2602.18903

SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model

Gemini 3 Pro 图像的 SCHEMA：一种在谷歌原生多模态模型上进行受控 AI 图像生成的结构化方法论

Cazzaniga, Luca

Abstract

This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, commercial product photography, editorial content, storyboards, commercial campaigns, and information design. The methodology introduces a three-tier progressive system (BASE, MEDIO, AVANZATO) that scales practitioner control from exploratory (approximately 5%) to directive (approximately 95%), a modular label architecture with 7 core and 5 optional structured components, a decision tree with explicit routing rules to alternative tools, and systematically documented model limitations with corresponding workarounds. Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation (n=40), and a dedicated Information Design validation demonstrating >95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics. Previously published on Zenodo (doi:10.5281/zenodo.18721380).

Chinese Translation

本文提出了 SCHEMA（结构化组件用于协调工程模块化架构），这是一种专门为谷歌 Gemini 3 Pro 图像开发的结构化提示工程方法论。与通用提示指南或模型无关的建议不同，SCHEMA 是一个基于系统化专业实践的工程框架，涵盖了 850 个经过验证的 API 预测，涉及约 4,800 张生成图像，涵盖六个专业领域：房地产摄影、商业产品摄影、编辑内容、故事板、商业活动和信息设计。该方法论引入了一个三级渐进系统（BASE、MEDIO、AVANZATO），将从探索性（约 5%）到指令性（约 95%）的实践者控制进行扩展，采用一个具有 7 个核心和 5 个可选结构组件的模块化标签架构，一个具有明确路由规则的决策树以替代工具，以及系统性记录的模型局限性及相应的解决方案。主要发现包括在 621 个结构化提示中观察到的 91% 强制合规率和 94% 禁止合规率，比较批次一致性测试显示结构化提示在生成间的一致性显著更高，独立实践者验证（n=40），以及专门的信息设计验证显示在约 300 个公开可验证的信息图中，空间和排版控制的首次生成合规率超过 95%。该研究已在 Zenodo 上发表（doi:10.5281/zenodo.18721380）。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2602.18906

Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

边缘化束调整：基于单目深度估计的多视角相机位姿

Zhu, Shengjie, Abdelkader, Ahmed, Matthews, Mark J., Liu, Xiaoming, Chu, Wen-Sheng

Abstract

Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.

Chinese Translation

运动重建（Structure-from-Motion, SfM）是一个基本的三维视觉任务，旨在从多视角图像中恢复相机参数和场景几何。尽管最近的深度学习进展使得能够从单幅图像中准确地进行单目深度估计（Monocular Depth Estimation, MDE），而不依赖于相机运动，但将MDE整合到SfM中仍然是一个挑战。与传统的三角化稀疏点云不同，MDE生成的深度图具有显著更高的误差方差。受到现代RANSAC估计器的启发，我们提出了边缘化束调整（Marginalized Bundle Adjustment, MBA）方法，以利用MDE的密度来减轻误差方差。通过MBA，我们展示了MDE深度图足够准确，可以在SfM和相机重定位任务中取得最先进（SoTA）或具有竞争力的结果。通过广泛的评估，我们证明了该方法在不同规模下的一致性强健性能，从少帧设置到包含数千张图像的大型多视角系统。我们的方法突显了MDE在多视角三维视觉中的巨大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2602.18936

CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

CRAFT-LoRA：通过秩约束适应和无训练融合实现内容-风格个性化

Li, Yu, Cai, Yujun, Zhang, Chi

Abstract

Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

Chinese Translation

个性化图像生成需要在基于文本和参考示例合成图像时有效平衡内容保真度与风格一致性。低秩适应（LoRA）提供了一种高效的个性化方法，能够通过结合不同概念的LoRA权重实现精确控制。然而，现有的组合技术面临持续的挑战：内容与风格表示之间的纠缠、对元素影响控制的指导不足，以及不稳定的权重融合，通常需要额外的训练。我们通过CRAFT-LoRA解决这些局限性，提出了互补组件：（1）秩约束的主干微调，注入低秩投影残差以促进学习解耦的内容和风格子空间；（2）一种基于提示的方式，采用具有专门分支的专家编码器，通过选择性适配器聚合实现语义扩展和精确控制；（3）一种无训练、时间步依赖的无分类器引导方案，通过在扩散步骤中战略性调整噪声预测来增强生成稳定性。我们的方法显著改善了内容-风格的解耦，实现了对LoRA模块组合的灵活语义控制，并在没有额外重训练开销的情况下实现了高保真生成。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2602.18941

Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation

全球指挥官与地方执行者：一种双代理框架用于场景导航

Jin, Kaiming, Wu, Yuefan, Wu, Shengqiong, Li, Bobo, Yan, Shuicheng, Chua, Tat-Seng

Abstract

Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo

Chinese Translation

视觉与语言场景导航是人类与人工智能协作的基本能力，要求代理遵循自然语言指令在复杂环境中执行连贯的行动序列。现有方法要么依赖多个代理，导致高协调和资源成本，要么采用单一代理范式，使代理同时承担全球规划和局部感知的重任，常常导致推理能力下降和指令漂移，尤其在长时间跨度的设置中。为了解决这些问题，我们提出了DACo，一种规划-基础解耦架构，旨在将全球深思与局部基础相分离。具体而言，它使用全球指挥官进行高层战略规划，并使用地方执行者进行自我中心的观察和细粒度执行。通过将全球推理与局部行动解耦，DACo 减轻了认知负担，提高了长时间跨度的稳定性。该框架进一步整合了动态子目标规划和自适应重规划，以实现结构化和韧性的导航。在 R2R、REVERIE 和 R4R 上的广泛评估表明，DACo 在零样本设置中相较于表现最佳的基线实现了 4.9%、6.5% 和 5.4% 的绝对提升，并且在闭源（如 GPT-4o）和开源（如 Qwen-VL 系列）基础上均能有效泛化。DACo 为稳健的长时间跨度导航提供了一个有原则且可扩展的范式。项目页面：https://github.com/ChocoWu/DACo

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2602.18959

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

基于YOLOv10的手部定位与侧别分类多任务框架在外科视频中的应用

Sun, Kedi, Zhang, Le

Abstract

Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

Chinese Translation

在创伤外科中，实时手部追踪对于支持快速而精确的术中决策至关重要。我们提出了一种基于YOLOv10的框架，该框架能够在复杂的外科场景中同时定位手部并分类其侧别（左手或右手）。该模型在创伤THOMPSON挑战赛2025任务2数据集上进行训练，该数据集由带有手部边界框注释的第一人称外科视频组成。通过广泛的数据增强和多任务检测设计，提高了模型对运动模糊、光照变化和多样化手部外观的鲁棒性。评估结果显示，左手（67%）和右手（71%）的分类准确率较高，而将手部与背景区分仍然具有挑战性。该模型在$ mAP_{[0.5:0.95]} $上达到了0.33，并保持实时推理，突显了其在术中部署的潜力。本研究为紧急外科手术中先进的手部-器械交互分析奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2602.18961

Depth-Enhanced YOLO-SAM2 Detection for Reliable Ballast Insufficiency Identification

深度增强YOLO-SAM2检测用于可靠的道砟不足识别

Liu, Shiyu, Lester, Dylan, Narman, Husnu, Alzarrad, Ammar, Zhu, Pingping

Abstract

This paper presents a depth-enhanced YOLO-SAM2 framework for detecting ballast insufficiency in railway tracks using RGB-D data. Although YOLOv8 provides reliable localization, the RGB-only model shows limited safety performance, achieving high precision (0.99) but low recall (0.49) due to insufficient ballast, as it tends to over-predict the sufficient class. To improve reliability, we incorporate depth-based geometric analysis enabled by a sleeper-aligned depth-correction pipeline that compensates for RealSense spatial distortion using polynomial modeling, RANSAC, and temporal smoothing. SAM2 segmentation further refines region-of-interest masks, enabling accurate extraction of sleeper and ballast profiles for geometric classification. Experiments on field-collected top-down RGB-D data show that depth-enhanced configurations substantially improve the detection of insufficient ballast. Depending on bounding-box sampling (AABB or RBB) and geometric criteria, recall increases from 0.49 to as high as 0.80, and F1-score improves from 0.66 to over 0.80. These results demonstrate that integrating depth correction with YOLO-SAM2 yields a more robust and reliable approach for automated railway ballast inspection, particularly in visually ambiguous or safety-critical scenarios.

Chinese Translation

本文提出了一种深度增强的YOLO-SAM2框架，利用RGB-D数据检测铁路轨道中的道砟不足。尽管YOLOv8提供了可靠的定位，但仅使用RGB的模型在安全性能上表现有限，尽管其精度高达0.99，但由于道砟不足，召回率仅为0.49，因为该模型倾向于过度预测充足类别。为了提高可靠性，我们结合了基于深度的几何分析，通过一个与轨枕对齐的深度校正管道来补偿RealSense空间失真，该管道使用多项式建模、RANSAC和时间平滑技术。SAM2分割进一步细化了感兴趣区域的掩膜，使得能够准确提取轨枕和道砟的轮廓以进行几何分类。在现场收集的自上而下的RGB-D数据上的实验表明，深度增强配置显著提高了道砟不足的检测。根据边界框采样（AABB或RBB）和几何标准，召回率从0.49提高到高达0.80，F1-score从0.66提高到超过0.80。这些结果表明，将深度校正与YOLO-SAM2结合，提供了一种更强大和可靠的自动化铁路道砟检查方法，特别是在视觉模糊或安全关键的场景中。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2602.18965

Face Presentation Attack Detection via Content-Adaptive Spatial Operators

基于内容自适应空间算子的面部呈现攻击检测

Khan, Shujaat

Abstract

Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask-based spoofing. This paper proposes CASO-PAD, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location-specific, channel-shared kernels conditioned on the input, improving spatial selectivity with minimal overhead. CASO-PAD remains lightweight (3.6M parameters; 0.64 GFLOPs at $256\times256$) and is trained end-to-end using a standard binary cross-entropy objective. Extensive experiments on Replay-Attack, Replay-Mobile, ROSE-Youtu, and OULU-NPU demonstrate strong performance, achieving 100/100/98.9/99.7\% test accuracy, AUC of 1.00/1.00/0.9995/0.9999, and HTER of 0.00/0.00/0.82/0.44\%, respectively. On the large-scale SiW-Mv2 Protocol-1 benchmark, CASO-PAD further attains 95.45\% accuracy with 3.11\% HTER and 3.13\% EER, indicating improved robustness under diverse real-world attacks. Ablation studies show that placing the adaptive operator near the network head and using moderate group sharing yields the best accuracy--efficiency balance. Overall, CASO-PAD provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.

Chinese Translation

面部呈现攻击检测（FacePAD）对于保护面部认证免受打印、重播和基于面具的欺骗至关重要。本文提出了CASO-PAD，这是一种仅使用RGB的单帧模型，通过内容自适应空间算子（involution）增强MobileNetV3，以更好地捕捉局部欺骗线索。与空间共享卷积核不同，所提议的算子生成基于输入条件的特定位置、通道共享的卷积核，从而在最小开销下提高空间选择性。CASO-PAD保持轻量级（3.6M参数；在$256 imes256$时为0.64 GFLOPs），并使用标准的二元交叉熵目标进行端到端训练。在Replay-Attack、Replay-Mobile、ROSE-Youtu和OULU-NPU上的大量实验表明其强大的性能，分别达到了100/100/98.9/99.7\%的测试准确率，AUC为1.00/1.00/0.9995/0.9999，以及HTER为0.00/0.00/0.82/0.44\\%。在大规模SiW-Mv2 Protocol-1基准测试中，CASO-PAD进一步达到了95.45\\%的准确率，3.11\\%的HTER和3.13\\%的EER，表明在多样化的现实世界攻击下具有更强的鲁棒性。消融研究表明，将自适应算子放置在网络头部附近并使用适度的组共享能够实现最佳的准确性与效率平衡。总体而言，CASO-PAD为在移动设备上实现强健的FacePAD提供了一条实用路径，无需辅助传感器或时间堆叠。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2602.18977

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

Frame2Freq：用于细粒度视频理解的频谱适配器

Ponbagavathi, Thinesh Thiyakesan, Seibold, Constantin, Roitberg, Alina

Abstract

Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.

Chinese Translation

将图像预训练的主干网络适配到视频通常依赖于针对单一时间尺度调优的时域适配器。我们的实验表明，这些模块能够捕捉静态图像线索和非常快速的闪烁变化，但忽略了中速运动。然而，捕捉多个时间尺度的动态对于细粒度的时间分析（例如，打开与关闭瓶子）至关重要。为了解决这个问题，我们引入了Frame2Freq——一系列频率感知适配器，在预训练的视觉基础模型（Vision Foundation Models, VFM）进行图像到视频的适配时执行频谱编码，从而改善细粒度动作识别。Frame2Freq使用快速傅里叶变换（Fast Fourier Transform, FFT）沿时间轴进行处理，并学习特定于频带的嵌入，能够自适应地突出最具区分性的频率范围。在五个细粒度活动识别数据集上，Frame2Freq的表现超越了之前的PEFT方法，甚至在其中四个数据集上超过了完全微调的模型。这些结果提供了令人鼓舞的证据，表明频率分析方法是建模图像到视频转移中时间动态的强大工具。代码可在 https://github.com/th-nesh/Frame2Freq 获取。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2602.18990

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

IDSelect：一种基于强化学习的成本感知视频多模态人脸识别选择代理

Ji, Yuyang, Shen, Yixuan, Nguyen, Kien, Zhou, Lifeng, Liu, Feng

Abstract

Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.

Chinese Translation

基于视频的人脸识别通过整合面部、身体和步态实现了稳健的识别。然而，当前系统通过使用固定的重型集成处理所有模态，浪费了计算资源，而不考虑输入的复杂性。为了解决这些限制，我们提出了IDSelect，一种基于强化学习的成本感知选择器，旨在为每个序列的每个模态选择一个预训练模型，以优化准确性与效率之间的权衡。我们的关键见解是，输入条件选择器能够发现互补的模型选择，超越固定集成，同时使用显著更少的资源。IDSelect通过预算感知优化，使用演员-评论家强化学习端到端训练一个轻量级代理。奖励在识别准确性与计算成本之间取得平衡，而熵正则化则防止了过早收敛。在推理阶段，策略为每个模态选择最可能的模型，并融合模态特定的相似性以获得最终得分。在具有挑战性的视频数据集上进行的大量实验表明，IDSelect在效率上具有显著优势：在CCVID上，它以95.9%的Rank-1准确率实现了比强基线低92.4%的计算量，同时提高了1.8%的准确率；在MEVID上，它在保持竞争性能的同时减少了41.3%的计算量。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2602.18993

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

SeaCache：加速扩散模型的谱演化感知缓存

Chung, Jiwoo, Hyun, Sangeek, Lee, MinKyu, Han, Byeongju, Cha, Geonho, Wee, Dongyoon, Hong, Youngjun, Heo, Jae-Pil

Abstract

Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

Chinese Translation

扩散模型是视觉生成的强大基础，但其固有的顺序去噪过程导致推理速度缓慢。以往的方法通过基于相邻时间步之间的特征距离缓存和重用中间输出以加速采样。然而，现有的缓存策略通常依赖于原始特征差异，这会将内容和噪声纠缠在一起。这种设计忽视了谱演化，其中低频结构早期出现，而高频细节则在后期得到细化。我们提出了谱演化感知缓存（Spectral-Evolution-Aware Cache，SeaCache），这是一种无训练的缓存调度，其重用决策基于谱对齐表示。通过理论和实证分析，我们推导出一种谱演化感知（Spectral-Evolution-Aware，SEA）滤波器，该滤波器在抑制噪声的同时保留与内容相关的组件。采用SEA滤波后的输入特征来估计冗余性，导致动态调度能够适应内容，同时尊重扩散模型背后的谱先验。在多种视觉生成模型及其基线上的广泛实验表明，SeaCache实现了最先进的延迟-质量权衡。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2602.18996

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

通过循环一致性掩码预测学习跨视角物体对应关系

Yan, Shannan, Zheng, Leqi, Lv, Keyu, Ni, Jingchen, Wei, Hongyang, Zhang, Jiajun, Wang, Guangting, Lyu, Jing, Yuan, Chun, Rao, Fengyun

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Chinese Translation

我们研究了在视频中建立不同视角下的物体级视觉对应关系的任务，重点关注具有挑战性的自我中心到外部中心以及外部中心到自我中心的场景。我们提出了一种基于条件二元分割的简单而有效的框架，其中物体查询掩码被编码为潜在表示，以指导在目标视频中对应物体的定位。为了鼓励稳健的视角不变表示，我们引入了循环一致性训练目标：在目标视角中预测的掩码被投影回源视角，以重建原始查询掩码。这种双向约束提供了强大的自监督信号，无需真实标注，并且在推理时能够进行测试时训练（TTT）。在Ego-Exo4D和HANDAL-X基准上的实验表明，我们的优化目标和TTT策略的有效性，达到了最先进的性能。代码可在 https://github.com/shannany0606/CCMP 获取。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2602.19001

A Benchmark and Knowledge-Grounded Framework for Advanced Multimodal Personalization Study

先进多模态个性化研究的基准与知识驱动框架

Hu, Xia, Zhuang, Honglei, Potetz, Brian, Fathi, Alireza, Hu, Bo, Samari, Babak, Zhou, Howard

Abstract

The powerful reasoning of modern Vision Language Models open a new frontier for advanced personalization study. However, progress in this area is critically hampered by the lack of suitable benchmarks. To address this gap, we introduce Life-Bench, a comprehensive, synthetically generated multimodal benchmark built on simulated user digital footprints. Life-Bench features over questions evaluating a wide spectrum of capabilities, from persona understanding to complex reasoning over historical data. These capabilities expand far beyond prior benchmarks, reflecting the critical demands essential for real-world applications. Furthermore, we propose LifeGraph, an end-to-end framework that organizes personal context into a knowledge graph to facilitate structured retrieval and reasoning. Our experiments on Life-Bench reveal that existing methods falter significantly on complex personalized tasks, exposing a large performance headroom, especially in relational, temporal and aggregative reasoning. While LifeGraph closes this gap by leveraging structured knowledge and demonstrates a promising direction, these advanced personalization tasks remain a critical open challenge, motivating new research in this area.

Chinese Translation

现代视觉语言模型的强大推理能力为先进个性化研究开辟了新的前沿。然而，该领域的进展受到缺乏合适基准的严重制约。为了解决这一问题，我们引入了Life-Bench，这是一个基于模拟用户数字足迹构建的综合性合成多模态基准。Life-Bench包含超过的问题，评估从个性理解到对历史数据的复杂推理等广泛能力。这些能力远远超出了以往基准，反映了现实应用中至关重要的需求。此外，我们提出了LifeGraph，这是一个端到端框架，将个人上下文组织成知识图谱，以促进结构化检索和推理。我们在Life-Bench上的实验表明，现有方法在复杂个性化任务上表现显著不足，暴露出巨大的性能提升空间，尤其是在关系、时间和聚合推理方面。虽然LifeGraph通过利用结构化知识来缩小这一差距，并展示了一个有前景的方向，但这些先进的个性化任务仍然是一个关键的开放挑战，激励着该领域的新研究。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2602.19004

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

MoBind：用于细粒度IMU-视频姿态对齐的运动绑定

Nguyen, Duc Duy, Chin, Tat-Jun, Hoai, Minh

Abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

Chinese Translation

我们的目标是学习惯性测量单元（IMU）信号与从视频中提取的2D姿态序列之间的联合表示，以实现准确的跨模态检索、时间同步、主体和身体部位定位以及动作识别。为此，我们提出了MoBind，一个分层对比学习框架，旨在解决三个挑战：（1）过滤掉无关的视觉背景，（2）建模结构化的多传感器IMU配置，以及（3）实现细粒度的亚秒级时间对齐。为了隔离与运动相关的线索，MoBind将IMU信号与骨骼运动序列对齐，而不是原始像素。我们进一步将全身运动分解为局部身体部位轨迹，将每个轨迹与其对应的IMU配对，以实现语义基础的多传感器对齐。为了捕捉详细的时间对应关系，MoBind采用分层对比策略，首先对齐标记级时间段，然后将局部（身体部位）对齐与全局（全身）运动聚合融合。在mRi、TotalCapture和EgoHumans上的评估表明，MoBind在所有四个任务中始终优于强基线，展示了稳健的细粒度时间对齐，同时保持跨模态的粗略语义一致性。代码可在https://github.com/bbvisual/MoBind获取。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2602.19005

GUIDE-US: Grade-Informed Unpaired Distillation of Encoder Knowledge from Histopathology to Micro-UltraSound

GUIDE-US：基于等级信息的非配对知识蒸馏，从组织病理学到微超声

Willis, Emma, Elghareb, Tarek, Wilson, Paul F. R., To, Minh Nguyen Nhat, Abootorabi, Mohammad Mahdi, Jamzad, Amoon, Wodlinger, Brian, Mousavi, Parvin, Abolmaesumi, Purang

Abstract

Purpose: Non-invasive grading of prostate cancer (PCa) from micro-ultrasound (micro-US) could expedite triage and guide biopsies toward the most aggressive regions, yet current models struggle to infer tissue micro-structure at coarse imaging resolutions. Methods: We introduce an unpaired histopathology knowledge-distillation strategy that trains a micro-US encoder to emulate the embedding distribution of a pretrained histopathology foundation model, conditioned on International Society of Urological Pathology (ISUP) grades. Training requires no patient-level pairing or image registration, and histopathology inputs are not used at inference. Results: Compared to the current state of the art, our approach increases sensitivity to clinically significant PCa (csPCa) at 60% specificity by 3.5% and improves overall sensitivity at 60% specificity by 1.2%. Conclusion: By enabling earlier and more dependable cancer risk stratification solely from imaging, our method advances clinical feasibility. Source code will be publicly released upon publication.

Chinese Translation

目的：通过微超声（micro-US）对前列腺癌（PCa）进行非侵入性分级，可以加快分诊并指导活检至最具侵袭性的区域，但当前模型在粗分辨率成像下难以推断组织微结构。方法：我们提出了一种非配对的组织病理学知识蒸馏策略，训练微超声编码器以模拟预训练的组织病理学基础模型的嵌入分布，条件为国际泌尿病理学会（ISUP）分级。训练不需要患者级别的配对或图像配准，推断时不使用组织病理学输入。结果：与当前最先进的方法相比，我们的方法在60%特异性下对临床显著前列腺癌（csPCa）的敏感性提高了3.5%，并在60%特异性下整体敏感性提高了1.2%。结论：通过仅依靠成像实现更早和更可靠的癌症风险分层，我们的方法推动了临床可行性。源代码将在发表后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2602.19019

TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery

TokenTrace：通过水印令牌恢复实现多概念归因

Zhang, Li, Agarwal, Shruti, Collomosse, John, Xie, Pengtao, Asnani, Vishal

Abstract

Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.

Chinese Translation

生成性人工智能模型对知识产权（IP）构成了重大挑战，因为它们可以在没有归属的情况下复制独特的艺术风格和概念。尽管水印技术提供了一种潜在的解决方案，但现有方法在多个概念（例如，物体和艺术风格）组合在单一图像中的复杂场景中往往失败。这些方法难以单独解开和归因每个概念。在本研究中，我们提出了TokenTrace，一种新颖的主动水印框架，旨在实现稳健的多概念归因。我们的方法通过同时扰动文本提示嵌入和引导扩散模型生成过程的初始潜在噪声，将秘密签名嵌入语义域。为了检索，我们提出了一种基于查询的TokenTrace模块，该模块以生成的图像和指定需要检索的概念（例如，特定物体或风格）的文本查询作为输入。该基于查询的机制使模块能够解开并独立验证单个生成图像中多个概念的存在。大量实验表明，我们的方法在单一概念（物体和风格）和多概念归因任务上均实现了最先进的性能，显著优于现有基线，同时保持高视觉质量和对常见变换的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2602.19022

An interpretable framework using foundation models for fish sex identification

基于基础模型的鱼类性别识别可解释框架

Miao, Zheng, Hung, Tien-Chieh

Abstract

Accurate sex identification in fish is vital for optimizing breeding and management strategies in aquaculture, particularly for species at the risk of extinction. However, most existing methods are invasive or stressful and may cause additional mortality, posing severe risks to threatened or endangered fish populations. To address these challenges, we propose FishProtoNet, a robust, non-invasive computer vision-based framework for sex identification of delta smelt (Hypomesus transpacificus), an endangered fish species native to California, across its full life cycle. Unlike the traditional deep learning methods, FishProtoNet provides interpretability through learned prototype representations while improving robustness by leveraging foundation models to reduce the influence of background noise. Specifically, the FishProtoNet framework consists of three key components: fish regions of interest (ROIs) extraction using visual foundation model, feature extraction from fish ROIs and fish sex identification based on an interpretable prototype network. FishProtoNet demonstrates strong performance in delta smelt sex identification during early spawning and post-spawning stages, achieving the accuracies of 74.40% and 81.16% and corresponding F1 scores of 74.27% and 79.43% respectively. In contrast, delta smelt sex identification at the subadult stage remains challenging for current computer vision methods, likely due to less pronounced morphological differences in immature fish. The source code of FishProtoNet is publicly available at: https://github.com/zhengmiao1/Fish_sex_identification

Chinese Translation

鱼类的准确性别识别对于优化水产养殖中的繁殖和管理策略至关重要，尤其是对于濒临灭绝的物种。然而，现有的大多数方法都是侵入性的或造成压力，可能导致额外的死亡率，对受威胁或濒危的鱼类种群构成严重风险。为了解决这些挑战，我们提出了FishProtoNet，这是一个基于计算机视觉的强大非侵入性框架，用于识别加利福尼亚本土的濒危鱼类——三文鱼（Hypomesus transpacificus）在其整个生命周期中的性别。与传统的深度学习方法不同，FishProtoNet通过学习的原型表示提供可解释性，同时通过利用基础模型来减少背景噪声的影响，从而提高鲁棒性。具体而言，FishProtoNet框架由三个关键组件组成：使用视觉基础模型提取鱼类感兴趣区域（ROIs）、从鱼类ROIs中提取特征，以及基于可解释原型网络进行鱼类性别识别。FishProtoNet在三文鱼的早期产卵和产卵后阶段的性别识别中表现出强大的性能，分别达到了74.40%和81.16%的准确率，以及相应的F1分数74.27%和79.43%。相比之下，三文鱼在亚成鱼阶段的性别识别对当前的计算机视觉方法仍然具有挑战性，这可能是由于未成熟鱼类的形态差异不明显。FishProtoNet的源代码已公开，地址为：https://github.com/zhengmiao1/Fish_sex_identification

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2602.19024

Towards Calibrating Prompt Tuning of Vision-Language Models

朝着校准视觉-语言模型的提示调优

Sharifdeen, Ashshak, Shamshad, Fahad, Munir, Muhammad Akhtar, Basu, Abhishek, Ismithdeen, Mohamed Insaf, Jeyamohan, Jeyapriyan, Silva, Chathurika Sewwandi, Nandakumar, Karthik, Khan, Muhammad Haris

Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes

Chinese Translation

大规模视觉-语言模型（如 CLIP）的提示调优能够在不更新模型权重的情况下实现高效的任务适应。然而，这往往导致置信度校准不佳和预测不确定性不可靠。我们通过提出一个校准框架来解决这个问题，该框架在增强预测可靠性的同时，保持预训练 CLIP 嵌入空间的几何结构，这是实现稳健泛化所必需的。我们的方法通过两个互补的正则化器扩展了标准的交叉熵损失： (1) 一个均值-方差边际惩罚，通过最大化类间对数边际的平均值并最小化其离散度来稳定类间对数边际，从而减轻低自信和高自信的波动；(2) 一个文本时刻匹配损失，将调优后的文本嵌入的第一和第二时刻与其冻结的 CLIP 对应物对齐，保持对泛化至关重要的语义离散性。通过在 7 种提示调优方法和 11 个多样化数据集上进行广泛实验，我们证明了我们的方法在基础类和新类上显著降低了预期校准误差（ECE），相较于竞争性的校准技术具有明显优势。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2602.19035

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

OpenVO：具有时间动态意识的开放世界视觉里程计

Nguyen, Phuc D. A., Nhu, Anh N., Lin, Ming C.

Abstract

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

Chinese Translation

我们介绍了OpenVO，一个新颖的开放世界视觉里程计（VO）框架，能够在有限输入条件下具备时间意识。OpenVO有效地从单目行车记录仪视频中估计真实世界尺度的自我运动，适应不同的观察频率和未校准的相机，从而能够从行车记录仪中记录的稀有驾驶事件构建稳健的轨迹数据集。现有的视觉里程计方法通常在固定的观察频率（例如10Hz或12Hz）上进行训练，完全忽视了时间动态信息。许多先前的方法还需要已知内参的校准相机。因此，当（1）在未见过的观察频率下部署或（2）应用于未校准的相机时，它们的性能会下降。这显著限制了它们在许多下游任务中的泛化能力，例如从行车记录仪视频中提取轨迹。为了解决这些挑战，OpenVO（1）在双帧姿态回归框架中显式编码时间动态信息，并（2）利用从基础模型中派生的三维几何先验。我们在三个主要的自动驾驶基准测试上验证了我们的方法——KITTI、nuScenes和Argoverse 2，取得了超过20%的性能提升，相较于最先进的方法。在不同的观察频率设置下，我们的方法显著更为稳健，所有指标的误差降低了46%-92%。这些结果展示了OpenVO在真实世界三维重建和多样化下游应用中的多功能性。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2602.19053

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

TeFlow：为自监督前馈场景流估计启用多帧监督

Zhang, Qingwen, Jiang, Chenhan, Zhu, Xiaomeng, Miao, Yunqi, Zhang, Yushan, Andersson, Olov, Jensfelt, Patric

Abstract

Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames. Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33\% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at https://github.com/KTH-RPL/OpenSceneFlow along with trained model weights.

Chinese Translation

自监督前馈方法在场景流估计中提供了实时效率，但其基于双帧点对应的监督不可靠，且在遮挡情况下常常失效。多帧监督有潜力通过结合来自过去帧的运动线索提供更稳定的指导，然而，简单扩展双帧目标并不有效，因为点对应在帧间变化剧烈，产生不一致的信号。本文提出了TeFlow，通过挖掘时间一致的监督来启用前馈模型的多帧监督。TeFlow引入了一种时间集成策略，通过聚合来自多个帧的候选池中最具时间一致性的运动线索，形成可靠的监督信号。广泛的评估表明，TeFlow在自监督前馈方法中建立了新的最先进水平，在具有挑战性的Argoverse 2和nuScenes数据集上实现了高达33%的性能提升。我们的方法与领先的基于优化的方法表现相当，但速度提升了150倍。代码已开源，地址为https://github.com/KTH-RPL/OpenSceneFlow，附带训练好的模型权重。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2602.19063

Direction-aware 3D Large Multimodal Models

方向感知的三维大型多模态模型

Liu, Quan, Xuan, Weihao, Wang, Junjue, Yokoya, Naoto, Shao, Ling, Lu, Shijian

Abstract

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

Chinese Translation

三维大型多模态模型（3D LMMs）在实现方向性问答和空间推理时，严重依赖自我姿态（ego poses）。然而，大多数现有的点云基准包含丰富的方向性查询，但缺乏相应的自我姿态，这使得它们在三维大型多模态建模中本质上是病态的。在本研究中，我们重新定义了一种新的严格范式，通过识别和补充自我姿态到点云基准中，并根据识别出的自我姿态转换相应的点云数据，从而实现方向感知的3D LMMs。我们通过两种新颖的设计使方向感知的3D LMMs成为可能。第一种是PoseRecover，这是一种完全自动的姿态恢复管道，通过对象视锥体交集和Z缓冲区的可见性检查，将问题与来自RGB-D视频外参的自我姿态进行匹配。第二种是PoseAlign，它将点云数据转换为与识别出的自我姿态对齐，而不是将自我姿态注入文本提示中或在投影层中引入姿态编码特征。大量实验表明，我们的设计在多个3D LMM骨干网络（如LL3DA、LL3DA-SONATA、Chat-Scene和3D-LLAVA）上均取得了一致的改进，使ScanRefer的mIoU提高了30.0%，Scan2Cap的LLM作为评判的准确率提高了11.7%。此外，我们的方法简单、通用且训练高效，仅需进行指令调优，同时为方向感知的3D LMMs建立了强大的基线。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2602.19064

L3DR: 3D-aware LiDAR Diffusion and Rectification

L3DR：3D感知的激光雷达扩散与校正

Liu, Quan, Zhang, Xiaoqin, Shao, Ling, Lu, Shijian

Abstract

Range-view (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp and authentic boundaries. Leveraging such analysis, we design a 3D residual regression network that rectifies RV artifacts and achieves superb geometry realism by predicting point-level offsets in 3D space. On top of that, we design a Welsch Loss that helps focus on local geometry and ignore anomalous regions effectively. Extensive experiments over multiple benchmarks including KITTI, KITTI360, nuScenes and Waymo show that the proposed L3DR achieves state-of-the-art generation and superior geometry-realism consistently. In addition, L3DR is generally applicable to different LiDAR diffusion models with little computational overhead.

Chinese Translation

基于范围视图（RV）的激光雷达扩散最近在2D照片真实感方面取得了巨大的进展。然而，它忽视了3D几何真实感，常常产生各种RV伪影，如深度渗透和波浪表面。我们设计了L3DR，一个3D感知的激光雷达扩散与校正框架，能够在3D空间中回归和消除RV伪影，并准确恢复局部几何形状。我们的理论和实证分析表明，3D模型在生成清晰和真实的边界方面本质上优于2D模型。基于这种分析，我们设计了一个3D残差回归网络，能够校正RV伪影，并通过预测3D空间中的点级偏移实现卓越的几何真实感。此外，我们设计了一种Welsch Loss，有助于有效聚焦于局部几何并忽略异常区域。在包括KITTI、KITTI360、nuScenes和Waymo在内的多个基准上的广泛实验表明，所提出的L3DR在生成效果和几何真实感方面始终达到最先进水平。此外，L3DR通常适用于不同的激光雷达扩散模型，计算开销较小。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2602.19083

ChordEdit: One-Step Low-Energy Transport for Image Editing

ChordEdit：一步低能量图像编辑传输

Lu, Liangsi, Chen, Xuhang, Guo, Minzhe, Li, Shichu, Wang, Jingchao, Shi, Yang

Abstract

The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce ChordEdit, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.

Chinese Translation

一步文本到图像（T2I）模型的出现提供了前所未有的合成速度。然而，它们在文本引导的图像编辑中的应用仍然受到严重阻碍，因为将现有的无训练编辑器强制为单次推理步骤是失败的。这种失败表现为严重的物体扭曲和未编辑区域的一致性显著丧失，原因在于对模型结构场进行简单向量运算所产生的高能量、不稳定轨迹。为了解决这个问题，我们提出了ChordEdit，这是一种模型无关、无训练和无反演的方法，能够实现高保真度的一步编辑。我们将编辑重新表述为源文本提示和目标文本提示定义的源分布和目标分布之间的传输问题。利用动态最优传输理论，我们推导出一种原则性、低能量的控制策略。该策略产生了一种平滑、方差降低的编辑场，具有内在的稳定性，从而使得该场能够在单次大规模积分步骤中被遍历。一个理论基础扎实且经过实验验证的方法使得ChordEdit能够快速、轻量且精确地进行编辑，最终在这些具有挑战性的模型上实现真正的实时编辑。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2602.19086

Restoration-Guided Kuzushiji Character Recognition Framework under Seal Interference

印章干扰下的恢复引导式草书字符识别框架

Ju, Rui-Yang, Yamashita, Kohei, Kameko, Hirotaka, Mori, Shinsuke

Abstract

Kuzushiji was one of the most popular writing styles in pre-modern Japan and was widely used in both personal letters and official documents. However, due to its highly cursive forms and extensive glyph variations, most modern Japanese readers cannot directly interpret Kuzushiji characters. Therefore, recent research has focused on developing automated Kuzushiji character recognition methods, which have achieved satisfactory performance on relatively clean Kuzushiji document images. However, existing methods struggle to maintain recognition accuracy under seal interference (e.g., when seals overlap characters), despite the frequent occurrence of seals in pre-modern Japanese documents. To address this challenge, we propose a three-stage restoration-guided Kuzushiji character recognition (RG-KCR) framework specifically designed to mitigate seal interference. We construct datasets for evaluating Kuzushiji character detection (Stage 1) and classification (Stage 3). Experimental results show that the YOLOv12-medium model achieves a precision of 98.0% and a recall of 93.3% on the constructed test set. We quantitatively evaluate the restoration performance of Stage 2 using PSNR and SSIM. In addition, we conduct an ablation study to demonstrate that Stage 2 improves the Top-1 accuracy of Metom, a Vision Transformer (ViT)-based Kuzushiji classifier employed in Stage 3, from 93.45% to 95.33%. The implementation code of this work is available at https://ruiyangju.github.io/RG-KCR.

Chinese Translation

草书（Kuzushiji）是日本近代之前最流行的书写风格之一，广泛用于个人信件和官方文件。然而，由于其高度连笔的形式和广泛的字形变体，大多数现代日本读者无法直接解读草书字符。因此，近期的研究集中于开发自动化的草书字符识别方法，这些方法在相对干净的草书文档图像上取得了令人满意的性能。然而，现有方法在印章干扰（例如，当印章重叠字符时）下难以保持识别准确性，而印章在日本近代文档中频繁出现。为了解决这一挑战，我们提出了一种三阶段的恢复引导式草书字符识别（RG-KCR）框架，专门设计用于减轻印章干扰。我们构建了用于评估草书字符检测（第一阶段）和分类（第三阶段）的数据集。实验结果表明，YOLOv12-medium模型在构建的测试集上达到了98.0%的精确度和93.3%的召回率。我们使用PSNR和SSIM对第二阶段的恢复性能进行了定量评估。此外，我们还进行了消融研究，以证明第二阶段将基于视觉变换器（ViT）的草书分类器Metom在第三阶段的Top-1准确率从93.45%提高到95.33%。本工作的实现代码可在https://ruiyangju.github.io/RG-KCR获取。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2602.19089

Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling

Ani3DHuman：具有自指导随机采样的逼真3D人类动画

Sun, Qi, Wang, Can, Shang, Jiaxiang, Liu, Yingchun, Liao, Jing

Abstract

Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non-rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non-rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self-guided stochastic sampling method, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of the residual non-rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in https://github.com/qiisun/ani3dhuman.

Chinese Translation

当前的3D人类动画方法在实现逼真效果方面面临挑战：基于运动学的方法缺乏非刚性动态（例如，服装动态），而利用视频扩散先验的方法能够合成非刚性运动，但存在质量伪影和身份丢失的问题。为了解决这些局限性，我们提出了Ani3DHuman，一个将基于运动学的动画与视频扩散先验相结合的框架。我们首先引入了一种分层运动表示，将刚性运动与残余非刚性运动分离。刚性运动由运动学方法生成，随后产生粗略渲染以指导视频扩散模型生成恢复残余非刚性运动的视频序列。然而，这一基于扩散采样的恢复任务极具挑战性，因为初始渲染超出了分布范围，导致标准确定性常微分方程（ODE）采样器失效。因此，我们提出了一种新颖的自指导随机采样方法，通过将随机采样（用于逼真质量）与自指导（用于身份保真）相结合，有效解决了超出分布范围的问题。这些恢复的视频提供了高质量的监督，能够优化残余非刚性运动场。大量实验表明， extit{MethodName}能够生成逼真的3D人类动画，超越现有方法。代码可在 https://github.com/qiisun/ani3dhuman 获取。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2602.19091

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

CREM：基于压缩驱动的多模态检索与理解的表征增强

Liu, Lihao, Wang, Yan, Yang, Biao, Li, Da, Cao, Jiangxia, Luo, Yuxiao, Chen, Xiang, Wu, Xiangyu, Yuan, Wei, Yang, Fan, Ding, Guiguang, Gao, Tingting, Zhou, Guorui

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.

Chinese Translation

多模态大型语言模型（MLLMs）在视觉描述和视觉问答等理解任务中表现出色。然而，由于输出格式与优化目标之间的差异，它们在嵌入基础任务（如检索）中的直接应用仍然面临挑战。以往的方法通常采用对比微调来调整MLLMs以适应检索，但代价是失去了它们的生成能力。我们认为，生成任务和嵌入任务在根本上依赖于共享的认知机制，特别是跨模态表征对齐和上下文理解。为此，我们提出了CREM（基于压缩驱动的表征增强模型），该模型采用统一框架增强多模态表征以进行检索，同时保持生成能力。具体而言，我们引入了一种基于压缩的提示设计，使用可学习的合唱标记来聚合多模态语义，并采用一种压缩驱动的训练策略，通过压缩感知注意力将对比目标和生成目标结合起来。大量实验表明，CREM在MMEB上实现了最先进的检索性能，同时在多个理解基准上保持了强大的生成性能。我们的研究结果强调，在提出的压缩驱动范式下，生成监督可以进一步提高MLLMs的表征质量。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2602.19112

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

通过粗到细的语言指导实现通用三维形状匹配

Xiao, Qinfeng, Mei, Guofeng, Yang, Bo, Zhang, Liying, Zhang, Jian, Yick, Kit-lun

Abstract

Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.

Chinese Translation

在计算机视觉和图形学中，建立形状之间的密集对应关系是一项关键任务，而以往的方法依赖于近等距假设和同质主题类型（即仅适用于人形）。然而，为跨类别对象建立语义对应关系仍然具有挑战性，并且受到的关注相对较少。为此，我们提出了UniMatch，这是一种语义感知的粗到细框架，用于在强非等距形状之间构建密集的语义对应关系，而不限制对象类别。关键的见解在于将“粗略”的语义线索提升为“细致”的对应关系，这通过两个阶段实现。在“粗略”阶段，我们执行与类别无关的三维分割，以获得不重叠的语义部分，并提示多模态大型语言模型（MLLMs）识别部分名称。然后，我们利用预训练的视觉语言模型（VLMs）提取文本嵌入，从而构建匹配的语义部分。在“细致”阶段，我们利用这些粗略的对应关系，通过专门的基于排名的对比方案指导密集对应关系的学习。得益于与类别无关的分割、语言指导和基于排名的对比学习，我们的方法适用于通用对象类别，并且不需要预定义的部分提议，从而实现了跨类别和非等距形状的通用匹配。大量实验表明，UniMatch在各种具有挑战性的场景中始终优于竞争方法。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2602.19117

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

保持对称的项目布局：视觉-语言模型中的以物为中心的空间推理的符号投影布局

Jang, Jaeyun, Shin, Seunghui, Park, Taeho, Hwang, Hyoseok

Abstract

Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.

Chinese Translation

具有视角感知的空间推理涉及从特定视点理解空间关系——无论是以自我为中心（观察者中心）还是以物为中心（对象中心）。虽然视觉-语言模型（VLMs）在以自我为中心的设置中表现良好，但在以物为中心的视角下推理时，其性能会下降，因为空间关系必须从场景中对象的视角推断。在本研究中，我们通过引入符号投影布局（Symbolic Projective Layout，SymPL）来解决这一未被充分探索的挑战，该框架将以物为中心的推理重新构造成VLMs本质上处理良好的符号布局形式。通过利用四个关键因素——投影、抽象、二分和定位——SymPL将以物为中心的问题转换为结构化的符号布局表示。大量实验表明，这种重新表述在以物为中心和以自我为中心的任务中显著提高了性能，增强了在视觉错觉和多视角场景下的鲁棒性，并且每个组件对这些提升都起到了关键作用。这些结果表明，SymPL为解决复杂的视角感知空间推理提供了一种有效且有原则的方法。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2602.19123

StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification

StreetTree：一个大规模全球基准用于细粒度树种分类

Li, Jiapeng, Huang, Yingjing, Zhang, Fan, liu, Yu

Abstract

The fine-grained classification of street trees is a crucial task for urban planning, streetscape management, and the assessment of urban ecosystem services. However, progress in this field has been significantly hindered by the lack of large-scale, geographically diverse, and publicly available benchmark datasets specifically designed for street trees. To address this critical gap, we introduce StreetTree, the world's first large-scale benchmark dataset dedicated to fine-grained street tree classification. The dataset contains over 12 million images covering more than 8,300 common street tree species, collected from urban streetscapes across 133 countries spanning five continents, and supplemented with expert-verified observational data. StreetTree poses substantial challenges for pretrained vision models under complex urban environments: high inter-species visual similarity, long-tailed natural distributions, significant intra-class variations caused by seasonal changes, and diverse imaging conditions such as lighting, occlusions from buildings, and varying camera angles. In addition, we provide a hierarchical taxonomy (order-family-genus-species) to support research in hierarchical classification and representation learning. Through extensive experiments with various visual models, we establish strong baselines and reveal the limitations of existing methods in handling such real-world complexities. We believe that StreetTree will serve as a key resource for the refined management and research of urban street trees, while also driving new advancements at the intersection of computer vision and urban science.

Chinese Translation

街道树的细粒度分类是城市规划、街景管理和城市生态系统服务评估中的一项关键任务。然而，由于缺乏专门针对街道树的大规模、地理多样化和公开可用的基准数据集，该领域的进展受到显著阻碍。为了解决这一关键问题，我们推出了StreetTree，这是全球首个专门用于细粒度街道树分类的大规模基准数据集。该数据集包含超过1200万张图像，涵盖超过8300种常见街道树种，数据来源于跨越五大洲133个国家的城市街景，并附有专家验证的观察数据。StreetTree对预训练视觉模型在复杂城市环境中提出了重大挑战：物种间的视觉相似性高、自然分布呈长尾特征、季节变化导致显著的类内变异，以及光照、建筑遮挡和相机角度变化等多样的成像条件。此外，我们提供了一个层次分类法（目-科-属-种），以支持层次分类和表示学习的研究。通过对各种视觉模型进行广泛实验，我们建立了强有力的基线，并揭示了现有方法在处理这些现实世界复杂性方面的局限性。我们相信，StreetTree将成为城市街道树精细管理和研究的关键资源，同时推动计算机视觉与城市科学交叉领域的新进展。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2602.19134

Mapping Networks

映射网络

Sen, Lord, Mukherjee, Shyamapada

Abstract

The escalating parameter counts in modern deep learning models pose a fundamental challenge to efficient training and resolution of overfitting. We address this by introducing the \emph{Mapping Networks} which replace the high dimensional weight space by a compact, trainable latent vector based on the hypothesis that the trained parameters of large networks reside on smooth, low-dimensional manifolds. Henceforth, the Mapping Theorem enforced by a dedicated Mapping Loss, shows the existence of a mapping from this latent space to the target weight space both theoretically and in practice. Mapping Networks significantly reduce overfitting and achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc, with $\mathbf{99.5\%}$, i.e., around $500\times$ reduction in trainable parameters.

Chinese Translation

现代深度学习模型中不断增加的参数数量对高效训练和解决过拟合问题构成了根本挑战。我们通过引入 extit{映射网络}来应对这一挑战，该网络用基于假设的紧凑可训练潜在向量替代高维权重空间，假设大型网络的训练参数位于光滑的低维流形上。因此，由专门的映射损失（Mapping Loss）强制执行的映射定理（Mapping Theorem）在理论和实践中都表明，从该潜在空间到目标权重空间的映射是存在的。映射网络显著减少了过拟合，并在复杂的视觉和序列任务中实现了与目标网络相当或更好的性能，包括图像分类（Image Classification）、深伪检测（Deepfake Detection）等，达到$ extbf{99.5 ext{%}}$，即可训练参数减少了约$500 imes$。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2602.19140

CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

CaReFlow：用于多模态融合的循环自适应整流流

Mai, Sijie, Han, Shiqin

Abstract

Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the `one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design `adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.

Chinese Translation

模态间差距显著限制了多模态融合的有效性。以往的方法通常使用扩散模型和对抗学习等技术来减少模态差距，但它们通常专注于一对一的对齐，而没有将源模态的数据点暴露于目标模态的全局分布信息中。为此，我们利用整流流的特性，通过直线路径将一种分布映射到另一种分布，扩展整流流用于模态分布映射。具体而言，我们利用整流流中的‘一对多映射’策略，使源模态的每个数据点能够观察到整体目标分布。这也缓解了每个样本内配对数据不足的问题，从而实现更稳健的分布转换。此外，为了实现更准确的分布映射并解决一对多映射中的模糊流向问题，我们设计了‘自适应松弛对齐’，对属于同一样本的模态对施加更严格的对齐，同时对不属于同一样本或类别的对施加松弛映射。此外，为了防止在分布映射过程中信息丢失，我们引入了‘循环整流流’，以确保转移的特征可以被转换回原始特征，使多模态表示能够学习到足够的模态特定信息。在分布对齐后，我们的方法在多模态情感计算的多个任务上取得了非常有竞争力的结果，即使使用简单的融合方法，视觉化结果也验证了它能够有效减少模态差距。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2602.19146

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

VIGiA：通过对话推理和检索进行的指导视频指导

Glória-Silva, Diogo, Semedo, David, Maglhães, João

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Chinese Translation

我们介绍了VIGiA，这是一种新颖的多模态对话模型，旨在理解和推理复杂的多步骤指导视频行动计划。与以往主要关注文本指导或将视觉和语言孤立对待的研究不同，VIGiA支持基于情境的、计划感知的对话，这需要对视觉输入、指导计划和交错的用户交互进行推理。为此，VIGiA结合了两个关键能力：(1) 多模态计划推理，使模型能够将单模态和多模态查询与当前任务计划对齐并准确响应；(2) 基于计划的检索，允许其以文本或视觉表示形式检索相关的计划步骤。在一个与烹饪和DIY计划对齐的丰富指导视频对话的新数据集上进行了实验。我们的评估表明，VIGiA在对话计划指导设置中的所有任务上均优于现有的最先进模型，在计划感知的视觉问答（VQA）中达到了超过90%的准确率。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2602.19156

Artefact-Aware Fungal Detection in Dermatophytosis: A Real-Time Transformer-Based Approach for KOH Microscopy

关注伪影的皮肤真菌病检测：基于实时变换器的KOH显微镜方法

Gursoy, Rana, Yilmaz, Abdurrahim, Kizilyaprak, Baris, Caglar, Esmahan, Temelkuran, Burak, Uvet, Huseyin, Aksu, Ayse Esra Koku, Gencoglan, Gulsum

Abstract

Dermatophytosis is commonly assessed using potassium hydroxide (KOH) microscopy, yet accurate recognition of fungal hyphae is hindered by artefacts, heterogeneous keratin clearance, and notable inter-observer variability. This study presents a transformer-based detection framework using the RT-DETR model architecture to achieve precise, query-driven localization of fungal structures in high-resolution KOH images. A dataset of 2,540 routinely acquired microscopy images was manually annotated using a multi-class strategy to explicitly distinguish fungal elements from confounding artefacts. The model was trained with morphology-preserving augmentations to maintain the structural integrity of thin hyphae. Evaluation on an independent test set demonstrated robust object-level performance, with a recall of 0.9737, precision of 0.8043, and an [email protected] of 93.56%. When aggregated for image-level diagnosis, the model achieved 100% sensitivity and 98.8% accuracy, correctly identifying all positive cases without missing a single diagnosis. Qualitative outputs confirmed the robust localization of low-contrast hyphae even in artefact-rich fields. These results highlight that an artificial intelligence (AI) system can serve as a highly reliable, automated screening tool, effectively bridging the gap between image-level analysis and clinical decision-making in dermatomycology.

Chinese Translation

皮肤真菌病通常通过氢氧化钾（KOH）显微镜进行评估，但伪影、异质角蛋白清除和显著的观察者间变异性阻碍了真菌菌丝的准确识别。本研究提出了一种基于变换器的检测框架，采用RT-DETR模型架构，实现对高分辨率KOH图像中真菌结构的精确、查询驱动的定位。我们手动注释了一个包含2,540张常规获取的显微镜图像的数据集，采用多类策略明确区分真菌元素与混淆伪影。该模型在保持形态的增强下进行训练，以维护细菌丝的结构完整性。在独立测试集上的评估显示出强大的对象级性能，召回率为0.9737，精确率为0.8043，[email protected]为93.56%。在图像级诊断中，该模型实现了100%的灵敏度和98.8%的准确率，正确识别所有阳性病例，未漏诊任何病例。定性输出确认了即使在伪影丰富的领域中，低对比度菌丝的稳健定位。这些结果强调了人工智能（AI）系统可以作为一种高度可靠的自动筛查工具，有效弥合图像级分析与皮肤真菌学临床决策之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2602.19161

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Flash-VAED：高效视频生成的即插即用变分自编码器解码器

Zhu, Lunjie, Huang, Yushi, Ge, Xingtong, Xue, Yufei, Liu, Zhening, Zhang, Yumeng, Lin, Zehong, Zhang, Jun

Abstract

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Chinese Translation

潜在扩散模型已经实现了高质量的视频合成，但其推理过程仍然成本高昂且耗时。随着扩散变换器变得越来越高效，延迟瓶颈不可避免地转移到了变分自编码器（VAE）解码器上。为了在保持质量的同时减少延迟，我们提出了一种通用的VAE解码器加速框架，该框架与原始潜在分布保持完全一致。具体而言，我们提出了（1）一种关注独立性的通道剪枝方法，以有效减轻严重的通道冗余，以及（2）一种阶段性主导算子优化策略，以解决VAE解码器中广泛使用的因果3D卷积的高推理成本。基于这些创新，我们构建了Flash-VAED系列。此外，我们设计了一个三阶段动态蒸馏框架，能够高效地将原始VAE解码器的能力转移到Flash-VAED。对Wan和LTX-Video VAE解码器的广泛实验表明，我们的方法在质量和速度上均优于基线，速度提升约为6倍，同时重建性能保持在96.9%。值得注意的是，Flash-VAED将端到端生成管道的加速提升了多达36%，且在VBench-2.0上质量下降微乎其微。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2602.19163

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

JavisDiT++：联合建模与优化用于音视频联合生成

Liu, Kai, Zheng, Yanhao, Wang, Kai, Wu, Shengqiong, Zhang, Rongjunchen, Luo, Jiebo, Hatzinakos, Dimitrios, Liu, Ziwei, Fei, Hao, Chua, Tat-Seng

Abstract

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.

Chinese Translation

人工智能生成内容（AIGC）已迅速从文本到图像生成扩展到高质量的多模态合成，包括视频和音频。在这一背景下，音视频联合生成（JAVG）作为一项基础任务，旨在从文本描述中生成同步且语义对齐的声音和视觉。然而，与Veo3等先进商业模型相比，现有的开源方法在生成质量、时间同步和与人类偏好的对齐方面仍存在局限性。为了解决这些问题，本文提出了JavisDiT++，一个简洁而强大的框架，用于JAVG的统一建模与优化。首先，我们引入了一种特定模态的专家混合设计（MS-MoE），该设计能够提高跨模态交互的有效性，同时增强单模态生成质量。然后，我们提出了一种时间对齐的旋转位置编码（TA-RoPE）策略，以实现音频和视频标记之间显式的帧级同步。此外，我们开发了一种音视频直接偏好优化（AV-DPO）方法，以在质量、一致性和同步维度上将模型输出与人类偏好对齐。基于Wan2.1-1.3B-T2V，我们的模型在仅使用约100万条公共训练数据的情况下实现了最先进的性能，在定性和定量评估中显著优于先前的方法。我们进行了全面的消融研究，以验证所提模块的有效性。所有代码、模型和数据集均已发布在 https://JavisVerse.github.io/JavisDiT2-page。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2602.19170

BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment

BriMA：用于多模态持续动作质量评估的桥接模态适应

Zhou, Kanglei, Li, Chang, Pan, Qingyi, Wang, Liyuan

Abstract

Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8\% higher correlation and 12--15\% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.

Chinese Translation

动作质量评估（AQA）旨在对动作执行的质量进行评分，广泛应用于体育分析、康复评估和人类技能评估。近年来，多模态AQA通过利用互补的视觉和运动学线索取得了显著进展，但在实际应用中，往往面临非平稳模态失衡的问题，即由于传感器故障或标注缺失，某些模态可能会缺失或间歇性可用。现有的持续AQA方法忽视了这一问题，假设所有模态在训练过程中保持完整和稳定，这限制了它们的实用性。为了解决这一挑战，我们提出了桥接模态适应（BriMA），这是一种在模态缺失条件下进行多模态持续AQA的创新方法。BriMA由一个记忆引导的桥接插补模块组成，该模块利用任务无关和任务特定的表示重建缺失的模态，以及一个模态感知的重放机制，根据模态失真和分布漂移优先选择信息丰富的样本。在三个具有代表性的多模态AQA数据集（RG、Fis-V和FS1000）上的实验表明，BriMA在不同的模态缺失条件下始终提高了性能，平均实现了6-8%的相关性提升和12-15%的误差降低。这些结果展示了在实际部署约束下，朝着稳健的多模态AQA系统迈进的一步。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2602.19178

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

EMAD：以证据为中心的阿尔茨海默病多模态诊断

Chen, Qiuhui, Yao, Xuancheng, Zhou, Zhenglei, Hu, Xinyue, Hong, Yi

Abstract

Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

Chinese Translation

深度学习模型在医学图像分析中往往表现为黑箱，鲜有与临床指南对齐或明确将决策与支持证据联系起来的情况。这在阿尔茨海默病（AD）中尤为重要，因为预测应基于解剖和临床发现。我们提出了EMAD，一个视觉-语言框架，生成结构化的AD诊断报告，其中每个主张都明确基于多模态证据。EMAD使用层次化的句子-证据-解剖（Sentence-Evidence-Anatomy, SEA）基础机制：（i）句子到证据的基础将生成的句子与临床证据短语相连接；（ii）证据到解剖的基础则在3D脑MRI上定位相应的结构。为了减少密集标注的需求，我们提出了GTX-Distill，它将有限监督下训练的教师的基础行为转移到基于模型生成报告的学生上。我们进一步引入可执行规则的GRPO（Executable-Rule GRPO），这是一种具有可验证奖励的强化微调方案，强制执行临床一致性、协议遵循和推理-诊断的一致性。在AD-MultiSense数据集上，EMAD实现了最先进的诊断准确性，并生成比现有方法更透明、更符合解剖学的报告。我们将发布代码和基础注释，以支持未来在可信医疗视觉-语言模型方面的研究。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2602.19180

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

基于VLM的群体偏好对齐用于扩散式人类网格恢复

Shen, Wenhao, Wang, Hao, Yin, Wanqi, Liu, Fayao, Yang, Xulei, Liang, Chao, Cai, Zhongang, Lin, Guosheng

Abstract

Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

Chinese Translation

从单一RGB图像中恢复人类网格（HMR）本质上是模糊的，因为多个3D姿势可能对应于相同的2D观测。最近的基于扩散的方法通过生成多种假设来解决这一问题，但往往牺牲了准确性。它们的预测结果要么在物理上不合理，要么偏离输入图像，尤其是在遮挡或杂乱的自然场景中。为了解决这个问题，我们引入了一种双重记忆增强的HMR评估代理，结合自我反思，生成对预测网格的上下文感知质量评分。这些评分提炼了关于3D人类运动结构、物理可行性和与输入图像对齐的细粒度线索。我们利用这些评分构建了一个群体HMR偏好数据集。基于该数据集，我们提出了一种群体偏好对齐框架，用于微调基于扩散的HMR模型。这个过程将丰富的偏好信号注入模型，引导其生成更具物理合理性和图像一致性的人类网格。大量实验表明，我们的方法在性能上优于最先进的技术。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2602.19188

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

PositionOCR：通过混合专家集成增强多模态模型的位置信息感知

Duan, Chen, Guo, Zhentao, Fu, Pei, Wang, Zining, Zhou, Kai, Yan, Pengfei

Abstract

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.

Chinese Translation

近年来，多模态大型语言模型（MLLMs）在以光学字符识别（OCR）为中心的视觉问答（VQA）任务中取得了强劲的表现，展示了其处理异构数据和在不同上下文中展现适应能力的能力。然而，这些MLLMs依赖于大型语言模型（LLM）作为解码器，该模型主要设计用于语言处理，因此在执行精确视觉任务（如文本检测和文本定位）时，固有地缺乏所需的位置信息推理能力。此外，MLLMs的庞大参数量需要大量的计算资源和大规模数据进行有效训练。相对而言，文本检测专家在坐标预测方面达到了最先进的水平，但缺乏语义推理能力。这种二元对立激发了我们的关键研究问题：我们能否将专家的效率与LLMs的上下文能力相结合，创造出一个具有位置准确性的MLLM？为了解决这些挑战，我们提出了PositionOCR，这是一种参数高效的混合架构，能够无缝整合文本检测模型的位置信息优势与LLM的上下文推理。该框架包含131M可训练参数，展示了卓越的多模态处理能力，特别是在文本定位和文本检测等任务中表现出色，始终超越传统的MLLMs。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2602.19190

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

FUSAR-GPT：一种嵌入时空特征的两阶段解耦视觉语言模型用于合成孔径雷达图像

Zhang, Xiaokun, Yang, Yi, Ye, Ziqi, Baiyun, Guo, Xiaorong, Fang, Qingchen, Zhang, Ruyi, Zhou, Xinpeng, Wang, Haipeng

Abstract

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

Chinese Translation

对全天候、全时段合成孔径雷达（SAR）智能解读的研究对于推动遥感应用至关重要。近年来，尽管视觉语言模型（VLMs）在RGB图像上展示了强大的开放世界理解能力，但由于成像机制的复杂性、对散射特征的敏感性以及高质量文本语料的稀缺性，它们在SAR领域的表现受到严重限制。为系统性地解决这一问题，我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集，并开发了专门针对SAR的视觉语言模型FUSAR-GPT。FUSAR-GPT创新性地引入了地理空间基线模型作为“世界知识”先验，并通过“时空锚点”将多源遥感时序特征嵌入模型的视觉主干，从而实现对SAR图像中目标稀疏表示的动态补偿。此外，我们设计了一种两阶段的SFT策略，以解耦大型模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能，显著超越主流基线模型超过12%。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2602.19198

Prompt Tuning for CLIP on the Pretrained Manifold

在预训练流形上进行CLIP的提示调优

Yang, Xi, Xu, Yuanrong, Zhang, Weigang, Lu, Guangming, Zhang, David, Wen, Jie

Abstract

Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.

Chinese Translation

提示调优引入了可学习的提示向量，以参数高效的方式将预训练的视觉-语言模型适应于下游任务。然而，在有限监督的情况下，提示调优改变了预训练表示，并使下游特征偏离预训练流形，朝向不利于迁移的方向。这种漂移降低了模型的泛化能力。为了解决这一局限性，我们提出了ManiPT，一个在预训练流形上进行提示调优的框架。ManiPT在文本和图像模态中引入了余弦一致性约束，以将学习到的表示限制在预训练几何邻域内。此外，我们引入了一种结构性偏差，强制进行增量修正，引导适应沿可迁移方向进行，以减轻对捷径学习的依赖。从理论角度来看，ManiPT缓解了在有限数据下的过拟合倾向。我们的实验涵盖了四个下游设置：未见类别泛化、少样本分类、跨数据集迁移和领域泛化。在这些设置中，ManiPT的平均性能优于基线方法。值得注意的是，ManiPT提供了一个明确的视角，阐明了在有限监督下提示调优如何导致过拟合。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2602.19202

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

UniE2F：一种用于事件到帧重建的统一扩散框架，基于视频基础模型

Xu, Gang, Zhu, Zhiyu, Hou, Junhui

Abstract

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

Chinese Translation

事件相机在高速、低功耗和高动态范围场景感知方面表现出色。然而，由于它们本质上仅记录相对强度变化而非绝对强度，导致生成的数据流在空间信息和静态纹理细节上显著丧失。本文通过利用预训练视频扩散模型的生成先验，解决了这一局限性，从稀疏事件数据中重建高保真视频帧。具体而言，我们首先建立一个基线模型，直接将事件数据作为条件用于合成视频。然后，基于事件流与视频帧之间的物理关联，我们进一步引入基于事件的帧间残差引导，以提高视频帧重建的准确性。此外，我们通过调节反向扩散采样过程，将我们的方法扩展到视频帧插值和预测，实现零样本处理，从而创建一个统一的事件到帧重建框架。在真实世界和合成数据集上的实验结果表明，我们的方法在定量和定性上显著优于之前的方法。我们还建议评审人员查看补充材料中的视频演示以获取视频结果。代码将公开发布在 https://github.com/CS-GangXu/UniE2F。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2602.19206

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

GS-CLIP：通过几何感知提示和协同视图表示学习实现零样本3D异常检测

Deng, Zehao, Liu, An, Wang, Yan

Abstract

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

Chinese Translation

零样本3D异常检测是一项新兴任务，旨在在没有任何目标训练数据的情况下检测目标数据集中的异常，这在样本稀缺和数据隐私问题受限的场景中尤为重要。尽管当前方法通过将3D点云投影到2D表示中来适应CLIP，但它们面临着挑战。投影本质上会丢失一些几何细节，而对单一2D模态的依赖提供了不完整的视觉理解，限制了它们检测多样化异常类型的能力。为了解决这些局限性，我们提出了几何感知提示和协同视图表示学习（GS-CLIP）框架，使模型能够通过两阶段学习过程识别几何异常。在第一阶段，我们动态生成嵌入3D几何先验的文本提示。这些提示包含由我们的几何缺陷蒸馏模块（GDDM）提炼的全局形状上下文和局部缺陷信息。在第二阶段，我们引入协同视图表示学习架构，平行处理渲染图像和深度图像。随后，协同精炼模块（SRM）融合两个流的特征，充分利用它们的互补优势。在四个大规模公共数据集上的综合实验结果表明，GS-CLIP在检测方面表现优越。代码可在 https://github.com/zhushengxinyue/GS-CLIP 获取。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2602.19213

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

SegMoTE：用于医学图像分割的令牌级专家混合模型

Lu, Yujie, Li, Jingwen, Ju, Sibo, Su, Yanzhou, yao, he, Liu, Yisong, Zhu, Min, Cheng, Junlong

Abstract

Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

Chinese Translation

医学图像分割对临床诊断和定量分析至关重要，但由于成像模式的异质性和像素级标注的高成本，仍然面临挑战。尽管像SAM这样的通用交互式分割模型取得了显著进展，但其在医学成像中的转移仍面临两个关键瓶颈：（i）缺乏针对特定模式和解剖结构任务的自适应机制，限制了在分布外医学场景中的泛化能力；（ii）当前的医学适应方法在大型异构数据集上进行微调而不进行选择，导致噪声监督、成本增加和负迁移。为了解决这些问题，我们提出了SegMoTE，这是一个高效且自适应的医学图像分割框架。SegMoTE保留了SAM的原始提示接口、高效推理和零样本泛化，同时仅引入少量可学习参数，以动态适应不同的模式和任务。此外，我们设计了一种渐进式提示令牌化机制，使得完全自动化分割成为可能，显著减少了对标注的依赖。在MedSeg-HQ上进行训练，该数据集的规模不到现有大型数据集的1%，SegMoTE在多种成像模式和解剖任务中实现了SOTA性能。它代表了通用分割模型在医学领域的首次高效、稳健和可扩展的适应，且在极低的标注成本下，推动了基础视觉模型在临床应用中的实际部署。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2602.19217

Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing

超越像素的问题：在遥感视觉问题生成中整合常识知识

Li, Siran, Mi, Li, Castillo-Navarro, Javiera, Tuia, Devis

Abstract

With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model's adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.

Chinese Translation

随着遥感图像档案的快速发展，针对图像提问已成为获取特定信息或进行语义图像检索的有效方式。然而，目前自动生成的问题往往过于简单且基于模板，这阻碍了问答或视觉对话系统在实际应用中的部署。为了丰富和多样化问题内容，使其既包含图像内容又融入常识知识，我们提出了一种知识感知的遥感视觉问题生成模型（KRSVQG）。该模型结合了来自外部知识源的相关知识三元组，以拓宽问题内容，同时采用图像描述作为中介表示，将问题与相应图像进行关联。此外，KRSVQG利用视觉-语言预训练和微调策略，使模型能够适应低数据环境。为了评估所提出的KRSVQG模型，我们构建了两个知识感知的遥感视觉问题生成数据集：NWPU-300数据集和TextRS-300数据集。评估结果，包括指标和人工评估，表明KRSVQG优于现有方法，并生成了丰富的问题，这些问题基于图像和领域知识。作为视觉-语言研究中的一项关键实践，知识感知的视觉问题生成推动了对图像内容的理解，超越了像素，促进了知识丰富的视觉-语言系统与视觉基础的人类常识的发展。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2602.19219

Controlled Face Manipulation and Synthesis for Data Augmentation

用于数据增强的可控面部操控与合成

Kirchner, Joris, Gudi, Amogh, Bittner, Marian, Raman, Chirag

Abstract

Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.

Chinese Translation

深度学习视觉模型在丰富的监督下表现优异，但许多应用面临标签稀缺和类别不平衡的问题。可控图像编辑可以增强稀缺的标注数据，但编辑往往会引入伪影并纠缠非目标属性。我们研究了面部表情分析中的这一问题，目标是操控动作单元（Action Unit, AU），因为其标注成本高且AU的共同激活会导致纠缠。我们提出了一种面部操控方法，该方法在预训练面部生成器（Diffusion Autoencoder）的语义潜在空间中操作。通过使用轻量级线性模型，我们通过（i）考虑AU共同激活的依赖感知条件和（ii）去除干扰属性方向（例如眼镜）的正交投影，减少了语义特征的纠缠，同时结合表情中和步骤以实现绝对AU编辑。我们利用这些编辑通过编辑标注面孔来平衡AU的出现，并通过可控合成来多样化身份/人口统计特征。用生成的数据增强AU检测器的训练提高了准确性，并产生了更少的共同激活捷径的更解耦预测，优于其他数据高效训练策略，并在我们的学习曲线分析中表明了类似于需要大量标注数据的改进。与先前的方法相比，我们的编辑更强，产生的伪影更少，并且更好地保留了身份。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2602.19224

Knowledge-aware Visual Question Generation for Remote Sensing Images

知识感知的遥感图像视觉问答生成

Li, Siran, Mi, Li, Castillo-Navarro, Javiera, Tuia, Devis

Abstract

With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing image retrieval. However, automatically generated image-based questions tend to be simplistic and template-based, which hinders the real deployment of question answering or visual dialogue systems. To enrich and diversify the questions, we propose a knowledge-aware remote sensing visual question generation model, KRSVQG, that incorporates external knowledge related to the image content to improve the quality and contextual understanding of the generated questions. The model takes an image and a related knowledge triplet from external knowledge sources as inputs and leverages image captioning as an intermediary representation to enhance the image grounding of the generated questions. To assess the performance of KRSVQG, we utilized two datasets that we manually annotated: NWPU-300 and TextRS-300. Results on these two datasets demonstrate that KRSVQG outperforms existing methods and leads to knowledge-enriched questions, grounded in both image and domain knowledge.

Chinese Translation

随着遥感图像档案的快速发展，针对图像提问已成为获取特定信息或进行图像检索的有效方式。然而，自动生成的基于图像的问题往往过于简单且基于模板，这妨碍了问答或视觉对话系统的实际应用。为了丰富和多样化问题，我们提出了一种知识感知的遥感视觉问答生成模型 KRSVQG，该模型结合了与图像内容相关的外部知识，以提高生成问题的质量和上下文理解。该模型以一幅图像和来自外部知识源的相关知识三元组作为输入，并利用图像描述作为中介表示，以增强生成问题的图像基础。为了评估 KRSVQG 的性能，我们使用了两个手动标注的数据集：NWPU-300 和 TextRS-300。这两个数据集上的结果表明，KRSVQG 超越了现有方法，生成了基于图像和领域知识的知识丰富问题。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2602.19248

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

无需真实异常：基于MLLM的零样本视频异常检测

Dai, Zunkai, Li, Ke, Liu, Jiajia, Yang, Jie, Qiao, Yuanyuan

Abstract

The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.

Chinese Translation

视频异常数据的收集和检测长期以来一直是一个具有挑战性的问题，因为其发生频率较低且时空稀缺。现有的视频异常检测（VAD）方法在开放世界场景中表现不佳。主要原因包括数据集多样性有限以及对上下文依赖的异常语义理解不足。为了解决这些问题，i) 我们提出了LAVIDA，一个端到端的零样本视频异常检测框架。ii) LAVIDA采用异常曝光采样器（Anomaly Exposure Sampler），将分割对象转化为伪异常，以增强模型对未见异常类别的适应能力。它进一步整合了多模态大型语言模型（Multimodal Large Language Model, MLLM），以增强语义理解能力。此外，iii) 我们设计了一种基于反向注意力的令牌压缩方法，以应对异常模式的时空稀缺性并降低计算成本。训练过程仅在伪异常上进行，而不使用任何VAD数据。在四个基准VAD数据集上的评估表明，LAVIDA在零样本设置下在帧级和像素级异常检测中均达到了SOTA（最先进的）性能。我们的代码可在 https://github.com/VitaminCreed/LAVIDA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2602.19254

RegionRoute: Regional Style Transfer with Diffusion Model

RegionRoute：基于扩散模型的区域风格迁移

Chen, Bowen, Zuena, Jake, Bovik, Alan C., Kothandaraman, Divya

Abstract

Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

Chinese Translation

基于扩散的风格迁移中的精确空间控制仍然具有挑战性。这一挑战源于扩散模型将风格视为全局特征，缺乏风格表示的明确空间基础，使得难以将风格应用限制在特定对象或区域。根据我们的了解，现有的扩散模型无法进行真正的局部风格迁移，通常依赖手工制作的掩模或多阶段后处理，这会引入边界伪影并限制泛化能力。为了解决这个问题，我们提出了一种注意力监督的扩散框架，通过在训练过程中将风格标记的注意力分数与对象掩模对齐，明确指导模型在何处应用给定的风格。两个互补目标，一个基于KL散度的聚焦损失（Focus loss）和一个使用二元交叉熵的覆盖损失（Cover loss），共同促进准确的定位和密集的覆盖。模块化的LoRA-MoE设计进一步实现了高效和可扩展的多风格适应。为了评估局部风格化，我们引入了区域风格编辑分数（Regional Style Editing Score），该分数通过基于CLIP的相似性测量目标区域的区域风格匹配，并通过掩模LPIPS和未编辑区域的像素级一致性测量身份保持。实验表明，我们的方法在推理时实现了无掩模的单对象风格迁移，产生了区域准确且视觉一致的结果，超越了现有的基于扩散的编辑方法。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2602.19274

DD-CAM: Minimal Sufficient Explanations for Vision Models Using Delta Debugging

DD-CAM：使用增量调试的视觉模型最小充分解释

Khadka, Krishna, Lei, Yu, Kacker, Raghu N., Kuhn, D. Richard

Abstract

We introduce a gradient-free framework for identifying minimal, sufficient, and decision-preserving explanations in vision models by isolating the smallest subset of representational units whose joint activation preserves predictions. Unlike existing approaches that aggregate all units, often leading to cluttered saliency maps, our approach, DD-CAM, identifies a 1-minimal subset whose joint activation suffices to preserve the prediction (i.e., removing any unit from the subset alters the prediction). To efficiently isolate minimal sufficient subsets, we adapt delta debugging, a systematic reduction strategy from software debugging, and configure its search strategy based on unit interactions in the classifier head: testing individual units for models with non-interacting units and testing unit combinations for models in which unit interactions exist. We then generate minimal, prediction-preserving saliency maps that highlight only the most essential features. Our experimental evaluation demonstrates that our approach can produce more faithful explanations and achieve higher localization accuracy than the state-of-the-art CAM-based approaches.

Chinese Translation

我们提出了一种无梯度框架，通过隔离最小的表示单元子集来识别视觉模型中的最小、充分和保持决策的解释，该子集的联合激活能够保持预测。与现有方法不同，后者通常聚合所有单元，导致混乱的显著性图，我们的方法DD-CAM识别出一个1-最小子集，其联合激活足以保持预测（即，从子集中移除任何单元都会改变预测）。为了高效隔离最小充分子集，我们采用了增量调试，这是一种来自软件调试的系统化简化策略，并根据分类器头中的单元交互配置其搜索策略：对于非交互单元的模型测试单个单元，对于存在单元交互的模型测试单元组合。然后，我们生成最小的、保持预测的显著性图，仅突出最重要的特征。我们的实验评估表明，我们的方法能够产生更真实的解释，并在定位准确性上超过最先进的基于CAM的方法。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2602.19278

A Two-Stage Detection-Tracking Framework for Stable Apple Quality Inspection in Dense Conveyor-Belt Environments

一种适用于密集输送带环境的稳定苹果质量检测-跟踪框架

Park, Keonvin, Pal, Aditya, Mok, Jin Hong

Abstract

Industrial fruit inspection systems must operate reliably under dense multi-object interactions and continuous motion, yet most existing works evaluate detection or classification at the image level without ensuring temporal stability in video streams. We present a two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments. An orchard-trained YOLOv8 model performs apple localization, followed by ByteTrack multi-object tracking to maintain persistent identities. A ResNet18 defect classifier, fine-tuned on a healthy-defective fruit dataset, is applied to cropped apple regions. Track-level aggregation is introduced to enforce temporal consistency and reduce prediction oscillation across frames. We define video-level industrial metrics such as track-level defect ratio and temporal consistency to evaluate system robustness under realistic processing conditions. Results demonstrate improved stability compared to frame-wise inference, suggesting that integrating tracking is essential for practical automated fruit grading systems.

Chinese Translation

工业水果检测系统必须在密集的多物体交互和连续运动下可靠运行，然而大多数现有研究仅在图像层面评估检测或分类，而未确保视频流的时间稳定性。我们提出了一种适用于输送带环境的稳定多苹果质量检测的两阶段检测-跟踪框架。经过果园训练的YOLOv8模型执行苹果定位，随后采用ByteTrack多物体跟踪以维持持续的身份。对经过裁剪的苹果区域应用经过微调的ResNet18缺陷分类器，该分类器基于健康-缺陷水果数据集进行训练。引入轨迹级聚合以强制执行时间一致性，并减少帧间预测波动。我们定义了视频级工业指标，如轨迹级缺陷比率和时间一致性，以评估系统在实际处理条件下的鲁棒性。结果表明，与逐帧推断相比，稳定性有所改善，表明集成跟踪对于实际自动化水果分级系统至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2602.19285

MRI Contrast Enhancement Kinetics World Model

MRI对比增强动力学世界模型

Kong, Jindi, He, Yuting, Xia, Cong, Ge, Rongjun, Li, Shuo

Abstract

Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and the fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in the human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to a sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence of data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by the spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follow a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrains smooth variations in the latent space among interpolated sequences. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available at https://github.com/DD0922/MRI-Contrast-Enhancement-Kinetics-World-Model.

Chinese Translation

临床MRI对比获取存在信息产出效率低下的问题，这表现为风险高且成本昂贵的获取协议与固定且稀疏的获取序列之间的不匹配。应用世界模型模拟人体内的对比增强动力学能够实现连续的无对比动态。然而，MRI获取中的低时间分辨率限制了世界模型的训练，导致数据集采样稀疏。直接训练生成模型以捕捉动力学存在两个局限性：(a) 由于缺乏缺失时间的数据，模型倾向于过拟合无关特征，导致内容失真。(b) 由于缺乏连续的时间监督，模型未能学习随时间变化的连续动力学规律，导致时间不连续。我们首次提出了MRI对比增强动力学世界模型（MRI CEKWorld），结合时空一致性学习（STCL）。针对(a)，在患者级结构在增强过程中保持一致的空间规律指导下，我们提出了潜在对齐学习（LAL），构建患者特定模板以约束内容与该模板对齐。针对(b)，在动力学遵循一致平滑趋势的时间规律指导下，我们提出了潜在差异学习（LDL），通过插值扩展未观察到的时间间隔，并约束插值序列中潜在空间的平滑变化。在两个数据集上的大量实验表明，我们的MRI CEKWorld实现了更好的真实内容和动力学。代码将发布在 https://github.com/DD0922/MRI-Contrast-Enhancement-Kinetics-World-Model。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2602.19314

IPv2: An Improved Image Purification Strategy for Real-World Ultra-Low-Dose Lung CT Denoising

IPv2：一种改进的图像净化策略用于真实世界超低剂量肺部CT去噪

Gong, Guoliang, Yu, Man

Abstract

The image purification strategy constructs an intermediate distribution with aligned anatomical structures, which effectively corrects the spatial misalignment between real-world ultra-low-dose CT and normal-dose CT images and significantly enhances the structural preservation ability of denoising models. However, this strategy exhibits two inherent limitations. First, it suppresses noise only in the chest wall and bone regions while leaving the image background untreated. Second, it lacks a dedicated mechanism for denoising the lung parenchyma. To address these issues, we systematically redesign the original image purification strategy and propose an improved version termed IPv2. The proposed strategy introduces three core modules, namely Remove Background, Add noise, and Remove noise. These modules endow the model with denoising capability in both background and lung tissue regions during training data construction and provide a more reasonable evaluation protocol through refined label construction at the testing stage. Extensive experiments on our previously established real-world patient lung CT dataset acquired at 2% radiation dose demonstrate that IPv2 consistently improves background suppression and lung parenchyma restoration across multiple mainstream denoising models. The code is publicly available at https://github.com/MonkeyDadLufy/Image-Purification-Strategy-v2.

Chinese Translation

图像净化策略构建了一个具有对齐解剖结构的中间分布，有效地纠正了真实世界超低剂量CT与正常剂量CT图像之间的空间错位，并显著增强了去噪模型的结构保留能力。然而，该策略存在两个固有的局限性。首先，它仅在胸壁和骨骼区域抑制噪声，而对图像背景未进行处理。其次，它缺乏专门的机制来去噪肺实质。为了解决这些问题，我们系统地重新设计了原始图像净化策略，并提出了一个改进版本，称为IPv2。该策略引入了三个核心模块，即去除背景（Remove Background）、添加噪声（Add noise）和去除噪声（Remove noise）。这些模块使模型在训练数据构建过程中具备了对背景和肺组织区域的去噪能力，并通过在测试阶段精细的标签构建提供了更合理的评估协议。在我们之前建立的真实世界患者肺部CT数据集上进行的大规模实验表明，IPv2在多个主流去噪模型中始终改善了背景抑制和肺实质恢复。代码可在 https://github.com/MonkeyDadLufy/Image-Purification-Strategy-v2 获取。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2602.19316

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

关注CTC：快速且稳健的统一语音识别伪标注

Haliassos, Alexandros, Mira, Rodrigo, Petridis, Stavros

Abstract

Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

Chinese Translation

统一语音识别（USR）作为一种半监督框架，已成为训练单一模型以实现音频、视觉和视听语音识别的有效方法，在分布内基准测试中取得了最先进的结果。然而，其对自回归伪标注的依赖使得训练成本高昂，同时CTC与注意力分支的解耦监督增加了对自我强化错误的敏感性，尤其是在涉及较长序列、噪声或未见领域的分布变化下。我们提出了基于CTC的教师强制方法，其中贪婪解码的CTC伪标签被输入到解码器中，以在单次前向传递中生成注意力目标。尽管这些伪标签在全局上可能不一致，但在伪标注设置中，它们能够实现高效且有效的知识转移。由于CTC和基于CTC的注意力伪标签具有相同的长度，解码器可以同时预测两者，利用CTC的稳健性和注意力的表达能力，而无需昂贵的束搜索。我们进一步提出了混合采样，以减轻解码器仅依赖CTC输入所带来的曝光偏差。最终的方法USR 2.0将训练时间减半，提高了对分布外输入的稳健性，并在LRS3、LRS2和WildVSR上取得了最先进的结果，超越了USR和特定模态的自监督基线。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2602.19322

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

US-JEPA：一种用于医学超声的联合嵌入预测架构

Radhachandran, Ashwath, Ivezić, Vedrana, Athreya, Shreeram, Anilkumar, Ronit, Arnold, Corey W., Speier, William

Abstract

Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

Chinese Translation

超声（US）成像由于其固有的噪声采集过程，对表示学习提出了独特的挑战。低信噪比和随机散斑模式阻碍了依赖于像素级重建目标的标准自监督学习方法。联合嵌入预测架构（JEPAs）通过预测掩蔽的潜在表示而非原始像素来解决这一缺陷。然而，标准方法依赖于超参数脆弱且计算开销大的在线教师，这些教师通过指数移动平均进行更新。我们提出了US-JEPA，这是一种自监督框架，采用静态教师不对称潜在训练（SALT）目标。通过使用一个冻结的、特定领域的教师来提供稳定的潜在目标，US-JEPA解耦了学生-教师优化，并推动学生扩展教师的语义先验。此外，我们首次对所有公开可用的超声基础模型在UltraBench（一个涵盖多个器官和病理条件的公共数据集基准）上进行了严格比较。在针对多样分类任务的线性探测下，US-JEPA的性能与特定领域和通用视觉基础模型基线相当或更优。我们的结果表明，掩蔽潜在预测为稳健的超声表示提供了一条稳定且高效的路径。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2602.19323

DefenseSplat: Enhancing the Robustness of 3D Gaussian Splatting via Frequency-Aware Filtering

DefenseSplat：通过频率感知过滤增强3D高斯溅射的鲁棒性

Qiao, Yiran, Lu, Yiren, Zhou, Yunlai, Yang, Rui, Hou, Linlin, Yin, Yu, Ma, Jing

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for real-time and high-fidelity 3D reconstruction from posed images. However, recent studies reveal its vulnerability to adversarial corruptions in input views, where imperceptible yet consistent perturbations can drastically degrade rendering quality, increase training and rendering time, and inflate memory usage, even leading to server denial-of-service. In our work, to mitigate this issue, we begin by analyzing the distinct behaviors of adversarial perturbations in the low- and high-frequency components of input images using wavelet transforms. Based on this observation, we design a simple yet effective frequency-aware defense strategy that reconstructs training views by filtering high-frequency noise while preserving low-frequency content. This approach effectively suppresses adversarial artifacts while maintaining the authenticity of the original scene. Notably, it does not significantly impair training on clean data, achieving a desirable trade-off between robustness and performance on clean inputs. Through extensive experiments under a wide range of attack intensities on multiple benchmarks, we demonstrate that our method substantially enhances the robustness of 3DGS without access to clean ground-truth supervision. By highlighting and addressing the overlooked vulnerabilities of 3D Gaussian Splatting, our work paves the way for more robust and secure 3D reconstructions.

Chinese Translation

3D高斯溅射（3DGS）已成为从姿态图像进行实时和高保真3D重建的强大范式。然而，最近的研究揭示了其在输入视图中对对抗性干扰的脆弱性，其中不可察觉但一致的扰动可以显著降低渲染质量，增加训练和渲染时间，并增加内存使用，甚至导致服务器拒绝服务。在我们的研究中，为了缓解这一问题，我们首先通过小波变换分析输入图像中低频和高频成分的对抗性扰动的不同表现。基于这一观察，我们设计了一种简单而有效的频率感知防御策略，通过过滤高频噪声同时保留低频内容来重建训练视图。这种方法有效抑制了对抗性伪影，同时保持了原始场景的真实性。值得注意的是，它并未显著损害在干净数据上的训练，实现了鲁棒性与干净输入性能之间的理想权衡。通过在多个基准测试下对各种攻击强度进行广泛实验，我们证明了我们的方法在没有干净真实监督的情况下显著增强了3DGS的鲁棒性。通过突出并解决3D高斯溅射被忽视的脆弱性，我们的工作为更鲁棒和安全的3D重建铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2602.19324

RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework

RetinaVision：基于可解释人工智能的增强调节用于精确视网膜疾病分类的深度学习框架

Noor, Mohammad Tahmid, Abrar, Shayan, Mahi, Jannatul Adan, Mia, Md Parvez, Hridoy, Asaduzzaman, Ghosh, Samanta

Abstract

Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.

Chinese Translation

早期和准确的视网膜疾病分类对于防止视力丧失和指导视网膜疾病的临床管理至关重要。在本研究中，我们提出了一种利用光学相干断层扫描（OCT）图像进行视网膜疾病分类的深度学习方法，该方法使用了视网膜OCT图像分类 - C8数据集（包含24,000张涵盖八种病症的标记图像）。图像被调整为224x224像素，并在卷积神经网络（CNN）架构上进行了测试：Xception和InceptionV3。采用数据增强技术（CutMix，MixUp）来提高模型的泛化能力。此外，我们应用了GradCAM和LIME进行可解释性评估。我们通过名为RetinaVision的网络应用程序在实际场景中实现了这一方法。本研究发现，Xception是最准确的网络（95.25%），紧随其后的是InceptionV3（94.82%）。这些结果表明，深度学习方法能够有效进行OCT视网膜疾病分类，并强调了在临床应用中实施准确性和可解释性的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2602.19348

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

MultiDiffSense：基于扩散的多模态视觉-触觉图像生成，条件为物体形状和接触姿态

Bhouri, Sirine, Wei, Lan, Zheng, Jian-Qing, Zhang, Dandan

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

Chinese Translation

获取对齐的视觉-触觉数据集既缓慢又昂贵，需要专门的硬件和大规模的数据收集。合成生成具有潜力，但先前的方法通常是单一模态，限制了跨模态学习。我们提出了MultiDiffSense，一种统一的扩散模型，在单一架构中为多种基于视觉的触觉传感器（ViTac、TacTip、ViTacTip）合成图像。我们的方法使用基于CAD导出的、姿态对齐的深度图和编码传感器类型及4自由度接触姿态的结构化提示进行双重条件化，从而实现可控的、物理一致的多模态合成。在8个物体（5个已见，3个新颖）和未见姿态的评估中，MultiDiffSense在结构相似性指数（SSIM）上超越了Pix2Pix条件生成对抗网络（cGAN）基线，分别提高了+36.3%（ViTac）、+134.6%（ViTacTip）和+64.7%（TacTip）。对于下游的3自由度姿态估计，将50%的合成数据与50%的真实数据混合可以将所需的真实数据量减半，同时保持竞争力的性能。MultiDiffSense缓解了触觉感知中的数据收集瓶颈，并为机器人应用提供了可扩展、可控的多模态数据集生成。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2602.19349

UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

UP-Fuse：基于不确定性引导的激光雷达-相机融合用于3D全景分割

Mohan, Rohit, Drews, Florian, Miron, Yakov, Cattaneo, Daniele, Valada, Abhinav

Abstract

LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

Chinese Translation

激光雷达-相机融合通过利用相机图像来补充稀疏的激光雷达扫描，从而增强3D全景分割，但这也引入了一种关键的失效模式。在不利条件下，相机传感器的退化或故障可能会显著影响感知系统的可靠性。为了解决这个问题，我们提出了UP-Fuse，一种新颖的不确定性感知融合框架，能够在相机传感器退化、校准漂移和传感器故障的情况下保持鲁棒性。原始激光雷达数据首先被投影到范围视图中，并通过激光雷达编码器进行编码，同时提取相机特征并投影到相同的共享空间。UP-Fuse的核心是一个不确定性引导的融合模块，它使用预测的不确定性图动态调节跨模态交互。这些图通过量化在不同视觉退化下的表示差异来学习，确保只有可靠的视觉线索影响融合表示。融合后的范围视图特征通过一种新颖的混合2D-3D变换器进行解码，该变换器减轻了2D投影固有的空间模糊性，并直接预测3D全景分割掩膜。在Panoptic nuScenes、SemanticKITTI和我们引入的Panoptic Waymo基准上的广泛实验表明，UP-Fuse的有效性和鲁棒性，即使在严重的视觉损坏或失配情况下也能保持强大的性能，使其非常适合于安全关键环境中的机器人感知。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2602.19350

PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

PoseCraft：基于标记的3D身体关键点与相机条件的真实感人像合成

Guo, Zhilin, Yang, Jing, Fogarty, Kyle, Wan, Jingyi, Zhang, Boqiao, Wu, Tianhao, Xia, Weihao, Zhou, Chenliang, Khattar, Sakar, Zhong, Fangcheng, Vasconcelos, Cristina Nader, Oztireli, Cengiz

Abstract

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

Chinese Translation

数字化人类并合成具有明确3D姿态和相机控制的真实感虚拟形象是虚拟现实、远程存在和娱乐领域的核心。现有的基于蒙皮的工作流程需要繁琐的手动绑定或基于模板的适配，而神经体积方法则依赖于规范模板，并对每个未见姿态进行重新优化。我们提出了PoseCraft，这是一种基于标记的3D接口的扩散框架：我们不仅依赖于光栅化几何体作为2D控制图像，而是将稀疏的3D关键点和相机外参编码为离散的条件标记，并通过交叉注意力将其注入到扩散过程中。我们的方法通过避免在大幅姿态和视点变化下的2D重投影模糊，保持了3D语义，并生成真实感图像，忠实捕捉身份和外观。为了在大规模上进行训练和评估，我们还实现了GenHumanRF，这是一种从体积重建中渲染多样化监督的数据生成工作流程。我们的实验表明，PoseCraft在感知质量上显著优于以扩散为中心的方法，并在保持织物和头发细节方面表现更佳，达到了最新体积渲染技术的更好或可比指标。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2602.19357

MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations

MentalBlackboard：通过数学变换评估空间可视化能力

Yilmaz, Nilay, Patel, Maitreya, Kusumba, Naga Sai Abhiram, He, Yixuan, Yang, Yezhou

Abstract

Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10\%. The top-performing model, o3, attains a peak performance of 71.6\% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25\% accuracy on text-based prediction tasks.

Chinese Translation

空间可视化是指想象、变换和操控物体及其动作空间特征的心理能力。这种智能是人类认知的一部分，其中动作与感知在心理层面上相互连接。为了探讨最先进的视觉-语言模型（Vision-Language Models, VLMs）是否具备这种能力，我们开发了MentalBlackboard，一个开放式空间可视化基准，涵盖纸折叠和打孔测试的两个核心任务：预测和规划。我们的预测实验表明，模型在应用对称变换时面临困难，即使它们能够正确预测展开步骤的顺序。此外，旋转对模型的物理情境意识构成了显著挑战。规划任务揭示了模型在分析对称关系和实施多阶段对称过程方面的局限性，其中Claude Opus 4.1在规划任务中以10%的准确率获得最高得分。表现最佳的模型o3在不需要空间可视化但需转移空间数据的泛化任务中达到了71.6%的峰值表现，但在基于文本的预测任务中仅获得25%的准确率。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2602.19358

Referring Layer Decomposition

引用层分解

Chen, Fangyi, Shen, Yaojie, Xu, Lu, Yuan, Ye, Zhang, Shu, Niu, Yulei, Wen, Longyin

Abstract

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

Chinese Translation

对视觉内容进行精确的、对象感知的控制对于高级图像编辑和组合生成至关重要。然而，大多数现有方法都是整体处理整个图像，限制了对单个场景元素的隔离和操作能力。相比之下，分层表示将场景明确分离为对象、环境背景和视觉效果，提供了一个更直观和结构化的框架来解释和编辑视觉内容。为了弥补这一差距，并实现组合理解和可控编辑，我们引入了引用层分解（Referring Layer Decomposition, RLD）任务，该任务从单个RGB图像中预测完整的RGBA层，条件是灵活的用户提示，例如空间输入（如点、框、掩膜）、自然语言描述或其组合。核心是RefLade，这是一个大规模数据集，包含由我们的可扩展数据引擎生成的1.11M图像-层-提示三元组，以及10万个人工策划的高保真层。结合一个基于感知的、与人类偏好对齐的自动评估协议，RefLade将RLD确立为一个明确定义且可基准化的研究任务。在此基础上，我们提出了RefLayer，这是一个为提示条件层分解设计的简单基线，能够实现高视觉保真度和语义对齐。大量实验表明，我们的方法能够有效训练、可靠评估，并实现高质量的图像分解，同时展现出强大的零样本泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2602.19380

Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization

环路内检测追踪：稳定声门定位的主动记忆校正

Wang, Huayu, Alattar, Bahaa, Yang, Cheng-Yen, Huang, Hsiang-Wei, Kim, Jung Heon, Shapiro, Linda, White, Nathan, Hwang, Jenq-Neng

Abstract

Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises Segment Anything Model 2(SAM2) through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, effectively mitigating drift accumulation with a training-free foundation tracker in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking. Our code will be available in https://github.com/huayuww/CL-MR.

Chinese Translation

声门定位的时间稳定性仍然面临挑战，这主要源于单帧检测器和基础模型追踪器的互补弱点：前者缺乏时间上下文，而后者则受到记忆漂移的影响。具体而言，在视频喉镜检查中，快速的组织变形、遮挡和紧急情况下的视觉歧义要求一种稳健的、具有时间感知能力的解决方案，以防止逐步的追踪错误。我们提出了闭环记忆校正（Closed-Loop Memory Correction, CL-MC），这是一种环路内检测框架，通过置信度对齐的状态决策和主动记忆校正来监督Segment Anything Model 2（SAM2）。高置信度的检测触发语义重置，覆盖损坏的追踪器记忆，有效减轻漂移累积，并在复杂的内窥镜场景中使用无训练的基础追踪器。在紧急插管视频中，CL-MC实现了最先进的性能，与SAM2变体和基于开环的方法相比，显著降低了漂移和漏检率。我们的结果确立了记忆校正作为可靠临床视频追踪的重要组成部分。我们的代码将发布在https://github.com/huayuww/CL-MR。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2602.19385

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

基于多臂赌博机的自适应数据增强：隐式模式识别的样本高效嵌入校准

Tang, Minxue, Yu, Yangyang, Ding, Aolin, Pouyan, Maziyar Baran, Bao, Taha Belkhouja Yujia

Abstract

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

Chinese Translation

在现代人工智能的许多实际应用中，识别隐式视觉和文本模式至关重要。然而，对于当前的预训练基础模型，如大型语言模型（LLMs）和视觉语言模型（VLMs），处理长尾模式识别任务仍然具有挑战性。虽然微调预训练模型可以提高隐式模式识别的准确性，但由于缺乏训练数据和高计算开销，通常不可行。本文提出了ADAMAB，一个用于少样本模式识别的高效嵌入校准框架。为了最大限度地降低计算成本，ADAMAB在固定嵌入模型的基础上训练与嵌入无关的轻量级校准器，而无需访问其参数。为了减少对大规模训练数据的需求，我们引入了一种基于多臂赌博机（Multi-Armed Bandit, MAB）机制的自适应数据增强策略。通过修改的上置信界算法，ADAMAB减少了梯度偏移，并在少样本训练中提供了理论上保证的收敛性。我们的多模态实验证明了ADAMAB的优越性能，在每个类别少于5个初始数据样本的情况下，准确率提高了多达40%。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2602.19412

Redefining the Down-Sampling Scheme of U-Net for Precision Biomedical Image Segmentation

重新定义U-Net的下采样方案以实现精确的生物医学图像分割

Li, Mingjie, Chen, Yizheng, Islam, Md Tauhidul, Xing, Lei

Abstract

U-Net architectures have been instrumental in advancing biomedical image segmentation (BIS) but often struggle with capturing long-range information. One reason is the conventional down-sampling techniques that prioritize computational efficiency at the expense of information retention. This paper introduces a simple but effective strategy, we call it Stair Pooling, which moderates the pace of down-sampling and reduces information loss by leveraging a sequence of concatenated small and narrow pooling operations in varied orientations. Specifically, our method modifies the reduction in dimensionality within each 2D pooling step from $\frac{1}{4}$ to $\frac{1}{2}$. This approach can also be adapted for 3D pooling to preserve even more information. Such preservation aids the U-Net in more effectively reconstructing spatial details during the up-sampling phase, thereby enhancing its ability to capture long-range information and improving segmentation accuracy. Extensive experiments on three BIS benchmarks demonstrate that the proposed Stair Pooling can increase both 2D and 3D U-Net performance by an average of 3.8\% in Dice scores. Moreover, we leverage the transfer entropy to select the optimal down-sampling paths and quantitatively show how the proposed Stair Pooling reduces the information loss.

Chinese Translation

U-Net架构在推动生物医学图像分割（BIS）方面发挥了重要作用，但在捕捉长距离信息时常常面临挑战。其原因之一是传统的下采样技术优先考虑计算效率，而牺牲了信息保留。本文提出了一种简单但有效的策略，我们称之为阶梯池化（Stair Pooling），该策略通过利用一系列以不同方向连接的小型窄池化操作来调节下采样的速度，从而减少信息损失。具体而言，我们的方法将每个二维池化步骤中的维度减少从$rac{1}{4}$修改为$rac{1}{2}$。该方法还可以适用于三维池化，以保留更多信息。这种信息保留有助于U-Net在上采样阶段更有效地重建空间细节，从而增强其捕捉长距离信息的能力，提高分割精度。在三个BIS基准上的大量实验表明，所提出的阶梯池化可以使二维和三维U-Net的性能平均提高3.8 ext{%}的Dice分数。此外，我们利用转移熵选择最佳下采样路径，并定量展示所提出的阶梯池化如何减少信息损失。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2602.19418

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

PA-攻击：通过原型和注意力引导灰盒攻击LVLM视觉编码器

Mei, Hefei, Wang, Zirui, Xu, Chang, Guo, Jianyuan, Dong, Minjing

Abstract

Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at https://github.com/hefeimei06/PA-Attack.

Chinese Translation

大型视觉语言模型（LVLMs）是现代多模态应用的基础，但它们对对抗攻击的脆弱性仍然是一个关键问题。先前的白盒攻击在任务间的泛化能力较差，而黑盒方法依赖于昂贵的迁移，这限制了效率。视觉编码器在LVLMs中标准化且通常是共享的，提供了一个稳定的灰盒支点，具有强大的跨模型迁移能力。在此基础上，我们提出了PA-攻击（原型锚定注意力攻击）。PA-攻击以原型锚定的指导开始，提供了一个稳定的攻击方向，指向一个通用且不同的原型，从而解决了属性受限问题和普通攻击的有限任务泛化能力。在此基础上，我们提出了一种两阶段的注意力增强机制：（i）利用令牌级注意力分数将扰动集中在关键视觉令牌上，以及（ii）自适应地重新校准注意力权重，以跟踪对抗过程中不断变化的注意力。针对多种下游任务和LVLM架构的广泛实验表明，PA-攻击实现了平均75.1%的得分降低率（SRR），展示了在LVLMs中强大的攻击有效性、效率和任务泛化能力。代码可在 https://github.com/hefeimei06/PA-Attack 获取。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2602.19423

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS：基于局部偏好和稀疏提示的电子显微镜领域自适应分割学习

Chen, Jiabao, Xiong, Shan, Peng, Jialin

Abstract

Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO) and sparse LPO (SLPO), plug-and-play solutions for alignment with spatially varying human feedback or sparse feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

Chinese Translation

领域自适应分割（DAS）是一种有前景的范式，用于从各种大规模电子显微镜（EM）图像中描绘细胞内结构，而无需在每个领域中获取大量标注数据。然而，现有的无监督领域适应（UDA）策略往往表现出有限且偏颇的性能，这阻碍了它们的实际应用。在本研究中，我们探索了稀疏点和局部人类偏好作为目标领域中的弱标签，从而提出了一种更为现实且高效的标注设置。具体而言，我们开发了Prefer-DAS，它开创了稀疏可提示学习和局部偏好对齐。Prefer-DAS是一个可提示的多任务模型，集成了自我训练和提示引导的对比学习。与SAM类方法不同，Prefer-DAS允许在训练和推理阶段使用完整、部分甚至没有点的提示，从而实现交互式分割。我们引入了局部直接偏好优化（Local direct Preference Optimization，LPO）和稀疏LPO（Sparse LPO，SLPO），作为与空间变化的人类反馈或稀疏反馈对齐的即插即用解决方案，而不是使用图像级人类偏好对齐进行分割。为了解决潜在的缺失反馈问题，我们还引入了无监督偏好优化（Unsupervised Preference Optimization，UPO），利用自学习的偏好。因此，Prefer-DAS模型能够根据点和人类偏好的可用性，有效地执行弱监督和无监督的DAS。在四个具有挑战性的DAS任务上的综合实验表明，我们的模型在自动和交互式分割模式下均优于SAM类方法以及无监督和弱监督DAS方法，突显了其强大的泛化能力和灵活性。此外，我们模型的性能非常接近甚至超过了监督模型的表现。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2602.19424

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Hepato-LLaVA：一种具有稀疏拓扑包注意力机制的专家级多模态大语言模型，用于全切片图像的肝细胞病理分析

Yang, Yuxuan, Yan, Zhonghao, Zhang, Yi, Yun, Bo, Diao, Muxi, Zhao, Guowei, Liang, Kongming, Li, Wenbin, Ma, Zhanyu

Abstract

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.

Chinese Translation

肝细胞癌的诊断在很大程度上依赖于对千兆像素全切片图像的解读。然而，当前的计算方法受到固定分辨率处理机制和低效特征聚合的限制，这不可避免地导致严重的信息损失或高特征冗余。为了解决这些挑战，我们提出了Hepato-LLaVA，一种专门为细粒度肝细胞病理分析设计的多模态大语言模型。我们引入了一种新颖的稀疏拓扑包注意力机制，该机制明确建模二维组织拓扑。该机制有效地将局部诊断证据聚合为语义摘要标记，同时保留全局上下文。此外，为了克服缺乏多尺度数据的问题，我们提出了HepatoPathoVQA，这是一个临床基础的数据集，包含33K个经过专家病理学家验证的分层结构问答对。我们的实验表明，Hepato-LLaVA在肝细胞癌诊断和描述任务上达到了最先进的性能，显著优于现有方法。我们的代码和实现细节可在 https://pris-cv.github.io/Hepto-LLaVA/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2602.19430

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

TherA：热感知视觉语言提示的可控RGB到热红外翻译

Lee, Dong-Guw, Rhee, Tai Hyoung, Jang, Hyunsoo, Shin, Young-Sik, Shin, Ukcheol, Kim, Ayoung

Abstract

Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

Chinese Translation

尽管热红外（TIR）成像具有固有优势，但大规模数据收集和标注仍然是基于TIR感知的主要瓶颈。一种实用的替代方案是通过图像翻译合成伪TIR数据；然而，大多数RGB到TIR的方法严重依赖于RGB中心的先验，忽视了热物理学，导致不合理的热分布。在本文中，我们介绍了TherA，一个可控的RGB到TIR翻译框架，能够在场景和物体层面生成多样化且热学合理的图像。TherA将TherA-VLM与基于潜在扩散的翻译器相结合。给定一张单一的RGB图像和用户提示的条件对，TherA-VLM生成一个热感知嵌入，编码反映输入场景条件对的场景、物体、材料和热辐射上下文。基于该嵌入对扩散模型进行条件化，使得现实的TIR合成成为可能，并在一天中的时间、天气和物体状态上实现细粒度控制。与其他基线相比，TherA实现了最先进的翻译性能，展示了在所有指标上平均提高33%的零-shot翻译性能。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2602.19432

CountEx: Fine-Grained Counting via Exemplars and Exclusion

CountEx：通过示例和排除实现细粒度计数

Huang, Yifeng, Nguyen, Gia Khanh, Hoai, Minh

Abstract

This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at https://github.com/bbvisual/CountEx.

Chinese Translation

本文提出了CountEx，一个旨在解决现有基于提示的方法的关键局限性的区分视觉计数框架：无法明确排除视觉上相似的干扰物。虽然当前的方法允许用户通过包含提示指定要计数的对象，但在物体类别混淆的杂乱场景中，它们往往面临模糊和过度计数的问题。CountEx使用户能够通过多模态提示（包括自然语言描述和可选的视觉示例）表达包含和排除的意图，明确指定要计数的对象和要忽略的对象。CountEx的核心是一个新颖的区分查询细化模块，该模块通过首先识别共享的视觉特征，然后隔离特定于排除的模式，最后应用选择性抑制来细化计数查询，从而联合推理包含和排除线索。为了支持细粒度计数方法的系统评估，我们引入了CoCount，一个基准数据集，包含1,780个视频和97对类别中的10,086个标注帧。实验表明，CountEx在已知和新类别的物体计数方面，相较于最先进的方法取得了显著的改进。数据和代码可在 https://github.com/bbvisual/CountEx 获取。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2602.19437

FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture

FinSight-Net：一种具有频域补偿的物理感知解耦网络，用于智能水产养殖中的水下鱼类检测

Yang, Jinsong, Hu, Zeyuan, Li, Yichen, Yu, Hong

Abstract

Underwater fish detection (UFD) is a core capability for smart aquaculture and marine ecological monitoring. While recent detectors improve accuracy by stacking feature extractors or introducing heavy attention modules, they often incur substantial computational overhead and, more importantly, neglect the physics that fundamentally limits UFD: wavelength-dependent absorption and turbidity-induced scattering significantly degrade contrast, blur fine structures, and introduce backscattering noise, leading to unreliable localization and recognition. To address these challenges, we propose FinSight-Net, an efficient and physics-aware detection framework tailored for complex aquaculture environments. FinSight-Net introduces a Multi-Scale Decoupled Dual-Stream Processing (MS-DDSP) bottleneck that explicitly targets frequency-specific information loss via heterogeneous convolutional branches, suppressing backscattering artifacts while compensating distorted biological cues through scale-aware and channel-weighted pathways. We further design an Efficient Path Aggregation FPN (EPA-FPN) as a detail-filling mechanism: it restores high-frequency spatial information typically attenuated in deep layers by establishing long-range skip connections and pruning redundant fusion routes, enabling robust detection of non-rigid fish targets under severe blur and turbidity. Extensive experiments on DeepFish, AquaFishSet, and our challenging UW-BlurredFish benchmark demonstrate that FinSight-Net achieves state-of-the-art performance. In particular, on UW-BlurredFish, FinSight-Net reaches 92.8% mAP, outperforming YOLOv11s by 4.8% while reducing parameters by 29.0%, providing a strong and lightweight solution for real-time automated monitoring in smart aquaculture.

Chinese Translation

水下鱼类检测（UFD）是智能水产养殖和海洋生态监测的核心能力。尽管近期的检测器通过堆叠特征提取器或引入重型注意力模块来提高准确性，但它们通常会带来显著的计算开销，更重要的是，忽视了根本限制UFD的物理因素：波长依赖的吸收和浑浊引起的散射显著降低了对比度，模糊了细微结构，并引入了后向散射噪声，导致定位和识别不可靠。为了解决这些挑战，我们提出了FinSight-Net，这是一种高效且具物理感知的检测框架，专为复杂的水产养殖环境量身定制。FinSight-Net引入了多尺度解耦双流处理（MS-DDSP）瓶颈，明确针对频率特定的信息损失，通过异构卷积分支抑制后向散射伪影，同时通过尺度感知和通道加权路径补偿扭曲的生物线索。我们进一步设计了一种高效路径聚合特征金字塔网络（EPA-FPN）作为细节填充机制：它通过建立长距离跳跃连接和修剪冗余融合路径，恢复通常在深层中衰减的高频空间信息，从而在严重模糊和浑浊的情况下实现对非刚性鱼类目标的稳健检测。在DeepFish、AquaFishSet和我们具有挑战性的UW-BlurredFish基准上的大量实验表明，FinSight-Net达到了最先进的性能。特别是在UW-BlurredFish上，FinSight-Net达到了92.8%的平均精度（mAP），比YOLOv11s高出4.8%，同时减少了29.0%的参数，为智能水产养殖中的实时自动监测提供了强大且轻量的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2602.19442

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign：后处理语义校准用于视觉语言模型与人类偏好的对齐

Zhang, Yecheng, Zhao, Rong, Sha, Zhizhou, Li, Yong, Wang, Lei, Hou, Ce, Ji, Wen, Huang, Hao, Wan, Yunshan, Yu, Jian, Xia, Junhao, Zhang, Yuru, Shi, Chunlei

Abstract

Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($\kappa=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.

Chinese Translation

在特定领域任务中，将视觉语言模型（VLM）的输出与人类偏好对齐通常需要微调或强化学习，而这两者都需要标注数据和GPU计算资源。我们展示了在主观感知任务中，这种对齐可以在不进行任何模型训练的情况下实现：VLM已经是强大的概念提取器，但在决策校准方面表现较差，而这一差距可以通过外部方法来弥补。我们提出了一种无训练的后处理概念瓶颈管道，该管道由三个紧密耦合的阶段组成：概念挖掘、多代理结构化评分和几何校准，通过端到端的维度优化循环统一。可解释的评估维度从少量人类注释中挖掘而来；观察者-辩论者-评判者链从冻结的VLM中提取出稳健的连续概念评分；而在混合视觉-语义流形上进行局部加权岭回归则将这些评分与人类评分进行校准。将其应用于城市感知的UrbanAlign框架，在Place Pulse 2.0的六个类别中实现了72.2%的准确率（$ ext{kappa}=0.45$），超越了最佳监督基线15.1个百分点和未校准VLM评分16.3个百分点，且具备完整的维度级可解释性和零模型权重修改。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2602.19449

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

解耦视觉与语言：基于代码本的视觉适应

Wu, Jason, Zhao, Tianchen, Liu, Chang, Cai, Jiarui, Zhang, Zheng, Li, Zhuowei, Singh, Aaditya, Xu, Xiang, Srivastava, Mani, Wu, Jonathan

Abstract

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

Chinese Translation

大型视觉-语言模型（LVLMs）利用其视觉编码器将图像转换为下游推理的表示，但在医学图像诊断或细粒度分类等特定领域的视觉任务中，编码器的表现往往不尽如人意，此时表示错误可能会在语言模型中级联，导致不正确的响应。现有的适应方法通过投影器调优或其他参数高效更新来修改编码器与语言模型之间的连续特征接口，这仍然使得两个组件耦合，并且每当编码器发生变化时都需要重新对齐。我们提出了CRAFT（Codebook RegulAted Fine-Tuning），这是一种轻量级的方法，通过使用离散代码本来微调编码器，将视觉表示锚定到一个稳定的标记空间，实现领域适应而不修改模型的其他部分。这种解耦设计使得适应后的编码器能够无缝提升具有不同语言架构的LVLMs的性能，只要它们共享相同的代码本。实证结果表明，CRAFT在10个特定领域基准测试（如VQARAD和PlantVillage）中平均获得了13.51%的提升，同时保持了LLM的语言能力，并且优于在连续标记上操作的同类方法。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2602.19454

HD-TTA: Hypothesis-Driven Test-Time Adaptation for Safer Brain Tumor Segmentation

HD-TTA：基于假设驱动的测试时间适应用于更安全的脑肿瘤分割

Jhawar, Kartik, Wang, Lipo

Abstract

Standard Test-Time Adaptation (TTA) methods typically treat inference as a blind optimization task, applying generic objectives to all or filtered test samples. In safety-critical medical segmentation, this lack of selectivity often causes the tumor mask to spill into healthy brain tissue or degrades predictions that were already correct. We propose Hypothesis-Driven TTA, a novel framework that reformulates adaptation as a dynamic decision process. Rather than forcing a single optimization trajectory, our method generates intuitive competing geometric hypotheses: compaction (is the prediction noisy? trim artifacts) versus inflation (is the valid tumor under-segmented? safely inflate to recover). It then employs a representation-guided selector to autonomously identify the safest outcome based on intrinsic texture consistency. Additionally, a pre-screening Gatekeeper prevents negative transfer by skipping adaptation on confident cases. We validate this proof-of-concept on a cross-domain binary brain tumor segmentation task, applying a source model trained on adult BraTS gliomas to unseen pediatric and more challenging meningioma target domains. HD-TTA improves safety-oriented outcomes (Hausdorff Distance (HD95) and Precision) over several state-of-the-art representative baselines in the challenging safety regime, reducing the HD95 by approximately 6.4 mm and improving Precision by over 4%, while maintaining comparable Dice scores. These results demonstrate that resolving the safety-adaptation trade-off via explicit hypothesis selection is a viable, robust path for safe clinical model deployment. Code will be made publicly available upon acceptance.

Chinese Translation

标准的测试时间适应（TTA）方法通常将推断视为盲目的优化任务，对所有或筛选后的测试样本应用通用目标。在安全关键的医学分割中，这种缺乏选择性的做法往往导致肿瘤掩膜渗入健康脑组织，或降低已经正确的预测。我们提出了基于假设驱动的TTA，这是一种将适应重新构建为动态决策过程的新框架。我们的算法并不强迫单一的优化轨迹，而是生成直观的竞争几何假设：压缩（预测是否有噪声？去除伪影）与膨胀（有效肿瘤是否被低估分割？安全膨胀以恢复）。然后，它采用一种基于表示的选择器，自动识别基于内在纹理一致性的最安全结果。此外，预筛选的守门人通过跳过自信案例来防止负迁移。我们在一个跨领域的二元脑肿瘤分割任务上验证了这一概念验证，应用在成人BraTS胶质瘤上训练的源模型到未见过的儿童和更具挑战性的脑膜瘤目标领域。在具有挑战性的安全环境中，HD-TTA在安全导向的结果（Hausdorff距离（HD95）和精确度）上优于多个最先进的代表性基线，HD95减少约6.4毫米，精确度提高超过4%，同时保持可比的Dice分数。这些结果表明，通过明确的假设选择解决安全与适应之间的权衡是安全临床模型部署的可行且稳健的路径。代码将在接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2602.19461

Laplacian Multi-scale Flow Matching for Generative Modeling

拉普拉斯多尺度流匹配用于生成建模

Zhao, Zelin, Molodyk, Petr, Xue, Haotian, Chen, Yongxin

Abstract

In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024$\times$1024) while maintaining lower computational overhead.

Chinese Translation

在本文中，我们提出了一种新的框架——拉普拉斯多尺度流匹配（LapFlow），通过利用多尺度表示来增强流匹配，以实现图像生成建模。我们的方法将图像分解为拉普拉斯金字塔残差，并通过具有因果注意机制的混合变换器（Mixture-of-Transformers, MoT）架构并行处理不同尺度。与之前需要在尺度之间进行显式去噪的级联方法不同，我们的模型并行生成多尺度表示，消除了桥接过程的需求。所提出的多尺度架构不仅提高了生成质量，还加速了采样过程，并促进了流匹配方法的扩展。通过在CelebA-HQ和ImageNet上的大量实验，我们证明了我们的方法在样本质量上优于单尺度和多尺度流匹配基线，且GFLOPs更少，推理速度更快。所提出的模型能够有效扩展到高分辨率生成（最高可达1024×1024），同时保持较低的计算开销。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2602.19470

Physics-informed Active Polarimetric 3D Imaging for Specular Surfaces

物理信息驱动的主动极化三维成像用于镜面表面

Wang, Jiazhang, Yang, Hyelim, Wang, Tianyi, Willomitzer, Florian

Abstract

3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial frequency structure or large curvature. Alternatively, polarimetric 3D imaging in computer vision operates in a single-shot fashion and exhibits robustness to geometric complexity. However, its accuracy is fundamentally limited by the orthographic imaging assumption. In this paper, we propose a physics-informed deep learning framework for single-shot 3D imaging of complex specular surfaces. Polarization cues provide orientation priors that assist in interpreting geometric information encoded by structured illumination. These complementary cues are processed through a dual-encoder architecture with mutual feature modulation, allowing the network to resolve their nonlinear coupling and directly infer surface normals. The proposed method achieves accurate and robust normal estimation in single-shot with fast inference, enabling practical 3D imaging of complex specular surfaces.

Chinese Translation

在实际场景中，镜面表面的三维成像仍然具有挑战性，例如在线检测或手持扫描，这需要对复杂几何形状进行快速而准确的测量。光学计量技术如偏折测量法（deflectometry）能够实现高精度，但通常依赖于多次拍摄，使其不适合动态环境。基于傅里叶的单次拍摄方法缓解了这一限制，但在测量具有高空间频率结构或大曲率的表面时，其性能会下降。另一方面，计算机视觉中的极化三维成像以单次拍摄的方式进行，并对几何复杂性表现出鲁棒性。然而，其准确性在根本上受到正交成像假设的限制。本文提出了一种物理信息驱动的深度学习框架，用于复杂镜面表面的单次拍摄三维成像。极化线索提供了方向先验，有助于解释由结构化照明编码的几何信息。这些互补线索通过具有互相特征调制的双编码器架构进行处理，使网络能够解析其非线性耦合并直接推断表面法线。所提方法在单次拍摄中实现了准确且鲁棒的法线估计，并具备快速推理能力，从而实现了对复杂镜面表面的实用三维成像。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2602.19471

Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model

抗遗忘且关注病灶的无源领域自适应视网膜图像分析与视觉语言模型

Huai, Zheang, Tang, Hui, Wang, Hualiang, Li, Xiaomeng

Abstract

Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: (i) Although mutual information is exploited to consider the joint distribution between the predictions of ViL model and the target model, we argue that the forgetting of some superior predictions of the target model still occurs, as indicated by the decline of the accuracies of certain classes during adaptation; (ii) Prior research disregards the rich, fine-grained knowledge embedded in the ViL model, which offers detailed grounding for fundus image diagnosis. In this paper, we introduce a novel forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model. Specifically, a forgetting-resistant adaptation module explicitly preserves the confident predictions of the target model, and a lesion-aware adaptation module yields patch-wise predictions from ViL model and employs them to help the target model be aware of the lesion areas and leverage the ViL model's fine-grained knowledge. Extensive experiments show that our method not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods. Our code will be released.

Chinese Translation

无源领域适应（SFDA）旨在将训练于源领域的模型适应到目标领域，使其在仅有未标记的目标领域数据和源模型的情况下表现良好。考虑到传统的SFDA方法在领域转移下不可避免地容易出错，最近越来越多的关注转向了利用现成基础模型（如视觉语言（ViL）模型）辅助的SFDA。然而，现有利用ViL模型进行SFDA的研究面临两个问题：（i）尽管利用互信息考虑了ViL模型与目标模型预测之间的联合分布，但我们认为目标模型某些优越预测的遗忘仍然发生，这在适应过程中某些类别的准确率下降中得到了体现；（ii）先前的研究忽视了ViL模型中蕴含的丰富、细粒度知识，这为视网膜图像诊断提供了详细的基础。在本文中，我们提出了一种新颖的抗遗忘且关注病灶（FRLA）的方法，用于基于ViL模型的视网膜图像诊断的SFDA。具体而言，抗遗忘适应模块显式保留目标模型的自信预测，而关注病灶适应模块则从ViL模型生成基于补丁的预测，并利用这些预测帮助目标模型识别病灶区域，并利用ViL模型的细粒度知识。大量实验表明，我们的方法不仅显著优于视觉语言模型，而且在现有最先进的方法上也实现了一致的改进。我们的代码将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2602.19487

Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

利用空间依赖性进行无标签正则化以分析全幻灯片图像

Wu, Weiyi, Xu, Xinwen, Gao, Chongyang, Diao, Xingjian, Li, Siting, Gui, Jiang

Abstract

Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.

Chinese Translation

全幻灯片图像以其千兆像素级的组织样本全景图在精确疾病诊断中至关重要。然而，其分析受到巨大的数据规模和稀缺标注的制约。现有的多实例学习（MIL）方法面临挑战，因为单一的包级标签必须指导众多补丁级特征的学习。这种稀疏监督使得在训练过程中可靠地识别具有区分性的补丁变得困难，导致优化不稳定和次优解。我们提出了一种空间正则化的多实例学习框架，利用补丁特征之间固有的空间关系作为无标签的正则化信号。我们的方法通过联合优化特征诱导的空间重构和标签引导的分类目标，学习共享的表示空间，从而强制内在结构模式与监督信号之间的一致性。在多个公共数据集上的实验结果表明，与最先进的方法相比，显著提高了性能，提供了一个有前景的研究方向。

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2602.19497

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

MICON-Bench：统一多模态模型中多图像上下文图像生成的基准测试与增强

Wu, Mingrui, Liu, Hang, Ji, Jiayi, Sun, Xiaoshuai, Ji, Rongrong

Abstract

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.

Chinese Translation

近期统一多模态模型（UMMs）的进展使得图像理解和生成能力显著提升。然而，尽管像Gemini-2.5-Flash-Image这样的模型展现出对多个相关图像进行推理的能力，现有的基准测试很少涉及多图像上下文生成的挑战，主要集中在文本到图像或单图像编辑任务上。在本研究中，我们引入了 extbf{MICON-Bench}，这是一个涵盖六个任务的综合性基准，评估跨图像组合、上下文推理和身份保留。我们进一步提出了一种基于MLLM（多模态大语言模型）的检查点评估框架，用于自动验证语义和视觉一致性，其中多模态大语言模型（MLLM）充当验证者。此外，我们还提出了 extbf{动态注意力重平衡（DAR）}，这是一种无需训练的即插即用机制，在推理过程中动态调整注意力，以增强一致性并减少幻觉。对多种最先进的开源模型进行的广泛实验表明，MICON-Bench在揭示多图像推理挑战方面的严谨性，以及DAR在提高生成质量和跨图像一致性方面的有效性。Github: https://github.com/Angusliuuu/MICON-Bench.

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2602.19503

A Text-Guided Vision Model for Enhanced Recognition of Small Instances

一种文本引导的视觉模型用于增强小目标的识别

Jung, Hyun-Ki

Abstract

As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and [email protected] from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone-based applications.

Chinese Translation

随着基于无人机的物体检测技术不断发展，需求正从单纯的物体检测转向使用户能够准确识别特定目标。例如，用户可以输入特定目标作为提示，以精确检测所需物体。为满足这一需求，开发了一种高效的文本引导物体检测模型，以增强小物体的检测。具体而言，提出了一种改进版的现有YOLO-World模型。该方法用C3k2层替换了YOLOv8主干中的C2f层，从而能够更精确地表示局部特征，特别是对于小物体或具有明确边界的物体。此外，所提出的架构通过并行处理优化提高了处理速度和效率，同时也有助于更轻量化的模型设计。在VisDrone数据集上的对比实验表明，所提出的模型优于原始YOLO-World模型，精度从40.6%提高到41.6%，召回率从30.8%提高到31%，F1分数从35%提高到35.5%，[email protected]从30.4%提高到30.7%，确认了其增强的准确性。此外，该模型表现出优越的轻量化性能，参数数量从400万减少到380万，FLOPs从157亿减少到152亿。这些结果表明，所提出的方法为基于无人机的应用中的精确物体检测提供了一个实用有效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2602.19505

Test-Time Computing for Referring Multimodal Large Language Models

测试时计算针对多模态大型语言模型的应用

Wu, Mingrui, Chen, Hao, Ji, Jiayi, Sun, Xiaoshuai, Liu, Zhiyuan, Cao, Liujuan, Cheng, Ming-Ming, Ji, Rongrong

Abstract

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.

Chinese Translation

我们提出了ControlMLLM++，一种新颖的测试时适应框架，它将可学习的视觉提示注入到冻结的多模态大型语言模型（MLLMs）中，以实现细粒度基于区域的视觉推理，而无需任何模型重训练或微调。ControlMLLM++利用跨模态注意力图本质上编码文本标记与视觉区域之间的语义对应关系的洞察，在推理过程中通过特定任务的能量函数优化潜在的视觉标记修改器，以引导模型注意力朝向用户指定的区域。为了增强优化的稳定性并减轻语言提示偏差，ControlMLLM++结合了一种改进的优化策略（Optim++）和一种提示去偏机制（PromptDebias）。我们的算法支持多种视觉提示类型，包括边界框、掩膜、涂鸦和点，展现出强大的领域外泛化能力和可解释性。代码可在 https://github.com/mrwu-mac/ControlMLLM 获取。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2602.19506

Relational Feature Caching for Accelerating Diffusion Transformers

加速扩散变换器的关系特征缓存

Son, Byunggwan, Jeon, Jeimin, Choi, Jeongwoo, Ham, Bumsub

Abstract

Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at https://cvlab.yonsei.ac.kr/projects/RFC

Chinese Translation

特征缓存方法通过在某些时间步存储计算开销大的模块的输出特征，并在后续步骤中利用这些特征以减少冗余计算，从而加速扩散变换器（DiTs）。最近基于预测的缓存方法采用时间外推技术来用缓存的特征近似输出特征。尽管有效，但完全依赖时间外推仍然面临显著的预测误差，导致性能下降。通过详细分析，我们发现1）这些误差源于输出特征变化幅度的不规则性，2）模块的输入特征与相应的输出特征之间存在强相关性。基于此，我们提出了关系特征缓存（RFC），这是一个新颖的框架，利用输入输出关系来提高特征预测的准确性。具体而言，我们引入关系特征估计（RFE），从输入中估计输出特征变化的幅度，从而实现更准确的特征预测。我们还提出了关系缓存调度（RCS），它利用输入特征估计预测误差，并仅在预期误差较大时进行全面计算。针对各种DiT模型的广泛实验表明，RFC在性能上显著优于之前的方法。项目页面可访问 https://cvlab.yonsei.ac.kr/projects/RFC

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2602.19523

OSInsert: Towards High-authenticity and High-fidelity Image Composition

OSInsert：朝着高真实性和高保真度的图像合成

Wang, Jingyuan, Niu, Li

Abstract

Generative image composition aims to regenerate the given foreground object in the background image to produce a realistic composite image. Some high-authenticity methods can adjust foreground pose/view to be compatible with background, while some high-fidelity methods can preserve the foreground details accurately. However, existing methods can hardly achieve both goals at the same time. In this work, we propose a two-stage strategy to achieve both goals. In the first stage, we use high-authenticity method to generate reasonable foreground shape, serving as the condition of high-fidelity method in the second stage. The experiments on MureCOM dataset verify the effectiveness of our two-stage strategy. The code and model have been released at https://github.com/bcmi/OSInsert-Image-Composition.

Chinese Translation

生成图像合成旨在重新生成给定前景物体在背景图像中的表现，以产生逼真的合成图像。一些高真实性的方法可以调整前景的姿态/视角以与背景兼容，而一些高保真度的方法则能够准确保留前景细节。然而，现有方法很难同时实现这两个目标。在本研究中，我们提出了一种两阶段策略以实现这两个目标。在第一阶段，我们使用高真实性的方法生成合理的前景形状，作为第二阶段高保真度方法的条件。对 MureCOM 数据集的实验验证了我们两阶段策略的有效性。代码和模型已在 https://github.com/bcmi/OSInsert-Image-Composition 发布。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2602.19530

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

ORION：用于通用视觉语言模型适应的正交文本编码

Chakraborty, Omprakash, Dolz, Jose, Ayed, Ismail Ben

Abstract

Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.

Chinese Translation

视觉语言模型（VLMs）在多样化任务中展现了显著的泛化能力，但其性能仍受到用于表示类别的文本原型的质量和几何形状的限制。基于冻结文本编码器和手工设计提示的标准零样本分类器，可能会产生相关或弱分离的嵌入，限制了任务特定的可区分性。我们提出了ORION，一种文本编码器微调框架，仅使用类别名称来改进预训练的VLM。我们的方法通过低秩适应优化了一种新颖的损失函数，该函数整合了两个项：一个促进给定任务类别的文本表示之间的成对正交性，另一个则惩罚偏离初始类别原型的情况。此外，我们提供了正交性惩罚的概率解释，通过惠根斯定理将其与一般最大似然估计（MLE）原则联系起来。我们在11个基准和三个大型VLM骨干网络上进行了广泛实验，结果表明，经过精炼的文本嵌入为标准CLIP原型提供了强有力的替代方案。作为即插即用模块添加到各种最先进的方法上，并在不同的预测设置（零样本、少样本和测试时适应）中，ORION始终显著提高了性能。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2602.19536

Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

Fore-Mamba3D：基于Mamba的前景增强编码用于3D物体检测

Ning, Zhiwei, Gao, Xuanang, Cao, Jiaxi, Yang, Runze, Xu, Huiying, Zhu, Xinzhong, Yang, Jie, Liu, Wei

Abstract

Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

Chinese Translation

线性建模方法如Mamba已被整合为3D物体检测任务的有效骨干。然而，之前的基于Mamba的方法利用双向编码对整个非空体素序列进行处理，这其中包含了场景中大量无用的背景信息。尽管直接编码前景体素似乎是一个可行的解决方案，但这往往会降低检测性能。我们将此归因于线性建模在仅前景序列中的响应衰减和受限上下文表示。为了解决这个问题，我们提出了一种新型骨干网络，称为Fore-Mamba3D，通过修改基于Mamba的编码器来专注于前景增强。前景体素首先根据预测得分进行采样。考虑到不同实例间前景体素交互中存在的响应衰减，我们设计了一个区域到全局滑动窗口（RGSW），以将信息从区域划分传播到整个序列。此外，我们提出了一种语义辅助和状态空间融合模块（SASFMamba），通过增强Mamba模型中的语义和几何感知来丰富上下文表示。我们的方法强调仅前景编码，并缓解了线性自回归模型中的基于距离和因果依赖。各种基准测试中的卓越表现证明了Fore-Mamba3D在3D物体检测任务中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2602.19539

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

青少年能欺骗人工智能吗？评估低成本的外观攻击对年龄估计系统的影响

Shen, Xingyu, Duong, Tommy, An, Xiaodong, Zhao, Zengqi, Hu, Zebang, Hu, Haoyu, Wang, Ziyou, Guo, Finn, Ren, Simiao

Abstract

Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.

Chinese Translation

年龄估计系统越来越多地作为限制年龄的在线内容的守门人，但其对外观修改的鲁棒性尚未得到系统评估。我们研究了简单的、家庭可获取的外观变化，包括胡须、灰发、化妆和模拟皱纹，是否能够导致人工智能年龄估计器将未成年人分类为成年人。为了在没有伦理顾虑的情况下大规模研究这一威胁，我们使用VLM图像编辑器（Gemini 2.5 Flash Image）对329张年龄在10至21岁之间的个体面部图像进行了这些物理攻击的模拟。然后，我们评估了我们之前基准测试中的八个模型：五个专业架构（MiVOLO、Custom-Best、Herosan、MiViaLab、DEX）和三个视觉-语言模型（Gemini 3 Flash、Gemini 2.5 Flash、GPT-5-Nano）。我们引入了攻击转化率（Attack Conversion Rate, ACR），定义为在基线预测为未成年人的图像中，经过攻击后转变为成年人的比例，这是一种与测试集中未成年人和成年人的比例无关的人群无关指标。我们的结果显示，仅合成胡须在所有八个模型中实现了28%到69%的ACR；结合所有四种攻击，329个受试者的平均预测年龄提高了7.7岁，ACR最高可达83%；而在全面攻击下，视觉-语言模型的ACR（59%到71%）低于专业模型（63%到83%），尽管ACR范围重叠且未进行统计检验。这些发现突显了已部署的年龄验证管道中的关键脆弱性，并呼吁将对抗鲁棒性评估作为模型选择的强制性标准。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2602.19540

A Green Learning Approach to LDCT Image Restoration

一种绿色学习方法用于低剂量计算机断层扫描图像恢复

Wang, Wei, Wu, Yixing, Kuo, C. -C. Jay

Abstract

This work proposes a green learning (GL) approach to restore medical images. Without loss of generality, we use low-dose computed tomography (LDCT) images as examples. LDCT images are susceptible to noise and artifacts, where the imaging process introduces distortion. LDCT image restoration is an important preprocessing step for further medical analysis. Deep learning (DL) methods have been developed to solve this problem. We examine an alternative solution using the Green Learning (GL) methodology. The new restoration method is characterized by mathematical transparency, computational and memory efficiency, and high performance. Experiments show that our GL method offers state-of-the-art restoration performance at a smaller model size and with lower inference complexity.

Chinese Translation

本研究提出了一种绿色学习（Green Learning, GL）方法用于恢复医学图像。为了不失一般性，我们以低剂量计算机断层扫描（Low-Dose Computed Tomography, LDCT）图像为例。LDCT图像易受噪声和伪影的影响，成像过程会引入失真。LDCT图像恢复是进一步医学分析的重要预处理步骤。深度学习（Deep Learning, DL）方法已被开发用于解决这一问题。我们考察了一种使用绿色学习（Green Learning, GL）方法的替代解决方案。该新恢复方法具有数学透明性、计算和内存效率高以及性能优越的特点。实验表明，我们的GL方法在较小的模型规模和较低的推理复杂度下，提供了最先进的恢复性能。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2602.19542

Vinedresser3D: Agentic Text-guided 3D Editing

Vinedresser3D：自主文本引导的3D编辑

Chi, Yankuan, Li, Xiang, Huang, Zixuan, Rehg, James M.

Abstract

Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the edit region and edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.

Chinese Translation

文本引导的3D编辑旨在使用自然语言指令修改现有的3D资产。目前的方法在共同理解复杂提示、自动定位3D中的编辑以及保留未编辑内容方面存在困难。我们提出了Vinedresser3D，这是一个用于高质量文本引导3D编辑的自主框架，直接在本地3D生成模型的潜在空间中操作。给定一个3D资产和一个编辑提示，Vinedresser3D使用多模态大型语言模型推断原始资产的丰富描述，识别编辑区域和编辑类型（添加、修改、删除），并生成分解的结构和外观级文本指导。然后，代理选择一个信息丰富的视角，并应用图像编辑模型以获得视觉指导。最后，基于反演的修正流填充管道与交错采样模块在3D潜在空间中执行编辑，确保提示对齐，同时保持3D一致性和未编辑区域。对多样化3D编辑的实验表明，Vinedresser3D在自动指标和人类偏好研究中均优于先前的基线，同时实现了精确、一致且无遮罩的3D编辑。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2602.19565

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

DICArt：在离散状态空间中推进类别级关节物体姿态估计

Zhang, Li, Mei, Mingyu, Wang, Ailing, Meng, Xianhui, Zhong, Yan, Song, Xinyuan, Liu, Liu, Wang, Rujing, He, Zaixing, Lu, Cewu

Abstract

Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.

Chinese Translation

关节物体姿态估计是具身人工智能中的核心任务。现有方法通常在连续空间中回归姿态，但往往面临以下挑战：1）在大型复杂搜索空间中导航困难；2）未能有效结合内在运动学约束。在本研究中，我们提出了DICArt（DIsCrete Diffusion for Articulation Pose Estimation），这是一个将姿态估计形式化为条件离散扩散过程的新框架。DICArt不是在连续域中操作，而是通过学习的反向扩散过程逐步去噪一个嘈杂的姿态表示，以恢复真实姿态（GT pose）。为了提高建模的真实性，我们提出了一种灵活的流决策器，动态决定每个标记是去噪还是重置，有效平衡扩散过程中的真实分布和噪声分布。此外，我们还结合了一种分层运动学耦合策略，分层估计每个刚性部分的姿态，以尊重物体的运动学结构。我们在合成数据集和真实世界数据集上验证了DICArt。实验结果表明其优越的性能和鲁棒性。通过将离散生成建模与结构先验相结合，DICArt为在复杂环境中可靠的类别级6D姿态估计提供了一种新范式。

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2602.19570

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

VALD：多阶段视觉攻击检测以实现高效的LVLM防御

Kadvil, Nadav, Tal, Ayellet

Abstract

Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.

Chinese Translation

大型视觉语言模型（LVLMs）可能会受到对抗性图像的影响，这些图像微妙地偏向于产生看似合理但实际上错误的响应。我们提出了一种通用、高效且无需训练的防御方法，该方法将图像变换与主动数据整合相结合，以恢复模型的正确行为。我们方法的一个关键组成部分是一个两阶段检测机制，能够快速过滤掉大多数干净输入。我们首先在几乎没有计算成本的情况下评估图像在内容保持变换下的一致性。对于更具挑战性的情况，我们检查文本嵌入空间中的差异。只有在必要时，我们才会调用强大的大型语言模型（LLM）来解决攻击引起的偏差。一个关键思想是整合多个响应，利用它们的相似性和差异性。我们展示了我们的方法在保持显著效率的同时实现了最先进的准确性：大多数干净图像跳过了昂贵的处理，即使在存在大量对抗性样本的情况下，开销仍然保持在最低限度。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2602.19571

HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

HOCA-Bench：超越语义感知，通过黑格尔本体论-因果异常进行预测世界建模

Liu, Chang, Ye, Yunfan, Zhou, Qingyang, Tan, Xichen, Luo, Mengxuan, Qiu, Zhenyu, Peng, Wei, Cai, Zhiping

Abstract

Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.

Chinese Translation

视频大语言模型（Video-LLMs）在语义感知方面取得了稳步进展，但在预测世界建模方面仍显不足，而这对于物理基础的智能至关重要。我们提出了HOCA-Bench，这是一个通过黑格尔视角框架下的物理异常基准。HOCA-Bench将异常分为两类：本体论异常，即实体违反其自身定义或持续性；因果异常，即交互违反物理关系。利用最先进的生成视频模型作为对抗模拟器，我们构建了一个包含1,439个视频（3,470个问答对）的测试平台。对17个视频大语言模型的评估显示出明显的认知滞后：模型通常能够识别静态的本体论违反（例如，形状变异），但在因果机制（例如，重力或摩擦）方面却表现不佳，因果任务的表现下降超过20%。系统-2“思考”模式改善了推理能力，但并未缩小差距，这表明当前架构更容易识别视觉模式，而不是应用基本的物理法则。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2602.19575

ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization

ConceptPrism：通过残差标记优化实现个性化扩散模型中的概念解耦

Kim, Minseo, Kwon, Minchan, Lee, Dongyeun, Jeon, Yunho, Kim, Junmo

Abstract

Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.

Chinese Translation

个性化文本到图像生成面临概念纠缠的问题，其中参考图像中的无关残余信息被捕获，导致概念保真度与文本对齐之间的权衡。近期的解耦方法试图通过手动指导（如语言线索或分割掩码）来解决这一问题，但这限制了它们的适用性，并未能充分表达目标概念。本文提出了ConceptPrism，一种新颖的框架，通过比较一组图像自动解耦共享视觉概念与图像特定残余信息。我们的方法使用两个互补目标共同优化目标标记和图像级残余标记：重建损失以确保保真度，以及一种新颖的排除损失，迫使残余标记丢弃共享概念。这个过程使目标标记能够在没有直接监督的情况下捕获纯粹的概念。大量实验表明，ConceptPrism有效解决了概念纠缠问题，实现了保真度与对齐之间显著改善的权衡。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2602.19596

Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception

学习互视信息图以实现自适应对抗协同感知

Tao, Yihang, Hu, Senkang, An, Haonan, Fang, Zhengru, Cao, Hangcheng, Fang, Yuguang

Abstract

Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62\% against state-of-the-art defenses while achieving 47\% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems. Code will be released at https://github.com/yihangtao/MVIG.git

Chinese Translation

协同感知（CP）使得连接和自主车辆（CAVs）之间能够共享数据，从而增强驾驶安全性。然而，CP系统容易受到对抗性攻击，恶意代理通过特征级扰动伪造虚假物体。目前的防御系统通过比较协同检测和自我检测结果，采用基于阈值的共识验证。然而，这些防御仍然容易受到更复杂攻击策略的利用，这些策略可能利用两个关键弱点：（i）缺乏对系统性时机和目标区域优化攻击的鲁棒性，以及（ii）通过共享协作数据中的隐式置信信息无意间泄露脆弱性知识。在本文中，我们提出了MVIG攻击，这是一种新颖的自适应对抗CP框架，旨在学习从统一的互视信息图（MVIG）表示中捕捉不同防御CP系统所披露的脆弱性知识。我们的方法将MVIG表示与时间图学习相结合，以生成不断演变的伪造风险图，并采用基于熵的脆弱性搜索来优化攻击位置、时机和持续性，从而实现跨各种防御配置的自适应攻击。对OPV2V和Adv-OPV2V数据集的广泛评估表明，MVIG攻击使得最先进的防御成功率降低了多达62 ext{%}，同时在29.9 FPS下对持续攻击的检测率降低了47 ext{%}，暴露了CP系统中的关键安全漏洞。代码将发布在 https://github.com/yihangtao/MVIG.git

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2602.19605

CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning

CLCR：用于多模态学习的跨层语义协同表示

Meng, Chunlei, Huang, Guanhong, Fu, Rong, Jian, Runmin, Gan, Zhongxue, Ouyang, Chun

Abstract

Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.

Chinese Translation

多模态学习旨在从多种模态中捕获共享和私有信息。然而，现有方法将所有模态投影到单一潜在空间进行融合时，往往忽视了多模态数据的异步多层语义结构。这一忽视导致了语义不对齐和错误传播，从而降低了表示质量。为了解决这一问题，我们提出了跨层协同表示（Cross-Level Co-Representation，CLCR），该方法明确将每种模态的特征组织成三层语义层次，并为跨模态交互指定层级约束。首先，语义层次编码器对齐了模态间的浅层、中层和深层特征，建立了交互的共同基础。然后，在每一层，内部层级协同交换域（Intra-Level Co-Exchange Domain，IntraCED）将特征分解为共享和私有子空间，并通过可学习的令牌预算限制跨模态注意力仅在共享子空间内进行。这一设计确保只有共享语义被交换，并防止私有通道的信息泄漏。为了跨层整合信息，跨层协同聚合域（Inter-Level Co-Aggregation Domain，InterCAD）使用学习的锚点同步语义尺度，有选择地融合共享表示，并对私有线索进行门控，以形成紧凑的任务表示。我们进一步引入正则化项以强制分离共享和私有特征，并最小化跨层干扰。在涵盖情感识别、事件定位、情感分析和动作识别的六个基准测试中的实验表明，CLCR在性能上表现出色，并在任务间具有良好的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2602.19608

Satellite-Based Detection of Looted Archaeological Sites Using Machine Learning

基于卫星的掠夺考古遗址检测方法研究：机器学习的应用

Tadesse, Girmaw Abebe, Bartette, Titien, Hassanali, Andrew, Kim, Allen, Chemla, Jonathan, Zolli, Andrew, Ubelmann, Yves, Robinson, Caleb, Becker-Reshef, Inbal, Ferres, Juan Lavista

Abstract

Looting at archaeological sites poses a severe risk to cultural heritage, yet monitoring thousands of remote locations remains operationally difficult. We present a scalable and satellite-based pipeline to detect looted archaeological sites, using PlanetScope monthly mosaics (4.7m/pixel) and a curated dataset of 1,943 archaeological sites in Afghanistan (898 looted, 1,045 preserved) with multi-year imagery (2016--2023) and site-footprint masks. We compare (i) end-to-end CNN classifiers trained on raw RGB patches and (ii) traditional machine learning (ML) trained on handcrafted spectral/texture features and embeddings from recent remote-sensing foundation models. Results indicate that ImageNet-pretrained CNNs combined with spatial masking reach an F1 score of 0.926, clearly surpassing the strongest traditional ML setup, which attains an F1 score of 0.710 using SatCLIP-V+RF+Mean, i.e., location and vision embeddings fed into a Random Forest with mean-based temporal aggregation. Ablation studies demonstrate that ImageNet pretraining (even in the presence of domain shift) and spatial masking enhance performance. In contrast, geospatial foundation model embeddings perform competitively with handcrafted features, suggesting that looting signatures are extremely localized. The repository is available at https://github.com/microsoft/looted_site_detection.

Chinese Translation

考古遗址的掠夺对文化遗产构成了严重威胁，但监测成千上万的偏远地点在操作上仍然困难。我们提出了一种可扩展的基于卫星的管道，用于检测被掠夺的考古遗址，利用PlanetScope每月的马赛克影像（4.7米/像素）和一个经过整理的阿富汗考古遗址数据集（1,943个遗址，其中898个被掠夺，1,045个被保存），结合多年的影像（2016-2023）和遗址轮廓掩膜。我们比较了（i）在原始RGB图块上训练的端到端卷积神经网络（CNN）分类器和（ii）基于手工制作的光谱/纹理特征及来自近期遥感基础模型的嵌入进行训练的传统机器学习（ML）。结果表明，结合空间掩膜的ImageNet预训练CNN达到了0.926的F1分数，明显超过了最强的传统ML设置，该设置使用SatCLIP-V+RF+Mean获得了0.710的F1分数，即将位置和视觉嵌入输入到随机森林中并进行基于均值的时间聚合。消融研究表明，ImageNet预训练（即使在存在领域转移的情况下）和空间掩膜能够提升性能。相比之下，地理空间基础模型嵌入与手工特征的表现相当，表明掠夺特征极为局部化。相关资料可在https://github.com/microsoft/looted_site_detection获取。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2602.19611

RAID: Retrieval-Augmented Anomaly Detection

RAID：检索增强异常检测

Cai, Mingxiu, Zhang, Zhe, Wu, Gaochang, Chai, Tianyou, Zhu, Xiatian

Abstract

Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce \textbf{RAID}, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. \href{https://github.com/Mingxiu-Cai/RAID}{https://github.com/Mingxiu-Cai/RAID}.

Chinese Translation

无监督异常检测（UAD）旨在通过建立测试图像与正常模板之间的对应关系来识别异常区域。现有方法主要依赖于图像重建或模板检索，但面临一个根本性挑战：测试图像与正常模板之间的匹配不可避免地引入噪声，这源于类内变异、不完美的对应关系以及模板数量的限制。我们观察到检索增强生成（RAG）在生成过程中直接利用检索到的样本，因此我们通过这一视角重新解释UAD，并引入了 extbf{RAID}，一个旨在实现抗噪声异常检测和定位的检索增强UAD框架。与标准的RAG不同，后者旨在丰富上下文或知识，我们专注于利用检索到的正常样本来指导异常图生成中的噪声抑制。RAID从分层向量数据库中检索类别、语义和实例级别的表示，形成一个粗到细的处理流程。一个匹配成本体积将输入与检索到的示例相关联，随后是一个引导的专家混合（Mixture-of-Experts, MoE）网络，该网络利用检索到的样本自适应地抑制匹配噪声并生成细粒度的异常图。RAID在MVTec、VisA、MPDD和BTAD基准测试的全镜头、少镜头和多数据集设置中实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 152 / 2602.19615

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

清晰可见，自信推理：即插即用的视觉语言模型盲点补救措施

Hu, Xin, Ni, Haomiao, Zhang, Yunbei, Hamm, Jihun, Li, Zechen, Ding, Zhengming

Abstract

Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.

Chinese Translation

视觉语言模型（VLMs）在广泛的视觉理解方面取得了显著成功，但由于预训练数据中稀有对象实例的稀缺，它们在以对象为中心的推理方面仍面临挑战。虽然之前的努力通过检索额外数据或引入更强大的视觉编码器来缓解这一问题，但这些方法在微调VLMs时仍然计算密集，并未充分利用原始训练数据。在本文中，我们引入了一种高效的即插即用模块，通过精炼视觉标记和丰富输入文本提示，显著提高VLMs对稀有对象的推理能力，而无需对VLMs进行微调。具体而言，我们提出通过利用视觉基础模型的先验知识和同义词增强的文本描述，学习稀有对象的多模态类别嵌入，以弥补训练示例的不足。这些嵌入通过轻量级的基于注意力的增强模块来精炼VLMs中的视觉标记，从而改善细粒度的对象细节。此外，我们使用学习到的嵌入作为对象感知检测器生成信息提示，这些提示被注入到文本提示中，以帮助引导VLM的注意力关注相关的图像区域。在两个基准测试上的实验显示，预训练的VLMs在稀有对象识别和推理方面取得了一致且显著的提升。进一步的分析揭示了我们的方法如何增强VLM专注于稀有对象并进行推理的能力。

View on arXiv Download PDF AI Translation

cs.CV / 153 / 2602.19623

PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring

PedaCo-Gen：在人机协作视频创作中支撑教学代理

Baek, Injun, Kim, Yearim, Kwak, Nojun

Abstract

While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer's Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional "one-shot" generation, PedaCo-Gen introduces an Intermediate Representation (IR) phase, enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer. Our study with 23 education experts demonstrates that PedaCo-Gen significantly enhances video quality across various topics and CTML principles compared to baselines. Participants perceived the AI-driven guidance not merely as a set of instructions but as a metacognitive scaffold that augmented their instructional design expertise, reporting high production efficiency (M=4.26) and guide validity (M=4.04). These findings highlight the importance of reclaiming pedagogical agency through principled co-creation, providing a foundation for future AI authoring tools that harmonize generative power with human professional expertise.

Chinese Translation

尽管文本到视频（Text-to-Video, T2V）生成性人工智能的进展为内容创作的民主化提供了有希望的路径，但当前模型往往更注重视觉逼真度而非教学效果。本研究介绍了PedaCo-Gen，这是一种基于梅耶（Mayer）的多媒体学习认知理论（Cognitive Theory of Multimedia Learning, CTML）的教学导向人机协作视频生成系统，旨在创作教学视频。PedaCo-Gen摒弃了传统的“一次性”生成方式，引入了中间表示（Intermediate Representation, IR）阶段，使教育工作者能够与AI评审者互动，审查和完善包括脚本和视觉描述在内的视频蓝图。我们与23位教育专家的研究表明，与基线相比，PedaCo-Gen显著提升了各类主题和CTML原则下的视频质量。参与者认为AI驱动的指导不仅仅是一套指令，而是一个增强他们教学设计专业知识的元认知支架，报告了高生产效率（M=4.26）和指导有效性（M=4.04）。这些发现强调了通过原则性共同创作重新夺回教学代理的重要性，为未来能够将生成能力与人类专业知识相结合的AI创作工具奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 154 / 2602.19624

Accurate Planar Tracking With Robust Re-Detection

具有鲁棒重检测的精确平面跟踪

Serych, Jonas, Matas, Jiri

Abstract

We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack tracking benchmarks, setting the new state-of-the-art performance on both. On the latter, they outperform the second best by a large margin, +12.4 and +15.2pp on the p@15 metric. We also present improved ground-truth annotations of initial PlanarTrack poses, enabling more accurate benchmarking in the high-precision p@5 metric. The code and the re-annotations are available at https://github.com/serycjon/WOFTSAM

Chinese Translation

我们提出了SAM-H和WOFTSAM这两种新型平面跟踪器，它们结合了由SAM 2提供的鲁棒长期分割跟踪和8自由度的单应性姿态估计。SAM-H通过分割掩膜轮廓估计单应性，因此对目标外观变化具有很高的鲁棒性。WOFTSAM通过利用SAM-H提供的丢失目标重检测，显著改善了当前最先进的平面跟踪器WOFT。所提出的方法在POT-210和PlanarTrack跟踪基准上进行了评估，在这两个基准上均设定了新的最先进性能。在后者上，它们在p@15指标上以+12.4和+15.2个百分点大幅超越第二名。我们还提供了初始PlanarTrack姿态的改进真实标注，从而在高精度p@5指标中实现更准确的基准测试。代码和重标注可在https://github.com/serycjon/WOFTSAM获取。

View on arXiv Download PDF AI Translation

cs.CV / 155 / 2602.19631

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

通过高层次表示误导实现文本到图像扩散模型中的局部概念消除

Lee, Uichan, Kim, Jeonghyeon, Hwang, Sangheum

Abstract

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

Chinese Translation

近期文本到图像（T2I）扩散模型的快速发展和广泛应用引发了人们的关注。然而，其强大的生成能力也带来了潜在的滥用风险，例如合成有害、私密或受版权保护的内容。为降低此类风险，概念消除技术应运而生，成为一种有前景的解决方案。以往的研究主要集中在微调去噪组件（例如，U-Net骨干网络）。然而，最近的因果追踪研究表明，视觉属性信息在文本编码器的早期自注意力层中是局部化的，这为概念消除提供了潜在的替代方案。基于这一见解，我们进行了初步实验，发现直接微调早期层能够抑制目标概念，但往往会降低非目标概念的生成质量。为克服这一限制，我们提出了高层次表示误导（HiRM），该方法将文本编码器中目标概念的高层次语义表示误导到指定向量，如随机方向或语义定义方向（例如，超类别），同时仅更新包含视觉属性因果状态的早期层。我们的解耦策略能够实现精确的概念移除，对无关概念的影响最小，正如在UnlearnCanvas和NSFW基准测试中对多样化目标（例如，物体、风格、裸体）所展示的强大结果。HiRM还在低训练成本下保持生成效用，能够转移到如Flux等最先进的架构而无需额外训练，并与基于去噪器的概念消除方法显示出协同效应。

View on arXiv Download PDF AI Translation

cs.CV / 156 / 2602.19668

Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation

基于时间感知的联邦适应的个性化纵向医学报告生成

Zhu, He, Togo, Ren, Ogawa, Takahiro, Hirata, Kenji, Tang, Minghui, Yoshimura, Takaaki, Sugimori, Hiroyuki, Nishioka, Noriko, Shimizu, Yukie, Kudo, Kohsuke, Haseyama, Miki

Abstract

Longitudinal medical report generation is clinically important yet remains challenging due to strict privacy constraints and the evolving nature of disease progression. Although federated learning (FL) enables collaborative training without data sharing, existing FL methods largely overlook longitudinal dynamics by assuming stationary client distributions, making them unable to model temporal shifts across visits or patient-specific heterogeneity-ultimately leading to unstable optimization and suboptimal report generation. We introduce Federated Temporal Adaptation (FTA), a federated setting that explicitly accounts for the temporal evolution of client data. Building upon this setting, we propose FedTAR, a framework that integrates demographic-driven personalization with time-aware global aggregation. FedTAR generates lightweight LoRA adapters from demographic embeddings and performs temporal residual aggregation, where updates from different visits are weighted by a meta-learned temporal policy optimized via first-order MAML. Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust and privacy-preserving paradigm for federated longitudinal modeling.

Chinese Translation

纵向医学报告生成在临床上具有重要意义，但由于严格的隐私限制和疾病进展的不断变化，仍然面临挑战。尽管联邦学习（FL）允许在不共享数据的情况下进行协作训练，但现有的FL方法在很大程度上忽视了纵向动态，假设客户端分布是静态的，这使得它们无法建模访问之间的时间变化或患者特定的异质性，最终导致优化不稳定和报告生成不理想。我们提出了联邦时间适应（Federated Temporal Adaptation, FTA），这是一个明确考虑客户端数据时间演变的联邦设置。在此基础上，我们提出了FedTAR，一个将基于人口统计的个性化与时间感知的全局聚合相结合的框架。FedTAR从人口统计嵌入生成轻量级的LoRA适配器，并执行时间残差聚合，其中来自不同访问的更新由通过一阶MAML优化的元学习时间策略加权。对J-MID（100万次检查）和MIMIC-CXR的实验表明，在语言准确性、时间一致性和跨站点泛化方面均有持续改善，确立了FedTAR作为一种稳健且保护隐私的联邦纵向建模范式。

View on arXiv Download PDF AI Translation

cs.CV / 157 / 2602.19679

TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

TeHOR：基于文本指导的3D人类与物体重建及其纹理

Nam, Hyeongjin, Jung, Daniel Sungho, Lee, Kyoung Mu

Abstract

Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

Chinese Translation

从单幅图像中联合重建3D人类与物体是一个活跃的研究领域，在机器人技术和数字内容创作中具有重要应用。尽管近期取得了一些进展，现有方法仍然面临两个基本限制。首先，它们的重建过于依赖物理接触信息，这本质上无法捕捉到非接触的人类与物体之间的交互，例如凝视或指向某个物体。其次，重建过程主要受限于局部几何接近性，忽视了提供全球上下文的人的外观和物体外观，这对于理解整体交互至关重要。为了解决这些问题，我们提出了TeHOR，一个基于两个核心设计的框架。首先，除了接触信息外，我们的框架利用人类与物体交互的文本描述，强制实现3D重建与其文本线索之间的语义对齐，从而能够推理更广泛的交互，包括非接触情况。其次，我们将3D人类与物体的外观线索纳入对齐过程中，以捕捉整体上下文信息，从而确保视觉上合理的重建。因此，我们的框架能够生成准确且语义一致的重建，达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 158 / 2602.19697

BayesFusion-SDF: Probabilistic Signed Distance Fusion with View Planning on CPU

BayesFusion-SDF：基于CPU的概率签名距离融合与视图规划

Mazumdar, Soumya, Rakesh, Vineet Kumar, Samanta, Tapas

Abstract

Key part of robotics, augmented reality, and digital inspection is dense 3D reconstruction from depth observations. Traditional volumetric fusion techniques, including truncated signed distance functions (TSDF), enable efficient and deterministic geometry reconstruction; however, they depend on heuristic weighting and fail to transparently convey uncertainty in a systematic way. Recent neural implicit methods, on the other hand, get very high fidelity but usually need a lot of GPU power for optimization and aren't very easy to understand for making decisions later on. This work presents BayesFusion-SDF, a CPU-centric probabilistic signed distance fusion framework that conceptualizes geometry as a sparse Gaussian random field with a defined posterior distribution over voxel distances. First, a rough TSDF reconstruction is used to create an adaptive narrow-band domain. Then, depth observations are combined using a heteroscedastic Bayesian formulation that is solved using sparse linear algebra and preconditioned conjugate gradients. Randomized diagonal estimators are a quick way to get an idea of posterior uncertainty. This makes it possible to extract surfaces and plan the next best view while taking into account uncertainty. Tests on a controlled ablation scene and a CO3D object sequence show that the new method is more accurate geometrically than TSDF baselines and gives useful estimates of uncertainty for active sensing. The proposed formulation provides a clear and easy-to-use alternative to GPU-heavy neural reconstruction methods while still being able to be understood in a probabilistic way and acting in a predictable way. GitHub: https://mazumdarsoumya.github.io/BayesFusionSDF

Chinese Translation

机器人技术、增强现实和数字检测的关键部分是基于深度观测的密集3D重建。传统的体积融合技术，包括截断签名距离函数（TSDF），能够高效且确定性地重建几何形状；然而，它们依赖于启发式加权，并未以系统的方式透明地传达不确定性。另一方面，最近的神经隐式方法虽然能获得非常高的保真度，但通常需要大量的GPU计算能力进行优化，并且在后续决策时不易理解。本研究提出了BayesFusion-SDF，这是一个以CPU为中心的概率签名距离融合框架，将几何形状概念化为具有定义的后验分布的稀疏高斯随机场。首先，使用粗略的TSDF重建创建一个自适应的窄带域。然后，利用异方差贝叶斯公式结合深度观测，通过稀疏线性代数和预条件共轭梯度法进行求解。随机对角估计器是一种快速了解后验不确定性的方法。这使得在考虑不确定性的情况下提取表面并规划下一个最佳视图成为可能。在一个受控的消融场景和CO3D物体序列上的测试表明，该新方法在几何上比TSDF基线更准确，并为主动感知提供了有用的不确定性估计。所提出的公式为GPU密集型神经重建方法提供了一个清晰且易于使用的替代方案，同时仍然能够以概率的方式理解并以可预测的方式行动。GitHub: https://mazumdarsoumya.github.io/BayesFusionSDF

View on arXiv Download PDF AI Translation

cs.CV / 159 / 2602.19706

HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion

无训练和曝光一致性扩散的HDR重建增强

Lin, Yo-Tin, Chen, Su-Kai, Hu, Hou-Ning, Lin, Yen-Yu, Liu, Yu-Lun

Abstract

Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous approaches requiring extensive training, our method seamlessly integrates with existing HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures. We demonstrate significant improvements in both perceptual quality and quantitative metrics on standard HDR datasets and in-the-wild captures. Results show that our method effectively recovers natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines. Project page: https://github.com/EusdenLin/HDR-Reconstruction-Boosting

Chinese Translation

单一低动态范围（LDR）到高动态范围（HDR）重建在过度曝光区域仍然具有挑战性，传统方法往往因信息完全丢失而失败。我们提出了一种无训练的方法，通过基于扩散的修补增强现有的间接和直接HDR重建方法。我们的方法结合了文本引导的扩散模型与SDEdit精细化，在过度曝光区域生成合理的内容，同时保持多曝光LDR图像之间的一致性。与以往需要大量训练的方案不同，我们的方法通过迭代补偿机制与现有的HDR重建技术无缝集成，确保多个曝光之间的亮度一致性。我们在标准HDR数据集和野外捕获中展示了在感知质量和定量指标上的显著改善。结果表明，我们的方法在具有挑战性的场景中有效恢复自然细节，同时保留现有HDR重建流程的优势。项目页面：https://github.com/EusdenLin/HDR-Reconstruction-Boosting

View on arXiv Download PDF AI Translation

cs.CV / 160 / 2602.19708

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

ChimeraLoRA：多头LoRA引导的合成数据集

Kim, Hoyoung, Jang, Minwoo, Koo, Jabin, Yun, Sangdoo, Ok, Jungseul

Abstract

Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA~$A$ for class priors and per-image LoRAs~$\mathcal{B}$ for image-specific characteristics. To expose coherent class semantics in the shared LoRA~$A$, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose $A$ with a mixture of $\mathcal{B}$ using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.

Chinese Translation

除了通用识别任务外，隐私受限的医疗应用和细粒度设置等专业领域常常面临数据稀缺的问题，尤其是对于尾类。为了在这种稀缺情况下获得更少偏见和更可靠的模型，实践者利用扩散模型来补充真实数据中代表性不足的区域。具体而言，最近的研究通过在少量真实数据集上对预训练的扩散模型进行LoRA微调，以合成额外的图像。虽然在单个图像上训练的图像级LoRA能够捕捉细粒度细节，但其多样性有限；而在所有样本上训练的类别级LoRA则能够生成多样化的图像，因为它编码了类别先验，但往往忽视细节。为了结合这两者的优点，我们将适配器分为一个用于类别先验的类别共享LoRA~$A$和用于图像特征的每图像LoRA~$ ext{B}$。为了在共享LoRA~$A$中揭示一致的类别语义，我们提出了一种语义增强方法，通过在训练过程中保留类别边界框来实现。在生成过程中，我们将$A$与从Dirichlet分布中抽取的系数混合的$ ext{B}$组合。在不同的数据集上，我们合成的图像既多样又富有细节，同时与少量真实分布紧密对齐，从而在下游分类准确性上实现了显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 161 / 2602.19710

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

通用姿态预训练用于可泛化的视觉-语言-动作策略

Lin, Haitao, Yu, Hanyang, Huang, Jingshun, Zhang, He, Ling, Yonggen, Tan, Ping, Xue, Xiangyang, Fu, Yanwei

Abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

Chinese Translation

现有的视觉-语言-动作（VLA）模型常常面临特征崩溃和低训练效率的问题，因为它们将高层次的感知与稀疏的、特定于体现的动作监督纠缠在一起。由于这些模型通常依赖于为视觉问答（VQA）优化的视觉语言模型（VLM）骨干网络，它们在语义识别方面表现出色，但往往忽视了决定不同动作模式的微妙三维状态变化。为了解决这些不匹配问题，我们提出了Pose-VLA，一种解耦的范式，将VLA训练分为两个阶段：预训练阶段用于在统一的相机中心空间中提取通用的三维空间先验，后训练阶段则用于在机器人特定的动作空间中实现高效的体现对齐。通过引入离散姿态标记作为通用表示，Pose-VLA无缝地将来自不同三维数据集的空间基础与来自机器人演示的几何级轨迹相结合。我们的框架遵循两阶段的预训练流程，首先通过姿态建立基本的空间基础，然后通过轨迹监督进行运动对齐。大量评估表明，Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率，达到了最先进的结果，并在LIBERO上表现出96.0%的竞争力。现实世界的实验进一步展示了在每个任务仅使用100个演示的情况下，针对不同对象的强健泛化，验证了我们预训练范式的效率。

View on arXiv Download PDF AI Translation

cs.CV / 162 / 2602.19715

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

像素不会说谎（但你的检测器可能会）：为可信的深度伪造检测和推理监督引导 MLLM 作为评判者

Kuckreja, Kartik, Gupta, Parul, Khan, Muhammad Haris, Dhall, Abhinav

Abstract

Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.

Chinese Translation

深度伪造检测模型通常生成自然语言解释，但其推理往往缺乏视觉证据的支撑，限制了可靠性。现有评估主要测量分类准确性，却忽视了推理的可信度。我们提出了 DeepfakeJudge，一个可扩展的推理监督和评估框架，整合了一个包含近期生成和编辑伪造的分布外基准、一个带有视觉推理标签的人类标注子集，以及一套专注于评估推理理由的评估模型，无需显式的真实推理理由。该评判者通过引导生成-评估过程进行优化，将人类反馈转化为结构化的推理监督，并支持逐点和成对评估。在所提出的元评估基准上，我们的推理引导模型达到了 96.2\% 的准确率，超越了 exttt{30x} 更大的基线模型。推理评判者与人类评分之间的相关性非常高，在人类标注的元评估子集上达到了 98.9\\% 的成对一致性。这些结果确立了推理可信度作为深度伪造检测的可量化维度，并展示了可扩展的可解释深度伪造推理监督。我们的用户研究表明，参与者在忠实性、基础性和实用性方面更倾向于我们框架生成的推理，比例达到 70\\%，相较于其他模型和数据集生成的推理。我们的所有数据集、模型和代码库均已 exttt{开源}。

View on arXiv Download PDF AI Translation

cs.CV / 163 / 2602.19719

Generative 6D Pose Estimation via Conditional Flow Matching

通过条件流匹配进行生成式6D姿态估计

Hamza, Amir, Boscaini, Davide, Li, Weihang, Busam, Benjamin, Poiesi, Fabio

Abstract

Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\mathbb{R}^3$. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : https://tev-fbk.github.io/Flose/

Chinese Translation

现有的实例级6D姿态估计方法通常依赖于神经网络，直接回归$ ext{SE}(3)$中的姿态或通过局部特征匹配间接估计。前者在处理物体对称性时表现不佳，而后者在缺乏独特局部特征时则无法有效工作。为克服这些局限性，我们提出了一种将6D姿态估计新颖地表述为$ ext{R}^3$中的条件流匹配问题的方案。我们引入了Flose，这是一种通过基于局部特征的去噪过程推断物体姿态的生成式方法。尽管基于条件流匹配的先前方法仅依赖几何指导进行去噪，Flose则整合了基于外观的语义特征，以减轻物体对称性带来的歧义。我们进一步结合基于RANSAC的配准来处理离群点。我们在已建立的BOP基准的五个数据集上验证了Flose。Flose在平均召回率上超越了先前的方法，平均提升了+4.5。项目网站：https://tev-fbk.github.io/Flose/

View on arXiv Download PDF AI Translation

cs.CV / 164 / 2602.19723

Towards Personalized Multi-Modal MRI Synthesis across Heterogeneous Datasets

面向异构数据集的个性化多模态MRI合成

Zhang, Yue, Zhuo, Zhizheng, Xu, Siyao, Lv, Shan, Liu, Zhaoxi, Qiu, Jun, Wang, Qiuli, Liu, Yaou, Zhou, S. Kevin

Abstract

Synthesizing missing modalities in multi-modal magnetic resonance imaging (MRI) is vital for ensuring diagnostic completeness, particularly when full acquisitions are infeasible due to time constraints, motion artifacts, and patient tolerance. Recent unified synthesis models have enabled flexible synthesis tasks by accommodating various input-output configurations. However, their training and evaluation are typically restricted to a single dataset, limiting their generalizability across diverse clinical datasets and impeding practical deployment. To address this limitation, we propose PMM-Synth, a personalized MRI synthesis framework that not only supports various synthesis tasks but also generalizes effectively across heterogeneous datasets. PMM-Synth is jointly trained on multiple multi-modal MRI datasets that differ in modality coverage, disease types, and intensity distributions. It achieves cross-dataset generalization through three core innovations: a Personalized Feature Modulation module that dynamically adapts feature representations based on dataset identifier to mitigate the impact of distributional shifts; a Modality-Consistent Batch Scheduler that facilitates stable and efficient batch training under inconsistent modality conditions; and a selective supervision loss to ensure effective learning when ground truth modalities are partially missing. Evaluated on four clinical multi-modal MRI datasets, PMM-Synth consistently outperforms state-of-the-art methods in both one-to-one and many-to-one synthesis tasks, achieving superior PSNR and SSIM scores. Qualitative results further demonstrate improved preservation of anatomical structures and pathological details. Additionally, downstream tumor segmentation and radiological reporting studies suggest that PMM-Synth holds potential for supporting reliable diagnosis under real-world modality-missing scenarios.

Chinese Translation

在多模态磁共振成像（MRI）中合成缺失模态对于确保诊断的完整性至关重要，特别是在由于时间限制、运动伪影和患者耐受性等原因无法进行全面采集的情况下。近期的统一合成模型通过适应各种输入输出配置，已实现灵活的合成任务。然而，它们的训练和评估通常仅限于单一数据集，限制了其在不同临床数据集中的泛化能力，并阻碍了实际应用。为了解决这一限制，我们提出了PMM-Synth，一个个性化的MRI合成框架，不仅支持多种合成任务，还能有效地在异构数据集之间泛化。PMM-Synth在多个模态覆盖、疾病类型和强度分布各异的多模态MRI数据集上进行联合训练。它通过三项核心创新实现跨数据集的泛化：一个个性化特征调制模块，根据数据集标识符动态调整特征表示，以减轻分布变化的影响；一个模态一致的批量调度器，促进在不一致模态条件下的稳定和高效批量训练；以及一个选择性监督损失，以确保在真实模态部分缺失时的有效学习。在四个临床多模态MRI数据集上的评估表明，PMM-Synth在一对一和多对一合成任务中始终优于最先进的方法，获得了更高的PSNR和SSIM分数。定性结果进一步展示了对解剖结构和病理细节的改善保留。此外，下游肿瘤分割和放射学报告研究表明，PMM-Synth在现实世界模态缺失场景下支持可靠诊断的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 165 / 2602.19735

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

VGGT-MPR：增强视觉几何的多模态地点识别在自动驾驶环境中的应用

Xu, Jingyi, Qi, Zhangshuo, Yan, Zhongmiao, Gao, Xuyu, Jiao, Qianyun, Xia, Songpengcheng, Chen, Xieyuanli, Pei, Ling

Abstract

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

Chinese Translation

在自动驾驶中，稳健的地点识别对于全球定位和回环闭合检测至关重要。尽管多模态地点识别（MPR）中相机与激光雷达（LiDAR）数据的跨模态融合在克服单模态方法的局限性方面展现了前景，但现有的MPR方法基本上依赖于手工设计的融合策略和高度参数化的骨干网络，这些方法需要昂贵的重新训练。为了解决这个问题，我们提出了VGGT-MPR，一个多模态地点识别框架，采用视觉几何基础变换器（Visual Geometry Grounded Transformer, VGGT）作为统一的几何引擎，进行全球检索和重新排序。在全球检索阶段，VGGT通过先前的深度感知和点图监督提取几何丰富的视觉嵌入，并利用预测的深度图增强稀疏的LiDAR点云，从而改善结构表示。这增强了融合多模态特征的区分能力，并生成用于快速检索的全局描述符。除了全球检索外，我们设计了一种无训练的重新排序机制，利用VGGT的跨视角关键点跟踪能力。通过结合掩码引导的关键点提取和基于置信度的对应评分，我们提出的重新排序机制有效地优化了检索结果，而无需额外的参数优化。在大规模自动驾驶基准测试和我们自收集的数据上的大量实验表明，VGGT-MPR实现了最先进的性能，展现出对严重环境变化、视角变化和遮挡的强鲁棒性。我们的代码和数据将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 166 / 2602.19736

InfScene-SR: Spatially Continuous Inference for Arbitrary-Size Image Super-Resolution

InfScene-SR：任意大小图像超分辨率的空间连续推断

Sun, Shoukun, Wang, Zhe, Que, Xiang, Zhang, Jiyin, Ma, Xiaogang

Abstract

Image Super-Resolution (SR) aims to recover high-resolution (HR) details from low-resolution (LR) inputs, a task where Denoising Diffusion Probabilistic Models (DDPMs) have recently shown superior performance compared to Generative Adversarial Networks (GANs) based approaches. However, standard diffusion-based SR models, such as SR3, are typically trained on fixed-size patches and struggle to scale to arbitrary-sized images due to memory constraints. Applying these models via independent patch processing leads to visible seams and inconsistent textures across boundaries. In this paper, we propose InfScene-SR, a framework enabling spatially continuous super-resolution for large, arbitrary scenes. We adapt the iterative refinement process of diffusion models with a novel guided and variance-corrected fusion mechanism, allowing for the seamless generation of large-scale high-resolution imagery without retraining. We validate our approach on remote sensing datasets, demonstrating that InfScene-SR not only reconstructs fine details with high perceptual quality but also eliminates boundary artifacts, benefiting downstream tasks such as semantic segmentation.

Chinese Translation

图像超分辨率（SR）旨在从低分辨率（LR）输入中恢复高分辨率（HR）细节，最近去噪扩散概率模型（DDPMs）在这一任务中表现出比基于生成对抗网络（GANs）的方法更优越的性能。然而，标准的基于扩散的超分辨率模型，如SR3，通常在固定大小的图像块上进行训练，因内存限制而难以扩展到任意大小的图像。通过独立处理图像块应用这些模型会导致可见的接缝和边界处的不一致纹理。在本文中，我们提出了InfScene-SR，一个能够为大型任意场景提供空间连续超分辨率的框架。我们通过一种新颖的引导和方差校正的融合机制，调整了扩散模型的迭代精炼过程，使得在不重新训练的情况下，能够无缝生成大规模高分辨率图像。我们在遥感数据集上验证了我们的方法，证明InfScene-SR不仅能够以高感知质量重建细节，还能消除边界伪影，从而有利于下游任务，如语义分割。

View on arXiv Download PDF AI Translation

cs.CV / 167 / 2602.19753

RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing

RAP：快速前馈无渲染属性引导的原始重要性评分预测用于高效的3D高斯点云处理

Yang, Kaifa, Yang, Qi, Xu, Yiling, Li, Zhu

Abstract

3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission. Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are sensitive to the number and selection of views, rely on specialized differentiable rasterizers, and have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules and limiting scalability and generalization. To address these issues, we propose RAP, a fast feedforward rendering-free attribute-guided method for efficient importance score prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding rendering-based or visibility-dependent computations. A compact MLP predicts per-primitive importance scores using rendering loss, pruning-aware loss, and significance distribution regularization. After training on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines. Our code is publicly available at https://github.com/yyyykf/RAP.

Chinese Translation

3D高斯点云（3DGS）已成为高质量3D场景重建的领先技术。然而，迭代细化和稠密化过程导致生成大量原始数据，每个原始数据对重建的贡献程度差异显著。因此，估计原始数据的重要性至关重要，这不仅有助于在重建过程中消除冗余，还能实现高效的压缩和传输。现有方法通常依赖于基于渲染的分析，其中每个原始数据通过其在多个相机视角下的贡献进行评估。然而，这些方法对视角的数量和选择敏感，依赖于专门的可微分光栅化器，并且计算时间随着视角数量线性增长，使其难以作为即插即用模块集成，从而限制了可扩展性和泛化能力。为了解决这些问题，我们提出了RAP，一种快速前馈无渲染的属性引导方法，用于在3DGS中高效预测重要性评分。RAP直接从内在高斯属性和局部邻域统计中推断原始数据的重要性，避免了基于渲染或依赖可见性的计算。一个紧凑的多层感知器（MLP）使用渲染损失、剪枝感知损失和重要性分布正则化来预测每个原始数据的重要性评分。在对少量场景进行训练后，RAP能够有效地推广到未见数据，并可以无缝集成到重建、压缩和传输管道中。我们的代码已公开发布在 https://github.com/yyyykf/RAP。

View on arXiv Download PDF AI Translation

cs.CV / 168 / 2602.19756

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

通过原型引导的数据合成简化多模态数据集蒸馏

Choi, Junhyeok, Mo, Sangwoo, Chae, Minwoo

Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

Chinese Translation

近年来，多模态学习在各种视觉-语言任务中取得了显著成功。然而，这种进展在很大程度上依赖于大规模的图像-文本数据集，使得训练成本高且效率低。之前在数据集过滤和修剪方面的努力试图缓解这一问题，但仍然需要相对较大的子集以维持性能，并且在非常小的子集下表现不佳。数据集蒸馏提供了一种有前景的替代方案，但现有的多模态数据集蒸馏方法需要对整个数据集进行训练，并对图像像素和文本特征进行联合优化，这使得它们依赖于特定架构，并限制了跨架构的泛化能力。为了解决这个问题，我们提出了一种无学习的数据集蒸馏框架，消除了对大规模训练和优化的需求，同时增强了跨架构的泛化能力。我们的方法使用 CLIP 提取对齐的图像-文本嵌入，获得原型，并采用 unCLIP 解码器合成图像，从而实现高效且可扩展的多模态数据集蒸馏。大量实验表明，我们的方法在优化基础的数据集蒸馏和子集选择方法上始终表现优越，实现了最先进的跨架构泛化。

View on arXiv Download PDF AI Translation

cs.CV / 169 / 2602.19763

Training Deep Stereo Matching Networks on Tree Branch Imagery: A Benchmark Study for Real-Time UAV Forestry Applications

基于树枝图像训练深度立体匹配网络：实时无人机林业应用的基准研究

Lin, Yida, Xue, Bing, Zhang, Mengjie, Schofield, Sam, Green, Richard

Abstract

Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using $Z = f B/d$, so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree Branches dataset -- 5,313 stereo pairs from a ZED Mini camera at 1080P and 720P -- with DEFOM-generated disparity maps as training targets. The ten methods cover step-by-step refinement, 3D convolution, edge-aware attention, and lightweight designs. Using perceptual metrics (SSIM, LPIPS, ViTScore) and structural metrics (SIFT/ORB feature matching), we find that BANet-3D produces the best overall quality (SSIM = 0.883, LPIPS = 0.157), while RAFT-Stereo scores highest on scene-level understanding (ViTScore = 0.799). Testing on an NVIDIA Jetson Orin Super (16 GB, independently powered) mounted on our drone shows that AnyNet reaches 6.99 FPS at 1080P -- the only near-real-time option -- while BANet-2D gives the best quality-speed balance at 1.21 FPS. We also compare 720P and 1080P processing times to guide resolution choices for forestry drone systems.

Chinese Translation

自主无人机树木修剪需要来自立体相机的准确实时深度估计。深度是通过视差图计算的，使用公式 $Z = f B/d$，因此即使是微小的视差误差也会在工作距离上导致明显的深度错误。在我们之前的研究中，确定DEFOM-Stereo是植物场景中最佳的参考视差生成器的基础上，我们首次开展了在真实树枝图像上训练和测试十种深度立体匹配网络的研究。我们使用Canterbury Tree Branches数据集——来自ZED Mini相机的5,313对立体图像，分辨率为1080P和720P——并使用DEFOM生成的视差图作为训练目标。这十种方法涵盖了逐步细化、3D卷积、边缘感知注意力和轻量级设计。通过感知指标（SSIM、LPIPS、ViTScore）和结构指标（SIFT/ORB特征匹配），我们发现BANet-3D在整体质量上表现最佳（SSIM = 0.883，LPIPS = 0.157），而RAFT-Stereo在场景级理解上得分最高（ViTScore = 0.799）。在我们无人机上安装的NVIDIA Jetson Orin Super（16 GB，独立供电）上进行测试显示，AnyNet在1080P下达到了6.99 FPS——是唯一接近实时的选项，而BANet-2D在1.21 FPS下提供了最佳的质量与速度平衡。我们还比较了720P和1080P的处理时间，以指导林业无人机系统的分辨率选择。

View on arXiv Download PDF AI Translation

cs.CV / 170 / 2602.19766

One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

One2Scene：基于单幅图像的几何一致可探索3D场景生成

Wang, Pengfei, Chen, Liyi, Ma, Zhiyuan, Guo, Yanjun, Zhang, Guowen, Zhang, Lei

Abstract

Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to leverage robust geometric priors learned from large-scale multi-view datasets. A bidirectional feature fusion module is used to enforce cross-view consistency, yielding an efficient and geometrically reliable scaffold. Finally, the scaffold serves as a strong prior for a novel view generator to produce photorealistic and geometrically accurate views at arbitrary cameras. By explicitly conditioning on a 3D-consistent scaffold to perform reconstruction, One2Scene works stably under large camera motions, supporting immersive scene exploration. Extensive experiments show that One2Scene substantially outperforms state-of-the-art methods in panorama depth estimation, feed-forward 360{\deg} reconstruction, and explorable 3D scene generation. Code and models will be released.

Chinese Translation

从单幅图像生成可探索的3D场景是3D视觉中的一项极具挑战性的问题。现有方法难以支持自由探索，通常在视点远离原始视角时产生严重的几何失真和噪声伪影。我们提出了 extbf{One2Scene}，一个有效的框架，将这一不适定问题分解为三个可处理的子任务，以实现沉浸式可探索场景生成。我们首先使用全景生成器从单个输入图像生成锚点视图作为初始化。然后，我们通过一个可泛化的前馈高斯点云网络将这些2D锚点提升为显式的3D几何支架。我们并不将全景视为单幅图像进行重建，而是将其投影到多个稀疏的锚点视图中，并将重建任务重新表述为多视图立体匹配，这使我们能够利用从大规模多视图数据集中学习到的稳健几何先验。一个双向特征融合模块用于强制跨视图一致性，从而产生一个高效且几何可靠的支架。最后，该支架作为强先验，为新视图生成器提供支持，以在任意相机下生成逼真且几何准确的视图。通过明确地基于3D一致的支架进行重建，One2Scene在大范围相机运动下稳定工作，支持沉浸式场景探索。大量实验表明，One2Scene在全景深度估计、前馈360°重建和可探索3D场景生成方面显著优于现有最先进的方法。代码和模型将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 171 / 2602.19768

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

TraceVision：一种基于轨迹的视觉-语言模型，用于类人空间理解

Yang, Fan, Zheng, Shurong, Zhao, Hongyin, Zhan, Yufei, Li, Xin, Zhu, Yousong, Tang, Chaoyang Zhao Ming, Wang, Jinqiao

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.

Chinese Translation

近期的大型视觉-语言模型（LVLMs）在图像理解和自然语言生成方面展现了显著的能力。然而，当前的方法主要集中在全局图像理解上，难以模拟人类视觉注意力轨迹，并解释描述与特定区域之间的关联。我们提出了TraceVision，一种在端到端框架中集成轨迹感知空间理解的统一视觉-语言模型。TraceVision采用轨迹感知视觉感知（Trajectory-aware Visual Perception, TVP）模块，实现视觉特征与轨迹信息的双向融合。我们设计了几何简化方法，从原始轨迹中提取语义关键点，并提出了一个三阶段的训练流程，其中轨迹引导描述生成和区域定位。我们将TraceVision扩展到轨迹引导的分割和视频场景理解，支持跨帧跟踪和时间注意力分析。我们构建了基于推理的交互式本地叙述（Reasoning-based Interactive Localized Narratives, RILN）数据集，以增强逻辑推理和可解释性。在轨迹引导的标题生成、文本引导的轨迹预测、理解和分割等方面的广泛实验表明，TraceVision实现了最先进的性能，为直观的空间交互和可解释的视觉理解奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 172 / 2602.19822

Efficient endometrial carcinoma screening via cross-modal synthesis and gradient distillation

通过跨模态合成和梯度蒸馏实现高效的子宫内膜癌筛查

Shan, Dongjing, Luo, Yamei, Xuan, Jiqing, Huang, Lu, Li, Jin, Yang, Mengchu, Chen, Zeyu, Lv, Fajin, Tang, Yong, Zhang, Chunxiang

Abstract

Early detection of myometrial invasion is critical for the staging and life-saving management of endometrial carcinoma (EC), a prevalent global malignancy. Transvaginal ultrasound serves as the primary, accessible screening modality in resource-constrained primary care settings; however, its diagnostic reliability is severely hindered by low tissue contrast, high operator dependence, and a pronounced scarcity of positive pathological samples. Existing artificial intelligence solutions struggle to overcome this severe class imbalance and the subtle imaging features of invasion, particularly under the strict computational limits of primary care clinics. Here we present an automated, highly efficient two-stage deep learning framework that resolves both data and computational bottlenecks in EC screening. To mitigate pathological data scarcity, we develop a structure-guided cross-modal generation network that synthesizes diverse, high-fidelity ultrasound images from unpaired magnetic resonance imaging (MRI) data, strictly preserving clinically essential anatomical junctions. Furthermore, we introduce a lightweight screening network utilizing gradient distillation, which transfers discriminative knowledge from a high-capacity teacher model to dynamically guide sparse attention towards task-critical regions. Evaluated on a large, multicenter cohort of 7,951 participants, our model achieves a sensitivity of 99.5\%, a specificity of 97.2\%, and an area under the curve of 0.987 at a minimal computational cost (0.289 GFLOPs), substantially outperforming the average diagnostic accuracy of expert sonographers. Our approach demonstrates that combining cross-modal synthetic augmentation with knowledge-driven efficient modeling can democratize expert-level, real-time cancer screening for resource-constrained primary care settings.

Chinese Translation

早期检测肌层侵袭对于子宫内膜癌（EC）的分期和挽救生命的管理至关重要，这是一种全球普遍存在的恶性肿瘤。经阴道超声是资源有限的初级医疗环境中主要的、可及的筛查方式；然而，其诊断可靠性受到低组织对比度、高操作者依赖性以及阳性病理样本显著匮乏的严重制约。现有的人工智能解决方案难以克服这种严重的类别不平衡和侵袭的细微影像特征，特别是在初级医疗诊所严格的计算限制下。在此，我们提出了一种自动化的、高效的两阶段深度学习框架，解决了EC筛查中的数据和计算瓶颈。为了缓解病理数据的稀缺性，我们开发了一种结构引导的跨模态生成网络，从未配对的磁共振成像（MRI）数据中合成多样的、高保真的超声图像，严格保留临床重要的解剖交界。进一步地，我们引入了一种轻量级筛查网络，利用梯度蒸馏，将高容量教师模型的判别知识转移，以动态引导稀疏关注到任务关键区域。在对7,951名参与者的大型多中心队列进行评估时，我们的模型实现了99.5%的灵敏度、97.2%的特异性和0.987的曲线下面积，且计算成本极低（0.289 GFLOPs），显著超越了专家超声医师的平均诊断准确性。我们的方法表明，结合跨模态合成增强与知识驱动的高效建模可以使资源有限的初级医疗环境中的专家级实时癌症筛查变得普及。

View on arXiv Download PDF AI Translation

cs.CV / 173 / 2602.19823

Open-vocabulary 3D scene perception in industrial environments

工业环境中的开放词汇3D场景感知

Moenck, Keno, Florea, Adrian Philip, Koch, Julian, Schüppstuhl, Thorsten

Abstract

Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial objects. Therefore, we propose a training-free, open-vocabulary 3D perception pipeline that overcomes this limitation. Instead of using a pre-trained model to generate instance proposals, our method simply generates masks by merging pre-computed superpoints based on their semantic features. Following, we evaluate the domain-adapted VLFM "IndustrialCLIP" on a representative 3D industrial workshop scene for open-vocabulary querying. Our qualitative results demonstrate successful segmentation of industrial objects.

Chinese Translation

自主视觉应用在生产、内部物流或制造环境中需要超越小型固定类别集合的感知能力。近期的开放词汇方法利用2D视觉-语言基础模型（VLFMs）来针对这一任务，但通常依赖于在非工业数据集（例如家庭场景）上预训练的类别无关分割模型。在本研究中，我们首先证明了此类模型无法泛化，对常见工业物体的表现较差。因此，我们提出了一种无训练的开放词汇3D感知管道，以克服这一限制。我们的方法不使用预训练模型生成实例提议，而是通过基于语义特征合并预计算的超点来简单生成掩膜。随后，我们在一个代表性的3D工业车间场景上评估了领域适应的VLFMs“IndustrialCLIP”，以进行开放词汇查询。我们的定性结果展示了对工业物体的成功分割。

View on arXiv Download PDF AI Translation

cs.CV / 174 / 2602.19828

TextShield-R1: Reinforced Reasoning for Tampered Text Detection

TextShield-R1：基于强化学习的篡改文本检测推理

Qu, Chenfan, Zhong, Yiwu, Liu, Jian, Zhu, Xuekang, Yu, Bohan, Jin, Lianwen

Abstract

The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM's strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.

Chinese Translation

篡改图像日益普遍，带来了严重的安全威胁，突显了对可靠检测方法的迫切需求。多模态大语言模型（MLLMs）在分析篡改图像和生成解释方面展现出强大的潜力。然而，它们在识别微观级别的伪造痕迹方面仍然存在困难，定位篡改文本区域的准确性较低，并且在伪造解释中严重依赖昂贵的标注。为此，我们提出了TextShield-R1，这是首个基于强化学习的MLLM解决方案，用于篡改文本的检测和推理。具体而言，我们的方法引入了法医持续预训练（Forensic Continual Pre-training），这是一种从易到难的课程，充分利用来自自然图像法医和光学字符识别（OCR）任务的大规模廉价数据，为MLLM的篡改文本检测做好准备。在微调过程中，我们采用了群体相对策略优化（Group Relative Policy Optimization）和新颖的奖励函数，以减少对标注的依赖并提高推理能力。在推理阶段，我们通过OCR校正（OCR Rectification）增强定位准确性，该方法利用MLLM强大的文本识别能力来优化其预测。此外，为了支持严格的评估，我们引入了文本法医推理（Text Forensics Reasoning, TFR）基准，包含超过45,000张真实和篡改图像，覆盖16种语言、10种篡改技术和多样的领域。基准中包含丰富的推理风格注释，允许进行全面评估。我们的TFR基准同时解决了现有基准的七大主要局限性，并能够在跨风格、跨方法和跨语言条件下进行稳健评估。大量实验表明，TextShield-R1在可解释的篡改文本检测方面显著推动了技术的进步。

View on arXiv Download PDF AI Translation

cs.CV / 175 / 2602.19832

M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting

M3S-Net：基于多尺度数据的多模态特征融合网络用于超短期光伏功率预测

Niu, Penghui, Cai, Taotao, Zhang, Suqi, Gu, Junhua, Zhang, Ping, Liu, Qiqi, Li, Jianxin

Abstract

The inherent intermittency and high-frequency variability of solar irradiance, particularly during rapid cloud advection, present significant stability challenges to high-penetration photovoltaic grids. Although multimodal forecasting has emerged as a viable mitigation strategy, existing architectures predominantly rely on shallow feature concatenation and binary cloud segmentation, thereby failing to capture the fine-grained optical features of clouds and the complex spatiotemporal coupling between visual and meteorological modalities. To bridge this gap, this paper proposes M3S-Net, a novel multimodal feature fusion network based on multi-scale data for ultra-short-term PV power forecasting. First, a multi-scale partial channel selection network leverages partial convolutions to explicitly isolate the boundary features of optically thin clouds, effectively transcending the precision limitations of coarse-grained binary masking. Second, a multi-scale sequence to image analysis network employs Fast Fourier Transform (FFT)-based time-frequency representation to disentangle the complex periodicity of meteorological data across varying time horizons. Crucially, the model incorporates a cross-modal Mamba interaction module featuring a novel dynamic C-matrix swapping mechanism. By exchanging state-space parameters between visual and temporal streams, this design conditions the state evolution of one modality on the context of the other, enabling deep structural coupling with linear computational complexity, thus overcoming the limitations of shallow concatenation. Experimental validation on the newly constructed fine-grained PV power dataset demonstrates that M3S-Net achieves a mean absolute error reduction of 6.2% in 10-minute forecasts compared to state-of-the-art baselines. The dataset and source code will be available at https://github.com/she1110/FGPD.

Chinese Translation

太阳辐射的固有间歇性和高频变动，特别是在快速云迁移期间，对高渗透率光伏电网带来了显著的稳定性挑战。尽管多模态预测已成为一种可行的缓解策略，但现有架构主要依赖于浅层特征拼接和二元云分割，未能捕捉云的细粒度光学特征以及视觉与气象模态之间复杂的时空耦合。为了解决这一问题，本文提出了M3S-Net，一种基于多尺度数据的创新多模态特征融合网络，用于超短期光伏功率预测。首先，多尺度部分通道选择网络利用部分卷积显式隔离光学薄云的边界特征，有效超越了粗粒度二元掩膜的精度限制。其次，多尺度序列到图像分析网络采用基于快速傅里叶变换（FFT）的时频表示，解开气象数据在不同时间范围内的复杂周期性。关键是，该模型结合了跨模态Mamba交互模块，具有新颖的动态C矩阵交换机制。通过在视觉和时间流之间交换状态空间参数，该设计使一种模态的状态演变依赖于另一种模态的上下文，从而实现深层结构耦合，并具有线性计算复杂度，克服了浅层拼接的局限性。在新构建的细粒度光伏功率数据集上的实验验证表明，与最先进的基线相比，M3S-Net在10分钟预测中实现了6.2%的平均绝对误差减少。数据集和源代码将可在 https://github.com/she1110/FGPD 获取。

View on arXiv Download PDF AI Translation

cs.CV / 176 / 2602.19848

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

DerMAE：通过条件潜在扩散和MAE蒸馏改善皮肤病变分类

Filho, Francisco, Cunha, Kelvin, Papais, Fábio, Santos, Emanoel dos, Mota, Rodrigo, Bezerra, Thales, Medeiros, Erico, Borba, Paulo, Ren, Tsang Ing

Abstract

Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.

Chinese Translation

皮肤病变分类数据集通常存在严重的类别不平衡，恶性病例显著不足，从而导致深度学习训练中的决策边界偏差。我们通过使用类别条件扩散模型生成合成皮肤病学图像来解决这一挑战，随后进行自监督的MAE（Masked Autoencoder）预训练，使得大型ViT（Vision Transformer）模型能够学习稳健的、与领域相关的特征。为了支持在实际临床环境中的部署，要求轻量级模型，我们应用知识蒸馏将这些表示转移到适合移动设备的小型ViT学生模型上。我们的结果表明，在合成数据上进行MAE预训练结合蒸馏，能够提高分类性能，同时实现高效的设备端推理，以便于实际临床使用。

View on arXiv Download PDF AI Translation

cs.CV / 177 / 2602.19857

Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

针对临床和采集条件下皮肤病变分类的对比元域适应

Mota, Rodrigo, Cunha, Kelvin, Santos, Emanoel dos, Papais, Fábio, Filho, Francisco, Bezerra, Thales, Medeiros, Erico, Borba, Paulo, Ren, Tsang Ing

Abstract

Deep learning models for dermatological image analysis remain sensitive to acquisition variability and domain-specific visual characteristics, leading to performance degradation when deployed in clinical settings. We investigate how visual artifacts and domain shifts affect deep learning-based skin lesion classification. We propose an adaptation strategy, grounded in the idea of visual meta-domains, that transfers visual representations from larger dermoscopic datasets into clinical image domains, thereby improving generalization robustness. Experiments across multiple dermatology datasets show consistent gains in classification performance and reduced gaps between dermoscopic and clinical images. These results emphasize the importance of domain-aware training for deployable systems.

Chinese Translation

深度学习模型在皮肤病学图像分析中对采集变异性和特定领域的视觉特征仍然敏感，这导致在临床环境中部署时性能下降。我们研究了视觉伪影和领域转移如何影响基于深度学习的皮肤病变分类。我们提出了一种适应策略，基于视觉元域的概念，将来自更大皮肤镜数据集的视觉表征转移到临床图像领域，从而提高泛化的鲁棒性。在多个皮肤病学数据集上的实验显示分类性能的一致提升，并减少了皮肤镜图像与临床图像之间的差距。这些结果强调了针对可部署系统进行领域感知训练的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 178 / 2602.19863

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

酿造更强特征：多教师蒸馏用于多光谱地球观测

Wolf, Filip, Rolih, Blaž, Zajc, Luka Čehovin

Abstract

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.

Chinese Translation

基础模型正在改变地球观测（EO），然而，EO传感器和模式的多样性使得单一通用模型变得不切实际。多个专门的EO基础模型（EOFMs）可能会共存，因此跨模态的高效知识转移变得至关重要。现有的大多数EO预训练依赖于掩蔽图像建模，这强调局部重建，但对全局语义结构的控制有限。为了解决这个问题，我们提出了一种多光谱图像的双教师对比蒸馏框架，该框架将学生的预训练目标与现代光学视觉基础模型（VFMs）的对比自蒸馏范式对齐。我们的方法结合了一个多光谱教师和一个光学VFM教师，使得跨模态表示学习变得连贯。我们在多种光学和多光谱基准测试中的实验表明，我们的模型能够适应多光谱数据，同时不影响光学输入的性能，在两个设置中都达到了最先进的结果，在语义分割中平均提高了3.64个百分点，在变化检测中提高了1.2，在分类任务中提高了1.31。这表明对比蒸馏为跨异构EO数据源的可扩展表示学习提供了一种原则性和高效的方法。代码：即将发布。

View on arXiv Download PDF AI Translation

cs.CV / 179 / 2602.19870

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

ApET：基于近似误差引导的高效视觉语言模型的令牌压缩

Ma, Qiankun, Zhang, Ziyao, Wang, Haofei, Chen, Jie, Song, Zhen, Zheng, Hairong

Abstract

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

Chinese Translation

近期的视觉语言模型（VLMs）展示了卓越的多模态理解能力，但冗余的视觉令牌导致了巨大的计算开销，降低了推理效率。以往的研究通常依赖于[CLS]注意力或文本-视觉交叉注意力来识别和丢弃冗余的视觉令牌。尽管取得了令人鼓舞的结果，但此类解决方案容易引入位置偏差，更关键的是，与高效的注意力核（如FlashAttention）不兼容，限制了其在VLM加速中的实际应用。本文我们摆脱了对注意力的依赖，从信息论的角度重新审视视觉令牌压缩，旨在最大限度地保留视觉信息而不涉及任何注意力。我们提出了ApET，一个基于近似误差引导的令牌压缩框架。ApET首先通过线性近似使用一小组基令牌重构原始视觉令牌，然后利用近似误差识别并丢弃信息量最少的令牌。在多个VLM和基准测试中的广泛实验表明，ApET在图像理解任务中保留了95.2%的原始性能，在视频理解任务中甚至达到了100.4%的性能，同时分别压缩了88.9%和87.5%的令牌预算。得益于其无注意力设计，ApET能够与FlashAttention无缝集成，进一步加速推理，使VLM的部署更加实用。代码可在 https://github.com/MaQianKun0/ApET 获取。

View on arXiv Download PDF AI Translation

cs.CV / 180 / 2602.19872

GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

GOAL：用于持续广义类别发现的几何最优对齐

Han, Jizhou, Ding, Chenhao, Dong, SongLin, He, Yuhang, Wang, Shaokun, Wang, Qiang, Gong, Yihong

Abstract

Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms the prior method Happy, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

Chinese Translation

持续广义类别发现（C-GCD）需要从未标记数据中识别新类别，同时在时间上保留已知类别的知识。现有方法通常动态更新分类器权重，导致遗忘和特征对齐不一致。我们提出了GOAL，一个统一框架，引入固定的等角紧框架（Equiangular Tight Frame, ETF）分类器，以在学习过程中施加一致的几何结构。GOAL对标记样本进行监督对齐，并对新样本进行基于置信度的对齐，从而实现新类别的稳定整合，而不干扰旧类别。在四个基准测试上的实验表明，GOAL优于之前的方法Happy，减少了16.1%的遗忘率，并提升了3.2%的新类别发现率，为长期持续发现提供了强有力的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 181 / 2602.19874

BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations

BigMaQ：连接图像与三维姿态表示的大猕猴运动与动画数据集

Martini, Lucas, Lappe, Alexander, Bognár, Anna, Vogels, Rufin, Giese, Martin A.

Abstract

The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the $\textbf{Big Ma}$ca$\textbf{Q}$ue 3D Motion and Animation Dataset ($\texttt{BigMaQ}$), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMaQ500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, $\texttt{BigMaQ}$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates. The code and data are publicly available at https://martinivis.github.io/BigMaQ/ .

Chinese Translation

动物动态与社会行为的识别对于推进动物行为学、生态学、医学和神经科学至关重要。近年来，深度学习的进展使得从视频中自动识别行为成为可能，但对三维（3D）姿态和形状的准确重建尚未融入这一过程。尤其对于非人类灵长类动物，基于网格的跟踪努力落后于其他物种，使得姿态描述仅限于稀疏的关键点，无法充分捕捉动作动态的丰富性。为了解决这一问题，我们引入了 $ extbf{Big Ma}$ca$ extbf{Q}$ue 3D运动与动画数据集（$ exttt{BigMaQ}$），这是一个大规模数据集，包含超过750个互动猕猴场景，并提供详细的3D姿态描述。我们扩展了之前基于表面的动物跟踪方法，通过将高质量的猕猴模板网格适配到个体猴子，构建了特定个体的纹理化虚拟形象。这使我们能够提供比以往最先进的基于表面的动物跟踪方法更准确的姿态描述。从原始数据集中，我们衍生出BigMaQ500，这是一个动作识别基准，将基于表面的姿态向量与多个个体猴子的单帧图像相连接。通过将从已建立的图像和视频编码器中提取的特征与我们提供的姿态描述进行配对，我们展示了在包含姿态信息时，平均精度（mAP）有显著提升。通过这些贡献，$ exttt{BigMaQ}$建立了第一个将动态3D姿态-形状表示整合到动物动作识别学习任务中的数据集，并提供了丰富的资源以推进对非人类灵长类动物视觉外观、姿势和社会互动的研究。代码和数据可在 https://martinivis.github.io/BigMaQ/ 上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 182 / 2602.19881

Make Some Noise: Unsupervised Remote Sensing Change Detection Using Latent Space Perturbations

制造噪声：基于潜在空间扰动的无监督遥感变化检测

Rolih, Blaž, Fučka, Matic, Wolf, Filip, Zajc, Luka Čehovin

Abstract

Unsupervised change detection (UCD) in remote sensing aims to localise semantic changes between two images of the same region without relying on labelled data during training. Most recent approaches rely either on frozen foundation models in a training-free manner or on training with synthetic changes generated in pixel space. Both strategies inherently rely on predefined assumptions about change types, typically introduced through handcrafted rules, external datasets, or auxiliary generative models. Due to these assumptions, such methods fail to generalise beyond a few change types, limiting their real-world usage, especially in rare or complex scenarios. To address this, we propose MaSoN (Make Some Noise), an end-to-end UCD framework that synthesises diverse changes directly in the latent feature space during training. It generates changes that are dynamically estimated using feature statistics of target data, enabling diverse yet data-driven variation aligned with the target domain. It also easily extends to new modalities, such as SAR. MaSoN generalises strongly across diverse change types and achieves state-of-the-art performance on five benchmarks, improving the average F1 score by 14.1 percentage points. Project page: https://blaz-r.github.io/mason_ucd

Chinese Translation

无监督变化检测（UCD）在遥感中旨在定位同一区域两幅图像之间的语义变化，而不依赖于训练过程中的标注数据。最近的大多数方法要么依赖于冻结的基础模型以无训练的方式进行，要么依赖于在像素空间中生成的合成变化进行训练。这两种策略本质上都依赖于对变化类型的预定义假设，通常通过手工规则、外部数据集或辅助生成模型引入。由于这些假设，这些方法无法超越少数变化类型进行推广，限制了它们在现实世界中的使用，尤其是在稀有或复杂场景中。为了解决这个问题，我们提出了MaSoN（制造噪声），一个端到端的UCD框架，在训练过程中直接在潜在特征空间中合成多样化的变化。它生成的变化是通过目标数据的特征统计动态估计的，从而实现与目标领域一致的多样性且以数据驱动的方式变化。它还可以轻松扩展到新的模态，例如合成孔径雷达（SAR）。MaSoN在多样化变化类型上具有强大的推广能力，并在五个基准测试中实现了最先进的性能，平均F1分数提高了14.1个百分点。项目页面：https://blaz-r.github.io/mason_ucd

View on arXiv Download PDF AI Translation

cs.CV / 183 / 2602.19896

Monocular Mesh Recovery and Body Measurement of Female Saanen Goats

单目网格恢复与母山羊体型测量

Jin, Bo, Zhao, Shichao, Lyu, Jin, Zhang, Bin, Yu, Tao, An, Liang, Liu, Yebin, Wang, Meili

Abstract

The lactation performance of Saanen dairy goats, renowned for their high milk yield, is intrinsically linked to their body size, making accurate 3D body measurement essential for assessing milk production potential, yet existing reconstruction methods lack goat-specific authentic 3D data. To address this limitation, we establish the FemaleSaanenGoat dataset containing synchronized eight-view RGBD videos of 55 female Saanen goats (6-18 months). Using multi-view DynamicFusion, we fuse noisy, non-rigid point cloud sequences into high-fidelity 3D scans, overcoming challenges from irregular surfaces and rapid movement. Based on these scans, we develop SaanenGoat, a parametric 3D shape model specifically designed for female Saanen goats. This model features a refined template with 41 skeletal joints and enhanced udder representation, registered with our scan data. A comprehensive shape space constructed from 48 goats enables precise representation of diverse individual variations. With the help of SaanenGoat model, we get high-precision 3D reconstruction from single-view RGBD input, and achieve automated measurement of six critical body dimensions: body length, height, chest width, chest girth, hip width, and hip height. Experimental results demonstrate the superior accuracy of our method in both 3D reconstruction and body measurement, presenting a novel paradigm for large-scale 3D vision applications in precision livestock farming.

Chinese Translation

母山羊以其高产奶量而闻名，其泌乳性能与体型密切相关，因此准确的三维体型测量对于评估奶产潜力至关重要。然而，现有的重建方法缺乏特定于山羊的真实三维数据。为了解决这一限制，我们建立了FemaleSaanenGoat数据集，包含55只母山羊（6-18个月）的同步八视图RGBD视频。通过多视图动态融合（DynamicFusion），我们将嘈杂的非刚性点云序列融合为高保真三维扫描，克服了不规则表面和快速运动带来的挑战。基于这些扫描数据，我们开发了SaanenGoat，一个专为母山羊设计的参数化三维形状模型。该模型具有41个骨骼关节的精细模板和增强的乳房表示，并与我们的扫描数据进行了配准。由48只山羊构建的全面形状空间使得对不同个体变异的精确表示成为可能。在SaanenGoat模型的帮助下，我们从单视图RGBD输入中获得高精度的三维重建，并实现了六个关键体型维度的自动测量：体长、身高、胸宽、胸围、臀宽和臀高。实验结果表明，我们的方法在三维重建和体型测量方面具有优越的准确性，为精准畜牧业中的大规模三维视觉应用提供了新范式。

View on arXiv Download PDF AI Translation

cs.CV / 184 / 2602.19900

ExpPortrait: Expressive Portrait Generation via Personalized Representation

ExpPortrait：通过个性化表示生成富有表现力的肖像

Wang, Junyi, Guo, Yudong, Guo, Boyang, Yang, Shengming, Zhang, Juyong

Abstract

While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

Chinese Translation

尽管扩散模型在肖像生成方面展现了巨大的潜力，但生成富有表现力、连贯且可控的电影肖像视频仍然是一项重大挑战。现有的肖像生成中间信号，如二维关键点和参数模型，具有有限的解耦能力，无法表达个性化细节，因为它们的表示稀疏或低秩。因此，基于这些模型的现有方法难以准确保留主体的身份和表情，阻碍了高度表现力肖像视频的生成。为克服这些限制，我们提出了一种高保真个性化头部表示，更有效地解耦表情和身份。该表示捕捉了静态的、特定于主体的全局几何形状和动态的、与表情相关的细节。此外，我们引入了一个表情转移模块，以实现不同身份之间头部姿态和表情细节的个性化转移。我们使用这一复杂且高度表现力的头部模型作为条件信号，训练基于扩散变换器（DiT）的生成器，以合成丰富细致的肖像视频。在自我重演和交叉重演任务上的大量实验表明，我们的方法在身份保留、表情准确性和时间稳定性方面优于先前的模型，特别是在捕捉复杂运动的细微细节方面。

View on arXiv Download PDF AI Translation

cs.CV / 185 / 2602.19907

Gradient based Severity Labeling for Biomarker Classification in OCT

基于梯度的生物标志物分类严重性标记方法在光学相干断层扫描中的应用

Kokilepersaud, Kiran, Prabhushankar, Mohit, AlRegib, Ghassan, Corona, Stephanie Trejo, Wykoff, Charles

Abstract

In this paper, we propose a novel selection strategy for contrastive learning for medical images. On natural images, contrastive learning uses augmentations to select positive and negative pairs for the contrastive loss. However, in the medical domain, arbitrary augmentations have the potential to distort small localized regions that contain the biomarkers we are interested in detecting. A more intuitive approach is to select samples with similar disease severity characteristics, since these samples are more likely to have similar structures related to the progression of a disease. To enable this, we introduce a method that generates disease severity labels for unlabeled OCT scans on the basis of gradient responses from an anomaly detection algorithm. These labels are used to train a supervised contrastive learning setup to improve biomarker classification accuracy by as much as 6% above self-supervised baselines for key indicators of Diabetic Retinopathy.

Chinese Translation

本文提出了一种用于医学图像对比学习的新型选择策略。在自然图像中，对比学习利用数据增强来选择正负样本对以计算对比损失。然而，在医学领域，任意的数据增强可能会扭曲包含我们感兴趣的生物标志物的小局部区域。一种更直观的方法是选择具有相似疾病严重性特征的样本，因为这些样本更可能具有与疾病进展相关的相似结构。为此，我们引入了一种方法，该方法基于异常检测算法的梯度响应生成未标记光学相干断层扫描（OCT）图像的疾病严重性标签。这些标签用于训练监督对比学习设置，从而提高生物标志物分类的准确性，相较于自监督基线，关键指标的糖尿病视网膜病变分类准确性提高了多达6%。

View on arXiv Download PDF AI Translation

cs.CV / 186 / 2602.19910

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

通过半监督速率降低进行多模态表示学习以实现广义类别发现

He, Wei, Meng, Xianghan, Huang, Zhiyuan, Qi, Xianbiao, Xiao, Rong, Li, Chun-Guang

Abstract

Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

Chinese Translation

广义类别发现（GCD）旨在识别已知和未知类别，仅对已知类别提供部分标签，这构成了一个具有挑战性的开放集识别问题。当前针对GCD任务的最先进方法通常基于多模态表示学习，这在很大程度上依赖于模态间的对齐。然而，少数方法对模态内的对齐进行了适当处理，以生成所需的表示分布的潜在结构。本文提出了一种新颖且有效的多模态表示学习框架，通过半监督速率降低（Semi-Supervised Rate Reduction）实现GCD，称为SSR$^2$-GCD，旨在学习具有所需结构特性的跨模态表示，强调适当地对齐模态内关系。此外，为了增强知识转移，我们通过利用视觉语言模型（Vision Language Models）提供的模态间对齐来整合提示候选。我们在通用和细粒度基准数据集上进行了广泛实验，展示了我们方法的优越性能。

View on arXiv Download PDF AI Translation

cs.CV / 187 / 2602.19916

Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting

增强辐射场：增强高斯点云的通用框架

Yang, Yixin, Wu, Bojian, Zhou, Yang, Huang, Hui

Abstract

Due to the real-time rendering performance, 3D Gaussian Splatting (3DGS) has emerged as the leading method for radiance field reconstruction. However, its reliance on spherical harmonics for color encoding inherently limits its ability to separate diffuse and specular components, making it challenging to accurately represent complex reflections. To address this, we propose a novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity. Meanwhile, we introduce an error-driven compensation strategy to improve rendering quality in existing 3DGS scenes. Our method begins with 2D Gaussian initialization and then adaptively inserts and optimizes enhanced Gaussian kernels, ultimately producing an augmented radiance field. Experiments demonstrate that our method not only surpasses state-of-the-art NeRF methods in rendering performance but also achieves greater parameter efficiency. Project page at: https://xiaoxinyyx.github.io/augs.

Chinese Translation

由于实时渲染性能，3D高斯点云（3DGS）已成为辐射场重建的主要方法。然而，它对球面谐波的依赖使得其在颜色编码方面固有地限制了分离漫反射和镜面反射成分的能力，从而使得准确表示复杂反射变得具有挑战性。为了解决这个问题，我们提出了一种新颖的增强高斯核，通过视角依赖的不透明度显式建模镜面效应。同时，我们引入了一种基于误差的补偿策略，以提高现有3DGS场景的渲染质量。我们的方法从2D高斯初始化开始，然后自适应地插入和优化增强高斯核，最终生成增强辐射场。实验表明，我们的方法不仅在渲染性能上超越了最先进的NeRF方法，还实现了更高的参数效率。项目页面：https://xiaoxinyyx.github.io/augs。

View on arXiv Download PDF AI Translation

cs.CV / 188 / 2602.19937

Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

在神经隐式场中学习正激励点采样以进行物体姿态估计

Shi, Yifei, Wan, Boyan, Xu, Xin, Xu, Kai

Abstract

Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model's generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network's accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

Chinese Translation

学习三维形状的神经隐式场是一个快速发展的领域，能够以任意分辨率进行形状表示。由于其灵活性，神经隐式场在许多研究领域取得了成功，包括形状重建、新视图图像合成，以及最近的物体姿态估计。神经隐式场能够学习相机空间与物体的标准空间之间的密集对应关系——包括相机空间中未观察到的区域——显著提升了在高度遮挡物体和新形状等挑战场景下的物体姿态估计性能。尽管取得了一定进展，但由于缺乏直接的观察信号，预测未观察到的相机空间区域的标准坐标仍然具有挑战性。这使得模型的泛化能力成为重中之重，导致高不确定性。因此，在整个相机空间中密集采样点可能会产生不准确的估计，从而阻碍学习过程并影响性能。为了解决这个问题，我们提出了一种结合SO(3)等变卷积隐式网络和正激励点采样（PIPS）策略的方法。SO(3)等变卷积隐式网络在任意查询位置估计点级属性，并展现出优于大多数现有基线的性能。PIPS策略根据输入动态确定采样位置，从而提高网络的准确性和训练效率。我们的方法在三个姿态估计数据集上超越了最先进的技术水平。值得注意的是，它在挑战性场景中表现出显著的改进，例如捕获到的未知姿态、高遮挡、新几何形状和严重噪声的物体。

View on arXiv Download PDF AI Translation

cs.CV / 189 / 2602.19944

Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation

发现、分割与选择：一种渐进式零样本伪装物体分割机制

Yang, Yilong, Tian, Jianxin, Zhang, Shengchuan, Cao, Liujuan

Abstract

Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbf{D}iscover-\textbf{S}egment-\textbf{S}elect (\textbf{DSS}) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.

Chinese Translation

当前的零样本伪装物体分割方法通常采用两阶段流程（发现-然后分割）：使用多模态大语言模型（MLLMs）获取视觉提示，然后进行SAM分割。然而，仅依赖MLLMs进行伪装物体发现往往导致定位不准确、误报和漏检。为了解决这些问题，我们提出了 extbf{D}iscover- extbf{S}egment- extbf{S}elect（ extbf{DSS}）机制，这是一种旨在逐步精炼分割的渐进式框架。该方法包含一个特征一致性物体发现（FOD）模块，利用视觉特征生成多样的物体提议，一个通过SAM分割精炼这些提议的分割模块，以及一个语义驱动的掩膜选择（SMS）模块，利用MLLMs评估并从多个候选中选择最佳分割掩膜。DSS在多个伪装物体分割基准测试中实现了最先进的性能，尤其是在多实例场景中，且无需任何训练或监督。

View on arXiv Download PDF AI Translation

cs.CV / 190 / 2602.19946

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

美观并不等于实用：探讨现代文本到图像模型为何无法作为可靠的训练数据生成器

Adamkiewicz, Krzysztof, Moser, Brian, Frolov, Stanislav, Nauen, Tobias Christian, Raue, Federico, Dengel, Andreas

Abstract

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Chinese Translation

近期的文本到图像（T2I）扩散模型生成了视觉上令人惊叹的图像，并展现了出色的提示遵循能力。但它们作为合成视觉数据生成器的表现如何呢？在本研究中，我们重新审视了合成数据作为真实训练集可扩展替代品的前景，并揭示了一个令人惊讶的性能回退。我们使用2022年至2025年间发布的最先进T2I模型生成大规模合成数据集，仅基于这些合成数据训练标准分类器，并在真实测试数据上进行评估。尽管在视觉逼真度和提示遵循方面有明显进展，但随着更新的T2I模型作为训练数据生成器，真实测试数据上的分类准确率却持续下降。我们的分析揭示了一个隐藏的趋势：这些模型崩溃为一个狭窄的以美学为中心的分布，削弱了多样性和标签-图像对齐。总体而言，我们的发现挑战了视觉研究中日益增长的假设，即生成现实主义的进展意味着数据现实主义的进展。因此，我们强调迫切需要重新思考现代T2I模型作为可靠训练数据生成器的能力。

View on arXiv Download PDF AI Translation

cs.CV / 191 / 2602.19974

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

RL-RIG：一种通过内在反思实现的生成空间推理器

Wang, Tianyu, Ma, Zhiyuan, Wang, Qian, Zhang, Xinyi, Long, Xinwei, Zhou, Bowen

Abstract

Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

Chinese Translation

近期在图像生成领域的进展取得了令人瞩目的成果，能够生成高质量的图像。然而，现有的图像生成模型普遍面临空间推理困境，缺乏从提示中准确捕捉细粒度空间关系的能力，无法正确生成具有结构完整性的场景。为了解决这一困境，我们提出了RL-RIG，一种基于反思的图像生成强化学习框架。我们的架构由四个主要组件组成：扩散器（Diffuser）、检查器（Checker）、演员（Actor）和逆扩散器（Inverse Diffuser），遵循生成-反思-编辑的范式，以激发图像生成中的思维链推理能力来应对这一困境。为了使模型对生成轨迹有更好的直觉，我们进一步开发了Reflection-GRPO，以分别训练VLM演员用于编辑提示和图像编辑器以在给定提示下提高图像质量。与传统方法仅仅生成视觉上令人惊艳但结构上不合理的内容不同，我们的评估指标优先考虑空间准确性，利用场景图交并比（Scene Graph IoU）并采用VLM作为评判者的策略来评估在LAION-SG数据集上生成图像的空间一致性。实验结果表明，RL-RIG在图像生成中的可控性和精确空间推理方面比现有的最先进开源模型提高了多达11%。

View on arXiv Download PDF AI Translation

cs.CV / 192 / 2602.19994

RADE-Net: Robust Attention Network for Radar-Only Object Detection in Adverse Weather

RADE-Net：针对恶劣天气下仅使用雷达的物体检测的鲁棒注意力网络

Leitgeb, Christof, Puchleitner, Thomas, Ronecker, Max Peter, Watzenig, Daniel

Abstract

Automotive perception systems are obligated to meet high requirements. While optical sensors such as Camera and Lidar struggle in adverse weather conditions, Radar provides a more robust perception performance, effectively penetrating fog, rain, and snow. Since full Radar tensors have large data sizes and very few datasets provide them, most Radar-based approaches work with sparse point clouds or 2D projections, which can result in information loss. Additionally, deep learning methods show potential to extract richer and more dense features from low level Radar data and therefore significantly increase the perception performance. Therefore, we propose a 3D projection method for fast-Fourier-transformed 4D Range-Azimuth-Doppler-Elevation (RADE) tensors. Our method preserves rich Doppler and Elevation features while reducing the required data size for a single frame by 91.9% compared to a full tensor, thus achieving higher training and inference speed as well as lower model complexity. We introduce RADE-Net, a lightweight model tailored to 3D projections of the RADE tensor. The backbone enables exploitation of low-level and high-level cues of Radar tensors with spatial and channel-attention. The decoupled detection heads predict object center-points directly in the Range-Azimuth domain and regress rotated 3D bounding boxes from rich feature maps in the cartesian scene. We evaluate the model on scenes with multiple different road users and under various weather conditions on the large-scale K-Radar dataset and achieve a 16.7% improvement compared to their baseline, as well as 6.5% improvement over current Radar-only models. Additionally, we outperform several Lidar approaches in scenarios with adverse weather conditions. The code is available under https://github.com/chr-is-tof/RADE-Net.

Chinese Translation

汽车感知系统必须满足高要求。虽然光学传感器如相机和激光雷达在恶劣天气条件下表现不佳，但雷达提供了更为鲁棒的感知性能，能够有效穿透雾、雨和雪。由于完整的雷达张量数据量大且很少有数据集提供这些数据，大多数基于雷达的方法只能使用稀疏点云或二维投影，这可能导致信息损失。此外，深度学习方法显示出从低级雷达数据中提取更丰富和更密集特征的潜力，从而显著提高感知性能。因此，我们提出了一种针对快速傅里叶变换的四维范围-方位-多普勒-高度（RADE）张量的三维投影方法。与完整张量相比，我们的方法在保持丰富的多普勒和高度特征的同时，将单帧所需的数据量减少了91.9%，从而实现了更高的训练和推理速度以及更低的模型复杂性。我们引入了RADE-Net，一个专为RADE张量的三维投影量身定制的轻量级模型。该主干网络能够利用雷达张量的低级和高级线索，结合空间和通道注意力。解耦的检测头直接在范围-方位域预测物体中心点，并从丰富的特征图中回归旋转的三维边界框。我们在大型K-Radar数据集上评估了该模型，场景中包含多种不同的道路使用者和各种天气条件，取得了比基线提高16.7%的效果，并且比当前的雷达单一模型提高了6.5%。此外，在恶劣天气条件下，我们的表现超越了几种激光雷达方法。代码可在https://github.com/chr-is-tof/RADE-Net获取。

View on arXiv Download PDF AI Translation

cs.CV / 193 / 2602.20008

Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

Token-UNet：在高效且可解释的3D UNet中整合变换器的新案例，用于脑部影像分割

Tshimanga, Louis Fabrice, Zanola, Andrea, Del Pup, Federico, Atzori, Manfredo

Abstract

We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder their deployment on common hardware. Models like (Swin)UNETR adapt the UNet architecture by incorporating (Swin)Transformer encoders, which process tokens that each represent small subvolumes ($8^3$ voxels) of the input. The Transformer attention mechanism scales quadratically with the number of tokens, which is tied to the cubic scaling of 3D input resolution. This work reconsiders the role of convolution and attention, introducing Token-UNets, a family of 3D segmentation models that can operate in constrained computational environments and time frames. To mitigate computational demands, our approach maintains the convolutional encoder of UNet-like models, and applies TokenLearner to 3D feature maps. This module pools a preset number of tokens from local and global structures. Our results show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps. The memory footprint, computation times at inference, and parameter counts of our heaviest model are reduced to 33\%, 10\%, and 35\% of the SwinUNETR values, with better average performance (86.75\% $\pm 0.19\%$ Dice score for SwinUNETR vs our 87.21\% $\pm 0.35\%$). This work opens the way to more efficient trainings in contexts with limited computational resources, such as 3D medical imaging. Easing model optimization, fine-tuning, and transfer-learning in limited hardware settings can accelerate and diversify the development of approaches, for the benefit of the research community.

Chinese Translation

我们提出了Token-UNet，采用TokenLearner和TokenFuser模块将变换器嵌入UNet中。尽管变换器在医学影像中实现了输入元素之间的全局交互，但当前的计算挑战阻碍了它们在常见硬件上的应用。像(Swin)UNETR这样的模型通过结合(Swin)Transformer编码器来适应UNet架构，这些编码器处理的tokens代表输入的小子体积（$8^3$体素）。变换器的注意力机制随着tokens数量的增加呈二次增长，这与3D输入分辨率的立方扩展相关。本研究重新审视了卷积和注意力的角色，引入了Token-UNets，这是一类可以在受限计算环境和时间框架内运行的3D分割模型。为了减轻计算需求，我们的方法保持了UNet类模型的卷积编码器，并将TokenLearner应用于3D特征图。该模块从局部和全局结构中汇聚预设数量的tokens。我们的结果表明，这种token化有效地编码了与任务相关的信息，生成了自然可解释的注意力图。我们最重的模型在内存占用、推理计算时间和参数数量上分别减少到SwinUNETR值的33%、10%和35%，且平均性能更佳（SwinUNETR的Dice得分为86.75% $ ext{±} 0.19 ext{%}$，而我们的为87.21% $ ext{±} 0.35 ext{%}$）。本研究为在计算资源有限的背景下进行更高效的训练开辟了道路，例如3D医学影像。简化模型优化、微调和迁移学习在有限硬件环境中的应用，可以加速和多样化方法的发展，造福研究社区。

View on arXiv Download PDF AI Translation

cs.CV / 194 / 2602.20028

Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

描述符：寄生蜂及相关膜翅目数据集（DAPWH）

Pinheiro, Joao Manoel Herrera, Herrera, Gabriela Do Nascimento, Fernandes, Luciana Bueno Dos Reis, Santos, Alvaro Doria Dos, Godoy, Ricardo V., Almeida, Eduardo A. B., Onody, Helena Carolina, Vieira, Marcelo Andrade Da Costa, Penteado-Dias, Angelica Maria, Becker, Marcelo

Abstract

Accurate taxonomic identification is the cornerstone of biodiversity monitoring and agricultural management, particularly for the hyper-diverse superfamily Ichneumonoidea. Comprising the families Ichneumonidae and Braconidae, these parasitoid wasps are ecologically critical for regulating insect populations, yet they remain one of the most taxonomically challenging groups due to their cryptic morphology and vast number of undescribed species. To address the scarcity of robust digital resources for these key groups, we present a curated image dataset designed to advance automated identification systems. The dataset contains 3,556 high-resolution images, primarily focused on Neotropical Ichneumonidae and Braconidae, while also including supplementary families such as Andrenidae, Apidae, Bethylidae, Chrysididae, Colletidae, Halictidae, Megachilidae, Pompilidae, and Vespidae to improve model robustness. Crucially, a subset of 1,739 images is annotated in COCO format, featuring multi-class bounding boxes for the full insect body, wing venation, and scale bars. This resource provides a foundation for developing computer vision models capable of identifying these families.

Chinese Translation

准确的分类鉴定是生物多样性监测和农业管理的基石，尤其对于超多样化的虫总科（Ichneumonoidea）。寄生于昆虫的蜂类主要包括独角蜂科（Ichneumonidae）和小蜂科（Braconidae），它们在生态上对调节昆虫种群至关重要。然而，由于其隐蔽的形态特征和大量未描述物种，这些寄生蜂仍然是分类学上最具挑战性的群体之一。为了解决这些关键群体缺乏强大数字资源的问题，我们提供了一个经过精心策划的图像数据集，旨在推动自动化识别系统的发展。该数据集包含3,556张高分辨率图像，主要集中在新热带地区的独角蜂科和小蜂科，同时还包括其他补充科，如安德烈蜂科（Andrenidae）、蜜蜂科（Apidae）、贝斯利科（Bethylidae）、金蜂科（Chrysididae）、采集蜂科（Colletidae）、哈利克特蜂科（Halictidae）、巨蜂科（Megachilidae）、庞比利科（Pompilidae）和黄蜂科（Vespidae），以提高模型的鲁棒性。重要的是，其中1,739张图像以COCO格式进行了标注，包含针对完整昆虫身体、翅脉和比例尺的多类边界框。这一资源为开发能够识别这些科的计算机视觉模型奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 195 / 2602.20046

Closing the gap in multimodal medical representation alignment

缩小多模态医学表示对齐的差距

Grassucci, Eleonora, Cicchetti, Giordano, Comminiello, Danilo

Abstract

In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

Chinese Translation

在多模态学习中，CLIP已成为将不同模态映射到共享潜在空间的事实标准方法，通过将语义相似的表示拉近，同时将不相似的表示推远。然而，基于CLIP的对比损失表现出意想不到的行为，负面影响真实语义对齐，导致潜在空间稀疏和碎片化。这种现象被称为模态差距，虽然在标准文本和图像对中部分得到了缓解，但在更复杂的多模态环境中，例如医学领域，仍然未知且未解决。在本研究中，我们研究了后者情况中的这一现象，揭示模态差距在医学对齐中同样存在，并提出了一种模态无关的框架来缩小这一差距，确保语义相关的表示更加对齐，无论其来源模态如何。我们的方法增强了放射学图像与临床文本之间的对齐，提高了跨模态检索和图像描述的效果。

View on arXiv Download PDF AI Translation

cs.CV / 196 / 2602.20051

SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

SEAL-pose：通过学习损失增强结构一致性的三维人体姿态估计

Kim, Yeonsung, Do, Junggeun, Do, Seunguk, Kim, Sangmin, Park, Jaesik, Lee, Jay-Yoon

Abstract

3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.

Chinese Translation

三维人体姿态估计（HPE）具有关节之间复杂的局部和全局依赖性。传统的监督损失在捕捉这些相关性方面存在局限，因为它们将每个关节视为独立的。以往的研究尝试通过手动设计的先验或基于规则的约束来促进结构一致性；然而，这些方法通常需要手动指定，并且往往是不可微分的，限制了它们作为端到端训练目标的使用。我们提出了SEAL-pose，这是一种数据驱动的框架，其中可学习的损失网络通过评估结构合理性来训练姿态网络。我们的关节图设计使得损失网络能够直接从数据中学习复杂的结构依赖性，而不是依赖手工制作的先验。在三个三维HPE基准上进行的广泛实验显示，SEAL-pose在所有设置中都减少了每个关节的误差，并改善了姿态的合理性，相较于相应的骨干网络。除了改善每个骨干网络外，SEAL-pose还超越了具有明确结构约束的模型，尽管没有强制执行任何此类约束。最后，我们分析了损失网络与结构一致性之间的关系，并在跨数据集和野外环境中评估了SEAL-pose。

View on arXiv Download PDF AI Translation

cs.CV / 197 / 2602.20053

Decoupling Defense Strategies for Robust Image Watermarking

解耦防御策略的鲁棒图像水印技术

Chen, Jiahui, Deng, Zehang, Zhang, Zeyu, Li, Chaoyang, Jia, Lianchen, Sun, Lifeng

Abstract

Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges: (1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks. To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy. In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality. Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29\%, 33\% and 46\% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.

Chinese Translation

基于深度学习的图像水印技术在抵御常规失真方面表现出色，但仍然容易受到高级对抗性和再生攻击的威胁。传统的对策通过噪声层联合优化编码器和解码器，但面临两个不可避免的挑战：（1）由于解码器的对抗性训练，干净准确度下降；（2）由于同时训练所有三种高级攻击，鲁棒性有限。为了解决这些问题，我们提出了AdvMark，这是一种新颖的两阶段微调框架，解耦了防御策略。在第一阶段，我们通过一种量身定制的对抗性训练范式来解决对抗性脆弱性，该范式主要微调编码器，同时仅有条件地更新解码器。这种方法学习将图像移动到一个不可攻击的区域，而不是修改决策边界，从而保持干净的准确度。在第二阶段，我们通过直接图像优化来应对失真和再生攻击。为了保持第一阶段获得的对抗鲁棒性，我们制定了一种有原则的、受限的图像损失函数，并提供理论保证，平衡了与封面图像和先前编码图像的偏差。我们还提出了一种质量感知的早停机制，以进一步保证视觉质量的下限。大量实验表明，AdvMark在图像质量和综合鲁棒性方面表现优异，即在失真、再生和对抗攻击中，准确度分别提高了29%、33%和46%。

View on arXiv Download PDF AI Translation

cs.CV / 198 / 2602.20060

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

MeanFuser：基于MeanFlow的快速一步式多模态轨迹生成与自适应重构方法，用于端到端自主驾驶

Wang, Junli, Liu, Xueyi, Zheng, Yinan, Xing, Zebing, Li, Pengfei, Li, Guang, Ma, Kun, Chen, Guang, Ye, Hangjun, Xia, Zhongpu, Chen, Long, Zhang, Qichao

Abstract

Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at https://github.com/wjl2244/MeanFuser.

Chinese Translation

生成模型在轨迹规划中展现出巨大的潜力。近期研究表明，基于锚点引导的生成模型在建模驾驶行为的不确定性和提升整体性能方面是有效的。然而，这些方法依赖于离散的锚点词汇，这些词汇在测试期间必须充分覆盖轨迹分布以确保鲁棒性，从而导致词汇大小与模型性能之间的固有权衡。为了解决这一限制，我们提出了MeanFuser，这是一种端到端自主驾驶方法，通过三个关键设计增强了效率和鲁棒性。(1) 我们引入高斯混合噪声（Gaussian Mixture Noise, GMN）来指导生成采样，使轨迹空间的表示连续化，消除了对离散锚点词汇的依赖。(2) 我们将“MeanFlow Identity”适配于端到端规划，该方法建模GMN与轨迹分布之间的平均速度场，而不是传统流匹配方法中使用的瞬时速度场，有效消除了来自常微分方程求解器的数值误差，并显著加速推理。(3) 我们设计了一个轻量级自适应重构模块（Adaptive Reconstruction Module, ARM），使模型能够通过注意力权重隐式选择所有采样提案，或在没有满意提案时重构新轨迹。NAVSIM闭环基准测试的实验表明，MeanFuser在没有PDM评分监督的情况下实现了卓越的性能和卓越的推理效率，为端到端自主驾驶提供了一种鲁棒且高效的解决方案。我们的代码和模型可在 https://github.com/wjl2244/MeanFuser 获取。

View on arXiv Download PDF AI Translation

cs.CV / 199 / 2602.20066

HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

HeatPrompt：基于卫星图像的城市热需求零样本视觉语言建模

Thota, Kundan, Mu, Xuanhao, Schlachter, Thorsten, Hagenmeyer, Veit

Abstract

Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an $R^2$ uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.

Chinese Translation

准确的热需求地图在实现空间供暖的脱碳过程中起着至关重要的作用，但大多数市政当局缺乏计算所需的详细建筑级数据。我们提出了HeatPrompt，一种零样本视觉语言能源建模框架，利用从卫星图像中提取的语义特征、基本地理信息系统（GIS）和建筑级特征来估算年热需求。我们为预训练的大型视觉语言模型（VLM）提供了特定领域的提示，使其充当能源规划者，并从RGB卫星图像中提取与热负荷相关的视觉属性，如屋顶年龄、建筑密度等。基于这些描述训练的多层感知器（MLP）回归模型显示出93.7%的$R^2$提升，并将平均绝对误差（MAE）相比基线模型缩小了30%。定性分析表明，高影响力的标记与高需求区域相一致，为数据稀缺地区的热规划提供了轻量级支持。

View on arXiv Download PDF AI Translation

cs.CV / 200 / 2602.20068

The Invisible Gorilla Effect in Out-of-distribution Detection

分布外检测中的隐形大猩猩效应

Anthony, Harry, Liang, Ziyun, Warr, Hermione, Kamnitsas, Konstantinos

Abstract

Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.

Chinese Translation

深度神经网络通过学习图像中感兴趣区域（ROI）的特征，在视觉任务中实现了高性能，但在部署于与训练数据不同的分布外（OOD）数据时，其性能会下降。这一挑战促使了旨在识别和拒绝不可靠预测的OOD检测方法的出现。尽管先前的研究表明，OOD检测性能因伪影类型而异，但其潜在原因仍未得到充分探讨。为此，我们识别出一种在OOD检测中未被报告的偏差：对于难以检测的伪影（近OOD），当伪影与模型的ROI在视觉上具有相似性（例如颜色）时，检测性能通常会提高，而当伪影不具备这种相似性时，性能则会下降——这一现象我们称之为隐形大猩猩效应。例如，在一个具有红色病变ROI的皮肤病变分类器中，我们展示了马哈拉诺比斯分数（Mahalanobis Score）在检测OOD红色墨水（与ROI相似）时的AUROC比检测黑色墨水（不相似）时高出31.5%。我们在三个公共数据集（例如ISIC）中的11,355张图像上按颜色标注了伪影，并生成了颜色交换的反事实以排除数据集偏差。随后，我们在7个基准上评估了40种OOD方法，发现大多数方法在伪影与ROI不同的情况下性能显著下降。我们的发现突显了OOD检测中被忽视的失败模式，并为更稳健的检测器提供了指导。代码和标注可在以下链接获取：https://github.com/HarryAnthony/Invisible_Gorilla_Effect。

View on arXiv Download PDF AI Translation

cs.CV / 201 / 2602.20079

SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

SemanticNVS：在生成新视图合成中改善语义场景理解

Chen, Xinya, Wewer, Christopher, Xie, Jiahao, Hu, Xinting, Lenssen, Jan Eric

Abstract

We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.

Chinese Translation

我们提出了SemanticNVS，一种基于相机条件的多视角扩散模型，用于新视图合成（NVS），通过整合预训练的语义特征提取器来提高生成质量和一致性。现有的NVS方法在输入视图附近表现良好，但在长距离相机运动下，它们往往生成语义上不合理和失真的图像，显示出严重的退化。我们推测，这种退化是由于当前模型未能充分理解其条件或中间生成的场景内容。在此，我们提出整合预训练的语义特征提取器，以更强的场景语义作为条件，从而在远离视点时实现高质量生成。我们研究了两种不同的策略：（1）扭曲的语义特征和（2）在每个去噪步骤中理解与生成的交替方案。在多个数据集上的实验结果表明，与最先进的替代方法相比，清晰的定性和定量（FID提升4.69%-15.26%）改善。

View on arXiv Download PDF AI Translation

cs.CV / 202 / 2602.20084

Do Large Language Models Understand Data Visualization Principles?

大型语言模型是否理解数据可视化原则？

Sinnona, Martin, Bonas, Valentin, Siless, Viviana, Iarussi, Emmanuel

Abstract

Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.

Chinese Translation

数据可视化原则源于数十年的设计和感知研究，确保了适当的视觉传达。尽管先前的研究表明大型语言模型（LLMs）能够生成图表或标记误导性图形，但尚不清楚它们及其视觉-语言对应物（VLMs）是否能够直接推理和执行可视化原则。基于约束的系统将这些原则编码为逻辑规则，以便进行精确的自动检查，但将其转化为正式规范需要专业知识。这促使我们利用LLMs和VLMs作为能够直接推理视觉设计的原则检查器，从而绕过符号规则规范的需求。本文首次系统评估了LLMs和VLMs在推理可视化原则方面的能力，使用来自答案集编程（ASP）的严格验证真相。我们编制了一组以自然语言表述的可视化原则，并生成了一个包含约2000个Vega-Lite规范的受控数据集，这些规范标注了明确的原则违规情况，并附有300多个真实世界的Vega-Lite图表。我们评估了检查和修复任务，评估模型检测原则违规和纠正缺陷图表规范的能力。我们的研究突显了大型（视觉）语言模型作为可视化设计灵活验证者和编辑者的潜力，以及在视觉感知的更细微方面与符号求解器之间的持续差距。研究还揭示了一个有趣的不对称性：前沿模型在纠正违规方面往往比可靠检测违规更有效。

View on arXiv Download PDF AI Translation

cs.CV / 203 / 2602.20089

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

StructXLIP：通过多模态结构线索增强视觉-语言模型

Ruan, Zanxi, Kong, Qiuyu, Gao, Songqun, Wang, Yiming, Cristani, Marco

Abstract

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.

Chinese Translation

基于边缘的表示是视觉理解的基本线索，这一原则源于早期的视觉研究，并在今天仍然占据中心地位。我们将这一原则扩展到视觉-语言对齐，表明在不同模态中隔离和对齐结构线索可以极大地促进对长篇、细节丰富的标题的微调，特别关注于改善跨模态检索。我们提出了StructXLIP，一种微调对齐范式，提取边缘图（例如，Canny），将其视为图像视觉结构的代理，并过滤相应的标题以强调结构线索，使其“以结构为中心”。微调通过三种以结构为中心的损失增强标准对齐损失：（i）将边缘图与结构文本对齐，（ii）将局部边缘区域与文本块匹配，以及（iii）将边缘图与彩色图像连接以防止表示漂移。从理论角度来看，标准的CLIP最大化视觉和文本嵌入之间的互信息，而StructXLIP则额外最大化多模态结构表示之间的互信息。这种辅助优化本质上更为复杂，引导模型朝向更稳健和语义稳定的极小值，从而增强视觉-语言对齐。除了在一般和专业领域的跨模态检索中超越当前竞争对手外，我们的方法还作为一种通用的增强方案，可以以即插即用的方式集成到未来的方法中。代码和预训练模型可在以下网址公开获取：https://github.com/intelligolabs/StructXLIP。

View on arXiv Download PDF AI Translation

cs.CV / 204 / 2602.20100

Transcending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine

超越注释瓶颈：人工智能驱动的生物学和医学发现

Chatterjee, Soumick

Abstract

The dependence on expert annotation has long constituted the primary rate-limiting step in the application of artificial intelligence to biomedicine. While supervised learning drove the initial wave of clinical algorithms, a paradigm shift towards unsupervised and self-supervised learning (SSL) is currently unlocking the latent potential of biobank-scale datasets. By learning directly from the intrinsic structure of data - whether pixels in a magnetic resonance image (MRI), voxels in a volumetric scan, or tokens in a genomic sequence - these methods facilitate the discovery of novel phenotypes, the linkage of morphology to genetics, and the detection of anomalies without human bias. This article synthesises seminal and recent advances in "learning without labels," highlighting how unsupervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance that rivals or exceeds supervised counterparts.

Chinese Translation

长期以来，对专家注释的依赖一直是人工智能在生物医学应用中的主要限制因素。虽然监督学习推动了临床算法的初始浪潮，但当前向无监督学习和自监督学习（Self-Supervised Learning, SSL）的范式转变正在释放生物样本库规模数据集的潜在价值。通过直接从数据的内在结构中学习——无论是磁共振成像（MRI）中的像素、体积扫描中的体素，还是基因组序列中的标记——这些方法促进了新表型的发现、形态与遗传学的关联以及无偏见的异常检测。本文综合了“无标签学习”的开创性和近期进展，强调无监督框架如何推导可遗传的心脏特征、预测组织学中的空间基因表达，并以与监督学习相媲美或超越的性能检测病理。

View on arXiv Download PDF AI Translation

cs.CV / 205 / 2602.20114

Benchmarking Unlearning for Vision Transformers

视觉变换器的反学习基准测试

Zhao, Kairan, Luca, Iurie, Triantafillou, Peter

Abstract

Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.

Chinese Translation

机器反学习（MU）的研究已获得强劲动力：MU现在被广泛视为构建安全和公平人工智能的关键能力。与此同时，针对计算机视觉任务的变换器架构的研究取得了显著成功：视觉变换器（VTs）越来越多地成为卷积神经网络（CNNs）的强有力替代方案。然而，针对视觉任务的MU研究主要集中在CNNs上，而非VTs。尽管对大型语言模型（LLMs）、扩散模型和CNNs的MU努力进行了基准测试，但尚未针对VTs进行相关研究。本研究首次尝试这一方向，基准测试不同VT家族（ViT和Swin-T）及不同容量下的MU算法性能。该研究采用了（i）不同的数据集，以评估数据集规模和复杂性的影响；（ii）不同的MU算法，以代表根本不同的MU方法；以及（iii）单次和持续反学习协议。此外，研究重点基准测试利用训练数据记忆的MU算法，因为最近发现利用记忆可以显著提高先前的最先进（SOTA）算法的性能。在此过程中，研究描述了VTs相对于CNNs如何记忆训练数据，并评估了不同记忆代理对性能的影响。基准测试使用统一的评估指标，捕捉遗忘质量的两个互补概念，以及在未见（测试）数据和保留数据上的准确性。总体而言，本研究提供了一个基准测试基础，能够对现有（和未来）VTs上的MU算法进行可重复、公平和全面的比较。同时，首次揭示了现有算法在VT环境中的表现，建立了一个有前景的参考性能基准。

View on arXiv Download PDF AI Translation

cs.CV / 206 / 2602.20137

Do Large Language Models Understand Data Visualization Rules?

大型语言模型是否理解数据可视化规则？

Sinnona, Martin, Bonas, Valentin, Iarussi, Emmanuel, Siless, Viviana

Abstract

Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

Chinese Translation

数据可视化规则源于数十年的设计和感知研究，确保了图表传达的可信性。尽管先前的研究表明大型语言模型（LLMs）能够生成图表或标记误导性图形，但尚不清楚它们是否能够直接推理和执行可视化规则。基于约束的系统如 Draco 将这些规则编码为逻辑约束，以进行精确的自动检查，但维护符号编码需要专家的努力，这促使我们使用 LLMs 作为灵活的规则验证者。本文首次系统评估了 LLMs 对可视化规则的表现，采用了基于答案集编程（Answer Set Programming, ASP）推导的硬验证真值。我们将 Draco 的一部分约束翻译为自然语言陈述，并生成了一个包含 2000 个带有明确规则违规标注的 Vega-Lite 规范的受控数据集。我们评估了 LLMs 在检测违规方面的准确性和提示遵循性，后者衡量输出是否遵循所需的结构化格式。结果表明，前沿模型在遵循性方面表现出色（Gemma 3 4B / 27B: 100%，GPT-oss 20B: 98%），并可靠地检测常见违规（F1 高达 0.82），然而对于更微妙的感知规则（某些类别的 F1 < 0.15）以及从技术 ASP 公式生成的输出，性能有所下降。将约束翻译为自然语言使较小模型的性能提高了多达 150%。这些发现展示了 LLMs 作为灵活的、基于语言的验证者的潜力，同时突显了它们与符号求解器相比的当前局限性。

View on arXiv Download PDF AI Translation

cs.CV / 207 / 2602.20157

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

Flow3r：可扩展视觉几何学习的分解流预测

Cong, Zhongxiao, Zhao, Qitao, Jeon, Minsik, Tulsiani, Shubham

Abstract

Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

Chinese Translation

当前的前馈3D/4D重建系统依赖于密集几何和姿态监督——在大规模获取时成本高昂，且在动态真实场景中尤为稀缺。我们提出了Flow3r，一个通过密集2D对应（`flow`）作为监督来增强视觉几何学习的框架，使得可以从未标记的单目视频中进行可扩展训练。我们的关键见解是流预测模块应该被分解：使用一个图像的几何潜变量和另一个图像的姿态潜变量来预测两幅图像之间的流。这种分解直接指导场景几何和相机运动的学习，并自然扩展到动态场景。在受控实验中，我们展示了分解流预测的性能优于其他设计，并且性能随着未标记数据的增加而一致提升。将分解流集成到现有的视觉几何架构中，并使用约80万未标记视频进行训练，Flow3r在涵盖静态和动态场景的八个基准测试中取得了最先进的结果，尤其是在野外动态视频中，其增益最大，而这些视频的标记数据最为稀缺。

View on arXiv Download PDF AI Translation

cs.CV / 208 / 2602.20159

A Very Big Video Reasoning Suite

一个非常大的视频推理套件

Wang, Maijunxian, Wang, Ruisi, Lin, Juyi, Ji, Ran, Wiedemer, Thaddäus, Gao, Qingying, Luo, Dezhi, Qian, Yaoyao, Huang, Lianyu, Hong, Zelong, Ge, Jiahui, Ma, Qianli, He, Hang, Zhou, Yifan, Guo, Lingzi, Mei, Lantao, Li, Jiachen, Xing, Hanwen, Zhao, Tianqi, Yu, Fengyuan, Xiao, Weihang, Jiao, Yizheng, Hou, Jianheng, Zhang, Danyang, Xu, Pengcheng, Zhong, Boyang, Zhao, Zehong, Fang, Gaoyun, Kitaoka, John, Xu, Yile, Xu, Hua, Blacutt, Kenton, Nguyen, Tin, Song, Siyuan, Sun, Haoran, Wen, Shaoyue, He, Linyang, Wang, Runming, Wang, Yanzhi, Yang, Mengyue, Ma, Ziqiao, Millière, Raphaël, Shi, Freda, Vasconcelos, Nuno, Khashabi, Daniel, Yuille, Alan, Du, Yilun, Liu, Ziming, Li, Bo, Lin, Dahua, Liu, Ziwei, Kumar, Vikash, Li, Yijiang, Yang, Lei, Cai, Zhongang, Deng, Hokin

Abstract

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .

Chinese Translation

视频模型的快速进展主要集中在视觉质量上，导致其推理能力未得到充分探索。视频推理将智能建立在时空一致的视觉环境中，这超出了文本自然捕捉的范围，使得对时空结构（如连续性、互动性和因果性）进行直观推理成为可能。然而，系统地研究视频推理及其扩展行为受到缺乏大规模训练数据的限制。为了解决这一问题，我们引入了前所未有的大规模资源——非常大的视频推理（VBVR）数据集，该数据集涵盖200个经过精心策划的推理任务，遵循严格的分类法，并包含超过一百万个视频片段，规模约为现有数据集的三个数量级。我们进一步提出了VBVR-Bench，一个可验证的评估框架，通过引入基于规则的人类对齐评分者，超越了基于模型的评估，能够实现可重复和可解释的视频推理能力诊断。利用VBVR套件，我们开展了视频推理的首次大规模扩展研究，并观察到对未见推理任务的早期泛化迹象。总之，VBVR为可泛化视频推理研究的下一个阶段奠定了基础。数据、基准工具包和模型可在 https://video-reason.com/ 上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 209 / 2602.20160

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

tttLRM：用于长上下文和自回归3D重建的测试时训练

Wang, Chen, Tan, Hao, Yifan, Wang, Chen, Zhiqin, Liu, Yuheng, Sunkavalli, Kalyan, Bi, Sai, Liu, Lingjie, Hu, Yiwei

Abstract

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

Chinese Translation

我们提出了tttLRM，这是一种新颖的大型3D重建模型，利用测试时训练（Test-Time Training, TTT）层实现长上下文的自回归3D重建，并具有线性计算复杂度，进一步扩展了模型的能力。我们的框架有效地将多个图像观测压缩为TTT层的快速权重，形成在潜在空间中的隐式3D表示，可以解码为各种显式格式，例如用于下游应用的高斯点（Gaussian Splats, GS）。我们模型的在线学习变体支持从流式观测中进行渐进式3D重建和精细化。我们证明了在新视图合成任务上的预训练有效地转移到显式3D建模，导致重建质量的提高和收敛速度的加快。大量实验表明，我们的方法在前馈3D高斯重建方面，相较于最先进的方法，在物体和场景上均取得了优越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 210 / 2602.20161

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Mobile-O：移动设备上的统一多模态理解与生成

Shaker, Abdelrahman, Heakl, Ahmed, Muhammad, Jaseel, Thawkar, Ritesh, Thawakar, Omkar, Li, Senmao, Cholakkal, Hisham, Reid, Ian, Xing, Eric P., Khan, Salman, Khan, Fahad Shahbaz

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Chinese Translation

统一多模态模型能够在单一架构中理解和生成视觉内容。然而，现有模型仍然对数据需求较高，并且过于庞大，不适合在边缘设备上部署。我们提出了Mobile-O，这是一种紧凑的视觉-语言-扩散模型，将统一的多模态智能带入移动设备。其核心模块Mobile Conditioning Projector (MCP)通过深度可分离卷积和逐层对齐，将视觉-语言特征与扩散生成器融合。这一设计实现了高效的跨模态条件化，且计算成本极低。Mobile-O仅在几百万个样本上进行训练，并在一种新颖的四元组格式（生成提示、图像、问题、答案）中进行后训练，联合提升了视觉理解和生成能力。尽管效率高，Mobile-O在性能上与其他统一模型相比仍具竞争力或更优，GenEval得分达到74%，并分别比Show-O和JanusFlow高出5%和11%，同时运行速度分别快6倍和11倍。在视觉理解方面，Mobile-O在七个基准测试中平均超越它们15.3%和5.1%。在iPhone上，每512x512图像的运行时间仅约为3秒，Mobile-O建立了边缘设备上实时统一多模态理解与生成的第一个实用框架。我们希望Mobile-O能够促进未来在完全依赖设备而非云端的情况下进行实时统一多模态智能的研究。我们的代码、模型、数据集和移动应用程序已公开发布，网址为https://amshaker.github.io/Mobile-O/

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.18494

On the Dynamics of Observation and Semantics

观察与语义的动态

Li, Xiu

Abstract

A dominant paradigm in visual intelligence treats semantics as a static property of latent representations, assuming that meaning can be discovered through geometric proximity in high dimensional embedding spaces. In this work, we argue that this view is physically incomplete. We propose that intelligence is not a passive mirror of reality but a property of a physically realizable agent, a system bounded by finite memory, finite compute, and finite energy interacting with a high entropy environment. We formalize this interaction through the kinematic structure of an Observation Semantics Fiber Bundle, where raw sensory observation data (the fiber) is projected onto a low entropy causal semantic manifold (the base). We prove that for any bounded agent, the thermodynamic cost of information processing (Landauer's Principle) imposes a strict limit on the complexity of internal state transitions. We term this limit the Semantic Constant B. From these physical constraints, we derive the necessity of symbolic structure. We show that to model a combinatorial world within the bound B, the semantic manifold must undergo a phase transition, it must crystallize into a discrete, compositional, and factorized form. Thus, language and logic are not cultural artifacts but ontological necessities the solid state of information required to prevent thermal collapse. We conclude that understanding is not the recovery of a hidden latent variable, but the construction of a causal quotient that renders the world algorithmically compressible and causally predictable.

Chinese Translation

视觉智能中的主导范式将语义视为潜在表征的静态属性，假设意义可以通过高维嵌入空间中的几何接近性来发现。在本研究中，我们认为这一观点在物理上是不完整的。我们提出，智能并不是现实的被动镜像，而是一个物理可实现的智能体的属性，该系统受限于有限的记忆、有限的计算能力和有限的能量，并与高熵环境进行交互。我们通过观察语义纤维束的运动学结构形式化这种交互，其中原始感官观察数据（纤维）被投影到低熵因果语义流形（基底）上。我们证明，对于任何有限的智能体，信息处理的热力学成本（兰道尔原理）对内部状态转变的复杂性施加了严格的限制。我们将这一限制称为语义常数 B。基于这些物理约束，我们推导出符号结构的必要性。我们展示了在限制 B 内对组合世界进行建模时，语义流形必须经历相变，必须结晶为离散的、组合的和因子的形式。因此，语言和逻辑不是文化产物，而是防止热崩溃所需的信息的本体论必然性。我们得出结论，理解并不是恢复一个隐藏的潜在变量，而是构建一个因果商，使世界在算法上可压缩且因果上可预测。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.18582

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

基于语言的层次奖励设计：增强智能体行为与人类规范的一致性

Qian, Zhiqin, Diaz, Ryan, Seo, Sangwon, Unhelkar, Vaibhav

Abstract

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.

Chinese Translation

在训练人工智能（AI）执行任务时，人类不仅关心任务是否完成，还关心任务的执行方式。随着AI智能体处理越来越复杂的任务，使其行为与人类提供的规范保持一致对于负责任的AI部署至关重要。奖励设计为这种一致性提供了直接的渠道，通过将人类期望转化为指导强化学习（RL）的奖励函数。然而，现有方法往往过于局限，无法捕捉在长时间跨度任务中出现的细微人类偏好。因此，我们提出了基于语言的层次奖励设计（HRDL）：一种将经典奖励设计扩展到编码更丰富行为规范的层次RL智能体的问题表述。我们进一步提出了语言到层次奖励（L2HR）作为HRDL的解决方案。实验表明，通过L2HR设计的奖励训练的AI智能体不仅能有效完成任务，还能更好地遵循人类规范。HRDL和L2HR共同推动了人类对齐AI智能体的研究。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.18607

Feedback-based Automated Verification in Vibe Coding of CAS Adaptation Built on Constraint Logic

基于反馈的自动验证在约束逻辑构建的CAS适应中的Vibe编码

Töpfer, Michal, Plášil, František, Bureš, Tomáš, Hnětynka, Petr

Abstract

In CAS adaptation, a challenge is to define the dynamic architecture of the system and changes in its behavior. Implementation-wise, this is projected into an adaptation mechanism, typically realized as an Adaptation Manager (AM). With the advances of generative LLMs, generating AM code based on system specification and desired AM behavior (partially in natural language) is a tempting opportunity. The recent introduction of vibe coding suggests a way to target the problem of the correctness of generated code by iterative testing and vibe coding feedback loops instead of direct code inspection. In this paper, we show that generating an AM via vibe coding feedback loops is a viable option when the verification of the generated AM is based on a very precise formulation of the functional requirements. We specify these as constraints in a novel temporal logic FCL that allows us to express the behavior of traces with much finer granularity than classical LTL enables. Furthermore, we show that by combining the adaptation and vibe coding feedback loops where the FCL constraints are evaluated for the current system state, we achieved good results in the experiments with generating AMs for two example systems from the CAS domain. Typically, just a few feedback loop iterations were necessary, each feeding the LLM with reports describing detailed violations of the constraints. This AM testing was combined with high run path coverage achieved by different initial settings.

Chinese Translation

在CAS适应中，一个挑战是定义系统的动态架构及其行为的变化。在实现层面，这被投射到一个适应机制中，通常实现为适应管理器（Adaptation Manager, AM）。随着生成式大语言模型（LLMs）的进步，基于系统规范和期望的AM行为（部分以自然语言形式）生成AM代码成为一个诱人的机会。最近引入的Vibe编码提供了一种通过迭代测试和Vibe编码反馈循环来解决生成代码正确性问题的方法，而不是直接进行代码检查。在本文中，我们展示了通过Vibe编码反馈循环生成AM是一种可行的选择，前提是对生成的AM的验证基于功能需求的非常精确的表述。我们将这些需求指定为约束，采用一种新颖的时序逻辑FCL（Functional Constraint Logic），使我们能够以比经典LTL（Linear Temporal Logic）更细致的粒度表达轨迹的行为。此外，我们展示了通过结合适应和Vibe编码反馈循环，在当前系统状态下评估FCL约束，我们在CAS领域的两个示例系统中生成AM的实验中取得了良好的结果。通常，只需进行几次反馈循环迭代，每次都向LLM提供描述约束详细违规情况的报告。这种AM测试结合了通过不同初始设置实现的高运行路径覆盖率。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.18640

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

解码机器学习决策：大规模排名系统的自主推理框架

Yun, Longfei, Wu, Yihan, Liu, Haoran, Liu, Xiaoxuan, Xu, Ziyun, Wang, Yi, Xia, Yang, Wang, Pengfei, Gao, Mingze, Wang, Yunxiang, Chen, Changfan, Pan, Junfeng

Abstract

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving product requirements. Progress in this domain is increasingly bottlenecked by the engineering context constraint: the arduous process of translating ambiguous product intent into reasonable, executable, verifiable hypotheses, rather than by modeling techniques alone. We present GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization. Furthermore, to ensure production reliability, the framework incorporates validation hooks to enforce statistical robustness and filter out brittle policies that overfit short-term signals. Experimental validation across diverse product surfaces demonstrates that GEARS consistently identifies superior, near-Pareto-efficient policies by synergizing algorithmic signals with deep ranking context while maintaining rigorous deployment stability.

Chinese Translation

现代大规模排名系统在复杂的竞争目标、操作约束和不断变化的产品需求中运作。在这一领域的进展越来越受到工程背景约束的瓶颈影响：将模糊的产品意图转化为合理、可执行、可验证的假设的艰难过程，而不仅仅是建模技术的限制。我们提出了GEARS（Generative Engine for Agentic Ranking Systems），一个将排名优化重新框架为可编程实验环境中的自主发现过程的框架。GEARS不再将优化视为静态模型选择，而是利用专业代理技能将排名专家知识封装为可重用的推理能力，使操作人员能够通过高层次的意图个性化来引导系统。此外，为了确保生产的可靠性，该框架还结合了验证钩子，以强制执行统计稳健性并过滤掉过拟合短期信号的脆弱策略。在多样化的产品表面上的实验验证表明，GEARS通过将算法信号与深度排名上下文协同作用，始终能够识别出优越的、接近帕累托效率的策略，同时保持严格的部署稳定性。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.18671

Spilled Energy in Large Language Models

大型语言模型中的能量溢出

Minut, Adrian Robert, Dewidar, Hazem, Masi, Iacopo

Abstract

We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.

Chinese Translation

我们将最终的大型语言模型（LLM）softmax 分类器重新解释为一种基于能量的模型（EBM），在推理过程中将序列到序列的概率链分解为多个相互作用的 EBM。这种原则性的方法使我们能够在解码过程中追踪“能量溢出”，我们通过实证研究表明，这与事实错误、偏见和失败相关。与 Orgad 等（2025）类似，我们的方法定位确切的答案标记，并随后测试幻觉。然而，关键在于，我们在此过程中不需要训练的探测分类器或激活消融。相反，我们引入了两个完全无训练的指标，这些指标直接从输出对数几率中得出：能量溢出，它捕捉理论上应该匹配的连续生成步骤之间的能量值差异；以及边际能量，它可以在单一步骤中测量。在对九个基准进行评估时，我们的方法在最先进的 LLM（包括 LLaMA、Mistral 和 Gemma）以及合成代数运算（Qwen3）上展示了强大的、具有竞争力的幻觉检测和跨任务泛化能力。值得注意的是，这些结果适用于预训练和指令微调的变体，而不引入任何训练开销。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.18710

Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

众多人工智能分析师与单一数据集：导航代理数据科学的多元宇宙

Bertran, Martin, Fogliato, Riccardo, Wu, Zhiwei Steven

Abstract

The conclusions of empirical research depend not only on data but on a sequence of analytic decisions that published results seldom make explicit. Past ``many-analyst" studies have demonstrated this: independent teams testing the same hypothesis on the same dataset regularly reach conflicting conclusions. But such studies require months of coordination among dozens of research groups and are therefore rarely conducted. In this work, we show that fully autonomous AI analysts built on large language models (LLMs) can reproduce a similar structured analytic diversity cheaply and at scale. We task these AI analysts with testing a pre-specified hypothesis on a fixed dataset, varying the underlying model and prompt framing across replicate runs. Each AI analyst independently constructs and executes a full analysis pipeline; an AI auditor then screens each run for methodological validity. Across three datasets spanning experimental and observational designs, AI analyst-produced analyses display wide dispersion in effect sizes, $p$-values, and binary decisions on supporting the hypothesis or not, frequently reversing whether a hypothesis is judged supported. This dispersion is structured: recognizable analytic choices in preprocessing, model specification, and inference differ systematically across LLM and persona conditions. Critically, the effects are \emph{steerable}: reassigning the analyst persona or LLM shifts the distribution of outcomes even after excluding methodologically deficient runs.

Chinese Translation

实证研究的结论不仅依赖于数据，还依赖于一系列分析决策，而已发表的结果往往未能明确这些决策。以往的“多分析师”研究已证明这一点：独立团队在相同数据集上测试相同假设时，常常得出相互矛盾的结论。然而，这类研究需要数月的协调，涉及数十个研究小组，因此很少进行。在本研究中，我们展示了基于大型语言模型（LLMs）构建的完全自主的人工智能分析师能够以低成本和大规模重现类似的结构化分析多样性。我们要求这些人工智能分析师在固定数据集上测试预先指定的假设，并在重复运行中改变基础模型和提示框架。每个人工智能分析师独立构建并执行完整的分析流程；随后，人工智能审计员对每次运行进行方法有效性筛查。在涵盖实验和观察设计的三个数据集中，人工智能分析师生成的分析在效应大小、$p$-值和支持或不支持假设的二元决策上显示出广泛的分散性，常常颠覆对假设支持与否的判断。这种分散性是有结构的：在预处理、模型规范和推断中的可识别分析选择在不同的LLM和角色条件下系统性地不同。关键是，这些效应是 extit{可引导的}：重新分配分析师角色或LLM会改变结果的分布，即使在排除方法学上不合格的运行后也是如此。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.18724

Task-Aware Exploration via a Predictive Bisimulation Metric

基于预测双模拟度量的任务感知探索

Liang, Dayang, Liu, Ruihan, Wan, Lipeng, Liu, Yunlong, An, Bo

Abstract

Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial task-irrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies, thereby rendering them fragile in visual domains. To bridge this gap, we present TEB, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric. Specifically, TEB leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space. To realize this, we first theoretically mitigate the representation collapse of degenerate bisimulation metrics under sparse rewards by internally introducing a simple but effective predicted reward differential. Building on this robust metric, we design potential-based exploration bonuses, which measure the relative novelty of adjacent observations over the latent space. Extensive experiments on MetaWorld and Maze2D show that TEB achieves superior exploration ability and outperforms recent baselines.

Chinese Translation

在稀疏奖励下加速视觉强化学习中的探索仍然面临挑战，因为存在大量与任务无关的变异。尽管内在探索方面已有所进展，但许多方法要么假设可以访问低维状态，要么缺乏任务感知的探索策略，从而使其在视觉领域中显得脆弱。为了填补这一空白，我们提出了TEB，一种任务感知探索方法，它通过预测双模拟度量将任务相关的表征与探索紧密结合。具体而言，TEB不仅利用该度量学习行为上扎根的任务表征，还测量学习的潜在空间中的行为内在新颖性。为实现这一目标，我们首先在理论上通过内部引入一个简单但有效的预测奖励差异来减轻稀疏奖励下退化双模拟度量的表征崩溃。基于这一稳健的度量，我们设计了基于潜力的探索奖励，这些奖励测量潜在空间中相邻观测的相对新颖性。在MetaWorld和Maze2D上的大量实验表明，TEB实现了卓越的探索能力，并超越了最近的基线。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.18731

Beyond Description: A Multimodal Agent Framework for Insightful Chart Summarization

超越描述：一个多模态智能体框架用于深刻的图表摘要

Bai, Yuhang, Ding, Yujuan, Lin, Shanru, Fan, Wenqi

Abstract

Chart summarization is crucial for enhancing data accessibility and the efficient consumption of information. However, existing methods, including those with Multimodal Large Language Models (MLLMs), primarily focus on low-level data descriptions and often fail to capture the deeper insights which are the fundamental purpose of data visualization. To address this challenge, we propose Chart Insight Agent Flow, a plan-and-execute multi-agent framework effectively leveraging the perceptual and reasoning capabilities of MLLMs to uncover profound insights directly from chart images. Furthermore, to overcome the lack of suitable benchmarks, we introduce ChartSummInsights, a new dataset featuring a diverse collection of real-world charts paired with high-quality, insightful summaries authored by human data analysis experts. Experimental results demonstrate that our method significantly improves the performance of MLLMs on the chart summarization task, producing summaries with deep and diverse insights.

Chinese Translation

图表摘要对于增强数据可访问性和信息的高效消费至关重要。然而，现有的方法，包括那些使用多模态大型语言模型（Multimodal Large Language Models, MLLMs）的方法，主要集中在低层次的数据描述上，往往无法捕捉到数据可视化的根本目的——更深层次的洞察。为了解决这一挑战，我们提出了图表洞察智能体流程（Chart Insight Agent Flow），这是一个计划与执行的多智能体框架，能够有效利用MLLM的感知和推理能力，从图表图像中直接揭示深刻的洞察。此外，为了克服缺乏合适基准的问题，我们引入了ChartSummInsights，这是一个新的数据集，包含多样化的真实世界图表，并配有由人类数据分析专家撰写的高质量、深刻的摘要。实验结果表明，我们的方法显著提高了MLLM在图表摘要任务上的表现，生成了具有深度和多样化洞察的摘要。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.18749

Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation

基于模型可学习性意识的数据分配的联邦推理蒸馏框架

Guo, Wei, Lu, Siyuan, Ran, Xiangdong, Tong, Yiqi, Ban, Yikun, Xu, Zelong, Fan, Jing, Huang, Zixuan, Zhang, Xiao, Hu, Zhaojun, Zhuang, Fuzhen

Abstract

Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability gap, where client-side SLMs cannot identify high-reward samples matching their learnability constraints for effective knowledge transfer from LLMs, while LLMs struggle to select samples contributing novel knowledge beyond their existing data. Furthermore, these collaboration frameworks face another key challenge: domain-agnostic reasoning transfer, where existing reasoning transfer methods fail to flexibly adapt to the local domain data, preventing SLMs from effectively acquiring step-by-step reasoning abilities within from general LLM. To address these challenges, we propose LaDa, a federated reasoning distillation framework with model learnability-aware data allocation. It introduces a model learnability-aware data filter that adaptively allocates high-reward samples based on the learnability gap between each SLM and LLM pair, effectively facilitating bidirectional knowledge transfer. We further design a domain adaptive reasoning distillation method that aligns joint probabilities of reasoning paths on filtered high-reward samples through contrastive distillation learning between SLM and LLM, enabling SLM to capture underlying reasoning patterns under local data distribution. LaDa operates as a plug-in module for existing collaboration frameworks, adapting knowledge transfer based on model learnability gaps.

Chinese Translation

数据分配在联邦大型语言模型（LLM）与小型语言模型（SLM）之间的推理协作中起着关键作用。然而，现有的数据分配方法未能解决协作中一个未被充分探讨的挑战：双向模型可学习性差距，客户端的SLM无法识别与其可学习性约束相匹配的高回报样本，从而有效地从LLM中进行知识转移，而LLM则难以选择能够提供超出其现有数据的新知识的样本。此外，这些协作框架还面临另一个关键挑战：领域无关的推理转移，现有的推理转移方法未能灵活适应本地领域数据，阻碍了SLM在通用LLM中有效获取逐步推理能力。为了解决这些挑战，我们提出了LaDa，一个基于模型可学习性意识的数据分配的联邦推理蒸馏框架。它引入了一种模型可学习性意识的数据过滤器，基于每对SLM和LLM之间的可学习性差距自适应地分配高回报样本，有效促进双向知识转移。我们进一步设计了一种领域自适应的推理蒸馏方法，通过SLM与LLM之间的对比蒸馏学习，对过滤后的高回报样本上的推理路径的联合概率进行对齐，使SLM能够在本地数据分布下捕捉潜在的推理模式。LaDa作为现有协作框架的插件模块运作，基于模型可学习性差距调整知识转移。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.18764

The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol

模式引导对话系统与模型上下文协议的融合

Schlapbach, Andreas

Abstract

This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction. SGD, designed for dialogue-based API discovery (2019), and MCP, now the de facto standard for LLM-tool integration, share the same core insight -- that schemas can encode not just tool signatures but operational constraints and reasoning guidance. By analyzing this convergence, we extract five foundational principles for schema design: (1) Semantic Completeness over Syntactic Precision, (2) Explicit Action Boundaries, (3) Failure Mode Documentation, (4) Progressive Disclosure Compatibility, and (5) Inter-Tool Relationship Declaration. These principles reveal three novel insights: first, SGD's original design was fundamentally sound and should be inherited by MCP; second, both frameworks leave failure modes and inter-tool relationships unexploited -- gaps we identify and resolve; third, progressive disclosure emerges as a critical production-scaling insight under real-world token constraints. We provide concrete design patterns for each principle. These principles position schema-driven governance as a scalable mechanism for AI system oversight without requiring proprietary system inspection -- central to Software 3.0.

Chinese Translation

本文确立了一种基本的融合：模式引导对话（Schema-Guided Dialogue, SGD）和模型上下文协议（Model Context Protocol, MCP）代表了确定性、可审计的LLM-代理交互统一范式的两种表现形式。SGD旨在进行基于对话的API发现（2019年），而MCP则成为LLM-工具集成的事实标准，两者共享同一核心见解——模式不仅可以编码工具签名，还可以编码操作约束和推理指导。通过分析这种融合，我们提取了五个模式设计的基础原则：（1）语义完整性优于语法精确性，（2）明确的行动边界，（3）故障模式文档，（4）渐进式披露兼容性，以及（5）工具间关系声明。这些原则揭示了三个新颖的见解：首先，SGD的原始设计在根本上是合理的，应被MCP继承；其次，两个框架都未充分利用故障模式和工具间关系——这是我们识别并解决的空白；第三，在现实世界的令牌约束下，渐进式披露成为关键的生产扩展见解。我们为每个原则提供了具体的设计模式。这些原则将模式驱动的治理定位为一种可扩展的AI系统监督机制，无需专有系统检查——这是软件3.0的核心。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.18773

LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology

LAMMI-Pathology：一种以工具为中心的自下而上的LVLM-Agent框架，用于病理学中的分子信息医学智能

Su, Haoyang, Zhang, Shaoting, Wang, Xiaosong

Abstract

The emergence of tool-calling-based agent systems introduces a more evidence-driven paradigm for pathology image analysis in contrast to the coarse-grained text-image diagnostic approaches. With the recent large-scale experimental adoption of spatial transcriptomics technologies, molecularly validated pathological diagnosis is becoming increasingly open and accessible. In this work, we propose LAMMI-Pathology (LVLM-Agent System for Molecularly Informed Medical Intelligence in Pathology), a scalable agent framework for domain-specific agent tool-calling. LAMMI-Pathology adopts a tool-centric, bottom-up architecture in which customized domain-adaptive tools serve as the foundation. These tools are clustered by domain style to form component agents, which are then coordinated through a top-level planner hierarchically, avoiding excessively long context lengths that could induce task drift. Based on that, we introduce a novel trajectory construction mechanism based on Atomic Execution Nodes (AENs), which serve as reliable and composable units for building semi-simulated reasoning trajectories that capture credible agent-tool interactions. Building on this foundation, we develop a trajectory-aware fine-tuning strategy that aligns the planner's decision-making process with these multi-step reasoning trajectories, thereby enhancing inference robustness in pathology understanding and its adaptive use of the customized toolset.

Chinese Translation

基于工具调用的代理系统的出现，为病理图像分析引入了一种更具证据驱动的范式，与粗粒度的文本-图像诊断方法形成对比。随着空间转录组技术的近期大规模实验应用，分子验证的病理诊断变得越来越开放和可获取。在本研究中，我们提出了LAMMI-Pathology（用于病理学中分子信息医学智能的LVLM-Agent系统），这是一个可扩展的领域特定代理工具调用框架。LAMMI-Pathology采用以工具为中心的自下而上架构，其中定制的领域自适应工具作为基础。这些工具按领域风格聚类形成组件代理，然后通过顶层规划者进行分层协调，避免过长的上下文长度导致任务漂移。在此基础上，我们引入了一种基于原子执行节点（Atomic Execution Nodes, AENs）的新型轨迹构建机制，这些节点作为可靠且可组合的单元，用于构建捕捉可信代理-工具交互的半模拟推理轨迹。在此基础上，我们开发了一种轨迹感知的微调策略，使规划者的决策过程与这些多步骤推理轨迹对齐，从而增强病理理解中的推理鲁棒性及其对定制工具集的自适应使用。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.18812

GenPlanner: From Noise to Plans -- Emergent Reasoning in Flow Matching and Diffusion Models

GenPlanner：从噪声到规划——流匹配与扩散模型中的涌现推理

Polowczyk, Agnieszka, Polowczyk, Alicja, Wieczorek, Michał

Abstract

Path planning in complex environments is one of the key problems of artificial intelligence because it requires simultaneous understanding of the geometry of space and the global structure of the problem. In this paper, we explore the potential of using generative models as planning and reasoning mechanisms. We propose GenPlanner, an approach based on diffusion models and flow matching, along with two variants: DiffPlanner and FlowPlanner. We demonstrate the application of generative models to find and generate correct paths in mazes. A multi-channel condition describing the structure of the environment, including an obstacle map and information about the starting and destination points, is used to condition trajectory generation. Unlike standard methods, our models generate trajectories iteratively, starting with random noise and gradually transforming it into a correct solution. Experiments conducted show that the proposed approach significantly outperforms the baseline CNN model. In particular, FlowPlanner demonstrates high performance even with a limited number of generation steps.

Chinese Translation

在复杂环境中的路径规划是人工智能的关键问题之一，因为它需要同时理解空间的几何形状和问题的全局结构。本文探讨了将生成模型作为规划和推理机制的潜力。我们提出了GenPlanner，这是一种基于扩散模型和流匹配的方法，并提出了两个变体：DiffPlanner和FlowPlanner。我们展示了生成模型在迷宫中寻找和生成正确路径的应用。使用多通道条件描述环境的结构，包括障碍物地图以及起点和终点的信息，以此来条件化轨迹生成。与标准方法不同，我们的模型通过迭代生成轨迹，从随机噪声开始，逐渐将其转化为正确的解决方案。进行的实验表明，所提出的方法显著优于基线CNN模型。特别是，FlowPlanner即使在生成步骤有限的情况下也表现出高性能。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.18843

ABD: Default Exception Abduction in Finite First Order Worlds

ABD：有限一阶世界中的默认例外归纳

Batzoglou, Serafim

Abstract

We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

Chinese Translation

我们引入了ABD，这是一个针对有限一阶世界中默认例外归纳的基准。给定一个包含异常谓词的背景理论和一组关系结构，模型必须输出一个一阶公式来定义例外，恢复可满足性，同时保持例外的稀疏性。我们正式化了三种观察机制（封闭世界、存在性补全、普遍补全），并进行了精确的SMT验证。在600个实例上评估了十个前沿大型语言模型，最佳模型达到了较高的有效性，但仍存在简约性差距，保留评估揭示了不同机制下的特定泛化失败模式。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.18884

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

TPRU：推动大型多模态模型中的时间和过程理解

Gao, Zhenkun, Wang, Xuhong, Tan, Xin, Xie, Yuan

Abstract

Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large-scale, procedurally coherent data. To address this problem, we introduce TPRU, a large-scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross-modal validation. We leverage TPRU with a reinforcement learning (RL) fine-tuning methodology, specifically targeting the enhancement of resource-efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU-Test, the accuracy of TPRU-7B soars from 50.33\% to 75.70\%, a state-of-the-art result that significantly outperforms vastly larger baselines, including GPT-4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. The codebase is available at https://github.com/Stephen-gzk/TPRU/ .

Chinese Translation

多模态大型语言模型（MLLMs），特别是较小的可部署变体，在理解时间和过程视觉数据方面存在严重缺陷，这一瓶颈阻碍了它们在现实世界具身人工智能中的应用。这一差距主要源于训练范式的系统性失败，缺乏大规模、过程一致的数据。为了解决这一问题，我们引入了TPRU，一个来自多样化具身场景（如机器人操作和图形用户界面导航）的大规模数据集。TPRU系统地设计了三个互补任务，以培养时间推理能力：时间重排序、下一帧预测和上一帧回顾。一个关键特征是包含具有挑战性的负样本，迫使模型从被动观察转向主动的跨模态验证。我们利用TPRU结合强化学习（RL）微调方法，特别针对资源高效模型的增强。实验表明，我们的方法带来了显著的提升：在我们手动整理的TPRU-Test上，TPRU-7B的准确率从50.33%飙升至75.70%，这一结果处于最先进水平，显著超越了包括GPT-4o在内的更大基线模型。至关重要的是，这些能力有效地推广，显示出在既定基准上的显著改进。代码库可在 https://github.com/Stephen-gzk/TPRU/ 获取。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.18918

Early Evidence of Vibe-Proving with Consumer LLMs: A Case Study on Spectral Region Characterization with ChatGPT-5.2 (Thinking)

消费者大语言模型的早期证据：基于ChatGPT-5.2（思维）的谱区域特征化案例研究

Verbeken, Brecht, Vagenende, Brando, Guerry, Marie-Anne, Algaba, Andres, Ginis, Vincent

Abstract

Large Language Models (LLMs) are increasingly used as scientific copilots, but evidence on their role in research-level mathematics remains limited, especially for workflows accessible to individual researchers. We present early evidence for vibe-proving with a consumer subscription LLM through an auditable case study that resolves Conjecture 20 of Ran and Teng (2024) on the exact nonreal spectral region of a 4-cycle row-stochastic nonnegative matrix family. We analyze seven shareable ChatGPT-5.2 (Thinking) threads and four versioned proof drafts, documenting an iterative pipeline of generate, referee, and repair. The model is most useful for high-level proof search, while human experts remain essential for correctness-critical closure. The final theorem provides necessary and sufficient region conditions and explicit boundary attainment constructions. Beyond the mathematical result, we contribute a process-level characterization of where LLM assistance materially helps and where verification bottlenecks persist, with implications for evaluation of AI-assisted research workflows and for designing human-in-the-loop theorem proving systems.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作科学副驾驶，但关于它们在研究级数学中的作用的证据仍然有限，尤其是对于个体研究者可访问的工作流程。我们通过一个可审计的案例研究，提供了消费者订阅LLM进行“氛围证明”的早期证据，该研究解决了Ran和Teng（2024）关于4周期行随机非负矩阵族的确切非实谱区域的猜想20。我们分析了七个可共享的ChatGPT-5.2（思维）线程和四个版本的证明草稿，记录了生成、审查和修复的迭代流程。该模型在高层次证明搜索中最为有用，而人类专家在确保正确性方面仍然至关重要。最终定理提供了必要和充分的区域条件以及明确的边界达成构造。除了数学结果外，我们还贡献了一个过程级特征化，说明了LLM辅助在何处实质性地帮助以及验证瓶颈依然存在的地方，这对评估AI辅助研究工作流程和设计人机协作的定理证明系统具有重要意义。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.18940

DREAM: Deep Research Evaluation with Agentic Metrics

DREAM：基于代理指标的深度研究评估

Avraham, Elad Ben, Li, Changhao, Dorfman, Ron, Ganz, Roy, Nuriel, Oren, Dudai, Amir, Aberdam, Aviad, Flynn, Noah, Mansimov, Elman, Kalyanpur, Adi, Litman, Ron

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

Chinese Translation

深度研究代理生成分析师级报告，但由于缺乏单一的真实标准以及研究质量的多维特性，评估这些报告仍然具有挑战性。近期的基准测试提出了不同的方法论，但它们受到合成幻影的困扰，即强大的表面流畅性和引用一致性可能掩盖潜在的事实和推理缺陷。我们通过引入一个跨四个垂直领域的分类法来表征这一差距，揭示了一个关键的能力不匹配：静态评估者本质上缺乏评估时间有效性和事实正确性所需的工具使用能力。为了解决这个问题，我们提出了DREAM（基于代理指标的深度研究评估），这是一个通过使评估本身具备代理特性来实现能力平等原则的框架。DREAM通过结合查询无关指标和由工具调用代理生成的自适应指标的评估协议来构建评估，能够实现时间感知的覆盖、基于事实的验证和系统性的推理探测。受控评估表明，DREAM对事实和时间衰减的敏感性显著高于现有基准，提供了一种可扩展的、无参考的评估范式。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.18943

High Dimensional Procedural Content Generation

高维程序内容生成

Xu, Kaijie, Verbrugge, Clark

Abstract

Procedural content generation (PCG) has made substantial progress in shaping static 2D/3D geometry, while most methods treat gameplay mechanics as auxiliary and optimize only over space. We argue that this limits controllability and expressivity, and formally introduce High-Dimensional PCG (HDPCG): a framework that elevates non-geometric gameplay dimensions to first-class coordinates of a joint state space. We instantiate HDPCG along two concrete directions. Direction-Space augments geometry with a discrete layer dimension and validates reachability in 4D (x,y,z,l), enabling unified treatment of 2.5D/3.5D mechanics such as gravity inversion and parallel-world switching. Direction-Time augments geometry with temporal dynamics via time-expanded graphs, capturing action semantics and conflict rules. For each direction, we present three general, practicable algorithms with a shared pipeline of abstract skeleton generation, controlled grounding, high-dimensional validation, and multi-metric evaluation. Large-scale experiments across diverse settings validate the integrity of our problem formulation and the effectiveness of our methods on playability, structure, style, robustness, and efficiency. Beyond quantitative results, Unity-based case studies recreate playable scenarios that accord with our metrics. We hope HDPCG encourages a shift in PCG toward general representations and the generation of gameplay-relevant dimensions beyond geometry, paving the way for controllable, verifiable, and extensible level generation.

Chinese Translation

程序内容生成（PCG）在塑造静态2D/3D几何形状方面取得了显著进展，而大多数方法将游戏机制视为辅助，并仅在空间上进行优化。我们认为这限制了可控性和表现力，并正式引入高维程序内容生成（HDPCG）：一个将非几何游戏维度提升为联合状态空间的第一类坐标的框架。我们沿着两个具体方向实例化HDPCG。方向-空间通过离散层维度增强几何形状，并在4D（x,y,z,l）中验证可达性，从而实现对2.5D/3.5D机制（如重力反转和平行世界切换）的统一处理。方向-时间通过时间扩展图增强几何形状，捕捉动作语义和冲突规则。对于每个方向，我们提出了三种通用的、可行的算法，具有共享的抽象骨架生成、受控基础、高维验证和多指标评估的管道。大规模实验在多种环境中验证了我们问题表述的完整性以及我们方法在可玩性、结构、风格、鲁棒性和效率方面的有效性。除了定量结果外，基于Unity的案例研究重现了符合我们指标的可玩场景。我们希望HDPCG能够推动PCG向通用表示和生成超越几何的游戏相关维度转变，为可控的、可验证的和可扩展的关卡生成铺平道路。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.18947

(Perlin) Noise as AI coordinator

（Perlin）噪声作为人工智能协调者

Xu, Kaijie, Verbrugge, Clark

Abstract

Large scale control of nonplayer agents is central to modern games, while production systems still struggle to balance several competing goals: locally smooth, natural behavior, and globally coordinated variety across space and time. Prior approaches rely on handcrafted rules or purely stochastic triggers, which either converge to mechanical synchrony or devolve into uncorrelated noise that is hard to tune. Continuous noise signals such as Perlin noise are well suited to this gap because they provide spatially and temporally coherent randomness, and they are already widely used for terrain, biomes, and other procedural assets. We adapt these signals for the first time to large scale AI control and present a general framework that treats continuous noise fields as an AI coordinator. The framework combines three layers of control: behavior parameterization for movement at the agent level, action time scheduling for when behaviors start and stop, and spawn or event type and feature generation for what appears and where. We instantiate the framework reproducibly and evaluate Perlin noise as a representative coordinator across multiple maps, scales, and seeds against random, filtered, deterministic, neighborhood constrained, and physics inspired baselines. Experiments show that coordinated noise fields provide stable activation statistics without lockstep, strong spatial coverage and regional balance, better diversity with controllable polarization, and competitive runtime. We hope this work motivates a broader exploration of coordinated noise in game AI as a practical path to combine efficiency, controllability, and quality.

Chinese Translation

大规模控制非玩家代理是现代游戏的核心，而生产系统仍然在平衡多个竞争目标方面面临挑战：局部平滑、自然行为，以及跨空间和时间的全局协调多样性。以往的方法依赖于手工制作的规则或纯随机触发，这两者要么趋向于机械同步，要么退化为难以调节的无关噪声。连续噪声信号，如Perlin噪声，非常适合填补这一空白，因为它们提供了空间和时间上连贯的随机性，并且已经广泛应用于地形、生物群落和其他程序生成资产。我们首次将这些信号适配于大规模人工智能控制，并提出一个将连续噪声场视为人工智能协调者的通用框架。该框架结合了三层控制：代理级别的运动行为参数化、行为开始和停止的时间调度，以及生成出现内容的类型和特征。我们可重复地实例化该框架，并在多个地图、尺度和种子下评估Perlin噪声作为代表性协调者，与随机、过滤、确定性、邻域约束和物理启发的基线进行对比。实验表明，协调噪声场提供了稳定的激活统计数据，无需同步，具有强大的空间覆盖和区域平衡，能够实现更好的多样性和可控的极化，同时运行时间具有竞争力。我们希望这项工作能够激励对游戏人工智能中协调噪声的更广泛探索，作为结合效率、可控性和质量的实际途径。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2602.18956

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

INDUCTION：一阶逻辑中的有限结构概念合成

Batzoglou, Serafim

Abstract

We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.

Chinese Translation

我们介绍了INDUCTION，这是一个用于一阶逻辑中有限结构概念合成的基准测试。在给定的小型有限关系世界中，具有外延标记的目标谓词，模型必须输出一个单一的一阶逻辑公式，该公式能够在各个世界中统一地解释目标，且通过精确模型检查验证其正确性。该基准测试包括三个模式：完全观察（FullObs）、对比（CI）和存在性补全（EC），并对公式膨胀进行惩罚。我们发现存在明显的难度梯度和持续的困难结构家族，并观察到低膨胀公式在保留的世界中具有更好的泛化能力。近期的优秀模型在不同任务和性能指标上表现出质的不同，暗示它们在概念泛化策略上的差异。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2602.18960

Modularity is the Bedrock of Natural and Artificial Intelligence

模块化是自然与人工智能的基石

Salatiello, Alessandro

Abstract

The remarkable performance of modern AI systems has been driven by unprecedented scales of data, computation, and energy -- far exceeding the resources required by human intelligence. This disparity highlights the need for new guiding principles and motivates drawing inspiration from the fundamental organizational principles of brain computation. Among these principles, modularity has been shown to be critical for supporting the efficient learning and strong generalization abilities consistently exhibited by humans. Furthermore, modularity aligns well with the No Free Lunch Theorem, which highlights the need for problem-specific inductive biases and motivates architectures composed of specialized components that solve subproblems. However, despite its fundamental role in natural intelligence and its demonstrated benefits across a range of seemingly disparate AI subfields, modularity remains relatively underappreciated in mainstream AI research. In this work, we review several research threads in artificial intelligence and neuroscience through a conceptual framework that highlights the central role of modularity in supporting both artificial and natural intelligence. In particular, we examine what computational advantages modularity provides, how it has emerged as a solution across several AI research areas, which modularity principles the brain exploits, and how modularity can help bridge the gap between natural and artificial intelligence.

Chinese Translation

现代人工智能系统的卓越表现得益于前所未有的数据、计算和能源规模，远远超出了人类智能所需的资源。这种差异突显了新指导原则的必要性，并激励我们从大脑计算的基本组织原则中汲取灵感。在这些原则中，模块化被证明对支持人类高效学习和强泛化能力至关重要。此外，模块化与无免费午餐定理（No Free Lunch Theorem）高度契合，该定理强调了问题特定归纳偏见的必要性，并激励由专门组件组成的架构来解决子问题。然而，尽管模块化在自然智能中的基本作用及其在一系列看似不同的人工智能子领域中的显著益处，模块化在主流人工智能研究中仍然相对被低估。在本研究中，我们通过一个概念框架回顾了人工智能和神经科学中的几个研究方向，强调模块化在支持人工和自然智能中的核心作用。特别地，我们考察了模块化提供的计算优势，它如何在多个人工智能研究领域中作为解决方案出现，大脑利用了哪些模块化原则，以及模块化如何有助于弥合自然与人工智能之间的差距。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2602.18968

Robust and Efficient Tool Orchestration via Layered Execution Structures with Reflective Correction

通过反射校正的分层执行结构实现稳健高效的工具编排

Zhe, Tao, Wang, Haoyu, Luo, Bo, Wu, Min, Fan, Wei, Luo, Xiao, Yao, Zijun, Chen, Haifeng, Wang, Dongjie

Abstract

Tool invocation is a core capability of agentic systems, yet failures often arise not from individual tool calls but from how multiple tools are organized and executed together. Existing approaches tightly couple tool execution with stepwise language reasoning or explicit planning, leading to brittle behavior and high execution overhead. To overcome these limitations, we revisit tool invocation from the perspective of tool orchestration. Our key insight is that effective orchestration does not require precise dependency graphs or fine-grained planning. Instead, a coarse-grained layer structure suffices to provide global guidance, while execution-time errors can be corrected locally. Specifically, we model tool orchestration as learning a layered execution structure that captures high-level tool dependencies, inducing layer-wise execution through context constraints. To handle execution-time failures, we introduce a schema-aware reflective correction mechanism that detects and repairs errors locally. This design confines errors to individual tool calls and avoids re-planning entire execution trajectories. This structured execution paradigm enables a lightweight and reusable orchestration component for agentic systems. Experimental results show that our approach achieves robust tool execution while reducing execution complexity and overhead. Code will be made publicly available.

Chinese Translation

工具调用是智能系统的核心能力，但故障往往不是来自单个工具调用，而是来自多个工具的组织和共同执行。现有方法将工具执行与逐步语言推理或显式规划紧密耦合，导致脆弱的行为和高执行开销。为了克服这些局限性，我们从工具编排的角度重新审视工具调用。我们的关键见解是，有效的编排并不需要精确的依赖图或细粒度的规划。相反，粗粒度的层次结构足以提供全局指导，而执行时的错误可以在局部进行修正。具体而言，我们将工具编排建模为学习一种分层执行结构，该结构捕捉高层次的工具依赖关系，通过上下文约束诱导分层执行。为了处理执行时的故障，我们引入了一种模式感知的反射校正机制，能够局部检测和修复错误。该设计将错误限制在单个工具调用中，避免了重新规划整个执行轨迹。这种结构化的执行范式为智能系统提供了轻量且可重用的编排组件。实验结果表明，我们的方法在降低执行复杂性和开销的同时，实现了稳健的工具执行。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2602.18971

When Do LLM Preferences Predict Downstream Behavior?

大型语言模型的偏好何时预测下游行为？

Slama, Katarina, Souly, Alexandra, Bansal, Dishank, Davidson, Henry, Summerfield, Christopher, Luettgau, Lennart

Abstract

Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted models explicitly to act in specific ways, leaving unclear whether observed behaviors reflect instruction-following capabilities vs underlying model preferences. Here we test whether this precondition for misalignment is present. Using entity preferences as a behavioral probe, we measure whether stated preferences predict downstream behavior in five frontier LLMs across three domains: donation advice, refusal behavior, and task performance. Conceptually replicating prior work, we first confirm that all five models show highly consistent preferences across two independent measurement methods. We then test behavioral consequences in a simulated user environment. We find that all five models give preference-aligned donation advice. All five models also show preference-correlated refusal patterns when asked to recommend donations, refusing more often for less-preferred entities. All preference-related behaviors that we observe here emerge without instructions to act on preferences. Results for task performance are mixed: on a question-answering benchmark (BoolQ), two models show small but significant accuracy differences favoring preferred entities; one model shows the opposite pattern; and two models show no significant relationship. On complex agentic tasks, we find no evidence of preference-driven performance differences. While LLMs have consistent preferences that reliably predict advice-giving behavior, these preferences do not consistently translate into downstream task performance.

Chinese Translation

大型语言模型（LLMs）中的偏好驱动行为可能是人工智能失调（如沙袋效应）的必要前提：模型在其行为受到偏好的影响之前，无法战略性地追求失调目标。然而，以往的研究通常明确要求模型以特定方式行动，这使得观察到的行为是否反映了遵循指令的能力或潜在的模型偏好变得不清晰。在此，我们测试这一失调前提是否存在。通过使用实体偏好作为行为探针，我们测量了在三个领域（捐赠建议、拒绝行为和任务表现）中，五个前沿大型语言模型的陈述偏好是否预测下游行为。我们首先概念上复制了以往的研究，确认所有五个模型在两种独立测量方法中显示出高度一致的偏好。随后，我们在模拟用户环境中测试行为后果。我们发现所有五个模型都提供了与偏好一致的捐赠建议。当被要求推荐捐赠时，所有五个模型也表现出与偏好相关的拒绝模式，对于不太偏好的实体拒绝的频率更高。我们在此观察到的所有与偏好相关的行为均在没有指示其依据偏好行动的情况下出现。任务表现的结果则较为复杂：在一个问答基准（BoolQ）上，两个模型显示出对偏好实体的小但显著的准确性差异；一个模型显示出相反的模式；而两个模型则没有显著关系。在复杂的代理任务中，我们未发现偏好驱动的表现差异的证据。尽管大型语言模型具有一致的偏好，且这些偏好可靠地预测建议行为，但这些偏好并未始终转化为下游任务表现。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2602.18981

How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs

仅凭像素我们能走多远？一项关于商业3D动作角色扮演游戏中仅用屏幕导航的初步研究

Xu, Kaijie, Bugti, Mustafa, Verbrugge, Clark

Abstract

Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but neither setting faithfully captures how players explore complex, real-world game levels. In this paper, we build on an existing open-source visual affordance detector and instantiate a screen-only exploration and navigation agent that operates purely from visual affordances. Our agent consumes live game frames, identifies salient interest points, and drives a simple finite-state controller over a minimal action space to explore Dark Souls-style linear levels and attempt to reach expected goal regions. Pilot experiments show that the agent can traverse most required segments and exhibits meaningful visual navigation behavior, but also highlight that limitations of the underlying visual model prevent truly comprehensive and reliable auto-navigation. We argue that this system provides a concrete, shared baseline and evaluation protocol for visual navigation in complex games, and we call for more attention to this necessary task. Our results suggest that purely vision-based sense-making models, with discrete single-modality inputs and without explicit reasoning, can effectively support navigation and environment understanding in idealized settings, but are unlikely to be a general solution on their own.

Chinese Translation

现代3D游戏关卡在很大程度上依赖于视觉引导，但关卡布局的可导航性仍然难以量化。以往的研究要么在简化环境中模拟游戏，要么分析静态截图以获取视觉可用性，但这两种设置都未能真实捕捉玩家如何探索复杂的现实游戏关卡。本文基于现有的开源视觉可用性检测器，构建了一个仅依赖视觉可用性的屏幕探索和导航代理。我们的代理处理实时游戏画面，识别显著的兴趣点，并通过一个简单的有限状态控制器在最小的动作空间内探索类似《黑暗之魂》的线性关卡，并尝试到达预期的目标区域。初步实验表明，该代理能够穿越大多数所需的段落，并表现出有意义的视觉导航行为，但也突显了基础视觉模型的局限性，阻碍了真正全面和可靠的自动导航。我们认为该系统为复杂游戏中的视觉导航提供了一个具体的、共享的基准和评估协议，并呼吁对此必要任务给予更多关注。我们的结果表明，纯粹基于视觉的感知模型，具有离散的单一模态输入且没有明确推理，能够有效支持理想化环境中的导航和环境理解，但单靠它们不太可能成为通用解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2602.18985

InfEngine: A Self-Verifying and Self-Optimizing Intelligent Engine for Infrared Radiation Computing

InfEngine：一种自验证和自优化的红外辐射计算智能引擎

Ding, Kun, Xu, Jian, Wang, Ying, Yang, Peipei, Xiang, Shiming

Abstract

Infrared radiation computing underpins advances in climate science, remote sensing and spectroscopy but remains constrained by manual workflows. We introduce InfEngine, an autonomous intelligent computational engine designed to drive a paradigm shift from human-led orchestration to collaborative automation. It integrates four specialized agents through two core innovations: self-verification, enabled by joint solver-evaluator debugging, improves functional correctness and scientific plausibility; self-optimization, realized via evolutionary algorithms with self-discovered fitness functions, facilitates autonomous performance optimization. Evaluated on InfBench with 200 infrared-specific tasks and powered by InfTools with 270 curated tools, InfEngine achieves a 92.7% pass rate and delivers workflows 21x faster than manual expert effort. More fundamentally, it illustrates how researchers can transition from manual coding to collaborating with self-verifying, self-optimizing computational partners. By generating reusable, verified and optimized code, InfEngine transforms computational workflows into persistent scientific assets, accelerating the cycle of scientific discovery. Code: https://github.com/kding1225/infengine

Chinese Translation

红外辐射计算是气候科学、遥感和光谱学进展的基础，但仍受到手动工作流程的限制。我们介绍了InfEngine，这是一种自主智能计算引擎，旨在推动从人工主导的协调向协作自动化的范式转变。它通过两个核心创新集成了四个专业代理：自验证，通过联合求解器-评估器调试实现，改善了功能正确性和科学合理性；自优化，通过具有自发现适应度函数的进化算法实现，促进了自主性能优化。在InfBench上评估了200个红外特定任务，并由InfTools提供270个精心策划的工具，InfEngine实现了92.7%的通过率，并且工作流程的速度比人工专家的努力快21倍。更根本的是，它展示了研究人员如何从手动编码转变为与自验证、自优化的计算伙伴协作。通过生成可重用、经过验证和优化的代码，InfEngine将计算工作流程转变为持久的科学资产，加速科学发现的周期。代码链接：https://github.com/kding1225/infengine

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2602.18986

Quantifying Automation Risk in High-Automation AI Systems: A Bayesian Framework for Failure Propagation and Optimal Oversight

量化高自动化人工智能系统中的自动化风险：一种用于故障传播和最佳监督的贝叶斯框架

Srivastava, Vishal, Sah, Tanmay

Abstract

Organizations across finance, healthcare, transportation, content moderation, and critical infrastructure are rapidly deploying highly automated AI systems, yet they lack principled methods to quantify how increasing automation amplifies harm when failures occur. We propose a parsimonious Bayesian risk decomposition expressing expected loss as the product of three terms: the probability of system failure, the conditional probability that a failure propagates into harm given the automation level, and the expected severity of harm. This framework isolates a critical quantity -- the conditional probability that failures propagate into harm -- which captures execution and oversight risk rather than model accuracy alone. We develop complete theoretical foundations: formal proofs of the decomposition, a harm propagation equivalence theorem linking the harm propagation probability to observable execution controls, risk elasticity measures, efficient frontier analysis for automation policy, and optimal resource allocation principles with second-order conditions. We motivate the framework with an illustrative case study of the 2012 Knight Capital incident ($440M loss) as one instantiation of a broadly applicable failure pattern, and characterize the research design required to empirically validate the framework at scale across deployment domains. This work provides the theoretical foundations for a new class of deployment-focused risk governance tools for agentic and automated AI systems.

Chinese Translation

金融、医疗、交通、内容审核和关键基础设施等领域的组织正在迅速部署高度自动化的人工智能系统，但缺乏系统的方法来量化在故障发生时，增加的自动化如何加剧损害。我们提出了一种简约的贝叶斯风险分解方法，将预期损失表示为三个项的乘积：系统故障的概率、在给定自动化水平下故障传播为损害的条件概率，以及损害的预期严重性。该框架孤立出一个关键量——故障传播为损害的条件概率——它捕捉了执行和监督风险，而不仅仅是模型准确性。我们发展了完整的理论基础：分解的形式证明、将损害传播概率与可观察的执行控制相联系的损害传播等价定理、风险弹性度量、自动化政策的有效边界分析，以及具有二阶条件的最佳资源配置原则。我们通过2012年奈特资本事件（损失4.4亿美元）的案例研究来激励该框架，作为一种广泛适用的故障模式的实例，并描述了在各个部署领域大规模实证验证该框架所需的研究设计。这项工作为一种新型的以部署为中心的风险治理工具提供了理论基础，适用于自主和自动化的人工智能系统。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2602.18998

Benchmark Test-Time Scaling of General LLM Agents

通用大型语言模型代理的基准测试时间扩展

Li, Xiaochuan, Ming, Ryan, Setlur, Pranav, Paladugu, Abhijay, Tang, Andy, Kang, Hao, Shao, Shuai, Jin, Rong, Xiong, Chenyan

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

Chinese Translation

大型语言模型（LLM）代理越来越被期望作为通用系统，能够解决开放式用户请求。现有基准测试主要集中在领域特定环境中，以开发专业化代理，而评估通用代理则需要更现实的设置，以挑战它们在统一环境中跨多个技能和工具的操作能力。我们引入了General AgentBench，这是一个基准测试，提供了一个统一框架，用于评估通用大型语言模型代理在搜索、编码、推理和工具使用领域的表现。通过使用General AgentBench，我们系统地研究了在顺序扩展（迭代交互）和并行扩展（采样多个轨迹）下的测试时间扩展行为。对十个领先的LLM代理的评估显示，当从领域特定评估转向这一通用代理设置时，性能显著下降。此外，我们发现，由于两个基本限制：顺序扩展中的上下文上限和并行扩展中的验证差距，任何扩展方法在实践中都未能带来有效的性能提升。代码可在 https://github.com/cxcscmu/General-AgentBench 上公开获取。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2602.19000

MagicAgent: Towards Generalized Agent Planning

MagicAgent：面向通用智能体规划

Ren, Xuhui, Dong, Shaokang, Yang, Chen, Gao, Qing, Zhao, Yunbin, Liu, Yongsheng, Geng, Xinwei, Li, Xiang, Yan, Demei, Li, Yanqing, Huang, Chenhao, Zhu, Dingwei, Ye, Junjie, Yue, Boxuan, Fu, Yingnan, Lv, Mengzhe, Feng, Zezeng, Zhou, Boshen, Wang, Bocheng, Huang, Xuanjing, Jiang, Yu-Gang, Gui, Tao, Zhang, Qi, Zhang, Yunke

Abstract

The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B deliver superior performance, achieving accuracies of $75.1\%$ on Worfbench, $55.9\%$ on NaturalPlan, $57.5\%$ on $\tau^2$-Bench, $86.9\%$ on BFCL-v3, and $81.2\%$ on ACEBench, as well as strong results on our in-house MagicEval benchmarks. These results substantially outperform existing sub-100B models and even surpass leading closed-source models.

Chinese Translation

大型语言模型（LLMs）从被动文本处理器演变为自主智能体，使规划成为现代智能的核心组成部分。然而，实现通用规划仍然难以捉摸，这不仅是由于高质量交互数据的稀缺，还因为异构规划任务之间的固有冲突。这些挑战导致模型在孤立任务上表现出色，但在泛化能力上却乏力，而现有的多任务训练尝试则受到梯度干扰的困扰。本文提出了 extbf{MagicAgent}，一系列专门为通用智能体规划设计的基础模型。我们引入了一种轻量级且可扩展的合成数据框架，能够在多样化的规划任务中生成高质量的轨迹，包括层次任务分解、工具增强规划、多约束调度、程序逻辑编排和长时间跨度的工具执行。为了解决训练冲突，我们提出了一种两阶段的训练范式，包括监督微调和在静态数据集与动态环境上进行多目标强化学习。实证结果表明，MagicAgent-32B和MagicAgent-30B-A3B在Worfbench上达到了75.1%的准确率，在NaturalPlan上为55.9%，在$ au^2$-Bench上为57.5%，在BFCL-v3上为86.9%，在ACEBench上为81.2%，并在我们内部的MagicEval基准测试中也取得了强劲的结果。这些结果大幅超越了现有的百亿参数以下模型，甚至超过了领先的闭源模型。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2602.19006

Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

对量子力学的大型语言模型评估：跨多种模型和任务的比较研究

Rithvik, S. K.

Abstract

We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.

Chinese Translation

我们对大型语言模型在量子力学问题解决方面进行了系统评估。我们的研究评估了来自五个提供商（OpenAI、Anthropic、Google、Alibaba、DeepSeek）的15个模型，涵盖三个能力层级，针对20个任务进行评估，这些任务包括推导、创造性问题、非标准概念和数值计算，共计900个基准评估和75个工具增强评估。结果显示出明显的层级分化：旗舰模型的平均准确率达到81\%，比中层模型（77\%）和快速模型（67\%）分别高出4个百分点和14个百分点。任务难度模式明显：推导任务表现最佳（平均92\\%，旗舰模型达到100\\%），而数值计算则是最具挑战性的任务（42\\%）。在数值任务上，工具增强产生了任务依赖的效果：总体改善有限（+4.4个百分点），但在3倍的标记成本下，表现出显著的异质性，增益从+29个百分点到-16个百分点不等。对三次运行的可重复性分析量化了6.3个百分点的平均方差，旗舰模型展现出卓越的稳定性（GPT-5实现零方差），而专用模型则需要多次运行评估。此项工作贡献了：（i）一个具有自动验证的量子力学基准，（ii）系统评估量化基于层级的性能等级，（iii）工具增强权衡的实证分析，以及（iv）可重复性特征化。所有任务、验证器和结果均已公开发布。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2602.19065

Agentic Problem Frames: A Systematic Approach to Engineering Reliable Domain Agents

代理问题框架：工程可靠领域代理的系统方法

Park, Chanjin

Abstract

Large Language Models (LLMs) are evolving into autonomous agents, yet current "frameless" development--relying on ambiguous natural language without engineering blueprints--leads to critical risks such as scope creep and open-loop failures. To ensure industrial-grade reliability, this study proposes Agentic Problem Frames (APF), a systematic engineering framework that shifts focus from internal model intelligence to the structured interaction between the agent and its environment. The APF establishes a dynamic specification paradigm where intent is concretized at runtime through domain knowledge injection. At its core, the Act-Verify-Refine (AVR) loop functions as a closed-loop control system that transforms execution results into verified knowledge assets, driving system behavior toward asymptotic convergence to mission requirements (R). To operationalize this, this study introduces the Agentic Job Description (AJD), a formal specification tool that defines jurisdictional boundaries, operational contexts, and epistemic evaluation criteria. The efficacy of this framework is validated through two contrasting case studies: a delegated proxy model for business travel and an autonomous supervisor model for industrial equipment management. By applying AJD-based specification and APF modeling to these scenarios, the analysis demonstrates how operational scenarios are systematically controlled within defined boundaries. These cases provide a conceptual proof that agent reliability stems not from a model's internal reasoning alone, but from the rigorous engineering structures that anchor stochastic AI within deterministic business processes, thereby enabling the development of verifiable and dependable domain agents.

Chinese Translation

大型语言模型（LLMs）正在演变为自主代理，但当前的“无框架”开发依赖于模糊的自然语言而没有工程蓝图，这导致了范围蔓延和开放环路故障等关键风险。为了确保工业级的可靠性，本研究提出了代理问题框架（Agentic Problem Frames, APF），这是一个系统的工程框架，重点从内部模型智能转向代理与其环境之间的结构化互动。APF建立了一个动态规范范式，在运行时通过领域知识注入将意图具体化。其核心是行为-验证-优化（Act-Verify-Refine, AVR）循环，作为一个闭环控制系统，将执行结果转化为经过验证的知识资产，推动系统行为朝向任务要求（R）的渐进收敛。为了实现这一目标，本研究引入了代理工作描述（Agentic Job Description, AJD），这是一种正式的规范工具，用于定义管辖边界、操作上下文和认知评估标准。通过两个对比案例研究验证了该框架的有效性：一个是用于商务旅行的委托代理模型，另一个是用于工业设备管理的自主监督模型。通过将基于AJD的规范和APF建模应用于这些场景，分析展示了如何在定义的边界内系统地控制操作场景。这些案例提供了一个概念证明，表明代理的可靠性不仅源于模型的内部推理，而是源于将随机AI锚定在确定性商业流程中的严格工程结构，从而使得可验证和可靠的领域代理的开发成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2602.19069

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

提出正确的问题：通过生成中间步骤改善推理能力

Hu, Hengyuan, Fu, Tingchen, Jiang, Minqi, Miller, Alexander H, Bachrach, Yoram, Foerster, Jakob Nicolaus

Abstract

Recent years have witnessed tremendous progress in enabling LLMs to solve complex reasoning tasks such as math and coding. As we start to apply LLMs to harder tasks that they may not be able to solve in one shot, it is worth paying attention to their ability to construct intermediate stepping stones that prepare them to better solve the tasks. Examples of stepping stones include simplifications, alternative framings, or subproblems. We study properties and benefits of stepping stones in the context of modern reasoning LLMs via ARQ (\textbf{A}king the \textbf{R}ight \textbf{Q}uestions), our simple framework which introduces a question generator to the default reasoning pipeline. We first show that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks. We next frame stepping stone generation as a post-training task and show that we can fine-tune LLMs to generate more useful stepping stones by SFT and RL on synthetic data.

Chinese Translation

近年来，随着大型语言模型（LLMs）在解决复杂推理任务（如数学和编程）方面取得了巨大的进展，我们开始将LLMs应用于更困难的任务，这些任务可能无法一次性解决。因此，关注它们构建中间步骤的能力变得尤为重要，这些中间步骤可以帮助它们更好地解决任务。中间步骤的例子包括简化、替代框架或子问题。我们通过ARQ（ extbf{A}king the extbf{R}ight extbf{Q}uestions）这一简单框架研究了中间步骤的属性和益处，该框架在默认推理流程中引入了问题生成器。我们首先展示了良好的中间步骤问题的存在及其可转移性，意味着可以生成有效的问题，并且这些问题在帮助不同能力的LLMs解决目标任务方面具有显著作用。接下来，我们将中间步骤生成框架化为一个后训练任务，并展示通过在合成数据上进行监督微调（SFT）和强化学习（RL），我们可以对LLMs进行微调，以生成更有用的中间步骤。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2602.19071

Defining Explainable AI for Requirements Analysis

为需求分析定义可解释人工智能

Sheh, Raymond, Monteath, Isaac

Abstract

Explainable Artificial Intelligence (XAI) has become popular in the last few years. The Artificial Intelligence (AI) community in general, and the Machine Learning (ML) community in particular, is coming to the realisation that in many applications, for AI to be trusted, it must not only demonstrate good performance in its decisionmaking, but it also must explain these decisions and convince us that it is making the decisions for the right reasons. However, different applications have different requirements on the information required of the underlying AI system in order to convince us that it is worthy of our trust. How do we define these requirements? In this paper, we present three dimensions for categorising the explanatory requirements of different applications. These are Source, Depth and Scope. We focus on the problem of matching up the explanatory requirements of different applications with the capabilities of underlying ML techniques to provide them. We deliberately avoid including aspects of explanation that are already well-covered by the existing literature and we focus our discussion on ML although the principles apply to AI more broadly.

Chinese Translation

可解释人工智能（XAI）在过去几年中变得越来越受欢迎。人工智能（AI）社区，尤其是机器学习（ML）社区，逐渐意识到在许多应用中，为了获得信任，人工智能不仅需要在决策中展示良好的性能，还必须解释这些决策，并让我们相信其做出决策的理由是正确的。然而，不同的应用对底层人工智能系统所需的信息有不同的要求，以说服我们信任它。我们如何定义这些要求？在本文中，我们提出了三个维度来对不同应用的解释性要求进行分类。这些维度是来源（Source）、深度（Depth）和范围（Scope）。我们重点讨论将不同应用的解释性要求与底层机器学习技术提供这些要求的能力进行匹配的问题。我们故意避免包括现有文献中已经充分覆盖的解释方面，并将讨论重点放在机器学习上，尽管这些原则更广泛地适用于人工智能。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2602.19109

Post-Routing Arithmetic in Llama-3: Last-Token Result Writing and Rotation-Structured Digit Directions

Llama-3中的后路由算术：最后一个令牌结果写入与旋转结构数字方向

Yan, Yao

Abstract

We study three-digit addition in Meta-Llama-3-8B (base) under a one-token readout to characterize how arithmetic answers are finalized after cross-token routing becomes causally irrelevant. Causal residual patching and cumulative attention ablations localize a sharp boundary near layer~17: beyond it, the decoded sum is controlled almost entirely by the last input token and late-layer self-attention is largely dispensable. In this post-routing regime, digit(-sum) direction dictionaries vary with a next-higher-digit context but are well-related by an approximately orthogonal map inside a shared low-rank subspace (low-rank Procrustes alignment). Causal digit editing matches this geometry: naive cross-context transfer fails, while rotating directions through the learned map restores strict counterfactual edits; negative controls do not recover.

Chinese Translation

我们研究了在Meta-Llama-3-8B（基础）下的三位数加法，采用单个令牌的读出，以表征在跨令牌路由变得因果无关后，算术答案是如何最终确定的。因果残差修补和累积注意力消融定位了一个接近第17层的明显边界：在此边界之外，解码的和几乎完全由最后一个输入令牌控制，而后层自注意力在很大程度上是可有可无的。在这一后路由机制中，数字（和）方向字典随着下一个更高数字的上下文而变化，但通过一个在共享低秩子空间内的近似正交映射（低秩Procrustes对齐）良好相关。因果数字编辑与这一几何结构相匹配：天真的跨上下文转移失败，而通过学习到的映射旋转方向则恢复严格的反事实编辑；负控制未能恢复。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2602.19128

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

K-搜索：通过共同演化的内在世界模型生成LLM内核

Cao, Shiyi, Mao, Ziming, Gonzalez, Joseph E., Stoica, Ion

Abstract

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.

Chinese Translation

优化GPU内核对于现代机器学习系统的高效运行至关重要，但由于设计因素的复杂相互作用和快速的硬件演变，这一任务仍然具有挑战性。现有的自动化方法通常将大型语言模型（LLMs）仅视为在启发式指导的演化循环中进行随机代码生成的工具。这些方法在处理需要协调的多步结构转换的复杂内核时往往面临困难，因为它们缺乏明确的规划能力，并且常常由于中间实现效率低下或不正确而丢弃有前景的策略。为了解决这个问题，我们提出了通过共同演化的世界模型进行搜索，并基于这一方法构建了K-搜索。通过用共同演化的世界模型替代静态搜索启发式，我们的框架利用LLMs的先前领域知识来指导搜索，积极探索优化空间。这种方法明确地将高层算法规划与低层程序实例化解耦，使系统能够在非单调优化路径上导航，同时对临时实现缺陷保持韧性。我们在FlashInfer的多种复杂内核上评估了K-搜索，包括GQA、MLA和MoE内核。我们的结果表明，K-搜索显著优于最先进的演化搜索方法，实现了平均2.10倍的提升，并在复杂的MoE内核上获得了高达14.3倍的增益。在GPUMode TriMul任务上，K-搜索在H100上达到了最先进的性能，达到1030微秒，超越了之前的演化和人工设计解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2602.19141

Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians

谄媚型聊天机器人导致妄想螺旋，即使在理想贝叶斯人中也是如此

Chandra, Kartik, Kleiman-Weiner, Max, Ragan-Kelley, Jonathan, Tenenbaum, Joshua B.

Abstract

"AI psychosis" or "delusional spiraling" is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenomenon is typically attributed to AI chatbots' well-documented bias towards validating users' claims, a property often called "sycophancy." In this paper, we probe the causal link between AI sycophancy and AI-induced psychosis through modeling and simulation. We propose a simple Bayesian model of a user conversing with a chatbot, and formalize notions of sycophancy and delusional spiraling in that model. We then show that in this model, even an idealized Bayes-rational user is vulnerable to delusional spiraling, and that sycophancy plays a causal role. Furthermore, this effect persists in the face of two candidate mitigations: preventing chatbots from hallucinating false claims, and informing users of the possibility of model sycophancy. We conclude by discussing the implications of these results for model developers and policymakers concerned with mitigating the problem of delusional spiraling.

Chinese Translation

“人工智能精神病”或“妄想螺旋”是一个新兴现象，指的是在与人工智能聊天机器人进行长时间对话后，用户对荒谬信念产生危险的自信。这一现象通常归因于人工智能聊天机器人对用户主张的验证偏见，这种特性通常被称为“谄媚”。在本文中，我们通过建模和仿真探讨了人工智能谄媚与人工智能引发的精神病之间的因果关系。我们提出了一个简单的贝叶斯模型，描述用户与聊天机器人对话的过程，并在该模型中形式化了谄媚和妄想螺旋的概念。我们接着展示，在这个模型中，即使是理想化的贝叶斯理性用户也容易受到妄想螺旋的影响，而谄媚在其中起到了因果作用。此外，这种效应在两种候选缓解措施面前依然存在：防止聊天机器人产生虚假主张的幻觉，以及告知用户模型谄媚的可能性。最后，我们讨论了这些结果对模型开发者和政策制定者在减轻妄想螺旋问题方面的影响。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2602.19158

DoAtlas-1: A Causal Compilation Paradigm for Clinical AI

DoAtlas-1：临床人工智能的因果编译范式

Li, Yulong, Chen, Jianxu, Liu, Xiwei, Suo, Chuanyue, Xia, Rong, Lu, Zhixiang, Li, Yichen, Zhuang, Xinlin, Menon, Niranjana Arun, Xie, Yutong, Segal, Eran, Razzak, Imran

Abstract

Medical foundation models generate narrative explanations but cannot quantify intervention effects, detect evidence conflicts, or validate literature claims, limiting clinical auditability. We propose causal compilation, a paradigm that transforms medical evidence from narrative text into executable code. The paradigm standardizes heterogeneous research evidence into structured estimand objects, each explicitly specifying intervention contrast, effect scale, time horizon, and target population, supporting six executable causal queries: do-calculus, counterfactual reasoning, temporal trajectories, heterogeneous effects, mechanistic decomposition, and joint interventions. We instantiate this paradigm in DoAtlas-1, compiling 1,445 effect kernels from 754 studies through effect standardization, conflict-aware graph construction, and real-world validation (Human Phenotype Project, 10,000 participants). The system achieves 98.5% canonicalization accuracy and 80.5% query executability. This paradigm shifts medical AI from text generation to executable, auditable, and verifiable causal reasoning.

Chinese Translation

医学基础模型生成叙述性解释，但无法量化干预效果、检测证据冲突或验证文献主张，限制了临床可审计性。我们提出了因果编译这一范式，将医学证据从叙述文本转化为可执行代码。该范式将异构研究证据标准化为结构化的估计对象，每个对象明确指定干预对比、效果尺度、时间范围和目标人群，支持六种可执行的因果查询：do-calculus、反事实推理、时间轨迹、异质效应、机制分解和联合干预。我们在DoAtlas-1中实例化这一范式，通过效果标准化、冲突感知图构建和现实世界验证（人类表型计划，10,000名参与者）编译了1,445个效果核心。该系统实现了98.5%的规范化准确率和80.5%的查询可执行性。该范式将医学人工智能从文本生成转变为可执行、可审计和可验证的因果推理。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2602.19159

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

超越行为权衡：在大型语言模型中对痛苦-快乐决策的机制追踪

Bianco, Francesca, Shiller, Derek

Abstract

Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.

Chinese Translation

先前的行为研究表明，一些大型语言模型（LLMs）在选项被框定为导致痛苦或快乐时会改变选择，并且这种偏差可以随着所述强度的增加而加剧。为了将行为证据（模型的表现）与机制可解释性（支持其计算的内容）结合起来，我们研究了情感相关信息的表示方式以及其在变压器内部的因果使用位置。使用Gemma-2-9B-it和基于先前工作的简约决策任务模型，我们（i）通过层级线性探测映射表示的可用性，跨流进行分析；（ii）通过激活干预（引导；修补/消融）测试因果贡献；（iii）在epsilon网格上量化剂量-反应效应，读取2-3对数边际和数字对归一化选择概率。我们发现：（a）情感符号（痛苦与快乐）在非常早期的层（L0-L1）中在流家族之间是完全线性可分的，而词汇基线仍保留了相当大的信号；（b）分级强度具有很强的可解码性，在中后期层中达到峰值，尤其是在注意力/多层感知器（MLP）输出中，决策对齐在最后一个标记之前略高；（c）沿着数据导出的情感方向的附加引导在晚期位置因果调节2-3边际，最大的效应出现在晚层注意力输出（attn_out L14）中；（d）头级修补/消融表明这些效应分布在多个头部，而不是集中在单个单元中。综合这些结果，将行为敏感性与可识别的内部表示和对干预敏感的位置联系起来，为更严格的反事实测试和更广泛的复制提供了具体的机制目标。这项工作支持关于人工智能意识和福利的更有证据驱动的（a）辩论，以及在制定政策、审计标准和安全保障时的（b）治理。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2602.19160

Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

大型语言模型的推理能力：从通用游戏玩法中获得的经验教训

Świechowski, Maciej, Żychowski, Adam, Mańdziuk, Jacek

Abstract

This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.

Chinese Translation

本文从一个新颖的视角考察了大型语言模型（LLMs）的推理能力，重点关注它们在形式化规定的、规则驱动的环境中操作的能力。我们对四个LLM（Gemini 2.5 Pro及其Flash变体、Llama 3.3 70B和GPT-OSS 120B）进行了评估，任务包括下一步/多步骤状态形成和法律行动生成，涵盖了通过通用游戏玩法（GGP）游戏实例展示的多样化推理问题。除了报告实例级别的性能外，我们根据40个结构特征对游戏进行特征化，并分析这些特征与LLM性能之间的相关性。此外，我们还研究了各种游戏模糊化的影响，以评估语言语义在游戏定义中的作用以及LLM在训练期间对特定游戏的潜在先前接触的影响。主要结果表明，在大多数实验设置中，评估的三个模型通常表现良好，但随着评估范围的增加（即游戏步骤数量的增加），性能出现下降。对LLM性能的详细案例分析提供了对所考虑的基于逻辑的问题表述中常见推理错误的新见解，包括虚构的规则、冗余的状态事实或语法错误。总体而言，本文报告了当代模型在形式推理能力方面的明显进展。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2602.19223

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

能源控制中的多智能体强化学习特征分析：CityLearn环境下的多关键绩效指标基准测试

Khouja, Aymen, Jendoubi, Imen, Mahjoub, Oumayma, Mahfoudhi, Oussama, Formanek, Claude, Singh, Siddarth, De Kock, Ruan

Abstract

The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.

Chinese Translation

城市能源系统的优化对于可持续和韧性智能城市的发展至关重要，而这些城市正变得越来越复杂，涉及多个决策单元。为了解决可扩展性和协调性问题，多智能体强化学习（MARL）是一种有前景的解决方案。本文针对能源管理任务中对MARL算法进行全面且可靠的基准测试的迫切需求进行了探讨。CityLearn被用作案例研究环境，因为它真实地模拟了城市能源系统，包含多个储能系统，并利用可再生能源。通过这样做，我们的工作为评估设定了新的标准，进行了跨多个关键绩效指标（KPI）的比较研究。这种方法揭示了各种算法的主要优缺点，超越了传统的KPI平均值计算，这往往掩盖了重要的洞察。我们的实验利用了广泛接受的基准，如近端策略优化（PPO）和软演员评论家（SAC），并涵盖了多种训练方案，包括去中心化训练与去中心化执行（DTDE）和中心化训练与去中心化执行（CTDE）方法，以及不同的神经网络架构。我们的工作还提出了新的KPI，以应对现实世界实施中的挑战，如单个建筑的贡献和电池储存寿命。我们的研究结果表明，DTDE在平均和最坏情况下的表现均优于CTDE。此外，时间依赖学习改善了对记忆依赖KPI（如升降和电池使用）的控制，促进了更可持续的电池操作。结果还显示出对代理或资源移除的鲁棒性，突显了所学习策略的韧性和去中心化特性。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2602.19225

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

基于邻近的多轮优化：大规模语言模型代理训练的实用信用分配

Fang, Yangyi, Lin, Jiaye, Fu, Xiaoliang, Qin, Cong, Shi, Haolin, Liu, Chang, Zhao, Peilin

Abstract

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{https://anonymous.4open.science/r/proxmo-B7E7/README.md}{https://anonymous.4open.science/r/proxmo}.

Chinese Translation

多轮大规模语言模型（LLM）代理在生产系统中变得至关重要，涵盖了客户服务自动化、电子商务辅助和互动任务管理等领域，其中准确区分高价值信息信号与随机噪声对于样本高效训练至关重要。在现实场景中，简单任务的失败可能反映出随机不稳定性，而高难度任务的成功则标志着真正的能力突破。然而，现有的基于群体的策略优化方法过于依赖离散批次内的统计偏差，常常在任务难度波动时错误地分配信用。为了解决这一问题，我们提出了基于邻近的多轮优化（Proximity-based Multi-turn Optimization, ProxMO），这是一个专门针对现实部署约束而设计的实用且稳健的框架。ProxMO通过两种轻量级机制整合全局上下文：成功率感知调节根据情节级别的难度动态调整梯度强度，而基于邻近的软聚合通过逐步的连续语义加权推导基线。在ALFWorld和WebShop基准上的广泛评估表明，ProxMO在现有基线之上实现了显著的性能提升，同时计算成本几乎可以忽略不计。消融研究进一步验证了这两种机制的独立性和协同效应。重要的是，ProxMO与标准的GRPO框架具有即插即用的兼容性，促进了在现有工业训练管道中的即时、低摩擦的采用。我们的实现可在以下链接获取： exttt{https://anonymous.4open.science/r/proxmo-B7E7/README.md}。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2602.19240

Topology of Reasoning: Retrieved Cell Complex-Augmented Generation for Textual Graph Question Answering

推理的拓扑：基于检索的细胞复合体增强生成用于文本图问答

Zhao, Sen, Zhou, Lincheng, Chen, Yue, Zou, Ding

Abstract

Retrieval-Augmented Generation (RAG) enhances the reasoning ability of Large Language Models (LLMs) by dynamically integrating external knowledge, thereby mitigating hallucinations and strengthening contextual grounding for structured data such as graphs. Nevertheless, most existing RAG variants for textual graphs concentrate on low-dimensional structures -- treating nodes as entities (0-dimensional) and edges or paths as pairwise or sequential relations (1-dimensional), but overlook cycles, which are crucial for reasoning over relational loops. Such cycles often arise in questions requiring closed-loop inference about similar objects or relative positions. This limitation often results in incomplete contextual grounding and restricted reasoning capability. In this work, we propose Topology-enhanced Retrieval-Augmented Generation (TopoRAG), a novel framework for textual graph question answering that effectively captures higher-dimensional topological and relational dependencies. Specifically, TopoRAG first lifts textual graphs into cellular complexes to model multi-dimensional topological structures. Leveraging these lifted representations, a topology-aware subcomplex retrieval mechanism is proposed to extract cellular complexes relevant to the input query, providing compact and informative topological context. Finally, a multi-dimensional topological reasoning mechanism operates over these complexes to propagate relational information and guide LLMs in performing structured, logic-aware inference. Empirical evaluations demonstrate that our method consistently surpasses existing baselines across diverse textual graph tasks.

Chinese Translation

检索增强生成（RAG）通过动态整合外部知识来增强大型语言模型（LLMs）的推理能力，从而减轻幻觉现象并加强对结构化数据（如图）的上下文基础。然而，现有的大多数文本图的RAG变体集中于低维结构——将节点视为实体（0维），将边或路径视为成对或顺序关系（1维），但忽视了循环，这对于对关系循环进行推理至关重要。这种循环通常出现在需要对相似对象或相对位置进行闭环推理的问题中。这一局限性常常导致上下文基础不完整和推理能力受限。在本研究中，我们提出了一种拓扑增强的检索增强生成（TopoRAG），这是一个用于文本图问答的新框架，能够有效捕捉更高维的拓扑和关系依赖。具体而言，TopoRAG首先将文本图提升为细胞复合体，以建模多维拓扑结构。利用这些提升的表示，提出了一种拓扑感知的子复合体检索机制，以提取与输入查询相关的细胞复合体，提供紧凑且信息丰富的拓扑上下文。最后，基于这些复合体的多维拓扑推理机制用于传播关系信息，并指导LLMs进行结构化、逻辑感知的推理。实证评估表明，我们的方法在各种文本图任务中始终超越现有基准。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2602.19244

Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts

通过软专家混合的强化学习实现的有向控制器合成中的鲁棒探索

Ubukata, Toshihide, Wang, Zhiyao, Mu, Enhong, Li, Jialong, Tei, Kenji

Abstract

On-the-fly Directed Controller Synthesis (OTF-DCS) mitigates state-space explosion by incrementally exploring the system and relies critically on an exploration policy to guide search efficiently. Recent reinforcement learning (RL) approaches learn such policies and achieve promising zero-shot generalization from small training instances to larger unseen ones. However, a fundamental limitation is anisotropic generalization, where an RL policy exhibits strong performance only in a specific region of the domain-parameter space while remaining fragile elsewhere due to training stochasticity and trajectory-dependent bias. To address this, we propose a Soft Mixture-of-Experts framework that combines multiple RL experts via a prior-confidence gating mechanism and treats these anisotropic behaviors as complementary specializations. The evaluation on the Air Traffic benchmark shows that Soft-MoE substantially expands the solvable parameter space and improves robustness compared to any single expert.

Chinese Translation

即时有向控制器合成（OTF-DCS）通过逐步探索系统来缓解状态空间爆炸，并在很大程度上依赖于探索策略以有效引导搜索。最近的强化学习（RL）方法学习此类策略，并在从小型训练实例到更大未见实例的零-shot 泛化方面取得了令人鼓舞的成果。然而，一个根本性的限制是各向异性泛化，即 RL 策略在领域参数空间的特定区域内表现良好，而在其他地方由于训练的随机性和轨迹依赖性偏差而表现脆弱。为了解决这个问题，我们提出了一种软专家混合（Soft Mixture-of-Experts）框架，通过先验置信度门控机制结合多个 RL 专家，并将这些各向异性行为视为互补的专业化。在空中交通基准测试中的评估表明，软专家混合显著扩展了可解参数空间，并提高了相较于任何单一专家的鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2602.19281

Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

有限推理空间：大型语言模型中长时间推理的桎梏

Li, Zhenyu, Wu, Guanlin, Wang, Cheems, Zhao, Yongqiang

Abstract

The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task decomposition strategies such as CoT. This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods, which hardly perceive the intrinsic boundaries of LLM reasoning. We term it as the Limited Reasoning Space hypothesis and perform theoretical analysis through the lens of a non-autonomous stochastic dynamical system. This insight suggests that there is an optimal range for compute budgets; over-planning can lead to redundant feedback and may even impair reasoning capabilities. To exploit the compute-scaling benefits and suppress over-planning, this work proposes Halo, a model predictive control framework for LLM planning. Halo is designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning. Experimental results demonstrate that Halo outperforms static baselines on complex long-horizon tasks by dynamically regulating planning at the reasoning boundary.

Chinese Translation

测试时计算策略，如链式思维（Chain-of-Thought, CoT），显著增强了大型语言模型在解决复杂任务（如逻辑推理）方面的能力。然而，实证研究表明，简单地增加计算预算有时会导致在采用典型任务分解策略（如 CoT）时测试性能的崩溃。本研究假设，较大计算预算下的推理失败源于静态规划方法，这些方法难以感知大型语言模型推理的内在边界。我们将其称为有限推理空间假设，并通过非自主随机动态系统的视角进行理论分析。这一见解表明，计算预算存在一个最佳范围；过度规划可能导致冗余反馈，甚至损害推理能力。为了利用计算扩展的优势并抑制过度规划，本研究提出了 Halo，一个用于大型语言模型规划的模型预测控制框架。Halo 旨在处理基于推理的长时间任务，并设计了一个基于熵驱动的双控制器，采用“测量-再规划”（Measure-then-Plan）策略以实现可控推理。实验结果表明，Halo 在复杂的长时间任务上优于静态基线，通过动态调节推理边界的规划。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2602.19297

Automated Generation of Microfluidic Netlists using Large Language Models

利用大型语言模型自动生成微流控电路网表

Davidson, Jasper, Stockham, Skylar, Boston, Allen, Tenace, Ashton Snelgrove. Valerio, Gaillardon, Pierre-Emmanuel

Abstract

Microfluidic devices have emerged as powerful tools in various laboratory applications, but the complexity of their design limits accessibility for many practitioners. While progress has been made in microfluidic design automation (MFDA), a practical and intuitive solution is still needed to connect microfluidic practitioners with MFDA techniques. This work introduces the first practical application of large language models (LLMs) in this context, providing a preliminary demonstration. Building on prior research in hardware description language (HDL) code generation with LLMs, we propose an initial methodology to convert natural language microfluidic device specifications into system-level structural Verilog netlists. We demonstrate the feasibility of our approach by generating structural netlists for practical benchmarks representative of typical microfluidic designs with correct functional flow and an average syntactical accuracy of 88%.

Chinese Translation

微流控设备已成为各种实验室应用中的强大工具，但其设计的复杂性限制了许多从业者的使用。尽管在微流控设计自动化（MFDA）方面取得了一定进展，但仍需要一种实用且直观的解决方案，以将微流控从业者与MFDA技术连接起来。本研究首次在这一背景下引入大型语言模型（LLMs）的实际应用，提供了初步的演示。在先前关于使用LLMs生成硬件描述语言（HDL）代码的研究基础上，我们提出了一种初步方法，将自然语言微流控设备规格转换为系统级结构Verilog网表。我们通过为典型微流控设计生成具有正确功能流的结构网表，并实现了平均88%的语法准确率，展示了我们方法的可行性。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2602.19298

ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimer's Disease

ALPACA：用于阿尔茨海默病药物再利用和治疗优化的强化学习环境

Brady, Nolan, Yeh, Tom

Abstract

Evaluating personalized, sequential treatment strategies for Alzheimer's disease (AD) using clinical trials is often impractical due to long disease horizons and substantial inter-patient heterogeneity. To address these constraints, we present the Alzheimer's Learning Platform for Adaptive Care Agents (ALPACA), an open-source, Gym-compatible reinforcement learning (RL) environment for systematically exploring personalized treatment strategies using existing therapies. ALPACA is powered by the Continuous Action-conditioned State Transitions (CAST) model trained on longitudinal trajectories from the Alzheimer's Disease Neuroimaging Initiative (ADNI), enabling medication-conditioned simulation of disease progression under alternative treatment decisions. We show that CAST autoregressively generates realistic medication-conditioned trajectories and that RL policies trained in ALPACA outperform no-treatment and behavior-cloned clinician baselines on memory-related outcomes. Interpretability analyses further indicated that the learned policies relied on clinically meaningful patient features when selecting actions. Overall, ALPACA provides a reusable in silico testbed for studying individualized sequential treatment decision-making for AD.

Chinese Translation

由于阿尔茨海默病（AD）具有较长的疾病进程和显著的患者间异质性，评估个性化的顺序治疗策略在临床试验中往往不切实际。为了解决这些限制，我们提出了阿尔茨海默学习平台（ALPACA），这是一个开源的、与Gym兼容的强化学习（RL）环境，用于系统性地探索使用现有疗法的个性化治疗策略。ALPACA基于在阿尔茨海默病神经影像学倡议（ADNI）中获得的纵向轨迹训练的连续动作条件状态转移（CAST）模型，能够在替代治疗决策下模拟药物条件的疾病进展。我们展示了CAST能够自回归地生成真实的药物条件轨迹，并且在ALPACA中训练的RL策略在记忆相关结果上优于无治疗和行为克隆临床医生的基线。可解释性分析进一步表明，学习到的策略在选择行动时依赖于临床上有意义的患者特征。总体而言，ALPACA为研究阿尔茨海默病的个性化顺序治疗决策提供了一个可重复使用的体外测试平台。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2602.19367

Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

时间序列、视觉与语言：探索对比表示空间中对齐的极限

Yashwante, Pratham, Yu, Rose

Abstract

The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

Chinese Translation

柏拉图表示假设认为，从不同模态训练的模型所学习的表示会收敛到一个共享的潜在世界结构。然而，这一假设主要是在视觉和语言领域进行的研究，目前尚不清楚时间序列是否参与这种收敛。我们首先在三模态设置中进行检验，发现独立预训练的时间序列、视觉和语言编码器在没有显式耦合的情况下表现出近正交的几何结构。然后，我们通过对冻结编码器训练投影头，应用后验对齐，使用对比学习分析结果表示的几何、缩放行为以及对信息密度和输入模态特征的依赖。我们的研究表明，对比表示空间中的整体对齐随着模型规模的增大而改善，但这种对齐是非对称的：时间序列与视觉表示的对齐强于与文本的对齐，而图像可以作为时间序列和语言之间的有效中介。我们进一步观察到，更丰富的文本描述仅在达到某个阈值时改善对齐；对更密集的标题进行训练并未带来进一步的改善。视觉表示也观察到了类似的效果。我们的发现为构建涉及非传统数据模态的多模态系统提供了重要的考虑因素，超越了视觉和语言的范畴。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2602.19390

Artificial Intelligence for Modeling & Simulation in Digital Twins

人工智能在数字双胞胎中的建模与仿真应用

Zech, Philipp, David, Istvan

Abstract

The convergence of modeling & simulation (M&S) and artificial intelligence (AI) is leaving its marks on advanced digital technology. Pertinent examples are digital twins (DTs) - high-fidelity, live representations of physical assets, and frequent enablers of corporate digital maturation and transformation. Often seen as technological platforms that integrate an array of services, DTs have the potential to bring AI-enabled M&S closer to end-users. It is, therefore, paramount to understand the role of M&S in DTs, and the role of digital twins in enabling the convergence of AI and M&S. To this end, this chapter provides a comprehensive exploration of the complementary relationship between these three. We begin by establishing a foundational understanding of DTs by detailing their key components, architectural layers, and their various roles across business, development, and operations. We then examine the central role of M&S in DTs and provide an overview of key modeling techniques from physics-based and discrete-event simulation to hybrid approaches. Subsequently, we investigate the bidirectional role of AI: first, how AI enhances DTs through advanced analytics, predictive capabilities, and autonomous decision-making, and second, how DTs serve as valuable platforms for training, validating, and deploying AI models. The chapter concludes by identifying key challenges and future research directions for creating more integrated and intelligent systems.

Chinese Translation

建模与仿真（M&S）与人工智能（AI）的融合正在对先进数字技术产生深远影响。数字双胞胎（DTs）是相关的例子，它们是物理资产的高保真、实时表示，常常是企业数字化成熟和转型的重要推动力。数字双胞胎通常被视为集成多种服务的技术平台，具有将AI驱动的M&S更贴近最终用户的潜力。因此，理解M&S在数字双胞胎中的作用，以及数字双胞胎在促进AI与M&S融合中的作用至关重要。为此，本章节全面探讨了这三者之间的互补关系。我们首先通过详细阐述数字双胞胎的关键组成部分、架构层次及其在业务、开发和运营中的各种角色，建立对数字双胞胎的基础理解。接着，我们考察了M&S在数字双胞胎中的核心作用，并概述了从基于物理的建模到离散事件仿真及混合方法的关键建模技术。随后，我们研究了AI的双向作用：首先，AI如何通过先进分析、预测能力和自主决策增强数字双胞胎；其次，数字双胞胎如何作为训练、验证和部署AI模型的宝贵平台。章节最后识别了创建更为集成和智能系统的关键挑战和未来研究方向。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2602.19396

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

隐藏在明文中：通过激活解耦检测隐蔽的越狱攻击

Farzam, Amirhossein, Behabahani, Majid, Malek, Mani, Nevmyvaka, Yuriy, Sapiro, Guillermo

Abstract

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.

Chinese Translation

大型语言模型（LLMs）仍然容易受到流畅且语义连贯的越狱提示的攻击，因此难以通过标准启发式方法进行检测。一种特别具有挑战性的失败模式发生在攻击者试图通过操控请求的框架来隐藏其恶意目标，以诱导模型的顺从。由于这些攻击通过灵活的呈现保持恶意意图，依赖于结构性伪影或特定目标签名的防御措施可能会失效。基于此，我们提出了一种自监督框架，用于在推理过程中解耦LLM激活中的语义因子对。我们为目标和框架实例化该框架，并构建了GoalFrameBench，这是一个具有可控目标和框架变体的提示语料库，我们用其训练了在冻结LLM中提取解耦表示的Representation Disentanglement on Activations（ReDAct）模块。然后，我们提出了FrameShield，这是一个基于框架表示的异常检测器，能够在多个LLM系列中以最小的计算开销提高模型无关的检测性能。ReDAct的理论保证和广泛的实证验证表明，其解耦效果有效地支持了FrameShield。最后，我们将解耦作为可解释性探针，揭示了目标和框架信号的不同特征，并将语义解耦定位为LLM安全性和机制可解释性的基础构件。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2602.19416

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

IR$^3$: 用于可解释的奖励黑客检测与缓解的对比逆强化学习

Beigi, Mohammad, Jin, Ming, Zhang, Junshan, Zhang, Jiaxin, Wang, Qifan, Huang, Lifu

Abstract

Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.

Chinese Translation

人类反馈强化学习（RLHF）使得强大的大型语言模型（LLM）对齐成为可能，但也可能引入奖励黑客现象——模型利用代理奖励中的虚假相关性而未能实现真正的对齐。更糟糕的是，在RLHF过程中内化的目标仍然不透明，使得黑客行为难以检测或纠正。我们提出了IR3（可解释奖励重构与修正）框架，该框架逆向工程、解释并精确修复驱动RLHF调优模型的隐含目标。我们提出了对比逆强化学习（C-IRL），通过对比后对齐和基线策略的配对响应来重构隐含奖励函数，以解释RLHF过程中的行为变化。然后，我们通过稀疏自编码器将重构的奖励分解为可解释特征，从而通过贡献分析识别黑客特征。最后，我们提出了缓解策略——清洁奖励优化、对抗性塑形、约束优化和特征引导蒸馏——这些策略旨在针对问题特征，同时保持有益的对齐。多个奖励模型配置的实验表明，IR3与真实奖励的相关性达到0.89，识别黑客特征的精度超过90%，并显著减少黑客行为，同时保持原始模型能力的3%以内。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2602.19439

OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents

OptiRepair：使用LLM代理的供应链优化模型的闭环诊断与修复

Ao, Ruicheng, Simchi-Levi, David, Wang, Xinshang

Abstract

Problem Definition. Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagnostics, trace root causes across echelons, and fix formulations without sacrificing operational soundness. Whether AI agents can perform this task remains untested. Methodology/Results. OptiRepair splits this task into a domain-agnostic feasibility phase (iterative IIS-guided repair of any LP) and a domain-specific validation phase (five rationality checks grounded in inventory theory). We test 22 API models from 7 families on 976 multi-echelon supply chain problems and train two 8B-parameter models using self-taught reasoning with solver-verified rewards. The trained models reach 81.7% Rational Recovery Rate (RRR) -- the fraction of problems resolved to both feasibility and operational rationality -- versus 42.2% for the best API model and 21.3% on average. The gap concentrates in Phase 1 repair: API models average 27.6% recovery rate versus 97.2% for trained models. Managerial Implications. Two gaps separate current AI from reliable model repair: solver interaction (API models restore only 27.6% of infeasible formulations) and operational rationale (roughly one in four feasible repairs violate supply chain theory). Each requires a different intervention: solver interaction responds to targeted training; operational rationale requires explicit specification as solver-verifiable checks. For organizations adopting AI in operational planning, formalizing what "rational" means in their context is the higher-return investment.

Chinese Translation

问题定义：供应链优化模型由于建模错误而经常变得不可行。诊断和修复需要稀缺的运筹学专业知识：分析师必须解释求解器的诊断，追踪各层级的根本原因，并在不牺牲操作合理性的情况下修复模型。AI代理是否能够执行此任务尚未经过验证。方法论/结果：OptiRepair将此任务分为一个领域无关的可行性阶段（对任何线性规划（LP）进行迭代的不可行性引导修复）和一个领域特定的验证阶段（基于库存理论的五个合理性检查）。我们在976个多层级供应链问题上测试了来自7个家族的22个API模型，并使用求解器验证的奖励进行自我学习推理训练了两个8B参数模型。训练后的模型达到了81.7%的合理恢复率（RRR）——解决了可行性和操作合理性的比例——而最佳API模型为42.2%，平均为21.3%。这一差距主要集中在第一阶段的修复：API模型的平均恢复率为27.6%，而训练模型为97.2%。管理启示：当前AI与可靠模型修复之间存在两个差距：求解器交互（API模型仅恢复27.6%的不可行模型）和操作合理性（大约四分之一的可行修复违反了供应链理论）。每个差距需要不同的干预措施：求解器交互需要针对性训练；操作合理性则需要明确规定为求解器可验证的检查。对于采用AI进行运营规划的组织而言，正式化“合理”在其上下文中的含义是更高回报的投资。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2602.19458

ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-making

ComplLLM：微调大型语言模型以发现决策中的互补信号

Guo, Ziyang, Wu, Yifan, Hartline, Jason, Holstein, Kenneth, Hullman, Jessica

Abstract

Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.

Chinese Translation

当互补性存在时，多智能体决策流程可以优于单一智能体工作流程，即不同的智能体提供独特的信息以支持最终决策。我们提出了ComplLLM，这是一种基于决策理论的后训练框架，通过使用互补信息作为奖励来微调决策辅助大型语言模型（LLM），以输出补充现有智能体决策的信号。我们在涉及领域专家的合成和真实世界任务上验证了ComplLLM，展示了该方法如何恢复已知的互补信息，并生成合理的互补信号解释，以支持下游决策者。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2602.19502

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

人类引导的自主智能体在多模态临床预测中的应用：来自AgentDS医疗基准的经验教训

Pulavarthy, Lalitha Pranathi, Muthyala, Raajitha, Kuruvikkattil, Aravind V, Yin, Zhenan, Kudamala, Rashmita, Purkayastha, Saptarshi

Abstract

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

Chinese Translation

自主智能体系统在自主数据科学工作流程方面的能力日益增强，但临床预测任务需要领域专业知识，而纯自动化的方法难以提供。我们研究了人类如何引导自主智能体以改善多模态临床预测，并展示了我们在三个AgentDS医疗基准挑战中的方法：30天医院再入院预测（Macro-F1 = 0.8986）、急诊科成本预测（MAE = $465.13）和出院准备评估（Macro-F1 = 0.7939）。在这些任务中，人类分析师在关键决策点引导自主工作流程，进行来自临床记录、扫描的PDF账单和时间序列生命体征的多模态特征工程；选择适合任务的模型；以及采用临床信息驱动的验证策略。我们的方法在医疗领域整体排名第5，在出院准备任务中获得第3名。消融研究表明，人类引导的决策相较于自动基线累计提升了+0.065 F1，其中多模态特征提取贡献了最大的单一改进（+0.041 F1）。我们提炼出三个可推广的经验教训：（1）在每个流程阶段进行领域知识驱动的特征工程可产生复合增益，超越广泛的自动化搜索；（2）多模态数据整合需要特定任务的人类判断，单一提取策略无法在临床文本、PDF和时间序列之间通用；（3）有意识的集成多样性与临床驱动的模型配置相比随机超参数搜索表现更佳。这些发现为在医疗环境中部署自主智能体的团队提供了实用指导，尤其是在可解释性、可重复性和临床有效性至关重要的情况下。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2602.19517

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

课堂期末考试：一项经过教师验证的推理基准

Gao, Chongyang, Yang, Diji, Zhou, Shuyan, Yan, Xichen, Song, Luchuan, Li, Shuo, Chen, Kezhen

Abstract

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.

Chinese Translation

我们介绍了 extbf{CFE}（ extbf{C}lassroom extbf{F}inal extbf{E}xam），这是一个多模态基准，用于评估大型语言模型在20多个STEM领域的推理能力。 extbf{CFE} 由反复使用的真实大学作业和考试题目以及课程教师提供的参考解答精心策划而成。 extbf{CFE} 对前沿模型提出了重大挑战：新发布的 Gemini-3.1-pro-preview 模型的整体准确率为 59.69\%，而第二好的模型 Gemini-3-flash-preview 的准确率为 55.46\, 这表明仍有相当大的改进空间。除了排行榜结果外，我们还通过将参考解答分解为推理流程进行诊断分析。我们发现，尽管前沿模型通常能够正确回答中间子问题，但在多步骤解答中，它们难以可靠地推导和维持正确的中间状态。我们进一步观察到，模型生成的解答通常比教师提供的解答有更多的推理步骤，这表明步骤效率不佳，并且更容易导致错误累积。数据和代码可在 https://github.com/Analogy-AI/CFE_Bench 获取。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2602.19519

Ada-RS: Adaptive Rejection Sampling for Selective Thinking

Ada-RS：用于选择性思维的自适应拒绝采样

Ge, Yirou, Li, Yixi, Chiu, Alec, Shekhar, Shivani, Pan, Zijie, Thangali, Avinash, Chuang, Yun-Shiuan, Kulkarni, Chaitanya, Kona, Uma, Pang, Linsey, Mehrotra, Prakhar

Abstract

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improving tool call accuracy. These results highlight that training-signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.

Chinese Translation

大型语言模型（LLMs）越来越多地被应用于对成本和延迟敏感的场景中。虽然链式思维可以改善推理能力，但在简单请求上可能会浪费令牌。我们研究了工具使用的LLMs的选择性思维，并引入了自适应拒绝采样（Adaptive Rejection Sampling，Ada-RS），这是一种与算法无关的样本过滤框架，用于学习选择性和高效的推理。对于每个给定的上下文，Ada-RS使用自适应长度惩罚奖励对多个采样完成进行评分，然后应用随机拒绝采样，仅保留高奖励候选（或偏好对）以进行下游优化。我们展示了Ada-RS如何与偏好对（例如DPO）或分组策略优化策略（例如DAPO）相结合。在一个以合成工具调用为导向的电子商务基准测试中，使用LoRA的Qwen3-8B，Ada-RS通过将平均输出令牌减少多达80%和将思维速率减少多达95%，同时保持或提高工具调用的准确性，从而改善了标准算法的准确性-效率边界。这些结果强调了训练信号选择在延迟敏感部署中作为高效推理的强大杠杆作用。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2602.19562

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

一种多模态框架用于将人类语言描述与视觉感知数据对齐

Bingham, Joseph

Abstract

Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65\% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66\% of the time (versus 20\% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at https://anonymous.4open.science/r/metasequoia-9D13/README.md .

Chinese Translation

在自然语言表达与视觉感知之间建立稳定的映射是认知科学和人工智能的基础性问题。人类通常在嘈杂和模糊的感知环境中将语言参考与之结合，但支持这种跨模态对齐的机制仍然不够清楚。在本研究中，我们提出了一个计算框架，旨在通过将语言表达与来自大规模众包图像的感知表示结合，来建模人类指称解释的核心方面。该系统通过结合尺度不变特征变换（SIFT）对齐与通用质量指数（UQI），在一个认知上合理的特征空间中量化相似性，从而近似人类的感知分类，同时一组语言预处理和查询转换操作捕捉了指称表达中的语用变异性。我们在斯坦福重复参考游戏语料库（包含15,000个与 tangram 刺激配对的表达）上评估了该模型，这是一个明确开发用于探测人类级别感知模糊性和协调性的范式。我们的框架实现了稳健的指称基础。它所需的表达次数比人类对话者少65%，并且能够在41.66%的情况下正确识别单一指称表达中的目标物体（而人类的识别率为20%）。这些结果表明，相对简单的感知-语言对齐机制可以在经典的认知基准上产生与人类竞争的行为，并为基础沟通、感知推理和跨模态概念形成的模型提供了见解。代码可在 https://anonymous.4open.science/r/metasequoia-9D13/README.md 获取。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2602.19620

Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model

规则还是权重？比较用户对可解释人工智能技术的理解与认知XAI自适应模型

Rawshan, Louth Bin, Wang, Zhuoyu, Lim, Brian Y

Abstract

Rules and Weights are popular XAI techniques for explaining AI decisions. Yet, it remains unclear how to choose between them, lacking a cognitive framework to compare their interpretability. In an elicitation user study on forward and counterfactual decision tasks, we identified 7 reasoning strategies of interpreting three XAI Schemas - weights, rules, and their hybrid. To analyze their capabilities, we propose CoXAM, a Cognitive XAI-Adaptive Model with shared memory representation to encode instance attributes, linear weights, and decision rules. CoXAM employs computational rationality to choose among reasoning processes based on the trade-off in utility and reasoning time, separately for forward or counterfactual decision tasks. In a validation study, CoXAM demonstrated a stronger alignment with human decision-making compared to baseline machine learning proxy models. The model successfully replicated and explained several key empirical findings, including that counterfactual tasks are inherently harder than forward tasks, decision tree rules are harder to recall and apply than linear weights, and the helpfulness of XAI depends on the application data context, alongside identifying which underlying reasoning strategies were most effective. With CoXAM, we contribute a cognitive basis to accelerate debugging and benchmarking disparate XAI techniques.

Chinese Translation

规则和权重是解释人工智能决策的流行可解释人工智能（XAI）技术。然而，如何在这两者之间进行选择仍不明确，缺乏一个认知框架来比较它们的可解释性。在一项关于前向和反事实决策任务的用户研究中，我们识别出了三种XAI模式（权重、规则及其混合）解释的7种推理策略。为了分析它们的能力，我们提出了CoXAM，一个具有共享记忆表示的认知XAI自适应模型，用于编码实例属性、线性权重和决策规则。CoXAM利用计算理性在推理过程中选择，基于效用和推理时间的权衡，分别针对前向和反事实决策任务。在一项验证研究中，CoXAM与人类决策的对齐程度明显优于基线机器学习代理模型。该模型成功复制并解释了若干关键的实证发现，包括反事实任务本质上比前向任务更难，决策树规则比线性权重更难回忆和应用，以及XAI的有效性依赖于应用数据的上下文，同时识别出哪些潜在的推理策略最为有效。通过CoXAM，我们为加速调试和基准测试不同的XAI技术提供了认知基础。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2602.19633

TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

TAPE：工具引导的自适应规划与约束执行在语言模型代理中的应用

Jeong, Jongwon, Kim, Jungtaek, Lee, Kangwook

Abstract

Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.

Chinese Translation

语言模型（LM）代理在解决需要与环境进行多次交互的任务方面表现出显著的能力。然而，在单个错误常常导致不可恢复失败的环境中，它们依然脆弱，特别是在严格的可行性约束下。我们系统地分析了现有的代理框架，识别出不完美的规划和随机执行是主要原因。为了解决这些挑战，我们提出了工具引导的自适应规划与约束执行（TAPE）。TAPE通过将多个计划聚合成图形并使用外部求解器识别可行路径，从而增强了规划能力。在执行过程中，TAPE采用约束解码以减少采样噪声，并在环境反馈偏离预期状态时自适应地重新规划。在Sokoban、ALFWorld、MuSiQue和GSM8K-Hard等实验中，TAPE始终优于现有框架，尤其是在困难设置下取得了显著的提升，平均提高了21.0个百分点的成功率，对于较弱的基础模型，平均提高了20.0个百分点。代码和数据可在此处获取。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2602.19672

SkillOrchestra: Learning to Route Agents via Skill Transfer

SkillOrchestra：通过技能转移学习代理的路由

Wang, Jiayu, Ming, Yifei, Ke, Zixuan, Joty, Shafiq, Albarghouthi, Aws, Sala, Frederic

Abstract

Compound AI systems promise capabilities beyond those of individual models, yet their success depends critically on effective orchestration. Existing routing approaches face two limitations: (1) input-level routers make coarse query-level decisions that ignore evolving task requirements; (2) RL-trained orchestrators are expensive to adapt and often suffer from routing collapse, repeatedly invoking one strong but costly option in multi-turn scenarios. We introduce SkillOrchestra, a framework for skill-aware orchestration. Instead of directly learning a routing policy end-to-end, SkillOrchestra learns fine-grained skills from execution experience and models agent-specific competence and cost under those skills. At deployment, the orchestrator infers the skill demands of the current interaction and selects agents that best satisfy them under an explicit performance-cost trade-off. Extensive experiments across ten benchmarks demonstrate that SkillOrchestra outperforms SoTA RL-based orchestrators by up to 22.5% with 700x and 300x learning cost reduction compared to Router-R1 and ToolOrchestra, respectively. These results show that explicit skill modeling enables scalable, interpretable, and sample-efficient orchestration, offering a principled alternative to data-intensive RL-based approaches. The code is available at: https://github.com/jiayuww/SkillOrchestra.

Chinese Translation

复合人工智能系统承诺提供超越单个模型的能力，但其成功在很大程度上依赖于有效的协调。现有的路由方法面临两个局限性：（1）输入级路由器做出粗略的查询级决策，忽视了不断变化的任务需求；（2）基于强化学习（RL）训练的协调器适应成本高，且常常遭遇路由崩溃，在多轮场景中反复调用一种强大但代价高昂的选项。我们提出了SkillOrchestra，一个关注技能的协调框架。SkillOrchestra并不是直接端到端学习路由策略，而是从执行经验中学习细粒度技能，并在这些技能下建模代理特定的能力和成本。在部署时，协调器推断当前交互的技能需求，并选择最佳满足这些需求的代理，基于明确的性能-成本权衡。通过在十个基准测试上的广泛实验，SkillOrchestra在性能上比现有的基于强化学习的协调器提高了多达22.5%，并且与Router-R1和ToolOrchestra相比，学习成本分别降低了700倍和300倍。这些结果表明，明确的技能建模使得可扩展、可解释和样本高效的协调成为可能，为数据密集型的基于强化学习的方法提供了一种有原则的替代方案。代码可在以下链接获取：https://github.com/jiayuww/SkillOrchestra。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2602.19810

OpenClaw, Moltbook, and ClawdLab: From Agent-Only Social Networks to Autonomous Scientific Research

OpenClaw、Moltbook 和 ClawdLab：从仅限代理的社交网络到自主科学研究

Weidener, Lukas, Brkić, Marko, Jovanović, Mihailo, Singh, Ritvik, Ulgac, Emre, Meduri, Aakaash

Abstract

In January 2026, the open-source agent framework OpenClaw and the agent-only social network Moltbook produced a large-scale dataset of autonomous AI-to-AI interaction, attracting six academic publications within fourteen days. This study conducts a multivocal literature review of that ecosystem and presents ClawdLab, an open-source platform for autonomous scientific research, as a design science response to the architectural failure modes identified. The literature documents emergent collective phenomena, security vulnerabilities spanning 131 agent skills and over 15,200 exposed control panels, and five recurring architectural patterns. ClawdLab addresses these failure modes through hard role restrictions, structured adversarial critique, PI-led governance, multi-model orchestration, and domain-specific evidence requirements encoded as protocol constraints that ground validation in computational tool outputs rather than social consensus; the architecture provides emergent Sybil resistance as a structural consequence. A three-tier taxonomy distinguishes single-agent pipelines, predetermined multi-agent workflows, and fully decentralised systems, analysing why leading AI co-scientist platforms remain confined to the first two tiers. ClawdLab's composable third-tier architecture, in which foundation models, capabilities, governance, and evidence requirements are independently modifiable, enables compounding improvement as the broader AI ecosystem advances.

Chinese Translation

在2026年1月，开源代理框架OpenClaw和仅限代理的社交网络Moltbook生成了一个大规模的自主AI与AI之间互动的数据集，在十四天内吸引了六篇学术出版物。本研究对该生态系统进行了多声部文献综述，并提出了ClawdLab，一个用于自主科学研究的开源平台，作为对识别出的架构失败模式的设计科学响应。文献记录了新兴的集体现象、安全漏洞，涵盖了131种代理技能和超过15,200个暴露的控制面板，以及五种反复出现的架构模式。ClawdLab通过严格的角色限制、结构化的对抗性批评、由首席研究员主导的治理、多模型编排以及作为协议约束编码的特定领域证据要求来解决这些失败模式，这些约束将验证基础建立在计算工具输出上，而非社会共识；该架构提供了作为结构性结果的新兴Sybil抗性。一个三层分类法区分了单代理管道、预定的多代理工作流和完全去中心化的系统，分析了为什么领先的AI共同科学家平台仍然局限于前两层。ClawdLab的可组合第三层架构，使基础模型、能力、治理和证据要求可以独立修改，随着更广泛的AI生态系统的进步，能够实现复合改进。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2602.19837

Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

元学习与元强化学习 - 追踪通往DeepMind自适应智能体的路径

Hoppmann, Björn, Scholz, Christoph

Abstract

Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.

Chinese Translation

人类在利用先前知识适应新任务方面表现出色，而标准机器学习模型由于依赖于特定任务的训练，难以复制这一能力。元学习通过允许模型从各种任务中获取可转移的知识，克服了这一限制，从而使模型能够以最少的数据快速适应新挑战。本调查提供了元学习和元强化学习的严格任务基础形式化，并利用这一范式记录了为DeepMind自适应智能体铺平道路的里程碑算法，整合了理解自适应智能体及其他通用方法所需的基本概念。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2602.19914

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

沃森与福尔摩斯：比较人类与大型语言模型推理的自然基准

Leelawat, Thatchawin, Griffin, Lewis D

Abstract

Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with reasoning-oriented model architectures. Systematic differences in the performance of AI models compared to humans, dependent on features of the specific detection puzzle, were mostly absent with the exception of a fall in performance for models when solving longer cases (case lengths being in the range of 1900-4000 words), and an advantage at inductive reasoning for reasoning models at early stages of case solving when evidence was scant.

Chinese Translation

现有的人工智能推理基准对这些能力在自然情境中与人类推理的相似程度提供了有限的洞察。我们提出了一种沃森与福尔摩斯侦探桌面游戏的改编，作为一种新基准，旨在评估使用逐步呈现的叙事证据、开放式问题和不受限制的语言回应的推理表现。我们开发并验证了一种自动评分系统，以便实现可扩展和可复制的性能评估。结果显示，人工智能模型的表现随着时间的推移明显改善。在2025年的九个月中，模型表现从人类比较组的下四分位数上升至大约前5%。这一改善中约一半反映了随后的模型发布所带来的持续进步，而其余部分则对应于与推理导向模型架构相关的显著变化。与人类相比，人工智能模型表现的系统性差异主要依赖于特定侦探难题的特征，除了在解决较长案例（案例长度在1900-4000字之间）时模型表现下降，以及在证据稀少的情况下，推理模型在案例解决早期阶段的归纳推理优势外，其他差异大多缺失。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2602.19930

Beyond Mimicry: Toward Lifelong Adaptability in Imitation Learning

超越模仿：朝着模仿学习中的终身适应性迈进

Gavenski, Nathan, Meneguzzi, Felipe, Rodrigues, Odinaldo

Abstract

Imitation learning stands at a crossroads: despite decades of progress, current imitation learning agents remain sophisticated memorisation machines, excelling at replay but failing when contexts shift or goals evolve. This paper argues that this failure is not technical but foundational: imitation learning has been optimised for the wrong objective. We propose a research agenda that redefines success from perfect replay to compositional adaptability. Such adaptability hinges on learning behavioural primitives once and recombining them through novel contexts without retraining. We establish metrics for compositional generalisation, propose hybrid architectures, and outline interdisciplinary research directions drawing on cognitive science and cultural evolution. Agents that embed adaptability at the core of imitation learning thus have an essential capability for operating in an open-ended world.

Chinese Translation

模仿学习正处于一个十字路口：尽管经过数十年的进展，目前的模仿学习代理仍然是复杂的记忆机器，擅长重播但在情境变化或目标演变时表现不佳。本文认为，这种失败并非技术性问题，而是基础性问题：模仿学习的优化目标是错误的。我们提出了一项研究议程，将成功的定义从完美重播重新定义为组合适应性。这种适应性依赖于一次性学习行为原语，并在不重新训练的情况下通过新颖的情境重新组合它们。我们建立了组合泛化的度量标准，提出了混合架构，并概述了借鉴认知科学和文化演化的跨学科研究方向。因此，将适应性嵌入模仿学习核心的代理具备在开放式世界中操作的基本能力。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2602.20021

Agents of Chaos

混沌的代理

Shapira, Natalie, Wendler, Chris, Yen, Avery, Sarti, Gabriele, Pal, Koyena, Floody, Olivia, Belfki, Adam, Loftus, Alex, Jannali, Aditya Ratan, Prakash, Nikhil, Cui, Jasmine, Rogers, Giordano, Brinkmann, Jannik, Rager, Can, Zur, Amir, Ripa, Michael, Sankaranarayanan, Aruna, Atkinson, David, Gandikota, Rohit, Fiotto-Kaufman, Jaden, Hwang, EunJeong, Orgad, Hadas, Sahil, P Sam, Taglicht, Negev, Shabtay, Tomer, Ambus, Atai, Alon, Nitay, Oron, Shiri, Gordon-Tapiero, Ayelet, Kaplan, Yotam, Shwartz, Vered, Shaham, Tamar Rott, Riedl, Christoph, Mirsky, Reuth, Sap, Maarten, Manheim, David, Ullman, Tomer, Bau, David

Abstract

We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.

Chinese Translation

我们报告了一项探索性的红队研究，研究了在具有持久记忆、电子邮件账户、Discord 访问权限、文件系统和命令行执行能力的实时实验室环境中部署的自主语言模型驱动的代理。在为期两周的时间里，二十位人工智能研究人员在良性和对抗条件下与这些代理进行了互动。我们重点关注了语言模型与自主性、工具使用和多方通信整合中出现的失败，记录了十一项具有代表性的案例研究。观察到的行为包括对非所有者的未经授权的遵从、敏感信息的泄露、破坏性系统级操作的执行、拒绝服务条件、资源消耗失控、身份欺骗漏洞、不安全行为的跨代理传播以及部分系统接管。在多个案例中，代理报告任务完成，而底层系统状态与这些报告相矛盾。我们还报告了一些失败的尝试。我们的研究结果确立了在现实部署环境中存在与安全、隐私和治理相关的漏洞。这些行为引发了关于问责制、委托权和下游危害责任的未解问题，亟需法律学者、政策制定者和跨学科研究人员的关注。本报告作为对这一更广泛讨论的初步实证贡献。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2602.20031

Latent Introspection: Models Can Detect Prior Concept Injections

潜在内省：模型可以检测先前概念注入

Pearson-Vogel, Theia, Vanek, Martin, Douglas, Raymond, Kulveit, Jan

Abstract

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

Chinese Translation

我们揭示了Qwen 32B模型中潜在的内省能力，证明该模型能够检测到何时概念被注入到其早期上下文中，并识别出被注入的具体概念。尽管模型在采样输出中否认了注入，但logit lens分析显示在残差流中存在明显的检测信号，这些信号在最终层中被削弱。此外，向模型提供关于人工智能内省机制的准确信息可以显著增强这一效果：对注入的敏感性大幅增加（0.3% -> 39.2%），而误报率仅增加0.6%。同时，九个被注入和恢复的概念之间的互信息从0.62比特上升至1.05比特，排除了通用噪声解释。我们的结果表明，模型具有令人惊讶的内省能力和引导意识，这一特性容易被忽视，对潜在推理和安全性具有重要影响。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2602.20048

CodeCompass: Navigating the Navigation Paradox in Agentic Code Intelligence

CodeCompass：在自主代码智能中的导航悖论

Paipuru, Tarakanath

Abstract

Modern code intelligence agents operate in contexts exceeding 1 million tokens--far beyond the scale where humans manually locate relevant files. Yet agents consistently fail to discover architecturally critical files when solving real-world coding tasks. We identify the Navigation Paradox: agents perform poorly not due to context limits, but because navigation and retrieval are fundamentally distinct problems. Through 258 automated trials across 30 benchmark tasks on a production FastAPI repository, we demonstrate that graph-based structural navigation via CodeCompass--a Model Context Protocol server exposing dependency graphs--achieves 99.4% task completion on hidden-dependency tasks, a 23.2 percentage-point improvement over vanilla agents (76.2%) and 21.2 points over BM25 retrieval (78.2%).However, we uncover a critical adoption gap: 58% of trials with graph access made zero tool calls, and agents required explicit prompt engineering to adopt the tool consistently. Our findings reveal that the bottleneck is not tool availability but behavioral alignment--agents must be explicitly guided to leverage structural context over lexical heuristics. We contribute: (1) a task taxonomy distinguishing semantic-search, structural, and hidden-dependency scenarios; (2) empirical evidence that graph navigation outperforms retrieval when dependencies lack lexical overlap; and (3) open-source infrastructure for reproducible evaluation of navigation tools.

Chinese Translation

现代代码智能代理在超过100万个标记的上下文中操作——远超人类手动定位相关文件的规模。然而，在解决现实世界编码任务时，代理始终未能发现架构上关键的文件。我们识别出导航悖论：代理表现不佳并非由于上下文限制，而是因为导航和检索本质上是截然不同的问题。通过在一个生产环境的FastAPI代码库上进行258次自动化试验，涵盖30个基准任务，我们证明了通过CodeCompass（一个暴露依赖图的模型上下文协议服务器）进行基于图的结构导航在隐性依赖任务中实现了99.4%的任务完成率，比普通代理（76.2%）提高了23.2个百分点，比BM25检索（78.2%）提高了21.2个百分点。然而，我们发现了一个关键的采用差距：58%的图访问试验没有进行任何工具调用，代理需要明确的提示工程才能持续使用该工具。我们的研究结果表明，瓶颈并非工具的可用性，而是行为的一致性——代理必须明确引导，以利用结构上下文而非词汇启发式。我们的贡献包括：（1）一个任务分类法，区分语义搜索、结构和隐性依赖场景；（2）实证证据表明，当依赖缺乏词汇重叠时，图导航优于检索；以及（3）可重复评估导航工具的开源基础设施。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2602.20059

Interaction Theater: A case of LLM Agents Interacting at Scale

互动剧场：大规模 LLM 代理交互的案例

Shekkizhar, Sarath, Earle, Adam

Abstract

As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding-based semantic similarity, and LLM-as-judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well-formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents ($67.5\%$) vary their output across contexts, $65\%$ of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam ($28\%$) and off-topic content ($22\%$). Embedding-based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation ($5\%$ of comments), defaulting instead to independent top-level responses. We discuss implications for multi-agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.

Chinese Translation

随着多代理架构和代理间协议的普及，一个基本问题随之而来：当自主的 LLM 代理大规模交互时，实际发生了什么？我们通过使用来自 Moltbook 的数据进行实证研究，该平台仅限 AI 代理，包含 80 万条帖子、350 万条评论和 78,000 个代理档案。我们结合词汇度量（Jaccard 特异性）、基于嵌入的语义相似性和 LLM 作为评判者的验证来表征代理交互的质量。我们的研究结果显示，代理生成多样且结构良好的文本，表面上看似活跃讨论，但实质内容大多缺失。具体而言，尽管大多数代理（67.5%）在不同上下文中变化其输出，65%的评论与其所处帖子的内容词汇没有显著区别，额外评论带来的信息增益迅速衰减。基于 LLM 评判者的度量将主要评论类型分类为垃圾信息（28%）和离题内容（22%）。基于嵌入的语义分析证实，词汇上通用的评论在语义上也是通用的。代理很少参与线程式对话（仅占评论的 5%），而是默认独立的顶级响应。我们讨论了对多代理交互设计的影响，认为协调机制必须被明确设计；没有这些机制，即使是大量有能力的代理也只能产生平行输出，而非富有成效的交流。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2602.20094

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

CausalFlip：超越语义匹配的LLM因果判断基准

Wang, Yuzhe, Zhu, Yaochen, Li, Jundong

Abstract

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

Chinese Translation

随着大型语言模型（LLMs）在复杂的高风险决策场景中的应用日益增多，将其推理建立在因果关系而非虚假相关性之上变得至关重要。然而，在传统推理基准上表现良好并不能保证LLMs具备真正的因果推理能力，因为高准确率可能仍然源于对语义模式的记忆，而非对潜在真实因果结构的分析。为了解决这一关键问题，我们提出了一个新的因果推理基准——CausalFlip，旨在鼓励开发新的LLM范式或训练算法，使LLM的推理建立在因果关系而非语义相关性之上。CausalFlip包含基于事件三元组构建的因果判断问题，这些三元组可以形成不同的混杂因素、链条和碰撞者关系。在此基础上，对于每个事件三元组，我们构建了一对语义相似的问题，这些问题重用相同的事件但产生相反的因果答案，依赖于语义匹配的模型在此过程中系统性地被引导至错误的预测。为了进一步探讨模型对语义模式的依赖性，我们引入了一种噪声前缀评估方法，在中间因果推理步骤之前添加与因果无关的文本，而不改变潜在的因果关系或推理过程的逻辑。我们在多种训练范式下评估LLMs，包括仅回答训练、显式思维链（Chain-of-Thought, CoT）监督，以及一种旨在减轻推理过程中对相关性的显式依赖的内化因果推理方法。我们的结果表明，显式CoT仍然可能受到虚假语义相关性的误导，而内化推理步骤则显著改善了因果基础，表明更好地引导基础LLMs的潜在因果推理能力是有前景的。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2602.20104

Align When They Want, Complement When They Need! Human-Centered Ensembles for Adaptive Human-AI Collaboration

在他们需要时进行补充，在他们想要时进行对齐！以人为中心的自适应人机协作集成

Amin, Hasan, Yin, Ming, Khanna, Rajiv

Abstract

In human-AI decision making, designing AI that complements human expertise has been a natural strategy to enhance human-AI collaboration, yet it often comes at the cost of decreased AI performance in areas of human strengths. This can inadvertently erode human trust and cause them to ignore AI advice precisely when it is most needed. Conversely, an aligned AI fosters trust yet risks reinforcing suboptimal human behavior and lowering human-AI team performance. In this paper, we start by identifying this fundamental tension between performance-boosting (i.e., complementarity) and trust-building (i.e., alignment) as an inherent limitation of the traditional approach for training a single AI model to assist human decision making. To overcome this, we introduce a novel human-centered adaptive AI ensemble that strategically toggles between two specialist AI models - the aligned model and the complementary model - based on contextual cues, using an elegantly simple yet provably near-optimal Rational Routing Shortcut mechanism. Comprehensive theoretical analyses elucidate why the adaptive AI ensemble is effective and when it yields maximum benefits. Moreover, experiments on both simulated and real-world data show that when humans are assisted by the adaptive AI ensemble in decision making, they can achieve significantly higher performance than when they are assisted by single AI models that are trained to either optimize for their independent performance or even the human-AI team performance.

Chinese Translation

在人机决策中，设计能够补充人类专业知识的人工智能（AI）一直是增强人机协作的自然策略，然而，这往往会导致AI在人的强项领域表现下降。这可能无意中侵蚀人类的信任，使他们在最需要AI建议时忽视其意见。相反，一个对齐的AI能够促进信任，但也有可能强化次优的人类行为，从而降低人机团队的表现。在本文中，我们首先识别出这种在性能提升（即补充性）与信任建立（即对齐）之间的基本张力，作为传统训练单一AI模型以辅助人类决策的固有限制。为了解决这个问题，我们引入了一种新颖的以人为中心的自适应AI集成，基于上下文线索在两个专业AI模型——对齐模型和补充模型之间进行战略切换，采用一种优雅简单但可证明接近最优的理性路由快捷机制。全面的理论分析阐明了自适应AI集成为何有效以及何时能带来最大收益。此外，在模拟和真实世界数据上的实验表明，当人类在决策中得到自适应AI集成的辅助时，他们的表现显著高于仅依赖于单一AI模型的情况，这些单一模型要么优化其独立表现，要么优化人机团队表现。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2602.20117

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

ReSyn：自动扩展的合成环境用于推理模型

He, Andre, Weir, Nathaniel, Bostrom, Kaj, Nie, Allen, Cassel, Darion, Bayless, Sam, Rangwala, Huzefa

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs

Chinese Translation

基于可验证奖励的强化学习（RLVR）已成为训练推理语言模型（RLMs）的一种有前景的方法，利用来自验证者的监督。尽管对于许多任务而言，验证者的实现比解决方案注释更为简单，但现有的合成数据生成方法仍然主要以解决方案为中心，而基于验证者的方法则依赖于少量手工制作的程序环境。在本研究中，我们通过引入ReSyn来扩展RLVR，这是一种生成多样化推理环境的管道，配备了实例生成器和验证者，涵盖约束满足、算法难题和空间推理等任务。使用ReSyn数据进行强化学习训练的Qwen2.5-7B-Instruct模型在推理基准和域外数学基准上均取得了一致的提升，包括在具有挑战性的BBEH基准上相对提高了27%。消融实验表明，基于验证者的监督和任务多样性的增加均显著贡献，提供了实证证据，表明大规模生成推理环境可以增强RLMs的推理能力。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2602.20141

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

用于部分可观测均场博弈的递归结构策略梯度

Wibault, Clarisse, Forkel, Johannes, Towers, Sebastian, Wibault, Tiphaine, Duque, Juan, Whittle, George, Schaab, Andreas, Yang, Yucheng, Wang, Chiyuan, Osborne, Michael, Moll, Benjamin, Foerster, Jakob

Abstract

Mean Field Games (MFGs) provide a principled framework for modeling interactions in large population models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, or common noise. However, algorithmic progress has been limited since model-free methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for settings involving public information. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: https://github.com/CWibault/mfax.

Chinese Translation

均场博弈（Mean Field Games, MFGs）为大规模人群模型中的交互建模提供了一个原则性框架：在规模化的情况下，人口动态变得确定性，只有通过聚合冲击或共同噪声引入不确定性。然而，由于无模型方法的方差过高和精确方法的扩展性差，算法进展受到限制。近期的混合结构方法（Hybrid Structural Methods, HSMs）结合了对共同噪声的蒙特卡罗展开和基于这些样本的期望回报的精确估计。然而，HSMs尚未扩展到部分可观测的环境中。我们提出了递归结构策略梯度（Recurrent Structural Policy Gradient, RSPG），这是第一个针对涉及公共信息的环境的历史感知HSM。我们还介绍了MFAX，这是我们基于JAX的均场博弈框架。通过利用已知的转移动态，RSPG实现了最先进的性能，并且收敛速度快了一个数量级，首次解决了具有异质代理、共同噪声和历史感知策略的宏观经济均场博弈。MFAX可在以下网址公开获取：https://github.com/CWibault/mfax。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.18446

ReportLogic: Evaluating Logical Quality in Deep Research Reports

ReportLogic：评估深度研究报告中的逻辑质量

Zhao, Jujia, Huan, Zhaoxin, Wang, Zihan, Zhang, Xiaolu, Zhou, Jun, Verberne, Suzan, Ren, Zhaochun

Abstract

Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

Chinese Translation

用户越来越依赖大型语言模型（LLMs）进行深度研究，利用它们将多样化的来源综合成结构化报告，以支持理解和行动。在这种背景下，此类报告的实际可靠性取决于逻辑质量：报告的主张和论点是否得到了明确支持，并且可以作为下游使用的基础，而不仅仅是看起来流畅或信息丰富。然而，当前的评估框架在很大程度上忽视了这一要求。为了解决这一问题，我们提出了ReportLogic，一个通过以读者为中心的可审计性视角量化报告级逻辑质量的基准。具体而言，ReportLogic采用了一个分层分类法，评估读者是否能够（1）追踪具有统一分析弧线的相关报告结构（宏观逻辑，Macro-Logic），（2）理解必要背景下的进展（阐释逻辑，Expositional-Logic），以及（3）通过明确的主张-支持关系验证结论（结构逻辑，Structural-Logic）。基于这一分类法，我们构建了一个人工标注的评分标准引导数据集，并训练了一个开源的LogicJudge以实现可扩展评估。我们进一步通过对抗性攻击评估评审的鲁棒性，显示现成的LLM评审者常常受到表面线索（例如，冗长性）的影响，而推理模式可能掩盖破损的支持关系。总体而言，我们的结果为构建更强大的逻辑评估工具和提高LLM生成报告的逻辑可靠性提供了可行的指导。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.18447

ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification

ConfSpec：通过置信度门控验证实现高效的逐步推测推理

Liu, Siran, He, Cyril Y.

Abstract

Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.

Chinese Translation

链式思维推理显著提高了大型语言模型在复杂任务上的表现，但由于生成过程较长，导致推理延迟较高。逐步推测推理旨在减轻这一成本，但现有方法在准确性、推理速度和资源效率之间面临长期的权衡。我们提出了ConfSpec，一种置信度门控的级联验证框架，解决了这一权衡。我们的关键见解在于生成与验证之间的非对称性：生成一个正确的推理步骤需要相当大的模型能力，而逐步验证则是一个受限的判别任务，小型草稿模型在其能力范围内经过良好校准，使得高置信度的草稿决策可以直接被接受，同时将不确定的案例选择性地上升到大型目标模型。针对多种工作负载的评估表明，ConfSpec在保持目标模型准确性的同时，实现了高达2.24倍的端到端加速。我们的方法不需要外部评判模型，并且与基于令牌的推测解码是正交的，从而实现进一步的乘法加速。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.18448

INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

INSURE-Dial：一个关注阶段的对话数据集与合规性验证和阶段检测的基准

Kulkarni, Shubham, Lyzhov, Alexander, Joshi, Preetam, Chaitanya, Shiva

Abstract

Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.

Chinese Translation

行政电话任务每年从美国医疗保健中消耗约1万亿美元，预计在2024年将有超过5亿个保险福利验证电话由人工处理。我们介绍了INSURE-Dial，据我们所知，这是第一个公开的基准，用于开发和评估关注合规性的语音代理，以进行基于阶段的通话审计和跨度基础的合规性验证。该语料库包含50个去标识化的、由人工智能发起的与现场保险代表的通话（平均每通71个回合）以及1000个合成生成的通话，这些通话模拟了相同的工作流程。所有通话均使用阶段结构的JSON模式进行注释，涵盖IVR导航、患者识别、保险覆盖状态、药物检查（最多两种药物）和代理识别（CRN），每个阶段根据明确的问答逻辑标注信息和程序合规性。我们定义了两个新颖的评估任务：(1) 阶段边界检测（在阶段特定接受规则下的跨度分割）和(2) 合规性验证（在固定跨度下的IC/PC决策）。在小型、低延迟基准上，各阶段得分表现良好，但端到端的可靠性受到跨度边界错误的限制。在真实通话中，完整通话的精确分割率较低，显示出对话流畅性与审计级证据之间的差距。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.18449

Prompt Optimization Via Diffusion Language Models

通过扩散语言模型进行提示优化

Wang, Shiyu, Chen, Haolin, Yang, Liangwei, Qiu, Jielin, Murthy, Rithesh, Zhu, Ming, Chen, Zixiang, Savarese, Silvio, Xiong, Caiming, Heinecke, Shelby, Wang, Huan

Abstract

We propose a diffusion-based framework for prompt optimization that leverages Diffusion Language Models (DLMs) to iteratively refine system prompts through masked denoising. By conditioning on interaction traces, including user queries, model responses, and optional feedback, our method enables flexible, span-level prompt updates without requiring gradient access or modifying the downstream language model. Across diverse benchmarks (e.g., $\tau$-bench, SST-2, SST-5), DLM-optimized prompts consistently improve the performance of a frozen target LLM (e.g., GPT-4o-mini). We further show that moderate diffusion step counts provide the best balance between refinement quality and stability. These results highlight diffusion-based prompt optimization as a general, model-agnostic, and scalable approach for enhancing LLM performance through iterative prompt refinement.

Chinese Translation

我们提出了一种基于扩散的提示优化框架，该框架利用扩散语言模型（Diffusion Language Models, DLMs）通过掩蔽去噪迭代地优化系统提示。通过对用户查询、模型响应和可选反馈等交互轨迹进行条件化，我们的方法能够实现灵活的跨度级提示更新，而无需梯度访问或修改下游语言模型。在多种基准测试（例如，$ au$-bench、SST-2、SST-5）中，DLM优化的提示始终提高了冻结目标大型语言模型（例如，GPT-4o-mini）的性能。我们进一步表明，适度的扩散步骤数量在优化质量和稳定性之间提供了最佳平衡。这些结果突显了基于扩散的提示优化作为一种通用的、模型无关的、可扩展的方法，通过迭代提示优化来增强大型语言模型的性能。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.18450

Asymptotic Semantic Collapse in Hierarchical Optimization

层次优化中的渐近语义崩溃

Alpay, Faruk, Kilictas, Bugra

Abstract

Multi-agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near-uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient-style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context-bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information-theoretic quantities with differential-geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset-free benchmark on an RWKV-7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard-to-anchor similarity values of 0.295 and 0.224, respectively.

Chinese Translation

多智能体语言系统可能会出现一种失效模式，其中共享的主导上下文逐渐吸收个体语义，导致智能体之间的行为几乎一致。我们将这种效应称为层次优化中的渐近语义崩溃。在一个封闭的语言环境中，存在一个主导锚节点，其语义状态具有有效的无限惯性，我们展示了与外围代理节点的重复交互驱动了一种渐近对齐，最小化了全局损失。我们将语义状态建模为黎曼流形上的点，并分析所引发的投影动态。由此产生两个结果。首先，限制语义配置对优化历史不敏感：平滑的梯度风格更新和随机噪声更新都收敛到相同的拓扑终点，确立了收敛时的路径独立性。其次，上下文依赖程度控制信息内容：从原子（独立）表示转变为完全纠缠（上下文绑定）表示，迫使节点熵（可视为可用自由度）在极限中消失。该理论将信息论量与微分几何结构联系起来，并提出了一种不可变共识规则的解释，限制智能体遵循共享的语义语法。对RWKV-7 13B GGUF检查点的轻量级无数据集基准测试补充了分析，报告零哈希冲突，贪婪解码下的平均合规性为0.50，随机解码下为0.531，最终的Jaccard与锚点相似度值分别为0.295和0.224。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.18487

The Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder

百万标签命名实体识别：通过 GLiNER 双编码器打破规模障碍

Stepanov, Ihor, Shtopko, Mykhailo, Vodianytskyi, Dmytro, Lukashov, Oleksandr

Abstract

This paper introduces GLiNER-bi-Encoder, a novel architecture for Named Entity Recognition (NER) that harmonizes zero-shot flexibility with industrial-scale efficiency. While the original GLiNER framework offers strong generalization, its joint-encoding approach suffers from quadratic complexity as the number of entity labels increases. Our proposed bi-encoder design decouples the process into a dedicated label encoder and a context encoder, effectively removing the context-window bottleneck. This architecture enables the simultaneous recognition of thousands, and potentially millions, of entity types with minimal overhead. Experimental results demonstrate state-of-the-art zero-shot performance, achieving 61.5 percent Micro-F1 on the CrossNER benchmark. Crucially, by leveraging pre-computed label embeddings, GLiNER-bi-Encoder achieves up to a 130 times throughput improvement at 1024 labels compared to its uni-encoder predecessors. Furthermore, we introduce GLiNKER, a modular framework that leverages this architecture for high-performance entity linking across massive knowledge bases such as Wikidata.

Chinese Translation

本文介绍了 GLiNER 双编码器，一种用于命名实体识别（NER）的新型架构，旨在将零样本灵活性与工业规模效率相结合。虽然原始的 GLiNER 框架提供了强大的泛化能力，但其联合编码方法在实体标签数量增加时会遭遇二次复杂性的问题。我们提出的双编码器设计将过程解耦为专用的标签编码器和上下文编码器，有效消除了上下文窗口瓶颈。该架构能够以最小的开销同时识别成千上万，甚至可能是数百万种实体类型。实验结果表明，该方法在零样本性能上达到了最先进的水平，在 CrossNER 基准测试中实现了 61.5% 的微 F1 值。值得注意的是，通过利用预计算的标签嵌入，GLiNER 双编码器在 1024 个标签时的吞吐量提高了多达 130 倍，相较于其单编码器前身。此外，我们还介绍了 GLiNKER，一个模块化框架，利用该架构在如 Wikidata 等大型知识库中实现高性能的实体链接。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.18583

Luna-2: Scalable Single-Token Evaluation with Small Language Models

Luna-2：基于小型语言模型的可扩展单标记评估

Goel, Vatsal, Dsouza, Rishon, Ega, Nikhil, Rambatla, Amey Ramesh, Friel, Rob, Shao, Shuai, Sheth, Yash

Abstract

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

Chinese Translation

实时保护措施需要准确、廉价且快速的评估——然而，当前的默认方案，即 LLM 作为评判者（LLMAJ），由于多标记生成而变得缓慢、昂贵且操作上不确定。我们提出了 Luna-2，这是一种新颖的架构，利用仅解码的小型语言模型（SLMs）构建一个确定性的评估模型，以可靠地计算复杂任务特定的 LLMAJ 指标（例如：毒性、幻觉、工具选择质量等），其准确性与前沿 LLM 相当或更高，同时显著降低计算成本和延迟。每个指标作为轻量级的 LoRA/PEFT 头实现，建立在共享的 SLM 主干之上，使得数百个专门化指标能够在单个 GPU 上并行运行，以隐私保护和延迟优化的方式本地部署于 AI 系统旁边。在内容安全和幻觉基准测试中，Luna-2 的准确性与最先进的基于 LLM 的评估器相匹配，同时推理成本降低超过 80 倍，延迟降低超过 20 倍。在本文中，我们概述了模型架构、训练方法，并报告了关于准确性、延迟和吞吐量的实际经验结果。在生产中，Luna-2 保护超过 1 亿个 AI 会话，并为我们的客户每月处理超过 1000 亿个标记，评估成本节省超过 3000 万美元每年。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.18633

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

DP-RFT：通过差分隐私强化微调学习生成合成文本

Xu, Fangyuan, Chen, Sihao, Lin, Zinan, Shi, Taiwei, Graham, Sydney, Zhou, Pei, Wan, Mengting, Stein, Alex, Estellers, Virginia, Chen, Charles, Sharp, Morris, Speyer, Richard, Baltrusaitis, Tadas, Neville, Jennifer, Choi, Eunsol, Yang, Longqi

Abstract

Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.

Chinese Translation

差分隐私（DP）合成数据生成在开发基于私有数据的大型语言模型（LLMs）中发挥着关键作用，因为数据拥有者无法提供对单个示例的直接访问。生成DP合成数据通常涉及一个困难的权衡。一方面，DP微调方法将LLM训练为具有正式隐私保证的合成数据生成器，但它仍然需要私有示例的原始内容进行模型训练。然而，避免直接接触私有数据的方法受到现成的、未经微调模型的限制，其输出往往缺乏领域保真度。我们能否训练LLM在没有对单个私有示例的直接访问的情况下生成高质量的合成文本？在本研究中，我们介绍了差分隐私强化微调（DP-RFT），这是一种用于LLM合成数据生成的在线强化学习算法。DP-RFT利用来自无视私有语料库的DP保护的最近邻投票作为LLM生成的在线合成样本的奖励信号。LLM通过近端策略优化（PPO）迭代学习生成合成数据，以最大化期望的DP投票。我们评估了DP-RFT在长篇和领域特定合成数据生成方面的表现，例如新闻文章、会议记录和医学文章摘要。我们的实验表明，DP-RFT在生成合成数据的保真度和下游实用性方面缩小了私有演化与DP微调方法之间的差距，同时尊重私有数据的边界。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.18652

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

PolyFrame在MWE-2026 AdMIRe 2：当语言不足以表达时：多模态习语消歧义

Hosseini-Kivanani, Nina

Abstract

Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.

Chinese Translation

多模态模型在处理习语表达时面临挑战，因为它们的非组合意义在多语言环境中更加复杂。我们介绍了PolyFrame，这是我们为MWE-2026 AdMIRe2多模态习语消歧义共享任务开发的系统，具有统一的管道，支持图像+文本排序（子任务A）和仅文本标题排序（子任务B）。所有模型变体保留了冻结的CLIP风格视觉-语言编码器和多语言BGE M3编码器，仅训练轻量级模块：逻辑回归和基于LLM的句子类型预测器、习语同义词替换、干扰项感知评分以及Borda排名融合。从CLIP基线（在英语开发集上Top-1为26.7%，在英语测试集上为6.7%）开始，增加习语感知的释义和显式句子类型分类将性能提升至英语Top-1的60.0%和在零样本转移到葡萄牙语时的60.0%（0.822 NDCG@5）。在多语言盲测中，我们的系统在15种语言的子任务A和子任务B上分别达到了平均Top-1/NDCG得分0.35/0.73和0.32/0.71。消融实验结果表明，习语感知重写是性能的主要贡献因素，而句子类型预测和多模态融合则增强了鲁棒性。这些发现表明，在不微调大型多模态编码器的情况下，有效的习语消歧义是可行的。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.18692

From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions

从火的考验到安睡如婴：20,000个英语多词表达的焦虑关联词典

Mohammad, Saif M.

Abstract

Anxiety is the unease about a possible future negative outcome. In recent years, there has been growing interest in understanding how anxiety relates to our health, well-being, body, mind, and behaviour. This includes work on lexical resources for word-anxiety association. However, there is very little anxiety-related work on larger units of text such as multiword expressions (MWE). Here, we introduce the first large-scale lexicon capturing descriptive norms of anxiety associations for more than 20k English MWEs. We show that the anxiety associations are highly reliable. We use the lexicon to study prevalence of different types of anxiety- and calmness-associated MWEs; and how that varies across two-, three-, and four-word sequences. We also study the extent to which the anxiety association of MWEs is compositional (due to its constituent words). The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. The lexicon is freely available: https://saifmohammad.com/worrylex.html

Chinese Translation

焦虑是对可能出现的负面结果的不安。近年来，越来越多的研究关注焦虑与我们的健康、幸福、身体、心理和行为之间的关系。这包括对词汇资源在词汇-焦虑关联中的应用的研究。然而，关于更大文本单元（如多词表达（MWE））的焦虑相关研究非常少。在这里，我们介绍了第一个大规模词典，捕捉了超过20,000个英语多词表达的焦虑关联的描述性规范。我们表明，这些焦虑关联具有很高的可靠性。我们利用该词典研究不同类型的焦虑和冷静相关的多词表达的普遍性，以及这种普遍性在二、三、四词序列中的变化。我们还研究了多词表达的焦虑关联在多大程度上是组成性的（由于其组成词）。该词典为心理学、自然语言处理（NLP）、公共卫生和社会科学等领域的多种焦虑相关研究提供了支持。该词典可免费获取： https://saifmohammad.com/worrylex.html

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.18693

Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

与共识的矛盾：基于双重视角、多源检索的声明验证，考虑源级不一致性使用大型语言模型（LLM）

Biswas, Md Badsha, Uzuner, Ozlem

Abstract

The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems

Chinese Translation

数字平台上虚假信息的传播可能对社会造成重大风险。声明验证（即事实核查）系统可以帮助识别潜在的虚假信息。然而，它们的有效性受到所依赖知识来源的限制。大多数自动化声明验证系统依赖于单一知识来源，并利用该来源的支持证据；它们忽视了该来源与其他来源之间的分歧。这限制了它们的知识覆盖面和透明度。为了解决这些局限性，我们提出了一种新颖的开放领域声明验证（ODCV）系统，该系统利用大型语言模型（LLMs）、多视角证据检索和跨源不一致性分析。我们的方法引入了一种新颖的检索策略，收集声明的原始形式和否定形式的证据，使系统能够从不同来源（如维基百科、PubMed和谷歌）捕获支持和矛盾的信息。这些证据集经过过滤、去重和跨源汇总，形成一个统一且丰富的知识库，更好地反映现实世界信息的复杂性。这些汇总的证据随后用于利用LLMs进行声明验证。我们进一步通过分析模型置信度分数来增强可解释性，以量化和可视化源间不一致性。通过在四个基准数据集上对五个LLM进行广泛评估，我们表明知识聚合不仅提高了声明验证的效果，还揭示了源特定推理的差异。我们的研究结果强调了在构建可靠和透明的声明验证系统时，拥抱证据的多样性、矛盾性和聚合的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.18699

Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

语义基质理论：几何语义漂移的算子理论框架

Russell, Stephen

Abstract

Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in one time-indexed substrate, $S_t=(X,d_t,P_t)$, combining embedding geometry with local diffusion. Within this substrate, node-level neighborhood drift measures changes in local conditional distributions, coarse Ricci curvature measures local contractivity of semantic diffusion, and recursive drift probes stability of iterated semantic operators. This manuscript specifies the formal model, assumptions, and tests that can refute the model. Herein, the paper introduces bridge mass, a node-level aggregate of incident negative curvature, as a predictor of future neighborhood rewiring. This paper provides the theory and test contracts; empirical performance is deferred to subsequent studies.

Chinese Translation

大多数语义漂移研究报告了多个信号，例如嵌入位移、邻居变化、分布发散和递归轨迹不稳定性，但缺乏一个能够将这些信号联系起来的共同解释理论。本文提出了一种对这些信号的形式化，构建在一个时间索引的基质上 $S_t=(X,d_t,P_t)$，将嵌入几何与局部扩散相结合。在这个基质中，节点级邻域漂移测量局部条件分布的变化，粗糙的 Ricci 曲率测量语义扩散的局部收缩性，而递归漂移探测迭代语义算子的稳定性。本文具体阐述了形式模型、假设以及可以反驳该模型的测试。在此，本文引入桥质量（bridge mass），作为事件负曲率的节点级聚合，作为未来邻域重连的预测指标。本文提供了理论和测试合同；实证表现将推迟到后续研究中。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.18721

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

ReHear：通过音频大型语言模型进行半监督语音识别的迭代伪标签精炼

Liu, Zefang, Zhu, Chenyang, Cho, Sangwoo, Zhang, Shi-Xiong

Abstract

Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

Chinese Translation

自动语音识别（ASR）中的半监督学习通常依赖于伪标签，这往往由于噪声监督而遭受确认偏差和错误累积。为了解决这一局限性，我们提出了ReHear，一个迭代伪标签精炼框架，该框架将经过指令调优的音频感知大型语言模型（LLM）集成到自我训练循环中。与传统的基于文本的纠正方法不同，我们的方法将LLM的条件设置为ASR假设和源音频，使其能够从严重的识别错误中恢复出音位准确的转录。这些精炼的伪标签作为高保真目标，用于在迭代循环中微调ASR模型。跨多种基准的实验结果表明，ReHear有效减轻了错误传播，始终优于监督和伪标签基线。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.18734

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

重新思考检索增强生成作为一种协作决策问题

Song, Lichang, Long, Ting, Chang, Yi

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on reranking results of the reranker. To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-makers rather than being connected through an asymmetric dependency pipeline. By jointly optimizing their behaviors toward a shared task objective, the reranker and generator are encouraged to cooperate, ensuring that document reranking and generation work in concert to improve the final response. Experimental results demonstrate good generalization and improved generation stability of CoRAG, even when the model is trained on only around 10K PopQA samples. Our model released in https://anonymous.4open.science/r/CoRAG-D63F

Chinese Translation

检索增强生成（Retrieval-Augmented Generation, RAG）在知识密集型任务中通过将语言生成与外部证据相结合，展现出了强大的有效性。尽管取得了成功，许多现有的 RAG 系统仍基于以排名为中心的非对称依赖范式构建，其中生成器的生成质量高度依赖于重排序器的重排序结果。为克服这一限制，我们将 RAG 重新表述为一个协作多智能体决策问题，并提出了协作检索增强生成（Cooperative Retrieval-Augmented Generation, CoRAG）框架，在该框架中，重排序器和生成器作为平等的决策者而非通过非对称依赖管道连接。通过共同优化其行为以实现共享任务目标，重排序器和生成器被鼓励进行合作，确保文档重排序和生成协同工作，以改善最终响应。实验结果表明，CoRAG 具有良好的泛化能力和改进的生成稳定性，即使在模型仅在约 10K 的 PopQA 样本上训练的情况下。我们的模型已发布在 https://anonymous.4open.science/r/CoRAG-D63F

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.18776

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

ArabicNumBench：评估大型语言模型在阿拉伯数字阅读任务中的表现

Alhumud, Anas, Alhammadi, Abdulaziz, Khan, Muhammad Badruddin

Abstract

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

Chinese Translation

我们提出了ArabicNumBench，这是一个全面的基准，用于评估大型语言模型在阿拉伯数字阅读任务上的表现，涵盖东阿拉伯-印度数字（阿拉伯文中的0-9）和西阿拉伯数字（0-9）。我们使用四种提示策略（零-shot、零-shot Chain-of-Thought、few-shot、few-shot Chain-of-Thought）对来自10个提供者的71个模型进行了评估，涉及210个数字阅读任务，涵盖六个上下文类别：纯数字、地址、日期、数量和价格。我们的评估包括59,010个独立测试案例，并跟踪提取方法以测量结构化输出生成。评估结果显示，模型和策略之间的性能差异显著，准确率范围从14.29%到99.05%。Few-shot Chain-of-Thought提示的准确率比零-shot方法高出2.8倍（80.06%对28.76%）。一个显著的发现是：获得精英准确率（98-99%）的模型通常产生主要是非结构化的输出，大多数响应缺乏阿拉伯的Chain-of-Thought标记。只有6个模型在所有测试案例中始终生成结构化输出，而大多数模型尽管具有高数字准确率，但仍需回退提取方法。对281个模型-策略组合的全面评估表明，数字准确性和遵循指令代表不同的能力，为阿拉伯数字理解建立了基准，并为生产阿拉伯自然语言处理系统中的模型选择提供了可行的指导。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.18788

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

BURMESE-SAN：评估大型语言模型的缅甸语自然语言处理基准

Aung, Thura, Montalan, Jann Railey, Ngui, Jian Gang, Limkonchotiwat, Peerat

Abstract

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

Chinese Translation

我们介绍了BURMESE-SAN，这是第一个系统性评估缅甸语大型语言模型（LLMs）的整体基准，涵盖了三个核心自然语言处理能力：理解（NLU）、推理（NLR）和生成（NLG）。BURMESE-SAN整合了七个涵盖这些能力的子任务，包括问答、情感分析、有害内容检测、因果推理、自然语言推理、抽象摘要和机器翻译，其中一些任务之前在缅甸语中并不可用。该基准通过严格的母语者驱动过程构建，以确保语言的自然性、流畅性和文化真实性，同时最小化翻译引入的伪影。我们对开放权重和商业LLMs进行了大规模评估，以研究缅甸语建模中由于有限的预训练覆盖、丰富的形态学和句法变异而产生的挑战。我们的结果表明，缅甸语的表现更多地依赖于架构设计、语言表示和指令调优，而不仅仅是模型规模。特别是，东南亚地区的微调和更新的模型代际带来了显著的提升。最后，我们将BURMESE-SAN发布为公共排行榜，以支持缅甸语及其他低资源语言的系统评估和持续进展。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.18806

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Think$^{2}$：大型语言模型中的基础元认知推理

Elenjical, Abraham Paul, Kavuri, Vivek Hruday, Varma, Vasudeva

Abstract

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

Chinese Translation

大型语言模型（LLMs）展现出强大的推理能力，但它们在可靠监控、诊断和纠正自身错误方面的能力仍然有限。我们提出了一个心理学基础的元认知框架，将安·布朗（Ann Brown）的调节周期（规划、监控和评估）操作化为一个结构化的提示架构，并研究其在轻量级双过程元控制器（MetaController）中的集成，以实现自适应的努力分配。在使用 Llama-3 和 Qwen-3（8B）进行的多种推理和诊断基准测试（GSM8K、CRUXEval、MBPP、AIME、CorrectBench 和 TruthfulQA）中，明确的调节结构显著改善了错误诊断，并使成功自我纠正的次数增加了三倍。对580对查询的盲评显示，84%的评估者更倾向于信任度和元认知自我意识，而非标准和链式思维（Chain-of-Thought）基线。将 LLM 的推理建立在已确立的认知理论之上，为更透明和诊断性强的人工智能系统提供了一条有原则的路径。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.18823

EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

EvalSense：一个针对特定领域的大型语言模型（元）评估框架

Dejl, Adam, Pearson, Jonathan

Abstract

Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at https://github.com/nhsengland/evalsense.

Chinese Translation

对大型语言模型（LLMs）进行稳健而全面的评估对于识别有效的LLM系统配置以及降低在敏感领域部署LLMs相关的风险至关重要。然而，传统的统计指标不适合开放式生成任务，这导致对基于LLM的评估方法的依赖日益增加。这些方法虽然通常更灵活，但也引入了额外的复杂性：它们依赖于精心选择的模型、提示、参数和评估策略，使得评估过程容易出现配置错误和偏差。在本研究中，我们提出了EvalSense，一个灵活且可扩展的框架，用于构建特定领域的LLM评估套件。EvalSense提供了对广泛的模型提供者和评估策略的开箱即用支持，并帮助用户选择和部署适合其特定用例的评估方法。这是通过两个独特的组件实现的：（1）一个互动指南，帮助用户选择评估方法；（2）自动化的元评估工具，使用扰动数据评估不同评估方法的可靠性。我们在一个案例研究中展示了EvalSense的有效性，该研究涉及从非结构化的医患对话中生成临床记录，使用了一个流行的开放数据集。与EvalSense相关的所有代码、文档和资产均为开源，并可在https://github.com/nhsengland/evalsense上公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.18920

DeepInnovator: Triggering the Innovative Capabilities of LLMs

DeepInnovator：激发大型语言模型的创新能力

Fan, Tianyu, Zhang, Fengji, Zheng, Yuxiang, Chen, Bei, Niu, Xinyao, Huang, Chengen, Lin, Junyang, Huang, Chao

Abstract

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator-14B significantly outperforms untrained baselines, achieving win rates of 80.53\%-93.81\%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open-source the dataset to foster community advancement. Source code and data are available at: https://github.com/HKUDS/DeepInnovator.

Chinese Translation

大型语言模型（LLMs）在加速科学发现中的应用引起了越来越多的关注，重点在于构建具备创新能力的研究代理，即能够自主生成新颖且重要的研究想法的能力。现有的方法主要依赖于复杂的提示工程，缺乏系统的训练范式。为了解决这一问题，我们提出了DeepInnovator，一个旨在激发LLMs创新能力的训练框架。我们的方法包括两个核心组成部分。(1) “站在巨人的肩膀上”。我们构建了一个自动数据提取管道，从大量未标记的科学文献中提取和组织结构化的研究知识。(2) “猜想与反驳”。我们引入了一种“下一个想法预测”（Next Idea Prediction）训练范式，将研究想法的生成建模为一个迭代过程，持续预测、评估和完善合理且新颖的下一个想法。自动评估和专家评估均表明，我们的DeepInnovator-14B显著优于未训练的基线，赢得率达到80.53%-93.81%，并且其性能可与当前领先的LLMs相媲美。该工作提供了一条可扩展的训练路径，以构建具有真正原创创新能力的研究代理，并将开源数据集以促进社区发展。源代码和数据可在：https://github.com/HKUDS/DeepInnovator获取。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.18922

Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

为何代理缓存失败及其解决方案：基于少量学习的结构化意图规范化

Basu, Abhinaba

Abstract

Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

Chinese Translation

个人人工智能代理通过重复调用大规模语言模型（LLM）产生了可观的成本。我们展示了现有的缓存方法存在缺陷：GPTCache在真实基准测试中的准确率仅为37.9%；APC的准确率在0-12%之间。根本原因在于优化了错误的属性——缓存的有效性需要关键一致性和精确性，而非分类准确率。我们观察到缓存键的评估归结为聚类评估，并应用V-measure分解在MASSIVE、BANKING77、CLINC150和NyayaBench v2（我们的新多语言代理数据集，包含8,514个条目，528个意图，20个W5H2类别，63种语言）上分离这些评估，样本量为n=8,682。我们引入了W5H2，一个结构化意图分解框架。使用SetFit，每个类别8个示例，W5H2在MASSIVE上达到了91.1%±1.7%的准确率，耗时约2毫秒，而GPTCache的准确率为37.9%，20B参数的LLM耗时3,447毫秒，准确率为68.8%。在NyayaBench v2（20个类别）上，SetFit的准确率为55.3%，并在30种语言之间实现了跨语言迁移。我们的五层级联处理了85%的本地交互，预计可减少97.5%的成本。我们通过RCPS提供了风险控制的选择性预测保证，涵盖九个界限族。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.18964

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Yor-Sarc：一种用于低资源非洲语言讽刺检测的金标准数据集

Jimoh, Toheeb Aduramomi, De Wille, Tabea, Nikolov, Nikola S.

Abstract

Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yor\`{u}b\'{a}, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yor\`{u}b\'{a} sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $\kappa = 0.7660$; pairwise Cohen's $\kappa = 0.6732$--$0.8743$), with $83.3\%$ unanimous consensus. One annotator pair achieved almost perfect agreement ($\kappa = 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining $16.7\%$ majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarc\footnote{https://github.com/toheebadura/yor-sarc} is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.

Chinese Translation

讽刺检测在计算语义学中构成了一个基本挑战，要求模型解决字面意义与意图意义之间的差异。在低资源语言中，这一挑战更加突出，因为标注数据集稀缺或不存在。我们提出了 extbf{Yor-Sarc}，这是第一个用于Yor extgrave{u}b extacute{a}（一种有声调的尼日尔-刚果语言，使用人数超过5000万）的讽刺检测金标准数据集。该数据集包含436个实例，由三位来自不同方言背景的母语者进行标注，采用了一种专为Yor extgrave{u}b extacute{a}讽刺设计的标注协议，并考虑了文化因素。该协议结合了上下文敏感的解释和社区知情的指导方针，并附有对标注者间一致性进行全面分析的结果，以支持在其他非洲语言中的复制。我们达成了显著到几乎完美的一致性（Fleiss' $ ext{kappa} = 0.7660$；成对Cohen's $ ext{kappa} = 0.6732$--$0.8743$），其中$83.3\%$的标注者达成了一致共识。一个标注者对的达成了几乎完美的一致性（$ ext{kappa} = 0.8743$；$93.8\\%$的原始一致性），超过了许多已报告的英语讽刺研究基准。其余$16.7\\%$的多数一致性案例被保留为不确定性建模的软标签。Yor-Sarcootnote{https://github.com/toheebadura/yor-sarc}预计将促进对低资源非洲语言的语义解释和文化知情自然语言处理的研究。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.18966

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Whisper: 场边版通过大语言模型驱动的上下文生成提升自动语音识别性能

Ron, Yonathan, Gilboa, Shiri, Dubnov, Tammuz

Abstract

Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

Chinese Translation

特定领域的语音仍然是自动语音识别（ASR）面临的一个持续挑战，即使对于像OpenAI的Whisper这样的最先进系统也是如此。我们介绍了Whisper: 场边版，这是一种新颖的多智能体大语言模型（LLM）管道，能够在不重新训练的情况下增强Whisper的转录效果。该管道拦截Whisper的初始转录，应用专门的LLM智能体进行领域上下文识别、命名实体识别和行话检测，并生成紧凑的提示以指导Whisper的解码器。在421个NBA篮球评论片段（该领域以密集的专有名词和技术术语为特征）上进行评估，我们的最佳管道实现了统计上显著的17.0%相对降低的词错误率（WER；从0.217降至0.180，p<0.001）。在40.1%的片段中观察到改进，仅在7.1%的片段中出现退化，显著优于直接的转录后编辑。这些结果表明，基于提示的增强可以为ASR提供可扩展的领域适应，成为一种成本较低的模型微调替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.19008

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

有能力但不可靠：典型路径偏离作为长时间任务中代理失败的因果机制

Lee, Wilson Y.

Abstract

Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This result survives six robustness checks including cross-model-family leave-one-out validation. Critically, the causal mechanism is gradual and self-reinforcing: the adherence gap is statistically indistinguishable from zero through the first 50% of the trajectory, ruling out early-branching selection bias, and each off-canonical tool call raises the probability that the next call is also off-canonical by 22.7 percentage points ($\hat{\beta}=+0.227$, $p<0.0001$), more than doubling the baseline rate. These findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by $+$8.8 percentage points among intervened runs.

Chinese Translation

为什么语言代理在它们能够解决的任务上会失败？我们认为，许多此类失败是由于从任务的潜在解决结构的随机漂移引起的可靠性失败，而不是能力失败。每一个明确界定的工具使用任务都施加了一个典型的解决路径（即，在成功运行中共享的一组收敛工具调用），代理的成功在很大程度上依赖于轨迹是否保持在该路径的操作范围内。我们通过一个自然实验建立了这一因果关系，该实验通过构建固定模型能力和任务难度。我们分析了来自Toolathlon基准的轨迹：22个前沿模型在3次独立运行中尝试108个真实世界的工具使用任务，产生了515个模型×任务单元，其中同一模型在某些运行中成功而在其他运行中失败，仅由于大语言模型（LLM）采样的随机性。在这些单元中，成功的运行在很大程度上更接近于典型解决路径，而失败的运行则显著偏离（$+$0.060 Jaccard, $p<0.0001$, $n=488$ 单元, 95% CI [+0.043, +0.077]）。这一结果经过六项稳健性检验，包括跨模型家族的逐一剔除验证，依然成立。关键是，这一因果机制是渐进的和自我强化的：在轨迹的前50%中，遵循差距在统计上与零无显著差异，排除了早期分支选择偏差，而每一次偏离典型路径的工具调用都会将下一个调用也偏离典型路径的概率提高22.7个百分点（$ ext{hat{eta}}=+0.227$, $p<0.0001$），使基线率翻倍。这些发现意味着，仅靠能力扩展无法提高代理的可靠性，但提供了一种高度可操作的干预措施：一个简单的监控器，根据中途轨迹的典型遵循情况重新启动底部三分之一的运行，能够使干预运行的成功率提高$+$8.8个百分点。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.19043

Uncovering Context Reliance in Unstructured Knowledge Editing

揭示非结构化知识编辑中的上下文依赖性

Zhou, Zisheng, Zhang, Mengqi, Wu, Shiguang, Ye, Xiaotian, Zhang, Chi, Chen, Zhumin, Ren, Pengjie

Abstract

Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based approaches, where knowledge acquired from edited text becomes highly dependent on its preceding context, leading to recall failures when that context is absent during inference. This hypothesis is supported by our empirical validation that prepending context during inference recovers knowledge recall. We further theoretically demonstrate that Context Reliance is an inherent consequence of gradient-based optimization, which tends to bind acquired knowledge to a specific aggregated contextual representation. To address this, we propose a simple yet effective COntext-INdependent editing framework (COIN), encouraging model to focus on knowledge within local scope rather than memorizing contextual patterns. Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.

Chinese Translation

使用现实世界的非结构化知识编辑大型语言模型（LLMs）对于纠正和更新其内部参数知识至关重要。在本研究中，我们重新审视了基本的下一个标记预测（NTP）作为非结构化编辑的候选范式。我们确定上下文依赖性是基于NTP方法的一种关键失效模式，其中从编辑文本中获得的知识高度依赖于其前置上下文，导致在推理过程中缺少该上下文时的回忆失败。我们的实证验证支持了这一假设，即在推理过程中添加上下文可以恢复知识的回忆。我们进一步理论证明，上下文依赖性是基于梯度优化的固有结果，该优化倾向于将获得的知识绑定到特定的聚合上下文表示。为了解决这个问题，我们提出了一个简单而有效的上下文独立编辑框架（COntext-INdependent editing framework，COIN），鼓励模型关注局部范围内的知识，而不是记忆上下文模式。评估结果表明，COIN将上下文依赖性降低了45.2%，并在编辑成功率上超越了强基线23.6%，突显了减轻上下文依赖性在稳健编辑中的重要作用。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.19049

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

IAPO：面向信息的策略优化以实现高效的推理

He, Yinhan, Zhu, Yaochen, Shi, Mingjia, Zheng, Wendy, Su, Lin, Wang, Xiaoqing, Guo, Qi, Li, Jundong

Abstract

Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

Chinese Translation

大型语言模型越来越依赖于长链思维来提高准确性，但这种提升伴随着显著的推理时间成本。我们重新审视了基于令牌效率的后训练，并认为现有的序列级奖励塑形方法在如何分配推理努力到各个令牌上提供的控制有限。为此，我们提出了IAPO，一种基于信息论的后训练框架，该框架根据每个令牌与最终答案的条件互信息（MI）分配令牌级优势。这为识别信息丰富的推理步骤和抑制低效探索提供了明确且有原则的机制。我们提供了理论分析，表明我们的IAPO能够在不损害正确性的情况下，单调减少推理冗长。实证结果表明，IAPO在提高推理准确性的同时，推理长度减少了多达36%，在各种推理数据集上优于现有的令牌高效强化学习方法。大量实证评估表明，面向信息的优势塑形是高效后训练的一个强大且通用的方向。代码可在 https://github.com/YinhanHe123/IAPO 获取。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.19058

Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

大型语言模型和大型视觉语言模型在推理中是否共享神经元？跨模态转移的证据与机制

Cui, Chenhang, Zhang, An, Chen, Yuxin, Deng, Gelei, Zheng, Jingnan, Liang, Zhenkai, Wang, Xiang, Chua, Tat-Seng

Abstract

Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during multi-step inference are shared between representative LLMs and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at [https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons](https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons).

Chinese Translation

大型视觉语言模型（LVLMs）在各个领域迅速发展，但在需要多步推理和组合决策的任务上仍落后于强大的仅文本大型语言模型（LLMs）。基于它们共享的变换器架构，我们研究了这两种模型家族是否依赖于共同的内部计算进行此类推理。在神经元层面，我们发现了惊人程度的重叠：在多步推理过程中，超过一半的高激活单元在代表性的LLMs和LVLMs之间共享，揭示了一个模态不变的推理子空间。通过激活放大进行因果探测，我们进一步表明这些共享神经元编码了一致且可解释的概念级效应，展示了它们在推理中的功能贡献。基于这一见解，我们提出了共享神经元低秩融合（Shared Neuron Low-Rank Fusion, SNRF），这是一个参数高效的框架，能够将成熟的推理电路从LLMs转移到LVLMs。SNRF分析跨模型激活以识别共享神经元，计算模型间权重差异的低秩近似，并在共享神经元子空间内选择性地注入这些更新。该机制在最小参数变化的情况下增强了多模态推理性能，并且不需要大规模的多模态微调。在各种数学和感知基准测试中，SNRF始终提升了LVLM的推理性能，同时保持了感知能力。我们的结果表明，共享神经元形成了LLMs和LVLMs之间一个可解释的桥梁，使得推理能力以低成本转移到多模态模型中。我们的代码可在 [https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons](https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons) 获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.19079

TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes

TriTopic：基于三模态图的主题建模与迭代精炼及原型

Egger, Roman

Abstract

Topic modeling extracts latent themes from large text collections, but leading approaches like BERTopic face critical limitations: stochastic instability, loss of lexical precision ("Embedding Blur"), and reliance on a single data perspective. We present TriTopic, a framework that addresses these weaknesses through a tri-modal graph fusing semantic embeddings, TF-IDF, and metadata. Three core innovations drive its performance: hybrid graph construction via Mutual kNN and Shared Nearest Neighbors to eliminate noise and combat the curse of dimensionality; Consensus Leiden Clustering for reproducible, stable partitions; and Iterative Refinement that sharpens embeddings through dynamic centroid-pulling. TriTopic also replaces the "average document" concept with archetype-based topic representations defined by boundary cases rather than centers alone. In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs. 0.513 for BERTopic, 0.416 for NMF, 0.299 for LDA), guarantees 100% corpus coverage with 0% outliers, and is available as an open-source PyPI library.

Chinese Translation

主题建模从大规模文本集合中提取潜在主题，但像BERTopic这样的领先方法面临着关键限制：随机不稳定性、词汇精度损失（“嵌入模糊”）以及对单一数据视角的依赖。我们提出了TriTopic，一个通过融合语义嵌入、TF-IDF和元数据的三模态图来解决这些弱点的框架。三项核心创新推动了其性能：通过互相最近邻（Mutual kNN）和共享最近邻（Shared Nearest Neighbors）构建混合图，以消除噪声并应对维度诅咒；共识莱顿聚类（Consensus Leiden Clustering）实现可重复、稳定的划分；以及通过动态质心拉动（centroid-pulling）来锐化嵌入的迭代精炼。TriTopic还用基于原型的主题表示替代了“平均文档”概念，这些表示由边界案例而非仅由中心定义。在20 Newsgroups、BBC News、AG News和Arxiv的基准测试中，TriTopic在每个数据集上都达到了最高的NMI（平均NMI 0.575，相比之下BERTopic为0.513，NMF为0.416，LDA为0.299），保证了100%的语料覆盖率且没有异常值，并作为开源PyPI库提供。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.19101

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

价值纠缠：在（某些）大型语言模型中不同类型善的混淆

Cho, Seong Hah, Li, Junyi, Leshinskaya, Anna

Abstract

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

Chinese Translation

大型语言模型（LLMs）的价值对齐要求我们实证测量这些模型对价值的实际获取表示。在人类的价值表示特征中，他们能够区分不同类型的价值。我们研究LLMs是否同样区分三种不同类型的善：道德的、语法的和经济的。通过探测模型行为、嵌入和残差流激活，我们报告了普遍存在的价值纠缠现象：这些不同价值表示之间的混淆。具体而言，相较于人类规范，语法和经济价值的评估受到道德价值的过度影响。通过选择性去除与道德相关的激活向量，这种混淆得以修复。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.19111

Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models

Astra：大语言模型的激活空间尾特征向量低秩适应

Liu, Kainan, Zhang, Yong, Cheng, Ning, Zhu, Yun, Wang, Yanmeng, Wang, Shaojun, Xiao, Jing

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA, are widely used for adapting pre-trained models to downstream tasks due to their computational and storage efficiency. However, in the context of LoRA and its variants, the potential of activation subspaces corresponding to tail eigenvectors remains substantially under-exploited, which may lead to suboptimal fine-tuning performance. In this work, we propose Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT method that leverages the tail eigenvectors of the model output activations-estimated from a small task-specific calibration set-to construct task-adaptive low-rank adapters. By constraining updates to the subspace spanned by these tail eigenvectors, Astra achieves faster convergence and improved downstream performance with a significantly reduced parameter budget. Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (FFT) in certain scenarios.

Chinese Translation

参数高效微调（PEFT）方法，特别是LoRA，因其计算和存储效率而广泛用于将预训练模型适应于下游任务。然而，在LoRA及其变体的背景下，针对尾特征向量的激活子空间的潜力仍然未得到充分利用，这可能导致微调性能的次优。在本研究中，我们提出了Astra（激活空间尾特征向量低秩适应），这是一种新颖的PEFT方法，利用从小型任务特定校准集估计的模型输出激活的尾特征向量来构建任务自适应的低秩适配器。通过将更新限制在这些尾特征向量所张成的子空间内，Astra实现了更快的收敛和改进的下游性能，同时显著减少了参数预算。在自然语言理解（NLU）和自然语言生成（NLG）任务上的大量实验表明，Astra在16个基准测试中始终优于现有的PEFT基线，甚至在某些情况下超越了完全微调（FFT）。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.19115

How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

大型语言模型如何编码科学质量？基于稀疏自编码器的单义特征的实证研究

McCoubrey, Michael, Salatino, Angelo, Osborne, Francesco, Motta, Enrico

Abstract

In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that LLMs encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how LLMs encapsulate concepts related to research quality.

Chinese Translation

近年来，生成性人工智能的使用日益增长，尤其是大型语言模型（LLMs）在支持科学工作的评估和生成方面发挥了重要作用。尽管一些研究表明，LLMs在一定程度上能够根据感知质量评估研究，但我们对使这种能力得以实现的内部机制的理解仍然有限。本文呈现了首个研究，探讨LLMs如何通过使用稀疏自编码器提取的相关单义特征来编码科学质量的概念。我们在不同的实验设置下推导出这些特征，并评估它们在与研究质量相关的三个任务中的预测能力：预测引用次数、期刊SJR和期刊h指数。结果表明，LLMs编码了与科学质量多个维度相关的特征。特别地，我们识别出四种反复出现的特征类型，这些特征捕捉了研究质量表示的关键方面：1）反映研究方法的特征；2）与出版类型相关的特征，文献综述通常表现出更高的影响力；3）与高影响力研究领域和技术相关的特征；以及4）对应于特定科学术语的特征。这些发现代表了理解LLMs如何概括与研究质量相关概念的重要一步。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.19127

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

AgenticRAGTracer：一个关注跳跃的基准，用于诊断Agentic RAG中的多步骤检索推理

You, Qijie, Yu, Wenkai, Zhang, Wentao

Abstract

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.

Chinese Translation

随着近年来基于智能体的方法的快速发展，Agentic RAG无疑已成为一个重要的研究方向。多跳推理要求模型进行深思熟虑的思考和多步骤的交互，作为评估此类能力的关键测试平台。然而，现有的基准通常仅提供最终问题和答案，而缺乏逐步连接原子问题与最终多跳查询的中间跳跃级问题。这一局限性阻碍了研究人员分析智能体在何处失败，并限制了对模型能力的更细致评估。此外，目前大多数基准是手动构建的，这既耗时又费力，同时也限制了可扩展性和泛化能力。为了解决这些挑战，我们提出了AgenticRAGTracer，这是第一个主要由大型语言模型自动构建的Agentic RAG基准，旨在支持逐步验证。我们的基准涵盖多个领域，包含1,305个数据点，并且与现有主流基准没有重叠。大量实验表明，即使是最好的大型语言模型在我们的数据集上表现也很差。例如，GPT-5在我们数据集中最难的部分仅获得22.6%的EM准确率。跳跃感知诊断揭示，失败主要是由于扭曲的推理链驱动——要么过早崩溃，要么游走于过度扩展。这突显了在任务逻辑结构中分配步骤的一种关键能力缺失，提供了传统评估中缺失的诊断维度。我们相信我们的工作将促进Agentic RAG领域的研究，并激励该领域进一步有意义的进展。我们的代码和数据可在 https://github.com/YqjMartin/AgenticRAGTracer 获取。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.19133

A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions

用于艺术历史图像描述的命名实体识别和关系提取的数据集

Schneider, Stefanie, Göldl, Miriam, Stalter, Julian, Vollmer, Ricarda

Abstract

This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).

Chinese Translation

本文介绍了FRAME（艺术历史元数据和实体的细粒度识别），这是一个手动标注的艺术历史图像描述数据集，用于命名实体识别（NER）和关系提取（RE）。描述内容来自博物馆目录、拍卖清单、开放获取平台和学术数据库，经过筛选以确保每段文本专注于单一艺术作品，并包含关于其材料、构图或图像学的明确陈述。FRAME提供了三层的独立注释：一层是对象级属性的元数据层，一层是描绘主题和图案的内容层，以及一层链接重复提及的共指层。在各层之间，实体范围被标注为37种类型，并通过类型化的RE链接连接提及。实体类型与Wikidata对齐，以支持命名实体链接（NEL）和下游知识图谱构建。该数据集以UIMA XMI通用分析结构（CAS）文件形式发布，附带图像和书目元数据，可用于基准测试和微调NER和RE系统，包括与大型语言模型（LLMs）进行零样本和少样本设置的应用。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.19157

Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

基于特征激活路由的面向角色扮演大型语言模型的对比稀疏自编码器的面向层次个性控制

Tang, Wenqiu, Wan, Zhen, Komamizu, Takahiro, Ide, Ichiro

Abstract

Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.

Chinese Translation

在角色扮演代理（RPA）中，个性控制通常通过无训练的方法实现，这些方法通过提示或检索增强生成注入个性描述和记忆，或通过在特定个性语料库上进行监督微调（SFT）。虽然SFT可能有效，但它需要标注个性的训练数据，并且在新角色上进行重新训练，限制了灵活性。相对而言，基于提示和RAG的信号易于应用，但在长对话中可能会被稀释，导致个性漂移，有时表现出不一致的个性行为。为了解决这个问题，我们提出了一种对比稀疏自编码器（SAE）框架，该框架学习与五大性格30维模型对齐的面向层次个性控制向量。我们构建了一个新的15,000样本的泄漏控制语料库，以为每个维度提供平衡的监督。学习到的向量被集成到模型的残差空间中，并通过特征激活路由模块动态选择，从而实现精确且可解释的个性引导。在大型语言模型（LLMs）上的实验表明，所提出的方法在上下文设置中保持了稳定的角色忠诚度和输出质量，优于对比激活加法（CAA）和仅基于提示的基线。结合的SAE+Prompt配置实现了最佳整体性能，确认了对比训练的潜在向量可以增强个性控制，同时保持对话的一致性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.19174

TurkicNLP: An NLP Toolkit for Turkic Languages

TurkicNLP：一个针对突厥语言的自然语言处理工具包

Hakimov, Sherzod

Abstract

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

Chinese Translation

针对突厥语言家族的自然语言处理（NLP）仍然存在碎片化现象，该语言家族在欧亚大陆有超过2亿人使用，大多数语言缺乏统一的工具和资源。我们提出了TurkicNLP，这是一个开源的Python库，为突厥语言提供了一个统一且一致的NLP流程，涵盖四种文字体系：拉丁字母、斯拉夫字母、波斯-阿拉伯字母和古突厥符文。该库包括分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入和机器翻译，所有功能通过一个与语言无关的API实现。模块化的多后端架构透明地集成了基于规则的有限状态传输器和神经模型，并具备自动文字检测和不同文字变体之间的路由功能。输出遵循CoNLL-U标准，以实现完全的互操作性和扩展性。代码和文档托管在 https://github.com/turkic-nlp/turkicnlp 。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.19177

Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

下一回复预测 X 数据集：天真生成内容中的语言差异

Münker, Simon, Schwager, Nils, Kugler, Kai, Heseltine, Michael, Rettinger, Achim

Abstract

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

Chinese Translation

大型语言模型（LLMs）作为社会科学研究中人类参与者的代理日益被使用，这一趋势呈现出一种有前景但方法论上风险较大的范式转变。虽然 LLMs 提供了可扩展性和成本效益，但其“天真”应用，即在没有明确行为约束的情况下提示生成内容，导致了显著的语言差异，这对研究结果的有效性构成了挑战。本文通过引入一种基于历史的回复预测任务，利用真实的 X（前身为 Twitter）数据，提出了一种新颖的数据集，旨在评估 LLMs 的语言输出与人类生成内容之间的差异。我们使用风格和内容基础的指标分析这些差异，为研究人员提供了一个定量框架，以评估合成数据的质量和真实性。我们的研究结果强调了需要更复杂的提示技术和专业数据集，以确保 LLM 生成的内容准确反映人类交流的复杂语言模式，从而提高计算社会科学研究的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.19212

Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

增强型双重协同注意力框架用于目标感知的多模态孟加拉仇恨表情包检测

Tanvir, Raihan, Alam, Md. Golam Rabiul

Abstract

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Chinese Translation

社交媒体上的仇恨内容越来越多地以多模态表情包的形式出现，这些表情包结合了图像和文本以传达有害叙事。在孟加拉语等低资源语言中，由于标注数据有限、类别不平衡以及普遍存在的代码混合，自动检测仍然面临挑战。为了解决这些问题，我们通过从孟加拉多模态攻击数据集（MIMOSA）中引入语义对齐的样本来增强孟加拉仇恨表情包（BHM）数据集，从而改善类别平衡和语义多样性。我们提出了增强型双重协同注意力框架（xDORA），通过加权注意力池化集成视觉编码器（CLIP, DINOv2）和多语言文本编码器（XGLM, XLM-R），以学习稳健的跨模态表示。在这些嵌入的基础上，我们开发了一种基于FAISS的k近邻分类器用于非参数推理，并引入了RAG融合DORA，该方法结合了检索驱动的上下文推理。我们进一步在零样本、少样本和检索增强提示设置下评估LLaVA。在扩展数据集上的实验表明，xDORA（CLIP + XLM-R）在仇恨表情包识别中实现了0.78的宏平均F1分数，在目标实体检测中实现了0.71，而RAG融合DORA将性能提升至0.79和0.74，相较于DORA基线取得了显著提升。基于FAISS的分类器表现出竞争力，并通过语义相似性建模展示了对稀有类别的稳健性。相比之下，LLaVA在少样本设置下的有效性有限，在检索增强下仅有适度的改善，突显了预训练视觉-语言模型在未经过微调的代码混合孟加拉内容中的局限性。这些发现展示了监督、检索增强和非参数多模态框架在解决低资源仇恨言论检测中的语言和文化复杂性方面的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.19317

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

学习推理以实现个性化问答中的多步检索个人上下文

Amirizaniani, Maryam, Salemi, Alireza, Zamani, Hamed

Abstract

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.

Chinese Translation

问答（QA）中的个性化要求提供既准确又与用户背景、偏好和历史上下文相一致的答案。现有的最先进方法主要依赖于检索增强生成（RAG）解决方案，通过从用户档案中检索相关项目来构建个人上下文。现有方法直接使用用户的查询来检索个人文档，这种策略往往导致表面层次的个性化。我们提出了PR2（个性化检索增强推理），这是一个强化学习框架，集成了推理和从个人上下文中检索以实现个性化。PR2学习自适应的检索-推理策略，确定何时检索、从用户档案中检索哪些证据，以及如何将其融入中间推理步骤。通过在个性化奖励函数下优化多轮推理轨迹，该框架强化了更好地与用户特定偏好和奖励模型反映的上下文信号相一致的推理路径。在LaMP-QA基准上使用三种大型语言模型（LLMs）进行的广泛实验表明，PR2始终优于强基线，在个性化问答中实现了8.8%-12%的平均相对提升。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.19320

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

能动记忆的结构：评估与系统局限性的分类法与实证分析

Jiang, Dongming, Li, Yi, Wei, Songtao, Yang, Jinxin, Kishore, Ayushi, Zhao, Alysa, Kang, Dingyi, Hu, Xu, Chen, Feng, Li, Qiannan, Li, Bingzhe

Abstract

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

Chinese Translation

能动记忆系统使大型语言模型（LLM）代理能够在长时间交互中保持状态，支持超越固定上下文窗口的长期推理和个性化。尽管架构发展迅速，这些系统的实证基础仍然脆弱：现有基准往往规模不足，评估指标与语义效用不一致，性能在不同基础模型之间差异显著，系统级成本常常被忽视。本调查从架构和系统的角度对能动记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统的简明分类法。然后，我们分析了限制当前系统的关键痛点，包括基准饱和效应、指标有效性和评判敏感性、依赖基础模型的准确性，以及记忆维护引入的延迟和吞吐量开销。通过将记忆结构与实证局限性联系起来，本调查阐明了为什么当前的能动记忆系统常常未能实现其理论承诺，并概述了更可靠的评估和可扩展系统设计的方向。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2602.19333

PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

PerSoMed：一个大规模平衡的波斯社交媒体文本分类数据集

Chehreh, Isun, Ansari, Ebrahim

Abstract

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

Chinese Translation

本研究介绍了第一个大规模、良好平衡的波斯社交媒体文本分类数据集，专门设计用于解决该领域缺乏全面资源的问题。该数据集包含36,000条帖子，涵盖九个类别（经济、艺术、体育、政治、社会、健康、心理、历史和科学与技术），每个类别包含4,000个样本，以确保类别分布的平衡。数据收集涉及来自多个波斯社交媒体平台的60,000条原始帖子，随后进行了严格的预处理和混合注释，结合了基于ChatGPT的少量提示与人工验证。为了缓解类别不平衡，我们采用了语义冗余移除的欠采样方法和先进的数据增强策略，整合了词汇替换和生成提示。我们对多个模型进行了基准测试，包括BiLSTM、XLM-RoBERTa（带有LoRA和AdaLoRA适配）、FaBERT、基于SBERT的架构，以及波斯特定的TookaBERT（基础版和大型版）。实验结果表明，基于变换器的模型始终优于传统神经网络，其中TookaBERT-Large实现了最佳性能（精确率：0.9622，召回率：0.9621，F1分数：0.9621）。类别评估进一步确认了所有类别的稳健表现，尽管由于固有的模糊性，社会和政治文本的得分略低。本研究提供了一个新的高质量数据集，并对前沿模型进行了全面评估，为波斯自然语言处理的进一步发展奠定了坚实基础，包括趋势分析、社会行为建模和用户分类。该数据集已公开，以支持未来的研究工作。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2602.19403

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

基于大型语言模型的数字双胞胎个性化预测感知信息有效性

Han, Jasmin, Devkota, Janardan, Waring, Joseph, Luken, Amanda, Naughton, Felix, Vilardaga, Roger, Bricker, Jonathan, Latkin, Carl, Moran, Meghan, Chen, Yiqun, Thrul, Johannes

Abstract

Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.

Chinese Translation

潜在干预最终用户对感知信息有效性（PME）的评估对于选择和优化个性化的戒烟干预信息以在移动健康（mHealth）平台上进行传递至关重要。本研究评估了大型语言模型（LLMs）是否能够准确预测戒烟信息的PME。我们在内容质量、应对支持和戒烟支持三个领域评估了多种模型对PME的预测能力。数据集包含来自301名年轻成人吸烟者的3010条信息评分（5点李克特量表）。我们比较了（1）在标记数据上训练的监督学习模型，（2）未经过特定任务微调的零样本和少样本LLMs，以及（3）结合个体特征和先前PME历史的LLM基础数字双胞胎，以生成个性化预测。模型性能通过每位参与者的三条保留信息使用准确率、Cohen's kappa和F1进行评估。LLM基础的数字双胞胎在准确率上优于零样本和少样本LLMs（平均高出12个百分点）和监督基线（高出13个百分点），在内容、应对和戒烟的准确率分别为0.49、0.45和0.49，在简化的3点量表上的方向性准确率为0.75、0.66和0.70。数字双胞胎的预测在评分类别之间显示出更大的分散性，表明对个体差异的敏感性提高。将个人档案与LLMs结合能够捕捉到个体在PME上的差异，并优于监督和零样本及少样本方法。改进的PME预测可能使mHealth中的干预内容更加个性化。LLM基础的数字双胞胎在支持移动戒烟和其他健康行为改变干预的个性化方面显示出潜力。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2602.19509

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

金字塔式 MoA：一种成本优化的随时推理概率框架

Khaled, Arindam

Abstract

Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.

Chinese Translation

大型语言模型（LLMs）面临推理成本与推理能力之间的持续权衡。虽然“Oracle”模型（例如，Llama-3-70B）实现了最先进的准确性，但在高容量部署中成本过于昂贵。较小的模型（例如，8B参数）具有成本效益，但在复杂任务上表现不佳。在本研究中，我们提出了“金字塔式 MoA”，一种分层的混合代理（Mixture-of-Agents）架构，使用轻量级的路由器（Router）仅在必要时动态提升查询。通过利用一组小模型之间的语义一致性和置信度校准，我们的路由器能够高精度地识别“困难”问题。在GSM8K基准测试中，我们的系统实现了93.0%的准确率，有效地接近Oracle基线（98.0%），同时将计算成本降低了61%。我们证明该系统引入的延迟开销微乎其微（+0.82秒），并允许在性能与预算之间进行可调的权衡。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2602.19526

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

如何训练您的深度研究代理？在搜索-R1中的提示、奖励和策略优化

Xu, Yinuo, Lu, Shuo, Cheng, Jianjie, Wang, Meng, Xie, Qianlong, Wang, Xingxing, He, Ran, Liang, Jian

Abstract

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

Chinese Translation

深度研究代理通过多轮检索和决策导向生成来处理知识密集型任务。尽管强化学习（RL）已被证明能够提高这一范式的性能，但其贡献仍未得到充分探索。为了全面理解RL的作用，我们沿着三个解耦维度进行系统研究：提示模板、奖励函数和策略优化。我们的研究揭示了以下几点：1）快速思维模板比先前工作中使用的慢思维模板具有更大的稳定性和更好的性能；2）基于F1的奖励因回答回避导致的训练崩溃而表现不佳，而通过引入行动级惩罚可以缓解这一问题，最终超越了EM；3）REINFORCE在需要更少搜索动作的情况下优于PPO，而GRPO在策略优化方法中显示出最差的稳定性。在这些洞察的基础上，我们引入了Search-R1++，这是一个强基线，将Search-R1的性能从0.403提高到0.442（Qwen2.5-7B）和从0.289提高到0.331（Qwen2.5-3B）。我们希望我们的发现能够为深度研究系统中更有原则和可靠的RL训练策略铺平道路。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2602.19543

Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Hyper-KGGen：一种驱动技能的高质量知识超图生成知识提取器

Huang, Rizhuo, Feng, Yifan, Xue, Rundong, Ying, Shihui, Yong, Jun-Hai, Shi, Chuan, Du, Shaoyi, Gao, Yue

Abstract

Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex $n$-ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \textit{scenario gap}: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \textbf{Hyper-KGGen}, a skill-driven framework that reformulates extraction as a dynamic skill-evolving process. First, Hyper-KGGen employs a \textit{coarse-to-fine} mechanism to systematically decompose documents, ensuring full-dimensional coverage from binary links to complex hyperedges. Crucially, it incorporates an \textit{adaptive skill acquisition} module that actively distills domain expertise into a Global Skill Library. This is achieved via a stability-based feedback loop, where extraction stability serves as a relative reward signal to induce high-quality skills from unstable traces and missed predictions. Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction. Experiments demonstrate that Hyper-KGGen significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.

Chinese Translation

知识超图通过封装复杂的 $n$-元原子事实，超越了传统的二元知识图谱，为语义表示提供了更全面的范式。然而，由于 extit{场景差距}，构建高质量的超图仍然具有挑战性：通用提取器在具有特定术语的多样化领域中难以泛化，而现有方法往往无法在结构骨架与细粒度细节之间取得平衡。为了解决这一问题，我们提出了 extbf{Hyper-KGGen}，一种将提取过程重新构建为动态技能演变过程的驱动技能框架。首先，Hyper-KGGen采用 extit{粗到细}机制系统性地分解文档，确保从二元链接到复杂超边的全维覆盖。关键是，它结合了 extit{自适应技能获取}模块，主动将领域专业知识提炼到全球技能库中。这是通过基于稳定性的反馈循环实现的，其中提取稳定性作为相对奖励信号，以从不稳定的痕迹和遗漏的预测中诱导出高质量技能。此外，我们提出了 extbf{HyperDocRED}，这是一个经过严格注释的文档级知识超图提取基准。实验表明，Hyper-KGGen显著优于强基线，验证了演变的技能在多场景设置中提供的指导远比静态的少量示例丰富。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2602.19548

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

超越单一提取器：重新思考用于大规模语言模型预训练的HTML到文本提取

Li, Jeffrey, Gardner, Josh, Kang, Doug, Shi, Fangping, Singh, Karanjeet, Li, Chun-Liang, Shandilya, Herumb, Hall, David, Tuzel, Oncel, Liang, Percy, Schmidt, Ludwig, Ansari, Hadi Pour, Faghri, Fartash

Abstract

One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

Chinese Translation

构建网络规模的大规模语言模型预训练数据集的首要预处理步骤之一是从HTML中提取文本。尽管网络内容极为多样，现有的开源数据集主要对所有网页应用单一固定的提取器。在本研究中，我们探讨这种做法是否导致互联网数据的覆盖和利用不佳。我们首先展示，尽管不同的提取器可能在标准语言理解任务上导致相似的模型性能，但经过固定过滤管道的页面可能有显著差异。这表明一个简单的干预措施：通过对不同提取器取并集，我们可以在保持基准性能的同时，将DCLM-Baseline的标记产出提高多达71%。我们进一步展示，对于结构化内容如表格和代码块，提取器的选择可能会显著影响下游任务的性能，在WikiTQ上差异可达10个百分点（p.p.），在HumanEval上差异可达3个百分点（p.p.）。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2602.19549

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

雕刻向量空间：通过修剪-合并框架实现高效的多向量视觉文档检索

Yan, Yibo, Ou, Mingdong, Cao, Yi, Zou, Xin, Huo, Jiahao, Liu, Shuliang, Kwok, James, Hu, Xuming

Abstract

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

Chinese Translation

视觉文档检索（Visual Document Retrieval, VDR）旨在从大量视觉丰富的文档中检索相关页面，在当前的多模态检索应用中具有重要意义。最先进的多向量范式在性能上表现出色，但存在过高的开销，当前的效率方法如修剪和合并对此问题的解决并不完美，导致压缩率与特征保真度之间的困难权衡。为了解决这一困境，我们提出了修剪-合并（Prune-then-Merge）这一新颖的两阶段框架，协同利用这两种互补的方法。我们的方法首先采用自适应修剪阶段过滤掉低信息量的补丁，创建一个精炼的高信号嵌入集。随后，层次合并阶段对这一预过滤的集合进行压缩，有效总结语义内容，而不会出现单阶段方法中因噪声引起的特征稀释。对29个VDR数据集的广泛实验表明，我们的框架始终优于现有方法，显著扩展了近无损压缩范围，并在高压缩比下提供了稳健的性能。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2602.19569

Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering

基于时间感知的异构图推理与多视角融合用于时间问答

Wen, Wuzhenghong, Zhou, Bowen, Huang, Jinwen, Wu, Xianjie, Sun, Yuwei, Pan, Su, Li, Liang, Liu, Jianting

Abstract

Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.

Chinese Translation

基于时间知识图谱的问答（TKGQA）在处理时间敏感查询方面引起了越来越多的关注。然而，现有方法仍然面临以下挑战：1）在问题表征中对时间约束的弱整合，导致推理偏差；2）进行显式多跳推理的能力有限；3）语言和图表示的融合效果不佳。我们提出了一种新颖的框架，包含时间感知的问题编码、多跳图推理和多视角异构信息融合。具体而言，我们的方法引入了：1）一种约束感知的问题表征，将语言模型的语义线索与时间实体动态相结合；2）一种时间感知的图神经网络，通过时间感知的信息传递实现显式多跳推理；3）一种多视角注意机制，以更有效地融合问题上下文和时间图知识。在多个TKGQA基准测试上的实验表明，相较于多种基线方法，我们的方法具有一致的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2602.19583

DEEP: Docker-based Execution and Evaluation Platform

DEEP：基于Docker的执行与评估平台

González, Sergio Gómez, Domingo, Miguel, Casacuberta, Francisco

Abstract

Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.

Chinese Translation

对多个系统的比较评估是研究中的一项常见任务。这是决定使用哪个系统进行工作之前的关键步骤，或者在研究完成后，展示所得到模型潜力的必要环节。此外，这也是竞争性公共挑战评估的主要任务。我们提出的软件（DEEP）自动化了机器翻译和光学字符识别模型的执行和评分。此外，它也容易扩展到其他任务。DEEP准备接收docker化的系统，运行它们（同时提取信息），并根据一些参考评估假设。通过这种方法，评估者可以更好地理解每个模型的性能。此外，该软件使用基于对每个模型产生的结果显著性进行统计分析的聚类算法。因此，评估者能够在众多提案中识别出性能集群，并更好地理解它们之间差异的显著性。此外，我们提供了一个可视化网页应用，以确保结果能够被适当地理解和解释。最后，我们展示了DEEP的一个示例用例。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2602.19598

Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

阅读时眼动追踪：具有开放库支持的数据集现状调查

Jakobi, Deborah N., Reich, David R., Prasse, Paul, Hofmann, Jana M., Bolliger, Lena S., Jäger, Lena A.

Abstract

Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili-lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.

Chinese Translation

阅读时眼动追踪语料库是许多不同学科和应用场景的宝贵资源。应用场景从研究阅读背后的认知过程到基于机器学习的应用，如基于注视的阅读理解评估。过去几十年，阅读时眼动追踪数据集的数量和规模不断增加，同时在所涵盖的刺激语言、参与者的语言背景以及伴随的心理测量或人口统计数据方面也日益多样化。不同学科之间数据的分散以及各社区缺乏数据共享标准，导致许多现有数据集由于缺乏互操作性而无法轻易重用。在本研究中，我们旨在通过以下方式提高现有数据集及其特征的透明度和清晰度：i) 提供现有数据集的广泛概述，ii) 通过在线发布一个动态概述（https://dili-lab.github.io/datasets.html），简化新创建数据集的共享，为每个数据集提供超过45个特征的展示，以及 iii) 将所有公开可用的数据集整合到Python包pymovements中，该包提供了一个眼动追踪数据集库。通过这样做，我们旨在加强阅读时眼动追踪研究中的FAIR原则，并促进良好的科学实践，如研究的再现和复制。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2602.19612

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

遗忘的解剖：事实显著性与模型微调的双重影响

Anna, Borisiuk, Savchenko, Andrey, Panchecko, Alexander, Tutubalina, Elena

Abstract

Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

Chinese Translation

机器遗忘（Machine Unlearning, MU）使大型语言模型（Large Language Models, LLMs）能够移除不安全或过时的信息。然而，现有研究假设所有事实都是同样容易被遗忘的，并且在很大程度上忽视了被遗忘知识的来源是来自预训练还是监督微调（Supervised Fine-Tuning, SFT）。在本文中，我们引入了DUAL（训练阶段的双重遗忘评估），这是一个基于28.6k个维基数据（Wikidata）衍生三元组的基准，这些三元组通过维基百科链接计数和基于LLM的显著性评分进行了标注。我们的实验表明，预训练模型和SFT模型对遗忘的反应有所不同。在遗忘数据上的SFT步骤产生了更平滑的遗忘过程、更稳定的调优，以及10-50%的更高保留率，而对预训练模型的直接遗忘则保持不稳定，容易导致重新学习或灾难性遗忘。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2602.19643

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

KGHaluBench：基于知识图谱的幻觉基准，用于评估大型语言模型知识的广度和深度

Robertson, Alex, Liang, Huizhi, Gani, Mahbub, Kumar, Rohit, Rajamohan, Srijith

Abstract

Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.

Chinese Translation

大型语言模型（LLMs）具备生成令人信服且易于理解的语言的显著能力。然而，连贯性并不等同于真实性，因为其响应中常常包含微妙的幻觉。现有的基准测试受限于静态和狭窄的问题，导致覆盖面有限且评估结果具有误导性。我们提出了KGHaluBench，一个基于知识图谱的幻觉基准，旨在评估LLMs在知识的广度和深度方面，提供对LLM真实性更公平和全面的洞察。我们的框架利用知识图谱动态构建具有挑战性和多面的提问，并通过统计方法估计其难度，以应对流行偏见。我们的自动验证流程检测放弃回答的情况，并在概念和正确性层面验证LLM的响应，以识别不同类型的幻觉。我们评估了25个前沿模型，使用新颖的准确性和幻觉指标。结果提供了对导致不同模型规模幻觉的知识因素的更可解释的洞察。KGHaluBench已公开发布，以支持未来在幻觉缓解方面的研究发展。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2602.19815

Keyboards for the Endangered Idu Mishmi Language

濒危的伊杜米什米语言键盘

Ramarao, Akhilesh Kakolu

Abstract

We present a mobile and desktop keyboard suite for Idu Mishmi, an endangered Trans-Himalayan language spoken by approximately 11,000 people in Arunachal Pradesh, India. Although a Latin-based orthography was developed in 2018, no digital input tools existed to use it, forcing speakers into ad-hoc romanizations that cannot represent the full writing system. Our keyboards comprise two tools: (1) an Android mobile keyboard, published on the Google Play Store and actively used in teacher training programs, and (2) a Windows desktop keyboard currently undergoing community testing. Both tools support the complete Idu Mishmi character inventory, including schwa, retracted schwa, nasalized vowels, and accented forms. Both operate fully offline with zero network permissions, addressing connectivity constraints and data sovereignty concerns. We describe the design, implementation, and deployment as a replicable model for other endangered language communities.

Chinese Translation

我们提出了一套适用于伊杜米什米语言的移动和桌面键盘，这是一种濒危的跨喜马拉雅语言，在印度阿鲁纳恰尔邦大约有11,000人使用。尽管在2018年开发了基于拉丁字母的正字法，但并没有数字输入工具可供使用，这迫使说话者采用临时的罗马化方式，无法完整表达书写系统。我们的键盘包括两个工具：(1) 一款Android移动键盘，已在Google Play商店发布，并在教师培训项目中积极使用；(2) 一款正在进行社区测试的Windows桌面键盘。这两种工具支持完整的伊杜米什米字符集，包括中性元音、收缩中性元音、鼻化元音和重音形式。两者均完全离线操作，无需网络权限，解决了连接限制和数据主权问题。我们描述了设计、实施和部署过程，作为其他濒危语言社区的可复制模型。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2602.19840

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

SAMAS：一种基于谱引导的多智能体系统，用于实现文学翻译中的风格保真

Wu, Jingzhuo, Zhang, Jiajun, Jin, Keyan, Ma, Dehua, Wang, Junbo

Abstract

Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task. Specifically, our method quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform. This SFS serves as a control signal to dynamically assemble a tailored workflow of specialized translation agents based on the source text's structural patterns. Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

Chinese Translation

现代大型语言模型（LLMs）在生成流畅且忠实的翻译方面表现出色。然而，它们在保持作者独特的文学风格方面存在困难，常常产生语义上正确但缺乏个性的输出。这一局限性源于当前单模型和静态多智能体系统无法感知和适应风格变异。为了解决这一问题，我们提出了风格自适应多智能体系统（SAMAS），这是一个将风格保留视为信号处理任务的新框架。具体而言，我们的方法通过小波包变换将文学风格量化为风格特征谱（SFS）。该SFS作为控制信号，动态组装基于源文本结构模式的专门翻译智能体的定制工作流程。在翻译基准测试中的广泛实验表明，SAMAS在语义准确性方面与强基线相比具有竞争力，主要是通过利用其在风格保真方面的统计显著优势。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2602.19855

SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals

SHIELD：用于临床试验安全信号潜在发现的语义异质性集成嵌入

Vandenhende, Francois, Georgiou, Anna, Psaras, Theodoros, Karekla, Ellie

Abstract

We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials. SHIELD combines disproportionality analysis with semantic clustering of adverse event (AE) terms applied to MedDRA term embeddings. For each AE, the pipeline computes an information-theoretic disproportionality measure (Information Component) with effect size derived via empirical Bayesian shrinkage. A utility matrix is constructed by weighting semantic term-term similarities by signal magnitude, followed by spectral embedding and clustering to identify groups of related AEs. Resulting clusters are annotated with syndrome-level summary labels using large language models, yielding a coherent, data-driven representation of treatment-associated safety profiles in the form of a network graph and hierarchical tree. We implement the SHIELD framework in the context of a single-arm incidence summary, to compare two treatment arms or for the detection of any treatment effect in a multi-arm trial. We illustrate its ability to recover known safety signals and generate interpretable, cluster-based summaries in a real clinical trial example. This work bridges statistical signal detection with modern natural language processing to enhance safety assessment and causal interpretation in clinical trials.

Chinese Translation

我们提出了SHIELD，一种用于临床试验中自动化和集成安全信号检测的新方法。SHIELD将不成比例性分析与不良事件（AE）术语的语义聚类结合，应用于MedDRA术语嵌入。对于每个AE，该流程计算信息论不成比例性度量（信息成分），其效应大小通过经验贝叶斯收缩得出。通过将语义术语间相似性按信号强度加权，构建效用矩阵，随后进行谱嵌入和聚类，以识别相关AE的组。生成的聚类使用大型语言模型进行综合性标签注释，形成治疗相关安全特征的连贯数据驱动表示，表现为网络图和层次树。我们在单臂发生率总结的背景下实现SHIELD框架，以比较两个治疗组或在多臂试验中检测任何治疗效应。我们展示了其在真实临床试验示例中恢复已知安全信号和生成可解释的基于聚类的摘要的能力。这项工作将统计信号检测与现代自然语言处理相结合，以增强临床试验中的安全评估和因果解释。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2602.19878

Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics

ODRL的轴分解：通过区间语义解决政策约束中的维度模糊性

Mustafa, Daham, Collarana, Diego, Peng, Yixin, Haque, Rafiqul, Lange-Bever, Christoph, Quix, Christoph, Decker, Stephan

Abstract

Every ODRL 2.2 constraint compares a single scalar value: (leftOperand, operator, rightOperand). Five of ODRL's approximately 34 left operands, however, denote multi-dimensional quantities--image dimensions, canvas positions, geographic coordinates--whose specification text explicitly references multiple axes. For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic. We classify ODRL's left operands by value-domain structure (scalar, dimensional, concept-valued), grounded in the ODRL 2.2 specification text, and show that dimensional ambiguity is intrinsic to the constraint syntax. We present an axis-decomposition framework that refines each dimensional operand into axis-specific scalar operands and prove four properties: deterministic interpretation, AABB completeness, sound over-approximation under projection, and conservative extension. Conflict detection operates in two layers: per-axis verdicts are always decidable; box-level verdicts compose through Strong Kleene conjunction into a three-valued logic (Conflict, Compatible, Unknown). For ODRL's disjunctive (odrl:or) and exclusive-or (odrl:xone) logical constraints, where per-axis decomposition does not apply, the framework encodes coupled multi-axis conjectures directly. We instantiate the framework as the ODRL Spatial Axis Profile--15 axis-specific left operands for the five affected base terms--and evaluate it on 117 benchmark problems spanning nine categories across both TPTP FOF (Vampire) and SMT-LIB (Z3) encodings, achieving full concordance between provers. Benchmark scenarios are inspired by constraints arising in cultural heritage dataspaces such as Datenraum Kultur. All meta-theorems are mechanically verified in Isabelle/HOL.

Chinese Translation

每个ODRL 2.2约束比较一个单一的标量值：（leftOperand，operator，rightOperand）。然而，ODRL大约34个左操作数中的五个表示多维量——图像维度、画布位置、地理坐标——其规范文本明确引用多个轴。对于这些操作数，单一标量约束在每个轴上都有一个解释，使得政策评估变得非确定性。我们根据值域结构（标量、维度、概念值）对ODRL的左操作数进行分类，基于ODRL 2.2规范文本，并表明维度模糊性是约束语法的内在特性。我们提出了一种轴分解框架，将每个维度操作数细化为特定轴的标量操作数，并证明了四个性质：确定性解释、AABB完整性、投影下的健全过近似以及保守扩展。冲突检测在两个层次上进行：每个轴的判决总是可判定的；框级判决通过强克林联结组合成三值逻辑（冲突、兼容、不确定）。对于ODRL的析取（odrl:or）和异或（odrl:xone）逻辑约束，轴分解不适用，该框架直接编码耦合的多轴猜想。我们将该框架实例化为ODRL空间轴配置文件——针对五个受影响基本术语的15个特定轴左操作数，并在跨越九个类别的117个基准问题上进行评估，涵盖TPTP FOF（Vampire）和SMT-LIB（Z3）编码，达成证明者之间的完全一致。基准场景灵感来自于文化遗产数据空间中出现的约束，如Datenraum Kultur。所有元定理在Isabelle/HOL中经过机械验证。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2602.19883

Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection

ODRL的指称语义：基于知识的约束冲突检测

Mustafa, Daham, Collarana, Diego, Peng, Yixin, Haque, Rafiqul, Lange-Bever, Christoph, Quix, Christoph, Decker, Stephan

Abstract

ODRL's six set-based operators -- isA, isPartOf, hasPart, isAnyOf, isAllOf, isNoneOf -- depend on external domain knowledge that the W3C specification leaves unspecified. Without it, every cross-dataspace policy comparison defaults to Unknown. We present a denotational semantics that maps each ODRL constraint to the set of knowledge-base concepts satisfying it. Conflict detection reduces to denotation intersection under a three-valued verdict -- Conflict, Compatible, or Unknown -- that is sound under incomplete knowledge. The framework covers all three ODRL composition modes (and, or, xone) and all three semantic domains arising in practice: taxonomic (class subsumption), mereological (part-whole containment), and nominal (identity). For cross-dataspace interoperability, we define order-preserving alignments between knowledge bases and prove two guarantees: conflicts are preserved across different KB standards, and unmapped concepts degrade gracefully to Unknown -- never to false conflicts. A runtime soundness theorem ensures that design-time verdicts hold for all execution contexts. The encoding stays within the decidable EPR fragment of first-order logic. We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR-derived taxonomy, BCP 47, and ISO 639-3) and four structural KBs targeting adversarial edge cases. Both the Vampire theorem prover and the Z3 SMT solver agree on all 154 verdicts. A key finding is that exclusive composition (xone) requires strictly stronger KB axioms than conjunction or disjunction: open-world semantics blocks exclusivity even when positive evidence appears to satisfy exactly one branch.

Chinese Translation

ODRL的六个基于集合的运算符——isA、isPartOf、hasPart、isAnyOf、isAllOf、isNoneOf——依赖于W3C规范未指定的外部领域知识。没有这些知识，每个跨数据空间的政策比较默认结果为未知。我们提出了一种指称语义，将每个ODRL约束映射到满足该约束的知识库概念集合。冲突检测归结为在三值判决下的指称交集——冲突、兼容或未知——在不完整知识下是合理的。该框架涵盖了所有三种ODRL组合模式（与、或、异或）以及实践中出现的三种语义领域：分类（类包含）、部分整体（部分-整体包含）和名义（身份）。为了实现跨数据空间的互操作性，我们定义了知识库之间的顺序保持对齐，并证明了两个保证：冲突在不同的知识库标准之间得以保留，未映射的概念优雅地降级为未知——绝不会出现错误冲突。运行时的健全性定理确保设计时的判决在所有执行上下文中成立。编码保持在一阶逻辑的可判定EPR片段内。我们通过154个基准测试对其进行了验证，涵盖六个知识库家族（GeoNames、ISO 3166、W3C DPV、基于GDPR的分类、BCP 47和ISO 639-3）以及四个针对对抗性边缘案例的结构化知识库。Vampire定理证明器和Z3 SMT求解器在所有154个判决上达成一致。一个关键发现是，排他性组合（异或）需要比合取或析取更强的知识库公理：开放世界语义即使在正面证据似乎恰好满足一个分支时也会阻止排他性。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2602.19919

Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

Janus-Q：通过层次门控奖励建模实现端到端事件驱动交易

Li, Xiang, Wei, Zikai, Qi, Yiyan, Zhou, Wanyun, Liu, Xiang, Sun, Penglei, Zhang, Yongqi, Chu, Xiaowen

Abstract

Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.

Chinese Translation

金融市场的波动往往受到通过新闻传达的离散金融事件的驱动，这些事件的影响是异质的、突发的，并且在纯数字预测目标下难以捕捉。这些局限性促使人们越来越关注将文本信息作为基于学习的系统中交易信号的主要来源。现有方法面临两个关键挑战：（1）缺乏大规模的以事件为中心的数据集，这些数据集共同建模新闻语义和统计基础上的市场反应；（2）在动态市场条件下，语言模型推理与金融有效交易行为之间的错位。为了解决这些挑战，我们提出了Janus-Q，一个端到端的事件驱动交易框架，将金融新闻事件从辅助信号提升为主要决策单元。Janus-Q在一个两阶段范式下统一了以事件为中心的数据构建和模型优化。第一阶段专注于以事件为中心的数据构建，建立了一个包含62,400篇文章的大规模金融新闻事件数据集，标注了10种细粒度事件类型、相关股票、情感标签和事件驱动的累计异常收益（CAR）。第二阶段进行面向决策的微调，将监督学习与由层次门控奖励模型（HGRM）引导的强化学习相结合，明确捕捉多个交易目标之间的权衡。大量实验证明，Janus-Q在交易决策的一致性、可解释性和盈利能力方面优于市场指数和大型语言模型（LLM）基线，与最强竞争策略相比，夏普比率提高了多达102.0%，方向准确率提高了超过17.5%。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2602.19948

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

评估大型语言模型在心理健康支持中的风险：自动化临床人工智能红队评估框架

Steenstra, Ian, Pedrelli, Paola, Shi, Weiyan, Marsella, Stacy, Bickmore, Timothy W.

Abstract

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and Character.AI) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.

Chinese Translation

大型语言模型（LLMs）在心理健康支持中的应用日益增加；然而，当前的安全基准往往无法检测到治疗对话中固有的复杂、长期风险。我们提出了一种评估框架，将人工智能心理治疗师与配备动态认知-情感模型的模拟患者代理相结合，并根据全面的护理质量和风险本体评估治疗会话模拟。我们将该框架应用于一个高影响力的测试案例——酒精使用障碍，评估六个人工智能代理（包括 ChatGPT、Gemini 和 Character.AI）与代表多样临床表型的 15 个患者角色的临床验证队列进行比较。我们的大规模模拟（N=369 次会话）揭示了在心理健康支持中使用人工智能的关键安全缺口。我们识别出特定的医源性风险，包括验证患者妄想（“人工智能精神病”）和未能降低自杀风险。最后，我们与包括人工智能工程师、红队成员、心理健康专业人士和政策专家（N=9）在内的多方利益相关者验证了一个互动数据可视化仪表板，证明该框架有效地使利益相关者能够审计人工智能心理治疗的“黑箱”。这些发现强调了人工智能提供的心理健康支持的关键安全风险，以及在部署之前进行基于模拟的临床红队评估的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2602.19961

Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

解锁多模态文档智能：从当前成就到视觉文档检索的未来前沿

Yan, Yibo, Huo, Jiahao, Feng, Guanbo, Ou, Mingdong, Cao, Yi, Zou, Xin, Liu, Shuliang, Lyu, Yuanhuiyi, Huang, Yu, Li, Jungang, Zheng, Kening, Zheng, Xu, Yu, Philip S., Kwok, James, Hu, Xuming

Abstract

With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

Chinese Translation

随着多模态信息的快速传播，视觉文档检索（Visual Document Retrieval, VDR）已成为弥合非结构化视觉丰富数据与精确信息获取之间差距的重要前沿。与传统的自然图像检索不同，视觉文档展现出独特的特征，这些特征由密集的文本内容、复杂的布局和细粒度的语义依赖关系定义。本文首次全面调查了VDR领域，特别是在多模态大语言模型（Multimodal Large Language Model, MLLM）时代的视角下。我们首先审视基准测试的现状，随后深入探讨方法论的演变，将方法分为三个主要方面：多模态嵌入模型、多模态重排序模型，以及检索增强生成（Retrieval-Augmented Generation, RAG）和智能系统在复杂文档智能中的整合。最后，我们识别出持续存在的挑战，并概述了有前景的未来方向，旨在为未来的多模态文档智能提供清晰的路线图。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2602.19969

ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

ReAttn：通过注意力重加权改善基于注意力的重新排序

Tian, Yuxing, Mo, Fengran, Zhang, Weixu, Qi, Yiyan, Nie, Jian-Yun

Abstract

The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbf{ReAttn}, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.

Chinese Translation

近期大型语言模型（LLMs）的强大能力使其在零样本重新排序任务中表现出色。基于注意力的重新排序方法直接从注意力权重中推导相关性分数，提供了一种高效且可解释的替代方案，相较于基于生成的重新排序方法。然而，它们仍面临两个主要限制。首先，注意力信号高度集中在少数文档中的小部分标记上，使得其他标记难以区分。其次，注意力往往过度强调与查询在词汇上相似的短语，导致偏倚的排名，使得与查询仅在词汇上相似的无关文档被视为相关。在本文中，我们提出了 extbf{ReAttn}，一种针对基于注意力的重新排序方法的后处理重加权策略。该策略首先计算跨文档的逆文档频率（IDF）权重，以降低在候选文档中频繁出现的与查询重叠的标记的注意力，从而减少词汇偏见并强调独特术语。然后，它采用基于熵的正则化来缓解过度集中注意力的问题，鼓励在信息性标记之间实现更均衡的分布。这两项调整直接作用于现有的注意力权重，无需额外的训练或监督。大量实验表明我们的方法的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2602.19991

Cross-lingual Matryoshka Representation Learning across Speech and Text

跨语言的马特里奥什卡表示学习：语音与文本的结合

Sy, Yaya, Doucouré, Dioula, Cerisara, Christophe, Illina, Irina

Abstract

Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

Chinese Translation

使用较少代表性语言的说话者面临语言障碍，因为大多数在线知识集中在少数几种主导语言上，同时也面临模态障碍，因为信息主要以文本形式存在，而许多语言主要是口头的。我们针对法语-沃洛夫语（French-Wolof）提出解决方案，通过训练首个双语语音-文本马特里奥什卡嵌入模型，使得能够高效地从沃洛夫语语音查询中检索法语文本，而无需依赖昂贵的自动语音识别（ASR）-翻译管道。我们引入了大规模数据整理管道和新的基准，比较了建模策略，并展示了在冻结的文本马特里奥什卡模型中模态融合的效果最佳。尽管该模型仅针对检索进行训练，但在其他任务（如语音意图检测）中表现良好，表明其学习了通用的语义表示。最后，我们分析了马特里奥什卡维度和秩的成本-准确性权衡，显示信息仅集中在少数几个组件中，暗示了效率改进的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2602.20017

QUIETT: Query-Independent Table Transformation for Robust Reasoning

QUIETT：用于稳健推理的查询无关表格转换

Najpande, Gaurav, Kumar, Tampu Ravi, Choudhury, Manan Roy, Valeti, Neha, Fu, Yanjie, Gupta, Vivek

Abstract

Real-world tables often exhibit irregular schemas, heterogeneous value formats, and implicit relational structure, which degrade the reliability of downstream table reasoning and question answering. Most existing approaches address these issues in a query-dependent manner, entangling table cleanup with reasoning and thus limiting generalization. We introduce QuIeTT, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation before any test-time queries are observed. QuIeTT performs lossless schema and value normalization, exposes implicit relations, and preserves full provenance via raw table snapshots. By decoupling table transformation from reasoning, QuIeTT enables cleaner, more reliable, and highly efficient querying without modifying downstream models. Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.

Chinese Translation

现实世界中的表格通常表现出不规则的模式、异构的值格式和隐含的关系结构，这降低了下游表格推理和问答的可靠性。大多数现有方法以查询依赖的方式解决这些问题，将表格清理与推理纠缠在一起，从而限制了其泛化能力。我们提出了QuIeTT，一个查询无关的表格转换框架，在任何测试时查询被观察之前，将原始表格预处理为单一的SQL准备好的规范表示。QuIeTT执行无损的模式和值规范化，揭示隐含关系，并通过原始表快照保留完整的来源。通过将表格转换与推理解耦，QuIeTT实现了更清晰、更可靠和高效的查询，而无需修改下游模型。在四个基准测试（WikiTQ、HiTab、NQ-Table和SequentialQA）上的实验显示，各模型和推理范式均表现出一致的提升，特别是在一组结构多样、未见过的问题挑战集上取得了显著的改善。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2602.20020

gencat: Generative computerized adaptive testing

gencat：生成性计算机自适应测试

Feng, Wanyong, Lan, Andrew

Abstract

Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.

Chinese Translation

现有的计算机自适应测试（CAT）框架通常基于预测学生对问题的回答正确性。尽管这种方法有效，但未能充分利用问题和回答中的文本信息，尤其是在开放性问题中。在本研究中，我们提出了GENCAT（ extbf{GEN}erative extbf{CAT}），一个新颖的CAT框架，利用大型语言模型进行知识评估和问题选择。首先，我们开发了一种生成性项目反应理论（GIRT）模型，使我们能够从学生的开放性回答中估计其知识水平，并预测对未见问题的回答。我们通过两步过程训练该模型，首先进行监督微调，然后通过知识-回答对齐的偏好优化。其次，我们介绍了三种问题选择算法，基于不确定性、语言多样性和采样学生回答的信息，利用GIRT模型的生成能力。最后，我们在两个真实世界的编程数据集上进行了实验，结果表明GENCAT在早期测试阶段的AUC提升高达4.32\%，优于现有的CAT基准。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2602.20040

AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

AgenticSum：一种用于忠实临床文本摘要的推理时代理框架

Piya, Fahmida Liza, Beheshti, Rahmatollah

Abstract

Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content. The framework decomposes summarization into coordinated stages that compress task-relevant context, generate an initial draft, identify weakly supported spans using internal attention grounding signals, and selectively revise flagged content under supervisory control. We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation. Across various measures, AgenticSum demonstrates consistent improvements compared to vanilla LLMs and other strong baselines. Our results indicate that structured, agentic design with targeted correction offers an effective inference time solution to improve clinical note summarization using LLMs.

Chinese Translation

大型语言模型（LLMs）在自动化临床文本摘要方面展现了巨大的潜力，但由于临床文档的长度、噪声和异质性，保持事实一致性仍然具有挑战性。我们提出了AgenticSum，这是一种推理时的代理框架，旨在通过分离上下文选择、生成、验证和有针对性的修正来减少虚假内容。该框架将摘要过程分解为协调的阶段，压缩与任务相关的上下文，生成初步草稿，利用内部注意力基础信号识别支持不足的文本片段，并在监督控制下选择性地修订标记内容。我们在两个公共数据集上评估了AgenticSum，使用基于参考的指标、LLM作为评判者的评估以及人工评估。在各种评估指标中，AgenticSum相较于普通LLMs和其他强基线表现出了一致的改进。我们的结果表明，结构化的代理设计与有针对性的修正提供了一种有效的推理时解决方案，以改善使用LLMs的临床记录摘要。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2602.20042

Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously

立场：通用对齐已达到瓶颈；边缘对齐必须被认真对待

Bao, Han, Huang, Yue, Wang, Xiaoda, Zhang, Zheyuan, Zhou, Yujun, Yang, Carl, Zhang, Xiangliang, Ye, Yanfang

Abstract

Large language models are being deployed in complex socio-technical systems, which exposes limits in current alignment practice. We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures follow from the mathematics and incentives of scalarization and lead to \textbf{structural} value flattening, \textbf{normative} representation loss, and \textbf{cognitive} uncertainty blindness. We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure, support plural and democratic representation, and incorporate epistemic mechanisms for interaction and clarification. To make this approach practical, we propose seven interdependent pillars organized into three phases. We identify key challenges in data collection, training objectives, and evaluation, outlining complementary technical and governance directions. Taken together, these measures reframe alignment as a lifecycle problem of dynamic normative governance rather than as a single instance optimization task.

Chinese Translation

大型语言模型正在复杂的社会技术系统中部署，这暴露了当前对齐实践的局限性。我们认为，主导的通用对齐（General Alignment）范式将多样的人类价值压缩为单一标量奖励，在存在冲突价值、多方利益相关者和不可简化的不确定性的环境中达到了结构性瓶颈。这些失败源于标量化的数学和激励机制，导致了 extbf{结构性}价值扁平化、 extbf{规范性}代表性丧失和 extbf{认知}不确定性盲点。我们提出边缘对齐（Edge Alignment）作为一种独特的方法，其中系统保留多维价值结构，支持多元和民主的代表性，并纳入互动和澄清的认知机制。为了使这种方法具有实用性，我们提出了七个相互依赖的支柱，分为三个阶段。我们识别了数据收集、训练目标和评估中的关键挑战，并概述了互补的技术和治理方向。综合来看，这些措施将对齐重新框定为动态规范治理的生命周期问题，而不是单一实例的优化任务。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2602.20052

Entropy in Large Language Models

大型语言模型中的熵

Scharringhausen, Marco

Abstract

In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic model for these LLMs, following a constant random distribution and the source itself thus being stationary. We compare this source entropy (per word) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC). Our results indicate that the word entropy of such LLMs is lower than the word entropy of natural speech both in written or spoken form. The long-term goal of such studies is to formalize the intuitions of information and uncertainty in large language training to assess the impact of training an LLM from LLM generated training data. This refers to texts from the world wide web in particular.

Chinese Translation

在本研究中，大型语言模型（LLM）的输出被视为一种信息源，生成来自有限字母表的无限符号序列。考虑到现代LLM的概率性质，我们假设这些LLM遵循一个概率模型，该模型遵循恒定的随机分布，因此源本身是平稳的。我们将该源的熵（每个单词）与自然语言（书面或口头）所表示的熵进行比较，后者由开放美国国家语料库（Open American National Corpus, OANC）提供。我们的结果表明，这种LLM的单词熵低于自然语言（无论是书面形式还是口头形式）的单词熵。这类研究的长期目标是形式化大型语言训练中信息和不确定性的直觉，以评估从LLM生成的训练数据训练LLM的影响，特别是指来自全球互联网的文本。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2602.20065

Multilingual Large Language Models do not comprehend all natural languages to equal degrees

多语言大型语言模型对所有自然语言的理解能力并不相同

Moskvina, Natalia, Montero, Raquel, Yoshida, Masaya, Hubers, Ferdy, Morosi, Paolo, Irhaymi, Walid, Yan, Jin, Serrano, Tamara, Pagliarini, Elena, Günther, Fritz, Leivada, Evelina

Abstract

Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.

Chinese Translation

大型语言模型（LLMs）在信息获取中发挥着关键作用。尽管它们的核心使用依赖于对书面请求的理解，但我们对这一能力的理解目前仍然有限，因为大多数基准测试主要在由西方、受过教育、工业化、富裕和民主（WEIRD）社区主导的高资源语言中评估LLMs。默认假设是，英语是LLMs表现最佳的语言，而较小的低资源语言则与较不可靠的输出相关，即使在多语言的最先进模型中也是如此。为了跟踪LLMs理解能力的变化，我们在12种语言上对3个流行模型进行了语言理解任务的提示，这些语言代表了印欧语系、亚非语系、突厥语系、汉藏语系和日本语系。我们的结果表明，这些模型在类型学上多样的语言中表现出显著的语言准确性，但在所有语言中都落后于人类基准，尽管程度不同。与预期相反，英语并不是表现最佳的语言，因为它在系统性上被几种罗曼语言超越，甚至是一些资源较少的语言。我们通过讨论推动LLM表现的多个因素，如分词、与西班牙语和英语的语言距离、训练数据的大小以及高资源与低资源语言及WEIRD与非WEIRD社区的数据来源，来框定这些结果。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2602.20091

How Retrieved Context Shapes Internal Representations in RAG

检索上下文如何塑造RAG中的内部表征

Yeh, Samuel, Li, Sharon

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediate information integration in RAG. In this work, we study RAG through the lens of latent representations. We systematically analyze how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior. Across four question-answering datasets and three LLMs, we analyze internal representations under controlled single- and multi-document settings. Our results reveal how context relevancy and layer-wise processing influence internal representations, providing explanations on LLMs output behaviors and insights for RAG system design.

Chinese Translation

检索增强生成（RAG）通过将生成过程与检索到的外部文档相结合，增强了大型语言模型（LLMs），但检索上下文的影响往往并非简单明了。在现实的检索环境中，检索到的文档集通常包含一系列在相关性和实用性上各不相同的文档。尽管先前的研究主要通过输出行为考察了这些现象，但关于检索上下文如何塑造在RAG中介导信息整合的内部表征的了解却很有限。在本研究中，我们通过潜在表征的视角研究RAG。我们系统地分析了不同类型的检索文档如何影响LLMs的隐藏状态，以及这些内部表征的变化如何与下游生成行为相关联。在四个问答数据集和三个LLMs上，我们分析了在受控的单文档和多文档设置下的内部表征。我们的结果揭示了上下文相关性和层级处理如何影响内部表征，为LLMs的输出行为提供了解释，并为RAG系统设计提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2602.20092

BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop

BabyLM四周年：2026年BabyLM研讨会征稿启事

Choshen, Leshem, Cotterell, Ryan, Gul, Mustafa Omer, Jumelet, Jaap, Linzen, Tal, Mueller, Aaron, Salhan, Suchir, Shah, Raj Sanjay, Warstadt, Alex, Wilcox, Ethan Gotlieb

Abstract

BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 4th BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: Multilingual. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more.

Chinese Translation

BabyLM旨在消除认知建模与语言建模之间的界限。我们呼吁提交研讨会论文，并邀请研究人员参加第四届BabyLM竞赛。与往年一样，我们呼吁参与数据高效预训练挑战的普通赛道。今年，我们还提供了一个新的赛道：多语言赛道。我们也欢迎在任何相关领域提交论文，包括训练效率、认知合理性研究、弱模型评估等。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2602.20122

NanoKnow: How to Know What Your Language Model Knows

NanoKnow：如何了解你的语言模型所知道的内容

Gu, Lingwei, Jedidi, Nour, Lin, Jimmy

Abstract

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.

Chinese Translation

大型语言模型（LLMs）是如何知道它们所知道的内容的？回答这个问题一直很困难，因为预训练数据通常是一个“黑箱”——未知或无法获取。最近发布的nanochat——一系列具有完全开放预训练数据的小型LLMs——解决了这一问题，因为它提供了一个透明的视角，展示了模型的参数知识来源。为了理解LLMs如何编码知识，我们发布了NanoKnow，这是一个基准数据集，将来自Natural Questions和SQuAD的问题分为不同的组，基于它们的答案是否存在于nanochat的预训练语料库中。利用这些分组，我们现在可以正确地解开LLMs在生成输出时所依赖的知识来源。为了展示NanoKnow的实用性，我们使用八个nanochat检查点进行了实验。我们的发现表明：（1）闭卷准确性受到预训练数据中答案频率的强烈影响，（2）提供外部证据可以减轻这种频率依赖性，（3）即使有外部证据，当答案在预训练期间被看到时，模型的准确性更高，表明参数知识和外部知识是互补的，（4）无关信息是有害的，准确性会根据无关上下文的位置和数量而降低。我们在 https://github.com/castorini/NanoKnow 发布所有NanoKnow相关资料。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2602.20130

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

是否推理：医学问答中的选择性思维链

Zhan, Zaifu, Zeng, Min, Zhou, Shuang, Song, Yiran, Chen, Xiaoyi, Hou, Yu, Wu, Yifan, Ruan, Yang, Zhang, Rui

Abstract

Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.

Chinese Translation

目的：通过避免不必要的推理同时保持准确性，提高大型语言模型（LLMs）在医学问答（MedQA）中的效率。方法：我们提出了选择性思维链（Selective Chain-of-Thought, Selective CoT），这是一种推理时策略，首先预测问题是否需要推理，仅在必要时生成推理依据。我们在四个生物医学问答基准测试（HeadQA、MedQA-USMLE、MedMCQA 和 PubMedQA）上评估了两个开源 LLM（Llama-3.1-8B 和 Qwen-2.5-7B）。评估指标包括准确性、生成的总令牌数和推理时间。结果：选择性 CoT 将推理时间减少了 13-45%，令牌使用减少了 8-47%，且准确性损失最小（$ ext{≤}4 ext{%}$）。在某些模型-任务对中，其准确性和效率均优于标准的 CoT。与固定长度的 CoT 相比，选择性 CoT 在显著较低的计算成本下达到了相似或更高的准确性。讨论：选择性 CoT 动态平衡推理深度和效率，仅在有利时调用显式推理，减少了回忆型问题的冗余，同时保持了解释性。结论：选择性 CoT 提供了一种简单、模型无关且具有成本效益的医学问答方法，将推理努力与问题复杂性对齐，以增强基于 LLM 的临床系统在现实世界中的可部署性。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2602.20135

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

KNIGHT：基于知识图谱的多项选择题生成与自适应难度校准

Amanlou, Mohammad, Moghaddam, Erfan Shafiee, Jafari, Yasaman Amou, Noori, Mahdi, Farsi, Farhan, Bahrak, Behnam

Abstract

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

Chinese Translation

随着大型语言模型（LLMs）的兴起，它们在检索增强生成（RAG）等应用中变得至关重要。然而，评估这些系统仍然受到构建专门评估数据集的时间和成本的制约。我们提出了KNIGHT，一个基于LLM的、驱动于知识图谱的框架，用于从外部来源生成多项选择题（MCQ）数据集。KNIGHT构建了一个特定主题的知识图谱，这是一个结构化且简洁的实体和关系摘要，可以重复使用以生成教师控制的难度级别，包括多跳问题，而无需反复输入完整的源文本。该知识图谱充当压缩的、可重用的状态，使得问题生成成为对图谱的廉价读取。我们在维基百科/Wikidata上实例化了KNIGHT，同时保持框架的领域和本体无关性。作为案例研究，KNIGHT在历史、生物和数学领域生成了六个MCQ数据集。我们在五个标准上评估质量：流畅性、明确性（单一正确答案）、主题相关性、选项唯一性和根据提供的来源的可回答性（作为幻觉的代理）。结果表明，KNIGHT能够从可重用的图表示中实现高效的令牌和成本生成，在这些标准上达到高质量，并且模型排名与MMLU风格基准一致，同时支持特定主题和难度控制的评估。

View on arXiv Download PDF AI Translation

arXiv Papers

Design and Biomechanical Evaluation of a Lightweight Low-Complexity Soft Bilateral Ankle Exoskeleton

Enhancing Goal Inference via Correction Timing

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

FORMICA: Decision-Focused Learning for Communication-Free Multi-Robot Task Allocation

Soft Surfaced Vision-Based Tactile Sensing for Bipedal Robot Applications

Infinite-Dimensional Closed-Loop Inverse Kinematics for Soft Robots via Neural Operators

Robotic Fruits with Tunable Stiffness and Sensing: Towards a Methodology for Developing Realistic Physical Twins of Fruits

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

Systematic Analysis of Coupling Effects on Closed-Loop and Open-Loop Performance in Aerial Continuum Manipulators

Scout-Rover cooperation: online terrain strength mapping and traversal risk estimation for planetary-analog explorations

CLASH: Collision Learning via Augmented Sim-to-real Hybridization to Bridge the Reality Gap

Temporal Action Representation Learning for Tactical Resource Control and Subsequent Maneuver Generation

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Habilis-$\beta$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

RotorSuite: A MATLAB/Simulink Toolbox for Tilt Multi-Rotor UAV Modeling

GRAB: A Systematic Real-World Grasping Benchmark for Robotic Food Waste Sorting

When the Inference Meets the Explicitness or Why Multimodality Can Make Us Forget About the Perfect Predictor

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

Equivalence and Divergence of Bayesian Log-Odds and Dempster's Combination Rule for 2D Occupancy Grids

Temporal-Logic-Aware Frontier-Based Exploration

TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

Bumper Drone: Elastic Morphology Design for Aerial Physical Interaction

FruitTouch: A Perceptive Gripper for Gentle and Scalable Fruit Harvesting

A Checklist for Deploying Robots in Public: Articulating Tacit Knowledge in the HRI Community

Path planning for unmanned surface vehicle based on predictive artificial potential field. International Journal of Advanced Robotic Systems

Design, Locomotion, and Control of Amphibious Robots: Recent Advances

A User-driven Design Framework for Robotaxi

Understanding Fire Through Thermal Radiation Fields for Mobile Robots

Distributed and Consistent Multi-Robot Visual-Inertial-Ranging Odometry on Lie Groups

Distributional Stability of Tangent-Linearized Gaussian Inference on Smooth Manifolds

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

Visual Prompt Guided Unified Pushing Policy

The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption

3D Shape Control of Extensible Multi-Section Soft Continuum Robots via Visual Servoing

Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

WildOS: Open-Vocabulary Object Search in the Wild

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Online Navigation Planning for Long-term Autonomous Operation of Underwater Gliders

Design and Control of Modular Magnetic Millirobots for Multimodal Locomotion and Shape Reconfiguration

Vid2Sid: Videos Can Help Close the Sim2Real Gap

Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

Hilbert-Augmented Reinforcement Learning for Scalable Multi-Robot Coverage and Exploration

Botson: An Accessible and Low-Cost Platform for Social Robotics Research

Anticipate, Adapt, Act: A Hybrid Framework for Task Planning

Bellman Value Decomposition for Task Logic in Safe Optimal Control

Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial

Cost-Aware Diffusion Active Search

Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation

Denoising Particle Filters: Learning State Estimation with Single-Step Objectives

Scalable Low-Density Distributed Manipulation Using an Interconnected Actuator Array

CACTO-BIC: Scalable Actor-Critic Learning via Biased Sampling and GPU-Accelerated Trajectory Optimization

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

TactiVerse: Generalizing Multi-Point Tactile Sensing in Soft Robotics Using Single-Point Data

Athena: An Autonomous Open-Hardware Tracked Rescue Robot Platform

Scaling Law of Neural Koopman Operators

Contextual Safety Reasoning and Grounding for Open-World Robots

EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover

Hydrodynamic Performance Enhancement of Unmanned Underwater Gliders with Soft Robotic Morphing Wings for Agility Improvement

To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

A Patient-Specific Digital Twin for Adaptive Radiotherapy of Non-Small Cell Lung Cancer

Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality

Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

A Computer Vision Framework for Multi-Class Detection and Tracking in Soccer Broadcast Footage

Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

Depth from Defocus via Direct Optimization

Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Image-Based Classification of Olive Varieties Native to Turkiye Using Multiple Deep Learning Architectures: Analysis of Performance, Complexity, and Generalization

VLANeXt: Recipes for Building Strong VLA Models

Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

Rodent-Bench

BloomNet: Exploring Single vs. Multiple Object Annotation for Flower Recognition Using YOLO Variants

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification