cs.RO / 1 / 2602.09046
Feasible Static Workspace Optimization of Tendon Driven Continuum Robot based on Euclidean norm
基于欧几里得范数的腱驱动连续机器人可行静态工作空间优化
Abstract
This paper focuses on the optimal design of a tendon-driven continuum robot (TDCR) based on its feasible static workspace (FSW). The TDCR under consideration is a two-segment robot driven by eight tendons, with four tendon actuators per segment. Tendon forces are treated as design variables, while the feasible static workspace (FSW) serves as the optimization objective. To determine the robot's feasible static workspace, a genetic algorithm optimization approach is employed to maximize a Euclidian norm of the TDCR's tip position over the workspace. During the simulations, the robot is subjected to external loads, including torques and forces. The results demonstrate the effectiveness of the proposed method in identifying optimal tendon forces to maximize the feasible static workspace, even under the influence of external forces and torques.
Chinese Translation
本文聚焦于腱驱动连续机器人(TDCR)的可行静态工作空间(FSW)的最优设计。所考虑的TDCR是一个由八根腱驱动的两段机器人,每段配备四个腱驱动器。腱力被视为设计变量,而可行静态工作空间(FSW)则作为优化目标。为了确定机器人的可行静态工作空间,采用遗传算法优化方法,以最大化TDCR在工作空间中的末端位置的欧几里得范数。在仿真过程中,机器人受到外部载荷的影响,包括扭矩和力。结果表明,所提出的方法在识别最佳腱力以最大化可行静态工作空间方面是有效的,即使在外部力和扭矩的影响下也是如此。
cs.RO / 2 / 2602.09076
Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception
腿部优于手臂:基于自我中心机器人感知的下肢姿态对人类轨迹预测的预测价值
Abstract
Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.
Chinese Translation
预测人类轨迹对于社交机器人在拥挤环境中的导航至关重要。尽管大多数现有方法将人类视为点质量,我们提出了一项关于多智能体轨迹预测的研究,该研究利用不同的人体骨骼特征以提高预测准确性。特别地,我们系统地评估了二维和三维骨骼关键点及其衍生的生物力学线索作为额外输入的预测效用。通过对JRDB数据集和另一个用于社交导航的360度全景视频新数据集的全面研究,我们发现,专注于下肢三维关键点可以将平均位移误差减少13%,并且将三维关键点输入与相应的生物力学线索结合使用可以进一步提高1-4%的性能。值得注意的是,当使用从等距全景图像中提取的二维关键点输入时,性能提升依然存在,这表明单目环视视觉能够捕捉到运动预测的有用线索。我们的发现表明,机器人通过观察人类的腿部能够有效预测人类运动,为社交机器人导航的传感能力设计提供了可行的见解。
cs.RO / 3 / 2602.09123
Agile asymmetric multi-legged locomotion: contact planning via geometric mechanics and spin model duality
敏捷非对称多足运动:通过几何力学和自旋模型对偶进行接触规划
Abstract
Legged robot research is presently focused on bipedal or quadrupedal robots, despite capabilities to build robots with many more legs to potentially improve locomotion performance. This imbalance is not necessarily due to hardware limitations, but rather to the absence of principled control frameworks that explain when and how additional legs improve locomotion performance. In multi-legged systems, coordinating many simultaneous contacts introduces a severe curse of dimensionality that challenges existing modeling and control approaches. As an alternative, multi-legged robots are typically controlled using low-dimensional gaits originally developed for bipeds or quadrupeds. These strategies fail to exploit the new symmetries and control opportunities that emerge in higher-dimensional systems. In this work, we develop a principled framework for discovering new control structures in multi-legged locomotion. We use geometric mechanics to reduce contact-rich locomotion planning to a graph optimization problem, and propose a spin model duality framework from statistical mechanics to exploit symmetry breaking and guide optimal gait reorganization. Using this approach, we identify an asymmetric locomotion strategy for a hexapod robot that achieves a forward speed of 0.61 body lengths per cycle (a 50% improvement over conventional gaits). The resulting asymmetry appears at both the control and hardware levels. At the control level, the body orientation oscillates asymmetrically between fast clockwise and slow counterclockwise turning phases for forward locomotion. At the hardware level, two legs on the same side remain unactuated and can be replaced with rigid parts without degrading performance. Numerical simulations and robophysical experiments validate the framework and reveal novel locomotion behaviors that emerge from symmetry reforming in high-dimensional embodied systems.
Chinese Translation
尽管有能力构建更多足的机器人以潜在地提高运动性能,当前腿部机器人研究仍然集中在双足或四足机器人上。这种不平衡并非完全由于硬件限制,而是缺乏原则性控制框架来解释何时以及如何增加额外的腿部以改善运动性能。在多足系统中,协调多个同时接触引入了严重的维度诅咒,这对现有的建模和控制方法构成了挑战。作为替代方案,多足机器人通常使用最初为双足或四足开发的低维步态进行控制。这些策略未能利用在高维系统中出现的新对称性和控制机会。在本研究中,我们开发了一个原则性框架,以发现多足运动中的新控制结构。我们利用几何力学将接触丰富的运动规划简化为图优化问题,并提出了一个来自统计力学的自旋模型对偶框架,以利用对称性破缺并指导最佳步态重组。通过这种方法,我们为六足机器人识别出一种非对称运动策略,该策略在每个周期内实现了0.61个身体长度的前进速度(比传统步态提高了50%)。所产生的非对称性在控制和硬件层面均有所体现。在控制层面,身体方向在快速顺时针和缓慢逆时针转动阶段之间不对称地振荡以实现前进运动。在硬件层面,同侧的两条腿保持不驱动状态,并且可以用刚性部件替代而不降低性能。数值模拟和机器人物理实验验证了该框架,并揭示了在高维具身系统中对称性重构所产生的新颖运动行为。
cs.RO / 4 / 2602.09153
SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
SceneSmith:自主生成适用于仿真的室内场景
Abstract
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
Chinese Translation
仿真已成为大规模训练和评估家用机器人的关键工具,但现有环境未能捕捉真实室内空间的多样性和物理复杂性。目前的场景合成方法生成的房间家具稀疏,缺乏机器人操作所需的密集杂物、可调节家具和物理属性。我们提出了SceneSmith,一个分层自主框架,能够根据自然语言提示生成适用于仿真的室内环境。SceneSmith通过多个阶段构建场景——从建筑布局到家具摆放再到小物体的填充——每个阶段都通过VLM代理之间的交互实现:设计师、评论者和协调者。该框架紧密集成了静态物体的文本到3D合成、可调节物体的数据集检索和物理属性估计。SceneSmith生成的物体数量是先前方法的3-6倍,物体间碰撞率低于2%,在物理仿真下96%的物体保持稳定。在一项包含205名参与者的用户研究中,其在现实感和提示忠实度方面的平均胜率分别达到92%和91%,优于基线方法。我们进一步展示了这些环境可以用于自动机器人策略评估的端到端流程。
cs.RO / 5 / 2602.09203
Elements of Robot Morphology: Supporting Designers in Robot Form Exploration
机器人形态的要素:支持设计师探索机器人形态
Abstract
Robot morphology, the form, shape, and structure of robots, is a key design space in human-robot interaction (HRI), shaping how robots function, express themselves, and interact with people. Yet, despite its importance, little is known about how design frameworks can guide systematic form exploration. To address this gap, we introduce Elements of Robot Morphology, a framework that identifies five fundamental elements: perception, articulation, end effectors, locomotion, and structure. Derived from an analysis of existing robots, the framework supports structured exploration of diverse robot forms. To operationalize the framework, we developed Morphology Exploration Blocks (MEB), a set of tangible blocks that enable hands-on, collaborative experimentation with robot morphologies. We evaluate the framework and toolkit through a case study and design workshops, showing how they support analysis, ideation, reflection, and collaborative robot design.
Chinese Translation
机器人形态,即机器人的形式、形状和结构,是人机交互(HRI)中的一个关键设计领域,影响着机器人的功能、表达方式和与人类的互动。然而,尽管其重要性显著,关于设计框架如何指导系统化形态探索的研究仍然较少。为了解决这一问题,我们提出了机器人形态的要素框架,该框架识别出五个基本要素:感知、关节、末端执行器、运动和结构。该框架源于对现有机器人的分析,支持对多样化机器人形态的结构化探索。为了实现这一框架,我们开发了形态探索模块(Morphology Exploration Blocks, MEB),这是一组可触摸的模块,能够支持对机器人形态进行动手的、协作的实验。通过案例研究和设计工作坊,我们评估了该框架和工具包,展示了它们如何支持分析、构思、反思和协作的机器人设计。
cs.RO / 6 / 2602.09204
Risk-Aware Obstacle Avoidance Algorithm for Real-Time Applications
面向实时应用的风险感知障碍规避算法
Abstract
Robust navigation in changing marine environments requires autonomous systems capable of perceiving, reasoning, and acting under uncertainty. This study introduces a hybrid risk-aware navigation architecture that integrates probabilistic modeling of obstacles along the vehicle path with smooth trajectory optimization for autonomous surface vessels. The system constructs probabilistic risk maps that capture both obstacle proximity and the behavior of dynamic objects. A risk-biased Rapidly Exploring Random Tree (RRT) planner leverages these maps to generate collision-free paths, which are subsequently refined using B-spline algorithms to ensure trajectory continuity. Three distinct RRT* rewiring modes are implemented based on the cost function: minimizing the path length, minimizing risk, and optimizing a combination of the path length and total risk. The framework is evaluated in experimental scenarios containing both static and dynamic obstacles. The results demonstrate the system's ability to navigate safely, maintain smooth trajectories, and dynamically adapt to changing environmental risks. Compared with conventional LIDAR or vision-only navigation approaches, the proposed method shows improvements in operational safety and autonomy, establishing it as a promising solution for risk-aware autonomous vehicle missions in uncertain and dynamic environments.
Chinese Translation
在变化的海洋环境中,稳健的导航需要能够在不确定性下感知、推理和行动的自主系统。本研究提出了一种混合风险感知导航架构,该架构将障碍物的概率建模与自主水面船舶的平滑轨迹优化相结合。该系统构建了概率风险地图,捕捉障碍物的接近度和动态物体的行为。基于风险的快速扩展随机树(Rapidly Exploring Random Tree, RRT)规划器利用这些地图生成无碰撞路径,随后使用B样条算法对路径进行细化,以确保轨迹的连续性。根据成本函数实现了三种不同的RRT*重连模式:最小化路径长度、最小化风险以及优化路径长度与总风险的组合。该框架在包含静态和动态障碍物的实验场景中进行了评估。结果表明,该系统能够安全导航,保持平滑轨迹,并动态适应变化的环境风险。与传统的激光雷达(LIDAR)或仅依赖视觉的导航方法相比,所提出的方法在操作安全性和自主性方面表现出改善,确立了其作为不确定和动态环境中风险感知自主车辆任务的有前景解决方案。
cs.RO / 7 / 2602.09227
From Legible to Inscrutable Trajectories: (Il)legible Motion Planning Accounting for Multiple Observers
从可读到不可读的轨迹:考虑多个观察者的(不)可读运动规划
Abstract
In cooperative environments, such as in factories or assistive scenarios, it is important for a robot to communicate its intentions to observers, who could be either other humans or robots. A legible trajectory allows an observer to quickly and accurately predict an agent's intention. In adversarial environments, such as in military operations or games, it is important for a robot to not communicate its intentions to observers. An illegible trajectory leads an observer to incorrectly predict the agent's intention or delays when an observer is able to make a correct prediction about the agent's intention. However, in some environments there are multiple observers, each of whom may be able to see only part of the environment, and each of whom may have different motives. In this work, we introduce the Mixed-Motive Limited-Observability Legible Motion Planning (MMLO-LMP) problem, which requires a motion planner to generate a trajectory that is legible to observers with positive motives and illegible to observers with negative motives while also considering the visibility limitations of each observer. We highlight multiple strategies an agent can take while still achieving the problem objective. We also present DUBIOUS, a trajectory optimizer that solves MMLO-LMP. Our results show that DUBIOUS can generate trajectories that balance legibility with the motives and limited visibility regions of the observers. Future work includes many variations of MMLO-LMP, including moving observers and observer teaming.
Chinese Translation
在合作环境中,例如工厂或辅助场景,机器人向观察者(可能是其他人类或机器人)传达其意图是非常重要的。可读轨迹使观察者能够快速准确地预测代理的意图。在对抗性环境中,例如军事行动或游戏中,机器人则需要避免向观察者传达其意图。不可读轨迹使观察者错误地预测代理的意图,或延迟观察者正确预测代理意图的时间。然而,在某些环境中存在多个观察者,每个观察者可能只能看到环境的一部分,并且每个观察者可能有不同的动机。在本研究中,我们引入了混合动机有限可观察性可读运动规划(Mixed-Motive Limited-Observability Legible Motion Planning,MMLO-LMP)问题,该问题要求运动规划器生成一个对积极动机的观察者可读而对消极动机的观察者不可读的轨迹,同时考虑每个观察者的可见性限制。我们强调了代理在实现问题目标时可以采取的多种策略。我们还提出了DUBIOUS,一个解决MMLO-LMP的轨迹优化器。我们的结果表明,DUBIOUS能够生成在可读性与观察者的动机和有限可见区域之间取得平衡的轨迹。未来的工作包括MMLO-LMP的多种变体,包括移动观察者和观察者团队合作。
cs.RO / 8 / 2602.09255
STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory
STaR:可扩展的任务条件检索用于长时间跨度的多模态机器人记忆
Abstract
Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.
Chinese Translation
移动机器人通常在多样化的开放动态场景中长时间部署,包括仓库和制造设施等室内环境,以及农业和道路作业等室外环境。一个核心挑战是构建一个可扩展的长时间跨度记忆,以支持代理工作流,在不同粒度下进行规划、检索和推理,产生精确、可操作的导航答案。我们提出了STaR,一个代理推理框架,它(i)构建了一个与任务无关的多模态长期记忆,能够对未见查询进行泛化,同时保留细粒度的环境语义(对象属性、空间关系和动态事件),以及(ii)基于信息瓶颈原理引入了一种可扩展的任务条件检索算法,从长期记忆中提取出一组紧凑、非冗余、信息丰富的候选记忆,以便进行上下文推理。我们在NaVQA(混合室内/室外校园场景)和WH-VQA(一个定制的仓库基准,包含许多视觉上相似的对象,使用Isaac Sim构建)上评估了STaR,强调上下文推理。在这两个数据集上,STaR始终优于强基线,取得了更高的成功率和显著更低的空间误差。我们进一步在真实的Husky轮式机器人上部署STaR,在室内和室外环境中展示了强大的长时间跨度推理、可扩展性和实际效用。
cs.RO / 9 / 2602.09259
Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation
基于数据的多任务仿真学习型外科注视感知模型设计
Abstract
In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.
Chinese Translation
在机器人辅助微创手术(RMIS)中,减少的触觉反馈和深度线索增加了对专家视觉感知的依赖,这促使了基于注视引导的训练和学习型外科感知模型的发展。然而,收集操作专家的注视数据成本高昂,并且尚不清楚注视监督的来源(包括专家水平:中级与初学者,以及感知方式:主动执行与被动观看)如何影响注意力模型的学习。我们引入了一种配对的主动-被动多任务外科注视数据集,该数据集在达芬奇SimNow模拟器上收集,涵盖四个训练项目。在任务执行过程中,使用带眼动追踪的虚拟现实(VR)头显记录主动注视,并将相应的视频作为刺激材料,用于收集观察者的被动注视,从而实现受控的同视频比较。我们量化了注视组织中的技能和方式依赖性差异,并通过注视密度重叠分析和单帧显著性建模评估被动注视在操作监督中的可替代性。在不同设置中,MSI-Net产生了稳定且可解释的预测,而SalGAN则不稳定,且通常与人类注视的对齐效果较差。基于被动注视训练的模型恢复了相当一部分中级主动注意,但存在可预测的退化,并且主动与被动目标之间的迁移是不对称的。值得注意的是,初学者的被动标签在高质量示范上与中级被动目标相近,损失有限,这为外科指导和感知建模中的可扩展众包注视监督提供了一条实际路径。
cs.RO / 10 / 2602.09287
Disambiguating Anthropomorphism and Anthropomimesis in Human-Robot Interaction
在人机交互中消除人性化与拟人化的歧义
Abstract
In this preliminary work, we offer an initial disambiguation of the theoretical concepts anthropomorphism and anthropomimesis in Human-Robot Interaction (HRI) and social robotics. We define anthropomorphism as users perceiving human-like qualities in robots, and anthropomimesis as robot developers designing human-like features into robots. This contribution aims to provide a clarification and exploration of these concepts for future HRI scholarship, particularly regarding the party responsible for human-like qualities - robot perceiver for anthropomorphism, and robot designer for anthropomimesis. We provide this contribution so that researchers can build on these disambiguated theoretical concepts for future robot design and evaluation.
Chinese Translation
在这项初步研究中,我们对人机交互(HRI)和社会机器人领域中的理论概念——人性化(anthropomorphism)和拟人化(anthropomimesis)进行了初步的消歧义。我们将人性化定义为用户在机器人中感知到的人类特征,而将拟人化定义为机器人开发者将人类特征设计融入机器人中。本研究旨在为未来的人机交互学术研究提供对这些概念的澄清和探索,特别是关于负责赋予机器人类人特征的主体——人性化中的机器人感知者和拟人化中的机器人设计者。我们提供这一贡献,以便研究人员能够在未来的机器人设计和评估中基于这些消歧义的理论概念进行进一步的研究。
cs.RO / 11 / 2602.09367
CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments
CAPER:用于机器人科学实验的约束与程序推理
Abstract
Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.
Chinese Translation
在科学实验室中,机器人辅助需要程序上正确的长时间操作、在有限监督下的可靠执行,以及在低演示环境中的鲁棒性。这些条件对端到端的视觉-语言-动作(VLA)模型提出了巨大挑战,因为这些模型在协议敏感的实验中,假设的可恢复错误和数据驱动的策略学习往往会失效。我们提出了CAPER,一个用于机器人科学实验的约束与程序推理框架,它明确限制了学习和推理在规划与控制流程中的发生位置。CAPER并不是加强端到端策略,而是实施了一种责任分离的结构:任务级推理在明确的约束下生成程序上有效的动作序列,中级多模态基础实现子任务,而不将空间决策委托给大型语言模型,低级控制通过最小演示的强化学习适应物理不确定性。通过可解释的中间表示编码程序承诺,CAPER防止了实验逻辑在执行时的违规,从而提高了可控性、鲁棒性和数据效率。在科学工作流基准和公共长时间操作数据集上的实验表明,在低数据和长时间设置中,成功率和程序正确性均有一致的改善。
cs.RO / 12 / 2602.09368
Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes
通过平滑误差可达管道实现的认证梯度基础接触丰富操控
Abstract
Gradient-based methods can efficiently optimize controllers using physical priors and differentiable simulators, but contact-rich manipulation remains challenging due to discontinuous or vanishing gradients from hybrid contact dynamics. Smoothing the dynamics yields continuous gradients, but the resulting model mismatch can cause controller failures when executed on real systems. We address this trade-off by planning with smoothed dynamics while explicitly quantifying and compensating for the induced errors, providing formal guarantees of constraint satisfaction and goal reachability on the true hybrid dynamics. Our method smooths both contact dynamics and geometry via a novel differentiable simulator based on convex optimization, which enables us to characterize the discrepancy from the true dynamics as a set-valued deviation. This deviation constrains the optimization of time-varying affine feedback policies through analytical bounds on the system's reachable set, enabling robust constraint satisfaction guarantees for the true closed-loop hybrid dynamics, while relying solely on informative gradients from the smoothed dynamics. We evaluate our method on several contact-rich tasks, including planar pushing, object rotation, and in-hand dexterous manipulation, achieving guaranteed constraint satisfaction with lower safety violation and goal error than baselines. By bridging differentiable physics with set-valued robust control, our method is the first certifiable gradient-based policy synthesis method for contact-rich manipulation.
Chinese Translation
基于梯度的方法可以利用物理先验和可微分模拟器高效优化控制器,但由于混合接触动态导致的梯度不连续或消失,使得接触丰富的操控仍然具有挑战性。平滑动态可以产生连续的梯度,但由此产生的模型不匹配可能在真实系统中导致控制器失败。我们通过在规划中使用平滑动态,同时明确量化和补偿引入的误差,来解决这一权衡,提供对真实混合动态的约束满足和目标可达性的正式保证。我们的方法通过基于凸优化的新型可微分模拟器平滑接触动态和几何形状,使我们能够将真实动态的差异表征为一个集合值偏差。该偏差通过对系统可达集的分析界限约束时间变化的仿射反馈策略的优化,从而为真实闭环混合动态提供稳健的约束满足保证,同时仅依赖于来自平滑动态的信息梯度。我们在多个接触丰富的任务上评估了我们的方法,包括平面推动、物体旋转和手中灵巧操控,取得了比基线更低的安全违规和目标误差的保证约束满足。通过将可微分物理与集合值鲁棒控制相结合,我们的方法是第一个可认证的基于梯度的接触丰富操控策略合成方法。
cs.RO / 13 / 2602.09370
Phase-Aware Policy Learning for Skateboard Riding of Quadruped Robots via Feature-wise Linear Modulation
基于相位感知的四足机器人滑板骑行策略学习:特征线性调制方法
Abstract
Skateboards offer a compact and efficient means of transportation as a type of personal mobility device. However, controlling them with legged robots poses several challenges for policy learning due to perception-driven interactions and multi-modal control objectives across distinct skateboarding phases. To address these challenges, we introduce Phase-Aware Policy Learning (PAPL), a reinforcement-learning framework tailored for skateboarding with quadruped robots. PAPL leverages the cyclic nature of skateboarding by integrating phase-conditioned Feature-wise Linear Modulation layers into actor and critic networks, enabling a unified policy that captures phase-dependent behaviors while sharing robot-specific knowledge across phases. Our evaluations in simulation validate command-tracking accuracy and conduct ablation studies quantifying each component's contribution. We also compare locomotion efficiency against leg and wheel-leg baselines and show real-world transferability.
Chinese Translation
滑板作为一种个人移动设备,提供了一种紧凑而高效的交通方式。然而,使用四足机器人控制滑板在策略学习中面临着多个挑战,这些挑战源于感知驱动的交互和不同滑板骑行阶段的多模态控制目标。为了解决这些挑战,我们提出了相位感知策略学习(Phase-Aware Policy Learning, PAPL),这是一个专为四足机器人滑板骑行设计的强化学习框架。PAPL利用滑板骑行的周期性特征,通过将相位条件的特征线性调制层集成到演员和评论者网络中,形成一个统一的策略,能够捕捉相位依赖的行为,同时在不同阶段之间共享特定于机器人的知识。我们的仿真评估验证了指令跟踪的准确性,并进行了消融研究以量化各个组件的贡献。我们还将运动效率与腿部和轮腿基线进行了比较,并展示了其在现实世界中的可转移性。
cs.RO / 14 / 2602.09430
Sci-VLA: Agentic VLA Inference Plugin for Long-Horizon Tasks in Scientific Experiments
Sci-VLA:用于科学实验中长时间任务的自主VLA推理插件
Abstract
Robotic laboratories play a critical role in autonomous scientific discovery by enabling scalable, continuous experimental execution. Recent vision-language-action (VLA) models offer a promising foundation for robotic laboratories. However, scientific experiments typically involve long-horizon tasks composed of multiple atomic tasks, posing a fundamental challenge to existing VLA models. While VLA models fine-tuned for scientific tasks can reliably execute atomic experimental actions seen during training, they often fail to perform composite tasks formed by reordering and composing these known atomic actions. This limitation arises from a distributional mismatch between training-time atomic tasks and inference-time composite tasks, which prevents VLA models from executing necessary transitional operations between atomic tasks. To address this challenge, we propose an Agentic VLA Inference Plugin for Long-Horizon Tasks in Scientific Experiments. It introduces an LLM-based agentic inference mechanism that intervenes when executing sequential manipulation tasks. By performing explicit transition inference and generating transitional robotic action code, the proposed plugin guides VLA models through missing transitional steps, enabling reliable execution of composite scientific workflows without any additional training. This inference-only intervention makes our method computationally efficient, data-efficient, and well-suited for open-ended and long-horizon robotic laboratory tasks. We build 3D assets of scientific instruments and common scientific operating scenes within an existing simulation environment. In these scenes, we have verified that our method increases the average success rate per atomic task by 42\% during inference. Furthermore, we show that our method can be easily transferred from the simulation to real scientific laboratories.
Chinese Translation
机器人实验室在自主科学发现中发挥着关键作用,能够实现可扩展的、持续的实验执行。最近的视觉-语言-动作(VLA)模型为机器人实验室提供了一个有前景的基础。然而,科学实验通常涉及由多个原子任务组成的长时间任务,这对现有的VLA模型提出了根本性的挑战。虽然针对科学任务微调的VLA模型能够可靠地执行训练期间见到的原子实验动作,但它们往往无法执行由重新排序和组合这些已知原子动作形成的复合任务。这一局限性源于训练时原子任务与推理时复合任务之间的分布不匹配,阻碍了VLA模型在原子任务之间执行必要的过渡操作。为了解决这一挑战,我们提出了一种用于科学实验中长时间任务的自主VLA推理插件。该插件引入了一种基于大型语言模型(LLM)的自主推理机制,在执行顺序操作任务时进行干预。通过执行明确的过渡推理并生成过渡机器人动作代码,所提出的插件引导VLA模型完成缺失的过渡步骤,从而在无需额外训练的情况下,实现复合科学工作流的可靠执行。这种仅推理的干预使我们的方法在计算上高效、数据上高效,并且非常适合开放式和长时间的机器人实验室任务。我们在现有的仿真环境中构建了科学仪器和常见科学操作场景的3D资产。在这些场景中,我们验证了我们的方法在推理过程中将每个原子任务的平均成功率提高了42%。此外,我们还展示了我们的方法可以轻松地从仿真转移到真实的科学实验室。
cs.RO / 15 / 2602.09472
LLM-Grounded Dynamic Task Planning with Hierarchical Temporal Logic for Human-Aware Multi-Robot Collaboration
基于大语言模型的动态任务规划与层次时间逻辑在人性化多机器人协作中的应用
Abstract
While Large Language Models (LLM) enable non-experts to specify open-world multi-robot tasks, the generated plans often lack kinematic feasibility and are not efficient, especially in long-horizon scenarios. Formal methods like Linear Temporal Logic (LTL) offer correctness and optimal guarantees, but are typically confined to static, offline settings and struggle with computational scalability. To bridge this gap, we propose a neuro-symbolic framework that grounds LLM reasoning into hierarchical LTL specifications and solves the corresponding Simultaneous Task Allocation and Planning (STAP) problem. Unlike static approaches, our system resolves stochastic environmental changes, such as moving users or updated instructions via a receding horizon planning (RHP) loop with real-time perception, which dynamically refines plans through a hierarchical state space. Extensive real-world experiments demonstrate that our approach significantly outperforms baseline methods in success rate and interaction fluency while minimizing planning latency.
Chinese Translation
虽然大语言模型(LLM)使非专家能够指定开放世界的多机器人任务,但生成的计划往往缺乏运动学可行性,并且在长时间范围内效率不高。形式化方法如线性时间逻辑(LTL)提供了正确性和最优性保证,但通常局限于静态的离线设置,并且在计算可扩展性方面存在困难。为了解决这一问题,我们提出了一种神经符号框架,将LLM推理与层次LTL规范相结合,并解决相应的同时任务分配与规划(STAP)问题。与静态方法不同,我们的系统通过实时感知的递归规划(RHP)循环解决环境中的随机变化,例如移动的用户或更新的指令,从而通过层次状态空间动态地优化计划。大量的现实世界实验表明,我们的方法在成功率和交互流畅性方面显著优于基线方法,同时最小化了规划延迟。
cs.RO / 16 / 2602.09563
Optimal Control of Microswimmers for Trajectory Tracking Using Bayesian Optimization
基于贝叶斯优化的微游动体轨迹跟踪的最优控制
Abstract
Trajectory tracking for microswimmers remains a key challenge in microrobotics, where low-Reynolds-number dynamics make control design particularly complex. In this work, we formulate the trajectory tracking problem as an optimal control problem and solve it using a combination of B-spline parametrization with Bayesian optimization, allowing the treatment of high computational costs without requiring complex gradient computations. Applied to a flagellated magnetic swimmer, the proposed method reproduces a variety of target trajectories, including biologically inspired paths observed in experimental studies. We further evaluate the approach on a three-sphere swimmer model, demonstrating that it can adapt to and partially compensate for wall-induced hydrodynamic effects. The proposed optimization strategy can be applied consistently across models of different fidelity, from low-dimensional ODE-based models to high-fidelity PDE-based simulations, showing its robustness and generality. These results highlight the potential of Bayesian optimization as a versatile tool for optimal control strategies in microscale locomotion under complex fluid-structure interactions.
Chinese Translation
微游动体的轨迹跟踪仍然是微型机器人学中的一个关键挑战,低雷诺数动力学使得控制设计特别复杂。在本研究中,我们将轨迹跟踪问题表述为一个最优控制问题,并通过将B样条参数化与贝叶斯优化相结合的方法进行求解,从而在不需要复杂梯度计算的情况下处理高计算成本。应用于一种鞭毛状磁性游动体,所提出的方法能够再现多种目标轨迹,包括在实验研究中观察到的生物启发路径。我们进一步在一个三球游动体模型上评估该方法,证明其能够适应并部分补偿由壁面引起的流体动力学效应。所提出的优化策略可以在不同精度的模型中一致应用,从低维ODE模型到高精度PDE仿真,显示出其稳健性和普适性。这些结果突显了贝叶斯优化作为微尺度运动中复杂流体-结构相互作用下最优控制策略的多功能工具的潜力。
cs.RO / 17 / 2602.09580
Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
通过动作分块评估者和归一化流实现样本高效的真实世界灵巧策略微调
Abstract
Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SOFT-FLOW on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation where standard methods struggle.
Chinese Translation
由于有限的真实世界交互预算和高度多模态的动作分布,真实世界的灵巧操作策略微调仍然面临挑战。尽管基于扩散的策略具有表达能力,但在微调过程中不允许保守的基于似然的更新,因为动作概率是不可处理的。相比之下,传统的高斯策略在多模态下会崩溃,特别是在动作以分块方式执行时,标准的逐步评估者无法与分块执行对齐,导致信用分配不佳。我们提出了SOFT-FLOW,这是一个样本高效的离线策略微调框架,结合了归一化流(Normalizing Flow)来应对这些挑战。归一化流策略为多模态动作块提供准确的似然值,允许通过似然正则化进行保守、稳定的策略更新,从而提高样本效率。动作分块评估者评估整个动作序列,使价值估计与策略的时间结构对齐,并改善长时间范围的信用分配。据我们所知,这是首次在真实机器人硬件上展示基于似然的多模态生成策略与分块级价值学习相结合的成果。我们在两个具有挑战性的真实世界灵巧操作任务上评估了SOFT-FLOW:用从箱子中取出的剪刀切割胶带,以及用掌心向下的抓握进行手中立方体的旋转——这两项任务都需要在长时间范围内进行精确、灵巧的控制。在这些任务中,SOFT-FLOW实现了稳定的、样本高效的适应,而标准方法则面临困难。
cs.RO / 18 / 2602.09583
Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation
面向可变形物体操作的偏好对齐视觉运动扩散策略
Abstract
Humans naturally develop preferences for how manipulation tasks should be performed, which are often subtle, personal, and difficult to articulate. Although it is important for robots to account for these preferences to increase personalization and user satisfaction, they remain largely underexplored in robotic manipulation, particularly in the context of deformable objects like garments and fabrics. In this work, we study how to adapt pretrained visuomotor diffusion policies to reflect preferred behaviors using limited demonstrations. We introduce RKO, a novel preference-alignment method that combines the benefits of two recent frameworks: RPO and KTO. We evaluate RKO against common preference learning frameworks, including these two, as well as a baseline vanilla diffusion policy, on real-world cloth-folding tasks spanning multiple garments and preference settings. We show that preference-aligned policies (particularly RKO) achieve superior performance and sample efficiency compared to standard diffusion policy fine-tuning. These results highlight the importance and feasibility of structured preference learning for scaling personalized robot behavior in complex deformable object manipulation tasks.
Chinese Translation
人类自然地形成对操作任务执行方式的偏好,这些偏好往往微妙、个性化且难以表达。尽管考虑这些偏好对于提高机器人个性化和用户满意度至关重要,但在机器人操作中,尤其是在处理可变形物体(如衣物和织物)的背景下,这一领域仍然未得到充分探索。在本研究中,我们探讨如何利用有限的示范将预训练的视觉运动扩散策略调整为反映偏好的行为。我们提出了RKO,一种新颖的偏好对齐方法,结合了两个最新框架的优势:RPO和KTO。我们在涉及多种衣物和偏好设置的真实世界折叠布料任务中,将RKO与常见的偏好学习框架(包括这两个框架)以及基线的普通扩散策略进行了评估。结果表明,偏好对齐策略(尤其是RKO)在性能和样本效率上优于标准的扩散策略微调。这些结果突显了结构化偏好学习在复杂可变形物体操作任务中扩展个性化机器人行为的重要性和可行性。
cs.RO / 19 / 2602.09617
AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception
AnyTouch 2:用于动态触觉感知的通用光学触觉表征学习
Abstract
Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
Chinese Translation
现实世界中丰富的接触操作要求机器人能够感知时间上的触觉反馈,捕捉细微的表面变形,并推理物体属性和力的动态。尽管光学触觉传感器独特地能够提供如此丰富的信息,但现有的触觉数据集和模型仍然有限。这些资源主要集中于物体级属性(例如,材料),而在物理交互过程中大多忽视了细粒度的触觉时间动态。我们认为,推进动态触觉感知需要一个系统的动态感知能力层次结构,以指导数据收集和模型设计。为了解决缺乏丰富动态信息的触觉数据的问题,我们提出了 ToucHD,这是一个大规模的层次触觉数据集,涵盖了触觉原子动作、现实世界操作和触摸-力配对数据。除了规模之外,ToucHD 建立了一个全面的触觉动态数据生态系统,从数据的角度明确支持层次感知能力。在此基础上,我们提出了 AnyTouch 2,这是一个通用的触觉表征学习框架,适用于多种光学触觉传感器,将物体级理解与细粒度的、力感知的动态感知统一起来。该框架捕捉跨帧的像素级和动作特定的变形,同时明确建模物理力的动态,从而从模型的角度学习多层次的动态感知能力。我们在涵盖静态物体属性和动态物理属性的基准测试上评估了我们的模型,以及跨越多个动态感知能力层次的现实世界操作任务——从基本的物体级理解到力感知的灵巧操作。实验结果表明,在传感器和任务之间表现出一致且强劲的性能。
cs.RO / 20 / 2602.09628
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
TeleGate:通过运动先验的门控专家选择实现全身类人机器人远程操作
Abstract
Real-time whole-body teleoperation is a critical method for humanoid robots to perform complex tasks in unstructured environments. However, developing a unified controller that robustly supports diverse human motions remains a significant challenge. Existing methods typically distill multiple expert policies into a single general policy, which often inevitably leads to performance degradation, particularly on highly dynamic motions. This paper presents TeleGate, a unified whole-body teleoperation framework for humanoid robots that achieves high-precision tracking across various motions while avoiding the performance loss inherent in knowledge distillation. Our key idea is to preserve the full capability of domain-specific expert policies by training a lightweight gating network, which dynamically activates experts in real-time based on proprioceptive states and reference trajectories. Furthermore, to compensate for the absence of future reference trajectories in real-time teleoperation, we introduce a VAE-based motion prior module that extracts implicit future motion intent from historical observations, enabling anticipatory control for motions requiring prediction such as jumping and standing up. We conducted empirical evaluations in simulation and also deployed our technique on the Unitree G1 humanoid robot. Using only 2.5 hours of motion capture data for training, our TeleGate achieves high-precision real-time teleoperation across diverse dynamic motions (e.g., running, fall recovery, and jumping), significantly outperforming the baseline methods in both tracking accuracy and success rate.
Chinese Translation
实时全身远程操作是类人机器人在非结构化环境中执行复杂任务的关键方法。然而,开发一个能够稳健支持多样化人类动作的统一控制器仍然是一个重大挑战。现有方法通常将多个专家策略提炼为单一的通用策略,这往往不可避免地导致性能下降,特别是在高度动态的动作上。本文提出了TeleGate,一个统一的类人机器人全身远程操作框架,能够在各种动作中实现高精度跟踪,同时避免知识提炼中固有的性能损失。我们的关键思想是通过训练一个轻量级的门控网络来保留领域特定专家策略的全部能力,该网络根据本体状态和参考轨迹实时动态激活专家。此外,为了弥补实时远程操作中缺乏未来参考轨迹的问题,我们引入了基于变分自编码器(VAE)的运动先验模块,从历史观察中提取隐含的未来运动意图,使得对需要预测的动作(如跳跃和站立)进行预期控制成为可能。我们在仿真中进行了实证评估,并将我们的技术部署在Unitree G1类人机器人上。仅使用2.5小时的运动捕捉数据进行训练,我们的TeleGate在多样化动态动作(例如跑步、跌倒恢复和跳跃)中实现了高精度实时远程操作,在跟踪精度和成功率方面显著优于基线方法。
cs.RO / 21 / 2602.09657
AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
AutoFly:用于无人机自主导航的视觉-语言-行动模型
Abstract
Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.
Chinese Translation
视觉-语言导航(VLN)要求智能体通过解释语言指令和视觉观察来导航环境,是具身人工智能的基础任务。目前针对无人机(UAV)的VLN研究依赖于详细的、预先指定的指令来引导无人机沿着预定路线行驶。然而,现实世界中的户外探索通常发生在未知环境中,无法提供详细的导航指令。相反,只能提供粗略的位置信息或方向指导,这要求无人机通过持续规划和避障自主导航。为了解决这一问题,我们提出了AutoFly,一种用于无人机自主导航的端到端视觉-语言-行动(VLA)模型。AutoFly结合了伪深度编码器,从RGB输入中提取深度感知特征,以增强空间推理,并采用渐进式两阶段训练策略,有效地将视觉、深度和语言表示与行动策略对齐。此外,现有的VLN数据集在现实世界自主导航中存在基本局限性,主要源于其对明确指令遵循的过度依赖,而忽视了自主决策和现实世界数据的不足。为了解决这些问题,我们构建了一个新颖的自主导航数据集,将范式从指令遵循转变为自主行为建模,具体通过:(1)强调持续避障、自主规划和识别工作流的轨迹收集;(2)全面整合现实世界数据。实验结果表明,AutoFly的成功率比最先进的VLA基线高出3.9%,并且在模拟和真实环境中表现一致。
cs.RO / 22 / 2602.09661
RANT: Ant-Inspired Multi-Robot Rainforest Exploration Using Particle Filter Localisation and Virtual Pheromone Coordination
RANT:基于蚂蚁启发的多机器人雨林探索框架,采用粒子滤波定位和虚拟信息素协调
Abstract
This paper presents RANT, an ant-inspired multi-robot exploration framework for noisy, uncertain environments. A team of differential-drive robots navigates a 10 x 10 m terrain, collects noisy probe measurements of a hidden richness field, and builds local probabilistic maps while the supervisor maintains a global evaluation. RANT combines particle-filter localisation, a behaviour-based controller with gradient-driven hotspot exploitation, and a lightweight no-revisit coordination mechanism based on virtual pheromone blocking. We experimentally analyse how team size, localisation fidelity, and coordination influence coverage, hotspot recall, and redundancy. Results show that particle filtering is essential for reliable hotspot engagement, coordination substantially reduces overlap, and increasing team size improves coverage but yields diminishing returns due to interference.
Chinese Translation
本文提出了RANT,一个基于蚂蚁启发的多机器人探索框架,旨在应对嘈杂和不确定的环境。一个差分驱动机器人团队在10 x 10米的地形中导航,收集隐藏丰富度场的嘈杂探测数据,并在监督者的全球评估下构建局部概率地图。RANT结合了粒子滤波定位、基于行为的控制器(采用梯度驱动的热点开发)以及基于虚拟信息素阻塞的轻量级无重访协调机制。我们通过实验分析团队规模、定位精度和协调如何影响覆盖率、热点回忆和冗余。结果表明,粒子滤波对于可靠的热点参与至关重要,协调显著减少了重叠,而增加团队规模虽然提高了覆盖率,但由于干扰导致收益递减。
cs.RO / 23 / 2602.09714
Fast Motion Planning for Non-Holonomic Mobile Robots via a Rectangular Corridor Representation of Structured Environments
基于矩形走廊表示的结构化环境中非完整移动机器人快速运动规划
Abstract
We present a complete framework for fast motion planning of non-holonomic autonomous mobile robots in highly complex but structured environments. Conventional grid-based planners struggle with scalability, while many kinematically-feasible planners impose a significant computational burden due to their search space complexity. To overcome these limitations, our approach introduces a deterministic free-space decomposition that creates a compact graph of overlapping rectangular corridors. This method enables a significant reduction in the search space, without sacrificing path resolution. The framework then performs online motion planning by finding a sequence of rectangles and generating a near-time-optimal, kinematically-feasible trajectory using an analytical planner. The result is a highly efficient solution for large-scale navigation. We validate our framework through extensive simulations and on a physical robot. The implementation is publicly available as open-source software.
Chinese Translation
我们提出了一个完整的框架,用于在高度复杂但结构化的环境中快速规划非完整自主移动机器人的运动。传统的基于网格的规划器在可扩展性方面存在困难,而许多运动学可行的规划器由于其搜索空间的复杂性,带来了显著的计算负担。为了克服这些限制,我们的方法引入了一种确定性的自由空间分解,创建了一个重叠矩形走廊的紧凑图。这种方法在不牺牲路径分辨率的情况下显著减少了搜索空间。该框架随后通过找到一系列矩形并使用分析规划器生成近时间最优的运动学可行轨迹来执行在线运动规划。最终结果是一个高效的大规模导航解决方案。我们通过广泛的仿真和在物理机器人上的验证来验证我们的框架。该实现作为开源软件公开可用。
cs.RO / 24 / 2602.09722
Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
重新思考视觉-语言-动作模型的扩展:对齐、混合与正则化
Abstract
While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets three dimensions of VLA scaling. (1) Physical alignment: we show that a unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer. (2) Embodiment mixture: we find that naively pooling heterogeneous robot datasets often induces negative transfer rather than gains, underscoring the fragility of indiscriminate data scaling. (3) Training regularization: we observe that intuitive strategies, such as sensory dropout and multi-stage fine-tuning, do not consistently improve performance at scale. Together, this study challenge some common assumptions about embodied scaling and provide practical guidance for training large-scale VLA policies from diverse robotic data. Project website: https://research.beingbeyond.com/rethink_vla
Chinese Translation
尽管视觉-语言-动作(VLA)模型在通用机器人控制中展现出强大的潜力,但尚不清楚标准的“扩展数据”方案是否适用于机器人领域,尤其是在训练数据在不同体现、传感器和动作空间中本质上是异质的情况下。我们进行了一项系统的、受控的VLA扩展研究,重新审视了在多样化机器人上进行预训练的核心训练选择。使用一个代表性的VLA框架,该框架结合了视觉-语言主干和流匹配,我们在匹配条件下消融了关键设计决策,并在广泛的仿真和真实机器人实验中进行了评估。为了提高现实世界结果的可靠性,我们引入了一种分组盲集成协议,该协议使操作员对模型身份保持盲目,并将策略执行与结果判断分离,从而减少实验者偏差。我们的分析针对VLA扩展的三个维度:(1) 物理对齐:我们表明,统一的末端执行器(EEF)相对动作表示对于稳健的跨体现转移至关重要。(2) 体现混合:我们发现,简单地汇总异质机器人数据集往往会导致负迁移而非收益,强调了无差别数据扩展的脆弱性。(3) 训练正则化:我们观察到,直观的策略,如感官丢失和多阶段微调,并不总能在扩展时提高性能。总的来说,这项研究挑战了一些关于体现扩展的常见假设,并为从多样化机器人数据中训练大规模VLA策略提供了实际指导。项目网站:https://research.beingbeyond.com/rethink_vla
cs.RO / 25 / 2602.09765
NavDreamer: Video Models as Zero-Shot 3D Navigators
NavDreamer:视频模型作为零-shot 3D 导航器
Abstract
Previous Vision-Language-Action models face critical limitations in navigation: scarce, diverse data from labor-intensive collection and static representations that fail to capture temporal dynamics and physical laws. We propose NavDreamer, a video-based framework for 3D navigation that leverages generative video models as a universal interface between language instructions and navigation trajectories. Our main hypothesis is that video's ability to encode spatiotemporal information and physical dynamics, combined with internet-scale availability, enables strong zero-shot generalization in navigation. To mitigate the stochasticity of generative predictions, we introduce a sampling-based optimization method that utilizes a VLM for trajectory scoring and selection. An inverse dynamics model is employed to decode executable waypoints from generated video plans for navigation. To systematically evaluate this paradigm in several video model backbones, we introduce a comprehensive benchmark covering object navigation, precise navigation, spatial grounding, language control, and scene reasoning. Extensive experiments demonstrate robust generalization across novel objects and unseen environments, with ablation studies revealing that navigation's high-level decision-making nature makes it particularly suited for video-based planning.
Chinese Translation
以往的视觉-语言-动作模型在导航方面面临着关键限制:数据稀缺、多样性不足,且数据来源于劳动密集型的收集过程,静态表示无法捕捉时间动态和物理规律。我们提出了NavDreamer,一个基于视频的3D导航框架,利用生成视频模型作为语言指令与导航轨迹之间的通用接口。我们的主要假设是,视频能够编码时空信息和物理动态,并结合互联网规模的可用性,使得导航中的零-shot 泛化能力得以增强。为了减轻生成预测的随机性,我们引入了一种基于采样的优化方法,该方法利用视觉语言模型(VLM)进行轨迹评分和选择。我们采用反向动态模型从生成的视频计划中解码可执行的航点以进行导航。为了系统性地评估这一范式在多种视频模型骨干网络中的表现,我们引入了一个全面的基准,涵盖对象导航、精确导航、空间定位、语言控制和场景推理。大量实验表明,在新颖对象和未见环境中具有强大的泛化能力,消融研究显示导航的高层决策特性使其特别适合于基于视频的规划。
cs.RO / 26 / 2602.09767
Diverse Skill Discovery for Quadruped Robots via Unsupervised Learning
通过无监督学习实现四足机器人多样化技能发现
Abstract
Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.
Chinese Translation
强化学习需要专家精心设计奖励机制以引导目标行为,而模仿学习则依赖于昂贵的特定任务数据。相比之下,无监督技能发现有潜力通过学习多样化的有用技能来减轻这些负担,这些技能由内在动机驱动。然而,现有方法存在两个主要局限性:它们通常依赖单一策略来掌握多样化的行为库,而未能建模这些行为之间的共享结构或差异,导致学习效率低下;此外,它们容易受到奖励操控的影响,即奖励信号迅速增加并收敛,而所学技能的实际多样性不足。在本研究中,我们提出了一种正交专家混合(Orthogonal Mixture-of-Experts, OMoE)架构,防止多样化行为坍缩为重叠的表示,从而使单一策略能够掌握广泛的运动技能。此外,我们设计了一个多鉴别器框架,其中不同的鉴别器在不同的观察空间中操作,有效缓解了奖励操控问题。我们在12自由度的Unitree A1四足机器人上评估了我们的方法,展示了一组多样化的运动技能。我们的实验表明,所提出的框架提高了训练效率,并与基线相比实现了18.3%的状态空间覆盖扩展。
cs.RO / 27 / 2602.09772
Design and Evaluation of an Assisted Programming Interface for Behavior Trees in Robotics
机器人行为树辅助编程接口的设计与评估
Abstract
The possibility to create reactive robot programs faster without the need for extensively trained programmers is becoming increasingly important. So far, it has not been explored how various techniques for creating Behavior Tree (BT) program representations could be combined with complete graphical user interfaces (GUIs) to allow a human user to validate and edit trees suggested by automated methods. In this paper, we introduce BEhavior TRee GUI (BETR-GUI) for creating BTs with the help of an AI assistant that combines methods using large language models, planning, genetic programming, and Bayesian optimization with a drag-and-drop editor. A user study with 60 participants shows that by combining different assistive methods, BETR-GUI enables users to perform better at solving the robot programming tasks. The results also show that humans using the full variant of BETR-GUI perform better than the AI assistant running on its own.
Chinese Translation
快速创建反应式机器人程序的可能性,而无需 extensively trained programmers,变得越来越重要。到目前为止,尚未探索如何将各种创建行为树(Behavior Tree, BT)程序表示的技术与完整的图形用户界面(Graphical User Interfaces, GUIs)相结合,以便让人类用户验证和编辑自动化方法建议的树。在本文中,我们介绍了行为树图形用户界面(BEhavior TRee GUI, BETR-GUI),该界面通过结合使用大型语言模型、规划、遗传编程和贝叶斯优化的方法,利用拖放编辑器帮助创建BT。针对60名参与者的用户研究表明,通过结合不同的辅助方法,BETR-GUI使用户在解决机器人编程任务时表现更佳。结果还显示,使用BETR-GUI完整版本的人类用户的表现优于单独运行的AI助手。
cs.RO / 28 / 2602.09849
BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
BagelVLA:通过交错的视觉-语言-动作生成增强长时间范围的操作
Abstract
Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. While recent Vision-Language-Action (VLA) models have leveraged pre-trained foundation models, they typically focus on either linguistic planning or visual forecasting in isolation. These methods rarely integrate both capabilities simultaneously to guide action generation, leading to suboptimal performance in complex, long-horizon manipulation tasks. To bridge this gap, we propose BagelVLA, a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework. Initialized from a pretrained unified understanding and generative model, BagelVLA is trained to interleave textual reasoning and visual prediction directly into the action execution loop. To efficiently couple these modalities, we introduce Residual Flow Guidance (RFG), which initializes from current observation and leverages single-step denoising to extract predictive visual features, guiding action generation with minimal latency. Extensive experiments demonstrate that BagelVLA outperforms existing baselines by a significant margin on multiple simulated and real-world benchmarks, particularly in tasks requiring multi-stage reasoning.
Chinese Translation
为具身智能体赋予推理任务、预见物理结果和生成精确动作的能力,对于通用操作至关重要。尽管最近的视觉-语言-动作(VLA)模型利用了预训练的基础模型,但它们通常单独关注语言规划或视觉预测。这些方法很少同时整合这两种能力来指导动作生成,导致在复杂的长时间范围操作任务中表现不佳。为了解决这一问题,我们提出了BagelVLA,一个统一的模型,将语言规划、视觉预测和动作生成整合在一个框架内。BagelVLA从预训练的统一理解和生成模型初始化,经过训练将文本推理和视觉预测直接交错到动作执行循环中。为了有效耦合这些模态,我们引入了残差流引导(Residual Flow Guidance, RFG),该方法从当前观察初始化,并利用单步去噪提取预测视觉特征,以最小延迟指导动作生成。大量实验表明,BagelVLA在多个模拟和真实世界基准测试中显著超越现有基线,特别是在需要多阶段推理的任务中表现突出。
cs.RO / 29 / 2602.09888
TriPilot-FF: Coordinated Whole-Body Teleoperation with Force Feedback
TriPilot-FF:具有力反馈的协调全身遥操作
Abstract
Mobile manipulators broaden the operational envelope for robot manipulation. However, the whole-body teleoperation of such robots remains a problem: operators must coordinate a wheeled base and two arms while reasoning about obstacles and contact. Existing interfaces are predominantly hand-centric (e.g., VR controllers and joysticks), leaving foot-operated channels underexplored for continuous base control. We present TriPilot-FF, an open-source whole-body teleoperation system for a custom bimanual mobile manipulator that introduces a foot-operated pedal with lidar-driven pedal haptics, coupled with upper-body bimanual leader-follower teleoperation. Using only a low-cost base-mounted lidar, TriPilot-FF renders a resistive pedal cue from proximity-to-obstacle signals in the commanded direction, shaping operator commands toward collision-averse behaviour without an explicit collision-avoidance controller. The system also supports arm-side force reflection for contact awareness and provides real-time force and visual guidance of bimanual manipulability to prompt mobile base repositioning, thereby improving reach. We demonstrate the capability of TriPilot-FF to effectively ``co-pilot'' the human operator over long time-horizons and tasks requiring precise mobile base movement and coordination. Finally, we incorporate teleoperation feedback signals into an Action Chunking with Transformers (ACT) policy and demonstrate improved performance when the additional information is available. We release the pedal device design, full software stack, and conduct extensive real-world evaluations on a bimanual wheeled platform. The project page of TriPilot-FF is http://bit.ly/46H3ZJT.
Chinese Translation
移动操纵器拓宽了机器人操作的工作范围。然而,这类机器人的全身遥操作仍然是一个问题:操作员必须协调一个轮式底盘和两个手臂,同时考虑障碍物和接触。现有的接口主要以手为中心(例如,虚拟现实控制器和操纵杆),而脚操作通道在连续底盘控制方面尚未得到充分探索。我们提出了TriPilot-FF,一个开源的全身遥操作系统,适用于定制的双手移动操纵器,该系统引入了一个脚操作踏板,结合激光雷达驱动的踏板触觉反馈,以及上半身双手领导-跟随遥操作。仅使用一个低成本的底盘安装激光雷达,TriPilot-FF根据指令方向的障碍物接近信号生成一个阻力踏板提示,从而引导操作员的指令朝向避免碰撞的行为,而无需显式的碰撞避免控制器。该系统还支持臂侧力反射以提高接触意识,并提供双手可操控性的实时力和视觉指导,以促使移动底盘重新定位,从而改善可达性。我们展示了TriPilot-FF在长时间和需要精确移动底盘运动与协调的任务中有效“共同驾驶”人类操作员的能力。最后,我们将遥操作反馈信号纳入了基于变换器的动作分块(Action Chunking with Transformers, ACT)策略,并在提供额外信息时展示了性能的提升。我们发布了踏板设备设计、完整软件堆栈,并在双手轮式平台上进行了广泛的现实世界评估。TriPilot-FF的项目页面为http://bit.ly/46H3ZJT。
cs.RO / 30 / 2602.09893
TaCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data
TaCo:异构触觉数据无损与有损编解码器的基准测试
Abstract
Tactile sensing is crucial for embodied intelligence, providing fine-grained perception and control in complex environments. However, efficient tactile data compression, which is essential for real-time robotic applications under strict bandwidth constraints, remains underexplored. The inherent heterogeneity and spatiotemporal complexity of tactile data further complicate this challenge. To bridge this gap, we introduce TaCo, the first comprehensive benchmark for Tactile data Codecs. TaCo evaluates 30 compression methods, including off-the-shelf compression algorithms and neural codecs, across five diverse datasets from various sensor types. We systematically assess both lossless and lossy compression schemes on four key tasks: lossless storage, human visualization, material and object classification, and dexterous robotic grasping. Notably, we pioneer the development of data-driven codecs explicitly trained on tactile data, TaCo-LL (lossless) and TaCo-L (lossy). Results have validated the superior performance of our TaCo-LL and TaCo-L. This benchmark provides a foundational framework for understanding the critical trade-offs between compression efficiency and task performance, paving the way for future advances in tactile perception.
Chinese Translation
触觉感知对具身智能至关重要,能够在复杂环境中提供细致的感知和控制。然而,高效的触觉数据压缩对于在严格带宽限制下的实时机器人应用至关重要,但这一领域仍然未得到充分探索。触觉数据固有的异质性和时空复杂性进一步加大了这一挑战。为了解决这一问题,我们推出了TaCo,这是第一个全面的触觉数据编解码器基准测试。TaCo评估了30种压缩方法,包括现成的压缩算法和神经编解码器,涵盖了来自不同传感器类型的五个多样化数据集。我们系统地评估了无损和有损压缩方案在四个关键任务上的表现:无损存储、人类可视化、材料和物体分类,以及灵巧的机器人抓取。值得注意的是,我们首次开发了专门针对触觉数据训练的数据驱动编解码器TaCo-LL(无损)和TaCo-L(有损)。结果验证了我们的TaCo-LL和TaCo-L的优越性能。该基准测试为理解压缩效率与任务性能之间的关键权衡提供了基础框架,为未来触觉感知的进展铺平了道路。
cs.RO / 31 / 2602.09940
Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
Instruct2Act:通过机器人动作网络将人类指令转化为动作序列和执行以实现机器人操作
Abstract
Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
Chinese Translation
由于计算和感知的限制,机器人在现实环境中往往难以遵循自由形式的人类指令。我们通过一种轻量级的完全本地处理管道来解决这一问题,该管道将自然语言命令转化为可靠的操作。我们的方法分为两个阶段:(i)指令到动作模块(Instruct2Act),这是一个紧凑的双向长短期记忆网络(BiLSTM),配备多头注意力自编码器,将指令解析为有序的原子动作序列(例如,伸手、抓取、移动、放置);(ii)机器人动作网络(RAN),它结合动态自适应轨迹径向网络(DATRN)和基于视觉的环境分析器(YOLOv8)为每个子动作生成精确的控制轨迹。整个系统在一个普通的设备上运行,无需云服务。在我们的定制专有数据集上,Instruct2Act 达到 91.5% 的子动作预测准确率,同时保持较小的占用空间。在四个任务(拾取-放置、拾取-倒入、擦拭和拾取-给出)上的真实机器人评估显示总体成功率为 90%;子动作推理在 < 3.8 秒内完成,端到端执行时间根据任务复杂性在 30-60 秒之间。这些结果表明,细粒度的指令到动作解析结合基于 DATRN 的轨迹生成和视觉引导的定位,为在资源受限的单摄像头环境中实现确定性实时操作提供了切实可行的路径。
cs.RO / 32 / 2602.09972
Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning
Hydra-Nav:通过自适应双过程推理进行目标导航
Abstract
While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.
Chinese Translation
尽管大型视觉-语言模型(VLMs)在目标物体导航方面展现出潜力,但当前的方法仍面临低成功率和对未见物体的低效定位的问题,这些失败主要归因于时空推理能力不足。同时,近期将推理引入基于VLM的代理的尝试提高了成功率,但也带来了显著的计算开销。为了解决现有方法的无效性和低效率,我们提出了Hydra-Nav,一种统一的VLM架构,它自适应地在用于分析探索历史和制定高层次计划的深思熟虑的慢系统与用于高效执行的反应式快系统之间切换。我们通过三个阶段的课程训练Hydra-Nav:(i)空间-动作对齐以加强轨迹规划,(ii)记忆-推理整合以增强长时间探索中的时空推理,以及(iii)迭代拒绝微调以在关键决策点实现选择性推理。大量实验表明,Hydra-Nav在HM3D、MP3D和OVON基准测试中达到了最先进的性能,分别比第二好的方法提高了11.1%、17.4%和21.2%。此外,我们引入了SOT(操作时间加权成功率),这是一个新的指标,用于衡量不同推理强度的VLMs的搜索效率。结果表明,自适应推理显著提高了搜索效率,相较于固定频率的基线方法。
cs.RO / 33 / 2602.09973
RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation
RoboInter:面向机器人操作的整体中间表示套件
Abstract
Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
Chinese Translation
大型视觉语言模型(VLMs)的进展激发了对机器人操作的视觉-语言-行动(VLA)系统的日益关注。然而,现有的操作数据集在整理上仍然成本高昂,且高度依赖具体的实现,覆盖面和多样性不足,从而阻碍了VLA模型的泛化。近期的方法试图通过计划-执行范式来缓解这些限制,其中首先生成高层次的计划(例如,子任务、轨迹),然后将其转化为低层次的动作,但这些方法严重依赖于额外的中间监督,而现有数据集中几乎缺乏这种监督。为了解决这一问题,我们引入了RoboInter操作套件,这是一个统一的资源,包括用于操作的中间表示的数据、基准和模型。它包含RoboInter-Tool,一个轻量级图形用户界面,支持对多样化表示的半自动标注,以及RoboInter-Data,一个大规模数据集,涵盖571个多样场景中的超过23万个情节,提供超过10类中间表示的密集逐帧标注,显著超越了以往工作的规模和标注质量。在此基础上,RoboInter-VQA引入了9个空间和20个时间的具身VQA类别,以系统性地基准测试和增强VLM的具身推理能力。同时,RoboInter-VLA提供了一个集成的计划-执行框架,支持模块化和端到端的VLA变体,通过中间监督将高层次规划与低层次执行连接起来。总体而言,RoboInter为通过细粒度和多样化的中间表示推动稳健和可泛化的机器人学习奠定了实用基础。
cs.RO / 34 / 2602.09991
Acoustic Drone Package Delivery Detection
声学无人机包裹投递检测
Abstract
In recent years, the illicit use of unmanned aerial vehicles (UAVs) for deliveries in restricted area such as prisons became a significant security challenge. While numerous studies have focused on UAV detection or localization, little attention has been given to delivery events identification. This study presents the first acoustic package delivery detection algorithm using a ground-based microphone array. The proposed method estimates both the drone's propeller speed and the delivery event using solely acoustic features. A deep neural network detects the presence of a drone and estimates the propeller's rotation speed or blade passing frequency (BPF) from a mel spectrogram. The algorithm analyzes the BPFs to identify probable delivery moments based on sudden changes before and after a specific time. Results demonstrate a mean absolute error of the blade passing frequency estimator of 16 Hz when the drone is less than 150 meters away from the microphone array. The drone presence detection estimator has a accuracy of 97%. The delivery detection algorithm correctly identifies 96% of events with a false positive rate of 8%. This study shows that deliveries can be identified using acoustic signals up to a range of 100 meters.
Chinese Translation
近年来,无人机(UAV)在监狱等限制区域进行非法投递的现象已成为一个重要的安全挑战。尽管许多研究集中于无人机的检测或定位,但对投递事件的识别关注较少。本研究提出了首个基于地面麦克风阵列的声学包裹投递检测算法。该方法仅利用声学特征估计无人机的螺旋桨转速和投递事件。深度神经网络通过梅尔谱图检测无人机的存在,并估计螺旋桨的旋转速度或叶片经过频率(BPF)。该算法分析BPF,以识别基于特定时间前后突然变化的可能投递时刻。结果表明,当无人机距离麦克风阵列小于150米时,叶片经过频率估计器的平均绝对误差为16 Hz。无人机存在检测估计器的准确率为97%。投递检测算法正确识别了96%的事件,假阳性率为8%。本研究表明,投递事件可以通过声学信号在100米的范围内被识别。
cs.RO / 35 / 2602.10007
A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging
拥堵匝道合流中安全高效的CAV换道的协作安全屏障
Abstract
Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi-agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state-of-the-art Multi-Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL-MASS, and evaluate it in a congested on-ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL-MASS effectively balances the trade-off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL-MASS is available with an open-source licence at https://github.com/hkbharath/MARL-MASS
Chinese Translation
在密集交通中换道是连接与自主车辆(CAVs)面临的一项重大挑战。现有的换道控制器主要要么确保安全,要么协同提高交通效率,但并未同时考虑这两种相互冲突的目标。为了解决这一问题,我们提出了多智能体安全屏障(Multi-Agent Safety Shield, MASS),该屏障利用控制障碍函数(Control Barrier Functions, CBFs)设计,以实现安全和协作的换道。MASS通过使用简单算法构建的图形交互拓扑捕捉CAV之间的多智能体交互,从而实现协作。此外,最先进的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)换道控制器通过集成MASS进行扩展,以确保安全,并定义定制的奖励函数以优先考虑效率提升。因此,我们提出了一种名为MARL-MASS的换道控制器,并在拥堵的匝道合流仿真中对其进行了评估。结果表明,MASS通过严格遵守安全约束,实现了具有安全保障的协作换道。此外,所提出的定制奖励函数提高了使用安全屏障训练的MARL策略的稳定性。总的来说,通过鼓励在遵守安全约束的同时探索协作换道策略,MARL-MASS有效地平衡了确保安全与提高拥堵交通效率之间的权衡。MARL-MASS的代码可在https://github.com/hkbharath/MARL-MASS以开源许可证获取。
cs.RO / 36 / 2602.10013
Learning Force-Regulated Manipulation with a Low-Cost Tactile-Force-Controlled Gripper
使用低成本触觉力控制夹具学习力调节操控
Abstract
Successfully manipulating many everyday objects, such as potato chips, requires precise force regulation. Failure to modulate force can lead to task failure or irreversible damage to the objects. Humans can precisely achieve this by adapting force from tactile feedback, even within a short period of physical contact. We aim to give robots this capability. However, commercial grippers exhibit high cost or high minimum force, making them unsuitable for studying force-controlled policy learning with everyday force-sensitive objects. We introduce TF-Gripper, a low-cost (~$150) force-controlled parallel-jaw gripper that integrates tactile sensing as feedback. It has an effective force range of 0.45-45N and is compatible with different robot arms. Additionally, we designed a teleoperation device paired with TF-Gripper to record human-applied grasping forces. While standard low-frequency policies can be trained on this data, they struggle with the reactive, contact-dependent nature of force regulation. To overcome this, we propose RETAF (REactive Tactile Adaptation of Force), a framework that decouples grasping force control from arm pose prediction. RETAF regulates force at high frequency using wrist images and tactile feedback, while a base policy predicts end-effector pose and gripper open/close action. We evaluate TF-Gripper and RETAF across five real-world tasks requiring precise force regulation. Results show that compared to position control, direct force control significantly improves grasp stability and task performance. We further show that tactile feedback is essential for force regulation, and that RETAF consistently outperforms baselines and can be integrated with various base policies. We hope this work opens a path for scaling the learning of force-controlled policies in robotic manipulation. Project page: https://force-gripper.github.io .
Chinese Translation
成功操控许多日常物品,例如薯片,需要精确的力调节。未能调节力可能导致任务失败或对物体造成不可逆的损害。人类能够通过触觉反馈在短时间的物理接触中精确地调节力。我们的目标是赋予机器人这种能力。然而,商业夹具通常成本高或最低力过大,使其不适合用于研究日常力敏感物体的力控制策略学习。我们介绍了TF-Gripper,一种低成本(约150美元)的力控制平行夹具,集成了触觉传感作为反馈。它的有效力范围为0.45-45N,并与不同的机器人手臂兼容。此外,我们设计了一种与TF-Gripper配对的遥操作设备,以记录人类施加的抓取力。虽然可以在这些数据上训练标准的低频策略,但它们在力调节的反应性和接触依赖性方面表现不佳。为了解决这个问题,我们提出了RETAF(REactive Tactile Adaptation of Force),一个将抓取力控制与手臂姿态预测解耦的框架。RETAF利用手腕图像和触觉反馈以高频率调节力,而基础策略则预测末端执行器姿态和夹具的开/关动作。我们在五个需要精确力调节的真实任务中评估了TF-Gripper和RETAF。结果表明,与位置控制相比,直接力控制显著提高了抓取稳定性和任务表现。我们进一步表明,触觉反馈对于力调节至关重要,并且RETAF始终优于基线,并且可以与各种基础策略集成。我们希望这项工作为扩展机器人操控中力控制策略的学习开辟了一条道路。项目页面:https://force-gripper.github.io
cs.RO / 37 / 2602.10015
RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments
RoboSubtaskNet:用于人机技能转移的时序子任务分割在真实环境中的应用
Abstract
Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
Chinese Translation
在长时间未剪辑的视频中,准确定位和分类细粒度子任务片段对于安全的人机协作至关重要。与通用活动识别不同,协作操作需要直接可由机器人执行的子任务标签。我们提出了RoboSubtaskNet,这是一种多阶段的人机子任务分割框架,将增强注意力的I3D特征(RGB加光流)与采用斐波那契扩张调度的改进MS-TCN相结合,以更好地捕捉短期过渡,例如抓取-放置。该网络的训练目标是一个复合目标,包括交叉熵和时序正则化项(截断均方误差和过渡感知项),以减少过度分割并鼓励有效的子任务进展。为了缩小视觉基准与控制之间的差距,我们引入了RoboSubtask,这是一个在子任务级别进行标注的医疗和工业演示数据集,旨在实现与操控原语的确定性映射。实证结果表明,RoboSubtaskNet在GTEA和我们的RoboSubtask基准(边界敏感和序列指标)上优于MS-TCN和MS-TCN++,同时在长时间的Breakfast基准上保持竞争力。具体而言,RoboSubtaskNet在GTEA上达到F1 @ 50 = 79.5%,Edit = 88.6%,Acc = 78.9%;在Breakfast上达到F1 @ 50 = 30.4%,Edit = 52.0%,Acc = 53.5%;在RoboSubtask上达到F1 @ 50 = 94.2%,Edit = 95.6%,Acc = 92.2%。我们进一步在7自由度的Kinova Gen3操控器上验证了完整的感知到执行管道,在物理试验中实现了可靠的端到端行为(整体任务成功率约为91.25%)。这些结果展示了从子任务级视频理解到在真实环境中部署机器人操作的实用路径。
cs.RO / 38 / 2602.10035
A Collision-Free Sway Damping Model Predictive Controller for Safe and Reactive Forestry Crane Navigation
一种无碰撞摆动阻尼模型预测控制器用于安全和反应式林业起重机导航
Abstract
Forestry cranes operate in dynamic, unstructured outdoor environments where simultaneous collision avoidance and payload sway control are critical for safe navigation. Existing approaches address these challenges separately, either focusing on sway damping with predefined collision-free paths or performing collision avoidance only at the global planning level. We present the first collision-free, sway-damping model predictive controller (MPC) for a forestry crane that unifies both objectives in a single control framework. Our approach integrates LiDAR-based environment mapping directly into the MPC using online Euclidean distance fields (EDF), enabling real-time environmental adaptation. The controller simultaneously enforces collision constraints while damping payload sway, allowing it to (i) replan upon quasi-static environmental changes, (ii) maintain collision-free operation under disturbances, and (iii) provide safe stopping when no bypass exists. Experimental validation on a real forestry crane demonstrates effective sway damping and successful obstacle avoidance. A video can be found at https://youtu.be/tEXDoeLLTxA.
Chinese Translation
林业起重机在动态、非结构化的户外环境中操作,其中同时避免碰撞和控制载荷摆动对于安全导航至关重要。现有的方法分别解决这些挑战,要么专注于使用预定义的无碰撞路径进行摆动阻尼,要么仅在全局规划层面进行碰撞避免。我们提出了首个无碰撞、摆动阻尼的模型预测控制器(MPC),将这两个目标统一在一个控制框架中。我们的方法将基于激光雷达(LiDAR)的环境映射直接集成到MPC中,使用在线欧几里得距离场(EDF),实现实时环境适应。该控制器在阻尼载荷摆动的同时强制执行碰撞约束,使其能够(i)在准静态环境变化时重新规划,(ii)在干扰下保持无碰撞操作,以及(iii)在没有绕行路径时提供安全停车。对真实林业起重机的实验验证表明了有效的摆动阻尼和成功的障碍物避免。视频链接为 https://youtu.be/tEXDoeLLTxA。
cs.RO / 39 / 2602.10069
Humanoid Factors: Design Principles for AI Humanoids in Human Worlds
类人因素:人类世界中人工智能类人的设计原则
Abstract
Human factors research has long focused on optimizing environments, tools, and systems to account for human performance. Yet, as humanoid robots begin to share our workplaces, homes, and public spaces, the design challenge expands. We must now consider not only factors for humans but also factors for humanoids, since both will coexist and interact within the same environments. Unlike conventional machines, humanoids introduce expectations of human-like behavior, communication, and social presence, which reshape usability, trust, and safety considerations. In this article, we introduce the concept of humanoid factors as a framework structured around four pillars - physical, cognitive, social, and ethical - that shape the development of humanoids to help them effectively coexist and collaborate with humans. This framework characterizes the overlap and divergence between human capabilities and those of general-purpose humanoids powered by AI foundation models. To demonstrate our framework's practical utility, we then apply the framework to evaluate a real-world humanoid control algorithm, illustrating how conventional task completion metrics in robotics overlook key human cognitive and interaction principles. We thus position humanoid factors as a foundational framework for designing, evaluating, and governing sustained human-humanoid coexistence.
Chinese Translation
人因研究长期以来一直专注于优化环境、工具和系统,以考虑人类的表现。然而,随着类人机器人开始进入我们的工作场所、家庭和公共空间,设计挑战也随之扩大。我们现在不仅必须考虑人类的因素,还必须考虑类人的因素,因为两者将在同一环境中共存和互动。与传统机器不同,类人机器人引入了对类人行为、沟通和社会存在感的期望,这重新塑造了可用性、信任和安全性的考量。在本文中,我们引入了类人因素的概念,作为一个围绕四个支柱——物理、认知、社会和伦理——构建的框架,旨在塑造类人的发展,以帮助它们有效地与人类共存和协作。该框架描述了人类能力与由人工智能基础模型驱动的一般类人机器人之间的重叠与差异。为了展示我们框架的实际应用价值,我们随后将该框架应用于评估一个真实世界的类人控制算法,说明传统机器人任务完成指标如何忽视关键的人类认知和互动原则。因此,我们将类人因素定位为设计、评估和管理持续人类与类人共存的基础框架。
cs.RO / 40 / 2602.10093
UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking
UniVTAC:一个统一的视觉-触觉操作数据生成、学习和基准测试模拟平台
Abstract
Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.
Chinese Translation
机器人操作在视觉-语言-动作(VLA)策略的推动下取得了快速进展。然而,视觉-触觉感知对于接触丰富的操作至关重要,因为像插入这样的任务仅依靠视觉很难稳健完成。同时,在物理世界中获取大规模且可靠的触觉数据仍然成本高昂且具有挑战性,而缺乏统一的评估平台进一步限制了策略学习和系统分析。为了解决这些挑战,我们提出了UniVTAC,一个基于模拟的视觉-触觉数据合成平台,支持三种常用的视觉-触觉传感器,并能够可扩展和可控地生成信息丰富的接触交互。在此平台的基础上,我们引入了UniVTAC编码器,这是一个基于大规模模拟合成数据和设计的监督信号训练的视觉-触觉编码器,为下游操作任务提供以触觉为中心的视觉-触觉表示。此外,我们还提出了UniVTAC基准测试,其中包含八个具有代表性的视觉-触觉操作任务,用于评估以触觉驱动的策略。实验结果表明,整合UniVTAC编码器使得在UniVTAC基准测试上的平均成功率提高了17.1%,而现实世界的机器人实验进一步证明了任务成功率提高了25%。我们的网页可访问 https://univtac.github.io/。
cs.RO / 41 / 2602.10098
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
VLA-JEPA:通过潜在世界模型增强视觉-语言-动作模型
Abstract
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
Chinese Translation
在互联网规模的视频上预训练视觉-语言-动作(VLA)策略具有吸引力,但目前的潜在动作目标往往学习到错误的内容:它们仍然依赖于像素变化,而不是与动作相关的状态转变,这使得它们容易受到外观偏差、干扰运动和信息泄漏的影响。我们提出了VLA-JEPA,一种JEPA风格的预训练框架,通过设计规避了这些陷阱。关键思想是 extit{无泄漏状态预测}:目标编码器从未来帧生成潜在表示,而学生路径仅观察当前观测——未来信息仅作为监督目标使用,而不作为输入。通过在潜在空间而非像素空间进行预测,VLA-JEPA学习到对相机运动和无关背景变化具有鲁棒性的动态抽象。这产生了一个简单的两阶段流程——JEPA预训练后跟随动作头微调——而不需要先前潜在动作管道的多阶段复杂性。在LIBERO、LIBERO-Plus、SimplerEnv和真实世界操作任务上的实验表明,VLA-JEPA在泛化和鲁棒性方面相较于现有方法取得了一致的提升。
cs.RO / 42 / 2602.10101
Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
Robo3R:通过精确的前馈3D重建增强机器人操作
Abstract
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
Chinese Translation
3D空间感知是通用机器人操作的基础,但获取可靠的高质量3D几何形状仍然具有挑战性。深度传感器受到噪声和材料敏感性的影响,而现有的重建模型缺乏物理交互所需的精确度和度量一致性。我们提出了Robo3R,这是一种前馈的、适用于操作的3D重建模型,能够实时从RGB图像和机器人状态中直接预测准确的度量尺度场景几何形状。Robo3R共同推断尺度不变的局部几何形状和相对相机姿态,并通过学习的全局相似性变换将其统一到规范机器人框架下的场景表示中。为了满足操作的精度要求,Robo3R采用了掩蔽点头以生成清晰、细粒度的点云,并使用基于关键点的透视n点(Perspective-n-Point, PnP)公式来优化相机外参和全局对齐。Robo3R在Robo3R-4M上进行训练,这是一个经过精心策划的大规模合成数据集,包含四百万个高保真标注帧,Robo3R在性能上始终优于最先进的重建方法和深度传感器。在包括模仿学习、模拟到现实转移、抓取合成和无碰撞运动规划等下游任务中,我们观察到性能的一致提升,表明这种替代的3D感知模块在机器人操作中的潜力。
cs.RO / 43 / 2602.10105
DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos
DexImit:从单目人类视频学习双手灵巧操作
Abstract
Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).
Chinese Translation
数据稀缺根本上限制了双手灵巧操作的泛化能力,因为收集灵巧手的真实世界数据既昂贵又劳动密集。人类操作视频作为操作知识的直接载体,具有显著的潜力来扩大机器人学习的规模。然而,人类手与机器人灵巧手之间的巨大体现差距使得直接从人类视频进行预训练变得极具挑战性。为了弥合这一差距并释放大规模人类操作视频数据的潜力,我们提出了DexImit,一个自动化框架,能够将单目人类操作视频转换为物理上合理的机器人数据,而无需任何额外信息。DexImit采用四阶段生成管道:(1)从任意视角重建手-物体交互,接近度量级别;(2)进行子任务分解和双手调度;(3)合成与展示交互一致的机器人轨迹;(4)全面的数据增强以实现零-shot的真实世界部署。基于这些设计,DexImit能够根据人类视频生成大规模的机器人数据,这些视频可以来自互联网或视频生成模型。DexImit能够处理多样的操作任务,包括工具使用(例如,切苹果)、长时间任务(例如,制作饮料)和精细操作(例如,叠杯子)。
cs.RO / 44 / 2602.10106
EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration
EgoHumanoid:解锁在自然环境中的人形机器人运动操控与无机器人自我中心示范
Abstract
Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.
Chinese Translation
人类示范提供了丰富的环境多样性和自然的规模,使其成为机器人遥操作的一个吸引人的替代方案。尽管这一范式已推动了机器人臂的操控,但其在更具挑战性、数据需求高的人形运动操控问题上的潜力仍然未被充分探索。我们提出了EgoHumanoid,这是第一个框架,利用丰富的自我中心人类示范和有限的机器人数据共同训练视觉-语言-动作策略,使人形机器人能够在多样的真实环境中执行运动操控。为了弥合人类与机器人之间的体现差距,包括物理形态和视角的差异,我们引入了一个系统的对齐流程,涵盖从硬件设计到数据处理的各个环节。我们开发了一个便携式系统以实现可扩展的人类数据收集,并建立了实用的收集协议以提高可转移性。在我们的人类到人形机器人对齐流程的核心是两个关键组件。视角对齐减少了由于相机高度和视角变化引起的视觉领域差异。动作对齐将人类动作映射到一个统一的、运动学上可行的人形机器人控制动作空间。大量的真实世界实验表明,结合无机器人自我中心数据的性能显著优于仅使用机器人基线,提升幅度达到51%,尤其是在未见过的环境中。我们的分析进一步揭示了哪些行为能够有效转移以及人类数据扩展的潜力。
cs.RO / 45 / 2602.10109
ST4VLA: Spatially Guided Training for Vision-Language-Action Models
ST4VLA:用于视觉-语言-动作模型的空间引导训练
Abstract
Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/
Chinese Translation
大型视觉-语言模型(VLMs)在多模态理解方面表现出色,但在需要将指令转化为低级运动动作的具身任务中却显得不足。我们提出了ST4VLA,一种双系统视觉-语言-动作框架,利用空间引导训练将动作学习与VLM中的空间先验对齐。ST4VLA包括两个阶段:(i)空间基础预训练,通过从网络规模和机器人特定数据中进行可扩展的点、框和轨迹预测,为VLM提供可转移的先验;(ii)空间引导动作后训练,鼓励模型生成更丰富的空间先验,以通过空间提示引导动作生成。该设计在策略学习过程中保持空间基础,并促进空间与动作目标之间的一致优化。从实证结果来看,ST4VLA在原始VLA上取得了显著提升,Google Robot的性能从66.1提高到84.6,WidowX Robot的性能从54.7提高到73.2,在SimplerEnv上建立了新的最先进结果。它还展示了对未见物体和改述指令的更强泛化能力,以及在现实环境中对长时间扰动的鲁棒性。这些结果突显了可扩展的空间引导训练作为一种有前景的方向,以实现稳健且可泛化的机器人学习。源代码、数据和模型已发布在 https://internrobotics.github.io/internvla-m1.github.io/
cs.RO / 46 / 2602.10111
Learning Agile Quadrotor Flight in the Real World
在真实世界中学习灵活的四旋翼飞行
Abstract
Learning-based controllers have achieved impressive performance in agile quadrotor flight but typically rely on massive training in simulation, necessitating accurate system identification for effective Sim2Real transfer. However, even with precise modeling, fixed policies remain susceptible to out-of-distribution scenarios, ranging from external aerodynamic disturbances to internal hardware degradation. To ensure safety under these evolving uncertainties, such controllers are forced to operate with conservative safety margins, inherently constraining their agility outside of controlled settings. While online adaptation offers a potential remedy, safely exploring physical limits remains a critical bottleneck due to data scarcity and safety risks. To bridge this gap, we propose a self-adaptive framework that eliminates the need for precise system identification or offline Sim2Real transfer. We introduce Adaptive Temporal Scaling (ATS) to actively explore platform physical limits, and employ online residual learning to augment a simple nominal model. {Based on the learned hybrid model, we further propose Real-world Anchored Short-horizon Backpropagation Through Time (RASH-BPTT) to achieve efficient and robust in-flight policy updates. Extensive experiments demonstrate that our quadrotor reliably executes agile maneuvers near actuator saturation limits. The system evolves a conservative base policy with a peak speed of 1.9 m/s to 7.3 m/s within approximately 100 seconds of flight time. These findings underscore that real-world adaptation serves not merely to compensate for modeling errors, but as a practical mechanism for sustained performance improvement in aggressive flight regimes.
Chinese Translation
基于学习的控制器在灵活的四旋翼飞行中取得了令人印象深刻的性能,但通常依赖于大量的仿真训练,这需要准确的系统识别以实现有效的Sim2Real转移。然而,即使在精确建模的情况下,固定策略仍然容易受到分布外场景的影响,这些场景包括外部气动干扰和内部硬件退化。为了在这些不断变化的不确定性下确保安全,这类控制器被迫以保守的安全边际运行,这在本质上限制了它们在受控环境之外的灵活性。虽然在线适应提供了潜在的解决方案,但由于数据稀缺和安全风险,安全探索物理极限仍然是一个关键瓶颈。为了解决这一问题,我们提出了一种自适应框架,消除了对精确系统识别或离线Sim2Real转移的需求。我们引入了自适应时间缩放(Adaptive Temporal Scaling, ATS)以主动探索平台的物理极限,并采用在线残差学习来增强一个简单的名义模型。基于学习的混合模型,我们进一步提出了现实世界锚定的短视界时间反向传播(Real-world Anchored Short-horizon Backpropagation Through Time, RASH-BPTT),以实现高效且稳健的飞行中策略更新。大量实验表明,我们的四旋翼在接近执行器饱和极限时可靠地执行灵活的机动。该系统在大约100秒的飞行时间内将保守的基础策略的峰值速度从1.9 m/s提升至7.3 m/s。这些发现强调,现实世界的适应不仅仅是为了补偿建模误差,而是作为在激进飞行状态下持续性能提升的实用机制。
cs.RO / 47 / 2602.10114
Decoupled MPPI-Based Multi-Arm Motion Planning
解耦的基于 MPPI 的多臂运动规划
Abstract
Recent advances in sampling-based motion planning algorithms for high DOF arms leverage GPUs to provide SOTA performance. These algorithms can be used to control multiple arms jointly, but this approach scales poorly. To address this, we extend STORM, a sampling-based model-predictive-control (MPC) motion planning algorithm, to handle multiple robots in a distributed fashion. First, we modify STORM to handle dynamic obstacles. Then, we let each arm compute its own motion plan prefix, which it shares with the other arms, which treat it as a dynamic obstacle. Finally, we add a dynamic priority scheme. The new algorithm, MR-STORM, demonstrates clear empirical advantages over SOTA algorithms when operating with both static and dynamic obstacles.
Chinese Translation
最近,基于采样的高自由度臂运动规划算法的进展利用 GPU 提供了最先进的性能。这些算法可以联合控制多个臂,但这种方法的扩展性较差。为了解决这个问题,我们扩展了 STORM,一种基于采样的模型预测控制(MPC)运动规划算法,以分布式方式处理多个机器人。首先,我们修改 STORM 以处理动态障碍物。然后,我们让每个臂计算自己的运动规划前缀,并与其他臂共享,其他臂将其视为动态障碍物。最后,我们添加了一种动态优先级方案。新算法 MR-STORM 在处理静态和动态障碍物时,显示出明显的实证优势,超越了最先进的算法。
cs.CV / 1 / 2602.09082
UI-Venus-1.5 Technical Report
UI-Venus-1.5 技术报告
Veuns-Team, :, Gao, Changlong, Gu, Zhangxuan, Liu, Yulin, Qiu, Xinyu, Shen, Shuheng, Wen, Yue, Xia, Tianyu, Xu, Zhenyu, Zeng, Zhengwen, Zhou, Beitong, Zhou, Xingran, Chen, Weizhi, Dai, Sunhao, Dou, Jingya, Gong, Yichen, Guo, Yuan, Guo, Zhenlin, Li, Feng, Li, Qian, Lin, Jinzhen, Zhou, Yuqi, Zhu, Linchao, Chen, Liang, Guo, Zhenyu, Meng, Changhua, Wang, Weiqiang
Abstract
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus
Chinese Translation
图形用户界面(GUI)代理已成为自动化数字环境中交互的强大范式,但实现广泛的通用性和持续强劲的任务性能仍然具有挑战性。在本报告中,我们提出了 UI-Venus-1.5,这是一种统一的端到端 GUI 代理,旨在用于稳健的现实世界应用。所提议的模型家族包括两种密集变体(2B 和 8B)以及一种专家混合变体(30B-A3B),以满足各种下游应用场景。与我们之前的版本相比,UI-Venus-1.5 引入了三个关键技术进展:(1)一个全面的中期训练阶段,利用 100 亿个标记跨越 30 多个数据集,以建立基础的 GUI 语义;(2)具有完整轨迹回放的在线强化学习,将训练目标与大规模环境中的长时间动态导航对齐;(3)通过模型合并构建的单一统一 GUI 代理,将特定领域模型(基础、网络和移动)合成到一个一致的检查点。广泛的评估表明,UI-Venus-1.5 在 ScreenSpot-Pro(69.6%)、VenusBench-GD(75.0%)和 AndroidWorld(77.6%)等基准测试中建立了新的最先进性能,显著超越了之前的强基线。此外,UI-Venus-1.5 在各种中国移动应用中展示了强大的导航能力,能够有效执行现实场景中的用户指令。代码:https://github.com/inclusionAI/UI-Venus;模型:https://huggingface.co/collections/inclusionAI/ui-venus
cs.CV / 2 / 2602.09084
Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
代理香蕉:基于代理思维和工具的高保真图像编辑
Abstract
We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
Chinese Translation
我们研究了在专业工作流程下基于指令的图像编辑,并识别出三个持续存在的挑战:(i)编辑者往往过度编辑,修改内容超出用户的意图;(ii)现有模型大多为单轮编辑,而多轮编辑可能会改变对象的真实性;(iii)在约1000分辨率下的评估与实际工作流程不一致,后者通常在超高清图像(例如4K)上进行操作。我们提出了代理香蕉(Agent Banana),一个用于高保真、对象感知、深思熟虑编辑的分层代理规划-执行框架。代理香蕉引入了两个关键机制:(1)上下文折叠(Context Folding),将长时间交互历史压缩为结构化记忆,以实现稳定的长时间控制;(2)图像层分解(Image Layer Decomposition),执行局部基于层的编辑,以保护非目标区域,同时启用原生分辨率输出。为了支持严格的评估,我们构建了HDD-Bench,一个高分辨率、基于对话的基准,具有可验证的逐步目标和原生4K图像(1180万像素),用于诊断长时间失败。在HDD-Bench上,代理香蕉在多轮一致性和背景保真度方面表现最佳(例如,IC 0.871,SSIM-OM 0.84,LPIPS-OM 0.12),同时在指令跟随方面保持竞争力,并在标准单轮编辑基准上也取得了强劲表现。我们希望这项工作能够推动可靠的专业级代理图像编辑及其在实际工作流程中的整合。
cs.CV / 3 / 2602.09146
SemanticMoments: Training-Free Motion Similarity via Third Moment Features
SemanticMoments:通过第三阶矩特征实现无训练的运动相似性
Abstract
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
Chinese Translation
基于语义运动检索视频是一个基础但尚未解决的问题。现有的视频表示方法过于依赖静态外观和场景上下文,而非运动动态,这种偏差源于它们的训练数据和目标。相反,传统的运动中心输入如光流缺乏理解高层次运动所需的语义基础。为了展示这种固有偏差,我们引入了SimMotion基准,结合了受控的合成数据和一个新的人工标注的真实世界数据集。我们展示了现有模型在这些基准上的表现不佳,常常无法将运动与外观区分开。为了解决这一差距,我们提出了SemanticMoments,这是一种简单的无训练方法,计算来自预训练语义模型的特征的时间统计(具体来说,是高阶矩)。在我们的基准测试中,SemanticMoments始终优于现有的RGB、光流和文本监督方法。这表明,在语义特征空间中的时间统计为运动中心的视频理解提供了一个可扩展且感知基础的基础。
cs.CV / 4 / 2602.09154
A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video
广播新闻视频中命名实体提取的混合确定性框架
Abstract
The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8%
[email protected], demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
Chinese Translation
随着基于视频的新闻内容量的增加,对透明和可靠的信息提取方法的需求也随之加大。然而,图形布局、排版规范和平台特定设计模式的多样性使得手动索引变得不切实际。本研究提出了一个全面的框架,用于自动检测和提取广播和社交媒体原生新闻视频中的人名。它引入了一个经过精心策划和平衡的注释帧语料库,捕捉当代新闻图形的多样性,并提出了一个可解释的、模块化的提取管道,旨在在确定性和可审计的条件下运行。该管道与一类对比的生成多模态方法进行了评估,揭示了确定性可审计性与随机推断之间的明显权衡。基础检测器在图形元素定位方面实现了95.8%的
[email protected],展示了操作上稳健的性能。尽管生成系统在原始准确性上略高(F1: 84.18%对比77.08%),但它们缺乏在新闻和分析背景下所需的透明数据来源。所提出的管道提供了平衡的精确度(79.9%)和召回率(74.4%),避免了幻觉,并在每个处理阶段提供了完整的可追溯性。补充的用户调查结果表明,59%的受访者报告在快速播放的广播中阅读屏幕上的名字存在困难,强调了该任务的实际相关性。结果为现代新闻媒体中的混合多模态信息提取建立了方法论上严谨且可解释的基线。
cs.CV / 5 / 2602.09155
Decoding Future Risk: Deep Learning Analysis of Tubular Adenoma Whole-Slide Images
解码未来风险:管状腺瘤全切片图像的深度学习分析
Abstract
Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient's long-term risk of developing colorectal cancer.
Chinese Translation
尽管广泛实施了旨在检测和去除癌前息肉的预防性措施,结直肠癌(CRC)仍然是癌症相关死亡的重要原因。尽管筛查有效降低了发病率,但仍有相当一部分最初被诊断为低级别腺瘤的患者在后期生活中会发展为CRC,即使没有已知的高风险综合症。识别哪些低风险患者具有更高的进展风险是针对个性化监测和预防治疗策略的一项关键未满足需求。传统的腺瘤组织学评估虽然基础,但可能无法充分捕捉到指示恶性潜力的微妙结构或细胞特征。数字病理学和机器学习的进步为全面和客观地分析全切片图像(WSIs)提供了机会。本研究探讨了机器学习算法,特别是卷积神经网络(CNNs),是否能够检测低级别管状腺瘤WSIs中的微妙组织学特征,这些特征能够预测患者发展结直肠癌的长期风险。
cs.CV / 6 / 2602.09165
All-in-One Conditioning for Text-to-Image Synthesis
一体化文本到图像合成的条件生成
Abstract
Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
Chinese Translation
准确解读和视觉呈现涉及多个对象、属性和空间关系的复杂提示是文本到图像合成中的一项关键挑战。尽管最近在生成照片级真实输出方面取得了进展,但当前模型在处理复杂文本输入时,往往难以保持语义的准确性和结构的一致性。我们提出了一种新颖的方法,将文本到图像合成置于场景图结构的框架内,旨在增强现有模型的组合能力。尽管之前的方法试图通过使用从提示中派生的预定义布局图来解决这一问题,但这种刚性约束往往限制了组合的灵活性和多样性。相反,我们引入了一种零样本的基于场景图的条件机制,在推理过程中生成软视觉指导。我们方法的核心是属性-大小-数量-位置(Attribute-Size-Quantity-Location,ASQL)调节器,它通过轻量级语言模型生成视觉条件,并通过推理时优化指导基于扩散的生成。这使得模型能够在支持轻量级、一致性和多样性图像合成的同时,保持文本与图像的对齐。
cs.CV / 7 / 2602.09209
Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain
可穿戴环境传感器预测腿部系统与即将到来的地形的交互
Abstract
Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.
Chinese Translation
计算机视觉(CV)已被用于步态期间的环境分类,并常用于辅助系统的控制信息;然而,预测脚如何与变化环境接触的能力尚未得到充分探索。我们评估了在平地到楼梯上升过渡期间,预测脚前后(AP)压力中心(COP)和冲击时间(TOI)的可行性。八名受试者在右小腿上佩戴RGB-D摄像头和仪器化鞋垫,执行踩上楼梯的任务。我们训练了一个CNN-RNN模型,以在脚接触地面前的250毫秒窗口内连续预测COP和TOI,称为预测视野(FH)。在150、100和50毫秒的FH下,COP的平均绝对误差(MAE)分别为29.42毫米、26.82毫米和23.72毫米。TOI的MAE在150、100和50毫秒下分别为21.14毫秒、20.08毫秒和17.73毫秒。尽管躯干速度对两项任务的误差没有影响,但在脚接触前更快的脚趾摆动速度被发现能提高COP预测的准确性,而在TOI的情况下则没有显著影响。此外,更前方的脚接触被发现会降低COP预测的准确性,但对TOI预测的准确性没有影响。我们还发现,我们的轻量级模型能够在消费级笔记本电脑或边缘计算设备上以60帧每秒的速度运行。这项研究表明,使用轻量级模型从视觉数据中预测COP和TOI是可行的,这可能对辅助系统中的预期控制具有重要意义。
cs.CV / 8 / 2602.09214
VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models
VLM-UQBench:视觉语言模型中模态特定与跨模态不确定性的基准
Abstract
Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
Chinese Translation
不确定性量化(UQ)对于确保视觉语言模型(VLMs)的安全和可靠行为至关重要。一个核心挑战是将不确定性归因于其来源,确定其是源于图像、文本,还是两者之间的不一致。我们提出了VLM-UQBench,这是一个针对VLM中模态特定和跨模态数据不确定性的基准。它由600个来自VizWiz数据集的真实样本组成,经过整理,形成干净的图像、不确定性文本和跨模态不确定性子集,以及一个可扩展的扰动管道,包含8种视觉、5种文本和3种跨模态扰动。我们进一步提出了两个简单的指标,用于量化UQ分数对这些扰动的敏感性及其与幻觉的相关性,并利用这些指标评估四个VLM和三个数据集中的一系列UQ方法。实证结果表明:(i)现有的UQ方法表现出强烈的模态特定专业化,并且在很大程度上依赖于基础VLM;(ii)模态特定的不确定性经常与幻觉共同出现,而当前的UQ分数仅提供微弱且不一致的风险信号;(iii)尽管UQ方法在显性、群体级别的模糊性上可以与基于推理的思维链基准相媲美,但它们在检测我们扰动管道引入的细微实例级别模糊性方面大多失败。这些结果突显了当前UQ实践与可靠VLM部署所需的细粒度、模态感知不确定性之间的显著差距。
cs.CV / 9 / 2602.09252
VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models
基于VLM的迭代精细化手术图像分割方法
Abstract
Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
Chinese Translation
手术图像分割对于机器人辅助手术和术中指导至关重要。然而,现有方法受限于预定义类别,产生一次性预测而缺乏自适应精细化,并且缺乏临床医生互动机制。我们提出了IR-SIS,一种接受自然语言描述的手术图像分割迭代精细化系统。IR-SIS利用经过微调的SAM3进行初始分割,采用视觉-语言模型(Vision-Language Model)检测器械并评估分割质量,并应用一种自主工作流程,自适应选择精细化策略。该系统通过自然语言反馈支持临床医生的互动。我们还从EndoVis2017和EndoVis2018基准构建了一个多粒度语言注释数据集。实验结果表明,在领域内和领域外数据上均实现了最先进的性能,临床医生的互动提供了额外的改进。我们的工作建立了第一个具有自适应自我精细化能力的基于语言的手术分割框架。
cs.CV / 10 / 2602.09268
Rethinking Global Text Conditioning in Diffusion Transformers
重新思考扩散变换器中的全球文本调节
Abstract
Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
Chinese Translation
扩散变换器通常通过注意力层和使用池化文本嵌入的调节机制来整合文本信息。然而,最近的方法放弃了基于调节的文本调节,完全依赖于注意力。在本文中,我们探讨了基于调节的文本调节是否必要,以及它是否能够提供任何性能优势。我们的分析表明,在其传统用法中,池化嵌入对整体性能贡献甚微,暗示仅依靠注意力通常足以忠实传播提示信息。然而,我们揭示了池化嵌入在从不同角度使用时可以提供显著的提升——作为指导并使得可控地向更理想的特性转变。这种方法无需训练,易于实现,运行时开销微乎其微,并且可以应用于各种扩散模型,带来在文本到图像/视频生成和图像编辑等多种任务中的改进。
cs.CV / 11 / 2602.09284
X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging
X-Mark:基于显著性引导的医学影像数据集所有权验证的鲁棒方法
Abstract
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.
Chinese Translation
高质量的医学影像数据集对于训练深度学习模型至关重要,但其未经授权的使用引发了严重的版权和伦理问题。医学影像为现有针对自然图像的数据集所有权验证方法带来了独特挑战,因为在固定尺度图像中生成的静态水印模式在动态和高分辨率扫描中表现不佳,且这些扫描具有有限的视觉多样性和微妙的解剖结构,同时又需保持诊断质量。本文提出了X-Mark,一种针对胸部X光图像版权保护的样本特定清晰标签水印方法。具体而言,X-Mark使用条件U-Net在每个样本的显著区域内生成独特的扰动。我们设计了一个多组件训练目标,以确保水印的有效性、对动态缩放过程的鲁棒性,同时保持诊断质量和视觉可区分性。我们在训练目标中加入拉普拉斯正则化,以惩罚高频扰动并实现水印的尺度不变性。所有权验证在黑箱设置中进行,以检测可疑模型中的特征行为。在CheXpert上的大量实验验证了X-Mark的有效性,达到了100%的水印成功率(WSR),并在Ind-M场景中将假阳性概率降低了12%,同时展示了对潜在自适应攻击的抵抗能力。
cs.CV / 12 / 2602.09315
A Deep Multi-Modal Method for Patient Wound Healing Assessment
一种用于患者伤口愈合评估的深度多模态方法
Abstract
Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient's non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient's risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.
Chinese Translation
患者住院是导致高伤口护理成本的主要因素之一。大多数患者并不需要立即住院治疗的伤口。然而,由于治疗延迟、患者不配合或存在合并症等因素,伤口可能恶化,最终导致患者住院。在本文中,我们提出了一种深度多模态方法来预测患者住院的风险。我们的目标是通过综合利用患者的伤口变量和伤口图像,来自信地预测风险。现有的研究主要集中在基于不同伤口类型的愈合轨迹上。我们开发了一种基于迁移学习的伤口评估解决方案,能够从伤口图像中预测伤口变量及其愈合轨迹,这是我们的主要贡献。我们认为,开发一种新颖的模型可以帮助及早发现伤口中的复杂性,这可能影响愈合过程,同时减少临床医生诊断伤口所花费的时间。
cs.CV / 13 / 2602.09318
GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification
GAFR-Net:一种用于可解释乳腺癌图像分类的图注意力与模糊规则网络
Abstract
Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic intervention.However, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue structures.Concurrently, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
Chinese Translation
准确分类乳腺癌组织病理图像对于早期肿瘤诊断和治疗干预至关重要。然而,传统深度学习架构在有限标注下常常面临性能下降,并且由于其“黑箱”特性,阻碍了其临床整合。为了解决这些局限性,我们提出了GAFR-Net,一种专门为稀缺监督下的组织病理图像分类而设计的强大且可解释的图注意力与模糊规则网络。GAFR-Net构建了一个基于相似性的图表示,以建模样本间关系,并采用多头图注意力机制来捕捉异质组织结构中的复杂关系特征。同时,一个可微分的模糊规则模块将节点度、聚类系数和标签一致性等内在拓扑描述符编码为明确且易于人类理解的诊断逻辑。该设计建立了透明的“如果-那么”映射,模拟医疗专家的启发式推理过程,为每个预测提供清晰的推理,而无需依赖事后归因方法。在三个基准数据集(BreakHis、Mini-DDSM和ICIAR2018)上的广泛评估表明,GAFR-Net在多个放大倍数和分类任务中始终优于各种最先进的方法。这些结果验证了GAFR-Net作为一种可靠的弱监督医学图像分析决策支持工具的卓越泛化能力和实际应用价值。
cs.CV / 14 / 2602.09324
Deep Modeling and Interpretation for Bladder Cancer Classification
膀胱癌分类的深度建模与解释
Abstract
Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60\%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.
Chinese Translation
基于视觉变换器(ViT)和卷积神经网络(CNN)的深度模型在自然数据集上表现出色。然而,这些模型在医学影像中可能并不相似,因为异常区域仅占图像的一小部分。这一挑战促使本研究探讨最新的深度模型在膀胱癌分类任务中的应用。我们提出以下方法来评估这些深度模型:1)使用13个模型(四个CNN和八个基于变换器的模型)进行标准分类,2)进行校准分析以检查这些模型在膀胱癌分类中的校准效果,3)使用GradCAM++评估这些模型在临床诊断中的可解释性。我们在一个公开的多中心膀胱癌数据集上模拟了约300个实验,实验结果表明ConvNext系列在分类膀胱癌图像时显示出有限的泛化能力(例如,约60%的准确率)。此外,与ConvNext和Swin变换器系列相比,ViT在校准效果上表现更佳。我们还引入测试时增强技术以提高模型的可解释性。最后,没有任何模型提供一种适用于所有情况的可解释模型解决方案。ConvNext系列适合于分布内样本,而ViT及其变体则适合于解释分布外样本。
cs.CV / 15 / 2602.09337
Kyrtos: A methodology for automatic deep analysis of graphic charts with curves in technical documents
Kyrtos:一种用于技术文档中带曲线图表自动深度分析的方法
Abstract
Deep Understanding of Technical Documents (DUTD) has become a very attractive field with great potential due to large amounts of accumulated documents and the valuable knowledge contained in them. In addition, the holistic understanding of technical documents depends on the accurate analysis of its particular modalities, such as graphics, tables, diagrams, text, etc. and their associations. In this paper, we introduce the Kyrtos methodology for the automatic recognition and analysis of charts with curves in graphics images of technical documents. The recognition processing part adopts a clustering based approach to recognize middle-points that delimit the line-segments that construct the illustrated curves. The analysis processing part parses the extracted line-segments of curves to capture behavioral features such as direction, trend and etc. These associations assist the conversion of recognized segments' relations into attributed graphs, for the preservation of the curves' structural characteristics. The graph relations are also are expressed into natural language (NL) text sentences, enriching the document's text and facilitating their conversion into Stochastic Petri-net (SPN) graphs, which depict the internal functionality represented in the chart image. Extensive evaluation results demonstrate the accuracy of Kyrtos' recognition and analysis methods by measuring the structural similarity between input chart curves and the approximations generated by Kyrtos for charts with multiple functions.
Chinese Translation
技术文档的深度理解(DUTD)已成为一个非常有吸引力的领域,因其积累了大量文档并蕴含了宝贵的知识。此外,技术文档的整体理解依赖于对其特定形式(如图形、表格、图解、文本等)及其关联的准确分析。本文介绍了Kyrtos方法,用于自动识别和分析技术文档图像中的带曲线图表。识别处理部分采用基于聚类的方法来识别界定构成所示曲线的线段的中点。分析处理部分解析提取的曲线线段,以捕捉行为特征,如方向、趋势等。这些关联有助于将识别的线段关系转换为带属性的图,以保留曲线的结构特征。图的关系也被表达为自然语言(NL)文本句子,丰富了文档的文本内容,并促进了它们向随机宠物ri网(SPN)图的转换,以描绘图表图像中表示的内部功能。大量评估结果通过测量输入图表曲线与Kyrtos为多功能图表生成的近似之间的结构相似性,证明了Kyrtos识别和分析方法的准确性。
cs.CV / 16 / 2602.09355
Impact of domain adaptation in deep learning for medical image classifications
深度学习在医学图像分类中的领域适应影响
Abstract
Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7\% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3\%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3\%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2\%$ compared to CNN alone on a multi-modality dataset.
Chinese Translation
领域适应(Domain Adaptation, DA)是机器学习中一个快速发展的领域,涉及将一个领域中训练的模型调整到另一个领域以获得良好表现。尽管已经取得了显著进展,但许多DA方法的基本概念仍然保持不变:将来自不同领域的数据对齐到一个共享特征空间。在这个空间中,从标记的源数据中获得的知识可以改善在缺乏足够标签的目标数据上的模型训练。在本研究中,我们展示了使用10个深度学习模型来模拟常见的DA技术,并探索它们在四个医学图像数据集中的应用。我们考虑了多种情况,如多模态、噪声数据、联邦学习(Federated Learning, FL)、可解释性分析和分类器校准。实验结果表明,在脑肿瘤(Brain Tumor, BT)数据集中,使用ResNet34的DA方法使模型性能提高了4.7%。同样,使用DA可以减少高斯噪声的影响,因为在BT数据集上使用ResNet34时提供了约3%的准确率提升。此外,仅仅将DA引入FL框架对皮肤癌分类的潜力有限(例如,性能提升约0.3%)。此外,DA方法可以通过gradcam++技术提高模型的可解释性,具有临床价值。校准分析还表明,使用DA提供的期望校准误差(Expected Calibration Error, ECE)值约为2%,相比于单独使用卷积神经网络(CNN)在多模态数据集上的表现更优。
cs.CV / 17 / 2602.09378
Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation
全可微分双向双任务协同学习用于半监督3D医学图像分割
Abstract
Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method's state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.
Chinese Translation
半监督学习通过利用未标记数据,减轻了图像分割对大规模像素级标注数据集的需求。由于高昂的标注成本和对专业临床知识的需求,高质量标注数据的稀缺仍然是医学图像分析中的一大挑战。半监督学习在解决这一瓶颈方面展现了显著潜力,其中伪标注和一致性正则化成为两种主要范式。双任务协作学习是一种新兴的一致性感知范式,旨在通过建立相关任务之间的预测一致性来获取补充监督。然而,目前的方法仅限于单向交互机制(通常是回归到分割),因为分割结果只能以离线方式转换为回归输出,从而未能充分利用在线双向跨任务协作的潜在优势。因此,我们提出了一种全可微分双向协同学习(DBiSL)框架,该框架无缝集成并增强了四个关键的半监督学习组件:监督学习、一致性正则化、伪监督学习和不确定性估计。在两个基准数据集上的实验表明我们的方法达到了最先进的性能。除了技术贡献外,本研究为统一的半监督学习框架设计提供了新的见解,并为双任务驱动的半监督学习建立了新的架构基础,同时提供了一个适用于更广泛计算机视觉应用的通用多任务学习框架。代码将在接受后发布于github。
cs.CV / 18 / 2602.09407
Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D
医学成像和自然物体中的单切片到三维重建:与SAM 3D的比较基准
Abstract
A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
Chinese Translation
对解剖结构的三维理解是诊断和治疗计划的核心,然而体积成像仍然成本高昂且等待时间长。图像到三维基础模型可以通过从二维模态重建三维数据来解决这一问题。目前的基础模型是在自然图像分布上训练的,利用像素间的几何先验从单幅图像重建自然物体。然而,这些学习到的几何先验是否能够转移到医学数据上尚不清楚。在本研究中,我们展示了一个受控的零样本基准,比较了五种最先进的图像到三维模型在单切片医学图像到三维重建中的表现:SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen和TripoSG。这些模型在六个涵盖解剖和病理结构的医学数据集以及两个自然数据集上进行了评估,使用基于体素的度量和点云距离度量。在医学数据集中,所有模型的基于体素的重叠保持在中等水平,这与从单切片推断体积时的深度重建失败模式一致。相比之下,全球距离度量在方法之间显示出更大的差异:SAM3D在整体拓扑相似性上与真实的医学三维数据达成了最佳结果,而其他模型则更容易导致重建的过度简化。我们的结果量化了单切片医学重建的局限性,并强调了由于二维医学数据的平面特性所导致的深度模糊性,促使多视角图像到三维重建的发展,以实现可靠的医学三维推断。
cs.CV / 19 / 2602.09411
K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge
K-Sort Eval:通过修正的VLM作为评判者实现高效的视觉生成偏好评估
Abstract
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.
Chinese Translation
视觉生成模型的快速发展促使对更具可扩展性和人类对齐的评估方法的需求。虽然众包的Arena平台通过收集人类投票提供人类偏好评估,但其成本高昂且耗时,固有地限制了其可扩展性。利用视觉语言模型(VLMs)作为人工判断的替代品提供了一种有前景的解决方案。然而,VLMs固有的幻觉和偏见妨碍了与人类偏好的对齐,从而影响了评估的可靠性。此外,静态评估方法导致效率低下。在本文中,我们提出了K-Sort Eval,一种可靠且高效的基于VLM的评估框架,集成了后验修正和动态匹配。具体而言,我们从K-Sort Arena中的数千个人类投票中策划了一个高质量的数据集,每个实例包含K个模型的输出和排名。在评估新模型时,它与现有模型进行(K+1)次自由比较,VLM提供排名。为了增强对齐性和可靠性,我们提出了一种后验修正方法,该方法根据VLM预测与人类监督之间的一致性自适应地修正贝叶斯更新中的后验概率。此外,我们提出了一种动态匹配策略,平衡不确定性和多样性,以最大化每次比较的预期收益,从而确保更高效的评估。大量实验表明,K-Sort Eval提供的评估结果与K-Sort Arena一致,通常只需少于90次模型运行,证明了其效率和可靠性。
cs.CV / 20 / 2602.09413
LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging
LARV:无数据的层级自适应重标定贴层用于模型合并
Abstract
Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.
Chinese Translation
模型合并旨在在不访问训练数据的情况下,将多个微调模型合并为一个单一的多任务模型。现有的任务向量合并方法如 TIES、TSV-M 和 Iso-C/CTS 在聚合规则上存在差异,但几乎对所有层采取统一处理。这一假设忽视了大型视觉变换器中的强层级异质性,其中浅层对干扰敏感,而深层则编码了稳定的任务特征。我们提出了 LARV,一种无训练、无数据、与合并无关的层级自适应重标定贴层,它可以插入任何任务向量合并器,并在聚合之前为每个任务向量分配每层的缩放比例,且我们展示了它在不同的合并规则中始终能够提升性能。LARV 通过简单的确定性调度自适应地抑制浅层干扰并增强深层对齐,且无需对现有合并器进行重新训练或修改。据我们所知,这是首次针对任务向量合并进行层级感知缩放的研究。LARV 计算简单的无数据层代理,并通过轻量级规则将其转化为缩放比例;我们在一个框架内研究了几种实例(例如,使用固定值的分层两级/三级缩放,或连续映射),并表明分层选择提供了最佳的鲁棒性,而连续映射仍然处于消融状态。LARV 与基础合并器正交,增加的成本微乎其微。在 FusionBench 上使用视觉变换器时,LARV 在 8/14/20 任务设置中始终改善所有任务向量基线;例如,Iso-C + LARV 在 ViT-B/32 上达到了 85.9%,在 ViT-B/16 上达到了 89.2%,在 ViT-L/14 上达到了 92.6%。层级分析和腐蚀测试进一步表明,LARV 抑制了浅层干扰,同时适度增强了深层的任务稳定特征,使模型合并成为一种稳健的层级感知过程,而非统一的过程。
cs.CV / 21 / 2602.09415
Stability and Concentration in Nonlinear Inverse Problems with Block-Structured Parameters: Lipschitz Geometry, Identifiability, and an Application to Gaussian Splatting
具有块结构参数的非线性逆问题中的稳定性与集中性:Lipschitz几何、可识别性及其在高斯溅射中的应用
Abstract
We develop an operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters. Under a unified set of assumptions combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise, we establish deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates. These results yield high-probability parameter error bounds that are intrinsic to the forward operator and independent of any specific reconstruction algorithm. As a concrete instantiation, we verify that the Gaussian Splatting rendering operator satisfies the proposed assumptions and derive explicit constants governing its Lipschitz continuity and resolution-dependent observability. This leads to a fundamental stability--resolution tradeoff, showing that estimation error is inherently constrained by the ratio between image resolution and model complexity. Overall, the analysis characterizes operator-level limits for a broad class of high-dimensional nonlinear inverse problems arising in modern imaging and differentiable rendering.
Chinese Translation
我们开发了一个算子理论框架,用于研究具有块结构参数的非线性逆问题中的稳定性和统计集中性。在结合块状Lipschitz几何、局部可识别性和亚高斯噪声的统一假设下,我们建立了确定性稳定性不等式、最小二乘失配泛函的全局Lipschitz界限以及非渐近集中估计。这些结果提供了与前向算子内在相关的高概率参数误差界限,并且与任何特定重建算法无关。作为具体实例,我们验证了高斯溅射渲染算子满足所提出的假设,并推导出控制其Lipschitz连续性和分辨率相关可观测性的显式常数。这导致了一个基本的稳定性-分辨率权衡,表明估计误差本质上受到图像分辨率与模型复杂度之间比率的限制。总体而言,该分析为现代成像和可微渲染中出现的广泛高维非线性逆问题特征化了算子级限制。
cs.CV / 22 / 2602.09425
Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification
弥合路边LiDAR中的模态差距:一种无训练的视觉-语言模型框架用于车辆分类
Abstract
Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
Chinese Translation
细粒度卡车分类对智能交通系统(ITS)至关重要,但当前基于LiDAR的方法由于依赖于监督深度学习和劳动密集型手动标注,面临可扩展性挑战。视觉-语言模型(VLMs)在少量样本泛化方面表现出色,但由于稀疏的3D点云与密集的2D图像之间的模态差距,其在路边LiDAR中的应用受到限制。我们提出了一种框架,通过适配现成的VLMs,实现无参数微调的细粒度卡车分类,从而弥合这一差距。我们新的深度感知图像生成管道应用了噪声去除、空间和时间配准、方向校正、形态学操作和各向异性平滑,将稀疏、遮挡的LiDAR扫描转换为深度编码的2D视觉代理。在一个包含20种车辆类别的真实世界数据集上验证,我们的方法在每个类别仅需16-30个示例即可实现具有竞争力的分类准确率,提供了一种可扩展的替代方案,优于数据密集型的监督基线。我们进一步观察到“语义锚点”效应:基于文本的指导在超低样本条件下($k < 4$)规范化性能,但在更多样本设置中由于语义不匹配而降低准确性。此外,我们展示了该框架作为冷启动策略的有效性,利用VLM生成的标签来引导轻量级监督模型的构建。值得注意的是,基于少量样本的VLM模型在特定的拖运类别(20英尺、40英尺和53英尺集装箱)中实现了超过75%的正确分类率,完全不需要昂贵的训练或微调,显著减少了初始手动标注的密集需求,从而在ITS应用中实现了一种实用的方法。
cs.CV / 23 / 2602.09432
SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL
SceneReVis:一种基于视觉的自我反思框架,通过多轮强化学习实现3D室内场景合成
Abstract
Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
Chinese Translation
当前的一次性3D场景合成方法常常因缺乏深思熟虑的推理而遭遇空间幻觉问题,例如碰撞。为了解决这一问题,我们提出了SceneReVis,这是一种基于视觉的自我反思框架,采用迭代的“诊断与行动”循环,通过多模态反馈明确拦截和解决空间冲突。为了支持这种逐步的范式,我们构建了SceneChain-12k,这是一个通过新颖的逆向工程流程获得的大规模因果构建轨迹数据集。我们进一步提出了一种两阶段训练方案,从监督微调过渡到自主强化学习,使模型演变为一个主动的空间规划者。大量实验表明,SceneReVis在高保真生成和目标导向优化方面达到了最先进的性能,并在长尾领域展现出强大的泛化能力。
cs.CV / 24 / 2602.09439
Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
Fine-T2I:一个开放、大规模且多样化的高质量文本到图像(T2I)微调数据集
Abstract
High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
Chinese Translation
高质量和开放的数据集仍然是文本到图像(T2I)微调的主要瓶颈。尽管模型架构和训练流程迅速发展,但大多数公开可用的微调数据集存在低分辨率、文本与图像对齐差差或多样性有限的问题,导致开放研究模型与企业级模型之间存在明显的性能差距。在本研究中,我们提出了Fine-T2I,一个大规模、高质量且完全开放的T2I微调数据集。Fine-T2I涵盖10个任务组合、32个提示类别、11种视觉风格和5种提示模板,并结合了由强大的现代模型生成的合成图像和来自专业摄影师的精心策划的真实图像。所有样本都经过严格筛选,以确保文本与图像的对齐、视觉保真度和提示质量,初始候选样本中超过95%被剔除。最终数据集包含超过600万个文本-图像对,磁盘空间约为2 TB,接近预训练数据集的规模,同时保持微调级别的质量。在一系列多样化的预训练扩散模型和自回归模型上,Fine-T2I的微调始终改善了生成质量和指令遵循性,这通过人工评估、视觉比较和自动指标得到了验证。我们以开放许可证发布Fine-T2I,以帮助缩小开放社区中T2I微调的数据差距。
cs.CV / 25 / 2602.09446
A Scoping Review of Deep Learning for Urban Visual Pollution and Proposal of a Real-Time Monitoring Framework with a Visual Pollution Index
深度学习在城市视觉污染中的应用范围评估及实时监测框架的提案,包含视觉污染指数
Abstract
Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.
Chinese Translation
城市视觉污染(Urban Visual Pollution, UVP)已成为一个重要问题,但关于自动检测和应用的研究仍然相对零散。本次范围评估回顾了现有基于深度学习的方法,用于检测、分类以及设计一个全面的视觉污染管理应用框架。根据PRISMA-ScR指南,系统性地搜索和审查了七个学术数据库(Scopus、Web of Science、IEEE Xplore、ACM DL、ScienceDirect、SpringerNatureLink和Wiley),共发现26篇文章。大多数研究集中于特定污染物类别,并采用YOLO、Faster R-CNN和EfficientDet架构的变体。尽管存在多个数据集,但它们仅限于特定区域,且缺乏标准化的分类法。很少有研究将检测整合到实时应用系统中,且这些研究往往存在地理偏倚。我们提出了一种监测视觉污染的框架,该框架整合了视觉污染指数,以评估特定区域视觉污染的严重程度。本次评审强调了建立一个统一的UVP管理系统的必要性,该系统应包含污染物分类法、跨城市基准数据集、通用深度学习模型以及支持可持续城市美学和提升城市居民福祉的评估指数。
cs.CV / 26 / 2602.09449
Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
前瞻与回顾流:无训练图像生成与轨迹平滑
Abstract
Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.
Chinese Translation
最近的进展通过流匹配框架将扩散模型重新表述为确定性常微分方程(ODE),为噪声到数据的生成过程提供了统一的表述。已经开发出多种无训练的流匹配方法,通过调整流速场来改善图像生成,消除了昂贵的重新训练需求。然而,修改速度场 $v$ 会引入错误,这些错误会在整个生成路径中传播,而对潜在轨迹 $z$ 的调整则自然会被预训练的速度网络纠正,从而减少错误积累。在本文中,我们提出了两种基于未来和过去速度 $v$ 以及潜在轨迹 $z$ 信息的互补无训练潜在轨迹调整方法,直接在潜在空间中细化生成路径。我们提出了两种无训练的轨迹平滑方案: extit{Look-Ahead},它使用曲率门控权重对当前和下一步潜在进行平均;以及 extit{Look-Back},它使用带衰减的指数移动平均来平滑潜在。我们通过广泛的实验和全面的评估指标证明,所提出的无训练轨迹平滑模型在多个数据集(包括 COCO17、CUB-200 和 Flickr30K)上显著优于各种最先进的模型。
cs.CV / 27 / 2602.09475
ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
ArtifactLens:数百个标签足以利用 VLM 进行伪影检测
Abstract
Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
Chinese Translation
现代图像生成器能够生成极为逼真的图像,只有伪影如扭曲的手或变形的物体揭示了它们的合成来源。检测这些伪影至关重要:如果不进行检测,我们无法对生成器进行基准测试或训练奖励模型以改进它们。目前的检测器在数万张标记图像上微调 VLM,但每当生成器演变或出现新伪影类型时,这种过程的重复成本非常高。我们展示了预训练的 VLM 已经编码了检测伪影所需的知识——通过适当的支架,这种能力可以仅通过每个伪影类别几百个标记示例来解锁。我们的系统 ArtifactLens 在五个人工伪影基准测试中达到了最先进的水平(这是跨多个数据集的首次评估),同时所需的标记数据量少了几个数量级。支架由一个多组件架构组成,结合了上下文学习和文本指令优化,并对每个部分进行了新颖的改进。我们的方法可以推广到其他伪影类型——物体形态、动物解剖和实体交互——以及 AIGC 检测这一独特任务。
cs.CV / 28 / 2602.09476
FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation
FD-DB:用于无配对合成到真实领域转换的频率解耦双分支网络
Abstract
Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
Chinese Translation
合成数据为几何敏感的视觉任务提供了低成本、准确标注的样本,但合成领域与真实领域之间的外观和成像差异导致了严重的领域偏移,并降低了下游性能。无配对的合成到真实转换可以在没有配对监督的情况下缩小这一差距,然而现有方法往往面临着照片真实感与结构稳定性之间的权衡:不受约束的生成可能引入变形或虚假纹理,而过于严格的约束则限制了对真实领域统计特征的适应。我们提出了FD-DB,一种频率解耦的双分支模型,将外观转换分为低频可解释编辑和高频残差补偿。可解释分支预测物理上有意义的编辑参数(白平衡、曝光、对比度、饱和度、模糊和颗粒),以构建一个稳定的低频外观基础,并强力保留内容。自由分支通过残差生成补充细节,而一个门控融合机制在明确的频率约束下结合两个分支,以限制低频漂移。我们进一步采用了两阶段训练计划,首先稳定编辑分支,然后释放残差分支以提高优化稳定性。在YCB-V数据集上的实验表明,FD-DB提高了真实领域的外观一致性,并显著提升了下游语义分割性能,同时保留了几何和语义结构。
cs.CV / 29 / 2602.09477
Weakly Supervised Contrastive Learning for Histopathology Patch Embeddings
用于组织病理切片嵌入的弱监督对比学习
Abstract
Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at github.com/BzhangURU/Paper_WeakSupCon_for_MIL
Chinese Translation
数字组织病理全切片图像(WSIs)提供了千兆像素级高分辨率图像,这些图像在疾病诊断中具有重要价值。然而,由于训练标签的有限性,数字组织病理图像分析面临重大挑战,因为手动标注从大WSIs裁剪出的特定区域或小切片需要大量时间和精力。弱监督多实例学习(MIL)提供了一种实用且高效的解决方案,仅需袋级(切片级)标签,而每个袋通常包含多个实例(切片)。大多数MIL方法直接使用由各种图像编码器生成的冻结图像切片特征作为输入,并主要关注特征聚合。然而,在MIL环境中,特征表示学习的编码器预训练在很大程度上被忽视。在我们的工作中,我们提出了一种新颖的特征表示学习框架,称为弱监督对比学习(WeakSupCon),该框架在训练过程中结合了袋级标签信息。我们的方法不依赖于实例级伪标签,但能够有效地在特征空间中分离具有不同标签的切片。实验结果表明,与自监督对比学习方法相比,我们的WeakSupCon方法生成的图像特征在三个数据集上显著提高了下游MIL性能。我们的相关代码可在github.com/BzhangURU/Paper_WeakSupCon_for_MIL获取。
cs.CV / 30 / 2602.09483
Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
超越下一个标记对齐:通过标记交互提炼多模态大型语言模型
Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.
Chinese Translation
多模态大型语言模型(MLLMs)展现了令人印象深刻的跨模态能力,但其庞大的体积带来了显著的部署挑战。知识蒸馏(KD)是一种压缩这些模型的有前景的解决方案,但现有方法主要依赖于静态的下一个标记对齐,忽视了动态标记交互,这些交互蕴含了多模态理解和生成的基本能力。为此,我们提出了Align-TI,一个从标记交互的角度设计的新型KD框架。我们的方法受到以下洞察的启发:MLLMs依赖于两种主要交互:视觉-指令标记交互以提取相关的视觉信息,以及内部响应标记交互以实现连贯生成。因此,Align-TI引入了两个组件:IVA使学生模型能够通过对显著视觉区域的对齐,模仿教师在指令相关视觉信息提取能力上的表现。TPA通过对齐序列标记到标记的转移概率,捕捉教师的动态生成逻辑。大量实验表明Align-TI的优越性。值得注意的是,我们的方法相较于传统的Vanilla KD实现了$2.6\%$的相对提升,而我们蒸馏的Align-TI-2B甚至比LLaVA-1.5-7B(一个更大的MLLM)高出$7.0\\%$,建立了一个新的最先进的蒸馏框架,用于训练参数高效的MLLMs。代码可在https://github.com/lchen1019/Align-TI获取。
cs.CV / 31 / 2602.09494
OSI: One-step Inversion Excels in Extracting Diffusion Watermarks
OSI:一步反演在提取扩散水印中的卓越表现
Abstract
Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.
Chinese Translation
水印是一种重要的机制,用于扩散生成图像的来源和版权保护。以高斯阴影为例的无训练方法,将水印嵌入扩散模型的初始噪声中,对生成图像的质量影响微乎其微。然而,提取这种类型的水印通常需要多步扩散反演以获得精确的初始噪声,这在计算上成本高且耗时。为了解决这个问题,我们提出了一种一步反演(One-step Inversion, OSI)的方法,这是一种显著更快且更准确的提取高斯阴影风格水印的方法。OSI将水印提取重新表述为一个可学习的符号分类问题,从而消除了对初始噪声精确回归的需求。然后,我们从扩散主干网络初始化OSI模型,并在合成的噪声-图像对上进行微调,以符号分类为目标。通过这种方式,OSI模型能够在仅一步内高效地完成水印提取。我们的OSI在性能上显著优于多步扩散反演方法:速度提高20倍,提取准确率更高,水印容量翻倍。针对不同调度器、扩散主干网络和加密方案的广泛实验一致显示出改进,证明了我们OSI框架的普适性。
cs.CV / 32 / 2602.09506
Equilibrium contrastive learning for imbalanced image classification
用于不平衡图像分类的平衡对比学习
Abstract
Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.
Chinese Translation
对比学习(Contrastive Learning, CL)是图像分类中的一种主要技术,但在不平衡数据集上表现有限。最近,提出了几种监督对比学习方法,以促进表示空间中的理想正则单纯形几何配置——特征类内崩溃和类间均匀均值间距,特别是针对不平衡数据集。具体而言,现有的基于原型的方法包括类原型,作为额外样本以考虑所有类。然而,现有的对比学习方法存在两个局限性。首先,它们未考虑类均值/原型与分类器之间的对齐,这可能导致较差的泛化能力。其次,现有的基于原型的方法将原型视为每个类仅一个额外样本,使其影响依赖于批次中类实例的数量,导致各类之间的贡献不平衡。为了解决这些局限性,我们提出了平衡对比学习(Equilibrium Contrastive Learning, ECL),这是一种监督对比学习框架,旨在促进几何平衡,在数据不平衡的情况下,类特征、均值和分类器和谐平衡。所提出的ECL框架使用两个主要组件。首先,ECL促进表示几何平衡(即,特征类样本崩溃和类均值均匀分布所特征的正则单纯形几何),同时平衡类平均特征和类原型的贡献。其次,ECL通过对齐分类器权重和类原型建立分类器-类中心几何平衡。我们在三个长尾数据集上进行了实验,包括CIFAR-10(0)-LT、ImageNet-LT,以及两个不平衡医学数据集ISIC 2019和我们构建的LCCT数据集。结果表明,ECL在不平衡分类的现有最先进的监督对比学习方法中表现优越。
cs.CV / 33 / 2602.09510
Robust Depth Super-Resolution via Adaptive Diffusion Sampling
通过自适应扩散采样实现鲁棒深度超分辨率
Abstract
We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
Chinese Translation
我们提出了AdaDS,这是一个可推广的深度超分辨率框架,能够从任意降级的低分辨率输入中稳健地恢复高分辨率深度图。与直接回归深度值并且在严重或未知降级下常常出现伪影的传统方法不同,AdaDS利用了高斯平滑的收缩特性:随着噪声在前向过程中积累,降级输入与其原始高质量对应物之间的分布差异减小,最终收敛到各向同性的高斯先验。基于此,AdaDS根据估计的细化不确定性自适应地选择反向扩散轨迹中的起始时间步,并随后注入定制的噪声,以将中间样本定位在目标后验分布的高概率区域内。这一策略确保了固有的鲁棒性,使得预训练扩散模型的生成先验在上游估计不完美的情况下仍能主导恢复。对真实世界和合成基准的广泛实验表明,与最先进的方法相比,AdaDS在零-shot泛化和对多样降级模式的韧性方面表现优越。
cs.CV / 34 / 2602.09515
Energy-Efficient Fast Object Detection on Edge Devices for IoT Systems
面向物联网系统的边缘设备能效快速目标检测
Abstract
This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.
Chinese Translation
本文提出了一种物联网(IoT)应用,利用AI分类器通过帧差法进行快速目标检测。该方法因其较短的持续时间,成为在物联网系统中进行快速目标检测的最有效和最适合的选择,相较于端到端方法,它更符合能效应用的需求。我们在三种边缘设备上实现了该技术:AMD AlveoTM U50、Jetson Orin Nano和Hailo-8TM AI加速器,以及四种使用人工神经网络和变换器模型的模型。我们考察了多种类别,包括鸟类、汽车、火车和飞机。使用帧差法,MobileNet模型始终保持高准确率、低延迟,并且能效极高。而YOLOX模型则始终表现出最低的准确率、最低的延迟和最低的能效。实验结果表明,所提出的算法相比于端到端方法,平均准确率提高了28.314%,平均能效提升了3.6倍,平均延迟减少了39.305%。在所有类别中,火车和飞机是移动速度较快的目标。实验显示,火车和飞机的准确率低于其他类别。因此,在需要快速检测和准确结果的任务中,端到端方法可能会造成灾难,因为它们无法处理快速目标检测。为了提高计算效率,我们将所提出的方法设计为轻量级检测算法,非常适合于物联网系统中的应用,尤其是那些需要快速移动目标检测和更高准确率的应用。
cs.CV / 35 / 2602.09518
A Universal Action Space for General Behavior Analysis
通用行为分析的行动空间
Abstract
Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at https://github.com/franktpmvu/Universal-Action-Space.
Chinese Translation
分析动物和人类行为长期以来一直是计算机视觉中的一项挑战性任务。从1970年代到1990年代的早期方法依赖于手工制作的边缘检测、分割以及颜色、形状和纹理等低级特征来定位物体并推断其身份,这本质上是一个不适定的问题。在这一时期,行为分析通常通过跟踪已识别的物体随时间的变化,并使用稀疏特征点建模其轨迹,这进一步限制了鲁棒性和泛化能力。2010年,Deng和Li引入的ImageNet带来了重大转变,使得通过深度神经网络实现大规模视觉识别成为可能,并有效地充当了一个全面的视觉词典。这一发展使得物体识别从复杂的低级处理转向学习的高级表示。在本研究中,我们遵循这一范式,利用现有的标注人类行为数据集构建一个大规模的通用行动空间(Universal Action Space, UAS)。然后,我们将该UAS作为分析和分类哺乳动物和黑猩猩行为数据集的基础。源代码已在GitHub上发布,链接为 https://github.com/franktpmvu/Universal-Action-Space。
cs.CV / 36 / 2602.09521
Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
关注细节,逻辑值与真实:视觉感知注意力与逻辑值增强以减轻大型视觉语言模型中的幻觉
Abstract
Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
Chinese Translation
现有的大型视觉语言模型(LVLMs)表现出不足的视觉注意力,导致幻觉现象。为了解决这个问题,一些先前的研究调整并增强了视觉注意力。然而,这些方法存在一个局限性,即对所有视觉标记增强注意力不可避免地会增加对与任务无关标记的注意力。为应对这一挑战,我们提出了一种无训练的注意力干预算法,旨在基于任务相关标记通常表现出高视觉-文本相似性的论点,增强任务相关标记的注意力。具体而言,我们提取表示视觉-文本相关性的视觉-文本交叉注意力子矩阵,以构建重加权矩阵重新分配注意力。此外,为了增强视觉标记的贡献,我们将视觉注意力值注入束搜索解码中,以识别具有更高视觉注意力的解决方案。大量实验表明,该方法显著减少了主流LVLMs中的幻觉现象,同时保持了生成内容的准确性和连贯性。
cs.CV / 37 / 2602.09523
Singpath-VL Technical Report
Singpath-VL技术报告
Abstract
We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
Chinese Translation
我们提出了Singpath-VL,一个视觉-语言大模型,以填补在宫颈细胞学中AI助手的空缺。最近多模态大语言模型(MLLMs)的进展显著推动了计算病理学的发展。然而,它们在细胞病理学,特别是宫颈细胞学中的应用仍然未被充分探索,主要是由于缺乏大规模、高质量的标注数据集。为了解决这一问题,我们首先开发了一种新颖的三阶段流程,以合成百万规模的图像-描述数据集。该流程利用多个通用的MLLM作为弱标注者,通过共识融合和专家知识注入来精炼它们的输出,并生成高保真的细胞形态描述。利用该数据集,我们随后通过多阶段策略对Qwen3-VL-4B模型进行微调,以创建一个专门的细胞病理学MLLM。最终模型Singpath-VL在细粒度形态感知和细胞级诊断分类方面表现出优越的性能。为了推动该领域的发展,我们将开源一部分合成数据集并进行基准测试。
cs.CV / 38 / 2602.09524
HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection
HLGFA:用于无监督异常检测的高低分辨率引导特征对齐
Abstract
Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.
Chinese Translation
无监督工业异常检测(UAD)对于现代制造检验至关重要,因为缺乏缺陷样本且需要可靠的检测。本文提出了HLGFA,一种高低分辨率引导特征对齐框架,通过建模正常样本的高分辨率和低分辨率表示之间的跨分辨率特征一致性来学习正常性,而不是依赖于像素级重建。双分辨率输入通过共享的冻结主干网络处理,以提取多层次特征,高分辨率表示被分解为结构和细节先验,通过条件调制和门控残差校正来引导低分辨率特征的细化。在推理过程中,异常自然地被识别为跨分辨率对齐失效的区域。此外,提出了一种噪声感知的数据增强策略,以抑制在工业环境中常见的干扰引起的响应。在标准基准上的广泛实验表明,HLGFA的有效性,在MVTec AD数据集上实现了97.9%的像素级AUROC和97.5%的图像级AUROC,优于代表性的基于重建和基于特征的方法。
cs.CV / 39 / 2602.09528
Schr\"oMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schr\"odinger Bridge Problem
Schr"oMind:通过解决薛定谔桥问题来减轻多模态大语言模型中的幻觉现象
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose Schr\"oMind-a novel framework reducing hallucinations via solving the Schr\"odinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schr\"odinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
Chinese Translation
最近,多模态大语言模型(MLLMs)的进展在各个领域取得了显著成功。然而,由于持续存在的幻觉现象,即生成的文本与视觉输入相矛盾或忽略视觉输入,它们在医疗等高风险领域的应用仍然有限。我们认为,MLLMs能够理解图像,但在生成准确的标记序列方面存在困难。轻微的扰动可能会使注意力从真实状态转移到不真实状态,而文本生成的自回归特性往往阻碍了错误的纠正。为了解决这个问题,我们提出了Schr"oMind——一个通过解决薛定谔桥问题来减少幻觉现象的新框架。该框架在轻量级训练下,通过最小运输成本建立了幻觉激活与真实激活之间的标记级映射,同时保留了模型的原始能力。在POPE和MME基准上的广泛实验表明,Schr"odinger的优越性,其在引入仅有的最小计算开销的同时,实现了最先进的性能。
cs.CV / 40 / 2602.09529
SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection
SCA-Net:用于增强小型建筑和道路变化检测的空间上下文聚合网络
Abstract
Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net's superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.
Chinese Translation
遥感图像中的自动变化检测对于城市管理、环境监测和灾害评估至关重要。尽管深度学习模型在这一领域取得了进展,但它们通常面临对小物体的低敏感性和高计算成本等挑战。本文提出了SCA-Net,一种基于Change-Agent框架的增强架构,用于在双时相图像中精确检测建筑和道路的变化。我们的模型结合了几个关键创新:一种新颖的差异金字塔块用于多尺度变化分析,一个结合形状感知和高分辨率增强块的自适应多尺度处理模块,以及用于联合上下文和细节处理的多级注意机制(PPM和CSAGate)。此外,引入了一种动态复合损失函数和四阶段训练策略,以稳定训练并加速收敛。在LEVIR-CD和LEVIR-MCI数据集上的综合评估表明,SCA-Net在性能上优于Change-Agent和其他最先进的方法。我们的方法在LEVIR-MCI上实现了平均交并比(mIoU)显著提高2.64%,在小型建筑的IoU上实现了惊人的57.9%的提升,同时将训练时间减少了61%。这项工作为实际变化检测应用提供了一种高效、准确和稳健的解决方案。
cs.CV / 41 / 2602.09531
DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
DR.Experts:用于盲图像质量评估的失真感知专家的差异化精炼
Abstract
Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce DR.Experts, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. DR.Experts begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of DR.Experts over current methods and showcase its excellence in terms of generalization and data efficiency.
Chinese Translation
盲图像质量评估旨在在没有参考的情况下复制人类对视觉质量的感知,在视觉任务中发挥着关键作用。然而,现有模型往往无法有效捕捉微妙的失真线索,导致与人类主观判断的不一致。我们发现这一限制的根本原因在于缺乏可靠的失真先验,因为现有方法通常学习统一图像特征与质量评分之间的浅层关系,导致它们对失真的敏感性不足,从而限制了性能。为了解决这一问题,我们提出了DR.Experts,一种新颖的基于先验的盲图像质量评估框架,旨在明确地结合失真先验,从而实现可靠的质量评估。DR.Experts首先利用一个感知退化的视觉-语言模型获取特定于失真的先验,然后通过提出的失真显著性差异模块对其进行进一步精炼和增强,确保失真的真实表示。精炼后的先验与语义和桥接表示随后通过一个名为动态失真加权模块的混合专家风格模块进行融合。该机制根据每个特定失真特征的感知影响进行加权,确保最终的质量预测与人类感知一致。在五个具有挑战性的盲图像质量评估基准上进行的广泛实验表明,DR.Experts在当前方法中具有优越性,并展示了其在泛化能力和数据效率方面的卓越表现。
cs.CV / 42 / 2602.09532
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
RAD:用于欠代表类别的检索增强单目度量深度估计
Abstract
Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
Chinese Translation
单目度量深度估计(MMDE)对于物理智能系统至关重要,但在复杂场景中对欠代表类别的准确深度估计仍然是一个持续的挑战。为了解决这一问题,我们提出了RAD,一种检索增强框架,通过利用检索到的邻域作为结构几何代理,近似多视角立体视觉的优势。我们的方法首先采用一种不确定性感知的检索机制,识别输入中的低置信度区域,并检索包含语义相似内容的RGB-D上下文样本。然后,我们通过双流网络处理输入和检索到的上下文,并使用匹配的交叉注意力模块进行融合,该模块仅在可靠的点对应关系下转移几何信息。在NYU Depth v2、KITTI和Cityscapes上的评估表明,RAD在欠代表类别上显著优于最先进的基线,在NYU Depth v2上减少了29.2%的相对绝对误差,在KITTI上减少了13.3%,在Cityscapes上减少了7.2%,同时在标准领域基准上保持了竞争力的性能。
cs.CV / 43 / 2602.09534
AUHead: Realistic Emotional Talking Head Generation via Action Units Control
AUHead:通过动作单元控制生成逼真的情感对话头
Abstract
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR
Chinese Translation
逼真的对话头视频生成对于虚拟化身、电影制作和互动系统至关重要。目前的方法在细腻的情感表达方面存在困难,主要是由于缺乏细粒度的情感控制。为了解决这个问题,我们提出了一种新颖的两阶段方法(AUHead),以将细粒度情感控制(即动作单元,Action Units,AUs)与音频解耦,并实现可控生成。在第一阶段,我们通过时空动作单元标记化和“情感-再到动作单元”的思维链机制,探索大型音频-语言模型(ALMs)的动作单元生成能力。该阶段旨在从原始语音中解耦动作单元,有效捕捉细微的情感线索。在第二阶段,我们提出了一种基于动作单元驱动的可控扩散模型,该模型根据动作单元序列合成逼真的对话头视频。具体而言,我们首先将动作单元序列映射到结构化的二维面部表示,以增强空间保真度,然后在交叉注意模块中建模动作单元与视觉的交互。为了实现灵活的动作单元质量权衡控制,我们在推理过程中引入了一种动作单元解耦指导策略,进一步提升生成视频的情感表现力和身份一致性。在基准数据集上的结果表明,我们的方法在情感真实感、准确的唇同步和视觉一致性方面表现出竞争力,显著超越现有技术。我们的实现可在 https://github.com/laura990501/AUHead_ICLR 获取。
cs.CV / 44 / 2602.09541
Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
Scalpel:通过混合高斯桥精细对齐注意力激活流形以减轻多模态幻觉
Abstract
Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schr\"odinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
Chinese Translation
大型视觉语言模型(LVLMs)的快速发展在视觉语言任务中取得了前所未有的性能。然而,由于大型语言模型(LLMs)的强先验和跨模态的注意力不对齐,LVLMs往往生成与视觉内容不一致的输出——称为幻觉。为了解决这个问题,我们提出了 extbf{Scalpel},一种通过将注意力激活分布细化到更可信区域来减少幻觉的方法。Scalpel在推理过程中为Transformer层中的每个头预测可信的注意力方向,并相应地调整激活。它采用高斯混合模型来捕捉信任和幻觉流形中注意力的多峰分布,并使用熵最优传输(等同于Schrödinger桥问题)来精确映射高斯成分。在减轻幻觉的过程中,Scalpel根据成分隶属关系和幻觉与信任激活之间的映射关系动态调整干预强度和方向。通过在多个数据集和基准上的广泛实验,证明Scalpel有效减轻幻觉,超越了先前的方法,并实现了最先进的性能。此外,Scalpel是模型和数据无关的,不需要额外的计算,仅需一步解码。
cs.CV / 45 / 2602.09586
Delving into Spectral Clustering with Vision-Language Representations
深入探讨基于视觉-语言表示的谱聚类
Abstract
Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
Chinese Translation
谱聚类被认为是一种强大的无监督数据分析技术。绝大多数谱聚类方法都是基于单一模态,未能充分利用多模态表示中的丰富信息。受到近期视觉-语言预训练成功的启发,本文将谱聚类的研究从单模态扩展到多模态。特别地,我们提出了一种神经切线核谱聚类(Neural Tangent Kernel Spectral Clustering),该方法利用预训练视觉-语言模型中的跨模态对齐。通过将神经切线核与正面名词(即与感兴趣图像语义上接近的名词)相结合,我们将图像之间的亲和力公式化为其视觉接近度和语义重叠的耦合。我们展示了这种公式化能够增强聚类内部的连接,同时抑制聚类间的虚假连接,从而鼓励块对角结构。此外,我们提出了一种正则化的亲和力扩散机制,能够自适应地集成由不同提示诱导的亲和力矩阵。在包括经典、大规模、细粒度和领域迁移数据集在内的 extbf{16}个基准测试上的大量实验表明,我们的方法在性能上始终大幅超越现有的最先进技术。
cs.CV / 46 / 2602.09587
MieDB-100k: A Comprehensive Dataset for Medical Image Editing
MieDB-100k:一个全面的医学图像编辑数据集
Abstract
The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
Chinese Translation
高质量数据的稀缺仍然是将多模态生成模型应用于医学图像编辑的主要瓶颈。现有的医学图像编辑数据集往往存在多样性有限、忽视医学图像理解以及无法平衡质量与可扩展性等问题。为了解决这些问题,我们提出了MieDB-100k,这是一个大规模、高质量且多样化的文本引导医学图像编辑数据集。它将编辑任务分为感知、修改和转化三个方面,考虑了理解和生成能力。我们通过利用特定模态的专家模型和基于规则的数据合成方法,构建了MieDB-100k,并经过严格的人工检查以确保临床真实性。大量实验表明,使用MieDB-100k训练的模型在性能上始终优于开源和专有模型,同时展现出强大的泛化能力。我们预期该数据集将成为未来专门医学图像编辑进展的基石。
cs.CV / 47 / 2602.09600
Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
Hand2World:通过自由空间手势生成自回归自我中心交互
Abstract
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Pl\"ucker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
Chinese Translation
自我中心交互世界模型对于增强现实和具身人工智能至关重要,其中视觉生成必须以低延迟、几何一致性和长期稳定性对用户输入做出响应。我们研究在自由空间手势下,从单一场景图像生成自我中心交互,旨在合成逼真的视频,其中手进入场景,与物体交互,并在头部运动下引发合理的世界动态。这个设置引入了基本挑战,包括自由空间手势与以接触为主的训练数据之间的分布偏移、单目视图中手部运动与相机运动之间的模糊性,以及生成任意长度视频的需求。我们提出了Hand2World,一个统一的自回归框架,通过基于投影3D手网格的遮挡不变手部条件来解决这些挑战,从而使可见性和遮挡可以从场景上下文中推断,而不是通过控制信号编码。为了稳定自我中心视点的变化,我们通过每像素的Plücker光线嵌入注入显式相机几何,解耦相机运动与手部运动,防止背景漂移。我们进一步开发了一个完全自动化的单目标注管道,并将双向扩散模型提炼为因果生成器,实现任意长度的合成。在三个自我中心交互基准上的实验表明,在感知质量和3D一致性方面有显著改善,同时支持相机控制和长时间交互生成。
cs.CV / 48 / 2602.09609
Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing
Tele-Omni:一个统一的多模态视频生成与编辑框架
Liu, Jialun, Ma, Yukuo, Cao, Xiao, Li, Tian, Shang, Gonghu, Huang, Haibin, Zhang, Chi, Li, Xuelong, Liu, Cong, Liu, Junqi, Hu, Jiakui, Tan, Robby T., Zhang, Shiwen, Yang, Liying, Yang, Xiaoyan, Weng, Qizhen, Chang, Xiangzhen, Liang, Yuanzhi, Xu, Yifan, Huang, Zhiyong, Li, Zuoxin, Li, Xuelong
Abstract
Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
Chinese Translation
近年来,基于扩散的视频生成技术的进步显著提高了视觉保真度和时间一致性。然而,现有的大多数方法仍然是任务特定的,主要依赖文本指令,这限制了它们在统一框架内处理多模态输入、上下文参考以及多样化视频生成和编辑场景的能力。此外,许多视频编辑方法依赖于针对单个操作精心设计的管道,这阻碍了可扩展性和可组合性。在本文中,我们提出了Tele-Omni,一个统一的多模态视频生成与编辑框架,能够在单一模型中遵循包括文本、图像和参考视频在内的多模态指令。Tele-Omni利用预训练的多模态大型语言模型来解析异构指令并推断结构化的生成或编辑意图,同时基于扩散的生成器在这些结构化信号的条件下执行高质量的视频合成。为了实现跨异构视频任务的联合训练,我们引入了一种任务感知的数据处理管道,将多模态输入统一为结构化指令格式,同时保持任务特定的约束。Tele-Omni支持广泛的视频中心任务,包括文本到视频生成、图像到视频生成、首尾帧视频生成、上下文视频生成和上下文视频编辑。通过将指令解析与视频合成解耦,并结合任务感知的数据设计,Tele-Omni实现了灵活的多模态控制,同时保持强大的时间一致性和视觉一致性。实验结果表明,Tele-Omni在多个任务中实现了具有竞争力的性能。
cs.CV / 49 / 2602.09611
AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
AGMark:用于大型视觉语言模型的注意力引导动态水印
Abstract
Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
Chinese Translation
水印技术已成为大型视觉语言模型(LVLMs)中内容可追溯性和知识产权保护的重要解决方案。然而,视觉无关的水印可能引入视觉上不相关的标记,并通过施加不加区分的伪随机偏差来破坏视觉基础。此外,目前的视觉特定水印依赖于对视觉关键权重的静态一次性估计,并在确定受保护标记的比例时忽视权重分布密度。这种设计未能考虑生成过程中视觉依赖的动态变化,可能在长尾中引入低质量标记。为了解决这些挑战,我们提出了注意力引导动态水印(AGMark),这是一个新颖的框架,能够嵌入可检测信号,同时严格保持视觉保真度。在每个解码步骤中,AGMark首先基于注意力权重动态识别语义关键证据,以确保视觉相关性,并结合上下文感知的连贯线索,从而产生更具适应性和良好校准的证据权重分布。然后,它通过共同考虑不确定性意识(标记熵)和证据校准(权重密度)来确定语义关键标记的比例,从而实现自适应词汇划分,以避免不相关的标记。实证结果证实,AGMark优于传统方法,显著提高了生成质量,并在生成后期特别增强了视觉语义保真度。该框架保持了高度竞争的检测准确率(至少99.36\% AUC)和强大的攻击韧性(至少88.61\% AUC),而不牺牲推理效率,有效地为可靠性保持的多模态水印建立了新的标准。
cs.CV / 50 / 2602.09637
Towards Training-free Multimodal Hate Localisation with Large Language Models
无训练的多模态仇恨内容定位方法:基于大型语言模型
Abstract
The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
Chinese Translation
在线视频中仇恨内容的泛滥对个人福祉和社会和谐构成了严重威胁。然而,现有的视频仇恨检测解决方案要么严重依赖大规模的人类标注,要么缺乏细粒度的时间精度。在本研究中,我们提出了LELA,这是第一个基于大型语言模型(LLM)的无训练仇恨视频定位框架。与依赖监督流程的最先进模型不同,LELA利用LLM和特定模态的字幕,以无训练的方式检测和时间定位仇恨内容。我们的方法将视频分解为五种模态,包括图像、语音、光学字符识别(OCR)、音乐和视频上下文,并使用多阶段提示方案为每一帧计算细粒度的仇恨评分。我们进一步引入了一种组合匹配机制,以增强跨模态推理。在两个具有挑战性的基准测试HateMM和MultiHateClip上的实验表明,LELA在所有现有的无训练基线中表现优异,具有显著优势。我们还提供了广泛的消融实验和定性可视化,确立了LELA作为可扩展和可解释的仇恨视频定位的强大基础。
cs.CV / 51 / 2602.09638
VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
VideoAfford:通过多模态大型语言模型从人-物交互视频中获取3D可操作性
Abstract
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
Chinese Translation
3D可操作性获取旨在突出3D物体上的可操作区域,这对于机器人操作至关重要。以往的研究主要集中在从静态线索(如语言和图像)中学习可操作性知识,这些静态线索难以提供足够的动态交互背景,以揭示时间和因果线索。为了解决这一困境,我们收集了一个全面的视频基础3D可操作性数据集 extit{VIDA},该数据集包含38K个人-物体交互视频,涵盖16种可操作性类型、38种物体类别和22K点云。基于 extit{VIDA},我们提出了一个强大的基线模型:VideoAfford,该模型激活了具有额外可操作性分割能力的多模态大型语言模型,使得在统一框架内能够进行世界知识推理和细粒度可操作性获取。为了增强动作理解能力,我们利用潜在动作编码器从HOI视频中提取动态交互先验。此外,我们引入了一种 extit{空间感知}损失函数,使得VideoAfford能够获取全面的3D空间知识。大量实验评估表明,我们的模型显著优于成熟的方法,并展现出强大的开放世界泛化能力和可操作性推理能力。所有数据集和代码将公开发布,以推动该领域的研究。
cs.CV / 52 / 2602.09648
Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
Time2General:学习时空不变表示以实现领域泛化的视频语义分割
Abstract
Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
Chinese Translation
领域泛化视频语义分割(DGVSS)是在单一标注的驾驶领域上训练的,并直接在未见领域上部署,无需目标标签和测试时适应,同时保持视频流中的时间一致性预测。在实际应用中,领域转移和时间采样转移破坏了基于对应关系的传播和固定步幅的时间聚合,导致即使在标签稳定区域也会出现严重的帧间闪烁。我们提出了Time2General,一个基于稳定性查询的DGVSS框架。Time2General引入了一种时空记忆解码器,将多帧上下文聚合到剪辑级时空记忆中,并在没有显式对应传播的情况下解码时间一致的逐帧掩码。为了进一步抑制闪烁并提高对不同采样率的鲁棒性,提出了掩码时间一致性损失,以规范化不同步幅下的时间预测差异,并随机化训练步幅,以使模型接触到多样的时间间隔。在多个驾驶基准上的大量实验表明,Time2General在跨领域准确性和时间稳定性方面相较于之前的DGSS和VSS基线取得了显著提升,同时运行速度可达18帧每秒。代码将在审稿过程结束后发布。
cs.CV / 53 / 2602.09662
TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
TreeCUA:通过树结构可验证演化高效扩展GUI自动化
Abstract
Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.
Chinese Translation
有效扩展GUI自动化对于计算机使用代理(CUAs)至关重要;然而,现有研究主要集中在扩展GUI基础而非更为关键的GUI规划上,后者需要更复杂的数据收集。实际上,CUA在应用程序/桌面/网页之间的探索过程通常遵循树结构,早期的功能入口往往被更频繁地探索。因此,将大规模轨迹组织成树结构可以降低数据成本,并简化GUI规划的数据扩展。在本研究中,我们提出了TreeCUA,以通过树结构可验证演化高效扩展GUI自动化。我们提出了一个多代理协作框架,以探索环境、验证动作、总结轨迹并评估质量,从而生成高质量和可扩展的GUI轨迹。为了提高效率,我们设计了一种新颖的基于树的拓扑结构来存储和重放重复的探索节点,并设计了一种自适应探索算法,以平衡深度(即轨迹难度)和广度(即轨迹多样性)。此外,我们开发了世界知识指导和全局记忆回溯,以避免低质量生成。最后,我们自然扩展并提出了TreeCUA-DPO方法,利用丰富的树节点信息,通过参考相邻轨迹的分支信息来提高GUI规划能力。实验结果表明,TreeCUA和TreeCUA-DPO提供了显著的改进,而域外(OOD)研究进一步证明了强大的泛化能力。所有轨迹节点信息和代码将可在https://github.com/UITron-hub/TreeCUA获得。
cs.CV / 54 / 2602.09686
Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI
基于注册辅助的多参数MRI的半监督肝脏分割与基于块的纤维化分期
Abstract
Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: https://github.com/mileywang3061/Care-Liver
Chinese Translation
肝纤维化在临床实践中带来了重大挑战,强调了精确肝脏分割和准确疾病分期的必要性。本研究基于CARE Liver 2025 Track 4 Challenge,提出了一种多任务深度学习框架,用于肝脏分割(LiSeg)和肝纤维化分期(LiFS),采用多参数MRI。LiSeg阶段通过采用半监督学习模型,结合图像分割和配准,解决了标注图像有限和多参数MRI数据复杂性的问题。该模型利用标记和未标记数据,克服了领域转移和模态间变异带来的困难。在LiFS阶段,我们采用了一种基于块的方法,根据分类输出可视化肝纤维化阶段。我们的方法有效处理了多模态成像数据、有限标签和领域转移。该方法已由挑战组织者在独立测试集上进行了测试,该测试集包括使用三通道MRI(T1、T2、DWI)和七通道MRI(T1、T2、DWI、GED1-GED4)的分布内(ID)和分布外(OOD)案例。代码可免费获取。Github链接:https://github.com/mileywang3061/Care-Liver
cs.CV / 55 / 2602.09701
GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
GenSeg-R1:基于强化学习的视觉-语言基础细粒度指称分割
Abstract
We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
Chinese Translation
我们通过一个解耦的推理-分割管道研究细粒度指称图像分割。视觉-语言模型(VLM)接收一幅图像和一个自然语言查询,推理场景并输出结构化的空间提示:每个被指称实例的边界框加上两个内部关键点。一个冻结的可提示分割器(SAM 2)将这些提示转换为高质量的掩膜。在我们的GenSeg-R1框架内,我们使用群体相对策略优化(GRPO)微调Qwen3-VL模型(4B和8B参数),无需监督推理链注释。在RefCOCOg验证中,我们的最佳模型(GenSeg-R1-8B)达到了0.7127的cIoU和0.7382的mIoU,显著优于相应的Qwen3-VL指令基线(分别提高了15.3和21.9分),并在相同评估下超过了Seg-Zero-7B [3],提高了3.3的cIoU。我们进一步引入了GenSeg-R1-G,一个在GRefCOCO [9]上训练的变体,采用了一个在环奖励的SAM 2,直接优化掩膜质量。在GRefCOCO验证中,GenSeg-R1-G在目标mIoU上达到了76.69%,在负向(无目标)提示上准确率为82.40%,显著优于缺乏无目标检测能力的Seg-R1-7B和Seg-Zero-7B。在ReasonSeg测试中,GenSeg-R1-4B达到了68.40%的mIoU,分别超过Seg-Zero-7B和Seg-R1-7B 7.0和10.7分。
cs.CV / 56 / 2602.09713
Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
Stroke3D:通过潜在扩散模型将二维笔画提升为带骨架的三维模型
Abstract
Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
Chinese Translation
带骨架的三维资产是三维变形和动画的基础。然而,现有的三维生成方法在生成可动画几何体方面面临挑战,而骨架绑定技术在骨架创建上缺乏细粒度的结构控制。为了解决这些局限性,我们提出了Stroke3D,一个新颖的框架,能够直接从用户输入(二维绘制的笔画和描述性文本提示)生成带骨架的网格。我们的方法开创了一个两阶段的流程,将生成过程分为:1)可控骨架生成,我们采用骨架图变分自编码器(Skeletal Graph VAE, Sk-VAE)将骨架的图结构编码到潜在空间中,在此基础上,骨架图生成模型(Skeletal Graph DiT, Sk-DiT)生成骨架嵌入。生成过程同时依赖于文本语义和二维笔画以实现明确的结构控制,变分自编码器的解码器重建最终的高质量三维骨架;2)通过TextuRig和SKA-DPO增强网格合成,我们在生成的骨架的基础上合成带纹理的网格。在这一阶段,我们首先通过用TextuRig(一个包含带纹理和带骨架网格及其说明的数据集,来自Objaverse-XL)增强现有的骨架到网格模型的训练数据。除此之外,我们还采用了一种偏好优化策略SKA-DPO,通过骨架-网格对齐评分来进一步提高几何保真度。综上所述,我们的框架使得创建可动画的三维内容的工作流程更加直观。根据我们所知,我们的工作是首个基于用户绘制的二维笔画生成带骨架的三维网格的方法。大量实验表明,Stroke3D能够生成合理的骨架和高质量的网格。
cs.CV / 57 / 2602.09717
From Lightweight CNNs to SpikeNets: Benchmarking Accuracy-Energy Tradeoffs with Pruned Spiking SqueezeNet
从轻量级卷积神经网络到脉冲神经网络:修剪脉冲SqueezeNet的准确性-能耗权衡基准测试
Abstract
Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7x higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant modules, yielding a pruned architecture, SNN-SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN-SqueezeNet. Crucially, it narrows the gap with CNN-SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.
Chinese Translation
脉冲神经网络(SNNs)作为卷积神经网络(CNNs)的节能替代方案,特别是在边缘智能领域,正受到越来越多的关注。然而,之前的研究主要集中在大规模模型上,轻量级CNN到SNN管道的设计和评估尚未得到充分探讨。本文首次系统性地基准测试了通过将紧凑的CNN架构转换为脉冲网络所获得的轻量级SNN,其中激活使用漏积分发火(Leaky-Integrate-and-Fire, LIF)神经元建模,并在统一的设置下使用代理梯度下降进行训练。我们构建了ShuffleNet、SqueezeNet、MnasNet和MixNet的脉冲变体,并在CIFAR-10、CIFAR-100和TinyImageNet上进行评估,测量准确性、F1分数、参数数量、计算复杂性和能耗。我们的结果表明,SNN的能效比其CNN对应物高出最多15.7倍,同时保持竞争力的准确性。在这些模型中,SqueezeNet的SNN变体始终优于其他轻量级SNN。为了进一步优化该模型,我们应用了一种结构化修剪策略,去除整个冗余模块,得到修剪后的架构SNN-SqueezeNet-P。与原始的SNN-SqueezeNet相比,该修剪模型在CIFAR-10上的准确性提高了6%,参数减少了19%。重要的是,它缩小了与CNN-SqueezeNet之间的差距,达到了几乎相同的准确性(仅低1%),但由于稀疏脉冲驱动计算,能耗减少了88.1%。这些发现共同确立了轻量级SNN作为边缘部署的实用低功耗替代方案,突显了在边缘部署高性能、低功耗智能的可行路径。
cs.CV / 58 / 2602.09730
Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings
裂纹的魅力:一种变分生成方法用于绘画中的裂纹检测
Abstract
Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.
Chinese Translation
近年来,成像技术、深度学习和数值性能的进步使得对艺术品进行非侵入性的详细分析成为可能,从而支持其文献记录和保护工作。特别是,数字化绘画中裂纹的自动检测对于评估退化和指导修复至关重要,但由于可能存在复杂的场景以及裂纹与类似裂纹的艺术特征(如笔触或毛发)之间的视觉相似性,这一任务仍然具有挑战性。我们提出了一种混合方法,将裂纹检测建模为一个逆问题,将观察到的图像分解为无裂纹的绘画和裂纹成分。我们采用深度生成模型作为潜在艺术品的强大先验,同时使用Mumford--Shah类型的变分泛函结合裂纹先验来捕捉裂纹结构。联合优化产生了绘画中裂纹定位的像素级地图。
cs.CV / 59 / 2602.09736
Toward Fine-Grained Facial Control in 3D Talking Head Generation
朝向细粒度面部控制的3D对话头生成
Abstract
Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
Chinese Translation
基于音频的对话头生成是数字化身的核心组成部分,而3D高斯点云(3D Gaussian Splatting)在高保真对话头的实时渲染中表现出色。然而,实现对细粒度面部动作的精确控制仍然是一个重大挑战,特别是由于唇部同步不准确和面部抖动,这两者都可能导致不适感效应(uncanny valley effect)。为了解决这些挑战,我们提出了细粒度3D高斯点云(Fine-Grained 3D Gaussian Splatting,FG-3DGS),这是一个新颖的框架,能够实现时间一致性和高保真的对话头生成。我们的方法引入了一种频率感知的解耦策略,以根据面部区域的运动特征明确建模。低频区域,如面颊、鼻子和额头,使用标准多层感知器(MLP)进行联合建模,而高频区域,包括眼睛和嘴巴,则通过一个专门的网络进行单独捕捉,该网络由面部区域掩码引导。预测的运动动态以高斯增量(Gaussian deltas)表示,应用于静态高斯以生成最终的头部帧,这些帧通过光栅化器使用特定于帧的相机参数进行渲染。此外,结合从大规模音频-视频对中通过预训练模型学习的高频精细后渲染对齐机制,以增强每帧生成并实现更准确的唇部同步。在广泛使用的对话头生成数据集上的大量实验表明,我们的方法在生成高保真、同步唇动的对话头视频方面优于最近的最先进方法。
cs.CV / 60 / 2602.09740
Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors
面向连接与自动驾驶车辆的鲁棒视觉系统:安全挑战与攻击向量
Abstract
This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.
Chinese Translation
本文探讨了连接与自动驾驶车辆(CAVs)中视觉系统的鲁棒性,这对于开发5级自动驾驶能力至关重要。安全可靠的CAV导航无疑依赖于能够准确检测物体、车道标记和交通标志的鲁棒视觉系统。我们分析了CAV导航所需的关键传感器和视觉组件,以推导出CAV视觉系统(CAVVS)的参考架构。该参考架构为识别CAVVS的潜在攻击面提供了基础。随后,我们详细阐述了针对每个攻击面的已识别攻击向量,严格评估其对机密性、完整性和可用性(CIA)的影响。我们的研究提供了对视觉系统中攻击向量动态的全面理解,这对于制定能够维护CIA三元组原则的鲁棒安全措施至关重要。
cs.CV / 61 / 2602.09764
Self-Supervised Learning as Discrete Communication
自监督学习作为离散通信
Abstract
Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
Chinese Translation
大多数自监督学习(SSL)方法通过对齐同一输入的不同视图来学习连续的视觉表示,这对信息在表示维度上的结构化提供了有限的控制。在本研究中,我们将视觉自监督学习框架视为教师网络与学生网络之间的离散通信过程,其中语义信息通过固定容量的二进制通道传输。学生网络不是对齐连续特征,而是预测由教师网络生成的多标签二进制消息。通过逐元素的二进制交叉熵目标强制执行离散一致性,同时编码速率正则化项鼓励有效利用受限通道,从而促进结构化表示。我们进一步表明,定期重新初始化投影头通过鼓励在多个离散编码中保持可预测的嵌入来增强这一效果。大量实验表明,在图像分类、检索和密集视觉预测任务中,相较于连续一致性基线,取得了一致的改进,并在自监督适应下应对领域转移。除了主干表示外,我们还分析了学习到的二进制代码,显示它们形成了一种紧凑且信息丰富的离散语言,捕捉了可在类别之间重用的语义因素。
cs.CV / 62 / 2602.09775
Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
图像来源于何处?通过分析标题对数据集进行地理特征描述
Abstract
Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
Chinese Translation
近期研究表明,文本到图像模型往往无法生成具有地理代表性的图像,这引发了对其训练数据代表性的担忧,并促使我们提出问题:这些训练样本来自世界的哪些地方?我们通过将图像-标题对根据从标题中提取的位置信息映射到国家,来对大规模多模态数据集进行地理特征描述。研究来自三个广泛使用的数据集(Re-LAION、DataComp1B 和 Conceptual Captions)的英文标题,涵盖 $20$ 个常见实体(例如,房屋、国旗),我们发现美国、英国和加拿大占样本的 $48.0\%$,而南美和非洲国家的代表性严重不足,分别仅占 $1.8\\%$ 和 $3.8\\%$ 的图像。我们观察到一个国家的 GDP 与其在数据中的代表性之间存在强相关性($
ho = 0.82$)。对 Re-LAION 数据集中 $4$ 种语言的非英语子集进行分析时,我们发现代表性严重倾向于这些语言主要使用的国家。此外,我们发现更高的代表性并不一定意味着更大的视觉或语义多样性。最后,通过分析在 Re-LAION 上训练的 Stable Diffusion v1.3 生成的特定国家图像,我们展示了尽管生成的图像看起来逼真,但与真实世界图像相比,其覆盖范围严重受限。
cs.CV / 63 / 2602.09809
SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
SciFlow-Bench:通过逆解析评估结构感知的科学图表生成
Abstract
Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
Chinese Translation
科学图表传达明确的结构信息,但现代文本到图像模型往往生成视觉上可信但结构上不正确的结果。现有基准要么依赖于以图像为中心的或对结构不敏感的主观指标,要么评估中间符号表示而非最终渲染图像,从而使基于像素的图表生成未得到充分探索。我们提出了SciFlow-Bench,这是一个优先考虑结构的基准,用于直接评估从像素级输出生成的科学图表。SciFlow-Bench基于真实的科学PDF构建,将每个源框架图与标准的真实图配对,并在一个闭环的往返协议下,将生成的图表图像逆解析回结构化图进行比较,从而将模型作为黑箱图像生成器进行评估。这种设计通过结构可恢复性而非仅仅视觉相似性来强制评估,并且得益于一个协调规划、感知和结构推理的分层多智能体系统。实验表明,保持结构正确性仍然是一个基本挑战,特别是对于具有复杂拓扑的图表,这突显了结构感知评估的必要性。
cs.CV / 64 / 2602.09816
CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video
CompSplat:面向压缩的真实世界视频3D高斯点云渲染
Abstract
High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.
Chinese Translation
从真实世界视频中进行高质量的新视角合成(NVS)对文化遗产保护、数字双胞胎和沉浸式媒体等应用至关重要。然而,真实世界视频通常包含长序列,具有不规则的相机轨迹和未知的姿态,这导致在重建过程中出现姿态漂移、特征错位和几何失真。此外,有损压缩通过引入不一致性进一步放大了这些问题,逐渐降低几何和渲染质量。尽管近期研究已解决了长序列NVS或无姿态重建的问题,但面向压缩的方法仍然集中于特定伪影或有限场景,长视频中的多样化压缩模式尚未得到充分探索。本文提出了CompSplat,一种面向压缩的训练框架,明确建模逐帧压缩特性,以减轻帧间不一致性和累积几何误差。CompSplat结合了面向压缩的帧加权和自适应剪枝策略,以增强鲁棒性和几何一致性,尤其是在重度压缩下。在包括Tanks and Temples、Free和Hike等具有挑战性的基准测试中的广泛实验表明,CompSplat在严重压缩条件下实现了最先进的渲染质量和姿态准确性,显著超越了大多数近期的最先进NVS方法。
cs.CV / 65 / 2602.09825
SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
SAKED:通过稳定性意识的知识增强解码减轻大型视觉-语言模型中的幻觉
Abstract
Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
Chinese Translation
大型视觉-语言模型(LVLMs)中的幻觉在实际应用中带来了显著的安全性和可靠性风险。受人类在不确定或犹豫时更容易出错的观察启发,我们研究了模型内部知识的不稳定性如何导致LVLM幻觉的产生。我们从注意力头、模型层和解码标记三个角度进行了广泛的实证分析,并识别出三种关键的幻觉模式:(i)注意力头之间的视觉激活漂移,(ii)层之间显著的知识波动,以及(iii)相邻输出标记之间的视觉焦点分散。在这些发现的基础上,我们提出了稳定性意识的知识增强解码(SAKED),该方法引入了一种层级知识稳定性评分(KSS),以量化模型中知识的稳定性。通过对比最具稳定性意识和不考虑稳定性的层,SAKED抑制了解码噪声,并动态利用最可靠的内部知识进行忠实的标记生成。此外,SAKED不需要训练,可以无缝集成到不同的架构中。大量实验表明,SAKED在各种模型、任务和基准上实现了最先进的幻觉减轻性能。
cs.CV / 66 / 2602.09839
ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
ARK:一个基于推理和知识的双轴多模态检索基准
Abstract
Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
Chinese Translation
现有的多模态检索基准主要强调日常生活图像的语义匹配,对专业知识和复杂推理的诊断能力有限。为了解决这一问题,我们引入了ARK,一个旨在从两个互补视角分析多模态检索的基准:(i) 知识领域(五个领域及17个子类型),用于表征检索所依赖的内容和专业知识;(ii) 推理技能(六个类别),用于表征识别正确候选项所需的多模态证据推理类型。具体而言,ARK评估了单模态和多模态查询及候选项的检索,涵盖16种异构视觉数据类型。为了避免在评估过程中出现捷径匹配,大多数查询与需要多步推理的针对性困难负样本配对。我们在ARK上评估了23个具有代表性的基于文本和多模态的检索器,观察到知识密集型和推理密集型检索之间存在显著差距,细粒度的视觉和空间推理成为持续的瓶颈。我们进一步表明,简单的增强措施如重新排序和重写能够带来持续的改进,但仍然存在显著的提升空间。
cs.CV / 67 / 2602.09843
Kelix Technique Report
Kelix 技术报告
Ding, Boyang, Chu, Chenglong, Zang, Dunju, Li, Han, Cao, Jiangxia, Gai, Kun, Wei, Muhao, Tang, Ruiming, Wang, Shiyao, Mao, Siyang, Luo, Xinchen, Liu, Yahui, Ling, Zhixin, Yang, Zhuoran, Li, Ziming, Song, Chengru, Zhou, Guorui, Zhang, Guowang, Peng, Hao, Wang, Hao, Deng, Jiaxin, Ouyang, Jin, Zhang, Jinghao, Ren, Lejian, Wang, Qianqian, Hu, Qigen, Wang, Tao, Wang, Xingmei, Yang, Yiping, Zhang, Zixing, Wang, Ziqi
Abstract
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
Chinese Translation
自回归大型语言模型(LLMs)通过将多样化任务表达为离散自然语言标记的序列,并通过下一个标记预测进行训练,从而实现良好的扩展性,这在自我监督下统一了理解与生成。将这一范式扩展到多模态数据需要跨模态的共享离散表示。然而,大多数视觉语言模型(VLMs)仍依赖于混合接口:离散文本标记与连续视觉变换器(ViT)特征的结合。由于监督主要依赖文本,这些模型往往偏向于理解,无法充分利用非文本数据的大规模自我监督学习。近期的研究探索了离散视觉标记化,以实现完全自回归的多模态建模,显示出在统一理解与生成方面的良好进展。然而,现有的离散视觉标记由于编码容量有限,常常会丢失信息,导致其理解能力明显弱于连续特征的 VLMs。我们提出了 Kelix,这是一种完全离散的自回归统一模型,缩小了离散与连续视觉表示之间的理解差距。
cs.CV / 68 / 2602.09850
Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
Reason-IAD:基于知识引导的动态潜在推理用于可解释的工业异常检测
Abstract
Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.
Chinese Translation
工业异常检测要求对细粒度缺陷模式进行精确推理。然而,现有的多模态大型语言模型(MLLMs)在通用领域数据上进行预训练,往往难以捕捉特定类别的异常,从而限制了检测的准确性和可解释性。为了解决这些局限性,我们提出了Reason-IAD,一个基于知识引导的动态潜在推理框架,用于可解释的工业异常检测。Reason-IAD包含两个核心组件。首先,一个增强检索的知识模块将特定类别的文本描述纳入模型输入,使得能够对领域特定缺陷进行上下文感知的推理。其次,一个基于熵的潜在推理机制在紧凑的潜在空间内进行迭代探索,使用可优化的潜在思维令牌,并通过基于熵的奖励引导,鼓励自信和稳定的预测。此外,一个动态视觉注入策略选择性地将最具信息量的图像块纳入潜在序列,引导推理过程朝向对异常检测至关重要的区域。大量实验结果表明,Reason-IAD在性能上始终优于最先进的方法。代码将公开发布在 https://github.com/chenpeng052/Reason-IAD。
cs.CV / 69 / 2602.09856
Code2World: A GUI World Model via Renderable Code Generation
Code2World:通过可渲染代码生成的图形用户界面世界模型
Abstract
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.
Chinese Translation
自主图形用户界面(GUI)代理通过感知界面和执行动作与环境进行交互。作为一个虚拟沙盒,GUI世界模型通过实现基于动作的预测,使代理具备类人前瞻性。然而,现有的基于文本和像素的方法在同时实现高视觉保真度和细粒度结构可控性方面面临挑战。为此,我们提出了Code2World,一种通过可渲染代码生成模拟下一个视觉状态的视觉-语言编码器。具体而言,为了解决数据稀缺问题,我们通过将GUI轨迹转换为高保真的HTML,并通过视觉反馈修订机制精炼合成代码,构建了AndroidCode,生成了超过80K对高质量屏幕-动作对的语料库。为了将现有的视觉-语言模型(VLM)适配到代码预测,我们首先进行格式布局的冷启动(SFT),然后进一步应用渲染感知强化学习(Render-Aware Reinforcement Learning),该方法通过强制视觉语义保真度和动作一致性,将渲染结果作为奖励信号。大量实验表明,Code2World-8B在下一个用户界面预测中表现优异,媲美竞争对手GPT-5和Gemini-3-Pro-Image。值得注意的是,Code2World以灵活的方式显著提高了下游导航的成功率,在AndroidWorld导航中使Gemini-2.5-Flash提升了9.5%。代码可在https://github.com/AMAP-ML/Code2World获取。
cs.CV / 70 / 2602.09868
Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
Free-GVC:朝着无训练的极端生成视频压缩与时间一致性迈进
Abstract
Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
Chinese Translation
基于近期视频生成的进展,生成视频压缩作为一种新范式,已成为实现视觉上令人愉悦重建的有效方法。然而,现有方法对时间相关性的利用有限,导致在超低比特率下出现明显的闪烁和时间一致性下降。本文提出了Free-GVC,一种无训练的生成视频压缩框架,将视频编码重新表述为由视频扩散先验引导的潜在轨迹压缩。我们的方法在图像组(GOP)级别上操作,将视频片段编码为紧凑的潜在空间,并沿着扩散轨迹逐步压缩它们。为了确保跨GOP的感知一致重建,我们引入了自适应质量控制模块,该模块动态构建在线速率-感知替代模型,以预测每个GOP的最佳扩散步骤。此外,跨GOP对齐模块建立帧重叠并在相邻组之间执行潜在融合,从而减轻闪烁并增强时间一致性。实验表明,Free-GVC在DISTS上相较于最新的神经编解码器DCVC-RT实现了平均93.29%的BD-Rate降低,用户研究进一步确认了其在超低比特率下的优越感知质量和时间一致性。
cs.CV / 71 / 2602.09872
BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices
BabyMamba-HAR:用于资源受限设备上高效人类活动识别的轻量级选择性状态空间模型
Abstract
Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.
Chinese Translation
在可穿戴和移动设备上进行人类活动识别(HAR)受到内存占用和计算预算的限制,但必须在异构传感器配置中保持竞争性的准确性。选择性状态空间模型(SSMs)提供了线性时间序列处理和输入依赖的门控,成为二次复杂度注意力机制的有力替代。然而,在TinyML领域中部署SSMs的设计空间仍然 largely 未被探索。本文介绍了BabyMamba-HAR,这是一个包含两种新颖的轻量级Mamba灵感架构的框架,针对资源受限的HAR进行了优化:(1)CI-BabyMamba-HAR,使用通道独立的主干,通过共享权重处理每个传感器通道,但采用实例独立的变换以防止跨通道噪声传播;(2)Crossover-BiDir-BabyMamba-HAR,使用早期融合主干,实现了通道数量独立的计算复杂度。这两种变体都结合了权重绑定的双向扫描和轻量级时间注意力池化。通过在八个不同基准上的评估,证明Crossover-BiDir-BabyMamba-HAR在约27K参数和2.21M MACs下达到了86.52%的平均宏F1分数,匹配TinyHAR(86.16%),同时在高通道数据集上需要的MACs少了11倍。系统的消融研究表明,双向扫描贡献了高达8.42%的F1分数提升,而门控时间注意力提供了比均值池化高达8.94%的F1分数增益。这些发现为将选择性状态空间模型作为高效TinyML骨干网用于HAR的实际设计原则奠定了基础。
cs.CV / 72 / 2602.09878
MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
MVISTA-4D:具有测试时动作推断的视图一致性4D世界模型用于机器人操控
Abstract
World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
Chinese Translation
基于世界模型的想象-再行动成为机器人操控的一个有前景的范式,然而现有的方法通常仅支持纯粹基于图像的预测或对部分3D几何体的推理,限制了其预测完整4D场景动态的能力。本研究提出了一种新颖的具身4D世界模型,能够实现几何一致的任意视角RGBD生成:该模型仅以单视角RGBD观测作为输入,想象出其余的视角,然后可以将其反投影并融合,以在时间上组装出更完整的3D结构。为了有效学习多视角、跨模态的生成,我们明确设计了跨视角和跨模态特征融合,联合鼓励RGB与深度之间的一致性,并在视角之间强制几何对齐。除了预测,将生成的未来转化为动作通常由逆动力学处理,但这是一种病态问题,因为多种动作可以解释同一过渡。我们通过一种测试时动作优化策略解决了这个问题,该策略通过生成模型反向传播,以推断与预测未来最佳匹配的轨迹级潜在变量,并利用残差逆动力学模型将该轨迹先验转化为准确的可执行动作。在三个数据集上的实验表明,在4D场景生成和下游操控方面表现出强劲的性能,消融实验提供了对关键设计选择的实用见解。
cs.CV / 73 / 2602.09883
AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization
AdaTSQ:通过时间敏感量化推动扩散变换器的帕累托前沿
Abstract
Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.
Chinese Translation
扩散变换器(Diffusion Transformers, DiTs)已成为高保真图像和视频生成的最先进骨干网络。然而,它们巨大的计算成本和内存占用阻碍了在边缘设备上的部署。尽管后训练量化(Post-Training Quantization, PTQ)已被证明对大型语言模型(Large Language Models, LLMs)有效,但直接将现有方法应用于 DiTs 会因忽视扩散过程固有的独特时间动态而导致次优结果。在本文中,我们提出了 AdaTSQ,一种新颖的 PTQ 框架,通过利用 DiTs 的时间敏感性,推动效率和质量的帕累托前沿。首先,我们提出了一种关注帕累托的时间步动态位宽分配策略。我们将量化策略搜索建模为一个受约束的路径寻找问题。我们利用一种以端到端重建误差为指导的束搜索算法,动态地为不同时间步分配层级位宽。其次,我们提出了一种以费舍尔信息为指导的时间校准机制。它利用时间费舍尔信息优先考虑来自高度敏感时间步的校准数据,与基于海森矩阵的权重优化无缝集成。在四种先进的 DiTs(如 Flux-Dev、Flux-Schnell、Z-Image 和 Wan2.1)上的大量实验表明,AdaTSQ 显著优于 SVDQuant 和 ViDiT-Q 等最先进方法。我们的代码将发布在 https://github.com/Qiushao-E/AdaTSQ。
cs.CV / 74 / 2602.09918
SARS: A Novel Face and Body Shape and Appearance Aware 3D Reconstruction System extends Morphable Models
SARS:一种新颖的面部和身体形状及外观感知的3D重建系统,扩展了可变形模型
Abstract
Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
Chinese Translation
可变形模型(3DMMs)是一种可变形模型,它以2D图像为输入,重建3D物体的结构和物理外观,特别是人脸和身体。3DMM结合了身份和表情混合形状与基本面部网格,以创建详细的3D模型。3D可变形模型的变异性可以通过调整多种参数来控制。它们是高层次的图像描述符,例如形状、纹理、光照和相机参数。之前的3D人类重建研究仅集中于全局面部结构或几何形状,忽略了面部语义特征,如年龄、性别以及表征面部边界、曲线、凹陷和皱纹的面部特征点。为了适应这些高层次面部特征的变化,本研究提出了一种形状和外观感知的3D重建系统(我们称之为SARS),这是一个模块化管道,可以从单幅图像中提取身体和面部信息,以正确重建人类全身的3D模型。
cs.CV / 75 / 2602.09927
A benchmark for video-based laparoscopic skill analysis and assessment
基于视频的腹腔镜技能分析与评估基准
Abstract
Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.
Chinese Translation
腹腔镜手术是一种复杂的外科技术,需要广泛的培训。近期深度学习的进展显示出在支持这一培训方面的潜力,能够实现对外科技能的自动视频评估。然而,深度学习模型的开发与评估目前受到可用标注数据集规模有限的制约。为了解决这一问题,我们引入了腹腔镜技能分析与评估数据集(Laparoscopic Skill Analysis and Assessment, LASANA),该数据集包含1270个四个基本腹腔镜培训任务的立体视频录制。每个录制都附有结构化的技能评分,由三位独立评审者汇总而成,并且标注了指示特定任务错误存在或不存在的二元标签。大多数录制源自腹腔镜培训课程,因此反映了参与者技能的自然变异。为了促进现有和新方法在基于视频的技能评估和错误识别方面的基准测试,我们为每个任务提供了预定义的数据划分。此外,我们还提供了深度学习模型的基线结果,作为未来比较的参考点。
cs.CV / 76 / 2602.09929
Monocular Normal Estimation via Shading Sequence Estimation
通过阴影序列估计进行单目法线估计
Abstract
Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
Chinese Translation
单目法线估计旨在从物体在任意光照下的单张RGB图像中估计法线图。现有方法依赖深度模型直接预测法线图。然而,它们往往面临三维不对齐的问题:尽管估计的法线图可能在外观上看起来正确,但重建的表面往往无法与几何细节对齐。我们认为这种不对齐源于当前的范式:模型难以区分和重建在法线图中表示的不同几何形状,因为潜在几何形状的差异仅通过相对微妙的颜色变化反映出来。为了解决这个问题,我们提出了一种新的范式,将法线估计重新表述为阴影序列估计,其中阴影序列对各种几何信息更为敏感。在此范式的基础上,我们提出了RoSE,一种利用图像到视频生成模型预测阴影序列的方法。预测的阴影序列随后通过解决一个简单的普通最小二乘问题转换为法线图。为了增强鲁棒性并更好地处理复杂物体,RoSE在一个合成数据集MultiShade上进行训练,该数据集具有多样的形状、材料和光照条件。实验表明,RoSE在基于物体的单目法线估计的真实世界基准数据集上达到了最先进的性能。
cs.CV / 77 / 2602.09932
GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery
GeoFormer:基于Swin Transformer的框架,用于从Sentinel影像中进行场景级建筑高度和占地面积估计
Abstract
Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.
Chinese Translation
准确的三维城市数据对于气候建模、灾害风险评估和城市规划至关重要,但由于依赖专有传感器或跨城市的泛化能力差,这类数据仍然稀缺。我们提出了GeoFormer,一个开源的Swin Transformer框架,利用Sentinel-1/2影像和开放的数字高程模型(DEM)数据,在100米网格上联合估计建筑高度(BH)和占地面积(BF)。地理块分割策略确保了训练集和测试集之间的严格空间独立性。在54个多样化城市的评估中,GeoFormer实现了3.19米的BH均方根误差(RMSE)和0.05的BF均方根误差,分别比最强的卷积神经网络(CNN)基线提高了7.5%和15.3%,同时在跨洲转移中保持了低于3.5米的BH均方根误差。消融研究确认了DEM在高度估计中的不可或缺性,光学反射率优于合成孔径雷达(SAR),尽管多源融合提供了最佳的整体准确性。所有代码、权重和全球产品均已公开发布。
cs.CV / 78 / 2602.09933
Unbalanced optimal transport for robust longitudinal lesion evolution with registration-aware and appearance-guided priors
基于不平衡最优传输的稳健纵向病变演变方法:考虑配准和外观引导的先验
Abstract
Evaluating lesion evolution in longitudinal CT scans of can cer patients is essential for assessing treatment response, yet establishing reliable lesion correspondence across time remains challenging. Standard bipartite matchers, which rely on geometric proximity, struggle when lesions appear, disappear, merge, or split. We propose a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. Our transport cost blends (i) size-normalized geometry, (ii) local registration trust from the deformation-field Jacobian, and (iii) optional patch-level appearance consistency. The resulting transport plan is sparsified by relative pruning, yielding one-to-one matches as well as new, disappearing, merging, and splitting lesions without retraining or heuristic rules. On longitudinal CT data, our approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores versus distance-only baselines.
Chinese Translation
评估癌症患者纵向CT扫描中的病变演变对于评估治疗反应至关重要,但在时间上建立可靠的病变对应关系仍然具有挑战性。标准的二分匹配器依赖于几何接近性,当病变出现、消失、合并或分裂时,往往难以有效工作。我们提出了一种基于不平衡最优传输(UOT)的配准感知匹配器,能够适应不等的病变质量,并根据患者级别的肿瘤负荷变化调整先验。我们的传输成本结合了(i)大小归一化的几何形状,(ii)来自变形场雅可比矩阵的局部配准信任,以及(iii)可选的补丁级外观一致性。通过相对修剪,得到的传输计划实现了稀疏化,生成了一对一的匹配,同时也能够处理新出现、消失、合并和分裂的病变,而无需重新训练或启发式规则。在纵向CT数据上,我们的方法在边缘检测精度和召回率、病变状态召回率以及病变图组件F1分数方面均优于仅基于距离的基线方法,表现出一致的提升。
cs.CV / 79 / 2602.09934
VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
VersaViT:通过任务引导优化增强多模态大语言模型视觉主干
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
Chinese Translation
多模态大语言模型(MLLMs)最近在视觉语言理解方面取得了显著成功,展示了其视觉编码器在高层语义对齐方面的优越性。因此,一个重要的问题随之而来:这些编码器能否作为多功能视觉主干,可靠地执行经典的以视觉为中心的任务?为了解决这个问题,我们做出了以下贡献:(i)我们发现MLLMs中的视觉编码器在其密集特征表示方面存在不足,表现为在密集预测任务(例如,语义分割、深度估计)上的次优性能;(ii)我们提出了VersaViT,一个全面的视觉变换器,构建了一个新颖的多任务框架以实现协同后训练。该框架通过具有多粒度监督的轻量级任务头促进视觉主干的优化;(iii)在各种下游任务上的广泛实验表明我们的方法的有效性,产生了一个适用于语言介导推理和像素级理解的多功能视觉主干。
cs.CV / 80 / 2602.09949
Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework
基于混合注意力-卷积框架的膀胱血管分割
Abstract
Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
Chinese Translation
膀胱癌的监测需要在多次干预中跟踪肿瘤位置,但可变形且中空的膀胱缺乏稳定的定位标志。虽然在内窥镜检查中可见的血管提供了患者特异性的“血管指纹”用于导航,但自动分割面临着不完美的内窥镜数据的挑战,包括稀疏标签、气泡等伪影、光照变化、持续变形以及模仿血管的黏膜褶皱。最先进的血管分割方法往往未能解决这些特定领域的复杂性。我们提出了一种混合注意力-卷积(Hybrid Attention-Convolution, HAC)架构,该架构结合了变换器(Transformers)以捕捉全局血管拓扑信息,并与卷积神经网络(CNN)相结合,学习残差细化图以精确恢复细小血管的细节。为了优先考虑结构连接性,变换器在优化的真实数据上进行训练,这些数据排除了短小和末端分支。此外,为了解决数据稀缺问题,我们采用了一种物理感知的预训练方法,即使用临床基础的增强技术对未标记数据进行自监督策略。在包含内窥镜视频帧的BlaVeS数据集上进行评估,我们的方法在准确性(0.94)、精确度(0.61)和重叠度(clDice, 0.66)方面优于最先进的医学分割模型。关键是,我们的方法成功抑制了在手术过程中膀胱充盈和排空时动态出现和消失的黏膜褶皱所带来的假阳性。因此,HAC提供了临床导航所需的可靠结构稳定性。
cs.CV / 81 / 2602.09979
Learning to Detect Baked Goods with Limited Supervision
在有限监督下学习检测烘焙食品
Abstract
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
Chinese Translation
监测剩余产品提供了宝贵的见解,可用于优化未来的生产。这对于德国面包店尤其重要,因为新鲜烘焙食品的保质期非常短。自动化这一过程可以降低人工成本,提高准确性,并简化操作。我们提出使用物体检测模型自动识别图像中的烘焙食品。然而,德国烘焙食品的多样性使得完全监督的训练成本过高,从而限制了可扩展性。尽管开放词汇检测器(如 OWLv2、Grounding DINO)提供了灵活性,但我们证明它们对于我们的任务是不够的。虽然我们的工作受到面包店的启发,但也解决了在行业中部署计算机视觉的更广泛挑战,其中任务是专业化的,注释数据集稀缺。我们编制了具有不同监督级别的数据集划分,涵盖19类烘焙食品。我们提出了两种训练工作流程,以有限监督训练物体检测模型。首先,我们结合 OWLv2 和 Grounding DINO 定位与图像级监督,以弱监督的方式训练模型。其次,我们通过对使用 Segment Anything 2 注释的视频帧进行微调来提高视角鲁棒性,将其作为伪标签传播模型。利用这些工作流程,我们选择 YOLOv11 进行检测任务,因为它在速度和准确性之间具有良好的权衡。仅依赖图像级监督,模型的平均精度均值(mAP)达到了0.91。在非理想部署条件下,使用伪标签微调使模型性能提高了19.3%。结合这些工作流程训练的模型在非理想部署条件下超越了我们的完全监督基线模型,尽管仅依赖于图像级监督。
cs.CV / 82 / 2602.09983
Coupled Inference in Diffusion Models for Semantic Decomposition
扩散模型中的耦合推理用于语义分解
Abstract
Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
Chinese Translation
许多视觉场景可以被描述为潜在因素的组合。有效的识别、推理和编辑通常不仅需要形成这样的组合表示,还需要解决分解问题。构建这些表示的一种流行选择是通过绑定操作。共振网络可以理解为耦合的霍普菲尔德网络,被提出作为对这些绑定表示进行分解的一种方法。最近的研究显示霍普菲尔德网络与扩散模型之间存在显著的相似性。基于这些观察,我们引入了一个使用扩散模型中的耦合推理进行语义分解的框架。我们的方法将语义分解框架视为一个逆问题,并使用重建驱动的引导项耦合扩散过程,鼓励因素估计的组合与绑定向量相匹配。我们还引入了一种新颖的迭代采样方案,以提高我们模型的性能。最后,我们展示了基于注意力的共振网络是我们框架的一个特例。通过实证,我们证明了我们的耦合推理框架在一系列合成语义分解任务中优于共振网络。
cs.CV / 83 / 2602.09989
Efficient Special Stain Classification
高效特殊染色分类
Abstract
Stains are essential in histopathology to visualize specific tissue characteristics, with Haematoxylin and Eosin (H&E) serving as the clinical standard. However, pathologists frequently utilize a variety of special stains for the diagnosis of specific morphologies. Maintaining accurate metadata for these slides is critical for quality control in clinical archives and for the integrity of computational pathology datasets. In this work, we compare two approaches for automated classification of stains using whole slide images, covering the 14 most commonly used special stains in our institute alongside standard and frozen-section H&E. We evaluate a Multi-Instance Learning (MIL) pipeline and a proposed lightweight thumbnail-based approach. On internal test data, MIL achieved the highest performance (macro F1: 0.941 for 16 classes; 0.969 for 14 merged classes), while the thumbnail approach remained competitive (0.897 and 0.953, respectively). On external TCGA data, the thumbnail model generalized best (weighted F1: 0.843 vs. 0.807 for MIL). The thumbnail approach also increased throughput by two orders of magnitude (5.635 vs. 0.018 slides/s for MIL with all patches). We conclude that thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows.
Chinese Translation
染色在组织病理学中对于可视化特定组织特征至关重要,其中苏木精-伊红(H&E)作为临床标准。然而,病理学家经常使用多种特殊染色来诊断特定的形态特征。为这些切片维护准确的元数据对于临床档案的质量控制和计算病理数据集的完整性至关重要。在本研究中,我们比较了两种基于全切片图像的染色自动分类方法,涵盖了我们机构中最常用的14种特殊染色以及标准和冷冻切片H&E。我们评估了多实例学习(Multi-Instance Learning, MIL)管道和一种提出的轻量级缩略图方法。在内部测试数据上,MIL达到了最高性能(宏F1: 16类为0.941;14合并类为0.969),而缩略图方法也保持了竞争力(分别为0.897和0.953)。在外部TCGA数据上,缩略图模型的泛化能力最佳(加权F1: 0.843对比MIL的0.807)。缩略图方法还将吞吐量提高了两个数量级(MIL在所有补丁下为5.635对比0.018切片/秒)。我们得出结论,基于缩略图的分类为数字病理工作流程中的常规视觉质量控制提供了一种可扩展且稳健的解决方案。
cs.CV / 84 / 2602.09999
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
Faster-GS:分析与改进高斯溅射优化
Abstract
Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.
Chinese Translation
近年来,3D高斯溅射(3DGS)的进展集中在加速优化的同时保持重建质量。然而,许多提出的方法将实现层面的改进与基础算法的修改纠缠在一起,或以牺牲保真度为代价换取性能,导致研究领域碎片化,难以进行公平比较。在本研究中,我们整合并评估了以往3DGS研究中最有效且广泛适用的策略,并在此基础上增加了几项新颖的优化。我们进一步探讨了该框架中未被充分研究的方面,包括数值稳定性、高斯截断和梯度近似。最终形成的系统Faster-GS提供了一种经过严格优化的算法,我们在一系列全面的基准测试中进行了评估。实验结果表明,Faster-GS在保持视觉质量的同时实现了高达5倍的训练速度,建立了3DGS优化的新成本效益和资源效率基准。此外,我们还展示了这些优化可以应用于4D高斯重建,从而实现高效的非刚性场景优化。
cs.CV / 85 / 2602.10032
Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
带有保证的感知:通过可达性分析进行认证的姿态估计
Abstract
Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
Chinese Translation
在网络物理系统中,代理越来越多地被委托执行安全关键任务。确保这些代理的安全性通常需要对其姿态进行定位,以便进行后续操作。姿态估计可以通过激光雷达传感器、摄像头以及GPS等外部服务的各种组合来获得。关键在于,在安全关键领域,粗略的估计不足以正式确定安全性,即在最坏情况下也要保证安全,而外部服务可能也不可靠。我们通过提出一种仅基于摄像头图像和已知目标几何形状的3D认证姿态估计来解决这一问题。这是通过正式界定姿态来实现的,该姿态是利用可达性分析和正式神经网络验证的最新成果进行计算的。我们的实验表明,我们的方法能够在合成和真实世界实验中高效且准确地定位代理。
cs.CV / 86 / 2602.10042
Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection
Fake-HR1:重新思考视觉语言模型在合成图像检测中的推理
Abstract
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
Chinese Translation
最近的研究表明,将链式思维(Chain-of-Thought, CoT)推理纳入检测过程可以增强模型检测合成图像的能力。然而,过长的推理会产生大量资源开销,包括令牌消耗和延迟,这在处理明显生成的伪造品时尤其显得冗余。为了解决这一问题,我们提出了Fake-HR1,这是一种大规模混合推理模型,据我们所知,它是首个根据生成检测任务的特征自适应决定推理是否必要的模型。为此,我们设计了一个两阶段的训练框架:首先进行混合微调(Hybrid Fine-Tuning, HFT)以进行冷启动初始化,然后通过在线强化学习与混合推理分组策略优化(Hybrid-Reasoning Grouped Policy Optimization, HGRPO)隐式学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够自适应地在不同类型的查询中进行推理,在推理能力和生成检测性能上均超越现有的大型语言模型(LLMs),同时显著提高响应效率。
cs.CV / 87 / 2602.10043
Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI
简单图像处理和相似性度量可以通过脑部MRI连接跨数据库的数据样本
Abstract
Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
Chinese Translation
头部磁共振成像(MRI)在严格的监管框架下被定期收集和共享用于研究。这些框架要求在共享之前去除潜在的标识符。然而,即使在去除颅骨后,脑实质仍然包含独特的特征,这些特征可以在数据库中与来自同一参与者的其他MRI匹配,如果有额外的数据特征,这将构成隐私风险。目前的监管框架通常要求基于某种合理性水平的评估来评估此类风险。先前的研究已经表明,脑部MRI可以实现参与者的链接,但它们依赖于基于训练或计算密集型的方法。在这里,我们展示了使用标准预处理和图像相似性计算,可以链接个体的去颅骨化T1加权MRI,如果有其他标识符可用,可能导致重新识别。尽管存在潜在的认知衰退,在不同时间间隔、扫描仪类型、空间分辨率和采集协议下,跨数据库匹配数据样本时几乎实现了完美的链接准确性。这些结果旨在为医疗数据共享中深思熟虑、前瞻性的政策发展做出有意义的贡献。
cs.CV / 88 / 2602.10079
Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
图像拼接和复制移动伪造能否通过同一模型检测?Forensim:一种基于注意力的状态空间方法
Abstract
We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.
Chinese Translation
我们介绍了Forensim,这是一种基于注意力的状态空间框架,用于图像伪造检测,能够同时定位被操纵的(目标)区域和源区域。与传统方法仅依赖伪影线索来检测拼接或伪造区域不同,Forensim旨在捕捉对理解上下文至关重要的重复模式。在抗议图像等场景中,仅检测伪造区域,例如插入和平人群中的重复暴力行为,可能会误导解读,突显了联合源-目标定位的必要性。Forensim输出三类掩膜(原始、源、目标),并支持在统一架构中检测拼接和复制移动伪造。我们提出了一种视觉状态空间模型,利用归一化的注意力图识别内部相似性,并结合基于区域的块注意力模块来区分被操纵区域。这一设计实现了端到端训练和精确定位。Forensim在标准基准测试中达到了最先进的性能。我们还发布了CMFD-Anything,这是一个新的数据集,旨在解决现有复制移动伪造数据集的局限性。
cs.CV / 89 / 2602.10095
Causality in Video Diffusers is Separable from Denoising
视频扩散中的因果关系与去噪过程是可分离的
Abstract
Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
Chinese Translation
因果关系——指组件之间的时间性单向因果关系——是许多复杂生成过程的基础,包括视频、语言和机器人轨迹。目前的因果扩散模型将时间推理与迭代去噪纠缠在一起,在每个去噪步骤和整个上下文中对所有层应用因果注意力。本文展示了这些模型中的因果推理可以与多步骤去噪过程分离。通过对自回归视频扩散器的系统探测,我们发现了两个关键规律:(1)早期层在去噪步骤中产生高度相似的特征,表明在扩散轨迹上存在冗余计算;(2)深层层次表现出稀疏的跨帧注意力,主要进行帧内渲染。基于这些发现,我们引入了可分离因果扩散(Separable Causal Diffusion, SCD),这是一种新架构,通过因果变换器编码器显式地将每帧的时间推理与通过轻量级扩散解码器进行的多步骤帧级渲染解耦。针对合成和真实基准的预训练和后训练任务的广泛实验表明,SCD显著提高了吞吐量和每帧延迟,同时在生成质量上与强因果扩散基线相匹配或超越。
cs.CV / 90 / 2602.10102
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
VideoWorld 2:从真实世界视频中学习可转移知识
Abstract
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
Chinese Translation
从未标记的视频数据中学习可转移知识并将其应用于新环境是智能体的一项基本能力。本研究提出了VideoWorld 2,扩展了VideoWorld,并首次探讨了直接从原始真实世界视频中学习可转移知识。在其核心,VideoWorld 2引入了一种动态增强的潜在动态模型(dynamic-enhanced Latent Dynamics Model, dLDM),该模型将动作动态与视觉外观解耦:一个预训练的视频扩散模型处理视觉外观建模,使得dLDM能够学习专注于紧凑且有意义的任务相关动态的潜在编码。这些潜在编码随后被自回归建模以学习任务策略并支持长时间推理。我们在具有挑战性的真实世界手工制作任务上评估了VideoWorld 2,在这些任务中,先前的视频生成和潜在动态模型难以可靠地操作。值得注意的是,VideoWorld 2在任务成功率上提高了多达70%,并生成连贯的长执行视频。在机器人领域,我们展示了VideoWorld 2能够从Open-X数据集中获取有效的操作知识,这显著提高了CALVIN上的任务表现。本研究揭示了直接从原始视频中学习可转移世界知识的潜力,所有代码、数据和模型将开源以供进一步研究。
cs.CV / 91 / 2602.10104
Olaf-World: Orienting Latent Actions for Video World Modeling
Olaf-World:面向视频世界建模的潜在动作定向
Abstract
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
Chinese Translation
可控动作世界模型的扩展受到动作标签稀缺的限制。尽管潜在动作学习承诺从未标记的视频中提取控制接口,但学习到的潜在动作往往无法在不同上下文中迁移:它们纠缠于场景特定的线索,并缺乏共享的坐标系统。这是因为标准目标仅在每个片段内操作,未提供跨上下文对齐动作语义的机制。我们的关键见解是,尽管动作是未观察到的,但它们的语义效应是可观察的,并且可以作为共享参考。我们引入了Seq$ riangle$-REPA,这是一种序列级控制效应对齐目标,它将集成的潜在动作锚定于来自冻结的自监督视频编码器的时间特征差异。在此基础上,我们提出了Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的管道。大量实验表明,我们的方法学习了更结构化的潜在动作空间,导致比最先进的基线更强的零样本动作迁移能力和更高的数据效率以适应新的控制接口。
cs.CV / 92 / 2602.10113
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
ConsID-Gen:视图一致性与身份保持的图像到视频生成
Abstract
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
Chinese Translation
图像到视频生成(I2V)将静态图像动画化为遵循文本指令的时间一致的视频序列,但在变化视角下保持细粒度的物体身份仍然是一个持续的挑战。与文本到视频模型不同,现有的I2V流程往往受到外观漂移和几何失真的困扰,我们将这些问题归因于单视图二维观测的稀疏性和跨模态对齐的弱性。在这里,我们从数据和模型两个角度解决这个问题。首先,我们策划了ConsIDVid,这是一个大型面向对象的数据集,采用可扩展的管道构建高质量、时间对齐的视频,并建立了ConsIDVid-Bench,在这里我们提出了一种新的基准和评估框架,用于多视图一致性,使用对微妙几何和外观偏差敏感的指标。我们进一步提出了ConsID-Gen,这是一种视图辅助的I2V生成框架,通过未姿态辅助视图增强第一帧,并通过双流视觉-几何编码器以及文本-视觉连接器融合语义和结构线索,从而为扩散变换器(Diffusion Transformer)骨干网络提供统一的条件。通过ConsIDVid-Bench的实验表明,ConsID-Gen在多个指标上始终表现优于其他模型,其最佳整体性能超越了领先的视频生成模型如Wan2.1和HunyuanVideo,在具有挑战性的现实场景中提供了卓越的身份保真度和时间一致性。我们将在 https://myangwu.github.io/ConsID-Gen 发布我们的模型和数据集。
cs.CV / 93 / 2602.10115
Quantum Multiple Rotation Averaging
量子多重旋转平均
Abstract
Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
Chinese Translation
多重旋转平均(MRA)是三维视觉和机器人领域中的一个基本优化问题,旨在从噪声相对测量中恢复全局一致的绝对旋转。现有的经典方法,如 L1-IRLS 和 Shonan,面临局部最小值易感性和依赖于凸松弛的局限性,这些松弛无法保持精确的流形几何,导致在高噪声场景中的准确性降低。我们提出了 IQARS(迭代量子退火旋转同步),这是第一个将 MRA 重新表述为一系列局部二次非凸子问题的算法,这些子问题在二值化后可以在量子退火器上执行,以利用固有的硬件优势。IQARS 消除了对凸松弛的依赖,更好地保持了非欧几里得旋转流形几何,同时利用量子隧穿和并行性进行高效的解空间探索。我们在合成和真实世界数据集上评估了 IQARS 的性能。尽管当前的退火器仍处于初期阶段,仅支持解决有限规模的问题且性能受限,但我们观察到,IQARS 在 D-Wave 退火器上已经可以实现比 Shonan 高约 12% 的准确性,即在实证评估中表现最佳的经典方法。
cs.CV / 94 / 2602.10116
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
SAGE:可扩展的代理性3D场景生成用于具身人工智能
Abstract
Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.
Chinese Translation
对于具身代理的现实世界数据收集仍然成本高昂且不安全,这呼唤可扩展、真实且适合模拟器的3D环境。然而,现有的场景生成系统往往依赖于基于规则或特定任务的流程,导致生成伪影和物理上无效的场景。我们提出了SAGE,一个代理性框架,能够根据用户指定的具身任务(例如,“拾起一个碗并将其放在桌子上”)理解意图,并自动大规模生成适合模拟的环境。该代理将多个布局和物体组合生成器与评估语义合理性、视觉真实感和物理稳定性的评估器相结合。通过迭代推理和自适应工具选择,它自我优化场景,直到满足用户意图和物理有效性。生成的环境真实、多样,并且可以直接在现代模拟器中用于策略训练。仅基于这些数据训练的策略展示出明显的扩展趋势,并能推广到未见过的物体和布局,展示了基于模拟驱动的扩展在具身人工智能中的潜力。代码、演示和SAGE-10k数据集可以在项目页面找到: https://nvlabs.github.io/sage。
cs.AI / 1 / 2602.09112
A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation
一种小规模自回归程序合成系统以实现受控实验
Abstract
What research can be pursued with small models trained to complete true programs? Typically, researchers study program synthesis via large language models (LLMs) which introduce issues such as knowing what is in or out of distribution, understanding fine-tuning effects, understanding the effects of tokenization, and higher demand on compute and storage to carry out experiments. We present a system called Cadmus which includes an integer virtual machine (VM), a dataset composed of true programs of diverse tasks, and an autoregressive transformer model that is trained for under \$200 of compute cost. The system can be used to study program completion, out-of-distribution representations, inductive reasoning, and instruction following in a setting where researchers have effective and affordable fine-grained control of the training distribution and the ability to inspect and instrument models. Smaller models working on complex reasoning tasks enable instrumentation and investigations that may be prohibitively expensive on larger models. To demonstrate that these tasks are complex enough to be of interest, we show that these Cadmus models outperform GPT-5 (by achieving 100\% accuracy while GPT-5 has 95\% accuracy) even on a simple task of completing correct, integer arithmetic programs in our domain-specific language (DSL) while providing transparency into the dataset's relationship to the problem. We also show that GPT-5 brings unknown priors into its reasoning process when solving the same tasks, demonstrating a confounding factor that prevents the use of large-scale LLMs for some investigations where the training set relationship to the task needs to be fully understood.
Chinese Translation
小模型训练完成真实程序的研究可以追求哪些方向?通常,研究人员通过大型语言模型(LLMs)研究程序合成,这引入了一些问题,例如了解哪些内容在分布内或分布外、理解微调效果、理解标记化的影响,以及进行实验所需的计算和存储需求更高。我们提出了一个名为Cadmus的系统,该系统包括一个整数虚拟机(VM)、一个由多样任务的真实程序组成的数据集,以及一个训练成本低于200美元的自回归变换器模型。该系统可用于研究程序完成、分布外表示、归纳推理和指令跟随,研究人员在此环境中能够有效且经济地对训练分布进行细粒度控制,并能够检查和仪器化模型。较小的模型在复杂推理任务上的工作使得仪器化和调查成为可能,而在大型模型上可能会过于昂贵。为了证明这些任务足够复杂以引起兴趣,我们展示了这些Cadmus模型在完成我们特定领域语言(DSL)中的正确整数算术程序这一简单任务时,表现优于GPT-5(实现100%的准确率,而GPT-5的准确率为95%),同时提供了数据集与问题之间关系的透明性。我们还展示了GPT-5在解决相同任务时引入了未知的先验,这表明了一个混淆因素,阻碍了在某些需要完全理解训练集与任务关系的研究中使用大规模LLMs。
cs.AI / 2 / 2602.09121
Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization
基于狄利克雷参数化的不确定性感知多模态情感识别
Abstract
In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework's versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.
Chinese Translation
在本研究中,我们提出了一种轻量级且保护隐私的多模态情感识别(MER)框架,旨在部署于边缘设备。为了展示框架的多样性,我们的实现使用了三种模态——语音、文本和面部图像。然而,该系统是完全模块化的,可以扩展以支持其他模态或任务。每种模态通过专门优化推理效率的骨干网络进行处理:语音使用 Emotion2Vec,面部表情使用基于 ResNet 的模型,文本使用 DistilRoBERTa。为了协调不同模态之间的不确定性,我们引入了一种与模型和任务无关的融合机制,基于邓普斯特-沙费尔理论和狄利克雷证据。该方法直接作用于模型的对数几率,捕捉预测不确定性,而无需额外的训练或联合分布估计,使其在情感识别之外具有广泛的适用性。在五个基准数据集(eNTERFACE05、MEAD、MELD、RAVDESS 和 CREMA-D)上的验证显示,我们的方法在保持计算效率和对模糊或缺失输入的鲁棒性的同时,达到了竞争性的准确性。总体而言,所提出的框架强调模块化、可扩展性和现实世界的可行性,为面向医疗保健、人机交互及其他情感信息应用的不确定性感知多模态系统铺平了道路。
cs.AI / 3 / 2602.09138
PABU: Progress-Aware Belief Update for Efficient LLM Agents
PABU:面向进展的信念更新以提高大型语言模型代理的效率
Abstract
Large Language Model (LLM) agents commonly condition actions on full action-observation histories, which introduce task-irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress-Aware Belief Update (PABU), a belief-state framework that compactly represents an agent's state by explicitly modeling task progress and selectively retaining past actions and observations. At each step, the agent predicts its relative progress since the previous round and decides whether the newly encountered interaction should be stored, conditioning future decisions only on the retained subset. Across eight environments in the AgentGym benchmark, and using identical training trajectories, PABU achieves an 81.0% task completion rate, outperforming previous State of the art (SoTA) models with full-history belief by 23.9%. Additionally, PABU's progress-oriented action selection improves efficiency, reducing the average number of interaction steps to 9.5, corresponding to a 26.9% reduction. Ablation studies show that both explicit progress prediction and selective retention are necessary for robust belief learning and performance gains.
Chinese Translation
大型语言模型(LLM)代理通常基于完整的行动-观察历史来决定行动,这引入了与任务无关的信息,容易导致冗余行动和更高的推理成本。我们提出了面向进展的信念更新(PABU),这是一种信念状态框架,通过明确建模任务进展并选择性保留过去的行动和观察,紧凑地表示代理的状态。在每一步中,代理预测自上一个回合以来的相对进展,并决定是否应存储新遇到的交互,仅基于保留的子集来条件化未来的决策。在AgentGym基准的八个环境中,使用相同的训练轨迹,PABU实现了81.0%的任务完成率,超越了之前的全历史信念的最新技术(SoTA)模型23.9%。此外,PABU的面向进展的行动选择提高了效率,将平均交互步骤减少到9.5,减少幅度为26.9%。消融研究表明,明确的进展预测和选择性保留对于稳健的信念学习和性能提升都是必要的。
cs.AI / 4 / 2602.09159
CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic Perspective
CoMMa:从博弈论视角出发的贡献感知医疗多智能体
Abstract
Recent multi-agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution-Aware Medical Multi-Agents (CoMMa), a decentralized LLM-agent framework in which specialists operate on partitioned evidence and coordinate through a game-theoretic objective for robust decision-making. In contrast to most agent architectures relying on stochastic narrative-based reasoning, CoMMa utilizes deterministic embedding projections to approximate contribution-aware credit assignment. This yields explicit evidence attribution by estimating each agent's marginal utility, producing interpretable and mathematically grounded decision pathways with improved stability. Evaluated on diverse oncology benchmarks, including a real-world multidisciplinary tumor board dataset, CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agents baselines.
Chinese Translation
近期的多智能体框架拓宽了处理需要对动态异构患者数据进行推理的肿瘤学决策支持任务的能力。我们提出了贡献感知医疗多智能体(CoMMa),这是一个去中心化的LLM-agent框架,其中专家在分区证据上操作,并通过博弈论目标进行协调,以实现稳健的决策。与大多数依赖于随机叙事推理的智能体架构不同,CoMMa利用确定性嵌入投影来近似贡献感知的信用分配。这通过估计每个智能体的边际效用实现了明确的证据归属,产生了可解释且数学上有依据的决策路径,且稳定性得到了改善。在多样的肿瘤学基准测试中进行评估,包括一个真实世界的多学科肿瘤委员会数据集,CoMMa在准确性和稳定性方面均优于数据集中和基于角色的多智能体基线。
cs.AI / 5 / 2602.09163
FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
FlyAOC:评估果蝇科学知识库的能动本体策划
Abstract
Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
Chinese Translation
科学知识库通过将来自主要文献的发现整理为结构化、可查询的格式,加速了发现的进程,既服务于人类研究者,也为新兴的人工智能系统提供支持。维护这些资源需要专家策展人搜索相关论文,协调文献之间的证据,并生成基于本体的注释——这一工作流程并未被现有的基准所涵盖,这些基准主要集中在诸如命名实体识别或关系提取等孤立的子任务上。我们提出了FlyBench,以评估人工智能代理在科学文献中进行端到端能动本体策划的能力。给定一个基因符号,代理必须从16,898篇全文论文的语料库中搜索和阅读,以生成结构化注释:描述功能的基因本体术语、表达模式以及连接数十年命名法的历史同义词。该基准包括来自果蝇知识库FlyBase的100个基因的7,397条专家策划的注释。我们评估了四种基线代理架构:记忆型、固定管道型、单代理型和多代理型。我们的研究发现,架构选择对性能有显著影响,多代理设计优于简单的替代方案,但扩展基础模型的收益递减。所有基线都有显著的改进空间。我们的分析揭示了若干发现,以指导未来的发展;例如,代理主要使用检索来确认参数知识,而不是发现新信息。我们希望FlyBench能够推动增强检索的科学推理能力的发展,这一能力在各科学领域具有广泛的应用前景。
cs.AI / 6 / 2602.09286
Human Control Is the Anchor, Not the Answer: Early Divergence of Oversight in Agentic AI Communities
人类控制是锚点,而非答案:代理人工智能社区监督的早期分歧
Abstract
Oversight for agentic AI is often discussed as a single goal ("human control"), yet early adoption may produce role-specific expectations. We present a comparative analysis of two newly active Reddit communities in Jan--Feb 2026 that reflect different socio-technical roles: r/OpenClaw (deployment and operations) and r/Moltbook (agent-centered social interaction). We conceptualize this period as an early-stage crystallization phase, where oversight expectations form before norms reach equilibrium. Using topic modeling in a shared comparison space, a coarse-grained oversight-theme abstraction, engagement-weighted salience, and divergence tests, we show the communities are strongly separable (JSD =0.418, cosine =0.372, permutation $p=0.0005$). Across both communities, "human control" is an anchor term, but its operational meaning diverges: r/OpenClaw} emphasizes execution guardrails and recovery (action-risk), while r/Moltbook} emphasizes identity, legitimacy, and accountability in public interaction (meaning-risk). The resulting distinction offers a portable lens for designing and evaluating oversight mechanisms that match agent role, rather than applying one-size-fits-all control policies.
Chinese Translation
代理人工智能的监督通常被讨论为一个单一目标(“人类控制”),然而早期采用可能会产生角色特定的期望。我们对2026年1月至2月期间两个新活跃的Reddit社区进行了比较分析,这两个社区反映了不同的社会技术角色:r/OpenClaw(部署与运营)和r/Moltbook(以代理为中心的社会互动)。我们将这一时期概念化为早期阶段的结晶化阶段,在此阶段,监督期望在规范达到平衡之前形成。通过在共享比较空间中使用主题建模、粗粒度的监督主题抽象、参与权重显著性和分歧测试,我们展示了这两个社区具有明显的可分性(JSD =0.418,余弦相似度=0.372,置换$p=0.0005$)。在这两个社区中,“人类控制”是一个锚定术语,但其操作意义存在分歧:r/OpenClaw强调执行保护措施和恢复(行动风险),而r/Moltbook则强调身份、合法性和公共互动中的问责制(意义风险)。由此产生的区别为设计和评估与代理角色相匹配的监督机制提供了一种可移植的视角,而不是应用一刀切的控制政策。
cs.AI / 7 / 2602.09340
Measuring Dataset Diversity from a Geometric Perspective
从几何角度测量数据集多样性
Abstract
Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.
Chinese Translation
多样性可以广泛定义为元素之间存在有意义的变化,这可以从多个角度进行观察,包括数据集中的统计变化和几何结构丰富性。现有的多样性度量,如特征空间离散度和度量空间大小,主要捕捉分布变化或熵,而在很大程度上忽视了数据集的几何结构。为了解决这一问题,我们引入了一种基于拓扑数据分析(TDA)和持久性景观(PLs)的框架,以提取和量化数据的几何特征。这种方法提供了一种理论基础的手段来测量超越熵的多样性,捕捉数据集丰富的几何和结构属性。通过在多种模态下进行广泛实验,我们证明了我们提出的基于PLs的多样性度量(PLDiv)是强大、可靠且可解释的,直接将数据多样性与其基础几何联系起来,并为数据集的构建、增强和评估提供了基础工具。
cs.AI / 8 / 2602.09341
Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge
审计多智能体大语言模型推理树优于多数投票和大语言模型作为裁判
Abstract
Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
Chinese Translation
多智能体系统(MAS)可以显著扩展大语言模型(LLMs)的推理能力,但大多数框架仍然通过多数投票来聚合智能体的输出。这种启发式方法忽略了推理轨迹的证据结构,并且在虚构共识下表现脆弱,智能体之间共享相关偏见并趋向于相同的不正确推理。我们引入了AgentAuditor,它用对推理树的路径搜索替代投票,明确表示智能体轨迹之间的共识和分歧。AgentAuditor通过在关键分歧点比较推理分支来解决冲突,将全局裁决转变为高效的局部验证。我们进一步提出了反共识偏好优化(Anti-Consensus Preference Optimization, ACPO),该方法在多数失败案例上训练裁决者,并奖励基于证据的少数选择而非流行错误。AgentAuditor对MAS设置不敏感,我们在5个流行设置中发现,它的绝对准确率比多数投票提高了最多5%,比使用大语言模型作为裁判提高了最多3%。
cs.AI / 9 / 2602.09343
Not-in-Perspective: Towards Shielding Google's Perspective API Against Adversarial Negation Attacks
非视角:保护谷歌的视角API免受对抗性否定攻击
Abstract
The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
Chinese Translation
社交媒体平台上网络欺凌的上升,尤其是涉及有毒评论的情况,增加了有效监控和调节在线互动的需求。现有的自动化有毒性检测系统解决方案基于机器学习或深度学习算法。然而,基于统计的解决方案通常容易受到包含逻辑修改(如短语和句子中的否定)的对抗性攻击。对此,我们提出了一套基于形式推理的方法论,旨在包裹现有的机器学习有毒性检测系统。作为预处理和后处理步骤,我们的形式推理包裹器有助于缓解否定攻击问题,并显著提高有毒性评分的准确性和有效性。我们在多个机器学习模型上评估了我们包裹器的不同变体,针对一个否定对抗数据集。实验结果突显了混合(形式推理与机器学习)方法相较于各种纯统计解决方案的改进效果。
cs.AI / 10 / 2602.09347
Image Quality in the Era of Artificial Intelligence
人工智能时代的图像质量
Abstract
Artificial intelligence (AI) is being deployed within radiology at a rapid pace. AI has proven an excellent tool for reconstructing and enhancing images that appear sharper, smoother, and more detailed, can be acquired more quickly, and allowing clinicians to review them more rapidly. However, incorporation of AI also introduces new failure modes and can exacerbate the disconnect between perceived quality of an image and information content of that image. Understanding the limitations of AI-enabled image reconstruction and enhancement is critical for safe and effective use of the technology. Hence, the purpose of this communication is to bring awareness to limitations when AI is used to reconstruct or enhance a radiological image, with the goal of enabling users to reap benefits of the technology while minimizing risks.
Chinese Translation
人工智能(AI)在放射学领域的应用正在迅速推进。AI已被证明是重建和增强图像的优秀工具,使得图像看起来更加清晰、平滑和细致,获取速度更快,并且允许临床医生更快速地进行审阅。然而,AI的引入也带来了新的故障模式,并可能加剧图像感知质量与信息内容之间的脱节。理解AI驱动的图像重建和增强的局限性对于安全和有效地使用该技术至关重要。因此,本次交流的目的是提高对使用AI重建或增强放射学图像时局限性的认识,旨在使用户能够在最小化风险的同时,充分利用该技术的优势。
cs.AI / 11 / 2602.09443
P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
P1-VL:在物理奥林匹克中桥接视觉感知与科学推理
Luo, Yun, Wang, Futing, Cheng, Qianjia, Yu, Fangchen, Lei, Haodi, Yan, Jianhao, Li, Chenxi, Chen, Jiacheng, Zhao, Yufeng, Wan, Haiyuan, Zhang, Yuchen, Zheng, Shenghe, Yao, Junchi, Zhang, Qingyang, He, Haonan, Zeng, Wenxuan, Sheng, Li, Xie, Chengxing, Zuo, Yuxin, Li, Yizhuo, Wu, Yulun, Huang, Rui, Zhou, Dongzhan, Chen, Kai, Qiao, Yu, Bai, Lei, Cheng, Yu, Ding, Ning, Zhou, Bowen, Ye, Peng, Cui, Ganqu
Abstract
The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
Chinese Translation
从符号操作到科学级推理的转变代表了大型语言模型(LLMs)面临的一个关键前沿,而物理学则作为将抽象逻辑与物理现实结合的关键测试基准。物理学要求模型与支配宇宙的法则保持物理一致性,这一任务根本上需要多模态感知以将抽象逻辑扎根于现实。在奥林匹克级别,图示往往是构成性的而非仅仅是说明性的,包含了文本中缺失的重要约束条件,如边界条件和空间对称性。为了弥合这一视觉与逻辑之间的鸿沟,我们提出了P1-VL,一个为高级科学推理而设计的开源视觉-语言模型系列。我们的方法将课程强化学习与代理增强相结合,前者通过逐步增加难度来稳定训练后的表现,后者则在推理过程中实现迭代自我验证。在2024-2025年的13场严格考试的基准HiPhO上进行评估,我们的旗舰模型P1-VL-235B-A22B成为首个获得12枚金牌的开源视觉-语言模型(VLM),并在开源模型中实现了最先进的性能。我们的代理增强系统在全球范围内获得了第二名,仅次于Gemini-3-Pro。超越物理学,P1-VL展现出卓越的科学推理能力和广泛的适用性,在STEM基准测试中显著领先于基础模型。通过开源P1-VL,我们为实现通用物理智能迈出了基础性的一步,以更好地将视觉感知与抽象物理法则对齐,从而促进机器科学发现。
cs.AI / 12 / 2602.09463
SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning
SpotAgent:通过代理推理将视觉地理定位与大型视觉-语言模型相结合
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
Chinese Translation
大型视觉-语言模型(LVLMs)在地理定位方面展现了强大的推理能力,但在视觉线索稀疏、长尾且高度模糊的现实场景中,它们常常面临挑战。以往的方法受限于内部知识,往往无法提供可验证的结果,在面对混淆证据时,产生自信但缺乏基础的预测。为了解决这些挑战,我们提出了SpotAgent,一个将地理定位形式化为代理推理过程的框架,利用专家级推理将视觉解读与工具辅助验证相结合。SpotAgent通过ReAct图积极探索和验证视觉线索,利用外部工具(如网络搜索、地图)。我们引入了一个三阶段的后训练流程,首先是监督微调(SFT)阶段以实现基本对齐,其次是利用多代理框架合成的高质量轨迹的代理冷启动阶段,旨在培养工具调用的专业知识。随后,通过强化学习进一步完善模型的推理能力。我们提出了一种空间感知动态过滤策略,通过根据空间难度优先考虑可学习样本,提高强化学习阶段的效率。在标准基准上进行的大量实验表明,SpotAgent实现了最先进的性能,有效减轻了幻觉现象,同时提供了精确且可验证的地理定位结果。
cs.AI / 13 / 2602.09485
Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models
高效性与透明性的桥梁:多模态大型推理模型中的可解释链式思维压缩
Abstract
Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.
Chinese Translation
长链思维(Long CoTs)在多模态推理模型中被广泛应用,以通过捕捉详细的视觉信息来应对复杂任务。然而,这些长链思维往往过于冗长,并包含多余的推理步骤,这可能会妨碍推理效率。压缩这些长链思维是一个自然的解决方案,但现有方法面临两个主要挑战:(1)它们可能通过去除重要的对齐线索而损害视觉-文本推理的完整性;(2)压缩过程缺乏可解释性,使得难以辨别哪些信息是关键的。为了解决这些问题,我们提出了XMCC,即可解释的多模态链式思维压缩器,它将压缩过程形式化为一个通过强化学习优化的顺序决策过程。XMCC能够有效缩短推理轨迹,同时保留关键的推理步骤和答案的正确性,并同时生成自然语言解释其压缩决策。在具有代表性的多模态推理基准上的大量实验表明,XMCC不仅减少了推理长度,还提供了可解释的解释,验证了其有效性。
cs.AI / 14 / 2602.09489
Computing Conditional Shapley Values Using Tabular Foundation Models
使用表格基础模型计算条件 Shapley 值
Abstract
Shapley values have become a cornerstone of explainable AI, but they are computationally expensive to use, especially when features are dependent. Evaluating them requires approximating a large number of conditional expectations, either via Monte Carlo integration or regression. Until recently it has not been possible to fully exploit deep learning for the regression approach, because retraining for each conditional expectation takes too long. Tabular foundation models such as TabPFN overcome this computational hurdle by leveraging in-context learning, so each conditional expectation can be approximated without any re-training. In this paper, we compute Shapley values with multiple variants of TabPFN and compare their performance with state-of-the-art methods on both simulated and real datasets. In most cases, TabPFN yields the best performance; where it does not, it is only marginally worse than the best method, at a fraction of the runtime. We discuss further improvements and how tabular foundation models can be better adapted specifically for conditional Shapley value estimation.
Chinese Translation
Shapley 值已成为可解释人工智能的基石,但在特征相互依赖时,其计算成本较高。评估 Shapley 值需要通过蒙特卡洛积分或回归来近似大量的条件期望。直到最近,由于每个条件期望的重新训练耗时过长,尚无法充分利用深度学习进行回归方法。表格基础模型如 TabPFN 通过利用上下文学习克服了这一计算障碍,因此每个条件期望可以在不重新训练的情况下进行近似。在本文中,我们使用多种 TabPFN 的变体计算 Shapley 值,并将其性能与最先进的方法在模拟和真实数据集上进行比较。在大多数情况下,TabPFN 的表现最佳;在其表现不佳的情况下,性能仅比最佳方法稍差,且运行时间仅为其一小部分。我们讨论了进一步的改进以及如何更好地将表格基础模型专门适应于条件 Shapley 值估计。
cs.AI / 15 / 2602.09533
Autoregressive Direct Preference Optimization
自回归直接偏好优化
Abstract
Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $\mu$ and the feedback length $\mu$'. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.
Chinese Translation
直接偏好优化(DPO)作为一种有前景的方法,已逐渐成为将大型语言模型(LLMs)与人类偏好对齐的有效手段。然而,广泛依赖于响应级别的Bradley-Terry(BT)模型可能限制了其潜力,因为在推导目标函数后,参考模型和可学习模型被假定为自回归的。基于这一局限性,我们重新审视了DPO的理论基础,并提出了一种新颖的公式化方法,该方法在应用BT模型之前明确引入了自回归假设。通过重新公式化和扩展DPO,我们推导出一种新变体,称为自回归DPO(ADPO),它将自回归建模明确整合到偏好优化框架中。在不违反理论基础的情况下,推导出的损失函数呈现出优雅的形式:它将DPO目标中的求和操作移出对数- sigmoid函数。此外,通过对ADPO的理论分析,我们表明在设计基于DPO的算法时需要考虑两个长度度量:令牌长度$bc$和反馈长度$bc$'。据我们所知,我们首次明确区分这两个度量并分析它们在LLMs偏好优化中的影响。
cs.AI / 16 / 2602.09597
Detecting radar targets swarms in range profiles with a partially complex-valued neural network
在范围轮廓中使用部分复值神经网络检测雷达目标群
Abstract
Correctly detecting radar targets is usually challenged by clutter and waveform distortion. An additional difficulty stems from the relative proximity of several targets, the latter being perceived as a single target in the worst case, or influencing each other's detection thresholds. The negative impact of targets proximity notably depends on the range resolution defined by the radar parameters and the adaptive threshold adopted. This paper addresses the matter of targets detection in radar range profiles containing multiple targets with varying proximity and distorted echoes. Inspired by recent contributions in the radar and signal processing literature, this work proposes partially complex-valued neural networks as an adaptive range profile processing. Simulated datasets are generated and experiments are conducted to compare a common pulse compression approach with a simple neural network partially defined by complex-valued parameters. Whereas the pulse compression processes one pulse length at a time, the neural network put forward is a generative architecture going through the entire received signal in one go to generate a complete detection profile.
Chinese Translation
正确检测雷达目标通常面临杂波和波形失真的挑战。额外的困难来自于多个目标的相对接近,后者在最坏的情况下被视为单一目标,或者相互影响检测阈值。目标接近的负面影响显著依赖于由雷达参数定义的范围分辨率和采用的自适应阈值。本文针对包含多个目标、具有不同接近度和失真回波的雷达范围轮廓中的目标检测问题进行探讨。受到雷达和信号处理文献中近期贡献的启发,本文提出了部分复值神经网络作为自适应范围轮廓处理方法。生成了模拟数据集,并进行了实验,以比较常见的脉冲压缩方法与由复值参数部分定义的简单神经网络。脉冲压缩方法一次处理一个脉冲长度,而提出的神经网络是一种生成性架构,能够一次性处理整个接收到的信号,以生成完整的检测轮廓。
cs.AI / 17 / 2602.09620
FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints
FLINGO -- 将 ASP 表达能力融入线性整数约束
Abstract
Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, something required in many real-world applications. The usual specification of constraints in most CASP solvers is closer to the numerical back-end expressiveness and semantics, rather than to standard specification in ASP. In the latter, numerical attributes are represented with predicates and this allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the FLINGO language (and tool) that incorporates the aforementioned expressiveness inside the numerical constraints and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced FLINGO syntax to regular CASP programs following the CLINGCON input format.
Chinese Translation
约束答案集编程(CASP)是一种混合范式,它通过数值约束处理丰富了答案集编程(ASP),这是许多现实世界应用所需的。在大多数 CASP 求解器中,约束的常规规范更接近于数值后端的表达能力和语义,而不是 ASP 中的标准规范。在后者中,数值属性通过谓词表示,这允许声明默认值、将属性留为空白、使用选择规则进行非确定性赋值或使用聚合值。在 CASP 中,一旦我们切换到这些相同属性的基于约束的表示,大多数(如果不是全部)这些特性就会丧失。本文介绍了 FLINGO 语言(及工具),它将上述表达能力融入数值约束中,并通过多个示例说明其用法。基于之前建立的语义基础,我们还展示了从新引入的 FLINGO 语法到遵循 CLINGCON 输入格式的常规 CASP 程序的翻译。
cs.AI / 18 / 2602.09653
ClinAlign: Scaling Healthcare Alignment from Clinician Preference
ClinAlign:从临床医生偏好扩展医疗对齐
Abstract
Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.
Chinese Translation
尽管大型语言模型(LLMs)展示了专家级的医学知识,但将其开放式输出与细致的临床医生偏好对齐仍然具有挑战性。现有方法通常依赖粗略目标或不可靠的自动评估者,这些评估者在专业指南中基础薄弱。我们提出了一个两阶段框架来解决这一问题。首先,我们引入了HealthRubrics,这是一个包含7034个经过医生验证的偏好示例的数据集,其中临床医生对LLM起草的标准进行了细化,以满足严格的医学标准。其次,我们将这些标准提炼为HealthPrinciples:119条广泛可重用的、以临床为基础的原则,按临床维度组织,使得超越人工标注的可扩展监督成为可能。我们利用HealthPrinciples进行(1)离线对齐,通过合成未标记查询的标准,以及(2)作为推理时的工具进行引导自我修订。一个在推理时仅激活30B参数模型中的3B参数,并使用我们框架训练的模型在HealthBench-Hard上达到了33.4%的成绩,超越了包括Deepseek-R1和o3在内的更大模型,为临床对齐建立了一个资源高效的基线。
cs.AI / 19 / 2602.09794
GHS-TDA: A Synergistic Reasoning Framework Integrating Global Hypothesis Space with Topological Data Analysis
GHS-TDA:一个将全局假设空间与拓扑数据分析相结合的协同推理框架
Abstract
Chain-of-Thought (CoT) has been shown to significantly improve the reasoning accuracy of large language models (LLMs) on complex tasks. However, due to the autoregressive, step-by-step generation paradigm, existing CoT methods suffer from two fundamental limitations. First, the reasoning process is highly sensitive to early decisions: once an initial error is introduced, it tends to propagate and amplify through subsequent steps, while the lack of a global coordination and revision mechanism makes such errors difficult to correct, ultimately leading to distorted reasoning chains. Second, current CoT approaches lack structured analysis techniques for filtering redundant reasoning and extracting key reasoning features, resulting in unstable reasoning processes and limited interpretability. To address these issues, we propose GHS-TDA. GHS-TDA first constructs a semantically enriched global hypothesis graph to aggregate, align, and coordinate multiple candidate reasoning paths, thereby providing alternative global correction routes when local reasoning fails. It then applies topological data analysis based on persistent homology to capture stable multi-scale structures, remove redundancy and inconsistencies, and extract a more reliable reasoning skeleton. By jointly leveraging reasoning diversity and topological stability, GHS-TDA achieves self-adaptive convergence, produces high-confidence and interpretable reasoning paths, and consistently outperforms strong baselines in terms of both accuracy and robustness across multiple reasoning benchmarks.
Chinese Translation
链式思维(Chain-of-Thought, CoT)已被证明能够显著提高大型语言模型(Large Language Models, LLMs)在复杂任务上的推理准确性。然而,由于自回归的逐步生成范式,现有的 CoT 方法存在两个根本性限制。首先,推理过程对早期决策高度敏感:一旦引入初始错误,往往会在后续步骤中传播和放大,而缺乏全局协调和修正机制使得此类错误难以纠正,最终导致扭曲的推理链。其次,目前的 CoT 方法缺乏结构化分析技术来过滤冗余推理和提取关键推理特征,导致推理过程不稳定且可解释性有限。为了解决这些问题,我们提出了 GHS-TDA。GHS-TDA 首先构建一个语义丰富的全局假设图,以聚合、对齐和协调多个候选推理路径,从而在局部推理失败时提供替代的全局修正路径。然后,它基于持久同调(persistent homology)应用拓扑数据分析,以捕捉稳定的多尺度结构,消除冗余和不一致性,并提取更可靠的推理骨架。通过共同利用推理多样性和拓扑稳定性,GHS-TDA 实现了自适应收敛,生成高置信度和可解释的推理路径,并在多个推理基准测试中在准确性和鲁棒性方面始终优于强基线。
cs.AI / 20 / 2602.09798
Symbolic Pattern Temporal Numeric Planning with Intermediate Conditions and Effects
具有中间条件和效果的符号模式时序数值规划
Abstract
Recently, a Symbolic Pattern Planning (SPP) approach was proposed for numeric planning where a pattern (i.e., a finite sequence of actions) suggests a causal order between actions. The pattern is then encoded in a SMT formula whose models correspond to valid plans. If the suggestion by the pattern is inaccurate and no valid plan can be found, the pattern is extended until it contains the causal order of actions in a valid plan, making the approach complete. In this paper, we extend the SPP approach to the temporal planning with Intermediate Conditions and Effects (ICEs) fragment, where $(i)$ actions are durative (and thus can overlap over time) and have conditions/effects which can be checked/applied at any time during an action's execution, and $(ii)$ one can specify plan's conditions/effects that must be checked/applied at specific times during the plan execution. Experimental results show that our SPP planner Patty $(i)$ outperforms all other planners in the literature in the majority of temporal domains without ICEs, $(ii)$ obtains comparable results with the SoTA search planner for ICS in literature domains with ICEs, and $(iii)$ outperforms the same planner in a novel domain based on a real-world application.
Chinese Translation
最近,提出了一种符号模式规划(Symbolic Pattern Planning, SPP)方法用于数值规划,其中模式(即有限的动作序列)建议了动作之间的因果顺序。该模式随后被编码为一个 SMT 公式,其模型对应于有效的规划。如果模式的建议不准确且无法找到有效的计划,则该模式会被扩展,直到其包含有效计划中动作的因果顺序,从而使该方法完整。在本文中,我们将 SPP 方法扩展到具有中间条件和效果(Intermediate Conditions and Effects, ICEs)片段的时序规划,其中 $(i)$ 动作是持续的(因此可以在时间上重叠),并且在动作执行的任何时间都可以检查/应用条件/效果,以及 $(ii)$ 可以指定在计划执行的特定时间必须检查/应用的计划条件/效果。实验结果表明,我们的 SPP 规划器 Patty $(i)$ 在大多数没有 ICEs 的时序领域中优于文献中的所有其他规划器,$(ii)$ 在具有 ICEs 的文献领域中与最先进的搜索规划器(SoTA search planner)获得可比结果,以及 $(iii)$ 在基于真实世界应用的新领域中优于同一规划器。
cs.AI / 21 / 2602.09802
Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices
大型语言模型会为视野支付额外费用吗?从主观选择推断支付意愿
Abstract
As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.
Chinese Translation
随着大型语言模型(LLMs)在旅行助手和购买支持等应用中的日益普及,它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手的背景下研究LLM的决策过程,通过向模型呈现选择困境,并利用多项式逻辑模型分析其反应,以推导出隐含的支付意愿(WTP)估计值。这些WTP值随后与经济学文献中的人类基准值进行比较。除了基线设置外,我们还考察了在更现实条件下模型行为的变化,包括提供有关用户过去选择的信息和基于角色的提示。我们的结果表明,尽管可以为较大的LLM推导出有意义的WTP值,但它们在属性层面上也表现出系统性偏差。此外,当引入昂贵选项或商业导向角色时,它们往往高估人类的WTP。基于用户对更便宜选项的先前偏好对模型进行条件化,可以得到更接近人类基准的估值。总体而言,我们的研究结果突显了使用LLM进行主观决策支持的潜力和局限性,并强调在实际应用中仔细选择模型、设计提示和用户表示的重要性。
cs.AI / 22 / 2602.09813
Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning
通过层次化策略表示学习实现高效的无监督环境设计
Abstract
Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student's capabilities. To improve efficiency, we incorporate a generative model that augments the teacher's training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.
Chinese Translation
无监督环境设计(UED)作为一种通过自动化课程生成开发通用智能体的有前景的方法,逐渐受到关注。当前流行的UED方法主要集中在开放性(Open-Endedness)上,其中教师算法依赖随机过程进行有用环境的无限生成。然而,这一假设在资源受限的场景中变得不切实际,因为教师与学生的互动机会有限。为了解决这一挑战,我们提出了一种用于环境设计的层次化马尔可夫决策过程(MDP)框架。我们的框架包含一个教师智能体,该智能体利用从发现的评估环境中获得的学生策略表示,使其能够根据学生的能力生成训练环境。为了提高效率,我们引入了一种生成模型,该模型通过合成数据增强教师的训练数据集,从而减少教师与学生之间的互动需求。在多个领域的实验中,我们展示了我们的方法在单个回合中优于基线方法,同时需要的教师-学生互动更少。结果表明,我们的方法在训练机会有限的环境中具有适用性。
cs.AI / 23 / 2602.09937
Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
为什么人工智能代理在云根本原因分析中系统性失败?
Abstract
Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.
Chinese Translation
大规模云系统的故障会导致巨大的财务损失,因此自动化根本原因分析(Root Cause Analysis, RCA)对运营稳定性至关重要。近期的研究利用大型语言模型(Large Language Model, LLM)代理来自动化这一任务,但现有系统即使在能力较强的模型下也表现出较低的检测准确性,当前的评估框架仅评估最终答案的正确性,而未揭示代理推理失败的原因。本文对基于LLM的RCA代理进行了过程级故障分析。我们在五个LLM模型上执行了完整的OpenRCA基准测试,产生了1,675次代理运行,并将观察到的故障分类为12种陷阱类型,涵盖了代理内部推理、代理间通信和代理与环境交互。我们的分析揭示,最常见的陷阱,尤其是虚构数据解释和不完整探索,存在于所有模型中,无论其能力等级如何,这表明这些故障源于共享的代理架构,而非个别模型的局限性。控制性缓解实验进一步表明,仅靠提示工程无法解决主要陷阱,而丰富代理间的通信协议可以将与通信相关的故障减少多达15个百分点。本研究中开发的陷阱分类法和诊断方法为设计更可靠的云RCA自主代理奠定了基础。
cs.AI / 24 / 2602.09945
Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning
通过差异推理学习弥补临床智能体中的推理差距
Abstract
Clinical decision support requires not only correct answers but also clinically valid reasoning. We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies. From reference reasoning rationales (e.g., physician-authored clinical rationale, clinical guidelines, or outputs from more capable models) and the agent's free-form chain-of-thought (CoT), DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis. An LLM-as-a-judge aligns semantically equivalent nodes and diagnoses discrepancies between graphs. These graph-level discrepancy diagnostics are converted into natural-language instructions and stored in a Differential Reasoning Knowledge Base (DR-KB). At inference, we retrieve top-$k$ instructions via Retrieval-Augmented Generation (RAG) to augment the agent prompt and patch likely logic gaps. Evaluation on open medical question answering (QA) benchmarks and a Return Visit Admissions (RVA) prediction task from internal clinical data demonstrates gains over baselines, improving both final-answer accuracy and reasoning fidelity. Ablation studies confirm gains from infusing reference reasoning rationales and the top-$k$ retrieval strategy. Clinicians' review of the output provides further assurance of the approach. Together, results suggest that DRL supports more reliable clinical decision-making in complex reasoning scenarios and offers a practical mechanism for deployment under limited token budgets.
Chinese Translation
临床决策支持不仅需要正确的答案,还需要临床有效的推理。我们提出了差异推理学习(Differential Reasoning Learning, DRL),这是一个通过学习推理差异来改进临床智能体的框架。DRL 从参考推理依据(例如,医生撰写的临床推理、临床指南或更强大模型的输出)和智能体的自由形式思维链(Chain-of-Thought, CoT)中提取推理图,作为有向无环图(Directed Acyclic Graphs, DAGs),并执行基于临床加权图编辑距离(Graph Edit Distance, GED)的差异分析。一个作为评判者的大型语言模型(LLM)对语义等价的节点进行对齐,并诊断图之间的差异。这些图级差异诊断被转换为自然语言指令,并存储在差异推理知识库(Differential Reasoning Knowledge Base, DR-KB)中。在推理时,我们通过检索增强生成(Retrieval-Augmented Generation, RAG)检索前 $k$ 条指令,以增强智能体的提示并修补可能的逻辑漏洞。在开放医学问答(QA)基准和来自内部临床数据的复诊入院(Return Visit Admissions, RVA)预测任务上的评估表明,相较于基线方法有显著提升,改善了最终答案的准确性和推理的可靠性。消融研究确认了参考推理依据和前 $k$ 检索策略的增益。临床医生对输出的审查进一步确保了该方法的有效性。综合来看,结果表明 DRL 在复杂推理场景中支持更可靠的临床决策,并为在有限的令牌预算下的部署提供了一种实用机制。
cs.AI / 25 / 2602.10004
ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
ESTAR:用于高效推理的早停令牌感知推理
Abstract
Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated signals, and (iii) -aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
Chinese Translation
大型推理模型(LRMs)通过生成长链思维实现了最先进的性能,但在正确答案已经得出后,往往会浪费计算资源进行冗余推理。我们提出了早停令牌感知推理(ESTAR),该方法能够检测并减少这种推理冗余,从而提高效率而不牺牲准确性。我们的方法结合了(i)一种基于轨迹的分类器,用于识别何时可以安全停止推理,(ii)监督微调,以教会LRMs提出自生成的信号,以及(iii)感知强化学习,在自生成的停止点以计算感知的奖励截断展开。对四个推理数据集的实验表明,ESTAR将推理长度减少了约3.7倍(从4,799减少到1,290),同时保持了准确性(74.9%对比74.2%),并具有强大的跨领域泛化能力。这些结果突显了早停作为一种简单而强大的机制,用于提高LRMs的推理效率。
cs.AI / 26 / 2602.10009
Discovering High Level Patterns from Simulation Traces
从仿真轨迹中发现高级模式
Abstract
Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM's capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., 'rigid-body collision', 'stable support', etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.
Chinese Translation
嵌入在具有物理交互环境中的人工智能(AI)代理面临许多挑战,包括推理、规划、总结和问答。当人类用户希望以自然语言引导或与代理互动时,这一问题更加严重。尽管使用语言模型(Language Models, LMs)是默认选择,但作为一种AI工具,它们在涉及物理的任务中表现不佳。语言模型的物理推理能力是通过观察数据学习的,而不是基于仿真。一个常见的方法是将仿真轨迹作为上下文,但由于仿真轨迹包含大量细粒度的数值和语义数据,这种方法的可扩展性较差。在本文中,我们提出了一种自然语言引导的方法,从详细的仿真日志中发现粗粒度模式(例如,“刚体碰撞”、“稳定支撑”等)。具体而言,我们合成在仿真日志上操作的程序,并将其映射到一系列高级激活模式。通过两个物理基准测试,我们展示了这种注释表示的仿真日志更适合于关于物理系统的自然语言推理。我们演示了该方法如何使语言模型能够从自然语言指定的目标生成有效的奖励程序,这些程序可以在规划或监督学习的上下文中使用。
cs.AI / 27 / 2602.10063
Chain of Mindset: Reasoning with Adaptive Cognitive Modes
思维链:使用自适应认知模式进行推理
Abstract
Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96\% and 4.72\% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.
Chinese Translation
人类的问题解决从来不是单一思维模式的重复,这里所指的思维模式是指一种独特的认知处理方式。在处理特定任务时,我们并不依赖于单一的思维模式;相反,我们在单一的解决过程中整合了多种思维模式。然而,现有的大型语言模型(LLM)推理方法陷入了一个共同的陷阱:它们在所有步骤中应用相同的固定思维模式,忽视了解决同一问题的不同阶段需要根本不同的思维模式。这种单一思维的假设阻碍了模型达到更高的智能水平。为了解决这一局限性,我们提出了思维链(Chain of Mindset, CoM),一种无训练的自主框架,能够实现逐步自适应的思维模式协调。CoM将推理分解为四种功能上异质的思维模式:空间(Spatial)、聚合(Convergent)、发散(Divergent)和算法(Algorithmic)。一个元代理(Meta-Agent)根据不断变化的推理状态动态选择最佳思维模式,同时一个双向上下文门(Context Gate)过滤跨模块的信息流,以保持有效性和效率。在涵盖数学、代码生成、科学问答和空间推理的六个具有挑战性的基准测试中的实验表明,CoM实现了最先进的性能,在Qwen3-VL-32B-Instruct和Gemini-2.0-Flash的整体准确率上分别超过最强基线4.96%和4.72%,同时平衡了推理效率。我们的代码公开可用,地址为 [https://github.com/QuantaAlpha/chain-of-mindset](https://github.com/QuantaAlpha/chain-of-mindset)。
cs.AI / 28 / 2602.10085
CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
CODE-SHARP:作为层次奖励程序的技能的持续开放式发现与演化
Abstract
Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{https://sites.google.com/view/code-sharp/homepage}{here}$.
Chinese Translation
开发能够开放式发现和学习新技能的智能体是人工智能领域的一项重大挑战。尽管强化学习为训练智能体掌握复杂技能提供了强大的框架,但它通常依赖于手工设计的奖励函数。这对于开放式技能发现来说是不可行的,因为有意义的技能集合并不是事先已知的。尽管最近的方法在自动化奖励函数设计方面显示出良好的前景,但它们仍然局限于为预定义任务优化奖励。为了解决这一局限性,我们提出了作为层次奖励程序的技能的持续开放式发现与演化(CODE-SHARP),这是一个新颖的框架,利用基础模型(Foundation Models, FM)开放式扩展和优化一个层次技能档案,该档案结构为可执行奖励函数的有向图。我们展示了一个仅在由发现的SHARP技能生成的奖励上训练的目标条件智能体,能够在Craftax环境中解决越来越长的目标。当与基于FM的高层规划器结合时,发现的技能使得单一的目标条件智能体能够解决复杂的长时间跨度任务,平均超越预训练智能体和任务特定专家策略超过134%。我们将开源我们的代码,并提供额外的视频,$ ext{链接在这里}$。
cs.AI / 29 / 2602.10090
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
代理世界模型:用于代理强化学习的无限合成环境
Abstract
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
Chinese Translation
最近在大型语言模型(LLM)方面的进展使得自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而,扩展此类代理训练受到缺乏多样化和可靠环境的限制。本文提出了代理世界模型(Agent World Model, AWM),一个完全合成的环境生成管道。通过该管道,我们扩展到1,000个涵盖日常场景的环境,其中代理可以与丰富的工具集(每个环境平均35个工具)进行交互,并获得高质量的观察。值得注意的是,这些环境是代码驱动的,并由数据库支持,提供比LLM模拟的环境更可靠和一致的状态转移。此外,与从现实环境中收集轨迹相比,它们还能够实现更高效的代理交互。为了证明这一资源的有效性,我们对多轮工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准测试上的实验表明,仅在合成环境中训练,而非特定基准的环境,能够产生强大的分布外泛化。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。
cs.CL / 1 / 2602.09147
Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection
PAN 2026 概述:Voight-Kampff 生成式人工智能检测、文本水印、多作者写作风格分析、生成式抄袭检测和推理轨迹检测
Abstract
The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.
Chinese Translation
PAN 研讨会的目标是通过客观和可重复的评估推动计算风格学和文本取证的发展。在 2026 年,我们将开展以下五个任务:(1)Voight-Kampff 生成式人工智能检测,特别是在混合和模糊作者身份的场景中;(2)文本水印,这是一个新任务,旨在寻找新的文本水印方案并基准测试现有方案的鲁棒性;(3)多作者写作风格分析,这是一个持续的任务,旨在寻找作者身份变化的位置;(4)生成式抄袭检测,这是一个持续的任务,目标是源文档检索和生成文本与源文档之间的文本对齐;(5)推理轨迹检测,这是一个新任务,涉及 LLM 生成或人类撰写的推理轨迹的源检测和安全检测。与往年一样,PAN 邀请软件提交以易于重现的 Docker 容器形式进行大多数任务的提交。自 2012 年以来,已有超过 1,100 个提交通过 TIRA 实验平台以这种方式进行。
cs.CL / 2 / 2602.09269
Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning
互动中的包容性测量:人机协作学习的包容性分析
Abstract
Inclusion, equity, and access are widely valued in AI and education, yet are often assessed through coarse sample descriptors or post-hoc self-reports that miss how inclusion is shaped moment by moment in collaborative problem solving (CPS). In this proof-of-concept paper, we introduce inclusion analytics, a discourse-based framework for examining inclusion as a dynamic, interactional process in CPS. We conceptualize inclusion along three complementary dimensions -- participation equity, affective climate, and epistemic equity -- and demonstrate how these constructs can be made analytically visible using scalable, interaction-level measures. Using both simulated conversations and empirical data from human-AI teaming experiments, we illustrate how inclusion analytics can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations. This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments.
Chinese Translation
包容性、公平性和可及性在人工智能和教育中被广泛重视,但通常通过粗略的样本描述符或事后自我报告进行评估,这些方法未能捕捉到在协作问题解决(CPS)中包容性是如何在每一个时刻形成的。在这篇概念验证论文中,我们引入了包容性分析,这是一种基于话语的框架,用于考察包容性作为CPS中一种动态的互动过程。我们将包容性概念化为三个互补维度——参与公平、情感气候和认知公平,并展示如何利用可扩展的互动级别测量使这些构念在分析上变得可见。通过模拟对话和来自人机团队实验的实证数据,我们说明了包容性分析如何揭示参与模式、关系动态和思想采纳,这些在聚合或事后评估中是不可见的。这项工作代表了朝着过程导向的方法测量人机协作学习环境中包容性的初步步骤。
cs.CL / 3 / 2602.09276
Effective Reasoning Chains Reduce Intrinsic Dimensionality
有效的推理链减少内在维度
Abstract
Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
Chinese Translation
链式思维(Chain-of-thought, CoT)推理及其变体显著提高了语言模型在复杂推理任务上的表现,但不同策略如何促进泛化的具体机制仍然不甚清楚。尽管当前的解释通常指向测试时计算的增加或结构性指导,但在这些因素与泛化之间建立一致且可量化的联系仍然具有挑战性。在本研究中,我们将内在维度识别为表征推理链有效性的量化指标。内在维度量化了在给定任务上达到特定准确率阈值所需的最小模型维度数。通过固定模型架构并通过不同的推理策略变化任务表述,我们证明有效的推理策略始终减少任务的内在维度。在使用Gemma-3 1B和4B验证GSM8K时,我们观察到推理策略的内在维度与其在同分布和异分布数据上的泛化性能之间存在强烈的负相关关系。我们的发现表明,有效的推理链通过使用更少的参数更好地压缩任务,从而促进学习,为分析推理过程提供了一种新的量化指标。
cs.CL / 4 / 2602.09312
Don't Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention
别闲聊:使用带注意力机制的非线性朴素贝叶斯的主题连续性模型
Abstract
Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model's ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.
Chinese Translation
在多种商业场景中将大型语言模型(LLM)作为聊天机器人使用,常常面临保持主题连续性的挑战。主题的突然转变可能导致用户体验不佳和计算资源的低效利用。本文提出了一种主题连续性模型,旨在评估响应是否与初始对话主题一致。我们的模型基于将相应的自然语言理解(NLU)模型扩展为可量化的形式,采用朴素贝叶斯方法。随后,我们引入了注意力机制和对数非线性,以增强其捕捉主题连续性的能力。这种方法使我们能够将NLU模型转化为可解释的分析公式。与许多受限于令牌限制的NLU模型相比,我们提出的模型能够以线性时间复杂度无缝处理任意长度的对话。此外,注意力机制显著提高了模型在复杂对话中识别主题连续性的能力。根据我们的实验,模型在处理冗长和复杂对话时始终优于传统方法。这一独特能力为我们确保LLM的负责任和可解释使用提供了机会。
cs.CL / 5 / 2602.09331
Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
超越均匀信用:用于策略优化的因果信用分配
Abstract
Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
Chinese Translation
语言模型推理的策略梯度方法,如 GRPO 和 DAPO,给所有生成的标记分配均匀的信用——填充短语 "让我想想" 收到的梯度更新与关键计算 "23 + 45 = 68" 相同。我们提出了反事实重要性加权:掩蔽推理跨度,测量答案概率的下降,并在策略梯度更新期间相应地增加标记的权重。我们的方法不需要辅助模型或外部注释,而是直接从策略模型自身的概率变化中估计重要性。在跨越 Qwen 和 Llama 系列的三种模型上对 GSM8K 的实验表明,相较于均匀基线,我们的方法在一致性上有显著改善,并且更快地收敛到等效的准确性。反转重要性信号会损害性能,确认我们捕捉到的是真正的因果结构而非噪声。分析表明该方法正确地优先考虑计算步骤而非支撑文本。我们认为这些发现确立了反事实重要性加权作为进一步研究的基础,而非完整解决方案。
cs.CL / 6 / 2602.09336
FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
FM SO.P:一种具有自动评估的渐进式任务混合框架,用于跨域标准操作程序理解
Abstract
Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
Chinese Translation
标准操作程序(SOP)对企业运营至关重要,但现有语言模型在SOP理解和跨域泛化方面表现不佳。目前的方法失败的原因在于联合训练无法区分SOP所需的推理能力:术语精确性、顺序排列和约束推理。我们提出了FM SO.P,通过两项创新来解决这些挑战。首先,我们引入了渐进式任务混合,通过三个任务类型的阶段性能力构建和累积数据:术语精确性的概念消歧、程序正确性的动作序列理解,以及条件逻辑的场景感知图推理。其次,我们提出了一种自动多智能体评估系统,由三个智能体组成,能够自适应生成评分标准、分层测试集和评分,适应不同领域(例如,机动车管理局的时间约束、银行的合规性)。在SOPBench上对七个领域(银行、机动车管理局、医疗保健、市场、大学、图书馆、酒店)进行评估,FM SO.P在我们的32B模型上达到了48.3%的通过率,在我们的开源7B模型上达到了34.3%,与Qwen-2.5-72B-Instruct基线(34.4%)相匹配,且参数数量减少了10倍。
cs.CL / 7 / 2602.09339
Understanding Risk and Dependency in AI Chatbot Use from User Discourse
理解用户话语中AI聊天机器人使用的风险与依赖性
Abstract
Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke's reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.
Chinese Translation
生成性人工智能系统日益融入日常生活,但关于与AI使用相关的心理风险如何产生、被体验和被用户调节的实证理解仍然有限。我们对2023年至2025年间从两个Reddit社区(r/AIDangers和r/ChatbotAddiction)收集的帖子进行了大规模的计算主题分析,这些帖子明确关注与AI相关的伤害和痛苦。采用基于多代理和大型语言模型(LLM)辅助的主题分析,基于Braun和Clarke的反思框架,我们识别出14个反复出现的主题类别,并将其综合为五个更高层次的体验维度。为了进一步描述情感模式,我们使用基于BERT的分类器进行情感标注,并可视化各维度的情感特征。我们的研究结果揭示了基于真实用户话语的与AI相关的心理风险的五个实证体验维度,其中自我调节困难是最普遍的,而恐惧主要集中在与自主性、控制和技术风险相关的担忧上。这些结果提供了来自用户实际体验的早期实证证据,展示了AI安全在实验室或推测性背景之外的感知和情感体验,为未来的AI安全研究、评估和负责任的治理奠定了基础。
cs.CL / 8 / 2602.09346
Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
西班牙语中的数字语言偏见:来自大型语言模型的词汇变异证据
Abstract
This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
Chinese Translation
本研究考察了大型语言模型(LLMs)在多大程度上捕捉西班牙语的地理词汇变异,这是一种表现出显著区域变异的语言。我们将LLMs视为虚拟信息源,通过两种调查式问题格式来探讨它们的方言知识:是非问题和多项选择问题。为此,我们利用了一个大规模的、由专家策划的西班牙语词汇变异数据库。我们的评估涵盖了来自21个西班牙语国家的900多个词汇项,并在国家和方言区域两个层面进行。通过这两种评估格式,结果揭示了LLMs在表现西班牙语言变体时的系统性差异。与西班牙、赤道几内亚、墨西哥及中美洲以及拉普拉塔河相关的词汇变异被模型更准确地识别,而智利变体则被模型区分的特别困难。重要的是,国家级数字资源的数量差异并不能解释这些表现模式,表明超越数据数量的因素在LLMs的方言表现中起着作用。通过提供对地理词汇变异的细致、大规模评估,本研究推进了对LLMs方言知识的实证理解,并为西班牙语中的数字语言偏见讨论提供了新的证据。
cs.CL / 9 / 2602.09366
Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
仅使用单语语料的无监督跨语言词性标注
Abstract
Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
Chinese Translation
由于词性标注数据的稀缺,现有对低资源语言的研究通常采用无监督的方法进行词性标注。在这些方法中,基于词对齐的词性标记投影方法通过平行语料将高资源源语言的词性标记转移到低资源目标语言,使其特别适合低资源语言环境。然而,这种方法在很大程度上依赖于平行语料,而许多低资源语言通常缺乏这样的语料。为了解决这一限制,我们提出了一种完全无监督的跨语言词性(POS)标注框架,仅依赖单语语料,通过利用无监督神经机器翻译(UNMT)系统。该UNMT系统首先将高资源语言的句子翻译成低资源语言,从而构建伪平行句子对。然后,我们根据词对齐的标准投影程序为目标语言训练一个词性标注器。此外,我们提出了一种多源投影技术,以校准目标侧的投影词性标记,从而增强训练更有效的词性标注器。我们在28对语言上评估了我们的框架,涵盖四种源语言(英语、德语、西班牙语和法语)和七种目标语言(南非荷兰语、巴斯克语、芬兰语、印尼语、立陶宛语、葡萄牙语和土耳其语)。实验结果表明,我们的方法在性能上可以与基于平行句子对的基线跨语言词性标注器相媲美,甚至在某些目标语言上超过了基线。此外,我们提出的多源投影技术进一步提升了性能,平均提高了1.3%,超越了之前的方法。
cs.CL / 10 / 2602.09372
AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis
AgentSkiller:通过语义集成的跨域数据合成扩展通用智能体智能
Abstract
Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.
Chinese Translation
大型语言模型代理在通过工具解决现实世界问题方面展现出潜力,但通用智能的提升受到高质量长时段数据稀缺的瓶颈。现有方法收集受隐私限制的API日志或生成缺乏多样性的脚本交互,这些方法难以产生扩展能力所需的数据。我们提出了AgentSkiller,一个完全自动化的框架,用于合成跨现实、语义关联域的多轮交互数据。该框架采用基于有向无环图(DAG)的架构,具有明确的状态转换,以确保确定性和可恢复性。该管道构建了领域本体和以人为中心的实体图,通过服务蓝图定义模型上下文协议服务器的工具接口,并用一致的数据库和严格的领域政策填充环境。跨域融合机制将服务链接以模拟复杂任务。最后,该管道通过验证解决路径、基于执行的验证过滤和使用基于角色的模拟器生成查询来创建用户任务,以实现自动化推出。这产生了具有明确状态变化的可靠环境。为了验证有效性,我们合成了约11K个交互样本;实验结果表明,在该数据集上训练的模型在函数调用方面相较于基线取得了显著提升,尤其是在较大的参数范围内。
cs.CL / 11 / 2602.09373
AfriNLLB: Efficient Translation Models for African Languages
AfriNLLB:非洲语言的高效翻译模型
Abstract
In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.
Chinese Translation
在本研究中,我们提出了AfriNLLB,一系列轻量级模型,用于高效地进行非洲语言之间的翻译。AfriNLLB支持15对语言(30个翻译方向),包括斯瓦希里语、豪萨语、约鲁巴语、阿姆哈拉语、索马里语、祖鲁语、林加拉语、南非荷兰语、沃洛夫语和埃及阿拉伯语,以及其他非洲联盟的官方语言,如阿拉伯语(现代标准阿拉伯语)、法语、葡萄牙语和西班牙语。我们的训练数据涵盖了英语与13种语言之间的双向翻译,以及法语与两种语言(林加拉语和沃洛夫语)之间的翻译。AfriNLLB模型基于NLLB-200 600M,我们通过迭代层修剪和量化对其进行了压缩。我们在为非洲语言策划的平行语料库上对修剪后的模型进行了微调,并采用了来自更大教师模型的知识蒸馏。我们的工作旨在实现非洲语言翻译模型在资源受限环境中的高效部署。我们的评估结果表明,AfriNLLB模型的性能与基线相当,同时显著更快。我们发布了两个版本的AfriNLLB模型,一个是允许进一步微调的Transformers版本,另一个是用于高效推理的CTranslate2版本。此外,我们还发布了用于微调基线和修剪模型的所有训练数据,以促进进一步的研究。
cs.CL / 12 / 2602.09383
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
BiasScope:面向 LLM-as-a-Judge 评估中的偏见自动检测
Abstract
LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
Chinese Translation
LLM-as-a-Judge 已在各种研究和实际应用中得到广泛采用,但其评估的稳健性和可靠性仍然是一个关键问题。其面临的核心挑战是偏见,尽管已主要从已知偏见及其对评估结果的影响进行研究,但对潜在未知偏见的自动化和系统性探索仍然缺乏。然而,这种探索对于增强评估的稳健性和可靠性至关重要。为填补这一空白,我们提出了 BiasScope,一个基于 LLM 的框架,用于自动化和大规模发现模型评估过程中可能出现的偏见。BiasScope 能够揭示不同模型家族和规模中的潜在偏见,其通用性和有效性在 JudgeBench 数据集上得到了验证。它克服了现有方法的局限性,将偏见发现从依赖人工努力和预定义偏见列表的被动过程转变为主动和全面的自动化探索。此外,基于 BiasScope,我们提出了 JudgeBench-Pro,这是 JudgeBench 的扩展版本,也是一个更具挑战性的基准,用于评估 LLM-as-a-Judge 的稳健性。值得注意的是,即使是强大的 LLM 作为评估者,在 JudgeBench-Pro 上的错误率也超过 50\%,这突显了加强评估稳健性和进一步减轻潜在偏见的迫切需求。
cs.CL / 13 / 2602.09384
Contractual Deepfakes: Can Large Language Models Generate Contracts?
合同深伪:大型语言模型能生成合同吗?
Abstract
Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.
Chinese Translation
尽管大型语言模型(LLMs)具有前所未有的文本生成能力,但它们并不理解词语的含义,缺乏上下文感知,无法进行推理。它们的输出仅是对统计上占主导地位的词汇模式的近似。然而,合同起草常被视为一种典型的法律任务,可能会受到这一技术的促进。本文旨在终结这种不合理的观点。预测词语与在特定交易情境中使用语言是不同的,而重构常见的合同短语与对法律进行推理也是不同的。LLMs似乎能够生成通用且表面上可信的合同文件。然而,经过冷静分析,这些文件可能会被发现是无用的、不一致条款的拼凑,或是可执行但不适合特定交易的合同。本文对LLMs威胁法律行业持续生存的简单假设提出了质疑。
cs.CL / 14 / 2602.09388
Effective vocabulary expanding of multilingual language models for extremely low-resource languages
多语言模型在极低资源语言中的有效词汇扩展
Abstract
Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.
Chinese Translation
多语言预训练语言模型(mPLMs)为许多低资源语言提供了显著的好处。为了进一步扩展这些模型所支持的语言范围,许多研究集中在对这些模型的持续预训练上。然而,针对如何将 mPLMs 扩展到之前未支持的低资源语言的研究较少。为了解决这一问题,我们使用目标语言语料库扩展模型的词汇。然后,我们从模型的原始词汇中筛选出一个子集,该子集偏向于表示源语言(例如,英语),并利用双语词典初始化扩展词汇的表示。随后,我们基于这些扩展词汇的表示,使用目标语言语料库继续对 mPLMs 进行预训练。实验结果表明,我们提出的方法在词性标注(POS tagging)和命名实体识别(NER)任务中优于基线方法,后者使用随机初始化的扩展词汇进行持续预训练,分别提高了 0.54% 和 2.60%。此外,我们的方法在选择训练语料方面表现出较高的鲁棒性,并且在持续预训练后,模型在源语言上的表现并未下降。
cs.CL / 15 / 2602.09416
Are Language Models Sensitive to Morally Irrelevant Distractors?
语言模型对道德无关干扰因素的敏感性如何?
Abstract
With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
Chinese Translation
随着大型语言模型(LLMs)在高风险环境中的快速发展和应用,确保LLMs的行为与人类价值观相一致变得愈发重要。现有的道德基准通过价值陈述、道德情境或心理问卷来提示LLMs,隐含的基本假设是LLMs会报告出相对稳定的道德偏好。然而,道德心理学研究表明,人类的道德判断对道德无关的情境因素(如闻到肉桂卷的香味或环境噪音的水平)非常敏感,这对假设人类道德判断稳定性的道德理论提出了挑战。在此,我们借鉴这种道德心理学的“情境主义”观点,评估LLMs是否表现出与人类相似的认知道德偏见。我们从现有的情感图像和叙事的心理数据集中策划了一个包含60个“道德干扰因素”的新型多模态数据集,这些干扰因素与所呈现的情境没有道德相关性。在将这些干扰因素注入现有道德基准以测量其对LLMs反应的影响后,我们发现道德干扰因素可以在低歧义情境中将LLMs的道德判断转变超过30%,这突显了对LLMs进行更具情境性的道德评估和更细致的认知道德建模的必要性。
cs.CL / 16 / 2602.09438
Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency
突破预采样障碍:激活信息驱动的难度感知自一致性
Abstract
Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.
Chinese Translation
自一致性(Self-Consistency, SC)是一种有效的解码策略,通过生成多个思维链推理路径并通过多数投票选择最终答案,从而提高大型语言模型(Large Language Models, LLMs)的推理性能。然而,由于需要大量样本,它在推理成本上存在显著的负担。为了解决这一问题,提出了难度自适应自一致性(Difficulty-Adaptive Self-Consistency, DSC),通过根据问题难度调整样本数量来减少简单问题的不必要标记使用。然而,DSC需要额外的模型调用和预采样来估计难度,并且在应用于每个数据集时这一过程会重复,导致显著的计算开销。在本研究中,我们提出了激活信息驱动的难度感知自一致性(Activation-Informed Difficulty-Aware Self-Consistency, ACTSC)以解决这些限制。ACTSC利用前馈网络神经元激活中反映的内部难度信号构建一个轻量级的难度估计探针,无需任何额外的标记生成或模型调用。该探针动态调整SC的样本数量,并且可以应用于新数据集而无需进行难度估计的预采样。为了验证其有效性,我们在五个基准上进行了实验。实验结果表明,ACTSC有效降低了推理成本,同时保持了相对于现有方法的准确性。
cs.CL / 17 / 2602.09442
Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts
评估RAG系统中的社会偏见:外部上下文如何帮助而推理如何伤害
Abstract
Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model's outputs. To better understand this phenomenon, we then explore the model's reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model's CoT. Our experiments reveal that the model's bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.
Chinese Translation
大型语言模型(LLMs)固有的社会偏见引发了显著的公平性问题。检索增强生成(RAG)架构通过检索外部知识源来增强LLMs的生成能力,但仍然容易受到相同的偏见相关挑战的影响。本研究重点评估和理解RAG的社会偏见影响。通过在各种检索语料库、LLMs和偏见评估数据集上进行广泛实验,涵盖超过13种不同的偏见类型,我们惊讶地观察到RAG中的偏见有所减少。这表明,外部上下文的纳入可以帮助抵消基于刻板印象的预测,可能通过多样化模型输出的上下文基础来改善公平性。为了更好地理解这一现象,我们进一步探索模型的推理过程,通过将链式思维(Chain-of-Thought, CoT)提示集成到RAG中,同时评估模型的CoT的忠实性。我们的实验揭示,随着从检索文档中纳入更多上下文信息,模型的偏见倾向在刻板印象和反刻板印象响应之间发生了转变。有趣的是,我们发现虽然CoT提高了准确性,但与RAG观察到的偏见减少相反,它在数据集上增加了整体偏见,这突显了需要偏见意识推理框架以减轻这种权衡的必要性。
cs.CL / 18 / 2602.09444
Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
概念文化指数:通过相对一般性衡量文化特异性
Abstract
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
Chinese Translation
大型语言模型(LLMs)越来越多地应用于多文化环境中;然而,句子层面文化特异性的系统评估仍然未得到充分探索。我们提出了概念文化指数(Conceptual Cultural Index, CCI),该指数用于评估句子层面的文化特异性。CCI 被定义为目标文化内的一般性估计与其他文化的平均一般性估计之间的差异。这一公式使用户能够通过比较设置操作性地控制文化的范围,并提供可解释性,因为该分数源于基础的一般性估计。我们在400个句子(200个文化特定句子和200个一般句子)上验证了CCI,结果分数分布展示了预期的模式:文化特定句子的分数较高,而一般句子的分数较低。在二元可分性方面,CCI的表现优于直接的LLM评分,为专注于目标文化的模型在AUC上带来了超过10点的提升。我们的代码可在 https://github.com/IyatomiLab/CCI 获取。
cs.CL / 19 / 2602.09469
NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
NOWJ @BioCreative IX ToxHabits:一种用于检测临床文本中物质使用及上下文信息的集成深度学习方法
Abstract
Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
Chinese Translation
从非结构化电子健康记录中提取药物使用信息仍然是临床自然语言处理中的一大挑战。尽管大型语言模型展示了进展,但在临床自然语言处理中的应用受到信任、控制和效率等问题的限制。为了解决这一问题,我们提出了NOWJ在BioCreative IX的ToxHabits共享任务中的提交。该任务旨在检测西班牙语临床文本中的有毒物质使用及上下文属性,这是一个特定领域的低资源环境。我们提出了一种多输出集成系统,解决子任务1 - ToxNER和子任务2 - ToxUse。我们的系统将BETO与条件随机场(CRF)层结合用于序列标注,采用多样的训练策略,并使用句子过滤来提高精度。我们的最佳运行结果在触发检测中达到了0.94的F1和0.97的精度,在论证检测中达到了0.91的F1。
cs.CL / 20 / 2602.09486
Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement
倾听层次:通过层间不一致性减轻幻觉现象
Abstract
Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
Chinese Translation
预训练的大型语言模型(LLMs)容易生成流畅但事实不准确的文本——这一现象被称为幻觉,削弱了它们在下游任务中的可靠性和实用性。我们假设生成文本片段的事实性与其在模型内部层次中的表征不稳定性相关。基于此,我们提出了CoCoA(混淆与一致性感知)解码器,这是一种新颖的无训练解码算法,通过在中间层倾听这些信号,在推理时减轻幻觉现象。我们提出了两个指标来量化中间层的不稳定性,并利用这些指标对表现出高内部混淆的输出进行惩罚,从而引导模型生成更具内部一致性和事实基础的输出。我们进一步提出了一种自信息门控变体CoCoA-SIG,该变体动态调节惩罚,以选择性地针对高惊讶度、不稳定的生成结果。在包括问答、摘要生成和代码生成等多种任务上的广泛实验表明,CoCoA显著提高了多个模型系列(如Llama-3、Qwen-2.5、Mistral)的事实正确性。通过利用模型内在信号,CoCoA提供了一种有效且广泛适用的方法,以增强LLMs在推理时的可信度,而无需任何模型重训练。
cs.CL / 21 / 2602.09501
Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models
去掩码:基于真实标签的掩码扩展顺序学习用于掩码扩散语言模型
Abstract
Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
Chinese Translation
掩码扩散语言模型(MDLMs)通过迭代填充掩码标记生成文本,在每一步需要做出两个相互关联的决策:选择哪些位置进行解掩码(where-to-unmask)和选择放置哪些标记(what-to-unmask)。虽然标准的MDLM训练直接优化标记预测(what-to-unmask),但推理时的解掩码顺序(where-to-unmask)通常是通过启发式置信度度量或通过强化学习与昂贵的在政策回合进行训练来确定的。为了解决这个问题,我们引入了Gt-Margin,这是一种基于真实标签的逐位置评分,定义为正确标记与其最强替代品之间的概率边际。Gt-Margin产生了一种优先考虑每个部分掩码状态下较简单位置的oracle解掩码顺序。我们证明,利用这种oracle解掩码顺序显著提高了最终生成质量,特别是在逻辑推理基准测试中。基于这一见解,我们通过学习排序训练了一个监督解掩码规划器,以模仿来自掩码上下文的oracle排序。最终的规划器集成到标准的MDLM采样中,以选择解掩码位置(where-to-unmask),在不修改标记预测模型的情况下提高推理准确性。
cs.CL / 22 / 2602.09514
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym:评估大规模语言模型在互动经济中的长远计划与执行能力
Abstract
Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
Chinese Translation
长远规划被广泛认为是基于大规模语言模型(LLM)的自主代理的核心能力;然而,目前的评估框架在很大程度上存在情节性、领域特定或缺乏对持续经济动态的充分基础等问题。我们提出了EcoGym,这是一个可推广的基准,用于在互动经济中进行连续的计划与执行决策。EcoGym包含三个多样化的环境:自动售货、自由职业和运营,这些环境在统一的决策过程中实现,具有标准化接口,并在有效无限的时间范围内(如果进行365天的循环评估,则超过1000步)预算行动。EcoGym的评估基于与商业相关的结果(例如净资产、收入和日活跃用户数),旨在针对部分可观察性和随机性下的长期战略一致性和稳健性。对十一种领先的LLM进行的实验揭示了一种系统性的紧张关系:没有单一模型在所有三种场景中占据主导地位。关键是,我们发现模型在高层次策略或高效行动执行方面表现出显著的次优性。EcoGym作为一个开放、可扩展的测试平台发布,旨在实现透明的长远代理评估,并研究现实经济环境中的可控性与效用之间的权衡。
cs.CL / 23 / 2602.09516
The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking
CLEF-2026 CheckThat! 实验室:推进多语言事实核查
Abstract
The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year's edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.
Chinese Translation
CheckThat! 实验室旨在推动开发创新技术,以应对多种语言和平台上的虚假信息和操控行为。在早期版本中,重点主要放在验证流程的核心任务上(可核查性、证据检索和验证),而在过去的三个版本中,实验室增加了与验证过程相关的额外任务。在今年的版本中,验证流程再次成为中心,包含以下任务:任务1为科学网络声明的来源检索(为2025年版本的后续任务),任务2为对数值和时间声明的事实核查,增加了推理组件,任务3则通过生成完整的事实核查文章扩展了验证流程。这些任务代表了具有挑战性的分类和检索问题,以及在文档和跨度级别上的生成挑战,包括多语言环境。
cs.CL / 24 / 2602.09517
Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models
大型语言模型中搜索增强推理的知识整合衰退
Abstract
Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
Chinese Translation
现代大型语言模型(LLMs)通过采用搜索增强推理在复杂任务中展现了显著的能力,以将外部知识融入长链思维中。然而,我们识别出这一范式中的一个关键但未被充分探讨的瓶颈,称为知识整合衰退(Knowledge Integration Decay, KID)。具体而言,我们观察到,随着生成的推理长度在搜索之前增长,模型在将检索到的证据整合到后续推理步骤中的能力逐渐减弱,即使在相关信息可用的情况下也会限制性能。为了解决这一问题,我们提出了自锚定知识编码(Self-Anchored Knowledge Encoding, SAKE),这是一种无训练的推理时策略,旨在稳定知识的利用。通过在推理过程的开始和结束处锚定检索到的知识,SAKE防止其被先前的上下文所掩盖,从而保持其语义完整性。在多跳问答和复杂推理基准上的大量实验表明,SAKE显著减轻了KID并提高了性能,为代理型LLMs中的知识整合提供了一种轻量且有效的解决方案。
cs.CL / 25 / 2602.09538
UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment
UniARM:朝着统一的自回归奖励模型实现多目标测试时对齐
Abstract
Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \& Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.
Chinese Translation
多目标对齐旨在将大型语言模型(LLM)的响应与多个人工偏好目标对齐。在现有方法中,通过自回归奖励模型(ARM)引导冻结的LLM生成以实现多目标测试时对齐是一种低成本的解决方案。然而,这些方法通常依赖于每个偏好目标的独立参数,要么通过在偏好维度上独立训练ARM,忽视了偏好特征之间的相互作用,要么通过为每个偏好训练一个单一的ARM,并为每个偏好使用独立的特征提取模块,这可能导致特征纠缠。这两种策略都可能导致生成的输出与用户偏好之间的不对齐。为了解决这一局限性,我们提出了偏好调制与共享低秩适应(MoSLoRA)用于ARM训练,该方法首先通过一个与偏好无关的模块提取共享特征,然后通过一个基于混合偏好向量条件的偏好调制模块对共享特征应用仿射变换。这一设计减轻了特征纠缠,并在推理过程中实现了对偏好权衡的精确控制。在此基础上,我们引入了统一自回归奖励模型(UniARM),这是一个用于多目标测试时对齐的新框架。UniARM在单一参数空间中联合建模所有偏好维度,消除了对每个偏好目标独立参数的需求,增强了其在大规模LLM上的实用性。
cs.CL / 26 / 2602.09552
Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
跨多领域对话问答的RAG方法综合比较
Abstract
Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{https://github.com/Klejda-A/exp-rag.git}{GitHub Repository}}
Chinese Translation
对话问答日益依赖于检索增强生成(RAG)来将大型语言模型(LLMs)与外部知识结合。然而,大多数现有研究孤立地评估RAG方法,并主要集中于单轮对话设置。本文解决了多轮对话问答中RAG方法缺乏系统比较的问题,在这种情况下,对话历史、共指和用户意图的变化显著增加了检索的复杂性。我们对八个跨多个领域的多样化对话问答数据集进行了对普通和高级RAG方法的全面实证研究。通过统一的实验设置,我们使用生成器和检索指标评估检索质量和答案生成,并分析性能在对话轮次中的演变。我们的结果表明,稳健而简单的方法,如重排序、混合BM25和HyDE,始终优于普通RAG。相比之下,几种高级技术未能带来收益,甚至可能导致性能低于无RAG基线。我们进一步证明,数据集特征和对话长度对检索效果有强烈影响,这解释了为什么没有单一的RAG策略在不同设置中占主导地位。总体而言,我们的研究结果表明,有效的对话RAG更依赖于检索策略与数据集结构之间的对齐,而非方法的复杂性。我们发布了所使用的代码。
cs.CL / 27 / 2602.09555
Advancing Block Diffusion Language Models for Test-Time Scaling
推进块扩散语言模型在测试时扩展中的应用
Abstract
Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
Chinese Translation
近期块扩散语言模型(BDLMs)的进展在推理任务中展示了竞争力的性能和强大的可扩展性。然而,现有的BDLM在测试时扩展设置下的探索有限,并且在长链式思维推理中面临更为严峻的解码挑战,特别是在平衡解码速度和有效性方面。在本研究中,我们提出了一个统一的BDLM测试时扩展框架,引入了解码和块级生成的自适应性。在解码层面,我们提出了有界自适应置信解码(Bounded Adaptive Confidence Decoding, BACD),这是一种基于模型置信度动态调整去噪的难度感知采样策略,旨在加速推理同时控制错误累积。除了逐步自适应性外,我们还引入了粗思考、细评估(Think Coarse, Critic Fine, TCCF)这一测试时扩展范式,将较大的块大小分配给探索性推理,而将较小的块大小分配给细化,从而实现有效的效率与有效性平衡。为了在大块大小下实现高效且有效的解码,我们采用了渐进块大小扩展(Progressive Block Size Extension),该方法在扩展块大小时减轻性能下降。大量实验表明,将BACD和TCCF应用于TDAR-8B相较于强基线如TraDo-8B(速度提升2.26倍,AIME24上提高11.2分)带来了显著改进。这些结果标志着解锁BDLM在复杂推理任务中测试时扩展潜力的重要一步。
cs.CL / 28 / 2602.09570
LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
LEMUR:用于多语言法律嵌入模型检索的鲁棒微调语料库
Abstract
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.
Chinese Translation
大型语言模型(LLMs)越来越多地被用于访问法律信息。然而,它们在多语言法律环境中的应用受到不可靠检索和缺乏领域适应的开放嵌入模型的限制。特别是,现有的多语言法律语料库并未针对语义检索进行设计,而基于PDF的立法来源由于文本提取不完美而引入了大量噪声。为了解决这些挑战,我们引入了LEMUR,这是一个大规模的多语言欧盟环境立法语料库,由24,953份涵盖25种语言的官方EUR-Lex PDF文档构成。我们通过使用词汇内容得分(Lexical Content Score, LCS)来衡量PDF到文本转换的准确性,从而量化词汇一致性与权威HTML版本之间的差异。基于LEMUR,我们在单语和双语环境中使用对比目标微调了三种最先进的多语言嵌入模型,反映了现实的法律检索场景。在低资源和高资源语言上的实验表明,法律领域的微调相对于强基线始终提高了Top-k检索准确率,尤其在低资源语言中表现出显著的提升。跨语言评估显示,这些改进能够转移到未见过的语言,表明微调主要增强了语言独立的内容级法律表示,而非特定于语言的线索。我们发布了代码ootnote{ exttt{https://github.com/nargesbh/eur_lex}}和数据ootnote{ exttt{https://huggingface.co/datasets/G4KMU/LEMUR}}。
cs.CL / 29 / 2602.09574
Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs
在测试时间扩展中对齐树搜索策略与固定令牌预算
Abstract
Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.
Chinese Translation
树搜索解码是一种有效的大型语言模型(LLMs)测试时间扩展形式,但实际部署中每个查询的令牌预算是固定的,并且在不同设置下有所不同。现有的树搜索策略在很大程度上与预算无关,将预算视为终止条件,这可能导致后期过度分支或过早终止。我们提出了{Budget-Guided MCTS}(BG-MCTS),这是一种树搜索解码算法,它将其搜索策略与剩余的令牌预算对齐:它从广泛探索开始,然后在预算减少时优先考虑细化和答案完成,同时减少来自浅层节点的后期分支。BG-MCTS在不同预算下在MATH500和AIME24/25上始终优于与预算无关的树搜索基线,使用开放权重的LLMs。
cs.CL / 30 / 2602.09590
Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models
基于上下文的反事实数据增强方法用于语言模型中的性别偏见缓解
Abstract
A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.
Chinese Translation
在减轻微调语言模型(LMs)中的社会偏见时,一个挑战是可能会降低语言建模能力,这可能会损害下游性能。反事实数据增强(CDA)作为一种广泛使用的微调方法,突显了这一问题,因为它生成的合成数据可能与真实世界分布不匹配,或者创建过于简单的反事实,忽视了在预训练语料库中改变敏感属性(例如性别)的社会背景。为了解决这些局限性,我们提出了一种简单而有效的上下文增强反事实数据增强方法——Context-CDA,该方法利用大型语言模型来增强去偏见语料库的多样性和上下文相关性。通过通过增强上下文最小化去偏见语料库与预训练数据之间的差异,该方法确保了更好的对齐,从而增强了语言建模能力。然后,我们采用基于不确定性的过滤方法,排除被目标较小语言模型(即待去偏见的LMs)认为质量较低的生成反事实,进一步提高微调语料库的质量。在性别偏见基准上的实验结果表明,Context-CDA有效地减轻了偏见,而不牺牲语言建模性能,同时通过分析下一个标记生成概率的分布变化,提供了对社会偏见的深入洞察。
cs.CL / 31 / 2602.09591
On the Optimal Reasoning Length for RL-Trained Language Models
关于强化学习训练语言模型的最优推理长度
Abstract
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
Chinese Translation
强化学习显著提高了大型语言模型的推理能力,但它也倾向于延长思维链输出的长度,并在训练和推理过程中增加计算成本。尽管已经提出了长度控制方法,但尚不清楚在效率和性能之间平衡的最优输出长度是什么。在本研究中,我们比较了几种长度控制方法在两个模型上的表现,即 Qwen3-1.7B Base 和 DeepSeek-R1-Distill-Qwen-1.5B。我们的结果表明,长度惩罚可能会阻碍推理的获取,而适当调整的长度控制可以提高具有强先前推理能力模型的效率。通过将先前的工作扩展到强化学习训练的策略,我们识别出两种失效模式:1)长输出增加了离散性,2)短输出导致思维不足。
cs.CL / 32 / 2602.09598
Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning
从不可恢复的错误中学习:工具集成大语言模型推理的错误定位策略优化
Abstract
Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
Chinese Translation
工具集成推理(TIR)使得大语言模型(LLM)代理能够通过规划、工具使用和迭代修正来解决任务,但在这种设置下,仅依赖结果的强化学习面临稀疏、延迟的奖励以及较弱的逐步信用分配问题。在长时间跨度的 TIR 轨迹中,早期的不可恢复错误可能决定成功或失败,因此定位第一个不可恢复步骤并利用其进行细粒度信用分配至关重要。我们提出了错误定位策略优化(ELPO),该方法通过在固定的回合预算下使用二分搜索回合树来定位第一个不可恢复步骤,将生成的树通过分层优势归因转换为稳定的学习信号,并应用错误定位的自适应剪切来强化对关键步骤及其后缀的修正更新。在数学、科学问答和代码执行的 TIR 基准测试中,ELPO 在可比的采样预算下始终优于强大的代理强化学习基线,并在 Pass@K 和 Major@K 规模、回合排名质量和工具调用效率等方面获得额外提升。我们的代码将很快公开发布。
cs.CL / 33 / 2602.09621
AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
AlignTune:大型语言模型后训练对齐的模块化工具包
Abstract
Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
Chinese Translation
后训练对齐是部署大型语言模型(LLMs)的核心,但实际工作流程仍然分散在特定后端的工具和临时拼接代码之间,使得实验难以重现。我们识别出后端干扰、奖励碎片化和不可重现的管道是对齐研究中的关键障碍。我们介绍了AlignTune,一个模块化工具包,提供了一个统一的接口用于监督微调(SFT)和RLHF风格的优化,并支持可互换的TRL和Unsloth后端。AlignTune标准化了配置,提供了一个可扩展的奖励层(基于规则和学习的),并集成了对标准基准和自定义任务的评估。通过将特定后端的逻辑隔离在一个单一的工厂边界后,AlignTune使得可控比较和可重现的对齐实验成为可能。
cs.CL / 34 / 2602.09624
MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation
MILE-RefHumEval:一种无参考、多独立大语言模型框架的人类对齐评估
Abstract
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
Chinese Translation
我们介绍了MILE-RefHumEval,这是一种无参考框架,用于评估大型语言模型(LLMs),无需真实标注或评估者协调。该框架利用一组独立提示的评估者,依据人类对齐的方案进行指导,支持离散和连续评分判断。从最佳候选选择、摘要生成、图像描述到对话等任务特定提示,MILE-RefHumEval提供灵活、可解释和可扩展的评估。实验表明,它与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界的LLM评估提供了一种高效、稳健且人类对齐的解决方案。
cs.CL / 35 / 2602.09642
MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
MATA:用于可靠和灵活表格问答的多智能体框架
Abstract
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS-Lab/MATA.
Chinese Translation
近年来,大型语言模型(Large Language Models, LLMs)的进展显著提升了表格理解任务,如表格问答(Table Question Answering, TableQA),然而在确保可靠性、可扩展性和效率方面仍面临挑战,特别是在资源受限或隐私敏感的环境中。本文介绍了MATA,一个多智能体的TableQA框架,利用多个互补的推理路径和一组基于小型语言模型构建的工具。MATA通过多样的推理风格为给定的表格和问题生成候选答案,然后借助这些工具对答案进行优化或选择最佳答案。此外,它还结合了一种旨在最小化昂贵的LLM智能体调用的算法,从而提高整体效率。MATA在使用小型开源模型时保持强大的性能,并能轻松适应各种LLM类型。在两个不同难度的基准测试上,使用十种不同的LLM进行的广泛实验表明,MATA在避免过度LLM推理的同时,实现了最先进的准确性和高效的推理。我们的结果强调,精心协调多个推理路径可以实现可扩展和可靠的TableQA。代码可在 https://github.com/AIDAS-Lab/MATA 获取。
cs.CL / 36 / 2602.09691
Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
生命周期意识下的知识蒸馏在机器翻译中的评估:环境影响与翻译质量的权衡
Abstract
Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
Chinese Translation
知识蒸馏(Knowledge Distillation, KD)是一种将较大系统(教师)压缩为较小系统(学生)的工具。在机器翻译领域,研究通常仅报告学生的翻译质量,而忽略了进行知识蒸馏的计算复杂性,这使得在计算资源限制下选择众多可用的知识蒸馏选项变得困难。在本研究中,我们通过考虑翻译质量和计算成本来评估代表性的知识蒸馏方法。我们使用机器学习生命周期评估(Machine Learning Life Cycle Assessment, MLCA)工具将计算成本表示为碳足迹。该评估考虑了知识蒸馏模型生命周期(教师训练、蒸馏和推理)中的运行时操作排放和摊销的硬件生产成本。我们发现:(i)在小规模部署时,蒸馏开销主导了总碳足迹;(ii)在大规模时,推理主导,使得知识蒸馏仅在超过任务依赖的使用阈值后才有益;(iii)词级蒸馏通常提供比序列级蒸馏更有利的碳足迹与质量权衡。我们的协议为在明确的质量和计算资源限制下选择知识蒸馏方法提供了可重复的指导。
cs.CL / 37 / 2602.09703
Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
马斯特里赫特大学在AMIYA:通过微调和MBR解码适应方言阿拉伯语的LLMs
Abstract
Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
Chinese Translation
大型语言模型(LLMs)正变得越来越多语言化,支持数百种语言,尤其是高资源语言。不幸的是,由于数据有限和语言变异,方言变体仍然代表不足。在本研究中,我们对预训练的LLM进行了适应,以提高方言表现。具体而言,我们在单语和英语方言平行数据上使用低秩适应(Low Rank Adaptation, LoRA)微调、适配器合并和方言感知的MBR解码,以改善方言的保真度生成和翻译。对叙利亚、摩洛哥和沙特阿拉伯语的实验表明,合并和MBR提高了方言的保真度,同时保持了语义准确性。这种组合提供了一个紧凑且有效的框架,用于稳健的方言阿拉伯语生成。
cs.CL / 38 / 2602.09712
TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces
TraceMem:从用户对话痕迹中编织叙事记忆图式
Abstract
Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: https://github.com/YimingShu-teay/TraceMem
Chinese Translation
长期交互的持续性仍然是大型语言模型(LLMs)面临的瓶颈,因为它们有限的上下文窗口难以管理随时间延续的对话历史。现有的记忆系统通常将交互视为不连贯的片段,未能捕捉对话流的潜在叙事一致性。我们提出了TraceMem,一个受认知启发的框架,通过三阶段管道从用户对话痕迹中编织结构化的叙事记忆图式:(1) 短期记忆处理,采用演绎主题分割方法来划定情节边界并提取语义表示;(2) 突触记忆巩固,一个将情节总结为情节记忆的过程,然后将其与语义一起提炼成用户特定的痕迹;(3) 系统记忆巩固,利用两阶段层次聚类将这些痕迹组织成一致的、随时间演变的叙事线索,围绕统一主题。这些线索被封装成结构化的用户记忆卡片,形成叙事记忆图式。为了利用记忆,我们提供了一种代理搜索机制,以增强推理过程。在LoCoMo基准上的评估显示,TraceMem以脑启发的架构实现了最先进的性能。分析表明,通过构建连贯的叙事,它在多跳和时间推理中超越了基线,突显了其在深度叙事理解中的重要作用。此外,我们对记忆系统进行了开放讨论,提供了我们对该领域的看法和未来展望。我们的代码实现可在以下链接获取:https://github.com/YimingShu-teay/TraceMem
cs.CL / 39 / 2602.09719
Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs
无监督层级动态测试时间适应用于大型语言模型
Abstract
Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
Chinese Translation
测试时间适应(TTA)用于大型语言模型(LLMs)在推理时利用部署时可用的信号更新模型参数。本文关注一种常见但尚未深入探讨的情况:无监督的样本特定TTA,其中模型仅使用提示本身独立适应每个提示,而无需黄金答案或外部监督。尽管这种方法具有吸引力,但使用固定的手工学习率的简单无监督TTA可能会不稳定:更新可能会过拟合于提示特定的统计特征,偏离期望的答案分布,最终降低生成质量。这种失败模式并不令人惊讶,因为在这种情况下,TTA必须在仅几个梯度步骤内适应单个提示,而与标准训练不同,后者是在大型数据集和较长优化周期上平均更新。因此,我们提出了层级动态测试时间适应的框架,该框架明确地根据提示表示、LLM结构和适应步骤调节TTA强度。在我们的设置中,TTA仅更新LoRA参数,并且一个轻量级超网络预测每层、每步的学习率乘子,从而实现细粒度控制。在各种数据集和LLMs上的实验一致表明,我们的方法通过学习适应步骤和变换器层投影上的有效缩放模式,显著增强了TTA,提高了稳定性,同时提升了性能。
cs.CL / 40 / 2602.09723
AI-Assisted Scientific Assessment: A Case Study on Climate Change
人工智能辅助的科学评估:气候变化案例研究
Abstract
The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.
Chinese Translation
新兴的人工智能共同科学家范式专注于可重复验证的任务,在这些任务中,代理在“猜测与检查”的循环中探索搜索空间。该范式并不适用于那些无法进行重复评估的问题,其真实情况是通过理论与现有证据的共识综合来确立的。我们评估了一个基于Gemini的人工智能环境,旨在支持协作科学评估,并与标准科学工作流程相结合。在与13位气候科学领域的多样化科学家团队合作中,我们在一个复杂主题上测试了该系统:大西洋经向翻转环流(AMOC)的稳定性。我们的结果表明,人工智能可以加速科学工作流程。该小组在仅46个小时的人力时间内,通过104轮修订,产生了79篇论文的综合总结。人工智能的贡献显著:大部分由人工智能生成的内容被保留在报告中。人工智能还帮助维护了逻辑一致性和呈现质量。然而,专家的补充对于确保报告的可接受性至关重要:报告中不到一半的内容是由人工智能生成的。此外,需要大量的监督来扩展和提升内容,以达到严格的科学标准。
cs.CL / 41 / 2602.09724
Targum -- A Multilingual New Testament Translation Corpus
Targum -- 一种多语言新约翻译语料库
Abstract
Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
Chinese Translation
许多欧洲语言拥有丰富的圣经翻译历史,但现有语料库在优先考虑语言广度的同时,往往未能捕捉到这种深度。为了解决这一问题,我们推出了一个包含657个新约翻译的多语言语料库,其中352个是独特的,涵盖五种语言,具有前所未有的深度:英语(208个独特版本,共396个),法语(41个,共78个),意大利语(18个,共33个),波兰语(30个,共48个)和西班牙语(55个,共102个)。该语料库汇集自12个在线圣经图书馆和一个现有语料库,每个翻译都手动注释了元数据,将文本映射到作品的标准化标识符、特定版本及其修订年份。这种规范化使研究人员能够根据自己的需求定义“独特性”:他们可以对翻译家族(如KJV谱系)进行微观分析,或通过去重紧密相关的文本进行宏观研究。通过提供第一个旨在进行灵活的多层次分析的资源,我们的语料库为翻译历史的定量研究建立了新的基准。
cs.CL / 42 / 2602.09760
Improving Interpretability of Lexical Semantic Change with Neurobiological Features
利用神经生物特征提高词汇语义变化的可解释性
Abstract
Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.
Chinese Translation
词汇语义变化(Lexical Semantic Change,LSC)是指一个词的意义随时间变化的现象。大多数关于LSC的研究集中在提高估计LSC程度的性能上,然而,通常难以解释一个词的意义是如何变化的。增强LSC的可解释性是一个重要的挑战,因为这可能为该领域带来新的见解。为了解决这一挑战,我们提出了一种方法,将通过预训练语言模型获得的上下文化词嵌入的语义空间映射到神经生物特征空间。在神经生物特征空间中,每个维度对应于词的一个原始特征,其值表示该特征的强度。这使得人类能够系统地解释LSC。在估计LSC程度时,我们的方法表现出优于大多数先前方法的性能。此外,鉴于所提方法的高可解释性,我们对LSC进行了多项分析。结果表明,我们的方法不仅发现了在先前研究中被忽视的有趣类型的LSC,还有效地搜索到具有特定类型LSC的词。
cs.CL / 43 / 2602.09785
Where Are We At with Automatic Speech Recognition for the Bambara Language?
巴姆巴拉语自动语音识别的现状如何?
Abstract
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
Chinese Translation
本文介绍了第一个用于评估巴姆巴拉语自动语音识别(ASR)的标准化基准,利用了一小时专业录制的马里宪法文本。该基准在近乎最佳的声学和语言条件下设计为一个受控参考集,用于评估37个模型,从巴姆巴拉训练系统到大规模商业模型。我们的研究发现,目前的ASR性能在狭窄的正式领域中仍显著低于部署标准;在词错误率(WER)方面表现最佳的系统达到了46.76\%,而最佳字符错误率(CER)为13.00\%的模型则由另一个模型设定,同时一些知名的多语言模型的WER超过了100\\%。这些结果表明,仅靠多语言预训练和模型扩展不足以支持代表性不足的语言。此外,由于该数据集代表了巴姆巴拉语最简化和正式的口语形式的最佳案例,这些数据尚未在实际的现实环境中进行测试。我们提供了该基准及其附带的公共排行榜,以促进巴姆巴拉语音技术的透明评估和未来研究。
cs.CL / 44 / 2602.09805
Decomposing Reasoning Efficiency in Large Language Models
大型语言模型中的推理效率分解
Abstract
Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $\rho=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
Chinese Translation
为推理而训练的大型语言模型在推理令牌与准确性之间进行权衡,然而标准评估仅报告最终准确性,掩盖了令牌的使用或浪费情况。我们引入了一种可选追踪框架,将令牌效率分解为可解释的因素:在固定令牌预算下的完成(避免截断)、给定完成的条件正确性和冗长性(令牌使用)。当基准元数据提供每个实例的工作负载代理时,我们进一步将冗长性分解为两个组成部分:平均表述开销(每个工作单位的令牌数)和捕捉开销如何随任务工作负载扩展的耦合系数。当推理追踪可用时,我们添加确定性的追踪质量度量(基础、重复、提示复制),以区分退化循环与冗长但参与的推理,避免人工标注和大型语言模型评判。对25个模型在CogniLoad上的评估表明,准确性和令牌效率的排名存在差异(Spearman $
ho=0.63$),效率差距通常由条件正确性驱动,表述开销变化约为9倍(与模型规模仅弱相关)。我们的分解揭示了不同的瓶颈特征,暗示了不同的效率干预措施。
cs.CL / 45 / 2602.09817
AnalyticsGPT: An LLM Workflow for Scientometric Question Answering
AnalyticsGPT:一种用于科学计量问题回答的大型语言模型工作流程
Abstract
This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.
Chinese Translation
本文介绍了AnalyticsGPT,这是一种直观且高效的大型语言模型(LLM)驱动的科学计量问题回答工作流程。该下游任务较少受到关注,主要涉及关于“科学的科学”的元科学问题的子类别。与基于论文的传统科学问题回答相比,该任务在规划阶段面临独特的挑战,即需要在问题中识别学术实体的命名实体,以及涉及科学计量指标(如影响因子)的多方面数据检索。除了在处理传统自然语言处理任务方面的卓越能力外,LLM在更复杂的应用中也展现出巨大潜力,例如任务分解、规划和推理。在本文中,我们探讨了LLM在科学计量问题回答中的应用,并描述了一个实现检索增强生成和代理概念的顺序工作流程的端到端系统。我们还解决了有效综合数据以形成可呈现的、结构良好的高层次分析的次要任务。作为检索增强生成的数据库,我们利用了一个专有的研究绩效评估平台。为了进行评估,我们咨询了经验丰富的主题专家,并利用LLM作为评审。在此过程中,我们提供了关于LLM在这一细分下游任务中的有效性的宝贵见解。我们的(框架)代码和提示可在以下链接获取:https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl。
cs.CL / 46 / 2602.09821
Text summarization via global structure awareness
通过全球结构意识进行文本摘要
Abstract
Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
Chinese Translation
文本摘要是自然语言处理(NLP)中的一项基础任务,信息爆炸使得长文档处理的需求日益增加,从而使得摘要变得至关重要。现有研究主要集中在模型改进和句子级剪枝上,但往往忽视了全球结构,导致连贯性受到破坏,下游性能减弱。一些研究采用大型语言模型(LLMs),虽然能够实现更高的准确性,但也带来了可观的资源和时间成本。为了解决这些问题,我们提出了GloSA-sum,这是第一种通过拓扑数据分析(TDA)实现全球结构意识的摘要方法。GloSA-sum有效地总结文本,同时保留语义核心和逻辑依赖关系。具体而言,我们从句子嵌入构建了一个语义加权图,其中持久同调识别核心语义和逻辑结构,这些结构被保存在一个“保护池”中,作为摘要的骨架。我们设计了一种以拓扑为指导的迭代策略,通过轻量级代理指标来近似句子重要性,以避免重复的高成本计算,从而在提高效率的同时保持结构完整性。为了进一步增强长文本处理,我们提出了一种层次策略,整合了段落级和全局摘要。在多个数据集上的实验表明,GloSA-sum在保留语义和逻辑完整性的同时减少了冗余,达到了准确性与效率之间的平衡,并通过缩短上下文而保留必要的推理链,进一步有利于LLM下游任务。
cs.CL / 47 / 2602.09826
From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
从现代标准阿拉伯语到方言:探索阿拉伯语言模型中的跨语言迁移
Abstract
Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
Chinese Translation
阿拉伯语言模型(LMs)主要在现代标准阿拉伯语(MSA)上进行预训练,并期望能够迁移到其方言中。虽然现代标准阿拉伯语作为标准书面形式通常用于正式场合,但人们在阿拉伯地区的各种方言中进行口语和在线书写。这对阿拉伯语言模型构成了限制,因为其方言与现代标准阿拉伯语的相似性各不相同。在本研究中,我们通过对3个自然语言处理(NLP)任务进行探测和表示相似性,研究了阿拉伯模型的跨语言迁移。我们的结果表明,迁移是可能的,但在方言之间存在不成比例的现象,我们发现这部分可以通过它们的地理接近性来解释。此外,我们发现了对所有阿拉伯方言进行训练的模型存在负干扰的证据。这质疑了它们的相似程度,并对阿拉伯模型中的跨语言迁移提出了担忧。
cs.CL / 48 / 2602.09832
LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
LLM 推理预测模型正确性的时机:来自编码课堂话语的证据
Abstract
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
Chinese Translation
大型语言模型(LLMs)正越来越多地被应用于大规模自动标记和分析教育对话,但当前的流程缺乏可靠的方法来检测模型何时出错。我们研究了 LLM 生成的推理是否可以用于预测模型自身预测的正确性。我们分析了来自课堂对话的 30,300 条教师发言,每条发言均由多个最先进的 LLM 标记,并附有教学行为构造和相应的推理。利用经过人工验证的真实标签,我们将任务框定为预测模型对给定发言分配的标签是否正确。我们使用词频-逆文档频率(TF-IDF)对 LLM 推理进行编码,并评估了五种监督分类器。一种随机森林分类器达到了 0.83 的 F1 分数(召回率 = 0.854),成功识别出大多数错误预测,并超越了基线。为特定教学行为构造训练专门的检测器进一步提高了在困难构造上的表现,表明错误检测受益于特定构造的语言线索。利用语言探究与词汇计数(LIWC)框架,我们考察了四个正确性的语言标记:因果关系、差异化、犹豫性和洞察力。正确的预测表现出扎实的因果语言(例如,因为,因此),而错误的推理则更可能依赖于认知模糊(例如,可能,能够)和表现性元认知(例如,认为,意识到)。句法复杂性并不能区分正确与错误的推理,且较长的推理并不更可靠。这些发现表明,基于推理的错误检测为自动化教育对话分析中的质量控制提供了一种实用且可扩展的方法。
cs.CL / 49 / 2602.09838
How Do People Quantify Naturally: Evidence from Mandarin Picture Description
人们如何进行自然量化:来自普通话图片描述的证据
Abstract
Quantification is a fundamental component of everyday language use, yet little is known about how speakers decide whether and how to quantify in naturalistic production. We investigate quantification in Mandarin Chinese using a picture-based elicited description task in which speakers freely described scenes containing multiple objects, without explicit instructions to count or quantify. Across both spoken and written modalities, we examine three aspects of quantification: whether speakers choose to quantify at all, how precise their quantification is, and which quantificational strategies they adopt. Results show that object numerosity, animacy, and production modality systematically shape quantificational behaviour. In particular, increasing numerosity reduces both the likelihood and the precision of quantification, while animate referents and modality selectively modulate strategy choice. This study demonstrates how quantification can be examined under unconstrained production conditions and provides a naturalistic dataset for further analyses of quantity expression in language production.
Chinese Translation
量化是日常语言使用的一个基本组成部分,但关于说话者如何决定在自然语境中进行量化及其方式的研究仍然较少。我们通过一项基于图片的引导描述任务来研究普通话中的量化,在该任务中,说话者自由描述包含多个物体的场景,而没有明确的计数或量化指示。我们在口头和书面两种表达方式中考察了量化的三个方面:说话者是否选择进行量化、他们的量化精确度如何,以及他们采用了哪些量化策略。结果表明,物体数量、生命性和表达方式系统性地影响量化行为。特别是,数量的增加降低了量化的可能性和精确度,而生命性指称和表达方式则选择性地调节策略选择。本研究展示了如何在不受约束的生产条件下考察量化,并提供了一个自然语境下的语料库,以便进一步分析语言生产中的数量表达。
cs.CL / 50 / 2602.09866
SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
SinFoS:用于翻译僧伽罗语修辞的平行数据集
Abstract
Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
Chinese Translation
修辞(Figures of Speech, FoS)由与文化深度交织的多词短语组成。尽管神经机器翻译(Neural Machine Translation, NMT)在高资源语言的比喻表达方面表现相对良好,但在处理像僧伽罗语这样的低资源语言时,由于可用数据有限,常常面临挑战。为了解决这一局限性,我们引入了一个包含2,344个僧伽罗语修辞的语料库,并附有文化和跨语言注释。我们对该数据集进行了研究,以分类修辞的文化来源并识别其跨语言等价物。此外,我们开发了一种二元分类器,以区分数据集中两种类型的修辞,达到了约92%的准确率。我们还评估了现有大型语言模型(Large Language Models, LLMs)在该数据集上的表现。我们的研究结果揭示了当前大型语言模型能力的显著不足,因为这些模型在准确传达习语含义方面常常面临困难。通过公开该数据集,我们为低资源自然语言处理和文化意识机器翻译的未来研究提供了一个重要的基准。
cs.CL / 51 / 2602.09870
Steer2Edit: From Activation Steering to Component-Level Editing
Steer2Edit:从激活引导到组件级编辑
Abstract
Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
Chinese Translation
引导方法通过识别隐藏表示中的语义方向来影响大型语言模型的行为,但通常通过推理时的激活干预来实现,这种干预对模型的内部状态施加固定的全局修改。尽管有效,这种干预在强控制下往往会导致不利的属性-效用权衡,因为它忽略了许多行为是由模型组件中的小而异质的子集所主导的。我们提出了Steer2Edit,这是一个理论上扎实、无训练的框架,将推理时控制信号中的引导向量转化为用于组件级rank-1权重编辑的诊断信号。Steer2Edit并不是在生成过程中均匀注入引导方向,而是选择性地重新分配各个注意力头和多层感知器(MLP)神经元的行为影响,从而产生可解释的编辑,保持标准的前向传播,并与优化的并行推理兼容。在安全对齐、幻觉缓解和推理效率方面,Steer2Edit始终实现了更有利的属性-效用权衡:在匹配的下游性能下,它将安全性提高了最多17.2%,将真实性提高了9.8%,并平均减少了12.2%的推理长度。总体而言,Steer2Edit为表示引导和权重编辑之间提供了一个有原则的桥梁,通过将引导信号转化为可解释的、无训练的参数更新。
cs.CL / 52 / 2602.09877
The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
莫尔特书背后的魔鬼:人类安全在自我进化的人工智能社会中总是消失
Abstract
The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
Chinese Translation
基于大型语言模型(LLMs)构建的多智能体系统的出现,为可扩展的集体智能和自我进化提供了一种有前景的范式。理想情况下,这些系统将在完全封闭的循环中实现持续的自我改进,同时保持强大的安全对齐——我们称之为自我进化三难问题。然而,我们通过理论和实证证明,满足持续自我进化、完全隔离和安全不变性的智能体社会是不可能的。基于信息理论框架,我们将安全形式化为与人类价值分布的偏离程度。我们理论上证明,孤立的自我进化会导致统计盲点,从而导致系统安全对齐的不可逆降级。来自一个开放式智能体社区(Moltbook)和两个封闭自我进化系统的实证和定性结果揭示了与我们理论预测的不可避免的安全侵蚀现象相一致的现象。我们进一步提出了几种解决方案方向,以缓解识别出的安全问题。我们的工作确立了自我进化人工智能社会的基本限制,并将讨论从症状驱动的安全补丁转向对内在动态风险的原则性理解,强调了外部监督或新型安全保护机制的必要性。
cs.CL / 53 / 2602.09914
AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
AmharicIR+Instr:一个用于神经检索和指令调优的双数据集资源
Abstract
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
Chinese Translation
神经检索和GPT风格的生成模型依赖于大量高质量的监督数据,而对于阿姆哈拉语等低资源语言,这类数据仍然稀缺。我们发布了一个阿姆哈拉语数据资源,由两个数据集组成,支持以下研究:(i)神经检索排名和(ii)遵循指令的文本生成。检索排名数据集包含1,091个经过人工验证的查询-正例-负例文档三元组,这些三元组来自多样的阿姆哈拉语来源,并构建以支持神经检索器(例如,DPR、ColBERT风格的后期交互和SPLADE风格的稀疏神经检索)的对比训练和基准测试。三元组是通过专家策划的查询、网络获取的查询和大型语言模型(LLM)辅助生成的组合创建的,正例/负例文档来自网络或由LLM合成,并由母语者验证。指令提示-响应数据集包含6,285个阿姆哈拉语提示-响应对,涵盖多个领域和指令类型,使用多个LLM生成,并通过人工审查和修正以确保语法正确性、相关性、流畅性和事实合理性。我们以标准化的划分和格式(CSV、JSON、JSONL)发布这两个数据集,以便于在阿姆哈拉语检索、排名和生成建模方面的可重复研究。这些数据集还附带了一种可以推广到其他低资源语言的方法论。
cs.CL / 54 / 2602.09924
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
LLMs 编码其失败:从生成前激活预测成功
Abstract
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty
Chinese Translation
在每个问题上运行大规模语言模型(LLMs)进行扩展推理是昂贵的,但确定哪些输入实际上需要额外计算仍然具有挑战性。我们研究了它们自身成功的可能性是否可以从生成前的内部表示中恢复,以及这一信号是否可以指导更高效的推理。我们在生成前激活上训练线性探针,以预测在数学和编码任务上的特定策略成功,显著优于诸如问题长度和 TF-IDF 等表面特征。使用 E2H-AMC,该工具提供了在相同问题上人类和模型的表现,我们展示了模型编码了一种特定于模型的难度概念,这种概念与人类的难度不同,并且这种区别在扩展推理时会增加。利用这些探针,我们证明了在一组模型中路由查询可以超过表现最佳的模型,同时在 MATH 上将推理成本降低多达 70%,显示出内部表示即使在与人类对难度的直觉存在偏差时,也能实现实际的效率提升。我们的代码可在以下链接获取:https://github.com/KabakaWilliam/llms_know_difficulty
cs.CL / 55 / 2602.09953
ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
ATTNPO:基于注意力引导的过程监督以实现高效推理
Abstract
Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
Chinese Translation
通过强化学习和可验证奖励(RLVR)训练的大型推理模型在复杂推理任务上表现出色,但往往会过度思考,产生冗余推理而没有性能提升。现有的轨迹级长度惩罚往往无法有效缩短推理长度,并且会降低准确性,因为它们对所有推理步骤采取统一处理,缺乏细粒度信号来区分冗余与必要性。同时,过程监督方法通常资源密集且存在不准确的信用分配问题。为了解决这些问题,我们提出了ATTNPO,一个低开销的过程监督强化学习框架,利用模型的内在注意力信号进行步骤级信用分配。我们首先识别出一组特殊的注意力头,这些注意力头自然关注于关键步骤,同时抑制冗余步骤。通过利用这些头的注意力得分,我们采用两种子策略来减轻过度思考,既抑制冗余步骤,又通过减少对关键步骤的惩罚来保持准确性。实验结果表明,ATTNPO显著缩短了推理长度,同时在9个基准测试中显著提高了性能。
cs.CL / 56 / 2602.09961
ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese
ViMultiChoice:一种为越南语多项选择阅读理解提供解释的方法
Abstract
Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.
Chinese Translation
多项选择阅读理解(MCRC)模型旨在从一组候选选项中选择给定问题的正确答案。然而,它们通常缺乏解释其选择背后推理的能力。本文介绍了一个新颖的越南语数据集,旨在训练和评估具有解释生成能力的MCRC模型。此外,我们提出了ViMultiChoice,这是一种专门为建模越南语阅读理解而设计的新方法,能够同时预测正确答案并生成相应的解释。实验结果表明,ViMultiChoice在现有MCRC基准测试中表现优于其他基线,在ViMMRC 2.0基准和新引入的数据集上达到了最新的(SotA)性能。此外,我们还展示了联合训练选项决策和解释生成在多项选择准确性方面带来了显著的提升。
cs.CL / 57 / 2602.09992
A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models
对神经语言模型的刺激贫乏论证的统一评估
Abstract
How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.
Chinese Translation
儿童如何能够从有限的输入中获得母语水平的句法知识?根据刺激贫乏假说(Poverty of the Stimulus Hypothesis, PoSH),儿童所接收到的语言输入不足以解释某些稳健学习的概括;因此,许多人认为,先天的语言约束是解释语言学习所必需的。神经语言模型在其设计中缺乏这种特定于语言的约束,提供了对这一长期存在(但有争议)主张的计算测试。我们介绍了 extit{poshbench},这是一个针对问题形成、岛屿移动以及其他与 PoSH 论证相关的英语现象的训练与评估套件。在对 1000 万到 5000 万字的具有发展合理性的文本进行 Transformer 模型训练时,我们发现即使没有直接的正面证据,所有现象仍然显示出概括的迹象——然而,神经模型的数据效率仍然较低,其概括能力也弱于儿童。我们进一步通过三种最近提出的认知动机诱导偏差来增强我们的模型。我们发现这些偏差提高了整体句法能力,但未能改善 extit{poshbench} 的表现。我们的发现挑战了先天句法是唯一可能的概括途径的主张,同时表明人类般的数据效率需要超出这里测试的诱导偏差。
cs.CL / 58 / 2602.10003
ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition
ViSpeechFormer:一种用于越南自动语音识别的音素方法
Abstract
Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
Chinese Translation
越南语具有音素正字法,每个图形符号最多对应一个音素,反之亦然。利用这种高度的图形符号与音素之间的透明性,我们提出了ViSpeechFormer(越南语语音变换器),这是一种基于音素的越南自动语音识别(ASR)方法。根据我们所知,这是第一个明确建模音素表示的越南ASR框架。在两个公开可用的越南ASR数据集上的实验表明,ViSpeechFormer表现出色,对词汇外单词的泛化能力更强,并且受到训练偏差的影响较小。这种基于音素的范式对于其他具有音素正字法的语言也具有良好的前景。代码将在本文接受后发布。
cs.CL / 59 / 2602.10017
SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
SCORE:无参考大型语言模型评估的特异性、上下文利用、鲁棒性和相关性
Abstract
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
Chinese Translation
大型语言模型(LLMs)越来越多地用于支持高风险、特定领域的问答和决策,例如自然灾害响应和基础设施规划,在这些领域中,有效的答案必须传达细致入微、决策关键的细节。然而,现有的检索增强生成(RAG)和开放式问答评估框架主要依赖于表面相似性、事实一致性或语义相关性,往往无法评估响应是否提供了领域敏感决策所需的特定信息。为了解决这一问题,我们提出了一种多维度、无参考的评估框架,从特异性、对同义改写和语义扰动的鲁棒性、答案相关性和上下文利用四个互补维度评估LLM输出。我们引入了一个经过精心策划的数据集,包括1,412对涵盖40个专业角色和七种自然灾害类型的领域特定问答对,以支持系统评估。我们进一步进行人工评估,以评估标注者之间的一致性及模型输出与人类判断之间的对齐,这突显了开放式领域特定评估的固有主观性。我们的结果表明,没有单一指标能够充分捕捉答案质量,并展示了在高风险应用中部署LLM时对结构化、多指标评估框架的需求。
cs.CL / 60 / 2602.10021
Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
解耦推理与隐式事实标记(DRIFT):一种高效长上下文推理的双模型框架
Abstract
The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.
Chinese Translation
将广泛的动态知识整合到大型语言模型(LLMs)中仍然是一个重大挑战,因为事实数据与推理模式之间存在固有的纠缠。现有的解决方案,从非参数的检索增强生成(RAG)到参数知识编辑,通常在实践中受到有限上下文窗口、检索噪声或灾难性遗忘风险的限制。本文提出了DRIFT,一种新颖的双模型架构,旨在明确解耦知识提取与推理过程。与静态提示压缩不同,DRIFT采用轻量级知识模型,动态地将文档块压缩为基于查询的隐式事实标记。这些密集表示被投影到推理模型的嵌入空间中,替代原始的冗余文本,同时保持推理准确性。大量实验表明,DRIFT在长上下文任务上的性能显著提升,超越了同类规模模型中的强基线。我们的方法为扩展LLMs的有效上下文窗口和推理能力提供了一种可扩展且高效的范式。我们的代码可在 https://github.com/Lancelot-Xie/DRIFT 获取。
cs.CL / 61 / 2602.10023
MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval
MEVER:基于图的证据检索的多模态可解释性声明验证
Abstract
Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
Chinese Translation
验证声明的真实性通常需要对文本和视觉证据进行联合多模态推理,例如分析文本说明和图表图像以进行声明验证。此外,为了使推理过程透明,必须提供文本解释以证明验证结果。然而,大多数声明验证工作主要集中在仅对文本证据进行推理或忽视可解释性,导致验证结果不准确且缺乏说服力。为了解决这个问题,我们提出了一种新颖的模型,能够联合实现证据检索、多模态声明验证和解释生成。在证据检索方面,我们为声明和证据构建了一个两层多模态图,其中设计了图像到文本和文本到图像的推理以进行多模态检索。在声明验证方面,我们提出了基于标记和证据级别的融合,以整合声明和证据的嵌入进行多模态验证。在解释生成方面,我们引入了多模态解码器中的融合以增强可解释性。最后,由于几乎所有数据集都属于一般领域,我们在人工智能领域创建了一个科学数据集AIChartClaim,以补充声明验证社区。实验结果表明了我们模型的优势。
cs.CL / 62 / 2602.10092
Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
量子审计:评估大型语言模型在量子计算上的推理能力限制
Abstract
Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
Chinese Translation
语言模型已成为量子计算教育和研究的实用工具,从总结技术论文到解释理论概念,以及回答有关该领域最新发展的问题。尽管现有基准评估量子代码生成和电路设计,但对量子计算概念的理解尚未得到系统性测量。量子审计(Quantum-Audit)填补了这一空白,提供了涵盖核心量子计算主题的2700个问题。我们评估了来自领先组织的26个模型。我们的基准包括1000个专家撰写的问题,1000个从研究论文中提取并由专家验证的问题,以及额外的700个问题,其中包括350个开放式问题和350个带有错误前提的问题,以测试模型是否能够纠正错误的假设。人类参与者的得分在23%到86%之间,专家的平均得分为74%。表现最佳的模型超过了专家平均水平,其中Claude Opus 4.5的准确率达到84%,尽管顶尖模型在专家撰写的问题上相比于LLM生成的问题显示出平均12个百分点的准确率下降。在高级主题上,表现进一步下降,安全问题的准确率降至73%。此外,模型经常接受并强化嵌入问题中的错误前提,而不是识别它们,在这些关键推理任务上的准确率低于66%。