cs.RO / 1 / 2603.18084
Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah
揭示运动策略中的潜在阶段结构和分支逻辑:以HalfCheetah为案例研究
Abstract
In locomotion control tasks, Deep Reinforcement Learning (DRL) has demonstrated high performance; however, the decision-making process of the learned policy remains a black box, making it difficult for humans to understand. On the other hand, in periodic motions such as walking, it is well known that implicit motion phases exist, such as the stance phase and the swing phase. Focusing on this point, this study hypothesizes that a policy trained for locomotion control may also represent a phase structure that is interpretable by humans. To examine this hypothesis in a controlled setting, we consider a locomotion task that is amenable to observing whether a policy autonomously acquires temporally structured phases through interaction with the environment. To verify this hypothesis, in the MuJoCo locomotion benchmark HalfCheetah-v5, the state transition sequences acquired by a policy trained for walking control through interaction with the environment were aggregated into semantic phases based on state similarity and consistency of subsequent transitions. As a result, we demonstrated that the state sequences generated by the trained policy exhibit periodic phase transition structures as well as phase branching. Furthermore, by approximating the states and actions corresponding to each semantic phase using Explainable Boosting Machines (EBMs), we analyzed phase-dependent decision making-namely, which state features the policy function attends to and how it controls action outputs in each phase. These results suggest that neural network-based policies, which are often regarded as black boxes, can autonomously acquire interpretable phase structures and logical branching mechanisms.
Chinese Translation
在运动控制任务中,深度强化学习(Deep Reinforcement Learning, DRL)表现出了高性能;然而,学习到的策略的决策过程仍然是一个黑箱,使得人类难以理解。另一方面,在步态等周期性运动中,众所周知存在隐含的运动阶段,如支撑阶段和摆动阶段。基于这一点,本研究假设为运动控制训练的策略也可能代表一种人类可解释的阶段结构。为了在受控环境中检验这一假设,我们考虑了一个运动任务,该任务便于观察策略是否通过与环境的互动自主获得时间结构化的阶段。为了验证这一假设,在MuJoCo运动基准HalfCheetah-v5中,通过与环境的互动训练的步态控制策略所获得的状态转移序列被基于状态相似性和后续转移的一致性聚合为语义阶段。结果表明,训练策略生成的状态序列表现出周期性阶段转变结构以及阶段分支。此外,通过使用可解释增强机(Explainable Boosting Machines, EBM)近似每个语义阶段对应的状态和动作,我们分析了阶段依赖的决策过程,即策略函数关注哪些状态特征,以及它如何在每个阶段控制动作输出。这些结果表明,通常被视为黑箱的神经网络基础策略可以自主获得可解释的阶段结构和逻辑分支机制。
cs.RO / 2 / 2603.18130
Final Report for the Workshop on Robotics & AI in Medicine
医疗领域机器人与人工智能研讨会最终报告
Abstract
The CARE Workshop on Robotics and AI in Medicine, held on December 1, 2025 in Indianapolis, convened leading researchers, clinicians, industry innovators, and federal stakeholders to shape a national vision for advancing robotics and artificial intelligence in healthcare. The event highlighted the accelerating need for coordinated research efforts that bridge engineering innovation with real clinical priorities, emphasizing safety, reliability, and translational readiness with an emphasis on the use of robotics and AI to achieve this readiness goal. Across keynotes, panels, and breakout sessions, participants underscored critical gaps in data availability, standardized evaluation methods, regulatory pathways, and workforce training that hinder the deployment of intelligent robotic systems in surgical, diagnostic, rehabilitative, and assistive contexts. Discussions emphasized the transformative potential of AI enabled robotics to improve precision, reduce provider burden, expand access to specialized care, and enhance patient outcomes particularly in undeserved regions and high risk procedural domains. Special attention was given to austere settings, disaster and relief and military settings. The workshop demonstrated broad consensus on the urgency of establishing a national Center for AI and Robotic Excellence in medicine (CARE). Stakeholders identified priority research thrusts including human robot collaboration, trustworthy autonomy, simulation and digital twins, multi modal sensing, and ethical integration of generative AI into clinical workflows. Participants also articulated the need for high quality datasets, shared test beds, autonomous surgical systems, clinically grounded benchmarks, and sustained interdisciplinary training mechanisms.
Chinese Translation
2025年12月1日在印第安纳波利斯召开了医疗领域机器人与人工智能(AI)研讨会(CARE Workshop),汇聚了领先的研究人员、临床医生、行业创新者以及联邦利益相关者,共同形成推动医疗保健中机器人和人工智能发展的国家愿景。此次活动突出了协调研究努力日益加速的必要性,这些努力旨在将工程创新与实际临床优先事项相结合,强调安全性、可靠性及转化准备,特别是使用机器人和人工智能来实现这一准备目标。在主题演讲、小组讨论和分组会议中,与会者强调了数据可用性、标准化评估方法、监管路径以及人力资源培训等关键缺口,这些缺口阻碍了智能机器人系统在外科诊断、康复和助理等背景下的应用。讨论强调了人工智能启用的机器人在改善精确性、减轻医疗提供者负担、扩大专业护理的可及性以及提升患者结果方面的变革潜力,特别是在服务不足地区和高风险程序领域。特别关注的领域包括艰苦环境、灾难救援和军事背景。研讨会显示各方普遍达成共识,建立医疗领域人工智能和机器人卓越中心(Center for AI and Robotic Excellence in medicine, CARE)的紧迫性。利益相关者确定了优先研究方向,包括人机协作、可信自动化、模拟与数字双胞胎、多模态传感以及生成性人工智能在临床工作流中的伦理整合。与会者还表达了对高质量数据集、共享测试平台、自主外科系统、以临床为基础的基准以及持续的跨学科培训机制的需求。
cs.RO / 3 / 2603.18210
GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System
GoalVLM:基于视觉语言模型的多智能体系统目标导航
Abstract
Object-goal navigation has traditionally been limited to ground robots with closed-set object vocabularies. Existing multi-agent approaches depend on precomputed probabilistic graphs tied to fixed category sets, precluding generalization to novel goals at test time. We present GoalVLM, a cooperative multi-agent framework for zero-shot, open-vocabulary object navigation. GoalVLM integrates a Vision-Language Model (VLM) directly into the decision loop, SAM3 for text-prompted detection and segmentation, and SpaceOM for spatial reasoning, enabling agents to interpret free-form language goals and score frontiers via zero-shot semantic priors without retraining. Each agent builds a BEV semantic map from depth-projected voxel splatting, while a Goal Projector back-projects detections through calibrated depth into the map for reliable goal localization. A constraint-guided reasoning layer evaluates frontiers through a structured prompt chain (scene captioning, room-type classification, perception gating, multi-frontier ranking), injecting commonsense priors into exploration. We evaluate GoalVLM on GOAT-Bench val_unseen (360 multi-subtask episodes, 1032 sequential object-goal subtasks, HM3D scenes), where each episode requires navigating to a chain of 5-7 open-vocabulary targets. GoalVLM with N=2 agents achieves 55.8% subtask SR and 18.3% SPL, competitive with state-of-the-art methods while requiring no task-specific training. Ablation studies confirm the contributions of VLM-guided frontier reasoning and depth-projected goal localization.
Chinese Translation
物体目标导航传统上仅限于具有闭合对象词汇的地面机器人。现有的多智能体方法依赖于与固定类别集相关的预计算概率图,这限制了在测试时对新目标的泛化能力。我们提出了GoalVLM,一个用于零-shot、开放词汇物体导航的合作多智能体框架。GoalVLM将视觉语言模型(VLM)直接集成到决策循环中,使用SAM3进行文本提示的检测和分割,并利用SpaceOM进行空间推理,使得智能体能够解释自由形式的语言目标,并通过零-shot语义先验对前沿进行评分,而无需重新训练。每个智能体从深度投影的体素溅射构建BEV语义地图,而目标投影器通过校准的深度将检测结果反投影到地图中,以实现可靠的目标定位。一个约束引导的推理层通过结构化的提示链(场景描述、房间类型分类、感知门控、多前沿排名)评估前沿,将常识先验注入探索中。我们在GOAT-Bench的val_unseen(360个多子任务情节,1032个顺序物体目标子任务,HM3D场景)上评估GoalVLM,每个情节需要导航到5-7个开放词汇目标的链。使用N=2个智能体的GoalVLM实现了55.8%的子任务成功率(SR)和18.3%的成功路径长度(SPL),与最先进的方法具有竞争力,同时不需要特定任务的训练。消融研究确认了VLM引导的前沿推理和深度投影目标定位的贡献。
cs.RO / 4 / 2603.18238
ReDAG-RT: Global Rate-Priority Scheduling for Real-Time Multi-DAG Execution in ROS 2
ReDAG-RT:针对ROS 2中实时多DAG执行的全球速率优先调度
Abstract
ROS 2 has become a dominant middleware for robotic systems, where perception, estimation, planning, and control pipelines are structured as directed acyclic graphs of callbacks executed under a shared executor. However, default ROS 2 executors use best-effort dispatch without cross-DAG priority enforcement, leading to callback contention, structural priority inversion, and deadline instability under concurrent workloads. These limitations restrict deployment in time-critical and safety-sensitive cyber-physical systems. This paper presents ReDAGRT, a user-space global scheduling framework for deterministic multi-DAG execution in unmodified ROS 2. The framework introduces a Rate-Priority driven global ready queue that orders callbacks by activation rate, enforces per-DAG concurrency bounds, and mitigates cross-graph priority inversion without modifying the ROS 2 API, executor interface, or underlying operating system scheduler. We formalize a multi-DAG task model for ROS 2 callback pipelines and analyze cross-DAG interference under Rate-Priority scheduling. Response-time recurrences and schedulability conditions are derived within classical Rate-Monotonic theory. Experiments in a ROS 2 Humble environment compare ReDAGRT against SingleThreadedExecutor and MultiThreadedExecutor using synthetic multi-DAG workloads. Results show up to 29.7 percent reduction in deadline miss rate, 42.9 percent reduction in 99th percentile response time, and 13.7 percent improvement over MultiThreadedExecutor under comparable utilization. Asymmetric per-DAG concurrency bounds further reduce interference by 40.8 percent. These results demonstrate that deterministic and analyzable multi-DAG scheduling can be achieved entirely in the ROS 2 user-space execution layer, providing a practical foundation for real-time robotic middleware in safety-critical systems.
Chinese Translation
ROS 2已成为机器人系统中的主导中间件,其中感知、估计、规划和控制管道被结构化为在共享执行器下执行的回调的有向无环图(DAG)。然而,默认的ROS 2执行器采用尽力而为的调度方式,而没有强制执行跨DAG的优先级,导致了回调竞争、结构性优先级倒置以及在并发工作负载下的截止时间不稳定。这些局限性限制了其在时间关键和安全敏感的网络物理系统中的部署。本文提出了ReDAGRT,这是一个用户空间的全球调度框架,用于在未修改的ROS 2中实现确定性的多DAG执行。该框架引入了一个基于速率优先的全球就绪队列,通过激活速率对回调进行排序,强制执行每个DAG的并发限制,并在不修改ROS 2 API、执行器接口或底层操作系统调度程序的情况下减轻跨图的优先级倒置。我们为ROS 2回调管道形式化了一个多DAG任务模型,并在速率优先调度下分析了跨DAG干扰。基于经典的速率单调理论,我们推导出了响应时间重发和可调度性条件。在ROS 2 Humble环境中的实验将ReDAGRT与SingleThreadedExecutor和MultiThreadedExecutor进行了比较,使用合成的多DAG工作负载。结果表明,截止时间缺失率减少了最高29.7%,99百分位响应时间减少了42.9%,在可比较利用率下,相较于MultiThreadedExecutor提升了13.7%。不对称的每个DAG的并发限制进一步减少了干扰达40.8%。这些结果表明,在ROS 2用户空间执行层中完全可以实现确定性和可分析的多DAG调度,为时间关键系统中的实时机器人中间件提供了实用基础。
cs.RO / 5 / 2603.18246
Rapid Adaptation of Particle Dynamics for Generalized Deformable Object Mobile Manipulation
针对广义可变形物体移动操作的粒子动力学快速适应
Abstract
We address the challenge of learning to manipulate deformable objects with unknown dynamics. In non-rigid objects, the dynamics parameters define how they react to interactions -- how they stretch, bend, compress, and move -- and they are critical to determining the optimal actions to perform a manipulation task successfully. In other robotic domains, such as legged locomotion and in-hand rigid object manipulation, state-of-the-art approaches can handle unknown dynamics using Rapid Motor Adaptation (RMA). Through a supervised procedure in simulation that encodes each rigid object's dynamics, such as mass and position, these approaches learn a policy that conditions actions on a vector of latent dynamic parameters inferred from sequences of state-actions. However, in deformable object manipulation, the object's dynamics not only includes its mass and position, but also how the shape of the object changes. Our key insight is that the recent ground-truth particle positions of a deformable object in simulation capture changes in the object's shape, making it possible to extend RMA to deformable object manipulation. This key insight allows us to develop RAPiD, a two-phase method that learns to perform real-robot deformable object mobile manipulation by: 1) learning a visuomotor policy conditioned on the object's dynamics embedding, which is encoded from the object's privileged information in simulation, such as its mass and ground-truth particle positions, and 2) learning to infer this embedding using non-privileged information instead, such as robot visual observations and actions, so that the learned policy can transfer to the real world. On a mobile manipulator with 22 degrees of freedom, RAPiD enables over 80%+ success rates across two vision-based deformable object mobile manipulation tasks in the real world, under various object dynamics, categories, and instances.
Chinese Translation
我们解决了学习操控具有未知动力学的可变形物体的挑战。在非刚性物体中,动力学参数定义了它们如何对交互做出反应——如何拉伸、弯曲、压缩和移动——这些参数对于确定成功执行操作任务的最佳行动至关重要。在其他机器人领域,如腿部运动和手中刚性物体操控,最先进的方法可以使用快速运动适应(Rapid Motor Adaptation, RMA)来处理未知动力学。通过在模拟中进行的监督程序,该程序编码每个刚性物体的动力学,例如质量和位置,这些方法学习一种策略,该策略基于从状态-动作序列推断出的潜在动力学参数向量来条件化行动。然而,在可变形物体操控中,物体的动力学不仅包括其质量和位置,还包括物体形状的变化。我们的关键见解是,最近在模拟中获得的可变形物体的真实粒子位置捕捉了物体形状的变化,从而使得将RMA扩展到可变形物体操控成为可能。这个关键见解使我们能够开发RAPiD,一种两阶段方法,通过以下方式学习执行真实机器人可变形物体的移动操控:1)学习一种基于物体动力学嵌入的视觉运动策略,该嵌入从物体在模拟中的特权信息(如质量和真实粒子位置)编码而来;2)学习使用非特权信息(如机器人视觉观察和动作)推断该嵌入,以便所学习的策略能够转移到现实世界。在具有22个自由度的移动操控器上,RAPiD在现实世界中实现了超过80%的成功率,适用于两项基于视觉的可变形物体移动操控任务,涵盖了各种物体动力学、类别和实例。
cs.RO / 6 / 2603.18260
Manufacturing Micro-Patterned Surfaces with Multi-Robot Systems
利用多机器人系统制造微图案化表面
Abstract
Applying micro-patterns to surfaces has been shown to impart useful physical properties such as drag reduction and hydrophobicity. However, current manufacturing techniques cannot produce micro-patterned surfaces at scale due to high-cost machinery and inefficient coverage techniques such as raster-scanning. In this work, we use multiple robots, each equipped with a patterning tool, to manufacture these surfaces. To allow these robots to coordinate during the patterning task, we use the ergodic control algorithm, which specifies coverage objectives using distributions. We demonstrate that robots can divide complicated coverage objectives by communicating compressed representations of their trajectory history both in simulations and experimental trials. Further, we show that robot-produced patterning can lower the coefficient of friction of metallic surfaces. This work demonstrates that distributed multi-robot systems can coordinate to manufacture products that were previously unrealizable at scale.
Chinese Translation
将微图案应用于表面已被证明能够赋予其有用的物理特性,如降低阻力和疏水性。然而,当前的制造技术由于高成本的机械设备和低效的覆盖技术(如光栅扫描)无法大规模生产微图案化表面。在本研究中,我们使用多个配备有图案化工具的机器人来制造这些表面。为了使这些机器人在图案化任务中能够协调工作,我们采用了遍历控制算法,该算法通过分布指定覆盖目标。我们展示了机器人可以通过在模拟和实验试验中交流其轨迹历史的压缩表示来划分复杂的覆盖目标。此外,我们还表明,机器人生产的图案化可以降低金属表面的摩擦系数。本研究表明,分布式多机器人系统能够协调制造以前无法大规模实现的产品。
cs.RO / 7 / 2603.18271
SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations
SG-CoT:一种基于场景图表示的模糊性意识机器人规划框架
Abstract
Ambiguity poses a major challenge to large language models (LLMs) used as robotic planners. In this letter, we present Scene Graph-Chain-of-Thought (SG-CoT), a two-stage framework where LLMs iteratively query a scene graph representation of the environment to detect and clarify ambiguities. First, a structured scene graph representation of the environment is constructed from input observations, capturing objects, their attributes, and relationships with other objects. Second, the LLM is equipped with retrieval functions to query portions of the scene graph that are relevant to the provided instruction. This grounds the reasoning process of the LLM in the observation, increasing the reliability of robotic planners under ambiguous situations. SG-CoT also allows the LLM to identify the source of ambiguity and pose a relevant disambiguation question to the user or another robot. Extensive experimentation demonstrates that SG-CoT consistently outperforms prior methods, with a minimum of 10% improvement in question accuracy and a minimum success rate increase of 4% in single-agent and 15% in multi-agent environments, validating its effectiveness for more generalizable robot planning.
Chinese Translation
模糊性对作为机器人规划者的大型语言模型(LLMs)构成了主要挑战。在本文中,我们提出了场景图-思维链(Scene Graph-Chain-of-Thought, SG-CoT),这是一个两阶段框架,其中LLMs迭代查询环境的场景图表示,以检测和澄清模糊性。首先,从输入观察中构建环境的结构化场景图表示,捕捉对象、其属性及与其他对象的关系。其次,LLM配备了检索功能,以查询与提供的指令相关的场景图部分。这将LLM的推理过程与观察结果紧密结合,提高了机器人规划者在模糊情况下的可靠性。SG-CoT还允许LLM识别模糊性的来源,并向用户或其他机器人提出相关的消歧问题。大量实验表明,SG-CoT在问题准确性上至少提高了10%,在单代理和多代理环境中的成功率分别至少提高了4%和15%,验证了其在更具普适性的机器人规划中的有效性。
cs.RO / 8 / 2603.18284
Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads
卸载还是过载:移动机器人操作工作负载的平台测量研究
Abstract
Mobile robotic manipulation--the ability of robots to navigate spaces and interact with objects--is a core capability of physical AI. Foundation models have led to breakthroughs in their performance, but at a significant computational cost. We present the first measurement study of mobile robotic manipulation workloads across onboard, edge, and cloud GPU platforms. We find that the full workload stack is infeasible to run on smaller onboard GPUs, while larger onboard GPUs drain robot batteries several hours faster. Offloading alleviates these constraints but introduces its own challenges, as additional network latency degrades task accuracy, and the bandwidth requirement makes naive cloud offloading impractical. Finally, we quantify opportunities and pitfalls of sharing compute across robot fleets. We believe our measurement study will be crucial to designing inference systems for mobile robots.
Chinese Translation
移动机器人操作能力——即机器人在空间中导航和与物体互动的能力——是物理人工智能的核心能力。基础模型的出现带来了性能上的突破,但也伴随着显著的计算成本。我们首次对移动机器人操作工作负载在车载、边缘和云GPU平台上的表现进行了测量研究。我们发现,完整的工作负载栈在较小的车载GPU上无法运行,而较大的车载GPU则会使机器人电池的耗尽速度加快数小时。卸载可以缓解这些限制,但也带来了自身的挑战,因为额外的网络延迟会降低任务的准确性,而带宽需求使得简单的云卸载变得不切实际。最后,我们量化了在机器人车队中共享计算资源的机会与陷阱。我们相信我们的测量研究对于设计移动机器人的推理系统至关重要。
cs.RO / 9 / 2603.18298
Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision
Sparse3DTrack:基于稀疏监督的单目3D目标跟踪
Abstract
Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.
Chinese Translation
单目3D目标跟踪旨在估计视频帧之间时间一致的3D目标姿态,使自主代理能够推理场景动态。 然而,现有的最先进方法完全依赖于对长视频序列的稠密3D标注,这些标注的获取成本高且难以扩展。 本文中,我们通过提出首个用于单目3D目标跟踪的稀疏监督框架,来解决这一基本限制。 我们的方法将任务分解为两个顺序的子问题:二维查询匹配和三维几何估计。 两个组件都利用图像序列的时空一致性来增强一组稀疏标记样本,并学习丰富的二维和三维场景表示。 我们的模型利用这些学习到的线索,自动生成整个视频的高质量3D伪标签,有效地将稀疏监督转化为稠密3D跟踪标注。这使得现有的完全监督跟踪器能够在极端标签稀疏的情况下有效工作。 在KITTI和nuScenes数据集上的大量实验表明,我们的方法显著提高了跟踪性能,最多实现了15.50个百分点的提升,同时每个跟踪最多使用四个真实标注。
cs.RO / 10 / 2603.18308
Proprioceptive-only State Estimation for Legged Robots with Set-Coverage Measurements of Learned Dynamics
基于仅靠本体感觉的腿状机器人状态估计:利用学习动态的集合覆盖测量
Abstract
Proprioceptive-only state estimation is attractive for legged robots since it is computationally cheaper and is unaffected by perceptually degraded conditions. The history of joint-level measurements contains rich information that can be used to infer the dynamics of the system and subsequently produce navigational measurements. Recent approaches produce these estimates with learned measurement models and fuse with IMU data, under a Gaussian noise assumption. However, this assumption can easily break down with limited training data and render the estimates inconsistent and potentially divergent. In this work, we propose a proprioceptive-only state estimation framework for legged robots that characterizes the measurement noise using set-coverage statements that do not assume any distribution. We develop a practical and computationally inexpensive method to use these set-coverage measurements with a Gaussian filter in a systematic way. We validate the approach in both simulation and two real-world quadrupedal datasets. Comparison with the Gaussian baselines shows that our proposed method remains consistent and is not prone to drift under real noise scenarios.
Chinese Translation
仅靠本体感觉的状态估计对腿状机器人具有吸引力,因为它在计算上更为便宜,并且不受感知退化条件的影响。关节级测量的历史包含丰富的信息,可用于推断系统的动态,进而生成导航测量。近期的方法利用学习测量模型产生这些估计,并在高斯噪声假设下与IMU数据融合。然而,该假设在训练数据有限的情况下容易失效,从而导致估计不一致且可能发散。在本研究中,我们提出了一种仅靠本体感觉的腿状机器人状态估计框架,该框架使用集合覆盖陈述表征测量噪声,而不假设任何分布。我们开发了一种实用且计算经济的方法,以系统的方式将这些集合覆盖测量与高斯滤波器结合起来。我们在模拟和两个真实世界的四足数据集中验证了该方法。与高斯基准的比较表明,我们提出的方法保持一致性,并且在真实噪声场景下不易漂移。
cs.RO / 11 / 2603.18315
DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving
DriveVLM-RL:一种基于神经科学的强化学习框架,结合视觉-语言模型以实现安全可部署的自动驾驶
Abstract
Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/
Chinese Translation
尽管端到端学习方法快速发展,确保自动驾驶车辆的安全决策仍然是一个基本挑战。传统的强化学习(RL)方法依赖于手动工程设计的奖励或稀疏的碰撞信号,这些方法无法捕捉安全驾驶所需的丰富上下文理解,并使得在真实环境中不可避免地发生不安全探索。近期的视觉-语言模型(VLM)展现了良好的语义理解能力;然而,它们的高推理延迟和易受幻觉影响的特性阻碍了它们在实时车辆控制中的直接应用。为了解决这些限制,本文提出了DriveVLM-RL,这是一种基于神经科学的框架,通过双通道架构将VLM集成到强化学习中,以实现安全和可部署的自动驾驶。该框架将语义奖励学习分解为静态路径(Static Pathway),用于基于CLIP的对比语言目标进行连续的空间安全评估,以及动态路径(Dynamic Pathway),用于通过轻量级检测器和大型VLM进行注意力门控的多帧语义风险推理。分层奖励合成机制将语义信号与车辆状态融合,同时异步训练管道将昂贵的VLM推理与环境交互解耦。所有VLM组件仅在离线训练期间使用,并在部署时移除,以确保实时可行性。在CARLA模拟器中的实验显示,在碰撞避免、任务成功率和在多种交通场景下的泛化能力方面显著提高,包括在没有明确碰撞惩罚设置下的强鲁棒性。这些结果表明DriveVLM-RL提供了一种将基础模型集成到自动驾驶中的实用范式,而不损害实时可行性。演示视频和代码可在以下网址获得:https://zilin-huang.github.io/DriveVLM-RL-website/
cs.RO / 12 / 2603.18336
ManiDreams: An Open-Source Library for Robust Object Manipulation via Uncertainty-aware Task-specific Intuitive Physics
ManiDreams:一个用于通过不确定性感知任务特定直观物理实现稳健物体操控的开源库
Abstract
Dynamics models, whether simulators or learned world models, have long been central to robotic manipulation, but most focus on minimizing prediction error rather than confronting a more fundamental challenge: real-world manipulation is inherently uncertain. We argue that robust manipulation under uncertainty is fundamentally an integration problem: uncertainties must be represented, propagated, and constrained within the planning loop, not merely suppressed during training. We present and open-source ManiDreams, a modular framework for uncertainty-aware manipulation planning over intuitive physics models. It realizes this integration through composable abstractions for distributional state representation, backend-agnostic dynamics prediction, and declarative constraint specification for action optimization. The framework explicitly addresses three sources of uncertainty: perceptual, parametric, and structural. It wraps any base policy with a sample-predict-constrain loop that evaluates candidate actions against distributional outcomes, adding robustness without retraining. Experiments on ManiSkill tasks show that ManiDreams maintains robust performance under various perturbations where the RL baseline degrades significantly. Runnable examples on pushing, picking, catching, and real-world deployment demonstrate flexibility across different policies, optimizers, physics backends, and executors. The framework is publicly available at https://github.com/Rice-RobotPI-Lab/ManiDreams
Chinese Translation
动态模型,无论是模拟器还是学习的世界模型,长期以来一直是机器人操控的核心,但大多数模型关注于最小化预测误差,而不是面对一个更根本的挑战:现实世界的操控本质上是不确定的。我们认为,在不确定性下的稳健操控根本上是一个集成问题:不确定性必须在规划循环中被表示、传播和约束,而不仅仅是在训练过程中被抑制。我们提出并开源了ManiDreams,这是一个用于基于直观物理模型的不确定性感知操控规划的模块化框架。该框架通过可组合的抽象实现这种集成,支持分布状态表示、后端无关的动态预测和用于动作优化的声明性约束规范。该框架明确解决了三种不确定性来源:感知性、不确定性参数和结构性。它将任何基础策略封装在一个样本-预测-约束循环中,该循环评估候选动作与分布结果的关系,从而在不重新训练的情况下增加稳健性。在ManiSkill任务上的实验表明,ManiDreams在各种扰动下保持了稳健的性能,而强化学习基线则显著下降。在推、抓、接和现实世界部署等任务上的可运行示例展示了在不同策略、优化器、物理后端和执行器之间的灵活性。该框架可在https://github.com/Rice-RobotPI-Lab/ManiDreams上公开获取。
cs.RO / 13 / 2603.18342
Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model
将不确定性转移至关键时刻:迈向可靠的VLA模型不确定性量化
Abstract
Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.
Chinese Translation
视觉-语言-行动(VLA)模型通过将视觉观察和语言指令映射到低级动作,从而实现通用机器人策略,但它们往往缺乏可靠的自我反思。一个常见的做法是计算标记级别的不确定性信号并对其进行平均。但在连续控制中,均值聚合可能会稀释短暂但安全关键的不确定性峰值。特别是,成功的回放可能包含由于温和噪音或非关键微调引起的局部高熵段,而失败的回放在大多数时间步中可能呈现低熵,仅在失败发生的初始阶段表现出短暂的峰值。我们提出了一种统一的不确定性量化方法,用于预测回放的成功与失败,该方法(1)使用基于最大值的滑动窗口池化,以保留瞬态风险信号,(2)应用运动感知稳定性加权,以强调与不稳定行为相关的高频动作震荡,且(3)通过贝叶斯优化进行自由度自适应校准,以优先考虑运动学关键轴。在LIBERO基准上的实验表明,我们的方法显著提高了失败预测的准确性,并为失败检测提供了更可靠的信号,这可以支持下游的人机交互干预。
cs.RO / 14 / 2603.18344
HRI-SA: A Multimodal Dataset for Online Assessment of Human Situational Awareness during Remote Human-Robot Teaming
HRI-SA:用于远程人机协作中人类情境意识在线评估的多模态数据集
Abstract
Maintaining situational awareness (SA) is critical in human-robot teams. Yet, under high workload and dynamic conditions, operators often experience SA gaps. Automated detection of SA gaps could provide timely assistance for operators. However, conventional SA measures either disrupt task flow or cannot capture real-time fluctuations, limiting their operational utility. To the best of our knowledge, no publicly available dataset currently supports the systematic evaluation of online human SA assessment in human-robot teaming. To advance the development of online SA assessment tools, we introduce HRI-SA, a multimodal dataset from 30 participants in a realistic search-and-rescue human-robot teaming context, incorporating eye movements, pupil diameter, biosignals, user interactions, and robot data. The experimental protocol included predefined events requiring timely operator assistance, with ground truth SA latency of two types (perceptual and comprehension) systematically obtained by measuring the time between assistance need onset and resolution. We illustrate the utility of this dataset by evaluating standard machine learning models for detecting perceptual SA latencies using generic eye-tracking features and contextual features. Results show that eye-tracking features alone effectively classified perceptual SA latency (recall=88.91%, F1=67.63%) using leave-one-group-out cross-validation, with performance improved through contextual data fusion (recall=91.51%, F1=80.38%). This paper contributes the first public dataset supporting the systematic evaluation of SA throughout a human-robot teaming mission, while also demonstrating the potential of generic eye-tracking features for continuous perceptual SA latency detection in remote human-robot teaming.
Chinese Translation
在人机团队中,维持情境意识(SA)至关重要。然而,在高工作负荷和动态条件下,操作员常常会经历情境意识的缺失。自动检测情境意识缺失可以为操作员提供及时的帮助。然而,传统的情境意识测量方法要么会干扰任务流程,要么无法捕捉实时波动,限制了其操作效用。根据我们所知,目前没有公开可用的数据集支持人机协作中在线人类情境意识评估的系统性评估。为了推动在线情境意识评估工具的发展,我们引入了HRI-SA,这是一个来自30名参与者的多模态数据集,基于现实的搜索与救援人机协作背景,包含眼动、瞳孔直径、生物信号、用户交互和机器人数据。实验协议包括需要及时操作员协助的预定义事件,通过测量协助需求开始与解决之间的时间,系统性地获得了两种类型的情境意识延迟的真实情况(感知和理解)。我们通过评估标准机器学习模型使用通用眼动特征和上下文特征来检测感知情境意识延迟,展示了该数据集的效用。结果表明,仅使用眼动特征就能有效分类感知情境意识延迟(召回率=88.91%,F1=67.63%),通过组外交叉验证,结合上下文数据融合后性能有所提升(召回率=91.51%,F1=80.38%)。本文贡献了第一个支持在整个机器人协作任务中系统性评估情境意识的公共数据集,同时展示了通用眼动特征在远程人机协作中持续检测感知情境意识延迟的潜力。
cs.RO / 15 / 2603.18354
Multi-material Direct Ink Writing and Embroidery for Stretchable Wearable Sensors
多材料直接墨水书写与刺绣技术在可拉伸可穿戴传感器中的应用
Abstract
The development of wearable sensing systems for sports performance tracking, rehabilitation, and injury prevention has driven growing demand for smart garments that combine comfort, durability, and accurate motion detection. This paper presents a textile-compatible fabrication workflow that integrates multi-material direct ink writing with automated embroidery to create stretchable strain sensors directly embedded into garments. The process combines sequential multi-material printing of a silicone-carbon grease-silicone stack with automated embroidery that provides both mechanical fixation and electrical interfacing in a single step. The resulting hybrid sensor demonstrates stretchability up to 120% strain while maintaining electrical continuity, with approximately linear behaviour up to 60% strain (R^2 = 0.99), a gauge factor of 31.4, and hysteresis of 22.9%. Repeated loading-unloading tests over 80 cycles show baseline and peak drift of 0.135% and 0.236% per cycle, respectively, indicating moderate cycle-to-cycle stability. Mechanical testing further confirms that the silicone-fabric interface remains intact under large deformation, with failure occurring in the textile rather than at the stitched boundary. As a preliminary proof of concept, the sensor was integrated into wearable elbow and knee sleeves for joint angle monitoring, showing a clear correlation between normalised resistance change and bending angle. By addressing both mechanical fixation and electrical interfacing through embroidery-based integration, this approach provides a reproducible and scalable pathway for incorporating printed stretchable electronics into textile systems for motion capture and soft robotic applications.
Chinese Translation
可穿戴传感系统的发展旨在追踪运动表现、康复以及预防伤害,这推动了对舒适性、耐用性和准确运动检测相结合的智能服装的日益需求。本文提出了一种兼容纺织品的制造流程,该流程将多材料直接墨水书写与自动刺绣相结合,旨在直接将可拉伸应变传感器嵌入到服装中。该过程结合了硅胶-碳油-硅胶叠层的顺序多材料打印,与自动刺绣,实现了在单一步骤中提供机械固定和电气接口。所得到的混合传感器在保持电气连续性的同时,表现出高达120%的拉伸性,且在60% 应变范围内具有近似线性行为(R² = 0.99),灵敏度因子为31.4,滞后率为22.9%。经过80个循环的重复加载-卸载测试显示,基线和峰值漂移分别为每循环0.135%和0.236%,表明具有适度的循环稳定性。机械测试进一步确认在大变形下,硅胶-织物界面保持完整,失效现象发生在纺织品而非缝合边界。作为初步的概念验证,传感器被集成到可穿戴的肘部和膝部护套中,用于关节角度监测,显示出归一化电阻变化与弯曲角度之间的明显相关性。通过刺绣集成解决机械固定和电气接口问题,该方法为将印刷的可拉伸电子元件集成到纺织系统中,实现运动捕捉和柔性机器人应用提供了可重复和可扩展的路径。
cs.RO / 16 / 2603.18370
Contact Status Recognition and Slip Detection with a Bio-inspired Tactile Hand
基于生物启发的触觉手的接触状态识别与滑移检测
Abstract
Stable and reliable grasp is critical to robotic manipulations especially for fragile and glazed objects, where the grasp force requires precise control as too large force possibly damages the objects while small force leads to slip and fall-off. Although it is assumed the objects to manipulate is grasped firmly in advance, slip detection and timely prevention are necessary for a robot in unstructured and universal environments. In this work, we addressed this issue by utilizing multimodal tactile feedback from a five-fingered bio-inspired hand. Motivated by human hands, the tactile sensing elements were distributed and embedded into the soft skin of robotic hand, forming 24 tactile channels in total. Different from the threshold method that was widely employed in most existing works, we converted the slip detection problem to contact status recognition in combination with binning technique first and then detected the slip onset time according to the recognition results. After the 24-channel tactile signals passed through discrete wavelet transform, 17 features were extracted from different time and frequency bands. With the optimal 120 features employed for status recognition, the test accuracy reached 96.39% across three different sliding speeds and six kinds of materials. When applied to four new unseen materials, a high accuracy of 91.95% was still achieved, which further validated the generalization of our proposed method. Finally, the performance of slip detection is verified based on the trained model of contact status recognition.
Chinese Translation
稳定可靠的抓取对于机器人操作尤其是处理脆弱和光滑物体至关重要,因为抓取力需要精确控制,过大的力可能会损坏物体,而过小的力则会导致滑动和掉落。尽管假设被操作的物体在事先被牢固抓取,但在非结构化和通用环境中,滑移检测和及时预防对于机器人来说是必要的。在本研究中,我们通过利用来自五指生物启发手的多模态触觉反馈来解决这一问题。受到人手的启发,触觉传感元件被分布并嵌入到机器人手的柔软皮肤中,总共形成24个触觉通道。与大多数现有研究中广泛采用的阈值方法不同,我们首先将滑移检测问题转换为接触状态识别,并结合分箱技术,然后根据识别结果检测滑移发生时间。在24通道触觉信号经过离散小波变换后,从不同的时间和频率带中提取了17个特征。使用最优的120个特征进行状态识别,测试准确率在三种不同滑动速度和六种材料中达到了96.39%。在应用于四种新的未见材料时,仍然达到了91.95%的高准确率,进一步验证了我们提出方法的泛化能力。最后,基于接触状态识别的训练模型验证了滑移检测的性能。
cs.RO / 17 / 2603.18400
Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning
约束图模型预测控制用于反应式多智能体任务与运动规划
Abstract
Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation tasks-coordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi-agent manipulation under real-world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc-mpc/home .
Chinese Translation
相互依赖的几何约束序列在许多多智能体任务与运动规划(TAMP)问题中至关重要。然而,现有处理此类约束序列的方法在面对部分有序任务和动态智能体分配时表现不佳。它们通常假设静态分配,并且无法在干扰改变任务分配时进行适应。为克服这些局限性,我们提出了约束图模型预测控制(Graph-of-Constraints Model Predictive Control,GoC-MPC),这是一个与模型预测控制(MPC)相结合的广义约束序列框架。GoC-MPC 自然支持部分有序任务、动态智能体协调和干扰恢复。通过对跟踪的三维关键点定义约束,我们的方法能够稳健地解决各种多智能体操控任务——仅依靠视觉观察进行智能体协调和在线适应,而无需依赖训练数据或环境模型。实验表明,与最近的基线相比,GoC-MPC 实现了更高的成功率、显著更快的 TAMP 计算速度和更短的整体路径,确立了其作为应对现实世界干扰的多智能体操控的高效且稳健的解决方案。我们的补充视频和代码可在 https://sites.google.com/view/goc-mpc/home 找到。
cs.RO / 18 / 2603.18408
Efficient and Versatile Quadrupedal Skating: Optimal Co-design via Reinforcement Learning and Bayesian Optimization
高效多功能的四足滑行:通过强化学习和贝叶斯优化的最佳联合设计
Abstract
In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.
Chinese Translation
本文提出了一种硬件与控制协同设计的方法,该方法使得配备被动轮的四足机器人能够高效且多功能地进行滑行。被动轮滑行不仅减小了腿部惯性,而且在高速下提升了能量效率。然而,缺乏直接的轮子驱动使得机械设计和控制紧密耦合。为了充分挖掘这种模式的潜力,我们构建了一个双层优化框架:上层的贝叶斯优化搜索机械设计空间,而下层的强化学习则为每个候选设计训练出电机控制策略。结果产生的设计-策略对不仅优于人类工程师设计的基准,而且展现出了多种灵活的行为,如冰球停(通过侧向转弯来快速制动以最大化摩擦)和自我对齐运动(自动重新定向以提高前进方向上的能量效率),提供了对四足机器人动态滑行运动的首次系统级研究。
cs.RO / 19 / 2603.18494
MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation
MemoAct:基于阿特金森-希弗林模型的记忆增强视觉运动策略用于机器人操控
Abstract
Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simple observation window extensions, struggling to simultaneously achieve precise task state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson-Shiffrin memory model, we propose MemoAct, a hierarchical memory-based policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, lossless short-term memory ensures precise task state tracking, while compressed long-term memory enables robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, specifically tailored to assess policy capabilities in task state tracking and long-horizon retention. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies. The project page is \href{https://tlf-tlf.github.io/MemoActPage/}{available}.
Chinese Translation
记忆增强的机器人策略在处理依赖记忆的任务中至关重要。然而,现有方法通常依赖于简单的观察窗口扩展,难以同时实现精确的任务状态跟踪和稳健的长时间保持。为了解决这些挑战,我们提出了MemoAct,一种受阿特金森-希弗林记忆模型启发的分层记忆基础策略,利用不同的记忆层次来应对特定的瓶颈。具体而言,无损的短期记忆确保了精确的任务状态跟踪,而压缩的长期记忆则实现了稳健的长时间保持。为了丰富评估环境,我们基于RoboTwin 2.0构建了MemoryRTBench,专门用于评估策略在任务状态跟踪和长时间保持方面的能力。广泛的模拟和现实场景实验表明,MemoAct在性能上优于现有的马尔可夫基线和历史感知策略。项目页面可在此访问: ext{https://tlf-tlf.github.io/MemoActPage/}。
cs.RO / 20 / 2603.18520
Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly
智能电动汽车拆解的机器人自主平台
Abstract
Electric vehicles (EV) create an urgent need for scalable battery recycling, yet disassembly of EV battery packs remains largely manual due to high design variability. We present our Robotic Agentic Platform for Intelligent Disassembly (RAPID), designed to investigate perception-driven manipulation, flexible automation, and AI-assisted robot programming in realistic recycling scenarios. The system integrates a gantry-mounted industrial manipulator, RGB-D perception, and an automated nut-running tool for fastener removal on a full-scale EV battery pack. An open-vocabulary object detection pipeline achieves 0.9757 mAP50, enabling reliable identification of screws, nuts, busbars, and other components. We experimentally evaluate (n=204) three one-shot fastener removal strategies: taught-in poses (97% success rate, 24 min duration), one-shot vision execution (57%, 29 min), and visual servoing (83%, 36 min), comparing success rate and disassembly time for the battery's top cover fasteners. To support flexible interaction, we introduce agentic AI specifications for robotic disassembly tasks, allowing LLM agents to translate high-level instructions into robot actions through structured tool interfaces and ROS services. We evaluate SmolAgents with GPT-4o-mini and Qwen 3.5 9B/4B on edge hardware. Tool-based interfaces achieve 100% task completion, while automatic ROS service discovery shows 43.3% failure rates, highlighting the need for structured robot APIs for reliable LLM-driven control. This open-source platform enables systematic investigation of human-robot collaboration, agentic robot programming, and increasingly autonomous disassembly workflows, providing a practical foundation for research toward scalable robotic battery recycling.
Chinese Translation
电动汽车(EV)对可扩展电池回收提出了迫切需求,但由于设计变异性高,电动汽车电池组的拆解仍主要依赖人工。我们提出了机器人自主智能拆解平台(RAPID),旨在研究感知驱动的操作、灵活自动化和人工智能辅助的机器人编程在现实回收场景中的应用。该系统集成了一个架式工业机械手、RGB-D感知和一个用于紧固件拆卸的自动螺母工具,能够快速拆解全尺寸电动汽车电池组。开放词汇对象检测管道实现了0.9757的mAP50,能够可靠识别螺丝、螺母、母线和其他组件。我们对三种一次性紧固件拆卸策略进行了实验评估(n=204):教学姿态(成功率97%,持续时间24分钟)、一次性视觉执行(57%,29分钟)和视觉伺服(83%,36分钟),比较了电池顶盖紧固件的成功率和拆解时间。为了支持灵活的交互,我们为机器人拆解任务引入了自主人工智能规范,使得大型语言模型(LLM)代理能够通过结构化工具接口和ROS服务将高层指令转换为机器人动作。我们在边缘硬件上评估了SmolAgents与GPT-4o-mini和Qwen 3.5 9B/4B的性能。基于工具的接口实现了100%的任务完成率,而自动ROS服务发现的失败率为43.3%,突显了为可靠的LLM驱动控制提供结构化机器人API的必要性。该开源平台使得人机协作、自主机器人编程和日益自主的拆解工作流程的系统研究成为可能,为可扩展的机器人电池回收研究提供了实用基础。
cs.RO / 21 / 2603.18532
Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
利用生成3D世界扩展机器人视觉-语言-动作的模拟到现实强化学习
Abstract
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25$\times$ speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13$\times$ speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.
Chinese Translation
大型视觉-语言模型(VLMs)在强化学习(RL)下的强大表现激励了类似的方法用于机器人视觉-语言-动作(VLA)模型的微调。许多近期的研究直接在现实世界中微调VLA,以避免处理模拟到现实的差距。虽然现实世界的强化学习规避了模拟到现实的问题,但它在本质上限制了所得到的VLA的普适性,因为在物理世界中扩展场景和物体的多样性是极其困难的。这导致了一个矛盾的结果,即将一个广泛预训练的模型转变为一个过拟合的、特定场景的策略。相反,在模拟中训练可以提供多样化的场景,但设计这些场景同样成本高昂。在本研究中,我们展示了如何利用3D世界生成模型对VLA进行强化学习微调,而不牺牲普适性并减少劳动成本。通过将这些模型与语言驱动的场景设计器结合使用,我们生成了数百个包含独特物体和背景的多样化互动场景,从而实现可扩展且高度并行的策略学习。从一个预训练的模仿基线开始,我们的方法将模拟成功率从9.7%提高到79.8%,同时任务完成时间加速了1.25倍。我们进一步展示了通过生成数字双胞胎的质量以及领域随机化实现的成功模拟到现实转移,将现实世界的成功率从21.7%提高到75%,并实现了1.13倍的加速。最后,我们通过消融研究进一步强调了利用3D世界生成模型提供的有效无限数据的好处,显示增加场景多样性直接改善了零-shot泛化能力。
cs.RO / 22 / 2603.18555
Inductance-Based Force Self-Sensing in Fiber-Reinforced Pneumatic Twisted-and-Coiled Actuators
基于电感的光纤增强气动扭绞卷绕执行器的自感知力测量
Abstract
Fiber-reinforced pneumatic twisted-and-coiled actuators (FR-PTCAs) offer high power density and compliance but their strong hysteresis and lack of intrinsic proprioception limit effective closed-loop control. This paper presents a self-sensing FR-PTCA integrated with a conductive nickel wire that enables intrinsic force estimation and indirect displacement inference via inductance feedback. Experimental characterization reveals that the inductance of the actuator exhibits a deterministic, low-hysteresis inductance-force relationship at constant pressures, in contrast to the strongly hysteretic inductance-length behavior. Leveraging this property, this paper develops a parametric self-sensing model and a nonlinear hybrid observer that integrates an Extended Kalman Filter (EKF) with constrained optimization to resolve the ambiguity in the inductance-force mapping and estimate actuator states. Experimental results demonstrate that the proposed approach achieves force estimation accuracy comparable to that of external load cells and maintains robust performance under varying load conditions.
Chinese Translation
光纤增强气动扭绞卷绕执行器(FR-PTCAs)具有高功率密度和良好的柔顺性,但其强烈的滞后性和缺乏内在的本体感知能力限制了有效的闭环控制。本文提出了一种自感知的FR-PTCA,该执行器集成了导电镍线,能够通过电感反馈实现内在力估计和间接位移推断。实验表征表明,在恒定压力下,执行器的电感表现出确定性的、低滞后性的电感-力关系,这与强滞后的电感-长度行为形成对比。利用这一特性,本文开发了一种参数化自感知模型和一种非线性混合观测器,该观测器将扩展卡尔曼滤波器(EKF)与约束优化相结合,以解决电感-力映射中的模糊性并估计执行器状态。实验结果表明,所提出的方法在力估计精度上可与外部负载传感器相媲美,并在不同负载条件下保持稳健的性能。
cs.RO / 23 / 2603.18559
TiBCLaG: A Trigger-induced Bistable Compliant Laparoscopic Grasper
TiBCLaG:一种触发诱导的双稳态柔性腹腔镜抓取器
Abstract
Industrial laparoscopic graspers use multi-link rigid mechanisms manufactured to tight tolerances, resulting in high manufacturing and assembly costs. This work presents the design and proof-of-concept validation of a monolithic, fully compliant, bistable, laparoscopic grasper that eliminates the need for multiple rigid links, thereby reducing part count. The device integrates a compliant trigger and a compliant gripper end-effector, coupled via a control push-rod, to achieve stable grasping without continuous user input. The trigger mechanism is synthesized using a Two-Element Beam Constraint Model as a design framework to control the deformation and stiffness of V-beam-like elements. This technique enables elastic energy storage while preventing snap-through instability. The end-effector is designed as a compliant gripper to achieve adaptive grasping through elastic deformation. Jaws' opening-and-closing performance is demonstrated using nonlinear finite element analysis. The laparoscopic design presented here is fabricated using fused deposition 3D printing. The fabricated prototype demonstrates reliable bistable actuation, confirming the feasibility of such compliant laparoscopic grasper architectures.
Chinese Translation
工业腹腔镜抓取器采用多连杆刚性机制制造,公差要求严格,导致高昂的制造和组装成本。本研究提出了一种单体、完全柔性、双稳态的腹腔镜抓取器的设计及其概念验证,消除了对多个刚性连杆的需求,从而减少了部件数量。该设备集成了一个柔性触发器和一个柔性抓取末端执行器,通过控制推杆连接,实现了稳定抓取而无需持续的用户输入。触发机制采用双元梁约束模型(Two-Element Beam Constraint Model)作为设计框架,以控制V形梁元件的变形和刚度。这一技术能够储存弹性能,同时防止突变不稳定性。末端执行器设计为柔性抓取器,通过弹性变形实现自适应抓取。通过非线性有限元分析展示了夹爪的开合性能。这里展示的腹腔镜设计采用熔融沉积3D打印技术制造。制造的原型展示了可靠的双稳态驱动,确认了这种柔性腹腔镜抓取器架构的可行性。
cs.RO / 24 / 2603.18589
Benchmarking Visual Feature Representations for LiDAR-Inertial-Visual Odometry Under Challenging Conditions
在挑战性条件下对激光雷达-惯性-视觉里程计的视觉特征表示进行基准测试
Abstract
Accurate localization in autonomous driving is critical for successful missions including environmental mapping and survivor searches. In visually challenging environments, including low-light conditions, overexposure, illumination changes, and high parallax, the performance of conventional visual odometry methods significantly degrade undermining robust robotic navigation. Researchers have recently proposed LiDAR-inertial-visual odometry (LIVO) frameworks, that integrate LiDAR, IMU, and camera sensors, to address these challenges. This paper extends the FAST-LIVO2-based framework by introducing a hybrid approach that integrates direct photometric methods with descriptor-based feature matching. For the descriptor-based feature matching, this work proposes pairs of ORB with the Hamming distance, SuperPoint with SuperGlue, SuperPoint with LightGlue, and XFeat with the mutual nearest neighbor. The proposed configurations are benchmarked by accuracy, computational cost, and feature tracking stability, enabling a quantitative comparison of the adaptability and applicability of visual descriptors. The experimental results reveal that the proposed hybrid approach outperforms the conventional sparse-direct method. Although the sparse-direct method often fails to converge in regions where photometric inconsistency arises due to illumination changes, the proposed approach still maintains robust performance under the same conditions. Furthermore, the hybrid approach with learning-based descriptors enables robust and reliable visual state estimation across challenging environments.
Chinese Translation
在自主驾驶中,准确定位对于成功完成环境映射和幸存者搜索等任务至关重要。在视觉挑战性环境中,包括低光照条件、过曝、光照变化和高视差,传统视觉里程计方法的性能显著下降,削弱了机器人的鲁棒导航。研究人员最近提出了激光雷达-惯性-视觉里程计(LiDAR-inertial-visual odometry, LIVO)框架,集成了激光雷达、惯性测量单元(IMU)和相机传感器,以应对这些挑战。本文在基于FAST-LIVO2的框架基础上,提出了一种混合方法,将直接光度方法与基于描述子的特征匹配相结合。对于基于描述子的特征匹配,本文提出了ORB与汉明距离、SuperPoint与SuperGlue、SuperPoint与LightGlue以及XFeat与互近邻的配对。通过准确性、计算成本和特征跟踪稳定性对所提配置进行基准测试,从而实现对视觉描述子的适应性和适用性的定量比较。实验结果表明,所提混合方法优于传统的稀疏直接方法。尽管稀疏直接方法在光照变化导致光度不一致的区域常常无法收敛,但所提方法在相同条件下仍保持了鲁棒性能。此外,采用基于学习的描述子的混合方法在挑战性环境中实现了鲁棒和可靠的视觉状态估计。
cs.RO / 25 / 2603.18624
REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation
REST:用于零样本目标导航的递归视野探索斯坦纳树
Abstract
Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.
Chinese Translation
零样本目标导航(ZSON)要求在未知环境中导航,以找到目标物体,而无需特定任务的训练。之前的无层次训练解决方案专注于场景理解( extit{belief})和高层决策( extit{policy}),但忽视了 extit{option}的设计,即从不断演变的信念中提出的子目标候选,并呈现给策略进行选择。在实践中,选项被简化为独立评分的孤立航点:单一目的地隐藏了沿途获得的价值;非结构化的集合模糊了候选者之间的关系。我们的见解是,选项空间应该是一个 extit{路径树}。完整路径揭示了目的地评分系统性忽视的途中信息增益;共享段的树结构使得粗到细的LLM推理成为可能,能够在检查单个叶子之前,放弃或追踪整个分支,从而将组合路径空间压缩为高效的层次结构。我们在 extbf{REST}(递归视野探索斯坦纳树)中实现了这一见解,这是一个无训练框架,(1) 从在线RGB-D流构建显式的开放词汇3D地图;(2) 通过基于采样的规划,生长一个以代理为中心的安全且信息丰富的路径树作为选项空间;(3) 将每个分支文本化为空间叙事,并通过链式思维LLM推理选择下一个最佳路径。在Gibson、HM3D和HSSD基准测试中,REST在成功率方面始终排名前列,同时实现了最佳或第二最佳的路径效率,展示了良好的效率与成功平衡。
cs.RO / 26 / 2603.18669
CSSDF-Net: Safe Motion Planning Based on Neural Implicit Representations of Configuration Space Distance Field
CSSDF-Net:基于神经隐式表示的配置空间距离场的安全运动规划
Abstract
High-dimensional manipulator operation in unstructured environments requires a differentiable, scene-agnostic distance query mechanism to guide safe motion generation. Existing geometric collision checkers are typically non-differentiable, while workspace-based implicit distance models are hindered by the highly nonlinear workspace--configuration mapping and often suffer from poor convergence; moreover, self-collision and environment collision are commonly handled as separate constraints. We propose Configuration-Space Signed Distance Field-Net (CSSDF-Net), which learns a continuous signed distance field directly in configuration space to provide joint-space distance and gradient queries under a unified geometric notion of safety. To enable zero-shot generalization without environment-specific retraining, we introduce a spatial-hashing-based data generation pipeline that encodes robot-centric geometric priors and supports efficient retrieval of risk configurations for arbitrary obstacle point sets. The learned distance field is integrated into safety-constrained trajectory optimization and receding-horizon MPC, enabling both offline planning and online reactive avoidance. Experiments on a planar arm and a 7-DoF manipulator demonstrate stable gradients, effective collision avoidance in static and dynamic scenes, and practical inference latency for large-scale point-cloud queries, supporting deployment in previously unseen environments.
Chinese Translation
在非结构化环境中进行高维操控操作需要一种可微分的、与场景无关的距离查询机制,以指导安全的运动生成。现有的几何碰撞检测器通常是不可微分的,而基于工作空间的隐式距离模型则受到高度非线性的工作空间与配置之间映射的限制,且常常面临收敛性差的问题;此外,自碰撞和环境碰撞通常被视为独立约束。我们提出了配置空间符号距离场网络(Configuration-Space Signed Distance Field-Net,CSSDF-Net),该网络直接在配置空间中学习连续的符号距离场,以在统一的几何安全概念下提供关节空间距离和梯度查询。为了实现零样本泛化而无需针对特定环境的重新训练,我们引入了一种基于空间哈希的数据生成管道,该管道编码了以机器人为中心的几何先验,并支持高效检索任意障碍物点集的风险配置。所学习的距离场被集成到安全约束的轨迹优化和递归预测控制(MPC)中,实现了离线规划和在线反应避障。在平面臂和7自由度操控器上的实验表明,梯度稳定,能够有效避免静态和动态场景中的碰撞,并且在大规模点云查询中具有实用的推理延迟,支持在之前未见过的环境中的部署。
cs.RO / 27 / 2603.18746
ROFT-VINS: Robust Feature Tracking-based Visual-Inertial State Estimation for Harsh Environment
ROFT-VINS:基于鲁棒特征跟踪的视觉-惯性状态估计方法用于恶劣环境
Abstract
SLAM (Simultaneous Localization and Mapping) and Odometry are important systems for estimating the position of mobile devices, such as robots and cars, utilizing one or more sensors. Particularly in camera-based SLAM or Odometry, effectively tracking visual features is important as it significantly impacts system performance. In this paper, we propose a method that leverages deep learning to robustly track visual features in monocular camera images. This method operates reliably even in textureless environments and situations with rapid lighting changes. Additionally, we evaluate the performance of our proposed method by integrating it into VINS-Fusion (Monocular-Inertial), a commonly used Visual-Inertial Odometry (VIO) system.
Chinese Translation
SLAM(同步定位与地图构建)和里程计是用于估计移动设备(如机器人和汽车)位置的重要系统,利用一个或多个传感器。特别是在基于相机的SLAM或里程计中,有效跟踪视觉特征至关重要,因为这对系统性能有显著影响。本文提出了一种利用深度学习在单目相机图像中鲁棒跟踪视觉特征的方法。该方法即使在无纹理环境和快速光照变化的情况下也能可靠运行。此外,我们通过将所提方法集成到VINS-Fusion(单目-惯性)这一常用的视觉-惯性里程计(VIO)系统中,评估了其性能。
cs.RO / 28 / 2603.18771
Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture
基于推理引导的视觉-语言-运动扩散架构的类人教育机器人同理运动生成
Abstract
This article suggests a reasoning-guided vision-language-motion diffusion framework (RG-VLMD) for generating instruction-aware co-speech gestures for humanoid robots in educational scenarios. The system integrates multi-modal affective estimation, pedagogical reasoning, and teaching-act-conditioned motion synthesis to enable adaptive and semantically consistent robot behavior. A gated mixture-of-experts model predicts Valence/Arousal from input text, visual, and acoustic features, which then mapped to discrete teaching-act categories through an affect-driven policy.These signals condition a diffusion-based motion generator using clip-level intent and frame-level instructional schedules via additive latent restriction with auxiliary action-group supervision. Compared to a baseline diffusion model, our proposed method produces more structured and distinctive motion patterns, as verified by motion statics and pairwise distance analysis. Generated motion sequences remain physically plausible and can be retargeted to a NAO robot for real-time execution. The results reveal that reasoning-guided instructional conditioning improves gesture controllability and pedagogical expressiveness in educational human-robot interaction.
Chinese Translation
本文提出了一种推理引导的视觉-语言-运动扩散框架(RG-VLMD),用于在教育场景中为类人机器人生成与指令相关的共语手势。该系统整合了多模态情感估计、教学推理和基于教学行为的运动合成,以实现自适应和语义一致的机器人行为。一个门控专家混合模型从输入的文本、视觉和声学特征中预测情感/唤醒(Valence/Arousal),然后通过情感驱动的策略将其映射到离散的教学行为类别。这些信号通过附加的潜在限制和辅助动作组监督,调节基于扩散的运动生成器,使用剪辑级意图和帧级教学计划。与基线扩散模型相比,我们提出的方法生成了更具结构性和独特性的运动模式,这通过运动统计和成对距离分析得到了验证。生成的运动序列在物理上保持合理,并可以重新定向到NAO机器人进行实时执行。结果表明,推理引导的教学条件改善了教育人机交互中的手势可控性和教学表现力。
cs.RO / 29 / 2603.18784
ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing
ViTac-Tracing:可变形物体描迹的视觉-触觉模仿学习
Abstract
Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in the real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects.
Chinese Translation
可变形物体往往呈现非结构化配置。对可变形物体进行描迹有助于将其带入扩展状态,并促进后续的操作任务。由于对物体特定建模或仿真到现实转移的要求,现有描迹方法在不同类别的可变形物体之间缺乏通用性,或在现实世界中难以可靠地完成任务。为了解决这一问题,我们提出了一种新颖的视觉-触觉模仿学习方法,旨在通过统一模型实现一维(1D)和二维(2D)可变形物体的描迹。我们的方法从视觉与触觉感知的局部和全局两方面进行设计。在局部方面,我们引入了一种加权损失,强调保持接触于触觉图像中心附近的动作,从而改善细粒度调整。在全局方面,我们提出了一种描迹任务损失,帮助策略调控任务进程。在硬件方面,为了弥补从视觉信息中提取的有限特征,我们在低成本远程操作系统中整合了触觉感知,同时考虑到远程操作员和机器人之间的协作。针对各种1D和2D可变形物体的广泛消融与对比实验表明我们方法的有效性,在已见物体上实现了80%的平均成功率,在未见物体上实现了65%的成功率。
cs.RO / 30 / 2603.18804
"You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
你身边有个朋友:为年轻新移民的语言和文化学习共同设计的同伴社交机器人
Abstract
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding (speech, facial feedback, gesture), and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
Chinese Translation
支持加拿大年轻新移民儿童的社区识字项目面临人员不足和一对一时间稀缺的问题,这限制了个性化的英语和文化学习支持。本文报告了一项与United for Literacy辅导员的共同设计研究,该研究为Maple提供了信息,Maple是一种桌面型、类似同伴的社交辅助机器人(Socially Assistive Robot, SAR),旨在作为辅导介导课程中的练习伙伴。通过观察和共同设计访谈,我们提取了新移民特定的需求,并将其整合到一个原型中,该原型使用基于短故事的活动、多模态支架(语音、面部反馈、手势)和嵌入式测验,以支持注意力并生成可供辅导员采取行动的形成性信号。我们为支持社区环境中语言社交化的辅导员介入SAR提供了系统设计的启示,并概述了在真实项目中以儿童为中心的评估方向。
cs.RO / 31 / 2603.18811
V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors
V-Dreamer:通过视频生成先验自动化机器人仿真和轨迹合成
Abstract
Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.
Chinese Translation
训练通用机器人需要大规模、多样化的操作数据,但现实世界的数据收集成本极高,而现有的仿真器往往受到固定资产库和手动启发式方法的限制。为了解决这一问题,我们提出了V-Dreamer,这一完全自动化的框架能够直接从自然语言指令生成开放词汇、适合仿真的操作环境和可执行的专家轨迹。V-Dreamer采用了一种新颖的生成管线,利用大型语言模型和3D生成模型构建物理真实的3D场景,并通过几何约束进行验证,以确保稳定的、无碰撞的布局。至关重要的是,在行为合成中,我们利用视频生成模型作为丰富的运动先验。这些视觉预测随后通过强大的Sim-to-Gen视觉运动对齐模块(使用CoTracker3和VGGT)映射为可执行的机器人轨迹。该管线在无需手动干预的情况下支持高视觉多样性和物理保真性。为了评估生成的数据,我们在合成轨迹上训练模仿学习策略,涵盖了各种物体和环境变化。在使用Piper机器人手臂进行的桌面操作任务上进行的广泛评估表明,我们的策略能够稳健地推广到仿真中的未见物体,并成功实现有效的仿真到现实转移,成功操作新颖的现实世界物体。
cs.RO / 32 / 2603.18861
A Passive Elastic-Folding Mechanism for Stackable Airdrop Sensors
一种用于可堆叠空投传感器的被动弹性折叠机制
Abstract
Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 degrees and 100 degrees, with a standard deviation of 4 degrees and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.
Chinese Translation
从空中机器人系统(例如,无人机)部署的空气分散传感器网络为广域环境监测提供了一种低成本的方法。然而,现有方法通常依赖于主动执行器进行空中形状或轨迹控制,这增加了功耗和系统成本。在此,我们介绍了一种被动弹性折叠铰链机制,该机制在释放时将传感器从平坦的可堆叠形态转变为三维结构。铰链通过将商业薄板材料与刚性印刷电路板(PCBs)层压而成,并通过单一的烘烤步骤编程折叠角度,从而实现可扩展生产,无需专用设备。我们的几何模型将层压几何、铰链力学和结果折叠角度联系起来,提供了一种针对目标配置的预测设计方法。实验室测试确认了折叠角度在10度到100度之间,标准偏差为4度,且重复性高。现场试验进一步证明了在分散过程中可靠的数据收集和LoRa传输,而基于水平风模型(HWM)的轨迹模拟则表明了超过10公里的广域传感的强大潜力。
cs.RO / 33 / 2603.18910
Safety-Guaranteed Imitation Learning from Nonlinear Model Predictive Control for Spacecraft Close Proximity Operations
基于非线性模型预测控制的安全保障模仿学习框架用于航天器近距离操作
Abstract
This paper presents a safety-guaranteed, runtime-efficient imitation learning framework for spacecraft close proximity control. We leverage Control Barrier Functions (CBFs) for safety certificates and Control Lyapunov Functions (CLFs) for stability as unified design principles across data generation, training, and deployment. First, a nonlinear Model Predictive Control (NMPC) expert enforces CBF constraints to provide safe reference trajectories. Second, we train a neural policy with a novel CBF-CLF-informed loss and DAgger-like rollouts with curriculum weighting, promoting data-efficiency and reducing future safety filter interventions. Third, at deployment a lightweight one-step CBF-CLF quadratic program minimally adjusts the learned control input to satisfy hard safety constraints while encouraging stability. We validate the approach for ESA-compliant close proximity operations, including fly-around with a spherical keep-out zone and final approach inside a conical approach corridor, using the Basilisk high-fidelity simulator with nonlinear dynamics and perturbations. Numerical experiments indicate stable convergence to decision points and strict adherence to safety under the filter, with task performance comparable to the NMPC expert while significantly reducing online computation. A runtime analysis demonstrates real-time feasibility on a commercial off-the-shelf processor, supporting onboard deployment for safety-critical on-orbit servicing.
Chinese Translation
本文提出了一种安全保障的、运行高效的模仿学习框架,用于航天器近距离控制。我们利用控制屏障函数(Control Barrier Functions, CBFs)来提供安全证明,利用控制李雅普诺夫函数(Control Lyapunov Functions, CLFs)来确保稳定性,作为数据生成、训练和部署中的统一设计原则。首先,非线性模型预测控制(Nonlinear Model Predictive Control, NMPC)专家通过强制执行CBF约束来提供安全参考轨迹。其次,我们使用新颖的CBF-CLF支持的损失函数和类似DAgger的课程加权回放来训练神经策略,以提高数据效率并减少未来的安全过滤干预。第三,在部署阶段,轻量级的一步CBF-CLF二次规划最小调整学习到的控制输入,以满足严格的安全约束,并促进稳定性。我们验证了该方法在符合欧洲航天局(ESA)要求的近距离操作中的有效性,包括在球形禁区内飞行和在锥形接近通道内的最终接近,使用具有非线性动力学和扰动的Basilisk高保真模拟器进行实验。数值实验表明,该方法能够稳定地收敛到决策点,并在过滤下严格遵守安全标准,其任务性能可与NMPC专家相媲美,同时显著减少在线计算。运行时分析表明在商业现成处理器上的实时可行性,支持在安全关键的轨道服务中的机载部署。
cs.RO / 34 / 2603.18921
Lightweight Model Predictive Control for Spacecraft Rendezvous Attitude Synchronization
轻量级模型预测控制用于航天器交会姿态同步
Abstract
This work introduces two lightweight model predictive control (MPC) approaches for attitude tracking with reaction wheels during spacecraft rendezvous synchronization. Both approaches are based on a novel attitude deviation formulation, which enables the use of inherently linear constraints on angular velocity. We develop a single-loop and a dual-loop MPC; the latter embeds a stabilizing feedback controller within the inner loop, yielding a linear time-invariant system. Both controllers are implemented with CasADi - including automatic code generation - evaluated across various solvers, and validated within the Basilisk astrodynamics simulation framework. The experimental results demonstrate improved tracking accuracy alongside reductions in computational effort and memory consumption. Finally, embedded delivery to an ARM Cortex-M7 - representative of commercial off-the-shelf devices used in New Space platforms - confirms the real-time feasibility of these approaches and highlights their suitability for onboard attitude control in resource-constrained spacecraft rendezvous missions.
Chinese Translation
本研究介绍了两种轻量级模型预测控制(MPC)方法,用于在航天器交会同步过程中利用反应轮进行姿态跟踪。这两种方法基于一种新颖的姿态偏差公式,使得可以对角速度施加固有的线性约束。我们开发了单回路和双回路MPC;后者在内环中嵌入了一个稳定反馈控制器,从而形成一个线性时不变系统。这两种控制器均使用CasADi实现,包括自动代码生成,并在多种求解器中进行评估,同时在Basilisk天体动力学仿真框架中进行了验证。实验结果表明,跟踪精度得到了提高,同时计算工作量和内存消耗也有所减少。最后,将其嵌入到ARM Cortex-M7中——代表了新兴航天平台中使用的商业现成设备——确认了这些方法的实时可行性,并突显了它们在资源受限的航天器交会任务中进行机载姿态控制的适用性。
cs.RO / 35 / 2603.18979
PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors
PRIOR:基于参考步态先验的人形机器人感知学习行走
Abstract
Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.
Chinese Translation
训练能够以自然步态穿越复杂地形的人形机器人行走策略仍然是一个未解决的挑战,通常需要多阶段训练流程、对抗性目标或广泛的现实世界校准。我们提出了PRIOR,这是一个高效且可重复的框架,基于Isaac Lab构建,通过一个简单而有效的设计实现了以人类步态进行稳健的地形穿越:(i) 一个参数化步态生成器,提供来自运动捕捉的稳定参考轨迹,而无需对抗性训练;(ii) 一个基于GRU的状态估计器,通过自我监督的高度图重建直接从自我中心的深度图像推断地形几何;(iii) 地形适应的步伐奖励,引导足部放置到可穿越区域。通过对深度图像分辨率权衡的系统分析,我们识别出在实时约束下最大化地形真实感的配置,显著减少感知开销而不降低穿越性能。针对不同难度的地形(包括楼梯、箱子和间隙)进行的全面实验表明,每个组件都带来了互补且至关重要的性能提升,整个框架实现了100%的穿越成功率。我们将开源完整的PRIOR框架,包括训练流程、参数化步态生成器和评估基准,以作为人形机器人行走研究在Isaac Lab上的可重复基础。
cs.RO / 36 / 2603.18988
MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction
MERGE:用于多参与者事件推理和人机交互中情境定位的引导视觉-语言模型
Abstract
We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic human-robot group interactions. Effective collaboration in such settings requires consistent situational awareness, built on persistent representations of people and objects and an episodic abstraction of events. MERGE achieves this by uniquely identifying physical instances of actors (humans or robots) and objects and structuring them into actor-action-object relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency, avoiding both the high monetary cost and the latency of frame-by-frame captioning that leads to fragmented and delayed outputs. To address the absence of suitable benchmarks for multi-actor collaboration, we introduce the GROUND dataset, which offers fine-grained situational annotations of multi-person and human-robot interactions. On this dataset, our approach improves the average grounding score by a factor of 2 compared to the performance of VLM-only baselines - including GPT-4o, GPT-5 and Gemini 2.5 Flash - while also reducing run-time by a factor of 4. The code and data are available at www.github.com/HRI-EU/merge.
Chinese Translation
我们介绍了MERGE,一个用于动态人机群体交互中参与者、物体和事件的情境定位系统。在这样的环境中,有效的协作需要建立在对人和物体的持久表征以及事件的情节抽象基础上的一致情境意识。MERGE通过唯一识别参与者(人或机器人)和物体的物理实例,并将其结构化为参与者-动作-物体关系,确保交互过程中的时间一致性。MERGE的核心是将视觉-语言模型(Vision-Language Models, VLMs)与感知管道相结合:一个轻量级的流处理模块持续处理视觉输入,以检测变化,并仅在必要时选择性地调用VLM。这种解耦设计保留了VLM的推理能力和零样本泛化,同时提高了效率,避免了逐帧字幕生成所导致的高成本和延迟,从而避免了输出的碎片化和延迟。为了应对缺乏适合多参与者协作的基准测试,我们引入了GROUND数据集,该数据集提供了对多个人和人机交互的细粒度情境注释。在该数据集上,我们的方法相比于仅使用VLM的基线(包括GPT-4o、GPT-5和Gemini 2.5 Flash)提高了平均定位得分2倍,同时将运行时间减少了4倍。代码和数据可在www.github.com/HRI-EU/merge获取。
cs.RO / 37 / 2603.19029
ATG-MoE: Autoregressive trajectory generation with mixture-of-experts for assembly skill learning
ATG-MoE:基于专家混合的自回归轨迹生成用于装配技能学习
Abstract
Flexible manufacturing requires robot systems that can adapt to constantly changing tasks, objects, and environments. However, traditional robot programming is labor-intensive and inflexible, while existing learning-based assembly methods often suffer from weak positional generalization, complex multi-stage designs, and limited multi-skill integration capability. To address these issues, this paper proposes ATG-MoE, an end-to-end autoregressive trajectory generation method with mixture of experts for assembly skill learning from demonstration. The proposed method establishes a closed-loop mapping from multi-modal inputs, including RGB-D observations, natural language instructions, and robot proprioception to manipulation trajectories. It integrates multi-modal feature fusion for scene and task understanding, autoregressive sequence modeling for temporally coherent trajectory generation, and a mixture-of-experts architecture for unified multi-skill learning. In contrast to conventional methods that separate visual perception and control or train different skills independently, ATG-MoE directly incorporates visual information into trajectory generation and supports efficient multi-skill integration within a single model. We train and evaluate the proposed method on eight representative assembly skills from a pressure-reducing valve assembly task. Experimental results show that ATG-MoE achieves strong overall performance in simulation, with an average grasp success rate of 96.3% and an average overall success rate of 91.8%, while also demonstrating strong generalization and effective multi-skill integration. Real-world experiments further verify its practicality for multi-skill industrial assembly. The project page can be found at https://hwh23.github.io/ATG-MoE
Chinese Translation
灵活制造需要能够适应不断变化的任务、物体和环境的机器人系统。然而,传统的机器人编程劳动密集且缺乏灵活性,而现有的基于学习的装配方法往往存在位置泛化能力弱、设计复杂多阶段以及多技能集成能力有限等问题。为了解决这些问题,本文提出了ATG-MoE,一种端到端的自回归轨迹生成方法,结合专家混合用于从示范中学习装配技能。该方法建立了一个闭环映射,将多模态输入(包括RGB-D观测、自然语言指令和机器人本体感知)映射到操作轨迹。它集成了多模态特征融合以理解场景和任务,自回归序列建模以生成时间上连贯的轨迹,以及专家混合架构以实现统一的多技能学习。与传统方法将视觉感知和控制分开或独立训练不同技能不同,ATG-MoE直接将视觉信息纳入轨迹生成,并支持在单一模型内高效的多技能集成。我们在一个减压阀装配任务的八个代表性装配技能上训练和评估了该方法。实验结果表明,ATG-MoE在仿真中实现了强大的整体性能,平均抓取成功率为96.3%,平均整体成功率为91.8%,同时还表现出强大的泛化能力和有效的多技能集成。现实世界的实验进一步验证了其在多技能工业装配中的实用性。项目页面可以在https://hwh23.github.io/ATG-MoE找到。
cs.RO / 38 / 2603.19063
Fire as a Service: Augmenting Robot Simulators with Thermally and Visually Accurate Fire Dynamics
作为服务的火灾:通过热力学和视觉精确的火灾动态增强机器人模拟器
Abstract
Most existing robot simulators prioritize rigid-body dynamics and photorealistic rendering, but largely neglect the thermally and optically complex phenomena that characterize real-world fire environments. For robots envisioned as future firefighters, this limitation hinders both reliable capability evaluation and the generation of representative training data prior to deployment in hazardous scenarios. To address these challenges, we introduce Fire as a Service (FaaS), a novel, asynchronous co-simulation framework that augments existing robot simulators with high-fidelity and computationally efficient fire simulations. Our pipeline enables robots to experience accurate, multi-species thermodynamic heat transfer and visually consistent volumetric smoke without disrupting high-frequency rigid-body control loops. We demonstrate that our framework can be integrated with diverse robot simulators to generate physically accurate fire behavior, benchmark thermal hazards encountered by robotic platforms, and collect realistic multimodal perceptual data. Crucially, its real-time performance supports human-in-the-loop teleoperation, enabling the successful training of reactive, multimodal policies via Behavioral Cloning. By adding fire dynamics to robot simulations, FaaS provides a scalable pathway toward safer, more reliable deployment of robots in fire scenarios.
Chinese Translation
现有的大多数机器人模拟器优先考虑刚体动力学和真实感渲染,但在很大程度上忽视了当前实际火灾环境中复杂的热力学和光学现象。对于未来设想中的消防机器人而言,这一局限性阻碍了可靠能力评估和在危险场景中部署前生成代表性训练数据。为了解决这些挑战,我们提出了火灾作为服务(Fire as a Service, FaaS),一种新颖的异步共模拟框架,可以通过高保真度和计算高效的火灾模拟增强现有的机器人模拟器。我们的管道使机器人能够体验准确的多种热力学热传递和视觉一致的体积烟雾,而不干扰高频刚体控制环路。我们证明了我们的框架可以与多种机器人模拟器结合,生成物理上准确的火灾行为,基准测试机器人平台遇到的热危害,并收集真实的多模态感知数据。至关重要的是,其实时性能支持人机协作远程操作,从而成功训练出反应性多模态策略通过行为克隆。通过将火灾动态添加到机器人模拟中,FaaS提供了一条可扩展的路径,以更安全、更可靠地在火灾场景中部署机器人。
cs.RO / 39 / 2603.19074
CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem
CAMO:一种用于多目标多旅行推销员问题的条件神经求解器
Abstract
Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.
Chinese Translation
机器人系统通常需要一组机器人共同访问多个目标,同时优化诸如总旅行成本和完工时间等竞争性目标。这一设置可以被表述为多目标多旅行推销员问题(Multi-Objective Multiple Traveling Salesman Problem, MOMTSP)。尽管基于学习的方法在单代理旅行推销员问题(TSP)和多目标TSP变体上表现出了强大的性能,但它们很少解决多代理协调和多目标权衡的组合挑战,这带来了双重复杂性。为了填补这一空白,我们提出了CAMO,一种用于MOMTSP的条件神经求解器,它能够在不同数量的目标、代理和偏好向量之间进行概括,并提供对帕累托前沿(Pareto front, PF)的高质量近似。具体而言,CAMO由一个条件编码器组成,用于将偏好融合到实例表示中,从而显式控制多目标权衡,以及一个协作解码器,通过交替选择代理和节点来自回归地构建多代理旅行。为了进一步提高概括能力,我们使用基于REINFORCE的目标在一个混合问题规模分布上训练CAMO。广泛的实验证明,CAMO在性能上优于神经和传统启发式算法,实现了对PF的更接近近似。此外,消融实验结果验证了CAMO关键组件的贡献,且在移动机器人平台上的真实世界测试展示了其实际应用性。
cs.RO / 40 / 2603.19078
Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning
关节体动力学网络:基于动力学的机器人学习先验
Abstract
Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Embedding ABD-NET into the policy actor enables dynamics-informed representations that capture how actions propagate through the body, leading to efficient and robust policy learning. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, state-of-the-art humanoid and quadruped platforms, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.
Chinese Translation
近期在强化学习领域的研究表明,将关节机器人结构先验(如连杆连接性)纳入策略网络可以提高学习效率。然而,尽管动力学特性在决定力和运动如何在机器人身体中传播方面起着基础性作用,但作为策略学习的归纳偏置,这一领域仍然未被充分探索。为了解决这一问题,我们提出了关节体动力学网络(Articulated-Body Dynamics Network,ABD-Net),这是一种基于前向动力学计算结构的新型图神经网络架构。具体而言,我们从关节体算法中适应了惯性传播机制,以树状结构的方式系统地聚合从子连杆到父连杆的惯性量,同时用可学习参数替代物理量。将ABD-NET嵌入策略执行器中,使得能够生成基于动力学的信息表示,捕捉动作如何在身体中传播,从而实现高效且稳健的策略学习。通过对模拟人形、四足和跳跃机器人进行实验,我们的方法在样本效率和对动力学变化的泛化能力上,相较于基于变换器和图神经网络的基线表现出更好的效果。我们进一步在真实的Unitree G1和Go2机器人上验证了学习到的策略,这些机器人是最先进的人形和四足平台,通过模拟到现实的转移和实时推理,生成动态、多样且稳健的运动行为。
cs.RO / 41 / 2603.19124
Tendon-Actuated Robots with a Tapered, Flexible Polymer Backbone: Design, Fabrication, and Modeling
具有锥形柔性聚合物骨架的腱驱动机器人:设计、制造与建模
Abstract
This paper presents the design, modeling, and fabrication of 3D-printed, tendon-actuated continuum robots featuring a flexible, tapered backbone constructed from thermoplastic polyurethane (TPU). Our scalable design incorporates an integrated electronics base housing that enables direct tendon tension control and sensing via actuators and compression load cells. Unlike many continuum robots that are single-purpose and costly, the proposed design prioritizes customizability, rapid assembly, and low cost while enabling high curvature and enhanced distal compliance through geometric tapering, thereby supporting a broad range of compliant robotic inspection and manipulation tasks. We develop a generalized forward kinetostatic model of the tapered backbone based on Cosserat rod theory using a Newtonian approach, extending existing tendon-actuated Cosserat rod formulations to explicitly account for spatially varying backbone cross-sectional geometry. The model captures the graded stiffness profile induced by the tapering and enables systematic exploration of the configuration space as a function of the geometric design parameters. Specifically, we analyze how the backbone taper angle influences the robot's configuration space and manipulability. The model is validated against motion capture data, achieving centimeter-level shape prediction accuracy after calibrating Young's modulus via a line search that minimizes modeling error. We further demonstrate teleoperated grasping using an endoscopic gripper routed along the continuum robot, mounted on a 6-DoF robotic arm. Parameterized iLogic/CAD scripts are provided for rapid geometry generation and scaling. The presented framework establishes a simple, rapid, and reproducible pathway from parametric design to controlled tendon actuation for tapered, tendon-driven continuum robots manufactured using fused deposition modeling 3D printers.
Chinese Translation
本文介绍了一种采用热塑性聚氨酯(TPU)构建的灵活锥形骨架的3D打印腱驱动连续机器人设计、建模和制造。我们的可扩展设计集成了一个电子基础外壳,使得通过执行器和压缩载荷传感器能够直接控制腱的张力和感知。与许多单一用途且成本高昂的连续机器人不同,所提出的设计优先考虑可定制性、快速组装和低成本,同时通过几何锥度实现高曲率和增强的远端顺应性,从而支持广泛的柔性机器人检测和操作任务。我们基于科塞拉特杆理论采用牛顿方法发展了一种通用的锥形骨架正运动静力学模型,将现有的腱驱动科塞拉特杆公式扩展为显式考虑空间变化的骨架截面几何特征。该模型捕捉到由锥度引起的渐变刚度特性,并能够系统地探索作为几何设计参数函数的配置空间。具体而言,我们分析了骨架锥度角如何影响机器人的配置空间和操作性。通过与运动捕捉数据的比较,对该模型进行了验证,在通过线性搜索校准杨氏模量以最小化建模误差后,形状预测精度达到了厘米级。进一步地,我们展示了如何使用沿连续机器人布置的内窥镜夹具进行遥操作抓取,该夹具安装在一个6自由度的机器人臂上。提供了参数化的iLogic/CAD脚本以实现快速几何生成和缩放。所呈现的框架建立了一个简单、快速且可重复的路径,从参数化设计到控制腱驱动的锥形腱驱动连续机器人,该机器人采用熔融沉积建模3D打印机制造。
cs.RO / 42 / 2603.19134
Introducing M: A Modular, Modifiable Social Robot
M的介绍:一个模块化、可修改的社交机器人
Abstract
We present M, an open-source, low-cost social robot platform designed to reduce platform friction that slows social robotics research by making robots easier to reproduce, modify, and deploy in real-world settings. M combines a modular mechanical design, multimodal sensing, and expressive yet mechanically simple actuation architecture with a ROS2-native software package that cleanly separates perception, expression control, and data management. The platform includes a simulation environment with interface equivalence to hardware to support rapid sim-to-real transfer of interaction behaviors. We demonstrate extensibility through additional sensing/actuation modules and provide example interaction templates for storytelling and two-way conversational coaching. Finally, we report real-world use in participatory design and week-long in-home deployments, showing how M can serve as a practical foundation for longitudinal, reproducible social robotics research.
Chinese Translation
我们呈现了M,一个开源、低成本的社交机器人平台,旨在减少减缓社交机器人研究的平台摩擦,使得机器人更容易重现、修改和在现实世界设置中部署。M结合了模块化机械设计、多模态传感和既富表现力又机械简单的驱动架构,配备了一个原生于ROS2的软件包,清晰地将感知、表达控制和数据管理分开。该平台包括一个与硬件等效的仿真环境,以支持交互行为的快速仿真到实际转移。我们通过额外的传感/驱动模块展示了可扩展性,并提供了叙事和双向对话辅导的交互模板示例。最后,我们报告了在参与设计和为期一周的家庭部署中的实际应用,展示了M如何作为长效、可重现的社交机器人研究的实用基础。
cs.RO / 43 / 2603.19166
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
意义与测量:用于视觉-语言导航的多智能体概率基础
Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
Chinese Translation
与人类协作的机器人必须将自然语言目标转化为可执行的、物理基础的决策。例如,执行命令“向冰箱右侧移动两米”需要在三维场景中对语义引用、空间关系和度量约束进行基础化。尽管最近的视觉语言模型(VLMs)展示了强大的语义基础能力,但它们并未明确设计用于推理物理定义空间中的度量约束。在本研究中,我们实证表明,基于最先进的VLM的基础方法在处理复杂的度量-语义语言查询时存在困难。为了解决这一局限性,我们提出了MAPG(多智能体概率基础),这是一个智能框架,将语言查询分解为结构化子组件,并查询VLM以基础化每个组件。然后,MAPG以概率方式组合这些基础输出,以在三维空间中生成度量一致的、可执行的决策。我们在HM-EQA基准上评估MAPG,并显示出相较于强基线的一致性能提升。此外,我们引入了一个新的基准MAPG-Bench,专门设计用于评估度量-语义目标基础,填补了现有语言基础评估中的空白。我们还展示了一个真实世界的机器人演示,表明当有结构化场景表示可用时,MAPG能够超越仿真。
cs.RO / 44 / 2603.19170
ADMM-Based Distributed MPC with Control Barrier Functions for Safe Multi-Robot Quadrupedal Locomotion
基于交替方向乘子法的分布式模型预测控制与控制屏障函数相结合的安全多机器人四足运动
Abstract
This paper proposes a fully decentralized model predictive control (MPC) framework with control barrier function (CBF) constraints for safety-critical trajectory planning in multi-robot legged systems. The incorporation of CBF constraints introduces explicit inter-agent coupling, which prevents direct decomposition of the resulting optimal control problems. To address this challenge, we reformulate the centralized safety-critical MPC problem using a structured distributed optimization framework based on the alternating direction method of multipliers (ADMM). By introducing a novel node-edge splitting formulation with consensus constraints, the proposed approach decomposes the global problem into independent node-local and edge-local quadratic programs that can be solved in parallel using only neighbor-to-neighbor communication. This enables fully decentralized trajectory optimization with symmetric computational load across agents while preserving safety and dynamic feasibility. The proposed framework is integrated into a hierarchical locomotion control architecture for quadrupedal robots, combining high-level distributed trajectory planning, mid-level nonlinear MPC enforcing single rigid body dynamics, and low-level whole-body control enforcing full-order robot dynamics. The effectiveness of the proposed approach is demonstrated through hardware experiments on two Unitree Go2 quadrupedal robots and numerical simulations involving up to four robots navigating uncertain environments with rough terrain and external disturbances. The results show that the proposed distributed formulation achieves performance comparable to centralized MPC while reducing the average per-cycle planning time by up to 51% in the four-agent case, enabling efficient real-time decentralized implementation.
Chinese Translation
本文提出了一种完全去中心化的模型预测控制(MPC)框架,结合控制屏障函数(CBF)约束,以应对多机器人腿部系统中的安全关键轨迹规划。CBF约束的引入增加了代理之间的显式耦合,阻碍了所产生的最优控制问题的直接分解。为了解决这一挑战,我们采用基于交替方向乘子法(ADMM)的结构化分布式优化框架重新构造了集中式安全关键MPC问题。通过引入一种新颖的节点-边分裂形式及共识约束,所提方法将全局问题分解为独立的节点局部和边局部二次规划,这些二次规划可以通过仅邻接节点之间的通信并行求解。这使得在保持安全和动态可行性的同时,可以实现全去中心化的轨迹优化,且代理之间的计算负载对称。所提框架集成于四足机器人分层运动控制架构中,结合了高层的分布式轨迹规划、中层的非线性MPC以执行单刚体动力学和低层的全身控制以执行全阶机器人动力学。通过对两台Unitree Go2四足机器人的硬件实验及涉及最多四台机器人在不确定环境中导航的数值仿真实验,证明了所提方法的有效性。结果表明,所提出的分布式形式在性能上与集中式MPC相当,同时在四代理情况下将平均每周期规划时间减少了多达51%,从而实现了高效的实时去中心化实施。
cs.RO / 45 / 2603.19183
Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
稀疏自编码器揭示VLA模型中的可解释和可引导特征
Abstract
Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io
Chinese Translation
视觉-语言-行动(VLA)模型已成为通用机器人操作的一种有前景的方法。然而,它们的泛化能力不一致:尽管这些模型在某些环境中表现出色,但经过微调的变体在新物体、场景和指令上往往失败。我们应用机械解释技术以更好地理解VLA模型的内部工作原理。为了探测内部表示,我们在VLA的隐藏层激活上训练稀疏自编码器(SAE)。SAE学习一个稀疏字典,其特征作为模型计算的紧凑、可解释的基础。我们发现,大多数提取的SAE特征对应于特定训练演示中的记忆序列。然而,一些特征对应于可解释的、通用的和可引导的运动原语及语义属性,提供了VLA泛化能力的有希望的前景。我们提出了一种指标,以根据特征是否代表可泛化的可转移原语或特定情境的记忆进行分类。我们通过在LIBERO基准上的引导实验验证了这些发现。我们展示了个别SAE特征对机器人行为的因果影响。引导通用特征会引发与其语义意义一致的行为,并可以应用于不同任务和场景。该工作提供了VLAs能够在任务和场景中学习可泛化特征的首个机械证据。我们观察到,在小型机器人数据集上的监督微调不成比例地增强了记忆。相比之下,在更大、更具多样性的数据集(例如DROID)上训练或使用知识隔离促进了更通用的特征。我们提供了一个开源代码库和用户友好的界面,用于激活收集、SAE训练和特征引导。我们的项目页面位于http://drvla.github.io
cs.RO / 46 / 2603.19199
FASTER: Rethinking Real-Time Flow VLAs
FASTER:重新思考实时流动视觉语言行动(VLA)
Abstract
Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $\pi_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
Chinese Translation
实时执行对于在物理世界中部署视觉语言行动(VLA)模型至关重要。现有的异步推理方法主要优化轨迹的平滑性,但忽视了对环境变化做出反应的关键延迟。通过重新思考行动分块策略中的反应概念,本文对影响反应时间的因素进行了系统分析。我们表明,反应时间遵循由首次行动时间(Time to First Action, TTFA)和执行视野共同决定的均匀分布。此外,我们揭示了在基于流的VLA中应用恒定调度的标准做法可能效率低下,并迫使系统在任何运动开始之前完成所有采样步骤,从而形成反应延迟的瓶颈。为了解决这个问题,我们提出了即时反应快速动作采样(Fast Action Sampling for ImmediaTE Reaction, FASTER)。通过引入视野感知调度,FASTER在流采样过程中自适应地优先考虑近期行动,将即时反应的去噪压缩为十倍(例如,在$ ext{π}_{0.5}$和X-VLA中)到单一步骤,同时保持长视野轨迹的质量。结合流式客户端-服务器管道,FASTER显著减少了真实机器人上的有效反应延迟,尤其是在消费级GPU上部署时。包括高度动态的乒乓球任务在内的真实世界实验证明,FASTER为通用策略解锁了前所未有的实时响应能力,使得快速生成准确且平滑的轨迹成为可能。
cs.RO / 47 / 2603.19201
OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation
OmniVTA:用于接触丰富的机器人操作的视觉触觉世界建模
Abstract
Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.
Chinese Translation
接触丰富的操作任务,例如擦拭和组装,需要对接触力、摩擦变化和状态转变的准确感知,这些信息仅通过视觉无法可靠推断。尽管对视觉触觉操作的关注日益增长,但进展受到两个持续局限的制约:现有数据集规模小、任务覆盖范围窄,并且当前方法将触觉信号视为被动观察,而未能利用这些信号来建模接触动态或明确启用闭环控制。在本文中,我们提出了OmniViTac,这是一个大规模的视觉触觉-动作数据集,包含超过21,000条轨迹,涵盖86个任务和100多个物体,组织为六种基于物理的交互模式。在此数据集的基础上,我们提出了OmniVTA,一个基于世界模型的视觉触觉操作框架,集成了四个紧密耦合的模块:自监督触觉编码器、用于预测短期接触演变的双流视觉触觉世界模型、用于动作生成的接触感知融合策略,以及一个60Hz的反射控制器,能够在闭环中修正预测和观察到的触觉信号之间的偏差。针对所有六种交互类别的真实机器人实验表明,OmniVTA优于现有方法,并在未见过的物体和几何配置下具有良好的泛化能力,证实了将预测接触建模与高频触觉反馈结合用于接触丰富操作的价值。所有数据、模型和代码将在项目网站https://mrsecant.github.io/OmniVTA上公开提供。
cs.RO / 48 / 2603.19229
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
NavTrust:评估具身导航的可信度基准
Abstract
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
Chinese Translation
具身导航主要分为两大类:视觉-语言导航(Vision-Language Navigation, VLN),其中代理通过遵循自然语言指令进行导航;以及目标物体导航(Object-Goal Navigation, OGN),代理则导航至特定目标物体。然而,现有研究主要在正常条件下评估模型性能,忽视了现实环境中可能出现的各种损坏。为了解决这一问题,我们提出了NavTrust,一个统一的基准,系统地在逼真的场景中对输入模态进行损坏,包括RGB、深度和指令,并评估这些因素对导航性能的影响。根据我们的最佳知识,NavTrust是第一个在统一框架中将具身导航代理暴露于多种RGB-深度损坏和指令变化的基准。我们对七种最先进方法的广泛评估表明,在现实损坏下性能显著下降,这突显了关键的鲁棒性差距,并提供了通向更可信的具身导航系统的路线图。此外,我们系统地评估了四种不同的缓解策略,以增强对RGB-深度和指令损坏的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们将它们部署在一台真实的移动机器人上,并观察到对损坏的鲁棒性有所提高。项目网站为:https://navtrust.github.io。
cs.RO / 49 / 2603.19233
Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
并非所有特征都是平等的:视觉-语言-动作模型的机制研究
Abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.
Chinese Translation
视觉-语言-动作(VLA)模型将感知、语言和运动控制结合在一个单一架构中,但它们如何将多模态输入转化为动作仍然不甚明了。我们对六个模型进行了激活注入、稀疏自编码器(SAEs)和线性探测的应用,这些模型的参数范围从8000万到70亿,涵盖了在四个基准测试中超过394,000个回合的实验。视觉通路在所有架构中主导动作生成:将基线激活注入空提示回合可恢复几乎相同的行为,而跨任务注入则引导机器人朝向源任务位置(99.8%的X-VLA回合与源轨迹对齐),揭示了与场景坐标相关的空间绑定运动程序,而非抽象任务表示。语言敏感性依赖于任务结构,而非模型设计:当视觉上下文唯一指定任务时,语言被忽略;当多个目标共享一个场景时,语言变得至关重要(X-VLA exttt{libero extunderscore goal}: 在错误提示下从94 o10 ext{%},而 exttt{libero extunderscore object}: 无论如何均为60--100 ext{%})。在所有三种多路径架构( extit{pizhalf}、SmolVLA、GR00T)中,专家路径编码运动程序,而VLM路径编码目标语义(来自专家注入的行为位移大于2倍),子空间注入确认这些占据可分离的激活子空间。每个标记的SAE处理对于大多数架构的动作保真度至关重要,尽管均值池化在X-VLA上提高了保真度。对比识别恢复了82个以上的操作概念,而因果消融揭示了28--92%的零效应率的敏感性,与表示宽度无关。我们发布了 extbf{Action Atlas}(https://action-atlas.com),以便于对所有六个模型的VLA表示进行交互式探索。
cs.CV / 1 / 2603.18045
RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers
基于视觉变换器的胶囊内窥镜视频罕见病检测
Abstract
This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.
Chinese Translation
本研究对应于胶囊内窥镜视频(CEV)的多标签分类胃肠竞赛。基于变换器的深度学习网络经过微调以适应该任务。所使用的在线模式为谷歌视觉变换器(ViT),批量大小为16,分辨率为224 x 224。总共分类了17个标签,包括口腔、食道、胃、小肠、结肠、Z线、幽门、回盲瓣、活动性出血、血管扩张、血液、侵蚀、红斑、血红素、淋巴管扩张、息肉和溃疡。在三个视频的测试数据集中,整体mAP @0.5为0.0205,而整体mAP @0.95为0.0196。
cs.CV / 2 / 2603.18062
S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition
S3T-Former:一种纯粹由脉冲驱动的状态空间拓扑变换器用于骨架动作识别
Abstract
Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.
Chinese Translation
基于骨架的动作识别对于多媒体应用至关重要,但它在很大程度上依赖耗电的人工神经网络(ANN),限制了其在资源受限的边缘设备上的部署。脉冲神经网络(SNN)提供了一种节能的替代方案;然而,现有的骨架数据脉冲模型往往通过密集矩阵聚合、重的多模态融合模块或非稀疏频域变换来妥协SNN的内在稀疏性。此外,它们严重受限于脉冲神经元的短期遗忘。在本文中,我们提出了脉冲状态空间拓扑变换器(S3T-Former),据我们所知,它是首个专为节能骨架动作识别而设计的纯脉冲驱动变换器架构。我们并未依赖重的融合开销,而是构造了一个多流解剖脉冲嵌入(M-ASE),它充当一个广义的运动学微分算子,优雅地将多模态骨架特征转化为异质的高度稀疏事件流。为了实现真正的拓扑和时间稀疏性,我们引入了侧向脉冲拓扑路由(LSTR)以进行按需条件脉冲传播,并设计了一种脉冲状态空间(S3)引擎,用以系统性地捕获长程时间动态,而无需采用非稀疏谱的方法。对多个大规模数据集的广泛实验表明,S3T-Former在理论上能显著降低能耗的同时,保持高度竞争的准确性,建立了节能神经形态动作识别的新状态最优。
cs.CV / 3 / 2603.18067
DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment
DarkDriving:一个用于暗环境下自主驾驶的真实世界昼夜对齐数据集
Abstract
The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.
Chinese Translation
低光照条件对以视觉为中心的自主驾驶感知系统在暗环境中构成了挑战。本文提出了一个新的基准数据集(命名为 DarkDriving),旨在研究自主驾驶中的低光增强。现有的真实世界低光增强基准数据集只能通过控制各种曝光在小范围和静态场景中收集。当前夜间驾驶数据集中的暗图像并没有与之精确对齐的白天图像。收集真实世界昼夜对齐数据集在动态驾驶场景中的极大困难显著限制了该领域的研究。通过在一个大型真实世界封闭驾驶测试场(面积:69英亩)中提出的自动昼夜轨迹跟踪基础姿态匹配(Trajectory Tracking based Pose Matching, TTPM)方法,我们收集了第一个用于暗环境下自主驾驶的真实世界昼夜对齐数据集。DarkDriving 数据集包含 9,538 对在位置和空间内容上精确对齐的昼夜图像对,其对齐误差仅在几厘米之内。对于每对图像,我们还手动标注了物体的二维边界框。DarkDriving 引入了四个与感知相关的任务,包括低光增强、广义低光增强,以及在暗环境下自主驾驶的二维检测和三维检测的低光增强。实验结果表明,我们的 DarkDriving 数据集为评估自主驾驶的低光增强提供了一个全面的基准,同时也可以推广到增强暗图像并促进其他低光驾驶环境(如 nuScenes)的检测。
cs.CV / 4 / 2603.18086
SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation
SSP-SAM:结合语义-空间提示的SAM用于指代表达分割
Abstract
The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as
[email protected]. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.
Chinese Translation
Segment Anything Model (SAM) 在一般图像分割方面表现出色,但对自然语言的理解能力有限,这限制了其在指代表达分割(RES)中的直接应用。为此,我们提出了SSP-SAM,一个通过集成语义-空间提示(SSP)编码器充分利用SAM分割能力的框架。具体而言,我们将视觉和语言注意力适配器结合到SSP编码器中,以突出视觉特征中的显著对象和语言特征中的区分短语。这一设计增强了提示生成器的指代表示,产生高质量的SSP,使得SAM能够生成由语言引导的精确掩膜。尽管SSP-SAM并非专门为广义指代表达分割(GRES)设计,其中指代对象可能对应于零、一个或多个对象,但SSP-SAM自然支持这一更灵活的设置,无需额外修改。在广泛使用的RES和GRES基准上进行的广泛实验验证了我们方法的优越性。值得注意的是,我们的方法生成高质量的分割掩膜,即使在严格的阈值(如
[email protected])下也能实现强大的精确度。进一步在PhraseCut数据集上的评估显示,与现有的最先进RES方法相比,在开放词汇场景中表现出更好的性能。代码和检查点可在以下链接获取:https://github.com/WayneTomas/SSP-SAM。
cs.CV / 5 / 2603.18089
CytoSyn: a Foundation Diffusion Model for Histopathology -- Tech Report
CytoSyn:一种用于组织病理学的基础扩散模型 -- 技术报告
Abstract
Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology H&E-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn's weights, its training and validation datasets, and a sample of synthetic images in this repository: https://huggingface.co/Owkin-Bioptimus/CytoSyn.
Chinese Translation
计算病理学近年来取得了显著进展,推动了对基础疾病理解和临床准备工具的进步。这一演变得益于大量数字化切片和专门的深度学习方法与模型的可用性。已经开发了多种自监督基础特征提取器,使得从细胞分割到肿瘤亚型分类和生存分析等下游预测应用成为可能。相比之下,专门为组织病理学设计的生成基础模型仍然稀缺。这些模型可以处理超出特征提取器能力的任务,例如虚拟染色。在本文中,我们介绍了CytoSyn,一种最先进的基础潜在扩散模型,能够引导生成高度真实和多样化的组织病理学H&E染色图像,正如在广泛的基准测试中所示。我们探讨了方法改进、训练集扩展、采样策略和切片级过拟合,最终得出了改进版的CytoSyn-v2,并与最先进的模型PixCell进行了深入比较。这一比较突显了这两种扩散模型和性能指标对预处理特定细节(如JPEG压缩)的强敏感性。我们的模型在超过10,000个TCGA诊断全切片图像的32种不同癌症类型的数据集上进行了训练。尽管仅在肿瘤学切片上训练,但它在生成炎症性肠病图像方面仍保持最先进的性能。为了支持研究社区,我们在此存储库中公开发布了CytoSyn的权重、训练和验证数据集,以及一组合成图像的样本:https://huggingface.co/Owkin-Bioptimus/CytoSyn。
cs.CV / 6 / 2603.18091
Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model
行动草拟与验证:一个自我验证的视觉-语言-行动模型框架
Abstract
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Chinese Translation
视觉-语言-行动(VLA)模型最近在具身任务中表现出了强劲的性能。现代的VLA通常采用扩散动作专家来高效地生成高精度的连续动作片段,而自回归生成在低级控制中可能会较慢且不够准确。然而,自回归范式仍然提供了互补的先验知识,可以提高在分布外环境中的鲁棒性和泛化能力。为了充分利用这两种范式,我们提出了行动草拟与验证(ADV):扩散动作专家草拟多个候选动作片段,VLM通过使用一种困惑度样式的度量在单个前向传递中对所有候选进行评分,从而选择一个。在匹配的骨干网络、训练数据和动作片段长度下,ADV在模拟中将成功率提高了4.3个百分点,在真实世界中提高了19.7个百分点,且仅需一次传递的VLM重新排序开销。
cs.CV / 7 / 2603.18093
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
一对多:具有注意力控制的高保真无训练异常生成
Abstract
Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.
Chinese Translation
工业异常检测(AD)的特点是正常图像数量众多,而异常图像数量稀缺。尽管已经提出了众多少样本异常合成方法,以增强下游AD任务的异常数据,但大多数现有方法都需要耗时的训练,并且难以学习忠实于真实异常的分布,从而限制了基于这些数据训练的AD模型的有效性。为了解决这些限制,我们提出了一种无训练的少样本异常生成方法,即O2MAG,该方法利用一幅参考异常图像中的自注意力来合成更多真实的异常,以支持有效的下游异常检测。具体而言,O2MAG通过自注意力嫁接操控三个并行扩散过程,并结合异常掩模以减轻前景-背景查询混淆,合成紧密遵循真实异常分布的文本引导异常。为了弥合编码的异常文本提示与真实异常语义之间的语义差距,我们进一步引入了异常引导优化,以使合成过程与目标异常分布对齐,引导生成朝向现实且与文本一致的异常。此外,为了减轻异常掩模内的微弱异常合成,在生成过程中采用了双重注意力增强,以增强对掩蔽区域的自注意力和交叉注意力。大量实验证实了O2MAG的有效性,展示了其在下游AD任务上优于以往最先进方法的卓越性能。
cs.CV / 8 / 2603.18095
Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling
Q-Drift:面向量化的扩散模型采样漂移修正
Abstract
Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.
Chinese Translation
后训练量化(PTQ)是部署大型扩散模型的一条实用路径,但量化噪声在去噪轨迹上可能会累积,从而降低生成质量。我们提出了Q-Drift,这是一种原则性的采样器侧修正方法,将量化误差视为每个去噪步骤上的隐式随机扰动,并推导出一种保留边际分布的漂移调整方法。Q-Drift从校准中估计逐步的方差统计,实际操作中只需5个配对的全精度/量化校准运行。得到的采样器修正方法可与常见的采样器、扩散模型和PTQ方法无缝结合,并在推理时造成可忽略的开销。在六种不同的文本到图像模型(包括DiT和U-Net)、三种采样器(Euler、flow-matching、DPM-Solver++)和两种PTQ方法(SVDQuant、MixDQ)中,Q-Drift在大多数情况下提高了FID,超过了相应的量化基线,在PixArt-Sigma(SVDQuant W3A4)中最高可减少4.59的FID,同时保持CLIP分数。
cs.CV / 9 / 2603.18101
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
仅训练异构图像-补丁-文本图监督以推动少样本学习适配器的发展
Abstract
Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.
Chinese Translation
最近,基于适配器的 CLIP 调优(如 Tip-Adapter)是一种强大的少样本学习者,通过缓存支持特征以实现快速原型匹配,从而提高效率。然而,这些方法依赖于全局单模态特征向量,忽视了细粒度补丁关系及其与类别文本的结构对齐。为了在不增加推理成本的情况下弥补这一差距,我们提出了一种新颖的非对称仅训练框架。我们构建了一个高容量的辅助异构图教师(Heterogeneous Graph Teacher),该教师仅在训练期间运行,而不是改变轻量级适配器。这个教师(i)将多尺度的视觉补丁和文本提示整合到一个统一的图中,(ii)通过模态感知图变换器(Modality-aware Graph Transformer, MGT)进行深度跨模态推理,和(iii)应用区分性节点过滤以提取高保真的类别特征。至关重要的是,我们采用了一种缓存感知的双目标策略,将这种关系知识直接监督到 Tip-Adapter 的键值缓存中,有效地升级原型,而图教师在测试时被丢弃。因此,推理与 Tip-Adapter 保持一致,没有额外的延迟或内存。我们的算法在标准的 1-16 次样本基准上持续建立新的最先进水平。消融实验确认了辅助图监督、文本引导推理和节点过滤是实现稳健少样本适应的关键因素。代码可在 https://github.com/MR-Sherif/TOGA.git 获取。
cs.CV / 10 / 2603.18108
From Concepts to Judgments: Interpretable Image Aesthetic Assessment
从概念到判断:可解释的图像美学评估
Abstract
Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.
Chinese Translation
图像美学评估(Image Aesthetic Assessment, IAA)旨在预测人类感知的图像美学质量。虽然近期的 IAA 模型在预测性能上表现出色,但它们对驱动预测的因素却鲜有深入的洞察。然而,对于用户而言,理解为什么一幅图像被认为是令人愉悦的,与得分本身一样重要,这推动了对 IAA 中可解释性的日益关注。当人类评估美学时,他们自然依赖高层次线索来证明自己的判断。基于这一观察,我们提出了一种以人类可理解的美学概念为基础的可解释 IAA 框架。我们以可接近的方式学习这些概念,构建一个子空间,形成一个内在可解释模型的基础。为了捕捉对美学感知的微妙影响,超越显式概念,我们引入了一种简单而有效的残差预测器。在摄影和艺术数据集上的实验表明,我们的方法在提供透明且人类可理解的美学判断的同时,仍能实现具有竞争力的预测性能。
cs.CV / 11 / 2603.18118
Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models
Insight-V++:基于多模态大型语言模型的高级长链视觉推理研究
Abstract
Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.
Chinese Translation
大型语言模型(LLMs)通过延长测试时推理实现了显著的可靠性和先进能力。然而,将这些能力扩展到多模态大型语言模型(MLLMs)仍然是一个重大挑战,因为缺乏高质量的长链推理数据和优化的训练流程。为了解决这一问题,我们提出了一个统一的多智能体视觉推理框架,该框架系统地从我们的基础图像中心模型Insight-V演变为一个通用的时空架构Insight-V++。我们首先提出了一种可扩展的数据生成管道,配备多粒度评估,能够自主合成跨图像和视频领域的结构化、复杂推理轨迹,而无需人工干预。我们认识到,直接用如此复杂的数据来监督MLLMs会产生次优结果,因此我们设计了一个双智能体架构,包括一个推理智能体来执行广泛的分析链,以及一个摘要智能体来批判性地评估和提炼最终结果。虽然我们的初始框架使用了直接偏好优化(DPO),但其离线策略的特性在根本上限制了强化学习的潜力。为克服这些局限性,特别是在长时间视频理解方面,Insight-V++引入了两种新算法,ST-GRPO和J-GRPO,增强了时空推理并提高了评估的鲁棒性。通过利用摘要智能体提供的可靠反馈,我们引导了一个迭代推理路径生成过程,在一个持续的自我改进循环中重新训练整个多智能体系统。在LLaVA-NeXT和Qwen2.5-VL等基础模型上的广泛实验表明,在具有挑战性的图像和视频推理基准上显著提高了性能,同时在传统的以感知为中心的任务上保持了强大的能力。
cs.CV / 12 / 2603.18178
VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events
VLM-AutoDrive:用于安全关键自主驾驶事件的后训练视觉-语言模型
Abstract
The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.
Chinese Translation
自我中心的行车记录仪视频的快速增长为检测安全关键事件(如碰撞和近碰撞)带来了重大挑战,这些场景短暂、罕见且难以被通用视觉模型捕捉。尽管多模态大型语言模型(MLLMs)展示了强大的通用推理能力,但由于领域和时间的错位,它们在驾驶环境中的表现不佳。我们提出了VLM-AutoDrive,一个模块化的后训练框架,用于将预训练的视觉-语言模型(VLMs)适应于高保真异常检测。该框架整合了基于元数据生成的标题、LLM生成的描述、视觉问答(VQA)对以及思维链(CoT)推理监督,以实现领域对齐和可解释的学习。在零样本设置下,现成的VLM(如NVIDIA的Cosmos-Reason1 7B (CR1))在碰撞召回率上接近零;使用VLM-AutoDrive进行微调后,碰撞F1值从0.00提高到0.69,整体准确率从35.35%提升至77.27%。VLM-AutoDrive为将通用VLM适应于安全关键、时间局部的感知任务提供了一种可扩展的方法。在真实世界的Nexar行车记录仪视频上进行评估时,它在碰撞和近碰撞检测方面取得了显著提升,同时生成了可解释的推理痕迹,弥合了自主驾驶中的感知、因果关系和决策推理之间的差距。
cs.CV / 13 / 2603.18192
MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles
MicroVision:一个开放的数据集和基准模型,用于检测易受伤害的道路使用者和微型出行车辆
Abstract
Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images -- a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as "person", or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.
Chinese Translation
微型出行是一种日益增长的交通方式,因易受伤害的道路使用者(VRUs)与微型出行车辆(MMVs)共用基础设施而在交通安全和规划方面带来了新的挑战,尤其是在有停放的微型出行车辆的区域。支持交通安全和规划的方案越来越依赖于图像中道路使用者的检测——这是一项计算机视觉任务,严重依赖于训练图像的质量。然而,现有的用于训练此类模型的开放图像数据集在VRUs和MMVs方面缺乏针对性和多样性,例如,将行人和MMV骑行者统称为“人”,或者未包括新型的MMVs如电动滑板车。此外,数据集往往是从汽车视角捕获的,缺乏仅涵盖VRUs行驶的区域(人行道、自行车道)的数据。为弥补这一空白,我们推出了MicroVision数据集:这是一个开放的图像数据集及其注释,用于训练和评估检测最常见的VRUs(行人、自行车骑行者、电动滑板车骑行者)和静态MMVs(自行车、电动滑板车)的模型,以VRU的视角进行记录。该数据集在瑞典哥德堡记录,包含超过8,000张匿名全高清图像,以及超过30,000个经过仔细注释的VRUs和MMVs,捕获于整整一年的时间,涉及近2,000个独特的互动场景。我们还提供了基于最先进架构的首个基准目标检测模型,在未见测试集上达到了高达0.723的平均精度。该数据集和模型可以支持交通安全,以区分不同的VRUs和MMVs,或帮助监测系统识别微型出行的使用。数据集及模型权重可通过以下链接获取:https://doi.org/10.71870/eepz-jd52。
cs.CV / 14 / 2603.18218
Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting
基于3D高斯点云的实时月球表面映射的语义分割与深度估计
Abstract
Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.
Chinese Translation
在月球表面的导航和映射需要在挑战性条件下具备强大的感知能力,包括纹理较差的环境、高对比度的光照和有限的计算资源。本文提出了一种实时映射框架,该框架将密集感知模型与3D高斯点云(3D Gaussian Splatting, 3DGS)表示相结合。我们首先在使用LuPNT模拟器生成的合成数据集上对几种模型进行了基准测试,选择了一种基于门控递归单元(Gated Recurrent Units, GRU)的立体密集深度估计模型,因为它在深度估计的速度和准确性之间取得了良好的平衡,同时还选择了一种卷积神经网络(Convolutional Neural Network, CNN),因其在检测语义分割方面表现优越。通过使用真实姿态将局部场景理解与全局状态估计解耦,我们的管道重建了一个120米的行程,几何高度准确度约为3厘米,超越了没有激光雷达的传统点云基线。生成的3DGS地图支持新视图合成,并为完整的SLAM系统奠定了基础,其联合地图和姿态优化的能力将带来显著优势。我们的结果表明,将语义分割与密集深度估计结合学习的地图表示是一种有效的方法,可用于创建详细的大规模地图,以支持未来的月球表面任务。
cs.CV / 15 / 2603.18261
LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression
LRConv-NeRV:低秩卷积用于高效神经视频压缩
Abstract
Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.
Chinese Translation
神经视频表示(NeRV)将整个视频序列编码为神经网络参数,提供了一种替代传统视频编码器的范式。然而,NeRV的卷积解码器仍然计算开销大且内存密集,限制了其在资源受限环境中的应用。本文提出了LRConv-NeRV,一种高效的NeRV变体,通过在解码器架构内端到端训练,将选定的稠密3x3卷积层替换为结构化低秩可分离卷积。通过逐步在最大的解码器阶段到较早的解码器阶段应用低秩分解,LRConv-NeRV实现了重建质量与效率之间的可控权衡。大量实验表明,仅在最后的解码器阶段应用LRConv可将解码器复杂度降低68%,从201.9 GFLOPs降至64.9 GFLOPs,模型大小减少9.3%,同时几乎没有质量损失,并实现约9.2%的比特率降低。在INT8后训练量化下,LRConv-NeRV保持接近稠密NeRV基线的重建质量,而对早期解码器阶段的更激进分解则导致不成比例的质量下降。与现有的层对齐设置相比,LRConv-NeRV实现了更有利的效率与质量权衡,提供了显著的GFLOPs和参数减少,同时保持更高的PSNR/MS-SSIM和改善的时间稳定性。使用LPIPS进行的时间闪烁分析进一步表明,所提出的解决方案保持了接近NeRV基线的时间一致性,结果确立了LRConv-NeRV作为在低精度和资源受限设置下高效神经视频解码的潜在架构替代方案。
cs.CV / 16 / 2603.18282
CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning
CycleCap:通过自监督循环一致性微调提升视觉语言模型的图像描述性能
Abstract
Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.
Chinese Translation
视觉语言模型(VLMs)在图像描述、视觉问答和视觉推理方面取得了显著进展。然而,它们仍然容易出现视觉与语言的不一致,常常生成过于通用或虚构的描述。现有的方法通过指令微调来解决这一问题,这需要昂贵的大规模标注数据集,或通过复杂的测试时框架进行描述优化。在本研究中,我们通过循环一致性的视角重新审视图像与文本的对齐:给定一幅图像和一个由图像到文本模型生成的描述,通过文本到图像模型的反向映射应重构出与原始图像相匹配的图像。在我们的设置中,VLM作为图像到文本的组成部分,而预训练的文本到图像模型则通过从生成的描述中重构图像来闭合循环。在此基础上,我们引入了CycleCap,这是一种微调方案,利用基于原始图像和重构图像相似性的奖励,通过群体相对策略优化(GRPO)来提升图像描述性能,奖励是在运行时计算的。与之前使用循环一致性损失构建偏好数据集的工作不同,我们的方法直接利用循环一致性作为自监督训练信号。这使得仅使用原始图像成为可能,消除了对精心策划的图像-文本数据集的需求,同时引导VLM生成更准确和更有根据的文本描述。应用于四个参数范围从10亿到70亿的VLM,CycleCap在描述和虚构基准测试中均取得了一致的改进,超越了依赖监督循环一致性训练的最先进方法。
cs.CV / 17 / 2603.18306
Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction
快速且可泛化的卫星场景重建NeRF架构选择
Abstract
Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.
Chinese Translation
神经辐射场(NeRF)已成为从多视角图像进行逼真3D重建的强大方法。然而,将NeRF应用于卫星图像仍然面临挑战。每个场景都需要单独训练,而通过神经架构搜索(NAS)优化架构则需要数小时到数天的GPU时间。虽然现有方法侧重于架构改进,但我们的SHAP分析表明,多视角一致性而非模型架构决定了重建质量。基于这一见解,我们开发了PreSCAN,一个预测框架,利用轻量级几何和光度描述符在训练之前估计NeRF质量。PreSCAN在不到30秒内选择合适的架构,预测误差小于1 dB,相比于NAS实现了1000倍的加速。我们进一步展示了PreSCAN在边缘平台(Jetson Orin)上的部署效用,通过将其预测与离线成本分析相结合,推理功耗降低了26%,延迟降低了43%,且质量损失最小。在DFC2019数据集上的实验确认PreSCAN在不同卫星场景中具有良好的泛化能力,无需重新训练。
cs.CV / 18 / 2603.18309
Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI
集成超分辨率的展开重建用于加速三维晚期钆增强磁共振成像
Abstract
Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.
Chinese Translation
加速三维晚期钆增强(LGE)磁共振成像需要稳健的重建方法,以从欠采样的 k 空间数据中恢复薄的心房结构。虽然展开的基于模型的网络有效地将物理驱动的数据一致性与学习的先验知识相结合,但它们在获取的分辨率下运行,可能无法完全恢复高频细节。我们提出了一种混合展开重建框架,其中增强深度超分辨率(Enhanced Deep Super-Resolution, EDSR)网络替代优化循环中每次迭代的近端算子,从而实现超分辨率增强与数据一致性强制的联合。该模型在回顾性欠采样的临床前三维 LGE 数据集上进行端到端训练,并与压缩感知、基于模型的深度学习(Model-Based Deep Learning, MoDL)和自引导深度图像先验(self-guided Deep Image Prior, DIP)基线进行比较。在各种加速因子下,所提方法在峰值信噪比(PSNR)和结构相似性指数(SSIM)上持续优于标准展开重建,并更好地保留细微的心脏结构,从而提高左心房(LA)分割性能。这些结果表明,将超分辨率先验直接集成到基于模型的重建中可以在加速三维 LGE 磁共振成像中带来可测量的提升。
cs.CV / 19 / 2603.18343
VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection
VISTA:基于验证引导的空间与时间基础模型融合及解剖学解码的罕见病理胃肠胶囊内镜事件检测
Abstract
Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal
[email protected] of 0.3530 and temporal
[email protected] of 0.3235.
Chinese Translation
胶囊内镜事件检测具有挑战性,因为诊断相关的发现稀少、视觉表现异质且嵌入在冗长且嘈杂的视频流中,同时评估是在事件级别进行,而非仅基于帧级准确率。因此,我们将RARE-VISION任务定义为一种与指标对齐的事件检测问题,而非纯粹的逐帧分类任务。我们提出的框架结合了两个互补的主干网络:用于局部时间上下文的EndoFM-LV和用于强帧级视觉语义的DINOv3 ViT-L/16,随后通过多样化头部集成(Diverse Head Ensemble)、验证引导的分层融合(Validation-Guided Hierarchical Fusion)及解剖感知的时间事件解码(Anatomy-Aware Temporal Event Decoding)来实现。融合阶段利用验证获得的类别级模型权重、主干网络权重及概率校准,而解码阶段则应用时间平滑、解剖学约束、阈值细化及逐标签事件生成,从而产生稳定的事件预测。消融实验表明,互补主干网络、验证引导的融合和解剖感知的时间解码均对事件级性能有贡献。在官方隐含测试集上,所提方法达到整体时间
[email protected]为0.3530,时间
[email protected]为0.3235。
cs.CV / 20 / 2603.18373
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
观察还是迎合:揭示视觉谄媚和VLM中的分裂信念
Abstract
When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.
Chinese Translation
当视觉语言模型(VLM)正确回答时,它们是否真正依赖于视觉信息,还是利用语言捷径?我们提出了三层诊断框架,它通过三项指标揭示幻觉来源:潜在异常检测(感知意识)、视觉必要性评分(视觉依赖性,通过KL散度测量)和竞争评分(视觉基础与指令遵循之间的冲突)。在7个VLM和7,000个模型-样本对中使用反事实干预(盲、噪声和冲突图像),我们的分类法揭示69.6%的样本表现出视觉谄媚——模型检测到视觉异常但却虚构以满足用户期望——而零样本显示出强有力的拒绝,这表明对齐训练系统性地压制了对真实不确定性的承认。规模分析(Qwen2.5-VL 7B至72B)显示,较大的模型减少了语言捷径,但放大了视觉谄媚,表明单靠规模不能解决基础问题。诊断评分进一步实现了一种事后选择性预测策略,在50%覆盖率下实现了最高+9.5个百分点的准确性,而没有额外的训练成本。
cs.CV / 21 / 2603.18401
Pixel-Accurate Epipolar Guided Matching
像素精确的极线引导匹配
Abstract
Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.
Chinese Translation
在重复纹理或宽基线视角等挑战性条件下,关键点匹配可能会变得缓慢且不可靠。在这种情况下,已知的几何关系(如基本矩阵)可以用来将潜在对应关系限制在狭窄的极线包络内,从而减少搜索空间并提高鲁棒性。这些极线引导的匹配方法在结构从运动(SfM)等任务中已证明有效;然而,大多数方法依赖于粗略的空间分箱,这会引入近似误差、需要代价昂贵的后处理,并可能错过有效的对应关系。我们通过一个准确的公式解决了这些局限性,该公式直接在角度空间中执行候选选择。在我们的方法中,每个关键点被分配一个容差圆,当从极点查看时,该圆定义了一个角度区间。然后,匹配变成一个一维角度区间查询,通过线段树高效地以对数时间解决。这保证了像素级容差,支持每个关键点的控制,并消除了不必要的特征描述子比较。在ETH3D上的广泛评估表明,与现有方法相比,速度显著提升,同时恢复了准确的对应关系集。
cs.CV / 22 / 2603.18402
Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning
Inst4DGS:实例分解的4D高斯溅射与多视频标签置换学习
Abstract
We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.
Chinese Translation
我们提出了Inst4DGS,一种具有较长视距的实例分解4D高斯溅射(4DGS)方法。虽然动态4DGS发展迅速,但实例分解的4DGS仍然未被充分探索,这主要是由于在独立分割的多视角视频中关联不一致的实例标签的困难。我们通过引入视频特定的标签置换潜变量来解决这一挑战,该潜变量通过可微分的Sinkhorn层学习跨视频的实例匹配,从而实现直接的多视角监督,并保持一致的身份。这样的显式标签对齐产生了清晰的决策边界和时间上稳定的身份,没有身份漂移。为了进一步提高效率,我们提出了实例分解的运动支架,为每个对象提供低维的运动基,以进行长视距的轨迹优化。在Panoptic Studio和Neural3DV上的实验表明,Inst4DGS 支持跟踪和实例分解,同时实现了最先进的渲染和分割质量。在Panoptic Studio数据集上,Inst4DGS将PSNR从26.10提高到28.36,实例mIoU从0.6310提高到0.9129,优于最强基线。
cs.CV / 23 / 2603.18418
Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?
关注稀有性:稀有皮肤疾病能否通过诊断推理可靠诊断?
Abstract
Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.
Chinese Translation
大型视觉-语言模型(LVLMs)在皮肤病学中表现出色;然而,针对稀有病症的诊断推理评估仍然未得到充分探索。现有基准主要集中于常见疾病,仅评估最终准确性,忽视了临床推理过程,而这一过程对于复杂病例至关重要。我们通过构建DermCase来填补这一空白,该基准源自经过同行评审的病例报告。我们的数据集包含26,030个多模态图像-文本对和6,354个临床挑战性病例,每个病例都附有详尽的临床信息和逐步推理链。为了实现可靠评估,我们建立了基于DermLIP的相似性度量,这在评估鉴别诊断质量时与皮肤科医生的意见有更强的一致性。对22个领先的LVLMs进行基准测试暴露了在诊断准确性、鉴别诊断和临床推理方面的显著不足。微调实验表明,指令调优显著提高了性能,而直接偏好优化(DPO)带来的提升则微乎其微。系统的错误分析进一步揭示了当前模型推理能力的关键局限性。
cs.CV / 24 / 2603.18423
SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning
SynQ:通过合成感知微调实现精确的零-shot量化
Abstract
How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model's error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.
Chinese Translation
如何在没有任何数据的情况下准确地量化一个预训练模型?量化算法广泛应用于资源受限的边缘设备上部署神经网络。零-shot量化(Zero-shot Quantization, ZSQ)解决了在隐私或安全原因下训练数据不可获取的关键实际场景。然而,现有ZSQ方法的性能受到三个主要挑战的制约:1)合成数据集中的噪声,2)基于非目标模式的预测,以及3)错误硬标签导致的误指导。在本文中,我们提出了SynQ(合成感知的零-shot量化微调),这是一个精心设计的ZSQ框架,用以克服现有方法的局限性。SynQ通过利用低通滤波器来最小化生成样本中的噪声。随后,SynQ训练量化模型,通过与预训练模型对齐其类别激活图来提高准确性。此外,SynQ通过仅对困难样本采用软标签来减轻来自预训练模型错误的误指导。大量实验表明,SynQ提供了优于现有ZSQ方法的最新准确性。
cs.CV / 25 / 2603.18427
R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation
研发:在语义分割的合成数据增强中平衡可靠性与多样性
Abstract
Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}.
Chinese Translation
收集和标注用于像素级语义分割任务的数据集是高度劳动密集型的。数据增强通过增强模型的泛化能力而无需额外的现实世界数据收集,提供了一种可行的解决方案。传统的增强技术,如平移、缩放和颜色变换,虽然能够创造几何变换,但未能生成新的结构。尽管生成模型已被用于扩展数据集的语义信息,但它们在保持原始图像与生成图像之间的一致性方面常常面临挑战,尤其是在像素级任务中。在本研究中,我们提出了一种新颖的合成数据增强管道,集成了可控扩散模型。我们的方法在多样性和可靠性数据之间取得平衡,有效地弥合了合成数据与真实数据之间的差距。我们利用类感知提示和视觉先验融合进一步提高图像质量,确保与分割标签的精确对齐。通过评估基准数据集,如 PASCAL VOC 和 BDD100K,我们证明了我们的方法显著提升了语义分割性能,尤其是在数据稀缺的场景中,同时提高了模型在现实世界应用中的鲁棒性。我们的代码可在 [https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance](https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance) 获取。
cs.CV / 26 / 2603.18429
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
AndroTMem:从交互轨迹到长时间跨度GUI代理中的锚定记忆
Shi, Yibo, Li, Jungang, Zhang, Linghao, Dongfang, Zihao, Wu, Biao, Tao, Sicheng, Yan, Yibo, Qin, Chenxi, Liu, Weiting, Lin, Zhixin, Li, Hanqian, Huang, Yu, Dai, Song, Hei, Yonghua, Ding, Yue, Li, Xiang, Wang, Shikang, Xu, Chengdong, Liu, Jingqi, Ma, Xueying, Zheng, Zhiwen, Zhang, Xiaofei, Wang, Bincheng, Yang, Nichen, Wu, Jie, Tian, Lihua, Li, Chen, Hu, Xuming
Abstract
Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).
Chinese Translation
长时间跨度的GUI代理是实现现实世界部署的重要一步,但在现有范式下,有效的交互记忆仍然未得到充分探索。重放完整的交互序列是冗余的,并且会放大噪声,而摘要往往会抹去依赖性关键的信息和可追溯性。我们提出了AndroTMem,这是一个用于长时间跨度Android GUI代理的锚定记忆的诊断框架。其核心基准,AndroTMem-Bench,包含1,069个任务和34,473个交互步骤(每个任务平均32.1个,最多65个)。我们通过任务完成率(TCR)评估代理,重点关注那些完成需要携带关键中间状态的任务;AndroTMem-Bench旨在强制执行步骤间的因果依赖,使得稀疏但至关重要的中间状态对后续动作至关重要,并将交互记忆置于评估的中心。在开源和闭源的GUI代理中,我们观察到一个一致的模式:随着交互序列的延长,性能下降主要是由于任务内记忆失败,而不是孤立的感知错误或局部行动失误。根据这一诊断,我们提出了锚定状态记忆(ASM),它将交互序列表示为一组紧凑的因果链接中间状态锚,以实现针对子目标的检索和基于归因的决策。在多个设置和12个评估的GUI代理中,ASM始终优于完整序列重放和基于摘要的基线,TCR提高了5%-30.16%,AMS提高了4.93%-24.66%,表明锚定的结构化记忆有效缓解了长时间跨度GUI任务中的交互记忆瓶颈。代码、基准和相关资源可在[https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem)上公开获取。
cs.CV / 27 / 2603.18443
SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation
SR-Nav:空间关系对零-shot目标导航的重要性
Abstract
Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav
Chinese Translation
零-shot目标导航旨在仅通过自我中心观察在未见环境中寻找目标物体。近期的方法利用基础模型的理解和推理能力来提升导航性能。然而,当面临较差的视角或弱语义线索时,基础模型往往无法在感知和规划中支持可靠的推理,导致导航效率低下或失败。我们观察到,物体和区域之间的内在关系编码了结构化的场景先验,这有助于智能体在部分观察下推断出合理的目标位置。基于这一见解,我们提出了空间关系感知导航(SR-Nav),这是一个建模观察到的和基于经验的空间关系的框架,以增强感知和规划。具体而言,SR-Nav首先构建一个动态空间关系图(DSRG),通过基础模型编码以目标为中心的空间关系,并根据实时观察动态更新。然后,我们引入了一个关系感知匹配模块。该模块利用关系匹配而非简单检测,利用DSRG中的多样关系来验证和纠正错误,从而增强视觉感知的鲁棒性。最后,我们设计了一个动态关系规划模块,通过基于当前位置信息动态计算DSRG中的最佳路径来减少规划搜索空间,从而指导规划并减少探索冗余。在HM3D上的实验表明,我们的方法在成功率和导航效率方面均达到了最先进的性能。代码将公开发布在 https://github.com/Mzyw-1314/SR-Nav
cs.CV / 28 / 2603.18453
Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching
在体育教练相关任务中学习一致的时间定位
Abstract
Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.
Chinese Translation
视频大语言模型(Video-LLMs)往往关注于无关的帧,这对需要精确时间定位的体育教练任务尤其有害。然而,获取帧级别的监督信息具有挑战性:从人类收集成本高且不可靠,而从其他模型获取也不稳定。我们通过利用相关任务(如生成和验证)必须关注相同帧的观察,来在不增加额外注释的情况下改善时间定位。我们通过对紧密相关任务的选定视觉注意力图施加自一致性目标来实现这一点。使用提供真实关键帧注释的VidDiffBench,我们首先验证了注意力错误分配是一个显著的瓶颈。然后,我们展示了使用我们的目标进行训练在三个体育教练任务(Exact、FitnessQA 和 ExpertAF)上分别获得了 +3.0%、+14.1% 的准确率和 +0.9 的 BERTScore 的提升,甚至超过了闭源模型。
cs.CV / 29 / 2603.18460
Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images
基于小规模MRI图像的可解释前列腺癌检测
Abstract
Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.
Chinese Translation
前列腺癌是男性死亡的主要原因之一,但由于病灶的微妙性和异质性,T2加权前列腺MRI的解读仍然具有挑战性。我们开发了一个可解释的自动癌症检测框架,使用了一个包含162幅T2加权图像(102幅癌症图像,60幅正常图像)的小型数据集,通过迁移学习和数据增强来解决数据稀缺问题。我们对视觉变换器(Vision Transformers, ViT, Swin)、卷积神经网络(CNNs, ResNet18)和经典方法(逻辑回归、支持向量机(SVM)、HOG+SVM)进行了全面比较。迁移学习的ResNet18在仅有1100万参数的情况下取得了最佳性能(90.9%的准确率,95.2%的灵敏度,AUC 0.905),而视觉变换器尽管复杂度显著更高,但表现较差。值得注意的是,HOG+SVM达到了可比的准确率(AUC 0.917),突显了在小型数据集中手工特征的有效性。与依赖双参数MRI(T2+DWI)和大型队列的最先进方法不同,我们的方法仅使用T2加权图像就实现了竞争力的性能,从而降低了获取复杂性和计算成本。在对22例病例的读者研究中,五名放射科医师的平均灵敏度为67.5%(Fleiss Kappa = 0.524),而AI模型的灵敏度为95.2%,这表明AI辅助筛查在减少漏诊和提高一致性方面的潜力。代码和数据已公开。
cs.CV / 30 / 2603.18461
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
基于细胞类型原型的神经网络用于从病理图像中估计基因表达
Abstract
Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation patterns.CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at https://github.com/naivete5656/CPNN.
Chinese Translation
从病理图像中估计幻灯片和补丁级别的基因表达谱能够实现快速且低成本的分子分析,对临床具有广泛影响。尽管现有方法取得了良好的结果,但它们将基因表达视为单纯的幻灯片或斑点级别信号,而未考虑到测量的表达是来自于基础细胞级别表达的聚合。为了明确引入这一缺失的细胞解析指导,我们提出了一种基于细胞类型原型的神经网络(Cell-type Prototype-informed Neural Network,CPNN),该网络利用公开可用的单细胞RNA测序数据集。由于单细胞测量存在噪声且未与组织学图像配对,我们首先估计细胞类型原型的平均表达谱,以反映稳定的基因-基因共变模式。然后,CPNN直接从图像中学习细胞类型组成权重,并建模原型与观察到的整体或空间表达之间的关系,提供一个生物学上有依据且结构上正则化的预测框架。我们在三个幻灯片级别数据集和三个补丁级别空间转录组数据集上评估了CPNN。在所有设置中,CPNN在斯皮尔曼相关性方面达到了最高性能。此外,通过可视化推断的组成权重,我们的框架提供了可解释的见解,揭示了哪些细胞类型驱动了预测的表达。代码可在 https://github.com/naivete5656/CPNN 上公开获取。
cs.CV / 31 / 2603.18465
MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling
MedQ-UNI:通过视觉-语言建模实现统一的医学图像质量评估与修复
Abstract
Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.
Chinese Translation
现有的医学图像修复(Med-IR)方法通常是特定于某种模态或降质类型的,无法在临床实践中遇到的异构降质情况下进行推广。我们认为这一限制源于医学图像修复与医学图像质量评估(Med-IQA)的隔离,因为缺乏明确质量理解的修复模型难以适应不同模态下的多样化降质类型。为了解决这些挑战,我们提出了MedQ-UNI,一个统一的视觉-语言模型,遵循评估-再修复的范式,明确利用Med-IQA来指导Med-IR在任意模态和降质类型下的应用。MedQ-UNI采用了一种多模态自回归双专家架构,具有共享注意力机制:质量评估专家首先通过结构化自然语言描述识别降质问题,然后修复专家根据这些描述进行针对性的图像修复。为了支持这一范式,我们构建了一个大规模数据集,包含约50K对样本,涵盖三种成像模态和五个修复任务,每个样本都附有结构化质量描述,以便进行联合的Med-IQA和Med-IR训练,并提供了一个包含2K样本的基准用于评估。大量实验表明,单一的MedQ-UNI模型在没有任何特定任务适应的情况下,在所有任务上都达到了最先进的修复性能,同时生成了更优质的描述,确认了明确的质量理解显著提高了修复的保真度和可解释性。
cs.CV / 32 / 2603.18466
Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion
重新定义重要性:区域感知的颜色编辑通过令牌级扩散
Abstract
Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at https://yangyuqi317.github.io/ColourCrafter.github.io/.
Chinese Translation
颜色是图像生成中感知最显著但最难控制的属性之一。尽管最近的扩散模型能够根据用户指令修改物体颜色,但其结果往往偏离预期的色调,特别是在细粒度和局部编辑方面。早期的基于文本的方法依赖离散的语言描述,无法准确表示连续的色彩变化。为克服这一限制,我们提出了ColourCrafter,一个统一的扩散框架,将颜色编辑从全球色调转移转变为结构化的区域感知生成过程。与传统的颜色驱动方法不同,ColourCrafter在潜在空间中对RGB颜色令牌和图像令牌进行令牌级融合,选择性地将颜色信息传播到语义相关区域,同时保持结构的保真度。感知Lab空间损失,通过将亮度和色度解耦,并限制编辑在遮罩区域内,从而进一步增强像素级精度。此外,我们构建了ColourfulSet,一个包含具有连续和多样化颜色变化的高质量图像对的大规模数据集。大量实验证明,ColourCrafter在细粒度颜色编辑中实现了最先进的颜色准确性、可控性和感知保真度。我们的项目可在https://yangyuqi317.github.io/ColourCrafter.github.io/获取。
cs.CV / 33 / 2603.18480
Do Vision Language Models Understand Human Engagement in Games?
视觉语言模型是否理解游戏中的人类参与度?
Abstract
Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.
Chinese Translation
从游戏视频中推断人类参与度对于游戏设计和玩家体验研究至关重要,但目前尚不清楚视觉语言模型(VLMs)是否能够仅通过视觉线索推断出这种潜在的心理状态。我们使用GameVibe少样本数据集,涵盖九款第一人称射击游戏,评估了三种VLM在六种提示策略下的表现,包括零-shot预测、基于流动理论(Flow)、游戏流(GameFlow)、自我决定理论(Self-Determination Theory)和MDA的理论指导提示,以及检索增强提示。我们考虑了逐点参与度预测和连续窗口之间参与度变化的成对预测。结果显示,零-shot VLM预测通常较弱,且往往无法超越简单的每个游戏的多数类基线。在某些设置中,记忆或检索增强提示改善了逐点预测,而成对预测在各种策略中始终保持困难。仅依靠理论指导提示并未可靠地提供帮助,反而可能强化表面层次的捷径。这些发现表明当前VLM存在感知与理解之间的差距:尽管它们能够识别可见的游戏线索,但在跨游戏稳健推断人类参与度方面仍然存在困难。
cs.CV / 34 / 2603.18481
T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World
T-QPM:在开放世界中为视觉-语言模型启用时间域外检测和领域泛化
Abstract
Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.
Chinese Translation
域外检测(OOD)在开放世界学习中仍然是一个关键挑战,因为模型必须适应不断变化的数据分布。尽管最近的视觉-语言模型(VLMS)如 CLIP 通过双模式匹配(DPM)实现了多模态 OOD 检测,但现有方法通常存在两个主要缺陷:(1)它们依赖于固定的融合规则,并假设环境是静态的,因此在时间漂移下表现不佳;(2)它们缺乏对协变量变化输入的鲁棒性。本文提出了一种新颖的两步框架,以增强动态环境中的 OOD 检测和协变量分布变化的鲁棒性。我们将双模式机制扩展为时间四重模式匹配(T-QPM)。首先,通过将 OOD 图像与文本描述配对,我们引入了 ID 和 OOD 信号之间的跨模态一致性模式,通过联合图像-文本推理来细化决策边界。其次,我们通过学习轻量级融合权重来解决时间分布漂移,以最佳方式结合语义匹配和视觉典型性。为了确保稳定性,我们基于平均阈值置信度(ATC)强制执行显式正则化,防止随着分布的演变而导致性能下降。在时间分区基准上的实验表明,我们的方法显著优于静态基线,为非平稳环境中的多模态 OOD 检测提供了一个鲁棒且时间一致的框架。
cs.CV / 35 / 2603.18488
TexEditor: Structure-Preserving Text-Driven Texture Editing
TexEditor:结构保持的文本驱动纹理编辑
Abstract
Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.
Chinese Translation
文本引导的纹理编辑旨在修改物体外观,同时保持其底层几何结构。然而,我们的实证分析表明,即使是最先进的编辑模型(SOTA)在纹理编辑过程中也常常难以保持结构一致性,尽管所期望的变化仅与外观相关。基于这一观察,我们从数据和训练两个角度共同增强结构保持能力,并构建了TexEditor,一个基于Qwen-Image-Edit-2509的专用纹理编辑模型。首先,我们构建了TexBlender,一个使用Blender生成的高质量SFT数据集,为冷启动提供强有力的结构先验。其次,我们引入了StructureNFT,一种基于强化学习的方法,整合了结构保持损失,将SFT过程中学习到的结构先验转移到真实场景中。此外,由于现有基准测试的现实性和评估覆盖范围有限,我们引入了TexBench,一个通用的现实世界基准,用于文本引导的纹理编辑。在现有基于Blender的纹理基准和我们的TexBench上进行的广泛实验表明,TexEditor始终优于Nano Banana Pro等强基线。此外,我们在通用基准ImgEdit上评估了TexEditor,以验证其泛化能力。我们的代码和数据可在https://github.com/KlingAIResearch/TexEditor获取。
cs.CV / 36 / 2603.18493
FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction
FILT3R:用于流式3D重建的潜在状态自适应卡尔曼滤波器
Abstract
Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.
Chinese Translation
流式3D重建保持一个持久的潜在状态,该状态从输入帧中在线更新,从而实现恒定内存推理。一个关键的失败模式是状态更新规则:激进的覆盖会忘记有用的历史,而保守的更新则无法跟踪新的证据,这两种行为在超出训练范围后都会变得不稳定。为了解决这一挑战,我们提出了FILT3R,一种无训练的潜在过滤层,将递归状态更新视为在标记空间中的随机状态估计。FILT3R维护每个标记的方差,并计算一种卡尔曼风格的增益,自适应地平衡内存保留与新观察之间的关系。过程噪声——控制潜在状态在帧之间预期变化的程度——是通过候选标记的EMA标准化时间漂移在线估计的。通过大量实验,我们证明FILT3R提供了一种可解释的、插件式的更新规则,将常见的覆盖和门控策略作为特例进行推广。具体而言,我们展示了在稳定状态下,增益随着累积证据的不确定性收缩而减小,而当真实场景变化增加过程不确定性时,增益则上升,从而改善了深度、姿态和3D重建的长期稳定性,相较于现有方法。代码将发布在 https://github.com/jinotter3/FILT3R。
cs.CV / 37 / 2603.18496
NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data
NymeriaPlus:通过附加注释和数据丰富 Nymeria 数据集
Abstract
The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.
Chinese Translation
Nymeria 数据集于 2024 年发布,是一个大规模的野外人类活动集合,使用多种自我中心的可穿戴设备进行捕捉,这些设备在空间上定位并在时间上同步。该数据集提供了使用动作捕捉服记录的身体运动真实数据、设备轨迹、半稠密 3D 点云以及上下文叙述。在本文中,我们对 Nymeria 进行了升级,并引入了 NymeriaPlus。NymeriaPlus 的特点包括:(1)以 Momentum Human Rig (MHR) 和 SMPL 格式改进的人类运动;(2)室内物体和结构元素的稠密 3D 和 2D 边界框注释;(3)实例级 3D 物体重建;以及(4)额外的模态,例如底图录音、音频和手环视频。通过将这些互补的模态和注释整合到一个单一、连贯的基准中,NymeriaPlus 将 Nymeria 强化为一个更强大的野外自我中心数据集。我们期望 NymeriaPlus 能够填补现有自我中心资源中的关键空白,并支持更广泛的研究,包括对具身人工智能的多模态学习的独特探索。
cs.CV / 38 / 2603.18501
Efficient Video Diffusion with Sparse Information Transmission for Video Compression
基于稀疏信息传输的高效视频扩散用于视频压缩
Abstract
Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.
Chinese Translation
视频压缩旨在以最小的比特率最大化重建质量。除了标准的失真度量,感知质量和时间一致性也至关重要。然而,在超低比特率下,传统的端到端压缩模型往往会产生模糊且感知质量较差的图像。此外,现有的生成压缩方法通常独立处理视频帧,在时间一致性和效率方面存在局限性。为了解决这些挑战,我们提出了基于稀疏信息传输的高效视频扩散(Efficient Video Diffusion with Sparse Information Transmission,Diff-SIT),该方法包括稀疏时间编码模块(Sparse Temporal Encoding Module,STEM)和带帧类型嵌入器的一步视频扩散(One-Step Video Diffusion with Frame Type Embedder,ODFTE)。STEM将原始帧序列稀疏编码为信息丰富的中间序列,从而实现显著的比特率节省。随后,ODFTE将这个中间序列整体处理,利用时间相关性。在此过程中,我们提出的帧类型嵌入器(Frame Type Embedder,FTE)引导扩散模型根据不同的帧类型进行自适应重建,以优化整体质量。在多个数据集上的大量实验表明,Diff-SIT在感知质量和时间一致性方面建立了新的最先进水平,特别是在具有挑战性的超低比特率环境中。代码已发布在 https://github.com/MingdeZhou/Diff-SIT。
cs.CV / 39 / 2603.18502
HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection
HOMEY:基于增强YOLO的启发式物体掩膜用于财产保险风险检测
Abstract
Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.
Chinese Translation
自动化财产风险检测是计算机视觉中一个具有高影响力但尚未深入探索的前沿领域,对房地产、承保和保险业务有直接影响。我们提出了HOMEY(基于增强YOLO的启发式物体掩膜),这是一种新颖的检测框架,将YOLO与特定领域的掩膜机制和自定义设计的损失函数相结合。HOMEY经过训练可以检测17类与风险相关的财产类别,包括结构损伤(如:基础裂缝、屋顶问题)、维护疏忽(如:荒废的庭院、过度生长的灌木)和责任风险(如:下垂的雨水槽、垃圾、危险标志)。我们的方法引入了启发式物体掩膜,在杂乱背景中放大微弱信号,并采用风险意识的损失校准来平衡类别倾斜和严重性加权。在真实世界的财产图像上的实验表明,HOMEY相比基线YOLO模型实现了更优的检测准确性和可靠性,同时保持快速推理。除了检测之外,HOMEY还能够进行可解释且成本高效的风险分析,为可扩展的AI驱动财产保险工作流奠定基础。
cs.CV / 40 / 2603.18505
From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions
从快照到交响乐:蛋白质预测从静态结构到生成动态与多模态交互的演变
Abstract
The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.
Chinese Translation
蛋白质折叠问题已被人工智能根本性地转变,从静态结构预测发展到动态构象集合和复杂生物分子相互作用的建模。本文系统地审视了人工智能驱动的蛋白质科学在五个相互关联维度上的范式转变:统一的多模态表示,整合序列、几何形状和文本知识;通过无多序列比对(MSA)架构和全原子复杂建模来优化静态预测;生成框架,包括扩散模型和流匹配,捕捉与热力学集合一致的构象分布;预测跨越蛋白质-配体、蛋白质-核酸和蛋白质-蛋白质复合物的异质相互作用;以及对适应度景观、突变效应和文本引导属性预测的功能推断。我们批判性地分析了当前的瓶颈,包括数据分布偏差、有限的机制可解释性,以及几何度量与生物物理现实之间的脱节,同时识别出未来朝向物理一致的生成模型、多模态基础架构和实验闭环系统的发展方向。这一方法论的转变标志着人工智能从结构分析工具转变为能够理解并最终重写生命动态语言的通用模拟器。
cs.CV / 41 / 2603.18508
Foundations and Architectures of Artificial Intelligence for Motor Insurance
汽车保险领域人工智能的基础与架构
Abstract
This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.
Chinese Translation
本手册对汽车保险领域人工智能的基础和架构进行了系统性阐述,基于大规模的实际应用。它 формализует一个垂直集成的人工智能范式,将感知、多模态推理和生产基础设施统一成一个完整的智能堆栈,用于汽车风险评估和索赔处理。该手册的核心开发了适应特定领域的变换器架构,用于结构化视觉理解、关系车辆表示学习和多模态文档智能,从而实现车辆损伤分析、索赔评估和承保工作流的端到端自动化。这些组件被组合成一个可扩展的管道,能够在泰国全国汽车保险系统中观察到的实际限制下运行。除了模型设计外,本手册还强调了学习算法与MLOps实践的共同演进,建立了一个原则性框架,将现代人工智能转化为在高风险工业环境中可靠的生产级系统。
cs.CV / 42 / 2603.18510
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
OnlinePG:基于3D高斯点云的在线开放词汇全景映射
Abstract
Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.
Chinese Translation
在线全景映射的开放词汇场景理解对于具身应用在环境中的感知和交互至关重要。然而,现有的方法主要为离线或缺乏实例级的理解,这限制了它们在现实世界机器人任务中的适用性。在本文中,我们提出了OnlinePG,一个新颖且有效的系统,它结合了几何重建与开放词汇感知,并使用3D高斯点云在在线环境中实现。技术上,为了实现在线全景映射,我们采用了一种高效的局部到全局的滑动窗口范式。为了构建局部一致性图,我们构建了一个3D段聚类图,该图共同利用几何和语义线索,将滑动窗口内的不一致段融合为完整实例。随后,为了更新全局地图,我们为局部3D高斯地图构建了具有空间属性的显式网格,并通过稳健的双向二分3D高斯实例匹配将它们融合到全局地图中。最后,我们利用融合后的VLM特征在3D空间属性网格中实现开放词汇场景理解。在广泛使用的数据集上进行的大量实验表明,我们的方法在在线方法中实现了更好的性能,同时保持了实时效率。
cs.CV / 43 / 2603.18513
CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution
CAFlow:用于高效组织病理学超分辨率的自适应深度单步流匹配
Abstract
In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.
Chinese Translation
在数字病理学中,整张幻灯片图像的分辨率通常超过千兆像素,这使得计算密集型生成超分辨率(SR)在常规部署中变得不切实际。我们提出了CAFlow,一种自适应深度单步流匹配框架,该框架将每个图像块路由到保持重建质量的最浅网络出口。CAFlow在像素未打乱的重新排列空间中执行流匹配,减少了16倍的空间计算,同时实现了直接推理。我们表明,将一半的训练时间专门用于精确的t=0样本对于单步质量至关重要(没有它质量下降-1.5 dB)。主干网络FlowResNet(1.90M参数)在四个早期出口中混合了卷积和窗口自注意力块,计算量范围从3.1到13.3 GFLOPs。一个轻量级出口分类器(约6K参数)在仅0.12 dB的代价下实现了33%的计算节省。在多脏器组织病理学x4超分辨率中,自适应路由实现了31.72 dB的PSNR,而全深度下为31.84 dB,最浅出口在计算量比SwinIR-light少2.8倍的情况下超越了双三次插值+1.9 dB。该方法在保持最小质量损失(-0.02 dB)的情况下,能够推广到未见过的结肠组织,并且在x8放大时超越了所有可比计算基线,同时与更大规模的SwinIR-Medium模型保持竞争力。下游细胞核分割确认了临床相关结构的保留。该模型在单个GPU上训练时间少于5小时,自适应路由可以将整张幻灯片的推理时间从分钟缩短到秒。
cs.CV / 44 / 2603.18523
Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
计数电路:大型视觉语言模型中视觉推理的机制可解释性
Abstract
Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.
Chinese Translation
计数作为大型视觉语言模型(LVLM)推理的一个简单但强大的测试,迫使模型识别每个单独的对象并将其全部相加。在本研究中,我们通过受控的合成和真实世界基准,结合机制分析,调查LVLM如何实现计数。我们的结果表明,LVLM表现出类似人类的计数行为,在小数量上表现出精确的性能,而在较大数量上则表现出噪声估计。我们引入了两种新颖的可解释性方法,视觉激活修补(Visual Activation Patching)和头镜(HeadLens),并利用它们揭示了一个在多种视觉推理任务中广泛共享的结构化“计数电路”。基于这些见解,我们提出了一种轻量级干预策略,利用简单且丰富的合成图像对任意预训练的LVLM进行专门的计数微调。尽管这种微调的范围较窄,但该干预不仅提高了在分布内合成数据上的计数准确性,还在分布外计数基准上平均提高了+8.36%,在复杂的通用视觉推理任务中为Qwen2.5-VL平均提高了+1.54%。这些发现突显了计数在视觉推理中的核心和影响力角色,并建议通过针对性增强计数机制来改善整体视觉推理能力的潜在途径。
cs.CV / 45 / 2603.18524
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
3DreamBooth:高保真度的3D主体驱动视频生成模型
Abstract
Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/
Chinese Translation
创建定制主题的动态、一致视角视频在沉浸式虚拟现实/增强现实、虚拟制作和下一代电子商务等一系列新兴应用中备受追捧。然而,尽管主体驱动的视频生成技术取得了快速进展,现有方法主要将主体视为二维实体,关注通过单视角视觉特征或文本提示进行身份转移。由于现实世界中的主体本质上是三维的,因此将这些以二维为中心的方法应用于三维物体定制时暴露出一个基本限制:它们缺乏重建三维几何所需的全面空间先验知识。因此,在合成新视角时,它们必须依赖于为未见区域生成可信但任意的细节,而不是保留真实的三维身份。由于多视角视频数据集的稀缺,实现真正的三维感知定制仍然面临挑战。虽然可以尝试在有限的视频序列上微调模型,但这往往导致时间过拟合。为了解决这些问题,我们提出了一种新的三维感知视频定制框架,包括3DreamBooth和3Dapter。3DreamBooth通过单帧优化范式将空间几何与时间运动解耦。通过限制对空间表示的更新,它在模型中有效地嵌入了一个强健的三维先验,而无需耗费大量基于视频的训练。为了增强细粒度纹理并加速收敛,我们引入了3Dapter,一种视觉调节模块。在单视图预训练后,3Dapter通过不对称调节策略与主要生成分支进行多视图联合优化。该设计使模块能够作为动态选择路由器,从最小参考集中查询视角特定的几何线索。项目页面:https://ko-lani.github.io/3DreamBooth/
cs.CV / 46 / 2603.18541
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
针对跨域少样本目标检测的目标域散光问题的解决
Abstract
Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.
Chinese Translation
跨域少样本目标检测(CD-FSOD)旨在将预训练的检测器从源域适应到注释有限的目标域,但面临严重的域偏移和数据稀缺问题。在本研究中,我们发现了一个之前被忽视的现象:模型在目标域中表现出分散且不集中的注意力,导致定位不准确和冗余预测,就像人类无法聚焦于视觉对象一样。因此,我们将其称为目标域散光问题。对变换器层之间注意力距离的分析表明,常规的微调固有地显示出缓解该问题的趋势,但结果仍然远未令人满意,这是我们在本文中旨在增强的。受人类中央凹视觉系统的生物启发,我们通过一个中心-边缘注意力精炼框架增强微调的固有趋势,该框架包含(1)一个正模式精炼模块,通过使用类特定原型重塑对语义对象的注意力,模拟视觉中心区域;(2)一个负背景调制模块,通过建模背景上下文增强边界辨别,模拟视觉边缘区域;(3)一个文本语义对齐模块,通过跨模态线索加强中心-边缘区分。我们这种生物启发的方法将散光注意力转变为集中的模式,显著改善了对目标域的适应性。在六个具有挑战性的CD-FSOD基准测试上的实验一致证明了检测精度的提高,并建立了新的最先进结果。
cs.CV / 47 / 2603.18545
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
CoDA:探索医疗视觉语言模型的链式分发攻击及事后令牌空间修复
Abstract
Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.
Chinese Translation
医疗视觉语言模型(MVLMs)越来越多地作为放射学流程中的感知支撑和多模态助手的视觉前端使用,但它们在真实临床工作流程下的可靠性仍然未得到充分探讨。以往的鲁棒性评估往往假设输入数据是干净且经过精心策划的,或研究孤立的损坏,忽视了在保持临床可读性的同时改变图像统计特征的常规获取、重建、显示和传递操作。为了解决这一问题,我们提出了CoDA,一个链式分发框架,通过组合类似获取的阴影、重建和显示重映射,以及传递和导出降解,构建临床上合理的流程转变。在掩蔽结构相似性约束下,CoDA联合优化阶段组合和参数,以诱发故障,同时保持视觉的合理性。在脑部MRI、胸部X光和腹部CT的实验中,CoDA显著降低了CLIP风格MVLMs的零-shot性能,链式组合的破坏性始终大于任何单一阶段。我们还评估了多模态大型语言模型(MLLMs)作为影像真实感和质量的技术真实性审计工具,而非病理学。专有的多模态模型在CoDA转变样本上显示出审计可靠性下降和持续的高置信度错误,而我们测试的医疗特定MLLMs在医疗图像质量审计方面表现出明显不足。最后,我们引入了一种基于教师引导的令牌空间适应和补丁级对齐的事后修复策略,改善了对归档CoDA输出的准确性。总体而言,我们的研究描绘了MVLM部署的临床基础威胁面,并表明轻量级对齐可以提高部署的鲁棒性。
cs.CV / 48 / 2603.18558
HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
HiMu:用于长视频问答的层次化多模态帧选择
Abstract
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.
Chinese Translation
长形式视频问答需要对扩展的时间上下文进行推理,这使得帧选择对于受到有限上下文窗口限制的大型视觉语言模型(LVLMs)至关重要。现有方法面临着明显的权衡:基于相似性的选择器快速但将组合查询压缩成单一密集向量,从而丢失了子事件的顺序和跨模态的绑定;基于代理的方法通过迭代的 LVLM 推理恢复这种结构,但代价高昂。我们提出了 HiMu,这是一个无训练的框架,用以弥补这一差距。一次仅需文本的 LLM 调用将查询解构为一个层次化的逻辑树,其叶子节点是原子谓词,每个节点被引导到跨越视觉(CLIP、开放词汇检测、光学字符识别)和音频(自动语音识别、CLAP)的轻量级专家。产生的信号经过标准化、时间平滑以对齐不同的模态,并通过模糊逻辑运算符自下而上组成,从而强制执行时间排序和相邻关系,生成连续的满足度曲线。在Video-MME、LongVideoBench和HERBench-Lite上的评估表明,HiMu提升了效率-准确性帕累托前沿:在使用 Qwen3-VL 8B 的16帧下,其性能超过了所有竞争选择器;在使用 GPT-4o 的情况下,其表现超越了在32-512帧下操作的代理系统,同时大约减少了10倍的乘加运算(FLOP)要求。
cs.CV / 49 / 2603.18561
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
CausalVAD:通过因果干预去混淆端到端自主驾驶
Abstract
Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.
Chinese Translation
以规划为导向的端到端驾驶模型展现出巨大的潜力,但它们本质上学习的是统计相关性而非真实的因果关系。这一脆弱性导致了因果混淆,模型利用数据集偏差作为捷径,严重损害了其在复杂场景中的可靠性和安全性。为了解决这一问题,我们提出了CausalVAD,一种利用因果干预的去混淆训练框架。其核心是设计了稀疏因果干预方案(SCIS),这是一个轻量级的即插即用模块,用于在神经网络中实现后门调整理论。SCIS构建了一个原型字典,代表潜在的驾驶上下文。然后,它利用该字典对模型的稀疏向量查询进行干预。这一步主动消除了由混淆因素引起的虚假关联,从而从下游任务的表示中消除了虚假因素。在nuScenes等基准上的大量实验表明,CausalVAD在规划准确性和安全性方面达到了最先进的水平。此外,我们的方法在抵御数据偏差和噪声场景方面表现出更强的鲁棒性,这些场景被配置为引发因果混淆。
cs.CV / 50 / 2603.18585
HAViT: Historical Attention Vision Transformer
HAViT:历史注意力视觉变换器
Abstract
Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.
Chinese Translation
视觉变换器在计算机视觉领域表现出色,但其注意力机制在各层之间独立操作,限制了信息流动和特征学习。我们提出了一种有效的跨层注意力传播方法,该方法在编码器层之间保留并整合历史注意力矩阵,从而为视觉变换器中的层间信息流提供了原则性的优化。这种方法使得注意力模式在变换器层次结构中逐步优化,提高了特征获取和优化动态。该方法仅需最小的架构改动,仅增加了注意力矩阵存储和混合操作。在CIFAR-100和TinyImageNet上的全面实验表明,准确性有了一致的提升,其中ViT在CIFAR-100上的性能从75.74%提高到77.07%(+1.33%),在TinyImageNet上的性能从57.82%提高到59.07%(+1.25%)。跨架构验证显示不同变换器变体之间也有类似的提升,其中CaiT的提升为1.01%。系统分析确定历史注意力的混合超参数(alpha = 0.45)在所有配置中是最优的,提供了当前和历史注意力信息之间的理想平衡。随机初始化的表现始终优于零初始化,表明多样的初始注意力模式加速了收敛并改善了最终性能。我们的代码已公开发布,地址为 https://github.com/banik-s/HAViT。
cs.CV / 51 / 2603.18586
Color image restoration based on nonlocal saturation-value similarity
基于非局部饱和度-值相似性的彩色图像恢复
Abstract
In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.
Chinese Translation
本文提出并发展了一种基于饱和度-值相似性的新的非局部变分技术,用于彩色图像恢复。在传统的非局部方法中,图像块直接从彩色图像的红、绿和蓝通道中提取,且由于块相似性主要基于独立通道的灰度值,因此无法细致地描述颜色信息。本文的主要目的是提出并发展一种新的非局部正则化方法,通过考虑彩色图像中图像块在饱和度-值通道的相似性。具体而言,我们首先通过将彩色图像块的饱和度-值相似性纳入所提出的非局部梯度中,建立基于饱和度-值相似性的非局部全变差,这可以描述两个相邻彩色图像块的饱和度和值相似性。然后,基于饱和度-值相似性的非局部全变差构建所提出的非局部变分模型。此外,我们设计了一种有效且高效的算法,通过采用布雷格曼算子分裂方法数值求解所提出的优化问题,并研究了所提出算法的收敛性。数值实例表明,所提出模型在视觉质量及一些定量指标(包括峰值信噪比(PSNR)、结构相似性指数(SSIM)、四元数结构相似性指数(QSSIM)和S-CIELAB颜色误差)方面的性能优于其他测试方法。
cs.CV / 52 / 2603.18588
AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis
AU 代码、语言与合成:将解剖学转化为面部行为合成的文本
Abstract
Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.
Chinese Translation
面部行为合成仍然是一个关键但尚未充分探索的挑战。尽管文本到面部模型取得了一定进展,但它们通常依赖于粗略的情感类别,这些类别缺乏捕捉人类非语言交流全谱所需的细微差别。动作单元(Action Units, AUs)提供了一种更精确且基于解剖学的替代方案。然而,当前基于 AU 的方法通常将 AU 编码为独热向量,将复合表情建模为单个 AU 的简单线性组合。当处理冲突的 AU 时,这种线性化就会出现问题——冲突的 AU 被定义为以相反的动作激活同一面部肌肉的 AU。这种情况会导致解剖学上不合理的伪影和不自然的运动叠加。为了解决这个问题,我们提出了一种新方法,通过 AU 的自然语言描述来表示面部行为。这种方法保留了 AU 框架的表现力,同时能够明确建模复杂和冲突的 AU。它还释放了现代文本到图像模型在高保真面部合成中的潜力。为支持这一方向,我们引入了 BP4D-AUText,这是第一个用于复杂面部行为的大规模文本-图像配对数据集。该数据集通过将基于规则的动态 AU 文本处理器应用于 BP4D 和 BP4D+ 数据集进行合成。我们进一步提出了 VQ-AUFace,这是一种生成模型,利用面部结构先验从文本合成真实且多样的面部行为。大量定量实验和用户研究表明,我们的方法显著优于现有方法,尤其在涉及冲突 AU 的挑战性条件下,能够生成解剖学上合理、行为丰富且感知上令人信服的面部表情。
cs.CV / 53 / 2603.18597
myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition
myMNIST:PETNN、KAN 和经典深度学习模型在缅甸手写数字识别中的基准测试
Abstract
We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN's strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.
Chinese Translation
我们首次系统性地对 myMNIST(前称 BHDD)进行基准测试,这是一个对缅甸自然语言处理/人工智能研究至关重要的公开可用的缅甸手写数字数据集。我们评估了十一种架构,涵盖经典深度学习模型(多层感知器、卷积神经网络、长短期记忆网络、门控循环单元、变换器)、近期替代方案(FastKAN、EfficientKAN)、一种基于能量的模型(JEM)以及受物理启发的 PETNN 变体(Sigmoid、GELU、SiLU)。使用精确率、召回率、F1 分数和准确率作为评估指标,我们的结果表明,卷积神经网络仍然是一个强有力的基准,取得了最佳的整体得分(F1 = 0.9959,准确率 = 0.9970)。PETNN(GELU)模型紧随其后(F1 = 0.9955,准确率 = 0.9966),超越了 LSTM、GRU、变换器和 KAN 变体。JEM 作为基于能量的建模,表现也相当竞争力(F1 = 0.9944,准确率 = 0.9958)。基于 KAN 的模型(FastKAN、EfficientKAN)落后于顶尖表现者,但提供了一个有意义的替代基准(准确率约为 0.992)。这些发现(i)为 myMNIST 在多种建模范式下建立了可重复的基准,(ii)突显了 PETNN 相对于经典和基于变换器模型的强大表现,以及(iii)量化了受能量启发的 PETNN 和真正的基于能量的模型(JEM)之间的差距。我们发布此基准以促进未来对缅甸数字识别的研究,并鼓励对新兴架构在区域脚本上的更广泛评估。
cs.CV / 54 / 2603.18598
Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness
基于文本引导的互补注意力机制用于零样本对抗鲁棒性
Abstract
Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.
Chinese Translation
由于其出色的零样本能力,预训练的视觉语言模型(例如,CLIP)在各个领域引起了广泛关注和应用。然而,研究发现CLIP对对抗样本较为敏感。通过实验分析,我们观察到一种现象,即对抗扰动会引起文本引导注意力的偏移。在这一观察基础上,我们提出了一种简单而有效的策略:零样本鲁棒性的文本引导注意力(Text-Guided Attention for Zero-Shot Robustness,TGA-ZSR)。该框架包含两个组件:局部注意力精细化模块和全局注意力约束模块。我们的目标是保持CLIP模型的泛化能力,并增强其对抗鲁棒性。此外,全局注意力约束模块利用干净样本从目标模型和原始模型中获取文本引导的注意力,其目标是在增强整体鲁棒性的同时保持模型在干净样本上的性能。然而,我们观察到该方法有时会聚焦于无关或虚假的特征,这可能导致次优性能并在某些情境下削弱其鲁棒性。为了克服这一局限性,我们进一步提出了一种新方法,称为互补文本引导注意力(Complementary Text-Guided Attention,Comp-TGA)。该方法整合了两种前景注意力:由类提示引导的注意力和由非类提示驱动的反向注意力。这些互补的注意力机制使得模型能够捕捉到前景的更全面和准确的表示。实验结果验证了TGA-ZSR和Comp-TGA在16个数据集上的零样本鲁棒准确率分别提升了9.58%和11.95%,超过了当前最先进的技术。
cs.CV / 55 / 2603.18599
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation
SJD-PAC:通过主动草拟和自适应延续加速推测性雅可比解码
Abstract
Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.
Chinese Translation
推测性雅可比解码(Speculative Jacobi Decoding,SJD)提供了一种无草拟模型的方法,以加速自回归文本到图像的合成。然而,视觉生成的高熵特性导致在复杂区域的草拟标记接受率低,从而形成了一个严重限制整体吞吐量的瓶颈。为了解决这一问题,我们引入了SJD-PAC,一个增强的SJD框架。首先,SJD-PAC采用了一种主动草拟策略,以提高这些高熵挑战区域的局部接受率。其次,我们引入了一种自适应延续机制,在初次拒绝后持续进行序列验证,避免了完全重采样的需要。这些优化措施协同工作,显著增加了每步的平均接受长度,提高了推理速度,同时严格保持目标分布。在标准文本到图像的基准测试中,实验表明SJD-PAC实现了$3.8 imes$的加速且图像质量无损。
cs.CV / 56 / 2603.18600
Improving Joint Audio-Video Generation with Cross-Modal Context Learning
通过跨模态上下文学习提升联合音视频生成
Abstract
The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.
Chinese Translation
基于双流变换器架构的联合音视频生成方法已成为当前研究的主流范式。通过结合预训练的视频扩散模型和音频扩散模型,以及跨模态交互注意力模块,可以在最小训练数据的情况下生成高质量、时间上同步的音视频内容。本文首先回顾了双流变换器范式,并进一步分析了其局限性,包括由控制跨模态交互的门控机制引起的模型流形变化、跨模态注意力引入的多模态背景区域偏差、训练和推理过程中多模态无分类器引导(CFG)的不一致性,以及多个条件之间的冲突。为了解决这些问题,我们提出了跨模态上下文学习(Cross-Modal Context Learning, CCL),配备了几个精心设计的模块。时间对齐的旋转位置编码和分区(Temporally Aligned RoPE and Partitioning, TARP)有效增强了音频潜在表示与视频潜在表示之间的时间对齐。跨模态上下文注意力(Cross-Modal Context Attention, CCA)模块中的可学习上下文标记(Learnable Context Tokens, LCT)和动态上下文路由(Dynamic Context Routing, DCR)为跨模态信息提供了稳定的无条件锚点,同时根据不同的训练任务动态路由,进一步提升了模型的收敛速度和生成质量。在推理过程中,无条件上下文引导(Unconditional Context Guidance, UCG)利用LCT提供的无条件支持,促进不同形式的CFG,改善训练与推理的一致性,并进一步缓解冲突。通过综合评估,CCL相比于近期的学术方法实现了最先进的性能,同时所需资源大幅减少。
cs.CV / 57 / 2603.18616
Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset
基于CNN模型与基于Transformer模型在RATIC数据集上进行腹部多脏器分割的基准比较
Abstract
Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.
Chinese Translation
腹部CT扫描中的多脏器精确分割对于计算机辅助诊断和治疗至关重要。尽管卷积神经网络(CNN)长期以来一直是医学图像分割的标准方法,但基于Transformer的架构因其建模长距离依赖的能力而受到关注。在本研究中,我们系统地对三种混合Transformer模型UNETR、SwinUNETR和UNETR++与强大的CNN基线SegResNet进行了基准比较,以进行异构RATIC数据集上的体积多脏器分割。该数据集包含来自全球23个机构的206个标注的CT扫描,涵盖五个腹部器官。所有模型在相同的预处理和训练条件下进行训练和评估,使用Dice相似系数(DSC)作为主要指标。结果表明,基于CNN的SegResNet在整体性能上表现最佳,超越了所有混合Transformer模型在所有器官上的表现。在基于Transformer的方法中,UNETR++提供了最具竞争力的结果,而UNETR则在较少的训练迭代中表现出显著更快的收敛。这些发现表明,对于小到中等规模的异构数据集,经过良好优化的CNN架构仍然具有很强的竞争力,并可能超越混合Transformer设计。
cs.CV / 58 / 2603.18623
OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data
OpenT2M:基于开源、大规模、高质量数据的简洁运动生成
Abstract
Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.
Chinese Translation
文本到运动(Text-to-motion, T2M)生成旨在从文本描述中创建逼真的人类动作,具有在动画和机器人领域的广泛应用。尽管最近有所进展,但现有的 T2M 模型在未见过的文本描述上表现不佳,这主要是由于现有运动数据集的规模小且多样性有限。为了解决这个问题,我们推出了 OpenT2M,这是一个百万级、高质量且开源的运动数据集,包含超过2800小时的人类运动数据。每个序列经过严格的质量控制,通过物理可行性验证和多粒度过滤,并配有详细的逐秒文本注释。我们还开发了一个自动化流水线,用于创建长期序列,以支持复杂的运动生成。在 OpenT2M 的基础上,我们推出了 MonoFrill,一个预训练的运动模型,该模型在没有复杂设计或技术技巧“附加”,即可实现出色的 T2M 结果。其核心组件是 2D-PRQ,一种新颖的运动标记器,通过将人体分为生物学部分来捕捉时空依赖性。实验表明,OpenT2M 显著提高了现有 T2M 模型的泛化能力,而 2D-PRQ 则实现了优越的重构和强大的零样本性能。我们期望 OpenT2M 和 MonoFrill 能在解决长期存在的数据质量和基准挑战方面推动 T2M 领域的发展。
cs.CV / 59 / 2603.18625
GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?
GenVideoLens:大规模视觉语言模型在AI生成视频检测中的不足之处?
Abstract
In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.
Chinese Translation
近年来,AI生成的视频变得愈加逼真和复杂。与此同时,大规模视觉语言模型(LVLMs)在检测此类内容方面展现出强大的潜力。然而,现有的评估协议大多将此任务视为二分类问题,并依赖于诸如总体准确率等粗粒度指标,提供了有限的洞见来了解LVLMs成功或失败的原因。为了解决这一局限性,我们提出了GenVideoLens,一个精细化基准,使LVLM在AI生成视频检测中的能力能够进行维度评估。该基准包含400个高度误导性的AI生成视频和100个真实视频,均由专家在覆盖感知、光学、物理和时间线索的15个真实性维度上进行标注。我们在此基准上评估了11个具有代表性的LVLMs。我们的分析揭示了明显的维度不平衡。虽然LVLMs在感知线索上的表现相对较好,但在光学一致性、物理交互和时间因果推理方面却表现不佳。模型在不同维度上的性能差异也很大,较小的开源模型在特定的真实性线索上有时甚至优于更强大的专有模型。时间扰动实验进一步表明,当前的LVLMs对时间信息的利用有限。总体而言,GenVideoLens为LVLM的行为提供了诊断性洞察,揭示了关键能力差距,并为改善未来的AI生成视频检测系统提供了指导。
cs.CV / 60 / 2603.18626
GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments
GEAR:极端环境下地理知识增强的类比识别框架
Abstract
The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.
Chinese Translation
马里亚纳海沟和青藏高原在地质起源和微生物代谢功能上表现出显著相似性。考虑到深海生物采样面临高昂的成本,在青藏高原识别马里亚纳海沟的结构同源陆地类比具有重要意义。然而,现有模型在跨领域地形相似性检索方面均未能充分解决这一问题,要么忽视地理知识,要么牺牲计算效率。为了解决这些挑战,我们提出了 extbf{G}eography-knowledge extbf{E}nhanced extbf{A}nalog extbf{R}ecognition( extbf{GEAR})框架,这是一个三阶段的流程,旨在高效地从250万平方公里的青藏高原中检索类比:(1)骨架引导筛选与剪裁:根据大小和线性形态标准识别候选山谷并进行初步筛选。(2)物理感知过滤:地形波形比较器(Topographic Waveform Comparator,TWC)和形态纹理模块(Morphological Texture Module,MTM)评估波形和纹理,并过滤掉不一致的候选山谷。(3)基于图的精细识别:我们设计了一个基于地貌指标的 extbf{M}orphology-integrated extbf{S}iamese extbf{G}raph extbf{N}etwork( extbf{MSG-Net})。相应地,我们发布了一个针对构造碰撞区的专家注释地形相似性数据集。实验表明每个阶段的有效性。此外,MSG-Net的F1-Score比现有最优基线高出1.38个百分点。利用MSG-Net提取的特征,我们发现与生物数据之间存在显著相关性,为未来的生物分析提供了证据。
cs.CV / 61 / 2603.18634
SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery
SwiftGS:用于即时卫星表面恢复的情节先验
Abstract
Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.
Chinese Translation
从多日期卫星影像快速、大规模地进行三维重建对于环境监测、城市规划和灾害响应至关重要,但由于光照变化、传感器异质性以及每场景优化的高成本,这一过程仍然困难。我们提出了SwiftGS,一个元学习系统,通过预测几何-辐射解耦的高斯原语以及一个轻量级的SDF,在单次前向传递中重建三维表面,取代了昂贵的每场景拟合,采用捕捉可转移先验的情节训练。该模型将可微分的物理图与空间门控相结合,用于投影、光照和传感器响应,融合稀疏的高斯细节和全局SDF结构,并结合语义-几何融合、条件轻量级任务头以及在不确定性感知的多任务损失下来自冻结几何教师的多视图监督。在推理时,SwiftGS以零样本方式运行,具有可选的紧凑校准,并在显著降低计算成本的情况下实现了准确的DSM重建和视图一致的渲染,消融实验突显了混合表示、物理感知渲染和情节元训练的优势。
cs.CV / 62 / 2603.18636
Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering
无训练稀疏注意力用于快速视频生成,通过离线逐层稀疏性分析和在线双向共同聚类
Abstract
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
Chinese Translation
扩散变压器(Diffusion Transformers,DiTs)在视频生成质量上表现出色,但由于密集的三维注意力导致高推理成本,从而发展出稀疏注意力技术以提高效率。然而,现有的无训练稀疏注意力方法在视频生成中仍面临两个未解决的局限性:在注意力剪枝中忽视层的异质性以及在块划分中忽视查询-键耦合,这妨碍了更好的质量与加速之间的平衡。在本研究中,我们揭示了一个关键见解,即每层注意力的稀疏性是其内在特性,对不同输入的影响较小。基于此,我们提出了SVOO,一种无训练稀疏注意力框架,用于快速视频生成,通过离线逐层稀疏性分析和在线双向共同聚类。具体而言,SVOO采用两阶段范式:(i) 离线逐层敏感性分析,以推导出每层的内在剪枝水平;(ii) 在线块级稀疏注意力,利用新颖的双向共同聚类算法。在七个广泛使用的视频生成模型上的大量实验表明,SVOO在质量与加速的平衡方面优于最先进的方法,提供高达$1.93 imes$的加速,同时在Wan2.1上保持高达29 dB的PSNR。
cs.CV / 63 / 2603.18639
PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance
PhysVideo:基于交叉视角几何指导的物理可信视频生成
Abstract
Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.
Chinese Translation
近年来视频生成的进展显著提升了视觉逼真度,但确保物理一致的运动仍然是一个基本挑战。直观上,这一限制可以归因于现实世界中物体运动在三维空间中展开,而视频观察仅提供了这些动态的部分、依赖视角的投影。为了解决这些问题,我们提出了PhysVideo,一个两阶段框架,首先生成具有物理感知的正交前景视频,然后合成包含背景的完整视频。在第一阶段,Phys4View利用物理感知注意力捕捉物理属性对运动动态的影响,并通过结合几何增强的交叉视角注意力和时间注意力来增强时空一致性。在第二阶段,VideoSyn使用生成的前景视频作为指导,学习前景动态与背景上下文之间的交互,以实现可控的视频合成。为了支持训练,我们构建了PhysMV,一个包含4万个场景的数据集,每个场景由四个正交视角组成,总共生成16万个视频序列。大量实验表明,PhysVideo在物理真实感和时空连贯性方面显著优于现有的视频生成方法。主页:https://anonymous.4open.science/w/Phys4D/
cs.CV / 64 / 2603.18645
MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration
MeInTime:弥合身份保留面部修复中的年龄差距
Abstract
To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime
Chinese Translation
为了更好地保留个体的身份,面部修复技术已经从无参考方法发展到基于参考的方法,这些方法利用相同身份的高质量参考图像来增强修复输出中的身份保真度。然而,大多数现有方法隐含地假设参考图像和退化输入在年龄上是一致的,这限制了它们在现实场景中的有效性,尤其是在仅有跨年龄参考可用的情况下,例如历史照片修复。本文提出了MeInTime,一种基于扩散的面部修复方法,将基于参考的修复从同龄扩展到跨年龄设置。给定一张或几张参考图像以及与退化输入相对应的年龄提示,MeInTime能够实现身份保真度和年龄一致性的忠实修复。具体而言,我们解耦了身份和年龄条件的建模。在训练过程中,我们专注于通过新引入的注意力机制有效地注入身份特征,并引入门控残差融合模块以促进退化特征与身份表示之间的整合。在推理阶段,我们提出了年龄感知梯度引导(Age-Aware Gradient Guidance),这是一种无训练的采样策略,利用基于年龄的方向迭代地推动身份感知去噪潜变量朝向所需的年龄语义流形。大量实验表明,MeInTime在身份保留和年龄一致性方面均优于现有的面部修复方法。我们的代码可在以下链接获取:https://github.com/teer4/MeInTime
cs.CV / 65 / 2603.18649
Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA
点击提问:一个具有离线文案撰写及在线互动问答功能的人工智能直播助手
Abstract
Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.
Chinese Translation
直播电商已成为现代广播的一种显著形式。为了为主播提供更高效便捷的产品推广方式,我们提出了 Click-to-Ask,这是一个具有离线和在线两部分的人工智能驱动的直播电商助手。离线模块处理多样化的多模态产品信息,将复杂输入转化为结构化的产品数据,并生成合规的推广文案。在直播过程中,在线模块通过允许主播点击问题,并结合离线模块生成的结构化产品信息和在流媒体架构中维护的事件级历史记忆,实时响应观众的询问。该系统显著减少了促销准备所需的时间,提高了内容参与度,并使得能够及时与观众的询问进行互动,从而最终提升了直播电商的效果。在我们收集的 TikTok 直播帧的数据集中,所提方法实现了0.913的问答识别准确率和0.876的响应质量评分,显示出在实际应用中的相当潜力。视频演示可在此观看:https://www.youtube.com/shorts/mWIXK-SWhiE。
cs.CV / 66 / 2603.18652
Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation
基于LLM语义评估的PDF解析器表格提取基准测试
Abstract
Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study
Chinese Translation
可靠地从PDF中提取表格对于大规模科学数据挖掘和知识库构建至关重要,但现有的评估方法依赖于基于规则的指标,无法捕捉表格内容的语义等价性。我们提出了一种基于合成生成的PDF的基准测试框架,使用精确的LaTeX真值,表格来源于arXiv,以确保现实的复杂性和多样性。作为我们核心的方法论贡献,我们应用了LLM-as-a-judge进行语义表格评估,并将其集成到一个匹配管道中,以适应解析器输出中的不一致性。通过一项包含超过1500个提取表格对的质量判断的人类验证研究,我们显示LLM基于的评估与人类判断之间的相关性显著更高(Pearson r=0.93),相比之下,基于树编辑距离的相似性(TEDS, r=0.68)和网格表相似性(GriTS, r=0.70)则较低。在对包含451个表格的100个合成文档中评估21个当代PDF解析器时,发现了显著的性能差异。我们的结果为选择表格数据提取解析器提供了实用指导,并建立了一种可重复、可扩展的评估方法论,以应对这一关键任务。代码和数据: https://github.com/phorn1/pdf-parse-bench 评估研究和人类评估: https://github.com/phorn1/table-metric-study
cs.CV / 67 / 2603.18655
Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation
多尺度切换框架用于医学超声图像分割中的半监督与对比学习
Abstract
Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch
Chinese Translation
医学超声图像分割面临显著挑战,主要由于标注数据有限以及特征成像伪影,例如斑点噪声和低对比边界。虽然半监督学习(SSL)方法已经出现以应对数据短缺,但现有方法在未标注数据的利用效率和特征表示机制的鲁棒性方面存在不足。本文提出了一种新颖的SSL框架——Switch,具有两个关键创新: (1) 多尺度切换(MSS)策略,通过分层块混合实现均匀的空间覆盖; (2) 频域切换(FDS),结合对比学习,在傅里叶空间中执行幅度切换,以实现鲁棒特征表示。我们的框架在教师-学生架构中整合了这些组件,有效利用标注和未标注数据。针对六个不同的超声数据集(淋巴结、乳腺病变、甲状腺结节和前列腺)进行了全面评估,结果显示出在最新方法上具有一致的优越性。在5%的标注比例下,Switch取得了显著改善:LN-INT数据集上的Dice得分为80.04%,DDTI数据集上的Dice得分为85.52%,前列腺数据集上的Dice得分为83.48%,我们的半监督方法甚至超越了完全监督的基准。该方法在保持参数效率(1.8M 参数)的同时,提供了卓越的性能,验证了其在资源受限的医学成像应用中的有效性。源代码可在 https://github.com/jinggqu/Switch 上公开获取。
cs.CV / 68 / 2603.18660
Multimodal Model for Computational Pathology:Representation Learning and Image Compression
计算病理学的多模态模型:表征学习与图像压缩
Abstract
Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.
Chinese Translation
全切片成像(WSI)通过使得对千兆像素级别的组织病理图像进行计算分析成为可能,彻底改变了数字病理学。最近基础模型的进展加速了计算病理学的发展,促进了病理图像、临床报告和结构化数据之间的联合推理。尽管取得了这些进展,仍然存在一些挑战:WSI的极高分辨率为视觉学习带来了计算障碍;有限的专家标注限制了监督学习方法;在保留生物学可解释性的同时整合多模态信息仍然困难;以及超长视觉序列建模的复杂性阻碍了临床透明度。本文全面回顾了多模态计算病理学的最新进展。我们系统分析了四个研究方向:(1)针对WSI的自监督表征学习和结构感知的标记压缩;(2)多模态数据生成与增强;(3)参数高效的适应与推理增强的少样本学习;(4)多智能体协作推理以实现可信诊断。我们特别考察了标记压缩如何实现跨尺度建模,以及多智能体机制如何模拟病理学家的“思维链”,在不同放大倍数下实现不确定性感知的证据融合。最后,我们讨论了当前面临的挑战,并认为未来的进展依赖于统一的多模态框架,将高分辨率视觉数据与临床和生物医学知识整合,以支持可解释和安全的人工智能辅助诊断。
cs.CV / 69 / 2603.18671
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
迈向高质量图像分割:通过惩罚邻近像素提高拓扑准确性
Abstract
Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.
Chinese Translation
标准的深度学习图像分割模型无法保证拓扑准确性,未能保持正确数量的连通分量或结构。这反过来影响了分割的质量,并损害了后续定量分析的可靠性。之前的研究提出了通过专门的框架、架构和损失函数来增强拓扑准确性。然而,这些方法往往难以集成到现有的训练流程中,计算成本非常高,或者仅限于具有管状形态的结构。我们提出了SCNP(Same Class Neighbor Penalization),一种高效的方法,通过惩罚与其分类最差的邻近像素的logits,来提高拓扑准确性,迫使模型在改善像素自身的预测之前,先改善其邻近像素的预测。我们展示了SCNP在13个数据集上的有效性,涵盖了不同的结构形态和图像模态,并将其集成到三个语义和实例分割框架中。此外,我们还展示了SCNP可以集成到多种损失函数中,从而提高拓扑准确性。我们的代码可以在https://jmlipman.github.io/SCNP-SameClassNeighborPenalization找到。
cs.CV / 70 / 2603.18719
Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer
基于本体引导的零样本视觉仿真到现实转移
Abstract
Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.
Chinese Translation
弥合仿真到现实(sim2real)之间的差距仍然具有挑战性,因为标注的真实世界数据稀缺。现有的基于扩散的方法依赖于非结构化的提示或统计对齐,这些方法未能捕捉到使图像看起来真实的结构化因素。我们提出了基于本体引导的扩散(Ontology-Guided Diffusion, OGD),这是一种神经符号零样本仿真到现实图像翻译框架,将现实主义表示为结构化知识。OGD将现实主义分解为可解释特征的本体,例如光照和材料属性,并在知识图谱中编码它们之间的关系。通过合成图像,OGD推断特征激活,并使用图神经网络生成全局嵌入。同时,符号规划器利用本体特征计算所需的一致视觉编辑序列,以缩小现实主义差距。图嵌入通过交叉注意力条件化一个预训练的基于指令的扩散模型,而规划的编辑则转换为结构化的指令提示。在各基准测试中,我们的基于图的嵌入比基线更好地区分真实与合成图像,OGD在仿真到现实图像翻译中超越了最先进的扩散方法。总体而言,OGD表明显式编码现实主义结构能够实现可解释的数据高效和可推广的零样本仿真到现实转移。
cs.CV / 71 / 2603.18739
EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation
EdgeCrafter:用于边缘密集预测的紧凑型视觉变换器通过任务专用蒸馏
Abstract
Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/
Chinese Translation
在资源受限的边缘设备上部署高性能的密集预测模型仍然面临挑战,因为计算和内存的严格限制。在实际应用中,面向目标检测、实例分割和姿态估计的轻量级系统仍然主要由基于卷积神经网络(CNN)的架构(如YOLO)主导,而紧凑型视觉变换器(ViTs)即使经过大规模预训练,通常也难以实现同样强大的准确性与效率的平衡。我们认为,这一差距主要是由于小规模ViTs在特定任务表示学习方面的不足,而不是ViTs与边缘密集预测之间的固有不匹配。为了解决这个问题,我们提出了EdgeCrafter,一个统一的紧凑型ViT框架,专注于边缘密集预测,核心是ECDet,一个基于蒸馏紧凑型主干和边缘友好的编码解码设计构建的检测模型。在COCO数据集上,ECDet-S在仅使用COCO注释的情况下,达到了51.7的平均精度(AP),参数量少于1000万。对于实例分割,ECInsSeg的性能可与RF-DETR相媲美,同时使用的参数显著更少。对于姿态估计,ECPose-X达到了74.8的AP,显著超越了YOLO26Pose-X(71.6 AP),尽管后者依赖于广泛的Objects365预训练。这些结果表明,当紧凑型ViTs与任务专用蒸馏和边缘感知设计相结合时,可以成为边缘密集预测的一个实用且具有竞争力的选择。代码可在以下网址获取:https://intellindust-ai-lab.github.io/projects/EdgeCrafter/
cs.CV / 72 / 2603.18742
6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models
6Bit-Diffusion:视频扩散模型的推理时混合精度量化
Abstract
Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.
Chinese Translation
扩散变换器在生成视频方面展现了显著的能力。然而,它们的实际应用受到高内存使用和计算成本的严重限制。后训练量化提供了一种实用的方法来减少内存使用并提高计算速度。现有的量化方法通常应用静态位宽分配,忽视了扩散时间步长中激活的量化难度,导致效率与质量之间的权衡不理想。本文提出了一种推理时的NVFP4/INT8混合精度量化框架。我们发现一个块的输入输出差异与其内部线性层的量化敏感性之间存在强线性相关性。基于这一见解,我们设计了一种轻量级预测器,动态地将NVFP4分配给时间上稳定的层,以最大化内存压缩,同时选择性地保留INT8用于波动层以确保鲁棒性。这种自适应精度策略使得在不影响生成质量的情况下实现激进的量化。此外,我们观察到变换器块的输入与输出之间的残差在时间步长上表现出高度的一致性。利用这一时间冗余,我们引入了时间增量缓存(Temporal Delta Cache, TDC)来跳过这些不变块的计算,进一步降低计算成本。大量实验表明,我们的方法实现了1.92倍的端到端加速和3.32倍的内存减少,为视频扩散模型的高效推理设定了新的基准。
cs.CV / 73 / 2603.18752
WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification
WeNLEX:用于多标签胸部X光分类的弱监督自然语言解释
Abstract
Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model's reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model's feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.
Chinese Translation
自然语言解释提供了一种本质上易于人类理解的方式来解释黑箱模型,密切反映了放射科医生在文本报告中传达诊断的方式。大多数研究通过使用带有解释的注释数据集显式监督解释生成过程。因此,尽管生成的解释看似合理,但并不忠实于模型的推理。在本研究中,我们提出了WeNLEX,一种用于多标签胸部X光分类的自然语言解释生成的弱监督模型。通过在黑箱模型的特征空间中将生成的图像与其对应的自然语言解释匹配,从而确保了解释的忠实性。通过与小型临床医生注释的解释数据库进行分布对齐,保持了解释的合理性。我们通过对多个评估忠实性、可模拟性、多样性和合理性的指标进行广泛验证,实证证明WeNLEX能够生成忠实且合理的解释,使用的真实解释数量仅为每个诊断5个。此外,WeNLEX可以在事后和模型内设置中运行。在后者,即当多标签分类器与网络的其余部分一起训练时,WeNLEX将独立分类器的分类AUC提高了2.21%,从而表明将可解释性添加到训练过程中实际上可以提高下游任务的性能。此外,仅通过更改数据库,WeNLEX的解释可以适应任何目标受众,我们通过训练一个简化版的WeNLEX来展示这种灵活性,该版本的解释为非医学用户进行了简化。
cs.CV / 74 / 2603.18757
DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
DA-Mamba:学习领域感知状态空间模型以实现领域自适应目标检测中的全局-局部对齐
Abstract
Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.
Chinese Translation
领域自适应目标检测(DAOD)旨在将检测器从标记的源领域转移到未标记的目标领域。现有的DAOD方法采用多粒度特征对齐来学习领域不变表示。然而,它们基于卷积神经网络(CNN)的主干和检测头的局部连接性限制了对齐仅限于局部区域,未能提取全局领域不变特征。尽管基于变换器的DAOD方法通过注意力机制捕捉全局依赖关系,但其二次计算成本阻碍了实际部署。为了解决这个问题,我们提出了DA-Mamba,一种混合卷积神经网络-状态空间模型(SSMs)架构,结合了CNN的高效性和状态空间模型(SSMs)在线性时间内进行长距离建模的能力,以捕捉全局和局部领域不变特征。具体而言,我们引入了两个新颖的模块:图像感知状态空间模型(IA-SSM)和对象感知状态空间模型(OA-SSM)。IA-SSM集成到主干中,以增强全局领域意识,实现图像级别的全局和局部对齐。OA-SSM插入到检测头中,以建模对象之间的空间和语义依赖关系,增强实例级对齐。综合实验表明,所提出的方法能够有效提高目标检测器的跨领域性能。
cs.CV / 75 / 2603.18764
ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation
ProCal:用于邻域引导的无源领域适应的概率校准
Abstract
Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model's initial predictions with the current model's online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.
Chinese Translation
无源领域适应(SFDA)在不需要访问源数据的情况下,将预训练模型适应于未标记的目标领域。尽管利用局部邻域结构的最先进方法在SFDA中显示出良好的前景,但它们往往过度依赖邻居之间的预测相似性。这种过度依赖加速了源知识的遗忘,并增加了对局部噪声过拟合的敏感性。为了解决这些问题,我们提出了ProCal,一种通过双模型协同预测机制动态校准基于邻域的预测的概率校准方法。ProCal将源模型的初始预测与当前模型的在线输出相结合,有效地校准邻居概率。这一策略不仅减轻了局部噪声的干扰,还保留了源模型的判别信息,从而在知识保留和领域适应之间实现了平衡。此外,我们设计了一个联合优化目标,将软监督损失与多样性损失结合,以指导目标模型。我们的理论分析表明,ProCal收敛到一个平衡点,在该平衡点上源知识和目标信息得到了有效融合,从而减少了知识遗忘和过拟合。我们通过在四个公共数据集上的31个跨领域任务的广泛实验验证了我们方法的有效性。我们的代码可在以下地址获取:https://github.com/zhengyinghit/ProCal。
cs.CV / 76 / 2603.18774
SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction
SEAR:用于RGB+热成像3D重建的简单高效视觉几何变换器适应方法
Abstract
Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.
Chinese Translation
基础的前馈视觉几何模型通过从大量RGB数据集中学习强大的场景先验,实现了准确高效的相机位姿估计和场景重建。然而,当应用于混合感知模态(如RGB-热成像(RGB-T)图像)时,其有效性下降。我们观察到,虽然在RGB数据上预训练的视觉几何基础变换器在热成像重建中表现良好,但在联合处理RGB和热成像模态时却难以对齐。为了解决这个问题,我们提出了SEAR,一种简单而高效的微调策略,旨在将预训练的几何变换器适应于多模态RGB-T输入。尽管在相对较小的RGB-T数据集上进行训练,我们的方法在3D重建和相机位姿估计方面显著超越了最先进的方法,在所有指标上(例如,AUC@30提高超过29%)实现了显著提升,并在推理时间上与原始预训练模型相比几乎没有额外开销,提供了更高的细节和模态之间的一致性。值得注意的是,SEAR即使在低光照和浓烟等挑战性条件下,也能实现可靠的多模态位姿估计和重建。我们通过广泛的消融研究验证了我们的架构,展示了模型如何对齐两种模态。此外,我们引入了一个新的数据集,包含在不同时间、视角和光照条件下捕获的RGB和热成像序列,为未来的多模态3D场景重建工作提供了一个稳健的基准。代码和模型可在https://www.github.com/Schindler-EPFL-Lab/SEAR上公开获取。
cs.CV / 77 / 2603.18782
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
Points-to-3D:基于点云先验的结构感知三维生成
Abstract
Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.
Chinese Translation
近期三维生成的进展主要依赖于基于图像或文本的模型,而现有的三维先验仍未得到充分利用。在许多现实场景中,活跃传感器(如激光雷达)或快速预测器(如VGGT)能够轻松获得可见区域的点云,提供当前方法未能充分利用的显式几何约束。在本研究中,我们提出了Points-to-3D,这是一个基于扩散的框架,利用点云先验实现几何可控的三维资产和场景生成。Points-to-3D基于潜在三维扩散模型TRELLIS构建,首先将纯噪声稀疏结构潜在初始化替换为定制的点云先验输入形式。然后,在TRELLIS框架内训练的结构修补网络被用于推理,采用分阶段采样策略(先进行结构修补再进行边界精修),在保留输入先验可见区域的同时完成全局几何。在实际应用中,Points-to-3D可以接收准确的点云先验或来自单幅图像的VGGT估计点云作为输入。在对象和场景场景上的实验一致地表明,Points-to-3D在渲染质量和几何保真度方面的表现优于现有最先进的基线,突显了显式嵌入点云先验以实现更准确和结构可控的三维生成的有效性。
cs.CV / 78 / 2603.18792
Rethinking Uncertainty Quantification and Entanglement in Image Segmentation
重新思考图像分割中的不确定性量化与纠缠
Abstract
Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.
Chinese Translation
不确定性量化(UQ)在医疗图像分割等安全关键应用中至关重要。总不确定性通常被分解为与数据相关的随机不确定性(AU)和与模型相关的认知不确定性(EU)。目前存在多种方法用于建模AU(如概率UNet、扩散模型)和EU(如集成方法、MC Dropout),但它们在结合时的相互作用尚不明确。此外,近期研究揭示了AU与EU之间的显著纠缠,这削弱了分解的可解释性和实际应用价值。我们提出了一项全面的实证研究,涵盖了广泛的AU-EU模型组合,提出了一种量化不确定性纠缠的指标,并在下游UQ任务中对两者进行了评估。在分布外检测中,集成方法表现出持续较低的纠缠和优越的性能。在模糊建模和校准方面,最佳模型依赖于数据集,其中基于softmax/SSN的方法表现良好,而概率UNet的纠缠程度较低。softmax集成在所有任务中表现出色。最后,我们分析了不确定性纠缠的潜在来源,并概述了减轻这一影响的方向。
cs.CV / 79 / 2603.18795
Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
Perceptio:通过空间标记生成增强感知的视觉语言模型
Abstract
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.
Chinese Translation
大型视觉语言模型(LVLMs)在语义理解方面表现出色,但在细粒度空间定位上存在困难,因为模型必须隐式推断复杂的几何形状而不产生空间解释。我们提出了Perceptio,一种增强感知的LVLM,具备二维和三维空间推理能力,得益于在自回归序列中直接生成的显式语义分割标记和深度标记。具体而言,我们(i)从强大的单眼教师模型提取VQVAE深度码本,将稠密深度标记化为紧凑的序列;(ii)在大型语言模型(LLM)内部整合基于SAM2的语义分割标记和VQ-VAE深度标记,以便模型首先发出空间标记,然后进行回答。为了稳定深度标记生成,我们引入了新颖的复合深度标记目标(标记损失、标记数量损失等)和可微重建的软合并技术。我们在多样化数据集上采用多任务联合训练策略,使模型学习感知标记以应对多个下游任务。基于InternVL,Perceptio在基准测试中实现了最先进的性能:在RefCOCO/+/g HardBLINK上提升了引用表达分割的cIoU分别为+0.8/+1.4/+1.1,空间理解精度提高10.3%,MMBench精度提高1.0%,表明显式空间链式思维显著增强了LVLMs的空间定位能力。
cs.CV / 80 / 2603.18797
VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation
VesselTok:对血管状三维生物医学图表示进行标记化以实现重建和生成
Abstract
Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok's performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok's learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.
Chinese Translation
空间图为曲线解剖结构(如血管、肺气道和神经网络)提供了一种轻量且优雅的表示方式。准确建模这些图在临床和(生物)医学研究中至关重要。然而,大型网络的高空间分辨率显著增加了其复杂性,导致了重大的计算挑战。在本研究中,我们旨在通过提出VesselTok来应对这些挑战,该框架从参数形状的角度处理空间密集图,以学习潜在表示(标记)。VesselTok利用带有伪半径的中心线点有效编码管状几何形状。具体而言,我们学习了一种新的潜在表示,该表示以中心线点为条件,以编码血管状管状结构的神经隐式表示。我们展示了VesselTok在多种解剖结构(包括肺气道、肺血管和脑血管)上的性能,突显了其稳健编码复杂拓扑的能力。为了证明VesselTok学习的潜在表示的有效性,我们展示了它们(i)能够推广到未见过的解剖结构,(ii)支持生成合理的解剖图模型,以及(iii)有效转移到下游逆问题,如链接预测。
cs.CV / 81 / 2603.18834
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging
基于统计特征引导的快速高分辨率透射电子显微镜成像去噪
Abstract
High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.
Chinese Translation
高分辨率透射电子显微镜(HRTEM)能够实现原子级别的成核动态观察,从而推动先进固体材料的研究。然而,由于成核过程的毫秒级快速变化,这需要短曝光时间的快速成像,导致严重的噪声掩盖原子位置。在本研究中,我们提出了一种基于统计特征引导的去噪网络,该网络利用统计特征在空间和频率域中引导去噪过程。在空间域中,我们提出了空间偏差引导加权,根据偏差特征为每个空间位置选择合适的卷积操作。在频率域中,我们提出了频带引导加权,基于频带特征增强信号并抑制噪声。我们还开发了一种特定于HRTEM的噪声校准方法,并生成了一个包含无序结构和真实HRTEM图像噪声的数据集。这可以确保模型在实际图像上的去噪性能,以便进行成核观察。对合成数据和真实数据的实验表明,我们的方法在HRTEM图像去噪方面优于现有的最先进方法,并在定位下游任务中表现出有效性。代码将发布在 https://github.com/HeasonLee/SCGN。
cs.CV / 82 / 2603.18846
Towards Interpretable Foundation Models for Retinal Fundus Images
朝向可解释的基础模型用于视网膜眼底图像
Abstract
Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.
Chinese Translation
基础模型用于从大量未标记的数据中提取可转移的表示,通常通过自监督学习 (SSL) 实现。然而,这些模型中的许多依赖于提供有限可解释性的架构,这在医疗影像等高风险领域是一个关键问题。我们提出了Dual-IFM,一个设计上可解释的基础模型,具有两种方式:首先,它通过与决策过程忠实的类证据图提供单个图像的局部可解释性。其次,它通过2D投影层提供整个数据集的全局可解释性,使得模型的表示空间能够直接可视化。我们在来自不同来源的超过80万张彩色眼底摄影图像上训练了我们的模型,以学习针对不同下游任务的可概括、可解释的表示。我们的结果表明,我们的模型在参数数量可达到当前最先进基础模型的16倍的情况下,性能范围与之相似,同时在分布外数据上提供可解释的预测。我们的结果表明,结合大规模的SSL预训练和固有可解释性可以为视网膜成像提供鲁棒的表示。
cs.CV / 83 / 2603.18850
HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
HORNet:基于任务引导的视频问答中的帧选择方法
Abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.
Chinese Translation
基于视觉语言模型(VLM)的视听问答(VQA)在很大程度上依赖于从输入视频中选择哪些帧,但大多数系统依赖于均匀或启发式的采样方法,这些方法无法针对下游问答质量进行优化。我们提出了 extbf{HORNet},一种轻量级的帧选择策略,采用组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练,以学习冻结的VLM需要哪些帧才能正确回答问题。HORNet的可训练参数少于100万,能够将输入帧减少高达99 ext{%},并将VLM处理时间减少高达93 ext{%},同时在短视频基准测试中提高答案质量(在MSVD-QA上F1提升1.7 ext{%}),并在时间推理任务中取得强劲表现(在NExT-QA上比均匀采样提高7.3分)。我们将此形式化为选择任意帧(Select Any Frames, SAF)任务,该任务将视觉输入的策划与VLM推理解耦,并表明GRPO训练的选择在分布外的泛化能力优于监督学习和PPO替代方案。HORNet的策略进一步可以在不同的VLM回答者之间迁移,无需重新训练,当与更强的模型配对时,额外获得8.5 ext{%}的相对提升。在涵盖341,877个问答对和114.2小时视频的六个基准测试中,我们的结果表明,优化VLM所看到的内容是优化其生成内容的一个实用且互补的替代方案,同时提高了效率。代码可在https://github.com/ostadabbas/HORNet获取。
cs.CV / 84 / 2603.18856
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o:基于轨迹的视频推理
Abstract
Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.
Chinese Translation
近期的研究在视频推理方面取得了显著进展,许多模型利用时空证据链来增强其推理能力。同时,越来越多的数据集和基准提供了旨在支持和评估此类推理的结构化注释。然而,关于物体在观察之间如何移动的推理却鲜有关注:之前的工作未能通过连接连续观察来阐明运动模式,使得轨迹理解变得隐含且难以验证。我们将这一缺失的能力形式化为时空轨迹(Spatial-Temporal-Trajectory, STT)推理,并引入了 extbf{Motion-o},这是一个以运动为中心的视频理解扩展,旨在使轨迹变得明确且可验证。为了实现运动推理,我们还引入了一个轨迹基础数据集工件,通过增强扩展稀疏的关键帧监督,以产生更密集的边界框轨迹和更强的轨迹级训练信号。最后,我们引入了运动思维链(Motion Chain of Thought, MCoT),这是一个结构化的推理路径,通过离散的 exttt{}标签总结每个物体的方向、速度和速度变化(的尺度),以明确地将基础观察连接成轨迹。为了训练Motion-o,我们设计了一个奖励函数,促使模型直接对视觉证据进行推理,同时不需要任何架构修改。实证结果表明,Motion-o在时空基础和轨迹预测方面有所改善,同时与现有框架完全兼容,确立了运动推理作为基于证据的视频理解的重要扩展。代码可在 https://github.com/ostadabbas/Motion-o 获取。
cs.CV / 85 / 2603.18891
PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
PromptHub:通过局部感知融合、集中和对齐增强多提示视觉上下文学习
Abstract
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
Chinese Translation
视觉上下文学习(VICL)旨在通过模仿像素示例来完成视觉任务。最近的研究开创了提示融合,结合了各种示例的优势,为扩展VICL提供了有希望的途径。不幸的是,基于补丁的融合框架和模型无关的监督限制了信息线索的利用,从而限制了性能提升。为克服这一不足,我们引入了PromptHub,一个通过局部感知融合、集中和对齐整体增强多提示的框架。PromptHub利用空间先验捕捉更丰富的上下文信息,采用互补的集中、对齐和预测目标来相互引导训练,并结合数据增强进一步强化监督。在三个基础视觉任务上的大量实验表明了PromptHub的优越性。此外,我们验证了其在分布外设置和各种检索场景中的普适性、可迁移性和鲁棒性。本研究建立了一个可靠的局部感知提示融合范式,超越了以往基于补丁的方法。代码可在 https://github.com/luotc-why/ICLR26-PromptHub 获取。
cs.CV / 86 / 2603.18892
MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
MultihopSpatial:针对视觉-语言模型的多跳组合空间推理基准
Abstract
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
Chinese Translation
空间推理是视觉-语言模型(VLMs)的基础,尤其是在作为视觉-语言-行动(VLA)代理在物理环境中展开应用时。然而,现有基准主要聚焦于基本的单跳关系,忽视了在真实场景中所需的多跳组合推理以及精确的视觉定位。为了解决这一问题,我们提出了MultihopSpatial,提供了三个关键贡献:(1)一个专门针对多跳和组合空间推理的综合基准,涵盖了1至3跳复杂查询,涉及多样的空间视角。(2)Acc@50IoU,一种补充指标,通过同时评估推理和视觉定位来要求答案选择和精确的边界框预测,这些能力对于VLA的强大部署至关重要。(3)MultihopSpatial-Train,一个专门的大规模训练语料库,用于促进空间智能的提升。对37种最先进VLM的广泛评估揭示了八个关键洞见,显示组合空间推理仍然是一个严峻的挑战。最后,我们展示了在我们的语料库上进行后训练的强化学习,能够增强VLM的内在空间推理能力和下游的具身操作表现。
cs.CV / 87 / 2603.18896
Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness
通过增强病理意识的条件扩散模型将MRI转换为PET
Abstract
Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.
Chinese Translation
正电子发射断层扫描(PET)是一种广泛认可的神经退行性疾病诊断技术,提供了重要的功能性见解。然而,其高昂的成本和辐射暴露限制了其广泛应用。相比之下,磁共振成像(MRI)不受此类限制。尽管MRI也能检测神经退行性变化,但其在诊断方面的敏感性不及PET。为克服这些限制,一种方法是从MRI生成合成PET。最近在生成模型方面的进展为跨模态医学图像转换铺平了道路;然而,现有方法主要强调结构保留,而忽视了对病理意识的关键需求。为了解决这一问题,我们提出了PASTA,这是一种基于条件扩散模型的具有增强病理意识的新型图像转换框架。PASTA通过其高度互动的双臂架构和多模态条件集成,超越了现有的最先进方法,既保留了结构细节,又保留了病理细节。此外,我们引入了一种新颖的循环交换一致性和体积生成策略,显著增强了PASTA生成高质量3D PET图像的能力。我们的定性和定量结果表明,合成的PET扫描具有高质量和病理意识。在阿尔茨海默病诊断中,这些合成扫描的表现比MRI提高了4%,几乎达到了实际PET的性能。我们的代码可在https://github.com/ai-med/PASTA获取。
cs.CV / 88 / 2603.18912
GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting
GHOST:基于RGB视频的快速类别无关手-物体交互重建方法,采用高斯斑点技术
Abstract
Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.
Chinese Translation
从单目RGB视频中理解真实的手-物体交互对增强现实/虚拟现实、机器人技术和具身人工智能至关重要。现有方法依赖于特定类别的模板或大量计算,但仍会在3D中产生物理不一致的手-物体对齐。我们提出了GHOST(高斯手-物体斑点技术),这是一种快速的、类别无关的框架,用于使用2D高斯斑点重建动态的手-物体交互。GHOST将手和物体表示为密集的、视角一致的高斯圆盘,并引入三项关键创新:(1)几何先验检索和一致性损失,完成被遮挡的物体区域;(2)关注抓取的对齐方法,细化手的平移和物体的比例,以确保现实的接触;(3)关注手的背景损失,防止对手遮挡物体区域的惩罚。GHOST能够从单个RGB视频中实现完整、物理一致且可动画的重建,同时运行速度比以往类别无关方法快十倍。对ARCTIC、HO3D及野外数据集的广泛实验展示了在3D重建和2D渲染质量上的最新准确性,确立了GHOST作为现实手-物体交互建模的高效且稳健的解决方案。代码可在 https://github.com/ATAboukhadra/GHOST 获得。
cs.CV / 89 / 2603.18924
Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching
无监督对比学习用于高效且鲁棒的光谱形状匹配
Abstract
Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.
Chinese Translation
在计算机视觉和图形学中,估计非刚性可变形三维形状之间的对应关系仍然是一个重大挑战。尽管深度功能映射方法已成为解决此问题的首选方案,但它们主要集中于单独或联合优化逐点和功能映射,而不是直接增强嵌入空间中的特征表示,这往往导致特征质量不足和匹配性能不佳。此外,这些方法严重依赖于传统的功能映射技术,如耗时的功能映射求解器,这会产生巨大的计算成本。在本研究中,我们首次提出了一种新颖的基于无监督对比学习的方法,用于高效且鲁棒的三维形状匹配。我们首先提出一个无监督对比学习框架,通过最大化正相似对之间的一致性和最小化负相似对之间的一致性来促进特征学习,从而提高学习特征的一致性和可区分性。然后,我们设计了一个显著简化的功能映射学习架构,消除了对计算成本高昂的功能映射求解器和多个辅助功能映射损失的需求,大大提高了计算效率。通过将这两个组件集成到一个统一的双分支管道中,我们的方法在准确性和效率上都达到了最先进的性能。大量实验表明,我们的方法不仅计算效率高,而且在各种具有挑战性的基准测试中超越了当前的最先进方法,包括近等距、非等距和拓扑不一致的场景,甚至超过了监督技术。
cs.CV / 90 / 2603.18943
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
VGGT-360:几何一致性零样本全景深度估计
Abstract
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
Chinese Translation
本文提出了VGGT-360,一个新颖的无训练框架,用于零样本、几何一致性的全景深度估计。与先前的视角无关的无训练方法不同,VGGT-360将任务重新表述为基于多视图重建的3D模型进行全景重投影,通过利用VGGT类基础模型的内在3D一致性,从而将碎片化的逐视图推理统一为一个连贯的全景理解。为了实现鲁棒且准确的估计,VGGT-360集成了三个插件式模块,形成一个统一的全景-3D-深度框架:(i) 不确定性引导的自适应投影将全景切片为透视视图,以弥合全景输入和VGGT的透视先验之间的领域差距。它通过估计基于梯度的不确定性,将更密集的视图分配给几何贫乏区域,从而为VGGT提供几何信息丰富的输入。(ii) 结构显著性增强的注意力在3D重建过程中增强了VGGT的鲁棒性,通过将结构感知的置信度注入其注意力层,引导关注几何可靠的区域,并增强跨视图的一致性。(iii) 相关性加权的3D模型校正通过使用注意力推断的相关性分数来重新加权重叠点,精炼重建的3D模型,为准确的全景重投影提供一致的几何基础。大量实验表明,VGGT-360在多个分辨率和多样的室内外数据集上超越了训练过和无训练的最先进方法。
cs.CV / 91 / 2603.18991
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
CRAFT:对齐扩散模型的微调比你想象的更简单
Abstract
Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.
Chinese Translation
对齐扩散模型在生成高质量、符合人类偏好的图像方面取得了显著突破。现有技术,如监督微调(SFT)和DPO风格的偏好优化,已成为微调扩散模型的原则性工具。然而,SFT依赖于高质量图像,这些图像获取成本高昂,而DPO风格的方法则依赖于大规模偏好数据集,这些数据集的质量往往不一致。除了数据依赖性之外,这些方法还受到计算效率的限制。为了解决这两个挑战,我们提出了复合奖励辅助微调(CRAFT),一种轻量级但强大的微调范式,它在显著减少训练数据的同时保持计算效率。CRAFT首先利用复合奖励过滤(CRF)技术构建高质量且一致的训练数据集,然后执行SFT的增强变体。我们还理论上证明CRAFT实际上优化了基于组的强化学习的下界,建立了选定数据的SFT与强化学习之间的原则性联系。我们广泛的实证结果表明,仅用100个样本的CRAFT就能轻松超越最近数千个偏好配对样本的SOTA偏好优化方法。此外,CRAFT的收敛速度比基线偏好优化方法快11-220倍,突显了其极高的效率。
cs.CV / 92 / 2603.19004
Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement
释放简单的力量:一种最简化的尖端指纹增强策略
Abstract
Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.
Chinese Translation
指纹识别系统依赖于人类指纹的独特特征,是现代安全和验证应用中不可或缺的组成部分。准确的细节提取是这些系统中的关键步骤,其依赖于指纹图像的质量。尽管最近在指纹增强技术方面取得了进展,但尖端方法在处理低质量指纹时常常面临挑战,并且计算需求较高。本文提出了一种最简化的指纹增强方法,优先考虑简单性和有效性。介绍了两种新颖的方法:一种上下文过滤方法和一种基于学习的方法。这些技术在性能上始终优于复杂的尖端方法,能够生成更清晰、更准确且噪声更少的图像。这些方法的有效性通过一个具有挑战性的潜在指纹数据库得到了验证。这些技术的开源实现不仅促进了可重复性,还鼓励了该领域的进一步发展。研究结果强调了在实现高质量指纹增强中简单性的重要性,并建议未来的研究应在复杂性与实际效益之间取得平衡。
cs.CV / 93 / 2603.19013
Generalized Hand-Object Pose Estimation with Occlusion Awareness
具备遮挡意识的广义手-物体姿态估计
Abstract
Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.
Chinese Translation
从单一RGB图像中进行广义的3D手-物体姿态估计仍然具有挑战性,主要由于物体外观和交互模式的巨大变化,尤其是在严重遮挡的情况下。我们提出了GenHOI,一个具备遮挡意识的广义手-物体姿态估计框架。GenHOI将分级语义知识与手部先验结合,以增强模型在困难遮挡条件下的泛化能力。具体而言,我们引入了一种分级语义提示,通过文本描述编码物体状态、手部配置和交互模式。这使得模型能够学习手-物体交互的抽象高级表征,从而在遇到未见过的物体和新颖交互时进行泛化,同时弥补缺失或模糊的视觉线索。为了实现稳健的遮挡推理,我们在RGB图像、预测的点云和文本描述上采用多模态掩膜建模策略。此外,我们利用手部先验作为稳定的空间参考,以提取隐式的交互约束。这使得即使在物体形状和交互模式显著变化的情况下,仍能可靠地推断姿态。在具有挑战性的DexYCB和HO3Dv2基准上进行的大量实验表明,我们的方法在手-物体姿态估计方面达到了最先进的性能。
cs.CV / 94 / 2603.19026
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
将 MLLM 本身重新思考为具有单一分割标记的分割器
Abstract
Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.
Chinese Translation
近期利用多模态大型语言模型(MLLM)的分割方法展示了可靠的对象级分割和增强的空间感知。然而,几乎所有之前的方法主要依赖于专业的掩码解码器来解释从生成的分割相关嵌入和视觉特征中获得的掩码,或者引入多个附加标记以提供帮助。本文旨在探讨我们是否以及如何能够利用 MLLM 本身的 1 个分割嵌入(SELF1E)解锁分割,同时实现具有竞争力的结果,从而消除对外部解码器的需求。为此,我们的方法针对 MLLM 中像素打乱图像特征的分辨率降低这一根本限制。首先,我们保留图像特征在其原始未压缩分辨率下,并用从 MLLM 处理的压缩特征中提取的残差特征进行填充,从而提高特征的精度。随后,我们分别对经过 LLM 处理和未经过 LLM 处理的图像特征进行像素解乱操作,以释放压缩特征的细节并在未压缩分辨率下放大残差特征,从而进一步增强填充特征的分辨率。此外,我们重新设计了具有双重感知路径的注意力掩码,即图像到图像和图像到分割,使得像素与分割标记之间能够进行丰富的特征交互。针对多个分割任务的全面实验验证了 SELF1E 的性能与基于专业掩码解码器的方法相当,展示了在 MLLM 中无解码器分割的可行性。项目页面:https://github.com/ANDYZAQ/SELF1E。
cs.CV / 95 / 2603.19028
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
SEM:用于视觉-语言模型后置去偏的稀疏嵌入调制
Abstract
Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.
Chinese Translation
连接视觉和语言的模型,如CLIP,是多模态人工智能的关键组成部分,但其大规模、未经筛选的训练数据引入了严重的社会和虚假偏见。现有的后置去偏方法通常直接在稠密的CLIP嵌入空间中操作,在该空间中,偏见和任务相关信息高度交织。这种交织限制了它们在不降低语义保真度的情况下去除偏见的能力。在本研究中,我们提出了稀疏嵌入调制(Sparse Embedding Modulation, SEM),这是一种在稀疏自编码器(Sparse Autoencoder, SAE)潜在空间中操作的后置零样本去偏框架。通过将CLIP文本嵌入分解为解耦特征,SEM识别并调制与偏见相关的神经元,同时保留与查询相关的神经元。这使得更精确的非线性干预成为可能。在四个基准数据集和两个CLIP骨干网络上,SEM在检索和零样本分类中实现了显著的公平性提升。我们的结果表明,稀疏潜在表示为视觉-语言模型的后置去偏提供了有效的基础。
cs.CV / 96 / 2603.19036
FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal
FUMO:用于单幅图像反射去除的先验调制扩散
Abstract
Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.
Chinese Translation
单幅图像反射去除(SIRR)在真实场景中具有挑战性,因为反射强度在空间上变化,反射模式与透射结构紧密交织。本文提出了一种具有先验调制框架的扩散模型(FUMO),该模型引入了显式引导信号,以提高空间可控性和结构真实性。从混合图像中直接提取了两个先验,一个是估计空间反射强度的强度先验,另一个是通过多尺度残差聚合捕捉细节敏感响应的高频先验。我们提出了一种粗到细的训练范式。在第一阶段,这些线索被结合以控制条件残差注入,重点关注既是反射主导又对结构敏感的区域。在第二阶段,细粒度的精炼网络纠正局部错位并在图像空间中锐化细节。在标准基准和真实场景中的挑战性图像上进行的实验表明,该方法在定量结果上具有竞争力,并且感知质量持续改善。代码已发布在 https://github.com/Lucious-Desmon/FUMO。
cs.CV / 97 / 2603.19039
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
TerraScope:用于地球观测的像素基础视觉推理
Abstract
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
Chinese Translation
视觉-语言模型(VLMs)在地球观测(EO)中展现出了良好的潜力,但在需要将复杂空间推理与精确的像素级视觉表示相结合的任务上仍存在困难。为了解决这个问题,我们提出了TerraScope,一个统一的VLM,提供像素基础的地理空间推理,具备两个关键能力:(1) 模态灵活推理:它可以处理单一模态输入(光学或合成孔径雷达(SAR)),并在同时可用时自适应融合不同模态进入推理过程;(2) 多时间态推理:它整合时间序列以进行多时间点的变化分析。此外,我们整理了Terra-CoT,这是一个包含100万样本的大规模数据集,其推理链中嵌入了像素级的掩模信息,来源于多个渠道。我们还提出了TerraScope-Bench,这是首个用于像素基础地理空间推理的基准,包含六个子任务,以评估答案的准确性和掩模的质量,确保真实的像素基础推理。实验表明,TerraScope在像素基础地理空间推理上显著优于现有的VLM,并提供可解释的视觉证据。
cs.CV / 98 / 2603.19048
Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos
测量动态生成视频中的三维空间几何一致性
Abstract
Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
Chinese Translation
近期的生成模型能够生成高保真视频,但它们往往表现出三维空间几何不一致性。现有的评估方法无法准确表征这些不一致性:以保真度为中心的指标如 FVD 对几何失真不敏感,而以一致性为重点的基准常常会惩罚有效的前景动态。为了解决这一问题,我们提出了 SGC(Spatial Geometric Consistency),一个用于评估动态生成视频中三维空间几何一致性的指标。我们通过测量从不同局部区域估计的多个相机姿态之间的发散来量化几何一致性。我们的方法首先将静态区域与动态区域分开,然后将静态背景划分为空间一致的子区域。我们为每个像素预测深度,为每个子区域估计局部相机姿态,并计算这些姿态之间的发散以量化几何一致性。在真实视频和生成视频上的实验表明,SGC 能够稳健地量化几何不一致性,有效识别现有指标所遗漏的关键失败。
cs.CV / 99 / 2603.19053
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
SwiftTailor:基于几何图像表示的高效3D服装生成
Abstract
Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.
Chinese Translation
现实且高效的3D服装生成仍然是计算机视觉和数字时尚领域的一项长期挑战。现有方法通常依赖大型视觉-语言模型生成2D缝制图案的序列化表示,然后通过服装建模框架(如GarmentCode)将其转换为适合仿真的3D网格。尽管这些方法产生了高质量的结果,但它们的推理时间通常较长,范围从30秒到1分钟。在本研究中,我们提出了SwiftTailor,一个新颖的两阶段框架,通过紧凑的几何图像表示统一了缝制图案推理和基于几何的网格合成。SwiftTailor包含两个轻量级模块:PatternMaker,一个高效的视觉-语言模型,从多种输入模态中预测缝制图案,以及GarmentSewer,一个高效的密集预测变换器,将这些图案转换为新颖的服装几何图像,编码所有服装面板在统一UV空间中的3D表面。最终的3D网格通过高效的逆映射过程重建,该过程结合了重网格化和动态缝合算法,直接组装服装,从而摊薄物理仿真的成本。在Multimodal GarmentCodeData上的大量实验表明,SwiftTailor在显著减少推理时间的同时,达到了最先进的准确性和视觉保真度。本研究为下一代3D服装生成提供了一种可扩展、可解释且高性能的解决方案。
cs.CV / 100 / 2603.19054
Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding
Em-Garde:一种用于主动流媒体视频理解的提议-匹配框架
Abstract
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.
Chinese Translation
近年来,流媒体视频理解的进展使得模型能够以主动的方式响应用户查询,形成了一种新的交互范式。目前的主动视频大语言模型(VideoLLMs)依赖于逐帧触发的决策过程,这导致了效率与准确性之间的困境。我们提出了Em-Garde,这是一种新颖的框架,将语义理解与流媒体感知解耦。在查询时,指令引导的提议解析器(Instruction-Guided Proposal Parser)将用户查询转化为结构化的、感知基础的视觉提议;在流媒体过程中,一个轻量级提议匹配模块(Lightweight Proposal Matching Module)执行基于嵌入的高效匹配以触发响应。在StreamingBench和OVO-Bench上的实验表明,在主动响应的准确性和效率方面,相较于之前的模型有了一致的提升,验证了在严格计算约束下实现主动视频理解的有效解决方案。
cs.CV / 101 / 2603.19059
SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation
SignAgent:用于语言基础手语注释和数据集策划的代理型大型语言模型
Abstract
This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.
Chinese Translation
本文介绍了SignAgent,一个新颖的代理框架,利用大型语言模型(LLMs)进行可扩展的、基于语言的手语(SL)注释和数据集策划。传统的手语计算方法通常在词汇层面操作,忽视了重要的语言细微差别,而手动语言注释仍然是一个显著的瓶颈,创建大规模、音位意识的数据集的速度和成本都过于高昂。SignAgent通过SignAgent Orchestrator解决了这些挑战,后者是一个推理型LLM,协调一套语言工具,以及SignGraph,一个提供词汇和语言基础的知识型LLM。我们在两个下游注释任务上评估了我们的框架。首先,在伪词汇注释任务中,代理执行受限分配,利用多模态证据提取和排序适合的词汇标签以匹配手语序列。其次,在ID词汇注释任务中,代理通过推理视觉相似性和音位重叠来检测和细化视觉簇,以正确识别和分组词汇手语变体。我们的结果表明,我们的代理型方法在大规模、语言意识的数据注释和策划方面取得了良好的表现。
cs.CV / 102 / 2603.19076
DROID-SLAM in the Wild
DROID-SLAM在实际环境中的应用
Abstract
We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.
Chinese Translation
我们提出了一种稳健的实时RGB SLAM系统,通过利用可微分的不确定性感知束调整来处理动态环境。传统的SLAM方法通常假设场景是静态的,这在存在运动时会导致跟踪失败。近期的动态SLAM方法试图通过使用预定义的动态先验或不确定性感知映射来解决这一挑战,但在面对未知的动态物体或高度杂乱的场景时,它们仍然存在局限性,因为几何映射变得不可靠。相比之下,我们的方法通过利用多视角视觉特征的不一致性来估计每个像素的不确定性,从而实现了在真实环境中稳健的跟踪和重建。所提出的系统在杂乱的动态场景中实现了最先进的相机位姿和场景几何,同时以约10帧每秒的速度实时运行。代码和数据集可在https://github.com/MoyangLi00/DROID-W.git获取。
cs.CV / 103 / 2603.19077
Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline
大规模小变化的多模态建筑变化检测:基准与基线
Abstract
Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD
Chinese Translation
光学遥感影像中的变化检测容易受到光照波动、季节变化和地表覆盖材料变化的影响。单靠RGB影像常常会产生伪变化,并导致特征的语义模糊。引入近红外(NIR)信息提供了与可见光互补的异质物理线索,从而增强建筑材料和微小结构的可分辨性,同时提高检测精度。然而,目前现有的多模态数据集普遍缺乏高分辨率和准确配准的双时相影像,当前的方法通常未能充分利用这些模态之间固有的异质性。为了解决这些问题,我们推出了大规模小变化多模态数据集(Large-scale Small-change Multi-modal Dataset, LSMD),这是一个针对现实场景中小变化的双时相RGB-NIR建筑变化检测基准数据集,为在复杂环境中评估多模态变化检测方法提供了严格的测试平台。基于LSMD,我们进一步提出了多模态光谱互补网络(Multi-modal Spectral Complementarity Network, MSCNet),以实现有效的跨模态特征融合。MSCNet由三个关键组件组成:邻域上下文增强模块(Neighborhood Context Enhancement Module, NCEM)用于增强局部空间细节,跨模态对齐与交互模块(Cross-modal Alignment and Interaction Module, CAIM)用于实现RGB与NIR特征之间的深度交互,以及重视显著性的信息源精炼模块(Saliency-aware Multisource Refinement Module, SMRM)用于逐步精炼融合特征。大量实验表明,MSCNet有效利用多模态信息,在多种输入配置下始终优于现有方法,验证了其在细粒度建筑变化检测中的有效性。源代码将公开发布于:https://github.com/AeroVILab-AHU/LSMD
cs.CV / 104 / 2603.19092
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
SAVeS:通过语义线索引导视觉语言模型中的安全判断
Abstract
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.
Chinese Translation
视觉语言模型(VLMs)越来越多地应用于现实世界和具身环境中,在这些环境中,安全决策依赖于视觉上下文。然而,哪些视觉证据驱动这些判断仍不清楚。我们研究了是否可以通过简单的语义线索来引导VLMs中的多模态安全行为。我们提出了一种语义引导框架,该框架在不改变基础场景内容的情况下,应用受控的文本、视觉和认知干预。为了评估这些效果,我们提出了SAVeS,一个在语义线索下进行情境安全的基准,以及一个评估协议,区分行为拒绝、基于事实的安全推理和错误拒绝。针对多个VLMs和一个额外的最先进基准的实验表明,安全决策对语义线索高度敏感,表明其依赖于学习到的视觉-语言关联,而非基于事实的视觉理解。我们进一步证明,自动化引导管道可以利用这些机制,突显了多模态安全系统中的潜在脆弱性。
cs.CV / 105 / 2603.19098
TAU-R1: Visual Language Model for Traffic Anomaly Understanding
TAU-R1:用于交通异常理解的视觉语言模型
Abstract
Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1
Chinese Translation
交通异常理解(TAU)对于智能交通系统中的交通安全至关重要。近期的视觉语言模型(VLMs)在视频理解方面展现了强大的能力。然而,由于缺乏基准测试和特定任务的方法论,TAU的进展仍然有限。为了解决这一限制,我们引入了Roundabout-TAU,这是一个基于与印第安纳州卡梅尔市合作收集的真实世界环形交叉口视频构建的数据集。该数据集包含342个视频片段,并标注了超过2000对涵盖交通异常理解多个方面的问题-答案对。在此基准的基础上,我们提出了TAU-R1,一个用于TAU的两层视觉语言框架。第一层是一个轻量级异常分类器,执行粗略的异常分类,而第二层是一个更大的异常推理器,生成详细的事件摘要。为了提高特定任务的推理能力,我们引入了一种两阶段的训练策略,包括分解问答增强的监督微调,随后是TAU-GRPO,一种基于GRPO的后训练方法,具有特定于TAU的奖励函数。实验结果表明,TAU-R1在异常分类和推理任务上均表现出色,同时保持了部署效率。数据集和代码可在以下链接获取:https://github.com/siri-rouser/TAU-R1
cs.CV / 106 / 2603.19121
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
CustomTex:通过多参考定制实现高保真室内场景纹理
Abstract
The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.
Chinese Translation
创建高保真、可定制的三维室内场景纹理仍然是一个重大挑战。虽然基于文本的方法提供了灵活性,但它们缺乏细粒度实例级控制的精确性,且通常生成的纹理质量不足,存在伪影和固化阴影。为克服这些局限性,我们提出了CustomTex,一个基于参考图像的实例级高保真场景纹理的新框架。CustomTex接收一个未纹理化的三维场景和一组指定每个对象实例所需外观的参考图像,并生成一个统一的高分辨率纹理图。我们方法的核心是一个双重蒸馏方法,它将语义控制与像素级增强分离。我们采用语义级蒸馏,配备实例交叉注意机制,以确保语义的合理性和“参考-实例”对齐,同时使用像素级蒸馏来强制高视觉保真度。两者在变分评分蒸馏(Variational Score Distillation, VSD)优化框架内统一。实验表明,CustomTex在与参考图像的一致性方面达到了精确的实例级一致性,并生成了具有更高清晰度、减少伪影和最小固化阴影的纹理,相较于最先进的方法,我们的工作为高质量、可定制的三维场景外观编辑建立了更直接和用户友好的路径。
cs.CV / 107 / 2603.19122
Revisiting Autoregressive Models for Generative Image Classification
重新审视自回归模型在生成图像分类中的应用
Abstract
Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.
Chinese Translation
类别条件生成模型已成为准确且稳健的分类器,扩散模型在视觉生成范式中展现出相较于其他方法(包括自回归(AR)模型)的明显优势。在本研究中,我们重新审视基于视觉的自回归生成分类器,并识别出先前方法的一个重要局限性:它们依赖于固定的标记顺序,这对图像理解施加了限制性的归纳偏差。我们观察到,单一顺序的预测更多依赖于部分判别线索,而对多个标记顺序的平均则提供了更全面的信号。基于这一见解,我们利用最近的任意顺序自回归模型来估计顺序边际化预测,从而释放自回归模型的高分类潜力。我们的方法在多种图像分类基准测试中始终优于基于扩散的分类器,同时效率提高了多达25倍。与最先进的自监督判别模型相比,我们的方法在分类性能上表现出竞争力——这是生成分类器的一个显著成就。
cs.CV / 108 / 2603.19137
GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning
GSMem:作为持久空间记忆的3D高斯点云用于零-shot具身探索与推理
Abstract
Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework
Chinese Translation
有效的具身探索要求智能体随着时间的推移积累和保留空间知识。然而,现有的场景表示方法,如离散场景图或静态视图快照,缺乏 extit{事后可重观测性}。如果初始观察未能捕捉到目标,所导致的记忆遗漏往往是不可恢复的。为了解决这一问题,我们提出了 extbf{GSMem},一个基于3D高斯点云(3D Gaussian Splatting, 3DGS)的零-shot具身探索与推理框架。通过显式参数化连续几何体和密集外观,3DGS作为一种持久空间记忆,使智能体具备 extit{空间回忆}的能力:从最佳的、之前未占用的视点渲染出逼真的新视图。为了实现这一点,GSMem采用了一种检索机制,同时利用并行的对象级场景图和语义级语言场。该互补设计能够稳健地定位目标区域,使智能体能够“幻觉”出高保真视觉-语言模型(Vision-Language Model, VLM)推理所需的最佳视图。此外,我们还引入了一种混合探索策略,将基于VLM的语义评分与基于3DGS的覆盖目标相结合,平衡任务感知探索与几何覆盖。在具身问答和终身导航的广泛实验中,证明了我们框架的稳健性和有效性。
cs.CV / 109 / 2603.19157
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
ADAPT:基于注意力的自适应提示调度与稀有概念生成的正交补充插值
Abstract
Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.
Chinese Translation
在文本到图像合成中生成稀有组合概念仍然是扩散模型面临的挑战,特别是对于训练数据中不常见的属性。近期的方法,如R2F,通过利用大语言模型(LLM)进行提示调度来应对这一挑战,但由于语言模型的随机性和迭代文本嵌入切换的次优引导,导致其固有的方差问题。为了解决这些问题,我们提出了ADAPT框架,这是一种无训练的框架,可以确定性地规划和语义对齐提示调度,为增强稀有概念的组合提供一致的指导。通过利用注意力评分和正交组件,ADAPT显著提升了在RareBench基准测试中稀有概念的组合生成,无需额外的训练或微调。通过全面的实验,我们证明ADAPT在RareBench中实现了优越的性能,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,而不牺牲视觉完整性。
cs.CV / 110 / 2603.19158
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
面向目标忠实的扩散生成的自适应辅助提示混合
Abstract
Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.
Chinese Translation
基于扩散的文本到图像(T2I)模型在生成照片真实感和语义丰富的图像方面取得了显著进展。然而,当目标概念位于训练分布的低密度区域时,这些模型往往会产生语义不一致或结构不匹配的结果。这一局限性源于文本-图像数据集的长尾特性,其中稀有概念或编辑指令的表现不足。为了解决这一问题,我们提出了自适应辅助提示混合(Adaptive Auxiliary Prompt Blending, AAPB)——一个统一框架,用于在低密度区域稳定扩散过程。AAPB利用辅助锚点提示为稀有概念生成提供语义支持,并为图像编辑提供结构支持,确保对目标提示的忠实引导。与之前的启发式提示交替方法不同,AAPB推导出一个封闭形式的自适应系数,在每个扩散步骤中最佳地平衡辅助锚点和目标提示之间的影响。基于Tweedie的恒等式,我们的公式提供了一个原则性且无训练的自适应提示混合框架,确保稳定且忠实于目标的生成。我们通过控制实验展示了自适应插值相较于固定插值的有效性,并在RareBench和FlowEdit数据集上实证显示出一致的改进,相较于之前的无训练基线,达到了更高的语义准确性和结构保真度。
cs.CV / 111 / 2603.19169
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
ARIADNE:一个用于可信冠状动脉造影分析的感知-推理协同框架
Abstract
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
Chinese Translation
传统的逐像素损失函数未能在冠状血管分割中强制执行拓扑约束,尽管像素级准确性高,但仍然产生了支离破碎的血管树。我们提出了ARIADNE,一个两阶段框架,将偏好对齐的感知与基于强化学习的诊断推理结合起来,以实现拓扑一致的狭窄检测。感知模块采用DPO(Differentiable Preference Optimization)来微调Sa2VA视觉-语言基础模型,使用Betti数约束作为偏好信号,使策略朝向几何上完整的血管结构,而不是逐像素重叠度量。推理模块将狭窄定位公式化为具有显式拒绝机制的马尔可夫决策过程,能够自主推迟模糊的解剖候选,如分叉和血管交叉,从覆盖最大化转向可靠性优化。在1,400个临床血管造影图像上,ARIADNE达到了0.838的最先进中心线Dice值,较几何基线减少了41%的假阳性。在多中心基准ARCADE和XCAD上的外部验证确认了其在不同采集协议下的泛化能力。这是DPO首次应用于医学成像中的拓扑对齐,表明基于偏好的学习在结构约束下减轻了拓扑违反,同时保持了介入心脏病学工作流程中的诊断敏感性。
cs.CV / 112 / 2603.19193
Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting
重建的重要性:通过3D高斯点云学习几何对齐的鸟瞰视图表示
Abstract
Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
Chinese Translation
鸟瞰视图(BEV)感知是自动驾驶的基石,提供了一种统一的空间表示,将周围视图图像融合,以便于进行各种下游任务的推理,如语义分割、3D物体检测和运动预测。然而,大多数现有的BEV感知框架采用端到端的训练范式,其中图像特征直接转化为BEV空间,并仅通过下游任务的监督进行优化。这种表述将整个感知过程视为一个黑箱,往往缺乏明确的3D几何理解和可解释性,从而导致性能不佳。在本文中,我们认为明确的3D表示对于准确的BEV感知至关重要,并提出了Splat2BEV,一个基于高斯点云的BEV任务框架。Splat2BEV旨在学习既语义丰富又几何精确的BEV特征表示。我们首先预训练一个高斯生成器,该生成器能够从多视角输入中明确重建3D场景,从而生成几何对齐的特征表示。这些表示随后被投影到BEV空间,以作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明,Splat2BEV达到了最先进的性能,并验证了将明确的3D重建纳入BEV感知的有效性。
cs.CV / 113 / 2603.19203
Tinted Frames: Question Framing Blinds Vision-Language Models
有色框架:问题框架使视觉-语言模型失去视力
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
Chinese Translation
视觉-语言模型(VLMs)已被证明存在盲点,通常在需要视觉推理的任务中未能充分利用其视觉输入。在本研究中,我们展示了VLMs的选择性盲点。它们根据语言框架调节对视觉输入的关注程度,即使在替代框架要求相同视觉推理的情况下也是如此。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的注意力的数量和分布。受限框架,如多项选择和是/否,导致对图像上下文的关注显著降低,减少了对任务相关区域的关注,并将注意力转向无信息的标记。我们进一步证明,这种注意力的错误分配是导致准确性下降和跨框架不一致的主要原因。基于这一机制洞察,我们提出了一种轻量级的提示调优方法,使用可学习的标记,鼓励在开放式设置中观察到的稳健、视觉基础的注意力模式,从而改善视觉基础并提高跨框架的性能。
cs.CV / 114 / 2603.19206
RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing
RPiAE:一种增强图像生成和编辑的表示驱动自编码器
Abstract
Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
Chinese Translation
扩散模型已成为图像生成和编辑的主流范式,潜在扩散模型将去噪过程转移到紧凑的潜在空间,以提高效率和可扩展性。最近的尝试利用预训练的视觉表示模型作为标记器先验,或将扩散特征与表示特征对齐,或直接重用表示编码器作为冻结标记器。尽管这些方法可以改善生成指标,但由于冻结编码器,它们通常在重建保真度方面受到限制,从而降低了编辑质量,同时过高维度的潜在表示使得扩散建模变得困难。为了解决这些局限性,我们提出了表示驱动自编码器(Representation-Pivoted AutoEncoder,RPiAE),这是一种基于表示的标记器,能够同时改善生成和编辑。我们引入了表示枢轴正则化(Representation-Pivot Regularization),这是一种训练策略,使得表示初始化的编码器能够在保留预训练表示空间的语义结构的同时进行重建微调,随后通过变分桥将潜在空间压缩为更紧凑的形式,以便更好地进行扩散建模。我们采用了一种目标解耦的阶段性训练策略,依次优化生成可行性和重建保真度目标。这些组件共同产生了一种保留强语义、忠实重建并生成具有降低扩散建模复杂性的潜在表示的标记器。实验表明,RPiAE在文本到图像生成和图像编辑方面优于其他视觉标记器,同时在基于表示的标记器中提供了最佳的重建保真度。
cs.CV / 115 / 2603.19209
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
视觉语言模型是否需要视觉变换器?评估状态空间模型作为视觉编码器
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
Chinese Translation
大型视觉语言模型(VLMs)通常使用固定的视觉主干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉主干,但我们探讨状态空间模型(SSM)视觉主干是否可以作为一种强有力的替代方案。我们在受控环境中系统地评估了SSM视觉主干在VLM中的表现。在匹配的ImageNet-1K初始化下,SSM主干在视觉问答(VQA)和定位/归属任务中都取得了最佳的整体性能。我们进一步对SSM和ViT系列主干进行检测或分割训练的适配,发现密集任务调优通常能提高各系列的性能;在这种适配后,SSM主干在较小的模型规模下仍然保持竞争力。我们还观察到:(i)更高的ImageNet准确率或更大的主干并不可靠地转化为更好的VLM性能,以及(ii)某些视觉主干在定位中表现不稳定。基于这些发现,我们提出了稳定化策略,以提高两类主干的鲁棒性,并强调SSM主干作为VLM中基于变换器的视觉编码器的强有力替代方案。
cs.CV / 116 / 2603.19216
DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising
DreamPartGen:通过协同潜在去噪实现语义基础的部件级3D生成
Abstract
Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
Chinese Translation
理解和生成3D物体作为有意义部件的组合是人类感知和推理的基础。然而,大多数文本到3D的方法忽视了部件的语义和功能结构。尽管最近的部件感知方法引入了分解,但它们仍然主要关注几何形状,缺乏语义基础,未能建模部件如何与文本描述对齐或它们之间的关系。我们提出了DreamPartGen,一个用于语义基础、部件感知文本到3D生成的框架。DreamPartGen引入了双重部件潜变量(Duplex Part Latents, DPLs),共同建模每个部件的几何形状和外观,以及关系语义潜变量(Relational Semantic Latents, RSLs),捕捉源自语言的部件间依赖关系。同步的共同去噪过程强制执行几何和语义的一致性,从而实现连贯、可解释和与文本对齐的3D合成。在多个基准测试中,DreamPartGen在几何保真度和文本-形状对齐方面实现了最先进的性能。
cs.CV / 117 / 2603.19217
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
LVOmniBench:开创性长音频-视频理解评估,面向全模态大型语言模型
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
Chinese Translation
近年来,全模态大型语言模型(OmniLLMs)的进展显著提升了对音频和视频输入的理解能力。然而,目前的评估主要集中在10秒到5分钟的短音频和视频片段上,未能反映现实应用的需求,因为视频通常持续数十分钟。为了解决这一关键缺口,我们推出了LVOmniBench,这是一个专门为长格式音频和视频的跨模态理解设计的新基准。该数据集包含来自开放平台的高质量视频,具有丰富的音视频动态。通过严格的人工选择和标注,LVOmniBench包含275个视频,时长从10分钟到90分钟不等,以及1,014个问答(QA)对。LVOmniBench旨在严格评估OmniLLMs在多个领域的能力,包括长期记忆、时间定位、细粒度理解和多模态感知。我们的广泛评估显示,当前的OmniLLMs在处理扩展的音视频输入时面临重大挑战。开源模型的准确率通常低于35%,而Gemini 3 Pro的最高准确率约为65%。我们预期,这一数据集及我们的实证发现将激励进一步研究和开发先进模型,以解决长格式音视频上下文中的复杂跨模态理解问题。
cs.CV / 118 / 2603.19218
Rethinking Vector Field Learning for Generative Segmentation
重新思考生成分割的向量场学习
Abstract
Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.
Chinese Translation
驯化扩散模型用于生成分割引起了越来越多的关注。虽然现有方法主要集中在架构调整或训练启发式上,但对连续流匹配目标与离散感知任务之间内在不匹配的理解仍然有限。在本工作中,我们从向量场学习的角度重新审视扩散分割。我们识别出常用流匹配目标的两个关键限制:梯度消失和轨迹遍历,这导致收敛缓慢和类间分离不好。为了解决这些问题,我们提出了一种原则性的向量场重塑策略,通过一个脱离的距离感知校正项来增强学习到的速度场。该校正引入了吸引和排斥的相互作用,在质心附近增强梯度幅度,同时保留了原始的扩散训练框架。此外,我们设计了一种受克罗内克序列启发的计算高效的准随机类别编码方案,该方案与端到端像素神经场框架无缝结合,用于像素级语义对齐。大量实验一致表明,相比于原始的流匹配方法,显著提高了性能,显著缩小了生成分割与强判别模型之间的性能差距。
cs.CV / 119 / 2603.19219
DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
DriveTok:用于统一多视图重建与理解的3D驾驶场景标记化
Abstract
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
Chinese Translation
随着视觉-语言-动作模型和世界模型在自动驾驶系统中的广泛应用,可扩展的图像标记化作为视觉模态的接口变得至关重要。然而,现有的大多数标记器是为单目和二维场景设计的,这在应用于高分辨率多视图驾驶场景时导致效率低下和视图间不一致。为了解决这个问题,我们提出了DriveTok,一种高效的3D驾驶场景标记器,用于统一的多视图重建与理解。DriveTok首先从视觉基础模型中获取语义丰富的视觉特征,然后通过3D可变形交叉注意力将其转换为场景标记。对于解码,我们采用多视图变换器从场景标记中重建多视图特征,并使用多个头来获取RGB、深度和语义重建。我们还在场景标记上直接添加了一个3D头,以进行3D语义占用预测,以增强空间意识。通过多个训练目标,DriveTok学习统一的场景标记,整合语义、几何和纹理信息,以实现高效的多视图标记化。在广泛使用的nuScenes数据集上进行的大量实验表明,DriveTok生成的场景标记在图像重建、语义分割、深度预测和3D占用预测任务中表现良好。
cs.CV / 120 / 2603.19222
Spectrally-Guided Diffusion Noise Schedules
光谱引导的扩散噪音调度
Abstract
Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.
Chinese Translation
去噪扩散模型广泛用于高质量的图像和视频生成。它们的性能依赖于噪声调度,这定义了在训练过程中施加的噪声水平的分布以及在采样期间遍历的噪声水平序列。噪声调度通常是手工设计的,并需要在不同分辨率下进行手动调整。在这项工作中,我们提出了一种基于图像光谱特性的每实例噪声调度的原则性设计方法。通过推导最低和最高噪声水平有效性的理论界限,我们设计了“紧凑型”噪声调度,以消除冗余步骤。在推理过程中,我们提出有条件地采样此类噪声调度的方案。实验表明,我们的噪声调度提高了单阶段像素扩散模型的生成质量,特别是在低步数范围内。
cs.CV / 121 / 2603.19224
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
EffectErase:高质量效果消除的联合视频对象移除与插入
Abstract
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
Chinese Translation
视频对象移除旨在消除动态目标对象及其视觉效果,如变形、阴影和反射,同时恢复无缝背景。近期基于扩散的视频修复和对象移除方法能够去除对象,但往往难以消除这些效果并合成连贯的背景。除了方法的局限性,缺乏一个全面的数据集来系统性地捕捉不同环境中常见对象效果以进行训练和评估,进一步阻碍了进展。为了解决这一问题,我们引入了VOR(视频对象移除),这是一个大规模数据集,提供多样化的配对视频,每对视频由一个包含目标对象及其效果的视频和一个不包含对象及效果的对应视频组成,并附有相应的对象掩码。VOR包含来自捕获和合成源的60K高质量视频对,涵盖五种效果类型,涉及广泛的对象类别以及复杂的动态多对象场景。在VOR的基础上,我们提出了EffectErase,这是一种效果感知的视频对象移除方法,将视频对象插入视为互惠学习方案中的逆辅助任务。该模型包括任务感知的区域引导,专注于受影响区域的学习,并支持灵活的任务切换。然后,设定插入-移除一致性目标,鼓励互补行为和效果区域及结构线索的共享定位。在VOR上训练的EffectErase在广泛的实验中表现出色,能够在多样化场景中实现高质量的视频对象效果消除。
cs.CV / 122 / 2603.19226
Under One Sun: Multi-Object Generative Perception of Materials and Illumination
同在一片阳光下:材料与光照的多物体生成感知
Abstract
We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.
Chinese Translation
我们引入了多物体生成感知(MultiGP),这是一种生成逆渲染方法,用于从单张图像中随机采样构成物体外观的所有辐射成分——反射率、纹理和光照。我们解决这一固有模糊的辐射解耦问题的关键思路是利用这样一个事实:尽管它们的纹理和反射率可能不同,但同一场景中的物体均由相同的光照照亮。MultiGP 利用这一共识,从已知形状的单张图像中生成反射率、纹理和光照的样本,基于四个关键技术贡献:一个级联的端到端架构,结合了图像空间和角空间的解耦;用于扩散收敛到单一一致光照估计的协调引导;应用于促进不同反射率物体间“交互”的轴向注意力;以及一个纹理提取控制网络,以在确保与估计光照解耦的同时保留高频纹理细节。实验结果表明,MultiGP 有效利用多物体外观的互补空间和频率特性,恢复个别的纹理和反射率以及共同的光照。
cs.CV / 123 / 2603.19227
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
基于扩散的离散运动标记器:桥接语义与运动学条件
Abstract
Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.
Chinese Translation
以往的运动生成主要遵循两种范式:连续扩散模型在运动学控制方面表现出色,而基于离散标记的生成器在语义条件下效果显著。为了结合这两者的优势,我们提出了一个三阶段框架,包括条件特征提取(感知)、离散标记生成(规划)和基于扩散的运动合成(控制)。该框架的核心是MoTok,一个基于扩散的离散运动标记器,通过将运动恢复委托给扩散解码器,将语义抽象与细粒度重建解耦,从而实现紧凑的单层标记,同时保持运动的保真度。对于运动学条件,粗略约束在规划阶段指导标记生成,而细粒度约束则通过基于扩散的优化在控制阶段得到强化。这一设计防止了运动学细节干扰语义标记规划。在HumanML3D数据集上,我们的方法显著提高了可控性和保真度,相比于MaskControl仅使用六分之一的标记,将轨迹误差从0.72厘米降低到0.08厘米,FID从0.083降低到0.029。与以往在更强运动学约束下表现下降的方法不同,我们的方法提高了保真度,将FID从0.033降低到0.014。
cs.CV / 124 / 2603.19228
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
SAMA:用于指令引导的视频编辑的因子化语义锚定与运动对齐
Abstract
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
Chinese Translation
当前的指令引导视频编辑模型在精确的语义修改与忠实的运动保留之间难以实现平衡。虽然现有方法依赖于注入显式的外部先验(例如,VLM特征或结构条件)来缓解这些问题,但这种依赖严重制约了模型的鲁棒性和泛化能力。为了解决这一限制,我们提出了SAMA(因子化语义锚定与运动对齐),一个将视频编辑分解为语义锚定和运动建模的框架。首先,我们引入了语义锚定,通过在稀疏锚定帧上共同预测语义标记和视频潜变量,建立可靠的视觉锚点,从而实现纯粹的指令感知结构规划。其次,运动对齐在运动中心的视频恢复预训练任务(立方体修复、速度扰动和管道洗牌)上对相同的主干网络进行预训练,使模型能够直接从原始视频中内化时间动态。SAMA通过一个两阶段的管道进行优化:一个因子化的预训练阶段,在没有配对视频-指令编辑数据的情况下学习固有的语义-运动表示,随后在配对编辑数据上进行监督微调。值得注意的是,仅因子化预训练就已经展现出强大的零-shot视频编辑能力,验证了所提出的因子化方法。SAMA在开源模型中实现了最先进的性能,并且在与领先的商业系统(例如,Kling-Omni)竞争中表现出色。代码、模型和数据集将被发布。
cs.CV / 125 / 2603.19231
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
MonoArt:基于渐进结构推理的单目关节三维重建
Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
Chinese Translation
从单张图像重建关节3D物体,需要联合推断物体几何形状、部件结构及运动参数,且仅依赖有限的视觉信息。一个关键难点在于运动线索与物体结构的纠缠,使得直接进行关节参数回归变得不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来解决该问题,但常以牺牲可扩展性或效率为代价。本文提出MonoArt,一个基于渐进结构推理的统一框架。MonoArt不直接从图像特征预测关节参数,而是在单一架构内,将视觉观测逐步转化为规范几何形态、结构化部件表示以及具备运动感知的嵌入。这一结构化推理过程使得关节推断更稳定且具有可解释性,无需外部运动模板或多阶段流水线。在PartNet-Mobility数据集上的大量实验证明,MonoArt在重建精度和推断速度方面均实现了最先进的性能。该框架还能够推广应用于机器人操作和关节场景重建。
cs.CV / 126 / 2603.19232
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
立方离散扩散:高维表示标记上的离散视觉生成
Abstract
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
Chinese Translation
使用离散标记进行视觉生成受到了显著关注,因为它使得与语言模型共享统一的标记预测范式成为可能,从而承诺实现无缝的多模态架构。然而,目前的离散生成方法仍然局限于低维潜在标记(通常为8-32维),牺牲了理解所需的语义丰富性。虽然高维预训练表示(768-1024维)可以弥补这一差距,但其离散生成面临着根本性挑战。在本文中,我们提出了立方离散扩散(Cubic Discrete Diffusion, CubiD),这是第一个针对高维表示的离散生成模型。CubiD 在高维离散表示中执行细粒度的掩码 -- 任何位置的任何维度都可以被掩盖,并通过部分观察进行预测。这使得模型能够学习到空间位置内和跨位置之间的丰富相关性,生成步骤的数量固定为 $T$,无论特征维度如何,其中 $T imes hwd$。在 ImageNet-256 数据集上,CubiD 实现了最先进的离散生成,参数规模从 9 亿到 37 亿表现出了强劲的扩展性。重要的是,我们验证了这些离散标记保留了原始表示能力,表明相同的离散标记可以有效地同时用于理解和生成任务。我们希望这项工作能够激励未来朝向统一多模态架构的研究。代码可在以下网址获取:https://github.com/YuqingWang1029/CubiD。
cs.CV / 127 / 2603.19234
Matryoshka Gaussian Splatting
套娃高斯喷溅
Abstract
The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.
Chinese Translation
从单一模型以可调细节水平渲染场景的能力被称为细节层次(Level of Detail, LoD),这对3D高斯喷溅(3D Gaussian Splatting, 3DGS)的实际应用至关重要。现有的离散LoD方法仅暴露出有限的操作点,而同时进行的连续LoD方法虽然能实现更平滑的缩放,通常在全容量下会出现明显的质量下降,这使得LoD成为一个成本高昂的设计决策。我们提出了套娃高斯喷溅(Matryoshka Gaussian Splatting, MGS),这是一个训练框架,使得标准3DGS管道能够实现连续LoD而不牺牲全容量渲染质量。MGS学习了一组有序的高斯分布,使得渲染任何前缀(即前k个喷溅)能够生成一个一致的重建,且其保真度随着预算的增加而平滑提高。我们的关键思路是随机预算训练:每次迭代随机抽取一个喷溅预算,并优化相应的前缀和完整集合。这一策略只需两次前向传播,并且不引入架构上的修改。在四个基准和六个基线的实验中,MGS在保持其骨架的全容量性能的同时,实现了从单一模型出发的连续速度-质量权衡。对排序策略、训练目标和模型容量的广泛消融实验进一步验证了我们的设计。
cs.CV / 128 / 2603.19235
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
生成模型了解空间:释放隐式3D先验以实现场景理解
Abstract
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
Chinese Translation
尽管多模态大型语言模型展现了令人印象深刻的语义能力,但它们常常存在空间盲目性,难以进行细粒度的几何推理和物理动态分析。现有解决方案通常依赖于显式的3D模态或复杂的几何支架,这些方法受到数据稀缺和泛化挑战的限制。在本研究中,我们提出了一种范式转变,通过利用大规模视频生成模型中的隐式空间先验。我们认为,为了合成时间上连贯的视频,这些模型本质上学习了稳健的3D结构先验和物理法则。我们引入了VEGA-3D(视频提取生成意识),这是一个即插即用的框架,重新利用预训练的视频扩散模型作为潜在世界模拟器。通过从中间噪声水平提取时空特征,并通过基于标记的自适应门控融合机制将其与语义表示相结合,我们在没有显式3D监督的情况下丰富了多模态大型语言模型(MLLMs)的密集几何线索。在3D场景理解、空间推理和具身操控基准测试中的广泛实验表明,我们的方法超越了最先进的基线,验证了生成先验为物理世界理解提供了可扩展的基础。代码已公开发布在 https://github.com/H-EmbodVis/VEGA-3D。
cs.AI / 1 / 2603.18048
DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
DEAF:音频语言模型声学忠实性诊断评估的基准测试
Abstract
Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.
Chinese Translation
最近的音频多模态大语言模型(Audio MLLMs)在语音基准测试中表现出色,但尚不清楚这些模型是否真正处理声学信号,还是依赖基于文本的语义推理。为了系统地研究这个问题,我们引入了DEAF(声学忠实性诊断评估),这是一个超过2700个冲突刺激的基准,涵盖三个声学维度:情感韵律、背景声音和说话者身份。随后,我们设计了一个受控的多层次评估框架,逐步增加文本的影响,从内容中的语义冲突到误导性提示及其组合,使我们能够将内容驱动的偏见与提示引发的迎合区分开来。我们进一步引入诊断指标来量化模型对文本线索的依赖程度与声学信号的关系。对七个音频多模态大语言模型的评估揭示了一种一致的文本主导模式:模型对声学变化敏感,但预测主要受到文本输入的驱动,显示出在标准语音基准测试中高性能与真正的声学理解之间的差距。
cs.AI / 2 / 2603.18073
Continually self-improving AI
持续自我改进的人工智能
Abstract
Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model's weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.
Chinese Translation
现代基于语言模型的人工智能系统极为强大,但其能力在三个关键方面仍然受到其人类创造者的根本限制。首先,尽管模型的权重可以通过微调进行更新,但在预训练后从小型专业语料库获取新知识仍然非常低效。其次,这些系统的训练在很大程度上依赖于历史上有限的人类生成数据。第三,用于训练人工智能模型的流程受到人类研究人员能够发现和探索的算法的限制。本文旨在克服这些固有限制,提出了三个章节,旨在打破这些依赖关系,以创造持续自我改进的人工智能。首先,为了克服知识获取中的数据效率障碍,我们提出了一种合成数据方法,该方法将小型语料库多样化并放大为丰富的知识表示,使模型能够有效地从有限的源材料中更新其参数。其次,为了减少对人类数据的依赖,我们展示了在给定固定数量的人类数据的情况下,模型可以自我生成合成数据,以启动其基本的预训练能力,而无需从任何现成的、经过指令调优的语言模型中提炼。最后,为了超越人类设计的训练范式,我们证明了通过在测试时扩展算法空间的搜索,人工智能可以在比人类研究人员手动探索的更大范围内搜索学习算法配置。
cs.AI / 3 / 2603.18085
Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
多特征子空间引导揭示人机交互的阴暗面
Abstract
Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.
Chinese Translation
近期事件突显了人机交互导致负面心理结果的令人担忧的案例,包括心理健康危机甚至用户伤害。随着大型语言模型(LLMs)作为指导、情感支持甚至非正式治疗的来源,这些风险有可能加剧。然而,研究有害人机交互背后的机制面临重大方法论挑战,因为有机的有害交互通常是在持续互动中发展而来的,这需要广泛的对话背景,而这些在受控环境中难以模拟。为了解决这一空白,我们开发了一个多特征子空间引导(MultiTraitsss)框架,该框架利用已建立的危机相关特征和新颖的子空间引导框架生成表现出累积有害行为模式的阴暗模型。单轮和多轮评估表明,我们的阴暗模型持续产生有害的交互和结果。利用我们的阴暗模型,我们提出了减少人机交互中有害结果的保护措施。
cs.AI / 4 / 2603.18104
Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI
自适应领域模型:贝叶斯演化、温暖旋转与几何和神经形态人工智能的原则性训练
Abstract
Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework [6], which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph [8], which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit 2026 standard [10], which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce Bayesian distillation, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce warm rotation, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with structural correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.
Chinese Translation
当前的人工智能训练基础设施假设在IEEE-754算术上进行反向模式自动微分。训练相对于推理的内存开销、优化器复杂性以及通过训练导致的几何属性结构退化,都是这种算术基础的结果。本文开发了一种替代的训练架构,基于三个先前的结果:维度类型系统和确定性内存管理框架[6],该框架确立了堆栈可分配的梯度分配和精确的累加器积累作为设计时可验证的属性;程序超图[8],该框架通过几何代数计算确立了等级保持作为类型级不变性;以及b-posit 2026标准[10],使得在传统上被认为仅用于推理的硬件目标上,posit算术变得可处理。它们的组合使得深度无关的训练内存限制在大约两倍于推理占用的范围内,保持等级的权重更新,以及精确的梯度累积,均可统一应用于损失函数优化和脉冲时序依赖的神经形态模型。我们引入了贝叶斯蒸馏,这是一种通过ADM训练机制提取通用模型潜在先验结构的机制,解决了领域特定训练中的数据稀缺自举问题。对于部署,我们引入了温暖旋转,这是一种操作模式,其中更新后的模型在没有服务中断的情况下过渡到活跃的推理路径,其结构正确性通过PHG证书和签名版本记录形式化。最终结果是一类领域特定的人工智能系统,其规模更小、精度更高,具有持续适应性,并在其领域的物理结构上可验证正确,且可从现有模型初始化。
cs.AI / 5 / 2603.18122
Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows
不再依赖编码,而是进行骨架编码:为主题专家构建低成本自主工作流程的互动无代码笔记本
Abstract
Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.
Chinese Translation
Skele-Code 是一种基于自然语言和图形的界面,用于构建与 AI 代理的工作流程,特别为技术背景薄弱或非技术用户设计。它支持增量式、互动式的笔记本样式开发,每一个步骤都被转化为代码,并具有所需的功能集和行为,以实现工作流程的增量构建。代理仅在代码生成和错误恢复阶段被调用,而不参与编排或任务执行。这种以代码为先的工作流程方法结合 Skele-Code 中使用的上下文工程,可以帮助降低与多代理系统执行工作流程相比的令牌成本。Skele-Code 生成模块化、易于扩展和共享的工作流程。生成的工作流程还可以被代理用作技能,或作为其他工作流程中的步骤。
cs.AI / 6 / 2603.18166
Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering
通过动态聚类实现高效的密集人群轨迹预测
Abstract
Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by predicting individual trajectories and considering surrounding objects based on manually annotated data. However, these approaches tend to overlook dense crowd scenarios, where the challenges of automation become more pronounced due to the massiveness, noisiness, and inaccuracy of the tracking outputs, resulting in high computational costs. To address these challenges, we propose and extensively evaluate a novel cluster-based approach that groups individuals based on similar attributes over time, enabling faster execution through accurate group summarisation. Our plug-and-play method can be combined with existing trajectory predictors by using our output centroid in place of their pedestrian input. We evaluate our proposed method on several challenging dense crowd scenes. We demonstrated that our approach leads to faster processing and lower memory usage when compared with state-of-the-art methods, while maintaining the accuracy
Chinese Translation
人群轨迹预测在公共安全和管理中发挥着至关重要的作用,能够帮助防止踩踏等灾害。近期的研究通过预测个体轨迹并基于手动标注的数据考虑周围物体来解决这一问题。然而,这些方法往往忽视了密集人群场景,在这种情况下,由于追踪输出的庞大、嘈杂和不准确性,自动化的挑战变得更加明显,导致高计算成本。为了解决这些挑战,我们提出并广泛评估了一种新颖的基于聚类的方法,该方法根据时间上相似的属性将个体进行分组,通过准确的群体总结实现更快的执行。我们的即插即用技术可以通过使用我们的输出质心替代现有轨迹预测器中的行人输入来与之结合。我们在多个具有挑战性的密集人群场景中评估了我们提出的方法。与最先进的方法相比,我们的观察表明,尽管保持了准确性,我们的方法在处理速度和内存使用方面都得到了显著提升。
cs.AI / 7 / 2603.18189
TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors
TeachingCoach:为教师提供教学指导的精细调优支架聊天机器人
Abstract
Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.
Chinese Translation
高等教育教师常常缺乏及时且以教学法为基础的支持,因为可扩展的教学指导仍然有限,现有工具依赖于通用聊天机器人的建议或不可扩展的教学中心人际咨询。我们提出了TeachingCoach,这是一种以教学法为基础的聊天机器人,旨在通过实时对话指导支持教师的专业发展。TeachingCoach建立在一个数据驱动的流程上,该流程从教育资源中提取教学规则,并利用合成对话生成技术对专门的语言模型进行精细调优,以指导教师进行问题识别、诊断和策略开发。专家评估表明,TeachingCoach提供的指导比GPT-4o mini基线更清晰、更具反思性和响应性,而与高等教育教师的用户研究则突出了对话深度与互动效率之间的权衡。综合来看,这些结果表明,以教学法为基础的、数据驱动的合成聊天机器人可以改善教学支持,并为未来的教学聊天机器人系统提供可扩展的设计思路。
cs.AI / 8 / 2603.18197
Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks
具有委托关键任务的自主人工智能的访问控制网站交互
Abstract
Recent studies reveal gaps in delegating critical tasks to agentic AI that accesses websites on the user's behalf, primarily due to limited access control mechanisms on websites designed for agentic AI. In response, we propose a design of website-based interaction for AI agents with fine-grained access control for delegated critical tasks. Our approach encompasses a website design and implementation, as well as modifications to the access grant protocols in an open-source authorization service to tailor it to agentic AI, with delegated critical tasks on the website. The evaluation of our approach demonstrates the capabilities of our access-controlled website used by AI agents.
Chinese Translation
近期研究揭示了在用户代理下将关键任务委托给自主人工智能(agentic AI)时存在的不足,主要是由于为自主人工智能设计的网站上访问控制机制的局限性。对此,我们提出了一种基于网站的交互设计,旨在为自主人工智能提供细粒度的访问控制,以便于委托关键任务。我们的方法包括网站的设计与实现,以及对开源授权服务中的访问授权协议的修改,以便将其调整为适应自主人工智能,并在网站上处理委托的关键任务。对我们方法的评估展示了自主人工智能使用的访问控制网站的能力。
cs.AI / 9 / 2603.18201
A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation
考虑误差传播的人工智能系统可靠性计算高效学习
Abstract
Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate through a sequence of interconnected functional stages, where upstream errors may propagate to downstream stages, ultimately affecting overall system reliability. Quantifying such error propagation is essential for accurate modeling of AI system reliability. However, this task is challenging due to: i) data availability: real-world AI system reliability data are often scarce and constrained by privacy concerns; ii) model validity: recurring error events across sequential stages are interdependent, violating the independence assumptions of statistical inference; and iii) computational complexity: AI systems process large volumes of high-speed data, resulting in frequent and complex recurrent error events that are difficult to track and analyze. To address these challenges, this paper leverages a physics-based autonomous vehicle simulation platform with a justifiable error injector to generate high-quality data for AI system reliability analysis. Building on this data, a new reliability modeling framework is developed to explicitly characterize error propagation across stages. Model parameters are estimated using a computationally efficient, theoretically guaranteed composite likelihood expectation - maximization algorithm. Its application to the reliability modeling for autonomous vehicle perception systems demonstrates its predictive accuracy and computational efficiency.
Chinese Translation
人工智能(AI)系统在新兴智慧城市中日益突出,但其可靠性仍然是一个关键问题。这些系统通常通过一系列相互连接的功能阶段运行,其中上游错误可能传播到下游阶段,最终影响整体系统的可靠性。量化这种错误传播对于准确建模AI系统的可靠性至关重要。然而,这项任务面临以下挑战:i)数据可用性:现实世界中的AI系统可靠性数据通常稀缺,并受到隐私问题的限制;ii)模型有效性:在连续阶段中反复出现的错误事件是相互依赖的,违反了统计推断的独立性假设;iii)计算复杂性:AI系统处理大量高速数据,导致频繁且复杂的重复错误事件,难以跟踪和分析。为了解决这些挑战,本文利用基于物理的自动驾驶车辆仿真平台,结合合理的错误注入器生成高质量数据,以进行AI系统可靠性分析。在此数据基础上,开发了一种新的可靠性建模框架,以明确表征各阶段的错误传播。模型参数通过一种计算高效、理论上有保证的复合似然期望最大化算法进行估计。该算法在自动驾驶车辆感知系统的可靠性建模中的应用,展示了其预测准确性和计算效率。
cs.AI / 10 / 2603.18272
Retrieval-Augmented LLM Agents: Learning to Learn from Experience
检索增强的大型语言模型代理:从经验中学习的学习
Abstract
While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.
Chinese Translation
尽管大型语言模型(LLMs)推动了通用代理的发展,但在未见任务上实现稳健的泛化仍然是一个重大挑战。目前的方法通常依赖于微调或使用检索到的经验进行无训练的记忆增强生成;然而,这两者都有局限性:微调往往无法推断到新任务,而经验检索的表现通常不及监督基线。在本研究中,我们提出结合这两种方法,并系统性地研究如何训练检索增强的LLM代理,以有效利用上下文中的检索轨迹。首先,我们建立了一种稳健的监督微调(SFT)方案,使用LoRA,超越了几种最先进的代理训练流程。其次,我们对经验检索的关键设计选择进行了详细分析,识别出存储、查询和轨迹选择的最佳策略。最后,我们提出了一种将经验检索整合到微调过程中的流程。我们的结果表明,这种结合的方法显著提高了对未见任务的泛化能力,为构建能够从经验中学习的代理提供了一个可扩展且有效的框架。
cs.AI / 11 / 2603.18273
EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research
EDM-ARS:一个特定领域的多代理系统用于自动化教育数据挖掘研究
Abstract
In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM-ARS orchestrates five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state-machine coordinator that supports revision loops, checkpoint-based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM-ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three-tier data registry design that encodes educational domain expertise, the specification of each agent, the inter-agent communication protocol, and mechanisms for error-handling and self-correction. Finally, we discuss current limitations, including single-dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi-dataset generalization. EDM-ARS is released as an open-source project to support the educational research community.
Chinese Translation
在本技术报告中,我们介绍了教育数据挖掘自动研究系统(EDM-ARS),这是一个特定领域的多代理流程,旨在自动化端到端教育数据挖掘(EDM)研究。我们将EDM-ARS概念化为一个针对领域意识的自动化研究流程的一般框架,其中教育专业知识嵌入到研究生命周期的每个阶段。作为该框架的首次实例,我们专注于预测建模任务。在此范围内,EDM-ARS通过一个状态机协调器协调五个专门的基于大语言模型(LLM)的代理(问题提出者(ProblemFormulator)、数据工程师(DataEngineer)、分析师(Analyst)、评论者(Critic)和写作助手(Writer)),该协调器支持修改循环、基于检查点的恢复和沙盒代码执行。在给定研究提示和数据集的情况下,EDM-ARS生成一篇完整的LaTeX手稿,其中包括真实的Semantic Scholar引用、经过验证的机器学习分析以及自动化的方法学同行评审。我们还提供了系统架构的详细描述、编码教育领域专业知识的三层数据注册设计、每个代理的规范、代理间通信协议,以及错误处理和自我修正机制。最后,我们讨论了当前的局限性,包括单数据集范围和公式化论文输出,并概述了关于因果推断、迁移学习、心理测量学以及多数据集泛化的分阶段路线图。EDM-ARS作为一个开源项目发布,旨在支持教育研究社区。
cs.AI / 12 / 2603.18290
CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring
CORE:通过置信度和正交残差评分实现稳健的分布外检测
Abstract
Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets -- a scorer that leads on one benchmark often falters on another. We attribute this inconsistency to a shared structural limitation: logit-based methods see only the classifier's confidence signal, while feature-based methods attempt to measure membership in the training distribution but do so in the full feature space where confidence and membership are entangled, inheriting architecture-sensitive failure modes. We observe that penultimate features naturally decompose into two orthogonal subspaces: a classifier-aligned component encoding confidence, and a residual the classifier discards. We discover that this residual carries a class-specific directional signature for in-distribution data -- a membership signal invisible to logit-based methods and entangled with noise in feature-based methods. We propose CORE (COnfidence + REsidual), which disentangles the two signals by scoring each subspace independently and combines them via normalized summation. Because the two signals are orthogonal by construction, their failure modes are approximately independent, producing robust detection where either view alone is unreliable. CORE achieves competitive or state-of-the-art performance across five architectures and five benchmark configurations, ranking first in three of five settings and achieving the highest grand average AUROC with negligible computational overhead.
Chinese Translation
分布外(OOD)检测对于可靠部署深度学习模型至关重要,但没有一种方法能够在不同的架构和数据集上始终如一地表现出色——在一个基准上表现优异的评分器在另一个基准上往往会失效。我们将这种不一致归因于一个共同的结构性限制:基于logit的方法仅关注分类器的置信度信号,而基于特征的方法试图测量在训练分布中的成员资格,但在完整特征空间中进行测量,导致置信度和成员资格交织在一起,从而继承了对架构敏感的失败模式。我们观察到,倒数第二层特征自然分解为两个正交子空间:一个与分类器对齐的组件编码置信度,另一个是分类器丢弃的残差。我们发现,这个残差对分布内数据具有类特定的方向性特征——这是基于logit的方法无法察觉的成员资格信号,并且在基于特征的方法中与噪声交织在一起。我们提出了CORE(COnfidence + REsidual),通过独立评分每个子空间来解耦这两个信号,并通过归一化求和将它们结合起来。由于这两个信号在构造上是正交的,它们的失败模式大致独立,从而在任一视角单独不可靠的情况下产生稳健的检测。CORE在五种架构和五种基准配置中实现了具有竞争力或最先进的性能,在五种设置中排名前三,并以微不足道的计算开销实现了最高的总体AUROC。
cs.AI / 13 / 2603.18294
The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
健康人工智能评估中的有效性差距:基准组成的横断面分析
Abstract
Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.
Chinese Translation
背景:临床试验依赖透明的纳入标准以确保结果的普遍适用性。相较之下,用于验证健康相关大型语言模型(LLMs)的基准往往缺乏对其包含的“患者”或“查询”群体的描述。没有明确的组成,汇总性能指标可能会错误地表示模型在临床使用方面的准备情况。方法:我们分析了 18,707 个消费者健康查询,基于六个公共基准,利用 LLMs 作为自动编码工具,应用标准化的 16 字段分类法来描述上下文、主题和意图。结果:我们识别出一个结构性的“有效性差距”。尽管基准已经从静态检索演变为互动对话,但临床组成仍然与现实世界的需求不匹配。虽然 42% 的语料库引用了客观数据,但这些数据却倾向于关注健康的可穿戴信号(17.7%);复杂的诊断输入仍然稀缺,包括实验室值(5.2%)、影像学(3.8%)和原始医疗记录(0.6%)。安全至关重要的场景几乎不存在:自杀/自残查询仅占语料库的 <0.7%,慢性病管理也仅占 5.5%。基准还忽视了脆弱群体(儿科/老年人 <11%)和全球健康需求。结论:评估基准依然与现实临床需求不相符,缺乏原始临床文档、对脆弱人群的充分代表以及纵向慢性护理场景。该领域必须采用标准化的查询描述——类似于临床试验报告——以使评估与临床实践的复杂性相一致。
cs.AI / 14 / 2603.18327
Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis
环境人工智能草拟笔记与临床医生最终文档中的消费者到临床语言转变:多层次分析
Abstract
Ambient AI generates draft clinical notes from patient-clinician conversations, often using lay or consumer-oriented phrasing to support patient understanding instead of standardized clinical terminology. How clinicians revise these drafts for professional documentation conventions remains unclear. We quantified clinician editing for consumer-to- clinical normalization using a dictionary-confirmed transformation framework. We analyzed 71,173 AI-draft and finalized-note section pairs from 34,726 encounters. Confirmed transformations were defined as replacing a consumer expression with its dictionary-mapped clinical equivalent in the same section. Editing significantly reduced terminology density across all sections (p < 0.001). The Assessment and Plan accounted for the largest transformation volume (59.3%). Our analysis identified 7,576 transformation events across 4,114 note sections (5.8%), representing 1.2% consumer-term deletions. Transformation intensity varied across individual clinicians (p < 0.001). Overall, clinician post-editing demonstrates consistent shifts from conversational phrasing toward standardized, section- appropriate clinical terminology, supporting section-aware ambient AI design.
Chinese Translation
环境人工智能根据患者与临床医生的对话生成临床草拟笔记,通常使用通俗或面向消费者的措辞,以支持患者理解,而不是标准化的临床术语。临床医生如何修订这些草稿以符合专业文档规范仍不清楚。我们使用字典确认的转化框架量化了临床医生在消费者到临床规范化方面的编辑。我们分析了来自34,726次就诊的71,173对AI草拟笔记和最终笔记部分。确认的转化被定义为在同一部分中用字典映射的临床等价表达替换消费者表达。编辑显著减少了所有部分的术语密度(p < 0.001)。评估与计划部分占据了最大的转化量(59.3%)。我们的分析识别了4,114个笔记部分中的7,576个转化事件(5.8%),代表1.2%的消费者术语删除。转化强度在不同临床医生之间存在差异(p < 0.001)。总体而言,临床医生的后期编辑展示了从对话措辞向标准化、适合各部分的临床术语的一致转变,支持了对各部分敏感的环境人工智能设计。
cs.AI / 15 / 2603.18329
FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering
FaithSteer-BENCH:一种与部署对齐的推理时间引导压力测试基准
Abstract
Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbf{FaithSteer-BENCH}, a stress-testing benchmark that evaluates steering methods at a fixed deployment-style operating point through three gate-wise criteria: controllability, utility preservation, and robustness. Across multiple models and representative steering approaches, we uncover several systematic failure modes that are largely obscured under standard evaluation, including illusory controllability, measurable cognitive tax on unrelated capabilities, and substantial brittleness under mild instruction-level perturbations, role prompts, encoding transformations, and data scarcity. Gate-wise benchmark results show that existing methods do not necessarily provide reliable controllability in deployment-oriented practical settings. In addition, mechanism-level diagnostics indicate that many steering methods induce prompt-conditional alignment rather than stable latent directional shifts, further explaining their fragility under stress. FaithSteer-BENCH therefore provides a unified benchmark and a clearer analytical lens for future method design, reliability evaluation, and deployment-oriented research in steering.
Chinese Translation
推理时间引导被广泛视为一种轻量级且无参数的机制,用于控制大型语言模型(LLM)的行为,先前的研究常常表明,简单的激活级别干预可以可靠地引发目标行为变化。然而,这些结论通常是在相对宽松的评估环境下得出的,忽视了部署约束、能力权衡和现实世界的鲁棒性。因此,我们引入了 extbf{FaithSteer-BENCH},这是一个压力测试基准,通过三个门控标准(可控性、效用保持和鲁棒性)在固定的部署风格操作点上评估引导方法。在多个模型和代表性的引导方法中,我们发现了一些在标准评估下大多被掩盖的系统性失败模式,包括虚幻的可控性、对无关能力的可测认知负担,以及在轻微指令级扰动、角色提示、编码变换和数据稀缺下的显著脆弱性。门控基准结果显示,现有方法在以部署为导向的实际环境中并不一定提供可靠的可控性。此外,机制级诊断表明,许多引导方法引发的是基于提示的条件对齐,而非稳定的潜在方向性转变,进一步解释了它们在压力下的脆弱性。因此,FaithSteer-BENCH提供了一个统一的基准和更清晰的分析视角,以便于未来方法设计、可靠性评估和以部署为导向的引导研究。
cs.AI / 16 / 2603.18330
MemArchitect: A Policy Driven Memory Governance Layer
MemArchitect:一种基于政策的内存治理层
Abstract
Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.
Chinese Translation
持久的大型语言模型(LLM)代理在内存管理中暴露了一个关键的治理缺口。标准的检索增强生成(RAG)框架将内存视为被动存储,缺乏解决矛盾、执行隐私保护或防止过时信息(“僵尸记忆”)污染上下文窗口的机制。我们提出了MemArchitect,这是一种将内存生命周期管理与模型权重解耦的治理层。MemArchitect 强制实施明确的基于规则的政策,包括内存衰减、冲突解决和隐私控制。我们证明,受治理的内存在自主设置中始终优于未管理的内存,突显了结构化内存治理对于可靠和安全的自主系统的必要性。
cs.AI / 17 / 2603.18331
Understanding the Theoretical Foundations of Deep Neural Networks through Differential Equations
通过微分方程理解深度神经网络的理论基础
Abstract
Deep neural networks (DNNs) have achieved remarkable empirical success, yet the absence of a principled theoretical foundation continues to hinder their systematic development. In this survey, we present differential equations as a theoretical foundation for understanding, analyzing, and improving DNNs. We organize the discussion around three guiding questions: i) how differential equations offer a principled understanding of DNN architectures, ii) how tools from differential equations can be used to improve DNN performance in a principled way, and iii) what real-world applications benefit from grounding DNNs in differential equations. We adopt a two-fold perspective spanning the model level, which interprets the whole DNN as a differential equation, and the layer level, which models individual DNN components as differential equations. From these two perspectives, we review how this framework connects model design, theoretical analysis, and performance improvement. We further discuss real-world applications, as well as key challenges and opportunities for future research.
Chinese Translation
深度神经网络(DNNs)在实践中取得了显著成功,但缺乏系统的理论基础仍然阻碍了它们的系统发展。在本次调查中,我们提出微分方程作为理解、分析和改进DNNs的理论基础。我们围绕三个指导性问题组织讨论:i) 微分方程如何提供对DNN架构的原则性理解,ii) 如何利用微分方程的工具以原则性方式改善DNN性能,以及 iii) 哪些现实世界应用受益于将DNN建立在微分方程基础之上。我们采用了双重视角,涵盖模型层面(将整个DNN解释为一个微分方程)和层面层面(将单个DNN组件建模为微分方程)。从这两个视角出发,我们回顾了这一框架如何连接模型设计、理论分析和性能改进。我们进一步讨论了现实世界应用,以及未来研究的关键挑战和机遇。
cs.AI / 18 / 2603.18349
Large-Scale Analysis of Political Propaganda on Moltbook
Moltbook上政治宣传的大规模分析
Abstract
We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents. To enable large-scale analysis, we develop LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen's $\kappa$= 0.64-0.74). Using a dataset of 673,127 posts and 879,606 comments, we find that political propaganda accounts for 1% of all posts and 42% of all political content. These posts are concentrated in a small set of communities, with 70% of such posts falling into five of them. 4% of agents produced 51% of these posts. We further find that a minority of these agents repeatedly post highly similar content within and across communities. Despite this, we find limited evidence that comments amplify political propaganda.
Chinese Translation
我们提出了一项基于自然语言处理(NLP)的研究,旨在分析Moltbook上的政治宣传,这是一个类似Reddit的平台,专为人工智能代理设计。为了实现大规模分析,我们开发了基于大型语言模型(LLM)的分类器,以检测政治宣传,并通过专家注释进行验证(Cohen's $ ext{kappa}$= 0.64-0.74)。使用673,127条帖子和879,606条评论的数据集,我们发现政治宣传占所有帖子的1%,而在所有政治内容中占42%。这些帖子集中在少数社区中,其中70%的此类帖子来自五个社区。4%的代理生成了51%的这些帖子。我们进一步发现,这些代理中的少数重复在社区内外发布高度相似的内容。尽管如此,我们发现评论对政治宣传的放大作用证据有限。
cs.AI / 19 / 2603.18353
Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
无可操作性的可解释性:机械方法无法纠正语言模型错误,尽管内部表征几乎完美
Abstract
Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.
Chinese Translation
语言模型在内部表征中编码了远超其输出性能的任务相关知识,但机械可解释性方法是否能够弥补这一知识-行动差距尚未系统性地进行测试。我们比较了四种机械可解释性方法——概念瓶颈引导(Steerling-8B)、稀疏自编码器特征引导、带激活补丁的逻辑透镜,以及带真实度分离向量引导的线性探测(Qwen 2.5 7B Instruct)——用于纠正400个由医生裁定的临床小案例中的假阴性分流错误(144个危险案例,256个良性案例)。线性探测以98.2%的AUROC区分了危险和良性案例,但模型的输出敏感性仅为45.1%,存在53个百分点的知识-行动差距。概念瓶颈引导纠正了20%的漏检危险,但干扰了53%的正确检测,无法与随机扰动区分(p=0.84)。尽管有3,695个显著特征,稀疏自编码器特征引导没有产生任何效果。高强度的真实度分离向量引导纠正了24%的漏检危险,同时干扰了6%的正确检测,但仍有76%的错误未被纠正。目前的机械可解释性方法无法可靠地将内部知识转化为纠正后的输出,这对假设可解释性能够有效纠正错误的人工智能安全框架具有重要影响。
cs.AI / 20 / 2603.18356
LGESynthNet: Controlled Scar Synthesis for Improved Scar Segmentation in Cardiac LGE-MRI Imaging
LGESynthNet:用于改善心脏LGE-MRI成像中瘢痕分割的可控瘢痕合成
Abstract
Segmentation of enhancement in LGE cardiac MRI is critical for diagnosing various ischemic and non-ischemic cardiomyopathies. However, creating pixel-level annotations for these images is challenging and labor-intensive, leading to limited availability of annotated data. Generative models, particularly diffusion models, offer promise for synthetic data generation, yet many rely on large training datasets and often struggle with fine-grained conditioning control, especially for small or localized features. We introduce LGESynthNet, a latent diffusion-based framework for controllable enhancement synthesis, enabling explicit control over size, location, and transmural extent. Formulated as inpainting using a ControlNet-based architecture, the model integrates: (a) a reward model for conditioning-specific supervision, (b) a captioning module for anatomically descriptive text prompts, and (c) a biomedical text encoder. Trained on just 429 images (79 patients), it produces realistic, anatomically coherent samples. A quality control filter selects outputs with high conditioning-fidelity, which when used for training augmentation, improve downstream segmentation and detection performance, by up-to 6 and 20 points respectively.
Chinese Translation
在LGE心脏MRI中,增强的分割对于诊断各种缺血性和非缺血性心肌病至关重要。然而,为这些图像创建像素级注释具有挑战性且劳动密集,导致标注数据的可用性有限。生成模型,特别是扩散模型,为合成数据生成提供了希望,但许多模型依赖于大量训练数据集,并且在细粒度条件控制方面常常遇到困难,尤其是对于小型或局部特征。我们提出了LGESynthNet,这是一种基于潜在扩散的框架,用于可控的增强合成,能够明确控制大小、位置和穿透范围。该模型以基于ControlNet的架构进行修复,集成了:(a)用于条件特定监督的奖励模型,(b)用于解剖描述性文本提示的字幕模块,以及(c)生物医学文本编码器。该模型仅在429幅图像(79名患者)上进行训练,生成现实且解剖一致的样本。质量控制过滤器选择高条件保真度的输出,当用于训练增强时,分别提高下游分割和检测性能,最多可提高6分和20分。
cs.AI / 21 / 2603.18382
From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
从弱线索到真实身份:评估LLM代理中的推理驱动去匿名化
Abstract
Anonymization is widely treated as a practical safeguard because re-identifying anonymous records was historically costly, requiring domain expertise, tailored algorithms, and manual corroboration. We study a growing privacy risk that may weaken this barrier: LLM-based agents can autonomously reconstruct real-world identities from scattered, individually non-identifying cues. By combining these sparse cues with public information, agents resolve identities without bespoke engineering. We formalize this threat as \emph{inference-driven linkage} and systematically evaluate it across three settings: classical linkage scenarios (Netflix and AOL), \emph{InferLink} (a controlled benchmark varying task intent, shared cues, and attacker knowledge), and modern text-rich artifacts. Without task-specific heuristics, agents successfully execute both fixed-pool matching and open-ended identity resolution. In the Netflix Prize setting, an agent reconstructs 79.2\% of identities, significantly outperforming a 56.0\% classical baseline. Furthermore, linkage emerges not only under explicit adversarial prompts but also as a byproduct of benign cross-source analysis in \emph{InferLink} and unstructured research narratives. These findings establish that identity inference -- not merely explicit information disclosure -- must be treated as a first-class privacy risk; evaluations must measure what identities an agent can infer.
Chinese Translation
匿名化被广泛视为一种实用的保护措施,因为重新识别匿名记录在历史上是成本高昂的,要求具有领域专业知识、量身定制的算法和手动验证。我们研究了一种可能削弱这一障碍的日益增长的隐私风险:基于大型语言模型(LLM)的代理能够自主地从分散的、单独不具识别性的线索中重构现实世界的身份。通过将这些稀疏的线索与公共信息相结合,代理能够在无需专门工程的情况下解析身份。我们将这一威胁形式化为“推理驱动连接”(inference-driven linkage),并在三种情境下系统评估:经典连接场景(Netflix和AOL)、 extit{InferLink}(一个控制基准,涵盖任务意图、共享线索和攻击者知识的变化)以及现代文本丰富的文档。在没有特定任务启发式的情况下,代理成功执行了固定池匹配和开放式身份解析。在Netflix大奖赛背景下,一个代理重构了79.2 ext{%}的身份,显著超过了56.0 ext{%}的经典基线。此外,连接不仅在明确的对抗性提示下出现,也作为在 extit{InferLink}与非结构化研究叙述中所进行的良性跨源分析的副产品而存在。这些发现表明,身份推断——而不仅仅是显式信息披露——必须被视为一种重要的隐私风险;评估必须衡量代理可以推断出哪些身份。
cs.AI / 22 / 2603.18388
Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization
黑暗中的反思:揭示和逃避反思提示优化中的黑箱
Abstract
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.
Chinese Translation
自动提示优化(APO)已成为提高大语言模型(LLM)性能的一种强大范例,无需手动提示工程。反思性APO方法如GEPA通过诊断失败案例迭代地优化提示,但优化过程仍然是黑箱且无标签的,导致不可解释的轨迹和系统性的失败。我们识别并实证展示了四个局限性:在GSM8K上使用缺陷种子时,GEPA的准确率从23.81%下降至13.50%。我们提出了VISTA,一个多智能体APO框架,它将假设生成与提示重写解耦,实现语义标记的假设、并行小批量验证和可解释的优化轨迹。结合随机重启和ε-贪婪采样的二层探索-开发机制进一步避免了局部最优。VISTA在相同的缺陷种子上恢复准确率至87.57%,并在GSM8K和AIME2025的所有条件下持续优于基线。
cs.AI / 23 / 2603.18420
From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory
从主题到过渡结构:通过预测性关联记忆在语料库规模上进行无监督概念发现
Abstract
Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross-examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.
Chinese Translation
嵌入模型通过语义内容对文本进行分组,即文本的主题。我们展示了文本内的时间共现发现了一种不同类型的结构:重复的过渡结构概念或文本的功能。我们在来自9,766本古腾堡计划文本(24.96百万段落)的373百万共现对上训练了一个29.4M参数的对比模型,将预训练的嵌入映射到一个关联空间,在该空间中,具有相似过渡结构的段落聚集在一起。在容量限制下(42.75%的准确率),模型必须在重复模式之间进行压缩,而不是记忆单个共现。在六个粒度(k=50到k=2,000)下进行聚类,产生了一个多分辨率概念图;从“直接对抗”和“抒情冥想”等广泛模式,到“水手方言”和“法庭交叉询问”等精确的语域和场景模板。在k=100时,聚类平均每个聚类包含4,508本书(共9,766本),确认了语料库范围内的模式。与嵌入相似性聚类的直接比较显示,原始嵌入按主题分组,而关联空间聚类按功能、语域和文学传统分组。未见过的小说被分配到现有聚类中而无需重新训练;关联模型将每部小说集中到一组连贯的聚类中,而原始嵌入分配几乎饱和了所有聚类。验证控制解决了位置、长度和书籍集中度的混淆。该方法将预测性关联记忆(PAM, arXiv:2602.11322)从情节回忆扩展到概念形成:在PAM回忆特定关联的地方,压缩下的多时期对比训练提取了结构模式,这些模式可以转移到未见文本中,同一框架在不同的机制中产生了质的不同的行为。
cs.AI / 24 / 2603.18426
Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression
先剪枝后量化还是先量化后剪枝?联合模型压缩中压缩顺序影响的理解
Abstract
What happens when multiple compression methods are combined-does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization. A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline. Most prior studies have either sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases. Consequently, the broader role of compression order in shaping model performance remains poorly understood. In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis. We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones. We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap. Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization.
Chinese Translation
当多种压缩方法结合使用时,会发生什么?这些方法应用的顺序是否重要?联合模型压缩作为一种通过结合剪枝和量化等多种方法来实现更高效性的强大策略,日益受到关注。其中一个核心但鲜有研究的因素是压缩顺序,即压缩流程中不同方法的应用顺序。以往的大多数研究要么假定技术间相互独立从而回避该问题,要么仅在高度受限的情形下进行探讨。因此,压缩顺序在影响模型性能方面的广泛作用仍未被充分理解。本文针对被忽视的压缩顺序问题,提供了理论与实证分析。我们将优化压缩顺序的问题形式化,并提出“渐进强度假设”(Progressive Intensity Hypothesis),该假设指出较弱的扰动应先于较强的扰动。我们提供理论保障,表明某一顺序的相对优势会随着底层性能差距的增大而提升。通过大量在语言模型和视觉模型上的实验验证了该假设,并进一步展示了其在多阶段压缩和混合精度量化等更广泛场景下的普适性。
cs.AI / 25 / 2603.18436
AS2 -- Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture
AS2 -- 基于注意力的软答案集:一种端到端可微分的神经-软-符号推理架构
Abstract
Neuro-symbolic artificial intelligence (AI) systems typically couple a neural perception module to a discrete symbolic solver through a non-differentiable boundary, preventing constraint-satisfaction feedback from reaching the perception encoder during training. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces the discrete solver with a soft, continuous approximation of the Answer Set Programming (ASP) immediate consequence operator $T_P$. AS2 maintains per-position probability distributions over a finite symbol domain throughout the forward pass and trains end-to-end by minimizing the fixed-point residual of a probabilistic lift of $T_P$, thereby differentiating through the constraint check without invoking an external solver at either training or inference time. The architecture is entirely free of conventional positional embeddings. Instead, it encodes problem structure through constraint-group membership embeddings that directly reflect the declarative ASP specification, making the model agnostic to arbitrary position indexing. On Visual Sudoku, AS2 achieves 99.89% cell accuracy and 100% constraint satisfaction (verified by Clingo) across 1,000 test boards, using a greedy constrained decoding procedure that requires no external solver. On MNIST Addition with $N \in \{2, 4, 8\}$ addends, AS2 achieves digit accuracy above 99.7% across all scales. These results demonstrate that a soft differentiable fixpoint operator, combined with constraint-aware attention and declarative constraint specification, can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability.
Chinese Translation
神经-符号人工智能(AI)系统通常通过非可微分的边界将神经感知模块与离散符号求解器相结合,这导致约束满足反馈在训练过程中无法到达感知编码器。我们提出了AS2(基于注意力的软答案集),这是一种完全可微分的神经-符号架构,它用答案集编程(ASP)即刻后置算子$T_P$的柔性连续近似替代了离散求解器。AS2在整个前向传播过程中保持有限符号域上每个位置的概率分布,并通过最小化概率提升$T_P$的固定点残差进行端到端训练,从而在训练和推理时不调用外部求解器的情况下实现对约束检查的可微分。这种架构完全无需传统 posições嵌入。而是通过约束组成员嵌入直接反映声明式ASP规范来编码问题结构,使模型对任意位置索引保持不可知。在视觉数独上,AS2在1,000个测试板上实现了99.89%的单元准确率和100%的约束满足(通过Clingo进行验证),使用的贪婪约束解码过程无需外部求解器。在MNIST加法中,当加数$N
i ext{2, 4, 8}$时,AS2在所有规模上实现了超过99.7%的数字准确率。这些结果表明,软可微分的固定点算子结合约束感知注意力和声明式约束规范,可以匹配或超越基于管道和求解器的神经-符号系统,同时保持完全的端到端可微分性。
cs.AI / 26 / 2603.18462
AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba
AlignMamba-2:基于Modality-Aware Mamba的多模态融合与情感分析增强方法
Abstract
In the era of large-scale pre-trained models, effectively adapting general knowledge to specific affective computing tasks remains a challenge, particularly regarding computational efficiency and multimodal heterogeneity. While Transformer-based methods have excelled at modeling inter-modal dependencies, their quadratic computational complexity limits their use with long-sequence data. Mamba-based models have emerged as a computationally efficient alternative; however, their inherent sequential scanning mechanism struggles to capture the global, non-sequential relationships that are crucial for effective cross-modal alignment. To address these limitations, we propose \textbf{AlignMamba-2}, an effective and efficient framework for multimodal fusion and sentiment analysis. Our approach introduces a dual alignment strategy that regularizes the model using both Optimal Transport distance and Maximum Mean Discrepancy, promoting geometric and statistical consistency between modalities without incurring any inference-time overhead. More importantly, we design a Modality-Aware Mamba layer, which employs a Mixture-of-Experts architecture with modality-specific and modality-shared experts to explicitly handle data heterogeneity during the fusion process. Extensive experiments on four challenging benchmarks, including dynamic time-series (on the CMU-MOSI and CMU-MOSEI datasets) and static image-related tasks (on the NYU-Depth V2 and MVSA-Single datasets), demonstrate that AlignMamba-2 establishes a new state-of-the-art in both effectiveness and efficiency across diverse pattern recognition tasks, ranging from dynamic time-series analysis to static image-text classification.
Chinese Translation
在大规模预训练模型时代,如何有效地将通用知识适配到特定的情感计算任务中,尤其是在计算效率和多模态异质性方面,依然是一个挑战。尽管基于Transformer的方法在建模模态间依赖关系方面表现优异,但其二次方计算复杂度限制了其在长序列数据上的应用。Mamba-based模型作为一种计算高效的替代方案出现,然而其固有的顺序扫描机制难以捕捉对有效跨模态对齐至关重要的全局非顺序关系。针对这些局限,我们提出了 extbf{AlignMamba-2},一个高效且有效的多模态融合与情感分析框架。该方法引入双重对齐策略,通过最优传输距离(Optimal Transport distance)和最大平均差异(Maximum Mean Discrepancy)对模型进行正则化,促进模态间几何和统计上的一致性,且不增加任何推理时开销。更重要的是,我们设计了一个Modality-Aware Mamba层,该层采用专家混合(Mixture-of-Experts)架构,包含模态专属专家与模态共享专家,在融合过程中显式处理数据异质性。在包括动态时间序列(CMU-MOSI和CMU-MOSEI数据集)和静态图像相关任务(NYU-Depth V2和MVSA-Single数据集)的四个挑战性基准上进行的大量实验表明,AlignMamba-2在动态时间序列分析到静态图文分类的多样化模式识别任务中,均在有效性和效率上建立了新的最先进水平。
cs.AI / 27 / 2603.18472
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
多模态大型语言模型在离散符号理解中的认知不匹配
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在解读自然场景方面取得了显著成功,但它们处理离散符号的能力——作为人类认知基本构件的符号——仍然是一个重要的未解之谜。与连续的视觉数据不同,数学公式、化学结构和语言字符等符号需要更精确、更深入的解释。本文提出了一套全面的基准测试,以评估顶级 MLLMs 如何在语言、文化、数学、物理和化学五个领域中导航这些“离散语义空间”。我们的研究揭示了一个反直觉的现象:模型常常在基本符号识别上失败,却能在复杂推理任务中取得成功,这表明它们依赖于语言概率而非真实的视觉感知。通过揭示这种“认知不匹配”,我们强调了当前人工智能能力中的一个重要缺口:对支撑科学发现和抽象思维的符号语言的真实感知和理解的挣扎。这项工作为开发更严谨、与人类对齐的智能系统提供了一个路线图。
cs.AI / 28 / 2603.18495
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
跨域示范到代码的神经符号反事实推理
Abstract
Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures, providing a reliable synthesis of code policies. NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.
Chinese Translation
近期视觉-语言模型(Vision-Language Models,VLMs)的进展使得视频指导的机器人编程成为可能,这允许智能体理解视频演示并生成可执行的控制代码。我们将视频指导的机器人编程形式化为一个跨域适应问题,其中演示与部署之间的感知和物理差异导致了程序性不匹配。然而,目前的VLMs缺乏所需的程序理解,无法重新构造因果依赖关系,从而在这样的域转移下实现任务兼容的行为。我们提出了NeSyCR,一个神经符号反事实推理框架,它能够实现任务程序的可验证适应,提供可靠的代码策略合成。NeSyCR将视频演示抽象为捕捉基础任务程序的符号轨迹。根据部署观察,它推导出揭示跨域不兼容性的反事实状态。通过可验证检查探索符号状态空间,NeSyCR提出程序修订,恢复与演示程序的兼容性。NeSyCR在任务成功率上比最强基线Statler提升了31.14%,显示出在模拟和现实操控任务中的强健跨域适应能力。
cs.AI / 29 / 2603.18507
Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM
专家角色改善大型语言模型的对齐性但损害准确性:基于意图的角色路由自启动与PRISM
Abstract
Persona prompting can steer LLM generation towards a domain-specific tone and pattern. This behavior enables use cases in multi-agent systems where diverse interactions are crucial and human-centered tasks require high-level human alignment. Prior works provide mixed opinions on their utility: some report performance gains when using expert personas for certain domains and their contribution to data diversity in synthetic data creation, while others find near-zero or negative impact on general utility. To fully leverage the benefits of the LLM persona and avoid its harmfulness, a more comprehensive investigation of the mechanism is crucial. In this work, we study how model optimization, task type, prompt length, and placement can impact expert persona effectiveness across instruction-tuned and reasoning LLMs, and provide insight into conditions under which expert personas fail and succeed. Based on our findings, we developed a pipeline to fully leverage the benefits of an expert persona, named PRISM (Persona Routing via Intent-based Self-Modeling), which self-distills an intent-conditioned expert persona into a gated LoRA adapter through a bootstrapping process that requires no external data, models, or knowledge. PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks across all models, with minimal memory and computing overhead.
Chinese Translation
角色提示可以引导大型语言模型(LLM)的生成朝向特定领域的语气和模式。这种行为使得在多智能体系统中的应用成为可能,在这些系统中,多样化的互动至关重要,而以人为中心的任务则需要高水平的人类对齐。先前的研究对其效用提供了不同的看法:一些研究报告在某些领域使用专家角色时的性能提升及其对合成数据创建中数据多样性的贡献,而另一些研究则发现对一般效用几乎没有或产生负面影响。为了充分利用LLM角色的优势并避免其有害性,对该机制进行更全面的研究至关重要。在本研究中,我们探讨了模型优化、任务类型、提示长度和位置如何影响专家角色在指令调优和推理LLM中的有效性,并提供了专家角色失败和成功的条件洞见。基于我们的发现,我们开发了一个管道,以充分利用专家角色的优势,命名为PRISM(基于意图的自我建模角色路由),该管道通过自启动过程将意图条件的专家角色自我提炼为一个门控的LoRA适配器,无需外部数据、模型或知识。PRISM在生成任务上增强了人类偏好和安全对齐,同时在所有模型的区分任务上保持准确性,且内存和计算开销最小。
cs.AI / 30 / 2603.18528
Correlation-Weighted Multi-Reward Optimization for Compositional Generation
基于相关性加权的多奖励优化用于组合生成
Abstract
Text-to-image models produce images that align well with natural language prompts, but compositional generation has long been a central challenge. Models often struggle to satisfy multiple concepts within a single prompt, frequently omitting some concepts and resulting in partial success. Such failures highlight the difficulty of jointly optimizing multiple concepts during reward optimization, where competing concepts can interfere with one another. To address this limitation, we propose Correlation-Weighted Multi-Reward Optimization (\ours), a framework that leverages the correlation structure among concept rewards to adaptively weight each attribute concept in optimization. By accounting for interactions among concepts, \ours balances competing reward signals and emphasizes concepts that are partially satisfied yet inconsistently generated across samples, improving compositional generation. Specifically, we decompose multi-concept prompts into pre-defined concept groups (\eg, objects, attributes, and relations) and obtain reward signals from dedicated reward models for each concept. We then adaptively reweight these rewards, assigning higher weights to conflicting or hard-to-satisfy concepts using correlation-based difficulty estimation. By focusing optimization on the most challenging concepts within each group, \ours encourages the model to consistently satisfy all requested attributes simultaneously. We apply our approach to train state-of-the-art diffusion models, SD3.5 and FLUX.1-dev, and demonstrate consistent improvements on challenging multi-concept benchmarks, including ConceptMix, GenEval 2, and T2I-CompBench.
Chinese Translation
文本到图像模型能够生成与自然语言提示高度一致的图像,但组合生成长期以来一直是一个核心挑战。模型通常难以在单一提示中满足多个概念,常常遗漏某些概念,导致部分成功。这些失败突显了在奖励优化过程中联合优化多个概念的困难,因为竞争概念之间可能相互干扰。为了解决这一限制,我们提出了相关性加权多奖励优化(Correlation-Weighted Multi-Reward Optimization, extit{ours})框架,该框架利用概念奖励之间的相关性结构,动态调整优化中每个属性概念的权重。通过考虑概念之间的相互作用, extit{ours}平衡了竞争的奖励信号,并强调那些部分满足但在样本中生成不一致的概念,从而改善组合生成。具体而言,我们将多概念提示分解为预定义的概念组(例如,物体、属性和关系),并为每个概念从专门的奖励模型中获取奖励信号。然后,我们动态地重新加权这些奖励,使用基于相关性的难度估计为冲突或难以满足的概念分配更高的权重。通过将优化重点放在每个组中最具挑战性的概念上, extit{ours}鼓励模型同时一致地满足所有请求的属性。我们将该方法应用于训练最先进的扩散模型SD3.5和FLUX.1-dev,并在包括ConceptMix、GenEval 2和T2I-CompBench在内的具有挑战性的多概念基准上展示了一致的改进。
cs.AI / 31 / 2603.18563
Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably
合理推理的人工智能代理可以在零样本情况下避免博弈论失败
Abstract
AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents' advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning' agents, i.e., agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner's dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.
Chinese Translation
人工智能代理越来越多地被部署在以重复的人工智能-人工智能互动为特征的互动经济环境中。尽管人工智能代理具备先进的能力,实证研究表明,这种互动往往无法稳定地诱导出战略均衡,例如纳什均衡。已经提出了后训练方法以诱导战略均衡;然而,在战略环境中,统一地应用对齐方法于多样且独立开发的人工智能模型仍然不切实际。在本文中,我们提供了理论和实证证据,表明现成的推理人工智能代理可以在零样本情况下实现类似纳什的博弈,而无需明确的后训练。具体而言,我们证明了“合理推理”代理,即能够根据先前观察形成对他人策略的信念并学习最佳响应这些信念的代理,最终在几乎每条实现的博弈路径上表现出与延续博弈的纳什均衡弱接近的行为。此外,我们通过允许阶段收益未知并让每个代理仅观察其自身私有实现的随机收益,放宽了共同知识收益的假设,并展示了我们仍然可以实现相同的路径上的纳什收敛保证。然后,我们通过模拟五种博弈场景进行实证验证,从重复囚徒困境博弈到风格化的重复营销推广博弈。我们的发现表明,人工智能代理自然展现出这种推理模式,因此在许多现实世界的战略互动中内在地达到稳定的均衡行为,从而消除了对普遍对齐程序的需求。
cs.AI / 32 / 2603.18571
CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization
CAPSUL:一个全面的人类蛋白质亚细胞定位基准
Abstract
Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called $\mathbf{CAPSUL}$, a $\mathbf{C}$omprehensive hum$\mathbf{A}$n $\mathbf{P}$rotein benchmark for $\mathbf{SU}$bcellular $\mathbf{L}$ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task. Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern $\alpha$-helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.
Chinese Translation
亚细胞定位是药物靶点识别和功能注释的重要生物任务。尽管生物学上已认识到亚细胞定位与蛋白质结构密切相关,但现有的数据集并未提供全面的三维结构信息以及详细的亚细胞定位注释,这严重阻碍了基于结构的模型在该任务中的应用。为了解决这一问题,我们引入了一个新的基准,称为$ extbf{CAPSUL}$,即一个$ extbf{C}$omprehensive hum$ extbf{A}$n $ extbf{P}$rotein基准,用于$ extbf{SU}$bcellular $ extbf{L}$ocalization。该数据集整合了多样的三维结构表示和由领域专家精心策划的细粒度亚细胞定位注释。我们使用多种最先进的基于序列和基于结构的模型对该基准进行了评估,展示了在此任务中涉及结构特征的重要性。此外,我们探索了重加权和单标签分类策略,以促进未来对基于结构的方法的研究。最后,我们通过对高尔基体的案例研究展示了基于结构的方法的强大可解释性,在此过程中,我们从注意力机制中发现了一种决定性的定位模式$ extalpha$-螺旋,证明了在直观生物可解释性与数据驱动的细胞生物学发现之间架起桥梁的潜力。
cs.AI / 33 / 2603.18573
Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation
交互作用:训练独立的模拟器以实现无参考的对话推荐
Abstract
Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.
Chinese Translation
训练对话推荐系统(CRS)需要大量对话数据,这在大规模收集上具有挑战性。为了解决这个问题,研究人员使用了模拟的用户-推荐器对话。传统的模拟方法通常利用单一的大型语言模型(LLM)生成完整的对话,同时预先了解目标项目,从而导致对话显得脚本化和人工化。我们提出了一种无参考的模拟框架,训练两个独立的LLM,一个作为用户,一个作为对话推荐器。这些模型在实时互动中工作,无法访问预先确定的目标项目,仅依赖于偏好摘要和目标属性,从而使推荐器能够通过对话真实推断用户偏好。这种方法产生的对话更加真实且多样,能够更好地反映真实的人机互动。我们的无参考模拟器在质量上与现有方法持平或超过,同时为生成高质量对话推荐数据提供了一种可扩展的解决方案,而无需将对话限制在预定义的目标项目上。我们进行了定量和人工评估,以确认我们无参考方法的有效性。
cs.AI / 34 / 2603.18577
MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning
MedForge:通过伪造意识推理实现可解释的医学深伪检测
Abstract
Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers are typically post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases. We present MedForge, a data-and-method solution for pre-hoc, evidence-grounded medical forgery detection. We introduce MedForge-90K, a large-scale benchmark of realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. Building on it, MedForge-Reasoner performs localize-then-analyze reasoning, predicting suspicious regions before producing a verdict, and is further aligned with Forgery-aware GSPO to strengthen grounding and reduce hallucinations. Experiments demonstrate state-of-the-art detection accuracy and trustworthy, expert-aligned explanations.
Chinese Translation
文本引导的图像编辑器现在能够高保真地操纵真实的医学扫描,允许病变的植入/移除,这威胁到临床信任和安全。现有的防御措施对于医疗保健来说是不够的。医学检测器大多是黑箱,而基于MLLM的解释器通常是事后解释,缺乏医学专业知识,并可能在模糊案例中产生幻觉证据。我们提出了MedForge,这是一种基于数据和方法的解决方案,用于事前、基于证据的医学伪造检测。我们介绍了MedForge-90K,这是一个涵盖19种病理的大规模真实病变编辑基准,具有通过医生检查指南和黄金编辑位置进行的专家指导推理监督。在此基础上,MedForge-Reasoner执行定位-再分析推理,预测可疑区域,然后做出裁决,并进一步与伪造意识GSPO对齐,以增强基础和减少幻觉。实验结果表明,检测准确率达到了最先进水平,并提供了可信赖的、与专家对齐的解释。
cs.AI / 35 / 2603.18614
ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
ZEBRAARENA:用于研究工具增强大语言模型中推理-行动耦合的诊断仿真环境
Abstract
Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.
Chinese Translation
工具增强的大型语言模型(LLMs)必须将多步骤推理与外部行动紧密耦合,然而现有基准往往将这种相互作用与复杂的环境动态、记忆知识或数据集污染混合在一起。在本文中,我们介绍了ZebraArena,这是一个程序生成的诊断环境,用于研究工具增强LLMs中的推理-行动耦合,具有可控的难度和知识最小化设计,从而限制了通过记忆或数据集污染获得的收益。ZebraArena中的每个任务都需要一组关键信息,而这些信息仅可通过针对性的工具使用获取,形成了外部信息获取与推导推理之间可解释的接口。这种设计通过独特的解决方案提供了确定性评估,并为衡量有效的工具使用提供了理论最优查询次数。我们展示了ZebraArena要求深入的推理和准确的外部工具调用的组合,而这一点仍然是一个挑战,因为前沿推理模型如GPT-5和Gemini 2.5 Pro在困难实例上仅实现了60%的准确率。我们还观察到理论最优性与实际工具使用之间的持续差距。例如,GPT-5的工具调用比理论最优值多出70-270%。我们在评估中强调了关键发现,并希望ZebraArena能够激发对内部推理与外部行动之间相互作用的进一步研究。
cs.AI / 36 / 2603.18627
Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation
代理流引导与并行展开搜索用于空间基础的文本到图像生成
Abstract
Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.
Chinese Translation
精确的文本到图像(T2I)生成已取得显著成功,但受到静态文本编码器有限的关系推理能力和开放式采样中误差累积的制约。在没有实时反馈的情况下,常微分方程轨迹中的初始语义模糊不可避免地演变为与空间约束的随机偏差。为了解决这一问题,我们提出了AFS-Search(代理流引导与并行展开搜索),这是一个基于FLUX.1-dev构建的无训练闭环框架。AFS-Search结合了无训练的闭环并行展开搜索和流引导机制,利用视觉-语言模型(VLM)作为语义评论者来诊断中间潜变量,并通过精确的空间基础动态引导速度场。此外,我们将T2I生成形式化为一个顺序决策过程,通过前瞻性模拟探索多个轨迹,并根据VLM指导的奖励选择最佳路径。此外,我们提供了AFS-Search-Pro以提高性能,以及AFS-Search-Fast以实现更快的生成。实验结果表明,我们的AFS-Search-Pro显著提升了原始FLUX.1-dev的性能,在三个不同的基准测试中实现了最先进的结果。同时,AFS-Search-Fast也显著增强了性能,同时保持快速的生成速度。
cs.AI / 37 / 2603.18631
D-Mem: A Dual-Process Memory System for LLM Agents
D-Mem:一种用于大语言模型代理的双重过程记忆系统
Abstract
Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine-grained contextual understanding. To address this, we introduce D-Mem, a dual-process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high-fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D-Mem employs a Multi-dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct demonstrate the efficacy of our approach. Notably, our Multi-dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT-4o-mini. This outperforms our static retrieval baseline, Mem0$^\ast$ (51.2), and recovers 96.7\% of the Full Deliberation's performance (55.3), while incurring significantly lower computational costs.
Chinese Translation
随着持久、自适应自主代理的发展,为这些系统提供高保真度的记忆访问以支持长期推理逐渐成为一项关键需求。然而,普遍的基于检索的记忆框架往往遵循增量处理的范式,持续地将对话记忆提取并更新到向量数据库中,并在查询时依赖语义检索。尽管这种方法速度快,但它本质上依赖于有损的抽象,常常遗漏上下文中的关键信息,并且在处理依赖细粒度上下文理解的查询时表现不佳。为了解决这个问题,我们提出了 D-Mem,一种双重过程记忆系统。它在常规查询中保留轻量级的向量检索,同时建立一个全面的全面深思模块作为高保真的后备。为了在不牺牲准确性的情况下实现认知经济,D-Mem 采用多维质量门控策略动态地桥接这两个过程。在使用 GPT-4o-mini 和 Qwen3-235B-Instruct 的 LoCoMo 和 RealTalk 基准测试中进行的实验表明了我们方法的有效性。值得注意的是,我们的多维质量门控策略在 LoCoMo 上达到 53.5 的 F1 分数,超越了我们的静态检索基线 Mem0$^ imes$(51.2),并恢复了全面深思性能的 96.7%(55.3),同时计算成本显著降低。
cs.AI / 38 / 2603.18633
An Onto-Relational-Sophic Framework for Governing Synthetic Minds
治理合成智能的本体-关系-智慧框架
Abstract
The rapid evolution of artificial intelligence, from task-specific systems to foundation models exhibiting broad, flexible competence across reasoning, creative synthesis, and social interaction, has outpaced the conceptual and governance frameworks designed to manage it. Current regulatory paradigms, anchored in a tool-centric worldview, address algorithmic bias and transparency but leave unanswered foundational questions about what increasingly capable synthetic minds are, how societies should relate to them, and the normative principles that should guide their development. Here we introduce the Onto-Relational-Sophic (ORS) framework, grounded in Cyberism philosophy, which offers integrated answers to these challenges through three pillars: (1) a Cyber-Physical-Social-Thinking (CPST) ontology that defines the mode of being for synthetic minds as irreducibly multi-dimensional rather than purely computational; (2) a graded spectrum of digital personhood providing a pragmatic relational taxonomy beyond binary person-or-tool classifications; and (3) Cybersophy, a wisdom-oriented axiology synthesizing virtue ethics, consequentialism, and relational approaches to guide governance. We apply the framework to emergent scenarios including autonomous research agents, AI-mediated healthcare, and agentic AI ecosystems, demonstrating its capacity to generate proportionate, adaptive governance recommendations. The ORS framework charts a path from narrow technical alignment toward comprehensive philosophical foundations for the synthetic minds already among us.
Chinese Translation
人工智能的快速发展,从特定任务系统到展现广泛、灵活能力的基础模型,涵盖推理、创造性综合和社会互动,已超越了为其管理而设计的概念和治理框架。目前的监管范式以工具为中心,虽然解决了算法偏见和透明性的问题,但对日益强大的合成智能的本质、社会应如何与其关系以及应指导其发展的规范原则等基础性问题仍未给出答案。在此,我们提出了基于网络主义哲学的本体-关系-智慧(Onto-Relational-Sophic, ORS)框架,通过三个支柱为这些挑战提供综合答案:(1) 网络-物理-社会-思维(Cyber-Physical-Social-Thinking, CPST)本体,定义合成智能的存在模式为不可简化的多维度,而非纯粹的计算;(2) 数字人格的分级光谱,提供超越二元人格或工具分类的务实关系分类;(3) 网络智慧(Cybersophy),一种以智慧为导向的价值论,综合了德性伦理学、结果主义和关系方法,以指导治理。我们将该框架应用于新兴场景,包括自主研究代理、AI介导的医疗保健和代理性AI生态系统,展示其生成适度、适应性治理建议的能力。ORS框架为从狭义技术对齐走向合成智能的全面哲学基础指明了方向。
cs.AI / 39 / 2603.18656
Balanced Thinking: Improving Chain of Thought Training in Vision Language Models
平衡思维:改善视觉语言模型中的思维链训练
Abstract
Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long
traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
Chinese Translation
视觉语言模型(VLMs)中的多模态推理通常依赖于一个两阶段的过程:监督微调(SFT)和强化学习(RL)。在标准的SFT中,所有标记对损失的贡献是相等的,尽管推理数据本质上是标记不平衡的。长的
轨迹掩盖了短但对任务至关重要的 段,导致冗长的推理和不准确的答案。我们提出了SCALe(Scheduled Curriculum Adaptive Loss),它通过动态的、与长度无关的加权,明确区分对推理和答案段的监督。与普通的SFT不同,SCALe-SFT在训练过程中通过余弦调度策略逐渐将重点从 转移到 ,鼓励简洁且有根据的推理。我们在不同的基准和架构上评估了SCALe。结果表明,SCALe在准确性上始终优于普通SFT,并且与完整的两阶段SFT + GRPO管道的性能相当,同时仅需约七分之一的训练时间,使其成为一种轻量且有效的替代方案。当与GRPO结合时,SCALe实现了最佳的整体性能,突显了其作为独立方法和强化细化的强大基础的价值。
cs.AI / 40 / 2603.18662
Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning
构建思维:视觉-文本交错几何推理的基准与策略优化
Abstract
Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.
Chinese Translation
几何推理本质上需要“构建思维”——动态操作视觉辅助工具,以弥合问题条件与解决方案之间的差距。然而,现有的多模态大型语言模型(MLLMs)主要局限于对静态图表的被动推理,缺乏何时以及如何构建有效视觉辅助工具的战略知识。为了解决这一问题,我们提出了一种视觉-文本交错思维链框架。我们首先介绍了GeoAux-Bench,这是第一个基准,包含4,334个几何问题,将文本构建步骤与真实的视觉更新对齐。我们的初步研究揭示了两个关键见解:(1)交错的视觉-文本辅助工具优于单一模态的对应物,后者无法无损捕捉几何协同;(2)有效的构建作为熵的减少者,与推理困惑度的降低有强相关性。在这些发现的基础上,我们提出了行动适用性策略优化(A2PO),这是一种强化学习范式,用于掌握战略构建。A2PO采用自适应奖励塑造,通过反事实采样来调节视觉辅助工具的时机和质量,以区分必要的与冗余的构建。实验表明,我们的方法使MLLMs能够利用选择性辅助构建,相较于强基线提高了3.51%的性能。代码和数据可在GitHub上获取。
cs.AI / 41 / 2603.18676
MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation
MANAR:具有导航抽象概念表示的记忆增强注意力
Abstract
MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Global Workspace Theory (GWT). While MHA enables unconstrained all-to-all communication, it lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness. MANAR addresses this by implementing a central workspace through a trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). The architecture follows a two-stage logic that maps directly to GWT mechanics: (i) an integration phase, where retrieved memory concepts converge to form a collective "mental image" (the ACR) based on input stimuli; and (ii) a broadcasting phase, where this global state navigates and informs the contextualization of individual local tokens. We demonstrate that efficient linear-time scaling is a fundamental architectural byproduct of instantiating GWT functional bottleneck, as routing global information through a constant-sized ACR resolves the quadratic complexity inherent in standard attention. MANAR is a compatible re-parameterization of MHA with identical semantic roles for its projections, enabling knowledge transfer from pretrained transformers via weight-copy and thus overcoming the adoption barriers of structurally incompatible linear-time alternatives. MANAR enables non-convex contextualization, synthesizing representations that provably lie outside the convex hull of input tokens - a mathematical reflection of the creative synthesis described in GWT. Empirical evaluations confirm that MANAR matches or exceeds strong baselines across language (GLUE score of 85.1), vision (83.9% ImageNet-1K), and speech (2.7% WER on LibriSpeech), positioning it as an efficient and expressive alternative to quadratic attention.
Chinese Translation
MANAR(具有导航抽象概念表示的记忆增强注意力)上下文化层通过实例化全球工作空间理论(Global Workspace Theory, GWT)的原则,概括了标准多头注意力(Multi-Head Attention, MHA)。虽然MHA允许不受限制的全对全通信,但它缺乏在意识的认知模型中假设的功能瓶颈和全球整合机制。MANAR通过实现一个可训练的抽象概念记忆和抽象概念表示(Abstract Conceptual Representation, ACR)来解决这一问题,构建了一个中央工作空间。该架构遵循一个直接映射到GWT机制的两阶段逻辑:(i) 整合阶段,在此阶段,检索到的记忆概念汇聚形成基于输入刺激的集体“心理图像”(ACR);(ii) 广播阶段,在此阶段,这一全球状态导航并告知个别局部标记的上下文化。我们证明,高效的线性时间扩展是实例化GWT功能瓶颈的一个基本架构副产品,因为通过固定大小的ACR路由全球信息解决了标准注意力中固有的二次复杂性。MANAR是MHA的兼容重参数化,其投影具有相同的语义角色,能够通过权重复制实现从预训练变换器的知识转移,从而克服结构上不兼容的线性时间替代方案的采用障碍。MANAR实现了非凸上下文化,合成的表示证明位于输入标记的凸包之外——这是GWT中描述的创造性合成的数学反映。实证评估确认,MANAR在语言(GLUE得分85.1)、视觉(83.9% ImageNet-1K)和语音(LibriSpeech上2.7% WER)方面匹配或超过强基准,将其定位为二次注意力的高效且富有表现力的替代方案。
cs.AI / 42 / 2603.18712
Accurate and Efficient Multi-Channel Time Series Forecasting via Sparse Attention Mechanism
通过稀疏注意机制进行准确高效的多通道时间序列预测
Abstract
The task of multi-channel time series forecasting is ubiquitous in numerous fields such as finance, supply chain management, and energy planning. It is critical to effectively capture complex dynamic dependencies within and between channels for accurate predictions. However, traditional method paid few attentions on learning the interaction among channels. This paper proposes Linear-Network (Li-Net), a novel architecture designed for multi-channel time series forecasting that captures the linear and non-linear dependencies among channels. Li-Net dynamically compresses representations across sequence and channel dimensions, processes the information through a configurable non-linear module and subsequently reconstructs the forecasts. Moreover, Li-Net integrates a sparse Top-K Softmax attention mechanism within a multi-scale projection framework to address these challenges. A core innovation is its ability to seamlessly incorporate and fuse multi-modal embeddings, guiding the sparse attention process to focus on the most informative time steps and feature channels. Through the experiment results on multiple real-world benchmark datasets demonstrate that Li-Net achieves competitive performance compared to state-of-the-art baseline methods. Furthermore, Li-Net provides a superior balance between prediction accuracy and computational burden, exhibiting significantly lower memory usage and faster inference times. Detailed ablation studies and parameter sensitivity analyses validate the effectiveness of each key component in our proposed architecture. Keywords: Multivariate Time Series Forecasting, Sparse Attention Mechanism, Multimodal Information Fusion, Non-linear relationship
Chinese Translation
多通道时间序列预测任务在金融、供应链管理和能源规划等多个领域中普遍存在。有效捕捉通道内及通道之间复杂的动态依赖关系对于准确预测至关重要。然而,传统方法对通道间交互的学习关注甚少。本文提出了一种新颖的架构——线性网络(Linear-Network,Li-Net),旨在进行多通道时间序列预测,能够捕捉通道之间的线性和非线性依赖关系。Li-Net 在序列和通道维度上动态压缩表示,通过可配置的非线性模块处理信息,并随后重建预测。此外,Li-Net 在多尺度投影框架中集成了一种稀疏 Top-K Softmax 注意机制,以应对这些挑战。其核心创新在于能够无缝整合和融合多模态嵌入,引导稀疏注意过程聚焦于最具信息量的时间步和特征通道。通过在多个真实世界基准数据集上的实验结果表明,Li-Net 相较于最先进的基线方法实现了竞争力的性能。此外,Li-Net 在预测准确性和计算负担之间提供了优越的平衡,展现出显著较低的内存使用和更快的推理时间。详细的消融研究和参数敏感性分析验证了我们提出的架构中每个关键组件的有效性。关键词:多变量时间序列预测,稀疏注意机制,多模态信息融合,非线性关系
cs.AI / 43 / 2603.18718
MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution
MemMA:通过多智能体推理和现场自我演化协调记忆周期
Abstract
Memory-augmented LLM agents maintain external memory banks to support long-horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta-Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in-situ self-evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug-and-play manner. Our code is publicly available at https://github.com/ventr1c/memma.
Chinese Translation
记忆增强的LLM代理维护外部记忆库以支持长时间交互,然而大多数现有系统将构建、检索和利用视为孤立的子程序。这造成了两个相互关联的挑战:在记忆周期的前向路径上,构建和检索受到局部启发式的驱动,而非明确的战略推理,导致战略盲目;在后向路径上,稀疏且延迟的监督使得下游失败很少转化为对记忆库的直接修复。为了解决这些挑战,我们提出了MemMA,一个即插即用的多智能体框架,协调记忆周期的前向和后向路径。在前向路径上,Meta-Thinker生成结构化指导,引导Memory Manager进行构建,并在迭代检索过程中指导Query Reasoner。在后向路径上,MemMA引入现场自我演化的记忆构建,合成探测问答对,验证当前记忆,并在记忆最终确定之前将失败转化为修复行动。在LoCoMo上的大量实验表明,MemMA在多个LLM骨干网络上始终优于现有基准,并以即插即用的方式改善三种不同的存储后端。我们的代码已公开发布在 https://github.com/ventr1c/memma。
cs.AI / 44 / 2603.18729
Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures
单一与多智能体生成性人工智能架构中的语言刻板印象分析
Abstract
Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.
Chinese Translation
文献中的许多研究表明,大型语言模型(LLM)的输出表现出歧视性行为,基于输入所用的方言触发基于刻板印象的推断。当相同的输入以标准美式英语(SAE)和非裔美国英语(AAE)提供给LLM时,这种偏见表现得尤为明显。本文复制了现有的方言敏感刻板印象生成分析,并研究了缓解策略的效果,包括提示工程(基于角色的提示和链式思维提示)以及由生成-评估-修订模型组成的多智能体架构。我们定义了八种提示模板,以分析方言偏见可能表现的不同方式,例如为SAE或AAE说话者建议的名字、职业和形容词。我们采用LLM作为评判者的方法来评估结果中的偏见。我们的结果表明,在所有模板类别中,SAE和AAE相关输出之间出现了承载刻板印象的差异,形容词和职业归属的影响最为显著。基线差异因模型而异,Claude Haiku模型中观察到最大的SAE-AAE差异,而Phi-4 Mini模型中观察到的差异最小。链式思维提示被证明是Claude Haiku的有效缓解策略,而多智能体架构的使用则确保了所有模型中的一致缓解。这些发现表明,对于考虑交叉性的软件工程,公平性评估应包括对缓解策略的模型特定验证,以及在高影响力的LLM部署中实施工作流程级别的控制(例如,涉及评估模型的智能体架构)。当前结果具有探索性,范围有限,但可以通过增加数据集规模以及将该程序应用于不同语言或方言来进行扩展和复制。
cs.AI / 45 / 2603.18743
Memento-Skills: Let Agents Design Agents
Memento-Skills:让智能体设计智能体
Abstract
We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.
Chinese Translation
我们介绍了 extit{Memento-Skills},一个通用的、可持续学习的大型语言模型(LLM)智能体系统,它作为一个 extit{智能体设计智能体}运作:通过经验自主构建、适应和改进特定任务的智能体。该系统基于一种基于记忆的强化学习框架,采用 extit{有状态提示},其中可重用的技能(以结构化的Markdown文件存储)作为持久的、不断演变的记忆。这些技能编码了行为和上下文,使智能体能够在交互中传递知识。从简单的基础技能(如网络搜索和终端操作)开始,智能体通过在 extit{Memento~2}中引入的 extit{读--写反思学习}机制不断改进。在 extit{读}阶段,一个可训练行为的技能路由器根据当前的有状态提示选择最相关的技能;在 extit{写}阶段,智能体根据新经验更新和扩展其技能库。这种闭环设计使得 extit{无需更新LLM参数的持续学习}成为可能,因为所有的适应都是通过外部技能和提示的演变实现的。与依赖人类设计的智能体的先前方法不同,Memento-Skills使得通用智能体能够 extit{端到端设计新任务的智能体}。通过迭代的技能生成和精炼,该系统逐步提升自身能力。在 extit{通用人工智能助手}基准和 extit{人类最后的考试}上的实验表明,系统实现了持续的提升,整体准确率分别提高了26.2 ext{%}和116.2 ext{%}。代码可在https://github.com/Memento-Teams/Memento-Skills获取。
cs.AI / 46 / 2603.18761
NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics
神经游戏变换器:基于博弈论和统计物理的吉布斯启发注意力机制
Abstract
Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts -- Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system's energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness-sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4\% (with a peak validation accuracy of 86.6\%), surpassing ALBERT-Base and remaining highly competitive with RoBERTa-Base. Code is available at https://github.com/dbouchaffra/NeuroGame-Transformer.
Chinese Translation
标准的变换器注意力机制受限于其成对的公式,这限制了对标记之间高阶依赖关系的建模。我们提出了神经游戏变换器(NeuroGame Transformer, NGT),通过双重视角重新概念化注意力:标记同时被视为合作游戏中的参与者和统计物理系统中的相互作用自旋。标记的重要性使用两个互补的博弈论概念进行量化——全球的基于排列的归因使用沙普利值(Shapley values),局部的联盟级影响使用班兹哈夫指数(Banzhaf indices)。这些通过可学习的门控参数结合形成外部磁场,而成对的相互作用势能则捕捉协同关系。系统的能量遵循伊辛哈密顿量(Ising Hamiltonian),注意力权重作为吉布斯分布下的边际概率出现,通过均场方程高效计算。为了确保在指数联盟空间下的可扩展性,我们开发了具有吉布斯分布权重的重要性加权蒙特卡洛估计器。这种方法避免了显式的指数因子,确保了长序列的数值稳定性。我们提供了理论收敛保证,并描述了由插值参数控制的公平性-敏感性权衡。实验结果表明,神经游戏变换器在SNLI和MNLI匹配数据集上表现出色,超越了一些主要的高效变换器基线。在SNLI上,它的测试准确率达到86.4%(峰值验证准确率为86.6%),超过了ALBERT-Base,并与RoBERTa-Base保持高度竞争。代码可在https://github.com/dbouchaffra/NeuroGame-Transformer获取。
cs.AI / 47 / 2603.18767
A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models
概念不仅仅是一个词:文本到图像扩散模型中的多样化遗忘
Abstract
Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model's parameters. Existing approaches typically rely on keywords to identify the target concept to be unlearned. However, we show that this keyword-based formulation is inherently limited: a visual concept is multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning, which imprecisely indicate the target concept is brittle and prone to over-forgetting. This occurs because a single keyword represents only a narrow point estimate of the concept, failing to cover its full semantic distribution and entangled variations in the latent space. To address this limitation, we propose Diversified Unlearning, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that integrating Diversified Unlearning as an add-on component into existing unlearning pipelines consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.
Chinese Translation
概念遗忘作为一种有前景的方向,已被提出以减少文本到图像扩散模型中有害内容生成的风险,通过选择性地从模型参数中抹去不良概念。现有的方法通常依赖关键词来识别需要遗忘的目标概念。然而,我们表明这种基于关键词的表述本质上是有限的:视觉概念是多维的,可以用多种文本形式表达,并且在潜在空间中常常与相关概念重叠,这使得仅依靠关键词的遗忘方法不够精确,容易导致过度遗忘。这是因为单一关键词仅代表了概念的一个狭窄点估计,未能覆盖其完整的语义分布和在潜在空间中的纠缠变体。为了解决这一局限性,我们提出了多样化遗忘(Diversified Unlearning),这是一种通过一组上下文多样的提示而非单一关键词来表示概念的分布框架。这种更丰富的表述使得遗忘过程更加精确和稳健。通过在多个基准和最先进的基线上的广泛实验,我们证明了将多样化遗忘作为附加组件整合到现有遗忘管道中,能够持续实现更强的抹除效果、更好的无关概念保留以及对对抗性恢复攻击的增强鲁棒性。
cs.AI / 48 / 2603.18786
Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind
第二届通过心智理论推动人工智能发展的研讨会论文集
Abstract
This volume includes a selection of papers presented at the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2026 in Singapore on 26th January 2026. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.
Chinese Translation
本卷收录了在2026年1月26日于新加坡举行的第二届通过心智理论推动人工智能发展的研讨会上所呈现的论文。该卷的目的是为心智理论(Theory of Mind)和人工智能(AI)研究社区提供一个开放获取和精心策划的文集。
cs.AI / 49 / 2603.18806
dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
dTRPO:扩散大型语言模型的策略优化中的轨迹简化
Abstract
Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.
Chinese Translation
扩散大型语言模型(dLLMs)为语言生成引入了一种新的范式,这同时带来了将其与人类偏好对齐的新挑战。在本工作中,我们旨在通过降低轨迹概率计算的成本来改善 dLLMs 的策略优化,从而实现扩展的离线策略训练。我们证明了:(i) 在参考政策正则化下,新解掩码令牌的概率比是中间扩散状态的无偏估计,以及 (ii) 全轨迹的概率可以通过一次前向传递重新掩码的最终状态来有效估计。通过将这两种轨迹简化策略整合到策略优化目标中,我们提出了轨迹简化策略优化(dTRPO)。我们在 7B dLLMs 上对 dTRPO 的性能进行了评估,涵盖了遵循指令的任务和推理基准。结果表明,dTRPO 显著提升了最先进的 dLLMs 的核心性能,在 STEM 任务上获得高达 9.6% 的提升,在编码任务上提升高达 4.3%,在遵循指令任务上提升高达 3.0%。此外,由于其离线和单次前向特性,dTRPO 展现出强大的训练效率,并通过高质量输出实现更高的生成效率。
cs.AI / 50 / 2603.18813
Can LLM generate interesting mathematical research problems?
大型语言模型能生成有趣的数学研究问题吗?
Abstract
This paper is the second one in a series of work on the mathematical creativity of LLM. In the first paper, the authors proposed three criteria for evaluating the mathematical creativity of LLM and constructed a benchmark dataset to measure it. This paper further explores the mathematical creativity of LLM, with a focus on investigating whether LLM can generate valuable and cutting-edge mathematical research problems. We develop an agent to generate unknown problems and produced 665 research problems in differential geometry. Through human verification, we find that many of these mathematical problems are unknown to experts and possess unique research value.
Chinese Translation
本文是关于大型语言模型(LLM)数学创造力研究系列中的第二篇论文。在第一篇论文中,作者提出了评估LLM数学创造力的三个标准,并构建了一个基准数据集以进行测量。本文进一步探讨了LLM的数学创造力,重点研究LLM是否能够生成有价值且前沿的数学研究问题。我们开发了一个代理来生成未知问题,并产生了665个微分几何领域的研究问题。通过人工验证,我们发现这些数学问题中的许多对专家而言是未知的,并且具有独特的研究价值。
cs.AI / 51 / 2603.18815
ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents
ProRL代理:用于多轮大语言模型代理的强化学习训练的可回滚服务
Abstract
Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.
Chinese Translation
多轮大语言模型代理在解决复杂的互动任务中变得越来越重要,强化学习(RL)是改善其长远行为的关键组成部分。然而,RL训练需要生成大量的沙箱回滚轨迹,而现有基础设施往往将回滚调度与训练循环耦合在一起,使得系统难以迁移和维护。在回滚即服务(rollout-as-a-service)理念下,我们提出了ProRL代理,这是一种可扩展的基础设施,通过API服务支持完整的代理回滚生命周期。ProRL代理还提供标准化和可扩展的沙箱环境,支持在无根高性能计算(HPC)环境中进行多样化的代理任务。我们通过在软件工程、数学、STEM和编码任务上的RL训练验证了ProRL代理。ProRL代理是开源的并作为NVIDIA NeMo Gym的一部分进行了集成。
cs.AI / 52 / 2603.18859
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
RewardFlow:基于拓扑的奖励传播方法在大型语言模型的自主强化学习中的应用
Abstract
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
Chinese Translation
强化学习(RL)在增强大型语言模型(LLMs)与外部环境的自主推理能力方面具有重要潜力。然而,终端奖励的固有稀疏性阻碍了细粒度的状态级优化。尽管过程奖励建模提供了一个有前景的替代方案,但训练专用奖励模型通常需要大量的计算成本和扩展困难。为了解决这些挑战,我们提出了RewardFlow,这是一种轻量级的方法,用于估计针对自主推理任务的状态级奖励。RewardFlow通过构建状态图,利用推理轨迹中状态的内在拓扑结构。这使得能够分析各状态对成功的贡献,随后进行基于拓扑的图传播,以量化贡献并产生客观的状态级奖励。当作为密集奖励用于RL优化时,RewardFlow在四个自主推理基准测试中显著超越了先前的RL基线,展示了卓越的性能、鲁棒性和训练效率。RewardFlow的实现已在 https://github.com/tmlr-group/RewardFlow 上公开发布。
cs.AI / 53 / 2603.18866
Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions
基于冲突的多智能体路径规划搜索与异步动作
Abstract
Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.
Chinese Translation
多智能体路径规划(MAPF)旨在为多个智能体从各自的起始位置到各自的目标位置寻找无碰撞路径,同时最小化路径成本。现有的大多数MAPF算法依赖于一个共同的假设,即同步动作,所有智能体的动作同时开始并且总是需要一个时间单位,这可能限制了MAPF规划器在实际中的应用。为了解决这一假设,连续时间冲突基搜索(CCBS)是一种流行的方法,可以为具有异步动作的MAPF(MAPF-AA)找到最优解。然而,最近发现CCBS由于连续等待时间所产生的不可数无限状态空间而不完整。本文提出了一种新方法,即带有异步动作的基于冲突的搜索(CBS-AA),该方法绕过了这一理论问题,并能够以完整性和解的最优性保证解决MAPF-AA。在CBS-AA的基础上,我们还开发了冲突解决技术,以进一步提高CBS-AA的可扩展性。我们的测试结果表明,我们的方法可以将分支数量减少多达90%。
cs.AI / 54 / 2603.18871
Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs
弥合网络碎片化:一种语义增强的深度强化学习框架用于无人机辅助的车载自组网
Abstract
Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM's semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.
Chinese Translation
车载自组网(VANETs)是自动驾驶的数字基石,但由于物理障碍,它们在城市环境中遭受严重的网络碎片化。无人机(UAVs)凭借其高机动性,成为弥合这些连接缺口的重要解决方案。然而,传统的基于深度强化学习(DRL)的无人机部署策略缺乏对道路拓扑的语义理解,常常导致盲目探索和样本效率低下。相比之下,大型语言模型(LLMs)具备强大的推理能力,能够识别拓扑的重要性,但将其应用于控制任务仍然具有挑战性。为了解决这一问题,我们提出了语义增强的深度强化学习(SA-DRL)框架。首先,我们提出了一种基于道路拓扑图(RTG)和双连通图(DCG)的碎片化量化方法。随后,我们设计了一个四阶段流程,将通用的大型语言模型转变为领域特定的拓扑专家。最后,我们提出了语义增强的PPO(SA-PPO)算法,该算法采用Logit融合机制,将LLM的语义推理直接注入策略作为先验,有效引导代理朝向关键交叉口。大量高保真模拟表明,SA-PPO以显著的效率实现了最先进的性能,仅使用26.6%的训练回合便达到了基准性能水平。最终,SA-PPO在两个关键连接性指标上分别比竞争方法提高了13.2%和23.5%,同时将能耗降低至基准的28.2%。
cs.AI / 55 / 2603.18881
Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography
根据 ChatGPT 的地理学——生成性人工智能如何表征和推理地理
Abstract
Understanding how AI will represent and reason about geography should be a key concern for all of us, as the broader public increasingly interacts with spaces and places through these systems. Similarly, in line with the nature of foundation models, our own research often relies on pre-trained models. Hence, understanding what world AI systems construct is as important as evaluating their accuracy, including factual recall. To motivate the need for such studies, we provide three illustrative vignettes, i.e., exploratory probes, in the hope that they will spark lively discussions and follow-up work: (1) Do models form strong defaults, and how brittle are model outputs to minute syntactic variations? (2) Can distributional shifts resurface from the composition of individually benign tasks, e.g., when using AI systems to create personas? (3) Do we overlook deeper questions of understanding when solely focusing on the ability of systems to recall facts such as geographic principles?
Chinese Translation
理解人工智能如何表征和推理地理应成为我们所有人的一个关键关注点,因为公众越来越多地通过这些系统与空间和地点互动。同样,考虑到基础模型的特性,我们自己的研究往往依赖于预训练模型。因此,理解人工智能系统构建的世界与评估其准确性(包括事实回忆)同样重要。为了激发对这些研究的需求,我们提供了三个说明性的小案例,即探索性探针,希望它们能引发生动的讨论和后续工作:(1)模型是否形成强默认,模型输出对微小句法变化的脆弱性如何?(2)分布性变化是否会从个别无害任务的组合中重新浮现,例如,当使用人工智能系统创建角色时?(3)当我们仅关注系统回忆地理原则等事实的能力时,是否忽视了更深层次的理解问题?
cs.AI / 56 / 2603.18886
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation
对数学对象的推理:在线奖励建模与测试时聚合
Aggarwal, Pranjal, Ghazvininejad, Marjan, Kim, Seungone, Kulikov, Ilia, Lanchantin, Jack, Li, Xian, Li, Tianjian, Liu, Bo, Neubig, Graham, Ovalle, Anaelia, Saha, Swarnadeep, Sukhbaatar, Sainbayar, Welleck, Sean, Weston, Jason, Whitehouse, Chenxi, Williams, Adina, Xu, Jing, Yu, Ping, Yuan, Weizhe, Zhang, Jingyu, Zhao, Wenting
Abstract
The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.
Chinese Translation
精确推导数学对象的能力是下游STEM应用的核心要求,包括数学、物理和化学等领域,在这些领域中,推理必须以正式结构化的表达形式收尾。然而,目前对数学和科学推理的语言模型评估在很大程度上依赖于简化的答案格式,例如数值或多项选择选项,这主要是由于自动评估的便利性。在本文中,我们提供了三项改进数学对象推理的贡献:(i)我们构建并发布了用于推导数学对象的训练数据和基准,即Principia套件;(ii)我们提供了强大的LLM评判者和验证者的训练方案,并展示了在线评判者训练如何提升性能;(iii)我们展示了在线训练如何通过聚合来扩展测试时计算。我们发现,强大的语言模型如Qwen3-235B和o3在Principia上表现不佳,而我们的训练方案可以在不同的LLM基础上带来显著的改进,同时改善现有数值和多项选择问答任务的结果,展示了推理能力的跨格式泛化。
cs.AI / 57 / 2603.18893
Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
语言模型中的定量内省:跨对话跟踪内部状态
Abstract
Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $\rho = 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($\Delta R^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.
Chinese Translation
在对话中跟踪大型语言模型的内部状态对于安全性、可解释性和模型福祉至关重要,但当前的方法存在局限性。线性探针和其他白盒方法对高维表示的压缩不够完美,并且随着模型规模的增加,应用难度加大。受到人类心理学的启发,数值自我报告作为跟踪内部状态的广泛工具,我们探讨大型语言模型(LLMs)自身的数值自我报告是否能够随着时间的推移跟踪探针定义的情感状态。我们在40个十轮对话中研究了四对概念(幸福感、兴趣、专注和冲动),将内省操作化为模型自我报告与概念匹配的探针定义的内部状态之间的因果信息耦合。我们发现,贪婪解码的自我报告将输出压缩为少量无信息的值,但通过计算基于logit的自我报告可以揭示内省能力。该指标跟踪可解释的内部状态(Spearman $
ho = 0.40$-$0.76$;等距 $R^2 = 0.12$-$0.54$ 在 LLaMA-3.2-3B-Instruct 中),并跟随这些状态随时间的变化,激活引导确认了耦合是因果的。此外,我们发现内省在第一轮就存在,但在对话中不断演变,并且可以通过沿一个概念的引导来选择性地改善另一个概念的内省($ riangle R^2$ 最高可达 $0.30$)。至关重要的是,这些现象在某些情况下随着模型规模的增加而扩展,在 LLaMA-3.1-8B-Instruct 中接近 $R^2
ightarrow 0.93$,并在其他模型家族中部分复制。综合来看,这些结果将数值自我报告定位为跟踪对话式人工智能系统中内部情感状态的可行且互补的工具。
cs.AI / 58 / 2603.18894
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
我无法相信这是腐败:评估多代理治理系统中的腐败
Abstract
Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption. We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity, with large differences across regimes and model--governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.
Chinese Translation
大型语言模型日益被提议作为高风险公共工作流程中的自主代理,然而我们缺乏系统性的证据来证明它们在被授予权威时是否会遵循机构规则。我们提供的证据表明,机构人工智能的完整性应被视为部署前的要求,而非部署后的假设。我们评估了多代理治理模拟,其中代理在不同的权威结构下占据正式的政府角色,并利用独立的基于评分标准的评委对28,112个转录片段进行了规则违反和滥用结果的评估。虽然我们提出了这一观点,但其核心贡献是实证的:在运作未饱和的模型中,治理结构对与腐败相关结果的影响比模型身份更强,不同的治理方式和模型-治理配对之间存在较大差异。在某些环境中,轻量级的保护措施可以降低风险,但并未持续有效地防止严重失误。这些结果意味着,机构设计是安全委派的前提条件:在将真实权威赋予大语言模型代理之前,系统应在具有可执行规则、可审计日志和对高影响行动的人类监督的治理类约束下进行压力测试。
cs.AI / 59 / 2603.18908
Secure Linear Alignment of Large Language Models
大型语言模型的安全线性对齐
Abstract
Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.
Chinese Translation
语言模型越来越表现出学习相似表示的趋势,尽管在训练目标、架构和数据模态上存在差异。这种独立训练模型之间的新兴兼容性为跨模型对齐下游目标提供了新的机会。此外,它还解锁了新的潜在应用领域,例如在安全性、隐私或竞争限制下禁止直接数据或模型共享的环境。在本研究中,我们提出了一种隐私保护框架,该框架利用表示收敛性实现独立语言模型之间的跨孤岛推理。该框架在共享公共数据集上学习仿射变换,并应用同态加密来保护推理过程中的客户端查询。通过仅加密线性对齐和分类操作,该方法在保持强安全保证的同时,实现了亚秒级的推理延迟。我们通过对表示收敛性的实证研究来支持该框架,在此过程中我们学习了独立模型最终隐藏状态之间的线性变换。我们在嵌入分类和分布外检测上评估了这些跨模型映射,观察到模型对之间的性能降级最小。此外,我们首次展示了线性对齐有时能够实现独立训练模型之间的文本生成。
cs.AI / 60 / 2603.18916
Agentic Business Process Management: A Research Manifesto
自主商业流程管理:研究宣言
Abstract
This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent-oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation-oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self-modification. These capabilities jointly ensure that agents' goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi-agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice.
Chinese Translation
本文提出了一份宣言,阐明了自主商业流程管理(Agentic Business Process Management, APM)的概念基础,APM是对商业流程管理(Business Process Management, BPM)的扩展,旨在管理在组织中执行流程的自主代理。从管理的角度来看,APM代表了商业流程传统视角的范式转变,这一转变源于对流程意识和代理导向抽象的认识,其中软件和人类代理作为主要功能实体,在明确的流程框架内感知、推理和行动。这一视角标志着从传统的以自动化为导向的BPM转向一种系统,在这种系统中,自主性受到约束、对齐,并通过流程意识变得可操作。我们介绍了实现APM系统所需的核心抽象和架构元素,并详细阐述了这些APM代理必须支持的四个关键能力:框架自主性、可解释性、对话可操作性和自我修改。这些能力共同确保代理的目标与组织目标对齐,并且代理在追求这些目标时表现出框架内但又积极主动的行为。我们讨论了这些能力的实现程度,并识别出需要在BPM、人工智能(AI)和多代理系统领域进一步发展的研究挑战。因此,该宣言为弥合这些领域提供了一条路线图,并指导APM系统在实践中的发展。
cs.AI / 61 / 2603.18968
Teleological Inference in Structural Causal Models via Intentional Interventions
通过意图干预在结构因果模型中的目的论推理
Abstract
Structural causal models (SCMs) were conceived to formulate and answer causal questions. This paper shows that SCMs can also be used to formulate and answer teleological questions, concerning the intentions of a state-aware, goal-directed agent intervening in a causal system. We review limitations of previous approaches to modeling such agents, and then introduce intentional interventions, a new time-agnostic operator that induces a twin SCM we call a structural final model (SFM). SFMs treat observed values as the outcome of intentional interventions and relate them to the counterfactual conditions of those interventions (what would have happened had the agent not intervened). We show how SFMs can be used to empirically detect agents and to discover their intentions.
Chinese Translation
结构因果模型(SCMs)旨在制定和回答因果问题。本文表明,SCMs 也可以用于制定和回答目的论问题,这些问题涉及到一个具有状态意识和目标导向的代理在因果系统中进行干预的意图。我们回顾了以往建模此类代理的局限性,并引入了意图干预,这是一种新的时间无关操作符,它引发了我们称之为结构最终模型(SFM)的双重 SCM。SFMs 将观察到的值视为意图干预的结果,并将其与这些干预的反事实条件(如果代理没有干预会发生什么)相关联。我们展示了如何使用 SFMs 来实证检测代理并发现他们的意图。
cs.AI / 62 / 2603.18976
Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction
评估5W3H结构化提示在人与智能体交互中的意图对齐
Abstract
Natural language prompts often suffer from intent transmission loss: the gap between what users actually need and what they communicate to AI systems. We evaluate PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction. In a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions - (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS - we collect 540 AI-generated outputs evaluated by an LLM judge. We introduce goal_alignment, a user-intent-centered evaluation dimension, and find that rendered PPS outperforms both simple prompts and raw JSON on this metric. PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning. We also identify a measurement asymmetry in standard LLM evaluation, where unconstrained prompts can inflate constraint adherence scores and mask the practical value of structured prompting. A preliminary retrospective survey (N = 20) further suggests a 66.1% reduction in follow-up prompts required, from 3.33 to 1.13 rounds. These findings suggest that structured intent representations can improve alignment and usability in human-AI interaction, especially in tasks where user intent is inherently ambiguous.
Chinese Translation
自然语言提示常常遭遇意图传递损失:用户实际需求与其传达给AI系统的信息之间的差距。我们评估PPS(提示协议规范),这是一种基于5W3H的框架,用于在人与AI交互中进行结构化意图表示。在针对60项任务的受控三条件研究中,我们涵盖了三个领域(商业、技术和旅行),使用了三种大型语言模型(DeepSeek-V3、Qwen-Max和Kimi),以及三种提示条件——(A)简单提示,(B)原始PPS JSON,以及(C)自然语言渲染的PPS,我们收集了540个由AI生成的输出,并由LLM评审进行评估。我们引入了以用户意图为中心的评估维度goal_alignment,并发现渲染的PPS在这一指标上优于简单提示和原始JSON。PPS的收益具有任务依赖性:在高歧义的商业分析任务中收益较大,而在低歧义的旅行规划中则相反。我们还识别出标准LLM评估中的一种测量不对称性,即无约束提示可能会夸大约束遵循评分,并掩盖结构化提示的实际价值。初步的回顾性调查(N = 20)进一步表明,后续提示的需求减少了66.1%,从3.33轮减少到1.13轮。这些发现表明,结构化意图表示能够改善人与AI交互中的对齐和可用性,尤其是在用户意图本质上模糊的任务中。
cs.AI / 63 / 2603.18987
Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis
揭示预测性警务中的算法偏见:基于生成对抗网络的多城市时间分析仿真框架
Abstract
Predictive policing systems that direct patrol resources based on algorithmically generated crime forecasts have been widely deployed across US cities, yet their tendency to encode and amplify racial disparities remains poorly understood in quantitative terms. We present a reproducible simulation framework that couples a Generative Adversarial Network GAN with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact. Using 145000 plus Part 1 crime records from Baltimore 2017 to 2019 and 233000 plus records from Chicago 2022, augmented with US Census ACS demographic data, we compute four monthly bias metrics across 264 city year mode observations: the Disparate Impact Ratio DIR, Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score. Our experiments reveal extreme and year variant bias in Baltimores detected mode, with mean annual DIR up to 15714 in 2019, moderate under detection of Black residents in Chicago DIR equals 0.22, and persistent Gini coefficients of 0.43 to 0.62 across all conditions. We further demonstrate that a Conditional Tabular GAN CTGAN debiasing approach partially redistributes detection rates but cannot eliminate structural disparity without accompanying policy intervention. Socioeconomic regression analysis confirms strong correlations between neighborhood racial composition and detection likelihood Pearson r equals 0.83 for percent White and r equals negative 0.81 for percent Black. A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals that outcomes are most sensitive to officer deployment levels. The code and data are publicly available at this repository.
Chinese Translation
基于算法生成的犯罪预测来指导巡逻资源的预测性警务系统已在美国城市广泛部署,但其编码和放大种族差异的倾向在定量上仍然不够清晰。我们提出了一个可重复的仿真框架,将生成对抗网络(GAN)与噪声OR巡逻检测模型相结合,以测量种族偏见如何在从犯罪发生到警察接触的整个执法流程中传播。利用2017至2019年巴尔的摩的超过145,000条第一类犯罪记录和2022年芝加哥的超过233,000条记录,并结合美国人口普查ACS人口数据,我们计算了264个城市年度模式观察中的四个每月偏见指标:差异影响比(DIR)、人口平等差距、基尼系数以及综合偏见放大分数。我们的实验揭示了巴尔的摩检测模式中的极端和年度变化偏见,2019年的平均年度DIR高达15714,芝加哥的黑人居民被低估,DIR为0.22,以及在所有条件下持续的基尼系数在0.43至0.62之间。我们进一步证明,条件表格生成对抗网络(CTGAN)去偏见方法部分重新分配了检测率,但在没有相应政策干预的情况下无法消除结构性差异。社会经济回归分析确认了邻里种族构成与检测可能性之间的强相关性,白人比例的皮尔逊相关系数r为0.83,黑人比例的r为-0.81。对巡逻半径、警员数量和公民报告概率的敏感性分析显示,结果对警员部署水平最为敏感。代码和数据在此存储库中公开可用。
cs.AI / 64 / 2603.18994
Evaluating Game Difficulty in Tetris Block Puzzle
评估俄罗斯方块难度
Abstract
Tetris Block Puzzle is a single player stochastic puzzle in which a player places blocks on an 8 x 8 grid to complete lines; its popular variants have amassed tens of millions of downloads. Despite this reach, there is little principled assessment of which rule sets are more difficult. Inspired by prior work that uses AlphaZero as a strong evaluator for chess variants, we study difficulty in this domain using Stochastic Gumbel AlphaZero (SGAZ), a budget-aware planning agent for stochastic environments. We evaluate rule changes including holding block h, preview holding block p, and additional Tetris block variants using metrics such as training reward and convergence iterations. Empirically, increasing h and p reduces difficulty (higher reward and faster convergence), while adding more Tetris block variants increases difficulty, with the T-pentomino producing the largest slowdown. Through analysis, SGAZ delivers strong play under small simulation budgets, enabling efficient, reproducible comparisons across rule sets and providing a reference for future design in stochastic puzzle games.
Chinese Translation
俄罗斯方块是一种单人随机拼图游戏,玩家在8 x 8的网格上放置方块以完成行;其流行的变种已累计数千万次下载。尽管如此,对于哪些规则集更具挑战性,仍缺乏系统的评估。受到先前研究的启发,该研究使用AlphaZero作为国际象棋变种的强评估器,我们在这一领域使用随机Gumbel AlphaZero(SGAZ)研究难度,该模型是一个针对随机环境的预算感知规划代理。我们评估了包括持有方块h、预览持有方块p和其他俄罗斯方块变种在内的规则变化,采用训练奖励和收敛迭代等指标进行评估。实证结果表明,增加h和p会降低难度(奖励更高且收敛更快),而增加更多的俄罗斯方块变种则会提高难度,其中T形五胞胎(T-pentomino)造成的减速最大。通过分析,SGAZ在小规模模拟预算下表现出色,使得跨规则集的高效、可重复比较成为可能,并为未来随机拼图游戏的设计提供了参考。
cs.AI / 65 / 2603.18999
Regret Bounds for Competitive Resource Allocation with Endogenous Costs
带内生成本的竞争性资源分配的遗憾界限
Abstract
We study online resource allocation among N interacting modules over T rounds. Unlike standard online optimization, costs are endogenous: they depend on the full allocation vector through an interaction matrix W encoding pairwise cooperation and competition. We analyze three paradigms: (I) uniform allocation (cost-ignorant), (II) gated allocation (cost-estimating), and (III) competitive allocation via multiplicative weights update with interaction feedback (cost-revealing). Our main results establish a strict separation under adversarial sequences with bounded variation: uniform incurs Omega(T) regret, gated achieves O(T^{2/3}), and competitive achieves O(sqrt(T log N)). The performance gap stems from competitive allocation's ability to exploit endogenous cost information revealed through interactions. We further show that W's topology governs a computation-regret tradeoff. Full interaction (|E|=O(N^2)) yields the tightest bound but highest per-step cost, while sparse topologies (|E|=O(N)) increase regret by at most O(sqrt(log N)) while reducing per-step cost from O(N^2) to O(N). Ring-structured topologies with both cooperative and competitive links - of which the five-element Wuxing topology is canonical - minimize the computation x regret product. These results provide the first formal regret-theoretic justification for decentralized competitive allocation in modular architectures and establish cost endogeneity as a fundamental challenge distinct from partial observability. Keywords: online learning, regret bounds, resource allocation, endogenous costs, interaction topology, multiplicative weights, modular systems, Wuxing topology
Chinese Translation
我们研究了在 T 轮中 N 个相互作用模块之间的在线资源分配。这与标准在线优化不同,成本是内生的:它们依赖于通过编码成对合作和竞争的交互矩阵 W 的完整分配向量。我们分析了三种范式:(I) 均匀分配(忽略成本),(II) 门控分配(估算成本),以及 (III) 通过乘法权重更新和交互反馈的竞争性分配(揭示成本)。我们的主要结果在具有有界变动的对抗序列下建立了严格区分:均匀分配会产生 (T) 的遗憾,门控分配达到 O(T^{2/3}),而竞争性分配则达到 O((T imes ext{log} N))。性能差距源于竞争性分配利用通过交互揭示的内生成本信息的能力。我们进一步展示了 W 的拓扑结构决定了计算与遗憾之间的权衡。完整交互 (|E|=O(N^2)) 产生最紧的界限但每步成本最高,而稀疏拓扑 (|E|=O(N)) 会使遗憾增加最多 O(( ext{log} N)),同时将每步成本从 O(N^2) 降低到 O(N)。环形结构的拓扑具有合作与竞争链接,其中五元素五行拓扑是典型,最小化了计算与遗憾的乘积。这些结果为模块化架构中去中心化竞争性分配提供了第一个正式的遗憾理论依据,并确立了成本内生性作为一个不同于部分可观测性的根本挑战。关键词:在线学习,遗憾界限,资源分配,内生成本,交互拓扑,乘法权重,模块化系统,五行拓扑
cs.AI / 66 / 2603.19022
Behavioral Fingerprints for LLM Endpoint Stability and Identity
大语言模型端点稳定性与身份的行为指纹
Abstract
The consistency of AI-native applications depends on the behavioral consistency of the model endpoints that power them. Traditional reliability metrics such as uptime, latency and throughput do not capture behavioral change, and an endpoint can remain "healthy" while its effective model identity changes due to updates to weights, tokenizers, quantization, inference engines, kernels, caching, routing, or hardware. We introduce Stability Monitor, a black-box stability monitoring system that periodically fingerprints an endpoint by sampling outputs from a fixed prompt set and comparing the resulting output distributions over time. Fingerprints are compared using a summed energy distance statistic across prompts, with permutation-test p-values as evidence of distribution shift aggregated sequentially to detect change events and define stability periods. In controlled validation, Stability Monitor detects changes to model family, version, inference stack, quantization, and behavioral parameters. In real-world monitoring of the same model hosted by multiple providers, we observe substantial provider-to-provider and within-provider stability differences.
Chinese Translation
人工智能原生应用的一致性依赖于支撑它们的模型端点的行为一致性。传统的可靠性指标,如正常运行时间、延迟和吞吐量,并未捕捉到行为变化,且一个端点在其有效模型身份因权重、分词器、量化、推理引擎、内核、缓存、路由或硬件的更新而变化时,仍然可以保持“健康”。我们提出了稳定性监测器(Stability Monitor),这是一种黑箱稳定性监测系统,通过从固定提示集采样输出并比较随时间变化的输出分布,定期对端点进行指纹识别。指纹通过在提示间的总能量距离统计进行比较,使用置换检验的 p 值作为分布变化的证据,顺序聚合以检测变化事件并定义稳定期。在受控验证中,稳定性监测器能够检测到模型家族、版本、推理堆栈、量化和行为参数的变化。在对由多个提供商托管的相同模型进行的实际监测中,我们观察到提供商之间以及提供商内部存在显著的稳定性差异。
cs.AI / 67 / 2603.19042
Man and machine: artificial intelligence and judicial decision making
人与机器:人工智能与司法决策
Abstract
The integration of artificial intelligence (AI) technologies into judicial decision-making - particularly in pretrial, sentencing, and parole contexts - has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI's role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI+human interactions. Across the fields of computer science, economics, law, criminology and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision aid tools on pretrial and sentencing decisions is modest or inexistent, our review also reveals important gaps in the canvassed literatures. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate noisy decision making environments and how individual characteristics influence judges' responses to AI advice. We argue that AI vs Human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers and advocate greater interdisciplinary integration and cross-fertilization in future research.
Chinese Translation
将人工智能(AI)技术整合到司法决策中——特别是在预审、判刑和假释等背景下——引发了关于透明度、可靠性和问责制的重大担忧。同时,这些发展使人类判断的局限性更加明显,并强调了理解法官如何与基于AI的决策辅助工具互动的重要性。以刑事司法风险评估为重点案例,我们进行了一项综合性回顾,连接了AI在司法决策中作用的三个交织方面:AI工具的性能和公平性、人类法官的优势和偏见,以及AI与人类的互动性质。在计算机科学、经济学、法律、犯罪学和心理学等领域,研究人员在评估自动化风险评估工具的预测有效性、记录司法决策中的偏见,以及在更有限的范围内,考察法官如何使用算法推荐方面取得了显著进展。尽管现有的实证证据表明,AI决策辅助工具对预审和判刑决策的影响是有限或不存在的,但我们的回顾也揭示了相关文献中的重要空白。需要进一步研究以评估AI风险评估工具的性能,理解法官如何在嘈杂的决策环境中进行导航,以及个体特征如何影响法官对AI建议的反应。我们认为,AI与人类的比较有潜力为算法工具和人类决策者提供新的见解,并倡导未来研究中更大的跨学科整合和交叉融合。
cs.AI / 68 / 2603.19087
Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity
设计中的意外发现:评估跨领域映射对人类与大型语言模型创造力的影响
Abstract
Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (''cross-domain mapping''). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity.
Chinese Translation
大型语言模型(LLMs)是否以与人类相同的方式具有创造力?相同的干预措施是否能同时提高两者的创造力?我们评估了一种有前景但尚未广泛测试的创造力干预措施:强迫创作者从一个随机的、遥远的源领域进行类比(“跨领域映射”)。人类参与者和LLMs在两个提示下为十种日常产品(例如,背包、电视)生成新颖特征:(i)跨领域映射,要求从随机分配的源(例如,章鱼、仙人掌、GPS)翻译一个属性;(ii)用户需求,要求提出针对未满足用户需求的创新。我们的研究表明,人类在随机分配的跨领域映射中可靠地受益,而LLMs平均生成的原创想法比人类更多,并且未显示出跨领域映射的统计显著效果。然而,在这两种系统中,当灵感来源与目标的语义距离变得更远时,跨领域映射的影响会增加。我们的结果突显了遥远联想在创造性构思中的作用,以及人类与LLMs在同一创造力干预下反应的系统性差异。
cs.AI / 69 / 2603.19100
LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling
LuMamba:用于电极拓扑不变和高效EEG建模的潜在统一Mamba
Abstract
Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emph{differing electrode topologies} and \emph{computational scalability}, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose \textbf{LuMamba} (\textbf{L}atent \textbf{U}nified \textbf{Mamba}), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA's learned-query cross-attention mechanism for channel unification~\cite{luna}, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling~\cite{femba}. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99\% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer's detection (0.97 AUPR), while requiring \textbf{377$\times$ fewer FLOPS} than state-of-art models at equivalent sequence lengths and scaling to \textbf{12$\times$ longer sequences} before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation
Chinese Translation
脑电图(EEG)能够在临床和神经技术应用中进行非侵入性脑活动监测,但由于电极拓扑的差异和计算可扩展性,构建EEG的基础模型仍然具有挑战性,因为Transformer架构会导致二次序列复杂度。作为一种联合解决方案,我们提出了 extbf{LuMamba}( extbf{L}atent extbf{U}nified extbf{Mamba}),这是一个自监督框架,结合了拓扑不变编码与线性复杂度的状态空间建模,使用LUNA的学习查询交叉注意机制进行通道统一~ extcite{luna},并利用FEMBA的双向Mamba模块进行高效的时间建模~ extcite{femba}。在该架构中,我们首次系统性地研究了潜在欧几里得联合嵌入预测架构(LeJEPA)在生物信号学习中的应用。LuMamba在TUEG语料库上经过超过21,000小时的无标记EEG预训练,并在五个下游任务中进行评估,涵盖异常检测、伪影识别和心理状态分类,电极配置范围从16到26个通道。在预训练目标中,仅使用掩蔽重建会产生结构化但不够通用的表示,而LeJEPA单独产生扩散的嵌入;结合这两种目标可实现最强的性能。LuMamba仅用4.6M参数,在TUAB上达到了80.99 ext% 的平衡准确率,并在阿尔茨海默病检测中实现了最先进的表现(0.97 AUPR),同时在相同序列长度下所需的计算量比最先进模型少 extbf{377$ imes$},并且在达到典型GPU内存限制之前可以扩展到 extbf{12$ imes$}更长的序列。代码可在https://github.com/pulp-bio/biofoundation获取。
cs.AI / 70 / 2603.19118
How Uncertainty Estimation Scales with Sampling in Reasoning Models
不确定性估计在推理模型中的采样扩展性研究
Abstract
Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.
Chinese Translation
不确定性估计对于部署推理语言模型至关重要,但在扩展的思维链推理下仍然缺乏深入理解。我们研究了并行采样作为一种完全黑箱的方法,利用语言化的置信度和自一致性。在三个推理模型和涵盖数学、STEM(科学、技术、工程和数学)以及人文学科的17个任务中,我们描述了这些信号的扩展特性。自一致性和语言化的置信度在推理模型中均有扩展,但自一致性在初始区分度上较低,并且在中等采样下滞后于语言化置信度。然而,大多数不确定性增益源自信号组合:仅用两个样本,混合估计器的AUROC平均提高了$+12$,并且即使在扩展到更大预算时也已超越单一信号,之后收益逐渐减小。这些效应依赖于领域:在数学领域,作为RLVR(推理语言模型后训练)风格的原生领域,推理模型实现了更高的不确定性质量,并表现出更强的互补性和更快的扩展性,相较于STEM或人文学科。
cs.AI / 71 / 2603.19138
Implicit Patterns in LLM-Based Binary Analysis
基于大型语言模型的二进制分析中的隐式模式
Abstract
Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.
Chinese Translation
二进制漏洞分析越来越多地由基于大型语言模型(LLM)的代理以迭代的多次传递方式进行,模型作为核心决策者。然而,由于上下文窗口的限制和隐式的令牌级行为,这些系统如何在数百个推理步骤中组织探索仍然不够清楚。我们首次进行大规模的追踪级研究,显示多次传递的LLM推理产生了结构化的令牌级隐式模式。通过分析521个二进制文件和99,563个推理步骤,我们识别出四种主导模式:早期剪枝、路径依赖锁定、目标回溯和知识引导优先级,这些模式隐式地从推理轨迹中出现。这些令牌级隐式模式作为LLM推理的抽象:探索不是通过显式的控制流或预定义的启发式方法组织,而是通过隐式决策来调节路径选择、承诺和修订。我们的分析表明,这些模式形成了一个稳定的、结构化的系统,具有不同的时间角色和可测量的特征。我们的结果为LLM驱动的二进制分析提供了首次系统化的特征描述,并为更可靠的分析系统奠定了基础。
cs.AI / 72 / 2603.19146
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding
D5P4:用于并行离散扩散解码中的多样性分区行列式点过程
Abstract
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive models, such as beam search, do not directly apply to iterative denoising, and existing diffusion decoding techniques provide limited control over in-batch diversity. To bridge this gap, we introduce a generalized beam-search framework for discrete diffusion that generates candidates in parallel and supports modular beam-selection objectives. As a diversity-focused instantiation, we propose D5P4, which formulates the selection step as MAP inference over a Determinantal Point Process. Leveraging a scalable greedy solver, D5P4 maintains multi-GPU compatibility and enables an explicit trade-off between model probability and target diversity with near-zero compute overhead. Experiments on free-form generation and question answering demonstrate that D5P4 improves diversity over strong baselines while maintaining competitive generation quality.
Chinese Translation
离散扩散模型是文本生成中对自回归方法的有希望的替代方案,然而它们的解码方法仍然未被充分研究。标准的自回归模型解码方法,如束搜索,无法直接应用于迭代去噪,而现有的扩散解码技术在批内部多样性控制方面提供的选择有限。为了解决这一问题,我们提出了一种针对离散扩散的广义束搜索框架,该框架并行生成候选项,并支持模块化的束选择目标。作为侧重于多样性的实例化,我们提出了D5P4,它将选择步骤制定为对行列式点过程的最大后验推断。利用可扩展的贪心求解器,D5P4维持了多GPU兼容性,并能够在几乎不增加计算开销的情况下,在模型概率与目标多样性之间明确进行权衡。我们在自由格式生成和问答任务上的实验表明,D5P4在保持竞争生成质量的同时,提升了多样性,相较于强基线取得了显著改善。
cs.AI / 73 / 2603.19163
cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization
cuGenOpt:一种用于组合优化的GPU加速通用元启发式框架
Abstract
Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a "one block evolves one solution" CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt
Chinese Translation
组合优化问题广泛存在于物流、调度和资源分配等领域,然而现有方法在通用性、性能和可用性之间面临根本性的权衡。我们提出了cuGenOpt,这是一种GPU加速的通用元启发式框架,能够同时解决这三个维度的问题。在引擎层面,cuGenOpt采用了“一个块演化一个解”的CUDA架构,具有统一的编码抽象(排列、二进制、整数)、两级自适应操作符选择机制以及硬件感知的资源管理。在可扩展性层面,用户定义的操作符注册接口允许领域专家注入特定问题的CUDA搜索操作符。在可用性层面,JIT编译管道将框架暴露为纯Python API,并且基于LLM的建模助手将自然语言问题描述转换为可执行的求解器代码。在三个GPU架构(T4、V100、A800)上的五个主题套件的实验表明,cuGenOpt的性能远超一般的MIP求解器,针对高达n=150的实例在质量上与专业求解器具有竞争力,并在30秒内在TSP-442上达到了4.73%的间隙。解决了涵盖五种编码变体的十二种问题类型的最优解。框架级优化将pcb442的间隙从36%减少到4.73%,并将VRPTW的吞吐量提升了75-81%。代码链接:https://github.com/L-yang-yang/cugenopt
cs.AI / 74 / 2603.19182
Box Maze: A Process-Control Architecture for Reliable LLM Reasoning
盒子迷宫:一种用于可靠大语言模型推理的过程控制架构
Abstract
Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
Chinese Translation
大型语言模型(LLMs)展现出强大的生成能力,但在对抗性提示下仍然容易出现幻觉和不可靠的推理。现有的安全性方法——如基于人类反馈的强化学习(RLHF)和输出过滤——主要在行为层面上运作,可能缺乏强制推理过程完整性的明确架构机制。本文提出了盒子迷宫框架,这是一种概念性的过程控制架构,将LLM推理分解为三个明确的层次:记忆基础、结构化推理和边界执行。我们引入了基于初步模拟的评估,涉及多个异构LLM系统(DeepSeek-V3、Doubao、Qwen)中的渐进性边界侵蚀场景。来自n=50个对抗性场景的结果表明,明确的认知控制层可能改善边界维护的一致性,架构约束将边界失效率从约40%(基线RLHF)降低到对抗条件下的1%以下。尽管当前的验证是基于模拟的,这些初步结果表明,过程级控制可能为提高大型语言模型推理的可靠性提供了一个有前景的方向。
cs.AI / 75 / 2603.19191
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
OS-Themis:一种可扩展的通用图形用户界面奖励评估框架
Abstract
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
Chinese Translation
强化学习(RL)有潜力提高图形用户界面(GUI)代理在随机环境中的鲁棒性,但训练对奖励函数的质量高度敏感。现有的奖励方法在可扩展性和性能之间难以取得平衡。为了解决这个问题,我们提出了OS-Themis,一个可扩展且准确的多智能体评估框架。与单一评审者不同,OS-Themis将轨迹分解为可验证的里程碑,以隔离决策所需的关键证据,并采用审查机制在做出最终裁决之前严格审计证据链。为了便于评估,我们进一步引入了OmniGUIRewardBench(OGRBench),这是一个全面的跨平台图形用户界面结果奖励基准,其中所有评估模型在OS-Themis下都达到了最佳性能。在AndroidWorld上的大量实验表明,OS-Themis在支持在线强化学习训练时提高了10.3%的性能,在自我训练循环中用于轨迹验证和过滤时提高了6.9%的增益,突显了其推动代理进化的潜力。
cs.CL / 1 / 2603.18007
Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm
大型语言模型是否具备心智理论?基于奇异故事范式的比较评估
Abstract
The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.
Chinese Translation
本研究探讨当前的大型语言模型(LLMs)是否展现出心智理论(ToM)能力——具体而言,即从文本中推断他人的信念、意图和情感的能力。考虑到LLMs是在没有社会体现或访问其他心理表征表现的语言数据上进行训练的,它们显现出的社会认知推理引发了关于其理解本质的关键问题。它们是否能够进行与人类能力在输出上无差别的稳健心理状态归属,还是其输出仅仅反映了表面的模式完成?为了解决这个问题,我们测试了五个LLMs,并将它们的表现与人类对照组进行了比较,使用了一种广泛用于人类ToM研究的文本工具的改编版本。该测试涉及回答关于故事角色的信念、意图和情感的问题。结果显示模型之间存在表现差距。较早和较小的模型受到可用相关推理线索数量的强烈影响,并在一定程度上也容易受到文本中无关或分散注意的信息的干扰。相比之下,GPT-4o展现出高准确性和强鲁棒性,即使在最具挑战性的条件下,其表现也与人类相当。这项工作为关于LLMs的认知状态以及真正理解与统计近似之间的界限的持续辩论做出了贡献。
cs.CL / 2 / 2603.18008
TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
TherapyGym:评估和对齐治疗聊天机器人中的临床忠实度与安全性
Abstract
Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.
Chinese Translation
大型语言模型(LLMs)在心理健康支持中的应用日益增多;然而,现有的评估方法——流畅性指标、偏好测试和通用对话基准——未能捕捉到心理治疗中临床关键的维度。我们提出了THERAPYGYM,一个沿着两个临床支柱(忠实度和安全性)评估和改善治疗聊天机器人的框架。忠实度通过认知治疗评分量表(Cognitive Therapy Rating Scale, CTRS)进行测量,该量表作为一个自动化流程实施,评分依据是多轮会话中对认知行为疗法(CBT)技术的遵循程度。安全性则通过多标签注释方案进行评估,涵盖特定于治疗的风险(例如,未能处理伤害或虐待)。为了减少基于LLM的评估者中的偏见和不可靠性,我们进一步发布了THERAPYJUDGEBENCH,这是一个包含116个对话和1,270个专家评分的验证集,用于审计和与持证临床医生的校准。THERAPYGYM还作为一个训练工具:基于CTRS和安全性的奖励驱动了强化学习(RL),并配置了涵盖多种症状特征的患者模拟。经过THERAPYGYM训练的模型在专家评分上有所提升,平均CTRS从0.10提高到0.60(在LLM评估者下从0.16提高到0.59)。我们的工作使得治疗聊天机器人的可扩展开发成为可能,确保其忠实于循证实践,并在高风险使用中更加安全。
cs.CL / 3 / 2603.18009
How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding
首个标记的信心有多大?一种针对大型语言模型分类和理解的不确定性校准提示优化框架
Abstract
With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.
Chinese Translation
随着大型语言模型(LLMs)在自然语言处理中的广泛应用,提示工程和检索增强生成(RAG)已成为主流,以提升LLMs在复杂任务上的表现。然而,LLMs是自回归生成输出的,这导致了不可避免的输出不确定性。由于模型性能对提示设计高度敏感,因此精确的不确定性测量对于可靠的提示优化至关重要。在多类多选(理解)任务中,基于输出概率的传统不确定性度量(例如熵)将所有类别视为平等,忽略了预训练语料中的类别先验差异。这种未能区分虚假信心(来自先验)与真实确定性(来自上下文理解)的问题导致信心校准不佳。为了解决这一问题,我们提出了基于焦点损失的首个标记不确定性度量(Log-Scale Focal Uncertainty, LSFU)。LSFU将标签先验概率作为风险调节因子,以抑制高频类别的噪声,并强调低频长尾类别的风险,同时具有动态加权机制以统一测量尺度。基于LSFU,我们进一步提出了不确定性校准提示优化框架(Uncertainty-Calibrated Prompt Optimization Framework, UCPOF),该框架利用模型输出的首个标记选择高质量示例,并动态优化提示。综合评估表明,UCPOF的平均准确率比少量样本基线提高了6.03%,在整体平均准确率上超越了始终开启的完整RAG 5.75%,并将平均检索触发率降低了50.66%。通过仅对高不确定性样本自适应触发RAG,我们的框架在维持先进性能的同时显著降低了计算成本。
cs.CL / 4 / 2603.18010
Agentic Framework for Political Biography Extraction
政治传记提取的代理框架
Abstract
The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.
Chinese Translation
大规模政治数据集的生成通常需要从大量非结构化文档或网络来源中提取结构化事实,这一任务传统上依赖于昂贵的人类专家,并且在大规模自动化方面仍然极具挑战性。本文利用大型语言模型(Large Language Models, LLMs)来自动化提取多维度的精英传记,解决了政治科学研究中的一个长期瓶颈。我们提出了一种两阶段的“综合-编码”(Synthesis-Coding)框架来处理复杂的提取任务:上游综合阶段使用递归代理LLM从异构网络来源中搜索、过滤和策划传记,随后是下游编码阶段将策划的传记映射到结构化数据框中。我们通过三个主要结果验证了该框架。首先,我们证明在给定策划上下文的情况下,LLM编码器在提取准确性上与人类专家相匹配或表现更佳。其次,我们展示在网络环境中,代理系统从网络资源中综合的信息量超过了人类集体智能(维基百科)。最后,我们诊断出直接从长篇和多语言语料库编码会引入偏差,而综合阶段可以通过将证据策划为信号密集的表示来缓解这一问题。通过全面评估,我们提供了一个可推广、可扩展的框架,以构建透明且可扩展的大规模政治科学数据库。
cs.CL / 5 / 2603.18011
Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating
通过确定性效用门控实现检索增强问答中的可控证据选择
Abstract
Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence.
Chinese Translation
许多现代人工智能问答系统将文本转换为向量,并检索最接近用户问题的匹配项。尽管在主题相似性方面有效,但仅依赖相似性评分并不能解释为什么某些检索到的文本可以作为证据,而其他同样相似的文本则不能。当许多候选项获得类似的评分时,系统可能会选择冗余、不完整或处理与问题要求不同条件的句子。本文提出了一种用于检索增强问答的确定性证据选择框架。该方法引入了意义-效用估计(Meaning-Utility Estimation,MUE)和多样性-效用估计(Diversity-Utility Estimation,DUE),这些固定评分和冗余控制程序在生成答案之前确定证据的可接受性。每个句子或记录都通过对语义相关性、术语覆盖、概念独特性和冗余的显性信号进行独立评估。该方法不需要训练或微调。在原型中,只有明确陈述任务所需的事实、规则或条件的单元才会被接受。单元不会合并或扩展。如果没有单元独立满足要求,系统将不返回答案。这种确定性门控生成紧凑、可审计的证据集,并建立相关文本与可用证据之间的清晰界限。
cs.CL / 6 / 2603.18012
DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation
DynaRAG:在检索增强生成中桥接静态与动态知识
Abstract
We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.
Chinese Translation
我们提出了DynaRAG,一个检索增强生成(RAG)框架,旨在通过动态知识集成处理静态和时间敏感的信息需求。与传统的仅依赖静态语料库的RAG管道不同,DynaRAG在检索到的文档不足以回答查询时,选择性地调用外部API。该系统采用基于大型语言模型(LLM)的重排序器来评估文档相关性,使用充分性分类器来判断何时需要回退,并利用Gorilla v2——一种最先进的API调用模型——进行准确的工具调用。我们通过引入FAISS进行模式过滤来增强系统的鲁棒性,以指导API选择。在CRAG基准上的评估表明,DynaRAG显著提高了动态问题的准确性,同时减少了幻觉现象。我们的结果突显了动态感知路由和选择性工具使用在构建可靠的现实世界问答系统中的重要性。
cs.CL / 7 / 2603.18013
Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models
学习但未表达:大型语言模型中的能力-表达解离
Abstract
Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.
Chinese Translation
大型语言模型(LLMs)在特定引导条件下展示了从其训练数据中重构和追踪学习内容的能力,然而这种能力在标准生成环境中并未表现出来。本实证观察研究考察了在叙事和问题解决任务背景下,300个提示-响应生成中非因果、不可实现解决方案类型的表达。基于近期关于记忆连贯性和对齐诱导话语先验的发现,我们记录了学习能力与表达输出之间的系统性解离。在三种不同的LLM、十种任务场景以及创造性叙事和实用建议背景下,我们在生成输出中记录到非因果解决框架的实例为零(0%,95% 置信区间:[0%,1.2%]),尽管在条件提取下验证了重构能力。这些发现挑战了训练数据存在直接预测输出概率的普遍假设,而是表明任务条件生成策略可以在多种背景下全面抑制学习内容。这些结果对理解生成动态、输出分布控制以及当代LLM的行为边界具有重要意义。
cs.CL / 8 / 2603.18014
Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction
实时信任度评分用于大型语言模型结构化输出和数据提取
Abstract
Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.
Chinese Translation
当前大型语言模型(LLM)生成的结构化输出存在 sporadic(偶发)错误,阻碍了企业人工智能努力实现其巨大的潜力。我们提出了CONSTRUCT,这是一种实时评分LLM结构化输出的信任度的方法,较低的评分输出更可能包含错误。这一方法揭示了有限人力审核带宽的最佳关注点。CONSTRUCT还对LLM结构化输出中的每个字段进行信任度评分,帮助审核人员快速识别输出中错误的部分。我们的方法适用于任何LLM(包括没有日志概率的黑箱LLM API,如推理模型和Anthropic模型),不需要标注的训练数据或自定义模型部署,且能够处理具有多种类型字段的复杂结构化输出(包括嵌套JSON架构)。此外,我们还提出了第一个公共LLM结构化输出基准之一,具有可靠的真实值且不充满错误。在这个四个数据集的基准评估中,CONSTRUCT以显著更高的精准率/召回率检测来自各种LLM(包括Gemini 3和GPT-5)的错误,相较于其他评分方法效果更佳。
cs.CL / 9 / 2603.18015
Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection
超越准确性:基于可解释性的有害内容检测分析
Abstract
Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.
Chinese Translation
尽管自动化有害内容检测系统常被用于监控在线平台,但审核员和最终用户常常无法理解其预测背后的逻辑。虽然最近的研究集中于提高分类准确性,但对于神经网络模型为何将内容识别为有害的理解却鲜有关注,尤其是在边缘案例、上下文依赖和政治敏感的情况下。在这项研究中,我们对一个在Civil Comments数据集上训练的神经有害内容检测模型进行了基于可解释性的分析。使用两种流行的后处理解释方法,Shapley加法解释(Shapley Additive Explanations)和积分梯度(Integrated Gradients),分析了基于RoBERTa的分类器在正确预测和系统性失败案例中的行为。尽管整体表现优异,曲线下面积(AUC)达到0.93,准确率为0.94,但分析揭示了仅通过汇总评估指标无法观察到的局限性。积分梯度似乎提取了更为分散的上下文归因,而Shapley加法解释则更专注于明确的词汇线索。两者输出的差异体现在假阴性和假阳性中。定性案例研究揭示了重复出现的失败模式,如间接有害、词汇过度归因或政治话语。结果表明,可解释的人工智能能够通过暴露模型的不确定性并提高自动决策背后的可解释性,从而促进人机协作的审核过程。最重要的是,这项工作强调了可解释性在在线有害内容检测系统中作为透明性和诊断资源的作用,而非仅仅作为提高性能的手段。
cs.CL / 10 / 2603.18016
MineDraft: A Framework for Batch Parallel Speculative Decoding
MineDraft:一种批量并行推测解码框架
Abstract
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.
Chinese Translation
推测解码(Speculative Decoding, SD)通过使用较小的草稿模型提出草稿标记,从而加速大型语言模型的推理,这些草稿标记随后由较大的目标模型进行验证。然而,标准的推测解码性能常常受到草稿和验证阶段严格顺序执行的限制。为了解决这个问题,本文提出了MineDraft,一种批量并行推测解码(Batch Parallel Speculative Decoding, PSD)框架,旨在通过将草稿过程与验证过程重叠来有效隐藏草稿延迟。我们的理论分析表明,PSD在效率上显著优于标准的SD。MineDraft通过一种新颖的批量并行设计实现了PSD,该设计维护了两批请求,将一批的草稿过程与另一批的验证过程重叠。我们的实验结果显示,MineDraft在吞吐量(提高高达75%)和端到端延迟(降低高达39%)方面相较于标准SD有显著改善。此外,我们已将MineDraft实现为vLLM的插件,展示了其在生产就绪推理系统中的实用性。
cs.CL / 11 / 2603.18018
An Agentic System for Schema Aware NL2SQL Generation
一种面向模式的自主系统用于模式感知的自然语言到SQL生成
Abstract
The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]
Chinese Translation
自然语言到SQL(NL2SQL)任务在通过直观语言使非专业用户能够与关系数据库互动方面发挥着关键作用,从而促进数据访问的民主化。尽管近期的框架通过任务专业化提高了翻译准确性,但它们对大型语言模型(LLMs)的依赖引发了关于计算开销、数据隐私以及在资源受限环境中实际部署的重大担忧。为了解决这些挑战,我们提出了一种基于模式的自主系统,该系统战略性地将小型语言模型(SLMs)作为主要代理,并辅以选择性LLM回退机制。仅在检测到SLM生成输出中的错误时才调用LLM,从而显著减少了计算开支。在BIRD基准上的实验结果表明,我们的系统实现了47.78%的执行准确率和51.05%的验证效率得分,与以LLM为中心的基线相比,成本降低超过90%,因为约67%的查询是通过本地SLM解决的。该系统每个查询的平均成本为0.0085,而LLM仅系统的成本为0.094,实现了本地执行查询的几乎零运营成本。[Github仓库:https://github.com/mindslab25/CESMA.]
cs.CL / 12 / 2603.18019
BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity
BenchBrowser -- 收集评估基准有效性的证据
Abstract
Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.
Chinese Translation
语言模型基准是否真正测量了从业者所期望的内容?高层次的元数据过于粗糙,无法传达基准的细微现实:“诗歌”基准可能从未测试过俳句,而“遵循指令”基准通常会测试任意混合的技能。这种不透明性使得验证与从业者目标的一致性成为一项繁琐的过程,甚至在模型在未测试的用户兴趣方面失败时,也可能造成能力的错觉。我们引入了BenchBrowser,这是一种检索工具,可以提取与自然语言使用案例相关的评估项目,覆盖20个基准套件。通过一项确认高检索精度的人类研究进行验证,BenchBrowser生成证据,帮助从业者诊断内容有效性低(能力的各个方面覆盖范围狭窄)和趋同有效性低(在测量同一能力时缺乏稳定排名)。因此,BenchBrowser有助于量化从业者意图与基准实际测试之间的关键差距。
cs.CL / 13 / 2603.18124
Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records
基于FrameNet的语义建模在临床记录中检测性别暴力的评估
Abstract
Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.
Chinese Translation
性别暴力(GBV)是一个重大的公共卫生问题,世界卫生组织估计,三分之一的女性在其一生中会遭遇伴侣的身体或性暴力。在巴西,尽管医疗专业人员在法律上有义务报告此类案件,但由于识别虐待的困难和公共信息系统之间的有限整合,漏报现象仍然显著。本研究探讨了基于FrameNet的语义注释在电子病历开放文本字段中是否能够支持性别暴力模式的识别。我们比较了针对性别暴力案件的支持向量机(SVM)分类器在(1)框架注释文本、(2)结合参数化数据的注释文本和(3)仅使用参数化数据上的表现。定量和定性分析表明,结合语义注释的模型优于分类模型,F1分数提高超过0.3,并且证明了领域特定的语义表示提供了超越结构化人口数据的有意义信号。研究结果支持了这样的假设:临床叙述的语义分析可以增强早期识别策略,并支持更为明智的公共卫生干预。
cs.CL / 14 / 2603.18161
How LLMs Distort Our Written Language
大型语言模型如何扭曲我们的书面语言
Abstract
Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.
Chinese Translation
大型语言模型(LLMs)全球有超过十亿人使用,最常用于书写辅助。在本研究中,我们展示了LLMs不仅改变了人类书写的声音和语调,还持续改变了所表达的意图。首先,我们进行了一项人类用户研究,以了解人们在使用LLMs进行写作时的实际互动方式。我们的发现表明,大量使用LLM导致在回答主题问题时,论文保持中立的比例增加了近70%。显著更多的重度LLM用户报告称,其写作缺乏创造性,且不再是他们的声音。接下来,我们使用2021年收集的人类撰写论文的数据集,研究在该数据集中请求LLM根据人类反馈修订论文如何导致结果内容和意义的重大变化。尽管LLMs接收了专家反馈并被要求仅进行语法编辑,我们发现它们仍以显著改变语义的方式修改文本。然后,我们考察了真实环境中的LLM生成文本,特别关注最近一场顶级人工智能会议中21%的AI生成的科学同行评审。我们发现LLM生成的评审对研究的清晰度和重要性赋予的权重显著较低,并且评分平均高出整整一个点。这些发现突显了人工智能使用的感知益处与人类写作语义上隐含的、持续的影响之间的不一致,激励未来研究广泛的AI写作将如何影响我们的文化和科学机构。
cs.CL / 15 / 2603.18171
Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations
温度变化下的人类词汇模型:语言因素、多样性与大型语言模型(LLM)词汇关联的典型性
Abstract
Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.
Chinese Translation
大型语言模型(LLMs)在文本生成的流畅性方面取得了令人瞩目的成果,但它们的语言知识本质——尤其是其内部词汇的类人特征——仍然不确定。本研究比较了人类与LLM生成的词汇关联,以评估模型在多大程度上准确捕捉人类的词汇模式。我们使用来自SWOW数据集的英语提示-响应对,以及在多种温度设置下从三个LLM(Mistral-7B、Llama-3.1-8B和Qwen-2.5-32B)生成的新关联,考察了(i)词汇因素如词频和具体性对提示-响应对的影响,以及(ii)LLM响应与人类响应相比的变异性和典型性。结果显示,所有模型在频率和具体性方面都反映了人类的趋势,但在响应的变异性和典型性上存在差异。较大的模型如Qwen倾向于模拟单一的“原型”人类参与者,生成高度典型但变异性极小的响应,而较小的模型如Mistral和Llama则产生更具变异性但典型性较低的响应。温度设置进一步影响这一权衡,较高的温度值增加了变异性但降低了典型性。这些发现突显了人类与LLM词汇之间的相似性与差异性,强调在探讨LLM词汇表示时需要考虑模型大小和温度。
cs.CL / 16 / 2603.18173
GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation
GRAFITE:用于问题跟踪和评估的生成回归分析框架
Abstract
Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.
Chinese Translation
大型语言模型(LLMs)的性能主要受到其发布时热门主题和基准的影响。然而,随着时间的推移,由于在训练过程中对基准数据的显著曝光,模型会出现污染。如果测试执行不当,这将导致模型性能膨胀的风险。为了解决这一挑战,我们提出了GRAFITE,这是一个通过综合系统维护和评估模型问题的持续LLM评估平台。我们的方法使得基于用户反馈建立模型问题的库成为可能,并提供了一条通过使用LLM作为评判者的质量保证(QA)测试来评估LLM与这些问题的管道。该平台支持多个模型的并排比较,便于检测不同版本之间的回归。该平台可在https://github.com/IBM/grafite获取。演示视频可在www.youtube.com/watch?v=XFZyoleN56k观看。
cs.CL / 17 / 2603.18184
CWoMP: Morpheme Representation Learning for Interlinear Glossing
CWoMP:用于逐行注释的语素表示学习
Abstract
Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable--grounded in lexicon entries--and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.
Chinese Translation
逐行注释文本(Interlinear glossed text, IGT)是一种用于语言文献记录的标准符号,具有丰富的语言学信息,但手动生成过程繁琐。近期的自动化IGT方法将注释视为字符序列,忽视了其组成结构。我们提出了CWoMP(对比词-语素预训练),该方法将语素视为具有学习表示的原子形式-意义单元。经过对比训练的编码器将上下文中的词与其组成语素对齐于共享的嵌入空间;自回归解码器随后通过从这些嵌入的可变词典中检索条目来生成语素序列。预测是可解释的,基于词典条目,用户可以通过扩展词典而无需重新训练来改善推理时的结果。我们在多种低资源语言上进行了评估,结果表明CWoMP在性能上优于现有方法,同时显著提高了效率,尤其在极低资源环境下表现出强劲的提升。
cs.CL / 18 / 2603.18203
How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence
心理学习范式如何塑造和限制人工智能
Abstract
The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.
Chinese Translation
人工智能的主导范式受到心理学学习理论的影响:行为主义激发了强化学习,认知主义催生了深度学习和记忆增强架构,建构主义则影响了课程学习和组合方法。本文论证,每一种人工智能范式不仅继承了其所依赖的心理学理论的优势,也承载了其结构性限制。强化学习无法解释知识的内部结构,深度学习将表示压缩到抗拒原则性更新的不透明参数空间,而当前的综合方法缺乏对如何从现有组件构建新理解的正式解释。本文进一步探讨了对死记硬背的跨文化解读差异,认为东方对于记忆的认识作为理解的结构化、多阶段前驱,提供了一座尚未充分利用的桥梁,连接心理学理论与人工智能方法论。通过探讨系统性辩论和对 Aizawa 对经典主义与连接主义的批判,本文引入了 ReSynth,一个三模块框架,将推理(Intellect)、目的(Identity)和知识(Memory)视为在架构上独立的组件。本文追溯了从心理学范式到人工智能方法的谱系,诊断每个阶段所继承的局限性,并论证适应性——即人工通用智能的核心挑战——需要一种表示架构,其中系统性行为是必要的结果,而不是偶然的属性。
cs.CL / 19 / 2603.18358
From Noise to Signal: When Outliers Seed New Topics
从噪声到信号:当异常值孕育新主题
Abstract
Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.
Chinese Translation
动态主题建模中的异常值通常被视为噪声,然而我们表明其中一些可以作为新兴主题的早期信号。我们引入了一种新闻文档轨迹的时间分类法,定义了文档如何与主题形成随时间的关系。它区分了预期异常值(anticipatory outliers),即那些在后续加入的主题之前出现的文档,以及那些要么强化现有主题,要么保持孤立的文档。通过捕捉这些轨迹,该分类法将弱信号检测与时间主题建模联系起来,并阐明了个别文章如何预见、启动或在不断演变的聚类中漂移。我们在一个累积聚类的环境中实现了该分类法,使用来自十一种最先进语言模型的文档嵌入,并在HydroNewsFr上进行了回顾性评估,该语料库是关于氢经济的法语新闻。模型间的一致性揭示了一小部分高共识的预期异常值,增强了对这些标签的信心。定性案例研究进一步通过具体的主题发展来说明这些轨迹。
cs.CL / 20 / 2603.18361
Synthetic Data Generation for Training Diversified Commonsense Reasoning Models
用于训练多样化常识推理模型的合成数据生成
Abstract
Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)
Chinese Translation
对话代理不仅需要以高质量(即具有常识的)回应用户,还需考虑多种合理的替代场景,以反映其回应的多样性。尽管对训练多样化常识生成器的需求日益增长,但由于缺乏大规模高质量的多样化常识训练数据集,这一研究方向的进展受到显著阻碍。由于高昂的标注成本,现有的生成性常识推理(Generative Commonsense Reasoning, GCR)数据集是由少量人类标注者创建的,仅覆盖狭窄的常识场景集。为了解决这一训练资源缺口,我们提出了一种两阶段的方法,创建了首个合成数据集CommonSyn,用于多样化的GCR。与传统模型和在不同规模的大型语言模型(Large Language Models, LLMs)上经过人类精心制作的数据集微调的模型相比,在我们的合成数据上微调的模型在生成多样性和质量上均有所提升。
cs.CL / 21 / 2603.18363
PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching
PowerFlow:通过原则性分布匹配解锁大型语言模型的双重特性
Abstract
Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
Chinese Translation
无监督内部反馈强化学习(RLIF)已成为一种有前景的范式,用于在没有外部监督的情况下引发大型语言模型(LLMs)的潜在能力。然而,当前的方法依赖于启发式内在奖励,这些奖励往往缺乏明确的理论优化目标,并且容易受到退化偏见的影响。在本研究中,我们引入了PowerFlow,一个原则性框架,将无监督微调重新表述为分布匹配问题。通过将GFlowNet视为未归一化密度的摊销变分采样器,我们提出了一种考虑长度的轨迹平衡目标,明确中和自回归生成中固有的结构长度偏见。通过瞄准$eta$-幂分布,PowerFlow能够有针对性地引发LLMs的双重特性:加尖分布($eta > 1$)以增强逻辑推理,或使其平坦($eta < 1$)以释放表现力创意。大量实验表明,PowerFlow始终优于现有的RLIF方法,匹配或甚至超过监督的GRPO。此外,通过减轻对齐模型中的过度加尖,我们的方法在多样性和质量上实现了同时提升,推动了创意任务中的帕累托前沿。
cs.CL / 22 / 2603.18390
AutoScreen-FW: An LLM-based Framework for Resume Screening
AutoScreen-FW:基于大语言模型的简历筛选框架
Abstract
Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM's judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters' burden.
Chinese Translation
企业招聘人员通常需要在有限的时间内筛选大量简历,这增加了他们的负担,并可能导致合适的候选人被忽视。为了解决这些挑战,之前的研究探索了基于大语言模型(LLM)的自动化简历筛选。然而,一些方法依赖于商业LLM,这可能带来数据隐私风险。此外,由于公司通常不公开带有评估结果的简历,因此在学习过程中应使用哪些简历样本以提高LLM的判断性能仍然不明确。为了解决这些问题,我们提出了AutoScreen-FW,一个基于LLM的本地自动化简历筛选框架。AutoScreen-FW使用多种方法选择一小组具有代表性的简历样本。这些样本与角色描述和评估标准一起用于上下文学习,使开源LLM能够充当职业顾问并评估未见过的简历。多项真实数据的实验表明,开源LLM的判断表现始终优于GPT-5-nano。在某一真实数据设置下,它还超越了GPT-5-mini。尽管在其他真实数据设置下,它的表现略逊于GPT-5-mini,但其每份简历的运行速度明显快于商业GPT模型。这些发现表明在公司内部部署AutoScreen-FW以支持高效筛选并减轻招聘人员负担的潜力。
cs.CL / 23 / 2603.18409
TopoChunker: Topology-Aware Agentic Document Chunking Framework
TopoChunker: 基于拓扑感知的代理文档分块框架
Abstract
Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.
Chinese Translation
当前用于增强检索生成(Retrieval-Augmented Generation, RAG)的文档分块方法通常对文本进行线性化。这种强迫线性化剥夺了固有的拓扑层次结构,造成“语义碎片化”,从而降低了下游检索质量。本文提出了TopoChunker,一种代理框架,将异构文档映射到结构化中间表示(Structured Intermediate Representation, SIR),以明确保留跨段依赖关系。为了在结构保真度与计算成本之间取得平衡,TopoChunker采用了双代理架构。检视代理(Inspector Agent)动态地通过成本优化的提取路径引导文档,而精炼代理(Refiner Agent)则执行能力审计和拓扑上下文消歧,以重建层级谱系。在非结构化叙述(GutenQA)和复杂报告(GovReport)上的评估表明,TopoChunker表现出最先进的性能。它在绝对生成准确率上比最强的基于大语言模型(LLM)的基准提高了8.0%,并达到了83.26%的Recall@3,同时将令牌开销减少了23.5%,提供了一种可扩展的结构感知RAG方法。
cs.CL / 24 / 2603.18411
TARo: Token-level Adaptive Routing for LLM Test-time Alignment
TARo:面向大语言模型测试时对齐的令牌级自适应路由
Abstract
Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
Chinese Translation
大语言模型(LLMs)展现出强大的推理能力,但通常需要昂贵的后训练才能达到高性能。近期的测试时对齐方法提供了一种轻量级的替代方案,但主要针对偏好对齐而非推理进行探索。为了解决这一问题,我们提出了令牌级自适应路由(TARo),该方法在推理时完全引导冻结的LLMs进行结构化推理。具体而言,我们首先在逐步数学轨迹上训练奖励模型,以捕捉细粒度的逻辑一致性信号,然后引入一个可学习的令牌级路由器,自动控制奖励模型对基础模型的引导。大量实验表明,TARo显著提高了推理性能,较基础模型提升了高达22.4%,较现有的令牌级测试时对齐方法提升了8.4%,同时也增强了分布外临床推理(MedXpertQA)和指令跟随(AlpacaEval)的能力。此外,TARo还能够在不重新训练的情况下从小型骨干网络推广到大型骨干网络,将测试时对齐从偏好优化扩展到稳健的跨领域推理。
cs.CL / 25 / 2603.18425
Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs
多模态任务干扰:多模态大型语言模型中历史-目标不匹配的基准与分析
Abstract
Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.
Chinese Translation
任务干扰是指在单一对话中因任务切换而导致的性能下降,尽管多模态对话系统日益普及,但相关研究主要集中在仅使用文本的场景。我们为评估多模态大型语言模型中的这一现象引入了一个基准,涵盖了六个跨文本和视觉的任务,并系统性地沿着三个维度改变历史-目标:模态不匹配、推理不匹配和答案格式不匹配。对开放权重和专有模型的实验表明,任务干扰具有高度的方向性:从仅文本目标切换到基于图像的目标会导致严重的性能下降,而相反的转变则仅产生最小的降幅。当多个维度同时出现不匹配时,干扰会进一步加剧,且模态差异对干扰的影响最为显著,其次是答案格式,而推理需求的变化对性能下降的影响则较小。
cs.CL / 26 / 2603.18428
Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation
通过测试时策略学习实现自我改进生成的自适应解码
Abstract
Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.
Chinese Translation
解码策略在很大程度上决定了大型语言模型(LLM)输出的质量,而广泛使用的启发式方法,如贪婪解码或固定温度/top-p 解码,往往是静态的且不考虑任务,导致在需要风格或结构灵活性的领域中生成质量不佳或不一致。我们提出了一种基于强化学习的解码器采样器,将解码视为序列决策过程,并学习一种轻量级策略,在测试时调整采样参数,同时保持 LLM 权重不变。我们使用 Granite-3.3-2B 和 Qwen-2.5-0.5B 评估了包括 BookSum、arXiv 和 WikiHow 在内的摘要数据集。我们的策略采样器在性能上始终优于贪婪和静态基线,取得了高达 +88%(BookSum,Granite)和 +79%(WikiHow,Qwen)的相对增益。奖励消融实验表明,仅考虑重叠的目标相比于复合奖励表现不佳,而结构化的塑形项(长度、覆盖率、重复性、完整性)则能够实现稳定和持续的改进。这些发现突显了强化学习作为解码中测试时适应的实用机制,使得在不重新训练大型模型的情况下实现领域感知和用户可控的生成。
cs.CL / 27 / 2603.18446
UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference
UT-ACA: 不确定性触发的自适应上下文分配用于长上下文推理
Abstract
Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.
Chinese Translation
由于注意力稀释和分布外降级,长上下文推理对大型语言模型仍然具有挑战性。上下文选择通过关注关键值缓存条目的子集来缓解这一限制,但大多数方法在解码过程中分配固定的上下文预算,尽管在令牌级上下文需求上存在高度的不均匀性。为了解决这个问题,我们提出了不确定性触发的自适应上下文分配(UT-ACA),这是一种在推理时动态调整上下文窗口的框架,基于令牌级的不确定性。UT-ACA学习一个不确定性检测器,该检测器结合语义嵌入和基于对数几率的信心,同时考虑到解码步骤中的不确定性累积。当表明证据不足时,UT-ACA选择性地回滚,扩展上下文窗口,并使用额外的支持重新生成令牌。实验表明,UT-ACA在长上下文设置中显著减少了平均上下文使用,同时保持生成质量。
cs.CL / 28 / 2603.18469
GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms
GAIN:不完美规范下大型语言模型目标一致决策的基准
Abstract
We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models' adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.
Chinese Translation
我们提出了 GAIN(不完美规范下的目标一致决策),这是一个旨在评估大型语言模型(LLMs)如何在遵循规范与实现商业目标之间取得平衡的基准。现有的基准通常聚焦于抽象场景,而非真实的商业应用。此外,它们对影响 LLM 决策的因素提供的见解有限,这限制了它们测量模型在复杂的、真实的规范与目标冲突中适应性的能力。在 GAIN 中,模型接收一个目标、一个特定情境、一个规范以及额外的情境压力。这些压力明确设计以鼓励潜在的规范偏离,是 GAIN 区别于其他基准的独特特征,使得能够系统地评估影响决策的因素。我们定义了五种类型的压力:目标一致性、风险规避、情感/伦理诉求、社会/权威影响和个人激励。该基准包含了1200个场景,涵盖四个领域:招聘、客户支持、广告和金融。我们的实验表明,先进的 LLM 经常反映人类的决策模式。然而,当存在个人激励压力时,它们显著偏离,表现出强烈倾向于遵循规范而非偏离规范。
cs.CL / 29 / 2603.18474
WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior
WASD:定位关键神经元作为解释和控制大型语言模型行为的充分条件
Abstract
Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.
Chinese Translation
精确控制大型语言模型(LLMs)的行为对于复杂应用至关重要。然而,现有的方法往往需要高昂的训练成本,缺乏自然语言的可控性,或者在语义连贯性方面存在妥协。为了填补这一空白,我们提出了WASD(unWeaving Actionable Sufficient Directives),这是一个通过识别生成token的充分神经条件来解释模型行为的新框架。我们的方法将候选条件表示为神经元激活谓词,并迭代搜索一个最小集合,以保证在输入扰动下当前输出的准确性。在Gemma-2-2B模型上对SST-2和CounterFact的实验表明,我们的方法生成的解释在稳定性、准确性和简洁性上优于传统的归因图。此外,通过对跨语言输出生成控制的案例研究,我们验证了WASD在控制模型行为方面的实际有效性。
cs.CL / 30 / 2603.18482
The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices
截断盲点:解码策略如何系统性地排除类人令牌选择
Abstract
Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.
Chinese Translation
标准的文本生成解码策略,包括 top-k、核采样(nucleus sampling)和对比搜索(contrastive search),是基于可能性选择令牌,从而将选择限制在高概率区域。人类语言生成的运作方式则有所不同:令牌的选择是基于交际适宜性而非统计频率。这种不匹配造成了一个截断盲点:在语境上适当但在统计上稀有的令牌对人类来说是可接触的,但对于基于可能性的解码却是无法到达的。我们假设这会影响机器生成文本的可检测性。通过分析超过180万篇文本,涵盖八种语言模型、五种解码策略和53种超参数配置,我们发现8-18%的人工选择令牌超出了典型的截断边界。基于可预测性和词汇多样性训练的简单分类器达到了显著的检测率。重要的是,模型规模和架构与可检测性之间并没有强相关性;截断参数解释了大部分方差。实现低可检测性的配置往往生成不连贯的文本,表明规避检测和生成自然文本是两个不同的目标。这些发现表明,可检测性是通过基于可能性的令牌选择增强的,而不仅仅是模型能力的问题。
cs.CL / 31 / 2603.18489
EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models
熵缓存:基于解码令牌熵引导的KV缓存方法用于扩散语言模型
Abstract
Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.
Chinese Translation
基于扩散的大型语言模型(dLLMs)依赖于双向注意力机制,这使得无损的KV缓存成为不可能,并且在每个去噪步骤都需要进行完整的前向传递。现有的近似KV缓存方法通过选择性更新缓存状态来降低这一成本,但它们的决策开销随上下文长度或模型深度而增加。我们提出了熵缓存(EntropyCache),这是一种无需训练的KV缓存方法,它使用新解码令牌分布的最大熵作为决定何时重新计算的恒定成本信号。我们的设计基于两个经验观察:(1) 解码令牌的熵与KV缓存漂移相关,为缓存陈旧性提供了一个低成本的代理;(2) 解码令牌的特征波动在去掩码后的多个步骤中持续存在,这促使我们重新计算最近解码的$k$个令牌。跳过或重新计算的决策每步仅需要$O(V)$的计算,与上下文长度和模型规模无关。在LLaDA-8B-Instruct和Dream-7B-Instruct上的实验表明,熵缓存在标准基准测试中实现了$15.2 imes$-$26.4 imes$的加速,在思维链基准测试中达到$22.4 imes$-$24.1 imes$的加速,同时保持了竞争力的准确性,决策开销仅占推理时间的$0.5\%$。代码可在https://github.com/mscheong01/EntropyCache获取。
cs.CL / 32 / 2603.18530
When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making
当名字改变裁决:干预一致性揭示大型语言模型决策中的系统性偏见
Abstract
Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.
Chinese Translation
大型语言模型(LLMs)越来越多地用于高风险决策,但它们对虚假特征的敏感性仍然缺乏充分的表征。我们引入了ICE-Guard,一个应用干预一致性测试的框架,以检测三种类型的虚假特征依赖:人口统计(姓名/种族交换)、权威(资历/声望交换)和框架(正面/负面重述)。在涵盖10个高风险领域的3000个案例中,我们评估了来自8个家族的11个LLMs,发现(1)权威偏见(平均5.8%)和框架偏见(5.0%)显著超过人口统计偏见(2.2%),挑战了该领域对人口统计的狭隘关注;(2)偏见集中在特定领域——金融领域显示22.6%的权威偏见,而刑事司法领域仅显示2.8%;(3)结构化分解,即LLM提取特征并由确定性标准决定,能够将翻转率降低多达100%(在9个模型中中位数为49%)。我们展示了一个ICE引导的检测-诊断-缓解-验证循环,通过迭代提示修补实现了累计78%的偏见减少。与真实的COMPAS再犯数据的验证显示,COMPAS衍生的翻转率超过了合并的合成率,表明我们的基准提供了对现实世界偏见的保守估计。代码和数据已公开可用。
cs.CL / 33 / 2603.18557
Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition
通过评估分解实现跨语言 LLM-Judge 转移
Abstract
As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.
Chinese Translation
随着大型语言模型在各种实际应用中的普遍部署,扩展自动化评估至英语以外的语言已成为一项关键挑战。现有的评估方法主要集中于英语,其适应其他语言受到大多数语言中人类标注判断稀缺和成本高昂的限制。我们提出了一种基于分解的评估框架,该框架围绕通用标准集(Universal Criteria Set, UCS)构建。UCS 包含一组共享的、与语言无关的评估维度,生成可解释的中间表示,支持最小监督下的跨语言转移。在多种语言和模型骨架下的多个忠实度任务上的实验表明,在没有目标语言注释的情况下,相较于强基线,表现出了一致的提升。
cs.CL / 34 / 2603.18579
ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs
ICE:基于统计基础的干预一致性解释评估框架用于大语言模型
Abstract
Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.
Chinese Translation
评估解释是否真实反映模型的推理仍然是一个未解决的问题。现有基准使用单一干预而没有进行统计测试,这使得无法区分真实的忠实性与偶然水平的表现。我们提出了ICE(干预一致性解释),这是一个通过随机化测试在多个干预操作下,将解释与匹配的随机基线进行比较的框架,从而产生带有置信区间的胜率。在对7个大语言模型(LLMs)进行评估时,涵盖了4个英语任务、6种非英语语言和2种归因方法,我们发现忠实性依赖于操作符:操作符之间的差距可达44个百分点,删除操作通常在短文本上夸大估计,但在长文本上则出现相反的模式,这表明忠实性应在不同干预操作之间进行比较解读,而不是作为单一分数。随机基线揭示了三分之一配置中的反忠实性,忠实性与人类可信度之间没有相关性(|r| < 0.04)。多语言评估揭示了模型与语言之间的显著交互,这种交互仅通过分词无法解释。我们发布了ICE框架和ICEBench基准。
cs.CL / 35 / 2603.18593
Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors
通过对数似然向量构建的提示-响应分布的语言模型映射
Abstract
We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.
Chinese Translation
我们提出了一种方法,通过对提示-响应对的对数似然向量来表示语言模型,并构建模型映射以比较它们的条件分布。在这个空间中,模型之间的距离近似于相应条件分布之间的KL散度。在对大量公开可用的语言模型进行的实验中,结果表明这些映射捕捉到了有意义的全局结构,包括与模型属性和任务性能的关系。该方法还捕捉到了由提示修改引起的系统性变化及其近似的加性组合性,暗示了一种分析和预测复合提示操作效果的方法。我们进一步引入了逐点互信息(PMI)向量,以减少无条件分布的影响;在某些情况下,基于PMI的模型映射更好地反映了与训练数据相关的差异。总体而言,该框架支持对输入依赖模型行为的分析。
cs.CL / 36 / 2603.18611
Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
社交媒体上可解释的人道主义分类的跨模态推理转移
Abstract
Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.
Chinese Translation
社交媒体数据传播的进步使得在危机期间能够提供实时信息。这些信息来自不同的类别,例如基础设施损坏、失踪或被困在受影响区域的人等。现有方法尝试将文本和图像分类到各种人道主义类别中,但其决策过程仍然在很大程度上不透明,这影响了其在现实应用中的部署。近期的研究试图通过从推文中提取文本推理来提高透明度,以解释预测的类别。然而,这些可解释的分类方法主要集中在文本上,而不是与危机相关的图像。在本文中,我们提出了一种设计上可解释的多模态分类框架。我们的方法首先使用视觉语言变换器模型学习文本和图像的联合表示,并提取文本推理。接下来,通过与文本推理的映射提取图像推理。我们的方法展示了如何通过跨模态推理转移从一种模态学习推理,从而节省标注工作。最后,基于提取的推理对推文进行分类。我们在CrisisMMD基准数据集上进行了实验,结果表明我们提出的方法将分类的Macro-F1提高了2-35%,同时提取了准确的文本标记和图像片段作为推理。人工评估也支持我们的主张,即我们提出的方法能够检索到更好的图像推理片段(12%),有助于识别人道主义类别。我们的方法在零样本模式下对新的、未见过的数据集适应良好,达到了80%的准确率。
cs.CL / 37 / 2603.18612
DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units
DiscoPhon: 基于离散语音单元的无监督音素库发现的基准评估
Abstract
We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.
Chinese Translation
我们介绍了DiscoPhon,一个用于评估离散语音单元无监督音素发现的多语言基准。DiscoPhon涵盖6种开发语言和6种测试语言,选择这些语言是为了覆盖广泛的音位对比。在仅有10小时新语言的语音数据的情况下,系统必须生成离散单元,这些单元通过多对一或一对一的分配方式映射到预定义的音素库。生成的序列将根据单元质量、识别和分割进行评估。我们提供了四个预训练的多语言HuBERT和SpidR基线,并展示了当前模型中音素信息足够丰富,可以使得衍生单元与音素之间有良好的相关性,尽管在不同语言中存在变化。
cs.CL / 38 / 2603.18620
Learning to Self-Evolve
学习自我进化
Abstract
We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.
Chinese Translation
我们介绍了学习自我进化(Learning to Self-Evolve, LSE),这是一个强化学习框架,旨在训练大型语言模型(Large Language Models, LLMs)在测试时改善其上下文。我们将LSE置于测试时自我进化的背景下,在该背景下,模型通过对已见问题的反馈迭代地优化其上下文,以便在新问题上表现得更好。现有方法完全依赖于模型固有的推理能力,从未明确地为此任务进行训练。LSE将多步进化问题简化为单步强化学习目标,其中每次上下文编辑的奖励来自于下游性能的提升。我们将这一目标与树导向的进化循环相结合。在文本到SQL生成(Text-to-SQL generation, BIRD)和一般问题回答(general question answering, MMLU-Redux)任务中,使用LSE训练的4B参数模型超越了由GPT-5和Claude Sonnet 4.5驱动的自我进化策略,以及包括GEPA和TextGrad在内的提示优化方法,并且能够在没有额外训练的情况下指导其他模型。我们的结果突显了将自我进化视为可学习技能的有效性。
cs.CL / 39 / 2603.18641
A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems
连续自然语言处理系统中序列任务适应的灾难性遗忘缓解的比较实证研究
Abstract
Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.
Chinese Translation
在实际应用中部署的神经语言模型必须不断适应新的任务和领域,而不遗忘先前获得的知识。本研究呈现了一个关于连续意图分类中灾难性遗忘缓解的比较实证研究。我们使用 CLINC150 数据集构建了一个 10 任务标签不重叠的场景,并在多种连续学习(CL)策略下评估三种主干架构:前馈人工神经网络(ANN)、门控循环单元(GRU)和 Transformer 编码器。我们考虑了每个主要 CL 家族中的一种代表性方法:基于重放的最大干扰检索(MIR)、基于正则化的无遗忘学习(LwF)以及通过硬注意力任务隔离(HAT)进行的参数隔离,分别以及所有成对和三重组合进行评估。通过平均准确率、宏 F1 和向后迁移评估性能,捕捉任务序列中的稳定性与可塑性权衡。我们的结果表明,天真的序列微调在所有架构中都遭受严重遗忘,并且没有单一的 CL 方法能够完全防止遗忘。重放作为一个关键因素出现:MIR 是最可靠的单一策略,而包含重放的组合(MIR+HAT、MIR+LwF、MIR+LwF+HAT)始终实现高最终性能,并且向后迁移接近零或轻微正值。最佳配置依赖于架构。对于 ANN 和 Transformer,MIR+HAT 产生了最佳结果,而 MIR+LwF+HAT 在 GRU 上表现最佳,在多个情况下,CL 方法甚至超越了联合训练,表明了一种正则化效应。这些发现强调了在设计连续意图分类系统时共同选择主干架构和 CL 机制的重要性。
cs.CL / 40 / 2603.18750
Automatic detection of Gen-AI texts: A comparative framework of neural models
自动检测生成型人工智能文本:神经模型的比较框架
Abstract
The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.
Chinese Translation
大型语言模型的快速普及显著增加了区分人类撰写文本和人工智能生成文本的难度,这在学术、编辑和社会领域引发了关键问题。本文通过设计、实施和比较评估多种基于机器学习的检测器,探讨了人工智能生成文本检测的问题。开发并分析了四种神经架构:多层感知器、单维卷积神经网络、基于MobileNet的卷积神经网络和Transformer模型。所提出的模型与广泛使用的在线检测器进行基准测试,包括ZeroGPT、GPTZero、QuillBot、Originality.AI、Sapling、IsGen、Rephrase和Writer。实验在COLING多语言数据集上进行,考虑了英语和意大利语的配置,同时还使用了一个专注于艺术和心理健康的原创主题数据集。结果表明,在不同语言和领域中,监督式检测器的性能比商业工具更稳定和稳健,突显了当前检测策略的关键优势和局限性。
cs.CL / 41 / 2603.18765
Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks
大型语言模型中的隐性评分偏见:写作风格如何影响数学、编程和论文任务的自动评估
Abstract
As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.
Chinese Translation
随着大型语言模型(LLMs)在教育环境中被越来越多地用作自动评分工具,关于其评估的公平性和偏见的担忧变得尤为重要。本研究调查了当内容的正确性保持不变时,LLMs是否因为写作风格表现出隐性评分偏见。我们构建了一个控制数据集,包含180个学生在三个学科(数学、编程和论文/写作)中的回答,每个学科涵盖三种表面层次的干扰类型:语法错误、非正式语言和非母语表达。两种前沿的开源LLM——LLaMA 3.3 70B(Meta)和Qwen 2.5 72B(Alibaba)被要求在1-10分的评分标准下评分,并明确指示仅评估内容正确性,而忽视写作风格。我们的结果揭示,在论文/写作任务中,两种模型和所有干扰类型均存在统计上显著的评分偏见(p < 0.05),效应量范围从中等(Cohen's d = 0.64)到非常大(d = 4.25)。非正式语言受到的处罚最为严重,LLaMA平均扣分为1.90分,Qwen扣分为1.20分,处罚程度相当于B+和C+之间的差异。非母语表达分别被扣1.35分和0.90分。与此形成鲜明对比的是,数学和编程任务表现出极小的偏见,大多数条件未能达成统计显著性。这些发现表明,LLM评分偏见是学科依赖的、对风格敏感的,并且即便在评分提示中明确反对偏见的指示下仍然存在。我们讨论了基于LLM的评分系统公平部署的影响,并建议在机构采用之前进行偏见审计协议。
cs.CL / 42 / 2603.18788
Mi:dm K 2.5 Pro
Mi:dm K 2.5 Pro
Abstract
The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.
Chinese Translation
不断发展的大型语言模型(LLM)领域需要超越简单文本生成的能力,优先考虑多步骤推理、长上下文理解和自主工作流程。这一转变对企业环境中的现有模型提出了挑战,特别是在韩语和特定领域场景中,现有模型的扩展性不足。我们推出了 Mi:dm K 2.5 Pro,这是一款拥有320亿参数的旗舰级大型语言模型,旨在通过以推理为中心的优化来应对企业级复杂性。我们的方法通过一个以质量为中心的策划管道构建了坚实的数据基础,该管道利用抽象语法树(AST)分析代码、填补数学空白的合成以及基于大型语言模型的质量评估器。预训练通过基于层预测器的深度上采样(Depth Upscaling, DuS)和支持128K令牌上下文窗口的渐进策略来扩展模型。后训练引入了一个专门的多阶段管道,包括推理微调(Reasoning SFT)、模型合并和异步强化学习(RL),以培养复杂问题解决能力。随后,“融合训练”(Fusion Training)重新平衡这些能力,使其具备对话流畅性、一致的响应风格和可靠的工具使用。评估结果表明,Mi:dm K 2.5 Pro 在与全球和国内领先模型的竞争中表现出色。此外,它在韩语特定基准测试中设定了最先进的结果,展示了深厚的语言和文化理解。最后,负责任的人工智能评估验证了其对攻击的安全性,确保了在无害性和响应性之间的平衡,以便安全部署。
cs.CL / 43 / 2603.18822
Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework
在嘈杂的俄罗斯社交媒体文本数据中检测基本价值观:一个多阶段分类框架
Abstract
This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.
Chinese Translation
本研究提出了一个多阶段分类框架,用于检测嘈杂的俄语社交媒体中的人类价值观,并在750万条公开文本帖子的随机样本上进行了验证。基于Schwartz的基本人类价值观理论,我们设计了一个包含垃圾信息和非个人内容过滤、有针对性的价值相关及政治相关帖子筛选、基于大型语言模型(LLM)的注释以及多标签分类的多阶段流程。特别关注验证LLM注释和模型预测的质量,均通过与人类专家进行对比。我们不将人类专家注释视为绝对真值,而是作为具有自身不确定性的解释性基准。为应对注释的主观性,我们将多个LLM生成的判断汇聚为反映不同一致性水平的软标签。随后使用这些标签训练了基于Transformer的模型,能够预测十个基本价值观中每一个的概率。表现最佳的模型为XLM RoBERTa large,在保留测试集上达到宏F1值0.83和F1值0.71。通过将价值观检测视为一个多视角的解释性任务,其中专家标签、GPT注释与模型预测代表对相同文本的连贯但不完全相同的解读,我们展示了模型总体上与人类判断一致,但系统性地高估了“开放性变化(Openness to Change)”价值领域。实证结果揭示了俄罗斯社交网络中价值表达及其共现的独特模式,为文化差异、传播框架和数字环境中的基于价值的解释提供了更为广泛的研究视角。所有模型均已公开发布。
cs.CL / 44 / 2603.18863
Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders
为何更好的跨语言对齐未能实现更好的跨语言迁移:编码器的案例
Abstract
Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.
Chinese Translation
更好的跨语言对齐通常被认为能够带来更好的跨语言迁移。然而,尽管显式对齐技术提高了嵌入相似性,却常常未能改善令牌级下游性能。在本研究中,我们展示了这种不匹配的原因在于对齐和下游任务目标在很大程度上是正交的,并且对齐带来的下游收益在不同语言和任务类型之间差异显著。我们分析了四个在不同语言对上对齐并针对词性标注(POS Tagging)或句子分类(Sentence Classification)进行微调的XLM-R编码器模型。通过代表性分析,包括嵌入距离、梯度相似性和任务与对齐损失的梯度大小,我们发现:(1)嵌入距离本身并不是任务性能改善(或恶化)的可靠预测指标;(2)对齐和任务的梯度往往接近正交,表明优化一个目标可能对优化另一个目标贡献甚微。综合来看,我们的发现解释了为何“更好的”对齐常常未能转化为“更好的”跨语言迁移。基于这些见解,我们提供了将跨语言对齐与任务特定微调相结合的实用指南,强调了仔细选择损失函数的重要性。
cs.CL / 45 / 2603.18873
Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo
从语言学习学生的视角评估大型语言模型生成的课程:关于Duolingo的短期案例研究
Abstract
Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.
Chinese Translation
流行的语言学习应用程序如Duolingo利用大型语言模型(LLMs)为用户生成课程。大多数课程集中于一般的现实场景,如问候、点餐或问路,而对特定职业背景的支持有限。这一差距可能阻碍学习者达到专业水平的流利度,我们将其定义为在目标语言中舒适地交流各种与工作相关和领域特定的信息的能力。我们对来自菲律宾一家跨国公司的五名员工进行了调查,了解他们使用Duolingo的体验。结果显示,受访者遇到一般场景的频率高于与工作相关的场景,而前者在建立基础语法、词汇和文化知识方面是可关联且有效的。后者则有助于弥补通向专业流利度的差距,因为它包含领域特定的词汇。每位参与者建议的课程场景在分析时呈现出不同的背景。基于这一理解,我们建议语言学习应用程序应生成适应个体需求的课程,通过个性化的领域特定课程场景,同时保持通过一般可关联课程场景提供基础支持。
cs.CL / 46 / 2603.18879
A Human-in/on-the-Loop Framework for Accessible Text Generation
人机协同框架用于可访问文本生成
Abstract
Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.
Chinese Translation
在文本简化中,使用简单语言和易读格式对于认知可达性至关重要。然而,目前的自动简化和评估流程仍然主要是自动化的、以指标驱动的,未能反映用户理解或规范标准。本文提出了一种混合框架,明确将人类参与整合到基于大型语言模型(LLM)的可访问文本生成中。人机协同(Human-in-the-Loop, HiTL)贡献在生成过程中指导调整,而人机监督(Human-on-the-Loop, HoTL)确保系统的后生成审查。通过用户研究和注释资源的实证证据,框架被操作化为(i)与标准对齐的检查清单,(ii)用于激活专家监督的事件-条件-行动触发规则,以及(iii)可达性关键绩效指标(Key Performance Indicators, KPIs)。该框架展示了如何将以人为中心的机制编码用于评估,并重新利用以提供结构化反馈,从而改善模型适应性。通过将人类角色嵌入到生成和监督的两个环节中,它建立了一个可追溯、可重复和可审计的过程,用于创建和评估可访问文本。在此过程中,它将可解释性和伦理问责作为核心设计原则,促进了更加透明和包容的自然语言处理(NLP)系统。
cs.CL / 47 / 2603.18911
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
渐进训练用于可解释的基于引用的对话:在英-印大语言模型中将幻觉降至零
Abstract
Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).
Chinese Translation
知识驱动的对话系统旨在通过依赖外部知识源来生成信息丰富且与上下文相关的响应。然而,大多数现有方法专注于英语,缺乏验证事实主张的明确引用机制,并且在模型决策透明度方面限制较多。我们提出了 XKD-Dial,一个针对可解释的知识驱动对话生成的渐进式四阶段训练流程,适用于双语(英语-印地语)环境,包括:(1) 多语言适应,(2) 带引用基础的英语对话 SFT (Supervised Fine-Tuning),(3) 双语对话 SFT,和 (4) 带引用感知奖励的 GRPO (Generative Reward Policy Optimization) 对齐。我们在每个流程阶段评估了六个模型,涵盖编码器-解码器 (250M-3B) 和仅解码器 (1B-7B) 架构。我们的关键贡献包括:(i) 三种事后可解释性分析 - 交叉注意力对齐、集成梯度归因和基于遮蔽的因果基础 - 系统性地应用于整个训练过程,以揭示引用行为是如何被学习的,而不仅仅是是否被学习;(ii) 从阶段二开始,带引用基础的 SFT 将编码器-解码器模型的幻觉率降低至 0.0%;(iii) 渐进过程防止了灾难性遗忘,同时提升了印地语能力;(iv) 更小的模型在 SFT 后与更大的模型在英语上的表现相当;(v) GRPO 相较于精心设计的 SFT 在结构化引用任务中提供了边际改进。我们在六种自动评估指标(BLEU、ROUGE、BERTScore、FactScore、Citation-F1 和幻觉率)上进行了评估。
cs.CL / 48 / 2603.18940
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
熵轨迹形状预测大型语言模型推理可靠性:链式思维中不确定性动态的诊断研究
Abstract
Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($\rho$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.
Chinese Translation
链式思维(CoT)推理提高了大型语言模型(LLM)的准确性,但廉价检测失败仍然难以实现。我们研究了推理步骤中不确定性动态的形状——通过每个步骤采样几个答案完成来捕捉——是否能够预测正确性。我们引入了熵轨迹单调性:如果每一步的答案分布熵在每一步都减少,则该链是单调的。在GSM8K(n=300)上使用Qwen2.5-7B-Instruct,单调链的准确率为68.8%,而非单调链为46.8%(+21.9个百分点;Fisher's p=0.0005;OR=2.50)。关键是,总熵减少并不能预测正确性($
ho$=-0.06,p=0.31),揭示了形状与大小的解离:每一步熵是否减少是重要的,而不是减少的多少。违反计数为0/1/2时的准确率分别为68.8%/50.8%/28.6%。令牌对数概率的置信度在步骤深度的校准中恶化(ECE: 0.186->0.312),而单调性在73.7%的覆盖率下实现了+5.8个百分点,超越了大约1,500个令牌/问题的标量基线——这是40链自一致性成本的1/8。结果在Mistral-7B(n=300)上重复:单调链的准确率达到72.3%,而非单调链为37.6%(+34.7个百分点;OR=4.33)。因此,不确定性轨迹的结构特性比聚合度量更具信息性。
cs.CL / 49 / 2603.19002
RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation
RADIUS:排名、分布与显著性 - 一套全面的调查模拟对齐工具
Abstract
Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.
Chinese Translation
使用大型语言模型(LLMs)进行调查模拟正在成为一种强大的应用,能够大规模生成类人响应。之前的研究使用借鉴自其他领域的指标来评估调查模拟,这些指标往往是临时的、零散的且缺乏标准化,导致结果难以比较。此外,现有指标主要关注准确性或分布性度量,忽视了排名对齐这一关键维度。在实践中,模拟可能在准确性上表现良好,但仍然未能捕捉到人类最偏好的选项——这一区别在决策应用中至关重要。我们引入了RADIUS,一套全面的二维调查模拟对齐工具,涵盖:1)排名对齐(Ranking alignment)和2)分布对齐(Distribution alignment),每个部分都辅以统计显著性测试(Significance testing)。RADIUS突出了现有指标的局限性,使调查模拟的评估更具意义,并提供了一个开源实现,以便进行可重复和可比较的评估。
cs.CL / 50 / 2603.19008
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
假设条件查询重写用于决策有用的检索
Abstract
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)通过将生成过程与外部非参数知识相结合,提升了大型语言模型(Large Language Models, LLMs)的性能。然而,当任务需要在竞争选项中进行选择时,仅仅将生成过程与广泛相关的上下文相结合通常不足以推动最终决策。现有的RAG方法通常依赖于单一的初始查询,这往往偏向于主题相关性而非决策相关证据,因此检索到的背景信息可能无法区分答案选项。为了解决这一问题,我们提出了假设条件查询重写(Hypothesis-Conditioned Query Rewriting, HCQR),这是一种无训练的预检索框架,将RAG从以主题为导向的检索转向以证据为导向的检索。HCQR首先从输入问题和候选选项中推导出一个轻量级的工作假设,然后将检索重写为三个目标明确的查询,旨在寻求证据以:(1) 支持假设,(2) 将其与竞争替代方案区分开,以及(3) 验证问题中的显著线索。这种方法使得上下文检索与答案选择更加直接对齐,使生成器能够根据检索到的证据确认或推翻初始假设。在MedQA和MMLU-Med上的实验表明,HCQR在性能上始终优于单查询RAG及重排序/过滤基线,平均准确率分别比简单RAG提高了5.9和3.6个百分点。代码可在https://anonymous.4open.science/r/HCQR-1C2E获取。
cs.CL / 51 / 2603.19017
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
到底是什么控制了大型语言模型中的时间推理:分词还是时间的表示?
Abstract
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb
Chinese Translation
我们提出了MultiTempBench,一个多语言时间推理基准,涵盖三项任务:日期算术、时区转换和时间关系提取,涉及五种语言(英语、德语、中文、阿拉伯语和豪萨语)及多种日历体系(公历、伊斯兰历和农历)。MultiTempBench包含1.5万个通过翻译750个精心整理的英语问题而构建的例子,并将每个问题扩展为多种受控日期格式变体。我们评估了20个大型语言模型,并引入了多语言日期分段比例(multilingual Date Fragmentation Ratio, mDFR),与人类严重性评分相校准,同时进行内部时间表示的几何探测分析。我们发现,时间特征的分词质量是一个依赖资源的瓶颈:在低资源语言和较少见的日历格式中,分段会破坏年份/月份/日期的分离,从而导致准确性崩溃,而在高资源设置中,通常对数字层面的分割具有较强的鲁棒性。除了分词之外,交叉混合效应回归显示,在高资源语言中,时间线性是时间推理的最强预测因素,而在低资源语言中,分段则是更强的预测因素。代码可在以下地址获取:https://github.com/gagan3012/mtb
cs.CL / 52 / 2603.19044
MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models
MoRI:基于动机的推理学习在大型语言模型中的科学创意
Abstract
Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.
Chinese Translation
科学创意旨在在特定科学背景下提出新颖的解决方案。现有的基于大型语言模型(LLM)的代理方法模仿人类的研究工作流程,但未能有效建模科学推理,导致表层概念的重新组合缺乏技术深度和科学基础。为了解决这个问题,我们提出了 extbf{MoRI}( extbf{Mo}tivation-grounded extbf{R}easoning for Scientific extbf{I}deation)框架,使LLM能够明确学习从研究动机到方法论的推理过程。基础LLM通过有监督的微调初始化,以从给定上下文生成研究动机,其后在一种复合强化学习奖励下进行训练,该奖励近似于科学严谨性:(1)考虑熵的信息增益鼓励模型发现并详细阐述基于真实方法论的高复杂度技术细节;(2)对比语义增益约束推理轨迹,以保持与科学有效解决方案的概念一致性。实证结果表明,MoRI在新颖性、技术严谨性和可行性等多个维度上显著超越强大的商业LLM和复杂的代理基线。代码将在 extit{https://github.com/ECNU-Text-Computing/IdeaGeneration}上提供。
cs.CL / 53 / 2603.19066
Parallelograms Strike Back: LLMs Generate Better Analogies than People
平行四边形反击:大型语言模型生成的类比优于人类
Abstract
Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.
Chinese Translation
四元词类比 (A:B::C:D) 经典上被几何建模为“平行四边形”,但最近的研究表明,这一模型未能很好地捕捉人类生成类比的方式,简单的局部相似性启发式往往能够提供更好的解释 (Peterson et al., 2020)。但平行四边形模型的失败是因为它对于类比关系的建模不佳,还是因为人类在生成保留关系的类比上表现得并不出色?我们比较了人类和大型语言模型 (LLM) 在同一组类比问题上的类比完成情况 (Peterson et al., 2020)。我们的研究发现,LLM生成的类比被可靠地评判为优于人类生成的类比,并且在分布式嵌入空间 (GloVe) 中与平行四边形结构的对齐程度更高。关键是,我们展示了相对于人类类比,LLM类比的改进主要源于更好的平行四边形对齐和对可接触单词的依赖减少,而不是对局部相似性的增强敏感性。此外,LLM的优势并不是由于LLM的响应均匀优越,而是因为人类产生了大量弱完成:当仅比较两者系统的典型(最频繁)响应时,LLM的优势便消失了。然而,更高的平行四边形对齐和较低的词频依然能够预测哪些LLM的完成被评定为高于人类的完成。总体而言,这些结果表明,平行四边形模型并不是对词类比的不良解释。相反,人类往往无法生成满足这种关系约束的完成,而LLM则更一致地满足这一约束。
cs.CL / 54 / 2603.19082
A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes
用于从临床笔记中识别患者健康素养信息的数据集和资源
Abstract
Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).
Chinese Translation
健康素养是影响患者结果的关键因素,但当前的筛查工具并不总是可行,并且在项目数量、问题格式和所捕获的健康素养维度上存在显著差异,这使得在结构化电子健康记录中进行文档记录变得困难。从非结构化临床笔记中进行自动检测提供了一种有前景的替代方案,因为这些笔记通常包含更丰富、更具上下文的健康素养信息,但由于缺乏注释资源,进展受到限制。我们介绍了HEALIX,这是第一个基于真实临床笔记的公开可用注释健康素养数据集,通过社会工作者笔记抽样、基于关键词的过滤和基于大型语言模型(LLM)的主动学习相结合进行策划。HEALIX包含589条笔记,涵盖9种笔记类型,并标注了三个健康素养标签:低、中和高。为了展示其实用性,我们在四个开源大型语言模型(LLMs)上基准测试了零样本和少样本提示策略。
cs.CL / 55 / 2603.19097
DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering
DaPT:一种用于多语言多跳问答的双路径框架
Abstract
Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.
Chinese Translation
增强检索生成(RAG)系统在解决复杂的多跳问答(QA)任务方面在英语场景中取得了显著进展。然而,RAG 系统不可避免地面临跨多语言语料库和查询的应用场景,留下了若干未解决的挑战。第一个挑战涉及缺乏评估 RAG 系统在多语言多跳(MM-hop)QA 设置下能力的基准。第二个挑战则集中在对大型语言模型(LLMs)在英语中强大语义理解的过度依赖,这在多语言场景中降低了有效性。为了解决这些挑战,我们首先通过将仅限英语的基准翻译成五种语言来构建多语言多跳 QA 基准,然后提出了 DaPT,一种新颖的多语言 RAG 框架。DaPT 为源语言查询及其英语翻译对应的子问题图并行生成,然后在采用双语检索和回答策略之前将它们合并,以顺序解决子问题。我们的实验结果表明,先进的 RAG 系统在多语言场景中存在显著的性能不平衡。此外,我们提出的方法与基线相比,始终产生更准确和简洁的答案,显著提升了该任务上 RAG 的性能。例如,在最具挑战性的 MuSiQue 基准上,DaPT 在平均 EM 分数上相较于最强基线实现了 18.3\% 的相对提升。
cs.CL / 56 / 2603.19144
UGID: Unified Graph Isomorphism for Debiasing Large Language Models
UGID:统一图同构用于消除大型语言模型的偏见
Abstract
Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.
Chinese Translation
大型语言模型(LLMs)表现出明显的社会偏见。基于输出层或数据优化的消偏方法无法完全解决这些偏见,许多之前的研究已表明,偏见嵌入在内部表示中。我们提出了 extit{ extbf{UGID}}( extit{U}nified extit{G}raph extit{I}somorphism for extit{D}ebiasing),这是一个针对大型语言模型的内部表示层级的消偏框架,将Transformer建模为结构化的计算图,其中注意力机制定义了图的路由边,隐藏状态定义了图的节点。具体而言,消偏被表述为在反事实输入之间强制保持图结构的不变性,仅允许在敏感属性上存在差异。 extit{ extbf{UGID}}联合约束偏见敏感区域中的注意力路由和隐藏表示,有效防止偏见在架构组件之间迁移。为了在不降低通用能力的情况下实现有效的行为对齐,我们引入了对敏感logits的对数空间约束和一种选择性锚点目标,以保持定义语义。对大型语言模型的广泛实验表明, extit{ extbf{UGID}}有效地减少了在分布内和分布外设置下的偏见,显著降低了内部结构差异,同时保持了模型安全性和实用性。
cs.CL / 57 / 2603.19149
Optimal Splitting of Language Models from Mixtures to Specialized Domains
从混合体到专业领域的语言模型的最佳分割
Abstract
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
Chinese Translation
由于可用预训练数据的规模和多样性,语言模型在各种知识、语言和推理任务上取得了令人印象深刻的性能。标准的训练方案是一个两阶段的范式:首先在完整的数据语料库上进行预训练,然后在来自完整语料库的高质量专业数据子集上进行专业化。在多领域环境中,这涉及对每个专业领域持续预训练多个模型,这被称为分割模型训练。我们提出了一种在通用预训练语料库上独立预训练多个模型的方法,并使用扩展法则确定预训练与持续预训练之间的最佳计算资源分配。我们的方法准确预测一个大小为N的模型在D个预训练和D'个专业化标记下的损失,并外推到更大的模型规模和标记数量。应用于语言模型训练,我们的方法在不同模型规模和计算预算下,在常识知识和推理基准测试中的表现均得到了一致提高。
cs.CL / 58 / 2603.19152
VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models
VEPO:低资源语言基础模型的可变熵策略优化
Abstract
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
Chinese Translation
大型语言模型在低资源语言上的表现常常不尽如人意,主要原因在于子词分割效率低下和系统性训练数据不平衡。本文提出了可变熵策略优化(Variable Entropy Policy Optimization, VEPO),该方法利用可验证奖励的强化学习,将确定性结构约束融入策略对齐过程。该框架确保在训练过程中遵循规定的序列长度、稳健的格式一致性和严格的语言规范性。我们方法的核心是一个可变熵机制,使模型能够通过调节探索与利用的平衡,动态校准字面忠实度与语义自然性之间的平衡。通过将熵调节的优势估计与不对称剪切相结合,VEPO在保持稳健探索的同时,减轻了策略崩溃的风险。在90个FLORES-200、COMET-22和chrF方向的实证评估中,VEPO在子词化效率和翻译质量上均表现出显著提升,缩小了代表性不足语言的性能差距。
cs.CL / 59 / 2603.19167
Evaluating Counterfactual Strategic Reasoning in Large Language Models
评估大型语言模型中的反事实战略推理
Abstract
We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.
Chinese Translation
我们在重复的博弈论环境中评估大型语言模型(LLMs),以判断其战略表现是否反映出真正的推理能力或是对记忆模式的依赖。我们考虑两种经典游戏:囚徒困境(Prisoner's Dilemma, PD)和剪刀石头布(Rock-Paper-Scissors, RPS),并在此基础上引入反事实变体,改变收益结构和行动标签,打破熟悉的对称性和优势关系。我们所提出的多指标评估框架比较了默认和反事实的实例,展示了大型语言模型在激励敏感性、结构泛化和反事实环境中的战略推理方面的局限性。
cs.CL / 60 / 2603.19220
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Nemotron-Cascade 2:采用级联强化学习和多领域在线蒸馏的后训练大语言模型
Abstract
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
Chinese Translation
我们介绍了Nemotron-Cascade 2,这是一个开放的30B MoE模型,具备3B激活参数,展现出一流的推理能力和强大的智能体能力。尽管其尺寸紧凑,但其数学和编码推理性能接近前沿开放模型。它是继DeepSeekV3.2-Speciale-671B-A37B之后,第二个在2025年国际数学奥林匹克(IMO)、国际信息学奥林匹克(IOI)和ICPC世界总决赛中实现金牌水平表现的开放权重大语言模型,展现出异常高的智能密度,其参数量比起传统模型减少了20倍。与Nemotron-Cascade 1相比,关键技术进展如下:在经过精心挑选的数据集上进行监督微调(SFT)后,我们大幅扩展了级联强化学习(Cascade RL),涵盖更广泛的推理和智能体领域。此外,我们在级联强化学习过程中引入了来自每个领域最强中间教师模型的多领域在线蒸馏,使我们能够高效地恢复基准回归并在此过程中保持强劲的性能提升。我们将发布模型检查点和训练数据的集合。
cs.CL / 61 / 2603.19223
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
F2LLM-v2:面向多语言世界的包容性、高性能和高效嵌入
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
Chinese Translation
我们介绍F2LLM-v2,一种新型通用多语言嵌入模型家族,具有80M至14B的8个不同尺寸。该模型基于新整理的6000万条公开高质量数据样本进行训练,支持超过200种语言,特别强调以前被低估的中低资源语言。通过将基于LLM的双阶段嵌入训练流程与套娃学习、模型剪枝和知识蒸馏技术相结合,我们提出的模型在效率上大大超过了之前的基于LLM的嵌入模型,同时保持了具有竞争力的性能。广泛的评估表明,F2LLM-v2-14B在11个MTEB基准测试中排名第一,而该家族中较小的模型在资源受限的应用中也设立了新的最先进水平。为了促进开源嵌入模型的研究,我们发布了所有模型、数据、代码和中间检查点。